Print this page
    
9700 ZFS resilvered mirror does not balance reads
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
NEX-17931 Getting panic: vfs_mountroot: cannot mount root after split mirror syspool
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9989 Changing volume names can result in double imports and data corruption
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-6855 System fails to boot up after a large number of datasets created
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-8711 backport illumos 7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-7550 zpool remove mirrored slog or special vdev causes system panic due to a NULL pointer dereference in "zfs" module
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6884 KRRP: replication deadlock due to unavailable resources
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6000 zpool destroy/export with autotrim=on panics due to lock assertion
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5702 Special vdev cannot be removed if it was used as slog
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5637 enablespecial property should be disabled after special vdev removal
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alex Deiter <alex.deiter@nexenta.com>
NEX-5367 special vdev: sync-write options (NEW)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5064 On-demand trim should store operation start and stop time
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5068 In-progress scrub can drastically increase zpool import times
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-5219 WBC: Add capability to delay migration
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5078 Want ability to see progress of freeing data and how much is left to free after large file delete patch
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5019 wrcache activation races vs. 'zpool create -O wrc_mode='
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-4934 Add capability to remove special vdev
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4830 writecache=off leaks data on special vdev (the data will never migrate)
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4876 On-demand TRIM shouldn't use system_taskq and should queue jobs
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4679 Autotrim taskq doesn't get destroyed on pool export
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
NEX-4567 KRRP: L2L replication inside of one pool causes ARC-deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
6529 Properly handle updates of variably-sized SA entries.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Gordon Ross <gwr@nexenta.com>
6527 Possible access beyond end of string in zpool comment
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gwr@nexenta.com>
6414 vdev_config_sync could be simpler
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
6175 sdev can create bogus zvol directories
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6174 /dev/zvol does not show pool directories
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5997 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
6046 SPARC boot should support com.delphix:hole_birth
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6041 SPARC boot should support LZ4
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6044 SPARC zfs reader is using wrong size for objset_phys
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
backout 5997: breaks "zpool add"
5997 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5808 spa_check_logs is not necessary on readonly pools
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
5770 Add load_nvlist() error handling
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4476 WRC: Allow to use write back cache per tree of datasets
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Revert "NEX-4476 WRC: Allow to use write back cache per tree of datasets"
This reverts commit fe97b74444278a6f36fec93179133641296312da.
NEX-4476 WRC: Allow to use write back cache per tree of datasets
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-3502 dedup ceiling should set a pool prop when cap is in effect
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-3965 System may panic on the importing of pool with WRC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-4077 taskq_dispatch in on-demand TRIM can sometimes fail
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Revert "NEX-3965 System may panic on the importing of pool with WRC"
This reverts commit 45bc50222913cddafde94621d28b78d6efaea897.
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3965 System may panic on the importing of pool with WRC
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3817 'zpool add' of special devices causes system panic
 Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
 Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3474 CLONE - Port NEX-2591 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
 NEX-3165 segregate ddt in arc (other lint fix)
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-3165 segregate ddt in arc
NEX-3213 need to load vdev props for all vdev including spares and l2arc vdevs
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-2112 `zdb -e <pool>` assertion failed for thread 0xfffffd7fff172a40
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-1228 Panic importing pool with active unsupported features
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Harold Shaw <harold.shaw@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-140 Duplicate entries in mantools and doctools manifests
NEX-1078 Replaced ASSERT with if-statement
NEX-521 Single threaded rpcbind is not scalable
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Jan Kryl <jan.kryl@nexenta.com>
NEX-1088 partially rolled back 641841bb
to fix regression that caused assert in read-only import.
OS-115 Heap leaks related to OS-114 and SUP-577
SUP-577 deadlock between zpool detach and syseventd
OS-103 handle CoS descriptor persistent references across vdev operations
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Make special vdev subtree topology the same as regular vdev subtree to simplify testcase setup
Fixup merge issues
Fix default properties' values after export/import
zfsxx issue #11: support for spare device groups
Issue #34: Add feature flag for the compount checksum - sha1crc32
           Contributors: Boris Protopopov
Issue #7: add cacheability to the properties
          Contributors: Boris Protopopov
Issue #27: Auto best-effort dedup enable/disable - settable per pool
Issues #7: Reconsile L2ARC and "special" use by datasets
Issue #9: Support for persistent CoS/vdev attributes with feature flags
          Support for feature flags for special tier
          Contributors: Daniil Lunev, Boris Protopopov
Issue #2: optimize DDE lookup in DDT objects
Added option to control number of classes of DDE's in DDT.
New default is one, that is all DDE's are stored together
regardless of refcount.
Issue #3: Add support for parametrized number of copies for DDTs
Issue #25: Add a pool-level property that controls the number of copies of DDTs in the pool.
Fixup merge results
re #13850 Refactor ZFS config discovery IOCs to libzfs_core patterns
re 13748 added zpool export -c option
zpool export -c command exports specified pool while keeping its latest
configuration in the cache file for subsequent zpool import -c.
re #13333 rb4362 - eliminated spa_update_iotime() to fix the stats
re #12684 rb4206 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
re #12643 rb4064 ZFS meta refactoring - vdev utilization tracking, auto-dedup
re #8279 rb3915 need a mechanism to notify NMS about ZFS config changes (fix lint -courtesy of Yuri Pankov)
re #12584 rb4049 zfsxx latest code merge (fix lint - courtesy of Yuri Pankov)
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties
    
      
        | Split | 
	Close | 
      
      | Expand all | 
      | Collapse all | 
    
    
          --- old/usr/src/uts/common/fs/zfs/spa.c
          +++ new/usr/src/uts/common/fs/zfs/spa.c
   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  
    | 
      ↓ open down ↓ | 
    13 lines elided | 
    
      ↑ open up ↑ | 
  
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  
  22   22  /*
  23   23   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  24      - * Copyright (c) 2011, 2018 by Delphix. All rights reserved.
  25      - * Copyright (c) 2015, Nexenta Systems, Inc.  All rights reserved.
       24 + * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
  26   25   * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
       26 + * Copyright 2018 Nexenta Systems, Inc.  All rights reserved.
  27   27   * Copyright 2013 Saso Kiselkov. All rights reserved.
  28   28   * Copyright (c) 2014 Integros [integros.com]
  29   29   * Copyright 2016 Toomas Soome <tsoome@me.com>
  30      - * Copyright 2017 Joyent, Inc.
       30 + * Copyright 2018 Joyent, Inc.
  31   31   * Copyright (c) 2017 Datto Inc.
  32      - * Copyright 2018 OmniOS Community Edition (OmniOSce) Association.
  33   32   */
  34   33  
  35   34  /*
  36   35   * SPA: Storage Pool Allocator
  37   36   *
  38   37   * This file contains all the routines used when modifying on-disk SPA state.
  39   38   * This includes opening, importing, destroying, exporting a pool, and syncing a
  40   39   * pool.
  41   40   */
  42   41  
  43   42  #include <sys/zfs_context.h>
  
    | 
      ↓ open down ↓ | 
    1 lines elided | 
    
      ↑ open up ↑ | 
  
  44   43  #include <sys/fm/fs/zfs.h>
  45   44  #include <sys/spa_impl.h>
  46   45  #include <sys/zio.h>
  47   46  #include <sys/zio_checksum.h>
  48   47  #include <sys/dmu.h>
  49   48  #include <sys/dmu_tx.h>
  50   49  #include <sys/zap.h>
  51   50  #include <sys/zil.h>
  52   51  #include <sys/ddt.h>
  53   52  #include <sys/vdev_impl.h>
  54      -#include <sys/vdev_removal.h>
  55      -#include <sys/vdev_indirect_mapping.h>
  56      -#include <sys/vdev_indirect_births.h>
  57   53  #include <sys/metaslab.h>
  58   54  #include <sys/metaslab_impl.h>
  59   55  #include <sys/uberblock_impl.h>
  60   56  #include <sys/txg.h>
  61   57  #include <sys/avl.h>
  62      -#include <sys/bpobj.h>
  63   58  #include <sys/dmu_traverse.h>
  64   59  #include <sys/dmu_objset.h>
  65   60  #include <sys/unique.h>
  66   61  #include <sys/dsl_pool.h>
  67   62  #include <sys/dsl_dataset.h>
  68   63  #include <sys/dsl_dir.h>
  69   64  #include <sys/dsl_prop.h>
  70   65  #include <sys/dsl_synctask.h>
  71   66  #include <sys/fs/zfs.h>
  72   67  #include <sys/arc.h>
  73   68  #include <sys/callb.h>
  74   69  #include <sys/systeminfo.h>
  75   70  #include <sys/spa_boot.h>
  76   71  #include <sys/zfs_ioctl.h>
  77   72  #include <sys/dsl_scan.h>
  78   73  #include <sys/zfeature.h>
  79   74  #include <sys/dsl_destroy.h>
       75 +#include <sys/cos.h>
       76 +#include <sys/special.h>
       77 +#include <sys/wbc.h>
  80   78  #include <sys/abd.h>
  81   79  
  82   80  #ifdef  _KERNEL
  83   81  #include <sys/bootprops.h>
  84   82  #include <sys/callb.h>
  85   83  #include <sys/cpupart.h>
  86   84  #include <sys/pool.h>
  87   85  #include <sys/sysdc.h>
  88   86  #include <sys/zone.h>
  89   87  #endif  /* _KERNEL */
  90   88  
  91   89  #include "zfs_prop.h"
  92   90  #include "zfs_comutil.h"
  93   91  
  94   92  /*
  95   93   * The interval, in seconds, at which failed configuration cache file writes
  96   94   * should be retried.
  97   95   */
  98      -int zfs_ccw_retry_interval = 300;
       96 +static int zfs_ccw_retry_interval = 300;
  99   97  
 100   98  typedef enum zti_modes {
 101   99          ZTI_MODE_FIXED,                 /* value is # of threads (min 1) */
 102  100          ZTI_MODE_BATCH,                 /* cpu-intensive; value is ignored */
 103  101          ZTI_MODE_NULL,                  /* don't create a taskq */
 104  102          ZTI_NMODES
 105  103  } zti_modes_t;
 106  104  
 107  105  #define ZTI_P(n, q)     { ZTI_MODE_FIXED, (n), (q) }
 108  106  #define ZTI_BATCH       { ZTI_MODE_BATCH, 0, 1 }
 109  107  #define ZTI_NULL        { ZTI_MODE_NULL, 0, 0 }
 110  108  
 111  109  #define ZTI_N(n)        ZTI_P(n, 1)
 112  110  #define ZTI_ONE         ZTI_N(1)
 113  111  
 114  112  typedef struct zio_taskq_info {
 115  113          zti_modes_t zti_mode;
 116  114          uint_t zti_value;
 117  115          uint_t zti_count;
 118  116  } zio_taskq_info_t;
 119  117  
 120  118  static const char *const zio_taskq_types[ZIO_TASKQ_TYPES] = {
 121  119          "issue", "issue_high", "intr", "intr_high"
 122  120  };
 123  121  
 124  122  /*
 125  123   * This table defines the taskq settings for each ZFS I/O type. When
 126  124   * initializing a pool, we use this table to create an appropriately sized
 127  125   * taskq. Some operations are low volume and therefore have a small, static
 128  126   * number of threads assigned to their taskqs using the ZTI_N(#) or ZTI_ONE
 129  127   * macros. Other operations process a large amount of data; the ZTI_BATCH
 130  128   * macro causes us to create a taskq oriented for throughput. Some operations
 131  129   * are so high frequency and short-lived that the taskq itself can become a a
 132  130   * point of lock contention. The ZTI_P(#, #) macro indicates that we need an
 133  131   * additional degree of parallelism specified by the number of threads per-
 134  132   * taskq and the number of taskqs; when dispatching an event in this case, the
 135  133   * particular taskq is chosen at random.
 136  134   *
 137  135   * The different taskq priorities are to handle the different contexts (issue
 138  136   * and interrupt) and then to reserve threads for ZIO_PRIORITY_NOW I/Os that
 139  137   * need to be handled with minimum delay.
 140  138   */
  
    | 
      ↓ open down ↓ | 
    32 lines elided | 
    
      ↑ open up ↑ | 
  
 141  139  const zio_taskq_info_t zio_taskqs[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
 142  140          /* ISSUE        ISSUE_HIGH      INTR            INTR_HIGH */
 143  141          { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* NULL */
 144  142          { ZTI_N(8),     ZTI_NULL,       ZTI_P(12, 8),   ZTI_NULL }, /* READ */
 145  143          { ZTI_BATCH,    ZTI_N(5),       ZTI_N(8),       ZTI_N(5) }, /* WRITE */
 146  144          { ZTI_P(12, 8), ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* FREE */
 147  145          { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* CLAIM */
 148  146          { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* IOCTL */
 149  147  };
 150  148  
      149 +static sysevent_t *spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl,
      150 +    const char *name);
      151 +static void spa_event_notify_impl(sysevent_t *ev);
 151  152  static void spa_sync_version(void *arg, dmu_tx_t *tx);
 152  153  static void spa_sync_props(void *arg, dmu_tx_t *tx);
      154 +static void spa_vdev_sync_props(void *arg, dmu_tx_t *tx);
      155 +static int spa_vdev_prop_set_nosync(vdev_t *, nvlist_t *, boolean_t *);
 153  156  static boolean_t spa_has_active_shared_spare(spa_t *spa);
 154      -static int spa_load_impl(spa_t *spa, spa_import_type_t type, char **ereport,
 155      -    boolean_t reloading);
      157 +static int spa_load_impl(spa_t *spa, uint64_t, nvlist_t *config,
      158 +    spa_load_state_t state, spa_import_type_t type, boolean_t mosconfig,
      159 +    char **ereport);
 156  160  static void spa_vdev_resilver_done(spa_t *spa);
      161 +static void spa_auto_trim(spa_t *spa, uint64_t txg);
      162 +static void spa_vdev_man_trim_done(spa_t *spa);
      163 +static void spa_vdev_auto_trim_done(spa_t *spa);
      164 +static uint64_t spa_min_trim_rate(spa_t *spa);
 157  165  
 158  166  uint_t          zio_taskq_batch_pct = 75;       /* 1 thread per cpu in pset */
 159  167  id_t            zio_taskq_psrset_bind = PS_NONE;
 160  168  boolean_t       zio_taskq_sysdc = B_TRUE;       /* use SDC scheduling class */
 161  169  uint_t          zio_taskq_basedc = 80;          /* base duty cycle */
 162  170  
 163  171  boolean_t       spa_create_process = B_TRUE;    /* no process ==> no sysdc */
 164  172  extern int      zfs_sync_pass_deferred_free;
 165  173  
 166  174  /*
 167      - * Report any spa_load_verify errors found, but do not fail spa_load.
 168      - * This is used by zdb to analyze non-idle pools.
 169      - */
 170      -boolean_t       spa_load_verify_dryrun = B_FALSE;
 171      -
 172      -/*
 173      - * This (illegal) pool name is used when temporarily importing a spa_t in order
 174      - * to get the vdev stats associated with the imported devices.
 175      - */
 176      -#define TRYIMPORT_NAME  "$import"
 177      -
 178      -/*
 179      - * For debugging purposes: print out vdev tree during pool import.
 180      - */
 181      -boolean_t       spa_load_print_vdev_tree = B_FALSE;
 182      -
 183      -/*
 184      - * A non-zero value for zfs_max_missing_tvds means that we allow importing
 185      - * pools with missing top-level vdevs. This is strictly intended for advanced
 186      - * pool recovery cases since missing data is almost inevitable. Pools with
 187      - * missing devices can only be imported read-only for safety reasons, and their
 188      - * fail-mode will be automatically set to "continue".
 189      - *
 190      - * With 1 missing vdev we should be able to import the pool and mount all
 191      - * datasets. User data that was not modified after the missing device has been
 192      - * added should be recoverable. This means that snapshots created prior to the
 193      - * addition of that device should be completely intact.
 194      - *
 195      - * With 2 missing vdevs, some datasets may fail to mount since there are
 196      - * dataset statistics that are stored as regular metadata. Some data might be
 197      - * recoverable if those vdevs were added recently.
 198      - *
 199      - * With 3 or more missing vdevs, the pool is severely damaged and MOS entries
 200      - * may be missing entirely. Chances of data recovery are very low. Note that
 201      - * there are also risks of performing an inadvertent rewind as we might be
 202      - * missing all the vdevs with the latest uberblocks.
 203      - */
 204      -uint64_t        zfs_max_missing_tvds = 0;
 205      -
 206      -/*
 207      - * The parameters below are similar to zfs_max_missing_tvds but are only
 208      - * intended for a preliminary open of the pool with an untrusted config which
 209      - * might be incomplete or out-dated.
 210      - *
 211      - * We are more tolerant for pools opened from a cachefile since we could have
 212      - * an out-dated cachefile where a device removal was not registered.
 213      - * We could have set the limit arbitrarily high but in the case where devices
 214      - * are really missing we would want to return the proper error codes; we chose
 215      - * SPA_DVAS_PER_BP - 1 so that some copies of the MOS would still be available
 216      - * and we get a chance to retrieve the trusted config.
 217      - */
 218      -uint64_t        zfs_max_missing_tvds_cachefile = SPA_DVAS_PER_BP - 1;
 219      -/*
 220      - * In the case where config was assembled by scanning device paths (/dev/dsks
 221      - * by default) we are less tolerant since all the existing devices should have
 222      - * been detected and we want spa_load to return the right error codes.
 223      - */
 224      -uint64_t        zfs_max_missing_tvds_scan = 0;
 225      -
 226      -/*
 227  175   * ==========================================================================
 228  176   * SPA properties routines
 229  177   * ==========================================================================
 230  178   */
 231  179  
 232  180  /*
 233  181   * Add a (source=src, propname=propval) list to an nvlist.
 234  182   */
 235  183  static void
 236  184  spa_prop_add_list(nvlist_t *nvl, zpool_prop_t prop, char *strval,
 237  185      uint64_t intval, zprop_source_t src)
 238  186  {
 239  187          const char *propname = zpool_prop_to_name(prop);
 240  188          nvlist_t *propval;
 241  189  
 242  190          VERIFY(nvlist_alloc(&propval, NV_UNIQUE_NAME, KM_SLEEP) == 0);
 243  191          VERIFY(nvlist_add_uint64(propval, ZPROP_SOURCE, src) == 0);
 244  192  
 245  193          if (strval != NULL)
 246  194                  VERIFY(nvlist_add_string(propval, ZPROP_VALUE, strval) == 0);
 247  195          else
 248  196                  VERIFY(nvlist_add_uint64(propval, ZPROP_VALUE, intval) == 0);
 249  197  
 250  198          VERIFY(nvlist_add_nvlist(nvl, propname, propval) == 0);
 251  199          nvlist_free(propval);
  
    | 
      ↓ open down ↓ | 
    15 lines elided | 
    
      ↑ open up ↑ | 
  
 252  200  }
 253  201  
 254  202  /*
 255  203   * Get property values from the spa configuration.
 256  204   */
 257  205  static void
 258  206  spa_prop_get_config(spa_t *spa, nvlist_t **nvp)
 259  207  {
 260  208          vdev_t *rvd = spa->spa_root_vdev;
 261  209          dsl_pool_t *pool = spa->spa_dsl_pool;
      210 +        spa_meta_placement_t *mp = &spa->spa_meta_policy;
 262  211          uint64_t size, alloc, cap, version;
 263  212          zprop_source_t src = ZPROP_SRC_NONE;
 264  213          spa_config_dirent_t *dp;
 265  214          metaslab_class_t *mc = spa_normal_class(spa);
 266  215  
 267  216          ASSERT(MUTEX_HELD(&spa->spa_props_lock));
 268  217  
 269  218          if (rvd != NULL) {
 270  219                  alloc = metaslab_class_get_alloc(spa_normal_class(spa));
 271  220                  size = metaslab_class_get_space(spa_normal_class(spa));
 272  221                  spa_prop_add_list(*nvp, ZPOOL_PROP_NAME, spa_name(spa), 0, src);
 273  222                  spa_prop_add_list(*nvp, ZPOOL_PROP_SIZE, NULL, size, src);
 274  223                  spa_prop_add_list(*nvp, ZPOOL_PROP_ALLOCATED, NULL, alloc, src);
 275  224                  spa_prop_add_list(*nvp, ZPOOL_PROP_FREE, NULL,
 276  225                      size - alloc, src);
      226 +                spa_prop_add_list(*nvp, ZPOOL_PROP_ENABLESPECIAL, NULL,
      227 +                    (uint64_t)spa->spa_usesc, src);
      228 +                spa_prop_add_list(*nvp, ZPOOL_PROP_MINWATERMARK, NULL,
      229 +                    spa->spa_minwat, src);
      230 +                spa_prop_add_list(*nvp, ZPOOL_PROP_HIWATERMARK, NULL,
      231 +                    spa->spa_hiwat, src);
      232 +                spa_prop_add_list(*nvp, ZPOOL_PROP_LOWATERMARK, NULL,
      233 +                    spa->spa_lowat, src);
      234 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPMETA_DITTO, NULL,
      235 +                    spa->spa_ddt_meta_copies, src);
 277  236  
      237 +                spa_prop_add_list(*nvp, ZPOOL_PROP_META_PLACEMENT, NULL,
      238 +                    mp->spa_enable_meta_placement_selection, src);
      239 +                spa_prop_add_list(*nvp, ZPOOL_PROP_SYNC_TO_SPECIAL, NULL,
      240 +                    mp->spa_sync_to_special, src);
      241 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DDT_META_TO_METADEV, NULL,
      242 +                    mp->spa_ddt_meta_to_special, src);
      243 +                spa_prop_add_list(*nvp, ZPOOL_PROP_ZFS_META_TO_METADEV,
      244 +                    NULL, mp->spa_zfs_meta_to_special, src);
      245 +                spa_prop_add_list(*nvp, ZPOOL_PROP_SMALL_DATA_TO_METADEV, NULL,
      246 +                    mp->spa_small_data_to_special, src);
      247 +
 278  248                  spa_prop_add_list(*nvp, ZPOOL_PROP_FRAGMENTATION, NULL,
 279  249                      metaslab_class_fragmentation(mc), src);
 280  250                  spa_prop_add_list(*nvp, ZPOOL_PROP_EXPANDSZ, NULL,
 281  251                      metaslab_class_expandable_space(mc), src);
 282  252                  spa_prop_add_list(*nvp, ZPOOL_PROP_READONLY, NULL,
 283  253                      (spa_mode(spa) == FREAD), src);
 284  254  
      255 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DDT_DESEGREGATION, NULL,
      256 +                    (spa->spa_ddt_class_min == spa->spa_ddt_class_max), src);
      257 +
 285  258                  cap = (size == 0) ? 0 : (alloc * 100 / size);
 286  259                  spa_prop_add_list(*nvp, ZPOOL_PROP_CAPACITY, NULL, cap, src);
 287  260  
      261 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_BEST_EFFORT, NULL,
      262 +                    spa->spa_dedup_best_effort, src);
      263 +
      264 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_LO_BEST_EFFORT, NULL,
      265 +                    spa->spa_dedup_lo_best_effort, src);
      266 +
      267 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_HI_BEST_EFFORT, NULL,
      268 +                    spa->spa_dedup_hi_best_effort, src);
      269 +
 288  270                  spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPRATIO, NULL,
 289  271                      ddt_get_pool_dedup_ratio(spa), src);
 290  272  
      273 +                spa_prop_add_list(*nvp, ZPOOL_PROP_DDTCAPPED, NULL,
      274 +                    spa->spa_ddt_capped, src);
      275 +
 291  276                  spa_prop_add_list(*nvp, ZPOOL_PROP_HEALTH, NULL,
 292  277                      rvd->vdev_state, src);
 293  278  
 294  279                  version = spa_version(spa);
 295  280                  if (version == zpool_prop_default_numeric(ZPOOL_PROP_VERSION))
 296  281                          src = ZPROP_SRC_DEFAULT;
 297  282                  else
 298  283                          src = ZPROP_SRC_LOCAL;
 299  284                  spa_prop_add_list(*nvp, ZPOOL_PROP_VERSION, NULL, version, src);
 300  285          }
 301  286  
 302  287          if (pool != NULL) {
 303  288                  /*
 304  289                   * The $FREE directory was introduced in SPA_VERSION_DEADLISTS,
 305  290                   * when opening pools before this version freedir will be NULL.
 306  291                   */
 307  292                  if (pool->dp_free_dir != NULL) {
 308  293                          spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING, NULL,
 309      -                            dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes,
      294 +                            dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes +
      295 +                            pool->dp_long_freeing_total,
 310  296                              src);
 311  297                  } else {
 312  298                          spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING,
 313      -                            NULL, 0, src);
      299 +                            NULL, pool->dp_long_freeing_total, src);
 314  300                  }
 315  301  
 316  302                  if (pool->dp_leak_dir != NULL) {
 317  303                          spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED, NULL,
 318  304                              dsl_dir_phys(pool->dp_leak_dir)->dd_used_bytes,
 319  305                              src);
 320  306                  } else {
 321  307                          spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED,
 322  308                              NULL, 0, src);
 323  309                  }
 324  310          }
 325  311  
 326  312          spa_prop_add_list(*nvp, ZPOOL_PROP_GUID, NULL, spa_guid(spa), src);
 327  313  
 328  314          if (spa->spa_comment != NULL) {
 329  315                  spa_prop_add_list(*nvp, ZPOOL_PROP_COMMENT, spa->spa_comment,
 330  316                      0, ZPROP_SRC_LOCAL);
 331  317          }
 332  318  
 333  319          if (spa->spa_root != NULL)
 334  320                  spa_prop_add_list(*nvp, ZPOOL_PROP_ALTROOT, spa->spa_root,
 335  321                      0, ZPROP_SRC_LOCAL);
 336  322  
 337  323          if (spa_feature_is_enabled(spa, SPA_FEATURE_LARGE_BLOCKS)) {
 338  324                  spa_prop_add_list(*nvp, ZPOOL_PROP_MAXBLOCKSIZE, NULL,
 339  325                      MIN(zfs_max_recordsize, SPA_MAXBLOCKSIZE), ZPROP_SRC_NONE);
 340  326          } else {
 341  327                  spa_prop_add_list(*nvp, ZPOOL_PROP_MAXBLOCKSIZE, NULL,
 342  328                      SPA_OLD_MAXBLOCKSIZE, ZPROP_SRC_NONE);
 343  329          }
 344  330  
 345  331          if ((dp = list_head(&spa->spa_config_list)) != NULL) {
 346  332                  if (dp->scd_path == NULL) {
 347  333                          spa_prop_add_list(*nvp, ZPOOL_PROP_CACHEFILE,
 348  334                              "none", 0, ZPROP_SRC_LOCAL);
 349  335                  } else if (strcmp(dp->scd_path, spa_config_path) != 0) {
 350  336                          spa_prop_add_list(*nvp, ZPOOL_PROP_CACHEFILE,
 351  337                              dp->scd_path, 0, ZPROP_SRC_LOCAL);
 352  338                  }
 353  339          }
 354  340  }
 355  341  
 356  342  /*
 357  343   * Get zpool property values.
 358  344   */
 359  345  int
 360  346  spa_prop_get(spa_t *spa, nvlist_t **nvp)
 361  347  {
 362  348          objset_t *mos = spa->spa_meta_objset;
 363  349          zap_cursor_t zc;
 364  350          zap_attribute_t za;
 365  351          int err;
 366  352  
 367  353          VERIFY(nvlist_alloc(nvp, NV_UNIQUE_NAME, KM_SLEEP) == 0);
 368  354  
 369  355          mutex_enter(&spa->spa_props_lock);
 370  356  
 371  357          /*
 372  358           * Get properties from the spa config.
 373  359           */
 374  360          spa_prop_get_config(spa, nvp);
 375  361  
 376  362          /* If no pool property object, no more prop to get. */
 377  363          if (mos == NULL || spa->spa_pool_props_object == 0) {
 378  364                  mutex_exit(&spa->spa_props_lock);
 379  365                  return (0);
 380  366          }
 381  367  
 382  368          /*
  
    | 
      ↓ open down ↓ | 
    59 lines elided | 
    
      ↑ open up ↑ | 
  
 383  369           * Get properties from the MOS pool property object.
 384  370           */
 385  371          for (zap_cursor_init(&zc, mos, spa->spa_pool_props_object);
 386  372              (err = zap_cursor_retrieve(&zc, &za)) == 0;
 387  373              zap_cursor_advance(&zc)) {
 388  374                  uint64_t intval = 0;
 389  375                  char *strval = NULL;
 390  376                  zprop_source_t src = ZPROP_SRC_DEFAULT;
 391  377                  zpool_prop_t prop;
 392  378  
 393      -                if ((prop = zpool_name_to_prop(za.za_name)) == ZPOOL_PROP_INVAL)
      379 +                if ((prop = zpool_name_to_prop(za.za_name)) == ZPROP_INVAL)
 394  380                          continue;
 395  381  
 396  382                  switch (za.za_integer_length) {
 397  383                  case 8:
 398  384                          /* integer property */
 399  385                          if (za.za_first_integer !=
 400  386                              zpool_prop_default_numeric(prop))
 401  387                                  src = ZPROP_SRC_LOCAL;
 402  388  
 403  389                          if (prop == ZPOOL_PROP_BOOTFS) {
 404  390                                  dsl_pool_t *dp;
 405  391                                  dsl_dataset_t *ds = NULL;
 406  392  
 407  393                                  dp = spa_get_dsl(spa);
 408  394                                  dsl_pool_config_enter(dp, FTAG);
 409  395                                  if (err = dsl_dataset_hold_obj(dp,
 410  396                                      za.za_first_integer, FTAG, &ds)) {
 411  397                                          dsl_pool_config_exit(dp, FTAG);
 412  398                                          break;
 413  399                                  }
 414  400  
 415  401                                  strval = kmem_alloc(ZFS_MAX_DATASET_NAME_LEN,
 416  402                                      KM_SLEEP);
 417  403                                  dsl_dataset_name(ds, strval);
 418  404                                  dsl_dataset_rele(ds, FTAG);
 419  405                                  dsl_pool_config_exit(dp, FTAG);
 420  406                          } else {
 421  407                                  strval = NULL;
 422  408                                  intval = za.za_first_integer;
 423  409                          }
 424  410  
 425  411                          spa_prop_add_list(*nvp, prop, strval, intval, src);
 426  412  
 427  413                          if (strval != NULL)
 428  414                                  kmem_free(strval, ZFS_MAX_DATASET_NAME_LEN);
 429  415  
 430  416                          break;
 431  417  
 432  418                  case 1:
 433  419                          /* string property */
 434  420                          strval = kmem_alloc(za.za_num_integers, KM_SLEEP);
 435  421                          err = zap_lookup(mos, spa->spa_pool_props_object,
 436  422                              za.za_name, 1, za.za_num_integers, strval);
 437  423                          if (err) {
 438  424                                  kmem_free(strval, za.za_num_integers);
 439  425                                  break;
 440  426                          }
 441  427                          spa_prop_add_list(*nvp, prop, strval, 0, src);
 442  428                          kmem_free(strval, za.za_num_integers);
 443  429                          break;
 444  430  
 445  431                  default:
 446  432                          break;
 447  433                  }
 448  434          }
 449  435          zap_cursor_fini(&zc);
 450  436          mutex_exit(&spa->spa_props_lock);
 451  437  out:
 452  438          if (err && err != ENOENT) {
 453  439                  nvlist_free(*nvp);
 454  440                  *nvp = NULL;
 455  441                  return (err);
 456  442          }
 457  443  
 458  444          return (0);
 459  445  }
 460  446  
 461  447  /*
  
    | 
      ↓ open down ↓ | 
    58 lines elided | 
    
      ↑ open up ↑ | 
  
 462  448   * Validate the given pool properties nvlist and modify the list
 463  449   * for the property values to be set.
 464  450   */
 465  451  static int
 466  452  spa_prop_validate(spa_t *spa, nvlist_t *props)
 467  453  {
 468  454          nvpair_t *elem;
 469  455          int error = 0, reset_bootfs = 0;
 470  456          uint64_t objnum = 0;
 471  457          boolean_t has_feature = B_FALSE;
      458 +        uint64_t lowat = spa->spa_lowat, hiwat = spa->spa_hiwat,
      459 +            minwat = spa->spa_minwat;
 472  460  
 473  461          elem = NULL;
 474  462          while ((elem = nvlist_next_nvpair(props, elem)) != NULL) {
 475  463                  uint64_t intval;
 476  464                  char *strval, *slash, *check, *fname;
 477  465                  const char *propname = nvpair_name(elem);
 478  466                  zpool_prop_t prop = zpool_name_to_prop(propname);
      467 +                spa_feature_t feature;
 479  468  
 480  469                  switch (prop) {
 481      -                case ZPOOL_PROP_INVAL:
      470 +                case ZPROP_INVAL:
 482  471                          if (!zpool_prop_feature(propname)) {
 483  472                                  error = SET_ERROR(EINVAL);
 484  473                                  break;
 485  474                          }
 486  475  
 487  476                          /*
 488  477                           * Sanitize the input.
 489  478                           */
 490  479                          if (nvpair_type(elem) != DATA_TYPE_UINT64) {
 491  480                                  error = SET_ERROR(EINVAL);
 492  481                                  break;
 493  482                          }
 494  483  
 495  484                          if (nvpair_value_uint64(elem, &intval) != 0) {
  
    | 
      ↓ open down ↓ | 
    4 lines elided | 
    
      ↑ open up ↑ | 
  
 496  485                                  error = SET_ERROR(EINVAL);
 497  486                                  break;
 498  487                          }
 499  488  
 500  489                          if (intval != 0) {
 501  490                                  error = SET_ERROR(EINVAL);
 502  491                                  break;
 503  492                          }
 504  493  
 505  494                          fname = strchr(propname, '@') + 1;
 506      -                        if (zfeature_lookup_name(fname, NULL) != 0) {
      495 +                        if (zfeature_lookup_name(fname, &feature) != 0) {
 507  496                                  error = SET_ERROR(EINVAL);
 508  497                                  break;
 509  498                          }
 510  499  
      500 +                        if (feature == SPA_FEATURE_WBC &&
      501 +                            !spa_has_special(spa)) {
      502 +                                error = SET_ERROR(ENOTSUP);
      503 +                                break;
      504 +                        }
      505 +
 511  506                          has_feature = B_TRUE;
 512  507                          break;
 513  508  
 514  509                  case ZPOOL_PROP_VERSION:
 515  510                          error = nvpair_value_uint64(elem, &intval);
 516  511                          if (!error &&
 517  512                              (intval < spa_version(spa) ||
 518  513                              intval > SPA_VERSION_BEFORE_FEATURES ||
 519  514                              has_feature))
 520  515                                  error = SET_ERROR(EINVAL);
 521  516                          break;
 522  517  
 523  518                  case ZPOOL_PROP_DELEGATION:
 524  519                  case ZPOOL_PROP_AUTOREPLACE:
 525  520                  case ZPOOL_PROP_LISTSNAPS:
 526  521                  case ZPOOL_PROP_AUTOEXPAND:
      522 +                case ZPOOL_PROP_DEDUP_BEST_EFFORT:
      523 +                case ZPOOL_PROP_DDT_DESEGREGATION:
      524 +                case ZPOOL_PROP_META_PLACEMENT:
      525 +                case ZPOOL_PROP_FORCETRIM:
      526 +                case ZPOOL_PROP_AUTOTRIM:
 527  527                          error = nvpair_value_uint64(elem, &intval);
 528  528                          if (!error && intval > 1)
 529  529                                  error = SET_ERROR(EINVAL);
 530  530                          break;
 531  531  
      532 +                case ZPOOL_PROP_DDT_META_TO_METADEV:
      533 +                case ZPOOL_PROP_ZFS_META_TO_METADEV:
      534 +                        error = nvpair_value_uint64(elem, &intval);
      535 +                        if (!error && intval > META_PLACEMENT_DUAL)
      536 +                                error = SET_ERROR(EINVAL);
      537 +                        break;
      538 +
      539 +                case ZPOOL_PROP_SYNC_TO_SPECIAL:
      540 +                        error = nvpair_value_uint64(elem, &intval);
      541 +                        if (!error && intval > SYNC_TO_SPECIAL_ALWAYS)
      542 +                                error = SET_ERROR(EINVAL);
      543 +                        break;
      544 +
      545 +                case ZPOOL_PROP_SMALL_DATA_TO_METADEV:
      546 +                        error = nvpair_value_uint64(elem, &intval);
      547 +                        if (!error && intval > SPA_MAXBLOCKSIZE)
      548 +                                error = SET_ERROR(EINVAL);
      549 +                        break;
      550 +
 532  551                  case ZPOOL_PROP_BOOTFS:
 533  552                          /*
 534  553                           * If the pool version is less than SPA_VERSION_BOOTFS,
 535  554                           * or the pool is still being created (version == 0),
 536  555                           * the bootfs property cannot be set.
 537  556                           */
 538  557                          if (spa_version(spa) < SPA_VERSION_BOOTFS) {
 539  558                                  error = SET_ERROR(ENOTSUP);
 540  559                                  break;
 541  560                          }
 542  561  
 543  562                          /*
 544  563                           * Make sure the vdev config is bootable
 545  564                           */
 546  565                          if (!vdev_is_bootable(spa->spa_root_vdev)) {
 547  566                                  error = SET_ERROR(ENOTSUP);
 548  567                                  break;
 549  568                          }
 550  569  
 551  570                          reset_bootfs = 1;
 552  571  
 553  572                          error = nvpair_value_string(elem, &strval);
 554  573  
 555  574                          if (!error) {
 556  575                                  objset_t *os;
 557  576                                  uint64_t propval;
 558  577  
 559  578                                  if (strval == NULL || strval[0] == '\0') {
 560  579                                          objnum = zpool_prop_default_numeric(
 561  580                                              ZPOOL_PROP_BOOTFS);
 562  581                                          break;
 563  582                                  }
 564  583  
 565  584                                  if (error = dmu_objset_hold(strval, FTAG, &os))
 566  585                                          break;
 567  586  
 568  587                                  /*
 569  588                                   * Must be ZPL, and its property settings
 570  589                                   * must be supported by GRUB (compression
 571  590                                   * is not gzip, and large blocks are not used).
 572  591                                   */
 573  592  
 574  593                                  if (dmu_objset_type(os) != DMU_OST_ZFS) {
 575  594                                          error = SET_ERROR(ENOTSUP);
 576  595                                  } else if ((error =
 577  596                                      dsl_prop_get_int_ds(dmu_objset_ds(os),
 578  597                                      zfs_prop_to_name(ZFS_PROP_COMPRESSION),
  
    | 
      ↓ open down ↓ | 
    37 lines elided | 
    
      ↑ open up ↑ | 
  
 579  598                                      &propval)) == 0 &&
 580  599                                      !BOOTFS_COMPRESS_VALID(propval)) {
 581  600                                          error = SET_ERROR(ENOTSUP);
 582  601                                  } else {
 583  602                                          objnum = dmu_objset_id(os);
 584  603                                  }
 585  604                                  dmu_objset_rele(os, FTAG);
 586  605                          }
 587  606                          break;
 588  607  
      608 +                case ZPOOL_PROP_DEDUP_LO_BEST_EFFORT:
      609 +                        error = nvpair_value_uint64(elem, &intval);
      610 +                        if ((intval < 0) || (intval > 100) ||
      611 +                            (intval >= spa->spa_dedup_hi_best_effort))
      612 +                                error = SET_ERROR(EINVAL);
      613 +                        break;
      614 +
      615 +                case ZPOOL_PROP_DEDUP_HI_BEST_EFFORT:
      616 +                        error = nvpair_value_uint64(elem, &intval);
      617 +                        if ((intval < 0) || (intval > 100) ||
      618 +                            (intval <= spa->spa_dedup_lo_best_effort))
      619 +                                error = SET_ERROR(EINVAL);
      620 +                        break;
      621 +
 589  622                  case ZPOOL_PROP_FAILUREMODE:
 590  623                          error = nvpair_value_uint64(elem, &intval);
 591  624                          if (!error && (intval < ZIO_FAILURE_MODE_WAIT ||
 592  625                              intval > ZIO_FAILURE_MODE_PANIC))
 593  626                                  error = SET_ERROR(EINVAL);
 594  627  
 595  628                          /*
 596  629                           * This is a special case which only occurs when
 597  630                           * the pool has completely failed. This allows
 598  631                           * the user to change the in-core failmode property
 599  632                           * without syncing it out to disk (I/Os might
 600  633                           * currently be blocked). We do this by returning
 601  634                           * EIO to the caller (spa_prop_set) to trick it
 602  635                           * into thinking we encountered a property validation
 603  636                           * error.
 604  637                           */
 605  638                          if (!error && spa_suspended(spa)) {
 606  639                                  spa->spa_failmode = intval;
 607  640                                  error = SET_ERROR(EIO);
 608  641                          }
 609  642                          break;
 610  643  
 611  644                  case ZPOOL_PROP_CACHEFILE:
 612  645                          if ((error = nvpair_value_string(elem, &strval)) != 0)
 613  646                                  break;
 614  647  
 615  648                          if (strval[0] == '\0')
 616  649                                  break;
 617  650  
 618  651                          if (strcmp(strval, "none") == 0)
 619  652                                  break;
 620  653  
 621  654                          if (strval[0] != '/') {
 622  655                                  error = SET_ERROR(EINVAL);
 623  656                                  break;
 624  657                          }
 625  658  
 626  659                          slash = strrchr(strval, '/');
 627  660                          ASSERT(slash != NULL);
 628  661  
 629  662                          if (slash[1] == '\0' || strcmp(slash, "/.") == 0 ||
 630  663                              strcmp(slash, "/..") == 0)
 631  664                                  error = SET_ERROR(EINVAL);
 632  665                          break;
 633  666  
 634  667                  case ZPOOL_PROP_COMMENT:
 635  668                          if ((error = nvpair_value_string(elem, &strval)) != 0)
 636  669                                  break;
 637  670                          for (check = strval; *check != '\0'; check++) {
 638  671                                  /*
 639  672                                   * The kernel doesn't have an easy isprint()
  
    | 
      ↓ open down ↓ | 
    41 lines elided | 
    
      ↑ open up ↑ | 
  
 640  673                                   * check.  For this kernel check, we merely
 641  674                                   * check ASCII apart from DEL.  Fix this if
 642  675                                   * there is an easy-to-use kernel isprint().
 643  676                                   */
 644  677                                  if (*check >= 0x7f) {
 645  678                                          error = SET_ERROR(EINVAL);
 646  679                                          break;
 647  680                                  }
 648  681                          }
 649  682                          if (strlen(strval) > ZPROP_MAX_COMMENT)
 650      -                                error = E2BIG;
      683 +                                error = SET_ERROR(E2BIG);
 651  684                          break;
 652  685  
 653  686                  case ZPOOL_PROP_DEDUPDITTO:
 654  687                          if (spa_version(spa) < SPA_VERSION_DEDUP)
 655  688                                  error = SET_ERROR(ENOTSUP);
 656  689                          else
 657  690                                  error = nvpair_value_uint64(elem, &intval);
 658  691                          if (error == 0 &&
 659  692                              intval != 0 && intval < ZIO_DEDUPDITTO_MIN)
 660  693                                  error = SET_ERROR(EINVAL);
 661  694                          break;
      695 +
      696 +                case ZPOOL_PROP_MINWATERMARK:
      697 +                        error = nvpair_value_uint64(elem, &intval);
      698 +                        if (!error && (intval > 100))
      699 +                                error = SET_ERROR(EINVAL);
      700 +                        minwat = intval;
      701 +                        break;
      702 +                case ZPOOL_PROP_LOWATERMARK:
      703 +                        error = nvpair_value_uint64(elem, &intval);
      704 +                        if (!error && (intval > 100))
      705 +                                error = SET_ERROR(EINVAL);
      706 +                        lowat = intval;
      707 +                        break;
      708 +                case ZPOOL_PROP_HIWATERMARK:
      709 +                        error = nvpair_value_uint64(elem, &intval);
      710 +                        if (!error && (intval > 100))
      711 +                                error = SET_ERROR(EINVAL);
      712 +                        hiwat = intval;
      713 +                        break;
      714 +                case ZPOOL_PROP_DEDUPMETA_DITTO:
      715 +                        error = nvpair_value_uint64(elem, &intval);
      716 +                        if (!error && (intval > SPA_DVAS_PER_BP))
      717 +                                error = SET_ERROR(EINVAL);
      718 +                        break;
      719 +                case ZPOOL_PROP_SCRUB_PRIO:
      720 +                case ZPOOL_PROP_RESILVER_PRIO:
      721 +                        error = nvpair_value_uint64(elem, &intval);
      722 +                        if (error || intval > 100)
      723 +                                error = SET_ERROR(EINVAL);
      724 +                        break;
 662  725                  }
 663  726  
 664  727                  if (error)
 665  728                          break;
 666  729          }
 667  730  
      731 +        /* check if low watermark is less than high watermark */
      732 +        if (lowat != 0 && lowat >= hiwat)
      733 +                error = SET_ERROR(EINVAL);
      734 +
      735 +        /* check if min watermark is less than low watermark */
      736 +        if (minwat != 0 && minwat >= lowat)
      737 +                error = SET_ERROR(EINVAL);
      738 +
 668  739          if (!error && reset_bootfs) {
 669  740                  error = nvlist_remove(props,
 670  741                      zpool_prop_to_name(ZPOOL_PROP_BOOTFS), DATA_TYPE_STRING);
 671  742  
 672  743                  if (!error) {
 673  744                          error = nvlist_add_uint64(props,
 674  745                              zpool_prop_to_name(ZPOOL_PROP_BOOTFS), objnum);
 675  746                  }
 676  747          }
 677  748  
 678  749          return (error);
 679  750  }
 680  751  
 681  752  void
 682  753  spa_configfile_set(spa_t *spa, nvlist_t *nvp, boolean_t need_sync)
 683  754  {
 684  755          char *cachefile;
 685  756          spa_config_dirent_t *dp;
 686  757  
 687  758          if (nvlist_lookup_string(nvp, zpool_prop_to_name(ZPOOL_PROP_CACHEFILE),
 688  759              &cachefile) != 0)
 689  760                  return;
 690  761  
 691  762          dp = kmem_alloc(sizeof (spa_config_dirent_t),
 692  763              KM_SLEEP);
 693  764  
 694  765          if (cachefile[0] == '\0')
 695  766                  dp->scd_path = spa_strdup(spa_config_path);
 696  767          else if (strcmp(cachefile, "none") == 0)
 697  768                  dp->scd_path = NULL;
 698  769          else
 699  770                  dp->scd_path = spa_strdup(cachefile);
 700  771  
 701  772          list_insert_head(&spa->spa_config_list, dp);
 702  773          if (need_sync)
 703  774                  spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
 704  775  }
 705  776  
 706  777  int
 707  778  spa_prop_set(spa_t *spa, nvlist_t *nvp)
 708  779  {
 709  780          int error;
 710  781          nvpair_t *elem = NULL;
 711  782          boolean_t need_sync = B_FALSE;
 712  783  
 713  784          if ((error = spa_prop_validate(spa, nvp)) != 0)
  
    | 
      ↓ open down ↓ | 
    36 lines elided | 
    
      ↑ open up ↑ | 
  
 714  785                  return (error);
 715  786  
 716  787          while ((elem = nvlist_next_nvpair(nvp, elem)) != NULL) {
 717  788                  zpool_prop_t prop = zpool_name_to_prop(nvpair_name(elem));
 718  789  
 719  790                  if (prop == ZPOOL_PROP_CACHEFILE ||
 720  791                      prop == ZPOOL_PROP_ALTROOT ||
 721  792                      prop == ZPOOL_PROP_READONLY)
 722  793                          continue;
 723  794  
 724      -                if (prop == ZPOOL_PROP_VERSION || prop == ZPOOL_PROP_INVAL) {
      795 +                if (prop == ZPOOL_PROP_VERSION || prop == ZPROP_INVAL) {
 725  796                          uint64_t ver;
 726  797  
 727  798                          if (prop == ZPOOL_PROP_VERSION) {
 728  799                                  VERIFY(nvpair_value_uint64(elem, &ver) == 0);
 729  800                          } else {
 730  801                                  ASSERT(zpool_prop_feature(nvpair_name(elem)));
 731  802                                  ver = SPA_VERSION_FEATURES;
 732  803                                  need_sync = B_TRUE;
 733  804                          }
 734  805  
 735  806                          /* Save time if the version is already set. */
 736  807                          if (ver == spa_version(spa))
 737  808                                  continue;
 738  809  
 739  810                          /*
 740  811                           * In addition to the pool directory object, we might
 741  812                           * create the pool properties object, the features for
 742  813                           * read object, the features for write object, or the
 743  814                           * feature descriptions object.
 744  815                           */
 745  816                          error = dsl_sync_task(spa->spa_name, NULL,
 746  817                              spa_sync_version, &ver,
 747  818                              6, ZFS_SPACE_CHECK_RESERVED);
 748  819                          if (error)
 749  820                                  return (error);
 750  821                          continue;
 751  822                  }
 752  823  
 753  824                  need_sync = B_TRUE;
 754  825                  break;
 755  826          }
 756  827  
 757  828          if (need_sync) {
 758  829                  return (dsl_sync_task(spa->spa_name, NULL, spa_sync_props,
 759  830                      nvp, 6, ZFS_SPACE_CHECK_RESERVED));
 760  831          }
 761  832  
 762  833          return (0);
 763  834  }
 764  835  
 765  836  /*
 766  837   * If the bootfs property value is dsobj, clear it.
 767  838   */
 768  839  void
 769  840  spa_prop_clear_bootfs(spa_t *spa, uint64_t dsobj, dmu_tx_t *tx)
 770  841  {
 771  842          if (spa->spa_bootfs == dsobj && spa->spa_pool_props_object != 0) {
 772  843                  VERIFY(zap_remove(spa->spa_meta_objset,
 773  844                      spa->spa_pool_props_object,
 774  845                      zpool_prop_to_name(ZPOOL_PROP_BOOTFS), tx) == 0);
 775  846                  spa->spa_bootfs = 0;
 776  847          }
 777  848  }
 778  849  
 779  850  /*ARGSUSED*/
 780  851  static int
 781  852  spa_change_guid_check(void *arg, dmu_tx_t *tx)
 782  853  {
 783  854          uint64_t *newguid = arg;
 784  855          spa_t *spa = dmu_tx_pool(tx)->dp_spa;
 785  856          vdev_t *rvd = spa->spa_root_vdev;
 786  857          uint64_t vdev_state;
 787  858  
 788  859          spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
 789  860          vdev_state = rvd->vdev_state;
 790  861          spa_config_exit(spa, SCL_STATE, FTAG);
 791  862  
 792  863          if (vdev_state != VDEV_STATE_HEALTHY)
 793  864                  return (SET_ERROR(ENXIO));
 794  865  
 795  866          ASSERT3U(spa_guid(spa), !=, *newguid);
 796  867  
 797  868          return (0);
 798  869  }
 799  870  
 800  871  static void
 801  872  spa_change_guid_sync(void *arg, dmu_tx_t *tx)
 802  873  {
 803  874          uint64_t *newguid = arg;
 804  875          spa_t *spa = dmu_tx_pool(tx)->dp_spa;
 805  876          uint64_t oldguid;
 806  877          vdev_t *rvd = spa->spa_root_vdev;
 807  878  
 808  879          oldguid = spa_guid(spa);
 809  880  
 810  881          spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
 811  882          rvd->vdev_guid = *newguid;
 812  883          rvd->vdev_guid_sum += (*newguid - oldguid);
 813  884          vdev_config_dirty(rvd);
 814  885          spa_config_exit(spa, SCL_STATE, FTAG);
 815  886  
 816  887          spa_history_log_internal(spa, "guid change", tx, "old=%llu new=%llu",
 817  888              oldguid, *newguid);
 818  889  }
 819  890  
 820  891  /*
 821  892   * Change the GUID for the pool.  This is done so that we can later
 822  893   * re-import a pool built from a clone of our own vdevs.  We will modify
 823  894   * the root vdev's guid, our own pool guid, and then mark all of our
 824  895   * vdevs dirty.  Note that we must make sure that all our vdevs are
 825  896   * online when we do this, or else any vdevs that weren't present
 826  897   * would be orphaned from our pool.  We are also going to issue a
 827  898   * sysevent to update any watchers.
 828  899   */
 829  900  int
 830  901  spa_change_guid(spa_t *spa)
 831  902  {
 832  903          int error;
  
    | 
      ↓ open down ↓ | 
    98 lines elided | 
    
      ↑ open up ↑ | 
  
 833  904          uint64_t guid;
 834  905  
 835  906          mutex_enter(&spa->spa_vdev_top_lock);
 836  907          mutex_enter(&spa_namespace_lock);
 837  908          guid = spa_generate_guid(NULL);
 838  909  
 839  910          error = dsl_sync_task(spa->spa_name, spa_change_guid_check,
 840  911              spa_change_guid_sync, &guid, 5, ZFS_SPACE_CHECK_RESERVED);
 841  912  
 842  913          if (error == 0) {
 843      -                spa_write_cachefile(spa, B_FALSE, B_TRUE);
      914 +                spa_config_sync(spa, B_FALSE, B_TRUE);
 844  915                  spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_REGUID);
 845  916          }
 846  917  
 847  918          mutex_exit(&spa_namespace_lock);
 848  919          mutex_exit(&spa->spa_vdev_top_lock);
 849  920  
 850  921          return (error);
 851  922  }
 852  923  
 853  924  /*
 854  925   * ==========================================================================
 855  926   * SPA state manipulation (open/create/destroy/import/export)
 856  927   * ==========================================================================
 857  928   */
 858  929  
 859  930  static int
 860  931  spa_error_entry_compare(const void *a, const void *b)
 861  932  {
 862  933          spa_error_entry_t *sa = (spa_error_entry_t *)a;
 863  934          spa_error_entry_t *sb = (spa_error_entry_t *)b;
 864  935          int ret;
 865  936  
 866  937          ret = bcmp(&sa->se_bookmark, &sb->se_bookmark,
 867  938              sizeof (zbookmark_phys_t));
 868  939  
 869  940          if (ret < 0)
 870  941                  return (-1);
 871  942          else if (ret > 0)
 872  943                  return (1);
 873  944          else
 874  945                  return (0);
 875  946  }
 876  947  
 877  948  /*
 878  949   * Utility function which retrieves copies of the current logs and
 879  950   * re-initializes them in the process.
 880  951   */
 881  952  void
 882  953  spa_get_errlists(spa_t *spa, avl_tree_t *last, avl_tree_t *scrub)
 883  954  {
 884  955          ASSERT(MUTEX_HELD(&spa->spa_errlist_lock));
 885  956  
 886  957          bcopy(&spa->spa_errlist_last, last, sizeof (avl_tree_t));
 887  958          bcopy(&spa->spa_errlist_scrub, scrub, sizeof (avl_tree_t));
 888  959  
 889  960          avl_create(&spa->spa_errlist_scrub,
 890  961              spa_error_entry_compare, sizeof (spa_error_entry_t),
 891  962              offsetof(spa_error_entry_t, se_avl));
 892  963          avl_create(&spa->spa_errlist_last,
 893  964              spa_error_entry_compare, sizeof (spa_error_entry_t),
 894  965              offsetof(spa_error_entry_t, se_avl));
 895  966  }
 896  967  
 897  968  static void
 898  969  spa_taskqs_init(spa_t *spa, zio_type_t t, zio_taskq_type_t q)
 899  970  {
 900  971          const zio_taskq_info_t *ztip = &zio_taskqs[t][q];
 901  972          enum zti_modes mode = ztip->zti_mode;
 902  973          uint_t value = ztip->zti_value;
 903  974          uint_t count = ztip->zti_count;
 904  975          spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
 905  976          char name[32];
 906  977          uint_t flags = 0;
 907  978          boolean_t batch = B_FALSE;
 908  979  
 909  980          if (mode == ZTI_MODE_NULL) {
 910  981                  tqs->stqs_count = 0;
 911  982                  tqs->stqs_taskq = NULL;
 912  983                  return;
 913  984          }
 914  985  
 915  986          ASSERT3U(count, >, 0);
 916  987  
 917  988          tqs->stqs_count = count;
 918  989          tqs->stqs_taskq = kmem_alloc(count * sizeof (taskq_t *), KM_SLEEP);
 919  990  
 920  991          switch (mode) {
 921  992          case ZTI_MODE_FIXED:
 922  993                  ASSERT3U(value, >=, 1);
 923  994                  value = MAX(value, 1);
 924  995                  break;
 925  996  
 926  997          case ZTI_MODE_BATCH:
 927  998                  batch = B_TRUE;
 928  999                  flags |= TASKQ_THREADS_CPU_PCT;
 929 1000                  value = zio_taskq_batch_pct;
 930 1001                  break;
 931 1002  
 932 1003          default:
 933 1004                  panic("unrecognized mode for %s_%s taskq (%u:%u) in "
 934 1005                      "spa_activate()",
 935 1006                      zio_type_name[t], zio_taskq_types[q], mode, value);
 936 1007                  break;
 937 1008          }
 938 1009  
 939 1010          for (uint_t i = 0; i < count; i++) {
 940 1011                  taskq_t *tq;
 941 1012  
 942 1013                  if (count > 1) {
 943 1014                          (void) snprintf(name, sizeof (name), "%s_%s_%u",
 944 1015                              zio_type_name[t], zio_taskq_types[q], i);
 945 1016                  } else {
 946 1017                          (void) snprintf(name, sizeof (name), "%s_%s",
 947 1018                              zio_type_name[t], zio_taskq_types[q]);
 948 1019                  }
 949 1020  
 950 1021                  if (zio_taskq_sysdc && spa->spa_proc != &p0) {
 951 1022                          if (batch)
 952 1023                                  flags |= TASKQ_DC_BATCH;
 953 1024  
 954 1025                          tq = taskq_create_sysdc(name, value, 50, INT_MAX,
 955 1026                              spa->spa_proc, zio_taskq_basedc, flags);
 956 1027                  } else {
 957 1028                          pri_t pri = maxclsyspri;
 958 1029                          /*
 959 1030                           * The write issue taskq can be extremely CPU
 960 1031                           * intensive.  Run it at slightly lower priority
 961 1032                           * than the other taskqs.
 962 1033                           */
 963 1034                          if (t == ZIO_TYPE_WRITE && q == ZIO_TASKQ_ISSUE)
 964 1035                                  pri--;
 965 1036  
 966 1037                          tq = taskq_create_proc(name, value, pri, 50,
 967 1038                              INT_MAX, spa->spa_proc, flags);
 968 1039                  }
 969 1040  
 970 1041                  tqs->stqs_taskq[i] = tq;
 971 1042          }
 972 1043  }
 973 1044  
 974 1045  static void
 975 1046  spa_taskqs_fini(spa_t *spa, zio_type_t t, zio_taskq_type_t q)
 976 1047  {
 977 1048          spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
 978 1049  
 979 1050          if (tqs->stqs_taskq == NULL) {
 980 1051                  ASSERT0(tqs->stqs_count);
 981 1052                  return;
 982 1053          }
 983 1054  
 984 1055          for (uint_t i = 0; i < tqs->stqs_count; i++) {
 985 1056                  ASSERT3P(tqs->stqs_taskq[i], !=, NULL);
 986 1057                  taskq_destroy(tqs->stqs_taskq[i]);
 987 1058          }
 988 1059  
 989 1060          kmem_free(tqs->stqs_taskq, tqs->stqs_count * sizeof (taskq_t *));
 990 1061          tqs->stqs_taskq = NULL;
 991 1062  }
 992 1063  
 993 1064  /*
 994 1065   * Dispatch a task to the appropriate taskq for the ZFS I/O type and priority.
 995 1066   * Note that a type may have multiple discrete taskqs to avoid lock contention
 996 1067   * on the taskq itself. In that case we choose which taskq at random by using
 997 1068   * the low bits of gethrtime().
 998 1069   */
 999 1070  void
1000 1071  spa_taskq_dispatch_ent(spa_t *spa, zio_type_t t, zio_taskq_type_t q,
1001 1072      task_func_t *func, void *arg, uint_t flags, taskq_ent_t *ent)
1002 1073  {
1003 1074          spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
1004 1075          taskq_t *tq;
1005 1076  
1006 1077          ASSERT3P(tqs->stqs_taskq, !=, NULL);
1007 1078          ASSERT3U(tqs->stqs_count, !=, 0);
1008 1079  
1009 1080          if (tqs->stqs_count == 1) {
1010 1081                  tq = tqs->stqs_taskq[0];
1011 1082          } else {
1012 1083                  tq = tqs->stqs_taskq[gethrtime() % tqs->stqs_count];
1013 1084          }
1014 1085  
1015 1086          taskq_dispatch_ent(tq, func, arg, flags, ent);
1016 1087  }
1017 1088  
1018 1089  static void
1019 1090  spa_create_zio_taskqs(spa_t *spa)
1020 1091  {
1021 1092          for (int t = 0; t < ZIO_TYPES; t++) {
1022 1093                  for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
1023 1094                          spa_taskqs_init(spa, t, q);
1024 1095                  }
1025 1096          }
1026 1097  }
1027 1098  
1028 1099  #ifdef _KERNEL
1029 1100  static void
1030 1101  spa_thread(void *arg)
1031 1102  {
1032 1103          callb_cpr_t cprinfo;
1033 1104  
1034 1105          spa_t *spa = arg;
1035 1106          user_t *pu = PTOU(curproc);
1036 1107  
1037 1108          CALLB_CPR_INIT(&cprinfo, &spa->spa_proc_lock, callb_generic_cpr,
1038 1109              spa->spa_name);
1039 1110  
1040 1111          ASSERT(curproc != &p0);
1041 1112          (void) snprintf(pu->u_psargs, sizeof (pu->u_psargs),
1042 1113              "zpool-%s", spa->spa_name);
1043 1114          (void) strlcpy(pu->u_comm, pu->u_psargs, sizeof (pu->u_comm));
1044 1115  
1045 1116          /* bind this thread to the requested psrset */
1046 1117          if (zio_taskq_psrset_bind != PS_NONE) {
1047 1118                  pool_lock();
1048 1119                  mutex_enter(&cpu_lock);
1049 1120                  mutex_enter(&pidlock);
1050 1121                  mutex_enter(&curproc->p_lock);
1051 1122  
1052 1123                  if (cpupart_bind_thread(curthread, zio_taskq_psrset_bind,
1053 1124                      0, NULL, NULL) == 0)  {
1054 1125                          curthread->t_bind_pset = zio_taskq_psrset_bind;
1055 1126                  } else {
1056 1127                          cmn_err(CE_WARN,
1057 1128                              "Couldn't bind process for zfs pool \"%s\" to "
1058 1129                              "pset %d\n", spa->spa_name, zio_taskq_psrset_bind);
1059 1130                  }
1060 1131  
1061 1132                  mutex_exit(&curproc->p_lock);
1062 1133                  mutex_exit(&pidlock);
1063 1134                  mutex_exit(&cpu_lock);
1064 1135                  pool_unlock();
1065 1136          }
1066 1137  
1067 1138          if (zio_taskq_sysdc) {
1068 1139                  sysdc_thread_enter(curthread, 100, 0);
1069 1140          }
1070 1141  
1071 1142          spa->spa_proc = curproc;
1072 1143          spa->spa_did = curthread->t_did;
1073 1144  
1074 1145          spa_create_zio_taskqs(spa);
1075 1146  
1076 1147          mutex_enter(&spa->spa_proc_lock);
1077 1148          ASSERT(spa->spa_proc_state == SPA_PROC_CREATED);
1078 1149  
1079 1150          spa->spa_proc_state = SPA_PROC_ACTIVE;
1080 1151          cv_broadcast(&spa->spa_proc_cv);
1081 1152  
1082 1153          CALLB_CPR_SAFE_BEGIN(&cprinfo);
1083 1154          while (spa->spa_proc_state == SPA_PROC_ACTIVE)
1084 1155                  cv_wait(&spa->spa_proc_cv, &spa->spa_proc_lock);
1085 1156          CALLB_CPR_SAFE_END(&cprinfo, &spa->spa_proc_lock);
1086 1157  
1087 1158          ASSERT(spa->spa_proc_state == SPA_PROC_DEACTIVATE);
1088 1159          spa->spa_proc_state = SPA_PROC_GONE;
1089 1160          spa->spa_proc = &p0;
1090 1161          cv_broadcast(&spa->spa_proc_cv);
1091 1162          CALLB_CPR_EXIT(&cprinfo);       /* drops spa_proc_lock */
1092 1163  
1093 1164          mutex_enter(&curproc->p_lock);
1094 1165          lwp_exit();
1095 1166  }
1096 1167  #endif
1097 1168  
1098 1169  /*
1099 1170   * Activate an uninitialized pool.
1100 1171   */
  
    | 
      ↓ open down ↓ | 
    247 lines elided | 
    
      ↑ open up ↑ | 
  
1101 1172  static void
1102 1173  spa_activate(spa_t *spa, int mode)
1103 1174  {
1104 1175          ASSERT(spa->spa_state == POOL_STATE_UNINITIALIZED);
1105 1176  
1106 1177          spa->spa_state = POOL_STATE_ACTIVE;
1107 1178          spa->spa_mode = mode;
1108 1179  
1109 1180          spa->spa_normal_class = metaslab_class_create(spa, zfs_metaslab_ops);
1110 1181          spa->spa_log_class = metaslab_class_create(spa, zfs_metaslab_ops);
     1182 +        spa->spa_special_class = metaslab_class_create(spa, zfs_metaslab_ops);
1111 1183  
1112 1184          /* Try to create a covering process */
1113 1185          mutex_enter(&spa->spa_proc_lock);
1114 1186          ASSERT(spa->spa_proc_state == SPA_PROC_NONE);
1115 1187          ASSERT(spa->spa_proc == &p0);
1116 1188          spa->spa_did = 0;
1117 1189  
1118 1190          /* Only create a process if we're going to be around a while. */
1119 1191          if (spa_create_process && strcmp(spa->spa_name, TRYIMPORT_NAME) != 0) {
1120 1192                  if (newproc(spa_thread, (caddr_t)spa, syscid, maxclsyspri,
1121 1193                      NULL, 0) == 0) {
1122 1194                          spa->spa_proc_state = SPA_PROC_CREATED;
1123 1195                          while (spa->spa_proc_state == SPA_PROC_CREATED) {
1124 1196                                  cv_wait(&spa->spa_proc_cv,
1125 1197                                      &spa->spa_proc_lock);
1126 1198                          }
1127 1199                          ASSERT(spa->spa_proc_state == SPA_PROC_ACTIVE);
1128 1200                          ASSERT(spa->spa_proc != &p0);
1129 1201                          ASSERT(spa->spa_did != 0);
1130 1202                  } else {
1131 1203  #ifdef _KERNEL
1132 1204                          cmn_err(CE_WARN,
1133 1205                              "Couldn't create process for zfs pool \"%s\"\n",
1134 1206                              spa->spa_name);
  
    | 
      ↓ open down ↓ | 
    14 lines elided | 
    
      ↑ open up ↑ | 
  
1135 1207  #endif
1136 1208                  }
1137 1209          }
1138 1210          mutex_exit(&spa->spa_proc_lock);
1139 1211  
1140 1212          /* If we didn't create a process, we need to create our taskqs. */
1141 1213          if (spa->spa_proc == &p0) {
1142 1214                  spa_create_zio_taskqs(spa);
1143 1215          }
1144 1216  
1145      -        for (size_t i = 0; i < TXG_SIZE; i++)
1146      -                spa->spa_txg_zio[i] = zio_root(spa, NULL, NULL, 0);
1147      -
1148 1217          list_create(&spa->spa_config_dirty_list, sizeof (vdev_t),
1149 1218              offsetof(vdev_t, vdev_config_dirty_node));
1150 1219          list_create(&spa->spa_evicting_os_list, sizeof (objset_t),
1151 1220              offsetof(objset_t, os_evicting_node));
1152 1221          list_create(&spa->spa_state_dirty_list, sizeof (vdev_t),
1153 1222              offsetof(vdev_t, vdev_state_dirty_node));
1154 1223  
1155 1224          txg_list_create(&spa->spa_vdev_txg_list, spa,
1156 1225              offsetof(struct vdev, vdev_txg_node));
1157 1226  
1158 1227          avl_create(&spa->spa_errlist_scrub,
1159 1228              spa_error_entry_compare, sizeof (spa_error_entry_t),
1160 1229              offsetof(spa_error_entry_t, se_avl));
1161 1230          avl_create(&spa->spa_errlist_last,
1162 1231              spa_error_entry_compare, sizeof (spa_error_entry_t),
1163 1232              offsetof(spa_error_entry_t, se_avl));
1164 1233  }
1165 1234  
1166 1235  /*
1167 1236   * Opposite of spa_activate().
1168 1237   */
1169 1238  static void
1170 1239  spa_deactivate(spa_t *spa)
1171 1240  {
1172 1241          ASSERT(spa->spa_sync_on == B_FALSE);
1173 1242          ASSERT(spa->spa_dsl_pool == NULL);
1174 1243          ASSERT(spa->spa_root_vdev == NULL);
1175 1244          ASSERT(spa->spa_async_zio_root == NULL);
1176 1245          ASSERT(spa->spa_state != POOL_STATE_UNINITIALIZED);
1177 1246  
1178 1247          spa_evicting_os_wait(spa);
1179 1248  
1180 1249          txg_list_destroy(&spa->spa_vdev_txg_list);
1181 1250  
  
    | 
      ↓ open down ↓ | 
    24 lines elided | 
    
      ↑ open up ↑ | 
  
1182 1251          list_destroy(&spa->spa_config_dirty_list);
1183 1252          list_destroy(&spa->spa_evicting_os_list);
1184 1253          list_destroy(&spa->spa_state_dirty_list);
1185 1254  
1186 1255          for (int t = 0; t < ZIO_TYPES; t++) {
1187 1256                  for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
1188 1257                          spa_taskqs_fini(spa, t, q);
1189 1258                  }
1190 1259          }
1191 1260  
1192      -        for (size_t i = 0; i < TXG_SIZE; i++) {
1193      -                ASSERT3P(spa->spa_txg_zio[i], !=, NULL);
1194      -                VERIFY0(zio_wait(spa->spa_txg_zio[i]));
1195      -                spa->spa_txg_zio[i] = NULL;
1196      -        }
1197      -
1198 1261          metaslab_class_destroy(spa->spa_normal_class);
1199 1262          spa->spa_normal_class = NULL;
1200 1263  
1201 1264          metaslab_class_destroy(spa->spa_log_class);
1202 1265          spa->spa_log_class = NULL;
1203 1266  
     1267 +        metaslab_class_destroy(spa->spa_special_class);
     1268 +        spa->spa_special_class = NULL;
     1269 +
1204 1270          /*
1205 1271           * If this was part of an import or the open otherwise failed, we may
1206 1272           * still have errors left in the queues.  Empty them just in case.
1207 1273           */
1208 1274          spa_errlog_drain(spa);
1209 1275  
1210 1276          avl_destroy(&spa->spa_errlist_scrub);
1211 1277          avl_destroy(&spa->spa_errlist_last);
1212 1278  
1213 1279          spa->spa_state = POOL_STATE_UNINITIALIZED;
1214 1280  
1215 1281          mutex_enter(&spa->spa_proc_lock);
1216 1282          if (spa->spa_proc_state != SPA_PROC_NONE) {
1217 1283                  ASSERT(spa->spa_proc_state == SPA_PROC_ACTIVE);
1218 1284                  spa->spa_proc_state = SPA_PROC_DEACTIVATE;
1219 1285                  cv_broadcast(&spa->spa_proc_cv);
1220 1286                  while (spa->spa_proc_state == SPA_PROC_DEACTIVATE) {
1221 1287                          ASSERT(spa->spa_proc != &p0);
1222 1288                          cv_wait(&spa->spa_proc_cv, &spa->spa_proc_lock);
1223 1289                  }
1224 1290                  ASSERT(spa->spa_proc_state == SPA_PROC_GONE);
1225 1291                  spa->spa_proc_state = SPA_PROC_NONE;
1226 1292          }
1227 1293          ASSERT(spa->spa_proc == &p0);
1228 1294          mutex_exit(&spa->spa_proc_lock);
1229 1295  
1230 1296          /*
1231 1297           * We want to make sure spa_thread() has actually exited the ZFS
1232 1298           * module, so that the module can't be unloaded out from underneath
1233 1299           * it.
1234 1300           */
1235 1301          if (spa->spa_did != 0) {
1236 1302                  thread_join(spa->spa_did);
1237 1303                  spa->spa_did = 0;
1238 1304          }
1239 1305  }
1240 1306  
1241 1307  /*
1242 1308   * Verify a pool configuration, and construct the vdev tree appropriately.  This
1243 1309   * will create all the necessary vdevs in the appropriate layout, with each vdev
1244 1310   * in the CLOSED state.  This will prep the pool before open/creation/import.
1245 1311   * All vdev validation is done by the vdev_alloc() routine.
1246 1312   */
1247 1313  static int
1248 1314  spa_config_parse(spa_t *spa, vdev_t **vdp, nvlist_t *nv, vdev_t *parent,
1249 1315      uint_t id, int atype)
1250 1316  {
1251 1317          nvlist_t **child;
1252 1318          uint_t children;
1253 1319          int error;
1254 1320  
1255 1321          if ((error = vdev_alloc(spa, vdp, nv, parent, id, atype)) != 0)
1256 1322                  return (error);
1257 1323  
1258 1324          if ((*vdp)->vdev_ops->vdev_op_leaf)
1259 1325                  return (0);
1260 1326  
1261 1327          error = nvlist_lookup_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN,
1262 1328              &child, &children);
1263 1329  
1264 1330          if (error == ENOENT)
1265 1331                  return (0);
1266 1332  
1267 1333          if (error) {
1268 1334                  vdev_free(*vdp);
1269 1335                  *vdp = NULL;
1270 1336                  return (SET_ERROR(EINVAL));
1271 1337          }
1272 1338  
1273 1339          for (int c = 0; c < children; c++) {
1274 1340                  vdev_t *vd;
1275 1341                  if ((error = spa_config_parse(spa, &vd, child[c], *vdp, c,
1276 1342                      atype)) != 0) {
1277 1343                          vdev_free(*vdp);
1278 1344                          *vdp = NULL;
1279 1345                          return (error);
1280 1346                  }
1281 1347          }
1282 1348  
1283 1349          ASSERT(*vdp != NULL);
1284 1350  
1285 1351          return (0);
1286 1352  }
1287 1353  
  
    | 
      ↓ open down ↓ | 
    74 lines elided | 
    
      ↑ open up ↑ | 
  
1288 1354  /*
1289 1355   * Opposite of spa_load().
1290 1356   */
1291 1357  static void
1292 1358  spa_unload(spa_t *spa)
1293 1359  {
1294 1360          int i;
1295 1361  
1296 1362          ASSERT(MUTEX_HELD(&spa_namespace_lock));
1297 1363  
1298      -        spa_load_note(spa, "UNLOADING");
     1364 +        /*
     1365 +         * Stop manual trim before stopping spa sync, because manual trim
     1366 +         * needs to execute a synctask (trim timestamp sync) at the end.
     1367 +         */
     1368 +        mutex_enter(&spa->spa_auto_trim_lock);
     1369 +        mutex_enter(&spa->spa_man_trim_lock);
     1370 +        spa_trim_stop_wait(spa);
     1371 +        mutex_exit(&spa->spa_man_trim_lock);
     1372 +        mutex_exit(&spa->spa_auto_trim_lock);
1299 1373  
1300 1374          /*
1301 1375           * Stop async tasks.
1302 1376           */
1303 1377          spa_async_suspend(spa);
1304 1378  
1305 1379          /*
1306 1380           * Stop syncing.
1307 1381           */
1308 1382          if (spa->spa_sync_on) {
1309 1383                  txg_sync_stop(spa->spa_dsl_pool);
1310 1384                  spa->spa_sync_on = B_FALSE;
1311 1385          }
1312 1386  
1313 1387          /*
1314 1388           * Even though vdev_free() also calls vdev_metaslab_fini, we need
1315 1389           * to call it earlier, before we wait for async i/o to complete.
1316 1390           * This ensures that there is no async metaslab prefetching, by
1317 1391           * calling taskq_wait(mg_taskq).
1318 1392           */
1319 1393          if (spa->spa_root_vdev != NULL) {
1320 1394                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1321 1395                  for (int c = 0; c < spa->spa_root_vdev->vdev_children; c++)
1322 1396                          vdev_metaslab_fini(spa->spa_root_vdev->vdev_child[c]);
1323 1397                  spa_config_exit(spa, SCL_ALL, FTAG);
1324 1398          }
1325 1399  
  
    | 
      ↓ open down ↓ | 
    17 lines elided | 
    
      ↑ open up ↑ | 
  
1326 1400          /*
1327 1401           * Wait for any outstanding async I/O to complete.
1328 1402           */
1329 1403          if (spa->spa_async_zio_root != NULL) {
1330 1404                  for (int i = 0; i < max_ncpus; i++)
1331 1405                          (void) zio_wait(spa->spa_async_zio_root[i]);
1332 1406                  kmem_free(spa->spa_async_zio_root, max_ncpus * sizeof (void *));
1333 1407                  spa->spa_async_zio_root = NULL;
1334 1408          }
1335 1409  
1336      -        if (spa->spa_vdev_removal != NULL) {
1337      -                spa_vdev_removal_destroy(spa->spa_vdev_removal);
1338      -                spa->spa_vdev_removal = NULL;
1339      -        }
1340      -
1341      -        if (spa->spa_condense_zthr != NULL) {
1342      -                ASSERT(!zthr_isrunning(spa->spa_condense_zthr));
1343      -                zthr_destroy(spa->spa_condense_zthr);
1344      -                spa->spa_condense_zthr = NULL;
1345      -        }
1346      -
1347      -        spa_condense_fini(spa);
1348      -
1349 1410          bpobj_close(&spa->spa_deferred_bpobj);
1350 1411  
1351 1412          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1352 1413  
1353 1414          /*
     1415 +         * Stop autotrim tasks.
     1416 +         */
     1417 +        mutex_enter(&spa->spa_auto_trim_lock);
     1418 +        if (spa->spa_auto_trim_taskq)
     1419 +                spa_auto_trim_taskq_destroy(spa);
     1420 +        mutex_exit(&spa->spa_auto_trim_lock);
     1421 +
     1422 +        /*
1354 1423           * Close all vdevs.
1355 1424           */
1356 1425          if (spa->spa_root_vdev)
1357 1426                  vdev_free(spa->spa_root_vdev);
1358 1427          ASSERT(spa->spa_root_vdev == NULL);
1359 1428  
1360 1429          /*
1361 1430           * Close the dsl pool.
1362 1431           */
1363 1432          if (spa->spa_dsl_pool) {
1364 1433                  dsl_pool_close(spa->spa_dsl_pool);
1365 1434                  spa->spa_dsl_pool = NULL;
1366 1435                  spa->spa_meta_objset = NULL;
1367 1436          }
1368 1437  
1369 1438          ddt_unload(spa);
1370 1439  
1371 1440          /*
1372 1441           * Drop and purge level 2 cache
1373 1442           */
1374 1443          spa_l2cache_drop(spa);
1375 1444  
1376 1445          for (i = 0; i < spa->spa_spares.sav_count; i++)
1377 1446                  vdev_free(spa->spa_spares.sav_vdevs[i]);
1378 1447          if (spa->spa_spares.sav_vdevs) {
1379 1448                  kmem_free(spa->spa_spares.sav_vdevs,
1380 1449                      spa->spa_spares.sav_count * sizeof (void *));
1381 1450                  spa->spa_spares.sav_vdevs = NULL;
1382 1451          }
1383 1452          if (spa->spa_spares.sav_config) {
1384 1453                  nvlist_free(spa->spa_spares.sav_config);
1385 1454                  spa->spa_spares.sav_config = NULL;
1386 1455          }
1387 1456          spa->spa_spares.sav_count = 0;
1388 1457  
1389 1458          for (i = 0; i < spa->spa_l2cache.sav_count; i++) {
1390 1459                  vdev_clear_stats(spa->spa_l2cache.sav_vdevs[i]);
1391 1460                  vdev_free(spa->spa_l2cache.sav_vdevs[i]);
1392 1461          }
1393 1462          if (spa->spa_l2cache.sav_vdevs) {
1394 1463                  kmem_free(spa->spa_l2cache.sav_vdevs,
1395 1464                      spa->spa_l2cache.sav_count * sizeof (void *));
  
    | 
      ↓ open down ↓ | 
    32 lines elided | 
    
      ↑ open up ↑ | 
  
1396 1465                  spa->spa_l2cache.sav_vdevs = NULL;
1397 1466          }
1398 1467          if (spa->spa_l2cache.sav_config) {
1399 1468                  nvlist_free(spa->spa_l2cache.sav_config);
1400 1469                  spa->spa_l2cache.sav_config = NULL;
1401 1470          }
1402 1471          spa->spa_l2cache.sav_count = 0;
1403 1472  
1404 1473          spa->spa_async_suspended = 0;
1405 1474  
1406      -        spa->spa_indirect_vdevs_loaded = B_FALSE;
1407      -
1408 1475          if (spa->spa_comment != NULL) {
1409 1476                  spa_strfree(spa->spa_comment);
1410 1477                  spa->spa_comment = NULL;
1411 1478          }
1412 1479  
1413 1480          spa_config_exit(spa, SCL_ALL, FTAG);
1414 1481  }
1415 1482  
1416 1483  /*
1417 1484   * Load (or re-load) the current list of vdevs describing the active spares for
1418 1485   * this pool.  When this is called, we have some form of basic information in
1419 1486   * 'spa_spares.sav_config'.  We parse this into vdevs, try to open them, and
1420 1487   * then re-generate a more complete list including status information.
1421 1488   */
1422      -void
     1489 +static void
1423 1490  spa_load_spares(spa_t *spa)
1424 1491  {
1425 1492          nvlist_t **spares;
1426 1493          uint_t nspares;
1427 1494          int i;
1428 1495          vdev_t *vd, *tvd;
1429 1496  
1430 1497          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
1431 1498  
1432 1499          /*
1433 1500           * First, close and free any existing spare vdevs.
1434 1501           */
1435 1502          for (i = 0; i < spa->spa_spares.sav_count; i++) {
1436 1503                  vd = spa->spa_spares.sav_vdevs[i];
1437 1504  
1438 1505                  /* Undo the call to spa_activate() below */
1439 1506                  if ((tvd = spa_lookup_by_guid(spa, vd->vdev_guid,
1440 1507                      B_FALSE)) != NULL && tvd->vdev_isspare)
1441 1508                          spa_spare_remove(tvd);
1442 1509                  vdev_close(vd);
1443 1510                  vdev_free(vd);
1444 1511          }
1445 1512  
1446 1513          if (spa->spa_spares.sav_vdevs)
1447 1514                  kmem_free(spa->spa_spares.sav_vdevs,
1448 1515                      spa->spa_spares.sav_count * sizeof (void *));
1449 1516  
1450 1517          if (spa->spa_spares.sav_config == NULL)
1451 1518                  nspares = 0;
1452 1519          else
1453 1520                  VERIFY(nvlist_lookup_nvlist_array(spa->spa_spares.sav_config,
1454 1521                      ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0);
1455 1522  
1456 1523          spa->spa_spares.sav_count = (int)nspares;
1457 1524          spa->spa_spares.sav_vdevs = NULL;
1458 1525  
1459 1526          if (nspares == 0)
1460 1527                  return;
1461 1528  
1462 1529          /*
1463 1530           * Construct the array of vdevs, opening them to get status in the
1464 1531           * process.   For each spare, there is potentially two different vdev_t
1465 1532           * structures associated with it: one in the list of spares (used only
1466 1533           * for basic validation purposes) and one in the active vdev
1467 1534           * configuration (if it's spared in).  During this phase we open and
1468 1535           * validate each vdev on the spare list.  If the vdev also exists in the
1469 1536           * active configuration, then we also mark this vdev as an active spare.
1470 1537           */
1471 1538          spa->spa_spares.sav_vdevs = kmem_alloc(nspares * sizeof (void *),
1472 1539              KM_SLEEP);
1473 1540          for (i = 0; i < spa->spa_spares.sav_count; i++) {
1474 1541                  VERIFY(spa_config_parse(spa, &vd, spares[i], NULL, 0,
1475 1542                      VDEV_ALLOC_SPARE) == 0);
1476 1543                  ASSERT(vd != NULL);
1477 1544  
1478 1545                  spa->spa_spares.sav_vdevs[i] = vd;
1479 1546  
1480 1547                  if ((tvd = spa_lookup_by_guid(spa, vd->vdev_guid,
1481 1548                      B_FALSE)) != NULL) {
1482 1549                          if (!tvd->vdev_isspare)
1483 1550                                  spa_spare_add(tvd);
1484 1551  
1485 1552                          /*
1486 1553                           * We only mark the spare active if we were successfully
1487 1554                           * able to load the vdev.  Otherwise, importing a pool
1488 1555                           * with a bad active spare would result in strange
1489 1556                           * behavior, because multiple pool would think the spare
1490 1557                           * is actively in use.
1491 1558                           *
1492 1559                           * There is a vulnerability here to an equally bizarre
1493 1560                           * circumstance, where a dead active spare is later
1494 1561                           * brought back to life (onlined or otherwise).  Given
1495 1562                           * the rarity of this scenario, and the extra complexity
1496 1563                           * it adds, we ignore the possibility.
1497 1564                           */
1498 1565                          if (!vdev_is_dead(tvd))
1499 1566                                  spa_spare_activate(tvd);
1500 1567                  }
1501 1568  
1502 1569                  vd->vdev_top = vd;
1503 1570                  vd->vdev_aux = &spa->spa_spares;
1504 1571  
1505 1572                  if (vdev_open(vd) != 0)
1506 1573                          continue;
1507 1574  
1508 1575                  if (vdev_validate_aux(vd) == 0)
1509 1576                          spa_spare_add(vd);
1510 1577          }
1511 1578  
1512 1579          /*
1513 1580           * Recompute the stashed list of spares, with status information
1514 1581           * this time.
1515 1582           */
1516 1583          VERIFY(nvlist_remove(spa->spa_spares.sav_config, ZPOOL_CONFIG_SPARES,
1517 1584              DATA_TYPE_NVLIST_ARRAY) == 0);
1518 1585  
1519 1586          spares = kmem_alloc(spa->spa_spares.sav_count * sizeof (void *),
1520 1587              KM_SLEEP);
1521 1588          for (i = 0; i < spa->spa_spares.sav_count; i++)
1522 1589                  spares[i] = vdev_config_generate(spa,
1523 1590                      spa->spa_spares.sav_vdevs[i], B_TRUE, VDEV_CONFIG_SPARE);
1524 1591          VERIFY(nvlist_add_nvlist_array(spa->spa_spares.sav_config,
1525 1592              ZPOOL_CONFIG_SPARES, spares, spa->spa_spares.sav_count) == 0);
1526 1593          for (i = 0; i < spa->spa_spares.sav_count; i++)
1527 1594                  nvlist_free(spares[i]);
1528 1595          kmem_free(spares, spa->spa_spares.sav_count * sizeof (void *));
  
    | 
      ↓ open down ↓ | 
    96 lines elided | 
    
      ↑ open up ↑ | 
  
1529 1596  }
1530 1597  
1531 1598  /*
1532 1599   * Load (or re-load) the current list of vdevs describing the active l2cache for
1533 1600   * this pool.  When this is called, we have some form of basic information in
1534 1601   * 'spa_l2cache.sav_config'.  We parse this into vdevs, try to open them, and
1535 1602   * then re-generate a more complete list including status information.
1536 1603   * Devices which are already active have their details maintained, and are
1537 1604   * not re-opened.
1538 1605   */
1539      -void
     1606 +static void
1540 1607  spa_load_l2cache(spa_t *spa)
1541 1608  {
1542 1609          nvlist_t **l2cache;
1543 1610          uint_t nl2cache;
1544 1611          int i, j, oldnvdevs;
1545 1612          uint64_t guid;
1546 1613          vdev_t *vd, **oldvdevs, **newvdevs;
1547 1614          spa_aux_vdev_t *sav = &spa->spa_l2cache;
1548 1615  
1549 1616          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
1550 1617  
1551 1618          if (sav->sav_config != NULL) {
1552 1619                  VERIFY(nvlist_lookup_nvlist_array(sav->sav_config,
1553 1620                      ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0);
1554 1621                  newvdevs = kmem_alloc(nl2cache * sizeof (void *), KM_SLEEP);
1555 1622          } else {
1556 1623                  nl2cache = 0;
1557 1624                  newvdevs = NULL;
1558 1625          }
1559 1626  
1560 1627          oldvdevs = sav->sav_vdevs;
1561 1628          oldnvdevs = sav->sav_count;
1562 1629          sav->sav_vdevs = NULL;
1563 1630          sav->sav_count = 0;
1564 1631  
1565 1632          /*
1566 1633           * Process new nvlist of vdevs.
1567 1634           */
1568 1635          for (i = 0; i < nl2cache; i++) {
1569 1636                  VERIFY(nvlist_lookup_uint64(l2cache[i], ZPOOL_CONFIG_GUID,
1570 1637                      &guid) == 0);
1571 1638  
1572 1639                  newvdevs[i] = NULL;
1573 1640                  for (j = 0; j < oldnvdevs; j++) {
1574 1641                          vd = oldvdevs[j];
1575 1642                          if (vd != NULL && guid == vd->vdev_guid) {
1576 1643                                  /*
1577 1644                                   * Retain previous vdev for add/remove ops.
1578 1645                                   */
1579 1646                                  newvdevs[i] = vd;
1580 1647                                  oldvdevs[j] = NULL;
1581 1648                                  break;
1582 1649                          }
1583 1650                  }
1584 1651  
1585 1652                  if (newvdevs[i] == NULL) {
1586 1653                          /*
1587 1654                           * Create new vdev
1588 1655                           */
1589 1656                          VERIFY(spa_config_parse(spa, &vd, l2cache[i], NULL, 0,
1590 1657                              VDEV_ALLOC_L2CACHE) == 0);
1591 1658                          ASSERT(vd != NULL);
1592 1659                          newvdevs[i] = vd;
1593 1660  
1594 1661                          /*
1595 1662                           * Commit this vdev as an l2cache device,
1596 1663                           * even if it fails to open.
1597 1664                           */
1598 1665                          spa_l2cache_add(vd);
1599 1666  
  
    | 
      ↓ open down ↓ | 
    50 lines elided | 
    
      ↑ open up ↑ | 
  
1600 1667                          vd->vdev_top = vd;
1601 1668                          vd->vdev_aux = sav;
1602 1669  
1603 1670                          spa_l2cache_activate(vd);
1604 1671  
1605 1672                          if (vdev_open(vd) != 0)
1606 1673                                  continue;
1607 1674  
1608 1675                          (void) vdev_validate_aux(vd);
1609 1676  
1610      -                        if (!vdev_is_dead(vd))
1611      -                                l2arc_add_vdev(spa, vd);
     1677 +                        if (!vdev_is_dead(vd)) {
     1678 +                                boolean_t do_rebuild = B_FALSE;
     1679 +
     1680 +                                (void) nvlist_lookup_boolean_value(l2cache[i],
     1681 +                                    ZPOOL_CONFIG_L2CACHE_PERSISTENT,
     1682 +                                    &do_rebuild);
     1683 +                                l2arc_add_vdev(spa, vd, do_rebuild);
     1684 +                        }
1612 1685                  }
1613 1686          }
1614 1687  
1615 1688          /*
1616 1689           * Purge vdevs that were dropped
1617 1690           */
1618 1691          for (i = 0; i < oldnvdevs; i++) {
1619 1692                  uint64_t pool;
1620 1693  
1621 1694                  vd = oldvdevs[i];
1622 1695                  if (vd != NULL) {
1623 1696                          ASSERT(vd->vdev_isl2cache);
1624 1697  
1625 1698                          if (spa_l2cache_exists(vd->vdev_guid, &pool) &&
1626 1699                              pool != 0ULL && l2arc_vdev_present(vd))
1627 1700                                  l2arc_remove_vdev(vd);
1628 1701                          vdev_clear_stats(vd);
1629 1702                          vdev_free(vd);
1630 1703                  }
1631 1704          }
1632 1705  
1633 1706          if (oldvdevs)
1634 1707                  kmem_free(oldvdevs, oldnvdevs * sizeof (void *));
1635 1708  
1636 1709          if (sav->sav_config == NULL)
1637 1710                  goto out;
1638 1711  
1639 1712          sav->sav_vdevs = newvdevs;
1640 1713          sav->sav_count = (int)nl2cache;
1641 1714  
1642 1715          /*
1643 1716           * Recompute the stashed list of l2cache devices, with status
1644 1717           * information this time.
1645 1718           */
1646 1719          VERIFY(nvlist_remove(sav->sav_config, ZPOOL_CONFIG_L2CACHE,
1647 1720              DATA_TYPE_NVLIST_ARRAY) == 0);
1648 1721  
1649 1722          l2cache = kmem_alloc(sav->sav_count * sizeof (void *), KM_SLEEP);
1650 1723          for (i = 0; i < sav->sav_count; i++)
1651 1724                  l2cache[i] = vdev_config_generate(spa,
1652 1725                      sav->sav_vdevs[i], B_TRUE, VDEV_CONFIG_L2CACHE);
1653 1726          VERIFY(nvlist_add_nvlist_array(sav->sav_config,
1654 1727              ZPOOL_CONFIG_L2CACHE, l2cache, sav->sav_count) == 0);
1655 1728  out:
1656 1729          for (i = 0; i < sav->sav_count; i++)
1657 1730                  nvlist_free(l2cache[i]);
1658 1731          if (sav->sav_count)
1659 1732                  kmem_free(l2cache, sav->sav_count * sizeof (void *));
1660 1733  }
1661 1734  
1662 1735  static int
1663 1736  load_nvlist(spa_t *spa, uint64_t obj, nvlist_t **value)
1664 1737  {
1665 1738          dmu_buf_t *db;
1666 1739          char *packed = NULL;
1667 1740          size_t nvsize = 0;
1668 1741          int error;
1669 1742          *value = NULL;
1670 1743  
1671 1744          error = dmu_bonus_hold(spa->spa_meta_objset, obj, FTAG, &db);
1672 1745          if (error != 0)
1673 1746                  return (error);
1674 1747  
1675 1748          nvsize = *(uint64_t *)db->db_data;
1676 1749          dmu_buf_rele(db, FTAG);
1677 1750  
1678 1751          packed = kmem_alloc(nvsize, KM_SLEEP);
  
    | 
      ↓ open down ↓ | 
    57 lines elided | 
    
      ↑ open up ↑ | 
  
1679 1752          error = dmu_read(spa->spa_meta_objset, obj, 0, nvsize, packed,
1680 1753              DMU_READ_PREFETCH);
1681 1754          if (error == 0)
1682 1755                  error = nvlist_unpack(packed, nvsize, value, 0);
1683 1756          kmem_free(packed, nvsize);
1684 1757  
1685 1758          return (error);
1686 1759  }
1687 1760  
1688 1761  /*
1689      - * Concrete top-level vdevs that are not missing and are not logs. At every
1690      - * spa_sync we write new uberblocks to at least SPA_SYNC_MIN_VDEVS core tvds.
1691      - */
1692      -static uint64_t
1693      -spa_healthy_core_tvds(spa_t *spa)
1694      -{
1695      -        vdev_t *rvd = spa->spa_root_vdev;
1696      -        uint64_t tvds = 0;
1697      -
1698      -        for (uint64_t i = 0; i < rvd->vdev_children; i++) {
1699      -                vdev_t *vd = rvd->vdev_child[i];
1700      -                if (vd->vdev_islog)
1701      -                        continue;
1702      -                if (vdev_is_concrete(vd) && !vdev_is_dead(vd))
1703      -                        tvds++;
1704      -        }
1705      -
1706      -        return (tvds);
1707      -}
1708      -
1709      -/*
1710 1762   * Checks to see if the given vdev could not be opened, in which case we post a
1711 1763   * sysevent to notify the autoreplace code that the device has been removed.
1712 1764   */
1713 1765  static void
1714 1766  spa_check_removed(vdev_t *vd)
1715 1767  {
1716      -        for (uint64_t c = 0; c < vd->vdev_children; c++)
     1768 +        for (int c = 0; c < vd->vdev_children; c++)
1717 1769                  spa_check_removed(vd->vdev_child[c]);
1718 1770  
1719 1771          if (vd->vdev_ops->vdev_op_leaf && vdev_is_dead(vd) &&
1720      -            vdev_is_concrete(vd)) {
     1772 +            !vd->vdev_ishole) {
1721 1773                  zfs_post_autoreplace(vd->vdev_spa, vd);
1722 1774                  spa_event_notify(vd->vdev_spa, vd, NULL, ESC_ZFS_VDEV_CHECK);
1723 1775          }
1724 1776  }
1725 1777  
1726      -static int
1727      -spa_check_for_missing_logs(spa_t *spa)
     1778 +static void
     1779 +spa_config_valid_zaps(vdev_t *vd, vdev_t *mvd)
1728 1780  {
1729      -        vdev_t *rvd = spa->spa_root_vdev;
     1781 +        ASSERT3U(vd->vdev_children, ==, mvd->vdev_children);
1730 1782  
     1783 +        vd->vdev_top_zap = mvd->vdev_top_zap;
     1784 +        vd->vdev_leaf_zap = mvd->vdev_leaf_zap;
     1785 +
     1786 +        for (uint64_t i = 0; i < vd->vdev_children; i++) {
     1787 +                spa_config_valid_zaps(vd->vdev_child[i], mvd->vdev_child[i]);
     1788 +        }
     1789 +}
     1790 +
     1791 +/*
     1792 + * Validate the current config against the MOS config
     1793 + */
     1794 +static boolean_t
     1795 +spa_config_valid(spa_t *spa, nvlist_t *config)
     1796 +{
     1797 +        vdev_t *mrvd, *rvd = spa->spa_root_vdev;
     1798 +        nvlist_t *nv;
     1799 +
     1800 +        VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nv) == 0);
     1801 +
     1802 +        spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
     1803 +        VERIFY(spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD) == 0);
     1804 +
1731 1805          /*
     1806 +         * One of the earliest signs of a stale config is a mismatch
     1807 +         * in the numbers of children vdev's
     1808 +         */
     1809 +        if (rvd->vdev_children != mrvd->vdev_children) {
     1810 +                vdev_free(mrvd);
     1811 +                spa_config_exit(spa, SCL_ALL, FTAG);
     1812 +                return (B_FALSE);
     1813 +        }
     1814 +        /*
1732 1815           * If we're doing a normal import, then build up any additional
1733      -         * diagnostic information about missing log devices.
     1816 +         * diagnostic information about missing devices in this config.
1734 1817           * We'll pass this up to the user for further processing.
1735 1818           */
1736 1819          if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG)) {
1737 1820                  nvlist_t **child, *nv;
1738 1821                  uint64_t idx = 0;
1739 1822  
1740 1823                  child = kmem_alloc(rvd->vdev_children * sizeof (nvlist_t **),
1741 1824                      KM_SLEEP);
1742 1825                  VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0);
1743 1826  
1744      -                for (uint64_t c = 0; c < rvd->vdev_children; c++) {
     1827 +                for (int c = 0; c < rvd->vdev_children; c++) {
1745 1828                          vdev_t *tvd = rvd->vdev_child[c];
     1829 +                        vdev_t *mtvd  = mrvd->vdev_child[c];
1746 1830  
1747      -                        /*
1748      -                         * We consider a device as missing only if it failed
1749      -                         * to open (i.e. offline or faulted is not considered
1750      -                         * as missing).
1751      -                         */
1752      -                        if (tvd->vdev_islog &&
1753      -                            tvd->vdev_state == VDEV_STATE_CANT_OPEN) {
1754      -                                child[idx++] = vdev_config_generate(spa, tvd,
1755      -                                    B_FALSE, VDEV_CONFIG_MISSING);
1756      -                        }
     1831 +                        if (tvd->vdev_ops == &vdev_missing_ops &&
     1832 +                            mtvd->vdev_ops != &vdev_missing_ops &&
     1833 +                            mtvd->vdev_islog)
     1834 +                                child[idx++] = vdev_config_generate(spa, mtvd,
     1835 +                                    B_FALSE, 0);
1757 1836                  }
1758 1837  
1759      -                if (idx > 0) {
1760      -                        fnvlist_add_nvlist_array(nv,
1761      -                            ZPOOL_CONFIG_CHILDREN, child, idx);
1762      -                        fnvlist_add_nvlist(spa->spa_load_info,
1763      -                            ZPOOL_CONFIG_MISSING_DEVICES, nv);
     1838 +                if (idx) {
     1839 +                        VERIFY(nvlist_add_nvlist_array(nv,
     1840 +                            ZPOOL_CONFIG_CHILDREN, child, idx) == 0);
     1841 +                        VERIFY(nvlist_add_nvlist(spa->spa_load_info,
     1842 +                            ZPOOL_CONFIG_MISSING_DEVICES, nv) == 0);
1764 1843  
1765      -                        for (uint64_t i = 0; i < idx; i++)
     1844 +                        for (int i = 0; i < idx; i++)
1766 1845                                  nvlist_free(child[i]);
1767 1846                  }
1768 1847                  nvlist_free(nv);
1769 1848                  kmem_free(child, rvd->vdev_children * sizeof (char **));
     1849 +        }
1770 1850  
1771      -                if (idx > 0) {
1772      -                        spa_load_failed(spa, "some log devices are missing");
1773      -                        return (SET_ERROR(ENXIO));
1774      -                }
1775      -        } else {
1776      -                for (uint64_t c = 0; c < rvd->vdev_children; c++) {
1777      -                        vdev_t *tvd = rvd->vdev_child[c];
     1851 +        /*
     1852 +         * Compare the root vdev tree with the information we have
     1853 +         * from the MOS config (mrvd). Check each top-level vdev
     1854 +         * with the corresponding MOS config top-level (mtvd).
     1855 +         */
     1856 +        for (int c = 0; c < rvd->vdev_children; c++) {
     1857 +                vdev_t *tvd = rvd->vdev_child[c];
     1858 +                vdev_t *mtvd  = mrvd->vdev_child[c];
1778 1859  
1779      -                        if (tvd->vdev_islog &&
1780      -                            tvd->vdev_state == VDEV_STATE_CANT_OPEN) {
     1860 +                /*
     1861 +                 * Resolve any "missing" vdevs in the current configuration.
     1862 +                 * If we find that the MOS config has more accurate information
     1863 +                 * about the top-level vdev then use that vdev instead.
     1864 +                 */
     1865 +                if (tvd->vdev_ops == &vdev_missing_ops &&
     1866 +                    mtvd->vdev_ops != &vdev_missing_ops) {
     1867 +
     1868 +                        if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG))
     1869 +                                continue;
     1870 +
     1871 +                        /*
     1872 +                         * Device specific actions.
     1873 +                         */
     1874 +                        if (mtvd->vdev_islog) {
1781 1875                                  spa_set_log_state(spa, SPA_LOG_CLEAR);
1782      -                                spa_load_note(spa, "some log devices are "
1783      -                                    "missing, ZIL is dropped.");
1784      -                                break;
     1876 +                        } else {
     1877 +                                /*
     1878 +                                 * XXX - once we have 'readonly' pool
     1879 +                                 * support we should be able to handle
     1880 +                                 * missing data devices by transitioning
     1881 +                                 * the pool to readonly.
     1882 +                                 */
     1883 +                                continue;
1785 1884                          }
     1885 +
     1886 +                        /*
     1887 +                         * Swap the missing vdev with the data we were
     1888 +                         * able to obtain from the MOS config.
     1889 +                         */
     1890 +                        vdev_remove_child(rvd, tvd);
     1891 +                        vdev_remove_child(mrvd, mtvd);
     1892 +
     1893 +                        vdev_add_child(rvd, mtvd);
     1894 +                        vdev_add_child(mrvd, tvd);
     1895 +
     1896 +                        spa_config_exit(spa, SCL_ALL, FTAG);
     1897 +                        vdev_load(mtvd);
     1898 +                        spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
     1899 +
     1900 +                        vdev_reopen(rvd);
     1901 +                } else {
     1902 +                        if (mtvd->vdev_islog) {
     1903 +                                /*
     1904 +                                 * Load the slog device's state from the MOS
     1905 +                                 * config since it's possible that the label
     1906 +                                 * does not contain the most up-to-date
     1907 +                                 * information.
     1908 +                                 */
     1909 +                                vdev_load_log_state(tvd, mtvd);
     1910 +                                vdev_reopen(tvd);
     1911 +                        }
     1912 +
     1913 +                        /*
     1914 +                         * Per-vdev ZAP info is stored exclusively in the MOS.
     1915 +                         */
     1916 +                        spa_config_valid_zaps(tvd, mtvd);
1786 1917                  }
1787 1918          }
1788 1919  
1789      -        return (0);
     1920 +        vdev_free(mrvd);
     1921 +        spa_config_exit(spa, SCL_ALL, FTAG);
     1922 +
     1923 +        /*
     1924 +         * Ensure we were able to validate the config.
     1925 +         */
     1926 +        return (rvd->vdev_guid_sum == spa->spa_uberblock.ub_guid_sum);
1790 1927  }
1791 1928  
1792 1929  /*
1793 1930   * Check for missing log devices
1794 1931   */
1795 1932  static boolean_t
1796 1933  spa_check_logs(spa_t *spa)
1797 1934  {
1798 1935          boolean_t rv = B_FALSE;
1799 1936          dsl_pool_t *dp = spa_get_dsl(spa);
1800 1937  
1801 1938          switch (spa->spa_log_state) {
1802 1939          case SPA_LOG_MISSING:
1803 1940                  /* need to recheck in case slog has been restored */
1804 1941          case SPA_LOG_UNKNOWN:
1805 1942                  rv = (dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
1806 1943                      zil_check_log_chain, NULL, DS_FIND_CHILDREN) != 0);
1807 1944                  if (rv)
1808 1945                          spa_set_log_state(spa, SPA_LOG_MISSING);
1809 1946                  break;
1810 1947          }
1811 1948          return (rv);
1812 1949  }
1813 1950  
1814 1951  static boolean_t
1815 1952  spa_passivate_log(spa_t *spa)
1816 1953  {
1817 1954          vdev_t *rvd = spa->spa_root_vdev;
1818 1955          boolean_t slog_found = B_FALSE;
1819 1956  
1820 1957          ASSERT(spa_config_held(spa, SCL_ALLOC, RW_WRITER));
1821 1958  
1822 1959          if (!spa_has_slogs(spa))
1823 1960                  return (B_FALSE);
1824 1961  
1825 1962          for (int c = 0; c < rvd->vdev_children; c++) {
1826 1963                  vdev_t *tvd = rvd->vdev_child[c];
1827 1964                  metaslab_group_t *mg = tvd->vdev_mg;
1828 1965  
1829 1966                  if (tvd->vdev_islog) {
1830 1967                          metaslab_group_passivate(mg);
1831 1968                          slog_found = B_TRUE;
1832 1969                  }
1833 1970          }
1834 1971  
1835 1972          return (slog_found);
1836 1973  }
1837 1974  
1838 1975  static void
1839 1976  spa_activate_log(spa_t *spa)
1840 1977  {
1841 1978          vdev_t *rvd = spa->spa_root_vdev;
1842 1979  
1843 1980          ASSERT(spa_config_held(spa, SCL_ALLOC, RW_WRITER));
1844 1981  
  
    | 
      ↓ open down ↓ | 
    45 lines elided | 
    
      ↑ open up ↑ | 
  
1845 1982          for (int c = 0; c < rvd->vdev_children; c++) {
1846 1983                  vdev_t *tvd = rvd->vdev_child[c];
1847 1984                  metaslab_group_t *mg = tvd->vdev_mg;
1848 1985  
1849 1986                  if (tvd->vdev_islog)
1850 1987                          metaslab_group_activate(mg);
1851 1988          }
1852 1989  }
1853 1990  
1854 1991  int
1855      -spa_reset_logs(spa_t *spa)
     1992 +spa_offline_log(spa_t *spa)
1856 1993  {
1857 1994          int error;
1858 1995  
1859      -        error = dmu_objset_find(spa_name(spa), zil_reset,
     1996 +        error = dmu_objset_find(spa_name(spa), zil_vdev_offline,
1860 1997              NULL, DS_FIND_CHILDREN);
1861 1998          if (error == 0) {
1862 1999                  /*
1863 2000                   * We successfully offlined the log device, sync out the
1864 2001                   * current txg so that the "stubby" block can be removed
1865 2002                   * by zil_sync().
1866 2003                   */
1867 2004                  txg_wait_synced(spa->spa_dsl_pool, 0);
1868 2005          }
1869 2006          return (error);
1870 2007  }
1871 2008  
1872 2009  static void
1873 2010  spa_aux_check_removed(spa_aux_vdev_t *sav)
1874 2011  {
1875 2012          for (int i = 0; i < sav->sav_count; i++)
1876 2013                  spa_check_removed(sav->sav_vdevs[i]);
1877 2014  }
1878 2015  
1879 2016  void
1880 2017  spa_claim_notify(zio_t *zio)
1881 2018  {
1882 2019          spa_t *spa = zio->io_spa;
1883 2020  
1884 2021          if (zio->io_error)
1885 2022                  return;
1886 2023  
1887 2024          mutex_enter(&spa->spa_props_lock);      /* any mutex will do */
1888 2025          if (spa->spa_claim_max_txg < zio->io_bp->blk_birth)
1889 2026                  spa->spa_claim_max_txg = zio->io_bp->blk_birth;
1890 2027          mutex_exit(&spa->spa_props_lock);
1891 2028  }
1892 2029  
1893 2030  typedef struct spa_load_error {
1894 2031          uint64_t        sle_meta_count;
1895 2032          uint64_t        sle_data_count;
1896 2033  } spa_load_error_t;
1897 2034  
1898 2035  static void
  
    | 
      ↓ open down ↓ | 
    29 lines elided | 
    
      ↑ open up ↑ | 
  
1899 2036  spa_load_verify_done(zio_t *zio)
1900 2037  {
1901 2038          blkptr_t *bp = zio->io_bp;
1902 2039          spa_load_error_t *sle = zio->io_private;
1903 2040          dmu_object_type_t type = BP_GET_TYPE(bp);
1904 2041          int error = zio->io_error;
1905 2042          spa_t *spa = zio->io_spa;
1906 2043  
1907 2044          abd_free(zio->io_abd);
1908 2045          if (error) {
1909      -                if ((BP_GET_LEVEL(bp) != 0 || DMU_OT_IS_METADATA(type)) &&
1910      -                    type != DMU_OT_INTENT_LOG)
     2046 +                if (BP_IS_METADATA(bp) && type != DMU_OT_INTENT_LOG)
1911 2047                          atomic_inc_64(&sle->sle_meta_count);
1912 2048                  else
1913 2049                          atomic_inc_64(&sle->sle_data_count);
1914 2050          }
1915 2051  
1916 2052          mutex_enter(&spa->spa_scrub_lock);
1917 2053          spa->spa_scrub_inflight--;
1918 2054          cv_broadcast(&spa->spa_scrub_io_cv);
1919 2055          mutex_exit(&spa->spa_scrub_lock);
1920 2056  }
1921 2057  
1922 2058  /*
1923 2059   * Maximum number of concurrent scrub i/os to create while verifying
1924 2060   * a pool while importing it.
1925 2061   */
1926 2062  int spa_load_verify_maxinflight = 10000;
1927 2063  boolean_t spa_load_verify_metadata = B_TRUE;
1928 2064  boolean_t spa_load_verify_data = B_TRUE;
1929 2065  
1930 2066  /*ARGSUSED*/
1931 2067  static int
1932 2068  spa_load_verify_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
1933 2069      const zbookmark_phys_t *zb, const dnode_phys_t *dnp, void *arg)
1934 2070  {
1935 2071          if (bp == NULL || BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp))
1936 2072                  return (0);
1937 2073          /*
1938 2074           * Note: normally this routine will not be called if
1939 2075           * spa_load_verify_metadata is not set.  However, it may be useful
1940 2076           * to manually set the flag after the traversal has begun.
1941 2077           */
1942 2078          if (!spa_load_verify_metadata)
1943 2079                  return (0);
1944 2080          if (!BP_IS_METADATA(bp) && !spa_load_verify_data)
1945 2081                  return (0);
1946 2082  
1947 2083          zio_t *rio = arg;
1948 2084          size_t size = BP_GET_PSIZE(bp);
1949 2085  
1950 2086          mutex_enter(&spa->spa_scrub_lock);
1951 2087          while (spa->spa_scrub_inflight >= spa_load_verify_maxinflight)
1952 2088                  cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
1953 2089          spa->spa_scrub_inflight++;
1954 2090          mutex_exit(&spa->spa_scrub_lock);
1955 2091  
1956 2092          zio_nowait(zio_read(rio, spa, bp, abd_alloc_for_io(size, B_FALSE), size,
1957 2093              spa_load_verify_done, rio->io_private, ZIO_PRIORITY_SCRUB,
1958 2094              ZIO_FLAG_SPECULATIVE | ZIO_FLAG_CANFAIL |
1959 2095              ZIO_FLAG_SCRUB | ZIO_FLAG_RAW, zb));
1960 2096          return (0);
1961 2097  }
1962 2098  
1963 2099  /* ARGSUSED */
1964 2100  int
1965 2101  verify_dataset_name_len(dsl_pool_t *dp, dsl_dataset_t *ds, void *arg)
1966 2102  {
1967 2103          if (dsl_dataset_namelen(ds) >= ZFS_MAX_DATASET_NAME_LEN)
1968 2104                  return (SET_ERROR(ENAMETOOLONG));
1969 2105  
1970 2106          return (0);
1971 2107  }
1972 2108  
1973 2109  static int
1974 2110  spa_load_verify(spa_t *spa)
1975 2111  {
1976 2112          zio_t *rio;
1977 2113          spa_load_error_t sle = { 0 };
1978 2114          zpool_rewind_policy_t policy;
1979 2115          boolean_t verify_ok = B_FALSE;
1980 2116          int error = 0;
1981 2117  
1982 2118          zpool_get_rewind_policy(spa->spa_config, &policy);
1983 2119  
1984 2120          if (policy.zrp_request & ZPOOL_NEVER_REWIND)
1985 2121                  return (0);
1986 2122  
1987 2123          dsl_pool_config_enter(spa->spa_dsl_pool, FTAG);
1988 2124          error = dmu_objset_find_dp(spa->spa_dsl_pool,
  
    | 
      ↓ open down ↓ | 
    68 lines elided | 
    
      ↑ open up ↑ | 
  
1989 2125              spa->spa_dsl_pool->dp_root_dir_obj, verify_dataset_name_len, NULL,
1990 2126              DS_FIND_CHILDREN);
1991 2127          dsl_pool_config_exit(spa->spa_dsl_pool, FTAG);
1992 2128          if (error != 0)
1993 2129                  return (error);
1994 2130  
1995 2131          rio = zio_root(spa, NULL, &sle,
1996 2132              ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE);
1997 2133  
1998 2134          if (spa_load_verify_metadata) {
1999      -                if (spa->spa_extreme_rewind) {
2000      -                        spa_load_note(spa, "performing a complete scan of the "
2001      -                            "pool since extreme rewind is on. This may take "
2002      -                            "a very long time.\n  (spa_load_verify_data=%u, "
2003      -                            "spa_load_verify_metadata=%u)",
2004      -                            spa_load_verify_data, spa_load_verify_metadata);
2005      -                }
2006      -                error = traverse_pool(spa, spa->spa_verify_min_txg,
     2135 +                zbookmark_phys_t zb = { 0 };
     2136 +                error = traverse_pool(spa, spa->spa_verify_min_txg, UINT64_MAX,
2007 2137                      TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA,
2008      -                    spa_load_verify_cb, rio);
     2138 +                    spa_load_verify_cb, rio, &zb);
2009 2139          }
2010 2140  
2011 2141          (void) zio_wait(rio);
2012 2142  
2013 2143          spa->spa_load_meta_errors = sle.sle_meta_count;
2014 2144          spa->spa_load_data_errors = sle.sle_data_count;
2015 2145  
2016      -        if (sle.sle_meta_count != 0 || sle.sle_data_count != 0) {
2017      -                spa_load_note(spa, "spa_load_verify found %llu metadata errors "
2018      -                    "and %llu data errors", (u_longlong_t)sle.sle_meta_count,
2019      -                    (u_longlong_t)sle.sle_data_count);
2020      -        }
2021      -
2022      -        if (spa_load_verify_dryrun ||
2023      -            (!error && sle.sle_meta_count <= policy.zrp_maxmeta &&
2024      -            sle.sle_data_count <= policy.zrp_maxdata)) {
     2146 +        if (!error && sle.sle_meta_count <= policy.zrp_maxmeta &&
     2147 +            sle.sle_data_count <= policy.zrp_maxdata) {
2025 2148                  int64_t loss = 0;
2026 2149  
2027 2150                  verify_ok = B_TRUE;
2028 2151                  spa->spa_load_txg = spa->spa_uberblock.ub_txg;
2029 2152                  spa->spa_load_txg_ts = spa->spa_uberblock.ub_timestamp;
2030 2153  
2031 2154                  loss = spa->spa_last_ubsync_txg_ts - spa->spa_load_txg_ts;
2032 2155                  VERIFY(nvlist_add_uint64(spa->spa_load_info,
2033 2156                      ZPOOL_CONFIG_LOAD_TIME, spa->spa_load_txg_ts) == 0);
2034 2157                  VERIFY(nvlist_add_int64(spa->spa_load_info,
2035 2158                      ZPOOL_CONFIG_REWIND_TIME, loss) == 0);
2036 2159                  VERIFY(nvlist_add_uint64(spa->spa_load_info,
2037 2160                      ZPOOL_CONFIG_LOAD_DATA_ERRORS, sle.sle_data_count) == 0);
2038 2161          } else {
2039 2162                  spa->spa_load_max_txg = spa->spa_uberblock.ub_txg;
2040 2163          }
2041 2164  
2042      -        if (spa_load_verify_dryrun)
2043      -                return (0);
2044      -
2045 2165          if (error) {
2046 2166                  if (error != ENXIO && error != EIO)
2047 2167                          error = SET_ERROR(EIO);
2048 2168                  return (error);
2049 2169          }
2050 2170  
2051 2171          return (verify_ok ? 0 : EIO);
2052 2172  }
2053 2173  
2054 2174  /*
2055 2175   * Find a value in the pool props object.
2056 2176   */
2057 2177  static void
  
    | 
      ↓ open down ↓ | 
    3 lines elided | 
    
      ↑ open up ↑ | 
  
2058 2178  spa_prop_find(spa_t *spa, zpool_prop_t prop, uint64_t *val)
2059 2179  {
2060 2180          (void) zap_lookup(spa->spa_meta_objset, spa->spa_pool_props_object,
2061 2181              zpool_prop_to_name(prop), sizeof (uint64_t), 1, val);
2062 2182  }
2063 2183  
2064 2184  /*
2065 2185   * Find a value in the pool directory object.
2066 2186   */
2067 2187  static int
2068      -spa_dir_prop(spa_t *spa, const char *name, uint64_t *val, boolean_t log_enoent)
     2188 +spa_dir_prop(spa_t *spa, const char *name, uint64_t *val)
2069 2189  {
2070      -        int error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2071      -            name, sizeof (uint64_t), 1, val);
     2190 +        return (zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
     2191 +            name, sizeof (uint64_t), 1, val));
     2192 +}
2072 2193  
2073      -        if (error != 0 && (error != ENOENT || log_enoent)) {
2074      -                spa_load_failed(spa, "couldn't get '%s' value in MOS directory "
2075      -                    "[error=%d]", name, error);
     2194 +static void
     2195 +spa_set_ddt_classes(spa_t *spa, int desegregation)
     2196 +{
     2197 +        /*
     2198 +         * if desegregation is turned on then set up ddt_class restrictions
     2199 +         */
     2200 +        if (desegregation) {
     2201 +                spa->spa_ddt_class_min = DDT_CLASS_DUPLICATE;
     2202 +                spa->spa_ddt_class_max = DDT_CLASS_DUPLICATE;
     2203 +        } else {
     2204 +                spa->spa_ddt_class_min = DDT_CLASS_DITTO;
     2205 +                spa->spa_ddt_class_max = DDT_CLASS_UNIQUE;
2076 2206          }
2077      -
2078      -        return (error);
2079 2207  }
2080 2208  
2081 2209  static int
2082 2210  spa_vdev_err(vdev_t *vdev, vdev_aux_t aux, int err)
2083 2211  {
2084 2212          vdev_set_state(vdev, B_TRUE, VDEV_STATE_CANT_OPEN, aux);
2085      -        return (SET_ERROR(err));
     2213 +        return (err);
2086 2214  }
2087 2215  
2088      -static void
2089      -spa_spawn_aux_threads(spa_t *spa)
2090      -{
2091      -        ASSERT(spa_writeable(spa));
2092      -
2093      -        ASSERT(MUTEX_HELD(&spa_namespace_lock));
2094      -
2095      -        spa_start_indirect_condensing_thread(spa);
2096      -}
2097      -
2098 2216  /*
2099 2217   * Fix up config after a partly-completed split.  This is done with the
2100 2218   * ZPOOL_CONFIG_SPLIT nvlist.  Both the splitting pool and the split-off
2101 2219   * pool have that entry in their config, but only the splitting one contains
2102 2220   * a list of all the guids of the vdevs that are being split off.
2103 2221   *
2104 2222   * This function determines what to do with that list: either rejoin
2105 2223   * all the disks to the pool, or complete the splitting process.  To attempt
2106 2224   * the rejoin, each disk that is offlined is marked online again, and
2107 2225   * we do a reopen() call.  If the vdev label for every disk that was
2108 2226   * marked online indicates it was successfully split off (VDEV_AUX_SPLIT_POOL)
2109 2227   * then we call vdev_split() on each disk, and complete the split.
2110 2228   *
2111 2229   * Otherwise we leave the config alone, with all the vdevs in place in
2112 2230   * the original pool.
2113 2231   */
2114 2232  static void
2115 2233  spa_try_repair(spa_t *spa, nvlist_t *config)
2116 2234  {
2117 2235          uint_t extracted;
2118 2236          uint64_t *glist;
2119 2237          uint_t i, gcount;
2120 2238          nvlist_t *nvl;
2121 2239          vdev_t **vd;
2122 2240          boolean_t attempt_reopen;
2123 2241  
2124 2242          if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT, &nvl) != 0)
2125 2243                  return;
2126 2244  
2127 2245          /* check that the config is complete */
2128 2246          if (nvlist_lookup_uint64_array(nvl, ZPOOL_CONFIG_SPLIT_LIST,
2129 2247              &glist, &gcount) != 0)
2130 2248                  return;
2131 2249  
2132 2250          vd = kmem_zalloc(gcount * sizeof (vdev_t *), KM_SLEEP);
2133 2251  
2134 2252          /* attempt to online all the vdevs & validate */
2135 2253          attempt_reopen = B_TRUE;
2136 2254          for (i = 0; i < gcount; i++) {
2137 2255                  if (glist[i] == 0)      /* vdev is hole */
2138 2256                          continue;
2139 2257  
2140 2258                  vd[i] = spa_lookup_by_guid(spa, glist[i], B_FALSE);
2141 2259                  if (vd[i] == NULL) {
2142 2260                          /*
2143 2261                           * Don't bother attempting to reopen the disks;
2144 2262                           * just do the split.
2145 2263                           */
2146 2264                          attempt_reopen = B_FALSE;
2147 2265                  } else {
2148 2266                          /* attempt to re-online it */
2149 2267                          vd[i]->vdev_offline = B_FALSE;
2150 2268                  }
2151 2269          }
2152 2270  
2153 2271          if (attempt_reopen) {
2154 2272                  vdev_reopen(spa->spa_root_vdev);
2155 2273  
2156 2274                  /* check each device to see what state it's in */
2157 2275                  for (extracted = 0, i = 0; i < gcount; i++) {
2158 2276                          if (vd[i] != NULL &&
2159 2277                              vd[i]->vdev_stat.vs_aux != VDEV_AUX_SPLIT_POOL)
2160 2278                                  break;
2161 2279                          ++extracted;
2162 2280                  }
2163 2281          }
2164 2282  
2165 2283          /*
2166 2284           * If every disk has been moved to the new pool, or if we never
2167 2285           * even attempted to look at them, then we split them off for
2168 2286           * good.
2169 2287           */
2170 2288          if (!attempt_reopen || gcount == extracted) {
  
    | 
      ↓ open down ↓ | 
    63 lines elided | 
    
      ↑ open up ↑ | 
  
2171 2289                  for (i = 0; i < gcount; i++)
2172 2290                          if (vd[i] != NULL)
2173 2291                                  vdev_split(vd[i]);
2174 2292                  vdev_reopen(spa->spa_root_vdev);
2175 2293          }
2176 2294  
2177 2295          kmem_free(vd, gcount * sizeof (vdev_t *));
2178 2296  }
2179 2297  
2180 2298  static int
2181      -spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type)
     2299 +spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type,
     2300 +    boolean_t mosconfig)
2182 2301  {
     2302 +        nvlist_t *config = spa->spa_config;
2183 2303          char *ereport = FM_EREPORT_ZFS_POOL;
     2304 +        char *comment;
2184 2305          int error;
     2306 +        uint64_t pool_guid;
     2307 +        nvlist_t *nvl;
2185 2308  
2186      -        spa->spa_load_state = state;
     2309 +        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid))
     2310 +                return (SET_ERROR(EINVAL));
2187 2311  
2188      -        gethrestime(&spa->spa_loaded_ts);
2189      -        error = spa_load_impl(spa, type, &ereport, B_FALSE);
     2312 +        ASSERT(spa->spa_comment == NULL);
     2313 +        if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0)
     2314 +                spa->spa_comment = spa_strdup(comment);
2190 2315  
2191 2316          /*
     2317 +         * Versioning wasn't explicitly added to the label until later, so if
     2318 +         * it's not present treat it as the initial version.
     2319 +         */
     2320 +        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
     2321 +            &spa->spa_ubsync.ub_version) != 0)
     2322 +                spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
     2323 +
     2324 +        (void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG,
     2325 +            &spa->spa_config_txg);
     2326 +
     2327 +        if ((state == SPA_LOAD_IMPORT || state == SPA_LOAD_TRYIMPORT) &&
     2328 +            spa_guid_exists(pool_guid, 0)) {
     2329 +                error = SET_ERROR(EEXIST);
     2330 +        } else {
     2331 +                spa->spa_config_guid = pool_guid;
     2332 +
     2333 +                if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT,
     2334 +                    &nvl) == 0) {
     2335 +                        VERIFY(nvlist_dup(nvl, &spa->spa_config_splitting,
     2336 +                            KM_SLEEP) == 0);
     2337 +                }
     2338 +
     2339 +                nvlist_free(spa->spa_load_info);
     2340 +                spa->spa_load_info = fnvlist_alloc();
     2341 +
     2342 +                gethrestime(&spa->spa_loaded_ts);
     2343 +                error = spa_load_impl(spa, pool_guid, config, state, type,
     2344 +                    mosconfig, &ereport);
     2345 +        }
     2346 +
     2347 +        /*
2192 2348           * Don't count references from objsets that are already closed
2193 2349           * and are making their way through the eviction process.
2194 2350           */
2195 2351          spa_evicting_os_wait(spa);
2196 2352          spa->spa_minref = refcount_count(&spa->spa_refcount);
2197 2353          if (error) {
2198 2354                  if (error != EEXIST) {
2199 2355                          spa->spa_loaded_ts.tv_sec = 0;
2200 2356                          spa->spa_loaded_ts.tv_nsec = 0;
2201 2357                  }
2202 2358                  if (error != EBADF) {
2203 2359                          zfs_ereport_post(ereport, spa, NULL, NULL, 0, 0);
2204 2360                  }
2205 2361          }
2206 2362          spa->spa_load_state = error ? SPA_LOAD_ERROR : SPA_LOAD_NONE;
2207 2363          spa->spa_ena = 0;
2208      -
2209 2364          return (error);
2210 2365  }
2211 2366  
2212 2367  /*
2213 2368   * Count the number of per-vdev ZAPs associated with all of the vdevs in the
2214 2369   * vdev tree rooted in the given vd, and ensure that each ZAP is present in the
2215 2370   * spa's per-vdev ZAP list.
2216 2371   */
2217 2372  static uint64_t
2218 2373  vdev_count_verify_zaps(vdev_t *vd)
2219 2374  {
2220 2375          spa_t *spa = vd->vdev_spa;
2221 2376          uint64_t total = 0;
2222 2377          if (vd->vdev_top_zap != 0) {
2223 2378                  total++;
2224 2379                  ASSERT0(zap_lookup_int(spa->spa_meta_objset,
2225 2380                      spa->spa_all_vdev_zaps, vd->vdev_top_zap));
2226 2381          }
2227 2382          if (vd->vdev_leaf_zap != 0) {
2228 2383                  total++;
2229 2384                  ASSERT0(zap_lookup_int(spa->spa_meta_objset,
  
    | 
      ↓ open down ↓ | 
    11 lines elided | 
    
      ↑ open up ↑ | 
  
2230 2385                      spa->spa_all_vdev_zaps, vd->vdev_leaf_zap));
2231 2386          }
2232 2387  
2233 2388          for (uint64_t i = 0; i < vd->vdev_children; i++) {
2234 2389                  total += vdev_count_verify_zaps(vd->vdev_child[i]);
2235 2390          }
2236 2391  
2237 2392          return (total);
2238 2393  }
2239 2394  
     2395 +/*
     2396 + * Load an existing storage pool, using the pool's builtin spa_config as a
     2397 + * source of configuration information.
     2398 + */
2240 2399  static int
2241      -spa_verify_host(spa_t *spa, nvlist_t *mos_config)
     2400 +spa_load_impl(spa_t *spa, uint64_t pool_guid, nvlist_t *config,
     2401 +    spa_load_state_t state, spa_import_type_t type, boolean_t mosconfig,
     2402 +    char **ereport)
2242 2403  {
2243      -        uint64_t hostid;
2244      -        char *hostname;
2245      -        uint64_t myhostid = 0;
2246      -
2247      -        if (!spa_is_root(spa) && nvlist_lookup_uint64(mos_config,
2248      -            ZPOOL_CONFIG_HOSTID, &hostid) == 0) {
2249      -                hostname = fnvlist_lookup_string(mos_config,
2250      -                    ZPOOL_CONFIG_HOSTNAME);
2251      -
2252      -                myhostid = zone_get_hostid(NULL);
2253      -
2254      -                if (hostid != 0 && myhostid != 0 && hostid != myhostid) {
2255      -                        cmn_err(CE_WARN, "pool '%s' could not be "
2256      -                            "loaded as it was last accessed by "
2257      -                            "another system (host: %s hostid: 0x%llx). "
2258      -                            "See: http://illumos.org/msg/ZFS-8000-EY",
2259      -                            spa_name(spa), hostname, (u_longlong_t)hostid);
2260      -                        spa_load_failed(spa, "hostid verification failed: pool "
2261      -                            "last accessed by host: %s (hostid: 0x%llx)",
2262      -                            hostname, (u_longlong_t)hostid);
2263      -                        return (SET_ERROR(EBADF));
2264      -                }
2265      -        }
2266      -
2267      -        return (0);
2268      -}
2269      -
2270      -static int
2271      -spa_ld_parse_config(spa_t *spa, spa_import_type_t type)
2272      -{
2273 2404          int error = 0;
2274      -        nvlist_t *nvtree, *nvl, *config = spa->spa_config;
2275      -        int parse;
     2405 +        nvlist_t *nvroot = NULL;
     2406 +        nvlist_t *label;
2276 2407          vdev_t *rvd;
2277      -        uint64_t pool_guid;
2278      -        char *comment;
     2408 +        uberblock_t *ub = &spa->spa_uberblock;
     2409 +        uint64_t children, config_cache_txg = spa->spa_config_txg;
     2410 +        int orig_mode = spa->spa_mode;
     2411 +        int parse;
     2412 +        uint64_t obj;
     2413 +        boolean_t missing_feat_write = B_FALSE;
     2414 +        spa_meta_placement_t *mp;
2279 2415  
2280 2416          /*
2281      -         * Versioning wasn't explicitly added to the label until later, so if
2282      -         * it's not present treat it as the initial version.
     2417 +         * If this is an untrusted config, access the pool in read-only mode.
     2418 +         * This prevents things like resilvering recently removed devices.
2283 2419           */
2284      -        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
2285      -            &spa->spa_ubsync.ub_version) != 0)
2286      -                spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
     2420 +        if (!mosconfig)
     2421 +                spa->spa_mode = FREAD;
2287 2422  
2288      -        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid)) {
2289      -                spa_load_failed(spa, "invalid config provided: '%s' missing",
2290      -                    ZPOOL_CONFIG_POOL_GUID);
2291      -                return (SET_ERROR(EINVAL));
2292      -        }
     2423 +        ASSERT(MUTEX_HELD(&spa_namespace_lock));
2293 2424  
2294      -        if ((spa->spa_load_state == SPA_LOAD_IMPORT || spa->spa_load_state ==
2295      -            SPA_LOAD_TRYIMPORT) && spa_guid_exists(pool_guid, 0)) {
2296      -                spa_load_failed(spa, "a pool with guid %llu is already open",
2297      -                    (u_longlong_t)pool_guid);
2298      -                return (SET_ERROR(EEXIST));
2299      -        }
     2425 +        spa->spa_load_state = state;
2300 2426  
2301      -        spa->spa_config_guid = pool_guid;
2302      -
2303      -        nvlist_free(spa->spa_load_info);
2304      -        spa->spa_load_info = fnvlist_alloc();
2305      -
2306      -        ASSERT(spa->spa_comment == NULL);
2307      -        if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0)
2308      -                spa->spa_comment = spa_strdup(comment);
2309      -
2310      -        (void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG,
2311      -            &spa->spa_config_txg);
2312      -
2313      -        if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT, &nvl) == 0)
2314      -                spa->spa_config_splitting = fnvlist_dup(nvl);
2315      -
2316      -        if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvtree)) {
2317      -                spa_load_failed(spa, "invalid config provided: '%s' missing",
2318      -                    ZPOOL_CONFIG_VDEV_TREE);
     2427 +        if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvroot))
2319 2428                  return (SET_ERROR(EINVAL));
2320      -        }
2321 2429  
     2430 +        parse = (type == SPA_IMPORT_EXISTING ?
     2431 +            VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT);
     2432 +
2322 2433          /*
2323 2434           * Create "The Godfather" zio to hold all async IOs
2324 2435           */
2325 2436          spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
2326 2437              KM_SLEEP);
2327 2438          for (int i = 0; i < max_ncpus; i++) {
2328 2439                  spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
2329 2440                      ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
2330 2441                      ZIO_FLAG_GODFATHER);
2331 2442          }
2332 2443  
2333 2444          /*
2334 2445           * Parse the configuration into a vdev tree.  We explicitly set the
2335 2446           * value that will be returned by spa_version() since parsing the
2336 2447           * configuration requires knowing the version number.
2337 2448           */
2338 2449          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2339      -        parse = (type == SPA_IMPORT_EXISTING ?
2340      -            VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT);
2341      -        error = spa_config_parse(spa, &rvd, nvtree, NULL, 0, parse);
     2450 +        error = spa_config_parse(spa, &rvd, nvroot, NULL, 0, parse);
2342 2451          spa_config_exit(spa, SCL_ALL, FTAG);
2343 2452  
2344      -        if (error != 0) {
2345      -                spa_load_failed(spa, "unable to parse config [error=%d]",
2346      -                    error);
     2453 +        if (error != 0)
2347 2454                  return (error);
2348      -        }
2349 2455  
2350 2456          ASSERT(spa->spa_root_vdev == rvd);
2351 2457          ASSERT3U(spa->spa_min_ashift, >=, SPA_MINBLOCKSHIFT);
2352 2458          ASSERT3U(spa->spa_max_ashift, <=, SPA_MAXBLOCKSHIFT);
2353 2459  
2354 2460          if (type != SPA_IMPORT_ASSEMBLE) {
2355 2461                  ASSERT(spa_guid(spa) == pool_guid);
2356 2462          }
2357 2463  
2358      -        return (0);
2359      -}
2360      -
2361      -/*
2362      - * Recursively open all vdevs in the vdev tree. This function is called twice:
2363      - * first with the untrusted config, then with the trusted config.
2364      - */
2365      -static int
2366      -spa_ld_open_vdevs(spa_t *spa)
2367      -{
2368      -        int error = 0;
2369      -
2370 2464          /*
2371      -         * spa_missing_tvds_allowed defines how many top-level vdevs can be
2372      -         * missing/unopenable for the root vdev to be still considered openable.
     2465 +         * Try to open all vdevs, loading each label in the process.
2373 2466           */
2374      -        if (spa->spa_trust_config) {
2375      -                spa->spa_missing_tvds_allowed = zfs_max_missing_tvds;
2376      -        } else if (spa->spa_config_source == SPA_CONFIG_SRC_CACHEFILE) {
2377      -                spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_cachefile;
2378      -        } else if (spa->spa_config_source == SPA_CONFIG_SRC_SCAN) {
2379      -                spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_scan;
2380      -        } else {
2381      -                spa->spa_missing_tvds_allowed = 0;
2382      -        }
2383      -
2384      -        spa->spa_missing_tvds_allowed =
2385      -            MAX(zfs_max_missing_tvds, spa->spa_missing_tvds_allowed);
2386      -
2387 2467          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2388      -        error = vdev_open(spa->spa_root_vdev);
     2468 +        error = vdev_open(rvd);
2389 2469          spa_config_exit(spa, SCL_ALL, FTAG);
     2470 +        if (error != 0)
     2471 +                return (error);
2390 2472  
2391      -        if (spa->spa_missing_tvds != 0) {
2392      -                spa_load_note(spa, "vdev tree has %lld missing top-level "
2393      -                    "vdevs.", (u_longlong_t)spa->spa_missing_tvds);
2394      -                if (spa->spa_trust_config && (spa->spa_mode & FWRITE)) {
2395      -                        /*
2396      -                         * Although theoretically we could allow users to open
2397      -                         * incomplete pools in RW mode, we'd need to add a lot
2398      -                         * of extra logic (e.g. adjust pool space to account
2399      -                         * for missing vdevs).
2400      -                         * This limitation also prevents users from accidentally
2401      -                         * opening the pool in RW mode during data recovery and
2402      -                         * damaging it further.
2403      -                         */
2404      -                        spa_load_note(spa, "pools with missing top-level "
2405      -                            "vdevs can only be opened in read-only mode.");
2406      -                        error = SET_ERROR(ENXIO);
2407      -                } else {
2408      -                        spa_load_note(spa, "current settings allow for maximum "
2409      -                            "%lld missing top-level vdevs at this stage.",
2410      -                            (u_longlong_t)spa->spa_missing_tvds_allowed);
2411      -                }
2412      -        }
2413      -        if (error != 0) {
2414      -                spa_load_failed(spa, "unable to open vdev tree [error=%d]",
2415      -                    error);
2416      -        }
2417      -        if (spa->spa_missing_tvds != 0 || error != 0)
2418      -                vdev_dbgmsg_print_tree(spa->spa_root_vdev, 2);
     2473 +        /*
     2474 +         * We need to validate the vdev labels against the configuration that
     2475 +         * we have in hand, which is dependent on the setting of mosconfig. If
     2476 +         * mosconfig is true then we're validating the vdev labels based on
     2477 +         * that config.  Otherwise, we're validating against the cached config
     2478 +         * (zpool.cache) that was read when we loaded the zfs module, and then
     2479 +         * later we will recursively call spa_load() and validate against
     2480 +         * the vdev config.
     2481 +         *
     2482 +         * If we're assembling a new pool that's been split off from an
     2483 +         * existing pool, the labels haven't yet been updated so we skip
     2484 +         * validation for now.
     2485 +         */
     2486 +        if (type != SPA_IMPORT_ASSEMBLE) {
     2487 +                spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
     2488 +                error = vdev_validate(rvd, mosconfig);
     2489 +                spa_config_exit(spa, SCL_ALL, FTAG);
2419 2490  
2420      -        return (error);
2421      -}
     2491 +                if (error != 0)
     2492 +                        return (error);
2422 2493  
2423      -/*
2424      - * We need to validate the vdev labels against the configuration that
2425      - * we have in hand. This function is called twice: first with an untrusted
2426      - * config, then with a trusted config. The validation is more strict when the
2427      - * config is trusted.
2428      - */
2429      -static int
2430      -spa_ld_validate_vdevs(spa_t *spa)
2431      -{
2432      -        int error = 0;
2433      -        vdev_t *rvd = spa->spa_root_vdev;
2434      -
2435      -        spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2436      -        error = vdev_validate(rvd);
2437      -        spa_config_exit(spa, SCL_ALL, FTAG);
2438      -
2439      -        if (error != 0) {
2440      -                spa_load_failed(spa, "vdev_validate failed [error=%d]", error);
2441      -                return (error);
     2494 +                if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN)
     2495 +                        return (SET_ERROR(ENXIO));
2442 2496          }
2443 2497  
2444      -        if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN) {
2445      -                spa_load_failed(spa, "cannot open vdev tree after invalidating "
2446      -                    "some vdevs");
2447      -                vdev_dbgmsg_print_tree(rvd, 2);
2448      -                return (SET_ERROR(ENXIO));
2449      -        }
2450      -
2451      -        return (0);
2452      -}
2453      -
2454      -static int
2455      -spa_ld_select_uberblock(spa_t *spa, spa_import_type_t type)
2456      -{
2457      -        vdev_t *rvd = spa->spa_root_vdev;
2458      -        nvlist_t *label;
2459      -        uberblock_t *ub = &spa->spa_uberblock;
2460      -
2461 2498          /*
2462 2499           * Find the best uberblock.
2463 2500           */
2464 2501          vdev_uberblock_load(rvd, ub, &label);
2465 2502  
2466 2503          /*
2467 2504           * If we weren't able to find a single valid uberblock, return failure.
2468 2505           */
2469 2506          if (ub->ub_txg == 0) {
2470 2507                  nvlist_free(label);
2471      -                spa_load_failed(spa, "no valid uberblock found");
2472 2508                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, ENXIO));
2473 2509          }
2474 2510  
2475      -        spa_load_note(spa, "using uberblock with txg=%llu",
2476      -            (u_longlong_t)ub->ub_txg);
2477      -
2478 2511          /*
2479 2512           * If the pool has an unsupported version we can't open it.
2480 2513           */
2481 2514          if (!SPA_VERSION_IS_SUPPORTED(ub->ub_version)) {
2482 2515                  nvlist_free(label);
2483      -                spa_load_failed(spa, "version %llu is not supported",
2484      -                    (u_longlong_t)ub->ub_version);
2485 2516                  return (spa_vdev_err(rvd, VDEV_AUX_VERSION_NEWER, ENOTSUP));
2486 2517          }
2487 2518  
2488 2519          if (ub->ub_version >= SPA_VERSION_FEATURES) {
2489 2520                  nvlist_t *features;
2490 2521  
2491 2522                  /*
2492 2523                   * If we weren't able to find what's necessary for reading the
2493 2524                   * MOS in the label, return failure.
2494 2525                   */
2495      -                if (label == NULL) {
2496      -                        spa_load_failed(spa, "label config unavailable");
2497      -                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
2498      -                            ENXIO));
2499      -                }
2500      -
2501      -                if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_FEATURES_FOR_READ,
2502      -                    &features) != 0) {
     2526 +                if (label == NULL || nvlist_lookup_nvlist(label,
     2527 +                    ZPOOL_CONFIG_FEATURES_FOR_READ, &features) != 0) {
2503 2528                          nvlist_free(label);
2504      -                        spa_load_failed(spa, "invalid label: '%s' missing",
2505      -                            ZPOOL_CONFIG_FEATURES_FOR_READ);
2506 2529                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
2507 2530                              ENXIO));
2508 2531                  }
2509 2532  
2510 2533                  /*
2511 2534                   * Update our in-core representation with the definitive values
2512 2535                   * from the label.
2513 2536                   */
2514 2537                  nvlist_free(spa->spa_label_features);
2515 2538                  VERIFY(nvlist_dup(features, &spa->spa_label_features, 0) == 0);
2516 2539          }
2517 2540  
2518 2541          nvlist_free(label);
2519 2542  
2520 2543          /*
2521 2544           * Look through entries in the label nvlist's features_for_read. If
2522 2545           * there is a feature listed there which we don't understand then we
2523 2546           * cannot open a pool.
2524 2547           */
2525 2548          if (ub->ub_version >= SPA_VERSION_FEATURES) {
2526 2549                  nvlist_t *unsup_feat;
2527 2550  
2528 2551                  VERIFY(nvlist_alloc(&unsup_feat, NV_UNIQUE_NAME, KM_SLEEP) ==
2529 2552                      0);
2530 2553  
2531 2554                  for (nvpair_t *nvp = nvlist_next_nvpair(spa->spa_label_features,
2532 2555                      NULL); nvp != NULL;
2533 2556                      nvp = nvlist_next_nvpair(spa->spa_label_features, nvp)) {
  
    | 
      ↓ open down ↓ | 
    18 lines elided | 
    
      ↑ open up ↑ | 
  
2534 2557                          if (!zfeature_is_supported(nvpair_name(nvp))) {
2535 2558                                  VERIFY(nvlist_add_string(unsup_feat,
2536 2559                                      nvpair_name(nvp), "") == 0);
2537 2560                          }
2538 2561                  }
2539 2562  
2540 2563                  if (!nvlist_empty(unsup_feat)) {
2541 2564                          VERIFY(nvlist_add_nvlist(spa->spa_load_info,
2542 2565                              ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat) == 0);
2543 2566                          nvlist_free(unsup_feat);
2544      -                        spa_load_failed(spa, "some features are unsupported");
2545 2567                          return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
2546 2568                              ENOTSUP));
2547 2569                  }
2548 2570  
2549 2571                  nvlist_free(unsup_feat);
2550 2572          }
2551 2573  
     2574 +        /*
     2575 +         * If the vdev guid sum doesn't match the uberblock, we have an
     2576 +         * incomplete configuration.  We first check to see if the pool
     2577 +         * is aware of the complete config (i.e ZPOOL_CONFIG_VDEV_CHILDREN).
     2578 +         * If it is, defer the vdev_guid_sum check till later so we
     2579 +         * can handle missing vdevs.
     2580 +         */
     2581 +        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VDEV_CHILDREN,
     2582 +            &children) != 0 && mosconfig && type != SPA_IMPORT_ASSEMBLE &&
     2583 +            rvd->vdev_guid_sum != ub->ub_guid_sum)
     2584 +                return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO));
     2585 +
2552 2586          if (type != SPA_IMPORT_ASSEMBLE && spa->spa_config_splitting) {
2553 2587                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2554      -                spa_try_repair(spa, spa->spa_config);
     2588 +                spa_try_repair(spa, config);
2555 2589                  spa_config_exit(spa, SCL_ALL, FTAG);
2556 2590                  nvlist_free(spa->spa_config_splitting);
2557 2591                  spa->spa_config_splitting = NULL;
2558 2592          }
2559 2593  
2560 2594          /*
2561 2595           * Initialize internal SPA structures.
2562 2596           */
2563 2597          spa->spa_state = POOL_STATE_ACTIVE;
2564 2598          spa->spa_ubsync = spa->spa_uberblock;
2565 2599          spa->spa_verify_min_txg = spa->spa_extreme_rewind ?
2566 2600              TXG_INITIAL - 1 : spa_last_synced_txg(spa) - TXG_DEFER_SIZE - 1;
2567 2601          spa->spa_first_txg = spa->spa_last_ubsync_txg ?
2568 2602              spa->spa_last_ubsync_txg : spa_last_synced_txg(spa) + 1;
2569 2603          spa->spa_claim_max_txg = spa->spa_first_txg;
2570 2604          spa->spa_prev_software_version = ub->ub_software_version;
2571 2605  
2572      -        return (0);
2573      -}
2574      -
2575      -static int
2576      -spa_ld_open_rootbp(spa_t *spa)
2577      -{
2578      -        int error = 0;
2579      -        vdev_t *rvd = spa->spa_root_vdev;
2580      -
2581 2606          error = dsl_pool_init(spa, spa->spa_first_txg, &spa->spa_dsl_pool);
2582      -        if (error != 0) {
2583      -                spa_load_failed(spa, "unable to open rootbp in dsl_pool_init "
2584      -                    "[error=%d]", error);
     2607 +        if (error)
2585 2608                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2586      -        }
2587 2609          spa->spa_meta_objset = spa->spa_dsl_pool->dp_meta_objset;
2588 2610  
2589      -        return (0);
2590      -}
2591      -
2592      -static int
2593      -spa_ld_load_trusted_config(spa_t *spa, spa_import_type_t type,
2594      -    boolean_t reloading)
2595      -{
2596      -        vdev_t *mrvd, *rvd = spa->spa_root_vdev;
2597      -        nvlist_t *nv, *mos_config, *policy;
2598      -        int error = 0, copy_error;
2599      -        uint64_t healthy_tvds, healthy_tvds_mos;
2600      -        uint64_t mos_config_txg;
2601      -
2602      -        if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object, B_TRUE)
2603      -            != 0)
     2611 +        if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object) != 0)
2604 2612                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2605 2613  
2606      -        /*
2607      -         * If we're assembling a pool from a split, the config provided is
2608      -         * already trusted so there is nothing to do.
2609      -         */
2610      -        if (type == SPA_IMPORT_ASSEMBLE)
2611      -                return (0);
2612      -
2613      -        healthy_tvds = spa_healthy_core_tvds(spa);
2614      -
2615      -        if (load_nvlist(spa, spa->spa_config_object, &mos_config)
2616      -            != 0) {
2617      -                spa_load_failed(spa, "unable to retrieve MOS config");
2618      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2619      -        }
2620      -
2621      -        /*
2622      -         * If we are doing an open, pool owner wasn't verified yet, thus do
2623      -         * the verification here.
2624      -         */
2625      -        if (spa->spa_load_state == SPA_LOAD_OPEN) {
2626      -                error = spa_verify_host(spa, mos_config);
2627      -                if (error != 0) {
2628      -                        nvlist_free(mos_config);
2629      -                        return (error);
2630      -                }
2631      -        }
2632      -
2633      -        nv = fnvlist_lookup_nvlist(mos_config, ZPOOL_CONFIG_VDEV_TREE);
2634      -
2635      -        spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2636      -
2637      -        /*
2638      -         * Build a new vdev tree from the trusted config
2639      -         */
2640      -        VERIFY(spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD) == 0);
2641      -
2642      -        /*
2643      -         * Vdev paths in the MOS may be obsolete. If the untrusted config was
2644      -         * obtained by scanning /dev/dsk, then it will have the right vdev
2645      -         * paths. We update the trusted MOS config with this information.
2646      -         * We first try to copy the paths with vdev_copy_path_strict, which
2647      -         * succeeds only when both configs have exactly the same vdev tree.
2648      -         * If that fails, we fall back to a more flexible method that has a
2649      -         * best effort policy.
2650      -         */
2651      -        copy_error = vdev_copy_path_strict(rvd, mrvd);
2652      -        if (copy_error != 0 || spa_load_print_vdev_tree) {
2653      -                spa_load_note(spa, "provided vdev tree:");
2654      -                vdev_dbgmsg_print_tree(rvd, 2);
2655      -                spa_load_note(spa, "MOS vdev tree:");
2656      -                vdev_dbgmsg_print_tree(mrvd, 2);
2657      -        }
2658      -        if (copy_error != 0) {
2659      -                spa_load_note(spa, "vdev_copy_path_strict failed, falling "
2660      -                    "back to vdev_copy_path_relaxed");
2661      -                vdev_copy_path_relaxed(rvd, mrvd);
2662      -        }
2663      -
2664      -        vdev_close(rvd);
2665      -        vdev_free(rvd);
2666      -        spa->spa_root_vdev = mrvd;
2667      -        rvd = mrvd;
2668      -        spa_config_exit(spa, SCL_ALL, FTAG);
2669      -
2670      -        /*
2671      -         * We will use spa_config if we decide to reload the spa or if spa_load
2672      -         * fails and we rewind. We must thus regenerate the config using the
2673      -         * MOS information with the updated paths. Rewind policy is an import
2674      -         * setting and is not in the MOS. We copy it over to our new, trusted
2675      -         * config.
2676      -         */
2677      -        mos_config_txg = fnvlist_lookup_uint64(mos_config,
2678      -            ZPOOL_CONFIG_POOL_TXG);
2679      -        nvlist_free(mos_config);
2680      -        mos_config = spa_config_generate(spa, NULL, mos_config_txg, B_FALSE);
2681      -        if (nvlist_lookup_nvlist(spa->spa_config, ZPOOL_REWIND_POLICY,
2682      -            &policy) == 0)
2683      -                fnvlist_add_nvlist(mos_config, ZPOOL_REWIND_POLICY, policy);
2684      -        spa_config_set(spa, mos_config);
2685      -        spa->spa_config_source = SPA_CONFIG_SRC_MOS;
2686      -
2687      -        /*
2688      -         * Now that we got the config from the MOS, we should be more strict
2689      -         * in checking blkptrs and can make assumptions about the consistency
2690      -         * of the vdev tree. spa_trust_config must be set to true before opening
2691      -         * vdevs in order for them to be writeable.
2692      -         */
2693      -        spa->spa_trust_config = B_TRUE;
2694      -
2695      -        /*
2696      -         * Open and validate the new vdev tree
2697      -         */
2698      -        error = spa_ld_open_vdevs(spa);
2699      -        if (error != 0)
2700      -                return (error);
2701      -
2702      -        error = spa_ld_validate_vdevs(spa);
2703      -        if (error != 0)
2704      -                return (error);
2705      -
2706      -        if (copy_error != 0 || spa_load_print_vdev_tree) {
2707      -                spa_load_note(spa, "final vdev tree:");
2708      -                vdev_dbgmsg_print_tree(rvd, 2);
2709      -        }
2710      -
2711      -        if (spa->spa_load_state != SPA_LOAD_TRYIMPORT &&
2712      -            !spa->spa_extreme_rewind && zfs_max_missing_tvds == 0) {
2713      -                /*
2714      -                 * Sanity check to make sure that we are indeed loading the
2715      -                 * latest uberblock. If we missed SPA_SYNC_MIN_VDEVS tvds
2716      -                 * in the config provided and they happened to be the only ones
2717      -                 * to have the latest uberblock, we could involuntarily perform
2718      -                 * an extreme rewind.
2719      -                 */
2720      -                healthy_tvds_mos = spa_healthy_core_tvds(spa);
2721      -                if (healthy_tvds_mos - healthy_tvds >=
2722      -                    SPA_SYNC_MIN_VDEVS) {
2723      -                        spa_load_note(spa, "config provided misses too many "
2724      -                            "top-level vdevs compared to MOS (%lld vs %lld). ",
2725      -                            (u_longlong_t)healthy_tvds,
2726      -                            (u_longlong_t)healthy_tvds_mos);
2727      -                        spa_load_note(spa, "vdev tree:");
2728      -                        vdev_dbgmsg_print_tree(rvd, 2);
2729      -                        if (reloading) {
2730      -                                spa_load_failed(spa, "config was already "
2731      -                                    "provided from MOS. Aborting.");
2732      -                                return (spa_vdev_err(rvd,
2733      -                                    VDEV_AUX_CORRUPT_DATA, EIO));
2734      -                        }
2735      -                        spa_load_note(spa, "spa must be reloaded using MOS "
2736      -                            "config");
2737      -                        return (SET_ERROR(EAGAIN));
2738      -                }
2739      -        }
2740      -
2741      -        error = spa_check_for_missing_logs(spa);
2742      -        if (error != 0)
2743      -                return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO));
2744      -
2745      -        if (rvd->vdev_guid_sum != spa->spa_uberblock.ub_guid_sum) {
2746      -                spa_load_failed(spa, "uberblock guid sum doesn't match MOS "
2747      -                    "guid sum (%llu != %llu)",
2748      -                    (u_longlong_t)spa->spa_uberblock.ub_guid_sum,
2749      -                    (u_longlong_t)rvd->vdev_guid_sum);
2750      -                return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM,
2751      -                    ENXIO));
2752      -        }
2753      -
2754      -        return (0);
2755      -}
2756      -
2757      -static int
2758      -spa_ld_open_indirect_vdev_metadata(spa_t *spa)
2759      -{
2760      -        int error = 0;
2761      -        vdev_t *rvd = spa->spa_root_vdev;
2762      -
2763      -        /*
2764      -         * Everything that we read before spa_remove_init() must be stored
2765      -         * on concreted vdevs.  Therefore we do this as early as possible.
2766      -         */
2767      -        error = spa_remove_init(spa);
2768      -        if (error != 0) {
2769      -                spa_load_failed(spa, "spa_remove_init failed [error=%d]",
2770      -                    error);
2771      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2772      -        }
2773      -
2774      -        /*
2775      -         * Retrieve information needed to condense indirect vdev mappings.
2776      -         */
2777      -        error = spa_condense_init(spa);
2778      -        if (error != 0) {
2779      -                spa_load_failed(spa, "spa_condense_init failed [error=%d]",
2780      -                    error);
2781      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
2782      -        }
2783      -
2784      -        return (0);
2785      -}
2786      -
2787      -static int
2788      -spa_ld_check_features(spa_t *spa, boolean_t *missing_feat_writep)
2789      -{
2790      -        int error = 0;
2791      -        vdev_t *rvd = spa->spa_root_vdev;
2792      -
2793 2614          if (spa_version(spa) >= SPA_VERSION_FEATURES) {
2794 2615                  boolean_t missing_feat_read = B_FALSE;
2795 2616                  nvlist_t *unsup_feat, *enabled_feat;
2796 2617  
2797 2618                  if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_READ,
2798      -                    &spa->spa_feat_for_read_obj, B_TRUE) != 0) {
     2619 +                    &spa->spa_feat_for_read_obj) != 0) {
2799 2620                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2800 2621                  }
2801 2622  
2802 2623                  if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_WRITE,
2803      -                    &spa->spa_feat_for_write_obj, B_TRUE) != 0) {
     2624 +                    &spa->spa_feat_for_write_obj) != 0) {
2804 2625                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2805 2626                  }
2806 2627  
2807 2628                  if (spa_dir_prop(spa, DMU_POOL_FEATURE_DESCRIPTIONS,
2808      -                    &spa->spa_feat_desc_obj, B_TRUE) != 0) {
     2629 +                    &spa->spa_feat_desc_obj) != 0) {
2809 2630                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2810 2631                  }
2811 2632  
2812 2633                  enabled_feat = fnvlist_alloc();
2813 2634                  unsup_feat = fnvlist_alloc();
2814 2635  
2815 2636                  if (!spa_features_check(spa, B_FALSE,
2816 2637                      unsup_feat, enabled_feat))
2817 2638                          missing_feat_read = B_TRUE;
2818 2639  
2819      -                if (spa_writeable(spa) ||
2820      -                    spa->spa_load_state == SPA_LOAD_TRYIMPORT) {
     2640 +                if (spa_writeable(spa) || state == SPA_LOAD_TRYIMPORT) {
2821 2641                          if (!spa_features_check(spa, B_TRUE,
2822 2642                              unsup_feat, enabled_feat)) {
2823      -                                *missing_feat_writep = B_TRUE;
     2643 +                                missing_feat_write = B_TRUE;
2824 2644                          }
2825 2645                  }
2826 2646  
2827 2647                  fnvlist_add_nvlist(spa->spa_load_info,
2828 2648                      ZPOOL_CONFIG_ENABLED_FEAT, enabled_feat);
2829 2649  
2830 2650                  if (!nvlist_empty(unsup_feat)) {
2831 2651                          fnvlist_add_nvlist(spa->spa_load_info,
2832 2652                              ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat);
2833 2653                  }
2834 2654  
2835 2655                  fnvlist_free(enabled_feat);
2836 2656                  fnvlist_free(unsup_feat);
2837 2657  
2838 2658                  if (!missing_feat_read) {
2839 2659                          fnvlist_add_boolean(spa->spa_load_info,
2840 2660                              ZPOOL_CONFIG_CAN_RDONLY);
2841 2661                  }
2842 2662  
2843 2663                  /*
2844 2664                   * If the state is SPA_LOAD_TRYIMPORT, our objective is
2845 2665                   * twofold: to determine whether the pool is available for
2846 2666                   * import in read-write mode and (if it is not) whether the
2847 2667                   * pool is available for import in read-only mode. If the pool
2848 2668                   * is available for import in read-write mode, it is displayed
2849 2669                   * as available in userland; if it is not available for import
2850 2670                   * in read-only mode, it is displayed as unavailable in
2851 2671                   * userland. If the pool is available for import in read-only
  
    | 
      ↓ open down ↓ | 
    18 lines elided | 
    
      ↑ open up ↑ | 
  
2852 2672                   * mode but not read-write mode, it is displayed as unavailable
2853 2673                   * in userland with a special note that the pool is actually
2854 2674                   * available for open in read-only mode.
2855 2675                   *
2856 2676                   * As a result, if the state is SPA_LOAD_TRYIMPORT and we are
2857 2677                   * missing a feature for write, we must first determine whether
2858 2678                   * the pool can be opened read-only before returning to
2859 2679                   * userland in order to know whether to display the
2860 2680                   * abovementioned note.
2861 2681                   */
2862      -                if (missing_feat_read || (*missing_feat_writep &&
     2682 +                if (missing_feat_read || (missing_feat_write &&
2863 2683                      spa_writeable(spa))) {
2864      -                        spa_load_failed(spa, "pool uses unsupported features");
2865 2684                          return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
2866 2685                              ENOTSUP));
2867 2686                  }
2868 2687  
2869 2688                  /*
2870 2689                   * Load refcounts for ZFS features from disk into an in-memory
2871 2690                   * cache during SPA initialization.
2872 2691                   */
2873 2692                  for (spa_feature_t i = 0; i < SPA_FEATURES; i++) {
2874 2693                          uint64_t refcount;
2875 2694  
2876 2695                          error = feature_get_refcount_from_disk(spa,
2877 2696                              &spa_feature_table[i], &refcount);
2878 2697                          if (error == 0) {
2879 2698                                  spa->spa_feat_refcount_cache[i] = refcount;
2880 2699                          } else if (error == ENOTSUP) {
2881 2700                                  spa->spa_feat_refcount_cache[i] =
2882 2701                                      SPA_FEATURE_DISABLED;
2883 2702                          } else {
2884      -                                spa_load_failed(spa, "error getting refcount "
2885      -                                    "for feature %s [error=%d]",
2886      -                                    spa_feature_table[i].fi_guid, error);
2887 2703                                  return (spa_vdev_err(rvd,
2888 2704                                      VDEV_AUX_CORRUPT_DATA, EIO));
2889 2705                          }
2890 2706                  }
2891 2707          }
2892 2708  
2893 2709          if (spa_feature_is_active(spa, SPA_FEATURE_ENABLED_TXG)) {
2894 2710                  if (spa_dir_prop(spa, DMU_POOL_FEATURE_ENABLED_TXG,
2895      -                    &spa->spa_feat_enabled_txg_obj, B_TRUE) != 0)
     2711 +                    &spa->spa_feat_enabled_txg_obj) != 0)
2896 2712                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2897 2713          }
2898 2714  
2899      -        return (0);
2900      -}
2901      -
2902      -static int
2903      -spa_ld_load_special_directories(spa_t *spa)
2904      -{
2905      -        int error = 0;
2906      -        vdev_t *rvd = spa->spa_root_vdev;
2907      -
2908 2715          spa->spa_is_initializing = B_TRUE;
2909 2716          error = dsl_pool_open(spa->spa_dsl_pool);
2910 2717          spa->spa_is_initializing = B_FALSE;
2911      -        if (error != 0) {
2912      -                spa_load_failed(spa, "dsl_pool_open failed [error=%d]", error);
     2718 +        if (error != 0)
2913 2719                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2914      -        }
2915 2720  
2916      -        return (0);
2917      -}
     2721 +        if (!mosconfig) {
     2722 +                uint64_t hostid;
     2723 +                nvlist_t *policy = NULL, *nvconfig;
2918 2724  
2919      -static int
2920      -spa_ld_get_props(spa_t *spa)
2921      -{
2922      -        int error = 0;
2923      -        uint64_t obj;
2924      -        vdev_t *rvd = spa->spa_root_vdev;
     2725 +                if (load_nvlist(spa, spa->spa_config_object, &nvconfig) != 0)
     2726 +                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2925 2727  
     2728 +                if (!spa_is_root(spa) && nvlist_lookup_uint64(nvconfig,
     2729 +                    ZPOOL_CONFIG_HOSTID, &hostid) == 0) {
     2730 +                        char *hostname;
     2731 +                        unsigned long myhostid = 0;
     2732 +
     2733 +                        VERIFY(nvlist_lookup_string(nvconfig,
     2734 +                            ZPOOL_CONFIG_HOSTNAME, &hostname) == 0);
     2735 +
     2736 +#ifdef  _KERNEL
     2737 +                        myhostid = zone_get_hostid(NULL);
     2738 +#else   /* _KERNEL */
     2739 +                        /*
     2740 +                         * We're emulating the system's hostid in userland, so
     2741 +                         * we can't use zone_get_hostid().
     2742 +                         */
     2743 +                        (void) ddi_strtoul(hw_serial, NULL, 10, &myhostid);
     2744 +#endif  /* _KERNEL */
     2745 +                        if (hostid != 0 && myhostid != 0 &&
     2746 +                            hostid != myhostid) {
     2747 +                                nvlist_free(nvconfig);
     2748 +                                cmn_err(CE_WARN, "pool '%s' could not be "
     2749 +                                    "loaded as it was last accessed by "
     2750 +                                    "another system (host: %s hostid: 0x%lx). "
     2751 +                                    "See: http://illumos.org/msg/ZFS-8000-EY",
     2752 +                                    spa_name(spa), hostname,
     2753 +                                    (unsigned long)hostid);
     2754 +                                return (SET_ERROR(EBADF));
     2755 +                        }
     2756 +                }
     2757 +                if (nvlist_lookup_nvlist(spa->spa_config,
     2758 +                    ZPOOL_REWIND_POLICY, &policy) == 0)
     2759 +                        VERIFY(nvlist_add_nvlist(nvconfig,
     2760 +                            ZPOOL_REWIND_POLICY, policy) == 0);
     2761 +
     2762 +                spa_config_set(spa, nvconfig);
     2763 +                spa_unload(spa);
     2764 +                spa_deactivate(spa);
     2765 +                spa_activate(spa, orig_mode);
     2766 +
     2767 +                return (spa_load(spa, state, SPA_IMPORT_EXISTING, B_TRUE));
     2768 +        }
     2769 +
2926 2770          /* Grab the secret checksum salt from the MOS. */
2927 2771          error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2928 2772              DMU_POOL_CHECKSUM_SALT, 1,
2929 2773              sizeof (spa->spa_cksum_salt.zcs_bytes),
2930 2774              spa->spa_cksum_salt.zcs_bytes);
2931 2775          if (error == ENOENT) {
2932 2776                  /* Generate a new salt for subsequent use */
2933 2777                  (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
2934 2778                      sizeof (spa->spa_cksum_salt.zcs_bytes));
2935 2779          } else if (error != 0) {
2936      -                spa_load_failed(spa, "unable to retrieve checksum salt from "
2937      -                    "MOS [error=%d]", error);
2938 2780                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2939 2781          }
2940 2782  
2941      -        if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj, B_TRUE) != 0)
     2783 +        if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj) != 0)
2942 2784                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2943 2785          error = bpobj_open(&spa->spa_deferred_bpobj, spa->spa_meta_objset, obj);
2944      -        if (error != 0) {
2945      -                spa_load_failed(spa, "error opening deferred-frees bpobj "
2946      -                    "[error=%d]", error);
     2786 +        if (error != 0)
2947 2787                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2948      -        }
2949 2788  
2950 2789          /*
2951 2790           * Load the bit that tells us to use the new accounting function
2952 2791           * (raid-z deflation).  If we have an older pool, this will not
2953 2792           * be present.
2954 2793           */
2955      -        error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate, B_FALSE);
     2794 +        error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate);
2956 2795          if (error != 0 && error != ENOENT)
2957 2796                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2958 2797  
2959 2798          error = spa_dir_prop(spa, DMU_POOL_CREATION_VERSION,
2960      -            &spa->spa_creation_version, B_FALSE);
     2799 +            &spa->spa_creation_version);
2961 2800          if (error != 0 && error != ENOENT)
2962 2801                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2963 2802  
2964 2803          /*
2965 2804           * Load the persistent error log.  If we have an older pool, this will
2966 2805           * not be present.
2967 2806           */
2968      -        error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last,
2969      -            B_FALSE);
     2807 +        error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last);
2970 2808          if (error != 0 && error != ENOENT)
2971 2809                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2972 2810  
2973 2811          error = spa_dir_prop(spa, DMU_POOL_ERRLOG_SCRUB,
2974      -            &spa->spa_errlog_scrub, B_FALSE);
     2812 +            &spa->spa_errlog_scrub);
2975 2813          if (error != 0 && error != ENOENT)
2976 2814                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2977 2815  
2978 2816          /*
2979 2817           * Load the history object.  If we have an older pool, this
2980 2818           * will not be present.
2981 2819           */
2982      -        error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history, B_FALSE);
     2820 +        error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history);
2983 2821          if (error != 0 && error != ENOENT)
2984 2822                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2985 2823  
2986 2824          /*
2987 2825           * Load the per-vdev ZAP map. If we have an older pool, this will not
2988 2826           * be present; in this case, defer its creation to a later time to
2989 2827           * avoid dirtying the MOS this early / out of sync context. See
2990 2828           * spa_sync_config_object.
2991 2829           */
2992 2830  
2993 2831          /* The sentinel is only available in the MOS config. */
2994 2832          nvlist_t *mos_config;
2995      -        if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0) {
2996      -                spa_load_failed(spa, "unable to retrieve MOS config");
     2833 +        if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0)
2997 2834                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2998      -        }
2999 2835  
3000 2836          error = spa_dir_prop(spa, DMU_POOL_VDEV_ZAP_MAP,
3001      -            &spa->spa_all_vdev_zaps, B_FALSE);
     2837 +            &spa->spa_all_vdev_zaps);
3002 2838  
3003 2839          if (error == ENOENT) {
3004 2840                  VERIFY(!nvlist_exists(mos_config,
3005 2841                      ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
3006 2842                  spa->spa_avz_action = AVZ_ACTION_INITIALIZE;
3007 2843                  ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
3008 2844          } else if (error != 0) {
3009 2845                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3010 2846          } else if (!nvlist_exists(mos_config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS)) {
3011 2847                  /*
3012 2848                   * An older version of ZFS overwrote the sentinel value, so
3013 2849                   * we have orphaned per-vdev ZAPs in the MOS. Defer their
3014 2850                   * destruction to later; see spa_sync_config_object.
  
    | 
      ↓ open down ↓ | 
    3 lines elided | 
    
      ↑ open up ↑ | 
  
3015 2851                   */
3016 2852                  spa->spa_avz_action = AVZ_ACTION_DESTROY;
3017 2853                  /*
3018 2854                   * We're assuming that no vdevs have had their ZAPs created
3019 2855                   * before this. Better be sure of it.
3020 2856                   */
3021 2857                  ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
3022 2858          }
3023 2859          nvlist_free(mos_config);
3024 2860  
3025      -        spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
3026      -
3027      -        error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object,
3028      -            B_FALSE);
3029      -        if (error && error != ENOENT)
3030      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3031      -
3032      -        if (error == 0) {
3033      -                uint64_t autoreplace;
3034      -
3035      -                spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs);
3036      -                spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace);
3037      -                spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation);
3038      -                spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode);
3039      -                spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand);
3040      -                spa_prop_find(spa, ZPOOL_PROP_DEDUPDITTO,
3041      -                    &spa->spa_dedup_ditto);
3042      -
3043      -                spa->spa_autoreplace = (autoreplace != 0);
3044      -        }
3045      -
3046 2861          /*
3047      -         * If we are importing a pool with missing top-level vdevs,
3048      -         * we enforce that the pool doesn't panic or get suspended on
3049      -         * error since the likelihood of missing data is extremely high.
3050      -         */
3051      -        if (spa->spa_missing_tvds > 0 &&
3052      -            spa->spa_failmode != ZIO_FAILURE_MODE_CONTINUE &&
3053      -            spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
3054      -                spa_load_note(spa, "forcing failmode to 'continue' "
3055      -                    "as some top level vdevs are missing");
3056      -                spa->spa_failmode = ZIO_FAILURE_MODE_CONTINUE;
3057      -        }
3058      -
3059      -        return (0);
3060      -}
3061      -
3062      -static int
3063      -spa_ld_open_aux_vdevs(spa_t *spa, spa_import_type_t type)
3064      -{
3065      -        int error = 0;
3066      -        vdev_t *rvd = spa->spa_root_vdev;
3067      -
3068      -        /*
3069 2862           * If we're assembling the pool from the split-off vdevs of
3070 2863           * an existing pool, we don't want to attach the spares & cache
3071 2864           * devices.
3072 2865           */
3073 2866  
3074 2867          /*
3075 2868           * Load any hot spares for this pool.
3076 2869           */
3077      -        error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object,
3078      -            B_FALSE);
     2870 +        error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object);
3079 2871          if (error != 0 && error != ENOENT)
3080 2872                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3081 2873          if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
3082 2874                  ASSERT(spa_version(spa) >= SPA_VERSION_SPARES);
3083 2875                  if (load_nvlist(spa, spa->spa_spares.sav_object,
3084      -                    &spa->spa_spares.sav_config) != 0) {
3085      -                        spa_load_failed(spa, "error loading spares nvlist");
     2876 +                    &spa->spa_spares.sav_config) != 0)
3086 2877                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3087      -                }
3088 2878  
3089 2879                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3090 2880                  spa_load_spares(spa);
3091 2881                  spa_config_exit(spa, SCL_ALL, FTAG);
3092 2882          } else if (error == 0) {
3093 2883                  spa->spa_spares.sav_sync = B_TRUE;
3094 2884          }
3095 2885  
3096 2886          /*
3097 2887           * Load any level 2 ARC devices for this pool.
3098 2888           */
3099 2889          error = spa_dir_prop(spa, DMU_POOL_L2CACHE,
3100      -            &spa->spa_l2cache.sav_object, B_FALSE);
     2890 +            &spa->spa_l2cache.sav_object);
3101 2891          if (error != 0 && error != ENOENT)
3102 2892                  return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3103 2893          if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
3104 2894                  ASSERT(spa_version(spa) >= SPA_VERSION_L2CACHE);
3105 2895                  if (load_nvlist(spa, spa->spa_l2cache.sav_object,
3106      -                    &spa->spa_l2cache.sav_config) != 0) {
3107      -                        spa_load_failed(spa, "error loading l2cache nvlist");
     2896 +                    &spa->spa_l2cache.sav_config) != 0)
3108 2897                          return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3109      -                }
3110 2898  
3111 2899                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3112 2900                  spa_load_l2cache(spa);
3113 2901                  spa_config_exit(spa, SCL_ALL, FTAG);
3114 2902          } else if (error == 0) {
3115 2903                  spa->spa_l2cache.sav_sync = B_TRUE;
3116 2904          }
3117 2905  
3118      -        return (0);
3119      -}
     2906 +        mp = &spa->spa_meta_policy;
3120 2907  
3121      -static int
3122      -spa_ld_load_vdev_metadata(spa_t *spa)
3123      -{
3124      -        int error = 0;
3125      -        vdev_t *rvd = spa->spa_root_vdev;
     2908 +        spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
     2909 +        spa->spa_hiwat = zpool_prop_default_numeric(ZPOOL_PROP_HIWATERMARK);
     2910 +        spa->spa_lowat = zpool_prop_default_numeric(ZPOOL_PROP_LOWATERMARK);
     2911 +        spa->spa_minwat = zpool_prop_default_numeric(ZPOOL_PROP_MINWATERMARK);
     2912 +        spa->spa_dedup_lo_best_effort =
     2913 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_LO_BEST_EFFORT);
     2914 +        spa->spa_dedup_hi_best_effort =
     2915 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_HI_BEST_EFFORT);
3126 2916  
     2917 +        mp->spa_enable_meta_placement_selection =
     2918 +            zpool_prop_default_numeric(ZPOOL_PROP_META_PLACEMENT);
     2919 +        mp->spa_sync_to_special =
     2920 +            zpool_prop_default_numeric(ZPOOL_PROP_SYNC_TO_SPECIAL);
     2921 +        mp->spa_ddt_meta_to_special =
     2922 +            zpool_prop_default_numeric(ZPOOL_PROP_DDT_META_TO_METADEV);
     2923 +        mp->spa_zfs_meta_to_special =
     2924 +            zpool_prop_default_numeric(ZPOOL_PROP_ZFS_META_TO_METADEV);
     2925 +        mp->spa_small_data_to_special =
     2926 +            zpool_prop_default_numeric(ZPOOL_PROP_SMALL_DATA_TO_METADEV);
     2927 +        spa_set_ddt_classes(spa,
     2928 +            zpool_prop_default_numeric(ZPOOL_PROP_DDT_DESEGREGATION));
     2929 +
     2930 +        spa->spa_resilver_prio =
     2931 +            zpool_prop_default_numeric(ZPOOL_PROP_RESILVER_PRIO);
     2932 +        spa->spa_scrub_prio = zpool_prop_default_numeric(ZPOOL_PROP_SCRUB_PRIO);
     2933 +
     2934 +        error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object);
     2935 +        if (error && error != ENOENT)
     2936 +                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
     2937 +
     2938 +        if (error == 0) {
     2939 +                uint64_t autoreplace;
     2940 +                uint64_t val = 0;
     2941 +
     2942 +                spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs);
     2943 +                spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace);
     2944 +                spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation);
     2945 +                spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode);
     2946 +                spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand);
     2947 +                spa_prop_find(spa, ZPOOL_PROP_BOOTSIZE, &spa->spa_bootsize);
     2948 +                spa_prop_find(spa, ZPOOL_PROP_DEDUPDITTO,
     2949 +                    &spa->spa_dedup_ditto);
     2950 +                spa_prop_find(spa, ZPOOL_PROP_FORCETRIM, &spa->spa_force_trim);
     2951 +
     2952 +                mutex_enter(&spa->spa_auto_trim_lock);
     2953 +                spa_prop_find(spa, ZPOOL_PROP_AUTOTRIM, &spa->spa_auto_trim);
     2954 +                if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
     2955 +                        spa_auto_trim_taskq_create(spa);
     2956 +                mutex_exit(&spa->spa_auto_trim_lock);
     2957 +
     2958 +                spa_prop_find(spa, ZPOOL_PROP_HIWATERMARK, &spa->spa_hiwat);
     2959 +                spa_prop_find(spa, ZPOOL_PROP_LOWATERMARK, &spa->spa_lowat);
     2960 +                spa_prop_find(spa, ZPOOL_PROP_MINWATERMARK, &spa->spa_minwat);
     2961 +                spa_prop_find(spa, ZPOOL_PROP_DEDUPMETA_DITTO,
     2962 +                    &spa->spa_ddt_meta_copies);
     2963 +                spa_prop_find(spa, ZPOOL_PROP_DDT_DESEGREGATION, &val);
     2964 +                spa_set_ddt_classes(spa, val);
     2965 +
     2966 +                spa_prop_find(spa, ZPOOL_PROP_RESILVER_PRIO,
     2967 +                    &spa->spa_resilver_prio);
     2968 +                spa_prop_find(spa, ZPOOL_PROP_SCRUB_PRIO,
     2969 +                    &spa->spa_scrub_prio);
     2970 +
     2971 +                spa_prop_find(spa, ZPOOL_PROP_DEDUP_BEST_EFFORT,
     2972 +                    &spa->spa_dedup_best_effort);
     2973 +                spa_prop_find(spa, ZPOOL_PROP_DEDUP_LO_BEST_EFFORT,
     2974 +                    &spa->spa_dedup_lo_best_effort);
     2975 +                spa_prop_find(spa, ZPOOL_PROP_DEDUP_HI_BEST_EFFORT,
     2976 +                    &spa->spa_dedup_hi_best_effort);
     2977 +
     2978 +                spa_prop_find(spa, ZPOOL_PROP_META_PLACEMENT,
     2979 +                    &mp->spa_enable_meta_placement_selection);
     2980 +                spa_prop_find(spa, ZPOOL_PROP_SYNC_TO_SPECIAL,
     2981 +                    &mp->spa_sync_to_special);
     2982 +                spa_prop_find(spa, ZPOOL_PROP_DDT_META_TO_METADEV,
     2983 +                    &mp->spa_ddt_meta_to_special);
     2984 +                spa_prop_find(spa, ZPOOL_PROP_ZFS_META_TO_METADEV,
     2985 +                    &mp->spa_zfs_meta_to_special);
     2986 +                spa_prop_find(spa, ZPOOL_PROP_SMALL_DATA_TO_METADEV,
     2987 +                    &mp->spa_small_data_to_special);
     2988 +
     2989 +                spa->spa_autoreplace = (autoreplace != 0);
     2990 +        }
     2991 +
     2992 +        error = spa_dir_prop(spa, DMU_POOL_COS_PROPS,
     2993 +            &spa->spa_cos_props_object);
     2994 +        if (error == 0)
     2995 +                (void) spa_load_cos_props(spa);
     2996 +        error = spa_dir_prop(spa, DMU_POOL_VDEV_PROPS,
     2997 +            &spa->spa_vdev_props_object);
     2998 +        if (error == 0)
     2999 +                (void) spa_load_vdev_props(spa);
     3000 +
     3001 +        (void) spa_dir_prop(spa, DMU_POOL_TRIM_START_TIME,
     3002 +            &spa->spa_man_trim_start_time);
     3003 +        (void) spa_dir_prop(spa, DMU_POOL_TRIM_STOP_TIME,
     3004 +            &spa->spa_man_trim_stop_time);
     3005 +
3127 3006          /*
3128 3007           * If the 'autoreplace' property is set, then post a resource notifying
3129 3008           * the ZFS DE that it should not issue any faults for unopenable
3130 3009           * devices.  We also iterate over the vdevs, and post a sysevent for any
3131 3010           * unopenable vdevs so that the normal autoreplace handler can take
3132 3011           * over.
3133 3012           */
3134      -        if (spa->spa_autoreplace && spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
     3013 +        if (spa->spa_autoreplace && state != SPA_LOAD_TRYIMPORT) {
3135 3014                  spa_check_removed(spa->spa_root_vdev);
3136 3015                  /*
3137 3016                   * For the import case, this is done in spa_import(), because
3138 3017                   * at this point we're using the spare definitions from
3139 3018                   * the MOS config, not necessarily from the userland config.
3140 3019                   */
3141      -                if (spa->spa_load_state != SPA_LOAD_IMPORT) {
     3020 +                if (state != SPA_LOAD_IMPORT) {
3142 3021                          spa_aux_check_removed(&spa->spa_spares);
3143 3022                          spa_aux_check_removed(&spa->spa_l2cache);
3144 3023                  }
3145 3024          }
3146 3025  
3147 3026          /*
3148      -         * Load the vdev metadata such as metaslabs, DTLs, spacemap object, etc.
     3027 +         * Load the vdev state for all toplevel vdevs.
3149 3028           */
3150      -        error = vdev_load(rvd);
3151      -        if (error != 0) {
3152      -                spa_load_failed(spa, "vdev_load failed [error=%d]", error);
3153      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
3154      -        }
     3029 +        vdev_load(rvd);
3155 3030  
3156 3031          /*
3157      -         * Propagate the leaf DTLs we just loaded all the way up the vdev tree.
     3032 +         * Propagate the leaf DTLs we just loaded all the way up the tree.
3158 3033           */
3159 3034          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3160 3035          vdev_dtl_reassess(rvd, 0, 0, B_FALSE);
3161 3036          spa_config_exit(spa, SCL_ALL, FTAG);
3162 3037  
3163      -        return (0);
3164      -}
3165      -
3166      -static int
3167      -spa_ld_load_dedup_tables(spa_t *spa)
3168      -{
3169      -        int error = 0;
3170      -        vdev_t *rvd = spa->spa_root_vdev;
3171      -
3172      -        error = ddt_load(spa);
3173      -        if (error != 0) {
3174      -                spa_load_failed(spa, "ddt_load failed [error=%d]", error);
3175      -                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3176      -        }
3177      -
3178      -        return (0);
3179      -}
3180      -
3181      -static int
3182      -spa_ld_verify_logs(spa_t *spa, spa_import_type_t type, char **ereport)
3183      -{
3184      -        vdev_t *rvd = spa->spa_root_vdev;
3185      -
3186      -        if (type != SPA_IMPORT_ASSEMBLE && spa_writeable(spa)) {
3187      -                boolean_t missing = spa_check_logs(spa);
3188      -                if (missing) {
3189      -                        if (spa->spa_missing_tvds != 0) {
3190      -                                spa_load_note(spa, "spa_check_logs failed "
3191      -                                    "so dropping the logs");
3192      -                        } else {
3193      -                                *ereport = FM_EREPORT_ZFS_LOG_REPLAY;
3194      -                                spa_load_failed(spa, "spa_check_logs failed");
3195      -                                return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG,
3196      -                                    ENXIO));
3197      -                        }
3198      -                }
3199      -        }
3200      -
3201      -        return (0);
3202      -}
3203      -
3204      -static int
3205      -spa_ld_verify_pool_data(spa_t *spa)
3206      -{
3207      -        int error = 0;
3208      -        vdev_t *rvd = spa->spa_root_vdev;
3209      -
3210 3038          /*
3211      -         * We've successfully opened the pool, verify that we're ready
3212      -         * to start pushing transactions.
     3039 +         * Load the DDTs (dedup tables).
3213 3040           */
3214      -        if (spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
3215      -                error = spa_load_verify(spa);
3216      -                if (error != 0) {
3217      -                        spa_load_failed(spa, "spa_load_verify failed "
3218      -                            "[error=%d]", error);
3219      -                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
3220      -                            error));
3221      -                }
3222      -        }
3223      -
3224      -        return (0);
3225      -}
3226      -
3227      -static void
3228      -spa_ld_claim_log_blocks(spa_t *spa)
3229      -{
3230      -        dmu_tx_t *tx;
3231      -        dsl_pool_t *dp = spa_get_dsl(spa);
3232      -
3233      -        /*
3234      -         * Claim log blocks that haven't been committed yet.
3235      -         * This must all happen in a single txg.
3236      -         * Note: spa_claim_max_txg is updated by spa_claim_notify(),
3237      -         * invoked from zil_claim_log_block()'s i/o done callback.
3238      -         * Price of rollback is that we abandon the log.
3239      -         */
3240      -        spa->spa_claiming = B_TRUE;
3241      -
3242      -        tx = dmu_tx_create_assigned(dp, spa_first_txg(spa));
3243      -        (void) dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
3244      -            zil_claim, tx, DS_FIND_CHILDREN);
3245      -        dmu_tx_commit(tx);
3246      -
3247      -        spa->spa_claiming = B_FALSE;
3248      -
3249      -        spa_set_log_state(spa, SPA_LOG_GOOD);
3250      -}
3251      -
3252      -static void
3253      -spa_ld_check_for_config_update(spa_t *spa, uint64_t config_cache_txg,
3254      -    boolean_t reloading)
3255      -{
3256      -        vdev_t *rvd = spa->spa_root_vdev;
3257      -        int need_update = B_FALSE;
3258      -
3259      -        /*
3260      -         * If the config cache is stale, or we have uninitialized
3261      -         * metaslabs (see spa_vdev_add()), then update the config.
3262      -         *
3263      -         * If this is a verbatim import, trust the current
3264      -         * in-core spa_config and update the disk labels.
3265      -         */
3266      -        if (reloading || config_cache_txg != spa->spa_config_txg ||
3267      -            spa->spa_load_state == SPA_LOAD_IMPORT ||
3268      -            spa->spa_load_state == SPA_LOAD_RECOVER ||
3269      -            (spa->spa_import_flags & ZFS_IMPORT_VERBATIM))
3270      -                need_update = B_TRUE;
3271      -
3272      -        for (int c = 0; c < rvd->vdev_children; c++)
3273      -                if (rvd->vdev_child[c]->vdev_ms_array == 0)
3274      -                        need_update = B_TRUE;
3275      -
3276      -        /*
3277      -         * Update the config cache asychronously in case we're the
3278      -         * root pool, in which case the config cache isn't writable yet.
3279      -         */
3280      -        if (need_update)
3281      -                spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
3282      -}
3283      -
3284      -static void
3285      -spa_ld_prepare_for_reload(spa_t *spa)
3286      -{
3287      -        int mode = spa->spa_mode;
3288      -        int async_suspended = spa->spa_async_suspended;
3289      -
3290      -        spa_unload(spa);
3291      -        spa_deactivate(spa);
3292      -        spa_activate(spa, mode);
3293      -
3294      -        /*
3295      -         * We save the value of spa_async_suspended as it gets reset to 0 by
3296      -         * spa_unload(). We want to restore it back to the original value before
3297      -         * returning as we might be calling spa_async_resume() later.
3298      -         */
3299      -        spa->spa_async_suspended = async_suspended;
3300      -}
3301      -
3302      -/*
3303      - * Load an existing storage pool, using the config provided. This config
3304      - * describes which vdevs are part of the pool and is later validated against
3305      - * partial configs present in each vdev's label and an entire copy of the
3306      - * config stored in the MOS.
3307      - */
3308      -static int
3309      -spa_load_impl(spa_t *spa, spa_import_type_t type, char **ereport,
3310      -    boolean_t reloading)
3311      -{
3312      -        int error = 0;
3313      -        boolean_t missing_feat_write = B_FALSE;
3314      -
3315      -        ASSERT(MUTEX_HELD(&spa_namespace_lock));
3316      -        ASSERT(spa->spa_config_source != SPA_CONFIG_SRC_NONE);
3317      -
3318      -        /*
3319      -         * Never trust the config that is provided unless we are assembling
3320      -         * a pool following a split.
3321      -         * This means don't trust blkptrs and the vdev tree in general. This
3322      -         * also effectively puts the spa in read-only mode since
3323      -         * spa_writeable() checks for spa_trust_config to be true.
3324      -         * We will later load a trusted config from the MOS.
3325      -         */
3326      -        if (type != SPA_IMPORT_ASSEMBLE)
3327      -                spa->spa_trust_config = B_FALSE;
3328      -
3329      -        if (reloading)
3330      -                spa_load_note(spa, "RELOADING");
3331      -        else
3332      -                spa_load_note(spa, "LOADING");
3333      -
3334      -        /*
3335      -         * Parse the config provided to create a vdev tree.
3336      -         */
3337      -        error = spa_ld_parse_config(spa, type);
     3041 +        error = ddt_load(spa);
3338 3042          if (error != 0)
3339      -                return (error);
     3043 +                return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3340 3044  
3341      -        /*
3342      -         * Now that we have the vdev tree, try to open each vdev. This involves
3343      -         * opening the underlying physical device, retrieving its geometry and
3344      -         * probing the vdev with a dummy I/O. The state of each vdev will be set
3345      -         * based on the success of those operations. After this we'll be ready
3346      -         * to read from the vdevs.
3347      -         */
3348      -        error = spa_ld_open_vdevs(spa);
3349      -        if (error != 0)
3350      -                return (error);
     3045 +        spa_update_dspace(spa);
3351 3046  
3352 3047          /*
3353      -         * Read the label of each vdev and make sure that the GUIDs stored
3354      -         * there match the GUIDs in the config provided.
3355      -         * If we're assembling a new pool that's been split off from an
3356      -         * existing pool, the labels haven't yet been updated so we skip
3357      -         * validation for now.
     3048 +         * Validate the config, using the MOS config to fill in any
     3049 +         * information which might be missing.  If we fail to validate
     3050 +         * the config then declare the pool unfit for use. If we're
     3051 +         * assembling a pool from a split, the log is not transferred
     3052 +         * over.
3358 3053           */
3359 3054          if (type != SPA_IMPORT_ASSEMBLE) {
3360      -                error = spa_ld_validate_vdevs(spa);
3361      -                if (error != 0)
3362      -                        return (error);
3363      -        }
     3055 +                nvlist_t *nvconfig;
3364 3056  
3365      -        /*
3366      -         * Read vdev labels to find the best uberblock (i.e. latest, unless
3367      -         * spa_load_max_txg is set) and store it in spa_uberblock. We get the
3368      -         * list of features required to read blkptrs in the MOS from the vdev
3369      -         * label with the best uberblock and verify that our version of zfs
3370      -         * supports them all.
3371      -         */
3372      -        error = spa_ld_select_uberblock(spa, type);
3373      -        if (error != 0)
3374      -                return (error);
     3057 +                if (load_nvlist(spa, spa->spa_config_object, &nvconfig) != 0)
     3058 +                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3375 3059  
3376      -        /*
3377      -         * Pass that uberblock to the dsl_pool layer which will open the root
3378      -         * blkptr. This blkptr points to the latest version of the MOS and will
3379      -         * allow us to read its contents.
3380      -         */
3381      -        error = spa_ld_open_rootbp(spa);
3382      -        if (error != 0)
3383      -                return (error);
     3060 +                if (!spa_config_valid(spa, nvconfig)) {
     3061 +                        nvlist_free(nvconfig);
     3062 +                        return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM,
     3063 +                            ENXIO));
     3064 +                }
     3065 +                nvlist_free(nvconfig);
3384 3066  
3385      -        /*
3386      -         * Retrieve the trusted config stored in the MOS and use it to create
3387      -         * a new, exact version of the vdev tree, then reopen all vdevs.
3388      -         */
3389      -        error = spa_ld_load_trusted_config(spa, type, reloading);
3390      -        if (error == EAGAIN) {
3391      -                VERIFY(!reloading);
3392 3067                  /*
3393      -                 * Redo the loading process with the trusted config if it is
3394      -                 * too different from the untrusted config.
     3068 +                 * Now that we've validated the config, check the state of the
     3069 +                 * root vdev.  If it can't be opened, it indicates one or
     3070 +                 * more toplevel vdevs are faulted.
3395 3071                   */
3396      -                spa_ld_prepare_for_reload(spa);
3397      -                return (spa_load_impl(spa, type, ereport, B_TRUE));
3398      -        } else if (error != 0) {
3399      -                return (error);
     3072 +                if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN)
     3073 +                        return (SET_ERROR(ENXIO));
     3074 +
     3075 +                if (spa_writeable(spa) && spa_check_logs(spa)) {
     3076 +                        *ereport = FM_EREPORT_ZFS_LOG_REPLAY;
     3077 +                        return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG, ENXIO));
     3078 +                }
3400 3079          }
3401 3080  
3402      -        /*
3403      -         * Retrieve the mapping of indirect vdevs. Those vdevs were removed
3404      -         * from the pool and their contents were re-mapped to other vdevs. Note
3405      -         * that everything that we read before this step must have been
3406      -         * rewritten on concrete vdevs after the last device removal was
3407      -         * initiated. Otherwise we could be reading from indirect vdevs before
3408      -         * we have loaded their mappings.
3409      -         */
3410      -        error = spa_ld_open_indirect_vdev_metadata(spa);
3411      -        if (error != 0)
3412      -                return (error);
3413      -
3414      -        /*
3415      -         * Retrieve the full list of active features from the MOS and check if
3416      -         * they are all supported.
3417      -         */
3418      -        error = spa_ld_check_features(spa, &missing_feat_write);
3419      -        if (error != 0)
3420      -                return (error);
3421      -
3422      -        /*
3423      -         * Load several special directories from the MOS needed by the dsl_pool
3424      -         * layer.
3425      -         */
3426      -        error = spa_ld_load_special_directories(spa);
3427      -        if (error != 0)
3428      -                return (error);
3429      -
3430      -        /*
3431      -         * Retrieve pool properties from the MOS.
3432      -         */
3433      -        error = spa_ld_get_props(spa);
3434      -        if (error != 0)
3435      -                return (error);
3436      -
3437      -        /*
3438      -         * Retrieve the list of auxiliary devices - cache devices and spares -
3439      -         * and open them.
3440      -         */
3441      -        error = spa_ld_open_aux_vdevs(spa, type);
3442      -        if (error != 0)
3443      -                return (error);
3444      -
3445      -        /*
3446      -         * Load the metadata for all vdevs. Also check if unopenable devices
3447      -         * should be autoreplaced.
3448      -         */
3449      -        error = spa_ld_load_vdev_metadata(spa);
3450      -        if (error != 0)
3451      -                return (error);
3452      -
3453      -        error = spa_ld_load_dedup_tables(spa);
3454      -        if (error != 0)
3455      -                return (error);
3456      -
3457      -        /*
3458      -         * Verify the logs now to make sure we don't have any unexpected errors
3459      -         * when we claim log blocks later.
3460      -         */
3461      -        error = spa_ld_verify_logs(spa, type, ereport);
3462      -        if (error != 0)
3463      -                return (error);
3464      -
3465 3081          if (missing_feat_write) {
3466      -                ASSERT(spa->spa_load_state == SPA_LOAD_TRYIMPORT);
     3082 +                ASSERT(state == SPA_LOAD_TRYIMPORT);
3467 3083  
3468 3084                  /*
3469 3085                   * At this point, we know that we can open the pool in
3470 3086                   * read-only mode but not read-write mode. We now have enough
3471 3087                   * information and can return to userland.
3472 3088                   */
3473      -                return (spa_vdev_err(spa->spa_root_vdev, VDEV_AUX_UNSUP_FEAT,
3474      -                    ENOTSUP));
     3089 +                return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT, ENOTSUP));
3475 3090          }
3476 3091  
3477 3092          /*
3478      -         * Traverse the last txgs to make sure the pool was left off in a safe
3479      -         * state. When performing an extreme rewind, we verify the whole pool,
3480      -         * which can take a very long time.
     3093 +         * We've successfully opened the pool, verify that we're ready
     3094 +         * to start pushing transactions.
3481 3095           */
3482      -        error = spa_ld_verify_pool_data(spa);
3483      -        if (error != 0)
3484      -                return (error);
     3096 +        if (state != SPA_LOAD_TRYIMPORT) {
     3097 +                if (error = spa_load_verify(spa)) {
     3098 +                        return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
     3099 +                            error));
     3100 +                }
     3101 +        }
3485 3102  
3486      -        /*
3487      -         * Calculate the deflated space for the pool. This must be done before
3488      -         * we write anything to the pool because we'd need to update the space
3489      -         * accounting using the deflated sizes.
3490      -         */
3491      -        spa_update_dspace(spa);
3492      -
3493      -        /*
3494      -         * We have now retrieved all the information we needed to open the
3495      -         * pool. If we are importing the pool in read-write mode, a few
3496      -         * additional steps must be performed to finish the import.
3497      -         */
3498      -        if (spa_writeable(spa) && (spa->spa_load_state == SPA_LOAD_RECOVER ||
     3103 +        if (spa_writeable(spa) && (state == SPA_LOAD_RECOVER ||
3499 3104              spa->spa_load_max_txg == UINT64_MAX)) {
3500      -                uint64_t config_cache_txg = spa->spa_config_txg;
     3105 +                dmu_tx_t *tx;
     3106 +                int need_update = B_FALSE;
     3107 +                dsl_pool_t *dp = spa_get_dsl(spa);
3501 3108  
3502      -                ASSERT(spa->spa_load_state != SPA_LOAD_TRYIMPORT);
     3109 +                ASSERT(state != SPA_LOAD_TRYIMPORT);
3503 3110  
3504 3111                  /*
3505      -                 * Traverse the ZIL and claim all blocks.
     3112 +                 * Claim log blocks that haven't been committed yet.
     3113 +                 * This must all happen in a single txg.
     3114 +                 * Note: spa_claim_max_txg is updated by spa_claim_notify(),
     3115 +                 * invoked from zil_claim_log_block()'s i/o done callback.
     3116 +                 * Price of rollback is that we abandon the log.
3506 3117                   */
3507      -                spa_ld_claim_log_blocks(spa);
     3118 +                spa->spa_claiming = B_TRUE;
3508 3119  
3509      -                /*
3510      -                 * Kick-off the syncing thread.
3511      -                 */
     3120 +                tx = dmu_tx_create_assigned(dp, spa_first_txg(spa));
     3121 +                (void) dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
     3122 +                    zil_claim, tx, DS_FIND_CHILDREN);
     3123 +                dmu_tx_commit(tx);
     3124 +
     3125 +                spa->spa_claiming = B_FALSE;
     3126 +
     3127 +                spa_set_log_state(spa, SPA_LOG_GOOD);
3512 3128                  spa->spa_sync_on = B_TRUE;
3513 3129                  txg_sync_start(spa->spa_dsl_pool);
3514 3130  
3515 3131                  /*
3516 3132                   * Wait for all claims to sync.  We sync up to the highest
3517 3133                   * claimed log block birth time so that claimed log blocks
3518 3134                   * don't appear to be from the future.  spa_claim_max_txg
3519      -                 * will have been set for us by ZIL traversal operations
3520      -                 * performed above.
     3135 +                 * will have been set for us by either zil_check_log_chain()
     3136 +                 * (invoked from spa_check_logs()) or zil_claim() above.
3521 3137                   */
3522 3138                  txg_wait_synced(spa->spa_dsl_pool, spa->spa_claim_max_txg);
3523 3139  
3524 3140                  /*
3525      -                 * Check if we need to request an update of the config. On the
3526      -                 * next sync, we would update the config stored in vdev labels
3527      -                 * and the cachefile (by default /etc/zfs/zpool.cache).
     3141 +                 * If the config cache is stale, or we have uninitialized
     3142 +                 * metaslabs (see spa_vdev_add()), then update the config.
     3143 +                 *
     3144 +                 * If this is a verbatim import, trust the current
     3145 +                 * in-core spa_config and update the disk labels.
3528 3146                   */
3529      -                spa_ld_check_for_config_update(spa, config_cache_txg,
3530      -                    reloading);
     3147 +                if (config_cache_txg != spa->spa_config_txg ||
     3148 +                    state == SPA_LOAD_IMPORT ||
     3149 +                    state == SPA_LOAD_RECOVER ||
     3150 +                    (spa->spa_import_flags & ZFS_IMPORT_VERBATIM))
     3151 +                        need_update = B_TRUE;
3531 3152  
     3153 +                for (int c = 0; c < rvd->vdev_children; c++)
     3154 +                        if (rvd->vdev_child[c]->vdev_ms_array == 0)
     3155 +                                need_update = B_TRUE;
     3156 +
3532 3157                  /*
     3158 +                 * Update the config cache asychronously in case we're the
     3159 +                 * root pool, in which case the config cache isn't writable yet.
     3160 +                 */
     3161 +                if (need_update)
     3162 +                        spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
     3163 +
     3164 +                /*
3533 3165                   * Check all DTLs to see if anything needs resilvering.
3534 3166                   */
3535 3167                  if (!dsl_scan_resilvering(spa->spa_dsl_pool) &&
3536      -                    vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL))
     3168 +                    vdev_resilver_needed(rvd, NULL, NULL))
3537 3169                          spa_async_request(spa, SPA_ASYNC_RESILVER);
3538 3170  
3539 3171                  /*
3540 3172                   * Log the fact that we booted up (so that we can detect if
3541 3173                   * we rebooted in the middle of an operation).
3542 3174                   */
3543 3175                  spa_history_log_version(spa, "open");
3544 3176  
3545      -                /*
3546      -                 * Delete any inconsistent datasets.
3547      -                 */
3548      -                (void) dmu_objset_find(spa_name(spa),
3549      -                    dsl_destroy_inconsistent, NULL, DS_FIND_CHILDREN);
     3177 +                dsl_destroy_inconsistent(spa_get_dsl(spa));
3550 3178  
3551 3179                  /*
3552 3180                   * Clean up any stale temporary dataset userrefs.
3553 3181                   */
3554 3182                  dsl_pool_clean_tmp_userrefs(spa->spa_dsl_pool);
3555      -
3556      -                spa_restart_removal(spa);
3557      -
3558      -                spa_spawn_aux_threads(spa);
3559 3183          }
3560 3184  
3561      -        spa_load_note(spa, "LOADED");
     3185 +        spa_async_request(spa, SPA_ASYNC_L2CACHE_REBUILD);
3562 3186  
3563 3187          return (0);
3564 3188  }
3565 3189  
3566 3190  static int
3567      -spa_load_retry(spa_t *spa, spa_load_state_t state)
     3191 +spa_load_retry(spa_t *spa, spa_load_state_t state, int mosconfig)
3568 3192  {
3569 3193          int mode = spa->spa_mode;
3570 3194  
3571 3195          spa_unload(spa);
3572 3196          spa_deactivate(spa);
3573 3197  
3574 3198          spa->spa_load_max_txg = spa->spa_uberblock.ub_txg - 1;
3575 3199  
3576 3200          spa_activate(spa, mode);
3577 3201          spa_async_suspend(spa);
3578 3202  
3579      -        spa_load_note(spa, "spa_load_retry: rewind, max txg: %llu",
3580      -            (u_longlong_t)spa->spa_load_max_txg);
3581      -
3582      -        return (spa_load(spa, state, SPA_IMPORT_EXISTING));
     3203 +        return (spa_load(spa, state, SPA_IMPORT_EXISTING, mosconfig));
3583 3204  }
3584 3205  
3585 3206  /*
3586 3207   * If spa_load() fails this function will try loading prior txg's. If
3587 3208   * 'state' is SPA_LOAD_RECOVER and one of these loads succeeds the pool
3588 3209   * will be rewound to that txg. If 'state' is not SPA_LOAD_RECOVER this
3589 3210   * function will not rewind the pool and will return the same error as
3590 3211   * spa_load().
3591 3212   */
3592 3213  static int
3593      -spa_load_best(spa_t *spa, spa_load_state_t state, uint64_t max_request,
3594      -    int rewind_flags)
     3214 +spa_load_best(spa_t *spa, spa_load_state_t state, int mosconfig,
     3215 +    uint64_t max_request, int rewind_flags)
3595 3216  {
3596 3217          nvlist_t *loadinfo = NULL;
3597 3218          nvlist_t *config = NULL;
3598 3219          int load_error, rewind_error;
3599 3220          uint64_t safe_rewind_txg;
3600 3221          uint64_t min_txg;
3601 3222  
3602 3223          if (spa->spa_load_txg && state == SPA_LOAD_RECOVER) {
3603 3224                  spa->spa_load_max_txg = spa->spa_load_txg;
3604 3225                  spa_set_log_state(spa, SPA_LOG_CLEAR);
3605 3226          } else {
3606 3227                  spa->spa_load_max_txg = max_request;
3607 3228                  if (max_request != UINT64_MAX)
3608 3229                          spa->spa_extreme_rewind = B_TRUE;
3609 3230          }
3610 3231  
3611      -        load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING);
     3232 +        load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING,
     3233 +            mosconfig);
3612 3234          if (load_error == 0)
3613 3235                  return (0);
3614 3236  
3615 3237          if (spa->spa_root_vdev != NULL)
3616 3238                  config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
3617 3239  
3618 3240          spa->spa_last_ubsync_txg = spa->spa_uberblock.ub_txg;
3619 3241          spa->spa_last_ubsync_txg_ts = spa->spa_uberblock.ub_timestamp;
3620 3242  
3621 3243          if (rewind_flags & ZPOOL_NEVER_REWIND) {
3622 3244                  nvlist_free(config);
3623 3245                  return (load_error);
3624 3246          }
3625 3247  
3626 3248          if (state == SPA_LOAD_RECOVER) {
3627 3249                  /* Price of rolling back is discarding txgs, including log */
3628 3250                  spa_set_log_state(spa, SPA_LOG_CLEAR);
3629 3251          } else {
3630 3252                  /*
3631 3253                   * If we aren't rolling back save the load info from our first
3632 3254                   * import attempt so that we can restore it after attempting
3633 3255                   * to rewind.
3634 3256                   */
3635 3257                  loadinfo = spa->spa_load_info;
3636 3258                  spa->spa_load_info = fnvlist_alloc();
3637 3259          }
3638 3260  
3639 3261          spa->spa_load_max_txg = spa->spa_last_ubsync_txg;
3640 3262          safe_rewind_txg = spa->spa_last_ubsync_txg - TXG_DEFER_SIZE;
3641 3263          min_txg = (rewind_flags & ZPOOL_EXTREME_REWIND) ?
  
    | 
      ↓ open down ↓ | 
    20 lines elided | 
    
      ↑ open up ↑ | 
  
3642 3264              TXG_INITIAL : safe_rewind_txg;
3643 3265  
3644 3266          /*
3645 3267           * Continue as long as we're finding errors, we're still within
3646 3268           * the acceptable rewind range, and we're still finding uberblocks
3647 3269           */
3648 3270          while (rewind_error && spa->spa_uberblock.ub_txg >= min_txg &&
3649 3271              spa->spa_uberblock.ub_txg <= spa->spa_load_max_txg) {
3650 3272                  if (spa->spa_load_max_txg < safe_rewind_txg)
3651 3273                          spa->spa_extreme_rewind = B_TRUE;
3652      -                rewind_error = spa_load_retry(spa, state);
     3274 +                rewind_error = spa_load_retry(spa, state, mosconfig);
3653 3275          }
3654 3276  
3655 3277          spa->spa_extreme_rewind = B_FALSE;
3656 3278          spa->spa_load_max_txg = UINT64_MAX;
3657 3279  
3658 3280          if (config && (rewind_error || state != SPA_LOAD_RECOVER))
3659 3281                  spa_config_set(spa, config);
3660 3282          else
3661 3283                  nvlist_free(config);
3662 3284  
3663 3285          if (state == SPA_LOAD_RECOVER) {
3664 3286                  ASSERT3P(loadinfo, ==, NULL);
3665 3287                  return (rewind_error);
3666 3288          } else {
3667 3289                  /* Store the rewind info as part of the initial load info */
3668 3290                  fnvlist_add_nvlist(loadinfo, ZPOOL_CONFIG_REWIND_INFO,
3669 3291                      spa->spa_load_info);
3670 3292  
3671 3293                  /* Restore the initial load info */
3672 3294                  fnvlist_free(spa->spa_load_info);
3673 3295                  spa->spa_load_info = loadinfo;
3674 3296  
3675 3297                  return (load_error);
3676 3298          }
3677 3299  }
3678 3300  
3679 3301  /*
3680 3302   * Pool Open/Import
3681 3303   *
3682 3304   * The import case is identical to an open except that the configuration is sent
3683 3305   * down from userland, instead of grabbed from the configuration cache.  For the
3684 3306   * case of an open, the pool configuration will exist in the
3685 3307   * POOL_STATE_UNINITIALIZED state.
3686 3308   *
3687 3309   * The stats information (gen/count/ustats) is used to gather vdev statistics at
3688 3310   * the same time open the pool, without having to keep around the spa_t in some
  
    | 
      ↓ open down ↓ | 
    26 lines elided | 
    
      ↑ open up ↑ | 
  
3689 3311   * ambiguous state.
3690 3312   */
3691 3313  static int
3692 3314  spa_open_common(const char *pool, spa_t **spapp, void *tag, nvlist_t *nvpolicy,
3693 3315      nvlist_t **config)
3694 3316  {
3695 3317          spa_t *spa;
3696 3318          spa_load_state_t state = SPA_LOAD_OPEN;
3697 3319          int error;
3698 3320          int locked = B_FALSE;
     3321 +        boolean_t open_with_activation = B_FALSE;
3699 3322  
3700 3323          *spapp = NULL;
3701 3324  
3702 3325          /*
3703 3326           * As disgusting as this is, we need to support recursive calls to this
3704 3327           * function because dsl_dir_open() is called during spa_load(), and ends
3705 3328           * up calling spa_open() again.  The real fix is to figure out how to
3706 3329           * avoid dsl_dir_open() calling this in the first place.
3707 3330           */
3708 3331          if (mutex_owner(&spa_namespace_lock) != curthread) {
3709 3332                  mutex_enter(&spa_namespace_lock);
3710 3333                  locked = B_TRUE;
3711 3334          }
3712 3335  
3713 3336          if ((spa = spa_lookup(pool)) == NULL) {
3714 3337                  if (locked)
3715 3338                          mutex_exit(&spa_namespace_lock);
3716 3339                  return (SET_ERROR(ENOENT));
3717 3340          }
3718 3341  
3719 3342          if (spa->spa_state == POOL_STATE_UNINITIALIZED) {
3720 3343                  zpool_rewind_policy_t policy;
  
    | 
      ↓ open down ↓ | 
    12 lines elided | 
    
      ↑ open up ↑ | 
  
3721 3344  
3722 3345                  zpool_get_rewind_policy(nvpolicy ? nvpolicy : spa->spa_config,
3723 3346                      &policy);
3724 3347                  if (policy.zrp_request & ZPOOL_DO_REWIND)
3725 3348                          state = SPA_LOAD_RECOVER;
3726 3349  
3727 3350                  spa_activate(spa, spa_mode_global);
3728 3351  
3729 3352                  if (state != SPA_LOAD_RECOVER)
3730 3353                          spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;
3731      -                spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE;
3732 3354  
3733      -                zfs_dbgmsg("spa_open_common: opening %s", pool);
3734      -                error = spa_load_best(spa, state, policy.zrp_txg,
     3355 +                error = spa_load_best(spa, state, B_FALSE, policy.zrp_txg,
3735 3356                      policy.zrp_request);
3736 3357  
3737 3358                  if (error == EBADF) {
3738 3359                          /*
3739 3360                           * If vdev_validate() returns failure (indicated by
3740 3361                           * EBADF), it indicates that one of the vdevs indicates
3741 3362                           * that the pool has been exported or destroyed.  If
3742 3363                           * this is the case, the config cache is out of sync and
3743 3364                           * we should remove the pool from the namespace.
3744 3365                           */
3745 3366                          spa_unload(spa);
3746 3367                          spa_deactivate(spa);
3747      -                        spa_write_cachefile(spa, B_TRUE, B_TRUE);
     3368 +                        spa_config_sync(spa, B_TRUE, B_TRUE);
3748 3369                          spa_remove(spa);
3749 3370                          if (locked)
3750 3371                                  mutex_exit(&spa_namespace_lock);
3751 3372                          return (SET_ERROR(ENOENT));
3752 3373                  }
3753 3374  
3754 3375                  if (error) {
3755 3376                          /*
3756 3377                           * We can't open the pool, but we still have useful
3757 3378                           * information: the state of each vdev after the
3758 3379                           * attempted vdev_open().  Return this to the user.
3759 3380                           */
3760 3381                          if (config != NULL && spa->spa_config) {
3761 3382                                  VERIFY(nvlist_dup(spa->spa_config, config,
3762 3383                                      KM_SLEEP) == 0);
3763 3384                                  VERIFY(nvlist_add_nvlist(*config,
3764 3385                                      ZPOOL_CONFIG_LOAD_INFO,
  
    | 
      ↓ open down ↓ | 
    7 lines elided | 
    
      ↑ open up ↑ | 
  
3765 3386                                      spa->spa_load_info) == 0);
3766 3387                          }
3767 3388                          spa_unload(spa);
3768 3389                          spa_deactivate(spa);
3769 3390                          spa->spa_last_open_failed = error;
3770 3391                          if (locked)
3771 3392                                  mutex_exit(&spa_namespace_lock);
3772 3393                          *spapp = NULL;
3773 3394                          return (error);
3774 3395                  }
     3396 +
     3397 +                open_with_activation = B_TRUE;
3775 3398          }
3776 3399  
3777 3400          spa_open_ref(spa, tag);
3778 3401  
3779 3402          if (config != NULL)
3780 3403                  *config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
3781 3404  
3782 3405          /*
3783 3406           * If we've recovered the pool, pass back any information we
3784 3407           * gathered while doing the load.
3785 3408           */
3786 3409          if (state == SPA_LOAD_RECOVER) {
3787 3410                  VERIFY(nvlist_add_nvlist(*config, ZPOOL_CONFIG_LOAD_INFO,
  
    | 
      ↓ open down ↓ | 
    3 lines elided | 
    
      ↑ open up ↑ | 
  
3788 3411                      spa->spa_load_info) == 0);
3789 3412          }
3790 3413  
3791 3414          if (locked) {
3792 3415                  spa->spa_last_open_failed = 0;
3793 3416                  spa->spa_last_ubsync_txg = 0;
3794 3417                  spa->spa_load_txg = 0;
3795 3418                  mutex_exit(&spa_namespace_lock);
3796 3419          }
3797 3420  
     3421 +        if (open_with_activation)
     3422 +                wbc_activate(spa, B_FALSE);
     3423 +
3798 3424          *spapp = spa;
3799 3425  
3800 3426          return (0);
3801 3427  }
3802 3428  
3803 3429  int
3804 3430  spa_open_rewind(const char *name, spa_t **spapp, void *tag, nvlist_t *policy,
3805 3431      nvlist_t **config)
3806 3432  {
3807 3433          return (spa_open_common(name, spapp, tag, policy, config));
3808 3434  }
3809 3435  
3810 3436  int
3811 3437  spa_open(const char *name, spa_t **spapp, void *tag)
3812 3438  {
3813 3439          return (spa_open_common(name, spapp, tag, NULL, NULL));
3814 3440  }
3815 3441  
3816 3442  /*
3817 3443   * Lookup the given spa_t, incrementing the inject count in the process,
3818 3444   * preventing it from being exported or destroyed.
3819 3445   */
3820 3446  spa_t *
3821 3447  spa_inject_addref(char *name)
3822 3448  {
3823 3449          spa_t *spa;
3824 3450  
3825 3451          mutex_enter(&spa_namespace_lock);
3826 3452          if ((spa = spa_lookup(name)) == NULL) {
3827 3453                  mutex_exit(&spa_namespace_lock);
3828 3454                  return (NULL);
3829 3455          }
3830 3456          spa->spa_inject_ref++;
3831 3457          mutex_exit(&spa_namespace_lock);
3832 3458  
3833 3459          return (spa);
3834 3460  }
3835 3461  
3836 3462  void
3837 3463  spa_inject_delref(spa_t *spa)
3838 3464  {
3839 3465          mutex_enter(&spa_namespace_lock);
3840 3466          spa->spa_inject_ref--;
3841 3467          mutex_exit(&spa_namespace_lock);
3842 3468  }
3843 3469  
3844 3470  /*
3845 3471   * Add spares device information to the nvlist.
3846 3472   */
3847 3473  static void
3848 3474  spa_add_spares(spa_t *spa, nvlist_t *config)
3849 3475  {
3850 3476          nvlist_t **spares;
3851 3477          uint_t i, nspares;
3852 3478          nvlist_t *nvroot;
3853 3479          uint64_t guid;
3854 3480          vdev_stat_t *vs;
3855 3481          uint_t vsc;
3856 3482          uint64_t pool;
3857 3483  
3858 3484          ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
3859 3485  
3860 3486          if (spa->spa_spares.sav_count == 0)
3861 3487                  return;
3862 3488  
3863 3489          VERIFY(nvlist_lookup_nvlist(config,
3864 3490              ZPOOL_CONFIG_VDEV_TREE, &nvroot) == 0);
3865 3491          VERIFY(nvlist_lookup_nvlist_array(spa->spa_spares.sav_config,
3866 3492              ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0);
3867 3493          if (nspares != 0) {
3868 3494                  VERIFY(nvlist_add_nvlist_array(nvroot,
3869 3495                      ZPOOL_CONFIG_SPARES, spares, nspares) == 0);
3870 3496                  VERIFY(nvlist_lookup_nvlist_array(nvroot,
3871 3497                      ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0);
3872 3498  
3873 3499                  /*
3874 3500                   * Go through and find any spares which have since been
3875 3501                   * repurposed as an active spare.  If this is the case, update
3876 3502                   * their status appropriately.
3877 3503                   */
3878 3504                  for (i = 0; i < nspares; i++) {
3879 3505                          VERIFY(nvlist_lookup_uint64(spares[i],
3880 3506                              ZPOOL_CONFIG_GUID, &guid) == 0);
3881 3507                          if (spa_spare_exists(guid, &pool, NULL) &&
3882 3508                              pool != 0ULL) {
3883 3509                                  VERIFY(nvlist_lookup_uint64_array(
3884 3510                                      spares[i], ZPOOL_CONFIG_VDEV_STATS,
3885 3511                                      (uint64_t **)&vs, &vsc) == 0);
3886 3512                                  vs->vs_state = VDEV_STATE_CANT_OPEN;
3887 3513                                  vs->vs_aux = VDEV_AUX_SPARED;
3888 3514                          }
3889 3515                  }
3890 3516          }
3891 3517  }
3892 3518  
3893 3519  /*
3894 3520   * Add l2cache device information to the nvlist, including vdev stats.
3895 3521   */
3896 3522  static void
3897 3523  spa_add_l2cache(spa_t *spa, nvlist_t *config)
3898 3524  {
3899 3525          nvlist_t **l2cache;
3900 3526          uint_t i, j, nl2cache;
3901 3527          nvlist_t *nvroot;
3902 3528          uint64_t guid;
3903 3529          vdev_t *vd;
3904 3530          vdev_stat_t *vs;
3905 3531          uint_t vsc;
3906 3532  
3907 3533          ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
3908 3534  
3909 3535          if (spa->spa_l2cache.sav_count == 0)
3910 3536                  return;
3911 3537  
3912 3538          VERIFY(nvlist_lookup_nvlist(config,
3913 3539              ZPOOL_CONFIG_VDEV_TREE, &nvroot) == 0);
3914 3540          VERIFY(nvlist_lookup_nvlist_array(spa->spa_l2cache.sav_config,
3915 3541              ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0);
3916 3542          if (nl2cache != 0) {
3917 3543                  VERIFY(nvlist_add_nvlist_array(nvroot,
3918 3544                      ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache) == 0);
3919 3545                  VERIFY(nvlist_lookup_nvlist_array(nvroot,
3920 3546                      ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0);
3921 3547  
3922 3548                  /*
3923 3549                   * Update level 2 cache device stats.
3924 3550                   */
3925 3551  
3926 3552                  for (i = 0; i < nl2cache; i++) {
3927 3553                          VERIFY(nvlist_lookup_uint64(l2cache[i],
3928 3554                              ZPOOL_CONFIG_GUID, &guid) == 0);
3929 3555  
3930 3556                          vd = NULL;
3931 3557                          for (j = 0; j < spa->spa_l2cache.sav_count; j++) {
3932 3558                                  if (guid ==
3933 3559                                      spa->spa_l2cache.sav_vdevs[j]->vdev_guid) {
3934 3560                                          vd = spa->spa_l2cache.sav_vdevs[j];
3935 3561                                          break;
3936 3562                                  }
3937 3563                          }
3938 3564                          ASSERT(vd != NULL);
3939 3565  
3940 3566                          VERIFY(nvlist_lookup_uint64_array(l2cache[i],
3941 3567                              ZPOOL_CONFIG_VDEV_STATS, (uint64_t **)&vs, &vsc)
3942 3568                              == 0);
3943 3569                          vdev_get_stats(vd, vs);
3944 3570                  }
3945 3571          }
3946 3572  }
3947 3573  
3948 3574  static void
3949 3575  spa_add_feature_stats(spa_t *spa, nvlist_t *config)
3950 3576  {
3951 3577          nvlist_t *features;
3952 3578          zap_cursor_t zc;
3953 3579          zap_attribute_t za;
3954 3580  
3955 3581          ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
3956 3582          VERIFY(nvlist_alloc(&features, NV_UNIQUE_NAME, KM_SLEEP) == 0);
3957 3583  
3958 3584          if (spa->spa_feat_for_read_obj != 0) {
3959 3585                  for (zap_cursor_init(&zc, spa->spa_meta_objset,
3960 3586                      spa->spa_feat_for_read_obj);
3961 3587                      zap_cursor_retrieve(&zc, &za) == 0;
3962 3588                      zap_cursor_advance(&zc)) {
3963 3589                          ASSERT(za.za_integer_length == sizeof (uint64_t) &&
3964 3590                              za.za_num_integers == 1);
3965 3591                          VERIFY3U(0, ==, nvlist_add_uint64(features, za.za_name,
3966 3592                              za.za_first_integer));
3967 3593                  }
3968 3594                  zap_cursor_fini(&zc);
3969 3595          }
3970 3596  
3971 3597          if (spa->spa_feat_for_write_obj != 0) {
3972 3598                  for (zap_cursor_init(&zc, spa->spa_meta_objset,
3973 3599                      spa->spa_feat_for_write_obj);
3974 3600                      zap_cursor_retrieve(&zc, &za) == 0;
3975 3601                      zap_cursor_advance(&zc)) {
3976 3602                          ASSERT(za.za_integer_length == sizeof (uint64_t) &&
3977 3603                              za.za_num_integers == 1);
3978 3604                          VERIFY3U(0, ==, nvlist_add_uint64(features, za.za_name,
3979 3605                              za.za_first_integer));
3980 3606                  }
3981 3607                  zap_cursor_fini(&zc);
3982 3608          }
3983 3609  
3984 3610          VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_FEATURE_STATS,
3985 3611              features) == 0);
3986 3612          nvlist_free(features);
3987 3613  }
3988 3614  
3989 3615  int
3990 3616  spa_get_stats(const char *name, nvlist_t **config,
3991 3617      char *altroot, size_t buflen)
3992 3618  {
3993 3619          int error;
3994 3620          spa_t *spa;
3995 3621  
3996 3622          *config = NULL;
3997 3623          error = spa_open_common(name, &spa, FTAG, NULL, config);
3998 3624  
3999 3625          if (spa != NULL) {
4000 3626                  /*
4001 3627                   * This still leaves a window of inconsistency where the spares
4002 3628                   * or l2cache devices could change and the config would be
4003 3629                   * self-inconsistent.
4004 3630                   */
4005 3631                  spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
4006 3632  
4007 3633                  if (*config != NULL) {
4008 3634                          uint64_t loadtimes[2];
4009 3635  
4010 3636                          loadtimes[0] = spa->spa_loaded_ts.tv_sec;
4011 3637                          loadtimes[1] = spa->spa_loaded_ts.tv_nsec;
4012 3638                          VERIFY(nvlist_add_uint64_array(*config,
4013 3639                              ZPOOL_CONFIG_LOADED_TIME, loadtimes, 2) == 0);
4014 3640  
4015 3641                          VERIFY(nvlist_add_uint64(*config,
4016 3642                              ZPOOL_CONFIG_ERRCOUNT,
4017 3643                              spa_get_errlog_size(spa)) == 0);
4018 3644  
4019 3645                          if (spa_suspended(spa))
4020 3646                                  VERIFY(nvlist_add_uint64(*config,
4021 3647                                      ZPOOL_CONFIG_SUSPENDED,
4022 3648                                      spa->spa_failmode) == 0);
4023 3649  
4024 3650                          spa_add_spares(spa, *config);
4025 3651                          spa_add_l2cache(spa, *config);
4026 3652                          spa_add_feature_stats(spa, *config);
4027 3653                  }
4028 3654          }
4029 3655  
4030 3656          /*
4031 3657           * We want to get the alternate root even for faulted pools, so we cheat
4032 3658           * and call spa_lookup() directly.
4033 3659           */
4034 3660          if (altroot) {
4035 3661                  if (spa == NULL) {
4036 3662                          mutex_enter(&spa_namespace_lock);
4037 3663                          spa = spa_lookup(name);
4038 3664                          if (spa)
4039 3665                                  spa_altroot(spa, altroot, buflen);
4040 3666                          else
4041 3667                                  altroot[0] = '\0';
4042 3668                          spa = NULL;
4043 3669                          mutex_exit(&spa_namespace_lock);
4044 3670                  } else {
4045 3671                          spa_altroot(spa, altroot, buflen);
4046 3672                  }
4047 3673          }
4048 3674  
4049 3675          if (spa != NULL) {
4050 3676                  spa_config_exit(spa, SCL_CONFIG, FTAG);
4051 3677                  spa_close(spa, FTAG);
4052 3678          }
4053 3679  
4054 3680          return (error);
4055 3681  }
4056 3682  
4057 3683  /*
4058 3684   * Validate that the auxiliary device array is well formed.  We must have an
4059 3685   * array of nvlists, each which describes a valid leaf vdev.  If this is an
4060 3686   * import (mode is VDEV_ALLOC_SPARE), then we allow corrupted spares to be
4061 3687   * specified, as long as they are well-formed.
4062 3688   */
4063 3689  static int
4064 3690  spa_validate_aux_devs(spa_t *spa, nvlist_t *nvroot, uint64_t crtxg, int mode,
4065 3691      spa_aux_vdev_t *sav, const char *config, uint64_t version,
4066 3692      vdev_labeltype_t label)
4067 3693  {
4068 3694          nvlist_t **dev;
4069 3695          uint_t i, ndev;
4070 3696          vdev_t *vd;
4071 3697          int error;
4072 3698  
4073 3699          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
4074 3700  
4075 3701          /*
4076 3702           * It's acceptable to have no devs specified.
4077 3703           */
4078 3704          if (nvlist_lookup_nvlist_array(nvroot, config, &dev, &ndev) != 0)
4079 3705                  return (0);
4080 3706  
4081 3707          if (ndev == 0)
4082 3708                  return (SET_ERROR(EINVAL));
4083 3709  
4084 3710          /*
4085 3711           * Make sure the pool is formatted with a version that supports this
4086 3712           * device type.
4087 3713           */
4088 3714          if (spa_version(spa) < version)
4089 3715                  return (SET_ERROR(ENOTSUP));
4090 3716  
4091 3717          /*
4092 3718           * Set the pending device list so we correctly handle device in-use
4093 3719           * checking.
4094 3720           */
4095 3721          sav->sav_pending = dev;
4096 3722          sav->sav_npending = ndev;
4097 3723  
4098 3724          for (i = 0; i < ndev; i++) {
4099 3725                  if ((error = spa_config_parse(spa, &vd, dev[i], NULL, 0,
4100 3726                      mode)) != 0)
4101 3727                          goto out;
4102 3728  
4103 3729                  if (!vd->vdev_ops->vdev_op_leaf) {
4104 3730                          vdev_free(vd);
4105 3731                          error = SET_ERROR(EINVAL);
4106 3732                          goto out;
4107 3733                  }
4108 3734  
4109 3735                  /*
4110 3736                   * The L2ARC currently only supports disk devices in
4111 3737                   * kernel context.  For user-level testing, we allow it.
4112 3738                   */
4113 3739  #ifdef _KERNEL
4114 3740                  if ((strcmp(config, ZPOOL_CONFIG_L2CACHE) == 0) &&
4115 3741                      strcmp(vd->vdev_ops->vdev_op_type, VDEV_TYPE_DISK) != 0) {
4116 3742                          error = SET_ERROR(ENOTBLK);
4117 3743                          vdev_free(vd);
4118 3744                          goto out;
4119 3745                  }
4120 3746  #endif
4121 3747                  vd->vdev_top = vd;
4122 3748  
4123 3749                  if ((error = vdev_open(vd)) == 0 &&
4124 3750                      (error = vdev_label_init(vd, crtxg, label)) == 0) {
4125 3751                          VERIFY(nvlist_add_uint64(dev[i], ZPOOL_CONFIG_GUID,
4126 3752                              vd->vdev_guid) == 0);
4127 3753                  }
4128 3754  
4129 3755                  vdev_free(vd);
4130 3756  
4131 3757                  if (error &&
4132 3758                      (mode != VDEV_ALLOC_SPARE && mode != VDEV_ALLOC_L2CACHE))
4133 3759                          goto out;
4134 3760                  else
4135 3761                          error = 0;
4136 3762          }
4137 3763  
4138 3764  out:
4139 3765          sav->sav_pending = NULL;
4140 3766          sav->sav_npending = 0;
4141 3767          return (error);
4142 3768  }
4143 3769  
4144 3770  static int
4145 3771  spa_validate_aux(spa_t *spa, nvlist_t *nvroot, uint64_t crtxg, int mode)
4146 3772  {
4147 3773          int error;
4148 3774  
4149 3775          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
4150 3776  
4151 3777          if ((error = spa_validate_aux_devs(spa, nvroot, crtxg, mode,
4152 3778              &spa->spa_spares, ZPOOL_CONFIG_SPARES, SPA_VERSION_SPARES,
4153 3779              VDEV_LABEL_SPARE)) != 0) {
4154 3780                  return (error);
4155 3781          }
4156 3782  
4157 3783          return (spa_validate_aux_devs(spa, nvroot, crtxg, mode,
4158 3784              &spa->spa_l2cache, ZPOOL_CONFIG_L2CACHE, SPA_VERSION_L2CACHE,
4159 3785              VDEV_LABEL_L2CACHE));
4160 3786  }
4161 3787  
4162 3788  static void
4163 3789  spa_set_aux_vdevs(spa_aux_vdev_t *sav, nvlist_t **devs, int ndevs,
4164 3790      const char *config)
4165 3791  {
4166 3792          int i;
4167 3793  
4168 3794          if (sav->sav_config != NULL) {
4169 3795                  nvlist_t **olddevs;
4170 3796                  uint_t oldndevs;
4171 3797                  nvlist_t **newdevs;
4172 3798  
4173 3799                  /*
4174 3800                   * Generate new dev list by concatentating with the
4175 3801                   * current dev list.
4176 3802                   */
4177 3803                  VERIFY(nvlist_lookup_nvlist_array(sav->sav_config, config,
4178 3804                      &olddevs, &oldndevs) == 0);
4179 3805  
4180 3806                  newdevs = kmem_alloc(sizeof (void *) *
4181 3807                      (ndevs + oldndevs), KM_SLEEP);
4182 3808                  for (i = 0; i < oldndevs; i++)
4183 3809                          VERIFY(nvlist_dup(olddevs[i], &newdevs[i],
4184 3810                              KM_SLEEP) == 0);
4185 3811                  for (i = 0; i < ndevs; i++)
4186 3812                          VERIFY(nvlist_dup(devs[i], &newdevs[i + oldndevs],
4187 3813                              KM_SLEEP) == 0);
4188 3814  
4189 3815                  VERIFY(nvlist_remove(sav->sav_config, config,
4190 3816                      DATA_TYPE_NVLIST_ARRAY) == 0);
4191 3817  
4192 3818                  VERIFY(nvlist_add_nvlist_array(sav->sav_config,
4193 3819                      config, newdevs, ndevs + oldndevs) == 0);
4194 3820                  for (i = 0; i < oldndevs + ndevs; i++)
4195 3821                          nvlist_free(newdevs[i]);
4196 3822                  kmem_free(newdevs, (oldndevs + ndevs) * sizeof (void *));
4197 3823          } else {
4198 3824                  /*
4199 3825                   * Generate a new dev list.
4200 3826                   */
4201 3827                  VERIFY(nvlist_alloc(&sav->sav_config, NV_UNIQUE_NAME,
4202 3828                      KM_SLEEP) == 0);
4203 3829                  VERIFY(nvlist_add_nvlist_array(sav->sav_config, config,
4204 3830                      devs, ndevs) == 0);
4205 3831          }
4206 3832  }
4207 3833  
4208 3834  /*
4209 3835   * Stop and drop level 2 ARC devices
4210 3836   */
4211 3837  void
4212 3838  spa_l2cache_drop(spa_t *spa)
4213 3839  {
4214 3840          vdev_t *vd;
4215 3841          int i;
4216 3842          spa_aux_vdev_t *sav = &spa->spa_l2cache;
4217 3843  
4218 3844          for (i = 0; i < sav->sav_count; i++) {
4219 3845                  uint64_t pool;
4220 3846  
4221 3847                  vd = sav->sav_vdevs[i];
4222 3848                  ASSERT(vd != NULL);
4223 3849  
4224 3850                  if (spa_l2cache_exists(vd->vdev_guid, &pool) &&
4225 3851                      pool != 0ULL && l2arc_vdev_present(vd))
4226 3852                          l2arc_remove_vdev(vd);
4227 3853          }
4228 3854  }
4229 3855  
4230 3856  /*
4231 3857   * Pool Creation
4232 3858   */
4233 3859  int
4234 3860  spa_create(const char *pool, nvlist_t *nvroot, nvlist_t *props,
4235 3861      nvlist_t *zplprops)
4236 3862  {
  
    | 
      ↓ open down ↓ | 
    429 lines elided | 
    
      ↑ open up ↑ | 
  
4237 3863          spa_t *spa;
4238 3864          char *altroot = NULL;
4239 3865          vdev_t *rvd;
4240 3866          dsl_pool_t *dp;
4241 3867          dmu_tx_t *tx;
4242 3868          int error = 0;
4243 3869          uint64_t txg = TXG_INITIAL;
4244 3870          nvlist_t **spares, **l2cache;
4245 3871          uint_t nspares, nl2cache;
4246 3872          uint64_t version, obj;
4247      -        boolean_t has_features;
     3873 +        boolean_t has_features = B_FALSE, wbc_feature_exists = B_FALSE;
     3874 +        spa_meta_placement_t *mp;
4248 3875  
4249 3876          /*
4250 3877           * If this pool already exists, return failure.
4251 3878           */
4252 3879          mutex_enter(&spa_namespace_lock);
4253 3880          if (spa_lookup(pool) != NULL) {
4254 3881                  mutex_exit(&spa_namespace_lock);
4255 3882                  return (SET_ERROR(EEXIST));
4256 3883          }
4257 3884  
4258 3885          /*
4259 3886           * Allocate a new spa_t structure.
4260 3887           */
4261 3888          (void) nvlist_lookup_string(props,
4262 3889              zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
4263 3890          spa = spa_add(pool, NULL, altroot);
4264 3891          spa_activate(spa, spa_mode_global);
4265 3892  
4266      -        if (props && (error = spa_prop_validate(spa, props))) {
4267      -                spa_deactivate(spa);
4268      -                spa_remove(spa);
4269      -                mutex_exit(&spa_namespace_lock);
4270      -                return (error);
4271      -        }
     3893 +        if (props != NULL) {
     3894 +                nvpair_t *wbc_feature_nvp = NULL;
4272 3895  
4273      -        has_features = B_FALSE;
4274      -        for (nvpair_t *elem = nvlist_next_nvpair(props, NULL);
4275      -            elem != NULL; elem = nvlist_next_nvpair(props, elem)) {
4276      -                if (zpool_prop_feature(nvpair_name(elem)))
4277      -                        has_features = B_TRUE;
     3896 +                for (nvpair_t *elem = nvlist_next_nvpair(props, NULL);
     3897 +                    elem != NULL; elem = nvlist_next_nvpair(props, elem)) {
     3898 +                        const char *propname = nvpair_name(elem);
     3899 +                        if (zpool_prop_feature(propname)) {
     3900 +                                spa_feature_t feature;
     3901 +                                int err;
     3902 +                                const char *fname = strchr(propname, '@') + 1;
     3903 +
     3904 +                                err = zfeature_lookup_name(fname, &feature);
     3905 +                                if (err == 0 && feature == SPA_FEATURE_WBC) {
     3906 +                                        wbc_feature_nvp = elem;
     3907 +                                        wbc_feature_exists = B_TRUE;
     3908 +                                }
     3909 +
     3910 +                                has_features = B_TRUE;
     3911 +                        }
     3912 +                }
     3913 +
     3914 +                /*
     3915 +                 * We do not want to enabled feature@wbc if
     3916 +                 * this pool does not have special vdev.
     3917 +                 * At this stage we remove this feature from common list,
     3918 +                 * but later after check that special vdev available this
     3919 +                 * feature will be enabled
     3920 +                 */
     3921 +                if (wbc_feature_nvp != NULL)
     3922 +                        fnvlist_remove_nvpair(props, wbc_feature_nvp);
     3923 +
     3924 +                if ((error = spa_prop_validate(spa, props)) != 0) {
     3925 +                        spa_deactivate(spa);
     3926 +                        spa_remove(spa);
     3927 +                        mutex_exit(&spa_namespace_lock);
     3928 +                        return (error);
     3929 +                }
4278 3930          }
4279 3931  
     3932 +
4280 3933          if (has_features || nvlist_lookup_uint64(props,
4281 3934              zpool_prop_to_name(ZPOOL_PROP_VERSION), &version) != 0) {
4282 3935                  version = SPA_VERSION;
4283 3936          }
4284 3937          ASSERT(SPA_VERSION_IS_SUPPORTED(version));
4285 3938  
4286 3939          spa->spa_first_txg = txg;
4287 3940          spa->spa_uberblock.ub_txg = txg - 1;
4288 3941          spa->spa_uberblock.ub_version = version;
4289 3942          spa->spa_ubsync = spa->spa_uberblock;
4290 3943          spa->spa_load_state = SPA_LOAD_CREATE;
4291      -        spa->spa_removing_phys.sr_state = DSS_NONE;
4292      -        spa->spa_removing_phys.sr_removing_vdev = -1;
4293      -        spa->spa_removing_phys.sr_prev_indirect_vdev = -1;
4294 3944  
4295 3945          /*
4296 3946           * Create "The Godfather" zio to hold all async IOs
4297 3947           */
4298 3948          spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
4299 3949              KM_SLEEP);
4300 3950          for (int i = 0; i < max_ncpus; i++) {
4301 3951                  spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
4302 3952                      ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
4303 3953                      ZIO_FLAG_GODFATHER);
4304 3954          }
4305 3955  
4306 3956          /*
4307 3957           * Create the root vdev.
4308 3958           */
4309 3959          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4310 3960  
4311 3961          error = spa_config_parse(spa, &rvd, nvroot, NULL, 0, VDEV_ALLOC_ADD);
4312 3962  
4313 3963          ASSERT(error != 0 || rvd != NULL);
4314 3964          ASSERT(error != 0 || spa->spa_root_vdev == rvd);
4315 3965  
4316 3966          if (error == 0 && !zfs_allocatable_devs(nvroot))
4317 3967                  error = SET_ERROR(EINVAL);
4318 3968  
4319 3969          if (error == 0 &&
4320 3970              (error = vdev_create(rvd, txg, B_FALSE)) == 0 &&
4321 3971              (error = spa_validate_aux(spa, nvroot, txg,
4322 3972              VDEV_ALLOC_ADD)) == 0) {
4323 3973                  for (int c = 0; c < rvd->vdev_children; c++) {
4324 3974                          vdev_metaslab_set_size(rvd->vdev_child[c]);
4325 3975                          vdev_expand(rvd->vdev_child[c], txg);
4326 3976                  }
4327 3977          }
4328 3978  
4329 3979          spa_config_exit(spa, SCL_ALL, FTAG);
4330 3980  
4331 3981          if (error != 0) {
4332 3982                  spa_unload(spa);
4333 3983                  spa_deactivate(spa);
4334 3984                  spa_remove(spa);
4335 3985                  mutex_exit(&spa_namespace_lock);
4336 3986                  return (error);
4337 3987          }
4338 3988  
4339 3989          /*
4340 3990           * Get the list of spares, if specified.
4341 3991           */
4342 3992          if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
4343 3993              &spares, &nspares) == 0) {
4344 3994                  VERIFY(nvlist_alloc(&spa->spa_spares.sav_config, NV_UNIQUE_NAME,
4345 3995                      KM_SLEEP) == 0);
4346 3996                  VERIFY(nvlist_add_nvlist_array(spa->spa_spares.sav_config,
4347 3997                      ZPOOL_CONFIG_SPARES, spares, nspares) == 0);
4348 3998                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4349 3999                  spa_load_spares(spa);
4350 4000                  spa_config_exit(spa, SCL_ALL, FTAG);
4351 4001                  spa->spa_spares.sav_sync = B_TRUE;
4352 4002          }
4353 4003  
4354 4004          /*
4355 4005           * Get the list of level 2 cache devices, if specified.
4356 4006           */
4357 4007          if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE,
4358 4008              &l2cache, &nl2cache) == 0) {
4359 4009                  VERIFY(nvlist_alloc(&spa->spa_l2cache.sav_config,
4360 4010                      NV_UNIQUE_NAME, KM_SLEEP) == 0);
4361 4011                  VERIFY(nvlist_add_nvlist_array(spa->spa_l2cache.sav_config,
4362 4012                      ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache) == 0);
4363 4013                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4364 4014                  spa_load_l2cache(spa);
4365 4015                  spa_config_exit(spa, SCL_ALL, FTAG);
4366 4016                  spa->spa_l2cache.sav_sync = B_TRUE;
4367 4017          }
4368 4018  
4369 4019          spa->spa_is_initializing = B_TRUE;
4370 4020          spa->spa_dsl_pool = dp = dsl_pool_create(spa, zplprops, txg);
4371 4021          spa->spa_meta_objset = dp->dp_meta_objset;
4372 4022          spa->spa_is_initializing = B_FALSE;
4373 4023  
4374 4024          /*
4375 4025           * Create DDTs (dedup tables).
4376 4026           */
4377 4027          ddt_create(spa);
4378 4028  
4379 4029          spa_update_dspace(spa);
4380 4030  
4381 4031          tx = dmu_tx_create_assigned(dp, txg);
4382 4032  
4383 4033          /*
4384 4034           * Create the pool config object.
4385 4035           */
4386 4036          spa->spa_config_object = dmu_object_alloc(spa->spa_meta_objset,
4387 4037              DMU_OT_PACKED_NVLIST, SPA_CONFIG_BLOCKSIZE,
4388 4038              DMU_OT_PACKED_NVLIST_SIZE, sizeof (uint64_t), tx);
4389 4039  
4390 4040          if (zap_add(spa->spa_meta_objset,
4391 4041              DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CONFIG,
4392 4042              sizeof (uint64_t), 1, &spa->spa_config_object, tx) != 0) {
4393 4043                  cmn_err(CE_PANIC, "failed to add pool config");
4394 4044          }
4395 4045  
4396 4046          if (spa_version(spa) >= SPA_VERSION_FEATURES)
4397 4047                  spa_feature_create_zap_objects(spa, tx);
4398 4048  
4399 4049          if (zap_add(spa->spa_meta_objset,
4400 4050              DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CREATION_VERSION,
4401 4051              sizeof (uint64_t), 1, &version, tx) != 0) {
4402 4052                  cmn_err(CE_PANIC, "failed to add pool version");
4403 4053          }
4404 4054  
4405 4055          /* Newly created pools with the right version are always deflated. */
4406 4056          if (version >= SPA_VERSION_RAIDZ_DEFLATE) {
4407 4057                  spa->spa_deflate = TRUE;
4408 4058                  if (zap_add(spa->spa_meta_objset,
4409 4059                      DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_DEFLATE,
4410 4060                      sizeof (uint64_t), 1, &spa->spa_deflate, tx) != 0) {
4411 4061                          cmn_err(CE_PANIC, "failed to add deflate");
4412 4062                  }
4413 4063          }
4414 4064  
4415 4065          /*
4416 4066           * Create the deferred-free bpobj.  Turn off compression
4417 4067           * because sync-to-convergence takes longer if the blocksize
4418 4068           * keeps changing.
4419 4069           */
4420 4070          obj = bpobj_alloc(spa->spa_meta_objset, 1 << 14, tx);
4421 4071          dmu_object_set_compress(spa->spa_meta_objset, obj,
4422 4072              ZIO_COMPRESS_OFF, tx);
4423 4073          if (zap_add(spa->spa_meta_objset,
4424 4074              DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_SYNC_BPOBJ,
4425 4075              sizeof (uint64_t), 1, &obj, tx) != 0) {
4426 4076                  cmn_err(CE_PANIC, "failed to add bpobj");
  
    | 
      ↓ open down ↓ | 
    123 lines elided | 
    
      ↑ open up ↑ | 
  
4427 4077          }
4428 4078          VERIFY3U(0, ==, bpobj_open(&spa->spa_deferred_bpobj,
4429 4079              spa->spa_meta_objset, obj));
4430 4080  
4431 4081          /*
4432 4082           * Create the pool's history object.
4433 4083           */
4434 4084          if (version >= SPA_VERSION_ZPOOL_HISTORY)
4435 4085                  spa_history_create_obj(spa, tx);
4436 4086  
     4087 +        mp = &spa->spa_meta_policy;
     4088 +
4437 4089          /*
4438 4090           * Generate some random noise for salted checksums to operate on.
4439 4091           */
4440 4092          (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
4441 4093              sizeof (spa->spa_cksum_salt.zcs_bytes));
4442 4094  
4443 4095          /*
4444 4096           * Set pool properties.
4445 4097           */
4446 4098          spa->spa_bootfs = zpool_prop_default_numeric(ZPOOL_PROP_BOOTFS);
4447 4099          spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
4448 4100          spa->spa_failmode = zpool_prop_default_numeric(ZPOOL_PROP_FAILUREMODE);
4449 4101          spa->spa_autoexpand = zpool_prop_default_numeric(ZPOOL_PROP_AUTOEXPAND);
     4102 +        spa->spa_minwat = zpool_prop_default_numeric(ZPOOL_PROP_MINWATERMARK);
     4103 +        spa->spa_hiwat = zpool_prop_default_numeric(ZPOOL_PROP_HIWATERMARK);
     4104 +        spa->spa_lowat = zpool_prop_default_numeric(ZPOOL_PROP_LOWATERMARK);
     4105 +        spa->spa_ddt_meta_copies =
     4106 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUPMETA_DITTO);
     4107 +        spa->spa_dedup_best_effort =
     4108 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_BEST_EFFORT);
     4109 +        spa->spa_dedup_lo_best_effort =
     4110 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_LO_BEST_EFFORT);
     4111 +        spa->spa_dedup_hi_best_effort =
     4112 +            zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_HI_BEST_EFFORT);
     4113 +        spa->spa_force_trim = zpool_prop_default_numeric(ZPOOL_PROP_FORCETRIM);
4450 4114  
     4115 +        spa->spa_resilver_prio =
     4116 +            zpool_prop_default_numeric(ZPOOL_PROP_RESILVER_PRIO);
     4117 +        spa->spa_scrub_prio = zpool_prop_default_numeric(ZPOOL_PROP_SCRUB_PRIO);
     4118 +
     4119 +        mutex_enter(&spa->spa_auto_trim_lock);
     4120 +        spa->spa_auto_trim = zpool_prop_default_numeric(ZPOOL_PROP_AUTOTRIM);
     4121 +        if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
     4122 +                spa_auto_trim_taskq_create(spa);
     4123 +        mutex_exit(&spa->spa_auto_trim_lock);
     4124 +
     4125 +        mp->spa_enable_meta_placement_selection =
     4126 +            zpool_prop_default_numeric(ZPOOL_PROP_META_PLACEMENT);
     4127 +        mp->spa_sync_to_special =
     4128 +            zpool_prop_default_numeric(ZPOOL_PROP_SYNC_TO_SPECIAL);
     4129 +        mp->spa_ddt_meta_to_special =
     4130 +            zpool_prop_default_numeric(ZPOOL_PROP_DDT_META_TO_METADEV);
     4131 +        mp->spa_zfs_meta_to_special =
     4132 +            zpool_prop_default_numeric(ZPOOL_PROP_ZFS_META_TO_METADEV);
     4133 +        mp->spa_small_data_to_special =
     4134 +            zpool_prop_default_numeric(ZPOOL_PROP_SMALL_DATA_TO_METADEV);
     4135 +
     4136 +        spa_set_ddt_classes(spa, 0);
     4137 +
4451 4138          if (props != NULL) {
4452 4139                  spa_configfile_set(spa, props, B_FALSE);
4453 4140                  spa_sync_props(props, tx);
4454 4141          }
4455 4142  
     4143 +        if (spa_has_special(spa)) {
     4144 +                spa_feature_enable(spa, SPA_FEATURE_META_DEVICES, tx);
     4145 +                spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
     4146 +
     4147 +                if (wbc_feature_exists)
     4148 +                        spa_feature_enable(spa, SPA_FEATURE_WBC, tx);
     4149 +        }
     4150 +
4456 4151          dmu_tx_commit(tx);
4457 4152  
4458 4153          spa->spa_sync_on = B_TRUE;
4459 4154          txg_sync_start(spa->spa_dsl_pool);
4460 4155  
4461 4156          /*
4462 4157           * We explicitly wait for the first transaction to complete so that our
4463 4158           * bean counters are appropriately updated.
4464 4159           */
4465 4160          txg_wait_synced(spa->spa_dsl_pool, txg);
4466 4161  
4467      -        spa_spawn_aux_threads(spa);
4468      -
4469      -        spa_write_cachefile(spa, B_FALSE, B_TRUE);
     4162 +        spa_config_sync(spa, B_FALSE, B_TRUE);
4470 4163          spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_CREATE);
4471 4164  
4472 4165          spa_history_log_version(spa, "create");
4473 4166  
4474 4167          /*
4475 4168           * Don't count references from objsets that are already closed
4476 4169           * and are making their way through the eviction process.
4477 4170           */
4478 4171          spa_evicting_os_wait(spa);
4479 4172          spa->spa_minref = refcount_count(&spa->spa_refcount);
4480 4173          spa->spa_load_state = SPA_LOAD_NONE;
4481 4174  
4482 4175          mutex_exit(&spa_namespace_lock);
4483 4176  
     4177 +        wbc_activate(spa, B_TRUE);
     4178 +
4484 4179          return (0);
4485 4180  }
4486 4181  
     4182 +
     4183 +/*
     4184 + * See if the pool has special tier, and if so, enable/activate
     4185 + * the feature as needed. Activation is not reference counted.
     4186 + */
     4187 +static void
     4188 +spa_check_special_feature(spa_t *spa)
     4189 +{
     4190 +        if (spa_has_special(spa)) {
     4191 +                nvlist_t *props = NULL;
     4192 +
     4193 +                if (!spa_feature_is_enabled(spa, SPA_FEATURE_META_DEVICES)) {
     4194 +                        VERIFY(nvlist_alloc(&props, NV_UNIQUE_NAME, 0) == 0);
     4195 +                        VERIFY(nvlist_add_uint64(props,
     4196 +                            FEATURE_META_DEVICES, 0) == 0);
     4197 +                        VERIFY(spa_prop_set(spa, props) == 0);
     4198 +                        nvlist_free(props);
     4199 +                }
     4200 +
     4201 +                if (!spa_feature_is_active(spa, SPA_FEATURE_META_DEVICES)) {
     4202 +                        dmu_tx_t *tx =
     4203 +                            dmu_tx_create_dd(spa->spa_dsl_pool->dp_mos_dir);
     4204 +
     4205 +                        VERIFY(dmu_tx_assign(tx, TXG_WAIT) == 0);
     4206 +                        spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
     4207 +                        dmu_tx_commit(tx);
     4208 +                }
     4209 +        }
     4210 +}
     4211 +
     4212 +static void
     4213 +spa_special_feature_activate(void *arg, dmu_tx_t *tx)
     4214 +{
     4215 +        spa_t *spa = (spa_t *)arg;
     4216 +
     4217 +        if (spa_has_special(spa)) {
     4218 +                /* enable and activate as needed */
     4219 +                spa_feature_enable(spa, SPA_FEATURE_META_DEVICES, tx);
     4220 +                if (!spa_feature_is_active(spa, SPA_FEATURE_META_DEVICES)) {
     4221 +                        spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
     4222 +                }
     4223 +
     4224 +                spa_feature_enable(spa, SPA_FEATURE_WBC, tx);
     4225 +        }
     4226 +}
     4227 +
4487 4228  #ifdef _KERNEL
4488 4229  /*
4489 4230   * Get the root pool information from the root disk, then import the root pool
4490 4231   * during the system boot up time.
4491 4232   */
4492 4233  extern int vdev_disk_read_rootlabel(char *, char *, nvlist_t **);
4493 4234  
4494 4235  static nvlist_t *
4495 4236  spa_generate_rootconf(char *devpath, char *devid, uint64_t *guid)
4496 4237  {
4497 4238          nvlist_t *config;
4498 4239          nvlist_t *nvtop, *nvroot;
4499 4240          uint64_t pgid;
4500 4241  
4501 4242          if (vdev_disk_read_rootlabel(devpath, devid, &config) != 0)
4502 4243                  return (NULL);
4503 4244  
4504 4245          /*
4505 4246           * Add this top-level vdev to the child array.
4506 4247           */
4507 4248          VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
4508 4249              &nvtop) == 0);
4509 4250          VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID,
4510 4251              &pgid) == 0);
4511 4252          VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_GUID, guid) == 0);
4512 4253  
4513 4254          /*
4514 4255           * Put this pool's top-level vdevs into a root vdev.
4515 4256           */
4516 4257          VERIFY(nvlist_alloc(&nvroot, NV_UNIQUE_NAME, KM_SLEEP) == 0);
4517 4258          VERIFY(nvlist_add_string(nvroot, ZPOOL_CONFIG_TYPE,
4518 4259              VDEV_TYPE_ROOT) == 0);
4519 4260          VERIFY(nvlist_add_uint64(nvroot, ZPOOL_CONFIG_ID, 0ULL) == 0);
4520 4261          VERIFY(nvlist_add_uint64(nvroot, ZPOOL_CONFIG_GUID, pgid) == 0);
4521 4262          VERIFY(nvlist_add_nvlist_array(nvroot, ZPOOL_CONFIG_CHILDREN,
4522 4263              &nvtop, 1) == 0);
4523 4264  
4524 4265          /*
4525 4266           * Replace the existing vdev_tree with the new root vdev in
4526 4267           * this pool's configuration (remove the old, add the new).
4527 4268           */
4528 4269          VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, nvroot) == 0);
4529 4270          nvlist_free(nvroot);
4530 4271          return (config);
4531 4272  }
4532 4273  
4533 4274  /*
4534 4275   * Walk the vdev tree and see if we can find a device with "better"
4535 4276   * configuration. A configuration is "better" if the label on that
4536 4277   * device has a more recent txg.
4537 4278   */
4538 4279  static void
4539 4280  spa_alt_rootvdev(vdev_t *vd, vdev_t **avd, uint64_t *txg)
4540 4281  {
4541 4282          for (int c = 0; c < vd->vdev_children; c++)
4542 4283                  spa_alt_rootvdev(vd->vdev_child[c], avd, txg);
4543 4284  
4544 4285          if (vd->vdev_ops->vdev_op_leaf) {
4545 4286                  nvlist_t *label;
4546 4287                  uint64_t label_txg;
4547 4288  
4548 4289                  if (vdev_disk_read_rootlabel(vd->vdev_physpath, vd->vdev_devid,
4549 4290                      &label) != 0)
4550 4291                          return;
4551 4292  
4552 4293                  VERIFY(nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_TXG,
4553 4294                      &label_txg) == 0);
4554 4295  
4555 4296                  /*
4556 4297                   * Do we have a better boot device?
4557 4298                   */
4558 4299                  if (label_txg > *txg) {
4559 4300                          *txg = label_txg;
4560 4301                          *avd = vd;
4561 4302                  }
4562 4303                  nvlist_free(label);
4563 4304          }
4564 4305  }
4565 4306  
4566 4307  /*
4567 4308   * Import a root pool.
4568 4309   *
4569 4310   * For x86. devpath_list will consist of devid and/or physpath name of
4570 4311   * the vdev (e.g. "id1,sd@SSEAGATE..." or "/pci@1f,0/ide@d/disk@0,0:a").
4571 4312   * The GRUB "findroot" command will return the vdev we should boot.
4572 4313   *
4573 4314   * For Sparc, devpath_list consists the physpath name of the booting device
4574 4315   * no matter the rootpool is a single device pool or a mirrored pool.
4575 4316   * e.g.
4576 4317   *      "/pci@1f,0/ide@d/disk@0,0:a"
4577 4318   */
4578 4319  int
4579 4320  spa_import_rootpool(char *devpath, char *devid)
4580 4321  {
4581 4322          spa_t *spa;
4582 4323          vdev_t *rvd, *bvd, *avd = NULL;
4583 4324          nvlist_t *config, *nvtop;
4584 4325          uint64_t guid, txg;
4585 4326          char *pname;
4586 4327          int error;
4587 4328  
4588 4329          /*
4589 4330           * Read the label from the boot device and generate a configuration.
4590 4331           */
4591 4332          config = spa_generate_rootconf(devpath, devid, &guid);
4592 4333  #if defined(_OBP) && defined(_KERNEL)
4593 4334          if (config == NULL) {
4594 4335                  if (strstr(devpath, "/iscsi/ssd") != NULL) {
4595 4336                          /* iscsi boot */
4596 4337                          get_iscsi_bootpath_phy(devpath);
4597 4338                          config = spa_generate_rootconf(devpath, devid, &guid);
4598 4339                  }
4599 4340          }
4600 4341  #endif
4601 4342          if (config == NULL) {
  
    | 
      ↓ open down ↓ | 
    105 lines elided | 
    
      ↑ open up ↑ | 
  
4602 4343                  cmn_err(CE_NOTE, "Cannot read the pool label from '%s'",
4603 4344                      devpath);
4604 4345                  return (SET_ERROR(EIO));
4605 4346          }
4606 4347  
4607 4348          VERIFY(nvlist_lookup_string(config, ZPOOL_CONFIG_POOL_NAME,
4608 4349              &pname) == 0);
4609 4350          VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, &txg) == 0);
4610 4351  
4611 4352          mutex_enter(&spa_namespace_lock);
4612      -        if ((spa = spa_lookup(pname)) != NULL) {
     4353 +        if ((spa = spa_lookup(pname)) != NULL || spa_config_guid_exists(guid)) {
4613 4354                  /*
4614 4355                   * Remove the existing root pool from the namespace so that we
4615 4356                   * can replace it with the correct config we just read in.
4616 4357                   */
4617 4358                  spa_remove(spa);
4618 4359          }
4619 4360  
4620 4361          spa = spa_add(pname, config, NULL);
4621 4362          spa->spa_is_root = B_TRUE;
4622 4363          spa->spa_import_flags = ZFS_IMPORT_VERBATIM;
4623      -        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
4624      -            &spa->spa_ubsync.ub_version) != 0)
4625      -                spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
4626 4364  
4627 4365          /*
4628 4366           * Build up a vdev tree based on the boot device's label config.
4629 4367           */
4630 4368          VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
4631 4369              &nvtop) == 0);
4632 4370          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4633 4371          error = spa_config_parse(spa, &rvd, nvtop, NULL, 0,
4634 4372              VDEV_ALLOC_ROOTPOOL);
4635 4373          spa_config_exit(spa, SCL_ALL, FTAG);
4636 4374          if (error) {
4637 4375                  mutex_exit(&spa_namespace_lock);
4638 4376                  nvlist_free(config);
4639 4377                  cmn_err(CE_NOTE, "Can not parse the config for pool '%s'",
4640 4378                      pname);
4641 4379                  return (error);
4642 4380          }
4643 4381  
4644 4382          /*
4645 4383           * Get the boot vdev.
4646 4384           */
4647 4385          if ((bvd = vdev_lookup_by_guid(rvd, guid)) == NULL) {
4648 4386                  cmn_err(CE_NOTE, "Can not find the boot vdev for guid %llu",
4649 4387                      (u_longlong_t)guid);
4650 4388                  error = SET_ERROR(ENOENT);
4651 4389                  goto out;
4652 4390          }
4653 4391  
4654 4392          /*
4655 4393           * Determine if there is a better boot device.
4656 4394           */
4657 4395          avd = bvd;
4658 4396          spa_alt_rootvdev(rvd, &avd, &txg);
4659 4397          if (avd != bvd) {
4660 4398                  cmn_err(CE_NOTE, "The boot device is 'degraded'. Please "
4661 4399                      "try booting from '%s'", avd->vdev_path);
4662 4400                  error = SET_ERROR(EINVAL);
4663 4401                  goto out;
4664 4402          }
4665 4403  
4666 4404          /*
4667 4405           * If the boot device is part of a spare vdev then ensure that
4668 4406           * we're booting off the active spare.
4669 4407           */
4670 4408          if (bvd->vdev_parent->vdev_ops == &vdev_spare_ops &&
4671 4409              !bvd->vdev_isspare) {
4672 4410                  cmn_err(CE_NOTE, "The boot device is currently spared. Please "
4673 4411                      "try booting from '%s'",
4674 4412                      bvd->vdev_parent->
4675 4413                      vdev_child[bvd->vdev_parent->vdev_children - 1]->vdev_path);
4676 4414                  error = SET_ERROR(EINVAL);
4677 4415                  goto out;
4678 4416          }
4679 4417  
4680 4418          error = 0;
4681 4419  out:
4682 4420          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4683 4421          vdev_free(rvd);
4684 4422          spa_config_exit(spa, SCL_ALL, FTAG);
4685 4423          mutex_exit(&spa_namespace_lock);
4686 4424  
4687 4425          nvlist_free(config);
4688 4426          return (error);
4689 4427  }
4690 4428  
4691 4429  #endif
4692 4430  
4693 4431  /*
4694 4432   * Import a non-root pool into the system.
4695 4433   */
4696 4434  int
4697 4435  spa_import(const char *pool, nvlist_t *config, nvlist_t *props, uint64_t flags)
4698 4436  {
  
    | 
      ↓ open down ↓ | 
    63 lines elided | 
    
      ↑ open up ↑ | 
  
4699 4437          spa_t *spa;
4700 4438          char *altroot = NULL;
4701 4439          spa_load_state_t state = SPA_LOAD_IMPORT;
4702 4440          zpool_rewind_policy_t policy;
4703 4441          uint64_t mode = spa_mode_global;
4704 4442          uint64_t readonly = B_FALSE;
4705 4443          int error;
4706 4444          nvlist_t *nvroot;
4707 4445          nvlist_t **spares, **l2cache;
4708 4446          uint_t nspares, nl2cache;
     4447 +        uint64_t guid;
4709 4448  
     4449 +        if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &guid) != 0)
     4450 +                return (SET_ERROR(EINVAL));
     4451 +
4710 4452          /*
4711 4453           * If a pool with this name exists, return failure.
4712 4454           */
4713 4455          mutex_enter(&spa_namespace_lock);
4714      -        if (spa_lookup(pool) != NULL) {
     4456 +        if (spa_lookup(pool) != NULL || spa_config_guid_exists(guid)) {
4715 4457                  mutex_exit(&spa_namespace_lock);
4716 4458                  return (SET_ERROR(EEXIST));
4717 4459          }
4718 4460  
4719 4461          /*
4720 4462           * Create and initialize the spa structure.
4721 4463           */
4722 4464          (void) nvlist_lookup_string(props,
4723 4465              zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
4724 4466          (void) nvlist_lookup_uint64(props,
4725 4467              zpool_prop_to_name(ZPOOL_PROP_READONLY), &readonly);
4726 4468          if (readonly)
4727 4469                  mode = FREAD;
4728 4470          spa = spa_add(pool, config, altroot);
  
    | 
      ↓ open down ↓ | 
    4 lines elided | 
    
      ↑ open up ↑ | 
  
4729 4471          spa->spa_import_flags = flags;
4730 4472  
4731 4473          /*
4732 4474           * Verbatim import - Take a pool and insert it into the namespace
4733 4475           * as if it had been loaded at boot.
4734 4476           */
4735 4477          if (spa->spa_import_flags & ZFS_IMPORT_VERBATIM) {
4736 4478                  if (props != NULL)
4737 4479                          spa_configfile_set(spa, props, B_FALSE);
4738 4480  
4739      -                spa_write_cachefile(spa, B_FALSE, B_TRUE);
     4481 +                spa_config_sync(spa, B_FALSE, B_TRUE);
4740 4482                  spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
4741      -                zfs_dbgmsg("spa_import: verbatim import of %s", pool);
     4483 +
4742 4484                  mutex_exit(&spa_namespace_lock);
4743 4485                  return (0);
4744 4486          }
4745 4487  
4746 4488          spa_activate(spa, mode);
4747 4489  
4748 4490          /*
4749 4491           * Don't start async tasks until we know everything is healthy.
4750 4492           */
4751 4493          spa_async_suspend(spa);
4752 4494  
4753 4495          zpool_get_rewind_policy(config, &policy);
4754 4496          if (policy.zrp_request & ZPOOL_DO_REWIND)
4755 4497                  state = SPA_LOAD_RECOVER;
4756 4498  
4757      -        spa->spa_config_source = SPA_CONFIG_SRC_TRYIMPORT;
4758      -
4759      -        if (state != SPA_LOAD_RECOVER) {
     4499 +        /*
     4500 +         * Pass off the heavy lifting to spa_load().  Pass TRUE for mosconfig
     4501 +         * because the user-supplied config is actually the one to trust when
     4502 +         * doing an import.
     4503 +         */
     4504 +        if (state != SPA_LOAD_RECOVER)
4760 4505                  spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;
4761      -                zfs_dbgmsg("spa_import: importing %s", pool);
4762      -        } else {
4763      -                zfs_dbgmsg("spa_import: importing %s, max_txg=%lld "
4764      -                    "(RECOVERY MODE)", pool, (longlong_t)policy.zrp_txg);
4765      -        }
4766      -        error = spa_load_best(spa, state, policy.zrp_txg, policy.zrp_request);
4767 4506  
     4507 +        error = spa_load_best(spa, state, B_TRUE, policy.zrp_txg,
     4508 +            policy.zrp_request);
     4509 +
4768 4510          /*
4769 4511           * Propagate anything learned while loading the pool and pass it
4770 4512           * back to caller (i.e. rewind info, missing devices, etc).
4771 4513           */
4772 4514          VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
4773 4515              spa->spa_load_info) == 0);
4774 4516  
4775 4517          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4776 4518          /*
4777 4519           * Toss any existing sparelist, as it doesn't have any validity
4778 4520           * anymore, and conflicts with spa_has_spare().
4779 4521           */
4780 4522          if (spa->spa_spares.sav_config) {
4781 4523                  nvlist_free(spa->spa_spares.sav_config);
4782 4524                  spa->spa_spares.sav_config = NULL;
4783 4525                  spa_load_spares(spa);
4784 4526          }
4785 4527          if (spa->spa_l2cache.sav_config) {
4786 4528                  nvlist_free(spa->spa_l2cache.sav_config);
4787 4529                  spa->spa_l2cache.sav_config = NULL;
4788 4530                  spa_load_l2cache(spa);
4789 4531          }
4790 4532  
4791 4533          VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
4792 4534              &nvroot) == 0);
4793 4535          if (error == 0)
4794 4536                  error = spa_validate_aux(spa, nvroot, -1ULL,
4795 4537                      VDEV_ALLOC_SPARE);
4796 4538          if (error == 0)
4797 4539                  error = spa_validate_aux(spa, nvroot, -1ULL,
4798 4540                      VDEV_ALLOC_L2CACHE);
4799 4541          spa_config_exit(spa, SCL_ALL, FTAG);
4800 4542  
4801 4543          if (props != NULL)
4802 4544                  spa_configfile_set(spa, props, B_FALSE);
  
    | 
      ↓ open down ↓ | 
    25 lines elided | 
    
      ↑ open up ↑ | 
  
4803 4545  
4804 4546          if (error != 0 || (props && spa_writeable(spa) &&
4805 4547              (error = spa_prop_set(spa, props)))) {
4806 4548                  spa_unload(spa);
4807 4549                  spa_deactivate(spa);
4808 4550                  spa_remove(spa);
4809 4551                  mutex_exit(&spa_namespace_lock);
4810 4552                  return (error);
4811 4553          }
4812 4554  
4813      -        spa_async_resume(spa);
4814      -
4815 4555          /*
4816 4556           * Override any spares and level 2 cache devices as specified by
4817 4557           * the user, as these may have correct device names/devids, etc.
4818 4558           */
4819 4559          if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
4820 4560              &spares, &nspares) == 0) {
4821 4561                  if (spa->spa_spares.sav_config)
4822 4562                          VERIFY(nvlist_remove(spa->spa_spares.sav_config,
4823 4563                              ZPOOL_CONFIG_SPARES, DATA_TYPE_NVLIST_ARRAY) == 0);
4824 4564                  else
4825 4565                          VERIFY(nvlist_alloc(&spa->spa_spares.sav_config,
4826 4566                              NV_UNIQUE_NAME, KM_SLEEP) == 0);
4827 4567                  VERIFY(nvlist_add_nvlist_array(spa->spa_spares.sav_config,
4828 4568                      ZPOOL_CONFIG_SPARES, spares, nspares) == 0);
4829 4569                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4830 4570                  spa_load_spares(spa);
4831 4571                  spa_config_exit(spa, SCL_ALL, FTAG);
4832 4572                  spa->spa_spares.sav_sync = B_TRUE;
4833 4573          }
4834 4574          if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE,
4835 4575              &l2cache, &nl2cache) == 0) {
4836 4576                  if (spa->spa_l2cache.sav_config)
4837 4577                          VERIFY(nvlist_remove(spa->spa_l2cache.sav_config,
4838 4578                              ZPOOL_CONFIG_L2CACHE, DATA_TYPE_NVLIST_ARRAY) == 0);
4839 4579                  else
  
    | 
      ↓ open down ↓ | 
    15 lines elided | 
    
      ↑ open up ↑ | 
  
4840 4580                          VERIFY(nvlist_alloc(&spa->spa_l2cache.sav_config,
4841 4581                              NV_UNIQUE_NAME, KM_SLEEP) == 0);
4842 4582                  VERIFY(nvlist_add_nvlist_array(spa->spa_l2cache.sav_config,
4843 4583                      ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache) == 0);
4844 4584                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4845 4585                  spa_load_l2cache(spa);
4846 4586                  spa_config_exit(spa, SCL_ALL, FTAG);
4847 4587                  spa->spa_l2cache.sav_sync = B_TRUE;
4848 4588          }
4849 4589  
     4590 +        /* At this point, we can load spare props */
     4591 +        (void) spa_load_vdev_props(spa);
     4592 +
4850 4593          /*
4851 4594           * Check for any removed devices.
4852 4595           */
4853 4596          if (spa->spa_autoreplace) {
4854 4597                  spa_aux_check_removed(&spa->spa_spares);
4855 4598                  spa_aux_check_removed(&spa->spa_l2cache);
4856 4599          }
4857 4600  
4858 4601          if (spa_writeable(spa)) {
4859 4602                  /*
4860 4603                   * Update the config cache to include the newly-imported pool.
4861 4604                   */
4862 4605                  spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
4863 4606          }
4864 4607  
4865 4608          /*
     4609 +         * Start async resume as late as possible to reduce I/O activity when
     4610 +         * importing a pool. This will let any pending txgs (e.g. from scrub
     4611 +         * or resilver) to complete quickly thereby reducing import times in
     4612 +         * such cases.
     4613 +         */
     4614 +        spa_async_resume(spa);
     4615 +
     4616 +        /*
4866 4617           * It's possible that the pool was expanded while it was exported.
4867 4618           * We kick off an async task to handle this for us.
4868 4619           */
4869 4620          spa_async_request(spa, SPA_ASYNC_AUTOEXPAND);
4870 4621  
     4622 +        /* Set/activate meta feature as needed */
     4623 +        if (!spa_writeable(spa))
     4624 +                spa_check_special_feature(spa);
4871 4625          spa_history_log_version(spa, "import");
4872 4626  
4873 4627          spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
4874 4628  
4875 4629          mutex_exit(&spa_namespace_lock);
4876 4630  
4877      -        return (0);
     4631 +        if (!spa_writeable(spa))
     4632 +                return (0);
     4633 +
     4634 +        wbc_activate(spa, B_FALSE);
     4635 +
     4636 +        return (dsl_sync_task(spa->spa_name, NULL, spa_special_feature_activate,
     4637 +            spa, 3, ZFS_SPACE_CHECK_RESERVED));
4878 4638  }
4879 4639  
4880 4640  nvlist_t *
4881 4641  spa_tryimport(nvlist_t *tryconfig)
4882 4642  {
4883 4643          nvlist_t *config = NULL;
4884      -        char *poolname, *cachefile;
     4644 +        char *poolname;
4885 4645          spa_t *spa;
4886 4646          uint64_t state;
4887 4647          int error;
4888      -        zpool_rewind_policy_t policy;
4889 4648  
4890 4649          if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_POOL_NAME, &poolname))
4891 4650                  return (NULL);
4892 4651  
4893 4652          if (nvlist_lookup_uint64(tryconfig, ZPOOL_CONFIG_POOL_STATE, &state))
4894 4653                  return (NULL);
4895 4654  
4896 4655          /*
4897 4656           * Create and initialize the spa structure.
4898 4657           */
4899 4658          mutex_enter(&spa_namespace_lock);
4900 4659          spa = spa_add(TRYIMPORT_NAME, tryconfig, NULL);
4901 4660          spa_activate(spa, FREAD);
4902 4661  
4903 4662          /*
4904      -         * Rewind pool if a max txg was provided. Note that even though we
4905      -         * retrieve the complete rewind policy, only the rewind txg is relevant
4906      -         * for tryimport.
     4663 +         * Pass off the heavy lifting to spa_load().
     4664 +         * Pass TRUE for mosconfig because the user-supplied config
     4665 +         * is actually the one to trust when doing an import.
4907 4666           */
4908      -        zpool_get_rewind_policy(spa->spa_config, &policy);
4909      -        if (policy.zrp_txg != UINT64_MAX) {
4910      -                spa->spa_load_max_txg = policy.zrp_txg;
4911      -                spa->spa_extreme_rewind = B_TRUE;
4912      -                zfs_dbgmsg("spa_tryimport: importing %s, max_txg=%lld",
4913      -                    poolname, (longlong_t)policy.zrp_txg);
4914      -        } else {
4915      -                zfs_dbgmsg("spa_tryimport: importing %s", poolname);
4916      -        }
     4667 +        error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING, B_TRUE);
4917 4668  
4918      -        if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_CACHEFILE, &cachefile)
4919      -            == 0) {
4920      -                zfs_dbgmsg("spa_tryimport: using cachefile '%s'", cachefile);
4921      -                spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE;
4922      -        } else {
4923      -                spa->spa_config_source = SPA_CONFIG_SRC_SCAN;
4924      -        }
4925      -
4926      -        error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING);
4927      -
4928 4669          /*
4929 4670           * If 'tryconfig' was at least parsable, return the current config.
4930 4671           */
4931 4672          if (spa->spa_root_vdev != NULL) {
4932 4673                  config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
4933 4674                  VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME,
4934 4675                      poolname) == 0);
4935 4676                  VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE,
4936 4677                      state) == 0);
4937 4678                  VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_TIMESTAMP,
4938 4679                      spa->spa_uberblock.ub_timestamp) == 0);
4939 4680                  VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
4940 4681                      spa->spa_load_info) == 0);
4941 4682  
4942 4683                  /*
4943 4684                   * If the bootfs property exists on this pool then we
4944 4685                   * copy it out so that external consumers can tell which
4945 4686                   * pools are bootable.
4946 4687                   */
4947 4688                  if ((!error || error == EEXIST) && spa->spa_bootfs) {
4948 4689                          char *tmpname = kmem_alloc(MAXPATHLEN, KM_SLEEP);
4949 4690  
4950 4691                          /*
4951 4692                           * We have to play games with the name since the
4952 4693                           * pool was opened as TRYIMPORT_NAME.
4953 4694                           */
4954 4695                          if (dsl_dsobj_to_dsname(spa_name(spa),
4955 4696                              spa->spa_bootfs, tmpname) == 0) {
4956 4697                                  char *cp;
4957 4698                                  char *dsname = kmem_alloc(MAXPATHLEN, KM_SLEEP);
4958 4699  
4959 4700                                  cp = strchr(tmpname, '/');
4960 4701                                  if (cp == NULL) {
4961 4702                                          (void) strlcpy(dsname, tmpname,
4962 4703                                              MAXPATHLEN);
4963 4704                                  } else {
4964 4705                                          (void) snprintf(dsname, MAXPATHLEN,
4965 4706                                              "%s/%s", poolname, ++cp);
4966 4707                                  }
4967 4708                                  VERIFY(nvlist_add_string(config,
4968 4709                                      ZPOOL_CONFIG_BOOTFS, dsname) == 0);
4969 4710                                  kmem_free(dsname, MAXPATHLEN);
4970 4711                          }
4971 4712                          kmem_free(tmpname, MAXPATHLEN);
4972 4713                  }
4973 4714  
4974 4715                  /*
4975 4716                   * Add the list of hot spares and level 2 cache devices.
4976 4717                   */
4977 4718                  spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
4978 4719                  spa_add_spares(spa, config);
4979 4720                  spa_add_l2cache(spa, config);
4980 4721                  spa_config_exit(spa, SCL_CONFIG, FTAG);
4981 4722          }
4982 4723  
4983 4724          spa_unload(spa);
4984 4725          spa_deactivate(spa);
4985 4726          spa_remove(spa);
4986 4727          mutex_exit(&spa_namespace_lock);
4987 4728  
4988 4729          return (config);
4989 4730  }
4990 4731  
4991 4732  /*
  
    | 
      ↓ open down ↓ | 
    54 lines elided | 
    
      ↑ open up ↑ | 
  
4992 4733   * Pool export/destroy
4993 4734   *
4994 4735   * The act of destroying or exporting a pool is very simple.  We make sure there
4995 4736   * is no more pending I/O and any references to the pool are gone.  Then, we
4996 4737   * update the pool state and sync all the labels to disk, removing the
4997 4738   * configuration from the cache afterwards. If the 'hardforce' flag is set, then
4998 4739   * we don't sync the labels or remove the configuration cache.
4999 4740   */
5000 4741  static int
5001 4742  spa_export_common(char *pool, int new_state, nvlist_t **oldconfig,
5002      -    boolean_t force, boolean_t hardforce)
     4743 +    boolean_t force, boolean_t hardforce, boolean_t saveconfig)
5003 4744  {
5004 4745          spa_t *spa;
     4746 +        zfs_autosnap_t *autosnap;
     4747 +        boolean_t wbcthr_stopped = B_FALSE;
5005 4748  
5006 4749          if (oldconfig)
5007 4750                  *oldconfig = NULL;
5008 4751  
5009 4752          if (!(spa_mode_global & FWRITE))
5010 4753                  return (SET_ERROR(EROFS));
5011 4754  
5012 4755          mutex_enter(&spa_namespace_lock);
5013 4756          if ((spa = spa_lookup(pool)) == NULL) {
5014 4757                  mutex_exit(&spa_namespace_lock);
5015 4758                  return (SET_ERROR(ENOENT));
5016 4759          }
5017 4760  
5018 4761          /*
5019      -         * Put a hold on the pool, drop the namespace lock, stop async tasks,
5020      -         * reacquire the namespace lock, and see if we can export.
     4762 +         * Put a hold on the pool, drop the namespace lock, stop async tasks
     4763 +         * and write cache thread, reacquire the namespace lock, and see
     4764 +         * if we can export.
5021 4765           */
5022 4766          spa_open_ref(spa, FTAG);
5023 4767          mutex_exit(&spa_namespace_lock);
     4768 +
     4769 +        autosnap = spa_get_autosnap(spa);
     4770 +        mutex_enter(&autosnap->autosnap_lock);
     4771 +
     4772 +        if (autosnap_has_children_zone(autosnap,
     4773 +            spa_name(spa), B_TRUE)) {
     4774 +                mutex_exit(&autosnap->autosnap_lock);
     4775 +                spa_close(spa, FTAG);
     4776 +                return (EBUSY);
     4777 +        }
     4778 +
     4779 +        mutex_exit(&autosnap->autosnap_lock);
     4780 +
     4781 +        wbcthr_stopped = wbc_stop_thread(spa); /* stop write cache thread */
     4782 +        autosnap_destroyer_thread_stop(spa);
5024 4783          spa_async_suspend(spa);
5025 4784          mutex_enter(&spa_namespace_lock);
5026 4785          spa_close(spa, FTAG);
5027 4786  
5028 4787          /*
5029 4788           * The pool will be in core if it's openable,
5030 4789           * in which case we can modify its state.
5031 4790           */
5032 4791          if (spa->spa_state != POOL_STATE_UNINITIALIZED && spa->spa_sync_on) {
5033 4792                  /*
5034 4793                   * Objsets may be open only because they're dirty, so we
5035 4794                   * have to force it to sync before checking spa_refcnt.
5036 4795                   */
5037 4796                  txg_wait_synced(spa->spa_dsl_pool, 0);
5038 4797                  spa_evicting_os_wait(spa);
5039 4798  
  
    | 
      ↓ open down ↓ | 
    6 lines elided | 
    
      ↑ open up ↑ | 
  
5040 4799                  /*
5041 4800                   * A pool cannot be exported or destroyed if there are active
5042 4801                   * references.  If we are resetting a pool, allow references by
5043 4802                   * fault injection handlers.
5044 4803                   */
5045 4804                  if (!spa_refcount_zero(spa) ||
5046 4805                      (spa->spa_inject_ref != 0 &&
5047 4806                      new_state != POOL_STATE_UNINITIALIZED)) {
5048 4807                          spa_async_resume(spa);
5049 4808                          mutex_exit(&spa_namespace_lock);
     4809 +                        if (wbcthr_stopped)
     4810 +                                (void) wbc_start_thread(spa);
     4811 +                        autosnap_destroyer_thread_start(spa);
5050 4812                          return (SET_ERROR(EBUSY));
5051 4813                  }
5052 4814  
5053 4815                  /*
5054 4816                   * A pool cannot be exported if it has an active shared spare.
5055 4817                   * This is to prevent other pools stealing the active spare
5056 4818                   * from an exported pool. At user's own will, such pool can
5057 4819                   * be forcedly exported.
5058 4820                   */
5059 4821                  if (!force && new_state == POOL_STATE_EXPORTED &&
5060 4822                      spa_has_active_shared_spare(spa)) {
5061 4823                          spa_async_resume(spa);
5062 4824                          mutex_exit(&spa_namespace_lock);
     4825 +                        if (wbcthr_stopped)
     4826 +                                (void) wbc_start_thread(spa);
     4827 +                        autosnap_destroyer_thread_start(spa);
5063 4828                          return (SET_ERROR(EXDEV));
5064 4829                  }
5065 4830  
5066 4831                  /*
5067 4832                   * We want this to be reflected on every label,
5068 4833                   * so mark them all dirty.  spa_unload() will do the
5069 4834                   * final sync that pushes these changes out.
5070 4835                   */
5071 4836                  if (new_state != POOL_STATE_UNINITIALIZED && !hardforce) {
5072 4837                          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
5073 4838                          spa->spa_state = new_state;
  
    | 
      ↓ open down ↓ | 
    1 lines elided | 
    
      ↑ open up ↑ | 
  
5074 4839                          spa->spa_final_txg = spa_last_synced_txg(spa) +
5075 4840                              TXG_DEFER_SIZE + 1;
5076 4841                          vdev_config_dirty(spa->spa_root_vdev);
5077 4842                          spa_config_exit(spa, SCL_ALL, FTAG);
5078 4843                  }
5079 4844          }
5080 4845  
5081 4846          spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_DESTROY);
5082 4847  
5083 4848          if (spa->spa_state != POOL_STATE_UNINITIALIZED) {
     4849 +                wbc_deactivate(spa);
     4850 +
5084 4851                  spa_unload(spa);
5085 4852                  spa_deactivate(spa);
5086 4853          }
5087 4854  
5088 4855          if (oldconfig && spa->spa_config)
5089 4856                  VERIFY(nvlist_dup(spa->spa_config, oldconfig, 0) == 0);
5090 4857  
5091 4858          if (new_state != POOL_STATE_UNINITIALIZED) {
5092 4859                  if (!hardforce)
5093      -                        spa_write_cachefile(spa, B_TRUE, B_TRUE);
     4860 +                        spa_config_sync(spa, !saveconfig, B_TRUE);
     4861 +
5094 4862                  spa_remove(spa);
5095 4863          }
5096 4864          mutex_exit(&spa_namespace_lock);
5097 4865  
5098 4866          return (0);
5099 4867  }
5100 4868  
5101 4869  /*
5102 4870   * Destroy a storage pool.
5103 4871   */
5104 4872  int
5105 4873  spa_destroy(char *pool)
5106 4874  {
5107 4875          return (spa_export_common(pool, POOL_STATE_DESTROYED, NULL,
5108      -            B_FALSE, B_FALSE));
     4876 +            B_FALSE, B_FALSE, B_FALSE));
5109 4877  }
5110 4878  
5111 4879  /*
5112 4880   * Export a storage pool.
5113 4881   */
5114 4882  int
5115 4883  spa_export(char *pool, nvlist_t **oldconfig, boolean_t force,
5116      -    boolean_t hardforce)
     4884 +    boolean_t hardforce, boolean_t saveconfig)
5117 4885  {
5118 4886          return (spa_export_common(pool, POOL_STATE_EXPORTED, oldconfig,
5119      -            force, hardforce));
     4887 +            force, hardforce, saveconfig));
5120 4888  }
5121 4889  
5122 4890  /*
5123 4891   * Similar to spa_export(), this unloads the spa_t without actually removing it
5124 4892   * from the namespace in any way.
5125 4893   */
5126 4894  int
5127 4895  spa_reset(char *pool)
5128 4896  {
5129 4897          return (spa_export_common(pool, POOL_STATE_UNINITIALIZED, NULL,
5130      -            B_FALSE, B_FALSE));
     4898 +            B_FALSE, B_FALSE, B_FALSE));
5131 4899  }
5132 4900  
5133 4901  /*
5134 4902   * ==========================================================================
5135 4903   * Device manipulation
5136 4904   * ==========================================================================
5137 4905   */
5138 4906  
5139 4907  /*
5140 4908   * Add a device to a storage pool.
5141 4909   */
5142 4910  int
5143 4911  spa_vdev_add(spa_t *spa, nvlist_t *nvroot)
5144 4912  {
5145 4913          uint64_t txg, id;
5146 4914          int error;
5147 4915          vdev_t *rvd = spa->spa_root_vdev;
5148 4916          vdev_t *vd, *tvd;
5149 4917          nvlist_t **spares, **l2cache;
5150 4918          uint_t nspares, nl2cache;
     4919 +        dmu_tx_t *tx = NULL;
5151 4920  
5152 4921          ASSERT(spa_writeable(spa));
5153 4922  
5154 4923          txg = spa_vdev_enter(spa);
5155 4924  
5156 4925          if ((error = spa_config_parse(spa, &vd, nvroot, NULL, 0,
5157 4926              VDEV_ALLOC_ADD)) != 0)
5158 4927                  return (spa_vdev_exit(spa, NULL, txg, error));
5159 4928  
5160 4929          spa->spa_pending_vdev = vd;     /* spa_vdev_exit() will clear this */
5161 4930  
5162 4931          if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES, &spares,
5163 4932              &nspares) != 0)
5164 4933                  nspares = 0;
5165 4934  
5166 4935          if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE, &l2cache,
5167 4936              &nl2cache) != 0)
5168 4937                  nl2cache = 0;
5169 4938  
5170 4939          if (vd->vdev_children == 0 && nspares == 0 && nl2cache == 0)
5171 4940                  return (spa_vdev_exit(spa, vd, txg, EINVAL));
5172 4941  
5173 4942          if (vd->vdev_children != 0 &&
5174 4943              (error = vdev_create(vd, txg, B_FALSE)) != 0)
  
    | 
      ↓ open down ↓ | 
    14 lines elided | 
    
      ↑ open up ↑ | 
  
5175 4944                  return (spa_vdev_exit(spa, vd, txg, error));
5176 4945  
5177 4946          /*
5178 4947           * We must validate the spares and l2cache devices after checking the
5179 4948           * children.  Otherwise, vdev_inuse() will blindly overwrite the spare.
5180 4949           */
5181 4950          if ((error = spa_validate_aux(spa, nvroot, txg, VDEV_ALLOC_ADD)) != 0)
5182 4951                  return (spa_vdev_exit(spa, vd, txg, error));
5183 4952  
5184 4953          /*
5185      -         * If we are in the middle of a device removal, we can only add
5186      -         * devices which match the existing devices in the pool.
5187      -         * If we are in the middle of a removal, or have some indirect
5188      -         * vdevs, we can not add raidz toplevels.
     4954 +         * Transfer each new top-level vdev from vd to rvd.
5189 4955           */
5190      -        if (spa->spa_vdev_removal != NULL ||
5191      -            spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
5192      -                for (int c = 0; c < vd->vdev_children; c++) {
5193      -                        tvd = vd->vdev_child[c];
5194      -                        if (spa->spa_vdev_removal != NULL &&
5195      -                            tvd->vdev_ashift !=
5196      -                            spa->spa_vdev_removal->svr_vdev->vdev_ashift) {
5197      -                                return (spa_vdev_exit(spa, vd, txg, EINVAL));
5198      -                        }
5199      -                        /* Fail if top level vdev is raidz */
5200      -                        if (tvd->vdev_ops == &vdev_raidz_ops) {
5201      -                                return (spa_vdev_exit(spa, vd, txg, EINVAL));
5202      -                        }
5203      -                        /*
5204      -                         * Need the top level mirror to be
5205      -                         * a mirror of leaf vdevs only
5206      -                         */
5207      -                        if (tvd->vdev_ops == &vdev_mirror_ops) {
5208      -                                for (uint64_t cid = 0;
5209      -                                    cid < tvd->vdev_children; cid++) {
5210      -                                        vdev_t *cvd = tvd->vdev_child[cid];
5211      -                                        if (!cvd->vdev_ops->vdev_op_leaf) {
5212      -                                                return (spa_vdev_exit(spa, vd,
5213      -                                                    txg, EINVAL));
5214      -                                        }
5215      -                                }
5216      -                        }
5217      -                }
5218      -        }
5219      -
5220 4956          for (int c = 0; c < vd->vdev_children; c++) {
5221 4957  
5222 4958                  /*
5223 4959                   * Set the vdev id to the first hole, if one exists.
5224 4960                   */
5225 4961                  for (id = 0; id < rvd->vdev_children; id++) {
5226 4962                          if (rvd->vdev_child[id]->vdev_ishole) {
5227 4963                                  vdev_free(rvd->vdev_child[id]);
5228 4964                                  break;
5229 4965                          }
5230 4966                  }
5231 4967                  tvd = vd->vdev_child[c];
5232 4968                  vdev_remove_child(vd, tvd);
5233 4969                  tvd->vdev_id = id;
5234 4970                  vdev_add_child(rvd, tvd);
5235 4971                  vdev_config_dirty(tvd);
5236 4972          }
5237 4973  
5238 4974          if (nspares != 0) {
5239 4975                  spa_set_aux_vdevs(&spa->spa_spares, spares, nspares,
5240 4976                      ZPOOL_CONFIG_SPARES);
5241 4977                  spa_load_spares(spa);
5242 4978                  spa->spa_spares.sav_sync = B_TRUE;
5243 4979          }
5244 4980  
5245 4981          if (nl2cache != 0) {
5246 4982                  spa_set_aux_vdevs(&spa->spa_l2cache, l2cache, nl2cache,
5247 4983                      ZPOOL_CONFIG_L2CACHE);
5248 4984                  spa_load_l2cache(spa);
5249 4985                  spa->spa_l2cache.sav_sync = B_TRUE;
5250 4986          }
5251 4987  
5252 4988          /*
5253 4989           * We have to be careful when adding new vdevs to an existing pool.
5254 4990           * If other threads start allocating from these vdevs before we
5255 4991           * sync the config cache, and we lose power, then upon reboot we may
5256 4992           * fail to open the pool because there are DVAs that the config cache
5257 4993           * can't translate.  Therefore, we first add the vdevs without
5258 4994           * initializing metaslabs; sync the config cache (via spa_vdev_exit());
5259 4995           * and then let spa_config_update() initialize the new metaslabs.
5260 4996           *
5261 4997           * spa_load() checks for added-but-not-initialized vdevs, so that
  
    | 
      ↓ open down ↓ | 
    32 lines elided | 
    
      ↑ open up ↑ | 
  
5262 4998           * if we lose power at any point in this sequence, the remaining
5263 4999           * steps will be completed the next time we load the pool.
5264 5000           */
5265 5001          (void) spa_vdev_exit(spa, vd, txg, 0);
5266 5002  
5267 5003          mutex_enter(&spa_namespace_lock);
5268 5004          spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
5269 5005          spa_event_notify(spa, NULL, NULL, ESC_ZFS_VDEV_ADD);
5270 5006          mutex_exit(&spa_namespace_lock);
5271 5007  
     5008 +        /*
     5009 +         * "spa_last_synced_txg(spa) + 1" is used because:
     5010 +         *   - spa_vdev_exit() calls txg_wait_synced() for "txg"
     5011 +         *   - spa_config_update() calls txg_wait_synced() for
     5012 +         *     "spa_last_synced_txg(spa) + 1"
     5013 +         */
     5014 +        tx = dmu_tx_create_assigned(spa_get_dsl(spa),
     5015 +            spa_last_synced_txg(spa) + 1);
     5016 +        spa_special_feature_activate(spa, tx);
     5017 +        dmu_tx_commit(tx);
     5018 +
     5019 +        wbc_activate(spa, B_FALSE);
     5020 +
5272 5021          return (0);
5273 5022  }
5274 5023  
5275 5024  /*
5276 5025   * Attach a device to a mirror.  The arguments are the path to any device
5277 5026   * in the mirror, and the nvroot for the new device.  If the path specifies
5278 5027   * a device that is not mirrored, we automatically insert the mirror vdev.
5279 5028   *
5280 5029   * If 'replacing' is specified, the new device is intended to replace the
5281 5030   * existing device; in this case the two devices are made into their own
5282 5031   * mirror using the 'replacing' vdev, which is functionally identical to
5283 5032   * the mirror vdev (it actually reuses all the same ops) but has a few
5284 5033   * extra rules: you can't attach to it after it's been created, and upon
5285 5034   * completion of resilvering, the first disk (the one being replaced)
5286 5035   * is automatically detached.
5287 5036   */
5288 5037  int
5289 5038  spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot, int replacing)
5290 5039  {
5291 5040          uint64_t txg, dtl_max_txg;
5292 5041          vdev_t *rvd = spa->spa_root_vdev;
5293 5042          vdev_t *oldvd, *newvd, *newrootvd, *pvd, *tvd;
5294 5043          vdev_ops_t *pvops;
  
    | 
      ↓ open down ↓ | 
    13 lines elided | 
    
      ↑ open up ↑ | 
  
5295 5044          char *oldvdpath, *newvdpath;
5296 5045          int newvd_isspare;
5297 5046          int error;
5298 5047  
5299 5048          ASSERT(spa_writeable(spa));
5300 5049  
5301 5050          txg = spa_vdev_enter(spa);
5302 5051  
5303 5052          oldvd = spa_lookup_by_guid(spa, guid, B_FALSE);
5304 5053  
5305      -        if (spa->spa_vdev_removal != NULL ||
5306      -            spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
5307      -                return (spa_vdev_exit(spa, NULL, txg, EBUSY));
5308      -        }
5309      -
5310 5054          if (oldvd == NULL)
5311 5055                  return (spa_vdev_exit(spa, NULL, txg, ENODEV));
5312 5056  
5313 5057          if (!oldvd->vdev_ops->vdev_op_leaf)
5314 5058                  return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5315 5059  
5316 5060          pvd = oldvd->vdev_parent;
5317 5061  
5318 5062          if ((error = spa_config_parse(spa, &newrootvd, nvroot, NULL, 0,
5319 5063              VDEV_ALLOC_ATTACH)) != 0)
5320 5064                  return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5321 5065  
5322 5066          if (newrootvd->vdev_children != 1)
5323 5067                  return (spa_vdev_exit(spa, newrootvd, txg, EINVAL));
5324 5068  
5325 5069          newvd = newrootvd->vdev_child[0];
5326 5070  
5327 5071          if (!newvd->vdev_ops->vdev_op_leaf)
5328 5072                  return (spa_vdev_exit(spa, newrootvd, txg, EINVAL));
5329 5073  
5330 5074          if ((error = vdev_create(newrootvd, txg, replacing)) != 0)
5331 5075                  return (spa_vdev_exit(spa, newrootvd, txg, error));
5332 5076  
5333 5077          /*
5334 5078           * Spares can't replace logs
5335 5079           */
5336 5080          if (oldvd->vdev_top->vdev_islog && newvd->vdev_isspare)
5337 5081                  return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
5338 5082  
5339 5083          if (!replacing) {
5340 5084                  /*
5341 5085                   * For attach, the only allowable parent is a mirror or the root
5342 5086                   * vdev.
5343 5087                   */
5344 5088                  if (pvd->vdev_ops != &vdev_mirror_ops &&
5345 5089                      pvd->vdev_ops != &vdev_root_ops)
5346 5090                          return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
5347 5091  
5348 5092                  pvops = &vdev_mirror_ops;
5349 5093          } else {
5350 5094                  /*
5351 5095                   * Active hot spares can only be replaced by inactive hot
5352 5096                   * spares.
5353 5097                   */
5354 5098                  if (pvd->vdev_ops == &vdev_spare_ops &&
5355 5099                      oldvd->vdev_isspare &&
5356 5100                      !spa_has_spare(spa, newvd->vdev_guid))
5357 5101                          return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
5358 5102  
5359 5103                  /*
5360 5104                   * If the source is a hot spare, and the parent isn't already a
5361 5105                   * spare, then we want to create a new hot spare.  Otherwise, we
5362 5106                   * want to create a replacing vdev.  The user is not allowed to
5363 5107                   * attach to a spared vdev child unless the 'isspare' state is
5364 5108                   * the same (spare replaces spare, non-spare replaces
5365 5109                   * non-spare).
5366 5110                   */
5367 5111                  if (pvd->vdev_ops == &vdev_replacing_ops &&
5368 5112                      spa_version(spa) < SPA_VERSION_MULTI_REPLACE) {
5369 5113                          return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
5370 5114                  } else if (pvd->vdev_ops == &vdev_spare_ops &&
5371 5115                      newvd->vdev_isspare != oldvd->vdev_isspare) {
5372 5116                          return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
5373 5117                  }
5374 5118  
5375 5119                  if (newvd->vdev_isspare)
5376 5120                          pvops = &vdev_spare_ops;
5377 5121                  else
5378 5122                          pvops = &vdev_replacing_ops;
5379 5123          }
5380 5124  
5381 5125          /*
5382 5126           * Make sure the new device is big enough.
5383 5127           */
5384 5128          if (newvd->vdev_asize < vdev_get_min_asize(oldvd))
5385 5129                  return (spa_vdev_exit(spa, newrootvd, txg, EOVERFLOW));
5386 5130  
5387 5131          /*
5388 5132           * The new device cannot have a higher alignment requirement
5389 5133           * than the top-level vdev.
5390 5134           */
5391 5135          if (newvd->vdev_ashift > oldvd->vdev_top->vdev_ashift)
5392 5136                  return (spa_vdev_exit(spa, newrootvd, txg, EDOM));
5393 5137  
5394 5138          /*
5395 5139           * If this is an in-place replacement, update oldvd's path and devid
5396 5140           * to make it distinguishable from newvd, and unopenable from now on.
5397 5141           */
5398 5142          if (strcmp(oldvd->vdev_path, newvd->vdev_path) == 0) {
5399 5143                  spa_strfree(oldvd->vdev_path);
5400 5144                  oldvd->vdev_path = kmem_alloc(strlen(newvd->vdev_path) + 5,
5401 5145                      KM_SLEEP);
5402 5146                  (void) sprintf(oldvd->vdev_path, "%s/%s",
5403 5147                      newvd->vdev_path, "old");
5404 5148                  if (oldvd->vdev_devid != NULL) {
5405 5149                          spa_strfree(oldvd->vdev_devid);
5406 5150                          oldvd->vdev_devid = NULL;
5407 5151                  }
5408 5152          }
5409 5153  
5410 5154          /* mark the device being resilvered */
5411 5155          newvd->vdev_resilver_txg = txg;
5412 5156  
5413 5157          /*
5414 5158           * If the parent is not a mirror, or if we're replacing, insert the new
5415 5159           * mirror/replacing/spare vdev above oldvd.
5416 5160           */
5417 5161          if (pvd->vdev_ops != pvops)
5418 5162                  pvd = vdev_add_parent(oldvd, pvops);
5419 5163  
5420 5164          ASSERT(pvd->vdev_top->vdev_parent == rvd);
5421 5165          ASSERT(pvd->vdev_ops == pvops);
5422 5166          ASSERT(oldvd->vdev_parent == pvd);
5423 5167  
5424 5168          /*
5425 5169           * Extract the new device from its root and add it to pvd.
5426 5170           */
5427 5171          vdev_remove_child(newrootvd, newvd);
5428 5172          newvd->vdev_id = pvd->vdev_children;
5429 5173          newvd->vdev_crtxg = oldvd->vdev_crtxg;
5430 5174          vdev_add_child(pvd, newvd);
5431 5175  
5432 5176          tvd = newvd->vdev_top;
5433 5177          ASSERT(pvd->vdev_top == tvd);
5434 5178          ASSERT(tvd->vdev_parent == rvd);
5435 5179  
5436 5180          vdev_config_dirty(tvd);
5437 5181  
5438 5182          /*
5439 5183           * Set newvd's DTL to [TXG_INITIAL, dtl_max_txg) so that we account
5440 5184           * for any dmu_sync-ed blocks.  It will propagate upward when
5441 5185           * spa_vdev_exit() calls vdev_dtl_reassess().
5442 5186           */
5443 5187          dtl_max_txg = txg + TXG_CONCURRENT_STATES;
5444 5188  
5445 5189          vdev_dtl_dirty(newvd, DTL_MISSING, TXG_INITIAL,
5446 5190              dtl_max_txg - TXG_INITIAL);
5447 5191  
5448 5192          if (newvd->vdev_isspare) {
5449 5193                  spa_spare_activate(newvd);
5450 5194                  spa_event_notify(spa, newvd, NULL, ESC_ZFS_VDEV_SPARE);
5451 5195          }
5452 5196  
5453 5197          oldvdpath = spa_strdup(oldvd->vdev_path);
5454 5198          newvdpath = spa_strdup(newvd->vdev_path);
5455 5199          newvd_isspare = newvd->vdev_isspare;
5456 5200  
5457 5201          /*
5458 5202           * Mark newvd's DTL dirty in this txg.
5459 5203           */
5460 5204          vdev_dirty(tvd, VDD_DTL, newvd, txg);
5461 5205  
5462 5206          /*
5463 5207           * Schedule the resilver to restart in the future. We do this to
5464 5208           * ensure that dmu_sync-ed blocks have been stitched into the
  
    | 
      ↓ open down ↓ | 
    145 lines elided | 
    
      ↑ open up ↑ | 
  
5465 5209           * respective datasets.
5466 5210           */
5467 5211          dsl_resilver_restart(spa->spa_dsl_pool, dtl_max_txg);
5468 5212  
5469 5213          if (spa->spa_bootfs)
5470 5214                  spa_event_notify(spa, newvd, NULL, ESC_ZFS_BOOTFS_VDEV_ATTACH);
5471 5215  
5472 5216          spa_event_notify(spa, newvd, NULL, ESC_ZFS_VDEV_ATTACH);
5473 5217  
5474 5218          /*
     5219 +         * Check CoS property of the old vdev, add reference by new vdev
     5220 +         */
     5221 +        if (oldvd->vdev_queue.vq_cos) {
     5222 +                cos_hold(oldvd->vdev_queue.vq_cos);
     5223 +                newvd->vdev_queue.vq_cos = oldvd->vdev_queue.vq_cos;
     5224 +        }
     5225 +
     5226 +        /*
5475 5227           * Commit the config
5476 5228           */
5477 5229          (void) spa_vdev_exit(spa, newrootvd, dtl_max_txg, 0);
5478 5230  
5479 5231          spa_history_log_internal(spa, "vdev attach", NULL,
5480 5232              "%s vdev=%s %s vdev=%s",
5481 5233              replacing && newvd_isspare ? "spare in" :
5482 5234              replacing ? "replace" : "attach", newvdpath,
5483 5235              replacing ? "for" : "to", oldvdpath);
5484 5236  
5485 5237          spa_strfree(oldvdpath);
5486 5238          spa_strfree(newvdpath);
5487 5239  
5488 5240          return (0);
5489 5241  }
5490 5242  
5491 5243  /*
5492 5244   * Detach a device from a mirror or replacing vdev.
5493 5245   *
5494 5246   * If 'replace_done' is specified, only detach if the parent
5495 5247   * is a replacing vdev.
5496 5248   */
5497 5249  int
5498 5250  spa_vdev_detach(spa_t *spa, uint64_t guid, uint64_t pguid, int replace_done)
5499 5251  {
5500 5252          uint64_t txg;
5501 5253          int error;
5502 5254          vdev_t *rvd = spa->spa_root_vdev;
5503 5255          vdev_t *vd, *pvd, *cvd, *tvd;
5504 5256          boolean_t unspare = B_FALSE;
5505 5257          uint64_t unspare_guid = 0;
5506 5258          char *vdpath;
5507 5259  
5508 5260          ASSERT(spa_writeable(spa));
5509 5261  
5510 5262          txg = spa_vdev_enter(spa);
5511 5263  
5512 5264          vd = spa_lookup_by_guid(spa, guid, B_FALSE);
5513 5265  
5514 5266          if (vd == NULL)
5515 5267                  return (spa_vdev_exit(spa, NULL, txg, ENODEV));
5516 5268  
5517 5269          if (!vd->vdev_ops->vdev_op_leaf)
5518 5270                  return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5519 5271  
5520 5272          pvd = vd->vdev_parent;
5521 5273  
5522 5274          /*
5523 5275           * If the parent/child relationship is not as expected, don't do it.
5524 5276           * Consider M(A,R(B,C)) -- that is, a mirror of A with a replacing
5525 5277           * vdev that's replacing B with C.  The user's intent in replacing
5526 5278           * is to go from M(A,B) to M(A,C).  If the user decides to cancel
5527 5279           * the replace by detaching C, the expected behavior is to end up
5528 5280           * M(A,B).  But suppose that right after deciding to detach C,
5529 5281           * the replacement of B completes.  We would have M(A,C), and then
5530 5282           * ask to detach C, which would leave us with just A -- not what
5531 5283           * the user wanted.  To prevent this, we make sure that the
5532 5284           * parent/child relationship hasn't changed -- in this example,
5533 5285           * that C's parent is still the replacing vdev R.
5534 5286           */
5535 5287          if (pvd->vdev_guid != pguid && pguid != 0)
5536 5288                  return (spa_vdev_exit(spa, NULL, txg, EBUSY));
5537 5289  
5538 5290          /*
5539 5291           * Only 'replacing' or 'spare' vdevs can be replaced.
5540 5292           */
5541 5293          if (replace_done && pvd->vdev_ops != &vdev_replacing_ops &&
5542 5294              pvd->vdev_ops != &vdev_spare_ops)
5543 5295                  return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5544 5296  
5545 5297          ASSERT(pvd->vdev_ops != &vdev_spare_ops ||
5546 5298              spa_version(spa) >= SPA_VERSION_SPARES);
5547 5299  
5548 5300          /*
5549 5301           * Only mirror, replacing, and spare vdevs support detach.
5550 5302           */
5551 5303          if (pvd->vdev_ops != &vdev_replacing_ops &&
5552 5304              pvd->vdev_ops != &vdev_mirror_ops &&
5553 5305              pvd->vdev_ops != &vdev_spare_ops)
5554 5306                  return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5555 5307  
5556 5308          /*
5557 5309           * If this device has the only valid copy of some data,
5558 5310           * we cannot safely detach it.
5559 5311           */
5560 5312          if (vdev_dtl_required(vd))
5561 5313                  return (spa_vdev_exit(spa, NULL, txg, EBUSY));
5562 5314  
5563 5315          ASSERT(pvd->vdev_children >= 2);
5564 5316  
5565 5317          /*
5566 5318           * If we are detaching the second disk from a replacing vdev, then
5567 5319           * check to see if we changed the original vdev's path to have "/old"
5568 5320           * at the end in spa_vdev_attach().  If so, undo that change now.
5569 5321           */
5570 5322          if (pvd->vdev_ops == &vdev_replacing_ops && vd->vdev_id > 0 &&
5571 5323              vd->vdev_path != NULL) {
5572 5324                  size_t len = strlen(vd->vdev_path);
5573 5325  
5574 5326                  for (int c = 0; c < pvd->vdev_children; c++) {
5575 5327                          cvd = pvd->vdev_child[c];
5576 5328  
5577 5329                          if (cvd == vd || cvd->vdev_path == NULL)
5578 5330                                  continue;
5579 5331  
5580 5332                          if (strncmp(cvd->vdev_path, vd->vdev_path, len) == 0 &&
5581 5333                              strcmp(cvd->vdev_path + len, "/old") == 0) {
5582 5334                                  spa_strfree(cvd->vdev_path);
5583 5335                                  cvd->vdev_path = spa_strdup(vd->vdev_path);
5584 5336                                  break;
5585 5337                          }
5586 5338                  }
5587 5339          }
5588 5340  
5589 5341          /*
5590 5342           * If we are detaching the original disk from a spare, then it implies
5591 5343           * that the spare should become a real disk, and be removed from the
5592 5344           * active spare list for the pool.
5593 5345           */
5594 5346          if (pvd->vdev_ops == &vdev_spare_ops &&
5595 5347              vd->vdev_id == 0 &&
5596 5348              pvd->vdev_child[pvd->vdev_children - 1]->vdev_isspare)
5597 5349                  unspare = B_TRUE;
5598 5350  
5599 5351          /*
5600 5352           * Erase the disk labels so the disk can be used for other things.
5601 5353           * This must be done after all other error cases are handled,
5602 5354           * but before we disembowel vd (so we can still do I/O to it).
5603 5355           * But if we can't do it, don't treat the error as fatal --
5604 5356           * it may be that the unwritability of the disk is the reason
5605 5357           * it's being detached!
5606 5358           */
5607 5359          error = vdev_label_init(vd, 0, VDEV_LABEL_REMOVE);
5608 5360  
5609 5361          /*
5610 5362           * Remove vd from its parent and compact the parent's children.
5611 5363           */
5612 5364          vdev_remove_child(pvd, vd);
5613 5365          vdev_compact_children(pvd);
5614 5366  
5615 5367          /*
5616 5368           * Remember one of the remaining children so we can get tvd below.
5617 5369           */
5618 5370          cvd = pvd->vdev_child[pvd->vdev_children - 1];
5619 5371  
5620 5372          /*
5621 5373           * If we need to remove the remaining child from the list of hot spares,
5622 5374           * do it now, marking the vdev as no longer a spare in the process.
5623 5375           * We must do this before vdev_remove_parent(), because that can
5624 5376           * change the GUID if it creates a new toplevel GUID.  For a similar
5625 5377           * reason, we must remove the spare now, in the same txg as the detach;
5626 5378           * otherwise someone could attach a new sibling, change the GUID, and
5627 5379           * the subsequent attempt to spa_vdev_remove(unspare_guid) would fail.
5628 5380           */
5629 5381          if (unspare) {
5630 5382                  ASSERT(cvd->vdev_isspare);
5631 5383                  spa_spare_remove(cvd);
5632 5384                  unspare_guid = cvd->vdev_guid;
5633 5385                  (void) spa_vdev_remove(spa, unspare_guid, B_TRUE);
5634 5386                  cvd->vdev_unspare = B_TRUE;
5635 5387          }
5636 5388  
5637 5389          /*
5638 5390           * If the parent mirror/replacing vdev only has one child,
5639 5391           * the parent is no longer needed.  Remove it from the tree.
5640 5392           */
5641 5393          if (pvd->vdev_children == 1) {
5642 5394                  if (pvd->vdev_ops == &vdev_spare_ops)
5643 5395                          cvd->vdev_unspare = B_FALSE;
5644 5396                  vdev_remove_parent(cvd);
5645 5397          }
5646 5398  
5647 5399  
5648 5400          /*
5649 5401           * We don't set tvd until now because the parent we just removed
5650 5402           * may have been the previous top-level vdev.
5651 5403           */
5652 5404          tvd = cvd->vdev_top;
5653 5405          ASSERT(tvd->vdev_parent == rvd);
5654 5406  
5655 5407          /*
5656 5408           * Reevaluate the parent vdev state.
5657 5409           */
5658 5410          vdev_propagate_state(cvd);
5659 5411  
5660 5412          /*
5661 5413           * If the 'autoexpand' property is set on the pool then automatically
5662 5414           * try to expand the size of the pool. For example if the device we
5663 5415           * just detached was smaller than the others, it may be possible to
5664 5416           * add metaslabs (i.e. grow the pool). We need to reopen the vdev
5665 5417           * first so that we can obtain the updated sizes of the leaf vdevs.
5666 5418           */
5667 5419          if (spa->spa_autoexpand) {
5668 5420                  vdev_reopen(tvd);
5669 5421                  vdev_expand(tvd, txg);
5670 5422          }
5671 5423  
5672 5424          vdev_config_dirty(tvd);
5673 5425  
5674 5426          /*
5675 5427           * Mark vd's DTL as dirty in this txg.  vdev_dtl_sync() will see that
5676 5428           * vd->vdev_detached is set and free vd's DTL object in syncing context.
5677 5429           * But first make sure we're not on any *other* txg's DTL list, to
  
    | 
      ↓ open down ↓ | 
    193 lines elided | 
    
      ↑ open up ↑ | 
  
5678 5430           * prevent vd from being accessed after it's freed.
5679 5431           */
5680 5432          vdpath = spa_strdup(vd->vdev_path);
5681 5433          for (int t = 0; t < TXG_SIZE; t++)
5682 5434                  (void) txg_list_remove_this(&tvd->vdev_dtl_list, vd, t);
5683 5435          vd->vdev_detached = B_TRUE;
5684 5436          vdev_dirty(tvd, VDD_DTL, vd, txg);
5685 5437  
5686 5438          spa_event_notify(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE);
5687 5439  
     5440 +        /*
     5441 +         * Release the references to CoS descriptors if any
     5442 +         */
     5443 +        if (vd->vdev_queue.vq_cos) {
     5444 +                cos_rele(vd->vdev_queue.vq_cos);
     5445 +                vd->vdev_queue.vq_cos = NULL;
     5446 +        }
     5447 +
5688 5448          /* hang on to the spa before we release the lock */
5689 5449          spa_open_ref(spa, FTAG);
5690 5450  
5691 5451          error = spa_vdev_exit(spa, vd, txg, 0);
5692 5452  
5693 5453          spa_history_log_internal(spa, "detach", NULL,
5694 5454              "vdev=%s", vdpath);
5695 5455          spa_strfree(vdpath);
5696 5456  
5697 5457          /*
5698 5458           * If this was the removal of the original device in a hot spare vdev,
5699 5459           * then we want to go through and remove the device from the hot spare
5700 5460           * list of every other pool.
5701 5461           */
5702 5462          if (unspare) {
5703 5463                  spa_t *altspa = NULL;
5704 5464  
5705 5465                  mutex_enter(&spa_namespace_lock);
5706 5466                  while ((altspa = spa_next(altspa)) != NULL) {
5707 5467                          if (altspa->spa_state != POOL_STATE_ACTIVE ||
5708 5468                              altspa == spa)
5709 5469                                  continue;
5710 5470  
5711 5471                          spa_open_ref(altspa, FTAG);
5712 5472                          mutex_exit(&spa_namespace_lock);
5713 5473                          (void) spa_vdev_remove(altspa, unspare_guid, B_TRUE);
5714 5474                          mutex_enter(&spa_namespace_lock);
5715 5475                          spa_close(altspa, FTAG);
5716 5476                  }
5717 5477                  mutex_exit(&spa_namespace_lock);
5718 5478  
5719 5479                  /* search the rest of the vdevs for spares to remove */
5720 5480                  spa_vdev_resilver_done(spa);
5721 5481          }
5722 5482  
5723 5483          /* all done with the spa; OK to release */
5724 5484          mutex_enter(&spa_namespace_lock);
5725 5485          spa_close(spa, FTAG);
5726 5486          mutex_exit(&spa_namespace_lock);
5727 5487  
5728 5488          return (error);
5729 5489  }
5730 5490  
5731 5491  /*
5732 5492   * Split a set of devices from their mirrors, and create a new pool from them.
5733 5493   */
5734 5494  int
5735 5495  spa_vdev_split_mirror(spa_t *spa, char *newname, nvlist_t *config,
5736 5496      nvlist_t *props, boolean_t exp)
5737 5497  {
5738 5498          int error = 0;
5739 5499          uint64_t txg, *glist;
  
    | 
      ↓ open down ↓ | 
    42 lines elided | 
    
      ↑ open up ↑ | 
  
5740 5500          spa_t *newspa;
5741 5501          uint_t c, children, lastlog;
5742 5502          nvlist_t **child, *nvl, *tmp;
5743 5503          dmu_tx_t *tx;
5744 5504          char *altroot = NULL;
5745 5505          vdev_t *rvd, **vml = NULL;                      /* vdev modify list */
5746 5506          boolean_t activate_slog;
5747 5507  
5748 5508          ASSERT(spa_writeable(spa));
5749 5509  
     5510 +        /*
     5511 +         * split for pools with activated WBC
     5512 +         * will be implemented in the next release
     5513 +         */
     5514 +        if (spa_feature_is_active(spa, SPA_FEATURE_WBC))
     5515 +                return (SET_ERROR(ENOTSUP));
     5516 +
5750 5517          txg = spa_vdev_enter(spa);
5751 5518  
5752 5519          /* clear the log and flush everything up to now */
5753 5520          activate_slog = spa_passivate_log(spa);
5754 5521          (void) spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5755      -        error = spa_reset_logs(spa);
     5522 +        error = spa_offline_log(spa);
5756 5523          txg = spa_vdev_config_enter(spa);
5757 5524  
5758 5525          if (activate_slog)
5759 5526                  spa_activate_log(spa);
5760 5527  
5761 5528          if (error != 0)
5762 5529                  return (spa_vdev_exit(spa, NULL, txg, error));
5763 5530  
5764 5531          /* check new spa name before going any further */
5765 5532          if (spa_lookup(newname) != NULL)
5766 5533                  return (spa_vdev_exit(spa, NULL, txg, EEXIST));
5767 5534  
5768 5535          /*
5769 5536           * scan through all the children to ensure they're all mirrors
5770 5537           */
5771 5538          if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvl) != 0 ||
5772 5539              nvlist_lookup_nvlist_array(nvl, ZPOOL_CONFIG_CHILDREN, &child,
  
    | 
      ↓ open down ↓ | 
    7 lines elided | 
    
      ↑ open up ↑ | 
  
5773 5540              &children) != 0)
5774 5541                  return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5775 5542  
5776 5543          /* first, check to ensure we've got the right child count */
5777 5544          rvd = spa->spa_root_vdev;
5778 5545          lastlog = 0;
5779 5546          for (c = 0; c < rvd->vdev_children; c++) {
5780 5547                  vdev_t *vd = rvd->vdev_child[c];
5781 5548  
5782 5549                  /* don't count the holes & logs as children */
5783      -                if (vd->vdev_islog || !vdev_is_concrete(vd)) {
     5550 +                if (vd->vdev_islog || vd->vdev_ishole) {
5784 5551                          if (lastlog == 0)
5785 5552                                  lastlog = c;
5786 5553                          continue;
5787 5554                  }
5788 5555  
5789 5556                  lastlog = 0;
5790 5557          }
5791 5558          if (children != (lastlog != 0 ? lastlog : rvd->vdev_children))
5792 5559                  return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5793 5560  
5794 5561          /* next, ensure no spare or cache devices are part of the split */
5795 5562          if (nvlist_lookup_nvlist(nvl, ZPOOL_CONFIG_SPARES, &tmp) == 0 ||
5796 5563              nvlist_lookup_nvlist(nvl, ZPOOL_CONFIG_L2CACHE, &tmp) == 0)
5797 5564                  return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5798 5565  
5799 5566          vml = kmem_zalloc(children * sizeof (vdev_t *), KM_SLEEP);
5800 5567          glist = kmem_zalloc(children * sizeof (uint64_t), KM_SLEEP);
5801 5568  
5802 5569          /* then, loop over each vdev and validate it */
5803 5570          for (c = 0; c < children; c++) {
5804 5571                  uint64_t is_hole = 0;
5805 5572  
5806 5573                  (void) nvlist_lookup_uint64(child[c], ZPOOL_CONFIG_IS_HOLE,
5807 5574                      &is_hole);
5808 5575  
5809 5576                  if (is_hole != 0) {
5810 5577                          if (spa->spa_root_vdev->vdev_child[c]->vdev_ishole ||
5811 5578                              spa->spa_root_vdev->vdev_child[c]->vdev_islog) {
5812 5579                                  continue;
5813 5580                          } else {
5814 5581                                  error = SET_ERROR(EINVAL);
5815 5582                                  break;
5816 5583                          }
5817 5584                  }
5818 5585  
5819 5586                  /* which disk is going to be split? */
5820 5587                  if (nvlist_lookup_uint64(child[c], ZPOOL_CONFIG_GUID,
5821 5588                      &glist[c]) != 0) {
5822 5589                          error = SET_ERROR(EINVAL);
5823 5590                          break;
5824 5591                  }
5825 5592  
  
    | 
      ↓ open down ↓ | 
    32 lines elided | 
    
      ↑ open up ↑ | 
  
5826 5593                  /* look it up in the spa */
5827 5594                  vml[c] = spa_lookup_by_guid(spa, glist[c], B_FALSE);
5828 5595                  if (vml[c] == NULL) {
5829 5596                          error = SET_ERROR(ENODEV);
5830 5597                          break;
5831 5598                  }
5832 5599  
5833 5600                  /* make sure there's nothing stopping the split */
5834 5601                  if (vml[c]->vdev_parent->vdev_ops != &vdev_mirror_ops ||
5835 5602                      vml[c]->vdev_islog ||
5836      -                    !vdev_is_concrete(vml[c]) ||
     5603 +                    vml[c]->vdev_ishole ||
5837 5604                      vml[c]->vdev_isspare ||
5838 5605                      vml[c]->vdev_isl2cache ||
5839 5606                      !vdev_writeable(vml[c]) ||
5840 5607                      vml[c]->vdev_children != 0 ||
5841 5608                      vml[c]->vdev_state != VDEV_STATE_HEALTHY ||
5842 5609                      c != spa->spa_root_vdev->vdev_child[c]->vdev_id) {
5843 5610                          error = SET_ERROR(EINVAL);
5844 5611                          break;
5845 5612                  }
5846 5613  
5847 5614                  if (vdev_dtl_required(vml[c])) {
5848 5615                          error = SET_ERROR(EBUSY);
5849 5616                          break;
5850 5617                  }
5851 5618  
5852 5619                  /* we need certain info from the top level */
5853 5620                  VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_METASLAB_ARRAY,
5854 5621                      vml[c]->vdev_top->vdev_ms_array) == 0);
5855 5622                  VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_METASLAB_SHIFT,
5856 5623                      vml[c]->vdev_top->vdev_ms_shift) == 0);
5857 5624                  VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_ASIZE,
5858 5625                      vml[c]->vdev_top->vdev_asize) == 0);
5859 5626                  VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_ASHIFT,
5860 5627                      vml[c]->vdev_top->vdev_ashift) == 0);
5861 5628  
5862 5629                  /* transfer per-vdev ZAPs */
5863 5630                  ASSERT3U(vml[c]->vdev_leaf_zap, !=, 0);
5864 5631                  VERIFY0(nvlist_add_uint64(child[c],
5865 5632                      ZPOOL_CONFIG_VDEV_LEAF_ZAP, vml[c]->vdev_leaf_zap));
5866 5633  
5867 5634                  ASSERT3U(vml[c]->vdev_top->vdev_top_zap, !=, 0);
5868 5635                  VERIFY0(nvlist_add_uint64(child[c],
5869 5636                      ZPOOL_CONFIG_VDEV_TOP_ZAP,
5870 5637                      vml[c]->vdev_parent->vdev_top_zap));
5871 5638          }
5872 5639  
5873 5640          if (error != 0) {
5874 5641                  kmem_free(vml, children * sizeof (vdev_t *));
5875 5642                  kmem_free(glist, children * sizeof (uint64_t));
5876 5643                  return (spa_vdev_exit(spa, NULL, txg, error));
5877 5644          }
5878 5645  
5879 5646          /* stop writers from using the disks */
5880 5647          for (c = 0; c < children; c++) {
5881 5648                  if (vml[c] != NULL)
5882 5649                          vml[c]->vdev_offline = B_TRUE;
5883 5650          }
5884 5651          vdev_reopen(spa->spa_root_vdev);
5885 5652  
5886 5653          /*
5887 5654           * Temporarily record the splitting vdevs in the spa config.  This
5888 5655           * will disappear once the config is regenerated.
5889 5656           */
5890 5657          VERIFY(nvlist_alloc(&nvl, NV_UNIQUE_NAME, KM_SLEEP) == 0);
5891 5658          VERIFY(nvlist_add_uint64_array(nvl, ZPOOL_CONFIG_SPLIT_LIST,
5892 5659              glist, children) == 0);
5893 5660          kmem_free(glist, children * sizeof (uint64_t));
5894 5661  
5895 5662          mutex_enter(&spa->spa_props_lock);
5896 5663          VERIFY(nvlist_add_nvlist(spa->spa_config, ZPOOL_CONFIG_SPLIT,
5897 5664              nvl) == 0);
5898 5665          mutex_exit(&spa->spa_props_lock);
5899 5666          spa->spa_config_splitting = nvl;
5900 5667          vdev_config_dirty(spa->spa_root_vdev);
5901 5668  
5902 5669          /* configure and create the new pool */
5903 5670          VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME, newname) == 0);
5904 5671          VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE,
5905 5672              exp ? POOL_STATE_EXPORTED : POOL_STATE_ACTIVE) == 0);
5906 5673          VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_VERSION,
5907 5674              spa_version(spa)) == 0);
5908 5675          VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_TXG,
5909 5676              spa->spa_config_txg) == 0);
5910 5677          VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_GUID,
5911 5678              spa_generate_guid(NULL)) == 0);
5912 5679          VERIFY0(nvlist_add_boolean(config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
5913 5680          (void) nvlist_lookup_string(props,
5914 5681              zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
5915 5682  
5916 5683          /* add the new pool to the namespace */
5917 5684          newspa = spa_add(newname, config, altroot);
5918 5685          newspa->spa_avz_action = AVZ_ACTION_REBUILD;
5919 5686          newspa->spa_config_txg = spa->spa_config_txg;
5920 5687          spa_set_log_state(newspa, SPA_LOG_CLEAR);
  
    | 
      ↓ open down ↓ | 
    74 lines elided | 
    
      ↑ open up ↑ | 
  
5921 5688  
5922 5689          /* release the spa config lock, retaining the namespace lock */
5923 5690          spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5924 5691  
5925 5692          if (zio_injection_enabled)
5926 5693                  zio_handle_panic_injection(spa, FTAG, 1);
5927 5694  
5928 5695          spa_activate(newspa, spa_mode_global);
5929 5696          spa_async_suspend(newspa);
5930 5697  
5931      -        newspa->spa_config_source = SPA_CONFIG_SRC_SPLIT;
5932      -
5933 5698          /* create the new pool from the disks of the original pool */
5934      -        error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE);
     5699 +        error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE, B_TRUE);
5935 5700          if (error)
5936 5701                  goto out;
5937 5702  
5938 5703          /* if that worked, generate a real config for the new pool */
5939 5704          if (newspa->spa_root_vdev != NULL) {
5940 5705                  VERIFY(nvlist_alloc(&newspa->spa_config_splitting,
5941 5706                      NV_UNIQUE_NAME, KM_SLEEP) == 0);
5942 5707                  VERIFY(nvlist_add_uint64(newspa->spa_config_splitting,
5943 5708                      ZPOOL_CONFIG_SPLIT_GUID, spa_guid(spa)) == 0);
5944 5709                  spa_config_set(newspa, spa_config_generate(newspa, NULL, -1ULL,
5945 5710                      B_TRUE));
5946 5711          }
5947 5712  
5948 5713          /* set the props */
5949 5714          if (props != NULL) {
5950 5715                  spa_configfile_set(newspa, props, B_FALSE);
5951 5716                  error = spa_prop_set(newspa, props);
5952 5717                  if (error)
5953 5718                          goto out;
5954 5719          }
5955 5720  
5956 5721          /* flush everything */
5957 5722          txg = spa_vdev_config_enter(newspa);
5958 5723          vdev_config_dirty(newspa->spa_root_vdev);
5959 5724          (void) spa_vdev_config_exit(newspa, NULL, txg, 0, FTAG);
5960 5725  
5961 5726          if (zio_injection_enabled)
5962 5727                  zio_handle_panic_injection(spa, FTAG, 2);
5963 5728  
  
    | 
      ↓ open down ↓ | 
    19 lines elided | 
    
      ↑ open up ↑ | 
  
5964 5729          spa_async_resume(newspa);
5965 5730  
5966 5731          /* finally, update the original pool's config */
5967 5732          txg = spa_vdev_config_enter(spa);
5968 5733          tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
5969 5734          error = dmu_tx_assign(tx, TXG_WAIT);
5970 5735          if (error != 0)
5971 5736                  dmu_tx_abort(tx);
5972 5737          for (c = 0; c < children; c++) {
5973 5738                  if (vml[c] != NULL) {
     5739 +                        vdev_t *tvd = vml[c]->vdev_top;
     5740 +
     5741 +                        /*
     5742 +                         * Need to be sure the detachable VDEV is not
     5743 +                         * on any *other* txg's DTL list to prevent it
     5744 +                         * from being accessed after it's freed.
     5745 +                         */
     5746 +                        for (int t = 0; t < TXG_SIZE; t++) {
     5747 +                                (void) txg_list_remove_this(
     5748 +                                    &tvd->vdev_dtl_list, vml[c], t);
     5749 +                        }
     5750 +
5974 5751                          vdev_split(vml[c]);
5975 5752                          if (error == 0)
5976 5753                                  spa_history_log_internal(spa, "detach", tx,
5977 5754                                      "vdev=%s", vml[c]->vdev_path);
5978 5755  
5979 5756                          vdev_free(vml[c]);
5980 5757                  }
5981 5758          }
5982 5759          spa->spa_avz_action = AVZ_ACTION_REBUILD;
5983 5760          vdev_config_dirty(spa->spa_root_vdev);
5984 5761          spa->spa_config_splitting = NULL;
5985 5762          nvlist_free(nvl);
5986 5763          if (error == 0)
5987 5764                  dmu_tx_commit(tx);
5988 5765          (void) spa_vdev_exit(spa, NULL, txg, 0);
5989 5766  
5990 5767          if (zio_injection_enabled)
5991 5768                  zio_handle_panic_injection(spa, FTAG, 3);
  
    | 
      ↓ open down ↓ | 
    8 lines elided | 
    
      ↑ open up ↑ | 
  
5992 5769  
5993 5770          /* split is complete; log a history record */
5994 5771          spa_history_log_internal(newspa, "split", NULL,
5995 5772              "from pool %s", spa_name(spa));
5996 5773  
5997 5774          kmem_free(vml, children * sizeof (vdev_t *));
5998 5775  
5999 5776          /* if we're not going to mount the filesystems in userland, export */
6000 5777          if (exp)
6001 5778                  error = spa_export_common(newname, POOL_STATE_EXPORTED, NULL,
6002      -                    B_FALSE, B_FALSE);
     5779 +                    B_FALSE, B_FALSE, B_FALSE);
6003 5780  
6004 5781          return (error);
6005 5782  
6006 5783  out:
6007 5784          spa_unload(newspa);
6008 5785          spa_deactivate(newspa);
6009 5786          spa_remove(newspa);
6010 5787  
6011 5788          txg = spa_vdev_config_enter(spa);
6012 5789  
6013 5790          /* re-online all offlined disks */
6014 5791          for (c = 0; c < children; c++) {
6015 5792                  if (vml[c] != NULL)
6016 5793                          vml[c]->vdev_offline = B_FALSE;
6017 5794          }
  
    | 
      ↓ open down ↓ | 
    5 lines elided | 
    
      ↑ open up ↑ | 
  
6018 5795          vdev_reopen(spa->spa_root_vdev);
6019 5796  
6020 5797          nvlist_free(spa->spa_config_splitting);
6021 5798          spa->spa_config_splitting = NULL;
6022 5799          (void) spa_vdev_exit(spa, NULL, txg, error);
6023 5800  
6024 5801          kmem_free(vml, children * sizeof (vdev_t *));
6025 5802          return (error);
6026 5803  }
6027 5804  
     5805 +static nvlist_t *
     5806 +spa_nvlist_lookup_by_guid(nvlist_t **nvpp, int count, uint64_t target_guid)
     5807 +{
     5808 +        for (int i = 0; i < count; i++) {
     5809 +                uint64_t guid;
     5810 +
     5811 +                VERIFY(nvlist_lookup_uint64(nvpp[i], ZPOOL_CONFIG_GUID,
     5812 +                    &guid) == 0);
     5813 +
     5814 +                if (guid == target_guid)
     5815 +                        return (nvpp[i]);
     5816 +        }
     5817 +
     5818 +        return (NULL);
     5819 +}
     5820 +
     5821 +static void
     5822 +spa_vdev_remove_aux(nvlist_t *config, char *name, nvlist_t **dev, int count,
     5823 +    nvlist_t *dev_to_remove)
     5824 +{
     5825 +        nvlist_t **newdev = NULL;
     5826 +
     5827 +        if (count > 1)
     5828 +                newdev = kmem_alloc((count - 1) * sizeof (void *), KM_SLEEP);
     5829 +
     5830 +        for (int i = 0, j = 0; i < count; i++) {
     5831 +                if (dev[i] == dev_to_remove)
     5832 +                        continue;
     5833 +                VERIFY(nvlist_dup(dev[i], &newdev[j++], KM_SLEEP) == 0);
     5834 +        }
     5835 +
     5836 +        VERIFY(nvlist_remove(config, name, DATA_TYPE_NVLIST_ARRAY) == 0);
     5837 +        VERIFY(nvlist_add_nvlist_array(config, name, newdev, count - 1) == 0);
     5838 +
     5839 +        for (int i = 0; i < count - 1; i++)
     5840 +                nvlist_free(newdev[i]);
     5841 +
     5842 +        if (count > 1)
     5843 +                kmem_free(newdev, (count - 1) * sizeof (void *));
     5844 +}
     5845 +
6028 5846  /*
     5847 + * Evacuate the device.
     5848 + */
     5849 +static int
     5850 +spa_vdev_remove_evacuate(spa_t *spa, vdev_t *vd)
     5851 +{
     5852 +        uint64_t txg;
     5853 +        int error = 0;
     5854 +
     5855 +        ASSERT(MUTEX_HELD(&spa_namespace_lock));
     5856 +        ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
     5857 +        ASSERT(vd == vd->vdev_top);
     5858 +
     5859 +        /*
     5860 +         * Evacuate the device.  We don't hold the config lock as writer
     5861 +         * since we need to do I/O but we do keep the
     5862 +         * spa_namespace_lock held.  Once this completes the device
     5863 +         * should no longer have any blocks allocated on it.
     5864 +         */
     5865 +        if (vd->vdev_islog) {
     5866 +                if (vd->vdev_stat.vs_alloc != 0)
     5867 +                        error = spa_offline_log(spa);
     5868 +        } else {
     5869 +                error = SET_ERROR(ENOTSUP);
     5870 +        }
     5871 +
     5872 +        if (error)
     5873 +                return (error);
     5874 +
     5875 +        /*
     5876 +         * The evacuation succeeded.  Remove any remaining MOS metadata
     5877 +         * associated with this vdev, and wait for these changes to sync.
     5878 +         */
     5879 +        ASSERT0(vd->vdev_stat.vs_alloc);
     5880 +        txg = spa_vdev_config_enter(spa);
     5881 +        vd->vdev_removing = B_TRUE;
     5882 +        vdev_dirty_leaves(vd, VDD_DTL, txg);
     5883 +        vdev_config_dirty(vd);
     5884 +        spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
     5885 +
     5886 +        return (0);
     5887 +}
     5888 +
     5889 +/*
     5890 + * Complete the removal by cleaning up the namespace.
     5891 + */
     5892 +static void
     5893 +spa_vdev_remove_from_namespace(spa_t *spa, vdev_t *vd)
     5894 +{
     5895 +        vdev_t *rvd = spa->spa_root_vdev;
     5896 +        uint64_t id = vd->vdev_id;
     5897 +        boolean_t last_vdev = (id == (rvd->vdev_children - 1));
     5898 +
     5899 +        ASSERT(MUTEX_HELD(&spa_namespace_lock));
     5900 +        ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
     5901 +        ASSERT(vd == vd->vdev_top);
     5902 +
     5903 +        /*
     5904 +         * Only remove any devices which are empty.
     5905 +         */
     5906 +        if (vd->vdev_stat.vs_alloc != 0)
     5907 +                return;
     5908 +
     5909 +        (void) vdev_label_init(vd, 0, VDEV_LABEL_REMOVE);
     5910 +
     5911 +        if (list_link_active(&vd->vdev_state_dirty_node))
     5912 +                vdev_state_clean(vd);
     5913 +        if (list_link_active(&vd->vdev_config_dirty_node))
     5914 +                vdev_config_clean(vd);
     5915 +
     5916 +        vdev_free(vd);
     5917 +
     5918 +        if (last_vdev) {
     5919 +                vdev_compact_children(rvd);
     5920 +        } else {
     5921 +                vd = vdev_alloc_common(spa, id, 0, &vdev_hole_ops);
     5922 +                vdev_add_child(rvd, vd);
     5923 +        }
     5924 +        vdev_config_dirty(rvd);
     5925 +
     5926 +        /*
     5927 +         * Reassess the health of our root vdev.
     5928 +         */
     5929 +        vdev_reopen(rvd);
     5930 +}
     5931 +
     5932 +/*
     5933 + * Remove a device from the pool -
     5934 + *
     5935 + * Removing a device from the vdev namespace requires several steps
     5936 + * and can take a significant amount of time.  As a result we use
     5937 + * the spa_vdev_config_[enter/exit] functions which allow us to
     5938 + * grab and release the spa_config_lock while still holding the namespace
     5939 + * lock.  During each step the configuration is synced out.
     5940 + *
     5941 + * Currently, this supports removing only hot spares, slogs, level 2 ARC
     5942 + * and special devices.
     5943 + */
     5944 +int
     5945 +spa_vdev_remove(spa_t *spa, uint64_t guid, boolean_t unspare)
     5946 +{
     5947 +        vdev_t *vd;
     5948 +        sysevent_t *ev = NULL;
     5949 +        metaslab_group_t *mg;
     5950 +        nvlist_t **spares, **l2cache, *nv;
     5951 +        uint64_t txg = 0;
     5952 +        uint_t nspares, nl2cache;
     5953 +        int error = 0;
     5954 +        boolean_t locked = MUTEX_HELD(&spa_namespace_lock);
     5955 +
     5956 +        ASSERT(spa_writeable(spa));
     5957 +
     5958 +        if (!locked)
     5959 +                txg = spa_vdev_enter(spa);
     5960 +
     5961 +        vd = spa_lookup_by_guid(spa, guid, B_FALSE);
     5962 +
     5963 +        if (spa->spa_spares.sav_vdevs != NULL &&
     5964 +            nvlist_lookup_nvlist_array(spa->spa_spares.sav_config,
     5965 +            ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0 &&
     5966 +            (nv = spa_nvlist_lookup_by_guid(spares, nspares, guid)) != NULL) {
     5967 +                /*
     5968 +                 * Only remove the hot spare if it's not currently in use
     5969 +                 * in this pool.
     5970 +                 */
     5971 +                if (vd == NULL || unspare) {
     5972 +                        if (vd == NULL)
     5973 +                                vd = spa_lookup_by_guid(spa, guid, B_TRUE);
     5974 +
     5975 +                        /*
     5976 +                         * Release the references to CoS descriptors if any
     5977 +                         */
     5978 +                        if (vd != NULL && vd->vdev_queue.vq_cos) {
     5979 +                                cos_rele(vd->vdev_queue.vq_cos);
     5980 +                                vd->vdev_queue.vq_cos = NULL;
     5981 +                        }
     5982 +
     5983 +                        ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_AUX);
     5984 +                        spa_vdev_remove_aux(spa->spa_spares.sav_config,
     5985 +                            ZPOOL_CONFIG_SPARES, spares, nspares, nv);
     5986 +                        spa_load_spares(spa);
     5987 +                        spa->spa_spares.sav_sync = B_TRUE;
     5988 +                } else {
     5989 +                        error = SET_ERROR(EBUSY);
     5990 +                }
     5991 +        } else if (spa->spa_l2cache.sav_vdevs != NULL &&
     5992 +            nvlist_lookup_nvlist_array(spa->spa_l2cache.sav_config,
     5993 +            ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0 &&
     5994 +            (nv = spa_nvlist_lookup_by_guid(l2cache, nl2cache, guid)) != NULL) {
     5995 +                /*
     5996 +                 * Cache devices can always be removed.
     5997 +                 */
     5998 +                if (vd == NULL)
     5999 +                        vd = spa_lookup_by_guid(spa, guid, B_TRUE);
     6000 +                /*
     6001 +                 * Release the references to CoS descriptors if any
     6002 +                 */
     6003 +                if (vd != NULL && vd->vdev_queue.vq_cos) {
     6004 +                        cos_rele(vd->vdev_queue.vq_cos);
     6005 +                        vd->vdev_queue.vq_cos = NULL;
     6006 +                }
     6007 +
     6008 +                ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_AUX);
     6009 +                spa_vdev_remove_aux(spa->spa_l2cache.sav_config,
     6010 +                    ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache, nv);
     6011 +                spa_load_l2cache(spa);
     6012 +                spa->spa_l2cache.sav_sync = B_TRUE;
     6013 +        } else if (vd != NULL && vd->vdev_islog) {
     6014 +                ASSERT(!locked);
     6015 +
     6016 +                if (vd != vd->vdev_top)
     6017 +                        return (spa_vdev_exit(spa, NULL, txg, SET_ERROR(ENOTSUP)));
     6018 +
     6019 +                mg = vd->vdev_mg;
     6020 +
     6021 +                /*
     6022 +                 * Stop allocating from this vdev.
     6023 +                 */
     6024 +                metaslab_group_passivate(mg);
     6025 +
     6026 +                /*
     6027 +                 * Wait for the youngest allocations and frees to sync,
     6028 +                 * and then wait for the deferral of those frees to finish.
     6029 +                 */
     6030 +                spa_vdev_config_exit(spa, NULL,
     6031 +                    txg + TXG_CONCURRENT_STATES + TXG_DEFER_SIZE, 0, FTAG);
     6032 +
     6033 +                /*
     6034 +                 * Attempt to evacuate the vdev.
     6035 +                 */
     6036 +                error = spa_vdev_remove_evacuate(spa, vd);
     6037 +
     6038 +                txg = spa_vdev_config_enter(spa);
     6039 +
     6040 +                /*
     6041 +                 * If we couldn't evacuate the vdev, unwind.
     6042 +                 */
     6043 +                if (error) {
     6044 +                        metaslab_group_activate(mg);
     6045 +                        return (spa_vdev_exit(spa, NULL, txg, error));
     6046 +                }
     6047 +
     6048 +                /*
     6049 +                 * Release the references to CoS descriptors if any
     6050 +                 */
     6051 +                if (vd->vdev_queue.vq_cos) {
     6052 +                        cos_rele(vd->vdev_queue.vq_cos);
     6053 +                        vd->vdev_queue.vq_cos = NULL;
     6054 +                }
     6055 +
     6056 +                ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
     6057 +
     6058 +                /*
     6059 +                 * Clean up the vdev namespace.
     6060 +                 */
     6061 +                ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
     6062 +                spa_vdev_remove_from_namespace(spa, vd);
     6063 +
     6064 +        } else if (vd != NULL && vdev_is_special(vd)) {
     6065 +                ASSERT(!locked);
     6066 +
     6067 +                if (vd != vd->vdev_top)
     6068 +                        return (spa_vdev_exit(spa, NULL, txg, SET_ERROR(ENOTSUP)));
     6069 +
     6070 +                error = spa_special_vdev_remove(spa, vd, &txg);
     6071 +                if (error == 0) {
     6072 +                        ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
     6073 +                        spa_vdev_remove_from_namespace(spa, vd);
     6074 +
     6075 +                        /*
     6076 +                         * User sees this field as 'enablespecial'
     6077 +                         * pool-level property
     6078 +                         */
     6079 +                        spa->spa_usesc = B_FALSE;
     6080 +                }
     6081 +        } else if (vd != NULL) {
     6082 +                /*
     6083 +                 * Normal vdevs cannot be removed (yet).
     6084 +                 */
     6085 +                error = SET_ERROR(ENOTSUP);
     6086 +        } else {
     6087 +                /*
     6088 +                 * There is no vdev of any kind with the specified guid.
     6089 +                 */
     6090 +                error = SET_ERROR(ENOENT);
     6091 +        }
     6092 +
     6093 +        if (!locked)
     6094 +                error = spa_vdev_exit(spa, NULL, txg, error);
     6095 +
     6096 +        if (ev)
     6097 +                spa_event_notify_impl(ev);
     6098 +
     6099 +        return (error);
     6100 +}
     6101 +
     6102 +/*
6029 6103   * Find any device that's done replacing, or a vdev marked 'unspare' that's
6030 6104   * currently spared, so we can detach it.
6031 6105   */
6032 6106  static vdev_t *
6033 6107  spa_vdev_resilver_done_hunt(vdev_t *vd)
6034 6108  {
6035 6109          vdev_t *newvd, *oldvd;
6036 6110  
6037 6111          for (int c = 0; c < vd->vdev_children; c++) {
6038 6112                  oldvd = spa_vdev_resilver_done_hunt(vd->vdev_child[c]);
6039 6113                  if (oldvd != NULL)
6040 6114                          return (oldvd);
6041 6115          }
6042 6116  
6043 6117          /*
6044 6118           * Check for a completed replacement.  We always consider the first
6045 6119           * vdev in the list to be the oldest vdev, and the last one to be
6046 6120           * the newest (see spa_vdev_attach() for how that works).  In
6047 6121           * the case where the newest vdev is faulted, we will not automatically
6048 6122           * remove it after a resilver completes.  This is OK as it will require
6049 6123           * user intervention to determine which disk the admin wishes to keep.
6050 6124           */
6051 6125          if (vd->vdev_ops == &vdev_replacing_ops) {
6052 6126                  ASSERT(vd->vdev_children > 1);
6053 6127  
6054 6128                  newvd = vd->vdev_child[vd->vdev_children - 1];
  
    | 
      ↓ open down ↓ | 
    16 lines elided | 
    
      ↑ open up ↑ | 
  
6055 6129                  oldvd = vd->vdev_child[0];
6056 6130  
6057 6131                  if (vdev_dtl_empty(newvd, DTL_MISSING) &&
6058 6132                      vdev_dtl_empty(newvd, DTL_OUTAGE) &&
6059 6133                      !vdev_dtl_required(oldvd))
6060 6134                          return (oldvd);
6061 6135          }
6062 6136  
6063 6137          /*
6064 6138           * Check for a completed resilver with the 'unspare' flag set.
     6139 +         * Also potentially update faulted state.
6065 6140           */
6066 6141          if (vd->vdev_ops == &vdev_spare_ops) {
6067 6142                  vdev_t *first = vd->vdev_child[0];
6068 6143                  vdev_t *last = vd->vdev_child[vd->vdev_children - 1];
6069 6144  
6070 6145                  if (last->vdev_unspare) {
6071 6146                          oldvd = first;
6072 6147                          newvd = last;
6073 6148                  } else if (first->vdev_unspare) {
6074 6149                          oldvd = last;
6075 6150                          newvd = first;
  
    | 
      ↓ open down ↓ | 
    1 lines elided | 
    
      ↑ open up ↑ | 
  
6076 6151                  } else {
6077 6152                          oldvd = NULL;
6078 6153                  }
6079 6154  
6080 6155                  if (oldvd != NULL &&
6081 6156                      vdev_dtl_empty(newvd, DTL_MISSING) &&
6082 6157                      vdev_dtl_empty(newvd, DTL_OUTAGE) &&
6083 6158                      !vdev_dtl_required(oldvd))
6084 6159                          return (oldvd);
6085 6160  
     6161 +                vdev_propagate_state(vd);
     6162 +
6086 6163                  /*
6087 6164                   * If there are more than two spares attached to a disk,
6088 6165                   * and those spares are not required, then we want to
6089 6166                   * attempt to free them up now so that they can be used
6090 6167                   * by other pools.  Once we're back down to a single
6091 6168                   * disk+spare, we stop removing them.
6092 6169                   */
6093 6170                  if (vd->vdev_children > 2) {
6094 6171                          newvd = vd->vdev_child[1];
6095 6172  
6096 6173                          if (newvd->vdev_isspare && last->vdev_isspare &&
6097 6174                              vdev_dtl_empty(last, DTL_MISSING) &&
6098 6175                              vdev_dtl_empty(last, DTL_OUTAGE) &&
6099 6176                              !vdev_dtl_required(newvd))
6100 6177                                  return (newvd);
6101 6178                  }
6102 6179          }
6103 6180  
6104 6181          return (NULL);
6105 6182  }
6106 6183  
6107 6184  static void
6108 6185  spa_vdev_resilver_done(spa_t *spa)
6109 6186  {
6110 6187          vdev_t *vd, *pvd, *ppvd;
6111 6188          uint64_t guid, sguid, pguid, ppguid;
6112 6189  
6113 6190          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
6114 6191  
6115 6192          while ((vd = spa_vdev_resilver_done_hunt(spa->spa_root_vdev)) != NULL) {
6116 6193                  pvd = vd->vdev_parent;
6117 6194                  ppvd = pvd->vdev_parent;
6118 6195                  guid = vd->vdev_guid;
6119 6196                  pguid = pvd->vdev_guid;
6120 6197                  ppguid = ppvd->vdev_guid;
6121 6198                  sguid = 0;
6122 6199                  /*
6123 6200                   * If we have just finished replacing a hot spared device, then
6124 6201                   * we need to detach the parent's first child (the original hot
6125 6202                   * spare) as well.
6126 6203                   */
6127 6204                  if (ppvd->vdev_ops == &vdev_spare_ops && pvd->vdev_id == 0 &&
6128 6205                      ppvd->vdev_children == 2) {
6129 6206                          ASSERT(pvd->vdev_ops == &vdev_replacing_ops);
6130 6207                          sguid = ppvd->vdev_child[1]->vdev_guid;
6131 6208                  }
6132 6209                  ASSERT(vd->vdev_resilver_txg == 0 || !vdev_dtl_required(vd));
6133 6210  
6134 6211                  spa_config_exit(spa, SCL_ALL, FTAG);
6135 6212                  if (spa_vdev_detach(spa, guid, pguid, B_TRUE) != 0)
  
    | 
      ↓ open down ↓ | 
    40 lines elided | 
    
      ↑ open up ↑ | 
  
6136 6213                          return;
6137 6214                  if (sguid && spa_vdev_detach(spa, sguid, ppguid, B_TRUE) != 0)
6138 6215                          return;
6139 6216                  spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
6140 6217          }
6141 6218  
6142 6219          spa_config_exit(spa, SCL_ALL, FTAG);
6143 6220  }
6144 6221  
6145 6222  /*
6146      - * Update the stored path or FRU for this vdev.
6147      - */
6148      -int
6149      -spa_vdev_set_common(spa_t *spa, uint64_t guid, const char *value,
6150      -    boolean_t ispath)
6151      -{
6152      -        vdev_t *vd;
6153      -        boolean_t sync = B_FALSE;
6154      -
6155      -        ASSERT(spa_writeable(spa));
6156      -
6157      -        spa_vdev_state_enter(spa, SCL_ALL);
6158      -
6159      -        if ((vd = spa_lookup_by_guid(spa, guid, B_TRUE)) == NULL)
6160      -                return (spa_vdev_state_exit(spa, NULL, ENOENT));
6161      -
6162      -        if (!vd->vdev_ops->vdev_op_leaf)
6163      -                return (spa_vdev_state_exit(spa, NULL, ENOTSUP));
6164      -
6165      -        if (ispath) {
6166      -                if (strcmp(value, vd->vdev_path) != 0) {
6167      -                        spa_strfree(vd->vdev_path);
6168      -                        vd->vdev_path = spa_strdup(value);
6169      -                        sync = B_TRUE;
6170      -                }
6171      -        } else {
6172      -                if (vd->vdev_fru == NULL) {
6173      -                        vd->vdev_fru = spa_strdup(value);
6174      -                        sync = B_TRUE;
6175      -                } else if (strcmp(value, vd->vdev_fru) != 0) {
6176      -                        spa_strfree(vd->vdev_fru);
6177      -                        vd->vdev_fru = spa_strdup(value);
6178      -                        sync = B_TRUE;
6179      -                }
6180      -        }
6181      -
6182      -        return (spa_vdev_state_exit(spa, sync ? vd : NULL, 0));
6183      -}
6184      -
6185      -int
6186      -spa_vdev_setpath(spa_t *spa, uint64_t guid, const char *newpath)
6187      -{
6188      -        return (spa_vdev_set_common(spa, guid, newpath, B_TRUE));
6189      -}
6190      -
6191      -int
6192      -spa_vdev_setfru(spa_t *spa, uint64_t guid, const char *newfru)
6193      -{
6194      -        return (spa_vdev_set_common(spa, guid, newfru, B_FALSE));
6195      -}
6196      -
6197      -/*
6198 6223   * ==========================================================================
6199 6224   * SPA Scanning
6200 6225   * ==========================================================================
6201 6226   */
6202 6227  int
6203 6228  spa_scrub_pause_resume(spa_t *spa, pool_scrub_cmd_t cmd)
6204 6229  {
6205 6230          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6206 6231  
6207 6232          if (dsl_scan_resilvering(spa->spa_dsl_pool))
6208 6233                  return (SET_ERROR(EBUSY));
6209 6234  
6210 6235          return (dsl_scrub_set_pause_resume(spa->spa_dsl_pool, cmd));
6211 6236  }
6212 6237  
6213 6238  int
6214 6239  spa_scan_stop(spa_t *spa)
6215 6240  {
6216 6241          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6217 6242          if (dsl_scan_resilvering(spa->spa_dsl_pool))
6218 6243                  return (SET_ERROR(EBUSY));
6219 6244          return (dsl_scan_cancel(spa->spa_dsl_pool));
6220 6245  }
6221 6246  
6222 6247  int
6223 6248  spa_scan(spa_t *spa, pool_scan_func_t func)
6224 6249  {
6225 6250          ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6226 6251  
6227 6252          if (func >= POOL_SCAN_FUNCS || func == POOL_SCAN_NONE)
6228 6253                  return (SET_ERROR(ENOTSUP));
6229 6254  
6230 6255          /*
6231 6256           * If a resilver was requested, but there is no DTL on a
6232 6257           * writeable leaf device, we have nothing to do.
6233 6258           */
6234 6259          if (func == POOL_SCAN_RESILVER &&
6235 6260              !vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL)) {
6236 6261                  spa_async_request(spa, SPA_ASYNC_RESILVER_DONE);
6237 6262                  return (0);
6238 6263          }
6239 6264  
6240 6265          return (dsl_scan(spa->spa_dsl_pool, func));
6241 6266  }
6242 6267  
6243 6268  /*
6244 6269   * ==========================================================================
6245 6270   * SPA async task processing
6246 6271   * ==========================================================================
6247 6272   */
6248 6273  
6249 6274  static void
6250 6275  spa_async_remove(spa_t *spa, vdev_t *vd)
6251 6276  {
6252 6277          if (vd->vdev_remove_wanted) {
6253 6278                  vd->vdev_remove_wanted = B_FALSE;
6254 6279                  vd->vdev_delayed_close = B_FALSE;
6255 6280                  vdev_set_state(vd, B_FALSE, VDEV_STATE_REMOVED, VDEV_AUX_NONE);
6256 6281  
6257 6282                  /*
6258 6283                   * We want to clear the stats, but we don't want to do a full
6259 6284                   * vdev_clear() as that will cause us to throw away
6260 6285                   * degraded/faulted state as well as attempt to reopen the
6261 6286                   * device, all of which is a waste.
6262 6287                   */
6263 6288                  vd->vdev_stat.vs_read_errors = 0;
6264 6289                  vd->vdev_stat.vs_write_errors = 0;
6265 6290                  vd->vdev_stat.vs_checksum_errors = 0;
6266 6291  
6267 6292                  vdev_state_dirty(vd->vdev_top);
6268 6293          }
6269 6294  
6270 6295          for (int c = 0; c < vd->vdev_children; c++)
6271 6296                  spa_async_remove(spa, vd->vdev_child[c]);
6272 6297  }
6273 6298  
6274 6299  static void
6275 6300  spa_async_probe(spa_t *spa, vdev_t *vd)
6276 6301  {
6277 6302          if (vd->vdev_probe_wanted) {
6278 6303                  vd->vdev_probe_wanted = B_FALSE;
6279 6304                  vdev_reopen(vd);        /* vdev_open() does the actual probe */
6280 6305          }
6281 6306  
6282 6307          for (int c = 0; c < vd->vdev_children; c++)
6283 6308                  spa_async_probe(spa, vd->vdev_child[c]);
6284 6309  }
6285 6310  
6286 6311  static void
6287 6312  spa_async_autoexpand(spa_t *spa, vdev_t *vd)
6288 6313  {
6289 6314          sysevent_id_t eid;
6290 6315          nvlist_t *attr;
6291 6316          char *physpath;
6292 6317  
6293 6318          if (!spa->spa_autoexpand)
6294 6319                  return;
6295 6320  
6296 6321          for (int c = 0; c < vd->vdev_children; c++) {
6297 6322                  vdev_t *cvd = vd->vdev_child[c];
6298 6323                  spa_async_autoexpand(spa, cvd);
6299 6324          }
6300 6325  
6301 6326          if (!vd->vdev_ops->vdev_op_leaf || vd->vdev_physpath == NULL)
6302 6327                  return;
6303 6328  
6304 6329          physpath = kmem_zalloc(MAXPATHLEN, KM_SLEEP);
6305 6330          (void) snprintf(physpath, MAXPATHLEN, "/devices%s", vd->vdev_physpath);
6306 6331  
6307 6332          VERIFY(nvlist_alloc(&attr, NV_UNIQUE_NAME, KM_SLEEP) == 0);
6308 6333          VERIFY(nvlist_add_string(attr, DEV_PHYS_PATH, physpath) == 0);
6309 6334  
6310 6335          (void) ddi_log_sysevent(zfs_dip, SUNW_VENDOR, EC_DEV_STATUS,
6311 6336              ESC_DEV_DLE, attr, &eid, DDI_SLEEP);
6312 6337  
6313 6338          nvlist_free(attr);
6314 6339          kmem_free(physpath, MAXPATHLEN);
6315 6340  }
6316 6341  
6317 6342  static void
6318 6343  spa_async_thread(void *arg)
6319 6344  {
6320 6345          spa_t *spa = (spa_t *)arg;
6321 6346          int tasks;
6322 6347  
6323 6348          ASSERT(spa->spa_sync_on);
6324 6349  
6325 6350          mutex_enter(&spa->spa_async_lock);
6326 6351          tasks = spa->spa_async_tasks;
6327 6352          spa->spa_async_tasks = 0;
6328 6353          mutex_exit(&spa->spa_async_lock);
6329 6354  
6330 6355          /*
6331 6356           * See if the config needs to be updated.
6332 6357           */
6333 6358          if (tasks & SPA_ASYNC_CONFIG_UPDATE) {
6334 6359                  uint64_t old_space, new_space;
6335 6360  
6336 6361                  mutex_enter(&spa_namespace_lock);
6337 6362                  old_space = metaslab_class_get_space(spa_normal_class(spa));
6338 6363                  spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
6339 6364                  new_space = metaslab_class_get_space(spa_normal_class(spa));
6340 6365                  mutex_exit(&spa_namespace_lock);
6341 6366  
6342 6367                  /*
6343 6368                   * If the pool grew as a result of the config update,
6344 6369                   * then log an internal history event.
6345 6370                   */
6346 6371                  if (new_space != old_space) {
6347 6372                          spa_history_log_internal(spa, "vdev online", NULL,
6348 6373                              "pool '%s' size: %llu(+%llu)",
6349 6374                              spa_name(spa), new_space, new_space - old_space);
6350 6375                  }
6351 6376          }
6352 6377  
6353 6378          /*
6354 6379           * See if any devices need to be marked REMOVED.
6355 6380           */
6356 6381          if (tasks & SPA_ASYNC_REMOVE) {
6357 6382                  spa_vdev_state_enter(spa, SCL_NONE);
6358 6383                  spa_async_remove(spa, spa->spa_root_vdev);
6359 6384                  for (int i = 0; i < spa->spa_l2cache.sav_count; i++)
6360 6385                          spa_async_remove(spa, spa->spa_l2cache.sav_vdevs[i]);
6361 6386                  for (int i = 0; i < spa->spa_spares.sav_count; i++)
6362 6387                          spa_async_remove(spa, spa->spa_spares.sav_vdevs[i]);
6363 6388                  (void) spa_vdev_state_exit(spa, NULL, 0);
6364 6389          }
6365 6390  
6366 6391          if ((tasks & SPA_ASYNC_AUTOEXPAND) && !spa_suspended(spa)) {
6367 6392                  spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
6368 6393                  spa_async_autoexpand(spa, spa->spa_root_vdev);
6369 6394                  spa_config_exit(spa, SCL_CONFIG, FTAG);
6370 6395          }
6371 6396  
6372 6397          /*
6373 6398           * See if any devices need to be probed.
6374 6399           */
6375 6400          if (tasks & SPA_ASYNC_PROBE) {
6376 6401                  spa_vdev_state_enter(spa, SCL_NONE);
6377 6402                  spa_async_probe(spa, spa->spa_root_vdev);
6378 6403                  (void) spa_vdev_state_exit(spa, NULL, 0);
6379 6404          }
6380 6405  
6381 6406          /*
6382 6407           * If any devices are done replacing, detach them.
6383 6408           */
  
    | 
      ↓ open down ↓ | 
    176 lines elided | 
    
      ↑ open up ↑ | 
  
6384 6409          if (tasks & SPA_ASYNC_RESILVER_DONE)
6385 6410                  spa_vdev_resilver_done(spa);
6386 6411  
6387 6412          /*
6388 6413           * Kick off a resilver.
6389 6414           */
6390 6415          if (tasks & SPA_ASYNC_RESILVER)
6391 6416                  dsl_resilver_restart(spa->spa_dsl_pool, 0);
6392 6417  
6393 6418          /*
     6419 +         * Kick off L2 cache rebuilding.
     6420 +         */
     6421 +        if (tasks & SPA_ASYNC_L2CACHE_REBUILD)
     6422 +                l2arc_spa_rebuild_start(spa);
     6423 +
     6424 +        if (tasks & SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY) {
     6425 +                mutex_enter(&spa->spa_man_trim_lock);
     6426 +                spa_man_trim_taskq_destroy(spa);
     6427 +                mutex_exit(&spa->spa_man_trim_lock);
     6428 +        }
     6429 +
     6430 +        /*
6394 6431           * Let the world know that we're done.
6395 6432           */
6396 6433          mutex_enter(&spa->spa_async_lock);
6397 6434          spa->spa_async_thread = NULL;
6398 6435          cv_broadcast(&spa->spa_async_cv);
6399 6436          mutex_exit(&spa->spa_async_lock);
6400 6437          thread_exit();
6401 6438  }
6402 6439  
6403 6440  void
6404 6441  spa_async_suspend(spa_t *spa)
6405 6442  {
6406 6443          mutex_enter(&spa->spa_async_lock);
6407 6444          spa->spa_async_suspended++;
6408 6445          while (spa->spa_async_thread != NULL)
6409 6446                  cv_wait(&spa->spa_async_cv, &spa->spa_async_lock);
6410 6447          mutex_exit(&spa->spa_async_lock);
6411      -
6412      -        spa_vdev_remove_suspend(spa);
6413      -
6414      -        zthr_t *condense_thread = spa->spa_condense_zthr;
6415      -        if (condense_thread != NULL && zthr_isrunning(condense_thread))
6416      -                VERIFY0(zthr_cancel(condense_thread));
6417 6448  }
6418 6449  
6419 6450  void
6420 6451  spa_async_resume(spa_t *spa)
6421 6452  {
6422 6453          mutex_enter(&spa->spa_async_lock);
6423 6454          ASSERT(spa->spa_async_suspended != 0);
6424 6455          spa->spa_async_suspended--;
6425 6456          mutex_exit(&spa->spa_async_lock);
6426      -        spa_restart_removal(spa);
6427      -
6428      -        zthr_t *condense_thread = spa->spa_condense_zthr;
6429      -        if (condense_thread != NULL && !zthr_isrunning(condense_thread))
6430      -                zthr_resume(condense_thread);
6431 6457  }
6432 6458  
6433 6459  static boolean_t
6434 6460  spa_async_tasks_pending(spa_t *spa)
6435 6461  {
6436 6462          uint_t non_config_tasks;
6437 6463          uint_t config_task;
6438 6464          boolean_t config_task_suspended;
6439 6465  
6440 6466          non_config_tasks = spa->spa_async_tasks & ~SPA_ASYNC_CONFIG_UPDATE;
6441 6467          config_task = spa->spa_async_tasks & SPA_ASYNC_CONFIG_UPDATE;
6442 6468          if (spa->spa_ccw_fail_time == 0) {
6443 6469                  config_task_suspended = B_FALSE;
6444 6470          } else {
6445 6471                  config_task_suspended =
6446 6472                      (gethrtime() - spa->spa_ccw_fail_time) <
6447 6473                      (zfs_ccw_retry_interval * NANOSEC);
6448 6474          }
6449 6475  
6450 6476          return (non_config_tasks || (config_task && !config_task_suspended));
6451 6477  }
6452 6478  
6453 6479  static void
6454 6480  spa_async_dispatch(spa_t *spa)
6455 6481  {
6456 6482          mutex_enter(&spa->spa_async_lock);
6457 6483          if (spa_async_tasks_pending(spa) &&
6458 6484              !spa->spa_async_suspended &&
6459 6485              spa->spa_async_thread == NULL &&
6460 6486              rootdir != NULL)
6461 6487                  spa->spa_async_thread = thread_create(NULL, 0,
6462 6488                      spa_async_thread, spa, 0, &p0, TS_RUN, maxclsyspri);
6463 6489          mutex_exit(&spa->spa_async_lock);
6464 6490  }
  
    | 
      ↓ open down ↓ | 
    24 lines elided | 
    
      ↑ open up ↑ | 
  
6465 6491  
6466 6492  void
6467 6493  spa_async_request(spa_t *spa, int task)
6468 6494  {
6469 6495          zfs_dbgmsg("spa=%s async request task=%u", spa->spa_name, task);
6470 6496          mutex_enter(&spa->spa_async_lock);
6471 6497          spa->spa_async_tasks |= task;
6472 6498          mutex_exit(&spa->spa_async_lock);
6473 6499  }
6474 6500  
     6501 +void
     6502 +spa_async_unrequest(spa_t *spa, int task)
     6503 +{
     6504 +        zfs_dbgmsg("spa=%s async unrequest task=%u", spa->spa_name, task);
     6505 +        mutex_enter(&spa->spa_async_lock);
     6506 +        spa->spa_async_tasks &= ~task;
     6507 +        mutex_exit(&spa->spa_async_lock);
     6508 +}
     6509 +
6475 6510  /*
6476 6511   * ==========================================================================
6477 6512   * SPA syncing routines
6478 6513   * ==========================================================================
6479 6514   */
6480 6515  
6481 6516  static int
6482 6517  bpobj_enqueue_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
6483 6518  {
6484 6519          bpobj_t *bpo = arg;
6485 6520          bpobj_enqueue(bpo, bp, tx);
6486 6521          return (0);
6487 6522  }
6488 6523  
6489 6524  static int
6490 6525  spa_free_sync_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
6491 6526  {
6492 6527          zio_t *zio = arg;
6493 6528  
6494 6529          zio_nowait(zio_free_sync(zio, zio->io_spa, dmu_tx_get_txg(tx), bp,
6495 6530              zio->io_flags));
6496 6531          return (0);
6497 6532  }
6498 6533  
6499 6534  /*
6500 6535   * Note: this simple function is not inlined to make it easier to dtrace the
6501 6536   * amount of time spent syncing frees.
6502 6537   */
6503 6538  static void
6504 6539  spa_sync_frees(spa_t *spa, bplist_t *bpl, dmu_tx_t *tx)
6505 6540  {
6506 6541          zio_t *zio = zio_root(spa, NULL, NULL, 0);
6507 6542          bplist_iterate(bpl, spa_free_sync_cb, zio, tx);
6508 6543          VERIFY(zio_wait(zio) == 0);
6509 6544  }
6510 6545  
6511 6546  /*
6512 6547   * Note: this simple function is not inlined to make it easier to dtrace the
6513 6548   * amount of time spent syncing deferred frees.
6514 6549   */
6515 6550  static void
6516 6551  spa_sync_deferred_frees(spa_t *spa, dmu_tx_t *tx)
6517 6552  {
6518 6553          zio_t *zio = zio_root(spa, NULL, NULL, 0);
6519 6554          VERIFY3U(bpobj_iterate(&spa->spa_deferred_bpobj,
6520 6555              spa_free_sync_cb, zio, tx), ==, 0);
6521 6556          VERIFY0(zio_wait(zio));
6522 6557  }
6523 6558  
6524 6559  
6525 6560  static void
6526 6561  spa_sync_nvlist(spa_t *spa, uint64_t obj, nvlist_t *nv, dmu_tx_t *tx)
6527 6562  {
6528 6563          char *packed = NULL;
6529 6564          size_t bufsize;
6530 6565          size_t nvsize = 0;
6531 6566          dmu_buf_t *db;
6532 6567  
6533 6568          VERIFY(nvlist_size(nv, &nvsize, NV_ENCODE_XDR) == 0);
6534 6569  
6535 6570          /*
6536 6571           * Write full (SPA_CONFIG_BLOCKSIZE) blocks of configuration
6537 6572           * information.  This avoids the dmu_buf_will_dirty() path and
6538 6573           * saves us a pre-read to get data we don't actually care about.
6539 6574           */
6540 6575          bufsize = P2ROUNDUP((uint64_t)nvsize, SPA_CONFIG_BLOCKSIZE);
6541 6576          packed = kmem_alloc(bufsize, KM_SLEEP);
6542 6577  
6543 6578          VERIFY(nvlist_pack(nv, &packed, &nvsize, NV_ENCODE_XDR,
6544 6579              KM_SLEEP) == 0);
6545 6580          bzero(packed + nvsize, bufsize - nvsize);
6546 6581  
6547 6582          dmu_write(spa->spa_meta_objset, obj, 0, bufsize, packed, tx);
6548 6583  
6549 6584          kmem_free(packed, bufsize);
6550 6585  
6551 6586          VERIFY(0 == dmu_bonus_hold(spa->spa_meta_objset, obj, FTAG, &db));
6552 6587          dmu_buf_will_dirty(db, tx);
6553 6588          *(uint64_t *)db->db_data = nvsize;
6554 6589          dmu_buf_rele(db, FTAG);
6555 6590  }
6556 6591  
6557 6592  static void
6558 6593  spa_sync_aux_dev(spa_t *spa, spa_aux_vdev_t *sav, dmu_tx_t *tx,
6559 6594      const char *config, const char *entry)
6560 6595  {
6561 6596          nvlist_t *nvroot;
6562 6597          nvlist_t **list;
6563 6598          int i;
6564 6599  
6565 6600          if (!sav->sav_sync)
6566 6601                  return;
6567 6602  
6568 6603          /*
6569 6604           * Update the MOS nvlist describing the list of available devices.
6570 6605           * spa_validate_aux() will have already made sure this nvlist is
6571 6606           * valid and the vdevs are labeled appropriately.
6572 6607           */
6573 6608          if (sav->sav_object == 0) {
6574 6609                  sav->sav_object = dmu_object_alloc(spa->spa_meta_objset,
6575 6610                      DMU_OT_PACKED_NVLIST, 1 << 14, DMU_OT_PACKED_NVLIST_SIZE,
6576 6611                      sizeof (uint64_t), tx);
6577 6612                  VERIFY(zap_update(spa->spa_meta_objset,
6578 6613                      DMU_POOL_DIRECTORY_OBJECT, entry, sizeof (uint64_t), 1,
6579 6614                      &sav->sav_object, tx) == 0);
6580 6615          }
6581 6616  
6582 6617          VERIFY(nvlist_alloc(&nvroot, NV_UNIQUE_NAME, KM_SLEEP) == 0);
6583 6618          if (sav->sav_count == 0) {
6584 6619                  VERIFY(nvlist_add_nvlist_array(nvroot, config, NULL, 0) == 0);
6585 6620          } else {
6586 6621                  list = kmem_alloc(sav->sav_count * sizeof (void *), KM_SLEEP);
6587 6622                  for (i = 0; i < sav->sav_count; i++)
6588 6623                          list[i] = vdev_config_generate(spa, sav->sav_vdevs[i],
6589 6624                              B_FALSE, VDEV_CONFIG_L2CACHE);
6590 6625                  VERIFY(nvlist_add_nvlist_array(nvroot, config, list,
6591 6626                      sav->sav_count) == 0);
6592 6627                  for (i = 0; i < sav->sav_count; i++)
6593 6628                          nvlist_free(list[i]);
6594 6629                  kmem_free(list, sav->sav_count * sizeof (void *));
6595 6630          }
6596 6631  
6597 6632          spa_sync_nvlist(spa, sav->sav_object, nvroot, tx);
6598 6633          nvlist_free(nvroot);
6599 6634  
6600 6635          sav->sav_sync = B_FALSE;
6601 6636  }
6602 6637  
6603 6638  /*
6604 6639   * Rebuild spa's all-vdev ZAP from the vdev ZAPs indicated in each vdev_t.
6605 6640   * The all-vdev ZAP must be empty.
6606 6641   */
6607 6642  static void
6608 6643  spa_avz_build(vdev_t *vd, uint64_t avz, dmu_tx_t *tx)
6609 6644  {
6610 6645          spa_t *spa = vd->vdev_spa;
6611 6646          if (vd->vdev_top_zap != 0) {
6612 6647                  VERIFY0(zap_add_int(spa->spa_meta_objset, avz,
6613 6648                      vd->vdev_top_zap, tx));
6614 6649          }
6615 6650          if (vd->vdev_leaf_zap != 0) {
6616 6651                  VERIFY0(zap_add_int(spa->spa_meta_objset, avz,
6617 6652                      vd->vdev_leaf_zap, tx));
6618 6653          }
6619 6654          for (uint64_t i = 0; i < vd->vdev_children; i++) {
6620 6655                  spa_avz_build(vd->vdev_child[i], avz, tx);
6621 6656          }
6622 6657  }
6623 6658  
6624 6659  static void
6625 6660  spa_sync_config_object(spa_t *spa, dmu_tx_t *tx)
6626 6661  {
6627 6662          nvlist_t *config;
6628 6663  
6629 6664          /*
6630 6665           * If the pool is being imported from a pre-per-vdev-ZAP version of ZFS,
6631 6666           * its config may not be dirty but we still need to build per-vdev ZAPs.
6632 6667           * Similarly, if the pool is being assembled (e.g. after a split), we
6633 6668           * need to rebuild the AVZ although the config may not be dirty.
6634 6669           */
6635 6670          if (list_is_empty(&spa->spa_config_dirty_list) &&
6636 6671              spa->spa_avz_action == AVZ_ACTION_NONE)
6637 6672                  return;
6638 6673  
6639 6674          spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
6640 6675  
6641 6676          ASSERT(spa->spa_avz_action == AVZ_ACTION_NONE ||
6642 6677              spa->spa_avz_action == AVZ_ACTION_INITIALIZE ||
6643 6678              spa->spa_all_vdev_zaps != 0);
6644 6679  
6645 6680          if (spa->spa_avz_action == AVZ_ACTION_REBUILD) {
6646 6681                  /* Make and build the new AVZ */
6647 6682                  uint64_t new_avz = zap_create(spa->spa_meta_objset,
6648 6683                      DMU_OTN_ZAP_METADATA, DMU_OT_NONE, 0, tx);
6649 6684                  spa_avz_build(spa->spa_root_vdev, new_avz, tx);
6650 6685  
6651 6686                  /* Diff old AVZ with new one */
6652 6687                  zap_cursor_t zc;
6653 6688                  zap_attribute_t za;
6654 6689  
6655 6690                  for (zap_cursor_init(&zc, spa->spa_meta_objset,
6656 6691                      spa->spa_all_vdev_zaps);
6657 6692                      zap_cursor_retrieve(&zc, &za) == 0;
6658 6693                      zap_cursor_advance(&zc)) {
6659 6694                          uint64_t vdzap = za.za_first_integer;
6660 6695                          if (zap_lookup_int(spa->spa_meta_objset, new_avz,
6661 6696                              vdzap) == ENOENT) {
6662 6697                                  /*
6663 6698                                   * ZAP is listed in old AVZ but not in new one;
6664 6699                                   * destroy it
6665 6700                                   */
6666 6701                                  VERIFY0(zap_destroy(spa->spa_meta_objset, vdzap,
6667 6702                                      tx));
6668 6703                          }
6669 6704                  }
6670 6705  
6671 6706                  zap_cursor_fini(&zc);
6672 6707  
6673 6708                  /* Destroy the old AVZ */
6674 6709                  VERIFY0(zap_destroy(spa->spa_meta_objset,
6675 6710                      spa->spa_all_vdev_zaps, tx));
6676 6711  
6677 6712                  /* Replace the old AVZ in the dir obj with the new one */
6678 6713                  VERIFY0(zap_update(spa->spa_meta_objset,
6679 6714                      DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_VDEV_ZAP_MAP,
6680 6715                      sizeof (new_avz), 1, &new_avz, tx));
6681 6716  
6682 6717                  spa->spa_all_vdev_zaps = new_avz;
6683 6718          } else if (spa->spa_avz_action == AVZ_ACTION_DESTROY) {
6684 6719                  zap_cursor_t zc;
6685 6720                  zap_attribute_t za;
6686 6721  
6687 6722                  /* Walk through the AVZ and destroy all listed ZAPs */
6688 6723                  for (zap_cursor_init(&zc, spa->spa_meta_objset,
6689 6724                      spa->spa_all_vdev_zaps);
6690 6725                      zap_cursor_retrieve(&zc, &za) == 0;
6691 6726                      zap_cursor_advance(&zc)) {
6692 6727                          uint64_t zap = za.za_first_integer;
6693 6728                          VERIFY0(zap_destroy(spa->spa_meta_objset, zap, tx));
6694 6729                  }
6695 6730  
6696 6731                  zap_cursor_fini(&zc);
6697 6732  
6698 6733                  /* Destroy and unlink the AVZ itself */
6699 6734                  VERIFY0(zap_destroy(spa->spa_meta_objset,
6700 6735                      spa->spa_all_vdev_zaps, tx));
6701 6736                  VERIFY0(zap_remove(spa->spa_meta_objset,
6702 6737                      DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_VDEV_ZAP_MAP, tx));
6703 6738                  spa->spa_all_vdev_zaps = 0;
6704 6739          }
6705 6740  
6706 6741          if (spa->spa_all_vdev_zaps == 0) {
6707 6742                  spa->spa_all_vdev_zaps = zap_create_link(spa->spa_meta_objset,
6708 6743                      DMU_OTN_ZAP_METADATA, DMU_POOL_DIRECTORY_OBJECT,
6709 6744                      DMU_POOL_VDEV_ZAP_MAP, tx);
6710 6745          }
6711 6746          spa->spa_avz_action = AVZ_ACTION_NONE;
6712 6747  
6713 6748          /* Create ZAPs for vdevs that don't have them. */
6714 6749          vdev_construct_zaps(spa->spa_root_vdev, tx);
6715 6750  
6716 6751          config = spa_config_generate(spa, spa->spa_root_vdev,
6717 6752              dmu_tx_get_txg(tx), B_FALSE);
6718 6753  
6719 6754          /*
6720 6755           * If we're upgrading the spa version then make sure that
6721 6756           * the config object gets updated with the correct version.
6722 6757           */
6723 6758          if (spa->spa_ubsync.ub_version < spa->spa_uberblock.ub_version)
6724 6759                  fnvlist_add_uint64(config, ZPOOL_CONFIG_VERSION,
6725 6760                      spa->spa_uberblock.ub_version);
6726 6761  
6727 6762          spa_config_exit(spa, SCL_STATE, FTAG);
6728 6763  
6729 6764          nvlist_free(spa->spa_config_syncing);
6730 6765          spa->spa_config_syncing = config;
6731 6766  
6732 6767          spa_sync_nvlist(spa, spa->spa_config_object, config, tx);
6733 6768  }
6734 6769  
6735 6770  static void
6736 6771  spa_sync_version(void *arg, dmu_tx_t *tx)
6737 6772  {
6738 6773          uint64_t *versionp = arg;
6739 6774          uint64_t version = *versionp;
6740 6775          spa_t *spa = dmu_tx_pool(tx)->dp_spa;
6741 6776  
6742 6777          /*
6743 6778           * Setting the version is special cased when first creating the pool.
6744 6779           */
6745 6780          ASSERT(tx->tx_txg != TXG_INITIAL);
6746 6781  
6747 6782          ASSERT(SPA_VERSION_IS_SUPPORTED(version));
6748 6783          ASSERT(version >= spa_version(spa));
6749 6784  
6750 6785          spa->spa_uberblock.ub_version = version;
6751 6786          vdev_config_dirty(spa->spa_root_vdev);
6752 6787          spa_history_log_internal(spa, "set", tx, "version=%lld", version);
  
    | 
      ↓ open down ↓ | 
    268 lines elided | 
    
      ↑ open up ↑ | 
  
6753 6788  }
6754 6789  
6755 6790  /*
6756 6791   * Set zpool properties.
6757 6792   */
6758 6793  static void
6759 6794  spa_sync_props(void *arg, dmu_tx_t *tx)
6760 6795  {
6761 6796          nvlist_t *nvp = arg;
6762 6797          spa_t *spa = dmu_tx_pool(tx)->dp_spa;
     6798 +        spa_meta_placement_t *mp = &spa->spa_meta_policy;
6763 6799          objset_t *mos = spa->spa_meta_objset;
6764 6800          nvpair_t *elem = NULL;
6765 6801  
6766 6802          mutex_enter(&spa->spa_props_lock);
6767 6803  
6768 6804          while ((elem = nvlist_next_nvpair(nvp, elem))) {
6769 6805                  uint64_t intval;
6770 6806                  char *strval, *fname;
6771 6807                  zpool_prop_t prop;
6772 6808                  const char *propname;
6773 6809                  zprop_type_t proptype;
6774 6810                  spa_feature_t fid;
6775 6811  
6776 6812                  switch (prop = zpool_name_to_prop(nvpair_name(elem))) {
6777      -                case ZPOOL_PROP_INVAL:
     6813 +                case ZPROP_INVAL:
6778 6814                          /*
6779 6815                           * We checked this earlier in spa_prop_validate().
6780 6816                           */
6781 6817                          ASSERT(zpool_prop_feature(nvpair_name(elem)));
6782 6818  
6783 6819                          fname = strchr(nvpair_name(elem), '@') + 1;
6784 6820                          VERIFY0(zfeature_lookup_name(fname, &fid));
6785 6821  
6786 6822                          spa_feature_enable(spa, fid, tx);
6787 6823                          spa_history_log_internal(spa, "set", tx,
6788 6824                              "%s=enabled", nvpair_name(elem));
6789 6825                          break;
6790 6826  
6791 6827                  case ZPOOL_PROP_VERSION:
6792 6828                          intval = fnvpair_value_uint64(elem);
6793 6829                          /*
6794 6830                           * The version is synced seperatly before other
6795 6831                           * properties and should be correct by now.
6796 6832                           */
6797 6833                          ASSERT3U(spa_version(spa), >=, intval);
6798 6834                          break;
6799 6835  
6800 6836                  case ZPOOL_PROP_ALTROOT:
6801 6837                          /*
6802 6838                           * 'altroot' is a non-persistent property. It should
6803 6839                           * have been set temporarily at creation or import time.
6804 6840                           */
6805 6841                          ASSERT(spa->spa_root != NULL);
6806 6842                          break;
6807 6843  
6808 6844                  case ZPOOL_PROP_READONLY:
6809 6845                  case ZPOOL_PROP_CACHEFILE:
6810 6846                          /*
6811 6847                           * 'readonly' and 'cachefile' are also non-persisitent
6812 6848                           * properties.
6813 6849                           */
6814 6850                          break;
6815 6851                  case ZPOOL_PROP_COMMENT:
6816 6852                          strval = fnvpair_value_string(elem);
6817 6853                          if (spa->spa_comment != NULL)
6818 6854                                  spa_strfree(spa->spa_comment);
6819 6855                          spa->spa_comment = spa_strdup(strval);
6820 6856                          /*
6821 6857                           * We need to dirty the configuration on all the vdevs
6822 6858                           * so that their labels get updated.  It's unnecessary
6823 6859                           * to do this for pool creation since the vdev's
6824 6860                           * configuratoin has already been dirtied.
6825 6861                           */
6826 6862                          if (tx->tx_txg != TXG_INITIAL)
6827 6863                                  vdev_config_dirty(spa->spa_root_vdev);
6828 6864                          spa_history_log_internal(spa, "set", tx,
6829 6865                              "%s=%s", nvpair_name(elem), strval);
6830 6866                          break;
6831 6867                  default:
6832 6868                          /*
6833 6869                           * Set pool property values in the poolprops mos object.
6834 6870                           */
6835 6871                          if (spa->spa_pool_props_object == 0) {
6836 6872                                  spa->spa_pool_props_object =
6837 6873                                      zap_create_link(mos, DMU_OT_POOL_PROPS,
6838 6874                                      DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_PROPS,
6839 6875                                      tx);
6840 6876                          }
6841 6877  
6842 6878                          /* normalize the property name */
6843 6879                          propname = zpool_prop_to_name(prop);
6844 6880                          proptype = zpool_prop_get_type(prop);
6845 6881  
6846 6882                          if (nvpair_type(elem) == DATA_TYPE_STRING) {
6847 6883                                  ASSERT(proptype == PROP_TYPE_STRING);
6848 6884                                  strval = fnvpair_value_string(elem);
6849 6885                                  VERIFY0(zap_update(mos,
6850 6886                                      spa->spa_pool_props_object, propname,
6851 6887                                      1, strlen(strval) + 1, strval, tx));
6852 6888                                  spa_history_log_internal(spa, "set", tx,
6853 6889                                      "%s=%s", nvpair_name(elem), strval);
6854 6890                          } else if (nvpair_type(elem) == DATA_TYPE_UINT64) {
6855 6891                                  intval = fnvpair_value_uint64(elem);
6856 6892  
6857 6893                                  if (proptype == PROP_TYPE_INDEX) {
6858 6894                                          const char *unused;
6859 6895                                          VERIFY0(zpool_prop_index_to_string(
6860 6896                                              prop, intval, &unused));
6861 6897                                  }
6862 6898                                  VERIFY0(zap_update(mos,
6863 6899                                      spa->spa_pool_props_object, propname,
6864 6900                                      8, 1, &intval, tx));
  
    | 
      ↓ open down ↓ | 
    77 lines elided | 
    
      ↑ open up ↑ | 
  
6865 6901                                  spa_history_log_internal(spa, "set", tx,
6866 6902                                      "%s=%lld", nvpair_name(elem), intval);
6867 6903                          } else {
6868 6904                                  ASSERT(0); /* not allowed */
6869 6905                          }
6870 6906  
6871 6907                          switch (prop) {
6872 6908                          case ZPOOL_PROP_DELEGATION:
6873 6909                                  spa->spa_delegation = intval;
6874 6910                                  break;
     6911 +                        case ZPOOL_PROP_DDT_DESEGREGATION:
     6912 +                                spa_set_ddt_classes(spa, intval);
     6913 +                                break;
     6914 +                        case ZPOOL_PROP_DEDUP_BEST_EFFORT:
     6915 +                                spa->spa_dedup_best_effort = intval;
     6916 +                                break;
     6917 +                        case ZPOOL_PROP_DEDUP_LO_BEST_EFFORT:
     6918 +                                spa->spa_dedup_lo_best_effort = intval;
     6919 +                                break;
     6920 +                        case ZPOOL_PROP_DEDUP_HI_BEST_EFFORT:
     6921 +                                spa->spa_dedup_hi_best_effort = intval;
     6922 +                                break;
6875 6923                          case ZPOOL_PROP_BOOTFS:
6876 6924                                  spa->spa_bootfs = intval;
6877 6925                                  break;
6878 6926                          case ZPOOL_PROP_FAILUREMODE:
6879 6927                                  spa->spa_failmode = intval;
6880 6928                                  break;
     6929 +                        case ZPOOL_PROP_FORCETRIM:
     6930 +                                spa->spa_force_trim = intval;
     6931 +                                break;
     6932 +                        case ZPOOL_PROP_AUTOTRIM:
     6933 +                                mutex_enter(&spa->spa_auto_trim_lock);
     6934 +                                if (intval != spa->spa_auto_trim) {
     6935 +                                        spa->spa_auto_trim = intval;
     6936 +                                        if (intval != 0)
     6937 +                                                spa_auto_trim_taskq_create(spa);
     6938 +                                        else
     6939 +                                                spa_auto_trim_taskq_destroy(
     6940 +                                                    spa);
     6941 +                                }
     6942 +                                mutex_exit(&spa->spa_auto_trim_lock);
     6943 +                                break;
6881 6944                          case ZPOOL_PROP_AUTOEXPAND:
6882 6945                                  spa->spa_autoexpand = intval;
6883 6946                                  if (tx->tx_txg != TXG_INITIAL)
6884 6947                                          spa_async_request(spa,
6885 6948                                              SPA_ASYNC_AUTOEXPAND);
6886 6949                                  break;
6887 6950                          case ZPOOL_PROP_DEDUPDITTO:
6888 6951                                  spa->spa_dedup_ditto = intval;
6889 6952                                  break;
     6953 +                        case ZPOOL_PROP_MINWATERMARK:
     6954 +                                spa->spa_minwat = intval;
     6955 +                                break;
     6956 +                        case ZPOOL_PROP_LOWATERMARK:
     6957 +                                spa->spa_lowat = intval;
     6958 +                                break;
     6959 +                        case ZPOOL_PROP_HIWATERMARK:
     6960 +                                spa->spa_hiwat = intval;
     6961 +                                break;
     6962 +                        case ZPOOL_PROP_DEDUPMETA_DITTO:
     6963 +                                spa->spa_ddt_meta_copies = intval;
     6964 +                                break;
     6965 +                        case ZPOOL_PROP_META_PLACEMENT:
     6966 +                                mp->spa_enable_meta_placement_selection =
     6967 +                                    intval;
     6968 +                                break;
     6969 +                        case ZPOOL_PROP_SYNC_TO_SPECIAL:
     6970 +                                mp->spa_sync_to_special = intval;
     6971 +                                break;
     6972 +                        case ZPOOL_PROP_DDT_META_TO_METADEV:
     6973 +                                mp->spa_ddt_meta_to_special = intval;
     6974 +                                break;
     6975 +                        case ZPOOL_PROP_ZFS_META_TO_METADEV:
     6976 +                                mp->spa_zfs_meta_to_special = intval;
     6977 +                                break;
     6978 +                        case ZPOOL_PROP_SMALL_DATA_TO_METADEV:
     6979 +                                mp->spa_small_data_to_special = intval;
     6980 +                                break;
     6981 +                        case ZPOOL_PROP_RESILVER_PRIO:
     6982 +                                spa->spa_resilver_prio = intval;
     6983 +                                break;
     6984 +                        case ZPOOL_PROP_SCRUB_PRIO:
     6985 +                                spa->spa_scrub_prio = intval;
     6986 +                                break;
6890 6987                          default:
6891 6988                                  break;
6892 6989                          }
6893 6990                  }
6894 6991  
6895 6992          }
6896 6993  
6897 6994          mutex_exit(&spa->spa_props_lock);
6898 6995  }
6899 6996  
6900 6997  /*
6901 6998   * Perform one-time upgrade on-disk changes.  spa_version() does not
6902 6999   * reflect the new version this txg, so there must be no changes this
6903 7000   * txg to anything that the upgrade code depends on after it executes.
6904 7001   * Therefore this must be called after dsl_pool_sync() does the sync
6905 7002   * tasks.
6906 7003   */
6907 7004  static void
6908 7005  spa_sync_upgrades(spa_t *spa, dmu_tx_t *tx)
6909 7006  {
6910 7007          dsl_pool_t *dp = spa->spa_dsl_pool;
6911 7008  
6912 7009          ASSERT(spa->spa_sync_pass == 1);
6913 7010  
6914 7011          rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG);
6915 7012  
6916 7013          if (spa->spa_ubsync.ub_version < SPA_VERSION_ORIGIN &&
6917 7014              spa->spa_uberblock.ub_version >= SPA_VERSION_ORIGIN) {
6918 7015                  dsl_pool_create_origin(dp, tx);
6919 7016  
6920 7017                  /* Keeping the origin open increases spa_minref */
6921 7018                  spa->spa_minref += 3;
6922 7019          }
6923 7020  
6924 7021          if (spa->spa_ubsync.ub_version < SPA_VERSION_NEXT_CLONES &&
6925 7022              spa->spa_uberblock.ub_version >= SPA_VERSION_NEXT_CLONES) {
6926 7023                  dsl_pool_upgrade_clones(dp, tx);
6927 7024          }
6928 7025  
6929 7026          if (spa->spa_ubsync.ub_version < SPA_VERSION_DIR_CLONES &&
6930 7027              spa->spa_uberblock.ub_version >= SPA_VERSION_DIR_CLONES) {
6931 7028                  dsl_pool_upgrade_dir_clones(dp, tx);
6932 7029  
6933 7030                  /* Keeping the freedir open increases spa_minref */
6934 7031                  spa->spa_minref += 3;
6935 7032          }
6936 7033  
6937 7034          if (spa->spa_ubsync.ub_version < SPA_VERSION_FEATURES &&
6938 7035              spa->spa_uberblock.ub_version >= SPA_VERSION_FEATURES) {
6939 7036                  spa_feature_create_zap_objects(spa, tx);
6940 7037          }
6941 7038  
6942 7039          /*
6943 7040           * LZ4_COMPRESS feature's behaviour was changed to activate_on_enable
6944 7041           * when possibility to use lz4 compression for metadata was added
6945 7042           * Old pools that have this feature enabled must be upgraded to have
6946 7043           * this feature active
6947 7044           */
6948 7045          if (spa->spa_uberblock.ub_version >= SPA_VERSION_FEATURES) {
6949 7046                  boolean_t lz4_en = spa_feature_is_enabled(spa,
6950 7047                      SPA_FEATURE_LZ4_COMPRESS);
6951 7048                  boolean_t lz4_ac = spa_feature_is_active(spa,
6952 7049                      SPA_FEATURE_LZ4_COMPRESS);
6953 7050  
6954 7051                  if (lz4_en && !lz4_ac)
6955 7052                          spa_feature_incr(spa, SPA_FEATURE_LZ4_COMPRESS, tx);
6956 7053          }
6957 7054  
6958 7055          /*
6959 7056           * If we haven't written the salt, do so now.  Note that the
6960 7057           * feature may not be activated yet, but that's fine since
6961 7058           * the presence of this ZAP entry is backwards compatible.
6962 7059           */
6963 7060          if (zap_contains(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
6964 7061              DMU_POOL_CHECKSUM_SALT) == ENOENT) {
  
    | 
      ↓ open down ↓ | 
    65 lines elided | 
    
      ↑ open up ↑ | 
  
6965 7062                  VERIFY0(zap_add(spa->spa_meta_objset,
6966 7063                      DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CHECKSUM_SALT, 1,
6967 7064                      sizeof (spa->spa_cksum_salt.zcs_bytes),
6968 7065                      spa->spa_cksum_salt.zcs_bytes, tx));
6969 7066          }
6970 7067  
6971 7068          rrw_exit(&dp->dp_config_rwlock, FTAG);
6972 7069  }
6973 7070  
6974 7071  static void
6975      -vdev_indirect_state_sync_verify(vdev_t *vd)
     7072 +spa_initialize_alloc_trees(spa_t *spa, uint32_t max_queue_depth,
     7073 +    uint64_t queue_depth_total)
6976 7074  {
6977      -        vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
6978      -        vdev_indirect_births_t *vib = vd->vdev_indirect_births;
     7075 +        vdev_t *rvd = spa->spa_root_vdev;
     7076 +        boolean_t dva_throttle_enabled = zio_dva_throttle_enabled;
     7077 +        metaslab_class_t *mcs[2] = {
     7078 +                spa_normal_class(spa),
     7079 +                spa_special_class(spa)
     7080 +        };
     7081 +        size_t mcs_len = sizeof (mcs) / sizeof (metaslab_class_t *);
6979 7082  
6980      -        if (vd->vdev_ops == &vdev_indirect_ops) {
6981      -                ASSERT(vim != NULL);
6982      -                ASSERT(vib != NULL);
6983      -        }
     7083 +        for (size_t i = 0; i < mcs_len; i++) {
     7084 +                metaslab_class_t *mc = mcs[i];
6984 7085  
6985      -        if (vdev_obsolete_sm_object(vd) != 0) {
6986      -                ASSERT(vd->vdev_obsolete_sm != NULL);
6987      -                ASSERT(vd->vdev_removing ||
6988      -                    vd->vdev_ops == &vdev_indirect_ops);
6989      -                ASSERT(vdev_indirect_mapping_num_entries(vim) > 0);
6990      -                ASSERT(vdev_indirect_mapping_bytes_mapped(vim) > 0);
     7086 +                ASSERT0(refcount_count(&mc->mc_alloc_slots));
     7087 +                mc->mc_alloc_max_slots = queue_depth_total;
     7088 +                mc->mc_alloc_throttle_enabled = dva_throttle_enabled;
6991 7089  
6992      -                ASSERT3U(vdev_obsolete_sm_object(vd), ==,
6993      -                    space_map_object(vd->vdev_obsolete_sm));
6994      -                ASSERT3U(vdev_indirect_mapping_bytes_mapped(vim), >=,
6995      -                    space_map_allocated(vd->vdev_obsolete_sm));
     7090 +                ASSERT3U(mc->mc_alloc_max_slots, <=,
     7091 +                    max_queue_depth * rvd->vdev_children);
6996 7092          }
6997      -        ASSERT(vd->vdev_obsolete_segments != NULL);
     7093 +}
6998 7094  
6999      -        /*
7000      -         * Since frees / remaps to an indirect vdev can only
7001      -         * happen in syncing context, the obsolete segments
7002      -         * tree must be empty when we start syncing.
7003      -         */
7004      -        ASSERT0(range_tree_space(vd->vdev_obsolete_segments));
     7095 +static void
     7096 +spa_check_alloc_trees(spa_t *spa)
     7097 +{
     7098 +        metaslab_class_t *mcs[2] = {
     7099 +                spa_normal_class(spa),
     7100 +                spa_special_class(spa)
     7101 +        };
     7102 +        size_t mcs_len = sizeof (mcs) / sizeof (metaslab_class_t *);
     7103 +
     7104 +        for (size_t i = 0; i < mcs_len; i++) {
     7105 +                metaslab_class_t *mc = mcs[i];
     7106 +
     7107 +                mutex_enter(&mc->mc_alloc_lock);
     7108 +                VERIFY0(avl_numnodes(&mc->mc_alloc_tree));
     7109 +                mutex_exit(&mc->mc_alloc_lock);
     7110 +        }
7005 7111  }
7006 7112  
7007 7113  /*
7008 7114   * Sync the specified transaction group.  New blocks may be dirtied as
7009 7115   * part of the process, so we iterate until it converges.
7010 7116   */
7011 7117  void
7012 7118  spa_sync(spa_t *spa, uint64_t txg)
7013 7119  {
7014 7120          dsl_pool_t *dp = spa->spa_dsl_pool;
7015 7121          objset_t *mos = spa->spa_meta_objset;
7016 7122          bplist_t *free_bpl = &spa->spa_free_bplist[txg & TXG_MASK];
  
    | 
      ↓ open down ↓ | 
    2 lines elided | 
    
      ↑ open up ↑ | 
  
7017 7123          vdev_t *rvd = spa->spa_root_vdev;
7018 7124          vdev_t *vd;
7019 7125          dmu_tx_t *tx;
7020 7126          int error;
7021 7127          uint32_t max_queue_depth = zfs_vdev_async_write_max_active *
7022 7128              zfs_vdev_queue_depth_pct / 100;
7023 7129  
7024 7130          VERIFY(spa_writeable(spa));
7025 7131  
7026 7132          /*
7027      -         * Wait for i/os issued in open context that need to complete
7028      -         * before this txg syncs.
7029      -         */
7030      -        VERIFY0(zio_wait(spa->spa_txg_zio[txg & TXG_MASK]));
7031      -        spa->spa_txg_zio[txg & TXG_MASK] = zio_root(spa, NULL, NULL, 0);
7032      -
7033      -        /*
7034 7133           * Lock out configuration changes.
7035 7134           */
7036 7135          spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
7037 7136  
7038 7137          spa->spa_syncing_txg = txg;
7039 7138          spa->spa_sync_pass = 0;
7040 7139  
7041      -        mutex_enter(&spa->spa_alloc_lock);
7042      -        VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
7043      -        mutex_exit(&spa->spa_alloc_lock);
     7140 +        spa_check_alloc_trees(spa);
7044 7141  
7045 7142          /*
     7143 +         * Another pool management task might be currently preventing
     7144 +         * from starting and the current txg sync was invoked on its behalf,
     7145 +         * so be prepared to postpone autotrim processing.
     7146 +         */
     7147 +        if (mutex_tryenter(&spa->spa_auto_trim_lock)) {
     7148 +                if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
     7149 +                        spa_auto_trim(spa, txg);
     7150 +                mutex_exit(&spa->spa_auto_trim_lock);
     7151 +        }
     7152 +
     7153 +        /*
7046 7154           * If there are any pending vdev state changes, convert them
7047 7155           * into config changes that go out with this transaction group.
7048 7156           */
7049 7157          spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
7050 7158          while (list_head(&spa->spa_state_dirty_list) != NULL) {
7051 7159                  /*
7052 7160                   * We need the write lock here because, for aux vdevs,
7053 7161                   * calling vdev_config_dirty() modifies sav_config.
7054 7162                   * This is ugly and will become unnecessary when we
7055 7163                   * eliminate the aux vdev wart by integrating all vdevs
7056 7164                   * into the root vdev tree.
7057 7165                   */
7058 7166                  spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
7059 7167                  spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_WRITER);
7060 7168                  while ((vd = list_head(&spa->spa_state_dirty_list)) != NULL) {
7061 7169                          vdev_state_clean(vd);
7062 7170                          vdev_config_dirty(vd);
7063 7171                  }
7064 7172                  spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
7065 7173                  spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);
7066 7174          }
7067 7175          spa_config_exit(spa, SCL_STATE, FTAG);
7068 7176  
7069 7177          tx = dmu_tx_create_assigned(dp, txg);
7070 7178  
7071 7179          spa->spa_sync_starttime = gethrtime();
7072 7180          VERIFY(cyclic_reprogram(spa->spa_deadman_cycid,
7073 7181              spa->spa_sync_starttime + spa->spa_deadman_synctime));
7074 7182  
7075 7183          /*
7076 7184           * If we are upgrading to SPA_VERSION_RAIDZ_DEFLATE this txg,
7077 7185           * set spa_deflate if we have no raid-z vdevs.
7078 7186           */
7079 7187          if (spa->spa_ubsync.ub_version < SPA_VERSION_RAIDZ_DEFLATE &&
7080 7188              spa->spa_uberblock.ub_version >= SPA_VERSION_RAIDZ_DEFLATE) {
7081 7189                  int i;
7082 7190  
7083 7191                  for (i = 0; i < rvd->vdev_children; i++) {
7084 7192                          vd = rvd->vdev_child[i];
7085 7193                          if (vd->vdev_deflate_ratio != SPA_MINBLOCKSIZE)
7086 7194                                  break;
7087 7195                  }
7088 7196                  if (i == rvd->vdev_children) {
7089 7197                          spa->spa_deflate = TRUE;
7090 7198                          VERIFY(0 == zap_add(spa->spa_meta_objset,
7091 7199                              DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_DEFLATE,
7092 7200                              sizeof (uint64_t), 1, &spa->spa_deflate, tx));
7093 7201                  }
7094 7202          }
7095 7203  
7096 7204          /*
7097 7205           * Set the top-level vdev's max queue depth. Evaluate each
7098 7206           * top-level's async write queue depth in case it changed.
7099 7207           * The max queue depth will not change in the middle of syncing
7100 7208           * out this txg.
7101 7209           */
7102 7210          uint64_t queue_depth_total = 0;
7103 7211          for (int c = 0; c < rvd->vdev_children; c++) {
7104 7212                  vdev_t *tvd = rvd->vdev_child[c];
7105 7213                  metaslab_group_t *mg = tvd->vdev_mg;
7106 7214  
7107 7215                  if (mg == NULL || mg->mg_class != spa_normal_class(spa) ||
7108 7216                      !metaslab_group_initialized(mg))
7109 7217                          continue;
  
    | 
      ↓ open down ↓ | 
    54 lines elided | 
    
      ↑ open up ↑ | 
  
7110 7218  
7111 7219                  /*
7112 7220                   * It is safe to do a lock-free check here because only async
7113 7221                   * allocations look at mg_max_alloc_queue_depth, and async
7114 7222                   * allocations all happen from spa_sync().
7115 7223                   */
7116 7224                  ASSERT0(refcount_count(&mg->mg_alloc_queue_depth));
7117 7225                  mg->mg_max_alloc_queue_depth = max_queue_depth;
7118 7226                  queue_depth_total += mg->mg_max_alloc_queue_depth;
7119 7227          }
7120      -        metaslab_class_t *mc = spa_normal_class(spa);
7121      -        ASSERT0(refcount_count(&mc->mc_alloc_slots));
7122      -        mc->mc_alloc_max_slots = queue_depth_total;
7123      -        mc->mc_alloc_throttle_enabled = zio_dva_throttle_enabled;
7124 7228  
7125      -        ASSERT3U(mc->mc_alloc_max_slots, <=,
7126      -            max_queue_depth * rvd->vdev_children);
     7229 +        spa_initialize_alloc_trees(spa, max_queue_depth,
     7230 +            queue_depth_total);
7127 7231  
7128      -        for (int c = 0; c < rvd->vdev_children; c++) {
7129      -                vdev_t *vd = rvd->vdev_child[c];
7130      -                vdev_indirect_state_sync_verify(vd);
7131      -
7132      -                if (vdev_indirect_should_condense(vd)) {
7133      -                        spa_condense_indirect_start_sync(vd, tx);
7134      -                        break;
7135      -                }
7136      -        }
7137      -
7138 7232          /*
7139 7233           * Iterate to convergence.
7140 7234           */
     7235 +
     7236 +        zfs_autosnap_t *autosnap = spa_get_autosnap(dp->dp_spa);
     7237 +        mutex_enter(&autosnap->autosnap_lock);
     7238 +
     7239 +        autosnap_zone_t *zone = list_head(&autosnap->autosnap_zones);
     7240 +        while (zone != NULL) {
     7241 +                zone->created = B_FALSE;
     7242 +                zone->dirty = B_FALSE;
     7243 +                zone = list_next(&autosnap->autosnap_zones, zone);
     7244 +        }
     7245 +
     7246 +        mutex_exit(&autosnap->autosnap_lock);
     7247 +
7141 7248          do {
7142 7249                  int pass = ++spa->spa_sync_pass;
7143 7250  
7144 7251                  spa_sync_config_object(spa, tx);
7145 7252                  spa_sync_aux_dev(spa, &spa->spa_spares, tx,
7146 7253                      ZPOOL_CONFIG_SPARES, DMU_POOL_SPARES);
7147 7254                  spa_sync_aux_dev(spa, &spa->spa_l2cache, tx,
7148 7255                      ZPOOL_CONFIG_L2CACHE, DMU_POOL_L2CACHE);
7149 7256                  spa_errlog_sync(spa, txg);
7150 7257                  dsl_pool_sync(dp, txg);
7151 7258  
7152 7259                  if (pass < zfs_sync_pass_deferred_free) {
7153 7260                          spa_sync_frees(spa, free_bpl, tx);
7154 7261                  } else {
7155 7262                          /*
7156 7263                           * We can not defer frees in pass 1, because
  
    | 
      ↓ open down ↓ | 
    6 lines elided | 
    
      ↑ open up ↑ | 
  
7157 7264                           * we sync the deferred frees later in pass 1.
7158 7265                           */
7159 7266                          ASSERT3U(pass, >, 1);
7160 7267                          bplist_iterate(free_bpl, bpobj_enqueue_cb,
7161 7268                              &spa->spa_deferred_bpobj, tx);
7162 7269                  }
7163 7270  
7164 7271                  ddt_sync(spa, txg);
7165 7272                  dsl_scan_sync(dp, tx);
7166 7273  
7167      -                if (spa->spa_vdev_removal != NULL)
7168      -                        svr_sync(spa, tx);
7169      -
7170      -                while ((vd = txg_list_remove(&spa->spa_vdev_txg_list, txg))
7171      -                    != NULL)
     7274 +                while (vd = txg_list_remove(&spa->spa_vdev_txg_list, txg))
7172 7275                          vdev_sync(vd, txg);
7173 7276  
7174 7277                  if (pass == 1) {
7175 7278                          spa_sync_upgrades(spa, tx);
7176 7279                          ASSERT3U(txg, >=,
7177 7280                              spa->spa_uberblock.ub_rootbp.blk_birth);
7178 7281                          /*
7179 7282                           * Note: We need to check if the MOS is dirty
7180 7283                           * because we could have marked the MOS dirty
7181 7284                           * without updating the uberblock (e.g. if we
7182 7285                           * have sync tasks but no dirty user data).  We
7183 7286                           * need to check the uberblock's rootbp because
7184 7287                           * it is updated if we have synced out dirty
7185 7288                           * data (though in this case the MOS will most
7186 7289                           * likely also be dirty due to second order
7187 7290                           * effects, we don't want to rely on that here).
7188 7291                           */
7189 7292                          if (spa->spa_uberblock.ub_rootbp.blk_birth < txg &&
7190 7293                              !dmu_objset_is_dirty(mos, txg)) {
7191 7294                                  /*
7192 7295                                   * Nothing changed on the first pass,
7193 7296                                   * therefore this TXG is a no-op.  Avoid
7194 7297                                   * syncing deferred frees, so that we
7195 7298                                   * can keep this TXG as a no-op.
7196 7299                                   */
7197 7300                                  ASSERT(txg_list_empty(&dp->dp_dirty_datasets,
7198 7301                                      txg));
7199 7302                                  ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
7200 7303                                  ASSERT(txg_list_empty(&dp->dp_sync_tasks, txg));
7201 7304                                  break;
7202 7305                          }
7203 7306                          spa_sync_deferred_frees(spa, tx);
7204 7307                  }
7205 7308  
7206 7309          } while (dmu_objset_is_dirty(mos, txg));
7207 7310  
7208 7311          if (!list_is_empty(&spa->spa_config_dirty_list)) {
7209 7312                  /*
7210 7313                   * Make sure that the number of ZAPs for all the vdevs matches
7211 7314                   * the number of ZAPs in the per-vdev ZAP list. This only gets
7212 7315                   * called if the config is dirty; otherwise there may be
  
    | 
      ↓ open down ↓ | 
    31 lines elided | 
    
      ↑ open up ↑ | 
  
7213 7316                   * outstanding AVZ operations that weren't completed in
7214 7317                   * spa_sync_config_object.
7215 7318                   */
7216 7319                  uint64_t all_vdev_zap_entry_count;
7217 7320                  ASSERT0(zap_count(spa->spa_meta_objset,
7218 7321                      spa->spa_all_vdev_zaps, &all_vdev_zap_entry_count));
7219 7322                  ASSERT3U(vdev_count_verify_zaps(spa->spa_root_vdev), ==,
7220 7323                      all_vdev_zap_entry_count);
7221 7324          }
7222 7325  
7223      -        if (spa->spa_vdev_removal != NULL) {
7224      -                ASSERT0(spa->spa_vdev_removal->svr_bytes_done[txg & TXG_MASK]);
7225      -        }
7226      -
7227 7326          /*
7228 7327           * Rewrite the vdev configuration (which includes the uberblock)
7229 7328           * to commit the transaction group.
7230 7329           *
7231 7330           * If there are no dirty vdevs, we sync the uberblock to a few
7232 7331           * random top-level vdevs that are known to be visible in the
7233 7332           * config cache (see spa_vdev_add() for a complete description).
7234 7333           * If there *are* dirty vdevs, sync the uberblock to all vdevs.
7235 7334           */
7236 7335          for (;;) {
7237 7336                  /*
7238 7337                   * We hold SCL_STATE to prevent vdev open/close/etc.
7239 7338                   * while we're attempting to write the vdev labels.
7240 7339                   */
7241 7340                  spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
7242 7341  
7243 7342                  if (list_is_empty(&spa->spa_config_dirty_list)) {
7244      -                        vdev_t *svd[SPA_SYNC_MIN_VDEVS];
     7343 +                        vdev_t *svd[SPA_DVAS_PER_BP];
7245 7344                          int svdcount = 0;
7246 7345                          int children = rvd->vdev_children;
7247 7346                          int c0 = spa_get_random(children);
7248 7347  
7249 7348                          for (int c = 0; c < children; c++) {
7250 7349                                  vd = rvd->vdev_child[(c0 + c) % children];
7251      -                                if (vd->vdev_ms_array == 0 || vd->vdev_islog ||
7252      -                                    !vdev_is_concrete(vd))
     7350 +                                if (vd->vdev_ms_array == 0 || vd->vdev_islog)
7253 7351                                          continue;
7254 7352                                  svd[svdcount++] = vd;
7255      -                                if (svdcount == SPA_SYNC_MIN_VDEVS)
     7353 +                                if (svdcount == SPA_DVAS_PER_BP)
7256 7354                                          break;
7257 7355                          }
7258 7356                          error = vdev_config_sync(svd, svdcount, txg);
7259 7357                  } else {
7260 7358                          error = vdev_config_sync(rvd->vdev_child,
7261 7359                              rvd->vdev_children, txg);
7262 7360                  }
7263 7361  
7264 7362                  if (error == 0)
7265 7363                          spa->spa_last_synced_guid = rvd->vdev_guid;
7266 7364  
7267 7365                  spa_config_exit(spa, SCL_STATE, FTAG);
7268 7366  
7269 7367                  if (error == 0)
7270 7368                          break;
7271 7369                  zio_suspend(spa, NULL);
7272 7370                  zio_resume_wait(spa);
7273 7371          }
7274 7372          dmu_tx_commit(tx);
7275 7373  
7276 7374          VERIFY(cyclic_reprogram(spa->spa_deadman_cycid, CY_INFINITY));
7277 7375  
7278 7376          /*
7279 7377           * Clear the dirty config list.
7280 7378           */
7281 7379          while ((vd = list_head(&spa->spa_config_dirty_list)) != NULL)
7282 7380                  vdev_config_clean(vd);
7283 7381  
7284 7382          /*
7285 7383           * Now that the new config has synced transactionally,
  
    | 
      ↓ open down ↓ | 
    20 lines elided | 
    
      ↑ open up ↑ | 
  
7286 7384           * let it become visible to the config cache.
7287 7385           */
7288 7386          if (spa->spa_config_syncing != NULL) {
7289 7387                  spa_config_set(spa, spa->spa_config_syncing);
7290 7388                  spa->spa_config_txg = txg;
7291 7389                  spa->spa_config_syncing = NULL;
7292 7390          }
7293 7391  
7294 7392          dsl_pool_sync_done(dp, txg);
7295 7393  
7296      -        mutex_enter(&spa->spa_alloc_lock);
7297      -        VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
7298      -        mutex_exit(&spa->spa_alloc_lock);
     7394 +        spa_check_alloc_trees(spa);
7299 7395  
7300 7396          /*
7301 7397           * Update usable space statistics.
7302 7398           */
7303 7399          while (vd = txg_list_remove(&spa->spa_vdev_txg_list, TXG_CLEAN(txg)))
7304 7400                  vdev_sync_done(vd, txg);
7305 7401  
7306 7402          spa_update_dspace(spa);
7307      -
     7403 +        spa_update_latency(spa);
7308 7404          /*
7309 7405           * It had better be the case that we didn't dirty anything
7310 7406           * since vdev_config_sync().
7311 7407           */
7312 7408          ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg));
7313 7409          ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
7314 7410          ASSERT(txg_list_empty(&spa->spa_vdev_txg_list, txg));
7315 7411  
7316 7412          spa->spa_sync_pass = 0;
7317 7413  
     7414 +        spa_check_special(spa);
     7415 +
7318 7416          /*
7319 7417           * Update the last synced uberblock here. We want to do this at
7320 7418           * the end of spa_sync() so that consumers of spa_last_synced_txg()
7321 7419           * will be guaranteed that all the processing associated with
7322 7420           * that txg has been completed.
7323 7421           */
7324 7422          spa->spa_ubsync = spa->spa_uberblock;
7325 7423          spa_config_exit(spa, SCL_CONFIG, FTAG);
7326 7424  
7327 7425          spa_handle_ignored_writes(spa);
7328 7426  
7329 7427          /*
7330 7428           * If any async tasks have been requested, kick them off.
7331 7429           */
7332 7430          spa_async_dispatch(spa);
7333 7431  }
7334 7432  
7335 7433  /*
7336 7434   * Sync all pools.  We don't want to hold the namespace lock across these
7337 7435   * operations, so we take a reference on the spa_t and drop the lock during the
7338 7436   * sync.
7339 7437   */
7340 7438  void
7341 7439  spa_sync_allpools(void)
7342 7440  {
7343 7441          spa_t *spa = NULL;
7344 7442          mutex_enter(&spa_namespace_lock);
7345 7443          while ((spa = spa_next(spa)) != NULL) {
7346 7444                  if (spa_state(spa) != POOL_STATE_ACTIVE ||
7347 7445                      !spa_writeable(spa) || spa_suspended(spa))
7348 7446                          continue;
7349 7447                  spa_open_ref(spa, FTAG);
7350 7448                  mutex_exit(&spa_namespace_lock);
7351 7449                  txg_wait_synced(spa_get_dsl(spa), 0);
7352 7450                  mutex_enter(&spa_namespace_lock);
7353 7451                  spa_close(spa, FTAG);
7354 7452          }
7355 7453          mutex_exit(&spa_namespace_lock);
7356 7454  }
7357 7455  
7358 7456  /*
7359 7457   * ==========================================================================
7360 7458   * Miscellaneous routines
7361 7459   * ==========================================================================
7362 7460   */
7363 7461  
7364 7462  /*
7365 7463   * Remove all pools in the system.
7366 7464   */
7367 7465  void
7368 7466  spa_evict_all(void)
7369 7467  {
7370 7468          spa_t *spa;
7371 7469  
7372 7470          /*
7373 7471           * Remove all cached state.  All pools should be closed now,
7374 7472           * so every spa in the AVL tree should be unreferenced.
7375 7473           */
7376 7474          mutex_enter(&spa_namespace_lock);
7377 7475          while ((spa = spa_next(NULL)) != NULL) {
7378 7476                  /*
7379 7477                   * Stop async tasks.  The async thread may need to detach
  
    | 
      ↓ open down ↓ | 
    52 lines elided | 
    
      ↑ open up ↑ | 
  
7380 7478                   * a device that's been replaced, which requires grabbing
7381 7479                   * spa_namespace_lock, so we must drop it here.
7382 7480                   */
7383 7481                  spa_open_ref(spa, FTAG);
7384 7482                  mutex_exit(&spa_namespace_lock);
7385 7483                  spa_async_suspend(spa);
7386 7484                  mutex_enter(&spa_namespace_lock);
7387 7485                  spa_close(spa, FTAG);
7388 7486  
7389 7487                  if (spa->spa_state != POOL_STATE_UNINITIALIZED) {
     7488 +                        wbc_deactivate(spa);
     7489 +
7390 7490                          spa_unload(spa);
7391 7491                          spa_deactivate(spa);
7392 7492                  }
     7493 +
7393 7494                  spa_remove(spa);
7394 7495          }
7395 7496          mutex_exit(&spa_namespace_lock);
7396 7497  }
7397 7498  
7398 7499  vdev_t *
7399 7500  spa_lookup_by_guid(spa_t *spa, uint64_t guid, boolean_t aux)
7400 7501  {
7401 7502          vdev_t *vd;
7402 7503          int i;
7403 7504  
7404 7505          if ((vd = vdev_lookup_by_guid(spa->spa_root_vdev, guid)) != NULL)
7405 7506                  return (vd);
7406 7507  
7407 7508          if (aux) {
7408 7509                  for (i = 0; i < spa->spa_l2cache.sav_count; i++) {
7409 7510                          vd = spa->spa_l2cache.sav_vdevs[i];
7410 7511                          if (vd->vdev_guid == guid)
7411 7512                                  return (vd);
7412 7513                  }
7413 7514  
7414 7515                  for (i = 0; i < spa->spa_spares.sav_count; i++) {
7415 7516                          vd = spa->spa_spares.sav_vdevs[i];
7416 7517                          if (vd->vdev_guid == guid)
7417 7518                                  return (vd);
7418 7519                  }
7419 7520          }
7420 7521  
7421 7522          return (NULL);
7422 7523  }
7423 7524  
7424 7525  void
7425 7526  spa_upgrade(spa_t *spa, uint64_t version)
7426 7527  {
7427 7528          ASSERT(spa_writeable(spa));
7428 7529  
7429 7530          spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
7430 7531  
7431 7532          /*
7432 7533           * This should only be called for a non-faulted pool, and since a
7433 7534           * future version would result in an unopenable pool, this shouldn't be
7434 7535           * possible.
7435 7536           */
7436 7537          ASSERT(SPA_VERSION_IS_SUPPORTED(spa->spa_uberblock.ub_version));
7437 7538          ASSERT3U(version, >=, spa->spa_uberblock.ub_version);
7438 7539  
7439 7540          spa->spa_uberblock.ub_version = version;
7440 7541          vdev_config_dirty(spa->spa_root_vdev);
7441 7542  
7442 7543          spa_config_exit(spa, SCL_ALL, FTAG);
7443 7544  
7444 7545          txg_wait_synced(spa_get_dsl(spa), 0);
7445 7546  }
7446 7547  
7447 7548  boolean_t
7448 7549  spa_has_spare(spa_t *spa, uint64_t guid)
7449 7550  {
7450 7551          int i;
7451 7552          uint64_t spareguid;
7452 7553          spa_aux_vdev_t *sav = &spa->spa_spares;
7453 7554  
7454 7555          for (i = 0; i < sav->sav_count; i++)
7455 7556                  if (sav->sav_vdevs[i]->vdev_guid == guid)
7456 7557                          return (B_TRUE);
7457 7558  
7458 7559          for (i = 0; i < sav->sav_npending; i++) {
7459 7560                  if (nvlist_lookup_uint64(sav->sav_pending[i], ZPOOL_CONFIG_GUID,
7460 7561                      &spareguid) == 0 && spareguid == guid)
7461 7562                          return (B_TRUE);
7462 7563          }
7463 7564  
7464 7565          return (B_FALSE);
7465 7566  }
7466 7567  
7467 7568  /*
7468 7569   * Check if a pool has an active shared spare device.
7469 7570   * Note: reference count of an active spare is 2, as a spare and as a replace
7470 7571   */
7471 7572  static boolean_t
7472 7573  spa_has_active_shared_spare(spa_t *spa)
7473 7574  {
7474 7575          int i, refcnt;
7475 7576          uint64_t pool;
7476 7577          spa_aux_vdev_t *sav = &spa->spa_spares;
7477 7578  
  
    | 
      ↓ open down ↓ | 
    75 lines elided | 
    
      ↑ open up ↑ | 
  
7478 7579          for (i = 0; i < sav->sav_count; i++) {
7479 7580                  if (spa_spare_exists(sav->sav_vdevs[i]->vdev_guid, &pool,
7480 7581                      &refcnt) && pool != 0ULL && pool == spa_guid(spa) &&
7481 7582                      refcnt > 2)
7482 7583                          return (B_TRUE);
7483 7584          }
7484 7585  
7485 7586          return (B_FALSE);
7486 7587  }
7487 7588  
7488      -sysevent_t *
     7589 +/*
     7590 + * Post a sysevent corresponding to the given event.  The 'name' must be one of
     7591 + * the event definitions in sys/sysevent/eventdefs.h.  The payload will be
     7592 + * filled in from the spa and (optionally) the vdev.  This doesn't do anything
     7593 + * in the userland libzpool, as we don't want consumers to misinterpret ztest
     7594 + * or zdb as real changes.
     7595 + */
     7596 +static sysevent_t *
7489 7597  spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7490 7598  {
7491 7599          sysevent_t              *ev = NULL;
7492 7600  #ifdef _KERNEL
7493 7601          sysevent_attr_list_t    *attr = NULL;
7494 7602          sysevent_value_t        value;
7495 7603  
7496 7604          ev = sysevent_alloc(EC_ZFS, (char *)name, SUNW_KERN_PUB "zfs",
7497 7605              SE_SLEEP);
7498 7606          ASSERT(ev != NULL);
7499 7607  
  
    | 
      ↓ open down ↓ | 
    1 lines elided | 
    
      ↑ open up ↑ | 
  
7500 7608          value.value_type = SE_DATA_TYPE_STRING;
7501 7609          value.value.sv_string = spa_name(spa);
7502 7610          if (sysevent_add_attr(&attr, ZFS_EV_POOL_NAME, &value, SE_SLEEP) != 0)
7503 7611                  goto done;
7504 7612  
7505 7613          value.value_type = SE_DATA_TYPE_UINT64;
7506 7614          value.value.sv_uint64 = spa_guid(spa);
7507 7615          if (sysevent_add_attr(&attr, ZFS_EV_POOL_GUID, &value, SE_SLEEP) != 0)
7508 7616                  goto done;
7509 7617  
7510      -        if (vd) {
     7618 +        if (vd != NULL) {
7511 7619                  value.value_type = SE_DATA_TYPE_UINT64;
7512 7620                  value.value.sv_uint64 = vd->vdev_guid;
7513 7621                  if (sysevent_add_attr(&attr, ZFS_EV_VDEV_GUID, &value,
7514 7622                      SE_SLEEP) != 0)
7515 7623                          goto done;
7516 7624  
7517 7625                  if (vd->vdev_path) {
7518 7626                          value.value_type = SE_DATA_TYPE_STRING;
7519 7627                          value.value.sv_string = vd->vdev_path;
7520 7628                          if (sysevent_add_attr(&attr, ZFS_EV_VDEV_PATH,
7521 7629                              &value, SE_SLEEP) != 0)
7522 7630                                  goto done;
7523 7631                  }
7524 7632          }
7525 7633  
7526 7634          if (hist_nvl != NULL) {
7527 7635                  fnvlist_merge((nvlist_t *)attr, hist_nvl);
7528 7636          }
7529 7637  
7530 7638          if (sysevent_attach_attributes(ev, attr) != 0)
7531 7639                  goto done;
  
    | 
      ↓ open down ↓ | 
    11 lines elided | 
    
      ↑ open up ↑ | 
  
7532 7640          attr = NULL;
7533 7641  
7534 7642  done:
7535 7643          if (attr)
7536 7644                  sysevent_free_attr(attr);
7537 7645  
7538 7646  #endif
7539 7647          return (ev);
7540 7648  }
7541 7649  
7542      -void
7543      -spa_event_post(sysevent_t *ev)
     7650 +static void
     7651 +spa_event_post(void *arg)
7544 7652  {
7545 7653  #ifdef _KERNEL
     7654 +        sysevent_t *ev = (sysevent_t *)arg;
     7655 +
7546 7656          sysevent_id_t           eid;
7547 7657  
7548 7658          (void) log_sysevent(ev, SE_SLEEP, &eid);
7549 7659          sysevent_free(ev);
7550 7660  #endif
7551 7661  }
7552 7662  
     7663 +/*
     7664 + * Dispatch event notifications to the taskq such that the corresponding
     7665 + * sysevents are queued with no spa locks held
     7666 + */
     7667 +taskq_t *spa_sysevent_taskq;
     7668 +
     7669 +static void
     7670 +spa_event_notify_impl(sysevent_t *ev)
     7671 +{
     7672 +        if (taskq_dispatch(spa_sysevent_taskq, spa_event_post,
     7673 +            ev, TQ_NOSLEEP) == NULL) {
     7674 +                /*
     7675 +                 * These are management sysevents; as much as it is
     7676 +                 * unpleasant to drop these due to syseventd not being able
     7677 +                 * to keep up, perhaps due to resource shortages, we are not
     7678 +                 * going to sleep here and risk locking up the pool sync
     7679 +                 * process; notify admin of problems
     7680 +                 */
     7681 +                cmn_err(CE_NOTE, "Could not dispatch sysevent nofitication "
     7682 +                    "for %s, please check state of syseventd\n",
     7683 +                    sysevent_get_subclass_name(ev));
     7684 +
     7685 +                sysevent_free(ev);
     7686 +
     7687 +                return;
     7688 +        }
     7689 +}
     7690 +
7553 7691  void
7554      -spa_event_discard(sysevent_t *ev)
     7692 +spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7555 7693  {
7556      -#ifdef _KERNEL
7557      -        sysevent_free(ev);
7558      -#endif
     7694 +        spa_event_notify_impl(spa_event_create(spa, vd, hist_nvl, name));
7559 7695  }
7560 7696  
7561 7697  /*
7562      - * Post a sysevent corresponding to the given event.  The 'name' must be one of
7563      - * the event definitions in sys/sysevent/eventdefs.h.  The payload will be
7564      - * filled in from the spa and (optionally) the vdev and history nvl.  This
7565      - * doesn't do anything in the userland libzpool, as we don't want consumers to
7566      - * misinterpret ztest or zdb as real changes.
     7698 + * Dispatches all auto-trim processing to all top-level vdevs. This is
     7699 + * called from spa_sync once every txg.
7567 7700   */
     7701 +static void
     7702 +spa_auto_trim(spa_t *spa, uint64_t txg)
     7703 +{
     7704 +        ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER) == SCL_CONFIG);
     7705 +        ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
     7706 +        ASSERT(spa->spa_auto_trim_taskq != NULL);
     7707 +
     7708 +        for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
     7709 +                vdev_trim_info_t *vti = kmem_zalloc(sizeof (*vti), KM_SLEEP);
     7710 +                vti->vti_vdev = spa->spa_root_vdev->vdev_child[i];
     7711 +                vti->vti_txg = txg;
     7712 +                vti->vti_done_cb = (void (*)(void *))spa_vdev_auto_trim_done;
     7713 +                vti->vti_done_arg = spa;
     7714 +                (void) taskq_dispatch(spa->spa_auto_trim_taskq,
     7715 +                    (void (*)(void *))vdev_auto_trim, vti, TQ_SLEEP);
     7716 +                spa->spa_num_auto_trimming++;
     7717 +        }
     7718 +}
     7719 +
     7720 +/*
     7721 + * Performs the sync update of the MOS pool directory's trim start/stop values.
     7722 + */
     7723 +static void
     7724 +spa_trim_update_time_sync(void *arg, dmu_tx_t *tx)
     7725 +{
     7726 +        spa_t *spa = arg;
     7727 +        VERIFY0(zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
     7728 +            DMU_POOL_TRIM_START_TIME, sizeof (uint64_t), 1,
     7729 +            &spa->spa_man_trim_start_time, tx));
     7730 +        VERIFY0(zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
     7731 +            DMU_POOL_TRIM_STOP_TIME, sizeof (uint64_t), 1,
     7732 +            &spa->spa_man_trim_stop_time, tx));
     7733 +}
     7734 +
     7735 +/*
     7736 + * Updates the in-core and on-disk manual TRIM operation start/stop time.
     7737 + * Passing UINT64_MAX for either start_time or stop_time means that no
     7738 + * update to that value should be recorded.
     7739 + */
     7740 +static dmu_tx_t *
     7741 +spa_trim_update_time(spa_t *spa, uint64_t start_time, uint64_t stop_time)
     7742 +{
     7743 +        int err;
     7744 +        dmu_tx_t *tx;
     7745 +
     7746 +        ASSERT(MUTEX_HELD(&spa->spa_man_trim_lock));
     7747 +        if (start_time != UINT64_MAX)
     7748 +                spa->spa_man_trim_start_time = start_time;
     7749 +        if (stop_time != UINT64_MAX)
     7750 +                spa->spa_man_trim_stop_time = stop_time;
     7751 +        tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
     7752 +        err = dmu_tx_assign(tx, TXG_WAIT);
     7753 +        if (err) {
     7754 +                dmu_tx_abort(tx);
     7755 +                return (NULL);
     7756 +        }
     7757 +        dsl_sync_task_nowait(spa_get_dsl(spa), spa_trim_update_time_sync,
     7758 +            spa, 1, ZFS_SPACE_CHECK_RESERVED, tx);
     7759 +
     7760 +        return (tx);
     7761 +}
     7762 +
     7763 +/*
     7764 + * Initiates an manual TRIM of the whole pool. This kicks off individual
     7765 + * TRIM tasks for each top-level vdev, which then pass over all of the free
     7766 + * space in all of the vdev's metaslabs and issues TRIM commands for that
     7767 + * space to the underlying vdevs.
     7768 + */
     7769 +extern void
     7770 +spa_man_trim(spa_t *spa, uint64_t rate)
     7771 +{
     7772 +        dmu_tx_t *time_update_tx;
     7773 +
     7774 +        mutex_enter(&spa->spa_man_trim_lock);
     7775 +
     7776 +        if (rate != 0)
     7777 +                spa->spa_man_trim_rate = MAX(rate, spa_min_trim_rate(spa));
     7778 +        else
     7779 +                spa->spa_man_trim_rate = 0;
     7780 +
     7781 +        if (spa->spa_num_man_trimming) {
     7782 +                /*
     7783 +                 * TRIM is already ongoing. Wake up all sleeping vdev trim
     7784 +                 * threads because the trim rate might have changed above.
     7785 +                 */
     7786 +                cv_broadcast(&spa->spa_man_trim_update_cv);
     7787 +                mutex_exit(&spa->spa_man_trim_lock);
     7788 +                return;
     7789 +        }
     7790 +        spa_man_trim_taskq_create(spa);
     7791 +        spa->spa_man_trim_stop = B_FALSE;
     7792 +
     7793 +        spa_event_notify(spa, NULL, NULL, ESC_ZFS_TRIM_START);
     7794 +        spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
     7795 +        for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
     7796 +                vdev_t *vd = spa->spa_root_vdev->vdev_child[i];
     7797 +                vdev_trim_info_t *vti = kmem_zalloc(sizeof (*vti), KM_SLEEP);
     7798 +                vti->vti_vdev = vd;
     7799 +                vti->vti_done_cb = (void (*)(void *))spa_vdev_man_trim_done;
     7800 +                vti->vti_done_arg = spa;
     7801 +                spa->spa_num_man_trimming++;
     7802 +
     7803 +                vd->vdev_trim_prog = 0;
     7804 +                (void) taskq_dispatch(spa->spa_man_trim_taskq,
     7805 +                    (void (*)(void *))vdev_man_trim, vti, TQ_SLEEP);
     7806 +        }
     7807 +        spa_config_exit(spa, SCL_CONFIG, FTAG);
     7808 +        time_update_tx = spa_trim_update_time(spa, gethrestime_sec(), 0);
     7809 +        mutex_exit(&spa->spa_man_trim_lock);
     7810 +        /* mustn't hold spa_man_trim_lock to prevent deadlock /w syncing ctx */
     7811 +        if (time_update_tx != NULL)
     7812 +                dmu_tx_commit(time_update_tx);
     7813 +}
     7814 +
     7815 +/*
     7816 + * Orders a manual TRIM operation to stop and returns immediately.
     7817 + */
     7818 +extern void
     7819 +spa_man_trim_stop(spa_t *spa)
     7820 +{
     7821 +        boolean_t held = MUTEX_HELD(&spa->spa_man_trim_lock);
     7822 +        if (!held)
     7823 +                mutex_enter(&spa->spa_man_trim_lock);
     7824 +        spa->spa_man_trim_stop = B_TRUE;
     7825 +        cv_broadcast(&spa->spa_man_trim_update_cv);
     7826 +        if (!held)
     7827 +                mutex_exit(&spa->spa_man_trim_lock);
     7828 +}
     7829 +
     7830 +/*
     7831 + * Orders a manual TRIM operation to stop and waits for both manual and
     7832 + * automatic TRIM to complete. By holding both the spa_man_trim_lock and
     7833 + * the spa_auto_trim_lock, the caller can guarantee that after this
     7834 + * function returns, no new TRIM operations can be initiated in parallel.
     7835 + */
7568 7836  void
7569      -spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
     7837 +spa_trim_stop_wait(spa_t *spa)
7570 7838  {
7571      -        spa_event_post(spa_event_create(spa, vd, hist_nvl, name));
     7839 +        ASSERT(MUTEX_HELD(&spa->spa_man_trim_lock));
     7840 +        ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
     7841 +        spa->spa_man_trim_stop = B_TRUE;
     7842 +        cv_broadcast(&spa->spa_man_trim_update_cv);
     7843 +        while (spa->spa_num_man_trimming > 0)
     7844 +                cv_wait(&spa->spa_man_trim_done_cv, &spa->spa_man_trim_lock);
     7845 +        while (spa->spa_num_auto_trimming > 0)
     7846 +                cv_wait(&spa->spa_auto_trim_done_cv, &spa->spa_auto_trim_lock);
     7847 +}
     7848 +
     7849 +/*
     7850 + * Returns manual TRIM progress. Progress is indicated by four return values:
     7851 + * 1) prog: the number of bytes of space on the pool in total that manual
     7852 + *      TRIM has already passed (regardless if the space is allocated or not).
     7853 + *      Completion of the operation is indicated when either the returned value
     7854 + *      is zero, or when the returned value is equal to the sum of the sizes of
     7855 + *      all top-level vdevs.
     7856 + * 2) rate: the trim rate in bytes per second. A value of zero indicates that
     7857 + *      trim progresses as fast as possible.
     7858 + * 3) start_time: the UNIXTIME of when the last manual TRIM operation was
     7859 + *      started. If no manual trim was ever initiated on the pool, this is
     7860 + *      zero.
     7861 + * 4) stop_time: the UNIXTIME of when the last manual TRIM operation has
     7862 + *      stopped on the pool. If a trim was started (start_time != 0), but has
     7863 + *      not yet completed, stop_time will be zero. If a trim is NOT currently
     7864 + *      ongoing and start_time is non-zero, this indicates that the previously
     7865 + *      initiated TRIM operation was interrupted.
     7866 + */
     7867 +extern void
     7868 +spa_get_trim_prog(spa_t *spa, uint64_t *prog, uint64_t *rate,
     7869 +    uint64_t *start_time, uint64_t *stop_time)
     7870 +{
     7871 +        uint64_t total = 0;
     7872 +        vdev_t *root_vd = spa->spa_root_vdev;
     7873 +
     7874 +        ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
     7875 +        mutex_enter(&spa->spa_man_trim_lock);
     7876 +        if (spa->spa_num_man_trimming > 0) {
     7877 +                for (uint64_t i = 0; i < root_vd->vdev_children; i++) {
     7878 +                        total += root_vd->vdev_child[i]->vdev_trim_prog;
     7879 +                }
     7880 +        }
     7881 +        *prog = total;
     7882 +        *rate = spa->spa_man_trim_rate;
     7883 +        *start_time = spa->spa_man_trim_start_time;
     7884 +        *stop_time = spa->spa_man_trim_stop_time;
     7885 +        mutex_exit(&spa->spa_man_trim_lock);
     7886 +}
     7887 +
     7888 +/*
     7889 + * Callback when a vdev_man_trim has finished on a single top-level vdev.
     7890 + */
     7891 +static void
     7892 +spa_vdev_man_trim_done(spa_t *spa)
     7893 +{
     7894 +        dmu_tx_t *time_update_tx = NULL;
     7895 +
     7896 +        mutex_enter(&spa->spa_man_trim_lock);
     7897 +        ASSERT(spa->spa_num_man_trimming > 0);
     7898 +        spa->spa_num_man_trimming--;
     7899 +        if (spa->spa_num_man_trimming == 0) {
     7900 +                /* if we were interrupted, leave stop_time at zero */
     7901 +                if (!spa->spa_man_trim_stop)
     7902 +                        time_update_tx = spa_trim_update_time(spa, UINT64_MAX,
     7903 +                            gethrestime_sec());
     7904 +                spa_event_notify(spa, NULL, NULL, ESC_ZFS_TRIM_FINISH);
     7905 +                spa_async_request(spa, SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY);
     7906 +                cv_broadcast(&spa->spa_man_trim_done_cv);
     7907 +        }
     7908 +        mutex_exit(&spa->spa_man_trim_lock);
     7909 +
     7910 +        if (time_update_tx != NULL)
     7911 +                dmu_tx_commit(time_update_tx);
     7912 +}
     7913 +
     7914 +/*
     7915 + * Called from vdev_auto_trim when a vdev has completed its auto-trim
     7916 + * processing.
     7917 + */
     7918 +static void
     7919 +spa_vdev_auto_trim_done(spa_t *spa)
     7920 +{
     7921 +        mutex_enter(&spa->spa_auto_trim_lock);
     7922 +        ASSERT(spa->spa_num_auto_trimming > 0);
     7923 +        spa->spa_num_auto_trimming--;
     7924 +        if (spa->spa_num_auto_trimming == 0)
     7925 +                cv_broadcast(&spa->spa_auto_trim_done_cv);
     7926 +        mutex_exit(&spa->spa_auto_trim_lock);
     7927 +}
     7928 +
     7929 +/*
     7930 + * Determines the minimum sensible rate at which a manual TRIM can be
     7931 + * performed on a given spa and returns it. Since we perform TRIM in
     7932 + * metaslab-sized increments, we'll just let the longest step between
     7933 + * metaslab TRIMs be 100s (random number, really). Thus, on a typical
     7934 + * 200-metaslab vdev, the longest TRIM should take is about 5.5 hours.
     7935 + * It *can* take longer if the device is really slow respond to
     7936 + * zio_trim() commands or it contains more than 200 metaslabs, or
     7937 + * metaslab sizes vary widely between top-level vdevs.
     7938 + */
     7939 +static uint64_t
     7940 +spa_min_trim_rate(spa_t *spa)
     7941 +{
     7942 +        uint64_t smallest_ms_sz = UINT64_MAX;
     7943 +
     7944 +        /* find the smallest metaslab */
     7945 +        spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
     7946 +        for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
     7947 +                smallest_ms_sz = MIN(smallest_ms_sz,
     7948 +                    spa->spa_root_vdev->vdev_child[i]->vdev_ms[0]->ms_size);
     7949 +        }
     7950 +        spa_config_exit(spa, SCL_CONFIG, FTAG);
     7951 +        VERIFY(smallest_ms_sz != 0);
     7952 +
     7953 +        /* minimum TRIM rate is 1/100th of the smallest metaslab size */
     7954 +        return (smallest_ms_sz / 100);
7572 7955  }
    
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX