Print this page
9700 ZFS resilvered mirror does not balance reads
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
NEX-17931 Getting panic: vfs_mountroot: cannot mount root after split mirror syspool
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9989 Changing volume names can result in double imports and data corruption
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-6855 System fails to boot up after a large number of datasets created
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-8711 backport illumos 7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-7550 zpool remove mirrored slog or special vdev causes system panic due to a NULL pointer dereference in "zfs" module
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6884 KRRP: replication deadlock due to unavailable resources
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6000 zpool destroy/export with autotrim=on panics due to lock assertion
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5702 Special vdev cannot be removed if it was used as slog
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5637 enablespecial property should be disabled after special vdev removal
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alex Deiter <alex.deiter@nexenta.com>
NEX-5367 special vdev: sync-write options (NEW)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5064 On-demand trim should store operation start and stop time
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5068 In-progress scrub can drastically increase zpool import times
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-5219 WBC: Add capability to delay migration
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5078 Want ability to see progress of freeing data and how much is left to free after large file delete patch
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5019 wrcache activation races vs. 'zpool create -O wrc_mode='
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-4934 Add capability to remove special vdev
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4830 writecache=off leaks data on special vdev (the data will never migrate)
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4876 On-demand TRIM shouldn't use system_taskq and should queue jobs
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4679 Autotrim taskq doesn't get destroyed on pool export
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
NEX-4567 KRRP: L2L replication inside of one pool causes ARC-deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
6529 Properly handle updates of variably-sized SA entries.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Gordon Ross <gwr@nexenta.com>
6527 Possible access beyond end of string in zpool comment
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gwr@nexenta.com>
6414 vdev_config_sync could be simpler
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
6175 sdev can create bogus zvol directories
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6174 /dev/zvol does not show pool directories
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5997 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
6046 SPARC boot should support com.delphix:hole_birth
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6041 SPARC boot should support LZ4
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6044 SPARC zfs reader is using wrong size for objset_phys
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
backout 5997: breaks "zpool add"
5997 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5808 spa_check_logs is not necessary on readonly pools
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
5770 Add load_nvlist() error handling
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4476 WRC: Allow to use write back cache per tree of datasets
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Revert "NEX-4476 WRC: Allow to use write back cache per tree of datasets"
This reverts commit fe97b74444278a6f36fec93179133641296312da.
NEX-4476 WRC: Allow to use write back cache per tree of datasets
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-3502 dedup ceiling should set a pool prop when cap is in effect
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-3965 System may panic on the importing of pool with WRC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-4077 taskq_dispatch in on-demand TRIM can sometimes fail
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Revert "NEX-3965 System may panic on the importing of pool with WRC"
This reverts commit 45bc50222913cddafde94621d28b78d6efaea897.
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3965 System may panic on the importing of pool with WRC
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3817 'zpool add' of special devices causes system panic
 Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
 Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3474 CLONE - Port NEX-2591 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
 NEX-3165 segregate ddt in arc (other lint fix)
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-3165 segregate ddt in arc
NEX-3213 need to load vdev props for all vdev including spares and l2arc vdevs
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-2112 `zdb -e <pool>` assertion failed for thread 0xfffffd7fff172a40
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-1228 Panic importing pool with active unsupported features
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Harold Shaw <harold.shaw@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-140 Duplicate entries in mantools and doctools manifests
NEX-1078 Replaced ASSERT with if-statement
NEX-521 Single threaded rpcbind is not scalable
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Jan Kryl <jan.kryl@nexenta.com>
NEX-1088 partially rolled back 641841bb
to fix regression that caused assert in read-only import.
OS-115 Heap leaks related to OS-114 and SUP-577
SUP-577 deadlock between zpool detach and syseventd
OS-103 handle CoS descriptor persistent references across vdev operations
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Make special vdev subtree topology the same as regular vdev subtree to simplify testcase setup
Fixup merge issues
Fix default properties' values after export/import
zfsxx issue #11: support for spare device groups
Issue #34: Add feature flag for the compount checksum - sha1crc32
           Contributors: Boris Protopopov
Issue #7: add cacheability to the properties
          Contributors: Boris Protopopov
Issue #27: Auto best-effort dedup enable/disable - settable per pool
Issues #7: Reconsile L2ARC and "special" use by datasets
Issue #9: Support for persistent CoS/vdev attributes with feature flags
          Support for feature flags for special tier
          Contributors: Daniil Lunev, Boris Protopopov
Issue #2: optimize DDE lookup in DDT objects
Added option to control number of classes of DDE's in DDT.
New default is one, that is all DDE's are stored together
regardless of refcount.
Issue #3: Add support for parametrized number of copies for DDTs
Issue #25: Add a pool-level property that controls the number of copies of DDTs in the pool.
Fixup merge results
re #13850 Refactor ZFS config discovery IOCs to libzfs_core patterns
re 13748 added zpool export -c option
zpool export -c command exports specified pool while keeping its latest
configuration in the cache file for subsequent zpool import -c.
re #13333 rb4362 - eliminated spa_update_iotime() to fix the stats
re #12684 rb4206 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
re #12643 rb4064 ZFS meta refactoring - vdev utilization tracking, auto-dedup
re #8279 rb3915 need a mechanism to notify NMS about ZFS config changes (fix lint -courtesy of Yuri Pankov)
re #12584 rb4049 zfsxx latest code merge (fix lint - courtesy of Yuri Pankov)
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties


   4  * The contents of this file are subject to the terms of the
   5  * Common Development and Distribution License (the "License").
   6  * You may not use this file except in compliance with the License.
   7  *
   8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9  * or http://www.opensolaris.org/os/licensing.
  10  * See the License for the specific language governing permissions
  11  * and limitations under the License.
  12  *
  13  * When distributing Covered Code, include this CDDL HEADER in each
  14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15  * If applicable, add the following below this CDDL HEADER, with the
  16  * fields enclosed by brackets "[]" replaced with your own identifying
  17  * information: Portions Copyright [yyyy] [name of copyright owner]
  18  *
  19  * CDDL HEADER END
  20  */
  21 
  22 /*
  23  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  24  * Copyright (c) 2011, 2018 by Delphix. All rights reserved.
  25  * Copyright (c) 2015, Nexenta Systems, Inc.  All rights reserved.
  26  * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.

  27  * Copyright 2013 Saso Kiselkov. All rights reserved.
  28  * Copyright (c) 2014 Integros [integros.com]
  29  * Copyright 2016 Toomas Soome <tsoome@me.com>
  30  * Copyright 2017 Joyent, Inc.
  31  * Copyright (c) 2017 Datto Inc.
  32  * Copyright 2018 OmniOS Community Edition (OmniOSce) Association.
  33  */
  34 
  35 /*
  36  * SPA: Storage Pool Allocator
  37  *
  38  * This file contains all the routines used when modifying on-disk SPA state.
  39  * This includes opening, importing, destroying, exporting a pool, and syncing a
  40  * pool.
  41  */
  42 
  43 #include <sys/zfs_context.h>
  44 #include <sys/fm/fs/zfs.h>
  45 #include <sys/spa_impl.h>
  46 #include <sys/zio.h>
  47 #include <sys/zio_checksum.h>
  48 #include <sys/dmu.h>
  49 #include <sys/dmu_tx.h>
  50 #include <sys/zap.h>
  51 #include <sys/zil.h>
  52 #include <sys/ddt.h>
  53 #include <sys/vdev_impl.h>
  54 #include <sys/vdev_removal.h>
  55 #include <sys/vdev_indirect_mapping.h>
  56 #include <sys/vdev_indirect_births.h>
  57 #include <sys/metaslab.h>
  58 #include <sys/metaslab_impl.h>
  59 #include <sys/uberblock_impl.h>
  60 #include <sys/txg.h>
  61 #include <sys/avl.h>
  62 #include <sys/bpobj.h>
  63 #include <sys/dmu_traverse.h>
  64 #include <sys/dmu_objset.h>
  65 #include <sys/unique.h>
  66 #include <sys/dsl_pool.h>
  67 #include <sys/dsl_dataset.h>
  68 #include <sys/dsl_dir.h>
  69 #include <sys/dsl_prop.h>
  70 #include <sys/dsl_synctask.h>
  71 #include <sys/fs/zfs.h>
  72 #include <sys/arc.h>
  73 #include <sys/callb.h>
  74 #include <sys/systeminfo.h>
  75 #include <sys/spa_boot.h>
  76 #include <sys/zfs_ioctl.h>
  77 #include <sys/dsl_scan.h>
  78 #include <sys/zfeature.h>
  79 #include <sys/dsl_destroy.h>



  80 #include <sys/abd.h>
  81 
  82 #ifdef  _KERNEL
  83 #include <sys/bootprops.h>
  84 #include <sys/callb.h>
  85 #include <sys/cpupart.h>
  86 #include <sys/pool.h>
  87 #include <sys/sysdc.h>
  88 #include <sys/zone.h>
  89 #endif  /* _KERNEL */
  90 
  91 #include "zfs_prop.h"
  92 #include "zfs_comutil.h"
  93 
  94 /*
  95  * The interval, in seconds, at which failed configuration cache file writes
  96  * should be retried.
  97  */
  98 int zfs_ccw_retry_interval = 300;
  99 
 100 typedef enum zti_modes {
 101         ZTI_MODE_FIXED,                 /* value is # of threads (min 1) */
 102         ZTI_MODE_BATCH,                 /* cpu-intensive; value is ignored */
 103         ZTI_MODE_NULL,                  /* don't create a taskq */
 104         ZTI_NMODES
 105 } zti_modes_t;
 106 
 107 #define ZTI_P(n, q)     { ZTI_MODE_FIXED, (n), (q) }
 108 #define ZTI_BATCH       { ZTI_MODE_BATCH, 0, 1 }
 109 #define ZTI_NULL        { ZTI_MODE_NULL, 0, 0 }
 110 
 111 #define ZTI_N(n)        ZTI_P(n, 1)
 112 #define ZTI_ONE         ZTI_N(1)
 113 
 114 typedef struct zio_taskq_info {
 115         zti_modes_t zti_mode;
 116         uint_t zti_value;
 117         uint_t zti_count;
 118 } zio_taskq_info_t;


 131  * are so high frequency and short-lived that the taskq itself can become a a
 132  * point of lock contention. The ZTI_P(#, #) macro indicates that we need an
 133  * additional degree of parallelism specified by the number of threads per-
 134  * taskq and the number of taskqs; when dispatching an event in this case, the
 135  * particular taskq is chosen at random.
 136  *
 137  * The different taskq priorities are to handle the different contexts (issue
 138  * and interrupt) and then to reserve threads for ZIO_PRIORITY_NOW I/Os that
 139  * need to be handled with minimum delay.
 140  */
 141 const zio_taskq_info_t zio_taskqs[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
 142         /* ISSUE        ISSUE_HIGH      INTR            INTR_HIGH */
 143         { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* NULL */
 144         { ZTI_N(8),     ZTI_NULL,       ZTI_P(12, 8),   ZTI_NULL }, /* READ */
 145         { ZTI_BATCH,    ZTI_N(5),       ZTI_N(8),       ZTI_N(5) }, /* WRITE */
 146         { ZTI_P(12, 8), ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* FREE */
 147         { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* CLAIM */
 148         { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* IOCTL */
 149 };
 150 



 151 static void spa_sync_version(void *arg, dmu_tx_t *tx);
 152 static void spa_sync_props(void *arg, dmu_tx_t *tx);


 153 static boolean_t spa_has_active_shared_spare(spa_t *spa);
 154 static int spa_load_impl(spa_t *spa, spa_import_type_t type, char **ereport,
 155     boolean_t reloading);

 156 static void spa_vdev_resilver_done(spa_t *spa);




 157 
 158 uint_t          zio_taskq_batch_pct = 75;       /* 1 thread per cpu in pset */
 159 id_t            zio_taskq_psrset_bind = PS_NONE;
 160 boolean_t       zio_taskq_sysdc = B_TRUE;       /* use SDC scheduling class */
 161 uint_t          zio_taskq_basedc = 80;          /* base duty cycle */
 162 
 163 boolean_t       spa_create_process = B_TRUE;    /* no process ==> no sysdc */
 164 extern int      zfs_sync_pass_deferred_free;
 165 
 166 /*
 167  * Report any spa_load_verify errors found, but do not fail spa_load.
 168  * This is used by zdb to analyze non-idle pools.
 169  */
 170 boolean_t       spa_load_verify_dryrun = B_FALSE;
 171 
 172 /*
 173  * This (illegal) pool name is used when temporarily importing a spa_t in order
 174  * to get the vdev stats associated with the imported devices.
 175  */
 176 #define TRYIMPORT_NAME  "$import"
 177 
 178 /*
 179  * For debugging purposes: print out vdev tree during pool import.
 180  */
 181 boolean_t       spa_load_print_vdev_tree = B_FALSE;
 182 
 183 /*
 184  * A non-zero value for zfs_max_missing_tvds means that we allow importing
 185  * pools with missing top-level vdevs. This is strictly intended for advanced
 186  * pool recovery cases since missing data is almost inevitable. Pools with
 187  * missing devices can only be imported read-only for safety reasons, and their
 188  * fail-mode will be automatically set to "continue".
 189  *
 190  * With 1 missing vdev we should be able to import the pool and mount all
 191  * datasets. User data that was not modified after the missing device has been
 192  * added should be recoverable. This means that snapshots created prior to the
 193  * addition of that device should be completely intact.
 194  *
 195  * With 2 missing vdevs, some datasets may fail to mount since there are
 196  * dataset statistics that are stored as regular metadata. Some data might be
 197  * recoverable if those vdevs were added recently.
 198  *
 199  * With 3 or more missing vdevs, the pool is severely damaged and MOS entries
 200  * may be missing entirely. Chances of data recovery are very low. Note that
 201  * there are also risks of performing an inadvertent rewind as we might be
 202  * missing all the vdevs with the latest uberblocks.
 203  */
 204 uint64_t        zfs_max_missing_tvds = 0;
 205 
 206 /*
 207  * The parameters below are similar to zfs_max_missing_tvds but are only
 208  * intended for a preliminary open of the pool with an untrusted config which
 209  * might be incomplete or out-dated.
 210  *
 211  * We are more tolerant for pools opened from a cachefile since we could have
 212  * an out-dated cachefile where a device removal was not registered.
 213  * We could have set the limit arbitrarily high but in the case where devices
 214  * are really missing we would want to return the proper error codes; we chose
 215  * SPA_DVAS_PER_BP - 1 so that some copies of the MOS would still be available
 216  * and we get a chance to retrieve the trusted config.
 217  */
 218 uint64_t        zfs_max_missing_tvds_cachefile = SPA_DVAS_PER_BP - 1;
 219 /*
 220  * In the case where config was assembled by scanning device paths (/dev/dsks
 221  * by default) we are less tolerant since all the existing devices should have
 222  * been detected and we want spa_load to return the right error codes.
 223  */
 224 uint64_t        zfs_max_missing_tvds_scan = 0;
 225 
 226 /*
 227  * ==========================================================================
 228  * SPA properties routines
 229  * ==========================================================================
 230  */
 231 
 232 /*
 233  * Add a (source=src, propname=propval) list to an nvlist.
 234  */
 235 static void
 236 spa_prop_add_list(nvlist_t *nvl, zpool_prop_t prop, char *strval,
 237     uint64_t intval, zprop_source_t src)
 238 {
 239         const char *propname = zpool_prop_to_name(prop);
 240         nvlist_t *propval;
 241 
 242         VERIFY(nvlist_alloc(&propval, NV_UNIQUE_NAME, KM_SLEEP) == 0);
 243         VERIFY(nvlist_add_uint64(propval, ZPROP_SOURCE, src) == 0);
 244 
 245         if (strval != NULL)
 246                 VERIFY(nvlist_add_string(propval, ZPROP_VALUE, strval) == 0);
 247         else
 248                 VERIFY(nvlist_add_uint64(propval, ZPROP_VALUE, intval) == 0);
 249 
 250         VERIFY(nvlist_add_nvlist(nvl, propname, propval) == 0);
 251         nvlist_free(propval);
 252 }
 253 
 254 /*
 255  * Get property values from the spa configuration.
 256  */
 257 static void
 258 spa_prop_get_config(spa_t *spa, nvlist_t **nvp)
 259 {
 260         vdev_t *rvd = spa->spa_root_vdev;
 261         dsl_pool_t *pool = spa->spa_dsl_pool;

 262         uint64_t size, alloc, cap, version;
 263         zprop_source_t src = ZPROP_SRC_NONE;
 264         spa_config_dirent_t *dp;
 265         metaslab_class_t *mc = spa_normal_class(spa);
 266 
 267         ASSERT(MUTEX_HELD(&spa->spa_props_lock));
 268 
 269         if (rvd != NULL) {
 270                 alloc = metaslab_class_get_alloc(spa_normal_class(spa));
 271                 size = metaslab_class_get_space(spa_normal_class(spa));
 272                 spa_prop_add_list(*nvp, ZPOOL_PROP_NAME, spa_name(spa), 0, src);
 273                 spa_prop_add_list(*nvp, ZPOOL_PROP_SIZE, NULL, size, src);
 274                 spa_prop_add_list(*nvp, ZPOOL_PROP_ALLOCATED, NULL, alloc, src);
 275                 spa_prop_add_list(*nvp, ZPOOL_PROP_FREE, NULL,
 276                     size - alloc, src);










 277 











 278                 spa_prop_add_list(*nvp, ZPOOL_PROP_FRAGMENTATION, NULL,
 279                     metaslab_class_fragmentation(mc), src);
 280                 spa_prop_add_list(*nvp, ZPOOL_PROP_EXPANDSZ, NULL,
 281                     metaslab_class_expandable_space(mc), src);
 282                 spa_prop_add_list(*nvp, ZPOOL_PROP_READONLY, NULL,
 283                     (spa_mode(spa) == FREAD), src);
 284 



 285                 cap = (size == 0) ? 0 : (alloc * 100 / size);
 286                 spa_prop_add_list(*nvp, ZPOOL_PROP_CAPACITY, NULL, cap, src);
 287 









 288                 spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPRATIO, NULL,
 289                     ddt_get_pool_dedup_ratio(spa), src);
 290 



 291                 spa_prop_add_list(*nvp, ZPOOL_PROP_HEALTH, NULL,
 292                     rvd->vdev_state, src);
 293 
 294                 version = spa_version(spa);
 295                 if (version == zpool_prop_default_numeric(ZPOOL_PROP_VERSION))
 296                         src = ZPROP_SRC_DEFAULT;
 297                 else
 298                         src = ZPROP_SRC_LOCAL;
 299                 spa_prop_add_list(*nvp, ZPOOL_PROP_VERSION, NULL, version, src);
 300         }
 301 
 302         if (pool != NULL) {
 303                 /*
 304                  * The $FREE directory was introduced in SPA_VERSION_DEADLISTS,
 305                  * when opening pools before this version freedir will be NULL.
 306                  */
 307                 if (pool->dp_free_dir != NULL) {
 308                         spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING, NULL,
 309                             dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes,

 310                             src);
 311                 } else {
 312                         spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING,
 313                             NULL, 0, src);
 314                 }
 315 
 316                 if (pool->dp_leak_dir != NULL) {
 317                         spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED, NULL,
 318                             dsl_dir_phys(pool->dp_leak_dir)->dd_used_bytes,
 319                             src);
 320                 } else {
 321                         spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED,
 322                             NULL, 0, src);
 323                 }
 324         }
 325 
 326         spa_prop_add_list(*nvp, ZPOOL_PROP_GUID, NULL, spa_guid(spa), src);
 327 
 328         if (spa->spa_comment != NULL) {
 329                 spa_prop_add_list(*nvp, ZPOOL_PROP_COMMENT, spa->spa_comment,
 330                     0, ZPROP_SRC_LOCAL);
 331         }
 332 
 333         if (spa->spa_root != NULL)


 373          */
 374         spa_prop_get_config(spa, nvp);
 375 
 376         /* If no pool property object, no more prop to get. */
 377         if (mos == NULL || spa->spa_pool_props_object == 0) {
 378                 mutex_exit(&spa->spa_props_lock);
 379                 return (0);
 380         }
 381 
 382         /*
 383          * Get properties from the MOS pool property object.
 384          */
 385         for (zap_cursor_init(&zc, mos, spa->spa_pool_props_object);
 386             (err = zap_cursor_retrieve(&zc, &za)) == 0;
 387             zap_cursor_advance(&zc)) {
 388                 uint64_t intval = 0;
 389                 char *strval = NULL;
 390                 zprop_source_t src = ZPROP_SRC_DEFAULT;
 391                 zpool_prop_t prop;
 392 
 393                 if ((prop = zpool_name_to_prop(za.za_name)) == ZPOOL_PROP_INVAL)
 394                         continue;
 395 
 396                 switch (za.za_integer_length) {
 397                 case 8:
 398                         /* integer property */
 399                         if (za.za_first_integer !=
 400                             zpool_prop_default_numeric(prop))
 401                                 src = ZPROP_SRC_LOCAL;
 402 
 403                         if (prop == ZPOOL_PROP_BOOTFS) {
 404                                 dsl_pool_t *dp;
 405                                 dsl_dataset_t *ds = NULL;
 406 
 407                                 dp = spa_get_dsl(spa);
 408                                 dsl_pool_config_enter(dp, FTAG);
 409                                 if (err = dsl_dataset_hold_obj(dp,
 410                                     za.za_first_integer, FTAG, &ds)) {
 411                                         dsl_pool_config_exit(dp, FTAG);
 412                                         break;
 413                                 }


 452         if (err && err != ENOENT) {
 453                 nvlist_free(*nvp);
 454                 *nvp = NULL;
 455                 return (err);
 456         }
 457 
 458         return (0);
 459 }
 460 
 461 /*
 462  * Validate the given pool properties nvlist and modify the list
 463  * for the property values to be set.
 464  */
 465 static int
 466 spa_prop_validate(spa_t *spa, nvlist_t *props)
 467 {
 468         nvpair_t *elem;
 469         int error = 0, reset_bootfs = 0;
 470         uint64_t objnum = 0;
 471         boolean_t has_feature = B_FALSE;


 472 
 473         elem = NULL;
 474         while ((elem = nvlist_next_nvpair(props, elem)) != NULL) {
 475                 uint64_t intval;
 476                 char *strval, *slash, *check, *fname;
 477                 const char *propname = nvpair_name(elem);
 478                 zpool_prop_t prop = zpool_name_to_prop(propname);

 479 
 480                 switch (prop) {
 481                 case ZPOOL_PROP_INVAL:
 482                         if (!zpool_prop_feature(propname)) {
 483                                 error = SET_ERROR(EINVAL);
 484                                 break;
 485                         }
 486 
 487                         /*
 488                          * Sanitize the input.
 489                          */
 490                         if (nvpair_type(elem) != DATA_TYPE_UINT64) {
 491                                 error = SET_ERROR(EINVAL);
 492                                 break;
 493                         }
 494 
 495                         if (nvpair_value_uint64(elem, &intval) != 0) {
 496                                 error = SET_ERROR(EINVAL);
 497                                 break;
 498                         }
 499 
 500                         if (intval != 0) {
 501                                 error = SET_ERROR(EINVAL);
 502                                 break;
 503                         }
 504 
 505                         fname = strchr(propname, '@') + 1;
 506                         if (zfeature_lookup_name(fname, NULL) != 0) {
 507                                 error = SET_ERROR(EINVAL);
 508                                 break;
 509                         }
 510 






 511                         has_feature = B_TRUE;
 512                         break;
 513 
 514                 case ZPOOL_PROP_VERSION:
 515                         error = nvpair_value_uint64(elem, &intval);
 516                         if (!error &&
 517                             (intval < spa_version(spa) ||
 518                             intval > SPA_VERSION_BEFORE_FEATURES ||
 519                             has_feature))
 520                                 error = SET_ERROR(EINVAL);
 521                         break;
 522 
 523                 case ZPOOL_PROP_DELEGATION:
 524                 case ZPOOL_PROP_AUTOREPLACE:
 525                 case ZPOOL_PROP_LISTSNAPS:
 526                 case ZPOOL_PROP_AUTOEXPAND:





 527                         error = nvpair_value_uint64(elem, &intval);
 528                         if (!error && intval > 1)
 529                                 error = SET_ERROR(EINVAL);
 530                         break;
 531 



















 532                 case ZPOOL_PROP_BOOTFS:
 533                         /*
 534                          * If the pool version is less than SPA_VERSION_BOOTFS,
 535                          * or the pool is still being created (version == 0),
 536                          * the bootfs property cannot be set.
 537                          */
 538                         if (spa_version(spa) < SPA_VERSION_BOOTFS) {
 539                                 error = SET_ERROR(ENOTSUP);
 540                                 break;
 541                         }
 542 
 543                         /*
 544                          * Make sure the vdev config is bootable
 545                          */
 546                         if (!vdev_is_bootable(spa->spa_root_vdev)) {
 547                                 error = SET_ERROR(ENOTSUP);
 548                                 break;
 549                         }
 550 
 551                         reset_bootfs = 1;


 569                                  * Must be ZPL, and its property settings
 570                                  * must be supported by GRUB (compression
 571                                  * is not gzip, and large blocks are not used).
 572                                  */
 573 
 574                                 if (dmu_objset_type(os) != DMU_OST_ZFS) {
 575                                         error = SET_ERROR(ENOTSUP);
 576                                 } else if ((error =
 577                                     dsl_prop_get_int_ds(dmu_objset_ds(os),
 578                                     zfs_prop_to_name(ZFS_PROP_COMPRESSION),
 579                                     &propval)) == 0 &&
 580                                     !BOOTFS_COMPRESS_VALID(propval)) {
 581                                         error = SET_ERROR(ENOTSUP);
 582                                 } else {
 583                                         objnum = dmu_objset_id(os);
 584                                 }
 585                                 dmu_objset_rele(os, FTAG);
 586                         }
 587                         break;
 588 














 589                 case ZPOOL_PROP_FAILUREMODE:
 590                         error = nvpair_value_uint64(elem, &intval);
 591                         if (!error && (intval < ZIO_FAILURE_MODE_WAIT ||
 592                             intval > ZIO_FAILURE_MODE_PANIC))
 593                                 error = SET_ERROR(EINVAL);
 594 
 595                         /*
 596                          * This is a special case which only occurs when
 597                          * the pool has completely failed. This allows
 598                          * the user to change the in-core failmode property
 599                          * without syncing it out to disk (I/Os might
 600                          * currently be blocked). We do this by returning
 601                          * EIO to the caller (spa_prop_set) to trick it
 602                          * into thinking we encountered a property validation
 603                          * error.
 604                          */
 605                         if (!error && spa_suspended(spa)) {
 606                                 spa->spa_failmode = intval;
 607                                 error = SET_ERROR(EIO);
 608                         }


 630                             strcmp(slash, "/..") == 0)
 631                                 error = SET_ERROR(EINVAL);
 632                         break;
 633 
 634                 case ZPOOL_PROP_COMMENT:
 635                         if ((error = nvpair_value_string(elem, &strval)) != 0)
 636                                 break;
 637                         for (check = strval; *check != '\0'; check++) {
 638                                 /*
 639                                  * The kernel doesn't have an easy isprint()
 640                                  * check.  For this kernel check, we merely
 641                                  * check ASCII apart from DEL.  Fix this if
 642                                  * there is an easy-to-use kernel isprint().
 643                                  */
 644                                 if (*check >= 0x7f) {
 645                                         error = SET_ERROR(EINVAL);
 646                                         break;
 647                                 }
 648                         }
 649                         if (strlen(strval) > ZPROP_MAX_COMMENT)
 650                                 error = E2BIG;
 651                         break;
 652 
 653                 case ZPOOL_PROP_DEDUPDITTO:
 654                         if (spa_version(spa) < SPA_VERSION_DEDUP)
 655                                 error = SET_ERROR(ENOTSUP);
 656                         else
 657                                 error = nvpair_value_uint64(elem, &intval);
 658                         if (error == 0 &&
 659                             intval != 0 && intval < ZIO_DEDUPDITTO_MIN)
 660                                 error = SET_ERROR(EINVAL);
 661                         break;






























 662                 }
 663 
 664                 if (error)
 665                         break;
 666         }
 667 








 668         if (!error && reset_bootfs) {
 669                 error = nvlist_remove(props,
 670                     zpool_prop_to_name(ZPOOL_PROP_BOOTFS), DATA_TYPE_STRING);
 671 
 672                 if (!error) {
 673                         error = nvlist_add_uint64(props,
 674                             zpool_prop_to_name(ZPOOL_PROP_BOOTFS), objnum);
 675                 }
 676         }
 677 
 678         return (error);
 679 }
 680 
 681 void
 682 spa_configfile_set(spa_t *spa, nvlist_t *nvp, boolean_t need_sync)
 683 {
 684         char *cachefile;
 685         spa_config_dirent_t *dp;
 686 
 687         if (nvlist_lookup_string(nvp, zpool_prop_to_name(ZPOOL_PROP_CACHEFILE),


 704 }
 705 
 706 int
 707 spa_prop_set(spa_t *spa, nvlist_t *nvp)
 708 {
 709         int error;
 710         nvpair_t *elem = NULL;
 711         boolean_t need_sync = B_FALSE;
 712 
 713         if ((error = spa_prop_validate(spa, nvp)) != 0)
 714                 return (error);
 715 
 716         while ((elem = nvlist_next_nvpair(nvp, elem)) != NULL) {
 717                 zpool_prop_t prop = zpool_name_to_prop(nvpair_name(elem));
 718 
 719                 if (prop == ZPOOL_PROP_CACHEFILE ||
 720                     prop == ZPOOL_PROP_ALTROOT ||
 721                     prop == ZPOOL_PROP_READONLY)
 722                         continue;
 723 
 724                 if (prop == ZPOOL_PROP_VERSION || prop == ZPOOL_PROP_INVAL) {
 725                         uint64_t ver;
 726 
 727                         if (prop == ZPOOL_PROP_VERSION) {
 728                                 VERIFY(nvpair_value_uint64(elem, &ver) == 0);
 729                         } else {
 730                                 ASSERT(zpool_prop_feature(nvpair_name(elem)));
 731                                 ver = SPA_VERSION_FEATURES;
 732                                 need_sync = B_TRUE;
 733                         }
 734 
 735                         /* Save time if the version is already set. */
 736                         if (ver == spa_version(spa))
 737                                 continue;
 738 
 739                         /*
 740                          * In addition to the pool directory object, we might
 741                          * create the pool properties object, the features for
 742                          * read object, the features for write object, or the
 743                          * feature descriptions object.
 744                          */


 823  * the root vdev's guid, our own pool guid, and then mark all of our
 824  * vdevs dirty.  Note that we must make sure that all our vdevs are
 825  * online when we do this, or else any vdevs that weren't present
 826  * would be orphaned from our pool.  We are also going to issue a
 827  * sysevent to update any watchers.
 828  */
 829 int
 830 spa_change_guid(spa_t *spa)
 831 {
 832         int error;
 833         uint64_t guid;
 834 
 835         mutex_enter(&spa->spa_vdev_top_lock);
 836         mutex_enter(&spa_namespace_lock);
 837         guid = spa_generate_guid(NULL);
 838 
 839         error = dsl_sync_task(spa->spa_name, spa_change_guid_check,
 840             spa_change_guid_sync, &guid, 5, ZFS_SPACE_CHECK_RESERVED);
 841 
 842         if (error == 0) {
 843                 spa_write_cachefile(spa, B_FALSE, B_TRUE);
 844                 spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_REGUID);
 845         }
 846 
 847         mutex_exit(&spa_namespace_lock);
 848         mutex_exit(&spa->spa_vdev_top_lock);
 849 
 850         return (error);
 851 }
 852 
 853 /*
 854  * ==========================================================================
 855  * SPA state manipulation (open/create/destroy/import/export)
 856  * ==========================================================================
 857  */
 858 
 859 static int
 860 spa_error_entry_compare(const void *a, const void *b)
 861 {
 862         spa_error_entry_t *sa = (spa_error_entry_t *)a;
 863         spa_error_entry_t *sb = (spa_error_entry_t *)b;


1091         CALLB_CPR_EXIT(&cprinfo);   /* drops spa_proc_lock */
1092 
1093         mutex_enter(&curproc->p_lock);
1094         lwp_exit();
1095 }
1096 #endif
1097 
1098 /*
1099  * Activate an uninitialized pool.
1100  */
1101 static void
1102 spa_activate(spa_t *spa, int mode)
1103 {
1104         ASSERT(spa->spa_state == POOL_STATE_UNINITIALIZED);
1105 
1106         spa->spa_state = POOL_STATE_ACTIVE;
1107         spa->spa_mode = mode;
1108 
1109         spa->spa_normal_class = metaslab_class_create(spa, zfs_metaslab_ops);
1110         spa->spa_log_class = metaslab_class_create(spa, zfs_metaslab_ops);

1111 
1112         /* Try to create a covering process */
1113         mutex_enter(&spa->spa_proc_lock);
1114         ASSERT(spa->spa_proc_state == SPA_PROC_NONE);
1115         ASSERT(spa->spa_proc == &p0);
1116         spa->spa_did = 0;
1117 
1118         /* Only create a process if we're going to be around a while. */
1119         if (spa_create_process && strcmp(spa->spa_name, TRYIMPORT_NAME) != 0) {
1120                 if (newproc(spa_thread, (caddr_t)spa, syscid, maxclsyspri,
1121                     NULL, 0) == 0) {
1122                         spa->spa_proc_state = SPA_PROC_CREATED;
1123                         while (spa->spa_proc_state == SPA_PROC_CREATED) {
1124                                 cv_wait(&spa->spa_proc_cv,
1125                                     &spa->spa_proc_lock);
1126                         }
1127                         ASSERT(spa->spa_proc_state == SPA_PROC_ACTIVE);
1128                         ASSERT(spa->spa_proc != &p0);
1129                         ASSERT(spa->spa_did != 0);
1130                 } else {
1131 #ifdef _KERNEL
1132                         cmn_err(CE_WARN,
1133                             "Couldn't create process for zfs pool \"%s\"\n",
1134                             spa->spa_name);
1135 #endif
1136                 }
1137         }
1138         mutex_exit(&spa->spa_proc_lock);
1139 
1140         /* If we didn't create a process, we need to create our taskqs. */
1141         if (spa->spa_proc == &p0) {
1142                 spa_create_zio_taskqs(spa);
1143         }
1144 
1145         for (size_t i = 0; i < TXG_SIZE; i++)
1146                 spa->spa_txg_zio[i] = zio_root(spa, NULL, NULL, 0);
1147 
1148         list_create(&spa->spa_config_dirty_list, sizeof (vdev_t),
1149             offsetof(vdev_t, vdev_config_dirty_node));
1150         list_create(&spa->spa_evicting_os_list, sizeof (objset_t),
1151             offsetof(objset_t, os_evicting_node));
1152         list_create(&spa->spa_state_dirty_list, sizeof (vdev_t),
1153             offsetof(vdev_t, vdev_state_dirty_node));
1154 
1155         txg_list_create(&spa->spa_vdev_txg_list, spa,
1156             offsetof(struct vdev, vdev_txg_node));
1157 
1158         avl_create(&spa->spa_errlist_scrub,
1159             spa_error_entry_compare, sizeof (spa_error_entry_t),
1160             offsetof(spa_error_entry_t, se_avl));
1161         avl_create(&spa->spa_errlist_last,
1162             spa_error_entry_compare, sizeof (spa_error_entry_t),
1163             offsetof(spa_error_entry_t, se_avl));
1164 }
1165 
1166 /*
1167  * Opposite of spa_activate().


1172         ASSERT(spa->spa_sync_on == B_FALSE);
1173         ASSERT(spa->spa_dsl_pool == NULL);
1174         ASSERT(spa->spa_root_vdev == NULL);
1175         ASSERT(spa->spa_async_zio_root == NULL);
1176         ASSERT(spa->spa_state != POOL_STATE_UNINITIALIZED);
1177 
1178         spa_evicting_os_wait(spa);
1179 
1180         txg_list_destroy(&spa->spa_vdev_txg_list);
1181 
1182         list_destroy(&spa->spa_config_dirty_list);
1183         list_destroy(&spa->spa_evicting_os_list);
1184         list_destroy(&spa->spa_state_dirty_list);
1185 
1186         for (int t = 0; t < ZIO_TYPES; t++) {
1187                 for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
1188                         spa_taskqs_fini(spa, t, q);
1189                 }
1190         }
1191 
1192         for (size_t i = 0; i < TXG_SIZE; i++) {
1193                 ASSERT3P(spa->spa_txg_zio[i], !=, NULL);
1194                 VERIFY0(zio_wait(spa->spa_txg_zio[i]));
1195                 spa->spa_txg_zio[i] = NULL;
1196         }
1197 
1198         metaslab_class_destroy(spa->spa_normal_class);
1199         spa->spa_normal_class = NULL;
1200 
1201         metaslab_class_destroy(spa->spa_log_class);
1202         spa->spa_log_class = NULL;
1203 



1204         /*
1205          * If this was part of an import or the open otherwise failed, we may
1206          * still have errors left in the queues.  Empty them just in case.
1207          */
1208         spa_errlog_drain(spa);
1209 
1210         avl_destroy(&spa->spa_errlist_scrub);
1211         avl_destroy(&spa->spa_errlist_last);
1212 
1213         spa->spa_state = POOL_STATE_UNINITIALIZED;
1214 
1215         mutex_enter(&spa->spa_proc_lock);
1216         if (spa->spa_proc_state != SPA_PROC_NONE) {
1217                 ASSERT(spa->spa_proc_state == SPA_PROC_ACTIVE);
1218                 spa->spa_proc_state = SPA_PROC_DEACTIVATE;
1219                 cv_broadcast(&spa->spa_proc_cv);
1220                 while (spa->spa_proc_state == SPA_PROC_DEACTIVATE) {
1221                         ASSERT(spa->spa_proc != &p0);
1222                         cv_wait(&spa->spa_proc_cv, &spa->spa_proc_lock);
1223                 }


1278                         *vdp = NULL;
1279                         return (error);
1280                 }
1281         }
1282 
1283         ASSERT(*vdp != NULL);
1284 
1285         return (0);
1286 }
1287 
1288 /*
1289  * Opposite of spa_load().
1290  */
1291 static void
1292 spa_unload(spa_t *spa)
1293 {
1294         int i;
1295 
1296         ASSERT(MUTEX_HELD(&spa_namespace_lock));
1297 
1298         spa_load_note(spa, "UNLOADING");








1299 
1300         /*
1301          * Stop async tasks.
1302          */
1303         spa_async_suspend(spa);
1304 
1305         /*
1306          * Stop syncing.
1307          */
1308         if (spa->spa_sync_on) {
1309                 txg_sync_stop(spa->spa_dsl_pool);
1310                 spa->spa_sync_on = B_FALSE;
1311         }
1312 
1313         /*
1314          * Even though vdev_free() also calls vdev_metaslab_fini, we need
1315          * to call it earlier, before we wait for async i/o to complete.
1316          * This ensures that there is no async metaslab prefetching, by
1317          * calling taskq_wait(mg_taskq).
1318          */
1319         if (spa->spa_root_vdev != NULL) {
1320                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1321                 for (int c = 0; c < spa->spa_root_vdev->vdev_children; c++)
1322                         vdev_metaslab_fini(spa->spa_root_vdev->vdev_child[c]);
1323                 spa_config_exit(spa, SCL_ALL, FTAG);
1324         }
1325 
1326         /*
1327          * Wait for any outstanding async I/O to complete.
1328          */
1329         if (spa->spa_async_zio_root != NULL) {
1330                 for (int i = 0; i < max_ncpus; i++)
1331                         (void) zio_wait(spa->spa_async_zio_root[i]);
1332                 kmem_free(spa->spa_async_zio_root, max_ncpus * sizeof (void *));
1333                 spa->spa_async_zio_root = NULL;
1334         }
1335 
1336         if (spa->spa_vdev_removal != NULL) {
1337                 spa_vdev_removal_destroy(spa->spa_vdev_removal);
1338                 spa->spa_vdev_removal = NULL;
1339         }
1340 
1341         if (spa->spa_condense_zthr != NULL) {
1342                 ASSERT(!zthr_isrunning(spa->spa_condense_zthr));
1343                 zthr_destroy(spa->spa_condense_zthr);
1344                 spa->spa_condense_zthr = NULL;
1345         }
1346 
1347         spa_condense_fini(spa);
1348 
1349         bpobj_close(&spa->spa_deferred_bpobj);
1350 
1351         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1352 
1353         /*








1354          * Close all vdevs.
1355          */
1356         if (spa->spa_root_vdev)
1357                 vdev_free(spa->spa_root_vdev);
1358         ASSERT(spa->spa_root_vdev == NULL);
1359 
1360         /*
1361          * Close the dsl pool.
1362          */
1363         if (spa->spa_dsl_pool) {
1364                 dsl_pool_close(spa->spa_dsl_pool);
1365                 spa->spa_dsl_pool = NULL;
1366                 spa->spa_meta_objset = NULL;
1367         }
1368 
1369         ddt_unload(spa);
1370 
1371         /*
1372          * Drop and purge level 2 cache
1373          */


1386         }
1387         spa->spa_spares.sav_count = 0;
1388 
1389         for (i = 0; i < spa->spa_l2cache.sav_count; i++) {
1390                 vdev_clear_stats(spa->spa_l2cache.sav_vdevs[i]);
1391                 vdev_free(spa->spa_l2cache.sav_vdevs[i]);
1392         }
1393         if (spa->spa_l2cache.sav_vdevs) {
1394                 kmem_free(spa->spa_l2cache.sav_vdevs,
1395                     spa->spa_l2cache.sav_count * sizeof (void *));
1396                 spa->spa_l2cache.sav_vdevs = NULL;
1397         }
1398         if (spa->spa_l2cache.sav_config) {
1399                 nvlist_free(spa->spa_l2cache.sav_config);
1400                 spa->spa_l2cache.sav_config = NULL;
1401         }
1402         spa->spa_l2cache.sav_count = 0;
1403 
1404         spa->spa_async_suspended = 0;
1405 
1406         spa->spa_indirect_vdevs_loaded = B_FALSE;
1407 
1408         if (spa->spa_comment != NULL) {
1409                 spa_strfree(spa->spa_comment);
1410                 spa->spa_comment = NULL;
1411         }
1412 
1413         spa_config_exit(spa, SCL_ALL, FTAG);
1414 }
1415 
1416 /*
1417  * Load (or re-load) the current list of vdevs describing the active spares for
1418  * this pool.  When this is called, we have some form of basic information in
1419  * 'spa_spares.sav_config'.  We parse this into vdevs, try to open them, and
1420  * then re-generate a more complete list including status information.
1421  */
1422 void
1423 spa_load_spares(spa_t *spa)
1424 {
1425         nvlist_t **spares;
1426         uint_t nspares;
1427         int i;
1428         vdev_t *vd, *tvd;
1429 
1430         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
1431 
1432         /*
1433          * First, close and free any existing spare vdevs.
1434          */
1435         for (i = 0; i < spa->spa_spares.sav_count; i++) {
1436                 vd = spa->spa_spares.sav_vdevs[i];
1437 
1438                 /* Undo the call to spa_activate() below */
1439                 if ((tvd = spa_lookup_by_guid(spa, vd->vdev_guid,
1440                     B_FALSE)) != NULL && tvd->vdev_isspare)
1441                         spa_spare_remove(tvd);
1442                 vdev_close(vd);


1519         spares = kmem_alloc(spa->spa_spares.sav_count * sizeof (void *),
1520             KM_SLEEP);
1521         for (i = 0; i < spa->spa_spares.sav_count; i++)
1522                 spares[i] = vdev_config_generate(spa,
1523                     spa->spa_spares.sav_vdevs[i], B_TRUE, VDEV_CONFIG_SPARE);
1524         VERIFY(nvlist_add_nvlist_array(spa->spa_spares.sav_config,
1525             ZPOOL_CONFIG_SPARES, spares, spa->spa_spares.sav_count) == 0);
1526         for (i = 0; i < spa->spa_spares.sav_count; i++)
1527                 nvlist_free(spares[i]);
1528         kmem_free(spares, spa->spa_spares.sav_count * sizeof (void *));
1529 }
1530 
1531 /*
1532  * Load (or re-load) the current list of vdevs describing the active l2cache for
1533  * this pool.  When this is called, we have some form of basic information in
1534  * 'spa_l2cache.sav_config'.  We parse this into vdevs, try to open them, and
1535  * then re-generate a more complete list including status information.
1536  * Devices which are already active have their details maintained, and are
1537  * not re-opened.
1538  */
1539 void
1540 spa_load_l2cache(spa_t *spa)
1541 {
1542         nvlist_t **l2cache;
1543         uint_t nl2cache;
1544         int i, j, oldnvdevs;
1545         uint64_t guid;
1546         vdev_t *vd, **oldvdevs, **newvdevs;
1547         spa_aux_vdev_t *sav = &spa->spa_l2cache;
1548 
1549         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
1550 
1551         if (sav->sav_config != NULL) {
1552                 VERIFY(nvlist_lookup_nvlist_array(sav->sav_config,
1553                     ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0);
1554                 newvdevs = kmem_alloc(nl2cache * sizeof (void *), KM_SLEEP);
1555         } else {
1556                 nl2cache = 0;
1557                 newvdevs = NULL;
1558         }
1559 


1590                             VDEV_ALLOC_L2CACHE) == 0);
1591                         ASSERT(vd != NULL);
1592                         newvdevs[i] = vd;
1593 
1594                         /*
1595                          * Commit this vdev as an l2cache device,
1596                          * even if it fails to open.
1597                          */
1598                         spa_l2cache_add(vd);
1599 
1600                         vd->vdev_top = vd;
1601                         vd->vdev_aux = sav;
1602 
1603                         spa_l2cache_activate(vd);
1604 
1605                         if (vdev_open(vd) != 0)
1606                                 continue;
1607 
1608                         (void) vdev_validate_aux(vd);
1609 
1610                         if (!vdev_is_dead(vd))
1611                                 l2arc_add_vdev(spa, vd);





1612                 }
1613         }

1614 
1615         /*
1616          * Purge vdevs that were dropped
1617          */
1618         for (i = 0; i < oldnvdevs; i++) {
1619                 uint64_t pool;
1620 
1621                 vd = oldvdevs[i];
1622                 if (vd != NULL) {
1623                         ASSERT(vd->vdev_isl2cache);
1624 
1625                         if (spa_l2cache_exists(vd->vdev_guid, &pool) &&
1626                             pool != 0ULL && l2arc_vdev_present(vd))
1627                                 l2arc_remove_vdev(vd);
1628                         vdev_clear_stats(vd);
1629                         vdev_free(vd);
1630                 }
1631         }
1632 
1633         if (oldvdevs)


1669         *value = NULL;
1670 
1671         error = dmu_bonus_hold(spa->spa_meta_objset, obj, FTAG, &db);
1672         if (error != 0)
1673                 return (error);
1674 
1675         nvsize = *(uint64_t *)db->db_data;
1676         dmu_buf_rele(db, FTAG);
1677 
1678         packed = kmem_alloc(nvsize, KM_SLEEP);
1679         error = dmu_read(spa->spa_meta_objset, obj, 0, nvsize, packed,
1680             DMU_READ_PREFETCH);
1681         if (error == 0)
1682                 error = nvlist_unpack(packed, nvsize, value, 0);
1683         kmem_free(packed, nvsize);
1684 
1685         return (error);
1686 }
1687 
1688 /*
1689  * Concrete top-level vdevs that are not missing and are not logs. At every
1690  * spa_sync we write new uberblocks to at least SPA_SYNC_MIN_VDEVS core tvds.
1691  */
1692 static uint64_t
1693 spa_healthy_core_tvds(spa_t *spa)
1694 {
1695         vdev_t *rvd = spa->spa_root_vdev;
1696         uint64_t tvds = 0;
1697 
1698         for (uint64_t i = 0; i < rvd->vdev_children; i++) {
1699                 vdev_t *vd = rvd->vdev_child[i];
1700                 if (vd->vdev_islog)
1701                         continue;
1702                 if (vdev_is_concrete(vd) && !vdev_is_dead(vd))
1703                         tvds++;
1704         }
1705 
1706         return (tvds);
1707 }
1708 
1709 /*
1710  * Checks to see if the given vdev could not be opened, in which case we post a
1711  * sysevent to notify the autoreplace code that the device has been removed.
1712  */
1713 static void
1714 spa_check_removed(vdev_t *vd)
1715 {
1716         for (uint64_t c = 0; c < vd->vdev_children; c++)
1717                 spa_check_removed(vd->vdev_child[c]);
1718 
1719         if (vd->vdev_ops->vdev_op_leaf && vdev_is_dead(vd) &&
1720             vdev_is_concrete(vd)) {
1721                 zfs_post_autoreplace(vd->vdev_spa, vd);
1722                 spa_event_notify(vd->vdev_spa, vd, NULL, ESC_ZFS_VDEV_CHECK);
1723         }
1724 }
1725 
1726 static int
1727 spa_check_for_missing_logs(spa_t *spa)
1728 {
1729         vdev_t *rvd = spa->spa_root_vdev;
1730 






















1731         /*









1732          * If we're doing a normal import, then build up any additional
1733          * diagnostic information about missing log devices.
1734          * We'll pass this up to the user for further processing.
1735          */
1736         if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG)) {
1737                 nvlist_t **child, *nv;
1738                 uint64_t idx = 0;
1739 
1740                 child = kmem_alloc(rvd->vdev_children * sizeof (nvlist_t **),
1741                     KM_SLEEP);
1742                 VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0);
1743 
1744                 for (uint64_t c = 0; c < rvd->vdev_children; c++) {
1745                         vdev_t *tvd = rvd->vdev_child[c];

1746 
1747                         /*
1748                          * We consider a device as missing only if it failed
1749                          * to open (i.e. offline or faulted is not considered
1750                          * as missing).
1751                          */
1752                         if (tvd->vdev_islog &&
1753                             tvd->vdev_state == VDEV_STATE_CANT_OPEN) {
1754                                 child[idx++] = vdev_config_generate(spa, tvd,
1755                                     B_FALSE, VDEV_CONFIG_MISSING);
1756                         }
1757                 }
1758 
1759                 if (idx > 0) {
1760                         fnvlist_add_nvlist_array(nv,
1761                             ZPOOL_CONFIG_CHILDREN, child, idx);
1762                         fnvlist_add_nvlist(spa->spa_load_info,
1763                             ZPOOL_CONFIG_MISSING_DEVICES, nv);
1764 
1765                         for (uint64_t i = 0; i < idx; i++)
1766                                 nvlist_free(child[i]);
1767                 }
1768                 nvlist_free(nv);
1769                 kmem_free(child, rvd->vdev_children * sizeof (char **));
1770 
1771                 if (idx > 0) {
1772                         spa_load_failed(spa, "some log devices are missing");
1773                         return (SET_ERROR(ENXIO));
1774                 }
1775         } else {
1776                 for (uint64_t c = 0; c < rvd->vdev_children; c++) {





1777                         vdev_t *tvd = rvd->vdev_child[c];

1778 
1779                         if (tvd->vdev_islog &&
1780                             tvd->vdev_state == VDEV_STATE_CANT_OPEN) {













1781                                 spa_set_log_state(spa, SPA_LOG_CLEAR);
1782                                 spa_load_note(spa, "some log devices are "
1783                                     "missing, ZIL is dropped.");
1784                                 break;





1785                         }


























1786                 }





1787         }

1788 
1789         return (0);






1790 }
1791 
1792 /*
1793  * Check for missing log devices
1794  */
1795 static boolean_t
1796 spa_check_logs(spa_t *spa)
1797 {
1798         boolean_t rv = B_FALSE;
1799         dsl_pool_t *dp = spa_get_dsl(spa);
1800 
1801         switch (spa->spa_log_state) {
1802         case SPA_LOG_MISSING:
1803                 /* need to recheck in case slog has been restored */
1804         case SPA_LOG_UNKNOWN:
1805                 rv = (dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
1806                     zil_check_log_chain, NULL, DS_FIND_CHILDREN) != 0);
1807                 if (rv)
1808                         spa_set_log_state(spa, SPA_LOG_MISSING);
1809                 break;


1835         return (slog_found);
1836 }
1837 
1838 static void
1839 spa_activate_log(spa_t *spa)
1840 {
1841         vdev_t *rvd = spa->spa_root_vdev;
1842 
1843         ASSERT(spa_config_held(spa, SCL_ALLOC, RW_WRITER));
1844 
1845         for (int c = 0; c < rvd->vdev_children; c++) {
1846                 vdev_t *tvd = rvd->vdev_child[c];
1847                 metaslab_group_t *mg = tvd->vdev_mg;
1848 
1849                 if (tvd->vdev_islog)
1850                         metaslab_group_activate(mg);
1851         }
1852 }
1853 
1854 int
1855 spa_reset_logs(spa_t *spa)
1856 {
1857         int error;
1858 
1859         error = dmu_objset_find(spa_name(spa), zil_reset,
1860             NULL, DS_FIND_CHILDREN);
1861         if (error == 0) {
1862                 /*
1863                  * We successfully offlined the log device, sync out the
1864                  * current txg so that the "stubby" block can be removed
1865                  * by zil_sync().
1866                  */
1867                 txg_wait_synced(spa->spa_dsl_pool, 0);
1868         }
1869         return (error);
1870 }
1871 
1872 static void
1873 spa_aux_check_removed(spa_aux_vdev_t *sav)
1874 {
1875         for (int i = 0; i < sav->sav_count; i++)
1876                 spa_check_removed(sav->sav_vdevs[i]);
1877 }
1878 
1879 void


1889                 spa->spa_claim_max_txg = zio->io_bp->blk_birth;
1890         mutex_exit(&spa->spa_props_lock);
1891 }
1892 
1893 typedef struct spa_load_error {
1894         uint64_t        sle_meta_count;
1895         uint64_t        sle_data_count;
1896 } spa_load_error_t;
1897 
1898 static void
1899 spa_load_verify_done(zio_t *zio)
1900 {
1901         blkptr_t *bp = zio->io_bp;
1902         spa_load_error_t *sle = zio->io_private;
1903         dmu_object_type_t type = BP_GET_TYPE(bp);
1904         int error = zio->io_error;
1905         spa_t *spa = zio->io_spa;
1906 
1907         abd_free(zio->io_abd);
1908         if (error) {
1909                 if ((BP_GET_LEVEL(bp) != 0 || DMU_OT_IS_METADATA(type)) &&
1910                     type != DMU_OT_INTENT_LOG)
1911                         atomic_inc_64(&sle->sle_meta_count);
1912                 else
1913                         atomic_inc_64(&sle->sle_data_count);
1914         }
1915 
1916         mutex_enter(&spa->spa_scrub_lock);
1917         spa->spa_scrub_inflight--;
1918         cv_broadcast(&spa->spa_scrub_io_cv);
1919         mutex_exit(&spa->spa_scrub_lock);
1920 }
1921 
1922 /*
1923  * Maximum number of concurrent scrub i/os to create while verifying
1924  * a pool while importing it.
1925  */
1926 int spa_load_verify_maxinflight = 10000;
1927 boolean_t spa_load_verify_metadata = B_TRUE;
1928 boolean_t spa_load_verify_data = B_TRUE;
1929 
1930 /*ARGSUSED*/


1979         boolean_t verify_ok = B_FALSE;
1980         int error = 0;
1981 
1982         zpool_get_rewind_policy(spa->spa_config, &policy);
1983 
1984         if (policy.zrp_request & ZPOOL_NEVER_REWIND)
1985                 return (0);
1986 
1987         dsl_pool_config_enter(spa->spa_dsl_pool, FTAG);
1988         error = dmu_objset_find_dp(spa->spa_dsl_pool,
1989             spa->spa_dsl_pool->dp_root_dir_obj, verify_dataset_name_len, NULL,
1990             DS_FIND_CHILDREN);
1991         dsl_pool_config_exit(spa->spa_dsl_pool, FTAG);
1992         if (error != 0)
1993                 return (error);
1994 
1995         rio = zio_root(spa, NULL, &sle,
1996             ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE);
1997 
1998         if (spa_load_verify_metadata) {
1999                 if (spa->spa_extreme_rewind) {
2000                         spa_load_note(spa, "performing a complete scan of the "
2001                             "pool since extreme rewind is on. This may take "
2002                             "a very long time.\n  (spa_load_verify_data=%u, "
2003                             "spa_load_verify_metadata=%u)",
2004                             spa_load_verify_data, spa_load_verify_metadata);
2005                 }
2006                 error = traverse_pool(spa, spa->spa_verify_min_txg,
2007                     TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA,
2008                     spa_load_verify_cb, rio);
2009         }
2010 
2011         (void) zio_wait(rio);
2012 
2013         spa->spa_load_meta_errors = sle.sle_meta_count;
2014         spa->spa_load_data_errors = sle.sle_data_count;
2015 
2016         if (sle.sle_meta_count != 0 || sle.sle_data_count != 0) {
2017                 spa_load_note(spa, "spa_load_verify found %llu metadata errors "
2018                     "and %llu data errors", (u_longlong_t)sle.sle_meta_count,
2019                     (u_longlong_t)sle.sle_data_count);
2020         }
2021 
2022         if (spa_load_verify_dryrun ||
2023             (!error && sle.sle_meta_count <= policy.zrp_maxmeta &&
2024             sle.sle_data_count <= policy.zrp_maxdata)) {
2025                 int64_t loss = 0;
2026 
2027                 verify_ok = B_TRUE;
2028                 spa->spa_load_txg = spa->spa_uberblock.ub_txg;
2029                 spa->spa_load_txg_ts = spa->spa_uberblock.ub_timestamp;
2030 
2031                 loss = spa->spa_last_ubsync_txg_ts - spa->spa_load_txg_ts;
2032                 VERIFY(nvlist_add_uint64(spa->spa_load_info,
2033                     ZPOOL_CONFIG_LOAD_TIME, spa->spa_load_txg_ts) == 0);
2034                 VERIFY(nvlist_add_int64(spa->spa_load_info,
2035                     ZPOOL_CONFIG_REWIND_TIME, loss) == 0);
2036                 VERIFY(nvlist_add_uint64(spa->spa_load_info,
2037                     ZPOOL_CONFIG_LOAD_DATA_ERRORS, sle.sle_data_count) == 0);
2038         } else {
2039                 spa->spa_load_max_txg = spa->spa_uberblock.ub_txg;
2040         }
2041 
2042         if (spa_load_verify_dryrun)
2043                 return (0);
2044 
2045         if (error) {
2046                 if (error != ENXIO && error != EIO)
2047                         error = SET_ERROR(EIO);
2048                 return (error);
2049         }
2050 
2051         return (verify_ok ? 0 : EIO);
2052 }
2053 
2054 /*
2055  * Find a value in the pool props object.
2056  */
2057 static void
2058 spa_prop_find(spa_t *spa, zpool_prop_t prop, uint64_t *val)
2059 {
2060         (void) zap_lookup(spa->spa_meta_objset, spa->spa_pool_props_object,
2061             zpool_prop_to_name(prop), sizeof (uint64_t), 1, val);
2062 }
2063 
2064 /*
2065  * Find a value in the pool directory object.
2066  */
2067 static int
2068 spa_dir_prop(spa_t *spa, const char *name, uint64_t *val, boolean_t log_enoent)
2069 {
2070         int error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2071             name, sizeof (uint64_t), 1, val);

2072 
2073         if (error != 0 && (error != ENOENT || log_enoent)) {
2074                 spa_load_failed(spa, "couldn't get '%s' value in MOS directory "
2075                     "[error=%d]", name, error);









2076         }
2077 
2078         return (error);
2079 }
2080 
2081 static int
2082 spa_vdev_err(vdev_t *vdev, vdev_aux_t aux, int err)
2083 {
2084         vdev_set_state(vdev, B_TRUE, VDEV_STATE_CANT_OPEN, aux);
2085         return (SET_ERROR(err));
2086 }
2087 
2088 static void
2089 spa_spawn_aux_threads(spa_t *spa)
2090 {
2091         ASSERT(spa_writeable(spa));
2092 
2093         ASSERT(MUTEX_HELD(&spa_namespace_lock));
2094 
2095         spa_start_indirect_condensing_thread(spa);
2096 }
2097 
2098 /*
2099  * Fix up config after a partly-completed split.  This is done with the
2100  * ZPOOL_CONFIG_SPLIT nvlist.  Both the splitting pool and the split-off
2101  * pool have that entry in their config, but only the splitting one contains
2102  * a list of all the guids of the vdevs that are being split off.
2103  *
2104  * This function determines what to do with that list: either rejoin
2105  * all the disks to the pool, or complete the splitting process.  To attempt
2106  * the rejoin, each disk that is offlined is marked online again, and
2107  * we do a reopen() call.  If the vdev label for every disk that was
2108  * marked online indicates it was successfully split off (VDEV_AUX_SPLIT_POOL)
2109  * then we call vdev_split() on each disk, and complete the split.
2110  *
2111  * Otherwise we leave the config alone, with all the vdevs in place in
2112  * the original pool.
2113  */
2114 static void
2115 spa_try_repair(spa_t *spa, nvlist_t *config)
2116 {
2117         uint_t extracted;


2161                         ++extracted;
2162                 }
2163         }
2164 
2165         /*
2166          * If every disk has been moved to the new pool, or if we never
2167          * even attempted to look at them, then we split them off for
2168          * good.
2169          */
2170         if (!attempt_reopen || gcount == extracted) {
2171                 for (i = 0; i < gcount; i++)
2172                         if (vd[i] != NULL)
2173                                 vdev_split(vd[i]);
2174                 vdev_reopen(spa->spa_root_vdev);
2175         }
2176 
2177         kmem_free(vd, gcount * sizeof (vdev_t *));
2178 }
2179 
2180 static int
2181 spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type)

2182 {

2183         char *ereport = FM_EREPORT_ZFS_POOL;

2184         int error;


2185 
2186         spa->spa_load_state = state;

2187 






























2188         gethrestime(&spa->spa_loaded_ts);
2189         error = spa_load_impl(spa, type, &ereport, B_FALSE);


2190 
2191         /*
2192          * Don't count references from objsets that are already closed
2193          * and are making their way through the eviction process.
2194          */
2195         spa_evicting_os_wait(spa);
2196         spa->spa_minref = refcount_count(&spa->spa_refcount);
2197         if (error) {
2198                 if (error != EEXIST) {
2199                         spa->spa_loaded_ts.tv_sec = 0;
2200                         spa->spa_loaded_ts.tv_nsec = 0;
2201                 }
2202                 if (error != EBADF) {
2203                         zfs_ereport_post(ereport, spa, NULL, NULL, 0, 0);
2204                 }
2205         }
2206         spa->spa_load_state = error ? SPA_LOAD_ERROR : SPA_LOAD_NONE;
2207         spa->spa_ena = 0;
2208 
2209         return (error);
2210 }
2211 
2212 /*
2213  * Count the number of per-vdev ZAPs associated with all of the vdevs in the
2214  * vdev tree rooted in the given vd, and ensure that each ZAP is present in the
2215  * spa's per-vdev ZAP list.
2216  */
2217 static uint64_t
2218 vdev_count_verify_zaps(vdev_t *vd)
2219 {
2220         spa_t *spa = vd->vdev_spa;
2221         uint64_t total = 0;
2222         if (vd->vdev_top_zap != 0) {
2223                 total++;
2224                 ASSERT0(zap_lookup_int(spa->spa_meta_objset,
2225                     spa->spa_all_vdev_zaps, vd->vdev_top_zap));
2226         }
2227         if (vd->vdev_leaf_zap != 0) {
2228                 total++;
2229                 ASSERT0(zap_lookup_int(spa->spa_meta_objset,
2230                     spa->spa_all_vdev_zaps, vd->vdev_leaf_zap));
2231         }
2232 
2233         for (uint64_t i = 0; i < vd->vdev_children; i++) {
2234                 total += vdev_count_verify_zaps(vd->vdev_child[i]);
2235         }
2236 
2237         return (total);
2238 }
2239 




2240 static int
2241 spa_verify_host(spa_t *spa, nvlist_t *mos_config)


2242 {
2243         uint64_t hostid;
2244         char *hostname;
2245         uint64_t myhostid = 0;
2246 
2247         if (!spa_is_root(spa) && nvlist_lookup_uint64(mos_config,
2248             ZPOOL_CONFIG_HOSTID, &hostid) == 0) {
2249                 hostname = fnvlist_lookup_string(mos_config,
2250                     ZPOOL_CONFIG_HOSTNAME);
2251 
2252                 myhostid = zone_get_hostid(NULL);
2253 
2254                 if (hostid != 0 && myhostid != 0 && hostid != myhostid) {
2255                         cmn_err(CE_WARN, "pool '%s' could not be "
2256                             "loaded as it was last accessed by "
2257                             "another system (host: %s hostid: 0x%llx). "
2258                             "See: http://illumos.org/msg/ZFS-8000-EY",
2259                             spa_name(spa), hostname, (u_longlong_t)hostid);
2260                         spa_load_failed(spa, "hostid verification failed: pool "
2261                             "last accessed by host: %s (hostid: 0x%llx)",
2262                             hostname, (u_longlong_t)hostid);
2263                         return (SET_ERROR(EBADF));
2264                 }
2265         }
2266 
2267         return (0);
2268 }
2269 
2270 static int
2271 spa_ld_parse_config(spa_t *spa, spa_import_type_t type)
2272 {
2273         int error = 0;
2274         nvlist_t *nvtree, *nvl, *config = spa->spa_config;
2275         int parse;
2276         vdev_t *rvd;
2277         uint64_t pool_guid;
2278         char *comment;





2279 
2280         /*
2281          * Versioning wasn't explicitly added to the label until later, so if
2282          * it's not present treat it as the initial version.
2283          */
2284         if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
2285             &spa->spa_ubsync.ub_version) != 0)
2286                 spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
2287 
2288         if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid)) {
2289                 spa_load_failed(spa, "invalid config provided: '%s' missing",
2290                     ZPOOL_CONFIG_POOL_GUID);
2291                 return (SET_ERROR(EINVAL));
2292         }
2293 
2294         if ((spa->spa_load_state == SPA_LOAD_IMPORT || spa->spa_load_state ==
2295             SPA_LOAD_TRYIMPORT) && spa_guid_exists(pool_guid, 0)) {
2296                 spa_load_failed(spa, "a pool with guid %llu is already open",
2297                     (u_longlong_t)pool_guid);
2298                 return (SET_ERROR(EEXIST));
2299         }
2300 
2301         spa->spa_config_guid = pool_guid;
2302 
2303         nvlist_free(spa->spa_load_info);
2304         spa->spa_load_info = fnvlist_alloc();
2305 
2306         ASSERT(spa->spa_comment == NULL);
2307         if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0)
2308                 spa->spa_comment = spa_strdup(comment);
2309 
2310         (void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG,
2311             &spa->spa_config_txg);
2312 
2313         if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT, &nvl) == 0)
2314                 spa->spa_config_splitting = fnvlist_dup(nvl);
2315 
2316         if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvtree)) {
2317                 spa_load_failed(spa, "invalid config provided: '%s' missing",
2318                     ZPOOL_CONFIG_VDEV_TREE);
2319                 return (SET_ERROR(EINVAL));
2320         }
2321 



2322         /*
2323          * Create "The Godfather" zio to hold all async IOs
2324          */
2325         spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
2326             KM_SLEEP);
2327         for (int i = 0; i < max_ncpus; i++) {
2328                 spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
2329                     ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
2330                     ZIO_FLAG_GODFATHER);
2331         }
2332 
2333         /*
2334          * Parse the configuration into a vdev tree.  We explicitly set the
2335          * value that will be returned by spa_version() since parsing the
2336          * configuration requires knowing the version number.
2337          */
2338         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2339         parse = (type == SPA_IMPORT_EXISTING ?
2340             VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT);
2341         error = spa_config_parse(spa, &rvd, nvtree, NULL, 0, parse);
2342         spa_config_exit(spa, SCL_ALL, FTAG);
2343 
2344         if (error != 0) {
2345                 spa_load_failed(spa, "unable to parse config [error=%d]",
2346                     error);
2347                 return (error);
2348         }
2349 
2350         ASSERT(spa->spa_root_vdev == rvd);
2351         ASSERT3U(spa->spa_min_ashift, >=, SPA_MINBLOCKSHIFT);
2352         ASSERT3U(spa->spa_max_ashift, <=, SPA_MAXBLOCKSHIFT);
2353 
2354         if (type != SPA_IMPORT_ASSEMBLE) {
2355                 ASSERT(spa_guid(spa) == pool_guid);
2356         }
2357 
2358         return (0);
2359 }
2360 
2361 /*
2362  * Recursively open all vdevs in the vdev tree. This function is called twice:
2363  * first with the untrusted config, then with the trusted config.
2364  */
2365 static int
2366 spa_ld_open_vdevs(spa_t *spa)
2367 {
2368         int error = 0;
2369 
2370         /*
2371          * spa_missing_tvds_allowed defines how many top-level vdevs can be
2372          * missing/unopenable for the root vdev to be still considered openable.
2373          */
2374         if (spa->spa_trust_config) {
2375                 spa->spa_missing_tvds_allowed = zfs_max_missing_tvds;
2376         } else if (spa->spa_config_source == SPA_CONFIG_SRC_CACHEFILE) {
2377                 spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_cachefile;
2378         } else if (spa->spa_config_source == SPA_CONFIG_SRC_SCAN) {
2379                 spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_scan;
2380         } else {
2381                 spa->spa_missing_tvds_allowed = 0;
2382         }
2383 
2384         spa->spa_missing_tvds_allowed =
2385             MAX(zfs_max_missing_tvds, spa->spa_missing_tvds_allowed);
2386 
2387         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2388         error = vdev_open(spa->spa_root_vdev);
2389         spa_config_exit(spa, SCL_ALL, FTAG);
2390 
2391         if (spa->spa_missing_tvds != 0) {
2392                 spa_load_note(spa, "vdev tree has %lld missing top-level "
2393                     "vdevs.", (u_longlong_t)spa->spa_missing_tvds);
2394                 if (spa->spa_trust_config && (spa->spa_mode & FWRITE)) {
2395                         /*
2396                          * Although theoretically we could allow users to open
2397                          * incomplete pools in RW mode, we'd need to add a lot
2398                          * of extra logic (e.g. adjust pool space to account
2399                          * for missing vdevs).
2400                          * This limitation also prevents users from accidentally
2401                          * opening the pool in RW mode during data recovery and
2402                          * damaging it further.
2403                          */
2404                         spa_load_note(spa, "pools with missing top-level "
2405                             "vdevs can only be opened in read-only mode.");
2406                         error = SET_ERROR(ENXIO);
2407                 } else {
2408                         spa_load_note(spa, "current settings allow for maximum "
2409                             "%lld missing top-level vdevs at this stage.",
2410                             (u_longlong_t)spa->spa_missing_tvds_allowed);
2411                 }
2412         }
2413         if (error != 0) {
2414                 spa_load_failed(spa, "unable to open vdev tree [error=%d]",
2415                     error);
2416         }
2417         if (spa->spa_missing_tvds != 0 || error != 0)
2418                 vdev_dbgmsg_print_tree(spa->spa_root_vdev, 2);
2419 
2420         return (error);
2421 }
2422 
2423 /*
2424  * We need to validate the vdev labels against the configuration that
2425  * we have in hand. This function is called twice: first with an untrusted
2426  * config, then with a trusted config. The validation is more strict when the
2427  * config is trusted.







2428  */
2429 static int
2430 spa_ld_validate_vdevs(spa_t *spa)
2431 {
2432         int error = 0;
2433         vdev_t *rvd = spa->spa_root_vdev;
2434 
2435         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2436         error = vdev_validate(rvd);
2437         spa_config_exit(spa, SCL_ALL, FTAG);
2438 
2439         if (error != 0) {
2440                 spa_load_failed(spa, "vdev_validate failed [error=%d]", error);
2441                 return (error);
2442         }
2443 
2444         if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN) {
2445                 spa_load_failed(spa, "cannot open vdev tree after invalidating "
2446                     "some vdevs");
2447                 vdev_dbgmsg_print_tree(rvd, 2);
2448                 return (SET_ERROR(ENXIO));
2449         }
2450 
2451         return (0);
2452 }
2453 
2454 static int
2455 spa_ld_select_uberblock(spa_t *spa, spa_import_type_t type)
2456 {
2457         vdev_t *rvd = spa->spa_root_vdev;
2458         nvlist_t *label;
2459         uberblock_t *ub = &spa->spa_uberblock;
2460 
2461         /*
2462          * Find the best uberblock.
2463          */
2464         vdev_uberblock_load(rvd, ub, &label);
2465 
2466         /*
2467          * If we weren't able to find a single valid uberblock, return failure.
2468          */
2469         if (ub->ub_txg == 0) {
2470                 nvlist_free(label);
2471                 spa_load_failed(spa, "no valid uberblock found");
2472                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, ENXIO));
2473         }
2474 
2475         spa_load_note(spa, "using uberblock with txg=%llu",
2476             (u_longlong_t)ub->ub_txg);
2477 
2478         /*
2479          * If the pool has an unsupported version we can't open it.
2480          */
2481         if (!SPA_VERSION_IS_SUPPORTED(ub->ub_version)) {
2482                 nvlist_free(label);
2483                 spa_load_failed(spa, "version %llu is not supported",
2484                     (u_longlong_t)ub->ub_version);
2485                 return (spa_vdev_err(rvd, VDEV_AUX_VERSION_NEWER, ENOTSUP));
2486         }
2487 
2488         if (ub->ub_version >= SPA_VERSION_FEATURES) {
2489                 nvlist_t *features;
2490 
2491                 /*
2492                  * If we weren't able to find what's necessary for reading the
2493                  * MOS in the label, return failure.
2494                  */
2495                 if (label == NULL) {
2496                         spa_load_failed(spa, "label config unavailable");
2497                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
2498                             ENXIO));
2499                 }
2500 
2501                 if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_FEATURES_FOR_READ,
2502                     &features) != 0) {
2503                         nvlist_free(label);
2504                         spa_load_failed(spa, "invalid label: '%s' missing",
2505                             ZPOOL_CONFIG_FEATURES_FOR_READ);
2506                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
2507                             ENXIO));
2508                 }
2509 
2510                 /*
2511                  * Update our in-core representation with the definitive values
2512                  * from the label.
2513                  */
2514                 nvlist_free(spa->spa_label_features);
2515                 VERIFY(nvlist_dup(features, &spa->spa_label_features, 0) == 0);
2516         }
2517 
2518         nvlist_free(label);
2519 
2520         /*
2521          * Look through entries in the label nvlist's features_for_read. If
2522          * there is a feature listed there which we don't understand then we
2523          * cannot open a pool.
2524          */
2525         if (ub->ub_version >= SPA_VERSION_FEATURES) {
2526                 nvlist_t *unsup_feat;
2527 
2528                 VERIFY(nvlist_alloc(&unsup_feat, NV_UNIQUE_NAME, KM_SLEEP) ==
2529                     0);
2530 
2531                 for (nvpair_t *nvp = nvlist_next_nvpair(spa->spa_label_features,
2532                     NULL); nvp != NULL;
2533                     nvp = nvlist_next_nvpair(spa->spa_label_features, nvp)) {
2534                         if (!zfeature_is_supported(nvpair_name(nvp))) {
2535                                 VERIFY(nvlist_add_string(unsup_feat,
2536                                     nvpair_name(nvp), "") == 0);
2537                         }
2538                 }
2539 
2540                 if (!nvlist_empty(unsup_feat)) {
2541                         VERIFY(nvlist_add_nvlist(spa->spa_load_info,
2542                             ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat) == 0);
2543                         nvlist_free(unsup_feat);
2544                         spa_load_failed(spa, "some features are unsupported");
2545                         return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
2546                             ENOTSUP));
2547                 }
2548 
2549                 nvlist_free(unsup_feat);
2550         }
2551 












2552         if (type != SPA_IMPORT_ASSEMBLE && spa->spa_config_splitting) {
2553                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2554                 spa_try_repair(spa, spa->spa_config);
2555                 spa_config_exit(spa, SCL_ALL, FTAG);
2556                 nvlist_free(spa->spa_config_splitting);
2557                 spa->spa_config_splitting = NULL;
2558         }
2559 
2560         /*
2561          * Initialize internal SPA structures.
2562          */
2563         spa->spa_state = POOL_STATE_ACTIVE;
2564         spa->spa_ubsync = spa->spa_uberblock;
2565         spa->spa_verify_min_txg = spa->spa_extreme_rewind ?
2566             TXG_INITIAL - 1 : spa_last_synced_txg(spa) - TXG_DEFER_SIZE - 1;
2567         spa->spa_first_txg = spa->spa_last_ubsync_txg ?
2568             spa->spa_last_ubsync_txg : spa_last_synced_txg(spa) + 1;
2569         spa->spa_claim_max_txg = spa->spa_first_txg;
2570         spa->spa_prev_software_version = ub->ub_software_version;
2571 
2572         return (0);
2573 }
2574 
2575 static int
2576 spa_ld_open_rootbp(spa_t *spa)
2577 {
2578         int error = 0;
2579         vdev_t *rvd = spa->spa_root_vdev;
2580 
2581         error = dsl_pool_init(spa, spa->spa_first_txg, &spa->spa_dsl_pool);
2582         if (error != 0) {
2583                 spa_load_failed(spa, "unable to open rootbp in dsl_pool_init "
2584                     "[error=%d]", error);
2585                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2586         }
2587         spa->spa_meta_objset = spa->spa_dsl_pool->dp_meta_objset;
2588 
2589         return (0);
2590 }
2591 
2592 static int
2593 spa_ld_load_trusted_config(spa_t *spa, spa_import_type_t type,
2594     boolean_t reloading)
2595 {
2596         vdev_t *mrvd, *rvd = spa->spa_root_vdev;
2597         nvlist_t *nv, *mos_config, *policy;
2598         int error = 0, copy_error;
2599         uint64_t healthy_tvds, healthy_tvds_mos;
2600         uint64_t mos_config_txg;
2601 
2602         if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object, B_TRUE)
2603             != 0)
2604                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2605 
2606         /*
2607          * If we're assembling a pool from a split, the config provided is
2608          * already trusted so there is nothing to do.
2609          */
2610         if (type == SPA_IMPORT_ASSEMBLE)
2611                 return (0);
2612 
2613         healthy_tvds = spa_healthy_core_tvds(spa);
2614 
2615         if (load_nvlist(spa, spa->spa_config_object, &mos_config)
2616             != 0) {
2617                 spa_load_failed(spa, "unable to retrieve MOS config");
2618                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2619         }
2620 
2621         /*
2622          * If we are doing an open, pool owner wasn't verified yet, thus do
2623          * the verification here.
2624          */
2625         if (spa->spa_load_state == SPA_LOAD_OPEN) {
2626                 error = spa_verify_host(spa, mos_config);
2627                 if (error != 0) {
2628                         nvlist_free(mos_config);
2629                         return (error);
2630                 }
2631         }
2632 
2633         nv = fnvlist_lookup_nvlist(mos_config, ZPOOL_CONFIG_VDEV_TREE);
2634 
2635         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2636 
2637         /*
2638          * Build a new vdev tree from the trusted config
2639          */
2640         VERIFY(spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD) == 0);
2641 
2642         /*
2643          * Vdev paths in the MOS may be obsolete. If the untrusted config was
2644          * obtained by scanning /dev/dsk, then it will have the right vdev
2645          * paths. We update the trusted MOS config with this information.
2646          * We first try to copy the paths with vdev_copy_path_strict, which
2647          * succeeds only when both configs have exactly the same vdev tree.
2648          * If that fails, we fall back to a more flexible method that has a
2649          * best effort policy.
2650          */
2651         copy_error = vdev_copy_path_strict(rvd, mrvd);
2652         if (copy_error != 0 || spa_load_print_vdev_tree) {
2653                 spa_load_note(spa, "provided vdev tree:");
2654                 vdev_dbgmsg_print_tree(rvd, 2);
2655                 spa_load_note(spa, "MOS vdev tree:");
2656                 vdev_dbgmsg_print_tree(mrvd, 2);
2657         }
2658         if (copy_error != 0) {
2659                 spa_load_note(spa, "vdev_copy_path_strict failed, falling "
2660                     "back to vdev_copy_path_relaxed");
2661                 vdev_copy_path_relaxed(rvd, mrvd);
2662         }
2663 
2664         vdev_close(rvd);
2665         vdev_free(rvd);
2666         spa->spa_root_vdev = mrvd;
2667         rvd = mrvd;
2668         spa_config_exit(spa, SCL_ALL, FTAG);
2669 
2670         /*
2671          * We will use spa_config if we decide to reload the spa or if spa_load
2672          * fails and we rewind. We must thus regenerate the config using the
2673          * MOS information with the updated paths. Rewind policy is an import
2674          * setting and is not in the MOS. We copy it over to our new, trusted
2675          * config.
2676          */
2677         mos_config_txg = fnvlist_lookup_uint64(mos_config,
2678             ZPOOL_CONFIG_POOL_TXG);
2679         nvlist_free(mos_config);
2680         mos_config = spa_config_generate(spa, NULL, mos_config_txg, B_FALSE);
2681         if (nvlist_lookup_nvlist(spa->spa_config, ZPOOL_REWIND_POLICY,
2682             &policy) == 0)
2683                 fnvlist_add_nvlist(mos_config, ZPOOL_REWIND_POLICY, policy);
2684         spa_config_set(spa, mos_config);
2685         spa->spa_config_source = SPA_CONFIG_SRC_MOS;
2686 
2687         /*
2688          * Now that we got the config from the MOS, we should be more strict
2689          * in checking blkptrs and can make assumptions about the consistency
2690          * of the vdev tree. spa_trust_config must be set to true before opening
2691          * vdevs in order for them to be writeable.
2692          */
2693         spa->spa_trust_config = B_TRUE;
2694 
2695         /*
2696          * Open and validate the new vdev tree
2697          */
2698         error = spa_ld_open_vdevs(spa);
2699         if (error != 0)
2700                 return (error);
2701 
2702         error = spa_ld_validate_vdevs(spa);
2703         if (error != 0)
2704                 return (error);
2705 
2706         if (copy_error != 0 || spa_load_print_vdev_tree) {
2707                 spa_load_note(spa, "final vdev tree:");
2708                 vdev_dbgmsg_print_tree(rvd, 2);
2709         }
2710 
2711         if (spa->spa_load_state != SPA_LOAD_TRYIMPORT &&
2712             !spa->spa_extreme_rewind && zfs_max_missing_tvds == 0) {
2713                 /*
2714                  * Sanity check to make sure that we are indeed loading the
2715                  * latest uberblock. If we missed SPA_SYNC_MIN_VDEVS tvds
2716                  * in the config provided and they happened to be the only ones
2717                  * to have the latest uberblock, we could involuntarily perform
2718                  * an extreme rewind.
2719                  */
2720                 healthy_tvds_mos = spa_healthy_core_tvds(spa);
2721                 if (healthy_tvds_mos - healthy_tvds >=
2722                     SPA_SYNC_MIN_VDEVS) {
2723                         spa_load_note(spa, "config provided misses too many "
2724                             "top-level vdevs compared to MOS (%lld vs %lld). ",
2725                             (u_longlong_t)healthy_tvds,
2726                             (u_longlong_t)healthy_tvds_mos);
2727                         spa_load_note(spa, "vdev tree:");
2728                         vdev_dbgmsg_print_tree(rvd, 2);
2729                         if (reloading) {
2730                                 spa_load_failed(spa, "config was already "
2731                                     "provided from MOS. Aborting.");
2732                                 return (spa_vdev_err(rvd,
2733                                     VDEV_AUX_CORRUPT_DATA, EIO));
2734                         }
2735                         spa_load_note(spa, "spa must be reloaded using MOS "
2736                             "config");
2737                         return (SET_ERROR(EAGAIN));
2738                 }
2739         }
2740 
2741         error = spa_check_for_missing_logs(spa);
2742         if (error != 0)
2743                 return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO));
2744 
2745         if (rvd->vdev_guid_sum != spa->spa_uberblock.ub_guid_sum) {
2746                 spa_load_failed(spa, "uberblock guid sum doesn't match MOS "
2747                     "guid sum (%llu != %llu)",
2748                     (u_longlong_t)spa->spa_uberblock.ub_guid_sum,
2749                     (u_longlong_t)rvd->vdev_guid_sum);
2750                 return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM,
2751                     ENXIO));
2752         }
2753 
2754         return (0);
2755 }
2756 
2757 static int
2758 spa_ld_open_indirect_vdev_metadata(spa_t *spa)
2759 {
2760         int error = 0;
2761         vdev_t *rvd = spa->spa_root_vdev;
2762 
2763         /*
2764          * Everything that we read before spa_remove_init() must be stored
2765          * on concreted vdevs.  Therefore we do this as early as possible.
2766          */
2767         error = spa_remove_init(spa);
2768         if (error != 0) {
2769                 spa_load_failed(spa, "spa_remove_init failed [error=%d]",
2770                     error);
2771                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2772         }
2773 
2774         /*
2775          * Retrieve information needed to condense indirect vdev mappings.
2776          */
2777         error = spa_condense_init(spa);
2778         if (error != 0) {
2779                 spa_load_failed(spa, "spa_condense_init failed [error=%d]",
2780                     error);
2781                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
2782         }
2783 
2784         return (0);
2785 }
2786 
2787 static int
2788 spa_ld_check_features(spa_t *spa, boolean_t *missing_feat_writep)
2789 {
2790         int error = 0;
2791         vdev_t *rvd = spa->spa_root_vdev;
2792 
2793         if (spa_version(spa) >= SPA_VERSION_FEATURES) {
2794                 boolean_t missing_feat_read = B_FALSE;
2795                 nvlist_t *unsup_feat, *enabled_feat;
2796 
2797                 if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_READ,
2798                     &spa->spa_feat_for_read_obj, B_TRUE) != 0) {
2799                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2800                 }
2801 
2802                 if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_WRITE,
2803                     &spa->spa_feat_for_write_obj, B_TRUE) != 0) {
2804                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2805                 }
2806 
2807                 if (spa_dir_prop(spa, DMU_POOL_FEATURE_DESCRIPTIONS,
2808                     &spa->spa_feat_desc_obj, B_TRUE) != 0) {
2809                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2810                 }
2811 
2812                 enabled_feat = fnvlist_alloc();
2813                 unsup_feat = fnvlist_alloc();
2814 
2815                 if (!spa_features_check(spa, B_FALSE,
2816                     unsup_feat, enabled_feat))
2817                         missing_feat_read = B_TRUE;
2818 
2819                 if (spa_writeable(spa) ||
2820                     spa->spa_load_state == SPA_LOAD_TRYIMPORT) {
2821                         if (!spa_features_check(spa, B_TRUE,
2822                             unsup_feat, enabled_feat)) {
2823                                 *missing_feat_writep = B_TRUE;
2824                         }
2825                 }
2826 
2827                 fnvlist_add_nvlist(spa->spa_load_info,
2828                     ZPOOL_CONFIG_ENABLED_FEAT, enabled_feat);
2829 
2830                 if (!nvlist_empty(unsup_feat)) {
2831                         fnvlist_add_nvlist(spa->spa_load_info,
2832                             ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat);
2833                 }
2834 
2835                 fnvlist_free(enabled_feat);
2836                 fnvlist_free(unsup_feat);
2837 
2838                 if (!missing_feat_read) {
2839                         fnvlist_add_boolean(spa->spa_load_info,
2840                             ZPOOL_CONFIG_CAN_RDONLY);
2841                 }
2842 
2843                 /*
2844                  * If the state is SPA_LOAD_TRYIMPORT, our objective is
2845                  * twofold: to determine whether the pool is available for
2846                  * import in read-write mode and (if it is not) whether the
2847                  * pool is available for import in read-only mode. If the pool
2848                  * is available for import in read-write mode, it is displayed
2849                  * as available in userland; if it is not available for import
2850                  * in read-only mode, it is displayed as unavailable in
2851                  * userland. If the pool is available for import in read-only
2852                  * mode but not read-write mode, it is displayed as unavailable
2853                  * in userland with a special note that the pool is actually
2854                  * available for open in read-only mode.
2855                  *
2856                  * As a result, if the state is SPA_LOAD_TRYIMPORT and we are
2857                  * missing a feature for write, we must first determine whether
2858                  * the pool can be opened read-only before returning to
2859                  * userland in order to know whether to display the
2860                  * abovementioned note.
2861                  */
2862                 if (missing_feat_read || (*missing_feat_writep &&
2863                     spa_writeable(spa))) {
2864                         spa_load_failed(spa, "pool uses unsupported features");
2865                         return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
2866                             ENOTSUP));
2867                 }
2868 
2869                 /*
2870                  * Load refcounts for ZFS features from disk into an in-memory
2871                  * cache during SPA initialization.
2872                  */
2873                 for (spa_feature_t i = 0; i < SPA_FEATURES; i++) {
2874                         uint64_t refcount;
2875 
2876                         error = feature_get_refcount_from_disk(spa,
2877                             &spa_feature_table[i], &refcount);
2878                         if (error == 0) {
2879                                 spa->spa_feat_refcount_cache[i] = refcount;
2880                         } else if (error == ENOTSUP) {
2881                                 spa->spa_feat_refcount_cache[i] =
2882                                     SPA_FEATURE_DISABLED;
2883                         } else {
2884                                 spa_load_failed(spa, "error getting refcount "
2885                                     "for feature %s [error=%d]",
2886                                     spa_feature_table[i].fi_guid, error);
2887                                 return (spa_vdev_err(rvd,
2888                                     VDEV_AUX_CORRUPT_DATA, EIO));
2889                         }
2890                 }
2891         }
2892 
2893         if (spa_feature_is_active(spa, SPA_FEATURE_ENABLED_TXG)) {
2894                 if (spa_dir_prop(spa, DMU_POOL_FEATURE_ENABLED_TXG,
2895                     &spa->spa_feat_enabled_txg_obj, B_TRUE) != 0)
2896                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2897         }
2898 
2899         return (0);
2900 }
2901 
2902 static int
2903 spa_ld_load_special_directories(spa_t *spa)
2904 {
2905         int error = 0;
2906         vdev_t *rvd = spa->spa_root_vdev;
2907 
2908         spa->spa_is_initializing = B_TRUE;
2909         error = dsl_pool_open(spa->spa_dsl_pool);
2910         spa->spa_is_initializing = B_FALSE;
2911         if (error != 0) {
2912                 spa_load_failed(spa, "dsl_pool_open failed [error=%d]", error);
2913                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));



































2914         }





2915 
2916         return (0);
2917 }


2918 
2919 static int
2920 spa_ld_get_props(spa_t *spa)
2921 {
2922         int error = 0;
2923         uint64_t obj;
2924         vdev_t *rvd = spa->spa_root_vdev;
2925 
2926         /* Grab the secret checksum salt from the MOS. */
2927         error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2928             DMU_POOL_CHECKSUM_SALT, 1,
2929             sizeof (spa->spa_cksum_salt.zcs_bytes),
2930             spa->spa_cksum_salt.zcs_bytes);
2931         if (error == ENOENT) {
2932                 /* Generate a new salt for subsequent use */
2933                 (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
2934                     sizeof (spa->spa_cksum_salt.zcs_bytes));
2935         } else if (error != 0) {
2936                 spa_load_failed(spa, "unable to retrieve checksum salt from "
2937                     "MOS [error=%d]", error);
2938                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2939         }
2940 
2941         if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj, B_TRUE) != 0)
2942                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2943         error = bpobj_open(&spa->spa_deferred_bpobj, spa->spa_meta_objset, obj);
2944         if (error != 0) {
2945                 spa_load_failed(spa, "error opening deferred-frees bpobj "
2946                     "[error=%d]", error);
2947                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2948         }
2949 
2950         /*
2951          * Load the bit that tells us to use the new accounting function
2952          * (raid-z deflation).  If we have an older pool, this will not
2953          * be present.
2954          */
2955         error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate, B_FALSE);
2956         if (error != 0 && error != ENOENT)
2957                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2958 
2959         error = spa_dir_prop(spa, DMU_POOL_CREATION_VERSION,
2960             &spa->spa_creation_version, B_FALSE);
2961         if (error != 0 && error != ENOENT)
2962                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2963 
2964         /*
2965          * Load the persistent error log.  If we have an older pool, this will
2966          * not be present.
2967          */
2968         error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last,
2969             B_FALSE);
2970         if (error != 0 && error != ENOENT)
2971                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2972 
2973         error = spa_dir_prop(spa, DMU_POOL_ERRLOG_SCRUB,
2974             &spa->spa_errlog_scrub, B_FALSE);
2975         if (error != 0 && error != ENOENT)
2976                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2977 
2978         /*
2979          * Load the history object.  If we have an older pool, this
2980          * will not be present.
2981          */
2982         error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history, B_FALSE);
2983         if (error != 0 && error != ENOENT)
2984                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2985 
2986         /*
2987          * Load the per-vdev ZAP map. If we have an older pool, this will not
2988          * be present; in this case, defer its creation to a later time to
2989          * avoid dirtying the MOS this early / out of sync context. See
2990          * spa_sync_config_object.
2991          */
2992 
2993         /* The sentinel is only available in the MOS config. */
2994         nvlist_t *mos_config;
2995         if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0) {
2996                 spa_load_failed(spa, "unable to retrieve MOS config");
2997                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2998         }
2999 
3000         error = spa_dir_prop(spa, DMU_POOL_VDEV_ZAP_MAP,
3001             &spa->spa_all_vdev_zaps, B_FALSE);
3002 
3003         if (error == ENOENT) {
3004                 VERIFY(!nvlist_exists(mos_config,
3005                     ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
3006                 spa->spa_avz_action = AVZ_ACTION_INITIALIZE;
3007                 ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
3008         } else if (error != 0) {
3009                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3010         } else if (!nvlist_exists(mos_config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS)) {
3011                 /*
3012                  * An older version of ZFS overwrote the sentinel value, so
3013                  * we have orphaned per-vdev ZAPs in the MOS. Defer their
3014                  * destruction to later; see spa_sync_config_object.
3015                  */
3016                 spa->spa_avz_action = AVZ_ACTION_DESTROY;
3017                 /*
3018                  * We're assuming that no vdevs have had their ZAPs created
3019                  * before this. Better be sure of it.
3020                  */
3021                 ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
3022         }
3023         nvlist_free(mos_config);
3024 
3025         spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
3026 
3027         error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object,
3028             B_FALSE);
3029         if (error && error != ENOENT)
3030                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3031 
3032         if (error == 0) {
3033                 uint64_t autoreplace;
3034 
3035                 spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs);
3036                 spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace);
3037                 spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation);
3038                 spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode);
3039                 spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand);
3040                 spa_prop_find(spa, ZPOOL_PROP_DEDUPDITTO,
3041                     &spa->spa_dedup_ditto);
3042 
3043                 spa->spa_autoreplace = (autoreplace != 0);
3044         }
3045 
3046         /*
3047          * If we are importing a pool with missing top-level vdevs,
3048          * we enforce that the pool doesn't panic or get suspended on
3049          * error since the likelihood of missing data is extremely high.
3050          */
3051         if (spa->spa_missing_tvds > 0 &&
3052             spa->spa_failmode != ZIO_FAILURE_MODE_CONTINUE &&
3053             spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
3054                 spa_load_note(spa, "forcing failmode to 'continue' "
3055                     "as some top level vdevs are missing");
3056                 spa->spa_failmode = ZIO_FAILURE_MODE_CONTINUE;
3057         }
3058 
3059         return (0);
3060 }
3061 
3062 static int
3063 spa_ld_open_aux_vdevs(spa_t *spa, spa_import_type_t type)
3064 {
3065         int error = 0;
3066         vdev_t *rvd = spa->spa_root_vdev;
3067 
3068         /*
3069          * If we're assembling the pool from the split-off vdevs of
3070          * an existing pool, we don't want to attach the spares & cache
3071          * devices.
3072          */
3073 
3074         /*
3075          * Load any hot spares for this pool.
3076          */
3077         error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object,
3078             B_FALSE);
3079         if (error != 0 && error != ENOENT)
3080                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3081         if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
3082                 ASSERT(spa_version(spa) >= SPA_VERSION_SPARES);
3083                 if (load_nvlist(spa, spa->spa_spares.sav_object,
3084                     &spa->spa_spares.sav_config) != 0) {
3085                         spa_load_failed(spa, "error loading spares nvlist");
3086                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3087                 }
3088 
3089                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3090                 spa_load_spares(spa);
3091                 spa_config_exit(spa, SCL_ALL, FTAG);
3092         } else if (error == 0) {
3093                 spa->spa_spares.sav_sync = B_TRUE;
3094         }
3095 
3096         /*
3097          * Load any level 2 ARC devices for this pool.
3098          */
3099         error = spa_dir_prop(spa, DMU_POOL_L2CACHE,
3100             &spa->spa_l2cache.sav_object, B_FALSE);
3101         if (error != 0 && error != ENOENT)
3102                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3103         if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
3104                 ASSERT(spa_version(spa) >= SPA_VERSION_L2CACHE);
3105                 if (load_nvlist(spa, spa->spa_l2cache.sav_object,
3106                     &spa->spa_l2cache.sav_config) != 0) {
3107                         spa_load_failed(spa, "error loading l2cache nvlist");
3108                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3109                 }
3110 
3111                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3112                 spa_load_l2cache(spa);
3113                 spa_config_exit(spa, SCL_ALL, FTAG);
3114         } else if (error == 0) {
3115                 spa->spa_l2cache.sav_sync = B_TRUE;
3116         }
3117 
3118         return (0);
3119 }
3120 
3121 static int
3122 spa_ld_load_vdev_metadata(spa_t *spa)
3123 {
3124         int error = 0;
3125         vdev_t *rvd = spa->spa_root_vdev;



3126 

























































































3127         /*
3128          * If the 'autoreplace' property is set, then post a resource notifying
3129          * the ZFS DE that it should not issue any faults for unopenable
3130          * devices.  We also iterate over the vdevs, and post a sysevent for any
3131          * unopenable vdevs so that the normal autoreplace handler can take
3132          * over.
3133          */
3134         if (spa->spa_autoreplace && spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
3135                 spa_check_removed(spa->spa_root_vdev);
3136                 /*
3137                  * For the import case, this is done in spa_import(), because
3138                  * at this point we're using the spare definitions from
3139                  * the MOS config, not necessarily from the userland config.
3140                  */
3141                 if (spa->spa_load_state != SPA_LOAD_IMPORT) {
3142                         spa_aux_check_removed(&spa->spa_spares);
3143                         spa_aux_check_removed(&spa->spa_l2cache);
3144                 }
3145         }
3146 
3147         /*
3148          * Load the vdev metadata such as metaslabs, DTLs, spacemap object, etc.
3149          */
3150         error = vdev_load(rvd);
3151         if (error != 0) {
3152                 spa_load_failed(spa, "vdev_load failed [error=%d]", error);
3153                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
3154         }
3155 
3156         /*
3157          * Propagate the leaf DTLs we just loaded all the way up the vdev tree.
3158          */
3159         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3160         vdev_dtl_reassess(rvd, 0, 0, B_FALSE);
3161         spa_config_exit(spa, SCL_ALL, FTAG);
3162 
3163         return (0);
3164 }




3165 
3166 static int
3167 spa_ld_load_dedup_tables(spa_t *spa)
3168 {
3169         int error = 0;
3170         vdev_t *rvd = spa->spa_root_vdev;
3171 
3172         error = ddt_load(spa);
3173         if (error != 0) {
3174                 spa_load_failed(spa, "ddt_load failed [error=%d]", error);








3175                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));





3176         }

3177 
3178         return (0);
3179 }





3180 
3181 static int
3182 spa_ld_verify_logs(spa_t *spa, spa_import_type_t type, char **ereport)
3183 {
3184         vdev_t *rvd = spa->spa_root_vdev;
3185 
3186         if (type != SPA_IMPORT_ASSEMBLE && spa_writeable(spa)) {
3187                 boolean_t missing = spa_check_logs(spa);
3188                 if (missing) {
3189                         if (spa->spa_missing_tvds != 0) {
3190                                 spa_load_note(spa, "spa_check_logs failed "
3191                                     "so dropping the logs");
3192                         } else {
3193                                 *ereport = FM_EREPORT_ZFS_LOG_REPLAY;
3194                                 spa_load_failed(spa, "spa_check_logs failed");
3195                                 return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG,
3196                                     ENXIO));
3197                         }
3198                 }
3199         }
3200 
3201         return (0);
3202 }
3203 
3204 static int
3205 spa_ld_verify_pool_data(spa_t *spa)
3206 {
3207         int error = 0;
3208         vdev_t *rvd = spa->spa_root_vdev;


3209 
3210         /*
3211          * We've successfully opened the pool, verify that we're ready
3212          * to start pushing transactions.
3213          */
3214         if (spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
3215                 error = spa_load_verify(spa);
3216                 if (error != 0) {
3217                         spa_load_failed(spa, "spa_load_verify failed "
3218                             "[error=%d]", error);
3219                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
3220                             error));
3221                 }
3222         }
3223 
3224         return (0);
3225 }
3226 
3227 static void
3228 spa_ld_claim_log_blocks(spa_t *spa)
3229 {
3230         dmu_tx_t *tx;

3231         dsl_pool_t *dp = spa_get_dsl(spa);
3232 


3233         /*
3234          * Claim log blocks that haven't been committed yet.
3235          * This must all happen in a single txg.
3236          * Note: spa_claim_max_txg is updated by spa_claim_notify(),
3237          * invoked from zil_claim_log_block()'s i/o done callback.
3238          * Price of rollback is that we abandon the log.
3239          */
3240         spa->spa_claiming = B_TRUE;
3241 
3242         tx = dmu_tx_create_assigned(dp, spa_first_txg(spa));
3243         (void) dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
3244             zil_claim, tx, DS_FIND_CHILDREN);
3245         dmu_tx_commit(tx);
3246 
3247         spa->spa_claiming = B_FALSE;
3248 
3249         spa_set_log_state(spa, SPA_LOG_GOOD);
3250 }

3251 
3252 static void
3253 spa_ld_check_for_config_update(spa_t *spa, uint64_t config_cache_txg,
3254     boolean_t reloading)
3255 {
3256         vdev_t *rvd = spa->spa_root_vdev;
3257         int need_update = B_FALSE;


3258 
3259         /*
3260          * If the config cache is stale, or we have uninitialized
3261          * metaslabs (see spa_vdev_add()), then update the config.
3262          *
3263          * If this is a verbatim import, trust the current
3264          * in-core spa_config and update the disk labels.
3265          */
3266         if (reloading || config_cache_txg != spa->spa_config_txg ||
3267             spa->spa_load_state == SPA_LOAD_IMPORT ||
3268             spa->spa_load_state == SPA_LOAD_RECOVER ||
3269             (spa->spa_import_flags & ZFS_IMPORT_VERBATIM))
3270                 need_update = B_TRUE;
3271 
3272         for (int c = 0; c < rvd->vdev_children; c++)
3273                 if (rvd->vdev_child[c]->vdev_ms_array == 0)
3274                         need_update = B_TRUE;
3275 
3276         /*
3277          * Update the config cache asychronously in case we're the
3278          * root pool, in which case the config cache isn't writable yet.
3279          */
3280         if (need_update)
3281                 spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
3282 }
3283 
3284 static void
3285 spa_ld_prepare_for_reload(spa_t *spa)
3286 {
3287         int mode = spa->spa_mode;
3288         int async_suspended = spa->spa_async_suspended;
3289 
3290         spa_unload(spa);
3291         spa_deactivate(spa);
3292         spa_activate(spa, mode);
3293 
3294         /*
3295          * We save the value of spa_async_suspended as it gets reset to 0 by
3296          * spa_unload(). We want to restore it back to the original value before
3297          * returning as we might be calling spa_async_resume() later.
3298          */
3299         spa->spa_async_suspended = async_suspended;
3300 }
3301 
3302 /*
3303  * Load an existing storage pool, using the config provided. This config
3304  * describes which vdevs are part of the pool and is later validated against
3305  * partial configs present in each vdev's label and an entire copy of the
3306  * config stored in the MOS.
3307  */
3308 static int
3309 spa_load_impl(spa_t *spa, spa_import_type_t type, char **ereport,
3310     boolean_t reloading)
3311 {
3312         int error = 0;
3313         boolean_t missing_feat_write = B_FALSE;
3314 
3315         ASSERT(MUTEX_HELD(&spa_namespace_lock));
3316         ASSERT(spa->spa_config_source != SPA_CONFIG_SRC_NONE);
3317 
3318         /*
3319          * Never trust the config that is provided unless we are assembling
3320          * a pool following a split.
3321          * This means don't trust blkptrs and the vdev tree in general. This
3322          * also effectively puts the spa in read-only mode since
3323          * spa_writeable() checks for spa_trust_config to be true.
3324          * We will later load a trusted config from the MOS.
3325          */
3326         if (type != SPA_IMPORT_ASSEMBLE)
3327                 spa->spa_trust_config = B_FALSE;
3328 
3329         if (reloading)
3330                 spa_load_note(spa, "RELOADING");
3331         else
3332                 spa_load_note(spa, "LOADING");
3333 
3334         /*
3335          * Parse the config provided to create a vdev tree.
3336          */
3337         error = spa_ld_parse_config(spa, type);
3338         if (error != 0)
3339                 return (error);
3340 
3341         /*
3342          * Now that we have the vdev tree, try to open each vdev. This involves
3343          * opening the underlying physical device, retrieving its geometry and
3344          * probing the vdev with a dummy I/O. The state of each vdev will be set
3345          * based on the success of those operations. After this we'll be ready
3346          * to read from the vdevs.
3347          */
3348         error = spa_ld_open_vdevs(spa);
3349         if (error != 0)
3350                 return (error);
3351 
3352         /*
3353          * Read the label of each vdev and make sure that the GUIDs stored
3354          * there match the GUIDs in the config provided.
3355          * If we're assembling a new pool that's been split off from an
3356          * existing pool, the labels haven't yet been updated so we skip
3357          * validation for now.
3358          */
3359         if (type != SPA_IMPORT_ASSEMBLE) {
3360                 error = spa_ld_validate_vdevs(spa);
3361                 if (error != 0)
3362                         return (error);
3363         }
3364 
3365         /*
3366          * Read vdev labels to find the best uberblock (i.e. latest, unless
3367          * spa_load_max_txg is set) and store it in spa_uberblock. We get the
3368          * list of features required to read blkptrs in the MOS from the vdev
3369          * label with the best uberblock and verify that our version of zfs
3370          * supports them all.
3371          */
3372         error = spa_ld_select_uberblock(spa, type);
3373         if (error != 0)
3374                 return (error);
3375 
3376         /*
3377          * Pass that uberblock to the dsl_pool layer which will open the root
3378          * blkptr. This blkptr points to the latest version of the MOS and will
3379          * allow us to read its contents.
3380          */
3381         error = spa_ld_open_rootbp(spa);
3382         if (error != 0)
3383                 return (error);
3384 
3385         /*
3386          * Retrieve the trusted config stored in the MOS and use it to create
3387          * a new, exact version of the vdev tree, then reopen all vdevs.
3388          */
3389         error = spa_ld_load_trusted_config(spa, type, reloading);
3390         if (error == EAGAIN) {
3391                 VERIFY(!reloading);
3392                 /*
3393                  * Redo the loading process with the trusted config if it is
3394                  * too different from the untrusted config.
3395                  */
3396                 spa_ld_prepare_for_reload(spa);
3397                 return (spa_load_impl(spa, type, ereport, B_TRUE));
3398         } else if (error != 0) {
3399                 return (error);
3400         }
3401 
3402         /*
3403          * Retrieve the mapping of indirect vdevs. Those vdevs were removed
3404          * from the pool and their contents were re-mapped to other vdevs. Note
3405          * that everything that we read before this step must have been
3406          * rewritten on concrete vdevs after the last device removal was
3407          * initiated. Otherwise we could be reading from indirect vdevs before
3408          * we have loaded their mappings.
3409          */
3410         error = spa_ld_open_indirect_vdev_metadata(spa);
3411         if (error != 0)
3412                 return (error);
3413 
3414         /*
3415          * Retrieve the full list of active features from the MOS and check if
3416          * they are all supported.
3417          */
3418         error = spa_ld_check_features(spa, &missing_feat_write);
3419         if (error != 0)
3420                 return (error);
3421 
3422         /*
3423          * Load several special directories from the MOS needed by the dsl_pool
3424          * layer.
3425          */
3426         error = spa_ld_load_special_directories(spa);
3427         if (error != 0)
3428                 return (error);
3429 
3430         /*
3431          * Retrieve pool properties from the MOS.
3432          */
3433         error = spa_ld_get_props(spa);
3434         if (error != 0)
3435                 return (error);
3436 
3437         /*
3438          * Retrieve the list of auxiliary devices - cache devices and spares -
3439          * and open them.
3440          */
3441         error = spa_ld_open_aux_vdevs(spa, type);
3442         if (error != 0)
3443                 return (error);
3444 
3445         /*
3446          * Load the metadata for all vdevs. Also check if unopenable devices
3447          * should be autoreplaced.
3448          */
3449         error = spa_ld_load_vdev_metadata(spa);
3450         if (error != 0)
3451                 return (error);
3452 
3453         error = spa_ld_load_dedup_tables(spa);
3454         if (error != 0)
3455                 return (error);
3456 
3457         /*
3458          * Verify the logs now to make sure we don't have any unexpected errors
3459          * when we claim log blocks later.
3460          */
3461         error = spa_ld_verify_logs(spa, type, ereport);
3462         if (error != 0)
3463                 return (error);
3464 
3465         if (missing_feat_write) {
3466                 ASSERT(spa->spa_load_state == SPA_LOAD_TRYIMPORT);
3467 
3468                 /*
3469                  * At this point, we know that we can open the pool in
3470                  * read-only mode but not read-write mode. We now have enough
3471                  * information and can return to userland.
3472                  */
3473                 return (spa_vdev_err(spa->spa_root_vdev, VDEV_AUX_UNSUP_FEAT,
3474                     ENOTSUP));
3475         }
3476 
3477         /*
3478          * Traverse the last txgs to make sure the pool was left off in a safe
3479          * state. When performing an extreme rewind, we verify the whole pool,
3480          * which can take a very long time.
3481          */
3482         error = spa_ld_verify_pool_data(spa);
3483         if (error != 0)
3484                 return (error);
3485 
3486         /*
3487          * Calculate the deflated space for the pool. This must be done before
3488          * we write anything to the pool because we'd need to update the space
3489          * accounting using the deflated sizes.
3490          */
3491         spa_update_dspace(spa);
3492 
3493         /*
3494          * We have now retrieved all the information we needed to open the
3495          * pool. If we are importing the pool in read-write mode, a few
3496          * additional steps must be performed to finish the import.
3497          */
3498         if (spa_writeable(spa) && (spa->spa_load_state == SPA_LOAD_RECOVER ||
3499             spa->spa_load_max_txg == UINT64_MAX)) {
3500                 uint64_t config_cache_txg = spa->spa_config_txg;
3501 
3502                 ASSERT(spa->spa_load_state != SPA_LOAD_TRYIMPORT);
3503 
3504                 /*
3505                  * Traverse the ZIL and claim all blocks.
3506                  */
3507                 spa_ld_claim_log_blocks(spa);
3508 
3509                 /*
3510                  * Kick-off the syncing thread.
3511                  */
3512                 spa->spa_sync_on = B_TRUE;
3513                 txg_sync_start(spa->spa_dsl_pool);
3514 
3515                 /*
3516                  * Wait for all claims to sync.  We sync up to the highest
3517                  * claimed log block birth time so that claimed log blocks
3518                  * don't appear to be from the future.  spa_claim_max_txg
3519                  * will have been set for us by ZIL traversal operations
3520                  * performed above.
3521                  */
3522                 txg_wait_synced(spa->spa_dsl_pool, spa->spa_claim_max_txg);
3523 
3524                 /*
3525                  * Check if we need to request an update of the config. On the
3526                  * next sync, we would update the config stored in vdev labels
3527                  * and the cachefile (by default /etc/zfs/zpool.cache).
3528                  */
3529                 spa_ld_check_for_config_update(spa, config_cache_txg,
3530                     reloading);
3531 
3532                 /*
3533                  * Check all DTLs to see if anything needs resilvering.
3534                  */
3535                 if (!dsl_scan_resilvering(spa->spa_dsl_pool) &&
3536                     vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL))
3537                         spa_async_request(spa, SPA_ASYNC_RESILVER);
3538 
3539                 /*
3540                  * Log the fact that we booted up (so that we can detect if
3541                  * we rebooted in the middle of an operation).
3542                  */
3543                 spa_history_log_version(spa, "open");
3544 
3545                 /*
3546                  * Delete any inconsistent datasets.
3547                  */
3548                 (void) dmu_objset_find(spa_name(spa),
3549                     dsl_destroy_inconsistent, NULL, DS_FIND_CHILDREN);
3550 
3551                 /*
3552                  * Clean up any stale temporary dataset userrefs.
3553                  */
3554                 dsl_pool_clean_tmp_userrefs(spa->spa_dsl_pool);
3555 
3556                 spa_restart_removal(spa);
3557 
3558                 spa_spawn_aux_threads(spa);
3559         }
3560 
3561         spa_load_note(spa, "LOADED");
3562 
3563         return (0);
3564 }
3565 
3566 static int
3567 spa_load_retry(spa_t *spa, spa_load_state_t state)
3568 {
3569         int mode = spa->spa_mode;
3570 
3571         spa_unload(spa);
3572         spa_deactivate(spa);
3573 
3574         spa->spa_load_max_txg = spa->spa_uberblock.ub_txg - 1;
3575 
3576         spa_activate(spa, mode);
3577         spa_async_suspend(spa);
3578 
3579         spa_load_note(spa, "spa_load_retry: rewind, max txg: %llu",
3580             (u_longlong_t)spa->spa_load_max_txg);
3581 
3582         return (spa_load(spa, state, SPA_IMPORT_EXISTING));
3583 }
3584 
3585 /*
3586  * If spa_load() fails this function will try loading prior txg's. If
3587  * 'state' is SPA_LOAD_RECOVER and one of these loads succeeds the pool
3588  * will be rewound to that txg. If 'state' is not SPA_LOAD_RECOVER this
3589  * function will not rewind the pool and will return the same error as
3590  * spa_load().
3591  */
3592 static int
3593 spa_load_best(spa_t *spa, spa_load_state_t state, uint64_t max_request,
3594     int rewind_flags)
3595 {
3596         nvlist_t *loadinfo = NULL;
3597         nvlist_t *config = NULL;
3598         int load_error, rewind_error;
3599         uint64_t safe_rewind_txg;
3600         uint64_t min_txg;
3601 
3602         if (spa->spa_load_txg && state == SPA_LOAD_RECOVER) {
3603                 spa->spa_load_max_txg = spa->spa_load_txg;
3604                 spa_set_log_state(spa, SPA_LOG_CLEAR);
3605         } else {
3606                 spa->spa_load_max_txg = max_request;
3607                 if (max_request != UINT64_MAX)
3608                         spa->spa_extreme_rewind = B_TRUE;
3609         }
3610 
3611         load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING);

3612         if (load_error == 0)
3613                 return (0);
3614 
3615         if (spa->spa_root_vdev != NULL)
3616                 config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
3617 
3618         spa->spa_last_ubsync_txg = spa->spa_uberblock.ub_txg;
3619         spa->spa_last_ubsync_txg_ts = spa->spa_uberblock.ub_timestamp;
3620 
3621         if (rewind_flags & ZPOOL_NEVER_REWIND) {
3622                 nvlist_free(config);
3623                 return (load_error);
3624         }
3625 
3626         if (state == SPA_LOAD_RECOVER) {
3627                 /* Price of rolling back is discarding txgs, including log */
3628                 spa_set_log_state(spa, SPA_LOG_CLEAR);
3629         } else {
3630                 /*
3631                  * If we aren't rolling back save the load info from our first
3632                  * import attempt so that we can restore it after attempting
3633                  * to rewind.
3634                  */
3635                 loadinfo = spa->spa_load_info;
3636                 spa->spa_load_info = fnvlist_alloc();
3637         }
3638 
3639         spa->spa_load_max_txg = spa->spa_last_ubsync_txg;
3640         safe_rewind_txg = spa->spa_last_ubsync_txg - TXG_DEFER_SIZE;
3641         min_txg = (rewind_flags & ZPOOL_EXTREME_REWIND) ?
3642             TXG_INITIAL : safe_rewind_txg;
3643 
3644         /*
3645          * Continue as long as we're finding errors, we're still within
3646          * the acceptable rewind range, and we're still finding uberblocks
3647          */
3648         while (rewind_error && spa->spa_uberblock.ub_txg >= min_txg &&
3649             spa->spa_uberblock.ub_txg <= spa->spa_load_max_txg) {
3650                 if (spa->spa_load_max_txg < safe_rewind_txg)
3651                         spa->spa_extreme_rewind = B_TRUE;
3652                 rewind_error = spa_load_retry(spa, state);
3653         }
3654 
3655         spa->spa_extreme_rewind = B_FALSE;
3656         spa->spa_load_max_txg = UINT64_MAX;
3657 
3658         if (config && (rewind_error || state != SPA_LOAD_RECOVER))
3659                 spa_config_set(spa, config);
3660         else
3661                 nvlist_free(config);
3662 
3663         if (state == SPA_LOAD_RECOVER) {
3664                 ASSERT3P(loadinfo, ==, NULL);
3665                 return (rewind_error);
3666         } else {
3667                 /* Store the rewind info as part of the initial load info */
3668                 fnvlist_add_nvlist(loadinfo, ZPOOL_CONFIG_REWIND_INFO,
3669                     spa->spa_load_info);
3670 
3671                 /* Restore the initial load info */
3672                 fnvlist_free(spa->spa_load_info);


3679 /*
3680  * Pool Open/Import
3681  *
3682  * The import case is identical to an open except that the configuration is sent
3683  * down from userland, instead of grabbed from the configuration cache.  For the
3684  * case of an open, the pool configuration will exist in the
3685  * POOL_STATE_UNINITIALIZED state.
3686  *
3687  * The stats information (gen/count/ustats) is used to gather vdev statistics at
3688  * the same time open the pool, without having to keep around the spa_t in some
3689  * ambiguous state.
3690  */
3691 static int
3692 spa_open_common(const char *pool, spa_t **spapp, void *tag, nvlist_t *nvpolicy,
3693     nvlist_t **config)
3694 {
3695         spa_t *spa;
3696         spa_load_state_t state = SPA_LOAD_OPEN;
3697         int error;
3698         int locked = B_FALSE;

3699 
3700         *spapp = NULL;
3701 
3702         /*
3703          * As disgusting as this is, we need to support recursive calls to this
3704          * function because dsl_dir_open() is called during spa_load(), and ends
3705          * up calling spa_open() again.  The real fix is to figure out how to
3706          * avoid dsl_dir_open() calling this in the first place.
3707          */
3708         if (mutex_owner(&spa_namespace_lock) != curthread) {
3709                 mutex_enter(&spa_namespace_lock);
3710                 locked = B_TRUE;
3711         }
3712 
3713         if ((spa = spa_lookup(pool)) == NULL) {
3714                 if (locked)
3715                         mutex_exit(&spa_namespace_lock);
3716                 return (SET_ERROR(ENOENT));
3717         }
3718 
3719         if (spa->spa_state == POOL_STATE_UNINITIALIZED) {
3720                 zpool_rewind_policy_t policy;
3721 
3722                 zpool_get_rewind_policy(nvpolicy ? nvpolicy : spa->spa_config,
3723                     &policy);
3724                 if (policy.zrp_request & ZPOOL_DO_REWIND)
3725                         state = SPA_LOAD_RECOVER;
3726 
3727                 spa_activate(spa, spa_mode_global);
3728 
3729                 if (state != SPA_LOAD_RECOVER)
3730                         spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;
3731                 spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE;
3732 
3733                 zfs_dbgmsg("spa_open_common: opening %s", pool);
3734                 error = spa_load_best(spa, state, policy.zrp_txg,
3735                     policy.zrp_request);
3736 
3737                 if (error == EBADF) {
3738                         /*
3739                          * If vdev_validate() returns failure (indicated by
3740                          * EBADF), it indicates that one of the vdevs indicates
3741                          * that the pool has been exported or destroyed.  If
3742                          * this is the case, the config cache is out of sync and
3743                          * we should remove the pool from the namespace.
3744                          */
3745                         spa_unload(spa);
3746                         spa_deactivate(spa);
3747                         spa_write_cachefile(spa, B_TRUE, B_TRUE);
3748                         spa_remove(spa);
3749                         if (locked)
3750                                 mutex_exit(&spa_namespace_lock);
3751                         return (SET_ERROR(ENOENT));
3752                 }
3753 
3754                 if (error) {
3755                         /*
3756                          * We can't open the pool, but we still have useful
3757                          * information: the state of each vdev after the
3758                          * attempted vdev_open().  Return this to the user.
3759                          */
3760                         if (config != NULL && spa->spa_config) {
3761                                 VERIFY(nvlist_dup(spa->spa_config, config,
3762                                     KM_SLEEP) == 0);
3763                                 VERIFY(nvlist_add_nvlist(*config,
3764                                     ZPOOL_CONFIG_LOAD_INFO,
3765                                     spa->spa_load_info) == 0);
3766                         }
3767                         spa_unload(spa);
3768                         spa_deactivate(spa);
3769                         spa->spa_last_open_failed = error;
3770                         if (locked)
3771                                 mutex_exit(&spa_namespace_lock);
3772                         *spapp = NULL;
3773                         return (error);
3774                 }


3775         }
3776 
3777         spa_open_ref(spa, tag);
3778 
3779         if (config != NULL)
3780                 *config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
3781 
3782         /*
3783          * If we've recovered the pool, pass back any information we
3784          * gathered while doing the load.
3785          */
3786         if (state == SPA_LOAD_RECOVER) {
3787                 VERIFY(nvlist_add_nvlist(*config, ZPOOL_CONFIG_LOAD_INFO,
3788                     spa->spa_load_info) == 0);
3789         }
3790 
3791         if (locked) {
3792                 spa->spa_last_open_failed = 0;
3793                 spa->spa_last_ubsync_txg = 0;
3794                 spa->spa_load_txg = 0;
3795                 mutex_exit(&spa_namespace_lock);
3796         }
3797 



3798         *spapp = spa;
3799 
3800         return (0);
3801 }
3802 
3803 int
3804 spa_open_rewind(const char *name, spa_t **spapp, void *tag, nvlist_t *policy,
3805     nvlist_t **config)
3806 {
3807         return (spa_open_common(name, spapp, tag, policy, config));
3808 }
3809 
3810 int
3811 spa_open(const char *name, spa_t **spapp, void *tag)
3812 {
3813         return (spa_open_common(name, spapp, tag, NULL, NULL));
3814 }
3815 
3816 /*
3817  * Lookup the given spa_t, incrementing the inject count in the process,


4227         }
4228 }
4229 
4230 /*
4231  * Pool Creation
4232  */
4233 int
4234 spa_create(const char *pool, nvlist_t *nvroot, nvlist_t *props,
4235     nvlist_t *zplprops)
4236 {
4237         spa_t *spa;
4238         char *altroot = NULL;
4239         vdev_t *rvd;
4240         dsl_pool_t *dp;
4241         dmu_tx_t *tx;
4242         int error = 0;
4243         uint64_t txg = TXG_INITIAL;
4244         nvlist_t **spares, **l2cache;
4245         uint_t nspares, nl2cache;
4246         uint64_t version, obj;
4247         boolean_t has_features;

4248 
4249         /*
4250          * If this pool already exists, return failure.
4251          */
4252         mutex_enter(&spa_namespace_lock);
4253         if (spa_lookup(pool) != NULL) {
4254                 mutex_exit(&spa_namespace_lock);
4255                 return (SET_ERROR(EEXIST));
4256         }
4257 
4258         /*
4259          * Allocate a new spa_t structure.
4260          */
4261         (void) nvlist_lookup_string(props,
4262             zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
4263         spa = spa_add(pool, NULL, altroot);
4264         spa_activate(spa, spa_mode_global);
4265 
4266         if (props && (error = spa_prop_validate(spa, props))) {































4267                 spa_deactivate(spa);
4268                 spa_remove(spa);
4269                 mutex_exit(&spa_namespace_lock);
4270                 return (error);
4271         }
4272 
4273         has_features = B_FALSE;
4274         for (nvpair_t *elem = nvlist_next_nvpair(props, NULL);
4275             elem != NULL; elem = nvlist_next_nvpair(props, elem)) {
4276                 if (zpool_prop_feature(nvpair_name(elem)))
4277                         has_features = B_TRUE;
4278         }
4279 

4280         if (has_features || nvlist_lookup_uint64(props,
4281             zpool_prop_to_name(ZPOOL_PROP_VERSION), &version) != 0) {
4282                 version = SPA_VERSION;
4283         }
4284         ASSERT(SPA_VERSION_IS_SUPPORTED(version));
4285 
4286         spa->spa_first_txg = txg;
4287         spa->spa_uberblock.ub_txg = txg - 1;
4288         spa->spa_uberblock.ub_version = version;
4289         spa->spa_ubsync = spa->spa_uberblock;
4290         spa->spa_load_state = SPA_LOAD_CREATE;
4291         spa->spa_removing_phys.sr_state = DSS_NONE;
4292         spa->spa_removing_phys.sr_removing_vdev = -1;
4293         spa->spa_removing_phys.sr_prev_indirect_vdev = -1;
4294 
4295         /*
4296          * Create "The Godfather" zio to hold all async IOs
4297          */
4298         spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
4299             KM_SLEEP);
4300         for (int i = 0; i < max_ncpus; i++) {
4301                 spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
4302                     ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
4303                     ZIO_FLAG_GODFATHER);
4304         }
4305 
4306         /*
4307          * Create the root vdev.
4308          */
4309         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4310 
4311         error = spa_config_parse(spa, &rvd, nvroot, NULL, 0, VDEV_ALLOC_ADD);
4312 
4313         ASSERT(error != 0 || rvd != NULL);


4417          * because sync-to-convergence takes longer if the blocksize
4418          * keeps changing.
4419          */
4420         obj = bpobj_alloc(spa->spa_meta_objset, 1 << 14, tx);
4421         dmu_object_set_compress(spa->spa_meta_objset, obj,
4422             ZIO_COMPRESS_OFF, tx);
4423         if (zap_add(spa->spa_meta_objset,
4424             DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_SYNC_BPOBJ,
4425             sizeof (uint64_t), 1, &obj, tx) != 0) {
4426                 cmn_err(CE_PANIC, "failed to add bpobj");
4427         }
4428         VERIFY3U(0, ==, bpobj_open(&spa->spa_deferred_bpobj,
4429             spa->spa_meta_objset, obj));
4430 
4431         /*
4432          * Create the pool's history object.
4433          */
4434         if (version >= SPA_VERSION_ZPOOL_HISTORY)
4435                 spa_history_create_obj(spa, tx);
4436 


4437         /*
4438          * Generate some random noise for salted checksums to operate on.
4439          */
4440         (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
4441             sizeof (spa->spa_cksum_salt.zcs_bytes));
4442 
4443         /*
4444          * Set pool properties.
4445          */
4446         spa->spa_bootfs = zpool_prop_default_numeric(ZPOOL_PROP_BOOTFS);
4447         spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
4448         spa->spa_failmode = zpool_prop_default_numeric(ZPOOL_PROP_FAILUREMODE);
4449         spa->spa_autoexpand = zpool_prop_default_numeric(ZPOOL_PROP_AUTOEXPAND);












4450 























4451         if (props != NULL) {
4452                 spa_configfile_set(spa, props, B_FALSE);
4453                 spa_sync_props(props, tx);
4454         }
4455 








4456         dmu_tx_commit(tx);
4457 
4458         spa->spa_sync_on = B_TRUE;
4459         txg_sync_start(spa->spa_dsl_pool);
4460 
4461         /*
4462          * We explicitly wait for the first transaction to complete so that our
4463          * bean counters are appropriately updated.
4464          */
4465         txg_wait_synced(spa->spa_dsl_pool, txg);
4466 
4467         spa_spawn_aux_threads(spa);
4468 
4469         spa_write_cachefile(spa, B_FALSE, B_TRUE);
4470         spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_CREATE);
4471 
4472         spa_history_log_version(spa, "create");
4473 
4474         /*
4475          * Don't count references from objsets that are already closed
4476          * and are making their way through the eviction process.
4477          */
4478         spa_evicting_os_wait(spa);
4479         spa->spa_minref = refcount_count(&spa->spa_refcount);
4480         spa->spa_load_state = SPA_LOAD_NONE;
4481 
4482         mutex_exit(&spa_namespace_lock);
4483 


4484         return (0);
4485 }
4486 














































4487 #ifdef _KERNEL
4488 /*
4489  * Get the root pool information from the root disk, then import the root pool
4490  * during the system boot up time.
4491  */
4492 extern int vdev_disk_read_rootlabel(char *, char *, nvlist_t **);
4493 
4494 static nvlist_t *
4495 spa_generate_rootconf(char *devpath, char *devid, uint64_t *guid)
4496 {
4497         nvlist_t *config;
4498         nvlist_t *nvtop, *nvroot;
4499         uint64_t pgid;
4500 
4501         if (vdev_disk_read_rootlabel(devpath, devid, &config) != 0)
4502                 return (NULL);
4503 
4504         /*
4505          * Add this top-level vdev to the child array.
4506          */


4592 #if defined(_OBP) && defined(_KERNEL)
4593         if (config == NULL) {
4594                 if (strstr(devpath, "/iscsi/ssd") != NULL) {
4595                         /* iscsi boot */
4596                         get_iscsi_bootpath_phy(devpath);
4597                         config = spa_generate_rootconf(devpath, devid, &guid);
4598                 }
4599         }
4600 #endif
4601         if (config == NULL) {
4602                 cmn_err(CE_NOTE, "Cannot read the pool label from '%s'",
4603                     devpath);
4604                 return (SET_ERROR(EIO));
4605         }
4606 
4607         VERIFY(nvlist_lookup_string(config, ZPOOL_CONFIG_POOL_NAME,
4608             &pname) == 0);
4609         VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, &txg) == 0);
4610 
4611         mutex_enter(&spa_namespace_lock);
4612         if ((spa = spa_lookup(pname)) != NULL) {
4613                 /*
4614                  * Remove the existing root pool from the namespace so that we
4615                  * can replace it with the correct config we just read in.
4616                  */
4617                 spa_remove(spa);
4618         }
4619 
4620         spa = spa_add(pname, config, NULL);
4621         spa->spa_is_root = B_TRUE;
4622         spa->spa_import_flags = ZFS_IMPORT_VERBATIM;
4623         if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
4624             &spa->spa_ubsync.ub_version) != 0)
4625                 spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
4626 
4627         /*
4628          * Build up a vdev tree based on the boot device's label config.
4629          */
4630         VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
4631             &nvtop) == 0);
4632         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4633         error = spa_config_parse(spa, &rvd, nvtop, NULL, 0,
4634             VDEV_ALLOC_ROOTPOOL);
4635         spa_config_exit(spa, SCL_ALL, FTAG);
4636         if (error) {
4637                 mutex_exit(&spa_namespace_lock);
4638                 nvlist_free(config);
4639                 cmn_err(CE_NOTE, "Can not parse the config for pool '%s'",
4640                     pname);
4641                 return (error);
4642         }
4643 
4644         /*
4645          * Get the boot vdev.


4689 }
4690 
4691 #endif
4692 
4693 /*
4694  * Import a non-root pool into the system.
4695  */
4696 int
4697 spa_import(const char *pool, nvlist_t *config, nvlist_t *props, uint64_t flags)
4698 {
4699         spa_t *spa;
4700         char *altroot = NULL;
4701         spa_load_state_t state = SPA_LOAD_IMPORT;
4702         zpool_rewind_policy_t policy;
4703         uint64_t mode = spa_mode_global;
4704         uint64_t readonly = B_FALSE;
4705         int error;
4706         nvlist_t *nvroot;
4707         nvlist_t **spares, **l2cache;
4708         uint_t nspares, nl2cache;

4709 



4710         /*
4711          * If a pool with this name exists, return failure.
4712          */
4713         mutex_enter(&spa_namespace_lock);
4714         if (spa_lookup(pool) != NULL) {
4715                 mutex_exit(&spa_namespace_lock);
4716                 return (SET_ERROR(EEXIST));
4717         }
4718 
4719         /*
4720          * Create and initialize the spa structure.
4721          */
4722         (void) nvlist_lookup_string(props,
4723             zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
4724         (void) nvlist_lookup_uint64(props,
4725             zpool_prop_to_name(ZPOOL_PROP_READONLY), &readonly);
4726         if (readonly)
4727                 mode = FREAD;
4728         spa = spa_add(pool, config, altroot);
4729         spa->spa_import_flags = flags;
4730 
4731         /*
4732          * Verbatim import - Take a pool and insert it into the namespace
4733          * as if it had been loaded at boot.
4734          */
4735         if (spa->spa_import_flags & ZFS_IMPORT_VERBATIM) {
4736                 if (props != NULL)
4737                         spa_configfile_set(spa, props, B_FALSE);
4738 
4739                 spa_write_cachefile(spa, B_FALSE, B_TRUE);
4740                 spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
4741                 zfs_dbgmsg("spa_import: verbatim import of %s", pool);
4742                 mutex_exit(&spa_namespace_lock);
4743                 return (0);
4744         }
4745 
4746         spa_activate(spa, mode);
4747 
4748         /*
4749          * Don't start async tasks until we know everything is healthy.
4750          */
4751         spa_async_suspend(spa);
4752 
4753         zpool_get_rewind_policy(config, &policy);
4754         if (policy.zrp_request & ZPOOL_DO_REWIND)
4755                 state = SPA_LOAD_RECOVER;
4756 
4757         spa->spa_config_source = SPA_CONFIG_SRC_TRYIMPORT;
4758 
4759         if (state != SPA_LOAD_RECOVER) {



4760                 spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;
4761                 zfs_dbgmsg("spa_import: importing %s", pool);
4762         } else {
4763                 zfs_dbgmsg("spa_import: importing %s, max_txg=%lld "
4764                     "(RECOVERY MODE)", pool, (longlong_t)policy.zrp_txg);
4765         }
4766         error = spa_load_best(spa, state, policy.zrp_txg, policy.zrp_request);
4767 



4768         /*
4769          * Propagate anything learned while loading the pool and pass it
4770          * back to caller (i.e. rewind info, missing devices, etc).
4771          */
4772         VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
4773             spa->spa_load_info) == 0);
4774 
4775         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4776         /*
4777          * Toss any existing sparelist, as it doesn't have any validity
4778          * anymore, and conflicts with spa_has_spare().
4779          */
4780         if (spa->spa_spares.sav_config) {
4781                 nvlist_free(spa->spa_spares.sav_config);
4782                 spa->spa_spares.sav_config = NULL;
4783                 spa_load_spares(spa);
4784         }
4785         if (spa->spa_l2cache.sav_config) {
4786                 nvlist_free(spa->spa_l2cache.sav_config);
4787                 spa->spa_l2cache.sav_config = NULL;


4793         if (error == 0)
4794                 error = spa_validate_aux(spa, nvroot, -1ULL,
4795                     VDEV_ALLOC_SPARE);
4796         if (error == 0)
4797                 error = spa_validate_aux(spa, nvroot, -1ULL,
4798                     VDEV_ALLOC_L2CACHE);
4799         spa_config_exit(spa, SCL_ALL, FTAG);
4800 
4801         if (props != NULL)
4802                 spa_configfile_set(spa, props, B_FALSE);
4803 
4804         if (error != 0 || (props && spa_writeable(spa) &&
4805             (error = spa_prop_set(spa, props)))) {
4806                 spa_unload(spa);
4807                 spa_deactivate(spa);
4808                 spa_remove(spa);
4809                 mutex_exit(&spa_namespace_lock);
4810                 return (error);
4811         }
4812 
4813         spa_async_resume(spa);
4814 
4815         /*
4816          * Override any spares and level 2 cache devices as specified by
4817          * the user, as these may have correct device names/devids, etc.
4818          */
4819         if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
4820             &spares, &nspares) == 0) {
4821                 if (spa->spa_spares.sav_config)
4822                         VERIFY(nvlist_remove(spa->spa_spares.sav_config,
4823                             ZPOOL_CONFIG_SPARES, DATA_TYPE_NVLIST_ARRAY) == 0);
4824                 else
4825                         VERIFY(nvlist_alloc(&spa->spa_spares.sav_config,
4826                             NV_UNIQUE_NAME, KM_SLEEP) == 0);
4827                 VERIFY(nvlist_add_nvlist_array(spa->spa_spares.sav_config,
4828                     ZPOOL_CONFIG_SPARES, spares, nspares) == 0);
4829                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4830                 spa_load_spares(spa);
4831                 spa_config_exit(spa, SCL_ALL, FTAG);
4832                 spa->spa_spares.sav_sync = B_TRUE;
4833         }
4834         if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE,
4835             &l2cache, &nl2cache) == 0) {
4836                 if (spa->spa_l2cache.sav_config)
4837                         VERIFY(nvlist_remove(spa->spa_l2cache.sav_config,
4838                             ZPOOL_CONFIG_L2CACHE, DATA_TYPE_NVLIST_ARRAY) == 0);
4839                 else
4840                         VERIFY(nvlist_alloc(&spa->spa_l2cache.sav_config,
4841                             NV_UNIQUE_NAME, KM_SLEEP) == 0);
4842                 VERIFY(nvlist_add_nvlist_array(spa->spa_l2cache.sav_config,
4843                     ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache) == 0);
4844                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4845                 spa_load_l2cache(spa);
4846                 spa_config_exit(spa, SCL_ALL, FTAG);
4847                 spa->spa_l2cache.sav_sync = B_TRUE;
4848         }
4849 



4850         /*
4851          * Check for any removed devices.
4852          */
4853         if (spa->spa_autoreplace) {
4854                 spa_aux_check_removed(&spa->spa_spares);
4855                 spa_aux_check_removed(&spa->spa_l2cache);
4856         }
4857 
4858         if (spa_writeable(spa)) {
4859                 /*
4860                  * Update the config cache to include the newly-imported pool.
4861                  */
4862                 spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
4863         }
4864 
4865         /*








4866          * It's possible that the pool was expanded while it was exported.
4867          * We kick off an async task to handle this for us.
4868          */
4869         spa_async_request(spa, SPA_ASYNC_AUTOEXPAND);
4870 



4871         spa_history_log_version(spa, "import");
4872 
4873         spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
4874 
4875         mutex_exit(&spa_namespace_lock);
4876 

4877         return (0);





4878 }
4879 
4880 nvlist_t *
4881 spa_tryimport(nvlist_t *tryconfig)
4882 {
4883         nvlist_t *config = NULL;
4884         char *poolname, *cachefile;
4885         spa_t *spa;
4886         uint64_t state;
4887         int error;
4888         zpool_rewind_policy_t policy;
4889 
4890         if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_POOL_NAME, &poolname))
4891                 return (NULL);
4892 
4893         if (nvlist_lookup_uint64(tryconfig, ZPOOL_CONFIG_POOL_STATE, &state))
4894                 return (NULL);
4895 
4896         /*
4897          * Create and initialize the spa structure.
4898          */
4899         mutex_enter(&spa_namespace_lock);
4900         spa = spa_add(TRYIMPORT_NAME, tryconfig, NULL);
4901         spa_activate(spa, FREAD);
4902 
4903         /*
4904          * Rewind pool if a max txg was provided. Note that even though we
4905          * retrieve the complete rewind policy, only the rewind txg is relevant
4906          * for tryimport.
4907          */
4908         zpool_get_rewind_policy(spa->spa_config, &policy);
4909         if (policy.zrp_txg != UINT64_MAX) {
4910                 spa->spa_load_max_txg = policy.zrp_txg;
4911                 spa->spa_extreme_rewind = B_TRUE;
4912                 zfs_dbgmsg("spa_tryimport: importing %s, max_txg=%lld",
4913                     poolname, (longlong_t)policy.zrp_txg);
4914         } else {
4915                 zfs_dbgmsg("spa_tryimport: importing %s", poolname);
4916         }
4917 
4918         if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_CACHEFILE, &cachefile)
4919             == 0) {
4920                 zfs_dbgmsg("spa_tryimport: using cachefile '%s'", cachefile);
4921                 spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE;
4922         } else {
4923                 spa->spa_config_source = SPA_CONFIG_SRC_SCAN;
4924         }
4925 
4926         error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING);
4927 
4928         /*
4929          * If 'tryconfig' was at least parsable, return the current config.
4930          */
4931         if (spa->spa_root_vdev != NULL) {
4932                 config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
4933                 VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME,
4934                     poolname) == 0);
4935                 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE,
4936                     state) == 0);
4937                 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_TIMESTAMP,
4938                     spa->spa_uberblock.ub_timestamp) == 0);
4939                 VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
4940                     spa->spa_load_info) == 0);
4941 
4942                 /*
4943                  * If the bootfs property exists on this pool then we
4944                  * copy it out so that external consumers can tell which
4945                  * pools are bootable.
4946                  */
4947                 if ((!error || error == EEXIST) && spa->spa_bootfs) {


4982 
4983         spa_unload(spa);
4984         spa_deactivate(spa);
4985         spa_remove(spa);
4986         mutex_exit(&spa_namespace_lock);
4987 
4988         return (config);
4989 }
4990 
4991 /*
4992  * Pool export/destroy
4993  *
4994  * The act of destroying or exporting a pool is very simple.  We make sure there
4995  * is no more pending I/O and any references to the pool are gone.  Then, we
4996  * update the pool state and sync all the labels to disk, removing the
4997  * configuration from the cache afterwards. If the 'hardforce' flag is set, then
4998  * we don't sync the labels or remove the configuration cache.
4999  */
5000 static int
5001 spa_export_common(char *pool, int new_state, nvlist_t **oldconfig,
5002     boolean_t force, boolean_t hardforce)
5003 {
5004         spa_t *spa;


5005 
5006         if (oldconfig)
5007                 *oldconfig = NULL;
5008 
5009         if (!(spa_mode_global & FWRITE))
5010                 return (SET_ERROR(EROFS));
5011 
5012         mutex_enter(&spa_namespace_lock);
5013         if ((spa = spa_lookup(pool)) == NULL) {
5014                 mutex_exit(&spa_namespace_lock);
5015                 return (SET_ERROR(ENOENT));
5016         }
5017 
5018         /*
5019          * Put a hold on the pool, drop the namespace lock, stop async tasks,
5020          * reacquire the namespace lock, and see if we can export.

5021          */
5022         spa_open_ref(spa, FTAG);
5023         mutex_exit(&spa_namespace_lock);















5024         spa_async_suspend(spa);
5025         mutex_enter(&spa_namespace_lock);
5026         spa_close(spa, FTAG);
5027 
5028         /*
5029          * The pool will be in core if it's openable,
5030          * in which case we can modify its state.
5031          */
5032         if (spa->spa_state != POOL_STATE_UNINITIALIZED && spa->spa_sync_on) {
5033                 /*
5034                  * Objsets may be open only because they're dirty, so we
5035                  * have to force it to sync before checking spa_refcnt.
5036                  */
5037                 txg_wait_synced(spa->spa_dsl_pool, 0);
5038                 spa_evicting_os_wait(spa);
5039 
5040                 /*
5041                  * A pool cannot be exported or destroyed if there are active
5042                  * references.  If we are resetting a pool, allow references by
5043                  * fault injection handlers.
5044                  */
5045                 if (!spa_refcount_zero(spa) ||
5046                     (spa->spa_inject_ref != 0 &&
5047                     new_state != POOL_STATE_UNINITIALIZED)) {
5048                         spa_async_resume(spa);
5049                         mutex_exit(&spa_namespace_lock);



5050                         return (SET_ERROR(EBUSY));
5051                 }
5052 
5053                 /*
5054                  * A pool cannot be exported if it has an active shared spare.
5055                  * This is to prevent other pools stealing the active spare
5056                  * from an exported pool. At user's own will, such pool can
5057                  * be forcedly exported.
5058                  */
5059                 if (!force && new_state == POOL_STATE_EXPORTED &&
5060                     spa_has_active_shared_spare(spa)) {
5061                         spa_async_resume(spa);
5062                         mutex_exit(&spa_namespace_lock);



5063                         return (SET_ERROR(EXDEV));
5064                 }
5065 
5066                 /*
5067                  * We want this to be reflected on every label,
5068                  * so mark them all dirty.  spa_unload() will do the
5069                  * final sync that pushes these changes out.
5070                  */
5071                 if (new_state != POOL_STATE_UNINITIALIZED && !hardforce) {
5072                         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
5073                         spa->spa_state = new_state;
5074                         spa->spa_final_txg = spa_last_synced_txg(spa) +
5075                             TXG_DEFER_SIZE + 1;
5076                         vdev_config_dirty(spa->spa_root_vdev);
5077                         spa_config_exit(spa, SCL_ALL, FTAG);
5078                 }
5079         }
5080 
5081         spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_DESTROY);
5082 
5083         if (spa->spa_state != POOL_STATE_UNINITIALIZED) {


5084                 spa_unload(spa);
5085                 spa_deactivate(spa);
5086         }
5087 
5088         if (oldconfig && spa->spa_config)
5089                 VERIFY(nvlist_dup(spa->spa_config, oldconfig, 0) == 0);
5090 
5091         if (new_state != POOL_STATE_UNINITIALIZED) {
5092                 if (!hardforce)
5093                         spa_write_cachefile(spa, B_TRUE, B_TRUE);

5094                 spa_remove(spa);
5095         }
5096         mutex_exit(&spa_namespace_lock);
5097 
5098         return (0);
5099 }
5100 
5101 /*
5102  * Destroy a storage pool.
5103  */
5104 int
5105 spa_destroy(char *pool)
5106 {
5107         return (spa_export_common(pool, POOL_STATE_DESTROYED, NULL,
5108             B_FALSE, B_FALSE));
5109 }
5110 
5111 /*
5112  * Export a storage pool.
5113  */
5114 int
5115 spa_export(char *pool, nvlist_t **oldconfig, boolean_t force,
5116     boolean_t hardforce)
5117 {
5118         return (spa_export_common(pool, POOL_STATE_EXPORTED, oldconfig,
5119             force, hardforce));
5120 }
5121 
5122 /*
5123  * Similar to spa_export(), this unloads the spa_t without actually removing it
5124  * from the namespace in any way.
5125  */
5126 int
5127 spa_reset(char *pool)
5128 {
5129         return (spa_export_common(pool, POOL_STATE_UNINITIALIZED, NULL,
5130             B_FALSE, B_FALSE));
5131 }
5132 
5133 /*
5134  * ==========================================================================
5135  * Device manipulation
5136  * ==========================================================================
5137  */
5138 
5139 /*
5140  * Add a device to a storage pool.
5141  */
5142 int
5143 spa_vdev_add(spa_t *spa, nvlist_t *nvroot)
5144 {
5145         uint64_t txg, id;
5146         int error;
5147         vdev_t *rvd = spa->spa_root_vdev;
5148         vdev_t *vd, *tvd;
5149         nvlist_t **spares, **l2cache;
5150         uint_t nspares, nl2cache;

5151 
5152         ASSERT(spa_writeable(spa));
5153 
5154         txg = spa_vdev_enter(spa);
5155 
5156         if ((error = spa_config_parse(spa, &vd, nvroot, NULL, 0,
5157             VDEV_ALLOC_ADD)) != 0)
5158                 return (spa_vdev_exit(spa, NULL, txg, error));
5159 
5160         spa->spa_pending_vdev = vd;  /* spa_vdev_exit() will clear this */
5161 
5162         if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES, &spares,
5163             &nspares) != 0)
5164                 nspares = 0;
5165 
5166         if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE, &l2cache,
5167             &nl2cache) != 0)
5168                 nl2cache = 0;
5169 
5170         if (vd->vdev_children == 0 && nspares == 0 && nl2cache == 0)
5171                 return (spa_vdev_exit(spa, vd, txg, EINVAL));
5172 
5173         if (vd->vdev_children != 0 &&
5174             (error = vdev_create(vd, txg, B_FALSE)) != 0)
5175                 return (spa_vdev_exit(spa, vd, txg, error));
5176 
5177         /*
5178          * We must validate the spares and l2cache devices after checking the
5179          * children.  Otherwise, vdev_inuse() will blindly overwrite the spare.
5180          */
5181         if ((error = spa_validate_aux(spa, nvroot, txg, VDEV_ALLOC_ADD)) != 0)
5182                 return (spa_vdev_exit(spa, vd, txg, error));
5183 
5184         /*
5185          * If we are in the middle of a device removal, we can only add
5186          * devices which match the existing devices in the pool.
5187          * If we are in the middle of a removal, or have some indirect
5188          * vdevs, we can not add raidz toplevels.
5189          */
5190         if (spa->spa_vdev_removal != NULL ||
5191             spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
5192                 for (int c = 0; c < vd->vdev_children; c++) {
5193                         tvd = vd->vdev_child[c];
5194                         if (spa->spa_vdev_removal != NULL &&
5195                             tvd->vdev_ashift !=
5196                             spa->spa_vdev_removal->svr_vdev->vdev_ashift) {
5197                                 return (spa_vdev_exit(spa, vd, txg, EINVAL));
5198                         }
5199                         /* Fail if top level vdev is raidz */
5200                         if (tvd->vdev_ops == &vdev_raidz_ops) {
5201                                 return (spa_vdev_exit(spa, vd, txg, EINVAL));
5202                         }
5203                         /*
5204                          * Need the top level mirror to be
5205                          * a mirror of leaf vdevs only
5206                          */
5207                         if (tvd->vdev_ops == &vdev_mirror_ops) {
5208                                 for (uint64_t cid = 0;
5209                                     cid < tvd->vdev_children; cid++) {
5210                                         vdev_t *cvd = tvd->vdev_child[cid];
5211                                         if (!cvd->vdev_ops->vdev_op_leaf) {
5212                                                 return (spa_vdev_exit(spa, vd,
5213                                                     txg, EINVAL));
5214                                         }
5215                                 }
5216                         }
5217                 }
5218         }
5219 
5220         for (int c = 0; c < vd->vdev_children; c++) {
5221 
5222                 /*
5223                  * Set the vdev id to the first hole, if one exists.
5224                  */
5225                 for (id = 0; id < rvd->vdev_children; id++) {
5226                         if (rvd->vdev_child[id]->vdev_ishole) {
5227                                 vdev_free(rvd->vdev_child[id]);
5228                                 break;
5229                         }
5230                 }
5231                 tvd = vd->vdev_child[c];
5232                 vdev_remove_child(vd, tvd);
5233                 tvd->vdev_id = id;
5234                 vdev_add_child(rvd, tvd);
5235                 vdev_config_dirty(tvd);
5236         }
5237 
5238         if (nspares != 0) {
5239                 spa_set_aux_vdevs(&spa->spa_spares, spares, nspares,
5240                     ZPOOL_CONFIG_SPARES);
5241                 spa_load_spares(spa);


5252         /*
5253          * We have to be careful when adding new vdevs to an existing pool.
5254          * If other threads start allocating from these vdevs before we
5255          * sync the config cache, and we lose power, then upon reboot we may
5256          * fail to open the pool because there are DVAs that the config cache
5257          * can't translate.  Therefore, we first add the vdevs without
5258          * initializing metaslabs; sync the config cache (via spa_vdev_exit());
5259          * and then let spa_config_update() initialize the new metaslabs.
5260          *
5261          * spa_load() checks for added-but-not-initialized vdevs, so that
5262          * if we lose power at any point in this sequence, the remaining
5263          * steps will be completed the next time we load the pool.
5264          */
5265         (void) spa_vdev_exit(spa, vd, txg, 0);
5266 
5267         mutex_enter(&spa_namespace_lock);
5268         spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
5269         spa_event_notify(spa, NULL, NULL, ESC_ZFS_VDEV_ADD);
5270         mutex_exit(&spa_namespace_lock);
5271 













5272         return (0);
5273 }
5274 
5275 /*
5276  * Attach a device to a mirror.  The arguments are the path to any device
5277  * in the mirror, and the nvroot for the new device.  If the path specifies
5278  * a device that is not mirrored, we automatically insert the mirror vdev.
5279  *
5280  * If 'replacing' is specified, the new device is intended to replace the
5281  * existing device; in this case the two devices are made into their own
5282  * mirror using the 'replacing' vdev, which is functionally identical to
5283  * the mirror vdev (it actually reuses all the same ops) but has a few
5284  * extra rules: you can't attach to it after it's been created, and upon
5285  * completion of resilvering, the first disk (the one being replaced)
5286  * is automatically detached.
5287  */
5288 int
5289 spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot, int replacing)
5290 {
5291         uint64_t txg, dtl_max_txg;
5292         vdev_t *rvd = spa->spa_root_vdev;
5293         vdev_t *oldvd, *newvd, *newrootvd, *pvd, *tvd;
5294         vdev_ops_t *pvops;
5295         char *oldvdpath, *newvdpath;
5296         int newvd_isspare;
5297         int error;
5298 
5299         ASSERT(spa_writeable(spa));
5300 
5301         txg = spa_vdev_enter(spa);
5302 
5303         oldvd = spa_lookup_by_guid(spa, guid, B_FALSE);
5304 
5305         if (spa->spa_vdev_removal != NULL ||
5306             spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
5307                 return (spa_vdev_exit(spa, NULL, txg, EBUSY));
5308         }
5309 
5310         if (oldvd == NULL)
5311                 return (spa_vdev_exit(spa, NULL, txg, ENODEV));
5312 
5313         if (!oldvd->vdev_ops->vdev_op_leaf)
5314                 return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5315 
5316         pvd = oldvd->vdev_parent;
5317 
5318         if ((error = spa_config_parse(spa, &newrootvd, nvroot, NULL, 0,
5319             VDEV_ALLOC_ATTACH)) != 0)
5320                 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5321 
5322         if (newrootvd->vdev_children != 1)
5323                 return (spa_vdev_exit(spa, newrootvd, txg, EINVAL));
5324 
5325         newvd = newrootvd->vdev_child[0];
5326 
5327         if (!newvd->vdev_ops->vdev_op_leaf)
5328                 return (spa_vdev_exit(spa, newrootvd, txg, EINVAL));
5329 


5455         newvd_isspare = newvd->vdev_isspare;
5456 
5457         /*
5458          * Mark newvd's DTL dirty in this txg.
5459          */
5460         vdev_dirty(tvd, VDD_DTL, newvd, txg);
5461 
5462         /*
5463          * Schedule the resilver to restart in the future. We do this to
5464          * ensure that dmu_sync-ed blocks have been stitched into the
5465          * respective datasets.
5466          */
5467         dsl_resilver_restart(spa->spa_dsl_pool, dtl_max_txg);
5468 
5469         if (spa->spa_bootfs)
5470                 spa_event_notify(spa, newvd, NULL, ESC_ZFS_BOOTFS_VDEV_ATTACH);
5471 
5472         spa_event_notify(spa, newvd, NULL, ESC_ZFS_VDEV_ATTACH);
5473 
5474         /*








5475          * Commit the config
5476          */
5477         (void) spa_vdev_exit(spa, newrootvd, dtl_max_txg, 0);
5478 
5479         spa_history_log_internal(spa, "vdev attach", NULL,
5480             "%s vdev=%s %s vdev=%s",
5481             replacing && newvd_isspare ? "spare in" :
5482             replacing ? "replace" : "attach", newvdpath,
5483             replacing ? "for" : "to", oldvdpath);
5484 
5485         spa_strfree(oldvdpath);
5486         spa_strfree(newvdpath);
5487 
5488         return (0);
5489 }
5490 
5491 /*
5492  * Detach a device from a mirror or replacing vdev.
5493  *
5494  * If 'replace_done' is specified, only detach if the parent


5668                 vdev_reopen(tvd);
5669                 vdev_expand(tvd, txg);
5670         }
5671 
5672         vdev_config_dirty(tvd);
5673 
5674         /*
5675          * Mark vd's DTL as dirty in this txg.  vdev_dtl_sync() will see that
5676          * vd->vdev_detached is set and free vd's DTL object in syncing context.
5677          * But first make sure we're not on any *other* txg's DTL list, to
5678          * prevent vd from being accessed after it's freed.
5679          */
5680         vdpath = spa_strdup(vd->vdev_path);
5681         for (int t = 0; t < TXG_SIZE; t++)
5682                 (void) txg_list_remove_this(&tvd->vdev_dtl_list, vd, t);
5683         vd->vdev_detached = B_TRUE;
5684         vdev_dirty(tvd, VDD_DTL, vd, txg);
5685 
5686         spa_event_notify(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE);
5687 








5688         /* hang on to the spa before we release the lock */
5689         spa_open_ref(spa, FTAG);
5690 
5691         error = spa_vdev_exit(spa, vd, txg, 0);
5692 
5693         spa_history_log_internal(spa, "detach", NULL,
5694             "vdev=%s", vdpath);
5695         spa_strfree(vdpath);
5696 
5697         /*
5698          * If this was the removal of the original device in a hot spare vdev,
5699          * then we want to go through and remove the device from the hot spare
5700          * list of every other pool.
5701          */
5702         if (unspare) {
5703                 spa_t *altspa = NULL;
5704 
5705                 mutex_enter(&spa_namespace_lock);
5706                 while ((altspa = spa_next(altspa)) != NULL) {
5707                         if (altspa->spa_state != POOL_STATE_ACTIVE ||


5730 
5731 /*
5732  * Split a set of devices from their mirrors, and create a new pool from them.
5733  */
5734 int
5735 spa_vdev_split_mirror(spa_t *spa, char *newname, nvlist_t *config,
5736     nvlist_t *props, boolean_t exp)
5737 {
5738         int error = 0;
5739         uint64_t txg, *glist;
5740         spa_t *newspa;
5741         uint_t c, children, lastlog;
5742         nvlist_t **child, *nvl, *tmp;
5743         dmu_tx_t *tx;
5744         char *altroot = NULL;
5745         vdev_t *rvd, **vml = NULL;                      /* vdev modify list */
5746         boolean_t activate_slog;
5747 
5748         ASSERT(spa_writeable(spa));
5749 







5750         txg = spa_vdev_enter(spa);
5751 
5752         /* clear the log and flush everything up to now */
5753         activate_slog = spa_passivate_log(spa);
5754         (void) spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5755         error = spa_reset_logs(spa);
5756         txg = spa_vdev_config_enter(spa);
5757 
5758         if (activate_slog)
5759                 spa_activate_log(spa);
5760 
5761         if (error != 0)
5762                 return (spa_vdev_exit(spa, NULL, txg, error));
5763 
5764         /* check new spa name before going any further */
5765         if (spa_lookup(newname) != NULL)
5766                 return (spa_vdev_exit(spa, NULL, txg, EEXIST));
5767 
5768         /*
5769          * scan through all the children to ensure they're all mirrors
5770          */
5771         if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvl) != 0 ||
5772             nvlist_lookup_nvlist_array(nvl, ZPOOL_CONFIG_CHILDREN, &child,
5773             &children) != 0)
5774                 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5775 
5776         /* first, check to ensure we've got the right child count */
5777         rvd = spa->spa_root_vdev;
5778         lastlog = 0;
5779         for (c = 0; c < rvd->vdev_children; c++) {
5780                 vdev_t *vd = rvd->vdev_child[c];
5781 
5782                 /* don't count the holes & logs as children */
5783                 if (vd->vdev_islog || !vdev_is_concrete(vd)) {
5784                         if (lastlog == 0)
5785                                 lastlog = c;
5786                         continue;
5787                 }
5788 
5789                 lastlog = 0;
5790         }
5791         if (children != (lastlog != 0 ? lastlog : rvd->vdev_children))
5792                 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5793 
5794         /* next, ensure no spare or cache devices are part of the split */
5795         if (nvlist_lookup_nvlist(nvl, ZPOOL_CONFIG_SPARES, &tmp) == 0 ||
5796             nvlist_lookup_nvlist(nvl, ZPOOL_CONFIG_L2CACHE, &tmp) == 0)
5797                 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5798 
5799         vml = kmem_zalloc(children * sizeof (vdev_t *), KM_SLEEP);
5800         glist = kmem_zalloc(children * sizeof (uint64_t), KM_SLEEP);
5801 
5802         /* then, loop over each vdev and validate it */
5803         for (c = 0; c < children; c++) {


5816                         }
5817                 }
5818 
5819                 /* which disk is going to be split? */
5820                 if (nvlist_lookup_uint64(child[c], ZPOOL_CONFIG_GUID,
5821                     &glist[c]) != 0) {
5822                         error = SET_ERROR(EINVAL);
5823                         break;
5824                 }
5825 
5826                 /* look it up in the spa */
5827                 vml[c] = spa_lookup_by_guid(spa, glist[c], B_FALSE);
5828                 if (vml[c] == NULL) {
5829                         error = SET_ERROR(ENODEV);
5830                         break;
5831                 }
5832 
5833                 /* make sure there's nothing stopping the split */
5834                 if (vml[c]->vdev_parent->vdev_ops != &vdev_mirror_ops ||
5835                     vml[c]->vdev_islog ||
5836                     !vdev_is_concrete(vml[c]) ||
5837                     vml[c]->vdev_isspare ||
5838                     vml[c]->vdev_isl2cache ||
5839                     !vdev_writeable(vml[c]) ||
5840                     vml[c]->vdev_children != 0 ||
5841                     vml[c]->vdev_state != VDEV_STATE_HEALTHY ||
5842                     c != spa->spa_root_vdev->vdev_child[c]->vdev_id) {
5843                         error = SET_ERROR(EINVAL);
5844                         break;
5845                 }
5846 
5847                 if (vdev_dtl_required(vml[c])) {
5848                         error = SET_ERROR(EBUSY);
5849                         break;
5850                 }
5851 
5852                 /* we need certain info from the top level */
5853                 VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_METASLAB_ARRAY,
5854                     vml[c]->vdev_top->vdev_ms_array) == 0);
5855                 VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_METASLAB_SHIFT,
5856                     vml[c]->vdev_top->vdev_ms_shift) == 0);


5911             spa_generate_guid(NULL)) == 0);
5912         VERIFY0(nvlist_add_boolean(config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
5913         (void) nvlist_lookup_string(props,
5914             zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
5915 
5916         /* add the new pool to the namespace */
5917         newspa = spa_add(newname, config, altroot);
5918         newspa->spa_avz_action = AVZ_ACTION_REBUILD;
5919         newspa->spa_config_txg = spa->spa_config_txg;
5920         spa_set_log_state(newspa, SPA_LOG_CLEAR);
5921 
5922         /* release the spa config lock, retaining the namespace lock */
5923         spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5924 
5925         if (zio_injection_enabled)
5926                 zio_handle_panic_injection(spa, FTAG, 1);
5927 
5928         spa_activate(newspa, spa_mode_global);
5929         spa_async_suspend(newspa);
5930 
5931         newspa->spa_config_source = SPA_CONFIG_SRC_SPLIT;
5932 
5933         /* create the new pool from the disks of the original pool */
5934         error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE);
5935         if (error)
5936                 goto out;
5937 
5938         /* if that worked, generate a real config for the new pool */
5939         if (newspa->spa_root_vdev != NULL) {
5940                 VERIFY(nvlist_alloc(&newspa->spa_config_splitting,
5941                     NV_UNIQUE_NAME, KM_SLEEP) == 0);
5942                 VERIFY(nvlist_add_uint64(newspa->spa_config_splitting,
5943                     ZPOOL_CONFIG_SPLIT_GUID, spa_guid(spa)) == 0);
5944                 spa_config_set(newspa, spa_config_generate(newspa, NULL, -1ULL,
5945                     B_TRUE));
5946         }
5947 
5948         /* set the props */
5949         if (props != NULL) {
5950                 spa_configfile_set(newspa, props, B_FALSE);
5951                 error = spa_prop_set(newspa, props);
5952                 if (error)
5953                         goto out;
5954         }
5955 
5956         /* flush everything */
5957         txg = spa_vdev_config_enter(newspa);
5958         vdev_config_dirty(newspa->spa_root_vdev);
5959         (void) spa_vdev_config_exit(newspa, NULL, txg, 0, FTAG);
5960 
5961         if (zio_injection_enabled)
5962                 zio_handle_panic_injection(spa, FTAG, 2);
5963 
5964         spa_async_resume(newspa);
5965 
5966         /* finally, update the original pool's config */
5967         txg = spa_vdev_config_enter(spa);
5968         tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
5969         error = dmu_tx_assign(tx, TXG_WAIT);
5970         if (error != 0)
5971                 dmu_tx_abort(tx);
5972         for (c = 0; c < children; c++) {
5973                 if (vml[c] != NULL) {












5974                         vdev_split(vml[c]);
5975                         if (error == 0)
5976                                 spa_history_log_internal(spa, "detach", tx,
5977                                     "vdev=%s", vml[c]->vdev_path);
5978 
5979                         vdev_free(vml[c]);
5980                 }
5981         }
5982         spa->spa_avz_action = AVZ_ACTION_REBUILD;
5983         vdev_config_dirty(spa->spa_root_vdev);
5984         spa->spa_config_splitting = NULL;
5985         nvlist_free(nvl);
5986         if (error == 0)
5987                 dmu_tx_commit(tx);
5988         (void) spa_vdev_exit(spa, NULL, txg, 0);
5989 
5990         if (zio_injection_enabled)
5991                 zio_handle_panic_injection(spa, FTAG, 3);
5992 
5993         /* split is complete; log a history record */
5994         spa_history_log_internal(newspa, "split", NULL,
5995             "from pool %s", spa_name(spa));
5996 
5997         kmem_free(vml, children * sizeof (vdev_t *));
5998 
5999         /* if we're not going to mount the filesystems in userland, export */
6000         if (exp)
6001                 error = spa_export_common(newname, POOL_STATE_EXPORTED, NULL,
6002                     B_FALSE, B_FALSE);
6003 
6004         return (error);
6005 
6006 out:
6007         spa_unload(newspa);
6008         spa_deactivate(newspa);
6009         spa_remove(newspa);
6010 
6011         txg = spa_vdev_config_enter(spa);
6012 
6013         /* re-online all offlined disks */
6014         for (c = 0; c < children; c++) {
6015                 if (vml[c] != NULL)
6016                         vml[c]->vdev_offline = B_FALSE;
6017         }
6018         vdev_reopen(spa->spa_root_vdev);
6019 
6020         nvlist_free(spa->spa_config_splitting);
6021         spa->spa_config_splitting = NULL;
6022         (void) spa_vdev_exit(spa, NULL, txg, error);
6023 
6024         kmem_free(vml, children * sizeof (vdev_t *));
6025         return (error);
6026 }
6027 









































6028 /*
































































































































































































































































6029  * Find any device that's done replacing, or a vdev marked 'unspare' that's
6030  * currently spared, so we can detach it.
6031  */
6032 static vdev_t *
6033 spa_vdev_resilver_done_hunt(vdev_t *vd)
6034 {
6035         vdev_t *newvd, *oldvd;
6036 
6037         for (int c = 0; c < vd->vdev_children; c++) {
6038                 oldvd = spa_vdev_resilver_done_hunt(vd->vdev_child[c]);
6039                 if (oldvd != NULL)
6040                         return (oldvd);
6041         }
6042 
6043         /*
6044          * Check for a completed replacement.  We always consider the first
6045          * vdev in the list to be the oldest vdev, and the last one to be
6046          * the newest (see spa_vdev_attach() for how that works).  In
6047          * the case where the newest vdev is faulted, we will not automatically
6048          * remove it after a resilver completes.  This is OK as it will require
6049          * user intervention to determine which disk the admin wishes to keep.
6050          */
6051         if (vd->vdev_ops == &vdev_replacing_ops) {
6052                 ASSERT(vd->vdev_children > 1);
6053 
6054                 newvd = vd->vdev_child[vd->vdev_children - 1];
6055                 oldvd = vd->vdev_child[0];
6056 
6057                 if (vdev_dtl_empty(newvd, DTL_MISSING) &&
6058                     vdev_dtl_empty(newvd, DTL_OUTAGE) &&
6059                     !vdev_dtl_required(oldvd))
6060                         return (oldvd);
6061         }
6062 
6063         /*
6064          * Check for a completed resilver with the 'unspare' flag set.

6065          */
6066         if (vd->vdev_ops == &vdev_spare_ops) {
6067                 vdev_t *first = vd->vdev_child[0];
6068                 vdev_t *last = vd->vdev_child[vd->vdev_children - 1];
6069 
6070                 if (last->vdev_unspare) {
6071                         oldvd = first;
6072                         newvd = last;
6073                 } else if (first->vdev_unspare) {
6074                         oldvd = last;
6075                         newvd = first;
6076                 } else {
6077                         oldvd = NULL;
6078                 }
6079 
6080                 if (oldvd != NULL &&
6081                     vdev_dtl_empty(newvd, DTL_MISSING) &&
6082                     vdev_dtl_empty(newvd, DTL_OUTAGE) &&
6083                     !vdev_dtl_required(oldvd))
6084                         return (oldvd);
6085 


6086                 /*
6087                  * If there are more than two spares attached to a disk,
6088                  * and those spares are not required, then we want to
6089                  * attempt to free them up now so that they can be used
6090                  * by other pools.  Once we're back down to a single
6091                  * disk+spare, we stop removing them.
6092                  */
6093                 if (vd->vdev_children > 2) {
6094                         newvd = vd->vdev_child[1];
6095 
6096                         if (newvd->vdev_isspare && last->vdev_isspare &&
6097                             vdev_dtl_empty(last, DTL_MISSING) &&
6098                             vdev_dtl_empty(last, DTL_OUTAGE) &&
6099                             !vdev_dtl_required(newvd))
6100                                 return (newvd);
6101                 }
6102         }
6103 
6104         return (NULL);
6105 }


6126                  */
6127                 if (ppvd->vdev_ops == &vdev_spare_ops && pvd->vdev_id == 0 &&
6128                     ppvd->vdev_children == 2) {
6129                         ASSERT(pvd->vdev_ops == &vdev_replacing_ops);
6130                         sguid = ppvd->vdev_child[1]->vdev_guid;
6131                 }
6132                 ASSERT(vd->vdev_resilver_txg == 0 || !vdev_dtl_required(vd));
6133 
6134                 spa_config_exit(spa, SCL_ALL, FTAG);
6135                 if (spa_vdev_detach(spa, guid, pguid, B_TRUE) != 0)
6136                         return;
6137                 if (sguid && spa_vdev_detach(spa, sguid, ppguid, B_TRUE) != 0)
6138                         return;
6139                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
6140         }
6141 
6142         spa_config_exit(spa, SCL_ALL, FTAG);
6143 }
6144 
6145 /*
6146  * Update the stored path or FRU for this vdev.
6147  */
6148 int
6149 spa_vdev_set_common(spa_t *spa, uint64_t guid, const char *value,
6150     boolean_t ispath)
6151 {
6152         vdev_t *vd;
6153         boolean_t sync = B_FALSE;
6154 
6155         ASSERT(spa_writeable(spa));
6156 
6157         spa_vdev_state_enter(spa, SCL_ALL);
6158 
6159         if ((vd = spa_lookup_by_guid(spa, guid, B_TRUE)) == NULL)
6160                 return (spa_vdev_state_exit(spa, NULL, ENOENT));
6161 
6162         if (!vd->vdev_ops->vdev_op_leaf)
6163                 return (spa_vdev_state_exit(spa, NULL, ENOTSUP));
6164 
6165         if (ispath) {
6166                 if (strcmp(value, vd->vdev_path) != 0) {
6167                         spa_strfree(vd->vdev_path);
6168                         vd->vdev_path = spa_strdup(value);
6169                         sync = B_TRUE;
6170                 }
6171         } else {
6172                 if (vd->vdev_fru == NULL) {
6173                         vd->vdev_fru = spa_strdup(value);
6174                         sync = B_TRUE;
6175                 } else if (strcmp(value, vd->vdev_fru) != 0) {
6176                         spa_strfree(vd->vdev_fru);
6177                         vd->vdev_fru = spa_strdup(value);
6178                         sync = B_TRUE;
6179                 }
6180         }
6181 
6182         return (spa_vdev_state_exit(spa, sync ? vd : NULL, 0));
6183 }
6184 
6185 int
6186 spa_vdev_setpath(spa_t *spa, uint64_t guid, const char *newpath)
6187 {
6188         return (spa_vdev_set_common(spa, guid, newpath, B_TRUE));
6189 }
6190 
6191 int
6192 spa_vdev_setfru(spa_t *spa, uint64_t guid, const char *newfru)
6193 {
6194         return (spa_vdev_set_common(spa, guid, newfru, B_FALSE));
6195 }
6196 
6197 /*
6198  * ==========================================================================
6199  * SPA Scanning
6200  * ==========================================================================
6201  */
6202 int
6203 spa_scrub_pause_resume(spa_t *spa, pool_scrub_cmd_t cmd)
6204 {
6205         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6206 
6207         if (dsl_scan_resilvering(spa->spa_dsl_pool))
6208                 return (SET_ERROR(EBUSY));
6209 
6210         return (dsl_scrub_set_pause_resume(spa->spa_dsl_pool, cmd));
6211 }
6212 
6213 int
6214 spa_scan_stop(spa_t *spa)
6215 {
6216         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6217         if (dsl_scan_resilvering(spa->spa_dsl_pool))


6374          */
6375         if (tasks & SPA_ASYNC_PROBE) {
6376                 spa_vdev_state_enter(spa, SCL_NONE);
6377                 spa_async_probe(spa, spa->spa_root_vdev);
6378                 (void) spa_vdev_state_exit(spa, NULL, 0);
6379         }
6380 
6381         /*
6382          * If any devices are done replacing, detach them.
6383          */
6384         if (tasks & SPA_ASYNC_RESILVER_DONE)
6385                 spa_vdev_resilver_done(spa);
6386 
6387         /*
6388          * Kick off a resilver.
6389          */
6390         if (tasks & SPA_ASYNC_RESILVER)
6391                 dsl_resilver_restart(spa->spa_dsl_pool, 0);
6392 
6393         /*












6394          * Let the world know that we're done.
6395          */
6396         mutex_enter(&spa->spa_async_lock);
6397         spa->spa_async_thread = NULL;
6398         cv_broadcast(&spa->spa_async_cv);
6399         mutex_exit(&spa->spa_async_lock);
6400         thread_exit();
6401 }
6402 
6403 void
6404 spa_async_suspend(spa_t *spa)
6405 {
6406         mutex_enter(&spa->spa_async_lock);
6407         spa->spa_async_suspended++;
6408         while (spa->spa_async_thread != NULL)
6409                 cv_wait(&spa->spa_async_cv, &spa->spa_async_lock);
6410         mutex_exit(&spa->spa_async_lock);
6411 
6412         spa_vdev_remove_suspend(spa);
6413 
6414         zthr_t *condense_thread = spa->spa_condense_zthr;
6415         if (condense_thread != NULL && zthr_isrunning(condense_thread))
6416                 VERIFY0(zthr_cancel(condense_thread));
6417 }
6418 
6419 void
6420 spa_async_resume(spa_t *spa)
6421 {
6422         mutex_enter(&spa->spa_async_lock);
6423         ASSERT(spa->spa_async_suspended != 0);
6424         spa->spa_async_suspended--;
6425         mutex_exit(&spa->spa_async_lock);
6426         spa_restart_removal(spa);
6427 
6428         zthr_t *condense_thread = spa->spa_condense_zthr;
6429         if (condense_thread != NULL && !zthr_isrunning(condense_thread))
6430                 zthr_resume(condense_thread);
6431 }
6432 
6433 static boolean_t
6434 spa_async_tasks_pending(spa_t *spa)
6435 {
6436         uint_t non_config_tasks;
6437         uint_t config_task;
6438         boolean_t config_task_suspended;
6439 
6440         non_config_tasks = spa->spa_async_tasks & ~SPA_ASYNC_CONFIG_UPDATE;
6441         config_task = spa->spa_async_tasks & SPA_ASYNC_CONFIG_UPDATE;
6442         if (spa->spa_ccw_fail_time == 0) {
6443                 config_task_suspended = B_FALSE;
6444         } else {
6445                 config_task_suspended =
6446                     (gethrtime() - spa->spa_ccw_fail_time) <
6447                     (zfs_ccw_retry_interval * NANOSEC);
6448         }
6449 
6450         return (non_config_tasks || (config_task && !config_task_suspended));


6455 {
6456         mutex_enter(&spa->spa_async_lock);
6457         if (spa_async_tasks_pending(spa) &&
6458             !spa->spa_async_suspended &&
6459             spa->spa_async_thread == NULL &&
6460             rootdir != NULL)
6461                 spa->spa_async_thread = thread_create(NULL, 0,
6462                     spa_async_thread, spa, 0, &p0, TS_RUN, maxclsyspri);
6463         mutex_exit(&spa->spa_async_lock);
6464 }
6465 
6466 void
6467 spa_async_request(spa_t *spa, int task)
6468 {
6469         zfs_dbgmsg("spa=%s async request task=%u", spa->spa_name, task);
6470         mutex_enter(&spa->spa_async_lock);
6471         spa->spa_async_tasks |= task;
6472         mutex_exit(&spa->spa_async_lock);
6473 }
6474 









6475 /*
6476  * ==========================================================================
6477  * SPA syncing routines
6478  * ==========================================================================
6479  */
6480 
6481 static int
6482 bpobj_enqueue_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
6483 {
6484         bpobj_t *bpo = arg;
6485         bpobj_enqueue(bpo, bp, tx);
6486         return (0);
6487 }
6488 
6489 static int
6490 spa_free_sync_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
6491 {
6492         zio_t *zio = arg;
6493 
6494         zio_nowait(zio_free_sync(zio, zio->io_spa, dmu_tx_get_txg(tx), bp,


6743          * Setting the version is special cased when first creating the pool.
6744          */
6745         ASSERT(tx->tx_txg != TXG_INITIAL);
6746 
6747         ASSERT(SPA_VERSION_IS_SUPPORTED(version));
6748         ASSERT(version >= spa_version(spa));
6749 
6750         spa->spa_uberblock.ub_version = version;
6751         vdev_config_dirty(spa->spa_root_vdev);
6752         spa_history_log_internal(spa, "set", tx, "version=%lld", version);
6753 }
6754 
6755 /*
6756  * Set zpool properties.
6757  */
6758 static void
6759 spa_sync_props(void *arg, dmu_tx_t *tx)
6760 {
6761         nvlist_t *nvp = arg;
6762         spa_t *spa = dmu_tx_pool(tx)->dp_spa;

6763         objset_t *mos = spa->spa_meta_objset;
6764         nvpair_t *elem = NULL;
6765 
6766         mutex_enter(&spa->spa_props_lock);
6767 
6768         while ((elem = nvlist_next_nvpair(nvp, elem))) {
6769                 uint64_t intval;
6770                 char *strval, *fname;
6771                 zpool_prop_t prop;
6772                 const char *propname;
6773                 zprop_type_t proptype;
6774                 spa_feature_t fid;
6775 
6776                 switch (prop = zpool_name_to_prop(nvpair_name(elem))) {
6777                 case ZPOOL_PROP_INVAL:
6778                         /*
6779                          * We checked this earlier in spa_prop_validate().
6780                          */
6781                         ASSERT(zpool_prop_feature(nvpair_name(elem)));
6782 
6783                         fname = strchr(nvpair_name(elem), '@') + 1;
6784                         VERIFY0(zfeature_lookup_name(fname, &fid));
6785 
6786                         spa_feature_enable(spa, fid, tx);
6787                         spa_history_log_internal(spa, "set", tx,
6788                             "%s=enabled", nvpair_name(elem));
6789                         break;
6790 
6791                 case ZPOOL_PROP_VERSION:
6792                         intval = fnvpair_value_uint64(elem);
6793                         /*
6794                          * The version is synced seperatly before other
6795                          * properties and should be correct by now.
6796                          */
6797                         ASSERT3U(spa_version(spa), >=, intval);


6855                                 intval = fnvpair_value_uint64(elem);
6856 
6857                                 if (proptype == PROP_TYPE_INDEX) {
6858                                         const char *unused;
6859                                         VERIFY0(zpool_prop_index_to_string(
6860                                             prop, intval, &unused));
6861                                 }
6862                                 VERIFY0(zap_update(mos,
6863                                     spa->spa_pool_props_object, propname,
6864                                     8, 1, &intval, tx));
6865                                 spa_history_log_internal(spa, "set", tx,
6866                                     "%s=%lld", nvpair_name(elem), intval);
6867                         } else {
6868                                 ASSERT(0); /* not allowed */
6869                         }
6870 
6871                         switch (prop) {
6872                         case ZPOOL_PROP_DELEGATION:
6873                                 spa->spa_delegation = intval;
6874                                 break;












6875                         case ZPOOL_PROP_BOOTFS:
6876                                 spa->spa_bootfs = intval;
6877                                 break;
6878                         case ZPOOL_PROP_FAILUREMODE:
6879                                 spa->spa_failmode = intval;
6880                                 break;















6881                         case ZPOOL_PROP_AUTOEXPAND:
6882                                 spa->spa_autoexpand = intval;
6883                                 if (tx->tx_txg != TXG_INITIAL)
6884                                         spa_async_request(spa,
6885                                             SPA_ASYNC_AUTOEXPAND);
6886                                 break;
6887                         case ZPOOL_PROP_DEDUPDITTO:
6888                                 spa->spa_dedup_ditto = intval;
6889                                 break;


































6890                         default:
6891                                 break;
6892                         }
6893                 }
6894 
6895         }
6896 
6897         mutex_exit(&spa->spa_props_lock);
6898 }
6899 
6900 /*
6901  * Perform one-time upgrade on-disk changes.  spa_version() does not
6902  * reflect the new version this txg, so there must be no changes this
6903  * txg to anything that the upgrade code depends on after it executes.
6904  * Therefore this must be called after dsl_pool_sync() does the sync
6905  * tasks.
6906  */
6907 static void
6908 spa_sync_upgrades(spa_t *spa, dmu_tx_t *tx)
6909 {


6955                         spa_feature_incr(spa, SPA_FEATURE_LZ4_COMPRESS, tx);
6956         }
6957 
6958         /*
6959          * If we haven't written the salt, do so now.  Note that the
6960          * feature may not be activated yet, but that's fine since
6961          * the presence of this ZAP entry is backwards compatible.
6962          */
6963         if (zap_contains(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
6964             DMU_POOL_CHECKSUM_SALT) == ENOENT) {
6965                 VERIFY0(zap_add(spa->spa_meta_objset,
6966                     DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CHECKSUM_SALT, 1,
6967                     sizeof (spa->spa_cksum_salt.zcs_bytes),
6968                     spa->spa_cksum_salt.zcs_bytes, tx));
6969         }
6970 
6971         rrw_exit(&dp->dp_config_rwlock, FTAG);
6972 }
6973 
6974 static void
6975 vdev_indirect_state_sync_verify(vdev_t *vd)

6976 {
6977         vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
6978         vdev_indirect_births_t *vib = vd->vdev_indirect_births;





6979 
6980         if (vd->vdev_ops == &vdev_indirect_ops) {
6981                 ASSERT(vim != NULL);
6982                 ASSERT(vib != NULL);
6983         }
6984 
6985         if (vdev_obsolete_sm_object(vd) != 0) {
6986                 ASSERT(vd->vdev_obsolete_sm != NULL);
6987                 ASSERT(vd->vdev_removing ||
6988                     vd->vdev_ops == &vdev_indirect_ops);
6989                 ASSERT(vdev_indirect_mapping_num_entries(vim) > 0);
6990                 ASSERT(vdev_indirect_mapping_bytes_mapped(vim) > 0);
6991 
6992                 ASSERT3U(vdev_obsolete_sm_object(vd), ==,
6993                     space_map_object(vd->vdev_obsolete_sm));
6994                 ASSERT3U(vdev_indirect_mapping_bytes_mapped(vim), >=,
6995                     space_map_allocated(vd->vdev_obsolete_sm));
6996         }
6997         ASSERT(vd->vdev_obsolete_segments != NULL);
6998 
6999         /*
7000          * Since frees / remaps to an indirect vdev can only
7001          * happen in syncing context, the obsolete segments
7002          * tree must be empty when we start syncing.
7003          */
7004         ASSERT0(range_tree_space(vd->vdev_obsolete_segments));










7005 }
7006 
7007 /*
7008  * Sync the specified transaction group.  New blocks may be dirtied as
7009  * part of the process, so we iterate until it converges.
7010  */
7011 void
7012 spa_sync(spa_t *spa, uint64_t txg)
7013 {
7014         dsl_pool_t *dp = spa->spa_dsl_pool;
7015         objset_t *mos = spa->spa_meta_objset;
7016         bplist_t *free_bpl = &spa->spa_free_bplist[txg & TXG_MASK];
7017         vdev_t *rvd = spa->spa_root_vdev;
7018         vdev_t *vd;
7019         dmu_tx_t *tx;
7020         int error;
7021         uint32_t max_queue_depth = zfs_vdev_async_write_max_active *
7022             zfs_vdev_queue_depth_pct / 100;
7023 
7024         VERIFY(spa_writeable(spa));
7025 
7026         /*
7027          * Wait for i/os issued in open context that need to complete
7028          * before this txg syncs.
7029          */
7030         VERIFY0(zio_wait(spa->spa_txg_zio[txg & TXG_MASK]));
7031         spa->spa_txg_zio[txg & TXG_MASK] = zio_root(spa, NULL, NULL, 0);
7032 
7033         /*
7034          * Lock out configuration changes.
7035          */
7036         spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
7037 
7038         spa->spa_syncing_txg = txg;
7039         spa->spa_sync_pass = 0;
7040 
7041         mutex_enter(&spa->spa_alloc_lock);
7042         VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
7043         mutex_exit(&spa->spa_alloc_lock);
7044 
7045         /*











7046          * If there are any pending vdev state changes, convert them
7047          * into config changes that go out with this transaction group.
7048          */
7049         spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
7050         while (list_head(&spa->spa_state_dirty_list) != NULL) {
7051                 /*
7052                  * We need the write lock here because, for aux vdevs,
7053                  * calling vdev_config_dirty() modifies sav_config.
7054                  * This is ugly and will become unnecessary when we
7055                  * eliminate the aux vdev wart by integrating all vdevs
7056                  * into the root vdev tree.
7057                  */
7058                 spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
7059                 spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_WRITER);
7060                 while ((vd = list_head(&spa->spa_state_dirty_list)) != NULL) {
7061                         vdev_state_clean(vd);
7062                         vdev_config_dirty(vd);
7063                 }
7064                 spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
7065                 spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);


7100          * out this txg.
7101          */
7102         uint64_t queue_depth_total = 0;
7103         for (int c = 0; c < rvd->vdev_children; c++) {
7104                 vdev_t *tvd = rvd->vdev_child[c];
7105                 metaslab_group_t *mg = tvd->vdev_mg;
7106 
7107                 if (mg == NULL || mg->mg_class != spa_normal_class(spa) ||
7108                     !metaslab_group_initialized(mg))
7109                         continue;
7110 
7111                 /*
7112                  * It is safe to do a lock-free check here because only async
7113                  * allocations look at mg_max_alloc_queue_depth, and async
7114                  * allocations all happen from spa_sync().
7115                  */
7116                 ASSERT0(refcount_count(&mg->mg_alloc_queue_depth));
7117                 mg->mg_max_alloc_queue_depth = max_queue_depth;
7118                 queue_depth_total += mg->mg_max_alloc_queue_depth;
7119         }
7120         metaslab_class_t *mc = spa_normal_class(spa);
7121         ASSERT0(refcount_count(&mc->mc_alloc_slots));
7122         mc->mc_alloc_max_slots = queue_depth_total;
7123         mc->mc_alloc_throttle_enabled = zio_dva_throttle_enabled;
7124 
7125         ASSERT3U(mc->mc_alloc_max_slots, <=,
7126             max_queue_depth * rvd->vdev_children);
7127 
7128         for (int c = 0; c < rvd->vdev_children; c++) {
7129                 vdev_t *vd = rvd->vdev_child[c];
7130                 vdev_indirect_state_sync_verify(vd);
7131 
7132                 if (vdev_indirect_should_condense(vd)) {
7133                         spa_condense_indirect_start_sync(vd, tx);
7134                         break;
7135                 }
7136         }
7137 
7138         /*
7139          * Iterate to convergence.
7140          */













7141         do {
7142                 int pass = ++spa->spa_sync_pass;
7143 
7144                 spa_sync_config_object(spa, tx);
7145                 spa_sync_aux_dev(spa, &spa->spa_spares, tx,
7146                     ZPOOL_CONFIG_SPARES, DMU_POOL_SPARES);
7147                 spa_sync_aux_dev(spa, &spa->spa_l2cache, tx,
7148                     ZPOOL_CONFIG_L2CACHE, DMU_POOL_L2CACHE);
7149                 spa_errlog_sync(spa, txg);
7150                 dsl_pool_sync(dp, txg);
7151 
7152                 if (pass < zfs_sync_pass_deferred_free) {
7153                         spa_sync_frees(spa, free_bpl, tx);
7154                 } else {
7155                         /*
7156                          * We can not defer frees in pass 1, because
7157                          * we sync the deferred frees later in pass 1.
7158                          */
7159                         ASSERT3U(pass, >, 1);
7160                         bplist_iterate(free_bpl, bpobj_enqueue_cb,
7161                             &spa->spa_deferred_bpobj, tx);
7162                 }
7163 
7164                 ddt_sync(spa, txg);
7165                 dsl_scan_sync(dp, tx);
7166 
7167                 if (spa->spa_vdev_removal != NULL)
7168                         svr_sync(spa, tx);
7169 
7170                 while ((vd = txg_list_remove(&spa->spa_vdev_txg_list, txg))
7171                     != NULL)
7172                         vdev_sync(vd, txg);
7173 
7174                 if (pass == 1) {
7175                         spa_sync_upgrades(spa, tx);
7176                         ASSERT3U(txg, >=,
7177                             spa->spa_uberblock.ub_rootbp.blk_birth);
7178                         /*
7179                          * Note: We need to check if the MOS is dirty
7180                          * because we could have marked the MOS dirty
7181                          * without updating the uberblock (e.g. if we
7182                          * have sync tasks but no dirty user data).  We
7183                          * need to check the uberblock's rootbp because
7184                          * it is updated if we have synced out dirty
7185                          * data (though in this case the MOS will most
7186                          * likely also be dirty due to second order
7187                          * effects, we don't want to rely on that here).
7188                          */
7189                         if (spa->spa_uberblock.ub_rootbp.blk_birth < txg &&
7190                             !dmu_objset_is_dirty(mos, txg)) {
7191                                 /*


7203                         spa_sync_deferred_frees(spa, tx);
7204                 }
7205 
7206         } while (dmu_objset_is_dirty(mos, txg));
7207 
7208         if (!list_is_empty(&spa->spa_config_dirty_list)) {
7209                 /*
7210                  * Make sure that the number of ZAPs for all the vdevs matches
7211                  * the number of ZAPs in the per-vdev ZAP list. This only gets
7212                  * called if the config is dirty; otherwise there may be
7213                  * outstanding AVZ operations that weren't completed in
7214                  * spa_sync_config_object.
7215                  */
7216                 uint64_t all_vdev_zap_entry_count;
7217                 ASSERT0(zap_count(spa->spa_meta_objset,
7218                     spa->spa_all_vdev_zaps, &all_vdev_zap_entry_count));
7219                 ASSERT3U(vdev_count_verify_zaps(spa->spa_root_vdev), ==,
7220                     all_vdev_zap_entry_count);
7221         }
7222 
7223         if (spa->spa_vdev_removal != NULL) {
7224                 ASSERT0(spa->spa_vdev_removal->svr_bytes_done[txg & TXG_MASK]);
7225         }
7226 
7227         /*
7228          * Rewrite the vdev configuration (which includes the uberblock)
7229          * to commit the transaction group.
7230          *
7231          * If there are no dirty vdevs, we sync the uberblock to a few
7232          * random top-level vdevs that are known to be visible in the
7233          * config cache (see spa_vdev_add() for a complete description).
7234          * If there *are* dirty vdevs, sync the uberblock to all vdevs.
7235          */
7236         for (;;) {
7237                 /*
7238                  * We hold SCL_STATE to prevent vdev open/close/etc.
7239                  * while we're attempting to write the vdev labels.
7240                  */
7241                 spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
7242 
7243                 if (list_is_empty(&spa->spa_config_dirty_list)) {
7244                         vdev_t *svd[SPA_SYNC_MIN_VDEVS];
7245                         int svdcount = 0;
7246                         int children = rvd->vdev_children;
7247                         int c0 = spa_get_random(children);
7248 
7249                         for (int c = 0; c < children; c++) {
7250                                 vd = rvd->vdev_child[(c0 + c) % children];
7251                                 if (vd->vdev_ms_array == 0 || vd->vdev_islog ||
7252                                     !vdev_is_concrete(vd))
7253                                         continue;
7254                                 svd[svdcount++] = vd;
7255                                 if (svdcount == SPA_SYNC_MIN_VDEVS)
7256                                         break;
7257                         }
7258                         error = vdev_config_sync(svd, svdcount, txg);
7259                 } else {
7260                         error = vdev_config_sync(rvd->vdev_child,
7261                             rvd->vdev_children, txg);
7262                 }
7263 
7264                 if (error == 0)
7265                         spa->spa_last_synced_guid = rvd->vdev_guid;
7266 
7267                 spa_config_exit(spa, SCL_STATE, FTAG);
7268 
7269                 if (error == 0)
7270                         break;
7271                 zio_suspend(spa, NULL);
7272                 zio_resume_wait(spa);
7273         }
7274         dmu_tx_commit(tx);
7275 
7276         VERIFY(cyclic_reprogram(spa->spa_deadman_cycid, CY_INFINITY));
7277 
7278         /*
7279          * Clear the dirty config list.
7280          */
7281         while ((vd = list_head(&spa->spa_config_dirty_list)) != NULL)
7282                 vdev_config_clean(vd);
7283 
7284         /*
7285          * Now that the new config has synced transactionally,
7286          * let it become visible to the config cache.
7287          */
7288         if (spa->spa_config_syncing != NULL) {
7289                 spa_config_set(spa, spa->spa_config_syncing);
7290                 spa->spa_config_txg = txg;
7291                 spa->spa_config_syncing = NULL;
7292         }
7293 
7294         dsl_pool_sync_done(dp, txg);
7295 
7296         mutex_enter(&spa->spa_alloc_lock);
7297         VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
7298         mutex_exit(&spa->spa_alloc_lock);
7299 
7300         /*
7301          * Update usable space statistics.
7302          */
7303         while (vd = txg_list_remove(&spa->spa_vdev_txg_list, TXG_CLEAN(txg)))
7304                 vdev_sync_done(vd, txg);
7305 
7306         spa_update_dspace(spa);
7307 
7308         /*
7309          * It had better be the case that we didn't dirty anything
7310          * since vdev_config_sync().
7311          */
7312         ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg));
7313         ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
7314         ASSERT(txg_list_empty(&spa->spa_vdev_txg_list, txg));
7315 
7316         spa->spa_sync_pass = 0;
7317 


7318         /*
7319          * Update the last synced uberblock here. We want to do this at
7320          * the end of spa_sync() so that consumers of spa_last_synced_txg()
7321          * will be guaranteed that all the processing associated with
7322          * that txg has been completed.
7323          */
7324         spa->spa_ubsync = spa->spa_uberblock;
7325         spa_config_exit(spa, SCL_CONFIG, FTAG);
7326 
7327         spa_handle_ignored_writes(spa);
7328 
7329         /*
7330          * If any async tasks have been requested, kick them off.
7331          */
7332         spa_async_dispatch(spa);
7333 }
7334 
7335 /*
7336  * Sync all pools.  We don't want to hold the namespace lock across these
7337  * operations, so we take a reference on the spa_t and drop the lock during the


7370         spa_t *spa;
7371 
7372         /*
7373          * Remove all cached state.  All pools should be closed now,
7374          * so every spa in the AVL tree should be unreferenced.
7375          */
7376         mutex_enter(&spa_namespace_lock);
7377         while ((spa = spa_next(NULL)) != NULL) {
7378                 /*
7379                  * Stop async tasks.  The async thread may need to detach
7380                  * a device that's been replaced, which requires grabbing
7381                  * spa_namespace_lock, so we must drop it here.
7382                  */
7383                 spa_open_ref(spa, FTAG);
7384                 mutex_exit(&spa_namespace_lock);
7385                 spa_async_suspend(spa);
7386                 mutex_enter(&spa_namespace_lock);
7387                 spa_close(spa, FTAG);
7388 
7389                 if (spa->spa_state != POOL_STATE_UNINITIALIZED) {


7390                         spa_unload(spa);
7391                         spa_deactivate(spa);
7392                 }

7393                 spa_remove(spa);
7394         }
7395         mutex_exit(&spa_namespace_lock);
7396 }
7397 
7398 vdev_t *
7399 spa_lookup_by_guid(spa_t *spa, uint64_t guid, boolean_t aux)
7400 {
7401         vdev_t *vd;
7402         int i;
7403 
7404         if ((vd = vdev_lookup_by_guid(spa->spa_root_vdev, guid)) != NULL)
7405                 return (vd);
7406 
7407         if (aux) {
7408                 for (i = 0; i < spa->spa_l2cache.sav_count; i++) {
7409                         vd = spa->spa_l2cache.sav_vdevs[i];
7410                         if (vd->vdev_guid == guid)
7411                                 return (vd);
7412                 }


7468  * Check if a pool has an active shared spare device.
7469  * Note: reference count of an active spare is 2, as a spare and as a replace
7470  */
7471 static boolean_t
7472 spa_has_active_shared_spare(spa_t *spa)
7473 {
7474         int i, refcnt;
7475         uint64_t pool;
7476         spa_aux_vdev_t *sav = &spa->spa_spares;
7477 
7478         for (i = 0; i < sav->sav_count; i++) {
7479                 if (spa_spare_exists(sav->sav_vdevs[i]->vdev_guid, &pool,
7480                     &refcnt) && pool != 0ULL && pool == spa_guid(spa) &&
7481                     refcnt > 2)
7482                         return (B_TRUE);
7483         }
7484 
7485         return (B_FALSE);
7486 }
7487 
7488 sysevent_t *







7489 spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7490 {
7491         sysevent_t              *ev = NULL;
7492 #ifdef _KERNEL
7493         sysevent_attr_list_t    *attr = NULL;
7494         sysevent_value_t        value;
7495 
7496         ev = sysevent_alloc(EC_ZFS, (char *)name, SUNW_KERN_PUB "zfs",
7497             SE_SLEEP);
7498         ASSERT(ev != NULL);
7499 
7500         value.value_type = SE_DATA_TYPE_STRING;
7501         value.value.sv_string = spa_name(spa);
7502         if (sysevent_add_attr(&attr, ZFS_EV_POOL_NAME, &value, SE_SLEEP) != 0)
7503                 goto done;
7504 
7505         value.value_type = SE_DATA_TYPE_UINT64;
7506         value.value.sv_uint64 = spa_guid(spa);
7507         if (sysevent_add_attr(&attr, ZFS_EV_POOL_GUID, &value, SE_SLEEP) != 0)
7508                 goto done;
7509 
7510         if (vd) {
7511                 value.value_type = SE_DATA_TYPE_UINT64;
7512                 value.value.sv_uint64 = vd->vdev_guid;
7513                 if (sysevent_add_attr(&attr, ZFS_EV_VDEV_GUID, &value,
7514                     SE_SLEEP) != 0)
7515                         goto done;
7516 
7517                 if (vd->vdev_path) {
7518                         value.value_type = SE_DATA_TYPE_STRING;
7519                         value.value.sv_string = vd->vdev_path;
7520                         if (sysevent_add_attr(&attr, ZFS_EV_VDEV_PATH,
7521                             &value, SE_SLEEP) != 0)
7522                                 goto done;
7523                 }
7524         }
7525 
7526         if (hist_nvl != NULL) {
7527                 fnvlist_merge((nvlist_t *)attr, hist_nvl);
7528         }
7529 
7530         if (sysevent_attach_attributes(ev, attr) != 0)
7531                 goto done;
7532         attr = NULL;
7533 
7534 done:
7535         if (attr)
7536                 sysevent_free_attr(attr);
7537 
7538 #endif
7539         return (ev);
7540 }
7541 
7542 void
7543 spa_event_post(sysevent_t *ev)
7544 {
7545 #ifdef _KERNEL


7546         sysevent_id_t           eid;
7547 
7548         (void) log_sysevent(ev, SE_SLEEP, &eid);
7549         sysevent_free(ev);
7550 #endif
7551 }
7552 
7553 void
7554 spa_event_discard(sysevent_t *ev)






7555 {
7556 #ifdef _KERNEL












7557         sysevent_free(ev);
7558 #endif


7559 }
7560 






7561 /*
7562  * Post a sysevent corresponding to the given event.  The 'name' must be one of
7563  * the event definitions in sys/sysevent/eventdefs.h.  The payload will be
7564  * filled in from the spa and (optionally) the vdev and history nvl.  This
7565  * doesn't do anything in the userland libzpool, as we don't want consumers to
7566  * misinterpret ztest or zdb as real changes.
7567  */







































































































































7568 void
7569 spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7570 {
7571         spa_event_post(spa_event_create(spa, vd, hist_nvl, name));



















































































































7572 }


   4  * The contents of this file are subject to the terms of the
   5  * Common Development and Distribution License (the "License").
   6  * You may not use this file except in compliance with the License.
   7  *
   8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9  * or http://www.opensolaris.org/os/licensing.
  10  * See the License for the specific language governing permissions
  11  * and limitations under the License.
  12  *
  13  * When distributing Covered Code, include this CDDL HEADER in each
  14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15  * If applicable, add the following below this CDDL HEADER, with the
  16  * fields enclosed by brackets "[]" replaced with your own identifying
  17  * information: Portions Copyright [yyyy] [name of copyright owner]
  18  *
  19  * CDDL HEADER END
  20  */
  21 
  22 /*
  23  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  24  * Copyright (c) 2011, 2017 by Delphix. All rights reserved.

  25  * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
  26  * Copyright 2018 Nexenta Systems, Inc.  All rights reserved.
  27  * Copyright 2013 Saso Kiselkov. All rights reserved.
  28  * Copyright (c) 2014 Integros [integros.com]
  29  * Copyright 2016 Toomas Soome <tsoome@me.com>
  30  * Copyright 2018 Joyent, Inc.
  31  * Copyright (c) 2017 Datto Inc.

  32  */
  33 
  34 /*
  35  * SPA: Storage Pool Allocator
  36  *
  37  * This file contains all the routines used when modifying on-disk SPA state.
  38  * This includes opening, importing, destroying, exporting a pool, and syncing a
  39  * pool.
  40  */
  41 
  42 #include <sys/zfs_context.h>
  43 #include <sys/fm/fs/zfs.h>
  44 #include <sys/spa_impl.h>
  45 #include <sys/zio.h>
  46 #include <sys/zio_checksum.h>
  47 #include <sys/dmu.h>
  48 #include <sys/dmu_tx.h>
  49 #include <sys/zap.h>
  50 #include <sys/zil.h>
  51 #include <sys/ddt.h>
  52 #include <sys/vdev_impl.h>



  53 #include <sys/metaslab.h>
  54 #include <sys/metaslab_impl.h>
  55 #include <sys/uberblock_impl.h>
  56 #include <sys/txg.h>
  57 #include <sys/avl.h>

  58 #include <sys/dmu_traverse.h>
  59 #include <sys/dmu_objset.h>
  60 #include <sys/unique.h>
  61 #include <sys/dsl_pool.h>
  62 #include <sys/dsl_dataset.h>
  63 #include <sys/dsl_dir.h>
  64 #include <sys/dsl_prop.h>
  65 #include <sys/dsl_synctask.h>
  66 #include <sys/fs/zfs.h>
  67 #include <sys/arc.h>
  68 #include <sys/callb.h>
  69 #include <sys/systeminfo.h>
  70 #include <sys/spa_boot.h>
  71 #include <sys/zfs_ioctl.h>
  72 #include <sys/dsl_scan.h>
  73 #include <sys/zfeature.h>
  74 #include <sys/dsl_destroy.h>
  75 #include <sys/cos.h>
  76 #include <sys/special.h>
  77 #include <sys/wbc.h>
  78 #include <sys/abd.h>
  79 
  80 #ifdef  _KERNEL
  81 #include <sys/bootprops.h>
  82 #include <sys/callb.h>
  83 #include <sys/cpupart.h>
  84 #include <sys/pool.h>
  85 #include <sys/sysdc.h>
  86 #include <sys/zone.h>
  87 #endif  /* _KERNEL */
  88 
  89 #include "zfs_prop.h"
  90 #include "zfs_comutil.h"
  91 
  92 /*
  93  * The interval, in seconds, at which failed configuration cache file writes
  94  * should be retried.
  95  */
  96 static int zfs_ccw_retry_interval = 300;
  97 
  98 typedef enum zti_modes {
  99         ZTI_MODE_FIXED,                 /* value is # of threads (min 1) */
 100         ZTI_MODE_BATCH,                 /* cpu-intensive; value is ignored */
 101         ZTI_MODE_NULL,                  /* don't create a taskq */
 102         ZTI_NMODES
 103 } zti_modes_t;
 104 
 105 #define ZTI_P(n, q)     { ZTI_MODE_FIXED, (n), (q) }
 106 #define ZTI_BATCH       { ZTI_MODE_BATCH, 0, 1 }
 107 #define ZTI_NULL        { ZTI_MODE_NULL, 0, 0 }
 108 
 109 #define ZTI_N(n)        ZTI_P(n, 1)
 110 #define ZTI_ONE         ZTI_N(1)
 111 
 112 typedef struct zio_taskq_info {
 113         zti_modes_t zti_mode;
 114         uint_t zti_value;
 115         uint_t zti_count;
 116 } zio_taskq_info_t;


 129  * are so high frequency and short-lived that the taskq itself can become a a
 130  * point of lock contention. The ZTI_P(#, #) macro indicates that we need an
 131  * additional degree of parallelism specified by the number of threads per-
 132  * taskq and the number of taskqs; when dispatching an event in this case, the
 133  * particular taskq is chosen at random.
 134  *
 135  * The different taskq priorities are to handle the different contexts (issue
 136  * and interrupt) and then to reserve threads for ZIO_PRIORITY_NOW I/Os that
 137  * need to be handled with minimum delay.
 138  */
 139 const zio_taskq_info_t zio_taskqs[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
 140         /* ISSUE        ISSUE_HIGH      INTR            INTR_HIGH */
 141         { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* NULL */
 142         { ZTI_N(8),     ZTI_NULL,       ZTI_P(12, 8),   ZTI_NULL }, /* READ */
 143         { ZTI_BATCH,    ZTI_N(5),       ZTI_N(8),       ZTI_N(5) }, /* WRITE */
 144         { ZTI_P(12, 8), ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* FREE */
 145         { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* CLAIM */
 146         { ZTI_ONE,      ZTI_NULL,       ZTI_ONE,        ZTI_NULL }, /* IOCTL */
 147 };
 148 
 149 static sysevent_t *spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl,
 150     const char *name);
 151 static void spa_event_notify_impl(sysevent_t *ev);
 152 static void spa_sync_version(void *arg, dmu_tx_t *tx);
 153 static void spa_sync_props(void *arg, dmu_tx_t *tx);
 154 static void spa_vdev_sync_props(void *arg, dmu_tx_t *tx);
 155 static int spa_vdev_prop_set_nosync(vdev_t *, nvlist_t *, boolean_t *);
 156 static boolean_t spa_has_active_shared_spare(spa_t *spa);
 157 static int spa_load_impl(spa_t *spa, uint64_t, nvlist_t *config,
 158     spa_load_state_t state, spa_import_type_t type, boolean_t mosconfig,
 159     char **ereport);
 160 static void spa_vdev_resilver_done(spa_t *spa);
 161 static void spa_auto_trim(spa_t *spa, uint64_t txg);
 162 static void spa_vdev_man_trim_done(spa_t *spa);
 163 static void spa_vdev_auto_trim_done(spa_t *spa);
 164 static uint64_t spa_min_trim_rate(spa_t *spa);
 165 
 166 uint_t          zio_taskq_batch_pct = 75;       /* 1 thread per cpu in pset */
 167 id_t            zio_taskq_psrset_bind = PS_NONE;
 168 boolean_t       zio_taskq_sysdc = B_TRUE;       /* use SDC scheduling class */
 169 uint_t          zio_taskq_basedc = 80;          /* base duty cycle */
 170 
 171 boolean_t       spa_create_process = B_TRUE;    /* no process ==> no sysdc */
 172 extern int      zfs_sync_pass_deferred_free;
 173 
 174 /*




























































 175  * ==========================================================================
 176  * SPA properties routines
 177  * ==========================================================================
 178  */
 179 
 180 /*
 181  * Add a (source=src, propname=propval) list to an nvlist.
 182  */
 183 static void
 184 spa_prop_add_list(nvlist_t *nvl, zpool_prop_t prop, char *strval,
 185     uint64_t intval, zprop_source_t src)
 186 {
 187         const char *propname = zpool_prop_to_name(prop);
 188         nvlist_t *propval;
 189 
 190         VERIFY(nvlist_alloc(&propval, NV_UNIQUE_NAME, KM_SLEEP) == 0);
 191         VERIFY(nvlist_add_uint64(propval, ZPROP_SOURCE, src) == 0);
 192 
 193         if (strval != NULL)
 194                 VERIFY(nvlist_add_string(propval, ZPROP_VALUE, strval) == 0);
 195         else
 196                 VERIFY(nvlist_add_uint64(propval, ZPROP_VALUE, intval) == 0);
 197 
 198         VERIFY(nvlist_add_nvlist(nvl, propname, propval) == 0);
 199         nvlist_free(propval);
 200 }
 201 
 202 /*
 203  * Get property values from the spa configuration.
 204  */
 205 static void
 206 spa_prop_get_config(spa_t *spa, nvlist_t **nvp)
 207 {
 208         vdev_t *rvd = spa->spa_root_vdev;
 209         dsl_pool_t *pool = spa->spa_dsl_pool;
 210         spa_meta_placement_t *mp = &spa->spa_meta_policy;
 211         uint64_t size, alloc, cap, version;
 212         zprop_source_t src = ZPROP_SRC_NONE;
 213         spa_config_dirent_t *dp;
 214         metaslab_class_t *mc = spa_normal_class(spa);
 215 
 216         ASSERT(MUTEX_HELD(&spa->spa_props_lock));
 217 
 218         if (rvd != NULL) {
 219                 alloc = metaslab_class_get_alloc(spa_normal_class(spa));
 220                 size = metaslab_class_get_space(spa_normal_class(spa));
 221                 spa_prop_add_list(*nvp, ZPOOL_PROP_NAME, spa_name(spa), 0, src);
 222                 spa_prop_add_list(*nvp, ZPOOL_PROP_SIZE, NULL, size, src);
 223                 spa_prop_add_list(*nvp, ZPOOL_PROP_ALLOCATED, NULL, alloc, src);
 224                 spa_prop_add_list(*nvp, ZPOOL_PROP_FREE, NULL,
 225                     size - alloc, src);
 226                 spa_prop_add_list(*nvp, ZPOOL_PROP_ENABLESPECIAL, NULL,
 227                     (uint64_t)spa->spa_usesc, src);
 228                 spa_prop_add_list(*nvp, ZPOOL_PROP_MINWATERMARK, NULL,
 229                     spa->spa_minwat, src);
 230                 spa_prop_add_list(*nvp, ZPOOL_PROP_HIWATERMARK, NULL,
 231                     spa->spa_hiwat, src);
 232                 spa_prop_add_list(*nvp, ZPOOL_PROP_LOWATERMARK, NULL,
 233                     spa->spa_lowat, src);
 234                 spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPMETA_DITTO, NULL,
 235                     spa->spa_ddt_meta_copies, src);
 236 
 237                 spa_prop_add_list(*nvp, ZPOOL_PROP_META_PLACEMENT, NULL,
 238                     mp->spa_enable_meta_placement_selection, src);
 239                 spa_prop_add_list(*nvp, ZPOOL_PROP_SYNC_TO_SPECIAL, NULL,
 240                     mp->spa_sync_to_special, src);
 241                 spa_prop_add_list(*nvp, ZPOOL_PROP_DDT_META_TO_METADEV, NULL,
 242                     mp->spa_ddt_meta_to_special, src);
 243                 spa_prop_add_list(*nvp, ZPOOL_PROP_ZFS_META_TO_METADEV,
 244                     NULL, mp->spa_zfs_meta_to_special, src);
 245                 spa_prop_add_list(*nvp, ZPOOL_PROP_SMALL_DATA_TO_METADEV, NULL,
 246                     mp->spa_small_data_to_special, src);
 247 
 248                 spa_prop_add_list(*nvp, ZPOOL_PROP_FRAGMENTATION, NULL,
 249                     metaslab_class_fragmentation(mc), src);
 250                 spa_prop_add_list(*nvp, ZPOOL_PROP_EXPANDSZ, NULL,
 251                     metaslab_class_expandable_space(mc), src);
 252                 spa_prop_add_list(*nvp, ZPOOL_PROP_READONLY, NULL,
 253                     (spa_mode(spa) == FREAD), src);
 254 
 255                 spa_prop_add_list(*nvp, ZPOOL_PROP_DDT_DESEGREGATION, NULL,
 256                     (spa->spa_ddt_class_min == spa->spa_ddt_class_max), src);
 257 
 258                 cap = (size == 0) ? 0 : (alloc * 100 / size);
 259                 spa_prop_add_list(*nvp, ZPOOL_PROP_CAPACITY, NULL, cap, src);
 260 
 261                 spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_BEST_EFFORT, NULL,
 262                     spa->spa_dedup_best_effort, src);
 263 
 264                 spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_LO_BEST_EFFORT, NULL,
 265                     spa->spa_dedup_lo_best_effort, src);
 266 
 267                 spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_HI_BEST_EFFORT, NULL,
 268                     spa->spa_dedup_hi_best_effort, src);
 269 
 270                 spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPRATIO, NULL,
 271                     ddt_get_pool_dedup_ratio(spa), src);
 272 
 273                 spa_prop_add_list(*nvp, ZPOOL_PROP_DDTCAPPED, NULL,
 274                     spa->spa_ddt_capped, src);
 275 
 276                 spa_prop_add_list(*nvp, ZPOOL_PROP_HEALTH, NULL,
 277                     rvd->vdev_state, src);
 278 
 279                 version = spa_version(spa);
 280                 if (version == zpool_prop_default_numeric(ZPOOL_PROP_VERSION))
 281                         src = ZPROP_SRC_DEFAULT;
 282                 else
 283                         src = ZPROP_SRC_LOCAL;
 284                 spa_prop_add_list(*nvp, ZPOOL_PROP_VERSION, NULL, version, src);
 285         }
 286 
 287         if (pool != NULL) {
 288                 /*
 289                  * The $FREE directory was introduced in SPA_VERSION_DEADLISTS,
 290                  * when opening pools before this version freedir will be NULL.
 291                  */
 292                 if (pool->dp_free_dir != NULL) {
 293                         spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING, NULL,
 294                             dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes +
 295                             pool->dp_long_freeing_total,
 296                             src);
 297                 } else {
 298                         spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING,
 299                             NULL, pool->dp_long_freeing_total, src);
 300                 }
 301 
 302                 if (pool->dp_leak_dir != NULL) {
 303                         spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED, NULL,
 304                             dsl_dir_phys(pool->dp_leak_dir)->dd_used_bytes,
 305                             src);
 306                 } else {
 307                         spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED,
 308                             NULL, 0, src);
 309                 }
 310         }
 311 
 312         spa_prop_add_list(*nvp, ZPOOL_PROP_GUID, NULL, spa_guid(spa), src);
 313 
 314         if (spa->spa_comment != NULL) {
 315                 spa_prop_add_list(*nvp, ZPOOL_PROP_COMMENT, spa->spa_comment,
 316                     0, ZPROP_SRC_LOCAL);
 317         }
 318 
 319         if (spa->spa_root != NULL)


 359          */
 360         spa_prop_get_config(spa, nvp);
 361 
 362         /* If no pool property object, no more prop to get. */
 363         if (mos == NULL || spa->spa_pool_props_object == 0) {
 364                 mutex_exit(&spa->spa_props_lock);
 365                 return (0);
 366         }
 367 
 368         /*
 369          * Get properties from the MOS pool property object.
 370          */
 371         for (zap_cursor_init(&zc, mos, spa->spa_pool_props_object);
 372             (err = zap_cursor_retrieve(&zc, &za)) == 0;
 373             zap_cursor_advance(&zc)) {
 374                 uint64_t intval = 0;
 375                 char *strval = NULL;
 376                 zprop_source_t src = ZPROP_SRC_DEFAULT;
 377                 zpool_prop_t prop;
 378 
 379                 if ((prop = zpool_name_to_prop(za.za_name)) == ZPROP_INVAL)
 380                         continue;
 381 
 382                 switch (za.za_integer_length) {
 383                 case 8:
 384                         /* integer property */
 385                         if (za.za_first_integer !=
 386                             zpool_prop_default_numeric(prop))
 387                                 src = ZPROP_SRC_LOCAL;
 388 
 389                         if (prop == ZPOOL_PROP_BOOTFS) {
 390                                 dsl_pool_t *dp;
 391                                 dsl_dataset_t *ds = NULL;
 392 
 393                                 dp = spa_get_dsl(spa);
 394                                 dsl_pool_config_enter(dp, FTAG);
 395                                 if (err = dsl_dataset_hold_obj(dp,
 396                                     za.za_first_integer, FTAG, &ds)) {
 397                                         dsl_pool_config_exit(dp, FTAG);
 398                                         break;
 399                                 }


 438         if (err && err != ENOENT) {
 439                 nvlist_free(*nvp);
 440                 *nvp = NULL;
 441                 return (err);
 442         }
 443 
 444         return (0);
 445 }
 446 
 447 /*
 448  * Validate the given pool properties nvlist and modify the list
 449  * for the property values to be set.
 450  */
 451 static int
 452 spa_prop_validate(spa_t *spa, nvlist_t *props)
 453 {
 454         nvpair_t *elem;
 455         int error = 0, reset_bootfs = 0;
 456         uint64_t objnum = 0;
 457         boolean_t has_feature = B_FALSE;
 458         uint64_t lowat = spa->spa_lowat, hiwat = spa->spa_hiwat,
 459             minwat = spa->spa_minwat;
 460 
 461         elem = NULL;
 462         while ((elem = nvlist_next_nvpair(props, elem)) != NULL) {
 463                 uint64_t intval;
 464                 char *strval, *slash, *check, *fname;
 465                 const char *propname = nvpair_name(elem);
 466                 zpool_prop_t prop = zpool_name_to_prop(propname);
 467                 spa_feature_t feature;
 468 
 469                 switch (prop) {
 470                 case ZPROP_INVAL:
 471                         if (!zpool_prop_feature(propname)) {
 472                                 error = SET_ERROR(EINVAL);
 473                                 break;
 474                         }
 475 
 476                         /*
 477                          * Sanitize the input.
 478                          */
 479                         if (nvpair_type(elem) != DATA_TYPE_UINT64) {
 480                                 error = SET_ERROR(EINVAL);
 481                                 break;
 482                         }
 483 
 484                         if (nvpair_value_uint64(elem, &intval) != 0) {
 485                                 error = SET_ERROR(EINVAL);
 486                                 break;
 487                         }
 488 
 489                         if (intval != 0) {
 490                                 error = SET_ERROR(EINVAL);
 491                                 break;
 492                         }
 493 
 494                         fname = strchr(propname, '@') + 1;
 495                         if (zfeature_lookup_name(fname, &feature) != 0) {
 496                                 error = SET_ERROR(EINVAL);
 497                                 break;
 498                         }
 499 
 500                         if (feature == SPA_FEATURE_WBC &&
 501                             !spa_has_special(spa)) {
 502                                 error = SET_ERROR(ENOTSUP);
 503                                 break;
 504                         }
 505 
 506                         has_feature = B_TRUE;
 507                         break;
 508 
 509                 case ZPOOL_PROP_VERSION:
 510                         error = nvpair_value_uint64(elem, &intval);
 511                         if (!error &&
 512                             (intval < spa_version(spa) ||
 513                             intval > SPA_VERSION_BEFORE_FEATURES ||
 514                             has_feature))
 515                                 error = SET_ERROR(EINVAL);
 516                         break;
 517 
 518                 case ZPOOL_PROP_DELEGATION:
 519                 case ZPOOL_PROP_AUTOREPLACE:
 520                 case ZPOOL_PROP_LISTSNAPS:
 521                 case ZPOOL_PROP_AUTOEXPAND:
 522                 case ZPOOL_PROP_DEDUP_BEST_EFFORT:
 523                 case ZPOOL_PROP_DDT_DESEGREGATION:
 524                 case ZPOOL_PROP_META_PLACEMENT:
 525                 case ZPOOL_PROP_FORCETRIM:
 526                 case ZPOOL_PROP_AUTOTRIM:
 527                         error = nvpair_value_uint64(elem, &intval);
 528                         if (!error && intval > 1)
 529                                 error = SET_ERROR(EINVAL);
 530                         break;
 531 
 532                 case ZPOOL_PROP_DDT_META_TO_METADEV:
 533                 case ZPOOL_PROP_ZFS_META_TO_METADEV:
 534                         error = nvpair_value_uint64(elem, &intval);
 535                         if (!error && intval > META_PLACEMENT_DUAL)
 536                                 error = SET_ERROR(EINVAL);
 537                         break;
 538 
 539                 case ZPOOL_PROP_SYNC_TO_SPECIAL:
 540                         error = nvpair_value_uint64(elem, &intval);
 541                         if (!error && intval > SYNC_TO_SPECIAL_ALWAYS)
 542                                 error = SET_ERROR(EINVAL);
 543                         break;
 544 
 545                 case ZPOOL_PROP_SMALL_DATA_TO_METADEV:
 546                         error = nvpair_value_uint64(elem, &intval);
 547                         if (!error && intval > SPA_MAXBLOCKSIZE)
 548                                 error = SET_ERROR(EINVAL);
 549                         break;
 550 
 551                 case ZPOOL_PROP_BOOTFS:
 552                         /*
 553                          * If the pool version is less than SPA_VERSION_BOOTFS,
 554                          * or the pool is still being created (version == 0),
 555                          * the bootfs property cannot be set.
 556                          */
 557                         if (spa_version(spa) < SPA_VERSION_BOOTFS) {
 558                                 error = SET_ERROR(ENOTSUP);
 559                                 break;
 560                         }
 561 
 562                         /*
 563                          * Make sure the vdev config is bootable
 564                          */
 565                         if (!vdev_is_bootable(spa->spa_root_vdev)) {
 566                                 error = SET_ERROR(ENOTSUP);
 567                                 break;
 568                         }
 569 
 570                         reset_bootfs = 1;


 588                                  * Must be ZPL, and its property settings
 589                                  * must be supported by GRUB (compression
 590                                  * is not gzip, and large blocks are not used).
 591                                  */
 592 
 593                                 if (dmu_objset_type(os) != DMU_OST_ZFS) {
 594                                         error = SET_ERROR(ENOTSUP);
 595                                 } else if ((error =
 596                                     dsl_prop_get_int_ds(dmu_objset_ds(os),
 597                                     zfs_prop_to_name(ZFS_PROP_COMPRESSION),
 598                                     &propval)) == 0 &&
 599                                     !BOOTFS_COMPRESS_VALID(propval)) {
 600                                         error = SET_ERROR(ENOTSUP);
 601                                 } else {
 602                                         objnum = dmu_objset_id(os);
 603                                 }
 604                                 dmu_objset_rele(os, FTAG);
 605                         }
 606                         break;
 607 
 608                 case ZPOOL_PROP_DEDUP_LO_BEST_EFFORT:
 609                         error = nvpair_value_uint64(elem, &intval);
 610                         if ((intval < 0) || (intval > 100) ||
 611                             (intval >= spa->spa_dedup_hi_best_effort))
 612                                 error = SET_ERROR(EINVAL);
 613                         break;
 614 
 615                 case ZPOOL_PROP_DEDUP_HI_BEST_EFFORT:
 616                         error = nvpair_value_uint64(elem, &intval);
 617                         if ((intval < 0) || (intval > 100) ||
 618                             (intval <= spa->spa_dedup_lo_best_effort))
 619                                 error = SET_ERROR(EINVAL);
 620                         break;
 621 
 622                 case ZPOOL_PROP_FAILUREMODE:
 623                         error = nvpair_value_uint64(elem, &intval);
 624                         if (!error && (intval < ZIO_FAILURE_MODE_WAIT ||
 625                             intval > ZIO_FAILURE_MODE_PANIC))
 626                                 error = SET_ERROR(EINVAL);
 627 
 628                         /*
 629                          * This is a special case which only occurs when
 630                          * the pool has completely failed. This allows
 631                          * the user to change the in-core failmode property
 632                          * without syncing it out to disk (I/Os might
 633                          * currently be blocked). We do this by returning
 634                          * EIO to the caller (spa_prop_set) to trick it
 635                          * into thinking we encountered a property validation
 636                          * error.
 637                          */
 638                         if (!error && spa_suspended(spa)) {
 639                                 spa->spa_failmode = intval;
 640                                 error = SET_ERROR(EIO);
 641                         }


 663                             strcmp(slash, "/..") == 0)
 664                                 error = SET_ERROR(EINVAL);
 665                         break;
 666 
 667                 case ZPOOL_PROP_COMMENT:
 668                         if ((error = nvpair_value_string(elem, &strval)) != 0)
 669                                 break;
 670                         for (check = strval; *check != '\0'; check++) {
 671                                 /*
 672                                  * The kernel doesn't have an easy isprint()
 673                                  * check.  For this kernel check, we merely
 674                                  * check ASCII apart from DEL.  Fix this if
 675                                  * there is an easy-to-use kernel isprint().
 676                                  */
 677                                 if (*check >= 0x7f) {
 678                                         error = SET_ERROR(EINVAL);
 679                                         break;
 680                                 }
 681                         }
 682                         if (strlen(strval) > ZPROP_MAX_COMMENT)
 683                                 error = SET_ERROR(E2BIG);
 684                         break;
 685 
 686                 case ZPOOL_PROP_DEDUPDITTO:
 687                         if (spa_version(spa) < SPA_VERSION_DEDUP)
 688                                 error = SET_ERROR(ENOTSUP);
 689                         else
 690                                 error = nvpair_value_uint64(elem, &intval);
 691                         if (error == 0 &&
 692                             intval != 0 && intval < ZIO_DEDUPDITTO_MIN)
 693                                 error = SET_ERROR(EINVAL);
 694                         break;
 695 
 696                 case ZPOOL_PROP_MINWATERMARK:
 697                         error = nvpair_value_uint64(elem, &intval);
 698                         if (!error && (intval > 100))
 699                                 error = SET_ERROR(EINVAL);
 700                         minwat = intval;
 701                         break;
 702                 case ZPOOL_PROP_LOWATERMARK:
 703                         error = nvpair_value_uint64(elem, &intval);
 704                         if (!error && (intval > 100))
 705                                 error = SET_ERROR(EINVAL);
 706                         lowat = intval;
 707                         break;
 708                 case ZPOOL_PROP_HIWATERMARK:
 709                         error = nvpair_value_uint64(elem, &intval);
 710                         if (!error && (intval > 100))
 711                                 error = SET_ERROR(EINVAL);
 712                         hiwat = intval;
 713                         break;
 714                 case ZPOOL_PROP_DEDUPMETA_DITTO:
 715                         error = nvpair_value_uint64(elem, &intval);
 716                         if (!error && (intval > SPA_DVAS_PER_BP))
 717                                 error = SET_ERROR(EINVAL);
 718                         break;
 719                 case ZPOOL_PROP_SCRUB_PRIO:
 720                 case ZPOOL_PROP_RESILVER_PRIO:
 721                         error = nvpair_value_uint64(elem, &intval);
 722                         if (error || intval > 100)
 723                                 error = SET_ERROR(EINVAL);
 724                         break;
 725                 }
 726 
 727                 if (error)
 728                         break;
 729         }
 730 
 731         /* check if low watermark is less than high watermark */
 732         if (lowat != 0 && lowat >= hiwat)
 733                 error = SET_ERROR(EINVAL);
 734 
 735         /* check if min watermark is less than low watermark */
 736         if (minwat != 0 && minwat >= lowat)
 737                 error = SET_ERROR(EINVAL);
 738 
 739         if (!error && reset_bootfs) {
 740                 error = nvlist_remove(props,
 741                     zpool_prop_to_name(ZPOOL_PROP_BOOTFS), DATA_TYPE_STRING);
 742 
 743                 if (!error) {
 744                         error = nvlist_add_uint64(props,
 745                             zpool_prop_to_name(ZPOOL_PROP_BOOTFS), objnum);
 746                 }
 747         }
 748 
 749         return (error);
 750 }
 751 
 752 void
 753 spa_configfile_set(spa_t *spa, nvlist_t *nvp, boolean_t need_sync)
 754 {
 755         char *cachefile;
 756         spa_config_dirent_t *dp;
 757 
 758         if (nvlist_lookup_string(nvp, zpool_prop_to_name(ZPOOL_PROP_CACHEFILE),


 775 }
 776 
 777 int
 778 spa_prop_set(spa_t *spa, nvlist_t *nvp)
 779 {
 780         int error;
 781         nvpair_t *elem = NULL;
 782         boolean_t need_sync = B_FALSE;
 783 
 784         if ((error = spa_prop_validate(spa, nvp)) != 0)
 785                 return (error);
 786 
 787         while ((elem = nvlist_next_nvpair(nvp, elem)) != NULL) {
 788                 zpool_prop_t prop = zpool_name_to_prop(nvpair_name(elem));
 789 
 790                 if (prop == ZPOOL_PROP_CACHEFILE ||
 791                     prop == ZPOOL_PROP_ALTROOT ||
 792                     prop == ZPOOL_PROP_READONLY)
 793                         continue;
 794 
 795                 if (prop == ZPOOL_PROP_VERSION || prop == ZPROP_INVAL) {
 796                         uint64_t ver;
 797 
 798                         if (prop == ZPOOL_PROP_VERSION) {
 799                                 VERIFY(nvpair_value_uint64(elem, &ver) == 0);
 800                         } else {
 801                                 ASSERT(zpool_prop_feature(nvpair_name(elem)));
 802                                 ver = SPA_VERSION_FEATURES;
 803                                 need_sync = B_TRUE;
 804                         }
 805 
 806                         /* Save time if the version is already set. */
 807                         if (ver == spa_version(spa))
 808                                 continue;
 809 
 810                         /*
 811                          * In addition to the pool directory object, we might
 812                          * create the pool properties object, the features for
 813                          * read object, the features for write object, or the
 814                          * feature descriptions object.
 815                          */


 894  * the root vdev's guid, our own pool guid, and then mark all of our
 895  * vdevs dirty.  Note that we must make sure that all our vdevs are
 896  * online when we do this, or else any vdevs that weren't present
 897  * would be orphaned from our pool.  We are also going to issue a
 898  * sysevent to update any watchers.
 899  */
 900 int
 901 spa_change_guid(spa_t *spa)
 902 {
 903         int error;
 904         uint64_t guid;
 905 
 906         mutex_enter(&spa->spa_vdev_top_lock);
 907         mutex_enter(&spa_namespace_lock);
 908         guid = spa_generate_guid(NULL);
 909 
 910         error = dsl_sync_task(spa->spa_name, spa_change_guid_check,
 911             spa_change_guid_sync, &guid, 5, ZFS_SPACE_CHECK_RESERVED);
 912 
 913         if (error == 0) {
 914                 spa_config_sync(spa, B_FALSE, B_TRUE);
 915                 spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_REGUID);
 916         }
 917 
 918         mutex_exit(&spa_namespace_lock);
 919         mutex_exit(&spa->spa_vdev_top_lock);
 920 
 921         return (error);
 922 }
 923 
 924 /*
 925  * ==========================================================================
 926  * SPA state manipulation (open/create/destroy/import/export)
 927  * ==========================================================================
 928  */
 929 
 930 static int
 931 spa_error_entry_compare(const void *a, const void *b)
 932 {
 933         spa_error_entry_t *sa = (spa_error_entry_t *)a;
 934         spa_error_entry_t *sb = (spa_error_entry_t *)b;


1162         CALLB_CPR_EXIT(&cprinfo);   /* drops spa_proc_lock */
1163 
1164         mutex_enter(&curproc->p_lock);
1165         lwp_exit();
1166 }
1167 #endif
1168 
1169 /*
1170  * Activate an uninitialized pool.
1171  */
1172 static void
1173 spa_activate(spa_t *spa, int mode)
1174 {
1175         ASSERT(spa->spa_state == POOL_STATE_UNINITIALIZED);
1176 
1177         spa->spa_state = POOL_STATE_ACTIVE;
1178         spa->spa_mode = mode;
1179 
1180         spa->spa_normal_class = metaslab_class_create(spa, zfs_metaslab_ops);
1181         spa->spa_log_class = metaslab_class_create(spa, zfs_metaslab_ops);
1182         spa->spa_special_class = metaslab_class_create(spa, zfs_metaslab_ops);
1183 
1184         /* Try to create a covering process */
1185         mutex_enter(&spa->spa_proc_lock);
1186         ASSERT(spa->spa_proc_state == SPA_PROC_NONE);
1187         ASSERT(spa->spa_proc == &p0);
1188         spa->spa_did = 0;
1189 
1190         /* Only create a process if we're going to be around a while. */
1191         if (spa_create_process && strcmp(spa->spa_name, TRYIMPORT_NAME) != 0) {
1192                 if (newproc(spa_thread, (caddr_t)spa, syscid, maxclsyspri,
1193                     NULL, 0) == 0) {
1194                         spa->spa_proc_state = SPA_PROC_CREATED;
1195                         while (spa->spa_proc_state == SPA_PROC_CREATED) {
1196                                 cv_wait(&spa->spa_proc_cv,
1197                                     &spa->spa_proc_lock);
1198                         }
1199                         ASSERT(spa->spa_proc_state == SPA_PROC_ACTIVE);
1200                         ASSERT(spa->spa_proc != &p0);
1201                         ASSERT(spa->spa_did != 0);
1202                 } else {
1203 #ifdef _KERNEL
1204                         cmn_err(CE_WARN,
1205                             "Couldn't create process for zfs pool \"%s\"\n",
1206                             spa->spa_name);
1207 #endif
1208                 }
1209         }
1210         mutex_exit(&spa->spa_proc_lock);
1211 
1212         /* If we didn't create a process, we need to create our taskqs. */
1213         if (spa->spa_proc == &p0) {
1214                 spa_create_zio_taskqs(spa);
1215         }
1216 



1217         list_create(&spa->spa_config_dirty_list, sizeof (vdev_t),
1218             offsetof(vdev_t, vdev_config_dirty_node));
1219         list_create(&spa->spa_evicting_os_list, sizeof (objset_t),
1220             offsetof(objset_t, os_evicting_node));
1221         list_create(&spa->spa_state_dirty_list, sizeof (vdev_t),
1222             offsetof(vdev_t, vdev_state_dirty_node));
1223 
1224         txg_list_create(&spa->spa_vdev_txg_list, spa,
1225             offsetof(struct vdev, vdev_txg_node));
1226 
1227         avl_create(&spa->spa_errlist_scrub,
1228             spa_error_entry_compare, sizeof (spa_error_entry_t),
1229             offsetof(spa_error_entry_t, se_avl));
1230         avl_create(&spa->spa_errlist_last,
1231             spa_error_entry_compare, sizeof (spa_error_entry_t),
1232             offsetof(spa_error_entry_t, se_avl));
1233 }
1234 
1235 /*
1236  * Opposite of spa_activate().


1241         ASSERT(spa->spa_sync_on == B_FALSE);
1242         ASSERT(spa->spa_dsl_pool == NULL);
1243         ASSERT(spa->spa_root_vdev == NULL);
1244         ASSERT(spa->spa_async_zio_root == NULL);
1245         ASSERT(spa->spa_state != POOL_STATE_UNINITIALIZED);
1246 
1247         spa_evicting_os_wait(spa);
1248 
1249         txg_list_destroy(&spa->spa_vdev_txg_list);
1250 
1251         list_destroy(&spa->spa_config_dirty_list);
1252         list_destroy(&spa->spa_evicting_os_list);
1253         list_destroy(&spa->spa_state_dirty_list);
1254 
1255         for (int t = 0; t < ZIO_TYPES; t++) {
1256                 for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
1257                         spa_taskqs_fini(spa, t, q);
1258                 }
1259         }
1260 






1261         metaslab_class_destroy(spa->spa_normal_class);
1262         spa->spa_normal_class = NULL;
1263 
1264         metaslab_class_destroy(spa->spa_log_class);
1265         spa->spa_log_class = NULL;
1266 
1267         metaslab_class_destroy(spa->spa_special_class);
1268         spa->spa_special_class = NULL;
1269 
1270         /*
1271          * If this was part of an import or the open otherwise failed, we may
1272          * still have errors left in the queues.  Empty them just in case.
1273          */
1274         spa_errlog_drain(spa);
1275 
1276         avl_destroy(&spa->spa_errlist_scrub);
1277         avl_destroy(&spa->spa_errlist_last);
1278 
1279         spa->spa_state = POOL_STATE_UNINITIALIZED;
1280 
1281         mutex_enter(&spa->spa_proc_lock);
1282         if (spa->spa_proc_state != SPA_PROC_NONE) {
1283                 ASSERT(spa->spa_proc_state == SPA_PROC_ACTIVE);
1284                 spa->spa_proc_state = SPA_PROC_DEACTIVATE;
1285                 cv_broadcast(&spa->spa_proc_cv);
1286                 while (spa->spa_proc_state == SPA_PROC_DEACTIVATE) {
1287                         ASSERT(spa->spa_proc != &p0);
1288                         cv_wait(&spa->spa_proc_cv, &spa->spa_proc_lock);
1289                 }


1344                         *vdp = NULL;
1345                         return (error);
1346                 }
1347         }
1348 
1349         ASSERT(*vdp != NULL);
1350 
1351         return (0);
1352 }
1353 
1354 /*
1355  * Opposite of spa_load().
1356  */
1357 static void
1358 spa_unload(spa_t *spa)
1359 {
1360         int i;
1361 
1362         ASSERT(MUTEX_HELD(&spa_namespace_lock));
1363 
1364         /*
1365          * Stop manual trim before stopping spa sync, because manual trim
1366          * needs to execute a synctask (trim timestamp sync) at the end.
1367          */
1368         mutex_enter(&spa->spa_auto_trim_lock);
1369         mutex_enter(&spa->spa_man_trim_lock);
1370         spa_trim_stop_wait(spa);
1371         mutex_exit(&spa->spa_man_trim_lock);
1372         mutex_exit(&spa->spa_auto_trim_lock);
1373 
1374         /*
1375          * Stop async tasks.
1376          */
1377         spa_async_suspend(spa);
1378 
1379         /*
1380          * Stop syncing.
1381          */
1382         if (spa->spa_sync_on) {
1383                 txg_sync_stop(spa->spa_dsl_pool);
1384                 spa->spa_sync_on = B_FALSE;
1385         }
1386 
1387         /*
1388          * Even though vdev_free() also calls vdev_metaslab_fini, we need
1389          * to call it earlier, before we wait for async i/o to complete.
1390          * This ensures that there is no async metaslab prefetching, by
1391          * calling taskq_wait(mg_taskq).
1392          */
1393         if (spa->spa_root_vdev != NULL) {
1394                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1395                 for (int c = 0; c < spa->spa_root_vdev->vdev_children; c++)
1396                         vdev_metaslab_fini(spa->spa_root_vdev->vdev_child[c]);
1397                 spa_config_exit(spa, SCL_ALL, FTAG);
1398         }
1399 
1400         /*
1401          * Wait for any outstanding async I/O to complete.
1402          */
1403         if (spa->spa_async_zio_root != NULL) {
1404                 for (int i = 0; i < max_ncpus; i++)
1405                         (void) zio_wait(spa->spa_async_zio_root[i]);
1406                 kmem_free(spa->spa_async_zio_root, max_ncpus * sizeof (void *));
1407                 spa->spa_async_zio_root = NULL;
1408         }
1409 













1410         bpobj_close(&spa->spa_deferred_bpobj);
1411 
1412         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1413 
1414         /*
1415          * Stop autotrim tasks.
1416          */
1417         mutex_enter(&spa->spa_auto_trim_lock);
1418         if (spa->spa_auto_trim_taskq)
1419                 spa_auto_trim_taskq_destroy(spa);
1420         mutex_exit(&spa->spa_auto_trim_lock);
1421 
1422         /*
1423          * Close all vdevs.
1424          */
1425         if (spa->spa_root_vdev)
1426                 vdev_free(spa->spa_root_vdev);
1427         ASSERT(spa->spa_root_vdev == NULL);
1428 
1429         /*
1430          * Close the dsl pool.
1431          */
1432         if (spa->spa_dsl_pool) {
1433                 dsl_pool_close(spa->spa_dsl_pool);
1434                 spa->spa_dsl_pool = NULL;
1435                 spa->spa_meta_objset = NULL;
1436         }
1437 
1438         ddt_unload(spa);
1439 
1440         /*
1441          * Drop and purge level 2 cache
1442          */


1455         }
1456         spa->spa_spares.sav_count = 0;
1457 
1458         for (i = 0; i < spa->spa_l2cache.sav_count; i++) {
1459                 vdev_clear_stats(spa->spa_l2cache.sav_vdevs[i]);
1460                 vdev_free(spa->spa_l2cache.sav_vdevs[i]);
1461         }
1462         if (spa->spa_l2cache.sav_vdevs) {
1463                 kmem_free(spa->spa_l2cache.sav_vdevs,
1464                     spa->spa_l2cache.sav_count * sizeof (void *));
1465                 spa->spa_l2cache.sav_vdevs = NULL;
1466         }
1467         if (spa->spa_l2cache.sav_config) {
1468                 nvlist_free(spa->spa_l2cache.sav_config);
1469                 spa->spa_l2cache.sav_config = NULL;
1470         }
1471         spa->spa_l2cache.sav_count = 0;
1472 
1473         spa->spa_async_suspended = 0;
1474 


1475         if (spa->spa_comment != NULL) {
1476                 spa_strfree(spa->spa_comment);
1477                 spa->spa_comment = NULL;
1478         }
1479 
1480         spa_config_exit(spa, SCL_ALL, FTAG);
1481 }
1482 
1483 /*
1484  * Load (or re-load) the current list of vdevs describing the active spares for
1485  * this pool.  When this is called, we have some form of basic information in
1486  * 'spa_spares.sav_config'.  We parse this into vdevs, try to open them, and
1487  * then re-generate a more complete list including status information.
1488  */
1489 static void
1490 spa_load_spares(spa_t *spa)
1491 {
1492         nvlist_t **spares;
1493         uint_t nspares;
1494         int i;
1495         vdev_t *vd, *tvd;
1496 
1497         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
1498 
1499         /*
1500          * First, close and free any existing spare vdevs.
1501          */
1502         for (i = 0; i < spa->spa_spares.sav_count; i++) {
1503                 vd = spa->spa_spares.sav_vdevs[i];
1504 
1505                 /* Undo the call to spa_activate() below */
1506                 if ((tvd = spa_lookup_by_guid(spa, vd->vdev_guid,
1507                     B_FALSE)) != NULL && tvd->vdev_isspare)
1508                         spa_spare_remove(tvd);
1509                 vdev_close(vd);


1586         spares = kmem_alloc(spa->spa_spares.sav_count * sizeof (void *),
1587             KM_SLEEP);
1588         for (i = 0; i < spa->spa_spares.sav_count; i++)
1589                 spares[i] = vdev_config_generate(spa,
1590                     spa->spa_spares.sav_vdevs[i], B_TRUE, VDEV_CONFIG_SPARE);
1591         VERIFY(nvlist_add_nvlist_array(spa->spa_spares.sav_config,
1592             ZPOOL_CONFIG_SPARES, spares, spa->spa_spares.sav_count) == 0);
1593         for (i = 0; i < spa->spa_spares.sav_count; i++)
1594                 nvlist_free(spares[i]);
1595         kmem_free(spares, spa->spa_spares.sav_count * sizeof (void *));
1596 }
1597 
1598 /*
1599  * Load (or re-load) the current list of vdevs describing the active l2cache for
1600  * this pool.  When this is called, we have some form of basic information in
1601  * 'spa_l2cache.sav_config'.  We parse this into vdevs, try to open them, and
1602  * then re-generate a more complete list including status information.
1603  * Devices which are already active have their details maintained, and are
1604  * not re-opened.
1605  */
1606 static void
1607 spa_load_l2cache(spa_t *spa)
1608 {
1609         nvlist_t **l2cache;
1610         uint_t nl2cache;
1611         int i, j, oldnvdevs;
1612         uint64_t guid;
1613         vdev_t *vd, **oldvdevs, **newvdevs;
1614         spa_aux_vdev_t *sav = &spa->spa_l2cache;
1615 
1616         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
1617 
1618         if (sav->sav_config != NULL) {
1619                 VERIFY(nvlist_lookup_nvlist_array(sav->sav_config,
1620                     ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0);
1621                 newvdevs = kmem_alloc(nl2cache * sizeof (void *), KM_SLEEP);
1622         } else {
1623                 nl2cache = 0;
1624                 newvdevs = NULL;
1625         }
1626 


1657                             VDEV_ALLOC_L2CACHE) == 0);
1658                         ASSERT(vd != NULL);
1659                         newvdevs[i] = vd;
1660 
1661                         /*
1662                          * Commit this vdev as an l2cache device,
1663                          * even if it fails to open.
1664                          */
1665                         spa_l2cache_add(vd);
1666 
1667                         vd->vdev_top = vd;
1668                         vd->vdev_aux = sav;
1669 
1670                         spa_l2cache_activate(vd);
1671 
1672                         if (vdev_open(vd) != 0)
1673                                 continue;
1674 
1675                         (void) vdev_validate_aux(vd);
1676 
1677                         if (!vdev_is_dead(vd)) {
1678                                 boolean_t do_rebuild = B_FALSE;
1679 
1680                                 (void) nvlist_lookup_boolean_value(l2cache[i],
1681                                     ZPOOL_CONFIG_L2CACHE_PERSISTENT,
1682                                     &do_rebuild);
1683                                 l2arc_add_vdev(spa, vd, do_rebuild);
1684                         }
1685                 }
1686         }
1687 
1688         /*
1689          * Purge vdevs that were dropped
1690          */
1691         for (i = 0; i < oldnvdevs; i++) {
1692                 uint64_t pool;
1693 
1694                 vd = oldvdevs[i];
1695                 if (vd != NULL) {
1696                         ASSERT(vd->vdev_isl2cache);
1697 
1698                         if (spa_l2cache_exists(vd->vdev_guid, &pool) &&
1699                             pool != 0ULL && l2arc_vdev_present(vd))
1700                                 l2arc_remove_vdev(vd);
1701                         vdev_clear_stats(vd);
1702                         vdev_free(vd);
1703                 }
1704         }
1705 
1706         if (oldvdevs)


1742         *value = NULL;
1743 
1744         error = dmu_bonus_hold(spa->spa_meta_objset, obj, FTAG, &db);
1745         if (error != 0)
1746                 return (error);
1747 
1748         nvsize = *(uint64_t *)db->db_data;
1749         dmu_buf_rele(db, FTAG);
1750 
1751         packed = kmem_alloc(nvsize, KM_SLEEP);
1752         error = dmu_read(spa->spa_meta_objset, obj, 0, nvsize, packed,
1753             DMU_READ_PREFETCH);
1754         if (error == 0)
1755                 error = nvlist_unpack(packed, nvsize, value, 0);
1756         kmem_free(packed, nvsize);
1757 
1758         return (error);
1759 }
1760 
1761 /*





















1762  * Checks to see if the given vdev could not be opened, in which case we post a
1763  * sysevent to notify the autoreplace code that the device has been removed.
1764  */
1765 static void
1766 spa_check_removed(vdev_t *vd)
1767 {
1768         for (int c = 0; c < vd->vdev_children; c++)
1769                 spa_check_removed(vd->vdev_child[c]);
1770 
1771         if (vd->vdev_ops->vdev_op_leaf && vdev_is_dead(vd) &&
1772             !vd->vdev_ishole) {
1773                 zfs_post_autoreplace(vd->vdev_spa, vd);
1774                 spa_event_notify(vd->vdev_spa, vd, NULL, ESC_ZFS_VDEV_CHECK);
1775         }
1776 }
1777 
1778 static void
1779 spa_config_valid_zaps(vdev_t *vd, vdev_t *mvd)
1780 {
1781         ASSERT3U(vd->vdev_children, ==, mvd->vdev_children);
1782 
1783         vd->vdev_top_zap = mvd->vdev_top_zap;
1784         vd->vdev_leaf_zap = mvd->vdev_leaf_zap;
1785 
1786         for (uint64_t i = 0; i < vd->vdev_children; i++) {
1787                 spa_config_valid_zaps(vd->vdev_child[i], mvd->vdev_child[i]);
1788         }
1789 }
1790 
1791 /*
1792  * Validate the current config against the MOS config
1793  */
1794 static boolean_t
1795 spa_config_valid(spa_t *spa, nvlist_t *config)
1796 {
1797         vdev_t *mrvd, *rvd = spa->spa_root_vdev;
1798         nvlist_t *nv;
1799 
1800         VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nv) == 0);
1801 
1802         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1803         VERIFY(spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD) == 0);
1804 
1805         /*
1806          * One of the earliest signs of a stale config is a mismatch
1807          * in the numbers of children vdev's
1808          */
1809         if (rvd->vdev_children != mrvd->vdev_children) {
1810                 vdev_free(mrvd);
1811                 spa_config_exit(spa, SCL_ALL, FTAG);
1812                 return (B_FALSE);
1813         }
1814         /*
1815          * If we're doing a normal import, then build up any additional
1816          * diagnostic information about missing devices in this config.
1817          * We'll pass this up to the user for further processing.
1818          */
1819         if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG)) {
1820                 nvlist_t **child, *nv;
1821                 uint64_t idx = 0;
1822 
1823                 child = kmem_alloc(rvd->vdev_children * sizeof (nvlist_t **),
1824                     KM_SLEEP);
1825                 VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0);
1826 
1827                 for (int c = 0; c < rvd->vdev_children; c++) {
1828                         vdev_t *tvd = rvd->vdev_child[c];
1829                         vdev_t *mtvd  = mrvd->vdev_child[c];
1830 
1831                         if (tvd->vdev_ops == &vdev_missing_ops &&
1832                             mtvd->vdev_ops != &vdev_missing_ops &&
1833                             mtvd->vdev_islog)
1834                                 child[idx++] = vdev_config_generate(spa, mtvd,
1835                                     B_FALSE, 0);




1836                 }

1837 
1838                 if (idx) {
1839                         VERIFY(nvlist_add_nvlist_array(nv,
1840                             ZPOOL_CONFIG_CHILDREN, child, idx) == 0);
1841                         VERIFY(nvlist_add_nvlist(spa->spa_load_info,
1842                             ZPOOL_CONFIG_MISSING_DEVICES, nv) == 0);
1843 
1844                         for (int i = 0; i < idx; i++)
1845                                 nvlist_free(child[i]);
1846                 }
1847                 nvlist_free(nv);
1848                 kmem_free(child, rvd->vdev_children * sizeof (char **));




1849         }
1850 
1851         /*
1852          * Compare the root vdev tree with the information we have
1853          * from the MOS config (mrvd). Check each top-level vdev
1854          * with the corresponding MOS config top-level (mtvd).
1855          */
1856         for (int c = 0; c < rvd->vdev_children; c++) {
1857                 vdev_t *tvd = rvd->vdev_child[c];
1858                 vdev_t *mtvd  = mrvd->vdev_child[c];
1859 
1860                 /*
1861                  * Resolve any "missing" vdevs in the current configuration.
1862                  * If we find that the MOS config has more accurate information
1863                  * about the top-level vdev then use that vdev instead.
1864                  */
1865                 if (tvd->vdev_ops == &vdev_missing_ops &&
1866                     mtvd->vdev_ops != &vdev_missing_ops) {
1867 
1868                         if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG))
1869                                 continue;
1870 
1871                         /*
1872                          * Device specific actions.
1873                          */
1874                         if (mtvd->vdev_islog) {
1875                                 spa_set_log_state(spa, SPA_LOG_CLEAR);
1876                         } else {
1877                                 /*
1878                                  * XXX - once we have 'readonly' pool
1879                                  * support we should be able to handle
1880                                  * missing data devices by transitioning
1881                                  * the pool to readonly.
1882                                  */
1883                                 continue;
1884                         }
1885 
1886                         /*
1887                          * Swap the missing vdev with the data we were
1888                          * able to obtain from the MOS config.
1889                          */
1890                         vdev_remove_child(rvd, tvd);
1891                         vdev_remove_child(mrvd, mtvd);
1892 
1893                         vdev_add_child(rvd, mtvd);
1894                         vdev_add_child(mrvd, tvd);
1895 
1896                         spa_config_exit(spa, SCL_ALL, FTAG);
1897                         vdev_load(mtvd);
1898                         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1899 
1900                         vdev_reopen(rvd);
1901                 } else {
1902                         if (mtvd->vdev_islog) {
1903                                 /*
1904                                  * Load the slog device's state from the MOS
1905                                  * config since it's possible that the label
1906                                  * does not contain the most up-to-date
1907                                  * information.
1908                                  */
1909                                 vdev_load_log_state(tvd, mtvd);
1910                                 vdev_reopen(tvd);
1911                         }
1912 
1913                         /*
1914                          * Per-vdev ZAP info is stored exclusively in the MOS.
1915                          */
1916                         spa_config_valid_zaps(tvd, mtvd);
1917                 }
1918         }
1919 
1920         vdev_free(mrvd);
1921         spa_config_exit(spa, SCL_ALL, FTAG);
1922 
1923         /*
1924          * Ensure we were able to validate the config.
1925          */
1926         return (rvd->vdev_guid_sum == spa->spa_uberblock.ub_guid_sum);
1927 }
1928 
1929 /*
1930  * Check for missing log devices
1931  */
1932 static boolean_t
1933 spa_check_logs(spa_t *spa)
1934 {
1935         boolean_t rv = B_FALSE;
1936         dsl_pool_t *dp = spa_get_dsl(spa);
1937 
1938         switch (spa->spa_log_state) {
1939         case SPA_LOG_MISSING:
1940                 /* need to recheck in case slog has been restored */
1941         case SPA_LOG_UNKNOWN:
1942                 rv = (dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
1943                     zil_check_log_chain, NULL, DS_FIND_CHILDREN) != 0);
1944                 if (rv)
1945                         spa_set_log_state(spa, SPA_LOG_MISSING);
1946                 break;


1972         return (slog_found);
1973 }
1974 
1975 static void
1976 spa_activate_log(spa_t *spa)
1977 {
1978         vdev_t *rvd = spa->spa_root_vdev;
1979 
1980         ASSERT(spa_config_held(spa, SCL_ALLOC, RW_WRITER));
1981 
1982         for (int c = 0; c < rvd->vdev_children; c++) {
1983                 vdev_t *tvd = rvd->vdev_child[c];
1984                 metaslab_group_t *mg = tvd->vdev_mg;
1985 
1986                 if (tvd->vdev_islog)
1987                         metaslab_group_activate(mg);
1988         }
1989 }
1990 
1991 int
1992 spa_offline_log(spa_t *spa)
1993 {
1994         int error;
1995 
1996         error = dmu_objset_find(spa_name(spa), zil_vdev_offline,
1997             NULL, DS_FIND_CHILDREN);
1998         if (error == 0) {
1999                 /*
2000                  * We successfully offlined the log device, sync out the
2001                  * current txg so that the "stubby" block can be removed
2002                  * by zil_sync().
2003                  */
2004                 txg_wait_synced(spa->spa_dsl_pool, 0);
2005         }
2006         return (error);
2007 }
2008 
2009 static void
2010 spa_aux_check_removed(spa_aux_vdev_t *sav)
2011 {
2012         for (int i = 0; i < sav->sav_count; i++)
2013                 spa_check_removed(sav->sav_vdevs[i]);
2014 }
2015 
2016 void


2026                 spa->spa_claim_max_txg = zio->io_bp->blk_birth;
2027         mutex_exit(&spa->spa_props_lock);
2028 }
2029 
2030 typedef struct spa_load_error {
2031         uint64_t        sle_meta_count;
2032         uint64_t        sle_data_count;
2033 } spa_load_error_t;
2034 
2035 static void
2036 spa_load_verify_done(zio_t *zio)
2037 {
2038         blkptr_t *bp = zio->io_bp;
2039         spa_load_error_t *sle = zio->io_private;
2040         dmu_object_type_t type = BP_GET_TYPE(bp);
2041         int error = zio->io_error;
2042         spa_t *spa = zio->io_spa;
2043 
2044         abd_free(zio->io_abd);
2045         if (error) {
2046                 if (BP_IS_METADATA(bp) && type != DMU_OT_INTENT_LOG)

2047                         atomic_inc_64(&sle->sle_meta_count);
2048                 else
2049                         atomic_inc_64(&sle->sle_data_count);
2050         }
2051 
2052         mutex_enter(&spa->spa_scrub_lock);
2053         spa->spa_scrub_inflight--;
2054         cv_broadcast(&spa->spa_scrub_io_cv);
2055         mutex_exit(&spa->spa_scrub_lock);
2056 }
2057 
2058 /*
2059  * Maximum number of concurrent scrub i/os to create while verifying
2060  * a pool while importing it.
2061  */
2062 int spa_load_verify_maxinflight = 10000;
2063 boolean_t spa_load_verify_metadata = B_TRUE;
2064 boolean_t spa_load_verify_data = B_TRUE;
2065 
2066 /*ARGSUSED*/


2115         boolean_t verify_ok = B_FALSE;
2116         int error = 0;
2117 
2118         zpool_get_rewind_policy(spa->spa_config, &policy);
2119 
2120         if (policy.zrp_request & ZPOOL_NEVER_REWIND)
2121                 return (0);
2122 
2123         dsl_pool_config_enter(spa->spa_dsl_pool, FTAG);
2124         error = dmu_objset_find_dp(spa->spa_dsl_pool,
2125             spa->spa_dsl_pool->dp_root_dir_obj, verify_dataset_name_len, NULL,
2126             DS_FIND_CHILDREN);
2127         dsl_pool_config_exit(spa->spa_dsl_pool, FTAG);
2128         if (error != 0)
2129                 return (error);
2130 
2131         rio = zio_root(spa, NULL, &sle,
2132             ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE);
2133 
2134         if (spa_load_verify_metadata) {
2135                 zbookmark_phys_t zb = { 0 };
2136                 error = traverse_pool(spa, spa->spa_verify_min_txg, UINT64_MAX,






2137                     TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA,
2138                     spa_load_verify_cb, rio, &zb);
2139         }
2140 
2141         (void) zio_wait(rio);
2142 
2143         spa->spa_load_meta_errors = sle.sle_meta_count;
2144         spa->spa_load_data_errors = sle.sle_data_count;
2145 
2146         if (!error && sle.sle_meta_count <= policy.zrp_maxmeta &&
2147             sle.sle_data_count <= policy.zrp_maxdata) {







2148                 int64_t loss = 0;
2149 
2150                 verify_ok = B_TRUE;
2151                 spa->spa_load_txg = spa->spa_uberblock.ub_txg;
2152                 spa->spa_load_txg_ts = spa->spa_uberblock.ub_timestamp;
2153 
2154                 loss = spa->spa_last_ubsync_txg_ts - spa->spa_load_txg_ts;
2155                 VERIFY(nvlist_add_uint64(spa->spa_load_info,
2156                     ZPOOL_CONFIG_LOAD_TIME, spa->spa_load_txg_ts) == 0);
2157                 VERIFY(nvlist_add_int64(spa->spa_load_info,
2158                     ZPOOL_CONFIG_REWIND_TIME, loss) == 0);
2159                 VERIFY(nvlist_add_uint64(spa->spa_load_info,
2160                     ZPOOL_CONFIG_LOAD_DATA_ERRORS, sle.sle_data_count) == 0);
2161         } else {
2162                 spa->spa_load_max_txg = spa->spa_uberblock.ub_txg;
2163         }
2164 



2165         if (error) {
2166                 if (error != ENXIO && error != EIO)
2167                         error = SET_ERROR(EIO);
2168                 return (error);
2169         }
2170 
2171         return (verify_ok ? 0 : EIO);
2172 }
2173 
2174 /*
2175  * Find a value in the pool props object.
2176  */
2177 static void
2178 spa_prop_find(spa_t *spa, zpool_prop_t prop, uint64_t *val)
2179 {
2180         (void) zap_lookup(spa->spa_meta_objset, spa->spa_pool_props_object,
2181             zpool_prop_to_name(prop), sizeof (uint64_t), 1, val);
2182 }
2183 
2184 /*
2185  * Find a value in the pool directory object.
2186  */
2187 static int
2188 spa_dir_prop(spa_t *spa, const char *name, uint64_t *val)
2189 {
2190         return (zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2191             name, sizeof (uint64_t), 1, val));
2192 }
2193 
2194 static void
2195 spa_set_ddt_classes(spa_t *spa, int desegregation)
2196 {
2197         /*
2198          * if desegregation is turned on then set up ddt_class restrictions
2199          */
2200         if (desegregation) {
2201                 spa->spa_ddt_class_min = DDT_CLASS_DUPLICATE;
2202                 spa->spa_ddt_class_max = DDT_CLASS_DUPLICATE;
2203         } else {
2204                 spa->spa_ddt_class_min = DDT_CLASS_DITTO;
2205                 spa->spa_ddt_class_max = DDT_CLASS_UNIQUE;
2206         }


2207 }
2208 
2209 static int
2210 spa_vdev_err(vdev_t *vdev, vdev_aux_t aux, int err)
2211 {
2212         vdev_set_state(vdev, B_TRUE, VDEV_STATE_CANT_OPEN, aux);
2213         return (err);
2214 }
2215 










2216 /*
2217  * Fix up config after a partly-completed split.  This is done with the
2218  * ZPOOL_CONFIG_SPLIT nvlist.  Both the splitting pool and the split-off
2219  * pool have that entry in their config, but only the splitting one contains
2220  * a list of all the guids of the vdevs that are being split off.
2221  *
2222  * This function determines what to do with that list: either rejoin
2223  * all the disks to the pool, or complete the splitting process.  To attempt
2224  * the rejoin, each disk that is offlined is marked online again, and
2225  * we do a reopen() call.  If the vdev label for every disk that was
2226  * marked online indicates it was successfully split off (VDEV_AUX_SPLIT_POOL)
2227  * then we call vdev_split() on each disk, and complete the split.
2228  *
2229  * Otherwise we leave the config alone, with all the vdevs in place in
2230  * the original pool.
2231  */
2232 static void
2233 spa_try_repair(spa_t *spa, nvlist_t *config)
2234 {
2235         uint_t extracted;


2279                         ++extracted;
2280                 }
2281         }
2282 
2283         /*
2284          * If every disk has been moved to the new pool, or if we never
2285          * even attempted to look at them, then we split them off for
2286          * good.
2287          */
2288         if (!attempt_reopen || gcount == extracted) {
2289                 for (i = 0; i < gcount; i++)
2290                         if (vd[i] != NULL)
2291                                 vdev_split(vd[i]);
2292                 vdev_reopen(spa->spa_root_vdev);
2293         }
2294 
2295         kmem_free(vd, gcount * sizeof (vdev_t *));
2296 }
2297 
2298 static int
2299 spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type,
2300     boolean_t mosconfig)
2301 {
2302         nvlist_t *config = spa->spa_config;
2303         char *ereport = FM_EREPORT_ZFS_POOL;
2304         char *comment;
2305         int error;
2306         uint64_t pool_guid;
2307         nvlist_t *nvl;
2308 
2309         if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid))
2310                 return (SET_ERROR(EINVAL));
2311 
2312         ASSERT(spa->spa_comment == NULL);
2313         if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0)
2314                 spa->spa_comment = spa_strdup(comment);
2315 
2316         /*
2317          * Versioning wasn't explicitly added to the label until later, so if
2318          * it's not present treat it as the initial version.
2319          */
2320         if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
2321             &spa->spa_ubsync.ub_version) != 0)
2322                 spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
2323 
2324         (void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG,
2325             &spa->spa_config_txg);
2326 
2327         if ((state == SPA_LOAD_IMPORT || state == SPA_LOAD_TRYIMPORT) &&
2328             spa_guid_exists(pool_guid, 0)) {
2329                 error = SET_ERROR(EEXIST);
2330         } else {
2331                 spa->spa_config_guid = pool_guid;
2332 
2333                 if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT,
2334                     &nvl) == 0) {
2335                         VERIFY(nvlist_dup(nvl, &spa->spa_config_splitting,
2336                             KM_SLEEP) == 0);
2337                 }
2338 
2339                 nvlist_free(spa->spa_load_info);
2340                 spa->spa_load_info = fnvlist_alloc();
2341 
2342                 gethrestime(&spa->spa_loaded_ts);
2343                 error = spa_load_impl(spa, pool_guid, config, state, type,
2344                     mosconfig, &ereport);
2345         }
2346 
2347         /*
2348          * Don't count references from objsets that are already closed
2349          * and are making their way through the eviction process.
2350          */
2351         spa_evicting_os_wait(spa);
2352         spa->spa_minref = refcount_count(&spa->spa_refcount);
2353         if (error) {
2354                 if (error != EEXIST) {
2355                         spa->spa_loaded_ts.tv_sec = 0;
2356                         spa->spa_loaded_ts.tv_nsec = 0;
2357                 }
2358                 if (error != EBADF) {
2359                         zfs_ereport_post(ereport, spa, NULL, NULL, 0, 0);
2360                 }
2361         }
2362         spa->spa_load_state = error ? SPA_LOAD_ERROR : SPA_LOAD_NONE;
2363         spa->spa_ena = 0;

2364         return (error);
2365 }
2366 
2367 /*
2368  * Count the number of per-vdev ZAPs associated with all of the vdevs in the
2369  * vdev tree rooted in the given vd, and ensure that each ZAP is present in the
2370  * spa's per-vdev ZAP list.
2371  */
2372 static uint64_t
2373 vdev_count_verify_zaps(vdev_t *vd)
2374 {
2375         spa_t *spa = vd->vdev_spa;
2376         uint64_t total = 0;
2377         if (vd->vdev_top_zap != 0) {
2378                 total++;
2379                 ASSERT0(zap_lookup_int(spa->spa_meta_objset,
2380                     spa->spa_all_vdev_zaps, vd->vdev_top_zap));
2381         }
2382         if (vd->vdev_leaf_zap != 0) {
2383                 total++;
2384                 ASSERT0(zap_lookup_int(spa->spa_meta_objset,
2385                     spa->spa_all_vdev_zaps, vd->vdev_leaf_zap));
2386         }
2387 
2388         for (uint64_t i = 0; i < vd->vdev_children; i++) {
2389                 total += vdev_count_verify_zaps(vd->vdev_child[i]);
2390         }
2391 
2392         return (total);
2393 }
2394 
2395 /*
2396  * Load an existing storage pool, using the pool's builtin spa_config as a
2397  * source of configuration information.
2398  */
2399 static int
2400 spa_load_impl(spa_t *spa, uint64_t pool_guid, nvlist_t *config,
2401     spa_load_state_t state, spa_import_type_t type, boolean_t mosconfig,
2402     char **ereport)
2403 {






























2404         int error = 0;
2405         nvlist_t *nvroot = NULL;
2406         nvlist_t *label;
2407         vdev_t *rvd;
2408         uberblock_t *ub = &spa->spa_uberblock;
2409         uint64_t children, config_cache_txg = spa->spa_config_txg;
2410         int orig_mode = spa->spa_mode;
2411         int parse;
2412         uint64_t obj;
2413         boolean_t missing_feat_write = B_FALSE;
2414         spa_meta_placement_t *mp;
2415 
2416         /*
2417          * If this is an untrusted config, access the pool in read-only mode.
2418          * This prevents things like resilvering recently removed devices.
2419          */
2420         if (!mosconfig)
2421                 spa->spa_mode = FREAD;

2422 
2423         ASSERT(MUTEX_HELD(&spa_namespace_lock));




2424 
2425         spa->spa_load_state = state;





2426 
2427         if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvroot))

















2428                 return (SET_ERROR(EINVAL));

2429 
2430         parse = (type == SPA_IMPORT_EXISTING ?
2431             VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT);
2432 
2433         /*
2434          * Create "The Godfather" zio to hold all async IOs
2435          */
2436         spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
2437             KM_SLEEP);
2438         for (int i = 0; i < max_ncpus; i++) {
2439                 spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
2440                     ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
2441                     ZIO_FLAG_GODFATHER);
2442         }
2443 
2444         /*
2445          * Parse the configuration into a vdev tree.  We explicitly set the
2446          * value that will be returned by spa_version() since parsing the
2447          * configuration requires knowing the version number.
2448          */
2449         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2450         error = spa_config_parse(spa, &rvd, nvroot, NULL, 0, parse);


2451         spa_config_exit(spa, SCL_ALL, FTAG);
2452 
2453         if (error != 0)


2454                 return (error);

2455 
2456         ASSERT(spa->spa_root_vdev == rvd);
2457         ASSERT3U(spa->spa_min_ashift, >=, SPA_MINBLOCKSHIFT);
2458         ASSERT3U(spa->spa_max_ashift, <=, SPA_MAXBLOCKSHIFT);
2459 
2460         if (type != SPA_IMPORT_ASSEMBLE) {
2461                 ASSERT(spa_guid(spa) == pool_guid);
2462         }
2463 












2464         /*
2465          * Try to open all vdevs, loading each label in the process.

2466          */













2467         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2468         error = vdev_open(rvd);
2469         spa_config_exit(spa, SCL_ALL, FTAG);
2470         if (error != 0)





























2471                 return (error);

2472 
2473         /*
2474          * We need to validate the vdev labels against the configuration that
2475          * we have in hand, which is dependent on the setting of mosconfig. If
2476          * mosconfig is true then we're validating the vdev labels based on
2477          * that config.  Otherwise, we're validating against the cached config
2478          * (zpool.cache) that was read when we loaded the zfs module, and then
2479          * later we will recursively call spa_load() and validate against
2480          * the vdev config.
2481          *
2482          * If we're assembling a new pool that's been split off from an
2483          * existing pool, the labels haven't yet been updated so we skip
2484          * validation for now.
2485          */
2486         if (type != SPA_IMPORT_ASSEMBLE) {





2487                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2488                 error = vdev_validate(rvd, mosconfig);
2489                 spa_config_exit(spa, SCL_ALL, FTAG);
2490 
2491                 if (error != 0)

2492                         return (error);

2493 
2494                 if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN)



2495                         return (SET_ERROR(ENXIO));
2496         }
2497 










2498         /*
2499          * Find the best uberblock.
2500          */
2501         vdev_uberblock_load(rvd, ub, &label);
2502 
2503         /*
2504          * If we weren't able to find a single valid uberblock, return failure.
2505          */
2506         if (ub->ub_txg == 0) {
2507                 nvlist_free(label);

2508                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, ENXIO));
2509         }
2510 



2511         /*
2512          * If the pool has an unsupported version we can't open it.
2513          */
2514         if (!SPA_VERSION_IS_SUPPORTED(ub->ub_version)) {
2515                 nvlist_free(label);


2516                 return (spa_vdev_err(rvd, VDEV_AUX_VERSION_NEWER, ENOTSUP));
2517         }
2518 
2519         if (ub->ub_version >= SPA_VERSION_FEATURES) {
2520                 nvlist_t *features;
2521 
2522                 /*
2523                  * If we weren't able to find what's necessary for reading the
2524                  * MOS in the label, return failure.
2525                  */
2526                 if (label == NULL || nvlist_lookup_nvlist(label,
2527                     ZPOOL_CONFIG_FEATURES_FOR_READ, &features) != 0) {






2528                         nvlist_free(label);


2529                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
2530                             ENXIO));
2531                 }
2532 
2533                 /*
2534                  * Update our in-core representation with the definitive values
2535                  * from the label.
2536                  */
2537                 nvlist_free(spa->spa_label_features);
2538                 VERIFY(nvlist_dup(features, &spa->spa_label_features, 0) == 0);
2539         }
2540 
2541         nvlist_free(label);
2542 
2543         /*
2544          * Look through entries in the label nvlist's features_for_read. If
2545          * there is a feature listed there which we don't understand then we
2546          * cannot open a pool.
2547          */
2548         if (ub->ub_version >= SPA_VERSION_FEATURES) {
2549                 nvlist_t *unsup_feat;
2550 
2551                 VERIFY(nvlist_alloc(&unsup_feat, NV_UNIQUE_NAME, KM_SLEEP) ==
2552                     0);
2553 
2554                 for (nvpair_t *nvp = nvlist_next_nvpair(spa->spa_label_features,
2555                     NULL); nvp != NULL;
2556                     nvp = nvlist_next_nvpair(spa->spa_label_features, nvp)) {
2557                         if (!zfeature_is_supported(nvpair_name(nvp))) {
2558                                 VERIFY(nvlist_add_string(unsup_feat,
2559                                     nvpair_name(nvp), "") == 0);
2560                         }
2561                 }
2562 
2563                 if (!nvlist_empty(unsup_feat)) {
2564                         VERIFY(nvlist_add_nvlist(spa->spa_load_info,
2565                             ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat) == 0);
2566                         nvlist_free(unsup_feat);

2567                         return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
2568                             ENOTSUP));
2569                 }
2570 
2571                 nvlist_free(unsup_feat);
2572         }
2573 
2574         /*
2575          * If the vdev guid sum doesn't match the uberblock, we have an
2576          * incomplete configuration.  We first check to see if the pool
2577          * is aware of the complete config (i.e ZPOOL_CONFIG_VDEV_CHILDREN).
2578          * If it is, defer the vdev_guid_sum check till later so we
2579          * can handle missing vdevs.
2580          */
2581         if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VDEV_CHILDREN,
2582             &children) != 0 && mosconfig && type != SPA_IMPORT_ASSEMBLE &&
2583             rvd->vdev_guid_sum != ub->ub_guid_sum)
2584                 return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO));
2585 
2586         if (type != SPA_IMPORT_ASSEMBLE && spa->spa_config_splitting) {
2587                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2588                 spa_try_repair(spa, config);
2589                 spa_config_exit(spa, SCL_ALL, FTAG);
2590                 nvlist_free(spa->spa_config_splitting);
2591                 spa->spa_config_splitting = NULL;
2592         }
2593 
2594         /*
2595          * Initialize internal SPA structures.
2596          */
2597         spa->spa_state = POOL_STATE_ACTIVE;
2598         spa->spa_ubsync = spa->spa_uberblock;
2599         spa->spa_verify_min_txg = spa->spa_extreme_rewind ?
2600             TXG_INITIAL - 1 : spa_last_synced_txg(spa) - TXG_DEFER_SIZE - 1;
2601         spa->spa_first_txg = spa->spa_last_ubsync_txg ?
2602             spa->spa_last_ubsync_txg : spa_last_synced_txg(spa) + 1;
2603         spa->spa_claim_max_txg = spa->spa_first_txg;
2604         spa->spa_prev_software_version = ub->ub_software_version;
2605 









2606         error = dsl_pool_init(spa, spa->spa_first_txg, &spa->spa_dsl_pool);
2607         if (error)


2608                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));

2609         spa->spa_meta_objset = spa->spa_dsl_pool->dp_meta_objset;
2610 
2611         if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object) != 0)














2612                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2613 



























































































































































































2614         if (spa_version(spa) >= SPA_VERSION_FEATURES) {
2615                 boolean_t missing_feat_read = B_FALSE;
2616                 nvlist_t *unsup_feat, *enabled_feat;
2617 
2618                 if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_READ,
2619                     &spa->spa_feat_for_read_obj) != 0) {
2620                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2621                 }
2622 
2623                 if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_WRITE,
2624                     &spa->spa_feat_for_write_obj) != 0) {
2625                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2626                 }
2627 
2628                 if (spa_dir_prop(spa, DMU_POOL_FEATURE_DESCRIPTIONS,
2629                     &spa->spa_feat_desc_obj) != 0) {
2630                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2631                 }
2632 
2633                 enabled_feat = fnvlist_alloc();
2634                 unsup_feat = fnvlist_alloc();
2635 
2636                 if (!spa_features_check(spa, B_FALSE,
2637                     unsup_feat, enabled_feat))
2638                         missing_feat_read = B_TRUE;
2639 
2640                 if (spa_writeable(spa) || state == SPA_LOAD_TRYIMPORT) {

2641                         if (!spa_features_check(spa, B_TRUE,
2642                             unsup_feat, enabled_feat)) {
2643                                 missing_feat_write = B_TRUE;
2644                         }
2645                 }
2646 
2647                 fnvlist_add_nvlist(spa->spa_load_info,
2648                     ZPOOL_CONFIG_ENABLED_FEAT, enabled_feat);
2649 
2650                 if (!nvlist_empty(unsup_feat)) {
2651                         fnvlist_add_nvlist(spa->spa_load_info,
2652                             ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat);
2653                 }
2654 
2655                 fnvlist_free(enabled_feat);
2656                 fnvlist_free(unsup_feat);
2657 
2658                 if (!missing_feat_read) {
2659                         fnvlist_add_boolean(spa->spa_load_info,
2660                             ZPOOL_CONFIG_CAN_RDONLY);
2661                 }
2662 
2663                 /*
2664                  * If the state is SPA_LOAD_TRYIMPORT, our objective is
2665                  * twofold: to determine whether the pool is available for
2666                  * import in read-write mode and (if it is not) whether the
2667                  * pool is available for import in read-only mode. If the pool
2668                  * is available for import in read-write mode, it is displayed
2669                  * as available in userland; if it is not available for import
2670                  * in read-only mode, it is displayed as unavailable in
2671                  * userland. If the pool is available for import in read-only
2672                  * mode but not read-write mode, it is displayed as unavailable
2673                  * in userland with a special note that the pool is actually
2674                  * available for open in read-only mode.
2675                  *
2676                  * As a result, if the state is SPA_LOAD_TRYIMPORT and we are
2677                  * missing a feature for write, we must first determine whether
2678                  * the pool can be opened read-only before returning to
2679                  * userland in order to know whether to display the
2680                  * abovementioned note.
2681                  */
2682                 if (missing_feat_read || (missing_feat_write &&
2683                     spa_writeable(spa))) {

2684                         return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
2685                             ENOTSUP));
2686                 }
2687 
2688                 /*
2689                  * Load refcounts for ZFS features from disk into an in-memory
2690                  * cache during SPA initialization.
2691                  */
2692                 for (spa_feature_t i = 0; i < SPA_FEATURES; i++) {
2693                         uint64_t refcount;
2694 
2695                         error = feature_get_refcount_from_disk(spa,
2696                             &spa_feature_table[i], &refcount);
2697                         if (error == 0) {
2698                                 spa->spa_feat_refcount_cache[i] = refcount;
2699                         } else if (error == ENOTSUP) {
2700                                 spa->spa_feat_refcount_cache[i] =
2701                                     SPA_FEATURE_DISABLED;
2702                         } else {



2703                                 return (spa_vdev_err(rvd,
2704                                     VDEV_AUX_CORRUPT_DATA, EIO));
2705                         }
2706                 }
2707         }
2708 
2709         if (spa_feature_is_active(spa, SPA_FEATURE_ENABLED_TXG)) {
2710                 if (spa_dir_prop(spa, DMU_POOL_FEATURE_ENABLED_TXG,
2711                     &spa->spa_feat_enabled_txg_obj) != 0)
2712                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2713         }
2714 









2715         spa->spa_is_initializing = B_TRUE;
2716         error = dsl_pool_open(spa->spa_dsl_pool);
2717         spa->spa_is_initializing = B_FALSE;
2718         if (error != 0)

2719                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2720 
2721         if (!mosconfig) {
2722                 uint64_t hostid;
2723                 nvlist_t *policy = NULL, *nvconfig;
2724 
2725                 if (load_nvlist(spa, spa->spa_config_object, &nvconfig) != 0)
2726                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2727 
2728                 if (!spa_is_root(spa) && nvlist_lookup_uint64(nvconfig,
2729                     ZPOOL_CONFIG_HOSTID, &hostid) == 0) {
2730                         char *hostname;
2731                         unsigned long myhostid = 0;
2732 
2733                         VERIFY(nvlist_lookup_string(nvconfig,
2734                             ZPOOL_CONFIG_HOSTNAME, &hostname) == 0);
2735 
2736 #ifdef  _KERNEL
2737                         myhostid = zone_get_hostid(NULL);
2738 #else   /* _KERNEL */
2739                         /*
2740                          * We're emulating the system's hostid in userland, so
2741                          * we can't use zone_get_hostid().
2742                          */
2743                         (void) ddi_strtoul(hw_serial, NULL, 10, &myhostid);
2744 #endif  /* _KERNEL */
2745                         if (hostid != 0 && myhostid != 0 &&
2746                             hostid != myhostid) {
2747                                 nvlist_free(nvconfig);
2748                                 cmn_err(CE_WARN, "pool '%s' could not be "
2749                                     "loaded as it was last accessed by "
2750                                     "another system (host: %s hostid: 0x%lx). "
2751                                     "See: http://illumos.org/msg/ZFS-8000-EY",
2752                                     spa_name(spa), hostname,
2753                                     (unsigned long)hostid);
2754                                 return (SET_ERROR(EBADF));
2755                         }
2756                 }
2757                 if (nvlist_lookup_nvlist(spa->spa_config,
2758                     ZPOOL_REWIND_POLICY, &policy) == 0)
2759                         VERIFY(nvlist_add_nvlist(nvconfig,
2760                             ZPOOL_REWIND_POLICY, policy) == 0);
2761 
2762                 spa_config_set(spa, nvconfig);
2763                 spa_unload(spa);
2764                 spa_deactivate(spa);
2765                 spa_activate(spa, orig_mode);
2766 
2767                 return (spa_load(spa, state, SPA_IMPORT_EXISTING, B_TRUE));
2768         }




2769 
2770         /* Grab the secret checksum salt from the MOS. */
2771         error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2772             DMU_POOL_CHECKSUM_SALT, 1,
2773             sizeof (spa->spa_cksum_salt.zcs_bytes),
2774             spa->spa_cksum_salt.zcs_bytes);
2775         if (error == ENOENT) {
2776                 /* Generate a new salt for subsequent use */
2777                 (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
2778                     sizeof (spa->spa_cksum_salt.zcs_bytes));
2779         } else if (error != 0) {


2780                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2781         }
2782 
2783         if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj) != 0)
2784                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2785         error = bpobj_open(&spa->spa_deferred_bpobj, spa->spa_meta_objset, obj);
2786         if (error != 0)


2787                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));

2788 
2789         /*
2790          * Load the bit that tells us to use the new accounting function
2791          * (raid-z deflation).  If we have an older pool, this will not
2792          * be present.
2793          */
2794         error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate);
2795         if (error != 0 && error != ENOENT)
2796                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2797 
2798         error = spa_dir_prop(spa, DMU_POOL_CREATION_VERSION,
2799             &spa->spa_creation_version);
2800         if (error != 0 && error != ENOENT)
2801                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2802 
2803         /*
2804          * Load the persistent error log.  If we have an older pool, this will
2805          * not be present.
2806          */
2807         error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last);

2808         if (error != 0 && error != ENOENT)
2809                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2810 
2811         error = spa_dir_prop(spa, DMU_POOL_ERRLOG_SCRUB,
2812             &spa->spa_errlog_scrub);
2813         if (error != 0 && error != ENOENT)
2814                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2815 
2816         /*
2817          * Load the history object.  If we have an older pool, this
2818          * will not be present.
2819          */
2820         error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history);
2821         if (error != 0 && error != ENOENT)
2822                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2823 
2824         /*
2825          * Load the per-vdev ZAP map. If we have an older pool, this will not
2826          * be present; in this case, defer its creation to a later time to
2827          * avoid dirtying the MOS this early / out of sync context. See
2828          * spa_sync_config_object.
2829          */
2830 
2831         /* The sentinel is only available in the MOS config. */
2832         nvlist_t *mos_config;
2833         if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0)

2834                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));

2835 
2836         error = spa_dir_prop(spa, DMU_POOL_VDEV_ZAP_MAP,
2837             &spa->spa_all_vdev_zaps);
2838 
2839         if (error == ENOENT) {
2840                 VERIFY(!nvlist_exists(mos_config,
2841                     ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
2842                 spa->spa_avz_action = AVZ_ACTION_INITIALIZE;
2843                 ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
2844         } else if (error != 0) {
2845                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2846         } else if (!nvlist_exists(mos_config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS)) {
2847                 /*
2848                  * An older version of ZFS overwrote the sentinel value, so
2849                  * we have orphaned per-vdev ZAPs in the MOS. Defer their
2850                  * destruction to later; see spa_sync_config_object.
2851                  */
2852                 spa->spa_avz_action = AVZ_ACTION_DESTROY;
2853                 /*
2854                  * We're assuming that no vdevs have had their ZAPs created
2855                  * before this. Better be sure of it.
2856                  */
2857                 ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
2858         }
2859         nvlist_free(mos_config);
2860 





















2861         /*






















2862          * If we're assembling the pool from the split-off vdevs of
2863          * an existing pool, we don't want to attach the spares & cache
2864          * devices.
2865          */
2866 
2867         /*
2868          * Load any hot spares for this pool.
2869          */
2870         error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object);

2871         if (error != 0 && error != ENOENT)
2872                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2873         if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
2874                 ASSERT(spa_version(spa) >= SPA_VERSION_SPARES);
2875                 if (load_nvlist(spa, spa->spa_spares.sav_object,
2876                     &spa->spa_spares.sav_config) != 0)

2877                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));

2878 
2879                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2880                 spa_load_spares(spa);
2881                 spa_config_exit(spa, SCL_ALL, FTAG);
2882         } else if (error == 0) {
2883                 spa->spa_spares.sav_sync = B_TRUE;
2884         }
2885 
2886         /*
2887          * Load any level 2 ARC devices for this pool.
2888          */
2889         error = spa_dir_prop(spa, DMU_POOL_L2CACHE,
2890             &spa->spa_l2cache.sav_object);
2891         if (error != 0 && error != ENOENT)
2892                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2893         if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
2894                 ASSERT(spa_version(spa) >= SPA_VERSION_L2CACHE);
2895                 if (load_nvlist(spa, spa->spa_l2cache.sav_object,
2896                     &spa->spa_l2cache.sav_config) != 0)

2897                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));

2898 
2899                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2900                 spa_load_l2cache(spa);
2901                 spa_config_exit(spa, SCL_ALL, FTAG);
2902         } else if (error == 0) {
2903                 spa->spa_l2cache.sav_sync = B_TRUE;
2904         }
2905 
2906         mp = &spa->spa_meta_policy;

2907 
2908         spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
2909         spa->spa_hiwat = zpool_prop_default_numeric(ZPOOL_PROP_HIWATERMARK);
2910         spa->spa_lowat = zpool_prop_default_numeric(ZPOOL_PROP_LOWATERMARK);
2911         spa->spa_minwat = zpool_prop_default_numeric(ZPOOL_PROP_MINWATERMARK);
2912         spa->spa_dedup_lo_best_effort =
2913             zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_LO_BEST_EFFORT);
2914         spa->spa_dedup_hi_best_effort =
2915             zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_HI_BEST_EFFORT);
2916 
2917         mp->spa_enable_meta_placement_selection =
2918             zpool_prop_default_numeric(ZPOOL_PROP_META_PLACEMENT);
2919         mp->spa_sync_to_special =
2920             zpool_prop_default_numeric(ZPOOL_PROP_SYNC_TO_SPECIAL);
2921         mp->spa_ddt_meta_to_special =
2922             zpool_prop_default_numeric(ZPOOL_PROP_DDT_META_TO_METADEV);
2923         mp->spa_zfs_meta_to_special =
2924             zpool_prop_default_numeric(ZPOOL_PROP_ZFS_META_TO_METADEV);
2925         mp->spa_small_data_to_special =
2926             zpool_prop_default_numeric(ZPOOL_PROP_SMALL_DATA_TO_METADEV);
2927         spa_set_ddt_classes(spa,
2928             zpool_prop_default_numeric(ZPOOL_PROP_DDT_DESEGREGATION));
2929 
2930         spa->spa_resilver_prio =
2931             zpool_prop_default_numeric(ZPOOL_PROP_RESILVER_PRIO);
2932         spa->spa_scrub_prio = zpool_prop_default_numeric(ZPOOL_PROP_SCRUB_PRIO);
2933 
2934         error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object);
2935         if (error && error != ENOENT)
2936                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2937 
2938         if (error == 0) {
2939                 uint64_t autoreplace;
2940                 uint64_t val = 0;
2941 
2942                 spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs);
2943                 spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace);
2944                 spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation);
2945                 spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode);
2946                 spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand);
2947                 spa_prop_find(spa, ZPOOL_PROP_BOOTSIZE, &spa->spa_bootsize);
2948                 spa_prop_find(spa, ZPOOL_PROP_DEDUPDITTO,
2949                     &spa->spa_dedup_ditto);
2950                 spa_prop_find(spa, ZPOOL_PROP_FORCETRIM, &spa->spa_force_trim);
2951 
2952                 mutex_enter(&spa->spa_auto_trim_lock);
2953                 spa_prop_find(spa, ZPOOL_PROP_AUTOTRIM, &spa->spa_auto_trim);
2954                 if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
2955                         spa_auto_trim_taskq_create(spa);
2956                 mutex_exit(&spa->spa_auto_trim_lock);
2957 
2958                 spa_prop_find(spa, ZPOOL_PROP_HIWATERMARK, &spa->spa_hiwat);
2959                 spa_prop_find(spa, ZPOOL_PROP_LOWATERMARK, &spa->spa_lowat);
2960                 spa_prop_find(spa, ZPOOL_PROP_MINWATERMARK, &spa->spa_minwat);
2961                 spa_prop_find(spa, ZPOOL_PROP_DEDUPMETA_DITTO,
2962                     &spa->spa_ddt_meta_copies);
2963                 spa_prop_find(spa, ZPOOL_PROP_DDT_DESEGREGATION, &val);
2964                 spa_set_ddt_classes(spa, val);
2965 
2966                 spa_prop_find(spa, ZPOOL_PROP_RESILVER_PRIO,
2967                     &spa->spa_resilver_prio);
2968                 spa_prop_find(spa, ZPOOL_PROP_SCRUB_PRIO,
2969                     &spa->spa_scrub_prio);
2970 
2971                 spa_prop_find(spa, ZPOOL_PROP_DEDUP_BEST_EFFORT,
2972                     &spa->spa_dedup_best_effort);
2973                 spa_prop_find(spa, ZPOOL_PROP_DEDUP_LO_BEST_EFFORT,
2974                     &spa->spa_dedup_lo_best_effort);
2975                 spa_prop_find(spa, ZPOOL_PROP_DEDUP_HI_BEST_EFFORT,
2976                     &spa->spa_dedup_hi_best_effort);
2977 
2978                 spa_prop_find(spa, ZPOOL_PROP_META_PLACEMENT,
2979                     &mp->spa_enable_meta_placement_selection);
2980                 spa_prop_find(spa, ZPOOL_PROP_SYNC_TO_SPECIAL,
2981                     &mp->spa_sync_to_special);
2982                 spa_prop_find(spa, ZPOOL_PROP_DDT_META_TO_METADEV,
2983                     &mp->spa_ddt_meta_to_special);
2984                 spa_prop_find(spa, ZPOOL_PROP_ZFS_META_TO_METADEV,
2985                     &mp->spa_zfs_meta_to_special);
2986                 spa_prop_find(spa, ZPOOL_PROP_SMALL_DATA_TO_METADEV,
2987                     &mp->spa_small_data_to_special);
2988 
2989                 spa->spa_autoreplace = (autoreplace != 0);
2990         }
2991 
2992         error = spa_dir_prop(spa, DMU_POOL_COS_PROPS,
2993             &spa->spa_cos_props_object);
2994         if (error == 0)
2995                 (void) spa_load_cos_props(spa);
2996         error = spa_dir_prop(spa, DMU_POOL_VDEV_PROPS,
2997             &spa->spa_vdev_props_object);
2998         if (error == 0)
2999                 (void) spa_load_vdev_props(spa);
3000 
3001         (void) spa_dir_prop(spa, DMU_POOL_TRIM_START_TIME,
3002             &spa->spa_man_trim_start_time);
3003         (void) spa_dir_prop(spa, DMU_POOL_TRIM_STOP_TIME,
3004             &spa->spa_man_trim_stop_time);
3005 
3006         /*
3007          * If the 'autoreplace' property is set, then post a resource notifying
3008          * the ZFS DE that it should not issue any faults for unopenable
3009          * devices.  We also iterate over the vdevs, and post a sysevent for any
3010          * unopenable vdevs so that the normal autoreplace handler can take
3011          * over.
3012          */
3013         if (spa->spa_autoreplace && state != SPA_LOAD_TRYIMPORT) {
3014                 spa_check_removed(spa->spa_root_vdev);
3015                 /*
3016                  * For the import case, this is done in spa_import(), because
3017                  * at this point we're using the spare definitions from
3018                  * the MOS config, not necessarily from the userland config.
3019                  */
3020                 if (state != SPA_LOAD_IMPORT) {
3021                         spa_aux_check_removed(&spa->spa_spares);
3022                         spa_aux_check_removed(&spa->spa_l2cache);
3023                 }
3024         }
3025 
3026         /*
3027          * Load the vdev state for all toplevel vdevs.
3028          */
3029         vdev_load(rvd);




3030 
3031         /*
3032          * Propagate the leaf DTLs we just loaded all the way up the tree.
3033          */
3034         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3035         vdev_dtl_reassess(rvd, 0, 0, B_FALSE);
3036         spa_config_exit(spa, SCL_ALL, FTAG);
3037 
3038         /*
3039          * Load the DDTs (dedup tables).
3040          */
3041         error = ddt_load(spa);
3042         if (error != 0)
3043                 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3044 
3045         spa_update_dspace(spa);




3046 
3047         /*
3048          * Validate the config, using the MOS config to fill in any
3049          * information which might be missing.  If we fail to validate
3050          * the config then declare the pool unfit for use. If we're
3051          * assembling a pool from a split, the log is not transferred
3052          * over.
3053          */
3054         if (type != SPA_IMPORT_ASSEMBLE) {
3055                 nvlist_t *nvconfig;
3056 
3057                 if (load_nvlist(spa, spa->spa_config_object, &nvconfig) != 0)
3058                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3059 
3060                 if (!spa_config_valid(spa, nvconfig)) {
3061                         nvlist_free(nvconfig);
3062                         return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM,
3063                             ENXIO));
3064                 }
3065                 nvlist_free(nvconfig);
3066 
3067                 /*
3068                  * Now that we've validated the config, check the state of the
3069                  * root vdev.  If it can't be opened, it indicates one or
3070                  * more toplevel vdevs are faulted.
3071                  */
3072                 if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN)
3073                         return (SET_ERROR(ENXIO));
3074 
3075                 if (spa_writeable(spa) && spa_check_logs(spa)) {











3076                         *ereport = FM_EREPORT_ZFS_LOG_REPLAY;
3077                         return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG, ENXIO));


3078                 }
3079         }

3080 
3081         if (missing_feat_write) {
3082                 ASSERT(state == SPA_LOAD_TRYIMPORT);
3083 
3084                 /*
3085                  * At this point, we know that we can open the pool in
3086                  * read-only mode but not read-write mode. We now have enough
3087                  * information and can return to userland.
3088                  */
3089                 return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT, ENOTSUP));
3090         }
3091 
3092         /*
3093          * We've successfully opened the pool, verify that we're ready
3094          * to start pushing transactions.
3095          */
3096         if (state != SPA_LOAD_TRYIMPORT) {
3097                 if (error = spa_load_verify(spa)) {



3098                         return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
3099                             error));
3100                 }
3101         }
3102 
3103         if (spa_writeable(spa) && (state == SPA_LOAD_RECOVER ||
3104             spa->spa_load_max_txg == UINT64_MAX)) {




3105                 dmu_tx_t *tx;
3106                 int need_update = B_FALSE;
3107                 dsl_pool_t *dp = spa_get_dsl(spa);
3108 
3109                 ASSERT(state != SPA_LOAD_TRYIMPORT);
3110 
3111                 /*
3112                  * Claim log blocks that haven't been committed yet.
3113                  * This must all happen in a single txg.
3114                  * Note: spa_claim_max_txg is updated by spa_claim_notify(),
3115                  * invoked from zil_claim_log_block()'s i/o done callback.
3116                  * Price of rollback is that we abandon the log.
3117                  */
3118                 spa->spa_claiming = B_TRUE;
3119 
3120                 tx = dmu_tx_create_assigned(dp, spa_first_txg(spa));
3121                 (void) dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
3122                     zil_claim, tx, DS_FIND_CHILDREN);
3123                 dmu_tx_commit(tx);
3124 
3125                 spa->spa_claiming = B_FALSE;
3126 
3127                 spa_set_log_state(spa, SPA_LOG_GOOD);
3128                 spa->spa_sync_on = B_TRUE;
3129                 txg_sync_start(spa->spa_dsl_pool);
3130 
3131                 /*
3132                  * Wait for all claims to sync.  We sync up to the highest
3133                  * claimed log block birth time so that claimed log blocks
3134                  * don't appear to be from the future.  spa_claim_max_txg
3135                  * will have been set for us by either zil_check_log_chain()
3136                  * (invoked from spa_check_logs()) or zil_claim() above.
3137                  */
3138                 txg_wait_synced(spa->spa_dsl_pool, spa->spa_claim_max_txg);
3139 
3140                 /*
3141                  * If the config cache is stale, or we have uninitialized
3142                  * metaslabs (see spa_vdev_add()), then update the config.
3143                  *
3144                  * If this is a verbatim import, trust the current
3145                  * in-core spa_config and update the disk labels.
3146                  */
3147                 if (config_cache_txg != spa->spa_config_txg ||
3148                     state == SPA_LOAD_IMPORT ||
3149                     state == SPA_LOAD_RECOVER ||
3150                     (spa->spa_import_flags & ZFS_IMPORT_VERBATIM))
3151                         need_update = B_TRUE;
3152 
3153                 for (int c = 0; c < rvd->vdev_children; c++)
3154                         if (rvd->vdev_child[c]->vdev_ms_array == 0)
3155                                 need_update = B_TRUE;
3156 
3157                 /*
3158                  * Update the config cache asychronously in case we're the
3159                  * root pool, in which case the config cache isn't writable yet.
3160                  */
3161                 if (need_update)
3162                         spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);

3163 










3164                 /*














































































































































































































































3165                  * Check all DTLs to see if anything needs resilvering.
3166                  */
3167                 if (!dsl_scan_resilvering(spa->spa_dsl_pool) &&
3168                     vdev_resilver_needed(rvd, NULL, NULL))
3169                         spa_async_request(spa, SPA_ASYNC_RESILVER);
3170 
3171                 /*
3172                  * Log the fact that we booted up (so that we can detect if
3173                  * we rebooted in the middle of an operation).
3174                  */
3175                 spa_history_log_version(spa, "open");
3176 
3177                 dsl_destroy_inconsistent(spa_get_dsl(spa));




3178 
3179                 /*
3180                  * Clean up any stale temporary dataset userrefs.
3181                  */
3182                 dsl_pool_clean_tmp_userrefs(spa->spa_dsl_pool);




3183         }
3184 
3185         spa_async_request(spa, SPA_ASYNC_L2CACHE_REBUILD);
3186 
3187         return (0);
3188 }
3189 
3190 static int
3191 spa_load_retry(spa_t *spa, spa_load_state_t state, int mosconfig)
3192 {
3193         int mode = spa->spa_mode;
3194 
3195         spa_unload(spa);
3196         spa_deactivate(spa);
3197 
3198         spa->spa_load_max_txg = spa->spa_uberblock.ub_txg - 1;
3199 
3200         spa_activate(spa, mode);
3201         spa_async_suspend(spa);
3202 
3203         return (spa_load(spa, state, SPA_IMPORT_EXISTING, mosconfig));



3204 }
3205 
3206 /*
3207  * If spa_load() fails this function will try loading prior txg's. If
3208  * 'state' is SPA_LOAD_RECOVER and one of these loads succeeds the pool
3209  * will be rewound to that txg. If 'state' is not SPA_LOAD_RECOVER this
3210  * function will not rewind the pool and will return the same error as
3211  * spa_load().
3212  */
3213 static int
3214 spa_load_best(spa_t *spa, spa_load_state_t state, int mosconfig,
3215     uint64_t max_request, int rewind_flags)
3216 {
3217         nvlist_t *loadinfo = NULL;
3218         nvlist_t *config = NULL;
3219         int load_error, rewind_error;
3220         uint64_t safe_rewind_txg;
3221         uint64_t min_txg;
3222 
3223         if (spa->spa_load_txg && state == SPA_LOAD_RECOVER) {
3224                 spa->spa_load_max_txg = spa->spa_load_txg;
3225                 spa_set_log_state(spa, SPA_LOG_CLEAR);
3226         } else {
3227                 spa->spa_load_max_txg = max_request;
3228                 if (max_request != UINT64_MAX)
3229                         spa->spa_extreme_rewind = B_TRUE;
3230         }
3231 
3232         load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING,
3233             mosconfig);
3234         if (load_error == 0)
3235                 return (0);
3236 
3237         if (spa->spa_root_vdev != NULL)
3238                 config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
3239 
3240         spa->spa_last_ubsync_txg = spa->spa_uberblock.ub_txg;
3241         spa->spa_last_ubsync_txg_ts = spa->spa_uberblock.ub_timestamp;
3242 
3243         if (rewind_flags & ZPOOL_NEVER_REWIND) {
3244                 nvlist_free(config);
3245                 return (load_error);
3246         }
3247 
3248         if (state == SPA_LOAD_RECOVER) {
3249                 /* Price of rolling back is discarding txgs, including log */
3250                 spa_set_log_state(spa, SPA_LOG_CLEAR);
3251         } else {
3252                 /*
3253                  * If we aren't rolling back save the load info from our first
3254                  * import attempt so that we can restore it after attempting
3255                  * to rewind.
3256                  */
3257                 loadinfo = spa->spa_load_info;
3258                 spa->spa_load_info = fnvlist_alloc();
3259         }
3260 
3261         spa->spa_load_max_txg = spa->spa_last_ubsync_txg;
3262         safe_rewind_txg = spa->spa_last_ubsync_txg - TXG_DEFER_SIZE;
3263         min_txg = (rewind_flags & ZPOOL_EXTREME_REWIND) ?
3264             TXG_INITIAL : safe_rewind_txg;
3265 
3266         /*
3267          * Continue as long as we're finding errors, we're still within
3268          * the acceptable rewind range, and we're still finding uberblocks
3269          */
3270         while (rewind_error && spa->spa_uberblock.ub_txg >= min_txg &&
3271             spa->spa_uberblock.ub_txg <= spa->spa_load_max_txg) {
3272                 if (spa->spa_load_max_txg < safe_rewind_txg)
3273                         spa->spa_extreme_rewind = B_TRUE;
3274                 rewind_error = spa_load_retry(spa, state, mosconfig);
3275         }
3276 
3277         spa->spa_extreme_rewind = B_FALSE;
3278         spa->spa_load_max_txg = UINT64_MAX;
3279 
3280         if (config && (rewind_error || state != SPA_LOAD_RECOVER))
3281                 spa_config_set(spa, config);
3282         else
3283                 nvlist_free(config);
3284 
3285         if (state == SPA_LOAD_RECOVER) {
3286                 ASSERT3P(loadinfo, ==, NULL);
3287                 return (rewind_error);
3288         } else {
3289                 /* Store the rewind info as part of the initial load info */
3290                 fnvlist_add_nvlist(loadinfo, ZPOOL_CONFIG_REWIND_INFO,
3291                     spa->spa_load_info);
3292 
3293                 /* Restore the initial load info */
3294                 fnvlist_free(spa->spa_load_info);


3301 /*
3302  * Pool Open/Import
3303  *
3304  * The import case is identical to an open except that the configuration is sent
3305  * down from userland, instead of grabbed from the configuration cache.  For the
3306  * case of an open, the pool configuration will exist in the
3307  * POOL_STATE_UNINITIALIZED state.
3308  *
3309  * The stats information (gen/count/ustats) is used to gather vdev statistics at
3310  * the same time open the pool, without having to keep around the spa_t in some
3311  * ambiguous state.
3312  */
3313 static int
3314 spa_open_common(const char *pool, spa_t **spapp, void *tag, nvlist_t *nvpolicy,
3315     nvlist_t **config)
3316 {
3317         spa_t *spa;
3318         spa_load_state_t state = SPA_LOAD_OPEN;
3319         int error;
3320         int locked = B_FALSE;
3321         boolean_t open_with_activation = B_FALSE;
3322 
3323         *spapp = NULL;
3324 
3325         /*
3326          * As disgusting as this is, we need to support recursive calls to this
3327          * function because dsl_dir_open() is called during spa_load(), and ends
3328          * up calling spa_open() again.  The real fix is to figure out how to
3329          * avoid dsl_dir_open() calling this in the first place.
3330          */
3331         if (mutex_owner(&spa_namespace_lock) != curthread) {
3332                 mutex_enter(&spa_namespace_lock);
3333                 locked = B_TRUE;
3334         }
3335 
3336         if ((spa = spa_lookup(pool)) == NULL) {
3337                 if (locked)
3338                         mutex_exit(&spa_namespace_lock);
3339                 return (SET_ERROR(ENOENT));
3340         }
3341 
3342         if (spa->spa_state == POOL_STATE_UNINITIALIZED) {
3343                 zpool_rewind_policy_t policy;
3344 
3345                 zpool_get_rewind_policy(nvpolicy ? nvpolicy : spa->spa_config,
3346                     &policy);
3347                 if (policy.zrp_request & ZPOOL_DO_REWIND)
3348                         state = SPA_LOAD_RECOVER;
3349 
3350                 spa_activate(spa, spa_mode_global);
3351 
3352                 if (state != SPA_LOAD_RECOVER)
3353                         spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;

3354 
3355                 error = spa_load_best(spa, state, B_FALSE, policy.zrp_txg,

3356                     policy.zrp_request);
3357 
3358                 if (error == EBADF) {
3359                         /*
3360                          * If vdev_validate() returns failure (indicated by
3361                          * EBADF), it indicates that one of the vdevs indicates
3362                          * that the pool has been exported or destroyed.  If
3363                          * this is the case, the config cache is out of sync and
3364                          * we should remove the pool from the namespace.
3365                          */
3366                         spa_unload(spa);
3367                         spa_deactivate(spa);
3368                         spa_config_sync(spa, B_TRUE, B_TRUE);
3369                         spa_remove(spa);
3370                         if (locked)
3371                                 mutex_exit(&spa_namespace_lock);
3372                         return (SET_ERROR(ENOENT));
3373                 }
3374 
3375                 if (error) {
3376                         /*
3377                          * We can't open the pool, but we still have useful
3378                          * information: the state of each vdev after the
3379                          * attempted vdev_open().  Return this to the user.
3380                          */
3381                         if (config != NULL && spa->spa_config) {
3382                                 VERIFY(nvlist_dup(spa->spa_config, config,
3383                                     KM_SLEEP) == 0);
3384                                 VERIFY(nvlist_add_nvlist(*config,
3385                                     ZPOOL_CONFIG_LOAD_INFO,
3386                                     spa->spa_load_info) == 0);
3387                         }
3388                         spa_unload(spa);
3389                         spa_deactivate(spa);
3390                         spa->spa_last_open_failed = error;
3391                         if (locked)
3392                                 mutex_exit(&spa_namespace_lock);
3393                         *spapp = NULL;
3394                         return (error);
3395                 }
3396 
3397                 open_with_activation = B_TRUE;
3398         }
3399 
3400         spa_open_ref(spa, tag);
3401 
3402         if (config != NULL)
3403                 *config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
3404 
3405         /*
3406          * If we've recovered the pool, pass back any information we
3407          * gathered while doing the load.
3408          */
3409         if (state == SPA_LOAD_RECOVER) {
3410                 VERIFY(nvlist_add_nvlist(*config, ZPOOL_CONFIG_LOAD_INFO,
3411                     spa->spa_load_info) == 0);
3412         }
3413 
3414         if (locked) {
3415                 spa->spa_last_open_failed = 0;
3416                 spa->spa_last_ubsync_txg = 0;
3417                 spa->spa_load_txg = 0;
3418                 mutex_exit(&spa_namespace_lock);
3419         }
3420 
3421         if (open_with_activation)
3422                 wbc_activate(spa, B_FALSE);
3423 
3424         *spapp = spa;
3425 
3426         return (0);
3427 }
3428 
3429 int
3430 spa_open_rewind(const char *name, spa_t **spapp, void *tag, nvlist_t *policy,
3431     nvlist_t **config)
3432 {
3433         return (spa_open_common(name, spapp, tag, policy, config));
3434 }
3435 
3436 int
3437 spa_open(const char *name, spa_t **spapp, void *tag)
3438 {
3439         return (spa_open_common(name, spapp, tag, NULL, NULL));
3440 }
3441 
3442 /*
3443  * Lookup the given spa_t, incrementing the inject count in the process,


3853         }
3854 }
3855 
3856 /*
3857  * Pool Creation
3858  */
3859 int
3860 spa_create(const char *pool, nvlist_t *nvroot, nvlist_t *props,
3861     nvlist_t *zplprops)
3862 {
3863         spa_t *spa;
3864         char *altroot = NULL;
3865         vdev_t *rvd;
3866         dsl_pool_t *dp;
3867         dmu_tx_t *tx;
3868         int error = 0;
3869         uint64_t txg = TXG_INITIAL;
3870         nvlist_t **spares, **l2cache;
3871         uint_t nspares, nl2cache;
3872         uint64_t version, obj;
3873         boolean_t has_features = B_FALSE, wbc_feature_exists = B_FALSE;
3874         spa_meta_placement_t *mp;
3875 
3876         /*
3877          * If this pool already exists, return failure.
3878          */
3879         mutex_enter(&spa_namespace_lock);
3880         if (spa_lookup(pool) != NULL) {
3881                 mutex_exit(&spa_namespace_lock);
3882                 return (SET_ERROR(EEXIST));
3883         }
3884 
3885         /*
3886          * Allocate a new spa_t structure.
3887          */
3888         (void) nvlist_lookup_string(props,
3889             zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
3890         spa = spa_add(pool, NULL, altroot);
3891         spa_activate(spa, spa_mode_global);
3892 
3893         if (props != NULL) {
3894                 nvpair_t *wbc_feature_nvp = NULL;
3895 
3896                 for (nvpair_t *elem = nvlist_next_nvpair(props, NULL);
3897                     elem != NULL; elem = nvlist_next_nvpair(props, elem)) {
3898                         const char *propname = nvpair_name(elem);
3899                         if (zpool_prop_feature(propname)) {
3900                                 spa_feature_t feature;
3901                                 int err;
3902                                 const char *fname = strchr(propname, '@') + 1;
3903 
3904                                 err = zfeature_lookup_name(fname, &feature);
3905                                 if (err == 0 && feature == SPA_FEATURE_WBC) {
3906                                         wbc_feature_nvp = elem;
3907                                         wbc_feature_exists = B_TRUE;
3908                                 }
3909 
3910                                 has_features = B_TRUE;
3911                         }
3912                 }
3913 
3914                 /*
3915                  * We do not want to enabled feature@wbc if
3916                  * this pool does not have special vdev.
3917                  * At this stage we remove this feature from common list,
3918                  * but later after check that special vdev available this
3919                  * feature will be enabled
3920                  */
3921                 if (wbc_feature_nvp != NULL)
3922                         fnvlist_remove_nvpair(props, wbc_feature_nvp);
3923 
3924                 if ((error = spa_prop_validate(spa, props)) != 0) {
3925                         spa_deactivate(spa);
3926                         spa_remove(spa);
3927                         mutex_exit(&spa_namespace_lock);
3928                         return (error);
3929                 }






3930         }
3931 
3932 
3933         if (has_features || nvlist_lookup_uint64(props,
3934             zpool_prop_to_name(ZPOOL_PROP_VERSION), &version) != 0) {
3935                 version = SPA_VERSION;
3936         }
3937         ASSERT(SPA_VERSION_IS_SUPPORTED(version));
3938 
3939         spa->spa_first_txg = txg;
3940         spa->spa_uberblock.ub_txg = txg - 1;
3941         spa->spa_uberblock.ub_version = version;
3942         spa->spa_ubsync = spa->spa_uberblock;
3943         spa->spa_load_state = SPA_LOAD_CREATE;



3944 
3945         /*
3946          * Create "The Godfather" zio to hold all async IOs
3947          */
3948         spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
3949             KM_SLEEP);
3950         for (int i = 0; i < max_ncpus; i++) {
3951                 spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
3952                     ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
3953                     ZIO_FLAG_GODFATHER);
3954         }
3955 
3956         /*
3957          * Create the root vdev.
3958          */
3959         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3960 
3961         error = spa_config_parse(spa, &rvd, nvroot, NULL, 0, VDEV_ALLOC_ADD);
3962 
3963         ASSERT(error != 0 || rvd != NULL);


4067          * because sync-to-convergence takes longer if the blocksize
4068          * keeps changing.
4069          */
4070         obj = bpobj_alloc(spa->spa_meta_objset, 1 << 14, tx);
4071         dmu_object_set_compress(spa->spa_meta_objset, obj,
4072             ZIO_COMPRESS_OFF, tx);
4073         if (zap_add(spa->spa_meta_objset,
4074             DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_SYNC_BPOBJ,
4075             sizeof (uint64_t), 1, &obj, tx) != 0) {
4076                 cmn_err(CE_PANIC, "failed to add bpobj");
4077         }
4078         VERIFY3U(0, ==, bpobj_open(&spa->spa_deferred_bpobj,
4079             spa->spa_meta_objset, obj));
4080 
4081         /*
4082          * Create the pool's history object.
4083          */
4084         if (version >= SPA_VERSION_ZPOOL_HISTORY)
4085                 spa_history_create_obj(spa, tx);
4086 
4087         mp = &spa->spa_meta_policy;
4088 
4089         /*
4090          * Generate some random noise for salted checksums to operate on.
4091          */
4092         (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
4093             sizeof (spa->spa_cksum_salt.zcs_bytes));
4094 
4095         /*
4096          * Set pool properties.
4097          */
4098         spa->spa_bootfs = zpool_prop_default_numeric(ZPOOL_PROP_BOOTFS);
4099         spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
4100         spa->spa_failmode = zpool_prop_default_numeric(ZPOOL_PROP_FAILUREMODE);
4101         spa->spa_autoexpand = zpool_prop_default_numeric(ZPOOL_PROP_AUTOEXPAND);
4102         spa->spa_minwat = zpool_prop_default_numeric(ZPOOL_PROP_MINWATERMARK);
4103         spa->spa_hiwat = zpool_prop_default_numeric(ZPOOL_PROP_HIWATERMARK);
4104         spa->spa_lowat = zpool_prop_default_numeric(ZPOOL_PROP_LOWATERMARK);
4105         spa->spa_ddt_meta_copies =
4106             zpool_prop_default_numeric(ZPOOL_PROP_DEDUPMETA_DITTO);
4107         spa->spa_dedup_best_effort =
4108             zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_BEST_EFFORT);
4109         spa->spa_dedup_lo_best_effort =
4110             zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_LO_BEST_EFFORT);
4111         spa->spa_dedup_hi_best_effort =
4112             zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_HI_BEST_EFFORT);
4113         spa->spa_force_trim = zpool_prop_default_numeric(ZPOOL_PROP_FORCETRIM);
4114 
4115         spa->spa_resilver_prio =
4116             zpool_prop_default_numeric(ZPOOL_PROP_RESILVER_PRIO);
4117         spa->spa_scrub_prio = zpool_prop_default_numeric(ZPOOL_PROP_SCRUB_PRIO);
4118 
4119         mutex_enter(&spa->spa_auto_trim_lock);
4120         spa->spa_auto_trim = zpool_prop_default_numeric(ZPOOL_PROP_AUTOTRIM);
4121         if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
4122                 spa_auto_trim_taskq_create(spa);
4123         mutex_exit(&spa->spa_auto_trim_lock);
4124 
4125         mp->spa_enable_meta_placement_selection =
4126             zpool_prop_default_numeric(ZPOOL_PROP_META_PLACEMENT);
4127         mp->spa_sync_to_special =
4128             zpool_prop_default_numeric(ZPOOL_PROP_SYNC_TO_SPECIAL);
4129         mp->spa_ddt_meta_to_special =
4130             zpool_prop_default_numeric(ZPOOL_PROP_DDT_META_TO_METADEV);
4131         mp->spa_zfs_meta_to_special =
4132             zpool_prop_default_numeric(ZPOOL_PROP_ZFS_META_TO_METADEV);
4133         mp->spa_small_data_to_special =
4134             zpool_prop_default_numeric(ZPOOL_PROP_SMALL_DATA_TO_METADEV);
4135 
4136         spa_set_ddt_classes(spa, 0);
4137 
4138         if (props != NULL) {
4139                 spa_configfile_set(spa, props, B_FALSE);
4140                 spa_sync_props(props, tx);
4141         }
4142 
4143         if (spa_has_special(spa)) {
4144                 spa_feature_enable(spa, SPA_FEATURE_META_DEVICES, tx);
4145                 spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
4146 
4147                 if (wbc_feature_exists)
4148                         spa_feature_enable(spa, SPA_FEATURE_WBC, tx);
4149         }
4150 
4151         dmu_tx_commit(tx);
4152 
4153         spa->spa_sync_on = B_TRUE;
4154         txg_sync_start(spa->spa_dsl_pool);
4155 
4156         /*
4157          * We explicitly wait for the first transaction to complete so that our
4158          * bean counters are appropriately updated.
4159          */
4160         txg_wait_synced(spa->spa_dsl_pool, txg);
4161 
4162         spa_config_sync(spa, B_FALSE, B_TRUE);


4163         spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_CREATE);
4164 
4165         spa_history_log_version(spa, "create");
4166 
4167         /*
4168          * Don't count references from objsets that are already closed
4169          * and are making their way through the eviction process.
4170          */
4171         spa_evicting_os_wait(spa);
4172         spa->spa_minref = refcount_count(&spa->spa_refcount);
4173         spa->spa_load_state = SPA_LOAD_NONE;
4174 
4175         mutex_exit(&spa_namespace_lock);
4176 
4177         wbc_activate(spa, B_TRUE);
4178 
4179         return (0);
4180 }
4181 
4182 
4183 /*
4184  * See if the pool has special tier, and if so, enable/activate
4185  * the feature as needed. Activation is not reference counted.
4186  */
4187 static void
4188 spa_check_special_feature(spa_t *spa)
4189 {
4190         if (spa_has_special(spa)) {
4191                 nvlist_t *props = NULL;
4192 
4193                 if (!spa_feature_is_enabled(spa, SPA_FEATURE_META_DEVICES)) {
4194                         VERIFY(nvlist_alloc(&props, NV_UNIQUE_NAME, 0) == 0);
4195                         VERIFY(nvlist_add_uint64(props,
4196                             FEATURE_META_DEVICES, 0) == 0);
4197                         VERIFY(spa_prop_set(spa, props) == 0);
4198                         nvlist_free(props);
4199                 }
4200 
4201                 if (!spa_feature_is_active(spa, SPA_FEATURE_META_DEVICES)) {
4202                         dmu_tx_t *tx =
4203                             dmu_tx_create_dd(spa->spa_dsl_pool->dp_mos_dir);
4204 
4205                         VERIFY(dmu_tx_assign(tx, TXG_WAIT) == 0);
4206                         spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
4207                         dmu_tx_commit(tx);
4208                 }
4209         }
4210 }
4211 
4212 static void
4213 spa_special_feature_activate(void *arg, dmu_tx_t *tx)
4214 {
4215         spa_t *spa = (spa_t *)arg;
4216 
4217         if (spa_has_special(spa)) {
4218                 /* enable and activate as needed */
4219                 spa_feature_enable(spa, SPA_FEATURE_META_DEVICES, tx);
4220                 if (!spa_feature_is_active(spa, SPA_FEATURE_META_DEVICES)) {
4221                         spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
4222                 }
4223 
4224                 spa_feature_enable(spa, SPA_FEATURE_WBC, tx);
4225         }
4226 }
4227 
4228 #ifdef _KERNEL
4229 /*
4230  * Get the root pool information from the root disk, then import the root pool
4231  * during the system boot up time.
4232  */
4233 extern int vdev_disk_read_rootlabel(char *, char *, nvlist_t **);
4234 
4235 static nvlist_t *
4236 spa_generate_rootconf(char *devpath, char *devid, uint64_t *guid)
4237 {
4238         nvlist_t *config;
4239         nvlist_t *nvtop, *nvroot;
4240         uint64_t pgid;
4241 
4242         if (vdev_disk_read_rootlabel(devpath, devid, &config) != 0)
4243                 return (NULL);
4244 
4245         /*
4246          * Add this top-level vdev to the child array.
4247          */


4333 #if defined(_OBP) && defined(_KERNEL)
4334         if (config == NULL) {
4335                 if (strstr(devpath, "/iscsi/ssd") != NULL) {
4336                         /* iscsi boot */
4337                         get_iscsi_bootpath_phy(devpath);
4338                         config = spa_generate_rootconf(devpath, devid, &guid);
4339                 }
4340         }
4341 #endif
4342         if (config == NULL) {
4343                 cmn_err(CE_NOTE, "Cannot read the pool label from '%s'",
4344                     devpath);
4345                 return (SET_ERROR(EIO));
4346         }
4347 
4348         VERIFY(nvlist_lookup_string(config, ZPOOL_CONFIG_POOL_NAME,
4349             &pname) == 0);
4350         VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, &txg) == 0);
4351 
4352         mutex_enter(&spa_namespace_lock);
4353         if ((spa = spa_lookup(pname)) != NULL || spa_config_guid_exists(guid)) {
4354                 /*
4355                  * Remove the existing root pool from the namespace so that we
4356                  * can replace it with the correct config we just read in.
4357                  */
4358                 spa_remove(spa);
4359         }
4360 
4361         spa = spa_add(pname, config, NULL);
4362         spa->spa_is_root = B_TRUE;
4363         spa->spa_import_flags = ZFS_IMPORT_VERBATIM;



4364 
4365         /*
4366          * Build up a vdev tree based on the boot device's label config.
4367          */
4368         VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
4369             &nvtop) == 0);
4370         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4371         error = spa_config_parse(spa, &rvd, nvtop, NULL, 0,
4372             VDEV_ALLOC_ROOTPOOL);
4373         spa_config_exit(spa, SCL_ALL, FTAG);
4374         if (error) {
4375                 mutex_exit(&spa_namespace_lock);
4376                 nvlist_free(config);
4377                 cmn_err(CE_NOTE, "Can not parse the config for pool '%s'",
4378                     pname);
4379                 return (error);
4380         }
4381 
4382         /*
4383          * Get the boot vdev.


4427 }
4428 
4429 #endif
4430 
4431 /*
4432  * Import a non-root pool into the system.
4433  */
4434 int
4435 spa_import(const char *pool, nvlist_t *config, nvlist_t *props, uint64_t flags)
4436 {
4437         spa_t *spa;
4438         char *altroot = NULL;
4439         spa_load_state_t state = SPA_LOAD_IMPORT;
4440         zpool_rewind_policy_t policy;
4441         uint64_t mode = spa_mode_global;
4442         uint64_t readonly = B_FALSE;
4443         int error;
4444         nvlist_t *nvroot;
4445         nvlist_t **spares, **l2cache;
4446         uint_t nspares, nl2cache;
4447         uint64_t guid;
4448 
4449         if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &guid) != 0)
4450                 return (SET_ERROR(EINVAL));
4451 
4452         /*
4453          * If a pool with this name exists, return failure.
4454          */
4455         mutex_enter(&spa_namespace_lock);
4456         if (spa_lookup(pool) != NULL || spa_config_guid_exists(guid)) {
4457                 mutex_exit(&spa_namespace_lock);
4458                 return (SET_ERROR(EEXIST));
4459         }
4460 
4461         /*
4462          * Create and initialize the spa structure.
4463          */
4464         (void) nvlist_lookup_string(props,
4465             zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
4466         (void) nvlist_lookup_uint64(props,
4467             zpool_prop_to_name(ZPOOL_PROP_READONLY), &readonly);
4468         if (readonly)
4469                 mode = FREAD;
4470         spa = spa_add(pool, config, altroot);
4471         spa->spa_import_flags = flags;
4472 
4473         /*
4474          * Verbatim import - Take a pool and insert it into the namespace
4475          * as if it had been loaded at boot.
4476          */
4477         if (spa->spa_import_flags & ZFS_IMPORT_VERBATIM) {
4478                 if (props != NULL)
4479                         spa_configfile_set(spa, props, B_FALSE);
4480 
4481                 spa_config_sync(spa, B_FALSE, B_TRUE);
4482                 spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
4483 
4484                 mutex_exit(&spa_namespace_lock);
4485                 return (0);
4486         }
4487 
4488         spa_activate(spa, mode);
4489 
4490         /*
4491          * Don't start async tasks until we know everything is healthy.
4492          */
4493         spa_async_suspend(spa);
4494 
4495         zpool_get_rewind_policy(config, &policy);
4496         if (policy.zrp_request & ZPOOL_DO_REWIND)
4497                 state = SPA_LOAD_RECOVER;
4498 
4499         /*
4500          * Pass off the heavy lifting to spa_load().  Pass TRUE for mosconfig
4501          * because the user-supplied config is actually the one to trust when
4502          * doing an import.
4503          */
4504         if (state != SPA_LOAD_RECOVER)
4505                 spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;






4506 
4507         error = spa_load_best(spa, state, B_TRUE, policy.zrp_txg,
4508             policy.zrp_request);
4509 
4510         /*
4511          * Propagate anything learned while loading the pool and pass it
4512          * back to caller (i.e. rewind info, missing devices, etc).
4513          */
4514         VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
4515             spa->spa_load_info) == 0);
4516 
4517         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4518         /*
4519          * Toss any existing sparelist, as it doesn't have any validity
4520          * anymore, and conflicts with spa_has_spare().
4521          */
4522         if (spa->spa_spares.sav_config) {
4523                 nvlist_free(spa->spa_spares.sav_config);
4524                 spa->spa_spares.sav_config = NULL;
4525                 spa_load_spares(spa);
4526         }
4527         if (spa->spa_l2cache.sav_config) {
4528                 nvlist_free(spa->spa_l2cache.sav_config);
4529                 spa->spa_l2cache.sav_config = NULL;


4535         if (error == 0)
4536                 error = spa_validate_aux(spa, nvroot, -1ULL,
4537                     VDEV_ALLOC_SPARE);
4538         if (error == 0)
4539                 error = spa_validate_aux(spa, nvroot, -1ULL,
4540                     VDEV_ALLOC_L2CACHE);
4541         spa_config_exit(spa, SCL_ALL, FTAG);
4542 
4543         if (props != NULL)
4544                 spa_configfile_set(spa, props, B_FALSE);
4545 
4546         if (error != 0 || (props && spa_writeable(spa) &&
4547             (error = spa_prop_set(spa, props)))) {
4548                 spa_unload(spa);
4549                 spa_deactivate(spa);
4550                 spa_remove(spa);
4551                 mutex_exit(&spa_namespace_lock);
4552                 return (error);
4553         }
4554 


4555         /*
4556          * Override any spares and level 2 cache devices as specified by
4557          * the user, as these may have correct device names/devids, etc.
4558          */
4559         if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
4560             &spares, &nspares) == 0) {
4561                 if (spa->spa_spares.sav_config)
4562                         VERIFY(nvlist_remove(spa->spa_spares.sav_config,
4563                             ZPOOL_CONFIG_SPARES, DATA_TYPE_NVLIST_ARRAY) == 0);
4564                 else
4565                         VERIFY(nvlist_alloc(&spa->spa_spares.sav_config,
4566                             NV_UNIQUE_NAME, KM_SLEEP) == 0);
4567                 VERIFY(nvlist_add_nvlist_array(spa->spa_spares.sav_config,
4568                     ZPOOL_CONFIG_SPARES, spares, nspares) == 0);
4569                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4570                 spa_load_spares(spa);
4571                 spa_config_exit(spa, SCL_ALL, FTAG);
4572                 spa->spa_spares.sav_sync = B_TRUE;
4573         }
4574         if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE,
4575             &l2cache, &nl2cache) == 0) {
4576                 if (spa->spa_l2cache.sav_config)
4577                         VERIFY(nvlist_remove(spa->spa_l2cache.sav_config,
4578                             ZPOOL_CONFIG_L2CACHE, DATA_TYPE_NVLIST_ARRAY) == 0);
4579                 else
4580                         VERIFY(nvlist_alloc(&spa->spa_l2cache.sav_config,
4581                             NV_UNIQUE_NAME, KM_SLEEP) == 0);
4582                 VERIFY(nvlist_add_nvlist_array(spa->spa_l2cache.sav_config,
4583                     ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache) == 0);
4584                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4585                 spa_load_l2cache(spa);
4586                 spa_config_exit(spa, SCL_ALL, FTAG);
4587                 spa->spa_l2cache.sav_sync = B_TRUE;
4588         }
4589 
4590         /* At this point, we can load spare props */
4591         (void) spa_load_vdev_props(spa);
4592 
4593         /*
4594          * Check for any removed devices.
4595          */
4596         if (spa->spa_autoreplace) {
4597                 spa_aux_check_removed(&spa->spa_spares);
4598                 spa_aux_check_removed(&spa->spa_l2cache);
4599         }
4600 
4601         if (spa_writeable(spa)) {
4602                 /*
4603                  * Update the config cache to include the newly-imported pool.
4604                  */
4605                 spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
4606         }
4607 
4608         /*
4609          * Start async resume as late as possible to reduce I/O activity when
4610          * importing a pool. This will let any pending txgs (e.g. from scrub
4611          * or resilver) to complete quickly thereby reducing import times in
4612          * such cases.
4613          */
4614         spa_async_resume(spa);
4615 
4616         /*
4617          * It's possible that the pool was expanded while it was exported.
4618          * We kick off an async task to handle this for us.
4619          */
4620         spa_async_request(spa, SPA_ASYNC_AUTOEXPAND);
4621 
4622         /* Set/activate meta feature as needed */
4623         if (!spa_writeable(spa))
4624                 spa_check_special_feature(spa);
4625         spa_history_log_version(spa, "import");
4626 
4627         spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
4628 
4629         mutex_exit(&spa_namespace_lock);
4630 
4631         if (!spa_writeable(spa))
4632                 return (0);
4633 
4634         wbc_activate(spa, B_FALSE);
4635 
4636         return (dsl_sync_task(spa->spa_name, NULL, spa_special_feature_activate,
4637             spa, 3, ZFS_SPACE_CHECK_RESERVED));
4638 }
4639 
4640 nvlist_t *
4641 spa_tryimport(nvlist_t *tryconfig)
4642 {
4643         nvlist_t *config = NULL;
4644         char *poolname;
4645         spa_t *spa;
4646         uint64_t state;
4647         int error;

4648 
4649         if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_POOL_NAME, &poolname))
4650                 return (NULL);
4651 
4652         if (nvlist_lookup_uint64(tryconfig, ZPOOL_CONFIG_POOL_STATE, &state))
4653                 return (NULL);
4654 
4655         /*
4656          * Create and initialize the spa structure.
4657          */
4658         mutex_enter(&spa_namespace_lock);
4659         spa = spa_add(TRYIMPORT_NAME, tryconfig, NULL);
4660         spa_activate(spa, FREAD);
4661 
4662         /*
4663          * Pass off the heavy lifting to spa_load().
4664          * Pass TRUE for mosconfig because the user-supplied config
4665          * is actually the one to trust when doing an import.
4666          */
4667         error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING, B_TRUE);








4668 










4669         /*
4670          * If 'tryconfig' was at least parsable, return the current config.
4671          */
4672         if (spa->spa_root_vdev != NULL) {
4673                 config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
4674                 VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME,
4675                     poolname) == 0);
4676                 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE,
4677                     state) == 0);
4678                 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_TIMESTAMP,
4679                     spa->spa_uberblock.ub_timestamp) == 0);
4680                 VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
4681                     spa->spa_load_info) == 0);
4682 
4683                 /*
4684                  * If the bootfs property exists on this pool then we
4685                  * copy it out so that external consumers can tell which
4686                  * pools are bootable.
4687                  */
4688                 if ((!error || error == EEXIST) && spa->spa_bootfs) {


4723 
4724         spa_unload(spa);
4725         spa_deactivate(spa);
4726         spa_remove(spa);
4727         mutex_exit(&spa_namespace_lock);
4728 
4729         return (config);
4730 }
4731 
4732 /*
4733  * Pool export/destroy
4734  *
4735  * The act of destroying or exporting a pool is very simple.  We make sure there
4736  * is no more pending I/O and any references to the pool are gone.  Then, we
4737  * update the pool state and sync all the labels to disk, removing the
4738  * configuration from the cache afterwards. If the 'hardforce' flag is set, then
4739  * we don't sync the labels or remove the configuration cache.
4740  */
4741 static int
4742 spa_export_common(char *pool, int new_state, nvlist_t **oldconfig,
4743     boolean_t force, boolean_t hardforce, boolean_t saveconfig)
4744 {
4745         spa_t *spa;
4746         zfs_autosnap_t *autosnap;
4747         boolean_t wbcthr_stopped = B_FALSE;
4748 
4749         if (oldconfig)
4750                 *oldconfig = NULL;
4751 
4752         if (!(spa_mode_global & FWRITE))
4753                 return (SET_ERROR(EROFS));
4754 
4755         mutex_enter(&spa_namespace_lock);
4756         if ((spa = spa_lookup(pool)) == NULL) {
4757                 mutex_exit(&spa_namespace_lock);
4758                 return (SET_ERROR(ENOENT));
4759         }
4760 
4761         /*
4762          * Put a hold on the pool, drop the namespace lock, stop async tasks
4763          * and write cache thread, reacquire the namespace lock, and see
4764          * if we can export.
4765          */
4766         spa_open_ref(spa, FTAG);
4767         mutex_exit(&spa_namespace_lock);
4768 
4769         autosnap = spa_get_autosnap(spa);
4770         mutex_enter(&autosnap->autosnap_lock);
4771 
4772         if (autosnap_has_children_zone(autosnap,
4773             spa_name(spa), B_TRUE)) {
4774                 mutex_exit(&autosnap->autosnap_lock);
4775                 spa_close(spa, FTAG);
4776                 return (EBUSY);
4777         }
4778 
4779         mutex_exit(&autosnap->autosnap_lock);
4780 
4781         wbcthr_stopped = wbc_stop_thread(spa); /* stop write cache thread */
4782         autosnap_destroyer_thread_stop(spa);
4783         spa_async_suspend(spa);
4784         mutex_enter(&spa_namespace_lock);
4785         spa_close(spa, FTAG);
4786 
4787         /*
4788          * The pool will be in core if it's openable,
4789          * in which case we can modify its state.
4790          */
4791         if (spa->spa_state != POOL_STATE_UNINITIALIZED && spa->spa_sync_on) {
4792                 /*
4793                  * Objsets may be open only because they're dirty, so we
4794                  * have to force it to sync before checking spa_refcnt.
4795                  */
4796                 txg_wait_synced(spa->spa_dsl_pool, 0);
4797                 spa_evicting_os_wait(spa);
4798 
4799                 /*
4800                  * A pool cannot be exported or destroyed if there are active
4801                  * references.  If we are resetting a pool, allow references by
4802                  * fault injection handlers.
4803                  */
4804                 if (!spa_refcount_zero(spa) ||
4805                     (spa->spa_inject_ref != 0 &&
4806                     new_state != POOL_STATE_UNINITIALIZED)) {
4807                         spa_async_resume(spa);
4808                         mutex_exit(&spa_namespace_lock);
4809                         if (wbcthr_stopped)
4810                                 (void) wbc_start_thread(spa);
4811                         autosnap_destroyer_thread_start(spa);
4812                         return (SET_ERROR(EBUSY));
4813                 }
4814 
4815                 /*
4816                  * A pool cannot be exported if it has an active shared spare.
4817                  * This is to prevent other pools stealing the active spare
4818                  * from an exported pool. At user's own will, such pool can
4819                  * be forcedly exported.
4820                  */
4821                 if (!force && new_state == POOL_STATE_EXPORTED &&
4822                     spa_has_active_shared_spare(spa)) {
4823                         spa_async_resume(spa);
4824                         mutex_exit(&spa_namespace_lock);
4825                         if (wbcthr_stopped)
4826                                 (void) wbc_start_thread(spa);
4827                         autosnap_destroyer_thread_start(spa);
4828                         return (SET_ERROR(EXDEV));
4829                 }
4830 
4831                 /*
4832                  * We want this to be reflected on every label,
4833                  * so mark them all dirty.  spa_unload() will do the
4834                  * final sync that pushes these changes out.
4835                  */
4836                 if (new_state != POOL_STATE_UNINITIALIZED && !hardforce) {
4837                         spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4838                         spa->spa_state = new_state;
4839                         spa->spa_final_txg = spa_last_synced_txg(spa) +
4840                             TXG_DEFER_SIZE + 1;
4841                         vdev_config_dirty(spa->spa_root_vdev);
4842                         spa_config_exit(spa, SCL_ALL, FTAG);
4843                 }
4844         }
4845 
4846         spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_DESTROY);
4847 
4848         if (spa->spa_state != POOL_STATE_UNINITIALIZED) {
4849                 wbc_deactivate(spa);
4850 
4851                 spa_unload(spa);
4852                 spa_deactivate(spa);
4853         }
4854 
4855         if (oldconfig && spa->spa_config)
4856                 VERIFY(nvlist_dup(spa->spa_config, oldconfig, 0) == 0);
4857 
4858         if (new_state != POOL_STATE_UNINITIALIZED) {
4859                 if (!hardforce)
4860                         spa_config_sync(spa, !saveconfig, B_TRUE);
4861 
4862                 spa_remove(spa);
4863         }
4864         mutex_exit(&spa_namespace_lock);
4865 
4866         return (0);
4867 }
4868 
4869 /*
4870  * Destroy a storage pool.
4871  */
4872 int
4873 spa_destroy(char *pool)
4874 {
4875         return (spa_export_common(pool, POOL_STATE_DESTROYED, NULL,
4876             B_FALSE, B_FALSE, B_FALSE));
4877 }
4878 
4879 /*
4880  * Export a storage pool.
4881  */
4882 int
4883 spa_export(char *pool, nvlist_t **oldconfig, boolean_t force,
4884     boolean_t hardforce, boolean_t saveconfig)
4885 {
4886         return (spa_export_common(pool, POOL_STATE_EXPORTED, oldconfig,
4887             force, hardforce, saveconfig));
4888 }
4889 
4890 /*
4891  * Similar to spa_export(), this unloads the spa_t without actually removing it
4892  * from the namespace in any way.
4893  */
4894 int
4895 spa_reset(char *pool)
4896 {
4897         return (spa_export_common(pool, POOL_STATE_UNINITIALIZED, NULL,
4898             B_FALSE, B_FALSE, B_FALSE));
4899 }
4900 
4901 /*
4902  * ==========================================================================
4903  * Device manipulation
4904  * ==========================================================================
4905  */
4906 
4907 /*
4908  * Add a device to a storage pool.
4909  */
4910 int
4911 spa_vdev_add(spa_t *spa, nvlist_t *nvroot)
4912 {
4913         uint64_t txg, id;
4914         int error;
4915         vdev_t *rvd = spa->spa_root_vdev;
4916         vdev_t *vd, *tvd;
4917         nvlist_t **spares, **l2cache;
4918         uint_t nspares, nl2cache;
4919         dmu_tx_t *tx = NULL;
4920 
4921         ASSERT(spa_writeable(spa));
4922 
4923         txg = spa_vdev_enter(spa);
4924 
4925         if ((error = spa_config_parse(spa, &vd, nvroot, NULL, 0,
4926             VDEV_ALLOC_ADD)) != 0)
4927                 return (spa_vdev_exit(spa, NULL, txg, error));
4928 
4929         spa->spa_pending_vdev = vd;  /* spa_vdev_exit() will clear this */
4930 
4931         if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES, &spares,
4932             &nspares) != 0)
4933                 nspares = 0;
4934 
4935         if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE, &l2cache,
4936             &nl2cache) != 0)
4937                 nl2cache = 0;
4938 
4939         if (vd->vdev_children == 0 && nspares == 0 && nl2cache == 0)
4940                 return (spa_vdev_exit(spa, vd, txg, EINVAL));
4941 
4942         if (vd->vdev_children != 0 &&
4943             (error = vdev_create(vd, txg, B_FALSE)) != 0)
4944                 return (spa_vdev_exit(spa, vd, txg, error));
4945 
4946         /*
4947          * We must validate the spares and l2cache devices after checking the
4948          * children.  Otherwise, vdev_inuse() will blindly overwrite the spare.
4949          */
4950         if ((error = spa_validate_aux(spa, nvroot, txg, VDEV_ALLOC_ADD)) != 0)
4951                 return (spa_vdev_exit(spa, vd, txg, error));
4952 
4953         /*
4954          * Transfer each new top-level vdev from vd to rvd.



4955          */


4956         for (int c = 0; c < vd->vdev_children; c++) {


























4957 


4958                 /*
4959                  * Set the vdev id to the first hole, if one exists.
4960                  */
4961                 for (id = 0; id < rvd->vdev_children; id++) {
4962                         if (rvd->vdev_child[id]->vdev_ishole) {
4963                                 vdev_free(rvd->vdev_child[id]);
4964                                 break;
4965                         }
4966                 }
4967                 tvd = vd->vdev_child[c];
4968                 vdev_remove_child(vd, tvd);
4969                 tvd->vdev_id = id;
4970                 vdev_add_child(rvd, tvd);
4971                 vdev_config_dirty(tvd);
4972         }
4973 
4974         if (nspares != 0) {
4975                 spa_set_aux_vdevs(&spa->spa_spares, spares, nspares,
4976                     ZPOOL_CONFIG_SPARES);
4977                 spa_load_spares(spa);


4988         /*
4989          * We have to be careful when adding new vdevs to an existing pool.
4990          * If other threads start allocating from these vdevs before we
4991          * sync the config cache, and we lose power, then upon reboot we may
4992          * fail to open the pool because there are DVAs that the config cache
4993          * can't translate.  Therefore, we first add the vdevs without
4994          * initializing metaslabs; sync the config cache (via spa_vdev_exit());
4995          * and then let spa_config_update() initialize the new metaslabs.
4996          *
4997          * spa_load() checks for added-but-not-initialized vdevs, so that
4998          * if we lose power at any point in this sequence, the remaining
4999          * steps will be completed the next time we load the pool.
5000          */
5001         (void) spa_vdev_exit(spa, vd, txg, 0);
5002 
5003         mutex_enter(&spa_namespace_lock);
5004         spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
5005         spa_event_notify(spa, NULL, NULL, ESC_ZFS_VDEV_ADD);
5006         mutex_exit(&spa_namespace_lock);
5007 
5008         /*
5009          * "spa_last_synced_txg(spa) + 1" is used because:
5010          *   - spa_vdev_exit() calls txg_wait_synced() for "txg"
5011          *   - spa_config_update() calls txg_wait_synced() for
5012          *     "spa_last_synced_txg(spa) + 1"
5013          */
5014         tx = dmu_tx_create_assigned(spa_get_dsl(spa),
5015             spa_last_synced_txg(spa) + 1);
5016         spa_special_feature_activate(spa, tx);
5017         dmu_tx_commit(tx);
5018 
5019         wbc_activate(spa, B_FALSE);
5020 
5021         return (0);
5022 }
5023 
5024 /*
5025  * Attach a device to a mirror.  The arguments are the path to any device
5026  * in the mirror, and the nvroot for the new device.  If the path specifies
5027  * a device that is not mirrored, we automatically insert the mirror vdev.
5028  *
5029  * If 'replacing' is specified, the new device is intended to replace the
5030  * existing device; in this case the two devices are made into their own
5031  * mirror using the 'replacing' vdev, which is functionally identical to
5032  * the mirror vdev (it actually reuses all the same ops) but has a few
5033  * extra rules: you can't attach to it after it's been created, and upon
5034  * completion of resilvering, the first disk (the one being replaced)
5035  * is automatically detached.
5036  */
5037 int
5038 spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot, int replacing)
5039 {
5040         uint64_t txg, dtl_max_txg;
5041         vdev_t *rvd = spa->spa_root_vdev;
5042         vdev_t *oldvd, *newvd, *newrootvd, *pvd, *tvd;
5043         vdev_ops_t *pvops;
5044         char *oldvdpath, *newvdpath;
5045         int newvd_isspare;
5046         int error;
5047 
5048         ASSERT(spa_writeable(spa));
5049 
5050         txg = spa_vdev_enter(spa);
5051 
5052         oldvd = spa_lookup_by_guid(spa, guid, B_FALSE);
5053 





5054         if (oldvd == NULL)
5055                 return (spa_vdev_exit(spa, NULL, txg, ENODEV));
5056 
5057         if (!oldvd->vdev_ops->vdev_op_leaf)
5058                 return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5059 
5060         pvd = oldvd->vdev_parent;
5061 
5062         if ((error = spa_config_parse(spa, &newrootvd, nvroot, NULL, 0,
5063             VDEV_ALLOC_ATTACH)) != 0)
5064                 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5065 
5066         if (newrootvd->vdev_children != 1)
5067                 return (spa_vdev_exit(spa, newrootvd, txg, EINVAL));
5068 
5069         newvd = newrootvd->vdev_child[0];
5070 
5071         if (!newvd->vdev_ops->vdev_op_leaf)
5072                 return (spa_vdev_exit(spa, newrootvd, txg, EINVAL));
5073 


5199         newvd_isspare = newvd->vdev_isspare;
5200 
5201         /*
5202          * Mark newvd's DTL dirty in this txg.
5203          */
5204         vdev_dirty(tvd, VDD_DTL, newvd, txg);
5205 
5206         /*
5207          * Schedule the resilver to restart in the future. We do this to
5208          * ensure that dmu_sync-ed blocks have been stitched into the
5209          * respective datasets.
5210          */
5211         dsl_resilver_restart(spa->spa_dsl_pool, dtl_max_txg);
5212 
5213         if (spa->spa_bootfs)
5214                 spa_event_notify(spa, newvd, NULL, ESC_ZFS_BOOTFS_VDEV_ATTACH);
5215 
5216         spa_event_notify(spa, newvd, NULL, ESC_ZFS_VDEV_ATTACH);
5217 
5218         /*
5219          * Check CoS property of the old vdev, add reference by new vdev
5220          */
5221         if (oldvd->vdev_queue.vq_cos) {
5222                 cos_hold(oldvd->vdev_queue.vq_cos);
5223                 newvd->vdev_queue.vq_cos = oldvd->vdev_queue.vq_cos;
5224         }
5225 
5226         /*
5227          * Commit the config
5228          */
5229         (void) spa_vdev_exit(spa, newrootvd, dtl_max_txg, 0);
5230 
5231         spa_history_log_internal(spa, "vdev attach", NULL,
5232             "%s vdev=%s %s vdev=%s",
5233             replacing && newvd_isspare ? "spare in" :
5234             replacing ? "replace" : "attach", newvdpath,
5235             replacing ? "for" : "to", oldvdpath);
5236 
5237         spa_strfree(oldvdpath);
5238         spa_strfree(newvdpath);
5239 
5240         return (0);
5241 }
5242 
5243 /*
5244  * Detach a device from a mirror or replacing vdev.
5245  *
5246  * If 'replace_done' is specified, only detach if the parent


5420                 vdev_reopen(tvd);
5421                 vdev_expand(tvd, txg);
5422         }
5423 
5424         vdev_config_dirty(tvd);
5425 
5426         /*
5427          * Mark vd's DTL as dirty in this txg.  vdev_dtl_sync() will see that
5428          * vd->vdev_detached is set and free vd's DTL object in syncing context.
5429          * But first make sure we're not on any *other* txg's DTL list, to
5430          * prevent vd from being accessed after it's freed.
5431          */
5432         vdpath = spa_strdup(vd->vdev_path);
5433         for (int t = 0; t < TXG_SIZE; t++)
5434                 (void) txg_list_remove_this(&tvd->vdev_dtl_list, vd, t);
5435         vd->vdev_detached = B_TRUE;
5436         vdev_dirty(tvd, VDD_DTL, vd, txg);
5437 
5438         spa_event_notify(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE);
5439 
5440         /*
5441          * Release the references to CoS descriptors if any
5442          */
5443         if (vd->vdev_queue.vq_cos) {
5444                 cos_rele(vd->vdev_queue.vq_cos);
5445                 vd->vdev_queue.vq_cos = NULL;
5446         }
5447 
5448         /* hang on to the spa before we release the lock */
5449         spa_open_ref(spa, FTAG);
5450 
5451         error = spa_vdev_exit(spa, vd, txg, 0);
5452 
5453         spa_history_log_internal(spa, "detach", NULL,
5454             "vdev=%s", vdpath);
5455         spa_strfree(vdpath);
5456 
5457         /*
5458          * If this was the removal of the original device in a hot spare vdev,
5459          * then we want to go through and remove the device from the hot spare
5460          * list of every other pool.
5461          */
5462         if (unspare) {
5463                 spa_t *altspa = NULL;
5464 
5465                 mutex_enter(&spa_namespace_lock);
5466                 while ((altspa = spa_next(altspa)) != NULL) {
5467                         if (altspa->spa_state != POOL_STATE_ACTIVE ||


5490 
5491 /*
5492  * Split a set of devices from their mirrors, and create a new pool from them.
5493  */
5494 int
5495 spa_vdev_split_mirror(spa_t *spa, char *newname, nvlist_t *config,
5496     nvlist_t *props, boolean_t exp)
5497 {
5498         int error = 0;
5499         uint64_t txg, *glist;
5500         spa_t *newspa;
5501         uint_t c, children, lastlog;
5502         nvlist_t **child, *nvl, *tmp;
5503         dmu_tx_t *tx;
5504         char *altroot = NULL;
5505         vdev_t *rvd, **vml = NULL;                      /* vdev modify list */
5506         boolean_t activate_slog;
5507 
5508         ASSERT(spa_writeable(spa));
5509 
5510         /*
5511          * split for pools with activated WBC
5512          * will be implemented in the next release
5513          */
5514         if (spa_feature_is_active(spa, SPA_FEATURE_WBC))
5515                 return (SET_ERROR(ENOTSUP));
5516 
5517         txg = spa_vdev_enter(spa);
5518 
5519         /* clear the log and flush everything up to now */
5520         activate_slog = spa_passivate_log(spa);
5521         (void) spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5522         error = spa_offline_log(spa);
5523         txg = spa_vdev_config_enter(spa);
5524 
5525         if (activate_slog)
5526                 spa_activate_log(spa);
5527 
5528         if (error != 0)
5529                 return (spa_vdev_exit(spa, NULL, txg, error));
5530 
5531         /* check new spa name before going any further */
5532         if (spa_lookup(newname) != NULL)
5533                 return (spa_vdev_exit(spa, NULL, txg, EEXIST));
5534 
5535         /*
5536          * scan through all the children to ensure they're all mirrors
5537          */
5538         if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvl) != 0 ||
5539             nvlist_lookup_nvlist_array(nvl, ZPOOL_CONFIG_CHILDREN, &child,
5540             &children) != 0)
5541                 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5542 
5543         /* first, check to ensure we've got the right child count */
5544         rvd = spa->spa_root_vdev;
5545         lastlog = 0;
5546         for (c = 0; c < rvd->vdev_children; c++) {
5547                 vdev_t *vd = rvd->vdev_child[c];
5548 
5549                 /* don't count the holes & logs as children */
5550                 if (vd->vdev_islog || vd->vdev_ishole) {
5551                         if (lastlog == 0)
5552                                 lastlog = c;
5553                         continue;
5554                 }
5555 
5556                 lastlog = 0;
5557         }
5558         if (children != (lastlog != 0 ? lastlog : rvd->vdev_children))
5559                 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5560 
5561         /* next, ensure no spare or cache devices are part of the split */
5562         if (nvlist_lookup_nvlist(nvl, ZPOOL_CONFIG_SPARES, &tmp) == 0 ||
5563             nvlist_lookup_nvlist(nvl, ZPOOL_CONFIG_L2CACHE, &tmp) == 0)
5564                 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5565 
5566         vml = kmem_zalloc(children * sizeof (vdev_t *), KM_SLEEP);
5567         glist = kmem_zalloc(children * sizeof (uint64_t), KM_SLEEP);
5568 
5569         /* then, loop over each vdev and validate it */
5570         for (c = 0; c < children; c++) {


5583                         }
5584                 }
5585 
5586                 /* which disk is going to be split? */
5587                 if (nvlist_lookup_uint64(child[c], ZPOOL_CONFIG_GUID,
5588                     &glist[c]) != 0) {
5589                         error = SET_ERROR(EINVAL);
5590                         break;
5591                 }
5592 
5593                 /* look it up in the spa */
5594                 vml[c] = spa_lookup_by_guid(spa, glist[c], B_FALSE);
5595                 if (vml[c] == NULL) {
5596                         error = SET_ERROR(ENODEV);
5597                         break;
5598                 }
5599 
5600                 /* make sure there's nothing stopping the split */
5601                 if (vml[c]->vdev_parent->vdev_ops != &vdev_mirror_ops ||
5602                     vml[c]->vdev_islog ||
5603                     vml[c]->vdev_ishole ||
5604                     vml[c]->vdev_isspare ||
5605                     vml[c]->vdev_isl2cache ||
5606                     !vdev_writeable(vml[c]) ||
5607                     vml[c]->vdev_children != 0 ||
5608                     vml[c]->vdev_state != VDEV_STATE_HEALTHY ||
5609                     c != spa->spa_root_vdev->vdev_child[c]->vdev_id) {
5610                         error = SET_ERROR(EINVAL);
5611                         break;
5612                 }
5613 
5614                 if (vdev_dtl_required(vml[c])) {
5615                         error = SET_ERROR(EBUSY);
5616                         break;
5617                 }
5618 
5619                 /* we need certain info from the top level */
5620                 VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_METASLAB_ARRAY,
5621                     vml[c]->vdev_top->vdev_ms_array) == 0);
5622                 VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_METASLAB_SHIFT,
5623                     vml[c]->vdev_top->vdev_ms_shift) == 0);


5678             spa_generate_guid(NULL)) == 0);
5679         VERIFY0(nvlist_add_boolean(config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
5680         (void) nvlist_lookup_string(props,
5681             zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
5682 
5683         /* add the new pool to the namespace */
5684         newspa = spa_add(newname, config, altroot);
5685         newspa->spa_avz_action = AVZ_ACTION_REBUILD;
5686         newspa->spa_config_txg = spa->spa_config_txg;
5687         spa_set_log_state(newspa, SPA_LOG_CLEAR);
5688 
5689         /* release the spa config lock, retaining the namespace lock */
5690         spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5691 
5692         if (zio_injection_enabled)
5693                 zio_handle_panic_injection(spa, FTAG, 1);
5694 
5695         spa_activate(newspa, spa_mode_global);
5696         spa_async_suspend(newspa);
5697 


5698         /* create the new pool from the disks of the original pool */
5699         error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE, B_TRUE);
5700         if (error)
5701                 goto out;
5702 
5703         /* if that worked, generate a real config for the new pool */
5704         if (newspa->spa_root_vdev != NULL) {
5705                 VERIFY(nvlist_alloc(&newspa->spa_config_splitting,
5706                     NV_UNIQUE_NAME, KM_SLEEP) == 0);
5707                 VERIFY(nvlist_add_uint64(newspa->spa_config_splitting,
5708                     ZPOOL_CONFIG_SPLIT_GUID, spa_guid(spa)) == 0);
5709                 spa_config_set(newspa, spa_config_generate(newspa, NULL, -1ULL,
5710                     B_TRUE));
5711         }
5712 
5713         /* set the props */
5714         if (props != NULL) {
5715                 spa_configfile_set(newspa, props, B_FALSE);
5716                 error = spa_prop_set(newspa, props);
5717                 if (error)
5718                         goto out;
5719         }
5720 
5721         /* flush everything */
5722         txg = spa_vdev_config_enter(newspa);
5723         vdev_config_dirty(newspa->spa_root_vdev);
5724         (void) spa_vdev_config_exit(newspa, NULL, txg, 0, FTAG);
5725 
5726         if (zio_injection_enabled)
5727                 zio_handle_panic_injection(spa, FTAG, 2);
5728 
5729         spa_async_resume(newspa);
5730 
5731         /* finally, update the original pool's config */
5732         txg = spa_vdev_config_enter(spa);
5733         tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
5734         error = dmu_tx_assign(tx, TXG_WAIT);
5735         if (error != 0)
5736                 dmu_tx_abort(tx);
5737         for (c = 0; c < children; c++) {
5738                 if (vml[c] != NULL) {
5739                         vdev_t *tvd = vml[c]->vdev_top;
5740 
5741                         /*
5742                          * Need to be sure the detachable VDEV is not
5743                          * on any *other* txg's DTL list to prevent it
5744                          * from being accessed after it's freed.
5745                          */
5746                         for (int t = 0; t < TXG_SIZE; t++) {
5747                                 (void) txg_list_remove_this(
5748                                     &tvd->vdev_dtl_list, vml[c], t);
5749                         }
5750 
5751                         vdev_split(vml[c]);
5752                         if (error == 0)
5753                                 spa_history_log_internal(spa, "detach", tx,
5754                                     "vdev=%s", vml[c]->vdev_path);
5755 
5756                         vdev_free(vml[c]);
5757                 }
5758         }
5759         spa->spa_avz_action = AVZ_ACTION_REBUILD;
5760         vdev_config_dirty(spa->spa_root_vdev);
5761         spa->spa_config_splitting = NULL;
5762         nvlist_free(nvl);
5763         if (error == 0)
5764                 dmu_tx_commit(tx);
5765         (void) spa_vdev_exit(spa, NULL, txg, 0);
5766 
5767         if (zio_injection_enabled)
5768                 zio_handle_panic_injection(spa, FTAG, 3);
5769 
5770         /* split is complete; log a history record */
5771         spa_history_log_internal(newspa, "split", NULL,
5772             "from pool %s", spa_name(spa));
5773 
5774         kmem_free(vml, children * sizeof (vdev_t *));
5775 
5776         /* if we're not going to mount the filesystems in userland, export */
5777         if (exp)
5778                 error = spa_export_common(newname, POOL_STATE_EXPORTED, NULL,
5779                     B_FALSE, B_FALSE, B_FALSE);
5780 
5781         return (error);
5782 
5783 out:
5784         spa_unload(newspa);
5785         spa_deactivate(newspa);
5786         spa_remove(newspa);
5787 
5788         txg = spa_vdev_config_enter(spa);
5789 
5790         /* re-online all offlined disks */
5791         for (c = 0; c < children; c++) {
5792                 if (vml[c] != NULL)
5793                         vml[c]->vdev_offline = B_FALSE;
5794         }
5795         vdev_reopen(spa->spa_root_vdev);
5796 
5797         nvlist_free(spa->spa_config_splitting);
5798         spa->spa_config_splitting = NULL;
5799         (void) spa_vdev_exit(spa, NULL, txg, error);
5800 
5801         kmem_free(vml, children * sizeof (vdev_t *));
5802         return (error);
5803 }
5804 
5805 static nvlist_t *
5806 spa_nvlist_lookup_by_guid(nvlist_t **nvpp, int count, uint64_t target_guid)
5807 {
5808         for (int i = 0; i < count; i++) {
5809                 uint64_t guid;
5810 
5811                 VERIFY(nvlist_lookup_uint64(nvpp[i], ZPOOL_CONFIG_GUID,
5812                     &guid) == 0);
5813 
5814                 if (guid == target_guid)
5815                         return (nvpp[i]);
5816         }
5817 
5818         return (NULL);
5819 }
5820 
5821 static void
5822 spa_vdev_remove_aux(nvlist_t *config, char *name, nvlist_t **dev, int count,
5823     nvlist_t *dev_to_remove)
5824 {
5825         nvlist_t **newdev = NULL;
5826 
5827         if (count > 1)
5828                 newdev = kmem_alloc((count - 1) * sizeof (void *), KM_SLEEP);
5829 
5830         for (int i = 0, j = 0; i < count; i++) {
5831                 if (dev[i] == dev_to_remove)
5832                         continue;
5833                 VERIFY(nvlist_dup(dev[i], &newdev[j++], KM_SLEEP) == 0);
5834         }
5835 
5836         VERIFY(nvlist_remove(config, name, DATA_TYPE_NVLIST_ARRAY) == 0);
5837         VERIFY(nvlist_add_nvlist_array(config, name, newdev, count - 1) == 0);
5838 
5839         for (int i = 0; i < count - 1; i++)
5840                 nvlist_free(newdev[i]);
5841 
5842         if (count > 1)
5843                 kmem_free(newdev, (count - 1) * sizeof (void *));
5844 }
5845 
5846 /*
5847  * Evacuate the device.
5848  */
5849 static int
5850 spa_vdev_remove_evacuate(spa_t *spa, vdev_t *vd)
5851 {
5852         uint64_t txg;
5853         int error = 0;
5854 
5855         ASSERT(MUTEX_HELD(&spa_namespace_lock));
5856         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
5857         ASSERT(vd == vd->vdev_top);
5858 
5859         /*
5860          * Evacuate the device.  We don't hold the config lock as writer
5861          * since we need to do I/O but we do keep the
5862          * spa_namespace_lock held.  Once this completes the device
5863          * should no longer have any blocks allocated on it.
5864          */
5865         if (vd->vdev_islog) {
5866                 if (vd->vdev_stat.vs_alloc != 0)
5867                         error = spa_offline_log(spa);
5868         } else {
5869                 error = SET_ERROR(ENOTSUP);
5870         }
5871 
5872         if (error)
5873                 return (error);
5874 
5875         /*
5876          * The evacuation succeeded.  Remove any remaining MOS metadata
5877          * associated with this vdev, and wait for these changes to sync.
5878          */
5879         ASSERT0(vd->vdev_stat.vs_alloc);
5880         txg = spa_vdev_config_enter(spa);
5881         vd->vdev_removing = B_TRUE;
5882         vdev_dirty_leaves(vd, VDD_DTL, txg);
5883         vdev_config_dirty(vd);
5884         spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5885 
5886         return (0);
5887 }
5888 
5889 /*
5890  * Complete the removal by cleaning up the namespace.
5891  */
5892 static void
5893 spa_vdev_remove_from_namespace(spa_t *spa, vdev_t *vd)
5894 {
5895         vdev_t *rvd = spa->spa_root_vdev;
5896         uint64_t id = vd->vdev_id;
5897         boolean_t last_vdev = (id == (rvd->vdev_children - 1));
5898 
5899         ASSERT(MUTEX_HELD(&spa_namespace_lock));
5900         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
5901         ASSERT(vd == vd->vdev_top);
5902 
5903         /*
5904          * Only remove any devices which are empty.
5905          */
5906         if (vd->vdev_stat.vs_alloc != 0)
5907                 return;
5908 
5909         (void) vdev_label_init(vd, 0, VDEV_LABEL_REMOVE);
5910 
5911         if (list_link_active(&vd->vdev_state_dirty_node))
5912                 vdev_state_clean(vd);
5913         if (list_link_active(&vd->vdev_config_dirty_node))
5914                 vdev_config_clean(vd);
5915 
5916         vdev_free(vd);
5917 
5918         if (last_vdev) {
5919                 vdev_compact_children(rvd);
5920         } else {
5921                 vd = vdev_alloc_common(spa, id, 0, &vdev_hole_ops);
5922                 vdev_add_child(rvd, vd);
5923         }
5924         vdev_config_dirty(rvd);
5925 
5926         /*
5927          * Reassess the health of our root vdev.
5928          */
5929         vdev_reopen(rvd);
5930 }
5931 
5932 /*
5933  * Remove a device from the pool -
5934  *
5935  * Removing a device from the vdev namespace requires several steps
5936  * and can take a significant amount of time.  As a result we use
5937  * the spa_vdev_config_[enter/exit] functions which allow us to
5938  * grab and release the spa_config_lock while still holding the namespace
5939  * lock.  During each step the configuration is synced out.
5940  *
5941  * Currently, this supports removing only hot spares, slogs, level 2 ARC
5942  * and special devices.
5943  */
5944 int
5945 spa_vdev_remove(spa_t *spa, uint64_t guid, boolean_t unspare)
5946 {
5947         vdev_t *vd;
5948         sysevent_t *ev = NULL;
5949         metaslab_group_t *mg;
5950         nvlist_t **spares, **l2cache, *nv;
5951         uint64_t txg = 0;
5952         uint_t nspares, nl2cache;
5953         int error = 0;
5954         boolean_t locked = MUTEX_HELD(&spa_namespace_lock);
5955 
5956         ASSERT(spa_writeable(spa));
5957 
5958         if (!locked)
5959                 txg = spa_vdev_enter(spa);
5960 
5961         vd = spa_lookup_by_guid(spa, guid, B_FALSE);
5962 
5963         if (spa->spa_spares.sav_vdevs != NULL &&
5964             nvlist_lookup_nvlist_array(spa->spa_spares.sav_config,
5965             ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0 &&
5966             (nv = spa_nvlist_lookup_by_guid(spares, nspares, guid)) != NULL) {
5967                 /*
5968                  * Only remove the hot spare if it's not currently in use
5969                  * in this pool.
5970                  */
5971                 if (vd == NULL || unspare) {
5972                         if (vd == NULL)
5973                                 vd = spa_lookup_by_guid(spa, guid, B_TRUE);
5974 
5975                         /*
5976                          * Release the references to CoS descriptors if any
5977                          */
5978                         if (vd != NULL && vd->vdev_queue.vq_cos) {
5979                                 cos_rele(vd->vdev_queue.vq_cos);
5980                                 vd->vdev_queue.vq_cos = NULL;
5981                         }
5982 
5983                         ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_AUX);
5984                         spa_vdev_remove_aux(spa->spa_spares.sav_config,
5985                             ZPOOL_CONFIG_SPARES, spares, nspares, nv);
5986                         spa_load_spares(spa);
5987                         spa->spa_spares.sav_sync = B_TRUE;
5988                 } else {
5989                         error = SET_ERROR(EBUSY);
5990                 }
5991         } else if (spa->spa_l2cache.sav_vdevs != NULL &&
5992             nvlist_lookup_nvlist_array(spa->spa_l2cache.sav_config,
5993             ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0 &&
5994             (nv = spa_nvlist_lookup_by_guid(l2cache, nl2cache, guid)) != NULL) {
5995                 /*
5996                  * Cache devices can always be removed.
5997                  */
5998                 if (vd == NULL)
5999                         vd = spa_lookup_by_guid(spa, guid, B_TRUE);
6000                 /*
6001                  * Release the references to CoS descriptors if any
6002                  */
6003                 if (vd != NULL && vd->vdev_queue.vq_cos) {
6004                         cos_rele(vd->vdev_queue.vq_cos);
6005                         vd->vdev_queue.vq_cos = NULL;
6006                 }
6007 
6008                 ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_AUX);
6009                 spa_vdev_remove_aux(spa->spa_l2cache.sav_config,
6010                     ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache, nv);
6011                 spa_load_l2cache(spa);
6012                 spa->spa_l2cache.sav_sync = B_TRUE;
6013         } else if (vd != NULL && vd->vdev_islog) {
6014                 ASSERT(!locked);
6015 
6016                 if (vd != vd->vdev_top)
6017                         return (spa_vdev_exit(spa, NULL, txg, SET_ERROR(ENOTSUP)));
6018 
6019                 mg = vd->vdev_mg;
6020 
6021                 /*
6022                  * Stop allocating from this vdev.
6023                  */
6024                 metaslab_group_passivate(mg);
6025 
6026                 /*
6027                  * Wait for the youngest allocations and frees to sync,
6028                  * and then wait for the deferral of those frees to finish.
6029                  */
6030                 spa_vdev_config_exit(spa, NULL,
6031                     txg + TXG_CONCURRENT_STATES + TXG_DEFER_SIZE, 0, FTAG);
6032 
6033                 /*
6034                  * Attempt to evacuate the vdev.
6035                  */
6036                 error = spa_vdev_remove_evacuate(spa, vd);
6037 
6038                 txg = spa_vdev_config_enter(spa);
6039 
6040                 /*
6041                  * If we couldn't evacuate the vdev, unwind.
6042                  */
6043                 if (error) {
6044                         metaslab_group_activate(mg);
6045                         return (spa_vdev_exit(spa, NULL, txg, error));
6046                 }
6047 
6048                 /*
6049                  * Release the references to CoS descriptors if any
6050                  */
6051                 if (vd->vdev_queue.vq_cos) {
6052                         cos_rele(vd->vdev_queue.vq_cos);
6053                         vd->vdev_queue.vq_cos = NULL;
6054                 }
6055 
6056                 ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
6057 
6058                 /*
6059                  * Clean up the vdev namespace.
6060                  */
6061                 ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
6062                 spa_vdev_remove_from_namespace(spa, vd);
6063 
6064         } else if (vd != NULL && vdev_is_special(vd)) {
6065                 ASSERT(!locked);
6066 
6067                 if (vd != vd->vdev_top)
6068                         return (spa_vdev_exit(spa, NULL, txg, SET_ERROR(ENOTSUP)));
6069 
6070                 error = spa_special_vdev_remove(spa, vd, &txg);
6071                 if (error == 0) {
6072                         ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
6073                         spa_vdev_remove_from_namespace(spa, vd);
6074 
6075                         /*
6076                          * User sees this field as 'enablespecial'
6077                          * pool-level property
6078                          */
6079                         spa->spa_usesc = B_FALSE;
6080                 }
6081         } else if (vd != NULL) {
6082                 /*
6083                  * Normal vdevs cannot be removed (yet).
6084                  */
6085                 error = SET_ERROR(ENOTSUP);
6086         } else {
6087                 /*
6088                  * There is no vdev of any kind with the specified guid.
6089                  */
6090                 error = SET_ERROR(ENOENT);
6091         }
6092 
6093         if (!locked)
6094                 error = spa_vdev_exit(spa, NULL, txg, error);
6095 
6096         if (ev)
6097                 spa_event_notify_impl(ev);
6098 
6099         return (error);
6100 }
6101 
6102 /*
6103  * Find any device that's done replacing, or a vdev marked 'unspare' that's
6104  * currently spared, so we can detach it.
6105  */
6106 static vdev_t *
6107 spa_vdev_resilver_done_hunt(vdev_t *vd)
6108 {
6109         vdev_t *newvd, *oldvd;
6110 
6111         for (int c = 0; c < vd->vdev_children; c++) {
6112                 oldvd = spa_vdev_resilver_done_hunt(vd->vdev_child[c]);
6113                 if (oldvd != NULL)
6114                         return (oldvd);
6115         }
6116 
6117         /*
6118          * Check for a completed replacement.  We always consider the first
6119          * vdev in the list to be the oldest vdev, and the last one to be
6120          * the newest (see spa_vdev_attach() for how that works).  In
6121          * the case where the newest vdev is faulted, we will not automatically
6122          * remove it after a resilver completes.  This is OK as it will require
6123          * user intervention to determine which disk the admin wishes to keep.
6124          */
6125         if (vd->vdev_ops == &vdev_replacing_ops) {
6126                 ASSERT(vd->vdev_children > 1);
6127 
6128                 newvd = vd->vdev_child[vd->vdev_children - 1];
6129                 oldvd = vd->vdev_child[0];
6130 
6131                 if (vdev_dtl_empty(newvd, DTL_MISSING) &&
6132                     vdev_dtl_empty(newvd, DTL_OUTAGE) &&
6133                     !vdev_dtl_required(oldvd))
6134                         return (oldvd);
6135         }
6136 
6137         /*
6138          * Check for a completed resilver with the 'unspare' flag set.
6139          * Also potentially update faulted state.
6140          */
6141         if (vd->vdev_ops == &vdev_spare_ops) {
6142                 vdev_t *first = vd->vdev_child[0];
6143                 vdev_t *last = vd->vdev_child[vd->vdev_children - 1];
6144 
6145                 if (last->vdev_unspare) {
6146                         oldvd = first;
6147                         newvd = last;
6148                 } else if (first->vdev_unspare) {
6149                         oldvd = last;
6150                         newvd = first;
6151                 } else {
6152                         oldvd = NULL;
6153                 }
6154 
6155                 if (oldvd != NULL &&
6156                     vdev_dtl_empty(newvd, DTL_MISSING) &&
6157                     vdev_dtl_empty(newvd, DTL_OUTAGE) &&
6158                     !vdev_dtl_required(oldvd))
6159                         return (oldvd);
6160 
6161                 vdev_propagate_state(vd);
6162 
6163                 /*
6164                  * If there are more than two spares attached to a disk,
6165                  * and those spares are not required, then we want to
6166                  * attempt to free them up now so that they can be used
6167                  * by other pools.  Once we're back down to a single
6168                  * disk+spare, we stop removing them.
6169                  */
6170                 if (vd->vdev_children > 2) {
6171                         newvd = vd->vdev_child[1];
6172 
6173                         if (newvd->vdev_isspare && last->vdev_isspare &&
6174                             vdev_dtl_empty(last, DTL_MISSING) &&
6175                             vdev_dtl_empty(last, DTL_OUTAGE) &&
6176                             !vdev_dtl_required(newvd))
6177                                 return (newvd);
6178                 }
6179         }
6180 
6181         return (NULL);
6182 }


6203                  */
6204                 if (ppvd->vdev_ops == &vdev_spare_ops && pvd->vdev_id == 0 &&
6205                     ppvd->vdev_children == 2) {
6206                         ASSERT(pvd->vdev_ops == &vdev_replacing_ops);
6207                         sguid = ppvd->vdev_child[1]->vdev_guid;
6208                 }
6209                 ASSERT(vd->vdev_resilver_txg == 0 || !vdev_dtl_required(vd));
6210 
6211                 spa_config_exit(spa, SCL_ALL, FTAG);
6212                 if (spa_vdev_detach(spa, guid, pguid, B_TRUE) != 0)
6213                         return;
6214                 if (sguid && spa_vdev_detach(spa, sguid, ppguid, B_TRUE) != 0)
6215                         return;
6216                 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
6217         }
6218 
6219         spa_config_exit(spa, SCL_ALL, FTAG);
6220 }
6221 
6222 /*




















































6223  * ==========================================================================
6224  * SPA Scanning
6225  * ==========================================================================
6226  */
6227 int
6228 spa_scrub_pause_resume(spa_t *spa, pool_scrub_cmd_t cmd)
6229 {
6230         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6231 
6232         if (dsl_scan_resilvering(spa->spa_dsl_pool))
6233                 return (SET_ERROR(EBUSY));
6234 
6235         return (dsl_scrub_set_pause_resume(spa->spa_dsl_pool, cmd));
6236 }
6237 
6238 int
6239 spa_scan_stop(spa_t *spa)
6240 {
6241         ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6242         if (dsl_scan_resilvering(spa->spa_dsl_pool))


6399          */
6400         if (tasks & SPA_ASYNC_PROBE) {
6401                 spa_vdev_state_enter(spa, SCL_NONE);
6402                 spa_async_probe(spa, spa->spa_root_vdev);
6403                 (void) spa_vdev_state_exit(spa, NULL, 0);
6404         }
6405 
6406         /*
6407          * If any devices are done replacing, detach them.
6408          */
6409         if (tasks & SPA_ASYNC_RESILVER_DONE)
6410                 spa_vdev_resilver_done(spa);
6411 
6412         /*
6413          * Kick off a resilver.
6414          */
6415         if (tasks & SPA_ASYNC_RESILVER)
6416                 dsl_resilver_restart(spa->spa_dsl_pool, 0);
6417 
6418         /*
6419          * Kick off L2 cache rebuilding.
6420          */
6421         if (tasks & SPA_ASYNC_L2CACHE_REBUILD)
6422                 l2arc_spa_rebuild_start(spa);
6423 
6424         if (tasks & SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY) {
6425                 mutex_enter(&spa->spa_man_trim_lock);
6426                 spa_man_trim_taskq_destroy(spa);
6427                 mutex_exit(&spa->spa_man_trim_lock);
6428         }
6429 
6430         /*
6431          * Let the world know that we're done.
6432          */
6433         mutex_enter(&spa->spa_async_lock);
6434         spa->spa_async_thread = NULL;
6435         cv_broadcast(&spa->spa_async_cv);
6436         mutex_exit(&spa->spa_async_lock);
6437         thread_exit();
6438 }
6439 
6440 void
6441 spa_async_suspend(spa_t *spa)
6442 {
6443         mutex_enter(&spa->spa_async_lock);
6444         spa->spa_async_suspended++;
6445         while (spa->spa_async_thread != NULL)
6446                 cv_wait(&spa->spa_async_cv, &spa->spa_async_lock);
6447         mutex_exit(&spa->spa_async_lock);






6448 }
6449 
6450 void
6451 spa_async_resume(spa_t *spa)
6452 {
6453         mutex_enter(&spa->spa_async_lock);
6454         ASSERT(spa->spa_async_suspended != 0);
6455         spa->spa_async_suspended--;
6456         mutex_exit(&spa->spa_async_lock);





6457 }
6458 
6459 static boolean_t
6460 spa_async_tasks_pending(spa_t *spa)
6461 {
6462         uint_t non_config_tasks;
6463         uint_t config_task;
6464         boolean_t config_task_suspended;
6465 
6466         non_config_tasks = spa->spa_async_tasks & ~SPA_ASYNC_CONFIG_UPDATE;
6467         config_task = spa->spa_async_tasks & SPA_ASYNC_CONFIG_UPDATE;
6468         if (spa->spa_ccw_fail_time == 0) {
6469                 config_task_suspended = B_FALSE;
6470         } else {
6471                 config_task_suspended =
6472                     (gethrtime() - spa->spa_ccw_fail_time) <
6473                     (zfs_ccw_retry_interval * NANOSEC);
6474         }
6475 
6476         return (non_config_tasks || (config_task && !config_task_suspended));


6481 {
6482         mutex_enter(&spa->spa_async_lock);
6483         if (spa_async_tasks_pending(spa) &&
6484             !spa->spa_async_suspended &&
6485             spa->spa_async_thread == NULL &&
6486             rootdir != NULL)
6487                 spa->spa_async_thread = thread_create(NULL, 0,
6488                     spa_async_thread, spa, 0, &p0, TS_RUN, maxclsyspri);
6489         mutex_exit(&spa->spa_async_lock);
6490 }
6491 
6492 void
6493 spa_async_request(spa_t *spa, int task)
6494 {
6495         zfs_dbgmsg("spa=%s async request task=%u", spa->spa_name, task);
6496         mutex_enter(&spa->spa_async_lock);
6497         spa->spa_async_tasks |= task;
6498         mutex_exit(&spa->spa_async_lock);
6499 }
6500 
6501 void
6502 spa_async_unrequest(spa_t *spa, int task)
6503 {
6504         zfs_dbgmsg("spa=%s async unrequest task=%u", spa->spa_name, task);
6505         mutex_enter(&spa->spa_async_lock);
6506         spa->spa_async_tasks &= ~task;
6507         mutex_exit(&spa->spa_async_lock);
6508 }
6509 
6510 /*
6511  * ==========================================================================
6512  * SPA syncing routines
6513  * ==========================================================================
6514  */
6515 
6516 static int
6517 bpobj_enqueue_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
6518 {
6519         bpobj_t *bpo = arg;
6520         bpobj_enqueue(bpo, bp, tx);
6521         return (0);
6522 }
6523 
6524 static int
6525 spa_free_sync_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
6526 {
6527         zio_t *zio = arg;
6528 
6529         zio_nowait(zio_free_sync(zio, zio->io_spa, dmu_tx_get_txg(tx), bp,


6778          * Setting the version is special cased when first creating the pool.
6779          */
6780         ASSERT(tx->tx_txg != TXG_INITIAL);
6781 
6782         ASSERT(SPA_VERSION_IS_SUPPORTED(version));
6783         ASSERT(version >= spa_version(spa));
6784 
6785         spa->spa_uberblock.ub_version = version;
6786         vdev_config_dirty(spa->spa_root_vdev);
6787         spa_history_log_internal(spa, "set", tx, "version=%lld", version);
6788 }
6789 
6790 /*
6791  * Set zpool properties.
6792  */
6793 static void
6794 spa_sync_props(void *arg, dmu_tx_t *tx)
6795 {
6796         nvlist_t *nvp = arg;
6797         spa_t *spa = dmu_tx_pool(tx)->dp_spa;
6798         spa_meta_placement_t *mp = &spa->spa_meta_policy;
6799         objset_t *mos = spa->spa_meta_objset;
6800         nvpair_t *elem = NULL;
6801 
6802         mutex_enter(&spa->spa_props_lock);
6803 
6804         while ((elem = nvlist_next_nvpair(nvp, elem))) {
6805                 uint64_t intval;
6806                 char *strval, *fname;
6807                 zpool_prop_t prop;
6808                 const char *propname;
6809                 zprop_type_t proptype;
6810                 spa_feature_t fid;
6811 
6812                 switch (prop = zpool_name_to_prop(nvpair_name(elem))) {
6813                 case ZPROP_INVAL:
6814                         /*
6815                          * We checked this earlier in spa_prop_validate().
6816                          */
6817                         ASSERT(zpool_prop_feature(nvpair_name(elem)));
6818 
6819                         fname = strchr(nvpair_name(elem), '@') + 1;
6820                         VERIFY0(zfeature_lookup_name(fname, &fid));
6821 
6822                         spa_feature_enable(spa, fid, tx);
6823                         spa_history_log_internal(spa, "set", tx,
6824                             "%s=enabled", nvpair_name(elem));
6825                         break;
6826 
6827                 case ZPOOL_PROP_VERSION:
6828                         intval = fnvpair_value_uint64(elem);
6829                         /*
6830                          * The version is synced seperatly before other
6831                          * properties and should be correct by now.
6832                          */
6833                         ASSERT3U(spa_version(spa), >=, intval);


6891                                 intval = fnvpair_value_uint64(elem);
6892 
6893                                 if (proptype == PROP_TYPE_INDEX) {
6894                                         const char *unused;
6895                                         VERIFY0(zpool_prop_index_to_string(
6896                                             prop, intval, &unused));
6897                                 }
6898                                 VERIFY0(zap_update(mos,
6899                                     spa->spa_pool_props_object, propname,
6900                                     8, 1, &intval, tx));
6901                                 spa_history_log_internal(spa, "set", tx,
6902                                     "%s=%lld", nvpair_name(elem), intval);
6903                         } else {
6904                                 ASSERT(0); /* not allowed */
6905                         }
6906 
6907                         switch (prop) {
6908                         case ZPOOL_PROP_DELEGATION:
6909                                 spa->spa_delegation = intval;
6910                                 break;
6911                         case ZPOOL_PROP_DDT_DESEGREGATION:
6912                                 spa_set_ddt_classes(spa, intval);
6913                                 break;
6914                         case ZPOOL_PROP_DEDUP_BEST_EFFORT:
6915                                 spa->spa_dedup_best_effort = intval;
6916                                 break;
6917                         case ZPOOL_PROP_DEDUP_LO_BEST_EFFORT:
6918                                 spa->spa_dedup_lo_best_effort = intval;
6919                                 break;
6920                         case ZPOOL_PROP_DEDUP_HI_BEST_EFFORT:
6921                                 spa->spa_dedup_hi_best_effort = intval;
6922                                 break;
6923                         case ZPOOL_PROP_BOOTFS:
6924                                 spa->spa_bootfs = intval;
6925                                 break;
6926                         case ZPOOL_PROP_FAILUREMODE:
6927                                 spa->spa_failmode = intval;
6928                                 break;
6929                         case ZPOOL_PROP_FORCETRIM:
6930                                 spa->spa_force_trim = intval;
6931                                 break;
6932                         case ZPOOL_PROP_AUTOTRIM:
6933                                 mutex_enter(&spa->spa_auto_trim_lock);
6934                                 if (intval != spa->spa_auto_trim) {
6935                                         spa->spa_auto_trim = intval;
6936                                         if (intval != 0)
6937                                                 spa_auto_trim_taskq_create(spa);
6938                                         else
6939                                                 spa_auto_trim_taskq_destroy(
6940                                                     spa);
6941                                 }
6942                                 mutex_exit(&spa->spa_auto_trim_lock);
6943                                 break;
6944                         case ZPOOL_PROP_AUTOEXPAND:
6945                                 spa->spa_autoexpand = intval;
6946                                 if (tx->tx_txg != TXG_INITIAL)
6947                                         spa_async_request(spa,
6948                                             SPA_ASYNC_AUTOEXPAND);
6949                                 break;
6950                         case ZPOOL_PROP_DEDUPDITTO:
6951                                 spa->spa_dedup_ditto = intval;
6952                                 break;
6953                         case ZPOOL_PROP_MINWATERMARK:
6954                                 spa->spa_minwat = intval;
6955                                 break;
6956                         case ZPOOL_PROP_LOWATERMARK:
6957                                 spa->spa_lowat = intval;
6958                                 break;
6959                         case ZPOOL_PROP_HIWATERMARK:
6960                                 spa->spa_hiwat = intval;
6961                                 break;
6962                         case ZPOOL_PROP_DEDUPMETA_DITTO:
6963                                 spa->spa_ddt_meta_copies = intval;
6964                                 break;
6965                         case ZPOOL_PROP_META_PLACEMENT:
6966                                 mp->spa_enable_meta_placement_selection =
6967                                     intval;
6968                                 break;
6969                         case ZPOOL_PROP_SYNC_TO_SPECIAL:
6970                                 mp->spa_sync_to_special = intval;
6971                                 break;
6972                         case ZPOOL_PROP_DDT_META_TO_METADEV:
6973                                 mp->spa_ddt_meta_to_special = intval;
6974                                 break;
6975                         case ZPOOL_PROP_ZFS_META_TO_METADEV:
6976                                 mp->spa_zfs_meta_to_special = intval;
6977                                 break;
6978                         case ZPOOL_PROP_SMALL_DATA_TO_METADEV:
6979                                 mp->spa_small_data_to_special = intval;
6980                                 break;
6981                         case ZPOOL_PROP_RESILVER_PRIO:
6982                                 spa->spa_resilver_prio = intval;
6983                                 break;
6984                         case ZPOOL_PROP_SCRUB_PRIO:
6985                                 spa->spa_scrub_prio = intval;
6986                                 break;
6987                         default:
6988                                 break;
6989                         }
6990                 }
6991 
6992         }
6993 
6994         mutex_exit(&spa->spa_props_lock);
6995 }
6996 
6997 /*
6998  * Perform one-time upgrade on-disk changes.  spa_version() does not
6999  * reflect the new version this txg, so there must be no changes this
7000  * txg to anything that the upgrade code depends on after it executes.
7001  * Therefore this must be called after dsl_pool_sync() does the sync
7002  * tasks.
7003  */
7004 static void
7005 spa_sync_upgrades(spa_t *spa, dmu_tx_t *tx)
7006 {


7052                         spa_feature_incr(spa, SPA_FEATURE_LZ4_COMPRESS, tx);
7053         }
7054 
7055         /*
7056          * If we haven't written the salt, do so now.  Note that the
7057          * feature may not be activated yet, but that's fine since
7058          * the presence of this ZAP entry is backwards compatible.
7059          */
7060         if (zap_contains(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
7061             DMU_POOL_CHECKSUM_SALT) == ENOENT) {
7062                 VERIFY0(zap_add(spa->spa_meta_objset,
7063                     DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CHECKSUM_SALT, 1,
7064                     sizeof (spa->spa_cksum_salt.zcs_bytes),
7065                     spa->spa_cksum_salt.zcs_bytes, tx));
7066         }
7067 
7068         rrw_exit(&dp->dp_config_rwlock, FTAG);
7069 }
7070 
7071 static void
7072 spa_initialize_alloc_trees(spa_t *spa, uint32_t max_queue_depth,
7073     uint64_t queue_depth_total)
7074 {
7075         vdev_t *rvd = spa->spa_root_vdev;
7076         boolean_t dva_throttle_enabled = zio_dva_throttle_enabled;
7077         metaslab_class_t *mcs[2] = {
7078                 spa_normal_class(spa),
7079                 spa_special_class(spa)
7080         };
7081         size_t mcs_len = sizeof (mcs) / sizeof (metaslab_class_t *);
7082 
7083         for (size_t i = 0; i < mcs_len; i++) {
7084                 metaslab_class_t *mc = mcs[i];


7085 
7086                 ASSERT0(refcount_count(&mc->mc_alloc_slots));
7087                 mc->mc_alloc_max_slots = queue_depth_total;
7088                 mc->mc_alloc_throttle_enabled = dva_throttle_enabled;



7089 
7090                 ASSERT3U(mc->mc_alloc_max_slots, <=,
7091                     max_queue_depth * rvd->vdev_children);


7092         }
7093 }
7094 
7095 static void
7096 spa_check_alloc_trees(spa_t *spa)
7097 {
7098         metaslab_class_t *mcs[2] = {
7099                 spa_normal_class(spa),
7100                 spa_special_class(spa)
7101         };
7102         size_t mcs_len = sizeof (mcs) / sizeof (metaslab_class_t *);
7103 
7104         for (size_t i = 0; i < mcs_len; i++) {
7105                 metaslab_class_t *mc = mcs[i];
7106 
7107                 mutex_enter(&mc->mc_alloc_lock);
7108                 VERIFY0(avl_numnodes(&mc->mc_alloc_tree));
7109                 mutex_exit(&mc->mc_alloc_lock);
7110         }
7111 }
7112 
7113 /*
7114  * Sync the specified transaction group.  New blocks may be dirtied as
7115  * part of the process, so we iterate until it converges.
7116  */
7117 void
7118 spa_sync(spa_t *spa, uint64_t txg)
7119 {
7120         dsl_pool_t *dp = spa->spa_dsl_pool;
7121         objset_t *mos = spa->spa_meta_objset;
7122         bplist_t *free_bpl = &spa->spa_free_bplist[txg & TXG_MASK];
7123         vdev_t *rvd = spa->spa_root_vdev;
7124         vdev_t *vd;
7125         dmu_tx_t *tx;
7126         int error;
7127         uint32_t max_queue_depth = zfs_vdev_async_write_max_active *
7128             zfs_vdev_queue_depth_pct / 100;
7129 
7130         VERIFY(spa_writeable(spa));
7131 
7132         /*







7133          * Lock out configuration changes.
7134          */
7135         spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
7136 
7137         spa->spa_syncing_txg = txg;
7138         spa->spa_sync_pass = 0;
7139 
7140         spa_check_alloc_trees(spa);


7141 
7142         /*
7143          * Another pool management task might be currently preventing
7144          * from starting and the current txg sync was invoked on its behalf,
7145          * so be prepared to postpone autotrim processing.
7146          */
7147         if (mutex_tryenter(&spa->spa_auto_trim_lock)) {
7148                 if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
7149                         spa_auto_trim(spa, txg);
7150                 mutex_exit(&spa->spa_auto_trim_lock);
7151         }
7152 
7153         /*
7154          * If there are any pending vdev state changes, convert them
7155          * into config changes that go out with this transaction group.
7156          */
7157         spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
7158         while (list_head(&spa->spa_state_dirty_list) != NULL) {
7159                 /*
7160                  * We need the write lock here because, for aux vdevs,
7161                  * calling vdev_config_dirty() modifies sav_config.
7162                  * This is ugly and will become unnecessary when we
7163                  * eliminate the aux vdev wart by integrating all vdevs
7164                  * into the root vdev tree.
7165                  */
7166                 spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
7167                 spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_WRITER);
7168                 while ((vd = list_head(&spa->spa_state_dirty_list)) != NULL) {
7169                         vdev_state_clean(vd);
7170                         vdev_config_dirty(vd);
7171                 }
7172                 spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
7173                 spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);


7208          * out this txg.
7209          */
7210         uint64_t queue_depth_total = 0;
7211         for (int c = 0; c < rvd->vdev_children; c++) {
7212                 vdev_t *tvd = rvd->vdev_child[c];
7213                 metaslab_group_t *mg = tvd->vdev_mg;
7214 
7215                 if (mg == NULL || mg->mg_class != spa_normal_class(spa) ||
7216                     !metaslab_group_initialized(mg))
7217                         continue;
7218 
7219                 /*
7220                  * It is safe to do a lock-free check here because only async
7221                  * allocations look at mg_max_alloc_queue_depth, and async
7222                  * allocations all happen from spa_sync().
7223                  */
7224                 ASSERT0(refcount_count(&mg->mg_alloc_queue_depth));
7225                 mg->mg_max_alloc_queue_depth = max_queue_depth;
7226                 queue_depth_total += mg->mg_max_alloc_queue_depth;
7227         }




7228 
7229         spa_initialize_alloc_trees(spa, max_queue_depth,
7230             queue_depth_total);
7231 










7232         /*
7233          * Iterate to convergence.
7234          */
7235 
7236         zfs_autosnap_t *autosnap = spa_get_autosnap(dp->dp_spa);
7237         mutex_enter(&autosnap->autosnap_lock);
7238 
7239         autosnap_zone_t *zone = list_head(&autosnap->autosnap_zones);
7240         while (zone != NULL) {
7241                 zone->created = B_FALSE;
7242                 zone->dirty = B_FALSE;
7243                 zone = list_next(&autosnap->autosnap_zones, zone);
7244         }
7245 
7246         mutex_exit(&autosnap->autosnap_lock);
7247 
7248         do {
7249                 int pass = ++spa->spa_sync_pass;
7250 
7251                 spa_sync_config_object(spa, tx);
7252                 spa_sync_aux_dev(spa, &spa->spa_spares, tx,
7253                     ZPOOL_CONFIG_SPARES, DMU_POOL_SPARES);
7254                 spa_sync_aux_dev(spa, &spa->spa_l2cache, tx,
7255                     ZPOOL_CONFIG_L2CACHE, DMU_POOL_L2CACHE);
7256                 spa_errlog_sync(spa, txg);
7257                 dsl_pool_sync(dp, txg);
7258 
7259                 if (pass < zfs_sync_pass_deferred_free) {
7260                         spa_sync_frees(spa, free_bpl, tx);
7261                 } else {
7262                         /*
7263                          * We can not defer frees in pass 1, because
7264                          * we sync the deferred frees later in pass 1.
7265                          */
7266                         ASSERT3U(pass, >, 1);
7267                         bplist_iterate(free_bpl, bpobj_enqueue_cb,
7268                             &spa->spa_deferred_bpobj, tx);
7269                 }
7270 
7271                 ddt_sync(spa, txg);
7272                 dsl_scan_sync(dp, tx);
7273 
7274                 while (vd = txg_list_remove(&spa->spa_vdev_txg_list, txg))




7275                         vdev_sync(vd, txg);
7276 
7277                 if (pass == 1) {
7278                         spa_sync_upgrades(spa, tx);
7279                         ASSERT3U(txg, >=,
7280                             spa->spa_uberblock.ub_rootbp.blk_birth);
7281                         /*
7282                          * Note: We need to check if the MOS is dirty
7283                          * because we could have marked the MOS dirty
7284                          * without updating the uberblock (e.g. if we
7285                          * have sync tasks but no dirty user data).  We
7286                          * need to check the uberblock's rootbp because
7287                          * it is updated if we have synced out dirty
7288                          * data (though in this case the MOS will most
7289                          * likely also be dirty due to second order
7290                          * effects, we don't want to rely on that here).
7291                          */
7292                         if (spa->spa_uberblock.ub_rootbp.blk_birth < txg &&
7293                             !dmu_objset_is_dirty(mos, txg)) {
7294                                 /*


7306                         spa_sync_deferred_frees(spa, tx);
7307                 }
7308 
7309         } while (dmu_objset_is_dirty(mos, txg));
7310 
7311         if (!list_is_empty(&spa->spa_config_dirty_list)) {
7312                 /*
7313                  * Make sure that the number of ZAPs for all the vdevs matches
7314                  * the number of ZAPs in the per-vdev ZAP list. This only gets
7315                  * called if the config is dirty; otherwise there may be
7316                  * outstanding AVZ operations that weren't completed in
7317                  * spa_sync_config_object.
7318                  */
7319                 uint64_t all_vdev_zap_entry_count;
7320                 ASSERT0(zap_count(spa->spa_meta_objset,
7321                     spa->spa_all_vdev_zaps, &all_vdev_zap_entry_count));
7322                 ASSERT3U(vdev_count_verify_zaps(spa->spa_root_vdev), ==,
7323                     all_vdev_zap_entry_count);
7324         }
7325 




7326         /*
7327          * Rewrite the vdev configuration (which includes the uberblock)
7328          * to commit the transaction group.
7329          *
7330          * If there are no dirty vdevs, we sync the uberblock to a few
7331          * random top-level vdevs that are known to be visible in the
7332          * config cache (see spa_vdev_add() for a complete description).
7333          * If there *are* dirty vdevs, sync the uberblock to all vdevs.
7334          */
7335         for (;;) {
7336                 /*
7337                  * We hold SCL_STATE to prevent vdev open/close/etc.
7338                  * while we're attempting to write the vdev labels.
7339                  */
7340                 spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
7341 
7342                 if (list_is_empty(&spa->spa_config_dirty_list)) {
7343                         vdev_t *svd[SPA_DVAS_PER_BP];
7344                         int svdcount = 0;
7345                         int children = rvd->vdev_children;
7346                         int c0 = spa_get_random(children);
7347 
7348                         for (int c = 0; c < children; c++) {
7349                                 vd = rvd->vdev_child[(c0 + c) % children];
7350                                 if (vd->vdev_ms_array == 0 || vd->vdev_islog)

7351                                         continue;
7352                                 svd[svdcount++] = vd;
7353                                 if (svdcount == SPA_DVAS_PER_BP)
7354                                         break;
7355                         }
7356                         error = vdev_config_sync(svd, svdcount, txg);
7357                 } else {
7358                         error = vdev_config_sync(rvd->vdev_child,
7359                             rvd->vdev_children, txg);
7360                 }
7361 
7362                 if (error == 0)
7363                         spa->spa_last_synced_guid = rvd->vdev_guid;
7364 
7365                 spa_config_exit(spa, SCL_STATE, FTAG);
7366 
7367                 if (error == 0)
7368                         break;
7369                 zio_suspend(spa, NULL);
7370                 zio_resume_wait(spa);
7371         }
7372         dmu_tx_commit(tx);
7373 
7374         VERIFY(cyclic_reprogram(spa->spa_deadman_cycid, CY_INFINITY));
7375 
7376         /*
7377          * Clear the dirty config list.
7378          */
7379         while ((vd = list_head(&spa->spa_config_dirty_list)) != NULL)
7380                 vdev_config_clean(vd);
7381 
7382         /*
7383          * Now that the new config has synced transactionally,
7384          * let it become visible to the config cache.
7385          */
7386         if (spa->spa_config_syncing != NULL) {
7387                 spa_config_set(spa, spa->spa_config_syncing);
7388                 spa->spa_config_txg = txg;
7389                 spa->spa_config_syncing = NULL;
7390         }
7391 
7392         dsl_pool_sync_done(dp, txg);
7393 
7394         spa_check_alloc_trees(spa);


7395 
7396         /*
7397          * Update usable space statistics.
7398          */
7399         while (vd = txg_list_remove(&spa->spa_vdev_txg_list, TXG_CLEAN(txg)))
7400                 vdev_sync_done(vd, txg);
7401 
7402         spa_update_dspace(spa);
7403         spa_update_latency(spa);
7404         /*
7405          * It had better be the case that we didn't dirty anything
7406          * since vdev_config_sync().
7407          */
7408         ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg));
7409         ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
7410         ASSERT(txg_list_empty(&spa->spa_vdev_txg_list, txg));
7411 
7412         spa->spa_sync_pass = 0;
7413 
7414         spa_check_special(spa);
7415 
7416         /*
7417          * Update the last synced uberblock here. We want to do this at
7418          * the end of spa_sync() so that consumers of spa_last_synced_txg()
7419          * will be guaranteed that all the processing associated with
7420          * that txg has been completed.
7421          */
7422         spa->spa_ubsync = spa->spa_uberblock;
7423         spa_config_exit(spa, SCL_CONFIG, FTAG);
7424 
7425         spa_handle_ignored_writes(spa);
7426 
7427         /*
7428          * If any async tasks have been requested, kick them off.
7429          */
7430         spa_async_dispatch(spa);
7431 }
7432 
7433 /*
7434  * Sync all pools.  We don't want to hold the namespace lock across these
7435  * operations, so we take a reference on the spa_t and drop the lock during the


7468         spa_t *spa;
7469 
7470         /*
7471          * Remove all cached state.  All pools should be closed now,
7472          * so every spa in the AVL tree should be unreferenced.
7473          */
7474         mutex_enter(&spa_namespace_lock);
7475         while ((spa = spa_next(NULL)) != NULL) {
7476                 /*
7477                  * Stop async tasks.  The async thread may need to detach
7478                  * a device that's been replaced, which requires grabbing
7479                  * spa_namespace_lock, so we must drop it here.
7480                  */
7481                 spa_open_ref(spa, FTAG);
7482                 mutex_exit(&spa_namespace_lock);
7483                 spa_async_suspend(spa);
7484                 mutex_enter(&spa_namespace_lock);
7485                 spa_close(spa, FTAG);
7486 
7487                 if (spa->spa_state != POOL_STATE_UNINITIALIZED) {
7488                         wbc_deactivate(spa);
7489 
7490                         spa_unload(spa);
7491                         spa_deactivate(spa);
7492                 }
7493 
7494                 spa_remove(spa);
7495         }
7496         mutex_exit(&spa_namespace_lock);
7497 }
7498 
7499 vdev_t *
7500 spa_lookup_by_guid(spa_t *spa, uint64_t guid, boolean_t aux)
7501 {
7502         vdev_t *vd;
7503         int i;
7504 
7505         if ((vd = vdev_lookup_by_guid(spa->spa_root_vdev, guid)) != NULL)
7506                 return (vd);
7507 
7508         if (aux) {
7509                 for (i = 0; i < spa->spa_l2cache.sav_count; i++) {
7510                         vd = spa->spa_l2cache.sav_vdevs[i];
7511                         if (vd->vdev_guid == guid)
7512                                 return (vd);
7513                 }


7569  * Check if a pool has an active shared spare device.
7570  * Note: reference count of an active spare is 2, as a spare and as a replace
7571  */
7572 static boolean_t
7573 spa_has_active_shared_spare(spa_t *spa)
7574 {
7575         int i, refcnt;
7576         uint64_t pool;
7577         spa_aux_vdev_t *sav = &spa->spa_spares;
7578 
7579         for (i = 0; i < sav->sav_count; i++) {
7580                 if (spa_spare_exists(sav->sav_vdevs[i]->vdev_guid, &pool,
7581                     &refcnt) && pool != 0ULL && pool == spa_guid(spa) &&
7582                     refcnt > 2)
7583                         return (B_TRUE);
7584         }
7585 
7586         return (B_FALSE);
7587 }
7588 
7589 /*
7590  * Post a sysevent corresponding to the given event.  The 'name' must be one of
7591  * the event definitions in sys/sysevent/eventdefs.h.  The payload will be
7592  * filled in from the spa and (optionally) the vdev.  This doesn't do anything
7593  * in the userland libzpool, as we don't want consumers to misinterpret ztest
7594  * or zdb as real changes.
7595  */
7596 static sysevent_t *
7597 spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7598 {
7599         sysevent_t              *ev = NULL;
7600 #ifdef _KERNEL
7601         sysevent_attr_list_t    *attr = NULL;
7602         sysevent_value_t        value;
7603 
7604         ev = sysevent_alloc(EC_ZFS, (char *)name, SUNW_KERN_PUB "zfs",
7605             SE_SLEEP);
7606         ASSERT(ev != NULL);
7607 
7608         value.value_type = SE_DATA_TYPE_STRING;
7609         value.value.sv_string = spa_name(spa);
7610         if (sysevent_add_attr(&attr, ZFS_EV_POOL_NAME, &value, SE_SLEEP) != 0)
7611                 goto done;
7612 
7613         value.value_type = SE_DATA_TYPE_UINT64;
7614         value.value.sv_uint64 = spa_guid(spa);
7615         if (sysevent_add_attr(&attr, ZFS_EV_POOL_GUID, &value, SE_SLEEP) != 0)
7616                 goto done;
7617 
7618         if (vd != NULL) {
7619                 value.value_type = SE_DATA_TYPE_UINT64;
7620                 value.value.sv_uint64 = vd->vdev_guid;
7621                 if (sysevent_add_attr(&attr, ZFS_EV_VDEV_GUID, &value,
7622                     SE_SLEEP) != 0)
7623                         goto done;
7624 
7625                 if (vd->vdev_path) {
7626                         value.value_type = SE_DATA_TYPE_STRING;
7627                         value.value.sv_string = vd->vdev_path;
7628                         if (sysevent_add_attr(&attr, ZFS_EV_VDEV_PATH,
7629                             &value, SE_SLEEP) != 0)
7630                                 goto done;
7631                 }
7632         }
7633 
7634         if (hist_nvl != NULL) {
7635                 fnvlist_merge((nvlist_t *)attr, hist_nvl);
7636         }
7637 
7638         if (sysevent_attach_attributes(ev, attr) != 0)
7639                 goto done;
7640         attr = NULL;
7641 
7642 done:
7643         if (attr)
7644                 sysevent_free_attr(attr);
7645 
7646 #endif
7647         return (ev);
7648 }
7649 
7650 static void
7651 spa_event_post(void *arg)
7652 {
7653 #ifdef _KERNEL
7654         sysevent_t *ev = (sysevent_t *)arg;
7655 
7656         sysevent_id_t           eid;
7657 
7658         (void) log_sysevent(ev, SE_SLEEP, &eid);
7659         sysevent_free(ev);
7660 #endif
7661 }
7662 
7663 /*
7664  * Dispatch event notifications to the taskq such that the corresponding
7665  * sysevents are queued with no spa locks held
7666  */
7667 taskq_t *spa_sysevent_taskq;
7668 
7669 static void
7670 spa_event_notify_impl(sysevent_t *ev)
7671 {
7672         if (taskq_dispatch(spa_sysevent_taskq, spa_event_post,
7673             ev, TQ_NOSLEEP) == NULL) {
7674                 /*
7675                  * These are management sysevents; as much as it is
7676                  * unpleasant to drop these due to syseventd not being able
7677                  * to keep up, perhaps due to resource shortages, we are not
7678                  * going to sleep here and risk locking up the pool sync
7679                  * process; notify admin of problems
7680                  */
7681                 cmn_err(CE_NOTE, "Could not dispatch sysevent nofitication "
7682                     "for %s, please check state of syseventd\n",
7683                     sysevent_get_subclass_name(ev));
7684 
7685                 sysevent_free(ev);
7686 
7687                 return;
7688         }
7689 }
7690 
7691 void
7692 spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7693 {
7694         spa_event_notify_impl(spa_event_create(spa, vd, hist_nvl, name));
7695 }
7696 
7697 /*
7698  * Dispatches all auto-trim processing to all top-level vdevs. This is
7699  * called from spa_sync once every txg.



7700  */
7701 static void
7702 spa_auto_trim(spa_t *spa, uint64_t txg)
7703 {
7704         ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER) == SCL_CONFIG);
7705         ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
7706         ASSERT(spa->spa_auto_trim_taskq != NULL);
7707 
7708         for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
7709                 vdev_trim_info_t *vti = kmem_zalloc(sizeof (*vti), KM_SLEEP);
7710                 vti->vti_vdev = spa->spa_root_vdev->vdev_child[i];
7711                 vti->vti_txg = txg;
7712                 vti->vti_done_cb = (void (*)(void *))spa_vdev_auto_trim_done;
7713                 vti->vti_done_arg = spa;
7714                 (void) taskq_dispatch(spa->spa_auto_trim_taskq,
7715                     (void (*)(void *))vdev_auto_trim, vti, TQ_SLEEP);
7716                 spa->spa_num_auto_trimming++;
7717         }
7718 }
7719 
7720 /*
7721  * Performs the sync update of the MOS pool directory's trim start/stop values.
7722  */
7723 static void
7724 spa_trim_update_time_sync(void *arg, dmu_tx_t *tx)
7725 {
7726         spa_t *spa = arg;
7727         VERIFY0(zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
7728             DMU_POOL_TRIM_START_TIME, sizeof (uint64_t), 1,
7729             &spa->spa_man_trim_start_time, tx));
7730         VERIFY0(zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
7731             DMU_POOL_TRIM_STOP_TIME, sizeof (uint64_t), 1,
7732             &spa->spa_man_trim_stop_time, tx));
7733 }
7734 
7735 /*
7736  * Updates the in-core and on-disk manual TRIM operation start/stop time.
7737  * Passing UINT64_MAX for either start_time or stop_time means that no
7738  * update to that value should be recorded.
7739  */
7740 static dmu_tx_t *
7741 spa_trim_update_time(spa_t *spa, uint64_t start_time, uint64_t stop_time)
7742 {
7743         int err;
7744         dmu_tx_t *tx;
7745 
7746         ASSERT(MUTEX_HELD(&spa->spa_man_trim_lock));
7747         if (start_time != UINT64_MAX)
7748                 spa->spa_man_trim_start_time = start_time;
7749         if (stop_time != UINT64_MAX)
7750                 spa->spa_man_trim_stop_time = stop_time;
7751         tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
7752         err = dmu_tx_assign(tx, TXG_WAIT);
7753         if (err) {
7754                 dmu_tx_abort(tx);
7755                 return (NULL);
7756         }
7757         dsl_sync_task_nowait(spa_get_dsl(spa), spa_trim_update_time_sync,
7758             spa, 1, ZFS_SPACE_CHECK_RESERVED, tx);
7759 
7760         return (tx);
7761 }
7762 
7763 /*
7764  * Initiates an manual TRIM of the whole pool. This kicks off individual
7765  * TRIM tasks for each top-level vdev, which then pass over all of the free
7766  * space in all of the vdev's metaslabs and issues TRIM commands for that
7767  * space to the underlying vdevs.
7768  */
7769 extern void
7770 spa_man_trim(spa_t *spa, uint64_t rate)
7771 {
7772         dmu_tx_t *time_update_tx;
7773 
7774         mutex_enter(&spa->spa_man_trim_lock);
7775 
7776         if (rate != 0)
7777                 spa->spa_man_trim_rate = MAX(rate, spa_min_trim_rate(spa));
7778         else
7779                 spa->spa_man_trim_rate = 0;
7780 
7781         if (spa->spa_num_man_trimming) {
7782                 /*
7783                  * TRIM is already ongoing. Wake up all sleeping vdev trim
7784                  * threads because the trim rate might have changed above.
7785                  */
7786                 cv_broadcast(&spa->spa_man_trim_update_cv);
7787                 mutex_exit(&spa->spa_man_trim_lock);
7788                 return;
7789         }
7790         spa_man_trim_taskq_create(spa);
7791         spa->spa_man_trim_stop = B_FALSE;
7792 
7793         spa_event_notify(spa, NULL, NULL, ESC_ZFS_TRIM_START);
7794         spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
7795         for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
7796                 vdev_t *vd = spa->spa_root_vdev->vdev_child[i];
7797                 vdev_trim_info_t *vti = kmem_zalloc(sizeof (*vti), KM_SLEEP);
7798                 vti->vti_vdev = vd;
7799                 vti->vti_done_cb = (void (*)(void *))spa_vdev_man_trim_done;
7800                 vti->vti_done_arg = spa;
7801                 spa->spa_num_man_trimming++;
7802 
7803                 vd->vdev_trim_prog = 0;
7804                 (void) taskq_dispatch(spa->spa_man_trim_taskq,
7805                     (void (*)(void *))vdev_man_trim, vti, TQ_SLEEP);
7806         }
7807         spa_config_exit(spa, SCL_CONFIG, FTAG);
7808         time_update_tx = spa_trim_update_time(spa, gethrestime_sec(), 0);
7809         mutex_exit(&spa->spa_man_trim_lock);
7810         /* mustn't hold spa_man_trim_lock to prevent deadlock /w syncing ctx */
7811         if (time_update_tx != NULL)
7812                 dmu_tx_commit(time_update_tx);
7813 }
7814 
7815 /*
7816  * Orders a manual TRIM operation to stop and returns immediately.
7817  */
7818 extern void
7819 spa_man_trim_stop(spa_t *spa)
7820 {
7821         boolean_t held = MUTEX_HELD(&spa->spa_man_trim_lock);
7822         if (!held)
7823                 mutex_enter(&spa->spa_man_trim_lock);
7824         spa->spa_man_trim_stop = B_TRUE;
7825         cv_broadcast(&spa->spa_man_trim_update_cv);
7826         if (!held)
7827                 mutex_exit(&spa->spa_man_trim_lock);
7828 }
7829 
7830 /*
7831  * Orders a manual TRIM operation to stop and waits for both manual and
7832  * automatic TRIM to complete. By holding both the spa_man_trim_lock and
7833  * the spa_auto_trim_lock, the caller can guarantee that after this
7834  * function returns, no new TRIM operations can be initiated in parallel.
7835  */
7836 void
7837 spa_trim_stop_wait(spa_t *spa)
7838 {
7839         ASSERT(MUTEX_HELD(&spa->spa_man_trim_lock));
7840         ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
7841         spa->spa_man_trim_stop = B_TRUE;
7842         cv_broadcast(&spa->spa_man_trim_update_cv);
7843         while (spa->spa_num_man_trimming > 0)
7844                 cv_wait(&spa->spa_man_trim_done_cv, &spa->spa_man_trim_lock);
7845         while (spa->spa_num_auto_trimming > 0)
7846                 cv_wait(&spa->spa_auto_trim_done_cv, &spa->spa_auto_trim_lock);
7847 }
7848 
7849 /*
7850  * Returns manual TRIM progress. Progress is indicated by four return values:
7851  * 1) prog: the number of bytes of space on the pool in total that manual
7852  *      TRIM has already passed (regardless if the space is allocated or not).
7853  *      Completion of the operation is indicated when either the returned value
7854  *      is zero, or when the returned value is equal to the sum of the sizes of
7855  *      all top-level vdevs.
7856  * 2) rate: the trim rate in bytes per second. A value of zero indicates that
7857  *      trim progresses as fast as possible.
7858  * 3) start_time: the UNIXTIME of when the last manual TRIM operation was
7859  *      started. If no manual trim was ever initiated on the pool, this is
7860  *      zero.
7861  * 4) stop_time: the UNIXTIME of when the last manual TRIM operation has
7862  *      stopped on the pool. If a trim was started (start_time != 0), but has
7863  *      not yet completed, stop_time will be zero. If a trim is NOT currently
7864  *      ongoing and start_time is non-zero, this indicates that the previously
7865  *      initiated TRIM operation was interrupted.
7866  */
7867 extern void
7868 spa_get_trim_prog(spa_t *spa, uint64_t *prog, uint64_t *rate,
7869     uint64_t *start_time, uint64_t *stop_time)
7870 {
7871         uint64_t total = 0;
7872         vdev_t *root_vd = spa->spa_root_vdev;
7873 
7874         ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
7875         mutex_enter(&spa->spa_man_trim_lock);
7876         if (spa->spa_num_man_trimming > 0) {
7877                 for (uint64_t i = 0; i < root_vd->vdev_children; i++) {
7878                         total += root_vd->vdev_child[i]->vdev_trim_prog;
7879                 }
7880         }
7881         *prog = total;
7882         *rate = spa->spa_man_trim_rate;
7883         *start_time = spa->spa_man_trim_start_time;
7884         *stop_time = spa->spa_man_trim_stop_time;
7885         mutex_exit(&spa->spa_man_trim_lock);
7886 }
7887 
7888 /*
7889  * Callback when a vdev_man_trim has finished on a single top-level vdev.
7890  */
7891 static void
7892 spa_vdev_man_trim_done(spa_t *spa)
7893 {
7894         dmu_tx_t *time_update_tx = NULL;
7895 
7896         mutex_enter(&spa->spa_man_trim_lock);
7897         ASSERT(spa->spa_num_man_trimming > 0);
7898         spa->spa_num_man_trimming--;
7899         if (spa->spa_num_man_trimming == 0) {
7900                 /* if we were interrupted, leave stop_time at zero */
7901                 if (!spa->spa_man_trim_stop)
7902                         time_update_tx = spa_trim_update_time(spa, UINT64_MAX,
7903                             gethrestime_sec());
7904                 spa_event_notify(spa, NULL, NULL, ESC_ZFS_TRIM_FINISH);
7905                 spa_async_request(spa, SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY);
7906                 cv_broadcast(&spa->spa_man_trim_done_cv);
7907         }
7908         mutex_exit(&spa->spa_man_trim_lock);
7909 
7910         if (time_update_tx != NULL)
7911                 dmu_tx_commit(time_update_tx);
7912 }
7913 
7914 /*
7915  * Called from vdev_auto_trim when a vdev has completed its auto-trim
7916  * processing.
7917  */
7918 static void
7919 spa_vdev_auto_trim_done(spa_t *spa)
7920 {
7921         mutex_enter(&spa->spa_auto_trim_lock);
7922         ASSERT(spa->spa_num_auto_trimming > 0);
7923         spa->spa_num_auto_trimming--;
7924         if (spa->spa_num_auto_trimming == 0)
7925                 cv_broadcast(&spa->spa_auto_trim_done_cv);
7926         mutex_exit(&spa->spa_auto_trim_lock);
7927 }
7928 
7929 /*
7930  * Determines the minimum sensible rate at which a manual TRIM can be
7931  * performed on a given spa and returns it. Since we perform TRIM in
7932  * metaslab-sized increments, we'll just let the longest step between
7933  * metaslab TRIMs be 100s (random number, really). Thus, on a typical
7934  * 200-metaslab vdev, the longest TRIM should take is about 5.5 hours.
7935  * It *can* take longer if the device is really slow respond to
7936  * zio_trim() commands or it contains more than 200 metaslabs, or
7937  * metaslab sizes vary widely between top-level vdevs.
7938  */
7939 static uint64_t
7940 spa_min_trim_rate(spa_t *spa)
7941 {
7942         uint64_t smallest_ms_sz = UINT64_MAX;
7943 
7944         /* find the smallest metaslab */
7945         spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
7946         for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
7947                 smallest_ms_sz = MIN(smallest_ms_sz,
7948                     spa->spa_root_vdev->vdev_child[i]->vdev_ms[0]->ms_size);
7949         }
7950         spa_config_exit(spa, SCL_CONFIG, FTAG);
7951         VERIFY(smallest_ms_sz != 0);
7952 
7953         /* minimum TRIM rate is 1/100th of the smallest metaslab size */
7954         return (smallest_ms_sz / 100);
7955 }