Print this page
9700 ZFS resilvered mirror does not balance reads
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
NEX-17931 Getting panic: vfs_mountroot: cannot mount root after split mirror syspool
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9989 Changing volume names can result in double imports and data corruption
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-6855 System fails to boot up after a large number of datasets created
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-8711 backport illumos 7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-7550 zpool remove mirrored slog or special vdev causes system panic due to a NULL pointer dereference in "zfs" module
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6884 KRRP: replication deadlock due to unavailable resources
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6000 zpool destroy/export with autotrim=on panics due to lock assertion
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5702 Special vdev cannot be removed if it was used as slog
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5637 enablespecial property should be disabled after special vdev removal
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alex Deiter <alex.deiter@nexenta.com>
NEX-5367 special vdev: sync-write options (NEW)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5064 On-demand trim should store operation start and stop time
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5068 In-progress scrub can drastically increase zpool import times
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-5219 WBC: Add capability to delay migration
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5078 Want ability to see progress of freeing data and how much is left to free after large file delete patch
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5019 wrcache activation races vs. 'zpool create -O wrc_mode='
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-4934 Add capability to remove special vdev
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4830 writecache=off leaks data on special vdev (the data will never migrate)
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4876 On-demand TRIM shouldn't use system_taskq and should queue jobs
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4679 Autotrim taskq doesn't get destroyed on pool export
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
NEX-4567 KRRP: L2L replication inside of one pool causes ARC-deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
6529 Properly handle updates of variably-sized SA entries.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Gordon Ross <gwr@nexenta.com>
6527 Possible access beyond end of string in zpool comment
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gwr@nexenta.com>
6414 vdev_config_sync could be simpler
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
6175 sdev can create bogus zvol directories
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6174 /dev/zvol does not show pool directories
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5997 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
6046 SPARC boot should support com.delphix:hole_birth
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6041 SPARC boot should support LZ4
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6044 SPARC zfs reader is using wrong size for objset_phys
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
backout 5997: breaks "zpool add"
5997 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5808 spa_check_logs is not necessary on readonly pools
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
5770 Add load_nvlist() error handling
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4476 WRC: Allow to use write back cache per tree of datasets
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Revert "NEX-4476 WRC: Allow to use write back cache per tree of datasets"
This reverts commit fe97b74444278a6f36fec93179133641296312da.
NEX-4476 WRC: Allow to use write back cache per tree of datasets
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-3502 dedup ceiling should set a pool prop when cap is in effect
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-3965 System may panic on the importing of pool with WRC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-4077 taskq_dispatch in on-demand TRIM can sometimes fail
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Revert "NEX-3965 System may panic on the importing of pool with WRC"
This reverts commit 45bc50222913cddafde94621d28b78d6efaea897.
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
usr/src/common/zfs/zpool_prop.c
usr/src/uts/common/sys/fs/zfs.h
NEX-3965 System may panic on the importing of pool with WRC
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3817 'zpool add' of special devices causes system panic
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3474 CLONE - Port NEX-2591 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
usr/src/uts/common/io/scsi/targets/sd.c
usr/src/uts/common/sys/scsi/targets/sddef.h
NEX-3165 segregate ddt in arc (other lint fix)
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-3165 segregate ddt in arc
NEX-3213 need to load vdev props for all vdev including spares and l2arc vdevs
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-2112 `zdb -e <pool>` assertion failed for thread 0xfffffd7fff172a40
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-1228 Panic importing pool with active unsupported features
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Harold Shaw <harold.shaw@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-140 Duplicate entries in mantools and doctools manifests
NEX-1078 Replaced ASSERT with if-statement
NEX-521 Single threaded rpcbind is not scalable
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Jan Kryl <jan.kryl@nexenta.com>
NEX-1088 partially rolled back 641841bb
to fix regression that caused assert in read-only import.
OS-115 Heap leaks related to OS-114 and SUP-577
SUP-577 deadlock between zpool detach and syseventd
OS-103 handle CoS descriptor persistent references across vdev operations
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Make special vdev subtree topology the same as regular vdev subtree to simplify testcase setup
Fixup merge issues
Fix default properties' values after export/import
zfsxx issue #11: support for spare device groups
Issue #34: Add feature flag for the compount checksum - sha1crc32
Contributors: Boris Protopopov
Issue #7: add cacheability to the properties
Contributors: Boris Protopopov
Issue #27: Auto best-effort dedup enable/disable - settable per pool
Issues #7: Reconsile L2ARC and "special" use by datasets
Issue #9: Support for persistent CoS/vdev attributes with feature flags
Support for feature flags for special tier
Contributors: Daniil Lunev, Boris Protopopov
Issue #2: optimize DDE lookup in DDT objects
Added option to control number of classes of DDE's in DDT.
New default is one, that is all DDE's are stored together
regardless of refcount.
Issue #3: Add support for parametrized number of copies for DDTs
Issue #25: Add a pool-level property that controls the number of copies of DDTs in the pool.
Fixup merge results
re #13850 Refactor ZFS config discovery IOCs to libzfs_core patterns
re 13748 added zpool export -c option
zpool export -c command exports specified pool while keeping its latest
configuration in the cache file for subsequent zpool import -c.
re #13333 rb4362 - eliminated spa_update_iotime() to fix the stats
re #12684 rb4206 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
re #12643 rb4064 ZFS meta refactoring - vdev utilization tracking, auto-dedup
re #8279 rb3915 need a mechanism to notify NMS about ZFS config changes (fix lint -courtesy of Yuri Pankov)
re #12584 rb4049 zfsxx latest code merge (fix lint - courtesy of Yuri Pankov)
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties
| Split |
Close |
| Expand all |
| Collapse all |
--- old/usr/src/uts/common/fs/zfs/spa.c
+++ new/usr/src/uts/common/fs/zfs/spa.c
1 1 /*
2 2 * CDDL HEADER START
3 3 *
4 4 * The contents of this file are subject to the terms of the
5 5 * Common Development and Distribution License (the "License").
6 6 * You may not use this file except in compliance with the License.
7 7 *
8 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9 9 * or http://www.opensolaris.org/os/licensing.
10 10 * See the License for the specific language governing permissions
11 11 * and limitations under the License.
12 12 *
13 13 * When distributing Covered Code, include this CDDL HEADER in each
|
↓ open down ↓ |
13 lines elided |
↑ open up ↑ |
14 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15 15 * If applicable, add the following below this CDDL HEADER, with the
16 16 * fields enclosed by brackets "[]" replaced with your own identifying
17 17 * information: Portions Copyright [yyyy] [name of copyright owner]
18 18 *
19 19 * CDDL HEADER END
20 20 */
21 21
22 22 /*
23 23 * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
24 - * Copyright (c) 2011, 2018 by Delphix. All rights reserved.
25 - * Copyright (c) 2015, Nexenta Systems, Inc. All rights reserved.
24 + * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
26 25 * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
26 + * Copyright 2018 Nexenta Systems, Inc. All rights reserved.
27 27 * Copyright 2013 Saso Kiselkov. All rights reserved.
28 28 * Copyright (c) 2014 Integros [integros.com]
29 29 * Copyright 2016 Toomas Soome <tsoome@me.com>
30 - * Copyright 2017 Joyent, Inc.
30 + * Copyright 2018 Joyent, Inc.
31 31 * Copyright (c) 2017 Datto Inc.
32 - * Copyright 2018 OmniOS Community Edition (OmniOSce) Association.
33 32 */
34 33
35 34 /*
36 35 * SPA: Storage Pool Allocator
37 36 *
38 37 * This file contains all the routines used when modifying on-disk SPA state.
39 38 * This includes opening, importing, destroying, exporting a pool, and syncing a
40 39 * pool.
41 40 */
42 41
43 42 #include <sys/zfs_context.h>
|
↓ open down ↓ |
1 lines elided |
↑ open up ↑ |
44 43 #include <sys/fm/fs/zfs.h>
45 44 #include <sys/spa_impl.h>
46 45 #include <sys/zio.h>
47 46 #include <sys/zio_checksum.h>
48 47 #include <sys/dmu.h>
49 48 #include <sys/dmu_tx.h>
50 49 #include <sys/zap.h>
51 50 #include <sys/zil.h>
52 51 #include <sys/ddt.h>
53 52 #include <sys/vdev_impl.h>
54 -#include <sys/vdev_removal.h>
55 -#include <sys/vdev_indirect_mapping.h>
56 -#include <sys/vdev_indirect_births.h>
57 53 #include <sys/metaslab.h>
58 54 #include <sys/metaslab_impl.h>
59 55 #include <sys/uberblock_impl.h>
60 56 #include <sys/txg.h>
61 57 #include <sys/avl.h>
62 -#include <sys/bpobj.h>
63 58 #include <sys/dmu_traverse.h>
64 59 #include <sys/dmu_objset.h>
65 60 #include <sys/unique.h>
66 61 #include <sys/dsl_pool.h>
67 62 #include <sys/dsl_dataset.h>
68 63 #include <sys/dsl_dir.h>
69 64 #include <sys/dsl_prop.h>
70 65 #include <sys/dsl_synctask.h>
71 66 #include <sys/fs/zfs.h>
72 67 #include <sys/arc.h>
73 68 #include <sys/callb.h>
74 69 #include <sys/systeminfo.h>
75 70 #include <sys/spa_boot.h>
76 71 #include <sys/zfs_ioctl.h>
77 72 #include <sys/dsl_scan.h>
78 73 #include <sys/zfeature.h>
79 74 #include <sys/dsl_destroy.h>
75 +#include <sys/cos.h>
76 +#include <sys/special.h>
77 +#include <sys/wbc.h>
80 78 #include <sys/abd.h>
81 79
82 80 #ifdef _KERNEL
83 81 #include <sys/bootprops.h>
84 82 #include <sys/callb.h>
85 83 #include <sys/cpupart.h>
86 84 #include <sys/pool.h>
87 85 #include <sys/sysdc.h>
88 86 #include <sys/zone.h>
89 87 #endif /* _KERNEL */
90 88
91 89 #include "zfs_prop.h"
92 90 #include "zfs_comutil.h"
93 91
94 92 /*
95 93 * The interval, in seconds, at which failed configuration cache file writes
96 94 * should be retried.
97 95 */
98 -int zfs_ccw_retry_interval = 300;
96 +static int zfs_ccw_retry_interval = 300;
99 97
100 98 typedef enum zti_modes {
101 99 ZTI_MODE_FIXED, /* value is # of threads (min 1) */
102 100 ZTI_MODE_BATCH, /* cpu-intensive; value is ignored */
103 101 ZTI_MODE_NULL, /* don't create a taskq */
104 102 ZTI_NMODES
105 103 } zti_modes_t;
106 104
107 105 #define ZTI_P(n, q) { ZTI_MODE_FIXED, (n), (q) }
108 106 #define ZTI_BATCH { ZTI_MODE_BATCH, 0, 1 }
109 107 #define ZTI_NULL { ZTI_MODE_NULL, 0, 0 }
110 108
111 109 #define ZTI_N(n) ZTI_P(n, 1)
112 110 #define ZTI_ONE ZTI_N(1)
113 111
114 112 typedef struct zio_taskq_info {
115 113 zti_modes_t zti_mode;
116 114 uint_t zti_value;
117 115 uint_t zti_count;
118 116 } zio_taskq_info_t;
119 117
120 118 static const char *const zio_taskq_types[ZIO_TASKQ_TYPES] = {
121 119 "issue", "issue_high", "intr", "intr_high"
122 120 };
123 121
124 122 /*
125 123 * This table defines the taskq settings for each ZFS I/O type. When
126 124 * initializing a pool, we use this table to create an appropriately sized
127 125 * taskq. Some operations are low volume and therefore have a small, static
128 126 * number of threads assigned to their taskqs using the ZTI_N(#) or ZTI_ONE
129 127 * macros. Other operations process a large amount of data; the ZTI_BATCH
130 128 * macro causes us to create a taskq oriented for throughput. Some operations
131 129 * are so high frequency and short-lived that the taskq itself can become a a
132 130 * point of lock contention. The ZTI_P(#, #) macro indicates that we need an
133 131 * additional degree of parallelism specified by the number of threads per-
134 132 * taskq and the number of taskqs; when dispatching an event in this case, the
135 133 * particular taskq is chosen at random.
136 134 *
137 135 * The different taskq priorities are to handle the different contexts (issue
138 136 * and interrupt) and then to reserve threads for ZIO_PRIORITY_NOW I/Os that
139 137 * need to be handled with minimum delay.
140 138 */
|
↓ open down ↓ |
32 lines elided |
↑ open up ↑ |
141 139 const zio_taskq_info_t zio_taskqs[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
142 140 /* ISSUE ISSUE_HIGH INTR INTR_HIGH */
143 141 { ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* NULL */
144 142 { ZTI_N(8), ZTI_NULL, ZTI_P(12, 8), ZTI_NULL }, /* READ */
145 143 { ZTI_BATCH, ZTI_N(5), ZTI_N(8), ZTI_N(5) }, /* WRITE */
146 144 { ZTI_P(12, 8), ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* FREE */
147 145 { ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* CLAIM */
148 146 { ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* IOCTL */
149 147 };
150 148
149 +static sysevent_t *spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl,
150 + const char *name);
151 +static void spa_event_notify_impl(sysevent_t *ev);
151 152 static void spa_sync_version(void *arg, dmu_tx_t *tx);
152 153 static void spa_sync_props(void *arg, dmu_tx_t *tx);
154 +static void spa_vdev_sync_props(void *arg, dmu_tx_t *tx);
155 +static int spa_vdev_prop_set_nosync(vdev_t *, nvlist_t *, boolean_t *);
153 156 static boolean_t spa_has_active_shared_spare(spa_t *spa);
154 -static int spa_load_impl(spa_t *spa, spa_import_type_t type, char **ereport,
155 - boolean_t reloading);
157 +static int spa_load_impl(spa_t *spa, uint64_t, nvlist_t *config,
158 + spa_load_state_t state, spa_import_type_t type, boolean_t mosconfig,
159 + char **ereport);
156 160 static void spa_vdev_resilver_done(spa_t *spa);
161 +static void spa_auto_trim(spa_t *spa, uint64_t txg);
162 +static void spa_vdev_man_trim_done(spa_t *spa);
163 +static void spa_vdev_auto_trim_done(spa_t *spa);
164 +static uint64_t spa_min_trim_rate(spa_t *spa);
157 165
158 166 uint_t zio_taskq_batch_pct = 75; /* 1 thread per cpu in pset */
159 167 id_t zio_taskq_psrset_bind = PS_NONE;
160 168 boolean_t zio_taskq_sysdc = B_TRUE; /* use SDC scheduling class */
161 169 uint_t zio_taskq_basedc = 80; /* base duty cycle */
162 170
163 171 boolean_t spa_create_process = B_TRUE; /* no process ==> no sysdc */
164 172 extern int zfs_sync_pass_deferred_free;
165 173
166 174 /*
167 - * Report any spa_load_verify errors found, but do not fail spa_load.
168 - * This is used by zdb to analyze non-idle pools.
169 - */
170 -boolean_t spa_load_verify_dryrun = B_FALSE;
171 -
172 -/*
173 - * This (illegal) pool name is used when temporarily importing a spa_t in order
174 - * to get the vdev stats associated with the imported devices.
175 - */
176 -#define TRYIMPORT_NAME "$import"
177 -
178 -/*
179 - * For debugging purposes: print out vdev tree during pool import.
180 - */
181 -boolean_t spa_load_print_vdev_tree = B_FALSE;
182 -
183 -/*
184 - * A non-zero value for zfs_max_missing_tvds means that we allow importing
185 - * pools with missing top-level vdevs. This is strictly intended for advanced
186 - * pool recovery cases since missing data is almost inevitable. Pools with
187 - * missing devices can only be imported read-only for safety reasons, and their
188 - * fail-mode will be automatically set to "continue".
189 - *
190 - * With 1 missing vdev we should be able to import the pool and mount all
191 - * datasets. User data that was not modified after the missing device has been
192 - * added should be recoverable. This means that snapshots created prior to the
193 - * addition of that device should be completely intact.
194 - *
195 - * With 2 missing vdevs, some datasets may fail to mount since there are
196 - * dataset statistics that are stored as regular metadata. Some data might be
197 - * recoverable if those vdevs were added recently.
198 - *
199 - * With 3 or more missing vdevs, the pool is severely damaged and MOS entries
200 - * may be missing entirely. Chances of data recovery are very low. Note that
201 - * there are also risks of performing an inadvertent rewind as we might be
202 - * missing all the vdevs with the latest uberblocks.
203 - */
204 -uint64_t zfs_max_missing_tvds = 0;
205 -
206 -/*
207 - * The parameters below are similar to zfs_max_missing_tvds but are only
208 - * intended for a preliminary open of the pool with an untrusted config which
209 - * might be incomplete or out-dated.
210 - *
211 - * We are more tolerant for pools opened from a cachefile since we could have
212 - * an out-dated cachefile where a device removal was not registered.
213 - * We could have set the limit arbitrarily high but in the case where devices
214 - * are really missing we would want to return the proper error codes; we chose
215 - * SPA_DVAS_PER_BP - 1 so that some copies of the MOS would still be available
216 - * and we get a chance to retrieve the trusted config.
217 - */
218 -uint64_t zfs_max_missing_tvds_cachefile = SPA_DVAS_PER_BP - 1;
219 -/*
220 - * In the case where config was assembled by scanning device paths (/dev/dsks
221 - * by default) we are less tolerant since all the existing devices should have
222 - * been detected and we want spa_load to return the right error codes.
223 - */
224 -uint64_t zfs_max_missing_tvds_scan = 0;
225 -
226 -/*
227 175 * ==========================================================================
228 176 * SPA properties routines
229 177 * ==========================================================================
230 178 */
231 179
232 180 /*
233 181 * Add a (source=src, propname=propval) list to an nvlist.
234 182 */
235 183 static void
236 184 spa_prop_add_list(nvlist_t *nvl, zpool_prop_t prop, char *strval,
237 185 uint64_t intval, zprop_source_t src)
238 186 {
239 187 const char *propname = zpool_prop_to_name(prop);
240 188 nvlist_t *propval;
241 189
242 190 VERIFY(nvlist_alloc(&propval, NV_UNIQUE_NAME, KM_SLEEP) == 0);
243 191 VERIFY(nvlist_add_uint64(propval, ZPROP_SOURCE, src) == 0);
244 192
245 193 if (strval != NULL)
246 194 VERIFY(nvlist_add_string(propval, ZPROP_VALUE, strval) == 0);
247 195 else
248 196 VERIFY(nvlist_add_uint64(propval, ZPROP_VALUE, intval) == 0);
249 197
250 198 VERIFY(nvlist_add_nvlist(nvl, propname, propval) == 0);
251 199 nvlist_free(propval);
|
↓ open down ↓ |
15 lines elided |
↑ open up ↑ |
252 200 }
253 201
254 202 /*
255 203 * Get property values from the spa configuration.
256 204 */
257 205 static void
258 206 spa_prop_get_config(spa_t *spa, nvlist_t **nvp)
259 207 {
260 208 vdev_t *rvd = spa->spa_root_vdev;
261 209 dsl_pool_t *pool = spa->spa_dsl_pool;
210 + spa_meta_placement_t *mp = &spa->spa_meta_policy;
262 211 uint64_t size, alloc, cap, version;
263 212 zprop_source_t src = ZPROP_SRC_NONE;
264 213 spa_config_dirent_t *dp;
265 214 metaslab_class_t *mc = spa_normal_class(spa);
266 215
267 216 ASSERT(MUTEX_HELD(&spa->spa_props_lock));
268 217
269 218 if (rvd != NULL) {
270 219 alloc = metaslab_class_get_alloc(spa_normal_class(spa));
271 220 size = metaslab_class_get_space(spa_normal_class(spa));
272 221 spa_prop_add_list(*nvp, ZPOOL_PROP_NAME, spa_name(spa), 0, src);
273 222 spa_prop_add_list(*nvp, ZPOOL_PROP_SIZE, NULL, size, src);
274 223 spa_prop_add_list(*nvp, ZPOOL_PROP_ALLOCATED, NULL, alloc, src);
275 224 spa_prop_add_list(*nvp, ZPOOL_PROP_FREE, NULL,
276 225 size - alloc, src);
226 + spa_prop_add_list(*nvp, ZPOOL_PROP_ENABLESPECIAL, NULL,
227 + (uint64_t)spa->spa_usesc, src);
228 + spa_prop_add_list(*nvp, ZPOOL_PROP_MINWATERMARK, NULL,
229 + spa->spa_minwat, src);
230 + spa_prop_add_list(*nvp, ZPOOL_PROP_HIWATERMARK, NULL,
231 + spa->spa_hiwat, src);
232 + spa_prop_add_list(*nvp, ZPOOL_PROP_LOWATERMARK, NULL,
233 + spa->spa_lowat, src);
234 + spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPMETA_DITTO, NULL,
235 + spa->spa_ddt_meta_copies, src);
277 236
237 + spa_prop_add_list(*nvp, ZPOOL_PROP_META_PLACEMENT, NULL,
238 + mp->spa_enable_meta_placement_selection, src);
239 + spa_prop_add_list(*nvp, ZPOOL_PROP_SYNC_TO_SPECIAL, NULL,
240 + mp->spa_sync_to_special, src);
241 + spa_prop_add_list(*nvp, ZPOOL_PROP_DDT_META_TO_METADEV, NULL,
242 + mp->spa_ddt_meta_to_special, src);
243 + spa_prop_add_list(*nvp, ZPOOL_PROP_ZFS_META_TO_METADEV,
244 + NULL, mp->spa_zfs_meta_to_special, src);
245 + spa_prop_add_list(*nvp, ZPOOL_PROP_SMALL_DATA_TO_METADEV, NULL,
246 + mp->spa_small_data_to_special, src);
247 +
278 248 spa_prop_add_list(*nvp, ZPOOL_PROP_FRAGMENTATION, NULL,
279 249 metaslab_class_fragmentation(mc), src);
280 250 spa_prop_add_list(*nvp, ZPOOL_PROP_EXPANDSZ, NULL,
281 251 metaslab_class_expandable_space(mc), src);
282 252 spa_prop_add_list(*nvp, ZPOOL_PROP_READONLY, NULL,
283 253 (spa_mode(spa) == FREAD), src);
284 254
255 + spa_prop_add_list(*nvp, ZPOOL_PROP_DDT_DESEGREGATION, NULL,
256 + (spa->spa_ddt_class_min == spa->spa_ddt_class_max), src);
257 +
285 258 cap = (size == 0) ? 0 : (alloc * 100 / size);
286 259 spa_prop_add_list(*nvp, ZPOOL_PROP_CAPACITY, NULL, cap, src);
287 260
261 + spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_BEST_EFFORT, NULL,
262 + spa->spa_dedup_best_effort, src);
263 +
264 + spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_LO_BEST_EFFORT, NULL,
265 + spa->spa_dedup_lo_best_effort, src);
266 +
267 + spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_HI_BEST_EFFORT, NULL,
268 + spa->spa_dedup_hi_best_effort, src);
269 +
288 270 spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPRATIO, NULL,
289 271 ddt_get_pool_dedup_ratio(spa), src);
290 272
273 + spa_prop_add_list(*nvp, ZPOOL_PROP_DDTCAPPED, NULL,
274 + spa->spa_ddt_capped, src);
275 +
291 276 spa_prop_add_list(*nvp, ZPOOL_PROP_HEALTH, NULL,
292 277 rvd->vdev_state, src);
293 278
294 279 version = spa_version(spa);
295 280 if (version == zpool_prop_default_numeric(ZPOOL_PROP_VERSION))
296 281 src = ZPROP_SRC_DEFAULT;
297 282 else
298 283 src = ZPROP_SRC_LOCAL;
299 284 spa_prop_add_list(*nvp, ZPOOL_PROP_VERSION, NULL, version, src);
300 285 }
301 286
302 287 if (pool != NULL) {
303 288 /*
304 289 * The $FREE directory was introduced in SPA_VERSION_DEADLISTS,
305 290 * when opening pools before this version freedir will be NULL.
306 291 */
307 292 if (pool->dp_free_dir != NULL) {
308 293 spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING, NULL,
309 - dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes,
294 + dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes +
295 + pool->dp_long_freeing_total,
310 296 src);
311 297 } else {
312 298 spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING,
313 - NULL, 0, src);
299 + NULL, pool->dp_long_freeing_total, src);
314 300 }
315 301
316 302 if (pool->dp_leak_dir != NULL) {
317 303 spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED, NULL,
318 304 dsl_dir_phys(pool->dp_leak_dir)->dd_used_bytes,
319 305 src);
320 306 } else {
321 307 spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED,
322 308 NULL, 0, src);
323 309 }
324 310 }
325 311
326 312 spa_prop_add_list(*nvp, ZPOOL_PROP_GUID, NULL, spa_guid(spa), src);
327 313
328 314 if (spa->spa_comment != NULL) {
329 315 spa_prop_add_list(*nvp, ZPOOL_PROP_COMMENT, spa->spa_comment,
330 316 0, ZPROP_SRC_LOCAL);
331 317 }
332 318
333 319 if (spa->spa_root != NULL)
334 320 spa_prop_add_list(*nvp, ZPOOL_PROP_ALTROOT, spa->spa_root,
335 321 0, ZPROP_SRC_LOCAL);
336 322
337 323 if (spa_feature_is_enabled(spa, SPA_FEATURE_LARGE_BLOCKS)) {
338 324 spa_prop_add_list(*nvp, ZPOOL_PROP_MAXBLOCKSIZE, NULL,
339 325 MIN(zfs_max_recordsize, SPA_MAXBLOCKSIZE), ZPROP_SRC_NONE);
340 326 } else {
341 327 spa_prop_add_list(*nvp, ZPOOL_PROP_MAXBLOCKSIZE, NULL,
342 328 SPA_OLD_MAXBLOCKSIZE, ZPROP_SRC_NONE);
343 329 }
344 330
345 331 if ((dp = list_head(&spa->spa_config_list)) != NULL) {
346 332 if (dp->scd_path == NULL) {
347 333 spa_prop_add_list(*nvp, ZPOOL_PROP_CACHEFILE,
348 334 "none", 0, ZPROP_SRC_LOCAL);
349 335 } else if (strcmp(dp->scd_path, spa_config_path) != 0) {
350 336 spa_prop_add_list(*nvp, ZPOOL_PROP_CACHEFILE,
351 337 dp->scd_path, 0, ZPROP_SRC_LOCAL);
352 338 }
353 339 }
354 340 }
355 341
356 342 /*
357 343 * Get zpool property values.
358 344 */
359 345 int
360 346 spa_prop_get(spa_t *spa, nvlist_t **nvp)
361 347 {
362 348 objset_t *mos = spa->spa_meta_objset;
363 349 zap_cursor_t zc;
364 350 zap_attribute_t za;
365 351 int err;
366 352
367 353 VERIFY(nvlist_alloc(nvp, NV_UNIQUE_NAME, KM_SLEEP) == 0);
368 354
369 355 mutex_enter(&spa->spa_props_lock);
370 356
371 357 /*
372 358 * Get properties from the spa config.
373 359 */
374 360 spa_prop_get_config(spa, nvp);
375 361
376 362 /* If no pool property object, no more prop to get. */
377 363 if (mos == NULL || spa->spa_pool_props_object == 0) {
378 364 mutex_exit(&spa->spa_props_lock);
379 365 return (0);
380 366 }
381 367
382 368 /*
|
↓ open down ↓ |
59 lines elided |
↑ open up ↑ |
383 369 * Get properties from the MOS pool property object.
384 370 */
385 371 for (zap_cursor_init(&zc, mos, spa->spa_pool_props_object);
386 372 (err = zap_cursor_retrieve(&zc, &za)) == 0;
387 373 zap_cursor_advance(&zc)) {
388 374 uint64_t intval = 0;
389 375 char *strval = NULL;
390 376 zprop_source_t src = ZPROP_SRC_DEFAULT;
391 377 zpool_prop_t prop;
392 378
393 - if ((prop = zpool_name_to_prop(za.za_name)) == ZPOOL_PROP_INVAL)
379 + if ((prop = zpool_name_to_prop(za.za_name)) == ZPROP_INVAL)
394 380 continue;
395 381
396 382 switch (za.za_integer_length) {
397 383 case 8:
398 384 /* integer property */
399 385 if (za.za_first_integer !=
400 386 zpool_prop_default_numeric(prop))
401 387 src = ZPROP_SRC_LOCAL;
402 388
403 389 if (prop == ZPOOL_PROP_BOOTFS) {
404 390 dsl_pool_t *dp;
405 391 dsl_dataset_t *ds = NULL;
406 392
407 393 dp = spa_get_dsl(spa);
408 394 dsl_pool_config_enter(dp, FTAG);
409 395 if (err = dsl_dataset_hold_obj(dp,
410 396 za.za_first_integer, FTAG, &ds)) {
411 397 dsl_pool_config_exit(dp, FTAG);
412 398 break;
413 399 }
414 400
415 401 strval = kmem_alloc(ZFS_MAX_DATASET_NAME_LEN,
416 402 KM_SLEEP);
417 403 dsl_dataset_name(ds, strval);
418 404 dsl_dataset_rele(ds, FTAG);
419 405 dsl_pool_config_exit(dp, FTAG);
420 406 } else {
421 407 strval = NULL;
422 408 intval = za.za_first_integer;
423 409 }
424 410
425 411 spa_prop_add_list(*nvp, prop, strval, intval, src);
426 412
427 413 if (strval != NULL)
428 414 kmem_free(strval, ZFS_MAX_DATASET_NAME_LEN);
429 415
430 416 break;
431 417
432 418 case 1:
433 419 /* string property */
434 420 strval = kmem_alloc(za.za_num_integers, KM_SLEEP);
435 421 err = zap_lookup(mos, spa->spa_pool_props_object,
436 422 za.za_name, 1, za.za_num_integers, strval);
437 423 if (err) {
438 424 kmem_free(strval, za.za_num_integers);
439 425 break;
440 426 }
441 427 spa_prop_add_list(*nvp, prop, strval, 0, src);
442 428 kmem_free(strval, za.za_num_integers);
443 429 break;
444 430
445 431 default:
446 432 break;
447 433 }
448 434 }
449 435 zap_cursor_fini(&zc);
450 436 mutex_exit(&spa->spa_props_lock);
451 437 out:
452 438 if (err && err != ENOENT) {
453 439 nvlist_free(*nvp);
454 440 *nvp = NULL;
455 441 return (err);
456 442 }
457 443
458 444 return (0);
459 445 }
460 446
461 447 /*
|
↓ open down ↓ |
58 lines elided |
↑ open up ↑ |
462 448 * Validate the given pool properties nvlist and modify the list
463 449 * for the property values to be set.
464 450 */
465 451 static int
466 452 spa_prop_validate(spa_t *spa, nvlist_t *props)
467 453 {
468 454 nvpair_t *elem;
469 455 int error = 0, reset_bootfs = 0;
470 456 uint64_t objnum = 0;
471 457 boolean_t has_feature = B_FALSE;
458 + uint64_t lowat = spa->spa_lowat, hiwat = spa->spa_hiwat,
459 + minwat = spa->spa_minwat;
472 460
473 461 elem = NULL;
474 462 while ((elem = nvlist_next_nvpair(props, elem)) != NULL) {
475 463 uint64_t intval;
476 464 char *strval, *slash, *check, *fname;
477 465 const char *propname = nvpair_name(elem);
478 466 zpool_prop_t prop = zpool_name_to_prop(propname);
467 + spa_feature_t feature;
479 468
480 469 switch (prop) {
481 - case ZPOOL_PROP_INVAL:
470 + case ZPROP_INVAL:
482 471 if (!zpool_prop_feature(propname)) {
483 472 error = SET_ERROR(EINVAL);
484 473 break;
485 474 }
486 475
487 476 /*
488 477 * Sanitize the input.
489 478 */
490 479 if (nvpair_type(elem) != DATA_TYPE_UINT64) {
491 480 error = SET_ERROR(EINVAL);
492 481 break;
493 482 }
494 483
495 484 if (nvpair_value_uint64(elem, &intval) != 0) {
|
↓ open down ↓ |
4 lines elided |
↑ open up ↑ |
496 485 error = SET_ERROR(EINVAL);
497 486 break;
498 487 }
499 488
500 489 if (intval != 0) {
501 490 error = SET_ERROR(EINVAL);
502 491 break;
503 492 }
504 493
505 494 fname = strchr(propname, '@') + 1;
506 - if (zfeature_lookup_name(fname, NULL) != 0) {
495 + if (zfeature_lookup_name(fname, &feature) != 0) {
507 496 error = SET_ERROR(EINVAL);
508 497 break;
509 498 }
510 499
500 + if (feature == SPA_FEATURE_WBC &&
501 + !spa_has_special(spa)) {
502 + error = SET_ERROR(ENOTSUP);
503 + break;
504 + }
505 +
511 506 has_feature = B_TRUE;
512 507 break;
513 508
514 509 case ZPOOL_PROP_VERSION:
515 510 error = nvpair_value_uint64(elem, &intval);
516 511 if (!error &&
517 512 (intval < spa_version(spa) ||
518 513 intval > SPA_VERSION_BEFORE_FEATURES ||
519 514 has_feature))
520 515 error = SET_ERROR(EINVAL);
521 516 break;
522 517
523 518 case ZPOOL_PROP_DELEGATION:
524 519 case ZPOOL_PROP_AUTOREPLACE:
525 520 case ZPOOL_PROP_LISTSNAPS:
526 521 case ZPOOL_PROP_AUTOEXPAND:
522 + case ZPOOL_PROP_DEDUP_BEST_EFFORT:
523 + case ZPOOL_PROP_DDT_DESEGREGATION:
524 + case ZPOOL_PROP_META_PLACEMENT:
525 + case ZPOOL_PROP_FORCETRIM:
526 + case ZPOOL_PROP_AUTOTRIM:
527 527 error = nvpair_value_uint64(elem, &intval);
528 528 if (!error && intval > 1)
529 529 error = SET_ERROR(EINVAL);
530 530 break;
531 531
532 + case ZPOOL_PROP_DDT_META_TO_METADEV:
533 + case ZPOOL_PROP_ZFS_META_TO_METADEV:
534 + error = nvpair_value_uint64(elem, &intval);
535 + if (!error && intval > META_PLACEMENT_DUAL)
536 + error = SET_ERROR(EINVAL);
537 + break;
538 +
539 + case ZPOOL_PROP_SYNC_TO_SPECIAL:
540 + error = nvpair_value_uint64(elem, &intval);
541 + if (!error && intval > SYNC_TO_SPECIAL_ALWAYS)
542 + error = SET_ERROR(EINVAL);
543 + break;
544 +
545 + case ZPOOL_PROP_SMALL_DATA_TO_METADEV:
546 + error = nvpair_value_uint64(elem, &intval);
547 + if (!error && intval > SPA_MAXBLOCKSIZE)
548 + error = SET_ERROR(EINVAL);
549 + break;
550 +
532 551 case ZPOOL_PROP_BOOTFS:
533 552 /*
534 553 * If the pool version is less than SPA_VERSION_BOOTFS,
535 554 * or the pool is still being created (version == 0),
536 555 * the bootfs property cannot be set.
537 556 */
538 557 if (spa_version(spa) < SPA_VERSION_BOOTFS) {
539 558 error = SET_ERROR(ENOTSUP);
540 559 break;
541 560 }
542 561
543 562 /*
544 563 * Make sure the vdev config is bootable
545 564 */
546 565 if (!vdev_is_bootable(spa->spa_root_vdev)) {
547 566 error = SET_ERROR(ENOTSUP);
548 567 break;
549 568 }
550 569
551 570 reset_bootfs = 1;
552 571
553 572 error = nvpair_value_string(elem, &strval);
554 573
555 574 if (!error) {
556 575 objset_t *os;
557 576 uint64_t propval;
558 577
559 578 if (strval == NULL || strval[0] == '\0') {
560 579 objnum = zpool_prop_default_numeric(
561 580 ZPOOL_PROP_BOOTFS);
562 581 break;
563 582 }
564 583
565 584 if (error = dmu_objset_hold(strval, FTAG, &os))
566 585 break;
567 586
568 587 /*
569 588 * Must be ZPL, and its property settings
570 589 * must be supported by GRUB (compression
571 590 * is not gzip, and large blocks are not used).
572 591 */
573 592
574 593 if (dmu_objset_type(os) != DMU_OST_ZFS) {
575 594 error = SET_ERROR(ENOTSUP);
576 595 } else if ((error =
577 596 dsl_prop_get_int_ds(dmu_objset_ds(os),
578 597 zfs_prop_to_name(ZFS_PROP_COMPRESSION),
|
↓ open down ↓ |
37 lines elided |
↑ open up ↑ |
579 598 &propval)) == 0 &&
580 599 !BOOTFS_COMPRESS_VALID(propval)) {
581 600 error = SET_ERROR(ENOTSUP);
582 601 } else {
583 602 objnum = dmu_objset_id(os);
584 603 }
585 604 dmu_objset_rele(os, FTAG);
586 605 }
587 606 break;
588 607
608 + case ZPOOL_PROP_DEDUP_LO_BEST_EFFORT:
609 + error = nvpair_value_uint64(elem, &intval);
610 + if ((intval < 0) || (intval > 100) ||
611 + (intval >= spa->spa_dedup_hi_best_effort))
612 + error = SET_ERROR(EINVAL);
613 + break;
614 +
615 + case ZPOOL_PROP_DEDUP_HI_BEST_EFFORT:
616 + error = nvpair_value_uint64(elem, &intval);
617 + if ((intval < 0) || (intval > 100) ||
618 + (intval <= spa->spa_dedup_lo_best_effort))
619 + error = SET_ERROR(EINVAL);
620 + break;
621 +
589 622 case ZPOOL_PROP_FAILUREMODE:
590 623 error = nvpair_value_uint64(elem, &intval);
591 624 if (!error && (intval < ZIO_FAILURE_MODE_WAIT ||
592 625 intval > ZIO_FAILURE_MODE_PANIC))
593 626 error = SET_ERROR(EINVAL);
594 627
595 628 /*
596 629 * This is a special case which only occurs when
597 630 * the pool has completely failed. This allows
598 631 * the user to change the in-core failmode property
599 632 * without syncing it out to disk (I/Os might
600 633 * currently be blocked). We do this by returning
601 634 * EIO to the caller (spa_prop_set) to trick it
602 635 * into thinking we encountered a property validation
603 636 * error.
604 637 */
605 638 if (!error && spa_suspended(spa)) {
606 639 spa->spa_failmode = intval;
607 640 error = SET_ERROR(EIO);
608 641 }
609 642 break;
610 643
611 644 case ZPOOL_PROP_CACHEFILE:
612 645 if ((error = nvpair_value_string(elem, &strval)) != 0)
613 646 break;
614 647
615 648 if (strval[0] == '\0')
616 649 break;
617 650
618 651 if (strcmp(strval, "none") == 0)
619 652 break;
620 653
621 654 if (strval[0] != '/') {
622 655 error = SET_ERROR(EINVAL);
623 656 break;
624 657 }
625 658
626 659 slash = strrchr(strval, '/');
627 660 ASSERT(slash != NULL);
628 661
629 662 if (slash[1] == '\0' || strcmp(slash, "/.") == 0 ||
630 663 strcmp(slash, "/..") == 0)
631 664 error = SET_ERROR(EINVAL);
632 665 break;
633 666
634 667 case ZPOOL_PROP_COMMENT:
635 668 if ((error = nvpair_value_string(elem, &strval)) != 0)
636 669 break;
637 670 for (check = strval; *check != '\0'; check++) {
638 671 /*
639 672 * The kernel doesn't have an easy isprint()
|
↓ open down ↓ |
41 lines elided |
↑ open up ↑ |
640 673 * check. For this kernel check, we merely
641 674 * check ASCII apart from DEL. Fix this if
642 675 * there is an easy-to-use kernel isprint().
643 676 */
644 677 if (*check >= 0x7f) {
645 678 error = SET_ERROR(EINVAL);
646 679 break;
647 680 }
648 681 }
649 682 if (strlen(strval) > ZPROP_MAX_COMMENT)
650 - error = E2BIG;
683 + error = SET_ERROR(E2BIG);
651 684 break;
652 685
653 686 case ZPOOL_PROP_DEDUPDITTO:
654 687 if (spa_version(spa) < SPA_VERSION_DEDUP)
655 688 error = SET_ERROR(ENOTSUP);
656 689 else
657 690 error = nvpair_value_uint64(elem, &intval);
658 691 if (error == 0 &&
659 692 intval != 0 && intval < ZIO_DEDUPDITTO_MIN)
660 693 error = SET_ERROR(EINVAL);
661 694 break;
695 +
696 + case ZPOOL_PROP_MINWATERMARK:
697 + error = nvpair_value_uint64(elem, &intval);
698 + if (!error && (intval > 100))
699 + error = SET_ERROR(EINVAL);
700 + minwat = intval;
701 + break;
702 + case ZPOOL_PROP_LOWATERMARK:
703 + error = nvpair_value_uint64(elem, &intval);
704 + if (!error && (intval > 100))
705 + error = SET_ERROR(EINVAL);
706 + lowat = intval;
707 + break;
708 + case ZPOOL_PROP_HIWATERMARK:
709 + error = nvpair_value_uint64(elem, &intval);
710 + if (!error && (intval > 100))
711 + error = SET_ERROR(EINVAL);
712 + hiwat = intval;
713 + break;
714 + case ZPOOL_PROP_DEDUPMETA_DITTO:
715 + error = nvpair_value_uint64(elem, &intval);
716 + if (!error && (intval > SPA_DVAS_PER_BP))
717 + error = SET_ERROR(EINVAL);
718 + break;
719 + case ZPOOL_PROP_SCRUB_PRIO:
720 + case ZPOOL_PROP_RESILVER_PRIO:
721 + error = nvpair_value_uint64(elem, &intval);
722 + if (error || intval > 100)
723 + error = SET_ERROR(EINVAL);
724 + break;
662 725 }
663 726
664 727 if (error)
665 728 break;
666 729 }
667 730
731 + /* check if low watermark is less than high watermark */
732 + if (lowat != 0 && lowat >= hiwat)
733 + error = SET_ERROR(EINVAL);
734 +
735 + /* check if min watermark is less than low watermark */
736 + if (minwat != 0 && minwat >= lowat)
737 + error = SET_ERROR(EINVAL);
738 +
668 739 if (!error && reset_bootfs) {
669 740 error = nvlist_remove(props,
670 741 zpool_prop_to_name(ZPOOL_PROP_BOOTFS), DATA_TYPE_STRING);
671 742
672 743 if (!error) {
673 744 error = nvlist_add_uint64(props,
674 745 zpool_prop_to_name(ZPOOL_PROP_BOOTFS), objnum);
675 746 }
676 747 }
677 748
678 749 return (error);
679 750 }
680 751
681 752 void
682 753 spa_configfile_set(spa_t *spa, nvlist_t *nvp, boolean_t need_sync)
683 754 {
684 755 char *cachefile;
685 756 spa_config_dirent_t *dp;
686 757
687 758 if (nvlist_lookup_string(nvp, zpool_prop_to_name(ZPOOL_PROP_CACHEFILE),
688 759 &cachefile) != 0)
689 760 return;
690 761
691 762 dp = kmem_alloc(sizeof (spa_config_dirent_t),
692 763 KM_SLEEP);
693 764
694 765 if (cachefile[0] == '\0')
695 766 dp->scd_path = spa_strdup(spa_config_path);
696 767 else if (strcmp(cachefile, "none") == 0)
697 768 dp->scd_path = NULL;
698 769 else
699 770 dp->scd_path = spa_strdup(cachefile);
700 771
701 772 list_insert_head(&spa->spa_config_list, dp);
702 773 if (need_sync)
703 774 spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
704 775 }
705 776
706 777 int
707 778 spa_prop_set(spa_t *spa, nvlist_t *nvp)
708 779 {
709 780 int error;
710 781 nvpair_t *elem = NULL;
711 782 boolean_t need_sync = B_FALSE;
712 783
713 784 if ((error = spa_prop_validate(spa, nvp)) != 0)
|
↓ open down ↓ |
36 lines elided |
↑ open up ↑ |
714 785 return (error);
715 786
716 787 while ((elem = nvlist_next_nvpair(nvp, elem)) != NULL) {
717 788 zpool_prop_t prop = zpool_name_to_prop(nvpair_name(elem));
718 789
719 790 if (prop == ZPOOL_PROP_CACHEFILE ||
720 791 prop == ZPOOL_PROP_ALTROOT ||
721 792 prop == ZPOOL_PROP_READONLY)
722 793 continue;
723 794
724 - if (prop == ZPOOL_PROP_VERSION || prop == ZPOOL_PROP_INVAL) {
795 + if (prop == ZPOOL_PROP_VERSION || prop == ZPROP_INVAL) {
725 796 uint64_t ver;
726 797
727 798 if (prop == ZPOOL_PROP_VERSION) {
728 799 VERIFY(nvpair_value_uint64(elem, &ver) == 0);
729 800 } else {
730 801 ASSERT(zpool_prop_feature(nvpair_name(elem)));
731 802 ver = SPA_VERSION_FEATURES;
732 803 need_sync = B_TRUE;
733 804 }
734 805
735 806 /* Save time if the version is already set. */
736 807 if (ver == spa_version(spa))
737 808 continue;
738 809
739 810 /*
740 811 * In addition to the pool directory object, we might
741 812 * create the pool properties object, the features for
742 813 * read object, the features for write object, or the
743 814 * feature descriptions object.
744 815 */
745 816 error = dsl_sync_task(spa->spa_name, NULL,
746 817 spa_sync_version, &ver,
747 818 6, ZFS_SPACE_CHECK_RESERVED);
748 819 if (error)
749 820 return (error);
750 821 continue;
751 822 }
752 823
753 824 need_sync = B_TRUE;
754 825 break;
755 826 }
756 827
757 828 if (need_sync) {
758 829 return (dsl_sync_task(spa->spa_name, NULL, spa_sync_props,
759 830 nvp, 6, ZFS_SPACE_CHECK_RESERVED));
760 831 }
761 832
762 833 return (0);
763 834 }
764 835
765 836 /*
766 837 * If the bootfs property value is dsobj, clear it.
767 838 */
768 839 void
769 840 spa_prop_clear_bootfs(spa_t *spa, uint64_t dsobj, dmu_tx_t *tx)
770 841 {
771 842 if (spa->spa_bootfs == dsobj && spa->spa_pool_props_object != 0) {
772 843 VERIFY(zap_remove(spa->spa_meta_objset,
773 844 spa->spa_pool_props_object,
774 845 zpool_prop_to_name(ZPOOL_PROP_BOOTFS), tx) == 0);
775 846 spa->spa_bootfs = 0;
776 847 }
777 848 }
778 849
779 850 /*ARGSUSED*/
780 851 static int
781 852 spa_change_guid_check(void *arg, dmu_tx_t *tx)
782 853 {
783 854 uint64_t *newguid = arg;
784 855 spa_t *spa = dmu_tx_pool(tx)->dp_spa;
785 856 vdev_t *rvd = spa->spa_root_vdev;
786 857 uint64_t vdev_state;
787 858
788 859 spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
789 860 vdev_state = rvd->vdev_state;
790 861 spa_config_exit(spa, SCL_STATE, FTAG);
791 862
792 863 if (vdev_state != VDEV_STATE_HEALTHY)
793 864 return (SET_ERROR(ENXIO));
794 865
795 866 ASSERT3U(spa_guid(spa), !=, *newguid);
796 867
797 868 return (0);
798 869 }
799 870
800 871 static void
801 872 spa_change_guid_sync(void *arg, dmu_tx_t *tx)
802 873 {
803 874 uint64_t *newguid = arg;
804 875 spa_t *spa = dmu_tx_pool(tx)->dp_spa;
805 876 uint64_t oldguid;
806 877 vdev_t *rvd = spa->spa_root_vdev;
807 878
808 879 oldguid = spa_guid(spa);
809 880
810 881 spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
811 882 rvd->vdev_guid = *newguid;
812 883 rvd->vdev_guid_sum += (*newguid - oldguid);
813 884 vdev_config_dirty(rvd);
814 885 spa_config_exit(spa, SCL_STATE, FTAG);
815 886
816 887 spa_history_log_internal(spa, "guid change", tx, "old=%llu new=%llu",
817 888 oldguid, *newguid);
818 889 }
819 890
820 891 /*
821 892 * Change the GUID for the pool. This is done so that we can later
822 893 * re-import a pool built from a clone of our own vdevs. We will modify
823 894 * the root vdev's guid, our own pool guid, and then mark all of our
824 895 * vdevs dirty. Note that we must make sure that all our vdevs are
825 896 * online when we do this, or else any vdevs that weren't present
826 897 * would be orphaned from our pool. We are also going to issue a
827 898 * sysevent to update any watchers.
828 899 */
829 900 int
830 901 spa_change_guid(spa_t *spa)
831 902 {
832 903 int error;
|
↓ open down ↓ |
98 lines elided |
↑ open up ↑ |
833 904 uint64_t guid;
834 905
835 906 mutex_enter(&spa->spa_vdev_top_lock);
836 907 mutex_enter(&spa_namespace_lock);
837 908 guid = spa_generate_guid(NULL);
838 909
839 910 error = dsl_sync_task(spa->spa_name, spa_change_guid_check,
840 911 spa_change_guid_sync, &guid, 5, ZFS_SPACE_CHECK_RESERVED);
841 912
842 913 if (error == 0) {
843 - spa_write_cachefile(spa, B_FALSE, B_TRUE);
914 + spa_config_sync(spa, B_FALSE, B_TRUE);
844 915 spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_REGUID);
845 916 }
846 917
847 918 mutex_exit(&spa_namespace_lock);
848 919 mutex_exit(&spa->spa_vdev_top_lock);
849 920
850 921 return (error);
851 922 }
852 923
853 924 /*
854 925 * ==========================================================================
855 926 * SPA state manipulation (open/create/destroy/import/export)
856 927 * ==========================================================================
857 928 */
858 929
859 930 static int
860 931 spa_error_entry_compare(const void *a, const void *b)
861 932 {
862 933 spa_error_entry_t *sa = (spa_error_entry_t *)a;
863 934 spa_error_entry_t *sb = (spa_error_entry_t *)b;
864 935 int ret;
865 936
866 937 ret = bcmp(&sa->se_bookmark, &sb->se_bookmark,
867 938 sizeof (zbookmark_phys_t));
868 939
869 940 if (ret < 0)
870 941 return (-1);
871 942 else if (ret > 0)
872 943 return (1);
873 944 else
874 945 return (0);
875 946 }
876 947
877 948 /*
878 949 * Utility function which retrieves copies of the current logs and
879 950 * re-initializes them in the process.
880 951 */
881 952 void
882 953 spa_get_errlists(spa_t *spa, avl_tree_t *last, avl_tree_t *scrub)
883 954 {
884 955 ASSERT(MUTEX_HELD(&spa->spa_errlist_lock));
885 956
886 957 bcopy(&spa->spa_errlist_last, last, sizeof (avl_tree_t));
887 958 bcopy(&spa->spa_errlist_scrub, scrub, sizeof (avl_tree_t));
888 959
889 960 avl_create(&spa->spa_errlist_scrub,
890 961 spa_error_entry_compare, sizeof (spa_error_entry_t),
891 962 offsetof(spa_error_entry_t, se_avl));
892 963 avl_create(&spa->spa_errlist_last,
893 964 spa_error_entry_compare, sizeof (spa_error_entry_t),
894 965 offsetof(spa_error_entry_t, se_avl));
895 966 }
896 967
897 968 static void
898 969 spa_taskqs_init(spa_t *spa, zio_type_t t, zio_taskq_type_t q)
899 970 {
900 971 const zio_taskq_info_t *ztip = &zio_taskqs[t][q];
901 972 enum zti_modes mode = ztip->zti_mode;
902 973 uint_t value = ztip->zti_value;
903 974 uint_t count = ztip->zti_count;
904 975 spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
905 976 char name[32];
906 977 uint_t flags = 0;
907 978 boolean_t batch = B_FALSE;
908 979
909 980 if (mode == ZTI_MODE_NULL) {
910 981 tqs->stqs_count = 0;
911 982 tqs->stqs_taskq = NULL;
912 983 return;
913 984 }
914 985
915 986 ASSERT3U(count, >, 0);
916 987
917 988 tqs->stqs_count = count;
918 989 tqs->stqs_taskq = kmem_alloc(count * sizeof (taskq_t *), KM_SLEEP);
919 990
920 991 switch (mode) {
921 992 case ZTI_MODE_FIXED:
922 993 ASSERT3U(value, >=, 1);
923 994 value = MAX(value, 1);
924 995 break;
925 996
926 997 case ZTI_MODE_BATCH:
927 998 batch = B_TRUE;
928 999 flags |= TASKQ_THREADS_CPU_PCT;
929 1000 value = zio_taskq_batch_pct;
930 1001 break;
931 1002
932 1003 default:
933 1004 panic("unrecognized mode for %s_%s taskq (%u:%u) in "
934 1005 "spa_activate()",
935 1006 zio_type_name[t], zio_taskq_types[q], mode, value);
936 1007 break;
937 1008 }
938 1009
939 1010 for (uint_t i = 0; i < count; i++) {
940 1011 taskq_t *tq;
941 1012
942 1013 if (count > 1) {
943 1014 (void) snprintf(name, sizeof (name), "%s_%s_%u",
944 1015 zio_type_name[t], zio_taskq_types[q], i);
945 1016 } else {
946 1017 (void) snprintf(name, sizeof (name), "%s_%s",
947 1018 zio_type_name[t], zio_taskq_types[q]);
948 1019 }
949 1020
950 1021 if (zio_taskq_sysdc && spa->spa_proc != &p0) {
951 1022 if (batch)
952 1023 flags |= TASKQ_DC_BATCH;
953 1024
954 1025 tq = taskq_create_sysdc(name, value, 50, INT_MAX,
955 1026 spa->spa_proc, zio_taskq_basedc, flags);
956 1027 } else {
957 1028 pri_t pri = maxclsyspri;
958 1029 /*
959 1030 * The write issue taskq can be extremely CPU
960 1031 * intensive. Run it at slightly lower priority
961 1032 * than the other taskqs.
962 1033 */
963 1034 if (t == ZIO_TYPE_WRITE && q == ZIO_TASKQ_ISSUE)
964 1035 pri--;
965 1036
966 1037 tq = taskq_create_proc(name, value, pri, 50,
967 1038 INT_MAX, spa->spa_proc, flags);
968 1039 }
969 1040
970 1041 tqs->stqs_taskq[i] = tq;
971 1042 }
972 1043 }
973 1044
974 1045 static void
975 1046 spa_taskqs_fini(spa_t *spa, zio_type_t t, zio_taskq_type_t q)
976 1047 {
977 1048 spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
978 1049
979 1050 if (tqs->stqs_taskq == NULL) {
980 1051 ASSERT0(tqs->stqs_count);
981 1052 return;
982 1053 }
983 1054
984 1055 for (uint_t i = 0; i < tqs->stqs_count; i++) {
985 1056 ASSERT3P(tqs->stqs_taskq[i], !=, NULL);
986 1057 taskq_destroy(tqs->stqs_taskq[i]);
987 1058 }
988 1059
989 1060 kmem_free(tqs->stqs_taskq, tqs->stqs_count * sizeof (taskq_t *));
990 1061 tqs->stqs_taskq = NULL;
991 1062 }
992 1063
993 1064 /*
994 1065 * Dispatch a task to the appropriate taskq for the ZFS I/O type and priority.
995 1066 * Note that a type may have multiple discrete taskqs to avoid lock contention
996 1067 * on the taskq itself. In that case we choose which taskq at random by using
997 1068 * the low bits of gethrtime().
998 1069 */
999 1070 void
1000 1071 spa_taskq_dispatch_ent(spa_t *spa, zio_type_t t, zio_taskq_type_t q,
1001 1072 task_func_t *func, void *arg, uint_t flags, taskq_ent_t *ent)
1002 1073 {
1003 1074 spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
1004 1075 taskq_t *tq;
1005 1076
1006 1077 ASSERT3P(tqs->stqs_taskq, !=, NULL);
1007 1078 ASSERT3U(tqs->stqs_count, !=, 0);
1008 1079
1009 1080 if (tqs->stqs_count == 1) {
1010 1081 tq = tqs->stqs_taskq[0];
1011 1082 } else {
1012 1083 tq = tqs->stqs_taskq[gethrtime() % tqs->stqs_count];
1013 1084 }
1014 1085
1015 1086 taskq_dispatch_ent(tq, func, arg, flags, ent);
1016 1087 }
1017 1088
1018 1089 static void
1019 1090 spa_create_zio_taskqs(spa_t *spa)
1020 1091 {
1021 1092 for (int t = 0; t < ZIO_TYPES; t++) {
1022 1093 for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
1023 1094 spa_taskqs_init(spa, t, q);
1024 1095 }
1025 1096 }
1026 1097 }
1027 1098
1028 1099 #ifdef _KERNEL
1029 1100 static void
1030 1101 spa_thread(void *arg)
1031 1102 {
1032 1103 callb_cpr_t cprinfo;
1033 1104
1034 1105 spa_t *spa = arg;
1035 1106 user_t *pu = PTOU(curproc);
1036 1107
1037 1108 CALLB_CPR_INIT(&cprinfo, &spa->spa_proc_lock, callb_generic_cpr,
1038 1109 spa->spa_name);
1039 1110
1040 1111 ASSERT(curproc != &p0);
1041 1112 (void) snprintf(pu->u_psargs, sizeof (pu->u_psargs),
1042 1113 "zpool-%s", spa->spa_name);
1043 1114 (void) strlcpy(pu->u_comm, pu->u_psargs, sizeof (pu->u_comm));
1044 1115
1045 1116 /* bind this thread to the requested psrset */
1046 1117 if (zio_taskq_psrset_bind != PS_NONE) {
1047 1118 pool_lock();
1048 1119 mutex_enter(&cpu_lock);
1049 1120 mutex_enter(&pidlock);
1050 1121 mutex_enter(&curproc->p_lock);
1051 1122
1052 1123 if (cpupart_bind_thread(curthread, zio_taskq_psrset_bind,
1053 1124 0, NULL, NULL) == 0) {
1054 1125 curthread->t_bind_pset = zio_taskq_psrset_bind;
1055 1126 } else {
1056 1127 cmn_err(CE_WARN,
1057 1128 "Couldn't bind process for zfs pool \"%s\" to "
1058 1129 "pset %d\n", spa->spa_name, zio_taskq_psrset_bind);
1059 1130 }
1060 1131
1061 1132 mutex_exit(&curproc->p_lock);
1062 1133 mutex_exit(&pidlock);
1063 1134 mutex_exit(&cpu_lock);
1064 1135 pool_unlock();
1065 1136 }
1066 1137
1067 1138 if (zio_taskq_sysdc) {
1068 1139 sysdc_thread_enter(curthread, 100, 0);
1069 1140 }
1070 1141
1071 1142 spa->spa_proc = curproc;
1072 1143 spa->spa_did = curthread->t_did;
1073 1144
1074 1145 spa_create_zio_taskqs(spa);
1075 1146
1076 1147 mutex_enter(&spa->spa_proc_lock);
1077 1148 ASSERT(spa->spa_proc_state == SPA_PROC_CREATED);
1078 1149
1079 1150 spa->spa_proc_state = SPA_PROC_ACTIVE;
1080 1151 cv_broadcast(&spa->spa_proc_cv);
1081 1152
1082 1153 CALLB_CPR_SAFE_BEGIN(&cprinfo);
1083 1154 while (spa->spa_proc_state == SPA_PROC_ACTIVE)
1084 1155 cv_wait(&spa->spa_proc_cv, &spa->spa_proc_lock);
1085 1156 CALLB_CPR_SAFE_END(&cprinfo, &spa->spa_proc_lock);
1086 1157
1087 1158 ASSERT(spa->spa_proc_state == SPA_PROC_DEACTIVATE);
1088 1159 spa->spa_proc_state = SPA_PROC_GONE;
1089 1160 spa->spa_proc = &p0;
1090 1161 cv_broadcast(&spa->spa_proc_cv);
1091 1162 CALLB_CPR_EXIT(&cprinfo); /* drops spa_proc_lock */
1092 1163
1093 1164 mutex_enter(&curproc->p_lock);
1094 1165 lwp_exit();
1095 1166 }
1096 1167 #endif
1097 1168
1098 1169 /*
1099 1170 * Activate an uninitialized pool.
1100 1171 */
|
↓ open down ↓ |
247 lines elided |
↑ open up ↑ |
1101 1172 static void
1102 1173 spa_activate(spa_t *spa, int mode)
1103 1174 {
1104 1175 ASSERT(spa->spa_state == POOL_STATE_UNINITIALIZED);
1105 1176
1106 1177 spa->spa_state = POOL_STATE_ACTIVE;
1107 1178 spa->spa_mode = mode;
1108 1179
1109 1180 spa->spa_normal_class = metaslab_class_create(spa, zfs_metaslab_ops);
1110 1181 spa->spa_log_class = metaslab_class_create(spa, zfs_metaslab_ops);
1182 + spa->spa_special_class = metaslab_class_create(spa, zfs_metaslab_ops);
1111 1183
1112 1184 /* Try to create a covering process */
1113 1185 mutex_enter(&spa->spa_proc_lock);
1114 1186 ASSERT(spa->spa_proc_state == SPA_PROC_NONE);
1115 1187 ASSERT(spa->spa_proc == &p0);
1116 1188 spa->spa_did = 0;
1117 1189
1118 1190 /* Only create a process if we're going to be around a while. */
1119 1191 if (spa_create_process && strcmp(spa->spa_name, TRYIMPORT_NAME) != 0) {
1120 1192 if (newproc(spa_thread, (caddr_t)spa, syscid, maxclsyspri,
1121 1193 NULL, 0) == 0) {
1122 1194 spa->spa_proc_state = SPA_PROC_CREATED;
1123 1195 while (spa->spa_proc_state == SPA_PROC_CREATED) {
1124 1196 cv_wait(&spa->spa_proc_cv,
1125 1197 &spa->spa_proc_lock);
1126 1198 }
1127 1199 ASSERT(spa->spa_proc_state == SPA_PROC_ACTIVE);
1128 1200 ASSERT(spa->spa_proc != &p0);
1129 1201 ASSERT(spa->spa_did != 0);
1130 1202 } else {
1131 1203 #ifdef _KERNEL
1132 1204 cmn_err(CE_WARN,
1133 1205 "Couldn't create process for zfs pool \"%s\"\n",
1134 1206 spa->spa_name);
|
↓ open down ↓ |
14 lines elided |
↑ open up ↑ |
1135 1207 #endif
1136 1208 }
1137 1209 }
1138 1210 mutex_exit(&spa->spa_proc_lock);
1139 1211
1140 1212 /* If we didn't create a process, we need to create our taskqs. */
1141 1213 if (spa->spa_proc == &p0) {
1142 1214 spa_create_zio_taskqs(spa);
1143 1215 }
1144 1216
1145 - for (size_t i = 0; i < TXG_SIZE; i++)
1146 - spa->spa_txg_zio[i] = zio_root(spa, NULL, NULL, 0);
1147 -
1148 1217 list_create(&spa->spa_config_dirty_list, sizeof (vdev_t),
1149 1218 offsetof(vdev_t, vdev_config_dirty_node));
1150 1219 list_create(&spa->spa_evicting_os_list, sizeof (objset_t),
1151 1220 offsetof(objset_t, os_evicting_node));
1152 1221 list_create(&spa->spa_state_dirty_list, sizeof (vdev_t),
1153 1222 offsetof(vdev_t, vdev_state_dirty_node));
1154 1223
1155 1224 txg_list_create(&spa->spa_vdev_txg_list, spa,
1156 1225 offsetof(struct vdev, vdev_txg_node));
1157 1226
1158 1227 avl_create(&spa->spa_errlist_scrub,
1159 1228 spa_error_entry_compare, sizeof (spa_error_entry_t),
1160 1229 offsetof(spa_error_entry_t, se_avl));
1161 1230 avl_create(&spa->spa_errlist_last,
1162 1231 spa_error_entry_compare, sizeof (spa_error_entry_t),
1163 1232 offsetof(spa_error_entry_t, se_avl));
1164 1233 }
1165 1234
1166 1235 /*
1167 1236 * Opposite of spa_activate().
1168 1237 */
1169 1238 static void
1170 1239 spa_deactivate(spa_t *spa)
1171 1240 {
1172 1241 ASSERT(spa->spa_sync_on == B_FALSE);
1173 1242 ASSERT(spa->spa_dsl_pool == NULL);
1174 1243 ASSERT(spa->spa_root_vdev == NULL);
1175 1244 ASSERT(spa->spa_async_zio_root == NULL);
1176 1245 ASSERT(spa->spa_state != POOL_STATE_UNINITIALIZED);
1177 1246
1178 1247 spa_evicting_os_wait(spa);
1179 1248
1180 1249 txg_list_destroy(&spa->spa_vdev_txg_list);
1181 1250
|
↓ open down ↓ |
24 lines elided |
↑ open up ↑ |
1182 1251 list_destroy(&spa->spa_config_dirty_list);
1183 1252 list_destroy(&spa->spa_evicting_os_list);
1184 1253 list_destroy(&spa->spa_state_dirty_list);
1185 1254
1186 1255 for (int t = 0; t < ZIO_TYPES; t++) {
1187 1256 for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
1188 1257 spa_taskqs_fini(spa, t, q);
1189 1258 }
1190 1259 }
1191 1260
1192 - for (size_t i = 0; i < TXG_SIZE; i++) {
1193 - ASSERT3P(spa->spa_txg_zio[i], !=, NULL);
1194 - VERIFY0(zio_wait(spa->spa_txg_zio[i]));
1195 - spa->spa_txg_zio[i] = NULL;
1196 - }
1197 -
1198 1261 metaslab_class_destroy(spa->spa_normal_class);
1199 1262 spa->spa_normal_class = NULL;
1200 1263
1201 1264 metaslab_class_destroy(spa->spa_log_class);
1202 1265 spa->spa_log_class = NULL;
1203 1266
1267 + metaslab_class_destroy(spa->spa_special_class);
1268 + spa->spa_special_class = NULL;
1269 +
1204 1270 /*
1205 1271 * If this was part of an import or the open otherwise failed, we may
1206 1272 * still have errors left in the queues. Empty them just in case.
1207 1273 */
1208 1274 spa_errlog_drain(spa);
1209 1275
1210 1276 avl_destroy(&spa->spa_errlist_scrub);
1211 1277 avl_destroy(&spa->spa_errlist_last);
1212 1278
1213 1279 spa->spa_state = POOL_STATE_UNINITIALIZED;
1214 1280
1215 1281 mutex_enter(&spa->spa_proc_lock);
1216 1282 if (spa->spa_proc_state != SPA_PROC_NONE) {
1217 1283 ASSERT(spa->spa_proc_state == SPA_PROC_ACTIVE);
1218 1284 spa->spa_proc_state = SPA_PROC_DEACTIVATE;
1219 1285 cv_broadcast(&spa->spa_proc_cv);
1220 1286 while (spa->spa_proc_state == SPA_PROC_DEACTIVATE) {
1221 1287 ASSERT(spa->spa_proc != &p0);
1222 1288 cv_wait(&spa->spa_proc_cv, &spa->spa_proc_lock);
1223 1289 }
1224 1290 ASSERT(spa->spa_proc_state == SPA_PROC_GONE);
1225 1291 spa->spa_proc_state = SPA_PROC_NONE;
1226 1292 }
1227 1293 ASSERT(spa->spa_proc == &p0);
1228 1294 mutex_exit(&spa->spa_proc_lock);
1229 1295
1230 1296 /*
1231 1297 * We want to make sure spa_thread() has actually exited the ZFS
1232 1298 * module, so that the module can't be unloaded out from underneath
1233 1299 * it.
1234 1300 */
1235 1301 if (spa->spa_did != 0) {
1236 1302 thread_join(spa->spa_did);
1237 1303 spa->spa_did = 0;
1238 1304 }
1239 1305 }
1240 1306
1241 1307 /*
1242 1308 * Verify a pool configuration, and construct the vdev tree appropriately. This
1243 1309 * will create all the necessary vdevs in the appropriate layout, with each vdev
1244 1310 * in the CLOSED state. This will prep the pool before open/creation/import.
1245 1311 * All vdev validation is done by the vdev_alloc() routine.
1246 1312 */
1247 1313 static int
1248 1314 spa_config_parse(spa_t *spa, vdev_t **vdp, nvlist_t *nv, vdev_t *parent,
1249 1315 uint_t id, int atype)
1250 1316 {
1251 1317 nvlist_t **child;
1252 1318 uint_t children;
1253 1319 int error;
1254 1320
1255 1321 if ((error = vdev_alloc(spa, vdp, nv, parent, id, atype)) != 0)
1256 1322 return (error);
1257 1323
1258 1324 if ((*vdp)->vdev_ops->vdev_op_leaf)
1259 1325 return (0);
1260 1326
1261 1327 error = nvlist_lookup_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN,
1262 1328 &child, &children);
1263 1329
1264 1330 if (error == ENOENT)
1265 1331 return (0);
1266 1332
1267 1333 if (error) {
1268 1334 vdev_free(*vdp);
1269 1335 *vdp = NULL;
1270 1336 return (SET_ERROR(EINVAL));
1271 1337 }
1272 1338
1273 1339 for (int c = 0; c < children; c++) {
1274 1340 vdev_t *vd;
1275 1341 if ((error = spa_config_parse(spa, &vd, child[c], *vdp, c,
1276 1342 atype)) != 0) {
1277 1343 vdev_free(*vdp);
1278 1344 *vdp = NULL;
1279 1345 return (error);
1280 1346 }
1281 1347 }
1282 1348
1283 1349 ASSERT(*vdp != NULL);
1284 1350
1285 1351 return (0);
1286 1352 }
1287 1353
|
↓ open down ↓ |
74 lines elided |
↑ open up ↑ |
1288 1354 /*
1289 1355 * Opposite of spa_load().
1290 1356 */
1291 1357 static void
1292 1358 spa_unload(spa_t *spa)
1293 1359 {
1294 1360 int i;
1295 1361
1296 1362 ASSERT(MUTEX_HELD(&spa_namespace_lock));
1297 1363
1298 - spa_load_note(spa, "UNLOADING");
1364 + /*
1365 + * Stop manual trim before stopping spa sync, because manual trim
1366 + * needs to execute a synctask (trim timestamp sync) at the end.
1367 + */
1368 + mutex_enter(&spa->spa_auto_trim_lock);
1369 + mutex_enter(&spa->spa_man_trim_lock);
1370 + spa_trim_stop_wait(spa);
1371 + mutex_exit(&spa->spa_man_trim_lock);
1372 + mutex_exit(&spa->spa_auto_trim_lock);
1299 1373
1300 1374 /*
1301 1375 * Stop async tasks.
1302 1376 */
1303 1377 spa_async_suspend(spa);
1304 1378
1305 1379 /*
1306 1380 * Stop syncing.
1307 1381 */
1308 1382 if (spa->spa_sync_on) {
1309 1383 txg_sync_stop(spa->spa_dsl_pool);
1310 1384 spa->spa_sync_on = B_FALSE;
1311 1385 }
1312 1386
1313 1387 /*
1314 1388 * Even though vdev_free() also calls vdev_metaslab_fini, we need
1315 1389 * to call it earlier, before we wait for async i/o to complete.
1316 1390 * This ensures that there is no async metaslab prefetching, by
1317 1391 * calling taskq_wait(mg_taskq).
1318 1392 */
1319 1393 if (spa->spa_root_vdev != NULL) {
1320 1394 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1321 1395 for (int c = 0; c < spa->spa_root_vdev->vdev_children; c++)
1322 1396 vdev_metaslab_fini(spa->spa_root_vdev->vdev_child[c]);
1323 1397 spa_config_exit(spa, SCL_ALL, FTAG);
1324 1398 }
1325 1399
|
↓ open down ↓ |
17 lines elided |
↑ open up ↑ |
1326 1400 /*
1327 1401 * Wait for any outstanding async I/O to complete.
1328 1402 */
1329 1403 if (spa->spa_async_zio_root != NULL) {
1330 1404 for (int i = 0; i < max_ncpus; i++)
1331 1405 (void) zio_wait(spa->spa_async_zio_root[i]);
1332 1406 kmem_free(spa->spa_async_zio_root, max_ncpus * sizeof (void *));
1333 1407 spa->spa_async_zio_root = NULL;
1334 1408 }
1335 1409
1336 - if (spa->spa_vdev_removal != NULL) {
1337 - spa_vdev_removal_destroy(spa->spa_vdev_removal);
1338 - spa->spa_vdev_removal = NULL;
1339 - }
1340 -
1341 - if (spa->spa_condense_zthr != NULL) {
1342 - ASSERT(!zthr_isrunning(spa->spa_condense_zthr));
1343 - zthr_destroy(spa->spa_condense_zthr);
1344 - spa->spa_condense_zthr = NULL;
1345 - }
1346 -
1347 - spa_condense_fini(spa);
1348 -
1349 1410 bpobj_close(&spa->spa_deferred_bpobj);
1350 1411
1351 1412 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1352 1413
1353 1414 /*
1415 + * Stop autotrim tasks.
1416 + */
1417 + mutex_enter(&spa->spa_auto_trim_lock);
1418 + if (spa->spa_auto_trim_taskq)
1419 + spa_auto_trim_taskq_destroy(spa);
1420 + mutex_exit(&spa->spa_auto_trim_lock);
1421 +
1422 + /*
1354 1423 * Close all vdevs.
1355 1424 */
1356 1425 if (spa->spa_root_vdev)
1357 1426 vdev_free(spa->spa_root_vdev);
1358 1427 ASSERT(spa->spa_root_vdev == NULL);
1359 1428
1360 1429 /*
1361 1430 * Close the dsl pool.
1362 1431 */
1363 1432 if (spa->spa_dsl_pool) {
1364 1433 dsl_pool_close(spa->spa_dsl_pool);
1365 1434 spa->spa_dsl_pool = NULL;
1366 1435 spa->spa_meta_objset = NULL;
1367 1436 }
1368 1437
1369 1438 ddt_unload(spa);
1370 1439
1371 1440 /*
1372 1441 * Drop and purge level 2 cache
1373 1442 */
1374 1443 spa_l2cache_drop(spa);
1375 1444
1376 1445 for (i = 0; i < spa->spa_spares.sav_count; i++)
1377 1446 vdev_free(spa->spa_spares.sav_vdevs[i]);
1378 1447 if (spa->spa_spares.sav_vdevs) {
1379 1448 kmem_free(spa->spa_spares.sav_vdevs,
1380 1449 spa->spa_spares.sav_count * sizeof (void *));
1381 1450 spa->spa_spares.sav_vdevs = NULL;
1382 1451 }
1383 1452 if (spa->spa_spares.sav_config) {
1384 1453 nvlist_free(spa->spa_spares.sav_config);
1385 1454 spa->spa_spares.sav_config = NULL;
1386 1455 }
1387 1456 spa->spa_spares.sav_count = 0;
1388 1457
1389 1458 for (i = 0; i < spa->spa_l2cache.sav_count; i++) {
1390 1459 vdev_clear_stats(spa->spa_l2cache.sav_vdevs[i]);
1391 1460 vdev_free(spa->spa_l2cache.sav_vdevs[i]);
1392 1461 }
1393 1462 if (spa->spa_l2cache.sav_vdevs) {
1394 1463 kmem_free(spa->spa_l2cache.sav_vdevs,
1395 1464 spa->spa_l2cache.sav_count * sizeof (void *));
|
↓ open down ↓ |
32 lines elided |
↑ open up ↑ |
1396 1465 spa->spa_l2cache.sav_vdevs = NULL;
1397 1466 }
1398 1467 if (spa->spa_l2cache.sav_config) {
1399 1468 nvlist_free(spa->spa_l2cache.sav_config);
1400 1469 spa->spa_l2cache.sav_config = NULL;
1401 1470 }
1402 1471 spa->spa_l2cache.sav_count = 0;
1403 1472
1404 1473 spa->spa_async_suspended = 0;
1405 1474
1406 - spa->spa_indirect_vdevs_loaded = B_FALSE;
1407 -
1408 1475 if (spa->spa_comment != NULL) {
1409 1476 spa_strfree(spa->spa_comment);
1410 1477 spa->spa_comment = NULL;
1411 1478 }
1412 1479
1413 1480 spa_config_exit(spa, SCL_ALL, FTAG);
1414 1481 }
1415 1482
1416 1483 /*
1417 1484 * Load (or re-load) the current list of vdevs describing the active spares for
1418 1485 * this pool. When this is called, we have some form of basic information in
1419 1486 * 'spa_spares.sav_config'. We parse this into vdevs, try to open them, and
1420 1487 * then re-generate a more complete list including status information.
1421 1488 */
1422 -void
1489 +static void
1423 1490 spa_load_spares(spa_t *spa)
1424 1491 {
1425 1492 nvlist_t **spares;
1426 1493 uint_t nspares;
1427 1494 int i;
1428 1495 vdev_t *vd, *tvd;
1429 1496
1430 1497 ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
1431 1498
1432 1499 /*
1433 1500 * First, close and free any existing spare vdevs.
1434 1501 */
1435 1502 for (i = 0; i < spa->spa_spares.sav_count; i++) {
1436 1503 vd = spa->spa_spares.sav_vdevs[i];
1437 1504
1438 1505 /* Undo the call to spa_activate() below */
1439 1506 if ((tvd = spa_lookup_by_guid(spa, vd->vdev_guid,
1440 1507 B_FALSE)) != NULL && tvd->vdev_isspare)
1441 1508 spa_spare_remove(tvd);
1442 1509 vdev_close(vd);
1443 1510 vdev_free(vd);
1444 1511 }
1445 1512
1446 1513 if (spa->spa_spares.sav_vdevs)
1447 1514 kmem_free(spa->spa_spares.sav_vdevs,
1448 1515 spa->spa_spares.sav_count * sizeof (void *));
1449 1516
1450 1517 if (spa->spa_spares.sav_config == NULL)
1451 1518 nspares = 0;
1452 1519 else
1453 1520 VERIFY(nvlist_lookup_nvlist_array(spa->spa_spares.sav_config,
1454 1521 ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0);
1455 1522
1456 1523 spa->spa_spares.sav_count = (int)nspares;
1457 1524 spa->spa_spares.sav_vdevs = NULL;
1458 1525
1459 1526 if (nspares == 0)
1460 1527 return;
1461 1528
1462 1529 /*
1463 1530 * Construct the array of vdevs, opening them to get status in the
1464 1531 * process. For each spare, there is potentially two different vdev_t
1465 1532 * structures associated with it: one in the list of spares (used only
1466 1533 * for basic validation purposes) and one in the active vdev
1467 1534 * configuration (if it's spared in). During this phase we open and
1468 1535 * validate each vdev on the spare list. If the vdev also exists in the
1469 1536 * active configuration, then we also mark this vdev as an active spare.
1470 1537 */
1471 1538 spa->spa_spares.sav_vdevs = kmem_alloc(nspares * sizeof (void *),
1472 1539 KM_SLEEP);
1473 1540 for (i = 0; i < spa->spa_spares.sav_count; i++) {
1474 1541 VERIFY(spa_config_parse(spa, &vd, spares[i], NULL, 0,
1475 1542 VDEV_ALLOC_SPARE) == 0);
1476 1543 ASSERT(vd != NULL);
1477 1544
1478 1545 spa->spa_spares.sav_vdevs[i] = vd;
1479 1546
1480 1547 if ((tvd = spa_lookup_by_guid(spa, vd->vdev_guid,
1481 1548 B_FALSE)) != NULL) {
1482 1549 if (!tvd->vdev_isspare)
1483 1550 spa_spare_add(tvd);
1484 1551
1485 1552 /*
1486 1553 * We only mark the spare active if we were successfully
1487 1554 * able to load the vdev. Otherwise, importing a pool
1488 1555 * with a bad active spare would result in strange
1489 1556 * behavior, because multiple pool would think the spare
1490 1557 * is actively in use.
1491 1558 *
1492 1559 * There is a vulnerability here to an equally bizarre
1493 1560 * circumstance, where a dead active spare is later
1494 1561 * brought back to life (onlined or otherwise). Given
1495 1562 * the rarity of this scenario, and the extra complexity
1496 1563 * it adds, we ignore the possibility.
1497 1564 */
1498 1565 if (!vdev_is_dead(tvd))
1499 1566 spa_spare_activate(tvd);
1500 1567 }
1501 1568
1502 1569 vd->vdev_top = vd;
1503 1570 vd->vdev_aux = &spa->spa_spares;
1504 1571
1505 1572 if (vdev_open(vd) != 0)
1506 1573 continue;
1507 1574
1508 1575 if (vdev_validate_aux(vd) == 0)
1509 1576 spa_spare_add(vd);
1510 1577 }
1511 1578
1512 1579 /*
1513 1580 * Recompute the stashed list of spares, with status information
1514 1581 * this time.
1515 1582 */
1516 1583 VERIFY(nvlist_remove(spa->spa_spares.sav_config, ZPOOL_CONFIG_SPARES,
1517 1584 DATA_TYPE_NVLIST_ARRAY) == 0);
1518 1585
1519 1586 spares = kmem_alloc(spa->spa_spares.sav_count * sizeof (void *),
1520 1587 KM_SLEEP);
1521 1588 for (i = 0; i < spa->spa_spares.sav_count; i++)
1522 1589 spares[i] = vdev_config_generate(spa,
1523 1590 spa->spa_spares.sav_vdevs[i], B_TRUE, VDEV_CONFIG_SPARE);
1524 1591 VERIFY(nvlist_add_nvlist_array(spa->spa_spares.sav_config,
1525 1592 ZPOOL_CONFIG_SPARES, spares, spa->spa_spares.sav_count) == 0);
1526 1593 for (i = 0; i < spa->spa_spares.sav_count; i++)
1527 1594 nvlist_free(spares[i]);
1528 1595 kmem_free(spares, spa->spa_spares.sav_count * sizeof (void *));
|
↓ open down ↓ |
96 lines elided |
↑ open up ↑ |
1529 1596 }
1530 1597
1531 1598 /*
1532 1599 * Load (or re-load) the current list of vdevs describing the active l2cache for
1533 1600 * this pool. When this is called, we have some form of basic information in
1534 1601 * 'spa_l2cache.sav_config'. We parse this into vdevs, try to open them, and
1535 1602 * then re-generate a more complete list including status information.
1536 1603 * Devices which are already active have their details maintained, and are
1537 1604 * not re-opened.
1538 1605 */
1539 -void
1606 +static void
1540 1607 spa_load_l2cache(spa_t *spa)
1541 1608 {
1542 1609 nvlist_t **l2cache;
1543 1610 uint_t nl2cache;
1544 1611 int i, j, oldnvdevs;
1545 1612 uint64_t guid;
1546 1613 vdev_t *vd, **oldvdevs, **newvdevs;
1547 1614 spa_aux_vdev_t *sav = &spa->spa_l2cache;
1548 1615
1549 1616 ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
1550 1617
1551 1618 if (sav->sav_config != NULL) {
1552 1619 VERIFY(nvlist_lookup_nvlist_array(sav->sav_config,
1553 1620 ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0);
1554 1621 newvdevs = kmem_alloc(nl2cache * sizeof (void *), KM_SLEEP);
1555 1622 } else {
1556 1623 nl2cache = 0;
1557 1624 newvdevs = NULL;
1558 1625 }
1559 1626
1560 1627 oldvdevs = sav->sav_vdevs;
1561 1628 oldnvdevs = sav->sav_count;
1562 1629 sav->sav_vdevs = NULL;
1563 1630 sav->sav_count = 0;
1564 1631
1565 1632 /*
1566 1633 * Process new nvlist of vdevs.
1567 1634 */
1568 1635 for (i = 0; i < nl2cache; i++) {
1569 1636 VERIFY(nvlist_lookup_uint64(l2cache[i], ZPOOL_CONFIG_GUID,
1570 1637 &guid) == 0);
1571 1638
1572 1639 newvdevs[i] = NULL;
1573 1640 for (j = 0; j < oldnvdevs; j++) {
1574 1641 vd = oldvdevs[j];
1575 1642 if (vd != NULL && guid == vd->vdev_guid) {
1576 1643 /*
1577 1644 * Retain previous vdev for add/remove ops.
1578 1645 */
1579 1646 newvdevs[i] = vd;
1580 1647 oldvdevs[j] = NULL;
1581 1648 break;
1582 1649 }
1583 1650 }
1584 1651
1585 1652 if (newvdevs[i] == NULL) {
1586 1653 /*
1587 1654 * Create new vdev
1588 1655 */
1589 1656 VERIFY(spa_config_parse(spa, &vd, l2cache[i], NULL, 0,
1590 1657 VDEV_ALLOC_L2CACHE) == 0);
1591 1658 ASSERT(vd != NULL);
1592 1659 newvdevs[i] = vd;
1593 1660
1594 1661 /*
1595 1662 * Commit this vdev as an l2cache device,
1596 1663 * even if it fails to open.
1597 1664 */
1598 1665 spa_l2cache_add(vd);
1599 1666
|
↓ open down ↓ |
50 lines elided |
↑ open up ↑ |
1600 1667 vd->vdev_top = vd;
1601 1668 vd->vdev_aux = sav;
1602 1669
1603 1670 spa_l2cache_activate(vd);
1604 1671
1605 1672 if (vdev_open(vd) != 0)
1606 1673 continue;
1607 1674
1608 1675 (void) vdev_validate_aux(vd);
1609 1676
1610 - if (!vdev_is_dead(vd))
1611 - l2arc_add_vdev(spa, vd);
1677 + if (!vdev_is_dead(vd)) {
1678 + boolean_t do_rebuild = B_FALSE;
1679 +
1680 + (void) nvlist_lookup_boolean_value(l2cache[i],
1681 + ZPOOL_CONFIG_L2CACHE_PERSISTENT,
1682 + &do_rebuild);
1683 + l2arc_add_vdev(spa, vd, do_rebuild);
1684 + }
1612 1685 }
1613 1686 }
1614 1687
1615 1688 /*
1616 1689 * Purge vdevs that were dropped
1617 1690 */
1618 1691 for (i = 0; i < oldnvdevs; i++) {
1619 1692 uint64_t pool;
1620 1693
1621 1694 vd = oldvdevs[i];
1622 1695 if (vd != NULL) {
1623 1696 ASSERT(vd->vdev_isl2cache);
1624 1697
1625 1698 if (spa_l2cache_exists(vd->vdev_guid, &pool) &&
1626 1699 pool != 0ULL && l2arc_vdev_present(vd))
1627 1700 l2arc_remove_vdev(vd);
1628 1701 vdev_clear_stats(vd);
1629 1702 vdev_free(vd);
1630 1703 }
1631 1704 }
1632 1705
1633 1706 if (oldvdevs)
1634 1707 kmem_free(oldvdevs, oldnvdevs * sizeof (void *));
1635 1708
1636 1709 if (sav->sav_config == NULL)
1637 1710 goto out;
1638 1711
1639 1712 sav->sav_vdevs = newvdevs;
1640 1713 sav->sav_count = (int)nl2cache;
1641 1714
1642 1715 /*
1643 1716 * Recompute the stashed list of l2cache devices, with status
1644 1717 * information this time.
1645 1718 */
1646 1719 VERIFY(nvlist_remove(sav->sav_config, ZPOOL_CONFIG_L2CACHE,
1647 1720 DATA_TYPE_NVLIST_ARRAY) == 0);
1648 1721
1649 1722 l2cache = kmem_alloc(sav->sav_count * sizeof (void *), KM_SLEEP);
1650 1723 for (i = 0; i < sav->sav_count; i++)
1651 1724 l2cache[i] = vdev_config_generate(spa,
1652 1725 sav->sav_vdevs[i], B_TRUE, VDEV_CONFIG_L2CACHE);
1653 1726 VERIFY(nvlist_add_nvlist_array(sav->sav_config,
1654 1727 ZPOOL_CONFIG_L2CACHE, l2cache, sav->sav_count) == 0);
1655 1728 out:
1656 1729 for (i = 0; i < sav->sav_count; i++)
1657 1730 nvlist_free(l2cache[i]);
1658 1731 if (sav->sav_count)
1659 1732 kmem_free(l2cache, sav->sav_count * sizeof (void *));
1660 1733 }
1661 1734
1662 1735 static int
1663 1736 load_nvlist(spa_t *spa, uint64_t obj, nvlist_t **value)
1664 1737 {
1665 1738 dmu_buf_t *db;
1666 1739 char *packed = NULL;
1667 1740 size_t nvsize = 0;
1668 1741 int error;
1669 1742 *value = NULL;
1670 1743
1671 1744 error = dmu_bonus_hold(spa->spa_meta_objset, obj, FTAG, &db);
1672 1745 if (error != 0)
1673 1746 return (error);
1674 1747
1675 1748 nvsize = *(uint64_t *)db->db_data;
1676 1749 dmu_buf_rele(db, FTAG);
1677 1750
1678 1751 packed = kmem_alloc(nvsize, KM_SLEEP);
|
↓ open down ↓ |
57 lines elided |
↑ open up ↑ |
1679 1752 error = dmu_read(spa->spa_meta_objset, obj, 0, nvsize, packed,
1680 1753 DMU_READ_PREFETCH);
1681 1754 if (error == 0)
1682 1755 error = nvlist_unpack(packed, nvsize, value, 0);
1683 1756 kmem_free(packed, nvsize);
1684 1757
1685 1758 return (error);
1686 1759 }
1687 1760
1688 1761 /*
1689 - * Concrete top-level vdevs that are not missing and are not logs. At every
1690 - * spa_sync we write new uberblocks to at least SPA_SYNC_MIN_VDEVS core tvds.
1691 - */
1692 -static uint64_t
1693 -spa_healthy_core_tvds(spa_t *spa)
1694 -{
1695 - vdev_t *rvd = spa->spa_root_vdev;
1696 - uint64_t tvds = 0;
1697 -
1698 - for (uint64_t i = 0; i < rvd->vdev_children; i++) {
1699 - vdev_t *vd = rvd->vdev_child[i];
1700 - if (vd->vdev_islog)
1701 - continue;
1702 - if (vdev_is_concrete(vd) && !vdev_is_dead(vd))
1703 - tvds++;
1704 - }
1705 -
1706 - return (tvds);
1707 -}
1708 -
1709 -/*
1710 1762 * Checks to see if the given vdev could not be opened, in which case we post a
1711 1763 * sysevent to notify the autoreplace code that the device has been removed.
1712 1764 */
1713 1765 static void
1714 1766 spa_check_removed(vdev_t *vd)
1715 1767 {
1716 - for (uint64_t c = 0; c < vd->vdev_children; c++)
1768 + for (int c = 0; c < vd->vdev_children; c++)
1717 1769 spa_check_removed(vd->vdev_child[c]);
1718 1770
1719 1771 if (vd->vdev_ops->vdev_op_leaf && vdev_is_dead(vd) &&
1720 - vdev_is_concrete(vd)) {
1772 + !vd->vdev_ishole) {
1721 1773 zfs_post_autoreplace(vd->vdev_spa, vd);
1722 1774 spa_event_notify(vd->vdev_spa, vd, NULL, ESC_ZFS_VDEV_CHECK);
1723 1775 }
1724 1776 }
1725 1777
1726 -static int
1727 -spa_check_for_missing_logs(spa_t *spa)
1778 +static void
1779 +spa_config_valid_zaps(vdev_t *vd, vdev_t *mvd)
1728 1780 {
1729 - vdev_t *rvd = spa->spa_root_vdev;
1781 + ASSERT3U(vd->vdev_children, ==, mvd->vdev_children);
1730 1782
1783 + vd->vdev_top_zap = mvd->vdev_top_zap;
1784 + vd->vdev_leaf_zap = mvd->vdev_leaf_zap;
1785 +
1786 + for (uint64_t i = 0; i < vd->vdev_children; i++) {
1787 + spa_config_valid_zaps(vd->vdev_child[i], mvd->vdev_child[i]);
1788 + }
1789 +}
1790 +
1791 +/*
1792 + * Validate the current config against the MOS config
1793 + */
1794 +static boolean_t
1795 +spa_config_valid(spa_t *spa, nvlist_t *config)
1796 +{
1797 + vdev_t *mrvd, *rvd = spa->spa_root_vdev;
1798 + nvlist_t *nv;
1799 +
1800 + VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nv) == 0);
1801 +
1802 + spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1803 + VERIFY(spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD) == 0);
1804 +
1731 1805 /*
1806 + * One of the earliest signs of a stale config is a mismatch
1807 + * in the numbers of children vdev's
1808 + */
1809 + if (rvd->vdev_children != mrvd->vdev_children) {
1810 + vdev_free(mrvd);
1811 + spa_config_exit(spa, SCL_ALL, FTAG);
1812 + return (B_FALSE);
1813 + }
1814 + /*
1732 1815 * If we're doing a normal import, then build up any additional
1733 - * diagnostic information about missing log devices.
1816 + * diagnostic information about missing devices in this config.
1734 1817 * We'll pass this up to the user for further processing.
1735 1818 */
1736 1819 if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG)) {
1737 1820 nvlist_t **child, *nv;
1738 1821 uint64_t idx = 0;
1739 1822
1740 1823 child = kmem_alloc(rvd->vdev_children * sizeof (nvlist_t **),
1741 1824 KM_SLEEP);
1742 1825 VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0);
1743 1826
1744 - for (uint64_t c = 0; c < rvd->vdev_children; c++) {
1827 + for (int c = 0; c < rvd->vdev_children; c++) {
1745 1828 vdev_t *tvd = rvd->vdev_child[c];
1829 + vdev_t *mtvd = mrvd->vdev_child[c];
1746 1830
1747 - /*
1748 - * We consider a device as missing only if it failed
1749 - * to open (i.e. offline or faulted is not considered
1750 - * as missing).
1751 - */
1752 - if (tvd->vdev_islog &&
1753 - tvd->vdev_state == VDEV_STATE_CANT_OPEN) {
1754 - child[idx++] = vdev_config_generate(spa, tvd,
1755 - B_FALSE, VDEV_CONFIG_MISSING);
1756 - }
1831 + if (tvd->vdev_ops == &vdev_missing_ops &&
1832 + mtvd->vdev_ops != &vdev_missing_ops &&
1833 + mtvd->vdev_islog)
1834 + child[idx++] = vdev_config_generate(spa, mtvd,
1835 + B_FALSE, 0);
1757 1836 }
1758 1837
1759 - if (idx > 0) {
1760 - fnvlist_add_nvlist_array(nv,
1761 - ZPOOL_CONFIG_CHILDREN, child, idx);
1762 - fnvlist_add_nvlist(spa->spa_load_info,
1763 - ZPOOL_CONFIG_MISSING_DEVICES, nv);
1838 + if (idx) {
1839 + VERIFY(nvlist_add_nvlist_array(nv,
1840 + ZPOOL_CONFIG_CHILDREN, child, idx) == 0);
1841 + VERIFY(nvlist_add_nvlist(spa->spa_load_info,
1842 + ZPOOL_CONFIG_MISSING_DEVICES, nv) == 0);
1764 1843
1765 - for (uint64_t i = 0; i < idx; i++)
1844 + for (int i = 0; i < idx; i++)
1766 1845 nvlist_free(child[i]);
1767 1846 }
1768 1847 nvlist_free(nv);
1769 1848 kmem_free(child, rvd->vdev_children * sizeof (char **));
1849 + }
1770 1850
1771 - if (idx > 0) {
1772 - spa_load_failed(spa, "some log devices are missing");
1773 - return (SET_ERROR(ENXIO));
1774 - }
1775 - } else {
1776 - for (uint64_t c = 0; c < rvd->vdev_children; c++) {
1777 - vdev_t *tvd = rvd->vdev_child[c];
1851 + /*
1852 + * Compare the root vdev tree with the information we have
1853 + * from the MOS config (mrvd). Check each top-level vdev
1854 + * with the corresponding MOS config top-level (mtvd).
1855 + */
1856 + for (int c = 0; c < rvd->vdev_children; c++) {
1857 + vdev_t *tvd = rvd->vdev_child[c];
1858 + vdev_t *mtvd = mrvd->vdev_child[c];
1778 1859
1779 - if (tvd->vdev_islog &&
1780 - tvd->vdev_state == VDEV_STATE_CANT_OPEN) {
1860 + /*
1861 + * Resolve any "missing" vdevs in the current configuration.
1862 + * If we find that the MOS config has more accurate information
1863 + * about the top-level vdev then use that vdev instead.
1864 + */
1865 + if (tvd->vdev_ops == &vdev_missing_ops &&
1866 + mtvd->vdev_ops != &vdev_missing_ops) {
1867 +
1868 + if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG))
1869 + continue;
1870 +
1871 + /*
1872 + * Device specific actions.
1873 + */
1874 + if (mtvd->vdev_islog) {
1781 1875 spa_set_log_state(spa, SPA_LOG_CLEAR);
1782 - spa_load_note(spa, "some log devices are "
1783 - "missing, ZIL is dropped.");
1784 - break;
1876 + } else {
1877 + /*
1878 + * XXX - once we have 'readonly' pool
1879 + * support we should be able to handle
1880 + * missing data devices by transitioning
1881 + * the pool to readonly.
1882 + */
1883 + continue;
1785 1884 }
1885 +
1886 + /*
1887 + * Swap the missing vdev with the data we were
1888 + * able to obtain from the MOS config.
1889 + */
1890 + vdev_remove_child(rvd, tvd);
1891 + vdev_remove_child(mrvd, mtvd);
1892 +
1893 + vdev_add_child(rvd, mtvd);
1894 + vdev_add_child(mrvd, tvd);
1895 +
1896 + spa_config_exit(spa, SCL_ALL, FTAG);
1897 + vdev_load(mtvd);
1898 + spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
1899 +
1900 + vdev_reopen(rvd);
1901 + } else {
1902 + if (mtvd->vdev_islog) {
1903 + /*
1904 + * Load the slog device's state from the MOS
1905 + * config since it's possible that the label
1906 + * does not contain the most up-to-date
1907 + * information.
1908 + */
1909 + vdev_load_log_state(tvd, mtvd);
1910 + vdev_reopen(tvd);
1911 + }
1912 +
1913 + /*
1914 + * Per-vdev ZAP info is stored exclusively in the MOS.
1915 + */
1916 + spa_config_valid_zaps(tvd, mtvd);
1786 1917 }
1787 1918 }
1788 1919
1789 - return (0);
1920 + vdev_free(mrvd);
1921 + spa_config_exit(spa, SCL_ALL, FTAG);
1922 +
1923 + /*
1924 + * Ensure we were able to validate the config.
1925 + */
1926 + return (rvd->vdev_guid_sum == spa->spa_uberblock.ub_guid_sum);
1790 1927 }
1791 1928
1792 1929 /*
1793 1930 * Check for missing log devices
1794 1931 */
1795 1932 static boolean_t
1796 1933 spa_check_logs(spa_t *spa)
1797 1934 {
1798 1935 boolean_t rv = B_FALSE;
1799 1936 dsl_pool_t *dp = spa_get_dsl(spa);
1800 1937
1801 1938 switch (spa->spa_log_state) {
1802 1939 case SPA_LOG_MISSING:
1803 1940 /* need to recheck in case slog has been restored */
1804 1941 case SPA_LOG_UNKNOWN:
1805 1942 rv = (dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
1806 1943 zil_check_log_chain, NULL, DS_FIND_CHILDREN) != 0);
1807 1944 if (rv)
1808 1945 spa_set_log_state(spa, SPA_LOG_MISSING);
1809 1946 break;
1810 1947 }
1811 1948 return (rv);
1812 1949 }
1813 1950
1814 1951 static boolean_t
1815 1952 spa_passivate_log(spa_t *spa)
1816 1953 {
1817 1954 vdev_t *rvd = spa->spa_root_vdev;
1818 1955 boolean_t slog_found = B_FALSE;
1819 1956
1820 1957 ASSERT(spa_config_held(spa, SCL_ALLOC, RW_WRITER));
1821 1958
1822 1959 if (!spa_has_slogs(spa))
1823 1960 return (B_FALSE);
1824 1961
1825 1962 for (int c = 0; c < rvd->vdev_children; c++) {
1826 1963 vdev_t *tvd = rvd->vdev_child[c];
1827 1964 metaslab_group_t *mg = tvd->vdev_mg;
1828 1965
1829 1966 if (tvd->vdev_islog) {
1830 1967 metaslab_group_passivate(mg);
1831 1968 slog_found = B_TRUE;
1832 1969 }
1833 1970 }
1834 1971
1835 1972 return (slog_found);
1836 1973 }
1837 1974
1838 1975 static void
1839 1976 spa_activate_log(spa_t *spa)
1840 1977 {
1841 1978 vdev_t *rvd = spa->spa_root_vdev;
1842 1979
1843 1980 ASSERT(spa_config_held(spa, SCL_ALLOC, RW_WRITER));
1844 1981
|
↓ open down ↓ |
45 lines elided |
↑ open up ↑ |
1845 1982 for (int c = 0; c < rvd->vdev_children; c++) {
1846 1983 vdev_t *tvd = rvd->vdev_child[c];
1847 1984 metaslab_group_t *mg = tvd->vdev_mg;
1848 1985
1849 1986 if (tvd->vdev_islog)
1850 1987 metaslab_group_activate(mg);
1851 1988 }
1852 1989 }
1853 1990
1854 1991 int
1855 -spa_reset_logs(spa_t *spa)
1992 +spa_offline_log(spa_t *spa)
1856 1993 {
1857 1994 int error;
1858 1995
1859 - error = dmu_objset_find(spa_name(spa), zil_reset,
1996 + error = dmu_objset_find(spa_name(spa), zil_vdev_offline,
1860 1997 NULL, DS_FIND_CHILDREN);
1861 1998 if (error == 0) {
1862 1999 /*
1863 2000 * We successfully offlined the log device, sync out the
1864 2001 * current txg so that the "stubby" block can be removed
1865 2002 * by zil_sync().
1866 2003 */
1867 2004 txg_wait_synced(spa->spa_dsl_pool, 0);
1868 2005 }
1869 2006 return (error);
1870 2007 }
1871 2008
1872 2009 static void
1873 2010 spa_aux_check_removed(spa_aux_vdev_t *sav)
1874 2011 {
1875 2012 for (int i = 0; i < sav->sav_count; i++)
1876 2013 spa_check_removed(sav->sav_vdevs[i]);
1877 2014 }
1878 2015
1879 2016 void
1880 2017 spa_claim_notify(zio_t *zio)
1881 2018 {
1882 2019 spa_t *spa = zio->io_spa;
1883 2020
1884 2021 if (zio->io_error)
1885 2022 return;
1886 2023
1887 2024 mutex_enter(&spa->spa_props_lock); /* any mutex will do */
1888 2025 if (spa->spa_claim_max_txg < zio->io_bp->blk_birth)
1889 2026 spa->spa_claim_max_txg = zio->io_bp->blk_birth;
1890 2027 mutex_exit(&spa->spa_props_lock);
1891 2028 }
1892 2029
1893 2030 typedef struct spa_load_error {
1894 2031 uint64_t sle_meta_count;
1895 2032 uint64_t sle_data_count;
1896 2033 } spa_load_error_t;
1897 2034
1898 2035 static void
|
↓ open down ↓ |
29 lines elided |
↑ open up ↑ |
1899 2036 spa_load_verify_done(zio_t *zio)
1900 2037 {
1901 2038 blkptr_t *bp = zio->io_bp;
1902 2039 spa_load_error_t *sle = zio->io_private;
1903 2040 dmu_object_type_t type = BP_GET_TYPE(bp);
1904 2041 int error = zio->io_error;
1905 2042 spa_t *spa = zio->io_spa;
1906 2043
1907 2044 abd_free(zio->io_abd);
1908 2045 if (error) {
1909 - if ((BP_GET_LEVEL(bp) != 0 || DMU_OT_IS_METADATA(type)) &&
1910 - type != DMU_OT_INTENT_LOG)
2046 + if (BP_IS_METADATA(bp) && type != DMU_OT_INTENT_LOG)
1911 2047 atomic_inc_64(&sle->sle_meta_count);
1912 2048 else
1913 2049 atomic_inc_64(&sle->sle_data_count);
1914 2050 }
1915 2051
1916 2052 mutex_enter(&spa->spa_scrub_lock);
1917 2053 spa->spa_scrub_inflight--;
1918 2054 cv_broadcast(&spa->spa_scrub_io_cv);
1919 2055 mutex_exit(&spa->spa_scrub_lock);
1920 2056 }
1921 2057
1922 2058 /*
1923 2059 * Maximum number of concurrent scrub i/os to create while verifying
1924 2060 * a pool while importing it.
1925 2061 */
1926 2062 int spa_load_verify_maxinflight = 10000;
1927 2063 boolean_t spa_load_verify_metadata = B_TRUE;
1928 2064 boolean_t spa_load_verify_data = B_TRUE;
1929 2065
1930 2066 /*ARGSUSED*/
1931 2067 static int
1932 2068 spa_load_verify_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
1933 2069 const zbookmark_phys_t *zb, const dnode_phys_t *dnp, void *arg)
1934 2070 {
1935 2071 if (bp == NULL || BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp))
1936 2072 return (0);
1937 2073 /*
1938 2074 * Note: normally this routine will not be called if
1939 2075 * spa_load_verify_metadata is not set. However, it may be useful
1940 2076 * to manually set the flag after the traversal has begun.
1941 2077 */
1942 2078 if (!spa_load_verify_metadata)
1943 2079 return (0);
1944 2080 if (!BP_IS_METADATA(bp) && !spa_load_verify_data)
1945 2081 return (0);
1946 2082
1947 2083 zio_t *rio = arg;
1948 2084 size_t size = BP_GET_PSIZE(bp);
1949 2085
1950 2086 mutex_enter(&spa->spa_scrub_lock);
1951 2087 while (spa->spa_scrub_inflight >= spa_load_verify_maxinflight)
1952 2088 cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
1953 2089 spa->spa_scrub_inflight++;
1954 2090 mutex_exit(&spa->spa_scrub_lock);
1955 2091
1956 2092 zio_nowait(zio_read(rio, spa, bp, abd_alloc_for_io(size, B_FALSE), size,
1957 2093 spa_load_verify_done, rio->io_private, ZIO_PRIORITY_SCRUB,
1958 2094 ZIO_FLAG_SPECULATIVE | ZIO_FLAG_CANFAIL |
1959 2095 ZIO_FLAG_SCRUB | ZIO_FLAG_RAW, zb));
1960 2096 return (0);
1961 2097 }
1962 2098
1963 2099 /* ARGSUSED */
1964 2100 int
1965 2101 verify_dataset_name_len(dsl_pool_t *dp, dsl_dataset_t *ds, void *arg)
1966 2102 {
1967 2103 if (dsl_dataset_namelen(ds) >= ZFS_MAX_DATASET_NAME_LEN)
1968 2104 return (SET_ERROR(ENAMETOOLONG));
1969 2105
1970 2106 return (0);
1971 2107 }
1972 2108
1973 2109 static int
1974 2110 spa_load_verify(spa_t *spa)
1975 2111 {
1976 2112 zio_t *rio;
1977 2113 spa_load_error_t sle = { 0 };
1978 2114 zpool_rewind_policy_t policy;
1979 2115 boolean_t verify_ok = B_FALSE;
1980 2116 int error = 0;
1981 2117
1982 2118 zpool_get_rewind_policy(spa->spa_config, &policy);
1983 2119
1984 2120 if (policy.zrp_request & ZPOOL_NEVER_REWIND)
1985 2121 return (0);
1986 2122
1987 2123 dsl_pool_config_enter(spa->spa_dsl_pool, FTAG);
1988 2124 error = dmu_objset_find_dp(spa->spa_dsl_pool,
|
↓ open down ↓ |
68 lines elided |
↑ open up ↑ |
1989 2125 spa->spa_dsl_pool->dp_root_dir_obj, verify_dataset_name_len, NULL,
1990 2126 DS_FIND_CHILDREN);
1991 2127 dsl_pool_config_exit(spa->spa_dsl_pool, FTAG);
1992 2128 if (error != 0)
1993 2129 return (error);
1994 2130
1995 2131 rio = zio_root(spa, NULL, &sle,
1996 2132 ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE);
1997 2133
1998 2134 if (spa_load_verify_metadata) {
1999 - if (spa->spa_extreme_rewind) {
2000 - spa_load_note(spa, "performing a complete scan of the "
2001 - "pool since extreme rewind is on. This may take "
2002 - "a very long time.\n (spa_load_verify_data=%u, "
2003 - "spa_load_verify_metadata=%u)",
2004 - spa_load_verify_data, spa_load_verify_metadata);
2005 - }
2006 - error = traverse_pool(spa, spa->spa_verify_min_txg,
2135 + zbookmark_phys_t zb = { 0 };
2136 + error = traverse_pool(spa, spa->spa_verify_min_txg, UINT64_MAX,
2007 2137 TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA,
2008 - spa_load_verify_cb, rio);
2138 + spa_load_verify_cb, rio, &zb);
2009 2139 }
2010 2140
2011 2141 (void) zio_wait(rio);
2012 2142
2013 2143 spa->spa_load_meta_errors = sle.sle_meta_count;
2014 2144 spa->spa_load_data_errors = sle.sle_data_count;
2015 2145
2016 - if (sle.sle_meta_count != 0 || sle.sle_data_count != 0) {
2017 - spa_load_note(spa, "spa_load_verify found %llu metadata errors "
2018 - "and %llu data errors", (u_longlong_t)sle.sle_meta_count,
2019 - (u_longlong_t)sle.sle_data_count);
2020 - }
2021 -
2022 - if (spa_load_verify_dryrun ||
2023 - (!error && sle.sle_meta_count <= policy.zrp_maxmeta &&
2024 - sle.sle_data_count <= policy.zrp_maxdata)) {
2146 + if (!error && sle.sle_meta_count <= policy.zrp_maxmeta &&
2147 + sle.sle_data_count <= policy.zrp_maxdata) {
2025 2148 int64_t loss = 0;
2026 2149
2027 2150 verify_ok = B_TRUE;
2028 2151 spa->spa_load_txg = spa->spa_uberblock.ub_txg;
2029 2152 spa->spa_load_txg_ts = spa->spa_uberblock.ub_timestamp;
2030 2153
2031 2154 loss = spa->spa_last_ubsync_txg_ts - spa->spa_load_txg_ts;
2032 2155 VERIFY(nvlist_add_uint64(spa->spa_load_info,
2033 2156 ZPOOL_CONFIG_LOAD_TIME, spa->spa_load_txg_ts) == 0);
2034 2157 VERIFY(nvlist_add_int64(spa->spa_load_info,
2035 2158 ZPOOL_CONFIG_REWIND_TIME, loss) == 0);
2036 2159 VERIFY(nvlist_add_uint64(spa->spa_load_info,
2037 2160 ZPOOL_CONFIG_LOAD_DATA_ERRORS, sle.sle_data_count) == 0);
2038 2161 } else {
2039 2162 spa->spa_load_max_txg = spa->spa_uberblock.ub_txg;
2040 2163 }
2041 2164
2042 - if (spa_load_verify_dryrun)
2043 - return (0);
2044 -
2045 2165 if (error) {
2046 2166 if (error != ENXIO && error != EIO)
2047 2167 error = SET_ERROR(EIO);
2048 2168 return (error);
2049 2169 }
2050 2170
2051 2171 return (verify_ok ? 0 : EIO);
2052 2172 }
2053 2173
2054 2174 /*
2055 2175 * Find a value in the pool props object.
2056 2176 */
2057 2177 static void
|
↓ open down ↓ |
3 lines elided |
↑ open up ↑ |
2058 2178 spa_prop_find(spa_t *spa, zpool_prop_t prop, uint64_t *val)
2059 2179 {
2060 2180 (void) zap_lookup(spa->spa_meta_objset, spa->spa_pool_props_object,
2061 2181 zpool_prop_to_name(prop), sizeof (uint64_t), 1, val);
2062 2182 }
2063 2183
2064 2184 /*
2065 2185 * Find a value in the pool directory object.
2066 2186 */
2067 2187 static int
2068 -spa_dir_prop(spa_t *spa, const char *name, uint64_t *val, boolean_t log_enoent)
2188 +spa_dir_prop(spa_t *spa, const char *name, uint64_t *val)
2069 2189 {
2070 - int error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2071 - name, sizeof (uint64_t), 1, val);
2190 + return (zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2191 + name, sizeof (uint64_t), 1, val));
2192 +}
2072 2193
2073 - if (error != 0 && (error != ENOENT || log_enoent)) {
2074 - spa_load_failed(spa, "couldn't get '%s' value in MOS directory "
2075 - "[error=%d]", name, error);
2194 +static void
2195 +spa_set_ddt_classes(spa_t *spa, int desegregation)
2196 +{
2197 + /*
2198 + * if desegregation is turned on then set up ddt_class restrictions
2199 + */
2200 + if (desegregation) {
2201 + spa->spa_ddt_class_min = DDT_CLASS_DUPLICATE;
2202 + spa->spa_ddt_class_max = DDT_CLASS_DUPLICATE;
2203 + } else {
2204 + spa->spa_ddt_class_min = DDT_CLASS_DITTO;
2205 + spa->spa_ddt_class_max = DDT_CLASS_UNIQUE;
2076 2206 }
2077 -
2078 - return (error);
2079 2207 }
2080 2208
2081 2209 static int
2082 2210 spa_vdev_err(vdev_t *vdev, vdev_aux_t aux, int err)
2083 2211 {
2084 2212 vdev_set_state(vdev, B_TRUE, VDEV_STATE_CANT_OPEN, aux);
2085 - return (SET_ERROR(err));
2213 + return (err);
2086 2214 }
2087 2215
2088 -static void
2089 -spa_spawn_aux_threads(spa_t *spa)
2090 -{
2091 - ASSERT(spa_writeable(spa));
2092 -
2093 - ASSERT(MUTEX_HELD(&spa_namespace_lock));
2094 -
2095 - spa_start_indirect_condensing_thread(spa);
2096 -}
2097 -
2098 2216 /*
2099 2217 * Fix up config after a partly-completed split. This is done with the
2100 2218 * ZPOOL_CONFIG_SPLIT nvlist. Both the splitting pool and the split-off
2101 2219 * pool have that entry in their config, but only the splitting one contains
2102 2220 * a list of all the guids of the vdevs that are being split off.
2103 2221 *
2104 2222 * This function determines what to do with that list: either rejoin
2105 2223 * all the disks to the pool, or complete the splitting process. To attempt
2106 2224 * the rejoin, each disk that is offlined is marked online again, and
2107 2225 * we do a reopen() call. If the vdev label for every disk that was
2108 2226 * marked online indicates it was successfully split off (VDEV_AUX_SPLIT_POOL)
2109 2227 * then we call vdev_split() on each disk, and complete the split.
2110 2228 *
2111 2229 * Otherwise we leave the config alone, with all the vdevs in place in
2112 2230 * the original pool.
2113 2231 */
2114 2232 static void
2115 2233 spa_try_repair(spa_t *spa, nvlist_t *config)
2116 2234 {
2117 2235 uint_t extracted;
2118 2236 uint64_t *glist;
2119 2237 uint_t i, gcount;
2120 2238 nvlist_t *nvl;
2121 2239 vdev_t **vd;
2122 2240 boolean_t attempt_reopen;
2123 2241
2124 2242 if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT, &nvl) != 0)
2125 2243 return;
2126 2244
2127 2245 /* check that the config is complete */
2128 2246 if (nvlist_lookup_uint64_array(nvl, ZPOOL_CONFIG_SPLIT_LIST,
2129 2247 &glist, &gcount) != 0)
2130 2248 return;
2131 2249
2132 2250 vd = kmem_zalloc(gcount * sizeof (vdev_t *), KM_SLEEP);
2133 2251
2134 2252 /* attempt to online all the vdevs & validate */
2135 2253 attempt_reopen = B_TRUE;
2136 2254 for (i = 0; i < gcount; i++) {
2137 2255 if (glist[i] == 0) /* vdev is hole */
2138 2256 continue;
2139 2257
2140 2258 vd[i] = spa_lookup_by_guid(spa, glist[i], B_FALSE);
2141 2259 if (vd[i] == NULL) {
2142 2260 /*
2143 2261 * Don't bother attempting to reopen the disks;
2144 2262 * just do the split.
2145 2263 */
2146 2264 attempt_reopen = B_FALSE;
2147 2265 } else {
2148 2266 /* attempt to re-online it */
2149 2267 vd[i]->vdev_offline = B_FALSE;
2150 2268 }
2151 2269 }
2152 2270
2153 2271 if (attempt_reopen) {
2154 2272 vdev_reopen(spa->spa_root_vdev);
2155 2273
2156 2274 /* check each device to see what state it's in */
2157 2275 for (extracted = 0, i = 0; i < gcount; i++) {
2158 2276 if (vd[i] != NULL &&
2159 2277 vd[i]->vdev_stat.vs_aux != VDEV_AUX_SPLIT_POOL)
2160 2278 break;
2161 2279 ++extracted;
2162 2280 }
2163 2281 }
2164 2282
2165 2283 /*
2166 2284 * If every disk has been moved to the new pool, or if we never
2167 2285 * even attempted to look at them, then we split them off for
2168 2286 * good.
2169 2287 */
2170 2288 if (!attempt_reopen || gcount == extracted) {
|
↓ open down ↓ |
63 lines elided |
↑ open up ↑ |
2171 2289 for (i = 0; i < gcount; i++)
2172 2290 if (vd[i] != NULL)
2173 2291 vdev_split(vd[i]);
2174 2292 vdev_reopen(spa->spa_root_vdev);
2175 2293 }
2176 2294
2177 2295 kmem_free(vd, gcount * sizeof (vdev_t *));
2178 2296 }
2179 2297
2180 2298 static int
2181 -spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type)
2299 +spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type,
2300 + boolean_t mosconfig)
2182 2301 {
2302 + nvlist_t *config = spa->spa_config;
2183 2303 char *ereport = FM_EREPORT_ZFS_POOL;
2304 + char *comment;
2184 2305 int error;
2306 + uint64_t pool_guid;
2307 + nvlist_t *nvl;
2185 2308
2186 - spa->spa_load_state = state;
2309 + if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid))
2310 + return (SET_ERROR(EINVAL));
2187 2311
2188 - gethrestime(&spa->spa_loaded_ts);
2189 - error = spa_load_impl(spa, type, &ereport, B_FALSE);
2312 + ASSERT(spa->spa_comment == NULL);
2313 + if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0)
2314 + spa->spa_comment = spa_strdup(comment);
2190 2315
2191 2316 /*
2317 + * Versioning wasn't explicitly added to the label until later, so if
2318 + * it's not present treat it as the initial version.
2319 + */
2320 + if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
2321 + &spa->spa_ubsync.ub_version) != 0)
2322 + spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
2323 +
2324 + (void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG,
2325 + &spa->spa_config_txg);
2326 +
2327 + if ((state == SPA_LOAD_IMPORT || state == SPA_LOAD_TRYIMPORT) &&
2328 + spa_guid_exists(pool_guid, 0)) {
2329 + error = SET_ERROR(EEXIST);
2330 + } else {
2331 + spa->spa_config_guid = pool_guid;
2332 +
2333 + if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT,
2334 + &nvl) == 0) {
2335 + VERIFY(nvlist_dup(nvl, &spa->spa_config_splitting,
2336 + KM_SLEEP) == 0);
2337 + }
2338 +
2339 + nvlist_free(spa->spa_load_info);
2340 + spa->spa_load_info = fnvlist_alloc();
2341 +
2342 + gethrestime(&spa->spa_loaded_ts);
2343 + error = spa_load_impl(spa, pool_guid, config, state, type,
2344 + mosconfig, &ereport);
2345 + }
2346 +
2347 + /*
2192 2348 * Don't count references from objsets that are already closed
2193 2349 * and are making their way through the eviction process.
2194 2350 */
2195 2351 spa_evicting_os_wait(spa);
2196 2352 spa->spa_minref = refcount_count(&spa->spa_refcount);
2197 2353 if (error) {
2198 2354 if (error != EEXIST) {
2199 2355 spa->spa_loaded_ts.tv_sec = 0;
2200 2356 spa->spa_loaded_ts.tv_nsec = 0;
2201 2357 }
2202 2358 if (error != EBADF) {
2203 2359 zfs_ereport_post(ereport, spa, NULL, NULL, 0, 0);
2204 2360 }
2205 2361 }
2206 2362 spa->spa_load_state = error ? SPA_LOAD_ERROR : SPA_LOAD_NONE;
2207 2363 spa->spa_ena = 0;
2208 -
2209 2364 return (error);
2210 2365 }
2211 2366
2212 2367 /*
2213 2368 * Count the number of per-vdev ZAPs associated with all of the vdevs in the
2214 2369 * vdev tree rooted in the given vd, and ensure that each ZAP is present in the
2215 2370 * spa's per-vdev ZAP list.
2216 2371 */
2217 2372 static uint64_t
2218 2373 vdev_count_verify_zaps(vdev_t *vd)
2219 2374 {
2220 2375 spa_t *spa = vd->vdev_spa;
2221 2376 uint64_t total = 0;
2222 2377 if (vd->vdev_top_zap != 0) {
2223 2378 total++;
2224 2379 ASSERT0(zap_lookup_int(spa->spa_meta_objset,
2225 2380 spa->spa_all_vdev_zaps, vd->vdev_top_zap));
2226 2381 }
2227 2382 if (vd->vdev_leaf_zap != 0) {
2228 2383 total++;
2229 2384 ASSERT0(zap_lookup_int(spa->spa_meta_objset,
|
↓ open down ↓ |
11 lines elided |
↑ open up ↑ |
2230 2385 spa->spa_all_vdev_zaps, vd->vdev_leaf_zap));
2231 2386 }
2232 2387
2233 2388 for (uint64_t i = 0; i < vd->vdev_children; i++) {
2234 2389 total += vdev_count_verify_zaps(vd->vdev_child[i]);
2235 2390 }
2236 2391
2237 2392 return (total);
2238 2393 }
2239 2394
2395 +/*
2396 + * Load an existing storage pool, using the pool's builtin spa_config as a
2397 + * source of configuration information.
2398 + */
2240 2399 static int
2241 -spa_verify_host(spa_t *spa, nvlist_t *mos_config)
2400 +spa_load_impl(spa_t *spa, uint64_t pool_guid, nvlist_t *config,
2401 + spa_load_state_t state, spa_import_type_t type, boolean_t mosconfig,
2402 + char **ereport)
2242 2403 {
2243 - uint64_t hostid;
2244 - char *hostname;
2245 - uint64_t myhostid = 0;
2246 -
2247 - if (!spa_is_root(spa) && nvlist_lookup_uint64(mos_config,
2248 - ZPOOL_CONFIG_HOSTID, &hostid) == 0) {
2249 - hostname = fnvlist_lookup_string(mos_config,
2250 - ZPOOL_CONFIG_HOSTNAME);
2251 -
2252 - myhostid = zone_get_hostid(NULL);
2253 -
2254 - if (hostid != 0 && myhostid != 0 && hostid != myhostid) {
2255 - cmn_err(CE_WARN, "pool '%s' could not be "
2256 - "loaded as it was last accessed by "
2257 - "another system (host: %s hostid: 0x%llx). "
2258 - "See: http://illumos.org/msg/ZFS-8000-EY",
2259 - spa_name(spa), hostname, (u_longlong_t)hostid);
2260 - spa_load_failed(spa, "hostid verification failed: pool "
2261 - "last accessed by host: %s (hostid: 0x%llx)",
2262 - hostname, (u_longlong_t)hostid);
2263 - return (SET_ERROR(EBADF));
2264 - }
2265 - }
2266 -
2267 - return (0);
2268 -}
2269 -
2270 -static int
2271 -spa_ld_parse_config(spa_t *spa, spa_import_type_t type)
2272 -{
2273 2404 int error = 0;
2274 - nvlist_t *nvtree, *nvl, *config = spa->spa_config;
2275 - int parse;
2405 + nvlist_t *nvroot = NULL;
2406 + nvlist_t *label;
2276 2407 vdev_t *rvd;
2277 - uint64_t pool_guid;
2278 - char *comment;
2408 + uberblock_t *ub = &spa->spa_uberblock;
2409 + uint64_t children, config_cache_txg = spa->spa_config_txg;
2410 + int orig_mode = spa->spa_mode;
2411 + int parse;
2412 + uint64_t obj;
2413 + boolean_t missing_feat_write = B_FALSE;
2414 + spa_meta_placement_t *mp;
2279 2415
2280 2416 /*
2281 - * Versioning wasn't explicitly added to the label until later, so if
2282 - * it's not present treat it as the initial version.
2417 + * If this is an untrusted config, access the pool in read-only mode.
2418 + * This prevents things like resilvering recently removed devices.
2283 2419 */
2284 - if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
2285 - &spa->spa_ubsync.ub_version) != 0)
2286 - spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
2420 + if (!mosconfig)
2421 + spa->spa_mode = FREAD;
2287 2422
2288 - if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid)) {
2289 - spa_load_failed(spa, "invalid config provided: '%s' missing",
2290 - ZPOOL_CONFIG_POOL_GUID);
2291 - return (SET_ERROR(EINVAL));
2292 - }
2423 + ASSERT(MUTEX_HELD(&spa_namespace_lock));
2293 2424
2294 - if ((spa->spa_load_state == SPA_LOAD_IMPORT || spa->spa_load_state ==
2295 - SPA_LOAD_TRYIMPORT) && spa_guid_exists(pool_guid, 0)) {
2296 - spa_load_failed(spa, "a pool with guid %llu is already open",
2297 - (u_longlong_t)pool_guid);
2298 - return (SET_ERROR(EEXIST));
2299 - }
2425 + spa->spa_load_state = state;
2300 2426
2301 - spa->spa_config_guid = pool_guid;
2302 -
2303 - nvlist_free(spa->spa_load_info);
2304 - spa->spa_load_info = fnvlist_alloc();
2305 -
2306 - ASSERT(spa->spa_comment == NULL);
2307 - if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0)
2308 - spa->spa_comment = spa_strdup(comment);
2309 -
2310 - (void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG,
2311 - &spa->spa_config_txg);
2312 -
2313 - if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT, &nvl) == 0)
2314 - spa->spa_config_splitting = fnvlist_dup(nvl);
2315 -
2316 - if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvtree)) {
2317 - spa_load_failed(spa, "invalid config provided: '%s' missing",
2318 - ZPOOL_CONFIG_VDEV_TREE);
2427 + if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvroot))
2319 2428 return (SET_ERROR(EINVAL));
2320 - }
2321 2429
2430 + parse = (type == SPA_IMPORT_EXISTING ?
2431 + VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT);
2432 +
2322 2433 /*
2323 2434 * Create "The Godfather" zio to hold all async IOs
2324 2435 */
2325 2436 spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
2326 2437 KM_SLEEP);
2327 2438 for (int i = 0; i < max_ncpus; i++) {
2328 2439 spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
2329 2440 ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
2330 2441 ZIO_FLAG_GODFATHER);
2331 2442 }
2332 2443
2333 2444 /*
2334 2445 * Parse the configuration into a vdev tree. We explicitly set the
2335 2446 * value that will be returned by spa_version() since parsing the
2336 2447 * configuration requires knowing the version number.
2337 2448 */
2338 2449 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2339 - parse = (type == SPA_IMPORT_EXISTING ?
2340 - VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT);
2341 - error = spa_config_parse(spa, &rvd, nvtree, NULL, 0, parse);
2450 + error = spa_config_parse(spa, &rvd, nvroot, NULL, 0, parse);
2342 2451 spa_config_exit(spa, SCL_ALL, FTAG);
2343 2452
2344 - if (error != 0) {
2345 - spa_load_failed(spa, "unable to parse config [error=%d]",
2346 - error);
2453 + if (error != 0)
2347 2454 return (error);
2348 - }
2349 2455
2350 2456 ASSERT(spa->spa_root_vdev == rvd);
2351 2457 ASSERT3U(spa->spa_min_ashift, >=, SPA_MINBLOCKSHIFT);
2352 2458 ASSERT3U(spa->spa_max_ashift, <=, SPA_MAXBLOCKSHIFT);
2353 2459
2354 2460 if (type != SPA_IMPORT_ASSEMBLE) {
2355 2461 ASSERT(spa_guid(spa) == pool_guid);
2356 2462 }
2357 2463
2358 - return (0);
2359 -}
2360 -
2361 -/*
2362 - * Recursively open all vdevs in the vdev tree. This function is called twice:
2363 - * first with the untrusted config, then with the trusted config.
2364 - */
2365 -static int
2366 -spa_ld_open_vdevs(spa_t *spa)
2367 -{
2368 - int error = 0;
2369 -
2370 2464 /*
2371 - * spa_missing_tvds_allowed defines how many top-level vdevs can be
2372 - * missing/unopenable for the root vdev to be still considered openable.
2465 + * Try to open all vdevs, loading each label in the process.
2373 2466 */
2374 - if (spa->spa_trust_config) {
2375 - spa->spa_missing_tvds_allowed = zfs_max_missing_tvds;
2376 - } else if (spa->spa_config_source == SPA_CONFIG_SRC_CACHEFILE) {
2377 - spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_cachefile;
2378 - } else if (spa->spa_config_source == SPA_CONFIG_SRC_SCAN) {
2379 - spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_scan;
2380 - } else {
2381 - spa->spa_missing_tvds_allowed = 0;
2382 - }
2383 -
2384 - spa->spa_missing_tvds_allowed =
2385 - MAX(zfs_max_missing_tvds, spa->spa_missing_tvds_allowed);
2386 -
2387 2467 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2388 - error = vdev_open(spa->spa_root_vdev);
2468 + error = vdev_open(rvd);
2389 2469 spa_config_exit(spa, SCL_ALL, FTAG);
2470 + if (error != 0)
2471 + return (error);
2390 2472
2391 - if (spa->spa_missing_tvds != 0) {
2392 - spa_load_note(spa, "vdev tree has %lld missing top-level "
2393 - "vdevs.", (u_longlong_t)spa->spa_missing_tvds);
2394 - if (spa->spa_trust_config && (spa->spa_mode & FWRITE)) {
2395 - /*
2396 - * Although theoretically we could allow users to open
2397 - * incomplete pools in RW mode, we'd need to add a lot
2398 - * of extra logic (e.g. adjust pool space to account
2399 - * for missing vdevs).
2400 - * This limitation also prevents users from accidentally
2401 - * opening the pool in RW mode during data recovery and
2402 - * damaging it further.
2403 - */
2404 - spa_load_note(spa, "pools with missing top-level "
2405 - "vdevs can only be opened in read-only mode.");
2406 - error = SET_ERROR(ENXIO);
2407 - } else {
2408 - spa_load_note(spa, "current settings allow for maximum "
2409 - "%lld missing top-level vdevs at this stage.",
2410 - (u_longlong_t)spa->spa_missing_tvds_allowed);
2411 - }
2412 - }
2413 - if (error != 0) {
2414 - spa_load_failed(spa, "unable to open vdev tree [error=%d]",
2415 - error);
2416 - }
2417 - if (spa->spa_missing_tvds != 0 || error != 0)
2418 - vdev_dbgmsg_print_tree(spa->spa_root_vdev, 2);
2473 + /*
2474 + * We need to validate the vdev labels against the configuration that
2475 + * we have in hand, which is dependent on the setting of mosconfig. If
2476 + * mosconfig is true then we're validating the vdev labels based on
2477 + * that config. Otherwise, we're validating against the cached config
2478 + * (zpool.cache) that was read when we loaded the zfs module, and then
2479 + * later we will recursively call spa_load() and validate against
2480 + * the vdev config.
2481 + *
2482 + * If we're assembling a new pool that's been split off from an
2483 + * existing pool, the labels haven't yet been updated so we skip
2484 + * validation for now.
2485 + */
2486 + if (type != SPA_IMPORT_ASSEMBLE) {
2487 + spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2488 + error = vdev_validate(rvd, mosconfig);
2489 + spa_config_exit(spa, SCL_ALL, FTAG);
2419 2490
2420 - return (error);
2421 -}
2491 + if (error != 0)
2492 + return (error);
2422 2493
2423 -/*
2424 - * We need to validate the vdev labels against the configuration that
2425 - * we have in hand. This function is called twice: first with an untrusted
2426 - * config, then with a trusted config. The validation is more strict when the
2427 - * config is trusted.
2428 - */
2429 -static int
2430 -spa_ld_validate_vdevs(spa_t *spa)
2431 -{
2432 - int error = 0;
2433 - vdev_t *rvd = spa->spa_root_vdev;
2434 -
2435 - spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2436 - error = vdev_validate(rvd);
2437 - spa_config_exit(spa, SCL_ALL, FTAG);
2438 -
2439 - if (error != 0) {
2440 - spa_load_failed(spa, "vdev_validate failed [error=%d]", error);
2441 - return (error);
2494 + if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN)
2495 + return (SET_ERROR(ENXIO));
2442 2496 }
2443 2497
2444 - if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN) {
2445 - spa_load_failed(spa, "cannot open vdev tree after invalidating "
2446 - "some vdevs");
2447 - vdev_dbgmsg_print_tree(rvd, 2);
2448 - return (SET_ERROR(ENXIO));
2449 - }
2450 -
2451 - return (0);
2452 -}
2453 -
2454 -static int
2455 -spa_ld_select_uberblock(spa_t *spa, spa_import_type_t type)
2456 -{
2457 - vdev_t *rvd = spa->spa_root_vdev;
2458 - nvlist_t *label;
2459 - uberblock_t *ub = &spa->spa_uberblock;
2460 -
2461 2498 /*
2462 2499 * Find the best uberblock.
2463 2500 */
2464 2501 vdev_uberblock_load(rvd, ub, &label);
2465 2502
2466 2503 /*
2467 2504 * If we weren't able to find a single valid uberblock, return failure.
2468 2505 */
2469 2506 if (ub->ub_txg == 0) {
2470 2507 nvlist_free(label);
2471 - spa_load_failed(spa, "no valid uberblock found");
2472 2508 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, ENXIO));
2473 2509 }
2474 2510
2475 - spa_load_note(spa, "using uberblock with txg=%llu",
2476 - (u_longlong_t)ub->ub_txg);
2477 -
2478 2511 /*
2479 2512 * If the pool has an unsupported version we can't open it.
2480 2513 */
2481 2514 if (!SPA_VERSION_IS_SUPPORTED(ub->ub_version)) {
2482 2515 nvlist_free(label);
2483 - spa_load_failed(spa, "version %llu is not supported",
2484 - (u_longlong_t)ub->ub_version);
2485 2516 return (spa_vdev_err(rvd, VDEV_AUX_VERSION_NEWER, ENOTSUP));
2486 2517 }
2487 2518
2488 2519 if (ub->ub_version >= SPA_VERSION_FEATURES) {
2489 2520 nvlist_t *features;
2490 2521
2491 2522 /*
2492 2523 * If we weren't able to find what's necessary for reading the
2493 2524 * MOS in the label, return failure.
2494 2525 */
2495 - if (label == NULL) {
2496 - spa_load_failed(spa, "label config unavailable");
2497 - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
2498 - ENXIO));
2499 - }
2500 -
2501 - if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_FEATURES_FOR_READ,
2502 - &features) != 0) {
2526 + if (label == NULL || nvlist_lookup_nvlist(label,
2527 + ZPOOL_CONFIG_FEATURES_FOR_READ, &features) != 0) {
2503 2528 nvlist_free(label);
2504 - spa_load_failed(spa, "invalid label: '%s' missing",
2505 - ZPOOL_CONFIG_FEATURES_FOR_READ);
2506 2529 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
2507 2530 ENXIO));
2508 2531 }
2509 2532
2510 2533 /*
2511 2534 * Update our in-core representation with the definitive values
2512 2535 * from the label.
2513 2536 */
2514 2537 nvlist_free(spa->spa_label_features);
2515 2538 VERIFY(nvlist_dup(features, &spa->spa_label_features, 0) == 0);
2516 2539 }
2517 2540
2518 2541 nvlist_free(label);
2519 2542
2520 2543 /*
2521 2544 * Look through entries in the label nvlist's features_for_read. If
2522 2545 * there is a feature listed there which we don't understand then we
2523 2546 * cannot open a pool.
2524 2547 */
2525 2548 if (ub->ub_version >= SPA_VERSION_FEATURES) {
2526 2549 nvlist_t *unsup_feat;
2527 2550
2528 2551 VERIFY(nvlist_alloc(&unsup_feat, NV_UNIQUE_NAME, KM_SLEEP) ==
2529 2552 0);
2530 2553
2531 2554 for (nvpair_t *nvp = nvlist_next_nvpair(spa->spa_label_features,
2532 2555 NULL); nvp != NULL;
2533 2556 nvp = nvlist_next_nvpair(spa->spa_label_features, nvp)) {
|
↓ open down ↓ |
18 lines elided |
↑ open up ↑ |
2534 2557 if (!zfeature_is_supported(nvpair_name(nvp))) {
2535 2558 VERIFY(nvlist_add_string(unsup_feat,
2536 2559 nvpair_name(nvp), "") == 0);
2537 2560 }
2538 2561 }
2539 2562
2540 2563 if (!nvlist_empty(unsup_feat)) {
2541 2564 VERIFY(nvlist_add_nvlist(spa->spa_load_info,
2542 2565 ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat) == 0);
2543 2566 nvlist_free(unsup_feat);
2544 - spa_load_failed(spa, "some features are unsupported");
2545 2567 return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
2546 2568 ENOTSUP));
2547 2569 }
2548 2570
2549 2571 nvlist_free(unsup_feat);
2550 2572 }
2551 2573
2574 + /*
2575 + * If the vdev guid sum doesn't match the uberblock, we have an
2576 + * incomplete configuration. We first check to see if the pool
2577 + * is aware of the complete config (i.e ZPOOL_CONFIG_VDEV_CHILDREN).
2578 + * If it is, defer the vdev_guid_sum check till later so we
2579 + * can handle missing vdevs.
2580 + */
2581 + if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VDEV_CHILDREN,
2582 + &children) != 0 && mosconfig && type != SPA_IMPORT_ASSEMBLE &&
2583 + rvd->vdev_guid_sum != ub->ub_guid_sum)
2584 + return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO));
2585 +
2552 2586 if (type != SPA_IMPORT_ASSEMBLE && spa->spa_config_splitting) {
2553 2587 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2554 - spa_try_repair(spa, spa->spa_config);
2588 + spa_try_repair(spa, config);
2555 2589 spa_config_exit(spa, SCL_ALL, FTAG);
2556 2590 nvlist_free(spa->spa_config_splitting);
2557 2591 spa->spa_config_splitting = NULL;
2558 2592 }
2559 2593
2560 2594 /*
2561 2595 * Initialize internal SPA structures.
2562 2596 */
2563 2597 spa->spa_state = POOL_STATE_ACTIVE;
2564 2598 spa->spa_ubsync = spa->spa_uberblock;
2565 2599 spa->spa_verify_min_txg = spa->spa_extreme_rewind ?
2566 2600 TXG_INITIAL - 1 : spa_last_synced_txg(spa) - TXG_DEFER_SIZE - 1;
2567 2601 spa->spa_first_txg = spa->spa_last_ubsync_txg ?
2568 2602 spa->spa_last_ubsync_txg : spa_last_synced_txg(spa) + 1;
2569 2603 spa->spa_claim_max_txg = spa->spa_first_txg;
2570 2604 spa->spa_prev_software_version = ub->ub_software_version;
2571 2605
2572 - return (0);
2573 -}
2574 -
2575 -static int
2576 -spa_ld_open_rootbp(spa_t *spa)
2577 -{
2578 - int error = 0;
2579 - vdev_t *rvd = spa->spa_root_vdev;
2580 -
2581 2606 error = dsl_pool_init(spa, spa->spa_first_txg, &spa->spa_dsl_pool);
2582 - if (error != 0) {
2583 - spa_load_failed(spa, "unable to open rootbp in dsl_pool_init "
2584 - "[error=%d]", error);
2607 + if (error)
2585 2608 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2586 - }
2587 2609 spa->spa_meta_objset = spa->spa_dsl_pool->dp_meta_objset;
2588 2610
2589 - return (0);
2590 -}
2591 -
2592 -static int
2593 -spa_ld_load_trusted_config(spa_t *spa, spa_import_type_t type,
2594 - boolean_t reloading)
2595 -{
2596 - vdev_t *mrvd, *rvd = spa->spa_root_vdev;
2597 - nvlist_t *nv, *mos_config, *policy;
2598 - int error = 0, copy_error;
2599 - uint64_t healthy_tvds, healthy_tvds_mos;
2600 - uint64_t mos_config_txg;
2601 -
2602 - if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object, B_TRUE)
2603 - != 0)
2611 + if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object) != 0)
2604 2612 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2605 2613
2606 - /*
2607 - * If we're assembling a pool from a split, the config provided is
2608 - * already trusted so there is nothing to do.
2609 - */
2610 - if (type == SPA_IMPORT_ASSEMBLE)
2611 - return (0);
2612 -
2613 - healthy_tvds = spa_healthy_core_tvds(spa);
2614 -
2615 - if (load_nvlist(spa, spa->spa_config_object, &mos_config)
2616 - != 0) {
2617 - spa_load_failed(spa, "unable to retrieve MOS config");
2618 - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2619 - }
2620 -
2621 - /*
2622 - * If we are doing an open, pool owner wasn't verified yet, thus do
2623 - * the verification here.
2624 - */
2625 - if (spa->spa_load_state == SPA_LOAD_OPEN) {
2626 - error = spa_verify_host(spa, mos_config);
2627 - if (error != 0) {
2628 - nvlist_free(mos_config);
2629 - return (error);
2630 - }
2631 - }
2632 -
2633 - nv = fnvlist_lookup_nvlist(mos_config, ZPOOL_CONFIG_VDEV_TREE);
2634 -
2635 - spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
2636 -
2637 - /*
2638 - * Build a new vdev tree from the trusted config
2639 - */
2640 - VERIFY(spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD) == 0);
2641 -
2642 - /*
2643 - * Vdev paths in the MOS may be obsolete. If the untrusted config was
2644 - * obtained by scanning /dev/dsk, then it will have the right vdev
2645 - * paths. We update the trusted MOS config with this information.
2646 - * We first try to copy the paths with vdev_copy_path_strict, which
2647 - * succeeds only when both configs have exactly the same vdev tree.
2648 - * If that fails, we fall back to a more flexible method that has a
2649 - * best effort policy.
2650 - */
2651 - copy_error = vdev_copy_path_strict(rvd, mrvd);
2652 - if (copy_error != 0 || spa_load_print_vdev_tree) {
2653 - spa_load_note(spa, "provided vdev tree:");
2654 - vdev_dbgmsg_print_tree(rvd, 2);
2655 - spa_load_note(spa, "MOS vdev tree:");
2656 - vdev_dbgmsg_print_tree(mrvd, 2);
2657 - }
2658 - if (copy_error != 0) {
2659 - spa_load_note(spa, "vdev_copy_path_strict failed, falling "
2660 - "back to vdev_copy_path_relaxed");
2661 - vdev_copy_path_relaxed(rvd, mrvd);
2662 - }
2663 -
2664 - vdev_close(rvd);
2665 - vdev_free(rvd);
2666 - spa->spa_root_vdev = mrvd;
2667 - rvd = mrvd;
2668 - spa_config_exit(spa, SCL_ALL, FTAG);
2669 -
2670 - /*
2671 - * We will use spa_config if we decide to reload the spa or if spa_load
2672 - * fails and we rewind. We must thus regenerate the config using the
2673 - * MOS information with the updated paths. Rewind policy is an import
2674 - * setting and is not in the MOS. We copy it over to our new, trusted
2675 - * config.
2676 - */
2677 - mos_config_txg = fnvlist_lookup_uint64(mos_config,
2678 - ZPOOL_CONFIG_POOL_TXG);
2679 - nvlist_free(mos_config);
2680 - mos_config = spa_config_generate(spa, NULL, mos_config_txg, B_FALSE);
2681 - if (nvlist_lookup_nvlist(spa->spa_config, ZPOOL_REWIND_POLICY,
2682 - &policy) == 0)
2683 - fnvlist_add_nvlist(mos_config, ZPOOL_REWIND_POLICY, policy);
2684 - spa_config_set(spa, mos_config);
2685 - spa->spa_config_source = SPA_CONFIG_SRC_MOS;
2686 -
2687 - /*
2688 - * Now that we got the config from the MOS, we should be more strict
2689 - * in checking blkptrs and can make assumptions about the consistency
2690 - * of the vdev tree. spa_trust_config must be set to true before opening
2691 - * vdevs in order for them to be writeable.
2692 - */
2693 - spa->spa_trust_config = B_TRUE;
2694 -
2695 - /*
2696 - * Open and validate the new vdev tree
2697 - */
2698 - error = spa_ld_open_vdevs(spa);
2699 - if (error != 0)
2700 - return (error);
2701 -
2702 - error = spa_ld_validate_vdevs(spa);
2703 - if (error != 0)
2704 - return (error);
2705 -
2706 - if (copy_error != 0 || spa_load_print_vdev_tree) {
2707 - spa_load_note(spa, "final vdev tree:");
2708 - vdev_dbgmsg_print_tree(rvd, 2);
2709 - }
2710 -
2711 - if (spa->spa_load_state != SPA_LOAD_TRYIMPORT &&
2712 - !spa->spa_extreme_rewind && zfs_max_missing_tvds == 0) {
2713 - /*
2714 - * Sanity check to make sure that we are indeed loading the
2715 - * latest uberblock. If we missed SPA_SYNC_MIN_VDEVS tvds
2716 - * in the config provided and they happened to be the only ones
2717 - * to have the latest uberblock, we could involuntarily perform
2718 - * an extreme rewind.
2719 - */
2720 - healthy_tvds_mos = spa_healthy_core_tvds(spa);
2721 - if (healthy_tvds_mos - healthy_tvds >=
2722 - SPA_SYNC_MIN_VDEVS) {
2723 - spa_load_note(spa, "config provided misses too many "
2724 - "top-level vdevs compared to MOS (%lld vs %lld). ",
2725 - (u_longlong_t)healthy_tvds,
2726 - (u_longlong_t)healthy_tvds_mos);
2727 - spa_load_note(spa, "vdev tree:");
2728 - vdev_dbgmsg_print_tree(rvd, 2);
2729 - if (reloading) {
2730 - spa_load_failed(spa, "config was already "
2731 - "provided from MOS. Aborting.");
2732 - return (spa_vdev_err(rvd,
2733 - VDEV_AUX_CORRUPT_DATA, EIO));
2734 - }
2735 - spa_load_note(spa, "spa must be reloaded using MOS "
2736 - "config");
2737 - return (SET_ERROR(EAGAIN));
2738 - }
2739 - }
2740 -
2741 - error = spa_check_for_missing_logs(spa);
2742 - if (error != 0)
2743 - return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO));
2744 -
2745 - if (rvd->vdev_guid_sum != spa->spa_uberblock.ub_guid_sum) {
2746 - spa_load_failed(spa, "uberblock guid sum doesn't match MOS "
2747 - "guid sum (%llu != %llu)",
2748 - (u_longlong_t)spa->spa_uberblock.ub_guid_sum,
2749 - (u_longlong_t)rvd->vdev_guid_sum);
2750 - return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM,
2751 - ENXIO));
2752 - }
2753 -
2754 - return (0);
2755 -}
2756 -
2757 -static int
2758 -spa_ld_open_indirect_vdev_metadata(spa_t *spa)
2759 -{
2760 - int error = 0;
2761 - vdev_t *rvd = spa->spa_root_vdev;
2762 -
2763 - /*
2764 - * Everything that we read before spa_remove_init() must be stored
2765 - * on concreted vdevs. Therefore we do this as early as possible.
2766 - */
2767 - error = spa_remove_init(spa);
2768 - if (error != 0) {
2769 - spa_load_failed(spa, "spa_remove_init failed [error=%d]",
2770 - error);
2771 - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2772 - }
2773 -
2774 - /*
2775 - * Retrieve information needed to condense indirect vdev mappings.
2776 - */
2777 - error = spa_condense_init(spa);
2778 - if (error != 0) {
2779 - spa_load_failed(spa, "spa_condense_init failed [error=%d]",
2780 - error);
2781 - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
2782 - }
2783 -
2784 - return (0);
2785 -}
2786 -
2787 -static int
2788 -spa_ld_check_features(spa_t *spa, boolean_t *missing_feat_writep)
2789 -{
2790 - int error = 0;
2791 - vdev_t *rvd = spa->spa_root_vdev;
2792 -
2793 2614 if (spa_version(spa) >= SPA_VERSION_FEATURES) {
2794 2615 boolean_t missing_feat_read = B_FALSE;
2795 2616 nvlist_t *unsup_feat, *enabled_feat;
2796 2617
2797 2618 if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_READ,
2798 - &spa->spa_feat_for_read_obj, B_TRUE) != 0) {
2619 + &spa->spa_feat_for_read_obj) != 0) {
2799 2620 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2800 2621 }
2801 2622
2802 2623 if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_WRITE,
2803 - &spa->spa_feat_for_write_obj, B_TRUE) != 0) {
2624 + &spa->spa_feat_for_write_obj) != 0) {
2804 2625 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2805 2626 }
2806 2627
2807 2628 if (spa_dir_prop(spa, DMU_POOL_FEATURE_DESCRIPTIONS,
2808 - &spa->spa_feat_desc_obj, B_TRUE) != 0) {
2629 + &spa->spa_feat_desc_obj) != 0) {
2809 2630 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2810 2631 }
2811 2632
2812 2633 enabled_feat = fnvlist_alloc();
2813 2634 unsup_feat = fnvlist_alloc();
2814 2635
2815 2636 if (!spa_features_check(spa, B_FALSE,
2816 2637 unsup_feat, enabled_feat))
2817 2638 missing_feat_read = B_TRUE;
2818 2639
2819 - if (spa_writeable(spa) ||
2820 - spa->spa_load_state == SPA_LOAD_TRYIMPORT) {
2640 + if (spa_writeable(spa) || state == SPA_LOAD_TRYIMPORT) {
2821 2641 if (!spa_features_check(spa, B_TRUE,
2822 2642 unsup_feat, enabled_feat)) {
2823 - *missing_feat_writep = B_TRUE;
2643 + missing_feat_write = B_TRUE;
2824 2644 }
2825 2645 }
2826 2646
2827 2647 fnvlist_add_nvlist(spa->spa_load_info,
2828 2648 ZPOOL_CONFIG_ENABLED_FEAT, enabled_feat);
2829 2649
2830 2650 if (!nvlist_empty(unsup_feat)) {
2831 2651 fnvlist_add_nvlist(spa->spa_load_info,
2832 2652 ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat);
2833 2653 }
2834 2654
2835 2655 fnvlist_free(enabled_feat);
2836 2656 fnvlist_free(unsup_feat);
2837 2657
2838 2658 if (!missing_feat_read) {
2839 2659 fnvlist_add_boolean(spa->spa_load_info,
2840 2660 ZPOOL_CONFIG_CAN_RDONLY);
2841 2661 }
2842 2662
2843 2663 /*
2844 2664 * If the state is SPA_LOAD_TRYIMPORT, our objective is
2845 2665 * twofold: to determine whether the pool is available for
2846 2666 * import in read-write mode and (if it is not) whether the
2847 2667 * pool is available for import in read-only mode. If the pool
2848 2668 * is available for import in read-write mode, it is displayed
2849 2669 * as available in userland; if it is not available for import
2850 2670 * in read-only mode, it is displayed as unavailable in
2851 2671 * userland. If the pool is available for import in read-only
|
↓ open down ↓ |
18 lines elided |
↑ open up ↑ |
2852 2672 * mode but not read-write mode, it is displayed as unavailable
2853 2673 * in userland with a special note that the pool is actually
2854 2674 * available for open in read-only mode.
2855 2675 *
2856 2676 * As a result, if the state is SPA_LOAD_TRYIMPORT and we are
2857 2677 * missing a feature for write, we must first determine whether
2858 2678 * the pool can be opened read-only before returning to
2859 2679 * userland in order to know whether to display the
2860 2680 * abovementioned note.
2861 2681 */
2862 - if (missing_feat_read || (*missing_feat_writep &&
2682 + if (missing_feat_read || (missing_feat_write &&
2863 2683 spa_writeable(spa))) {
2864 - spa_load_failed(spa, "pool uses unsupported features");
2865 2684 return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT,
2866 2685 ENOTSUP));
2867 2686 }
2868 2687
2869 2688 /*
2870 2689 * Load refcounts for ZFS features from disk into an in-memory
2871 2690 * cache during SPA initialization.
2872 2691 */
2873 2692 for (spa_feature_t i = 0; i < SPA_FEATURES; i++) {
2874 2693 uint64_t refcount;
2875 2694
2876 2695 error = feature_get_refcount_from_disk(spa,
2877 2696 &spa_feature_table[i], &refcount);
2878 2697 if (error == 0) {
2879 2698 spa->spa_feat_refcount_cache[i] = refcount;
2880 2699 } else if (error == ENOTSUP) {
2881 2700 spa->spa_feat_refcount_cache[i] =
2882 2701 SPA_FEATURE_DISABLED;
2883 2702 } else {
2884 - spa_load_failed(spa, "error getting refcount "
2885 - "for feature %s [error=%d]",
2886 - spa_feature_table[i].fi_guid, error);
2887 2703 return (spa_vdev_err(rvd,
2888 2704 VDEV_AUX_CORRUPT_DATA, EIO));
2889 2705 }
2890 2706 }
2891 2707 }
2892 2708
2893 2709 if (spa_feature_is_active(spa, SPA_FEATURE_ENABLED_TXG)) {
2894 2710 if (spa_dir_prop(spa, DMU_POOL_FEATURE_ENABLED_TXG,
2895 - &spa->spa_feat_enabled_txg_obj, B_TRUE) != 0)
2711 + &spa->spa_feat_enabled_txg_obj) != 0)
2896 2712 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2897 2713 }
2898 2714
2899 - return (0);
2900 -}
2901 -
2902 -static int
2903 -spa_ld_load_special_directories(spa_t *spa)
2904 -{
2905 - int error = 0;
2906 - vdev_t *rvd = spa->spa_root_vdev;
2907 -
2908 2715 spa->spa_is_initializing = B_TRUE;
2909 2716 error = dsl_pool_open(spa->spa_dsl_pool);
2910 2717 spa->spa_is_initializing = B_FALSE;
2911 - if (error != 0) {
2912 - spa_load_failed(spa, "dsl_pool_open failed [error=%d]", error);
2718 + if (error != 0)
2913 2719 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2914 - }
2915 2720
2916 - return (0);
2917 -}
2721 + if (!mosconfig) {
2722 + uint64_t hostid;
2723 + nvlist_t *policy = NULL, *nvconfig;
2918 2724
2919 -static int
2920 -spa_ld_get_props(spa_t *spa)
2921 -{
2922 - int error = 0;
2923 - uint64_t obj;
2924 - vdev_t *rvd = spa->spa_root_vdev;
2725 + if (load_nvlist(spa, spa->spa_config_object, &nvconfig) != 0)
2726 + return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2925 2727
2728 + if (!spa_is_root(spa) && nvlist_lookup_uint64(nvconfig,
2729 + ZPOOL_CONFIG_HOSTID, &hostid) == 0) {
2730 + char *hostname;
2731 + unsigned long myhostid = 0;
2732 +
2733 + VERIFY(nvlist_lookup_string(nvconfig,
2734 + ZPOOL_CONFIG_HOSTNAME, &hostname) == 0);
2735 +
2736 +#ifdef _KERNEL
2737 + myhostid = zone_get_hostid(NULL);
2738 +#else /* _KERNEL */
2739 + /*
2740 + * We're emulating the system's hostid in userland, so
2741 + * we can't use zone_get_hostid().
2742 + */
2743 + (void) ddi_strtoul(hw_serial, NULL, 10, &myhostid);
2744 +#endif /* _KERNEL */
2745 + if (hostid != 0 && myhostid != 0 &&
2746 + hostid != myhostid) {
2747 + nvlist_free(nvconfig);
2748 + cmn_err(CE_WARN, "pool '%s' could not be "
2749 + "loaded as it was last accessed by "
2750 + "another system (host: %s hostid: 0x%lx). "
2751 + "See: http://illumos.org/msg/ZFS-8000-EY",
2752 + spa_name(spa), hostname,
2753 + (unsigned long)hostid);
2754 + return (SET_ERROR(EBADF));
2755 + }
2756 + }
2757 + if (nvlist_lookup_nvlist(spa->spa_config,
2758 + ZPOOL_REWIND_POLICY, &policy) == 0)
2759 + VERIFY(nvlist_add_nvlist(nvconfig,
2760 + ZPOOL_REWIND_POLICY, policy) == 0);
2761 +
2762 + spa_config_set(spa, nvconfig);
2763 + spa_unload(spa);
2764 + spa_deactivate(spa);
2765 + spa_activate(spa, orig_mode);
2766 +
2767 + return (spa_load(spa, state, SPA_IMPORT_EXISTING, B_TRUE));
2768 + }
2769 +
2926 2770 /* Grab the secret checksum salt from the MOS. */
2927 2771 error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
2928 2772 DMU_POOL_CHECKSUM_SALT, 1,
2929 2773 sizeof (spa->spa_cksum_salt.zcs_bytes),
2930 2774 spa->spa_cksum_salt.zcs_bytes);
2931 2775 if (error == ENOENT) {
2932 2776 /* Generate a new salt for subsequent use */
2933 2777 (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
2934 2778 sizeof (spa->spa_cksum_salt.zcs_bytes));
2935 2779 } else if (error != 0) {
2936 - spa_load_failed(spa, "unable to retrieve checksum salt from "
2937 - "MOS [error=%d]", error);
2938 2780 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2939 2781 }
2940 2782
2941 - if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj, B_TRUE) != 0)
2783 + if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj) != 0)
2942 2784 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2943 2785 error = bpobj_open(&spa->spa_deferred_bpobj, spa->spa_meta_objset, obj);
2944 - if (error != 0) {
2945 - spa_load_failed(spa, "error opening deferred-frees bpobj "
2946 - "[error=%d]", error);
2786 + if (error != 0)
2947 2787 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2948 - }
2949 2788
2950 2789 /*
2951 2790 * Load the bit that tells us to use the new accounting function
2952 2791 * (raid-z deflation). If we have an older pool, this will not
2953 2792 * be present.
2954 2793 */
2955 - error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate, B_FALSE);
2794 + error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate);
2956 2795 if (error != 0 && error != ENOENT)
2957 2796 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2958 2797
2959 2798 error = spa_dir_prop(spa, DMU_POOL_CREATION_VERSION,
2960 - &spa->spa_creation_version, B_FALSE);
2799 + &spa->spa_creation_version);
2961 2800 if (error != 0 && error != ENOENT)
2962 2801 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2963 2802
2964 2803 /*
2965 2804 * Load the persistent error log. If we have an older pool, this will
2966 2805 * not be present.
2967 2806 */
2968 - error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last,
2969 - B_FALSE);
2807 + error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last);
2970 2808 if (error != 0 && error != ENOENT)
2971 2809 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2972 2810
2973 2811 error = spa_dir_prop(spa, DMU_POOL_ERRLOG_SCRUB,
2974 - &spa->spa_errlog_scrub, B_FALSE);
2812 + &spa->spa_errlog_scrub);
2975 2813 if (error != 0 && error != ENOENT)
2976 2814 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2977 2815
2978 2816 /*
2979 2817 * Load the history object. If we have an older pool, this
2980 2818 * will not be present.
2981 2819 */
2982 - error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history, B_FALSE);
2820 + error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history);
2983 2821 if (error != 0 && error != ENOENT)
2984 2822 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2985 2823
2986 2824 /*
2987 2825 * Load the per-vdev ZAP map. If we have an older pool, this will not
2988 2826 * be present; in this case, defer its creation to a later time to
2989 2827 * avoid dirtying the MOS this early / out of sync context. See
2990 2828 * spa_sync_config_object.
2991 2829 */
2992 2830
2993 2831 /* The sentinel is only available in the MOS config. */
2994 2832 nvlist_t *mos_config;
2995 - if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0) {
2996 - spa_load_failed(spa, "unable to retrieve MOS config");
2833 + if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0)
2997 2834 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2998 - }
2999 2835
3000 2836 error = spa_dir_prop(spa, DMU_POOL_VDEV_ZAP_MAP,
3001 - &spa->spa_all_vdev_zaps, B_FALSE);
2837 + &spa->spa_all_vdev_zaps);
3002 2838
3003 2839 if (error == ENOENT) {
3004 2840 VERIFY(!nvlist_exists(mos_config,
3005 2841 ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
3006 2842 spa->spa_avz_action = AVZ_ACTION_INITIALIZE;
3007 2843 ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
3008 2844 } else if (error != 0) {
3009 2845 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3010 2846 } else if (!nvlist_exists(mos_config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS)) {
3011 2847 /*
3012 2848 * An older version of ZFS overwrote the sentinel value, so
3013 2849 * we have orphaned per-vdev ZAPs in the MOS. Defer their
3014 2850 * destruction to later; see spa_sync_config_object.
|
↓ open down ↓ |
3 lines elided |
↑ open up ↑ |
3015 2851 */
3016 2852 spa->spa_avz_action = AVZ_ACTION_DESTROY;
3017 2853 /*
3018 2854 * We're assuming that no vdevs have had their ZAPs created
3019 2855 * before this. Better be sure of it.
3020 2856 */
3021 2857 ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev));
3022 2858 }
3023 2859 nvlist_free(mos_config);
3024 2860
3025 - spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
3026 -
3027 - error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object,
3028 - B_FALSE);
3029 - if (error && error != ENOENT)
3030 - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3031 -
3032 - if (error == 0) {
3033 - uint64_t autoreplace;
3034 -
3035 - spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs);
3036 - spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace);
3037 - spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation);
3038 - spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode);
3039 - spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand);
3040 - spa_prop_find(spa, ZPOOL_PROP_DEDUPDITTO,
3041 - &spa->spa_dedup_ditto);
3042 -
3043 - spa->spa_autoreplace = (autoreplace != 0);
3044 - }
3045 -
3046 2861 /*
3047 - * If we are importing a pool with missing top-level vdevs,
3048 - * we enforce that the pool doesn't panic or get suspended on
3049 - * error since the likelihood of missing data is extremely high.
3050 - */
3051 - if (spa->spa_missing_tvds > 0 &&
3052 - spa->spa_failmode != ZIO_FAILURE_MODE_CONTINUE &&
3053 - spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
3054 - spa_load_note(spa, "forcing failmode to 'continue' "
3055 - "as some top level vdevs are missing");
3056 - spa->spa_failmode = ZIO_FAILURE_MODE_CONTINUE;
3057 - }
3058 -
3059 - return (0);
3060 -}
3061 -
3062 -static int
3063 -spa_ld_open_aux_vdevs(spa_t *spa, spa_import_type_t type)
3064 -{
3065 - int error = 0;
3066 - vdev_t *rvd = spa->spa_root_vdev;
3067 -
3068 - /*
3069 2862 * If we're assembling the pool from the split-off vdevs of
3070 2863 * an existing pool, we don't want to attach the spares & cache
3071 2864 * devices.
3072 2865 */
3073 2866
3074 2867 /*
3075 2868 * Load any hot spares for this pool.
3076 2869 */
3077 - error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object,
3078 - B_FALSE);
2870 + error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object);
3079 2871 if (error != 0 && error != ENOENT)
3080 2872 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3081 2873 if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
3082 2874 ASSERT(spa_version(spa) >= SPA_VERSION_SPARES);
3083 2875 if (load_nvlist(spa, spa->spa_spares.sav_object,
3084 - &spa->spa_spares.sav_config) != 0) {
3085 - spa_load_failed(spa, "error loading spares nvlist");
2876 + &spa->spa_spares.sav_config) != 0)
3086 2877 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3087 - }
3088 2878
3089 2879 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3090 2880 spa_load_spares(spa);
3091 2881 spa_config_exit(spa, SCL_ALL, FTAG);
3092 2882 } else if (error == 0) {
3093 2883 spa->spa_spares.sav_sync = B_TRUE;
3094 2884 }
3095 2885
3096 2886 /*
3097 2887 * Load any level 2 ARC devices for this pool.
3098 2888 */
3099 2889 error = spa_dir_prop(spa, DMU_POOL_L2CACHE,
3100 - &spa->spa_l2cache.sav_object, B_FALSE);
2890 + &spa->spa_l2cache.sav_object);
3101 2891 if (error != 0 && error != ENOENT)
3102 2892 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3103 2893 if (error == 0 && type != SPA_IMPORT_ASSEMBLE) {
3104 2894 ASSERT(spa_version(spa) >= SPA_VERSION_L2CACHE);
3105 2895 if (load_nvlist(spa, spa->spa_l2cache.sav_object,
3106 - &spa->spa_l2cache.sav_config) != 0) {
3107 - spa_load_failed(spa, "error loading l2cache nvlist");
2896 + &spa->spa_l2cache.sav_config) != 0)
3108 2897 return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3109 - }
3110 2898
3111 2899 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3112 2900 spa_load_l2cache(spa);
3113 2901 spa_config_exit(spa, SCL_ALL, FTAG);
3114 2902 } else if (error == 0) {
3115 2903 spa->spa_l2cache.sav_sync = B_TRUE;
3116 2904 }
3117 2905
3118 - return (0);
3119 -}
2906 + mp = &spa->spa_meta_policy;
3120 2907
3121 -static int
3122 -spa_ld_load_vdev_metadata(spa_t *spa)
3123 -{
3124 - int error = 0;
3125 - vdev_t *rvd = spa->spa_root_vdev;
2908 + spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
2909 + spa->spa_hiwat = zpool_prop_default_numeric(ZPOOL_PROP_HIWATERMARK);
2910 + spa->spa_lowat = zpool_prop_default_numeric(ZPOOL_PROP_LOWATERMARK);
2911 + spa->spa_minwat = zpool_prop_default_numeric(ZPOOL_PROP_MINWATERMARK);
2912 + spa->spa_dedup_lo_best_effort =
2913 + zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_LO_BEST_EFFORT);
2914 + spa->spa_dedup_hi_best_effort =
2915 + zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_HI_BEST_EFFORT);
3126 2916
2917 + mp->spa_enable_meta_placement_selection =
2918 + zpool_prop_default_numeric(ZPOOL_PROP_META_PLACEMENT);
2919 + mp->spa_sync_to_special =
2920 + zpool_prop_default_numeric(ZPOOL_PROP_SYNC_TO_SPECIAL);
2921 + mp->spa_ddt_meta_to_special =
2922 + zpool_prop_default_numeric(ZPOOL_PROP_DDT_META_TO_METADEV);
2923 + mp->spa_zfs_meta_to_special =
2924 + zpool_prop_default_numeric(ZPOOL_PROP_ZFS_META_TO_METADEV);
2925 + mp->spa_small_data_to_special =
2926 + zpool_prop_default_numeric(ZPOOL_PROP_SMALL_DATA_TO_METADEV);
2927 + spa_set_ddt_classes(spa,
2928 + zpool_prop_default_numeric(ZPOOL_PROP_DDT_DESEGREGATION));
2929 +
2930 + spa->spa_resilver_prio =
2931 + zpool_prop_default_numeric(ZPOOL_PROP_RESILVER_PRIO);
2932 + spa->spa_scrub_prio = zpool_prop_default_numeric(ZPOOL_PROP_SCRUB_PRIO);
2933 +
2934 + error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object);
2935 + if (error && error != ENOENT)
2936 + return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
2937 +
2938 + if (error == 0) {
2939 + uint64_t autoreplace;
2940 + uint64_t val = 0;
2941 +
2942 + spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs);
2943 + spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace);
2944 + spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation);
2945 + spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode);
2946 + spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand);
2947 + spa_prop_find(spa, ZPOOL_PROP_BOOTSIZE, &spa->spa_bootsize);
2948 + spa_prop_find(spa, ZPOOL_PROP_DEDUPDITTO,
2949 + &spa->spa_dedup_ditto);
2950 + spa_prop_find(spa, ZPOOL_PROP_FORCETRIM, &spa->spa_force_trim);
2951 +
2952 + mutex_enter(&spa->spa_auto_trim_lock);
2953 + spa_prop_find(spa, ZPOOL_PROP_AUTOTRIM, &spa->spa_auto_trim);
2954 + if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
2955 + spa_auto_trim_taskq_create(spa);
2956 + mutex_exit(&spa->spa_auto_trim_lock);
2957 +
2958 + spa_prop_find(spa, ZPOOL_PROP_HIWATERMARK, &spa->spa_hiwat);
2959 + spa_prop_find(spa, ZPOOL_PROP_LOWATERMARK, &spa->spa_lowat);
2960 + spa_prop_find(spa, ZPOOL_PROP_MINWATERMARK, &spa->spa_minwat);
2961 + spa_prop_find(spa, ZPOOL_PROP_DEDUPMETA_DITTO,
2962 + &spa->spa_ddt_meta_copies);
2963 + spa_prop_find(spa, ZPOOL_PROP_DDT_DESEGREGATION, &val);
2964 + spa_set_ddt_classes(spa, val);
2965 +
2966 + spa_prop_find(spa, ZPOOL_PROP_RESILVER_PRIO,
2967 + &spa->spa_resilver_prio);
2968 + spa_prop_find(spa, ZPOOL_PROP_SCRUB_PRIO,
2969 + &spa->spa_scrub_prio);
2970 +
2971 + spa_prop_find(spa, ZPOOL_PROP_DEDUP_BEST_EFFORT,
2972 + &spa->spa_dedup_best_effort);
2973 + spa_prop_find(spa, ZPOOL_PROP_DEDUP_LO_BEST_EFFORT,
2974 + &spa->spa_dedup_lo_best_effort);
2975 + spa_prop_find(spa, ZPOOL_PROP_DEDUP_HI_BEST_EFFORT,
2976 + &spa->spa_dedup_hi_best_effort);
2977 +
2978 + spa_prop_find(spa, ZPOOL_PROP_META_PLACEMENT,
2979 + &mp->spa_enable_meta_placement_selection);
2980 + spa_prop_find(spa, ZPOOL_PROP_SYNC_TO_SPECIAL,
2981 + &mp->spa_sync_to_special);
2982 + spa_prop_find(spa, ZPOOL_PROP_DDT_META_TO_METADEV,
2983 + &mp->spa_ddt_meta_to_special);
2984 + spa_prop_find(spa, ZPOOL_PROP_ZFS_META_TO_METADEV,
2985 + &mp->spa_zfs_meta_to_special);
2986 + spa_prop_find(spa, ZPOOL_PROP_SMALL_DATA_TO_METADEV,
2987 + &mp->spa_small_data_to_special);
2988 +
2989 + spa->spa_autoreplace = (autoreplace != 0);
2990 + }
2991 +
2992 + error = spa_dir_prop(spa, DMU_POOL_COS_PROPS,
2993 + &spa->spa_cos_props_object);
2994 + if (error == 0)
2995 + (void) spa_load_cos_props(spa);
2996 + error = spa_dir_prop(spa, DMU_POOL_VDEV_PROPS,
2997 + &spa->spa_vdev_props_object);
2998 + if (error == 0)
2999 + (void) spa_load_vdev_props(spa);
3000 +
3001 + (void) spa_dir_prop(spa, DMU_POOL_TRIM_START_TIME,
3002 + &spa->spa_man_trim_start_time);
3003 + (void) spa_dir_prop(spa, DMU_POOL_TRIM_STOP_TIME,
3004 + &spa->spa_man_trim_stop_time);
3005 +
3127 3006 /*
3128 3007 * If the 'autoreplace' property is set, then post a resource notifying
3129 3008 * the ZFS DE that it should not issue any faults for unopenable
3130 3009 * devices. We also iterate over the vdevs, and post a sysevent for any
3131 3010 * unopenable vdevs so that the normal autoreplace handler can take
3132 3011 * over.
3133 3012 */
3134 - if (spa->spa_autoreplace && spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
3013 + if (spa->spa_autoreplace && state != SPA_LOAD_TRYIMPORT) {
3135 3014 spa_check_removed(spa->spa_root_vdev);
3136 3015 /*
3137 3016 * For the import case, this is done in spa_import(), because
3138 3017 * at this point we're using the spare definitions from
3139 3018 * the MOS config, not necessarily from the userland config.
3140 3019 */
3141 - if (spa->spa_load_state != SPA_LOAD_IMPORT) {
3020 + if (state != SPA_LOAD_IMPORT) {
3142 3021 spa_aux_check_removed(&spa->spa_spares);
3143 3022 spa_aux_check_removed(&spa->spa_l2cache);
3144 3023 }
3145 3024 }
3146 3025
3147 3026 /*
3148 - * Load the vdev metadata such as metaslabs, DTLs, spacemap object, etc.
3027 + * Load the vdev state for all toplevel vdevs.
3149 3028 */
3150 - error = vdev_load(rvd);
3151 - if (error != 0) {
3152 - spa_load_failed(spa, "vdev_load failed [error=%d]", error);
3153 - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error));
3154 - }
3029 + vdev_load(rvd);
3155 3030
3156 3031 /*
3157 - * Propagate the leaf DTLs we just loaded all the way up the vdev tree.
3032 + * Propagate the leaf DTLs we just loaded all the way up the tree.
3158 3033 */
3159 3034 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
3160 3035 vdev_dtl_reassess(rvd, 0, 0, B_FALSE);
3161 3036 spa_config_exit(spa, SCL_ALL, FTAG);
3162 3037
3163 - return (0);
3164 -}
3165 -
3166 -static int
3167 -spa_ld_load_dedup_tables(spa_t *spa)
3168 -{
3169 - int error = 0;
3170 - vdev_t *rvd = spa->spa_root_vdev;
3171 -
3172 - error = ddt_load(spa);
3173 - if (error != 0) {
3174 - spa_load_failed(spa, "ddt_load failed [error=%d]", error);
3175 - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3176 - }
3177 -
3178 - return (0);
3179 -}
3180 -
3181 -static int
3182 -spa_ld_verify_logs(spa_t *spa, spa_import_type_t type, char **ereport)
3183 -{
3184 - vdev_t *rvd = spa->spa_root_vdev;
3185 -
3186 - if (type != SPA_IMPORT_ASSEMBLE && spa_writeable(spa)) {
3187 - boolean_t missing = spa_check_logs(spa);
3188 - if (missing) {
3189 - if (spa->spa_missing_tvds != 0) {
3190 - spa_load_note(spa, "spa_check_logs failed "
3191 - "so dropping the logs");
3192 - } else {
3193 - *ereport = FM_EREPORT_ZFS_LOG_REPLAY;
3194 - spa_load_failed(spa, "spa_check_logs failed");
3195 - return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG,
3196 - ENXIO));
3197 - }
3198 - }
3199 - }
3200 -
3201 - return (0);
3202 -}
3203 -
3204 -static int
3205 -spa_ld_verify_pool_data(spa_t *spa)
3206 -{
3207 - int error = 0;
3208 - vdev_t *rvd = spa->spa_root_vdev;
3209 -
3210 3038 /*
3211 - * We've successfully opened the pool, verify that we're ready
3212 - * to start pushing transactions.
3039 + * Load the DDTs (dedup tables).
3213 3040 */
3214 - if (spa->spa_load_state != SPA_LOAD_TRYIMPORT) {
3215 - error = spa_load_verify(spa);
3216 - if (error != 0) {
3217 - spa_load_failed(spa, "spa_load_verify failed "
3218 - "[error=%d]", error);
3219 - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
3220 - error));
3221 - }
3222 - }
3223 -
3224 - return (0);
3225 -}
3226 -
3227 -static void
3228 -spa_ld_claim_log_blocks(spa_t *spa)
3229 -{
3230 - dmu_tx_t *tx;
3231 - dsl_pool_t *dp = spa_get_dsl(spa);
3232 -
3233 - /*
3234 - * Claim log blocks that haven't been committed yet.
3235 - * This must all happen in a single txg.
3236 - * Note: spa_claim_max_txg is updated by spa_claim_notify(),
3237 - * invoked from zil_claim_log_block()'s i/o done callback.
3238 - * Price of rollback is that we abandon the log.
3239 - */
3240 - spa->spa_claiming = B_TRUE;
3241 -
3242 - tx = dmu_tx_create_assigned(dp, spa_first_txg(spa));
3243 - (void) dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
3244 - zil_claim, tx, DS_FIND_CHILDREN);
3245 - dmu_tx_commit(tx);
3246 -
3247 - spa->spa_claiming = B_FALSE;
3248 -
3249 - spa_set_log_state(spa, SPA_LOG_GOOD);
3250 -}
3251 -
3252 -static void
3253 -spa_ld_check_for_config_update(spa_t *spa, uint64_t config_cache_txg,
3254 - boolean_t reloading)
3255 -{
3256 - vdev_t *rvd = spa->spa_root_vdev;
3257 - int need_update = B_FALSE;
3258 -
3259 - /*
3260 - * If the config cache is stale, or we have uninitialized
3261 - * metaslabs (see spa_vdev_add()), then update the config.
3262 - *
3263 - * If this is a verbatim import, trust the current
3264 - * in-core spa_config and update the disk labels.
3265 - */
3266 - if (reloading || config_cache_txg != spa->spa_config_txg ||
3267 - spa->spa_load_state == SPA_LOAD_IMPORT ||
3268 - spa->spa_load_state == SPA_LOAD_RECOVER ||
3269 - (spa->spa_import_flags & ZFS_IMPORT_VERBATIM))
3270 - need_update = B_TRUE;
3271 -
3272 - for (int c = 0; c < rvd->vdev_children; c++)
3273 - if (rvd->vdev_child[c]->vdev_ms_array == 0)
3274 - need_update = B_TRUE;
3275 -
3276 - /*
3277 - * Update the config cache asychronously in case we're the
3278 - * root pool, in which case the config cache isn't writable yet.
3279 - */
3280 - if (need_update)
3281 - spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
3282 -}
3283 -
3284 -static void
3285 -spa_ld_prepare_for_reload(spa_t *spa)
3286 -{
3287 - int mode = spa->spa_mode;
3288 - int async_suspended = spa->spa_async_suspended;
3289 -
3290 - spa_unload(spa);
3291 - spa_deactivate(spa);
3292 - spa_activate(spa, mode);
3293 -
3294 - /*
3295 - * We save the value of spa_async_suspended as it gets reset to 0 by
3296 - * spa_unload(). We want to restore it back to the original value before
3297 - * returning as we might be calling spa_async_resume() later.
3298 - */
3299 - spa->spa_async_suspended = async_suspended;
3300 -}
3301 -
3302 -/*
3303 - * Load an existing storage pool, using the config provided. This config
3304 - * describes which vdevs are part of the pool and is later validated against
3305 - * partial configs present in each vdev's label and an entire copy of the
3306 - * config stored in the MOS.
3307 - */
3308 -static int
3309 -spa_load_impl(spa_t *spa, spa_import_type_t type, char **ereport,
3310 - boolean_t reloading)
3311 -{
3312 - int error = 0;
3313 - boolean_t missing_feat_write = B_FALSE;
3314 -
3315 - ASSERT(MUTEX_HELD(&spa_namespace_lock));
3316 - ASSERT(spa->spa_config_source != SPA_CONFIG_SRC_NONE);
3317 -
3318 - /*
3319 - * Never trust the config that is provided unless we are assembling
3320 - * a pool following a split.
3321 - * This means don't trust blkptrs and the vdev tree in general. This
3322 - * also effectively puts the spa in read-only mode since
3323 - * spa_writeable() checks for spa_trust_config to be true.
3324 - * We will later load a trusted config from the MOS.
3325 - */
3326 - if (type != SPA_IMPORT_ASSEMBLE)
3327 - spa->spa_trust_config = B_FALSE;
3328 -
3329 - if (reloading)
3330 - spa_load_note(spa, "RELOADING");
3331 - else
3332 - spa_load_note(spa, "LOADING");
3333 -
3334 - /*
3335 - * Parse the config provided to create a vdev tree.
3336 - */
3337 - error = spa_ld_parse_config(spa, type);
3041 + error = ddt_load(spa);
3338 3042 if (error != 0)
3339 - return (error);
3043 + return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3340 3044
3341 - /*
3342 - * Now that we have the vdev tree, try to open each vdev. This involves
3343 - * opening the underlying physical device, retrieving its geometry and
3344 - * probing the vdev with a dummy I/O. The state of each vdev will be set
3345 - * based on the success of those operations. After this we'll be ready
3346 - * to read from the vdevs.
3347 - */
3348 - error = spa_ld_open_vdevs(spa);
3349 - if (error != 0)
3350 - return (error);
3045 + spa_update_dspace(spa);
3351 3046
3352 3047 /*
3353 - * Read the label of each vdev and make sure that the GUIDs stored
3354 - * there match the GUIDs in the config provided.
3355 - * If we're assembling a new pool that's been split off from an
3356 - * existing pool, the labels haven't yet been updated so we skip
3357 - * validation for now.
3048 + * Validate the config, using the MOS config to fill in any
3049 + * information which might be missing. If we fail to validate
3050 + * the config then declare the pool unfit for use. If we're
3051 + * assembling a pool from a split, the log is not transferred
3052 + * over.
3358 3053 */
3359 3054 if (type != SPA_IMPORT_ASSEMBLE) {
3360 - error = spa_ld_validate_vdevs(spa);
3361 - if (error != 0)
3362 - return (error);
3363 - }
3055 + nvlist_t *nvconfig;
3364 3056
3365 - /*
3366 - * Read vdev labels to find the best uberblock (i.e. latest, unless
3367 - * spa_load_max_txg is set) and store it in spa_uberblock. We get the
3368 - * list of features required to read blkptrs in the MOS from the vdev
3369 - * label with the best uberblock and verify that our version of zfs
3370 - * supports them all.
3371 - */
3372 - error = spa_ld_select_uberblock(spa, type);
3373 - if (error != 0)
3374 - return (error);
3057 + if (load_nvlist(spa, spa->spa_config_object, &nvconfig) != 0)
3058 + return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
3375 3059
3376 - /*
3377 - * Pass that uberblock to the dsl_pool layer which will open the root
3378 - * blkptr. This blkptr points to the latest version of the MOS and will
3379 - * allow us to read its contents.
3380 - */
3381 - error = spa_ld_open_rootbp(spa);
3382 - if (error != 0)
3383 - return (error);
3060 + if (!spa_config_valid(spa, nvconfig)) {
3061 + nvlist_free(nvconfig);
3062 + return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM,
3063 + ENXIO));
3064 + }
3065 + nvlist_free(nvconfig);
3384 3066
3385 - /*
3386 - * Retrieve the trusted config stored in the MOS and use it to create
3387 - * a new, exact version of the vdev tree, then reopen all vdevs.
3388 - */
3389 - error = spa_ld_load_trusted_config(spa, type, reloading);
3390 - if (error == EAGAIN) {
3391 - VERIFY(!reloading);
3392 3067 /*
3393 - * Redo the loading process with the trusted config if it is
3394 - * too different from the untrusted config.
3068 + * Now that we've validated the config, check the state of the
3069 + * root vdev. If it can't be opened, it indicates one or
3070 + * more toplevel vdevs are faulted.
3395 3071 */
3396 - spa_ld_prepare_for_reload(spa);
3397 - return (spa_load_impl(spa, type, ereport, B_TRUE));
3398 - } else if (error != 0) {
3399 - return (error);
3072 + if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN)
3073 + return (SET_ERROR(ENXIO));
3074 +
3075 + if (spa_writeable(spa) && spa_check_logs(spa)) {
3076 + *ereport = FM_EREPORT_ZFS_LOG_REPLAY;
3077 + return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG, ENXIO));
3078 + }
3400 3079 }
3401 3080
3402 - /*
3403 - * Retrieve the mapping of indirect vdevs. Those vdevs were removed
3404 - * from the pool and their contents were re-mapped to other vdevs. Note
3405 - * that everything that we read before this step must have been
3406 - * rewritten on concrete vdevs after the last device removal was
3407 - * initiated. Otherwise we could be reading from indirect vdevs before
3408 - * we have loaded their mappings.
3409 - */
3410 - error = spa_ld_open_indirect_vdev_metadata(spa);
3411 - if (error != 0)
3412 - return (error);
3413 -
3414 - /*
3415 - * Retrieve the full list of active features from the MOS and check if
3416 - * they are all supported.
3417 - */
3418 - error = spa_ld_check_features(spa, &missing_feat_write);
3419 - if (error != 0)
3420 - return (error);
3421 -
3422 - /*
3423 - * Load several special directories from the MOS needed by the dsl_pool
3424 - * layer.
3425 - */
3426 - error = spa_ld_load_special_directories(spa);
3427 - if (error != 0)
3428 - return (error);
3429 -
3430 - /*
3431 - * Retrieve pool properties from the MOS.
3432 - */
3433 - error = spa_ld_get_props(spa);
3434 - if (error != 0)
3435 - return (error);
3436 -
3437 - /*
3438 - * Retrieve the list of auxiliary devices - cache devices and spares -
3439 - * and open them.
3440 - */
3441 - error = spa_ld_open_aux_vdevs(spa, type);
3442 - if (error != 0)
3443 - return (error);
3444 -
3445 - /*
3446 - * Load the metadata for all vdevs. Also check if unopenable devices
3447 - * should be autoreplaced.
3448 - */
3449 - error = spa_ld_load_vdev_metadata(spa);
3450 - if (error != 0)
3451 - return (error);
3452 -
3453 - error = spa_ld_load_dedup_tables(spa);
3454 - if (error != 0)
3455 - return (error);
3456 -
3457 - /*
3458 - * Verify the logs now to make sure we don't have any unexpected errors
3459 - * when we claim log blocks later.
3460 - */
3461 - error = spa_ld_verify_logs(spa, type, ereport);
3462 - if (error != 0)
3463 - return (error);
3464 -
3465 3081 if (missing_feat_write) {
3466 - ASSERT(spa->spa_load_state == SPA_LOAD_TRYIMPORT);
3082 + ASSERT(state == SPA_LOAD_TRYIMPORT);
3467 3083
3468 3084 /*
3469 3085 * At this point, we know that we can open the pool in
3470 3086 * read-only mode but not read-write mode. We now have enough
3471 3087 * information and can return to userland.
3472 3088 */
3473 - return (spa_vdev_err(spa->spa_root_vdev, VDEV_AUX_UNSUP_FEAT,
3474 - ENOTSUP));
3089 + return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT, ENOTSUP));
3475 3090 }
3476 3091
3477 3092 /*
3478 - * Traverse the last txgs to make sure the pool was left off in a safe
3479 - * state. When performing an extreme rewind, we verify the whole pool,
3480 - * which can take a very long time.
3093 + * We've successfully opened the pool, verify that we're ready
3094 + * to start pushing transactions.
3481 3095 */
3482 - error = spa_ld_verify_pool_data(spa);
3483 - if (error != 0)
3484 - return (error);
3096 + if (state != SPA_LOAD_TRYIMPORT) {
3097 + if (error = spa_load_verify(spa)) {
3098 + return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA,
3099 + error));
3100 + }
3101 + }
3485 3102
3486 - /*
3487 - * Calculate the deflated space for the pool. This must be done before
3488 - * we write anything to the pool because we'd need to update the space
3489 - * accounting using the deflated sizes.
3490 - */
3491 - spa_update_dspace(spa);
3492 -
3493 - /*
3494 - * We have now retrieved all the information we needed to open the
3495 - * pool. If we are importing the pool in read-write mode, a few
3496 - * additional steps must be performed to finish the import.
3497 - */
3498 - if (spa_writeable(spa) && (spa->spa_load_state == SPA_LOAD_RECOVER ||
3103 + if (spa_writeable(spa) && (state == SPA_LOAD_RECOVER ||
3499 3104 spa->spa_load_max_txg == UINT64_MAX)) {
3500 - uint64_t config_cache_txg = spa->spa_config_txg;
3105 + dmu_tx_t *tx;
3106 + int need_update = B_FALSE;
3107 + dsl_pool_t *dp = spa_get_dsl(spa);
3501 3108
3502 - ASSERT(spa->spa_load_state != SPA_LOAD_TRYIMPORT);
3109 + ASSERT(state != SPA_LOAD_TRYIMPORT);
3503 3110
3504 3111 /*
3505 - * Traverse the ZIL and claim all blocks.
3112 + * Claim log blocks that haven't been committed yet.
3113 + * This must all happen in a single txg.
3114 + * Note: spa_claim_max_txg is updated by spa_claim_notify(),
3115 + * invoked from zil_claim_log_block()'s i/o done callback.
3116 + * Price of rollback is that we abandon the log.
3506 3117 */
3507 - spa_ld_claim_log_blocks(spa);
3118 + spa->spa_claiming = B_TRUE;
3508 3119
3509 - /*
3510 - * Kick-off the syncing thread.
3511 - */
3120 + tx = dmu_tx_create_assigned(dp, spa_first_txg(spa));
3121 + (void) dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
3122 + zil_claim, tx, DS_FIND_CHILDREN);
3123 + dmu_tx_commit(tx);
3124 +
3125 + spa->spa_claiming = B_FALSE;
3126 +
3127 + spa_set_log_state(spa, SPA_LOG_GOOD);
3512 3128 spa->spa_sync_on = B_TRUE;
3513 3129 txg_sync_start(spa->spa_dsl_pool);
3514 3130
3515 3131 /*
3516 3132 * Wait for all claims to sync. We sync up to the highest
3517 3133 * claimed log block birth time so that claimed log blocks
3518 3134 * don't appear to be from the future. spa_claim_max_txg
3519 - * will have been set for us by ZIL traversal operations
3520 - * performed above.
3135 + * will have been set for us by either zil_check_log_chain()
3136 + * (invoked from spa_check_logs()) or zil_claim() above.
3521 3137 */
3522 3138 txg_wait_synced(spa->spa_dsl_pool, spa->spa_claim_max_txg);
3523 3139
3524 3140 /*
3525 - * Check if we need to request an update of the config. On the
3526 - * next sync, we would update the config stored in vdev labels
3527 - * and the cachefile (by default /etc/zfs/zpool.cache).
3141 + * If the config cache is stale, or we have uninitialized
3142 + * metaslabs (see spa_vdev_add()), then update the config.
3143 + *
3144 + * If this is a verbatim import, trust the current
3145 + * in-core spa_config and update the disk labels.
3528 3146 */
3529 - spa_ld_check_for_config_update(spa, config_cache_txg,
3530 - reloading);
3147 + if (config_cache_txg != spa->spa_config_txg ||
3148 + state == SPA_LOAD_IMPORT ||
3149 + state == SPA_LOAD_RECOVER ||
3150 + (spa->spa_import_flags & ZFS_IMPORT_VERBATIM))
3151 + need_update = B_TRUE;
3531 3152
3153 + for (int c = 0; c < rvd->vdev_children; c++)
3154 + if (rvd->vdev_child[c]->vdev_ms_array == 0)
3155 + need_update = B_TRUE;
3156 +
3532 3157 /*
3158 + * Update the config cache asychronously in case we're the
3159 + * root pool, in which case the config cache isn't writable yet.
3160 + */
3161 + if (need_update)
3162 + spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
3163 +
3164 + /*
3533 3165 * Check all DTLs to see if anything needs resilvering.
3534 3166 */
3535 3167 if (!dsl_scan_resilvering(spa->spa_dsl_pool) &&
3536 - vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL))
3168 + vdev_resilver_needed(rvd, NULL, NULL))
3537 3169 spa_async_request(spa, SPA_ASYNC_RESILVER);
3538 3170
3539 3171 /*
3540 3172 * Log the fact that we booted up (so that we can detect if
3541 3173 * we rebooted in the middle of an operation).
3542 3174 */
3543 3175 spa_history_log_version(spa, "open");
3544 3176
3545 - /*
3546 - * Delete any inconsistent datasets.
3547 - */
3548 - (void) dmu_objset_find(spa_name(spa),
3549 - dsl_destroy_inconsistent, NULL, DS_FIND_CHILDREN);
3177 + dsl_destroy_inconsistent(spa_get_dsl(spa));
3550 3178
3551 3179 /*
3552 3180 * Clean up any stale temporary dataset userrefs.
3553 3181 */
3554 3182 dsl_pool_clean_tmp_userrefs(spa->spa_dsl_pool);
3555 -
3556 - spa_restart_removal(spa);
3557 -
3558 - spa_spawn_aux_threads(spa);
3559 3183 }
3560 3184
3561 - spa_load_note(spa, "LOADED");
3185 + spa_async_request(spa, SPA_ASYNC_L2CACHE_REBUILD);
3562 3186
3563 3187 return (0);
3564 3188 }
3565 3189
3566 3190 static int
3567 -spa_load_retry(spa_t *spa, spa_load_state_t state)
3191 +spa_load_retry(spa_t *spa, spa_load_state_t state, int mosconfig)
3568 3192 {
3569 3193 int mode = spa->spa_mode;
3570 3194
3571 3195 spa_unload(spa);
3572 3196 spa_deactivate(spa);
3573 3197
3574 3198 spa->spa_load_max_txg = spa->spa_uberblock.ub_txg - 1;
3575 3199
3576 3200 spa_activate(spa, mode);
3577 3201 spa_async_suspend(spa);
3578 3202
3579 - spa_load_note(spa, "spa_load_retry: rewind, max txg: %llu",
3580 - (u_longlong_t)spa->spa_load_max_txg);
3581 -
3582 - return (spa_load(spa, state, SPA_IMPORT_EXISTING));
3203 + return (spa_load(spa, state, SPA_IMPORT_EXISTING, mosconfig));
3583 3204 }
3584 3205
3585 3206 /*
3586 3207 * If spa_load() fails this function will try loading prior txg's. If
3587 3208 * 'state' is SPA_LOAD_RECOVER and one of these loads succeeds the pool
3588 3209 * will be rewound to that txg. If 'state' is not SPA_LOAD_RECOVER this
3589 3210 * function will not rewind the pool and will return the same error as
3590 3211 * spa_load().
3591 3212 */
3592 3213 static int
3593 -spa_load_best(spa_t *spa, spa_load_state_t state, uint64_t max_request,
3594 - int rewind_flags)
3214 +spa_load_best(spa_t *spa, spa_load_state_t state, int mosconfig,
3215 + uint64_t max_request, int rewind_flags)
3595 3216 {
3596 3217 nvlist_t *loadinfo = NULL;
3597 3218 nvlist_t *config = NULL;
3598 3219 int load_error, rewind_error;
3599 3220 uint64_t safe_rewind_txg;
3600 3221 uint64_t min_txg;
3601 3222
3602 3223 if (spa->spa_load_txg && state == SPA_LOAD_RECOVER) {
3603 3224 spa->spa_load_max_txg = spa->spa_load_txg;
3604 3225 spa_set_log_state(spa, SPA_LOG_CLEAR);
3605 3226 } else {
3606 3227 spa->spa_load_max_txg = max_request;
3607 3228 if (max_request != UINT64_MAX)
3608 3229 spa->spa_extreme_rewind = B_TRUE;
3609 3230 }
3610 3231
3611 - load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING);
3232 + load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING,
3233 + mosconfig);
3612 3234 if (load_error == 0)
3613 3235 return (0);
3614 3236
3615 3237 if (spa->spa_root_vdev != NULL)
3616 3238 config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
3617 3239
3618 3240 spa->spa_last_ubsync_txg = spa->spa_uberblock.ub_txg;
3619 3241 spa->spa_last_ubsync_txg_ts = spa->spa_uberblock.ub_timestamp;
3620 3242
3621 3243 if (rewind_flags & ZPOOL_NEVER_REWIND) {
3622 3244 nvlist_free(config);
3623 3245 return (load_error);
3624 3246 }
3625 3247
3626 3248 if (state == SPA_LOAD_RECOVER) {
3627 3249 /* Price of rolling back is discarding txgs, including log */
3628 3250 spa_set_log_state(spa, SPA_LOG_CLEAR);
3629 3251 } else {
3630 3252 /*
3631 3253 * If we aren't rolling back save the load info from our first
3632 3254 * import attempt so that we can restore it after attempting
3633 3255 * to rewind.
3634 3256 */
3635 3257 loadinfo = spa->spa_load_info;
3636 3258 spa->spa_load_info = fnvlist_alloc();
3637 3259 }
3638 3260
3639 3261 spa->spa_load_max_txg = spa->spa_last_ubsync_txg;
3640 3262 safe_rewind_txg = spa->spa_last_ubsync_txg - TXG_DEFER_SIZE;
3641 3263 min_txg = (rewind_flags & ZPOOL_EXTREME_REWIND) ?
|
↓ open down ↓ |
20 lines elided |
↑ open up ↑ |
3642 3264 TXG_INITIAL : safe_rewind_txg;
3643 3265
3644 3266 /*
3645 3267 * Continue as long as we're finding errors, we're still within
3646 3268 * the acceptable rewind range, and we're still finding uberblocks
3647 3269 */
3648 3270 while (rewind_error && spa->spa_uberblock.ub_txg >= min_txg &&
3649 3271 spa->spa_uberblock.ub_txg <= spa->spa_load_max_txg) {
3650 3272 if (spa->spa_load_max_txg < safe_rewind_txg)
3651 3273 spa->spa_extreme_rewind = B_TRUE;
3652 - rewind_error = spa_load_retry(spa, state);
3274 + rewind_error = spa_load_retry(spa, state, mosconfig);
3653 3275 }
3654 3276
3655 3277 spa->spa_extreme_rewind = B_FALSE;
3656 3278 spa->spa_load_max_txg = UINT64_MAX;
3657 3279
3658 3280 if (config && (rewind_error || state != SPA_LOAD_RECOVER))
3659 3281 spa_config_set(spa, config);
3660 3282 else
3661 3283 nvlist_free(config);
3662 3284
3663 3285 if (state == SPA_LOAD_RECOVER) {
3664 3286 ASSERT3P(loadinfo, ==, NULL);
3665 3287 return (rewind_error);
3666 3288 } else {
3667 3289 /* Store the rewind info as part of the initial load info */
3668 3290 fnvlist_add_nvlist(loadinfo, ZPOOL_CONFIG_REWIND_INFO,
3669 3291 spa->spa_load_info);
3670 3292
3671 3293 /* Restore the initial load info */
3672 3294 fnvlist_free(spa->spa_load_info);
3673 3295 spa->spa_load_info = loadinfo;
3674 3296
3675 3297 return (load_error);
3676 3298 }
3677 3299 }
3678 3300
3679 3301 /*
3680 3302 * Pool Open/Import
3681 3303 *
3682 3304 * The import case is identical to an open except that the configuration is sent
3683 3305 * down from userland, instead of grabbed from the configuration cache. For the
3684 3306 * case of an open, the pool configuration will exist in the
3685 3307 * POOL_STATE_UNINITIALIZED state.
3686 3308 *
3687 3309 * The stats information (gen/count/ustats) is used to gather vdev statistics at
3688 3310 * the same time open the pool, without having to keep around the spa_t in some
|
↓ open down ↓ |
26 lines elided |
↑ open up ↑ |
3689 3311 * ambiguous state.
3690 3312 */
3691 3313 static int
3692 3314 spa_open_common(const char *pool, spa_t **spapp, void *tag, nvlist_t *nvpolicy,
3693 3315 nvlist_t **config)
3694 3316 {
3695 3317 spa_t *spa;
3696 3318 spa_load_state_t state = SPA_LOAD_OPEN;
3697 3319 int error;
3698 3320 int locked = B_FALSE;
3321 + boolean_t open_with_activation = B_FALSE;
3699 3322
3700 3323 *spapp = NULL;
3701 3324
3702 3325 /*
3703 3326 * As disgusting as this is, we need to support recursive calls to this
3704 3327 * function because dsl_dir_open() is called during spa_load(), and ends
3705 3328 * up calling spa_open() again. The real fix is to figure out how to
3706 3329 * avoid dsl_dir_open() calling this in the first place.
3707 3330 */
3708 3331 if (mutex_owner(&spa_namespace_lock) != curthread) {
3709 3332 mutex_enter(&spa_namespace_lock);
3710 3333 locked = B_TRUE;
3711 3334 }
3712 3335
3713 3336 if ((spa = spa_lookup(pool)) == NULL) {
3714 3337 if (locked)
3715 3338 mutex_exit(&spa_namespace_lock);
3716 3339 return (SET_ERROR(ENOENT));
3717 3340 }
3718 3341
3719 3342 if (spa->spa_state == POOL_STATE_UNINITIALIZED) {
3720 3343 zpool_rewind_policy_t policy;
|
↓ open down ↓ |
12 lines elided |
↑ open up ↑ |
3721 3344
3722 3345 zpool_get_rewind_policy(nvpolicy ? nvpolicy : spa->spa_config,
3723 3346 &policy);
3724 3347 if (policy.zrp_request & ZPOOL_DO_REWIND)
3725 3348 state = SPA_LOAD_RECOVER;
3726 3349
3727 3350 spa_activate(spa, spa_mode_global);
3728 3351
3729 3352 if (state != SPA_LOAD_RECOVER)
3730 3353 spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;
3731 - spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE;
3732 3354
3733 - zfs_dbgmsg("spa_open_common: opening %s", pool);
3734 - error = spa_load_best(spa, state, policy.zrp_txg,
3355 + error = spa_load_best(spa, state, B_FALSE, policy.zrp_txg,
3735 3356 policy.zrp_request);
3736 3357
3737 3358 if (error == EBADF) {
3738 3359 /*
3739 3360 * If vdev_validate() returns failure (indicated by
3740 3361 * EBADF), it indicates that one of the vdevs indicates
3741 3362 * that the pool has been exported or destroyed. If
3742 3363 * this is the case, the config cache is out of sync and
3743 3364 * we should remove the pool from the namespace.
3744 3365 */
3745 3366 spa_unload(spa);
3746 3367 spa_deactivate(spa);
3747 - spa_write_cachefile(spa, B_TRUE, B_TRUE);
3368 + spa_config_sync(spa, B_TRUE, B_TRUE);
3748 3369 spa_remove(spa);
3749 3370 if (locked)
3750 3371 mutex_exit(&spa_namespace_lock);
3751 3372 return (SET_ERROR(ENOENT));
3752 3373 }
3753 3374
3754 3375 if (error) {
3755 3376 /*
3756 3377 * We can't open the pool, but we still have useful
3757 3378 * information: the state of each vdev after the
3758 3379 * attempted vdev_open(). Return this to the user.
3759 3380 */
3760 3381 if (config != NULL && spa->spa_config) {
3761 3382 VERIFY(nvlist_dup(spa->spa_config, config,
3762 3383 KM_SLEEP) == 0);
3763 3384 VERIFY(nvlist_add_nvlist(*config,
3764 3385 ZPOOL_CONFIG_LOAD_INFO,
|
↓ open down ↓ |
7 lines elided |
↑ open up ↑ |
3765 3386 spa->spa_load_info) == 0);
3766 3387 }
3767 3388 spa_unload(spa);
3768 3389 spa_deactivate(spa);
3769 3390 spa->spa_last_open_failed = error;
3770 3391 if (locked)
3771 3392 mutex_exit(&spa_namespace_lock);
3772 3393 *spapp = NULL;
3773 3394 return (error);
3774 3395 }
3396 +
3397 + open_with_activation = B_TRUE;
3775 3398 }
3776 3399
3777 3400 spa_open_ref(spa, tag);
3778 3401
3779 3402 if (config != NULL)
3780 3403 *config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
3781 3404
3782 3405 /*
3783 3406 * If we've recovered the pool, pass back any information we
3784 3407 * gathered while doing the load.
3785 3408 */
3786 3409 if (state == SPA_LOAD_RECOVER) {
3787 3410 VERIFY(nvlist_add_nvlist(*config, ZPOOL_CONFIG_LOAD_INFO,
|
↓ open down ↓ |
3 lines elided |
↑ open up ↑ |
3788 3411 spa->spa_load_info) == 0);
3789 3412 }
3790 3413
3791 3414 if (locked) {
3792 3415 spa->spa_last_open_failed = 0;
3793 3416 spa->spa_last_ubsync_txg = 0;
3794 3417 spa->spa_load_txg = 0;
3795 3418 mutex_exit(&spa_namespace_lock);
3796 3419 }
3797 3420
3421 + if (open_with_activation)
3422 + wbc_activate(spa, B_FALSE);
3423 +
3798 3424 *spapp = spa;
3799 3425
3800 3426 return (0);
3801 3427 }
3802 3428
3803 3429 int
3804 3430 spa_open_rewind(const char *name, spa_t **spapp, void *tag, nvlist_t *policy,
3805 3431 nvlist_t **config)
3806 3432 {
3807 3433 return (spa_open_common(name, spapp, tag, policy, config));
3808 3434 }
3809 3435
3810 3436 int
3811 3437 spa_open(const char *name, spa_t **spapp, void *tag)
3812 3438 {
3813 3439 return (spa_open_common(name, spapp, tag, NULL, NULL));
3814 3440 }
3815 3441
3816 3442 /*
3817 3443 * Lookup the given spa_t, incrementing the inject count in the process,
3818 3444 * preventing it from being exported or destroyed.
3819 3445 */
3820 3446 spa_t *
3821 3447 spa_inject_addref(char *name)
3822 3448 {
3823 3449 spa_t *spa;
3824 3450
3825 3451 mutex_enter(&spa_namespace_lock);
3826 3452 if ((spa = spa_lookup(name)) == NULL) {
3827 3453 mutex_exit(&spa_namespace_lock);
3828 3454 return (NULL);
3829 3455 }
3830 3456 spa->spa_inject_ref++;
3831 3457 mutex_exit(&spa_namespace_lock);
3832 3458
3833 3459 return (spa);
3834 3460 }
3835 3461
3836 3462 void
3837 3463 spa_inject_delref(spa_t *spa)
3838 3464 {
3839 3465 mutex_enter(&spa_namespace_lock);
3840 3466 spa->spa_inject_ref--;
3841 3467 mutex_exit(&spa_namespace_lock);
3842 3468 }
3843 3469
3844 3470 /*
3845 3471 * Add spares device information to the nvlist.
3846 3472 */
3847 3473 static void
3848 3474 spa_add_spares(spa_t *spa, nvlist_t *config)
3849 3475 {
3850 3476 nvlist_t **spares;
3851 3477 uint_t i, nspares;
3852 3478 nvlist_t *nvroot;
3853 3479 uint64_t guid;
3854 3480 vdev_stat_t *vs;
3855 3481 uint_t vsc;
3856 3482 uint64_t pool;
3857 3483
3858 3484 ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
3859 3485
3860 3486 if (spa->spa_spares.sav_count == 0)
3861 3487 return;
3862 3488
3863 3489 VERIFY(nvlist_lookup_nvlist(config,
3864 3490 ZPOOL_CONFIG_VDEV_TREE, &nvroot) == 0);
3865 3491 VERIFY(nvlist_lookup_nvlist_array(spa->spa_spares.sav_config,
3866 3492 ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0);
3867 3493 if (nspares != 0) {
3868 3494 VERIFY(nvlist_add_nvlist_array(nvroot,
3869 3495 ZPOOL_CONFIG_SPARES, spares, nspares) == 0);
3870 3496 VERIFY(nvlist_lookup_nvlist_array(nvroot,
3871 3497 ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0);
3872 3498
3873 3499 /*
3874 3500 * Go through and find any spares which have since been
3875 3501 * repurposed as an active spare. If this is the case, update
3876 3502 * their status appropriately.
3877 3503 */
3878 3504 for (i = 0; i < nspares; i++) {
3879 3505 VERIFY(nvlist_lookup_uint64(spares[i],
3880 3506 ZPOOL_CONFIG_GUID, &guid) == 0);
3881 3507 if (spa_spare_exists(guid, &pool, NULL) &&
3882 3508 pool != 0ULL) {
3883 3509 VERIFY(nvlist_lookup_uint64_array(
3884 3510 spares[i], ZPOOL_CONFIG_VDEV_STATS,
3885 3511 (uint64_t **)&vs, &vsc) == 0);
3886 3512 vs->vs_state = VDEV_STATE_CANT_OPEN;
3887 3513 vs->vs_aux = VDEV_AUX_SPARED;
3888 3514 }
3889 3515 }
3890 3516 }
3891 3517 }
3892 3518
3893 3519 /*
3894 3520 * Add l2cache device information to the nvlist, including vdev stats.
3895 3521 */
3896 3522 static void
3897 3523 spa_add_l2cache(spa_t *spa, nvlist_t *config)
3898 3524 {
3899 3525 nvlist_t **l2cache;
3900 3526 uint_t i, j, nl2cache;
3901 3527 nvlist_t *nvroot;
3902 3528 uint64_t guid;
3903 3529 vdev_t *vd;
3904 3530 vdev_stat_t *vs;
3905 3531 uint_t vsc;
3906 3532
3907 3533 ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
3908 3534
3909 3535 if (spa->spa_l2cache.sav_count == 0)
3910 3536 return;
3911 3537
3912 3538 VERIFY(nvlist_lookup_nvlist(config,
3913 3539 ZPOOL_CONFIG_VDEV_TREE, &nvroot) == 0);
3914 3540 VERIFY(nvlist_lookup_nvlist_array(spa->spa_l2cache.sav_config,
3915 3541 ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0);
3916 3542 if (nl2cache != 0) {
3917 3543 VERIFY(nvlist_add_nvlist_array(nvroot,
3918 3544 ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache) == 0);
3919 3545 VERIFY(nvlist_lookup_nvlist_array(nvroot,
3920 3546 ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0);
3921 3547
3922 3548 /*
3923 3549 * Update level 2 cache device stats.
3924 3550 */
3925 3551
3926 3552 for (i = 0; i < nl2cache; i++) {
3927 3553 VERIFY(nvlist_lookup_uint64(l2cache[i],
3928 3554 ZPOOL_CONFIG_GUID, &guid) == 0);
3929 3555
3930 3556 vd = NULL;
3931 3557 for (j = 0; j < spa->spa_l2cache.sav_count; j++) {
3932 3558 if (guid ==
3933 3559 spa->spa_l2cache.sav_vdevs[j]->vdev_guid) {
3934 3560 vd = spa->spa_l2cache.sav_vdevs[j];
3935 3561 break;
3936 3562 }
3937 3563 }
3938 3564 ASSERT(vd != NULL);
3939 3565
3940 3566 VERIFY(nvlist_lookup_uint64_array(l2cache[i],
3941 3567 ZPOOL_CONFIG_VDEV_STATS, (uint64_t **)&vs, &vsc)
3942 3568 == 0);
3943 3569 vdev_get_stats(vd, vs);
3944 3570 }
3945 3571 }
3946 3572 }
3947 3573
3948 3574 static void
3949 3575 spa_add_feature_stats(spa_t *spa, nvlist_t *config)
3950 3576 {
3951 3577 nvlist_t *features;
3952 3578 zap_cursor_t zc;
3953 3579 zap_attribute_t za;
3954 3580
3955 3581 ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
3956 3582 VERIFY(nvlist_alloc(&features, NV_UNIQUE_NAME, KM_SLEEP) == 0);
3957 3583
3958 3584 if (spa->spa_feat_for_read_obj != 0) {
3959 3585 for (zap_cursor_init(&zc, spa->spa_meta_objset,
3960 3586 spa->spa_feat_for_read_obj);
3961 3587 zap_cursor_retrieve(&zc, &za) == 0;
3962 3588 zap_cursor_advance(&zc)) {
3963 3589 ASSERT(za.za_integer_length == sizeof (uint64_t) &&
3964 3590 za.za_num_integers == 1);
3965 3591 VERIFY3U(0, ==, nvlist_add_uint64(features, za.za_name,
3966 3592 za.za_first_integer));
3967 3593 }
3968 3594 zap_cursor_fini(&zc);
3969 3595 }
3970 3596
3971 3597 if (spa->spa_feat_for_write_obj != 0) {
3972 3598 for (zap_cursor_init(&zc, spa->spa_meta_objset,
3973 3599 spa->spa_feat_for_write_obj);
3974 3600 zap_cursor_retrieve(&zc, &za) == 0;
3975 3601 zap_cursor_advance(&zc)) {
3976 3602 ASSERT(za.za_integer_length == sizeof (uint64_t) &&
3977 3603 za.za_num_integers == 1);
3978 3604 VERIFY3U(0, ==, nvlist_add_uint64(features, za.za_name,
3979 3605 za.za_first_integer));
3980 3606 }
3981 3607 zap_cursor_fini(&zc);
3982 3608 }
3983 3609
3984 3610 VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_FEATURE_STATS,
3985 3611 features) == 0);
3986 3612 nvlist_free(features);
3987 3613 }
3988 3614
3989 3615 int
3990 3616 spa_get_stats(const char *name, nvlist_t **config,
3991 3617 char *altroot, size_t buflen)
3992 3618 {
3993 3619 int error;
3994 3620 spa_t *spa;
3995 3621
3996 3622 *config = NULL;
3997 3623 error = spa_open_common(name, &spa, FTAG, NULL, config);
3998 3624
3999 3625 if (spa != NULL) {
4000 3626 /*
4001 3627 * This still leaves a window of inconsistency where the spares
4002 3628 * or l2cache devices could change and the config would be
4003 3629 * self-inconsistent.
4004 3630 */
4005 3631 spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
4006 3632
4007 3633 if (*config != NULL) {
4008 3634 uint64_t loadtimes[2];
4009 3635
4010 3636 loadtimes[0] = spa->spa_loaded_ts.tv_sec;
4011 3637 loadtimes[1] = spa->spa_loaded_ts.tv_nsec;
4012 3638 VERIFY(nvlist_add_uint64_array(*config,
4013 3639 ZPOOL_CONFIG_LOADED_TIME, loadtimes, 2) == 0);
4014 3640
4015 3641 VERIFY(nvlist_add_uint64(*config,
4016 3642 ZPOOL_CONFIG_ERRCOUNT,
4017 3643 spa_get_errlog_size(spa)) == 0);
4018 3644
4019 3645 if (spa_suspended(spa))
4020 3646 VERIFY(nvlist_add_uint64(*config,
4021 3647 ZPOOL_CONFIG_SUSPENDED,
4022 3648 spa->spa_failmode) == 0);
4023 3649
4024 3650 spa_add_spares(spa, *config);
4025 3651 spa_add_l2cache(spa, *config);
4026 3652 spa_add_feature_stats(spa, *config);
4027 3653 }
4028 3654 }
4029 3655
4030 3656 /*
4031 3657 * We want to get the alternate root even for faulted pools, so we cheat
4032 3658 * and call spa_lookup() directly.
4033 3659 */
4034 3660 if (altroot) {
4035 3661 if (spa == NULL) {
4036 3662 mutex_enter(&spa_namespace_lock);
4037 3663 spa = spa_lookup(name);
4038 3664 if (spa)
4039 3665 spa_altroot(spa, altroot, buflen);
4040 3666 else
4041 3667 altroot[0] = '\0';
4042 3668 spa = NULL;
4043 3669 mutex_exit(&spa_namespace_lock);
4044 3670 } else {
4045 3671 spa_altroot(spa, altroot, buflen);
4046 3672 }
4047 3673 }
4048 3674
4049 3675 if (spa != NULL) {
4050 3676 spa_config_exit(spa, SCL_CONFIG, FTAG);
4051 3677 spa_close(spa, FTAG);
4052 3678 }
4053 3679
4054 3680 return (error);
4055 3681 }
4056 3682
4057 3683 /*
4058 3684 * Validate that the auxiliary device array is well formed. We must have an
4059 3685 * array of nvlists, each which describes a valid leaf vdev. If this is an
4060 3686 * import (mode is VDEV_ALLOC_SPARE), then we allow corrupted spares to be
4061 3687 * specified, as long as they are well-formed.
4062 3688 */
4063 3689 static int
4064 3690 spa_validate_aux_devs(spa_t *spa, nvlist_t *nvroot, uint64_t crtxg, int mode,
4065 3691 spa_aux_vdev_t *sav, const char *config, uint64_t version,
4066 3692 vdev_labeltype_t label)
4067 3693 {
4068 3694 nvlist_t **dev;
4069 3695 uint_t i, ndev;
4070 3696 vdev_t *vd;
4071 3697 int error;
4072 3698
4073 3699 ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
4074 3700
4075 3701 /*
4076 3702 * It's acceptable to have no devs specified.
4077 3703 */
4078 3704 if (nvlist_lookup_nvlist_array(nvroot, config, &dev, &ndev) != 0)
4079 3705 return (0);
4080 3706
4081 3707 if (ndev == 0)
4082 3708 return (SET_ERROR(EINVAL));
4083 3709
4084 3710 /*
4085 3711 * Make sure the pool is formatted with a version that supports this
4086 3712 * device type.
4087 3713 */
4088 3714 if (spa_version(spa) < version)
4089 3715 return (SET_ERROR(ENOTSUP));
4090 3716
4091 3717 /*
4092 3718 * Set the pending device list so we correctly handle device in-use
4093 3719 * checking.
4094 3720 */
4095 3721 sav->sav_pending = dev;
4096 3722 sav->sav_npending = ndev;
4097 3723
4098 3724 for (i = 0; i < ndev; i++) {
4099 3725 if ((error = spa_config_parse(spa, &vd, dev[i], NULL, 0,
4100 3726 mode)) != 0)
4101 3727 goto out;
4102 3728
4103 3729 if (!vd->vdev_ops->vdev_op_leaf) {
4104 3730 vdev_free(vd);
4105 3731 error = SET_ERROR(EINVAL);
4106 3732 goto out;
4107 3733 }
4108 3734
4109 3735 /*
4110 3736 * The L2ARC currently only supports disk devices in
4111 3737 * kernel context. For user-level testing, we allow it.
4112 3738 */
4113 3739 #ifdef _KERNEL
4114 3740 if ((strcmp(config, ZPOOL_CONFIG_L2CACHE) == 0) &&
4115 3741 strcmp(vd->vdev_ops->vdev_op_type, VDEV_TYPE_DISK) != 0) {
4116 3742 error = SET_ERROR(ENOTBLK);
4117 3743 vdev_free(vd);
4118 3744 goto out;
4119 3745 }
4120 3746 #endif
4121 3747 vd->vdev_top = vd;
4122 3748
4123 3749 if ((error = vdev_open(vd)) == 0 &&
4124 3750 (error = vdev_label_init(vd, crtxg, label)) == 0) {
4125 3751 VERIFY(nvlist_add_uint64(dev[i], ZPOOL_CONFIG_GUID,
4126 3752 vd->vdev_guid) == 0);
4127 3753 }
4128 3754
4129 3755 vdev_free(vd);
4130 3756
4131 3757 if (error &&
4132 3758 (mode != VDEV_ALLOC_SPARE && mode != VDEV_ALLOC_L2CACHE))
4133 3759 goto out;
4134 3760 else
4135 3761 error = 0;
4136 3762 }
4137 3763
4138 3764 out:
4139 3765 sav->sav_pending = NULL;
4140 3766 sav->sav_npending = 0;
4141 3767 return (error);
4142 3768 }
4143 3769
4144 3770 static int
4145 3771 spa_validate_aux(spa_t *spa, nvlist_t *nvroot, uint64_t crtxg, int mode)
4146 3772 {
4147 3773 int error;
4148 3774
4149 3775 ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
4150 3776
4151 3777 if ((error = spa_validate_aux_devs(spa, nvroot, crtxg, mode,
4152 3778 &spa->spa_spares, ZPOOL_CONFIG_SPARES, SPA_VERSION_SPARES,
4153 3779 VDEV_LABEL_SPARE)) != 0) {
4154 3780 return (error);
4155 3781 }
4156 3782
4157 3783 return (spa_validate_aux_devs(spa, nvroot, crtxg, mode,
4158 3784 &spa->spa_l2cache, ZPOOL_CONFIG_L2CACHE, SPA_VERSION_L2CACHE,
4159 3785 VDEV_LABEL_L2CACHE));
4160 3786 }
4161 3787
4162 3788 static void
4163 3789 spa_set_aux_vdevs(spa_aux_vdev_t *sav, nvlist_t **devs, int ndevs,
4164 3790 const char *config)
4165 3791 {
4166 3792 int i;
4167 3793
4168 3794 if (sav->sav_config != NULL) {
4169 3795 nvlist_t **olddevs;
4170 3796 uint_t oldndevs;
4171 3797 nvlist_t **newdevs;
4172 3798
4173 3799 /*
4174 3800 * Generate new dev list by concatentating with the
4175 3801 * current dev list.
4176 3802 */
4177 3803 VERIFY(nvlist_lookup_nvlist_array(sav->sav_config, config,
4178 3804 &olddevs, &oldndevs) == 0);
4179 3805
4180 3806 newdevs = kmem_alloc(sizeof (void *) *
4181 3807 (ndevs + oldndevs), KM_SLEEP);
4182 3808 for (i = 0; i < oldndevs; i++)
4183 3809 VERIFY(nvlist_dup(olddevs[i], &newdevs[i],
4184 3810 KM_SLEEP) == 0);
4185 3811 for (i = 0; i < ndevs; i++)
4186 3812 VERIFY(nvlist_dup(devs[i], &newdevs[i + oldndevs],
4187 3813 KM_SLEEP) == 0);
4188 3814
4189 3815 VERIFY(nvlist_remove(sav->sav_config, config,
4190 3816 DATA_TYPE_NVLIST_ARRAY) == 0);
4191 3817
4192 3818 VERIFY(nvlist_add_nvlist_array(sav->sav_config,
4193 3819 config, newdevs, ndevs + oldndevs) == 0);
4194 3820 for (i = 0; i < oldndevs + ndevs; i++)
4195 3821 nvlist_free(newdevs[i]);
4196 3822 kmem_free(newdevs, (oldndevs + ndevs) * sizeof (void *));
4197 3823 } else {
4198 3824 /*
4199 3825 * Generate a new dev list.
4200 3826 */
4201 3827 VERIFY(nvlist_alloc(&sav->sav_config, NV_UNIQUE_NAME,
4202 3828 KM_SLEEP) == 0);
4203 3829 VERIFY(nvlist_add_nvlist_array(sav->sav_config, config,
4204 3830 devs, ndevs) == 0);
4205 3831 }
4206 3832 }
4207 3833
4208 3834 /*
4209 3835 * Stop and drop level 2 ARC devices
4210 3836 */
4211 3837 void
4212 3838 spa_l2cache_drop(spa_t *spa)
4213 3839 {
4214 3840 vdev_t *vd;
4215 3841 int i;
4216 3842 spa_aux_vdev_t *sav = &spa->spa_l2cache;
4217 3843
4218 3844 for (i = 0; i < sav->sav_count; i++) {
4219 3845 uint64_t pool;
4220 3846
4221 3847 vd = sav->sav_vdevs[i];
4222 3848 ASSERT(vd != NULL);
4223 3849
4224 3850 if (spa_l2cache_exists(vd->vdev_guid, &pool) &&
4225 3851 pool != 0ULL && l2arc_vdev_present(vd))
4226 3852 l2arc_remove_vdev(vd);
4227 3853 }
4228 3854 }
4229 3855
4230 3856 /*
4231 3857 * Pool Creation
4232 3858 */
4233 3859 int
4234 3860 spa_create(const char *pool, nvlist_t *nvroot, nvlist_t *props,
4235 3861 nvlist_t *zplprops)
4236 3862 {
|
↓ open down ↓ |
429 lines elided |
↑ open up ↑ |
4237 3863 spa_t *spa;
4238 3864 char *altroot = NULL;
4239 3865 vdev_t *rvd;
4240 3866 dsl_pool_t *dp;
4241 3867 dmu_tx_t *tx;
4242 3868 int error = 0;
4243 3869 uint64_t txg = TXG_INITIAL;
4244 3870 nvlist_t **spares, **l2cache;
4245 3871 uint_t nspares, nl2cache;
4246 3872 uint64_t version, obj;
4247 - boolean_t has_features;
3873 + boolean_t has_features = B_FALSE, wbc_feature_exists = B_FALSE;
3874 + spa_meta_placement_t *mp;
4248 3875
4249 3876 /*
4250 3877 * If this pool already exists, return failure.
4251 3878 */
4252 3879 mutex_enter(&spa_namespace_lock);
4253 3880 if (spa_lookup(pool) != NULL) {
4254 3881 mutex_exit(&spa_namespace_lock);
4255 3882 return (SET_ERROR(EEXIST));
4256 3883 }
4257 3884
4258 3885 /*
4259 3886 * Allocate a new spa_t structure.
4260 3887 */
4261 3888 (void) nvlist_lookup_string(props,
4262 3889 zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
4263 3890 spa = spa_add(pool, NULL, altroot);
4264 3891 spa_activate(spa, spa_mode_global);
4265 3892
4266 - if (props && (error = spa_prop_validate(spa, props))) {
4267 - spa_deactivate(spa);
4268 - spa_remove(spa);
4269 - mutex_exit(&spa_namespace_lock);
4270 - return (error);
4271 - }
3893 + if (props != NULL) {
3894 + nvpair_t *wbc_feature_nvp = NULL;
4272 3895
4273 - has_features = B_FALSE;
4274 - for (nvpair_t *elem = nvlist_next_nvpair(props, NULL);
4275 - elem != NULL; elem = nvlist_next_nvpair(props, elem)) {
4276 - if (zpool_prop_feature(nvpair_name(elem)))
4277 - has_features = B_TRUE;
3896 + for (nvpair_t *elem = nvlist_next_nvpair(props, NULL);
3897 + elem != NULL; elem = nvlist_next_nvpair(props, elem)) {
3898 + const char *propname = nvpair_name(elem);
3899 + if (zpool_prop_feature(propname)) {
3900 + spa_feature_t feature;
3901 + int err;
3902 + const char *fname = strchr(propname, '@') + 1;
3903 +
3904 + err = zfeature_lookup_name(fname, &feature);
3905 + if (err == 0 && feature == SPA_FEATURE_WBC) {
3906 + wbc_feature_nvp = elem;
3907 + wbc_feature_exists = B_TRUE;
3908 + }
3909 +
3910 + has_features = B_TRUE;
3911 + }
3912 + }
3913 +
3914 + /*
3915 + * We do not want to enabled feature@wbc if
3916 + * this pool does not have special vdev.
3917 + * At this stage we remove this feature from common list,
3918 + * but later after check that special vdev available this
3919 + * feature will be enabled
3920 + */
3921 + if (wbc_feature_nvp != NULL)
3922 + fnvlist_remove_nvpair(props, wbc_feature_nvp);
3923 +
3924 + if ((error = spa_prop_validate(spa, props)) != 0) {
3925 + spa_deactivate(spa);
3926 + spa_remove(spa);
3927 + mutex_exit(&spa_namespace_lock);
3928 + return (error);
3929 + }
4278 3930 }
4279 3931
3932 +
4280 3933 if (has_features || nvlist_lookup_uint64(props,
4281 3934 zpool_prop_to_name(ZPOOL_PROP_VERSION), &version) != 0) {
4282 3935 version = SPA_VERSION;
4283 3936 }
4284 3937 ASSERT(SPA_VERSION_IS_SUPPORTED(version));
4285 3938
4286 3939 spa->spa_first_txg = txg;
4287 3940 spa->spa_uberblock.ub_txg = txg - 1;
4288 3941 spa->spa_uberblock.ub_version = version;
4289 3942 spa->spa_ubsync = spa->spa_uberblock;
4290 3943 spa->spa_load_state = SPA_LOAD_CREATE;
4291 - spa->spa_removing_phys.sr_state = DSS_NONE;
4292 - spa->spa_removing_phys.sr_removing_vdev = -1;
4293 - spa->spa_removing_phys.sr_prev_indirect_vdev = -1;
4294 3944
4295 3945 /*
4296 3946 * Create "The Godfather" zio to hold all async IOs
4297 3947 */
4298 3948 spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *),
4299 3949 KM_SLEEP);
4300 3950 for (int i = 0; i < max_ncpus; i++) {
4301 3951 spa->spa_async_zio_root[i] = zio_root(spa, NULL, NULL,
4302 3952 ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
4303 3953 ZIO_FLAG_GODFATHER);
4304 3954 }
4305 3955
4306 3956 /*
4307 3957 * Create the root vdev.
4308 3958 */
4309 3959 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4310 3960
4311 3961 error = spa_config_parse(spa, &rvd, nvroot, NULL, 0, VDEV_ALLOC_ADD);
4312 3962
4313 3963 ASSERT(error != 0 || rvd != NULL);
4314 3964 ASSERT(error != 0 || spa->spa_root_vdev == rvd);
4315 3965
4316 3966 if (error == 0 && !zfs_allocatable_devs(nvroot))
4317 3967 error = SET_ERROR(EINVAL);
4318 3968
4319 3969 if (error == 0 &&
4320 3970 (error = vdev_create(rvd, txg, B_FALSE)) == 0 &&
4321 3971 (error = spa_validate_aux(spa, nvroot, txg,
4322 3972 VDEV_ALLOC_ADD)) == 0) {
4323 3973 for (int c = 0; c < rvd->vdev_children; c++) {
4324 3974 vdev_metaslab_set_size(rvd->vdev_child[c]);
4325 3975 vdev_expand(rvd->vdev_child[c], txg);
4326 3976 }
4327 3977 }
4328 3978
4329 3979 spa_config_exit(spa, SCL_ALL, FTAG);
4330 3980
4331 3981 if (error != 0) {
4332 3982 spa_unload(spa);
4333 3983 spa_deactivate(spa);
4334 3984 spa_remove(spa);
4335 3985 mutex_exit(&spa_namespace_lock);
4336 3986 return (error);
4337 3987 }
4338 3988
4339 3989 /*
4340 3990 * Get the list of spares, if specified.
4341 3991 */
4342 3992 if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
4343 3993 &spares, &nspares) == 0) {
4344 3994 VERIFY(nvlist_alloc(&spa->spa_spares.sav_config, NV_UNIQUE_NAME,
4345 3995 KM_SLEEP) == 0);
4346 3996 VERIFY(nvlist_add_nvlist_array(spa->spa_spares.sav_config,
4347 3997 ZPOOL_CONFIG_SPARES, spares, nspares) == 0);
4348 3998 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4349 3999 spa_load_spares(spa);
4350 4000 spa_config_exit(spa, SCL_ALL, FTAG);
4351 4001 spa->spa_spares.sav_sync = B_TRUE;
4352 4002 }
4353 4003
4354 4004 /*
4355 4005 * Get the list of level 2 cache devices, if specified.
4356 4006 */
4357 4007 if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE,
4358 4008 &l2cache, &nl2cache) == 0) {
4359 4009 VERIFY(nvlist_alloc(&spa->spa_l2cache.sav_config,
4360 4010 NV_UNIQUE_NAME, KM_SLEEP) == 0);
4361 4011 VERIFY(nvlist_add_nvlist_array(spa->spa_l2cache.sav_config,
4362 4012 ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache) == 0);
4363 4013 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4364 4014 spa_load_l2cache(spa);
4365 4015 spa_config_exit(spa, SCL_ALL, FTAG);
4366 4016 spa->spa_l2cache.sav_sync = B_TRUE;
4367 4017 }
4368 4018
4369 4019 spa->spa_is_initializing = B_TRUE;
4370 4020 spa->spa_dsl_pool = dp = dsl_pool_create(spa, zplprops, txg);
4371 4021 spa->spa_meta_objset = dp->dp_meta_objset;
4372 4022 spa->spa_is_initializing = B_FALSE;
4373 4023
4374 4024 /*
4375 4025 * Create DDTs (dedup tables).
4376 4026 */
4377 4027 ddt_create(spa);
4378 4028
4379 4029 spa_update_dspace(spa);
4380 4030
4381 4031 tx = dmu_tx_create_assigned(dp, txg);
4382 4032
4383 4033 /*
4384 4034 * Create the pool config object.
4385 4035 */
4386 4036 spa->spa_config_object = dmu_object_alloc(spa->spa_meta_objset,
4387 4037 DMU_OT_PACKED_NVLIST, SPA_CONFIG_BLOCKSIZE,
4388 4038 DMU_OT_PACKED_NVLIST_SIZE, sizeof (uint64_t), tx);
4389 4039
4390 4040 if (zap_add(spa->spa_meta_objset,
4391 4041 DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CONFIG,
4392 4042 sizeof (uint64_t), 1, &spa->spa_config_object, tx) != 0) {
4393 4043 cmn_err(CE_PANIC, "failed to add pool config");
4394 4044 }
4395 4045
4396 4046 if (spa_version(spa) >= SPA_VERSION_FEATURES)
4397 4047 spa_feature_create_zap_objects(spa, tx);
4398 4048
4399 4049 if (zap_add(spa->spa_meta_objset,
4400 4050 DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CREATION_VERSION,
4401 4051 sizeof (uint64_t), 1, &version, tx) != 0) {
4402 4052 cmn_err(CE_PANIC, "failed to add pool version");
4403 4053 }
4404 4054
4405 4055 /* Newly created pools with the right version are always deflated. */
4406 4056 if (version >= SPA_VERSION_RAIDZ_DEFLATE) {
4407 4057 spa->spa_deflate = TRUE;
4408 4058 if (zap_add(spa->spa_meta_objset,
4409 4059 DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_DEFLATE,
4410 4060 sizeof (uint64_t), 1, &spa->spa_deflate, tx) != 0) {
4411 4061 cmn_err(CE_PANIC, "failed to add deflate");
4412 4062 }
4413 4063 }
4414 4064
4415 4065 /*
4416 4066 * Create the deferred-free bpobj. Turn off compression
4417 4067 * because sync-to-convergence takes longer if the blocksize
4418 4068 * keeps changing.
4419 4069 */
4420 4070 obj = bpobj_alloc(spa->spa_meta_objset, 1 << 14, tx);
4421 4071 dmu_object_set_compress(spa->spa_meta_objset, obj,
4422 4072 ZIO_COMPRESS_OFF, tx);
4423 4073 if (zap_add(spa->spa_meta_objset,
4424 4074 DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_SYNC_BPOBJ,
4425 4075 sizeof (uint64_t), 1, &obj, tx) != 0) {
4426 4076 cmn_err(CE_PANIC, "failed to add bpobj");
|
↓ open down ↓ |
123 lines elided |
↑ open up ↑ |
4427 4077 }
4428 4078 VERIFY3U(0, ==, bpobj_open(&spa->spa_deferred_bpobj,
4429 4079 spa->spa_meta_objset, obj));
4430 4080
4431 4081 /*
4432 4082 * Create the pool's history object.
4433 4083 */
4434 4084 if (version >= SPA_VERSION_ZPOOL_HISTORY)
4435 4085 spa_history_create_obj(spa, tx);
4436 4086
4087 + mp = &spa->spa_meta_policy;
4088 +
4437 4089 /*
4438 4090 * Generate some random noise for salted checksums to operate on.
4439 4091 */
4440 4092 (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes,
4441 4093 sizeof (spa->spa_cksum_salt.zcs_bytes));
4442 4094
4443 4095 /*
4444 4096 * Set pool properties.
4445 4097 */
4446 4098 spa->spa_bootfs = zpool_prop_default_numeric(ZPOOL_PROP_BOOTFS);
4447 4099 spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION);
4448 4100 spa->spa_failmode = zpool_prop_default_numeric(ZPOOL_PROP_FAILUREMODE);
4449 4101 spa->spa_autoexpand = zpool_prop_default_numeric(ZPOOL_PROP_AUTOEXPAND);
4102 + spa->spa_minwat = zpool_prop_default_numeric(ZPOOL_PROP_MINWATERMARK);
4103 + spa->spa_hiwat = zpool_prop_default_numeric(ZPOOL_PROP_HIWATERMARK);
4104 + spa->spa_lowat = zpool_prop_default_numeric(ZPOOL_PROP_LOWATERMARK);
4105 + spa->spa_ddt_meta_copies =
4106 + zpool_prop_default_numeric(ZPOOL_PROP_DEDUPMETA_DITTO);
4107 + spa->spa_dedup_best_effort =
4108 + zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_BEST_EFFORT);
4109 + spa->spa_dedup_lo_best_effort =
4110 + zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_LO_BEST_EFFORT);
4111 + spa->spa_dedup_hi_best_effort =
4112 + zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_HI_BEST_EFFORT);
4113 + spa->spa_force_trim = zpool_prop_default_numeric(ZPOOL_PROP_FORCETRIM);
4450 4114
4115 + spa->spa_resilver_prio =
4116 + zpool_prop_default_numeric(ZPOOL_PROP_RESILVER_PRIO);
4117 + spa->spa_scrub_prio = zpool_prop_default_numeric(ZPOOL_PROP_SCRUB_PRIO);
4118 +
4119 + mutex_enter(&spa->spa_auto_trim_lock);
4120 + spa->spa_auto_trim = zpool_prop_default_numeric(ZPOOL_PROP_AUTOTRIM);
4121 + if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
4122 + spa_auto_trim_taskq_create(spa);
4123 + mutex_exit(&spa->spa_auto_trim_lock);
4124 +
4125 + mp->spa_enable_meta_placement_selection =
4126 + zpool_prop_default_numeric(ZPOOL_PROP_META_PLACEMENT);
4127 + mp->spa_sync_to_special =
4128 + zpool_prop_default_numeric(ZPOOL_PROP_SYNC_TO_SPECIAL);
4129 + mp->spa_ddt_meta_to_special =
4130 + zpool_prop_default_numeric(ZPOOL_PROP_DDT_META_TO_METADEV);
4131 + mp->spa_zfs_meta_to_special =
4132 + zpool_prop_default_numeric(ZPOOL_PROP_ZFS_META_TO_METADEV);
4133 + mp->spa_small_data_to_special =
4134 + zpool_prop_default_numeric(ZPOOL_PROP_SMALL_DATA_TO_METADEV);
4135 +
4136 + spa_set_ddt_classes(spa, 0);
4137 +
4451 4138 if (props != NULL) {
4452 4139 spa_configfile_set(spa, props, B_FALSE);
4453 4140 spa_sync_props(props, tx);
4454 4141 }
4455 4142
4143 + if (spa_has_special(spa)) {
4144 + spa_feature_enable(spa, SPA_FEATURE_META_DEVICES, tx);
4145 + spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
4146 +
4147 + if (wbc_feature_exists)
4148 + spa_feature_enable(spa, SPA_FEATURE_WBC, tx);
4149 + }
4150 +
4456 4151 dmu_tx_commit(tx);
4457 4152
4458 4153 spa->spa_sync_on = B_TRUE;
4459 4154 txg_sync_start(spa->spa_dsl_pool);
4460 4155
4461 4156 /*
4462 4157 * We explicitly wait for the first transaction to complete so that our
4463 4158 * bean counters are appropriately updated.
4464 4159 */
4465 4160 txg_wait_synced(spa->spa_dsl_pool, txg);
4466 4161
4467 - spa_spawn_aux_threads(spa);
4468 -
4469 - spa_write_cachefile(spa, B_FALSE, B_TRUE);
4162 + spa_config_sync(spa, B_FALSE, B_TRUE);
4470 4163 spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_CREATE);
4471 4164
4472 4165 spa_history_log_version(spa, "create");
4473 4166
4474 4167 /*
4475 4168 * Don't count references from objsets that are already closed
4476 4169 * and are making their way through the eviction process.
4477 4170 */
4478 4171 spa_evicting_os_wait(spa);
4479 4172 spa->spa_minref = refcount_count(&spa->spa_refcount);
4480 4173 spa->spa_load_state = SPA_LOAD_NONE;
4481 4174
4482 4175 mutex_exit(&spa_namespace_lock);
4483 4176
4177 + wbc_activate(spa, B_TRUE);
4178 +
4484 4179 return (0);
4485 4180 }
4486 4181
4182 +
4183 +/*
4184 + * See if the pool has special tier, and if so, enable/activate
4185 + * the feature as needed. Activation is not reference counted.
4186 + */
4187 +static void
4188 +spa_check_special_feature(spa_t *spa)
4189 +{
4190 + if (spa_has_special(spa)) {
4191 + nvlist_t *props = NULL;
4192 +
4193 + if (!spa_feature_is_enabled(spa, SPA_FEATURE_META_DEVICES)) {
4194 + VERIFY(nvlist_alloc(&props, NV_UNIQUE_NAME, 0) == 0);
4195 + VERIFY(nvlist_add_uint64(props,
4196 + FEATURE_META_DEVICES, 0) == 0);
4197 + VERIFY(spa_prop_set(spa, props) == 0);
4198 + nvlist_free(props);
4199 + }
4200 +
4201 + if (!spa_feature_is_active(spa, SPA_FEATURE_META_DEVICES)) {
4202 + dmu_tx_t *tx =
4203 + dmu_tx_create_dd(spa->spa_dsl_pool->dp_mos_dir);
4204 +
4205 + VERIFY(dmu_tx_assign(tx, TXG_WAIT) == 0);
4206 + spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
4207 + dmu_tx_commit(tx);
4208 + }
4209 + }
4210 +}
4211 +
4212 +static void
4213 +spa_special_feature_activate(void *arg, dmu_tx_t *tx)
4214 +{
4215 + spa_t *spa = (spa_t *)arg;
4216 +
4217 + if (spa_has_special(spa)) {
4218 + /* enable and activate as needed */
4219 + spa_feature_enable(spa, SPA_FEATURE_META_DEVICES, tx);
4220 + if (!spa_feature_is_active(spa, SPA_FEATURE_META_DEVICES)) {
4221 + spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx);
4222 + }
4223 +
4224 + spa_feature_enable(spa, SPA_FEATURE_WBC, tx);
4225 + }
4226 +}
4227 +
4487 4228 #ifdef _KERNEL
4488 4229 /*
4489 4230 * Get the root pool information from the root disk, then import the root pool
4490 4231 * during the system boot up time.
4491 4232 */
4492 4233 extern int vdev_disk_read_rootlabel(char *, char *, nvlist_t **);
4493 4234
4494 4235 static nvlist_t *
4495 4236 spa_generate_rootconf(char *devpath, char *devid, uint64_t *guid)
4496 4237 {
4497 4238 nvlist_t *config;
4498 4239 nvlist_t *nvtop, *nvroot;
4499 4240 uint64_t pgid;
4500 4241
4501 4242 if (vdev_disk_read_rootlabel(devpath, devid, &config) != 0)
4502 4243 return (NULL);
4503 4244
4504 4245 /*
4505 4246 * Add this top-level vdev to the child array.
4506 4247 */
4507 4248 VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
4508 4249 &nvtop) == 0);
4509 4250 VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID,
4510 4251 &pgid) == 0);
4511 4252 VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_GUID, guid) == 0);
4512 4253
4513 4254 /*
4514 4255 * Put this pool's top-level vdevs into a root vdev.
4515 4256 */
4516 4257 VERIFY(nvlist_alloc(&nvroot, NV_UNIQUE_NAME, KM_SLEEP) == 0);
4517 4258 VERIFY(nvlist_add_string(nvroot, ZPOOL_CONFIG_TYPE,
4518 4259 VDEV_TYPE_ROOT) == 0);
4519 4260 VERIFY(nvlist_add_uint64(nvroot, ZPOOL_CONFIG_ID, 0ULL) == 0);
4520 4261 VERIFY(nvlist_add_uint64(nvroot, ZPOOL_CONFIG_GUID, pgid) == 0);
4521 4262 VERIFY(nvlist_add_nvlist_array(nvroot, ZPOOL_CONFIG_CHILDREN,
4522 4263 &nvtop, 1) == 0);
4523 4264
4524 4265 /*
4525 4266 * Replace the existing vdev_tree with the new root vdev in
4526 4267 * this pool's configuration (remove the old, add the new).
4527 4268 */
4528 4269 VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, nvroot) == 0);
4529 4270 nvlist_free(nvroot);
4530 4271 return (config);
4531 4272 }
4532 4273
4533 4274 /*
4534 4275 * Walk the vdev tree and see if we can find a device with "better"
4535 4276 * configuration. A configuration is "better" if the label on that
4536 4277 * device has a more recent txg.
4537 4278 */
4538 4279 static void
4539 4280 spa_alt_rootvdev(vdev_t *vd, vdev_t **avd, uint64_t *txg)
4540 4281 {
4541 4282 for (int c = 0; c < vd->vdev_children; c++)
4542 4283 spa_alt_rootvdev(vd->vdev_child[c], avd, txg);
4543 4284
4544 4285 if (vd->vdev_ops->vdev_op_leaf) {
4545 4286 nvlist_t *label;
4546 4287 uint64_t label_txg;
4547 4288
4548 4289 if (vdev_disk_read_rootlabel(vd->vdev_physpath, vd->vdev_devid,
4549 4290 &label) != 0)
4550 4291 return;
4551 4292
4552 4293 VERIFY(nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_TXG,
4553 4294 &label_txg) == 0);
4554 4295
4555 4296 /*
4556 4297 * Do we have a better boot device?
4557 4298 */
4558 4299 if (label_txg > *txg) {
4559 4300 *txg = label_txg;
4560 4301 *avd = vd;
4561 4302 }
4562 4303 nvlist_free(label);
4563 4304 }
4564 4305 }
4565 4306
4566 4307 /*
4567 4308 * Import a root pool.
4568 4309 *
4569 4310 * For x86. devpath_list will consist of devid and/or physpath name of
4570 4311 * the vdev (e.g. "id1,sd@SSEAGATE..." or "/pci@1f,0/ide@d/disk@0,0:a").
4571 4312 * The GRUB "findroot" command will return the vdev we should boot.
4572 4313 *
4573 4314 * For Sparc, devpath_list consists the physpath name of the booting device
4574 4315 * no matter the rootpool is a single device pool or a mirrored pool.
4575 4316 * e.g.
4576 4317 * "/pci@1f,0/ide@d/disk@0,0:a"
4577 4318 */
4578 4319 int
4579 4320 spa_import_rootpool(char *devpath, char *devid)
4580 4321 {
4581 4322 spa_t *spa;
4582 4323 vdev_t *rvd, *bvd, *avd = NULL;
4583 4324 nvlist_t *config, *nvtop;
4584 4325 uint64_t guid, txg;
4585 4326 char *pname;
4586 4327 int error;
4587 4328
4588 4329 /*
4589 4330 * Read the label from the boot device and generate a configuration.
4590 4331 */
4591 4332 config = spa_generate_rootconf(devpath, devid, &guid);
4592 4333 #if defined(_OBP) && defined(_KERNEL)
4593 4334 if (config == NULL) {
4594 4335 if (strstr(devpath, "/iscsi/ssd") != NULL) {
4595 4336 /* iscsi boot */
4596 4337 get_iscsi_bootpath_phy(devpath);
4597 4338 config = spa_generate_rootconf(devpath, devid, &guid);
4598 4339 }
4599 4340 }
4600 4341 #endif
4601 4342 if (config == NULL) {
|
↓ open down ↓ |
105 lines elided |
↑ open up ↑ |
4602 4343 cmn_err(CE_NOTE, "Cannot read the pool label from '%s'",
4603 4344 devpath);
4604 4345 return (SET_ERROR(EIO));
4605 4346 }
4606 4347
4607 4348 VERIFY(nvlist_lookup_string(config, ZPOOL_CONFIG_POOL_NAME,
4608 4349 &pname) == 0);
4609 4350 VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, &txg) == 0);
4610 4351
4611 4352 mutex_enter(&spa_namespace_lock);
4612 - if ((spa = spa_lookup(pname)) != NULL) {
4353 + if ((spa = spa_lookup(pname)) != NULL || spa_config_guid_exists(guid)) {
4613 4354 /*
4614 4355 * Remove the existing root pool from the namespace so that we
4615 4356 * can replace it with the correct config we just read in.
4616 4357 */
4617 4358 spa_remove(spa);
4618 4359 }
4619 4360
4620 4361 spa = spa_add(pname, config, NULL);
4621 4362 spa->spa_is_root = B_TRUE;
4622 4363 spa->spa_import_flags = ZFS_IMPORT_VERBATIM;
4623 - if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION,
4624 - &spa->spa_ubsync.ub_version) != 0)
4625 - spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL;
4626 4364
4627 4365 /*
4628 4366 * Build up a vdev tree based on the boot device's label config.
4629 4367 */
4630 4368 VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
4631 4369 &nvtop) == 0);
4632 4370 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4633 4371 error = spa_config_parse(spa, &rvd, nvtop, NULL, 0,
4634 4372 VDEV_ALLOC_ROOTPOOL);
4635 4373 spa_config_exit(spa, SCL_ALL, FTAG);
4636 4374 if (error) {
4637 4375 mutex_exit(&spa_namespace_lock);
4638 4376 nvlist_free(config);
4639 4377 cmn_err(CE_NOTE, "Can not parse the config for pool '%s'",
4640 4378 pname);
4641 4379 return (error);
4642 4380 }
4643 4381
4644 4382 /*
4645 4383 * Get the boot vdev.
4646 4384 */
4647 4385 if ((bvd = vdev_lookup_by_guid(rvd, guid)) == NULL) {
4648 4386 cmn_err(CE_NOTE, "Can not find the boot vdev for guid %llu",
4649 4387 (u_longlong_t)guid);
4650 4388 error = SET_ERROR(ENOENT);
4651 4389 goto out;
4652 4390 }
4653 4391
4654 4392 /*
4655 4393 * Determine if there is a better boot device.
4656 4394 */
4657 4395 avd = bvd;
4658 4396 spa_alt_rootvdev(rvd, &avd, &txg);
4659 4397 if (avd != bvd) {
4660 4398 cmn_err(CE_NOTE, "The boot device is 'degraded'. Please "
4661 4399 "try booting from '%s'", avd->vdev_path);
4662 4400 error = SET_ERROR(EINVAL);
4663 4401 goto out;
4664 4402 }
4665 4403
4666 4404 /*
4667 4405 * If the boot device is part of a spare vdev then ensure that
4668 4406 * we're booting off the active spare.
4669 4407 */
4670 4408 if (bvd->vdev_parent->vdev_ops == &vdev_spare_ops &&
4671 4409 !bvd->vdev_isspare) {
4672 4410 cmn_err(CE_NOTE, "The boot device is currently spared. Please "
4673 4411 "try booting from '%s'",
4674 4412 bvd->vdev_parent->
4675 4413 vdev_child[bvd->vdev_parent->vdev_children - 1]->vdev_path);
4676 4414 error = SET_ERROR(EINVAL);
4677 4415 goto out;
4678 4416 }
4679 4417
4680 4418 error = 0;
4681 4419 out:
4682 4420 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4683 4421 vdev_free(rvd);
4684 4422 spa_config_exit(spa, SCL_ALL, FTAG);
4685 4423 mutex_exit(&spa_namespace_lock);
4686 4424
4687 4425 nvlist_free(config);
4688 4426 return (error);
4689 4427 }
4690 4428
4691 4429 #endif
4692 4430
4693 4431 /*
4694 4432 * Import a non-root pool into the system.
4695 4433 */
4696 4434 int
4697 4435 spa_import(const char *pool, nvlist_t *config, nvlist_t *props, uint64_t flags)
4698 4436 {
|
↓ open down ↓ |
63 lines elided |
↑ open up ↑ |
4699 4437 spa_t *spa;
4700 4438 char *altroot = NULL;
4701 4439 spa_load_state_t state = SPA_LOAD_IMPORT;
4702 4440 zpool_rewind_policy_t policy;
4703 4441 uint64_t mode = spa_mode_global;
4704 4442 uint64_t readonly = B_FALSE;
4705 4443 int error;
4706 4444 nvlist_t *nvroot;
4707 4445 nvlist_t **spares, **l2cache;
4708 4446 uint_t nspares, nl2cache;
4447 + uint64_t guid;
4709 4448
4449 + if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &guid) != 0)
4450 + return (SET_ERROR(EINVAL));
4451 +
4710 4452 /*
4711 4453 * If a pool with this name exists, return failure.
4712 4454 */
4713 4455 mutex_enter(&spa_namespace_lock);
4714 - if (spa_lookup(pool) != NULL) {
4456 + if (spa_lookup(pool) != NULL || spa_config_guid_exists(guid)) {
4715 4457 mutex_exit(&spa_namespace_lock);
4716 4458 return (SET_ERROR(EEXIST));
4717 4459 }
4718 4460
4719 4461 /*
4720 4462 * Create and initialize the spa structure.
4721 4463 */
4722 4464 (void) nvlist_lookup_string(props,
4723 4465 zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
4724 4466 (void) nvlist_lookup_uint64(props,
4725 4467 zpool_prop_to_name(ZPOOL_PROP_READONLY), &readonly);
4726 4468 if (readonly)
4727 4469 mode = FREAD;
4728 4470 spa = spa_add(pool, config, altroot);
|
↓ open down ↓ |
4 lines elided |
↑ open up ↑ |
4729 4471 spa->spa_import_flags = flags;
4730 4472
4731 4473 /*
4732 4474 * Verbatim import - Take a pool and insert it into the namespace
4733 4475 * as if it had been loaded at boot.
4734 4476 */
4735 4477 if (spa->spa_import_flags & ZFS_IMPORT_VERBATIM) {
4736 4478 if (props != NULL)
4737 4479 spa_configfile_set(spa, props, B_FALSE);
4738 4480
4739 - spa_write_cachefile(spa, B_FALSE, B_TRUE);
4481 + spa_config_sync(spa, B_FALSE, B_TRUE);
4740 4482 spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
4741 - zfs_dbgmsg("spa_import: verbatim import of %s", pool);
4483 +
4742 4484 mutex_exit(&spa_namespace_lock);
4743 4485 return (0);
4744 4486 }
4745 4487
4746 4488 spa_activate(spa, mode);
4747 4489
4748 4490 /*
4749 4491 * Don't start async tasks until we know everything is healthy.
4750 4492 */
4751 4493 spa_async_suspend(spa);
4752 4494
4753 4495 zpool_get_rewind_policy(config, &policy);
4754 4496 if (policy.zrp_request & ZPOOL_DO_REWIND)
4755 4497 state = SPA_LOAD_RECOVER;
4756 4498
4757 - spa->spa_config_source = SPA_CONFIG_SRC_TRYIMPORT;
4758 -
4759 - if (state != SPA_LOAD_RECOVER) {
4499 + /*
4500 + * Pass off the heavy lifting to spa_load(). Pass TRUE for mosconfig
4501 + * because the user-supplied config is actually the one to trust when
4502 + * doing an import.
4503 + */
4504 + if (state != SPA_LOAD_RECOVER)
4760 4505 spa->spa_last_ubsync_txg = spa->spa_load_txg = 0;
4761 - zfs_dbgmsg("spa_import: importing %s", pool);
4762 - } else {
4763 - zfs_dbgmsg("spa_import: importing %s, max_txg=%lld "
4764 - "(RECOVERY MODE)", pool, (longlong_t)policy.zrp_txg);
4765 - }
4766 - error = spa_load_best(spa, state, policy.zrp_txg, policy.zrp_request);
4767 4506
4507 + error = spa_load_best(spa, state, B_TRUE, policy.zrp_txg,
4508 + policy.zrp_request);
4509 +
4768 4510 /*
4769 4511 * Propagate anything learned while loading the pool and pass it
4770 4512 * back to caller (i.e. rewind info, missing devices, etc).
4771 4513 */
4772 4514 VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
4773 4515 spa->spa_load_info) == 0);
4774 4516
4775 4517 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4776 4518 /*
4777 4519 * Toss any existing sparelist, as it doesn't have any validity
4778 4520 * anymore, and conflicts with spa_has_spare().
4779 4521 */
4780 4522 if (spa->spa_spares.sav_config) {
4781 4523 nvlist_free(spa->spa_spares.sav_config);
4782 4524 spa->spa_spares.sav_config = NULL;
4783 4525 spa_load_spares(spa);
4784 4526 }
4785 4527 if (spa->spa_l2cache.sav_config) {
4786 4528 nvlist_free(spa->spa_l2cache.sav_config);
4787 4529 spa->spa_l2cache.sav_config = NULL;
4788 4530 spa_load_l2cache(spa);
4789 4531 }
4790 4532
4791 4533 VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
4792 4534 &nvroot) == 0);
4793 4535 if (error == 0)
4794 4536 error = spa_validate_aux(spa, nvroot, -1ULL,
4795 4537 VDEV_ALLOC_SPARE);
4796 4538 if (error == 0)
4797 4539 error = spa_validate_aux(spa, nvroot, -1ULL,
4798 4540 VDEV_ALLOC_L2CACHE);
4799 4541 spa_config_exit(spa, SCL_ALL, FTAG);
4800 4542
4801 4543 if (props != NULL)
4802 4544 spa_configfile_set(spa, props, B_FALSE);
|
↓ open down ↓ |
25 lines elided |
↑ open up ↑ |
4803 4545
4804 4546 if (error != 0 || (props && spa_writeable(spa) &&
4805 4547 (error = spa_prop_set(spa, props)))) {
4806 4548 spa_unload(spa);
4807 4549 spa_deactivate(spa);
4808 4550 spa_remove(spa);
4809 4551 mutex_exit(&spa_namespace_lock);
4810 4552 return (error);
4811 4553 }
4812 4554
4813 - spa_async_resume(spa);
4814 -
4815 4555 /*
4816 4556 * Override any spares and level 2 cache devices as specified by
4817 4557 * the user, as these may have correct device names/devids, etc.
4818 4558 */
4819 4559 if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
4820 4560 &spares, &nspares) == 0) {
4821 4561 if (spa->spa_spares.sav_config)
4822 4562 VERIFY(nvlist_remove(spa->spa_spares.sav_config,
4823 4563 ZPOOL_CONFIG_SPARES, DATA_TYPE_NVLIST_ARRAY) == 0);
4824 4564 else
4825 4565 VERIFY(nvlist_alloc(&spa->spa_spares.sav_config,
4826 4566 NV_UNIQUE_NAME, KM_SLEEP) == 0);
4827 4567 VERIFY(nvlist_add_nvlist_array(spa->spa_spares.sav_config,
4828 4568 ZPOOL_CONFIG_SPARES, spares, nspares) == 0);
4829 4569 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4830 4570 spa_load_spares(spa);
4831 4571 spa_config_exit(spa, SCL_ALL, FTAG);
4832 4572 spa->spa_spares.sav_sync = B_TRUE;
4833 4573 }
4834 4574 if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE,
4835 4575 &l2cache, &nl2cache) == 0) {
4836 4576 if (spa->spa_l2cache.sav_config)
4837 4577 VERIFY(nvlist_remove(spa->spa_l2cache.sav_config,
4838 4578 ZPOOL_CONFIG_L2CACHE, DATA_TYPE_NVLIST_ARRAY) == 0);
4839 4579 else
|
↓ open down ↓ |
15 lines elided |
↑ open up ↑ |
4840 4580 VERIFY(nvlist_alloc(&spa->spa_l2cache.sav_config,
4841 4581 NV_UNIQUE_NAME, KM_SLEEP) == 0);
4842 4582 VERIFY(nvlist_add_nvlist_array(spa->spa_l2cache.sav_config,
4843 4583 ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache) == 0);
4844 4584 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
4845 4585 spa_load_l2cache(spa);
4846 4586 spa_config_exit(spa, SCL_ALL, FTAG);
4847 4587 spa->spa_l2cache.sav_sync = B_TRUE;
4848 4588 }
4849 4589
4590 + /* At this point, we can load spare props */
4591 + (void) spa_load_vdev_props(spa);
4592 +
4850 4593 /*
4851 4594 * Check for any removed devices.
4852 4595 */
4853 4596 if (spa->spa_autoreplace) {
4854 4597 spa_aux_check_removed(&spa->spa_spares);
4855 4598 spa_aux_check_removed(&spa->spa_l2cache);
4856 4599 }
4857 4600
4858 4601 if (spa_writeable(spa)) {
4859 4602 /*
4860 4603 * Update the config cache to include the newly-imported pool.
4861 4604 */
4862 4605 spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
4863 4606 }
4864 4607
4865 4608 /*
4609 + * Start async resume as late as possible to reduce I/O activity when
4610 + * importing a pool. This will let any pending txgs (e.g. from scrub
4611 + * or resilver) to complete quickly thereby reducing import times in
4612 + * such cases.
4613 + */
4614 + spa_async_resume(spa);
4615 +
4616 + /*
4866 4617 * It's possible that the pool was expanded while it was exported.
4867 4618 * We kick off an async task to handle this for us.
4868 4619 */
4869 4620 spa_async_request(spa, SPA_ASYNC_AUTOEXPAND);
4870 4621
4622 + /* Set/activate meta feature as needed */
4623 + if (!spa_writeable(spa))
4624 + spa_check_special_feature(spa);
4871 4625 spa_history_log_version(spa, "import");
4872 4626
4873 4627 spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT);
4874 4628
4875 4629 mutex_exit(&spa_namespace_lock);
4876 4630
4877 - return (0);
4631 + if (!spa_writeable(spa))
4632 + return (0);
4633 +
4634 + wbc_activate(spa, B_FALSE);
4635 +
4636 + return (dsl_sync_task(spa->spa_name, NULL, spa_special_feature_activate,
4637 + spa, 3, ZFS_SPACE_CHECK_RESERVED));
4878 4638 }
4879 4639
4880 4640 nvlist_t *
4881 4641 spa_tryimport(nvlist_t *tryconfig)
4882 4642 {
4883 4643 nvlist_t *config = NULL;
4884 - char *poolname, *cachefile;
4644 + char *poolname;
4885 4645 spa_t *spa;
4886 4646 uint64_t state;
4887 4647 int error;
4888 - zpool_rewind_policy_t policy;
4889 4648
4890 4649 if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_POOL_NAME, &poolname))
4891 4650 return (NULL);
4892 4651
4893 4652 if (nvlist_lookup_uint64(tryconfig, ZPOOL_CONFIG_POOL_STATE, &state))
4894 4653 return (NULL);
4895 4654
4896 4655 /*
4897 4656 * Create and initialize the spa structure.
4898 4657 */
4899 4658 mutex_enter(&spa_namespace_lock);
4900 4659 spa = spa_add(TRYIMPORT_NAME, tryconfig, NULL);
4901 4660 spa_activate(spa, FREAD);
4902 4661
4903 4662 /*
4904 - * Rewind pool if a max txg was provided. Note that even though we
4905 - * retrieve the complete rewind policy, only the rewind txg is relevant
4906 - * for tryimport.
4663 + * Pass off the heavy lifting to spa_load().
4664 + * Pass TRUE for mosconfig because the user-supplied config
4665 + * is actually the one to trust when doing an import.
4907 4666 */
4908 - zpool_get_rewind_policy(spa->spa_config, &policy);
4909 - if (policy.zrp_txg != UINT64_MAX) {
4910 - spa->spa_load_max_txg = policy.zrp_txg;
4911 - spa->spa_extreme_rewind = B_TRUE;
4912 - zfs_dbgmsg("spa_tryimport: importing %s, max_txg=%lld",
4913 - poolname, (longlong_t)policy.zrp_txg);
4914 - } else {
4915 - zfs_dbgmsg("spa_tryimport: importing %s", poolname);
4916 - }
4667 + error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING, B_TRUE);
4917 4668
4918 - if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_CACHEFILE, &cachefile)
4919 - == 0) {
4920 - zfs_dbgmsg("spa_tryimport: using cachefile '%s'", cachefile);
4921 - spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE;
4922 - } else {
4923 - spa->spa_config_source = SPA_CONFIG_SRC_SCAN;
4924 - }
4925 -
4926 - error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING);
4927 -
4928 4669 /*
4929 4670 * If 'tryconfig' was at least parsable, return the current config.
4930 4671 */
4931 4672 if (spa->spa_root_vdev != NULL) {
4932 4673 config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
4933 4674 VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME,
4934 4675 poolname) == 0);
4935 4676 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE,
4936 4677 state) == 0);
4937 4678 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_TIMESTAMP,
4938 4679 spa->spa_uberblock.ub_timestamp) == 0);
4939 4680 VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
4940 4681 spa->spa_load_info) == 0);
4941 4682
4942 4683 /*
4943 4684 * If the bootfs property exists on this pool then we
4944 4685 * copy it out so that external consumers can tell which
4945 4686 * pools are bootable.
4946 4687 */
4947 4688 if ((!error || error == EEXIST) && spa->spa_bootfs) {
4948 4689 char *tmpname = kmem_alloc(MAXPATHLEN, KM_SLEEP);
4949 4690
4950 4691 /*
4951 4692 * We have to play games with the name since the
4952 4693 * pool was opened as TRYIMPORT_NAME.
4953 4694 */
4954 4695 if (dsl_dsobj_to_dsname(spa_name(spa),
4955 4696 spa->spa_bootfs, tmpname) == 0) {
4956 4697 char *cp;
4957 4698 char *dsname = kmem_alloc(MAXPATHLEN, KM_SLEEP);
4958 4699
4959 4700 cp = strchr(tmpname, '/');
4960 4701 if (cp == NULL) {
4961 4702 (void) strlcpy(dsname, tmpname,
4962 4703 MAXPATHLEN);
4963 4704 } else {
4964 4705 (void) snprintf(dsname, MAXPATHLEN,
4965 4706 "%s/%s", poolname, ++cp);
4966 4707 }
4967 4708 VERIFY(nvlist_add_string(config,
4968 4709 ZPOOL_CONFIG_BOOTFS, dsname) == 0);
4969 4710 kmem_free(dsname, MAXPATHLEN);
4970 4711 }
4971 4712 kmem_free(tmpname, MAXPATHLEN);
4972 4713 }
4973 4714
4974 4715 /*
4975 4716 * Add the list of hot spares and level 2 cache devices.
4976 4717 */
4977 4718 spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
4978 4719 spa_add_spares(spa, config);
4979 4720 spa_add_l2cache(spa, config);
4980 4721 spa_config_exit(spa, SCL_CONFIG, FTAG);
4981 4722 }
4982 4723
4983 4724 spa_unload(spa);
4984 4725 spa_deactivate(spa);
4985 4726 spa_remove(spa);
4986 4727 mutex_exit(&spa_namespace_lock);
4987 4728
4988 4729 return (config);
4989 4730 }
4990 4731
4991 4732 /*
|
↓ open down ↓ |
54 lines elided |
↑ open up ↑ |
4992 4733 * Pool export/destroy
4993 4734 *
4994 4735 * The act of destroying or exporting a pool is very simple. We make sure there
4995 4736 * is no more pending I/O and any references to the pool are gone. Then, we
4996 4737 * update the pool state and sync all the labels to disk, removing the
4997 4738 * configuration from the cache afterwards. If the 'hardforce' flag is set, then
4998 4739 * we don't sync the labels or remove the configuration cache.
4999 4740 */
5000 4741 static int
5001 4742 spa_export_common(char *pool, int new_state, nvlist_t **oldconfig,
5002 - boolean_t force, boolean_t hardforce)
4743 + boolean_t force, boolean_t hardforce, boolean_t saveconfig)
5003 4744 {
5004 4745 spa_t *spa;
4746 + zfs_autosnap_t *autosnap;
4747 + boolean_t wbcthr_stopped = B_FALSE;
5005 4748
5006 4749 if (oldconfig)
5007 4750 *oldconfig = NULL;
5008 4751
5009 4752 if (!(spa_mode_global & FWRITE))
5010 4753 return (SET_ERROR(EROFS));
5011 4754
5012 4755 mutex_enter(&spa_namespace_lock);
5013 4756 if ((spa = spa_lookup(pool)) == NULL) {
5014 4757 mutex_exit(&spa_namespace_lock);
5015 4758 return (SET_ERROR(ENOENT));
5016 4759 }
5017 4760
5018 4761 /*
5019 - * Put a hold on the pool, drop the namespace lock, stop async tasks,
5020 - * reacquire the namespace lock, and see if we can export.
4762 + * Put a hold on the pool, drop the namespace lock, stop async tasks
4763 + * and write cache thread, reacquire the namespace lock, and see
4764 + * if we can export.
5021 4765 */
5022 4766 spa_open_ref(spa, FTAG);
5023 4767 mutex_exit(&spa_namespace_lock);
4768 +
4769 + autosnap = spa_get_autosnap(spa);
4770 + mutex_enter(&autosnap->autosnap_lock);
4771 +
4772 + if (autosnap_has_children_zone(autosnap,
4773 + spa_name(spa), B_TRUE)) {
4774 + mutex_exit(&autosnap->autosnap_lock);
4775 + spa_close(spa, FTAG);
4776 + return (EBUSY);
4777 + }
4778 +
4779 + mutex_exit(&autosnap->autosnap_lock);
4780 +
4781 + wbcthr_stopped = wbc_stop_thread(spa); /* stop write cache thread */
4782 + autosnap_destroyer_thread_stop(spa);
5024 4783 spa_async_suspend(spa);
5025 4784 mutex_enter(&spa_namespace_lock);
5026 4785 spa_close(spa, FTAG);
5027 4786
5028 4787 /*
5029 4788 * The pool will be in core if it's openable,
5030 4789 * in which case we can modify its state.
5031 4790 */
5032 4791 if (spa->spa_state != POOL_STATE_UNINITIALIZED && spa->spa_sync_on) {
5033 4792 /*
5034 4793 * Objsets may be open only because they're dirty, so we
5035 4794 * have to force it to sync before checking spa_refcnt.
5036 4795 */
5037 4796 txg_wait_synced(spa->spa_dsl_pool, 0);
5038 4797 spa_evicting_os_wait(spa);
5039 4798
|
↓ open down ↓ |
6 lines elided |
↑ open up ↑ |
5040 4799 /*
5041 4800 * A pool cannot be exported or destroyed if there are active
5042 4801 * references. If we are resetting a pool, allow references by
5043 4802 * fault injection handlers.
5044 4803 */
5045 4804 if (!spa_refcount_zero(spa) ||
5046 4805 (spa->spa_inject_ref != 0 &&
5047 4806 new_state != POOL_STATE_UNINITIALIZED)) {
5048 4807 spa_async_resume(spa);
5049 4808 mutex_exit(&spa_namespace_lock);
4809 + if (wbcthr_stopped)
4810 + (void) wbc_start_thread(spa);
4811 + autosnap_destroyer_thread_start(spa);
5050 4812 return (SET_ERROR(EBUSY));
5051 4813 }
5052 4814
5053 4815 /*
5054 4816 * A pool cannot be exported if it has an active shared spare.
5055 4817 * This is to prevent other pools stealing the active spare
5056 4818 * from an exported pool. At user's own will, such pool can
5057 4819 * be forcedly exported.
5058 4820 */
5059 4821 if (!force && new_state == POOL_STATE_EXPORTED &&
5060 4822 spa_has_active_shared_spare(spa)) {
5061 4823 spa_async_resume(spa);
5062 4824 mutex_exit(&spa_namespace_lock);
4825 + if (wbcthr_stopped)
4826 + (void) wbc_start_thread(spa);
4827 + autosnap_destroyer_thread_start(spa);
5063 4828 return (SET_ERROR(EXDEV));
5064 4829 }
5065 4830
5066 4831 /*
5067 4832 * We want this to be reflected on every label,
5068 4833 * so mark them all dirty. spa_unload() will do the
5069 4834 * final sync that pushes these changes out.
5070 4835 */
5071 4836 if (new_state != POOL_STATE_UNINITIALIZED && !hardforce) {
5072 4837 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
5073 4838 spa->spa_state = new_state;
|
↓ open down ↓ |
1 lines elided |
↑ open up ↑ |
5074 4839 spa->spa_final_txg = spa_last_synced_txg(spa) +
5075 4840 TXG_DEFER_SIZE + 1;
5076 4841 vdev_config_dirty(spa->spa_root_vdev);
5077 4842 spa_config_exit(spa, SCL_ALL, FTAG);
5078 4843 }
5079 4844 }
5080 4845
5081 4846 spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_DESTROY);
5082 4847
5083 4848 if (spa->spa_state != POOL_STATE_UNINITIALIZED) {
4849 + wbc_deactivate(spa);
4850 +
5084 4851 spa_unload(spa);
5085 4852 spa_deactivate(spa);
5086 4853 }
5087 4854
5088 4855 if (oldconfig && spa->spa_config)
5089 4856 VERIFY(nvlist_dup(spa->spa_config, oldconfig, 0) == 0);
5090 4857
5091 4858 if (new_state != POOL_STATE_UNINITIALIZED) {
5092 4859 if (!hardforce)
5093 - spa_write_cachefile(spa, B_TRUE, B_TRUE);
4860 + spa_config_sync(spa, !saveconfig, B_TRUE);
4861 +
5094 4862 spa_remove(spa);
5095 4863 }
5096 4864 mutex_exit(&spa_namespace_lock);
5097 4865
5098 4866 return (0);
5099 4867 }
5100 4868
5101 4869 /*
5102 4870 * Destroy a storage pool.
5103 4871 */
5104 4872 int
5105 4873 spa_destroy(char *pool)
5106 4874 {
5107 4875 return (spa_export_common(pool, POOL_STATE_DESTROYED, NULL,
5108 - B_FALSE, B_FALSE));
4876 + B_FALSE, B_FALSE, B_FALSE));
5109 4877 }
5110 4878
5111 4879 /*
5112 4880 * Export a storage pool.
5113 4881 */
5114 4882 int
5115 4883 spa_export(char *pool, nvlist_t **oldconfig, boolean_t force,
5116 - boolean_t hardforce)
4884 + boolean_t hardforce, boolean_t saveconfig)
5117 4885 {
5118 4886 return (spa_export_common(pool, POOL_STATE_EXPORTED, oldconfig,
5119 - force, hardforce));
4887 + force, hardforce, saveconfig));
5120 4888 }
5121 4889
5122 4890 /*
5123 4891 * Similar to spa_export(), this unloads the spa_t without actually removing it
5124 4892 * from the namespace in any way.
5125 4893 */
5126 4894 int
5127 4895 spa_reset(char *pool)
5128 4896 {
5129 4897 return (spa_export_common(pool, POOL_STATE_UNINITIALIZED, NULL,
5130 - B_FALSE, B_FALSE));
4898 + B_FALSE, B_FALSE, B_FALSE));
5131 4899 }
5132 4900
5133 4901 /*
5134 4902 * ==========================================================================
5135 4903 * Device manipulation
5136 4904 * ==========================================================================
5137 4905 */
5138 4906
5139 4907 /*
5140 4908 * Add a device to a storage pool.
5141 4909 */
5142 4910 int
5143 4911 spa_vdev_add(spa_t *spa, nvlist_t *nvroot)
5144 4912 {
5145 4913 uint64_t txg, id;
5146 4914 int error;
5147 4915 vdev_t *rvd = spa->spa_root_vdev;
5148 4916 vdev_t *vd, *tvd;
5149 4917 nvlist_t **spares, **l2cache;
5150 4918 uint_t nspares, nl2cache;
4919 + dmu_tx_t *tx = NULL;
5151 4920
5152 4921 ASSERT(spa_writeable(spa));
5153 4922
5154 4923 txg = spa_vdev_enter(spa);
5155 4924
5156 4925 if ((error = spa_config_parse(spa, &vd, nvroot, NULL, 0,
5157 4926 VDEV_ALLOC_ADD)) != 0)
5158 4927 return (spa_vdev_exit(spa, NULL, txg, error));
5159 4928
5160 4929 spa->spa_pending_vdev = vd; /* spa_vdev_exit() will clear this */
5161 4930
5162 4931 if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES, &spares,
5163 4932 &nspares) != 0)
5164 4933 nspares = 0;
5165 4934
5166 4935 if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_L2CACHE, &l2cache,
5167 4936 &nl2cache) != 0)
5168 4937 nl2cache = 0;
5169 4938
5170 4939 if (vd->vdev_children == 0 && nspares == 0 && nl2cache == 0)
5171 4940 return (spa_vdev_exit(spa, vd, txg, EINVAL));
5172 4941
5173 4942 if (vd->vdev_children != 0 &&
5174 4943 (error = vdev_create(vd, txg, B_FALSE)) != 0)
|
↓ open down ↓ |
14 lines elided |
↑ open up ↑ |
5175 4944 return (spa_vdev_exit(spa, vd, txg, error));
5176 4945
5177 4946 /*
5178 4947 * We must validate the spares and l2cache devices after checking the
5179 4948 * children. Otherwise, vdev_inuse() will blindly overwrite the spare.
5180 4949 */
5181 4950 if ((error = spa_validate_aux(spa, nvroot, txg, VDEV_ALLOC_ADD)) != 0)
5182 4951 return (spa_vdev_exit(spa, vd, txg, error));
5183 4952
5184 4953 /*
5185 - * If we are in the middle of a device removal, we can only add
5186 - * devices which match the existing devices in the pool.
5187 - * If we are in the middle of a removal, or have some indirect
5188 - * vdevs, we can not add raidz toplevels.
4954 + * Transfer each new top-level vdev from vd to rvd.
5189 4955 */
5190 - if (spa->spa_vdev_removal != NULL ||
5191 - spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
5192 - for (int c = 0; c < vd->vdev_children; c++) {
5193 - tvd = vd->vdev_child[c];
5194 - if (spa->spa_vdev_removal != NULL &&
5195 - tvd->vdev_ashift !=
5196 - spa->spa_vdev_removal->svr_vdev->vdev_ashift) {
5197 - return (spa_vdev_exit(spa, vd, txg, EINVAL));
5198 - }
5199 - /* Fail if top level vdev is raidz */
5200 - if (tvd->vdev_ops == &vdev_raidz_ops) {
5201 - return (spa_vdev_exit(spa, vd, txg, EINVAL));
5202 - }
5203 - /*
5204 - * Need the top level mirror to be
5205 - * a mirror of leaf vdevs only
5206 - */
5207 - if (tvd->vdev_ops == &vdev_mirror_ops) {
5208 - for (uint64_t cid = 0;
5209 - cid < tvd->vdev_children; cid++) {
5210 - vdev_t *cvd = tvd->vdev_child[cid];
5211 - if (!cvd->vdev_ops->vdev_op_leaf) {
5212 - return (spa_vdev_exit(spa, vd,
5213 - txg, EINVAL));
5214 - }
5215 - }
5216 - }
5217 - }
5218 - }
5219 -
5220 4956 for (int c = 0; c < vd->vdev_children; c++) {
5221 4957
5222 4958 /*
5223 4959 * Set the vdev id to the first hole, if one exists.
5224 4960 */
5225 4961 for (id = 0; id < rvd->vdev_children; id++) {
5226 4962 if (rvd->vdev_child[id]->vdev_ishole) {
5227 4963 vdev_free(rvd->vdev_child[id]);
5228 4964 break;
5229 4965 }
5230 4966 }
5231 4967 tvd = vd->vdev_child[c];
5232 4968 vdev_remove_child(vd, tvd);
5233 4969 tvd->vdev_id = id;
5234 4970 vdev_add_child(rvd, tvd);
5235 4971 vdev_config_dirty(tvd);
5236 4972 }
5237 4973
5238 4974 if (nspares != 0) {
5239 4975 spa_set_aux_vdevs(&spa->spa_spares, spares, nspares,
5240 4976 ZPOOL_CONFIG_SPARES);
5241 4977 spa_load_spares(spa);
5242 4978 spa->spa_spares.sav_sync = B_TRUE;
5243 4979 }
5244 4980
5245 4981 if (nl2cache != 0) {
5246 4982 spa_set_aux_vdevs(&spa->spa_l2cache, l2cache, nl2cache,
5247 4983 ZPOOL_CONFIG_L2CACHE);
5248 4984 spa_load_l2cache(spa);
5249 4985 spa->spa_l2cache.sav_sync = B_TRUE;
5250 4986 }
5251 4987
5252 4988 /*
5253 4989 * We have to be careful when adding new vdevs to an existing pool.
5254 4990 * If other threads start allocating from these vdevs before we
5255 4991 * sync the config cache, and we lose power, then upon reboot we may
5256 4992 * fail to open the pool because there are DVAs that the config cache
5257 4993 * can't translate. Therefore, we first add the vdevs without
5258 4994 * initializing metaslabs; sync the config cache (via spa_vdev_exit());
5259 4995 * and then let spa_config_update() initialize the new metaslabs.
5260 4996 *
5261 4997 * spa_load() checks for added-but-not-initialized vdevs, so that
|
↓ open down ↓ |
32 lines elided |
↑ open up ↑ |
5262 4998 * if we lose power at any point in this sequence, the remaining
5263 4999 * steps will be completed the next time we load the pool.
5264 5000 */
5265 5001 (void) spa_vdev_exit(spa, vd, txg, 0);
5266 5002
5267 5003 mutex_enter(&spa_namespace_lock);
5268 5004 spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
5269 5005 spa_event_notify(spa, NULL, NULL, ESC_ZFS_VDEV_ADD);
5270 5006 mutex_exit(&spa_namespace_lock);
5271 5007
5008 + /*
5009 + * "spa_last_synced_txg(spa) + 1" is used because:
5010 + * - spa_vdev_exit() calls txg_wait_synced() for "txg"
5011 + * - spa_config_update() calls txg_wait_synced() for
5012 + * "spa_last_synced_txg(spa) + 1"
5013 + */
5014 + tx = dmu_tx_create_assigned(spa_get_dsl(spa),
5015 + spa_last_synced_txg(spa) + 1);
5016 + spa_special_feature_activate(spa, tx);
5017 + dmu_tx_commit(tx);
5018 +
5019 + wbc_activate(spa, B_FALSE);
5020 +
5272 5021 return (0);
5273 5022 }
5274 5023
5275 5024 /*
5276 5025 * Attach a device to a mirror. The arguments are the path to any device
5277 5026 * in the mirror, and the nvroot for the new device. If the path specifies
5278 5027 * a device that is not mirrored, we automatically insert the mirror vdev.
5279 5028 *
5280 5029 * If 'replacing' is specified, the new device is intended to replace the
5281 5030 * existing device; in this case the two devices are made into their own
5282 5031 * mirror using the 'replacing' vdev, which is functionally identical to
5283 5032 * the mirror vdev (it actually reuses all the same ops) but has a few
5284 5033 * extra rules: you can't attach to it after it's been created, and upon
5285 5034 * completion of resilvering, the first disk (the one being replaced)
5286 5035 * is automatically detached.
5287 5036 */
5288 5037 int
5289 5038 spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot, int replacing)
5290 5039 {
5291 5040 uint64_t txg, dtl_max_txg;
5292 5041 vdev_t *rvd = spa->spa_root_vdev;
5293 5042 vdev_t *oldvd, *newvd, *newrootvd, *pvd, *tvd;
5294 5043 vdev_ops_t *pvops;
|
↓ open down ↓ |
13 lines elided |
↑ open up ↑ |
5295 5044 char *oldvdpath, *newvdpath;
5296 5045 int newvd_isspare;
5297 5046 int error;
5298 5047
5299 5048 ASSERT(spa_writeable(spa));
5300 5049
5301 5050 txg = spa_vdev_enter(spa);
5302 5051
5303 5052 oldvd = spa_lookup_by_guid(spa, guid, B_FALSE);
5304 5053
5305 - if (spa->spa_vdev_removal != NULL ||
5306 - spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
5307 - return (spa_vdev_exit(spa, NULL, txg, EBUSY));
5308 - }
5309 -
5310 5054 if (oldvd == NULL)
5311 5055 return (spa_vdev_exit(spa, NULL, txg, ENODEV));
5312 5056
5313 5057 if (!oldvd->vdev_ops->vdev_op_leaf)
5314 5058 return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5315 5059
5316 5060 pvd = oldvd->vdev_parent;
5317 5061
5318 5062 if ((error = spa_config_parse(spa, &newrootvd, nvroot, NULL, 0,
5319 5063 VDEV_ALLOC_ATTACH)) != 0)
5320 5064 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5321 5065
5322 5066 if (newrootvd->vdev_children != 1)
5323 5067 return (spa_vdev_exit(spa, newrootvd, txg, EINVAL));
5324 5068
5325 5069 newvd = newrootvd->vdev_child[0];
5326 5070
5327 5071 if (!newvd->vdev_ops->vdev_op_leaf)
5328 5072 return (spa_vdev_exit(spa, newrootvd, txg, EINVAL));
5329 5073
5330 5074 if ((error = vdev_create(newrootvd, txg, replacing)) != 0)
5331 5075 return (spa_vdev_exit(spa, newrootvd, txg, error));
5332 5076
5333 5077 /*
5334 5078 * Spares can't replace logs
5335 5079 */
5336 5080 if (oldvd->vdev_top->vdev_islog && newvd->vdev_isspare)
5337 5081 return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
5338 5082
5339 5083 if (!replacing) {
5340 5084 /*
5341 5085 * For attach, the only allowable parent is a mirror or the root
5342 5086 * vdev.
5343 5087 */
5344 5088 if (pvd->vdev_ops != &vdev_mirror_ops &&
5345 5089 pvd->vdev_ops != &vdev_root_ops)
5346 5090 return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
5347 5091
5348 5092 pvops = &vdev_mirror_ops;
5349 5093 } else {
5350 5094 /*
5351 5095 * Active hot spares can only be replaced by inactive hot
5352 5096 * spares.
5353 5097 */
5354 5098 if (pvd->vdev_ops == &vdev_spare_ops &&
5355 5099 oldvd->vdev_isspare &&
5356 5100 !spa_has_spare(spa, newvd->vdev_guid))
5357 5101 return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
5358 5102
5359 5103 /*
5360 5104 * If the source is a hot spare, and the parent isn't already a
5361 5105 * spare, then we want to create a new hot spare. Otherwise, we
5362 5106 * want to create a replacing vdev. The user is not allowed to
5363 5107 * attach to a spared vdev child unless the 'isspare' state is
5364 5108 * the same (spare replaces spare, non-spare replaces
5365 5109 * non-spare).
5366 5110 */
5367 5111 if (pvd->vdev_ops == &vdev_replacing_ops &&
5368 5112 spa_version(spa) < SPA_VERSION_MULTI_REPLACE) {
5369 5113 return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
5370 5114 } else if (pvd->vdev_ops == &vdev_spare_ops &&
5371 5115 newvd->vdev_isspare != oldvd->vdev_isspare) {
5372 5116 return (spa_vdev_exit(spa, newrootvd, txg, ENOTSUP));
5373 5117 }
5374 5118
5375 5119 if (newvd->vdev_isspare)
5376 5120 pvops = &vdev_spare_ops;
5377 5121 else
5378 5122 pvops = &vdev_replacing_ops;
5379 5123 }
5380 5124
5381 5125 /*
5382 5126 * Make sure the new device is big enough.
5383 5127 */
5384 5128 if (newvd->vdev_asize < vdev_get_min_asize(oldvd))
5385 5129 return (spa_vdev_exit(spa, newrootvd, txg, EOVERFLOW));
5386 5130
5387 5131 /*
5388 5132 * The new device cannot have a higher alignment requirement
5389 5133 * than the top-level vdev.
5390 5134 */
5391 5135 if (newvd->vdev_ashift > oldvd->vdev_top->vdev_ashift)
5392 5136 return (spa_vdev_exit(spa, newrootvd, txg, EDOM));
5393 5137
5394 5138 /*
5395 5139 * If this is an in-place replacement, update oldvd's path and devid
5396 5140 * to make it distinguishable from newvd, and unopenable from now on.
5397 5141 */
5398 5142 if (strcmp(oldvd->vdev_path, newvd->vdev_path) == 0) {
5399 5143 spa_strfree(oldvd->vdev_path);
5400 5144 oldvd->vdev_path = kmem_alloc(strlen(newvd->vdev_path) + 5,
5401 5145 KM_SLEEP);
5402 5146 (void) sprintf(oldvd->vdev_path, "%s/%s",
5403 5147 newvd->vdev_path, "old");
5404 5148 if (oldvd->vdev_devid != NULL) {
5405 5149 spa_strfree(oldvd->vdev_devid);
5406 5150 oldvd->vdev_devid = NULL;
5407 5151 }
5408 5152 }
5409 5153
5410 5154 /* mark the device being resilvered */
5411 5155 newvd->vdev_resilver_txg = txg;
5412 5156
5413 5157 /*
5414 5158 * If the parent is not a mirror, or if we're replacing, insert the new
5415 5159 * mirror/replacing/spare vdev above oldvd.
5416 5160 */
5417 5161 if (pvd->vdev_ops != pvops)
5418 5162 pvd = vdev_add_parent(oldvd, pvops);
5419 5163
5420 5164 ASSERT(pvd->vdev_top->vdev_parent == rvd);
5421 5165 ASSERT(pvd->vdev_ops == pvops);
5422 5166 ASSERT(oldvd->vdev_parent == pvd);
5423 5167
5424 5168 /*
5425 5169 * Extract the new device from its root and add it to pvd.
5426 5170 */
5427 5171 vdev_remove_child(newrootvd, newvd);
5428 5172 newvd->vdev_id = pvd->vdev_children;
5429 5173 newvd->vdev_crtxg = oldvd->vdev_crtxg;
5430 5174 vdev_add_child(pvd, newvd);
5431 5175
5432 5176 tvd = newvd->vdev_top;
5433 5177 ASSERT(pvd->vdev_top == tvd);
5434 5178 ASSERT(tvd->vdev_parent == rvd);
5435 5179
5436 5180 vdev_config_dirty(tvd);
5437 5181
5438 5182 /*
5439 5183 * Set newvd's DTL to [TXG_INITIAL, dtl_max_txg) so that we account
5440 5184 * for any dmu_sync-ed blocks. It will propagate upward when
5441 5185 * spa_vdev_exit() calls vdev_dtl_reassess().
5442 5186 */
5443 5187 dtl_max_txg = txg + TXG_CONCURRENT_STATES;
5444 5188
5445 5189 vdev_dtl_dirty(newvd, DTL_MISSING, TXG_INITIAL,
5446 5190 dtl_max_txg - TXG_INITIAL);
5447 5191
5448 5192 if (newvd->vdev_isspare) {
5449 5193 spa_spare_activate(newvd);
5450 5194 spa_event_notify(spa, newvd, NULL, ESC_ZFS_VDEV_SPARE);
5451 5195 }
5452 5196
5453 5197 oldvdpath = spa_strdup(oldvd->vdev_path);
5454 5198 newvdpath = spa_strdup(newvd->vdev_path);
5455 5199 newvd_isspare = newvd->vdev_isspare;
5456 5200
5457 5201 /*
5458 5202 * Mark newvd's DTL dirty in this txg.
5459 5203 */
5460 5204 vdev_dirty(tvd, VDD_DTL, newvd, txg);
5461 5205
5462 5206 /*
5463 5207 * Schedule the resilver to restart in the future. We do this to
5464 5208 * ensure that dmu_sync-ed blocks have been stitched into the
|
↓ open down ↓ |
145 lines elided |
↑ open up ↑ |
5465 5209 * respective datasets.
5466 5210 */
5467 5211 dsl_resilver_restart(spa->spa_dsl_pool, dtl_max_txg);
5468 5212
5469 5213 if (spa->spa_bootfs)
5470 5214 spa_event_notify(spa, newvd, NULL, ESC_ZFS_BOOTFS_VDEV_ATTACH);
5471 5215
5472 5216 spa_event_notify(spa, newvd, NULL, ESC_ZFS_VDEV_ATTACH);
5473 5217
5474 5218 /*
5219 + * Check CoS property of the old vdev, add reference by new vdev
5220 + */
5221 + if (oldvd->vdev_queue.vq_cos) {
5222 + cos_hold(oldvd->vdev_queue.vq_cos);
5223 + newvd->vdev_queue.vq_cos = oldvd->vdev_queue.vq_cos;
5224 + }
5225 +
5226 + /*
5475 5227 * Commit the config
5476 5228 */
5477 5229 (void) spa_vdev_exit(spa, newrootvd, dtl_max_txg, 0);
5478 5230
5479 5231 spa_history_log_internal(spa, "vdev attach", NULL,
5480 5232 "%s vdev=%s %s vdev=%s",
5481 5233 replacing && newvd_isspare ? "spare in" :
5482 5234 replacing ? "replace" : "attach", newvdpath,
5483 5235 replacing ? "for" : "to", oldvdpath);
5484 5236
5485 5237 spa_strfree(oldvdpath);
5486 5238 spa_strfree(newvdpath);
5487 5239
5488 5240 return (0);
5489 5241 }
5490 5242
5491 5243 /*
5492 5244 * Detach a device from a mirror or replacing vdev.
5493 5245 *
5494 5246 * If 'replace_done' is specified, only detach if the parent
5495 5247 * is a replacing vdev.
5496 5248 */
5497 5249 int
5498 5250 spa_vdev_detach(spa_t *spa, uint64_t guid, uint64_t pguid, int replace_done)
5499 5251 {
5500 5252 uint64_t txg;
5501 5253 int error;
5502 5254 vdev_t *rvd = spa->spa_root_vdev;
5503 5255 vdev_t *vd, *pvd, *cvd, *tvd;
5504 5256 boolean_t unspare = B_FALSE;
5505 5257 uint64_t unspare_guid = 0;
5506 5258 char *vdpath;
5507 5259
5508 5260 ASSERT(spa_writeable(spa));
5509 5261
5510 5262 txg = spa_vdev_enter(spa);
5511 5263
5512 5264 vd = spa_lookup_by_guid(spa, guid, B_FALSE);
5513 5265
5514 5266 if (vd == NULL)
5515 5267 return (spa_vdev_exit(spa, NULL, txg, ENODEV));
5516 5268
5517 5269 if (!vd->vdev_ops->vdev_op_leaf)
5518 5270 return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5519 5271
5520 5272 pvd = vd->vdev_parent;
5521 5273
5522 5274 /*
5523 5275 * If the parent/child relationship is not as expected, don't do it.
5524 5276 * Consider M(A,R(B,C)) -- that is, a mirror of A with a replacing
5525 5277 * vdev that's replacing B with C. The user's intent in replacing
5526 5278 * is to go from M(A,B) to M(A,C). If the user decides to cancel
5527 5279 * the replace by detaching C, the expected behavior is to end up
5528 5280 * M(A,B). But suppose that right after deciding to detach C,
5529 5281 * the replacement of B completes. We would have M(A,C), and then
5530 5282 * ask to detach C, which would leave us with just A -- not what
5531 5283 * the user wanted. To prevent this, we make sure that the
5532 5284 * parent/child relationship hasn't changed -- in this example,
5533 5285 * that C's parent is still the replacing vdev R.
5534 5286 */
5535 5287 if (pvd->vdev_guid != pguid && pguid != 0)
5536 5288 return (spa_vdev_exit(spa, NULL, txg, EBUSY));
5537 5289
5538 5290 /*
5539 5291 * Only 'replacing' or 'spare' vdevs can be replaced.
5540 5292 */
5541 5293 if (replace_done && pvd->vdev_ops != &vdev_replacing_ops &&
5542 5294 pvd->vdev_ops != &vdev_spare_ops)
5543 5295 return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5544 5296
5545 5297 ASSERT(pvd->vdev_ops != &vdev_spare_ops ||
5546 5298 spa_version(spa) >= SPA_VERSION_SPARES);
5547 5299
5548 5300 /*
5549 5301 * Only mirror, replacing, and spare vdevs support detach.
5550 5302 */
5551 5303 if (pvd->vdev_ops != &vdev_replacing_ops &&
5552 5304 pvd->vdev_ops != &vdev_mirror_ops &&
5553 5305 pvd->vdev_ops != &vdev_spare_ops)
5554 5306 return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));
5555 5307
5556 5308 /*
5557 5309 * If this device has the only valid copy of some data,
5558 5310 * we cannot safely detach it.
5559 5311 */
5560 5312 if (vdev_dtl_required(vd))
5561 5313 return (spa_vdev_exit(spa, NULL, txg, EBUSY));
5562 5314
5563 5315 ASSERT(pvd->vdev_children >= 2);
5564 5316
5565 5317 /*
5566 5318 * If we are detaching the second disk from a replacing vdev, then
5567 5319 * check to see if we changed the original vdev's path to have "/old"
5568 5320 * at the end in spa_vdev_attach(). If so, undo that change now.
5569 5321 */
5570 5322 if (pvd->vdev_ops == &vdev_replacing_ops && vd->vdev_id > 0 &&
5571 5323 vd->vdev_path != NULL) {
5572 5324 size_t len = strlen(vd->vdev_path);
5573 5325
5574 5326 for (int c = 0; c < pvd->vdev_children; c++) {
5575 5327 cvd = pvd->vdev_child[c];
5576 5328
5577 5329 if (cvd == vd || cvd->vdev_path == NULL)
5578 5330 continue;
5579 5331
5580 5332 if (strncmp(cvd->vdev_path, vd->vdev_path, len) == 0 &&
5581 5333 strcmp(cvd->vdev_path + len, "/old") == 0) {
5582 5334 spa_strfree(cvd->vdev_path);
5583 5335 cvd->vdev_path = spa_strdup(vd->vdev_path);
5584 5336 break;
5585 5337 }
5586 5338 }
5587 5339 }
5588 5340
5589 5341 /*
5590 5342 * If we are detaching the original disk from a spare, then it implies
5591 5343 * that the spare should become a real disk, and be removed from the
5592 5344 * active spare list for the pool.
5593 5345 */
5594 5346 if (pvd->vdev_ops == &vdev_spare_ops &&
5595 5347 vd->vdev_id == 0 &&
5596 5348 pvd->vdev_child[pvd->vdev_children - 1]->vdev_isspare)
5597 5349 unspare = B_TRUE;
5598 5350
5599 5351 /*
5600 5352 * Erase the disk labels so the disk can be used for other things.
5601 5353 * This must be done after all other error cases are handled,
5602 5354 * but before we disembowel vd (so we can still do I/O to it).
5603 5355 * But if we can't do it, don't treat the error as fatal --
5604 5356 * it may be that the unwritability of the disk is the reason
5605 5357 * it's being detached!
5606 5358 */
5607 5359 error = vdev_label_init(vd, 0, VDEV_LABEL_REMOVE);
5608 5360
5609 5361 /*
5610 5362 * Remove vd from its parent and compact the parent's children.
5611 5363 */
5612 5364 vdev_remove_child(pvd, vd);
5613 5365 vdev_compact_children(pvd);
5614 5366
5615 5367 /*
5616 5368 * Remember one of the remaining children so we can get tvd below.
5617 5369 */
5618 5370 cvd = pvd->vdev_child[pvd->vdev_children - 1];
5619 5371
5620 5372 /*
5621 5373 * If we need to remove the remaining child from the list of hot spares,
5622 5374 * do it now, marking the vdev as no longer a spare in the process.
5623 5375 * We must do this before vdev_remove_parent(), because that can
5624 5376 * change the GUID if it creates a new toplevel GUID. For a similar
5625 5377 * reason, we must remove the spare now, in the same txg as the detach;
5626 5378 * otherwise someone could attach a new sibling, change the GUID, and
5627 5379 * the subsequent attempt to spa_vdev_remove(unspare_guid) would fail.
5628 5380 */
5629 5381 if (unspare) {
5630 5382 ASSERT(cvd->vdev_isspare);
5631 5383 spa_spare_remove(cvd);
5632 5384 unspare_guid = cvd->vdev_guid;
5633 5385 (void) spa_vdev_remove(spa, unspare_guid, B_TRUE);
5634 5386 cvd->vdev_unspare = B_TRUE;
5635 5387 }
5636 5388
5637 5389 /*
5638 5390 * If the parent mirror/replacing vdev only has one child,
5639 5391 * the parent is no longer needed. Remove it from the tree.
5640 5392 */
5641 5393 if (pvd->vdev_children == 1) {
5642 5394 if (pvd->vdev_ops == &vdev_spare_ops)
5643 5395 cvd->vdev_unspare = B_FALSE;
5644 5396 vdev_remove_parent(cvd);
5645 5397 }
5646 5398
5647 5399
5648 5400 /*
5649 5401 * We don't set tvd until now because the parent we just removed
5650 5402 * may have been the previous top-level vdev.
5651 5403 */
5652 5404 tvd = cvd->vdev_top;
5653 5405 ASSERT(tvd->vdev_parent == rvd);
5654 5406
5655 5407 /*
5656 5408 * Reevaluate the parent vdev state.
5657 5409 */
5658 5410 vdev_propagate_state(cvd);
5659 5411
5660 5412 /*
5661 5413 * If the 'autoexpand' property is set on the pool then automatically
5662 5414 * try to expand the size of the pool. For example if the device we
5663 5415 * just detached was smaller than the others, it may be possible to
5664 5416 * add metaslabs (i.e. grow the pool). We need to reopen the vdev
5665 5417 * first so that we can obtain the updated sizes of the leaf vdevs.
5666 5418 */
5667 5419 if (spa->spa_autoexpand) {
5668 5420 vdev_reopen(tvd);
5669 5421 vdev_expand(tvd, txg);
5670 5422 }
5671 5423
5672 5424 vdev_config_dirty(tvd);
5673 5425
5674 5426 /*
5675 5427 * Mark vd's DTL as dirty in this txg. vdev_dtl_sync() will see that
5676 5428 * vd->vdev_detached is set and free vd's DTL object in syncing context.
5677 5429 * But first make sure we're not on any *other* txg's DTL list, to
|
↓ open down ↓ |
193 lines elided |
↑ open up ↑ |
5678 5430 * prevent vd from being accessed after it's freed.
5679 5431 */
5680 5432 vdpath = spa_strdup(vd->vdev_path);
5681 5433 for (int t = 0; t < TXG_SIZE; t++)
5682 5434 (void) txg_list_remove_this(&tvd->vdev_dtl_list, vd, t);
5683 5435 vd->vdev_detached = B_TRUE;
5684 5436 vdev_dirty(tvd, VDD_DTL, vd, txg);
5685 5437
5686 5438 spa_event_notify(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE);
5687 5439
5440 + /*
5441 + * Release the references to CoS descriptors if any
5442 + */
5443 + if (vd->vdev_queue.vq_cos) {
5444 + cos_rele(vd->vdev_queue.vq_cos);
5445 + vd->vdev_queue.vq_cos = NULL;
5446 + }
5447 +
5688 5448 /* hang on to the spa before we release the lock */
5689 5449 spa_open_ref(spa, FTAG);
5690 5450
5691 5451 error = spa_vdev_exit(spa, vd, txg, 0);
5692 5452
5693 5453 spa_history_log_internal(spa, "detach", NULL,
5694 5454 "vdev=%s", vdpath);
5695 5455 spa_strfree(vdpath);
5696 5456
5697 5457 /*
5698 5458 * If this was the removal of the original device in a hot spare vdev,
5699 5459 * then we want to go through and remove the device from the hot spare
5700 5460 * list of every other pool.
5701 5461 */
5702 5462 if (unspare) {
5703 5463 spa_t *altspa = NULL;
5704 5464
5705 5465 mutex_enter(&spa_namespace_lock);
5706 5466 while ((altspa = spa_next(altspa)) != NULL) {
5707 5467 if (altspa->spa_state != POOL_STATE_ACTIVE ||
5708 5468 altspa == spa)
5709 5469 continue;
5710 5470
5711 5471 spa_open_ref(altspa, FTAG);
5712 5472 mutex_exit(&spa_namespace_lock);
5713 5473 (void) spa_vdev_remove(altspa, unspare_guid, B_TRUE);
5714 5474 mutex_enter(&spa_namespace_lock);
5715 5475 spa_close(altspa, FTAG);
5716 5476 }
5717 5477 mutex_exit(&spa_namespace_lock);
5718 5478
5719 5479 /* search the rest of the vdevs for spares to remove */
5720 5480 spa_vdev_resilver_done(spa);
5721 5481 }
5722 5482
5723 5483 /* all done with the spa; OK to release */
5724 5484 mutex_enter(&spa_namespace_lock);
5725 5485 spa_close(spa, FTAG);
5726 5486 mutex_exit(&spa_namespace_lock);
5727 5487
5728 5488 return (error);
5729 5489 }
5730 5490
5731 5491 /*
5732 5492 * Split a set of devices from their mirrors, and create a new pool from them.
5733 5493 */
5734 5494 int
5735 5495 spa_vdev_split_mirror(spa_t *spa, char *newname, nvlist_t *config,
5736 5496 nvlist_t *props, boolean_t exp)
5737 5497 {
5738 5498 int error = 0;
5739 5499 uint64_t txg, *glist;
|
↓ open down ↓ |
42 lines elided |
↑ open up ↑ |
5740 5500 spa_t *newspa;
5741 5501 uint_t c, children, lastlog;
5742 5502 nvlist_t **child, *nvl, *tmp;
5743 5503 dmu_tx_t *tx;
5744 5504 char *altroot = NULL;
5745 5505 vdev_t *rvd, **vml = NULL; /* vdev modify list */
5746 5506 boolean_t activate_slog;
5747 5507
5748 5508 ASSERT(spa_writeable(spa));
5749 5509
5510 + /*
5511 + * split for pools with activated WBC
5512 + * will be implemented in the next release
5513 + */
5514 + if (spa_feature_is_active(spa, SPA_FEATURE_WBC))
5515 + return (SET_ERROR(ENOTSUP));
5516 +
5750 5517 txg = spa_vdev_enter(spa);
5751 5518
5752 5519 /* clear the log and flush everything up to now */
5753 5520 activate_slog = spa_passivate_log(spa);
5754 5521 (void) spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5755 - error = spa_reset_logs(spa);
5522 + error = spa_offline_log(spa);
5756 5523 txg = spa_vdev_config_enter(spa);
5757 5524
5758 5525 if (activate_slog)
5759 5526 spa_activate_log(spa);
5760 5527
5761 5528 if (error != 0)
5762 5529 return (spa_vdev_exit(spa, NULL, txg, error));
5763 5530
5764 5531 /* check new spa name before going any further */
5765 5532 if (spa_lookup(newname) != NULL)
5766 5533 return (spa_vdev_exit(spa, NULL, txg, EEXIST));
5767 5534
5768 5535 /*
5769 5536 * scan through all the children to ensure they're all mirrors
5770 5537 */
5771 5538 if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvl) != 0 ||
5772 5539 nvlist_lookup_nvlist_array(nvl, ZPOOL_CONFIG_CHILDREN, &child,
|
↓ open down ↓ |
7 lines elided |
↑ open up ↑ |
5773 5540 &children) != 0)
5774 5541 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5775 5542
5776 5543 /* first, check to ensure we've got the right child count */
5777 5544 rvd = spa->spa_root_vdev;
5778 5545 lastlog = 0;
5779 5546 for (c = 0; c < rvd->vdev_children; c++) {
5780 5547 vdev_t *vd = rvd->vdev_child[c];
5781 5548
5782 5549 /* don't count the holes & logs as children */
5783 - if (vd->vdev_islog || !vdev_is_concrete(vd)) {
5550 + if (vd->vdev_islog || vd->vdev_ishole) {
5784 5551 if (lastlog == 0)
5785 5552 lastlog = c;
5786 5553 continue;
5787 5554 }
5788 5555
5789 5556 lastlog = 0;
5790 5557 }
5791 5558 if (children != (lastlog != 0 ? lastlog : rvd->vdev_children))
5792 5559 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5793 5560
5794 5561 /* next, ensure no spare or cache devices are part of the split */
5795 5562 if (nvlist_lookup_nvlist(nvl, ZPOOL_CONFIG_SPARES, &tmp) == 0 ||
5796 5563 nvlist_lookup_nvlist(nvl, ZPOOL_CONFIG_L2CACHE, &tmp) == 0)
5797 5564 return (spa_vdev_exit(spa, NULL, txg, EINVAL));
5798 5565
5799 5566 vml = kmem_zalloc(children * sizeof (vdev_t *), KM_SLEEP);
5800 5567 glist = kmem_zalloc(children * sizeof (uint64_t), KM_SLEEP);
5801 5568
5802 5569 /* then, loop over each vdev and validate it */
5803 5570 for (c = 0; c < children; c++) {
5804 5571 uint64_t is_hole = 0;
5805 5572
5806 5573 (void) nvlist_lookup_uint64(child[c], ZPOOL_CONFIG_IS_HOLE,
5807 5574 &is_hole);
5808 5575
5809 5576 if (is_hole != 0) {
5810 5577 if (spa->spa_root_vdev->vdev_child[c]->vdev_ishole ||
5811 5578 spa->spa_root_vdev->vdev_child[c]->vdev_islog) {
5812 5579 continue;
5813 5580 } else {
5814 5581 error = SET_ERROR(EINVAL);
5815 5582 break;
5816 5583 }
5817 5584 }
5818 5585
5819 5586 /* which disk is going to be split? */
5820 5587 if (nvlist_lookup_uint64(child[c], ZPOOL_CONFIG_GUID,
5821 5588 &glist[c]) != 0) {
5822 5589 error = SET_ERROR(EINVAL);
5823 5590 break;
5824 5591 }
5825 5592
|
↓ open down ↓ |
32 lines elided |
↑ open up ↑ |
5826 5593 /* look it up in the spa */
5827 5594 vml[c] = spa_lookup_by_guid(spa, glist[c], B_FALSE);
5828 5595 if (vml[c] == NULL) {
5829 5596 error = SET_ERROR(ENODEV);
5830 5597 break;
5831 5598 }
5832 5599
5833 5600 /* make sure there's nothing stopping the split */
5834 5601 if (vml[c]->vdev_parent->vdev_ops != &vdev_mirror_ops ||
5835 5602 vml[c]->vdev_islog ||
5836 - !vdev_is_concrete(vml[c]) ||
5603 + vml[c]->vdev_ishole ||
5837 5604 vml[c]->vdev_isspare ||
5838 5605 vml[c]->vdev_isl2cache ||
5839 5606 !vdev_writeable(vml[c]) ||
5840 5607 vml[c]->vdev_children != 0 ||
5841 5608 vml[c]->vdev_state != VDEV_STATE_HEALTHY ||
5842 5609 c != spa->spa_root_vdev->vdev_child[c]->vdev_id) {
5843 5610 error = SET_ERROR(EINVAL);
5844 5611 break;
5845 5612 }
5846 5613
5847 5614 if (vdev_dtl_required(vml[c])) {
5848 5615 error = SET_ERROR(EBUSY);
5849 5616 break;
5850 5617 }
5851 5618
5852 5619 /* we need certain info from the top level */
5853 5620 VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_METASLAB_ARRAY,
5854 5621 vml[c]->vdev_top->vdev_ms_array) == 0);
5855 5622 VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_METASLAB_SHIFT,
5856 5623 vml[c]->vdev_top->vdev_ms_shift) == 0);
5857 5624 VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_ASIZE,
5858 5625 vml[c]->vdev_top->vdev_asize) == 0);
5859 5626 VERIFY(nvlist_add_uint64(child[c], ZPOOL_CONFIG_ASHIFT,
5860 5627 vml[c]->vdev_top->vdev_ashift) == 0);
5861 5628
5862 5629 /* transfer per-vdev ZAPs */
5863 5630 ASSERT3U(vml[c]->vdev_leaf_zap, !=, 0);
5864 5631 VERIFY0(nvlist_add_uint64(child[c],
5865 5632 ZPOOL_CONFIG_VDEV_LEAF_ZAP, vml[c]->vdev_leaf_zap));
5866 5633
5867 5634 ASSERT3U(vml[c]->vdev_top->vdev_top_zap, !=, 0);
5868 5635 VERIFY0(nvlist_add_uint64(child[c],
5869 5636 ZPOOL_CONFIG_VDEV_TOP_ZAP,
5870 5637 vml[c]->vdev_parent->vdev_top_zap));
5871 5638 }
5872 5639
5873 5640 if (error != 0) {
5874 5641 kmem_free(vml, children * sizeof (vdev_t *));
5875 5642 kmem_free(glist, children * sizeof (uint64_t));
5876 5643 return (spa_vdev_exit(spa, NULL, txg, error));
5877 5644 }
5878 5645
5879 5646 /* stop writers from using the disks */
5880 5647 for (c = 0; c < children; c++) {
5881 5648 if (vml[c] != NULL)
5882 5649 vml[c]->vdev_offline = B_TRUE;
5883 5650 }
5884 5651 vdev_reopen(spa->spa_root_vdev);
5885 5652
5886 5653 /*
5887 5654 * Temporarily record the splitting vdevs in the spa config. This
5888 5655 * will disappear once the config is regenerated.
5889 5656 */
5890 5657 VERIFY(nvlist_alloc(&nvl, NV_UNIQUE_NAME, KM_SLEEP) == 0);
5891 5658 VERIFY(nvlist_add_uint64_array(nvl, ZPOOL_CONFIG_SPLIT_LIST,
5892 5659 glist, children) == 0);
5893 5660 kmem_free(glist, children * sizeof (uint64_t));
5894 5661
5895 5662 mutex_enter(&spa->spa_props_lock);
5896 5663 VERIFY(nvlist_add_nvlist(spa->spa_config, ZPOOL_CONFIG_SPLIT,
5897 5664 nvl) == 0);
5898 5665 mutex_exit(&spa->spa_props_lock);
5899 5666 spa->spa_config_splitting = nvl;
5900 5667 vdev_config_dirty(spa->spa_root_vdev);
5901 5668
5902 5669 /* configure and create the new pool */
5903 5670 VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME, newname) == 0);
5904 5671 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE,
5905 5672 exp ? POOL_STATE_EXPORTED : POOL_STATE_ACTIVE) == 0);
5906 5673 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_VERSION,
5907 5674 spa_version(spa)) == 0);
5908 5675 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_TXG,
5909 5676 spa->spa_config_txg) == 0);
5910 5677 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_GUID,
5911 5678 spa_generate_guid(NULL)) == 0);
5912 5679 VERIFY0(nvlist_add_boolean(config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS));
5913 5680 (void) nvlist_lookup_string(props,
5914 5681 zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot);
5915 5682
5916 5683 /* add the new pool to the namespace */
5917 5684 newspa = spa_add(newname, config, altroot);
5918 5685 newspa->spa_avz_action = AVZ_ACTION_REBUILD;
5919 5686 newspa->spa_config_txg = spa->spa_config_txg;
5920 5687 spa_set_log_state(newspa, SPA_LOG_CLEAR);
|
↓ open down ↓ |
74 lines elided |
↑ open up ↑ |
5921 5688
5922 5689 /* release the spa config lock, retaining the namespace lock */
5923 5690 spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5924 5691
5925 5692 if (zio_injection_enabled)
5926 5693 zio_handle_panic_injection(spa, FTAG, 1);
5927 5694
5928 5695 spa_activate(newspa, spa_mode_global);
5929 5696 spa_async_suspend(newspa);
5930 5697
5931 - newspa->spa_config_source = SPA_CONFIG_SRC_SPLIT;
5932 -
5933 5698 /* create the new pool from the disks of the original pool */
5934 - error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE);
5699 + error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE, B_TRUE);
5935 5700 if (error)
5936 5701 goto out;
5937 5702
5938 5703 /* if that worked, generate a real config for the new pool */
5939 5704 if (newspa->spa_root_vdev != NULL) {
5940 5705 VERIFY(nvlist_alloc(&newspa->spa_config_splitting,
5941 5706 NV_UNIQUE_NAME, KM_SLEEP) == 0);
5942 5707 VERIFY(nvlist_add_uint64(newspa->spa_config_splitting,
5943 5708 ZPOOL_CONFIG_SPLIT_GUID, spa_guid(spa)) == 0);
5944 5709 spa_config_set(newspa, spa_config_generate(newspa, NULL, -1ULL,
5945 5710 B_TRUE));
5946 5711 }
5947 5712
5948 5713 /* set the props */
5949 5714 if (props != NULL) {
5950 5715 spa_configfile_set(newspa, props, B_FALSE);
5951 5716 error = spa_prop_set(newspa, props);
5952 5717 if (error)
5953 5718 goto out;
5954 5719 }
5955 5720
5956 5721 /* flush everything */
5957 5722 txg = spa_vdev_config_enter(newspa);
5958 5723 vdev_config_dirty(newspa->spa_root_vdev);
5959 5724 (void) spa_vdev_config_exit(newspa, NULL, txg, 0, FTAG);
5960 5725
5961 5726 if (zio_injection_enabled)
5962 5727 zio_handle_panic_injection(spa, FTAG, 2);
5963 5728
|
↓ open down ↓ |
19 lines elided |
↑ open up ↑ |
5964 5729 spa_async_resume(newspa);
5965 5730
5966 5731 /* finally, update the original pool's config */
5967 5732 txg = spa_vdev_config_enter(spa);
5968 5733 tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
5969 5734 error = dmu_tx_assign(tx, TXG_WAIT);
5970 5735 if (error != 0)
5971 5736 dmu_tx_abort(tx);
5972 5737 for (c = 0; c < children; c++) {
5973 5738 if (vml[c] != NULL) {
5739 + vdev_t *tvd = vml[c]->vdev_top;
5740 +
5741 + /*
5742 + * Need to be sure the detachable VDEV is not
5743 + * on any *other* txg's DTL list to prevent it
5744 + * from being accessed after it's freed.
5745 + */
5746 + for (int t = 0; t < TXG_SIZE; t++) {
5747 + (void) txg_list_remove_this(
5748 + &tvd->vdev_dtl_list, vml[c], t);
5749 + }
5750 +
5974 5751 vdev_split(vml[c]);
5975 5752 if (error == 0)
5976 5753 spa_history_log_internal(spa, "detach", tx,
5977 5754 "vdev=%s", vml[c]->vdev_path);
5978 5755
5979 5756 vdev_free(vml[c]);
5980 5757 }
5981 5758 }
5982 5759 spa->spa_avz_action = AVZ_ACTION_REBUILD;
5983 5760 vdev_config_dirty(spa->spa_root_vdev);
5984 5761 spa->spa_config_splitting = NULL;
5985 5762 nvlist_free(nvl);
5986 5763 if (error == 0)
5987 5764 dmu_tx_commit(tx);
5988 5765 (void) spa_vdev_exit(spa, NULL, txg, 0);
5989 5766
5990 5767 if (zio_injection_enabled)
5991 5768 zio_handle_panic_injection(spa, FTAG, 3);
|
↓ open down ↓ |
8 lines elided |
↑ open up ↑ |
5992 5769
5993 5770 /* split is complete; log a history record */
5994 5771 spa_history_log_internal(newspa, "split", NULL,
5995 5772 "from pool %s", spa_name(spa));
5996 5773
5997 5774 kmem_free(vml, children * sizeof (vdev_t *));
5998 5775
5999 5776 /* if we're not going to mount the filesystems in userland, export */
6000 5777 if (exp)
6001 5778 error = spa_export_common(newname, POOL_STATE_EXPORTED, NULL,
6002 - B_FALSE, B_FALSE);
5779 + B_FALSE, B_FALSE, B_FALSE);
6003 5780
6004 5781 return (error);
6005 5782
6006 5783 out:
6007 5784 spa_unload(newspa);
6008 5785 spa_deactivate(newspa);
6009 5786 spa_remove(newspa);
6010 5787
6011 5788 txg = spa_vdev_config_enter(spa);
6012 5789
6013 5790 /* re-online all offlined disks */
6014 5791 for (c = 0; c < children; c++) {
6015 5792 if (vml[c] != NULL)
6016 5793 vml[c]->vdev_offline = B_FALSE;
6017 5794 }
|
↓ open down ↓ |
5 lines elided |
↑ open up ↑ |
6018 5795 vdev_reopen(spa->spa_root_vdev);
6019 5796
6020 5797 nvlist_free(spa->spa_config_splitting);
6021 5798 spa->spa_config_splitting = NULL;
6022 5799 (void) spa_vdev_exit(spa, NULL, txg, error);
6023 5800
6024 5801 kmem_free(vml, children * sizeof (vdev_t *));
6025 5802 return (error);
6026 5803 }
6027 5804
5805 +static nvlist_t *
5806 +spa_nvlist_lookup_by_guid(nvlist_t **nvpp, int count, uint64_t target_guid)
5807 +{
5808 + for (int i = 0; i < count; i++) {
5809 + uint64_t guid;
5810 +
5811 + VERIFY(nvlist_lookup_uint64(nvpp[i], ZPOOL_CONFIG_GUID,
5812 + &guid) == 0);
5813 +
5814 + if (guid == target_guid)
5815 + return (nvpp[i]);
5816 + }
5817 +
5818 + return (NULL);
5819 +}
5820 +
5821 +static void
5822 +spa_vdev_remove_aux(nvlist_t *config, char *name, nvlist_t **dev, int count,
5823 + nvlist_t *dev_to_remove)
5824 +{
5825 + nvlist_t **newdev = NULL;
5826 +
5827 + if (count > 1)
5828 + newdev = kmem_alloc((count - 1) * sizeof (void *), KM_SLEEP);
5829 +
5830 + for (int i = 0, j = 0; i < count; i++) {
5831 + if (dev[i] == dev_to_remove)
5832 + continue;
5833 + VERIFY(nvlist_dup(dev[i], &newdev[j++], KM_SLEEP) == 0);
5834 + }
5835 +
5836 + VERIFY(nvlist_remove(config, name, DATA_TYPE_NVLIST_ARRAY) == 0);
5837 + VERIFY(nvlist_add_nvlist_array(config, name, newdev, count - 1) == 0);
5838 +
5839 + for (int i = 0; i < count - 1; i++)
5840 + nvlist_free(newdev[i]);
5841 +
5842 + if (count > 1)
5843 + kmem_free(newdev, (count - 1) * sizeof (void *));
5844 +}
5845 +
6028 5846 /*
5847 + * Evacuate the device.
5848 + */
5849 +static int
5850 +spa_vdev_remove_evacuate(spa_t *spa, vdev_t *vd)
5851 +{
5852 + uint64_t txg;
5853 + int error = 0;
5854 +
5855 + ASSERT(MUTEX_HELD(&spa_namespace_lock));
5856 + ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
5857 + ASSERT(vd == vd->vdev_top);
5858 +
5859 + /*
5860 + * Evacuate the device. We don't hold the config lock as writer
5861 + * since we need to do I/O but we do keep the
5862 + * spa_namespace_lock held. Once this completes the device
5863 + * should no longer have any blocks allocated on it.
5864 + */
5865 + if (vd->vdev_islog) {
5866 + if (vd->vdev_stat.vs_alloc != 0)
5867 + error = spa_offline_log(spa);
5868 + } else {
5869 + error = SET_ERROR(ENOTSUP);
5870 + }
5871 +
5872 + if (error)
5873 + return (error);
5874 +
5875 + /*
5876 + * The evacuation succeeded. Remove any remaining MOS metadata
5877 + * associated with this vdev, and wait for these changes to sync.
5878 + */
5879 + ASSERT0(vd->vdev_stat.vs_alloc);
5880 + txg = spa_vdev_config_enter(spa);
5881 + vd->vdev_removing = B_TRUE;
5882 + vdev_dirty_leaves(vd, VDD_DTL, txg);
5883 + vdev_config_dirty(vd);
5884 + spa_vdev_config_exit(spa, NULL, txg, 0, FTAG);
5885 +
5886 + return (0);
5887 +}
5888 +
5889 +/*
5890 + * Complete the removal by cleaning up the namespace.
5891 + */
5892 +static void
5893 +spa_vdev_remove_from_namespace(spa_t *spa, vdev_t *vd)
5894 +{
5895 + vdev_t *rvd = spa->spa_root_vdev;
5896 + uint64_t id = vd->vdev_id;
5897 + boolean_t last_vdev = (id == (rvd->vdev_children - 1));
5898 +
5899 + ASSERT(MUTEX_HELD(&spa_namespace_lock));
5900 + ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
5901 + ASSERT(vd == vd->vdev_top);
5902 +
5903 + /*
5904 + * Only remove any devices which are empty.
5905 + */
5906 + if (vd->vdev_stat.vs_alloc != 0)
5907 + return;
5908 +
5909 + (void) vdev_label_init(vd, 0, VDEV_LABEL_REMOVE);
5910 +
5911 + if (list_link_active(&vd->vdev_state_dirty_node))
5912 + vdev_state_clean(vd);
5913 + if (list_link_active(&vd->vdev_config_dirty_node))
5914 + vdev_config_clean(vd);
5915 +
5916 + vdev_free(vd);
5917 +
5918 + if (last_vdev) {
5919 + vdev_compact_children(rvd);
5920 + } else {
5921 + vd = vdev_alloc_common(spa, id, 0, &vdev_hole_ops);
5922 + vdev_add_child(rvd, vd);
5923 + }
5924 + vdev_config_dirty(rvd);
5925 +
5926 + /*
5927 + * Reassess the health of our root vdev.
5928 + */
5929 + vdev_reopen(rvd);
5930 +}
5931 +
5932 +/*
5933 + * Remove a device from the pool -
5934 + *
5935 + * Removing a device from the vdev namespace requires several steps
5936 + * and can take a significant amount of time. As a result we use
5937 + * the spa_vdev_config_[enter/exit] functions which allow us to
5938 + * grab and release the spa_config_lock while still holding the namespace
5939 + * lock. During each step the configuration is synced out.
5940 + *
5941 + * Currently, this supports removing only hot spares, slogs, level 2 ARC
5942 + * and special devices.
5943 + */
5944 +int
5945 +spa_vdev_remove(spa_t *spa, uint64_t guid, boolean_t unspare)
5946 +{
5947 + vdev_t *vd;
5948 + sysevent_t *ev = NULL;
5949 + metaslab_group_t *mg;
5950 + nvlist_t **spares, **l2cache, *nv;
5951 + uint64_t txg = 0;
5952 + uint_t nspares, nl2cache;
5953 + int error = 0;
5954 + boolean_t locked = MUTEX_HELD(&spa_namespace_lock);
5955 +
5956 + ASSERT(spa_writeable(spa));
5957 +
5958 + if (!locked)
5959 + txg = spa_vdev_enter(spa);
5960 +
5961 + vd = spa_lookup_by_guid(spa, guid, B_FALSE);
5962 +
5963 + if (spa->spa_spares.sav_vdevs != NULL &&
5964 + nvlist_lookup_nvlist_array(spa->spa_spares.sav_config,
5965 + ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0 &&
5966 + (nv = spa_nvlist_lookup_by_guid(spares, nspares, guid)) != NULL) {
5967 + /*
5968 + * Only remove the hot spare if it's not currently in use
5969 + * in this pool.
5970 + */
5971 + if (vd == NULL || unspare) {
5972 + if (vd == NULL)
5973 + vd = spa_lookup_by_guid(spa, guid, B_TRUE);
5974 +
5975 + /*
5976 + * Release the references to CoS descriptors if any
5977 + */
5978 + if (vd != NULL && vd->vdev_queue.vq_cos) {
5979 + cos_rele(vd->vdev_queue.vq_cos);
5980 + vd->vdev_queue.vq_cos = NULL;
5981 + }
5982 +
5983 + ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_AUX);
5984 + spa_vdev_remove_aux(spa->spa_spares.sav_config,
5985 + ZPOOL_CONFIG_SPARES, spares, nspares, nv);
5986 + spa_load_spares(spa);
5987 + spa->spa_spares.sav_sync = B_TRUE;
5988 + } else {
5989 + error = SET_ERROR(EBUSY);
5990 + }
5991 + } else if (spa->spa_l2cache.sav_vdevs != NULL &&
5992 + nvlist_lookup_nvlist_array(spa->spa_l2cache.sav_config,
5993 + ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0 &&
5994 + (nv = spa_nvlist_lookup_by_guid(l2cache, nl2cache, guid)) != NULL) {
5995 + /*
5996 + * Cache devices can always be removed.
5997 + */
5998 + if (vd == NULL)
5999 + vd = spa_lookup_by_guid(spa, guid, B_TRUE);
6000 + /*
6001 + * Release the references to CoS descriptors if any
6002 + */
6003 + if (vd != NULL && vd->vdev_queue.vq_cos) {
6004 + cos_rele(vd->vdev_queue.vq_cos);
6005 + vd->vdev_queue.vq_cos = NULL;
6006 + }
6007 +
6008 + ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_AUX);
6009 + spa_vdev_remove_aux(spa->spa_l2cache.sav_config,
6010 + ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache, nv);
6011 + spa_load_l2cache(spa);
6012 + spa->spa_l2cache.sav_sync = B_TRUE;
6013 + } else if (vd != NULL && vd->vdev_islog) {
6014 + ASSERT(!locked);
6015 +
6016 + if (vd != vd->vdev_top)
6017 + return (spa_vdev_exit(spa, NULL, txg, SET_ERROR(ENOTSUP)));
6018 +
6019 + mg = vd->vdev_mg;
6020 +
6021 + /*
6022 + * Stop allocating from this vdev.
6023 + */
6024 + metaslab_group_passivate(mg);
6025 +
6026 + /*
6027 + * Wait for the youngest allocations and frees to sync,
6028 + * and then wait for the deferral of those frees to finish.
6029 + */
6030 + spa_vdev_config_exit(spa, NULL,
6031 + txg + TXG_CONCURRENT_STATES + TXG_DEFER_SIZE, 0, FTAG);
6032 +
6033 + /*
6034 + * Attempt to evacuate the vdev.
6035 + */
6036 + error = spa_vdev_remove_evacuate(spa, vd);
6037 +
6038 + txg = spa_vdev_config_enter(spa);
6039 +
6040 + /*
6041 + * If we couldn't evacuate the vdev, unwind.
6042 + */
6043 + if (error) {
6044 + metaslab_group_activate(mg);
6045 + return (spa_vdev_exit(spa, NULL, txg, error));
6046 + }
6047 +
6048 + /*
6049 + * Release the references to CoS descriptors if any
6050 + */
6051 + if (vd->vdev_queue.vq_cos) {
6052 + cos_rele(vd->vdev_queue.vq_cos);
6053 + vd->vdev_queue.vq_cos = NULL;
6054 + }
6055 +
6056 + ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
6057 +
6058 + /*
6059 + * Clean up the vdev namespace.
6060 + */
6061 + ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
6062 + spa_vdev_remove_from_namespace(spa, vd);
6063 +
6064 + } else if (vd != NULL && vdev_is_special(vd)) {
6065 + ASSERT(!locked);
6066 +
6067 + if (vd != vd->vdev_top)
6068 + return (spa_vdev_exit(spa, NULL, txg, SET_ERROR(ENOTSUP)));
6069 +
6070 + error = spa_special_vdev_remove(spa, vd, &txg);
6071 + if (error == 0) {
6072 + ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV);
6073 + spa_vdev_remove_from_namespace(spa, vd);
6074 +
6075 + /*
6076 + * User sees this field as 'enablespecial'
6077 + * pool-level property
6078 + */
6079 + spa->spa_usesc = B_FALSE;
6080 + }
6081 + } else if (vd != NULL) {
6082 + /*
6083 + * Normal vdevs cannot be removed (yet).
6084 + */
6085 + error = SET_ERROR(ENOTSUP);
6086 + } else {
6087 + /*
6088 + * There is no vdev of any kind with the specified guid.
6089 + */
6090 + error = SET_ERROR(ENOENT);
6091 + }
6092 +
6093 + if (!locked)
6094 + error = spa_vdev_exit(spa, NULL, txg, error);
6095 +
6096 + if (ev)
6097 + spa_event_notify_impl(ev);
6098 +
6099 + return (error);
6100 +}
6101 +
6102 +/*
6029 6103 * Find any device that's done replacing, or a vdev marked 'unspare' that's
6030 6104 * currently spared, so we can detach it.
6031 6105 */
6032 6106 static vdev_t *
6033 6107 spa_vdev_resilver_done_hunt(vdev_t *vd)
6034 6108 {
6035 6109 vdev_t *newvd, *oldvd;
6036 6110
6037 6111 for (int c = 0; c < vd->vdev_children; c++) {
6038 6112 oldvd = spa_vdev_resilver_done_hunt(vd->vdev_child[c]);
6039 6113 if (oldvd != NULL)
6040 6114 return (oldvd);
6041 6115 }
6042 6116
6043 6117 /*
6044 6118 * Check for a completed replacement. We always consider the first
6045 6119 * vdev in the list to be the oldest vdev, and the last one to be
6046 6120 * the newest (see spa_vdev_attach() for how that works). In
6047 6121 * the case where the newest vdev is faulted, we will not automatically
6048 6122 * remove it after a resilver completes. This is OK as it will require
6049 6123 * user intervention to determine which disk the admin wishes to keep.
6050 6124 */
6051 6125 if (vd->vdev_ops == &vdev_replacing_ops) {
6052 6126 ASSERT(vd->vdev_children > 1);
6053 6127
6054 6128 newvd = vd->vdev_child[vd->vdev_children - 1];
|
↓ open down ↓ |
16 lines elided |
↑ open up ↑ |
6055 6129 oldvd = vd->vdev_child[0];
6056 6130
6057 6131 if (vdev_dtl_empty(newvd, DTL_MISSING) &&
6058 6132 vdev_dtl_empty(newvd, DTL_OUTAGE) &&
6059 6133 !vdev_dtl_required(oldvd))
6060 6134 return (oldvd);
6061 6135 }
6062 6136
6063 6137 /*
6064 6138 * Check for a completed resilver with the 'unspare' flag set.
6139 + * Also potentially update faulted state.
6065 6140 */
6066 6141 if (vd->vdev_ops == &vdev_spare_ops) {
6067 6142 vdev_t *first = vd->vdev_child[0];
6068 6143 vdev_t *last = vd->vdev_child[vd->vdev_children - 1];
6069 6144
6070 6145 if (last->vdev_unspare) {
6071 6146 oldvd = first;
6072 6147 newvd = last;
6073 6148 } else if (first->vdev_unspare) {
6074 6149 oldvd = last;
6075 6150 newvd = first;
|
↓ open down ↓ |
1 lines elided |
↑ open up ↑ |
6076 6151 } else {
6077 6152 oldvd = NULL;
6078 6153 }
6079 6154
6080 6155 if (oldvd != NULL &&
6081 6156 vdev_dtl_empty(newvd, DTL_MISSING) &&
6082 6157 vdev_dtl_empty(newvd, DTL_OUTAGE) &&
6083 6158 !vdev_dtl_required(oldvd))
6084 6159 return (oldvd);
6085 6160
6161 + vdev_propagate_state(vd);
6162 +
6086 6163 /*
6087 6164 * If there are more than two spares attached to a disk,
6088 6165 * and those spares are not required, then we want to
6089 6166 * attempt to free them up now so that they can be used
6090 6167 * by other pools. Once we're back down to a single
6091 6168 * disk+spare, we stop removing them.
6092 6169 */
6093 6170 if (vd->vdev_children > 2) {
6094 6171 newvd = vd->vdev_child[1];
6095 6172
6096 6173 if (newvd->vdev_isspare && last->vdev_isspare &&
6097 6174 vdev_dtl_empty(last, DTL_MISSING) &&
6098 6175 vdev_dtl_empty(last, DTL_OUTAGE) &&
6099 6176 !vdev_dtl_required(newvd))
6100 6177 return (newvd);
6101 6178 }
6102 6179 }
6103 6180
6104 6181 return (NULL);
6105 6182 }
6106 6183
6107 6184 static void
6108 6185 spa_vdev_resilver_done(spa_t *spa)
6109 6186 {
6110 6187 vdev_t *vd, *pvd, *ppvd;
6111 6188 uint64_t guid, sguid, pguid, ppguid;
6112 6189
6113 6190 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
6114 6191
6115 6192 while ((vd = spa_vdev_resilver_done_hunt(spa->spa_root_vdev)) != NULL) {
6116 6193 pvd = vd->vdev_parent;
6117 6194 ppvd = pvd->vdev_parent;
6118 6195 guid = vd->vdev_guid;
6119 6196 pguid = pvd->vdev_guid;
6120 6197 ppguid = ppvd->vdev_guid;
6121 6198 sguid = 0;
6122 6199 /*
6123 6200 * If we have just finished replacing a hot spared device, then
6124 6201 * we need to detach the parent's first child (the original hot
6125 6202 * spare) as well.
6126 6203 */
6127 6204 if (ppvd->vdev_ops == &vdev_spare_ops && pvd->vdev_id == 0 &&
6128 6205 ppvd->vdev_children == 2) {
6129 6206 ASSERT(pvd->vdev_ops == &vdev_replacing_ops);
6130 6207 sguid = ppvd->vdev_child[1]->vdev_guid;
6131 6208 }
6132 6209 ASSERT(vd->vdev_resilver_txg == 0 || !vdev_dtl_required(vd));
6133 6210
6134 6211 spa_config_exit(spa, SCL_ALL, FTAG);
6135 6212 if (spa_vdev_detach(spa, guid, pguid, B_TRUE) != 0)
|
↓ open down ↓ |
40 lines elided |
↑ open up ↑ |
6136 6213 return;
6137 6214 if (sguid && spa_vdev_detach(spa, sguid, ppguid, B_TRUE) != 0)
6138 6215 return;
6139 6216 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
6140 6217 }
6141 6218
6142 6219 spa_config_exit(spa, SCL_ALL, FTAG);
6143 6220 }
6144 6221
6145 6222 /*
6146 - * Update the stored path or FRU for this vdev.
6147 - */
6148 -int
6149 -spa_vdev_set_common(spa_t *spa, uint64_t guid, const char *value,
6150 - boolean_t ispath)
6151 -{
6152 - vdev_t *vd;
6153 - boolean_t sync = B_FALSE;
6154 -
6155 - ASSERT(spa_writeable(spa));
6156 -
6157 - spa_vdev_state_enter(spa, SCL_ALL);
6158 -
6159 - if ((vd = spa_lookup_by_guid(spa, guid, B_TRUE)) == NULL)
6160 - return (spa_vdev_state_exit(spa, NULL, ENOENT));
6161 -
6162 - if (!vd->vdev_ops->vdev_op_leaf)
6163 - return (spa_vdev_state_exit(spa, NULL, ENOTSUP));
6164 -
6165 - if (ispath) {
6166 - if (strcmp(value, vd->vdev_path) != 0) {
6167 - spa_strfree(vd->vdev_path);
6168 - vd->vdev_path = spa_strdup(value);
6169 - sync = B_TRUE;
6170 - }
6171 - } else {
6172 - if (vd->vdev_fru == NULL) {
6173 - vd->vdev_fru = spa_strdup(value);
6174 - sync = B_TRUE;
6175 - } else if (strcmp(value, vd->vdev_fru) != 0) {
6176 - spa_strfree(vd->vdev_fru);
6177 - vd->vdev_fru = spa_strdup(value);
6178 - sync = B_TRUE;
6179 - }
6180 - }
6181 -
6182 - return (spa_vdev_state_exit(spa, sync ? vd : NULL, 0));
6183 -}
6184 -
6185 -int
6186 -spa_vdev_setpath(spa_t *spa, uint64_t guid, const char *newpath)
6187 -{
6188 - return (spa_vdev_set_common(spa, guid, newpath, B_TRUE));
6189 -}
6190 -
6191 -int
6192 -spa_vdev_setfru(spa_t *spa, uint64_t guid, const char *newfru)
6193 -{
6194 - return (spa_vdev_set_common(spa, guid, newfru, B_FALSE));
6195 -}
6196 -
6197 -/*
6198 6223 * ==========================================================================
6199 6224 * SPA Scanning
6200 6225 * ==========================================================================
6201 6226 */
6202 6227 int
6203 6228 spa_scrub_pause_resume(spa_t *spa, pool_scrub_cmd_t cmd)
6204 6229 {
6205 6230 ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6206 6231
6207 6232 if (dsl_scan_resilvering(spa->spa_dsl_pool))
6208 6233 return (SET_ERROR(EBUSY));
6209 6234
6210 6235 return (dsl_scrub_set_pause_resume(spa->spa_dsl_pool, cmd));
6211 6236 }
6212 6237
6213 6238 int
6214 6239 spa_scan_stop(spa_t *spa)
6215 6240 {
6216 6241 ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6217 6242 if (dsl_scan_resilvering(spa->spa_dsl_pool))
6218 6243 return (SET_ERROR(EBUSY));
6219 6244 return (dsl_scan_cancel(spa->spa_dsl_pool));
6220 6245 }
6221 6246
6222 6247 int
6223 6248 spa_scan(spa_t *spa, pool_scan_func_t func)
6224 6249 {
6225 6250 ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0);
6226 6251
6227 6252 if (func >= POOL_SCAN_FUNCS || func == POOL_SCAN_NONE)
6228 6253 return (SET_ERROR(ENOTSUP));
6229 6254
6230 6255 /*
6231 6256 * If a resilver was requested, but there is no DTL on a
6232 6257 * writeable leaf device, we have nothing to do.
6233 6258 */
6234 6259 if (func == POOL_SCAN_RESILVER &&
6235 6260 !vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL)) {
6236 6261 spa_async_request(spa, SPA_ASYNC_RESILVER_DONE);
6237 6262 return (0);
6238 6263 }
6239 6264
6240 6265 return (dsl_scan(spa->spa_dsl_pool, func));
6241 6266 }
6242 6267
6243 6268 /*
6244 6269 * ==========================================================================
6245 6270 * SPA async task processing
6246 6271 * ==========================================================================
6247 6272 */
6248 6273
6249 6274 static void
6250 6275 spa_async_remove(spa_t *spa, vdev_t *vd)
6251 6276 {
6252 6277 if (vd->vdev_remove_wanted) {
6253 6278 vd->vdev_remove_wanted = B_FALSE;
6254 6279 vd->vdev_delayed_close = B_FALSE;
6255 6280 vdev_set_state(vd, B_FALSE, VDEV_STATE_REMOVED, VDEV_AUX_NONE);
6256 6281
6257 6282 /*
6258 6283 * We want to clear the stats, but we don't want to do a full
6259 6284 * vdev_clear() as that will cause us to throw away
6260 6285 * degraded/faulted state as well as attempt to reopen the
6261 6286 * device, all of which is a waste.
6262 6287 */
6263 6288 vd->vdev_stat.vs_read_errors = 0;
6264 6289 vd->vdev_stat.vs_write_errors = 0;
6265 6290 vd->vdev_stat.vs_checksum_errors = 0;
6266 6291
6267 6292 vdev_state_dirty(vd->vdev_top);
6268 6293 }
6269 6294
6270 6295 for (int c = 0; c < vd->vdev_children; c++)
6271 6296 spa_async_remove(spa, vd->vdev_child[c]);
6272 6297 }
6273 6298
6274 6299 static void
6275 6300 spa_async_probe(spa_t *spa, vdev_t *vd)
6276 6301 {
6277 6302 if (vd->vdev_probe_wanted) {
6278 6303 vd->vdev_probe_wanted = B_FALSE;
6279 6304 vdev_reopen(vd); /* vdev_open() does the actual probe */
6280 6305 }
6281 6306
6282 6307 for (int c = 0; c < vd->vdev_children; c++)
6283 6308 spa_async_probe(spa, vd->vdev_child[c]);
6284 6309 }
6285 6310
6286 6311 static void
6287 6312 spa_async_autoexpand(spa_t *spa, vdev_t *vd)
6288 6313 {
6289 6314 sysevent_id_t eid;
6290 6315 nvlist_t *attr;
6291 6316 char *physpath;
6292 6317
6293 6318 if (!spa->spa_autoexpand)
6294 6319 return;
6295 6320
6296 6321 for (int c = 0; c < vd->vdev_children; c++) {
6297 6322 vdev_t *cvd = vd->vdev_child[c];
6298 6323 spa_async_autoexpand(spa, cvd);
6299 6324 }
6300 6325
6301 6326 if (!vd->vdev_ops->vdev_op_leaf || vd->vdev_physpath == NULL)
6302 6327 return;
6303 6328
6304 6329 physpath = kmem_zalloc(MAXPATHLEN, KM_SLEEP);
6305 6330 (void) snprintf(physpath, MAXPATHLEN, "/devices%s", vd->vdev_physpath);
6306 6331
6307 6332 VERIFY(nvlist_alloc(&attr, NV_UNIQUE_NAME, KM_SLEEP) == 0);
6308 6333 VERIFY(nvlist_add_string(attr, DEV_PHYS_PATH, physpath) == 0);
6309 6334
6310 6335 (void) ddi_log_sysevent(zfs_dip, SUNW_VENDOR, EC_DEV_STATUS,
6311 6336 ESC_DEV_DLE, attr, &eid, DDI_SLEEP);
6312 6337
6313 6338 nvlist_free(attr);
6314 6339 kmem_free(physpath, MAXPATHLEN);
6315 6340 }
6316 6341
6317 6342 static void
6318 6343 spa_async_thread(void *arg)
6319 6344 {
6320 6345 spa_t *spa = (spa_t *)arg;
6321 6346 int tasks;
6322 6347
6323 6348 ASSERT(spa->spa_sync_on);
6324 6349
6325 6350 mutex_enter(&spa->spa_async_lock);
6326 6351 tasks = spa->spa_async_tasks;
6327 6352 spa->spa_async_tasks = 0;
6328 6353 mutex_exit(&spa->spa_async_lock);
6329 6354
6330 6355 /*
6331 6356 * See if the config needs to be updated.
6332 6357 */
6333 6358 if (tasks & SPA_ASYNC_CONFIG_UPDATE) {
6334 6359 uint64_t old_space, new_space;
6335 6360
6336 6361 mutex_enter(&spa_namespace_lock);
6337 6362 old_space = metaslab_class_get_space(spa_normal_class(spa));
6338 6363 spa_config_update(spa, SPA_CONFIG_UPDATE_POOL);
6339 6364 new_space = metaslab_class_get_space(spa_normal_class(spa));
6340 6365 mutex_exit(&spa_namespace_lock);
6341 6366
6342 6367 /*
6343 6368 * If the pool grew as a result of the config update,
6344 6369 * then log an internal history event.
6345 6370 */
6346 6371 if (new_space != old_space) {
6347 6372 spa_history_log_internal(spa, "vdev online", NULL,
6348 6373 "pool '%s' size: %llu(+%llu)",
6349 6374 spa_name(spa), new_space, new_space - old_space);
6350 6375 }
6351 6376 }
6352 6377
6353 6378 /*
6354 6379 * See if any devices need to be marked REMOVED.
6355 6380 */
6356 6381 if (tasks & SPA_ASYNC_REMOVE) {
6357 6382 spa_vdev_state_enter(spa, SCL_NONE);
6358 6383 spa_async_remove(spa, spa->spa_root_vdev);
6359 6384 for (int i = 0; i < spa->spa_l2cache.sav_count; i++)
6360 6385 spa_async_remove(spa, spa->spa_l2cache.sav_vdevs[i]);
6361 6386 for (int i = 0; i < spa->spa_spares.sav_count; i++)
6362 6387 spa_async_remove(spa, spa->spa_spares.sav_vdevs[i]);
6363 6388 (void) spa_vdev_state_exit(spa, NULL, 0);
6364 6389 }
6365 6390
6366 6391 if ((tasks & SPA_ASYNC_AUTOEXPAND) && !spa_suspended(spa)) {
6367 6392 spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
6368 6393 spa_async_autoexpand(spa, spa->spa_root_vdev);
6369 6394 spa_config_exit(spa, SCL_CONFIG, FTAG);
6370 6395 }
6371 6396
6372 6397 /*
6373 6398 * See if any devices need to be probed.
6374 6399 */
6375 6400 if (tasks & SPA_ASYNC_PROBE) {
6376 6401 spa_vdev_state_enter(spa, SCL_NONE);
6377 6402 spa_async_probe(spa, spa->spa_root_vdev);
6378 6403 (void) spa_vdev_state_exit(spa, NULL, 0);
6379 6404 }
6380 6405
6381 6406 /*
6382 6407 * If any devices are done replacing, detach them.
6383 6408 */
|
↓ open down ↓ |
176 lines elided |
↑ open up ↑ |
6384 6409 if (tasks & SPA_ASYNC_RESILVER_DONE)
6385 6410 spa_vdev_resilver_done(spa);
6386 6411
6387 6412 /*
6388 6413 * Kick off a resilver.
6389 6414 */
6390 6415 if (tasks & SPA_ASYNC_RESILVER)
6391 6416 dsl_resilver_restart(spa->spa_dsl_pool, 0);
6392 6417
6393 6418 /*
6419 + * Kick off L2 cache rebuilding.
6420 + */
6421 + if (tasks & SPA_ASYNC_L2CACHE_REBUILD)
6422 + l2arc_spa_rebuild_start(spa);
6423 +
6424 + if (tasks & SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY) {
6425 + mutex_enter(&spa->spa_man_trim_lock);
6426 + spa_man_trim_taskq_destroy(spa);
6427 + mutex_exit(&spa->spa_man_trim_lock);
6428 + }
6429 +
6430 + /*
6394 6431 * Let the world know that we're done.
6395 6432 */
6396 6433 mutex_enter(&spa->spa_async_lock);
6397 6434 spa->spa_async_thread = NULL;
6398 6435 cv_broadcast(&spa->spa_async_cv);
6399 6436 mutex_exit(&spa->spa_async_lock);
6400 6437 thread_exit();
6401 6438 }
6402 6439
6403 6440 void
6404 6441 spa_async_suspend(spa_t *spa)
6405 6442 {
6406 6443 mutex_enter(&spa->spa_async_lock);
6407 6444 spa->spa_async_suspended++;
6408 6445 while (spa->spa_async_thread != NULL)
6409 6446 cv_wait(&spa->spa_async_cv, &spa->spa_async_lock);
6410 6447 mutex_exit(&spa->spa_async_lock);
6411 -
6412 - spa_vdev_remove_suspend(spa);
6413 -
6414 - zthr_t *condense_thread = spa->spa_condense_zthr;
6415 - if (condense_thread != NULL && zthr_isrunning(condense_thread))
6416 - VERIFY0(zthr_cancel(condense_thread));
6417 6448 }
6418 6449
6419 6450 void
6420 6451 spa_async_resume(spa_t *spa)
6421 6452 {
6422 6453 mutex_enter(&spa->spa_async_lock);
6423 6454 ASSERT(spa->spa_async_suspended != 0);
6424 6455 spa->spa_async_suspended--;
6425 6456 mutex_exit(&spa->spa_async_lock);
6426 - spa_restart_removal(spa);
6427 -
6428 - zthr_t *condense_thread = spa->spa_condense_zthr;
6429 - if (condense_thread != NULL && !zthr_isrunning(condense_thread))
6430 - zthr_resume(condense_thread);
6431 6457 }
6432 6458
6433 6459 static boolean_t
6434 6460 spa_async_tasks_pending(spa_t *spa)
6435 6461 {
6436 6462 uint_t non_config_tasks;
6437 6463 uint_t config_task;
6438 6464 boolean_t config_task_suspended;
6439 6465
6440 6466 non_config_tasks = spa->spa_async_tasks & ~SPA_ASYNC_CONFIG_UPDATE;
6441 6467 config_task = spa->spa_async_tasks & SPA_ASYNC_CONFIG_UPDATE;
6442 6468 if (spa->spa_ccw_fail_time == 0) {
6443 6469 config_task_suspended = B_FALSE;
6444 6470 } else {
6445 6471 config_task_suspended =
6446 6472 (gethrtime() - spa->spa_ccw_fail_time) <
6447 6473 (zfs_ccw_retry_interval * NANOSEC);
6448 6474 }
6449 6475
6450 6476 return (non_config_tasks || (config_task && !config_task_suspended));
6451 6477 }
6452 6478
6453 6479 static void
6454 6480 spa_async_dispatch(spa_t *spa)
6455 6481 {
6456 6482 mutex_enter(&spa->spa_async_lock);
6457 6483 if (spa_async_tasks_pending(spa) &&
6458 6484 !spa->spa_async_suspended &&
6459 6485 spa->spa_async_thread == NULL &&
6460 6486 rootdir != NULL)
6461 6487 spa->spa_async_thread = thread_create(NULL, 0,
6462 6488 spa_async_thread, spa, 0, &p0, TS_RUN, maxclsyspri);
6463 6489 mutex_exit(&spa->spa_async_lock);
6464 6490 }
|
↓ open down ↓ |
24 lines elided |
↑ open up ↑ |
6465 6491
6466 6492 void
6467 6493 spa_async_request(spa_t *spa, int task)
6468 6494 {
6469 6495 zfs_dbgmsg("spa=%s async request task=%u", spa->spa_name, task);
6470 6496 mutex_enter(&spa->spa_async_lock);
6471 6497 spa->spa_async_tasks |= task;
6472 6498 mutex_exit(&spa->spa_async_lock);
6473 6499 }
6474 6500
6501 +void
6502 +spa_async_unrequest(spa_t *spa, int task)
6503 +{
6504 + zfs_dbgmsg("spa=%s async unrequest task=%u", spa->spa_name, task);
6505 + mutex_enter(&spa->spa_async_lock);
6506 + spa->spa_async_tasks &= ~task;
6507 + mutex_exit(&spa->spa_async_lock);
6508 +}
6509 +
6475 6510 /*
6476 6511 * ==========================================================================
6477 6512 * SPA syncing routines
6478 6513 * ==========================================================================
6479 6514 */
6480 6515
6481 6516 static int
6482 6517 bpobj_enqueue_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
6483 6518 {
6484 6519 bpobj_t *bpo = arg;
6485 6520 bpobj_enqueue(bpo, bp, tx);
6486 6521 return (0);
6487 6522 }
6488 6523
6489 6524 static int
6490 6525 spa_free_sync_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
6491 6526 {
6492 6527 zio_t *zio = arg;
6493 6528
6494 6529 zio_nowait(zio_free_sync(zio, zio->io_spa, dmu_tx_get_txg(tx), bp,
6495 6530 zio->io_flags));
6496 6531 return (0);
6497 6532 }
6498 6533
6499 6534 /*
6500 6535 * Note: this simple function is not inlined to make it easier to dtrace the
6501 6536 * amount of time spent syncing frees.
6502 6537 */
6503 6538 static void
6504 6539 spa_sync_frees(spa_t *spa, bplist_t *bpl, dmu_tx_t *tx)
6505 6540 {
6506 6541 zio_t *zio = zio_root(spa, NULL, NULL, 0);
6507 6542 bplist_iterate(bpl, spa_free_sync_cb, zio, tx);
6508 6543 VERIFY(zio_wait(zio) == 0);
6509 6544 }
6510 6545
6511 6546 /*
6512 6547 * Note: this simple function is not inlined to make it easier to dtrace the
6513 6548 * amount of time spent syncing deferred frees.
6514 6549 */
6515 6550 static void
6516 6551 spa_sync_deferred_frees(spa_t *spa, dmu_tx_t *tx)
6517 6552 {
6518 6553 zio_t *zio = zio_root(spa, NULL, NULL, 0);
6519 6554 VERIFY3U(bpobj_iterate(&spa->spa_deferred_bpobj,
6520 6555 spa_free_sync_cb, zio, tx), ==, 0);
6521 6556 VERIFY0(zio_wait(zio));
6522 6557 }
6523 6558
6524 6559
6525 6560 static void
6526 6561 spa_sync_nvlist(spa_t *spa, uint64_t obj, nvlist_t *nv, dmu_tx_t *tx)
6527 6562 {
6528 6563 char *packed = NULL;
6529 6564 size_t bufsize;
6530 6565 size_t nvsize = 0;
6531 6566 dmu_buf_t *db;
6532 6567
6533 6568 VERIFY(nvlist_size(nv, &nvsize, NV_ENCODE_XDR) == 0);
6534 6569
6535 6570 /*
6536 6571 * Write full (SPA_CONFIG_BLOCKSIZE) blocks of configuration
6537 6572 * information. This avoids the dmu_buf_will_dirty() path and
6538 6573 * saves us a pre-read to get data we don't actually care about.
6539 6574 */
6540 6575 bufsize = P2ROUNDUP((uint64_t)nvsize, SPA_CONFIG_BLOCKSIZE);
6541 6576 packed = kmem_alloc(bufsize, KM_SLEEP);
6542 6577
6543 6578 VERIFY(nvlist_pack(nv, &packed, &nvsize, NV_ENCODE_XDR,
6544 6579 KM_SLEEP) == 0);
6545 6580 bzero(packed + nvsize, bufsize - nvsize);
6546 6581
6547 6582 dmu_write(spa->spa_meta_objset, obj, 0, bufsize, packed, tx);
6548 6583
6549 6584 kmem_free(packed, bufsize);
6550 6585
6551 6586 VERIFY(0 == dmu_bonus_hold(spa->spa_meta_objset, obj, FTAG, &db));
6552 6587 dmu_buf_will_dirty(db, tx);
6553 6588 *(uint64_t *)db->db_data = nvsize;
6554 6589 dmu_buf_rele(db, FTAG);
6555 6590 }
6556 6591
6557 6592 static void
6558 6593 spa_sync_aux_dev(spa_t *spa, spa_aux_vdev_t *sav, dmu_tx_t *tx,
6559 6594 const char *config, const char *entry)
6560 6595 {
6561 6596 nvlist_t *nvroot;
6562 6597 nvlist_t **list;
6563 6598 int i;
6564 6599
6565 6600 if (!sav->sav_sync)
6566 6601 return;
6567 6602
6568 6603 /*
6569 6604 * Update the MOS nvlist describing the list of available devices.
6570 6605 * spa_validate_aux() will have already made sure this nvlist is
6571 6606 * valid and the vdevs are labeled appropriately.
6572 6607 */
6573 6608 if (sav->sav_object == 0) {
6574 6609 sav->sav_object = dmu_object_alloc(spa->spa_meta_objset,
6575 6610 DMU_OT_PACKED_NVLIST, 1 << 14, DMU_OT_PACKED_NVLIST_SIZE,
6576 6611 sizeof (uint64_t), tx);
6577 6612 VERIFY(zap_update(spa->spa_meta_objset,
6578 6613 DMU_POOL_DIRECTORY_OBJECT, entry, sizeof (uint64_t), 1,
6579 6614 &sav->sav_object, tx) == 0);
6580 6615 }
6581 6616
6582 6617 VERIFY(nvlist_alloc(&nvroot, NV_UNIQUE_NAME, KM_SLEEP) == 0);
6583 6618 if (sav->sav_count == 0) {
6584 6619 VERIFY(nvlist_add_nvlist_array(nvroot, config, NULL, 0) == 0);
6585 6620 } else {
6586 6621 list = kmem_alloc(sav->sav_count * sizeof (void *), KM_SLEEP);
6587 6622 for (i = 0; i < sav->sav_count; i++)
6588 6623 list[i] = vdev_config_generate(spa, sav->sav_vdevs[i],
6589 6624 B_FALSE, VDEV_CONFIG_L2CACHE);
6590 6625 VERIFY(nvlist_add_nvlist_array(nvroot, config, list,
6591 6626 sav->sav_count) == 0);
6592 6627 for (i = 0; i < sav->sav_count; i++)
6593 6628 nvlist_free(list[i]);
6594 6629 kmem_free(list, sav->sav_count * sizeof (void *));
6595 6630 }
6596 6631
6597 6632 spa_sync_nvlist(spa, sav->sav_object, nvroot, tx);
6598 6633 nvlist_free(nvroot);
6599 6634
6600 6635 sav->sav_sync = B_FALSE;
6601 6636 }
6602 6637
6603 6638 /*
6604 6639 * Rebuild spa's all-vdev ZAP from the vdev ZAPs indicated in each vdev_t.
6605 6640 * The all-vdev ZAP must be empty.
6606 6641 */
6607 6642 static void
6608 6643 spa_avz_build(vdev_t *vd, uint64_t avz, dmu_tx_t *tx)
6609 6644 {
6610 6645 spa_t *spa = vd->vdev_spa;
6611 6646 if (vd->vdev_top_zap != 0) {
6612 6647 VERIFY0(zap_add_int(spa->spa_meta_objset, avz,
6613 6648 vd->vdev_top_zap, tx));
6614 6649 }
6615 6650 if (vd->vdev_leaf_zap != 0) {
6616 6651 VERIFY0(zap_add_int(spa->spa_meta_objset, avz,
6617 6652 vd->vdev_leaf_zap, tx));
6618 6653 }
6619 6654 for (uint64_t i = 0; i < vd->vdev_children; i++) {
6620 6655 spa_avz_build(vd->vdev_child[i], avz, tx);
6621 6656 }
6622 6657 }
6623 6658
6624 6659 static void
6625 6660 spa_sync_config_object(spa_t *spa, dmu_tx_t *tx)
6626 6661 {
6627 6662 nvlist_t *config;
6628 6663
6629 6664 /*
6630 6665 * If the pool is being imported from a pre-per-vdev-ZAP version of ZFS,
6631 6666 * its config may not be dirty but we still need to build per-vdev ZAPs.
6632 6667 * Similarly, if the pool is being assembled (e.g. after a split), we
6633 6668 * need to rebuild the AVZ although the config may not be dirty.
6634 6669 */
6635 6670 if (list_is_empty(&spa->spa_config_dirty_list) &&
6636 6671 spa->spa_avz_action == AVZ_ACTION_NONE)
6637 6672 return;
6638 6673
6639 6674 spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
6640 6675
6641 6676 ASSERT(spa->spa_avz_action == AVZ_ACTION_NONE ||
6642 6677 spa->spa_avz_action == AVZ_ACTION_INITIALIZE ||
6643 6678 spa->spa_all_vdev_zaps != 0);
6644 6679
6645 6680 if (spa->spa_avz_action == AVZ_ACTION_REBUILD) {
6646 6681 /* Make and build the new AVZ */
6647 6682 uint64_t new_avz = zap_create(spa->spa_meta_objset,
6648 6683 DMU_OTN_ZAP_METADATA, DMU_OT_NONE, 0, tx);
6649 6684 spa_avz_build(spa->spa_root_vdev, new_avz, tx);
6650 6685
6651 6686 /* Diff old AVZ with new one */
6652 6687 zap_cursor_t zc;
6653 6688 zap_attribute_t za;
6654 6689
6655 6690 for (zap_cursor_init(&zc, spa->spa_meta_objset,
6656 6691 spa->spa_all_vdev_zaps);
6657 6692 zap_cursor_retrieve(&zc, &za) == 0;
6658 6693 zap_cursor_advance(&zc)) {
6659 6694 uint64_t vdzap = za.za_first_integer;
6660 6695 if (zap_lookup_int(spa->spa_meta_objset, new_avz,
6661 6696 vdzap) == ENOENT) {
6662 6697 /*
6663 6698 * ZAP is listed in old AVZ but not in new one;
6664 6699 * destroy it
6665 6700 */
6666 6701 VERIFY0(zap_destroy(spa->spa_meta_objset, vdzap,
6667 6702 tx));
6668 6703 }
6669 6704 }
6670 6705
6671 6706 zap_cursor_fini(&zc);
6672 6707
6673 6708 /* Destroy the old AVZ */
6674 6709 VERIFY0(zap_destroy(spa->spa_meta_objset,
6675 6710 spa->spa_all_vdev_zaps, tx));
6676 6711
6677 6712 /* Replace the old AVZ in the dir obj with the new one */
6678 6713 VERIFY0(zap_update(spa->spa_meta_objset,
6679 6714 DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_VDEV_ZAP_MAP,
6680 6715 sizeof (new_avz), 1, &new_avz, tx));
6681 6716
6682 6717 spa->spa_all_vdev_zaps = new_avz;
6683 6718 } else if (spa->spa_avz_action == AVZ_ACTION_DESTROY) {
6684 6719 zap_cursor_t zc;
6685 6720 zap_attribute_t za;
6686 6721
6687 6722 /* Walk through the AVZ and destroy all listed ZAPs */
6688 6723 for (zap_cursor_init(&zc, spa->spa_meta_objset,
6689 6724 spa->spa_all_vdev_zaps);
6690 6725 zap_cursor_retrieve(&zc, &za) == 0;
6691 6726 zap_cursor_advance(&zc)) {
6692 6727 uint64_t zap = za.za_first_integer;
6693 6728 VERIFY0(zap_destroy(spa->spa_meta_objset, zap, tx));
6694 6729 }
6695 6730
6696 6731 zap_cursor_fini(&zc);
6697 6732
6698 6733 /* Destroy and unlink the AVZ itself */
6699 6734 VERIFY0(zap_destroy(spa->spa_meta_objset,
6700 6735 spa->spa_all_vdev_zaps, tx));
6701 6736 VERIFY0(zap_remove(spa->spa_meta_objset,
6702 6737 DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_VDEV_ZAP_MAP, tx));
6703 6738 spa->spa_all_vdev_zaps = 0;
6704 6739 }
6705 6740
6706 6741 if (spa->spa_all_vdev_zaps == 0) {
6707 6742 spa->spa_all_vdev_zaps = zap_create_link(spa->spa_meta_objset,
6708 6743 DMU_OTN_ZAP_METADATA, DMU_POOL_DIRECTORY_OBJECT,
6709 6744 DMU_POOL_VDEV_ZAP_MAP, tx);
6710 6745 }
6711 6746 spa->spa_avz_action = AVZ_ACTION_NONE;
6712 6747
6713 6748 /* Create ZAPs for vdevs that don't have them. */
6714 6749 vdev_construct_zaps(spa->spa_root_vdev, tx);
6715 6750
6716 6751 config = spa_config_generate(spa, spa->spa_root_vdev,
6717 6752 dmu_tx_get_txg(tx), B_FALSE);
6718 6753
6719 6754 /*
6720 6755 * If we're upgrading the spa version then make sure that
6721 6756 * the config object gets updated with the correct version.
6722 6757 */
6723 6758 if (spa->spa_ubsync.ub_version < spa->spa_uberblock.ub_version)
6724 6759 fnvlist_add_uint64(config, ZPOOL_CONFIG_VERSION,
6725 6760 spa->spa_uberblock.ub_version);
6726 6761
6727 6762 spa_config_exit(spa, SCL_STATE, FTAG);
6728 6763
6729 6764 nvlist_free(spa->spa_config_syncing);
6730 6765 spa->spa_config_syncing = config;
6731 6766
6732 6767 spa_sync_nvlist(spa, spa->spa_config_object, config, tx);
6733 6768 }
6734 6769
6735 6770 static void
6736 6771 spa_sync_version(void *arg, dmu_tx_t *tx)
6737 6772 {
6738 6773 uint64_t *versionp = arg;
6739 6774 uint64_t version = *versionp;
6740 6775 spa_t *spa = dmu_tx_pool(tx)->dp_spa;
6741 6776
6742 6777 /*
6743 6778 * Setting the version is special cased when first creating the pool.
6744 6779 */
6745 6780 ASSERT(tx->tx_txg != TXG_INITIAL);
6746 6781
6747 6782 ASSERT(SPA_VERSION_IS_SUPPORTED(version));
6748 6783 ASSERT(version >= spa_version(spa));
6749 6784
6750 6785 spa->spa_uberblock.ub_version = version;
6751 6786 vdev_config_dirty(spa->spa_root_vdev);
6752 6787 spa_history_log_internal(spa, "set", tx, "version=%lld", version);
|
↓ open down ↓ |
268 lines elided |
↑ open up ↑ |
6753 6788 }
6754 6789
6755 6790 /*
6756 6791 * Set zpool properties.
6757 6792 */
6758 6793 static void
6759 6794 spa_sync_props(void *arg, dmu_tx_t *tx)
6760 6795 {
6761 6796 nvlist_t *nvp = arg;
6762 6797 spa_t *spa = dmu_tx_pool(tx)->dp_spa;
6798 + spa_meta_placement_t *mp = &spa->spa_meta_policy;
6763 6799 objset_t *mos = spa->spa_meta_objset;
6764 6800 nvpair_t *elem = NULL;
6765 6801
6766 6802 mutex_enter(&spa->spa_props_lock);
6767 6803
6768 6804 while ((elem = nvlist_next_nvpair(nvp, elem))) {
6769 6805 uint64_t intval;
6770 6806 char *strval, *fname;
6771 6807 zpool_prop_t prop;
6772 6808 const char *propname;
6773 6809 zprop_type_t proptype;
6774 6810 spa_feature_t fid;
6775 6811
6776 6812 switch (prop = zpool_name_to_prop(nvpair_name(elem))) {
6777 - case ZPOOL_PROP_INVAL:
6813 + case ZPROP_INVAL:
6778 6814 /*
6779 6815 * We checked this earlier in spa_prop_validate().
6780 6816 */
6781 6817 ASSERT(zpool_prop_feature(nvpair_name(elem)));
6782 6818
6783 6819 fname = strchr(nvpair_name(elem), '@') + 1;
6784 6820 VERIFY0(zfeature_lookup_name(fname, &fid));
6785 6821
6786 6822 spa_feature_enable(spa, fid, tx);
6787 6823 spa_history_log_internal(spa, "set", tx,
6788 6824 "%s=enabled", nvpair_name(elem));
6789 6825 break;
6790 6826
6791 6827 case ZPOOL_PROP_VERSION:
6792 6828 intval = fnvpair_value_uint64(elem);
6793 6829 /*
6794 6830 * The version is synced seperatly before other
6795 6831 * properties and should be correct by now.
6796 6832 */
6797 6833 ASSERT3U(spa_version(spa), >=, intval);
6798 6834 break;
6799 6835
6800 6836 case ZPOOL_PROP_ALTROOT:
6801 6837 /*
6802 6838 * 'altroot' is a non-persistent property. It should
6803 6839 * have been set temporarily at creation or import time.
6804 6840 */
6805 6841 ASSERT(spa->spa_root != NULL);
6806 6842 break;
6807 6843
6808 6844 case ZPOOL_PROP_READONLY:
6809 6845 case ZPOOL_PROP_CACHEFILE:
6810 6846 /*
6811 6847 * 'readonly' and 'cachefile' are also non-persisitent
6812 6848 * properties.
6813 6849 */
6814 6850 break;
6815 6851 case ZPOOL_PROP_COMMENT:
6816 6852 strval = fnvpair_value_string(elem);
6817 6853 if (spa->spa_comment != NULL)
6818 6854 spa_strfree(spa->spa_comment);
6819 6855 spa->spa_comment = spa_strdup(strval);
6820 6856 /*
6821 6857 * We need to dirty the configuration on all the vdevs
6822 6858 * so that their labels get updated. It's unnecessary
6823 6859 * to do this for pool creation since the vdev's
6824 6860 * configuratoin has already been dirtied.
6825 6861 */
6826 6862 if (tx->tx_txg != TXG_INITIAL)
6827 6863 vdev_config_dirty(spa->spa_root_vdev);
6828 6864 spa_history_log_internal(spa, "set", tx,
6829 6865 "%s=%s", nvpair_name(elem), strval);
6830 6866 break;
6831 6867 default:
6832 6868 /*
6833 6869 * Set pool property values in the poolprops mos object.
6834 6870 */
6835 6871 if (spa->spa_pool_props_object == 0) {
6836 6872 spa->spa_pool_props_object =
6837 6873 zap_create_link(mos, DMU_OT_POOL_PROPS,
6838 6874 DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_PROPS,
6839 6875 tx);
6840 6876 }
6841 6877
6842 6878 /* normalize the property name */
6843 6879 propname = zpool_prop_to_name(prop);
6844 6880 proptype = zpool_prop_get_type(prop);
6845 6881
6846 6882 if (nvpair_type(elem) == DATA_TYPE_STRING) {
6847 6883 ASSERT(proptype == PROP_TYPE_STRING);
6848 6884 strval = fnvpair_value_string(elem);
6849 6885 VERIFY0(zap_update(mos,
6850 6886 spa->spa_pool_props_object, propname,
6851 6887 1, strlen(strval) + 1, strval, tx));
6852 6888 spa_history_log_internal(spa, "set", tx,
6853 6889 "%s=%s", nvpair_name(elem), strval);
6854 6890 } else if (nvpair_type(elem) == DATA_TYPE_UINT64) {
6855 6891 intval = fnvpair_value_uint64(elem);
6856 6892
6857 6893 if (proptype == PROP_TYPE_INDEX) {
6858 6894 const char *unused;
6859 6895 VERIFY0(zpool_prop_index_to_string(
6860 6896 prop, intval, &unused));
6861 6897 }
6862 6898 VERIFY0(zap_update(mos,
6863 6899 spa->spa_pool_props_object, propname,
6864 6900 8, 1, &intval, tx));
|
↓ open down ↓ |
77 lines elided |
↑ open up ↑ |
6865 6901 spa_history_log_internal(spa, "set", tx,
6866 6902 "%s=%lld", nvpair_name(elem), intval);
6867 6903 } else {
6868 6904 ASSERT(0); /* not allowed */
6869 6905 }
6870 6906
6871 6907 switch (prop) {
6872 6908 case ZPOOL_PROP_DELEGATION:
6873 6909 spa->spa_delegation = intval;
6874 6910 break;
6911 + case ZPOOL_PROP_DDT_DESEGREGATION:
6912 + spa_set_ddt_classes(spa, intval);
6913 + break;
6914 + case ZPOOL_PROP_DEDUP_BEST_EFFORT:
6915 + spa->spa_dedup_best_effort = intval;
6916 + break;
6917 + case ZPOOL_PROP_DEDUP_LO_BEST_EFFORT:
6918 + spa->spa_dedup_lo_best_effort = intval;
6919 + break;
6920 + case ZPOOL_PROP_DEDUP_HI_BEST_EFFORT:
6921 + spa->spa_dedup_hi_best_effort = intval;
6922 + break;
6875 6923 case ZPOOL_PROP_BOOTFS:
6876 6924 spa->spa_bootfs = intval;
6877 6925 break;
6878 6926 case ZPOOL_PROP_FAILUREMODE:
6879 6927 spa->spa_failmode = intval;
6880 6928 break;
6929 + case ZPOOL_PROP_FORCETRIM:
6930 + spa->spa_force_trim = intval;
6931 + break;
6932 + case ZPOOL_PROP_AUTOTRIM:
6933 + mutex_enter(&spa->spa_auto_trim_lock);
6934 + if (intval != spa->spa_auto_trim) {
6935 + spa->spa_auto_trim = intval;
6936 + if (intval != 0)
6937 + spa_auto_trim_taskq_create(spa);
6938 + else
6939 + spa_auto_trim_taskq_destroy(
6940 + spa);
6941 + }
6942 + mutex_exit(&spa->spa_auto_trim_lock);
6943 + break;
6881 6944 case ZPOOL_PROP_AUTOEXPAND:
6882 6945 spa->spa_autoexpand = intval;
6883 6946 if (tx->tx_txg != TXG_INITIAL)
6884 6947 spa_async_request(spa,
6885 6948 SPA_ASYNC_AUTOEXPAND);
6886 6949 break;
6887 6950 case ZPOOL_PROP_DEDUPDITTO:
6888 6951 spa->spa_dedup_ditto = intval;
6889 6952 break;
6953 + case ZPOOL_PROP_MINWATERMARK:
6954 + spa->spa_minwat = intval;
6955 + break;
6956 + case ZPOOL_PROP_LOWATERMARK:
6957 + spa->spa_lowat = intval;
6958 + break;
6959 + case ZPOOL_PROP_HIWATERMARK:
6960 + spa->spa_hiwat = intval;
6961 + break;
6962 + case ZPOOL_PROP_DEDUPMETA_DITTO:
6963 + spa->spa_ddt_meta_copies = intval;
6964 + break;
6965 + case ZPOOL_PROP_META_PLACEMENT:
6966 + mp->spa_enable_meta_placement_selection =
6967 + intval;
6968 + break;
6969 + case ZPOOL_PROP_SYNC_TO_SPECIAL:
6970 + mp->spa_sync_to_special = intval;
6971 + break;
6972 + case ZPOOL_PROP_DDT_META_TO_METADEV:
6973 + mp->spa_ddt_meta_to_special = intval;
6974 + break;
6975 + case ZPOOL_PROP_ZFS_META_TO_METADEV:
6976 + mp->spa_zfs_meta_to_special = intval;
6977 + break;
6978 + case ZPOOL_PROP_SMALL_DATA_TO_METADEV:
6979 + mp->spa_small_data_to_special = intval;
6980 + break;
6981 + case ZPOOL_PROP_RESILVER_PRIO:
6982 + spa->spa_resilver_prio = intval;
6983 + break;
6984 + case ZPOOL_PROP_SCRUB_PRIO:
6985 + spa->spa_scrub_prio = intval;
6986 + break;
6890 6987 default:
6891 6988 break;
6892 6989 }
6893 6990 }
6894 6991
6895 6992 }
6896 6993
6897 6994 mutex_exit(&spa->spa_props_lock);
6898 6995 }
6899 6996
6900 6997 /*
6901 6998 * Perform one-time upgrade on-disk changes. spa_version() does not
6902 6999 * reflect the new version this txg, so there must be no changes this
6903 7000 * txg to anything that the upgrade code depends on after it executes.
6904 7001 * Therefore this must be called after dsl_pool_sync() does the sync
6905 7002 * tasks.
6906 7003 */
6907 7004 static void
6908 7005 spa_sync_upgrades(spa_t *spa, dmu_tx_t *tx)
6909 7006 {
6910 7007 dsl_pool_t *dp = spa->spa_dsl_pool;
6911 7008
6912 7009 ASSERT(spa->spa_sync_pass == 1);
6913 7010
6914 7011 rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG);
6915 7012
6916 7013 if (spa->spa_ubsync.ub_version < SPA_VERSION_ORIGIN &&
6917 7014 spa->spa_uberblock.ub_version >= SPA_VERSION_ORIGIN) {
6918 7015 dsl_pool_create_origin(dp, tx);
6919 7016
6920 7017 /* Keeping the origin open increases spa_minref */
6921 7018 spa->spa_minref += 3;
6922 7019 }
6923 7020
6924 7021 if (spa->spa_ubsync.ub_version < SPA_VERSION_NEXT_CLONES &&
6925 7022 spa->spa_uberblock.ub_version >= SPA_VERSION_NEXT_CLONES) {
6926 7023 dsl_pool_upgrade_clones(dp, tx);
6927 7024 }
6928 7025
6929 7026 if (spa->spa_ubsync.ub_version < SPA_VERSION_DIR_CLONES &&
6930 7027 spa->spa_uberblock.ub_version >= SPA_VERSION_DIR_CLONES) {
6931 7028 dsl_pool_upgrade_dir_clones(dp, tx);
6932 7029
6933 7030 /* Keeping the freedir open increases spa_minref */
6934 7031 spa->spa_minref += 3;
6935 7032 }
6936 7033
6937 7034 if (spa->spa_ubsync.ub_version < SPA_VERSION_FEATURES &&
6938 7035 spa->spa_uberblock.ub_version >= SPA_VERSION_FEATURES) {
6939 7036 spa_feature_create_zap_objects(spa, tx);
6940 7037 }
6941 7038
6942 7039 /*
6943 7040 * LZ4_COMPRESS feature's behaviour was changed to activate_on_enable
6944 7041 * when possibility to use lz4 compression for metadata was added
6945 7042 * Old pools that have this feature enabled must be upgraded to have
6946 7043 * this feature active
6947 7044 */
6948 7045 if (spa->spa_uberblock.ub_version >= SPA_VERSION_FEATURES) {
6949 7046 boolean_t lz4_en = spa_feature_is_enabled(spa,
6950 7047 SPA_FEATURE_LZ4_COMPRESS);
6951 7048 boolean_t lz4_ac = spa_feature_is_active(spa,
6952 7049 SPA_FEATURE_LZ4_COMPRESS);
6953 7050
6954 7051 if (lz4_en && !lz4_ac)
6955 7052 spa_feature_incr(spa, SPA_FEATURE_LZ4_COMPRESS, tx);
6956 7053 }
6957 7054
6958 7055 /*
6959 7056 * If we haven't written the salt, do so now. Note that the
6960 7057 * feature may not be activated yet, but that's fine since
6961 7058 * the presence of this ZAP entry is backwards compatible.
6962 7059 */
6963 7060 if (zap_contains(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
6964 7061 DMU_POOL_CHECKSUM_SALT) == ENOENT) {
|
↓ open down ↓ |
65 lines elided |
↑ open up ↑ |
6965 7062 VERIFY0(zap_add(spa->spa_meta_objset,
6966 7063 DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CHECKSUM_SALT, 1,
6967 7064 sizeof (spa->spa_cksum_salt.zcs_bytes),
6968 7065 spa->spa_cksum_salt.zcs_bytes, tx));
6969 7066 }
6970 7067
6971 7068 rrw_exit(&dp->dp_config_rwlock, FTAG);
6972 7069 }
6973 7070
6974 7071 static void
6975 -vdev_indirect_state_sync_verify(vdev_t *vd)
7072 +spa_initialize_alloc_trees(spa_t *spa, uint32_t max_queue_depth,
7073 + uint64_t queue_depth_total)
6976 7074 {
6977 - vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
6978 - vdev_indirect_births_t *vib = vd->vdev_indirect_births;
7075 + vdev_t *rvd = spa->spa_root_vdev;
7076 + boolean_t dva_throttle_enabled = zio_dva_throttle_enabled;
7077 + metaslab_class_t *mcs[2] = {
7078 + spa_normal_class(spa),
7079 + spa_special_class(spa)
7080 + };
7081 + size_t mcs_len = sizeof (mcs) / sizeof (metaslab_class_t *);
6979 7082
6980 - if (vd->vdev_ops == &vdev_indirect_ops) {
6981 - ASSERT(vim != NULL);
6982 - ASSERT(vib != NULL);
6983 - }
7083 + for (size_t i = 0; i < mcs_len; i++) {
7084 + metaslab_class_t *mc = mcs[i];
6984 7085
6985 - if (vdev_obsolete_sm_object(vd) != 0) {
6986 - ASSERT(vd->vdev_obsolete_sm != NULL);
6987 - ASSERT(vd->vdev_removing ||
6988 - vd->vdev_ops == &vdev_indirect_ops);
6989 - ASSERT(vdev_indirect_mapping_num_entries(vim) > 0);
6990 - ASSERT(vdev_indirect_mapping_bytes_mapped(vim) > 0);
7086 + ASSERT0(refcount_count(&mc->mc_alloc_slots));
7087 + mc->mc_alloc_max_slots = queue_depth_total;
7088 + mc->mc_alloc_throttle_enabled = dva_throttle_enabled;
6991 7089
6992 - ASSERT3U(vdev_obsolete_sm_object(vd), ==,
6993 - space_map_object(vd->vdev_obsolete_sm));
6994 - ASSERT3U(vdev_indirect_mapping_bytes_mapped(vim), >=,
6995 - space_map_allocated(vd->vdev_obsolete_sm));
7090 + ASSERT3U(mc->mc_alloc_max_slots, <=,
7091 + max_queue_depth * rvd->vdev_children);
6996 7092 }
6997 - ASSERT(vd->vdev_obsolete_segments != NULL);
7093 +}
6998 7094
6999 - /*
7000 - * Since frees / remaps to an indirect vdev can only
7001 - * happen in syncing context, the obsolete segments
7002 - * tree must be empty when we start syncing.
7003 - */
7004 - ASSERT0(range_tree_space(vd->vdev_obsolete_segments));
7095 +static void
7096 +spa_check_alloc_trees(spa_t *spa)
7097 +{
7098 + metaslab_class_t *mcs[2] = {
7099 + spa_normal_class(spa),
7100 + spa_special_class(spa)
7101 + };
7102 + size_t mcs_len = sizeof (mcs) / sizeof (metaslab_class_t *);
7103 +
7104 + for (size_t i = 0; i < mcs_len; i++) {
7105 + metaslab_class_t *mc = mcs[i];
7106 +
7107 + mutex_enter(&mc->mc_alloc_lock);
7108 + VERIFY0(avl_numnodes(&mc->mc_alloc_tree));
7109 + mutex_exit(&mc->mc_alloc_lock);
7110 + }
7005 7111 }
7006 7112
7007 7113 /*
7008 7114 * Sync the specified transaction group. New blocks may be dirtied as
7009 7115 * part of the process, so we iterate until it converges.
7010 7116 */
7011 7117 void
7012 7118 spa_sync(spa_t *spa, uint64_t txg)
7013 7119 {
7014 7120 dsl_pool_t *dp = spa->spa_dsl_pool;
7015 7121 objset_t *mos = spa->spa_meta_objset;
7016 7122 bplist_t *free_bpl = &spa->spa_free_bplist[txg & TXG_MASK];
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
7017 7123 vdev_t *rvd = spa->spa_root_vdev;
7018 7124 vdev_t *vd;
7019 7125 dmu_tx_t *tx;
7020 7126 int error;
7021 7127 uint32_t max_queue_depth = zfs_vdev_async_write_max_active *
7022 7128 zfs_vdev_queue_depth_pct / 100;
7023 7129
7024 7130 VERIFY(spa_writeable(spa));
7025 7131
7026 7132 /*
7027 - * Wait for i/os issued in open context that need to complete
7028 - * before this txg syncs.
7029 - */
7030 - VERIFY0(zio_wait(spa->spa_txg_zio[txg & TXG_MASK]));
7031 - spa->spa_txg_zio[txg & TXG_MASK] = zio_root(spa, NULL, NULL, 0);
7032 -
7033 - /*
7034 7133 * Lock out configuration changes.
7035 7134 */
7036 7135 spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
7037 7136
7038 7137 spa->spa_syncing_txg = txg;
7039 7138 spa->spa_sync_pass = 0;
7040 7139
7041 - mutex_enter(&spa->spa_alloc_lock);
7042 - VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
7043 - mutex_exit(&spa->spa_alloc_lock);
7140 + spa_check_alloc_trees(spa);
7044 7141
7045 7142 /*
7143 + * Another pool management task might be currently preventing
7144 + * from starting and the current txg sync was invoked on its behalf,
7145 + * so be prepared to postpone autotrim processing.
7146 + */
7147 + if (mutex_tryenter(&spa->spa_auto_trim_lock)) {
7148 + if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON)
7149 + spa_auto_trim(spa, txg);
7150 + mutex_exit(&spa->spa_auto_trim_lock);
7151 + }
7152 +
7153 + /*
7046 7154 * If there are any pending vdev state changes, convert them
7047 7155 * into config changes that go out with this transaction group.
7048 7156 */
7049 7157 spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
7050 7158 while (list_head(&spa->spa_state_dirty_list) != NULL) {
7051 7159 /*
7052 7160 * We need the write lock here because, for aux vdevs,
7053 7161 * calling vdev_config_dirty() modifies sav_config.
7054 7162 * This is ugly and will become unnecessary when we
7055 7163 * eliminate the aux vdev wart by integrating all vdevs
7056 7164 * into the root vdev tree.
7057 7165 */
7058 7166 spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
7059 7167 spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_WRITER);
7060 7168 while ((vd = list_head(&spa->spa_state_dirty_list)) != NULL) {
7061 7169 vdev_state_clean(vd);
7062 7170 vdev_config_dirty(vd);
7063 7171 }
7064 7172 spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
7065 7173 spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);
7066 7174 }
7067 7175 spa_config_exit(spa, SCL_STATE, FTAG);
7068 7176
7069 7177 tx = dmu_tx_create_assigned(dp, txg);
7070 7178
7071 7179 spa->spa_sync_starttime = gethrtime();
7072 7180 VERIFY(cyclic_reprogram(spa->spa_deadman_cycid,
7073 7181 spa->spa_sync_starttime + spa->spa_deadman_synctime));
7074 7182
7075 7183 /*
7076 7184 * If we are upgrading to SPA_VERSION_RAIDZ_DEFLATE this txg,
7077 7185 * set spa_deflate if we have no raid-z vdevs.
7078 7186 */
7079 7187 if (spa->spa_ubsync.ub_version < SPA_VERSION_RAIDZ_DEFLATE &&
7080 7188 spa->spa_uberblock.ub_version >= SPA_VERSION_RAIDZ_DEFLATE) {
7081 7189 int i;
7082 7190
7083 7191 for (i = 0; i < rvd->vdev_children; i++) {
7084 7192 vd = rvd->vdev_child[i];
7085 7193 if (vd->vdev_deflate_ratio != SPA_MINBLOCKSIZE)
7086 7194 break;
7087 7195 }
7088 7196 if (i == rvd->vdev_children) {
7089 7197 spa->spa_deflate = TRUE;
7090 7198 VERIFY(0 == zap_add(spa->spa_meta_objset,
7091 7199 DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_DEFLATE,
7092 7200 sizeof (uint64_t), 1, &spa->spa_deflate, tx));
7093 7201 }
7094 7202 }
7095 7203
7096 7204 /*
7097 7205 * Set the top-level vdev's max queue depth. Evaluate each
7098 7206 * top-level's async write queue depth in case it changed.
7099 7207 * The max queue depth will not change in the middle of syncing
7100 7208 * out this txg.
7101 7209 */
7102 7210 uint64_t queue_depth_total = 0;
7103 7211 for (int c = 0; c < rvd->vdev_children; c++) {
7104 7212 vdev_t *tvd = rvd->vdev_child[c];
7105 7213 metaslab_group_t *mg = tvd->vdev_mg;
7106 7214
7107 7215 if (mg == NULL || mg->mg_class != spa_normal_class(spa) ||
7108 7216 !metaslab_group_initialized(mg))
7109 7217 continue;
|
↓ open down ↓ |
54 lines elided |
↑ open up ↑ |
7110 7218
7111 7219 /*
7112 7220 * It is safe to do a lock-free check here because only async
7113 7221 * allocations look at mg_max_alloc_queue_depth, and async
7114 7222 * allocations all happen from spa_sync().
7115 7223 */
7116 7224 ASSERT0(refcount_count(&mg->mg_alloc_queue_depth));
7117 7225 mg->mg_max_alloc_queue_depth = max_queue_depth;
7118 7226 queue_depth_total += mg->mg_max_alloc_queue_depth;
7119 7227 }
7120 - metaslab_class_t *mc = spa_normal_class(spa);
7121 - ASSERT0(refcount_count(&mc->mc_alloc_slots));
7122 - mc->mc_alloc_max_slots = queue_depth_total;
7123 - mc->mc_alloc_throttle_enabled = zio_dva_throttle_enabled;
7124 7228
7125 - ASSERT3U(mc->mc_alloc_max_slots, <=,
7126 - max_queue_depth * rvd->vdev_children);
7229 + spa_initialize_alloc_trees(spa, max_queue_depth,
7230 + queue_depth_total);
7127 7231
7128 - for (int c = 0; c < rvd->vdev_children; c++) {
7129 - vdev_t *vd = rvd->vdev_child[c];
7130 - vdev_indirect_state_sync_verify(vd);
7131 -
7132 - if (vdev_indirect_should_condense(vd)) {
7133 - spa_condense_indirect_start_sync(vd, tx);
7134 - break;
7135 - }
7136 - }
7137 -
7138 7232 /*
7139 7233 * Iterate to convergence.
7140 7234 */
7235 +
7236 + zfs_autosnap_t *autosnap = spa_get_autosnap(dp->dp_spa);
7237 + mutex_enter(&autosnap->autosnap_lock);
7238 +
7239 + autosnap_zone_t *zone = list_head(&autosnap->autosnap_zones);
7240 + while (zone != NULL) {
7241 + zone->created = B_FALSE;
7242 + zone->dirty = B_FALSE;
7243 + zone = list_next(&autosnap->autosnap_zones, zone);
7244 + }
7245 +
7246 + mutex_exit(&autosnap->autosnap_lock);
7247 +
7141 7248 do {
7142 7249 int pass = ++spa->spa_sync_pass;
7143 7250
7144 7251 spa_sync_config_object(spa, tx);
7145 7252 spa_sync_aux_dev(spa, &spa->spa_spares, tx,
7146 7253 ZPOOL_CONFIG_SPARES, DMU_POOL_SPARES);
7147 7254 spa_sync_aux_dev(spa, &spa->spa_l2cache, tx,
7148 7255 ZPOOL_CONFIG_L2CACHE, DMU_POOL_L2CACHE);
7149 7256 spa_errlog_sync(spa, txg);
7150 7257 dsl_pool_sync(dp, txg);
7151 7258
7152 7259 if (pass < zfs_sync_pass_deferred_free) {
7153 7260 spa_sync_frees(spa, free_bpl, tx);
7154 7261 } else {
7155 7262 /*
7156 7263 * We can not defer frees in pass 1, because
|
↓ open down ↓ |
6 lines elided |
↑ open up ↑ |
7157 7264 * we sync the deferred frees later in pass 1.
7158 7265 */
7159 7266 ASSERT3U(pass, >, 1);
7160 7267 bplist_iterate(free_bpl, bpobj_enqueue_cb,
7161 7268 &spa->spa_deferred_bpobj, tx);
7162 7269 }
7163 7270
7164 7271 ddt_sync(spa, txg);
7165 7272 dsl_scan_sync(dp, tx);
7166 7273
7167 - if (spa->spa_vdev_removal != NULL)
7168 - svr_sync(spa, tx);
7169 -
7170 - while ((vd = txg_list_remove(&spa->spa_vdev_txg_list, txg))
7171 - != NULL)
7274 + while (vd = txg_list_remove(&spa->spa_vdev_txg_list, txg))
7172 7275 vdev_sync(vd, txg);
7173 7276
7174 7277 if (pass == 1) {
7175 7278 spa_sync_upgrades(spa, tx);
7176 7279 ASSERT3U(txg, >=,
7177 7280 spa->spa_uberblock.ub_rootbp.blk_birth);
7178 7281 /*
7179 7282 * Note: We need to check if the MOS is dirty
7180 7283 * because we could have marked the MOS dirty
7181 7284 * without updating the uberblock (e.g. if we
7182 7285 * have sync tasks but no dirty user data). We
7183 7286 * need to check the uberblock's rootbp because
7184 7287 * it is updated if we have synced out dirty
7185 7288 * data (though in this case the MOS will most
7186 7289 * likely also be dirty due to second order
7187 7290 * effects, we don't want to rely on that here).
7188 7291 */
7189 7292 if (spa->spa_uberblock.ub_rootbp.blk_birth < txg &&
7190 7293 !dmu_objset_is_dirty(mos, txg)) {
7191 7294 /*
7192 7295 * Nothing changed on the first pass,
7193 7296 * therefore this TXG is a no-op. Avoid
7194 7297 * syncing deferred frees, so that we
7195 7298 * can keep this TXG as a no-op.
7196 7299 */
7197 7300 ASSERT(txg_list_empty(&dp->dp_dirty_datasets,
7198 7301 txg));
7199 7302 ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
7200 7303 ASSERT(txg_list_empty(&dp->dp_sync_tasks, txg));
7201 7304 break;
7202 7305 }
7203 7306 spa_sync_deferred_frees(spa, tx);
7204 7307 }
7205 7308
7206 7309 } while (dmu_objset_is_dirty(mos, txg));
7207 7310
7208 7311 if (!list_is_empty(&spa->spa_config_dirty_list)) {
7209 7312 /*
7210 7313 * Make sure that the number of ZAPs for all the vdevs matches
7211 7314 * the number of ZAPs in the per-vdev ZAP list. This only gets
7212 7315 * called if the config is dirty; otherwise there may be
|
↓ open down ↓ |
31 lines elided |
↑ open up ↑ |
7213 7316 * outstanding AVZ operations that weren't completed in
7214 7317 * spa_sync_config_object.
7215 7318 */
7216 7319 uint64_t all_vdev_zap_entry_count;
7217 7320 ASSERT0(zap_count(spa->spa_meta_objset,
7218 7321 spa->spa_all_vdev_zaps, &all_vdev_zap_entry_count));
7219 7322 ASSERT3U(vdev_count_verify_zaps(spa->spa_root_vdev), ==,
7220 7323 all_vdev_zap_entry_count);
7221 7324 }
7222 7325
7223 - if (spa->spa_vdev_removal != NULL) {
7224 - ASSERT0(spa->spa_vdev_removal->svr_bytes_done[txg & TXG_MASK]);
7225 - }
7226 -
7227 7326 /*
7228 7327 * Rewrite the vdev configuration (which includes the uberblock)
7229 7328 * to commit the transaction group.
7230 7329 *
7231 7330 * If there are no dirty vdevs, we sync the uberblock to a few
7232 7331 * random top-level vdevs that are known to be visible in the
7233 7332 * config cache (see spa_vdev_add() for a complete description).
7234 7333 * If there *are* dirty vdevs, sync the uberblock to all vdevs.
7235 7334 */
7236 7335 for (;;) {
7237 7336 /*
7238 7337 * We hold SCL_STATE to prevent vdev open/close/etc.
7239 7338 * while we're attempting to write the vdev labels.
7240 7339 */
7241 7340 spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
7242 7341
7243 7342 if (list_is_empty(&spa->spa_config_dirty_list)) {
7244 - vdev_t *svd[SPA_SYNC_MIN_VDEVS];
7343 + vdev_t *svd[SPA_DVAS_PER_BP];
7245 7344 int svdcount = 0;
7246 7345 int children = rvd->vdev_children;
7247 7346 int c0 = spa_get_random(children);
7248 7347
7249 7348 for (int c = 0; c < children; c++) {
7250 7349 vd = rvd->vdev_child[(c0 + c) % children];
7251 - if (vd->vdev_ms_array == 0 || vd->vdev_islog ||
7252 - !vdev_is_concrete(vd))
7350 + if (vd->vdev_ms_array == 0 || vd->vdev_islog)
7253 7351 continue;
7254 7352 svd[svdcount++] = vd;
7255 - if (svdcount == SPA_SYNC_MIN_VDEVS)
7353 + if (svdcount == SPA_DVAS_PER_BP)
7256 7354 break;
7257 7355 }
7258 7356 error = vdev_config_sync(svd, svdcount, txg);
7259 7357 } else {
7260 7358 error = vdev_config_sync(rvd->vdev_child,
7261 7359 rvd->vdev_children, txg);
7262 7360 }
7263 7361
7264 7362 if (error == 0)
7265 7363 spa->spa_last_synced_guid = rvd->vdev_guid;
7266 7364
7267 7365 spa_config_exit(spa, SCL_STATE, FTAG);
7268 7366
7269 7367 if (error == 0)
7270 7368 break;
7271 7369 zio_suspend(spa, NULL);
7272 7370 zio_resume_wait(spa);
7273 7371 }
7274 7372 dmu_tx_commit(tx);
7275 7373
7276 7374 VERIFY(cyclic_reprogram(spa->spa_deadman_cycid, CY_INFINITY));
7277 7375
7278 7376 /*
7279 7377 * Clear the dirty config list.
7280 7378 */
7281 7379 while ((vd = list_head(&spa->spa_config_dirty_list)) != NULL)
7282 7380 vdev_config_clean(vd);
7283 7381
7284 7382 /*
7285 7383 * Now that the new config has synced transactionally,
|
↓ open down ↓ |
20 lines elided |
↑ open up ↑ |
7286 7384 * let it become visible to the config cache.
7287 7385 */
7288 7386 if (spa->spa_config_syncing != NULL) {
7289 7387 spa_config_set(spa, spa->spa_config_syncing);
7290 7388 spa->spa_config_txg = txg;
7291 7389 spa->spa_config_syncing = NULL;
7292 7390 }
7293 7391
7294 7392 dsl_pool_sync_done(dp, txg);
7295 7393
7296 - mutex_enter(&spa->spa_alloc_lock);
7297 - VERIFY0(avl_numnodes(&spa->spa_alloc_tree));
7298 - mutex_exit(&spa->spa_alloc_lock);
7394 + spa_check_alloc_trees(spa);
7299 7395
7300 7396 /*
7301 7397 * Update usable space statistics.
7302 7398 */
7303 7399 while (vd = txg_list_remove(&spa->spa_vdev_txg_list, TXG_CLEAN(txg)))
7304 7400 vdev_sync_done(vd, txg);
7305 7401
7306 7402 spa_update_dspace(spa);
7307 -
7403 + spa_update_latency(spa);
7308 7404 /*
7309 7405 * It had better be the case that we didn't dirty anything
7310 7406 * since vdev_config_sync().
7311 7407 */
7312 7408 ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg));
7313 7409 ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg));
7314 7410 ASSERT(txg_list_empty(&spa->spa_vdev_txg_list, txg));
7315 7411
7316 7412 spa->spa_sync_pass = 0;
7317 7413
7414 + spa_check_special(spa);
7415 +
7318 7416 /*
7319 7417 * Update the last synced uberblock here. We want to do this at
7320 7418 * the end of spa_sync() so that consumers of spa_last_synced_txg()
7321 7419 * will be guaranteed that all the processing associated with
7322 7420 * that txg has been completed.
7323 7421 */
7324 7422 spa->spa_ubsync = spa->spa_uberblock;
7325 7423 spa_config_exit(spa, SCL_CONFIG, FTAG);
7326 7424
7327 7425 spa_handle_ignored_writes(spa);
7328 7426
7329 7427 /*
7330 7428 * If any async tasks have been requested, kick them off.
7331 7429 */
7332 7430 spa_async_dispatch(spa);
7333 7431 }
7334 7432
7335 7433 /*
7336 7434 * Sync all pools. We don't want to hold the namespace lock across these
7337 7435 * operations, so we take a reference on the spa_t and drop the lock during the
7338 7436 * sync.
7339 7437 */
7340 7438 void
7341 7439 spa_sync_allpools(void)
7342 7440 {
7343 7441 spa_t *spa = NULL;
7344 7442 mutex_enter(&spa_namespace_lock);
7345 7443 while ((spa = spa_next(spa)) != NULL) {
7346 7444 if (spa_state(spa) != POOL_STATE_ACTIVE ||
7347 7445 !spa_writeable(spa) || spa_suspended(spa))
7348 7446 continue;
7349 7447 spa_open_ref(spa, FTAG);
7350 7448 mutex_exit(&spa_namespace_lock);
7351 7449 txg_wait_synced(spa_get_dsl(spa), 0);
7352 7450 mutex_enter(&spa_namespace_lock);
7353 7451 spa_close(spa, FTAG);
7354 7452 }
7355 7453 mutex_exit(&spa_namespace_lock);
7356 7454 }
7357 7455
7358 7456 /*
7359 7457 * ==========================================================================
7360 7458 * Miscellaneous routines
7361 7459 * ==========================================================================
7362 7460 */
7363 7461
7364 7462 /*
7365 7463 * Remove all pools in the system.
7366 7464 */
7367 7465 void
7368 7466 spa_evict_all(void)
7369 7467 {
7370 7468 spa_t *spa;
7371 7469
7372 7470 /*
7373 7471 * Remove all cached state. All pools should be closed now,
7374 7472 * so every spa in the AVL tree should be unreferenced.
7375 7473 */
7376 7474 mutex_enter(&spa_namespace_lock);
7377 7475 while ((spa = spa_next(NULL)) != NULL) {
7378 7476 /*
7379 7477 * Stop async tasks. The async thread may need to detach
|
↓ open down ↓ |
52 lines elided |
↑ open up ↑ |
7380 7478 * a device that's been replaced, which requires grabbing
7381 7479 * spa_namespace_lock, so we must drop it here.
7382 7480 */
7383 7481 spa_open_ref(spa, FTAG);
7384 7482 mutex_exit(&spa_namespace_lock);
7385 7483 spa_async_suspend(spa);
7386 7484 mutex_enter(&spa_namespace_lock);
7387 7485 spa_close(spa, FTAG);
7388 7486
7389 7487 if (spa->spa_state != POOL_STATE_UNINITIALIZED) {
7488 + wbc_deactivate(spa);
7489 +
7390 7490 spa_unload(spa);
7391 7491 spa_deactivate(spa);
7392 7492 }
7493 +
7393 7494 spa_remove(spa);
7394 7495 }
7395 7496 mutex_exit(&spa_namespace_lock);
7396 7497 }
7397 7498
7398 7499 vdev_t *
7399 7500 spa_lookup_by_guid(spa_t *spa, uint64_t guid, boolean_t aux)
7400 7501 {
7401 7502 vdev_t *vd;
7402 7503 int i;
7403 7504
7404 7505 if ((vd = vdev_lookup_by_guid(spa->spa_root_vdev, guid)) != NULL)
7405 7506 return (vd);
7406 7507
7407 7508 if (aux) {
7408 7509 for (i = 0; i < spa->spa_l2cache.sav_count; i++) {
7409 7510 vd = spa->spa_l2cache.sav_vdevs[i];
7410 7511 if (vd->vdev_guid == guid)
7411 7512 return (vd);
7412 7513 }
7413 7514
7414 7515 for (i = 0; i < spa->spa_spares.sav_count; i++) {
7415 7516 vd = spa->spa_spares.sav_vdevs[i];
7416 7517 if (vd->vdev_guid == guid)
7417 7518 return (vd);
7418 7519 }
7419 7520 }
7420 7521
7421 7522 return (NULL);
7422 7523 }
7423 7524
7424 7525 void
7425 7526 spa_upgrade(spa_t *spa, uint64_t version)
7426 7527 {
7427 7528 ASSERT(spa_writeable(spa));
7428 7529
7429 7530 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
7430 7531
7431 7532 /*
7432 7533 * This should only be called for a non-faulted pool, and since a
7433 7534 * future version would result in an unopenable pool, this shouldn't be
7434 7535 * possible.
7435 7536 */
7436 7537 ASSERT(SPA_VERSION_IS_SUPPORTED(spa->spa_uberblock.ub_version));
7437 7538 ASSERT3U(version, >=, spa->spa_uberblock.ub_version);
7438 7539
7439 7540 spa->spa_uberblock.ub_version = version;
7440 7541 vdev_config_dirty(spa->spa_root_vdev);
7441 7542
7442 7543 spa_config_exit(spa, SCL_ALL, FTAG);
7443 7544
7444 7545 txg_wait_synced(spa_get_dsl(spa), 0);
7445 7546 }
7446 7547
7447 7548 boolean_t
7448 7549 spa_has_spare(spa_t *spa, uint64_t guid)
7449 7550 {
7450 7551 int i;
7451 7552 uint64_t spareguid;
7452 7553 spa_aux_vdev_t *sav = &spa->spa_spares;
7453 7554
7454 7555 for (i = 0; i < sav->sav_count; i++)
7455 7556 if (sav->sav_vdevs[i]->vdev_guid == guid)
7456 7557 return (B_TRUE);
7457 7558
7458 7559 for (i = 0; i < sav->sav_npending; i++) {
7459 7560 if (nvlist_lookup_uint64(sav->sav_pending[i], ZPOOL_CONFIG_GUID,
7460 7561 &spareguid) == 0 && spareguid == guid)
7461 7562 return (B_TRUE);
7462 7563 }
7463 7564
7464 7565 return (B_FALSE);
7465 7566 }
7466 7567
7467 7568 /*
7468 7569 * Check if a pool has an active shared spare device.
7469 7570 * Note: reference count of an active spare is 2, as a spare and as a replace
7470 7571 */
7471 7572 static boolean_t
7472 7573 spa_has_active_shared_spare(spa_t *spa)
7473 7574 {
7474 7575 int i, refcnt;
7475 7576 uint64_t pool;
7476 7577 spa_aux_vdev_t *sav = &spa->spa_spares;
7477 7578
|
↓ open down ↓ |
75 lines elided |
↑ open up ↑ |
7478 7579 for (i = 0; i < sav->sav_count; i++) {
7479 7580 if (spa_spare_exists(sav->sav_vdevs[i]->vdev_guid, &pool,
7480 7581 &refcnt) && pool != 0ULL && pool == spa_guid(spa) &&
7481 7582 refcnt > 2)
7482 7583 return (B_TRUE);
7483 7584 }
7484 7585
7485 7586 return (B_FALSE);
7486 7587 }
7487 7588
7488 -sysevent_t *
7589 +/*
7590 + * Post a sysevent corresponding to the given event. The 'name' must be one of
7591 + * the event definitions in sys/sysevent/eventdefs.h. The payload will be
7592 + * filled in from the spa and (optionally) the vdev. This doesn't do anything
7593 + * in the userland libzpool, as we don't want consumers to misinterpret ztest
7594 + * or zdb as real changes.
7595 + */
7596 +static sysevent_t *
7489 7597 spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7490 7598 {
7491 7599 sysevent_t *ev = NULL;
7492 7600 #ifdef _KERNEL
7493 7601 sysevent_attr_list_t *attr = NULL;
7494 7602 sysevent_value_t value;
7495 7603
7496 7604 ev = sysevent_alloc(EC_ZFS, (char *)name, SUNW_KERN_PUB "zfs",
7497 7605 SE_SLEEP);
7498 7606 ASSERT(ev != NULL);
7499 7607
|
↓ open down ↓ |
1 lines elided |
↑ open up ↑ |
7500 7608 value.value_type = SE_DATA_TYPE_STRING;
7501 7609 value.value.sv_string = spa_name(spa);
7502 7610 if (sysevent_add_attr(&attr, ZFS_EV_POOL_NAME, &value, SE_SLEEP) != 0)
7503 7611 goto done;
7504 7612
7505 7613 value.value_type = SE_DATA_TYPE_UINT64;
7506 7614 value.value.sv_uint64 = spa_guid(spa);
7507 7615 if (sysevent_add_attr(&attr, ZFS_EV_POOL_GUID, &value, SE_SLEEP) != 0)
7508 7616 goto done;
7509 7617
7510 - if (vd) {
7618 + if (vd != NULL) {
7511 7619 value.value_type = SE_DATA_TYPE_UINT64;
7512 7620 value.value.sv_uint64 = vd->vdev_guid;
7513 7621 if (sysevent_add_attr(&attr, ZFS_EV_VDEV_GUID, &value,
7514 7622 SE_SLEEP) != 0)
7515 7623 goto done;
7516 7624
7517 7625 if (vd->vdev_path) {
7518 7626 value.value_type = SE_DATA_TYPE_STRING;
7519 7627 value.value.sv_string = vd->vdev_path;
7520 7628 if (sysevent_add_attr(&attr, ZFS_EV_VDEV_PATH,
7521 7629 &value, SE_SLEEP) != 0)
7522 7630 goto done;
7523 7631 }
7524 7632 }
7525 7633
7526 7634 if (hist_nvl != NULL) {
7527 7635 fnvlist_merge((nvlist_t *)attr, hist_nvl);
7528 7636 }
7529 7637
7530 7638 if (sysevent_attach_attributes(ev, attr) != 0)
7531 7639 goto done;
|
↓ open down ↓ |
11 lines elided |
↑ open up ↑ |
7532 7640 attr = NULL;
7533 7641
7534 7642 done:
7535 7643 if (attr)
7536 7644 sysevent_free_attr(attr);
7537 7645
7538 7646 #endif
7539 7647 return (ev);
7540 7648 }
7541 7649
7542 -void
7543 -spa_event_post(sysevent_t *ev)
7650 +static void
7651 +spa_event_post(void *arg)
7544 7652 {
7545 7653 #ifdef _KERNEL
7654 + sysevent_t *ev = (sysevent_t *)arg;
7655 +
7546 7656 sysevent_id_t eid;
7547 7657
7548 7658 (void) log_sysevent(ev, SE_SLEEP, &eid);
7549 7659 sysevent_free(ev);
7550 7660 #endif
7551 7661 }
7552 7662
7663 +/*
7664 + * Dispatch event notifications to the taskq such that the corresponding
7665 + * sysevents are queued with no spa locks held
7666 + */
7667 +taskq_t *spa_sysevent_taskq;
7668 +
7669 +static void
7670 +spa_event_notify_impl(sysevent_t *ev)
7671 +{
7672 + if (taskq_dispatch(spa_sysevent_taskq, spa_event_post,
7673 + ev, TQ_NOSLEEP) == NULL) {
7674 + /*
7675 + * These are management sysevents; as much as it is
7676 + * unpleasant to drop these due to syseventd not being able
7677 + * to keep up, perhaps due to resource shortages, we are not
7678 + * going to sleep here and risk locking up the pool sync
7679 + * process; notify admin of problems
7680 + */
7681 + cmn_err(CE_NOTE, "Could not dispatch sysevent nofitication "
7682 + "for %s, please check state of syseventd\n",
7683 + sysevent_get_subclass_name(ev));
7684 +
7685 + sysevent_free(ev);
7686 +
7687 + return;
7688 + }
7689 +}
7690 +
7553 7691 void
7554 -spa_event_discard(sysevent_t *ev)
7692 +spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7555 7693 {
7556 -#ifdef _KERNEL
7557 - sysevent_free(ev);
7558 -#endif
7694 + spa_event_notify_impl(spa_event_create(spa, vd, hist_nvl, name));
7559 7695 }
7560 7696
7561 7697 /*
7562 - * Post a sysevent corresponding to the given event. The 'name' must be one of
7563 - * the event definitions in sys/sysevent/eventdefs.h. The payload will be
7564 - * filled in from the spa and (optionally) the vdev and history nvl. This
7565 - * doesn't do anything in the userland libzpool, as we don't want consumers to
7566 - * misinterpret ztest or zdb as real changes.
7698 + * Dispatches all auto-trim processing to all top-level vdevs. This is
7699 + * called from spa_sync once every txg.
7567 7700 */
7701 +static void
7702 +spa_auto_trim(spa_t *spa, uint64_t txg)
7703 +{
7704 + ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER) == SCL_CONFIG);
7705 + ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
7706 + ASSERT(spa->spa_auto_trim_taskq != NULL);
7707 +
7708 + for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
7709 + vdev_trim_info_t *vti = kmem_zalloc(sizeof (*vti), KM_SLEEP);
7710 + vti->vti_vdev = spa->spa_root_vdev->vdev_child[i];
7711 + vti->vti_txg = txg;
7712 + vti->vti_done_cb = (void (*)(void *))spa_vdev_auto_trim_done;
7713 + vti->vti_done_arg = spa;
7714 + (void) taskq_dispatch(spa->spa_auto_trim_taskq,
7715 + (void (*)(void *))vdev_auto_trim, vti, TQ_SLEEP);
7716 + spa->spa_num_auto_trimming++;
7717 + }
7718 +}
7719 +
7720 +/*
7721 + * Performs the sync update of the MOS pool directory's trim start/stop values.
7722 + */
7723 +static void
7724 +spa_trim_update_time_sync(void *arg, dmu_tx_t *tx)
7725 +{
7726 + spa_t *spa = arg;
7727 + VERIFY0(zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
7728 + DMU_POOL_TRIM_START_TIME, sizeof (uint64_t), 1,
7729 + &spa->spa_man_trim_start_time, tx));
7730 + VERIFY0(zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
7731 + DMU_POOL_TRIM_STOP_TIME, sizeof (uint64_t), 1,
7732 + &spa->spa_man_trim_stop_time, tx));
7733 +}
7734 +
7735 +/*
7736 + * Updates the in-core and on-disk manual TRIM operation start/stop time.
7737 + * Passing UINT64_MAX for either start_time or stop_time means that no
7738 + * update to that value should be recorded.
7739 + */
7740 +static dmu_tx_t *
7741 +spa_trim_update_time(spa_t *spa, uint64_t start_time, uint64_t stop_time)
7742 +{
7743 + int err;
7744 + dmu_tx_t *tx;
7745 +
7746 + ASSERT(MUTEX_HELD(&spa->spa_man_trim_lock));
7747 + if (start_time != UINT64_MAX)
7748 + spa->spa_man_trim_start_time = start_time;
7749 + if (stop_time != UINT64_MAX)
7750 + spa->spa_man_trim_stop_time = stop_time;
7751 + tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
7752 + err = dmu_tx_assign(tx, TXG_WAIT);
7753 + if (err) {
7754 + dmu_tx_abort(tx);
7755 + return (NULL);
7756 + }
7757 + dsl_sync_task_nowait(spa_get_dsl(spa), spa_trim_update_time_sync,
7758 + spa, 1, ZFS_SPACE_CHECK_RESERVED, tx);
7759 +
7760 + return (tx);
7761 +}
7762 +
7763 +/*
7764 + * Initiates an manual TRIM of the whole pool. This kicks off individual
7765 + * TRIM tasks for each top-level vdev, which then pass over all of the free
7766 + * space in all of the vdev's metaslabs and issues TRIM commands for that
7767 + * space to the underlying vdevs.
7768 + */
7769 +extern void
7770 +spa_man_trim(spa_t *spa, uint64_t rate)
7771 +{
7772 + dmu_tx_t *time_update_tx;
7773 +
7774 + mutex_enter(&spa->spa_man_trim_lock);
7775 +
7776 + if (rate != 0)
7777 + spa->spa_man_trim_rate = MAX(rate, spa_min_trim_rate(spa));
7778 + else
7779 + spa->spa_man_trim_rate = 0;
7780 +
7781 + if (spa->spa_num_man_trimming) {
7782 + /*
7783 + * TRIM is already ongoing. Wake up all sleeping vdev trim
7784 + * threads because the trim rate might have changed above.
7785 + */
7786 + cv_broadcast(&spa->spa_man_trim_update_cv);
7787 + mutex_exit(&spa->spa_man_trim_lock);
7788 + return;
7789 + }
7790 + spa_man_trim_taskq_create(spa);
7791 + spa->spa_man_trim_stop = B_FALSE;
7792 +
7793 + spa_event_notify(spa, NULL, NULL, ESC_ZFS_TRIM_START);
7794 + spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
7795 + for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
7796 + vdev_t *vd = spa->spa_root_vdev->vdev_child[i];
7797 + vdev_trim_info_t *vti = kmem_zalloc(sizeof (*vti), KM_SLEEP);
7798 + vti->vti_vdev = vd;
7799 + vti->vti_done_cb = (void (*)(void *))spa_vdev_man_trim_done;
7800 + vti->vti_done_arg = spa;
7801 + spa->spa_num_man_trimming++;
7802 +
7803 + vd->vdev_trim_prog = 0;
7804 + (void) taskq_dispatch(spa->spa_man_trim_taskq,
7805 + (void (*)(void *))vdev_man_trim, vti, TQ_SLEEP);
7806 + }
7807 + spa_config_exit(spa, SCL_CONFIG, FTAG);
7808 + time_update_tx = spa_trim_update_time(spa, gethrestime_sec(), 0);
7809 + mutex_exit(&spa->spa_man_trim_lock);
7810 + /* mustn't hold spa_man_trim_lock to prevent deadlock /w syncing ctx */
7811 + if (time_update_tx != NULL)
7812 + dmu_tx_commit(time_update_tx);
7813 +}
7814 +
7815 +/*
7816 + * Orders a manual TRIM operation to stop and returns immediately.
7817 + */
7818 +extern void
7819 +spa_man_trim_stop(spa_t *spa)
7820 +{
7821 + boolean_t held = MUTEX_HELD(&spa->spa_man_trim_lock);
7822 + if (!held)
7823 + mutex_enter(&spa->spa_man_trim_lock);
7824 + spa->spa_man_trim_stop = B_TRUE;
7825 + cv_broadcast(&spa->spa_man_trim_update_cv);
7826 + if (!held)
7827 + mutex_exit(&spa->spa_man_trim_lock);
7828 +}
7829 +
7830 +/*
7831 + * Orders a manual TRIM operation to stop and waits for both manual and
7832 + * automatic TRIM to complete. By holding both the spa_man_trim_lock and
7833 + * the spa_auto_trim_lock, the caller can guarantee that after this
7834 + * function returns, no new TRIM operations can be initiated in parallel.
7835 + */
7568 7836 void
7569 -spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name)
7837 +spa_trim_stop_wait(spa_t *spa)
7570 7838 {
7571 - spa_event_post(spa_event_create(spa, vd, hist_nvl, name));
7839 + ASSERT(MUTEX_HELD(&spa->spa_man_trim_lock));
7840 + ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
7841 + spa->spa_man_trim_stop = B_TRUE;
7842 + cv_broadcast(&spa->spa_man_trim_update_cv);
7843 + while (spa->spa_num_man_trimming > 0)
7844 + cv_wait(&spa->spa_man_trim_done_cv, &spa->spa_man_trim_lock);
7845 + while (spa->spa_num_auto_trimming > 0)
7846 + cv_wait(&spa->spa_auto_trim_done_cv, &spa->spa_auto_trim_lock);
7847 +}
7848 +
7849 +/*
7850 + * Returns manual TRIM progress. Progress is indicated by four return values:
7851 + * 1) prog: the number of bytes of space on the pool in total that manual
7852 + * TRIM has already passed (regardless if the space is allocated or not).
7853 + * Completion of the operation is indicated when either the returned value
7854 + * is zero, or when the returned value is equal to the sum of the sizes of
7855 + * all top-level vdevs.
7856 + * 2) rate: the trim rate in bytes per second. A value of zero indicates that
7857 + * trim progresses as fast as possible.
7858 + * 3) start_time: the UNIXTIME of when the last manual TRIM operation was
7859 + * started. If no manual trim was ever initiated on the pool, this is
7860 + * zero.
7861 + * 4) stop_time: the UNIXTIME of when the last manual TRIM operation has
7862 + * stopped on the pool. If a trim was started (start_time != 0), but has
7863 + * not yet completed, stop_time will be zero. If a trim is NOT currently
7864 + * ongoing and start_time is non-zero, this indicates that the previously
7865 + * initiated TRIM operation was interrupted.
7866 + */
7867 +extern void
7868 +spa_get_trim_prog(spa_t *spa, uint64_t *prog, uint64_t *rate,
7869 + uint64_t *start_time, uint64_t *stop_time)
7870 +{
7871 + uint64_t total = 0;
7872 + vdev_t *root_vd = spa->spa_root_vdev;
7873 +
7874 + ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
7875 + mutex_enter(&spa->spa_man_trim_lock);
7876 + if (spa->spa_num_man_trimming > 0) {
7877 + for (uint64_t i = 0; i < root_vd->vdev_children; i++) {
7878 + total += root_vd->vdev_child[i]->vdev_trim_prog;
7879 + }
7880 + }
7881 + *prog = total;
7882 + *rate = spa->spa_man_trim_rate;
7883 + *start_time = spa->spa_man_trim_start_time;
7884 + *stop_time = spa->spa_man_trim_stop_time;
7885 + mutex_exit(&spa->spa_man_trim_lock);
7886 +}
7887 +
7888 +/*
7889 + * Callback when a vdev_man_trim has finished on a single top-level vdev.
7890 + */
7891 +static void
7892 +spa_vdev_man_trim_done(spa_t *spa)
7893 +{
7894 + dmu_tx_t *time_update_tx = NULL;
7895 +
7896 + mutex_enter(&spa->spa_man_trim_lock);
7897 + ASSERT(spa->spa_num_man_trimming > 0);
7898 + spa->spa_num_man_trimming--;
7899 + if (spa->spa_num_man_trimming == 0) {
7900 + /* if we were interrupted, leave stop_time at zero */
7901 + if (!spa->spa_man_trim_stop)
7902 + time_update_tx = spa_trim_update_time(spa, UINT64_MAX,
7903 + gethrestime_sec());
7904 + spa_event_notify(spa, NULL, NULL, ESC_ZFS_TRIM_FINISH);
7905 + spa_async_request(spa, SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY);
7906 + cv_broadcast(&spa->spa_man_trim_done_cv);
7907 + }
7908 + mutex_exit(&spa->spa_man_trim_lock);
7909 +
7910 + if (time_update_tx != NULL)
7911 + dmu_tx_commit(time_update_tx);
7912 +}
7913 +
7914 +/*
7915 + * Called from vdev_auto_trim when a vdev has completed its auto-trim
7916 + * processing.
7917 + */
7918 +static void
7919 +spa_vdev_auto_trim_done(spa_t *spa)
7920 +{
7921 + mutex_enter(&spa->spa_auto_trim_lock);
7922 + ASSERT(spa->spa_num_auto_trimming > 0);
7923 + spa->spa_num_auto_trimming--;
7924 + if (spa->spa_num_auto_trimming == 0)
7925 + cv_broadcast(&spa->spa_auto_trim_done_cv);
7926 + mutex_exit(&spa->spa_auto_trim_lock);
7927 +}
7928 +
7929 +/*
7930 + * Determines the minimum sensible rate at which a manual TRIM can be
7931 + * performed on a given spa and returns it. Since we perform TRIM in
7932 + * metaslab-sized increments, we'll just let the longest step between
7933 + * metaslab TRIMs be 100s (random number, really). Thus, on a typical
7934 + * 200-metaslab vdev, the longest TRIM should take is about 5.5 hours.
7935 + * It *can* take longer if the device is really slow respond to
7936 + * zio_trim() commands or it contains more than 200 metaslabs, or
7937 + * metaslab sizes vary widely between top-level vdevs.
7938 + */
7939 +static uint64_t
7940 +spa_min_trim_rate(spa_t *spa)
7941 +{
7942 + uint64_t smallest_ms_sz = UINT64_MAX;
7943 +
7944 + /* find the smallest metaslab */
7945 + spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
7946 + for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
7947 + smallest_ms_sz = MIN(smallest_ms_sz,
7948 + spa->spa_root_vdev->vdev_child[i]->vdev_ms[0]->ms_size);
7949 + }
7950 + spa_config_exit(spa, SCL_CONFIG, FTAG);
7951 + VERIFY(smallest_ms_sz != 0);
7952 +
7953 + /* minimum TRIM rate is 1/100th of the smallest metaslab size */
7954 + return (smallest_ms_sz / 100);
7572 7955 }
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX