Print this page
9700 ZFS resilvered mirror does not balance reads
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
NEX-17931 Getting panic: vfs_mountroot: cannot mount root after split mirror syspool
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9989 Changing volume names can result in double imports and data corruption
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-6855 System fails to boot up after a large number of datasets created
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-8711 backport illumos 7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-7550 zpool remove mirrored slog or special vdev causes system panic due to a NULL pointer dereference in "zfs" module
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6884 KRRP: replication deadlock due to unavailable resources
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6000 zpool destroy/export with autotrim=on panics due to lock assertion
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5702 Special vdev cannot be removed if it was used as slog
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5637 enablespecial property should be disabled after special vdev removal
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alex Deiter <alex.deiter@nexenta.com>
NEX-5367 special vdev: sync-write options (NEW)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5064 On-demand trim should store operation start and stop time
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5068 In-progress scrub can drastically increase zpool import times
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-5219 WBC: Add capability to delay migration
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5078 Want ability to see progress of freeing data and how much is left to free after large file delete patch
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5019 wrcache activation races vs. 'zpool create -O wrc_mode='
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-4934 Add capability to remove special vdev
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4830 writecache=off leaks data on special vdev (the data will never migrate)
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4876 On-demand TRIM shouldn't use system_taskq and should queue jobs
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4679 Autotrim taskq doesn't get destroyed on pool export
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
NEX-4567 KRRP: L2L replication inside of one pool causes ARC-deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
6529 Properly handle updates of variably-sized SA entries.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Gordon Ross <gwr@nexenta.com>
6527 Possible access beyond end of string in zpool comment
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gwr@nexenta.com>
6414 vdev_config_sync could be simpler
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
6175 sdev can create bogus zvol directories
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6174 /dev/zvol does not show pool directories
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5997 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
6046 SPARC boot should support com.delphix:hole_birth
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6041 SPARC boot should support LZ4
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6044 SPARC zfs reader is using wrong size for objset_phys
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
backout 5997: breaks "zpool add"
5997 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5808 spa_check_logs is not necessary on readonly pools
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
5770 Add load_nvlist() error handling
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4476 WRC: Allow to use write back cache per tree of datasets
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Revert "NEX-4476 WRC: Allow to use write back cache per tree of datasets"
This reverts commit fe97b74444278a6f36fec93179133641296312da.
NEX-4476 WRC: Allow to use write back cache per tree of datasets
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-3502 dedup ceiling should set a pool prop when cap is in effect
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-3965 System may panic on the importing of pool with WRC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-4077 taskq_dispatch in on-demand TRIM can sometimes fail
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Revert "NEX-3965 System may panic on the importing of pool with WRC"
This reverts commit 45bc50222913cddafde94621d28b78d6efaea897.
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3965 System may panic on the importing of pool with WRC
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3817 'zpool add' of special devices causes system panic
 Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
 Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3474 CLONE - Port NEX-2591 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
 NEX-3165 segregate ddt in arc (other lint fix)
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-3165 segregate ddt in arc
NEX-3213 need to load vdev props for all vdev including spares and l2arc vdevs
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-2112 `zdb -e <pool>` assertion failed for thread 0xfffffd7fff172a40
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-1228 Panic importing pool with active unsupported features
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Harold Shaw <harold.shaw@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-140 Duplicate entries in mantools and doctools manifests
NEX-1078 Replaced ASSERT with if-statement
NEX-521 Single threaded rpcbind is not scalable
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Jan Kryl <jan.kryl@nexenta.com>
NEX-1088 partially rolled back 641841bb
to fix regression that caused assert in read-only import.
OS-115 Heap leaks related to OS-114 and SUP-577
SUP-577 deadlock between zpool detach and syseventd
OS-103 handle CoS descriptor persistent references across vdev operations
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Make special vdev subtree topology the same as regular vdev subtree to simplify testcase setup
Fixup merge issues
Fix default properties' values after export/import
zfsxx issue #11: support for spare device groups
Issue #34: Add feature flag for the compount checksum - sha1crc32
           Contributors: Boris Protopopov
Issue #7: add cacheability to the properties
          Contributors: Boris Protopopov
Issue #27: Auto best-effort dedup enable/disable - settable per pool
Issues #7: Reconsile L2ARC and "special" use by datasets
Issue #9: Support for persistent CoS/vdev attributes with feature flags
          Support for feature flags for special tier
          Contributors: Daniil Lunev, Boris Protopopov
Issue #2: optimize DDE lookup in DDT objects
Added option to control number of classes of DDE's in DDT.
New default is one, that is all DDE's are stored together
regardless of refcount.
Issue #3: Add support for parametrized number of copies for DDTs
Issue #25: Add a pool-level property that controls the number of copies of DDTs in the pool.
Fixup merge results
re #13850 Refactor ZFS config discovery IOCs to libzfs_core patterns
re 13748 added zpool export -c option
zpool export -c command exports specified pool while keeping its latest
configuration in the cache file for subsequent zpool import -c.
re #13333 rb4362 - eliminated spa_update_iotime() to fix the stats
re #12684 rb4206 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
re #12643 rb4064 ZFS meta refactoring - vdev utilization tracking, auto-dedup
re #8279 rb3915 need a mechanism to notify NMS about ZFS config changes (fix lint -courtesy of Yuri Pankov)
re #12584 rb4049 zfsxx latest code merge (fix lint - courtesy of Yuri Pankov)
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties

*** 19,37 **** * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. ! * Copyright (c) 2011, 2018 by Delphix. All rights reserved. ! * Copyright (c) 2015, Nexenta Systems, Inc. All rights reserved. * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved. * Copyright 2013 Saso Kiselkov. All rights reserved. * Copyright (c) 2014 Integros [integros.com] * Copyright 2016 Toomas Soome <tsoome@me.com> ! * Copyright 2017 Joyent, Inc. * Copyright (c) 2017 Datto Inc. - * Copyright 2018 OmniOS Community Edition (OmniOSce) Association. */ /* * SPA: Storage Pool Allocator * --- 19,36 ---- * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. ! * Copyright (c) 2011, 2017 by Delphix. All rights reserved. * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved. + * Copyright 2018 Nexenta Systems, Inc. All rights reserved. * Copyright 2013 Saso Kiselkov. All rights reserved. * Copyright (c) 2014 Integros [integros.com] * Copyright 2016 Toomas Soome <tsoome@me.com> ! * Copyright 2018 Joyent, Inc. * Copyright (c) 2017 Datto Inc. */ /* * SPA: Storage Pool Allocator *
*** 49,67 **** #include <sys/dmu_tx.h> #include <sys/zap.h> #include <sys/zil.h> #include <sys/ddt.h> #include <sys/vdev_impl.h> - #include <sys/vdev_removal.h> - #include <sys/vdev_indirect_mapping.h> - #include <sys/vdev_indirect_births.h> #include <sys/metaslab.h> #include <sys/metaslab_impl.h> #include <sys/uberblock_impl.h> #include <sys/txg.h> #include <sys/avl.h> - #include <sys/bpobj.h> #include <sys/dmu_traverse.h> #include <sys/dmu_objset.h> #include <sys/unique.h> #include <sys/dsl_pool.h> #include <sys/dsl_dataset.h> --- 48,62 ----
*** 75,84 **** --- 70,82 ---- #include <sys/spa_boot.h> #include <sys/zfs_ioctl.h> #include <sys/dsl_scan.h> #include <sys/zfeature.h> #include <sys/dsl_destroy.h> + #include <sys/cos.h> + #include <sys/special.h> + #include <sys/wbc.h> #include <sys/abd.h> #ifdef _KERNEL #include <sys/bootprops.h> #include <sys/callb.h>
*** 93,103 **** /* * The interval, in seconds, at which failed configuration cache file writes * should be retried. */ ! int zfs_ccw_retry_interval = 300; typedef enum zti_modes { ZTI_MODE_FIXED, /* value is # of threads (min 1) */ ZTI_MODE_BATCH, /* cpu-intensive; value is ignored */ ZTI_MODE_NULL, /* don't create a taskq */ --- 91,101 ---- /* * The interval, in seconds, at which failed configuration cache file writes * should be retried. */ ! static int zfs_ccw_retry_interval = 300; typedef enum zti_modes { ZTI_MODE_FIXED, /* value is # of threads (min 1) */ ZTI_MODE_BATCH, /* cpu-intensive; value is ignored */ ZTI_MODE_NULL, /* don't create a taskq */
*** 146,161 **** { ZTI_P(12, 8), ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* FREE */ { ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* CLAIM */ { ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* IOCTL */ }; static void spa_sync_version(void *arg, dmu_tx_t *tx); static void spa_sync_props(void *arg, dmu_tx_t *tx); static boolean_t spa_has_active_shared_spare(spa_t *spa); ! static int spa_load_impl(spa_t *spa, spa_import_type_t type, char **ereport, ! boolean_t reloading); static void spa_vdev_resilver_done(spa_t *spa); uint_t zio_taskq_batch_pct = 75; /* 1 thread per cpu in pset */ id_t zio_taskq_psrset_bind = PS_NONE; boolean_t zio_taskq_sysdc = B_TRUE; /* use SDC scheduling class */ uint_t zio_taskq_basedc = 80; /* base duty cycle */ --- 144,169 ---- { ZTI_P(12, 8), ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* FREE */ { ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* CLAIM */ { ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* IOCTL */ }; + static sysevent_t *spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, + const char *name); + static void spa_event_notify_impl(sysevent_t *ev); static void spa_sync_version(void *arg, dmu_tx_t *tx); static void spa_sync_props(void *arg, dmu_tx_t *tx); + static void spa_vdev_sync_props(void *arg, dmu_tx_t *tx); + static int spa_vdev_prop_set_nosync(vdev_t *, nvlist_t *, boolean_t *); static boolean_t spa_has_active_shared_spare(spa_t *spa); ! static int spa_load_impl(spa_t *spa, uint64_t, nvlist_t *config, ! spa_load_state_t state, spa_import_type_t type, boolean_t mosconfig, ! char **ereport); static void spa_vdev_resilver_done(spa_t *spa); + static void spa_auto_trim(spa_t *spa, uint64_t txg); + static void spa_vdev_man_trim_done(spa_t *spa); + static void spa_vdev_auto_trim_done(spa_t *spa); + static uint64_t spa_min_trim_rate(spa_t *spa); uint_t zio_taskq_batch_pct = 75; /* 1 thread per cpu in pset */ id_t zio_taskq_psrset_bind = PS_NONE; boolean_t zio_taskq_sysdc = B_TRUE; /* use SDC scheduling class */ uint_t zio_taskq_basedc = 80; /* base duty cycle */
*** 162,231 **** boolean_t spa_create_process = B_TRUE; /* no process ==> no sysdc */ extern int zfs_sync_pass_deferred_free; /* - * Report any spa_load_verify errors found, but do not fail spa_load. - * This is used by zdb to analyze non-idle pools. - */ - boolean_t spa_load_verify_dryrun = B_FALSE; - - /* - * This (illegal) pool name is used when temporarily importing a spa_t in order - * to get the vdev stats associated with the imported devices. - */ - #define TRYIMPORT_NAME "$import" - - /* - * For debugging purposes: print out vdev tree during pool import. - */ - boolean_t spa_load_print_vdev_tree = B_FALSE; - - /* - * A non-zero value for zfs_max_missing_tvds means that we allow importing - * pools with missing top-level vdevs. This is strictly intended for advanced - * pool recovery cases since missing data is almost inevitable. Pools with - * missing devices can only be imported read-only for safety reasons, and their - * fail-mode will be automatically set to "continue". - * - * With 1 missing vdev we should be able to import the pool and mount all - * datasets. User data that was not modified after the missing device has been - * added should be recoverable. This means that snapshots created prior to the - * addition of that device should be completely intact. - * - * With 2 missing vdevs, some datasets may fail to mount since there are - * dataset statistics that are stored as regular metadata. Some data might be - * recoverable if those vdevs were added recently. - * - * With 3 or more missing vdevs, the pool is severely damaged and MOS entries - * may be missing entirely. Chances of data recovery are very low. Note that - * there are also risks of performing an inadvertent rewind as we might be - * missing all the vdevs with the latest uberblocks. - */ - uint64_t zfs_max_missing_tvds = 0; - - /* - * The parameters below are similar to zfs_max_missing_tvds but are only - * intended for a preliminary open of the pool with an untrusted config which - * might be incomplete or out-dated. - * - * We are more tolerant for pools opened from a cachefile since we could have - * an out-dated cachefile where a device removal was not registered. - * We could have set the limit arbitrarily high but in the case where devices - * are really missing we would want to return the proper error codes; we chose - * SPA_DVAS_PER_BP - 1 so that some copies of the MOS would still be available - * and we get a chance to retrieve the trusted config. - */ - uint64_t zfs_max_missing_tvds_cachefile = SPA_DVAS_PER_BP - 1; - /* - * In the case where config was assembled by scanning device paths (/dev/dsks - * by default) we are less tolerant since all the existing devices should have - * been detected and we want spa_load to return the right error codes. - */ - uint64_t zfs_max_missing_tvds_scan = 0; - - /* * ========================================================================== * SPA properties routines * ========================================================================== */ --- 170,179 ----
*** 257,266 **** --- 205,215 ---- static void spa_prop_get_config(spa_t *spa, nvlist_t **nvp) { vdev_t *rvd = spa->spa_root_vdev; dsl_pool_t *pool = spa->spa_dsl_pool; + spa_meta_placement_t *mp = &spa->spa_meta_policy; uint64_t size, alloc, cap, version; zprop_source_t src = ZPROP_SRC_NONE; spa_config_dirent_t *dp; metaslab_class_t *mc = spa_normal_class(spa);
*** 272,295 **** --- 221,280 ---- spa_prop_add_list(*nvp, ZPOOL_PROP_NAME, spa_name(spa), 0, src); spa_prop_add_list(*nvp, ZPOOL_PROP_SIZE, NULL, size, src); spa_prop_add_list(*nvp, ZPOOL_PROP_ALLOCATED, NULL, alloc, src); spa_prop_add_list(*nvp, ZPOOL_PROP_FREE, NULL, size - alloc, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_ENABLESPECIAL, NULL, + (uint64_t)spa->spa_usesc, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_MINWATERMARK, NULL, + spa->spa_minwat, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_HIWATERMARK, NULL, + spa->spa_hiwat, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_LOWATERMARK, NULL, + spa->spa_lowat, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPMETA_DITTO, NULL, + spa->spa_ddt_meta_copies, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_META_PLACEMENT, NULL, + mp->spa_enable_meta_placement_selection, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_SYNC_TO_SPECIAL, NULL, + mp->spa_sync_to_special, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_DDT_META_TO_METADEV, NULL, + mp->spa_ddt_meta_to_special, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_ZFS_META_TO_METADEV, + NULL, mp->spa_zfs_meta_to_special, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_SMALL_DATA_TO_METADEV, NULL, + mp->spa_small_data_to_special, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_FRAGMENTATION, NULL, metaslab_class_fragmentation(mc), src); spa_prop_add_list(*nvp, ZPOOL_PROP_EXPANDSZ, NULL, metaslab_class_expandable_space(mc), src); spa_prop_add_list(*nvp, ZPOOL_PROP_READONLY, NULL, (spa_mode(spa) == FREAD), src); + spa_prop_add_list(*nvp, ZPOOL_PROP_DDT_DESEGREGATION, NULL, + (spa->spa_ddt_class_min == spa->spa_ddt_class_max), src); + cap = (size == 0) ? 0 : (alloc * 100 / size); spa_prop_add_list(*nvp, ZPOOL_PROP_CAPACITY, NULL, cap, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_BEST_EFFORT, NULL, + spa->spa_dedup_best_effort, src); + + spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_LO_BEST_EFFORT, NULL, + spa->spa_dedup_lo_best_effort, src); + + spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUP_HI_BEST_EFFORT, NULL, + spa->spa_dedup_hi_best_effort, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_DEDUPRATIO, NULL, ddt_get_pool_dedup_ratio(spa), src); + spa_prop_add_list(*nvp, ZPOOL_PROP_DDTCAPPED, NULL, + spa->spa_ddt_capped, src); + spa_prop_add_list(*nvp, ZPOOL_PROP_HEALTH, NULL, rvd->vdev_state, src); version = spa_version(spa); if (version == zpool_prop_default_numeric(ZPOOL_PROP_VERSION))
*** 304,318 **** * The $FREE directory was introduced in SPA_VERSION_DEADLISTS, * when opening pools before this version freedir will be NULL. */ if (pool->dp_free_dir != NULL) { spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING, NULL, ! dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes, src); } else { spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING, ! NULL, 0, src); } if (pool->dp_leak_dir != NULL) { spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED, NULL, dsl_dir_phys(pool->dp_leak_dir)->dd_used_bytes, --- 289,304 ---- * The $FREE directory was introduced in SPA_VERSION_DEADLISTS, * when opening pools before this version freedir will be NULL. */ if (pool->dp_free_dir != NULL) { spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING, NULL, ! dsl_dir_phys(pool->dp_free_dir)->dd_used_bytes + ! pool->dp_long_freeing_total, src); } else { spa_prop_add_list(*nvp, ZPOOL_PROP_FREEING, ! NULL, pool->dp_long_freeing_total, src); } if (pool->dp_leak_dir != NULL) { spa_prop_add_list(*nvp, ZPOOL_PROP_LEAKED, NULL, dsl_dir_phys(pool->dp_leak_dir)->dd_used_bytes,
*** 388,398 **** uint64_t intval = 0; char *strval = NULL; zprop_source_t src = ZPROP_SRC_DEFAULT; zpool_prop_t prop; ! if ((prop = zpool_name_to_prop(za.za_name)) == ZPOOL_PROP_INVAL) continue; switch (za.za_integer_length) { case 8: /* integer property */ --- 374,384 ---- uint64_t intval = 0; char *strval = NULL; zprop_source_t src = ZPROP_SRC_DEFAULT; zpool_prop_t prop; ! if ((prop = zpool_name_to_prop(za.za_name)) == ZPROP_INVAL) continue; switch (za.za_integer_length) { case 8: /* integer property */
*** 467,486 **** { nvpair_t *elem; int error = 0, reset_bootfs = 0; uint64_t objnum = 0; boolean_t has_feature = B_FALSE; elem = NULL; while ((elem = nvlist_next_nvpair(props, elem)) != NULL) { uint64_t intval; char *strval, *slash, *check, *fname; const char *propname = nvpair_name(elem); zpool_prop_t prop = zpool_name_to_prop(propname); switch (prop) { ! case ZPOOL_PROP_INVAL: if (!zpool_prop_feature(propname)) { error = SET_ERROR(EINVAL); break; } --- 453,475 ---- { nvpair_t *elem; int error = 0, reset_bootfs = 0; uint64_t objnum = 0; boolean_t has_feature = B_FALSE; + uint64_t lowat = spa->spa_lowat, hiwat = spa->spa_hiwat, + minwat = spa->spa_minwat; elem = NULL; while ((elem = nvlist_next_nvpair(props, elem)) != NULL) { uint64_t intval; char *strval, *slash, *check, *fname; const char *propname = nvpair_name(elem); zpool_prop_t prop = zpool_name_to_prop(propname); + spa_feature_t feature; switch (prop) { ! case ZPROP_INVAL: if (!zpool_prop_feature(propname)) { error = SET_ERROR(EINVAL); break; }
*** 501,515 **** error = SET_ERROR(EINVAL); break; } fname = strchr(propname, '@') + 1; ! if (zfeature_lookup_name(fname, NULL) != 0) { error = SET_ERROR(EINVAL); break; } has_feature = B_TRUE; break; case ZPOOL_PROP_VERSION: error = nvpair_value_uint64(elem, &intval); --- 490,510 ---- error = SET_ERROR(EINVAL); break; } fname = strchr(propname, '@') + 1; ! if (zfeature_lookup_name(fname, &feature) != 0) { error = SET_ERROR(EINVAL); break; } + if (feature == SPA_FEATURE_WBC && + !spa_has_special(spa)) { + error = SET_ERROR(ENOTSUP); + break; + } + has_feature = B_TRUE; break; case ZPOOL_PROP_VERSION: error = nvpair_value_uint64(elem, &intval);
*** 522,536 **** --- 517,555 ---- case ZPOOL_PROP_DELEGATION: case ZPOOL_PROP_AUTOREPLACE: case ZPOOL_PROP_LISTSNAPS: case ZPOOL_PROP_AUTOEXPAND: + case ZPOOL_PROP_DEDUP_BEST_EFFORT: + case ZPOOL_PROP_DDT_DESEGREGATION: + case ZPOOL_PROP_META_PLACEMENT: + case ZPOOL_PROP_FORCETRIM: + case ZPOOL_PROP_AUTOTRIM: error = nvpair_value_uint64(elem, &intval); if (!error && intval > 1) error = SET_ERROR(EINVAL); break; + case ZPOOL_PROP_DDT_META_TO_METADEV: + case ZPOOL_PROP_ZFS_META_TO_METADEV: + error = nvpair_value_uint64(elem, &intval); + if (!error && intval > META_PLACEMENT_DUAL) + error = SET_ERROR(EINVAL); + break; + + case ZPOOL_PROP_SYNC_TO_SPECIAL: + error = nvpair_value_uint64(elem, &intval); + if (!error && intval > SYNC_TO_SPECIAL_ALWAYS) + error = SET_ERROR(EINVAL); + break; + + case ZPOOL_PROP_SMALL_DATA_TO_METADEV: + error = nvpair_value_uint64(elem, &intval); + if (!error && intval > SPA_MAXBLOCKSIZE) + error = SET_ERROR(EINVAL); + break; + case ZPOOL_PROP_BOOTFS: /* * If the pool version is less than SPA_VERSION_BOOTFS, * or the pool is still being created (version == 0), * the bootfs property cannot be set.
*** 584,593 **** --- 603,626 ---- } dmu_objset_rele(os, FTAG); } break; + case ZPOOL_PROP_DEDUP_LO_BEST_EFFORT: + error = nvpair_value_uint64(elem, &intval); + if ((intval < 0) || (intval > 100) || + (intval >= spa->spa_dedup_hi_best_effort)) + error = SET_ERROR(EINVAL); + break; + + case ZPOOL_PROP_DEDUP_HI_BEST_EFFORT: + error = nvpair_value_uint64(elem, &intval); + if ((intval < 0) || (intval > 100) || + (intval <= spa->spa_dedup_lo_best_effort)) + error = SET_ERROR(EINVAL); + break; + case ZPOOL_PROP_FAILUREMODE: error = nvpair_value_uint64(elem, &intval); if (!error && (intval < ZIO_FAILURE_MODE_WAIT || intval > ZIO_FAILURE_MODE_PANIC)) error = SET_ERROR(EINVAL);
*** 645,655 **** error = SET_ERROR(EINVAL); break; } } if (strlen(strval) > ZPROP_MAX_COMMENT) ! error = E2BIG; break; case ZPOOL_PROP_DEDUPDITTO: if (spa_version(spa) < SPA_VERSION_DEDUP) error = SET_ERROR(ENOTSUP); --- 678,688 ---- error = SET_ERROR(EINVAL); break; } } if (strlen(strval) > ZPROP_MAX_COMMENT) ! error = SET_ERROR(E2BIG); break; case ZPOOL_PROP_DEDUPDITTO: if (spa_version(spa) < SPA_VERSION_DEDUP) error = SET_ERROR(ENOTSUP);
*** 657,672 **** --- 690,743 ---- error = nvpair_value_uint64(elem, &intval); if (error == 0 && intval != 0 && intval < ZIO_DEDUPDITTO_MIN) error = SET_ERROR(EINVAL); break; + + case ZPOOL_PROP_MINWATERMARK: + error = nvpair_value_uint64(elem, &intval); + if (!error && (intval > 100)) + error = SET_ERROR(EINVAL); + minwat = intval; + break; + case ZPOOL_PROP_LOWATERMARK: + error = nvpair_value_uint64(elem, &intval); + if (!error && (intval > 100)) + error = SET_ERROR(EINVAL); + lowat = intval; + break; + case ZPOOL_PROP_HIWATERMARK: + error = nvpair_value_uint64(elem, &intval); + if (!error && (intval > 100)) + error = SET_ERROR(EINVAL); + hiwat = intval; + break; + case ZPOOL_PROP_DEDUPMETA_DITTO: + error = nvpair_value_uint64(elem, &intval); + if (!error && (intval > SPA_DVAS_PER_BP)) + error = SET_ERROR(EINVAL); + break; + case ZPOOL_PROP_SCRUB_PRIO: + case ZPOOL_PROP_RESILVER_PRIO: + error = nvpair_value_uint64(elem, &intval); + if (error || intval > 100) + error = SET_ERROR(EINVAL); + break; } if (error) break; } + /* check if low watermark is less than high watermark */ + if (lowat != 0 && lowat >= hiwat) + error = SET_ERROR(EINVAL); + + /* check if min watermark is less than low watermark */ + if (minwat != 0 && minwat >= lowat) + error = SET_ERROR(EINVAL); + if (!error && reset_bootfs) { error = nvlist_remove(props, zpool_prop_to_name(ZPOOL_PROP_BOOTFS), DATA_TYPE_STRING); if (!error) {
*** 719,729 **** if (prop == ZPOOL_PROP_CACHEFILE || prop == ZPOOL_PROP_ALTROOT || prop == ZPOOL_PROP_READONLY) continue; ! if (prop == ZPOOL_PROP_VERSION || prop == ZPOOL_PROP_INVAL) { uint64_t ver; if (prop == ZPOOL_PROP_VERSION) { VERIFY(nvpair_value_uint64(elem, &ver) == 0); } else { --- 790,800 ---- if (prop == ZPOOL_PROP_CACHEFILE || prop == ZPOOL_PROP_ALTROOT || prop == ZPOOL_PROP_READONLY) continue; ! if (prop == ZPOOL_PROP_VERSION || prop == ZPROP_INVAL) { uint64_t ver; if (prop == ZPOOL_PROP_VERSION) { VERIFY(nvpair_value_uint64(elem, &ver) == 0); } else {
*** 838,848 **** error = dsl_sync_task(spa->spa_name, spa_change_guid_check, spa_change_guid_sync, &guid, 5, ZFS_SPACE_CHECK_RESERVED); if (error == 0) { ! spa_write_cachefile(spa, B_FALSE, B_TRUE); spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_REGUID); } mutex_exit(&spa_namespace_lock); mutex_exit(&spa->spa_vdev_top_lock); --- 909,919 ---- error = dsl_sync_task(spa->spa_name, spa_change_guid_check, spa_change_guid_sync, &guid, 5, ZFS_SPACE_CHECK_RESERVED); if (error == 0) { ! spa_config_sync(spa, B_FALSE, B_TRUE); spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_REGUID); } mutex_exit(&spa_namespace_lock); mutex_exit(&spa->spa_vdev_top_lock);
*** 1106,1115 **** --- 1177,1187 ---- spa->spa_state = POOL_STATE_ACTIVE; spa->spa_mode = mode; spa->spa_normal_class = metaslab_class_create(spa, zfs_metaslab_ops); spa->spa_log_class = metaslab_class_create(spa, zfs_metaslab_ops); + spa->spa_special_class = metaslab_class_create(spa, zfs_metaslab_ops); /* Try to create a covering process */ mutex_enter(&spa->spa_proc_lock); ASSERT(spa->spa_proc_state == SPA_PROC_NONE); ASSERT(spa->spa_proc == &p0);
*** 1140,1152 **** /* If we didn't create a process, we need to create our taskqs. */ if (spa->spa_proc == &p0) { spa_create_zio_taskqs(spa); } - for (size_t i = 0; i < TXG_SIZE; i++) - spa->spa_txg_zio[i] = zio_root(spa, NULL, NULL, 0); - list_create(&spa->spa_config_dirty_list, sizeof (vdev_t), offsetof(vdev_t, vdev_config_dirty_node)); list_create(&spa->spa_evicting_os_list, sizeof (objset_t), offsetof(objset_t, os_evicting_node)); list_create(&spa->spa_state_dirty_list, sizeof (vdev_t), --- 1212,1221 ----
*** 1187,1208 **** for (int q = 0; q < ZIO_TASKQ_TYPES; q++) { spa_taskqs_fini(spa, t, q); } } - for (size_t i = 0; i < TXG_SIZE; i++) { - ASSERT3P(spa->spa_txg_zio[i], !=, NULL); - VERIFY0(zio_wait(spa->spa_txg_zio[i])); - spa->spa_txg_zio[i] = NULL; - } - metaslab_class_destroy(spa->spa_normal_class); spa->spa_normal_class = NULL; metaslab_class_destroy(spa->spa_log_class); spa->spa_log_class = NULL; /* * If this was part of an import or the open otherwise failed, we may * still have errors left in the queues. Empty them just in case. */ spa_errlog_drain(spa); --- 1256,1274 ---- for (int q = 0; q < ZIO_TASKQ_TYPES; q++) { spa_taskqs_fini(spa, t, q); } } metaslab_class_destroy(spa->spa_normal_class); spa->spa_normal_class = NULL; metaslab_class_destroy(spa->spa_log_class); spa->spa_log_class = NULL; + metaslab_class_destroy(spa->spa_special_class); + spa->spa_special_class = NULL; + /* * If this was part of an import or the open otherwise failed, we may * still have errors left in the queues. Empty them just in case. */ spa_errlog_drain(spa);
*** 1293,1303 **** { int i; ASSERT(MUTEX_HELD(&spa_namespace_lock)); ! spa_load_note(spa, "UNLOADING"); /* * Stop async tasks. */ spa_async_suspend(spa); --- 1359,1377 ---- { int i; ASSERT(MUTEX_HELD(&spa_namespace_lock)); ! /* ! * Stop manual trim before stopping spa sync, because manual trim ! * needs to execute a synctask (trim timestamp sync) at the end. ! */ ! mutex_enter(&spa->spa_auto_trim_lock); ! mutex_enter(&spa->spa_man_trim_lock); ! spa_trim_stop_wait(spa); ! mutex_exit(&spa->spa_man_trim_lock); ! mutex_exit(&spa->spa_auto_trim_lock); /* * Stop async tasks. */ spa_async_suspend(spa);
*** 1331,1358 **** (void) zio_wait(spa->spa_async_zio_root[i]); kmem_free(spa->spa_async_zio_root, max_ncpus * sizeof (void *)); spa->spa_async_zio_root = NULL; } - if (spa->spa_vdev_removal != NULL) { - spa_vdev_removal_destroy(spa->spa_vdev_removal); - spa->spa_vdev_removal = NULL; - } - - if (spa->spa_condense_zthr != NULL) { - ASSERT(!zthr_isrunning(spa->spa_condense_zthr)); - zthr_destroy(spa->spa_condense_zthr); - spa->spa_condense_zthr = NULL; - } - - spa_condense_fini(spa); - bpobj_close(&spa->spa_deferred_bpobj); spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); /* * Close all vdevs. */ if (spa->spa_root_vdev) vdev_free(spa->spa_root_vdev); ASSERT(spa->spa_root_vdev == NULL); --- 1405,1427 ---- (void) zio_wait(spa->spa_async_zio_root[i]); kmem_free(spa->spa_async_zio_root, max_ncpus * sizeof (void *)); spa->spa_async_zio_root = NULL; } bpobj_close(&spa->spa_deferred_bpobj); spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); /* + * Stop autotrim tasks. + */ + mutex_enter(&spa->spa_auto_trim_lock); + if (spa->spa_auto_trim_taskq) + spa_auto_trim_taskq_destroy(spa); + mutex_exit(&spa->spa_auto_trim_lock); + + /* * Close all vdevs. */ if (spa->spa_root_vdev) vdev_free(spa->spa_root_vdev); ASSERT(spa->spa_root_vdev == NULL);
*** 1401,1412 **** } spa->spa_l2cache.sav_count = 0; spa->spa_async_suspended = 0; - spa->spa_indirect_vdevs_loaded = B_FALSE; - if (spa->spa_comment != NULL) { spa_strfree(spa->spa_comment); spa->spa_comment = NULL; } --- 1470,1479 ----
*** 1417,1427 **** * Load (or re-load) the current list of vdevs describing the active spares for * this pool. When this is called, we have some form of basic information in * 'spa_spares.sav_config'. We parse this into vdevs, try to open them, and * then re-generate a more complete list including status information. */ ! void spa_load_spares(spa_t *spa) { nvlist_t **spares; uint_t nspares; int i; --- 1484,1494 ---- * Load (or re-load) the current list of vdevs describing the active spares for * this pool. When this is called, we have some form of basic information in * 'spa_spares.sav_config'. We parse this into vdevs, try to open them, and * then re-generate a more complete list including status information. */ ! static void spa_load_spares(spa_t *spa) { nvlist_t **spares; uint_t nspares; int i;
*** 1534,1544 **** * 'spa_l2cache.sav_config'. We parse this into vdevs, try to open them, and * then re-generate a more complete list including status information. * Devices which are already active have their details maintained, and are * not re-opened. */ ! void spa_load_l2cache(spa_t *spa) { nvlist_t **l2cache; uint_t nl2cache; int i, j, oldnvdevs; --- 1601,1611 ---- * 'spa_l2cache.sav_config'. We parse this into vdevs, try to open them, and * then re-generate a more complete list including status information. * Devices which are already active have their details maintained, and are * not re-opened. */ ! static void spa_load_l2cache(spa_t *spa) { nvlist_t **l2cache; uint_t nl2cache; int i, j, oldnvdevs;
*** 1605,1618 **** if (vdev_open(vd) != 0) continue; (void) vdev_validate_aux(vd); ! if (!vdev_is_dead(vd)) ! l2arc_add_vdev(spa, vd); } } /* * Purge vdevs that were dropped */ for (i = 0; i < oldnvdevs; i++) { --- 1672,1691 ---- if (vdev_open(vd) != 0) continue; (void) vdev_validate_aux(vd); ! if (!vdev_is_dead(vd)) { ! boolean_t do_rebuild = B_FALSE; ! ! (void) nvlist_lookup_boolean_value(l2cache[i], ! ZPOOL_CONFIG_L2CACHE_PERSISTENT, ! &do_rebuild); ! l2arc_add_vdev(spa, vd, do_rebuild); } } + } /* * Purge vdevs that were dropped */ for (i = 0; i < oldnvdevs; i++) {
*** 1684,1738 **** return (error); } /* - * Concrete top-level vdevs that are not missing and are not logs. At every - * spa_sync we write new uberblocks to at least SPA_SYNC_MIN_VDEVS core tvds. - */ - static uint64_t - spa_healthy_core_tvds(spa_t *spa) - { - vdev_t *rvd = spa->spa_root_vdev; - uint64_t tvds = 0; - - for (uint64_t i = 0; i < rvd->vdev_children; i++) { - vdev_t *vd = rvd->vdev_child[i]; - if (vd->vdev_islog) - continue; - if (vdev_is_concrete(vd) && !vdev_is_dead(vd)) - tvds++; - } - - return (tvds); - } - - /* * Checks to see if the given vdev could not be opened, in which case we post a * sysevent to notify the autoreplace code that the device has been removed. */ static void spa_check_removed(vdev_t *vd) { ! for (uint64_t c = 0; c < vd->vdev_children; c++) spa_check_removed(vd->vdev_child[c]); if (vd->vdev_ops->vdev_op_leaf && vdev_is_dead(vd) && ! vdev_is_concrete(vd)) { zfs_post_autoreplace(vd->vdev_spa, vd); spa_event_notify(vd->vdev_spa, vd, NULL, ESC_ZFS_VDEV_CHECK); } } ! static int ! spa_check_for_missing_logs(spa_t *spa) { ! vdev_t *rvd = spa->spa_root_vdev; /* * If we're doing a normal import, then build up any additional ! * diagnostic information about missing log devices. * We'll pass this up to the user for further processing. */ if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG)) { nvlist_t **child, *nv; uint64_t idx = 0; --- 1757,1821 ---- return (error); } /* * Checks to see if the given vdev could not be opened, in which case we post a * sysevent to notify the autoreplace code that the device has been removed. */ static void spa_check_removed(vdev_t *vd) { ! for (int c = 0; c < vd->vdev_children; c++) spa_check_removed(vd->vdev_child[c]); if (vd->vdev_ops->vdev_op_leaf && vdev_is_dead(vd) && ! !vd->vdev_ishole) { zfs_post_autoreplace(vd->vdev_spa, vd); spa_event_notify(vd->vdev_spa, vd, NULL, ESC_ZFS_VDEV_CHECK); } } ! static void ! spa_config_valid_zaps(vdev_t *vd, vdev_t *mvd) { ! ASSERT3U(vd->vdev_children, ==, mvd->vdev_children); + vd->vdev_top_zap = mvd->vdev_top_zap; + vd->vdev_leaf_zap = mvd->vdev_leaf_zap; + + for (uint64_t i = 0; i < vd->vdev_children; i++) { + spa_config_valid_zaps(vd->vdev_child[i], mvd->vdev_child[i]); + } + } + + /* + * Validate the current config against the MOS config + */ + static boolean_t + spa_config_valid(spa_t *spa, nvlist_t *config) + { + vdev_t *mrvd, *rvd = spa->spa_root_vdev; + nvlist_t *nv; + + VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nv) == 0); + + spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); + VERIFY(spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD) == 0); + /* + * One of the earliest signs of a stale config is a mismatch + * in the numbers of children vdev's + */ + if (rvd->vdev_children != mrvd->vdev_children) { + vdev_free(mrvd); + spa_config_exit(spa, SCL_ALL, FTAG); + return (B_FALSE); + } + /* * If we're doing a normal import, then build up any additional ! * diagnostic information about missing devices in this config. * We'll pass this up to the user for further processing. */ if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG)) { nvlist_t **child, *nv; uint64_t idx = 0;
*** 1739,1794 **** child = kmem_alloc(rvd->vdev_children * sizeof (nvlist_t **), KM_SLEEP); VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0); ! for (uint64_t c = 0; c < rvd->vdev_children; c++) { vdev_t *tvd = rvd->vdev_child[c]; ! /* ! * We consider a device as missing only if it failed ! * to open (i.e. offline or faulted is not considered ! * as missing). ! */ ! if (tvd->vdev_islog && ! tvd->vdev_state == VDEV_STATE_CANT_OPEN) { ! child[idx++] = vdev_config_generate(spa, tvd, ! B_FALSE, VDEV_CONFIG_MISSING); } - } ! if (idx > 0) { ! fnvlist_add_nvlist_array(nv, ! ZPOOL_CONFIG_CHILDREN, child, idx); ! fnvlist_add_nvlist(spa->spa_load_info, ! ZPOOL_CONFIG_MISSING_DEVICES, nv); ! for (uint64_t i = 0; i < idx; i++) nvlist_free(child[i]); } nvlist_free(nv); kmem_free(child, rvd->vdev_children * sizeof (char **)); - - if (idx > 0) { - spa_load_failed(spa, "some log devices are missing"); - return (SET_ERROR(ENXIO)); } ! } else { ! for (uint64_t c = 0; c < rvd->vdev_children; c++) { vdev_t *tvd = rvd->vdev_child[c]; ! if (tvd->vdev_islog && ! tvd->vdev_state == VDEV_STATE_CANT_OPEN) { spa_set_log_state(spa, SPA_LOG_CLEAR); ! spa_load_note(spa, "some log devices are " ! "missing, ZIL is dropped."); ! break; } } } ! return (0); } /* * Check for missing log devices */ --- 1822,1931 ---- child = kmem_alloc(rvd->vdev_children * sizeof (nvlist_t **), KM_SLEEP); VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0); ! for (int c = 0; c < rvd->vdev_children; c++) { vdev_t *tvd = rvd->vdev_child[c]; + vdev_t *mtvd = mrvd->vdev_child[c]; ! if (tvd->vdev_ops == &vdev_missing_ops && ! mtvd->vdev_ops != &vdev_missing_ops && ! mtvd->vdev_islog) ! child[idx++] = vdev_config_generate(spa, mtvd, ! B_FALSE, 0); } ! if (idx) { ! VERIFY(nvlist_add_nvlist_array(nv, ! ZPOOL_CONFIG_CHILDREN, child, idx) == 0); ! VERIFY(nvlist_add_nvlist(spa->spa_load_info, ! ZPOOL_CONFIG_MISSING_DEVICES, nv) == 0); ! for (int i = 0; i < idx; i++) nvlist_free(child[i]); } nvlist_free(nv); kmem_free(child, rvd->vdev_children * sizeof (char **)); } ! ! /* ! * Compare the root vdev tree with the information we have ! * from the MOS config (mrvd). Check each top-level vdev ! * with the corresponding MOS config top-level (mtvd). ! */ ! for (int c = 0; c < rvd->vdev_children; c++) { vdev_t *tvd = rvd->vdev_child[c]; + vdev_t *mtvd = mrvd->vdev_child[c]; ! /* ! * Resolve any "missing" vdevs in the current configuration. ! * If we find that the MOS config has more accurate information ! * about the top-level vdev then use that vdev instead. ! */ ! if (tvd->vdev_ops == &vdev_missing_ops && ! mtvd->vdev_ops != &vdev_missing_ops) { ! ! if (!(spa->spa_import_flags & ZFS_IMPORT_MISSING_LOG)) ! continue; ! ! /* ! * Device specific actions. ! */ ! if (mtvd->vdev_islog) { spa_set_log_state(spa, SPA_LOG_CLEAR); ! } else { ! /* ! * XXX - once we have 'readonly' pool ! * support we should be able to handle ! * missing data devices by transitioning ! * the pool to readonly. ! */ ! continue; } + + /* + * Swap the missing vdev with the data we were + * able to obtain from the MOS config. + */ + vdev_remove_child(rvd, tvd); + vdev_remove_child(mrvd, mtvd); + + vdev_add_child(rvd, mtvd); + vdev_add_child(mrvd, tvd); + + spa_config_exit(spa, SCL_ALL, FTAG); + vdev_load(mtvd); + spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); + + vdev_reopen(rvd); + } else { + if (mtvd->vdev_islog) { + /* + * Load the slog device's state from the MOS + * config since it's possible that the label + * does not contain the most up-to-date + * information. + */ + vdev_load_log_state(tvd, mtvd); + vdev_reopen(tvd); } + + /* + * Per-vdev ZAP info is stored exclusively in the MOS. + */ + spa_config_valid_zaps(tvd, mtvd); } + } ! vdev_free(mrvd); ! spa_config_exit(spa, SCL_ALL, FTAG); ! ! /* ! * Ensure we were able to validate the config. ! */ ! return (rvd->vdev_guid_sum == spa->spa_uberblock.ub_guid_sum); } /* * Check for missing log devices */
*** 1850,1864 **** metaslab_group_activate(mg); } } int ! spa_reset_logs(spa_t *spa) { int error; ! error = dmu_objset_find(spa_name(spa), zil_reset, NULL, DS_FIND_CHILDREN); if (error == 0) { /* * We successfully offlined the log device, sync out the * current txg so that the "stubby" block can be removed --- 1987,2001 ---- metaslab_group_activate(mg); } } int ! spa_offline_log(spa_t *spa) { int error; ! error = dmu_objset_find(spa_name(spa), zil_vdev_offline, NULL, DS_FIND_CHILDREN); if (error == 0) { /* * We successfully offlined the log device, sync out the * current txg so that the "stubby" block can be removed
*** 1904,1915 **** int error = zio->io_error; spa_t *spa = zio->io_spa; abd_free(zio->io_abd); if (error) { ! if ((BP_GET_LEVEL(bp) != 0 || DMU_OT_IS_METADATA(type)) && ! type != DMU_OT_INTENT_LOG) atomic_inc_64(&sle->sle_meta_count); else atomic_inc_64(&sle->sle_data_count); } --- 2041,2051 ---- int error = zio->io_error; spa_t *spa = zio->io_spa; abd_free(zio->io_abd); if (error) { ! if (BP_IS_METADATA(bp) && type != DMU_OT_INTENT_LOG) atomic_inc_64(&sle->sle_meta_count); else atomic_inc_64(&sle->sle_data_count); }
*** 1994,2029 **** rio = zio_root(spa, NULL, &sle, ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE); if (spa_load_verify_metadata) { ! if (spa->spa_extreme_rewind) { ! spa_load_note(spa, "performing a complete scan of the " ! "pool since extreme rewind is on. This may take " ! "a very long time.\n (spa_load_verify_data=%u, " ! "spa_load_verify_metadata=%u)", ! spa_load_verify_data, spa_load_verify_metadata); ! } ! error = traverse_pool(spa, spa->spa_verify_min_txg, TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA, ! spa_load_verify_cb, rio); } (void) zio_wait(rio); spa->spa_load_meta_errors = sle.sle_meta_count; spa->spa_load_data_errors = sle.sle_data_count; ! if (sle.sle_meta_count != 0 || sle.sle_data_count != 0) { ! spa_load_note(spa, "spa_load_verify found %llu metadata errors " ! "and %llu data errors", (u_longlong_t)sle.sle_meta_count, ! (u_longlong_t)sle.sle_data_count); ! } ! ! if (spa_load_verify_dryrun || ! (!error && sle.sle_meta_count <= policy.zrp_maxmeta && ! sle.sle_data_count <= policy.zrp_maxdata)) { int64_t loss = 0; verify_ok = B_TRUE; spa->spa_load_txg = spa->spa_uberblock.ub_txg; spa->spa_load_txg_ts = spa->spa_uberblock.ub_timestamp; --- 2130,2152 ---- rio = zio_root(spa, NULL, &sle, ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE); if (spa_load_verify_metadata) { ! zbookmark_phys_t zb = { 0 }; ! error = traverse_pool(spa, spa->spa_verify_min_txg, UINT64_MAX, TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA, ! spa_load_verify_cb, rio, &zb); } (void) zio_wait(rio); spa->spa_load_meta_errors = sle.sle_meta_count; spa->spa_load_data_errors = sle.sle_data_count; ! if (!error && sle.sle_meta_count <= policy.zrp_maxmeta && ! sle.sle_data_count <= policy.zrp_maxdata) { int64_t loss = 0; verify_ok = B_TRUE; spa->spa_load_txg = spa->spa_uberblock.ub_txg; spa->spa_load_txg_ts = spa->spa_uberblock.ub_timestamp;
*** 2037,2049 **** ZPOOL_CONFIG_LOAD_DATA_ERRORS, sle.sle_data_count) == 0); } else { spa->spa_load_max_txg = spa->spa_uberblock.ub_txg; } - if (spa_load_verify_dryrun) - return (0); - if (error) { if (error != ENXIO && error != EIO) error = SET_ERROR(EIO); return (error); } --- 2160,2169 ----
*** 2063,2102 **** /* * Find a value in the pool directory object. */ static int ! spa_dir_prop(spa_t *spa, const char *name, uint64_t *val, boolean_t log_enoent) { ! int error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT, ! name, sizeof (uint64_t), 1, val); ! if (error != 0 && (error != ENOENT || log_enoent)) { ! spa_load_failed(spa, "couldn't get '%s' value in MOS directory " ! "[error=%d]", name, error); } - - return (error); } static int spa_vdev_err(vdev_t *vdev, vdev_aux_t aux, int err) { vdev_set_state(vdev, B_TRUE, VDEV_STATE_CANT_OPEN, aux); ! return (SET_ERROR(err)); } - static void - spa_spawn_aux_threads(spa_t *spa) - { - ASSERT(spa_writeable(spa)); - - ASSERT(MUTEX_HELD(&spa_namespace_lock)); - - spa_start_indirect_condensing_thread(spa); - } - /* * Fix up config after a partly-completed split. This is done with the * ZPOOL_CONFIG_SPLIT nvlist. Both the splitting pool and the split-off * pool have that entry in their config, but only the splitting one contains * a list of all the guids of the vdevs that are being split off. --- 2183,2220 ---- /* * Find a value in the pool directory object. */ static int ! spa_dir_prop(spa_t *spa, const char *name, uint64_t *val) { ! return (zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT, ! name, sizeof (uint64_t), 1, val)); ! } ! static void ! spa_set_ddt_classes(spa_t *spa, int desegregation) ! { ! /* ! * if desegregation is turned on then set up ddt_class restrictions ! */ ! if (desegregation) { ! spa->spa_ddt_class_min = DDT_CLASS_DUPLICATE; ! spa->spa_ddt_class_max = DDT_CLASS_DUPLICATE; ! } else { ! spa->spa_ddt_class_min = DDT_CLASS_DITTO; ! spa->spa_ddt_class_max = DDT_CLASS_UNIQUE; } } static int spa_vdev_err(vdev_t *vdev, vdev_aux_t aux, int err) { vdev_set_state(vdev, B_TRUE, VDEV_STATE_CANT_OPEN, aux); ! return (err); } /* * Fix up config after a partly-completed split. This is done with the * ZPOOL_CONFIG_SPLIT nvlist. Both the splitting pool and the split-off * pool have that entry in their config, but only the splitting one contains * a list of all the guids of the vdevs that are being split off.
*** 2176,2194 **** kmem_free(vd, gcount * sizeof (vdev_t *)); } static int ! spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type) { char *ereport = FM_EREPORT_ZFS_POOL; int error; ! spa->spa_load_state = state; gethrestime(&spa->spa_loaded_ts); ! error = spa_load_impl(spa, type, &ereport, B_FALSE); /* * Don't count references from objsets that are already closed * and are making their way through the eviction process. */ --- 2294,2350 ---- kmem_free(vd, gcount * sizeof (vdev_t *)); } static int ! spa_load(spa_t *spa, spa_load_state_t state, spa_import_type_t type, ! boolean_t mosconfig) { + nvlist_t *config = spa->spa_config; char *ereport = FM_EREPORT_ZFS_POOL; + char *comment; int error; + uint64_t pool_guid; + nvlist_t *nvl; ! if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid)) ! return (SET_ERROR(EINVAL)); + ASSERT(spa->spa_comment == NULL); + if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0) + spa->spa_comment = spa_strdup(comment); + + /* + * Versioning wasn't explicitly added to the label until later, so if + * it's not present treat it as the initial version. + */ + if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION, + &spa->spa_ubsync.ub_version) != 0) + spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL; + + (void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, + &spa->spa_config_txg); + + if ((state == SPA_LOAD_IMPORT || state == SPA_LOAD_TRYIMPORT) && + spa_guid_exists(pool_guid, 0)) { + error = SET_ERROR(EEXIST); + } else { + spa->spa_config_guid = pool_guid; + + if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT, + &nvl) == 0) { + VERIFY(nvlist_dup(nvl, &spa->spa_config_splitting, + KM_SLEEP) == 0); + } + + nvlist_free(spa->spa_load_info); + spa->spa_load_info = fnvlist_alloc(); + gethrestime(&spa->spa_loaded_ts); ! error = spa_load_impl(spa, pool_guid, config, state, type, ! mosconfig, &ereport); ! } /* * Don't count references from objsets that are already closed * and are making their way through the eviction process. */
*** 2203,2213 **** zfs_ereport_post(ereport, spa, NULL, NULL, 0, 0); } } spa->spa_load_state = error ? SPA_LOAD_ERROR : SPA_LOAD_NONE; spa->spa_ena = 0; - return (error); } /* * Count the number of per-vdev ZAPs associated with all of the vdevs in the --- 2359,2368 ----
*** 2235,2326 **** } return (total); } static int ! spa_verify_host(spa_t *spa, nvlist_t *mos_config) { - uint64_t hostid; - char *hostname; - uint64_t myhostid = 0; - - if (!spa_is_root(spa) && nvlist_lookup_uint64(mos_config, - ZPOOL_CONFIG_HOSTID, &hostid) == 0) { - hostname = fnvlist_lookup_string(mos_config, - ZPOOL_CONFIG_HOSTNAME); - - myhostid = zone_get_hostid(NULL); - - if (hostid != 0 && myhostid != 0 && hostid != myhostid) { - cmn_err(CE_WARN, "pool '%s' could not be " - "loaded as it was last accessed by " - "another system (host: %s hostid: 0x%llx). " - "See: http://illumos.org/msg/ZFS-8000-EY", - spa_name(spa), hostname, (u_longlong_t)hostid); - spa_load_failed(spa, "hostid verification failed: pool " - "last accessed by host: %s (hostid: 0x%llx)", - hostname, (u_longlong_t)hostid); - return (SET_ERROR(EBADF)); - } - } - - return (0); - } - - static int - spa_ld_parse_config(spa_t *spa, spa_import_type_t type) - { int error = 0; ! nvlist_t *nvtree, *nvl, *config = spa->spa_config; ! int parse; vdev_t *rvd; ! uint64_t pool_guid; ! char *comment; /* ! * Versioning wasn't explicitly added to the label until later, so if ! * it's not present treat it as the initial version. */ ! if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION, ! &spa->spa_ubsync.ub_version) != 0) ! spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL; ! if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid)) { ! spa_load_failed(spa, "invalid config provided: '%s' missing", ! ZPOOL_CONFIG_POOL_GUID); ! return (SET_ERROR(EINVAL)); ! } ! if ((spa->spa_load_state == SPA_LOAD_IMPORT || spa->spa_load_state == ! SPA_LOAD_TRYIMPORT) && spa_guid_exists(pool_guid, 0)) { ! spa_load_failed(spa, "a pool with guid %llu is already open", ! (u_longlong_t)pool_guid); ! return (SET_ERROR(EEXIST)); ! } ! spa->spa_config_guid = pool_guid; ! ! nvlist_free(spa->spa_load_info); ! spa->spa_load_info = fnvlist_alloc(); ! ! ASSERT(spa->spa_comment == NULL); ! if (nvlist_lookup_string(config, ZPOOL_CONFIG_COMMENT, &comment) == 0) ! spa->spa_comment = spa_strdup(comment); ! ! (void) nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, ! &spa->spa_config_txg); ! ! if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_SPLIT, &nvl) == 0) ! spa->spa_config_splitting = fnvlist_dup(nvl); ! ! if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvtree)) { ! spa_load_failed(spa, "invalid config provided: '%s' missing", ! ZPOOL_CONFIG_VDEV_TREE); return (SET_ERROR(EINVAL)); - } /* * Create "The Godfather" zio to hold all async IOs */ spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *), KM_SLEEP); --- 2390,2437 ---- } return (total); } + /* + * Load an existing storage pool, using the pool's builtin spa_config as a + * source of configuration information. + */ static int ! spa_load_impl(spa_t *spa, uint64_t pool_guid, nvlist_t *config, ! spa_load_state_t state, spa_import_type_t type, boolean_t mosconfig, ! char **ereport) { int error = 0; ! nvlist_t *nvroot = NULL; ! nvlist_t *label; vdev_t *rvd; ! uberblock_t *ub = &spa->spa_uberblock; ! uint64_t children, config_cache_txg = spa->spa_config_txg; ! int orig_mode = spa->spa_mode; ! int parse; ! uint64_t obj; ! boolean_t missing_feat_write = B_FALSE; ! spa_meta_placement_t *mp; /* ! * If this is an untrusted config, access the pool in read-only mode. ! * This prevents things like resilvering recently removed devices. */ ! if (!mosconfig) ! spa->spa_mode = FREAD; ! ASSERT(MUTEX_HELD(&spa_namespace_lock)); ! spa->spa_load_state = state; ! if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &nvroot)) return (SET_ERROR(EINVAL)); + parse = (type == SPA_IMPORT_EXISTING ? + VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT); + /* * Create "The Godfather" zio to hold all async IOs */ spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *), KM_SLEEP);
*** 2334,2465 **** * Parse the configuration into a vdev tree. We explicitly set the * value that will be returned by spa_version() since parsing the * configuration requires knowing the version number. */ spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); ! parse = (type == SPA_IMPORT_EXISTING ? ! VDEV_ALLOC_LOAD : VDEV_ALLOC_SPLIT); ! error = spa_config_parse(spa, &rvd, nvtree, NULL, 0, parse); spa_config_exit(spa, SCL_ALL, FTAG); ! if (error != 0) { ! spa_load_failed(spa, "unable to parse config [error=%d]", ! error); return (error); - } ASSERT(spa->spa_root_vdev == rvd); ASSERT3U(spa->spa_min_ashift, >=, SPA_MINBLOCKSHIFT); ASSERT3U(spa->spa_max_ashift, <=, SPA_MAXBLOCKSHIFT); if (type != SPA_IMPORT_ASSEMBLE) { ASSERT(spa_guid(spa) == pool_guid); } - return (0); - } - - /* - * Recursively open all vdevs in the vdev tree. This function is called twice: - * first with the untrusted config, then with the trusted config. - */ - static int - spa_ld_open_vdevs(spa_t *spa) - { - int error = 0; - /* ! * spa_missing_tvds_allowed defines how many top-level vdevs can be ! * missing/unopenable for the root vdev to be still considered openable. */ - if (spa->spa_trust_config) { - spa->spa_missing_tvds_allowed = zfs_max_missing_tvds; - } else if (spa->spa_config_source == SPA_CONFIG_SRC_CACHEFILE) { - spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_cachefile; - } else if (spa->spa_config_source == SPA_CONFIG_SRC_SCAN) { - spa->spa_missing_tvds_allowed = zfs_max_missing_tvds_scan; - } else { - spa->spa_missing_tvds_allowed = 0; - } - - spa->spa_missing_tvds_allowed = - MAX(zfs_max_missing_tvds, spa->spa_missing_tvds_allowed); - spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); ! error = vdev_open(spa->spa_root_vdev); spa_config_exit(spa, SCL_ALL, FTAG); ! ! if (spa->spa_missing_tvds != 0) { ! spa_load_note(spa, "vdev tree has %lld missing top-level " ! "vdevs.", (u_longlong_t)spa->spa_missing_tvds); ! if (spa->spa_trust_config && (spa->spa_mode & FWRITE)) { ! /* ! * Although theoretically we could allow users to open ! * incomplete pools in RW mode, we'd need to add a lot ! * of extra logic (e.g. adjust pool space to account ! * for missing vdevs). ! * This limitation also prevents users from accidentally ! * opening the pool in RW mode during data recovery and ! * damaging it further. ! */ ! spa_load_note(spa, "pools with missing top-level " ! "vdevs can only be opened in read-only mode."); ! error = SET_ERROR(ENXIO); ! } else { ! spa_load_note(spa, "current settings allow for maximum " ! "%lld missing top-level vdevs at this stage.", ! (u_longlong_t)spa->spa_missing_tvds_allowed); ! } ! } ! if (error != 0) { ! spa_load_failed(spa, "unable to open vdev tree [error=%d]", ! error); ! } ! if (spa->spa_missing_tvds != 0 || error != 0) ! vdev_dbgmsg_print_tree(spa->spa_root_vdev, 2); ! return (error); - } ! /* * We need to validate the vdev labels against the configuration that ! * we have in hand. This function is called twice: first with an untrusted ! * config, then with a trusted config. The validation is more strict when the ! * config is trusted. */ ! static int ! spa_ld_validate_vdevs(spa_t *spa) ! { ! int error = 0; ! vdev_t *rvd = spa->spa_root_vdev; ! spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); ! error = vdev_validate(rvd); spa_config_exit(spa, SCL_ALL, FTAG); ! if (error != 0) { ! spa_load_failed(spa, "vdev_validate failed [error=%d]", error); return (error); - } ! if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN) { ! spa_load_failed(spa, "cannot open vdev tree after invalidating " ! "some vdevs"); ! vdev_dbgmsg_print_tree(rvd, 2); return (SET_ERROR(ENXIO)); } - return (0); - } - - static int - spa_ld_select_uberblock(spa_t *spa, spa_import_type_t type) - { - vdev_t *rvd = spa->spa_root_vdev; - nvlist_t *label; - uberblock_t *ub = &spa->spa_uberblock; - /* * Find the best uberblock. */ vdev_uberblock_load(rvd, ub, &label); --- 2445,2502 ---- * Parse the configuration into a vdev tree. We explicitly set the * value that will be returned by spa_version() since parsing the * configuration requires knowing the version number. */ spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); ! error = spa_config_parse(spa, &rvd, nvroot, NULL, 0, parse); spa_config_exit(spa, SCL_ALL, FTAG); ! if (error != 0) return (error); ASSERT(spa->spa_root_vdev == rvd); ASSERT3U(spa->spa_min_ashift, >=, SPA_MINBLOCKSHIFT); ASSERT3U(spa->spa_max_ashift, <=, SPA_MAXBLOCKSHIFT); if (type != SPA_IMPORT_ASSEMBLE) { ASSERT(spa_guid(spa) == pool_guid); } /* ! * Try to open all vdevs, loading each label in the process. */ spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); ! error = vdev_open(rvd); spa_config_exit(spa, SCL_ALL, FTAG); ! if (error != 0) return (error); ! /* * We need to validate the vdev labels against the configuration that ! * we have in hand, which is dependent on the setting of mosconfig. If ! * mosconfig is true then we're validating the vdev labels based on ! * that config. Otherwise, we're validating against the cached config ! * (zpool.cache) that was read when we loaded the zfs module, and then ! * later we will recursively call spa_load() and validate against ! * the vdev config. ! * ! * If we're assembling a new pool that's been split off from an ! * existing pool, the labels haven't yet been updated so we skip ! * validation for now. */ ! if (type != SPA_IMPORT_ASSEMBLE) { spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); ! error = vdev_validate(rvd, mosconfig); spa_config_exit(spa, SCL_ALL, FTAG); ! if (error != 0) return (error); ! if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN) return (SET_ERROR(ENXIO)); } /* * Find the best uberblock. */ vdev_uberblock_load(rvd, ub, &label);
*** 2466,2489 **** /* * If we weren't able to find a single valid uberblock, return failure. */ if (ub->ub_txg == 0) { nvlist_free(label); - spa_load_failed(spa, "no valid uberblock found"); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, ENXIO)); } - spa_load_note(spa, "using uberblock with txg=%llu", - (u_longlong_t)ub->ub_txg); - /* * If the pool has an unsupported version we can't open it. */ if (!SPA_VERSION_IS_SUPPORTED(ub->ub_version)) { nvlist_free(label); - spa_load_failed(spa, "version %llu is not supported", - (u_longlong_t)ub->ub_version); return (spa_vdev_err(rvd, VDEV_AUX_VERSION_NEWER, ENOTSUP)); } if (ub->ub_version >= SPA_VERSION_FEATURES) { nvlist_t *features; --- 2503,2520 ----
*** 2490,2510 **** /* * If we weren't able to find what's necessary for reading the * MOS in the label, return failure. */ ! if (label == NULL) { ! spa_load_failed(spa, "label config unavailable"); ! return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, ! ENXIO)); ! } ! ! if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_FEATURES_FOR_READ, ! &features) != 0) { nvlist_free(label); - spa_load_failed(spa, "invalid label: '%s' missing", - ZPOOL_CONFIG_FEATURES_FOR_READ); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, ENXIO)); } /* --- 2521,2533 ---- /* * If we weren't able to find what's necessary for reading the * MOS in the label, return failure. */ ! if (label == NULL || nvlist_lookup_nvlist(label, ! ZPOOL_CONFIG_FEATURES_FOR_READ, &features) != 0) { nvlist_free(label); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, ENXIO)); } /*
*** 2539,2559 **** if (!nvlist_empty(unsup_feat)) { VERIFY(nvlist_add_nvlist(spa->spa_load_info, ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat) == 0); nvlist_free(unsup_feat); - spa_load_failed(spa, "some features are unsupported"); return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT, ENOTSUP)); } nvlist_free(unsup_feat); } if (type != SPA_IMPORT_ASSEMBLE && spa->spa_config_splitting) { spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); ! spa_try_repair(spa, spa->spa_config); spa_config_exit(spa, SCL_ALL, FTAG); nvlist_free(spa->spa_config_splitting); spa->spa_config_splitting = NULL; } --- 2562,2593 ---- if (!nvlist_empty(unsup_feat)) { VERIFY(nvlist_add_nvlist(spa->spa_load_info, ZPOOL_CONFIG_UNSUP_FEAT, unsup_feat) == 0); nvlist_free(unsup_feat); return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT, ENOTSUP)); } nvlist_free(unsup_feat); } + /* + * If the vdev guid sum doesn't match the uberblock, we have an + * incomplete configuration. We first check to see if the pool + * is aware of the complete config (i.e ZPOOL_CONFIG_VDEV_CHILDREN). + * If it is, defer the vdev_guid_sum check till later so we + * can handle missing vdevs. + */ + if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VDEV_CHILDREN, + &children) != 0 && mosconfig && type != SPA_IMPORT_ASSEMBLE && + rvd->vdev_guid_sum != ub->ub_guid_sum) + return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO)); + if (type != SPA_IMPORT_ASSEMBLE && spa->spa_config_splitting) { spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); ! spa_try_repair(spa, config); spa_config_exit(spa, SCL_ALL, FTAG); nvlist_free(spa->spa_config_splitting); spa->spa_config_splitting = NULL; }
*** 2567,2813 **** spa->spa_first_txg = spa->spa_last_ubsync_txg ? spa->spa_last_ubsync_txg : spa_last_synced_txg(spa) + 1; spa->spa_claim_max_txg = spa->spa_first_txg; spa->spa_prev_software_version = ub->ub_software_version; - return (0); - } - - static int - spa_ld_open_rootbp(spa_t *spa) - { - int error = 0; - vdev_t *rvd = spa->spa_root_vdev; - error = dsl_pool_init(spa, spa->spa_first_txg, &spa->spa_dsl_pool); ! if (error != 0) { ! spa_load_failed(spa, "unable to open rootbp in dsl_pool_init " ! "[error=%d]", error); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); - } spa->spa_meta_objset = spa->spa_dsl_pool->dp_meta_objset; ! return (0); ! } ! ! static int ! spa_ld_load_trusted_config(spa_t *spa, spa_import_type_t type, ! boolean_t reloading) ! { ! vdev_t *mrvd, *rvd = spa->spa_root_vdev; ! nvlist_t *nv, *mos_config, *policy; ! int error = 0, copy_error; ! uint64_t healthy_tvds, healthy_tvds_mos; ! uint64_t mos_config_txg; ! ! if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object, B_TRUE) ! != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); - /* - * If we're assembling a pool from a split, the config provided is - * already trusted so there is nothing to do. - */ - if (type == SPA_IMPORT_ASSEMBLE) - return (0); - - healthy_tvds = spa_healthy_core_tvds(spa); - - if (load_nvlist(spa, spa->spa_config_object, &mos_config) - != 0) { - spa_load_failed(spa, "unable to retrieve MOS config"); - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); - } - - /* - * If we are doing an open, pool owner wasn't verified yet, thus do - * the verification here. - */ - if (spa->spa_load_state == SPA_LOAD_OPEN) { - error = spa_verify_host(spa, mos_config); - if (error != 0) { - nvlist_free(mos_config); - return (error); - } - } - - nv = fnvlist_lookup_nvlist(mos_config, ZPOOL_CONFIG_VDEV_TREE); - - spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); - - /* - * Build a new vdev tree from the trusted config - */ - VERIFY(spa_config_parse(spa, &mrvd, nv, NULL, 0, VDEV_ALLOC_LOAD) == 0); - - /* - * Vdev paths in the MOS may be obsolete. If the untrusted config was - * obtained by scanning /dev/dsk, then it will have the right vdev - * paths. We update the trusted MOS config with this information. - * We first try to copy the paths with vdev_copy_path_strict, which - * succeeds only when both configs have exactly the same vdev tree. - * If that fails, we fall back to a more flexible method that has a - * best effort policy. - */ - copy_error = vdev_copy_path_strict(rvd, mrvd); - if (copy_error != 0 || spa_load_print_vdev_tree) { - spa_load_note(spa, "provided vdev tree:"); - vdev_dbgmsg_print_tree(rvd, 2); - spa_load_note(spa, "MOS vdev tree:"); - vdev_dbgmsg_print_tree(mrvd, 2); - } - if (copy_error != 0) { - spa_load_note(spa, "vdev_copy_path_strict failed, falling " - "back to vdev_copy_path_relaxed"); - vdev_copy_path_relaxed(rvd, mrvd); - } - - vdev_close(rvd); - vdev_free(rvd); - spa->spa_root_vdev = mrvd; - rvd = mrvd; - spa_config_exit(spa, SCL_ALL, FTAG); - - /* - * We will use spa_config if we decide to reload the spa or if spa_load - * fails and we rewind. We must thus regenerate the config using the - * MOS information with the updated paths. Rewind policy is an import - * setting and is not in the MOS. We copy it over to our new, trusted - * config. - */ - mos_config_txg = fnvlist_lookup_uint64(mos_config, - ZPOOL_CONFIG_POOL_TXG); - nvlist_free(mos_config); - mos_config = spa_config_generate(spa, NULL, mos_config_txg, B_FALSE); - if (nvlist_lookup_nvlist(spa->spa_config, ZPOOL_REWIND_POLICY, - &policy) == 0) - fnvlist_add_nvlist(mos_config, ZPOOL_REWIND_POLICY, policy); - spa_config_set(spa, mos_config); - spa->spa_config_source = SPA_CONFIG_SRC_MOS; - - /* - * Now that we got the config from the MOS, we should be more strict - * in checking blkptrs and can make assumptions about the consistency - * of the vdev tree. spa_trust_config must be set to true before opening - * vdevs in order for them to be writeable. - */ - spa->spa_trust_config = B_TRUE; - - /* - * Open and validate the new vdev tree - */ - error = spa_ld_open_vdevs(spa); - if (error != 0) - return (error); - - error = spa_ld_validate_vdevs(spa); - if (error != 0) - return (error); - - if (copy_error != 0 || spa_load_print_vdev_tree) { - spa_load_note(spa, "final vdev tree:"); - vdev_dbgmsg_print_tree(rvd, 2); - } - - if (spa->spa_load_state != SPA_LOAD_TRYIMPORT && - !spa->spa_extreme_rewind && zfs_max_missing_tvds == 0) { - /* - * Sanity check to make sure that we are indeed loading the - * latest uberblock. If we missed SPA_SYNC_MIN_VDEVS tvds - * in the config provided and they happened to be the only ones - * to have the latest uberblock, we could involuntarily perform - * an extreme rewind. - */ - healthy_tvds_mos = spa_healthy_core_tvds(spa); - if (healthy_tvds_mos - healthy_tvds >= - SPA_SYNC_MIN_VDEVS) { - spa_load_note(spa, "config provided misses too many " - "top-level vdevs compared to MOS (%lld vs %lld). ", - (u_longlong_t)healthy_tvds, - (u_longlong_t)healthy_tvds_mos); - spa_load_note(spa, "vdev tree:"); - vdev_dbgmsg_print_tree(rvd, 2); - if (reloading) { - spa_load_failed(spa, "config was already " - "provided from MOS. Aborting."); - return (spa_vdev_err(rvd, - VDEV_AUX_CORRUPT_DATA, EIO)); - } - spa_load_note(spa, "spa must be reloaded using MOS " - "config"); - return (SET_ERROR(EAGAIN)); - } - } - - error = spa_check_for_missing_logs(spa); - if (error != 0) - return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, ENXIO)); - - if (rvd->vdev_guid_sum != spa->spa_uberblock.ub_guid_sum) { - spa_load_failed(spa, "uberblock guid sum doesn't match MOS " - "guid sum (%llu != %llu)", - (u_longlong_t)spa->spa_uberblock.ub_guid_sum, - (u_longlong_t)rvd->vdev_guid_sum); - return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, - ENXIO)); - } - - return (0); - } - - static int - spa_ld_open_indirect_vdev_metadata(spa_t *spa) - { - int error = 0; - vdev_t *rvd = spa->spa_root_vdev; - - /* - * Everything that we read before spa_remove_init() must be stored - * on concreted vdevs. Therefore we do this as early as possible. - */ - error = spa_remove_init(spa); - if (error != 0) { - spa_load_failed(spa, "spa_remove_init failed [error=%d]", - error); - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); - } - - /* - * Retrieve information needed to condense indirect vdev mappings. - */ - error = spa_condense_init(spa); - if (error != 0) { - spa_load_failed(spa, "spa_condense_init failed [error=%d]", - error); - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error)); - } - - return (0); - } - - static int - spa_ld_check_features(spa_t *spa, boolean_t *missing_feat_writep) - { - int error = 0; - vdev_t *rvd = spa->spa_root_vdev; - if (spa_version(spa) >= SPA_VERSION_FEATURES) { boolean_t missing_feat_read = B_FALSE; nvlist_t *unsup_feat, *enabled_feat; if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_READ, ! &spa->spa_feat_for_read_obj, B_TRUE) != 0) { return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_WRITE, ! &spa->spa_feat_for_write_obj, B_TRUE) != 0) { return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } if (spa_dir_prop(spa, DMU_POOL_FEATURE_DESCRIPTIONS, ! &spa->spa_feat_desc_obj, B_TRUE) != 0) { return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } enabled_feat = fnvlist_alloc(); unsup_feat = fnvlist_alloc(); --- 2601,2634 ---- spa->spa_first_txg = spa->spa_last_ubsync_txg ? spa->spa_last_ubsync_txg : spa_last_synced_txg(spa) + 1; spa->spa_claim_max_txg = spa->spa_first_txg; spa->spa_prev_software_version = ub->ub_software_version; error = dsl_pool_init(spa, spa->spa_first_txg, &spa->spa_dsl_pool); ! if (error) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); spa->spa_meta_objset = spa->spa_dsl_pool->dp_meta_objset; ! if (spa_dir_prop(spa, DMU_POOL_CONFIG, &spa->spa_config_object) != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); if (spa_version(spa) >= SPA_VERSION_FEATURES) { boolean_t missing_feat_read = B_FALSE; nvlist_t *unsup_feat, *enabled_feat; if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_READ, ! &spa->spa_feat_for_read_obj) != 0) { return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } if (spa_dir_prop(spa, DMU_POOL_FEATURES_FOR_WRITE, ! &spa->spa_feat_for_write_obj) != 0) { return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } if (spa_dir_prop(spa, DMU_POOL_FEATURE_DESCRIPTIONS, ! &spa->spa_feat_desc_obj) != 0) { return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } enabled_feat = fnvlist_alloc(); unsup_feat = fnvlist_alloc();
*** 2814,2828 **** if (!spa_features_check(spa, B_FALSE, unsup_feat, enabled_feat)) missing_feat_read = B_TRUE; ! if (spa_writeable(spa) || ! spa->spa_load_state == SPA_LOAD_TRYIMPORT) { if (!spa_features_check(spa, B_TRUE, unsup_feat, enabled_feat)) { ! *missing_feat_writep = B_TRUE; } } fnvlist_add_nvlist(spa->spa_load_info, ZPOOL_CONFIG_ENABLED_FEAT, enabled_feat); --- 2635,2648 ---- if (!spa_features_check(spa, B_FALSE, unsup_feat, enabled_feat)) missing_feat_read = B_TRUE; ! if (spa_writeable(spa) || state == SPA_LOAD_TRYIMPORT) { if (!spa_features_check(spa, B_TRUE, unsup_feat, enabled_feat)) { ! missing_feat_write = B_TRUE; } } fnvlist_add_nvlist(spa->spa_load_info, ZPOOL_CONFIG_ENABLED_FEAT, enabled_feat);
*** 2857,2869 **** * missing a feature for write, we must first determine whether * the pool can be opened read-only before returning to * userland in order to know whether to display the * abovementioned note. */ ! if (missing_feat_read || (*missing_feat_writep && spa_writeable(spa))) { - spa_load_failed(spa, "pool uses unsupported features"); return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT, ENOTSUP)); } /* --- 2677,2688 ---- * missing a feature for write, we must first determine whether * the pool can be opened read-only before returning to * userland in order to know whether to display the * abovementioned note. */ ! if (missing_feat_read || (missing_feat_write && spa_writeable(spa))) { return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT, ENOTSUP)); } /*
*** 2879,2929 **** spa->spa_feat_refcount_cache[i] = refcount; } else if (error == ENOTSUP) { spa->spa_feat_refcount_cache[i] = SPA_FEATURE_DISABLED; } else { - spa_load_failed(spa, "error getting refcount " - "for feature %s [error=%d]", - spa_feature_table[i].fi_guid, error); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } } } if (spa_feature_is_active(spa, SPA_FEATURE_ENABLED_TXG)) { if (spa_dir_prop(spa, DMU_POOL_FEATURE_ENABLED_TXG, ! &spa->spa_feat_enabled_txg_obj, B_TRUE) != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } - return (0); - } - - static int - spa_ld_load_special_directories(spa_t *spa) - { - int error = 0; - vdev_t *rvd = spa->spa_root_vdev; - spa->spa_is_initializing = B_TRUE; error = dsl_pool_open(spa->spa_dsl_pool); spa->spa_is_initializing = B_FALSE; ! if (error != 0) { ! spa_load_failed(spa, "dsl_pool_open failed [error=%d]", error); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } ! return (0); ! } ! static int ! spa_ld_get_props(spa_t *spa) ! { ! int error = 0; ! uint64_t obj; ! vdev_t *rvd = spa->spa_root_vdev; /* Grab the secret checksum salt from the MOS. */ error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CHECKSUM_SALT, 1, sizeof (spa->spa_cksum_salt.zcs_bytes), --- 2698,2773 ---- spa->spa_feat_refcount_cache[i] = refcount; } else if (error == ENOTSUP) { spa->spa_feat_refcount_cache[i] = SPA_FEATURE_DISABLED; } else { return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } } } if (spa_feature_is_active(spa, SPA_FEATURE_ENABLED_TXG)) { if (spa_dir_prop(spa, DMU_POOL_FEATURE_ENABLED_TXG, ! &spa->spa_feat_enabled_txg_obj) != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } spa->spa_is_initializing = B_TRUE; error = dsl_pool_open(spa->spa_dsl_pool); spa->spa_is_initializing = B_FALSE; ! if (error != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); + + if (!mosconfig) { + uint64_t hostid; + nvlist_t *policy = NULL, *nvconfig; + + if (load_nvlist(spa, spa->spa_config_object, &nvconfig) != 0) + return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); + + if (!spa_is_root(spa) && nvlist_lookup_uint64(nvconfig, + ZPOOL_CONFIG_HOSTID, &hostid) == 0) { + char *hostname; + unsigned long myhostid = 0; + + VERIFY(nvlist_lookup_string(nvconfig, + ZPOOL_CONFIG_HOSTNAME, &hostname) == 0); + + #ifdef _KERNEL + myhostid = zone_get_hostid(NULL); + #else /* _KERNEL */ + /* + * We're emulating the system's hostid in userland, so + * we can't use zone_get_hostid(). + */ + (void) ddi_strtoul(hw_serial, NULL, 10, &myhostid); + #endif /* _KERNEL */ + if (hostid != 0 && myhostid != 0 && + hostid != myhostid) { + nvlist_free(nvconfig); + cmn_err(CE_WARN, "pool '%s' could not be " + "loaded as it was last accessed by " + "another system (host: %s hostid: 0x%lx). " + "See: http://illumos.org/msg/ZFS-8000-EY", + spa_name(spa), hostname, + (unsigned long)hostid); + return (SET_ERROR(EBADF)); } + } + if (nvlist_lookup_nvlist(spa->spa_config, + ZPOOL_REWIND_POLICY, &policy) == 0) + VERIFY(nvlist_add_nvlist(nvconfig, + ZPOOL_REWIND_POLICY, policy) == 0); ! spa_config_set(spa, nvconfig); ! spa_unload(spa); ! spa_deactivate(spa); ! spa_activate(spa, orig_mode); ! return (spa_load(spa, state, SPA_IMPORT_EXISTING, B_TRUE)); ! } /* Grab the secret checksum salt from the MOS. */ error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_CHECKSUM_SALT, 1, sizeof (spa->spa_cksum_salt.zcs_bytes),
*** 2931,2987 **** if (error == ENOENT) { /* Generate a new salt for subsequent use */ (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes, sizeof (spa->spa_cksum_salt.zcs_bytes)); } else if (error != 0) { - spa_load_failed(spa, "unable to retrieve checksum salt from " - "MOS [error=%d]", error); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } ! if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj, B_TRUE) != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); error = bpobj_open(&spa->spa_deferred_bpobj, spa->spa_meta_objset, obj); ! if (error != 0) { ! spa_load_failed(spa, "error opening deferred-frees bpobj " ! "[error=%d]", error); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); - } /* * Load the bit that tells us to use the new accounting function * (raid-z deflation). If we have an older pool, this will not * be present. */ ! error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate, B_FALSE); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); error = spa_dir_prop(spa, DMU_POOL_CREATION_VERSION, ! &spa->spa_creation_version, B_FALSE); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); /* * Load the persistent error log. If we have an older pool, this will * not be present. */ ! error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last, ! B_FALSE); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); error = spa_dir_prop(spa, DMU_POOL_ERRLOG_SCRUB, ! &spa->spa_errlog_scrub, B_FALSE); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); /* * Load the history object. If we have an older pool, this * will not be present. */ ! error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history, B_FALSE); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); /* * Load the per-vdev ZAP map. If we have an older pool, this will not --- 2775,2825 ---- if (error == ENOENT) { /* Generate a new salt for subsequent use */ (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes, sizeof (spa->spa_cksum_salt.zcs_bytes)); } else if (error != 0) { return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } ! if (spa_dir_prop(spa, DMU_POOL_SYNC_BPOBJ, &obj) != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); error = bpobj_open(&spa->spa_deferred_bpobj, spa->spa_meta_objset, obj); ! if (error != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); /* * Load the bit that tells us to use the new accounting function * (raid-z deflation). If we have an older pool, this will not * be present. */ ! error = spa_dir_prop(spa, DMU_POOL_DEFLATE, &spa->spa_deflate); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); error = spa_dir_prop(spa, DMU_POOL_CREATION_VERSION, ! &spa->spa_creation_version); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); /* * Load the persistent error log. If we have an older pool, this will * not be present. */ ! error = spa_dir_prop(spa, DMU_POOL_ERRLOG_LAST, &spa->spa_errlog_last); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); error = spa_dir_prop(spa, DMU_POOL_ERRLOG_SCRUB, ! &spa->spa_errlog_scrub); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); /* * Load the history object. If we have an older pool, this * will not be present. */ ! error = spa_dir_prop(spa, DMU_POOL_HISTORY, &spa->spa_history); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); /* * Load the per-vdev ZAP map. If we have an older pool, this will not
*** 2990,3006 **** * spa_sync_config_object. */ /* The sentinel is only available in the MOS config. */ nvlist_t *mos_config; ! if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0) { ! spa_load_failed(spa, "unable to retrieve MOS config"); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); - } error = spa_dir_prop(spa, DMU_POOL_VDEV_ZAP_MAP, ! &spa->spa_all_vdev_zaps, B_FALSE); if (error == ENOENT) { VERIFY(!nvlist_exists(mos_config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS)); spa->spa_avz_action = AVZ_ACTION_INITIALIZE; --- 2828,2842 ---- * spa_sync_config_object. */ /* The sentinel is only available in the MOS config. */ nvlist_t *mos_config; ! if (load_nvlist(spa, spa->spa_config_object, &mos_config) != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); error = spa_dir_prop(spa, DMU_POOL_VDEV_ZAP_MAP, ! &spa->spa_all_vdev_zaps); if (error == ENOENT) { VERIFY(!nvlist_exists(mos_config, ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS)); spa->spa_avz_action = AVZ_ACTION_INITIALIZE;
*** 3020,3092 **** */ ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev)); } nvlist_free(mos_config); - spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION); - - error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object, - B_FALSE); - if (error && error != ENOENT) - return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); - - if (error == 0) { - uint64_t autoreplace; - - spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs); - spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace); - spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation); - spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode); - spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand); - spa_prop_find(spa, ZPOOL_PROP_DEDUPDITTO, - &spa->spa_dedup_ditto); - - spa->spa_autoreplace = (autoreplace != 0); - } - /* - * If we are importing a pool with missing top-level vdevs, - * we enforce that the pool doesn't panic or get suspended on - * error since the likelihood of missing data is extremely high. - */ - if (spa->spa_missing_tvds > 0 && - spa->spa_failmode != ZIO_FAILURE_MODE_CONTINUE && - spa->spa_load_state != SPA_LOAD_TRYIMPORT) { - spa_load_note(spa, "forcing failmode to 'continue' " - "as some top level vdevs are missing"); - spa->spa_failmode = ZIO_FAILURE_MODE_CONTINUE; - } - - return (0); - } - - static int - spa_ld_open_aux_vdevs(spa_t *spa, spa_import_type_t type) - { - int error = 0; - vdev_t *rvd = spa->spa_root_vdev; - - /* * If we're assembling the pool from the split-off vdevs of * an existing pool, we don't want to attach the spares & cache * devices. */ /* * Load any hot spares for this pool. */ ! error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object, ! B_FALSE); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); if (error == 0 && type != SPA_IMPORT_ASSEMBLE) { ASSERT(spa_version(spa) >= SPA_VERSION_SPARES); if (load_nvlist(spa, spa->spa_spares.sav_object, ! &spa->spa_spares.sav_config) != 0) { ! spa_load_failed(spa, "error loading spares nvlist"); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); - } spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); spa_load_spares(spa); spa_config_exit(spa, SCL_ALL, FTAG); } else if (error == 0) { --- 2856,2882 ---- */ ASSERT0(vdev_count_verify_zaps(spa->spa_root_vdev)); } nvlist_free(mos_config); /* * If we're assembling the pool from the split-off vdevs of * an existing pool, we don't want to attach the spares & cache * devices. */ /* * Load any hot spares for this pool. */ ! error = spa_dir_prop(spa, DMU_POOL_SPARES, &spa->spa_spares.sav_object); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); if (error == 0 && type != SPA_IMPORT_ASSEMBLE) { ASSERT(spa_version(spa) >= SPA_VERSION_SPARES); if (load_nvlist(spa, spa->spa_spares.sav_object, ! &spa->spa_spares.sav_config) != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); spa_load_spares(spa); spa_config_exit(spa, SCL_ALL, FTAG); } else if (error == 0) {
*** 3095,3237 **** /* * Load any level 2 ARC devices for this pool. */ error = spa_dir_prop(spa, DMU_POOL_L2CACHE, ! &spa->spa_l2cache.sav_object, B_FALSE); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); if (error == 0 && type != SPA_IMPORT_ASSEMBLE) { ASSERT(spa_version(spa) >= SPA_VERSION_L2CACHE); if (load_nvlist(spa, spa->spa_l2cache.sav_object, ! &spa->spa_l2cache.sav_config) != 0) { ! spa_load_failed(spa, "error loading l2cache nvlist"); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); - } spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); spa_load_l2cache(spa); spa_config_exit(spa, SCL_ALL, FTAG); } else if (error == 0) { spa->spa_l2cache.sav_sync = B_TRUE; } ! return (0); ! } ! static int ! spa_ld_load_vdev_metadata(spa_t *spa) ! { ! int error = 0; ! vdev_t *rvd = spa->spa_root_vdev; /* * If the 'autoreplace' property is set, then post a resource notifying * the ZFS DE that it should not issue any faults for unopenable * devices. We also iterate over the vdevs, and post a sysevent for any * unopenable vdevs so that the normal autoreplace handler can take * over. */ ! if (spa->spa_autoreplace && spa->spa_load_state != SPA_LOAD_TRYIMPORT) { spa_check_removed(spa->spa_root_vdev); /* * For the import case, this is done in spa_import(), because * at this point we're using the spare definitions from * the MOS config, not necessarily from the userland config. */ ! if (spa->spa_load_state != SPA_LOAD_IMPORT) { spa_aux_check_removed(&spa->spa_spares); spa_aux_check_removed(&spa->spa_l2cache); } } /* ! * Load the vdev metadata such as metaslabs, DTLs, spacemap object, etc. */ ! error = vdev_load(rvd); ! if (error != 0) { ! spa_load_failed(spa, "vdev_load failed [error=%d]", error); ! return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error)); ! } /* ! * Propagate the leaf DTLs we just loaded all the way up the vdev tree. */ spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); vdev_dtl_reassess(rvd, 0, 0, B_FALSE); spa_config_exit(spa, SCL_ALL, FTAG); ! return (0); ! } ! static int ! spa_ld_load_dedup_tables(spa_t *spa) ! { ! int error = 0; ! vdev_t *rvd = spa->spa_root_vdev; ! error = ddt_load(spa); ! if (error != 0) { ! spa_load_failed(spa, "ddt_load failed [error=%d]", error); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); } ! return (0); ! } ! static int ! spa_ld_verify_logs(spa_t *spa, spa_import_type_t type, char **ereport) ! { ! vdev_t *rvd = spa->spa_root_vdev; ! ! if (type != SPA_IMPORT_ASSEMBLE && spa_writeable(spa)) { ! boolean_t missing = spa_check_logs(spa); ! if (missing) { ! if (spa->spa_missing_tvds != 0) { ! spa_load_note(spa, "spa_check_logs failed " ! "so dropping the logs"); ! } else { *ereport = FM_EREPORT_ZFS_LOG_REPLAY; ! spa_load_failed(spa, "spa_check_logs failed"); ! return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG, ! ENXIO)); } } - } ! return (0); ! } ! static int ! spa_ld_verify_pool_data(spa_t *spa) ! { ! int error = 0; ! vdev_t *rvd = spa->spa_root_vdev; /* * We've successfully opened the pool, verify that we're ready * to start pushing transactions. */ ! if (spa->spa_load_state != SPA_LOAD_TRYIMPORT) { ! error = spa_load_verify(spa); ! if (error != 0) { ! spa_load_failed(spa, "spa_load_verify failed " ! "[error=%d]", error); return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error)); } } ! return (0); ! } ! ! static void ! spa_ld_claim_log_blocks(spa_t *spa) ! { dmu_tx_t *tx; dsl_pool_t *dp = spa_get_dsl(spa); /* * Claim log blocks that haven't been committed yet. * This must all happen in a single txg. * Note: spa_claim_max_txg is updated by spa_claim_notify(), * invoked from zil_claim_log_block()'s i/o done callback. --- 2885,3115 ---- /* * Load any level 2 ARC devices for this pool. */ error = spa_dir_prop(spa, DMU_POOL_L2CACHE, ! &spa->spa_l2cache.sav_object); if (error != 0 && error != ENOENT) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); if (error == 0 && type != SPA_IMPORT_ASSEMBLE) { ASSERT(spa_version(spa) >= SPA_VERSION_L2CACHE); if (load_nvlist(spa, spa->spa_l2cache.sav_object, ! &spa->spa_l2cache.sav_config) != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); spa_load_l2cache(spa); spa_config_exit(spa, SCL_ALL, FTAG); } else if (error == 0) { spa->spa_l2cache.sav_sync = B_TRUE; } ! mp = &spa->spa_meta_policy; ! spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION); ! spa->spa_hiwat = zpool_prop_default_numeric(ZPOOL_PROP_HIWATERMARK); ! spa->spa_lowat = zpool_prop_default_numeric(ZPOOL_PROP_LOWATERMARK); ! spa->spa_minwat = zpool_prop_default_numeric(ZPOOL_PROP_MINWATERMARK); ! spa->spa_dedup_lo_best_effort = ! zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_LO_BEST_EFFORT); ! spa->spa_dedup_hi_best_effort = ! zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_HI_BEST_EFFORT); + mp->spa_enable_meta_placement_selection = + zpool_prop_default_numeric(ZPOOL_PROP_META_PLACEMENT); + mp->spa_sync_to_special = + zpool_prop_default_numeric(ZPOOL_PROP_SYNC_TO_SPECIAL); + mp->spa_ddt_meta_to_special = + zpool_prop_default_numeric(ZPOOL_PROP_DDT_META_TO_METADEV); + mp->spa_zfs_meta_to_special = + zpool_prop_default_numeric(ZPOOL_PROP_ZFS_META_TO_METADEV); + mp->spa_small_data_to_special = + zpool_prop_default_numeric(ZPOOL_PROP_SMALL_DATA_TO_METADEV); + spa_set_ddt_classes(spa, + zpool_prop_default_numeric(ZPOOL_PROP_DDT_DESEGREGATION)); + + spa->spa_resilver_prio = + zpool_prop_default_numeric(ZPOOL_PROP_RESILVER_PRIO); + spa->spa_scrub_prio = zpool_prop_default_numeric(ZPOOL_PROP_SCRUB_PRIO); + + error = spa_dir_prop(spa, DMU_POOL_PROPS, &spa->spa_pool_props_object); + if (error && error != ENOENT) + return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); + + if (error == 0) { + uint64_t autoreplace; + uint64_t val = 0; + + spa_prop_find(spa, ZPOOL_PROP_BOOTFS, &spa->spa_bootfs); + spa_prop_find(spa, ZPOOL_PROP_AUTOREPLACE, &autoreplace); + spa_prop_find(spa, ZPOOL_PROP_DELEGATION, &spa->spa_delegation); + spa_prop_find(spa, ZPOOL_PROP_FAILUREMODE, &spa->spa_failmode); + spa_prop_find(spa, ZPOOL_PROP_AUTOEXPAND, &spa->spa_autoexpand); + spa_prop_find(spa, ZPOOL_PROP_BOOTSIZE, &spa->spa_bootsize); + spa_prop_find(spa, ZPOOL_PROP_DEDUPDITTO, + &spa->spa_dedup_ditto); + spa_prop_find(spa, ZPOOL_PROP_FORCETRIM, &spa->spa_force_trim); + + mutex_enter(&spa->spa_auto_trim_lock); + spa_prop_find(spa, ZPOOL_PROP_AUTOTRIM, &spa->spa_auto_trim); + if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON) + spa_auto_trim_taskq_create(spa); + mutex_exit(&spa->spa_auto_trim_lock); + + spa_prop_find(spa, ZPOOL_PROP_HIWATERMARK, &spa->spa_hiwat); + spa_prop_find(spa, ZPOOL_PROP_LOWATERMARK, &spa->spa_lowat); + spa_prop_find(spa, ZPOOL_PROP_MINWATERMARK, &spa->spa_minwat); + spa_prop_find(spa, ZPOOL_PROP_DEDUPMETA_DITTO, + &spa->spa_ddt_meta_copies); + spa_prop_find(spa, ZPOOL_PROP_DDT_DESEGREGATION, &val); + spa_set_ddt_classes(spa, val); + + spa_prop_find(spa, ZPOOL_PROP_RESILVER_PRIO, + &spa->spa_resilver_prio); + spa_prop_find(spa, ZPOOL_PROP_SCRUB_PRIO, + &spa->spa_scrub_prio); + + spa_prop_find(spa, ZPOOL_PROP_DEDUP_BEST_EFFORT, + &spa->spa_dedup_best_effort); + spa_prop_find(spa, ZPOOL_PROP_DEDUP_LO_BEST_EFFORT, + &spa->spa_dedup_lo_best_effort); + spa_prop_find(spa, ZPOOL_PROP_DEDUP_HI_BEST_EFFORT, + &spa->spa_dedup_hi_best_effort); + + spa_prop_find(spa, ZPOOL_PROP_META_PLACEMENT, + &mp->spa_enable_meta_placement_selection); + spa_prop_find(spa, ZPOOL_PROP_SYNC_TO_SPECIAL, + &mp->spa_sync_to_special); + spa_prop_find(spa, ZPOOL_PROP_DDT_META_TO_METADEV, + &mp->spa_ddt_meta_to_special); + spa_prop_find(spa, ZPOOL_PROP_ZFS_META_TO_METADEV, + &mp->spa_zfs_meta_to_special); + spa_prop_find(spa, ZPOOL_PROP_SMALL_DATA_TO_METADEV, + &mp->spa_small_data_to_special); + + spa->spa_autoreplace = (autoreplace != 0); + } + + error = spa_dir_prop(spa, DMU_POOL_COS_PROPS, + &spa->spa_cos_props_object); + if (error == 0) + (void) spa_load_cos_props(spa); + error = spa_dir_prop(spa, DMU_POOL_VDEV_PROPS, + &spa->spa_vdev_props_object); + if (error == 0) + (void) spa_load_vdev_props(spa); + + (void) spa_dir_prop(spa, DMU_POOL_TRIM_START_TIME, + &spa->spa_man_trim_start_time); + (void) spa_dir_prop(spa, DMU_POOL_TRIM_STOP_TIME, + &spa->spa_man_trim_stop_time); + /* * If the 'autoreplace' property is set, then post a resource notifying * the ZFS DE that it should not issue any faults for unopenable * devices. We also iterate over the vdevs, and post a sysevent for any * unopenable vdevs so that the normal autoreplace handler can take * over. */ ! if (spa->spa_autoreplace && state != SPA_LOAD_TRYIMPORT) { spa_check_removed(spa->spa_root_vdev); /* * For the import case, this is done in spa_import(), because * at this point we're using the spare definitions from * the MOS config, not necessarily from the userland config. */ ! if (state != SPA_LOAD_IMPORT) { spa_aux_check_removed(&spa->spa_spares); spa_aux_check_removed(&spa->spa_l2cache); } } /* ! * Load the vdev state for all toplevel vdevs. */ ! vdev_load(rvd); /* ! * Propagate the leaf DTLs we just loaded all the way up the tree. */ spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); vdev_dtl_reassess(rvd, 0, 0, B_FALSE); spa_config_exit(spa, SCL_ALL, FTAG); ! /* ! * Load the DDTs (dedup tables). ! */ ! error = ddt_load(spa); ! if (error != 0) ! return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); ! spa_update_dspace(spa); ! /* ! * Validate the config, using the MOS config to fill in any ! * information which might be missing. If we fail to validate ! * the config then declare the pool unfit for use. If we're ! * assembling a pool from a split, the log is not transferred ! * over. ! */ ! if (type != SPA_IMPORT_ASSEMBLE) { ! nvlist_t *nvconfig; ! ! if (load_nvlist(spa, spa->spa_config_object, &nvconfig) != 0) return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO)); + + if (!spa_config_valid(spa, nvconfig)) { + nvlist_free(nvconfig); + return (spa_vdev_err(rvd, VDEV_AUX_BAD_GUID_SUM, + ENXIO)); } + nvlist_free(nvconfig); ! /* ! * Now that we've validated the config, check the state of the ! * root vdev. If it can't be opened, it indicates one or ! * more toplevel vdevs are faulted. ! */ ! if (rvd->vdev_state <= VDEV_STATE_CANT_OPEN) ! return (SET_ERROR(ENXIO)); ! if (spa_writeable(spa) && spa_check_logs(spa)) { *ereport = FM_EREPORT_ZFS_LOG_REPLAY; ! return (spa_vdev_err(rvd, VDEV_AUX_BAD_LOG, ENXIO)); } } ! if (missing_feat_write) { ! ASSERT(state == SPA_LOAD_TRYIMPORT); ! /* ! * At this point, we know that we can open the pool in ! * read-only mode but not read-write mode. We now have enough ! * information and can return to userland. ! */ ! return (spa_vdev_err(rvd, VDEV_AUX_UNSUP_FEAT, ENOTSUP)); ! } /* * We've successfully opened the pool, verify that we're ready * to start pushing transactions. */ ! if (state != SPA_LOAD_TRYIMPORT) { ! if (error = spa_load_verify(spa)) { return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, error)); } } ! if (spa_writeable(spa) && (state == SPA_LOAD_RECOVER || ! spa->spa_load_max_txg == UINT64_MAX)) { dmu_tx_t *tx; + int need_update = B_FALSE; dsl_pool_t *dp = spa_get_dsl(spa); + ASSERT(state != SPA_LOAD_TRYIMPORT); + /* * Claim log blocks that haven't been committed yet. * This must all happen in a single txg. * Note: spa_claim_max_txg is updated by spa_claim_notify(), * invoked from zil_claim_log_block()'s i/o done callback.
*** 3245,3273 **** dmu_tx_commit(tx); spa->spa_claiming = B_FALSE; spa_set_log_state(spa, SPA_LOG_GOOD); ! } ! static void ! spa_ld_check_for_config_update(spa_t *spa, uint64_t config_cache_txg, ! boolean_t reloading) ! { ! vdev_t *rvd = spa->spa_root_vdev; ! int need_update = B_FALSE; /* * If the config cache is stale, or we have uninitialized * metaslabs (see spa_vdev_add()), then update the config. * * If this is a verbatim import, trust the current * in-core spa_config and update the disk labels. */ ! if (reloading || config_cache_txg != spa->spa_config_txg || ! spa->spa_load_state == SPA_LOAD_IMPORT || ! spa->spa_load_state == SPA_LOAD_RECOVER || (spa->spa_import_flags & ZFS_IMPORT_VERBATIM)) need_update = B_TRUE; for (int c = 0; c < rvd->vdev_children; c++) if (rvd->vdev_child[c]->vdev_ms_array == 0) --- 3123,3154 ---- dmu_tx_commit(tx); spa->spa_claiming = B_FALSE; spa_set_log_state(spa, SPA_LOG_GOOD); ! spa->spa_sync_on = B_TRUE; ! txg_sync_start(spa->spa_dsl_pool); ! /* ! * Wait for all claims to sync. We sync up to the highest ! * claimed log block birth time so that claimed log blocks ! * don't appear to be from the future. spa_claim_max_txg ! * will have been set for us by either zil_check_log_chain() ! * (invoked from spa_check_logs()) or zil_claim() above. ! */ ! txg_wait_synced(spa->spa_dsl_pool, spa->spa_claim_max_txg); /* * If the config cache is stale, or we have uninitialized * metaslabs (see spa_vdev_add()), then update the config. * * If this is a verbatim import, trust the current * in-core spa_config and update the disk labels. */ ! if (config_cache_txg != spa->spa_config_txg || ! state == SPA_LOAD_IMPORT || ! state == SPA_LOAD_RECOVER || (spa->spa_import_flags & ZFS_IMPORT_VERBATIM)) need_update = B_TRUE; for (int c = 0; c < rvd->vdev_children; c++) if (rvd->vdev_child[c]->vdev_ms_array == 0)
*** 3277,3572 **** * Update the config cache asychronously in case we're the * root pool, in which case the config cache isn't writable yet. */ if (need_update) spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE); - } - static void - spa_ld_prepare_for_reload(spa_t *spa) - { - int mode = spa->spa_mode; - int async_suspended = spa->spa_async_suspended; - - spa_unload(spa); - spa_deactivate(spa); - spa_activate(spa, mode); - /* - * We save the value of spa_async_suspended as it gets reset to 0 by - * spa_unload(). We want to restore it back to the original value before - * returning as we might be calling spa_async_resume() later. - */ - spa->spa_async_suspended = async_suspended; - } - - /* - * Load an existing storage pool, using the config provided. This config - * describes which vdevs are part of the pool and is later validated against - * partial configs present in each vdev's label and an entire copy of the - * config stored in the MOS. - */ - static int - spa_load_impl(spa_t *spa, spa_import_type_t type, char **ereport, - boolean_t reloading) - { - int error = 0; - boolean_t missing_feat_write = B_FALSE; - - ASSERT(MUTEX_HELD(&spa_namespace_lock)); - ASSERT(spa->spa_config_source != SPA_CONFIG_SRC_NONE); - - /* - * Never trust the config that is provided unless we are assembling - * a pool following a split. - * This means don't trust blkptrs and the vdev tree in general. This - * also effectively puts the spa in read-only mode since - * spa_writeable() checks for spa_trust_config to be true. - * We will later load a trusted config from the MOS. - */ - if (type != SPA_IMPORT_ASSEMBLE) - spa->spa_trust_config = B_FALSE; - - if (reloading) - spa_load_note(spa, "RELOADING"); - else - spa_load_note(spa, "LOADING"); - - /* - * Parse the config provided to create a vdev tree. - */ - error = spa_ld_parse_config(spa, type); - if (error != 0) - return (error); - - /* - * Now that we have the vdev tree, try to open each vdev. This involves - * opening the underlying physical device, retrieving its geometry and - * probing the vdev with a dummy I/O. The state of each vdev will be set - * based on the success of those operations. After this we'll be ready - * to read from the vdevs. - */ - error = spa_ld_open_vdevs(spa); - if (error != 0) - return (error); - - /* - * Read the label of each vdev and make sure that the GUIDs stored - * there match the GUIDs in the config provided. - * If we're assembling a new pool that's been split off from an - * existing pool, the labels haven't yet been updated so we skip - * validation for now. - */ - if (type != SPA_IMPORT_ASSEMBLE) { - error = spa_ld_validate_vdevs(spa); - if (error != 0) - return (error); - } - - /* - * Read vdev labels to find the best uberblock (i.e. latest, unless - * spa_load_max_txg is set) and store it in spa_uberblock. We get the - * list of features required to read blkptrs in the MOS from the vdev - * label with the best uberblock and verify that our version of zfs - * supports them all. - */ - error = spa_ld_select_uberblock(spa, type); - if (error != 0) - return (error); - - /* - * Pass that uberblock to the dsl_pool layer which will open the root - * blkptr. This blkptr points to the latest version of the MOS and will - * allow us to read its contents. - */ - error = spa_ld_open_rootbp(spa); - if (error != 0) - return (error); - - /* - * Retrieve the trusted config stored in the MOS and use it to create - * a new, exact version of the vdev tree, then reopen all vdevs. - */ - error = spa_ld_load_trusted_config(spa, type, reloading); - if (error == EAGAIN) { - VERIFY(!reloading); - /* - * Redo the loading process with the trusted config if it is - * too different from the untrusted config. - */ - spa_ld_prepare_for_reload(spa); - return (spa_load_impl(spa, type, ereport, B_TRUE)); - } else if (error != 0) { - return (error); - } - - /* - * Retrieve the mapping of indirect vdevs. Those vdevs were removed - * from the pool and their contents were re-mapped to other vdevs. Note - * that everything that we read before this step must have been - * rewritten on concrete vdevs after the last device removal was - * initiated. Otherwise we could be reading from indirect vdevs before - * we have loaded their mappings. - */ - error = spa_ld_open_indirect_vdev_metadata(spa); - if (error != 0) - return (error); - - /* - * Retrieve the full list of active features from the MOS and check if - * they are all supported. - */ - error = spa_ld_check_features(spa, &missing_feat_write); - if (error != 0) - return (error); - - /* - * Load several special directories from the MOS needed by the dsl_pool - * layer. - */ - error = spa_ld_load_special_directories(spa); - if (error != 0) - return (error); - - /* - * Retrieve pool properties from the MOS. - */ - error = spa_ld_get_props(spa); - if (error != 0) - return (error); - - /* - * Retrieve the list of auxiliary devices - cache devices and spares - - * and open them. - */ - error = spa_ld_open_aux_vdevs(spa, type); - if (error != 0) - return (error); - - /* - * Load the metadata for all vdevs. Also check if unopenable devices - * should be autoreplaced. - */ - error = spa_ld_load_vdev_metadata(spa); - if (error != 0) - return (error); - - error = spa_ld_load_dedup_tables(spa); - if (error != 0) - return (error); - - /* - * Verify the logs now to make sure we don't have any unexpected errors - * when we claim log blocks later. - */ - error = spa_ld_verify_logs(spa, type, ereport); - if (error != 0) - return (error); - - if (missing_feat_write) { - ASSERT(spa->spa_load_state == SPA_LOAD_TRYIMPORT); - - /* - * At this point, we know that we can open the pool in - * read-only mode but not read-write mode. We now have enough - * information and can return to userland. - */ - return (spa_vdev_err(spa->spa_root_vdev, VDEV_AUX_UNSUP_FEAT, - ENOTSUP)); - } - - /* - * Traverse the last txgs to make sure the pool was left off in a safe - * state. When performing an extreme rewind, we verify the whole pool, - * which can take a very long time. - */ - error = spa_ld_verify_pool_data(spa); - if (error != 0) - return (error); - - /* - * Calculate the deflated space for the pool. This must be done before - * we write anything to the pool because we'd need to update the space - * accounting using the deflated sizes. - */ - spa_update_dspace(spa); - - /* - * We have now retrieved all the information we needed to open the - * pool. If we are importing the pool in read-write mode, a few - * additional steps must be performed to finish the import. - */ - if (spa_writeable(spa) && (spa->spa_load_state == SPA_LOAD_RECOVER || - spa->spa_load_max_txg == UINT64_MAX)) { - uint64_t config_cache_txg = spa->spa_config_txg; - - ASSERT(spa->spa_load_state != SPA_LOAD_TRYIMPORT); - - /* - * Traverse the ZIL and claim all blocks. - */ - spa_ld_claim_log_blocks(spa); - - /* - * Kick-off the syncing thread. - */ - spa->spa_sync_on = B_TRUE; - txg_sync_start(spa->spa_dsl_pool); - - /* - * Wait for all claims to sync. We sync up to the highest - * claimed log block birth time so that claimed log blocks - * don't appear to be from the future. spa_claim_max_txg - * will have been set for us by ZIL traversal operations - * performed above. - */ - txg_wait_synced(spa->spa_dsl_pool, spa->spa_claim_max_txg); - - /* - * Check if we need to request an update of the config. On the - * next sync, we would update the config stored in vdev labels - * and the cachefile (by default /etc/zfs/zpool.cache). - */ - spa_ld_check_for_config_update(spa, config_cache_txg, - reloading); - - /* * Check all DTLs to see if anything needs resilvering. */ if (!dsl_scan_resilvering(spa->spa_dsl_pool) && ! vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL)) spa_async_request(spa, SPA_ASYNC_RESILVER); /* * Log the fact that we booted up (so that we can detect if * we rebooted in the middle of an operation). */ spa_history_log_version(spa, "open"); ! /* ! * Delete any inconsistent datasets. ! */ ! (void) dmu_objset_find(spa_name(spa), ! dsl_destroy_inconsistent, NULL, DS_FIND_CHILDREN); /* * Clean up any stale temporary dataset userrefs. */ dsl_pool_clean_tmp_userrefs(spa->spa_dsl_pool); - - spa_restart_removal(spa); - - spa_spawn_aux_threads(spa); } ! spa_load_note(spa, "LOADED"); return (0); } static int ! spa_load_retry(spa_t *spa, spa_load_state_t state) { int mode = spa->spa_mode; spa_unload(spa); spa_deactivate(spa); --- 3158,3196 ---- * Update the config cache asychronously in case we're the * root pool, in which case the config cache isn't writable yet. */ if (need_update) spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE); /* * Check all DTLs to see if anything needs resilvering. */ if (!dsl_scan_resilvering(spa->spa_dsl_pool) && ! vdev_resilver_needed(rvd, NULL, NULL)) spa_async_request(spa, SPA_ASYNC_RESILVER); /* * Log the fact that we booted up (so that we can detect if * we rebooted in the middle of an operation). */ spa_history_log_version(spa, "open"); ! dsl_destroy_inconsistent(spa_get_dsl(spa)); /* * Clean up any stale temporary dataset userrefs. */ dsl_pool_clean_tmp_userrefs(spa->spa_dsl_pool); } ! spa_async_request(spa, SPA_ASYNC_L2CACHE_REBUILD); return (0); } static int ! spa_load_retry(spa_t *spa, spa_load_state_t state, int mosconfig) { int mode = spa->spa_mode; spa_unload(spa); spa_deactivate(spa);
*** 3574,3587 **** spa->spa_load_max_txg = spa->spa_uberblock.ub_txg - 1; spa_activate(spa, mode); spa_async_suspend(spa); ! spa_load_note(spa, "spa_load_retry: rewind, max txg: %llu", ! (u_longlong_t)spa->spa_load_max_txg); ! ! return (spa_load(spa, state, SPA_IMPORT_EXISTING)); } /* * If spa_load() fails this function will try loading prior txg's. If * 'state' is SPA_LOAD_RECOVER and one of these loads succeeds the pool --- 3198,3208 ---- spa->spa_load_max_txg = spa->spa_uberblock.ub_txg - 1; spa_activate(spa, mode); spa_async_suspend(spa); ! return (spa_load(spa, state, SPA_IMPORT_EXISTING, mosconfig)); } /* * If spa_load() fails this function will try loading prior txg's. If * 'state' is SPA_LOAD_RECOVER and one of these loads succeeds the pool
*** 3588,3599 **** * will be rewound to that txg. If 'state' is not SPA_LOAD_RECOVER this * function will not rewind the pool and will return the same error as * spa_load(). */ static int ! spa_load_best(spa_t *spa, spa_load_state_t state, uint64_t max_request, ! int rewind_flags) { nvlist_t *loadinfo = NULL; nvlist_t *config = NULL; int load_error, rewind_error; uint64_t safe_rewind_txg; --- 3209,3220 ---- * will be rewound to that txg. If 'state' is not SPA_LOAD_RECOVER this * function will not rewind the pool and will return the same error as * spa_load(). */ static int ! spa_load_best(spa_t *spa, spa_load_state_t state, int mosconfig, ! uint64_t max_request, int rewind_flags) { nvlist_t *loadinfo = NULL; nvlist_t *config = NULL; int load_error, rewind_error; uint64_t safe_rewind_txg;
*** 3606,3616 **** spa->spa_load_max_txg = max_request; if (max_request != UINT64_MAX) spa->spa_extreme_rewind = B_TRUE; } ! load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING); if (load_error == 0) return (0); if (spa->spa_root_vdev != NULL) config = spa_config_generate(spa, NULL, -1ULL, B_TRUE); --- 3227,3238 ---- spa->spa_load_max_txg = max_request; if (max_request != UINT64_MAX) spa->spa_extreme_rewind = B_TRUE; } ! load_error = rewind_error = spa_load(spa, state, SPA_IMPORT_EXISTING, ! mosconfig); if (load_error == 0) return (0); if (spa->spa_root_vdev != NULL) config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
*** 3647,3657 **** */ while (rewind_error && spa->spa_uberblock.ub_txg >= min_txg && spa->spa_uberblock.ub_txg <= spa->spa_load_max_txg) { if (spa->spa_load_max_txg < safe_rewind_txg) spa->spa_extreme_rewind = B_TRUE; ! rewind_error = spa_load_retry(spa, state); } spa->spa_extreme_rewind = B_FALSE; spa->spa_load_max_txg = UINT64_MAX; --- 3269,3279 ---- */ while (rewind_error && spa->spa_uberblock.ub_txg >= min_txg && spa->spa_uberblock.ub_txg <= spa->spa_load_max_txg) { if (spa->spa_load_max_txg < safe_rewind_txg) spa->spa_extreme_rewind = B_TRUE; ! rewind_error = spa_load_retry(spa, state, mosconfig); } spa->spa_extreme_rewind = B_FALSE; spa->spa_load_max_txg = UINT64_MAX;
*** 3694,3703 **** --- 3316,3326 ---- { spa_t *spa; spa_load_state_t state = SPA_LOAD_OPEN; int error; int locked = B_FALSE; + boolean_t open_with_activation = B_FALSE; *spapp = NULL; /* * As disgusting as this is, we need to support recursive calls to this
*** 3726,3739 **** spa_activate(spa, spa_mode_global); if (state != SPA_LOAD_RECOVER) spa->spa_last_ubsync_txg = spa->spa_load_txg = 0; - spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE; ! zfs_dbgmsg("spa_open_common: opening %s", pool); ! error = spa_load_best(spa, state, policy.zrp_txg, policy.zrp_request); if (error == EBADF) { /* * If vdev_validate() returns failure (indicated by --- 3349,3360 ---- spa_activate(spa, spa_mode_global); if (state != SPA_LOAD_RECOVER) spa->spa_last_ubsync_txg = spa->spa_load_txg = 0; ! error = spa_load_best(spa, state, B_FALSE, policy.zrp_txg, policy.zrp_request); if (error == EBADF) { /* * If vdev_validate() returns failure (indicated by
*** 3742,3752 **** * this is the case, the config cache is out of sync and * we should remove the pool from the namespace. */ spa_unload(spa); spa_deactivate(spa); ! spa_write_cachefile(spa, B_TRUE, B_TRUE); spa_remove(spa); if (locked) mutex_exit(&spa_namespace_lock); return (SET_ERROR(ENOENT)); } --- 3363,3373 ---- * this is the case, the config cache is out of sync and * we should remove the pool from the namespace. */ spa_unload(spa); spa_deactivate(spa); ! spa_config_sync(spa, B_TRUE, B_TRUE); spa_remove(spa); if (locked) mutex_exit(&spa_namespace_lock); return (SET_ERROR(ENOENT)); }
*** 3770,3779 **** --- 3391,3402 ---- if (locked) mutex_exit(&spa_namespace_lock); *spapp = NULL; return (error); } + + open_with_activation = B_TRUE; } spa_open_ref(spa, tag); if (config != NULL)
*** 3793,3802 **** --- 3416,3428 ---- spa->spa_last_ubsync_txg = 0; spa->spa_load_txg = 0; mutex_exit(&spa_namespace_lock); } + if (open_with_activation) + wbc_activate(spa, B_FALSE); + *spapp = spa; return (0); }
*** 4242,4252 **** int error = 0; uint64_t txg = TXG_INITIAL; nvlist_t **spares, **l2cache; uint_t nspares, nl2cache; uint64_t version, obj; ! boolean_t has_features; /* * If this pool already exists, return failure. */ mutex_enter(&spa_namespace_lock); --- 3868,3879 ---- int error = 0; uint64_t txg = TXG_INITIAL; nvlist_t **spares, **l2cache; uint_t nspares, nl2cache; uint64_t version, obj; ! boolean_t has_features = B_FALSE, wbc_feature_exists = B_FALSE; ! spa_meta_placement_t *mp; /* * If this pool already exists, return failure. */ mutex_enter(&spa_namespace_lock);
*** 4261,4284 **** (void) nvlist_lookup_string(props, zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot); spa = spa_add(pool, NULL, altroot); spa_activate(spa, spa_mode_global); ! if (props && (error = spa_prop_validate(spa, props))) { spa_deactivate(spa); spa_remove(spa); mutex_exit(&spa_namespace_lock); return (error); } - - has_features = B_FALSE; - for (nvpair_t *elem = nvlist_next_nvpair(props, NULL); - elem != NULL; elem = nvlist_next_nvpair(props, elem)) { - if (zpool_prop_feature(nvpair_name(elem))) - has_features = B_TRUE; } if (has_features || nvlist_lookup_uint64(props, zpool_prop_to_name(ZPOOL_PROP_VERSION), &version) != 0) { version = SPA_VERSION; } ASSERT(SPA_VERSION_IS_SUPPORTED(version)); --- 3888,3937 ---- (void) nvlist_lookup_string(props, zpool_prop_to_name(ZPOOL_PROP_ALTROOT), &altroot); spa = spa_add(pool, NULL, altroot); spa_activate(spa, spa_mode_global); ! if (props != NULL) { ! nvpair_t *wbc_feature_nvp = NULL; ! ! for (nvpair_t *elem = nvlist_next_nvpair(props, NULL); ! elem != NULL; elem = nvlist_next_nvpair(props, elem)) { ! const char *propname = nvpair_name(elem); ! if (zpool_prop_feature(propname)) { ! spa_feature_t feature; ! int err; ! const char *fname = strchr(propname, '@') + 1; ! ! err = zfeature_lookup_name(fname, &feature); ! if (err == 0 && feature == SPA_FEATURE_WBC) { ! wbc_feature_nvp = elem; ! wbc_feature_exists = B_TRUE; ! } ! ! has_features = B_TRUE; ! } ! } ! ! /* ! * We do not want to enabled feature@wbc if ! * this pool does not have special vdev. ! * At this stage we remove this feature from common list, ! * but later after check that special vdev available this ! * feature will be enabled ! */ ! if (wbc_feature_nvp != NULL) ! fnvlist_remove_nvpair(props, wbc_feature_nvp); ! ! if ((error = spa_prop_validate(spa, props)) != 0) { spa_deactivate(spa); spa_remove(spa); mutex_exit(&spa_namespace_lock); return (error); } } + if (has_features || nvlist_lookup_uint64(props, zpool_prop_to_name(ZPOOL_PROP_VERSION), &version) != 0) { version = SPA_VERSION; } ASSERT(SPA_VERSION_IS_SUPPORTED(version));
*** 4286,4298 **** spa->spa_first_txg = txg; spa->spa_uberblock.ub_txg = txg - 1; spa->spa_uberblock.ub_version = version; spa->spa_ubsync = spa->spa_uberblock; spa->spa_load_state = SPA_LOAD_CREATE; - spa->spa_removing_phys.sr_state = DSS_NONE; - spa->spa_removing_phys.sr_removing_vdev = -1; - spa->spa_removing_phys.sr_prev_indirect_vdev = -1; /* * Create "The Godfather" zio to hold all async IOs */ spa->spa_async_zio_root = kmem_alloc(max_ncpus * sizeof (void *), --- 3939,3948 ----
*** 4432,4441 **** --- 4082,4093 ---- * Create the pool's history object. */ if (version >= SPA_VERSION_ZPOOL_HISTORY) spa_history_create_obj(spa, tx); + mp = &spa->spa_meta_policy; + /* * Generate some random noise for salted checksums to operate on. */ (void) random_get_pseudo_bytes(spa->spa_cksum_salt.zcs_bytes, sizeof (spa->spa_cksum_salt.zcs_bytes));
*** 4445,4460 **** --- 4097,4155 ---- */ spa->spa_bootfs = zpool_prop_default_numeric(ZPOOL_PROP_BOOTFS); spa->spa_delegation = zpool_prop_default_numeric(ZPOOL_PROP_DELEGATION); spa->spa_failmode = zpool_prop_default_numeric(ZPOOL_PROP_FAILUREMODE); spa->spa_autoexpand = zpool_prop_default_numeric(ZPOOL_PROP_AUTOEXPAND); + spa->spa_minwat = zpool_prop_default_numeric(ZPOOL_PROP_MINWATERMARK); + spa->spa_hiwat = zpool_prop_default_numeric(ZPOOL_PROP_HIWATERMARK); + spa->spa_lowat = zpool_prop_default_numeric(ZPOOL_PROP_LOWATERMARK); + spa->spa_ddt_meta_copies = + zpool_prop_default_numeric(ZPOOL_PROP_DEDUPMETA_DITTO); + spa->spa_dedup_best_effort = + zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_BEST_EFFORT); + spa->spa_dedup_lo_best_effort = + zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_LO_BEST_EFFORT); + spa->spa_dedup_hi_best_effort = + zpool_prop_default_numeric(ZPOOL_PROP_DEDUP_HI_BEST_EFFORT); + spa->spa_force_trim = zpool_prop_default_numeric(ZPOOL_PROP_FORCETRIM); + spa->spa_resilver_prio = + zpool_prop_default_numeric(ZPOOL_PROP_RESILVER_PRIO); + spa->spa_scrub_prio = zpool_prop_default_numeric(ZPOOL_PROP_SCRUB_PRIO); + + mutex_enter(&spa->spa_auto_trim_lock); + spa->spa_auto_trim = zpool_prop_default_numeric(ZPOOL_PROP_AUTOTRIM); + if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON) + spa_auto_trim_taskq_create(spa); + mutex_exit(&spa->spa_auto_trim_lock); + + mp->spa_enable_meta_placement_selection = + zpool_prop_default_numeric(ZPOOL_PROP_META_PLACEMENT); + mp->spa_sync_to_special = + zpool_prop_default_numeric(ZPOOL_PROP_SYNC_TO_SPECIAL); + mp->spa_ddt_meta_to_special = + zpool_prop_default_numeric(ZPOOL_PROP_DDT_META_TO_METADEV); + mp->spa_zfs_meta_to_special = + zpool_prop_default_numeric(ZPOOL_PROP_ZFS_META_TO_METADEV); + mp->spa_small_data_to_special = + zpool_prop_default_numeric(ZPOOL_PROP_SMALL_DATA_TO_METADEV); + + spa_set_ddt_classes(spa, 0); + if (props != NULL) { spa_configfile_set(spa, props, B_FALSE); spa_sync_props(props, tx); } + if (spa_has_special(spa)) { + spa_feature_enable(spa, SPA_FEATURE_META_DEVICES, tx); + spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx); + + if (wbc_feature_exists) + spa_feature_enable(spa, SPA_FEATURE_WBC, tx); + } + dmu_tx_commit(tx); spa->spa_sync_on = B_TRUE; txg_sync_start(spa->spa_dsl_pool);
*** 4462,4474 **** * We explicitly wait for the first transaction to complete so that our * bean counters are appropriately updated. */ txg_wait_synced(spa->spa_dsl_pool, txg); ! spa_spawn_aux_threads(spa); ! ! spa_write_cachefile(spa, B_FALSE, B_TRUE); spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_CREATE); spa_history_log_version(spa, "create"); /* --- 4157,4167 ---- * We explicitly wait for the first transaction to complete so that our * bean counters are appropriately updated. */ txg_wait_synced(spa->spa_dsl_pool, txg); ! spa_config_sync(spa, B_FALSE, B_TRUE); spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_CREATE); spa_history_log_version(spa, "create"); /*
*** 4479,4491 **** --- 4172,4232 ---- spa->spa_minref = refcount_count(&spa->spa_refcount); spa->spa_load_state = SPA_LOAD_NONE; mutex_exit(&spa_namespace_lock); + wbc_activate(spa, B_TRUE); + return (0); } + + /* + * See if the pool has special tier, and if so, enable/activate + * the feature as needed. Activation is not reference counted. + */ + static void + spa_check_special_feature(spa_t *spa) + { + if (spa_has_special(spa)) { + nvlist_t *props = NULL; + + if (!spa_feature_is_enabled(spa, SPA_FEATURE_META_DEVICES)) { + VERIFY(nvlist_alloc(&props, NV_UNIQUE_NAME, 0) == 0); + VERIFY(nvlist_add_uint64(props, + FEATURE_META_DEVICES, 0) == 0); + VERIFY(spa_prop_set(spa, props) == 0); + nvlist_free(props); + } + + if (!spa_feature_is_active(spa, SPA_FEATURE_META_DEVICES)) { + dmu_tx_t *tx = + dmu_tx_create_dd(spa->spa_dsl_pool->dp_mos_dir); + + VERIFY(dmu_tx_assign(tx, TXG_WAIT) == 0); + spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx); + dmu_tx_commit(tx); + } + } + } + + static void + spa_special_feature_activate(void *arg, dmu_tx_t *tx) + { + spa_t *spa = (spa_t *)arg; + + if (spa_has_special(spa)) { + /* enable and activate as needed */ + spa_feature_enable(spa, SPA_FEATURE_META_DEVICES, tx); + if (!spa_feature_is_active(spa, SPA_FEATURE_META_DEVICES)) { + spa_feature_incr(spa, SPA_FEATURE_META_DEVICES, tx); + } + + spa_feature_enable(spa, SPA_FEATURE_WBC, tx); + } + } + #ifdef _KERNEL /* * Get the root pool information from the root disk, then import the root pool * during the system boot up time. */
*** 4607,4617 **** VERIFY(nvlist_lookup_string(config, ZPOOL_CONFIG_POOL_NAME, &pname) == 0); VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, &txg) == 0); mutex_enter(&spa_namespace_lock); ! if ((spa = spa_lookup(pname)) != NULL) { /* * Remove the existing root pool from the namespace so that we * can replace it with the correct config we just read in. */ spa_remove(spa); --- 4348,4358 ---- VERIFY(nvlist_lookup_string(config, ZPOOL_CONFIG_POOL_NAME, &pname) == 0); VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, &txg) == 0); mutex_enter(&spa_namespace_lock); ! if ((spa = spa_lookup(pname)) != NULL || spa_config_guid_exists(guid)) { /* * Remove the existing root pool from the namespace so that we * can replace it with the correct config we just read in. */ spa_remove(spa);
*** 4618,4630 **** } spa = spa_add(pname, config, NULL); spa->spa_is_root = B_TRUE; spa->spa_import_flags = ZFS_IMPORT_VERBATIM; - if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_VERSION, - &spa->spa_ubsync.ub_version) != 0) - spa->spa_ubsync.ub_version = SPA_VERSION_INITIAL; /* * Build up a vdev tree based on the boot device's label config. */ VERIFY(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, --- 4359,4368 ----
*** 4704,4719 **** uint64_t readonly = B_FALSE; int error; nvlist_t *nvroot; nvlist_t **spares, **l2cache; uint_t nspares, nl2cache; /* * If a pool with this name exists, return failure. */ mutex_enter(&spa_namespace_lock); ! if (spa_lookup(pool) != NULL) { mutex_exit(&spa_namespace_lock); return (SET_ERROR(EEXIST)); } /* --- 4442,4461 ---- uint64_t readonly = B_FALSE; int error; nvlist_t *nvroot; nvlist_t **spares, **l2cache; uint_t nspares, nl2cache; + uint64_t guid; + if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &guid) != 0) + return (SET_ERROR(EINVAL)); + /* * If a pool with this name exists, return failure. */ mutex_enter(&spa_namespace_lock); ! if (spa_lookup(pool) != NULL || spa_config_guid_exists(guid)) { mutex_exit(&spa_namespace_lock); return (SET_ERROR(EEXIST)); } /*
*** 4734,4746 **** */ if (spa->spa_import_flags & ZFS_IMPORT_VERBATIM) { if (props != NULL) spa_configfile_set(spa, props, B_FALSE); ! spa_write_cachefile(spa, B_FALSE, B_TRUE); spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT); ! zfs_dbgmsg("spa_import: verbatim import of %s", pool); mutex_exit(&spa_namespace_lock); return (0); } spa_activate(spa, mode); --- 4476,4488 ---- */ if (spa->spa_import_flags & ZFS_IMPORT_VERBATIM) { if (props != NULL) spa_configfile_set(spa, props, B_FALSE); ! spa_config_sync(spa, B_FALSE, B_TRUE); spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT); ! mutex_exit(&spa_namespace_lock); return (0); } spa_activate(spa, mode);
*** 4752,4772 **** zpool_get_rewind_policy(config, &policy); if (policy.zrp_request & ZPOOL_DO_REWIND) state = SPA_LOAD_RECOVER; ! spa->spa_config_source = SPA_CONFIG_SRC_TRYIMPORT; ! ! if (state != SPA_LOAD_RECOVER) { spa->spa_last_ubsync_txg = spa->spa_load_txg = 0; - zfs_dbgmsg("spa_import: importing %s", pool); - } else { - zfs_dbgmsg("spa_import: importing %s, max_txg=%lld " - "(RECOVERY MODE)", pool, (longlong_t)policy.zrp_txg); - } - error = spa_load_best(spa, state, policy.zrp_txg, policy.zrp_request); /* * Propagate anything learned while loading the pool and pass it * back to caller (i.e. rewind info, missing devices, etc). */ VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO, --- 4494,4514 ---- zpool_get_rewind_policy(config, &policy); if (policy.zrp_request & ZPOOL_DO_REWIND) state = SPA_LOAD_RECOVER; ! /* ! * Pass off the heavy lifting to spa_load(). Pass TRUE for mosconfig ! * because the user-supplied config is actually the one to trust when ! * doing an import. ! */ ! if (state != SPA_LOAD_RECOVER) spa->spa_last_ubsync_txg = spa->spa_load_txg = 0; + error = spa_load_best(spa, state, B_TRUE, policy.zrp_txg, + policy.zrp_request); + /* * Propagate anything learned while loading the pool and pass it * back to caller (i.e. rewind info, missing devices, etc). */ VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_LOAD_INFO,
*** 4808,4819 **** spa_remove(spa); mutex_exit(&spa_namespace_lock); return (error); } - spa_async_resume(spa); - /* * Override any spares and level 2 cache devices as specified by * the user, as these may have correct device names/devids, etc. */ if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES, --- 4550,4559 ----
*** 4845,4854 **** --- 4585,4597 ---- spa_load_l2cache(spa); spa_config_exit(spa, SCL_ALL, FTAG); spa->spa_l2cache.sav_sync = B_TRUE; } + /* At this point, we can load spare props */ + (void) spa_load_vdev_props(spa); + /* * Check for any removed devices. */ if (spa->spa_autoreplace) { spa_aux_check_removed(&spa->spa_spares);
*** 4861,4893 **** */ spa_config_update(spa, SPA_CONFIG_UPDATE_POOL); } /* * It's possible that the pool was expanded while it was exported. * We kick off an async task to handle this for us. */ spa_async_request(spa, SPA_ASYNC_AUTOEXPAND); spa_history_log_version(spa, "import"); spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT); mutex_exit(&spa_namespace_lock); return (0); } nvlist_t * spa_tryimport(nvlist_t *tryconfig) { nvlist_t *config = NULL; ! char *poolname, *cachefile; spa_t *spa; uint64_t state; int error; - zpool_rewind_policy_t policy; if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_POOL_NAME, &poolname)) return (NULL); if (nvlist_lookup_uint64(tryconfig, ZPOOL_CONFIG_POOL_STATE, &state)) --- 4604,4652 ---- */ spa_config_update(spa, SPA_CONFIG_UPDATE_POOL); } /* + * Start async resume as late as possible to reduce I/O activity when + * importing a pool. This will let any pending txgs (e.g. from scrub + * or resilver) to complete quickly thereby reducing import times in + * such cases. + */ + spa_async_resume(spa); + + /* * It's possible that the pool was expanded while it was exported. * We kick off an async task to handle this for us. */ spa_async_request(spa, SPA_ASYNC_AUTOEXPAND); + /* Set/activate meta feature as needed */ + if (!spa_writeable(spa)) + spa_check_special_feature(spa); spa_history_log_version(spa, "import"); spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_IMPORT); mutex_exit(&spa_namespace_lock); + if (!spa_writeable(spa)) return (0); + + wbc_activate(spa, B_FALSE); + + return (dsl_sync_task(spa->spa_name, NULL, spa_special_feature_activate, + spa, 3, ZFS_SPACE_CHECK_RESERVED)); } nvlist_t * spa_tryimport(nvlist_t *tryconfig) { nvlist_t *config = NULL; ! char *poolname; spa_t *spa; uint64_t state; int error; if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_POOL_NAME, &poolname)) return (NULL); if (nvlist_lookup_uint64(tryconfig, ZPOOL_CONFIG_POOL_STATE, &state))
*** 4899,4932 **** mutex_enter(&spa_namespace_lock); spa = spa_add(TRYIMPORT_NAME, tryconfig, NULL); spa_activate(spa, FREAD); /* ! * Rewind pool if a max txg was provided. Note that even though we ! * retrieve the complete rewind policy, only the rewind txg is relevant ! * for tryimport. */ ! zpool_get_rewind_policy(spa->spa_config, &policy); ! if (policy.zrp_txg != UINT64_MAX) { ! spa->spa_load_max_txg = policy.zrp_txg; ! spa->spa_extreme_rewind = B_TRUE; ! zfs_dbgmsg("spa_tryimport: importing %s, max_txg=%lld", ! poolname, (longlong_t)policy.zrp_txg); ! } else { ! zfs_dbgmsg("spa_tryimport: importing %s", poolname); ! } - if (nvlist_lookup_string(tryconfig, ZPOOL_CONFIG_CACHEFILE, &cachefile) - == 0) { - zfs_dbgmsg("spa_tryimport: using cachefile '%s'", cachefile); - spa->spa_config_source = SPA_CONFIG_SRC_CACHEFILE; - } else { - spa->spa_config_source = SPA_CONFIG_SRC_SCAN; - } - - error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING); - /* * If 'tryconfig' was at least parsable, return the current config. */ if (spa->spa_root_vdev != NULL) { config = spa_config_generate(spa, NULL, -1ULL, B_TRUE); --- 4658,4673 ---- mutex_enter(&spa_namespace_lock); spa = spa_add(TRYIMPORT_NAME, tryconfig, NULL); spa_activate(spa, FREAD); /* ! * Pass off the heavy lifting to spa_load(). ! * Pass TRUE for mosconfig because the user-supplied config ! * is actually the one to trust when doing an import. */ ! error = spa_load(spa, SPA_LOAD_TRYIMPORT, SPA_IMPORT_EXISTING, B_TRUE); /* * If 'tryconfig' was at least parsable, return the current config. */ if (spa->spa_root_vdev != NULL) { config = spa_config_generate(spa, NULL, -1ULL, B_TRUE);
*** 4997,5009 **** * configuration from the cache afterwards. If the 'hardforce' flag is set, then * we don't sync the labels or remove the configuration cache. */ static int spa_export_common(char *pool, int new_state, nvlist_t **oldconfig, ! boolean_t force, boolean_t hardforce) { spa_t *spa; if (oldconfig) *oldconfig = NULL; if (!(spa_mode_global & FWRITE)) --- 4738,4752 ---- * configuration from the cache afterwards. If the 'hardforce' flag is set, then * we don't sync the labels or remove the configuration cache. */ static int spa_export_common(char *pool, int new_state, nvlist_t **oldconfig, ! boolean_t force, boolean_t hardforce, boolean_t saveconfig) { spa_t *spa; + zfs_autosnap_t *autosnap; + boolean_t wbcthr_stopped = B_FALSE; if (oldconfig) *oldconfig = NULL; if (!(spa_mode_global & FWRITE))
*** 5014,5028 **** mutex_exit(&spa_namespace_lock); return (SET_ERROR(ENOENT)); } /* ! * Put a hold on the pool, drop the namespace lock, stop async tasks, ! * reacquire the namespace lock, and see if we can export. */ spa_open_ref(spa, FTAG); mutex_exit(&spa_namespace_lock); spa_async_suspend(spa); mutex_enter(&spa_namespace_lock); spa_close(spa, FTAG); /* --- 4757,4787 ---- mutex_exit(&spa_namespace_lock); return (SET_ERROR(ENOENT)); } /* ! * Put a hold on the pool, drop the namespace lock, stop async tasks ! * and write cache thread, reacquire the namespace lock, and see ! * if we can export. */ spa_open_ref(spa, FTAG); mutex_exit(&spa_namespace_lock); + + autosnap = spa_get_autosnap(spa); + mutex_enter(&autosnap->autosnap_lock); + + if (autosnap_has_children_zone(autosnap, + spa_name(spa), B_TRUE)) { + mutex_exit(&autosnap->autosnap_lock); + spa_close(spa, FTAG); + return (EBUSY); + } + + mutex_exit(&autosnap->autosnap_lock); + + wbcthr_stopped = wbc_stop_thread(spa); /* stop write cache thread */ + autosnap_destroyer_thread_stop(spa); spa_async_suspend(spa); mutex_enter(&spa_namespace_lock); spa_close(spa, FTAG); /*
*** 5045,5054 **** --- 4804,4816 ---- if (!spa_refcount_zero(spa) || (spa->spa_inject_ref != 0 && new_state != POOL_STATE_UNINITIALIZED)) { spa_async_resume(spa); mutex_exit(&spa_namespace_lock); + if (wbcthr_stopped) + (void) wbc_start_thread(spa); + autosnap_destroyer_thread_start(spa); return (SET_ERROR(EBUSY)); } /* * A pool cannot be exported if it has an active shared spare.
*** 5058,5067 **** --- 4820,4832 ---- */ if (!force && new_state == POOL_STATE_EXPORTED && spa_has_active_shared_spare(spa)) { spa_async_resume(spa); mutex_exit(&spa_namespace_lock); + if (wbcthr_stopped) + (void) wbc_start_thread(spa); + autosnap_destroyer_thread_start(spa); return (SET_ERROR(EXDEV)); } /* * We want this to be reflected on every label,
*** 5079,5098 **** } spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_DESTROY); if (spa->spa_state != POOL_STATE_UNINITIALIZED) { spa_unload(spa); spa_deactivate(spa); } if (oldconfig && spa->spa_config) VERIFY(nvlist_dup(spa->spa_config, oldconfig, 0) == 0); if (new_state != POOL_STATE_UNINITIALIZED) { if (!hardforce) ! spa_write_cachefile(spa, B_TRUE, B_TRUE); spa_remove(spa); } mutex_exit(&spa_namespace_lock); return (0); --- 4844,4866 ---- } spa_event_notify(spa, NULL, NULL, ESC_ZFS_POOL_DESTROY); if (spa->spa_state != POOL_STATE_UNINITIALIZED) { + wbc_deactivate(spa); + spa_unload(spa); spa_deactivate(spa); } if (oldconfig && spa->spa_config) VERIFY(nvlist_dup(spa->spa_config, oldconfig, 0) == 0); if (new_state != POOL_STATE_UNINITIALIZED) { if (!hardforce) ! spa_config_sync(spa, !saveconfig, B_TRUE); ! spa_remove(spa); } mutex_exit(&spa_namespace_lock); return (0);
*** 5103,5124 **** */ int spa_destroy(char *pool) { return (spa_export_common(pool, POOL_STATE_DESTROYED, NULL, ! B_FALSE, B_FALSE)); } /* * Export a storage pool. */ int spa_export(char *pool, nvlist_t **oldconfig, boolean_t force, ! boolean_t hardforce) { return (spa_export_common(pool, POOL_STATE_EXPORTED, oldconfig, ! force, hardforce)); } /* * Similar to spa_export(), this unloads the spa_t without actually removing it * from the namespace in any way. --- 4871,4892 ---- */ int spa_destroy(char *pool) { return (spa_export_common(pool, POOL_STATE_DESTROYED, NULL, ! B_FALSE, B_FALSE, B_FALSE)); } /* * Export a storage pool. */ int spa_export(char *pool, nvlist_t **oldconfig, boolean_t force, ! boolean_t hardforce, boolean_t saveconfig) { return (spa_export_common(pool, POOL_STATE_EXPORTED, oldconfig, ! force, hardforce, saveconfig)); } /* * Similar to spa_export(), this unloads the spa_t without actually removing it * from the namespace in any way.
*** 5125,5135 **** */ int spa_reset(char *pool) { return (spa_export_common(pool, POOL_STATE_UNINITIALIZED, NULL, ! B_FALSE, B_FALSE)); } /* * ========================================================================== * Device manipulation --- 4893,4903 ---- */ int spa_reset(char *pool) { return (spa_export_common(pool, POOL_STATE_UNINITIALIZED, NULL, ! B_FALSE, B_FALSE, B_FALSE)); } /* * ========================================================================== * Device manipulation
*** 5146,5155 **** --- 4914,4924 ---- int error; vdev_t *rvd = spa->spa_root_vdev; vdev_t *vd, *tvd; nvlist_t **spares, **l2cache; uint_t nspares, nl2cache; + dmu_tx_t *tx = NULL; ASSERT(spa_writeable(spa)); txg = spa_vdev_enter(spa);
*** 5180,5226 **** */ if ((error = spa_validate_aux(spa, nvroot, txg, VDEV_ALLOC_ADD)) != 0) return (spa_vdev_exit(spa, vd, txg, error)); /* ! * If we are in the middle of a device removal, we can only add ! * devices which match the existing devices in the pool. ! * If we are in the middle of a removal, or have some indirect ! * vdevs, we can not add raidz toplevels. */ - if (spa->spa_vdev_removal != NULL || - spa->spa_removing_phys.sr_prev_indirect_vdev != -1) { for (int c = 0; c < vd->vdev_children; c++) { - tvd = vd->vdev_child[c]; - if (spa->spa_vdev_removal != NULL && - tvd->vdev_ashift != - spa->spa_vdev_removal->svr_vdev->vdev_ashift) { - return (spa_vdev_exit(spa, vd, txg, EINVAL)); - } - /* Fail if top level vdev is raidz */ - if (tvd->vdev_ops == &vdev_raidz_ops) { - return (spa_vdev_exit(spa, vd, txg, EINVAL)); - } - /* - * Need the top level mirror to be - * a mirror of leaf vdevs only - */ - if (tvd->vdev_ops == &vdev_mirror_ops) { - for (uint64_t cid = 0; - cid < tvd->vdev_children; cid++) { - vdev_t *cvd = tvd->vdev_child[cid]; - if (!cvd->vdev_ops->vdev_op_leaf) { - return (spa_vdev_exit(spa, vd, - txg, EINVAL)); - } - } - } - } - } - for (int c = 0; c < vd->vdev_children; c++) { - /* * Set the vdev id to the first hole, if one exists. */ for (id = 0; id < rvd->vdev_children; id++) { if (rvd->vdev_child[id]->vdev_ishole) { --- 4949,4962 ---- */ if ((error = spa_validate_aux(spa, nvroot, txg, VDEV_ALLOC_ADD)) != 0) return (spa_vdev_exit(spa, vd, txg, error)); /* ! * Transfer each new top-level vdev from vd to rvd. */ for (int c = 0; c < vd->vdev_children; c++) { /* * Set the vdev id to the first hole, if one exists. */ for (id = 0; id < rvd->vdev_children; id++) { if (rvd->vdev_child[id]->vdev_ishole) {
*** 5267,5276 **** --- 5003,5025 ---- mutex_enter(&spa_namespace_lock); spa_config_update(spa, SPA_CONFIG_UPDATE_POOL); spa_event_notify(spa, NULL, NULL, ESC_ZFS_VDEV_ADD); mutex_exit(&spa_namespace_lock); + /* + * "spa_last_synced_txg(spa) + 1" is used because: + * - spa_vdev_exit() calls txg_wait_synced() for "txg" + * - spa_config_update() calls txg_wait_synced() for + * "spa_last_synced_txg(spa) + 1" + */ + tx = dmu_tx_create_assigned(spa_get_dsl(spa), + spa_last_synced_txg(spa) + 1); + spa_special_feature_activate(spa, tx); + dmu_tx_commit(tx); + + wbc_activate(spa, B_FALSE); + return (0); } /* * Attach a device to a mirror. The arguments are the path to any device
*** 5300,5314 **** txg = spa_vdev_enter(spa); oldvd = spa_lookup_by_guid(spa, guid, B_FALSE); - if (spa->spa_vdev_removal != NULL || - spa->spa_removing_phys.sr_prev_indirect_vdev != -1) { - return (spa_vdev_exit(spa, NULL, txg, EBUSY)); - } - if (oldvd == NULL) return (spa_vdev_exit(spa, NULL, txg, ENODEV)); if (!oldvd->vdev_ops->vdev_op_leaf) return (spa_vdev_exit(spa, NULL, txg, ENOTSUP)); --- 5049,5058 ----
*** 5470,5479 **** --- 5214,5231 ---- spa_event_notify(spa, newvd, NULL, ESC_ZFS_BOOTFS_VDEV_ATTACH); spa_event_notify(spa, newvd, NULL, ESC_ZFS_VDEV_ATTACH); /* + * Check CoS property of the old vdev, add reference by new vdev + */ + if (oldvd->vdev_queue.vq_cos) { + cos_hold(oldvd->vdev_queue.vq_cos); + newvd->vdev_queue.vq_cos = oldvd->vdev_queue.vq_cos; + } + + /* * Commit the config */ (void) spa_vdev_exit(spa, newrootvd, dtl_max_txg, 0); spa_history_log_internal(spa, "vdev attach", NULL,
*** 5683,5692 **** --- 5435,5452 ---- vd->vdev_detached = B_TRUE; vdev_dirty(tvd, VDD_DTL, vd, txg); spa_event_notify(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE); + /* + * Release the references to CoS descriptors if any + */ + if (vd->vdev_queue.vq_cos) { + cos_rele(vd->vdev_queue.vq_cos); + vd->vdev_queue.vq_cos = NULL; + } + /* hang on to the spa before we release the lock */ spa_open_ref(spa, FTAG); error = spa_vdev_exit(spa, vd, txg, 0);
*** 5745,5760 **** vdev_t *rvd, **vml = NULL; /* vdev modify list */ boolean_t activate_slog; ASSERT(spa_writeable(spa)); txg = spa_vdev_enter(spa); /* clear the log and flush everything up to now */ activate_slog = spa_passivate_log(spa); (void) spa_vdev_config_exit(spa, NULL, txg, 0, FTAG); ! error = spa_reset_logs(spa); txg = spa_vdev_config_enter(spa); if (activate_slog) spa_activate_log(spa); --- 5505,5527 ---- vdev_t *rvd, **vml = NULL; /* vdev modify list */ boolean_t activate_slog; ASSERT(spa_writeable(spa)); + /* + * split for pools with activated WBC + * will be implemented in the next release + */ + if (spa_feature_is_active(spa, SPA_FEATURE_WBC)) + return (SET_ERROR(ENOTSUP)); + txg = spa_vdev_enter(spa); /* clear the log and flush everything up to now */ activate_slog = spa_passivate_log(spa); (void) spa_vdev_config_exit(spa, NULL, txg, 0, FTAG); ! error = spa_offline_log(spa); txg = spa_vdev_config_enter(spa); if (activate_slog) spa_activate_log(spa);
*** 5778,5788 **** lastlog = 0; for (c = 0; c < rvd->vdev_children; c++) { vdev_t *vd = rvd->vdev_child[c]; /* don't count the holes & logs as children */ ! if (vd->vdev_islog || !vdev_is_concrete(vd)) { if (lastlog == 0) lastlog = c; continue; } --- 5545,5555 ---- lastlog = 0; for (c = 0; c < rvd->vdev_children; c++) { vdev_t *vd = rvd->vdev_child[c]; /* don't count the holes & logs as children */ ! if (vd->vdev_islog || vd->vdev_ishole) { if (lastlog == 0) lastlog = c; continue; }
*** 5831,5841 **** } /* make sure there's nothing stopping the split */ if (vml[c]->vdev_parent->vdev_ops != &vdev_mirror_ops || vml[c]->vdev_islog || ! !vdev_is_concrete(vml[c]) || vml[c]->vdev_isspare || vml[c]->vdev_isl2cache || !vdev_writeable(vml[c]) || vml[c]->vdev_children != 0 || vml[c]->vdev_state != VDEV_STATE_HEALTHY || --- 5598,5608 ---- } /* make sure there's nothing stopping the split */ if (vml[c]->vdev_parent->vdev_ops != &vdev_mirror_ops || vml[c]->vdev_islog || ! vml[c]->vdev_ishole || vml[c]->vdev_isspare || vml[c]->vdev_isl2cache || !vdev_writeable(vml[c]) || vml[c]->vdev_children != 0 || vml[c]->vdev_state != VDEV_STATE_HEALTHY ||
*** 5926,5939 **** zio_handle_panic_injection(spa, FTAG, 1); spa_activate(newspa, spa_mode_global); spa_async_suspend(newspa); - newspa->spa_config_source = SPA_CONFIG_SRC_SPLIT; - /* create the new pool from the disks of the original pool */ ! error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE); if (error) goto out; /* if that worked, generate a real config for the new pool */ if (newspa->spa_root_vdev != NULL) { --- 5693,5704 ---- zio_handle_panic_injection(spa, FTAG, 1); spa_activate(newspa, spa_mode_global); spa_async_suspend(newspa); /* create the new pool from the disks of the original pool */ ! error = spa_load(newspa, SPA_LOAD_IMPORT, SPA_IMPORT_ASSEMBLE, B_TRUE); if (error) goto out; /* if that worked, generate a real config for the new pool */ if (newspa->spa_root_vdev != NULL) {
*** 5969,5978 **** --- 5734,5755 ---- error = dmu_tx_assign(tx, TXG_WAIT); if (error != 0) dmu_tx_abort(tx); for (c = 0; c < children; c++) { if (vml[c] != NULL) { + vdev_t *tvd = vml[c]->vdev_top; + + /* + * Need to be sure the detachable VDEV is not + * on any *other* txg's DTL list to prevent it + * from being accessed after it's freed. + */ + for (int t = 0; t < TXG_SIZE; t++) { + (void) txg_list_remove_this( + &tvd->vdev_dtl_list, vml[c], t); + } + vdev_split(vml[c]); if (error == 0) spa_history_log_internal(spa, "detach", tx, "vdev=%s", vml[c]->vdev_path);
*** 5997,6007 **** kmem_free(vml, children * sizeof (vdev_t *)); /* if we're not going to mount the filesystems in userland, export */ if (exp) error = spa_export_common(newname, POOL_STATE_EXPORTED, NULL, ! B_FALSE, B_FALSE); return (error); out: spa_unload(newspa); --- 5774,5784 ---- kmem_free(vml, children * sizeof (vdev_t *)); /* if we're not going to mount the filesystems in userland, export */ if (exp) error = spa_export_common(newname, POOL_STATE_EXPORTED, NULL, ! B_FALSE, B_FALSE, B_FALSE); return (error); out: spa_unload(newspa);
*** 6023,6033 **** --- 5800,6107 ---- kmem_free(vml, children * sizeof (vdev_t *)); return (error); } + static nvlist_t * + spa_nvlist_lookup_by_guid(nvlist_t **nvpp, int count, uint64_t target_guid) + { + for (int i = 0; i < count; i++) { + uint64_t guid; + + VERIFY(nvlist_lookup_uint64(nvpp[i], ZPOOL_CONFIG_GUID, + &guid) == 0); + + if (guid == target_guid) + return (nvpp[i]); + } + + return (NULL); + } + + static void + spa_vdev_remove_aux(nvlist_t *config, char *name, nvlist_t **dev, int count, + nvlist_t *dev_to_remove) + { + nvlist_t **newdev = NULL; + + if (count > 1) + newdev = kmem_alloc((count - 1) * sizeof (void *), KM_SLEEP); + + for (int i = 0, j = 0; i < count; i++) { + if (dev[i] == dev_to_remove) + continue; + VERIFY(nvlist_dup(dev[i], &newdev[j++], KM_SLEEP) == 0); + } + + VERIFY(nvlist_remove(config, name, DATA_TYPE_NVLIST_ARRAY) == 0); + VERIFY(nvlist_add_nvlist_array(config, name, newdev, count - 1) == 0); + + for (int i = 0; i < count - 1; i++) + nvlist_free(newdev[i]); + + if (count > 1) + kmem_free(newdev, (count - 1) * sizeof (void *)); + } + /* + * Evacuate the device. + */ + static int + spa_vdev_remove_evacuate(spa_t *spa, vdev_t *vd) + { + uint64_t txg; + int error = 0; + + ASSERT(MUTEX_HELD(&spa_namespace_lock)); + ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == 0); + ASSERT(vd == vd->vdev_top); + + /* + * Evacuate the device. We don't hold the config lock as writer + * since we need to do I/O but we do keep the + * spa_namespace_lock held. Once this completes the device + * should no longer have any blocks allocated on it. + */ + if (vd->vdev_islog) { + if (vd->vdev_stat.vs_alloc != 0) + error = spa_offline_log(spa); + } else { + error = SET_ERROR(ENOTSUP); + } + + if (error) + return (error); + + /* + * The evacuation succeeded. Remove any remaining MOS metadata + * associated with this vdev, and wait for these changes to sync. + */ + ASSERT0(vd->vdev_stat.vs_alloc); + txg = spa_vdev_config_enter(spa); + vd->vdev_removing = B_TRUE; + vdev_dirty_leaves(vd, VDD_DTL, txg); + vdev_config_dirty(vd); + spa_vdev_config_exit(spa, NULL, txg, 0, FTAG); + + return (0); + } + + /* + * Complete the removal by cleaning up the namespace. + */ + static void + spa_vdev_remove_from_namespace(spa_t *spa, vdev_t *vd) + { + vdev_t *rvd = spa->spa_root_vdev; + uint64_t id = vd->vdev_id; + boolean_t last_vdev = (id == (rvd->vdev_children - 1)); + + ASSERT(MUTEX_HELD(&spa_namespace_lock)); + ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL); + ASSERT(vd == vd->vdev_top); + + /* + * Only remove any devices which are empty. + */ + if (vd->vdev_stat.vs_alloc != 0) + return; + + (void) vdev_label_init(vd, 0, VDEV_LABEL_REMOVE); + + if (list_link_active(&vd->vdev_state_dirty_node)) + vdev_state_clean(vd); + if (list_link_active(&vd->vdev_config_dirty_node)) + vdev_config_clean(vd); + + vdev_free(vd); + + if (last_vdev) { + vdev_compact_children(rvd); + } else { + vd = vdev_alloc_common(spa, id, 0, &vdev_hole_ops); + vdev_add_child(rvd, vd); + } + vdev_config_dirty(rvd); + + /* + * Reassess the health of our root vdev. + */ + vdev_reopen(rvd); + } + + /* + * Remove a device from the pool - + * + * Removing a device from the vdev namespace requires several steps + * and can take a significant amount of time. As a result we use + * the spa_vdev_config_[enter/exit] functions which allow us to + * grab and release the spa_config_lock while still holding the namespace + * lock. During each step the configuration is synced out. + * + * Currently, this supports removing only hot spares, slogs, level 2 ARC + * and special devices. + */ + int + spa_vdev_remove(spa_t *spa, uint64_t guid, boolean_t unspare) + { + vdev_t *vd; + sysevent_t *ev = NULL; + metaslab_group_t *mg; + nvlist_t **spares, **l2cache, *nv; + uint64_t txg = 0; + uint_t nspares, nl2cache; + int error = 0; + boolean_t locked = MUTEX_HELD(&spa_namespace_lock); + + ASSERT(spa_writeable(spa)); + + if (!locked) + txg = spa_vdev_enter(spa); + + vd = spa_lookup_by_guid(spa, guid, B_FALSE); + + if (spa->spa_spares.sav_vdevs != NULL && + nvlist_lookup_nvlist_array(spa->spa_spares.sav_config, + ZPOOL_CONFIG_SPARES, &spares, &nspares) == 0 && + (nv = spa_nvlist_lookup_by_guid(spares, nspares, guid)) != NULL) { + /* + * Only remove the hot spare if it's not currently in use + * in this pool. + */ + if (vd == NULL || unspare) { + if (vd == NULL) + vd = spa_lookup_by_guid(spa, guid, B_TRUE); + + /* + * Release the references to CoS descriptors if any + */ + if (vd != NULL && vd->vdev_queue.vq_cos) { + cos_rele(vd->vdev_queue.vq_cos); + vd->vdev_queue.vq_cos = NULL; + } + + ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_AUX); + spa_vdev_remove_aux(spa->spa_spares.sav_config, + ZPOOL_CONFIG_SPARES, spares, nspares, nv); + spa_load_spares(spa); + spa->spa_spares.sav_sync = B_TRUE; + } else { + error = SET_ERROR(EBUSY); + } + } else if (spa->spa_l2cache.sav_vdevs != NULL && + nvlist_lookup_nvlist_array(spa->spa_l2cache.sav_config, + ZPOOL_CONFIG_L2CACHE, &l2cache, &nl2cache) == 0 && + (nv = spa_nvlist_lookup_by_guid(l2cache, nl2cache, guid)) != NULL) { + /* + * Cache devices can always be removed. + */ + if (vd == NULL) + vd = spa_lookup_by_guid(spa, guid, B_TRUE); + /* + * Release the references to CoS descriptors if any + */ + if (vd != NULL && vd->vdev_queue.vq_cos) { + cos_rele(vd->vdev_queue.vq_cos); + vd->vdev_queue.vq_cos = NULL; + } + + ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_AUX); + spa_vdev_remove_aux(spa->spa_l2cache.sav_config, + ZPOOL_CONFIG_L2CACHE, l2cache, nl2cache, nv); + spa_load_l2cache(spa); + spa->spa_l2cache.sav_sync = B_TRUE; + } else if (vd != NULL && vd->vdev_islog) { + ASSERT(!locked); + + if (vd != vd->vdev_top) + return (spa_vdev_exit(spa, NULL, txg, SET_ERROR(ENOTSUP))); + + mg = vd->vdev_mg; + + /* + * Stop allocating from this vdev. + */ + metaslab_group_passivate(mg); + + /* + * Wait for the youngest allocations and frees to sync, + * and then wait for the deferral of those frees to finish. + */ + spa_vdev_config_exit(spa, NULL, + txg + TXG_CONCURRENT_STATES + TXG_DEFER_SIZE, 0, FTAG); + + /* + * Attempt to evacuate the vdev. + */ + error = spa_vdev_remove_evacuate(spa, vd); + + txg = spa_vdev_config_enter(spa); + + /* + * If we couldn't evacuate the vdev, unwind. + */ + if (error) { + metaslab_group_activate(mg); + return (spa_vdev_exit(spa, NULL, txg, error)); + } + + /* + * Release the references to CoS descriptors if any + */ + if (vd->vdev_queue.vq_cos) { + cos_rele(vd->vdev_queue.vq_cos); + vd->vdev_queue.vq_cos = NULL; + } + + ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV); + + /* + * Clean up the vdev namespace. + */ + ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV); + spa_vdev_remove_from_namespace(spa, vd); + + } else if (vd != NULL && vdev_is_special(vd)) { + ASSERT(!locked); + + if (vd != vd->vdev_top) + return (spa_vdev_exit(spa, NULL, txg, SET_ERROR(ENOTSUP))); + + error = spa_special_vdev_remove(spa, vd, &txg); + if (error == 0) { + ev = spa_event_create(spa, vd, NULL, ESC_ZFS_VDEV_REMOVE_DEV); + spa_vdev_remove_from_namespace(spa, vd); + + /* + * User sees this field as 'enablespecial' + * pool-level property + */ + spa->spa_usesc = B_FALSE; + } + } else if (vd != NULL) { + /* + * Normal vdevs cannot be removed (yet). + */ + error = SET_ERROR(ENOTSUP); + } else { + /* + * There is no vdev of any kind with the specified guid. + */ + error = SET_ERROR(ENOENT); + } + + if (!locked) + error = spa_vdev_exit(spa, NULL, txg, error); + + if (ev) + spa_event_notify_impl(ev); + + return (error); + } + + /* * Find any device that's done replacing, or a vdev marked 'unspare' that's * currently spared, so we can detach it. */ static vdev_t * spa_vdev_resilver_done_hunt(vdev_t *vd)
*** 6060,6069 **** --- 6134,6144 ---- return (oldvd); } /* * Check for a completed resilver with the 'unspare' flag set. + * Also potentially update faulted state. */ if (vd->vdev_ops == &vdev_spare_ops) { vdev_t *first = vd->vdev_child[0]; vdev_t *last = vd->vdev_child[vd->vdev_children - 1];
*** 6081,6090 **** --- 6156,6167 ---- vdev_dtl_empty(newvd, DTL_MISSING) && vdev_dtl_empty(newvd, DTL_OUTAGE) && !vdev_dtl_required(oldvd)) return (oldvd); + vdev_propagate_state(vd); + /* * If there are more than two spares attached to a disk, * and those spares are not required, then we want to * attempt to free them up now so that they can be used * by other pools. Once we're back down to a single
*** 6141,6202 **** spa_config_exit(spa, SCL_ALL, FTAG); } /* - * Update the stored path or FRU for this vdev. - */ - int - spa_vdev_set_common(spa_t *spa, uint64_t guid, const char *value, - boolean_t ispath) - { - vdev_t *vd; - boolean_t sync = B_FALSE; - - ASSERT(spa_writeable(spa)); - - spa_vdev_state_enter(spa, SCL_ALL); - - if ((vd = spa_lookup_by_guid(spa, guid, B_TRUE)) == NULL) - return (spa_vdev_state_exit(spa, NULL, ENOENT)); - - if (!vd->vdev_ops->vdev_op_leaf) - return (spa_vdev_state_exit(spa, NULL, ENOTSUP)); - - if (ispath) { - if (strcmp(value, vd->vdev_path) != 0) { - spa_strfree(vd->vdev_path); - vd->vdev_path = spa_strdup(value); - sync = B_TRUE; - } - } else { - if (vd->vdev_fru == NULL) { - vd->vdev_fru = spa_strdup(value); - sync = B_TRUE; - } else if (strcmp(value, vd->vdev_fru) != 0) { - spa_strfree(vd->vdev_fru); - vd->vdev_fru = spa_strdup(value); - sync = B_TRUE; - } - } - - return (spa_vdev_state_exit(spa, sync ? vd : NULL, 0)); - } - - int - spa_vdev_setpath(spa_t *spa, uint64_t guid, const char *newpath) - { - return (spa_vdev_set_common(spa, guid, newpath, B_TRUE)); - } - - int - spa_vdev_setfru(spa_t *spa, uint64_t guid, const char *newfru) - { - return (spa_vdev_set_common(spa, guid, newfru, B_FALSE)); - } - - /* * ========================================================================== * SPA Scanning * ========================================================================== */ int --- 6218,6227 ----
*** 6389,6398 **** --- 6414,6435 ---- */ if (tasks & SPA_ASYNC_RESILVER) dsl_resilver_restart(spa->spa_dsl_pool, 0); /* + * Kick off L2 cache rebuilding. + */ + if (tasks & SPA_ASYNC_L2CACHE_REBUILD) + l2arc_spa_rebuild_start(spa); + + if (tasks & SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY) { + mutex_enter(&spa->spa_man_trim_lock); + spa_man_trim_taskq_destroy(spa); + mutex_exit(&spa->spa_man_trim_lock); + } + + /* * Let the world know that we're done. */ mutex_enter(&spa->spa_async_lock); spa->spa_async_thread = NULL; cv_broadcast(&spa->spa_async_cv);
*** 6406,6435 **** mutex_enter(&spa->spa_async_lock); spa->spa_async_suspended++; while (spa->spa_async_thread != NULL) cv_wait(&spa->spa_async_cv, &spa->spa_async_lock); mutex_exit(&spa->spa_async_lock); - - spa_vdev_remove_suspend(spa); - - zthr_t *condense_thread = spa->spa_condense_zthr; - if (condense_thread != NULL && zthr_isrunning(condense_thread)) - VERIFY0(zthr_cancel(condense_thread)); } void spa_async_resume(spa_t *spa) { mutex_enter(&spa->spa_async_lock); ASSERT(spa->spa_async_suspended != 0); spa->spa_async_suspended--; mutex_exit(&spa->spa_async_lock); - spa_restart_removal(spa); - - zthr_t *condense_thread = spa->spa_condense_zthr; - if (condense_thread != NULL && !zthr_isrunning(condense_thread)) - zthr_resume(condense_thread); } static boolean_t spa_async_tasks_pending(spa_t *spa) { --- 6443,6461 ----
*** 6470,6479 **** --- 6496,6514 ---- mutex_enter(&spa->spa_async_lock); spa->spa_async_tasks |= task; mutex_exit(&spa->spa_async_lock); } + void + spa_async_unrequest(spa_t *spa, int task) + { + zfs_dbgmsg("spa=%s async unrequest task=%u", spa->spa_name, task); + mutex_enter(&spa->spa_async_lock); + spa->spa_async_tasks &= ~task; + mutex_exit(&spa->spa_async_lock); + } + /* * ========================================================================== * SPA syncing routines * ========================================================================== */
*** 6758,6767 **** --- 6793,6803 ---- static void spa_sync_props(void *arg, dmu_tx_t *tx) { nvlist_t *nvp = arg; spa_t *spa = dmu_tx_pool(tx)->dp_spa; + spa_meta_placement_t *mp = &spa->spa_meta_policy; objset_t *mos = spa->spa_meta_objset; nvpair_t *elem = NULL; mutex_enter(&spa->spa_props_lock);
*** 6772,6782 **** const char *propname; zprop_type_t proptype; spa_feature_t fid; switch (prop = zpool_name_to_prop(nvpair_name(elem))) { ! case ZPOOL_PROP_INVAL: /* * We checked this earlier in spa_prop_validate(). */ ASSERT(zpool_prop_feature(nvpair_name(elem))); --- 6808,6818 ---- const char *propname; zprop_type_t proptype; spa_feature_t fid; switch (prop = zpool_name_to_prop(nvpair_name(elem))) { ! case ZPROP_INVAL: /* * We checked this earlier in spa_prop_validate(). */ ASSERT(zpool_prop_feature(nvpair_name(elem)));
*** 6870,6894 **** --- 6906,6991 ---- switch (prop) { case ZPOOL_PROP_DELEGATION: spa->spa_delegation = intval; break; + case ZPOOL_PROP_DDT_DESEGREGATION: + spa_set_ddt_classes(spa, intval); + break; + case ZPOOL_PROP_DEDUP_BEST_EFFORT: + spa->spa_dedup_best_effort = intval; + break; + case ZPOOL_PROP_DEDUP_LO_BEST_EFFORT: + spa->spa_dedup_lo_best_effort = intval; + break; + case ZPOOL_PROP_DEDUP_HI_BEST_EFFORT: + spa->spa_dedup_hi_best_effort = intval; + break; case ZPOOL_PROP_BOOTFS: spa->spa_bootfs = intval; break; case ZPOOL_PROP_FAILUREMODE: spa->spa_failmode = intval; break; + case ZPOOL_PROP_FORCETRIM: + spa->spa_force_trim = intval; + break; + case ZPOOL_PROP_AUTOTRIM: + mutex_enter(&spa->spa_auto_trim_lock); + if (intval != spa->spa_auto_trim) { + spa->spa_auto_trim = intval; + if (intval != 0) + spa_auto_trim_taskq_create(spa); + else + spa_auto_trim_taskq_destroy( + spa); + } + mutex_exit(&spa->spa_auto_trim_lock); + break; case ZPOOL_PROP_AUTOEXPAND: spa->spa_autoexpand = intval; if (tx->tx_txg != TXG_INITIAL) spa_async_request(spa, SPA_ASYNC_AUTOEXPAND); break; case ZPOOL_PROP_DEDUPDITTO: spa->spa_dedup_ditto = intval; break; + case ZPOOL_PROP_MINWATERMARK: + spa->spa_minwat = intval; + break; + case ZPOOL_PROP_LOWATERMARK: + spa->spa_lowat = intval; + break; + case ZPOOL_PROP_HIWATERMARK: + spa->spa_hiwat = intval; + break; + case ZPOOL_PROP_DEDUPMETA_DITTO: + spa->spa_ddt_meta_copies = intval; + break; + case ZPOOL_PROP_META_PLACEMENT: + mp->spa_enable_meta_placement_selection = + intval; + break; + case ZPOOL_PROP_SYNC_TO_SPECIAL: + mp->spa_sync_to_special = intval; + break; + case ZPOOL_PROP_DDT_META_TO_METADEV: + mp->spa_ddt_meta_to_special = intval; + break; + case ZPOOL_PROP_ZFS_META_TO_METADEV: + mp->spa_zfs_meta_to_special = intval; + break; + case ZPOOL_PROP_SMALL_DATA_TO_METADEV: + mp->spa_small_data_to_special = intval; + break; + case ZPOOL_PROP_RESILVER_PRIO: + spa->spa_resilver_prio = intval; + break; + case ZPOOL_PROP_SCRUB_PRIO: + spa->spa_scrub_prio = intval; + break; default: break; } }
*** 6970,7009 **** rrw_exit(&dp->dp_config_rwlock, FTAG); } static void ! vdev_indirect_state_sync_verify(vdev_t *vd) { ! vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping; ! vdev_indirect_births_t *vib = vd->vdev_indirect_births; ! if (vd->vdev_ops == &vdev_indirect_ops) { ! ASSERT(vim != NULL); ! ASSERT(vib != NULL); ! } ! if (vdev_obsolete_sm_object(vd) != 0) { ! ASSERT(vd->vdev_obsolete_sm != NULL); ! ASSERT(vd->vdev_removing || ! vd->vdev_ops == &vdev_indirect_ops); ! ASSERT(vdev_indirect_mapping_num_entries(vim) > 0); ! ASSERT(vdev_indirect_mapping_bytes_mapped(vim) > 0); ! ASSERT3U(vdev_obsolete_sm_object(vd), ==, ! space_map_object(vd->vdev_obsolete_sm)); ! ASSERT3U(vdev_indirect_mapping_bytes_mapped(vim), >=, ! space_map_allocated(vd->vdev_obsolete_sm)); } ! ASSERT(vd->vdev_obsolete_segments != NULL); ! /* ! * Since frees / remaps to an indirect vdev can only ! * happen in syncing context, the obsolete segments ! * tree must be empty when we start syncing. ! */ ! ASSERT0(range_tree_space(vd->vdev_obsolete_segments)); } /* * Sync the specified transaction group. New blocks may be dirtied as * part of the process, so we iterate until it converges. --- 7067,7115 ---- rrw_exit(&dp->dp_config_rwlock, FTAG); } static void ! spa_initialize_alloc_trees(spa_t *spa, uint32_t max_queue_depth, ! uint64_t queue_depth_total) { ! vdev_t *rvd = spa->spa_root_vdev; ! boolean_t dva_throttle_enabled = zio_dva_throttle_enabled; ! metaslab_class_t *mcs[2] = { ! spa_normal_class(spa), ! spa_special_class(spa) ! }; ! size_t mcs_len = sizeof (mcs) / sizeof (metaslab_class_t *); ! for (size_t i = 0; i < mcs_len; i++) { ! metaslab_class_t *mc = mcs[i]; ! ASSERT0(refcount_count(&mc->mc_alloc_slots)); ! mc->mc_alloc_max_slots = queue_depth_total; ! mc->mc_alloc_throttle_enabled = dva_throttle_enabled; ! ASSERT3U(mc->mc_alloc_max_slots, <=, ! max_queue_depth * rvd->vdev_children); } ! } ! static void ! spa_check_alloc_trees(spa_t *spa) ! { ! metaslab_class_t *mcs[2] = { ! spa_normal_class(spa), ! spa_special_class(spa) ! }; ! size_t mcs_len = sizeof (mcs) / sizeof (metaslab_class_t *); ! ! for (size_t i = 0; i < mcs_len; i++) { ! metaslab_class_t *mc = mcs[i]; ! ! mutex_enter(&mc->mc_alloc_lock); ! VERIFY0(avl_numnodes(&mc->mc_alloc_tree)); ! mutex_exit(&mc->mc_alloc_lock); ! } } /* * Sync the specified transaction group. New blocks may be dirtied as * part of the process, so we iterate until it converges.
*** 7022,7050 **** zfs_vdev_queue_depth_pct / 100; VERIFY(spa_writeable(spa)); /* - * Wait for i/os issued in open context that need to complete - * before this txg syncs. - */ - VERIFY0(zio_wait(spa->spa_txg_zio[txg & TXG_MASK])); - spa->spa_txg_zio[txg & TXG_MASK] = zio_root(spa, NULL, NULL, 0); - - /* * Lock out configuration changes. */ spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER); spa->spa_syncing_txg = txg; spa->spa_sync_pass = 0; ! mutex_enter(&spa->spa_alloc_lock); ! VERIFY0(avl_numnodes(&spa->spa_alloc_tree)); ! mutex_exit(&spa->spa_alloc_lock); /* * If there are any pending vdev state changes, convert them * into config changes that go out with this transaction group. */ spa_config_enter(spa, SCL_STATE, FTAG, RW_READER); while (list_head(&spa->spa_state_dirty_list) != NULL) { --- 7128,7158 ---- zfs_vdev_queue_depth_pct / 100; VERIFY(spa_writeable(spa)); /* * Lock out configuration changes. */ spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER); spa->spa_syncing_txg = txg; spa->spa_sync_pass = 0; ! spa_check_alloc_trees(spa); /* + * Another pool management task might be currently preventing + * from starting and the current txg sync was invoked on its behalf, + * so be prepared to postpone autotrim processing. + */ + if (mutex_tryenter(&spa->spa_auto_trim_lock)) { + if (spa->spa_auto_trim == SPA_AUTO_TRIM_ON) + spa_auto_trim(spa, txg); + mutex_exit(&spa->spa_auto_trim_lock); + } + + /* * If there are any pending vdev state changes, convert them * into config changes that go out with this transaction group. */ spa_config_enter(spa, SCL_STATE, FTAG, RW_READER); while (list_head(&spa->spa_state_dirty_list) != NULL) {
*** 7115,7145 **** */ ASSERT0(refcount_count(&mg->mg_alloc_queue_depth)); mg->mg_max_alloc_queue_depth = max_queue_depth; queue_depth_total += mg->mg_max_alloc_queue_depth; } - metaslab_class_t *mc = spa_normal_class(spa); - ASSERT0(refcount_count(&mc->mc_alloc_slots)); - mc->mc_alloc_max_slots = queue_depth_total; - mc->mc_alloc_throttle_enabled = zio_dva_throttle_enabled; ! ASSERT3U(mc->mc_alloc_max_slots, <=, ! max_queue_depth * rvd->vdev_children); - for (int c = 0; c < rvd->vdev_children; c++) { - vdev_t *vd = rvd->vdev_child[c]; - vdev_indirect_state_sync_verify(vd); - - if (vdev_indirect_should_condense(vd)) { - spa_condense_indirect_start_sync(vd, tx); - break; - } - } - /* * Iterate to convergence. */ do { int pass = ++spa->spa_sync_pass; spa_sync_config_object(spa, tx); spa_sync_aux_dev(spa, &spa->spa_spares, tx, --- 7223,7252 ---- */ ASSERT0(refcount_count(&mg->mg_alloc_queue_depth)); mg->mg_max_alloc_queue_depth = max_queue_depth; queue_depth_total += mg->mg_max_alloc_queue_depth; } ! spa_initialize_alloc_trees(spa, max_queue_depth, ! queue_depth_total); /* * Iterate to convergence. */ + + zfs_autosnap_t *autosnap = spa_get_autosnap(dp->dp_spa); + mutex_enter(&autosnap->autosnap_lock); + + autosnap_zone_t *zone = list_head(&autosnap->autosnap_zones); + while (zone != NULL) { + zone->created = B_FALSE; + zone->dirty = B_FALSE; + zone = list_next(&autosnap->autosnap_zones, zone); + } + + mutex_exit(&autosnap->autosnap_lock); + do { int pass = ++spa->spa_sync_pass; spa_sync_config_object(spa, tx); spa_sync_aux_dev(spa, &spa->spa_spares, tx,
*** 7162,7176 **** } ddt_sync(spa, txg); dsl_scan_sync(dp, tx); ! if (spa->spa_vdev_removal != NULL) ! svr_sync(spa, tx); ! ! while ((vd = txg_list_remove(&spa->spa_vdev_txg_list, txg)) ! != NULL) vdev_sync(vd, txg); if (pass == 1) { spa_sync_upgrades(spa, tx); ASSERT3U(txg, >=, --- 7269,7279 ---- } ddt_sync(spa, txg); dsl_scan_sync(dp, tx); ! while (vd = txg_list_remove(&spa->spa_vdev_txg_list, txg)) vdev_sync(vd, txg); if (pass == 1) { spa_sync_upgrades(spa, tx); ASSERT3U(txg, >=,
*** 7218,7231 **** spa->spa_all_vdev_zaps, &all_vdev_zap_entry_count)); ASSERT3U(vdev_count_verify_zaps(spa->spa_root_vdev), ==, all_vdev_zap_entry_count); } - if (spa->spa_vdev_removal != NULL) { - ASSERT0(spa->spa_vdev_removal->svr_bytes_done[txg & TXG_MASK]); - } - /* * Rewrite the vdev configuration (which includes the uberblock) * to commit the transaction group. * * If there are no dirty vdevs, we sync the uberblock to a few --- 7321,7330 ----
*** 7239,7260 **** * while we're attempting to write the vdev labels. */ spa_config_enter(spa, SCL_STATE, FTAG, RW_READER); if (list_is_empty(&spa->spa_config_dirty_list)) { ! vdev_t *svd[SPA_SYNC_MIN_VDEVS]; int svdcount = 0; int children = rvd->vdev_children; int c0 = spa_get_random(children); for (int c = 0; c < children; c++) { vd = rvd->vdev_child[(c0 + c) % children]; ! if (vd->vdev_ms_array == 0 || vd->vdev_islog || ! !vdev_is_concrete(vd)) continue; svd[svdcount++] = vd; ! if (svdcount == SPA_SYNC_MIN_VDEVS) break; } error = vdev_config_sync(svd, svdcount, txg); } else { error = vdev_config_sync(rvd->vdev_child, --- 7338,7358 ---- * while we're attempting to write the vdev labels. */ spa_config_enter(spa, SCL_STATE, FTAG, RW_READER); if (list_is_empty(&spa->spa_config_dirty_list)) { ! vdev_t *svd[SPA_DVAS_PER_BP]; int svdcount = 0; int children = rvd->vdev_children; int c0 = spa_get_random(children); for (int c = 0; c < children; c++) { vd = rvd->vdev_child[(c0 + c) % children]; ! if (vd->vdev_ms_array == 0 || vd->vdev_islog) continue; svd[svdcount++] = vd; ! if (svdcount == SPA_DVAS_PER_BP) break; } error = vdev_config_sync(svd, svdcount, txg); } else { error = vdev_config_sync(rvd->vdev_child,
*** 7291,7312 **** spa->spa_config_syncing = NULL; } dsl_pool_sync_done(dp, txg); ! mutex_enter(&spa->spa_alloc_lock); ! VERIFY0(avl_numnodes(&spa->spa_alloc_tree)); ! mutex_exit(&spa->spa_alloc_lock); /* * Update usable space statistics. */ while (vd = txg_list_remove(&spa->spa_vdev_txg_list, TXG_CLEAN(txg))) vdev_sync_done(vd, txg); spa_update_dspace(spa); ! /* * It had better be the case that we didn't dirty anything * since vdev_config_sync(). */ ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg)); --- 7389,7408 ---- spa->spa_config_syncing = NULL; } dsl_pool_sync_done(dp, txg); ! spa_check_alloc_trees(spa); /* * Update usable space statistics. */ while (vd = txg_list_remove(&spa->spa_vdev_txg_list, TXG_CLEAN(txg))) vdev_sync_done(vd, txg); spa_update_dspace(spa); ! spa_update_latency(spa); /* * It had better be the case that we didn't dirty anything * since vdev_config_sync(). */ ASSERT(txg_list_empty(&dp->dp_dirty_datasets, txg));
*** 7313,7322 **** --- 7409,7420 ---- ASSERT(txg_list_empty(&dp->dp_dirty_dirs, txg)); ASSERT(txg_list_empty(&spa->spa_vdev_txg_list, txg)); spa->spa_sync_pass = 0; + spa_check_special(spa); + /* * Update the last synced uberblock here. We want to do this at * the end of spa_sync() so that consumers of spa_last_synced_txg() * will be guaranteed that all the processing associated with * that txg has been completed.
*** 7385,7397 **** --- 7483,7498 ---- spa_async_suspend(spa); mutex_enter(&spa_namespace_lock); spa_close(spa, FTAG); if (spa->spa_state != POOL_STATE_UNINITIALIZED) { + wbc_deactivate(spa); + spa_unload(spa); spa_deactivate(spa); } + spa_remove(spa); } mutex_exit(&spa_namespace_lock); }
*** 7483,7493 **** } return (B_FALSE); } ! sysevent_t * spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name) { sysevent_t *ev = NULL; #ifdef _KERNEL sysevent_attr_list_t *attr = NULL; --- 7584,7601 ---- } return (B_FALSE); } ! /* ! * Post a sysevent corresponding to the given event. The 'name' must be one of ! * the event definitions in sys/sysevent/eventdefs.h. The payload will be ! * filled in from the spa and (optionally) the vdev. This doesn't do anything ! * in the userland libzpool, as we don't want consumers to misinterpret ztest ! * or zdb as real changes. ! */ ! static sysevent_t * spa_event_create(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name) { sysevent_t *ev = NULL; #ifdef _KERNEL sysevent_attr_list_t *attr = NULL;
*** 7505,7515 **** value.value_type = SE_DATA_TYPE_UINT64; value.value.sv_uint64 = spa_guid(spa); if (sysevent_add_attr(&attr, ZFS_EV_POOL_GUID, &value, SE_SLEEP) != 0) goto done; ! if (vd) { value.value_type = SE_DATA_TYPE_UINT64; value.value.sv_uint64 = vd->vdev_guid; if (sysevent_add_attr(&attr, ZFS_EV_VDEV_GUID, &value, SE_SLEEP) != 0) goto done; --- 7613,7623 ---- value.value_type = SE_DATA_TYPE_UINT64; value.value.sv_uint64 = spa_guid(spa); if (sysevent_add_attr(&attr, ZFS_EV_POOL_GUID, &value, SE_SLEEP) != 0) goto done; ! if (vd != NULL) { value.value_type = SE_DATA_TYPE_UINT64; value.value.sv_uint64 = vd->vdev_guid; if (sysevent_add_attr(&attr, ZFS_EV_VDEV_GUID, &value, SE_SLEEP) != 0) goto done;
*** 7537,7572 **** #endif return (ev); } ! void ! spa_event_post(sysevent_t *ev) { #ifdef _KERNEL sysevent_id_t eid; (void) log_sysevent(ev, SE_SLEEP, &eid); sysevent_free(ev); #endif } ! void ! spa_event_discard(sysevent_t *ev) { ! #ifdef _KERNEL sysevent_free(ev); ! #endif } /* ! * Post a sysevent corresponding to the given event. The 'name' must be one of ! * the event definitions in sys/sysevent/eventdefs.h. The payload will be ! * filled in from the spa and (optionally) the vdev and history nvl. This ! * doesn't do anything in the userland libzpool, as we don't want consumers to ! * misinterpret ztest or zdb as real changes. */ void ! spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name) { ! spa_event_post(spa_event_create(spa, vd, hist_nvl, name)); } --- 7645,7955 ---- #endif return (ev); } ! static void ! spa_event_post(void *arg) { #ifdef _KERNEL + sysevent_t *ev = (sysevent_t *)arg; + sysevent_id_t eid; (void) log_sysevent(ev, SE_SLEEP, &eid); sysevent_free(ev); #endif } ! /* ! * Dispatch event notifications to the taskq such that the corresponding ! * sysevents are queued with no spa locks held ! */ ! taskq_t *spa_sysevent_taskq; ! ! static void ! spa_event_notify_impl(sysevent_t *ev) { ! if (taskq_dispatch(spa_sysevent_taskq, spa_event_post, ! ev, TQ_NOSLEEP) == NULL) { ! /* ! * These are management sysevents; as much as it is ! * unpleasant to drop these due to syseventd not being able ! * to keep up, perhaps due to resource shortages, we are not ! * going to sleep here and risk locking up the pool sync ! * process; notify admin of problems ! */ ! cmn_err(CE_NOTE, "Could not dispatch sysevent nofitication " ! "for %s, please check state of syseventd\n", ! sysevent_get_subclass_name(ev)); ! sysevent_free(ev); ! ! return; ! } } + void + spa_event_notify(spa_t *spa, vdev_t *vd, nvlist_t *hist_nvl, const char *name) + { + spa_event_notify_impl(spa_event_create(spa, vd, hist_nvl, name)); + } + /* ! * Dispatches all auto-trim processing to all top-level vdevs. This is ! * called from spa_sync once every txg. */ + static void + spa_auto_trim(spa_t *spa, uint64_t txg) + { + ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER) == SCL_CONFIG); + ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock)); + ASSERT(spa->spa_auto_trim_taskq != NULL); + + for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) { + vdev_trim_info_t *vti = kmem_zalloc(sizeof (*vti), KM_SLEEP); + vti->vti_vdev = spa->spa_root_vdev->vdev_child[i]; + vti->vti_txg = txg; + vti->vti_done_cb = (void (*)(void *))spa_vdev_auto_trim_done; + vti->vti_done_arg = spa; + (void) taskq_dispatch(spa->spa_auto_trim_taskq, + (void (*)(void *))vdev_auto_trim, vti, TQ_SLEEP); + spa->spa_num_auto_trimming++; + } + } + + /* + * Performs the sync update of the MOS pool directory's trim start/stop values. + */ + static void + spa_trim_update_time_sync(void *arg, dmu_tx_t *tx) + { + spa_t *spa = arg; + VERIFY0(zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT, + DMU_POOL_TRIM_START_TIME, sizeof (uint64_t), 1, + &spa->spa_man_trim_start_time, tx)); + VERIFY0(zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT, + DMU_POOL_TRIM_STOP_TIME, sizeof (uint64_t), 1, + &spa->spa_man_trim_stop_time, tx)); + } + + /* + * Updates the in-core and on-disk manual TRIM operation start/stop time. + * Passing UINT64_MAX for either start_time or stop_time means that no + * update to that value should be recorded. + */ + static dmu_tx_t * + spa_trim_update_time(spa_t *spa, uint64_t start_time, uint64_t stop_time) + { + int err; + dmu_tx_t *tx; + + ASSERT(MUTEX_HELD(&spa->spa_man_trim_lock)); + if (start_time != UINT64_MAX) + spa->spa_man_trim_start_time = start_time; + if (stop_time != UINT64_MAX) + spa->spa_man_trim_stop_time = stop_time; + tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir); + err = dmu_tx_assign(tx, TXG_WAIT); + if (err) { + dmu_tx_abort(tx); + return (NULL); + } + dsl_sync_task_nowait(spa_get_dsl(spa), spa_trim_update_time_sync, + spa, 1, ZFS_SPACE_CHECK_RESERVED, tx); + + return (tx); + } + + /* + * Initiates an manual TRIM of the whole pool. This kicks off individual + * TRIM tasks for each top-level vdev, which then pass over all of the free + * space in all of the vdev's metaslabs and issues TRIM commands for that + * space to the underlying vdevs. + */ + extern void + spa_man_trim(spa_t *spa, uint64_t rate) + { + dmu_tx_t *time_update_tx; + + mutex_enter(&spa->spa_man_trim_lock); + + if (rate != 0) + spa->spa_man_trim_rate = MAX(rate, spa_min_trim_rate(spa)); + else + spa->spa_man_trim_rate = 0; + + if (spa->spa_num_man_trimming) { + /* + * TRIM is already ongoing. Wake up all sleeping vdev trim + * threads because the trim rate might have changed above. + */ + cv_broadcast(&spa->spa_man_trim_update_cv); + mutex_exit(&spa->spa_man_trim_lock); + return; + } + spa_man_trim_taskq_create(spa); + spa->spa_man_trim_stop = B_FALSE; + + spa_event_notify(spa, NULL, NULL, ESC_ZFS_TRIM_START); + spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER); + for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) { + vdev_t *vd = spa->spa_root_vdev->vdev_child[i]; + vdev_trim_info_t *vti = kmem_zalloc(sizeof (*vti), KM_SLEEP); + vti->vti_vdev = vd; + vti->vti_done_cb = (void (*)(void *))spa_vdev_man_trim_done; + vti->vti_done_arg = spa; + spa->spa_num_man_trimming++; + + vd->vdev_trim_prog = 0; + (void) taskq_dispatch(spa->spa_man_trim_taskq, + (void (*)(void *))vdev_man_trim, vti, TQ_SLEEP); + } + spa_config_exit(spa, SCL_CONFIG, FTAG); + time_update_tx = spa_trim_update_time(spa, gethrestime_sec(), 0); + mutex_exit(&spa->spa_man_trim_lock); + /* mustn't hold spa_man_trim_lock to prevent deadlock /w syncing ctx */ + if (time_update_tx != NULL) + dmu_tx_commit(time_update_tx); + } + + /* + * Orders a manual TRIM operation to stop and returns immediately. + */ + extern void + spa_man_trim_stop(spa_t *spa) + { + boolean_t held = MUTEX_HELD(&spa->spa_man_trim_lock); + if (!held) + mutex_enter(&spa->spa_man_trim_lock); + spa->spa_man_trim_stop = B_TRUE; + cv_broadcast(&spa->spa_man_trim_update_cv); + if (!held) + mutex_exit(&spa->spa_man_trim_lock); + } + + /* + * Orders a manual TRIM operation to stop and waits for both manual and + * automatic TRIM to complete. By holding both the spa_man_trim_lock and + * the spa_auto_trim_lock, the caller can guarantee that after this + * function returns, no new TRIM operations can be initiated in parallel. + */ void ! spa_trim_stop_wait(spa_t *spa) { ! ASSERT(MUTEX_HELD(&spa->spa_man_trim_lock)); ! ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock)); ! spa->spa_man_trim_stop = B_TRUE; ! cv_broadcast(&spa->spa_man_trim_update_cv); ! while (spa->spa_num_man_trimming > 0) ! cv_wait(&spa->spa_man_trim_done_cv, &spa->spa_man_trim_lock); ! while (spa->spa_num_auto_trimming > 0) ! cv_wait(&spa->spa_auto_trim_done_cv, &spa->spa_auto_trim_lock); ! } ! ! /* ! * Returns manual TRIM progress. Progress is indicated by four return values: ! * 1) prog: the number of bytes of space on the pool in total that manual ! * TRIM has already passed (regardless if the space is allocated or not). ! * Completion of the operation is indicated when either the returned value ! * is zero, or when the returned value is equal to the sum of the sizes of ! * all top-level vdevs. ! * 2) rate: the trim rate in bytes per second. A value of zero indicates that ! * trim progresses as fast as possible. ! * 3) start_time: the UNIXTIME of when the last manual TRIM operation was ! * started. If no manual trim was ever initiated on the pool, this is ! * zero. ! * 4) stop_time: the UNIXTIME of when the last manual TRIM operation has ! * stopped on the pool. If a trim was started (start_time != 0), but has ! * not yet completed, stop_time will be zero. If a trim is NOT currently ! * ongoing and start_time is non-zero, this indicates that the previously ! * initiated TRIM operation was interrupted. ! */ ! extern void ! spa_get_trim_prog(spa_t *spa, uint64_t *prog, uint64_t *rate, ! uint64_t *start_time, uint64_t *stop_time) ! { ! uint64_t total = 0; ! vdev_t *root_vd = spa->spa_root_vdev; ! ! ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER)); ! mutex_enter(&spa->spa_man_trim_lock); ! if (spa->spa_num_man_trimming > 0) { ! for (uint64_t i = 0; i < root_vd->vdev_children; i++) { ! total += root_vd->vdev_child[i]->vdev_trim_prog; ! } ! } ! *prog = total; ! *rate = spa->spa_man_trim_rate; ! *start_time = spa->spa_man_trim_start_time; ! *stop_time = spa->spa_man_trim_stop_time; ! mutex_exit(&spa->spa_man_trim_lock); ! } ! ! /* ! * Callback when a vdev_man_trim has finished on a single top-level vdev. ! */ ! static void ! spa_vdev_man_trim_done(spa_t *spa) ! { ! dmu_tx_t *time_update_tx = NULL; ! ! mutex_enter(&spa->spa_man_trim_lock); ! ASSERT(spa->spa_num_man_trimming > 0); ! spa->spa_num_man_trimming--; ! if (spa->spa_num_man_trimming == 0) { ! /* if we were interrupted, leave stop_time at zero */ ! if (!spa->spa_man_trim_stop) ! time_update_tx = spa_trim_update_time(spa, UINT64_MAX, ! gethrestime_sec()); ! spa_event_notify(spa, NULL, NULL, ESC_ZFS_TRIM_FINISH); ! spa_async_request(spa, SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY); ! cv_broadcast(&spa->spa_man_trim_done_cv); ! } ! mutex_exit(&spa->spa_man_trim_lock); ! ! if (time_update_tx != NULL) ! dmu_tx_commit(time_update_tx); ! } ! ! /* ! * Called from vdev_auto_trim when a vdev has completed its auto-trim ! * processing. ! */ ! static void ! spa_vdev_auto_trim_done(spa_t *spa) ! { ! mutex_enter(&spa->spa_auto_trim_lock); ! ASSERT(spa->spa_num_auto_trimming > 0); ! spa->spa_num_auto_trimming--; ! if (spa->spa_num_auto_trimming == 0) ! cv_broadcast(&spa->spa_auto_trim_done_cv); ! mutex_exit(&spa->spa_auto_trim_lock); ! } ! ! /* ! * Determines the minimum sensible rate at which a manual TRIM can be ! * performed on a given spa and returns it. Since we perform TRIM in ! * metaslab-sized increments, we'll just let the longest step between ! * metaslab TRIMs be 100s (random number, really). Thus, on a typical ! * 200-metaslab vdev, the longest TRIM should take is about 5.5 hours. ! * It *can* take longer if the device is really slow respond to ! * zio_trim() commands or it contains more than 200 metaslabs, or ! * metaslab sizes vary widely between top-level vdevs. ! */ ! static uint64_t ! spa_min_trim_rate(spa_t *spa) ! { ! uint64_t smallest_ms_sz = UINT64_MAX; ! ! /* find the smallest metaslab */ ! spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER); ! for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) { ! smallest_ms_sz = MIN(smallest_ms_sz, ! spa->spa_root_vdev->vdev_child[i]->vdev_ms[0]->ms_size); ! } ! spa_config_exit(spa, SCL_CONFIG, FTAG); ! VERIFY(smallest_ms_sz != 0); ! ! /* minimum TRIM rate is 1/100th of the smallest metaslab size */ ! return (smallest_ms_sz / 100); }