Print this page
NEX-15270 pool clear does not "repair" cache devices
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dmitry Savitsky <dmitry.savitsky@nexenta.com>
NEX-15270 pool clear does not "repair" cache devices
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dmitry Savitsky <dmitry.savitsky@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9940 Appliance requires a reboot after JBOD power failure or disconnecting all SAS cables
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9554 dsl_scan.c internals contain some confusingly similar function names for handling the dataset and block sorting queues
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9562 Attaching a vdev while resilver/scrub is running causes panic.
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5736 implement autoreplace matching based on FRU slot number
NEX-6200 hot spares are not reactivated after reinserting into enclosure
NEX-9403 need to update FRU for spare and l2cache devices
NEX-9404 remove lofi autoreplace support from syseventd
NEX-9409 hotsparing doesn't work for vdevs without FRU
NEX-9424 zfs`vdev_online() needs better notification about state changes
Portions contributed by: Alek Pinchuk <alek@nexenta.com>
Portions contributed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8206 dtrace helpers leak when cfork() fails
Reviewed by: Rick McNeal <rick.mcneal@nexeneta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8507 erroneous check in vdev_type_is_ddt()
Reviewed by: Alex Deiter <alex.deiter@nexenta.com>
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4584 System panic when adding special vdev to a pool that does not support feature flags
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5318 Cleanup specialclass property (obsolete, not used) and fix related meta-to-special case
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-2846 Enable Automatic/Intelligent Hot Sparing capability
Reviewed by: Jeffry Molanus <jeffry.molanus@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5064 On-demand trim should store operation start and stop time
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4940 Special Vdev operation in presence (or absense) of IO Errors
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-3729 KRRP changes mess up iostat(1M)
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
NEX-4204 Removing vdev while on-demand trim is ongoing locks up pool
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3474 CLONE - Port NEX-2591 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3212 remove vdev prop object type from dmu.h, p2 Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3025 support root pools on EFI labeled disks
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-1142 move rwlock to vdev to protect vdev_tsd
not just ldi handle.
This way we serialize open/close, yet allow parallel I/O.
NEX-801 If a block pointer is corrupt read or write may crash
If block pointer is corrupt in such a way that vdev id of one of the
ditto blocks is wrong (out of range), zio_vdev_io_start or zio_vdev_io_done
may trip over it and crash.
This changeset takes care of this by claiming that an invalid vdev is
neither readable, nor writeable.
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #12393 rb3935 Kerberos and smbd disagree about who is our AD server (fix elf runtime attributes check)
re #11612 rb3907 Failing vdev of a mirrored pool should not take zfs operations out of action for extended periods of time.
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties

*** 19,43 **** * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. ! * Copyright (c) 2011, 2018 by Delphix. All rights reserved. ! * Copyright 2017 Nexenta Systems, Inc. * Copyright (c) 2014 Integros [integros.com] * Copyright 2016 Toomas Soome <tsoome@me.com> * Copyright 2017 Joyent, Inc. */ #include <sys/zfs_context.h> #include <sys/fm/fs/zfs.h> #include <sys/spa.h> #include <sys/spa_impl.h> - #include <sys/bpobj.h> #include <sys/dmu.h> #include <sys/dmu_tx.h> - #include <sys/dsl_dir.h> #include <sys/vdev_impl.h> #include <sys/uberblock_impl.h> #include <sys/metaslab.h> #include <sys/metaslab_impl.h> #include <sys/space_map.h> --- 19,41 ---- * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. ! * Copyright (c) 2011, 2015 by Delphix. All rights reserved. ! * Copyright 2018 Nexenta Systems, Inc. * Copyright (c) 2014 Integros [integros.com] * Copyright 2016 Toomas Soome <tsoome@me.com> * Copyright 2017 Joyent, Inc. */ #include <sys/zfs_context.h> #include <sys/fm/fs/zfs.h> #include <sys/spa.h> #include <sys/spa_impl.h> #include <sys/dmu.h> #include <sys/dmu_tx.h> #include <sys/vdev_impl.h> #include <sys/uberblock_impl.h> #include <sys/metaslab.h> #include <sys/metaslab_impl.h> #include <sys/space_map.h>
*** 62,159 **** &vdev_spare_ops, &vdev_disk_ops, &vdev_file_ops, &vdev_missing_ops, &vdev_hole_ops, - &vdev_indirect_ops, NULL }; /* maximum scrub/resilver I/O queue per leaf vdev */ int zfs_scrub_limit = 10; /* * When a vdev is added, it will be divided into approximately (but no * more than) this number of metaslabs. */ int metaslabs_per_vdev = 200; - boolean_t vdev_validate_skip = B_FALSE; - - /*PRINTFLIKE2*/ - void - vdev_dbgmsg(vdev_t *vd, const char *fmt, ...) - { - va_list adx; - char buf[256]; - - va_start(adx, fmt); - (void) vsnprintf(buf, sizeof (buf), fmt, adx); - va_end(adx); - - if (vd->vdev_path != NULL) { - zfs_dbgmsg("%s vdev '%s': %s", vd->vdev_ops->vdev_op_type, - vd->vdev_path, buf); - } else { - zfs_dbgmsg("%s-%llu vdev (guid %llu): %s", - vd->vdev_ops->vdev_op_type, - (u_longlong_t)vd->vdev_id, - (u_longlong_t)vd->vdev_guid, buf); - } - } - - void - vdev_dbgmsg_print_tree(vdev_t *vd, int indent) - { - char state[20]; - - if (vd->vdev_ishole || vd->vdev_ops == &vdev_missing_ops) { - zfs_dbgmsg("%*svdev %u: %s", indent, "", vd->vdev_id, - vd->vdev_ops->vdev_op_type); - return; - } - - switch (vd->vdev_state) { - case VDEV_STATE_UNKNOWN: - (void) snprintf(state, sizeof (state), "unknown"); - break; - case VDEV_STATE_CLOSED: - (void) snprintf(state, sizeof (state), "closed"); - break; - case VDEV_STATE_OFFLINE: - (void) snprintf(state, sizeof (state), "offline"); - break; - case VDEV_STATE_REMOVED: - (void) snprintf(state, sizeof (state), "removed"); - break; - case VDEV_STATE_CANT_OPEN: - (void) snprintf(state, sizeof (state), "can't open"); - break; - case VDEV_STATE_FAULTED: - (void) snprintf(state, sizeof (state), "faulted"); - break; - case VDEV_STATE_DEGRADED: - (void) snprintf(state, sizeof (state), "degraded"); - break; - case VDEV_STATE_HEALTHY: - (void) snprintf(state, sizeof (state), "healthy"); - break; - default: - (void) snprintf(state, sizeof (state), "<state %u>", - (uint_t)vd->vdev_state); - } - - zfs_dbgmsg("%*svdev %u: %s%s, guid: %llu, path: %s, %s", indent, - "", vd->vdev_id, vd->vdev_ops->vdev_op_type, - vd->vdev_islog ? " (log)" : "", - (u_longlong_t)vd->vdev_guid, - vd->vdev_path ? vd->vdev_path : "N/A", state); - - for (uint64_t i = 0; i < vd->vdev_children; i++) - vdev_dbgmsg_print_tree(vd->vdev_child[i], indent + 2); - } - /* * Given a vdev type, return the appropriate ops vector. */ static vdev_ops_t * vdev_getops(const char *type) --- 60,86 ---- &vdev_spare_ops, &vdev_disk_ops, &vdev_file_ops, &vdev_missing_ops, &vdev_hole_ops, NULL }; /* maximum scrub/resilver I/O queue per leaf vdev */ int zfs_scrub_limit = 10; /* + * alpha for exponential moving average of I/O latency (in 1/10th of a percent) + */ + int zfs_vs_latency_alpha = 100; + + /* * When a vdev is added, it will be divided into approximately (but no * more than) this number of metaslabs. */ int metaslabs_per_vdev = 200; /* * Given a vdev type, return the appropriate ops vector. */ static vdev_ops_t * vdev_getops(const char *type)
*** 165,174 **** --- 92,107 ---- break; return (ops); } + boolean_t + vdev_is_special(vdev_t *vd) + { + return (vd ? vd->vdev_isspecial : B_FALSE); + } + /* * Default asize function: return the MAX of psize with the asize of * all children. This is what's used by anything other than RAID-Z. */ uint64_t
*** 310,319 **** --- 243,255 ---- } pvd->vdev_child = newchild; pvd->vdev_child[id] = cvd; + cvd->vdev_isspecial_child = + (pvd->vdev_isspecial || pvd->vdev_isspecial_child); + cvd->vdev_top = (pvd->vdev_top ? pvd->vdev_top: cvd); ASSERT(cvd->vdev_top->vdev_parent->vdev_parent == NULL); /* * Walk up all ancestors to update guid sum.
*** 391,404 **** */ vdev_t * vdev_alloc_common(spa_t *spa, uint_t id, uint64_t guid, vdev_ops_t *ops) { vdev_t *vd; - vdev_indirect_config_t *vic; vd = kmem_zalloc(sizeof (vdev_t), KM_SLEEP); - vic = &vd->vdev_indirect_config; if (spa->spa_root_vdev == NULL) { ASSERT(ops == &vdev_root_ops); spa->spa_root_vdev = vd; spa->spa_load_guid = spa_generate_guid(NULL); --- 327,338 ----
*** 425,446 **** vd->vdev_guid = guid; vd->vdev_guid_sum = guid; vd->vdev_ops = ops; vd->vdev_state = VDEV_STATE_CLOSED; vd->vdev_ishole = (ops == &vdev_hole_ops); - vic->vic_prev_indirect_vdev = UINT64_MAX; - rw_init(&vd->vdev_indirect_rwlock, NULL, RW_DEFAULT, NULL); - mutex_init(&vd->vdev_obsolete_lock, NULL, MUTEX_DEFAULT, NULL); - vd->vdev_obsolete_segments = range_tree_create(NULL, NULL); - mutex_init(&vd->vdev_dtl_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&vd->vdev_stat_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&vd->vdev_probe_lock, NULL, MUTEX_DEFAULT, NULL); ! mutex_init(&vd->vdev_queue_lock, NULL, MUTEX_DEFAULT, NULL); for (int t = 0; t < DTL_TYPES; t++) { ! vd->vdev_dtl[t] = range_tree_create(NULL, NULL); } txg_list_create(&vd->vdev_ms_list, spa, offsetof(struct metaslab, ms_txg_node)); txg_list_create(&vd->vdev_dtl_list, spa, offsetof(struct vdev, vdev_dtl_node)); --- 359,377 ---- vd->vdev_guid = guid; vd->vdev_guid_sum = guid; vd->vdev_ops = ops; vd->vdev_state = VDEV_STATE_CLOSED; vd->vdev_ishole = (ops == &vdev_hole_ops); mutex_init(&vd->vdev_dtl_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&vd->vdev_stat_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&vd->vdev_probe_lock, NULL, MUTEX_DEFAULT, NULL); ! mutex_init(&vd->vdev_scan_io_queue_lock, NULL, MUTEX_DEFAULT, NULL); ! rw_init(&vd->vdev_tsd_lock, NULL, RW_DEFAULT, NULL); for (int t = 0; t < DTL_TYPES; t++) { ! vd->vdev_dtl[t] = range_tree_create(NULL, NULL, ! &vd->vdev_dtl_lock); } txg_list_create(&vd->vdev_ms_list, spa, offsetof(struct metaslab, ms_txg_node)); txg_list_create(&vd->vdev_dtl_list, spa, offsetof(struct vdev, vdev_dtl_node));
*** 460,472 **** vdev_alloc(spa_t *spa, vdev_t **vdp, nvlist_t *nv, vdev_t *parent, uint_t id, int alloctype) { vdev_ops_t *ops; char *type; ! uint64_t guid = 0, islog, nparity; vdev_t *vd; - vdev_indirect_config_t *vic; ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL); if (nvlist_lookup_string(nv, ZPOOL_CONFIG_TYPE, &type) != 0) return (SET_ERROR(EINVAL)); --- 391,403 ---- vdev_alloc(spa_t *spa, vdev_t **vdp, nvlist_t *nv, vdev_t *parent, uint_t id, int alloctype) { vdev_ops_t *ops; char *type; ! uint64_t guid = 0, nparity; ! uint64_t isspecial = 0, islog = 0; vdev_t *vd; ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL); if (nvlist_lookup_string(nv, ZPOOL_CONFIG_TYPE, &type) != 0) return (SET_ERROR(EINVAL));
*** 505,519 **** return (SET_ERROR(EINVAL)); /* * Determine whether we're a log vdev. */ - islog = 0; (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_IS_LOG, &islog); if (islog && spa_version(spa) < SPA_VERSION_SLOGS) return (SET_ERROR(ENOTSUP)); if (ops == &vdev_hole_ops && spa_version(spa) < SPA_VERSION_HOLES) return (SET_ERROR(ENOTSUP)); /* * Set the nparity property for RAID-Z vdevs. --- 436,456 ---- return (SET_ERROR(EINVAL)); /* * Determine whether we're a log vdev. */ (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_IS_LOG, &islog); if (islog && spa_version(spa) < SPA_VERSION_SLOGS) return (SET_ERROR(ENOTSUP)); + /* + * Determine whether we're a special vdev. + */ + (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_IS_SPECIAL, &isspecial); + if (isspecial && spa_version(spa) < SPA_VERSION_FEATURES) + return (SET_ERROR(ENOTSUP)); + if (ops == &vdev_hole_ops && spa_version(spa) < SPA_VERSION_HOLES) return (SET_ERROR(ENOTSUP)); /* * Set the nparity property for RAID-Z vdevs.
*** 550,563 **** nparity = 0; } ASSERT(nparity != -1ULL); vd = vdev_alloc_common(spa, id, guid, ops); - vic = &vd->vdev_indirect_config; vd->vdev_islog = islog; vd->vdev_nparity = nparity; if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PATH, &vd->vdev_path) == 0) vd->vdev_path = spa_strdup(vd->vdev_path); if (nvlist_lookup_string(nv, ZPOOL_CONFIG_DEVID, &vd->vdev_devid) == 0) vd->vdev_devid = spa_strdup(vd->vdev_devid); --- 487,502 ---- nparity = 0; } ASSERT(nparity != -1ULL); vd = vdev_alloc_common(spa, id, guid, ops); vd->vdev_islog = islog; + vd->vdev_isspecial = isspecial; vd->vdev_nparity = nparity; + vd->vdev_isspecial_child = (parent != NULL && + (parent->vdev_isspecial || parent->vdev_isspecial_child)); if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PATH, &vd->vdev_path) == 0) vd->vdev_path = spa_strdup(vd->vdev_path); if (nvlist_lookup_string(nv, ZPOOL_CONFIG_DEVID, &vd->vdev_devid) == 0) vd->vdev_devid = spa_strdup(vd->vdev_devid);
*** 565,591 **** &vd->vdev_physpath) == 0) vd->vdev_physpath = spa_strdup(vd->vdev_physpath); if (nvlist_lookup_string(nv, ZPOOL_CONFIG_FRU, &vd->vdev_fru) == 0) vd->vdev_fru = spa_strdup(vd->vdev_fru); /* * Set the whole_disk property. If it's not specified, leave the value * as -1. */ if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_WHOLE_DISK, &vd->vdev_wholedisk) != 0) vd->vdev_wholedisk = -1ULL; ! ASSERT0(vic->vic_mapping_object); ! (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_INDIRECT_OBJECT, ! &vic->vic_mapping_object); ! ASSERT0(vic->vic_births_object); ! (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_INDIRECT_BIRTHS, ! &vic->vic_births_object); ! ASSERT3U(vic->vic_prev_indirect_vdev, ==, UINT64_MAX); ! (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_PREV_INDIRECT_VDEV, ! &vic->vic_prev_indirect_vdev); /* * Look for the 'not present' flag. This will only be set if the device * was not present at the time of import. */ --- 504,558 ---- &vd->vdev_physpath) == 0) vd->vdev_physpath = spa_strdup(vd->vdev_physpath); if (nvlist_lookup_string(nv, ZPOOL_CONFIG_FRU, &vd->vdev_fru) == 0) vd->vdev_fru = spa_strdup(vd->vdev_fru); + #ifdef _KERNEL + if (vd->vdev_path) { + char dev_path[MAXPATHLEN]; + char *last_slash = NULL; + kstat_t *exist = NULL; + + if (strcmp(vd->vdev_ops->vdev_op_type, VDEV_TYPE_DISK) == 0) + last_slash = strrchr(vd->vdev_path, '/'); + + (void) sprintf(dev_path, "%s:%s", spa->spa_name, + last_slash != NULL ? last_slash + 1 : vd->vdev_path); + + exist = kstat_hold_byname("zfs", 0, dev_path, ALL_ZONES); + + if (!exist) { + vd->vdev_iokstat = kstat_create("zfs", 0, dev_path, + "zfs", KSTAT_TYPE_IO, 1, 0); + + if (vd->vdev_iokstat) { + vd->vdev_iokstat->ks_lock = + &spa->spa_iokstat_lock; + kstat_install(vd->vdev_iokstat); + } + } else { + kstat_rele(exist); + } + } + #endif + /* * Set the whole_disk property. If it's not specified, leave the value * as -1. */ if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_WHOLE_DISK, &vd->vdev_wholedisk) != 0) vd->vdev_wholedisk = -1ULL; ! /* ! * Set the is_ssd property. If it's not specified it means the media ! * is not SSD or the request failed and we assume it's not. ! */ ! if (nvlist_lookup_boolean(nv, ZPOOL_CONFIG_IS_SSD) == 0) ! vd->vdev_is_ssd = B_TRUE; ! else ! vd->vdev_is_ssd = B_FALSE; /* * Look for the 'not present' flag. This will only be set if the device * was not present at the time of import. */
*** 621,636 **** } else { ASSERT0(vd->vdev_top_zap); } if (parent && !parent->vdev_parent && alloctype != VDEV_ALLOC_ATTACH) { ASSERT(alloctype == VDEV_ALLOC_LOAD || alloctype == VDEV_ALLOC_ADD || alloctype == VDEV_ALLOC_SPLIT || alloctype == VDEV_ALLOC_ROOTPOOL); ! vd->vdev_mg = metaslab_group_create(islog ? ! spa_log_class(spa) : spa_normal_class(spa), vd); } if (vd->vdev_ops->vdev_op_leaf && (alloctype == VDEV_ALLOC_LOAD || alloctype == VDEV_ALLOC_SPLIT)) { (void) nvlist_lookup_uint64(nv, --- 588,606 ---- } else { ASSERT0(vd->vdev_top_zap); } if (parent && !parent->vdev_parent && alloctype != VDEV_ALLOC_ATTACH) { + metaslab_class_t *mc = isspecial ? spa_special_class(spa) : + (islog ? spa_log_class(spa) : spa_normal_class(spa)); + ASSERT(alloctype == VDEV_ALLOC_LOAD || alloctype == VDEV_ALLOC_ADD || alloctype == VDEV_ALLOC_SPLIT || alloctype == VDEV_ALLOC_ROOTPOOL); ! ! vd->vdev_mg = metaslab_group_create(mc, vd); } if (vd->vdev_ops->vdev_op_leaf && (alloctype == VDEV_ALLOC_LOAD || alloctype == VDEV_ALLOC_SPLIT)) { (void) nvlist_lookup_uint64(nv,
*** 708,717 **** --- 678,697 ---- vdev_free(vdev_t *vd) { spa_t *spa = vd->vdev_spa; /* + * Scan queues are normally destroyed at the end of a scan. If the + * queue exists here, that implies the vdev is being removed while + * the scan is still running. + */ + if (vd->vdev_scan_io_queue != NULL) { + dsl_scan_io_queue_destroy(vd->vdev_scan_io_queue); + vd->vdev_scan_io_queue = NULL; + } + + /* * vdev_free() implies closing the vdev first. This is simpler than * trying to ensure complicated semantics for all callers. */ vdev_close(vd);
*** 775,809 **** range_tree_vacate(vd->vdev_dtl[t], NULL, NULL); range_tree_destroy(vd->vdev_dtl[t]); } mutex_exit(&vd->vdev_dtl_lock); ! EQUIV(vd->vdev_indirect_births != NULL, ! vd->vdev_indirect_mapping != NULL); ! if (vd->vdev_indirect_births != NULL) { ! vdev_indirect_mapping_close(vd->vdev_indirect_mapping); ! vdev_indirect_births_close(vd->vdev_indirect_births); } - - if (vd->vdev_obsolete_sm != NULL) { - ASSERT(vd->vdev_removing || - vd->vdev_ops == &vdev_indirect_ops); - space_map_close(vd->vdev_obsolete_sm); - vd->vdev_obsolete_sm = NULL; - } - range_tree_destroy(vd->vdev_obsolete_segments); - rw_destroy(&vd->vdev_indirect_rwlock); - mutex_destroy(&vd->vdev_obsolete_lock); - - mutex_destroy(&vd->vdev_queue_lock); mutex_destroy(&vd->vdev_dtl_lock); mutex_destroy(&vd->vdev_stat_lock); mutex_destroy(&vd->vdev_probe_lock); if (vd == spa->spa_root_vdev) spa->spa_root_vdev = NULL; kmem_free(vd, sizeof (vdev_t)); } /* * Transfer top-level vdev state from svd to tvd. --- 755,779 ---- range_tree_vacate(vd->vdev_dtl[t], NULL, NULL); range_tree_destroy(vd->vdev_dtl[t]); } mutex_exit(&vd->vdev_dtl_lock); ! if (vd->vdev_iokstat) { ! kstat_delete(vd->vdev_iokstat); ! vd->vdev_iokstat = NULL; } mutex_destroy(&vd->vdev_dtl_lock); mutex_destroy(&vd->vdev_stat_lock); mutex_destroy(&vd->vdev_probe_lock); + mutex_destroy(&vd->vdev_scan_io_queue_lock); + rw_destroy(&vd->vdev_tsd_lock); if (vd == spa->spa_root_vdev) spa->spa_root_vdev = NULL; + ASSERT3P(vd->vdev_scan_io_queue, ==, NULL); + kmem_free(vd, sizeof (vdev_t)); } /* * Transfer top-level vdev state from svd to tvd.
*** 869,878 **** --- 839,854 ---- tvd->vdev_deflate_ratio = svd->vdev_deflate_ratio; svd->vdev_deflate_ratio = 0; tvd->vdev_islog = svd->vdev_islog; svd->vdev_islog = 0; + + tvd->vdev_isspecial = svd->vdev_isspecial; + svd->vdev_isspecial = 0; + svd->vdev_isspecial_child = tvd->vdev_isspecial; + + dsl_scan_io_queue_vdev_xfer(svd, tvd); } static void vdev_top_update(vdev_t *tvd, vdev_t *vd) {
*** 900,910 **** mvd = vdev_alloc_common(spa, cvd->vdev_id, 0, ops); mvd->vdev_asize = cvd->vdev_asize; mvd->vdev_min_asize = cvd->vdev_min_asize; mvd->vdev_max_asize = cvd->vdev_max_asize; - mvd->vdev_psize = cvd->vdev_psize; mvd->vdev_ashift = cvd->vdev_ashift; mvd->vdev_state = cvd->vdev_state; mvd->vdev_crtxg = cvd->vdev_crtxg; vdev_remove_child(pvd, cvd); --- 876,885 ----
*** 981,990 **** --- 956,974 ---- if (vd->vdev_ms_shift == 0) return (0); ASSERT(!vd->vdev_ishole); + /* + * Compute the raidz-deflation ratio. Note, we hard-code + * in 128k (1 << 17) because it is the "typical" blocksize. + * Even though SPA_MAXBLOCKSIZE changed, this algorithm can not change, + * otherwise it would inconsistently account for existing bp's. + */ + vd->vdev_deflate_ratio = (1 << 17) / + (vdev_psize_to_asize(vd, 1 << 17) >> SPA_MINBLOCKSHIFT); + ASSERT(oldc <= newc); mspp = kmem_zalloc(newc * sizeof (*mspp), KM_SLEEP); if (oldc != 0) {
*** 996,1029 **** vd->vdev_ms_count = newc; for (m = oldc; m < newc; m++) { uint64_t object = 0; ! /* ! * vdev_ms_array may be 0 if we are creating the "fake" ! * metaslabs for an indirect vdev for zdb's leak detection. ! * See zdb_leak_init(). ! */ ! if (txg == 0 && vd->vdev_ms_array != 0) { error = dmu_read(mos, vd->vdev_ms_array, m * sizeof (uint64_t), sizeof (uint64_t), &object, DMU_READ_PREFETCH); ! if (error != 0) { ! vdev_dbgmsg(vd, "unable to read the metaslab " ! "array [error=%d]", error); return (error); } - } error = metaslab_init(vd->vdev_mg, m, object, txg, &(vd->vdev_ms[m])); ! if (error != 0) { ! vdev_dbgmsg(vd, "metaslab_init failed [error=%d]", ! error); return (error); } - } if (txg == 0) spa_config_enter(spa, SCL_ALLOC, FTAG, RW_WRITER); /* --- 980,1002 ---- vd->vdev_ms_count = newc; for (m = oldc; m < newc; m++) { uint64_t object = 0; ! if (txg == 0) { error = dmu_read(mos, vd->vdev_ms_array, m * sizeof (uint64_t), sizeof (uint64_t), &object, DMU_READ_PREFETCH); ! if (error) return (error); } error = metaslab_init(vd->vdev_mg, m, object, txg, &(vd->vdev_ms[m])); ! if (error) return (error); } if (txg == 0) spa_config_enter(spa, SCL_ALLOC, FTAG, RW_WRITER); /*
*** 1041,1066 **** } void vdev_metaslab_fini(vdev_t *vd) { ! if (vd->vdev_ms != NULL) { uint64_t count = vd->vdev_ms_count; metaslab_group_passivate(vd->vdev_mg); ! for (uint64_t m = 0; m < count; m++) { metaslab_t *msp = vd->vdev_ms[m]; if (msp != NULL) metaslab_fini(msp); } kmem_free(vd->vdev_ms, count * sizeof (metaslab_t *)); vd->vdev_ms = NULL; - - vd->vdev_ms_count = 0; } - ASSERT0(vd->vdev_ms_count); } typedef struct vdev_probe_stats { boolean_t vps_readable; boolean_t vps_writeable; --- 1014,1037 ---- } void vdev_metaslab_fini(vdev_t *vd) { ! uint64_t m; uint64_t count = vd->vdev_ms_count; + if (vd->vdev_ms != NULL) { metaslab_group_passivate(vd->vdev_mg); ! for (m = 0; m < count; m++) { metaslab_t *msp = vd->vdev_ms[m]; if (msp != NULL) metaslab_fini(msp); } kmem_free(vd->vdev_ms, count * sizeof (metaslab_t *)); vd->vdev_ms = NULL; } } typedef struct vdev_probe_stats { boolean_t vps_readable; boolean_t vps_writeable;
*** 1100,1110 **** if (vdev_readable(vd) && (vdev_writeable(vd) || !spa_writeable(spa))) { zio->io_error = 0; } else { ASSERT(zio->io_error != 0); - vdev_dbgmsg(vd, "failed probe"); zfs_ereport_post(FM_EREPORT_ZFS_PROBE_FAILURE, spa, vd, NULL, 0, 0); zio->io_error = SET_ERROR(ENXIO); } --- 1071,1080 ----
*** 1268,1292 **** taskq_destroy(tq); } /* - * Compute the raidz-deflation ratio. Note, we hard-code - * in 128k (1 << 17) because it is the "typical" blocksize. - * Even though SPA_MAXBLOCKSIZE changed, this algorithm can not change, - * otherwise it would inconsistently account for existing bp's. - */ - static void - vdev_set_deflate_ratio(vdev_t *vd) - { - if (vd == vd->vdev_top && !vd->vdev_ishole && vd->vdev_ashift != 0) { - vd->vdev_deflate_ratio = (1 << 17) / - (vdev_psize_to_asize(vd, 1 << 17) >> SPA_MINBLOCKSHIFT); - } - } - - /* * Prepare a virtual device for access. */ int vdev_open(vdev_t *vd) { --- 1238,1247 ----
*** 1307,1320 **** vd->vdev_cant_read = B_FALSE; vd->vdev_cant_write = B_FALSE; vd->vdev_min_asize = vdev_get_min_asize(vd); /* ! * If this vdev is not removed, check its fault status. If it's ! * faulted, bail out of the open. */ ! if (!vd->vdev_removed && vd->vdev_faulted) { ASSERT(vd->vdev_children == 0); ASSERT(vd->vdev_label_aux == VDEV_AUX_ERR_EXCEEDED || vd->vdev_label_aux == VDEV_AUX_EXTERNAL); vdev_set_state(vd, B_TRUE, VDEV_STATE_FAULTED, vd->vdev_label_aux); --- 1262,1276 ---- vd->vdev_cant_read = B_FALSE; vd->vdev_cant_write = B_FALSE; vd->vdev_min_asize = vdev_get_min_asize(vd); /* ! * If vdev isn't removed and is faulted for reasons other than failed ! * open, or if it's offline - bail out. */ ! if (!vd->vdev_removed && vd->vdev_faulted && ! vd->vdev_label_aux != VDEV_AUX_OPEN_FAILED) { ASSERT(vd->vdev_children == 0); ASSERT(vd->vdev_label_aux == VDEV_AUX_ERR_EXCEEDED || vd->vdev_label_aux == VDEV_AUX_EXTERNAL); vdev_set_state(vd, B_TRUE, VDEV_STATE_FAULTED, vd->vdev_label_aux);
*** 1338,1354 **** if (error) { if (vd->vdev_removed && vd->vdev_stat.vs_aux != VDEV_AUX_OPEN_FAILED) vd->vdev_removed = B_FALSE; - if (vd->vdev_stat.vs_aux == VDEV_AUX_CHILDREN_OFFLINE) { - vdev_set_state(vd, B_TRUE, VDEV_STATE_OFFLINE, - vd->vdev_stat.vs_aux); - } else { vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN, vd->vdev_stat.vs_aux); - } return (error); } vd->vdev_removed = B_FALSE; --- 1294,1305 ----
*** 1504,1556 **** /* * Called once the vdevs are all opened, this routine validates the label * contents. This needs to be done before vdev_load() so that we don't * inadvertently do repair I/Os to the wrong device. * * This function will only return failure if one of the vdevs indicates that it * has since been destroyed or exported. This is only possible if * /etc/zfs/zpool.cache was readonly at the time. Otherwise, the vdev state * will be updated but the function will return 0. */ int ! vdev_validate(vdev_t *vd) { spa_t *spa = vd->vdev_spa; nvlist_t *label; ! uint64_t guid = 0, aux_guid = 0, top_guid; uint64_t state; - nvlist_t *nvl; - uint64_t txg; ! if (vdev_validate_skip) ! return (0); ! ! for (uint64_t c = 0; c < vd->vdev_children; c++) ! if (vdev_validate(vd->vdev_child[c]) != 0) return (SET_ERROR(EBADF)); /* * If the device has already failed, or was marked offline, don't do * any further validation. Otherwise, label I/O will fail and we will * overwrite the previous state. */ ! if (!vd->vdev_ops->vdev_op_leaf || !vdev_readable(vd)) ! return (0); - /* - * If we are performing an extreme rewind, we allow for a label that - * was modified at a point after the current txg. - */ - if (spa->spa_extreme_rewind || spa_last_synced_txg(spa) == 0) - txg = UINT64_MAX; - else - txg = spa_last_synced_txg(spa); - if ((label = vdev_label_read_config(vd, txg)) == NULL) { vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN, VDEV_AUX_BAD_LABEL); - vdev_dbgmsg(vd, "vdev_validate: failed reading config"); return (0); } /* * Determine if this vdev has been split off into another --- 1455,1500 ---- /* * Called once the vdevs are all opened, this routine validates the label * contents. This needs to be done before vdev_load() so that we don't * inadvertently do repair I/Os to the wrong device. * + * If 'strict' is false ignore the spa guid check. This is necessary because + * if the machine crashed during a re-guid the new guid might have been written + * to all of the vdev labels, but not the cached config. The strict check + * will be performed when the pool is opened again using the mos config. + * * This function will only return failure if one of the vdevs indicates that it * has since been destroyed or exported. This is only possible if * /etc/zfs/zpool.cache was readonly at the time. Otherwise, the vdev state * will be updated but the function will return 0. */ int ! vdev_validate(vdev_t *vd, boolean_t strict) { spa_t *spa = vd->vdev_spa; nvlist_t *label; ! uint64_t guid = 0, top_guid; uint64_t state; ! for (int c = 0; c < vd->vdev_children; c++) ! if (vdev_validate(vd->vdev_child[c], strict) != 0) return (SET_ERROR(EBADF)); /* * If the device has already failed, or was marked offline, don't do * any further validation. Otherwise, label I/O will fail and we will * overwrite the previous state. */ ! if (vd->vdev_ops->vdev_op_leaf && vdev_readable(vd)) { ! uint64_t aux_guid = 0; ! nvlist_t *nvl; ! uint64_t txg = spa_last_synced_txg(spa) != 0 ? ! spa_last_synced_txg(spa) : -1ULL; if ((label = vdev_label_read_config(vd, txg)) == NULL) { vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN, VDEV_AUX_BAD_LABEL); return (0); } /* * Determine if this vdev has been split off into another
*** 1559,1671 **** if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_SPLIT_GUID, &aux_guid) == 0 && aux_guid == spa_guid(spa)) { vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_SPLIT_POOL); nvlist_free(label); - vdev_dbgmsg(vd, "vdev_validate: vdev split into other pool"); return (0); } ! if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_GUID, &guid) != 0) { vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_CORRUPT_DATA); nvlist_free(label); - vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label", - ZPOOL_CONFIG_POOL_GUID); return (0); } - /* - * If config is not trusted then ignore the spa guid check. This is - * necessary because if the machine crashed during a re-guid the new - * guid might have been written to all of the vdev labels, but not the - * cached config. The check will be performed again once we have the - * trusted config from the MOS. - */ - if (spa->spa_trust_config && guid != spa_guid(spa)) { - vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, - VDEV_AUX_CORRUPT_DATA); - nvlist_free(label); - vdev_dbgmsg(vd, "vdev_validate: vdev label pool_guid doesn't " - "match config (%llu != %llu)", (u_longlong_t)guid, - (u_longlong_t)spa_guid(spa)); - return (0); - } - if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_VDEV_TREE, &nvl) != 0 || nvlist_lookup_uint64(nvl, ZPOOL_CONFIG_ORIG_GUID, &aux_guid) != 0) aux_guid = 0; - if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID, &guid) != 0) { - vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, - VDEV_AUX_CORRUPT_DATA); - nvlist_free(label); - vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label", - ZPOOL_CONFIG_GUID); - return (0); - } - - if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_TOP_GUID, &top_guid) - != 0) { - vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, - VDEV_AUX_CORRUPT_DATA); - nvlist_free(label); - vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label", - ZPOOL_CONFIG_TOP_GUID); - return (0); - } - /* ! * If this vdev just became a top-level vdev because its sibling was ! * detached, it will have adopted the parent's vdev guid -- but the ! * label may or may not be on disk yet. Fortunately, either version ! * of the label will have the same top guid, so if we're a top-level ! * vdev, we can safely compare to that instead. ! * However, if the config comes from a cachefile that failed to update ! * after the detach, a top-level vdev will appear as a non top-level ! * vdev in the config. Also relax the constraints if we perform an ! * extreme rewind. * * If we split this vdev off instead, then we also check the * original pool's guid. We don't want to consider the vdev * corrupt if it is partway through a split operation. */ ! if (vd->vdev_guid != guid && vd->vdev_guid != aux_guid) { ! boolean_t mismatch = B_FALSE; ! if (spa->spa_trust_config && !spa->spa_extreme_rewind) { ! if (vd != vd->vdev_top || vd->vdev_guid != top_guid) ! mismatch = B_TRUE; ! } else { ! if (vd->vdev_guid != top_guid && ! vd->vdev_top->vdev_guid != guid) ! mismatch = B_TRUE; ! } ! ! if (mismatch) { vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_CORRUPT_DATA); nvlist_free(label); - vdev_dbgmsg(vd, "vdev_validate: config guid " - "doesn't match label guid"); - vdev_dbgmsg(vd, "CONFIG: guid %llu, top_guid %llu", - (u_longlong_t)vd->vdev_guid, - (u_longlong_t)vd->vdev_top->vdev_guid); - vdev_dbgmsg(vd, "LABEL: guid %llu, top_guid %llu, " - "aux_guid %llu", (u_longlong_t)guid, - (u_longlong_t)top_guid, (u_longlong_t)aux_guid); return (0); } - } if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_STATE, &state) != 0) { vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_CORRUPT_DATA); nvlist_free(label); - vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label", - ZPOOL_CONFIG_POOL_STATE); return (0); } nvlist_free(label); --- 1503,1558 ---- if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_SPLIT_GUID, &aux_guid) == 0 && aux_guid == spa_guid(spa)) { vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_SPLIT_POOL); nvlist_free(label); return (0); } ! if (strict && (nvlist_lookup_uint64(label, ! ZPOOL_CONFIG_POOL_GUID, &guid) != 0 || ! guid != spa_guid(spa))) { vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_CORRUPT_DATA); nvlist_free(label); return (0); } if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_VDEV_TREE, &nvl) != 0 || nvlist_lookup_uint64(nvl, ZPOOL_CONFIG_ORIG_GUID, &aux_guid) != 0) aux_guid = 0; /* ! * If this vdev just became a top-level vdev because its ! * sibling was detached, it will have adopted the parent's ! * vdev guid -- but the label may or may not be on disk yet. ! * Fortunately, either version of the label will have the ! * same top guid, so if we're a top-level vdev, we can ! * safely compare to that instead. * * If we split this vdev off instead, then we also check the * original pool's guid. We don't want to consider the vdev * corrupt if it is partway through a split operation. */ ! if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID, ! &guid) != 0 || ! nvlist_lookup_uint64(label, ZPOOL_CONFIG_TOP_GUID, ! &top_guid) != 0 || ! ((vd->vdev_guid != guid && vd->vdev_guid != aux_guid) && ! (vd->vdev_guid != top_guid || vd != vd->vdev_top))) { vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_CORRUPT_DATA); nvlist_free(label); return (0); } if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_STATE, &state) != 0) { vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_CORRUPT_DATA); nvlist_free(label); return (0); } nvlist_free(label);
*** 1673,1811 **** * If this is a verbatim import, no need to check the * state of the pool. */ if (!(spa->spa_import_flags & ZFS_IMPORT_VERBATIM) && spa_load_state(spa) == SPA_LOAD_OPEN && ! state != POOL_STATE_ACTIVE) { ! vdev_dbgmsg(vd, "vdev_validate: invalid pool state (%llu) " ! "for spa %s", (u_longlong_t)state, spa->spa_name); return (SET_ERROR(EBADF)); - } /* * If we were able to open and validate a vdev that was * previously marked permanently unavailable, clear that state * now. */ if (vd->vdev_not_present) vd->vdev_not_present = 0; - - return (0); - } - - static void - vdev_copy_path_impl(vdev_t *svd, vdev_t *dvd) - { - if (svd->vdev_path != NULL && dvd->vdev_path != NULL) { - if (strcmp(svd->vdev_path, dvd->vdev_path) != 0) { - zfs_dbgmsg("vdev_copy_path: vdev %llu: path changed " - "from '%s' to '%s'", (u_longlong_t)dvd->vdev_guid, - dvd->vdev_path, svd->vdev_path); - spa_strfree(dvd->vdev_path); - dvd->vdev_path = spa_strdup(svd->vdev_path); } - } else if (svd->vdev_path != NULL) { - dvd->vdev_path = spa_strdup(svd->vdev_path); - zfs_dbgmsg("vdev_copy_path: vdev %llu: path set to '%s'", - (u_longlong_t)dvd->vdev_guid, dvd->vdev_path); - } - } - /* - * Recursively copy vdev paths from one vdev to another. Source and destination - * vdev trees must have same geometry otherwise return error. Intended to copy - * paths from userland config into MOS config. - */ - int - vdev_copy_path_strict(vdev_t *svd, vdev_t *dvd) - { - if ((svd->vdev_ops == &vdev_missing_ops) || - (svd->vdev_ishole && dvd->vdev_ishole) || - (dvd->vdev_ops == &vdev_indirect_ops)) return (0); - - if (svd->vdev_ops != dvd->vdev_ops) { - vdev_dbgmsg(svd, "vdev_copy_path: vdev type mismatch: %s != %s", - svd->vdev_ops->vdev_op_type, dvd->vdev_ops->vdev_op_type); - return (SET_ERROR(EINVAL)); - } - - if (svd->vdev_guid != dvd->vdev_guid) { - vdev_dbgmsg(svd, "vdev_copy_path: guids mismatch (%llu != " - "%llu)", (u_longlong_t)svd->vdev_guid, - (u_longlong_t)dvd->vdev_guid); - return (SET_ERROR(EINVAL)); - } - - if (svd->vdev_children != dvd->vdev_children) { - vdev_dbgmsg(svd, "vdev_copy_path: children count mismatch: " - "%llu != %llu", (u_longlong_t)svd->vdev_children, - (u_longlong_t)dvd->vdev_children); - return (SET_ERROR(EINVAL)); - } - - for (uint64_t i = 0; i < svd->vdev_children; i++) { - int error = vdev_copy_path_strict(svd->vdev_child[i], - dvd->vdev_child[i]); - if (error != 0) - return (error); - } - - if (svd->vdev_ops->vdev_op_leaf) - vdev_copy_path_impl(svd, dvd); - - return (0); } - static void - vdev_copy_path_search(vdev_t *stvd, vdev_t *dvd) - { - ASSERT(stvd->vdev_top == stvd); - ASSERT3U(stvd->vdev_id, ==, dvd->vdev_top->vdev_id); - - for (uint64_t i = 0; i < dvd->vdev_children; i++) { - vdev_copy_path_search(stvd, dvd->vdev_child[i]); - } - - if (!dvd->vdev_ops->vdev_op_leaf || !vdev_is_concrete(dvd)) - return; - - /* - * The idea here is that while a vdev can shift positions within - * a top vdev (when replacing, attaching mirror, etc.) it cannot - * step outside of it. - */ - vdev_t *vd = vdev_lookup_by_guid(stvd, dvd->vdev_guid); - - if (vd == NULL || vd->vdev_ops != dvd->vdev_ops) - return; - - ASSERT(vd->vdev_ops->vdev_op_leaf); - - vdev_copy_path_impl(vd, dvd); - } - /* - * Recursively copy vdev paths from one root vdev to another. Source and - * destination vdev trees may differ in geometry. For each destination leaf - * vdev, search a vdev with the same guid and top vdev id in the source. - * Intended to copy paths from userland config into MOS config. - */ - void - vdev_copy_path_relaxed(vdev_t *srvd, vdev_t *drvd) - { - uint64_t children = MIN(srvd->vdev_children, drvd->vdev_children); - ASSERT(srvd->vdev_ops == &vdev_root_ops); - ASSERT(drvd->vdev_ops == &vdev_root_ops); - - for (uint64_t i = 0; i < children; i++) { - vdev_copy_path_search(srvd->vdev_child[i], - drvd->vdev_child[i]); - } - } - - /* * Close a virtual device. */ void vdev_close(vdev_t *vd) { --- 1560,1585 ---- * If this is a verbatim import, no need to check the * state of the pool. */ if (!(spa->spa_import_flags & ZFS_IMPORT_VERBATIM) && spa_load_state(spa) == SPA_LOAD_OPEN && ! state != POOL_STATE_ACTIVE) return (SET_ERROR(EBADF)); /* * If we were able to open and validate a vdev that was * previously marked permanently unavailable, clear that state * now. */ if (vd->vdev_not_present) vd->vdev_not_present = 0; } return (0); } /* * Close a virtual device. */ void vdev_close(vdev_t *vd) {
*** 1893,1906 **** */ if (vd->vdev_aux) { (void) vdev_validate_aux(vd); if (vdev_readable(vd) && vdev_writeable(vd) && vd->vdev_aux == &spa->spa_l2cache && ! !l2arc_vdev_present(vd)) ! l2arc_add_vdev(spa, vd); } else { ! (void) vdev_validate(vd); } /* * Reassess parent vdev's health. */ --- 1667,1686 ---- */ if (vd->vdev_aux) { (void) vdev_validate_aux(vd); if (vdev_readable(vd) && vdev_writeable(vd) && vd->vdev_aux == &spa->spa_l2cache && ! !l2arc_vdev_present(vd)) { ! /* ! * When reopening we can assume persistent L2ARC is ! * supported, since we've already opened the device ! * in the past and prepended an L2ARC uberblock. ! */ ! l2arc_add_vdev(spa, vd, B_TRUE); ! } } else { ! (void) vdev_validate(vd, B_TRUE); } /* * Reassess parent vdev's health. */
*** 1949,1960 **** void vdev_dirty(vdev_t *vd, int flags, void *arg, uint64_t txg) { ASSERT(vd == vd->vdev_top); ! /* indirect vdevs don't have metaslabs or dtls */ ! ASSERT(vdev_is_concrete(vd) || flags == 0); ASSERT(ISP2(flags)); ASSERT(spa_writeable(vd->vdev_spa)); if (flags & VDD_METASLAB) (void) txg_list_add(&vd->vdev_ms_list, arg, txg); --- 1729,1739 ---- void vdev_dirty(vdev_t *vd, int flags, void *arg, uint64_t txg) { ASSERT(vd == vd->vdev_top); ! ASSERT(!vd->vdev_ishole); ASSERT(ISP2(flags)); ASSERT(spa_writeable(vd->vdev_spa)); if (flags & VDD_METASLAB) (void) txg_list_add(&vd->vdev_ms_list, arg, txg);
*** 2020,2033 **** ASSERT(t < DTL_TYPES); ASSERT(vd != vd->vdev_spa->spa_root_vdev); ASSERT(spa_writeable(vd->vdev_spa)); ! mutex_enter(&vd->vdev_dtl_lock); if (!range_tree_contains(rt, txg, size)) range_tree_add(rt, txg, size); ! mutex_exit(&vd->vdev_dtl_lock); } boolean_t vdev_dtl_contains(vdev_t *vd, vdev_dtl_type_t t, uint64_t txg, uint64_t size) { --- 1799,1812 ---- ASSERT(t < DTL_TYPES); ASSERT(vd != vd->vdev_spa->spa_root_vdev); ASSERT(spa_writeable(vd->vdev_spa)); ! mutex_enter(rt->rt_lock); if (!range_tree_contains(rt, txg, size)) range_tree_add(rt, txg, size); ! mutex_exit(rt->rt_lock); } boolean_t vdev_dtl_contains(vdev_t *vd, vdev_dtl_type_t t, uint64_t txg, uint64_t size) {
*** 2035,2059 **** boolean_t dirty = B_FALSE; ASSERT(t < DTL_TYPES); ASSERT(vd != vd->vdev_spa->spa_root_vdev); ! /* ! * While we are loading the pool, the DTLs have not been loaded yet. ! * Ignore the DTLs and try all devices. This avoids a recursive ! * mutex enter on the vdev_dtl_lock, and also makes us try hard ! * when loading the pool (relying on the checksum to ensure that ! * we get the right data -- note that we while loading, we are ! * only reading the MOS, which is always checksummed). ! */ ! if (vd->vdev_spa->spa_load_state != SPA_LOAD_NONE) ! return (B_FALSE); ! ! mutex_enter(&vd->vdev_dtl_lock); if (range_tree_space(rt) != 0) dirty = range_tree_contains(rt, txg, size); ! mutex_exit(&vd->vdev_dtl_lock); return (dirty); } boolean_t --- 1814,1827 ---- boolean_t dirty = B_FALSE; ASSERT(t < DTL_TYPES); ASSERT(vd != vd->vdev_spa->spa_root_vdev); ! mutex_enter(rt->rt_lock); if (range_tree_space(rt) != 0) dirty = range_tree_contains(rt, txg, size); ! mutex_exit(rt->rt_lock); return (dirty); } boolean_t
*** 2060,2072 **** vdev_dtl_empty(vdev_t *vd, vdev_dtl_type_t t) { range_tree_t *rt = vd->vdev_dtl[t]; boolean_t empty; ! mutex_enter(&vd->vdev_dtl_lock); empty = (range_tree_space(rt) == 0); ! mutex_exit(&vd->vdev_dtl_lock); return (empty); } /* --- 1828,1840 ---- vdev_dtl_empty(vdev_t *vd, vdev_dtl_type_t t) { range_tree_t *rt = vd->vdev_dtl[t]; boolean_t empty; ! mutex_enter(rt->rt_lock); empty = (range_tree_space(rt) == 0); ! mutex_exit(rt->rt_lock); return (empty); } /*
*** 2155,2165 **** for (int c = 0; c < vd->vdev_children; c++) vdev_dtl_reassess(vd->vdev_child[c], txg, scrub_txg, scrub_done); ! if (vd == spa->spa_root_vdev || !vdev_is_concrete(vd) || vd->vdev_aux) return; if (vd->vdev_ops->vdev_op_leaf) { dsl_scan_t *scn = spa->spa_dsl_pool->dp_scan; --- 1923,1933 ---- for (int c = 0; c < vd->vdev_children; c++) vdev_dtl_reassess(vd->vdev_child[c], txg, scrub_txg, scrub_done); ! if (vd == spa->spa_root_vdev || vd->vdev_ishole || vd->vdev_aux) return; if (vd->vdev_ops->vdev_op_leaf) { dsl_scan_t *scn = spa->spa_dsl_pool->dp_scan;
*** 2261,2274 **** spa_t *spa = vd->vdev_spa; objset_t *mos = spa->spa_meta_objset; int error = 0; if (vd->vdev_ops->vdev_op_leaf && vd->vdev_dtl_object != 0) { ! ASSERT(vdev_is_concrete(vd)); error = space_map_open(&vd->vdev_dtl_sm, mos, ! vd->vdev_dtl_object, 0, -1ULL, 0); if (error) return (error); ASSERT(vd->vdev_dtl_sm != NULL); mutex_enter(&vd->vdev_dtl_lock); --- 2029,2042 ---- spa_t *spa = vd->vdev_spa; objset_t *mos = spa->spa_meta_objset; int error = 0; if (vd->vdev_ops->vdev_op_leaf && vd->vdev_dtl_object != 0) { ! ASSERT(!vd->vdev_ishole); error = space_map_open(&vd->vdev_dtl_sm, mos, ! vd->vdev_dtl_object, 0, -1ULL, 0, &vd->vdev_dtl_lock); if (error) return (error); ASSERT(vd->vdev_dtl_sm != NULL); mutex_enter(&vd->vdev_dtl_lock);
*** 2343,2356 **** { spa_t *spa = vd->vdev_spa; range_tree_t *rt = vd->vdev_dtl[DTL_MISSING]; objset_t *mos = spa->spa_meta_objset; range_tree_t *rtsync; dmu_tx_t *tx; uint64_t object = space_map_object(vd->vdev_dtl_sm); ! ASSERT(vdev_is_concrete(vd)); ASSERT(vd->vdev_ops->vdev_op_leaf); tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg); if (vd->vdev_detached || vd->vdev_top->vdev_removing) { --- 2111,2125 ---- { spa_t *spa = vd->vdev_spa; range_tree_t *rt = vd->vdev_dtl[DTL_MISSING]; objset_t *mos = spa->spa_meta_objset; range_tree_t *rtsync; + kmutex_t rtlock; dmu_tx_t *tx; uint64_t object = space_map_object(vd->vdev_dtl_sm); ! ASSERT(!vd->vdev_ishole); ASSERT(vd->vdev_ops->vdev_op_leaf); tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg); if (vd->vdev_detached || vd->vdev_top->vdev_removing) {
*** 2364,2374 **** * We only destroy the leaf ZAP for detached leaves or for * removed log devices. Removed data devices handle leaf ZAP * cleanup later, once cancellation is no longer possible. */ if (vd->vdev_leaf_zap != 0 && (vd->vdev_detached || ! vd->vdev_top->vdev_islog)) { vdev_destroy_unlink_zap(vd, vd->vdev_leaf_zap, tx); vd->vdev_leaf_zap = 0; } dmu_tx_commit(tx); --- 2133,2143 ---- * We only destroy the leaf ZAP for detached leaves or for * removed log devices. Removed data devices handle leaf ZAP * cleanup later, once cancellation is no longer possible. */ if (vd->vdev_leaf_zap != 0 && (vd->vdev_detached || ! vd->vdev_top->vdev_islog || vd->vdev_top->vdev_isspecial)) { vdev_destroy_unlink_zap(vd, vd->vdev_leaf_zap, tx); vd->vdev_leaf_zap = 0; } dmu_tx_commit(tx);
*** 2380,2395 **** new_object = space_map_alloc(mos, tx); VERIFY3U(new_object, !=, 0); VERIFY0(space_map_open(&vd->vdev_dtl_sm, mos, new_object, ! 0, -1ULL, 0)); ASSERT(vd->vdev_dtl_sm != NULL); } ! rtsync = range_tree_create(NULL, NULL); mutex_enter(&vd->vdev_dtl_lock); range_tree_walk(rt, range_tree_add, rtsync); mutex_exit(&vd->vdev_dtl_lock); space_map_truncate(vd->vdev_dtl_sm, tx); --- 2149,2168 ---- new_object = space_map_alloc(mos, tx); VERIFY3U(new_object, !=, 0); VERIFY0(space_map_open(&vd->vdev_dtl_sm, mos, new_object, ! 0, -1ULL, 0, &vd->vdev_dtl_lock)); ASSERT(vd->vdev_dtl_sm != NULL); } ! mutex_init(&rtlock, NULL, MUTEX_DEFAULT, NULL); + rtsync = range_tree_create(NULL, NULL, &rtlock); + + mutex_enter(&rtlock); + mutex_enter(&vd->vdev_dtl_lock); range_tree_walk(rt, range_tree_add, rtsync); mutex_exit(&vd->vdev_dtl_lock); space_map_truncate(vd->vdev_dtl_sm, tx);
*** 2396,2414 **** space_map_write(vd->vdev_dtl_sm, rtsync, SM_ALLOC, tx); range_tree_vacate(rtsync, NULL, NULL); range_tree_destroy(rtsync); /* * If the object for the space map has changed then dirty * the top level so that we update the config. */ if (object != space_map_object(vd->vdev_dtl_sm)) { ! vdev_dbgmsg(vd, "txg %llu, spa %s, DTL old object %llu, " ! "new object %llu", (u_longlong_t)txg, spa_name(spa), ! (u_longlong_t)object, ! (u_longlong_t)space_map_object(vd->vdev_dtl_sm)); vdev_config_dirty(vd->vdev_top); } dmu_tx_commit(tx); --- 2169,2189 ---- space_map_write(vd->vdev_dtl_sm, rtsync, SM_ALLOC, tx); range_tree_vacate(rtsync, NULL, NULL); range_tree_destroy(rtsync); + mutex_exit(&rtlock); + mutex_destroy(&rtlock); + /* * If the object for the space map has changed then dirty * the top level so that we update the config. */ if (object != space_map_object(vd->vdev_dtl_sm)) { ! zfs_dbgmsg("txg %llu, spa %s, DTL old object %llu, " ! "new object %llu", txg, spa_name(spa), object, ! space_map_object(vd->vdev_dtl_sm)); vdev_config_dirty(vd->vdev_top); } dmu_tx_commit(tx);
*** 2489,2564 **** *maxp = thismax; } return (needed); } ! int vdev_load(vdev_t *vd) { - int error = 0; /* * Recursively load all children. */ ! for (int c = 0; c < vd->vdev_children; c++) { ! error = vdev_load(vd->vdev_child[c]); ! if (error != 0) { ! return (error); ! } ! } - vdev_set_deflate_ratio(vd); - /* * If this is a top-level vdev, initialize its metaslabs. */ ! if (vd == vd->vdev_top && vdev_is_concrete(vd)) { ! if (vd->vdev_ashift == 0 || vd->vdev_asize == 0) { vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_CORRUPT_DATA); - vdev_dbgmsg(vd, "vdev_load: invalid size. ashift=%llu, " - "asize=%llu", (u_longlong_t)vd->vdev_ashift, - (u_longlong_t)vd->vdev_asize); - return (SET_ERROR(ENXIO)); - } else if ((error = vdev_metaslab_init(vd, 0)) != 0) { - vdev_dbgmsg(vd, "vdev_load: metaslab_init failed " - "[error=%d]", error); - vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, - VDEV_AUX_CORRUPT_DATA); - return (error); - } - } /* * If this is a leaf vdev, load its DTL. */ ! if (vd->vdev_ops->vdev_op_leaf && (error = vdev_dtl_load(vd)) != 0) { vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_CORRUPT_DATA); - vdev_dbgmsg(vd, "vdev_load: vdev_dtl_load failed " - "[error=%d]", error); - return (error); - } - - uint64_t obsolete_sm_object = vdev_obsolete_sm_object(vd); - if (obsolete_sm_object != 0) { - objset_t *mos = vd->vdev_spa->spa_meta_objset; - ASSERT(vd->vdev_asize != 0); - ASSERT(vd->vdev_obsolete_sm == NULL); - - if ((error = space_map_open(&vd->vdev_obsolete_sm, mos, - obsolete_sm_object, 0, vd->vdev_asize, 0))) { - vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, - VDEV_AUX_CORRUPT_DATA); - vdev_dbgmsg(vd, "vdev_load: space_map_open failed for " - "obsolete spacemap (obj %llu) [error=%d]", - (u_longlong_t)obsolete_sm_object, error); - return (error); - } - space_map_update(vd->vdev_obsolete_sm); - } - - return (0); } /* * The special vdev case is used for hot spares and l2cache devices. Its * sole purpose it to set the vdev state for the associated vdev. To do this, --- 2264,2297 ---- *maxp = thismax; } return (needed); } ! void vdev_load(vdev_t *vd) { /* * Recursively load all children. */ ! for (int c = 0; c < vd->vdev_children; c++) ! vdev_load(vd->vdev_child[c]); /* * If this is a top-level vdev, initialize its metaslabs. */ ! if (vd == vd->vdev_top && !vd->vdev_ishole && ! (vd->vdev_ashift == 0 || vd->vdev_asize == 0 || ! vdev_metaslab_init(vd, 0) != 0)) vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_CORRUPT_DATA); /* * If this is a leaf vdev, load its DTL. */ ! if (vd->vdev_ops->vdev_op_leaf && vdev_dtl_load(vd) != 0) vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN, VDEV_AUX_CORRUPT_DATA); } /* * The special vdev case is used for hot spares and l2cache devices. Its * sole purpose it to set the vdev state for the associated vdev. To do this,
*** 2599,2644 **** */ nvlist_free(label); return (0); } - /* - * Free the objects used to store this vdev's spacemaps, and the array - * that points to them. - */ void ! vdev_destroy_spacemaps(vdev_t *vd, dmu_tx_t *tx) { - if (vd->vdev_ms_array == 0) - return; - - objset_t *mos = vd->vdev_spa->spa_meta_objset; - uint64_t array_count = vd->vdev_asize >> vd->vdev_ms_shift; - size_t array_bytes = array_count * sizeof (uint64_t); - uint64_t *smobj_array = kmem_alloc(array_bytes, KM_SLEEP); - VERIFY0(dmu_read(mos, vd->vdev_ms_array, 0, - array_bytes, smobj_array, 0)); - - for (uint64_t i = 0; i < array_count; i++) { - uint64_t smobj = smobj_array[i]; - if (smobj == 0) - continue; - - space_map_free_obj(mos, smobj, tx); - } - - kmem_free(smobj_array, array_bytes); - VERIFY0(dmu_object_free(mos, vd->vdev_ms_array, tx)); - vd->vdev_ms_array = 0; - } - - static void - vdev_remove_empty(vdev_t *vd, uint64_t txg) - { spa_t *spa = vd->vdev_spa; dmu_tx_t *tx; ASSERT(vd == vd->vdev_top); ASSERT3U(txg, ==, spa_syncing_txg(spa)); if (vd->vdev_ms != NULL) { metaslab_group_t *mg = vd->vdev_mg; --- 2332,2349 ---- */ nvlist_free(label); return (0); } void ! vdev_remove(vdev_t *vd, uint64_t txg) { spa_t *spa = vd->vdev_spa; + objset_t *mos = spa->spa_meta_objset; dmu_tx_t *tx; + tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg); ASSERT(vd == vd->vdev_top); ASSERT3U(txg, ==, spa_syncing_txg(spa)); if (vd->vdev_ms != NULL) { metaslab_group_t *mg = vd->vdev_mg;
*** 2661,2685 **** * and metaslab class are up-to-date. */ metaslab_group_histogram_remove(mg, msp); VERIFY0(space_map_allocated(msp->ms_sm)); space_map_close(msp->ms_sm); msp->ms_sm = NULL; mutex_exit(&msp->ms_lock); } metaslab_group_histogram_verify(mg); metaslab_class_histogram_verify(mg->mg_class); for (int i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++) ASSERT0(mg->mg_histogram[i]); } ! tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg); ! vdev_destroy_spacemaps(vd, tx); ! if (vd->vdev_islog && vd->vdev_top_zap != 0) { vdev_destroy_unlink_zap(vd, vd->vdev_top_zap, tx); vd->vdev_top_zap = 0; } dmu_tx_commit(tx); } --- 2366,2395 ---- * and metaslab class are up-to-date. */ metaslab_group_histogram_remove(mg, msp); VERIFY0(space_map_allocated(msp->ms_sm)); + space_map_free(msp->ms_sm, tx); space_map_close(msp->ms_sm); msp->ms_sm = NULL; mutex_exit(&msp->ms_lock); } metaslab_group_histogram_verify(mg); metaslab_class_histogram_verify(mg->mg_class); for (int i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++) ASSERT0(mg->mg_histogram[i]); + } ! if (vd->vdev_ms_array) { ! (void) dmu_object_free(mos, vd->vdev_ms_array, tx); ! vd->vdev_ms_array = 0; ! } ! if ((vd->vdev_islog || vd->vdev_isspecial) && ! vd->vdev_top_zap != 0) { vdev_destroy_unlink_zap(vd, vd->vdev_top_zap, tx); vd->vdev_top_zap = 0; } dmu_tx_commit(tx); }
*** 2688,2698 **** vdev_sync_done(vdev_t *vd, uint64_t txg) { metaslab_t *msp; boolean_t reassess = !txg_list_empty(&vd->vdev_ms_list, TXG_CLEAN(txg)); ! ASSERT(vdev_is_concrete(vd)); while (msp = txg_list_remove(&vd->vdev_ms_list, TXG_CLEAN(txg))) metaslab_sync_done(msp, txg); if (reassess) --- 2398,2408 ---- vdev_sync_done(vdev_t *vd, uint64_t txg) { metaslab_t *msp; boolean_t reassess = !txg_list_empty(&vd->vdev_ms_list, TXG_CLEAN(txg)); ! ASSERT(!vd->vdev_ishole); while (msp = txg_list_remove(&vd->vdev_ms_list, TXG_CLEAN(txg))) metaslab_sync_done(msp, txg); if (reassess)
*** 2705,2767 **** spa_t *spa = vd->vdev_spa; vdev_t *lvd; metaslab_t *msp; dmu_tx_t *tx; ! if (range_tree_space(vd->vdev_obsolete_segments) > 0) { ! dmu_tx_t *tx; ! ASSERT(vd->vdev_removing || ! vd->vdev_ops == &vdev_indirect_ops); ! ! tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg); ! vdev_indirect_sync_obsolete(vd, tx); ! dmu_tx_commit(tx); ! ! /* ! * If the vdev is indirect, it can't have dirty ! * metaslabs or DTLs. ! */ ! if (vd->vdev_ops == &vdev_indirect_ops) { ! ASSERT(txg_list_empty(&vd->vdev_ms_list, txg)); ! ASSERT(txg_list_empty(&vd->vdev_dtl_list, txg)); ! return; ! } ! } ! ! ASSERT(vdev_is_concrete(vd)); ! ! if (vd->vdev_ms_array == 0 && vd->vdev_ms_shift != 0 && ! !vd->vdev_removing) { ASSERT(vd == vd->vdev_top); - ASSERT0(vd->vdev_indirect_config.vic_mapping_object); tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg); vd->vdev_ms_array = dmu_object_alloc(spa->spa_meta_objset, DMU_OT_OBJECT_ARRAY, 0, DMU_OT_NONE, 0, tx); ASSERT(vd->vdev_ms_array != 0); vdev_config_dirty(vd); dmu_tx_commit(tx); } while ((msp = txg_list_remove(&vd->vdev_ms_list, txg)) != NULL) { metaslab_sync(msp, txg); (void) txg_list_add(&vd->vdev_ms_list, msp, TXG_CLEAN(txg)); } while ((lvd = txg_list_remove(&vd->vdev_dtl_list, txg)) != NULL) vdev_dtl_sync(lvd, txg); - /* - * Remove the metadata associated with this vdev once it's empty. - * Note that this is typically used for log/cache device removal; - * we don't empty toplevel vdevs when removing them. But if - * a toplevel happens to be emptied, this is not harmful. - */ - if (vd->vdev_stat.vs_alloc == 0 && vd->vdev_removing) { - vdev_remove_empty(vd, txg); - } - (void) txg_list_add(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg)); } uint64_t vdev_psize_to_asize(vdev_t *vd, uint64_t psize) --- 2415,2450 ---- spa_t *spa = vd->vdev_spa; vdev_t *lvd; metaslab_t *msp; dmu_tx_t *tx; ! ASSERT(!vd->vdev_ishole); ! if (vd->vdev_ms_array == 0 && vd->vdev_ms_shift != 0) { ASSERT(vd == vd->vdev_top); tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg); vd->vdev_ms_array = dmu_object_alloc(spa->spa_meta_objset, DMU_OT_OBJECT_ARRAY, 0, DMU_OT_NONE, 0, tx); ASSERT(vd->vdev_ms_array != 0); vdev_config_dirty(vd); dmu_tx_commit(tx); } + /* + * Remove the metadata associated with this vdev once it's empty. + */ + if (vd->vdev_stat.vs_alloc == 0 && vd->vdev_removing) + vdev_remove(vd, txg); + while ((msp = txg_list_remove(&vd->vdev_ms_list, txg)) != NULL) { metaslab_sync(msp, txg); (void) txg_list_add(&vd->vdev_ms_list, msp, TXG_CLEAN(txg)); } while ((lvd = txg_list_remove(&vd->vdev_dtl_list, txg)) != NULL) vdev_dtl_sync(lvd, txg); (void) txg_list_add(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg)); } uint64_t vdev_psize_to_asize(vdev_t *vd, uint64_t psize)
*** 2881,2892 **** wasoffline = (vd->vdev_offline || vd->vdev_tmpoffline); oldstate = vd->vdev_state; tvd = vd->vdev_top; ! vd->vdev_offline = B_FALSE; ! vd->vdev_tmpoffline = B_FALSE; vd->vdev_checkremove = !!(flags & ZFS_ONLINE_CHECKREMOVE); vd->vdev_forcefault = !!(flags & ZFS_ONLINE_FORCEFAULT); /* XXX - L2ARC 1.0 does not support expansion */ if (!vd->vdev_aux) { --- 2564,2575 ---- wasoffline = (vd->vdev_offline || vd->vdev_tmpoffline); oldstate = vd->vdev_state; tvd = vd->vdev_top; ! vd->vdev_offline = 0ULL; ! vd->vdev_tmpoffline = 0ULL; vd->vdev_checkremove = !!(flags & ZFS_ONLINE_CHECKREMOVE); vd->vdev_forcefault = !!(flags & ZFS_ONLINE_FORCEFAULT); /* XXX - L2ARC 1.0 does not support expansion */ if (!vd->vdev_aux) {
*** 2971,2981 **** * Prevent any future allocations. */ metaslab_group_passivate(mg); (void) spa_vdev_state_exit(spa, vd, 0); ! error = spa_reset_logs(spa); spa_vdev_state_enter(spa, SCL_ALLOC); /* * Check to see if the config has changed. --- 2654,2664 ---- * Prevent any future allocations. */ metaslab_group_passivate(mg); (void) spa_vdev_state_exit(spa, vd, 0); ! error = spa_offline_log(spa); spa_vdev_state_enter(spa, SCL_ALLOC); /* * Check to see if the config has changed.
*** 3038,3067 **** * children. If 'vd' is NULL, then the user wants to clear all vdevs. */ void vdev_clear(spa_t *spa, vdev_t *vd) { vdev_t *rvd = spa->spa_root_vdev; ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL); ! if (vd == NULL) vd = rvd; vd->vdev_stat.vs_read_errors = 0; vd->vdev_stat.vs_write_errors = 0; vd->vdev_stat.vs_checksum_errors = 0; - for (int c = 0; c < vd->vdev_children; c++) - vdev_clear(spa, vd->vdev_child[c]); - /* ! * It makes no sense to "clear" an indirect vdev. */ ! if (!vdev_is_concrete(vd)) ! return; /* * If we're in the FAULTED state or have experienced failed I/O, then * clear the persistent state and attempt to reopen the device. We * also mark the vdev config dirty, so that the new faulted state is * written out to disk. --- 2721,2765 ---- * children. If 'vd' is NULL, then the user wants to clear all vdevs. */ void vdev_clear(spa_t *spa, vdev_t *vd) { + int c; vdev_t *rvd = spa->spa_root_vdev; ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL); ! if (vd == NULL) { vd = rvd; + /* Go through spare and l2cache vdevs */ + for (c = 0; c < spa->spa_spares.sav_count; c++) + vdev_clear(spa, spa->spa_spares.sav_vdevs[c]); + for (c = 0; c < spa->spa_l2cache.sav_count; c++) + vdev_clear(spa, spa->spa_l2cache.sav_vdevs[c]); + } + vd->vdev_stat.vs_read_errors = 0; vd->vdev_stat.vs_write_errors = 0; vd->vdev_stat.vs_checksum_errors = 0; /* ! * If all disk vdevs failed at the same time (e.g. due to a ! * disconnected cable), that suspends I/O activity to the pool, ! * which stalls spa_sync if there happened to be any dirty data. ! * As a consequence, this flag might not be cleared, because it ! * is only lowered by spa_async_remove (which cannot run). This ! * then prevents zio_resume from succeeding even if vdev reopen ! * succeeds, leading to an indefinitely suspended pool. So we ! * lower the flag here to allow zio_resume to succeed, provided ! * reopening of the vdevs succeeds. */ ! vd->vdev_remove_wanted = B_FALSE; + for (c = 0; c < vd->vdev_children; c++) + vdev_clear(spa, vd->vdev_child[c]); + /* * If we're in the FAULTED state or have experienced failed I/O, then * clear the persistent state and attempt to reopen the device. We * also mark the vdev config dirty, so that the new faulted state is * written out to disk.
*** 3112,3137 **** * This simplifies the code since we don't have to check for * these types of devices in the various code paths. * Instead we rely on the fact that we skip over dead devices * before issuing I/O to them. */ ! return (vd->vdev_state < VDEV_STATE_DEGRADED || ! vd->vdev_ops == &vdev_hole_ops || vd->vdev_ops == &vdev_missing_ops); } boolean_t vdev_readable(vdev_t *vd) { ! return (!vdev_is_dead(vd) && !vd->vdev_cant_read); } boolean_t vdev_writeable(vdev_t *vd) { ! return (!vdev_is_dead(vd) && !vd->vdev_cant_write && ! vdev_is_concrete(vd)); } boolean_t vdev_allocatable(vdev_t *vd) { --- 2810,2833 ---- * This simplifies the code since we don't have to check for * these types of devices in the various code paths. * Instead we rely on the fact that we skip over dead devices * before issuing I/O to them. */ ! return (vd->vdev_state < VDEV_STATE_DEGRADED || vd->vdev_ishole || vd->vdev_ops == &vdev_missing_ops); } boolean_t vdev_readable(vdev_t *vd) { ! return (vd != NULL && !vdev_is_dead(vd) && !vd->vdev_cant_read); } boolean_t vdev_writeable(vdev_t *vd) { ! return (vd != NULL && !vdev_is_dead(vd) && !vd->vdev_cant_write); } boolean_t vdev_allocatable(vdev_t *vd) {
*** 3144,3154 **** * the proper locks. Note that we have to get the vdev state * in a local variable because although it changes atomically, * we're asking two separate questions about it. */ return (!(state < VDEV_STATE_DEGRADED && state != VDEV_STATE_CLOSED) && ! !vd->vdev_cant_write && vdev_is_concrete(vd) && vd->vdev_mg->mg_initialized); } boolean_t vdev_accessible(vdev_t *vd, zio_t *zio) --- 2840,2850 ---- * the proper locks. Note that we have to get the vdev state * in a local variable because although it changes atomically, * we're asking two separate questions about it. */ return (!(state < VDEV_STATE_DEGRADED && state != VDEV_STATE_CLOSED) && ! !vd->vdev_cant_write && !vd->vdev_ishole && vd->vdev_mg->mg_initialized); } boolean_t vdev_accessible(vdev_t *vd, zio_t *zio)
*** 3193,3204 **** */ if (vd->vdev_aux == NULL && tvd != NULL) { vs->vs_esize = P2ALIGN(vd->vdev_max_asize - vd->vdev_asize - spa->spa_bootsize, 1ULL << tvd->vdev_ms_shift); } ! if (vd->vdev_aux == NULL && vd == vd->vdev_top && ! vdev_is_concrete(vd)) { vs->vs_fragmentation = vd->vdev_mg->mg_fragmentation; } /* * If we're getting stats on the root vdev, aggregate the I/O counts --- 2889,2899 ---- */ if (vd->vdev_aux == NULL && tvd != NULL) { vs->vs_esize = P2ALIGN(vd->vdev_max_asize - vd->vdev_asize - spa->spa_bootsize, 1ULL << tvd->vdev_ms_shift); } ! if (vd->vdev_aux == NULL && vd == vd->vdev_top && !vd->vdev_ishole) { vs->vs_fragmentation = vd->vdev_mg->mg_fragmentation; } /* * If we're getting stats on the root vdev, aggregate the I/O counts
*** 3210,3219 **** --- 2905,2916 ---- vdev_stat_t *cvs = &cvd->vdev_stat; for (int t = 0; t < ZIO_TYPES; t++) { vs->vs_ops[t] += cvs->vs_ops[t]; vs->vs_bytes[t] += cvs->vs_bytes[t]; + vs->vs_iotime[t] += cvs->vs_iotime[t]; + vs->vs_latency[t] += cvs->vs_latency[t]; } cvs->vs_scan_removing = cvd->vdev_removing; } } mutex_exit(&vd->vdev_stat_lock);
*** 3302,3311 **** --- 2999,3022 ---- } vs->vs_ops[type]++; vs->vs_bytes[type] += psize; + /* + * While measuring each delta in nanoseconds, we should keep + * cumulative iotime in microseconds so it doesn't overflow on + * a busy system. + */ + vs->vs_iotime[type] += (zio->io_vd_timestamp) / 1000; + + /* + * Latency is an exponential moving average of iotime deltas + * with tuneable alpha measured in 1/10th of percent. + */ + vs->vs_latency[type] += ((int64_t)zio->io_vd_timestamp - + vs->vs_latency[type]) * zfs_vs_latency_alpha / 1000; + mutex_exit(&vd->vdev_stat_lock); return; } if (flags & ZIO_FLAG_SPECULATIVE)
*** 3338,3349 **** } if (type == ZIO_TYPE_WRITE && !vdev_is_dead(vd)) vs->vs_write_errors++; mutex_exit(&vd->vdev_stat_lock); ! if (spa->spa_load_state == SPA_LOAD_NONE && ! type == ZIO_TYPE_WRITE && txg != 0 && (!(flags & ZIO_FLAG_IO_REPAIR) || (flags & ZIO_FLAG_SCAN_THREAD) || spa->spa_claiming)) { /* * This is either a normal write (not a repair), or it's --- 3049,3070 ---- } if (type == ZIO_TYPE_WRITE && !vdev_is_dead(vd)) vs->vs_write_errors++; mutex_exit(&vd->vdev_stat_lock); ! if ((vd->vdev_isspecial || vd->vdev_isspecial_child) && ! (vs->vs_checksum_errors != 0 || vs->vs_read_errors != 0 || ! vs->vs_write_errors != 0 || !vdev_readable(vd) || ! !vdev_writeable(vd)) && !spa->spa_special_has_errors) { ! /* all new writes will be placed on normal */ ! cmn_err(CE_WARN, "New writes to special vdev [%s] " ! "will be stopped", (vd->vdev_path != NULL) ? ! vd->vdev_path : "undefined"); ! spa->spa_special_has_errors = B_TRUE; ! } ! ! if (type == ZIO_TYPE_WRITE && txg != 0 && (!(flags & ZIO_FLAG_IO_REPAIR) || (flags & ZIO_FLAG_SCAN_THREAD) || spa->spa_claiming)) { /* * This is either a normal write (not a repair), or it's
*** 3414,3424 **** vd->vdev_stat.vs_alloc += alloc_delta; vd->vdev_stat.vs_space += space_delta; vd->vdev_stat.vs_dspace += dspace_delta; mutex_exit(&vd->vdev_stat_lock); ! if (mc == spa_normal_class(spa)) { mutex_enter(&rvd->vdev_stat_lock); rvd->vdev_stat.vs_alloc += alloc_delta; rvd->vdev_stat.vs_space += space_delta; rvd->vdev_stat.vs_dspace += dspace_delta; mutex_exit(&rvd->vdev_stat_lock); --- 3135,3145 ---- vd->vdev_stat.vs_alloc += alloc_delta; vd->vdev_stat.vs_space += space_delta; vd->vdev_stat.vs_dspace += dspace_delta; mutex_exit(&vd->vdev_stat_lock); ! if (mc == spa_normal_class(spa) || mc == spa_special_class(spa)) { mutex_enter(&rvd->vdev_stat_lock); rvd->vdev_stat.vs_alloc += alloc_delta; rvd->vdev_stat.vs_space += space_delta; rvd->vdev_stat.vs_dspace += dspace_delta; mutex_exit(&rvd->vdev_stat_lock);
*** 3504,3517 **** vdev_config_dirty(rvd->vdev_child[c]); } else { ASSERT(vd == vd->vdev_top); if (!list_link_active(&vd->vdev_config_dirty_node) && ! vdev_is_concrete(vd)) { list_insert_head(&spa->spa_config_dirty_list, vd); } - } } void vdev_config_clean(vdev_t *vd) { --- 3225,3237 ---- vdev_config_dirty(rvd->vdev_child[c]); } else { ASSERT(vd == vd->vdev_top); if (!list_link_active(&vd->vdev_config_dirty_node) && ! !vd->vdev_ishole) list_insert_head(&spa->spa_config_dirty_list, vd); } } void vdev_config_clean(vdev_t *vd) {
*** 3547,3558 **** */ ASSERT(spa_config_held(spa, SCL_STATE, RW_WRITER) || (dsl_pool_sync_context(spa_get_dsl(spa)) && spa_config_held(spa, SCL_STATE, RW_READER))); ! if (!list_link_active(&vd->vdev_state_dirty_node) && ! vdev_is_concrete(vd)) list_insert_head(&spa->spa_state_dirty_list, vd); } void vdev_state_clean(vdev_t *vd) --- 3267,3277 ---- */ ASSERT(spa_config_held(spa, SCL_STATE, RW_WRITER) || (dsl_pool_sync_context(spa_get_dsl(spa)) && spa_config_held(spa, SCL_STATE, RW_READER))); ! if (!list_link_active(&vd->vdev_state_dirty_node) && !vd->vdev_ishole) list_insert_head(&spa->spa_state_dirty_list, vd); } void vdev_state_clean(vdev_t *vd)
*** 3582,3595 **** if (vd->vdev_children > 0) { for (int c = 0; c < vd->vdev_children; c++) { child = vd->vdev_child[c]; /* ! * Don't factor holes or indirect vdevs into the ! * decision. */ ! if (!vdev_is_concrete(child)) continue; if (!vdev_readable(child) || (!vdev_writeable(child) && spa_writeable(spa))) { /* --- 3301,3313 ---- if (vd->vdev_children > 0) { for (int c = 0; c < vd->vdev_children; c++) { child = vd->vdev_child[c]; /* ! * Don't factor holes into the decision. */ ! if (child->vdev_ishole) continue; if (!vdev_readable(child) || (!vdev_writeable(child) && spa_writeable(spa))) { /*
*** 3760,3782 **** if (!isopen && vd->vdev_parent) vdev_propagate_state(vd->vdev_parent); } - boolean_t - vdev_children_are_offline(vdev_t *vd) - { - ASSERT(!vd->vdev_ops->vdev_op_leaf); - - for (uint64_t i = 0; i < vd->vdev_children; i++) { - if (vd->vdev_child[i]->vdev_state != VDEV_STATE_OFFLINE) - return (B_FALSE); - } - - return (B_TRUE); - } - /* * Check the vdev configuration to ensure that it's capable of supporting * a root pool. We do not support partial configuration. * In addition, only a single top-level vdev is allowed. */ --- 3478,3487 ----
*** 3787,3798 **** char *vdev_type = vd->vdev_ops->vdev_op_type; if (strcmp(vdev_type, VDEV_TYPE_ROOT) == 0 && vd->vdev_children > 1) { return (B_FALSE); ! } else if (strcmp(vdev_type, VDEV_TYPE_MISSING) == 0 || ! strcmp(vdev_type, VDEV_TYPE_INDIRECT) == 0) { return (B_FALSE); } } for (int c = 0; c < vd->vdev_children; c++) { --- 3492,3502 ---- char *vdev_type = vd->vdev_ops->vdev_op_type; if (strcmp(vdev_type, VDEV_TYPE_ROOT) == 0 && vd->vdev_children > 1) { return (B_FALSE); ! } else if (strcmp(vdev_type, VDEV_TYPE_MISSING) == 0) { return (B_FALSE); } } for (int c = 0; c < vd->vdev_children; c++) {
*** 3800,3818 **** return (B_FALSE); } return (B_TRUE); } ! boolean_t ! vdev_is_concrete(vdev_t *vd) { ! vdev_ops_t *ops = vd->vdev_ops; ! if (ops == &vdev_indirect_ops || ops == &vdev_hole_ops || ! ops == &vdev_missing_ops || ops == &vdev_root_ops) { ! return (B_FALSE); ! } else { ! return (B_TRUE); } } /* * Determine if a log device has valid content. If the vdev was --- 3504,3539 ---- return (B_FALSE); } return (B_TRUE); } ! /* ! * Load the state from the original vdev tree (ovd) which ! * we've retrieved from the MOS config object. If the original ! * vdev was offline or faulted then we transfer that state to the ! * device in the current vdev tree (nvd). ! */ ! void ! vdev_load_log_state(vdev_t *nvd, vdev_t *ovd) { ! spa_t *spa = nvd->vdev_spa; ! ! ASSERT(nvd->vdev_top->vdev_islog); ! ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL); ! ASSERT3U(nvd->vdev_guid, ==, ovd->vdev_guid); ! ! for (int c = 0; c < nvd->vdev_children; c++) ! vdev_load_log_state(nvd->vdev_child[c], ovd->vdev_child[c]); ! ! if (nvd->vdev_ops->vdev_op_leaf) { ! /* ! * Restore the persistent vdev state ! */ ! nvd->vdev_offline = ovd->vdev_offline; ! nvd->vdev_faulted = ovd->vdev_faulted; ! nvd->vdev_degraded = ovd->vdev_degraded; ! nvd->vdev_removed = ovd->vdev_removed; } } /* * Determine if a log device has valid content. If the vdev was
*** 3840,3853 **** vdev_expand(vdev_t *vd, uint64_t txg) { ASSERT(vd->vdev_top == vd); ASSERT(spa_config_held(vd->vdev_spa, SCL_ALL, RW_WRITER) == SCL_ALL); ! vdev_set_deflate_ratio(vd); ! ! if ((vd->vdev_asize >> vd->vdev_ms_shift) > vd->vdev_ms_count && ! vdev_is_concrete(vd)) { VERIFY(vdev_metaslab_init(vd, txg) == 0); vdev_config_dirty(vd); } } --- 3561,3571 ---- vdev_expand(vdev_t *vd, uint64_t txg) { ASSERT(vd->vdev_top == vd); ASSERT(spa_config_held(vd->vdev_spa, SCL_ALL, RW_WRITER) == SCL_ALL); ! if ((vd->vdev_asize >> vd->vdev_ms_shift) > vd->vdev_ms_count) { VERIFY(vdev_metaslab_init(vd, txg) == 0); vdev_config_dirty(vd); } }
*** 3894,3909 **** * the spa_deadman_synctime we panic the system. */ fio = avl_first(&vq->vq_active_tree); delta = gethrtime() - fio->io_timestamp; if (delta > spa_deadman_synctime(spa)) { ! vdev_dbgmsg(vd, "SLOW IO: zio timestamp " ! "%lluns, delta %lluns, last io %lluns", ! fio->io_timestamp, (u_longlong_t)delta, vq->vq_io_complete_ts); fm_panic("I/O to pool '%s' appears to be " "hung.", spa_name(spa)); } } mutex_exit(&vq->vq_lock); } } --- 3612,3752 ---- * the spa_deadman_synctime we panic the system. */ fio = avl_first(&vq->vq_active_tree); delta = gethrtime() - fio->io_timestamp; if (delta > spa_deadman_synctime(spa)) { ! zfs_dbgmsg("SLOW IO: zio timestamp %lluns, " ! "delta %lluns, last io %lluns", ! fio->io_timestamp, delta, vq->vq_io_complete_ts); fm_panic("I/O to pool '%s' appears to be " "hung.", spa_name(spa)); } } mutex_exit(&vq->vq_lock); } + } + + boolean_t + vdev_type_is_ddt(vdev_t *vd) + { + uint64_t pool; + + if (vd->vdev_l2ad_ddt == 1 && + zfs_ddt_limit_type == DDT_LIMIT_TO_L2ARC) { + ASSERT(spa_l2cache_exists(vd->vdev_guid, &pool)); + ASSERT(vd->vdev_isl2cache); + return (B_TRUE); + } + return (B_FALSE); + } + + /* count leaf vdev(s) under the given vdev */ + uint_t + vdev_count_leaf_vdevs(vdev_t *vd) + { + uint_t cnt = 0; + + if (vd->vdev_ops->vdev_op_leaf) + return (1); + + /* if this is not a leaf vdev - visit children */ + for (int c = 0; c < vd->vdev_children; c++) + cnt += vdev_count_leaf_vdevs(vd->vdev_child[c]); + + return (cnt); + } + + /* + * Implements the per-vdev portion of manual TRIM. The function passes over + * all metaslabs on this vdev and performs a metaslab_trim_all on them. It's + * also responsible for rate-control if spa_man_trim_rate is non-zero. + */ + void + vdev_man_trim(vdev_trim_info_t *vti) + { + clock_t t = ddi_get_lbolt(); + spa_t *spa = vti->vti_vdev->vdev_spa; + vdev_t *vd = vti->vti_vdev; + + vd->vdev_man_trimming = B_TRUE; + vd->vdev_trim_prog = 0; + + spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER); + for (uint64_t i = 0; i < vti->vti_vdev->vdev_ms_count && + !spa->spa_man_trim_stop; i++) { + uint64_t delta; + metaslab_t *msp = vd->vdev_ms[i]; + zio_t *trim_io = metaslab_trim_all(msp, &delta); + + atomic_add_64(&vd->vdev_trim_prog, msp->ms_size); + spa_config_exit(spa, SCL_STATE_ALL, FTAG); + + (void) zio_wait(trim_io); + + /* delay loop to handle fixed-rate trimming */ + for (;;) { + uint64_t rate = spa->spa_man_trim_rate; + uint64_t sleep_delay; + + if (rate == 0) { + /* No delay, just update 't' and move on. */ + t = ddi_get_lbolt(); + break; + } + + sleep_delay = (delta * hz) / rate; + mutex_enter(&spa->spa_man_trim_lock); + (void) cv_timedwait(&spa->spa_man_trim_update_cv, + &spa->spa_man_trim_lock, t); + mutex_exit(&spa->spa_man_trim_lock); + + /* If interrupted, don't try to relock, get out */ + if (spa->spa_man_trim_stop) + goto out; + + /* Timeout passed, move on to the next metaslab. */ + if (ddi_get_lbolt() >= t + sleep_delay) { + t += sleep_delay; + break; + } + } + spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER); + } + spa_config_exit(spa, SCL_STATE_ALL, FTAG); + out: + vd->vdev_man_trimming = B_FALSE; + /* + * Ensure we're marked as "completed" even if we've had to stop + * before processing all metaslabs. + */ + vd->vdev_trim_prog = vd->vdev_asize; + + ASSERT(vti->vti_done_cb != NULL); + vti->vti_done_cb(vti->vti_done_arg); + + kmem_free(vti, sizeof (*vti)); + } + + /* + * Runs through all metaslabs on the vdev and does their autotrim processing. + */ + void + vdev_auto_trim(vdev_trim_info_t *vti) + { + vdev_t *vd = vti->vti_vdev; + spa_t *spa = vd->vdev_spa; + uint64_t txg = vti->vti_txg; + + if (vd->vdev_man_trimming) + goto out; + + spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER); + for (uint64_t i = 0; i < vd->vdev_ms_count; i++) + metaslab_auto_trim(vd->vdev_ms[i], txg); + spa_config_exit(spa, SCL_STATE_ALL, FTAG); + out: + ASSERT(vti->vti_done_cb != NULL); + vti->vti_done_cb(vti->vti_done_arg); + + kmem_free(vti, sizeof (*vti)); }