Print this page
NEX-15270 pool clear does not "repair" cache devices
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dmitry Savitsky <dmitry.savitsky@nexenta.com>
NEX-15270 pool clear does not "repair" cache devices
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dmitry Savitsky <dmitry.savitsky@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9940 Appliance requires a reboot after JBOD power failure or disconnecting all SAS cables
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9554 dsl_scan.c internals contain some confusingly similar function names for handling the dataset and block sorting queues
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9562 Attaching a vdev while resilver/scrub is running causes panic.
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5736 implement autoreplace matching based on FRU slot number
NEX-6200 hot spares are not reactivated after reinserting into enclosure
NEX-9403 need to update FRU for spare and l2cache devices
NEX-9404 remove lofi autoreplace support from syseventd
NEX-9409 hotsparing doesn't work for vdevs without FRU
NEX-9424 zfs`vdev_online() needs better notification about state changes
Portions contributed by: Alek Pinchuk <alek@nexenta.com>
Portions contributed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8206 dtrace helpers leak when cfork() fails
Reviewed by: Rick McNeal <rick.mcneal@nexeneta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8507 erroneous check in vdev_type_is_ddt()
Reviewed by: Alex Deiter <alex.deiter@nexenta.com>
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4584 System panic when adding special vdev to a pool that does not support feature flags
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5318 Cleanup specialclass property (obsolete, not used) and fix related meta-to-special case
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-2846 Enable Automatic/Intelligent Hot Sparing capability
Reviewed by: Jeffry Molanus <jeffry.molanus@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5064 On-demand trim should store operation start and stop time
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4940 Special Vdev operation in presence (or absense) of IO Errors
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-3729 KRRP changes mess up iostat(1M)
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
NEX-4204 Removing vdev while on-demand trim is ongoing locks up pool
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
usr/src/common/zfs/zpool_prop.c
usr/src/uts/common/sys/fs/zfs.h
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3474 CLONE - Port NEX-2591 FRU field not set during pool creation and never updated
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3212 remove vdev prop object type from dmu.h, p2 Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3025 support root pools on EFI labeled disks
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-1142 move rwlock to vdev to protect vdev_tsd
not just ldi handle.
This way we serialize open/close, yet allow parallel I/O.
NEX-801 If a block pointer is corrupt read or write may crash
If block pointer is corrupt in such a way that vdev id of one of the
ditto blocks is wrong (out of range), zio_vdev_io_start or zio_vdev_io_done
may trip over it and crash.
This changeset takes care of this by claiming that an invalid vdev is
neither readable, nor writeable.
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #12393 rb3935 Kerberos and smbd disagree about who is our AD server (fix elf runtime attributes check)
re #11612 rb3907 Failing vdev of a mirrored pool should not take zfs operations out of action for extended periods of time.
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties
@@ -19,25 +19,23 @@
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
- * Copyright (c) 2011, 2018 by Delphix. All rights reserved.
- * Copyright 2017 Nexenta Systems, Inc.
+ * Copyright (c) 2011, 2015 by Delphix. All rights reserved.
+ * Copyright 2018 Nexenta Systems, Inc.
* Copyright (c) 2014 Integros [integros.com]
* Copyright 2016 Toomas Soome <tsoome@me.com>
* Copyright 2017 Joyent, Inc.
*/
#include <sys/zfs_context.h>
#include <sys/fm/fs/zfs.h>
#include <sys/spa.h>
#include <sys/spa_impl.h>
-#include <sys/bpobj.h>
#include <sys/dmu.h>
#include <sys/dmu_tx.h>
-#include <sys/dsl_dir.h>
#include <sys/vdev_impl.h>
#include <sys/uberblock_impl.h>
#include <sys/metaslab.h>
#include <sys/metaslab_impl.h>
#include <sys/space_map.h>
@@ -62,98 +60,27 @@
&vdev_spare_ops,
&vdev_disk_ops,
&vdev_file_ops,
&vdev_missing_ops,
&vdev_hole_ops,
- &vdev_indirect_ops,
NULL
};
/* maximum scrub/resilver I/O queue per leaf vdev */
int zfs_scrub_limit = 10;
/*
+ * alpha for exponential moving average of I/O latency (in 1/10th of a percent)
+ */
+int zfs_vs_latency_alpha = 100;
+
+/*
* When a vdev is added, it will be divided into approximately (but no
* more than) this number of metaslabs.
*/
int metaslabs_per_vdev = 200;
-boolean_t vdev_validate_skip = B_FALSE;
-
-/*PRINTFLIKE2*/
-void
-vdev_dbgmsg(vdev_t *vd, const char *fmt, ...)
-{
- va_list adx;
- char buf[256];
-
- va_start(adx, fmt);
- (void) vsnprintf(buf, sizeof (buf), fmt, adx);
- va_end(adx);
-
- if (vd->vdev_path != NULL) {
- zfs_dbgmsg("%s vdev '%s': %s", vd->vdev_ops->vdev_op_type,
- vd->vdev_path, buf);
- } else {
- zfs_dbgmsg("%s-%llu vdev (guid %llu): %s",
- vd->vdev_ops->vdev_op_type,
- (u_longlong_t)vd->vdev_id,
- (u_longlong_t)vd->vdev_guid, buf);
- }
-}
-
-void
-vdev_dbgmsg_print_tree(vdev_t *vd, int indent)
-{
- char state[20];
-
- if (vd->vdev_ishole || vd->vdev_ops == &vdev_missing_ops) {
- zfs_dbgmsg("%*svdev %u: %s", indent, "", vd->vdev_id,
- vd->vdev_ops->vdev_op_type);
- return;
- }
-
- switch (vd->vdev_state) {
- case VDEV_STATE_UNKNOWN:
- (void) snprintf(state, sizeof (state), "unknown");
- break;
- case VDEV_STATE_CLOSED:
- (void) snprintf(state, sizeof (state), "closed");
- break;
- case VDEV_STATE_OFFLINE:
- (void) snprintf(state, sizeof (state), "offline");
- break;
- case VDEV_STATE_REMOVED:
- (void) snprintf(state, sizeof (state), "removed");
- break;
- case VDEV_STATE_CANT_OPEN:
- (void) snprintf(state, sizeof (state), "can't open");
- break;
- case VDEV_STATE_FAULTED:
- (void) snprintf(state, sizeof (state), "faulted");
- break;
- case VDEV_STATE_DEGRADED:
- (void) snprintf(state, sizeof (state), "degraded");
- break;
- case VDEV_STATE_HEALTHY:
- (void) snprintf(state, sizeof (state), "healthy");
- break;
- default:
- (void) snprintf(state, sizeof (state), "<state %u>",
- (uint_t)vd->vdev_state);
- }
-
- zfs_dbgmsg("%*svdev %u: %s%s, guid: %llu, path: %s, %s", indent,
- "", vd->vdev_id, vd->vdev_ops->vdev_op_type,
- vd->vdev_islog ? " (log)" : "",
- (u_longlong_t)vd->vdev_guid,
- vd->vdev_path ? vd->vdev_path : "N/A", state);
-
- for (uint64_t i = 0; i < vd->vdev_children; i++)
- vdev_dbgmsg_print_tree(vd->vdev_child[i], indent + 2);
-}
-
/*
* Given a vdev type, return the appropriate ops vector.
*/
static vdev_ops_t *
vdev_getops(const char *type)
@@ -165,10 +92,16 @@
break;
return (ops);
}
+boolean_t
+vdev_is_special(vdev_t *vd)
+{
+ return (vd ? vd->vdev_isspecial : B_FALSE);
+}
+
/*
* Default asize function: return the MAX of psize with the asize of
* all children. This is what's used by anything other than RAID-Z.
*/
uint64_t
@@ -310,10 +243,13 @@
}
pvd->vdev_child = newchild;
pvd->vdev_child[id] = cvd;
+ cvd->vdev_isspecial_child =
+ (pvd->vdev_isspecial || pvd->vdev_isspecial_child);
+
cvd->vdev_top = (pvd->vdev_top ? pvd->vdev_top: cvd);
ASSERT(cvd->vdev_top->vdev_parent->vdev_parent == NULL);
/*
* Walk up all ancestors to update guid sum.
@@ -391,14 +327,12 @@
*/
vdev_t *
vdev_alloc_common(spa_t *spa, uint_t id, uint64_t guid, vdev_ops_t *ops)
{
vdev_t *vd;
- vdev_indirect_config_t *vic;
vd = kmem_zalloc(sizeof (vdev_t), KM_SLEEP);
- vic = &vd->vdev_indirect_config;
if (spa->spa_root_vdev == NULL) {
ASSERT(ops == &vdev_root_ops);
spa->spa_root_vdev = vd;
spa->spa_load_guid = spa_generate_guid(NULL);
@@ -425,22 +359,19 @@
vd->vdev_guid = guid;
vd->vdev_guid_sum = guid;
vd->vdev_ops = ops;
vd->vdev_state = VDEV_STATE_CLOSED;
vd->vdev_ishole = (ops == &vdev_hole_ops);
- vic->vic_prev_indirect_vdev = UINT64_MAX;
- rw_init(&vd->vdev_indirect_rwlock, NULL, RW_DEFAULT, NULL);
- mutex_init(&vd->vdev_obsolete_lock, NULL, MUTEX_DEFAULT, NULL);
- vd->vdev_obsolete_segments = range_tree_create(NULL, NULL);
-
mutex_init(&vd->vdev_dtl_lock, NULL, MUTEX_DEFAULT, NULL);
mutex_init(&vd->vdev_stat_lock, NULL, MUTEX_DEFAULT, NULL);
mutex_init(&vd->vdev_probe_lock, NULL, MUTEX_DEFAULT, NULL);
- mutex_init(&vd->vdev_queue_lock, NULL, MUTEX_DEFAULT, NULL);
+ mutex_init(&vd->vdev_scan_io_queue_lock, NULL, MUTEX_DEFAULT, NULL);
+ rw_init(&vd->vdev_tsd_lock, NULL, RW_DEFAULT, NULL);
for (int t = 0; t < DTL_TYPES; t++) {
- vd->vdev_dtl[t] = range_tree_create(NULL, NULL);
+ vd->vdev_dtl[t] = range_tree_create(NULL, NULL,
+ &vd->vdev_dtl_lock);
}
txg_list_create(&vd->vdev_ms_list, spa,
offsetof(struct metaslab, ms_txg_node));
txg_list_create(&vd->vdev_dtl_list, spa,
offsetof(struct vdev, vdev_dtl_node));
@@ -460,13 +391,13 @@
vdev_alloc(spa_t *spa, vdev_t **vdp, nvlist_t *nv, vdev_t *parent, uint_t id,
int alloctype)
{
vdev_ops_t *ops;
char *type;
- uint64_t guid = 0, islog, nparity;
+ uint64_t guid = 0, nparity;
+ uint64_t isspecial = 0, islog = 0;
vdev_t *vd;
- vdev_indirect_config_t *vic;
ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
if (nvlist_lookup_string(nv, ZPOOL_CONFIG_TYPE, &type) != 0)
return (SET_ERROR(EINVAL));
@@ -505,15 +436,21 @@
return (SET_ERROR(EINVAL));
/*
* Determine whether we're a log vdev.
*/
- islog = 0;
(void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_IS_LOG, &islog);
if (islog && spa_version(spa) < SPA_VERSION_SLOGS)
return (SET_ERROR(ENOTSUP));
+ /*
+ * Determine whether we're a special vdev.
+ */
+ (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_IS_SPECIAL, &isspecial);
+ if (isspecial && spa_version(spa) < SPA_VERSION_FEATURES)
+ return (SET_ERROR(ENOTSUP));
+
if (ops == &vdev_hole_ops && spa_version(spa) < SPA_VERSION_HOLES)
return (SET_ERROR(ENOTSUP));
/*
* Set the nparity property for RAID-Z vdevs.
@@ -550,14 +487,16 @@
nparity = 0;
}
ASSERT(nparity != -1ULL);
vd = vdev_alloc_common(spa, id, guid, ops);
- vic = &vd->vdev_indirect_config;
vd->vdev_islog = islog;
+ vd->vdev_isspecial = isspecial;
vd->vdev_nparity = nparity;
+ vd->vdev_isspecial_child = (parent != NULL &&
+ (parent->vdev_isspecial || parent->vdev_isspecial_child));
if (nvlist_lookup_string(nv, ZPOOL_CONFIG_PATH, &vd->vdev_path) == 0)
vd->vdev_path = spa_strdup(vd->vdev_path);
if (nvlist_lookup_string(nv, ZPOOL_CONFIG_DEVID, &vd->vdev_devid) == 0)
vd->vdev_devid = spa_strdup(vd->vdev_devid);
@@ -565,27 +504,55 @@
&vd->vdev_physpath) == 0)
vd->vdev_physpath = spa_strdup(vd->vdev_physpath);
if (nvlist_lookup_string(nv, ZPOOL_CONFIG_FRU, &vd->vdev_fru) == 0)
vd->vdev_fru = spa_strdup(vd->vdev_fru);
+#ifdef _KERNEL
+ if (vd->vdev_path) {
+ char dev_path[MAXPATHLEN];
+ char *last_slash = NULL;
+ kstat_t *exist = NULL;
+
+ if (strcmp(vd->vdev_ops->vdev_op_type, VDEV_TYPE_DISK) == 0)
+ last_slash = strrchr(vd->vdev_path, '/');
+
+ (void) sprintf(dev_path, "%s:%s", spa->spa_name,
+ last_slash != NULL ? last_slash + 1 : vd->vdev_path);
+
+ exist = kstat_hold_byname("zfs", 0, dev_path, ALL_ZONES);
+
+ if (!exist) {
+ vd->vdev_iokstat = kstat_create("zfs", 0, dev_path,
+ "zfs", KSTAT_TYPE_IO, 1, 0);
+
+ if (vd->vdev_iokstat) {
+ vd->vdev_iokstat->ks_lock =
+ &spa->spa_iokstat_lock;
+ kstat_install(vd->vdev_iokstat);
+ }
+ } else {
+ kstat_rele(exist);
+ }
+ }
+#endif
+
/*
* Set the whole_disk property. If it's not specified, leave the value
* as -1.
*/
if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_WHOLE_DISK,
&vd->vdev_wholedisk) != 0)
vd->vdev_wholedisk = -1ULL;
- ASSERT0(vic->vic_mapping_object);
- (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_INDIRECT_OBJECT,
- &vic->vic_mapping_object);
- ASSERT0(vic->vic_births_object);
- (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_INDIRECT_BIRTHS,
- &vic->vic_births_object);
- ASSERT3U(vic->vic_prev_indirect_vdev, ==, UINT64_MAX);
- (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_PREV_INDIRECT_VDEV,
- &vic->vic_prev_indirect_vdev);
+ /*
+ * Set the is_ssd property. If it's not specified it means the media
+ * is not SSD or the request failed and we assume it's not.
+ */
+ if (nvlist_lookup_boolean(nv, ZPOOL_CONFIG_IS_SSD) == 0)
+ vd->vdev_is_ssd = B_TRUE;
+ else
+ vd->vdev_is_ssd = B_FALSE;
/*
* Look for the 'not present' flag. This will only be set if the device
* was not present at the time of import.
*/
@@ -621,16 +588,19 @@
} else {
ASSERT0(vd->vdev_top_zap);
}
if (parent && !parent->vdev_parent && alloctype != VDEV_ALLOC_ATTACH) {
+ metaslab_class_t *mc = isspecial ? spa_special_class(spa) :
+ (islog ? spa_log_class(spa) : spa_normal_class(spa));
+
ASSERT(alloctype == VDEV_ALLOC_LOAD ||
alloctype == VDEV_ALLOC_ADD ||
alloctype == VDEV_ALLOC_SPLIT ||
alloctype == VDEV_ALLOC_ROOTPOOL);
- vd->vdev_mg = metaslab_group_create(islog ?
- spa_log_class(spa) : spa_normal_class(spa), vd);
+
+ vd->vdev_mg = metaslab_group_create(mc, vd);
}
if (vd->vdev_ops->vdev_op_leaf &&
(alloctype == VDEV_ALLOC_LOAD || alloctype == VDEV_ALLOC_SPLIT)) {
(void) nvlist_lookup_uint64(nv,
@@ -708,10 +678,20 @@
vdev_free(vdev_t *vd)
{
spa_t *spa = vd->vdev_spa;
/*
+ * Scan queues are normally destroyed at the end of a scan. If the
+ * queue exists here, that implies the vdev is being removed while
+ * the scan is still running.
+ */
+ if (vd->vdev_scan_io_queue != NULL) {
+ dsl_scan_io_queue_destroy(vd->vdev_scan_io_queue);
+ vd->vdev_scan_io_queue = NULL;
+ }
+
+ /*
* vdev_free() implies closing the vdev first. This is simpler than
* trying to ensure complicated semantics for all callers.
*/
vdev_close(vd);
@@ -775,35 +755,25 @@
range_tree_vacate(vd->vdev_dtl[t], NULL, NULL);
range_tree_destroy(vd->vdev_dtl[t]);
}
mutex_exit(&vd->vdev_dtl_lock);
- EQUIV(vd->vdev_indirect_births != NULL,
- vd->vdev_indirect_mapping != NULL);
- if (vd->vdev_indirect_births != NULL) {
- vdev_indirect_mapping_close(vd->vdev_indirect_mapping);
- vdev_indirect_births_close(vd->vdev_indirect_births);
+ if (vd->vdev_iokstat) {
+ kstat_delete(vd->vdev_iokstat);
+ vd->vdev_iokstat = NULL;
}
-
- if (vd->vdev_obsolete_sm != NULL) {
- ASSERT(vd->vdev_removing ||
- vd->vdev_ops == &vdev_indirect_ops);
- space_map_close(vd->vdev_obsolete_sm);
- vd->vdev_obsolete_sm = NULL;
- }
- range_tree_destroy(vd->vdev_obsolete_segments);
- rw_destroy(&vd->vdev_indirect_rwlock);
- mutex_destroy(&vd->vdev_obsolete_lock);
-
- mutex_destroy(&vd->vdev_queue_lock);
mutex_destroy(&vd->vdev_dtl_lock);
mutex_destroy(&vd->vdev_stat_lock);
mutex_destroy(&vd->vdev_probe_lock);
+ mutex_destroy(&vd->vdev_scan_io_queue_lock);
+ rw_destroy(&vd->vdev_tsd_lock);
if (vd == spa->spa_root_vdev)
spa->spa_root_vdev = NULL;
+ ASSERT3P(vd->vdev_scan_io_queue, ==, NULL);
+
kmem_free(vd, sizeof (vdev_t));
}
/*
* Transfer top-level vdev state from svd to tvd.
@@ -869,10 +839,16 @@
tvd->vdev_deflate_ratio = svd->vdev_deflate_ratio;
svd->vdev_deflate_ratio = 0;
tvd->vdev_islog = svd->vdev_islog;
svd->vdev_islog = 0;
+
+ tvd->vdev_isspecial = svd->vdev_isspecial;
+ svd->vdev_isspecial = 0;
+ svd->vdev_isspecial_child = tvd->vdev_isspecial;
+
+ dsl_scan_io_queue_vdev_xfer(svd, tvd);
}
static void
vdev_top_update(vdev_t *tvd, vdev_t *vd)
{
@@ -900,11 +876,10 @@
mvd = vdev_alloc_common(spa, cvd->vdev_id, 0, ops);
mvd->vdev_asize = cvd->vdev_asize;
mvd->vdev_min_asize = cvd->vdev_min_asize;
mvd->vdev_max_asize = cvd->vdev_max_asize;
- mvd->vdev_psize = cvd->vdev_psize;
mvd->vdev_ashift = cvd->vdev_ashift;
mvd->vdev_state = cvd->vdev_state;
mvd->vdev_crtxg = cvd->vdev_crtxg;
vdev_remove_child(pvd, cvd);
@@ -981,10 +956,19 @@
if (vd->vdev_ms_shift == 0)
return (0);
ASSERT(!vd->vdev_ishole);
+ /*
+ * Compute the raidz-deflation ratio. Note, we hard-code
+ * in 128k (1 << 17) because it is the "typical" blocksize.
+ * Even though SPA_MAXBLOCKSIZE changed, this algorithm can not change,
+ * otherwise it would inconsistently account for existing bp's.
+ */
+ vd->vdev_deflate_ratio = (1 << 17) /
+ (vdev_psize_to_asize(vd, 1 << 17) >> SPA_MINBLOCKSHIFT);
+
ASSERT(oldc <= newc);
mspp = kmem_zalloc(newc * sizeof (*mspp), KM_SLEEP);
if (oldc != 0) {
@@ -996,34 +980,23 @@
vd->vdev_ms_count = newc;
for (m = oldc; m < newc; m++) {
uint64_t object = 0;
- /*
- * vdev_ms_array may be 0 if we are creating the "fake"
- * metaslabs for an indirect vdev for zdb's leak detection.
- * See zdb_leak_init().
- */
- if (txg == 0 && vd->vdev_ms_array != 0) {
+ if (txg == 0) {
error = dmu_read(mos, vd->vdev_ms_array,
m * sizeof (uint64_t), sizeof (uint64_t), &object,
DMU_READ_PREFETCH);
- if (error != 0) {
- vdev_dbgmsg(vd, "unable to read the metaslab "
- "array [error=%d]", error);
+ if (error)
return (error);
}
- }
error = metaslab_init(vd->vdev_mg, m, object, txg,
&(vd->vdev_ms[m]));
- if (error != 0) {
- vdev_dbgmsg(vd, "metaslab_init failed [error=%d]",
- error);
+ if (error)
return (error);
}
- }
if (txg == 0)
spa_config_enter(spa, SCL_ALLOC, FTAG, RW_WRITER);
/*
@@ -1041,26 +1014,24 @@
}
void
vdev_metaslab_fini(vdev_t *vd)
{
- if (vd->vdev_ms != NULL) {
+ uint64_t m;
uint64_t count = vd->vdev_ms_count;
+ if (vd->vdev_ms != NULL) {
metaslab_group_passivate(vd->vdev_mg);
- for (uint64_t m = 0; m < count; m++) {
+ for (m = 0; m < count; m++) {
metaslab_t *msp = vd->vdev_ms[m];
if (msp != NULL)
metaslab_fini(msp);
}
kmem_free(vd->vdev_ms, count * sizeof (metaslab_t *));
vd->vdev_ms = NULL;
-
- vd->vdev_ms_count = 0;
}
- ASSERT0(vd->vdev_ms_count);
}
typedef struct vdev_probe_stats {
boolean_t vps_readable;
boolean_t vps_writeable;
@@ -1100,11 +1071,10 @@
if (vdev_readable(vd) &&
(vdev_writeable(vd) || !spa_writeable(spa))) {
zio->io_error = 0;
} else {
ASSERT(zio->io_error != 0);
- vdev_dbgmsg(vd, "failed probe");
zfs_ereport_post(FM_EREPORT_ZFS_PROBE_FAILURE,
spa, vd, NULL, 0, 0);
zio->io_error = SET_ERROR(ENXIO);
}
@@ -1268,25 +1238,10 @@
taskq_destroy(tq);
}
/*
- * Compute the raidz-deflation ratio. Note, we hard-code
- * in 128k (1 << 17) because it is the "typical" blocksize.
- * Even though SPA_MAXBLOCKSIZE changed, this algorithm can not change,
- * otherwise it would inconsistently account for existing bp's.
- */
-static void
-vdev_set_deflate_ratio(vdev_t *vd)
-{
- if (vd == vd->vdev_top && !vd->vdev_ishole && vd->vdev_ashift != 0) {
- vd->vdev_deflate_ratio = (1 << 17) /
- (vdev_psize_to_asize(vd, 1 << 17) >> SPA_MINBLOCKSHIFT);
- }
-}
-
-/*
* Prepare a virtual device for access.
*/
int
vdev_open(vdev_t *vd)
{
@@ -1307,14 +1262,15 @@
vd->vdev_cant_read = B_FALSE;
vd->vdev_cant_write = B_FALSE;
vd->vdev_min_asize = vdev_get_min_asize(vd);
/*
- * If this vdev is not removed, check its fault status. If it's
- * faulted, bail out of the open.
+ * If vdev isn't removed and is faulted for reasons other than failed
+ * open, or if it's offline - bail out.
*/
- if (!vd->vdev_removed && vd->vdev_faulted) {
+ if (!vd->vdev_removed && vd->vdev_faulted &&
+ vd->vdev_label_aux != VDEV_AUX_OPEN_FAILED) {
ASSERT(vd->vdev_children == 0);
ASSERT(vd->vdev_label_aux == VDEV_AUX_ERR_EXCEEDED ||
vd->vdev_label_aux == VDEV_AUX_EXTERNAL);
vdev_set_state(vd, B_TRUE, VDEV_STATE_FAULTED,
vd->vdev_label_aux);
@@ -1338,17 +1294,12 @@
if (error) {
if (vd->vdev_removed &&
vd->vdev_stat.vs_aux != VDEV_AUX_OPEN_FAILED)
vd->vdev_removed = B_FALSE;
- if (vd->vdev_stat.vs_aux == VDEV_AUX_CHILDREN_OFFLINE) {
- vdev_set_state(vd, B_TRUE, VDEV_STATE_OFFLINE,
- vd->vdev_stat.vs_aux);
- } else {
vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
vd->vdev_stat.vs_aux);
- }
return (error);
}
vd->vdev_removed = B_FALSE;
@@ -1504,53 +1455,46 @@
/*
* Called once the vdevs are all opened, this routine validates the label
* contents. This needs to be done before vdev_load() so that we don't
* inadvertently do repair I/Os to the wrong device.
*
+ * If 'strict' is false ignore the spa guid check. This is necessary because
+ * if the machine crashed during a re-guid the new guid might have been written
+ * to all of the vdev labels, but not the cached config. The strict check
+ * will be performed when the pool is opened again using the mos config.
+ *
* This function will only return failure if one of the vdevs indicates that it
* has since been destroyed or exported. This is only possible if
* /etc/zfs/zpool.cache was readonly at the time. Otherwise, the vdev state
* will be updated but the function will return 0.
*/
int
-vdev_validate(vdev_t *vd)
+vdev_validate(vdev_t *vd, boolean_t strict)
{
spa_t *spa = vd->vdev_spa;
nvlist_t *label;
- uint64_t guid = 0, aux_guid = 0, top_guid;
+ uint64_t guid = 0, top_guid;
uint64_t state;
- nvlist_t *nvl;
- uint64_t txg;
- if (vdev_validate_skip)
- return (0);
-
- for (uint64_t c = 0; c < vd->vdev_children; c++)
- if (vdev_validate(vd->vdev_child[c]) != 0)
+ for (int c = 0; c < vd->vdev_children; c++)
+ if (vdev_validate(vd->vdev_child[c], strict) != 0)
return (SET_ERROR(EBADF));
/*
* If the device has already failed, or was marked offline, don't do
* any further validation. Otherwise, label I/O will fail and we will
* overwrite the previous state.
*/
- if (!vd->vdev_ops->vdev_op_leaf || !vdev_readable(vd))
- return (0);
+ if (vd->vdev_ops->vdev_op_leaf && vdev_readable(vd)) {
+ uint64_t aux_guid = 0;
+ nvlist_t *nvl;
+ uint64_t txg = spa_last_synced_txg(spa) != 0 ?
+ spa_last_synced_txg(spa) : -1ULL;
- /*
- * If we are performing an extreme rewind, we allow for a label that
- * was modified at a point after the current txg.
- */
- if (spa->spa_extreme_rewind || spa_last_synced_txg(spa) == 0)
- txg = UINT64_MAX;
- else
- txg = spa_last_synced_txg(spa);
-
if ((label = vdev_label_read_config(vd, txg)) == NULL) {
vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_BAD_LABEL);
- vdev_dbgmsg(vd, "vdev_validate: failed reading config");
return (0);
}
/*
* Determine if this vdev has been split off into another
@@ -1559,113 +1503,56 @@
if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_SPLIT_GUID,
&aux_guid) == 0 && aux_guid == spa_guid(spa)) {
vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_SPLIT_POOL);
nvlist_free(label);
- vdev_dbgmsg(vd, "vdev_validate: vdev split into other pool");
return (0);
}
- if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_GUID, &guid) != 0) {
+ if (strict && (nvlist_lookup_uint64(label,
+ ZPOOL_CONFIG_POOL_GUID, &guid) != 0 ||
+ guid != spa_guid(spa))) {
vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_CORRUPT_DATA);
nvlist_free(label);
- vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
- ZPOOL_CONFIG_POOL_GUID);
return (0);
}
- /*
- * If config is not trusted then ignore the spa guid check. This is
- * necessary because if the machine crashed during a re-guid the new
- * guid might have been written to all of the vdev labels, but not the
- * cached config. The check will be performed again once we have the
- * trusted config from the MOS.
- */
- if (spa->spa_trust_config && guid != spa_guid(spa)) {
- vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
- VDEV_AUX_CORRUPT_DATA);
- nvlist_free(label);
- vdev_dbgmsg(vd, "vdev_validate: vdev label pool_guid doesn't "
- "match config (%llu != %llu)", (u_longlong_t)guid,
- (u_longlong_t)spa_guid(spa));
- return (0);
- }
-
if (nvlist_lookup_nvlist(label, ZPOOL_CONFIG_VDEV_TREE, &nvl)
!= 0 || nvlist_lookup_uint64(nvl, ZPOOL_CONFIG_ORIG_GUID,
&aux_guid) != 0)
aux_guid = 0;
- if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID, &guid) != 0) {
- vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
- VDEV_AUX_CORRUPT_DATA);
- nvlist_free(label);
- vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
- ZPOOL_CONFIG_GUID);
- return (0);
- }
-
- if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_TOP_GUID, &top_guid)
- != 0) {
- vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
- VDEV_AUX_CORRUPT_DATA);
- nvlist_free(label);
- vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
- ZPOOL_CONFIG_TOP_GUID);
- return (0);
- }
-
/*
- * If this vdev just became a top-level vdev because its sibling was
- * detached, it will have adopted the parent's vdev guid -- but the
- * label may or may not be on disk yet. Fortunately, either version
- * of the label will have the same top guid, so if we're a top-level
- * vdev, we can safely compare to that instead.
- * However, if the config comes from a cachefile that failed to update
- * after the detach, a top-level vdev will appear as a non top-level
- * vdev in the config. Also relax the constraints if we perform an
- * extreme rewind.
+ * If this vdev just became a top-level vdev because its
+ * sibling was detached, it will have adopted the parent's
+ * vdev guid -- but the label may or may not be on disk yet.
+ * Fortunately, either version of the label will have the
+ * same top guid, so if we're a top-level vdev, we can
+ * safely compare to that instead.
*
* If we split this vdev off instead, then we also check the
* original pool's guid. We don't want to consider the vdev
* corrupt if it is partway through a split operation.
*/
- if (vd->vdev_guid != guid && vd->vdev_guid != aux_guid) {
- boolean_t mismatch = B_FALSE;
- if (spa->spa_trust_config && !spa->spa_extreme_rewind) {
- if (vd != vd->vdev_top || vd->vdev_guid != top_guid)
- mismatch = B_TRUE;
- } else {
- if (vd->vdev_guid != top_guid &&
- vd->vdev_top->vdev_guid != guid)
- mismatch = B_TRUE;
- }
-
- if (mismatch) {
+ if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID,
+ &guid) != 0 ||
+ nvlist_lookup_uint64(label, ZPOOL_CONFIG_TOP_GUID,
+ &top_guid) != 0 ||
+ ((vd->vdev_guid != guid && vd->vdev_guid != aux_guid) &&
+ (vd->vdev_guid != top_guid || vd != vd->vdev_top))) {
vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_CORRUPT_DATA);
nvlist_free(label);
- vdev_dbgmsg(vd, "vdev_validate: config guid "
- "doesn't match label guid");
- vdev_dbgmsg(vd, "CONFIG: guid %llu, top_guid %llu",
- (u_longlong_t)vd->vdev_guid,
- (u_longlong_t)vd->vdev_top->vdev_guid);
- vdev_dbgmsg(vd, "LABEL: guid %llu, top_guid %llu, "
- "aux_guid %llu", (u_longlong_t)guid,
- (u_longlong_t)top_guid, (u_longlong_t)aux_guid);
return (0);
}
- }
if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_STATE,
&state) != 0) {
vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_CORRUPT_DATA);
nvlist_free(label);
- vdev_dbgmsg(vd, "vdev_validate: '%s' missing from label",
- ZPOOL_CONFIG_POOL_STATE);
return (0);
}
nvlist_free(label);
@@ -1673,139 +1560,26 @@
* If this is a verbatim import, no need to check the
* state of the pool.
*/
if (!(spa->spa_import_flags & ZFS_IMPORT_VERBATIM) &&
spa_load_state(spa) == SPA_LOAD_OPEN &&
- state != POOL_STATE_ACTIVE) {
- vdev_dbgmsg(vd, "vdev_validate: invalid pool state (%llu) "
- "for spa %s", (u_longlong_t)state, spa->spa_name);
+ state != POOL_STATE_ACTIVE)
return (SET_ERROR(EBADF));
- }
/*
* If we were able to open and validate a vdev that was
* previously marked permanently unavailable, clear that state
* now.
*/
if (vd->vdev_not_present)
vd->vdev_not_present = 0;
-
- return (0);
-}
-
-static void
-vdev_copy_path_impl(vdev_t *svd, vdev_t *dvd)
-{
- if (svd->vdev_path != NULL && dvd->vdev_path != NULL) {
- if (strcmp(svd->vdev_path, dvd->vdev_path) != 0) {
- zfs_dbgmsg("vdev_copy_path: vdev %llu: path changed "
- "from '%s' to '%s'", (u_longlong_t)dvd->vdev_guid,
- dvd->vdev_path, svd->vdev_path);
- spa_strfree(dvd->vdev_path);
- dvd->vdev_path = spa_strdup(svd->vdev_path);
}
- } else if (svd->vdev_path != NULL) {
- dvd->vdev_path = spa_strdup(svd->vdev_path);
- zfs_dbgmsg("vdev_copy_path: vdev %llu: path set to '%s'",
- (u_longlong_t)dvd->vdev_guid, dvd->vdev_path);
- }
-}
-/*
- * Recursively copy vdev paths from one vdev to another. Source and destination
- * vdev trees must have same geometry otherwise return error. Intended to copy
- * paths from userland config into MOS config.
- */
-int
-vdev_copy_path_strict(vdev_t *svd, vdev_t *dvd)
-{
- if ((svd->vdev_ops == &vdev_missing_ops) ||
- (svd->vdev_ishole && dvd->vdev_ishole) ||
- (dvd->vdev_ops == &vdev_indirect_ops))
return (0);
-
- if (svd->vdev_ops != dvd->vdev_ops) {
- vdev_dbgmsg(svd, "vdev_copy_path: vdev type mismatch: %s != %s",
- svd->vdev_ops->vdev_op_type, dvd->vdev_ops->vdev_op_type);
- return (SET_ERROR(EINVAL));
- }
-
- if (svd->vdev_guid != dvd->vdev_guid) {
- vdev_dbgmsg(svd, "vdev_copy_path: guids mismatch (%llu != "
- "%llu)", (u_longlong_t)svd->vdev_guid,
- (u_longlong_t)dvd->vdev_guid);
- return (SET_ERROR(EINVAL));
- }
-
- if (svd->vdev_children != dvd->vdev_children) {
- vdev_dbgmsg(svd, "vdev_copy_path: children count mismatch: "
- "%llu != %llu", (u_longlong_t)svd->vdev_children,
- (u_longlong_t)dvd->vdev_children);
- return (SET_ERROR(EINVAL));
- }
-
- for (uint64_t i = 0; i < svd->vdev_children; i++) {
- int error = vdev_copy_path_strict(svd->vdev_child[i],
- dvd->vdev_child[i]);
- if (error != 0)
- return (error);
- }
-
- if (svd->vdev_ops->vdev_op_leaf)
- vdev_copy_path_impl(svd, dvd);
-
- return (0);
}
-static void
-vdev_copy_path_search(vdev_t *stvd, vdev_t *dvd)
-{
- ASSERT(stvd->vdev_top == stvd);
- ASSERT3U(stvd->vdev_id, ==, dvd->vdev_top->vdev_id);
-
- for (uint64_t i = 0; i < dvd->vdev_children; i++) {
- vdev_copy_path_search(stvd, dvd->vdev_child[i]);
- }
-
- if (!dvd->vdev_ops->vdev_op_leaf || !vdev_is_concrete(dvd))
- return;
-
- /*
- * The idea here is that while a vdev can shift positions within
- * a top vdev (when replacing, attaching mirror, etc.) it cannot
- * step outside of it.
- */
- vdev_t *vd = vdev_lookup_by_guid(stvd, dvd->vdev_guid);
-
- if (vd == NULL || vd->vdev_ops != dvd->vdev_ops)
- return;
-
- ASSERT(vd->vdev_ops->vdev_op_leaf);
-
- vdev_copy_path_impl(vd, dvd);
-}
-
/*
- * Recursively copy vdev paths from one root vdev to another. Source and
- * destination vdev trees may differ in geometry. For each destination leaf
- * vdev, search a vdev with the same guid and top vdev id in the source.
- * Intended to copy paths from userland config into MOS config.
- */
-void
-vdev_copy_path_relaxed(vdev_t *srvd, vdev_t *drvd)
-{
- uint64_t children = MIN(srvd->vdev_children, drvd->vdev_children);
- ASSERT(srvd->vdev_ops == &vdev_root_ops);
- ASSERT(drvd->vdev_ops == &vdev_root_ops);
-
- for (uint64_t i = 0; i < children; i++) {
- vdev_copy_path_search(srvd->vdev_child[i],
- drvd->vdev_child[i]);
- }
-}
-
-/*
* Close a virtual device.
*/
void
vdev_close(vdev_t *vd)
{
@@ -1893,14 +1667,20 @@
*/
if (vd->vdev_aux) {
(void) vdev_validate_aux(vd);
if (vdev_readable(vd) && vdev_writeable(vd) &&
vd->vdev_aux == &spa->spa_l2cache &&
- !l2arc_vdev_present(vd))
- l2arc_add_vdev(spa, vd);
+ !l2arc_vdev_present(vd)) {
+ /*
+ * When reopening we can assume persistent L2ARC is
+ * supported, since we've already opened the device
+ * in the past and prepended an L2ARC uberblock.
+ */
+ l2arc_add_vdev(spa, vd, B_TRUE);
+ }
} else {
- (void) vdev_validate(vd);
+ (void) vdev_validate(vd, B_TRUE);
}
/*
* Reassess parent vdev's health.
*/
@@ -1949,12 +1729,11 @@
void
vdev_dirty(vdev_t *vd, int flags, void *arg, uint64_t txg)
{
ASSERT(vd == vd->vdev_top);
- /* indirect vdevs don't have metaslabs or dtls */
- ASSERT(vdev_is_concrete(vd) || flags == 0);
+ ASSERT(!vd->vdev_ishole);
ASSERT(ISP2(flags));
ASSERT(spa_writeable(vd->vdev_spa));
if (flags & VDD_METASLAB)
(void) txg_list_add(&vd->vdev_ms_list, arg, txg);
@@ -2020,14 +1799,14 @@
ASSERT(t < DTL_TYPES);
ASSERT(vd != vd->vdev_spa->spa_root_vdev);
ASSERT(spa_writeable(vd->vdev_spa));
- mutex_enter(&vd->vdev_dtl_lock);
+ mutex_enter(rt->rt_lock);
if (!range_tree_contains(rt, txg, size))
range_tree_add(rt, txg, size);
- mutex_exit(&vd->vdev_dtl_lock);
+ mutex_exit(rt->rt_lock);
}
boolean_t
vdev_dtl_contains(vdev_t *vd, vdev_dtl_type_t t, uint64_t txg, uint64_t size)
{
@@ -2035,25 +1814,14 @@
boolean_t dirty = B_FALSE;
ASSERT(t < DTL_TYPES);
ASSERT(vd != vd->vdev_spa->spa_root_vdev);
- /*
- * While we are loading the pool, the DTLs have not been loaded yet.
- * Ignore the DTLs and try all devices. This avoids a recursive
- * mutex enter on the vdev_dtl_lock, and also makes us try hard
- * when loading the pool (relying on the checksum to ensure that
- * we get the right data -- note that we while loading, we are
- * only reading the MOS, which is always checksummed).
- */
- if (vd->vdev_spa->spa_load_state != SPA_LOAD_NONE)
- return (B_FALSE);
-
- mutex_enter(&vd->vdev_dtl_lock);
+ mutex_enter(rt->rt_lock);
if (range_tree_space(rt) != 0)
dirty = range_tree_contains(rt, txg, size);
- mutex_exit(&vd->vdev_dtl_lock);
+ mutex_exit(rt->rt_lock);
return (dirty);
}
boolean_t
@@ -2060,13 +1828,13 @@
vdev_dtl_empty(vdev_t *vd, vdev_dtl_type_t t)
{
range_tree_t *rt = vd->vdev_dtl[t];
boolean_t empty;
- mutex_enter(&vd->vdev_dtl_lock);
+ mutex_enter(rt->rt_lock);
empty = (range_tree_space(rt) == 0);
- mutex_exit(&vd->vdev_dtl_lock);
+ mutex_exit(rt->rt_lock);
return (empty);
}
/*
@@ -2155,11 +1923,11 @@
for (int c = 0; c < vd->vdev_children; c++)
vdev_dtl_reassess(vd->vdev_child[c], txg,
scrub_txg, scrub_done);
- if (vd == spa->spa_root_vdev || !vdev_is_concrete(vd) || vd->vdev_aux)
+ if (vd == spa->spa_root_vdev || vd->vdev_ishole || vd->vdev_aux)
return;
if (vd->vdev_ops->vdev_op_leaf) {
dsl_scan_t *scn = spa->spa_dsl_pool->dp_scan;
@@ -2261,14 +2029,14 @@
spa_t *spa = vd->vdev_spa;
objset_t *mos = spa->spa_meta_objset;
int error = 0;
if (vd->vdev_ops->vdev_op_leaf && vd->vdev_dtl_object != 0) {
- ASSERT(vdev_is_concrete(vd));
+ ASSERT(!vd->vdev_ishole);
error = space_map_open(&vd->vdev_dtl_sm, mos,
- vd->vdev_dtl_object, 0, -1ULL, 0);
+ vd->vdev_dtl_object, 0, -1ULL, 0, &vd->vdev_dtl_lock);
if (error)
return (error);
ASSERT(vd->vdev_dtl_sm != NULL);
mutex_enter(&vd->vdev_dtl_lock);
@@ -2343,14 +2111,15 @@
{
spa_t *spa = vd->vdev_spa;
range_tree_t *rt = vd->vdev_dtl[DTL_MISSING];
objset_t *mos = spa->spa_meta_objset;
range_tree_t *rtsync;
+ kmutex_t rtlock;
dmu_tx_t *tx;
uint64_t object = space_map_object(vd->vdev_dtl_sm);
- ASSERT(vdev_is_concrete(vd));
+ ASSERT(!vd->vdev_ishole);
ASSERT(vd->vdev_ops->vdev_op_leaf);
tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
if (vd->vdev_detached || vd->vdev_top->vdev_removing) {
@@ -2364,11 +2133,11 @@
* We only destroy the leaf ZAP for detached leaves or for
* removed log devices. Removed data devices handle leaf ZAP
* cleanup later, once cancellation is no longer possible.
*/
if (vd->vdev_leaf_zap != 0 && (vd->vdev_detached ||
- vd->vdev_top->vdev_islog)) {
+ vd->vdev_top->vdev_islog || vd->vdev_top->vdev_isspecial)) {
vdev_destroy_unlink_zap(vd, vd->vdev_leaf_zap, tx);
vd->vdev_leaf_zap = 0;
}
dmu_tx_commit(tx);
@@ -2380,16 +2149,20 @@
new_object = space_map_alloc(mos, tx);
VERIFY3U(new_object, !=, 0);
VERIFY0(space_map_open(&vd->vdev_dtl_sm, mos, new_object,
- 0, -1ULL, 0));
+ 0, -1ULL, 0, &vd->vdev_dtl_lock));
ASSERT(vd->vdev_dtl_sm != NULL);
}
- rtsync = range_tree_create(NULL, NULL);
+ mutex_init(&rtlock, NULL, MUTEX_DEFAULT, NULL);
+ rtsync = range_tree_create(NULL, NULL, &rtlock);
+
+ mutex_enter(&rtlock);
+
mutex_enter(&vd->vdev_dtl_lock);
range_tree_walk(rt, range_tree_add, rtsync);
mutex_exit(&vd->vdev_dtl_lock);
space_map_truncate(vd->vdev_dtl_sm, tx);
@@ -2396,19 +2169,21 @@
space_map_write(vd->vdev_dtl_sm, rtsync, SM_ALLOC, tx);
range_tree_vacate(rtsync, NULL, NULL);
range_tree_destroy(rtsync);
+ mutex_exit(&rtlock);
+ mutex_destroy(&rtlock);
+
/*
* If the object for the space map has changed then dirty
* the top level so that we update the config.
*/
if (object != space_map_object(vd->vdev_dtl_sm)) {
- vdev_dbgmsg(vd, "txg %llu, spa %s, DTL old object %llu, "
- "new object %llu", (u_longlong_t)txg, spa_name(spa),
- (u_longlong_t)object,
- (u_longlong_t)space_map_object(vd->vdev_dtl_sm));
+ zfs_dbgmsg("txg %llu, spa %s, DTL old object %llu, "
+ "new object %llu", txg, spa_name(spa), object,
+ space_map_object(vd->vdev_dtl_sm));
vdev_config_dirty(vd->vdev_top);
}
dmu_tx_commit(tx);
@@ -2489,76 +2264,34 @@
*maxp = thismax;
}
return (needed);
}
-int
+void
vdev_load(vdev_t *vd)
{
- int error = 0;
/*
* Recursively load all children.
*/
- for (int c = 0; c < vd->vdev_children; c++) {
- error = vdev_load(vd->vdev_child[c]);
- if (error != 0) {
- return (error);
- }
- }
+ for (int c = 0; c < vd->vdev_children; c++)
+ vdev_load(vd->vdev_child[c]);
- vdev_set_deflate_ratio(vd);
-
/*
* If this is a top-level vdev, initialize its metaslabs.
*/
- if (vd == vd->vdev_top && vdev_is_concrete(vd)) {
- if (vd->vdev_ashift == 0 || vd->vdev_asize == 0) {
+ if (vd == vd->vdev_top && !vd->vdev_ishole &&
+ (vd->vdev_ashift == 0 || vd->vdev_asize == 0 ||
+ vdev_metaslab_init(vd, 0) != 0))
vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_CORRUPT_DATA);
- vdev_dbgmsg(vd, "vdev_load: invalid size. ashift=%llu, "
- "asize=%llu", (u_longlong_t)vd->vdev_ashift,
- (u_longlong_t)vd->vdev_asize);
- return (SET_ERROR(ENXIO));
- } else if ((error = vdev_metaslab_init(vd, 0)) != 0) {
- vdev_dbgmsg(vd, "vdev_load: metaslab_init failed "
- "[error=%d]", error);
- vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
- VDEV_AUX_CORRUPT_DATA);
- return (error);
- }
- }
/*
* If this is a leaf vdev, load its DTL.
*/
- if (vd->vdev_ops->vdev_op_leaf && (error = vdev_dtl_load(vd)) != 0) {
+ if (vd->vdev_ops->vdev_op_leaf && vdev_dtl_load(vd) != 0)
vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_CORRUPT_DATA);
- vdev_dbgmsg(vd, "vdev_load: vdev_dtl_load failed "
- "[error=%d]", error);
- return (error);
- }
-
- uint64_t obsolete_sm_object = vdev_obsolete_sm_object(vd);
- if (obsolete_sm_object != 0) {
- objset_t *mos = vd->vdev_spa->spa_meta_objset;
- ASSERT(vd->vdev_asize != 0);
- ASSERT(vd->vdev_obsolete_sm == NULL);
-
- if ((error = space_map_open(&vd->vdev_obsolete_sm, mos,
- obsolete_sm_object, 0, vd->vdev_asize, 0))) {
- vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
- VDEV_AUX_CORRUPT_DATA);
- vdev_dbgmsg(vd, "vdev_load: space_map_open failed for "
- "obsolete spacemap (obj %llu) [error=%d]",
- (u_longlong_t)obsolete_sm_object, error);
- return (error);
- }
- space_map_update(vd->vdev_obsolete_sm);
- }
-
- return (0);
}
/*
* The special vdev case is used for hot spares and l2cache devices. Its
* sole purpose it to set the vdev state for the associated vdev. To do this,
@@ -2599,46 +2332,18 @@
*/
nvlist_free(label);
return (0);
}
-/*
- * Free the objects used to store this vdev's spacemaps, and the array
- * that points to them.
- */
void
-vdev_destroy_spacemaps(vdev_t *vd, dmu_tx_t *tx)
+vdev_remove(vdev_t *vd, uint64_t txg)
{
- if (vd->vdev_ms_array == 0)
- return;
-
- objset_t *mos = vd->vdev_spa->spa_meta_objset;
- uint64_t array_count = vd->vdev_asize >> vd->vdev_ms_shift;
- size_t array_bytes = array_count * sizeof (uint64_t);
- uint64_t *smobj_array = kmem_alloc(array_bytes, KM_SLEEP);
- VERIFY0(dmu_read(mos, vd->vdev_ms_array, 0,
- array_bytes, smobj_array, 0));
-
- for (uint64_t i = 0; i < array_count; i++) {
- uint64_t smobj = smobj_array[i];
- if (smobj == 0)
- continue;
-
- space_map_free_obj(mos, smobj, tx);
- }
-
- kmem_free(smobj_array, array_bytes);
- VERIFY0(dmu_object_free(mos, vd->vdev_ms_array, tx));
- vd->vdev_ms_array = 0;
-}
-
-static void
-vdev_remove_empty(vdev_t *vd, uint64_t txg)
-{
spa_t *spa = vd->vdev_spa;
+ objset_t *mos = spa->spa_meta_objset;
dmu_tx_t *tx;
+ tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
ASSERT(vd == vd->vdev_top);
ASSERT3U(txg, ==, spa_syncing_txg(spa));
if (vd->vdev_ms != NULL) {
metaslab_group_t *mg = vd->vdev_mg;
@@ -2661,25 +2366,30 @@
* and metaslab class are up-to-date.
*/
metaslab_group_histogram_remove(mg, msp);
VERIFY0(space_map_allocated(msp->ms_sm));
+ space_map_free(msp->ms_sm, tx);
space_map_close(msp->ms_sm);
msp->ms_sm = NULL;
mutex_exit(&msp->ms_lock);
}
metaslab_group_histogram_verify(mg);
metaslab_class_histogram_verify(mg->mg_class);
for (int i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
ASSERT0(mg->mg_histogram[i]);
+
}
- tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
- vdev_destroy_spacemaps(vd, tx);
+ if (vd->vdev_ms_array) {
+ (void) dmu_object_free(mos, vd->vdev_ms_array, tx);
+ vd->vdev_ms_array = 0;
+ }
- if (vd->vdev_islog && vd->vdev_top_zap != 0) {
+ if ((vd->vdev_islog || vd->vdev_isspecial) &&
+ vd->vdev_top_zap != 0) {
vdev_destroy_unlink_zap(vd, vd->vdev_top_zap, tx);
vd->vdev_top_zap = 0;
}
dmu_tx_commit(tx);
}
@@ -2688,11 +2398,11 @@
vdev_sync_done(vdev_t *vd, uint64_t txg)
{
metaslab_t *msp;
boolean_t reassess = !txg_list_empty(&vd->vdev_ms_list, TXG_CLEAN(txg));
- ASSERT(vdev_is_concrete(vd));
+ ASSERT(!vd->vdev_ishole);
while (msp = txg_list_remove(&vd->vdev_ms_list, TXG_CLEAN(txg)))
metaslab_sync_done(msp, txg);
if (reassess)
@@ -2705,63 +2415,36 @@
spa_t *spa = vd->vdev_spa;
vdev_t *lvd;
metaslab_t *msp;
dmu_tx_t *tx;
- if (range_tree_space(vd->vdev_obsolete_segments) > 0) {
- dmu_tx_t *tx;
+ ASSERT(!vd->vdev_ishole);
- ASSERT(vd->vdev_removing ||
- vd->vdev_ops == &vdev_indirect_ops);
-
- tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
- vdev_indirect_sync_obsolete(vd, tx);
- dmu_tx_commit(tx);
-
- /*
- * If the vdev is indirect, it can't have dirty
- * metaslabs or DTLs.
- */
- if (vd->vdev_ops == &vdev_indirect_ops) {
- ASSERT(txg_list_empty(&vd->vdev_ms_list, txg));
- ASSERT(txg_list_empty(&vd->vdev_dtl_list, txg));
- return;
- }
- }
-
- ASSERT(vdev_is_concrete(vd));
-
- if (vd->vdev_ms_array == 0 && vd->vdev_ms_shift != 0 &&
- !vd->vdev_removing) {
+ if (vd->vdev_ms_array == 0 && vd->vdev_ms_shift != 0) {
ASSERT(vd == vd->vdev_top);
- ASSERT0(vd->vdev_indirect_config.vic_mapping_object);
tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
vd->vdev_ms_array = dmu_object_alloc(spa->spa_meta_objset,
DMU_OT_OBJECT_ARRAY, 0, DMU_OT_NONE, 0, tx);
ASSERT(vd->vdev_ms_array != 0);
vdev_config_dirty(vd);
dmu_tx_commit(tx);
}
+ /*
+ * Remove the metadata associated with this vdev once it's empty.
+ */
+ if (vd->vdev_stat.vs_alloc == 0 && vd->vdev_removing)
+ vdev_remove(vd, txg);
+
while ((msp = txg_list_remove(&vd->vdev_ms_list, txg)) != NULL) {
metaslab_sync(msp, txg);
(void) txg_list_add(&vd->vdev_ms_list, msp, TXG_CLEAN(txg));
}
while ((lvd = txg_list_remove(&vd->vdev_dtl_list, txg)) != NULL)
vdev_dtl_sync(lvd, txg);
- /*
- * Remove the metadata associated with this vdev once it's empty.
- * Note that this is typically used for log/cache device removal;
- * we don't empty toplevel vdevs when removing them. But if
- * a toplevel happens to be emptied, this is not harmful.
- */
- if (vd->vdev_stat.vs_alloc == 0 && vd->vdev_removing) {
- vdev_remove_empty(vd, txg);
- }
-
(void) txg_list_add(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg));
}
uint64_t
vdev_psize_to_asize(vdev_t *vd, uint64_t psize)
@@ -2881,12 +2564,12 @@
wasoffline = (vd->vdev_offline || vd->vdev_tmpoffline);
oldstate = vd->vdev_state;
tvd = vd->vdev_top;
- vd->vdev_offline = B_FALSE;
- vd->vdev_tmpoffline = B_FALSE;
+ vd->vdev_offline = 0ULL;
+ vd->vdev_tmpoffline = 0ULL;
vd->vdev_checkremove = !!(flags & ZFS_ONLINE_CHECKREMOVE);
vd->vdev_forcefault = !!(flags & ZFS_ONLINE_FORCEFAULT);
/* XXX - L2ARC 1.0 does not support expansion */
if (!vd->vdev_aux) {
@@ -2971,11 +2654,11 @@
* Prevent any future allocations.
*/
metaslab_group_passivate(mg);
(void) spa_vdev_state_exit(spa, vd, 0);
- error = spa_reset_logs(spa);
+ error = spa_offline_log(spa);
spa_vdev_state_enter(spa, SCL_ALLOC);
/*
* Check to see if the config has changed.
@@ -3038,30 +2721,45 @@
* children. If 'vd' is NULL, then the user wants to clear all vdevs.
*/
void
vdev_clear(spa_t *spa, vdev_t *vd)
{
+ int c;
vdev_t *rvd = spa->spa_root_vdev;
ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL);
- if (vd == NULL)
+ if (vd == NULL) {
vd = rvd;
+ /* Go through spare and l2cache vdevs */
+ for (c = 0; c < spa->spa_spares.sav_count; c++)
+ vdev_clear(spa, spa->spa_spares.sav_vdevs[c]);
+ for (c = 0; c < spa->spa_l2cache.sav_count; c++)
+ vdev_clear(spa, spa->spa_l2cache.sav_vdevs[c]);
+ }
+
vd->vdev_stat.vs_read_errors = 0;
vd->vdev_stat.vs_write_errors = 0;
vd->vdev_stat.vs_checksum_errors = 0;
- for (int c = 0; c < vd->vdev_children; c++)
- vdev_clear(spa, vd->vdev_child[c]);
-
/*
- * It makes no sense to "clear" an indirect vdev.
+ * If all disk vdevs failed at the same time (e.g. due to a
+ * disconnected cable), that suspends I/O activity to the pool,
+ * which stalls spa_sync if there happened to be any dirty data.
+ * As a consequence, this flag might not be cleared, because it
+ * is only lowered by spa_async_remove (which cannot run). This
+ * then prevents zio_resume from succeeding even if vdev reopen
+ * succeeds, leading to an indefinitely suspended pool. So we
+ * lower the flag here to allow zio_resume to succeed, provided
+ * reopening of the vdevs succeeds.
*/
- if (!vdev_is_concrete(vd))
- return;
+ vd->vdev_remove_wanted = B_FALSE;
+ for (c = 0; c < vd->vdev_children; c++)
+ vdev_clear(spa, vd->vdev_child[c]);
+
/*
* If we're in the FAULTED state or have experienced failed I/O, then
* clear the persistent state and attempt to reopen the device. We
* also mark the vdev config dirty, so that the new faulted state is
* written out to disk.
@@ -3112,26 +2810,24 @@
* This simplifies the code since we don't have to check for
* these types of devices in the various code paths.
* Instead we rely on the fact that we skip over dead devices
* before issuing I/O to them.
*/
- return (vd->vdev_state < VDEV_STATE_DEGRADED ||
- vd->vdev_ops == &vdev_hole_ops ||
+ return (vd->vdev_state < VDEV_STATE_DEGRADED || vd->vdev_ishole ||
vd->vdev_ops == &vdev_missing_ops);
}
boolean_t
vdev_readable(vdev_t *vd)
{
- return (!vdev_is_dead(vd) && !vd->vdev_cant_read);
+ return (vd != NULL && !vdev_is_dead(vd) && !vd->vdev_cant_read);
}
boolean_t
vdev_writeable(vdev_t *vd)
{
- return (!vdev_is_dead(vd) && !vd->vdev_cant_write &&
- vdev_is_concrete(vd));
+ return (vd != NULL && !vdev_is_dead(vd) && !vd->vdev_cant_write);
}
boolean_t
vdev_allocatable(vdev_t *vd)
{
@@ -3144,11 +2840,11 @@
* the proper locks. Note that we have to get the vdev state
* in a local variable because although it changes atomically,
* we're asking two separate questions about it.
*/
return (!(state < VDEV_STATE_DEGRADED && state != VDEV_STATE_CLOSED) &&
- !vd->vdev_cant_write && vdev_is_concrete(vd) &&
+ !vd->vdev_cant_write && !vd->vdev_ishole &&
vd->vdev_mg->mg_initialized);
}
boolean_t
vdev_accessible(vdev_t *vd, zio_t *zio)
@@ -3193,12 +2889,11 @@
*/
if (vd->vdev_aux == NULL && tvd != NULL) {
vs->vs_esize = P2ALIGN(vd->vdev_max_asize - vd->vdev_asize -
spa->spa_bootsize, 1ULL << tvd->vdev_ms_shift);
}
- if (vd->vdev_aux == NULL && vd == vd->vdev_top &&
- vdev_is_concrete(vd)) {
+ if (vd->vdev_aux == NULL && vd == vd->vdev_top && !vd->vdev_ishole) {
vs->vs_fragmentation = vd->vdev_mg->mg_fragmentation;
}
/*
* If we're getting stats on the root vdev, aggregate the I/O counts
@@ -3210,10 +2905,12 @@
vdev_stat_t *cvs = &cvd->vdev_stat;
for (int t = 0; t < ZIO_TYPES; t++) {
vs->vs_ops[t] += cvs->vs_ops[t];
vs->vs_bytes[t] += cvs->vs_bytes[t];
+ vs->vs_iotime[t] += cvs->vs_iotime[t];
+ vs->vs_latency[t] += cvs->vs_latency[t];
}
cvs->vs_scan_removing = cvd->vdev_removing;
}
}
mutex_exit(&vd->vdev_stat_lock);
@@ -3302,10 +2999,24 @@
}
vs->vs_ops[type]++;
vs->vs_bytes[type] += psize;
+ /*
+ * While measuring each delta in nanoseconds, we should keep
+ * cumulative iotime in microseconds so it doesn't overflow on
+ * a busy system.
+ */
+ vs->vs_iotime[type] += (zio->io_vd_timestamp) / 1000;
+
+ /*
+ * Latency is an exponential moving average of iotime deltas
+ * with tuneable alpha measured in 1/10th of percent.
+ */
+ vs->vs_latency[type] += ((int64_t)zio->io_vd_timestamp -
+ vs->vs_latency[type]) * zfs_vs_latency_alpha / 1000;
+
mutex_exit(&vd->vdev_stat_lock);
return;
}
if (flags & ZIO_FLAG_SPECULATIVE)
@@ -3338,12 +3049,22 @@
}
if (type == ZIO_TYPE_WRITE && !vdev_is_dead(vd))
vs->vs_write_errors++;
mutex_exit(&vd->vdev_stat_lock);
- if (spa->spa_load_state == SPA_LOAD_NONE &&
- type == ZIO_TYPE_WRITE && txg != 0 &&
+ if ((vd->vdev_isspecial || vd->vdev_isspecial_child) &&
+ (vs->vs_checksum_errors != 0 || vs->vs_read_errors != 0 ||
+ vs->vs_write_errors != 0 || !vdev_readable(vd) ||
+ !vdev_writeable(vd)) && !spa->spa_special_has_errors) {
+ /* all new writes will be placed on normal */
+ cmn_err(CE_WARN, "New writes to special vdev [%s] "
+ "will be stopped", (vd->vdev_path != NULL) ?
+ vd->vdev_path : "undefined");
+ spa->spa_special_has_errors = B_TRUE;
+ }
+
+ if (type == ZIO_TYPE_WRITE && txg != 0 &&
(!(flags & ZIO_FLAG_IO_REPAIR) ||
(flags & ZIO_FLAG_SCAN_THREAD) ||
spa->spa_claiming)) {
/*
* This is either a normal write (not a repair), or it's
@@ -3414,11 +3135,11 @@
vd->vdev_stat.vs_alloc += alloc_delta;
vd->vdev_stat.vs_space += space_delta;
vd->vdev_stat.vs_dspace += dspace_delta;
mutex_exit(&vd->vdev_stat_lock);
- if (mc == spa_normal_class(spa)) {
+ if (mc == spa_normal_class(spa) || mc == spa_special_class(spa)) {
mutex_enter(&rvd->vdev_stat_lock);
rvd->vdev_stat.vs_alloc += alloc_delta;
rvd->vdev_stat.vs_space += space_delta;
rvd->vdev_stat.vs_dspace += dspace_delta;
mutex_exit(&rvd->vdev_stat_lock);
@@ -3504,14 +3225,13 @@
vdev_config_dirty(rvd->vdev_child[c]);
} else {
ASSERT(vd == vd->vdev_top);
if (!list_link_active(&vd->vdev_config_dirty_node) &&
- vdev_is_concrete(vd)) {
+ !vd->vdev_ishole)
list_insert_head(&spa->spa_config_dirty_list, vd);
}
- }
}
void
vdev_config_clean(vdev_t *vd)
{
@@ -3547,12 +3267,11 @@
*/
ASSERT(spa_config_held(spa, SCL_STATE, RW_WRITER) ||
(dsl_pool_sync_context(spa_get_dsl(spa)) &&
spa_config_held(spa, SCL_STATE, RW_READER)));
- if (!list_link_active(&vd->vdev_state_dirty_node) &&
- vdev_is_concrete(vd))
+ if (!list_link_active(&vd->vdev_state_dirty_node) && !vd->vdev_ishole)
list_insert_head(&spa->spa_state_dirty_list, vd);
}
void
vdev_state_clean(vdev_t *vd)
@@ -3582,14 +3301,13 @@
if (vd->vdev_children > 0) {
for (int c = 0; c < vd->vdev_children; c++) {
child = vd->vdev_child[c];
/*
- * Don't factor holes or indirect vdevs into the
- * decision.
+ * Don't factor holes into the decision.
*/
- if (!vdev_is_concrete(child))
+ if (child->vdev_ishole)
continue;
if (!vdev_readable(child) ||
(!vdev_writeable(child) && spa_writeable(spa))) {
/*
@@ -3760,23 +3478,10 @@
if (!isopen && vd->vdev_parent)
vdev_propagate_state(vd->vdev_parent);
}
-boolean_t
-vdev_children_are_offline(vdev_t *vd)
-{
- ASSERT(!vd->vdev_ops->vdev_op_leaf);
-
- for (uint64_t i = 0; i < vd->vdev_children; i++) {
- if (vd->vdev_child[i]->vdev_state != VDEV_STATE_OFFLINE)
- return (B_FALSE);
- }
-
- return (B_TRUE);
-}
-
/*
* Check the vdev configuration to ensure that it's capable of supporting
* a root pool. We do not support partial configuration.
* In addition, only a single top-level vdev is allowed.
*/
@@ -3787,12 +3492,11 @@
char *vdev_type = vd->vdev_ops->vdev_op_type;
if (strcmp(vdev_type, VDEV_TYPE_ROOT) == 0 &&
vd->vdev_children > 1) {
return (B_FALSE);
- } else if (strcmp(vdev_type, VDEV_TYPE_MISSING) == 0 ||
- strcmp(vdev_type, VDEV_TYPE_INDIRECT) == 0) {
+ } else if (strcmp(vdev_type, VDEV_TYPE_MISSING) == 0) {
return (B_FALSE);
}
}
for (int c = 0; c < vd->vdev_children; c++) {
@@ -3800,19 +3504,36 @@
return (B_FALSE);
}
return (B_TRUE);
}
-boolean_t
-vdev_is_concrete(vdev_t *vd)
+/*
+ * Load the state from the original vdev tree (ovd) which
+ * we've retrieved from the MOS config object. If the original
+ * vdev was offline or faulted then we transfer that state to the
+ * device in the current vdev tree (nvd).
+ */
+void
+vdev_load_log_state(vdev_t *nvd, vdev_t *ovd)
{
- vdev_ops_t *ops = vd->vdev_ops;
- if (ops == &vdev_indirect_ops || ops == &vdev_hole_ops ||
- ops == &vdev_missing_ops || ops == &vdev_root_ops) {
- return (B_FALSE);
- } else {
- return (B_TRUE);
+ spa_t *spa = nvd->vdev_spa;
+
+ ASSERT(nvd->vdev_top->vdev_islog);
+ ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL);
+ ASSERT3U(nvd->vdev_guid, ==, ovd->vdev_guid);
+
+ for (int c = 0; c < nvd->vdev_children; c++)
+ vdev_load_log_state(nvd->vdev_child[c], ovd->vdev_child[c]);
+
+ if (nvd->vdev_ops->vdev_op_leaf) {
+ /*
+ * Restore the persistent vdev state
+ */
+ nvd->vdev_offline = ovd->vdev_offline;
+ nvd->vdev_faulted = ovd->vdev_faulted;
+ nvd->vdev_degraded = ovd->vdev_degraded;
+ nvd->vdev_removed = ovd->vdev_removed;
}
}
/*
* Determine if a log device has valid content. If the vdev was
@@ -3840,14 +3561,11 @@
vdev_expand(vdev_t *vd, uint64_t txg)
{
ASSERT(vd->vdev_top == vd);
ASSERT(spa_config_held(vd->vdev_spa, SCL_ALL, RW_WRITER) == SCL_ALL);
- vdev_set_deflate_ratio(vd);
-
- if ((vd->vdev_asize >> vd->vdev_ms_shift) > vd->vdev_ms_count &&
- vdev_is_concrete(vd)) {
+ if ((vd->vdev_asize >> vd->vdev_ms_shift) > vd->vdev_ms_count) {
VERIFY(vdev_metaslab_init(vd, txg) == 0);
vdev_config_dirty(vd);
}
}
@@ -3894,16 +3612,141 @@
* the spa_deadman_synctime we panic the system.
*/
fio = avl_first(&vq->vq_active_tree);
delta = gethrtime() - fio->io_timestamp;
if (delta > spa_deadman_synctime(spa)) {
- vdev_dbgmsg(vd, "SLOW IO: zio timestamp "
- "%lluns, delta %lluns, last io %lluns",
- fio->io_timestamp, (u_longlong_t)delta,
+ zfs_dbgmsg("SLOW IO: zio timestamp %lluns, "
+ "delta %lluns, last io %lluns",
+ fio->io_timestamp, delta,
vq->vq_io_complete_ts);
fm_panic("I/O to pool '%s' appears to be "
"hung.", spa_name(spa));
}
}
mutex_exit(&vq->vq_lock);
}
+}
+
+boolean_t
+vdev_type_is_ddt(vdev_t *vd)
+{
+ uint64_t pool;
+
+ if (vd->vdev_l2ad_ddt == 1 &&
+ zfs_ddt_limit_type == DDT_LIMIT_TO_L2ARC) {
+ ASSERT(spa_l2cache_exists(vd->vdev_guid, &pool));
+ ASSERT(vd->vdev_isl2cache);
+ return (B_TRUE);
+ }
+ return (B_FALSE);
+}
+
+/* count leaf vdev(s) under the given vdev */
+uint_t
+vdev_count_leaf_vdevs(vdev_t *vd)
+{
+ uint_t cnt = 0;
+
+ if (vd->vdev_ops->vdev_op_leaf)
+ return (1);
+
+ /* if this is not a leaf vdev - visit children */
+ for (int c = 0; c < vd->vdev_children; c++)
+ cnt += vdev_count_leaf_vdevs(vd->vdev_child[c]);
+
+ return (cnt);
+}
+
+/*
+ * Implements the per-vdev portion of manual TRIM. The function passes over
+ * all metaslabs on this vdev and performs a metaslab_trim_all on them. It's
+ * also responsible for rate-control if spa_man_trim_rate is non-zero.
+ */
+void
+vdev_man_trim(vdev_trim_info_t *vti)
+{
+ clock_t t = ddi_get_lbolt();
+ spa_t *spa = vti->vti_vdev->vdev_spa;
+ vdev_t *vd = vti->vti_vdev;
+
+ vd->vdev_man_trimming = B_TRUE;
+ vd->vdev_trim_prog = 0;
+
+ spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER);
+ for (uint64_t i = 0; i < vti->vti_vdev->vdev_ms_count &&
+ !spa->spa_man_trim_stop; i++) {
+ uint64_t delta;
+ metaslab_t *msp = vd->vdev_ms[i];
+ zio_t *trim_io = metaslab_trim_all(msp, &delta);
+
+ atomic_add_64(&vd->vdev_trim_prog, msp->ms_size);
+ spa_config_exit(spa, SCL_STATE_ALL, FTAG);
+
+ (void) zio_wait(trim_io);
+
+ /* delay loop to handle fixed-rate trimming */
+ for (;;) {
+ uint64_t rate = spa->spa_man_trim_rate;
+ uint64_t sleep_delay;
+
+ if (rate == 0) {
+ /* No delay, just update 't' and move on. */
+ t = ddi_get_lbolt();
+ break;
+ }
+
+ sleep_delay = (delta * hz) / rate;
+ mutex_enter(&spa->spa_man_trim_lock);
+ (void) cv_timedwait(&spa->spa_man_trim_update_cv,
+ &spa->spa_man_trim_lock, t);
+ mutex_exit(&spa->spa_man_trim_lock);
+
+ /* If interrupted, don't try to relock, get out */
+ if (spa->spa_man_trim_stop)
+ goto out;
+
+ /* Timeout passed, move on to the next metaslab. */
+ if (ddi_get_lbolt() >= t + sleep_delay) {
+ t += sleep_delay;
+ break;
+ }
+ }
+ spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER);
+ }
+ spa_config_exit(spa, SCL_STATE_ALL, FTAG);
+out:
+ vd->vdev_man_trimming = B_FALSE;
+ /*
+ * Ensure we're marked as "completed" even if we've had to stop
+ * before processing all metaslabs.
+ */
+ vd->vdev_trim_prog = vd->vdev_asize;
+
+ ASSERT(vti->vti_done_cb != NULL);
+ vti->vti_done_cb(vti->vti_done_arg);
+
+ kmem_free(vti, sizeof (*vti));
+}
+
+/*
+ * Runs through all metaslabs on the vdev and does their autotrim processing.
+ */
+void
+vdev_auto_trim(vdev_trim_info_t *vti)
+{
+ vdev_t *vd = vti->vti_vdev;
+ spa_t *spa = vd->vdev_spa;
+ uint64_t txg = vti->vti_txg;
+
+ if (vd->vdev_man_trimming)
+ goto out;
+
+ spa_config_enter(spa, SCL_STATE_ALL, FTAG, RW_READER);
+ for (uint64_t i = 0; i < vd->vdev_ms_count; i++)
+ metaslab_auto_trim(vd->vdev_ms[i], txg);
+ spa_config_exit(spa, SCL_STATE_ALL, FTAG);
+out:
+ ASSERT(vti->vti_done_cb != NULL);
+ vti->vti_done_cb(vti->vti_done_arg);
+
+ kmem_free(vti, sizeof (*vti));
}