Print this page
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-10069 ZFS_READONLY is a little too strict (fix test lint)
NEX-9553 Move ss_fill gap logic from scan algorithm into range_tree.c
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4720 WRC: DVA allocation bypass for special BPs works incorrect
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4683 WRC: Special block pointer must know that it is special
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
6295 metaslab_condense's dbgmsg should include vdev id
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4245 WRC: Code cleanup and refactoring to simplify merge with upstream
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-4059 On-demand TRIM can sometimes race in metaslab_load
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
usr/src/common/zfs/zpool_prop.c
usr/src/uts/common/sys/fs/zfs.h
NEX-3710 WRC improvements and bug-fixes
* refactored WRC move-logic to use zio kmem_cashes
* replace size and compression fields by blk_prop field
(the same in blkptr_t) to little reduce size of wrc_block_t
and use similar macros as for blkptr_t to get PSIZE, LSIZE
and COMPRESSION
* make CPU more happy by reduce atomic calls
* removed unused code
* fixed naming of variables
* fixed possible system panic after restart system
with enabled WRC
* fixed a race that causes system panic
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
usr/src/uts/common/io/scsi/targets/sd.c
usr/src/uts/common/sys/scsi/targets/sddef.h
OS-197 Series of zpool exports and imports can hang the system
Reviewed by: Sarah Jelinek <sarah.jelinek@nexetna.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittens@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
re #8346 rb2639 KT disk failures
*** 21,30 ****
--- 21,31 ----
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2011, 2015 by Delphix. All rights reserved.
* Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
* Copyright (c) 2014 Integros [integros.com]
+ * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
*/
#include <sys/zfs_context.h>
#include <sys/dmu.h>
#include <sys/dmu_tx.h>
*** 32,42 ****
#include <sys/metaslab_impl.h>
#include <sys/vdev_impl.h>
#include <sys/zio.h>
#include <sys/spa_impl.h>
#include <sys/zfeature.h>
! #include <sys/vdev_indirect_mapping.h>
#define GANG_ALLOCATION(flags) \
((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER))
uint64_t metaslab_aliquot = 512ULL << 10;
--- 33,43 ----
#include <sys/metaslab_impl.h>
#include <sys/vdev_impl.h>
#include <sys/zio.h>
#include <sys/spa_impl.h>
#include <sys/zfeature.h>
! #include <sys/wbc.h>
#define GANG_ALLOCATION(flags) \
((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER))
uint64_t metaslab_aliquot = 512ULL << 10;
*** 165,179 ****
* Enable/disable metaslab group biasing.
*/
boolean_t metaslab_bias_enabled = B_TRUE;
/*
- * Enable/disable remapping of indirect DVAs to their concrete vdevs.
- */
- boolean_t zfs_remap_blkptr_enable = B_TRUE;
-
- /*
* Enable/disable segment-based metaslab selection.
*/
boolean_t zfs_metaslab_segment_weight_enabled = B_TRUE;
/*
--- 166,175 ----
*** 199,214 ****
*/
uint64_t metaslab_trace_max_entries = 5000;
static uint64_t metaslab_weight(metaslab_t *);
static void metaslab_set_fragmentation(metaslab_t *);
- static void metaslab_free_impl(vdev_t *, uint64_t, uint64_t, uint64_t);
- static void metaslab_check_free_impl(vdev_t *, uint64_t, uint64_t);
kmem_cache_t *metaslab_alloc_trace_cache;
/*
* ==========================================================================
* Metaslab classes
* ==========================================================================
*/
metaslab_class_t *
--- 195,244 ----
*/
uint64_t metaslab_trace_max_entries = 5000;
static uint64_t metaslab_weight(metaslab_t *);
static void metaslab_set_fragmentation(metaslab_t *);
kmem_cache_t *metaslab_alloc_trace_cache;
/*
+ * Toggle between space-based DVA allocator 0, latency-based 1 or hybrid 2.
+ * A value other than 0, 1 or 2 will be considered 0 (default).
+ */
+ int metaslab_alloc_dva_algorithm = 0;
+
+ /*
+ * How many TXG's worth of updates should be aggregated per TRIM/UNMAP
+ * issued to the underlying vdev. We keep two range trees of extents
+ * (called "trim sets") to be trimmed per metaslab, the `current' and
+ * the `previous' TS. New free's are added to the current TS. Then,
+ * once `zfs_txgs_per_trim' transactions have elapsed, the `current'
+ * TS becomes the `previous' TS and a new, blank TS is created to be
+ * the new `current', which will then start accumulating any new frees.
+ * Once another zfs_txgs_per_trim TXGs have passed, the previous TS's
+ * extents are trimmed, the TS is destroyed and the current TS again
+ * becomes the previous TS.
+ * This serves to fulfill two functions: aggregate many small frees
+ * into fewer larger trim operations (which should help with devices
+ * which do not take so kindly to them) and to allow for disaster
+ * recovery (extents won't get trimmed immediately, but instead only
+ * after passing this rather long timeout, thus not preserving
+ * 'zfs import -F' functionality).
+ */
+ unsigned int zfs_txgs_per_trim = 32;
+
+ static void metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size);
+ static void metaslab_trim_add(void *arg, uint64_t offset, uint64_t size);
+
+ static zio_t *metaslab_exec_trim(metaslab_t *msp);
+
+ static metaslab_trimset_t *metaslab_new_trimset(uint64_t txg, kmutex_t *lock);
+ static void metaslab_free_trimset(metaslab_trimset_t *ts);
+ static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
+ uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit);
+
+ /*
* ==========================================================================
* Metaslab classes
* ==========================================================================
*/
metaslab_class_t *
*** 216,225 ****
--- 246,259 ----
{
metaslab_class_t *mc;
mc = kmem_zalloc(sizeof (metaslab_class_t), KM_SLEEP);
+ mutex_init(&mc->mc_alloc_lock, NULL, MUTEX_DEFAULT, NULL);
+ avl_create(&mc->mc_alloc_tree, zio_bookmark_compare,
+ sizeof (zio_t), offsetof(zio_t, io_alloc_node));
+
mc->mc_spa = spa;
mc->mc_rotor = NULL;
mc->mc_ops = ops;
mutex_init(&mc->mc_lock, NULL, MUTEX_DEFAULT, NULL);
refcount_create_tracked(&mc->mc_alloc_slots);
*** 234,243 ****
--- 268,280 ----
ASSERT(mc->mc_alloc == 0);
ASSERT(mc->mc_deferred == 0);
ASSERT(mc->mc_space == 0);
ASSERT(mc->mc_dspace == 0);
+ avl_destroy(&mc->mc_alloc_tree);
+ mutex_destroy(&mc->mc_alloc_lock);
+
refcount_destroy(&mc->mc_alloc_slots);
mutex_destroy(&mc->mc_lock);
kmem_free(mc, sizeof (metaslab_class_t));
}
*** 320,330 ****
/*
* Skip any holes, uninitialized top-levels, or
* vdevs that are not in this metalab class.
*/
! if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
mg->mg_class != mc) {
continue;
}
for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
--- 357,367 ----
/*
* Skip any holes, uninitialized top-levels, or
* vdevs that are not in this metalab class.
*/
! if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
mg->mg_class != mc) {
continue;
}
for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
*** 355,368 ****
for (int c = 0; c < rvd->vdev_children; c++) {
vdev_t *tvd = rvd->vdev_child[c];
metaslab_group_t *mg = tvd->vdev_mg;
/*
! * Skip any holes, uninitialized top-levels,
! * or vdevs that are not in this metalab class.
*/
! if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
mg->mg_class != mc) {
continue;
}
/*
--- 392,405 ----
for (int c = 0; c < rvd->vdev_children; c++) {
vdev_t *tvd = rvd->vdev_child[c];
metaslab_group_t *mg = tvd->vdev_mg;
/*
! * Skip any holes, uninitialized top-levels, or
! * vdevs that are not in this metalab class.
*/
! if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
mg->mg_class != mc) {
continue;
}
/*
*** 404,414 ****
for (int c = 0; c < rvd->vdev_children; c++) {
uint64_t tspace;
vdev_t *tvd = rvd->vdev_child[c];
metaslab_group_t *mg = tvd->vdev_mg;
! if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
mg->mg_class != mc) {
continue;
}
/*
--- 441,451 ----
for (int c = 0; c < rvd->vdev_children; c++) {
uint64_t tspace;
vdev_t *tvd = rvd->vdev_child[c];
metaslab_group_t *mg = tvd->vdev_mg;
! if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
mg->mg_class != mc) {
continue;
}
/*
*** 516,527 ****
vdev_stat_t *vs = &vd->vdev_stat;
boolean_t was_allocatable;
boolean_t was_initialized;
ASSERT(vd == vd->vdev_top);
- ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_READER), ==,
- SCL_ALLOC);
mutex_enter(&mg->mg_lock);
was_allocatable = mg->mg_allocatable;
was_initialized = mg->mg_initialized;
--- 553,562 ----
*** 615,624 ****
--- 650,660 ----
* either because we never activated in the first place or
* because we're done, and possibly removing the vdev.
*/
ASSERT(mg->mg_activation_count <= 0);
+ if (mg->mg_taskq)
taskq_destroy(mg->mg_taskq);
avl_destroy(&mg->mg_metaslab_tree);
mutex_destroy(&mg->mg_lock);
refcount_destroy(&mg->mg_alloc_queue_depth);
kmem_free(mg, sizeof (metaslab_group_t));
*** 628,638 ****
metaslab_group_activate(metaslab_group_t *mg)
{
metaslab_class_t *mc = mg->mg_class;
metaslab_group_t *mgprev, *mgnext;
! ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER), !=, 0);
ASSERT(mc->mc_rotor != mg);
ASSERT(mg->mg_prev == NULL);
ASSERT(mg->mg_next == NULL);
ASSERT(mg->mg_activation_count <= 0);
--- 664,674 ----
metaslab_group_activate(metaslab_group_t *mg)
{
metaslab_class_t *mc = mg->mg_class;
metaslab_group_t *mgprev, *mgnext;
! ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
ASSERT(mc->mc_rotor != mg);
ASSERT(mg->mg_prev == NULL);
ASSERT(mg->mg_next == NULL);
ASSERT(mg->mg_activation_count <= 0);
*** 654,705 ****
mgnext->mg_prev = mg;
}
mc->mc_rotor = mg;
}
- /*
- * Passivate a metaslab group and remove it from the allocation rotor.
- * Callers must hold both the SCL_ALLOC and SCL_ZIO lock prior to passivating
- * a metaslab group. This function will momentarily drop spa_config_locks
- * that are lower than the SCL_ALLOC lock (see comment below).
- */
void
metaslab_group_passivate(metaslab_group_t *mg)
{
metaslab_class_t *mc = mg->mg_class;
- spa_t *spa = mc->mc_spa;
metaslab_group_t *mgprev, *mgnext;
- int locks = spa_config_held(spa, SCL_ALL, RW_WRITER);
! ASSERT3U(spa_config_held(spa, SCL_ALLOC | SCL_ZIO, RW_WRITER), ==,
! (SCL_ALLOC | SCL_ZIO));
if (--mg->mg_activation_count != 0) {
ASSERT(mc->mc_rotor != mg);
ASSERT(mg->mg_prev == NULL);
ASSERT(mg->mg_next == NULL);
ASSERT(mg->mg_activation_count < 0);
return;
}
- /*
- * The spa_config_lock is an array of rwlocks, ordered as
- * follows (from highest to lowest):
- * SCL_CONFIG > SCL_STATE > SCL_L2ARC > SCL_ALLOC >
- * SCL_ZIO > SCL_FREE > SCL_VDEV
- * (For more information about the spa_config_lock see spa_misc.c)
- * The higher the lock, the broader its coverage. When we passivate
- * a metaslab group, we must hold both the SCL_ALLOC and the SCL_ZIO
- * config locks. However, the metaslab group's taskq might be trying
- * to preload metaslabs so we must drop the SCL_ZIO lock and any
- * lower locks to allow the I/O to complete. At a minimum,
- * we continue to hold the SCL_ALLOC lock, which prevents any future
- * allocations from taking place and any changes to the vdev tree.
- */
- spa_config_exit(spa, locks & ~(SCL_ZIO - 1), spa);
taskq_wait(mg->mg_taskq);
- spa_config_enter(spa, locks & ~(SCL_ZIO - 1), spa, RW_WRITER);
metaslab_group_alloc_update(mg);
mgprev = mg->mg_prev;
mgnext = mg->mg_next;
--- 690,716 ----
mgnext->mg_prev = mg;
}
mc->mc_rotor = mg;
}
void
metaslab_group_passivate(metaslab_group_t *mg)
{
metaslab_class_t *mc = mg->mg_class;
metaslab_group_t *mgprev, *mgnext;
! ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
if (--mg->mg_activation_count != 0) {
ASSERT(mc->mc_rotor != mg);
ASSERT(mg->mg_prev == NULL);
ASSERT(mg->mg_next == NULL);
ASSERT(mg->mg_activation_count < 0);
return;
}
taskq_wait(mg->mg_taskq);
metaslab_group_alloc_update(mg);
mgprev = mg->mg_prev;
mgnext = mg->mg_next;
*** 1139,1161 ****
* This is a helper function that can be used by the allocator to find
* a suitable block to allocate. This will search the specified AVL
* tree looking for a block that matches the specified criteria.
*/
static uint64_t
! metaslab_block_picker(avl_tree_t *t, uint64_t *cursor, uint64_t size,
! uint64_t align)
{
range_seg_t *rs = metaslab_block_find(t, *cursor, size);
! while (rs != NULL) {
uint64_t offset = P2ROUNDUP(rs->rs_start, align);
! if (offset + size <= rs->rs_end) {
*cursor = offset + size;
return (offset);
}
- rs = AVL_NEXT(t, rs);
}
/*
* If we know we've searched the whole map (*cursor == 0), give up.
* Otherwise, reset the cursor to the beginning and try again.
--- 1150,1173 ----
* This is a helper function that can be used by the allocator to find
* a suitable block to allocate. This will search the specified AVL
* tree looking for a block that matches the specified criteria.
*/
static uint64_t
! metaslab_block_picker(metaslab_t *msp, avl_tree_t *t, uint64_t *cursor,
! uint64_t size, uint64_t align)
{
range_seg_t *rs = metaslab_block_find(t, *cursor, size);
! for (; rs != NULL; rs = AVL_NEXT(t, rs)) {
uint64_t offset = P2ROUNDUP(rs->rs_start, align);
! if (offset + size <= rs->rs_end &&
! !metaslab_check_trim_conflict(msp, &offset, size, align,
! rs->rs_end)) {
*cursor = offset + size;
return (offset);
}
}
/*
* If we know we've searched the whole map (*cursor == 0), give up.
* Otherwise, reset the cursor to the beginning and try again.
*** 1162,1172 ****
*/
if (*cursor == 0)
return (-1ULL);
*cursor = 0;
! return (metaslab_block_picker(t, cursor, size, align));
}
/*
* ==========================================================================
* The first-fit block allocator
--- 1174,1184 ----
*/
if (*cursor == 0)
return (-1ULL);
*cursor = 0;
! return (metaslab_block_picker(msp, t, cursor, size, align));
}
/*
* ==========================================================================
* The first-fit block allocator
*** 1184,1194 ****
*/
uint64_t align = size & -size;
uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1];
avl_tree_t *t = &msp->ms_tree->rt_root;
! return (metaslab_block_picker(t, cursor, size, align));
}
static metaslab_ops_t metaslab_ff_ops = {
metaslab_ff_alloc
};
--- 1196,1206 ----
*/
uint64_t align = size & -size;
uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1];
avl_tree_t *t = &msp->ms_tree->rt_root;
! return (metaslab_block_picker(msp, t, cursor, size, align));
}
static metaslab_ops_t metaslab_ff_ops = {
metaslab_ff_alloc
};
*** 1232,1242 ****
free_pct < metaslab_df_free_pct) {
t = &msp->ms_size_tree;
*cursor = 0;
}
! return (metaslab_block_picker(t, cursor, size, 1ULL));
}
static metaslab_ops_t metaslab_df_ops = {
metaslab_df_alloc
};
--- 1244,1254 ----
free_pct < metaslab_df_free_pct) {
t = &msp->ms_size_tree;
*cursor = 0;
}
! return (metaslab_block_picker(msp, t, cursor, size, 1ULL));
}
static metaslab_ops_t metaslab_df_ops = {
metaslab_df_alloc
};
*** 1264,1281 ****
ASSERT3U(*cursor_end, >=, *cursor);
if ((*cursor + size) > *cursor_end) {
range_seg_t *rs;
!
! rs = avl_last(&msp->ms_size_tree);
! if (rs == NULL || (rs->rs_end - rs->rs_start) < size)
! return (-1ULL);
!
*cursor = rs->rs_start;
*cursor_end = rs->rs_end;
}
offset = *cursor;
*cursor += size;
return (offset);
--- 1276,1299 ----
ASSERT3U(*cursor_end, >=, *cursor);
if ((*cursor + size) > *cursor_end) {
range_seg_t *rs;
! for (rs = avl_last(&msp->ms_size_tree);
! rs != NULL && rs->rs_end - rs->rs_start >= size;
! rs = AVL_PREV(&msp->ms_size_tree, rs)) {
*cursor = rs->rs_start;
*cursor_end = rs->rs_end;
+ if (!metaslab_check_trim_conflict(msp, cursor, size,
+ 1, *cursor_end)) {
+ /* segment appears to be acceptable */
+ break;
}
+ }
+ if (rs == NULL || rs->rs_end - rs->rs_start < size)
+ return (-1ULL);
+ }
offset = *cursor;
*cursor += size;
return (offset);
*** 1307,1316 ****
--- 1325,1336 ----
avl_index_t where;
range_seg_t *rs, rsearch;
uint64_t hbit = highbit64(size);
uint64_t *cursor = &msp->ms_lbas[hbit - 1];
uint64_t max_size = metaslab_block_maxsize(msp);
+ /* mutable copy for adjustment by metaslab_check_trim_conflict */
+ uint64_t adjustable_start;
ASSERT(MUTEX_HELD(&msp->ms_lock));
ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&msp->ms_size_tree));
if (max_size < size)
*** 1318,1344 ****
rsearch.rs_start = *cursor;
rsearch.rs_end = *cursor + size;
rs = avl_find(t, &rsearch, &where);
! if (rs == NULL || (rs->rs_end - rs->rs_start) < size) {
t = &msp->ms_size_tree;
rsearch.rs_start = 0;
rsearch.rs_end = MIN(max_size,
1ULL << (hbit + metaslab_ndf_clump_shift));
rs = avl_find(t, &rsearch, &where);
if (rs == NULL)
rs = avl_nearest(t, where, AVL_AFTER);
ASSERT(rs != NULL);
}
-
- if ((rs->rs_end - rs->rs_start) >= size) {
- *cursor = rs->rs_start + size;
- return (rs->rs_start);
}
! return (-1ULL);
}
static metaslab_ops_t metaslab_ndf_ops = {
metaslab_ndf_alloc
};
--- 1338,1373 ----
rsearch.rs_start = *cursor;
rsearch.rs_end = *cursor + size;
rs = avl_find(t, &rsearch, &where);
! if (rs != NULL)
! adjustable_start = rs->rs_start;
! if (rs == NULL || rs->rs_end - adjustable_start < size ||
! metaslab_check_trim_conflict(msp, &adjustable_start, size, 1,
! rs->rs_end)) {
! /* segment not usable, try the largest remaining one */
t = &msp->ms_size_tree;
rsearch.rs_start = 0;
rsearch.rs_end = MIN(max_size,
1ULL << (hbit + metaslab_ndf_clump_shift));
rs = avl_find(t, &rsearch, &where);
if (rs == NULL)
rs = avl_nearest(t, where, AVL_AFTER);
ASSERT(rs != NULL);
+ adjustable_start = rs->rs_start;
+ if (rs->rs_end - adjustable_start < size ||
+ metaslab_check_trim_conflict(msp, &adjustable_start,
+ size, 1, rs->rs_end)) {
+ /* even largest remaining segment not usable */
+ return (-1ULL);
}
}
!
! *cursor = adjustable_start + size;
! return (*cursor);
}
static metaslab_ops_t metaslab_ndf_ops = {
metaslab_ndf_alloc
};
*** 1374,1389 ****
ASSERT(MUTEX_HELD(&msp->ms_lock));
ASSERT(!msp->ms_loaded);
ASSERT(!msp->ms_loading);
msp->ms_loading = B_TRUE;
- /*
- * Nobody else can manipulate a loading metaslab, so it's now safe
- * to drop the lock. This way we don't have to hold the lock while
- * reading the spacemap from disk.
- */
- mutex_exit(&msp->ms_lock);
/*
* If the space map has not been allocated yet, then treat
* all the space in the metaslab as free and add it to the
* ms_tree.
--- 1403,1412 ----
*** 1392,1412 ****
error = space_map_load(msp->ms_sm, msp->ms_tree, SM_FREE);
else
range_tree_add(msp->ms_tree, msp->ms_start, msp->ms_size);
success = (error == 0);
-
- mutex_enter(&msp->ms_lock);
msp->ms_loading = B_FALSE;
if (success) {
ASSERT3P(msp->ms_group, !=, NULL);
msp->ms_loaded = B_TRUE;
for (int t = 0; t < TXG_DEFER_SIZE; t++) {
range_tree_walk(msp->ms_defertree[t],
range_tree_remove, msp->ms_tree);
}
msp->ms_max_size = metaslab_block_maxsize(msp);
}
cv_broadcast(&msp->ms_load_cv);
return (error);
--- 1415,1435 ----
error = space_map_load(msp->ms_sm, msp->ms_tree, SM_FREE);
else
range_tree_add(msp->ms_tree, msp->ms_start, msp->ms_size);
success = (error == 0);
msp->ms_loading = B_FALSE;
if (success) {
ASSERT3P(msp->ms_group, !=, NULL);
msp->ms_loaded = B_TRUE;
for (int t = 0; t < TXG_DEFER_SIZE; t++) {
range_tree_walk(msp->ms_defertree[t],
range_tree_remove, msp->ms_tree);
+ range_tree_walk(msp->ms_defertree[t],
+ metaslab_trim_remove, msp);
}
msp->ms_max_size = metaslab_block_maxsize(msp);
}
cv_broadcast(&msp->ms_load_cv);
return (error);
*** 1431,1442 ****
metaslab_t *ms;
int error;
ms = kmem_zalloc(sizeof (metaslab_t), KM_SLEEP);
mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL);
- mutex_init(&ms->ms_sync_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL);
ms->ms_id = id;
ms->ms_start = id << vd->vdev_ms_shift;
ms->ms_size = 1ULL << vd->vdev_ms_shift;
/*
--- 1454,1465 ----
metaslab_t *ms;
int error;
ms = kmem_zalloc(sizeof (metaslab_t), KM_SLEEP);
mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL);
+ cv_init(&ms->ms_trim_cv, NULL, CV_DEFAULT, NULL);
ms->ms_id = id;
ms->ms_start = id << vd->vdev_ms_shift;
ms->ms_size = 1ULL << vd->vdev_ms_shift;
/*
*** 1443,1470 ****
* We only open space map objects that already exist. All others
* will be opened when we finally allocate an object for it.
*/
if (object != 0) {
error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start,
! ms->ms_size, vd->vdev_ashift);
if (error != 0) {
kmem_free(ms, sizeof (metaslab_t));
return (error);
}
ASSERT(ms->ms_sm != NULL);
}
/*
* We create the main range tree here, but we don't create the
* other range trees until metaslab_sync_done(). This serves
* two purposes: it allows metaslab_sync_done() to detect the
* addition of new space; and for debugging, it ensures that we'd
* data fault on any attempt to use this metaslab before it's ready.
*/
! ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms);
metaslab_group_add(mg, ms);
metaslab_set_fragmentation(ms);
/*
--- 1466,1495 ----
* We only open space map objects that already exist. All others
* will be opened when we finally allocate an object for it.
*/
if (object != 0) {
error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start,
! ms->ms_size, vd->vdev_ashift, &ms->ms_lock);
if (error != 0) {
kmem_free(ms, sizeof (metaslab_t));
return (error);
}
ASSERT(ms->ms_sm != NULL);
}
+ ms->ms_cur_ts = metaslab_new_trimset(0, &ms->ms_lock);
+
/*
* We create the main range tree here, but we don't create the
* other range trees until metaslab_sync_done(). This serves
* two purposes: it allows metaslab_sync_done() to detect the
* addition of new space; and for debugging, it ensures that we'd
* data fault on any attempt to use this metaslab before it's ready.
*/
! ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms, &ms->ms_lock);
metaslab_group_add(mg, ms);
metaslab_set_fragmentation(ms);
/*
*** 1524,1539 ****
for (int t = 0; t < TXG_DEFER_SIZE; t++) {
range_tree_destroy(msp->ms_defertree[t]);
}
ASSERT0(msp->ms_deferspace);
mutex_exit(&msp->ms_lock);
cv_destroy(&msp->ms_load_cv);
mutex_destroy(&msp->ms_lock);
- mutex_destroy(&msp->ms_sync_lock);
kmem_free(msp, sizeof (metaslab_t));
}
#define FRAGMENTATION_TABLE_SIZE 17
--- 1549,1569 ----
for (int t = 0; t < TXG_DEFER_SIZE; t++) {
range_tree_destroy(msp->ms_defertree[t]);
}
+ metaslab_free_trimset(msp->ms_cur_ts);
+ if (msp->ms_prev_ts)
+ metaslab_free_trimset(msp->ms_prev_ts);
+ ASSERT3P(msp->ms_trimming_ts, ==, NULL);
+
ASSERT0(msp->ms_deferspace);
mutex_exit(&msp->ms_lock);
cv_destroy(&msp->ms_load_cv);
+ cv_destroy(&msp->ms_trim_cv);
mutex_destroy(&msp->ms_lock);
kmem_free(msp, sizeof (metaslab_t));
}
#define FRAGMENTATION_TABLE_SIZE 17
*** 1895,1909 ****
uint64_t weight;
ASSERT(MUTEX_HELD(&msp->ms_lock));
/*
! * If this vdev is in the process of being removed, there is nothing
* for us to do here.
*/
! if (vd->vdev_removing)
return (0);
metaslab_set_fragmentation(msp);
/*
* Update the maximum size if the metaslab is loaded. This will
--- 1925,1942 ----
uint64_t weight;
ASSERT(MUTEX_HELD(&msp->ms_lock));
/*
! * This vdev is in the process of being removed so there is nothing
* for us to do here.
*/
! if (vd->vdev_removing) {
! ASSERT0(space_map_allocated(msp->ms_sm));
! ASSERT0(vd->vdev_ms_shift);
return (0);
+ }
metaslab_set_fragmentation(msp);
/*
* Update the maximum size if the metaslab is loaded. This will
*** 2031,2047 ****
taskq_wait(mg->mg_taskq);
return;
}
mutex_enter(&mg->mg_lock);
-
/*
* Load the next potential metaslabs
*/
for (msp = avl_first(t); msp != NULL; msp = AVL_NEXT(t, msp)) {
- ASSERT3P(msp->ms_group, ==, mg);
-
/*
* We preload only the maximum number of metaslabs specified
* by metaslab_preload_limit. If a metaslab is being forced
* to condense then we preload it too. This will ensure
* that force condensing happens in the next txg.
--- 2064,2077 ----
*** 2064,2074 ****
* 1. The size of the space map object should not dramatically increase as a
* result of writing out the free space range tree.
*
* 2. The minimal on-disk space map representation is zfs_condense_pct/100
* times the size than the free space range tree representation
! * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1MB).
*
* 3. The on-disk size of the space map should actually decrease.
*
* Checking the first condition is tricky since we don't want to walk
* the entire AVL tree calculating the estimated on-disk size. Instead we
--- 2094,2104 ----
* 1. The size of the space map object should not dramatically increase as a
* result of writing out the free space range tree.
*
* 2. The minimal on-disk space map representation is zfs_condense_pct/100
* times the size than the free space range tree representation
! * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1.MB).
*
* 3. The on-disk size of the space map should actually decrease.
*
* Checking the first condition is tricky since we don't want to walk
* the entire AVL tree calculating the estimated on-disk size. Instead we
*** 2161,2171 ****
* that have been freed in this txg, any deferred frees that exist,
* and any allocation in the future. Removing segments should be
* a relatively inexpensive operation since we expect these trees to
* have a small number of nodes.
*/
! condense_tree = range_tree_create(NULL, NULL);
range_tree_add(condense_tree, msp->ms_start, msp->ms_size);
/*
* Remove what's been freed in this txg from the condense_tree.
* Since we're in sync_pass 1, we know that all the frees from
--- 2191,2201 ----
* that have been freed in this txg, any deferred frees that exist,
* and any allocation in the future. Removing segments should be
* a relatively inexpensive operation since we expect these trees to
* have a small number of nodes.
*/
! condense_tree = range_tree_create(NULL, NULL, &msp->ms_lock);
range_tree_add(condense_tree, msp->ms_start, msp->ms_size);
/*
* Remove what's been freed in this txg from the condense_tree.
* Since we're in sync_pass 1, we know that all the frees from
*** 2194,2203 ****
--- 2224,2234 ----
*/
msp->ms_condensing = B_TRUE;
mutex_exit(&msp->ms_lock);
space_map_truncate(sm, tx);
+ mutex_enter(&msp->ms_lock);
/*
* While we would ideally like to create a space map representation
* that consists only of allocation records, doing so can be
* prohibitively expensive because the in-core free tree can be
*** 2210,2220 ****
space_map_write(sm, condense_tree, SM_ALLOC, tx);
range_tree_vacate(condense_tree, NULL, NULL);
range_tree_destroy(condense_tree);
space_map_write(sm, msp->ms_tree, SM_FREE, tx);
- mutex_enter(&msp->ms_lock);
msp->ms_condensing = B_FALSE;
}
/*
* Write a metaslab to disk in the context of the specified transaction group.
--- 2241,2250 ----
*** 2230,2244 ****
--- 2260,2277 ----
dmu_tx_t *tx;
uint64_t object = space_map_object(msp->ms_sm);
ASSERT(!vd->vdev_ishole);
+ mutex_enter(&msp->ms_lock);
+
/*
* This metaslab has just been added so there's no work to do now.
*/
if (msp->ms_freeingtree == NULL) {
ASSERT3P(alloctree, ==, NULL);
+ mutex_exit(&msp->ms_lock);
return;
}
ASSERT3P(alloctree, !=, NULL);
ASSERT3P(msp->ms_freeingtree, !=, NULL);
*** 2250,2277 ****
* is being forced to condense and it's loaded, we need to let it
* through.
*/
if (range_tree_space(alloctree) == 0 &&
range_tree_space(msp->ms_freeingtree) == 0 &&
! !(msp->ms_loaded && msp->ms_condense_wanted))
return;
VERIFY(txg <= spa_final_dirty_txg(spa));
/*
* The only state that can actually be changing concurrently with
* metaslab_sync() is the metaslab's ms_tree. No other thread can
* be modifying this txg's alloctree, freeingtree, freedtree, or
! * space_map_phys_t. We drop ms_lock whenever we could call
! * into the DMU, because the DMU can call down to us
! * (e.g. via zio_free()) at any time.
! *
! * The spa_vdev_remove_thread() can be reading metaslab state
! * concurrently, and it is locked out by the ms_sync_lock. Note
! * that the ms_lock is insufficient for this, because it is dropped
! * by space_map_write().
*/
tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
if (msp->ms_sm == NULL) {
--- 2283,2308 ----
* is being forced to condense and it's loaded, we need to let it
* through.
*/
if (range_tree_space(alloctree) == 0 &&
range_tree_space(msp->ms_freeingtree) == 0 &&
! !(msp->ms_loaded && msp->ms_condense_wanted)) {
! mutex_exit(&msp->ms_lock);
return;
+ }
VERIFY(txg <= spa_final_dirty_txg(spa));
/*
* The only state that can actually be changing concurrently with
* metaslab_sync() is the metaslab's ms_tree. No other thread can
* be modifying this txg's alloctree, freeingtree, freedtree, or
! * space_map_phys_t. Therefore, we only hold ms_lock to satify
! * space map ASSERTs. We drop it whenever we call into the DMU,
! * because the DMU can call down to us (e.g. via zio_free()) at
! * any time.
*/
tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
if (msp->ms_sm == NULL) {
*** 2279,2295 ****
new_object = space_map_alloc(mos, tx);
VERIFY3U(new_object, !=, 0);
VERIFY0(space_map_open(&msp->ms_sm, mos, new_object,
! msp->ms_start, msp->ms_size, vd->vdev_ashift));
ASSERT(msp->ms_sm != NULL);
}
- mutex_enter(&msp->ms_sync_lock);
- mutex_enter(&msp->ms_lock);
-
/*
* Note: metaslab_condense() clears the space map's histogram.
* Therefore we must verify and remove this histogram before
* condensing.
*/
--- 2310,2324 ----
new_object = space_map_alloc(mos, tx);
VERIFY3U(new_object, !=, 0);
VERIFY0(space_map_open(&msp->ms_sm, mos, new_object,
! msp->ms_start, msp->ms_size, vd->vdev_ashift,
! &msp->ms_lock));
ASSERT(msp->ms_sm != NULL);
}
/*
* Note: metaslab_condense() clears the space map's histogram.
* Therefore we must verify and remove this histogram before
* condensing.
*/
*** 2299,2317 ****
if (msp->ms_loaded && spa_sync_pass(spa) == 1 &&
metaslab_should_condense(msp)) {
metaslab_condense(msp, txg, tx);
} else {
- mutex_exit(&msp->ms_lock);
space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx);
space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx);
- mutex_enter(&msp->ms_lock);
}
if (msp->ms_loaded) {
/*
! * When the space map is loaded, we have an accurate
* histogram in the range tree. This gives us an opportunity
* to bring the space map's histogram up-to-date so we clear
* it first before updating it.
*/
space_map_histogram_clear(msp->ms_sm);
--- 2328,2344 ----
if (msp->ms_loaded && spa_sync_pass(spa) == 1 &&
metaslab_should_condense(msp)) {
metaslab_condense(msp, txg, tx);
} else {
space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx);
space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx);
}
if (msp->ms_loaded) {
/*
! * When the space map is loaded, we have an accruate
* histogram in the range tree. This gives us an opportunity
* to bring the space map's histogram up-to-date so we clear
* it first before updating it.
*/
space_map_histogram_clear(msp->ms_sm);
*** 2375,2385 ****
if (object != space_map_object(msp->ms_sm)) {
object = space_map_object(msp->ms_sm);
dmu_write(mos, vd->vdev_ms_array, sizeof (uint64_t) *
msp->ms_id, sizeof (uint64_t), &object, tx);
}
- mutex_exit(&msp->ms_sync_lock);
dmu_tx_commit(tx);
}
/*
* Called after a transaction group has completely synced to mark
--- 2402,2411 ----
*** 2405,2437 ****
*/
if (msp->ms_freedtree == NULL) {
for (int t = 0; t < TXG_SIZE; t++) {
ASSERT(msp->ms_alloctree[t] == NULL);
! msp->ms_alloctree[t] = range_tree_create(NULL, NULL);
}
ASSERT3P(msp->ms_freeingtree, ==, NULL);
! msp->ms_freeingtree = range_tree_create(NULL, NULL);
ASSERT3P(msp->ms_freedtree, ==, NULL);
! msp->ms_freedtree = range_tree_create(NULL, NULL);
for (int t = 0; t < TXG_DEFER_SIZE; t++) {
ASSERT(msp->ms_defertree[t] == NULL);
! msp->ms_defertree[t] = range_tree_create(NULL, NULL);
}
vdev_space_update(vd, 0, 0, msp->ms_size);
}
defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE];
uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) -
metaslab_class_get_alloc(spa_normal_class(spa));
! if (free_space <= spa_get_slop_space(spa) || vd->vdev_removing) {
defer_allowed = B_FALSE;
}
defer_delta = 0;
alloc_delta = space_map_alloc_delta(msp->ms_sm);
--- 2431,2467 ----
*/
if (msp->ms_freedtree == NULL) {
for (int t = 0; t < TXG_SIZE; t++) {
ASSERT(msp->ms_alloctree[t] == NULL);
! msp->ms_alloctree[t] = range_tree_create(NULL, msp,
! &msp->ms_lock);
}
ASSERT3P(msp->ms_freeingtree, ==, NULL);
! msp->ms_freeingtree = range_tree_create(NULL, msp,
! &msp->ms_lock);
ASSERT3P(msp->ms_freedtree, ==, NULL);
! msp->ms_freedtree = range_tree_create(NULL, msp,
! &msp->ms_lock);
for (int t = 0; t < TXG_DEFER_SIZE; t++) {
ASSERT(msp->ms_defertree[t] == NULL);
! msp->ms_defertree[t] = range_tree_create(NULL, msp,
! &msp->ms_lock);
}
vdev_space_update(vd, 0, 0, msp->ms_size);
}
defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE];
uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) -
metaslab_class_get_alloc(spa_normal_class(spa));
! if (free_space <= spa_get_slop_space(spa)) {
defer_allowed = B_FALSE;
}
defer_delta = 0;
alloc_delta = space_map_alloc_delta(msp->ms_sm);
*** 2454,2463 ****
--- 2484,2501 ----
* Move the frees from the defer_tree back to the free
* range tree (if it's loaded). Swap the freed_tree and the
* defer_tree -- this is safe to do because we've just emptied out
* the defer_tree.
*/
+ if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
+ !vd->vdev_man_trimming) {
+ range_tree_walk(*defer_tree, metaslab_trim_add, msp);
+ if (!defer_allowed) {
+ range_tree_walk(msp->ms_freedtree, metaslab_trim_add,
+ msp);
+ }
+ }
range_tree_vacate(*defer_tree,
msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree);
if (defer_allowed) {
range_tree_swap(&msp->ms_freedtree, defer_tree);
} else {
*** 2497,2533 ****
if (!metaslab_debug_unload)
metaslab_unload(msp);
}
- ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK]));
- ASSERT0(range_tree_space(msp->ms_freeingtree));
- ASSERT0(range_tree_space(msp->ms_freedtree));
-
mutex_exit(&msp->ms_lock);
}
void
metaslab_sync_reassess(metaslab_group_t *mg)
{
- spa_t *spa = mg->mg_class->mc_spa;
-
- spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
metaslab_group_alloc_update(mg);
mg->mg_fragmentation = metaslab_group_fragmentation(mg);
/*
! * Preload the next potential metaslabs but only on active
! * metaslab groups. We can get into a state where the metaslab
! * is no longer active since we dirty metaslabs as we remove a
! * a device, thus potentially making the metaslab group eligible
! * for preloading.
*/
- if (mg->mg_activation_count > 0) {
metaslab_group_preload(mg);
- }
- spa_config_exit(spa, SCL_ALLOC, FTAG);
}
static uint64_t
metaslab_distance(metaslab_t *msp, dva_t *dva)
{
--- 2535,2557 ----
if (!metaslab_debug_unload)
metaslab_unload(msp);
}
mutex_exit(&msp->ms_lock);
}
void
metaslab_sync_reassess(metaslab_group_t *mg)
{
metaslab_group_alloc_update(mg);
mg->mg_fragmentation = metaslab_group_fragmentation(mg);
/*
! * Preload the next potential metaslabs
*/
metaslab_group_preload(mg);
}
static uint64_t
metaslab_distance(metaslab_t *msp, dva_t *dva)
{
*** 2717,2726 ****
--- 2741,2751 ----
VERIFY0(P2PHASE(start, 1ULL << vd->vdev_ashift));
VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
VERIFY3U(range_tree_space(rt) - size, <=, msp->ms_size);
range_tree_remove(rt, start, size);
+ metaslab_trim_remove(msp, start, size);
if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
vdev_dirty(mg->mg_vd, VDD_METASLAB, msp, txg);
range_tree_add(msp->ms_alloctree[txg & TXG_MASK], start, size);
*** 2738,2748 ****
return (start);
}
static uint64_t
metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal,
! uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
{
metaslab_t *msp = NULL;
uint64_t offset = -1ULL;
uint64_t activation_weight;
uint64_t target_distance;
--- 2763,2774 ----
return (start);
}
static uint64_t
metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal,
! uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d,
! int flags)
{
metaslab_t *msp = NULL;
uint64_t offset = -1ULL;
uint64_t activation_weight;
uint64_t target_distance;
*** 2759,2768 ****
--- 2785,2795 ----
metaslab_t *search = kmem_alloc(sizeof (*search), KM_SLEEP);
search->ms_weight = UINT64_MAX;
search->ms_start = 0;
for (;;) {
boolean_t was_active;
+ boolean_t pass_primary = B_TRUE;
avl_tree_t *t = &mg->mg_metaslab_tree;
avl_index_t idx;
mutex_enter(&mg->mg_lock);
*** 2796,2820 ****
*/
if (msp->ms_condensing)
continue;
was_active = msp->ms_weight & METASLAB_ACTIVE_MASK;
! if (activation_weight == METASLAB_WEIGHT_PRIMARY)
break;
target_distance = min_distance +
(space_map_allocated(msp->ms_sm) != 0 ? 0 :
min_distance >> 1);
! for (i = 0; i < d; i++) {
if (metaslab_distance(msp, &dva[i]) <
target_distance)
break;
- }
if (i == d)
break;
}
mutex_exit(&mg->mg_lock);
if (msp == NULL) {
kmem_free(search, sizeof (*search));
return (-1ULL);
}
--- 2823,2858 ----
*/
if (msp->ms_condensing)
continue;
was_active = msp->ms_weight & METASLAB_ACTIVE_MASK;
! if (flags & METASLAB_USE_WEIGHT_SECONDARY) {
! if (!pass_primary) {
! DTRACE_PROBE(metaslab_use_secondary);
! activation_weight =
! METASLAB_WEIGHT_SECONDARY;
break;
+ }
+ pass_primary = B_FALSE;
+ } else {
+ if (activation_weight ==
+ METASLAB_WEIGHT_PRIMARY)
+ break;
+
target_distance = min_distance +
(space_map_allocated(msp->ms_sm) != 0 ? 0 :
min_distance >> 1);
! for (i = 0; i < d; i++)
if (metaslab_distance(msp, &dva[i]) <
target_distance)
break;
if (i == d)
break;
}
+ }
mutex_exit(&mg->mg_lock);
if (msp == NULL) {
kmem_free(search, sizeof (*search));
return (-1ULL);
}
*** 2931,2947 ****
return (offset);
}
static uint64_t
metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal,
! uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
{
uint64_t offset;
ASSERT(mg->mg_initialized);
offset = metaslab_group_alloc_normal(mg, zal, asize, txg,
! min_distance, dva, d);
mutex_enter(&mg->mg_lock);
if (offset == -1ULL) {
mg->mg_failed_allocations++;
metaslab_trace_add(zal, mg, NULL, asize, d,
--- 2969,2986 ----
return (offset);
}
static uint64_t
metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal,
! uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva,
! int d, int flags)
{
uint64_t offset;
ASSERT(mg->mg_initialized);
offset = metaslab_group_alloc_normal(mg, zal, asize, txg,
! min_distance, dva, d, flags);
mutex_enter(&mg->mg_lock);
if (offset == -1ULL) {
mg->mg_failed_allocations++;
metaslab_trace_add(zal, mg, NULL, asize, d,
*** 2975,2985 ****
int ditto_same_vdev_distance_shift = 3;
/*
* Allocate a block for the specified i/o.
*/
! int
metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags,
zio_alloc_list_t *zal)
{
metaslab_group_t *mg, *rotor;
--- 3014,3024 ----
int ditto_same_vdev_distance_shift = 3;
/*
* Allocate a block for the specified i/o.
*/
! static int
metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags,
zio_alloc_list_t *zal)
{
metaslab_group_t *mg, *rotor;
*** 3021,3035 ****
if (hintdva) {
vd = vdev_lookup_top(spa, DVA_GET_VDEV(&hintdva[d]));
/*
* It's possible the vdev we're using as the hint no
! * longer exists or its mg has been closed (e.g. by
! * device removal). Consult the rotor when
* all else fails.
*/
! if (vd != NULL && vd->vdev_mg != NULL) {
mg = vd->vdev_mg;
if (flags & METASLAB_HINTBP_AVOID &&
mg->mg_next != NULL)
mg = mg->mg_next;
--- 3060,3073 ----
if (hintdva) {
vd = vdev_lookup_top(spa, DVA_GET_VDEV(&hintdva[d]));
/*
* It's possible the vdev we're using as the hint no
! * longer exists (i.e. removed). Consult the rotor when
* all else fails.
*/
! if (vd != NULL) {
mg = vd->vdev_mg;
if (flags & METASLAB_HINTBP_AVOID &&
mg->mg_next != NULL)
mg = mg->mg_next;
*** 3120,3130 ****
uint64_t asize = vdev_psize_to_asize(vd, psize);
ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0);
uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg,
! distance, dva, d);
if (offset != -1ULL) {
/*
* If we've just selected this metaslab group,
* figure out whether the corresponding vdev is
--- 3158,3168 ----
uint64_t asize = vdev_psize_to_asize(vd, psize);
ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0);
uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg,
! distance, dva, d, flags);
if (offset != -1ULL) {
/*
* If we've just selected this metaslab group,
* figure out whether the corresponding vdev is
*** 3131,3144 ****
* over- or under-used relative to the pool,
* and set an allocation bias to even it out.
*/
if (mc->mc_aliquot == 0 && metaslab_bias_enabled) {
vdev_stat_t *vs = &vd->vdev_stat;
! int64_t vu, cu;
vu = (vs->vs_alloc * 100) / (vs->vs_space + 1);
cu = (mc->mc_alloc * 100) / (mc->mc_space + 1);
/*
* Calculate how much more or less we should
* try to allocate from this device during
* this iteration around the rotor.
--- 3169,3187 ----
* over- or under-used relative to the pool,
* and set an allocation bias to even it out.
*/
if (mc->mc_aliquot == 0 && metaslab_bias_enabled) {
vdev_stat_t *vs = &vd->vdev_stat;
! vdev_stat_t *pvs = &vd->vdev_parent->vdev_stat;
! int64_t vu, cu, vu_io;
vu = (vs->vs_alloc * 100) / (vs->vs_space + 1);
cu = (mc->mc_alloc * 100) / (mc->mc_space + 1);
+ vu_io =
+ (((vs->vs_iotime[ZIO_TYPE_WRITE] * 100) /
+ (pvs->vs_iotime[ZIO_TYPE_WRITE] + 1)) *
+ (vd->vdev_parent->vdev_children)) - 100;
/*
* Calculate how much more or less we should
* try to allocate from this device during
* this iteration around the rotor.
*** 3151,3160 ****
--- 3194,3222 ----
* This reduces allocations by 307K for this
* iteration.
*/
mg->mg_bias = ((cu - vu) *
(int64_t)mg->mg_aliquot) / 100;
+
+ /*
+ * Experiment: space-based DVA allocator 0,
+ * latency-based 1 or hybrid 2.
+ */
+ switch (metaslab_alloc_dva_algorithm) {
+ case 1:
+ mg->mg_bias =
+ (vu_io * (int64_t)mg->mg_aliquot) /
+ 100;
+ break;
+ case 2:
+ mg->mg_bias =
+ ((((cu - vu) + vu_io) / 2) *
+ (int64_t)mg->mg_aliquot) / 100;
+ break;
+ default:
+ break;
+ }
} else if (!metaslab_bias_enabled) {
mg->mg_bias = 0;
}
if (atomic_add_64_nv(&mc->mc_aliquot, asize) >=
*** 3165,3174 ****
--- 3227,3238 ----
DVA_SET_VDEV(&dva[d], vd->vdev_id);
DVA_SET_OFFSET(&dva[d], offset);
DVA_SET_GANG(&dva[d], !!(flags & METASLAB_GANG_HEADER));
DVA_SET_ASIZE(&dva[d], asize);
+ DTRACE_PROBE3(alloc_dva_probe, uint64_t, vd->vdev_id,
+ uint64_t, offset, uint64_t, psize);
return (0);
}
next:
mc->mc_rotor = mg->mg_next;
*** 3187,3418 ****
metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC);
return (SET_ERROR(ENOSPC));
}
- void
- metaslab_free_concrete(vdev_t *vd, uint64_t offset, uint64_t asize,
- uint64_t txg)
- {
- metaslab_t *msp;
- spa_t *spa = vd->vdev_spa;
-
- ASSERT3U(txg, ==, spa->spa_syncing_txg);
- ASSERT(vdev_is_concrete(vd));
- ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
- ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
-
- msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
-
- VERIFY(!msp->ms_condensing);
- VERIFY3U(offset, >=, msp->ms_start);
- VERIFY3U(offset + asize, <=, msp->ms_start + msp->ms_size);
- VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
- VERIFY0(P2PHASE(asize, 1ULL << vd->vdev_ashift));
-
- metaslab_check_free_impl(vd, offset, asize);
- mutex_enter(&msp->ms_lock);
- if (range_tree_space(msp->ms_freeingtree) == 0) {
- vdev_dirty(vd, VDD_METASLAB, msp, txg);
- }
- range_tree_add(msp->ms_freeingtree, offset, asize);
- mutex_exit(&msp->ms_lock);
- }
-
- /* ARGSUSED */
- void
- metaslab_free_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
- uint64_t size, void *arg)
- {
- uint64_t *txgp = arg;
-
- if (vd->vdev_ops->vdev_op_remap != NULL)
- vdev_indirect_mark_obsolete(vd, offset, size, *txgp);
- else
- metaslab_free_impl(vd, offset, size, *txgp);
- }
-
- static void
- metaslab_free_impl(vdev_t *vd, uint64_t offset, uint64_t size,
- uint64_t txg)
- {
- spa_t *spa = vd->vdev_spa;
-
- ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
-
- if (txg > spa_freeze_txg(spa))
- return;
-
- if (spa->spa_vdev_removal != NULL &&
- spa->spa_vdev_removal->svr_vdev == vd &&
- vdev_is_concrete(vd)) {
- /*
- * Note: we check if the vdev is concrete because when
- * we complete the removal, we first change the vdev to be
- * an indirect vdev (in open context), and then (in syncing
- * context) clear spa_vdev_removal.
- */
- free_from_removing_vdev(vd, offset, size, txg);
- } else if (vd->vdev_ops->vdev_op_remap != NULL) {
- vdev_indirect_mark_obsolete(vd, offset, size, txg);
- vd->vdev_ops->vdev_op_remap(vd, offset, size,
- metaslab_free_impl_cb, &txg);
- } else {
- metaslab_free_concrete(vd, offset, size, txg);
- }
- }
-
- typedef struct remap_blkptr_cb_arg {
- blkptr_t *rbca_bp;
- spa_remap_cb_t rbca_cb;
- vdev_t *rbca_remap_vd;
- uint64_t rbca_remap_offset;
- void *rbca_cb_arg;
- } remap_blkptr_cb_arg_t;
-
- void
- remap_blkptr_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
- uint64_t size, void *arg)
- {
- remap_blkptr_cb_arg_t *rbca = arg;
- blkptr_t *bp = rbca->rbca_bp;
-
- /* We can not remap split blocks. */
- if (size != DVA_GET_ASIZE(&bp->blk_dva[0]))
- return;
- ASSERT0(inner_offset);
-
- if (rbca->rbca_cb != NULL) {
- /*
- * At this point we know that we are not handling split
- * blocks and we invoke the callback on the previous
- * vdev which must be indirect.
- */
- ASSERT3P(rbca->rbca_remap_vd->vdev_ops, ==, &vdev_indirect_ops);
-
- rbca->rbca_cb(rbca->rbca_remap_vd->vdev_id,
- rbca->rbca_remap_offset, size, rbca->rbca_cb_arg);
-
- /* set up remap_blkptr_cb_arg for the next call */
- rbca->rbca_remap_vd = vd;
- rbca->rbca_remap_offset = offset;
- }
-
- /*
- * The phys birth time is that of dva[0]. This ensures that we know
- * when each dva was written, so that resilver can determine which
- * blocks need to be scrubbed (i.e. those written during the time
- * the vdev was offline). It also ensures that the key used in
- * the ARC hash table is unique (i.e. dva[0] + phys_birth). If
- * we didn't change the phys_birth, a lookup in the ARC for a
- * remapped BP could find the data that was previously stored at
- * this vdev + offset.
- */
- vdev_t *oldvd = vdev_lookup_top(vd->vdev_spa,
- DVA_GET_VDEV(&bp->blk_dva[0]));
- vdev_indirect_births_t *vib = oldvd->vdev_indirect_births;
- bp->blk_phys_birth = vdev_indirect_births_physbirth(vib,
- DVA_GET_OFFSET(&bp->blk_dva[0]), DVA_GET_ASIZE(&bp->blk_dva[0]));
-
- DVA_SET_VDEV(&bp->blk_dva[0], vd->vdev_id);
- DVA_SET_OFFSET(&bp->blk_dva[0], offset);
- }
-
/*
! * If the block pointer contains any indirect DVAs, modify them to refer to
! * concrete DVAs. Note that this will sometimes not be possible, leaving
! * the indirect DVA in place. This happens if the indirect DVA spans multiple
! * segments in the mapping (i.e. it is a "split block").
! *
! * If the BP was remapped, calls the callback on the original dva (note the
! * callback can be called multiple times if the original indirect DVA refers
! * to another indirect DVA, etc).
! *
! * Returns TRUE if the BP was remapped.
*/
- boolean_t
- spa_remap_blkptr(spa_t *spa, blkptr_t *bp, spa_remap_cb_t callback, void *arg)
- {
- remap_blkptr_cb_arg_t rbca;
-
- if (!zfs_remap_blkptr_enable)
- return (B_FALSE);
-
- if (!spa_feature_is_enabled(spa, SPA_FEATURE_OBSOLETE_COUNTS))
- return (B_FALSE);
-
- /*
- * Dedup BP's can not be remapped, because ddt_phys_select() depends
- * on DVA[0] being the same in the BP as in the DDT (dedup table).
- */
- if (BP_GET_DEDUP(bp))
- return (B_FALSE);
-
- /*
- * Gang blocks can not be remapped, because
- * zio_checksum_gang_verifier() depends on the DVA[0] that's in
- * the BP used to read the gang block header (GBH) being the same
- * as the DVA[0] that we allocated for the GBH.
- */
- if (BP_IS_GANG(bp))
- return (B_FALSE);
-
- /*
- * Embedded BP's have no DVA to remap.
- */
- if (BP_GET_NDVAS(bp) < 1)
- return (B_FALSE);
-
- /*
- * Note: we only remap dva[0]. If we remapped other dvas, we
- * would no longer know what their phys birth txg is.
- */
- dva_t *dva = &bp->blk_dva[0];
-
- uint64_t offset = DVA_GET_OFFSET(dva);
- uint64_t size = DVA_GET_ASIZE(dva);
- vdev_t *vd = vdev_lookup_top(spa, DVA_GET_VDEV(dva));
-
- if (vd->vdev_ops->vdev_op_remap == NULL)
- return (B_FALSE);
-
- rbca.rbca_bp = bp;
- rbca.rbca_cb = callback;
- rbca.rbca_remap_vd = vd;
- rbca.rbca_remap_offset = offset;
- rbca.rbca_cb_arg = arg;
-
- /*
- * remap_blkptr_cb() will be called in order for each level of
- * indirection, until a concrete vdev is reached or a split block is
- * encountered. old_vd and old_offset are updated within the callback
- * as we go from the one indirect vdev to the next one (either concrete
- * or indirect again) in that order.
- */
- vd->vdev_ops->vdev_op_remap(vd, offset, size, remap_blkptr_cb, &rbca);
-
- /* Check if the DVA wasn't remapped because it is a split block */
- if (DVA_GET_VDEV(&rbca.rbca_bp->blk_dva[0]) == vd->vdev_id)
- return (B_FALSE);
-
- return (B_TRUE);
- }
-
- /*
- * Undo the allocation of a DVA which happened in the given transaction group.
- */
void
! metaslab_unalloc_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
{
- metaslab_t *msp;
- vdev_t *vd;
uint64_t vdev = DVA_GET_VDEV(dva);
uint64_t offset = DVA_GET_OFFSET(dva);
uint64_t size = DVA_GET_ASIZE(dva);
ASSERT(DVA_IS_VALID(dva));
- ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
if (txg > spa_freeze_txg(spa))
return;
if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
--- 3251,3277 ----
metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC);
return (SET_ERROR(ENOSPC));
}
/*
! * Free the block represented by DVA in the context of the specified
! * transaction group.
*/
void
! metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg, boolean_t now)
{
uint64_t vdev = DVA_GET_VDEV(dva);
uint64_t offset = DVA_GET_OFFSET(dva);
uint64_t size = DVA_GET_ASIZE(dva);
+ vdev_t *vd;
+ metaslab_t *msp;
+ DTRACE_PROBE3(free_dva_probe, uint64_t, vdev,
+ uint64_t, offset, uint64_t, size);
+
ASSERT(DVA_IS_VALID(dva));
if (txg > spa_freeze_txg(spa))
return;
if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
*** 3421,3441 ****
(u_longlong_t)vdev, (u_longlong_t)offset);
ASSERT(0);
return;
}
! ASSERT(!vd->vdev_removing);
! ASSERT(vdev_is_concrete(vd));
! ASSERT0(vd->vdev_indirect_config.vic_mapping_object);
! ASSERT3P(vd->vdev_indirect_mapping, ==, NULL);
if (DVA_GET_GANG(dva))
size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
- msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
-
mutex_enter(&msp->ms_lock);
range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
offset, size);
VERIFY(!msp->ms_condensing);
VERIFY3U(offset, >=, msp->ms_start);
--- 3280,3297 ----
(u_longlong_t)vdev, (u_longlong_t)offset);
ASSERT(0);
return;
}
! msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
if (DVA_GET_GANG(dva))
size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
mutex_enter(&msp->ms_lock);
+
+ if (now) {
range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
offset, size);
VERIFY(!msp->ms_condensing);
VERIFY3U(offset, >=, msp->ms_start);
*** 3443,3475 ****
VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
msp->ms_size);
VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
range_tree_add(msp->ms_tree, offset, size);
mutex_exit(&msp->ms_lock);
}
/*
! * Free the block represented by DVA in the context of the specified
! * transaction group.
*/
! void
! metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
{
uint64_t vdev = DVA_GET_VDEV(dva);
uint64_t offset = DVA_GET_OFFSET(dva);
uint64_t size = DVA_GET_ASIZE(dva);
! vdev_t *vd = vdev_lookup_top(spa, vdev);
ASSERT(DVA_IS_VALID(dva));
- ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
! if (DVA_GET_GANG(dva)) {
size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
}
! metaslab_free_impl(vd, offset, size, txg);
}
/*
* Reserve some allocation slots. The reservation system must be called
* before we call into the allocator. If there aren't any available slots
--- 3299,3378 ----
VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
msp->ms_size);
VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
range_tree_add(msp->ms_tree, offset, size);
+ if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
+ !vd->vdev_man_trimming)
+ metaslab_trim_add(msp, offset, size);
+ msp->ms_max_size = metaslab_block_maxsize(msp);
+ } else {
+ VERIFY3U(txg, ==, spa->spa_syncing_txg);
+ if (range_tree_space(msp->ms_freeingtree) == 0)
+ vdev_dirty(vd, VDD_METASLAB, msp, txg);
+ range_tree_add(msp->ms_freeingtree, offset, size);
+ }
+
mutex_exit(&msp->ms_lock);
}
/*
! * Intent log support: upon opening the pool after a crash, notify the SPA
! * of blocks that the intent log has allocated for immediate write, but
! * which are still considered free by the SPA because the last transaction
! * group didn't commit yet.
*/
! static int
! metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
{
uint64_t vdev = DVA_GET_VDEV(dva);
uint64_t offset = DVA_GET_OFFSET(dva);
uint64_t size = DVA_GET_ASIZE(dva);
! vdev_t *vd;
! metaslab_t *msp;
! int error = 0;
ASSERT(DVA_IS_VALID(dva));
! if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
! (offset >> vd->vdev_ms_shift) >= vd->vdev_ms_count)
! return (SET_ERROR(ENXIO));
!
! msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
!
! if (DVA_GET_GANG(dva))
size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
+
+ mutex_enter(&msp->ms_lock);
+
+ if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
+ error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
+
+ if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
+ error = SET_ERROR(ENOENT);
+
+ if (error || txg == 0) { /* txg == 0 indicates dry run */
+ mutex_exit(&msp->ms_lock);
+ return (error);
}
! VERIFY(!msp->ms_condensing);
! VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
! VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
! VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
! range_tree_remove(msp->ms_tree, offset, size);
! metaslab_trim_remove(msp, offset, size);
!
! if (spa_writeable(spa)) { /* don't dirty if we're zdb(1M) */
! if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
! vdev_dirty(vd, VDD_METASLAB, msp, txg);
! range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
! }
!
! mutex_exit(&msp->ms_lock);
!
! return (0);
}
/*
* Reserve some allocation slots. The reservation system must be called
* before we call into the allocator. If there aren't any available slots
*** 3516,3642 ****
(void) refcount_remove(&mc->mc_alloc_slots, zio);
}
mutex_exit(&mc->mc_lock);
}
- static int
- metaslab_claim_concrete(vdev_t *vd, uint64_t offset, uint64_t size,
- uint64_t txg)
- {
- metaslab_t *msp;
- spa_t *spa = vd->vdev_spa;
- int error = 0;
-
- if (offset >> vd->vdev_ms_shift >= vd->vdev_ms_count)
- return (ENXIO);
-
- ASSERT3P(vd->vdev_ms, !=, NULL);
- msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
-
- mutex_enter(&msp->ms_lock);
-
- if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
- error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
-
- if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
- error = SET_ERROR(ENOENT);
-
- if (error || txg == 0) { /* txg == 0 indicates dry run */
- mutex_exit(&msp->ms_lock);
- return (error);
- }
-
- VERIFY(!msp->ms_condensing);
- VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
- VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
- VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
- range_tree_remove(msp->ms_tree, offset, size);
-
- if (spa_writeable(spa)) { /* don't dirty if we're zdb(1M) */
- if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
- vdev_dirty(vd, VDD_METASLAB, msp, txg);
- range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
- }
-
- mutex_exit(&msp->ms_lock);
-
- return (0);
- }
-
- typedef struct metaslab_claim_cb_arg_t {
- uint64_t mcca_txg;
- int mcca_error;
- } metaslab_claim_cb_arg_t;
-
- /* ARGSUSED */
- static void
- metaslab_claim_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
- uint64_t size, void *arg)
- {
- metaslab_claim_cb_arg_t *mcca_arg = arg;
-
- if (mcca_arg->mcca_error == 0) {
- mcca_arg->mcca_error = metaslab_claim_concrete(vd, offset,
- size, mcca_arg->mcca_txg);
- }
- }
-
int
- metaslab_claim_impl(vdev_t *vd, uint64_t offset, uint64_t size, uint64_t txg)
- {
- if (vd->vdev_ops->vdev_op_remap != NULL) {
- metaslab_claim_cb_arg_t arg;
-
- /*
- * Only zdb(1M) can claim on indirect vdevs. This is used
- * to detect leaks of mapped space (that are not accounted
- * for in the obsolete counts, spacemap, or bpobj).
- */
- ASSERT(!spa_writeable(vd->vdev_spa));
- arg.mcca_error = 0;
- arg.mcca_txg = txg;
-
- vd->vdev_ops->vdev_op_remap(vd, offset, size,
- metaslab_claim_impl_cb, &arg);
-
- if (arg.mcca_error == 0) {
- arg.mcca_error = metaslab_claim_concrete(vd,
- offset, size, txg);
- }
- return (arg.mcca_error);
- } else {
- return (metaslab_claim_concrete(vd, offset, size, txg));
- }
- }
-
- /*
- * Intent log support: upon opening the pool after a crash, notify the SPA
- * of blocks that the intent log has allocated for immediate write, but
- * which are still considered free by the SPA because the last transaction
- * group didn't commit yet.
- */
- static int
- metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
- {
- uint64_t vdev = DVA_GET_VDEV(dva);
- uint64_t offset = DVA_GET_OFFSET(dva);
- uint64_t size = DVA_GET_ASIZE(dva);
- vdev_t *vd;
-
- if ((vd = vdev_lookup_top(spa, vdev)) == NULL) {
- return (SET_ERROR(ENXIO));
- }
-
- ASSERT(DVA_IS_VALID(dva));
-
- if (DVA_GET_GANG(dva))
- size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
-
- return (metaslab_claim_impl(vd, offset, size, txg));
- }
-
- int
metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp,
int ndvas, uint64_t txg, blkptr_t *hintbp, int flags,
zio_alloc_list_t *zal, zio_t *zio)
{
dva_t *dva = bp->blk_dva;
--- 3419,3429 ----
*** 3656,3671 ****
ASSERT(ndvas > 0 && ndvas <= spa_max_replication(spa));
ASSERT(BP_GET_NDVAS(bp) == 0);
ASSERT(hintbp == NULL || ndvas <= BP_GET_NDVAS(hintbp));
ASSERT3P(zal, !=, NULL);
for (int d = 0; d < ndvas; d++) {
! error = metaslab_alloc_dva(spa, mc, psize, dva, d, hintdva,
! txg, flags, zal);
if (error != 0) {
for (d--; d >= 0; d--) {
! metaslab_unalloc_dva(spa, &dva[d], txg);
metaslab_group_alloc_decrement(spa,
DVA_GET_VDEV(&dva[d]), zio, flags);
bzero(&dva[d], sizeof (dva_t));
}
spa_config_exit(spa, SCL_ALLOC, FTAG);
--- 3443,3502 ----
ASSERT(ndvas > 0 && ndvas <= spa_max_replication(spa));
ASSERT(BP_GET_NDVAS(bp) == 0);
ASSERT(hintbp == NULL || ndvas <= BP_GET_NDVAS(hintbp));
ASSERT3P(zal, !=, NULL);
+ if (mc == spa_special_class(spa) && !BP_IS_METADATA(bp) &&
+ !(flags & (METASLAB_GANG_HEADER)) &&
+ !(spa->spa_meta_policy.spa_small_data_to_special &&
+ psize <= spa->spa_meta_policy.spa_small_data_to_special)) {
+ error = metaslab_alloc_dva(spa, spa_normal_class(spa),
+ psize, &dva[WBC_NORMAL_DVA], 0, NULL, txg,
+ flags | METASLAB_USE_WEIGHT_SECONDARY, zal);
+ if (error == 0) {
+ error = metaslab_alloc_dva(spa, mc, psize,
+ &dva[WBC_SPECIAL_DVA], 0, NULL, txg, flags, zal);
+ if (error != 0) {
+ error = 0;
+ /*
+ * Change the place of NORMAL and cleanup the
+ * second DVA. After that this BP is just a
+ * regular BP with one DVA
+ *
+ * This operation is valid only if:
+ * WBC_SPECIAL_DVA is dva[0]
+ * WBC_NORMAL_DVA is dva[1]
+ *
+ * see wbc.h
+ */
+ bcopy(&dva[WBC_NORMAL_DVA],
+ &dva[WBC_SPECIAL_DVA], sizeof (dva_t));
+ bzero(&dva[WBC_NORMAL_DVA], sizeof (dva_t));
+
+ /*
+ * Allocation of special DVA has failed,
+ * so this BP will be a regular BP and need
+ * to update the metaslab group's queue depth
+ * based on the newly allocated dva.
+ */
+ metaslab_group_alloc_increment(spa,
+ DVA_GET_VDEV(&dva[0]), zio, flags);
+ } else {
+ BP_SET_SPECIAL(bp, 1);
+ }
+ } else {
+ spa_config_exit(spa, SCL_ALLOC, FTAG);
+ return (error);
+ }
+ } else {
for (int d = 0; d < ndvas; d++) {
! error = metaslab_alloc_dva(spa, mc, psize, dva, d,
! hintdva, txg, flags, zal);
if (error != 0) {
for (d--; d >= 0; d--) {
! metaslab_free_dva(spa, &dva[d],
! txg, B_TRUE);
metaslab_group_alloc_decrement(spa,
DVA_GET_VDEV(&dva[d]), zio, flags);
bzero(&dva[d], sizeof (dva_t));
}
spa_config_exit(spa, SCL_ALLOC, FTAG);
*** 3676,3689 ****
* based on the newly allocated dva.
*/
metaslab_group_alloc_increment(spa,
DVA_GET_VDEV(&dva[d]), zio, flags);
}
-
}
- ASSERT(error == 0);
ASSERT(BP_GET_NDVAS(bp) == ndvas);
spa_config_exit(spa, SCL_ALLOC, FTAG);
BP_SET_BIRTH(bp, txg, txg);
--- 3507,3520 ----
* based on the newly allocated dva.
*/
metaslab_group_alloc_increment(spa,
DVA_GET_VDEV(&dva[d]), zio, flags);
}
}
ASSERT(BP_GET_NDVAS(bp) == ndvas);
+ }
+ ASSERT(error == 0);
spa_config_exit(spa, SCL_ALLOC, FTAG);
BP_SET_BIRTH(bp, txg, txg);
*** 3699,3715 ****
ASSERT(!BP_IS_HOLE(bp));
ASSERT(!now || bp->blk_birth >= spa_syncing_txg(spa));
spa_config_enter(spa, SCL_FREE, FTAG, RW_READER);
! for (int d = 0; d < ndvas; d++) {
! if (now) {
! metaslab_unalloc_dva(spa, &dva[d], txg);
} else {
! metaslab_free_dva(spa, &dva[d], txg);
}
- }
spa_config_exit(spa, SCL_FREE, FTAG);
}
int
--- 3530,3561 ----
ASSERT(!BP_IS_HOLE(bp));
ASSERT(!now || bp->blk_birth >= spa_syncing_txg(spa));
spa_config_enter(spa, SCL_FREE, FTAG, RW_READER);
! if (BP_IS_SPECIAL(bp)) {
! int start_dva;
! wbc_data_t *wbc_data = spa_get_wbc_data(spa);
!
! mutex_enter(&wbc_data->wbc_lock);
! start_dva = wbc_first_valid_dva(bp, wbc_data, B_TRUE);
! mutex_exit(&wbc_data->wbc_lock);
!
! /*
! * Actual freeing should not be locked as
! * the block is already exempted from WBC
! * trees, and thus will not be moved
! */
! metaslab_free_dva(spa, &dva[WBC_NORMAL_DVA], txg, now);
! if (start_dva == 0) {
! metaslab_free_dva(spa, &dva[WBC_SPECIAL_DVA],
! txg, now);
! }
} else {
! for (int d = 0; d < ndvas; d++)
! metaslab_free_dva(spa, &dva[d], txg, now);
}
spa_config_exit(spa, SCL_FREE, FTAG);
}
int
*** 3730,3810 ****
return (error);
}
spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
for (int d = 0; d < ndvas; d++)
! if ((error = metaslab_claim_dva(spa, &dva[d], txg)) != 0)
break;
spa_config_exit(spa, SCL_ALLOC, FTAG);
ASSERT(error == 0 || txg == 0);
return (error);
}
! /* ARGSUSED */
! static void
! metaslab_check_free_impl_cb(uint64_t inner, vdev_t *vd, uint64_t offset,
! uint64_t size, void *arg)
{
- if (vd->vdev_ops == &vdev_indirect_ops)
- return;
-
- metaslab_check_free_impl(vd, offset, size);
- }
-
- static void
- metaslab_check_free_impl(vdev_t *vd, uint64_t offset, uint64_t size)
- {
- metaslab_t *msp;
- spa_t *spa = vd->vdev_spa;
-
if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
return;
! if (vd->vdev_ops->vdev_op_remap != NULL) {
! vd->vdev_ops->vdev_op_remap(vd, offset, size,
! metaslab_check_free_impl_cb, NULL);
return;
}
! ASSERT(vdev_is_concrete(vd));
! ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
! ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
! msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
!
! mutex_enter(&msp->ms_lock);
! if (msp->ms_loaded)
range_tree_verify(msp->ms_tree, offset, size);
range_tree_verify(msp->ms_freeingtree, offset, size);
range_tree_verify(msp->ms_freedtree, offset, size);
for (int j = 0; j < TXG_DEFER_SIZE; j++)
range_tree_verify(msp->ms_defertree[j], offset, size);
mutex_exit(&msp->ms_lock);
}
void
! metaslab_check_free(spa_t *spa, const blkptr_t *bp)
{
! if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
! return;
! spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
! for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
! uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
! vdev_t *vd = vdev_lookup_top(spa, vdev);
! uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
! uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
! if (DVA_GET_GANG(&bp->blk_dva[i]))
! size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
! ASSERT3P(vd, !=, NULL);
! metaslab_check_free_impl(vd, offset, size);
}
! spa_config_exit(spa, SCL_VDEV, FTAG);
}
--- 3576,3921 ----
return (error);
}
spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
+ if (BP_IS_SPECIAL(bp)) {
+ int start_dva;
+ wbc_data_t *wbc_data = spa_get_wbc_data(spa);
+
+ mutex_enter(&wbc_data->wbc_lock);
+ start_dva = wbc_first_valid_dva(bp, wbc_data, B_FALSE);
+
+ /*
+ * Actual claiming should be under lock for WBC blocks. It must
+ * be done to ensure zdb will not fail. The only other user of
+ * the claiming is ZIL whose blocks can not be WBC ones, and
+ * thus the lock will not be held for them.
+ */
+ error = metaslab_claim_dva(spa,
+ &dva[WBC_NORMAL_DVA], txg);
+ if (error == 0 && start_dva == 0) {
+ error = metaslab_claim_dva(spa,
+ &dva[WBC_SPECIAL_DVA], txg);
+ }
+
+ mutex_exit(&wbc_data->wbc_lock);
+ } else {
for (int d = 0; d < ndvas; d++)
! if ((error = metaslab_claim_dva(spa,
! &dva[d], txg)) != 0)
break;
+ }
spa_config_exit(spa, SCL_ALLOC, FTAG);
ASSERT(error == 0 || txg == 0);
return (error);
}
! void
! metaslab_check_free(spa_t *spa, const blkptr_t *bp)
{
if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
return;
! if (BP_IS_SPECIAL(bp)) {
! /* Do not check frees for WBC blocks */
return;
}
! spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
! for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
! uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
! vdev_t *vd = vdev_lookup_top(spa, vdev);
! uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
! uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
! metaslab_t *msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
! if (msp->ms_loaded) {
range_tree_verify(msp->ms_tree, offset, size);
+ range_tree_verify(msp->ms_cur_ts->ts_tree,
+ offset, size);
+ if (msp->ms_prev_ts != NULL) {
+ range_tree_verify(msp->ms_prev_ts->ts_tree,
+ offset, size);
+ }
+ }
range_tree_verify(msp->ms_freeingtree, offset, size);
range_tree_verify(msp->ms_freedtree, offset, size);
for (int j = 0; j < TXG_DEFER_SIZE; j++)
range_tree_verify(msp->ms_defertree[j], offset, size);
+ }
+ spa_config_exit(spa, SCL_VDEV, FTAG);
+ }
+
+ /*
+ * Trims all free space in the metaslab. Returns the root TRIM zio (that the
+ * caller should zio_wait() for) and the amount of space in the metaslab that
+ * has been scheduled for trimming in the `delta' return argument.
+ */
+ zio_t *
+ metaslab_trim_all(metaslab_t *msp, uint64_t *delta)
+ {
+ boolean_t was_loaded;
+ uint64_t trimmed_space;
+ zio_t *trim_io;
+
+ ASSERT(!MUTEX_HELD(&msp->ms_group->mg_lock));
+
+ mutex_enter(&msp->ms_lock);
+
+ while (msp->ms_loading)
+ metaslab_load_wait(msp);
+ /* If we loaded the metaslab, unload it when we're done. */
+ was_loaded = msp->ms_loaded;
+ if (!was_loaded) {
+ if (metaslab_load(msp) != 0) {
mutex_exit(&msp->ms_lock);
+ return (0);
+ }
+ }
+ /* Flush out any scheduled extents and add everything in ms_tree. */
+ range_tree_vacate(msp->ms_cur_ts->ts_tree, NULL, NULL);
+ range_tree_walk(msp->ms_tree, metaslab_trim_add, msp);
+
+ /* Force this trim to take place ASAP. */
+ if (msp->ms_prev_ts != NULL)
+ metaslab_free_trimset(msp->ms_prev_ts);
+ msp->ms_prev_ts = msp->ms_cur_ts;
+ msp->ms_cur_ts = metaslab_new_trimset(0, &msp->ms_lock);
+ trimmed_space = range_tree_space(msp->ms_tree);
+ if (!was_loaded)
+ metaslab_unload(msp);
+
+ trim_io = metaslab_exec_trim(msp);
+ mutex_exit(&msp->ms_lock);
+ *delta = trimmed_space;
+
+ return (trim_io);
}
+ /*
+ * Notifies the trimsets in a metaslab that an extent has been allocated.
+ * This removes the segment from the queues of extents awaiting to be trimmed.
+ */
+ static void
+ metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size)
+ {
+ metaslab_t *msp = arg;
+
+ range_tree_remove_overlap(msp->ms_cur_ts->ts_tree, offset, size);
+ if (msp->ms_prev_ts != NULL) {
+ range_tree_remove_overlap(msp->ms_prev_ts->ts_tree, offset,
+ size);
+ }
+ }
+
+ /*
+ * Notifies the trimsets in a metaslab that an extent has been freed.
+ * This adds the segment to the currently open queue of extents awaiting
+ * to be trimmed.
+ */
+ static void
+ metaslab_trim_add(void *arg, uint64_t offset, uint64_t size)
+ {
+ metaslab_t *msp = arg;
+ ASSERT(msp->ms_cur_ts != NULL);
+ range_tree_add(msp->ms_cur_ts->ts_tree, offset, size);
+ }
+
+ /*
+ * Does a metaslab's automatic trim operation processing. This must be
+ * called from metaslab_sync, with the txg number of the txg. This function
+ * issues trims in intervals as dictated by the zfs_txgs_per_trim tunable.
+ */
void
! metaslab_auto_trim(metaslab_t *msp, uint64_t txg)
{
! /* for atomicity */
! uint64_t txgs_per_trim = zfs_txgs_per_trim;
! ASSERT(!MUTEX_HELD(&msp->ms_lock));
! mutex_enter(&msp->ms_lock);
! /*
! * Since we typically have hundreds of metaslabs per vdev, but we only
! * trim them once every zfs_txgs_per_trim txgs, it'd be best if we
! * could sequence the TRIM commands from all metaslabs so that they
! * don't all always pound the device in the same txg. We do so by
! * artificially inflating the birth txg of the first trim set by a
! * sequence number derived from the metaslab's starting offset
! * (modulo zfs_txgs_per_trim). Thus, for the default 200 metaslabs and
! * 32 txgs per trim, we'll only be trimming ~6.25 metaslabs per txg.
! *
! * If we detect that the txg has advanced too far ahead of ts_birth,
! * it means our birth txg is out of lockstep. Recompute it by
! * rounding down to the nearest zfs_txgs_per_trim multiple and adding
! * our metaslab id modulo zfs_txgs_per_trim.
! */
! if (txg > msp->ms_cur_ts->ts_birth + txgs_per_trim) {
! msp->ms_cur_ts->ts_birth = (txg / txgs_per_trim) *
! txgs_per_trim + (msp->ms_id % txgs_per_trim);
! }
! /* Time to swap out the current and previous trimsets */
! if (txg == msp->ms_cur_ts->ts_birth + txgs_per_trim) {
! if (msp->ms_prev_ts != NULL) {
! if (msp->ms_trimming_ts != NULL) {
! spa_t *spa = msp->ms_group->mg_class->mc_spa;
! /*
! * The previous trim run is still ongoing, so
! * the device is reacting slowly to our trim
! * requests. Drop this trimset, so as not to
! * back the device up with trim requests.
! */
! spa_trimstats_auto_slow_incr(spa);
! metaslab_free_trimset(msp->ms_prev_ts);
! } else if (msp->ms_group->mg_vd->vdev_man_trimming) {
! /*
! * If a manual trim is ongoing, we want to
! * inhibit autotrim temporarily so it doesn't
! * slow down the manual trim.
! */
! metaslab_free_trimset(msp->ms_prev_ts);
! } else {
! /*
! * Trim out aged extents on the vdevs - these
! * are safe to be destroyed now. We'll keep
! * the trimset around to deny allocations from
! * these regions while the trims are ongoing.
! */
! zio_nowait(metaslab_exec_trim(msp));
! }
! }
! msp->ms_prev_ts = msp->ms_cur_ts;
! msp->ms_cur_ts = metaslab_new_trimset(txg, &msp->ms_lock);
! }
! mutex_exit(&msp->ms_lock);
! }
! static void
! metaslab_trim_done(zio_t *zio)
! {
! metaslab_t *msp = zio->io_private;
! boolean_t held;
!
! ASSERT(msp != NULL);
! ASSERT(msp->ms_trimming_ts != NULL);
! held = MUTEX_HELD(&msp->ms_lock);
! if (!held)
! mutex_enter(&msp->ms_lock);
! metaslab_free_trimset(msp->ms_trimming_ts);
! msp->ms_trimming_ts = NULL;
! cv_signal(&msp->ms_trim_cv);
! if (!held)
! mutex_exit(&msp->ms_lock);
! }
!
! /*
! * Executes a zio_trim on a range tree holding freed extents in the metaslab.
! */
! static zio_t *
! metaslab_exec_trim(metaslab_t *msp)
! {
! metaslab_group_t *mg = msp->ms_group;
! spa_t *spa = mg->mg_class->mc_spa;
! vdev_t *vd = mg->mg_vd;
! range_tree_t *trim_tree;
! zio_t *zio;
!
! ASSERT(MUTEX_HELD(&msp->ms_lock));
!
! /* wait for a preceding trim to finish */
! while (msp->ms_trimming_ts != NULL)
! cv_wait(&msp->ms_trim_cv, &msp->ms_lock);
! msp->ms_trimming_ts = msp->ms_prev_ts;
! msp->ms_prev_ts = NULL;
! trim_tree = msp->ms_trimming_ts->ts_tree;
! #ifdef DEBUG
! if (msp->ms_loaded) {
! for (range_seg_t *rs = avl_first(&trim_tree->rt_root);
! rs != NULL; rs = AVL_NEXT(&trim_tree->rt_root, rs)) {
! if (!range_tree_contains(msp->ms_tree,
! rs->rs_start, rs->rs_end - rs->rs_start)) {
! panic("trimming allocated region; mss=%p",
! (void*)rs);
}
! }
! }
! #endif
!
! /* Nothing to trim */
! if (range_tree_space(trim_tree) == 0) {
! metaslab_free_trimset(msp->ms_trimming_ts);
! msp->ms_trimming_ts = 0;
! return (zio_root(spa, NULL, NULL, 0));
! }
! zio = zio_trim(spa, vd, trim_tree, metaslab_trim_done, msp, 0,
! ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |
! ZIO_FLAG_CONFIG_WRITER, msp);
!
! return (zio);
! }
!
! /*
! * Allocates and initializes a new trimset structure. The `txg' argument
! * indicates when this trimset was born and `lock' indicates the lock to
! * link to the range tree.
! */
! static metaslab_trimset_t *
! metaslab_new_trimset(uint64_t txg, kmutex_t *lock)
! {
! metaslab_trimset_t *ts;
!
! ts = kmem_zalloc(sizeof (*ts), KM_SLEEP);
! ts->ts_birth = txg;
! ts->ts_tree = range_tree_create(NULL, NULL, lock);
!
! return (ts);
! }
!
! /*
! * Destroys and frees a trim set previously allocated by metaslab_new_trimset.
! */
! static void
! metaslab_free_trimset(metaslab_trimset_t *ts)
! {
! range_tree_vacate(ts->ts_tree, NULL, NULL);
! range_tree_destroy(ts->ts_tree);
! kmem_free(ts, sizeof (*ts));
! }
!
! /*
! * Checks whether an allocation conflicts with an ongoing trim operation in
! * the given metaslab. This function takes a segment starting at `*offset'
! * of `size' and checks whether it hits any region in the metaslab currently
! * being trimmed. If yes, it tries to adjust the allocation to the end of
! * the region being trimmed (P2ROUNDUP aligned by `align'), but only up to
! * `limit' (no part of the allocation is allowed to go past this point).
! *
! * Returns B_FALSE if either the original allocation wasn't in conflict, or
! * the conflict could be resolved by adjusting the value stored in `offset'
! * such that the whole allocation still fits below `limit'. Returns B_TRUE
! * if the allocation conflict couldn't be resolved.
! */
! static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
! uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit)
! {
! uint64_t new_offset;
!
! if (msp->ms_trimming_ts == NULL)
! /* no trim conflict, original offset is OK */
! return (B_FALSE);
!
! new_offset = P2ROUNDUP(range_tree_find_gap(msp->ms_trimming_ts->ts_tree,
! *offset, size), align);
! if (new_offset != *offset && new_offset + size > limit)
! /* trim conflict and adjustment not possible */
! return (B_TRUE);
!
! /* trim conflict, but adjusted offset still within limit */
! *offset = new_offset;
! return (B_FALSE);
}