Print this page
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-10069 ZFS_READONLY is a little too strict (fix test lint)
NEX-9553 Move ss_fill gap logic from scan algorithm into range_tree.c
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4720 WRC: DVA allocation bypass for special BPs works incorrect
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4683 WRC: Special block pointer must know that it is special
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
6295 metaslab_condense's dbgmsg should include vdev id
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4245 WRC: Code cleanup and refactoring to simplify merge with upstream
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-4059 On-demand TRIM can sometimes race in metaslab_load
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3710 WRC improvements and bug-fixes
 * refactored WRC move-logic to use zio kmem_cashes
 * replace size and compression fields by blk_prop field
   (the same in blkptr_t) to little reduce size of wrc_block_t
   and use similar macros as for blkptr_t to get PSIZE, LSIZE
   and COMPRESSION
 * make CPU more happy by reduce atomic calls
 * removed unused code
 * fixed naming of variables
 * fixed possible system panic after restart system
   with enabled WRC
 * fixed a race that causes system panic
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
OS-197 Series of zpool exports and imports can hang the system
Reviewed by: Sarah Jelinek <sarah.jelinek@nexetna.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittens@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
re #8346 rb2639 KT disk failures

*** 21,30 **** --- 21,31 ---- /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2011, 2015 by Delphix. All rights reserved. * Copyright (c) 2013 by Saso Kiselkov. All rights reserved. * Copyright (c) 2014 Integros [integros.com] + * Copyright 2017 Nexenta Systems, Inc. All rights reserved. */ #include <sys/zfs_context.h> #include <sys/dmu.h> #include <sys/dmu_tx.h>
*** 32,42 **** #include <sys/metaslab_impl.h> #include <sys/vdev_impl.h> #include <sys/zio.h> #include <sys/spa_impl.h> #include <sys/zfeature.h> ! #include <sys/vdev_indirect_mapping.h> #define GANG_ALLOCATION(flags) \ ((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER)) uint64_t metaslab_aliquot = 512ULL << 10; --- 33,43 ---- #include <sys/metaslab_impl.h> #include <sys/vdev_impl.h> #include <sys/zio.h> #include <sys/spa_impl.h> #include <sys/zfeature.h> ! #include <sys/wbc.h> #define GANG_ALLOCATION(flags) \ ((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER)) uint64_t metaslab_aliquot = 512ULL << 10;
*** 165,179 **** * Enable/disable metaslab group biasing. */ boolean_t metaslab_bias_enabled = B_TRUE; /* - * Enable/disable remapping of indirect DVAs to their concrete vdevs. - */ - boolean_t zfs_remap_blkptr_enable = B_TRUE; - - /* * Enable/disable segment-based metaslab selection. */ boolean_t zfs_metaslab_segment_weight_enabled = B_TRUE; /* --- 166,175 ----
*** 199,214 **** */ uint64_t metaslab_trace_max_entries = 5000; static uint64_t metaslab_weight(metaslab_t *); static void metaslab_set_fragmentation(metaslab_t *); - static void metaslab_free_impl(vdev_t *, uint64_t, uint64_t, uint64_t); - static void metaslab_check_free_impl(vdev_t *, uint64_t, uint64_t); kmem_cache_t *metaslab_alloc_trace_cache; /* * ========================================================================== * Metaslab classes * ========================================================================== */ metaslab_class_t * --- 195,244 ---- */ uint64_t metaslab_trace_max_entries = 5000; static uint64_t metaslab_weight(metaslab_t *); static void metaslab_set_fragmentation(metaslab_t *); kmem_cache_t *metaslab_alloc_trace_cache; /* + * Toggle between space-based DVA allocator 0, latency-based 1 or hybrid 2. + * A value other than 0, 1 or 2 will be considered 0 (default). + */ + int metaslab_alloc_dva_algorithm = 0; + + /* + * How many TXG's worth of updates should be aggregated per TRIM/UNMAP + * issued to the underlying vdev. We keep two range trees of extents + * (called "trim sets") to be trimmed per metaslab, the `current' and + * the `previous' TS. New free's are added to the current TS. Then, + * once `zfs_txgs_per_trim' transactions have elapsed, the `current' + * TS becomes the `previous' TS and a new, blank TS is created to be + * the new `current', which will then start accumulating any new frees. + * Once another zfs_txgs_per_trim TXGs have passed, the previous TS's + * extents are trimmed, the TS is destroyed and the current TS again + * becomes the previous TS. + * This serves to fulfill two functions: aggregate many small frees + * into fewer larger trim operations (which should help with devices + * which do not take so kindly to them) and to allow for disaster + * recovery (extents won't get trimmed immediately, but instead only + * after passing this rather long timeout, thus not preserving + * 'zfs import -F' functionality). + */ + unsigned int zfs_txgs_per_trim = 32; + + static void metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size); + static void metaslab_trim_add(void *arg, uint64_t offset, uint64_t size); + + static zio_t *metaslab_exec_trim(metaslab_t *msp); + + static metaslab_trimset_t *metaslab_new_trimset(uint64_t txg, kmutex_t *lock); + static void metaslab_free_trimset(metaslab_trimset_t *ts); + static boolean_t metaslab_check_trim_conflict(metaslab_t *msp, + uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit); + + /* * ========================================================================== * Metaslab classes * ========================================================================== */ metaslab_class_t *
*** 216,225 **** --- 246,259 ---- { metaslab_class_t *mc; mc = kmem_zalloc(sizeof (metaslab_class_t), KM_SLEEP); + mutex_init(&mc->mc_alloc_lock, NULL, MUTEX_DEFAULT, NULL); + avl_create(&mc->mc_alloc_tree, zio_bookmark_compare, + sizeof (zio_t), offsetof(zio_t, io_alloc_node)); + mc->mc_spa = spa; mc->mc_rotor = NULL; mc->mc_ops = ops; mutex_init(&mc->mc_lock, NULL, MUTEX_DEFAULT, NULL); refcount_create_tracked(&mc->mc_alloc_slots);
*** 234,243 **** --- 268,280 ---- ASSERT(mc->mc_alloc == 0); ASSERT(mc->mc_deferred == 0); ASSERT(mc->mc_space == 0); ASSERT(mc->mc_dspace == 0); + avl_destroy(&mc->mc_alloc_tree); + mutex_destroy(&mc->mc_alloc_lock); + refcount_destroy(&mc->mc_alloc_slots); mutex_destroy(&mc->mc_lock); kmem_free(mc, sizeof (metaslab_class_t)); }
*** 320,330 **** /* * Skip any holes, uninitialized top-levels, or * vdevs that are not in this metalab class. */ ! if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 || mg->mg_class != mc) { continue; } for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++) --- 357,367 ---- /* * Skip any holes, uninitialized top-levels, or * vdevs that are not in this metalab class. */ ! if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 || mg->mg_class != mc) { continue; } for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
*** 355,368 **** for (int c = 0; c < rvd->vdev_children; c++) { vdev_t *tvd = rvd->vdev_child[c]; metaslab_group_t *mg = tvd->vdev_mg; /* ! * Skip any holes, uninitialized top-levels, ! * or vdevs that are not in this metalab class. */ ! if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 || mg->mg_class != mc) { continue; } /* --- 392,405 ---- for (int c = 0; c < rvd->vdev_children; c++) { vdev_t *tvd = rvd->vdev_child[c]; metaslab_group_t *mg = tvd->vdev_mg; /* ! * Skip any holes, uninitialized top-levels, or ! * vdevs that are not in this metalab class. */ ! if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 || mg->mg_class != mc) { continue; } /*
*** 404,414 **** for (int c = 0; c < rvd->vdev_children; c++) { uint64_t tspace; vdev_t *tvd = rvd->vdev_child[c]; metaslab_group_t *mg = tvd->vdev_mg; ! if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 || mg->mg_class != mc) { continue; } /* --- 441,451 ---- for (int c = 0; c < rvd->vdev_children; c++) { uint64_t tspace; vdev_t *tvd = rvd->vdev_child[c]; metaslab_group_t *mg = tvd->vdev_mg; ! if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 || mg->mg_class != mc) { continue; } /*
*** 516,527 **** vdev_stat_t *vs = &vd->vdev_stat; boolean_t was_allocatable; boolean_t was_initialized; ASSERT(vd == vd->vdev_top); - ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_READER), ==, - SCL_ALLOC); mutex_enter(&mg->mg_lock); was_allocatable = mg->mg_allocatable; was_initialized = mg->mg_initialized; --- 553,562 ----
*** 615,624 **** --- 650,660 ---- * either because we never activated in the first place or * because we're done, and possibly removing the vdev. */ ASSERT(mg->mg_activation_count <= 0); + if (mg->mg_taskq) taskq_destroy(mg->mg_taskq); avl_destroy(&mg->mg_metaslab_tree); mutex_destroy(&mg->mg_lock); refcount_destroy(&mg->mg_alloc_queue_depth); kmem_free(mg, sizeof (metaslab_group_t));
*** 628,638 **** metaslab_group_activate(metaslab_group_t *mg) { metaslab_class_t *mc = mg->mg_class; metaslab_group_t *mgprev, *mgnext; ! ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER), !=, 0); ASSERT(mc->mc_rotor != mg); ASSERT(mg->mg_prev == NULL); ASSERT(mg->mg_next == NULL); ASSERT(mg->mg_activation_count <= 0); --- 664,674 ---- metaslab_group_activate(metaslab_group_t *mg) { metaslab_class_t *mc = mg->mg_class; metaslab_group_t *mgprev, *mgnext; ! ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER)); ASSERT(mc->mc_rotor != mg); ASSERT(mg->mg_prev == NULL); ASSERT(mg->mg_next == NULL); ASSERT(mg->mg_activation_count <= 0);
*** 654,705 **** mgnext->mg_prev = mg; } mc->mc_rotor = mg; } - /* - * Passivate a metaslab group and remove it from the allocation rotor. - * Callers must hold both the SCL_ALLOC and SCL_ZIO lock prior to passivating - * a metaslab group. This function will momentarily drop spa_config_locks - * that are lower than the SCL_ALLOC lock (see comment below). - */ void metaslab_group_passivate(metaslab_group_t *mg) { metaslab_class_t *mc = mg->mg_class; - spa_t *spa = mc->mc_spa; metaslab_group_t *mgprev, *mgnext; - int locks = spa_config_held(spa, SCL_ALL, RW_WRITER); ! ASSERT3U(spa_config_held(spa, SCL_ALLOC | SCL_ZIO, RW_WRITER), ==, ! (SCL_ALLOC | SCL_ZIO)); if (--mg->mg_activation_count != 0) { ASSERT(mc->mc_rotor != mg); ASSERT(mg->mg_prev == NULL); ASSERT(mg->mg_next == NULL); ASSERT(mg->mg_activation_count < 0); return; } - /* - * The spa_config_lock is an array of rwlocks, ordered as - * follows (from highest to lowest): - * SCL_CONFIG > SCL_STATE > SCL_L2ARC > SCL_ALLOC > - * SCL_ZIO > SCL_FREE > SCL_VDEV - * (For more information about the spa_config_lock see spa_misc.c) - * The higher the lock, the broader its coverage. When we passivate - * a metaslab group, we must hold both the SCL_ALLOC and the SCL_ZIO - * config locks. However, the metaslab group's taskq might be trying - * to preload metaslabs so we must drop the SCL_ZIO lock and any - * lower locks to allow the I/O to complete. At a minimum, - * we continue to hold the SCL_ALLOC lock, which prevents any future - * allocations from taking place and any changes to the vdev tree. - */ - spa_config_exit(spa, locks & ~(SCL_ZIO - 1), spa); taskq_wait(mg->mg_taskq); - spa_config_enter(spa, locks & ~(SCL_ZIO - 1), spa, RW_WRITER); metaslab_group_alloc_update(mg); mgprev = mg->mg_prev; mgnext = mg->mg_next; --- 690,716 ---- mgnext->mg_prev = mg; } mc->mc_rotor = mg; } void metaslab_group_passivate(metaslab_group_t *mg) { metaslab_class_t *mc = mg->mg_class; metaslab_group_t *mgprev, *mgnext; ! ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER)); if (--mg->mg_activation_count != 0) { ASSERT(mc->mc_rotor != mg); ASSERT(mg->mg_prev == NULL); ASSERT(mg->mg_next == NULL); ASSERT(mg->mg_activation_count < 0); return; } taskq_wait(mg->mg_taskq); metaslab_group_alloc_update(mg); mgprev = mg->mg_prev; mgnext = mg->mg_next;
*** 1139,1161 **** * This is a helper function that can be used by the allocator to find * a suitable block to allocate. This will search the specified AVL * tree looking for a block that matches the specified criteria. */ static uint64_t ! metaslab_block_picker(avl_tree_t *t, uint64_t *cursor, uint64_t size, ! uint64_t align) { range_seg_t *rs = metaslab_block_find(t, *cursor, size); ! while (rs != NULL) { uint64_t offset = P2ROUNDUP(rs->rs_start, align); ! if (offset + size <= rs->rs_end) { *cursor = offset + size; return (offset); } - rs = AVL_NEXT(t, rs); } /* * If we know we've searched the whole map (*cursor == 0), give up. * Otherwise, reset the cursor to the beginning and try again. --- 1150,1173 ---- * This is a helper function that can be used by the allocator to find * a suitable block to allocate. This will search the specified AVL * tree looking for a block that matches the specified criteria. */ static uint64_t ! metaslab_block_picker(metaslab_t *msp, avl_tree_t *t, uint64_t *cursor, ! uint64_t size, uint64_t align) { range_seg_t *rs = metaslab_block_find(t, *cursor, size); ! for (; rs != NULL; rs = AVL_NEXT(t, rs)) { uint64_t offset = P2ROUNDUP(rs->rs_start, align); ! if (offset + size <= rs->rs_end && ! !metaslab_check_trim_conflict(msp, &offset, size, align, ! rs->rs_end)) { *cursor = offset + size; return (offset); } } /* * If we know we've searched the whole map (*cursor == 0), give up. * Otherwise, reset the cursor to the beginning and try again.
*** 1162,1172 **** */ if (*cursor == 0) return (-1ULL); *cursor = 0; ! return (metaslab_block_picker(t, cursor, size, align)); } /* * ========================================================================== * The first-fit block allocator --- 1174,1184 ---- */ if (*cursor == 0) return (-1ULL); *cursor = 0; ! return (metaslab_block_picker(msp, t, cursor, size, align)); } /* * ========================================================================== * The first-fit block allocator
*** 1184,1194 **** */ uint64_t align = size & -size; uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1]; avl_tree_t *t = &msp->ms_tree->rt_root; ! return (metaslab_block_picker(t, cursor, size, align)); } static metaslab_ops_t metaslab_ff_ops = { metaslab_ff_alloc }; --- 1196,1206 ---- */ uint64_t align = size & -size; uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1]; avl_tree_t *t = &msp->ms_tree->rt_root; ! return (metaslab_block_picker(msp, t, cursor, size, align)); } static metaslab_ops_t metaslab_ff_ops = { metaslab_ff_alloc };
*** 1232,1242 **** free_pct < metaslab_df_free_pct) { t = &msp->ms_size_tree; *cursor = 0; } ! return (metaslab_block_picker(t, cursor, size, 1ULL)); } static metaslab_ops_t metaslab_df_ops = { metaslab_df_alloc }; --- 1244,1254 ---- free_pct < metaslab_df_free_pct) { t = &msp->ms_size_tree; *cursor = 0; } ! return (metaslab_block_picker(msp, t, cursor, size, 1ULL)); } static metaslab_ops_t metaslab_df_ops = { metaslab_df_alloc };
*** 1264,1281 **** ASSERT3U(*cursor_end, >=, *cursor); if ((*cursor + size) > *cursor_end) { range_seg_t *rs; ! ! rs = avl_last(&msp->ms_size_tree); ! if (rs == NULL || (rs->rs_end - rs->rs_start) < size) ! return (-1ULL); ! *cursor = rs->rs_start; *cursor_end = rs->rs_end; } offset = *cursor; *cursor += size; return (offset); --- 1276,1299 ---- ASSERT3U(*cursor_end, >=, *cursor); if ((*cursor + size) > *cursor_end) { range_seg_t *rs; ! for (rs = avl_last(&msp->ms_size_tree); ! rs != NULL && rs->rs_end - rs->rs_start >= size; ! rs = AVL_PREV(&msp->ms_size_tree, rs)) { *cursor = rs->rs_start; *cursor_end = rs->rs_end; + if (!metaslab_check_trim_conflict(msp, cursor, size, + 1, *cursor_end)) { + /* segment appears to be acceptable */ + break; } + } + if (rs == NULL || rs->rs_end - rs->rs_start < size) + return (-1ULL); + } offset = *cursor; *cursor += size; return (offset);
*** 1307,1316 **** --- 1325,1336 ---- avl_index_t where; range_seg_t *rs, rsearch; uint64_t hbit = highbit64(size); uint64_t *cursor = &msp->ms_lbas[hbit - 1]; uint64_t max_size = metaslab_block_maxsize(msp); + /* mutable copy for adjustment by metaslab_check_trim_conflict */ + uint64_t adjustable_start; ASSERT(MUTEX_HELD(&msp->ms_lock)); ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&msp->ms_size_tree)); if (max_size < size)
*** 1318,1344 **** rsearch.rs_start = *cursor; rsearch.rs_end = *cursor + size; rs = avl_find(t, &rsearch, &where); ! if (rs == NULL || (rs->rs_end - rs->rs_start) < size) { t = &msp->ms_size_tree; rsearch.rs_start = 0; rsearch.rs_end = MIN(max_size, 1ULL << (hbit + metaslab_ndf_clump_shift)); rs = avl_find(t, &rsearch, &where); if (rs == NULL) rs = avl_nearest(t, where, AVL_AFTER); ASSERT(rs != NULL); } - - if ((rs->rs_end - rs->rs_start) >= size) { - *cursor = rs->rs_start + size; - return (rs->rs_start); } ! return (-1ULL); } static metaslab_ops_t metaslab_ndf_ops = { metaslab_ndf_alloc }; --- 1338,1373 ---- rsearch.rs_start = *cursor; rsearch.rs_end = *cursor + size; rs = avl_find(t, &rsearch, &where); ! if (rs != NULL) ! adjustable_start = rs->rs_start; ! if (rs == NULL || rs->rs_end - adjustable_start < size || ! metaslab_check_trim_conflict(msp, &adjustable_start, size, 1, ! rs->rs_end)) { ! /* segment not usable, try the largest remaining one */ t = &msp->ms_size_tree; rsearch.rs_start = 0; rsearch.rs_end = MIN(max_size, 1ULL << (hbit + metaslab_ndf_clump_shift)); rs = avl_find(t, &rsearch, &where); if (rs == NULL) rs = avl_nearest(t, where, AVL_AFTER); ASSERT(rs != NULL); + adjustable_start = rs->rs_start; + if (rs->rs_end - adjustable_start < size || + metaslab_check_trim_conflict(msp, &adjustable_start, + size, 1, rs->rs_end)) { + /* even largest remaining segment not usable */ + return (-1ULL); } } ! ! *cursor = adjustable_start + size; ! return (*cursor); } static metaslab_ops_t metaslab_ndf_ops = { metaslab_ndf_alloc };
*** 1374,1389 **** ASSERT(MUTEX_HELD(&msp->ms_lock)); ASSERT(!msp->ms_loaded); ASSERT(!msp->ms_loading); msp->ms_loading = B_TRUE; - /* - * Nobody else can manipulate a loading metaslab, so it's now safe - * to drop the lock. This way we don't have to hold the lock while - * reading the spacemap from disk. - */ - mutex_exit(&msp->ms_lock); /* * If the space map has not been allocated yet, then treat * all the space in the metaslab as free and add it to the * ms_tree. --- 1403,1412 ----
*** 1392,1412 **** error = space_map_load(msp->ms_sm, msp->ms_tree, SM_FREE); else range_tree_add(msp->ms_tree, msp->ms_start, msp->ms_size); success = (error == 0); - - mutex_enter(&msp->ms_lock); msp->ms_loading = B_FALSE; if (success) { ASSERT3P(msp->ms_group, !=, NULL); msp->ms_loaded = B_TRUE; for (int t = 0; t < TXG_DEFER_SIZE; t++) { range_tree_walk(msp->ms_defertree[t], range_tree_remove, msp->ms_tree); } msp->ms_max_size = metaslab_block_maxsize(msp); } cv_broadcast(&msp->ms_load_cv); return (error); --- 1415,1435 ---- error = space_map_load(msp->ms_sm, msp->ms_tree, SM_FREE); else range_tree_add(msp->ms_tree, msp->ms_start, msp->ms_size); success = (error == 0); msp->ms_loading = B_FALSE; if (success) { ASSERT3P(msp->ms_group, !=, NULL); msp->ms_loaded = B_TRUE; for (int t = 0; t < TXG_DEFER_SIZE; t++) { range_tree_walk(msp->ms_defertree[t], range_tree_remove, msp->ms_tree); + range_tree_walk(msp->ms_defertree[t], + metaslab_trim_remove, msp); } msp->ms_max_size = metaslab_block_maxsize(msp); } cv_broadcast(&msp->ms_load_cv); return (error);
*** 1431,1442 **** metaslab_t *ms; int error; ms = kmem_zalloc(sizeof (metaslab_t), KM_SLEEP); mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL); - mutex_init(&ms->ms_sync_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL); ms->ms_id = id; ms->ms_start = id << vd->vdev_ms_shift; ms->ms_size = 1ULL << vd->vdev_ms_shift; /* --- 1454,1465 ---- metaslab_t *ms; int error; ms = kmem_zalloc(sizeof (metaslab_t), KM_SLEEP); mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL); + cv_init(&ms->ms_trim_cv, NULL, CV_DEFAULT, NULL); ms->ms_id = id; ms->ms_start = id << vd->vdev_ms_shift; ms->ms_size = 1ULL << vd->vdev_ms_shift; /*
*** 1443,1470 **** * We only open space map objects that already exist. All others * will be opened when we finally allocate an object for it. */ if (object != 0) { error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start, ! ms->ms_size, vd->vdev_ashift); if (error != 0) { kmem_free(ms, sizeof (metaslab_t)); return (error); } ASSERT(ms->ms_sm != NULL); } /* * We create the main range tree here, but we don't create the * other range trees until metaslab_sync_done(). This serves * two purposes: it allows metaslab_sync_done() to detect the * addition of new space; and for debugging, it ensures that we'd * data fault on any attempt to use this metaslab before it's ready. */ ! ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms); metaslab_group_add(mg, ms); metaslab_set_fragmentation(ms); /* --- 1466,1495 ---- * We only open space map objects that already exist. All others * will be opened when we finally allocate an object for it. */ if (object != 0) { error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start, ! ms->ms_size, vd->vdev_ashift, &ms->ms_lock); if (error != 0) { kmem_free(ms, sizeof (metaslab_t)); return (error); } ASSERT(ms->ms_sm != NULL); } + ms->ms_cur_ts = metaslab_new_trimset(0, &ms->ms_lock); + /* * We create the main range tree here, but we don't create the * other range trees until metaslab_sync_done(). This serves * two purposes: it allows metaslab_sync_done() to detect the * addition of new space; and for debugging, it ensures that we'd * data fault on any attempt to use this metaslab before it's ready. */ ! ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms, &ms->ms_lock); metaslab_group_add(mg, ms); metaslab_set_fragmentation(ms); /*
*** 1524,1539 **** for (int t = 0; t < TXG_DEFER_SIZE; t++) { range_tree_destroy(msp->ms_defertree[t]); } ASSERT0(msp->ms_deferspace); mutex_exit(&msp->ms_lock); cv_destroy(&msp->ms_load_cv); mutex_destroy(&msp->ms_lock); - mutex_destroy(&msp->ms_sync_lock); kmem_free(msp, sizeof (metaslab_t)); } #define FRAGMENTATION_TABLE_SIZE 17 --- 1549,1569 ---- for (int t = 0; t < TXG_DEFER_SIZE; t++) { range_tree_destroy(msp->ms_defertree[t]); } + metaslab_free_trimset(msp->ms_cur_ts); + if (msp->ms_prev_ts) + metaslab_free_trimset(msp->ms_prev_ts); + ASSERT3P(msp->ms_trimming_ts, ==, NULL); + ASSERT0(msp->ms_deferspace); mutex_exit(&msp->ms_lock); cv_destroy(&msp->ms_load_cv); + cv_destroy(&msp->ms_trim_cv); mutex_destroy(&msp->ms_lock); kmem_free(msp, sizeof (metaslab_t)); } #define FRAGMENTATION_TABLE_SIZE 17
*** 1895,1909 **** uint64_t weight; ASSERT(MUTEX_HELD(&msp->ms_lock)); /* ! * If this vdev is in the process of being removed, there is nothing * for us to do here. */ ! if (vd->vdev_removing) return (0); metaslab_set_fragmentation(msp); /* * Update the maximum size if the metaslab is loaded. This will --- 1925,1942 ---- uint64_t weight; ASSERT(MUTEX_HELD(&msp->ms_lock)); /* ! * This vdev is in the process of being removed so there is nothing * for us to do here. */ ! if (vd->vdev_removing) { ! ASSERT0(space_map_allocated(msp->ms_sm)); ! ASSERT0(vd->vdev_ms_shift); return (0); + } metaslab_set_fragmentation(msp); /* * Update the maximum size if the metaslab is loaded. This will
*** 2031,2047 **** taskq_wait(mg->mg_taskq); return; } mutex_enter(&mg->mg_lock); - /* * Load the next potential metaslabs */ for (msp = avl_first(t); msp != NULL; msp = AVL_NEXT(t, msp)) { - ASSERT3P(msp->ms_group, ==, mg); - /* * We preload only the maximum number of metaslabs specified * by metaslab_preload_limit. If a metaslab is being forced * to condense then we preload it too. This will ensure * that force condensing happens in the next txg. --- 2064,2077 ----
*** 2064,2074 **** * 1. The size of the space map object should not dramatically increase as a * result of writing out the free space range tree. * * 2. The minimal on-disk space map representation is zfs_condense_pct/100 * times the size than the free space range tree representation ! * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1MB). * * 3. The on-disk size of the space map should actually decrease. * * Checking the first condition is tricky since we don't want to walk * the entire AVL tree calculating the estimated on-disk size. Instead we --- 2094,2104 ---- * 1. The size of the space map object should not dramatically increase as a * result of writing out the free space range tree. * * 2. The minimal on-disk space map representation is zfs_condense_pct/100 * times the size than the free space range tree representation ! * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1.MB). * * 3. The on-disk size of the space map should actually decrease. * * Checking the first condition is tricky since we don't want to walk * the entire AVL tree calculating the estimated on-disk size. Instead we
*** 2161,2171 **** * that have been freed in this txg, any deferred frees that exist, * and any allocation in the future. Removing segments should be * a relatively inexpensive operation since we expect these trees to * have a small number of nodes. */ ! condense_tree = range_tree_create(NULL, NULL); range_tree_add(condense_tree, msp->ms_start, msp->ms_size); /* * Remove what's been freed in this txg from the condense_tree. * Since we're in sync_pass 1, we know that all the frees from --- 2191,2201 ---- * that have been freed in this txg, any deferred frees that exist, * and any allocation in the future. Removing segments should be * a relatively inexpensive operation since we expect these trees to * have a small number of nodes. */ ! condense_tree = range_tree_create(NULL, NULL, &msp->ms_lock); range_tree_add(condense_tree, msp->ms_start, msp->ms_size); /* * Remove what's been freed in this txg from the condense_tree. * Since we're in sync_pass 1, we know that all the frees from
*** 2194,2203 **** --- 2224,2234 ---- */ msp->ms_condensing = B_TRUE; mutex_exit(&msp->ms_lock); space_map_truncate(sm, tx); + mutex_enter(&msp->ms_lock); /* * While we would ideally like to create a space map representation * that consists only of allocation records, doing so can be * prohibitively expensive because the in-core free tree can be
*** 2210,2220 **** space_map_write(sm, condense_tree, SM_ALLOC, tx); range_tree_vacate(condense_tree, NULL, NULL); range_tree_destroy(condense_tree); space_map_write(sm, msp->ms_tree, SM_FREE, tx); - mutex_enter(&msp->ms_lock); msp->ms_condensing = B_FALSE; } /* * Write a metaslab to disk in the context of the specified transaction group. --- 2241,2250 ----
*** 2230,2244 **** --- 2260,2277 ---- dmu_tx_t *tx; uint64_t object = space_map_object(msp->ms_sm); ASSERT(!vd->vdev_ishole); + mutex_enter(&msp->ms_lock); + /* * This metaslab has just been added so there's no work to do now. */ if (msp->ms_freeingtree == NULL) { ASSERT3P(alloctree, ==, NULL); + mutex_exit(&msp->ms_lock); return; } ASSERT3P(alloctree, !=, NULL); ASSERT3P(msp->ms_freeingtree, !=, NULL);
*** 2250,2277 **** * is being forced to condense and it's loaded, we need to let it * through. */ if (range_tree_space(alloctree) == 0 && range_tree_space(msp->ms_freeingtree) == 0 && ! !(msp->ms_loaded && msp->ms_condense_wanted)) return; VERIFY(txg <= spa_final_dirty_txg(spa)); /* * The only state that can actually be changing concurrently with * metaslab_sync() is the metaslab's ms_tree. No other thread can * be modifying this txg's alloctree, freeingtree, freedtree, or ! * space_map_phys_t. We drop ms_lock whenever we could call ! * into the DMU, because the DMU can call down to us ! * (e.g. via zio_free()) at any time. ! * ! * The spa_vdev_remove_thread() can be reading metaslab state ! * concurrently, and it is locked out by the ms_sync_lock. Note ! * that the ms_lock is insufficient for this, because it is dropped ! * by space_map_write(). */ tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg); if (msp->ms_sm == NULL) { --- 2283,2308 ---- * is being forced to condense and it's loaded, we need to let it * through. */ if (range_tree_space(alloctree) == 0 && range_tree_space(msp->ms_freeingtree) == 0 && ! !(msp->ms_loaded && msp->ms_condense_wanted)) { ! mutex_exit(&msp->ms_lock); return; + } VERIFY(txg <= spa_final_dirty_txg(spa)); /* * The only state that can actually be changing concurrently with * metaslab_sync() is the metaslab's ms_tree. No other thread can * be modifying this txg's alloctree, freeingtree, freedtree, or ! * space_map_phys_t. Therefore, we only hold ms_lock to satify ! * space map ASSERTs. We drop it whenever we call into the DMU, ! * because the DMU can call down to us (e.g. via zio_free()) at ! * any time. */ tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg); if (msp->ms_sm == NULL) {
*** 2279,2295 **** new_object = space_map_alloc(mos, tx); VERIFY3U(new_object, !=, 0); VERIFY0(space_map_open(&msp->ms_sm, mos, new_object, ! msp->ms_start, msp->ms_size, vd->vdev_ashift)); ASSERT(msp->ms_sm != NULL); } - mutex_enter(&msp->ms_sync_lock); - mutex_enter(&msp->ms_lock); - /* * Note: metaslab_condense() clears the space map's histogram. * Therefore we must verify and remove this histogram before * condensing. */ --- 2310,2324 ---- new_object = space_map_alloc(mos, tx); VERIFY3U(new_object, !=, 0); VERIFY0(space_map_open(&msp->ms_sm, mos, new_object, ! msp->ms_start, msp->ms_size, vd->vdev_ashift, ! &msp->ms_lock)); ASSERT(msp->ms_sm != NULL); } /* * Note: metaslab_condense() clears the space map's histogram. * Therefore we must verify and remove this histogram before * condensing. */
*** 2299,2317 **** if (msp->ms_loaded && spa_sync_pass(spa) == 1 && metaslab_should_condense(msp)) { metaslab_condense(msp, txg, tx); } else { - mutex_exit(&msp->ms_lock); space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx); space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx); - mutex_enter(&msp->ms_lock); } if (msp->ms_loaded) { /* ! * When the space map is loaded, we have an accurate * histogram in the range tree. This gives us an opportunity * to bring the space map's histogram up-to-date so we clear * it first before updating it. */ space_map_histogram_clear(msp->ms_sm); --- 2328,2344 ---- if (msp->ms_loaded && spa_sync_pass(spa) == 1 && metaslab_should_condense(msp)) { metaslab_condense(msp, txg, tx); } else { space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx); space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx); } if (msp->ms_loaded) { /* ! * When the space map is loaded, we have an accruate * histogram in the range tree. This gives us an opportunity * to bring the space map's histogram up-to-date so we clear * it first before updating it. */ space_map_histogram_clear(msp->ms_sm);
*** 2375,2385 **** if (object != space_map_object(msp->ms_sm)) { object = space_map_object(msp->ms_sm); dmu_write(mos, vd->vdev_ms_array, sizeof (uint64_t) * msp->ms_id, sizeof (uint64_t), &object, tx); } - mutex_exit(&msp->ms_sync_lock); dmu_tx_commit(tx); } /* * Called after a transaction group has completely synced to mark --- 2402,2411 ----
*** 2405,2437 **** */ if (msp->ms_freedtree == NULL) { for (int t = 0; t < TXG_SIZE; t++) { ASSERT(msp->ms_alloctree[t] == NULL); ! msp->ms_alloctree[t] = range_tree_create(NULL, NULL); } ASSERT3P(msp->ms_freeingtree, ==, NULL); ! msp->ms_freeingtree = range_tree_create(NULL, NULL); ASSERT3P(msp->ms_freedtree, ==, NULL); ! msp->ms_freedtree = range_tree_create(NULL, NULL); for (int t = 0; t < TXG_DEFER_SIZE; t++) { ASSERT(msp->ms_defertree[t] == NULL); ! msp->ms_defertree[t] = range_tree_create(NULL, NULL); } vdev_space_update(vd, 0, 0, msp->ms_size); } defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE]; uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) - metaslab_class_get_alloc(spa_normal_class(spa)); ! if (free_space <= spa_get_slop_space(spa) || vd->vdev_removing) { defer_allowed = B_FALSE; } defer_delta = 0; alloc_delta = space_map_alloc_delta(msp->ms_sm); --- 2431,2467 ---- */ if (msp->ms_freedtree == NULL) { for (int t = 0; t < TXG_SIZE; t++) { ASSERT(msp->ms_alloctree[t] == NULL); ! msp->ms_alloctree[t] = range_tree_create(NULL, msp, ! &msp->ms_lock); } ASSERT3P(msp->ms_freeingtree, ==, NULL); ! msp->ms_freeingtree = range_tree_create(NULL, msp, ! &msp->ms_lock); ASSERT3P(msp->ms_freedtree, ==, NULL); ! msp->ms_freedtree = range_tree_create(NULL, msp, ! &msp->ms_lock); for (int t = 0; t < TXG_DEFER_SIZE; t++) { ASSERT(msp->ms_defertree[t] == NULL); ! msp->ms_defertree[t] = range_tree_create(NULL, msp, ! &msp->ms_lock); } vdev_space_update(vd, 0, 0, msp->ms_size); } defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE]; uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) - metaslab_class_get_alloc(spa_normal_class(spa)); ! if (free_space <= spa_get_slop_space(spa)) { defer_allowed = B_FALSE; } defer_delta = 0; alloc_delta = space_map_alloc_delta(msp->ms_sm);
*** 2454,2463 **** --- 2484,2501 ---- * Move the frees from the defer_tree back to the free * range tree (if it's loaded). Swap the freed_tree and the * defer_tree -- this is safe to do because we've just emptied out * the defer_tree. */ + if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON && + !vd->vdev_man_trimming) { + range_tree_walk(*defer_tree, metaslab_trim_add, msp); + if (!defer_allowed) { + range_tree_walk(msp->ms_freedtree, metaslab_trim_add, + msp); + } + } range_tree_vacate(*defer_tree, msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree); if (defer_allowed) { range_tree_swap(&msp->ms_freedtree, defer_tree); } else {
*** 2497,2533 **** if (!metaslab_debug_unload) metaslab_unload(msp); } - ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK])); - ASSERT0(range_tree_space(msp->ms_freeingtree)); - ASSERT0(range_tree_space(msp->ms_freedtree)); - mutex_exit(&msp->ms_lock); } void metaslab_sync_reassess(metaslab_group_t *mg) { - spa_t *spa = mg->mg_class->mc_spa; - - spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER); metaslab_group_alloc_update(mg); mg->mg_fragmentation = metaslab_group_fragmentation(mg); /* ! * Preload the next potential metaslabs but only on active ! * metaslab groups. We can get into a state where the metaslab ! * is no longer active since we dirty metaslabs as we remove a ! * a device, thus potentially making the metaslab group eligible ! * for preloading. */ - if (mg->mg_activation_count > 0) { metaslab_group_preload(mg); - } - spa_config_exit(spa, SCL_ALLOC, FTAG); } static uint64_t metaslab_distance(metaslab_t *msp, dva_t *dva) { --- 2535,2557 ---- if (!metaslab_debug_unload) metaslab_unload(msp); } mutex_exit(&msp->ms_lock); } void metaslab_sync_reassess(metaslab_group_t *mg) { metaslab_group_alloc_update(mg); mg->mg_fragmentation = metaslab_group_fragmentation(mg); /* ! * Preload the next potential metaslabs */ metaslab_group_preload(mg); } static uint64_t metaslab_distance(metaslab_t *msp, dva_t *dva) {
*** 2717,2726 **** --- 2741,2751 ---- VERIFY0(P2PHASE(start, 1ULL << vd->vdev_ashift)); VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift)); VERIFY3U(range_tree_space(rt) - size, <=, msp->ms_size); range_tree_remove(rt, start, size); + metaslab_trim_remove(msp, start, size); if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0) vdev_dirty(mg->mg_vd, VDD_METASLAB, msp, txg); range_tree_add(msp->ms_alloctree[txg & TXG_MASK], start, size);
*** 2738,2748 **** return (start); } static uint64_t metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal, ! uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d) { metaslab_t *msp = NULL; uint64_t offset = -1ULL; uint64_t activation_weight; uint64_t target_distance; --- 2763,2774 ---- return (start); } static uint64_t metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal, ! uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d, ! int flags) { metaslab_t *msp = NULL; uint64_t offset = -1ULL; uint64_t activation_weight; uint64_t target_distance;
*** 2759,2768 **** --- 2785,2795 ---- metaslab_t *search = kmem_alloc(sizeof (*search), KM_SLEEP); search->ms_weight = UINT64_MAX; search->ms_start = 0; for (;;) { boolean_t was_active; + boolean_t pass_primary = B_TRUE; avl_tree_t *t = &mg->mg_metaslab_tree; avl_index_t idx; mutex_enter(&mg->mg_lock);
*** 2796,2820 **** */ if (msp->ms_condensing) continue; was_active = msp->ms_weight & METASLAB_ACTIVE_MASK; ! if (activation_weight == METASLAB_WEIGHT_PRIMARY) break; target_distance = min_distance + (space_map_allocated(msp->ms_sm) != 0 ? 0 : min_distance >> 1); ! for (i = 0; i < d; i++) { if (metaslab_distance(msp, &dva[i]) < target_distance) break; - } if (i == d) break; } mutex_exit(&mg->mg_lock); if (msp == NULL) { kmem_free(search, sizeof (*search)); return (-1ULL); } --- 2823,2858 ---- */ if (msp->ms_condensing) continue; was_active = msp->ms_weight & METASLAB_ACTIVE_MASK; ! if (flags & METASLAB_USE_WEIGHT_SECONDARY) { ! if (!pass_primary) { ! DTRACE_PROBE(metaslab_use_secondary); ! activation_weight = ! METASLAB_WEIGHT_SECONDARY; break; + } + pass_primary = B_FALSE; + } else { + if (activation_weight == + METASLAB_WEIGHT_PRIMARY) + break; + target_distance = min_distance + (space_map_allocated(msp->ms_sm) != 0 ? 0 : min_distance >> 1); ! for (i = 0; i < d; i++) if (metaslab_distance(msp, &dva[i]) < target_distance) break; if (i == d) break; } + } mutex_exit(&mg->mg_lock); if (msp == NULL) { kmem_free(search, sizeof (*search)); return (-1ULL); }
*** 2931,2947 **** return (offset); } static uint64_t metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal, ! uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d) { uint64_t offset; ASSERT(mg->mg_initialized); offset = metaslab_group_alloc_normal(mg, zal, asize, txg, ! min_distance, dva, d); mutex_enter(&mg->mg_lock); if (offset == -1ULL) { mg->mg_failed_allocations++; metaslab_trace_add(zal, mg, NULL, asize, d, --- 2969,2986 ---- return (offset); } static uint64_t metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal, ! uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, ! int d, int flags) { uint64_t offset; ASSERT(mg->mg_initialized); offset = metaslab_group_alloc_normal(mg, zal, asize, txg, ! min_distance, dva, d, flags); mutex_enter(&mg->mg_lock); if (offset == -1ULL) { mg->mg_failed_allocations++; metaslab_trace_add(zal, mg, NULL, asize, d,
*** 2975,2985 **** int ditto_same_vdev_distance_shift = 3; /* * Allocate a block for the specified i/o. */ ! int metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize, dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags, zio_alloc_list_t *zal) { metaslab_group_t *mg, *rotor; --- 3014,3024 ---- int ditto_same_vdev_distance_shift = 3; /* * Allocate a block for the specified i/o. */ ! static int metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize, dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags, zio_alloc_list_t *zal) { metaslab_group_t *mg, *rotor;
*** 3021,3035 **** if (hintdva) { vd = vdev_lookup_top(spa, DVA_GET_VDEV(&hintdva[d])); /* * It's possible the vdev we're using as the hint no ! * longer exists or its mg has been closed (e.g. by ! * device removal). Consult the rotor when * all else fails. */ ! if (vd != NULL && vd->vdev_mg != NULL) { mg = vd->vdev_mg; if (flags & METASLAB_HINTBP_AVOID && mg->mg_next != NULL) mg = mg->mg_next; --- 3060,3073 ---- if (hintdva) { vd = vdev_lookup_top(spa, DVA_GET_VDEV(&hintdva[d])); /* * It's possible the vdev we're using as the hint no ! * longer exists (i.e. removed). Consult the rotor when * all else fails. */ ! if (vd != NULL) { mg = vd->vdev_mg; if (flags & METASLAB_HINTBP_AVOID && mg->mg_next != NULL) mg = mg->mg_next;
*** 3120,3130 **** uint64_t asize = vdev_psize_to_asize(vd, psize); ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0); uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg, ! distance, dva, d); if (offset != -1ULL) { /* * If we've just selected this metaslab group, * figure out whether the corresponding vdev is --- 3158,3168 ---- uint64_t asize = vdev_psize_to_asize(vd, psize); ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0); uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg, ! distance, dva, d, flags); if (offset != -1ULL) { /* * If we've just selected this metaslab group, * figure out whether the corresponding vdev is
*** 3131,3144 **** * over- or under-used relative to the pool, * and set an allocation bias to even it out. */ if (mc->mc_aliquot == 0 && metaslab_bias_enabled) { vdev_stat_t *vs = &vd->vdev_stat; ! int64_t vu, cu; vu = (vs->vs_alloc * 100) / (vs->vs_space + 1); cu = (mc->mc_alloc * 100) / (mc->mc_space + 1); /* * Calculate how much more or less we should * try to allocate from this device during * this iteration around the rotor. --- 3169,3187 ---- * over- or under-used relative to the pool, * and set an allocation bias to even it out. */ if (mc->mc_aliquot == 0 && metaslab_bias_enabled) { vdev_stat_t *vs = &vd->vdev_stat; ! vdev_stat_t *pvs = &vd->vdev_parent->vdev_stat; ! int64_t vu, cu, vu_io; vu = (vs->vs_alloc * 100) / (vs->vs_space + 1); cu = (mc->mc_alloc * 100) / (mc->mc_space + 1); + vu_io = + (((vs->vs_iotime[ZIO_TYPE_WRITE] * 100) / + (pvs->vs_iotime[ZIO_TYPE_WRITE] + 1)) * + (vd->vdev_parent->vdev_children)) - 100; /* * Calculate how much more or less we should * try to allocate from this device during * this iteration around the rotor.
*** 3151,3160 **** --- 3194,3222 ---- * This reduces allocations by 307K for this * iteration. */ mg->mg_bias = ((cu - vu) * (int64_t)mg->mg_aliquot) / 100; + + /* + * Experiment: space-based DVA allocator 0, + * latency-based 1 or hybrid 2. + */ + switch (metaslab_alloc_dva_algorithm) { + case 1: + mg->mg_bias = + (vu_io * (int64_t)mg->mg_aliquot) / + 100; + break; + case 2: + mg->mg_bias = + ((((cu - vu) + vu_io) / 2) * + (int64_t)mg->mg_aliquot) / 100; + break; + default: + break; + } } else if (!metaslab_bias_enabled) { mg->mg_bias = 0; } if (atomic_add_64_nv(&mc->mc_aliquot, asize) >=
*** 3165,3174 **** --- 3227,3238 ---- DVA_SET_VDEV(&dva[d], vd->vdev_id); DVA_SET_OFFSET(&dva[d], offset); DVA_SET_GANG(&dva[d], !!(flags & METASLAB_GANG_HEADER)); DVA_SET_ASIZE(&dva[d], asize); + DTRACE_PROBE3(alloc_dva_probe, uint64_t, vd->vdev_id, + uint64_t, offset, uint64_t, psize); return (0); } next: mc->mc_rotor = mg->mg_next;
*** 3187,3418 **** metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC); return (SET_ERROR(ENOSPC)); } - void - metaslab_free_concrete(vdev_t *vd, uint64_t offset, uint64_t asize, - uint64_t txg) - { - metaslab_t *msp; - spa_t *spa = vd->vdev_spa; - - ASSERT3U(txg, ==, spa->spa_syncing_txg); - ASSERT(vdev_is_concrete(vd)); - ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0); - ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count); - - msp = vd->vdev_ms[offset >> vd->vdev_ms_shift]; - - VERIFY(!msp->ms_condensing); - VERIFY3U(offset, >=, msp->ms_start); - VERIFY3U(offset + asize, <=, msp->ms_start + msp->ms_size); - VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift)); - VERIFY0(P2PHASE(asize, 1ULL << vd->vdev_ashift)); - - metaslab_check_free_impl(vd, offset, asize); - mutex_enter(&msp->ms_lock); - if (range_tree_space(msp->ms_freeingtree) == 0) { - vdev_dirty(vd, VDD_METASLAB, msp, txg); - } - range_tree_add(msp->ms_freeingtree, offset, asize); - mutex_exit(&msp->ms_lock); - } - - /* ARGSUSED */ - void - metaslab_free_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset, - uint64_t size, void *arg) - { - uint64_t *txgp = arg; - - if (vd->vdev_ops->vdev_op_remap != NULL) - vdev_indirect_mark_obsolete(vd, offset, size, *txgp); - else - metaslab_free_impl(vd, offset, size, *txgp); - } - - static void - metaslab_free_impl(vdev_t *vd, uint64_t offset, uint64_t size, - uint64_t txg) - { - spa_t *spa = vd->vdev_spa; - - ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0); - - if (txg > spa_freeze_txg(spa)) - return; - - if (spa->spa_vdev_removal != NULL && - spa->spa_vdev_removal->svr_vdev == vd && - vdev_is_concrete(vd)) { - /* - * Note: we check if the vdev is concrete because when - * we complete the removal, we first change the vdev to be - * an indirect vdev (in open context), and then (in syncing - * context) clear spa_vdev_removal. - */ - free_from_removing_vdev(vd, offset, size, txg); - } else if (vd->vdev_ops->vdev_op_remap != NULL) { - vdev_indirect_mark_obsolete(vd, offset, size, txg); - vd->vdev_ops->vdev_op_remap(vd, offset, size, - metaslab_free_impl_cb, &txg); - } else { - metaslab_free_concrete(vd, offset, size, txg); - } - } - - typedef struct remap_blkptr_cb_arg { - blkptr_t *rbca_bp; - spa_remap_cb_t rbca_cb; - vdev_t *rbca_remap_vd; - uint64_t rbca_remap_offset; - void *rbca_cb_arg; - } remap_blkptr_cb_arg_t; - - void - remap_blkptr_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset, - uint64_t size, void *arg) - { - remap_blkptr_cb_arg_t *rbca = arg; - blkptr_t *bp = rbca->rbca_bp; - - /* We can not remap split blocks. */ - if (size != DVA_GET_ASIZE(&bp->blk_dva[0])) - return; - ASSERT0(inner_offset); - - if (rbca->rbca_cb != NULL) { - /* - * At this point we know that we are not handling split - * blocks and we invoke the callback on the previous - * vdev which must be indirect. - */ - ASSERT3P(rbca->rbca_remap_vd->vdev_ops, ==, &vdev_indirect_ops); - - rbca->rbca_cb(rbca->rbca_remap_vd->vdev_id, - rbca->rbca_remap_offset, size, rbca->rbca_cb_arg); - - /* set up remap_blkptr_cb_arg for the next call */ - rbca->rbca_remap_vd = vd; - rbca->rbca_remap_offset = offset; - } - - /* - * The phys birth time is that of dva[0]. This ensures that we know - * when each dva was written, so that resilver can determine which - * blocks need to be scrubbed (i.e. those written during the time - * the vdev was offline). It also ensures that the key used in - * the ARC hash table is unique (i.e. dva[0] + phys_birth). If - * we didn't change the phys_birth, a lookup in the ARC for a - * remapped BP could find the data that was previously stored at - * this vdev + offset. - */ - vdev_t *oldvd = vdev_lookup_top(vd->vdev_spa, - DVA_GET_VDEV(&bp->blk_dva[0])); - vdev_indirect_births_t *vib = oldvd->vdev_indirect_births; - bp->blk_phys_birth = vdev_indirect_births_physbirth(vib, - DVA_GET_OFFSET(&bp->blk_dva[0]), DVA_GET_ASIZE(&bp->blk_dva[0])); - - DVA_SET_VDEV(&bp->blk_dva[0], vd->vdev_id); - DVA_SET_OFFSET(&bp->blk_dva[0], offset); - } - /* ! * If the block pointer contains any indirect DVAs, modify them to refer to ! * concrete DVAs. Note that this will sometimes not be possible, leaving ! * the indirect DVA in place. This happens if the indirect DVA spans multiple ! * segments in the mapping (i.e. it is a "split block"). ! * ! * If the BP was remapped, calls the callback on the original dva (note the ! * callback can be called multiple times if the original indirect DVA refers ! * to another indirect DVA, etc). ! * ! * Returns TRUE if the BP was remapped. */ - boolean_t - spa_remap_blkptr(spa_t *spa, blkptr_t *bp, spa_remap_cb_t callback, void *arg) - { - remap_blkptr_cb_arg_t rbca; - - if (!zfs_remap_blkptr_enable) - return (B_FALSE); - - if (!spa_feature_is_enabled(spa, SPA_FEATURE_OBSOLETE_COUNTS)) - return (B_FALSE); - - /* - * Dedup BP's can not be remapped, because ddt_phys_select() depends - * on DVA[0] being the same in the BP as in the DDT (dedup table). - */ - if (BP_GET_DEDUP(bp)) - return (B_FALSE); - - /* - * Gang blocks can not be remapped, because - * zio_checksum_gang_verifier() depends on the DVA[0] that's in - * the BP used to read the gang block header (GBH) being the same - * as the DVA[0] that we allocated for the GBH. - */ - if (BP_IS_GANG(bp)) - return (B_FALSE); - - /* - * Embedded BP's have no DVA to remap. - */ - if (BP_GET_NDVAS(bp) < 1) - return (B_FALSE); - - /* - * Note: we only remap dva[0]. If we remapped other dvas, we - * would no longer know what their phys birth txg is. - */ - dva_t *dva = &bp->blk_dva[0]; - - uint64_t offset = DVA_GET_OFFSET(dva); - uint64_t size = DVA_GET_ASIZE(dva); - vdev_t *vd = vdev_lookup_top(spa, DVA_GET_VDEV(dva)); - - if (vd->vdev_ops->vdev_op_remap == NULL) - return (B_FALSE); - - rbca.rbca_bp = bp; - rbca.rbca_cb = callback; - rbca.rbca_remap_vd = vd; - rbca.rbca_remap_offset = offset; - rbca.rbca_cb_arg = arg; - - /* - * remap_blkptr_cb() will be called in order for each level of - * indirection, until a concrete vdev is reached or a split block is - * encountered. old_vd and old_offset are updated within the callback - * as we go from the one indirect vdev to the next one (either concrete - * or indirect again) in that order. - */ - vd->vdev_ops->vdev_op_remap(vd, offset, size, remap_blkptr_cb, &rbca); - - /* Check if the DVA wasn't remapped because it is a split block */ - if (DVA_GET_VDEV(&rbca.rbca_bp->blk_dva[0]) == vd->vdev_id) - return (B_FALSE); - - return (B_TRUE); - } - - /* - * Undo the allocation of a DVA which happened in the given transaction group. - */ void ! metaslab_unalloc_dva(spa_t *spa, const dva_t *dva, uint64_t txg) { - metaslab_t *msp; - vdev_t *vd; uint64_t vdev = DVA_GET_VDEV(dva); uint64_t offset = DVA_GET_OFFSET(dva); uint64_t size = DVA_GET_ASIZE(dva); ASSERT(DVA_IS_VALID(dva)); - ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0); if (txg > spa_freeze_txg(spa)) return; if ((vd = vdev_lookup_top(spa, vdev)) == NULL || --- 3251,3277 ---- metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC); return (SET_ERROR(ENOSPC)); } /* ! * Free the block represented by DVA in the context of the specified ! * transaction group. */ void ! metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg, boolean_t now) { uint64_t vdev = DVA_GET_VDEV(dva); uint64_t offset = DVA_GET_OFFSET(dva); uint64_t size = DVA_GET_ASIZE(dva); + vdev_t *vd; + metaslab_t *msp; + DTRACE_PROBE3(free_dva_probe, uint64_t, vdev, + uint64_t, offset, uint64_t, size); + ASSERT(DVA_IS_VALID(dva)); if (txg > spa_freeze_txg(spa)) return; if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
*** 3421,3441 **** (u_longlong_t)vdev, (u_longlong_t)offset); ASSERT(0); return; } ! ASSERT(!vd->vdev_removing); ! ASSERT(vdev_is_concrete(vd)); ! ASSERT0(vd->vdev_indirect_config.vic_mapping_object); ! ASSERT3P(vd->vdev_indirect_mapping, ==, NULL); if (DVA_GET_GANG(dva)) size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE); - msp = vd->vdev_ms[offset >> vd->vdev_ms_shift]; - mutex_enter(&msp->ms_lock); range_tree_remove(msp->ms_alloctree[txg & TXG_MASK], offset, size); VERIFY(!msp->ms_condensing); VERIFY3U(offset, >=, msp->ms_start); --- 3280,3297 ---- (u_longlong_t)vdev, (u_longlong_t)offset); ASSERT(0); return; } ! msp = vd->vdev_ms[offset >> vd->vdev_ms_shift]; if (DVA_GET_GANG(dva)) size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE); mutex_enter(&msp->ms_lock); + + if (now) { range_tree_remove(msp->ms_alloctree[txg & TXG_MASK], offset, size); VERIFY(!msp->ms_condensing); VERIFY3U(offset, >=, msp->ms_start);
*** 3443,3475 **** VERIFY3U(range_tree_space(msp->ms_tree) + size, <=, msp->ms_size); VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift)); VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift)); range_tree_add(msp->ms_tree, offset, size); mutex_exit(&msp->ms_lock); } /* ! * Free the block represented by DVA in the context of the specified ! * transaction group. */ ! void ! metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg) { uint64_t vdev = DVA_GET_VDEV(dva); uint64_t offset = DVA_GET_OFFSET(dva); uint64_t size = DVA_GET_ASIZE(dva); ! vdev_t *vd = vdev_lookup_top(spa, vdev); ASSERT(DVA_IS_VALID(dva)); - ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0); ! if (DVA_GET_GANG(dva)) { size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE); } ! metaslab_free_impl(vd, offset, size, txg); } /* * Reserve some allocation slots. The reservation system must be called * before we call into the allocator. If there aren't any available slots --- 3299,3378 ---- VERIFY3U(range_tree_space(msp->ms_tree) + size, <=, msp->ms_size); VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift)); VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift)); range_tree_add(msp->ms_tree, offset, size); + if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON && + !vd->vdev_man_trimming) + metaslab_trim_add(msp, offset, size); + msp->ms_max_size = metaslab_block_maxsize(msp); + } else { + VERIFY3U(txg, ==, spa->spa_syncing_txg); + if (range_tree_space(msp->ms_freeingtree) == 0) + vdev_dirty(vd, VDD_METASLAB, msp, txg); + range_tree_add(msp->ms_freeingtree, offset, size); + } + mutex_exit(&msp->ms_lock); } /* ! * Intent log support: upon opening the pool after a crash, notify the SPA ! * of blocks that the intent log has allocated for immediate write, but ! * which are still considered free by the SPA because the last transaction ! * group didn't commit yet. */ ! static int ! metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg) { uint64_t vdev = DVA_GET_VDEV(dva); uint64_t offset = DVA_GET_OFFSET(dva); uint64_t size = DVA_GET_ASIZE(dva); ! vdev_t *vd; ! metaslab_t *msp; ! int error = 0; ASSERT(DVA_IS_VALID(dva)); ! if ((vd = vdev_lookup_top(spa, vdev)) == NULL || ! (offset >> vd->vdev_ms_shift) >= vd->vdev_ms_count) ! return (SET_ERROR(ENXIO)); ! ! msp = vd->vdev_ms[offset >> vd->vdev_ms_shift]; ! ! if (DVA_GET_GANG(dva)) size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE); + + mutex_enter(&msp->ms_lock); + + if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded) + error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY); + + if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size)) + error = SET_ERROR(ENOENT); + + if (error || txg == 0) { /* txg == 0 indicates dry run */ + mutex_exit(&msp->ms_lock); + return (error); } ! VERIFY(!msp->ms_condensing); ! VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift)); ! VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift)); ! VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size); ! range_tree_remove(msp->ms_tree, offset, size); ! metaslab_trim_remove(msp, offset, size); ! ! if (spa_writeable(spa)) { /* don't dirty if we're zdb(1M) */ ! if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0) ! vdev_dirty(vd, VDD_METASLAB, msp, txg); ! range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size); ! } ! ! mutex_exit(&msp->ms_lock); ! ! return (0); } /* * Reserve some allocation slots. The reservation system must be called * before we call into the allocator. If there aren't any available slots
*** 3516,3642 **** (void) refcount_remove(&mc->mc_alloc_slots, zio); } mutex_exit(&mc->mc_lock); } - static int - metaslab_claim_concrete(vdev_t *vd, uint64_t offset, uint64_t size, - uint64_t txg) - { - metaslab_t *msp; - spa_t *spa = vd->vdev_spa; - int error = 0; - - if (offset >> vd->vdev_ms_shift >= vd->vdev_ms_count) - return (ENXIO); - - ASSERT3P(vd->vdev_ms, !=, NULL); - msp = vd->vdev_ms[offset >> vd->vdev_ms_shift]; - - mutex_enter(&msp->ms_lock); - - if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded) - error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY); - - if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size)) - error = SET_ERROR(ENOENT); - - if (error || txg == 0) { /* txg == 0 indicates dry run */ - mutex_exit(&msp->ms_lock); - return (error); - } - - VERIFY(!msp->ms_condensing); - VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift)); - VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift)); - VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size); - range_tree_remove(msp->ms_tree, offset, size); - - if (spa_writeable(spa)) { /* don't dirty if we're zdb(1M) */ - if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0) - vdev_dirty(vd, VDD_METASLAB, msp, txg); - range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size); - } - - mutex_exit(&msp->ms_lock); - - return (0); - } - - typedef struct metaslab_claim_cb_arg_t { - uint64_t mcca_txg; - int mcca_error; - } metaslab_claim_cb_arg_t; - - /* ARGSUSED */ - static void - metaslab_claim_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset, - uint64_t size, void *arg) - { - metaslab_claim_cb_arg_t *mcca_arg = arg; - - if (mcca_arg->mcca_error == 0) { - mcca_arg->mcca_error = metaslab_claim_concrete(vd, offset, - size, mcca_arg->mcca_txg); - } - } - int - metaslab_claim_impl(vdev_t *vd, uint64_t offset, uint64_t size, uint64_t txg) - { - if (vd->vdev_ops->vdev_op_remap != NULL) { - metaslab_claim_cb_arg_t arg; - - /* - * Only zdb(1M) can claim on indirect vdevs. This is used - * to detect leaks of mapped space (that are not accounted - * for in the obsolete counts, spacemap, or bpobj). - */ - ASSERT(!spa_writeable(vd->vdev_spa)); - arg.mcca_error = 0; - arg.mcca_txg = txg; - - vd->vdev_ops->vdev_op_remap(vd, offset, size, - metaslab_claim_impl_cb, &arg); - - if (arg.mcca_error == 0) { - arg.mcca_error = metaslab_claim_concrete(vd, - offset, size, txg); - } - return (arg.mcca_error); - } else { - return (metaslab_claim_concrete(vd, offset, size, txg)); - } - } - - /* - * Intent log support: upon opening the pool after a crash, notify the SPA - * of blocks that the intent log has allocated for immediate write, but - * which are still considered free by the SPA because the last transaction - * group didn't commit yet. - */ - static int - metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg) - { - uint64_t vdev = DVA_GET_VDEV(dva); - uint64_t offset = DVA_GET_OFFSET(dva); - uint64_t size = DVA_GET_ASIZE(dva); - vdev_t *vd; - - if ((vd = vdev_lookup_top(spa, vdev)) == NULL) { - return (SET_ERROR(ENXIO)); - } - - ASSERT(DVA_IS_VALID(dva)); - - if (DVA_GET_GANG(dva)) - size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE); - - return (metaslab_claim_impl(vd, offset, size, txg)); - } - - int metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp, int ndvas, uint64_t txg, blkptr_t *hintbp, int flags, zio_alloc_list_t *zal, zio_t *zio) { dva_t *dva = bp->blk_dva; --- 3419,3429 ----
*** 3656,3671 **** ASSERT(ndvas > 0 && ndvas <= spa_max_replication(spa)); ASSERT(BP_GET_NDVAS(bp) == 0); ASSERT(hintbp == NULL || ndvas <= BP_GET_NDVAS(hintbp)); ASSERT3P(zal, !=, NULL); for (int d = 0; d < ndvas; d++) { ! error = metaslab_alloc_dva(spa, mc, psize, dva, d, hintdva, ! txg, flags, zal); if (error != 0) { for (d--; d >= 0; d--) { ! metaslab_unalloc_dva(spa, &dva[d], txg); metaslab_group_alloc_decrement(spa, DVA_GET_VDEV(&dva[d]), zio, flags); bzero(&dva[d], sizeof (dva_t)); } spa_config_exit(spa, SCL_ALLOC, FTAG); --- 3443,3502 ---- ASSERT(ndvas > 0 && ndvas <= spa_max_replication(spa)); ASSERT(BP_GET_NDVAS(bp) == 0); ASSERT(hintbp == NULL || ndvas <= BP_GET_NDVAS(hintbp)); ASSERT3P(zal, !=, NULL); + if (mc == spa_special_class(spa) && !BP_IS_METADATA(bp) && + !(flags & (METASLAB_GANG_HEADER)) && + !(spa->spa_meta_policy.spa_small_data_to_special && + psize <= spa->spa_meta_policy.spa_small_data_to_special)) { + error = metaslab_alloc_dva(spa, spa_normal_class(spa), + psize, &dva[WBC_NORMAL_DVA], 0, NULL, txg, + flags | METASLAB_USE_WEIGHT_SECONDARY, zal); + if (error == 0) { + error = metaslab_alloc_dva(spa, mc, psize, + &dva[WBC_SPECIAL_DVA], 0, NULL, txg, flags, zal); + if (error != 0) { + error = 0; + /* + * Change the place of NORMAL and cleanup the + * second DVA. After that this BP is just a + * regular BP with one DVA + * + * This operation is valid only if: + * WBC_SPECIAL_DVA is dva[0] + * WBC_NORMAL_DVA is dva[1] + * + * see wbc.h + */ + bcopy(&dva[WBC_NORMAL_DVA], + &dva[WBC_SPECIAL_DVA], sizeof (dva_t)); + bzero(&dva[WBC_NORMAL_DVA], sizeof (dva_t)); + + /* + * Allocation of special DVA has failed, + * so this BP will be a regular BP and need + * to update the metaslab group's queue depth + * based on the newly allocated dva. + */ + metaslab_group_alloc_increment(spa, + DVA_GET_VDEV(&dva[0]), zio, flags); + } else { + BP_SET_SPECIAL(bp, 1); + } + } else { + spa_config_exit(spa, SCL_ALLOC, FTAG); + return (error); + } + } else { for (int d = 0; d < ndvas; d++) { ! error = metaslab_alloc_dva(spa, mc, psize, dva, d, ! hintdva, txg, flags, zal); if (error != 0) { for (d--; d >= 0; d--) { ! metaslab_free_dva(spa, &dva[d], ! txg, B_TRUE); metaslab_group_alloc_decrement(spa, DVA_GET_VDEV(&dva[d]), zio, flags); bzero(&dva[d], sizeof (dva_t)); } spa_config_exit(spa, SCL_ALLOC, FTAG);
*** 3676,3689 **** * based on the newly allocated dva. */ metaslab_group_alloc_increment(spa, DVA_GET_VDEV(&dva[d]), zio, flags); } - } - ASSERT(error == 0); ASSERT(BP_GET_NDVAS(bp) == ndvas); spa_config_exit(spa, SCL_ALLOC, FTAG); BP_SET_BIRTH(bp, txg, txg); --- 3507,3520 ---- * based on the newly allocated dva. */ metaslab_group_alloc_increment(spa, DVA_GET_VDEV(&dva[d]), zio, flags); } } ASSERT(BP_GET_NDVAS(bp) == ndvas); + } + ASSERT(error == 0); spa_config_exit(spa, SCL_ALLOC, FTAG); BP_SET_BIRTH(bp, txg, txg);
*** 3699,3715 **** ASSERT(!BP_IS_HOLE(bp)); ASSERT(!now || bp->blk_birth >= spa_syncing_txg(spa)); spa_config_enter(spa, SCL_FREE, FTAG, RW_READER); ! for (int d = 0; d < ndvas; d++) { ! if (now) { ! metaslab_unalloc_dva(spa, &dva[d], txg); } else { ! metaslab_free_dva(spa, &dva[d], txg); } - } spa_config_exit(spa, SCL_FREE, FTAG); } int --- 3530,3561 ---- ASSERT(!BP_IS_HOLE(bp)); ASSERT(!now || bp->blk_birth >= spa_syncing_txg(spa)); spa_config_enter(spa, SCL_FREE, FTAG, RW_READER); ! if (BP_IS_SPECIAL(bp)) { ! int start_dva; ! wbc_data_t *wbc_data = spa_get_wbc_data(spa); ! ! mutex_enter(&wbc_data->wbc_lock); ! start_dva = wbc_first_valid_dva(bp, wbc_data, B_TRUE); ! mutex_exit(&wbc_data->wbc_lock); ! ! /* ! * Actual freeing should not be locked as ! * the block is already exempted from WBC ! * trees, and thus will not be moved ! */ ! metaslab_free_dva(spa, &dva[WBC_NORMAL_DVA], txg, now); ! if (start_dva == 0) { ! metaslab_free_dva(spa, &dva[WBC_SPECIAL_DVA], ! txg, now); ! } } else { ! for (int d = 0; d < ndvas; d++) ! metaslab_free_dva(spa, &dva[d], txg, now); } spa_config_exit(spa, SCL_FREE, FTAG); } int
*** 3730,3810 **** return (error); } spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER); for (int d = 0; d < ndvas; d++) ! if ((error = metaslab_claim_dva(spa, &dva[d], txg)) != 0) break; spa_config_exit(spa, SCL_ALLOC, FTAG); ASSERT(error == 0 || txg == 0); return (error); } ! /* ARGSUSED */ ! static void ! metaslab_check_free_impl_cb(uint64_t inner, vdev_t *vd, uint64_t offset, ! uint64_t size, void *arg) { - if (vd->vdev_ops == &vdev_indirect_ops) - return; - - metaslab_check_free_impl(vd, offset, size); - } - - static void - metaslab_check_free_impl(vdev_t *vd, uint64_t offset, uint64_t size) - { - metaslab_t *msp; - spa_t *spa = vd->vdev_spa; - if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0) return; ! if (vd->vdev_ops->vdev_op_remap != NULL) { ! vd->vdev_ops->vdev_op_remap(vd, offset, size, ! metaslab_check_free_impl_cb, NULL); return; } ! ASSERT(vdev_is_concrete(vd)); ! ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count); ! ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0); ! msp = vd->vdev_ms[offset >> vd->vdev_ms_shift]; ! ! mutex_enter(&msp->ms_lock); ! if (msp->ms_loaded) range_tree_verify(msp->ms_tree, offset, size); range_tree_verify(msp->ms_freeingtree, offset, size); range_tree_verify(msp->ms_freedtree, offset, size); for (int j = 0; j < TXG_DEFER_SIZE; j++) range_tree_verify(msp->ms_defertree[j], offset, size); mutex_exit(&msp->ms_lock); } void ! metaslab_check_free(spa_t *spa, const blkptr_t *bp) { ! if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0) ! return; ! spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER); ! for (int i = 0; i < BP_GET_NDVAS(bp); i++) { ! uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]); ! vdev_t *vd = vdev_lookup_top(spa, vdev); ! uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]); ! uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]); ! if (DVA_GET_GANG(&bp->blk_dva[i])) ! size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE); ! ASSERT3P(vd, !=, NULL); ! metaslab_check_free_impl(vd, offset, size); } ! spa_config_exit(spa, SCL_VDEV, FTAG); } --- 3576,3921 ---- return (error); } spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER); + if (BP_IS_SPECIAL(bp)) { + int start_dva; + wbc_data_t *wbc_data = spa_get_wbc_data(spa); + + mutex_enter(&wbc_data->wbc_lock); + start_dva = wbc_first_valid_dva(bp, wbc_data, B_FALSE); + + /* + * Actual claiming should be under lock for WBC blocks. It must + * be done to ensure zdb will not fail. The only other user of + * the claiming is ZIL whose blocks can not be WBC ones, and + * thus the lock will not be held for them. + */ + error = metaslab_claim_dva(spa, + &dva[WBC_NORMAL_DVA], txg); + if (error == 0 && start_dva == 0) { + error = metaslab_claim_dva(spa, + &dva[WBC_SPECIAL_DVA], txg); + } + + mutex_exit(&wbc_data->wbc_lock); + } else { for (int d = 0; d < ndvas; d++) ! if ((error = metaslab_claim_dva(spa, ! &dva[d], txg)) != 0) break; + } spa_config_exit(spa, SCL_ALLOC, FTAG); ASSERT(error == 0 || txg == 0); return (error); } ! void ! metaslab_check_free(spa_t *spa, const blkptr_t *bp) { if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0) return; ! if (BP_IS_SPECIAL(bp)) { ! /* Do not check frees for WBC blocks */ return; } ! spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER); ! for (int i = 0; i < BP_GET_NDVAS(bp); i++) { ! uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]); ! vdev_t *vd = vdev_lookup_top(spa, vdev); ! uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]); ! uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]); ! metaslab_t *msp = vd->vdev_ms[offset >> vd->vdev_ms_shift]; ! if (msp->ms_loaded) { range_tree_verify(msp->ms_tree, offset, size); + range_tree_verify(msp->ms_cur_ts->ts_tree, + offset, size); + if (msp->ms_prev_ts != NULL) { + range_tree_verify(msp->ms_prev_ts->ts_tree, + offset, size); + } + } range_tree_verify(msp->ms_freeingtree, offset, size); range_tree_verify(msp->ms_freedtree, offset, size); for (int j = 0; j < TXG_DEFER_SIZE; j++) range_tree_verify(msp->ms_defertree[j], offset, size); + } + spa_config_exit(spa, SCL_VDEV, FTAG); + } + + /* + * Trims all free space in the metaslab. Returns the root TRIM zio (that the + * caller should zio_wait() for) and the amount of space in the metaslab that + * has been scheduled for trimming in the `delta' return argument. + */ + zio_t * + metaslab_trim_all(metaslab_t *msp, uint64_t *delta) + { + boolean_t was_loaded; + uint64_t trimmed_space; + zio_t *trim_io; + + ASSERT(!MUTEX_HELD(&msp->ms_group->mg_lock)); + + mutex_enter(&msp->ms_lock); + + while (msp->ms_loading) + metaslab_load_wait(msp); + /* If we loaded the metaslab, unload it when we're done. */ + was_loaded = msp->ms_loaded; + if (!was_loaded) { + if (metaslab_load(msp) != 0) { mutex_exit(&msp->ms_lock); + return (0); + } + } + /* Flush out any scheduled extents and add everything in ms_tree. */ + range_tree_vacate(msp->ms_cur_ts->ts_tree, NULL, NULL); + range_tree_walk(msp->ms_tree, metaslab_trim_add, msp); + + /* Force this trim to take place ASAP. */ + if (msp->ms_prev_ts != NULL) + metaslab_free_trimset(msp->ms_prev_ts); + msp->ms_prev_ts = msp->ms_cur_ts; + msp->ms_cur_ts = metaslab_new_trimset(0, &msp->ms_lock); + trimmed_space = range_tree_space(msp->ms_tree); + if (!was_loaded) + metaslab_unload(msp); + + trim_io = metaslab_exec_trim(msp); + mutex_exit(&msp->ms_lock); + *delta = trimmed_space; + + return (trim_io); } + /* + * Notifies the trimsets in a metaslab that an extent has been allocated. + * This removes the segment from the queues of extents awaiting to be trimmed. + */ + static void + metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size) + { + metaslab_t *msp = arg; + + range_tree_remove_overlap(msp->ms_cur_ts->ts_tree, offset, size); + if (msp->ms_prev_ts != NULL) { + range_tree_remove_overlap(msp->ms_prev_ts->ts_tree, offset, + size); + } + } + + /* + * Notifies the trimsets in a metaslab that an extent has been freed. + * This adds the segment to the currently open queue of extents awaiting + * to be trimmed. + */ + static void + metaslab_trim_add(void *arg, uint64_t offset, uint64_t size) + { + metaslab_t *msp = arg; + ASSERT(msp->ms_cur_ts != NULL); + range_tree_add(msp->ms_cur_ts->ts_tree, offset, size); + } + + /* + * Does a metaslab's automatic trim operation processing. This must be + * called from metaslab_sync, with the txg number of the txg. This function + * issues trims in intervals as dictated by the zfs_txgs_per_trim tunable. + */ void ! metaslab_auto_trim(metaslab_t *msp, uint64_t txg) { ! /* for atomicity */ ! uint64_t txgs_per_trim = zfs_txgs_per_trim; ! ASSERT(!MUTEX_HELD(&msp->ms_lock)); ! mutex_enter(&msp->ms_lock); ! /* ! * Since we typically have hundreds of metaslabs per vdev, but we only ! * trim them once every zfs_txgs_per_trim txgs, it'd be best if we ! * could sequence the TRIM commands from all metaslabs so that they ! * don't all always pound the device in the same txg. We do so by ! * artificially inflating the birth txg of the first trim set by a ! * sequence number derived from the metaslab's starting offset ! * (modulo zfs_txgs_per_trim). Thus, for the default 200 metaslabs and ! * 32 txgs per trim, we'll only be trimming ~6.25 metaslabs per txg. ! * ! * If we detect that the txg has advanced too far ahead of ts_birth, ! * it means our birth txg is out of lockstep. Recompute it by ! * rounding down to the nearest zfs_txgs_per_trim multiple and adding ! * our metaslab id modulo zfs_txgs_per_trim. ! */ ! if (txg > msp->ms_cur_ts->ts_birth + txgs_per_trim) { ! msp->ms_cur_ts->ts_birth = (txg / txgs_per_trim) * ! txgs_per_trim + (msp->ms_id % txgs_per_trim); ! } ! /* Time to swap out the current and previous trimsets */ ! if (txg == msp->ms_cur_ts->ts_birth + txgs_per_trim) { ! if (msp->ms_prev_ts != NULL) { ! if (msp->ms_trimming_ts != NULL) { ! spa_t *spa = msp->ms_group->mg_class->mc_spa; ! /* ! * The previous trim run is still ongoing, so ! * the device is reacting slowly to our trim ! * requests. Drop this trimset, so as not to ! * back the device up with trim requests. ! */ ! spa_trimstats_auto_slow_incr(spa); ! metaslab_free_trimset(msp->ms_prev_ts); ! } else if (msp->ms_group->mg_vd->vdev_man_trimming) { ! /* ! * If a manual trim is ongoing, we want to ! * inhibit autotrim temporarily so it doesn't ! * slow down the manual trim. ! */ ! metaslab_free_trimset(msp->ms_prev_ts); ! } else { ! /* ! * Trim out aged extents on the vdevs - these ! * are safe to be destroyed now. We'll keep ! * the trimset around to deny allocations from ! * these regions while the trims are ongoing. ! */ ! zio_nowait(metaslab_exec_trim(msp)); ! } ! } ! msp->ms_prev_ts = msp->ms_cur_ts; ! msp->ms_cur_ts = metaslab_new_trimset(txg, &msp->ms_lock); ! } ! mutex_exit(&msp->ms_lock); ! } ! static void ! metaslab_trim_done(zio_t *zio) ! { ! metaslab_t *msp = zio->io_private; ! boolean_t held; ! ! ASSERT(msp != NULL); ! ASSERT(msp->ms_trimming_ts != NULL); ! held = MUTEX_HELD(&msp->ms_lock); ! if (!held) ! mutex_enter(&msp->ms_lock); ! metaslab_free_trimset(msp->ms_trimming_ts); ! msp->ms_trimming_ts = NULL; ! cv_signal(&msp->ms_trim_cv); ! if (!held) ! mutex_exit(&msp->ms_lock); ! } ! ! /* ! * Executes a zio_trim on a range tree holding freed extents in the metaslab. ! */ ! static zio_t * ! metaslab_exec_trim(metaslab_t *msp) ! { ! metaslab_group_t *mg = msp->ms_group; ! spa_t *spa = mg->mg_class->mc_spa; ! vdev_t *vd = mg->mg_vd; ! range_tree_t *trim_tree; ! zio_t *zio; ! ! ASSERT(MUTEX_HELD(&msp->ms_lock)); ! ! /* wait for a preceding trim to finish */ ! while (msp->ms_trimming_ts != NULL) ! cv_wait(&msp->ms_trim_cv, &msp->ms_lock); ! msp->ms_trimming_ts = msp->ms_prev_ts; ! msp->ms_prev_ts = NULL; ! trim_tree = msp->ms_trimming_ts->ts_tree; ! #ifdef DEBUG ! if (msp->ms_loaded) { ! for (range_seg_t *rs = avl_first(&trim_tree->rt_root); ! rs != NULL; rs = AVL_NEXT(&trim_tree->rt_root, rs)) { ! if (!range_tree_contains(msp->ms_tree, ! rs->rs_start, rs->rs_end - rs->rs_start)) { ! panic("trimming allocated region; mss=%p", ! (void*)rs); } ! } ! } ! #endif ! ! /* Nothing to trim */ ! if (range_tree_space(trim_tree) == 0) { ! metaslab_free_trimset(msp->ms_trimming_ts); ! msp->ms_trimming_ts = 0; ! return (zio_root(spa, NULL, NULL, 0)); ! } ! zio = zio_trim(spa, vd, trim_tree, metaslab_trim_done, msp, 0, ! ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY | ! ZIO_FLAG_CONFIG_WRITER, msp); ! ! return (zio); ! } ! ! /* ! * Allocates and initializes a new trimset structure. The `txg' argument ! * indicates when this trimset was born and `lock' indicates the lock to ! * link to the range tree. ! */ ! static metaslab_trimset_t * ! metaslab_new_trimset(uint64_t txg, kmutex_t *lock) ! { ! metaslab_trimset_t *ts; ! ! ts = kmem_zalloc(sizeof (*ts), KM_SLEEP); ! ts->ts_birth = txg; ! ts->ts_tree = range_tree_create(NULL, NULL, lock); ! ! return (ts); ! } ! ! /* ! * Destroys and frees a trim set previously allocated by metaslab_new_trimset. ! */ ! static void ! metaslab_free_trimset(metaslab_trimset_t *ts) ! { ! range_tree_vacate(ts->ts_tree, NULL, NULL); ! range_tree_destroy(ts->ts_tree); ! kmem_free(ts, sizeof (*ts)); ! } ! ! /* ! * Checks whether an allocation conflicts with an ongoing trim operation in ! * the given metaslab. This function takes a segment starting at `*offset' ! * of `size' and checks whether it hits any region in the metaslab currently ! * being trimmed. If yes, it tries to adjust the allocation to the end of ! * the region being trimmed (P2ROUNDUP aligned by `align'), but only up to ! * `limit' (no part of the allocation is allowed to go past this point). ! * ! * Returns B_FALSE if either the original allocation wasn't in conflict, or ! * the conflict could be resolved by adjusting the value stored in `offset' ! * such that the whole allocation still fits below `limit'. Returns B_TRUE ! * if the allocation conflict couldn't be resolved. ! */ ! static boolean_t metaslab_check_trim_conflict(metaslab_t *msp, ! uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit) ! { ! uint64_t new_offset; ! ! if (msp->ms_trimming_ts == NULL) ! /* no trim conflict, original offset is OK */ ! return (B_FALSE); ! ! new_offset = P2ROUNDUP(range_tree_find_gap(msp->ms_trimming_ts->ts_tree, ! *offset, size), align); ! if (new_offset != *offset && new_offset + size > limit) ! /* trim conflict and adjustment not possible */ ! return (B_TRUE); ! ! /* trim conflict, but adjusted offset still within limit */ ! *offset = new_offset; ! return (B_FALSE); }