big-one Udiff usr/src/uts/common/fs/zfs/metaslab.c

Print this page

NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-10069 ZFS_READONLY is a little too strict (fix test lint)
NEX-9553 Move ss_fill gap logic from scan algorithm into range_tree.c
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4720 WRC: DVA allocation bypass for special BPs works incorrect
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4683 WRC: Special block pointer must know that it is special
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
6295 metaslab_condense's dbgmsg should include vdev id
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4245 WRC: Code cleanup and refactoring to simplify merge with upstream
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-4059 On-demand TRIM can sometimes race in metaslab_load
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3710 WRC improvements and bug-fixes
 * refactored WRC move-logic to use zio kmem_cashes
 * replace size and compression fields by blk_prop field
   (the same in blkptr_t) to little reduce size of wrc_block_t
   and use similar macros as for blkptr_t to get PSIZE, LSIZE
   and COMPRESSION
 * make CPU more happy by reduce atomic calls
 * removed unused code
 * fixed naming of variables
 * fixed possible system panic after restart system
   with enabled WRC
 * fixed a race that causes system panic
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
OS-197 Series of zpool exports and imports can hang the system
Reviewed by: Sarah Jelinek <sarah.jelinek@nexetna.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittens@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
re #8346 rb2639 KT disk failures

@@ -21,10 +21,11 @@
 /*
  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  * Copyright (c) 2011, 2015 by Delphix. All rights reserved.
  * Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
  * Copyright (c) 2014 Integros [integros.com]
+ * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
  */
 
 #include <sys/zfs_context.h>
 #include <sys/dmu.h>
 #include <sys/dmu_tx.h>

@@ -32,11 +33,11 @@
 #include <sys/metaslab_impl.h>
 #include <sys/vdev_impl.h>
 #include <sys/zio.h>
 #include <sys/spa_impl.h>
 #include <sys/zfeature.h>
-#include <sys/vdev_indirect_mapping.h>
+#include <sys/wbc.h>
 
 #define GANG_ALLOCATION(flags) \
         ((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER))
 
 uint64_t metaslab_aliquot = 512ULL << 10;

@@ -165,15 +166,10 @@
  * Enable/disable metaslab group biasing.
  */
 boolean_t metaslab_bias_enabled = B_TRUE;
 
 /*
- * Enable/disable remapping of indirect DVAs to their concrete vdevs.
- */
-boolean_t zfs_remap_blkptr_enable = B_TRUE;
-
-/*
  * Enable/disable segment-based metaslab selection.
  */
 boolean_t zfs_metaslab_segment_weight_enabled = B_TRUE;
 
 /*

@@ -199,16 +195,50 @@
  */
 uint64_t metaslab_trace_max_entries = 5000;
 
 static uint64_t metaslab_weight(metaslab_t *);
 static void metaslab_set_fragmentation(metaslab_t *);
-static void metaslab_free_impl(vdev_t *, uint64_t, uint64_t, uint64_t);
-static void metaslab_check_free_impl(vdev_t *, uint64_t, uint64_t);
 
 kmem_cache_t *metaslab_alloc_trace_cache;
 
 /*
+ * Toggle between space-based DVA allocator 0, latency-based 1 or hybrid 2.
+ * A value other than 0, 1 or 2 will be considered 0 (default).
+ */
+int metaslab_alloc_dva_algorithm = 0;
+
+/*
+ * How many TXG's worth of updates should be aggregated per TRIM/UNMAP
+ * issued to the underlying vdev. We keep two range trees of extents
+ * (called "trim sets") to be trimmed per metaslab, the `current' and
+ * the `previous' TS. New free's are added to the current TS. Then,
+ * once `zfs_txgs_per_trim' transactions have elapsed, the `current'
+ * TS becomes the `previous' TS and a new, blank TS is created to be
+ * the new `current', which will then start accumulating any new frees.
+ * Once another zfs_txgs_per_trim TXGs have passed, the previous TS's
+ * extents are trimmed, the TS is destroyed and the current TS again
+ * becomes the previous TS.
+ * This serves to fulfill two functions: aggregate many small frees
+ * into fewer larger trim operations (which should help with devices
+ * which do not take so kindly to them) and to allow for disaster
+ * recovery (extents won't get trimmed immediately, but instead only
+ * after passing this rather long timeout, thus not preserving
+ * 'zfs import -F' functionality).
+ */
+unsigned int zfs_txgs_per_trim = 32;
+
+static void metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size);
+static void metaslab_trim_add(void *arg, uint64_t offset, uint64_t size);
+
+static zio_t *metaslab_exec_trim(metaslab_t *msp);
+
+static metaslab_trimset_t *metaslab_new_trimset(uint64_t txg, kmutex_t *lock);
+static void metaslab_free_trimset(metaslab_trimset_t *ts);
+static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
+    uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit);
+
+/*
  * ==========================================================================
  * Metaslab classes
  * ==========================================================================
  */
 metaslab_class_t *

@@ -216,10 +246,14 @@
 {
         metaslab_class_t *mc;
 
         mc = kmem_zalloc(sizeof (metaslab_class_t), KM_SLEEP);
 
+        mutex_init(&mc->mc_alloc_lock, NULL, MUTEX_DEFAULT, NULL);
+        avl_create(&mc->mc_alloc_tree, zio_bookmark_compare,
+            sizeof (zio_t), offsetof(zio_t, io_alloc_node));
+
         mc->mc_spa = spa;
         mc->mc_rotor = NULL;
         mc->mc_ops = ops;
         mutex_init(&mc->mc_lock, NULL, MUTEX_DEFAULT, NULL);
         refcount_create_tracked(&mc->mc_alloc_slots);

@@ -234,10 +268,13 @@
         ASSERT(mc->mc_alloc == 0);
         ASSERT(mc->mc_deferred == 0);
         ASSERT(mc->mc_space == 0);
         ASSERT(mc->mc_dspace == 0);
 
+        avl_destroy(&mc->mc_alloc_tree);
+        mutex_destroy(&mc->mc_alloc_lock);
+
         refcount_destroy(&mc->mc_alloc_slots);
         mutex_destroy(&mc->mc_lock);
         kmem_free(mc, sizeof (metaslab_class_t));
 }

@@ -320,11 +357,11 @@
 
                 /*
                  * Skip any holes, uninitialized top-levels, or
                  * vdevs that are not in this metalab class.
                  */
-                if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
+                if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
                     mg->mg_class != mc) {
                         continue;
                 }
 
                 for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)

@@ -355,14 +392,14 @@
         for (int c = 0; c < rvd->vdev_children; c++) {
                 vdev_t *tvd = rvd->vdev_child[c];
                 metaslab_group_t *mg = tvd->vdev_mg;
 
                 /*
-                 * Skip any holes, uninitialized top-levels,
-                 * or vdevs that are not in this metalab class.
+                 * Skip any holes, uninitialized top-levels, or
+                 * vdevs that are not in this metalab class.
                  */
-                if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
+                if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
                     mg->mg_class != mc) {
                         continue;
                 }
 
                 /*

@@ -404,11 +441,11 @@
         for (int c = 0; c < rvd->vdev_children; c++) {
                 uint64_t tspace;
                 vdev_t *tvd = rvd->vdev_child[c];
                 metaslab_group_t *mg = tvd->vdev_mg;
 
-                if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
+                if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
                     mg->mg_class != mc) {
                         continue;
                 }
 
                 /*

@@ -516,12 +553,10 @@
         vdev_stat_t *vs = &vd->vdev_stat;
         boolean_t was_allocatable;
         boolean_t was_initialized;
 
         ASSERT(vd == vd->vdev_top);
-        ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_READER), ==,
-            SCL_ALLOC);
 
         mutex_enter(&mg->mg_lock);
         was_allocatable = mg->mg_allocatable;
         was_initialized = mg->mg_initialized;

@@ -615,10 +650,11 @@
          * either because we never activated in the first place or
          * because we're done, and possibly removing the vdev.
          */
         ASSERT(mg->mg_activation_count <= 0);
 
+        if (mg->mg_taskq)
         taskq_destroy(mg->mg_taskq);
         avl_destroy(&mg->mg_metaslab_tree);
         mutex_destroy(&mg->mg_lock);
         refcount_destroy(&mg->mg_alloc_queue_depth);
         kmem_free(mg, sizeof (metaslab_group_t));

@@ -628,11 +664,11 @@
 metaslab_group_activate(metaslab_group_t *mg)
 {
         metaslab_class_t *mc = mg->mg_class;
         metaslab_group_t *mgprev, *mgnext;
 
-        ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER), !=, 0);
+        ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
 
         ASSERT(mc->mc_rotor != mg);
         ASSERT(mg->mg_prev == NULL);
         ASSERT(mg->mg_next == NULL);
         ASSERT(mg->mg_activation_count <= 0);

@@ -654,52 +690,27 @@
                 mgnext->mg_prev = mg;
         }
         mc->mc_rotor = mg;
 }
 
-/*
- * Passivate a metaslab group and remove it from the allocation rotor.
- * Callers must hold both the SCL_ALLOC and SCL_ZIO lock prior to passivating
- * a metaslab group. This function will momentarily drop spa_config_locks
- * that are lower than the SCL_ALLOC lock (see comment below).
- */
 void
 metaslab_group_passivate(metaslab_group_t *mg)
 {
         metaslab_class_t *mc = mg->mg_class;
-        spa_t *spa = mc->mc_spa;
         metaslab_group_t *mgprev, *mgnext;
-        int locks = spa_config_held(spa, SCL_ALL, RW_WRITER);
 
-        ASSERT3U(spa_config_held(spa, SCL_ALLOC | SCL_ZIO, RW_WRITER), ==,
-            (SCL_ALLOC | SCL_ZIO));
+        ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
 
         if (--mg->mg_activation_count != 0) {
                 ASSERT(mc->mc_rotor != mg);
                 ASSERT(mg->mg_prev == NULL);
                 ASSERT(mg->mg_next == NULL);
                 ASSERT(mg->mg_activation_count < 0);
                 return;
         }
 
-        /*
-         * The spa_config_lock is an array of rwlocks, ordered as
-         * follows (from highest to lowest):
-         *      SCL_CONFIG > SCL_STATE > SCL_L2ARC > SCL_ALLOC >
-         *      SCL_ZIO > SCL_FREE > SCL_VDEV
-         * (For more information about the spa_config_lock see spa_misc.c)
-         * The higher the lock, the broader its coverage. When we passivate
-         * a metaslab group, we must hold both the SCL_ALLOC and the SCL_ZIO
-         * config locks. However, the metaslab group's taskq might be trying
-         * to preload metaslabs so we must drop the SCL_ZIO lock and any
-         * lower locks to allow the I/O to complete. At a minimum,
-         * we continue to hold the SCL_ALLOC lock, which prevents any future
-         * allocations from taking place and any changes to the vdev tree.
-         */
-        spa_config_exit(spa, locks & ~(SCL_ZIO - 1), spa);
         taskq_wait(mg->mg_taskq);
-        spa_config_enter(spa, locks & ~(SCL_ZIO - 1), spa, RW_WRITER);
         metaslab_group_alloc_update(mg);
 
         mgprev = mg->mg_prev;
         mgnext = mg->mg_next;

@@ -1139,23 +1150,24 @@
  * This is a helper function that can be used by the allocator to find
  * a suitable block to allocate. This will search the specified AVL
  * tree looking for a block that matches the specified criteria.
  */
 static uint64_t
-metaslab_block_picker(avl_tree_t *t, uint64_t *cursor, uint64_t size,
-    uint64_t align)
+metaslab_block_picker(metaslab_t *msp, avl_tree_t *t, uint64_t *cursor,
+    uint64_t size, uint64_t align)
 {
         range_seg_t *rs = metaslab_block_find(t, *cursor, size);
 
-        while (rs != NULL) {
+        for (; rs != NULL; rs = AVL_NEXT(t, rs)) {
                 uint64_t offset = P2ROUNDUP(rs->rs_start, align);
 
-                if (offset + size <= rs->rs_end) {
+                if (offset + size <= rs->rs_end &&
+                    !metaslab_check_trim_conflict(msp, &offset, size, align,
+                    rs->rs_end)) {
                         *cursor = offset + size;
                         return (offset);
                 }
-                rs = AVL_NEXT(t, rs);
         }
 
         /*
          * If we know we've searched the whole map (*cursor == 0), give up.
          * Otherwise, reset the cursor to the beginning and try again.

@@ -1162,11 +1174,11 @@
          */
         if (*cursor == 0)
                 return (-1ULL);
 
         *cursor = 0;
-        return (metaslab_block_picker(t, cursor, size, align));
+        return (metaslab_block_picker(msp, t, cursor, size, align));
 }
 
 /*
  * ==========================================================================
  * The first-fit block allocator

@@ -1184,11 +1196,11 @@
          */
         uint64_t align = size & -size;
         uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1];
         avl_tree_t *t = &msp->ms_tree->rt_root;
 
-        return (metaslab_block_picker(t, cursor, size, align));
+        return (metaslab_block_picker(msp, t, cursor, size, align));
 }
 
 static metaslab_ops_t metaslab_ff_ops = {
         metaslab_ff_alloc
 };

@@ -1232,11 +1244,11 @@
             free_pct < metaslab_df_free_pct) {
                 t = &msp->ms_size_tree;
                 *cursor = 0;
         }
 
-        return (metaslab_block_picker(t, cursor, size, 1ULL));
+        return (metaslab_block_picker(msp, t, cursor, size, 1ULL));
 }
 
 static metaslab_ops_t metaslab_df_ops = {
         metaslab_df_alloc
 };

@@ -1264,18 +1276,24 @@
 
         ASSERT3U(*cursor_end, >=, *cursor);
 
         if ((*cursor + size) > *cursor_end) {
                 range_seg_t *rs;
-
-                rs = avl_last(&msp->ms_size_tree);
-                if (rs == NULL || (rs->rs_end - rs->rs_start) < size)
-                        return (-1ULL);
-
+                for (rs = avl_last(&msp->ms_size_tree);
+                    rs != NULL && rs->rs_end - rs->rs_start >= size;
+                    rs = AVL_PREV(&msp->ms_size_tree, rs)) {
                 *cursor = rs->rs_start;
                 *cursor_end = rs->rs_end;
+                        if (!metaslab_check_trim_conflict(msp, cursor, size,
+                            1, *cursor_end)) {
+                                /* segment appears to be acceptable */
+                                break;
         }
+                }
+                if (rs == NULL || rs->rs_end - rs->rs_start < size)
+                        return (-1ULL);
+        }
 
         offset = *cursor;
         *cursor += size;
 
         return (offset);

@@ -1307,10 +1325,12 @@
         avl_index_t where;
         range_seg_t *rs, rsearch;
         uint64_t hbit = highbit64(size);
         uint64_t *cursor = &msp->ms_lbas[hbit - 1];
         uint64_t max_size = metaslab_block_maxsize(msp);
+        /* mutable copy for adjustment by metaslab_check_trim_conflict */
+        uint64_t adjustable_start;
 
         ASSERT(MUTEX_HELD(&msp->ms_lock));
         ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&msp->ms_size_tree));
 
         if (max_size < size)

@@ -1318,27 +1338,36 @@
 
         rsearch.rs_start = *cursor;
         rsearch.rs_end = *cursor + size;
 
         rs = avl_find(t, &rsearch, &where);
-        if (rs == NULL || (rs->rs_end - rs->rs_start) < size) {
+        if (rs != NULL)
+                adjustable_start = rs->rs_start;
+        if (rs == NULL || rs->rs_end - adjustable_start < size ||
+            metaslab_check_trim_conflict(msp, &adjustable_start, size, 1,
+            rs->rs_end)) {
+                /* segment not usable, try the largest remaining one */
                 t = &msp->ms_size_tree;
 
                 rsearch.rs_start = 0;
                 rsearch.rs_end = MIN(max_size,
                     1ULL << (hbit + metaslab_ndf_clump_shift));
                 rs = avl_find(t, &rsearch, &where);
                 if (rs == NULL)
                         rs = avl_nearest(t, where, AVL_AFTER);
                 ASSERT(rs != NULL);
+                adjustable_start = rs->rs_start;
+                if (rs->rs_end - adjustable_start < size ||
+                    metaslab_check_trim_conflict(msp, &adjustable_start,
+                    size, 1, rs->rs_end)) {
+                        /* even largest remaining segment not usable */
+                        return (-1ULL);
         }
-
-        if ((rs->rs_end - rs->rs_start) >= size) {
-                *cursor = rs->rs_start + size;
-                return (rs->rs_start);
         }
-        return (-1ULL);
+
+        *cursor = adjustable_start + size;
+        return (*cursor);
 }
 
 static metaslab_ops_t metaslab_ndf_ops = {
         metaslab_ndf_alloc
 };

@@ -1374,16 +1403,10 @@
         ASSERT(MUTEX_HELD(&msp->ms_lock));
         ASSERT(!msp->ms_loaded);
         ASSERT(!msp->ms_loading);
 
         msp->ms_loading = B_TRUE;
-        /*
-         * Nobody else can manipulate a loading metaslab, so it's now safe
-         * to drop the lock.  This way we don't have to hold the lock while
-         * reading the spacemap from disk.
-         */
-        mutex_exit(&msp->ms_lock);
 
         /*
          * If the space map has not been allocated yet, then treat
          * all the space in the metaslab as free and add it to the
          * ms_tree.

@@ -1392,21 +1415,21 @@
                 error = space_map_load(msp->ms_sm, msp->ms_tree, SM_FREE);
         else
                 range_tree_add(msp->ms_tree, msp->ms_start, msp->ms_size);
 
         success = (error == 0);
-
-        mutex_enter(&msp->ms_lock);
         msp->ms_loading = B_FALSE;
 
         if (success) {
                 ASSERT3P(msp->ms_group, !=, NULL);
                 msp->ms_loaded = B_TRUE;
 
                 for (int t = 0; t < TXG_DEFER_SIZE; t++) {
                         range_tree_walk(msp->ms_defertree[t],
                             range_tree_remove, msp->ms_tree);
+                        range_tree_walk(msp->ms_defertree[t],
+                            metaslab_trim_remove, msp);
                 }
                 msp->ms_max_size = metaslab_block_maxsize(msp);
         }
         cv_broadcast(&msp->ms_load_cv);
         return (error);

@@ -1431,12 +1454,12 @@
         metaslab_t *ms;
         int error;
 
         ms = kmem_zalloc(sizeof (metaslab_t), KM_SLEEP);
         mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL);
-        mutex_init(&ms->ms_sync_lock, NULL, MUTEX_DEFAULT, NULL);
         cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL);
+        cv_init(&ms->ms_trim_cv, NULL, CV_DEFAULT, NULL);
         ms->ms_id = id;
         ms->ms_start = id << vd->vdev_ms_shift;
         ms->ms_size = 1ULL << vd->vdev_ms_shift;
 
         /*

@@ -1443,28 +1466,30 @@
          * We only open space map objects that already exist. All others
          * will be opened when we finally allocate an object for it.
          */
         if (object != 0) {
                 error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start,
-                    ms->ms_size, vd->vdev_ashift);
+                    ms->ms_size, vd->vdev_ashift, &ms->ms_lock);
 
                 if (error != 0) {
                         kmem_free(ms, sizeof (metaslab_t));
                         return (error);
                 }
 
                 ASSERT(ms->ms_sm != NULL);
         }
 
+        ms->ms_cur_ts = metaslab_new_trimset(0, &ms->ms_lock);
+
         /*
          * We create the main range tree here, but we don't create the
          * other range trees until metaslab_sync_done().  This serves
          * two purposes: it allows metaslab_sync_done() to detect the
          * addition of new space; and for debugging, it ensures that we'd
          * data fault on any attempt to use this metaslab before it's ready.
          */
-        ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms);
+        ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms, &ms->ms_lock);
         metaslab_group_add(mg, ms);
 
         metaslab_set_fragmentation(ms);
 
         /*

@@ -1524,16 +1549,21 @@
 
         for (int t = 0; t < TXG_DEFER_SIZE; t++) {
                 range_tree_destroy(msp->ms_defertree[t]);
         }
 
+        metaslab_free_trimset(msp->ms_cur_ts);
+        if (msp->ms_prev_ts)
+                metaslab_free_trimset(msp->ms_prev_ts);
+        ASSERT3P(msp->ms_trimming_ts, ==, NULL);
+
         ASSERT0(msp->ms_deferspace);
 
         mutex_exit(&msp->ms_lock);
         cv_destroy(&msp->ms_load_cv);
+        cv_destroy(&msp->ms_trim_cv);
         mutex_destroy(&msp->ms_lock);
-        mutex_destroy(&msp->ms_sync_lock);
 
         kmem_free(msp, sizeof (metaslab_t));
 }
 
 #define FRAGMENTATION_TABLE_SIZE        17

@@ -1895,15 +1925,18 @@
         uint64_t weight;
 
         ASSERT(MUTEX_HELD(&msp->ms_lock));
 
         /*
-         * If this vdev is in the process of being removed, there is nothing
+         * This vdev is in the process of being removed so there is nothing
          * for us to do here.
          */
-        if (vd->vdev_removing)
+        if (vd->vdev_removing) {
+                ASSERT0(space_map_allocated(msp->ms_sm));
+                ASSERT0(vd->vdev_ms_shift);
                 return (0);
+        }
 
         metaslab_set_fragmentation(msp);
 
         /*
          * Update the maximum size if the metaslab is loaded. This will

@@ -2031,17 +2064,14 @@
                 taskq_wait(mg->mg_taskq);
                 return;
         }
 
         mutex_enter(&mg->mg_lock);
-
         /*
          * Load the next potential metaslabs
          */
         for (msp = avl_first(t); msp != NULL; msp = AVL_NEXT(t, msp)) {
-                ASSERT3P(msp->ms_group, ==, mg);
-
                 /*
                  * We preload only the maximum number of metaslabs specified
                  * by metaslab_preload_limit. If a metaslab is being forced
                  * to condense then we preload it too. This will ensure
                  * that force condensing happens in the next txg.

@@ -2064,11 +2094,11 @@
  * 1. The size of the space map object should not dramatically increase as a
  * result of writing out the free space range tree.
  *
  * 2. The minimal on-disk space map representation is zfs_condense_pct/100
  * times the size than the free space range tree representation
- * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1MB).
+ * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1.MB).
  *
  * 3. The on-disk size of the space map should actually decrease.
  *
  * Checking the first condition is tricky since we don't want to walk
  * the entire AVL tree calculating the estimated on-disk size. Instead we

@@ -2161,11 +2191,11 @@
          * that have been freed in this txg, any deferred frees that exist,
          * and any allocation in the future. Removing segments should be
          * a relatively inexpensive operation since we expect these trees to
          * have a small number of nodes.
          */
-        condense_tree = range_tree_create(NULL, NULL);
+        condense_tree = range_tree_create(NULL, NULL, &msp->ms_lock);
         range_tree_add(condense_tree, msp->ms_start, msp->ms_size);
 
         /*
          * Remove what's been freed in this txg from the condense_tree.
          * Since we're in sync_pass 1, we know that all the frees from

@@ -2194,10 +2224,11 @@
          */
         msp->ms_condensing = B_TRUE;
 
         mutex_exit(&msp->ms_lock);
         space_map_truncate(sm, tx);
+        mutex_enter(&msp->ms_lock);
 
         /*
          * While we would ideally like to create a space map representation
          * that consists only of allocation records, doing so can be
          * prohibitively expensive because the in-core free tree can be

@@ -2210,11 +2241,10 @@
         space_map_write(sm, condense_tree, SM_ALLOC, tx);
         range_tree_vacate(condense_tree, NULL, NULL);
         range_tree_destroy(condense_tree);
 
         space_map_write(sm, msp->ms_tree, SM_FREE, tx);
-        mutex_enter(&msp->ms_lock);
         msp->ms_condensing = B_FALSE;
 }
 
 /*
  * Write a metaslab to disk in the context of the specified transaction group.

@@ -2230,15 +2260,18 @@
         dmu_tx_t *tx;
         uint64_t object = space_map_object(msp->ms_sm);
 
         ASSERT(!vd->vdev_ishole);
 
+        mutex_enter(&msp->ms_lock);
+
         /*
          * This metaslab has just been added so there's no work to do now.
          */
         if (msp->ms_freeingtree == NULL) {
                 ASSERT3P(alloctree, ==, NULL);
+                mutex_exit(&msp->ms_lock);
                 return;
         }
 
         ASSERT3P(alloctree, !=, NULL);
         ASSERT3P(msp->ms_freeingtree, !=, NULL);

@@ -2250,28 +2283,26 @@
          * is being forced to condense and it's loaded, we need to let it
          * through.
          */
         if (range_tree_space(alloctree) == 0 &&
             range_tree_space(msp->ms_freeingtree) == 0 &&
-            !(msp->ms_loaded && msp->ms_condense_wanted))
+            !(msp->ms_loaded && msp->ms_condense_wanted)) {
+                mutex_exit(&msp->ms_lock);
                 return;
+        }
 
 
         VERIFY(txg <= spa_final_dirty_txg(spa));
 
         /*
          * The only state that can actually be changing concurrently with
          * metaslab_sync() is the metaslab's ms_tree.  No other thread can
          * be modifying this txg's alloctree, freeingtree, freedtree, or
-         * space_map_phys_t.  We drop ms_lock whenever we could call
-         * into the DMU, because the DMU can call down to us
-         * (e.g. via zio_free()) at any time.
-         *
-         * The spa_vdev_remove_thread() can be reading metaslab state
-         * concurrently, and it is locked out by the ms_sync_lock.  Note
-         * that the ms_lock is insufficient for this, because it is dropped
-         * by space_map_write().
+         * space_map_phys_t. Therefore, we only hold ms_lock to satify
+         * space map ASSERTs. We drop it whenever we call into the DMU,
+         * because the DMU can call down to us (e.g. via zio_free()) at
+         * any time.
          */
 
         tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
 
         if (msp->ms_sm == NULL) {

@@ -2279,17 +2310,15 @@
 
                 new_object = space_map_alloc(mos, tx);
                 VERIFY3U(new_object, !=, 0);
 
                 VERIFY0(space_map_open(&msp->ms_sm, mos, new_object,
-                    msp->ms_start, msp->ms_size, vd->vdev_ashift));
+                    msp->ms_start, msp->ms_size, vd->vdev_ashift,
+                    &msp->ms_lock));
                 ASSERT(msp->ms_sm != NULL);
         }
 
-        mutex_enter(&msp->ms_sync_lock);
-        mutex_enter(&msp->ms_lock);
-
         /*
          * Note: metaslab_condense() clears the space map's histogram.
          * Therefore we must verify and remove this histogram before
          * condensing.
          */

@@ -2299,19 +2328,17 @@
 
         if (msp->ms_loaded && spa_sync_pass(spa) == 1 &&
             metaslab_should_condense(msp)) {
                 metaslab_condense(msp, txg, tx);
         } else {
-                mutex_exit(&msp->ms_lock);
                 space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx);
                 space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx);
-                mutex_enter(&msp->ms_lock);
         }
 
         if (msp->ms_loaded) {
                 /*
-                 * When the space map is loaded, we have an accurate
+                 * When the space map is loaded, we have an accruate
                  * histogram in the range tree. This gives us an opportunity
                  * to bring the space map's histogram up-to-date so we clear
                  * it first before updating it.
                  */
                 space_map_histogram_clear(msp->ms_sm);

@@ -2375,11 +2402,10 @@
         if (object != space_map_object(msp->ms_sm)) {
                 object = space_map_object(msp->ms_sm);
                 dmu_write(mos, vd->vdev_ms_array, sizeof (uint64_t) *
                     msp->ms_id, sizeof (uint64_t), &object, tx);
         }
-        mutex_exit(&msp->ms_sync_lock);
         dmu_tx_commit(tx);
 }
 
 /*
  * Called after a transaction group has completely synced to mark

@@ -2405,33 +2431,37 @@
          */
         if (msp->ms_freedtree == NULL) {
                 for (int t = 0; t < TXG_SIZE; t++) {
                         ASSERT(msp->ms_alloctree[t] == NULL);
 
-                        msp->ms_alloctree[t] = range_tree_create(NULL, NULL);
+                        msp->ms_alloctree[t] = range_tree_create(NULL, msp,
+                            &msp->ms_lock);
                 }
 
                 ASSERT3P(msp->ms_freeingtree, ==, NULL);
-                msp->ms_freeingtree = range_tree_create(NULL, NULL);
+                msp->ms_freeingtree = range_tree_create(NULL, msp,
+                    &msp->ms_lock);
 
                 ASSERT3P(msp->ms_freedtree, ==, NULL);
-                msp->ms_freedtree = range_tree_create(NULL, NULL);
+                msp->ms_freedtree = range_tree_create(NULL, msp,
+                    &msp->ms_lock);
 
                 for (int t = 0; t < TXG_DEFER_SIZE; t++) {
                         ASSERT(msp->ms_defertree[t] == NULL);
 
-                        msp->ms_defertree[t] = range_tree_create(NULL, NULL);
+                        msp->ms_defertree[t] = range_tree_create(NULL, msp,
+                            &msp->ms_lock);
                 }
 
                 vdev_space_update(vd, 0, 0, msp->ms_size);
         }
 
         defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE];
 
         uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) -
             metaslab_class_get_alloc(spa_normal_class(spa));
-        if (free_space <= spa_get_slop_space(spa) || vd->vdev_removing) {
+        if (free_space <= spa_get_slop_space(spa)) {
                 defer_allowed = B_FALSE;
         }
 
         defer_delta = 0;
         alloc_delta = space_map_alloc_delta(msp->ms_sm);

@@ -2454,10 +2484,18 @@
          * Move the frees from the defer_tree back to the free
          * range tree (if it's loaded). Swap the freed_tree and the
          * defer_tree -- this is safe to do because we've just emptied out
          * the defer_tree.
          */
+        if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
+            !vd->vdev_man_trimming) {
+                range_tree_walk(*defer_tree, metaslab_trim_add, msp);
+                if (!defer_allowed) {
+                        range_tree_walk(msp->ms_freedtree, metaslab_trim_add,
+                            msp);
+                }
+        }
         range_tree_vacate(*defer_tree,
             msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree);
         if (defer_allowed) {
                 range_tree_swap(&msp->ms_freedtree, defer_tree);
         } else {

@@ -2497,37 +2535,23 @@
 
                 if (!metaslab_debug_unload)
                         metaslab_unload(msp);
         }
 
-        ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK]));
-        ASSERT0(range_tree_space(msp->ms_freeingtree));
-        ASSERT0(range_tree_space(msp->ms_freedtree));
-
         mutex_exit(&msp->ms_lock);
 }
 
 void
 metaslab_sync_reassess(metaslab_group_t *mg)
 {
-        spa_t *spa = mg->mg_class->mc_spa;
-
-        spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
         metaslab_group_alloc_update(mg);
         mg->mg_fragmentation = metaslab_group_fragmentation(mg);
 
         /*
-         * Preload the next potential metaslabs but only on active
-         * metaslab groups. We can get into a state where the metaslab
-         * is no longer active since we dirty metaslabs as we remove a
-         * a device, thus potentially making the metaslab group eligible
-         * for preloading.
+         * Preload the next potential metaslabs
          */
-        if (mg->mg_activation_count > 0) {
                 metaslab_group_preload(mg);
-        }
-        spa_config_exit(spa, SCL_ALLOC, FTAG);
 }
 
 static uint64_t
 metaslab_distance(metaslab_t *msp, dva_t *dva)
 {

@@ -2717,10 +2741,11 @@
 
                 VERIFY0(P2PHASE(start, 1ULL << vd->vdev_ashift));
                 VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
                 VERIFY3U(range_tree_space(rt) - size, <=, msp->ms_size);
                 range_tree_remove(rt, start, size);
+                metaslab_trim_remove(msp, start, size);
 
                 if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
                         vdev_dirty(mg->mg_vd, VDD_METASLAB, msp, txg);
 
                 range_tree_add(msp->ms_alloctree[txg & TXG_MASK], start, size);

@@ -2738,11 +2763,12 @@
         return (start);
 }
 
 static uint64_t
 metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal,
-    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
+    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d,
+    int flags)
 {
         metaslab_t *msp = NULL;
         uint64_t offset = -1ULL;
         uint64_t activation_weight;
         uint64_t target_distance;

@@ -2759,10 +2785,11 @@
         metaslab_t *search = kmem_alloc(sizeof (*search), KM_SLEEP);
         search->ms_weight = UINT64_MAX;
         search->ms_start = 0;
         for (;;) {
                 boolean_t was_active;
+                boolean_t pass_primary = B_TRUE;
                 avl_tree_t *t = &mg->mg_metaslab_tree;
                 avl_index_t idx;
 
                 mutex_enter(&mg->mg_lock);

@@ -2796,25 +2823,36 @@
                          */
                         if (msp->ms_condensing)
                                 continue;
 
                         was_active = msp->ms_weight & METASLAB_ACTIVE_MASK;
-                        if (activation_weight == METASLAB_WEIGHT_PRIMARY)
+                        if (flags & METASLAB_USE_WEIGHT_SECONDARY) {
+                                if (!pass_primary) {
+                                        DTRACE_PROBE(metaslab_use_secondary);
+                                        activation_weight =
+                                            METASLAB_WEIGHT_SECONDARY;
                                 break;
+                                }
 
+                                pass_primary = B_FALSE;
+                        } else {
+                                if (activation_weight ==
+                                    METASLAB_WEIGHT_PRIMARY)
+                                        break;
+
                         target_distance = min_distance +
                             (space_map_allocated(msp->ms_sm) != 0 ? 0 :
                             min_distance >> 1);
 
-                        for (i = 0; i < d; i++) {
+                                for (i = 0; i < d; i++)
                                 if (metaslab_distance(msp, &dva[i]) <
                                     target_distance)
                                         break;
-                        }
                         if (i == d)
                                 break;
                 }
+                }
                 mutex_exit(&mg->mg_lock);
                 if (msp == NULL) {
                         kmem_free(search, sizeof (*search));
                         return (-1ULL);
                 }

@@ -2931,17 +2969,18 @@
         return (offset);
 }
 
 static uint64_t
 metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal,
-    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
+    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva,
+    int d, int flags)
 {
         uint64_t offset;
         ASSERT(mg->mg_initialized);
 
         offset = metaslab_group_alloc_normal(mg, zal, asize, txg,
-            min_distance, dva, d);
+            min_distance, dva, d, flags);
 
         mutex_enter(&mg->mg_lock);
         if (offset == -1ULL) {
                 mg->mg_failed_allocations++;
                 metaslab_trace_add(zal, mg, NULL, asize, d,

@@ -2975,11 +3014,11 @@
 int ditto_same_vdev_distance_shift = 3;
 
 /*
  * Allocate a block for the specified i/o.
  */
-int
+static int
 metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
     dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags,
     zio_alloc_list_t *zal)
 {
         metaslab_group_t *mg, *rotor;

@@ -3021,15 +3060,14 @@
         if (hintdva) {
                 vd = vdev_lookup_top(spa, DVA_GET_VDEV(&hintdva[d]));
 
                 /*
                  * It's possible the vdev we're using as the hint no
-                 * longer exists or its mg has been closed (e.g. by
-                 * device removal).  Consult the rotor when
+                 * longer exists (i.e. removed). Consult the rotor when
                  * all else fails.
                  */
-                if (vd != NULL && vd->vdev_mg != NULL) {
+                if (vd != NULL) {
                         mg = vd->vdev_mg;
 
                         if (flags & METASLAB_HINTBP_AVOID &&
                             mg->mg_next != NULL)
                                 mg = mg->mg_next;

@@ -3120,11 +3158,11 @@
 
                 uint64_t asize = vdev_psize_to_asize(vd, psize);
                 ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0);
 
                 uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg,
-                    distance, dva, d);
+                    distance, dva, d, flags);
 
                 if (offset != -1ULL) {
                         /*
                          * If we've just selected this metaslab group,
                          * figure out whether the corresponding vdev is

@@ -3131,14 +3169,19 @@
                          * over- or under-used relative to the pool,
                          * and set an allocation bias to even it out.
                          */
                         if (mc->mc_aliquot == 0 && metaslab_bias_enabled) {
                                 vdev_stat_t *vs = &vd->vdev_stat;
-                                int64_t vu, cu;
+                                vdev_stat_t *pvs = &vd->vdev_parent->vdev_stat;
+                                int64_t vu, cu, vu_io;
 
                                 vu = (vs->vs_alloc * 100) / (vs->vs_space + 1);
                                 cu = (mc->mc_alloc * 100) / (mc->mc_space + 1);
+                                vu_io =
+                                    (((vs->vs_iotime[ZIO_TYPE_WRITE] * 100) /
+                                    (pvs->vs_iotime[ZIO_TYPE_WRITE] + 1)) *
+                                    (vd->vdev_parent->vdev_children)) - 100;
 
                                 /*
                                  * Calculate how much more or less we should
                                  * try to allocate from this device during
                                  * this iteration around the rotor.

@@ -3151,10 +3194,29 @@
                                  * This reduces allocations by 307K for this
                                  * iteration.
                                  */
                                 mg->mg_bias = ((cu - vu) *
                                     (int64_t)mg->mg_aliquot) / 100;
+
+                                /*
+                                 * Experiment: space-based DVA allocator 0,
+                                 * latency-based 1 or hybrid 2.
+                                 */
+                                switch (metaslab_alloc_dva_algorithm) {
+                                case 1:
+                                        mg->mg_bias =
+                                            (vu_io * (int64_t)mg->mg_aliquot) /
+                                            100;
+                                        break;
+                                case 2:
+                                        mg->mg_bias =
+                                            ((((cu - vu) + vu_io) / 2) *
+                                            (int64_t)mg->mg_aliquot) / 100;
+                                        break;
+                                default:
+                                        break;
+                                }
                         } else if (!metaslab_bias_enabled) {
                                 mg->mg_bias = 0;
                         }
 
                         if (atomic_add_64_nv(&mc->mc_aliquot, asize) >=

@@ -3165,10 +3227,12 @@
 
                         DVA_SET_VDEV(&dva[d], vd->vdev_id);
                         DVA_SET_OFFSET(&dva[d], offset);
                         DVA_SET_GANG(&dva[d], !!(flags & METASLAB_GANG_HEADER));
                         DVA_SET_ASIZE(&dva[d], asize);
+                        DTRACE_PROBE3(alloc_dva_probe, uint64_t, vd->vdev_id,
+                            uint64_t, offset, uint64_t, psize);
 
                         return (0);
                 }
 next:
                 mc->mc_rotor = mg->mg_next;

@@ -3187,232 +3251,27 @@
 
         metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC);
         return (SET_ERROR(ENOSPC));
 }
 
-void
-metaslab_free_concrete(vdev_t *vd, uint64_t offset, uint64_t asize,
-    uint64_t txg)
-{
-        metaslab_t *msp;
-        spa_t *spa = vd->vdev_spa;
-
-        ASSERT3U(txg, ==, spa->spa_syncing_txg);
-        ASSERT(vdev_is_concrete(vd));
-        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
-        ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
-
-        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
-
-        VERIFY(!msp->ms_condensing);
-        VERIFY3U(offset, >=, msp->ms_start);
-        VERIFY3U(offset + asize, <=, msp->ms_start + msp->ms_size);
-        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
-        VERIFY0(P2PHASE(asize, 1ULL << vd->vdev_ashift));
-
-        metaslab_check_free_impl(vd, offset, asize);
-        mutex_enter(&msp->ms_lock);
-        if (range_tree_space(msp->ms_freeingtree) == 0) {
-                vdev_dirty(vd, VDD_METASLAB, msp, txg);
-        }
-        range_tree_add(msp->ms_freeingtree, offset, asize);
-        mutex_exit(&msp->ms_lock);
-}
-
-/* ARGSUSED */
-void
-metaslab_free_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
-    uint64_t size, void *arg)
-{
-        uint64_t *txgp = arg;
-
-        if (vd->vdev_ops->vdev_op_remap != NULL)
-                vdev_indirect_mark_obsolete(vd, offset, size, *txgp);
-        else
-                metaslab_free_impl(vd, offset, size, *txgp);
-}
-
-static void
-metaslab_free_impl(vdev_t *vd, uint64_t offset, uint64_t size,
-    uint64_t txg)
-{
-        spa_t *spa = vd->vdev_spa;
-
-        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
-
-        if (txg > spa_freeze_txg(spa))
-                return;
-
-        if (spa->spa_vdev_removal != NULL &&
-            spa->spa_vdev_removal->svr_vdev == vd &&
-            vdev_is_concrete(vd)) {
-                /*
-                 * Note: we check if the vdev is concrete because when
-                 * we complete the removal, we first change the vdev to be
-                 * an indirect vdev (in open context), and then (in syncing
-                 * context) clear spa_vdev_removal.
-                 */
-                free_from_removing_vdev(vd, offset, size, txg);
-        } else if (vd->vdev_ops->vdev_op_remap != NULL) {
-                vdev_indirect_mark_obsolete(vd, offset, size, txg);
-                vd->vdev_ops->vdev_op_remap(vd, offset, size,
-                    metaslab_free_impl_cb, &txg);
-        } else {
-                metaslab_free_concrete(vd, offset, size, txg);
-        }
-}
-
-typedef struct remap_blkptr_cb_arg {
-        blkptr_t *rbca_bp;
-        spa_remap_cb_t rbca_cb;
-        vdev_t *rbca_remap_vd;
-        uint64_t rbca_remap_offset;
-        void *rbca_cb_arg;
-} remap_blkptr_cb_arg_t;
-
-void
-remap_blkptr_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
-    uint64_t size, void *arg)
-{
-        remap_blkptr_cb_arg_t *rbca = arg;
-        blkptr_t *bp = rbca->rbca_bp;
-
-        /* We can not remap split blocks. */
-        if (size != DVA_GET_ASIZE(&bp->blk_dva[0]))
-                return;
-        ASSERT0(inner_offset);
-
-        if (rbca->rbca_cb != NULL) {
-                /*
-                 * At this point we know that we are not handling split
-                 * blocks and we invoke the callback on the previous
-                 * vdev which must be indirect.
-                 */
-                ASSERT3P(rbca->rbca_remap_vd->vdev_ops, ==, &vdev_indirect_ops);
-
-                rbca->rbca_cb(rbca->rbca_remap_vd->vdev_id,
-                    rbca->rbca_remap_offset, size, rbca->rbca_cb_arg);
-
-                /* set up remap_blkptr_cb_arg for the next call */
-                rbca->rbca_remap_vd = vd;
-                rbca->rbca_remap_offset = offset;
-        }
-
-        /*
-         * The phys birth time is that of dva[0].  This ensures that we know
-         * when each dva was written, so that resilver can determine which
-         * blocks need to be scrubbed (i.e. those written during the time
-         * the vdev was offline).  It also ensures that the key used in
-         * the ARC hash table is unique (i.e. dva[0] + phys_birth).  If
-         * we didn't change the phys_birth, a lookup in the ARC for a
-         * remapped BP could find the data that was previously stored at
-         * this vdev + offset.
-         */
-        vdev_t *oldvd = vdev_lookup_top(vd->vdev_spa,
-            DVA_GET_VDEV(&bp->blk_dva[0]));
-        vdev_indirect_births_t *vib = oldvd->vdev_indirect_births;
-        bp->blk_phys_birth = vdev_indirect_births_physbirth(vib,
-            DVA_GET_OFFSET(&bp->blk_dva[0]), DVA_GET_ASIZE(&bp->blk_dva[0]));
-
-        DVA_SET_VDEV(&bp->blk_dva[0], vd->vdev_id);
-        DVA_SET_OFFSET(&bp->blk_dva[0], offset);
-}
-
 /*
- * If the block pointer contains any indirect DVAs, modify them to refer to
- * concrete DVAs.  Note that this will sometimes not be possible, leaving
- * the indirect DVA in place.  This happens if the indirect DVA spans multiple
- * segments in the mapping (i.e. it is a "split block").
- *
- * If the BP was remapped, calls the callback on the original dva (note the
- * callback can be called multiple times if the original indirect DVA refers
- * to another indirect DVA, etc).
- *
- * Returns TRUE if the BP was remapped.
+ * Free the block represented by DVA in the context of the specified
+ * transaction group.
  */
-boolean_t
-spa_remap_blkptr(spa_t *spa, blkptr_t *bp, spa_remap_cb_t callback, void *arg)
-{
-        remap_blkptr_cb_arg_t rbca;
-
-        if (!zfs_remap_blkptr_enable)
-                return (B_FALSE);
-
-        if (!spa_feature_is_enabled(spa, SPA_FEATURE_OBSOLETE_COUNTS))
-                return (B_FALSE);
-
-        /*
-         * Dedup BP's can not be remapped, because ddt_phys_select() depends
-         * on DVA[0] being the same in the BP as in the DDT (dedup table).
-         */
-        if (BP_GET_DEDUP(bp))
-                return (B_FALSE);
-
-        /*
-         * Gang blocks can not be remapped, because
-         * zio_checksum_gang_verifier() depends on the DVA[0] that's in
-         * the BP used to read the gang block header (GBH) being the same
-         * as the DVA[0] that we allocated for the GBH.
-         */
-        if (BP_IS_GANG(bp))
-                return (B_FALSE);
-
-        /*
-         * Embedded BP's have no DVA to remap.
-         */
-        if (BP_GET_NDVAS(bp) < 1)
-                return (B_FALSE);
-
-        /*
-         * Note: we only remap dva[0].  If we remapped other dvas, we
-         * would no longer know what their phys birth txg is.
-         */
-        dva_t *dva = &bp->blk_dva[0];
-
-        uint64_t offset = DVA_GET_OFFSET(dva);
-        uint64_t size = DVA_GET_ASIZE(dva);
-        vdev_t *vd = vdev_lookup_top(spa, DVA_GET_VDEV(dva));
-
-        if (vd->vdev_ops->vdev_op_remap == NULL)
-                return (B_FALSE);
-
-        rbca.rbca_bp = bp;
-        rbca.rbca_cb = callback;
-        rbca.rbca_remap_vd = vd;
-        rbca.rbca_remap_offset = offset;
-        rbca.rbca_cb_arg = arg;
-
-        /*
-         * remap_blkptr_cb() will be called in order for each level of
-         * indirection, until a concrete vdev is reached or a split block is
-         * encountered. old_vd and old_offset are updated within the callback
-         * as we go from the one indirect vdev to the next one (either concrete
-         * or indirect again) in that order.
-         */
-        vd->vdev_ops->vdev_op_remap(vd, offset, size, remap_blkptr_cb, &rbca);
-
-        /* Check if the DVA wasn't remapped because it is a split block */
-        if (DVA_GET_VDEV(&rbca.rbca_bp->blk_dva[0]) == vd->vdev_id)
-                return (B_FALSE);
-
-        return (B_TRUE);
-}
-
-/*
- * Undo the allocation of a DVA which happened in the given transaction group.
- */
 void
-metaslab_unalloc_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
+metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg, boolean_t now)
 {
-        metaslab_t *msp;
-        vdev_t *vd;
         uint64_t vdev = DVA_GET_VDEV(dva);
         uint64_t offset = DVA_GET_OFFSET(dva);
         uint64_t size = DVA_GET_ASIZE(dva);
+        vdev_t *vd;
+        metaslab_t *msp;
 
+        DTRACE_PROBE3(free_dva_probe, uint64_t, vdev,
+            uint64_t, offset, uint64_t, size);
+
         ASSERT(DVA_IS_VALID(dva));
-        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
 
         if (txg > spa_freeze_txg(spa))
                 return;
 
         if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||

@@ -3421,21 +3280,18 @@
                     (u_longlong_t)vdev, (u_longlong_t)offset);
                 ASSERT(0);
                 return;
         }
 
-        ASSERT(!vd->vdev_removing);
-        ASSERT(vdev_is_concrete(vd));
-        ASSERT0(vd->vdev_indirect_config.vic_mapping_object);
-        ASSERT3P(vd->vdev_indirect_mapping, ==, NULL);
+        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
 
         if (DVA_GET_GANG(dva))
                 size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
 
-        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
-
         mutex_enter(&msp->ms_lock);
+
+        if (now) {
         range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
             offset, size);
 
         VERIFY(!msp->ms_condensing);
         VERIFY3U(offset, >=, msp->ms_start);

@@ -3443,33 +3299,80 @@
         VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
             msp->ms_size);
         VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
         VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
         range_tree_add(msp->ms_tree, offset, size);
+                if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
+                    !vd->vdev_man_trimming)
+                        metaslab_trim_add(msp, offset, size);
+                msp->ms_max_size = metaslab_block_maxsize(msp);
+        } else {
+                VERIFY3U(txg, ==, spa->spa_syncing_txg);
+                if (range_tree_space(msp->ms_freeingtree) == 0)
+                        vdev_dirty(vd, VDD_METASLAB, msp, txg);
+                range_tree_add(msp->ms_freeingtree, offset, size);
+        }
+
         mutex_exit(&msp->ms_lock);
 }
 
 /*
- * Free the block represented by DVA in the context of the specified
- * transaction group.
+ * Intent log support: upon opening the pool after a crash, notify the SPA
+ * of blocks that the intent log has allocated for immediate write, but
+ * which are still considered free by the SPA because the last transaction
+ * group didn't commit yet.
  */
-void
-metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
+static int
+metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
 {
         uint64_t vdev = DVA_GET_VDEV(dva);
         uint64_t offset = DVA_GET_OFFSET(dva);
         uint64_t size = DVA_GET_ASIZE(dva);
-        vdev_t *vd = vdev_lookup_top(spa, vdev);
+        vdev_t *vd;
+        metaslab_t *msp;
+        int error = 0;
 
         ASSERT(DVA_IS_VALID(dva));
-        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
 
-        if (DVA_GET_GANG(dva)) {
+        if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
+            (offset >> vd->vdev_ms_shift) >= vd->vdev_ms_count)
+                return (SET_ERROR(ENXIO));
+
+        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
+
+        if (DVA_GET_GANG(dva))
                 size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
+
+        mutex_enter(&msp->ms_lock);
+
+        if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
+                error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
+
+        if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
+                error = SET_ERROR(ENOENT);
+
+        if (error || txg == 0) {        /* txg == 0 indicates dry run */
+                mutex_exit(&msp->ms_lock);
+                return (error);
         }
 
-        metaslab_free_impl(vd, offset, size, txg);
+        VERIFY(!msp->ms_condensing);
+        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
+        VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
+        VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
+        range_tree_remove(msp->ms_tree, offset, size);
+        metaslab_trim_remove(msp, offset, size);
+
+        if (spa_writeable(spa)) {       /* don't dirty if we're zdb(1M) */
+                if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
+                        vdev_dirty(vd, VDD_METASLAB, msp, txg);
+                range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
+        }
+
+        mutex_exit(&msp->ms_lock);
+
+        return (0);
 }
 
 /*
  * Reserve some allocation slots. The reservation system must be called
  * before we call into the allocator. If there aren't any available slots

@@ -3516,127 +3419,11 @@
                 (void) refcount_remove(&mc->mc_alloc_slots, zio);
         }
         mutex_exit(&mc->mc_lock);
 }
 
-static int
-metaslab_claim_concrete(vdev_t *vd, uint64_t offset, uint64_t size,
-    uint64_t txg)
-{
-        metaslab_t *msp;
-        spa_t *spa = vd->vdev_spa;
-        int error = 0;
-
-        if (offset >> vd->vdev_ms_shift >= vd->vdev_ms_count)
-                return (ENXIO);
-
-        ASSERT3P(vd->vdev_ms, !=, NULL);
-        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
-
-        mutex_enter(&msp->ms_lock);
-
-        if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
-                error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
-
-        if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
-                error = SET_ERROR(ENOENT);
-
-        if (error || txg == 0) {        /* txg == 0 indicates dry run */
-                mutex_exit(&msp->ms_lock);
-                return (error);
-        }
-
-        VERIFY(!msp->ms_condensing);
-        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
-        VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
-        VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
-        range_tree_remove(msp->ms_tree, offset, size);
-
-        if (spa_writeable(spa)) {       /* don't dirty if we're zdb(1M) */
-                if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
-                        vdev_dirty(vd, VDD_METASLAB, msp, txg);
-                range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
-        }
-
-        mutex_exit(&msp->ms_lock);
-
-        return (0);
-}
-
-typedef struct metaslab_claim_cb_arg_t {
-        uint64_t        mcca_txg;
-        int             mcca_error;
-} metaslab_claim_cb_arg_t;
-
-/* ARGSUSED */
-static void
-metaslab_claim_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
-    uint64_t size, void *arg)
-{
-        metaslab_claim_cb_arg_t *mcca_arg = arg;
-
-        if (mcca_arg->mcca_error == 0) {
-                mcca_arg->mcca_error = metaslab_claim_concrete(vd, offset,
-                    size, mcca_arg->mcca_txg);
-        }
-}
-
 int
-metaslab_claim_impl(vdev_t *vd, uint64_t offset, uint64_t size, uint64_t txg)
-{
-        if (vd->vdev_ops->vdev_op_remap != NULL) {
-                metaslab_claim_cb_arg_t arg;
-
-                /*
-                 * Only zdb(1M) can claim on indirect vdevs.  This is used
-                 * to detect leaks of mapped space (that are not accounted
-                 * for in the obsolete counts, spacemap, or bpobj).
-                 */
-                ASSERT(!spa_writeable(vd->vdev_spa));
-                arg.mcca_error = 0;
-                arg.mcca_txg = txg;
-
-                vd->vdev_ops->vdev_op_remap(vd, offset, size,
-                    metaslab_claim_impl_cb, &arg);
-
-                if (arg.mcca_error == 0) {
-                        arg.mcca_error = metaslab_claim_concrete(vd,
-                            offset, size, txg);
-                }
-                return (arg.mcca_error);
-        } else {
-                return (metaslab_claim_concrete(vd, offset, size, txg));
-        }
-}
-
-/*
- * Intent log support: upon opening the pool after a crash, notify the SPA
- * of blocks that the intent log has allocated for immediate write, but
- * which are still considered free by the SPA because the last transaction
- * group didn't commit yet.
- */
-static int
-metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
-{
-        uint64_t vdev = DVA_GET_VDEV(dva);
-        uint64_t offset = DVA_GET_OFFSET(dva);
-        uint64_t size = DVA_GET_ASIZE(dva);
-        vdev_t *vd;
-
-        if ((vd = vdev_lookup_top(spa, vdev)) == NULL) {
-                return (SET_ERROR(ENXIO));
-        }
-
-        ASSERT(DVA_IS_VALID(dva));
-
-        if (DVA_GET_GANG(dva))
-                size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
-
-        return (metaslab_claim_impl(vd, offset, size, txg));
-}
-
-int
 metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp,
     int ndvas, uint64_t txg, blkptr_t *hintbp, int flags,
     zio_alloc_list_t *zal, zio_t *zio)
 {
         dva_t *dva = bp->blk_dva;

@@ -3656,16 +3443,60 @@
         ASSERT(ndvas > 0 && ndvas <= spa_max_replication(spa));
         ASSERT(BP_GET_NDVAS(bp) == 0);
         ASSERT(hintbp == NULL || ndvas <= BP_GET_NDVAS(hintbp));
         ASSERT3P(zal, !=, NULL);
 
+        if (mc == spa_special_class(spa) && !BP_IS_METADATA(bp) &&
+            !(flags & (METASLAB_GANG_HEADER)) &&
+            !(spa->spa_meta_policy.spa_small_data_to_special &&
+            psize <= spa->spa_meta_policy.spa_small_data_to_special)) {
+                error = metaslab_alloc_dva(spa, spa_normal_class(spa),
+                    psize, &dva[WBC_NORMAL_DVA], 0, NULL, txg,
+                    flags | METASLAB_USE_WEIGHT_SECONDARY, zal);
+                if (error == 0) {
+                        error = metaslab_alloc_dva(spa, mc, psize,
+                            &dva[WBC_SPECIAL_DVA], 0, NULL, txg, flags, zal);
+                        if (error != 0) {
+                                error = 0;
+                                /*
+                                 * Change the place of NORMAL and cleanup the
+                                 * second DVA. After that this BP is just a
+                                 * regular BP with one DVA
+                                 *
+                                 * This operation is valid only if:
+                                 * WBC_SPECIAL_DVA is dva[0]
+                                 * WBC_NORMAL_DVA is dva[1]
+                                 *
+                                 * see wbc.h
+                                 */
+                                bcopy(&dva[WBC_NORMAL_DVA],
+                                    &dva[WBC_SPECIAL_DVA], sizeof (dva_t));
+                                bzero(&dva[WBC_NORMAL_DVA], sizeof (dva_t));
+
+                                /*
+                                 * Allocation of special DVA has failed,
+                                 * so this BP will be a regular BP and need
+                                 * to update the metaslab group's queue depth
+                                 * based on the newly allocated dva.
+                                 */
+                                metaslab_group_alloc_increment(spa,
+                                    DVA_GET_VDEV(&dva[0]), zio, flags);
+                        } else {
+                                BP_SET_SPECIAL(bp, 1);
+                        }
+                } else {
+                        spa_config_exit(spa, SCL_ALLOC, FTAG);
+                        return (error);
+                }
+        } else {
         for (int d = 0; d < ndvas; d++) {
-                error = metaslab_alloc_dva(spa, mc, psize, dva, d, hintdva,
-                    txg, flags, zal);
+                        error = metaslab_alloc_dva(spa, mc, psize, dva, d,
+                            hintdva, txg, flags, zal);
                 if (error != 0) {
                         for (d--; d >= 0; d--) {
-                                metaslab_unalloc_dva(spa, &dva[d], txg);
+                                        metaslab_free_dva(spa, &dva[d],
+                                            txg, B_TRUE);
                                 metaslab_group_alloc_decrement(spa,
                                     DVA_GET_VDEV(&dva[d]), zio, flags);
                                 bzero(&dva[d], sizeof (dva_t));
                         }
                         spa_config_exit(spa, SCL_ALLOC, FTAG);

@@ -3676,14 +3507,14 @@
                          * based on the newly allocated dva.
                          */
                         metaslab_group_alloc_increment(spa,
                             DVA_GET_VDEV(&dva[d]), zio, flags);
                 }
-
         }
-        ASSERT(error == 0);
         ASSERT(BP_GET_NDVAS(bp) == ndvas);
+        }
+        ASSERT(error == 0);
 
         spa_config_exit(spa, SCL_ALLOC, FTAG);
 
         BP_SET_BIRTH(bp, txg, txg);

@@ -3699,17 +3530,32 @@
         ASSERT(!BP_IS_HOLE(bp));
         ASSERT(!now || bp->blk_birth >= spa_syncing_txg(spa));
 
         spa_config_enter(spa, SCL_FREE, FTAG, RW_READER);
 
-        for (int d = 0; d < ndvas; d++) {
-                if (now) {
-                        metaslab_unalloc_dva(spa, &dva[d], txg);
+        if (BP_IS_SPECIAL(bp)) {
+                int start_dva;
+                wbc_data_t *wbc_data = spa_get_wbc_data(spa);
+
+                mutex_enter(&wbc_data->wbc_lock);
+                start_dva = wbc_first_valid_dva(bp, wbc_data, B_TRUE);
+                mutex_exit(&wbc_data->wbc_lock);
+
+                /*
+                 * Actual freeing should not be locked as
+                 * the block is already exempted from WBC
+                 * trees, and thus will not be moved
+                 */
+                metaslab_free_dva(spa, &dva[WBC_NORMAL_DVA], txg, now);
+                if (start_dva == 0) {
+                        metaslab_free_dva(spa, &dva[WBC_SPECIAL_DVA],
+                            txg, now);
+                }
                 } else {
-                        metaslab_free_dva(spa, &dva[d], txg);
+                for (int d = 0; d < ndvas; d++)
+                        metaslab_free_dva(spa, &dva[d], txg, now);
                 }
-        }
 
         spa_config_exit(spa, SCL_FREE, FTAG);
 }
 
 int

@@ -3730,81 +3576,346 @@
                         return (error);
         }
 
         spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
 
+        if (BP_IS_SPECIAL(bp)) {
+                int start_dva;
+                wbc_data_t *wbc_data = spa_get_wbc_data(spa);
+
+                mutex_enter(&wbc_data->wbc_lock);
+                start_dva = wbc_first_valid_dva(bp, wbc_data, B_FALSE);
+
+                /*
+                 * Actual claiming should be under lock for WBC blocks. It must
+                 * be done to ensure zdb will not fail. The only other user of
+                 * the claiming is ZIL whose blocks can not be WBC ones, and
+                 * thus the lock will not be held for them.
+                 */
+                error = metaslab_claim_dva(spa,
+                    &dva[WBC_NORMAL_DVA], txg);
+                if (error == 0 && start_dva == 0) {
+                        error = metaslab_claim_dva(spa,
+                            &dva[WBC_SPECIAL_DVA], txg);
+                }
+
+                mutex_exit(&wbc_data->wbc_lock);
+        } else {
         for (int d = 0; d < ndvas; d++)
-                if ((error = metaslab_claim_dva(spa, &dva[d], txg)) != 0)
+                        if ((error = metaslab_claim_dva(spa,
+                            &dva[d], txg)) != 0)
                         break;
+        }
 
         spa_config_exit(spa, SCL_ALLOC, FTAG);
 
         ASSERT(error == 0 || txg == 0);
 
         return (error);
 }
 
-/* ARGSUSED */
-static void
-metaslab_check_free_impl_cb(uint64_t inner, vdev_t *vd, uint64_t offset,
-    uint64_t size, void *arg)
+void
+metaslab_check_free(spa_t *spa, const blkptr_t *bp)
 {
-        if (vd->vdev_ops == &vdev_indirect_ops)
-                return;
-
-        metaslab_check_free_impl(vd, offset, size);
-}
-
-static void
-metaslab_check_free_impl(vdev_t *vd, uint64_t offset, uint64_t size)
-{
-        metaslab_t *msp;
-        spa_t *spa = vd->vdev_spa;
-
         if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
                 return;
 
-        if (vd->vdev_ops->vdev_op_remap != NULL) {
-                vd->vdev_ops->vdev_op_remap(vd, offset, size,
-                    metaslab_check_free_impl_cb, NULL);
+        if (BP_IS_SPECIAL(bp)) {
+                /* Do not check frees for WBC blocks */
                 return;
         }
 
-        ASSERT(vdev_is_concrete(vd));
-        ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
-        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
+        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
+        for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
+                uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
+                vdev_t *vd = vdev_lookup_top(spa, vdev);
+                uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
+                uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
+                metaslab_t *msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
 
-        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
-
-        mutex_enter(&msp->ms_lock);
-        if (msp->ms_loaded)
+                if (msp->ms_loaded) {
                 range_tree_verify(msp->ms_tree, offset, size);
+                        range_tree_verify(msp->ms_cur_ts->ts_tree,
+                            offset, size);
+                        if (msp->ms_prev_ts != NULL) {
+                                range_tree_verify(msp->ms_prev_ts->ts_tree,
+                                    offset, size);
+                        }
+                }
 
         range_tree_verify(msp->ms_freeingtree, offset, size);
         range_tree_verify(msp->ms_freedtree, offset, size);
         for (int j = 0; j < TXG_DEFER_SIZE; j++)
                 range_tree_verify(msp->ms_defertree[j], offset, size);
+        }
+        spa_config_exit(spa, SCL_VDEV, FTAG);
+}
+
+/*
+ * Trims all free space in the metaslab. Returns the root TRIM zio (that the
+ * caller should zio_wait() for) and the amount of space in the metaslab that
+ * has been scheduled for trimming in the `delta' return argument.
+ */
+zio_t *
+metaslab_trim_all(metaslab_t *msp, uint64_t *delta)
+{
+        boolean_t was_loaded;
+        uint64_t trimmed_space;
+        zio_t *trim_io;
+
+        ASSERT(!MUTEX_HELD(&msp->ms_group->mg_lock));
+
+        mutex_enter(&msp->ms_lock);
+
+        while (msp->ms_loading)
+                metaslab_load_wait(msp);
+        /* If we loaded the metaslab, unload it when we're done. */
+        was_loaded = msp->ms_loaded;
+        if (!was_loaded) {
+                if (metaslab_load(msp) != 0) {
         mutex_exit(&msp->ms_lock);
+                        return (0);
+                }
+        }
+        /* Flush out any scheduled extents and add everything in ms_tree. */
+        range_tree_vacate(msp->ms_cur_ts->ts_tree, NULL, NULL);
+        range_tree_walk(msp->ms_tree, metaslab_trim_add, msp);
+
+        /* Force this trim to take place ASAP. */
+        if (msp->ms_prev_ts != NULL)
+                metaslab_free_trimset(msp->ms_prev_ts);
+        msp->ms_prev_ts = msp->ms_cur_ts;
+        msp->ms_cur_ts = metaslab_new_trimset(0, &msp->ms_lock);
+        trimmed_space = range_tree_space(msp->ms_tree);
+        if (!was_loaded)
+                metaslab_unload(msp);
+
+        trim_io = metaslab_exec_trim(msp);
+        mutex_exit(&msp->ms_lock);
+        *delta = trimmed_space;
+
+        return (trim_io);
 }
 
+/*
+ * Notifies the trimsets in a metaslab that an extent has been allocated.
+ * This removes the segment from the queues of extents awaiting to be trimmed.
+ */
+static void
+metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size)
+{
+        metaslab_t *msp = arg;
+
+        range_tree_remove_overlap(msp->ms_cur_ts->ts_tree, offset, size);
+        if (msp->ms_prev_ts != NULL) {
+                range_tree_remove_overlap(msp->ms_prev_ts->ts_tree, offset,
+                    size);
+        }
+}
+
+/*
+ * Notifies the trimsets in a metaslab that an extent has been freed.
+ * This adds the segment to the currently open queue of extents awaiting
+ * to be trimmed.
+ */
+static void
+metaslab_trim_add(void *arg, uint64_t offset, uint64_t size)
+{
+        metaslab_t *msp = arg;
+        ASSERT(msp->ms_cur_ts != NULL);
+        range_tree_add(msp->ms_cur_ts->ts_tree, offset, size);
+}
+
+/*
+ * Does a metaslab's automatic trim operation processing. This must be
+ * called from metaslab_sync, with the txg number of the txg. This function
+ * issues trims in intervals as dictated by the zfs_txgs_per_trim tunable.
+ */
 void
-metaslab_check_free(spa_t *spa, const blkptr_t *bp)
+metaslab_auto_trim(metaslab_t *msp, uint64_t txg)
 {
-        if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
-                return;
+        /* for atomicity */
+        uint64_t txgs_per_trim = zfs_txgs_per_trim;
 
-        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
-        for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
-                uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
-                vdev_t *vd = vdev_lookup_top(spa, vdev);
-                uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
-                uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
+        ASSERT(!MUTEX_HELD(&msp->ms_lock));
+        mutex_enter(&msp->ms_lock);
 
-                if (DVA_GET_GANG(&bp->blk_dva[i]))
-                        size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
+        /*
+         * Since we typically have hundreds of metaslabs per vdev, but we only
+         * trim them once every zfs_txgs_per_trim txgs, it'd be best if we
+         * could sequence the TRIM commands from all metaslabs so that they
+         * don't all always pound the device in the same txg. We do so by
+         * artificially inflating the birth txg of the first trim set by a
+         * sequence number derived from the metaslab's starting offset
+         * (modulo zfs_txgs_per_trim). Thus, for the default 200 metaslabs and
+         * 32 txgs per trim, we'll only be trimming ~6.25 metaslabs per txg.
+         *
+         * If we detect that the txg has advanced too far ahead of ts_birth,
+         * it means our birth txg is out of lockstep. Recompute it by
+         * rounding down to the nearest zfs_txgs_per_trim multiple and adding
+         * our metaslab id modulo zfs_txgs_per_trim.
+         */
+        if (txg > msp->ms_cur_ts->ts_birth + txgs_per_trim) {
+                msp->ms_cur_ts->ts_birth = (txg / txgs_per_trim) *
+                    txgs_per_trim + (msp->ms_id % txgs_per_trim);
+        }
 
-                ASSERT3P(vd, !=, NULL);
+        /* Time to swap out the current and previous trimsets */
+        if (txg == msp->ms_cur_ts->ts_birth + txgs_per_trim) {
+                if (msp->ms_prev_ts != NULL) {
+                        if (msp->ms_trimming_ts != NULL) {
+                                spa_t *spa = msp->ms_group->mg_class->mc_spa;
+                                /*
+                                 * The previous trim run is still ongoing, so
+                                 * the device is reacting slowly to our trim
+                                 * requests. Drop this trimset, so as not to
+                                 * back the device up with trim requests.
+                                 */
+                                spa_trimstats_auto_slow_incr(spa);
+                                metaslab_free_trimset(msp->ms_prev_ts);
+                        } else if (msp->ms_group->mg_vd->vdev_man_trimming) {
+                                /*
+                                 * If a manual trim is ongoing, we want to
+                                 * inhibit autotrim temporarily so it doesn't
+                                 * slow down the manual trim.
+                                 */
+                                metaslab_free_trimset(msp->ms_prev_ts);
+                        } else {
+                                /*
+                                 * Trim out aged extents on the vdevs - these
+                                 * are safe to be destroyed now. We'll keep
+                                 * the trimset around to deny allocations from
+                                 * these regions while the trims are ongoing.
+                                 */
+                                zio_nowait(metaslab_exec_trim(msp));
+                        }
+                }
+                msp->ms_prev_ts = msp->ms_cur_ts;
+                msp->ms_cur_ts = metaslab_new_trimset(txg, &msp->ms_lock);
+        }
+        mutex_exit(&msp->ms_lock);
+}
 
-                metaslab_check_free_impl(vd, offset, size);
+static void
+metaslab_trim_done(zio_t *zio)
+{
+        metaslab_t *msp = zio->io_private;
+        boolean_t held;
+
+        ASSERT(msp != NULL);
+        ASSERT(msp->ms_trimming_ts != NULL);
+        held = MUTEX_HELD(&msp->ms_lock);
+        if (!held)
+                mutex_enter(&msp->ms_lock);
+        metaslab_free_trimset(msp->ms_trimming_ts);
+        msp->ms_trimming_ts = NULL;
+        cv_signal(&msp->ms_trim_cv);
+        if (!held)
+                mutex_exit(&msp->ms_lock);
+}
+
+/*
+ * Executes a zio_trim on a range tree holding freed extents in the metaslab.
+ */
+static zio_t *
+metaslab_exec_trim(metaslab_t *msp)
+{
+        metaslab_group_t *mg = msp->ms_group;
+        spa_t *spa = mg->mg_class->mc_spa;
+        vdev_t *vd = mg->mg_vd;
+        range_tree_t *trim_tree;
+        zio_t *zio;
+
+        ASSERT(MUTEX_HELD(&msp->ms_lock));
+
+        /* wait for a preceding trim to finish */
+        while (msp->ms_trimming_ts != NULL)
+                cv_wait(&msp->ms_trim_cv, &msp->ms_lock);
+        msp->ms_trimming_ts = msp->ms_prev_ts;
+        msp->ms_prev_ts = NULL;
+        trim_tree = msp->ms_trimming_ts->ts_tree;
+#ifdef  DEBUG
+        if (msp->ms_loaded) {
+                for (range_seg_t *rs = avl_first(&trim_tree->rt_root);
+                    rs != NULL; rs = AVL_NEXT(&trim_tree->rt_root, rs)) {
+                        if (!range_tree_contains(msp->ms_tree,
+                            rs->rs_start, rs->rs_end - rs->rs_start)) {
+                                panic("trimming allocated region; mss=%p",
+                                    (void*)rs);
         }
-        spa_config_exit(spa, SCL_VDEV, FTAG);
+                }
+        }
+#endif
+
+        /* Nothing to trim */
+        if (range_tree_space(trim_tree) == 0) {
+                metaslab_free_trimset(msp->ms_trimming_ts);
+                msp->ms_trimming_ts = 0;
+                return (zio_root(spa, NULL, NULL, 0));
+        }
+        zio = zio_trim(spa, vd, trim_tree, metaslab_trim_done, msp, 0,
+            ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |
+            ZIO_FLAG_CONFIG_WRITER, msp);
+
+        return (zio);
+}
+
+/*
+ * Allocates and initializes a new trimset structure. The `txg' argument
+ * indicates when this trimset was born and `lock' indicates the lock to
+ * link to the range tree.
+ */
+static metaslab_trimset_t *
+metaslab_new_trimset(uint64_t txg, kmutex_t *lock)
+{
+        metaslab_trimset_t *ts;
+
+        ts = kmem_zalloc(sizeof (*ts), KM_SLEEP);
+        ts->ts_birth = txg;
+        ts->ts_tree = range_tree_create(NULL, NULL, lock);
+
+        return (ts);
+}
+
+/*
+ * Destroys and frees a trim set previously allocated by metaslab_new_trimset.
+ */
+static void
+metaslab_free_trimset(metaslab_trimset_t *ts)
+{
+        range_tree_vacate(ts->ts_tree, NULL, NULL);
+        range_tree_destroy(ts->ts_tree);
+        kmem_free(ts, sizeof (*ts));
+}
+
+/*
+ * Checks whether an allocation conflicts with an ongoing trim operation in
+ * the given metaslab. This function takes a segment starting at `*offset'
+ * of `size' and checks whether it hits any region in the metaslab currently
+ * being trimmed. If yes, it tries to adjust the allocation to the end of
+ * the region being trimmed (P2ROUNDUP aligned by `align'), but only up to
+ * `limit' (no part of the allocation is allowed to go past this point).
+ *
+ * Returns B_FALSE if either the original allocation wasn't in conflict, or
+ * the conflict could be resolved by adjusting the value stored in `offset'
+ * such that the whole allocation still fits below `limit'. Returns B_TRUE
+ * if the allocation conflict couldn't be resolved.
+ */
+static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
+    uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit)
+{
+        uint64_t new_offset;
+
+        if (msp->ms_trimming_ts == NULL)
+                /* no trim conflict, original offset is OK */
+                return (B_FALSE);
+
+        new_offset = P2ROUNDUP(range_tree_find_gap(msp->ms_trimming_ts->ts_tree,
+            *offset, size), align);
+        if (new_offset != *offset && new_offset + size > limit)
+                /* trim conflict and adjustment not possible */
+                return (B_TRUE);
+
+        /* trim conflict, but adjusted offset still within limit */
+        *offset = new_offset;
+        return (B_FALSE);
 }