big-one Cdiff usr/src/uts/common/fs/zfs/metaslab.c

Print this page

NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-10069 ZFS_READONLY is a little too strict (fix test lint)
NEX-9553 Move ss_fill gap logic from scan algorithm into range_tree.c
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4720 WRC: DVA allocation bypass for special BPs works incorrect
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4683 WRC: Special block pointer must know that it is special
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
6295 metaslab_condense's dbgmsg should include vdev id
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4245 WRC: Code cleanup and refactoring to simplify merge with upstream
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-4059 On-demand TRIM can sometimes race in metaslab_load
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3710 WRC improvements and bug-fixes
 * refactored WRC move-logic to use zio kmem_cashes
 * replace size and compression fields by blk_prop field
   (the same in blkptr_t) to little reduce size of wrc_block_t
   and use similar macros as for blkptr_t to get PSIZE, LSIZE
   and COMPRESSION
 * make CPU more happy by reduce atomic calls
 * removed unused code
 * fixed naming of variables
 * fixed possible system panic after restart system
   with enabled WRC
 * fixed a race that causes system panic
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
OS-197 Series of zpool exports and imports can hang the system
Reviewed by: Sarah Jelinek <sarah.jelinek@nexetna.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittens@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
re #8346 rb2639 KT disk failures


*** 21,30 ****
--- 21,31 ----
  /*
   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
   * Copyright (c) 2011, 2015 by Delphix. All rights reserved.
   * Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
   * Copyright (c) 2014 Integros [integros.com]
+  * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
   */
  
  #include <sys/zfs_context.h>
  #include <sys/dmu.h>
  #include <sys/dmu_tx.h>
*** 32,42 ****
  #include <sys/metaslab_impl.h>
  #include <sys/vdev_impl.h>
  #include <sys/zio.h>
  #include <sys/spa_impl.h>
  #include <sys/zfeature.h>
! #include <sys/vdev_indirect_mapping.h>
  
  #define GANG_ALLOCATION(flags) \
          ((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER))
  
  uint64_t metaslab_aliquot = 512ULL << 10;
--- 33,43 ----
  #include <sys/metaslab_impl.h>
  #include <sys/vdev_impl.h>
  #include <sys/zio.h>
  #include <sys/spa_impl.h>
  #include <sys/zfeature.h>
! #include <sys/wbc.h>
  
  #define GANG_ALLOCATION(flags) \
          ((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER))
  
  uint64_t metaslab_aliquot = 512ULL << 10;
*** 165,179 ****
   * Enable/disable metaslab group biasing.
   */
  boolean_t metaslab_bias_enabled = B_TRUE;
  
  /*
-  * Enable/disable remapping of indirect DVAs to their concrete vdevs.
-  */
- boolean_t zfs_remap_blkptr_enable = B_TRUE;
- 
- /*
   * Enable/disable segment-based metaslab selection.
   */
  boolean_t zfs_metaslab_segment_weight_enabled = B_TRUE;
  
  /*
--- 166,175 ----
*** 199,214 ****
   */
  uint64_t metaslab_trace_max_entries = 5000;
  
  static uint64_t metaslab_weight(metaslab_t *);
  static void metaslab_set_fragmentation(metaslab_t *);
- static void metaslab_free_impl(vdev_t *, uint64_t, uint64_t, uint64_t);
- static void metaslab_check_free_impl(vdev_t *, uint64_t, uint64_t);
  
  kmem_cache_t *metaslab_alloc_trace_cache;
  
  /*
   * ==========================================================================
   * Metaslab classes
   * ==========================================================================
   */
  metaslab_class_t *
--- 195,244 ----
   */
  uint64_t metaslab_trace_max_entries = 5000;
  
  static uint64_t metaslab_weight(metaslab_t *);
  static void metaslab_set_fragmentation(metaslab_t *);
  
  kmem_cache_t *metaslab_alloc_trace_cache;
  
  /*
+  * Toggle between space-based DVA allocator 0, latency-based 1 or hybrid 2.
+  * A value other than 0, 1 or 2 will be considered 0 (default).
+  */
+ int metaslab_alloc_dva_algorithm = 0;
+ 
+ /*
+  * How many TXG's worth of updates should be aggregated per TRIM/UNMAP
+  * issued to the underlying vdev. We keep two range trees of extents
+  * (called "trim sets") to be trimmed per metaslab, the `current' and
+  * the `previous' TS. New free's are added to the current TS. Then,
+  * once `zfs_txgs_per_trim' transactions have elapsed, the `current'
+  * TS becomes the `previous' TS and a new, blank TS is created to be
+  * the new `current', which will then start accumulating any new frees.
+  * Once another zfs_txgs_per_trim TXGs have passed, the previous TS's
+  * extents are trimmed, the TS is destroyed and the current TS again
+  * becomes the previous TS.
+  * This serves to fulfill two functions: aggregate many small frees
+  * into fewer larger trim operations (which should help with devices
+  * which do not take so kindly to them) and to allow for disaster
+  * recovery (extents won't get trimmed immediately, but instead only
+  * after passing this rather long timeout, thus not preserving
+  * 'zfs import -F' functionality).
+  */
+ unsigned int zfs_txgs_per_trim = 32;
+ 
+ static void metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size);
+ static void metaslab_trim_add(void *arg, uint64_t offset, uint64_t size);
+ 
+ static zio_t *metaslab_exec_trim(metaslab_t *msp);
+ 
+ static metaslab_trimset_t *metaslab_new_trimset(uint64_t txg, kmutex_t *lock);
+ static void metaslab_free_trimset(metaslab_trimset_t *ts);
+ static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
+     uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit);
+ 
+ /*
   * ==========================================================================
   * Metaslab classes
   * ==========================================================================
   */
  metaslab_class_t *
*** 216,225 ****
--- 246,259 ----
  {
          metaslab_class_t *mc;
  
          mc = kmem_zalloc(sizeof (metaslab_class_t), KM_SLEEP);
  
+         mutex_init(&mc->mc_alloc_lock, NULL, MUTEX_DEFAULT, NULL);
+         avl_create(&mc->mc_alloc_tree, zio_bookmark_compare,
+             sizeof (zio_t), offsetof(zio_t, io_alloc_node));
+ 
          mc->mc_spa = spa;
          mc->mc_rotor = NULL;
          mc->mc_ops = ops;
          mutex_init(&mc->mc_lock, NULL, MUTEX_DEFAULT, NULL);
          refcount_create_tracked(&mc->mc_alloc_slots);
*** 234,243 ****
--- 268,280 ----
          ASSERT(mc->mc_alloc == 0);
          ASSERT(mc->mc_deferred == 0);
          ASSERT(mc->mc_space == 0);
          ASSERT(mc->mc_dspace == 0);
  
+         avl_destroy(&mc->mc_alloc_tree);
+         mutex_destroy(&mc->mc_alloc_lock);
+ 
          refcount_destroy(&mc->mc_alloc_slots);
          mutex_destroy(&mc->mc_lock);
          kmem_free(mc, sizeof (metaslab_class_t));
  }
  
*** 320,330 ****
  
                  /*
                   * Skip any holes, uninitialized top-levels, or
                   * vdevs that are not in this metalab class.
                   */
!                 if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
                      mg->mg_class != mc) {
                          continue;
                  }
  
                  for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
--- 357,367 ----
  
                  /*
                   * Skip any holes, uninitialized top-levels, or
                   * vdevs that are not in this metalab class.
                   */
!                 if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
                      mg->mg_class != mc) {
                          continue;
                  }
  
                  for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
*** 355,368 ****
          for (int c = 0; c < rvd->vdev_children; c++) {
                  vdev_t *tvd = rvd->vdev_child[c];
                  metaslab_group_t *mg = tvd->vdev_mg;
  
                  /*
!                  * Skip any holes, uninitialized top-levels,
!                  * or vdevs that are not in this metalab class.
                   */
!                 if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
                      mg->mg_class != mc) {
                          continue;
                  }
  
                  /*
--- 392,405 ----
          for (int c = 0; c < rvd->vdev_children; c++) {
                  vdev_t *tvd = rvd->vdev_child[c];
                  metaslab_group_t *mg = tvd->vdev_mg;
  
                  /*
!                  * Skip any holes, uninitialized top-levels, or
!                  * vdevs that are not in this metalab class.
                   */
!                 if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
                      mg->mg_class != mc) {
                          continue;
                  }
  
                  /*
*** 404,414 ****
          for (int c = 0; c < rvd->vdev_children; c++) {
                  uint64_t tspace;
                  vdev_t *tvd = rvd->vdev_child[c];
                  metaslab_group_t *mg = tvd->vdev_mg;
  
!                 if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
                      mg->mg_class != mc) {
                          continue;
                  }
  
                  /*
--- 441,451 ----
          for (int c = 0; c < rvd->vdev_children; c++) {
                  uint64_t tspace;
                  vdev_t *tvd = rvd->vdev_child[c];
                  metaslab_group_t *mg = tvd->vdev_mg;
  
!                 if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
                      mg->mg_class != mc) {
                          continue;
                  }
  
                  /*
*** 516,527 ****
          vdev_stat_t *vs = &vd->vdev_stat;
          boolean_t was_allocatable;
          boolean_t was_initialized;
  
          ASSERT(vd == vd->vdev_top);
-         ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_READER), ==,
-             SCL_ALLOC);
  
          mutex_enter(&mg->mg_lock);
          was_allocatable = mg->mg_allocatable;
          was_initialized = mg->mg_initialized;
  
--- 553,562 ----
*** 615,624 ****
--- 650,660 ----
           * either because we never activated in the first place or
           * because we're done, and possibly removing the vdev.
           */
          ASSERT(mg->mg_activation_count <= 0);
  
+         if (mg->mg_taskq)
                  taskq_destroy(mg->mg_taskq);
          avl_destroy(&mg->mg_metaslab_tree);
          mutex_destroy(&mg->mg_lock);
          refcount_destroy(&mg->mg_alloc_queue_depth);
          kmem_free(mg, sizeof (metaslab_group_t));
*** 628,638 ****
  metaslab_group_activate(metaslab_group_t *mg)
  {
          metaslab_class_t *mc = mg->mg_class;
          metaslab_group_t *mgprev, *mgnext;
  
!         ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER), !=, 0);
  
          ASSERT(mc->mc_rotor != mg);
          ASSERT(mg->mg_prev == NULL);
          ASSERT(mg->mg_next == NULL);
          ASSERT(mg->mg_activation_count <= 0);
--- 664,674 ----
  metaslab_group_activate(metaslab_group_t *mg)
  {
          metaslab_class_t *mc = mg->mg_class;
          metaslab_group_t *mgprev, *mgnext;
  
!         ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
  
          ASSERT(mc->mc_rotor != mg);
          ASSERT(mg->mg_prev == NULL);
          ASSERT(mg->mg_next == NULL);
          ASSERT(mg->mg_activation_count <= 0);
*** 654,705 ****
                  mgnext->mg_prev = mg;
          }
          mc->mc_rotor = mg;
  }
  
- /*
-  * Passivate a metaslab group and remove it from the allocation rotor.
-  * Callers must hold both the SCL_ALLOC and SCL_ZIO lock prior to passivating
-  * a metaslab group. This function will momentarily drop spa_config_locks
-  * that are lower than the SCL_ALLOC lock (see comment below).
-  */
  void
  metaslab_group_passivate(metaslab_group_t *mg)
  {
          metaslab_class_t *mc = mg->mg_class;
-         spa_t *spa = mc->mc_spa;
          metaslab_group_t *mgprev, *mgnext;
-         int locks = spa_config_held(spa, SCL_ALL, RW_WRITER);
  
!         ASSERT3U(spa_config_held(spa, SCL_ALLOC | SCL_ZIO, RW_WRITER), ==,
!             (SCL_ALLOC | SCL_ZIO));
  
          if (--mg->mg_activation_count != 0) {
                  ASSERT(mc->mc_rotor != mg);
                  ASSERT(mg->mg_prev == NULL);
                  ASSERT(mg->mg_next == NULL);
                  ASSERT(mg->mg_activation_count < 0);
                  return;
          }
  
-         /*
-          * The spa_config_lock is an array of rwlocks, ordered as
-          * follows (from highest to lowest):
-          *      SCL_CONFIG > SCL_STATE > SCL_L2ARC > SCL_ALLOC >
-          *      SCL_ZIO > SCL_FREE > SCL_VDEV
-          * (For more information about the spa_config_lock see spa_misc.c)
-          * The higher the lock, the broader its coverage. When we passivate
-          * a metaslab group, we must hold both the SCL_ALLOC and the SCL_ZIO
-          * config locks. However, the metaslab group's taskq might be trying
-          * to preload metaslabs so we must drop the SCL_ZIO lock and any
-          * lower locks to allow the I/O to complete. At a minimum,
-          * we continue to hold the SCL_ALLOC lock, which prevents any future
-          * allocations from taking place and any changes to the vdev tree.
-          */
-         spa_config_exit(spa, locks & ~(SCL_ZIO - 1), spa);
          taskq_wait(mg->mg_taskq);
-         spa_config_enter(spa, locks & ~(SCL_ZIO - 1), spa, RW_WRITER);
          metaslab_group_alloc_update(mg);
  
          mgprev = mg->mg_prev;
          mgnext = mg->mg_next;
  
--- 690,716 ----
                  mgnext->mg_prev = mg;
          }
          mc->mc_rotor = mg;
  }
  
  void
  metaslab_group_passivate(metaslab_group_t *mg)
  {
          metaslab_class_t *mc = mg->mg_class;
          metaslab_group_t *mgprev, *mgnext;
  
!         ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
  
          if (--mg->mg_activation_count != 0) {
                  ASSERT(mc->mc_rotor != mg);
                  ASSERT(mg->mg_prev == NULL);
                  ASSERT(mg->mg_next == NULL);
                  ASSERT(mg->mg_activation_count < 0);
                  return;
          }
  
          taskq_wait(mg->mg_taskq);
          metaslab_group_alloc_update(mg);
  
          mgprev = mg->mg_prev;
          mgnext = mg->mg_next;
  
*** 1139,1161 ****
   * This is a helper function that can be used by the allocator to find
   * a suitable block to allocate. This will search the specified AVL
   * tree looking for a block that matches the specified criteria.
   */
  static uint64_t
! metaslab_block_picker(avl_tree_t *t, uint64_t *cursor, uint64_t size,
!     uint64_t align)
  {
          range_seg_t *rs = metaslab_block_find(t, *cursor, size);
  
!         while (rs != NULL) {
                  uint64_t offset = P2ROUNDUP(rs->rs_start, align);
  
!                 if (offset + size <= rs->rs_end) {
                          *cursor = offset + size;
                          return (offset);
                  }
-                 rs = AVL_NEXT(t, rs);
          }
  
          /*
           * If we know we've searched the whole map (*cursor == 0), give up.
           * Otherwise, reset the cursor to the beginning and try again.
--- 1150,1173 ----
   * This is a helper function that can be used by the allocator to find
   * a suitable block to allocate. This will search the specified AVL
   * tree looking for a block that matches the specified criteria.
   */
  static uint64_t
! metaslab_block_picker(metaslab_t *msp, avl_tree_t *t, uint64_t *cursor,
!     uint64_t size, uint64_t align)
  {
          range_seg_t *rs = metaslab_block_find(t, *cursor, size);
  
!         for (; rs != NULL; rs = AVL_NEXT(t, rs)) {
                  uint64_t offset = P2ROUNDUP(rs->rs_start, align);
  
!                 if (offset + size <= rs->rs_end &&
!                     !metaslab_check_trim_conflict(msp, &offset, size, align,
!                     rs->rs_end)) {
                          *cursor = offset + size;
                          return (offset);
                  }
          }
  
          /*
           * If we know we've searched the whole map (*cursor == 0), give up.
           * Otherwise, reset the cursor to the beginning and try again.
*** 1162,1172 ****
           */
          if (*cursor == 0)
                  return (-1ULL);
  
          *cursor = 0;
!         return (metaslab_block_picker(t, cursor, size, align));
  }
  
  /*
   * ==========================================================================
   * The first-fit block allocator
--- 1174,1184 ----
           */
          if (*cursor == 0)
                  return (-1ULL);
  
          *cursor = 0;
!         return (metaslab_block_picker(msp, t, cursor, size, align));
  }
  
  /*
   * ==========================================================================
   * The first-fit block allocator
*** 1184,1194 ****
           */
          uint64_t align = size & -size;
          uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1];
          avl_tree_t *t = &msp->ms_tree->rt_root;
  
!         return (metaslab_block_picker(t, cursor, size, align));
  }
  
  static metaslab_ops_t metaslab_ff_ops = {
          metaslab_ff_alloc
  };
--- 1196,1206 ----
           */
          uint64_t align = size & -size;
          uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1];
          avl_tree_t *t = &msp->ms_tree->rt_root;
  
!         return (metaslab_block_picker(msp, t, cursor, size, align));
  }
  
  static metaslab_ops_t metaslab_ff_ops = {
          metaslab_ff_alloc
  };
*** 1232,1242 ****
              free_pct < metaslab_df_free_pct) {
                  t = &msp->ms_size_tree;
                  *cursor = 0;
          }
  
!         return (metaslab_block_picker(t, cursor, size, 1ULL));
  }
  
  static metaslab_ops_t metaslab_df_ops = {
          metaslab_df_alloc
  };
--- 1244,1254 ----
              free_pct < metaslab_df_free_pct) {
                  t = &msp->ms_size_tree;
                  *cursor = 0;
          }
  
!         return (metaslab_block_picker(msp, t, cursor, size, 1ULL));
  }
  
  static metaslab_ops_t metaslab_df_ops = {
          metaslab_df_alloc
  };
*** 1264,1281 ****
  
          ASSERT3U(*cursor_end, >=, *cursor);
  
          if ((*cursor + size) > *cursor_end) {
                  range_seg_t *rs;
! 
!                 rs = avl_last(&msp->ms_size_tree);
!                 if (rs == NULL || (rs->rs_end - rs->rs_start) < size)
!                         return (-1ULL);
! 
                  *cursor = rs->rs_start;
                  *cursor_end = rs->rs_end;
          }
  
          offset = *cursor;
          *cursor += size;
  
          return (offset);
--- 1276,1299 ----
  
          ASSERT3U(*cursor_end, >=, *cursor);
  
          if ((*cursor + size) > *cursor_end) {
                  range_seg_t *rs;
!                 for (rs = avl_last(&msp->ms_size_tree);
!                     rs != NULL && rs->rs_end - rs->rs_start >= size;
!                     rs = AVL_PREV(&msp->ms_size_tree, rs)) {
                          *cursor = rs->rs_start;
                          *cursor_end = rs->rs_end;
+                         if (!metaslab_check_trim_conflict(msp, cursor, size,
+                             1, *cursor_end)) {
+                                 /* segment appears to be acceptable */
+                                 break;
                          }
+                 }
+                 if (rs == NULL || rs->rs_end - rs->rs_start < size)
+                         return (-1ULL);
+         }
  
          offset = *cursor;
          *cursor += size;
  
          return (offset);
*** 1307,1316 ****
--- 1325,1336 ----
          avl_index_t where;
          range_seg_t *rs, rsearch;
          uint64_t hbit = highbit64(size);
          uint64_t *cursor = &msp->ms_lbas[hbit - 1];
          uint64_t max_size = metaslab_block_maxsize(msp);
+         /* mutable copy for adjustment by metaslab_check_trim_conflict */
+         uint64_t adjustable_start;
  
          ASSERT(MUTEX_HELD(&msp->ms_lock));
          ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&msp->ms_size_tree));
  
          if (max_size < size)
*** 1318,1344 ****
  
          rsearch.rs_start = *cursor;
          rsearch.rs_end = *cursor + size;
  
          rs = avl_find(t, &rsearch, &where);
!         if (rs == NULL || (rs->rs_end - rs->rs_start) < size) {
                  t = &msp->ms_size_tree;
  
                  rsearch.rs_start = 0;
                  rsearch.rs_end = MIN(max_size,
                      1ULL << (hbit + metaslab_ndf_clump_shift));
                  rs = avl_find(t, &rsearch, &where);
                  if (rs == NULL)
                          rs = avl_nearest(t, where, AVL_AFTER);
                  ASSERT(rs != NULL);
          }
- 
-         if ((rs->rs_end - rs->rs_start) >= size) {
-                 *cursor = rs->rs_start + size;
-                 return (rs->rs_start);
          }
!         return (-1ULL);
  }
  
  static metaslab_ops_t metaslab_ndf_ops = {
          metaslab_ndf_alloc
  };
--- 1338,1373 ----
  
          rsearch.rs_start = *cursor;
          rsearch.rs_end = *cursor + size;
  
          rs = avl_find(t, &rsearch, &where);
!         if (rs != NULL)
!                 adjustable_start = rs->rs_start;
!         if (rs == NULL || rs->rs_end - adjustable_start < size ||
!             metaslab_check_trim_conflict(msp, &adjustable_start, size, 1,
!             rs->rs_end)) {
!                 /* segment not usable, try the largest remaining one */
                  t = &msp->ms_size_tree;
  
                  rsearch.rs_start = 0;
                  rsearch.rs_end = MIN(max_size,
                      1ULL << (hbit + metaslab_ndf_clump_shift));
                  rs = avl_find(t, &rsearch, &where);
                  if (rs == NULL)
                          rs = avl_nearest(t, where, AVL_AFTER);
                  ASSERT(rs != NULL);
+                 adjustable_start = rs->rs_start;
+                 if (rs->rs_end - adjustable_start < size ||
+                     metaslab_check_trim_conflict(msp, &adjustable_start,
+                     size, 1, rs->rs_end)) {
+                         /* even largest remaining segment not usable */
+                         return (-1ULL);
                  }
          }
! 
!         *cursor = adjustable_start + size;
!         return (*cursor);
  }
  
  static metaslab_ops_t metaslab_ndf_ops = {
          metaslab_ndf_alloc
  };
*** 1374,1389 ****
          ASSERT(MUTEX_HELD(&msp->ms_lock));
          ASSERT(!msp->ms_loaded);
          ASSERT(!msp->ms_loading);
  
          msp->ms_loading = B_TRUE;
-         /*
-          * Nobody else can manipulate a loading metaslab, so it's now safe
-          * to drop the lock.  This way we don't have to hold the lock while
-          * reading the spacemap from disk.
-          */
-         mutex_exit(&msp->ms_lock);
  
          /*
           * If the space map has not been allocated yet, then treat
           * all the space in the metaslab as free and add it to the
           * ms_tree.
--- 1403,1412 ----
*** 1392,1412 ****
                  error = space_map_load(msp->ms_sm, msp->ms_tree, SM_FREE);
          else
                  range_tree_add(msp->ms_tree, msp->ms_start, msp->ms_size);
  
          success = (error == 0);
- 
-         mutex_enter(&msp->ms_lock);
          msp->ms_loading = B_FALSE;
  
          if (success) {
                  ASSERT3P(msp->ms_group, !=, NULL);
                  msp->ms_loaded = B_TRUE;
  
                  for (int t = 0; t < TXG_DEFER_SIZE; t++) {
                          range_tree_walk(msp->ms_defertree[t],
                              range_tree_remove, msp->ms_tree);
                  }
                  msp->ms_max_size = metaslab_block_maxsize(msp);
          }
          cv_broadcast(&msp->ms_load_cv);
          return (error);
--- 1415,1435 ----
                  error = space_map_load(msp->ms_sm, msp->ms_tree, SM_FREE);
          else
                  range_tree_add(msp->ms_tree, msp->ms_start, msp->ms_size);
  
          success = (error == 0);
          msp->ms_loading = B_FALSE;
  
          if (success) {
                  ASSERT3P(msp->ms_group, !=, NULL);
                  msp->ms_loaded = B_TRUE;
  
                  for (int t = 0; t < TXG_DEFER_SIZE; t++) {
                          range_tree_walk(msp->ms_defertree[t],
                              range_tree_remove, msp->ms_tree);
+                         range_tree_walk(msp->ms_defertree[t],
+                             metaslab_trim_remove, msp);
                  }
                  msp->ms_max_size = metaslab_block_maxsize(msp);
          }
          cv_broadcast(&msp->ms_load_cv);
          return (error);
*** 1431,1442 ****
          metaslab_t *ms;
          int error;
  
          ms = kmem_zalloc(sizeof (metaslab_t), KM_SLEEP);
          mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL);
-         mutex_init(&ms->ms_sync_lock, NULL, MUTEX_DEFAULT, NULL);
          cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL);
          ms->ms_id = id;
          ms->ms_start = id << vd->vdev_ms_shift;
          ms->ms_size = 1ULL << vd->vdev_ms_shift;
  
          /*
--- 1454,1465 ----
          metaslab_t *ms;
          int error;
  
          ms = kmem_zalloc(sizeof (metaslab_t), KM_SLEEP);
          mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL);
          cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL);
+         cv_init(&ms->ms_trim_cv, NULL, CV_DEFAULT, NULL);
          ms->ms_id = id;
          ms->ms_start = id << vd->vdev_ms_shift;
          ms->ms_size = 1ULL << vd->vdev_ms_shift;
  
          /*
*** 1443,1470 ****
           * We only open space map objects that already exist. All others
           * will be opened when we finally allocate an object for it.
           */
          if (object != 0) {
                  error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start,
!                     ms->ms_size, vd->vdev_ashift);
  
                  if (error != 0) {
                          kmem_free(ms, sizeof (metaslab_t));
                          return (error);
                  }
  
                  ASSERT(ms->ms_sm != NULL);
          }
  
          /*
           * We create the main range tree here, but we don't create the
           * other range trees until metaslab_sync_done().  This serves
           * two purposes: it allows metaslab_sync_done() to detect the
           * addition of new space; and for debugging, it ensures that we'd
           * data fault on any attempt to use this metaslab before it's ready.
           */
!         ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms);
          metaslab_group_add(mg, ms);
  
          metaslab_set_fragmentation(ms);
  
          /*
--- 1466,1495 ----
           * We only open space map objects that already exist. All others
           * will be opened when we finally allocate an object for it.
           */
          if (object != 0) {
                  error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start,
!                     ms->ms_size, vd->vdev_ashift, &ms->ms_lock);
  
                  if (error != 0) {
                          kmem_free(ms, sizeof (metaslab_t));
                          return (error);
                  }
  
                  ASSERT(ms->ms_sm != NULL);
          }
  
+         ms->ms_cur_ts = metaslab_new_trimset(0, &ms->ms_lock);
+ 
          /*
           * We create the main range tree here, but we don't create the
           * other range trees until metaslab_sync_done().  This serves
           * two purposes: it allows metaslab_sync_done() to detect the
           * addition of new space; and for debugging, it ensures that we'd
           * data fault on any attempt to use this metaslab before it's ready.
           */
!         ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms, &ms->ms_lock);
          metaslab_group_add(mg, ms);
  
          metaslab_set_fragmentation(ms);
  
          /*
*** 1524,1539 ****
  
          for (int t = 0; t < TXG_DEFER_SIZE; t++) {
                  range_tree_destroy(msp->ms_defertree[t]);
          }
  
          ASSERT0(msp->ms_deferspace);
  
          mutex_exit(&msp->ms_lock);
          cv_destroy(&msp->ms_load_cv);
          mutex_destroy(&msp->ms_lock);
-         mutex_destroy(&msp->ms_sync_lock);
  
          kmem_free(msp, sizeof (metaslab_t));
  }
  
  #define FRAGMENTATION_TABLE_SIZE        17
--- 1549,1569 ----
  
          for (int t = 0; t < TXG_DEFER_SIZE; t++) {
                  range_tree_destroy(msp->ms_defertree[t]);
          }
  
+         metaslab_free_trimset(msp->ms_cur_ts);
+         if (msp->ms_prev_ts)
+                 metaslab_free_trimset(msp->ms_prev_ts);
+         ASSERT3P(msp->ms_trimming_ts, ==, NULL);
+ 
          ASSERT0(msp->ms_deferspace);
  
          mutex_exit(&msp->ms_lock);
          cv_destroy(&msp->ms_load_cv);
+         cv_destroy(&msp->ms_trim_cv);
          mutex_destroy(&msp->ms_lock);
  
          kmem_free(msp, sizeof (metaslab_t));
  }
  
  #define FRAGMENTATION_TABLE_SIZE        17
*** 1895,1909 ****
          uint64_t weight;
  
          ASSERT(MUTEX_HELD(&msp->ms_lock));
  
          /*
!          * If this vdev is in the process of being removed, there is nothing
           * for us to do here.
           */
!         if (vd->vdev_removing)
                  return (0);
  
          metaslab_set_fragmentation(msp);
  
          /*
           * Update the maximum size if the metaslab is loaded. This will
--- 1925,1942 ----
          uint64_t weight;
  
          ASSERT(MUTEX_HELD(&msp->ms_lock));
  
          /*
!          * This vdev is in the process of being removed so there is nothing
           * for us to do here.
           */
!         if (vd->vdev_removing) {
!                 ASSERT0(space_map_allocated(msp->ms_sm));
!                 ASSERT0(vd->vdev_ms_shift);
                  return (0);
+         }
  
          metaslab_set_fragmentation(msp);
  
          /*
           * Update the maximum size if the metaslab is loaded. This will
*** 2031,2047 ****
                  taskq_wait(mg->mg_taskq);
                  return;
          }
  
          mutex_enter(&mg->mg_lock);
- 
          /*
           * Load the next potential metaslabs
           */
          for (msp = avl_first(t); msp != NULL; msp = AVL_NEXT(t, msp)) {
-                 ASSERT3P(msp->ms_group, ==, mg);
- 
                  /*
                   * We preload only the maximum number of metaslabs specified
                   * by metaslab_preload_limit. If a metaslab is being forced
                   * to condense then we preload it too. This will ensure
                   * that force condensing happens in the next txg.
--- 2064,2077 ----
*** 2064,2074 ****
   * 1. The size of the space map object should not dramatically increase as a
   * result of writing out the free space range tree.
   *
   * 2. The minimal on-disk space map representation is zfs_condense_pct/100
   * times the size than the free space range tree representation
!  * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1MB).
   *
   * 3. The on-disk size of the space map should actually decrease.
   *
   * Checking the first condition is tricky since we don't want to walk
   * the entire AVL tree calculating the estimated on-disk size. Instead we
--- 2094,2104 ----
   * 1. The size of the space map object should not dramatically increase as a
   * result of writing out the free space range tree.
   *
   * 2. The minimal on-disk space map representation is zfs_condense_pct/100
   * times the size than the free space range tree representation
!  * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1.MB).
   *
   * 3. The on-disk size of the space map should actually decrease.
   *
   * Checking the first condition is tricky since we don't want to walk
   * the entire AVL tree calculating the estimated on-disk size. Instead we
*** 2161,2171 ****
           * that have been freed in this txg, any deferred frees that exist,
           * and any allocation in the future. Removing segments should be
           * a relatively inexpensive operation since we expect these trees to
           * have a small number of nodes.
           */
!         condense_tree = range_tree_create(NULL, NULL);
          range_tree_add(condense_tree, msp->ms_start, msp->ms_size);
  
          /*
           * Remove what's been freed in this txg from the condense_tree.
           * Since we're in sync_pass 1, we know that all the frees from
--- 2191,2201 ----
           * that have been freed in this txg, any deferred frees that exist,
           * and any allocation in the future. Removing segments should be
           * a relatively inexpensive operation since we expect these trees to
           * have a small number of nodes.
           */
!         condense_tree = range_tree_create(NULL, NULL, &msp->ms_lock);
          range_tree_add(condense_tree, msp->ms_start, msp->ms_size);
  
          /*
           * Remove what's been freed in this txg from the condense_tree.
           * Since we're in sync_pass 1, we know that all the frees from
*** 2194,2203 ****
--- 2224,2234 ----
           */
          msp->ms_condensing = B_TRUE;
  
          mutex_exit(&msp->ms_lock);
          space_map_truncate(sm, tx);
+         mutex_enter(&msp->ms_lock);
  
          /*
           * While we would ideally like to create a space map representation
           * that consists only of allocation records, doing so can be
           * prohibitively expensive because the in-core free tree can be
*** 2210,2220 ****
          space_map_write(sm, condense_tree, SM_ALLOC, tx);
          range_tree_vacate(condense_tree, NULL, NULL);
          range_tree_destroy(condense_tree);
  
          space_map_write(sm, msp->ms_tree, SM_FREE, tx);
-         mutex_enter(&msp->ms_lock);
          msp->ms_condensing = B_FALSE;
  }
  
  /*
   * Write a metaslab to disk in the context of the specified transaction group.
--- 2241,2250 ----
*** 2230,2244 ****
--- 2260,2277 ----
          dmu_tx_t *tx;
          uint64_t object = space_map_object(msp->ms_sm);
  
          ASSERT(!vd->vdev_ishole);
  
+         mutex_enter(&msp->ms_lock);
+ 
          /*
           * This metaslab has just been added so there's no work to do now.
           */
          if (msp->ms_freeingtree == NULL) {
                  ASSERT3P(alloctree, ==, NULL);
+                 mutex_exit(&msp->ms_lock);
                  return;
          }
  
          ASSERT3P(alloctree, !=, NULL);
          ASSERT3P(msp->ms_freeingtree, !=, NULL);
*** 2250,2277 ****
           * is being forced to condense and it's loaded, we need to let it
           * through.
           */
          if (range_tree_space(alloctree) == 0 &&
              range_tree_space(msp->ms_freeingtree) == 0 &&
!             !(msp->ms_loaded && msp->ms_condense_wanted))
                  return;
  
  
          VERIFY(txg <= spa_final_dirty_txg(spa));
  
          /*
           * The only state that can actually be changing concurrently with
           * metaslab_sync() is the metaslab's ms_tree.  No other thread can
           * be modifying this txg's alloctree, freeingtree, freedtree, or
!          * space_map_phys_t.  We drop ms_lock whenever we could call
!          * into the DMU, because the DMU can call down to us
!          * (e.g. via zio_free()) at any time.
!          *
!          * The spa_vdev_remove_thread() can be reading metaslab state
!          * concurrently, and it is locked out by the ms_sync_lock.  Note
!          * that the ms_lock is insufficient for this, because it is dropped
!          * by space_map_write().
           */
  
          tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
  
          if (msp->ms_sm == NULL) {
--- 2283,2308 ----
           * is being forced to condense and it's loaded, we need to let it
           * through.
           */
          if (range_tree_space(alloctree) == 0 &&
              range_tree_space(msp->ms_freeingtree) == 0 &&
!             !(msp->ms_loaded && msp->ms_condense_wanted)) {
!                 mutex_exit(&msp->ms_lock);
                  return;
+         }
  
  
          VERIFY(txg <= spa_final_dirty_txg(spa));
  
          /*
           * The only state that can actually be changing concurrently with
           * metaslab_sync() is the metaslab's ms_tree.  No other thread can
           * be modifying this txg's alloctree, freeingtree, freedtree, or
!          * space_map_phys_t. Therefore, we only hold ms_lock to satify
!          * space map ASSERTs. We drop it whenever we call into the DMU,
!          * because the DMU can call down to us (e.g. via zio_free()) at
!          * any time.
           */
  
          tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
  
          if (msp->ms_sm == NULL) {
*** 2279,2295 ****
  
                  new_object = space_map_alloc(mos, tx);
                  VERIFY3U(new_object, !=, 0);
  
                  VERIFY0(space_map_open(&msp->ms_sm, mos, new_object,
!                     msp->ms_start, msp->ms_size, vd->vdev_ashift));
                  ASSERT(msp->ms_sm != NULL);
          }
  
-         mutex_enter(&msp->ms_sync_lock);
-         mutex_enter(&msp->ms_lock);
- 
          /*
           * Note: metaslab_condense() clears the space map's histogram.
           * Therefore we must verify and remove this histogram before
           * condensing.
           */
--- 2310,2324 ----
  
                  new_object = space_map_alloc(mos, tx);
                  VERIFY3U(new_object, !=, 0);
  
                  VERIFY0(space_map_open(&msp->ms_sm, mos, new_object,
!                     msp->ms_start, msp->ms_size, vd->vdev_ashift,
!                     &msp->ms_lock));
                  ASSERT(msp->ms_sm != NULL);
          }
  
          /*
           * Note: metaslab_condense() clears the space map's histogram.
           * Therefore we must verify and remove this histogram before
           * condensing.
           */
*** 2299,2317 ****
  
          if (msp->ms_loaded && spa_sync_pass(spa) == 1 &&
              metaslab_should_condense(msp)) {
                  metaslab_condense(msp, txg, tx);
          } else {
-                 mutex_exit(&msp->ms_lock);
                  space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx);
                  space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx);
-                 mutex_enter(&msp->ms_lock);
          }
  
          if (msp->ms_loaded) {
                  /*
!                  * When the space map is loaded, we have an accurate
                   * histogram in the range tree. This gives us an opportunity
                   * to bring the space map's histogram up-to-date so we clear
                   * it first before updating it.
                   */
                  space_map_histogram_clear(msp->ms_sm);
--- 2328,2344 ----
  
          if (msp->ms_loaded && spa_sync_pass(spa) == 1 &&
              metaslab_should_condense(msp)) {
                  metaslab_condense(msp, txg, tx);
          } else {
                  space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx);
                  space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx);
          }
  
          if (msp->ms_loaded) {
                  /*
!                  * When the space map is loaded, we have an accruate
                   * histogram in the range tree. This gives us an opportunity
                   * to bring the space map's histogram up-to-date so we clear
                   * it first before updating it.
                   */
                  space_map_histogram_clear(msp->ms_sm);
*** 2375,2385 ****
          if (object != space_map_object(msp->ms_sm)) {
                  object = space_map_object(msp->ms_sm);
                  dmu_write(mos, vd->vdev_ms_array, sizeof (uint64_t) *
                      msp->ms_id, sizeof (uint64_t), &object, tx);
          }
-         mutex_exit(&msp->ms_sync_lock);
          dmu_tx_commit(tx);
  }
  
  /*
   * Called after a transaction group has completely synced to mark
--- 2402,2411 ----
*** 2405,2437 ****
           */
          if (msp->ms_freedtree == NULL) {
                  for (int t = 0; t < TXG_SIZE; t++) {
                          ASSERT(msp->ms_alloctree[t] == NULL);
  
!                         msp->ms_alloctree[t] = range_tree_create(NULL, NULL);
                  }
  
                  ASSERT3P(msp->ms_freeingtree, ==, NULL);
!                 msp->ms_freeingtree = range_tree_create(NULL, NULL);
  
                  ASSERT3P(msp->ms_freedtree, ==, NULL);
!                 msp->ms_freedtree = range_tree_create(NULL, NULL);
  
                  for (int t = 0; t < TXG_DEFER_SIZE; t++) {
                          ASSERT(msp->ms_defertree[t] == NULL);
  
!                         msp->ms_defertree[t] = range_tree_create(NULL, NULL);
                  }
  
                  vdev_space_update(vd, 0, 0, msp->ms_size);
          }
  
          defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE];
  
          uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) -
              metaslab_class_get_alloc(spa_normal_class(spa));
!         if (free_space <= spa_get_slop_space(spa) || vd->vdev_removing) {
                  defer_allowed = B_FALSE;
          }
  
          defer_delta = 0;
          alloc_delta = space_map_alloc_delta(msp->ms_sm);
--- 2431,2467 ----
           */
          if (msp->ms_freedtree == NULL) {
                  for (int t = 0; t < TXG_SIZE; t++) {
                          ASSERT(msp->ms_alloctree[t] == NULL);
  
!                         msp->ms_alloctree[t] = range_tree_create(NULL, msp,
!                             &msp->ms_lock);
                  }
  
                  ASSERT3P(msp->ms_freeingtree, ==, NULL);
!                 msp->ms_freeingtree = range_tree_create(NULL, msp,
!                     &msp->ms_lock);
  
                  ASSERT3P(msp->ms_freedtree, ==, NULL);
!                 msp->ms_freedtree = range_tree_create(NULL, msp,
!                     &msp->ms_lock);
  
                  for (int t = 0; t < TXG_DEFER_SIZE; t++) {
                          ASSERT(msp->ms_defertree[t] == NULL);
  
!                         msp->ms_defertree[t] = range_tree_create(NULL, msp,
!                             &msp->ms_lock);
                  }
  
                  vdev_space_update(vd, 0, 0, msp->ms_size);
          }
  
          defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE];
  
          uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) -
              metaslab_class_get_alloc(spa_normal_class(spa));
!         if (free_space <= spa_get_slop_space(spa)) {
                  defer_allowed = B_FALSE;
          }
  
          defer_delta = 0;
          alloc_delta = space_map_alloc_delta(msp->ms_sm);
*** 2454,2463 ****
--- 2484,2501 ----
           * Move the frees from the defer_tree back to the free
           * range tree (if it's loaded). Swap the freed_tree and the
           * defer_tree -- this is safe to do because we've just emptied out
           * the defer_tree.
           */
+         if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
+             !vd->vdev_man_trimming) {
+                 range_tree_walk(*defer_tree, metaslab_trim_add, msp);
+                 if (!defer_allowed) {
+                         range_tree_walk(msp->ms_freedtree, metaslab_trim_add,
+                             msp);
+                 }
+         }
          range_tree_vacate(*defer_tree,
              msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree);
          if (defer_allowed) {
                  range_tree_swap(&msp->ms_freedtree, defer_tree);
          } else {
*** 2497,2533 ****
  
                  if (!metaslab_debug_unload)
                          metaslab_unload(msp);
          }
  
-         ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK]));
-         ASSERT0(range_tree_space(msp->ms_freeingtree));
-         ASSERT0(range_tree_space(msp->ms_freedtree));
- 
          mutex_exit(&msp->ms_lock);
  }
  
  void
  metaslab_sync_reassess(metaslab_group_t *mg)
  {
-         spa_t *spa = mg->mg_class->mc_spa;
- 
-         spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
          metaslab_group_alloc_update(mg);
          mg->mg_fragmentation = metaslab_group_fragmentation(mg);
  
          /*
!          * Preload the next potential metaslabs but only on active
!          * metaslab groups. We can get into a state where the metaslab
!          * is no longer active since we dirty metaslabs as we remove a
!          * a device, thus potentially making the metaslab group eligible
!          * for preloading.
           */
-         if (mg->mg_activation_count > 0) {
                  metaslab_group_preload(mg);
-         }
-         spa_config_exit(spa, SCL_ALLOC, FTAG);
  }
  
  static uint64_t
  metaslab_distance(metaslab_t *msp, dva_t *dva)
  {
--- 2535,2557 ----
  
                  if (!metaslab_debug_unload)
                          metaslab_unload(msp);
          }
  
          mutex_exit(&msp->ms_lock);
  }
  
  void
  metaslab_sync_reassess(metaslab_group_t *mg)
  {
          metaslab_group_alloc_update(mg);
          mg->mg_fragmentation = metaslab_group_fragmentation(mg);
  
          /*
!          * Preload the next potential metaslabs
           */
          metaslab_group_preload(mg);
  }
  
  static uint64_t
  metaslab_distance(metaslab_t *msp, dva_t *dva)
  {
*** 2717,2726 ****
--- 2741,2751 ----
  
                  VERIFY0(P2PHASE(start, 1ULL << vd->vdev_ashift));
                  VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
                  VERIFY3U(range_tree_space(rt) - size, <=, msp->ms_size);
                  range_tree_remove(rt, start, size);
+                 metaslab_trim_remove(msp, start, size);
  
                  if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
                          vdev_dirty(mg->mg_vd, VDD_METASLAB, msp, txg);
  
                  range_tree_add(msp->ms_alloctree[txg & TXG_MASK], start, size);
*** 2738,2748 ****
          return (start);
  }
  
  static uint64_t
  metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal,
!     uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
  {
          metaslab_t *msp = NULL;
          uint64_t offset = -1ULL;
          uint64_t activation_weight;
          uint64_t target_distance;
--- 2763,2774 ----
          return (start);
  }
  
  static uint64_t
  metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal,
!     uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d,
!     int flags)
  {
          metaslab_t *msp = NULL;
          uint64_t offset = -1ULL;
          uint64_t activation_weight;
          uint64_t target_distance;
*** 2759,2768 ****
--- 2785,2795 ----
          metaslab_t *search = kmem_alloc(sizeof (*search), KM_SLEEP);
          search->ms_weight = UINT64_MAX;
          search->ms_start = 0;
          for (;;) {
                  boolean_t was_active;
+                 boolean_t pass_primary = B_TRUE;
                  avl_tree_t *t = &mg->mg_metaslab_tree;
                  avl_index_t idx;
  
                  mutex_enter(&mg->mg_lock);
  
*** 2796,2820 ****
                           */
                          if (msp->ms_condensing)
                                  continue;
  
                          was_active = msp->ms_weight & METASLAB_ACTIVE_MASK;
!                         if (activation_weight == METASLAB_WEIGHT_PRIMARY)
                                  break;
  
                          target_distance = min_distance +
                              (space_map_allocated(msp->ms_sm) != 0 ? 0 :
                              min_distance >> 1);
  
!                         for (i = 0; i < d; i++) {
                                  if (metaslab_distance(msp, &dva[i]) <
                                      target_distance)
                                          break;
-                         }
                          if (i == d)
                                  break;
                  }
                  mutex_exit(&mg->mg_lock);
                  if (msp == NULL) {
                          kmem_free(search, sizeof (*search));
                          return (-1ULL);
                  }
--- 2823,2858 ----
                           */
                          if (msp->ms_condensing)
                                  continue;
  
                          was_active = msp->ms_weight & METASLAB_ACTIVE_MASK;
!                         if (flags & METASLAB_USE_WEIGHT_SECONDARY) {
!                                 if (!pass_primary) {
!                                         DTRACE_PROBE(metaslab_use_secondary);
!                                         activation_weight =
!                                             METASLAB_WEIGHT_SECONDARY;
                                          break;
+                                 }
  
+                                 pass_primary = B_FALSE;
+                         } else {
+                                 if (activation_weight ==
+                                     METASLAB_WEIGHT_PRIMARY)
+                                         break;
+ 
                                  target_distance = min_distance +
                                      (space_map_allocated(msp->ms_sm) != 0 ? 0 :
                                      min_distance >> 1);
  
!                                 for (i = 0; i < d; i++)
                                          if (metaslab_distance(msp, &dva[i]) <
                                              target_distance)
                                                  break;
                                  if (i == d)
                                          break;
                          }
+                 }
                  mutex_exit(&mg->mg_lock);
                  if (msp == NULL) {
                          kmem_free(search, sizeof (*search));
                          return (-1ULL);
                  }
*** 2931,2947 ****
          return (offset);
  }
  
  static uint64_t
  metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal,
!     uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
  {
          uint64_t offset;
          ASSERT(mg->mg_initialized);
  
          offset = metaslab_group_alloc_normal(mg, zal, asize, txg,
!             min_distance, dva, d);
  
          mutex_enter(&mg->mg_lock);
          if (offset == -1ULL) {
                  mg->mg_failed_allocations++;
                  metaslab_trace_add(zal, mg, NULL, asize, d,
--- 2969,2986 ----
          return (offset);
  }
  
  static uint64_t
  metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal,
!     uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva,
!     int d, int flags)
  {
          uint64_t offset;
          ASSERT(mg->mg_initialized);
  
          offset = metaslab_group_alloc_normal(mg, zal, asize, txg,
!             min_distance, dva, d, flags);
  
          mutex_enter(&mg->mg_lock);
          if (offset == -1ULL) {
                  mg->mg_failed_allocations++;
                  metaslab_trace_add(zal, mg, NULL, asize, d,
*** 2975,2985 ****
  int ditto_same_vdev_distance_shift = 3;
  
  /*
   * Allocate a block for the specified i/o.
   */
! int
  metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
      dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags,
      zio_alloc_list_t *zal)
  {
          metaslab_group_t *mg, *rotor;
--- 3014,3024 ----
  int ditto_same_vdev_distance_shift = 3;
  
  /*
   * Allocate a block for the specified i/o.
   */
! static int
  metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
      dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags,
      zio_alloc_list_t *zal)
  {
          metaslab_group_t *mg, *rotor;
*** 3021,3035 ****
          if (hintdva) {
                  vd = vdev_lookup_top(spa, DVA_GET_VDEV(&hintdva[d]));
  
                  /*
                   * It's possible the vdev we're using as the hint no
!                  * longer exists or its mg has been closed (e.g. by
!                  * device removal).  Consult the rotor when
                   * all else fails.
                   */
!                 if (vd != NULL && vd->vdev_mg != NULL) {
                          mg = vd->vdev_mg;
  
                          if (flags & METASLAB_HINTBP_AVOID &&
                              mg->mg_next != NULL)
                                  mg = mg->mg_next;
--- 3060,3073 ----
          if (hintdva) {
                  vd = vdev_lookup_top(spa, DVA_GET_VDEV(&hintdva[d]));
  
                  /*
                   * It's possible the vdev we're using as the hint no
!                  * longer exists (i.e. removed). Consult the rotor when
                   * all else fails.
                   */
!                 if (vd != NULL) {
                          mg = vd->vdev_mg;
  
                          if (flags & METASLAB_HINTBP_AVOID &&
                              mg->mg_next != NULL)
                                  mg = mg->mg_next;
*** 3120,3130 ****
  
                  uint64_t asize = vdev_psize_to_asize(vd, psize);
                  ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0);
  
                  uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg,
!                     distance, dva, d);
  
                  if (offset != -1ULL) {
                          /*
                           * If we've just selected this metaslab group,
                           * figure out whether the corresponding vdev is
--- 3158,3168 ----
  
                  uint64_t asize = vdev_psize_to_asize(vd, psize);
                  ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0);
  
                  uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg,
!                     distance, dva, d, flags);
  
                  if (offset != -1ULL) {
                          /*
                           * If we've just selected this metaslab group,
                           * figure out whether the corresponding vdev is
*** 3131,3144 ****
                           * over- or under-used relative to the pool,
                           * and set an allocation bias to even it out.
                           */
                          if (mc->mc_aliquot == 0 && metaslab_bias_enabled) {
                                  vdev_stat_t *vs = &vd->vdev_stat;
!                                 int64_t vu, cu;
  
                                  vu = (vs->vs_alloc * 100) / (vs->vs_space + 1);
                                  cu = (mc->mc_alloc * 100) / (mc->mc_space + 1);
  
                                  /*
                                   * Calculate how much more or less we should
                                   * try to allocate from this device during
                                   * this iteration around the rotor.
--- 3169,3187 ----
                           * over- or under-used relative to the pool,
                           * and set an allocation bias to even it out.
                           */
                          if (mc->mc_aliquot == 0 && metaslab_bias_enabled) {
                                  vdev_stat_t *vs = &vd->vdev_stat;
!                                 vdev_stat_t *pvs = &vd->vdev_parent->vdev_stat;
!                                 int64_t vu, cu, vu_io;
  
                                  vu = (vs->vs_alloc * 100) / (vs->vs_space + 1);
                                  cu = (mc->mc_alloc * 100) / (mc->mc_space + 1);
+                                 vu_io =
+                                     (((vs->vs_iotime[ZIO_TYPE_WRITE] * 100) /
+                                     (pvs->vs_iotime[ZIO_TYPE_WRITE] + 1)) *
+                                     (vd->vdev_parent->vdev_children)) - 100;
  
                                  /*
                                   * Calculate how much more or less we should
                                   * try to allocate from this device during
                                   * this iteration around the rotor.
*** 3151,3160 ****
--- 3194,3222 ----
                                   * This reduces allocations by 307K for this
                                   * iteration.
                                   */
                                  mg->mg_bias = ((cu - vu) *
                                      (int64_t)mg->mg_aliquot) / 100;
+ 
+                                 /*
+                                  * Experiment: space-based DVA allocator 0,
+                                  * latency-based 1 or hybrid 2.
+                                  */
+                                 switch (metaslab_alloc_dva_algorithm) {
+                                 case 1:
+                                         mg->mg_bias =
+                                             (vu_io * (int64_t)mg->mg_aliquot) /
+                                             100;
+                                         break;
+                                 case 2:
+                                         mg->mg_bias =
+                                             ((((cu - vu) + vu_io) / 2) *
+                                             (int64_t)mg->mg_aliquot) / 100;
+                                         break;
+                                 default:
+                                         break;
+                                 }
                          } else if (!metaslab_bias_enabled) {
                                  mg->mg_bias = 0;
                          }
  
                          if (atomic_add_64_nv(&mc->mc_aliquot, asize) >=
*** 3165,3174 ****
--- 3227,3238 ----
  
                          DVA_SET_VDEV(&dva[d], vd->vdev_id);
                          DVA_SET_OFFSET(&dva[d], offset);
                          DVA_SET_GANG(&dva[d], !!(flags & METASLAB_GANG_HEADER));
                          DVA_SET_ASIZE(&dva[d], asize);
+                         DTRACE_PROBE3(alloc_dva_probe, uint64_t, vd->vdev_id,
+                             uint64_t, offset, uint64_t, psize);
  
                          return (0);
                  }
  next:
                  mc->mc_rotor = mg->mg_next;
*** 3187,3418 ****
  
          metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC);
          return (SET_ERROR(ENOSPC));
  }
  
- void
- metaslab_free_concrete(vdev_t *vd, uint64_t offset, uint64_t asize,
-     uint64_t txg)
- {
-         metaslab_t *msp;
-         spa_t *spa = vd->vdev_spa;
- 
-         ASSERT3U(txg, ==, spa->spa_syncing_txg);
-         ASSERT(vdev_is_concrete(vd));
-         ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
-         ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
- 
-         msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
- 
-         VERIFY(!msp->ms_condensing);
-         VERIFY3U(offset, >=, msp->ms_start);
-         VERIFY3U(offset + asize, <=, msp->ms_start + msp->ms_size);
-         VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
-         VERIFY0(P2PHASE(asize, 1ULL << vd->vdev_ashift));
- 
-         metaslab_check_free_impl(vd, offset, asize);
-         mutex_enter(&msp->ms_lock);
-         if (range_tree_space(msp->ms_freeingtree) == 0) {
-                 vdev_dirty(vd, VDD_METASLAB, msp, txg);
-         }
-         range_tree_add(msp->ms_freeingtree, offset, asize);
-         mutex_exit(&msp->ms_lock);
- }
- 
- /* ARGSUSED */
- void
- metaslab_free_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
-     uint64_t size, void *arg)
- {
-         uint64_t *txgp = arg;
- 
-         if (vd->vdev_ops->vdev_op_remap != NULL)
-                 vdev_indirect_mark_obsolete(vd, offset, size, *txgp);
-         else
-                 metaslab_free_impl(vd, offset, size, *txgp);
- }
- 
- static void
- metaslab_free_impl(vdev_t *vd, uint64_t offset, uint64_t size,
-     uint64_t txg)
- {
-         spa_t *spa = vd->vdev_spa;
- 
-         ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
- 
-         if (txg > spa_freeze_txg(spa))
-                 return;
- 
-         if (spa->spa_vdev_removal != NULL &&
-             spa->spa_vdev_removal->svr_vdev == vd &&
-             vdev_is_concrete(vd)) {
-                 /*
-                  * Note: we check if the vdev is concrete because when
-                  * we complete the removal, we first change the vdev to be
-                  * an indirect vdev (in open context), and then (in syncing
-                  * context) clear spa_vdev_removal.
-                  */
-                 free_from_removing_vdev(vd, offset, size, txg);
-         } else if (vd->vdev_ops->vdev_op_remap != NULL) {
-                 vdev_indirect_mark_obsolete(vd, offset, size, txg);
-                 vd->vdev_ops->vdev_op_remap(vd, offset, size,
-                     metaslab_free_impl_cb, &txg);
-         } else {
-                 metaslab_free_concrete(vd, offset, size, txg);
-         }
- }
- 
- typedef struct remap_blkptr_cb_arg {
-         blkptr_t *rbca_bp;
-         spa_remap_cb_t rbca_cb;
-         vdev_t *rbca_remap_vd;
-         uint64_t rbca_remap_offset;
-         void *rbca_cb_arg;
- } remap_blkptr_cb_arg_t;
- 
- void
- remap_blkptr_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
-     uint64_t size, void *arg)
- {
-         remap_blkptr_cb_arg_t *rbca = arg;
-         blkptr_t *bp = rbca->rbca_bp;
- 
-         /* We can not remap split blocks. */
-         if (size != DVA_GET_ASIZE(&bp->blk_dva[0]))
-                 return;
-         ASSERT0(inner_offset);
- 
-         if (rbca->rbca_cb != NULL) {
-                 /*
-                  * At this point we know that we are not handling split
-                  * blocks and we invoke the callback on the previous
-                  * vdev which must be indirect.
-                  */
-                 ASSERT3P(rbca->rbca_remap_vd->vdev_ops, ==, &vdev_indirect_ops);
- 
-                 rbca->rbca_cb(rbca->rbca_remap_vd->vdev_id,
-                     rbca->rbca_remap_offset, size, rbca->rbca_cb_arg);
- 
-                 /* set up remap_blkptr_cb_arg for the next call */
-                 rbca->rbca_remap_vd = vd;
-                 rbca->rbca_remap_offset = offset;
-         }
- 
-         /*
-          * The phys birth time is that of dva[0].  This ensures that we know
-          * when each dva was written, so that resilver can determine which
-          * blocks need to be scrubbed (i.e. those written during the time
-          * the vdev was offline).  It also ensures that the key used in
-          * the ARC hash table is unique (i.e. dva[0] + phys_birth).  If
-          * we didn't change the phys_birth, a lookup in the ARC for a
-          * remapped BP could find the data that was previously stored at
-          * this vdev + offset.
-          */
-         vdev_t *oldvd = vdev_lookup_top(vd->vdev_spa,
-             DVA_GET_VDEV(&bp->blk_dva[0]));
-         vdev_indirect_births_t *vib = oldvd->vdev_indirect_births;
-         bp->blk_phys_birth = vdev_indirect_births_physbirth(vib,
-             DVA_GET_OFFSET(&bp->blk_dva[0]), DVA_GET_ASIZE(&bp->blk_dva[0]));
- 
-         DVA_SET_VDEV(&bp->blk_dva[0], vd->vdev_id);
-         DVA_SET_OFFSET(&bp->blk_dva[0], offset);
- }
- 
  /*
!  * If the block pointer contains any indirect DVAs, modify them to refer to
!  * concrete DVAs.  Note that this will sometimes not be possible, leaving
!  * the indirect DVA in place.  This happens if the indirect DVA spans multiple
!  * segments in the mapping (i.e. it is a "split block").
!  *
!  * If the BP was remapped, calls the callback on the original dva (note the
!  * callback can be called multiple times if the original indirect DVA refers
!  * to another indirect DVA, etc).
!  *
!  * Returns TRUE if the BP was remapped.
   */
- boolean_t
- spa_remap_blkptr(spa_t *spa, blkptr_t *bp, spa_remap_cb_t callback, void *arg)
- {
-         remap_blkptr_cb_arg_t rbca;
- 
-         if (!zfs_remap_blkptr_enable)
-                 return (B_FALSE);
- 
-         if (!spa_feature_is_enabled(spa, SPA_FEATURE_OBSOLETE_COUNTS))
-                 return (B_FALSE);
- 
-         /*
-          * Dedup BP's can not be remapped, because ddt_phys_select() depends
-          * on DVA[0] being the same in the BP as in the DDT (dedup table).
-          */
-         if (BP_GET_DEDUP(bp))
-                 return (B_FALSE);
- 
-         /*
-          * Gang blocks can not be remapped, because
-          * zio_checksum_gang_verifier() depends on the DVA[0] that's in
-          * the BP used to read the gang block header (GBH) being the same
-          * as the DVA[0] that we allocated for the GBH.
-          */
-         if (BP_IS_GANG(bp))
-                 return (B_FALSE);
- 
-         /*
-          * Embedded BP's have no DVA to remap.
-          */
-         if (BP_GET_NDVAS(bp) < 1)
-                 return (B_FALSE);
- 
-         /*
-          * Note: we only remap dva[0].  If we remapped other dvas, we
-          * would no longer know what their phys birth txg is.
-          */
-         dva_t *dva = &bp->blk_dva[0];
- 
-         uint64_t offset = DVA_GET_OFFSET(dva);
-         uint64_t size = DVA_GET_ASIZE(dva);
-         vdev_t *vd = vdev_lookup_top(spa, DVA_GET_VDEV(dva));
- 
-         if (vd->vdev_ops->vdev_op_remap == NULL)
-                 return (B_FALSE);
- 
-         rbca.rbca_bp = bp;
-         rbca.rbca_cb = callback;
-         rbca.rbca_remap_vd = vd;
-         rbca.rbca_remap_offset = offset;
-         rbca.rbca_cb_arg = arg;
- 
-         /*
-          * remap_blkptr_cb() will be called in order for each level of
-          * indirection, until a concrete vdev is reached or a split block is
-          * encountered. old_vd and old_offset are updated within the callback
-          * as we go from the one indirect vdev to the next one (either concrete
-          * or indirect again) in that order.
-          */
-         vd->vdev_ops->vdev_op_remap(vd, offset, size, remap_blkptr_cb, &rbca);
- 
-         /* Check if the DVA wasn't remapped because it is a split block */
-         if (DVA_GET_VDEV(&rbca.rbca_bp->blk_dva[0]) == vd->vdev_id)
-                 return (B_FALSE);
- 
-         return (B_TRUE);
- }
- 
- /*
-  * Undo the allocation of a DVA which happened in the given transaction group.
-  */
  void
! metaslab_unalloc_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
  {
-         metaslab_t *msp;
-         vdev_t *vd;
          uint64_t vdev = DVA_GET_VDEV(dva);
          uint64_t offset = DVA_GET_OFFSET(dva);
          uint64_t size = DVA_GET_ASIZE(dva);
  
          ASSERT(DVA_IS_VALID(dva));
-         ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
  
          if (txg > spa_freeze_txg(spa))
                  return;
  
          if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
--- 3251,3277 ----
  
          metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC);
          return (SET_ERROR(ENOSPC));
  }
  
  /*
!  * Free the block represented by DVA in the context of the specified
!  * transaction group.
   */
  void
! metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg, boolean_t now)
  {
          uint64_t vdev = DVA_GET_VDEV(dva);
          uint64_t offset = DVA_GET_OFFSET(dva);
          uint64_t size = DVA_GET_ASIZE(dva);
+         vdev_t *vd;
+         metaslab_t *msp;
  
+         DTRACE_PROBE3(free_dva_probe, uint64_t, vdev,
+             uint64_t, offset, uint64_t, size);
+ 
          ASSERT(DVA_IS_VALID(dva));
  
          if (txg > spa_freeze_txg(spa))
                  return;
  
          if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
*** 3421,3441 ****
                      (u_longlong_t)vdev, (u_longlong_t)offset);
                  ASSERT(0);
                  return;
          }
  
!         ASSERT(!vd->vdev_removing);
!         ASSERT(vdev_is_concrete(vd));
!         ASSERT0(vd->vdev_indirect_config.vic_mapping_object);
!         ASSERT3P(vd->vdev_indirect_mapping, ==, NULL);
  
          if (DVA_GET_GANG(dva))
                  size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
  
-         msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
- 
          mutex_enter(&msp->ms_lock);
          range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
              offset, size);
  
          VERIFY(!msp->ms_condensing);
          VERIFY3U(offset, >=, msp->ms_start);
--- 3280,3297 ----
                      (u_longlong_t)vdev, (u_longlong_t)offset);
                  ASSERT(0);
                  return;
          }
  
!         msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
  
          if (DVA_GET_GANG(dva))
                  size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
  
          mutex_enter(&msp->ms_lock);
+ 
+         if (now) {
                  range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
                      offset, size);
  
                  VERIFY(!msp->ms_condensing);
                  VERIFY3U(offset, >=, msp->ms_start);
*** 3443,3475 ****
          VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
              msp->ms_size);
          VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
          VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
          range_tree_add(msp->ms_tree, offset, size);
          mutex_exit(&msp->ms_lock);
  }
  
  /*
!  * Free the block represented by DVA in the context of the specified
!  * transaction group.
   */
! void
! metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
  {
          uint64_t vdev = DVA_GET_VDEV(dva);
          uint64_t offset = DVA_GET_OFFSET(dva);
          uint64_t size = DVA_GET_ASIZE(dva);
!         vdev_t *vd = vdev_lookup_top(spa, vdev);
  
          ASSERT(DVA_IS_VALID(dva));
-         ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
  
!         if (DVA_GET_GANG(dva)) {
                  size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
          }
  
!         metaslab_free_impl(vd, offset, size, txg);
  }
  
  /*
   * Reserve some allocation slots. The reservation system must be called
   * before we call into the allocator. If there aren't any available slots
--- 3299,3378 ----
                  VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
                      msp->ms_size);
                  VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
                  VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
                  range_tree_add(msp->ms_tree, offset, size);
+                 if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
+                     !vd->vdev_man_trimming)
+                         metaslab_trim_add(msp, offset, size);
+                 msp->ms_max_size = metaslab_block_maxsize(msp);
+         } else {
+                 VERIFY3U(txg, ==, spa->spa_syncing_txg);
+                 if (range_tree_space(msp->ms_freeingtree) == 0)
+                         vdev_dirty(vd, VDD_METASLAB, msp, txg);
+                 range_tree_add(msp->ms_freeingtree, offset, size);
+         }
+ 
          mutex_exit(&msp->ms_lock);
  }
  
  /*
!  * Intent log support: upon opening the pool after a crash, notify the SPA
!  * of blocks that the intent log has allocated for immediate write, but
!  * which are still considered free by the SPA because the last transaction
!  * group didn't commit yet.
   */
! static int
! metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
  {
          uint64_t vdev = DVA_GET_VDEV(dva);
          uint64_t offset = DVA_GET_OFFSET(dva);
          uint64_t size = DVA_GET_ASIZE(dva);
!         vdev_t *vd;
!         metaslab_t *msp;
!         int error = 0;
  
          ASSERT(DVA_IS_VALID(dva));
  
!         if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
!             (offset >> vd->vdev_ms_shift) >= vd->vdev_ms_count)
!                 return (SET_ERROR(ENXIO));
! 
!         msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
! 
!         if (DVA_GET_GANG(dva))
                  size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
+ 
+         mutex_enter(&msp->ms_lock);
+ 
+         if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
+                 error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
+ 
+         if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
+                 error = SET_ERROR(ENOENT);
+ 
+         if (error || txg == 0) {        /* txg == 0 indicates dry run */
+                 mutex_exit(&msp->ms_lock);
+                 return (error);
          }
  
!         VERIFY(!msp->ms_condensing);
!         VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
!         VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
!         VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
!         range_tree_remove(msp->ms_tree, offset, size);
!         metaslab_trim_remove(msp, offset, size);
! 
!         if (spa_writeable(spa)) {       /* don't dirty if we're zdb(1M) */
!                 if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
!                         vdev_dirty(vd, VDD_METASLAB, msp, txg);
!                 range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
!         }
! 
!         mutex_exit(&msp->ms_lock);
! 
!         return (0);
  }
  
  /*
   * Reserve some allocation slots. The reservation system must be called
   * before we call into the allocator. If there aren't any available slots
*** 3516,3642 ****
                  (void) refcount_remove(&mc->mc_alloc_slots, zio);
          }
          mutex_exit(&mc->mc_lock);
  }
  
- static int
- metaslab_claim_concrete(vdev_t *vd, uint64_t offset, uint64_t size,
-     uint64_t txg)
- {
-         metaslab_t *msp;
-         spa_t *spa = vd->vdev_spa;
-         int error = 0;
- 
-         if (offset >> vd->vdev_ms_shift >= vd->vdev_ms_count)
-                 return (ENXIO);
- 
-         ASSERT3P(vd->vdev_ms, !=, NULL);
-         msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
- 
-         mutex_enter(&msp->ms_lock);
- 
-         if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
-                 error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
- 
-         if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
-                 error = SET_ERROR(ENOENT);
- 
-         if (error || txg == 0) {        /* txg == 0 indicates dry run */
-                 mutex_exit(&msp->ms_lock);
-                 return (error);
-         }
- 
-         VERIFY(!msp->ms_condensing);
-         VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
-         VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
-         VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
-         range_tree_remove(msp->ms_tree, offset, size);
- 
-         if (spa_writeable(spa)) {       /* don't dirty if we're zdb(1M) */
-                 if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
-                         vdev_dirty(vd, VDD_METASLAB, msp, txg);
-                 range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
-         }
- 
-         mutex_exit(&msp->ms_lock);
- 
-         return (0);
- }
- 
- typedef struct metaslab_claim_cb_arg_t {
-         uint64_t        mcca_txg;
-         int             mcca_error;
- } metaslab_claim_cb_arg_t;
- 
- /* ARGSUSED */
- static void
- metaslab_claim_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
-     uint64_t size, void *arg)
- {
-         metaslab_claim_cb_arg_t *mcca_arg = arg;
- 
-         if (mcca_arg->mcca_error == 0) {
-                 mcca_arg->mcca_error = metaslab_claim_concrete(vd, offset,
-                     size, mcca_arg->mcca_txg);
-         }
- }
- 
  int
- metaslab_claim_impl(vdev_t *vd, uint64_t offset, uint64_t size, uint64_t txg)
- {
-         if (vd->vdev_ops->vdev_op_remap != NULL) {
-                 metaslab_claim_cb_arg_t arg;
- 
-                 /*
-                  * Only zdb(1M) can claim on indirect vdevs.  This is used
-                  * to detect leaks of mapped space (that are not accounted
-                  * for in the obsolete counts, spacemap, or bpobj).
-                  */
-                 ASSERT(!spa_writeable(vd->vdev_spa));
-                 arg.mcca_error = 0;
-                 arg.mcca_txg = txg;
- 
-                 vd->vdev_ops->vdev_op_remap(vd, offset, size,
-                     metaslab_claim_impl_cb, &arg);
- 
-                 if (arg.mcca_error == 0) {
-                         arg.mcca_error = metaslab_claim_concrete(vd,
-                             offset, size, txg);
-                 }
-                 return (arg.mcca_error);
-         } else {
-                 return (metaslab_claim_concrete(vd, offset, size, txg));
-         }
- }
- 
- /*
-  * Intent log support: upon opening the pool after a crash, notify the SPA
-  * of blocks that the intent log has allocated for immediate write, but
-  * which are still considered free by the SPA because the last transaction
-  * group didn't commit yet.
-  */
- static int
- metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
- {
-         uint64_t vdev = DVA_GET_VDEV(dva);
-         uint64_t offset = DVA_GET_OFFSET(dva);
-         uint64_t size = DVA_GET_ASIZE(dva);
-         vdev_t *vd;
- 
-         if ((vd = vdev_lookup_top(spa, vdev)) == NULL) {
-                 return (SET_ERROR(ENXIO));
-         }
- 
-         ASSERT(DVA_IS_VALID(dva));
- 
-         if (DVA_GET_GANG(dva))
-                 size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
- 
-         return (metaslab_claim_impl(vd, offset, size, txg));
- }
- 
- int
  metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp,
      int ndvas, uint64_t txg, blkptr_t *hintbp, int flags,
      zio_alloc_list_t *zal, zio_t *zio)
  {
          dva_t *dva = bp->blk_dva;
--- 3419,3429 ----
*** 3656,3671 ****
          ASSERT(ndvas > 0 && ndvas <= spa_max_replication(spa));
          ASSERT(BP_GET_NDVAS(bp) == 0);
          ASSERT(hintbp == NULL || ndvas <= BP_GET_NDVAS(hintbp));
          ASSERT3P(zal, !=, NULL);
  
          for (int d = 0; d < ndvas; d++) {
!                 error = metaslab_alloc_dva(spa, mc, psize, dva, d, hintdva,
!                     txg, flags, zal);
                  if (error != 0) {
                          for (d--; d >= 0; d--) {
!                                 metaslab_unalloc_dva(spa, &dva[d], txg);
                                  metaslab_group_alloc_decrement(spa,
                                      DVA_GET_VDEV(&dva[d]), zio, flags);
                                  bzero(&dva[d], sizeof (dva_t));
                          }
                          spa_config_exit(spa, SCL_ALLOC, FTAG);
--- 3443,3502 ----
          ASSERT(ndvas > 0 && ndvas <= spa_max_replication(spa));
          ASSERT(BP_GET_NDVAS(bp) == 0);
          ASSERT(hintbp == NULL || ndvas <= BP_GET_NDVAS(hintbp));
          ASSERT3P(zal, !=, NULL);
  
+         if (mc == spa_special_class(spa) && !BP_IS_METADATA(bp) &&
+             !(flags & (METASLAB_GANG_HEADER)) &&
+             !(spa->spa_meta_policy.spa_small_data_to_special &&
+             psize <= spa->spa_meta_policy.spa_small_data_to_special)) {
+                 error = metaslab_alloc_dva(spa, spa_normal_class(spa),
+                     psize, &dva[WBC_NORMAL_DVA], 0, NULL, txg,
+                     flags | METASLAB_USE_WEIGHT_SECONDARY, zal);
+                 if (error == 0) {
+                         error = metaslab_alloc_dva(spa, mc, psize,
+                             &dva[WBC_SPECIAL_DVA], 0, NULL, txg, flags, zal);
+                         if (error != 0) {
+                                 error = 0;
+                                 /*
+                                  * Change the place of NORMAL and cleanup the
+                                  * second DVA. After that this BP is just a
+                                  * regular BP with one DVA
+                                  *
+                                  * This operation is valid only if:
+                                  * WBC_SPECIAL_DVA is dva[0]
+                                  * WBC_NORMAL_DVA is dva[1]
+                                  *
+                                  * see wbc.h
+                                  */
+                                 bcopy(&dva[WBC_NORMAL_DVA],
+                                     &dva[WBC_SPECIAL_DVA], sizeof (dva_t));
+                                 bzero(&dva[WBC_NORMAL_DVA], sizeof (dva_t));
+ 
+                                 /*
+                                  * Allocation of special DVA has failed,
+                                  * so this BP will be a regular BP and need
+                                  * to update the metaslab group's queue depth
+                                  * based on the newly allocated dva.
+                                  */
+                                 metaslab_group_alloc_increment(spa,
+                                     DVA_GET_VDEV(&dva[0]), zio, flags);
+                         } else {
+                                 BP_SET_SPECIAL(bp, 1);
+                         }
+                 } else {
+                         spa_config_exit(spa, SCL_ALLOC, FTAG);
+                         return (error);
+                 }
+         } else {
                  for (int d = 0; d < ndvas; d++) {
!                         error = metaslab_alloc_dva(spa, mc, psize, dva, d,
!                             hintdva, txg, flags, zal);
                          if (error != 0) {
                                  for (d--; d >= 0; d--) {
!                                         metaslab_free_dva(spa, &dva[d],
!                                             txg, B_TRUE);
                                          metaslab_group_alloc_decrement(spa,
                                              DVA_GET_VDEV(&dva[d]), zio, flags);
                                          bzero(&dva[d], sizeof (dva_t));
                                  }
                                  spa_config_exit(spa, SCL_ALLOC, FTAG);
*** 3676,3689 ****
                           * based on the newly allocated dva.
                           */
                          metaslab_group_alloc_increment(spa,
                              DVA_GET_VDEV(&dva[d]), zio, flags);
                  }
- 
          }
-         ASSERT(error == 0);
          ASSERT(BP_GET_NDVAS(bp) == ndvas);
  
          spa_config_exit(spa, SCL_ALLOC, FTAG);
  
          BP_SET_BIRTH(bp, txg, txg);
  
--- 3507,3520 ----
                                   * based on the newly allocated dva.
                                   */
                                  metaslab_group_alloc_increment(spa,
                                      DVA_GET_VDEV(&dva[d]), zio, flags);
                          }
                  }
                  ASSERT(BP_GET_NDVAS(bp) == ndvas);
+         }
+         ASSERT(error == 0);
  
          spa_config_exit(spa, SCL_ALLOC, FTAG);
  
          BP_SET_BIRTH(bp, txg, txg);
  
*** 3699,3715 ****
          ASSERT(!BP_IS_HOLE(bp));
          ASSERT(!now || bp->blk_birth >= spa_syncing_txg(spa));
  
          spa_config_enter(spa, SCL_FREE, FTAG, RW_READER);
  
!         for (int d = 0; d < ndvas; d++) {
!                 if (now) {
!                         metaslab_unalloc_dva(spa, &dva[d], txg);
                  } else {
!                         metaslab_free_dva(spa, &dva[d], txg);
                  }
-         }
  
          spa_config_exit(spa, SCL_FREE, FTAG);
  }
  
  int
--- 3530,3561 ----
          ASSERT(!BP_IS_HOLE(bp));
          ASSERT(!now || bp->blk_birth >= spa_syncing_txg(spa));
  
          spa_config_enter(spa, SCL_FREE, FTAG, RW_READER);
  
!         if (BP_IS_SPECIAL(bp)) {
!                 int start_dva;
!                 wbc_data_t *wbc_data = spa_get_wbc_data(spa);
! 
!                 mutex_enter(&wbc_data->wbc_lock);
!                 start_dva = wbc_first_valid_dva(bp, wbc_data, B_TRUE);
!                 mutex_exit(&wbc_data->wbc_lock);
! 
!                 /*
!                  * Actual freeing should not be locked as
!                  * the block is already exempted from WBC
!                  * trees, and thus will not be moved
!                  */
!                 metaslab_free_dva(spa, &dva[WBC_NORMAL_DVA], txg, now);
!                 if (start_dva == 0) {
!                         metaslab_free_dva(spa, &dva[WBC_SPECIAL_DVA],
!                             txg, now);
!                 }
          } else {
!                 for (int d = 0; d < ndvas; d++)
!                         metaslab_free_dva(spa, &dva[d], txg, now);
          }
  
          spa_config_exit(spa, SCL_FREE, FTAG);
  }
  
  int
*** 3730,3810 ****
                          return (error);
          }
  
          spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
  
          for (int d = 0; d < ndvas; d++)
!                 if ((error = metaslab_claim_dva(spa, &dva[d], txg)) != 0)
                          break;
  
          spa_config_exit(spa, SCL_ALLOC, FTAG);
  
          ASSERT(error == 0 || txg == 0);
  
          return (error);
  }
  
! /* ARGSUSED */
! static void
! metaslab_check_free_impl_cb(uint64_t inner, vdev_t *vd, uint64_t offset,
!     uint64_t size, void *arg)
  {
-         if (vd->vdev_ops == &vdev_indirect_ops)
-                 return;
- 
-         metaslab_check_free_impl(vd, offset, size);
- }
- 
- static void
- metaslab_check_free_impl(vdev_t *vd, uint64_t offset, uint64_t size)
- {
-         metaslab_t *msp;
-         spa_t *spa = vd->vdev_spa;
- 
          if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
                  return;
  
!         if (vd->vdev_ops->vdev_op_remap != NULL) {
!                 vd->vdev_ops->vdev_op_remap(vd, offset, size,
!                     metaslab_check_free_impl_cb, NULL);
                  return;
          }
  
!         ASSERT(vdev_is_concrete(vd));
!         ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
!         ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
  
!         msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
! 
!         mutex_enter(&msp->ms_lock);
!         if (msp->ms_loaded)
                  range_tree_verify(msp->ms_tree, offset, size);
  
          range_tree_verify(msp->ms_freeingtree, offset, size);
          range_tree_verify(msp->ms_freedtree, offset, size);
          for (int j = 0; j < TXG_DEFER_SIZE; j++)
                  range_tree_verify(msp->ms_defertree[j], offset, size);
          mutex_exit(&msp->ms_lock);
  }
  
  void
! metaslab_check_free(spa_t *spa, const blkptr_t *bp)
  {
!         if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
!                 return;
  
!         spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
!         for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
!                 uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
!                 vdev_t *vd = vdev_lookup_top(spa, vdev);
!                 uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
!                 uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
  
!                 if (DVA_GET_GANG(&bp->blk_dva[i]))
!                         size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
  
!                 ASSERT3P(vd, !=, NULL);
  
!                 metaslab_check_free_impl(vd, offset, size);
          }
!         spa_config_exit(spa, SCL_VDEV, FTAG);
  }
--- 3576,3921 ----
                          return (error);
          }
  
          spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
  
+         if (BP_IS_SPECIAL(bp)) {
+                 int start_dva;
+                 wbc_data_t *wbc_data = spa_get_wbc_data(spa);
+ 
+                 mutex_enter(&wbc_data->wbc_lock);
+                 start_dva = wbc_first_valid_dva(bp, wbc_data, B_FALSE);
+ 
+                 /*
+                  * Actual claiming should be under lock for WBC blocks. It must
+                  * be done to ensure zdb will not fail. The only other user of
+                  * the claiming is ZIL whose blocks can not be WBC ones, and
+                  * thus the lock will not be held for them.
+                  */
+                 error = metaslab_claim_dva(spa,
+                     &dva[WBC_NORMAL_DVA], txg);
+                 if (error == 0 && start_dva == 0) {
+                         error = metaslab_claim_dva(spa,
+                             &dva[WBC_SPECIAL_DVA], txg);
+                 }
+ 
+                 mutex_exit(&wbc_data->wbc_lock);
+         } else {
                  for (int d = 0; d < ndvas; d++)
!                         if ((error = metaslab_claim_dva(spa,
!                             &dva[d], txg)) != 0)
                                  break;
+         }
  
          spa_config_exit(spa, SCL_ALLOC, FTAG);
  
          ASSERT(error == 0 || txg == 0);
  
          return (error);
  }
  
! void
! metaslab_check_free(spa_t *spa, const blkptr_t *bp)
  {
          if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
                  return;
  
!         if (BP_IS_SPECIAL(bp)) {
!                 /* Do not check frees for WBC blocks */
                  return;
          }
  
!         spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
!         for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
!                 uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
!                 vdev_t *vd = vdev_lookup_top(spa, vdev);
!                 uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
!                 uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
!                 metaslab_t *msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
  
!                 if (msp->ms_loaded) {
                          range_tree_verify(msp->ms_tree, offset, size);
+                         range_tree_verify(msp->ms_cur_ts->ts_tree,
+                             offset, size);
+                         if (msp->ms_prev_ts != NULL) {
+                                 range_tree_verify(msp->ms_prev_ts->ts_tree,
+                                     offset, size);
+                         }
+                 }
  
                  range_tree_verify(msp->ms_freeingtree, offset, size);
                  range_tree_verify(msp->ms_freedtree, offset, size);
                  for (int j = 0; j < TXG_DEFER_SIZE; j++)
                          range_tree_verify(msp->ms_defertree[j], offset, size);
+         }
+         spa_config_exit(spa, SCL_VDEV, FTAG);
+ }
+ 
+ /*
+  * Trims all free space in the metaslab. Returns the root TRIM zio (that the
+  * caller should zio_wait() for) and the amount of space in the metaslab that
+  * has been scheduled for trimming in the `delta' return argument.
+  */
+ zio_t *
+ metaslab_trim_all(metaslab_t *msp, uint64_t *delta)
+ {
+         boolean_t was_loaded;
+         uint64_t trimmed_space;
+         zio_t *trim_io;
+ 
+         ASSERT(!MUTEX_HELD(&msp->ms_group->mg_lock));
+ 
+         mutex_enter(&msp->ms_lock);
+ 
+         while (msp->ms_loading)
+                 metaslab_load_wait(msp);
+         /* If we loaded the metaslab, unload it when we're done. */
+         was_loaded = msp->ms_loaded;
+         if (!was_loaded) {
+                 if (metaslab_load(msp) != 0) {
                          mutex_exit(&msp->ms_lock);
+                         return (0);
+                 }
+         }
+         /* Flush out any scheduled extents and add everything in ms_tree. */
+         range_tree_vacate(msp->ms_cur_ts->ts_tree, NULL, NULL);
+         range_tree_walk(msp->ms_tree, metaslab_trim_add, msp);
+ 
+         /* Force this trim to take place ASAP. */
+         if (msp->ms_prev_ts != NULL)
+                 metaslab_free_trimset(msp->ms_prev_ts);
+         msp->ms_prev_ts = msp->ms_cur_ts;
+         msp->ms_cur_ts = metaslab_new_trimset(0, &msp->ms_lock);
+         trimmed_space = range_tree_space(msp->ms_tree);
+         if (!was_loaded)
+                 metaslab_unload(msp);
+ 
+         trim_io = metaslab_exec_trim(msp);
+         mutex_exit(&msp->ms_lock);
+         *delta = trimmed_space;
+ 
+         return (trim_io);
  }
  
+ /*
+  * Notifies the trimsets in a metaslab that an extent has been allocated.
+  * This removes the segment from the queues of extents awaiting to be trimmed.
+  */
+ static void
+ metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size)
+ {
+         metaslab_t *msp = arg;
+ 
+         range_tree_remove_overlap(msp->ms_cur_ts->ts_tree, offset, size);
+         if (msp->ms_prev_ts != NULL) {
+                 range_tree_remove_overlap(msp->ms_prev_ts->ts_tree, offset,
+                     size);
+         }
+ }
+ 
+ /*
+  * Notifies the trimsets in a metaslab that an extent has been freed.
+  * This adds the segment to the currently open queue of extents awaiting
+  * to be trimmed.
+  */
+ static void
+ metaslab_trim_add(void *arg, uint64_t offset, uint64_t size)
+ {
+         metaslab_t *msp = arg;
+         ASSERT(msp->ms_cur_ts != NULL);
+         range_tree_add(msp->ms_cur_ts->ts_tree, offset, size);
+ }
+ 
+ /*
+  * Does a metaslab's automatic trim operation processing. This must be
+  * called from metaslab_sync, with the txg number of the txg. This function
+  * issues trims in intervals as dictated by the zfs_txgs_per_trim tunable.
+  */
  void
! metaslab_auto_trim(metaslab_t *msp, uint64_t txg)
  {
!         /* for atomicity */
!         uint64_t txgs_per_trim = zfs_txgs_per_trim;
  
!         ASSERT(!MUTEX_HELD(&msp->ms_lock));
!         mutex_enter(&msp->ms_lock);
  
!         /*
!          * Since we typically have hundreds of metaslabs per vdev, but we only
!          * trim them once every zfs_txgs_per_trim txgs, it'd be best if we
!          * could sequence the TRIM commands from all metaslabs so that they
!          * don't all always pound the device in the same txg. We do so by
!          * artificially inflating the birth txg of the first trim set by a
!          * sequence number derived from the metaslab's starting offset
!          * (modulo zfs_txgs_per_trim). Thus, for the default 200 metaslabs and
!          * 32 txgs per trim, we'll only be trimming ~6.25 metaslabs per txg.
!          *
!          * If we detect that the txg has advanced too far ahead of ts_birth,
!          * it means our birth txg is out of lockstep. Recompute it by
!          * rounding down to the nearest zfs_txgs_per_trim multiple and adding
!          * our metaslab id modulo zfs_txgs_per_trim.
!          */
!         if (txg > msp->ms_cur_ts->ts_birth + txgs_per_trim) {
!                 msp->ms_cur_ts->ts_birth = (txg / txgs_per_trim) *
!                     txgs_per_trim + (msp->ms_id % txgs_per_trim);
!         }
  
!         /* Time to swap out the current and previous trimsets */
!         if (txg == msp->ms_cur_ts->ts_birth + txgs_per_trim) {
!                 if (msp->ms_prev_ts != NULL) {
!                         if (msp->ms_trimming_ts != NULL) {
!                                 spa_t *spa = msp->ms_group->mg_class->mc_spa;
!                                 /*
!                                  * The previous trim run is still ongoing, so
!                                  * the device is reacting slowly to our trim
!                                  * requests. Drop this trimset, so as not to
!                                  * back the device up with trim requests.
!                                  */
!                                 spa_trimstats_auto_slow_incr(spa);
!                                 metaslab_free_trimset(msp->ms_prev_ts);
!                         } else if (msp->ms_group->mg_vd->vdev_man_trimming) {
!                                 /*
!                                  * If a manual trim is ongoing, we want to
!                                  * inhibit autotrim temporarily so it doesn't
!                                  * slow down the manual trim.
!                                  */
!                                 metaslab_free_trimset(msp->ms_prev_ts);
!                         } else {
!                                 /*
!                                  * Trim out aged extents on the vdevs - these
!                                  * are safe to be destroyed now. We'll keep
!                                  * the trimset around to deny allocations from
!                                  * these regions while the trims are ongoing.
!                                  */
!                                 zio_nowait(metaslab_exec_trim(msp));
!                         }
!                 }
!                 msp->ms_prev_ts = msp->ms_cur_ts;
!                 msp->ms_cur_ts = metaslab_new_trimset(txg, &msp->ms_lock);
!         }
!         mutex_exit(&msp->ms_lock);
! }
  
! static void
! metaslab_trim_done(zio_t *zio)
! {
!         metaslab_t *msp = zio->io_private;
!         boolean_t held;
! 
!         ASSERT(msp != NULL);
!         ASSERT(msp->ms_trimming_ts != NULL);
!         held = MUTEX_HELD(&msp->ms_lock);
!         if (!held)
!                 mutex_enter(&msp->ms_lock);
!         metaslab_free_trimset(msp->ms_trimming_ts);
!         msp->ms_trimming_ts = NULL;
!         cv_signal(&msp->ms_trim_cv);
!         if (!held)
!                 mutex_exit(&msp->ms_lock);
! }
! 
! /*
!  * Executes a zio_trim on a range tree holding freed extents in the metaslab.
!  */
! static zio_t *
! metaslab_exec_trim(metaslab_t *msp)
! {
!         metaslab_group_t *mg = msp->ms_group;
!         spa_t *spa = mg->mg_class->mc_spa;
!         vdev_t *vd = mg->mg_vd;
!         range_tree_t *trim_tree;
!         zio_t *zio;
! 
!         ASSERT(MUTEX_HELD(&msp->ms_lock));
! 
!         /* wait for a preceding trim to finish */
!         while (msp->ms_trimming_ts != NULL)
!                 cv_wait(&msp->ms_trim_cv, &msp->ms_lock);
!         msp->ms_trimming_ts = msp->ms_prev_ts;
!         msp->ms_prev_ts = NULL;
!         trim_tree = msp->ms_trimming_ts->ts_tree;
! #ifdef  DEBUG
!         if (msp->ms_loaded) {
!                 for (range_seg_t *rs = avl_first(&trim_tree->rt_root);
!                     rs != NULL; rs = AVL_NEXT(&trim_tree->rt_root, rs)) {
!                         if (!range_tree_contains(msp->ms_tree,
!                             rs->rs_start, rs->rs_end - rs->rs_start)) {
!                                 panic("trimming allocated region; mss=%p",
!                                     (void*)rs);
                          }
!                 }
!         }
! #endif
! 
!         /* Nothing to trim */
!         if (range_tree_space(trim_tree) == 0) {
!                 metaslab_free_trimset(msp->ms_trimming_ts);
!                 msp->ms_trimming_ts = 0;
!                 return (zio_root(spa, NULL, NULL, 0));
!         }
!         zio = zio_trim(spa, vd, trim_tree, metaslab_trim_done, msp, 0,
!             ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |
!             ZIO_FLAG_CONFIG_WRITER, msp);
! 
!         return (zio);
! }
! 
! /*
!  * Allocates and initializes a new trimset structure. The `txg' argument
!  * indicates when this trimset was born and `lock' indicates the lock to
!  * link to the range tree.
!  */
! static metaslab_trimset_t *
! metaslab_new_trimset(uint64_t txg, kmutex_t *lock)
! {
!         metaslab_trimset_t *ts;
! 
!         ts = kmem_zalloc(sizeof (*ts), KM_SLEEP);
!         ts->ts_birth = txg;
!         ts->ts_tree = range_tree_create(NULL, NULL, lock);
! 
!         return (ts);
! }
! 
! /*
!  * Destroys and frees a trim set previously allocated by metaslab_new_trimset.
!  */
! static void
! metaslab_free_trimset(metaslab_trimset_t *ts)
! {
!         range_tree_vacate(ts->ts_tree, NULL, NULL);
!         range_tree_destroy(ts->ts_tree);
!         kmem_free(ts, sizeof (*ts));
! }
! 
! /*
!  * Checks whether an allocation conflicts with an ongoing trim operation in
!  * the given metaslab. This function takes a segment starting at `*offset'
!  * of `size' and checks whether it hits any region in the metaslab currently
!  * being trimmed. If yes, it tries to adjust the allocation to the end of
!  * the region being trimmed (P2ROUNDUP aligned by `align'), but only up to
!  * `limit' (no part of the allocation is allowed to go past this point).
!  *
!  * Returns B_FALSE if either the original allocation wasn't in conflict, or
!  * the conflict could be resolved by adjusting the value stored in `offset'
!  * such that the whole allocation still fits below `limit'. Returns B_TRUE
!  * if the allocation conflict couldn't be resolved.
!  */
! static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
!     uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit)
! {
!         uint64_t new_offset;
! 
!         if (msp->ms_trimming_ts == NULL)
!                 /* no trim conflict, original offset is OK */
!                 return (B_FALSE);
! 
!         new_offset = P2ROUNDUP(range_tree_find_gap(msp->ms_trimming_ts->ts_tree,
!             *offset, size), align);
!         if (new_offset != *offset && new_offset + size > limit)
!                 /* trim conflict and adjustment not possible */
!                 return (B_TRUE);
! 
!         /* trim conflict, but adjusted offset still within limit */
!         *offset = new_offset;
!         return (B_FALSE);
  }