big-one Wdiff usr/src/uts/common/fs/zfs/metaslab.c

Print this page

NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-10069 ZFS_READONLY is a little too strict (fix test lint)
NEX-9553 Move ss_fill gap logic from scan algorithm into range_tree.c
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4720 WRC: DVA allocation bypass for special BPs works incorrect
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4683 WRC: Special block pointer must know that it is special
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
6295 metaslab_condense's dbgmsg should include vdev id
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4245 WRC: Code cleanup and refactoring to simplify merge with upstream
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-4059 On-demand TRIM can sometimes race in metaslab_load
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3710 WRC improvements and bug-fixes
 * refactored WRC move-logic to use zio kmem_cashes
 * replace size and compression fields by blk_prop field
   (the same in blkptr_t) to little reduce size of wrc_block_t
   and use similar macros as for blkptr_t to get PSIZE, LSIZE
   and COMPRESSION
 * make CPU more happy by reduce atomic calls
 * removed unused code
 * fixed naming of variables
 * fixed possible system panic after restart system
   with enabled WRC
 * fixed a race that causes system panic
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
OS-197 Series of zpool exports and imports can hang the system
Reviewed by: Sarah Jelinek <sarah.jelinek@nexetna.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittens@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
re #8346 rb2639 KT disk failures

Split	Close
Expand all
Collapse all

          --- old/usr/src/uts/common/fs/zfs/metaslab.c
          +++ new/usr/src/uts/common/fs/zfs/metaslab.c

   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the

↓ open down ↓

15 lines elided

↑ open up ↑

  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23   23   * Copyright (c) 2011, 2015 by Delphix. All rights reserved.
  24   24   * Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
  25   25   * Copyright (c) 2014 Integros [integros.com]
       26 + * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
  26   27   */
  27   28  
  28   29  #include <sys/zfs_context.h>
  29   30  #include <sys/dmu.h>
  30   31  #include <sys/dmu_tx.h>
  31   32  #include <sys/space_map.h>
  32   33  #include <sys/metaslab_impl.h>
  33   34  #include <sys/vdev_impl.h>
  34   35  #include <sys/zio.h>
  35   36  #include <sys/spa_impl.h>
  36   37  #include <sys/zfeature.h>
  37      -#include <sys/vdev_indirect_mapping.h>
       38 +#include <sys/wbc.h>
  38   39  
  39   40  #define GANG_ALLOCATION(flags) \
  40   41          ((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER))
  41   42  
  42   43  uint64_t metaslab_aliquot = 512ULL << 10;
  43   44  uint64_t metaslab_gang_bang = SPA_MAXBLOCKSIZE + 1;     /* force gang blocks */
  44   45  
  45   46  /*
  46   47   * The in-core space map representation is more compact than its on-disk form.
  47   48   * The zfs_condense_pct determines how much more compact the in-core

  48   49   * space map representation must be before we compact it on-disk.
  49   50   * Values should be greater than or equal to 100.
  50   51   */
  51   52  int zfs_condense_pct = 200;
  52   53  
  53   54  /*
  54   55   * Condensing a metaslab is not guaranteed to actually reduce the amount of
  55   56   * space used on disk. In particular, a space map uses data in increments of
  56   57   * MAX(1 << ashift, space_map_blksize), so a metaslab might use the
  57   58   * same number of blocks after condensing. Since the goal of condensing is to
  58   59   * reduce the number of IOPs required to read the space map, we only want to
  59   60   * condense when we can be sure we will reduce the number of blocks used by the
  60   61   * space map. Unfortunately, we cannot precisely compute whether or not this is
  61   62   * the case in metaslab_should_condense since we are holding ms_lock. Instead,
  62   63   * we apply the following heuristic: do not condense a spacemap unless the
  63   64   * uncondensed size consumes greater than zfs_metaslab_condense_block_threshold
  64   65   * blocks.
  65   66   */
  66   67  int zfs_metaslab_condense_block_threshold = 4;
  67   68  
  68   69  /*
  69   70   * The zfs_mg_noalloc_threshold defines which metaslab groups should
  70   71   * be eligible for allocation. The value is defined as a percentage of
  71   72   * free space. Metaslab groups that have more free space than
  72   73   * zfs_mg_noalloc_threshold are always eligible for allocations. Once
  73   74   * a metaslab group's free space is less than or equal to the
  74   75   * zfs_mg_noalloc_threshold the allocator will avoid allocating to that
  75   76   * group unless all groups in the pool have reached zfs_mg_noalloc_threshold.
  76   77   * Once all groups in the pool reach zfs_mg_noalloc_threshold then all
  77   78   * groups are allowed to accept allocations. Gang blocks are always
  78   79   * eligible to allocate on any metaslab group. The default value of 0 means
  79   80   * no metaslab group will be excluded based on this criterion.
  80   81   */
  81   82  int zfs_mg_noalloc_threshold = 0;
  82   83  
  83   84  /*
  84   85   * Metaslab groups are considered eligible for allocations if their
  85   86   * fragmenation metric (measured as a percentage) is less than or equal to
  86   87   * zfs_mg_fragmentation_threshold. If a metaslab group exceeds this threshold
  87   88   * then it will be skipped unless all metaslab groups within the metaslab
  88   89   * class have also crossed this threshold.
  89   90   */
  90   91  int zfs_mg_fragmentation_threshold = 85;
  91   92  
  92   93  /*
  93   94   * Allow metaslabs to keep their active state as long as their fragmentation
  94   95   * percentage is less than or equal to zfs_metaslab_fragmentation_threshold. An
  95   96   * active metaslab that exceeds this threshold will no longer keep its active
  96   97   * status allowing better metaslabs to be selected.
  97   98   */
  98   99  int zfs_metaslab_fragmentation_threshold = 70;
  99  100  
 100  101  /*
 101  102   * When set will load all metaslabs when pool is first opened.
 102  103   */
 103  104  int metaslab_debug_load = 0;
 104  105  
 105  106  /*
 106  107   * When set will prevent metaslabs from being unloaded.
 107  108   */
 108  109  int metaslab_debug_unload = 0;
 109  110  
 110  111  /*
 111  112   * Minimum size which forces the dynamic allocator to change
 112  113   * it's allocation strategy.  Once the space map cannot satisfy
 113  114   * an allocation of this size then it switches to using more
 114  115   * aggressive strategy (i.e search by size rather than offset).
 115  116   */
 116  117  uint64_t metaslab_df_alloc_threshold = SPA_OLD_MAXBLOCKSIZE;
 117  118  
 118  119  /*
 119  120   * The minimum free space, in percent, which must be available
 120  121   * in a space map to continue allocations in a first-fit fashion.
 121  122   * Once the space map's free space drops below this level we dynamically
 122  123   * switch to using best-fit allocations.
 123  124   */
 124  125  int metaslab_df_free_pct = 4;
 125  126  
 126  127  /*
 127  128   * A metaslab is considered "free" if it contains a contiguous
 128  129   * segment which is greater than metaslab_min_alloc_size.
 129  130   */
 130  131  uint64_t metaslab_min_alloc_size = DMU_MAX_ACCESS;
 131  132  
 132  133  /*
 133  134   * Percentage of all cpus that can be used by the metaslab taskq.
 134  135   */
 135  136  int metaslab_load_pct = 50;
 136  137  
 137  138  /*
 138  139   * Determines how many txgs a metaslab may remain loaded without having any
 139  140   * allocations from it. As long as a metaslab continues to be used we will
 140  141   * keep it loaded.
 141  142   */
 142  143  int metaslab_unload_delay = TXG_SIZE * 2;
 143  144  
 144  145  /*
 145  146   * Max number of metaslabs per group to preload.
 146  147   */
 147  148  int metaslab_preload_limit = SPA_DVAS_PER_BP;
 148  149  
 149  150  /*
 150  151   * Enable/disable preloading of metaslab.
 151  152   */
 152  153  boolean_t metaslab_preload_enabled = B_TRUE;
 153  154  
 154  155  /*
 155  156   * Enable/disable fragmentation weighting on metaslabs.
 156  157   */
 157  158  boolean_t metaslab_fragmentation_factor_enabled = B_TRUE;
 158  159  
 159  160  /*

↓ open down ↓

112 lines elided

↑ open up ↑

 160  161   * Enable/disable lba weighting (i.e. outer tracks are given preference).
 161  162   */
 162  163  boolean_t metaslab_lba_weighting_enabled = B_TRUE;
 163  164  
 164  165  /*
 165  166   * Enable/disable metaslab group biasing.
 166  167   */
 167  168  boolean_t metaslab_bias_enabled = B_TRUE;
 168  169  
 169  170  /*
 170      - * Enable/disable remapping of indirect DVAs to their concrete vdevs.
 171      - */
 172      -boolean_t zfs_remap_blkptr_enable = B_TRUE;
 173      -
 174      -/*
 175  171   * Enable/disable segment-based metaslab selection.
 176  172   */
 177  173  boolean_t zfs_metaslab_segment_weight_enabled = B_TRUE;
 178  174  
 179  175  /*
 180  176   * When using segment-based metaslab selection, we will continue
 181  177   * allocating from the active metaslab until we have exhausted
 182  178   * zfs_metaslab_switch_threshold of its buckets.
 183  179   */
 184  180  int zfs_metaslab_switch_threshold = 2;

 185  181  
 186  182  /*
 187  183   * Internal switch to enable/disable the metaslab allocation tracing
 188  184   * facility.
 189  185   */
 190  186  boolean_t metaslab_trace_enabled = B_TRUE;
 191  187  
 192  188  /*
 193  189   * Maximum entries that the metaslab allocation tracing facility will keep

↓ open down ↓

9 lines elided

↑ open up ↑

 194  190   * in a given list when running in non-debug mode. We limit the number
 195  191   * of entries in non-debug mode to prevent us from using up too much memory.
 196  192   * The limit should be sufficiently large that we don't expect any allocation
 197  193   * to every exceed this value. In debug mode, the system will panic if this
 198  194   * limit is ever reached allowing for further investigation.
 199  195   */
 200  196  uint64_t metaslab_trace_max_entries = 5000;
 201  197  
 202  198  static uint64_t metaslab_weight(metaslab_t *);
 203  199  static void metaslab_set_fragmentation(metaslab_t *);
 204      -static void metaslab_free_impl(vdev_t *, uint64_t, uint64_t, uint64_t);
 205      -static void metaslab_check_free_impl(vdev_t *, uint64_t, uint64_t);
 206  200  
 207  201  kmem_cache_t *metaslab_alloc_trace_cache;
 208  202  
 209  203  /*
      204 + * Toggle between space-based DVA allocator 0, latency-based 1 or hybrid 2.
      205 + * A value other than 0, 1 or 2 will be considered 0 (default).
      206 + */
      207 +int metaslab_alloc_dva_algorithm = 0;
      208 +
      209 +/*
      210 + * How many TXG's worth of updates should be aggregated per TRIM/UNMAP
      211 + * issued to the underlying vdev. We keep two range trees of extents
      212 + * (called "trim sets") to be trimmed per metaslab, the `current' and
      213 + * the `previous' TS. New free's are added to the current TS. Then,
      214 + * once `zfs_txgs_per_trim' transactions have elapsed, the `current'
      215 + * TS becomes the `previous' TS and a new, blank TS is created to be
      216 + * the new `current', which will then start accumulating any new frees.
      217 + * Once another zfs_txgs_per_trim TXGs have passed, the previous TS's
      218 + * extents are trimmed, the TS is destroyed and the current TS again
      219 + * becomes the previous TS.
      220 + * This serves to fulfill two functions: aggregate many small frees
      221 + * into fewer larger trim operations (which should help with devices
      222 + * which do not take so kindly to them) and to allow for disaster
      223 + * recovery (extents won't get trimmed immediately, but instead only
      224 + * after passing this rather long timeout, thus not preserving
      225 + * 'zfs import -F' functionality).
      226 + */
      227 +unsigned int zfs_txgs_per_trim = 32;
      228 +
      229 +static void metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size);
      230 +static void metaslab_trim_add(void *arg, uint64_t offset, uint64_t size);
      231 +
      232 +static zio_t *metaslab_exec_trim(metaslab_t *msp);
      233 +
      234 +static metaslab_trimset_t *metaslab_new_trimset(uint64_t txg, kmutex_t *lock);
      235 +static void metaslab_free_trimset(metaslab_trimset_t *ts);
      236 +static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
      237 +    uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit);
      238 +
      239 +/*
 210  240   * ==========================================================================
 211  241   * Metaslab classes
 212  242   * ==========================================================================
 213  243   */
 214  244  metaslab_class_t *
 215  245  metaslab_class_create(spa_t *spa, metaslab_ops_t *ops)
 216  246  {
 217  247          metaslab_class_t *mc;
 218  248  
 219  249          mc = kmem_zalloc(sizeof (metaslab_class_t), KM_SLEEP);
 220  250  
      251 +        mutex_init(&mc->mc_alloc_lock, NULL, MUTEX_DEFAULT, NULL);
      252 +        avl_create(&mc->mc_alloc_tree, zio_bookmark_compare,
      253 +            sizeof (zio_t), offsetof(zio_t, io_alloc_node));
      254 +
 221  255          mc->mc_spa = spa;
 222  256          mc->mc_rotor = NULL;
 223  257          mc->mc_ops = ops;
 224  258          mutex_init(&mc->mc_lock, NULL, MUTEX_DEFAULT, NULL);
 225  259          refcount_create_tracked(&mc->mc_alloc_slots);
 226  260  
 227  261          return (mc);
 228  262  }
 229  263  
 230  264  void
 231  265  metaslab_class_destroy(metaslab_class_t *mc)
 232  266  {
 233  267          ASSERT(mc->mc_rotor == NULL);
 234  268          ASSERT(mc->mc_alloc == 0);
 235  269          ASSERT(mc->mc_deferred == 0);
 236  270          ASSERT(mc->mc_space == 0);
 237  271          ASSERT(mc->mc_dspace == 0);
 238  272  
      273 +        avl_destroy(&mc->mc_alloc_tree);
      274 +        mutex_destroy(&mc->mc_alloc_lock);
      275 +
 239  276          refcount_destroy(&mc->mc_alloc_slots);
 240  277          mutex_destroy(&mc->mc_lock);
 241  278          kmem_free(mc, sizeof (metaslab_class_t));
 242  279  }
 243  280  
 244  281  int
 245  282  metaslab_class_validate(metaslab_class_t *mc)
 246  283  {
 247  284          metaslab_group_t *mg;
 248  285          vdev_t *vd;

 249  286  
 250  287          /*
 251  288           * Must hold one of the spa_config locks.
 252  289           */
 253  290          ASSERT(spa_config_held(mc->mc_spa, SCL_ALL, RW_READER) ||
 254  291              spa_config_held(mc->mc_spa, SCL_ALL, RW_WRITER));
 255  292  
 256  293          if ((mg = mc->mc_rotor) == NULL)
 257  294                  return (0);
 258  295  
 259  296          do {
 260  297                  vd = mg->mg_vd;
 261  298                  ASSERT(vd->vdev_mg != NULL);
 262  299                  ASSERT3P(vd->vdev_top, ==, vd);
 263  300                  ASSERT3P(mg->mg_class, ==, mc);
 264  301                  ASSERT3P(vd->vdev_ops, !=, &vdev_hole_ops);
 265  302          } while ((mg = mg->mg_next) != mc->mc_rotor);
 266  303  
 267  304          return (0);
 268  305  }
 269  306  
 270  307  void
 271  308  metaslab_class_space_update(metaslab_class_t *mc, int64_t alloc_delta,
 272  309      int64_t defer_delta, int64_t space_delta, int64_t dspace_delta)
 273  310  {
 274  311          atomic_add_64(&mc->mc_alloc, alloc_delta);
 275  312          atomic_add_64(&mc->mc_deferred, defer_delta);
 276  313          atomic_add_64(&mc->mc_space, space_delta);
 277  314          atomic_add_64(&mc->mc_dspace, dspace_delta);
 278  315  }
 279  316  
 280  317  uint64_t
 281  318  metaslab_class_get_alloc(metaslab_class_t *mc)
 282  319  {
 283  320          return (mc->mc_alloc);
 284  321  }
 285  322  
 286  323  uint64_t
 287  324  metaslab_class_get_deferred(metaslab_class_t *mc)
 288  325  {
 289  326          return (mc->mc_deferred);
 290  327  }
 291  328  
 292  329  uint64_t
 293  330  metaslab_class_get_space(metaslab_class_t *mc)
 294  331  {
 295  332          return (mc->mc_space);
 296  333  }
 297  334  
 298  335  uint64_t
 299  336  metaslab_class_get_dspace(metaslab_class_t *mc)
 300  337  {
 301  338          return (spa_deflate(mc->mc_spa) ? mc->mc_dspace : mc->mc_space);
 302  339  }
 303  340  
 304  341  void
 305  342  metaslab_class_histogram_verify(metaslab_class_t *mc)
 306  343  {
 307  344          vdev_t *rvd = mc->mc_spa->spa_root_vdev;
 308  345          uint64_t *mc_hist;
 309  346          int i;
 310  347  
 311  348          if ((zfs_flags & ZFS_DEBUG_HISTOGRAM_VERIFY) == 0)
 312  349                  return;
 313  350  
 314  351          mc_hist = kmem_zalloc(sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE,

↓ open down ↓

66 lines elided

↑ open up ↑

 315  352              KM_SLEEP);
 316  353  
 317  354          for (int c = 0; c < rvd->vdev_children; c++) {
 318  355                  vdev_t *tvd = rvd->vdev_child[c];
 319  356                  metaslab_group_t *mg = tvd->vdev_mg;
 320  357  
 321  358                  /*
 322  359                   * Skip any holes, uninitialized top-levels, or
 323  360                   * vdevs that are not in this metalab class.
 324  361                   */
 325      -                if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
      362 +                if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
 326  363                      mg->mg_class != mc) {
 327  364                          continue;
 328  365                  }
 329  366  
 330  367                  for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
 331  368                          mc_hist[i] += mg->mg_histogram[i];
 332  369          }
 333  370  
 334  371          for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
 335  372                  VERIFY3U(mc_hist[i], ==, mc->mc_histogram[i]);

 336  373  
 337  374          kmem_free(mc_hist, sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE);
 338  375  }
 339  376  
 340  377  /*
 341  378   * Calculate the metaslab class's fragmentation metric. The metric
 342  379   * is weighted based on the space contribution of each metaslab group.
 343  380   * The return value will be a number between 0 and 100 (inclusive), or
 344  381   * ZFS_FRAG_INVALID if the metric has not been set. See comment above the
 345  382   * zfs_frag_table for more information about the metric.
 346  383   */
 347  384  uint64_t
 348  385  metaslab_class_fragmentation(metaslab_class_t *mc)
 349  386  {

↓ open down ↓

14 lines elided

↑ open up ↑

 350  387          vdev_t *rvd = mc->mc_spa->spa_root_vdev;
 351  388          uint64_t fragmentation = 0;
 352  389  
 353  390          spa_config_enter(mc->mc_spa, SCL_VDEV, FTAG, RW_READER);
 354  391  
 355  392          for (int c = 0; c < rvd->vdev_children; c++) {
 356  393                  vdev_t *tvd = rvd->vdev_child[c];
 357  394                  metaslab_group_t *mg = tvd->vdev_mg;
 358  395  
 359  396                  /*
 360      -                 * Skip any holes, uninitialized top-levels,
 361      -                 * or vdevs that are not in this metalab class.
      397 +                 * Skip any holes, uninitialized top-levels, or
      398 +                 * vdevs that are not in this metalab class.
 362  399                   */
 363      -                if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
      400 +                if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
 364  401                      mg->mg_class != mc) {
 365  402                          continue;
 366  403                  }
 367  404  
 368  405                  /*
 369  406                   * If a metaslab group does not contain a fragmentation
 370  407                   * metric then just bail out.
 371  408                   */
 372  409                  if (mg->mg_fragmentation == ZFS_FRAG_INVALID) {
 373  410                          spa_config_exit(mc->mc_spa, SCL_VDEV, FTAG);

 374  411                          return (ZFS_FRAG_INVALID);
 375  412                  }
 376  413  
 377  414                  /*
 378  415                   * Determine how much this metaslab_group is contributing
 379  416                   * to the overall pool fragmentation metric.
 380  417                   */
 381  418                  fragmentation += mg->mg_fragmentation *
 382  419                      metaslab_group_get_space(mg);
 383  420          }
 384  421          fragmentation /= metaslab_class_get_space(mc);
 385  422  
 386  423          ASSERT3U(fragmentation, <=, 100);
 387  424          spa_config_exit(mc->mc_spa, SCL_VDEV, FTAG);
 388  425          return (fragmentation);
 389  426  }
 390  427  
 391  428  /*
 392  429   * Calculate the amount of expandable space that is available in
 393  430   * this metaslab class. If a device is expanded then its expandable
 394  431   * space will be the amount of allocatable space that is currently not
 395  432   * part of this metaslab class.
 396  433   */
 397  434  uint64_t
 398  435  metaslab_class_expandable_space(metaslab_class_t *mc)

↓ open down ↓

25 lines elided

↑ open up ↑

 399  436  {
 400  437          vdev_t *rvd = mc->mc_spa->spa_root_vdev;
 401  438          uint64_t space = 0;
 402  439  
 403  440          spa_config_enter(mc->mc_spa, SCL_VDEV, FTAG, RW_READER);
 404  441          for (int c = 0; c < rvd->vdev_children; c++) {
 405  442                  uint64_t tspace;
 406  443                  vdev_t *tvd = rvd->vdev_child[c];
 407  444                  metaslab_group_t *mg = tvd->vdev_mg;
 408  445  
 409      -                if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
      446 +                if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
 410  447                      mg->mg_class != mc) {
 411  448                          continue;
 412  449                  }
 413  450  
 414  451                  /*
 415  452                   * Calculate if we have enough space to add additional
 416  453                   * metaslabs. We report the expandable space in terms
 417  454                   * of the metaslab size since that's the unit of expansion.
 418  455                   * Adjust by efi system partition size.
 419  456                   */

 420  457                  tspace = tvd->vdev_max_asize - tvd->vdev_asize;
 421  458                  if (tspace > mc->mc_spa->spa_bootsize) {
 422  459                          tspace -= mc->mc_spa->spa_bootsize;
 423  460                  }
 424  461                  space += P2ALIGN(tspace, 1ULL << tvd->vdev_ms_shift);
 425  462          }
 426  463          spa_config_exit(mc->mc_spa, SCL_VDEV, FTAG);
 427  464          return (space);
 428  465  }
 429  466  
 430  467  static int
 431  468  metaslab_compare(const void *x1, const void *x2)
 432  469  {
 433  470          const metaslab_t *m1 = x1;
 434  471          const metaslab_t *m2 = x2;
 435  472  
 436  473          if (m1->ms_weight < m2->ms_weight)
 437  474                  return (1);
 438  475          if (m1->ms_weight > m2->ms_weight)
 439  476                  return (-1);
 440  477  
 441  478          /*
 442  479           * If the weights are identical, use the offset to force uniqueness.
 443  480           */
 444  481          if (m1->ms_start < m2->ms_start)
 445  482                  return (-1);
 446  483          if (m1->ms_start > m2->ms_start)
 447  484                  return (1);
 448  485  
 449  486          ASSERT3P(m1, ==, m2);
 450  487  
 451  488          return (0);
 452  489  }
 453  490  
 454  491  /*
 455  492   * Verify that the space accounting on disk matches the in-core range_trees.
 456  493   */
 457  494  void
 458  495  metaslab_verify_space(metaslab_t *msp, uint64_t txg)
 459  496  {
 460  497          spa_t *spa = msp->ms_group->mg_vd->vdev_spa;
 461  498          uint64_t allocated = 0;
 462  499          uint64_t sm_free_space, msp_free_space;
 463  500  
 464  501          ASSERT(MUTEX_HELD(&msp->ms_lock));
 465  502  
 466  503          if ((zfs_flags & ZFS_DEBUG_METASLAB_VERIFY) == 0)
 467  504                  return;
 468  505  
 469  506          /*
 470  507           * We can only verify the metaslab space when we're called
 471  508           * from syncing context with a loaded metaslab that has an allocated
 472  509           * space map. Calling this in non-syncing context does not
 473  510           * provide a consistent view of the metaslab since we're performing
 474  511           * allocations in the future.
 475  512           */
 476  513          if (txg != spa_syncing_txg(spa) || msp->ms_sm == NULL ||
 477  514              !msp->ms_loaded)
 478  515                  return;
 479  516  
 480  517          sm_free_space = msp->ms_size - space_map_allocated(msp->ms_sm) -
 481  518              space_map_alloc_delta(msp->ms_sm);
 482  519  
 483  520          /*
 484  521           * Account for future allocations since we would have already
 485  522           * deducted that space from the ms_freetree.
 486  523           */
 487  524          for (int t = 0; t < TXG_CONCURRENT_STATES; t++) {
 488  525                  allocated +=
 489  526                      range_tree_space(msp->ms_alloctree[(txg + t) & TXG_MASK]);
 490  527          }
 491  528  
 492  529          msp_free_space = range_tree_space(msp->ms_tree) + allocated +
 493  530              msp->ms_deferspace + range_tree_space(msp->ms_freedtree);
 494  531  
 495  532          VERIFY3U(sm_free_space, ==, msp_free_space);
 496  533  }
 497  534  
 498  535  /*
 499  536   * ==========================================================================
 500  537   * Metaslab groups
 501  538   * ==========================================================================
 502  539   */
 503  540  /*
 504  541   * Update the allocatable flag and the metaslab group's capacity.
 505  542   * The allocatable flag is set to true if the capacity is below
 506  543   * the zfs_mg_noalloc_threshold or has a fragmentation value that is
 507  544   * greater than zfs_mg_fragmentation_threshold. If a metaslab group
 508  545   * transitions from allocatable to non-allocatable or vice versa then the
 509  546   * metaslab group's class is updated to reflect the transition.
 510  547   */

↓ open down ↓

91 lines elided

↑ open up ↑

 511  548  static void
 512  549  metaslab_group_alloc_update(metaslab_group_t *mg)
 513  550  {
 514  551          vdev_t *vd = mg->mg_vd;
 515  552          metaslab_class_t *mc = mg->mg_class;
 516  553          vdev_stat_t *vs = &vd->vdev_stat;
 517  554          boolean_t was_allocatable;
 518  555          boolean_t was_initialized;
 519  556  
 520  557          ASSERT(vd == vd->vdev_top);
 521      -        ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_READER), ==,
 522      -            SCL_ALLOC);
 523  558  
 524  559          mutex_enter(&mg->mg_lock);
 525  560          was_allocatable = mg->mg_allocatable;
 526  561          was_initialized = mg->mg_initialized;
 527  562  
 528  563          mg->mg_free_capacity = ((vs->vs_space - vs->vs_alloc) * 100) /
 529  564              (vs->vs_space + 1);
 530  565  
 531  566          mutex_enter(&mc->mc_lock);
 532  567

 533  568          /*
 534  569           * If the metaslab group was just added then it won't
 535  570           * have any space until we finish syncing out this txg.
 536  571           * At that point we will consider it initialized and available
 537  572           * for allocations.  We also don't consider non-activated
 538  573           * metaslab groups (e.g. vdevs that are in the middle of being removed)
 539  574           * to be initialized, because they can't be used for allocation.
 540  575           */
 541  576          mg->mg_initialized = metaslab_group_initialized(mg);
 542  577          if (!was_initialized && mg->mg_initialized) {
 543  578                  mc->mc_groups++;
 544  579          } else if (was_initialized && !mg->mg_initialized) {
 545  580                  ASSERT3U(mc->mc_groups, >, 0);
 546  581                  mc->mc_groups--;
 547  582          }
 548  583          if (mg->mg_initialized)
 549  584                  mg->mg_no_free_space = B_FALSE;
 550  585  
 551  586          /*
 552  587           * A metaslab group is considered allocatable if it has plenty
 553  588           * of free space or is not heavily fragmented. We only take
 554  589           * fragmentation into account if the metaslab group has a valid
 555  590           * fragmentation metric (i.e. a value between 0 and 100).
 556  591           */
 557  592          mg->mg_allocatable = (mg->mg_activation_count > 0 &&
 558  593              mg->mg_free_capacity > zfs_mg_noalloc_threshold &&
 559  594              (mg->mg_fragmentation == ZFS_FRAG_INVALID ||
 560  595              mg->mg_fragmentation <= zfs_mg_fragmentation_threshold));
 561  596  
 562  597          /*
 563  598           * The mc_alloc_groups maintains a count of the number of
 564  599           * groups in this metaslab class that are still above the
 565  600           * zfs_mg_noalloc_threshold. This is used by the allocating
 566  601           * threads to determine if they should avoid allocations to
 567  602           * a given group. The allocator will avoid allocations to a group
 568  603           * if that group has reached or is below the zfs_mg_noalloc_threshold
 569  604           * and there are still other groups that are above the threshold.
 570  605           * When a group transitions from allocatable to non-allocatable or
 571  606           * vice versa we update the metaslab class to reflect that change.
 572  607           * When the mc_alloc_groups value drops to 0 that means that all
 573  608           * groups have reached the zfs_mg_noalloc_threshold making all groups
 574  609           * eligible for allocations. This effectively means that all devices
 575  610           * are balanced again.
 576  611           */
 577  612          if (was_allocatable && !mg->mg_allocatable)
 578  613                  mc->mc_alloc_groups--;
 579  614          else if (!was_allocatable && mg->mg_allocatable)
 580  615                  mc->mc_alloc_groups++;
 581  616          mutex_exit(&mc->mc_lock);
 582  617  
 583  618          mutex_exit(&mg->mg_lock);
 584  619  }
 585  620  
 586  621  metaslab_group_t *
 587  622  metaslab_group_create(metaslab_class_t *mc, vdev_t *vd)
 588  623  {
 589  624          metaslab_group_t *mg;
 590  625  
 591  626          mg = kmem_zalloc(sizeof (metaslab_group_t), KM_SLEEP);
 592  627          mutex_init(&mg->mg_lock, NULL, MUTEX_DEFAULT, NULL);
 593  628          avl_create(&mg->mg_metaslab_tree, metaslab_compare,
 594  629              sizeof (metaslab_t), offsetof(struct metaslab, ms_group_node));
 595  630          mg->mg_vd = vd;
 596  631          mg->mg_class = mc;
 597  632          mg->mg_activation_count = 0;
 598  633          mg->mg_initialized = B_FALSE;
 599  634          mg->mg_no_free_space = B_TRUE;
 600  635          refcount_create_tracked(&mg->mg_alloc_queue_depth);
 601  636  
 602  637          mg->mg_taskq = taskq_create("metaslab_group_taskq", metaslab_load_pct,
 603  638              minclsyspri, 10, INT_MAX, TASKQ_THREADS_CPU_PCT);
 604  639  
 605  640          return (mg);
 606  641  }
 607  642  
 608  643  void
 609  644  metaslab_group_destroy(metaslab_group_t *mg)

↓ open down ↓

77 lines elided

↑ open up ↑

 610  645  {
 611  646          ASSERT(mg->mg_prev == NULL);
 612  647          ASSERT(mg->mg_next == NULL);
 613  648          /*
 614  649           * We may have gone below zero with the activation count
 615  650           * either because we never activated in the first place or
 616  651           * because we're done, and possibly removing the vdev.
 617  652           */
 618  653          ASSERT(mg->mg_activation_count <= 0);
 619  654  
 620      -        taskq_destroy(mg->mg_taskq);
      655 +        if (mg->mg_taskq)
      656 +                taskq_destroy(mg->mg_taskq);
 621  657          avl_destroy(&mg->mg_metaslab_tree);
 622  658          mutex_destroy(&mg->mg_lock);
 623  659          refcount_destroy(&mg->mg_alloc_queue_depth);
 624  660          kmem_free(mg, sizeof (metaslab_group_t));
 625  661  }
 626  662  
 627  663  void
 628  664  metaslab_group_activate(metaslab_group_t *mg)
 629  665  {
 630  666          metaslab_class_t *mc = mg->mg_class;
 631  667          metaslab_group_t *mgprev, *mgnext;
 632  668  
 633      -        ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER), !=, 0);
      669 +        ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
 634  670  
 635  671          ASSERT(mc->mc_rotor != mg);
 636  672          ASSERT(mg->mg_prev == NULL);
 637  673          ASSERT(mg->mg_next == NULL);
 638  674          ASSERT(mg->mg_activation_count <= 0);
 639  675  
 640  676          if (++mg->mg_activation_count <= 0)
 641  677                  return;
 642  678  
 643  679          mg->mg_aliquot = metaslab_aliquot * MAX(1, mg->mg_vd->vdev_children);

 644  680          metaslab_group_alloc_update(mg);
 645  681  
 646  682          if ((mgprev = mc->mc_rotor) == NULL) {
 647  683                  mg->mg_prev = mg;
 648  684                  mg->mg_next = mg;

↓ open down ↓

5 lines elided

↑ open up ↑

 649  685          } else {
 650  686                  mgnext = mgprev->mg_next;
 651  687                  mg->mg_prev = mgprev;
 652  688                  mg->mg_next = mgnext;
 653  689                  mgprev->mg_next = mg;
 654  690                  mgnext->mg_prev = mg;
 655  691          }
 656  692          mc->mc_rotor = mg;
 657  693  }
 658  694  
 659      -/*
 660      - * Passivate a metaslab group and remove it from the allocation rotor.
 661      - * Callers must hold both the SCL_ALLOC and SCL_ZIO lock prior to passivating
 662      - * a metaslab group. This function will momentarily drop spa_config_locks
 663      - * that are lower than the SCL_ALLOC lock (see comment below).
 664      - */
 665  695  void
 666  696  metaslab_group_passivate(metaslab_group_t *mg)
 667  697  {
 668  698          metaslab_class_t *mc = mg->mg_class;
 669      -        spa_t *spa = mc->mc_spa;
 670  699          metaslab_group_t *mgprev, *mgnext;
 671      -        int locks = spa_config_held(spa, SCL_ALL, RW_WRITER);
 672  700  
 673      -        ASSERT3U(spa_config_held(spa, SCL_ALLOC | SCL_ZIO, RW_WRITER), ==,
 674      -            (SCL_ALLOC | SCL_ZIO));
      701 +        ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
 675  702  
 676  703          if (--mg->mg_activation_count != 0) {
 677  704                  ASSERT(mc->mc_rotor != mg);
 678  705                  ASSERT(mg->mg_prev == NULL);
 679  706                  ASSERT(mg->mg_next == NULL);
 680  707                  ASSERT(mg->mg_activation_count < 0);
 681  708                  return;
 682  709          }
 683  710  
 684      -        /*
 685      -         * The spa_config_lock is an array of rwlocks, ordered as
 686      -         * follows (from highest to lowest):
 687      -         *      SCL_CONFIG > SCL_STATE > SCL_L2ARC > SCL_ALLOC >
 688      -         *      SCL_ZIO > SCL_FREE > SCL_VDEV
 689      -         * (For more information about the spa_config_lock see spa_misc.c)
 690      -         * The higher the lock, the broader its coverage. When we passivate
 691      -         * a metaslab group, we must hold both the SCL_ALLOC and the SCL_ZIO
 692      -         * config locks. However, the metaslab group's taskq might be trying
 693      -         * to preload metaslabs so we must drop the SCL_ZIO lock and any
 694      -         * lower locks to allow the I/O to complete. At a minimum,
 695      -         * we continue to hold the SCL_ALLOC lock, which prevents any future
 696      -         * allocations from taking place and any changes to the vdev tree.
 697      -         */
 698      -        spa_config_exit(spa, locks & ~(SCL_ZIO - 1), spa);
 699  711          taskq_wait(mg->mg_taskq);
 700      -        spa_config_enter(spa, locks & ~(SCL_ZIO - 1), spa, RW_WRITER);
 701  712          metaslab_group_alloc_update(mg);
 702  713  
 703  714          mgprev = mg->mg_prev;
 704  715          mgnext = mg->mg_next;
 705  716  
 706  717          if (mg == mgnext) {
 707  718                  mc->mc_rotor = NULL;
 708  719          } else {
 709  720                  mc->mc_rotor = mgnext;
 710  721                  mgprev->mg_next = mgnext;

 711  722                  mgnext->mg_prev = mgprev;
 712  723          }
 713  724  
 714  725          mg->mg_prev = NULL;
 715  726          mg->mg_next = NULL;
 716  727  }
 717  728  
 718  729  boolean_t
 719  730  metaslab_group_initialized(metaslab_group_t *mg)
 720  731  {
 721  732          vdev_t *vd = mg->mg_vd;
 722  733          vdev_stat_t *vs = &vd->vdev_stat;
 723  734  
 724  735          return (vs->vs_space != 0 && mg->mg_activation_count > 0);
 725  736  }
 726  737  
 727  738  uint64_t
 728  739  metaslab_group_get_space(metaslab_group_t *mg)
 729  740  {
 730  741          return ((1ULL << mg->mg_vd->vdev_ms_shift) * mg->mg_vd->vdev_ms_count);
 731  742  }
 732  743  
 733  744  void
 734  745  metaslab_group_histogram_verify(metaslab_group_t *mg)
 735  746  {
 736  747          uint64_t *mg_hist;
 737  748          vdev_t *vd = mg->mg_vd;
 738  749          uint64_t ashift = vd->vdev_ashift;
 739  750          int i;
 740  751  
 741  752          if ((zfs_flags & ZFS_DEBUG_HISTOGRAM_VERIFY) == 0)
 742  753                  return;
 743  754  
 744  755          mg_hist = kmem_zalloc(sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE,
 745  756              KM_SLEEP);
 746  757  
 747  758          ASSERT3U(RANGE_TREE_HISTOGRAM_SIZE, >=,
 748  759              SPACE_MAP_HISTOGRAM_SIZE + ashift);
 749  760  
 750  761          for (int m = 0; m < vd->vdev_ms_count; m++) {
 751  762                  metaslab_t *msp = vd->vdev_ms[m];
 752  763  
 753  764                  if (msp->ms_sm == NULL)
 754  765                          continue;
 755  766  
 756  767                  for (i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++)
 757  768                          mg_hist[i + ashift] +=
 758  769                              msp->ms_sm->sm_phys->smp_histogram[i];
 759  770          }
 760  771  
 761  772          for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i ++)
 762  773                  VERIFY3U(mg_hist[i], ==, mg->mg_histogram[i]);
 763  774  
 764  775          kmem_free(mg_hist, sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE);
 765  776  }
 766  777  
 767  778  static void
 768  779  metaslab_group_histogram_add(metaslab_group_t *mg, metaslab_t *msp)
 769  780  {
 770  781          metaslab_class_t *mc = mg->mg_class;
 771  782          uint64_t ashift = mg->mg_vd->vdev_ashift;
 772  783  
 773  784          ASSERT(MUTEX_HELD(&msp->ms_lock));
 774  785          if (msp->ms_sm == NULL)
 775  786                  return;
 776  787  
 777  788          mutex_enter(&mg->mg_lock);
 778  789          for (int i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++) {
 779  790                  mg->mg_histogram[i + ashift] +=
 780  791                      msp->ms_sm->sm_phys->smp_histogram[i];
 781  792                  mc->mc_histogram[i + ashift] +=
 782  793                      msp->ms_sm->sm_phys->smp_histogram[i];
 783  794          }
 784  795          mutex_exit(&mg->mg_lock);
 785  796  }
 786  797  
 787  798  void
 788  799  metaslab_group_histogram_remove(metaslab_group_t *mg, metaslab_t *msp)
 789  800  {
 790  801          metaslab_class_t *mc = mg->mg_class;
 791  802          uint64_t ashift = mg->mg_vd->vdev_ashift;
 792  803  
 793  804          ASSERT(MUTEX_HELD(&msp->ms_lock));
 794  805          if (msp->ms_sm == NULL)
 795  806                  return;
 796  807  
 797  808          mutex_enter(&mg->mg_lock);
 798  809          for (int i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++) {
 799  810                  ASSERT3U(mg->mg_histogram[i + ashift], >=,
 800  811                      msp->ms_sm->sm_phys->smp_histogram[i]);
 801  812                  ASSERT3U(mc->mc_histogram[i + ashift], >=,
 802  813                      msp->ms_sm->sm_phys->smp_histogram[i]);
 803  814  
 804  815                  mg->mg_histogram[i + ashift] -=
 805  816                      msp->ms_sm->sm_phys->smp_histogram[i];
 806  817                  mc->mc_histogram[i + ashift] -=
 807  818                      msp->ms_sm->sm_phys->smp_histogram[i];
 808  819          }
 809  820          mutex_exit(&mg->mg_lock);
 810  821  }
 811  822  
 812  823  static void
 813  824  metaslab_group_add(metaslab_group_t *mg, metaslab_t *msp)
 814  825  {
 815  826          ASSERT(msp->ms_group == NULL);
 816  827          mutex_enter(&mg->mg_lock);
 817  828          msp->ms_group = mg;
 818  829          msp->ms_weight = 0;
 819  830          avl_add(&mg->mg_metaslab_tree, msp);
 820  831          mutex_exit(&mg->mg_lock);
 821  832  
 822  833          mutex_enter(&msp->ms_lock);
 823  834          metaslab_group_histogram_add(mg, msp);
 824  835          mutex_exit(&msp->ms_lock);
 825  836  }
 826  837  
 827  838  static void
 828  839  metaslab_group_remove(metaslab_group_t *mg, metaslab_t *msp)
 829  840  {
 830  841          mutex_enter(&msp->ms_lock);
 831  842          metaslab_group_histogram_remove(mg, msp);
 832  843          mutex_exit(&msp->ms_lock);
 833  844  
 834  845          mutex_enter(&mg->mg_lock);
 835  846          ASSERT(msp->ms_group == mg);
 836  847          avl_remove(&mg->mg_metaslab_tree, msp);
 837  848          msp->ms_group = NULL;
 838  849          mutex_exit(&mg->mg_lock);
 839  850  }
 840  851  
 841  852  static void
 842  853  metaslab_group_sort(metaslab_group_t *mg, metaslab_t *msp, uint64_t weight)
 843  854  {
 844  855          /*
 845  856           * Although in principle the weight can be any value, in
 846  857           * practice we do not use values in the range [1, 511].
 847  858           */
 848  859          ASSERT(weight >= SPA_MINBLOCKSIZE || weight == 0);
 849  860          ASSERT(MUTEX_HELD(&msp->ms_lock));
 850  861  
 851  862          mutex_enter(&mg->mg_lock);
 852  863          ASSERT(msp->ms_group == mg);
 853  864          avl_remove(&mg->mg_metaslab_tree, msp);
 854  865          msp->ms_weight = weight;
 855  866          avl_add(&mg->mg_metaslab_tree, msp);
 856  867          mutex_exit(&mg->mg_lock);
 857  868  }
 858  869  
 859  870  /*
 860  871   * Calculate the fragmentation for a given metaslab group. We can use
 861  872   * a simple average here since all metaslabs within the group must have
 862  873   * the same size. The return value will be a value between 0 and 100
 863  874   * (inclusive), or ZFS_FRAG_INVALID if less than half of the metaslab in this
 864  875   * group have a fragmentation metric.
 865  876   */
 866  877  uint64_t
 867  878  metaslab_group_fragmentation(metaslab_group_t *mg)
 868  879  {
 869  880          vdev_t *vd = mg->mg_vd;
 870  881          uint64_t fragmentation = 0;
 871  882          uint64_t valid_ms = 0;
 872  883  
 873  884          for (int m = 0; m < vd->vdev_ms_count; m++) {
 874  885                  metaslab_t *msp = vd->vdev_ms[m];
 875  886  
 876  887                  if (msp->ms_fragmentation == ZFS_FRAG_INVALID)
 877  888                          continue;
 878  889  
 879  890                  valid_ms++;
 880  891                  fragmentation += msp->ms_fragmentation;
 881  892          }
 882  893  
 883  894          if (valid_ms <= vd->vdev_ms_count / 2)
 884  895                  return (ZFS_FRAG_INVALID);
 885  896  
 886  897          fragmentation /= valid_ms;
 887  898          ASSERT3U(fragmentation, <=, 100);
 888  899          return (fragmentation);
 889  900  }
 890  901  
 891  902  /*
 892  903   * Determine if a given metaslab group should skip allocations. A metaslab
 893  904   * group should avoid allocations if its free capacity is less than the
 894  905   * zfs_mg_noalloc_threshold or its fragmentation metric is greater than
 895  906   * zfs_mg_fragmentation_threshold and there is at least one metaslab group
 896  907   * that can still handle allocations. If the allocation throttle is enabled
 897  908   * then we skip allocations to devices that have reached their maximum
 898  909   * allocation queue depth unless the selected metaslab group is the only
 899  910   * eligible group remaining.
 900  911   */
 901  912  static boolean_t
 902  913  metaslab_group_allocatable(metaslab_group_t *mg, metaslab_group_t *rotor,
 903  914      uint64_t psize)
 904  915  {
 905  916          spa_t *spa = mg->mg_vd->vdev_spa;
 906  917          metaslab_class_t *mc = mg->mg_class;
 907  918  
 908  919          /*
 909  920           * We can only consider skipping this metaslab group if it's
 910  921           * in the normal metaslab class and there are other metaslab
 911  922           * groups to select from. Otherwise, we always consider it eligible
 912  923           * for allocations.
 913  924           */
 914  925          if (mc != spa_normal_class(spa) || mc->mc_groups <= 1)
 915  926                  return (B_TRUE);
 916  927  
 917  928          /*
 918  929           * If the metaslab group's mg_allocatable flag is set (see comments
 919  930           * in metaslab_group_alloc_update() for more information) and
 920  931           * the allocation throttle is disabled then allow allocations to this
 921  932           * device. However, if the allocation throttle is enabled then
 922  933           * check if we have reached our allocation limit (mg_alloc_queue_depth)
 923  934           * to determine if we should allow allocations to this metaslab group.
 924  935           * If all metaslab groups are no longer considered allocatable
 925  936           * (mc_alloc_groups == 0) or we're trying to allocate the smallest
 926  937           * gang block size then we allow allocations on this metaslab group
 927  938           * regardless of the mg_allocatable or throttle settings.
 928  939           */
 929  940          if (mg->mg_allocatable) {
 930  941                  metaslab_group_t *mgp;
 931  942                  int64_t qdepth;
 932  943                  uint64_t qmax = mg->mg_max_alloc_queue_depth;
 933  944  
 934  945                  if (!mc->mc_alloc_throttle_enabled)
 935  946                          return (B_TRUE);
 936  947  
 937  948                  /*
 938  949                   * If this metaslab group does not have any free space, then
 939  950                   * there is no point in looking further.
 940  951                   */
 941  952                  if (mg->mg_no_free_space)
 942  953                          return (B_FALSE);
 943  954  
 944  955                  qdepth = refcount_count(&mg->mg_alloc_queue_depth);
 945  956  
 946  957                  /*
 947  958                   * If this metaslab group is below its qmax or it's
 948  959                   * the only allocatable metasable group, then attempt
 949  960                   * to allocate from it.
 950  961                   */
 951  962                  if (qdepth < qmax || mc->mc_alloc_groups == 1)
 952  963                          return (B_TRUE);
 953  964                  ASSERT3U(mc->mc_alloc_groups, >, 1);
 954  965  
 955  966                  /*
 956  967                   * Since this metaslab group is at or over its qmax, we
 957  968                   * need to determine if there are metaslab groups after this
 958  969                   * one that might be able to handle this allocation. This is
 959  970                   * racy since we can't hold the locks for all metaslab
 960  971                   * groups at the same time when we make this check.
 961  972                   */
 962  973                  for (mgp = mg->mg_next; mgp != rotor; mgp = mgp->mg_next) {
 963  974                          qmax = mgp->mg_max_alloc_queue_depth;
 964  975  
 965  976                          qdepth = refcount_count(&mgp->mg_alloc_queue_depth);
 966  977  
 967  978                          /*
 968  979                           * If there is another metaslab group that
 969  980                           * might be able to handle the allocation, then
 970  981                           * we return false so that we skip this group.
 971  982                           */
 972  983                          if (qdepth < qmax && !mgp->mg_no_free_space)
 973  984                                  return (B_FALSE);
 974  985                  }
 975  986  
 976  987                  /*
 977  988                   * We didn't find another group to handle the allocation
 978  989                   * so we can't skip this metaslab group even though
 979  990                   * we are at or over our qmax.
 980  991                   */
 981  992                  return (B_TRUE);
 982  993  
 983  994          } else if (mc->mc_alloc_groups == 0 || psize == SPA_MINBLOCKSIZE) {
 984  995                  return (B_TRUE);
 985  996          }
 986  997          return (B_FALSE);
 987  998  }
 988  999  
 989 1000  /*
 990 1001   * ==========================================================================
 991 1002   * Range tree callbacks
 992 1003   * ==========================================================================
 993 1004   */
 994 1005  
 995 1006  /*
 996 1007   * Comparison function for the private size-ordered tree. Tree is sorted
 997 1008   * by size, larger sizes at the end of the tree.
 998 1009   */
 999 1010  static int
1000 1011  metaslab_rangesize_compare(const void *x1, const void *x2)
1001 1012  {
1002 1013          const range_seg_t *r1 = x1;
1003 1014          const range_seg_t *r2 = x2;
1004 1015          uint64_t rs_size1 = r1->rs_end - r1->rs_start;
1005 1016          uint64_t rs_size2 = r2->rs_end - r2->rs_start;
1006 1017  
1007 1018          if (rs_size1 < rs_size2)
1008 1019                  return (-1);
1009 1020          if (rs_size1 > rs_size2)
1010 1021                  return (1);
1011 1022  
1012 1023          if (r1->rs_start < r2->rs_start)
1013 1024                  return (-1);
1014 1025  
1015 1026          if (r1->rs_start > r2->rs_start)
1016 1027                  return (1);
1017 1028  
1018 1029          return (0);
1019 1030  }
1020 1031  
1021 1032  /*
1022 1033   * Create any block allocator specific components. The current allocators
1023 1034   * rely on using both a size-ordered range_tree_t and an array of uint64_t's.
1024 1035   */
1025 1036  static void
1026 1037  metaslab_rt_create(range_tree_t *rt, void *arg)
1027 1038  {
1028 1039          metaslab_t *msp = arg;
1029 1040  
1030 1041          ASSERT3P(rt->rt_arg, ==, msp);
1031 1042          ASSERT(msp->ms_tree == NULL);
1032 1043  
1033 1044          avl_create(&msp->ms_size_tree, metaslab_rangesize_compare,
1034 1045              sizeof (range_seg_t), offsetof(range_seg_t, rs_pp_node));
1035 1046  }
1036 1047  
1037 1048  /*
1038 1049   * Destroy the block allocator specific components.
1039 1050   */
1040 1051  static void
1041 1052  metaslab_rt_destroy(range_tree_t *rt, void *arg)
1042 1053  {
1043 1054          metaslab_t *msp = arg;
1044 1055  
1045 1056          ASSERT3P(rt->rt_arg, ==, msp);
1046 1057          ASSERT3P(msp->ms_tree, ==, rt);
1047 1058          ASSERT0(avl_numnodes(&msp->ms_size_tree));
1048 1059  
1049 1060          avl_destroy(&msp->ms_size_tree);
1050 1061  }
1051 1062  
1052 1063  static void
1053 1064  metaslab_rt_add(range_tree_t *rt, range_seg_t *rs, void *arg)
1054 1065  {
1055 1066          metaslab_t *msp = arg;
1056 1067  
1057 1068          ASSERT3P(rt->rt_arg, ==, msp);
1058 1069          ASSERT3P(msp->ms_tree, ==, rt);
1059 1070          VERIFY(!msp->ms_condensing);
1060 1071          avl_add(&msp->ms_size_tree, rs);
1061 1072  }
1062 1073  
1063 1074  static void
1064 1075  metaslab_rt_remove(range_tree_t *rt, range_seg_t *rs, void *arg)
1065 1076  {
1066 1077          metaslab_t *msp = arg;
1067 1078  
1068 1079          ASSERT3P(rt->rt_arg, ==, msp);
1069 1080          ASSERT3P(msp->ms_tree, ==, rt);
1070 1081          VERIFY(!msp->ms_condensing);
1071 1082          avl_remove(&msp->ms_size_tree, rs);
1072 1083  }
1073 1084  
1074 1085  static void
1075 1086  metaslab_rt_vacate(range_tree_t *rt, void *arg)
1076 1087  {
1077 1088          metaslab_t *msp = arg;
1078 1089  
1079 1090          ASSERT3P(rt->rt_arg, ==, msp);
1080 1091          ASSERT3P(msp->ms_tree, ==, rt);
1081 1092  
1082 1093          /*
1083 1094           * Normally one would walk the tree freeing nodes along the way.
1084 1095           * Since the nodes are shared with the range trees we can avoid
1085 1096           * walking all nodes and just reinitialize the avl tree. The nodes
1086 1097           * will be freed by the range tree, so we don't want to free them here.
1087 1098           */
1088 1099          avl_create(&msp->ms_size_tree, metaslab_rangesize_compare,
1089 1100              sizeof (range_seg_t), offsetof(range_seg_t, rs_pp_node));
1090 1101  }
1091 1102  
1092 1103  static range_tree_ops_t metaslab_rt_ops = {
1093 1104          metaslab_rt_create,
1094 1105          metaslab_rt_destroy,
1095 1106          metaslab_rt_add,
1096 1107          metaslab_rt_remove,
1097 1108          metaslab_rt_vacate
1098 1109  };
1099 1110  
1100 1111  /*
1101 1112   * ==========================================================================
1102 1113   * Common allocator routines
1103 1114   * ==========================================================================
1104 1115   */
1105 1116  
1106 1117  /*
1107 1118   * Return the maximum contiguous segment within the metaslab.
1108 1119   */
1109 1120  uint64_t
1110 1121  metaslab_block_maxsize(metaslab_t *msp)
1111 1122  {
1112 1123          avl_tree_t *t = &msp->ms_size_tree;
1113 1124          range_seg_t *rs;
1114 1125  
1115 1126          if (t == NULL || (rs = avl_last(t)) == NULL)
1116 1127                  return (0ULL);
1117 1128  
1118 1129          return (rs->rs_end - rs->rs_start);
1119 1130  }
1120 1131  
1121 1132  static range_seg_t *
1122 1133  metaslab_block_find(avl_tree_t *t, uint64_t start, uint64_t size)
1123 1134  {
1124 1135          range_seg_t *rs, rsearch;
1125 1136          avl_index_t where;
1126 1137  
1127 1138          rsearch.rs_start = start;
1128 1139          rsearch.rs_end = start + size;
1129 1140  
1130 1141          rs = avl_find(t, &rsearch, &where);
1131 1142          if (rs == NULL) {
1132 1143                  rs = avl_nearest(t, where, AVL_AFTER);
1133 1144          }

↓ open down ↓

423 lines elided

↑ open up ↑

1134 1145  
1135 1146          return (rs);
1136 1147  }
1137 1148  
1138 1149  /*
1139 1150   * This is a helper function that can be used by the allocator to find
1140 1151   * a suitable block to allocate. This will search the specified AVL
1141 1152   * tree looking for a block that matches the specified criteria.
1142 1153   */
1143 1154  static uint64_t
1144      -metaslab_block_picker(avl_tree_t *t, uint64_t *cursor, uint64_t size,
1145      -    uint64_t align)
     1155 +metaslab_block_picker(metaslab_t *msp, avl_tree_t *t, uint64_t *cursor,
     1156 +    uint64_t size, uint64_t align)
1146 1157  {
1147 1158          range_seg_t *rs = metaslab_block_find(t, *cursor, size);
1148 1159  
1149      -        while (rs != NULL) {
     1160 +        for (; rs != NULL; rs = AVL_NEXT(t, rs)) {
1150 1161                  uint64_t offset = P2ROUNDUP(rs->rs_start, align);
1151 1162  
1152      -                if (offset + size <= rs->rs_end) {
     1163 +                if (offset + size <= rs->rs_end &&
     1164 +                    !metaslab_check_trim_conflict(msp, &offset, size, align,
     1165 +                    rs->rs_end)) {
1153 1166                          *cursor = offset + size;
1154 1167                          return (offset);
1155 1168                  }
1156      -                rs = AVL_NEXT(t, rs);
1157 1169          }
1158 1170  
1159 1171          /*
1160 1172           * If we know we've searched the whole map (*cursor == 0), give up.
1161 1173           * Otherwise, reset the cursor to the beginning and try again.
1162 1174           */
1163 1175          if (*cursor == 0)
1164 1176                  return (-1ULL);
1165 1177  
1166 1178          *cursor = 0;
1167      -        return (metaslab_block_picker(t, cursor, size, align));
     1179 +        return (metaslab_block_picker(msp, t, cursor, size, align));
1168 1180  }
1169 1181  
1170 1182  /*
1171 1183   * ==========================================================================
1172 1184   * The first-fit block allocator
1173 1185   * ==========================================================================
1174 1186   */
1175 1187  static uint64_t
1176 1188  metaslab_ff_alloc(metaslab_t *msp, uint64_t size)
1177 1189  {

1178 1190          /*

↓ open down ↓

1 lines elided

↑ open up ↑

1179 1191           * Find the largest power of 2 block size that evenly divides the
1180 1192           * requested size. This is used to try to allocate blocks with similar
1181 1193           * alignment from the same area of the metaslab (i.e. same cursor
1182 1194           * bucket) but it does not guarantee that other allocations sizes
1183 1195           * may exist in the same region.
1184 1196           */
1185 1197          uint64_t align = size & -size;
1186 1198          uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1];
1187 1199          avl_tree_t *t = &msp->ms_tree->rt_root;
1188 1200  
1189      -        return (metaslab_block_picker(t, cursor, size, align));
     1201 +        return (metaslab_block_picker(msp, t, cursor, size, align));
1190 1202  }
1191 1203  
1192 1204  static metaslab_ops_t metaslab_ff_ops = {
1193 1205          metaslab_ff_alloc
1194 1206  };
1195 1207  
1196 1208  /*
1197 1209   * ==========================================================================
1198 1210   * Dynamic block allocator -
1199 1211   * Uses the first fit allocation scheme until space get low and then

1200 1212   * adjusts to a best fit allocation method. Uses metaslab_df_alloc_threshold
1201 1213   * and metaslab_df_free_pct to determine when to switch the allocation scheme.
1202 1214   * ==========================================================================
1203 1215   */
1204 1216  static uint64_t
1205 1217  metaslab_df_alloc(metaslab_t *msp, uint64_t size)
1206 1218  {
1207 1219          /*
1208 1220           * Find the largest power of 2 block size that evenly divides the
1209 1221           * requested size. This is used to try to allocate blocks with similar
1210 1222           * alignment from the same area of the metaslab (i.e. same cursor
1211 1223           * bucket) but it does not guarantee that other allocations sizes
1212 1224           * may exist in the same region.
1213 1225           */
1214 1226          uint64_t align = size & -size;
1215 1227          uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1];
1216 1228          range_tree_t *rt = msp->ms_tree;
1217 1229          avl_tree_t *t = &rt->rt_root;
1218 1230          uint64_t max_size = metaslab_block_maxsize(msp);
1219 1231          int free_pct = range_tree_space(rt) * 100 / msp->ms_size;
1220 1232  
1221 1233          ASSERT(MUTEX_HELD(&msp->ms_lock));
1222 1234          ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&msp->ms_size_tree));
1223 1235  
1224 1236          if (max_size < size)
1225 1237                  return (-1ULL);
1226 1238

↓ open down ↓

27 lines elided

↑ open up ↑

1227 1239          /*
1228 1240           * If we're running low on space switch to using the size
1229 1241           * sorted AVL tree (best-fit).
1230 1242           */
1231 1243          if (max_size < metaslab_df_alloc_threshold ||
1232 1244              free_pct < metaslab_df_free_pct) {
1233 1245                  t = &msp->ms_size_tree;
1234 1246                  *cursor = 0;
1235 1247          }
1236 1248  
1237      -        return (metaslab_block_picker(t, cursor, size, 1ULL));
     1249 +        return (metaslab_block_picker(msp, t, cursor, size, 1ULL));
1238 1250  }
1239 1251  
1240 1252  static metaslab_ops_t metaslab_df_ops = {
1241 1253          metaslab_df_alloc
1242 1254  };
1243 1255  
1244 1256  /*
1245 1257   * ==========================================================================
1246 1258   * Cursor fit block allocator -
1247 1259   * Select the largest region in the metaslab, set the cursor to the beginning

1248 1260   * of the range and the cursor_end to the end of the range. As allocations
1249 1261   * are made advance the cursor. Continue allocating from the cursor until
1250 1262   * the range is exhausted and then find a new range.
1251 1263   * ==========================================================================
1252 1264   */
1253 1265  static uint64_t
1254 1266  metaslab_cf_alloc(metaslab_t *msp, uint64_t size)
1255 1267  {
1256 1268          range_tree_t *rt = msp->ms_tree;
1257 1269          avl_tree_t *t = &msp->ms_size_tree;
1258 1270          uint64_t *cursor = &msp->ms_lbas[0];

↓ open down ↓

11 lines elided

↑ open up ↑

1259 1271          uint64_t *cursor_end = &msp->ms_lbas[1];
1260 1272          uint64_t offset = 0;
1261 1273  
1262 1274          ASSERT(MUTEX_HELD(&msp->ms_lock));
1263 1275          ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&rt->rt_root));
1264 1276  
1265 1277          ASSERT3U(*cursor_end, >=, *cursor);
1266 1278  
1267 1279          if ((*cursor + size) > *cursor_end) {
1268 1280                  range_seg_t *rs;
1269      -
1270      -                rs = avl_last(&msp->ms_size_tree);
1271      -                if (rs == NULL || (rs->rs_end - rs->rs_start) < size)
     1281 +                for (rs = avl_last(&msp->ms_size_tree);
     1282 +                    rs != NULL && rs->rs_end - rs->rs_start >= size;
     1283 +                    rs = AVL_PREV(&msp->ms_size_tree, rs)) {
     1284 +                        *cursor = rs->rs_start;
     1285 +                        *cursor_end = rs->rs_end;
     1286 +                        if (!metaslab_check_trim_conflict(msp, cursor, size,
     1287 +                            1, *cursor_end)) {
     1288 +                                /* segment appears to be acceptable */
     1289 +                                break;
     1290 +                        }
     1291 +                }
     1292 +                if (rs == NULL || rs->rs_end - rs->rs_start < size)
1272 1293                          return (-1ULL);
1273      -
1274      -                *cursor = rs->rs_start;
1275      -                *cursor_end = rs->rs_end;
1276 1294          }
1277 1295  
1278 1296          offset = *cursor;
1279 1297          *cursor += size;
1280 1298  
1281 1299          return (offset);
1282 1300  }
1283 1301  
1284 1302  static metaslab_ops_t metaslab_cf_ops = {
1285 1303          metaslab_cf_alloc

1286 1304  };
1287 1305  
1288 1306  /*
1289 1307   * ==========================================================================
1290 1308   * New dynamic fit allocator -
1291 1309   * Select a region that is large enough to allocate 2^metaslab_ndf_clump_shift
1292 1310   * contiguous blocks. If no region is found then just use the largest segment
1293 1311   * that remains.
1294 1312   * ==========================================================================
1295 1313   */
1296 1314  
1297 1315  /*
1298 1316   * Determines desired number of contiguous blocks (2^metaslab_ndf_clump_shift)
1299 1317   * to request from the allocator.
1300 1318   */
1301 1319  uint64_t metaslab_ndf_clump_shift = 4;

↓ open down ↓

16 lines elided

↑ open up ↑

1302 1320  
1303 1321  static uint64_t
1304 1322  metaslab_ndf_alloc(metaslab_t *msp, uint64_t size)
1305 1323  {
1306 1324          avl_tree_t *t = &msp->ms_tree->rt_root;
1307 1325          avl_index_t where;
1308 1326          range_seg_t *rs, rsearch;
1309 1327          uint64_t hbit = highbit64(size);
1310 1328          uint64_t *cursor = &msp->ms_lbas[hbit - 1];
1311 1329          uint64_t max_size = metaslab_block_maxsize(msp);
     1330 +        /* mutable copy for adjustment by metaslab_check_trim_conflict */
     1331 +        uint64_t adjustable_start;
1312 1332  
1313 1333          ASSERT(MUTEX_HELD(&msp->ms_lock));
1314 1334          ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&msp->ms_size_tree));
1315 1335  
1316 1336          if (max_size < size)
1317 1337                  return (-1ULL);
1318 1338  
1319 1339          rsearch.rs_start = *cursor;
1320 1340          rsearch.rs_end = *cursor + size;
1321 1341  
1322 1342          rs = avl_find(t, &rsearch, &where);
1323      -        if (rs == NULL || (rs->rs_end - rs->rs_start) < size) {
     1343 +        if (rs != NULL)
     1344 +                adjustable_start = rs->rs_start;
     1345 +        if (rs == NULL || rs->rs_end - adjustable_start < size ||
     1346 +            metaslab_check_trim_conflict(msp, &adjustable_start, size, 1,
     1347 +            rs->rs_end)) {
     1348 +                /* segment not usable, try the largest remaining one */
1324 1349                  t = &msp->ms_size_tree;
1325 1350  
1326 1351                  rsearch.rs_start = 0;
1327 1352                  rsearch.rs_end = MIN(max_size,
1328 1353                      1ULL << (hbit + metaslab_ndf_clump_shift));
1329 1354                  rs = avl_find(t, &rsearch, &where);
1330 1355                  if (rs == NULL)
1331 1356                          rs = avl_nearest(t, where, AVL_AFTER);
1332 1357                  ASSERT(rs != NULL);
     1358 +                adjustable_start = rs->rs_start;
     1359 +                if (rs->rs_end - adjustable_start < size ||
     1360 +                    metaslab_check_trim_conflict(msp, &adjustable_start,
     1361 +                    size, 1, rs->rs_end)) {
     1362 +                        /* even largest remaining segment not usable */
     1363 +                        return (-1ULL);
     1364 +                }
1333 1365          }
1334 1366  
1335      -        if ((rs->rs_end - rs->rs_start) >= size) {
1336      -                *cursor = rs->rs_start + size;
1337      -                return (rs->rs_start);
1338      -        }
1339      -        return (-1ULL);
     1367 +        *cursor = adjustable_start + size;
     1368 +        return (*cursor);
1340 1369  }
1341 1370  
1342 1371  static metaslab_ops_t metaslab_ndf_ops = {
1343 1372          metaslab_ndf_alloc
1344 1373  };
1345 1374  
1346 1375  metaslab_ops_t *zfs_metaslab_ops = &metaslab_df_ops;
1347 1376  
1348 1377  /*
1349 1378   * ==========================================================================

1350 1379   * Metaslabs
1351 1380   * ==========================================================================
1352 1381   */
1353 1382  
1354 1383  /*
1355 1384   * Wait for any in-progress metaslab loads to complete.
1356 1385   */
1357 1386  void
1358 1387  metaslab_load_wait(metaslab_t *msp)
1359 1388  {
1360 1389          ASSERT(MUTEX_HELD(&msp->ms_lock));
1361 1390  
1362 1391          while (msp->ms_loading) {
1363 1392                  ASSERT(!msp->ms_loaded);
1364 1393                  cv_wait(&msp->ms_load_cv, &msp->ms_lock);
1365 1394          }
1366 1395  }
1367 1396  
1368 1397  int

↓ open down ↓

19 lines elided

↑ open up ↑

1369 1398  metaslab_load(metaslab_t *msp)
1370 1399  {
1371 1400          int error = 0;
1372 1401          boolean_t success = B_FALSE;
1373 1402  
1374 1403          ASSERT(MUTEX_HELD(&msp->ms_lock));
1375 1404          ASSERT(!msp->ms_loaded);
1376 1405          ASSERT(!msp->ms_loading);
1377 1406  
1378 1407          msp->ms_loading = B_TRUE;
1379      -        /*
1380      -         * Nobody else can manipulate a loading metaslab, so it's now safe
1381      -         * to drop the lock.  This way we don't have to hold the lock while
1382      -         * reading the spacemap from disk.
1383      -         */
1384      -        mutex_exit(&msp->ms_lock);
1385 1408  
1386 1409          /*
1387 1410           * If the space map has not been allocated yet, then treat
1388 1411           * all the space in the metaslab as free and add it to the
1389 1412           * ms_tree.
1390 1413           */
1391 1414          if (msp->ms_sm != NULL)
1392 1415                  error = space_map_load(msp->ms_sm, msp->ms_tree, SM_FREE);
1393 1416          else
1394 1417                  range_tree_add(msp->ms_tree, msp->ms_start, msp->ms_size);
1395 1418  
1396 1419          success = (error == 0);
1397      -
1398      -        mutex_enter(&msp->ms_lock);
1399 1420          msp->ms_loading = B_FALSE;
1400 1421  
1401 1422          if (success) {
1402 1423                  ASSERT3P(msp->ms_group, !=, NULL);
1403 1424                  msp->ms_loaded = B_TRUE;
1404 1425  
1405 1426                  for (int t = 0; t < TXG_DEFER_SIZE; t++) {
1406 1427                          range_tree_walk(msp->ms_defertree[t],
1407 1428                              range_tree_remove, msp->ms_tree);
     1429 +                        range_tree_walk(msp->ms_defertree[t],
     1430 +                            metaslab_trim_remove, msp);
1408 1431                  }
1409 1432                  msp->ms_max_size = metaslab_block_maxsize(msp);
1410 1433          }
1411 1434          cv_broadcast(&msp->ms_load_cv);
1412 1435          return (error);
1413 1436  }
1414 1437  
1415 1438  void
1416 1439  metaslab_unload(metaslab_t *msp)
1417 1440  {

1418 1441          ASSERT(MUTEX_HELD(&msp->ms_lock));
1419 1442          range_tree_vacate(msp->ms_tree, NULL, NULL);
1420 1443          msp->ms_loaded = B_FALSE;
1421 1444          msp->ms_weight &= ~METASLAB_ACTIVE_MASK;
1422 1445          msp->ms_max_size = 0;
1423 1446  }
1424 1447  
1425 1448  int

↓ open down ↓

8 lines elided

↑ open up ↑

1426 1449  metaslab_init(metaslab_group_t *mg, uint64_t id, uint64_t object, uint64_t txg,
1427 1450      metaslab_t **msp)
1428 1451  {
1429 1452          vdev_t *vd = mg->mg_vd;
1430 1453          objset_t *mos = vd->vdev_spa->spa_meta_objset;
1431 1454          metaslab_t *ms;
1432 1455          int error;
1433 1456  
1434 1457          ms = kmem_zalloc(sizeof (metaslab_t), KM_SLEEP);
1435 1458          mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL);
1436      -        mutex_init(&ms->ms_sync_lock, NULL, MUTEX_DEFAULT, NULL);
1437 1459          cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL);
     1460 +        cv_init(&ms->ms_trim_cv, NULL, CV_DEFAULT, NULL);
1438 1461          ms->ms_id = id;
1439 1462          ms->ms_start = id << vd->vdev_ms_shift;
1440 1463          ms->ms_size = 1ULL << vd->vdev_ms_shift;
1441 1464  
1442 1465          /*
1443 1466           * We only open space map objects that already exist. All others
1444 1467           * will be opened when we finally allocate an object for it.
1445 1468           */
1446 1469          if (object != 0) {
1447 1470                  error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start,
1448      -                    ms->ms_size, vd->vdev_ashift);
     1471 +                    ms->ms_size, vd->vdev_ashift, &ms->ms_lock);
1449 1472  
1450 1473                  if (error != 0) {
1451 1474                          kmem_free(ms, sizeof (metaslab_t));
1452 1475                          return (error);
1453 1476                  }
1454 1477  
1455 1478                  ASSERT(ms->ms_sm != NULL);
1456 1479          }
1457 1480  
     1481 +        ms->ms_cur_ts = metaslab_new_trimset(0, &ms->ms_lock);
     1482 +
1458 1483          /*
1459 1484           * We create the main range tree here, but we don't create the
1460 1485           * other range trees until metaslab_sync_done().  This serves
1461 1486           * two purposes: it allows metaslab_sync_done() to detect the
1462 1487           * addition of new space; and for debugging, it ensures that we'd
1463 1488           * data fault on any attempt to use this metaslab before it's ready.
1464 1489           */
1465      -        ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms);
     1490 +        ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms, &ms->ms_lock);
1466 1491          metaslab_group_add(mg, ms);
1467 1492  
1468 1493          metaslab_set_fragmentation(ms);
1469 1494  
1470 1495          /*
1471 1496           * If we're opening an existing pool (txg == 0) or creating
1472 1497           * a new one (txg == TXG_INITIAL), all space is available now.
1473 1498           * If we're adding space to an existing pool, the new space
1474 1499           * does not become available until after this txg has synced.
1475 1500           * The metaslab's weight will also be initialized when we sync

1476 1501           * out this txg. This ensures that we don't attempt to allocate
1477 1502           * from it before we have initialized it completely.
1478 1503           */
1479 1504          if (txg <= TXG_INITIAL)
1480 1505                  metaslab_sync_done(ms, 0);
1481 1506  
1482 1507          /*
1483 1508           * If metaslab_debug_load is set and we're initializing a metaslab
1484 1509           * that has an allocated space map object then load the its space
1485 1510           * map so that can verify frees.
1486 1511           */
1487 1512          if (metaslab_debug_load && ms->ms_sm != NULL) {
1488 1513                  mutex_enter(&ms->ms_lock);
1489 1514                  VERIFY0(metaslab_load(ms));
1490 1515                  mutex_exit(&ms->ms_lock);
1491 1516          }
1492 1517  
1493 1518          if (txg != 0) {
1494 1519                  vdev_dirty(vd, 0, NULL, txg);
1495 1520                  vdev_dirty(vd, VDD_METASLAB, ms, txg);
1496 1521          }
1497 1522  
1498 1523          *msp = ms;
1499 1524  
1500 1525          return (0);
1501 1526  }
1502 1527  
1503 1528  void
1504 1529  metaslab_fini(metaslab_t *msp)
1505 1530  {
1506 1531          metaslab_group_t *mg = msp->ms_group;
1507 1532  
1508 1533          metaslab_group_remove(mg, msp);
1509 1534  
1510 1535          mutex_enter(&msp->ms_lock);
1511 1536          VERIFY(msp->ms_group == NULL);
1512 1537          vdev_space_update(mg->mg_vd, -space_map_allocated(msp->ms_sm),
1513 1538              0, -msp->ms_size);
1514 1539          space_map_close(msp->ms_sm);
1515 1540  
1516 1541          metaslab_unload(msp);
1517 1542          range_tree_destroy(msp->ms_tree);
1518 1543          range_tree_destroy(msp->ms_freeingtree);

↓ open down ↓

43 lines elided

↑ open up ↑

1519 1544          range_tree_destroy(msp->ms_freedtree);
1520 1545  
1521 1546          for (int t = 0; t < TXG_SIZE; t++) {
1522 1547                  range_tree_destroy(msp->ms_alloctree[t]);
1523 1548          }
1524 1549  
1525 1550          for (int t = 0; t < TXG_DEFER_SIZE; t++) {
1526 1551                  range_tree_destroy(msp->ms_defertree[t]);
1527 1552          }
1528 1553  
     1554 +        metaslab_free_trimset(msp->ms_cur_ts);
     1555 +        if (msp->ms_prev_ts)
     1556 +                metaslab_free_trimset(msp->ms_prev_ts);
     1557 +        ASSERT3P(msp->ms_trimming_ts, ==, NULL);
     1558 +
1529 1559          ASSERT0(msp->ms_deferspace);
1530 1560  
1531 1561          mutex_exit(&msp->ms_lock);
1532 1562          cv_destroy(&msp->ms_load_cv);
     1563 +        cv_destroy(&msp->ms_trim_cv);
1533 1564          mutex_destroy(&msp->ms_lock);
1534      -        mutex_destroy(&msp->ms_sync_lock);
1535 1565  
1536 1566          kmem_free(msp, sizeof (metaslab_t));
1537 1567  }
1538 1568  
1539 1569  #define FRAGMENTATION_TABLE_SIZE        17
1540 1570  
1541 1571  /*
1542 1572   * This table defines a segment size based fragmentation metric that will
1543 1573   * allow each metaslab to derive its own fragmentation value. This is done
1544 1574   * by calculating the space in each bucket of the spacemap histogram and

1545 1575   * multiplying that by the fragmetation metric in this table. Doing
1546 1576   * this for all buckets and dividing it by the total amount of free
1547 1577   * space in this metaslab (i.e. the total free space in all buckets) gives
1548 1578   * us the fragmentation metric. This means that a high fragmentation metric
1549 1579   * equates to most of the free space being comprised of small segments.
1550 1580   * Conversely, if the metric is low, then most of the free space is in
1551 1581   * large segments. A 10% change in fragmentation equates to approximately
1552 1582   * double the number of segments.
1553 1583   *
1554 1584   * This table defines 0% fragmented space using 16MB segments. Testing has
1555 1585   * shown that segments that are greater than or equal to 16MB do not suffer
1556 1586   * from drastic performance problems. Using this value, we derive the rest
1557 1587   * of the table. Since the fragmentation value is never stored on disk, it
1558 1588   * is possible to change these calculations in the future.
1559 1589   */
1560 1590  int zfs_frag_table[FRAGMENTATION_TABLE_SIZE] = {
1561 1591          100,    /* 512B */
1562 1592          100,    /* 1K   */
1563 1593          98,     /* 2K   */
1564 1594          95,     /* 4K   */
1565 1595          90,     /* 8K   */
1566 1596          80,     /* 16K  */
1567 1597          70,     /* 32K  */
1568 1598          60,     /* 64K  */
1569 1599          50,     /* 128K */
1570 1600          40,     /* 256K */
1571 1601          30,     /* 512K */
1572 1602          20,     /* 1M   */
1573 1603          15,     /* 2M   */
1574 1604          10,     /* 4M   */
1575 1605          5,      /* 8M   */
1576 1606          0       /* 16M  */
1577 1607  };
1578 1608  
1579 1609  /*
1580 1610   * Calclate the metaslab's fragmentation metric. A return value
1581 1611   * of ZFS_FRAG_INVALID means that the metaslab has not been upgraded and does
1582 1612   * not support this metric. Otherwise, the return value should be in the
1583 1613   * range [0, 100].
1584 1614   */
1585 1615  static void
1586 1616  metaslab_set_fragmentation(metaslab_t *msp)
1587 1617  {
1588 1618          spa_t *spa = msp->ms_group->mg_vd->vdev_spa;
1589 1619          uint64_t fragmentation = 0;
1590 1620          uint64_t total = 0;
1591 1621          boolean_t feature_enabled = spa_feature_is_enabled(spa,
1592 1622              SPA_FEATURE_SPACEMAP_HISTOGRAM);
1593 1623  
1594 1624          if (!feature_enabled) {
1595 1625                  msp->ms_fragmentation = ZFS_FRAG_INVALID;
1596 1626                  return;
1597 1627          }
1598 1628  
1599 1629          /*
1600 1630           * A null space map means that the entire metaslab is free
1601 1631           * and thus is not fragmented.
1602 1632           */
1603 1633          if (msp->ms_sm == NULL) {
1604 1634                  msp->ms_fragmentation = 0;
1605 1635                  return;
1606 1636          }
1607 1637  
1608 1638          /*
1609 1639           * If this metaslab's space map has not been upgraded, flag it
1610 1640           * so that we upgrade next time we encounter it.
1611 1641           */
1612 1642          if (msp->ms_sm->sm_dbuf->db_size != sizeof (space_map_phys_t)) {
1613 1643                  uint64_t txg = spa_syncing_txg(spa);
1614 1644                  vdev_t *vd = msp->ms_group->mg_vd;
1615 1645  
1616 1646                  /*
1617 1647                   * If we've reached the final dirty txg, then we must
1618 1648                   * be shutting down the pool. We don't want to dirty
1619 1649                   * any data past this point so skip setting the condense
1620 1650                   * flag. We can retry this action the next time the pool
1621 1651                   * is imported.
1622 1652                   */
1623 1653                  if (spa_writeable(spa) && txg < spa_final_dirty_txg(spa)) {
1624 1654                          msp->ms_condense_wanted = B_TRUE;
1625 1655                          vdev_dirty(vd, VDD_METASLAB, msp, txg + 1);
1626 1656                          spa_dbgmsg(spa, "txg %llu, requesting force condense: "
1627 1657                              "ms_id %llu, vdev_id %llu", txg, msp->ms_id,
1628 1658                              vd->vdev_id);
1629 1659                  }
1630 1660                  msp->ms_fragmentation = ZFS_FRAG_INVALID;
1631 1661                  return;
1632 1662          }
1633 1663  
1634 1664          for (int i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++) {
1635 1665                  uint64_t space = 0;
1636 1666                  uint8_t shift = msp->ms_sm->sm_shift;
1637 1667  
1638 1668                  int idx = MIN(shift - SPA_MINBLOCKSHIFT + i,
1639 1669                      FRAGMENTATION_TABLE_SIZE - 1);
1640 1670  
1641 1671                  if (msp->ms_sm->sm_phys->smp_histogram[i] == 0)
1642 1672                          continue;
1643 1673  
1644 1674                  space = msp->ms_sm->sm_phys->smp_histogram[i] << (i + shift);
1645 1675                  total += space;
1646 1676  
1647 1677                  ASSERT3U(idx, <, FRAGMENTATION_TABLE_SIZE);
1648 1678                  fragmentation += space * zfs_frag_table[idx];
1649 1679          }
1650 1680  
1651 1681          if (total > 0)
1652 1682                  fragmentation /= total;
1653 1683          ASSERT3U(fragmentation, <=, 100);
1654 1684  
1655 1685          msp->ms_fragmentation = fragmentation;
1656 1686  }
1657 1687  
1658 1688  /*
1659 1689   * Compute a weight -- a selection preference value -- for the given metaslab.
1660 1690   * This is based on the amount of free space, the level of fragmentation,
1661 1691   * the LBA range, and whether the metaslab is loaded.
1662 1692   */
1663 1693  static uint64_t
1664 1694  metaslab_space_weight(metaslab_t *msp)
1665 1695  {
1666 1696          metaslab_group_t *mg = msp->ms_group;
1667 1697          vdev_t *vd = mg->mg_vd;
1668 1698          uint64_t weight, space;
1669 1699  
1670 1700          ASSERT(MUTEX_HELD(&msp->ms_lock));
1671 1701          ASSERT(!vd->vdev_removing);
1672 1702  
1673 1703          /*
1674 1704           * The baseline weight is the metaslab's free space.
1675 1705           */
1676 1706          space = msp->ms_size - space_map_allocated(msp->ms_sm);
1677 1707  
1678 1708          if (metaslab_fragmentation_factor_enabled &&
1679 1709              msp->ms_fragmentation != ZFS_FRAG_INVALID) {
1680 1710                  /*
1681 1711                   * Use the fragmentation information to inversely scale
1682 1712                   * down the baseline weight. We need to ensure that we
1683 1713                   * don't exclude this metaslab completely when it's 100%
1684 1714                   * fragmented. To avoid this we reduce the fragmented value
1685 1715                   * by 1.
1686 1716                   */
1687 1717                  space = (space * (100 - (msp->ms_fragmentation - 1))) / 100;
1688 1718  
1689 1719                  /*
1690 1720                   * If space < SPA_MINBLOCKSIZE, then we will not allocate from
1691 1721                   * this metaslab again. The fragmentation metric may have
1692 1722                   * decreased the space to something smaller than
1693 1723                   * SPA_MINBLOCKSIZE, so reset the space to SPA_MINBLOCKSIZE
1694 1724                   * so that we can consume any remaining space.
1695 1725                   */
1696 1726                  if (space > 0 && space < SPA_MINBLOCKSIZE)
1697 1727                          space = SPA_MINBLOCKSIZE;
1698 1728          }
1699 1729          weight = space;
1700 1730  
1701 1731          /*
1702 1732           * Modern disks have uniform bit density and constant angular velocity.
1703 1733           * Therefore, the outer recording zones are faster (higher bandwidth)
1704 1734           * than the inner zones by the ratio of outer to inner track diameter,
1705 1735           * which is typically around 2:1.  We account for this by assigning
1706 1736           * higher weight to lower metaslabs (multiplier ranging from 2x to 1x).
1707 1737           * In effect, this means that we'll select the metaslab with the most
1708 1738           * free bandwidth rather than simply the one with the most free space.
1709 1739           */
1710 1740          if (metaslab_lba_weighting_enabled) {
1711 1741                  weight = 2 * weight - (msp->ms_id * weight) / vd->vdev_ms_count;
1712 1742                  ASSERT(weight >= space && weight <= 2 * space);
1713 1743          }
1714 1744  
1715 1745          /*
1716 1746           * If this metaslab is one we're actively using, adjust its
1717 1747           * weight to make it preferable to any inactive metaslab so
1718 1748           * we'll polish it off. If the fragmentation on this metaslab
1719 1749           * has exceed our threshold, then don't mark it active.
1720 1750           */
1721 1751          if (msp->ms_loaded && msp->ms_fragmentation != ZFS_FRAG_INVALID &&
1722 1752              msp->ms_fragmentation <= zfs_metaslab_fragmentation_threshold) {
1723 1753                  weight |= (msp->ms_weight & METASLAB_ACTIVE_MASK);
1724 1754          }
1725 1755  
1726 1756          WEIGHT_SET_SPACEBASED(weight);
1727 1757          return (weight);
1728 1758  }
1729 1759  
1730 1760  /*
1731 1761   * Return the weight of the specified metaslab, according to the segment-based
1732 1762   * weighting algorithm. The metaslab must be loaded. This function can
1733 1763   * be called within a sync pass since it relies only on the metaslab's
1734 1764   * range tree which is always accurate when the metaslab is loaded.
1735 1765   */
1736 1766  static uint64_t
1737 1767  metaslab_weight_from_range_tree(metaslab_t *msp)
1738 1768  {
1739 1769          uint64_t weight = 0;
1740 1770          uint32_t segments = 0;
1741 1771  
1742 1772          ASSERT(msp->ms_loaded);
1743 1773  
1744 1774          for (int i = RANGE_TREE_HISTOGRAM_SIZE - 1; i >= SPA_MINBLOCKSHIFT;
1745 1775              i--) {
1746 1776                  uint8_t shift = msp->ms_group->mg_vd->vdev_ashift;
1747 1777                  int max_idx = SPACE_MAP_HISTOGRAM_SIZE + shift - 1;
1748 1778  
1749 1779                  segments <<= 1;
1750 1780                  segments += msp->ms_tree->rt_histogram[i];
1751 1781  
1752 1782                  /*
1753 1783                   * The range tree provides more precision than the space map
1754 1784                   * and must be downgraded so that all values fit within the
1755 1785                   * space map's histogram. This allows us to compare loaded
1756 1786                   * vs. unloaded metaslabs to determine which metaslab is
1757 1787                   * considered "best".
1758 1788                   */
1759 1789                  if (i > max_idx)
1760 1790                          continue;
1761 1791  
1762 1792                  if (segments != 0) {
1763 1793                          WEIGHT_SET_COUNT(weight, segments);
1764 1794                          WEIGHT_SET_INDEX(weight, i);
1765 1795                          WEIGHT_SET_ACTIVE(weight, 0);
1766 1796                          break;
1767 1797                  }
1768 1798          }
1769 1799          return (weight);
1770 1800  }
1771 1801  
1772 1802  /*
1773 1803   * Calculate the weight based on the on-disk histogram. This should only
1774 1804   * be called after a sync pass has completely finished since the on-disk
1775 1805   * information is updated in metaslab_sync().
1776 1806   */
1777 1807  static uint64_t
1778 1808  metaslab_weight_from_spacemap(metaslab_t *msp)
1779 1809  {
1780 1810          uint64_t weight = 0;
1781 1811  
1782 1812          for (int i = SPACE_MAP_HISTOGRAM_SIZE - 1; i >= 0; i--) {
1783 1813                  if (msp->ms_sm->sm_phys->smp_histogram[i] != 0) {
1784 1814                          WEIGHT_SET_COUNT(weight,
1785 1815                              msp->ms_sm->sm_phys->smp_histogram[i]);
1786 1816                          WEIGHT_SET_INDEX(weight, i +
1787 1817                              msp->ms_sm->sm_shift);
1788 1818                          WEIGHT_SET_ACTIVE(weight, 0);
1789 1819                          break;
1790 1820                  }
1791 1821          }
1792 1822          return (weight);
1793 1823  }
1794 1824  
1795 1825  /*
1796 1826   * Compute a segment-based weight for the specified metaslab. The weight
1797 1827   * is determined by highest bucket in the histogram. The information
1798 1828   * for the highest bucket is encoded into the weight value.
1799 1829   */
1800 1830  static uint64_t
1801 1831  metaslab_segment_weight(metaslab_t *msp)
1802 1832  {
1803 1833          metaslab_group_t *mg = msp->ms_group;
1804 1834          uint64_t weight = 0;
1805 1835          uint8_t shift = mg->mg_vd->vdev_ashift;
1806 1836  
1807 1837          ASSERT(MUTEX_HELD(&msp->ms_lock));
1808 1838  
1809 1839          /*
1810 1840           * The metaslab is completely free.
1811 1841           */
1812 1842          if (space_map_allocated(msp->ms_sm) == 0) {
1813 1843                  int idx = highbit64(msp->ms_size) - 1;
1814 1844                  int max_idx = SPACE_MAP_HISTOGRAM_SIZE + shift - 1;
1815 1845  
1816 1846                  if (idx < max_idx) {
1817 1847                          WEIGHT_SET_COUNT(weight, 1ULL);
1818 1848                          WEIGHT_SET_INDEX(weight, idx);
1819 1849                  } else {
1820 1850                          WEIGHT_SET_COUNT(weight, 1ULL << (idx - max_idx));
1821 1851                          WEIGHT_SET_INDEX(weight, max_idx);
1822 1852                  }
1823 1853                  WEIGHT_SET_ACTIVE(weight, 0);
1824 1854                  ASSERT(!WEIGHT_IS_SPACEBASED(weight));
1825 1855  
1826 1856                  return (weight);
1827 1857          }
1828 1858  
1829 1859          ASSERT3U(msp->ms_sm->sm_dbuf->db_size, ==, sizeof (space_map_phys_t));
1830 1860  
1831 1861          /*
1832 1862           * If the metaslab is fully allocated then just make the weight 0.
1833 1863           */
1834 1864          if (space_map_allocated(msp->ms_sm) == msp->ms_size)
1835 1865                  return (0);
1836 1866          /*
1837 1867           * If the metaslab is already loaded, then use the range tree to
1838 1868           * determine the weight. Otherwise, we rely on the space map information
1839 1869           * to generate the weight.
1840 1870           */
1841 1871          if (msp->ms_loaded) {
1842 1872                  weight = metaslab_weight_from_range_tree(msp);
1843 1873          } else {
1844 1874                  weight = metaslab_weight_from_spacemap(msp);
1845 1875          }
1846 1876  
1847 1877          /*
1848 1878           * If the metaslab was active the last time we calculated its weight
1849 1879           * then keep it active. We want to consume the entire region that
1850 1880           * is associated with this weight.
1851 1881           */
1852 1882          if (msp->ms_activation_weight != 0 && weight != 0)
1853 1883                  WEIGHT_SET_ACTIVE(weight, WEIGHT_GET_ACTIVE(msp->ms_weight));
1854 1884          return (weight);
1855 1885  }
1856 1886  
1857 1887  /*
1858 1888   * Determine if we should attempt to allocate from this metaslab. If the
1859 1889   * metaslab has a maximum size then we can quickly determine if the desired
1860 1890   * allocation size can be satisfied. Otherwise, if we're using segment-based
1861 1891   * weighting then we can determine the maximum allocation that this metaslab
1862 1892   * can accommodate based on the index encoded in the weight. If we're using
1863 1893   * space-based weights then rely on the entire weight (excluding the weight
1864 1894   * type bit).
1865 1895   */
1866 1896  boolean_t
1867 1897  metaslab_should_allocate(metaslab_t *msp, uint64_t asize)
1868 1898  {
1869 1899          boolean_t should_allocate;
1870 1900  
1871 1901          if (msp->ms_max_size != 0)
1872 1902                  return (msp->ms_max_size >= asize);
1873 1903  
1874 1904          if (!WEIGHT_IS_SPACEBASED(msp->ms_weight)) {
1875 1905                  /*
1876 1906                   * The metaslab segment weight indicates segments in the
1877 1907                   * range [2^i, 2^(i+1)), where i is the index in the weight.
1878 1908                   * Since the asize might be in the middle of the range, we
1879 1909                   * should attempt the allocation if asize < 2^(i+1).
1880 1910                   */
1881 1911                  should_allocate = (asize <
1882 1912                      1ULL << (WEIGHT_GET_INDEX(msp->ms_weight) + 1));
1883 1913          } else {
1884 1914                  should_allocate = (asize <=
1885 1915                      (msp->ms_weight & ~METASLAB_WEIGHT_TYPE));
1886 1916          }
1887 1917          return (should_allocate);
1888 1918  }
1889 1919

↓ open down ↓

345 lines elided

↑ open up ↑

1890 1920  static uint64_t
1891 1921  metaslab_weight(metaslab_t *msp)
1892 1922  {
1893 1923          vdev_t *vd = msp->ms_group->mg_vd;
1894 1924          spa_t *spa = vd->vdev_spa;
1895 1925          uint64_t weight;
1896 1926  
1897 1927          ASSERT(MUTEX_HELD(&msp->ms_lock));
1898 1928  
1899 1929          /*
1900      -         * If this vdev is in the process of being removed, there is nothing
     1930 +         * This vdev is in the process of being removed so there is nothing
1901 1931           * for us to do here.
1902 1932           */
1903      -        if (vd->vdev_removing)
     1933 +        if (vd->vdev_removing) {
     1934 +                ASSERT0(space_map_allocated(msp->ms_sm));
     1935 +                ASSERT0(vd->vdev_ms_shift);
1904 1936                  return (0);
     1937 +        }
1905 1938  
1906 1939          metaslab_set_fragmentation(msp);
1907 1940  
1908 1941          /*
1909 1942           * Update the maximum size if the metaslab is loaded. This will
1910 1943           * ensure that we get an accurate maximum size if newly freed space
1911 1944           * has been added back into the free tree.
1912 1945           */
1913 1946          if (msp->ms_loaded)
1914 1947                  msp->ms_max_size = metaslab_block_maxsize(msp);

1915 1948  
1916 1949          /*
1917 1950           * Segment-based weighting requires space map histogram support.
1918 1951           */
1919 1952          if (zfs_metaslab_segment_weight_enabled &&
1920 1953              spa_feature_is_enabled(spa, SPA_FEATURE_SPACEMAP_HISTOGRAM) &&
1921 1954              (msp->ms_sm == NULL || msp->ms_sm->sm_dbuf->db_size ==
1922 1955              sizeof (space_map_phys_t))) {
1923 1956                  weight = metaslab_segment_weight(msp);
1924 1957          } else {
1925 1958                  weight = metaslab_space_weight(msp);
1926 1959          }
1927 1960          return (weight);
1928 1961  }
1929 1962  
1930 1963  static int
1931 1964  metaslab_activate(metaslab_t *msp, uint64_t activation_weight)
1932 1965  {
1933 1966          ASSERT(MUTEX_HELD(&msp->ms_lock));
1934 1967  
1935 1968          if ((msp->ms_weight & METASLAB_ACTIVE_MASK) == 0) {
1936 1969                  metaslab_load_wait(msp);
1937 1970                  if (!msp->ms_loaded) {
1938 1971                          int error = metaslab_load(msp);
1939 1972                          if (error) {
1940 1973                                  metaslab_group_sort(msp->ms_group, msp, 0);
1941 1974                                  return (error);
1942 1975                          }
1943 1976                  }
1944 1977  
1945 1978                  msp->ms_activation_weight = msp->ms_weight;
1946 1979                  metaslab_group_sort(msp->ms_group, msp,
1947 1980                      msp->ms_weight | activation_weight);
1948 1981          }
1949 1982          ASSERT(msp->ms_loaded);
1950 1983          ASSERT(msp->ms_weight & METASLAB_ACTIVE_MASK);
1951 1984  
1952 1985          return (0);
1953 1986  }
1954 1987  
1955 1988  static void
1956 1989  metaslab_passivate(metaslab_t *msp, uint64_t weight)
1957 1990  {
1958 1991          uint64_t size = weight & ~METASLAB_WEIGHT_TYPE;
1959 1992  
1960 1993          /*
1961 1994           * If size < SPA_MINBLOCKSIZE, then we will not allocate from
1962 1995           * this metaslab again.  In that case, it had better be empty,
1963 1996           * or we would be leaving space on the table.
1964 1997           */
1965 1998          ASSERT(size >= SPA_MINBLOCKSIZE ||
1966 1999              range_tree_space(msp->ms_tree) == 0);
1967 2000          ASSERT0(weight & METASLAB_ACTIVE_MASK);
1968 2001  
1969 2002          msp->ms_activation_weight = 0;
1970 2003          metaslab_group_sort(msp->ms_group, msp, weight);
1971 2004          ASSERT((msp->ms_weight & METASLAB_ACTIVE_MASK) == 0);
1972 2005  }
1973 2006  
1974 2007  /*
1975 2008   * Segment-based metaslabs are activated once and remain active until
1976 2009   * we either fail an allocation attempt (similar to space-based metaslabs)
1977 2010   * or have exhausted the free space in zfs_metaslab_switch_threshold
1978 2011   * buckets since the metaslab was activated. This function checks to see
1979 2012   * if we've exhaused the zfs_metaslab_switch_threshold buckets in the
1980 2013   * metaslab and passivates it proactively. This will allow us to select a
1981 2014   * metaslabs with larger contiguous region if any remaining within this
1982 2015   * metaslab group. If we're in sync pass > 1, then we continue using this
1983 2016   * metaslab so that we don't dirty more block and cause more sync passes.
1984 2017   */
1985 2018  void
1986 2019  metaslab_segment_may_passivate(metaslab_t *msp)
1987 2020  {
1988 2021          spa_t *spa = msp->ms_group->mg_vd->vdev_spa;
1989 2022  
1990 2023          if (WEIGHT_IS_SPACEBASED(msp->ms_weight) || spa_sync_pass(spa) > 1)
1991 2024                  return;
1992 2025  
1993 2026          /*
1994 2027           * Since we are in the middle of a sync pass, the most accurate
1995 2028           * information that is accessible to us is the in-core range tree
1996 2029           * histogram; calculate the new weight based on that information.
1997 2030           */
1998 2031          uint64_t weight = metaslab_weight_from_range_tree(msp);
1999 2032          int activation_idx = WEIGHT_GET_INDEX(msp->ms_activation_weight);
2000 2033          int current_idx = WEIGHT_GET_INDEX(weight);
2001 2034  
2002 2035          if (current_idx <= activation_idx - zfs_metaslab_switch_threshold)
2003 2036                  metaslab_passivate(msp, weight);
2004 2037  }
2005 2038  
2006 2039  static void
2007 2040  metaslab_preload(void *arg)
2008 2041  {
2009 2042          metaslab_t *msp = arg;
2010 2043          spa_t *spa = msp->ms_group->mg_vd->vdev_spa;
2011 2044  
2012 2045          ASSERT(!MUTEX_HELD(&msp->ms_group->mg_lock));
2013 2046  
2014 2047          mutex_enter(&msp->ms_lock);
2015 2048          metaslab_load_wait(msp);
2016 2049          if (!msp->ms_loaded)
2017 2050                  (void) metaslab_load(msp);
2018 2051          msp->ms_selected_txg = spa_syncing_txg(spa);
2019 2052          mutex_exit(&msp->ms_lock);
2020 2053  }
2021 2054  
2022 2055  static void
2023 2056  metaslab_group_preload(metaslab_group_t *mg)
2024 2057  {
2025 2058          spa_t *spa = mg->mg_vd->vdev_spa;

↓ open down ↓

111 lines elided

↑ open up ↑

2026 2059          metaslab_t *msp;
2027 2060          avl_tree_t *t = &mg->mg_metaslab_tree;
2028 2061          int m = 0;
2029 2062  
2030 2063          if (spa_shutting_down(spa) || !metaslab_preload_enabled) {
2031 2064                  taskq_wait(mg->mg_taskq);
2032 2065                  return;
2033 2066          }
2034 2067  
2035 2068          mutex_enter(&mg->mg_lock);
2036      -
2037 2069          /*
2038 2070           * Load the next potential metaslabs
2039 2071           */
2040 2072          for (msp = avl_first(t); msp != NULL; msp = AVL_NEXT(t, msp)) {
2041      -                ASSERT3P(msp->ms_group, ==, mg);
2042      -
2043 2073                  /*
2044 2074                   * We preload only the maximum number of metaslabs specified
2045 2075                   * by metaslab_preload_limit. If a metaslab is being forced
2046 2076                   * to condense then we preload it too. This will ensure
2047 2077                   * that force condensing happens in the next txg.
2048 2078                   */
2049 2079                  if (++m > metaslab_preload_limit && !msp->ms_condense_wanted) {
2050 2080                          continue;
2051 2081                  }
2052 2082

2053 2083                  VERIFY(taskq_dispatch(mg->mg_taskq, metaslab_preload,
2054 2084                      msp, TQ_SLEEP) != NULL);
2055 2085          }
2056 2086          mutex_exit(&mg->mg_lock);
2057 2087  }
2058 2088

↓ open down ↓

6 lines elided

↑ open up ↑

2059 2089  /*
2060 2090   * Determine if the space map's on-disk footprint is past our tolerance
2061 2091   * for inefficiency. We would like to use the following criteria to make
2062 2092   * our decision:
2063 2093   *
2064 2094   * 1. The size of the space map object should not dramatically increase as a
2065 2095   * result of writing out the free space range tree.
2066 2096   *
2067 2097   * 2. The minimal on-disk space map representation is zfs_condense_pct/100
2068 2098   * times the size than the free space range tree representation
2069      - * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1MB).
     2099 + * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1.MB).
2070 2100   *
2071 2101   * 3. The on-disk size of the space map should actually decrease.
2072 2102   *
2073 2103   * Checking the first condition is tricky since we don't want to walk
2074 2104   * the entire AVL tree calculating the estimated on-disk size. Instead we
2075 2105   * use the size-ordered range tree in the metaslab and calculate the
2076 2106   * size required to write out the largest segment in our free tree. If the
2077 2107   * size required to represent that segment on disk is larger than the space
2078 2108   * map object then we avoid condensing this map.
2079 2109   *

2080 2110   * To determine the second criterion we use a best-case estimate and assume
2081 2111   * each segment can be represented on-disk as a single 64-bit entry. We refer
2082 2112   * to this best-case estimate as the space map's minimal form.
2083 2113   *
2084 2114   * Unfortunately, we cannot compute the on-disk size of the space map in this
2085 2115   * context because we cannot accurately compute the effects of compression, etc.
2086 2116   * Instead, we apply the heuristic described in the block comment for
2087 2117   * zfs_metaslab_condense_block_threshold - we only condense if the space used
2088 2118   * is greater than a threshold number of blocks.
2089 2119   */
2090 2120  static boolean_t
2091 2121  metaslab_should_condense(metaslab_t *msp)
2092 2122  {
2093 2123          space_map_t *sm = msp->ms_sm;
2094 2124          range_seg_t *rs;
2095 2125          uint64_t size, entries, segsz, object_size, optimal_size, record_size;
2096 2126          dmu_object_info_t doi;
2097 2127          uint64_t vdev_blocksize = 1 << msp->ms_group->mg_vd->vdev_ashift;
2098 2128  
2099 2129          ASSERT(MUTEX_HELD(&msp->ms_lock));
2100 2130          ASSERT(msp->ms_loaded);
2101 2131  
2102 2132          /*
2103 2133           * Use the ms_size_tree range tree, which is ordered by size, to
2104 2134           * obtain the largest segment in the free tree. We always condense
2105 2135           * metaslabs that are empty and metaslabs for which a condense
2106 2136           * request has been made.
2107 2137           */
2108 2138          rs = avl_last(&msp->ms_size_tree);
2109 2139          if (rs == NULL || msp->ms_condense_wanted)
2110 2140                  return (B_TRUE);
2111 2141  
2112 2142          /*
2113 2143           * Calculate the number of 64-bit entries this segment would
2114 2144           * require when written to disk. If this single segment would be
2115 2145           * larger on-disk than the entire current on-disk structure, then
2116 2146           * clearly condensing will increase the on-disk structure size.
2117 2147           */
2118 2148          size = (rs->rs_end - rs->rs_start) >> sm->sm_shift;
2119 2149          entries = size / (MIN(size, SM_RUN_MAX));
2120 2150          segsz = entries * sizeof (uint64_t);
2121 2151  
2122 2152          optimal_size = sizeof (uint64_t) * avl_numnodes(&msp->ms_tree->rt_root);
2123 2153          object_size = space_map_length(msp->ms_sm);
2124 2154  
2125 2155          dmu_object_info_from_db(sm->sm_dbuf, &doi);
2126 2156          record_size = MAX(doi.doi_data_block_size, vdev_blocksize);
2127 2157  
2128 2158          return (segsz <= object_size &&
2129 2159              object_size >= (optimal_size * zfs_condense_pct / 100) &&
2130 2160              object_size > zfs_metaslab_condense_block_threshold * record_size);
2131 2161  }
2132 2162  
2133 2163  /*
2134 2164   * Condense the on-disk space map representation to its minimized form.
2135 2165   * The minimized form consists of a small number of allocations followed by
2136 2166   * the entries of the free range tree.
2137 2167   */
2138 2168  static void
2139 2169  metaslab_condense(metaslab_t *msp, uint64_t txg, dmu_tx_t *tx)
2140 2170  {
2141 2171          spa_t *spa = msp->ms_group->mg_vd->vdev_spa;
2142 2172          range_tree_t *condense_tree;
2143 2173          space_map_t *sm = msp->ms_sm;
2144 2174  
2145 2175          ASSERT(MUTEX_HELD(&msp->ms_lock));
2146 2176          ASSERT3U(spa_sync_pass(spa), ==, 1);
2147 2177          ASSERT(msp->ms_loaded);
2148 2178  
2149 2179  
2150 2180          spa_dbgmsg(spa, "condensing: txg %llu, msp[%llu] %p, vdev id %llu, "
2151 2181              "spa %s, smp size %llu, segments %lu, forcing condense=%s", txg,
2152 2182              msp->ms_id, msp, msp->ms_group->mg_vd->vdev_id,
2153 2183              msp->ms_group->mg_vd->vdev_spa->spa_name,
2154 2184              space_map_length(msp->ms_sm), avl_numnodes(&msp->ms_tree->rt_root),
2155 2185              msp->ms_condense_wanted ? "TRUE" : "FALSE");

↓ open down ↓

76 lines elided

↑ open up ↑

2156 2186  
2157 2187          msp->ms_condense_wanted = B_FALSE;
2158 2188  
2159 2189          /*
2160 2190           * Create an range tree that is 100% allocated. We remove segments
2161 2191           * that have been freed in this txg, any deferred frees that exist,
2162 2192           * and any allocation in the future. Removing segments should be
2163 2193           * a relatively inexpensive operation since we expect these trees to
2164 2194           * have a small number of nodes.
2165 2195           */
2166      -        condense_tree = range_tree_create(NULL, NULL);
     2196 +        condense_tree = range_tree_create(NULL, NULL, &msp->ms_lock);
2167 2197          range_tree_add(condense_tree, msp->ms_start, msp->ms_size);
2168 2198  
2169 2199          /*
2170 2200           * Remove what's been freed in this txg from the condense_tree.
2171 2201           * Since we're in sync_pass 1, we know that all the frees from
2172 2202           * this txg are in the freeingtree.
2173 2203           */
2174 2204          range_tree_walk(msp->ms_freeingtree, range_tree_remove, condense_tree);
2175 2205  
2176 2206          for (int t = 0; t < TXG_DEFER_SIZE; t++) {

2177 2207                  range_tree_walk(msp->ms_defertree[t],
2178 2208                      range_tree_remove, condense_tree);
2179 2209          }
2180 2210  
2181 2211          for (int t = 1; t < TXG_CONCURRENT_STATES; t++) {
2182 2212                  range_tree_walk(msp->ms_alloctree[(txg + t) & TXG_MASK],
2183 2213                      range_tree_remove, condense_tree);
2184 2214          }
2185 2215  
2186 2216          /*
2187 2217           * We're about to drop the metaslab's lock thus allowing
2188 2218           * other consumers to change it's content. Set the

↓ open down ↓

12 lines elided

↑ open up ↑

2189 2219           * metaslab's ms_condensing flag to ensure that
2190 2220           * allocations on this metaslab do not occur while we're
2191 2221           * in the middle of committing it to disk. This is only critical
2192 2222           * for the ms_tree as all other range trees use per txg
2193 2223           * views of their content.
2194 2224           */
2195 2225          msp->ms_condensing = B_TRUE;
2196 2226  
2197 2227          mutex_exit(&msp->ms_lock);
2198 2228          space_map_truncate(sm, tx);
     2229 +        mutex_enter(&msp->ms_lock);
2199 2230  
2200 2231          /*
2201 2232           * While we would ideally like to create a space map representation
2202 2233           * that consists only of allocation records, doing so can be
2203 2234           * prohibitively expensive because the in-core free tree can be
2204 2235           * large, and therefore computationally expensive to subtract
2205 2236           * from the condense_tree. Instead we sync out two trees, a cheap
2206 2237           * allocation only tree followed by the in-core free tree. While not
2207 2238           * optimal, this is typically close to optimal, and much cheaper to
2208 2239           * compute.
2209 2240           */
2210 2241          space_map_write(sm, condense_tree, SM_ALLOC, tx);
2211 2242          range_tree_vacate(condense_tree, NULL, NULL);
2212 2243          range_tree_destroy(condense_tree);
2213 2244  
2214 2245          space_map_write(sm, msp->ms_tree, SM_FREE, tx);
2215      -        mutex_enter(&msp->ms_lock);
2216 2246          msp->ms_condensing = B_FALSE;
2217 2247  }
2218 2248  
2219 2249  /*
2220 2250   * Write a metaslab to disk in the context of the specified transaction group.
2221 2251   */
2222 2252  void
2223 2253  metaslab_sync(metaslab_t *msp, uint64_t txg)
2224 2254  {
2225 2255          metaslab_group_t *mg = msp->ms_group;
2226 2256          vdev_t *vd = mg->mg_vd;
2227 2257          spa_t *spa = vd->vdev_spa;
2228 2258          objset_t *mos = spa_meta_objset(spa);
2229 2259          range_tree_t *alloctree = msp->ms_alloctree[txg & TXG_MASK];
2230 2260          dmu_tx_t *tx;
2231 2261          uint64_t object = space_map_object(msp->ms_sm);
2232 2262  
2233 2263          ASSERT(!vd->vdev_ishole);
2234 2264  
     2265 +        mutex_enter(&msp->ms_lock);
     2266 +
2235 2267          /*
2236 2268           * This metaslab has just been added so there's no work to do now.
2237 2269           */
2238 2270          if (msp->ms_freeingtree == NULL) {
2239 2271                  ASSERT3P(alloctree, ==, NULL);
     2272 +                mutex_exit(&msp->ms_lock);
2240 2273                  return;
2241 2274          }
2242 2275  
2243 2276          ASSERT3P(alloctree, !=, NULL);
2244 2277          ASSERT3P(msp->ms_freeingtree, !=, NULL);
2245 2278          ASSERT3P(msp->ms_freedtree, !=, NULL);
2246 2279  
2247 2280          /*
2248 2281           * Normally, we don't want to process a metaslab if there
2249 2282           * are no allocations or frees to perform. However, if the metaslab
2250 2283           * is being forced to condense and it's loaded, we need to let it
2251 2284           * through.
2252 2285           */
2253 2286          if (range_tree_space(alloctree) == 0 &&
2254 2287              range_tree_space(msp->ms_freeingtree) == 0 &&
2255      -            !(msp->ms_loaded && msp->ms_condense_wanted))
     2288 +            !(msp->ms_loaded && msp->ms_condense_wanted)) {
     2289 +                mutex_exit(&msp->ms_lock);
2256 2290                  return;
     2291 +        }
2257 2292  
2258 2293  
2259 2294          VERIFY(txg <= spa_final_dirty_txg(spa));
2260 2295  
2261 2296          /*
2262 2297           * The only state that can actually be changing concurrently with
2263 2298           * metaslab_sync() is the metaslab's ms_tree.  No other thread can
2264 2299           * be modifying this txg's alloctree, freeingtree, freedtree, or
2265      -         * space_map_phys_t.  We drop ms_lock whenever we could call
2266      -         * into the DMU, because the DMU can call down to us
2267      -         * (e.g. via zio_free()) at any time.
2268      -         *
2269      -         * The spa_vdev_remove_thread() can be reading metaslab state
2270      -         * concurrently, and it is locked out by the ms_sync_lock.  Note
2271      -         * that the ms_lock is insufficient for this, because it is dropped
2272      -         * by space_map_write().
     2300 +         * space_map_phys_t. Therefore, we only hold ms_lock to satify
     2301 +         * space map ASSERTs. We drop it whenever we call into the DMU,
     2302 +         * because the DMU can call down to us (e.g. via zio_free()) at
     2303 +         * any time.
2273 2304           */
2274 2305  
2275 2306          tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
2276 2307  
2277 2308          if (msp->ms_sm == NULL) {
2278 2309                  uint64_t new_object;
2279 2310  
2280 2311                  new_object = space_map_alloc(mos, tx);
2281 2312                  VERIFY3U(new_object, !=, 0);
2282 2313  
2283 2314                  VERIFY0(space_map_open(&msp->ms_sm, mos, new_object,
2284      -                    msp->ms_start, msp->ms_size, vd->vdev_ashift));
     2315 +                    msp->ms_start, msp->ms_size, vd->vdev_ashift,
     2316 +                    &msp->ms_lock));
2285 2317                  ASSERT(msp->ms_sm != NULL);
2286 2318          }
2287 2319  
2288      -        mutex_enter(&msp->ms_sync_lock);
2289      -        mutex_enter(&msp->ms_lock);
2290      -
2291 2320          /*
2292 2321           * Note: metaslab_condense() clears the space map's histogram.
2293 2322           * Therefore we must verify and remove this histogram before
2294 2323           * condensing.
2295 2324           */
2296 2325          metaslab_group_histogram_verify(mg);
2297 2326          metaslab_class_histogram_verify(mg->mg_class);
2298 2327          metaslab_group_histogram_remove(mg, msp);
2299 2328  
2300 2329          if (msp->ms_loaded && spa_sync_pass(spa) == 1 &&
2301 2330              metaslab_should_condense(msp)) {
2302 2331                  metaslab_condense(msp, txg, tx);
2303 2332          } else {
2304      -                mutex_exit(&msp->ms_lock);
2305 2333                  space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx);
2306 2334                  space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx);
2307      -                mutex_enter(&msp->ms_lock);
2308 2335          }
2309 2336  
2310 2337          if (msp->ms_loaded) {
2311 2338                  /*
2312      -                 * When the space map is loaded, we have an accurate
     2339 +                 * When the space map is loaded, we have an accruate
2313 2340                   * histogram in the range tree. This gives us an opportunity
2314 2341                   * to bring the space map's histogram up-to-date so we clear
2315 2342                   * it first before updating it.
2316 2343                   */
2317 2344                  space_map_histogram_clear(msp->ms_sm);
2318 2345                  space_map_histogram_add(msp->ms_sm, msp->ms_tree, tx);
2319 2346  
2320 2347                  /*
2321 2348                   * Since we've cleared the histogram we need to add back
2322 2349                   * any free space that has already been processed, plus

2323 2350                   * any deferred space. This allows the on-disk histogram
2324 2351                   * to accurately reflect all free space even if some space
2325 2352                   * is not yet available for allocation (i.e. deferred).
2326 2353                   */
2327 2354                  space_map_histogram_add(msp->ms_sm, msp->ms_freedtree, tx);
2328 2355  
2329 2356                  /*
2330 2357                   * Add back any deferred free space that has not been
2331 2358                   * added back into the in-core free tree yet. This will
2332 2359                   * ensure that we don't end up with a space map histogram
2333 2360                   * that is completely empty unless the metaslab is fully
2334 2361                   * allocated.
2335 2362                   */
2336 2363                  for (int t = 0; t < TXG_DEFER_SIZE; t++) {
2337 2364                          space_map_histogram_add(msp->ms_sm,
2338 2365                              msp->ms_defertree[t], tx);
2339 2366                  }
2340 2367          }
2341 2368  
2342 2369          /*
2343 2370           * Always add the free space from this sync pass to the space
2344 2371           * map histogram. We want to make sure that the on-disk histogram
2345 2372           * accounts for all free space. If the space map is not loaded,
2346 2373           * then we will lose some accuracy but will correct it the next
2347 2374           * time we load the space map.
2348 2375           */
2349 2376          space_map_histogram_add(msp->ms_sm, msp->ms_freeingtree, tx);
2350 2377  
2351 2378          metaslab_group_histogram_add(mg, msp);
2352 2379          metaslab_group_histogram_verify(mg);
2353 2380          metaslab_class_histogram_verify(mg->mg_class);
2354 2381  
2355 2382          /*
2356 2383           * For sync pass 1, we avoid traversing this txg's free range tree
2357 2384           * and instead will just swap the pointers for freeingtree and
2358 2385           * freedtree. We can safely do this since the freed_tree is
2359 2386           * guaranteed to be empty on the initial pass.
2360 2387           */
2361 2388          if (spa_sync_pass(spa) == 1) {
2362 2389                  range_tree_swap(&msp->ms_freeingtree, &msp->ms_freedtree);
2363 2390          } else {
2364 2391                  range_tree_vacate(msp->ms_freeingtree,
2365 2392                      range_tree_add, msp->ms_freedtree);
2366 2393          }
2367 2394          range_tree_vacate(alloctree, NULL, NULL);
2368 2395  
2369 2396          ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK]));

↓ open down ↓

47 lines elided

↑ open up ↑

2370 2397          ASSERT0(range_tree_space(msp->ms_alloctree[TXG_CLEAN(txg) & TXG_MASK]));
2371 2398          ASSERT0(range_tree_space(msp->ms_freeingtree));
2372 2399  
2373 2400          mutex_exit(&msp->ms_lock);
2374 2401  
2375 2402          if (object != space_map_object(msp->ms_sm)) {
2376 2403                  object = space_map_object(msp->ms_sm);
2377 2404                  dmu_write(mos, vd->vdev_ms_array, sizeof (uint64_t) *
2378 2405                      msp->ms_id, sizeof (uint64_t), &object, tx);
2379 2406          }
2380      -        mutex_exit(&msp->ms_sync_lock);
2381 2407          dmu_tx_commit(tx);
2382 2408  }
2383 2409  
2384 2410  /*
2385 2411   * Called after a transaction group has completely synced to mark
2386 2412   * all of the metaslab's free space as usable.
2387 2413   */
2388 2414  void
2389 2415  metaslab_sync_done(metaslab_t *msp, uint64_t txg)
2390 2416  {

2391 2417          metaslab_group_t *mg = msp->ms_group;
2392 2418          vdev_t *vd = mg->mg_vd;
2393 2419          spa_t *spa = vd->vdev_spa;
2394 2420          range_tree_t **defer_tree;
2395 2421          int64_t alloc_delta, defer_delta;
2396 2422          boolean_t defer_allowed = B_TRUE;
2397 2423  
2398 2424          ASSERT(!vd->vdev_ishole);
2399 2425

↓ open down ↓

9 lines elided

↑ open up ↑

2400 2426          mutex_enter(&msp->ms_lock);
2401 2427  
2402 2428          /*
2403 2429           * If this metaslab is just becoming available, initialize its
2404 2430           * range trees and add its capacity to the vdev.
2405 2431           */
2406 2432          if (msp->ms_freedtree == NULL) {
2407 2433                  for (int t = 0; t < TXG_SIZE; t++) {
2408 2434                          ASSERT(msp->ms_alloctree[t] == NULL);
2409 2435  
2410      -                        msp->ms_alloctree[t] = range_tree_create(NULL, NULL);
     2436 +                        msp->ms_alloctree[t] = range_tree_create(NULL, msp,
     2437 +                            &msp->ms_lock);
2411 2438                  }
2412 2439  
2413 2440                  ASSERT3P(msp->ms_freeingtree, ==, NULL);
2414      -                msp->ms_freeingtree = range_tree_create(NULL, NULL);
     2441 +                msp->ms_freeingtree = range_tree_create(NULL, msp,
     2442 +                    &msp->ms_lock);
2415 2443  
2416 2444                  ASSERT3P(msp->ms_freedtree, ==, NULL);
2417      -                msp->ms_freedtree = range_tree_create(NULL, NULL);
     2445 +                msp->ms_freedtree = range_tree_create(NULL, msp,
     2446 +                    &msp->ms_lock);
2418 2447  
2419 2448                  for (int t = 0; t < TXG_DEFER_SIZE; t++) {
2420 2449                          ASSERT(msp->ms_defertree[t] == NULL);
2421 2450  
2422      -                        msp->ms_defertree[t] = range_tree_create(NULL, NULL);
     2451 +                        msp->ms_defertree[t] = range_tree_create(NULL, msp,
     2452 +                            &msp->ms_lock);
2423 2453                  }
2424 2454  
2425 2455                  vdev_space_update(vd, 0, 0, msp->ms_size);
2426 2456          }
2427 2457  
2428 2458          defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE];
2429 2459  
2430 2460          uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) -
2431 2461              metaslab_class_get_alloc(spa_normal_class(spa));
2432      -        if (free_space <= spa_get_slop_space(spa) || vd->vdev_removing) {
     2462 +        if (free_space <= spa_get_slop_space(spa)) {
2433 2463                  defer_allowed = B_FALSE;
2434 2464          }
2435 2465  
2436 2466          defer_delta = 0;
2437 2467          alloc_delta = space_map_alloc_delta(msp->ms_sm);
2438 2468          if (defer_allowed) {
2439 2469                  defer_delta = range_tree_space(msp->ms_freedtree) -
2440 2470                      range_tree_space(*defer_tree);
2441 2471          } else {
2442 2472                  defer_delta -= range_tree_space(*defer_tree);

2443 2473          }
2444 2474  
2445 2475          vdev_space_update(vd, alloc_delta + defer_delta, defer_delta, 0);
2446 2476  
2447 2477          /*
2448 2478           * If there's a metaslab_load() in progress, wait for it to complete

↓ open down ↓

6 lines elided

↑ open up ↑

2449 2479           * so that we have a consistent view of the in-core space map.
2450 2480           */
2451 2481          metaslab_load_wait(msp);
2452 2482  
2453 2483          /*
2454 2484           * Move the frees from the defer_tree back to the free
2455 2485           * range tree (if it's loaded). Swap the freed_tree and the
2456 2486           * defer_tree -- this is safe to do because we've just emptied out
2457 2487           * the defer_tree.
2458 2488           */
     2489 +        if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
     2490 +            !vd->vdev_man_trimming) {
     2491 +                range_tree_walk(*defer_tree, metaslab_trim_add, msp);
     2492 +                if (!defer_allowed) {
     2493 +                        range_tree_walk(msp->ms_freedtree, metaslab_trim_add,
     2494 +                            msp);
     2495 +                }
     2496 +        }
2459 2497          range_tree_vacate(*defer_tree,
2460 2498              msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree);
2461 2499          if (defer_allowed) {
2462 2500                  range_tree_swap(&msp->ms_freedtree, defer_tree);
2463 2501          } else {
2464 2502                  range_tree_vacate(msp->ms_freedtree,
2465 2503                      msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree);
2466 2504          }
2467 2505  
2468 2506          space_map_update(msp->ms_sm);

2469 2507  
2470 2508          msp->ms_deferspace += defer_delta;
2471 2509          ASSERT3S(msp->ms_deferspace, >=, 0);
2472 2510          ASSERT3S(msp->ms_deferspace, <=, msp->ms_size);
2473 2511          if (msp->ms_deferspace != 0) {
2474 2512                  /*
2475 2513                   * Keep syncing this metaslab until all deferred frees
2476 2514                   * are back in circulation.
2477 2515                   */
2478 2516                  vdev_dirty(vd, VDD_METASLAB, msp, txg + 1);
2479 2517          }
2480 2518  
2481 2519          /*
2482 2520           * Calculate the new weights before unloading any metaslabs.
2483 2521           * This will give us the most accurate weighting.
2484 2522           */
2485 2523          metaslab_group_sort(mg, msp, metaslab_weight(msp));
2486 2524  
2487 2525          /*
2488 2526           * If the metaslab is loaded and we've not tried to load or allocate
2489 2527           * from it in 'metaslab_unload_delay' txgs, then unload it.
2490 2528           */
2491 2529          if (msp->ms_loaded &&

↓ open down ↓

23 lines elided

↑ open up ↑

2492 2530              msp->ms_selected_txg + metaslab_unload_delay < txg) {
2493 2531                  for (int t = 1; t < TXG_CONCURRENT_STATES; t++) {
2494 2532                          VERIFY0(range_tree_space(
2495 2533                              msp->ms_alloctree[(txg + t) & TXG_MASK]));
2496 2534                  }
2497 2535  
2498 2536                  if (!metaslab_debug_unload)
2499 2537                          metaslab_unload(msp);
2500 2538          }
2501 2539  
2502      -        ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK]));
2503      -        ASSERT0(range_tree_space(msp->ms_freeingtree));
2504      -        ASSERT0(range_tree_space(msp->ms_freedtree));
2505      -
2506 2540          mutex_exit(&msp->ms_lock);
2507 2541  }
2508 2542  
2509 2543  void
2510 2544  metaslab_sync_reassess(metaslab_group_t *mg)
2511 2545  {
2512      -        spa_t *spa = mg->mg_class->mc_spa;
2513      -
2514      -        spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
2515 2546          metaslab_group_alloc_update(mg);
2516 2547          mg->mg_fragmentation = metaslab_group_fragmentation(mg);
2517 2548  
2518 2549          /*
2519      -         * Preload the next potential metaslabs but only on active
2520      -         * metaslab groups. We can get into a state where the metaslab
2521      -         * is no longer active since we dirty metaslabs as we remove a
2522      -         * a device, thus potentially making the metaslab group eligible
2523      -         * for preloading.
     2550 +         * Preload the next potential metaslabs
2524 2551           */
2525      -        if (mg->mg_activation_count > 0) {
2526      -                metaslab_group_preload(mg);
2527      -        }
2528      -        spa_config_exit(spa, SCL_ALLOC, FTAG);
     2552 +        metaslab_group_preload(mg);
2529 2553  }
2530 2554  
2531 2555  static uint64_t
2532 2556  metaslab_distance(metaslab_t *msp, dva_t *dva)
2533 2557  {
2534 2558          uint64_t ms_shift = msp->ms_group->mg_vd->vdev_ms_shift;
2535 2559          uint64_t offset = DVA_GET_OFFSET(dva) >> ms_shift;
2536 2560          uint64_t start = msp->ms_id;
2537 2561  
2538 2562          if (msp->ms_group->mg_vd->vdev_id != DVA_GET_VDEV(dva))

2539 2563                  return (1ULL << 63);
2540 2564  
2541 2565          if (offset < start)
2542 2566                  return ((start - offset) << ms_shift);
2543 2567          if (offset > start)
2544 2568                  return ((offset - start) << ms_shift);
2545 2569          return (0);
2546 2570  }
2547 2571  
2548 2572  /*
2549 2573   * ==========================================================================
2550 2574   * Metaslab allocation tracing facility
2551 2575   * ==========================================================================
2552 2576   */
2553 2577  kstat_t *metaslab_trace_ksp;
2554 2578  kstat_named_t metaslab_trace_over_limit;
2555 2579  
2556 2580  void
2557 2581  metaslab_alloc_trace_init(void)
2558 2582  {
2559 2583          ASSERT(metaslab_alloc_trace_cache == NULL);
2560 2584          metaslab_alloc_trace_cache = kmem_cache_create(
2561 2585              "metaslab_alloc_trace_cache", sizeof (metaslab_alloc_trace_t),
2562 2586              0, NULL, NULL, NULL, NULL, NULL, 0);
2563 2587          metaslab_trace_ksp = kstat_create("zfs", 0, "metaslab_trace_stats",
2564 2588              "misc", KSTAT_TYPE_NAMED, 1, KSTAT_FLAG_VIRTUAL);
2565 2589          if (metaslab_trace_ksp != NULL) {
2566 2590                  metaslab_trace_ksp->ks_data = &metaslab_trace_over_limit;
2567 2591                  kstat_named_init(&metaslab_trace_over_limit,
2568 2592                      "metaslab_trace_over_limit", KSTAT_DATA_UINT64);
2569 2593                  kstat_install(metaslab_trace_ksp);
2570 2594          }
2571 2595  }
2572 2596  
2573 2597  void
2574 2598  metaslab_alloc_trace_fini(void)
2575 2599  {
2576 2600          if (metaslab_trace_ksp != NULL) {
2577 2601                  kstat_delete(metaslab_trace_ksp);
2578 2602                  metaslab_trace_ksp = NULL;
2579 2603          }
2580 2604          kmem_cache_destroy(metaslab_alloc_trace_cache);
2581 2605          metaslab_alloc_trace_cache = NULL;
2582 2606  }
2583 2607  
2584 2608  /*
2585 2609   * Add an allocation trace element to the allocation tracing list.
2586 2610   */
2587 2611  static void
2588 2612  metaslab_trace_add(zio_alloc_list_t *zal, metaslab_group_t *mg,
2589 2613      metaslab_t *msp, uint64_t psize, uint32_t dva_id, uint64_t offset)
2590 2614  {
2591 2615          if (!metaslab_trace_enabled)
2592 2616                  return;
2593 2617  
2594 2618          /*
2595 2619           * When the tracing list reaches its maximum we remove
2596 2620           * the second element in the list before adding a new one.
2597 2621           * By removing the second element we preserve the original
2598 2622           * entry as a clue to what allocations steps have already been
2599 2623           * performed.
2600 2624           */
2601 2625          if (zal->zal_size == metaslab_trace_max_entries) {
2602 2626                  metaslab_alloc_trace_t *mat_next;
2603 2627  #ifdef DEBUG
2604 2628                  panic("too many entries in allocation list");
2605 2629  #endif
2606 2630                  atomic_inc_64(&metaslab_trace_over_limit.value.ui64);
2607 2631                  zal->zal_size--;
2608 2632                  mat_next = list_next(&zal->zal_list, list_head(&zal->zal_list));
2609 2633                  list_remove(&zal->zal_list, mat_next);
2610 2634                  kmem_cache_free(metaslab_alloc_trace_cache, mat_next);
2611 2635          }
2612 2636  
2613 2637          metaslab_alloc_trace_t *mat =
2614 2638              kmem_cache_alloc(metaslab_alloc_trace_cache, KM_SLEEP);
2615 2639          list_link_init(&mat->mat_list_node);
2616 2640          mat->mat_mg = mg;
2617 2641          mat->mat_msp = msp;
2618 2642          mat->mat_size = psize;
2619 2643          mat->mat_dva_id = dva_id;
2620 2644          mat->mat_offset = offset;
2621 2645          mat->mat_weight = 0;
2622 2646  
2623 2647          if (msp != NULL)
2624 2648                  mat->mat_weight = msp->ms_weight;
2625 2649  
2626 2650          /*
2627 2651           * The list is part of the zio so locking is not required. Only
2628 2652           * a single thread will perform allocations for a given zio.
2629 2653           */
2630 2654          list_insert_tail(&zal->zal_list, mat);
2631 2655          zal->zal_size++;
2632 2656  
2633 2657          ASSERT3U(zal->zal_size, <=, metaslab_trace_max_entries);
2634 2658  }
2635 2659  
2636 2660  void
2637 2661  metaslab_trace_init(zio_alloc_list_t *zal)
2638 2662  {
2639 2663          list_create(&zal->zal_list, sizeof (metaslab_alloc_trace_t),
2640 2664              offsetof(metaslab_alloc_trace_t, mat_list_node));
2641 2665          zal->zal_size = 0;
2642 2666  }
2643 2667  
2644 2668  void
2645 2669  metaslab_trace_fini(zio_alloc_list_t *zal)
2646 2670  {
2647 2671          metaslab_alloc_trace_t *mat;
2648 2672  
2649 2673          while ((mat = list_remove_head(&zal->zal_list)) != NULL)
2650 2674                  kmem_cache_free(metaslab_alloc_trace_cache, mat);
2651 2675          list_destroy(&zal->zal_list);
2652 2676          zal->zal_size = 0;
2653 2677  }
2654 2678  
2655 2679  /*
2656 2680   * ==========================================================================
2657 2681   * Metaslab block operations
2658 2682   * ==========================================================================
2659 2683   */
2660 2684  
2661 2685  static void
2662 2686  metaslab_group_alloc_increment(spa_t *spa, uint64_t vdev, void *tag, int flags)
2663 2687  {
2664 2688          if (!(flags & METASLAB_ASYNC_ALLOC) ||
2665 2689              flags & METASLAB_DONT_THROTTLE)
2666 2690                  return;
2667 2691  
2668 2692          metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg;
2669 2693          if (!mg->mg_class->mc_alloc_throttle_enabled)
2670 2694                  return;
2671 2695  
2672 2696          (void) refcount_add(&mg->mg_alloc_queue_depth, tag);
2673 2697  }
2674 2698  
2675 2699  void
2676 2700  metaslab_group_alloc_decrement(spa_t *spa, uint64_t vdev, void *tag, int flags)
2677 2701  {
2678 2702          if (!(flags & METASLAB_ASYNC_ALLOC) ||
2679 2703              flags & METASLAB_DONT_THROTTLE)
2680 2704                  return;
2681 2705  
2682 2706          metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg;
2683 2707          if (!mg->mg_class->mc_alloc_throttle_enabled)
2684 2708                  return;
2685 2709  
2686 2710          (void) refcount_remove(&mg->mg_alloc_queue_depth, tag);
2687 2711  }
2688 2712  
2689 2713  void
2690 2714  metaslab_group_alloc_verify(spa_t *spa, const blkptr_t *bp, void *tag)
2691 2715  {
2692 2716  #ifdef ZFS_DEBUG
2693 2717          const dva_t *dva = bp->blk_dva;
2694 2718          int ndvas = BP_GET_NDVAS(bp);
2695 2719  
2696 2720          for (int d = 0; d < ndvas; d++) {
2697 2721                  uint64_t vdev = DVA_GET_VDEV(&dva[d]);
2698 2722                  metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg;
2699 2723                  VERIFY(refcount_not_held(&mg->mg_alloc_queue_depth, tag));
2700 2724          }
2701 2725  #endif
2702 2726  }
2703 2727  
2704 2728  static uint64_t
2705 2729  metaslab_block_alloc(metaslab_t *msp, uint64_t size, uint64_t txg)
2706 2730  {
2707 2731          uint64_t start;
2708 2732          range_tree_t *rt = msp->ms_tree;
2709 2733          metaslab_class_t *mc = msp->ms_group->mg_class;
2710 2734  
2711 2735          VERIFY(!msp->ms_condensing);

↓ open down ↓

173 lines elided

↑ open up ↑

2712 2736  
2713 2737          start = mc->mc_ops->msop_alloc(msp, size);
2714 2738          if (start != -1ULL) {
2715 2739                  metaslab_group_t *mg = msp->ms_group;
2716 2740                  vdev_t *vd = mg->mg_vd;
2717 2741  
2718 2742                  VERIFY0(P2PHASE(start, 1ULL << vd->vdev_ashift));
2719 2743                  VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
2720 2744                  VERIFY3U(range_tree_space(rt) - size, <=, msp->ms_size);
2721 2745                  range_tree_remove(rt, start, size);
     2746 +                metaslab_trim_remove(msp, start, size);
2722 2747  
2723 2748                  if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
2724 2749                          vdev_dirty(mg->mg_vd, VDD_METASLAB, msp, txg);
2725 2750  
2726 2751                  range_tree_add(msp->ms_alloctree[txg & TXG_MASK], start, size);
2727 2752  
2728 2753                  /* Track the last successful allocation */
2729 2754                  msp->ms_alloc_txg = txg;
2730 2755                  metaslab_verify_space(msp, txg);
2731 2756          }

2732 2757

↓ open down ↓

1 lines elided

↑ open up ↑

2733 2758          /*
2734 2759           * Now that we've attempted the allocation we need to update the
2735 2760           * metaslab's maximum block size since it may have changed.
2736 2761           */
2737 2762          msp->ms_max_size = metaslab_block_maxsize(msp);
2738 2763          return (start);
2739 2764  }
2740 2765  
2741 2766  static uint64_t
2742 2767  metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal,
2743      -    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
     2768 +    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d,
     2769 +    int flags)
2744 2770  {
2745 2771          metaslab_t *msp = NULL;
2746 2772          uint64_t offset = -1ULL;
2747 2773          uint64_t activation_weight;
2748 2774          uint64_t target_distance;
2749 2775          int i;
2750 2776  
2751 2777          activation_weight = METASLAB_WEIGHT_PRIMARY;
2752 2778          for (i = 0; i < d; i++) {
2753 2779                  if (DVA_GET_VDEV(&dva[i]) == mg->mg_vd->vdev_id) {
2754 2780                          activation_weight = METASLAB_WEIGHT_SECONDARY;
2755 2781                          break;
2756 2782                  }
2757 2783          }
2758 2784  
2759 2785          metaslab_t *search = kmem_alloc(sizeof (*search), KM_SLEEP);
2760 2786          search->ms_weight = UINT64_MAX;
2761 2787          search->ms_start = 0;
2762 2788          for (;;) {
2763 2789                  boolean_t was_active;
     2790 +                boolean_t pass_primary = B_TRUE;
2764 2791                  avl_tree_t *t = &mg->mg_metaslab_tree;
2765 2792                  avl_index_t idx;
2766 2793  
2767 2794                  mutex_enter(&mg->mg_lock);
2768 2795  
2769 2796                  /*
2770 2797                   * Find the metaslab with the highest weight that is less
2771 2798                   * than what we've already tried.  In the common case, this
2772 2799                   * means that we will examine each metaslab at most once.
2773 2800                   * Note that concurrent callers could reorder metaslabs

2774 2801                   * by activation/passivation once we have dropped the mg_lock.
2775 2802                   * If a metaslab is activated by another thread, and we fail
2776 2803                   * to allocate from the metaslab we have selected, we may
2777 2804                   * not try the newly-activated metaslab, and instead activate
2778 2805                   * another metaslab.  This is not optimal, but generally
2779 2806                   * does not cause any problems (a possible exception being
2780 2807                   * if every metaslab is completely full except for the
2781 2808                   * the newly-activated metaslab which we fail to examine).
2782 2809                   */
2783 2810                  msp = avl_find(t, search, &idx);
2784 2811                  if (msp == NULL)
2785 2812                          msp = avl_nearest(t, idx, AVL_AFTER);
2786 2813                  for (; msp != NULL; msp = AVL_NEXT(t, msp)) {
2787 2814  
2788 2815                          if (!metaslab_should_allocate(msp, asize)) {
2789 2816                                  metaslab_trace_add(zal, mg, msp, asize, d,
2790 2817                                      TRACE_TOO_SMALL);

↓ open down ↓

17 lines elided

↑ open up ↑

2791 2818                                  continue;
2792 2819                          }
2793 2820  
2794 2821                          /*
2795 2822                           * If the selected metaslab is condensing, skip it.
2796 2823                           */
2797 2824                          if (msp->ms_condensing)
2798 2825                                  continue;
2799 2826  
2800 2827                          was_active = msp->ms_weight & METASLAB_ACTIVE_MASK;
2801      -                        if (activation_weight == METASLAB_WEIGHT_PRIMARY)
2802      -                                break;
     2828 +                        if (flags & METASLAB_USE_WEIGHT_SECONDARY) {
     2829 +                                if (!pass_primary) {
     2830 +                                        DTRACE_PROBE(metaslab_use_secondary);
     2831 +                                        activation_weight =
     2832 +                                            METASLAB_WEIGHT_SECONDARY;
     2833 +                                        break;
     2834 +                                }
2803 2835  
2804      -                        target_distance = min_distance +
2805      -                            (space_map_allocated(msp->ms_sm) != 0 ? 0 :
2806      -                            min_distance >> 1);
     2836 +                                pass_primary = B_FALSE;
     2837 +                        } else {
     2838 +                                if (activation_weight ==
     2839 +                                    METASLAB_WEIGHT_PRIMARY)
     2840 +                                        break;
2807 2841  
2808      -                        for (i = 0; i < d; i++) {
2809      -                                if (metaslab_distance(msp, &dva[i]) <
2810      -                                    target_distance)
     2842 +                                target_distance = min_distance +
     2843 +                                    (space_map_allocated(msp->ms_sm) != 0 ? 0 :
     2844 +                                    min_distance >> 1);
     2845 +
     2846 +                                for (i = 0; i < d; i++)
     2847 +                                        if (metaslab_distance(msp, &dva[i]) <
     2848 +                                            target_distance)
     2849 +                                                break;
     2850 +                                if (i == d)
2811 2851                                          break;
2812 2852                          }
2813      -                        if (i == d)
2814      -                                break;
2815 2853                  }
2816 2854                  mutex_exit(&mg->mg_lock);
2817 2855                  if (msp == NULL) {
2818 2856                          kmem_free(search, sizeof (*search));
2819 2857                          return (-1ULL);
2820 2858                  }
2821 2859                  search->ms_weight = msp->ms_weight;
2822 2860                  search->ms_start = msp->ms_start + 1;
2823 2861  
2824 2862                  mutex_enter(&msp->ms_lock);

2825 2863  
2826 2864                  /*
2827 2865                   * Ensure that the metaslab we have selected is still
2828 2866                   * capable of handling our request. It's possible that
2829 2867                   * another thread may have changed the weight while we
2830 2868                   * were blocked on the metaslab lock. We check the
2831 2869                   * active status first to see if we need to reselect
2832 2870                   * a new metaslab.
2833 2871                   */
2834 2872                  if (was_active && !(msp->ms_weight & METASLAB_ACTIVE_MASK)) {
2835 2873                          mutex_exit(&msp->ms_lock);
2836 2874                          continue;
2837 2875                  }
2838 2876  
2839 2877                  if ((msp->ms_weight & METASLAB_WEIGHT_SECONDARY) &&
2840 2878                      activation_weight == METASLAB_WEIGHT_PRIMARY) {
2841 2879                          metaslab_passivate(msp,
2842 2880                              msp->ms_weight & ~METASLAB_ACTIVE_MASK);
2843 2881                          mutex_exit(&msp->ms_lock);
2844 2882                          continue;
2845 2883                  }
2846 2884  
2847 2885                  if (metaslab_activate(msp, activation_weight) != 0) {
2848 2886                          mutex_exit(&msp->ms_lock);
2849 2887                          continue;
2850 2888                  }
2851 2889                  msp->ms_selected_txg = txg;
2852 2890  
2853 2891                  /*
2854 2892                   * Now that we have the lock, recheck to see if we should
2855 2893                   * continue to use this metaslab for this allocation. The
2856 2894                   * the metaslab is now loaded so metaslab_should_allocate() can
2857 2895                   * accurately determine if the allocation attempt should
2858 2896                   * proceed.
2859 2897                   */
2860 2898                  if (!metaslab_should_allocate(msp, asize)) {
2861 2899                          /* Passivate this metaslab and select a new one. */
2862 2900                          metaslab_trace_add(zal, mg, msp, asize, d,
2863 2901                              TRACE_TOO_SMALL);
2864 2902                          goto next;
2865 2903                  }
2866 2904  
2867 2905                  /*
2868 2906                   * If this metaslab is currently condensing then pick again as
2869 2907                   * we can't manipulate this metaslab until it's committed
2870 2908                   * to disk.
2871 2909                   */
2872 2910                  if (msp->ms_condensing) {
2873 2911                          metaslab_trace_add(zal, mg, msp, asize, d,
2874 2912                              TRACE_CONDENSING);
2875 2913                          mutex_exit(&msp->ms_lock);
2876 2914                          continue;
2877 2915                  }
2878 2916  
2879 2917                  offset = metaslab_block_alloc(msp, asize, txg);
2880 2918                  metaslab_trace_add(zal, mg, msp, asize, d, offset);
2881 2919  
2882 2920                  if (offset != -1ULL) {
2883 2921                          /* Proactively passivate the metaslab, if needed */
2884 2922                          metaslab_segment_may_passivate(msp);
2885 2923                          break;
2886 2924                  }
2887 2925  next:
2888 2926                  ASSERT(msp->ms_loaded);
2889 2927  
2890 2928                  /*
2891 2929                   * We were unable to allocate from this metaslab so determine
2892 2930                   * a new weight for this metaslab. Now that we have loaded
2893 2931                   * the metaslab we can provide a better hint to the metaslab
2894 2932                   * selector.
2895 2933                   *
2896 2934                   * For space-based metaslabs, we use the maximum block size.
2897 2935                   * This information is only available when the metaslab
2898 2936                   * is loaded and is more accurate than the generic free
2899 2937                   * space weight that was calculated by metaslab_weight().
2900 2938                   * This information allows us to quickly compare the maximum
2901 2939                   * available allocation in the metaslab to the allocation
2902 2940                   * size being requested.
2903 2941                   *
2904 2942                   * For segment-based metaslabs, determine the new weight
2905 2943                   * based on the highest bucket in the range tree. We
2906 2944                   * explicitly use the loaded segment weight (i.e. the range
2907 2945                   * tree histogram) since it contains the space that is
2908 2946                   * currently available for allocation and is accurate
2909 2947                   * even within a sync pass.
2910 2948                   */
2911 2949                  if (WEIGHT_IS_SPACEBASED(msp->ms_weight)) {
2912 2950                          uint64_t weight = metaslab_block_maxsize(msp);
2913 2951                          WEIGHT_SET_SPACEBASED(weight);
2914 2952                          metaslab_passivate(msp, weight);
2915 2953                  } else {
2916 2954                          metaslab_passivate(msp,
2917 2955                              metaslab_weight_from_range_tree(msp));
2918 2956                  }
2919 2957  
2920 2958                  /*
2921 2959                   * We have just failed an allocation attempt, check
2922 2960                   * that metaslab_should_allocate() agrees. Otherwise,
2923 2961                   * we may end up in an infinite loop retrying the same
2924 2962                   * metaslab.
2925 2963                   */

↓ open down ↓

101 lines elided

↑ open up ↑

2926 2964                  ASSERT(!metaslab_should_allocate(msp, asize));
2927 2965                  mutex_exit(&msp->ms_lock);
2928 2966          }
2929 2967          mutex_exit(&msp->ms_lock);
2930 2968          kmem_free(search, sizeof (*search));
2931 2969          return (offset);
2932 2970  }
2933 2971  
2934 2972  static uint64_t
2935 2973  metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal,
2936      -    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
     2974 +    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva,
     2975 +    int d, int flags)
2937 2976  {
2938 2977          uint64_t offset;
2939 2978          ASSERT(mg->mg_initialized);
2940 2979  
2941 2980          offset = metaslab_group_alloc_normal(mg, zal, asize, txg,
2942      -            min_distance, dva, d);
     2981 +            min_distance, dva, d, flags);
2943 2982  
2944 2983          mutex_enter(&mg->mg_lock);
2945 2984          if (offset == -1ULL) {
2946 2985                  mg->mg_failed_allocations++;
2947 2986                  metaslab_trace_add(zal, mg, NULL, asize, d,
2948 2987                      TRACE_GROUP_FAILURE);
2949 2988                  if (asize == SPA_GANGBLOCKSIZE) {
2950 2989                          /*
2951 2990                           * This metaslab group was unable to allocate
2952 2991                           * the minimum gang block size so it must be out of

2953 2992                           * space. We must notify the allocation throttle
2954 2993                           * to start skipping allocation attempts to this
2955 2994                           * metaslab group until more space becomes available.
2956 2995                           * Note: this failure cannot be caused by the
2957 2996                           * allocation throttle since the allocation throttle
2958 2997                           * is only responsible for skipping devices and
2959 2998                           * not failing block allocations.
2960 2999                           */
2961 3000                          mg->mg_no_free_space = B_TRUE;
2962 3001                  }
2963 3002          }
2964 3003          mg->mg_allocations++;
2965 3004          mutex_exit(&mg->mg_lock);
2966 3005          return (offset);
2967 3006  }
2968 3007  
2969 3008  /*

↓ open down ↓

17 lines elided

↑ open up ↑

2970 3009   * If we have to write a ditto block (i.e. more than one DVA for a given BP)
2971 3010   * on the same vdev as an existing DVA of this BP, then try to allocate it
2972 3011   * at least (vdev_asize / (2 ^ ditto_same_vdev_distance_shift)) away from the
2973 3012   * existing DVAs.
2974 3013   */
2975 3014  int ditto_same_vdev_distance_shift = 3;
2976 3015  
2977 3016  /*
2978 3017   * Allocate a block for the specified i/o.
2979 3018   */
2980      -int
     3019 +static int
2981 3020  metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
2982 3021      dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags,
2983 3022      zio_alloc_list_t *zal)
2984 3023  {
2985 3024          metaslab_group_t *mg, *rotor;
2986 3025          vdev_t *vd;
2987 3026          boolean_t try_hard = B_FALSE;
2988 3027  
2989 3028          ASSERT(!DVA_IS_VALID(&dva[d]));
2990 3029

2991 3030          /*
2992 3031           * For testing, make some blocks above a certain size be gang blocks.
2993 3032           */
2994 3033          if (psize >= metaslab_gang_bang && (ddi_get_lbolt() & 3) == 0) {
2995 3034                  metaslab_trace_add(zal, NULL, NULL, psize, d, TRACE_FORCE_GANG);
2996 3035                  return (SET_ERROR(ENOSPC));
2997 3036          }
2998 3037  
2999 3038          /*
3000 3039           * Start at the rotor and loop through all mgs until we find something.
3001 3040           * Note that there's no locking on mc_rotor or mc_aliquot because
3002 3041           * nothing actually breaks if we miss a few updates -- we just won't
3003 3042           * allocate quite as evenly.  It all balances out over time.
3004 3043           *
3005 3044           * If we are doing ditto or log blocks, try to spread them across
3006 3045           * consecutive vdevs.  If we're forced to reuse a vdev before we've
3007 3046           * allocated all of our ditto blocks, then try and spread them out on
3008 3047           * that vdev as much as possible.  If it turns out to not be possible,
3009 3048           * gradually lower our standards until anything becomes acceptable.
3010 3049           * Also, allocating on consecutive vdevs (as opposed to random vdevs)
3011 3050           * gives us hope of containing our fault domains to something we're
3012 3051           * able to reason about.  Otherwise, any two top-level vdev failures
3013 3052           * will guarantee the loss of data.  With consecutive allocation,
3014 3053           * only two adjacent top-level vdev failures will result in data loss.
3015 3054           *

↓ open down ↓

25 lines elided

↑ open up ↑

3016 3055           * If we are doing gang blocks (hintdva is non-NULL), try to keep
3017 3056           * ourselves on the same vdev as our gang block header.  That
3018 3057           * way, we can hope for locality in vdev_cache, plus it makes our
3019 3058           * fault domains something tractable.
3020 3059           */
3021 3060          if (hintdva) {
3022 3061                  vd = vdev_lookup_top(spa, DVA_GET_VDEV(&hintdva[d]));
3023 3062  
3024 3063                  /*
3025 3064                   * It's possible the vdev we're using as the hint no
3026      -                 * longer exists or its mg has been closed (e.g. by
3027      -                 * device removal).  Consult the rotor when
     3065 +                 * longer exists (i.e. removed). Consult the rotor when
3028 3066                   * all else fails.
3029 3067                   */
3030      -                if (vd != NULL && vd->vdev_mg != NULL) {
     3068 +                if (vd != NULL) {
3031 3069                          mg = vd->vdev_mg;
3032 3070  
3033 3071                          if (flags & METASLAB_HINTBP_AVOID &&
3034 3072                              mg->mg_next != NULL)
3035 3073                                  mg = mg->mg_next;
3036 3074                  } else {
3037 3075                          mg = mc->mc_rotor;
3038 3076                  }
3039 3077          } else if (d != 0) {
3040 3078                  vd = vdev_lookup_top(spa, DVA_GET_VDEV(&dva[d - 1]));

3041 3079                  mg = vd->vdev_mg->mg_next;
3042 3080          } else {
3043 3081                  mg = mc->mc_rotor;
3044 3082          }
3045 3083  
3046 3084          /*
3047 3085           * If the hint put us into the wrong metaslab class, or into a
3048 3086           * metaslab group that has been passivated, just follow the rotor.
3049 3087           */
3050 3088          if (mg->mg_class != mc || mg->mg_activation_count <= 0)
3051 3089                  mg = mc->mc_rotor;
3052 3090  
3053 3091          rotor = mg;
3054 3092  top:
3055 3093          do {
3056 3094                  boolean_t allocatable;
3057 3095  
3058 3096                  ASSERT(mg->mg_activation_count == 1);
3059 3097                  vd = mg->mg_vd;
3060 3098  
3061 3099                  /*
3062 3100                   * Don't allocate from faulted devices.
3063 3101                   */
3064 3102                  if (try_hard) {
3065 3103                          spa_config_enter(spa, SCL_ZIO, FTAG, RW_READER);
3066 3104                          allocatable = vdev_allocatable(vd);
3067 3105                          spa_config_exit(spa, SCL_ZIO, FTAG);
3068 3106                  } else {
3069 3107                          allocatable = vdev_allocatable(vd);
3070 3108                  }
3071 3109  
3072 3110                  /*
3073 3111                   * Determine if the selected metaslab group is eligible
3074 3112                   * for allocations. If we're ganging then don't allow
3075 3113                   * this metaslab group to skip allocations since that would
3076 3114                   * inadvertently return ENOSPC and suspend the pool
3077 3115                   * even though space is still available.
3078 3116                   */
3079 3117                  if (allocatable && !GANG_ALLOCATION(flags) && !try_hard) {
3080 3118                          allocatable = metaslab_group_allocatable(mg, rotor,
3081 3119                              psize);
3082 3120                  }
3083 3121  
3084 3122                  if (!allocatable) {
3085 3123                          metaslab_trace_add(zal, mg, NULL, psize, d,
3086 3124                              TRACE_NOT_ALLOCATABLE);
3087 3125                          goto next;
3088 3126                  }
3089 3127  
3090 3128                  ASSERT(mg->mg_initialized);
3091 3129  
3092 3130                  /*
3093 3131                   * Avoid writing single-copy data to a failing,
3094 3132                   * non-redundant vdev, unless we've already tried all
3095 3133                   * other vdevs.
3096 3134                   */
3097 3135                  if ((vd->vdev_stat.vs_write_errors > 0 ||
3098 3136                      vd->vdev_state < VDEV_STATE_HEALTHY) &&
3099 3137                      d == 0 && !try_hard && vd->vdev_children == 0) {
3100 3138                          metaslab_trace_add(zal, mg, NULL, psize, d,
3101 3139                              TRACE_VDEV_ERROR);
3102 3140                          goto next;
3103 3141                  }
3104 3142  
3105 3143                  ASSERT(mg->mg_class == mc);
3106 3144  
3107 3145                  /*
3108 3146                   * If we don't need to try hard, then require that the
3109 3147                   * block be 1/8th of the device away from any other DVAs
3110 3148                   * in this BP.  If we are trying hard, allow any offset
3111 3149                   * to be used (distance=0).
3112 3150                   */
3113 3151                  uint64_t distance = 0;
3114 3152                  if (!try_hard) {

↓ open down ↓

74 lines elided

↑ open up ↑

3115 3153                          distance = vd->vdev_asize >>
3116 3154                              ditto_same_vdev_distance_shift;
3117 3155                          if (distance <= (1ULL << vd->vdev_ms_shift))
3118 3156                                  distance = 0;
3119 3157                  }
3120 3158  
3121 3159                  uint64_t asize = vdev_psize_to_asize(vd, psize);
3122 3160                  ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0);
3123 3161  
3124 3162                  uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg,
3125      -                    distance, dva, d);
     3163 +                    distance, dva, d, flags);
3126 3164  
3127 3165                  if (offset != -1ULL) {
3128 3166                          /*
3129 3167                           * If we've just selected this metaslab group,
3130 3168                           * figure out whether the corresponding vdev is
3131 3169                           * over- or under-used relative to the pool,
3132 3170                           * and set an allocation bias to even it out.
3133 3171                           */
3134 3172                          if (mc->mc_aliquot == 0 && metaslab_bias_enabled) {
3135 3173                                  vdev_stat_t *vs = &vd->vdev_stat;
3136      -                                int64_t vu, cu;
     3174 +                                vdev_stat_t *pvs = &vd->vdev_parent->vdev_stat;
     3175 +                                int64_t vu, cu, vu_io;
3137 3176  
3138 3177                                  vu = (vs->vs_alloc * 100) / (vs->vs_space + 1);
3139 3178                                  cu = (mc->mc_alloc * 100) / (mc->mc_space + 1);
     3179 +                                vu_io =
     3180 +                                    (((vs->vs_iotime[ZIO_TYPE_WRITE] * 100) /
     3181 +                                    (pvs->vs_iotime[ZIO_TYPE_WRITE] + 1)) *
     3182 +                                    (vd->vdev_parent->vdev_children)) - 100;
3140 3183  
3141 3184                                  /*
3142 3185                                   * Calculate how much more or less we should
3143 3186                                   * try to allocate from this device during
3144 3187                                   * this iteration around the rotor.
3145 3188                                   * For example, if a device is 80% full
3146 3189                                   * and the pool is 20% full then we should
3147 3190                                   * reduce allocations by 60% on this device.
3148 3191                                   *
3149 3192                                   * mg_bias = (20 - 80) * 512K / 100 = -307K
3150 3193                                   *
3151 3194                                   * This reduces allocations by 307K for this
3152 3195                                   * iteration.
3153 3196                                   */
3154 3197                                  mg->mg_bias = ((cu - vu) *
3155 3198                                      (int64_t)mg->mg_aliquot) / 100;
     3199 +
     3200 +                                /*
     3201 +                                 * Experiment: space-based DVA allocator 0,
     3202 +                                 * latency-based 1 or hybrid 2.
     3203 +                                 */
     3204 +                                switch (metaslab_alloc_dva_algorithm) {
     3205 +                                case 1:
     3206 +                                        mg->mg_bias =
     3207 +                                            (vu_io * (int64_t)mg->mg_aliquot) /
     3208 +                                            100;
     3209 +                                        break;
     3210 +                                case 2:
     3211 +                                        mg->mg_bias =
     3212 +                                            ((((cu - vu) + vu_io) / 2) *
     3213 +                                            (int64_t)mg->mg_aliquot) / 100;
     3214 +                                        break;
     3215 +                                default:
     3216 +                                        break;
     3217 +                                }
3156 3218                          } else if (!metaslab_bias_enabled) {
3157 3219                                  mg->mg_bias = 0;
3158 3220                          }
3159 3221  
3160 3222                          if (atomic_add_64_nv(&mc->mc_aliquot, asize) >=
3161 3223                              mg->mg_aliquot + mg->mg_bias) {
3162 3224                                  mc->mc_rotor = mg->mg_next;
3163 3225                                  mc->mc_aliquot = 0;
3164 3226                          }
3165 3227  
3166 3228                          DVA_SET_VDEV(&dva[d], vd->vdev_id);
3167 3229                          DVA_SET_OFFSET(&dva[d], offset);
3168 3230                          DVA_SET_GANG(&dva[d], !!(flags & METASLAB_GANG_HEADER));
3169 3231                          DVA_SET_ASIZE(&dva[d], asize);
     3232 +                        DTRACE_PROBE3(alloc_dva_probe, uint64_t, vd->vdev_id,
     3233 +                            uint64_t, offset, uint64_t, psize);
3170 3234  
3171 3235                          return (0);
3172 3236                  }
3173 3237  next:
3174 3238                  mc->mc_rotor = mg->mg_next;
3175 3239                  mc->mc_aliquot = 0;
3176 3240          } while ((mg = mg->mg_next) != rotor);
3177 3241  
3178 3242          /*
3179 3243           * If we haven't tried hard, do so now.

3180 3244           */
3181 3245          if (!try_hard) {

↓ open down ↓

2 lines elided

↑ open up ↑

3182 3246                  try_hard = B_TRUE;
3183 3247                  goto top;
3184 3248          }
3185 3249  
3186 3250          bzero(&dva[d], sizeof (dva_t));
3187 3251  
3188 3252          metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC);
3189 3253          return (SET_ERROR(ENOSPC));
3190 3254  }
3191 3255  
3192      -void
3193      -metaslab_free_concrete(vdev_t *vd, uint64_t offset, uint64_t asize,
3194      -    uint64_t txg)
3195      -{
3196      -        metaslab_t *msp;
3197      -        spa_t *spa = vd->vdev_spa;
3198      -
3199      -        ASSERT3U(txg, ==, spa->spa_syncing_txg);
3200      -        ASSERT(vdev_is_concrete(vd));
3201      -        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3202      -        ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
3203      -
3204      -        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3205      -
3206      -        VERIFY(!msp->ms_condensing);
3207      -        VERIFY3U(offset, >=, msp->ms_start);
3208      -        VERIFY3U(offset + asize, <=, msp->ms_start + msp->ms_size);
3209      -        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3210      -        VERIFY0(P2PHASE(asize, 1ULL << vd->vdev_ashift));
3211      -
3212      -        metaslab_check_free_impl(vd, offset, asize);
3213      -        mutex_enter(&msp->ms_lock);
3214      -        if (range_tree_space(msp->ms_freeingtree) == 0) {
3215      -                vdev_dirty(vd, VDD_METASLAB, msp, txg);
3216      -        }
3217      -        range_tree_add(msp->ms_freeingtree, offset, asize);
3218      -        mutex_exit(&msp->ms_lock);
3219      -}
3220      -
3221      -/* ARGSUSED */
3222      -void
3223      -metaslab_free_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
3224      -    uint64_t size, void *arg)
3225      -{
3226      -        uint64_t *txgp = arg;
3227      -
3228      -        if (vd->vdev_ops->vdev_op_remap != NULL)
3229      -                vdev_indirect_mark_obsolete(vd, offset, size, *txgp);
3230      -        else
3231      -                metaslab_free_impl(vd, offset, size, *txgp);
3232      -}
3233      -
3234      -static void
3235      -metaslab_free_impl(vdev_t *vd, uint64_t offset, uint64_t size,
3236      -    uint64_t txg)
3237      -{
3238      -        spa_t *spa = vd->vdev_spa;
3239      -
3240      -        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3241      -
3242      -        if (txg > spa_freeze_txg(spa))
3243      -                return;
3244      -
3245      -        if (spa->spa_vdev_removal != NULL &&
3246      -            spa->spa_vdev_removal->svr_vdev == vd &&
3247      -            vdev_is_concrete(vd)) {
3248      -                /*
3249      -                 * Note: we check if the vdev is concrete because when
3250      -                 * we complete the removal, we first change the vdev to be
3251      -                 * an indirect vdev (in open context), and then (in syncing
3252      -                 * context) clear spa_vdev_removal.
3253      -                 */
3254      -                free_from_removing_vdev(vd, offset, size, txg);
3255      -        } else if (vd->vdev_ops->vdev_op_remap != NULL) {
3256      -                vdev_indirect_mark_obsolete(vd, offset, size, txg);
3257      -                vd->vdev_ops->vdev_op_remap(vd, offset, size,
3258      -                    metaslab_free_impl_cb, &txg);
3259      -        } else {
3260      -                metaslab_free_concrete(vd, offset, size, txg);
3261      -        }
3262      -}
3263      -
3264      -typedef struct remap_blkptr_cb_arg {
3265      -        blkptr_t *rbca_bp;
3266      -        spa_remap_cb_t rbca_cb;
3267      -        vdev_t *rbca_remap_vd;
3268      -        uint64_t rbca_remap_offset;
3269      -        void *rbca_cb_arg;
3270      -} remap_blkptr_cb_arg_t;
3271      -
3272      -void
3273      -remap_blkptr_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
3274      -    uint64_t size, void *arg)
3275      -{
3276      -        remap_blkptr_cb_arg_t *rbca = arg;
3277      -        blkptr_t *bp = rbca->rbca_bp;
3278      -
3279      -        /* We can not remap split blocks. */
3280      -        if (size != DVA_GET_ASIZE(&bp->blk_dva[0]))
3281      -                return;
3282      -        ASSERT0(inner_offset);
3283      -
3284      -        if (rbca->rbca_cb != NULL) {
3285      -                /*
3286      -                 * At this point we know that we are not handling split
3287      -                 * blocks and we invoke the callback on the previous
3288      -                 * vdev which must be indirect.
3289      -                 */
3290      -                ASSERT3P(rbca->rbca_remap_vd->vdev_ops, ==, &vdev_indirect_ops);
3291      -
3292      -                rbca->rbca_cb(rbca->rbca_remap_vd->vdev_id,
3293      -                    rbca->rbca_remap_offset, size, rbca->rbca_cb_arg);
3294      -
3295      -                /* set up remap_blkptr_cb_arg for the next call */
3296      -                rbca->rbca_remap_vd = vd;
3297      -                rbca->rbca_remap_offset = offset;
3298      -        }
3299      -
3300      -        /*
3301      -         * The phys birth time is that of dva[0].  This ensures that we know
3302      -         * when each dva was written, so that resilver can determine which
3303      -         * blocks need to be scrubbed (i.e. those written during the time
3304      -         * the vdev was offline).  It also ensures that the key used in
3305      -         * the ARC hash table is unique (i.e. dva[0] + phys_birth).  If
3306      -         * we didn't change the phys_birth, a lookup in the ARC for a
3307      -         * remapped BP could find the data that was previously stored at
3308      -         * this vdev + offset.
3309      -         */
3310      -        vdev_t *oldvd = vdev_lookup_top(vd->vdev_spa,
3311      -            DVA_GET_VDEV(&bp->blk_dva[0]));
3312      -        vdev_indirect_births_t *vib = oldvd->vdev_indirect_births;
3313      -        bp->blk_phys_birth = vdev_indirect_births_physbirth(vib,
3314      -            DVA_GET_OFFSET(&bp->blk_dva[0]), DVA_GET_ASIZE(&bp->blk_dva[0]));
3315      -
3316      -        DVA_SET_VDEV(&bp->blk_dva[0], vd->vdev_id);
3317      -        DVA_SET_OFFSET(&bp->blk_dva[0], offset);
3318      -}
3319      -
3320 3256  /*
3321      - * If the block pointer contains any indirect DVAs, modify them to refer to
3322      - * concrete DVAs.  Note that this will sometimes not be possible, leaving
3323      - * the indirect DVA in place.  This happens if the indirect DVA spans multiple
3324      - * segments in the mapping (i.e. it is a "split block").
3325      - *
3326      - * If the BP was remapped, calls the callback on the original dva (note the
3327      - * callback can be called multiple times if the original indirect DVA refers
3328      - * to another indirect DVA, etc).
3329      - *
3330      - * Returns TRUE if the BP was remapped.
     3257 + * Free the block represented by DVA in the context of the specified
     3258 + * transaction group.
3331 3259   */
3332      -boolean_t
3333      -spa_remap_blkptr(spa_t *spa, blkptr_t *bp, spa_remap_cb_t callback, void *arg)
3334      -{
3335      -        remap_blkptr_cb_arg_t rbca;
3336      -
3337      -        if (!zfs_remap_blkptr_enable)
3338      -                return (B_FALSE);
3339      -
3340      -        if (!spa_feature_is_enabled(spa, SPA_FEATURE_OBSOLETE_COUNTS))
3341      -                return (B_FALSE);
3342      -
3343      -        /*
3344      -         * Dedup BP's can not be remapped, because ddt_phys_select() depends
3345      -         * on DVA[0] being the same in the BP as in the DDT (dedup table).
3346      -         */
3347      -        if (BP_GET_DEDUP(bp))
3348      -                return (B_FALSE);
3349      -
3350      -        /*
3351      -         * Gang blocks can not be remapped, because
3352      -         * zio_checksum_gang_verifier() depends on the DVA[0] that's in
3353      -         * the BP used to read the gang block header (GBH) being the same
3354      -         * as the DVA[0] that we allocated for the GBH.
3355      -         */
3356      -        if (BP_IS_GANG(bp))
3357      -                return (B_FALSE);
3358      -
3359      -        /*
3360      -         * Embedded BP's have no DVA to remap.
3361      -         */
3362      -        if (BP_GET_NDVAS(bp) < 1)
3363      -                return (B_FALSE);
3364      -
3365      -        /*
3366      -         * Note: we only remap dva[0].  If we remapped other dvas, we
3367      -         * would no longer know what their phys birth txg is.
3368      -         */
3369      -        dva_t *dva = &bp->blk_dva[0];
3370      -
3371      -        uint64_t offset = DVA_GET_OFFSET(dva);
3372      -        uint64_t size = DVA_GET_ASIZE(dva);
3373      -        vdev_t *vd = vdev_lookup_top(spa, DVA_GET_VDEV(dva));
3374      -
3375      -        if (vd->vdev_ops->vdev_op_remap == NULL)
3376      -                return (B_FALSE);
3377      -
3378      -        rbca.rbca_bp = bp;
3379      -        rbca.rbca_cb = callback;
3380      -        rbca.rbca_remap_vd = vd;
3381      -        rbca.rbca_remap_offset = offset;
3382      -        rbca.rbca_cb_arg = arg;
3383      -
3384      -        /*
3385      -         * remap_blkptr_cb() will be called in order for each level of
3386      -         * indirection, until a concrete vdev is reached or a split block is
3387      -         * encountered. old_vd and old_offset are updated within the callback
3388      -         * as we go from the one indirect vdev to the next one (either concrete
3389      -         * or indirect again) in that order.
3390      -         */
3391      -        vd->vdev_ops->vdev_op_remap(vd, offset, size, remap_blkptr_cb, &rbca);
3392      -
3393      -        /* Check if the DVA wasn't remapped because it is a split block */
3394      -        if (DVA_GET_VDEV(&rbca.rbca_bp->blk_dva[0]) == vd->vdev_id)
3395      -                return (B_FALSE);
3396      -
3397      -        return (B_TRUE);
3398      -}
3399      -
3400      -/*
3401      - * Undo the allocation of a DVA which happened in the given transaction group.
3402      - */
3403 3260  void
3404      -metaslab_unalloc_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
     3261 +metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg, boolean_t now)
3405 3262  {
3406      -        metaslab_t *msp;
3407      -        vdev_t *vd;
3408 3263          uint64_t vdev = DVA_GET_VDEV(dva);
3409 3264          uint64_t offset = DVA_GET_OFFSET(dva);
3410 3265          uint64_t size = DVA_GET_ASIZE(dva);
     3266 +        vdev_t *vd;
     3267 +        metaslab_t *msp;
3411 3268  
     3269 +        DTRACE_PROBE3(free_dva_probe, uint64_t, vdev,
     3270 +            uint64_t, offset, uint64_t, size);
     3271 +
3412 3272          ASSERT(DVA_IS_VALID(dva));
3413      -        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3414 3273  
3415 3274          if (txg > spa_freeze_txg(spa))
3416 3275                  return;
3417 3276  
3418 3277          if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
3419 3278              (offset >> vd->vdev_ms_shift) >= vd->vdev_ms_count) {
3420 3279                  cmn_err(CE_WARN, "metaslab_free_dva(): bad DVA %llu:%llu",
3421 3280                      (u_longlong_t)vdev, (u_longlong_t)offset);
3422 3281                  ASSERT(0);
3423 3282                  return;
3424 3283          }
3425 3284  
3426      -        ASSERT(!vd->vdev_removing);
3427      -        ASSERT(vdev_is_concrete(vd));
3428      -        ASSERT0(vd->vdev_indirect_config.vic_mapping_object);
3429      -        ASSERT3P(vd->vdev_indirect_mapping, ==, NULL);
     3285 +        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3430 3286  
3431 3287          if (DVA_GET_GANG(dva))
3432 3288                  size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
3433 3289  
3434      -        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3435      -
3436 3290          mutex_enter(&msp->ms_lock);
3437      -        range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
3438      -            offset, size);
3439 3291  
3440      -        VERIFY(!msp->ms_condensing);
3441      -        VERIFY3U(offset, >=, msp->ms_start);
3442      -        VERIFY3U(offset + size, <=, msp->ms_start + msp->ms_size);
3443      -        VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
3444      -            msp->ms_size);
3445      -        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3446      -        VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
3447      -        range_tree_add(msp->ms_tree, offset, size);
     3292 +        if (now) {
     3293 +                range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
     3294 +                    offset, size);
     3295 +
     3296 +                VERIFY(!msp->ms_condensing);
     3297 +                VERIFY3U(offset, >=, msp->ms_start);
     3298 +                VERIFY3U(offset + size, <=, msp->ms_start + msp->ms_size);
     3299 +                VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
     3300 +                    msp->ms_size);
     3301 +                VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
     3302 +                VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
     3303 +                range_tree_add(msp->ms_tree, offset, size);
     3304 +                if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
     3305 +                    !vd->vdev_man_trimming)
     3306 +                        metaslab_trim_add(msp, offset, size);
     3307 +                msp->ms_max_size = metaslab_block_maxsize(msp);
     3308 +        } else {
     3309 +                VERIFY3U(txg, ==, spa->spa_syncing_txg);
     3310 +                if (range_tree_space(msp->ms_freeingtree) == 0)
     3311 +                        vdev_dirty(vd, VDD_METASLAB, msp, txg);
     3312 +                range_tree_add(msp->ms_freeingtree, offset, size);
     3313 +        }
     3314 +
3448 3315          mutex_exit(&msp->ms_lock);
3449 3316  }
3450 3317  
3451 3318  /*
3452      - * Free the block represented by DVA in the context of the specified
3453      - * transaction group.
     3319 + * Intent log support: upon opening the pool after a crash, notify the SPA
     3320 + * of blocks that the intent log has allocated for immediate write, but
     3321 + * which are still considered free by the SPA because the last transaction
     3322 + * group didn't commit yet.
3454 3323   */
3455      -void
3456      -metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
     3324 +static int
     3325 +metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
3457 3326  {
3458 3327          uint64_t vdev = DVA_GET_VDEV(dva);
3459 3328          uint64_t offset = DVA_GET_OFFSET(dva);
3460 3329          uint64_t size = DVA_GET_ASIZE(dva);
3461      -        vdev_t *vd = vdev_lookup_top(spa, vdev);
     3330 +        vdev_t *vd;
     3331 +        metaslab_t *msp;
     3332 +        int error = 0;
3462 3333  
3463 3334          ASSERT(DVA_IS_VALID(dva));
3464      -        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3465 3335  
3466      -        if (DVA_GET_GANG(dva)) {
     3336 +        if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
     3337 +            (offset >> vd->vdev_ms_shift) >= vd->vdev_ms_count)
     3338 +                return (SET_ERROR(ENXIO));
     3339 +
     3340 +        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
     3341 +
     3342 +        if (DVA_GET_GANG(dva))
3467 3343                  size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
     3344 +
     3345 +        mutex_enter(&msp->ms_lock);
     3346 +
     3347 +        if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
     3348 +                error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
     3349 +
     3350 +        if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
     3351 +                error = SET_ERROR(ENOENT);
     3352 +
     3353 +        if (error || txg == 0) {        /* txg == 0 indicates dry run */
     3354 +                mutex_exit(&msp->ms_lock);
     3355 +                return (error);
3468 3356          }
3469 3357  
3470      -        metaslab_free_impl(vd, offset, size, txg);
     3358 +        VERIFY(!msp->ms_condensing);
     3359 +        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
     3360 +        VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
     3361 +        VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
     3362 +        range_tree_remove(msp->ms_tree, offset, size);
     3363 +        metaslab_trim_remove(msp, offset, size);
     3364 +
     3365 +        if (spa_writeable(spa)) {       /* don't dirty if we're zdb(1M) */
     3366 +                if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
     3367 +                        vdev_dirty(vd, VDD_METASLAB, msp, txg);
     3368 +                range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
     3369 +        }
     3370 +
     3371 +        mutex_exit(&msp->ms_lock);
     3372 +
     3373 +        return (0);
3471 3374  }
3472 3375  
3473 3376  /*
3474 3377   * Reserve some allocation slots. The reservation system must be called
3475 3378   * before we call into the allocator. If there aren't any available slots
3476 3379   * then the I/O will be throttled until an I/O completes and its slots are
3477 3380   * freed up. The function returns true if it was successful in placing
3478 3381   * the reservation.
3479 3382   */
3480 3383  boolean_t

3481 3384  metaslab_class_throttle_reserve(metaslab_class_t *mc, int slots, zio_t *zio,
3482 3385      int flags)
3483 3386  {
3484 3387          uint64_t available_slots = 0;
3485 3388          boolean_t slot_reserved = B_FALSE;
3486 3389  
3487 3390          ASSERT(mc->mc_alloc_throttle_enabled);
3488 3391          mutex_enter(&mc->mc_lock);
3489 3392  
3490 3393          uint64_t reserved_slots = refcount_count(&mc->mc_alloc_slots);
3491 3394          if (reserved_slots < mc->mc_alloc_max_slots)
3492 3395                  available_slots = mc->mc_alloc_max_slots - reserved_slots;
3493 3396  
3494 3397          if (slots <= available_slots || GANG_ALLOCATION(flags)) {
3495 3398                  /*
3496 3399                   * We reserve the slots individually so that we can unreserve
3497 3400                   * them individually when an I/O completes.
3498 3401                   */
3499 3402                  for (int d = 0; d < slots; d++) {
3500 3403                          reserved_slots = refcount_add(&mc->mc_alloc_slots, zio);
3501 3404                  }
3502 3405                  zio->io_flags |= ZIO_FLAG_IO_ALLOCATING;
3503 3406                  slot_reserved = B_TRUE;
3504 3407          }
3505 3408  
3506 3409          mutex_exit(&mc->mc_lock);
3507 3410          return (slot_reserved);
3508 3411  }
3509 3412  
3510 3413  void

↓ open down ↓

30 lines elided

↑ open up ↑

3511 3414  metaslab_class_throttle_unreserve(metaslab_class_t *mc, int slots, zio_t *zio)
3512 3415  {
3513 3416          ASSERT(mc->mc_alloc_throttle_enabled);
3514 3417          mutex_enter(&mc->mc_lock);
3515 3418          for (int d = 0; d < slots; d++) {
3516 3419                  (void) refcount_remove(&mc->mc_alloc_slots, zio);
3517 3420          }
3518 3421          mutex_exit(&mc->mc_lock);
3519 3422  }
3520 3423  
3521      -static int
3522      -metaslab_claim_concrete(vdev_t *vd, uint64_t offset, uint64_t size,
3523      -    uint64_t txg)
3524      -{
3525      -        metaslab_t *msp;
3526      -        spa_t *spa = vd->vdev_spa;
3527      -        int error = 0;
3528      -
3529      -        if (offset >> vd->vdev_ms_shift >= vd->vdev_ms_count)
3530      -                return (ENXIO);
3531      -
3532      -        ASSERT3P(vd->vdev_ms, !=, NULL);
3533      -        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3534      -
3535      -        mutex_enter(&msp->ms_lock);
3536      -
3537      -        if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
3538      -                error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
3539      -
3540      -        if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
3541      -                error = SET_ERROR(ENOENT);
3542      -
3543      -        if (error || txg == 0) {        /* txg == 0 indicates dry run */
3544      -                mutex_exit(&msp->ms_lock);
3545      -                return (error);
3546      -        }
3547      -
3548      -        VERIFY(!msp->ms_condensing);
3549      -        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3550      -        VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
3551      -        VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
3552      -        range_tree_remove(msp->ms_tree, offset, size);
3553      -
3554      -        if (spa_writeable(spa)) {       /* don't dirty if we're zdb(1M) */
3555      -                if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
3556      -                        vdev_dirty(vd, VDD_METASLAB, msp, txg);
3557      -                range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
3558      -        }
3559      -
3560      -        mutex_exit(&msp->ms_lock);
3561      -
3562      -        return (0);
3563      -}
3564      -
3565      -typedef struct metaslab_claim_cb_arg_t {
3566      -        uint64_t        mcca_txg;
3567      -        int             mcca_error;
3568      -} metaslab_claim_cb_arg_t;
3569      -
3570      -/* ARGSUSED */
3571      -static void
3572      -metaslab_claim_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
3573      -    uint64_t size, void *arg)
3574      -{
3575      -        metaslab_claim_cb_arg_t *mcca_arg = arg;
3576      -
3577      -        if (mcca_arg->mcca_error == 0) {
3578      -                mcca_arg->mcca_error = metaslab_claim_concrete(vd, offset,
3579      -                    size, mcca_arg->mcca_txg);
3580      -        }
3581      -}
3582      -
3583 3424  int
3584      -metaslab_claim_impl(vdev_t *vd, uint64_t offset, uint64_t size, uint64_t txg)
3585      -{
3586      -        if (vd->vdev_ops->vdev_op_remap != NULL) {
3587      -                metaslab_claim_cb_arg_t arg;
3588      -
3589      -                /*
3590      -                 * Only zdb(1M) can claim on indirect vdevs.  This is used
3591      -                 * to detect leaks of mapped space (that are not accounted
3592      -                 * for in the obsolete counts, spacemap, or bpobj).
3593      -                 */
3594      -                ASSERT(!spa_writeable(vd->vdev_spa));
3595      -                arg.mcca_error = 0;
3596      -                arg.mcca_txg = txg;
3597      -
3598      -                vd->vdev_ops->vdev_op_remap(vd, offset, size,
3599      -                    metaslab_claim_impl_cb, &arg);
3600      -
3601      -                if (arg.mcca_error == 0) {
3602      -                        arg.mcca_error = metaslab_claim_concrete(vd,
3603      -                            offset, size, txg);
3604      -                }
3605      -                return (arg.mcca_error);
3606      -        } else {
3607      -                return (metaslab_claim_concrete(vd, offset, size, txg));
3608      -        }
3609      -}
3610      -
3611      -/*
3612      - * Intent log support: upon opening the pool after a crash, notify the SPA
3613      - * of blocks that the intent log has allocated for immediate write, but
3614      - * which are still considered free by the SPA because the last transaction
3615      - * group didn't commit yet.
3616      - */
3617      -static int
3618      -metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
3619      -{
3620      -        uint64_t vdev = DVA_GET_VDEV(dva);
3621      -        uint64_t offset = DVA_GET_OFFSET(dva);
3622      -        uint64_t size = DVA_GET_ASIZE(dva);
3623      -        vdev_t *vd;
3624      -
3625      -        if ((vd = vdev_lookup_top(spa, vdev)) == NULL) {
3626      -                return (SET_ERROR(ENXIO));
3627      -        }
3628      -
3629      -        ASSERT(DVA_IS_VALID(dva));
3630      -
3631      -        if (DVA_GET_GANG(dva))
3632      -                size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
3633      -
3634      -        return (metaslab_claim_impl(vd, offset, size, txg));
3635      -}
3636      -
3637      -int
3638 3425  metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp,
3639 3426      int ndvas, uint64_t txg, blkptr_t *hintbp, int flags,
3640 3427      zio_alloc_list_t *zal, zio_t *zio)
3641 3428  {
3642 3429          dva_t *dva = bp->blk_dva;
3643 3430          dva_t *hintdva = hintbp->blk_dva;
3644 3431          int error = 0;
3645 3432  
3646 3433          ASSERT(bp->blk_birth == 0);
3647 3434          ASSERT(BP_PHYSICAL_BIRTH(bp) == 0);

3648 3435  
3649 3436          spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
3650 3437

↓ open down ↓

3 lines elided

↑ open up ↑

3651 3438          if (mc->mc_rotor == NULL) {     /* no vdevs in this class */
3652 3439                  spa_config_exit(spa, SCL_ALLOC, FTAG);
3653 3440                  return (SET_ERROR(ENOSPC));
3654 3441          }
3655 3442  
3656 3443          ASSERT(ndvas > 0 && ndvas <= spa_max_replication(spa));
3657 3444          ASSERT(BP_GET_NDVAS(bp) == 0);
3658 3445          ASSERT(hintbp == NULL || ndvas <= BP_GET_NDVAS(hintbp));
3659 3446          ASSERT3P(zal, !=, NULL);
3660 3447  
3661      -        for (int d = 0; d < ndvas; d++) {
3662      -                error = metaslab_alloc_dva(spa, mc, psize, dva, d, hintdva,
3663      -                    txg, flags, zal);
3664      -                if (error != 0) {
3665      -                        for (d--; d >= 0; d--) {
3666      -                                metaslab_unalloc_dva(spa, &dva[d], txg);
3667      -                                metaslab_group_alloc_decrement(spa,
3668      -                                    DVA_GET_VDEV(&dva[d]), zio, flags);
3669      -                                bzero(&dva[d], sizeof (dva_t));
     3448 +        if (mc == spa_special_class(spa) && !BP_IS_METADATA(bp) &&
     3449 +            !(flags & (METASLAB_GANG_HEADER)) &&
     3450 +            !(spa->spa_meta_policy.spa_small_data_to_special &&
     3451 +            psize <= spa->spa_meta_policy.spa_small_data_to_special)) {
     3452 +                error = metaslab_alloc_dva(spa, spa_normal_class(spa),
     3453 +                    psize, &dva[WBC_NORMAL_DVA], 0, NULL, txg,
     3454 +                    flags | METASLAB_USE_WEIGHT_SECONDARY, zal);
     3455 +                if (error == 0) {
     3456 +                        error = metaslab_alloc_dva(spa, mc, psize,
     3457 +                            &dva[WBC_SPECIAL_DVA], 0, NULL, txg, flags, zal);
     3458 +                        if (error != 0) {
     3459 +                                error = 0;
     3460 +                                /*
     3461 +                                 * Change the place of NORMAL and cleanup the
     3462 +                                 * second DVA. After that this BP is just a
     3463 +                                 * regular BP with one DVA
     3464 +                                 *
     3465 +                                 * This operation is valid only if:
     3466 +                                 * WBC_SPECIAL_DVA is dva[0]
     3467 +                                 * WBC_NORMAL_DVA is dva[1]
     3468 +                                 *
     3469 +                                 * see wbc.h
     3470 +                                 */
     3471 +                                bcopy(&dva[WBC_NORMAL_DVA],
     3472 +                                    &dva[WBC_SPECIAL_DVA], sizeof (dva_t));
     3473 +                                bzero(&dva[WBC_NORMAL_DVA], sizeof (dva_t));
     3474 +
     3475 +                                /*
     3476 +                                 * Allocation of special DVA has failed,
     3477 +                                 * so this BP will be a regular BP and need
     3478 +                                 * to update the metaslab group's queue depth
     3479 +                                 * based on the newly allocated dva.
     3480 +                                 */
     3481 +                                metaslab_group_alloc_increment(spa,
     3482 +                                    DVA_GET_VDEV(&dva[0]), zio, flags);
     3483 +                        } else {
     3484 +                                BP_SET_SPECIAL(bp, 1);
3670 3485                          }
     3486 +                } else {
3671 3487                          spa_config_exit(spa, SCL_ALLOC, FTAG);
3672 3488                          return (error);
3673      -                } else {
3674      -                        /*
3675      -                         * Update the metaslab group's queue depth
3676      -                         * based on the newly allocated dva.
3677      -                         */
3678      -                        metaslab_group_alloc_increment(spa,
3679      -                            DVA_GET_VDEV(&dva[d]), zio, flags);
3680 3489                  }
3681      -
     3490 +        } else {
     3491 +                for (int d = 0; d < ndvas; d++) {
     3492 +                        error = metaslab_alloc_dva(spa, mc, psize, dva, d,
     3493 +                            hintdva, txg, flags, zal);
     3494 +                        if (error != 0) {
     3495 +                                for (d--; d >= 0; d--) {
     3496 +                                        metaslab_free_dva(spa, &dva[d],
     3497 +                                            txg, B_TRUE);
     3498 +                                        metaslab_group_alloc_decrement(spa,
     3499 +                                            DVA_GET_VDEV(&dva[d]), zio, flags);
     3500 +                                        bzero(&dva[d], sizeof (dva_t));
     3501 +                                }
     3502 +                                spa_config_exit(spa, SCL_ALLOC, FTAG);
     3503 +                                return (error);
     3504 +                        } else {
     3505 +                                /*
     3506 +                                 * Update the metaslab group's queue depth
     3507 +                                 * based on the newly allocated dva.
     3508 +                                 */
     3509 +                                metaslab_group_alloc_increment(spa,
     3510 +                                    DVA_GET_VDEV(&dva[d]), zio, flags);
     3511 +                        }
     3512 +                }
     3513 +                ASSERT(BP_GET_NDVAS(bp) == ndvas);
3682 3514          }
3683 3515          ASSERT(error == 0);
3684      -        ASSERT(BP_GET_NDVAS(bp) == ndvas);
3685 3516  
3686 3517          spa_config_exit(spa, SCL_ALLOC, FTAG);
3687 3518  
3688 3519          BP_SET_BIRTH(bp, txg, txg);
3689 3520  
3690 3521          return (0);
3691 3522  }
3692 3523  
3693 3524  void
3694 3525  metaslab_free(spa_t *spa, const blkptr_t *bp, uint64_t txg, boolean_t now)
3695 3526  {
3696 3527          const dva_t *dva = bp->blk_dva;
3697 3528          int ndvas = BP_GET_NDVAS(bp);
3698 3529  
3699 3530          ASSERT(!BP_IS_HOLE(bp));
3700 3531          ASSERT(!now || bp->blk_birth >= spa_syncing_txg(spa));
3701 3532  
3702 3533          spa_config_enter(spa, SCL_FREE, FTAG, RW_READER);
3703 3534  
3704      -        for (int d = 0; d < ndvas; d++) {
3705      -                if (now) {
3706      -                        metaslab_unalloc_dva(spa, &dva[d], txg);
3707      -                } else {
3708      -                        metaslab_free_dva(spa, &dva[d], txg);
     3535 +        if (BP_IS_SPECIAL(bp)) {
     3536 +                int start_dva;
     3537 +                wbc_data_t *wbc_data = spa_get_wbc_data(spa);
     3538 +
     3539 +                mutex_enter(&wbc_data->wbc_lock);
     3540 +                start_dva = wbc_first_valid_dva(bp, wbc_data, B_TRUE);
     3541 +                mutex_exit(&wbc_data->wbc_lock);
     3542 +
     3543 +                /*
     3544 +                 * Actual freeing should not be locked as
     3545 +                 * the block is already exempted from WBC
     3546 +                 * trees, and thus will not be moved
     3547 +                 */
     3548 +                metaslab_free_dva(spa, &dva[WBC_NORMAL_DVA], txg, now);
     3549 +                if (start_dva == 0) {
     3550 +                        metaslab_free_dva(spa, &dva[WBC_SPECIAL_DVA],
     3551 +                            txg, now);
3709 3552                  }
     3553 +        } else {
     3554 +                for (int d = 0; d < ndvas; d++)
     3555 +                        metaslab_free_dva(spa, &dva[d], txg, now);
3710 3556          }
3711 3557  
3712 3558          spa_config_exit(spa, SCL_FREE, FTAG);
3713 3559  }
3714 3560  
3715 3561  int
3716 3562  metaslab_claim(spa_t *spa, const blkptr_t *bp, uint64_t txg)
3717 3563  {
3718 3564          const dva_t *dva = bp->blk_dva;
3719 3565          int ndvas = BP_GET_NDVAS(bp);

3720 3566          int error = 0;
3721 3567  
3722 3568          ASSERT(!BP_IS_HOLE(bp));
3723 3569  
3724 3570          if (txg != 0) {

↓ open down ↓

5 lines elided

↑ open up ↑

3725 3571                  /*
3726 3572                   * First do a dry run to make sure all DVAs are claimable,
3727 3573                   * so we don't have to unwind from partial failures below.
3728 3574                   */
3729 3575                  if ((error = metaslab_claim(spa, bp, 0)) != 0)
3730 3576                          return (error);
3731 3577          }
3732 3578  
3733 3579          spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
3734 3580  
3735      -        for (int d = 0; d < ndvas; d++)
3736      -                if ((error = metaslab_claim_dva(spa, &dva[d], txg)) != 0)
3737      -                        break;
     3581 +        if (BP_IS_SPECIAL(bp)) {
     3582 +                int start_dva;
     3583 +                wbc_data_t *wbc_data = spa_get_wbc_data(spa);
3738 3584  
     3585 +                mutex_enter(&wbc_data->wbc_lock);
     3586 +                start_dva = wbc_first_valid_dva(bp, wbc_data, B_FALSE);
     3587 +
     3588 +                /*
     3589 +                 * Actual claiming should be under lock for WBC blocks. It must
     3590 +                 * be done to ensure zdb will not fail. The only other user of
     3591 +                 * the claiming is ZIL whose blocks can not be WBC ones, and
     3592 +                 * thus the lock will not be held for them.
     3593 +                 */
     3594 +                error = metaslab_claim_dva(spa,
     3595 +                    &dva[WBC_NORMAL_DVA], txg);
     3596 +                if (error == 0 && start_dva == 0) {
     3597 +                        error = metaslab_claim_dva(spa,
     3598 +                            &dva[WBC_SPECIAL_DVA], txg);
     3599 +                }
     3600 +
     3601 +                mutex_exit(&wbc_data->wbc_lock);
     3602 +        } else {
     3603 +                for (int d = 0; d < ndvas; d++)
     3604 +                        if ((error = metaslab_claim_dva(spa,
     3605 +                            &dva[d], txg)) != 0)
     3606 +                                break;
     3607 +        }
     3608 +
3739 3609          spa_config_exit(spa, SCL_ALLOC, FTAG);
3740 3610  
3741 3611          ASSERT(error == 0 || txg == 0);
3742 3612  
3743 3613          return (error);
3744 3614  }
3745 3615  
3746      -/* ARGSUSED */
3747      -static void
3748      -metaslab_check_free_impl_cb(uint64_t inner, vdev_t *vd, uint64_t offset,
3749      -    uint64_t size, void *arg)
     3616 +void
     3617 +metaslab_check_free(spa_t *spa, const blkptr_t *bp)
3750 3618  {
3751      -        if (vd->vdev_ops == &vdev_indirect_ops)
     3619 +        if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
3752 3620                  return;
3753 3621  
3754      -        metaslab_check_free_impl(vd, offset, size);
     3622 +        if (BP_IS_SPECIAL(bp)) {
     3623 +                /* Do not check frees for WBC blocks */
     3624 +                return;
     3625 +        }
     3626 +
     3627 +        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
     3628 +        for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
     3629 +                uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
     3630 +                vdev_t *vd = vdev_lookup_top(spa, vdev);
     3631 +                uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
     3632 +                uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
     3633 +                metaslab_t *msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
     3634 +
     3635 +                if (msp->ms_loaded) {
     3636 +                        range_tree_verify(msp->ms_tree, offset, size);
     3637 +                        range_tree_verify(msp->ms_cur_ts->ts_tree,
     3638 +                            offset, size);
     3639 +                        if (msp->ms_prev_ts != NULL) {
     3640 +                                range_tree_verify(msp->ms_prev_ts->ts_tree,
     3641 +                                    offset, size);
     3642 +                        }
     3643 +                }
     3644 +
     3645 +                range_tree_verify(msp->ms_freeingtree, offset, size);
     3646 +                range_tree_verify(msp->ms_freedtree, offset, size);
     3647 +                for (int j = 0; j < TXG_DEFER_SIZE; j++)
     3648 +                        range_tree_verify(msp->ms_defertree[j], offset, size);
     3649 +        }
     3650 +        spa_config_exit(spa, SCL_VDEV, FTAG);
3755 3651  }
3756 3652  
3757      -static void
3758      -metaslab_check_free_impl(vdev_t *vd, uint64_t offset, uint64_t size)
     3653 +/*
     3654 + * Trims all free space in the metaslab. Returns the root TRIM zio (that the
     3655 + * caller should zio_wait() for) and the amount of space in the metaslab that
     3656 + * has been scheduled for trimming in the `delta' return argument.
     3657 + */
     3658 +zio_t *
     3659 +metaslab_trim_all(metaslab_t *msp, uint64_t *delta)
3759 3660  {
3760      -        metaslab_t *msp;
3761      -        spa_t *spa = vd->vdev_spa;
     3661 +        boolean_t was_loaded;
     3662 +        uint64_t trimmed_space;
     3663 +        zio_t *trim_io;
3762 3664  
3763      -        if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
3764      -                return;
     3665 +        ASSERT(!MUTEX_HELD(&msp->ms_group->mg_lock));
3765 3666  
3766      -        if (vd->vdev_ops->vdev_op_remap != NULL) {
3767      -                vd->vdev_ops->vdev_op_remap(vd, offset, size,
3768      -                    metaslab_check_free_impl_cb, NULL);
3769      -                return;
     3667 +        mutex_enter(&msp->ms_lock);
     3668 +
     3669 +        while (msp->ms_loading)
     3670 +                metaslab_load_wait(msp);
     3671 +        /* If we loaded the metaslab, unload it when we're done. */
     3672 +        was_loaded = msp->ms_loaded;
     3673 +        if (!was_loaded) {
     3674 +                if (metaslab_load(msp) != 0) {
     3675 +                        mutex_exit(&msp->ms_lock);
     3676 +                        return (0);
     3677 +                }
3770 3678          }
     3679 +        /* Flush out any scheduled extents and add everything in ms_tree. */
     3680 +        range_tree_vacate(msp->ms_cur_ts->ts_tree, NULL, NULL);
     3681 +        range_tree_walk(msp->ms_tree, metaslab_trim_add, msp);
3771 3682  
3772      -        ASSERT(vdev_is_concrete(vd));
3773      -        ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
3774      -        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
     3683 +        /* Force this trim to take place ASAP. */
     3684 +        if (msp->ms_prev_ts != NULL)
     3685 +                metaslab_free_trimset(msp->ms_prev_ts);
     3686 +        msp->ms_prev_ts = msp->ms_cur_ts;
     3687 +        msp->ms_cur_ts = metaslab_new_trimset(0, &msp->ms_lock);
     3688 +        trimmed_space = range_tree_space(msp->ms_tree);
     3689 +        if (!was_loaded)
     3690 +                metaslab_unload(msp);
3775 3691  
3776      -        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
     3692 +        trim_io = metaslab_exec_trim(msp);
     3693 +        mutex_exit(&msp->ms_lock);
     3694 +        *delta = trimmed_space;
3777 3695  
     3696 +        return (trim_io);
     3697 +}
     3698 +
     3699 +/*
     3700 + * Notifies the trimsets in a metaslab that an extent has been allocated.
     3701 + * This removes the segment from the queues of extents awaiting to be trimmed.
     3702 + */
     3703 +static void
     3704 +metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size)
     3705 +{
     3706 +        metaslab_t *msp = arg;
     3707 +
     3708 +        range_tree_remove_overlap(msp->ms_cur_ts->ts_tree, offset, size);
     3709 +        if (msp->ms_prev_ts != NULL) {
     3710 +                range_tree_remove_overlap(msp->ms_prev_ts->ts_tree, offset,
     3711 +                    size);
     3712 +        }
     3713 +}
     3714 +
     3715 +/*
     3716 + * Notifies the trimsets in a metaslab that an extent has been freed.
     3717 + * This adds the segment to the currently open queue of extents awaiting
     3718 + * to be trimmed.
     3719 + */
     3720 +static void
     3721 +metaslab_trim_add(void *arg, uint64_t offset, uint64_t size)
     3722 +{
     3723 +        metaslab_t *msp = arg;
     3724 +        ASSERT(msp->ms_cur_ts != NULL);
     3725 +        range_tree_add(msp->ms_cur_ts->ts_tree, offset, size);
     3726 +}
     3727 +
     3728 +/*
     3729 + * Does a metaslab's automatic trim operation processing. This must be
     3730 + * called from metaslab_sync, with the txg number of the txg. This function
     3731 + * issues trims in intervals as dictated by the zfs_txgs_per_trim tunable.
     3732 + */
     3733 +void
     3734 +metaslab_auto_trim(metaslab_t *msp, uint64_t txg)
     3735 +{
     3736 +        /* for atomicity */
     3737 +        uint64_t txgs_per_trim = zfs_txgs_per_trim;
     3738 +
     3739 +        ASSERT(!MUTEX_HELD(&msp->ms_lock));
3778 3740          mutex_enter(&msp->ms_lock);
3779      -        if (msp->ms_loaded)
3780      -                range_tree_verify(msp->ms_tree, offset, size);
3781 3741  
3782      -        range_tree_verify(msp->ms_freeingtree, offset, size);
3783      -        range_tree_verify(msp->ms_freedtree, offset, size);
3784      -        for (int j = 0; j < TXG_DEFER_SIZE; j++)
3785      -                range_tree_verify(msp->ms_defertree[j], offset, size);
     3742 +        /*
     3743 +         * Since we typically have hundreds of metaslabs per vdev, but we only
     3744 +         * trim them once every zfs_txgs_per_trim txgs, it'd be best if we
     3745 +         * could sequence the TRIM commands from all metaslabs so that they
     3746 +         * don't all always pound the device in the same txg. We do so by
     3747 +         * artificially inflating the birth txg of the first trim set by a
     3748 +         * sequence number derived from the metaslab's starting offset
     3749 +         * (modulo zfs_txgs_per_trim). Thus, for the default 200 metaslabs and
     3750 +         * 32 txgs per trim, we'll only be trimming ~6.25 metaslabs per txg.
     3751 +         *
     3752 +         * If we detect that the txg has advanced too far ahead of ts_birth,
     3753 +         * it means our birth txg is out of lockstep. Recompute it by
     3754 +         * rounding down to the nearest zfs_txgs_per_trim multiple and adding
     3755 +         * our metaslab id modulo zfs_txgs_per_trim.
     3756 +         */
     3757 +        if (txg > msp->ms_cur_ts->ts_birth + txgs_per_trim) {
     3758 +                msp->ms_cur_ts->ts_birth = (txg / txgs_per_trim) *
     3759 +                    txgs_per_trim + (msp->ms_id % txgs_per_trim);
     3760 +        }
     3761 +
     3762 +        /* Time to swap out the current and previous trimsets */
     3763 +        if (txg == msp->ms_cur_ts->ts_birth + txgs_per_trim) {
     3764 +                if (msp->ms_prev_ts != NULL) {
     3765 +                        if (msp->ms_trimming_ts != NULL) {
     3766 +                                spa_t *spa = msp->ms_group->mg_class->mc_spa;
     3767 +                                /*
     3768 +                                 * The previous trim run is still ongoing, so
     3769 +                                 * the device is reacting slowly to our trim
     3770 +                                 * requests. Drop this trimset, so as not to
     3771 +                                 * back the device up with trim requests.
     3772 +                                 */
     3773 +                                spa_trimstats_auto_slow_incr(spa);
     3774 +                                metaslab_free_trimset(msp->ms_prev_ts);
     3775 +                        } else if (msp->ms_group->mg_vd->vdev_man_trimming) {
     3776 +                                /*
     3777 +                                 * If a manual trim is ongoing, we want to
     3778 +                                 * inhibit autotrim temporarily so it doesn't
     3779 +                                 * slow down the manual trim.
     3780 +                                 */
     3781 +                                metaslab_free_trimset(msp->ms_prev_ts);
     3782 +                        } else {
     3783 +                                /*
     3784 +                                 * Trim out aged extents on the vdevs - these
     3785 +                                 * are safe to be destroyed now. We'll keep
     3786 +                                 * the trimset around to deny allocations from
     3787 +                                 * these regions while the trims are ongoing.
     3788 +                                 */
     3789 +                                zio_nowait(metaslab_exec_trim(msp));
     3790 +                        }
     3791 +                }
     3792 +                msp->ms_prev_ts = msp->ms_cur_ts;
     3793 +                msp->ms_cur_ts = metaslab_new_trimset(txg, &msp->ms_lock);
     3794 +        }
3786 3795          mutex_exit(&msp->ms_lock);
3787 3796  }
3788 3797  
3789      -void
3790      -metaslab_check_free(spa_t *spa, const blkptr_t *bp)
     3798 +static void
     3799 +metaslab_trim_done(zio_t *zio)
3791 3800  {
3792      -        if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
3793      -                return;
     3801 +        metaslab_t *msp = zio->io_private;
     3802 +        boolean_t held;
3794 3803  
3795      -        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
3796      -        for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
3797      -                uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
3798      -                vdev_t *vd = vdev_lookup_top(spa, vdev);
3799      -                uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
3800      -                uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
     3804 +        ASSERT(msp != NULL);
     3805 +        ASSERT(msp->ms_trimming_ts != NULL);
     3806 +        held = MUTEX_HELD(&msp->ms_lock);
     3807 +        if (!held)
     3808 +                mutex_enter(&msp->ms_lock);
     3809 +        metaslab_free_trimset(msp->ms_trimming_ts);
     3810 +        msp->ms_trimming_ts = NULL;
     3811 +        cv_signal(&msp->ms_trim_cv);
     3812 +        if (!held)
     3813 +                mutex_exit(&msp->ms_lock);
     3814 +}
3801 3815  
3802      -                if (DVA_GET_GANG(&bp->blk_dva[i]))
3803      -                        size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
     3816 +/*
     3817 + * Executes a zio_trim on a range tree holding freed extents in the metaslab.
     3818 + */
     3819 +static zio_t *
     3820 +metaslab_exec_trim(metaslab_t *msp)
     3821 +{
     3822 +        metaslab_group_t *mg = msp->ms_group;
     3823 +        spa_t *spa = mg->mg_class->mc_spa;
     3824 +        vdev_t *vd = mg->mg_vd;
     3825 +        range_tree_t *trim_tree;
     3826 +        zio_t *zio;
3804 3827  
3805      -                ASSERT3P(vd, !=, NULL);
     3828 +        ASSERT(MUTEX_HELD(&msp->ms_lock));
3806 3829  
3807      -                metaslab_check_free_impl(vd, offset, size);
     3830 +        /* wait for a preceding trim to finish */
     3831 +        while (msp->ms_trimming_ts != NULL)
     3832 +                cv_wait(&msp->ms_trim_cv, &msp->ms_lock);
     3833 +        msp->ms_trimming_ts = msp->ms_prev_ts;
     3834 +        msp->ms_prev_ts = NULL;
     3835 +        trim_tree = msp->ms_trimming_ts->ts_tree;
     3836 +#ifdef  DEBUG
     3837 +        if (msp->ms_loaded) {
     3838 +                for (range_seg_t *rs = avl_first(&trim_tree->rt_root);
     3839 +                    rs != NULL; rs = AVL_NEXT(&trim_tree->rt_root, rs)) {
     3840 +                        if (!range_tree_contains(msp->ms_tree,
     3841 +                            rs->rs_start, rs->rs_end - rs->rs_start)) {
     3842 +                                panic("trimming allocated region; mss=%p",
     3843 +                                    (void*)rs);
     3844 +                        }
     3845 +                }
3808 3846          }
3809      -        spa_config_exit(spa, SCL_VDEV, FTAG);
     3847 +#endif
     3848 +
     3849 +        /* Nothing to trim */
     3850 +        if (range_tree_space(trim_tree) == 0) {
     3851 +                metaslab_free_trimset(msp->ms_trimming_ts);
     3852 +                msp->ms_trimming_ts = 0;
     3853 +                return (zio_root(spa, NULL, NULL, 0));
     3854 +        }
     3855 +        zio = zio_trim(spa, vd, trim_tree, metaslab_trim_done, msp, 0,
     3856 +            ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |
     3857 +            ZIO_FLAG_CONFIG_WRITER, msp);
     3858 +
     3859 +        return (zio);
     3860 +}
     3861 +
     3862 +/*
     3863 + * Allocates and initializes a new trimset structure. The `txg' argument
     3864 + * indicates when this trimset was born and `lock' indicates the lock to
     3865 + * link to the range tree.
     3866 + */
     3867 +static metaslab_trimset_t *
     3868 +metaslab_new_trimset(uint64_t txg, kmutex_t *lock)
     3869 +{
     3870 +        metaslab_trimset_t *ts;
     3871 +
     3872 +        ts = kmem_zalloc(sizeof (*ts), KM_SLEEP);
     3873 +        ts->ts_birth = txg;
     3874 +        ts->ts_tree = range_tree_create(NULL, NULL, lock);
     3875 +
     3876 +        return (ts);
     3877 +}
     3878 +
     3879 +/*
     3880 + * Destroys and frees a trim set previously allocated by metaslab_new_trimset.
     3881 + */
     3882 +static void
     3883 +metaslab_free_trimset(metaslab_trimset_t *ts)
     3884 +{
     3885 +        range_tree_vacate(ts->ts_tree, NULL, NULL);
     3886 +        range_tree_destroy(ts->ts_tree);
     3887 +        kmem_free(ts, sizeof (*ts));
     3888 +}
     3889 +
     3890 +/*
     3891 + * Checks whether an allocation conflicts with an ongoing trim operation in
     3892 + * the given metaslab. This function takes a segment starting at `*offset'
     3893 + * of `size' and checks whether it hits any region in the metaslab currently
     3894 + * being trimmed. If yes, it tries to adjust the allocation to the end of
     3895 + * the region being trimmed (P2ROUNDUP aligned by `align'), but only up to
     3896 + * `limit' (no part of the allocation is allowed to go past this point).
     3897 + *
     3898 + * Returns B_FALSE if either the original allocation wasn't in conflict, or
     3899 + * the conflict could be resolved by adjusting the value stored in `offset'
     3900 + * such that the whole allocation still fits below `limit'. Returns B_TRUE
     3901 + * if the allocation conflict couldn't be resolved.
     3902 + */
     3903 +static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
     3904 +    uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit)
     3905 +{
     3906 +        uint64_t new_offset;
     3907 +
     3908 +        if (msp->ms_trimming_ts == NULL)
     3909 +                /* no trim conflict, original offset is OK */
     3910 +                return (B_FALSE);
     3911 +
     3912 +        new_offset = P2ROUNDUP(range_tree_find_gap(msp->ms_trimming_ts->ts_tree,
     3913 +            *offset, size), align);
     3914 +        if (new_offset != *offset && new_offset + size > limit)
     3915 +                /* trim conflict and adjustment not possible */
     3916 +                return (B_TRUE);
     3917 +
     3918 +        /* trim conflict, but adjusted offset still within limit */
     3919 +        *offset = new_offset;
     3920 +        return (B_FALSE);
3810 3921  }

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX