Print this page
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-10069 ZFS_READONLY is a little too strict (fix test lint)
NEX-9553 Move ss_fill gap logic from scan algorithm into range_tree.c
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4720 WRC: DVA allocation bypass for special BPs works incorrect
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4683 WRC: Special block pointer must know that it is special
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
6295 metaslab_condense's dbgmsg should include vdev id
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4245 WRC: Code cleanup and refactoring to simplify merge with upstream
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-4059 On-demand TRIM can sometimes race in metaslab_load
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-3710 WRC improvements and bug-fixes
 * refactored WRC move-logic to use zio kmem_cashes
 * replace size and compression fields by blk_prop field
   (the same in blkptr_t) to little reduce size of wrc_block_t
   and use similar macros as for blkptr_t to get PSIZE, LSIZE
   and COMPRESSION
 * make CPU more happy by reduce atomic calls
 * removed unused code
 * fixed naming of variables
 * fixed possible system panic after restart system
   with enabled WRC
 * fixed a race that causes system panic
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
OS-197 Series of zpool exports and imports can hang the system
Reviewed by: Sarah Jelinek <sarah.jelinek@nexetna.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittens@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
re #8346 rb2639 KT disk failures

Split Close
Expand all
Collapse all
          --- old/usr/src/uts/common/fs/zfs/metaslab.c
          +++ new/usr/src/uts/common/fs/zfs/metaslab.c
↓ open down ↓ 15 lines elided ↑ open up ↑
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23   23   * Copyright (c) 2011, 2015 by Delphix. All rights reserved.
  24   24   * Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
  25   25   * Copyright (c) 2014 Integros [integros.com]
       26 + * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
  26   27   */
  27   28  
  28   29  #include <sys/zfs_context.h>
  29   30  #include <sys/dmu.h>
  30   31  #include <sys/dmu_tx.h>
  31   32  #include <sys/space_map.h>
  32   33  #include <sys/metaslab_impl.h>
  33   34  #include <sys/vdev_impl.h>
  34   35  #include <sys/zio.h>
  35   36  #include <sys/spa_impl.h>
  36   37  #include <sys/zfeature.h>
  37      -#include <sys/vdev_indirect_mapping.h>
       38 +#include <sys/wbc.h>
  38   39  
  39   40  #define GANG_ALLOCATION(flags) \
  40   41          ((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER))
  41   42  
  42   43  uint64_t metaslab_aliquot = 512ULL << 10;
  43   44  uint64_t metaslab_gang_bang = SPA_MAXBLOCKSIZE + 1;     /* force gang blocks */
  44   45  
  45   46  /*
  46   47   * The in-core space map representation is more compact than its on-disk form.
  47   48   * The zfs_condense_pct determines how much more compact the in-core
↓ open down ↓ 112 lines elided ↑ open up ↑
 160  161   * Enable/disable lba weighting (i.e. outer tracks are given preference).
 161  162   */
 162  163  boolean_t metaslab_lba_weighting_enabled = B_TRUE;
 163  164  
 164  165  /*
 165  166   * Enable/disable metaslab group biasing.
 166  167   */
 167  168  boolean_t metaslab_bias_enabled = B_TRUE;
 168  169  
 169  170  /*
 170      - * Enable/disable remapping of indirect DVAs to their concrete vdevs.
 171      - */
 172      -boolean_t zfs_remap_blkptr_enable = B_TRUE;
 173      -
 174      -/*
 175  171   * Enable/disable segment-based metaslab selection.
 176  172   */
 177  173  boolean_t zfs_metaslab_segment_weight_enabled = B_TRUE;
 178  174  
 179  175  /*
 180  176   * When using segment-based metaslab selection, we will continue
 181  177   * allocating from the active metaslab until we have exhausted
 182  178   * zfs_metaslab_switch_threshold of its buckets.
 183  179   */
 184  180  int zfs_metaslab_switch_threshold = 2;
↓ open down ↓ 9 lines elided ↑ open up ↑
 194  190   * in a given list when running in non-debug mode. We limit the number
 195  191   * of entries in non-debug mode to prevent us from using up too much memory.
 196  192   * The limit should be sufficiently large that we don't expect any allocation
 197  193   * to every exceed this value. In debug mode, the system will panic if this
 198  194   * limit is ever reached allowing for further investigation.
 199  195   */
 200  196  uint64_t metaslab_trace_max_entries = 5000;
 201  197  
 202  198  static uint64_t metaslab_weight(metaslab_t *);
 203  199  static void metaslab_set_fragmentation(metaslab_t *);
 204      -static void metaslab_free_impl(vdev_t *, uint64_t, uint64_t, uint64_t);
 205      -static void metaslab_check_free_impl(vdev_t *, uint64_t, uint64_t);
 206  200  
 207  201  kmem_cache_t *metaslab_alloc_trace_cache;
 208  202  
 209  203  /*
      204 + * Toggle between space-based DVA allocator 0, latency-based 1 or hybrid 2.
      205 + * A value other than 0, 1 or 2 will be considered 0 (default).
      206 + */
      207 +int metaslab_alloc_dva_algorithm = 0;
      208 +
      209 +/*
      210 + * How many TXG's worth of updates should be aggregated per TRIM/UNMAP
      211 + * issued to the underlying vdev. We keep two range trees of extents
      212 + * (called "trim sets") to be trimmed per metaslab, the `current' and
      213 + * the `previous' TS. New free's are added to the current TS. Then,
      214 + * once `zfs_txgs_per_trim' transactions have elapsed, the `current'
      215 + * TS becomes the `previous' TS and a new, blank TS is created to be
      216 + * the new `current', which will then start accumulating any new frees.
      217 + * Once another zfs_txgs_per_trim TXGs have passed, the previous TS's
      218 + * extents are trimmed, the TS is destroyed and the current TS again
      219 + * becomes the previous TS.
      220 + * This serves to fulfill two functions: aggregate many small frees
      221 + * into fewer larger trim operations (which should help with devices
      222 + * which do not take so kindly to them) and to allow for disaster
      223 + * recovery (extents won't get trimmed immediately, but instead only
      224 + * after passing this rather long timeout, thus not preserving
      225 + * 'zfs import -F' functionality).
      226 + */
      227 +unsigned int zfs_txgs_per_trim = 32;
      228 +
      229 +static void metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size);
      230 +static void metaslab_trim_add(void *arg, uint64_t offset, uint64_t size);
      231 +
      232 +static zio_t *metaslab_exec_trim(metaslab_t *msp);
      233 +
      234 +static metaslab_trimset_t *metaslab_new_trimset(uint64_t txg, kmutex_t *lock);
      235 +static void metaslab_free_trimset(metaslab_trimset_t *ts);
      236 +static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
      237 +    uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit);
      238 +
      239 +/*
 210  240   * ==========================================================================
 211  241   * Metaslab classes
 212  242   * ==========================================================================
 213  243   */
 214  244  metaslab_class_t *
 215  245  metaslab_class_create(spa_t *spa, metaslab_ops_t *ops)
 216  246  {
 217  247          metaslab_class_t *mc;
 218  248  
 219  249          mc = kmem_zalloc(sizeof (metaslab_class_t), KM_SLEEP);
 220  250  
      251 +        mutex_init(&mc->mc_alloc_lock, NULL, MUTEX_DEFAULT, NULL);
      252 +        avl_create(&mc->mc_alloc_tree, zio_bookmark_compare,
      253 +            sizeof (zio_t), offsetof(zio_t, io_alloc_node));
      254 +
 221  255          mc->mc_spa = spa;
 222  256          mc->mc_rotor = NULL;
 223  257          mc->mc_ops = ops;
 224  258          mutex_init(&mc->mc_lock, NULL, MUTEX_DEFAULT, NULL);
 225  259          refcount_create_tracked(&mc->mc_alloc_slots);
 226  260  
 227  261          return (mc);
 228  262  }
 229  263  
 230  264  void
 231  265  metaslab_class_destroy(metaslab_class_t *mc)
 232  266  {
 233  267          ASSERT(mc->mc_rotor == NULL);
 234  268          ASSERT(mc->mc_alloc == 0);
 235  269          ASSERT(mc->mc_deferred == 0);
 236  270          ASSERT(mc->mc_space == 0);
 237  271          ASSERT(mc->mc_dspace == 0);
 238  272  
      273 +        avl_destroy(&mc->mc_alloc_tree);
      274 +        mutex_destroy(&mc->mc_alloc_lock);
      275 +
 239  276          refcount_destroy(&mc->mc_alloc_slots);
 240  277          mutex_destroy(&mc->mc_lock);
 241  278          kmem_free(mc, sizeof (metaslab_class_t));
 242  279  }
 243  280  
 244  281  int
 245  282  metaslab_class_validate(metaslab_class_t *mc)
 246  283  {
 247  284          metaslab_group_t *mg;
 248  285          vdev_t *vd;
↓ open down ↓ 66 lines elided ↑ open up ↑
 315  352              KM_SLEEP);
 316  353  
 317  354          for (int c = 0; c < rvd->vdev_children; c++) {
 318  355                  vdev_t *tvd = rvd->vdev_child[c];
 319  356                  metaslab_group_t *mg = tvd->vdev_mg;
 320  357  
 321  358                  /*
 322  359                   * Skip any holes, uninitialized top-levels, or
 323  360                   * vdevs that are not in this metalab class.
 324  361                   */
 325      -                if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
      362 +                if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
 326  363                      mg->mg_class != mc) {
 327  364                          continue;
 328  365                  }
 329  366  
 330  367                  for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
 331  368                          mc_hist[i] += mg->mg_histogram[i];
 332  369          }
 333  370  
 334  371          for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
 335  372                  VERIFY3U(mc_hist[i], ==, mc->mc_histogram[i]);
↓ open down ↓ 14 lines elided ↑ open up ↑
 350  387          vdev_t *rvd = mc->mc_spa->spa_root_vdev;
 351  388          uint64_t fragmentation = 0;
 352  389  
 353  390          spa_config_enter(mc->mc_spa, SCL_VDEV, FTAG, RW_READER);
 354  391  
 355  392          for (int c = 0; c < rvd->vdev_children; c++) {
 356  393                  vdev_t *tvd = rvd->vdev_child[c];
 357  394                  metaslab_group_t *mg = tvd->vdev_mg;
 358  395  
 359  396                  /*
 360      -                 * Skip any holes, uninitialized top-levels,
 361      -                 * or vdevs that are not in this metalab class.
      397 +                 * Skip any holes, uninitialized top-levels, or
      398 +                 * vdevs that are not in this metalab class.
 362  399                   */
 363      -                if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
      400 +                if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
 364  401                      mg->mg_class != mc) {
 365  402                          continue;
 366  403                  }
 367  404  
 368  405                  /*
 369  406                   * If a metaslab group does not contain a fragmentation
 370  407                   * metric then just bail out.
 371  408                   */
 372  409                  if (mg->mg_fragmentation == ZFS_FRAG_INVALID) {
 373  410                          spa_config_exit(mc->mc_spa, SCL_VDEV, FTAG);
↓ open down ↓ 25 lines elided ↑ open up ↑
 399  436  {
 400  437          vdev_t *rvd = mc->mc_spa->spa_root_vdev;
 401  438          uint64_t space = 0;
 402  439  
 403  440          spa_config_enter(mc->mc_spa, SCL_VDEV, FTAG, RW_READER);
 404  441          for (int c = 0; c < rvd->vdev_children; c++) {
 405  442                  uint64_t tspace;
 406  443                  vdev_t *tvd = rvd->vdev_child[c];
 407  444                  metaslab_group_t *mg = tvd->vdev_mg;
 408  445  
 409      -                if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
      446 +                if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
 410  447                      mg->mg_class != mc) {
 411  448                          continue;
 412  449                  }
 413  450  
 414  451                  /*
 415  452                   * Calculate if we have enough space to add additional
 416  453                   * metaslabs. We report the expandable space in terms
 417  454                   * of the metaslab size since that's the unit of expansion.
 418  455                   * Adjust by efi system partition size.
 419  456                   */
↓ open down ↓ 91 lines elided ↑ open up ↑
 511  548  static void
 512  549  metaslab_group_alloc_update(metaslab_group_t *mg)
 513  550  {
 514  551          vdev_t *vd = mg->mg_vd;
 515  552          metaslab_class_t *mc = mg->mg_class;
 516  553          vdev_stat_t *vs = &vd->vdev_stat;
 517  554          boolean_t was_allocatable;
 518  555          boolean_t was_initialized;
 519  556  
 520  557          ASSERT(vd == vd->vdev_top);
 521      -        ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_READER), ==,
 522      -            SCL_ALLOC);
 523  558  
 524  559          mutex_enter(&mg->mg_lock);
 525  560          was_allocatable = mg->mg_allocatable;
 526  561          was_initialized = mg->mg_initialized;
 527  562  
 528  563          mg->mg_free_capacity = ((vs->vs_space - vs->vs_alloc) * 100) /
 529  564              (vs->vs_space + 1);
 530  565  
 531  566          mutex_enter(&mc->mc_lock);
 532  567  
↓ open down ↓ 77 lines elided ↑ open up ↑
 610  645  {
 611  646          ASSERT(mg->mg_prev == NULL);
 612  647          ASSERT(mg->mg_next == NULL);
 613  648          /*
 614  649           * We may have gone below zero with the activation count
 615  650           * either because we never activated in the first place or
 616  651           * because we're done, and possibly removing the vdev.
 617  652           */
 618  653          ASSERT(mg->mg_activation_count <= 0);
 619  654  
 620      -        taskq_destroy(mg->mg_taskq);
      655 +        if (mg->mg_taskq)
      656 +                taskq_destroy(mg->mg_taskq);
 621  657          avl_destroy(&mg->mg_metaslab_tree);
 622  658          mutex_destroy(&mg->mg_lock);
 623  659          refcount_destroy(&mg->mg_alloc_queue_depth);
 624  660          kmem_free(mg, sizeof (metaslab_group_t));
 625  661  }
 626  662  
 627  663  void
 628  664  metaslab_group_activate(metaslab_group_t *mg)
 629  665  {
 630  666          metaslab_class_t *mc = mg->mg_class;
 631  667          metaslab_group_t *mgprev, *mgnext;
 632  668  
 633      -        ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER), !=, 0);
      669 +        ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
 634  670  
 635  671          ASSERT(mc->mc_rotor != mg);
 636  672          ASSERT(mg->mg_prev == NULL);
 637  673          ASSERT(mg->mg_next == NULL);
 638  674          ASSERT(mg->mg_activation_count <= 0);
 639  675  
 640  676          if (++mg->mg_activation_count <= 0)
 641  677                  return;
 642  678  
 643  679          mg->mg_aliquot = metaslab_aliquot * MAX(1, mg->mg_vd->vdev_children);
↓ open down ↓ 5 lines elided ↑ open up ↑
 649  685          } else {
 650  686                  mgnext = mgprev->mg_next;
 651  687                  mg->mg_prev = mgprev;
 652  688                  mg->mg_next = mgnext;
 653  689                  mgprev->mg_next = mg;
 654  690                  mgnext->mg_prev = mg;
 655  691          }
 656  692          mc->mc_rotor = mg;
 657  693  }
 658  694  
 659      -/*
 660      - * Passivate a metaslab group and remove it from the allocation rotor.
 661      - * Callers must hold both the SCL_ALLOC and SCL_ZIO lock prior to passivating
 662      - * a metaslab group. This function will momentarily drop spa_config_locks
 663      - * that are lower than the SCL_ALLOC lock (see comment below).
 664      - */
 665  695  void
 666  696  metaslab_group_passivate(metaslab_group_t *mg)
 667  697  {
 668  698          metaslab_class_t *mc = mg->mg_class;
 669      -        spa_t *spa = mc->mc_spa;
 670  699          metaslab_group_t *mgprev, *mgnext;
 671      -        int locks = spa_config_held(spa, SCL_ALL, RW_WRITER);
 672  700  
 673      -        ASSERT3U(spa_config_held(spa, SCL_ALLOC | SCL_ZIO, RW_WRITER), ==,
 674      -            (SCL_ALLOC | SCL_ZIO));
      701 +        ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
 675  702  
 676  703          if (--mg->mg_activation_count != 0) {
 677  704                  ASSERT(mc->mc_rotor != mg);
 678  705                  ASSERT(mg->mg_prev == NULL);
 679  706                  ASSERT(mg->mg_next == NULL);
 680  707                  ASSERT(mg->mg_activation_count < 0);
 681  708                  return;
 682  709          }
 683  710  
 684      -        /*
 685      -         * The spa_config_lock is an array of rwlocks, ordered as
 686      -         * follows (from highest to lowest):
 687      -         *      SCL_CONFIG > SCL_STATE > SCL_L2ARC > SCL_ALLOC >
 688      -         *      SCL_ZIO > SCL_FREE > SCL_VDEV
 689      -         * (For more information about the spa_config_lock see spa_misc.c)
 690      -         * The higher the lock, the broader its coverage. When we passivate
 691      -         * a metaslab group, we must hold both the SCL_ALLOC and the SCL_ZIO
 692      -         * config locks. However, the metaslab group's taskq might be trying
 693      -         * to preload metaslabs so we must drop the SCL_ZIO lock and any
 694      -         * lower locks to allow the I/O to complete. At a minimum,
 695      -         * we continue to hold the SCL_ALLOC lock, which prevents any future
 696      -         * allocations from taking place and any changes to the vdev tree.
 697      -         */
 698      -        spa_config_exit(spa, locks & ~(SCL_ZIO - 1), spa);
 699  711          taskq_wait(mg->mg_taskq);
 700      -        spa_config_enter(spa, locks & ~(SCL_ZIO - 1), spa, RW_WRITER);
 701  712          metaslab_group_alloc_update(mg);
 702  713  
 703  714          mgprev = mg->mg_prev;
 704  715          mgnext = mg->mg_next;
 705  716  
 706  717          if (mg == mgnext) {
 707  718                  mc->mc_rotor = NULL;
 708  719          } else {
 709  720                  mc->mc_rotor = mgnext;
 710  721                  mgprev->mg_next = mgnext;
↓ open down ↓ 423 lines elided ↑ open up ↑
1134 1145  
1135 1146          return (rs);
1136 1147  }
1137 1148  
1138 1149  /*
1139 1150   * This is a helper function that can be used by the allocator to find
1140 1151   * a suitable block to allocate. This will search the specified AVL
1141 1152   * tree looking for a block that matches the specified criteria.
1142 1153   */
1143 1154  static uint64_t
1144      -metaslab_block_picker(avl_tree_t *t, uint64_t *cursor, uint64_t size,
1145      -    uint64_t align)
     1155 +metaslab_block_picker(metaslab_t *msp, avl_tree_t *t, uint64_t *cursor,
     1156 +    uint64_t size, uint64_t align)
1146 1157  {
1147 1158          range_seg_t *rs = metaslab_block_find(t, *cursor, size);
1148 1159  
1149      -        while (rs != NULL) {
     1160 +        for (; rs != NULL; rs = AVL_NEXT(t, rs)) {
1150 1161                  uint64_t offset = P2ROUNDUP(rs->rs_start, align);
1151 1162  
1152      -                if (offset + size <= rs->rs_end) {
     1163 +                if (offset + size <= rs->rs_end &&
     1164 +                    !metaslab_check_trim_conflict(msp, &offset, size, align,
     1165 +                    rs->rs_end)) {
1153 1166                          *cursor = offset + size;
1154 1167                          return (offset);
1155 1168                  }
1156      -                rs = AVL_NEXT(t, rs);
1157 1169          }
1158 1170  
1159 1171          /*
1160 1172           * If we know we've searched the whole map (*cursor == 0), give up.
1161 1173           * Otherwise, reset the cursor to the beginning and try again.
1162 1174           */
1163 1175          if (*cursor == 0)
1164 1176                  return (-1ULL);
1165 1177  
1166 1178          *cursor = 0;
1167      -        return (metaslab_block_picker(t, cursor, size, align));
     1179 +        return (metaslab_block_picker(msp, t, cursor, size, align));
1168 1180  }
1169 1181  
1170 1182  /*
1171 1183   * ==========================================================================
1172 1184   * The first-fit block allocator
1173 1185   * ==========================================================================
1174 1186   */
1175 1187  static uint64_t
1176 1188  metaslab_ff_alloc(metaslab_t *msp, uint64_t size)
1177 1189  {
↓ open down ↓ 1 lines elided ↑ open up ↑
1179 1191           * Find the largest power of 2 block size that evenly divides the
1180 1192           * requested size. This is used to try to allocate blocks with similar
1181 1193           * alignment from the same area of the metaslab (i.e. same cursor
1182 1194           * bucket) but it does not guarantee that other allocations sizes
1183 1195           * may exist in the same region.
1184 1196           */
1185 1197          uint64_t align = size & -size;
1186 1198          uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1];
1187 1199          avl_tree_t *t = &msp->ms_tree->rt_root;
1188 1200  
1189      -        return (metaslab_block_picker(t, cursor, size, align));
     1201 +        return (metaslab_block_picker(msp, t, cursor, size, align));
1190 1202  }
1191 1203  
1192 1204  static metaslab_ops_t metaslab_ff_ops = {
1193 1205          metaslab_ff_alloc
1194 1206  };
1195 1207  
1196 1208  /*
1197 1209   * ==========================================================================
1198 1210   * Dynamic block allocator -
1199 1211   * Uses the first fit allocation scheme until space get low and then
↓ open down ↓ 27 lines elided ↑ open up ↑
1227 1239          /*
1228 1240           * If we're running low on space switch to using the size
1229 1241           * sorted AVL tree (best-fit).
1230 1242           */
1231 1243          if (max_size < metaslab_df_alloc_threshold ||
1232 1244              free_pct < metaslab_df_free_pct) {
1233 1245                  t = &msp->ms_size_tree;
1234 1246                  *cursor = 0;
1235 1247          }
1236 1248  
1237      -        return (metaslab_block_picker(t, cursor, size, 1ULL));
     1249 +        return (metaslab_block_picker(msp, t, cursor, size, 1ULL));
1238 1250  }
1239 1251  
1240 1252  static metaslab_ops_t metaslab_df_ops = {
1241 1253          metaslab_df_alloc
1242 1254  };
1243 1255  
1244 1256  /*
1245 1257   * ==========================================================================
1246 1258   * Cursor fit block allocator -
1247 1259   * Select the largest region in the metaslab, set the cursor to the beginning
↓ open down ↓ 11 lines elided ↑ open up ↑
1259 1271          uint64_t *cursor_end = &msp->ms_lbas[1];
1260 1272          uint64_t offset = 0;
1261 1273  
1262 1274          ASSERT(MUTEX_HELD(&msp->ms_lock));
1263 1275          ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&rt->rt_root));
1264 1276  
1265 1277          ASSERT3U(*cursor_end, >=, *cursor);
1266 1278  
1267 1279          if ((*cursor + size) > *cursor_end) {
1268 1280                  range_seg_t *rs;
1269      -
1270      -                rs = avl_last(&msp->ms_size_tree);
1271      -                if (rs == NULL || (rs->rs_end - rs->rs_start) < size)
     1281 +                for (rs = avl_last(&msp->ms_size_tree);
     1282 +                    rs != NULL && rs->rs_end - rs->rs_start >= size;
     1283 +                    rs = AVL_PREV(&msp->ms_size_tree, rs)) {
     1284 +                        *cursor = rs->rs_start;
     1285 +                        *cursor_end = rs->rs_end;
     1286 +                        if (!metaslab_check_trim_conflict(msp, cursor, size,
     1287 +                            1, *cursor_end)) {
     1288 +                                /* segment appears to be acceptable */
     1289 +                                break;
     1290 +                        }
     1291 +                }
     1292 +                if (rs == NULL || rs->rs_end - rs->rs_start < size)
1272 1293                          return (-1ULL);
1273      -
1274      -                *cursor = rs->rs_start;
1275      -                *cursor_end = rs->rs_end;
1276 1294          }
1277 1295  
1278 1296          offset = *cursor;
1279 1297          *cursor += size;
1280 1298  
1281 1299          return (offset);
1282 1300  }
1283 1301  
1284 1302  static metaslab_ops_t metaslab_cf_ops = {
1285 1303          metaslab_cf_alloc
↓ open down ↓ 16 lines elided ↑ open up ↑
1302 1320  
1303 1321  static uint64_t
1304 1322  metaslab_ndf_alloc(metaslab_t *msp, uint64_t size)
1305 1323  {
1306 1324          avl_tree_t *t = &msp->ms_tree->rt_root;
1307 1325          avl_index_t where;
1308 1326          range_seg_t *rs, rsearch;
1309 1327          uint64_t hbit = highbit64(size);
1310 1328          uint64_t *cursor = &msp->ms_lbas[hbit - 1];
1311 1329          uint64_t max_size = metaslab_block_maxsize(msp);
     1330 +        /* mutable copy for adjustment by metaslab_check_trim_conflict */
     1331 +        uint64_t adjustable_start;
1312 1332  
1313 1333          ASSERT(MUTEX_HELD(&msp->ms_lock));
1314 1334          ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&msp->ms_size_tree));
1315 1335  
1316 1336          if (max_size < size)
1317 1337                  return (-1ULL);
1318 1338  
1319 1339          rsearch.rs_start = *cursor;
1320 1340          rsearch.rs_end = *cursor + size;
1321 1341  
1322 1342          rs = avl_find(t, &rsearch, &where);
1323      -        if (rs == NULL || (rs->rs_end - rs->rs_start) < size) {
     1343 +        if (rs != NULL)
     1344 +                adjustable_start = rs->rs_start;
     1345 +        if (rs == NULL || rs->rs_end - adjustable_start < size ||
     1346 +            metaslab_check_trim_conflict(msp, &adjustable_start, size, 1,
     1347 +            rs->rs_end)) {
     1348 +                /* segment not usable, try the largest remaining one */
1324 1349                  t = &msp->ms_size_tree;
1325 1350  
1326 1351                  rsearch.rs_start = 0;
1327 1352                  rsearch.rs_end = MIN(max_size,
1328 1353                      1ULL << (hbit + metaslab_ndf_clump_shift));
1329 1354                  rs = avl_find(t, &rsearch, &where);
1330 1355                  if (rs == NULL)
1331 1356                          rs = avl_nearest(t, where, AVL_AFTER);
1332 1357                  ASSERT(rs != NULL);
     1358 +                adjustable_start = rs->rs_start;
     1359 +                if (rs->rs_end - adjustable_start < size ||
     1360 +                    metaslab_check_trim_conflict(msp, &adjustable_start,
     1361 +                    size, 1, rs->rs_end)) {
     1362 +                        /* even largest remaining segment not usable */
     1363 +                        return (-1ULL);
     1364 +                }
1333 1365          }
1334 1366  
1335      -        if ((rs->rs_end - rs->rs_start) >= size) {
1336      -                *cursor = rs->rs_start + size;
1337      -                return (rs->rs_start);
1338      -        }
1339      -        return (-1ULL);
     1367 +        *cursor = adjustable_start + size;
     1368 +        return (*cursor);
1340 1369  }
1341 1370  
1342 1371  static metaslab_ops_t metaslab_ndf_ops = {
1343 1372          metaslab_ndf_alloc
1344 1373  };
1345 1374  
1346 1375  metaslab_ops_t *zfs_metaslab_ops = &metaslab_df_ops;
1347 1376  
1348 1377  /*
1349 1378   * ==========================================================================
↓ open down ↓ 19 lines elided ↑ open up ↑
1369 1398  metaslab_load(metaslab_t *msp)
1370 1399  {
1371 1400          int error = 0;
1372 1401          boolean_t success = B_FALSE;
1373 1402  
1374 1403          ASSERT(MUTEX_HELD(&msp->ms_lock));
1375 1404          ASSERT(!msp->ms_loaded);
1376 1405          ASSERT(!msp->ms_loading);
1377 1406  
1378 1407          msp->ms_loading = B_TRUE;
1379      -        /*
1380      -         * Nobody else can manipulate a loading metaslab, so it's now safe
1381      -         * to drop the lock.  This way we don't have to hold the lock while
1382      -         * reading the spacemap from disk.
1383      -         */
1384      -        mutex_exit(&msp->ms_lock);
1385 1408  
1386 1409          /*
1387 1410           * If the space map has not been allocated yet, then treat
1388 1411           * all the space in the metaslab as free and add it to the
1389 1412           * ms_tree.
1390 1413           */
1391 1414          if (msp->ms_sm != NULL)
1392 1415                  error = space_map_load(msp->ms_sm, msp->ms_tree, SM_FREE);
1393 1416          else
1394 1417                  range_tree_add(msp->ms_tree, msp->ms_start, msp->ms_size);
1395 1418  
1396 1419          success = (error == 0);
1397      -
1398      -        mutex_enter(&msp->ms_lock);
1399 1420          msp->ms_loading = B_FALSE;
1400 1421  
1401 1422          if (success) {
1402 1423                  ASSERT3P(msp->ms_group, !=, NULL);
1403 1424                  msp->ms_loaded = B_TRUE;
1404 1425  
1405 1426                  for (int t = 0; t < TXG_DEFER_SIZE; t++) {
1406 1427                          range_tree_walk(msp->ms_defertree[t],
1407 1428                              range_tree_remove, msp->ms_tree);
     1429 +                        range_tree_walk(msp->ms_defertree[t],
     1430 +                            metaslab_trim_remove, msp);
1408 1431                  }
1409 1432                  msp->ms_max_size = metaslab_block_maxsize(msp);
1410 1433          }
1411 1434          cv_broadcast(&msp->ms_load_cv);
1412 1435          return (error);
1413 1436  }
1414 1437  
1415 1438  void
1416 1439  metaslab_unload(metaslab_t *msp)
1417 1440  {
↓ open down ↓ 8 lines elided ↑ open up ↑
1426 1449  metaslab_init(metaslab_group_t *mg, uint64_t id, uint64_t object, uint64_t txg,
1427 1450      metaslab_t **msp)
1428 1451  {
1429 1452          vdev_t *vd = mg->mg_vd;
1430 1453          objset_t *mos = vd->vdev_spa->spa_meta_objset;
1431 1454          metaslab_t *ms;
1432 1455          int error;
1433 1456  
1434 1457          ms = kmem_zalloc(sizeof (metaslab_t), KM_SLEEP);
1435 1458          mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL);
1436      -        mutex_init(&ms->ms_sync_lock, NULL, MUTEX_DEFAULT, NULL);
1437 1459          cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL);
     1460 +        cv_init(&ms->ms_trim_cv, NULL, CV_DEFAULT, NULL);
1438 1461          ms->ms_id = id;
1439 1462          ms->ms_start = id << vd->vdev_ms_shift;
1440 1463          ms->ms_size = 1ULL << vd->vdev_ms_shift;
1441 1464  
1442 1465          /*
1443 1466           * We only open space map objects that already exist. All others
1444 1467           * will be opened when we finally allocate an object for it.
1445 1468           */
1446 1469          if (object != 0) {
1447 1470                  error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start,
1448      -                    ms->ms_size, vd->vdev_ashift);
     1471 +                    ms->ms_size, vd->vdev_ashift, &ms->ms_lock);
1449 1472  
1450 1473                  if (error != 0) {
1451 1474                          kmem_free(ms, sizeof (metaslab_t));
1452 1475                          return (error);
1453 1476                  }
1454 1477  
1455 1478                  ASSERT(ms->ms_sm != NULL);
1456 1479          }
1457 1480  
     1481 +        ms->ms_cur_ts = metaslab_new_trimset(0, &ms->ms_lock);
     1482 +
1458 1483          /*
1459 1484           * We create the main range tree here, but we don't create the
1460 1485           * other range trees until metaslab_sync_done().  This serves
1461 1486           * two purposes: it allows metaslab_sync_done() to detect the
1462 1487           * addition of new space; and for debugging, it ensures that we'd
1463 1488           * data fault on any attempt to use this metaslab before it's ready.
1464 1489           */
1465      -        ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms);
     1490 +        ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms, &ms->ms_lock);
1466 1491          metaslab_group_add(mg, ms);
1467 1492  
1468 1493          metaslab_set_fragmentation(ms);
1469 1494  
1470 1495          /*
1471 1496           * If we're opening an existing pool (txg == 0) or creating
1472 1497           * a new one (txg == TXG_INITIAL), all space is available now.
1473 1498           * If we're adding space to an existing pool, the new space
1474 1499           * does not become available until after this txg has synced.
1475 1500           * The metaslab's weight will also be initialized when we sync
↓ open down ↓ 43 lines elided ↑ open up ↑
1519 1544          range_tree_destroy(msp->ms_freedtree);
1520 1545  
1521 1546          for (int t = 0; t < TXG_SIZE; t++) {
1522 1547                  range_tree_destroy(msp->ms_alloctree[t]);
1523 1548          }
1524 1549  
1525 1550          for (int t = 0; t < TXG_DEFER_SIZE; t++) {
1526 1551                  range_tree_destroy(msp->ms_defertree[t]);
1527 1552          }
1528 1553  
     1554 +        metaslab_free_trimset(msp->ms_cur_ts);
     1555 +        if (msp->ms_prev_ts)
     1556 +                metaslab_free_trimset(msp->ms_prev_ts);
     1557 +        ASSERT3P(msp->ms_trimming_ts, ==, NULL);
     1558 +
1529 1559          ASSERT0(msp->ms_deferspace);
1530 1560  
1531 1561          mutex_exit(&msp->ms_lock);
1532 1562          cv_destroy(&msp->ms_load_cv);
     1563 +        cv_destroy(&msp->ms_trim_cv);
1533 1564          mutex_destroy(&msp->ms_lock);
1534      -        mutex_destroy(&msp->ms_sync_lock);
1535 1565  
1536 1566          kmem_free(msp, sizeof (metaslab_t));
1537 1567  }
1538 1568  
1539 1569  #define FRAGMENTATION_TABLE_SIZE        17
1540 1570  
1541 1571  /*
1542 1572   * This table defines a segment size based fragmentation metric that will
1543 1573   * allow each metaslab to derive its own fragmentation value. This is done
1544 1574   * by calculating the space in each bucket of the spacemap histogram and
↓ open down ↓ 345 lines elided ↑ open up ↑
1890 1920  static uint64_t
1891 1921  metaslab_weight(metaslab_t *msp)
1892 1922  {
1893 1923          vdev_t *vd = msp->ms_group->mg_vd;
1894 1924          spa_t *spa = vd->vdev_spa;
1895 1925          uint64_t weight;
1896 1926  
1897 1927          ASSERT(MUTEX_HELD(&msp->ms_lock));
1898 1928  
1899 1929          /*
1900      -         * If this vdev is in the process of being removed, there is nothing
     1930 +         * This vdev is in the process of being removed so there is nothing
1901 1931           * for us to do here.
1902 1932           */
1903      -        if (vd->vdev_removing)
     1933 +        if (vd->vdev_removing) {
     1934 +                ASSERT0(space_map_allocated(msp->ms_sm));
     1935 +                ASSERT0(vd->vdev_ms_shift);
1904 1936                  return (0);
     1937 +        }
1905 1938  
1906 1939          metaslab_set_fragmentation(msp);
1907 1940  
1908 1941          /*
1909 1942           * Update the maximum size if the metaslab is loaded. This will
1910 1943           * ensure that we get an accurate maximum size if newly freed space
1911 1944           * has been added back into the free tree.
1912 1945           */
1913 1946          if (msp->ms_loaded)
1914 1947                  msp->ms_max_size = metaslab_block_maxsize(msp);
↓ open down ↓ 111 lines elided ↑ open up ↑
2026 2059          metaslab_t *msp;
2027 2060          avl_tree_t *t = &mg->mg_metaslab_tree;
2028 2061          int m = 0;
2029 2062  
2030 2063          if (spa_shutting_down(spa) || !metaslab_preload_enabled) {
2031 2064                  taskq_wait(mg->mg_taskq);
2032 2065                  return;
2033 2066          }
2034 2067  
2035 2068          mutex_enter(&mg->mg_lock);
2036      -
2037 2069          /*
2038 2070           * Load the next potential metaslabs
2039 2071           */
2040 2072          for (msp = avl_first(t); msp != NULL; msp = AVL_NEXT(t, msp)) {
2041      -                ASSERT3P(msp->ms_group, ==, mg);
2042      -
2043 2073                  /*
2044 2074                   * We preload only the maximum number of metaslabs specified
2045 2075                   * by metaslab_preload_limit. If a metaslab is being forced
2046 2076                   * to condense then we preload it too. This will ensure
2047 2077                   * that force condensing happens in the next txg.
2048 2078                   */
2049 2079                  if (++m > metaslab_preload_limit && !msp->ms_condense_wanted) {
2050 2080                          continue;
2051 2081                  }
2052 2082  
↓ open down ↓ 6 lines elided ↑ open up ↑
2059 2089  /*
2060 2090   * Determine if the space map's on-disk footprint is past our tolerance
2061 2091   * for inefficiency. We would like to use the following criteria to make
2062 2092   * our decision:
2063 2093   *
2064 2094   * 1. The size of the space map object should not dramatically increase as a
2065 2095   * result of writing out the free space range tree.
2066 2096   *
2067 2097   * 2. The minimal on-disk space map representation is zfs_condense_pct/100
2068 2098   * times the size than the free space range tree representation
2069      - * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1MB).
     2099 + * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1.MB).
2070 2100   *
2071 2101   * 3. The on-disk size of the space map should actually decrease.
2072 2102   *
2073 2103   * Checking the first condition is tricky since we don't want to walk
2074 2104   * the entire AVL tree calculating the estimated on-disk size. Instead we
2075 2105   * use the size-ordered range tree in the metaslab and calculate the
2076 2106   * size required to write out the largest segment in our free tree. If the
2077 2107   * size required to represent that segment on disk is larger than the space
2078 2108   * map object then we avoid condensing this map.
2079 2109   *
↓ open down ↓ 76 lines elided ↑ open up ↑
2156 2186  
2157 2187          msp->ms_condense_wanted = B_FALSE;
2158 2188  
2159 2189          /*
2160 2190           * Create an range tree that is 100% allocated. We remove segments
2161 2191           * that have been freed in this txg, any deferred frees that exist,
2162 2192           * and any allocation in the future. Removing segments should be
2163 2193           * a relatively inexpensive operation since we expect these trees to
2164 2194           * have a small number of nodes.
2165 2195           */
2166      -        condense_tree = range_tree_create(NULL, NULL);
     2196 +        condense_tree = range_tree_create(NULL, NULL, &msp->ms_lock);
2167 2197          range_tree_add(condense_tree, msp->ms_start, msp->ms_size);
2168 2198  
2169 2199          /*
2170 2200           * Remove what's been freed in this txg from the condense_tree.
2171 2201           * Since we're in sync_pass 1, we know that all the frees from
2172 2202           * this txg are in the freeingtree.
2173 2203           */
2174 2204          range_tree_walk(msp->ms_freeingtree, range_tree_remove, condense_tree);
2175 2205  
2176 2206          for (int t = 0; t < TXG_DEFER_SIZE; t++) {
↓ open down ↓ 12 lines elided ↑ open up ↑
2189 2219           * metaslab's ms_condensing flag to ensure that
2190 2220           * allocations on this metaslab do not occur while we're
2191 2221           * in the middle of committing it to disk. This is only critical
2192 2222           * for the ms_tree as all other range trees use per txg
2193 2223           * views of their content.
2194 2224           */
2195 2225          msp->ms_condensing = B_TRUE;
2196 2226  
2197 2227          mutex_exit(&msp->ms_lock);
2198 2228          space_map_truncate(sm, tx);
     2229 +        mutex_enter(&msp->ms_lock);
2199 2230  
2200 2231          /*
2201 2232           * While we would ideally like to create a space map representation
2202 2233           * that consists only of allocation records, doing so can be
2203 2234           * prohibitively expensive because the in-core free tree can be
2204 2235           * large, and therefore computationally expensive to subtract
2205 2236           * from the condense_tree. Instead we sync out two trees, a cheap
2206 2237           * allocation only tree followed by the in-core free tree. While not
2207 2238           * optimal, this is typically close to optimal, and much cheaper to
2208 2239           * compute.
2209 2240           */
2210 2241          space_map_write(sm, condense_tree, SM_ALLOC, tx);
2211 2242          range_tree_vacate(condense_tree, NULL, NULL);
2212 2243          range_tree_destroy(condense_tree);
2213 2244  
2214 2245          space_map_write(sm, msp->ms_tree, SM_FREE, tx);
2215      -        mutex_enter(&msp->ms_lock);
2216 2246          msp->ms_condensing = B_FALSE;
2217 2247  }
2218 2248  
2219 2249  /*
2220 2250   * Write a metaslab to disk in the context of the specified transaction group.
2221 2251   */
2222 2252  void
2223 2253  metaslab_sync(metaslab_t *msp, uint64_t txg)
2224 2254  {
2225 2255          metaslab_group_t *mg = msp->ms_group;
2226 2256          vdev_t *vd = mg->mg_vd;
2227 2257          spa_t *spa = vd->vdev_spa;
2228 2258          objset_t *mos = spa_meta_objset(spa);
2229 2259          range_tree_t *alloctree = msp->ms_alloctree[txg & TXG_MASK];
2230 2260          dmu_tx_t *tx;
2231 2261          uint64_t object = space_map_object(msp->ms_sm);
2232 2262  
2233 2263          ASSERT(!vd->vdev_ishole);
2234 2264  
     2265 +        mutex_enter(&msp->ms_lock);
     2266 +
2235 2267          /*
2236 2268           * This metaslab has just been added so there's no work to do now.
2237 2269           */
2238 2270          if (msp->ms_freeingtree == NULL) {
2239 2271                  ASSERT3P(alloctree, ==, NULL);
     2272 +                mutex_exit(&msp->ms_lock);
2240 2273                  return;
2241 2274          }
2242 2275  
2243 2276          ASSERT3P(alloctree, !=, NULL);
2244 2277          ASSERT3P(msp->ms_freeingtree, !=, NULL);
2245 2278          ASSERT3P(msp->ms_freedtree, !=, NULL);
2246 2279  
2247 2280          /*
2248 2281           * Normally, we don't want to process a metaslab if there
2249 2282           * are no allocations or frees to perform. However, if the metaslab
2250 2283           * is being forced to condense and it's loaded, we need to let it
2251 2284           * through.
2252 2285           */
2253 2286          if (range_tree_space(alloctree) == 0 &&
2254 2287              range_tree_space(msp->ms_freeingtree) == 0 &&
2255      -            !(msp->ms_loaded && msp->ms_condense_wanted))
     2288 +            !(msp->ms_loaded && msp->ms_condense_wanted)) {
     2289 +                mutex_exit(&msp->ms_lock);
2256 2290                  return;
     2291 +        }
2257 2292  
2258 2293  
2259 2294          VERIFY(txg <= spa_final_dirty_txg(spa));
2260 2295  
2261 2296          /*
2262 2297           * The only state that can actually be changing concurrently with
2263 2298           * metaslab_sync() is the metaslab's ms_tree.  No other thread can
2264 2299           * be modifying this txg's alloctree, freeingtree, freedtree, or
2265      -         * space_map_phys_t.  We drop ms_lock whenever we could call
2266      -         * into the DMU, because the DMU can call down to us
2267      -         * (e.g. via zio_free()) at any time.
2268      -         *
2269      -         * The spa_vdev_remove_thread() can be reading metaslab state
2270      -         * concurrently, and it is locked out by the ms_sync_lock.  Note
2271      -         * that the ms_lock is insufficient for this, because it is dropped
2272      -         * by space_map_write().
     2300 +         * space_map_phys_t. Therefore, we only hold ms_lock to satify
     2301 +         * space map ASSERTs. We drop it whenever we call into the DMU,
     2302 +         * because the DMU can call down to us (e.g. via zio_free()) at
     2303 +         * any time.
2273 2304           */
2274 2305  
2275 2306          tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
2276 2307  
2277 2308          if (msp->ms_sm == NULL) {
2278 2309                  uint64_t new_object;
2279 2310  
2280 2311                  new_object = space_map_alloc(mos, tx);
2281 2312                  VERIFY3U(new_object, !=, 0);
2282 2313  
2283 2314                  VERIFY0(space_map_open(&msp->ms_sm, mos, new_object,
2284      -                    msp->ms_start, msp->ms_size, vd->vdev_ashift));
     2315 +                    msp->ms_start, msp->ms_size, vd->vdev_ashift,
     2316 +                    &msp->ms_lock));
2285 2317                  ASSERT(msp->ms_sm != NULL);
2286 2318          }
2287 2319  
2288      -        mutex_enter(&msp->ms_sync_lock);
2289      -        mutex_enter(&msp->ms_lock);
2290      -
2291 2320          /*
2292 2321           * Note: metaslab_condense() clears the space map's histogram.
2293 2322           * Therefore we must verify and remove this histogram before
2294 2323           * condensing.
2295 2324           */
2296 2325          metaslab_group_histogram_verify(mg);
2297 2326          metaslab_class_histogram_verify(mg->mg_class);
2298 2327          metaslab_group_histogram_remove(mg, msp);
2299 2328  
2300 2329          if (msp->ms_loaded && spa_sync_pass(spa) == 1 &&
2301 2330              metaslab_should_condense(msp)) {
2302 2331                  metaslab_condense(msp, txg, tx);
2303 2332          } else {
2304      -                mutex_exit(&msp->ms_lock);
2305 2333                  space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx);
2306 2334                  space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx);
2307      -                mutex_enter(&msp->ms_lock);
2308 2335          }
2309 2336  
2310 2337          if (msp->ms_loaded) {
2311 2338                  /*
2312      -                 * When the space map is loaded, we have an accurate
     2339 +                 * When the space map is loaded, we have an accruate
2313 2340                   * histogram in the range tree. This gives us an opportunity
2314 2341                   * to bring the space map's histogram up-to-date so we clear
2315 2342                   * it first before updating it.
2316 2343                   */
2317 2344                  space_map_histogram_clear(msp->ms_sm);
2318 2345                  space_map_histogram_add(msp->ms_sm, msp->ms_tree, tx);
2319 2346  
2320 2347                  /*
2321 2348                   * Since we've cleared the histogram we need to add back
2322 2349                   * any free space that has already been processed, plus
↓ open down ↓ 47 lines elided ↑ open up ↑
2370 2397          ASSERT0(range_tree_space(msp->ms_alloctree[TXG_CLEAN(txg) & TXG_MASK]));
2371 2398          ASSERT0(range_tree_space(msp->ms_freeingtree));
2372 2399  
2373 2400          mutex_exit(&msp->ms_lock);
2374 2401  
2375 2402          if (object != space_map_object(msp->ms_sm)) {
2376 2403                  object = space_map_object(msp->ms_sm);
2377 2404                  dmu_write(mos, vd->vdev_ms_array, sizeof (uint64_t) *
2378 2405                      msp->ms_id, sizeof (uint64_t), &object, tx);
2379 2406          }
2380      -        mutex_exit(&msp->ms_sync_lock);
2381 2407          dmu_tx_commit(tx);
2382 2408  }
2383 2409  
2384 2410  /*
2385 2411   * Called after a transaction group has completely synced to mark
2386 2412   * all of the metaslab's free space as usable.
2387 2413   */
2388 2414  void
2389 2415  metaslab_sync_done(metaslab_t *msp, uint64_t txg)
2390 2416  {
↓ open down ↓ 9 lines elided ↑ open up ↑
2400 2426          mutex_enter(&msp->ms_lock);
2401 2427  
2402 2428          /*
2403 2429           * If this metaslab is just becoming available, initialize its
2404 2430           * range trees and add its capacity to the vdev.
2405 2431           */
2406 2432          if (msp->ms_freedtree == NULL) {
2407 2433                  for (int t = 0; t < TXG_SIZE; t++) {
2408 2434                          ASSERT(msp->ms_alloctree[t] == NULL);
2409 2435  
2410      -                        msp->ms_alloctree[t] = range_tree_create(NULL, NULL);
     2436 +                        msp->ms_alloctree[t] = range_tree_create(NULL, msp,
     2437 +                            &msp->ms_lock);
2411 2438                  }
2412 2439  
2413 2440                  ASSERT3P(msp->ms_freeingtree, ==, NULL);
2414      -                msp->ms_freeingtree = range_tree_create(NULL, NULL);
     2441 +                msp->ms_freeingtree = range_tree_create(NULL, msp,
     2442 +                    &msp->ms_lock);
2415 2443  
2416 2444                  ASSERT3P(msp->ms_freedtree, ==, NULL);
2417      -                msp->ms_freedtree = range_tree_create(NULL, NULL);
     2445 +                msp->ms_freedtree = range_tree_create(NULL, msp,
     2446 +                    &msp->ms_lock);
2418 2447  
2419 2448                  for (int t = 0; t < TXG_DEFER_SIZE; t++) {
2420 2449                          ASSERT(msp->ms_defertree[t] == NULL);
2421 2450  
2422      -                        msp->ms_defertree[t] = range_tree_create(NULL, NULL);
     2451 +                        msp->ms_defertree[t] = range_tree_create(NULL, msp,
     2452 +                            &msp->ms_lock);
2423 2453                  }
2424 2454  
2425 2455                  vdev_space_update(vd, 0, 0, msp->ms_size);
2426 2456          }
2427 2457  
2428 2458          defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE];
2429 2459  
2430 2460          uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) -
2431 2461              metaslab_class_get_alloc(spa_normal_class(spa));
2432      -        if (free_space <= spa_get_slop_space(spa) || vd->vdev_removing) {
     2462 +        if (free_space <= spa_get_slop_space(spa)) {
2433 2463                  defer_allowed = B_FALSE;
2434 2464          }
2435 2465  
2436 2466          defer_delta = 0;
2437 2467          alloc_delta = space_map_alloc_delta(msp->ms_sm);
2438 2468          if (defer_allowed) {
2439 2469                  defer_delta = range_tree_space(msp->ms_freedtree) -
2440 2470                      range_tree_space(*defer_tree);
2441 2471          } else {
2442 2472                  defer_delta -= range_tree_space(*defer_tree);
↓ open down ↓ 6 lines elided ↑ open up ↑
2449 2479           * so that we have a consistent view of the in-core space map.
2450 2480           */
2451 2481          metaslab_load_wait(msp);
2452 2482  
2453 2483          /*
2454 2484           * Move the frees from the defer_tree back to the free
2455 2485           * range tree (if it's loaded). Swap the freed_tree and the
2456 2486           * defer_tree -- this is safe to do because we've just emptied out
2457 2487           * the defer_tree.
2458 2488           */
     2489 +        if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
     2490 +            !vd->vdev_man_trimming) {
     2491 +                range_tree_walk(*defer_tree, metaslab_trim_add, msp);
     2492 +                if (!defer_allowed) {
     2493 +                        range_tree_walk(msp->ms_freedtree, metaslab_trim_add,
     2494 +                            msp);
     2495 +                }
     2496 +        }
2459 2497          range_tree_vacate(*defer_tree,
2460 2498              msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree);
2461 2499          if (defer_allowed) {
2462 2500                  range_tree_swap(&msp->ms_freedtree, defer_tree);
2463 2501          } else {
2464 2502                  range_tree_vacate(msp->ms_freedtree,
2465 2503                      msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree);
2466 2504          }
2467 2505  
2468 2506          space_map_update(msp->ms_sm);
↓ open down ↓ 23 lines elided ↑ open up ↑
2492 2530              msp->ms_selected_txg + metaslab_unload_delay < txg) {
2493 2531                  for (int t = 1; t < TXG_CONCURRENT_STATES; t++) {
2494 2532                          VERIFY0(range_tree_space(
2495 2533                              msp->ms_alloctree[(txg + t) & TXG_MASK]));
2496 2534                  }
2497 2535  
2498 2536                  if (!metaslab_debug_unload)
2499 2537                          metaslab_unload(msp);
2500 2538          }
2501 2539  
2502      -        ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK]));
2503      -        ASSERT0(range_tree_space(msp->ms_freeingtree));
2504      -        ASSERT0(range_tree_space(msp->ms_freedtree));
2505      -
2506 2540          mutex_exit(&msp->ms_lock);
2507 2541  }
2508 2542  
2509 2543  void
2510 2544  metaslab_sync_reassess(metaslab_group_t *mg)
2511 2545  {
2512      -        spa_t *spa = mg->mg_class->mc_spa;
2513      -
2514      -        spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
2515 2546          metaslab_group_alloc_update(mg);
2516 2547          mg->mg_fragmentation = metaslab_group_fragmentation(mg);
2517 2548  
2518 2549          /*
2519      -         * Preload the next potential metaslabs but only on active
2520      -         * metaslab groups. We can get into a state where the metaslab
2521      -         * is no longer active since we dirty metaslabs as we remove a
2522      -         * a device, thus potentially making the metaslab group eligible
2523      -         * for preloading.
     2550 +         * Preload the next potential metaslabs
2524 2551           */
2525      -        if (mg->mg_activation_count > 0) {
2526      -                metaslab_group_preload(mg);
2527      -        }
2528      -        spa_config_exit(spa, SCL_ALLOC, FTAG);
     2552 +        metaslab_group_preload(mg);
2529 2553  }
2530 2554  
2531 2555  static uint64_t
2532 2556  metaslab_distance(metaslab_t *msp, dva_t *dva)
2533 2557  {
2534 2558          uint64_t ms_shift = msp->ms_group->mg_vd->vdev_ms_shift;
2535 2559          uint64_t offset = DVA_GET_OFFSET(dva) >> ms_shift;
2536 2560          uint64_t start = msp->ms_id;
2537 2561  
2538 2562          if (msp->ms_group->mg_vd->vdev_id != DVA_GET_VDEV(dva))
↓ open down ↓ 173 lines elided ↑ open up ↑
2712 2736  
2713 2737          start = mc->mc_ops->msop_alloc(msp, size);
2714 2738          if (start != -1ULL) {
2715 2739                  metaslab_group_t *mg = msp->ms_group;
2716 2740                  vdev_t *vd = mg->mg_vd;
2717 2741  
2718 2742                  VERIFY0(P2PHASE(start, 1ULL << vd->vdev_ashift));
2719 2743                  VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
2720 2744                  VERIFY3U(range_tree_space(rt) - size, <=, msp->ms_size);
2721 2745                  range_tree_remove(rt, start, size);
     2746 +                metaslab_trim_remove(msp, start, size);
2722 2747  
2723 2748                  if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
2724 2749                          vdev_dirty(mg->mg_vd, VDD_METASLAB, msp, txg);
2725 2750  
2726 2751                  range_tree_add(msp->ms_alloctree[txg & TXG_MASK], start, size);
2727 2752  
2728 2753                  /* Track the last successful allocation */
2729 2754                  msp->ms_alloc_txg = txg;
2730 2755                  metaslab_verify_space(msp, txg);
2731 2756          }
↓ open down ↓ 1 lines elided ↑ open up ↑
2733 2758          /*
2734 2759           * Now that we've attempted the allocation we need to update the
2735 2760           * metaslab's maximum block size since it may have changed.
2736 2761           */
2737 2762          msp->ms_max_size = metaslab_block_maxsize(msp);
2738 2763          return (start);
2739 2764  }
2740 2765  
2741 2766  static uint64_t
2742 2767  metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal,
2743      -    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
     2768 +    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d,
     2769 +    int flags)
2744 2770  {
2745 2771          metaslab_t *msp = NULL;
2746 2772          uint64_t offset = -1ULL;
2747 2773          uint64_t activation_weight;
2748 2774          uint64_t target_distance;
2749 2775          int i;
2750 2776  
2751 2777          activation_weight = METASLAB_WEIGHT_PRIMARY;
2752 2778          for (i = 0; i < d; i++) {
2753 2779                  if (DVA_GET_VDEV(&dva[i]) == mg->mg_vd->vdev_id) {
2754 2780                          activation_weight = METASLAB_WEIGHT_SECONDARY;
2755 2781                          break;
2756 2782                  }
2757 2783          }
2758 2784  
2759 2785          metaslab_t *search = kmem_alloc(sizeof (*search), KM_SLEEP);
2760 2786          search->ms_weight = UINT64_MAX;
2761 2787          search->ms_start = 0;
2762 2788          for (;;) {
2763 2789                  boolean_t was_active;
     2790 +                boolean_t pass_primary = B_TRUE;
2764 2791                  avl_tree_t *t = &mg->mg_metaslab_tree;
2765 2792                  avl_index_t idx;
2766 2793  
2767 2794                  mutex_enter(&mg->mg_lock);
2768 2795  
2769 2796                  /*
2770 2797                   * Find the metaslab with the highest weight that is less
2771 2798                   * than what we've already tried.  In the common case, this
2772 2799                   * means that we will examine each metaslab at most once.
2773 2800                   * Note that concurrent callers could reorder metaslabs
↓ open down ↓ 17 lines elided ↑ open up ↑
2791 2818                                  continue;
2792 2819                          }
2793 2820  
2794 2821                          /*
2795 2822                           * If the selected metaslab is condensing, skip it.
2796 2823                           */
2797 2824                          if (msp->ms_condensing)
2798 2825                                  continue;
2799 2826  
2800 2827                          was_active = msp->ms_weight & METASLAB_ACTIVE_MASK;
2801      -                        if (activation_weight == METASLAB_WEIGHT_PRIMARY)
2802      -                                break;
     2828 +                        if (flags & METASLAB_USE_WEIGHT_SECONDARY) {
     2829 +                                if (!pass_primary) {
     2830 +                                        DTRACE_PROBE(metaslab_use_secondary);
     2831 +                                        activation_weight =
     2832 +                                            METASLAB_WEIGHT_SECONDARY;
     2833 +                                        break;
     2834 +                                }
2803 2835  
2804      -                        target_distance = min_distance +
2805      -                            (space_map_allocated(msp->ms_sm) != 0 ? 0 :
2806      -                            min_distance >> 1);
     2836 +                                pass_primary = B_FALSE;
     2837 +                        } else {
     2838 +                                if (activation_weight ==
     2839 +                                    METASLAB_WEIGHT_PRIMARY)
     2840 +                                        break;
2807 2841  
2808      -                        for (i = 0; i < d; i++) {
2809      -                                if (metaslab_distance(msp, &dva[i]) <
2810      -                                    target_distance)
     2842 +                                target_distance = min_distance +
     2843 +                                    (space_map_allocated(msp->ms_sm) != 0 ? 0 :
     2844 +                                    min_distance >> 1);
     2845 +
     2846 +                                for (i = 0; i < d; i++)
     2847 +                                        if (metaslab_distance(msp, &dva[i]) <
     2848 +                                            target_distance)
     2849 +                                                break;
     2850 +                                if (i == d)
2811 2851                                          break;
2812 2852                          }
2813      -                        if (i == d)
2814      -                                break;
2815 2853                  }
2816 2854                  mutex_exit(&mg->mg_lock);
2817 2855                  if (msp == NULL) {
2818 2856                          kmem_free(search, sizeof (*search));
2819 2857                          return (-1ULL);
2820 2858                  }
2821 2859                  search->ms_weight = msp->ms_weight;
2822 2860                  search->ms_start = msp->ms_start + 1;
2823 2861  
2824 2862                  mutex_enter(&msp->ms_lock);
↓ open down ↓ 101 lines elided ↑ open up ↑
2926 2964                  ASSERT(!metaslab_should_allocate(msp, asize));
2927 2965                  mutex_exit(&msp->ms_lock);
2928 2966          }
2929 2967          mutex_exit(&msp->ms_lock);
2930 2968          kmem_free(search, sizeof (*search));
2931 2969          return (offset);
2932 2970  }
2933 2971  
2934 2972  static uint64_t
2935 2973  metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal,
2936      -    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
     2974 +    uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva,
     2975 +    int d, int flags)
2937 2976  {
2938 2977          uint64_t offset;
2939 2978          ASSERT(mg->mg_initialized);
2940 2979  
2941 2980          offset = metaslab_group_alloc_normal(mg, zal, asize, txg,
2942      -            min_distance, dva, d);
     2981 +            min_distance, dva, d, flags);
2943 2982  
2944 2983          mutex_enter(&mg->mg_lock);
2945 2984          if (offset == -1ULL) {
2946 2985                  mg->mg_failed_allocations++;
2947 2986                  metaslab_trace_add(zal, mg, NULL, asize, d,
2948 2987                      TRACE_GROUP_FAILURE);
2949 2988                  if (asize == SPA_GANGBLOCKSIZE) {
2950 2989                          /*
2951 2990                           * This metaslab group was unable to allocate
2952 2991                           * the minimum gang block size so it must be out of
↓ open down ↓ 17 lines elided ↑ open up ↑
2970 3009   * If we have to write a ditto block (i.e. more than one DVA for a given BP)
2971 3010   * on the same vdev as an existing DVA of this BP, then try to allocate it
2972 3011   * at least (vdev_asize / (2 ^ ditto_same_vdev_distance_shift)) away from the
2973 3012   * existing DVAs.
2974 3013   */
2975 3014  int ditto_same_vdev_distance_shift = 3;
2976 3015  
2977 3016  /*
2978 3017   * Allocate a block for the specified i/o.
2979 3018   */
2980      -int
     3019 +static int
2981 3020  metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
2982 3021      dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags,
2983 3022      zio_alloc_list_t *zal)
2984 3023  {
2985 3024          metaslab_group_t *mg, *rotor;
2986 3025          vdev_t *vd;
2987 3026          boolean_t try_hard = B_FALSE;
2988 3027  
2989 3028          ASSERT(!DVA_IS_VALID(&dva[d]));
2990 3029  
↓ open down ↓ 25 lines elided ↑ open up ↑
3016 3055           * If we are doing gang blocks (hintdva is non-NULL), try to keep
3017 3056           * ourselves on the same vdev as our gang block header.  That
3018 3057           * way, we can hope for locality in vdev_cache, plus it makes our
3019 3058           * fault domains something tractable.
3020 3059           */
3021 3060          if (hintdva) {
3022 3061                  vd = vdev_lookup_top(spa, DVA_GET_VDEV(&hintdva[d]));
3023 3062  
3024 3063                  /*
3025 3064                   * It's possible the vdev we're using as the hint no
3026      -                 * longer exists or its mg has been closed (e.g. by
3027      -                 * device removal).  Consult the rotor when
     3065 +                 * longer exists (i.e. removed). Consult the rotor when
3028 3066                   * all else fails.
3029 3067                   */
3030      -                if (vd != NULL && vd->vdev_mg != NULL) {
     3068 +                if (vd != NULL) {
3031 3069                          mg = vd->vdev_mg;
3032 3070  
3033 3071                          if (flags & METASLAB_HINTBP_AVOID &&
3034 3072                              mg->mg_next != NULL)
3035 3073                                  mg = mg->mg_next;
3036 3074                  } else {
3037 3075                          mg = mc->mc_rotor;
3038 3076                  }
3039 3077          } else if (d != 0) {
3040 3078                  vd = vdev_lookup_top(spa, DVA_GET_VDEV(&dva[d - 1]));
↓ open down ↓ 74 lines elided ↑ open up ↑
3115 3153                          distance = vd->vdev_asize >>
3116 3154                              ditto_same_vdev_distance_shift;
3117 3155                          if (distance <= (1ULL << vd->vdev_ms_shift))
3118 3156                                  distance = 0;
3119 3157                  }
3120 3158  
3121 3159                  uint64_t asize = vdev_psize_to_asize(vd, psize);
3122 3160                  ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0);
3123 3161  
3124 3162                  uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg,
3125      -                    distance, dva, d);
     3163 +                    distance, dva, d, flags);
3126 3164  
3127 3165                  if (offset != -1ULL) {
3128 3166                          /*
3129 3167                           * If we've just selected this metaslab group,
3130 3168                           * figure out whether the corresponding vdev is
3131 3169                           * over- or under-used relative to the pool,
3132 3170                           * and set an allocation bias to even it out.
3133 3171                           */
3134 3172                          if (mc->mc_aliquot == 0 && metaslab_bias_enabled) {
3135 3173                                  vdev_stat_t *vs = &vd->vdev_stat;
3136      -                                int64_t vu, cu;
     3174 +                                vdev_stat_t *pvs = &vd->vdev_parent->vdev_stat;
     3175 +                                int64_t vu, cu, vu_io;
3137 3176  
3138 3177                                  vu = (vs->vs_alloc * 100) / (vs->vs_space + 1);
3139 3178                                  cu = (mc->mc_alloc * 100) / (mc->mc_space + 1);
     3179 +                                vu_io =
     3180 +                                    (((vs->vs_iotime[ZIO_TYPE_WRITE] * 100) /
     3181 +                                    (pvs->vs_iotime[ZIO_TYPE_WRITE] + 1)) *
     3182 +                                    (vd->vdev_parent->vdev_children)) - 100;
3140 3183  
3141 3184                                  /*
3142 3185                                   * Calculate how much more or less we should
3143 3186                                   * try to allocate from this device during
3144 3187                                   * this iteration around the rotor.
3145 3188                                   * For example, if a device is 80% full
3146 3189                                   * and the pool is 20% full then we should
3147 3190                                   * reduce allocations by 60% on this device.
3148 3191                                   *
3149 3192                                   * mg_bias = (20 - 80) * 512K / 100 = -307K
3150 3193                                   *
3151 3194                                   * This reduces allocations by 307K for this
3152 3195                                   * iteration.
3153 3196                                   */
3154 3197                                  mg->mg_bias = ((cu - vu) *
3155 3198                                      (int64_t)mg->mg_aliquot) / 100;
     3199 +
     3200 +                                /*
     3201 +                                 * Experiment: space-based DVA allocator 0,
     3202 +                                 * latency-based 1 or hybrid 2.
     3203 +                                 */
     3204 +                                switch (metaslab_alloc_dva_algorithm) {
     3205 +                                case 1:
     3206 +                                        mg->mg_bias =
     3207 +                                            (vu_io * (int64_t)mg->mg_aliquot) /
     3208 +                                            100;
     3209 +                                        break;
     3210 +                                case 2:
     3211 +                                        mg->mg_bias =
     3212 +                                            ((((cu - vu) + vu_io) / 2) *
     3213 +                                            (int64_t)mg->mg_aliquot) / 100;
     3214 +                                        break;
     3215 +                                default:
     3216 +                                        break;
     3217 +                                }
3156 3218                          } else if (!metaslab_bias_enabled) {
3157 3219                                  mg->mg_bias = 0;
3158 3220                          }
3159 3221  
3160 3222                          if (atomic_add_64_nv(&mc->mc_aliquot, asize) >=
3161 3223                              mg->mg_aliquot + mg->mg_bias) {
3162 3224                                  mc->mc_rotor = mg->mg_next;
3163 3225                                  mc->mc_aliquot = 0;
3164 3226                          }
3165 3227  
3166 3228                          DVA_SET_VDEV(&dva[d], vd->vdev_id);
3167 3229                          DVA_SET_OFFSET(&dva[d], offset);
3168 3230                          DVA_SET_GANG(&dva[d], !!(flags & METASLAB_GANG_HEADER));
3169 3231                          DVA_SET_ASIZE(&dva[d], asize);
     3232 +                        DTRACE_PROBE3(alloc_dva_probe, uint64_t, vd->vdev_id,
     3233 +                            uint64_t, offset, uint64_t, psize);
3170 3234  
3171 3235                          return (0);
3172 3236                  }
3173 3237  next:
3174 3238                  mc->mc_rotor = mg->mg_next;
3175 3239                  mc->mc_aliquot = 0;
3176 3240          } while ((mg = mg->mg_next) != rotor);
3177 3241  
3178 3242          /*
3179 3243           * If we haven't tried hard, do so now.
↓ open down ↓ 2 lines elided ↑ open up ↑
3182 3246                  try_hard = B_TRUE;
3183 3247                  goto top;
3184 3248          }
3185 3249  
3186 3250          bzero(&dva[d], sizeof (dva_t));
3187 3251  
3188 3252          metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC);
3189 3253          return (SET_ERROR(ENOSPC));
3190 3254  }
3191 3255  
3192      -void
3193      -metaslab_free_concrete(vdev_t *vd, uint64_t offset, uint64_t asize,
3194      -    uint64_t txg)
3195      -{
3196      -        metaslab_t *msp;
3197      -        spa_t *spa = vd->vdev_spa;
3198      -
3199      -        ASSERT3U(txg, ==, spa->spa_syncing_txg);
3200      -        ASSERT(vdev_is_concrete(vd));
3201      -        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3202      -        ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
3203      -
3204      -        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3205      -
3206      -        VERIFY(!msp->ms_condensing);
3207      -        VERIFY3U(offset, >=, msp->ms_start);
3208      -        VERIFY3U(offset + asize, <=, msp->ms_start + msp->ms_size);
3209      -        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3210      -        VERIFY0(P2PHASE(asize, 1ULL << vd->vdev_ashift));
3211      -
3212      -        metaslab_check_free_impl(vd, offset, asize);
3213      -        mutex_enter(&msp->ms_lock);
3214      -        if (range_tree_space(msp->ms_freeingtree) == 0) {
3215      -                vdev_dirty(vd, VDD_METASLAB, msp, txg);
3216      -        }
3217      -        range_tree_add(msp->ms_freeingtree, offset, asize);
3218      -        mutex_exit(&msp->ms_lock);
3219      -}
3220      -
3221      -/* ARGSUSED */
3222      -void
3223      -metaslab_free_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
3224      -    uint64_t size, void *arg)
3225      -{
3226      -        uint64_t *txgp = arg;
3227      -
3228      -        if (vd->vdev_ops->vdev_op_remap != NULL)
3229      -                vdev_indirect_mark_obsolete(vd, offset, size, *txgp);
3230      -        else
3231      -                metaslab_free_impl(vd, offset, size, *txgp);
3232      -}
3233      -
3234      -static void
3235      -metaslab_free_impl(vdev_t *vd, uint64_t offset, uint64_t size,
3236      -    uint64_t txg)
3237      -{
3238      -        spa_t *spa = vd->vdev_spa;
3239      -
3240      -        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3241      -
3242      -        if (txg > spa_freeze_txg(spa))
3243      -                return;
3244      -
3245      -        if (spa->spa_vdev_removal != NULL &&
3246      -            spa->spa_vdev_removal->svr_vdev == vd &&
3247      -            vdev_is_concrete(vd)) {
3248      -                /*
3249      -                 * Note: we check if the vdev is concrete because when
3250      -                 * we complete the removal, we first change the vdev to be
3251      -                 * an indirect vdev (in open context), and then (in syncing
3252      -                 * context) clear spa_vdev_removal.
3253      -                 */
3254      -                free_from_removing_vdev(vd, offset, size, txg);
3255      -        } else if (vd->vdev_ops->vdev_op_remap != NULL) {
3256      -                vdev_indirect_mark_obsolete(vd, offset, size, txg);
3257      -                vd->vdev_ops->vdev_op_remap(vd, offset, size,
3258      -                    metaslab_free_impl_cb, &txg);
3259      -        } else {
3260      -                metaslab_free_concrete(vd, offset, size, txg);
3261      -        }
3262      -}
3263      -
3264      -typedef struct remap_blkptr_cb_arg {
3265      -        blkptr_t *rbca_bp;
3266      -        spa_remap_cb_t rbca_cb;
3267      -        vdev_t *rbca_remap_vd;
3268      -        uint64_t rbca_remap_offset;
3269      -        void *rbca_cb_arg;
3270      -} remap_blkptr_cb_arg_t;
3271      -
3272      -void
3273      -remap_blkptr_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
3274      -    uint64_t size, void *arg)
3275      -{
3276      -        remap_blkptr_cb_arg_t *rbca = arg;
3277      -        blkptr_t *bp = rbca->rbca_bp;
3278      -
3279      -        /* We can not remap split blocks. */
3280      -        if (size != DVA_GET_ASIZE(&bp->blk_dva[0]))
3281      -                return;
3282      -        ASSERT0(inner_offset);
3283      -
3284      -        if (rbca->rbca_cb != NULL) {
3285      -                /*
3286      -                 * At this point we know that we are not handling split
3287      -                 * blocks and we invoke the callback on the previous
3288      -                 * vdev which must be indirect.
3289      -                 */
3290      -                ASSERT3P(rbca->rbca_remap_vd->vdev_ops, ==, &vdev_indirect_ops);
3291      -
3292      -                rbca->rbca_cb(rbca->rbca_remap_vd->vdev_id,
3293      -                    rbca->rbca_remap_offset, size, rbca->rbca_cb_arg);
3294      -
3295      -                /* set up remap_blkptr_cb_arg for the next call */
3296      -                rbca->rbca_remap_vd = vd;
3297      -                rbca->rbca_remap_offset = offset;
3298      -        }
3299      -
3300      -        /*
3301      -         * The phys birth time is that of dva[0].  This ensures that we know
3302      -         * when each dva was written, so that resilver can determine which
3303      -         * blocks need to be scrubbed (i.e. those written during the time
3304      -         * the vdev was offline).  It also ensures that the key used in
3305      -         * the ARC hash table is unique (i.e. dva[0] + phys_birth).  If
3306      -         * we didn't change the phys_birth, a lookup in the ARC for a
3307      -         * remapped BP could find the data that was previously stored at
3308      -         * this vdev + offset.
3309      -         */
3310      -        vdev_t *oldvd = vdev_lookup_top(vd->vdev_spa,
3311      -            DVA_GET_VDEV(&bp->blk_dva[0]));
3312      -        vdev_indirect_births_t *vib = oldvd->vdev_indirect_births;
3313      -        bp->blk_phys_birth = vdev_indirect_births_physbirth(vib,
3314      -            DVA_GET_OFFSET(&bp->blk_dva[0]), DVA_GET_ASIZE(&bp->blk_dva[0]));
3315      -
3316      -        DVA_SET_VDEV(&bp->blk_dva[0], vd->vdev_id);
3317      -        DVA_SET_OFFSET(&bp->blk_dva[0], offset);
3318      -}
3319      -
3320 3256  /*
3321      - * If the block pointer contains any indirect DVAs, modify them to refer to
3322      - * concrete DVAs.  Note that this will sometimes not be possible, leaving
3323      - * the indirect DVA in place.  This happens if the indirect DVA spans multiple
3324      - * segments in the mapping (i.e. it is a "split block").
3325      - *
3326      - * If the BP was remapped, calls the callback on the original dva (note the
3327      - * callback can be called multiple times if the original indirect DVA refers
3328      - * to another indirect DVA, etc).
3329      - *
3330      - * Returns TRUE if the BP was remapped.
     3257 + * Free the block represented by DVA in the context of the specified
     3258 + * transaction group.
3331 3259   */
3332      -boolean_t
3333      -spa_remap_blkptr(spa_t *spa, blkptr_t *bp, spa_remap_cb_t callback, void *arg)
3334      -{
3335      -        remap_blkptr_cb_arg_t rbca;
3336      -
3337      -        if (!zfs_remap_blkptr_enable)
3338      -                return (B_FALSE);
3339      -
3340      -        if (!spa_feature_is_enabled(spa, SPA_FEATURE_OBSOLETE_COUNTS))
3341      -                return (B_FALSE);
3342      -
3343      -        /*
3344      -         * Dedup BP's can not be remapped, because ddt_phys_select() depends
3345      -         * on DVA[0] being the same in the BP as in the DDT (dedup table).
3346      -         */
3347      -        if (BP_GET_DEDUP(bp))
3348      -                return (B_FALSE);
3349      -
3350      -        /*
3351      -         * Gang blocks can not be remapped, because
3352      -         * zio_checksum_gang_verifier() depends on the DVA[0] that's in
3353      -         * the BP used to read the gang block header (GBH) being the same
3354      -         * as the DVA[0] that we allocated for the GBH.
3355      -         */
3356      -        if (BP_IS_GANG(bp))
3357      -                return (B_FALSE);
3358      -
3359      -        /*
3360      -         * Embedded BP's have no DVA to remap.
3361      -         */
3362      -        if (BP_GET_NDVAS(bp) < 1)
3363      -                return (B_FALSE);
3364      -
3365      -        /*
3366      -         * Note: we only remap dva[0].  If we remapped other dvas, we
3367      -         * would no longer know what their phys birth txg is.
3368      -         */
3369      -        dva_t *dva = &bp->blk_dva[0];
3370      -
3371      -        uint64_t offset = DVA_GET_OFFSET(dva);
3372      -        uint64_t size = DVA_GET_ASIZE(dva);
3373      -        vdev_t *vd = vdev_lookup_top(spa, DVA_GET_VDEV(dva));
3374      -
3375      -        if (vd->vdev_ops->vdev_op_remap == NULL)
3376      -                return (B_FALSE);
3377      -
3378      -        rbca.rbca_bp = bp;
3379      -        rbca.rbca_cb = callback;
3380      -        rbca.rbca_remap_vd = vd;
3381      -        rbca.rbca_remap_offset = offset;
3382      -        rbca.rbca_cb_arg = arg;
3383      -
3384      -        /*
3385      -         * remap_blkptr_cb() will be called in order for each level of
3386      -         * indirection, until a concrete vdev is reached or a split block is
3387      -         * encountered. old_vd and old_offset are updated within the callback
3388      -         * as we go from the one indirect vdev to the next one (either concrete
3389      -         * or indirect again) in that order.
3390      -         */
3391      -        vd->vdev_ops->vdev_op_remap(vd, offset, size, remap_blkptr_cb, &rbca);
3392      -
3393      -        /* Check if the DVA wasn't remapped because it is a split block */
3394      -        if (DVA_GET_VDEV(&rbca.rbca_bp->blk_dva[0]) == vd->vdev_id)
3395      -                return (B_FALSE);
3396      -
3397      -        return (B_TRUE);
3398      -}
3399      -
3400      -/*
3401      - * Undo the allocation of a DVA which happened in the given transaction group.
3402      - */
3403 3260  void
3404      -metaslab_unalloc_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
     3261 +metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg, boolean_t now)
3405 3262  {
3406      -        metaslab_t *msp;
3407      -        vdev_t *vd;
3408 3263          uint64_t vdev = DVA_GET_VDEV(dva);
3409 3264          uint64_t offset = DVA_GET_OFFSET(dva);
3410 3265          uint64_t size = DVA_GET_ASIZE(dva);
     3266 +        vdev_t *vd;
     3267 +        metaslab_t *msp;
3411 3268  
     3269 +        DTRACE_PROBE3(free_dva_probe, uint64_t, vdev,
     3270 +            uint64_t, offset, uint64_t, size);
     3271 +
3412 3272          ASSERT(DVA_IS_VALID(dva));
3413      -        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3414 3273  
3415 3274          if (txg > spa_freeze_txg(spa))
3416 3275                  return;
3417 3276  
3418 3277          if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
3419 3278              (offset >> vd->vdev_ms_shift) >= vd->vdev_ms_count) {
3420 3279                  cmn_err(CE_WARN, "metaslab_free_dva(): bad DVA %llu:%llu",
3421 3280                      (u_longlong_t)vdev, (u_longlong_t)offset);
3422 3281                  ASSERT(0);
3423 3282                  return;
3424 3283          }
3425 3284  
3426      -        ASSERT(!vd->vdev_removing);
3427      -        ASSERT(vdev_is_concrete(vd));
3428      -        ASSERT0(vd->vdev_indirect_config.vic_mapping_object);
3429      -        ASSERT3P(vd->vdev_indirect_mapping, ==, NULL);
     3285 +        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3430 3286  
3431 3287          if (DVA_GET_GANG(dva))
3432 3288                  size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
3433 3289  
3434      -        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3435      -
3436 3290          mutex_enter(&msp->ms_lock);
3437      -        range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
3438      -            offset, size);
3439 3291  
3440      -        VERIFY(!msp->ms_condensing);
3441      -        VERIFY3U(offset, >=, msp->ms_start);
3442      -        VERIFY3U(offset + size, <=, msp->ms_start + msp->ms_size);
3443      -        VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
3444      -            msp->ms_size);
3445      -        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3446      -        VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
3447      -        range_tree_add(msp->ms_tree, offset, size);
     3292 +        if (now) {
     3293 +                range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
     3294 +                    offset, size);
     3295 +
     3296 +                VERIFY(!msp->ms_condensing);
     3297 +                VERIFY3U(offset, >=, msp->ms_start);
     3298 +                VERIFY3U(offset + size, <=, msp->ms_start + msp->ms_size);
     3299 +                VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
     3300 +                    msp->ms_size);
     3301 +                VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
     3302 +                VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
     3303 +                range_tree_add(msp->ms_tree, offset, size);
     3304 +                if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
     3305 +                    !vd->vdev_man_trimming)
     3306 +                        metaslab_trim_add(msp, offset, size);
     3307 +                msp->ms_max_size = metaslab_block_maxsize(msp);
     3308 +        } else {
     3309 +                VERIFY3U(txg, ==, spa->spa_syncing_txg);
     3310 +                if (range_tree_space(msp->ms_freeingtree) == 0)
     3311 +                        vdev_dirty(vd, VDD_METASLAB, msp, txg);
     3312 +                range_tree_add(msp->ms_freeingtree, offset, size);
     3313 +        }
     3314 +
3448 3315          mutex_exit(&msp->ms_lock);
3449 3316  }
3450 3317  
3451 3318  /*
3452      - * Free the block represented by DVA in the context of the specified
3453      - * transaction group.
     3319 + * Intent log support: upon opening the pool after a crash, notify the SPA
     3320 + * of blocks that the intent log has allocated for immediate write, but
     3321 + * which are still considered free by the SPA because the last transaction
     3322 + * group didn't commit yet.
3454 3323   */
3455      -void
3456      -metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
     3324 +static int
     3325 +metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
3457 3326  {
3458 3327          uint64_t vdev = DVA_GET_VDEV(dva);
3459 3328          uint64_t offset = DVA_GET_OFFSET(dva);
3460 3329          uint64_t size = DVA_GET_ASIZE(dva);
3461      -        vdev_t *vd = vdev_lookup_top(spa, vdev);
     3330 +        vdev_t *vd;
     3331 +        metaslab_t *msp;
     3332 +        int error = 0;
3462 3333  
3463 3334          ASSERT(DVA_IS_VALID(dva));
3464      -        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3465 3335  
3466      -        if (DVA_GET_GANG(dva)) {
     3336 +        if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
     3337 +            (offset >> vd->vdev_ms_shift) >= vd->vdev_ms_count)
     3338 +                return (SET_ERROR(ENXIO));
     3339 +
     3340 +        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
     3341 +
     3342 +        if (DVA_GET_GANG(dva))
3467 3343                  size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
     3344 +
     3345 +        mutex_enter(&msp->ms_lock);
     3346 +
     3347 +        if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
     3348 +                error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
     3349 +
     3350 +        if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
     3351 +                error = SET_ERROR(ENOENT);
     3352 +
     3353 +        if (error || txg == 0) {        /* txg == 0 indicates dry run */
     3354 +                mutex_exit(&msp->ms_lock);
     3355 +                return (error);
3468 3356          }
3469 3357  
3470      -        metaslab_free_impl(vd, offset, size, txg);
     3358 +        VERIFY(!msp->ms_condensing);
     3359 +        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
     3360 +        VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
     3361 +        VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
     3362 +        range_tree_remove(msp->ms_tree, offset, size);
     3363 +        metaslab_trim_remove(msp, offset, size);
     3364 +
     3365 +        if (spa_writeable(spa)) {       /* don't dirty if we're zdb(1M) */
     3366 +                if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
     3367 +                        vdev_dirty(vd, VDD_METASLAB, msp, txg);
     3368 +                range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
     3369 +        }
     3370 +
     3371 +        mutex_exit(&msp->ms_lock);
     3372 +
     3373 +        return (0);
3471 3374  }
3472 3375  
3473 3376  /*
3474 3377   * Reserve some allocation slots. The reservation system must be called
3475 3378   * before we call into the allocator. If there aren't any available slots
3476 3379   * then the I/O will be throttled until an I/O completes and its slots are
3477 3380   * freed up. The function returns true if it was successful in placing
3478 3381   * the reservation.
3479 3382   */
3480 3383  boolean_t
↓ open down ↓ 30 lines elided ↑ open up ↑
3511 3414  metaslab_class_throttle_unreserve(metaslab_class_t *mc, int slots, zio_t *zio)
3512 3415  {
3513 3416          ASSERT(mc->mc_alloc_throttle_enabled);
3514 3417          mutex_enter(&mc->mc_lock);
3515 3418          for (int d = 0; d < slots; d++) {
3516 3419                  (void) refcount_remove(&mc->mc_alloc_slots, zio);
3517 3420          }
3518 3421          mutex_exit(&mc->mc_lock);
3519 3422  }
3520 3423  
3521      -static int
3522      -metaslab_claim_concrete(vdev_t *vd, uint64_t offset, uint64_t size,
3523      -    uint64_t txg)
3524      -{
3525      -        metaslab_t *msp;
3526      -        spa_t *spa = vd->vdev_spa;
3527      -        int error = 0;
3528      -
3529      -        if (offset >> vd->vdev_ms_shift >= vd->vdev_ms_count)
3530      -                return (ENXIO);
3531      -
3532      -        ASSERT3P(vd->vdev_ms, !=, NULL);
3533      -        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3534      -
3535      -        mutex_enter(&msp->ms_lock);
3536      -
3537      -        if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
3538      -                error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
3539      -
3540      -        if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
3541      -                error = SET_ERROR(ENOENT);
3542      -
3543      -        if (error || txg == 0) {        /* txg == 0 indicates dry run */
3544      -                mutex_exit(&msp->ms_lock);
3545      -                return (error);
3546      -        }
3547      -
3548      -        VERIFY(!msp->ms_condensing);
3549      -        VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3550      -        VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
3551      -        VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
3552      -        range_tree_remove(msp->ms_tree, offset, size);
3553      -
3554      -        if (spa_writeable(spa)) {       /* don't dirty if we're zdb(1M) */
3555      -                if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
3556      -                        vdev_dirty(vd, VDD_METASLAB, msp, txg);
3557      -                range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
3558      -        }
3559      -
3560      -        mutex_exit(&msp->ms_lock);
3561      -
3562      -        return (0);
3563      -}
3564      -
3565      -typedef struct metaslab_claim_cb_arg_t {
3566      -        uint64_t        mcca_txg;
3567      -        int             mcca_error;
3568      -} metaslab_claim_cb_arg_t;
3569      -
3570      -/* ARGSUSED */
3571      -static void
3572      -metaslab_claim_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
3573      -    uint64_t size, void *arg)
3574      -{
3575      -        metaslab_claim_cb_arg_t *mcca_arg = arg;
3576      -
3577      -        if (mcca_arg->mcca_error == 0) {
3578      -                mcca_arg->mcca_error = metaslab_claim_concrete(vd, offset,
3579      -                    size, mcca_arg->mcca_txg);
3580      -        }
3581      -}
3582      -
3583 3424  int
3584      -metaslab_claim_impl(vdev_t *vd, uint64_t offset, uint64_t size, uint64_t txg)
3585      -{
3586      -        if (vd->vdev_ops->vdev_op_remap != NULL) {
3587      -                metaslab_claim_cb_arg_t arg;
3588      -
3589      -                /*
3590      -                 * Only zdb(1M) can claim on indirect vdevs.  This is used
3591      -                 * to detect leaks of mapped space (that are not accounted
3592      -                 * for in the obsolete counts, spacemap, or bpobj).
3593      -                 */
3594      -                ASSERT(!spa_writeable(vd->vdev_spa));
3595      -                arg.mcca_error = 0;
3596      -                arg.mcca_txg = txg;
3597      -
3598      -                vd->vdev_ops->vdev_op_remap(vd, offset, size,
3599      -                    metaslab_claim_impl_cb, &arg);
3600      -
3601      -                if (arg.mcca_error == 0) {
3602      -                        arg.mcca_error = metaslab_claim_concrete(vd,
3603      -                            offset, size, txg);
3604      -                }
3605      -                return (arg.mcca_error);
3606      -        } else {
3607      -                return (metaslab_claim_concrete(vd, offset, size, txg));
3608      -        }
3609      -}
3610      -
3611      -/*
3612      - * Intent log support: upon opening the pool after a crash, notify the SPA
3613      - * of blocks that the intent log has allocated for immediate write, but
3614      - * which are still considered free by the SPA because the last transaction
3615      - * group didn't commit yet.
3616      - */
3617      -static int
3618      -metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
3619      -{
3620      -        uint64_t vdev = DVA_GET_VDEV(dva);
3621      -        uint64_t offset = DVA_GET_OFFSET(dva);
3622      -        uint64_t size = DVA_GET_ASIZE(dva);
3623      -        vdev_t *vd;
3624      -
3625      -        if ((vd = vdev_lookup_top(spa, vdev)) == NULL) {
3626      -                return (SET_ERROR(ENXIO));
3627      -        }
3628      -
3629      -        ASSERT(DVA_IS_VALID(dva));
3630      -
3631      -        if (DVA_GET_GANG(dva))
3632      -                size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
3633      -
3634      -        return (metaslab_claim_impl(vd, offset, size, txg));
3635      -}
3636      -
3637      -int
3638 3425  metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp,
3639 3426      int ndvas, uint64_t txg, blkptr_t *hintbp, int flags,
3640 3427      zio_alloc_list_t *zal, zio_t *zio)
3641 3428  {
3642 3429          dva_t *dva = bp->blk_dva;
3643 3430          dva_t *hintdva = hintbp->blk_dva;
3644 3431          int error = 0;
3645 3432  
3646 3433          ASSERT(bp->blk_birth == 0);
3647 3434          ASSERT(BP_PHYSICAL_BIRTH(bp) == 0);
↓ open down ↓ 3 lines elided ↑ open up ↑
3651 3438          if (mc->mc_rotor == NULL) {     /* no vdevs in this class */
3652 3439                  spa_config_exit(spa, SCL_ALLOC, FTAG);
3653 3440                  return (SET_ERROR(ENOSPC));
3654 3441          }
3655 3442  
3656 3443          ASSERT(ndvas > 0 && ndvas <= spa_max_replication(spa));
3657 3444          ASSERT(BP_GET_NDVAS(bp) == 0);
3658 3445          ASSERT(hintbp == NULL || ndvas <= BP_GET_NDVAS(hintbp));
3659 3446          ASSERT3P(zal, !=, NULL);
3660 3447  
3661      -        for (int d = 0; d < ndvas; d++) {
3662      -                error = metaslab_alloc_dva(spa, mc, psize, dva, d, hintdva,
3663      -                    txg, flags, zal);
3664      -                if (error != 0) {
3665      -                        for (d--; d >= 0; d--) {
3666      -                                metaslab_unalloc_dva(spa, &dva[d], txg);
3667      -                                metaslab_group_alloc_decrement(spa,
3668      -                                    DVA_GET_VDEV(&dva[d]), zio, flags);
3669      -                                bzero(&dva[d], sizeof (dva_t));
     3448 +        if (mc == spa_special_class(spa) && !BP_IS_METADATA(bp) &&
     3449 +            !(flags & (METASLAB_GANG_HEADER)) &&
     3450 +            !(spa->spa_meta_policy.spa_small_data_to_special &&
     3451 +            psize <= spa->spa_meta_policy.spa_small_data_to_special)) {
     3452 +                error = metaslab_alloc_dva(spa, spa_normal_class(spa),
     3453 +                    psize, &dva[WBC_NORMAL_DVA], 0, NULL, txg,
     3454 +                    flags | METASLAB_USE_WEIGHT_SECONDARY, zal);
     3455 +                if (error == 0) {
     3456 +                        error = metaslab_alloc_dva(spa, mc, psize,
     3457 +                            &dva[WBC_SPECIAL_DVA], 0, NULL, txg, flags, zal);
     3458 +                        if (error != 0) {
     3459 +                                error = 0;
     3460 +                                /*
     3461 +                                 * Change the place of NORMAL and cleanup the
     3462 +                                 * second DVA. After that this BP is just a
     3463 +                                 * regular BP with one DVA
     3464 +                                 *
     3465 +                                 * This operation is valid only if:
     3466 +                                 * WBC_SPECIAL_DVA is dva[0]
     3467 +                                 * WBC_NORMAL_DVA is dva[1]
     3468 +                                 *
     3469 +                                 * see wbc.h
     3470 +                                 */
     3471 +                                bcopy(&dva[WBC_NORMAL_DVA],
     3472 +                                    &dva[WBC_SPECIAL_DVA], sizeof (dva_t));
     3473 +                                bzero(&dva[WBC_NORMAL_DVA], sizeof (dva_t));
     3474 +
     3475 +                                /*
     3476 +                                 * Allocation of special DVA has failed,
     3477 +                                 * so this BP will be a regular BP and need
     3478 +                                 * to update the metaslab group's queue depth
     3479 +                                 * based on the newly allocated dva.
     3480 +                                 */
     3481 +                                metaslab_group_alloc_increment(spa,
     3482 +                                    DVA_GET_VDEV(&dva[0]), zio, flags);
     3483 +                        } else {
     3484 +                                BP_SET_SPECIAL(bp, 1);
3670 3485                          }
     3486 +                } else {
3671 3487                          spa_config_exit(spa, SCL_ALLOC, FTAG);
3672 3488                          return (error);
3673      -                } else {
3674      -                        /*
3675      -                         * Update the metaslab group's queue depth
3676      -                         * based on the newly allocated dva.
3677      -                         */
3678      -                        metaslab_group_alloc_increment(spa,
3679      -                            DVA_GET_VDEV(&dva[d]), zio, flags);
3680 3489                  }
3681      -
     3490 +        } else {
     3491 +                for (int d = 0; d < ndvas; d++) {
     3492 +                        error = metaslab_alloc_dva(spa, mc, psize, dva, d,
     3493 +                            hintdva, txg, flags, zal);
     3494 +                        if (error != 0) {
     3495 +                                for (d--; d >= 0; d--) {
     3496 +                                        metaslab_free_dva(spa, &dva[d],
     3497 +                                            txg, B_TRUE);
     3498 +                                        metaslab_group_alloc_decrement(spa,
     3499 +                                            DVA_GET_VDEV(&dva[d]), zio, flags);
     3500 +                                        bzero(&dva[d], sizeof (dva_t));
     3501 +                                }
     3502 +                                spa_config_exit(spa, SCL_ALLOC, FTAG);
     3503 +                                return (error);
     3504 +                        } else {
     3505 +                                /*
     3506 +                                 * Update the metaslab group's queue depth
     3507 +                                 * based on the newly allocated dva.
     3508 +                                 */
     3509 +                                metaslab_group_alloc_increment(spa,
     3510 +                                    DVA_GET_VDEV(&dva[d]), zio, flags);
     3511 +                        }
     3512 +                }
     3513 +                ASSERT(BP_GET_NDVAS(bp) == ndvas);
3682 3514          }
3683 3515          ASSERT(error == 0);
3684      -        ASSERT(BP_GET_NDVAS(bp) == ndvas);
3685 3516  
3686 3517          spa_config_exit(spa, SCL_ALLOC, FTAG);
3687 3518  
3688 3519          BP_SET_BIRTH(bp, txg, txg);
3689 3520  
3690 3521          return (0);
3691 3522  }
3692 3523  
3693 3524  void
3694 3525  metaslab_free(spa_t *spa, const blkptr_t *bp, uint64_t txg, boolean_t now)
3695 3526  {
3696 3527          const dva_t *dva = bp->blk_dva;
3697 3528          int ndvas = BP_GET_NDVAS(bp);
3698 3529  
3699 3530          ASSERT(!BP_IS_HOLE(bp));
3700 3531          ASSERT(!now || bp->blk_birth >= spa_syncing_txg(spa));
3701 3532  
3702 3533          spa_config_enter(spa, SCL_FREE, FTAG, RW_READER);
3703 3534  
3704      -        for (int d = 0; d < ndvas; d++) {
3705      -                if (now) {
3706      -                        metaslab_unalloc_dva(spa, &dva[d], txg);
3707      -                } else {
3708      -                        metaslab_free_dva(spa, &dva[d], txg);
     3535 +        if (BP_IS_SPECIAL(bp)) {
     3536 +                int start_dva;
     3537 +                wbc_data_t *wbc_data = spa_get_wbc_data(spa);
     3538 +
     3539 +                mutex_enter(&wbc_data->wbc_lock);
     3540 +                start_dva = wbc_first_valid_dva(bp, wbc_data, B_TRUE);
     3541 +                mutex_exit(&wbc_data->wbc_lock);
     3542 +
     3543 +                /*
     3544 +                 * Actual freeing should not be locked as
     3545 +                 * the block is already exempted from WBC
     3546 +                 * trees, and thus will not be moved
     3547 +                 */
     3548 +                metaslab_free_dva(spa, &dva[WBC_NORMAL_DVA], txg, now);
     3549 +                if (start_dva == 0) {
     3550 +                        metaslab_free_dva(spa, &dva[WBC_SPECIAL_DVA],
     3551 +                            txg, now);
3709 3552                  }
     3553 +        } else {
     3554 +                for (int d = 0; d < ndvas; d++)
     3555 +                        metaslab_free_dva(spa, &dva[d], txg, now);
3710 3556          }
3711 3557  
3712 3558          spa_config_exit(spa, SCL_FREE, FTAG);
3713 3559  }
3714 3560  
3715 3561  int
3716 3562  metaslab_claim(spa_t *spa, const blkptr_t *bp, uint64_t txg)
3717 3563  {
3718 3564          const dva_t *dva = bp->blk_dva;
3719 3565          int ndvas = BP_GET_NDVAS(bp);
↓ open down ↓ 5 lines elided ↑ open up ↑
3725 3571                  /*
3726 3572                   * First do a dry run to make sure all DVAs are claimable,
3727 3573                   * so we don't have to unwind from partial failures below.
3728 3574                   */
3729 3575                  if ((error = metaslab_claim(spa, bp, 0)) != 0)
3730 3576                          return (error);
3731 3577          }
3732 3578  
3733 3579          spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
3734 3580  
3735      -        for (int d = 0; d < ndvas; d++)
3736      -                if ((error = metaslab_claim_dva(spa, &dva[d], txg)) != 0)
3737      -                        break;
     3581 +        if (BP_IS_SPECIAL(bp)) {
     3582 +                int start_dva;
     3583 +                wbc_data_t *wbc_data = spa_get_wbc_data(spa);
3738 3584  
     3585 +                mutex_enter(&wbc_data->wbc_lock);
     3586 +                start_dva = wbc_first_valid_dva(bp, wbc_data, B_FALSE);
     3587 +
     3588 +                /*
     3589 +                 * Actual claiming should be under lock for WBC blocks. It must
     3590 +                 * be done to ensure zdb will not fail. The only other user of
     3591 +                 * the claiming is ZIL whose blocks can not be WBC ones, and
     3592 +                 * thus the lock will not be held for them.
     3593 +                 */
     3594 +                error = metaslab_claim_dva(spa,
     3595 +                    &dva[WBC_NORMAL_DVA], txg);
     3596 +                if (error == 0 && start_dva == 0) {
     3597 +                        error = metaslab_claim_dva(spa,
     3598 +                            &dva[WBC_SPECIAL_DVA], txg);
     3599 +                }
     3600 +
     3601 +                mutex_exit(&wbc_data->wbc_lock);
     3602 +        } else {
     3603 +                for (int d = 0; d < ndvas; d++)
     3604 +                        if ((error = metaslab_claim_dva(spa,
     3605 +                            &dva[d], txg)) != 0)
     3606 +                                break;
     3607 +        }
     3608 +
3739 3609          spa_config_exit(spa, SCL_ALLOC, FTAG);
3740 3610  
3741 3611          ASSERT(error == 0 || txg == 0);
3742 3612  
3743 3613          return (error);
3744 3614  }
3745 3615  
3746      -/* ARGSUSED */
3747      -static void
3748      -metaslab_check_free_impl_cb(uint64_t inner, vdev_t *vd, uint64_t offset,
3749      -    uint64_t size, void *arg)
     3616 +void
     3617 +metaslab_check_free(spa_t *spa, const blkptr_t *bp)
3750 3618  {
3751      -        if (vd->vdev_ops == &vdev_indirect_ops)
     3619 +        if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
3752 3620                  return;
3753 3621  
3754      -        metaslab_check_free_impl(vd, offset, size);
     3622 +        if (BP_IS_SPECIAL(bp)) {
     3623 +                /* Do not check frees for WBC blocks */
     3624 +                return;
     3625 +        }
     3626 +
     3627 +        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
     3628 +        for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
     3629 +                uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
     3630 +                vdev_t *vd = vdev_lookup_top(spa, vdev);
     3631 +                uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
     3632 +                uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
     3633 +                metaslab_t *msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
     3634 +
     3635 +                if (msp->ms_loaded) {
     3636 +                        range_tree_verify(msp->ms_tree, offset, size);
     3637 +                        range_tree_verify(msp->ms_cur_ts->ts_tree,
     3638 +                            offset, size);
     3639 +                        if (msp->ms_prev_ts != NULL) {
     3640 +                                range_tree_verify(msp->ms_prev_ts->ts_tree,
     3641 +                                    offset, size);
     3642 +                        }
     3643 +                }
     3644 +
     3645 +                range_tree_verify(msp->ms_freeingtree, offset, size);
     3646 +                range_tree_verify(msp->ms_freedtree, offset, size);
     3647 +                for (int j = 0; j < TXG_DEFER_SIZE; j++)
     3648 +                        range_tree_verify(msp->ms_defertree[j], offset, size);
     3649 +        }
     3650 +        spa_config_exit(spa, SCL_VDEV, FTAG);
3755 3651  }
3756 3652  
3757      -static void
3758      -metaslab_check_free_impl(vdev_t *vd, uint64_t offset, uint64_t size)
     3653 +/*
     3654 + * Trims all free space in the metaslab. Returns the root TRIM zio (that the
     3655 + * caller should zio_wait() for) and the amount of space in the metaslab that
     3656 + * has been scheduled for trimming in the `delta' return argument.
     3657 + */
     3658 +zio_t *
     3659 +metaslab_trim_all(metaslab_t *msp, uint64_t *delta)
3759 3660  {
3760      -        metaslab_t *msp;
3761      -        spa_t *spa = vd->vdev_spa;
     3661 +        boolean_t was_loaded;
     3662 +        uint64_t trimmed_space;
     3663 +        zio_t *trim_io;
3762 3664  
3763      -        if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
3764      -                return;
     3665 +        ASSERT(!MUTEX_HELD(&msp->ms_group->mg_lock));
3765 3666  
3766      -        if (vd->vdev_ops->vdev_op_remap != NULL) {
3767      -                vd->vdev_ops->vdev_op_remap(vd, offset, size,
3768      -                    metaslab_check_free_impl_cb, NULL);
3769      -                return;
     3667 +        mutex_enter(&msp->ms_lock);
     3668 +
     3669 +        while (msp->ms_loading)
     3670 +                metaslab_load_wait(msp);
     3671 +        /* If we loaded the metaslab, unload it when we're done. */
     3672 +        was_loaded = msp->ms_loaded;
     3673 +        if (!was_loaded) {
     3674 +                if (metaslab_load(msp) != 0) {
     3675 +                        mutex_exit(&msp->ms_lock);
     3676 +                        return (0);
     3677 +                }
3770 3678          }
     3679 +        /* Flush out any scheduled extents and add everything in ms_tree. */
     3680 +        range_tree_vacate(msp->ms_cur_ts->ts_tree, NULL, NULL);
     3681 +        range_tree_walk(msp->ms_tree, metaslab_trim_add, msp);
3771 3682  
3772      -        ASSERT(vdev_is_concrete(vd));
3773      -        ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
3774      -        ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
     3683 +        /* Force this trim to take place ASAP. */
     3684 +        if (msp->ms_prev_ts != NULL)
     3685 +                metaslab_free_trimset(msp->ms_prev_ts);
     3686 +        msp->ms_prev_ts = msp->ms_cur_ts;
     3687 +        msp->ms_cur_ts = metaslab_new_trimset(0, &msp->ms_lock);
     3688 +        trimmed_space = range_tree_space(msp->ms_tree);
     3689 +        if (!was_loaded)
     3690 +                metaslab_unload(msp);
3775 3691  
3776      -        msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
     3692 +        trim_io = metaslab_exec_trim(msp);
     3693 +        mutex_exit(&msp->ms_lock);
     3694 +        *delta = trimmed_space;
3777 3695  
     3696 +        return (trim_io);
     3697 +}
     3698 +
     3699 +/*
     3700 + * Notifies the trimsets in a metaslab that an extent has been allocated.
     3701 + * This removes the segment from the queues of extents awaiting to be trimmed.
     3702 + */
     3703 +static void
     3704 +metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size)
     3705 +{
     3706 +        metaslab_t *msp = arg;
     3707 +
     3708 +        range_tree_remove_overlap(msp->ms_cur_ts->ts_tree, offset, size);
     3709 +        if (msp->ms_prev_ts != NULL) {
     3710 +                range_tree_remove_overlap(msp->ms_prev_ts->ts_tree, offset,
     3711 +                    size);
     3712 +        }
     3713 +}
     3714 +
     3715 +/*
     3716 + * Notifies the trimsets in a metaslab that an extent has been freed.
     3717 + * This adds the segment to the currently open queue of extents awaiting
     3718 + * to be trimmed.
     3719 + */
     3720 +static void
     3721 +metaslab_trim_add(void *arg, uint64_t offset, uint64_t size)
     3722 +{
     3723 +        metaslab_t *msp = arg;
     3724 +        ASSERT(msp->ms_cur_ts != NULL);
     3725 +        range_tree_add(msp->ms_cur_ts->ts_tree, offset, size);
     3726 +}
     3727 +
     3728 +/*
     3729 + * Does a metaslab's automatic trim operation processing. This must be
     3730 + * called from metaslab_sync, with the txg number of the txg. This function
     3731 + * issues trims in intervals as dictated by the zfs_txgs_per_trim tunable.
     3732 + */
     3733 +void
     3734 +metaslab_auto_trim(metaslab_t *msp, uint64_t txg)
     3735 +{
     3736 +        /* for atomicity */
     3737 +        uint64_t txgs_per_trim = zfs_txgs_per_trim;
     3738 +
     3739 +        ASSERT(!MUTEX_HELD(&msp->ms_lock));
3778 3740          mutex_enter(&msp->ms_lock);
3779      -        if (msp->ms_loaded)
3780      -                range_tree_verify(msp->ms_tree, offset, size);
3781 3741  
3782      -        range_tree_verify(msp->ms_freeingtree, offset, size);
3783      -        range_tree_verify(msp->ms_freedtree, offset, size);
3784      -        for (int j = 0; j < TXG_DEFER_SIZE; j++)
3785      -                range_tree_verify(msp->ms_defertree[j], offset, size);
     3742 +        /*
     3743 +         * Since we typically have hundreds of metaslabs per vdev, but we only
     3744 +         * trim them once every zfs_txgs_per_trim txgs, it'd be best if we
     3745 +         * could sequence the TRIM commands from all metaslabs so that they
     3746 +         * don't all always pound the device in the same txg. We do so by
     3747 +         * artificially inflating the birth txg of the first trim set by a
     3748 +         * sequence number derived from the metaslab's starting offset
     3749 +         * (modulo zfs_txgs_per_trim). Thus, for the default 200 metaslabs and
     3750 +         * 32 txgs per trim, we'll only be trimming ~6.25 metaslabs per txg.
     3751 +         *
     3752 +         * If we detect that the txg has advanced too far ahead of ts_birth,
     3753 +         * it means our birth txg is out of lockstep. Recompute it by
     3754 +         * rounding down to the nearest zfs_txgs_per_trim multiple and adding
     3755 +         * our metaslab id modulo zfs_txgs_per_trim.
     3756 +         */
     3757 +        if (txg > msp->ms_cur_ts->ts_birth + txgs_per_trim) {
     3758 +                msp->ms_cur_ts->ts_birth = (txg / txgs_per_trim) *
     3759 +                    txgs_per_trim + (msp->ms_id % txgs_per_trim);
     3760 +        }
     3761 +
     3762 +        /* Time to swap out the current and previous trimsets */
     3763 +        if (txg == msp->ms_cur_ts->ts_birth + txgs_per_trim) {
     3764 +                if (msp->ms_prev_ts != NULL) {
     3765 +                        if (msp->ms_trimming_ts != NULL) {
     3766 +                                spa_t *spa = msp->ms_group->mg_class->mc_spa;
     3767 +                                /*
     3768 +                                 * The previous trim run is still ongoing, so
     3769 +                                 * the device is reacting slowly to our trim
     3770 +                                 * requests. Drop this trimset, so as not to
     3771 +                                 * back the device up with trim requests.
     3772 +                                 */
     3773 +                                spa_trimstats_auto_slow_incr(spa);
     3774 +                                metaslab_free_trimset(msp->ms_prev_ts);
     3775 +                        } else if (msp->ms_group->mg_vd->vdev_man_trimming) {
     3776 +                                /*
     3777 +                                 * If a manual trim is ongoing, we want to
     3778 +                                 * inhibit autotrim temporarily so it doesn't
     3779 +                                 * slow down the manual trim.
     3780 +                                 */
     3781 +                                metaslab_free_trimset(msp->ms_prev_ts);
     3782 +                        } else {
     3783 +                                /*
     3784 +                                 * Trim out aged extents on the vdevs - these
     3785 +                                 * are safe to be destroyed now. We'll keep
     3786 +                                 * the trimset around to deny allocations from
     3787 +                                 * these regions while the trims are ongoing.
     3788 +                                 */
     3789 +                                zio_nowait(metaslab_exec_trim(msp));
     3790 +                        }
     3791 +                }
     3792 +                msp->ms_prev_ts = msp->ms_cur_ts;
     3793 +                msp->ms_cur_ts = metaslab_new_trimset(txg, &msp->ms_lock);
     3794 +        }
3786 3795          mutex_exit(&msp->ms_lock);
3787 3796  }
3788 3797  
3789      -void
3790      -metaslab_check_free(spa_t *spa, const blkptr_t *bp)
     3798 +static void
     3799 +metaslab_trim_done(zio_t *zio)
3791 3800  {
3792      -        if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
3793      -                return;
     3801 +        metaslab_t *msp = zio->io_private;
     3802 +        boolean_t held;
3794 3803  
3795      -        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
3796      -        for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
3797      -                uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
3798      -                vdev_t *vd = vdev_lookup_top(spa, vdev);
3799      -                uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
3800      -                uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
     3804 +        ASSERT(msp != NULL);
     3805 +        ASSERT(msp->ms_trimming_ts != NULL);
     3806 +        held = MUTEX_HELD(&msp->ms_lock);
     3807 +        if (!held)
     3808 +                mutex_enter(&msp->ms_lock);
     3809 +        metaslab_free_trimset(msp->ms_trimming_ts);
     3810 +        msp->ms_trimming_ts = NULL;
     3811 +        cv_signal(&msp->ms_trim_cv);
     3812 +        if (!held)
     3813 +                mutex_exit(&msp->ms_lock);
     3814 +}
3801 3815  
3802      -                if (DVA_GET_GANG(&bp->blk_dva[i]))
3803      -                        size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
     3816 +/*
     3817 + * Executes a zio_trim on a range tree holding freed extents in the metaslab.
     3818 + */
     3819 +static zio_t *
     3820 +metaslab_exec_trim(metaslab_t *msp)
     3821 +{
     3822 +        metaslab_group_t *mg = msp->ms_group;
     3823 +        spa_t *spa = mg->mg_class->mc_spa;
     3824 +        vdev_t *vd = mg->mg_vd;
     3825 +        range_tree_t *trim_tree;
     3826 +        zio_t *zio;
3804 3827  
3805      -                ASSERT3P(vd, !=, NULL);
     3828 +        ASSERT(MUTEX_HELD(&msp->ms_lock));
3806 3829  
3807      -                metaslab_check_free_impl(vd, offset, size);
     3830 +        /* wait for a preceding trim to finish */
     3831 +        while (msp->ms_trimming_ts != NULL)
     3832 +                cv_wait(&msp->ms_trim_cv, &msp->ms_lock);
     3833 +        msp->ms_trimming_ts = msp->ms_prev_ts;
     3834 +        msp->ms_prev_ts = NULL;
     3835 +        trim_tree = msp->ms_trimming_ts->ts_tree;
     3836 +#ifdef  DEBUG
     3837 +        if (msp->ms_loaded) {
     3838 +                for (range_seg_t *rs = avl_first(&trim_tree->rt_root);
     3839 +                    rs != NULL; rs = AVL_NEXT(&trim_tree->rt_root, rs)) {
     3840 +                        if (!range_tree_contains(msp->ms_tree,
     3841 +                            rs->rs_start, rs->rs_end - rs->rs_start)) {
     3842 +                                panic("trimming allocated region; mss=%p",
     3843 +                                    (void*)rs);
     3844 +                        }
     3845 +                }
3808 3846          }
3809      -        spa_config_exit(spa, SCL_VDEV, FTAG);
     3847 +#endif
     3848 +
     3849 +        /* Nothing to trim */
     3850 +        if (range_tree_space(trim_tree) == 0) {
     3851 +                metaslab_free_trimset(msp->ms_trimming_ts);
     3852 +                msp->ms_trimming_ts = 0;
     3853 +                return (zio_root(spa, NULL, NULL, 0));
     3854 +        }
     3855 +        zio = zio_trim(spa, vd, trim_tree, metaslab_trim_done, msp, 0,
     3856 +            ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |
     3857 +            ZIO_FLAG_CONFIG_WRITER, msp);
     3858 +
     3859 +        return (zio);
     3860 +}
     3861 +
     3862 +/*
     3863 + * Allocates and initializes a new trimset structure. The `txg' argument
     3864 + * indicates when this trimset was born and `lock' indicates the lock to
     3865 + * link to the range tree.
     3866 + */
     3867 +static metaslab_trimset_t *
     3868 +metaslab_new_trimset(uint64_t txg, kmutex_t *lock)
     3869 +{
     3870 +        metaslab_trimset_t *ts;
     3871 +
     3872 +        ts = kmem_zalloc(sizeof (*ts), KM_SLEEP);
     3873 +        ts->ts_birth = txg;
     3874 +        ts->ts_tree = range_tree_create(NULL, NULL, lock);
     3875 +
     3876 +        return (ts);
     3877 +}
     3878 +
     3879 +/*
     3880 + * Destroys and frees a trim set previously allocated by metaslab_new_trimset.
     3881 + */
     3882 +static void
     3883 +metaslab_free_trimset(metaslab_trimset_t *ts)
     3884 +{
     3885 +        range_tree_vacate(ts->ts_tree, NULL, NULL);
     3886 +        range_tree_destroy(ts->ts_tree);
     3887 +        kmem_free(ts, sizeof (*ts));
     3888 +}
     3889 +
     3890 +/*
     3891 + * Checks whether an allocation conflicts with an ongoing trim operation in
     3892 + * the given metaslab. This function takes a segment starting at `*offset'
     3893 + * of `size' and checks whether it hits any region in the metaslab currently
     3894 + * being trimmed. If yes, it tries to adjust the allocation to the end of
     3895 + * the region being trimmed (P2ROUNDUP aligned by `align'), but only up to
     3896 + * `limit' (no part of the allocation is allowed to go past this point).
     3897 + *
     3898 + * Returns B_FALSE if either the original allocation wasn't in conflict, or
     3899 + * the conflict could be resolved by adjusting the value stored in `offset'
     3900 + * such that the whole allocation still fits below `limit'. Returns B_TRUE
     3901 + * if the allocation conflict couldn't be resolved.
     3902 + */
     3903 +static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
     3904 +    uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit)
     3905 +{
     3906 +        uint64_t new_offset;
     3907 +
     3908 +        if (msp->ms_trimming_ts == NULL)
     3909 +                /* no trim conflict, original offset is OK */
     3910 +                return (B_FALSE);
     3911 +
     3912 +        new_offset = P2ROUNDUP(range_tree_find_gap(msp->ms_trimming_ts->ts_tree,
     3913 +            *offset, size), align);
     3914 +        if (new_offset != *offset && new_offset + size > limit)
     3915 +                /* trim conflict and adjustment not possible */
     3916 +                return (B_TRUE);
     3917 +
     3918 +        /* trim conflict, but adjusted offset still within limit */
     3919 +        *offset = new_offset;
     3920 +        return (B_FALSE);
3810 3921  }
    
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX