Print this page
    
NEX-18589 checksum errors on SSD-based pool
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-14242 Getting panic in module "zfs" due to a NULL pointer dereference
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-17716 reoccurring checksum errors on pool
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-15266 Default resilver throttling values are too aggressive
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-15266 Default resilver throttling values are too aggressive
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-10069 ZFS_READONLY is a little too strict (fix test lint)
NEX-9553 Move ss_fill gap logic from scan algorithm into range_tree.c
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-9719 Reorganize scan_io_t to make it smaller and improve scan performance
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9658 Resilver code leaks the block sorting queues
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9651 Resilver code leaks a bit of memory in the dataset processing queue
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-9551 Resilver algorithm should properly sort metadata and data with copies > 1
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9601 New resilver algorithm causes scrub errors on WBC devices
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-9593 New resilvering algorithm can panic when WBC mirrors are in use
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9554 dsl_scan.c internals contain some confusingly similar function names for handling the dataset and block sorting queues
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9562 Attaching a vdev while resilver/scrub is running causes panic.
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-4940 Special Vdev operation in presence (or absense) of IO Errors
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-4705 WRC: Kernel-panic during the destroying of a pool with activated WRC
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
6450 scrub/resilver unnecessarily traverses snapshots created after the scrub started
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6292 exporting a pool while an async destroy is running can leave entries in the deferred tree
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Fabian Keil <fk@fabiankeil.de>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix multi-proto)
6251 add tunable to disable free_bpobj processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
4391 panic system rather than corrupting pool if we hit bug 4390
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Issue #26: partial scrub
Added partial scrub options:
-M for MOS only scrub
-m for metadata scrub
Issue #2: optimize DDE lookup in DDT objects
Added option to control number of classes of DDE's in DDT.
New default is one, that is all DDE's are stored together
regardless of refcount.
re #12619 rb4429 More dp->dp_config_rwlock holds
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
    
      
        | Split | 
	Close | 
      
      | Expand all | 
      | Collapse all | 
    
    
          --- old/usr/src/uts/common/fs/zfs/dsl_scan.c
          +++ new/usr/src/uts/common/fs/zfs/dsl_scan.c
   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  
    | 
      ↓ open down ↓ | 
    12 lines elided | 
    
      ↑ open up ↑ | 
  
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright (c) 2008, 2010, Oracle and/or its affiliates. All rights reserved.
       23 + * Copyright 2017 Nexenta Systems, Inc.  All rights reserved.
  23   24   * Copyright 2016 Gary Mills
  24      - * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
       25 + * Copyright (c) 2011, 2016 by Delphix. All rights reserved.
  25   26   * Copyright 2017 Joyent, Inc.
  26   27   * Copyright (c) 2017 Datto Inc.
  27   28   */
  28   29  
  29   30  #include <sys/dsl_scan.h>
  30   31  #include <sys/dsl_pool.h>
  31   32  #include <sys/dsl_dataset.h>
  32   33  #include <sys/dsl_prop.h>
  33   34  #include <sys/dsl_dir.h>
  34   35  #include <sys/dsl_synctask.h>
  35   36  #include <sys/dnode.h>
  36   37  #include <sys/dmu_tx.h>
  37   38  #include <sys/dmu_objset.h>
  38   39  #include <sys/arc.h>
  39   40  #include <sys/zap.h>
  40   41  #include <sys/zio.h>
  41   42  #include <sys/zfs_context.h>
  42   43  #include <sys/fs/zfs.h>
  43   44  #include <sys/zfs_znode.h>
  44   45  #include <sys/spa_impl.h>
  45   46  #include <sys/vdev_impl.h>
  
    | 
      ↓ open down ↓ | 
    11 lines elided | 
    
      ↑ open up ↑ | 
  
  46   47  #include <sys/zil_impl.h>
  47   48  #include <sys/zio_checksum.h>
  48   49  #include <sys/ddt.h>
  49   50  #include <sys/sa.h>
  50   51  #include <sys/sa_impl.h>
  51   52  #include <sys/zfeature.h>
  52   53  #include <sys/abd.h>
  53   54  #ifdef _KERNEL
  54   55  #include <sys/zfs_vfsops.h>
  55   56  #endif
       57 +#include <sys/range_tree.h>
  56   58  
       59 +extern int zfs_vdev_async_write_active_min_dirty_percent;
       60 +
       61 +typedef struct {
       62 +        uint64_t        sds_dsobj;
       63 +        uint64_t        sds_txg;
       64 +        avl_node_t      sds_node;
       65 +} scan_ds_t;
       66 +
       67 +typedef struct {
       68 +        dsl_scan_io_queue_t     *qri_queue;
       69 +        uint64_t                qri_limit;
       70 +} io_queue_run_info_t;
       71 +
       72 +/*
       73 + * This controls what conditions are placed on dsl_scan_sync_state():
       74 + * SYNC_OPTIONAL) write out scn_phys iff scn_bytes_pending == 0
       75 + * SYNC_MANDATORY) write out scn_phys always. scn_bytes_pending must be 0.
       76 + * SYNC_CACHED) if scn_bytes_pending == 0, write out scn_phys. Otherwise
       77 + *      write out the scn_phys_cached version.
       78 + * See dsl_scan_sync_state for details.
       79 + */
       80 +typedef enum {
       81 +        SYNC_OPTIONAL,
       82 +        SYNC_MANDATORY,
       83 +        SYNC_CACHED
       84 +} state_sync_type_t;
       85 +
  57   86  typedef int (scan_cb_t)(dsl_pool_t *, const blkptr_t *,
  58   87      const zbookmark_phys_t *);
  59   88  
  60   89  static scan_cb_t dsl_scan_scrub_cb;
  61   90  static void dsl_scan_cancel_sync(void *, dmu_tx_t *);
  62      -static void dsl_scan_sync_state(dsl_scan_t *, dmu_tx_t *);
       91 +static void dsl_scan_sync_state(dsl_scan_t *scn, dmu_tx_t *tx,
       92 +    state_sync_type_t sync_type);
  63   93  static boolean_t dsl_scan_restarting(dsl_scan_t *, dmu_tx_t *);
  64   94  
  65      -int zfs_top_maxinflight = 32;           /* maximum I/Os per top-level */
  66      -int zfs_resilver_delay = 2;             /* number of ticks to delay resilver */
  67      -int zfs_scrub_delay = 4;                /* number of ticks to delay scrub */
  68      -int zfs_scan_idle = 50;                 /* idle window in clock ticks */
       95 +static int scan_ds_queue_compar(const void *a, const void *b);
       96 +static void scan_ds_queue_empty(dsl_scan_t *scn, boolean_t destroy);
       97 +static boolean_t scan_ds_queue_contains(dsl_scan_t *scn, uint64_t dsobj,
       98 +    uint64_t *txg);
       99 +static int scan_ds_queue_insert(dsl_scan_t *scn, uint64_t dsobj, uint64_t txg);
      100 +static void scan_ds_queue_remove(dsl_scan_t *scn, uint64_t dsobj);
      101 +static boolean_t scan_ds_queue_first(dsl_scan_t *scn, uint64_t *dsobj,
      102 +    uint64_t *txg);
      103 +static void scan_ds_queue_sync(dsl_scan_t *scn, dmu_tx_t *tx);
  69  104  
      105 +/*
      106 + * Maximum number of parallelly executing I/Os per top-level vdev.
      107 + * Tune with care. Very high settings (hundreds) are known to trigger
      108 + * some firmware bugs and resets on certain SSDs.
      109 + */
      110 +int zfs_top_maxinflight = 32;
      111 +
      112 +/*
      113 + * Minimum amount of data we dequeue if our queues are full and the
      114 + * dirty data limit for a txg has been reached.
      115 + */
      116 +uint64_t zfs_scan_dequeue_min =                 16 << 20;
      117 +/*
      118 + * The duration target we're aiming for a dsl_scan_sync to take due to our
      119 + * dequeued data. If we go over that value, we lower the amount we dequeue
      120 + * each run and vice versa. The bonus value below is just something we add
      121 + * on top of he target value so that we have a little bit of fudging in case
      122 + * some top-level vdevs finish before others - we want to keep the vdevs as
      123 + * hot as possible.
      124 + */
      125 +uint64_t zfs_scan_dequeue_run_target_ms =       2000;
      126 +uint64_t zfs_dequeue_run_bonus_ms =             1000;
      127 +#define DEQUEUE_BONUS_MS_MAX                    100000
      128 +
      129 +boolean_t zfs_scan_direct = B_FALSE;    /* don't queue & sort zios, go direct */
      130 +uint64_t zfs_scan_max_ext_gap = 2 << 20;        /* bytes */
      131 +/* See scan_io_queue_mem_lim for details on the memory limit tunables */
      132 +uint64_t zfs_scan_mem_lim_fact = 20;            /* fraction of physmem */
      133 +uint64_t zfs_scan_mem_lim_soft_fact = 20;       /* fraction of mem lim above */
      134 +uint64_t zfs_scan_checkpoint_intval = 7200;     /* seconds */
      135 +/*
      136 + * fill_weight is non-tunable at runtime, so we copy it at module init from
      137 + * zfs_scan_fill_weight. Runtime adjustments to zfs_scan_fill_weight would
      138 + * break queue sorting.
      139 + */
      140 +uint64_t zfs_scan_fill_weight = 3;
      141 +static uint64_t fill_weight = 3;
      142 +
      143 +/* See scan_io_queue_mem_lim for details on the memory limit tunables */
      144 +uint64_t zfs_scan_mem_lim_min = 16 << 20;       /* bytes */
      145 +uint64_t zfs_scan_mem_lim_soft_max = 128 << 20; /* bytes */
      146 +
      147 +#define ZFS_SCAN_CHECKPOINT_INTVAL      SEC_TO_TICK(zfs_scan_checkpoint_intval)
      148 +
  70  149  int zfs_scan_min_time_ms = 1000; /* min millisecs to scrub per txg */
  71  150  int zfs_free_min_time_ms = 1000; /* min millisecs to free per txg */
  72      -int zfs_obsolete_min_time_ms = 500; /* min millisecs to obsolete per txg */
  73  151  int zfs_resilver_min_time_ms = 3000; /* min millisecs to resilver per txg */
  74  152  boolean_t zfs_no_scrub_io = B_FALSE; /* set to disable scrub i/o */
  75  153  boolean_t zfs_no_scrub_prefetch = B_FALSE; /* set to disable scrub prefetch */
  76  154  enum ddt_class zfs_scrub_ddt_class_max = DDT_CLASS_DUPLICATE;
  77  155  int dsl_scan_delay_completion = B_FALSE; /* set to delay scan completion */
  78  156  /* max number of blocks to free in a single TXG */
  79      -uint64_t zfs_async_block_max_blocks = UINT64_MAX;
      157 +uint64_t zfs_free_max_blocks = UINT64_MAX;
  80  158  
  81  159  #define DSL_SCAN_IS_SCRUB_RESILVER(scn) \
  82  160          ((scn)->scn_phys.scn_func == POOL_SCAN_SCRUB || \
  83      -        (scn)->scn_phys.scn_func == POOL_SCAN_RESILVER)
      161 +        (scn)->scn_phys.scn_func == POOL_SCAN_RESILVER || \
      162 +        (scn)->scn_phys.scn_func == POOL_SCAN_MOS || \
      163 +        (scn)->scn_phys.scn_func == POOL_SCAN_META)
  84  164  
  85  165  extern int zfs_txg_timeout;
  86  166  
  87  167  /*
  88  168   * Enable/disable the processing of the free_bpobj object.
  89  169   */
  90  170  boolean_t zfs_free_bpobj_enabled = B_TRUE;
  91  171  
  92  172  /* the order has to match pool_scan_type */
  93  173  static scan_cb_t *scan_funcs[POOL_SCAN_FUNCS] = {
  94  174          NULL,
  95  175          dsl_scan_scrub_cb,      /* POOL_SCAN_SCRUB */
  96  176          dsl_scan_scrub_cb,      /* POOL_SCAN_RESILVER */
      177 +        dsl_scan_scrub_cb,      /* POOL_SCAN_MOS */
      178 +        dsl_scan_scrub_cb,      /* POOL_SCAN_META */
  97  179  };
  98  180  
      181 +typedef struct scan_io {
      182 +        uint64_t                sio_prop;
      183 +        uint64_t                sio_phys_birth;
      184 +        uint64_t                sio_birth;
      185 +        zio_cksum_t             sio_cksum;
      186 +        zbookmark_phys_t        sio_zb;
      187 +        union {
      188 +                avl_node_t      sio_addr_node;
      189 +                list_node_t     sio_list_node;
      190 +        } sio_nodes;
      191 +        uint64_t                sio_dva_word1;
      192 +        uint32_t                sio_asize;
      193 +        int                     sio_flags;
      194 +} scan_io_t;
      195 +
      196 +struct dsl_scan_io_queue {
      197 +        dsl_scan_t      *q_scn;
      198 +        vdev_t          *q_vd;
      199 +
      200 +        kcondvar_t      q_cv;
      201 +
      202 +        range_tree_t    *q_exts_by_addr;
      203 +        avl_tree_t      q_zios_by_addr;
      204 +        avl_tree_t      q_exts_by_size;
      205 +
      206 +        /* number of bytes in queued zios - atomic ops */
      207 +        uint64_t        q_zio_bytes;
      208 +
      209 +        range_seg_t     q_issuing_rs;
      210 +        uint64_t        q_num_issuing_zios;
      211 +};
      212 +
      213 +#define SCAN_IO_GET_OFFSET(sio) \
      214 +        BF64_GET_SB((sio)->sio_dva_word1, 0, 63, SPA_MINBLOCKSHIFT, 0)
      215 +#define SCAN_IO_SET_OFFSET(sio, offset) \
      216 +        BF64_SET_SB((sio)->sio_dva_word1, 0, 63, SPA_MINBLOCKSHIFT, 0, offset)
      217 +
      218 +static void scan_io_queue_insert_cb(range_tree_t *rt, range_seg_t *rs,
      219 +    void *arg);
      220 +static void scan_io_queue_remove_cb(range_tree_t *rt, range_seg_t *rs,
      221 +    void *arg);
      222 +static void scan_io_queue_vacate_cb(range_tree_t *rt, void *arg);
      223 +static int ext_size_compar(const void *x, const void *y);
      224 +static int io_addr_compar(const void *x, const void *y);
      225 +
      226 +static struct range_tree_ops scan_io_queue_ops = {
      227 +        .rtop_create = NULL,
      228 +        .rtop_destroy = NULL,
      229 +        .rtop_add = scan_io_queue_insert_cb,
      230 +        .rtop_remove = scan_io_queue_remove_cb,
      231 +        .rtop_vacate = scan_io_queue_vacate_cb
      232 +};
      233 +
      234 +typedef enum {
      235 +        MEM_LIM_NONE,
      236 +        MEM_LIM_SOFT,
      237 +        MEM_LIM_HARD
      238 +} mem_lim_t;
      239 +
      240 +static void dsl_scan_enqueue(dsl_pool_t *dp, const blkptr_t *bp,
      241 +    int zio_flags, const zbookmark_phys_t *zb);
      242 +static void scan_exec_io(dsl_pool_t *dp, const blkptr_t *bp, int zio_flags,
      243 +    const zbookmark_phys_t *zb, boolean_t limit_inflight);
      244 +static void scan_io_queue_insert(dsl_scan_t *scn, dsl_scan_io_queue_t *queue,
      245 +    const blkptr_t *bp, int dva_i, int zio_flags, const zbookmark_phys_t *zb);
      246 +
      247 +static void scan_io_queues_run_one(io_queue_run_info_t *info);
      248 +static void scan_io_queues_run(dsl_scan_t *scn);
      249 +static mem_lim_t scan_io_queue_mem_lim(dsl_scan_t *scn);
      250 +
      251 +static dsl_scan_io_queue_t *scan_io_queue_create(vdev_t *vd);
      252 +static void scan_io_queues_destroy(dsl_scan_t *scn);
      253 +static void dsl_scan_freed_dva(spa_t *spa, const blkptr_t *bp, int dva_i);
      254 +
      255 +static inline boolean_t
      256 +dsl_scan_is_running(const dsl_scan_t *scn)
      257 +{
      258 +        return (scn->scn_phys.scn_state == DSS_SCANNING ||
      259 +            scn->scn_phys.scn_state == DSS_FINISHING);
      260 +}
      261 +
      262 +static inline void
      263 +sio2bp(const scan_io_t *sio, blkptr_t *bp, uint64_t vdev_id)
      264 +{
      265 +        bzero(bp, sizeof (*bp));
      266 +        DVA_SET_ASIZE(&bp->blk_dva[0], sio->sio_asize);
      267 +        DVA_SET_VDEV(&bp->blk_dva[0], vdev_id);
      268 +        bp->blk_dva[0].dva_word[1] = sio->sio_dva_word1;
      269 +        bp->blk_prop = sio->sio_prop;
      270 +        /*
      271 +         * We must reset the special flag, because the rebuilt BP lacks
      272 +         * a second DVA, so wbc_select_dva must not be allowed to run.
      273 +         */
      274 +        BP_SET_SPECIAL(bp, 0);
      275 +        bp->blk_phys_birth = sio->sio_phys_birth;
      276 +        bp->blk_birth = sio->sio_birth;
      277 +        bp->blk_fill = 1;       /* we always only work with data pointers */
      278 +        bp->blk_cksum = sio->sio_cksum;
      279 +}
      280 +
      281 +static inline void
      282 +bp2sio(const blkptr_t *bp, scan_io_t *sio, int dva_i)
      283 +{
      284 +        if (BP_IS_SPECIAL(bp))
      285 +                ASSERT3S(dva_i, ==, WBC_NORMAL_DVA);
      286 +        /* we discard the vdev guid, since we can deduce it from the queue */
      287 +        sio->sio_dva_word1 = bp->blk_dva[dva_i].dva_word[1];
      288 +        sio->sio_asize = DVA_GET_ASIZE(&bp->blk_dva[dva_i]);
      289 +        sio->sio_prop = bp->blk_prop;
      290 +        sio->sio_phys_birth = bp->blk_phys_birth;
      291 +        sio->sio_birth = bp->blk_birth;
      292 +        sio->sio_cksum = bp->blk_cksum;
      293 +}
      294 +
      295 +void
      296 +dsl_scan_global_init()
      297 +{
      298 +        fill_weight = zfs_scan_fill_weight;
      299 +}
      300 +
  99  301  int
 100  302  dsl_scan_init(dsl_pool_t *dp, uint64_t txg)
 101  303  {
 102  304          int err;
 103  305          dsl_scan_t *scn;
 104  306          spa_t *spa = dp->dp_spa;
 105  307          uint64_t f;
 106  308  
 107  309          scn = dp->dp_scan = kmem_zalloc(sizeof (dsl_scan_t), KM_SLEEP);
 108  310          scn->scn_dp = dp;
 109  311  
      312 +        mutex_init(&scn->scn_sorted_lock, NULL, MUTEX_DEFAULT, NULL);
      313 +        mutex_init(&scn->scn_status_lock, NULL, MUTEX_DEFAULT, NULL);
      314 +
 110  315          /*
 111  316           * It's possible that we're resuming a scan after a reboot so
 112  317           * make sure that the scan_async_destroying flag is initialized
 113  318           * appropriately.
 114  319           */
 115  320          ASSERT(!scn->scn_async_destroying);
 116  321          scn->scn_async_destroying = spa_feature_is_active(dp->dp_spa,
 117  322              SPA_FEATURE_ASYNC_DESTROY);
 118  323  
      324 +        bcopy(&scn->scn_phys, &scn->scn_phys_cached, sizeof (scn->scn_phys));
      325 +        mutex_init(&scn->scn_queue_lock, NULL, MUTEX_DEFAULT, NULL);
      326 +        avl_create(&scn->scn_queue, scan_ds_queue_compar, sizeof (scan_ds_t),
      327 +            offsetof(scan_ds_t, sds_node));
      328 +
 119  329          err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
 120  330              "scrub_func", sizeof (uint64_t), 1, &f);
 121  331          if (err == 0) {
 122  332                  /*
 123  333                   * There was an old-style scrub in progress.  Restart a
 124  334                   * new-style scrub from the beginning.
 125  335                   */
 126  336                  scn->scn_restart_txg = txg;
      337 +                DTRACE_PROBE2(scan_init__old2new, dsl_scan_t *, scn,
      338 +                    uint64_t, txg);
 127  339                  zfs_dbgmsg("old-style scrub was in progress; "
 128  340                      "restarting new-style scrub in txg %llu",
 129  341                      scn->scn_restart_txg);
 130  342  
 131  343                  /*
 132  344                   * Load the queue obj from the old location so that it
 133  345                   * can be freed by dsl_scan_done().
 134  346                   */
 135  347                  (void) zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
 136  348                      "scrub_queue", sizeof (uint64_t), 1,
 137  349                      &scn->scn_phys.scn_queue_obj);
 138  350          } else {
 139  351                  err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
 140  352                      DMU_POOL_SCAN, sizeof (uint64_t), SCAN_PHYS_NUMINTS,
 141  353                      &scn->scn_phys);
 142  354                  if (err == ENOENT)
 143  355                          return (0);
 144  356                  else if (err)
 145  357                          return (err);
 146  358  
 147      -                if (scn->scn_phys.scn_state == DSS_SCANNING &&
      359 +                /*
      360 +                 * We might be restarting after a reboot, so jump the issued
      361 +                 * counter to how far we've scanned. We know we're consistent
      362 +                 * up to here.
      363 +                 */
      364 +                scn->scn_bytes_issued = scn->scn_phys.scn_examined;
      365 +
      366 +                if (dsl_scan_is_running(scn) &&
 148  367                      spa_prev_software_version(dp->dp_spa) < SPA_VERSION_SCAN) {
 149  368                          /*
 150  369                           * A new-type scrub was in progress on an old
 151  370                           * pool, and the pool was accessed by old
 152  371                           * software.  Restart from the beginning, since
 153  372                           * the old software may have changed the pool in
 154  373                           * the meantime.
 155  374                           */
 156  375                          scn->scn_restart_txg = txg;
      376 +                        DTRACE_PROBE2(scan_init__new2old2new,
      377 +                            dsl_scan_t *, scn, uint64_t, txg);
 157  378                          zfs_dbgmsg("new-style scrub was modified "
 158  379                              "by old software; restarting in txg %llu",
 159  380                              scn->scn_restart_txg);
 160  381                  }
 161  382          }
 162  383  
      384 +        /* reload the queue into the in-core state */
      385 +        if (scn->scn_phys.scn_queue_obj != 0) {
      386 +                zap_cursor_t zc;
      387 +                zap_attribute_t za;
      388 +
      389 +                for (zap_cursor_init(&zc, dp->dp_meta_objset,
      390 +                    scn->scn_phys.scn_queue_obj);
      391 +                    zap_cursor_retrieve(&zc, &za) == 0;
      392 +                    (void) zap_cursor_advance(&zc)) {
      393 +                        VERIFY0(scan_ds_queue_insert(scn,
      394 +                            zfs_strtonum(za.za_name, NULL),
      395 +                            za.za_first_integer));
      396 +                }
      397 +                zap_cursor_fini(&zc);
      398 +        }
      399 +
 163  400          spa_scan_stat_init(spa);
 164  401          return (0);
 165  402  }
 166  403  
 167  404  void
 168  405  dsl_scan_fini(dsl_pool_t *dp)
 169  406  {
 170      -        if (dp->dp_scan) {
      407 +        if (dp->dp_scan != NULL) {
      408 +                dsl_scan_t *scn = dp->dp_scan;
      409 +
      410 +                mutex_destroy(&scn->scn_sorted_lock);
      411 +                mutex_destroy(&scn->scn_status_lock);
      412 +                if (scn->scn_taskq != NULL)
      413 +                        taskq_destroy(scn->scn_taskq);
      414 +                scan_ds_queue_empty(scn, B_TRUE);
      415 +                mutex_destroy(&scn->scn_queue_lock);
      416 +
 171  417                  kmem_free(dp->dp_scan, sizeof (dsl_scan_t));
 172  418                  dp->dp_scan = NULL;
 173  419          }
 174  420  }
 175  421  
 176  422  /* ARGSUSED */
 177  423  static int
 178  424  dsl_scan_setup_check(void *arg, dmu_tx_t *tx)
 179  425  {
 180  426          dsl_scan_t *scn = dmu_tx_pool(tx)->dp_scan;
 181  427  
 182      -        if (scn->scn_phys.scn_state == DSS_SCANNING)
      428 +        if (dsl_scan_is_running(scn))
 183  429                  return (SET_ERROR(EBUSY));
 184  430  
 185  431          return (0);
 186  432  }
 187  433  
 188  434  static void
 189  435  dsl_scan_setup_sync(void *arg, dmu_tx_t *tx)
 190  436  {
 191  437          dsl_scan_t *scn = dmu_tx_pool(tx)->dp_scan;
 192  438          pool_scan_func_t *funcp = arg;
 193  439          dmu_object_type_t ot = 0;
 194  440          dsl_pool_t *dp = scn->scn_dp;
 195  441          spa_t *spa = dp->dp_spa;
 196  442  
 197      -        ASSERT(scn->scn_phys.scn_state != DSS_SCANNING);
      443 +        ASSERT(!dsl_scan_is_running(scn));
 198  444          ASSERT(*funcp > POOL_SCAN_NONE && *funcp < POOL_SCAN_FUNCS);
 199  445          bzero(&scn->scn_phys, sizeof (scn->scn_phys));
 200  446          scn->scn_phys.scn_func = *funcp;
 201  447          scn->scn_phys.scn_state = DSS_SCANNING;
 202  448          scn->scn_phys.scn_min_txg = 0;
 203  449          scn->scn_phys.scn_max_txg = tx->tx_txg;
 204      -        scn->scn_phys.scn_ddt_class_max = DDT_CLASSES - 1; /* the entire DDT */
      450 +        /* the entire DDT */
      451 +        scn->scn_phys.scn_ddt_class_max = spa->spa_ddt_class_max;
 205  452          scn->scn_phys.scn_start_time = gethrestime_sec();
 206  453          scn->scn_phys.scn_errors = 0;
 207  454          scn->scn_phys.scn_to_examine = spa->spa_root_vdev->vdev_stat.vs_alloc;
 208  455          scn->scn_restart_txg = 0;
 209  456          scn->scn_done_txg = 0;
      457 +        scn->scn_bytes_issued = 0;
      458 +        scn->scn_checkpointing = B_FALSE;
      459 +        scn->scn_last_checkpoint = 0;
 210  460          spa_scan_stat_init(spa);
 211  461  
 212  462          if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) {
 213      -                scn->scn_phys.scn_ddt_class_max = zfs_scrub_ddt_class_max;
      463 +                scn->scn_phys.scn_ddt_class_max =
      464 +                    MIN(zfs_scrub_ddt_class_max, spa->spa_ddt_class_max);
 214  465  
 215  466                  /* rewrite all disk labels */
 216  467                  vdev_config_dirty(spa->spa_root_vdev);
 217  468  
 218  469                  if (vdev_resilver_needed(spa->spa_root_vdev,
 219  470                      &scn->scn_phys.scn_min_txg, &scn->scn_phys.scn_max_txg)) {
 220  471                          spa_event_notify(spa, NULL, NULL,
 221  472                              ESC_ZFS_RESILVER_START);
 222  473                  } else {
 223  474                          spa_event_notify(spa, NULL, NULL, ESC_ZFS_SCRUB_START);
 224  475                  }
 225  476  
 226  477                  spa->spa_scrub_started = B_TRUE;
 227  478                  /*
 228  479                   * If this is an incremental scrub, limit the DDT scrub phase
 229  480                   * to just the auto-ditto class (for correctness); the rest
 230  481                   * of the scrub should go faster using top-down pruning.
 231  482                   */
 232  483                  if (scn->scn_phys.scn_min_txg > TXG_INITIAL)
 233      -                        scn->scn_phys.scn_ddt_class_max = DDT_CLASS_DITTO;
      484 +                        scn->scn_phys.scn_ddt_class_max =
      485 +                            MIN(DDT_CLASS_DITTO, spa->spa_ddt_class_max);
 234  486  
 235  487          }
 236  488  
 237  489          /* back to the generic stuff */
 238  490  
 239  491          if (dp->dp_blkstats == NULL) {
 240  492                  dp->dp_blkstats =
 241  493                      kmem_alloc(sizeof (zfs_all_blkstats_t), KM_SLEEP);
 242  494          }
 243  495          bzero(dp->dp_blkstats, sizeof (zfs_all_blkstats_t));
 244  496  
 245  497          if (spa_version(spa) < SPA_VERSION_DSL_SCRUB)
 246  498                  ot = DMU_OT_ZAP_OTHER;
 247  499  
 248  500          scn->scn_phys.scn_queue_obj = zap_create(dp->dp_meta_objset,
 249  501              ot ? ot : DMU_OT_SCAN_QUEUE, DMU_OT_NONE, 0, tx);
 250  502  
 251      -        dsl_scan_sync_state(scn, tx);
      503 +        bcopy(&scn->scn_phys, &scn->scn_phys_cached, sizeof (scn->scn_phys));
 252  504  
      505 +        dsl_scan_sync_state(scn, tx, SYNC_MANDATORY);
      506 +
 253  507          spa_history_log_internal(spa, "scan setup", tx,
 254  508              "func=%u mintxg=%llu maxtxg=%llu",
 255  509              *funcp, scn->scn_phys.scn_min_txg, scn->scn_phys.scn_max_txg);
 256  510  }
 257  511  
 258  512  /* ARGSUSED */
 259  513  static void
 260  514  dsl_scan_done(dsl_scan_t *scn, boolean_t complete, dmu_tx_t *tx)
 261  515  {
 262  516          static const char *old_names[] = {
 263  517                  "scrub_bookmark",
 264  518                  "scrub_ddt_bookmark",
 265  519                  "scrub_ddt_class_max",
 266  520                  "scrub_queue",
 267  521                  "scrub_min_txg",
 268  522                  "scrub_max_txg",
 269  523                  "scrub_func",
 270  524                  "scrub_errors",
 271  525                  NULL
 272  526          };
 273  527  
 274  528          dsl_pool_t *dp = scn->scn_dp;
  
    | 
      ↓ open down ↓ | 
    12 lines elided | 
    
      ↑ open up ↑ | 
  
 275  529          spa_t *spa = dp->dp_spa;
 276  530          int i;
 277  531  
 278  532          /* Remove any remnants of an old-style scrub. */
 279  533          for (i = 0; old_names[i]; i++) {
 280  534                  (void) zap_remove(dp->dp_meta_objset,
 281  535                      DMU_POOL_DIRECTORY_OBJECT, old_names[i], tx);
 282  536          }
 283  537  
 284  538          if (scn->scn_phys.scn_queue_obj != 0) {
 285      -                VERIFY(0 == dmu_object_free(dp->dp_meta_objset,
      539 +                VERIFY0(dmu_object_free(dp->dp_meta_objset,
 286  540                      scn->scn_phys.scn_queue_obj, tx));
 287  541                  scn->scn_phys.scn_queue_obj = 0;
 288  542          }
      543 +        scan_ds_queue_empty(scn, B_FALSE);
 289  544  
 290  545          scn->scn_phys.scn_flags &= ~DSF_SCRUB_PAUSED;
 291  546  
 292  547          /*
 293  548           * If we were "restarted" from a stopped state, don't bother
 294  549           * with anything else.
 295  550           */
 296      -        if (scn->scn_phys.scn_state != DSS_SCANNING)
      551 +        if (!dsl_scan_is_running(scn)) {
      552 +                ASSERT(!scn->scn_is_sorted);
 297  553                  return;
      554 +        }
 298  555  
 299      -        if (complete)
 300      -                scn->scn_phys.scn_state = DSS_FINISHED;
 301      -        else
 302      -                scn->scn_phys.scn_state = DSS_CANCELED;
      556 +        if (scn->scn_is_sorted) {
      557 +                scan_io_queues_destroy(scn);
      558 +                scn->scn_is_sorted = B_FALSE;
 303  559  
      560 +                if (scn->scn_taskq != NULL) {
      561 +                        taskq_destroy(scn->scn_taskq);
      562 +                        scn->scn_taskq = NULL;
      563 +                }
      564 +        }
      565 +
      566 +        scn->scn_phys.scn_state = complete ? DSS_FINISHED : DSS_CANCELED;
      567 +
 304  568          if (dsl_scan_restarting(scn, tx))
 305  569                  spa_history_log_internal(spa, "scan aborted, restarting", tx,
 306  570                      "errors=%llu", spa_get_errlog_size(spa));
 307  571          else if (!complete)
 308  572                  spa_history_log_internal(spa, "scan cancelled", tx,
 309  573                      "errors=%llu", spa_get_errlog_size(spa));
 310  574          else
 311  575                  spa_history_log_internal(spa, "scan done", tx,
 312  576                      "errors=%llu", spa_get_errlog_size(spa));
 313  577  
 314  578          if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) {
 315  579                  mutex_enter(&spa->spa_scrub_lock);
 316  580                  while (spa->spa_scrub_inflight > 0) {
 317  581                          cv_wait(&spa->spa_scrub_io_cv,
 318  582                              &spa->spa_scrub_lock);
 319  583                  }
 320  584                  mutex_exit(&spa->spa_scrub_lock);
 321  585                  spa->spa_scrub_started = B_FALSE;
 322  586                  spa->spa_scrub_active = B_FALSE;
 323  587  
 324  588                  /*
 325  589                   * If the scrub/resilver completed, update all DTLs to
 326  590                   * reflect this.  Whether it succeeded or not, vacate
 327  591                   * all temporary scrub DTLs.
 328  592                   */
 329  593                  vdev_dtl_reassess(spa->spa_root_vdev, tx->tx_txg,
 330  594                      complete ? scn->scn_phys.scn_max_txg : 0, B_TRUE);
 331  595                  if (complete) {
 332  596                          spa_event_notify(spa, NULL, NULL,
 333  597                              scn->scn_phys.scn_min_txg ?
 334  598                              ESC_ZFS_RESILVER_FINISH : ESC_ZFS_SCRUB_FINISH);
 335  599                  }
  
    | 
      ↓ open down ↓ | 
    22 lines elided | 
    
      ↑ open up ↑ | 
  
 336  600                  spa_errlog_rotate(spa);
 337  601  
 338  602                  /*
 339  603                   * We may have finished replacing a device.
 340  604                   * Let the async thread assess this and handle the detach.
 341  605                   */
 342  606                  spa_async_request(spa, SPA_ASYNC_RESILVER_DONE);
 343  607          }
 344  608  
 345  609          scn->scn_phys.scn_end_time = gethrestime_sec();
      610 +
      611 +        ASSERT(!dsl_scan_is_running(scn));
      612 +
      613 +        /*
      614 +         * If the special-vdev does not have any errors after
      615 +         * SCRUB/RESILVER we need to drop flag that does not
      616 +         * allow to write to special
      617 +         */
      618 +        spa_special_check_errors(spa);
 346  619  }
 347  620  
 348  621  /* ARGSUSED */
 349  622  static int
 350  623  dsl_scan_cancel_check(void *arg, dmu_tx_t *tx)
 351  624  {
 352  625          dsl_scan_t *scn = dmu_tx_pool(tx)->dp_scan;
 353  626  
 354      -        if (scn->scn_phys.scn_state != DSS_SCANNING)
      627 +        if (!dsl_scan_is_running(scn))
 355  628                  return (SET_ERROR(ENOENT));
 356  629          return (0);
 357  630  }
 358  631  
 359  632  /* ARGSUSED */
 360  633  static void
 361  634  dsl_scan_cancel_sync(void *arg, dmu_tx_t *tx)
 362  635  {
 363  636          dsl_scan_t *scn = dmu_tx_pool(tx)->dp_scan;
 364  637  
 365  638          dsl_scan_done(scn, B_FALSE, tx);
 366      -        dsl_scan_sync_state(scn, tx);
 367      -        spa_event_notify(scn->scn_dp->dp_spa, NULL, NULL, ESC_ZFS_SCRUB_ABORT);
      639 +        dsl_scan_sync_state(scn, tx, SYNC_MANDATORY);
 368  640  }
 369  641  
 370  642  int
 371  643  dsl_scan_cancel(dsl_pool_t *dp)
 372  644  {
 373  645          return (dsl_sync_task(spa_name(dp->dp_spa), dsl_scan_cancel_check,
 374  646              dsl_scan_cancel_sync, NULL, 3, ZFS_SPACE_CHECK_RESERVED));
 375  647  }
 376  648  
 377  649  boolean_t
 378  650  dsl_scan_is_paused_scrub(const dsl_scan_t *scn)
 379  651  {
 380  652          if (dsl_scan_scrubbing(scn->scn_dp) &&
 381  653              scn->scn_phys.scn_flags & DSF_SCRUB_PAUSED)
 382  654                  return (B_TRUE);
 383  655  
 384  656          return (B_FALSE);
 385  657  }
 386  658  
 387  659  static int
 388  660  dsl_scrub_pause_resume_check(void *arg, dmu_tx_t *tx)
 389  661  {
 390  662          pool_scrub_cmd_t *cmd = arg;
 391  663          dsl_pool_t *dp = dmu_tx_pool(tx);
 392  664          dsl_scan_t *scn = dp->dp_scan;
 393  665  
 394  666          if (*cmd == POOL_SCRUB_PAUSE) {
 395  667                  /* can't pause a scrub when there is no in-progress scrub */
 396  668                  if (!dsl_scan_scrubbing(dp))
 397  669                          return (SET_ERROR(ENOENT));
 398  670  
 399  671                  /* can't pause a paused scrub */
 400  672                  if (dsl_scan_is_paused_scrub(scn))
 401  673                          return (SET_ERROR(EBUSY));
 402  674          } else if (*cmd != POOL_SCRUB_NORMAL) {
 403  675                  return (SET_ERROR(ENOTSUP));
 404  676          }
 405  677  
 406  678          return (0);
 407  679  }
 408  680  
 409  681  static void
 410  682  dsl_scrub_pause_resume_sync(void *arg, dmu_tx_t *tx)
  
    | 
      ↓ open down ↓ | 
    33 lines elided | 
    
      ↑ open up ↑ | 
  
 411  683  {
 412  684          pool_scrub_cmd_t *cmd = arg;
 413  685          dsl_pool_t *dp = dmu_tx_pool(tx);
 414  686          spa_t *spa = dp->dp_spa;
 415  687          dsl_scan_t *scn = dp->dp_scan;
 416  688  
 417  689          if (*cmd == POOL_SCRUB_PAUSE) {
 418  690                  /* can't pause a scrub when there is no in-progress scrub */
 419  691                  spa->spa_scan_pass_scrub_pause = gethrestime_sec();
 420  692                  scn->scn_phys.scn_flags |= DSF_SCRUB_PAUSED;
 421      -                dsl_scan_sync_state(scn, tx);
 422      -                spa_event_notify(spa, NULL, NULL, ESC_ZFS_SCRUB_PAUSED);
      693 +                scn->scn_phys_cached.scn_flags |= DSF_SCRUB_PAUSED;
      694 +                dsl_scan_sync_state(scn, tx, SYNC_CACHED);
 423  695          } else {
 424  696                  ASSERT3U(*cmd, ==, POOL_SCRUB_NORMAL);
 425  697                  if (dsl_scan_is_paused_scrub(scn)) {
 426  698                          /*
 427  699                           * We need to keep track of how much time we spend
 428  700                           * paused per pass so that we can adjust the scrub rate
 429  701                           * shown in the output of 'zpool status'
 430  702                           */
 431  703                          spa->spa_scan_pass_scrub_spent_paused +=
 432  704                              gethrestime_sec() - spa->spa_scan_pass_scrub_pause;
 433  705                          spa->spa_scan_pass_scrub_pause = 0;
 434  706                          scn->scn_phys.scn_flags &= ~DSF_SCRUB_PAUSED;
 435      -                        dsl_scan_sync_state(scn, tx);
      707 +                        scn->scn_phys_cached.scn_flags &= ~DSF_SCRUB_PAUSED;
      708 +                        dsl_scan_sync_state(scn, tx, SYNC_CACHED);
 436  709                  }
 437  710          }
 438  711  }
 439  712  
 440  713  /*
 441  714   * Set scrub pause/resume state if it makes sense to do so
 442  715   */
 443  716  int
 444  717  dsl_scrub_set_pause_resume(const dsl_pool_t *dp, pool_scrub_cmd_t cmd)
 445  718  {
 446  719          return (dsl_sync_task(spa_name(dp->dp_spa),
 447  720              dsl_scrub_pause_resume_check, dsl_scrub_pause_resume_sync, &cmd, 3,
 448  721              ZFS_SPACE_CHECK_RESERVED));
 449  722  }
 450  723  
 451  724  boolean_t
 452  725  dsl_scan_scrubbing(const dsl_pool_t *dp)
 453  726  {
 454  727          dsl_scan_t *scn = dp->dp_scan;
 455  728  
 456      -        if (scn->scn_phys.scn_state == DSS_SCANNING &&
      729 +        if ((scn->scn_phys.scn_state == DSS_SCANNING ||
      730 +            scn->scn_phys.scn_state == DSS_FINISHING) &&
 457  731              scn->scn_phys.scn_func == POOL_SCAN_SCRUB)
 458  732                  return (B_TRUE);
 459  733  
 460  734          return (B_FALSE);
 461  735  }
 462  736  
 463  737  static void dsl_scan_visitbp(blkptr_t *bp, const zbookmark_phys_t *zb,
 464  738      dnode_phys_t *dnp, dsl_dataset_t *ds, dsl_scan_t *scn,
 465  739      dmu_objset_type_t ostype, dmu_tx_t *tx);
 466  740  static void dsl_scan_visitdnode(dsl_scan_t *, dsl_dataset_t *ds,
 467  741      dmu_objset_type_t ostype,
 468  742      dnode_phys_t *dnp, uint64_t object, dmu_tx_t *tx);
 469  743  
 470  744  void
 471  745  dsl_free(dsl_pool_t *dp, uint64_t txg, const blkptr_t *bp)
 472  746  {
 473  747          zio_free(dp->dp_spa, txg, bp);
 474  748  }
 475  749  
 476  750  void
 477  751  dsl_free_sync(zio_t *pio, dsl_pool_t *dp, uint64_t txg, const blkptr_t *bpp)
 478  752  {
 479  753          ASSERT(dsl_pool_sync_context(dp));
 480  754          zio_nowait(zio_free_sync(pio, dp->dp_spa, txg, bpp, pio->io_flags));
 481  755  }
  
    | 
      ↓ open down ↓ | 
    15 lines elided | 
    
      ↑ open up ↑ | 
  
 482  756  
 483  757  static uint64_t
 484  758  dsl_scan_ds_maxtxg(dsl_dataset_t *ds)
 485  759  {
 486  760          uint64_t smt = ds->ds_dir->dd_pool->dp_scan->scn_phys.scn_max_txg;
 487  761          if (ds->ds_is_snapshot)
 488  762                  return (MIN(smt, dsl_dataset_phys(ds)->ds_creation_txg));
 489  763          return (smt);
 490  764  }
 491  765  
      766 +/*
      767 + * This is the dataset processing "queue", i.e. the datasets that are to be
      768 + * scanned for data locations and inserted into the LBA reordering tree.
      769 + * Please note that even though we call this a "queue", the actual
      770 + * implementation uses an avl tree (to detect double insertion). The tree
      771 + * uses the dataset object set number for the sorting criterion, so
      772 + * scan_ds_queue_insert CANNOT be guaranteed to always append stuff at the
      773 + * end (datasets are inserted by the scanner in discovery order, i.e.
      774 + * parent-child relationships). Consequently, the scanner must never step
      775 + * through the AVL tree in a naively sequential fashion using AVL_NEXT.
      776 + * We must always use scan_ds_queue_first to pick the first dataset in the
      777 + * list, process it, remove it using scan_ds_queue_remove and pick the next
      778 + * first dataset, again using scan_ds_queue_first.
      779 + */
      780 +static int
      781 +scan_ds_queue_compar(const void *a, const void *b)
      782 +{
      783 +        const scan_ds_t *sds_a = a, *sds_b = b;
      784 +
      785 +        if (sds_a->sds_dsobj < sds_b->sds_dsobj)
      786 +                return (-1);
      787 +        if (sds_a->sds_dsobj == sds_b->sds_dsobj)
      788 +                return (0);
      789 +        return (1);
      790 +}
      791 +
 492  792  static void
 493      -dsl_scan_sync_state(dsl_scan_t *scn, dmu_tx_t *tx)
      793 +scan_ds_queue_empty(dsl_scan_t *scn, boolean_t destroy)
 494  794  {
 495      -        VERIFY0(zap_update(scn->scn_dp->dp_meta_objset,
 496      -            DMU_POOL_DIRECTORY_OBJECT,
 497      -            DMU_POOL_SCAN, sizeof (uint64_t), SCAN_PHYS_NUMINTS,
 498      -            &scn->scn_phys, tx));
      795 +        void *cookie = NULL;
      796 +        scan_ds_t *sds;
      797 +
      798 +        mutex_enter(&scn->scn_queue_lock);
      799 +        while ((sds = avl_destroy_nodes(&scn->scn_queue, &cookie)) != NULL)
      800 +                kmem_free(sds, sizeof (*sds));
      801 +        mutex_exit(&scn->scn_queue_lock);
      802 +
      803 +        if (destroy)
      804 +                avl_destroy(&scn->scn_queue);
 499  805  }
 500  806  
 501      -extern int zfs_vdev_async_write_active_min_dirty_percent;
      807 +static boolean_t
      808 +scan_ds_queue_contains(dsl_scan_t *scn, uint64_t dsobj, uint64_t *txg)
      809 +{
      810 +        scan_ds_t *sds;
      811 +        scan_ds_t srch = { .sds_dsobj = dsobj };
 502  812  
      813 +        mutex_enter(&scn->scn_queue_lock);
      814 +        sds = avl_find(&scn->scn_queue, &srch, NULL);
      815 +        if (sds != NULL && txg != NULL)
      816 +                *txg = sds->sds_txg;
      817 +        mutex_exit(&scn->scn_queue_lock);
      818 +
      819 +        return (sds != NULL);
      820 +}
      821 +
      822 +static int
      823 +scan_ds_queue_insert(dsl_scan_t *scn, uint64_t dsobj, uint64_t txg)
      824 +{
      825 +        scan_ds_t *sds;
      826 +        avl_index_t where;
      827 +
      828 +        sds = kmem_zalloc(sizeof (*sds), KM_SLEEP);
      829 +        sds->sds_dsobj = dsobj;
      830 +        sds->sds_txg = txg;
      831 +
      832 +        mutex_enter(&scn->scn_queue_lock);
      833 +        if (avl_find(&scn->scn_queue, sds, &where) != NULL) {
      834 +                kmem_free(sds, sizeof (*sds));
      835 +                return (EEXIST);
      836 +        }
      837 +        avl_insert(&scn->scn_queue, sds, where);
      838 +        mutex_exit(&scn->scn_queue_lock);
      839 +
      840 +        return (0);
      841 +}
      842 +
      843 +static void
      844 +scan_ds_queue_remove(dsl_scan_t *scn, uint64_t dsobj)
      845 +{
      846 +        scan_ds_t srch, *sds;
      847 +
      848 +        srch.sds_dsobj = dsobj;
      849 +
      850 +        mutex_enter(&scn->scn_queue_lock);
      851 +        sds = avl_find(&scn->scn_queue, &srch, NULL);
      852 +        VERIFY(sds != NULL);
      853 +        avl_remove(&scn->scn_queue, sds);
      854 +        mutex_exit(&scn->scn_queue_lock);
      855 +
      856 +        kmem_free(sds, sizeof (*sds));
      857 +}
      858 +
 503  859  static boolean_t
      860 +scan_ds_queue_first(dsl_scan_t *scn, uint64_t *dsobj, uint64_t *txg)
      861 +{
      862 +        scan_ds_t *sds;
      863 +
      864 +        mutex_enter(&scn->scn_queue_lock);
      865 +        sds = avl_first(&scn->scn_queue);
      866 +        if (sds != NULL) {
      867 +                *dsobj = sds->sds_dsobj;
      868 +                *txg = sds->sds_txg;
      869 +        }
      870 +        mutex_exit(&scn->scn_queue_lock);
      871 +
      872 +        return (sds != NULL);
      873 +}
      874 +
      875 +static void
      876 +scan_ds_queue_sync(dsl_scan_t *scn, dmu_tx_t *tx)
      877 +{
      878 +        dsl_pool_t *dp = scn->scn_dp;
      879 +        spa_t *spa = dp->dp_spa;
      880 +        dmu_object_type_t ot = (spa_version(spa) >= SPA_VERSION_DSL_SCRUB) ?
      881 +            DMU_OT_SCAN_QUEUE : DMU_OT_ZAP_OTHER;
      882 +
      883 +        ASSERT0(scn->scn_bytes_pending);
      884 +        ASSERT(scn->scn_phys.scn_queue_obj != 0);
      885 +
      886 +        VERIFY0(dmu_object_free(dp->dp_meta_objset,
      887 +            scn->scn_phys.scn_queue_obj, tx));
      888 +        scn->scn_phys.scn_queue_obj = zap_create(dp->dp_meta_objset, ot,
      889 +            DMU_OT_NONE, 0, tx);
      890 +
      891 +        mutex_enter(&scn->scn_queue_lock);
      892 +        for (scan_ds_t *sds = avl_first(&scn->scn_queue);
      893 +            sds != NULL; sds = AVL_NEXT(&scn->scn_queue, sds)) {
      894 +                VERIFY0(zap_add_int_key(dp->dp_meta_objset,
      895 +                    scn->scn_phys.scn_queue_obj, sds->sds_dsobj,
      896 +                    sds->sds_txg, tx));
      897 +        }
      898 +        mutex_exit(&scn->scn_queue_lock);
      899 +}
      900 +
      901 +/*
      902 + * Writes out a persistent dsl_scan_phys_t record to the pool directory.
      903 + * Because we can be running in the block sorting algorithm, we do not always
      904 + * want to write out the record, only when it is "safe" to do so. This safety
      905 + * condition is achieved by making sure that the sorting queues are empty
      906 + * (scn_bytes_pending==0). The sync'ed state could be inconsistent with how
      907 + * much actual scanning progress has been made. What kind of sync is performed
      908 + * specified by the sync_type argument. If the sync is optional, we only
      909 + * sync if the queues are empty. If the sync is mandatory, we do a hard VERIFY
      910 + * to make sure that the queues are empty. The third possible state is a
      911 + * "cached" sync. This is done in response to:
      912 + * 1) The dataset that was in the last sync'ed dsl_scan_phys_t having been
      913 + *      destroyed, so we wouldn't be able to restart scanning from it.
      914 + * 2) The snapshot that was in the last sync'ed dsl_scan_phys_t having been
      915 + *      superseded by a newer snapshot.
      916 + * 3) The dataset that was in the last sync'ed dsl_scan_phys_t having been
      917 + *      swapped with its clone.
      918 + * In all cases, a cached sync simply rewrites the last record we've written,
      919 + * just slightly modified. For the modifications that are performed to the
      920 + * last written dsl_scan_phys_t, see dsl_scan_ds_destroyed,
      921 + * dsl_scan_ds_snapshotted and dsl_scan_ds_clone_swapped.
      922 + */
      923 +static void
      924 +dsl_scan_sync_state(dsl_scan_t *scn, dmu_tx_t *tx, state_sync_type_t sync_type)
      925 +{
      926 +        mutex_enter(&scn->scn_status_lock);
      927 +        ASSERT(sync_type != SYNC_MANDATORY || scn->scn_bytes_pending == 0);
      928 +        if (scn->scn_bytes_pending == 0) {
      929 +                if (scn->scn_phys.scn_queue_obj != 0)
      930 +                        scan_ds_queue_sync(scn, tx);
      931 +                VERIFY0(zap_update(scn->scn_dp->dp_meta_objset,
      932 +                    DMU_POOL_DIRECTORY_OBJECT,
      933 +                    DMU_POOL_SCAN, sizeof (uint64_t), SCAN_PHYS_NUMINTS,
      934 +                    &scn->scn_phys, tx));
      935 +                bcopy(&scn->scn_phys, &scn->scn_phys_cached,
      936 +                    sizeof (scn->scn_phys));
      937 +                scn->scn_checkpointing = B_FALSE;
      938 +                scn->scn_last_checkpoint = ddi_get_lbolt();
      939 +        } else if (sync_type == SYNC_CACHED) {
      940 +                VERIFY0(zap_update(scn->scn_dp->dp_meta_objset,
      941 +                    DMU_POOL_DIRECTORY_OBJECT,
      942 +                    DMU_POOL_SCAN, sizeof (uint64_t), SCAN_PHYS_NUMINTS,
      943 +                    &scn->scn_phys_cached, tx));
      944 +        }
      945 +        mutex_exit(&scn->scn_status_lock);
      946 +}
      947 +
      948 +static boolean_t
 504  949  dsl_scan_check_suspend(dsl_scan_t *scn, const zbookmark_phys_t *zb)
 505  950  {
 506  951          /* we never skip user/group accounting objects */
 507  952          if (zb && (int64_t)zb->zb_object < 0)
 508  953                  return (B_FALSE);
 509  954  
 510  955          if (scn->scn_suspending)
 511  956                  return (B_TRUE); /* we're already suspending */
 512  957  
 513  958          if (!ZB_IS_ZERO(&scn->scn_phys.scn_bookmark))
 514  959                  return (B_FALSE); /* we're resuming */
 515  960  
 516  961          /* We only know how to resume from level-0 blocks. */
 517  962          if (zb && zb->zb_level != 0)
 518  963                  return (B_FALSE);
 519  964  
 520  965          /*
 521  966           * We suspend if:
 522  967           *  - we have scanned for the maximum time: an entire txg
  
    | 
      ↓ open down ↓ | 
    9 lines elided | 
    
      ↑ open up ↑ | 
  
 523  968           *    timeout (default 5 sec)
 524  969           *  or
 525  970           *  - we have scanned for at least the minimum time (default 1 sec
 526  971           *    for scrub, 3 sec for resilver), and either we have sufficient
 527  972           *    dirty data that we are starting to write more quickly
 528  973           *    (default 30%), or someone is explicitly waiting for this txg
 529  974           *    to complete.
 530  975           *  or
 531  976           *  - the spa is shutting down because this pool is being exported
 532  977           *    or the machine is rebooting.
      978 +         *  or
      979 +         *  - the scan queue has reached its memory use limit
 533  980           */
 534  981          int mintime = (scn->scn_phys.scn_func == POOL_SCAN_RESILVER) ?
 535  982              zfs_resilver_min_time_ms : zfs_scan_min_time_ms;
 536  983          uint64_t elapsed_nanosecs = gethrtime() - scn->scn_sync_start_time;
 537  984          int dirty_pct = scn->scn_dp->dp_dirty_total * 100 / zfs_dirty_data_max;
 538  985          if (elapsed_nanosecs / NANOSEC >= zfs_txg_timeout ||
 539  986              (NSEC2MSEC(elapsed_nanosecs) > mintime &&
 540  987              (txg_sync_waiting(scn->scn_dp) ||
 541  988              dirty_pct >= zfs_vdev_async_write_active_min_dirty_percent)) ||
 542      -            spa_shutting_down(scn->scn_dp->dp_spa)) {
      989 +            spa_shutting_down(scn->scn_dp->dp_spa) || scn->scn_clearing ||
      990 +            scan_io_queue_mem_lim(scn) == MEM_LIM_HARD) {
 543  991                  if (zb) {
      992 +                        DTRACE_PROBE1(scan_pause, zbookmark_phys_t *, zb);
 544  993                          dprintf("suspending at bookmark %llx/%llx/%llx/%llx\n",
 545  994                              (longlong_t)zb->zb_objset,
 546  995                              (longlong_t)zb->zb_object,
 547  996                              (longlong_t)zb->zb_level,
 548  997                              (longlong_t)zb->zb_blkid);
 549  998                          scn->scn_phys.scn_bookmark = *zb;
      999 +                } else {
     1000 +                        DTRACE_PROBE1(scan_pause_ddt, ddt_bookmark_t *,
     1001 +                            &scn->scn_phys.scn_ddt_bookmark);
     1002 +                        dprintf("pausing at DDT bookmark %llx/%llx/%llx/%llx\n",
     1003 +                            (longlong_t)scn->scn_phys.scn_ddt_bookmark.
     1004 +                            ddb_class,
     1005 +                            (longlong_t)scn->scn_phys.scn_ddt_bookmark.
     1006 +                            ddb_type,
     1007 +                            (longlong_t)scn->scn_phys.scn_ddt_bookmark.
     1008 +                            ddb_checksum,
     1009 +                            (longlong_t)scn->scn_phys.scn_ddt_bookmark.
     1010 +                            ddb_cursor);
 550 1011                  }
 551 1012                  dprintf("suspending at DDT bookmark %llx/%llx/%llx/%llx\n",
 552 1013                      (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_class,
 553 1014                      (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_type,
 554 1015                      (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_checksum,
 555 1016                      (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_cursor);
 556 1017                  scn->scn_suspending = B_TRUE;
 557 1018                  return (B_TRUE);
 558 1019          }
 559 1020          return (B_FALSE);
 560 1021  }
 561 1022  
 562 1023  typedef struct zil_scan_arg {
 563 1024          dsl_pool_t      *zsa_dp;
 564 1025          zil_header_t    *zsa_zh;
 565 1026  } zil_scan_arg_t;
 566 1027  
 567 1028  /* ARGSUSED */
 568 1029  static int
 569 1030  dsl_scan_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
 570 1031  {
 571 1032          zil_scan_arg_t *zsa = arg;
 572 1033          dsl_pool_t *dp = zsa->zsa_dp;
 573 1034          dsl_scan_t *scn = dp->dp_scan;
 574 1035          zil_header_t *zh = zsa->zsa_zh;
 575 1036          zbookmark_phys_t zb;
 576 1037  
 577 1038          if (BP_IS_HOLE(bp) || bp->blk_birth <= scn->scn_phys.scn_cur_min_txg)
 578 1039                  return (0);
 579 1040  
 580 1041          /*
 581 1042           * One block ("stubby") can be allocated a long time ago; we
 582 1043           * want to visit that one because it has been allocated
 583 1044           * (on-disk) even if it hasn't been claimed (even though for
 584 1045           * scrub there's nothing to do to it).
 585 1046           */
 586 1047          if (claim_txg == 0 && bp->blk_birth >= spa_first_txg(dp->dp_spa))
 587 1048                  return (0);
 588 1049  
 589 1050          SET_BOOKMARK(&zb, zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET],
 590 1051              ZB_ZIL_OBJECT, ZB_ZIL_LEVEL, bp->blk_cksum.zc_word[ZIL_ZC_SEQ]);
 591 1052  
 592 1053          VERIFY(0 == scan_funcs[scn->scn_phys.scn_func](dp, bp, &zb));
 593 1054          return (0);
 594 1055  }
 595 1056  
 596 1057  /* ARGSUSED */
 597 1058  static int
 598 1059  dsl_scan_zil_record(zilog_t *zilog, lr_t *lrc, void *arg, uint64_t claim_txg)
 599 1060  {
 600 1061          if (lrc->lrc_txtype == TX_WRITE) {
 601 1062                  zil_scan_arg_t *zsa = arg;
 602 1063                  dsl_pool_t *dp = zsa->zsa_dp;
 603 1064                  dsl_scan_t *scn = dp->dp_scan;
 604 1065                  zil_header_t *zh = zsa->zsa_zh;
 605 1066                  lr_write_t *lr = (lr_write_t *)lrc;
 606 1067                  blkptr_t *bp = &lr->lr_blkptr;
 607 1068                  zbookmark_phys_t zb;
 608 1069  
 609 1070                  if (BP_IS_HOLE(bp) ||
 610 1071                      bp->blk_birth <= scn->scn_phys.scn_cur_min_txg)
 611 1072                          return (0);
 612 1073  
 613 1074                  /*
 614 1075                   * birth can be < claim_txg if this record's txg is
 615 1076                   * already txg sync'ed (but this log block contains
 616 1077                   * other records that are not synced)
 617 1078                   */
 618 1079                  if (claim_txg == 0 || bp->blk_birth < claim_txg)
 619 1080                          return (0);
 620 1081  
 621 1082                  SET_BOOKMARK(&zb, zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET],
 622 1083                      lr->lr_foid, ZB_ZIL_LEVEL,
 623 1084                      lr->lr_offset / BP_GET_LSIZE(bp));
 624 1085  
 625 1086                  VERIFY(0 == scan_funcs[scn->scn_phys.scn_func](dp, bp, &zb));
 626 1087          }
 627 1088          return (0);
 628 1089  }
 629 1090  
 630 1091  static void
 631 1092  dsl_scan_zil(dsl_pool_t *dp, zil_header_t *zh)
 632 1093  {
 633 1094          uint64_t claim_txg = zh->zh_claim_txg;
 634 1095          zil_scan_arg_t zsa = { dp, zh };
 635 1096          zilog_t *zilog;
 636 1097  
 637 1098          /*
 638 1099           * We only want to visit blocks that have been claimed but not yet
 639 1100           * replayed (or, in read-only mode, blocks that *would* be claimed).
 640 1101           */
 641 1102          if (claim_txg == 0 && spa_writeable(dp->dp_spa))
 642 1103                  return;
 643 1104  
 644 1105          zilog = zil_alloc(dp->dp_meta_objset, zh);
 645 1106  
 646 1107          (void) zil_parse(zilog, dsl_scan_zil_block, dsl_scan_zil_record, &zsa,
 647 1108              claim_txg);
 648 1109  
 649 1110          zil_free(zilog);
 650 1111  }
 651 1112  
 652 1113  /* ARGSUSED */
 653 1114  static void
 654 1115  dsl_scan_prefetch(dsl_scan_t *scn, arc_buf_t *buf, blkptr_t *bp,
 655 1116      uint64_t objset, uint64_t object, uint64_t blkid)
 656 1117  {
 657 1118          zbookmark_phys_t czb;
 658 1119          arc_flags_t flags = ARC_FLAG_NOWAIT | ARC_FLAG_PREFETCH;
 659 1120  
 660 1121          if (zfs_no_scrub_prefetch)
 661 1122                  return;
 662 1123  
 663 1124          if (BP_IS_HOLE(bp) || bp->blk_birth <= scn->scn_phys.scn_min_txg ||
 664 1125              (BP_GET_LEVEL(bp) == 0 && BP_GET_TYPE(bp) != DMU_OT_DNODE))
 665 1126                  return;
 666 1127  
 667 1128          SET_BOOKMARK(&czb, objset, object, BP_GET_LEVEL(bp), blkid);
 668 1129  
 669 1130          (void) arc_read(scn->scn_zio_root, scn->scn_dp->dp_spa, bp,
 670 1131              NULL, NULL, ZIO_PRIORITY_ASYNC_READ,
 671 1132              ZIO_FLAG_CANFAIL | ZIO_FLAG_SCAN_THREAD, &flags, &czb);
 672 1133  }
 673 1134  
 674 1135  static boolean_t
 675 1136  dsl_scan_check_resume(dsl_scan_t *scn, const dnode_phys_t *dnp,
 676 1137      const zbookmark_phys_t *zb)
 677 1138  {
 678 1139          /*
 679 1140           * We never skip over user/group accounting objects (obj<0)
 680 1141           */
 681 1142          if (!ZB_IS_ZERO(&scn->scn_phys.scn_bookmark) &&
 682 1143              (int64_t)zb->zb_object >= 0) {
 683 1144                  /*
 684 1145                   * If we already visited this bp & everything below (in
 685 1146                   * a prior txg sync), don't bother doing it again.
 686 1147                   */
 687 1148                  if (zbookmark_subtree_completed(dnp, zb,
 688 1149                      &scn->scn_phys.scn_bookmark))
  
    | 
      ↓ open down ↓ | 
    129 lines elided | 
    
      ↑ open up ↑ | 
  
 689 1150                          return (B_TRUE);
 690 1151  
 691 1152                  /*
 692 1153                   * If we found the block we're trying to resume from, or
 693 1154                   * we went past it to a different object, zero it out to
 694 1155                   * indicate that it's OK to start checking for suspending
 695 1156                   * again.
 696 1157                   */
 697 1158                  if (bcmp(zb, &scn->scn_phys.scn_bookmark, sizeof (*zb)) == 0 ||
 698 1159                      zb->zb_object > scn->scn_phys.scn_bookmark.zb_object) {
     1160 +                        DTRACE_PROBE1(scan_resume, zbookmark_phys_t *, zb);
 699 1161                          dprintf("resuming at %llx/%llx/%llx/%llx\n",
 700 1162                              (longlong_t)zb->zb_objset,
 701 1163                              (longlong_t)zb->zb_object,
 702 1164                              (longlong_t)zb->zb_level,
 703 1165                              (longlong_t)zb->zb_blkid);
 704 1166                          bzero(&scn->scn_phys.scn_bookmark, sizeof (*zb));
 705 1167                  }
 706 1168          }
 707 1169          return (B_FALSE);
 708 1170  }
 709 1171  
 710 1172  /*
 711 1173   * Return nonzero on i/o error.
 712 1174   * Return new buf to write out in *bufp.
 713 1175   */
 714 1176  static int
 715 1177  dsl_scan_recurse(dsl_scan_t *scn, dsl_dataset_t *ds, dmu_objset_type_t ostype,
 716 1178      dnode_phys_t *dnp, const blkptr_t *bp,
 717 1179      const zbookmark_phys_t *zb, dmu_tx_t *tx)
 718 1180  {
 719 1181          dsl_pool_t *dp = scn->scn_dp;
 720 1182          int zio_flags = ZIO_FLAG_CANFAIL | ZIO_FLAG_SCAN_THREAD;
 721 1183          int err;
 722 1184  
  
    | 
      ↓ open down ↓ | 
    14 lines elided | 
    
      ↑ open up ↑ | 
  
 723 1185          if (BP_GET_LEVEL(bp) > 0) {
 724 1186                  arc_flags_t flags = ARC_FLAG_WAIT;
 725 1187                  int i;
 726 1188                  blkptr_t *cbp;
 727 1189                  int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
 728 1190                  arc_buf_t *buf;
 729 1191  
 730 1192                  err = arc_read(NULL, dp->dp_spa, bp, arc_getbuf_func, &buf,
 731 1193                      ZIO_PRIORITY_ASYNC_READ, zio_flags, &flags, zb);
 732 1194                  if (err) {
 733      -                        scn->scn_phys.scn_errors++;
     1195 +                        atomic_inc_64(&scn->scn_phys.scn_errors);
 734 1196                          return (err);
 735 1197                  }
 736 1198                  for (i = 0, cbp = buf->b_data; i < epb; i++, cbp++) {
 737 1199                          dsl_scan_prefetch(scn, buf, cbp, zb->zb_objset,
 738 1200                              zb->zb_object, zb->zb_blkid * epb + i);
 739 1201                  }
 740 1202                  for (i = 0, cbp = buf->b_data; i < epb; i++, cbp++) {
 741 1203                          zbookmark_phys_t czb;
 742 1204  
 743 1205                          SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
 744 1206                              zb->zb_level - 1,
 745 1207                              zb->zb_blkid * epb + i);
 746 1208                          dsl_scan_visitbp(cbp, &czb, dnp,
 747 1209                              ds, scn, ostype, tx);
 748 1210                  }
 749 1211                  arc_buf_destroy(buf, &buf);
  
    | 
      ↓ open down ↓ | 
    6 lines elided | 
    
      ↑ open up ↑ | 
  
 750 1212          } else if (BP_GET_TYPE(bp) == DMU_OT_DNODE) {
 751 1213                  arc_flags_t flags = ARC_FLAG_WAIT;
 752 1214                  dnode_phys_t *cdnp;
 753 1215                  int i, j;
 754 1216                  int epb = BP_GET_LSIZE(bp) >> DNODE_SHIFT;
 755 1217                  arc_buf_t *buf;
 756 1218  
 757 1219                  err = arc_read(NULL, dp->dp_spa, bp, arc_getbuf_func, &buf,
 758 1220                      ZIO_PRIORITY_ASYNC_READ, zio_flags, &flags, zb);
 759 1221                  if (err) {
 760      -                        scn->scn_phys.scn_errors++;
     1222 +                        atomic_inc_64(&scn->scn_phys.scn_errors);
 761 1223                          return (err);
 762 1224                  }
 763 1225                  for (i = 0, cdnp = buf->b_data; i < epb; i++, cdnp++) {
 764 1226                          for (j = 0; j < cdnp->dn_nblkptr; j++) {
 765 1227                                  blkptr_t *cbp = &cdnp->dn_blkptr[j];
 766 1228                                  dsl_scan_prefetch(scn, buf, cbp,
 767 1229                                      zb->zb_objset, zb->zb_blkid * epb + i, j);
 768 1230                          }
 769 1231                  }
 770 1232                  for (i = 0, cdnp = buf->b_data; i < epb; i++, cdnp++) {
 771 1233                          dsl_scan_visitdnode(scn, ds, ostype,
 772 1234                              cdnp, zb->zb_blkid * epb + i, tx);
 773 1235                  }
  
    | 
      ↓ open down ↓ | 
    3 lines elided | 
    
      ↑ open up ↑ | 
  
 774 1236  
 775 1237                  arc_buf_destroy(buf, &buf);
 776 1238          } else if (BP_GET_TYPE(bp) == DMU_OT_OBJSET) {
 777 1239                  arc_flags_t flags = ARC_FLAG_WAIT;
 778 1240                  objset_phys_t *osp;
 779 1241                  arc_buf_t *buf;
 780 1242  
 781 1243                  err = arc_read(NULL, dp->dp_spa, bp, arc_getbuf_func, &buf,
 782 1244                      ZIO_PRIORITY_ASYNC_READ, zio_flags, &flags, zb);
 783 1245                  if (err) {
 784      -                        scn->scn_phys.scn_errors++;
     1246 +                        atomic_inc_64(&scn->scn_phys.scn_errors);
 785 1247                          return (err);
 786 1248                  }
 787 1249  
 788 1250                  osp = buf->b_data;
 789 1251  
 790 1252                  dsl_scan_visitdnode(scn, ds, osp->os_type,
 791 1253                      &osp->os_meta_dnode, DMU_META_DNODE_OBJECT, tx);
 792 1254  
 793 1255                  if (OBJSET_BUF_HAS_USERUSED(buf)) {
 794 1256                          /*
 795 1257                           * We also always visit user/group accounting
 796 1258                           * objects, and never skip them, even if we are
 797 1259                           * suspending.  This is necessary so that the space
 798 1260                           * deltas from this txg get integrated.
 799 1261                           */
 800 1262                          dsl_scan_visitdnode(scn, ds, osp->os_type,
 801 1263                              &osp->os_groupused_dnode,
 802 1264                              DMU_GROUPUSED_OBJECT, tx);
 803 1265                          dsl_scan_visitdnode(scn, ds, osp->os_type,
 804 1266                              &osp->os_userused_dnode,
 805 1267                              DMU_USERUSED_OBJECT, tx);
 806 1268                  }
 807 1269                  arc_buf_destroy(buf, &buf);
 808 1270          }
 809 1271  
 810 1272          return (0);
 811 1273  }
 812 1274  
 813 1275  static void
 814 1276  dsl_scan_visitdnode(dsl_scan_t *scn, dsl_dataset_t *ds,
 815 1277      dmu_objset_type_t ostype, dnode_phys_t *dnp,
 816 1278      uint64_t object, dmu_tx_t *tx)
 817 1279  {
 818 1280          int j;
 819 1281  
 820 1282          for (j = 0; j < dnp->dn_nblkptr; j++) {
 821 1283                  zbookmark_phys_t czb;
 822 1284  
 823 1285                  SET_BOOKMARK(&czb, ds ? ds->ds_object : 0, object,
 824 1286                      dnp->dn_nlevels - 1, j);
 825 1287                  dsl_scan_visitbp(&dnp->dn_blkptr[j],
 826 1288                      &czb, dnp, ds, scn, ostype, tx);
 827 1289          }
 828 1290  
 829 1291          if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) {
 830 1292                  zbookmark_phys_t czb;
 831 1293                  SET_BOOKMARK(&czb, ds ? ds->ds_object : 0, object,
 832 1294                      0, DMU_SPILL_BLKID);
 833 1295                  dsl_scan_visitbp(&dnp->dn_spill,
 834 1296                      &czb, dnp, ds, scn, ostype, tx);
 835 1297          }
 836 1298  }
 837 1299  
 838 1300  /*
 839 1301   * The arguments are in this order because mdb can only print the
 840 1302   * first 5; we want them to be useful.
 841 1303   */
 842 1304  static void
 843 1305  dsl_scan_visitbp(blkptr_t *bp, const zbookmark_phys_t *zb,
 844 1306      dnode_phys_t *dnp, dsl_dataset_t *ds, dsl_scan_t *scn,
 845 1307      dmu_objset_type_t ostype, dmu_tx_t *tx)
 846 1308  {
 847 1309          dsl_pool_t *dp = scn->scn_dp;
 848 1310          arc_buf_t *buf = NULL;
 849 1311          blkptr_t bp_toread = *bp;
 850 1312  
 851 1313          /* ASSERT(pbuf == NULL || arc_released(pbuf)); */
 852 1314  
 853 1315          if (dsl_scan_check_suspend(scn, zb))
  
    | 
      ↓ open down ↓ | 
    59 lines elided | 
    
      ↑ open up ↑ | 
  
 854 1316                  return;
 855 1317  
 856 1318          if (dsl_scan_check_resume(scn, dnp, zb))
 857 1319                  return;
 858 1320  
 859 1321          if (BP_IS_HOLE(bp))
 860 1322                  return;
 861 1323  
 862 1324          scn->scn_visited_this_txg++;
 863 1325  
     1326 +#ifdef  _KERNEL
     1327 +        DTRACE_PROBE7(scan_visitbp, blkptr_t *, bp, zbookmark_phys_t *, zb,
     1328 +            dnode_phys_t *, dnp, dsl_dataset_t *, ds, dsl_scan_t *, scn,
     1329 +            dmu_objset_type_t, ostype, dmu_tx_t *, tx);
     1330 +#endif  /* _KERNEL */
 864 1331          dprintf_bp(bp,
 865 1332              "visiting ds=%p/%llu zb=%llx/%llx/%llx/%llx bp=%p",
 866 1333              ds, ds ? ds->ds_object : 0,
 867 1334              zb->zb_objset, zb->zb_object, zb->zb_level, zb->zb_blkid,
 868 1335              bp);
 869 1336  
 870 1337          if (bp->blk_birth <= scn->scn_phys.scn_cur_min_txg)
 871 1338                  return;
 872 1339  
 873 1340          if (dsl_scan_recurse(scn, ds, ostype, dnp, &bp_toread, zb, tx) != 0)
 874 1341                  return;
 875 1342  
 876 1343          /*
 877 1344           * If dsl_scan_ddt() has already visited this block, it will have
 878 1345           * already done any translations or scrubbing, so don't call the
 879 1346           * callback again.
 880 1347           */
 881 1348          if (ddt_class_contains(dp->dp_spa,
 882 1349              scn->scn_phys.scn_ddt_class_max, bp)) {
 883 1350                  ASSERT(buf == NULL);
 884 1351                  return;
 885 1352          }
 886 1353  
 887 1354          /*
 888 1355           * If this block is from the future (after cur_max_txg), then we
 889 1356           * are doing this on behalf of a deleted snapshot, and we will
 890 1357           * revisit the future block on the next pass of this dataset.
 891 1358           * Don't scan it now unless we need to because something
 892 1359           * under it was modified.
 893 1360           */
 894 1361          if (BP_PHYSICAL_BIRTH(bp) <= scn->scn_phys.scn_cur_max_txg) {
 895 1362                  scan_funcs[scn->scn_phys.scn_func](dp, bp, zb);
 896 1363          }
 897 1364  }
 898 1365  
 899 1366  static void
  
    | 
      ↓ open down ↓ | 
    26 lines elided | 
    
      ↑ open up ↑ | 
  
 900 1367  dsl_scan_visit_rootbp(dsl_scan_t *scn, dsl_dataset_t *ds, blkptr_t *bp,
 901 1368      dmu_tx_t *tx)
 902 1369  {
 903 1370          zbookmark_phys_t zb;
 904 1371  
 905 1372          SET_BOOKMARK(&zb, ds ? ds->ds_object : DMU_META_OBJSET,
 906 1373              ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
 907 1374          dsl_scan_visitbp(bp, &zb, NULL,
 908 1375              ds, scn, DMU_OST_NONE, tx);
 909 1376  
     1377 +        DTRACE_PROBE4(scan_finished, dsl_scan_t *, scn, dsl_dataset_t *, ds,
     1378 +            blkptr_t *, bp, dmu_tx_t *, tx);
 910 1379          dprintf_ds(ds, "finished scan%s", "");
 911 1380  }
 912 1381  
 913      -void
 914      -dsl_scan_ds_destroyed(dsl_dataset_t *ds, dmu_tx_t *tx)
     1382 +static void
     1383 +ds_destroyed_scn_phys(dsl_dataset_t *ds, dsl_scan_phys_t *scn_phys)
 915 1384  {
 916      -        dsl_pool_t *dp = ds->ds_dir->dd_pool;
 917      -        dsl_scan_t *scn = dp->dp_scan;
 918      -        uint64_t mintxg;
 919      -
 920      -        if (scn->scn_phys.scn_state != DSS_SCANNING)
 921      -                return;
 922      -
 923      -        if (scn->scn_phys.scn_bookmark.zb_objset == ds->ds_object) {
     1385 +        if (scn_phys->scn_bookmark.zb_objset == ds->ds_object) {
 924 1386                  if (ds->ds_is_snapshot) {
 925 1387                          /*
 926 1388                           * Note:
 927 1389                           *  - scn_cur_{min,max}_txg stays the same.
 928 1390                           *  - Setting the flag is not really necessary if
 929 1391                           *    scn_cur_max_txg == scn_max_txg, because there
 930 1392                           *    is nothing after this snapshot that we care
 931 1393                           *    about.  However, we set it anyway and then
 932 1394                           *    ignore it when we retraverse it in
 933 1395                           *    dsl_scan_visitds().
 934 1396                           */
 935      -                        scn->scn_phys.scn_bookmark.zb_objset =
     1397 +                        scn_phys->scn_bookmark.zb_objset =
 936 1398                              dsl_dataset_phys(ds)->ds_next_snap_obj;
 937 1399                          zfs_dbgmsg("destroying ds %llu; currently traversing; "
 938 1400                              "reset zb_objset to %llu",
 939 1401                              (u_longlong_t)ds->ds_object,
 940 1402                              (u_longlong_t)dsl_dataset_phys(ds)->
 941 1403                              ds_next_snap_obj);
 942      -                        scn->scn_phys.scn_flags |= DSF_VISIT_DS_AGAIN;
     1404 +                        scn_phys->scn_flags |= DSF_VISIT_DS_AGAIN;
 943 1405                  } else {
 944      -                        SET_BOOKMARK(&scn->scn_phys.scn_bookmark,
     1406 +                        SET_BOOKMARK(&scn_phys->scn_bookmark,
 945 1407                              ZB_DESTROYED_OBJSET, 0, 0, 0);
 946 1408                          zfs_dbgmsg("destroying ds %llu; currently traversing; "
 947 1409                              "reset bookmark to -1,0,0,0",
 948 1410                              (u_longlong_t)ds->ds_object);
 949 1411                  }
 950      -        } else if (zap_lookup_int_key(dp->dp_meta_objset,
 951      -            scn->scn_phys.scn_queue_obj, ds->ds_object, &mintxg) == 0) {
     1412 +        }
     1413 +}
     1414 +
     1415 +/*
     1416 + * Invoked when a dataset is destroyed. We need to make sure that:
     1417 + *
     1418 + * 1) If it is the dataset that was currently being scanned, we write
     1419 + *      a new dsl_scan_phys_t and marking the objset reference in it
     1420 + *      as destroyed.
     1421 + * 2) Remove it from the work queue, if it was present.
     1422 + *
     1423 + * If the dataset was actually a snapshot, instead of marking the dataset
     1424 + * as destroyed, we instead substitute the next snapshot in line.
     1425 + */
     1426 +void
     1427 +dsl_scan_ds_destroyed(dsl_dataset_t *ds, dmu_tx_t *tx)
     1428 +{
     1429 +        dsl_pool_t *dp = ds->ds_dir->dd_pool;
     1430 +        dsl_scan_t *scn = dp->dp_scan;
     1431 +        uint64_t mintxg;
     1432 +
     1433 +        if (!dsl_scan_is_running(scn))
     1434 +                return;
     1435 +
     1436 +        ds_destroyed_scn_phys(ds, &scn->scn_phys);
     1437 +        ds_destroyed_scn_phys(ds, &scn->scn_phys_cached);
     1438 +
     1439 +        if (scan_ds_queue_contains(scn, ds->ds_object, &mintxg)) {
     1440 +                scan_ds_queue_remove(scn, ds->ds_object);
     1441 +                if (ds->ds_is_snapshot) {
     1442 +                        VERIFY0(scan_ds_queue_insert(scn,
     1443 +                            dsl_dataset_phys(ds)->ds_next_snap_obj, mintxg));
     1444 +                }
     1445 +        }
     1446 +
     1447 +        if (zap_lookup_int_key(dp->dp_meta_objset, scn->scn_phys.scn_queue_obj,
     1448 +            ds->ds_object, &mintxg) == 0) {
     1449 +                DTRACE_PROBE3(scan_ds_destroyed__in_queue,
     1450 +                    dsl_scan_t *, scn, dsl_dataset_t *, ds, dmu_tx_t *, tx);
 952 1451                  ASSERT3U(dsl_dataset_phys(ds)->ds_num_children, <=, 1);
 953 1452                  VERIFY3U(0, ==, zap_remove_int(dp->dp_meta_objset,
 954 1453                      scn->scn_phys.scn_queue_obj, ds->ds_object, tx));
 955 1454                  if (ds->ds_is_snapshot) {
 956 1455                          /*
 957 1456                           * We keep the same mintxg; it could be >
 958 1457                           * ds_creation_txg if the previous snapshot was
 959 1458                           * deleted too.
 960 1459                           */
 961 1460                          VERIFY(zap_add_int_key(dp->dp_meta_objset,
 962 1461                              scn->scn_phys.scn_queue_obj,
 963 1462                              dsl_dataset_phys(ds)->ds_next_snap_obj,
 964 1463                              mintxg, tx) == 0);
 965 1464                          zfs_dbgmsg("destroying ds %llu; in queue; "
 966 1465                              "replacing with %llu",
 967 1466                              (u_longlong_t)ds->ds_object,
 968 1467                              (u_longlong_t)dsl_dataset_phys(ds)->
 969 1468                              ds_next_snap_obj);
  
    | 
      ↓ open down ↓ | 
    8 lines elided | 
    
      ↑ open up ↑ | 
  
 970 1469                  } else {
 971 1470                          zfs_dbgmsg("destroying ds %llu; in queue; removing",
 972 1471                              (u_longlong_t)ds->ds_object);
 973 1472                  }
 974 1473          }
 975 1474  
 976 1475          /*
 977 1476           * dsl_scan_sync() should be called after this, and should sync
 978 1477           * out our changed state, but just to be safe, do it here.
 979 1478           */
 980      -        dsl_scan_sync_state(scn, tx);
     1479 +        dsl_scan_sync_state(scn, tx, SYNC_CACHED);
 981 1480  }
 982 1481  
     1482 +static void
     1483 +ds_snapshotted_bookmark(dsl_dataset_t *ds, zbookmark_phys_t *scn_bookmark)
     1484 +{
     1485 +        if (scn_bookmark->zb_objset == ds->ds_object) {
     1486 +                scn_bookmark->zb_objset =
     1487 +                    dsl_dataset_phys(ds)->ds_prev_snap_obj;
     1488 +                zfs_dbgmsg("snapshotting ds %llu; currently traversing; "
     1489 +                    "reset zb_objset to %llu",
     1490 +                    (u_longlong_t)ds->ds_object,
     1491 +                    (u_longlong_t)dsl_dataset_phys(ds)->ds_prev_snap_obj);
     1492 +        }
     1493 +}
     1494 +
     1495 +/*
     1496 + * Called when a dataset is snapshotted. If we were currently traversing
     1497 + * this snapshot, we reset our bookmark to point at the newly created
     1498 + * snapshot. We also modify our work queue to remove the old snapshot and
     1499 + * replace with the new one.
     1500 + */
 983 1501  void
 984 1502  dsl_scan_ds_snapshotted(dsl_dataset_t *ds, dmu_tx_t *tx)
 985 1503  {
 986 1504          dsl_pool_t *dp = ds->ds_dir->dd_pool;
 987 1505          dsl_scan_t *scn = dp->dp_scan;
 988 1506          uint64_t mintxg;
 989 1507  
 990      -        if (scn->scn_phys.scn_state != DSS_SCANNING)
     1508 +        if (!dsl_scan_is_running(scn))
 991 1509                  return;
 992 1510  
 993 1511          ASSERT(dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);
 994 1512  
 995      -        if (scn->scn_phys.scn_bookmark.zb_objset == ds->ds_object) {
 996      -                scn->scn_phys.scn_bookmark.zb_objset =
 997      -                    dsl_dataset_phys(ds)->ds_prev_snap_obj;
 998      -                zfs_dbgmsg("snapshotting ds %llu; currently traversing; "
 999      -                    "reset zb_objset to %llu",
1000      -                    (u_longlong_t)ds->ds_object,
1001      -                    (u_longlong_t)dsl_dataset_phys(ds)->ds_prev_snap_obj);
1002      -        } else if (zap_lookup_int_key(dp->dp_meta_objset,
1003      -            scn->scn_phys.scn_queue_obj, ds->ds_object, &mintxg) == 0) {
     1513 +        ds_snapshotted_bookmark(ds, &scn->scn_phys.scn_bookmark);
     1514 +        ds_snapshotted_bookmark(ds, &scn->scn_phys_cached.scn_bookmark);
     1515 +
     1516 +        if (scan_ds_queue_contains(scn, ds->ds_object, &mintxg)) {
     1517 +                scan_ds_queue_remove(scn, ds->ds_object);
     1518 +                VERIFY0(scan_ds_queue_insert(scn,
     1519 +                    dsl_dataset_phys(ds)->ds_prev_snap_obj, mintxg));
     1520 +        }
     1521 +
     1522 +        if (zap_lookup_int_key(dp->dp_meta_objset, scn->scn_phys.scn_queue_obj,
     1523 +            ds->ds_object, &mintxg) == 0) {
     1524 +                DTRACE_PROBE3(scan_ds_snapshotted__in_queue,
     1525 +                    dsl_scan_t *, scn, dsl_dataset_t *, ds, dmu_tx_t *, tx);
     1526 +
1004 1527                  VERIFY3U(0, ==, zap_remove_int(dp->dp_meta_objset,
1005 1528                      scn->scn_phys.scn_queue_obj, ds->ds_object, tx));
1006 1529                  VERIFY(zap_add_int_key(dp->dp_meta_objset,
1007 1530                      scn->scn_phys.scn_queue_obj,
1008 1531                      dsl_dataset_phys(ds)->ds_prev_snap_obj, mintxg, tx) == 0);
1009 1532                  zfs_dbgmsg("snapshotting ds %llu; in queue; "
1010 1533                      "replacing with %llu",
1011 1534                      (u_longlong_t)ds->ds_object,
1012 1535                      (u_longlong_t)dsl_dataset_phys(ds)->ds_prev_snap_obj);
1013 1536          }
1014      -        dsl_scan_sync_state(scn, tx);
     1537 +
     1538 +        dsl_scan_sync_state(scn, tx, SYNC_CACHED);
1015 1539  }
1016 1540  
     1541 +static void
     1542 +ds_clone_swapped_bookmark(dsl_dataset_t *ds1, dsl_dataset_t *ds2,
     1543 +    zbookmark_phys_t *scn_bookmark)
     1544 +{
     1545 +        if (scn_bookmark->zb_objset == ds1->ds_object) {
     1546 +                scn_bookmark->zb_objset = ds2->ds_object;
     1547 +                zfs_dbgmsg("clone_swap ds %llu; currently traversing; "
     1548 +                    "reset zb_objset to %llu",
     1549 +                    (u_longlong_t)ds1->ds_object,
     1550 +                    (u_longlong_t)ds2->ds_object);
     1551 +        } else if (scn_bookmark->zb_objset == ds2->ds_object) {
     1552 +                scn_bookmark->zb_objset = ds1->ds_object;
     1553 +                zfs_dbgmsg("clone_swap ds %llu; currently traversing; "
     1554 +                    "reset zb_objset to %llu",
     1555 +                    (u_longlong_t)ds2->ds_object,
     1556 +                    (u_longlong_t)ds1->ds_object);
     1557 +        }
     1558 +}
     1559 +
     1560 +/*
     1561 + * Called when a parent dataset and its clone are swapped. If we were
     1562 + * currently traversing the dataset, we need to switch to traversing the
     1563 + * newly promoted parent.
     1564 + */
1017 1565  void
1018 1566  dsl_scan_ds_clone_swapped(dsl_dataset_t *ds1, dsl_dataset_t *ds2, dmu_tx_t *tx)
1019 1567  {
1020 1568          dsl_pool_t *dp = ds1->ds_dir->dd_pool;
1021 1569          dsl_scan_t *scn = dp->dp_scan;
1022 1570          uint64_t mintxg;
1023 1571  
1024      -        if (scn->scn_phys.scn_state != DSS_SCANNING)
     1572 +        if (!dsl_scan_is_running(scn))
1025 1573                  return;
1026 1574  
1027      -        if (scn->scn_phys.scn_bookmark.zb_objset == ds1->ds_object) {
1028      -                scn->scn_phys.scn_bookmark.zb_objset = ds2->ds_object;
     1575 +        ds_clone_swapped_bookmark(ds1, ds2, &scn->scn_phys.scn_bookmark);
     1576 +        ds_clone_swapped_bookmark(ds1, ds2, &scn->scn_phys_cached.scn_bookmark);
     1577 +
     1578 +        if (scan_ds_queue_contains(scn, ds1->ds_object, &mintxg)) {
     1579 +                int err;
     1580 +
     1581 +                scan_ds_queue_remove(scn, ds1->ds_object);
     1582 +                err = scan_ds_queue_insert(scn, ds2->ds_object, mintxg);
     1583 +                VERIFY(err == 0 || err == EEXIST);
     1584 +                if (err == EEXIST) {
     1585 +                        /* Both were there to begin with */
     1586 +                        VERIFY0(scan_ds_queue_insert(scn, ds1->ds_object,
     1587 +                            mintxg));
     1588 +                }
1029 1589                  zfs_dbgmsg("clone_swap ds %llu; currently traversing; "
1030 1590                      "reset zb_objset to %llu",
1031 1591                      (u_longlong_t)ds1->ds_object,
1032 1592                      (u_longlong_t)ds2->ds_object);
1033      -        } else if (scn->scn_phys.scn_bookmark.zb_objset == ds2->ds_object) {
1034      -                scn->scn_phys.scn_bookmark.zb_objset = ds1->ds_object;
     1593 +        } else if (scan_ds_queue_contains(scn, ds2->ds_object, &mintxg)) {
     1594 +                scan_ds_queue_remove(scn, ds2->ds_object);
     1595 +                VERIFY0(scan_ds_queue_insert(scn, ds1->ds_object, mintxg));
1035 1596                  zfs_dbgmsg("clone_swap ds %llu; currently traversing; "
1036 1597                      "reset zb_objset to %llu",
1037 1598                      (u_longlong_t)ds2->ds_object,
1038 1599                      (u_longlong_t)ds1->ds_object);
1039 1600          }
1040 1601  
1041 1602          if (zap_lookup_int_key(dp->dp_meta_objset, scn->scn_phys.scn_queue_obj,
1042 1603              ds1->ds_object, &mintxg) == 0) {
1043 1604                  int err;
1044 1605  
     1606 +                DTRACE_PROBE4(scan_ds_clone_swapped__in_queue_ds1,
     1607 +                    dsl_scan_t *, scn, dsl_dataset_t *, ds1,
     1608 +                    dsl_dataset_t *, ds2, dmu_tx_t *, tx);
1045 1609                  ASSERT3U(mintxg, ==, dsl_dataset_phys(ds1)->ds_prev_snap_txg);
1046 1610                  ASSERT3U(mintxg, ==, dsl_dataset_phys(ds2)->ds_prev_snap_txg);
1047 1611                  VERIFY3U(0, ==, zap_remove_int(dp->dp_meta_objset,
1048 1612                      scn->scn_phys.scn_queue_obj, ds1->ds_object, tx));
1049 1613                  err = zap_add_int_key(dp->dp_meta_objset,
1050 1614                      scn->scn_phys.scn_queue_obj, ds2->ds_object, mintxg, tx);
1051 1615                  VERIFY(err == 0 || err == EEXIST);
1052 1616                  if (err == EEXIST) {
1053 1617                          /* Both were there to begin with */
1054 1618                          VERIFY(0 == zap_add_int_key(dp->dp_meta_objset,
1055 1619                              scn->scn_phys.scn_queue_obj,
1056 1620                              ds1->ds_object, mintxg, tx));
1057 1621                  }
1058 1622                  zfs_dbgmsg("clone_swap ds %llu; in queue; "
1059 1623                      "replacing with %llu",
1060 1624                      (u_longlong_t)ds1->ds_object,
1061 1625                      (u_longlong_t)ds2->ds_object);
1062 1626          } else if (zap_lookup_int_key(dp->dp_meta_objset,
1063 1627              scn->scn_phys.scn_queue_obj, ds2->ds_object, &mintxg) == 0) {
     1628 +                DTRACE_PROBE4(scan_ds_clone_swapped__in_queue_ds2,
     1629 +                    dsl_scan_t *, scn, dsl_dataset_t *, ds1,
     1630 +                    dsl_dataset_t *, ds2, dmu_tx_t *, tx);
1064 1631                  ASSERT3U(mintxg, ==, dsl_dataset_phys(ds1)->ds_prev_snap_txg);
1065 1632                  ASSERT3U(mintxg, ==, dsl_dataset_phys(ds2)->ds_prev_snap_txg);
1066 1633                  VERIFY3U(0, ==, zap_remove_int(dp->dp_meta_objset,
1067 1634                      scn->scn_phys.scn_queue_obj, ds2->ds_object, tx));
1068 1635                  VERIFY(0 == zap_add_int_key(dp->dp_meta_objset,
1069 1636                      scn->scn_phys.scn_queue_obj, ds1->ds_object, mintxg, tx));
1070 1637                  zfs_dbgmsg("clone_swap ds %llu; in queue; "
1071 1638                      "replacing with %llu",
1072 1639                      (u_longlong_t)ds2->ds_object,
1073 1640                      (u_longlong_t)ds1->ds_object);
1074 1641          }
1075 1642  
1076      -        dsl_scan_sync_state(scn, tx);
     1643 +        dsl_scan_sync_state(scn, tx, SYNC_CACHED);
1077 1644  }
1078 1645  
1079      -struct enqueue_clones_arg {
1080      -        dmu_tx_t *tx;
1081      -        uint64_t originobj;
1082      -};
1083      -
1084 1646  /* ARGSUSED */
1085 1647  static int
1086 1648  enqueue_clones_cb(dsl_pool_t *dp, dsl_dataset_t *hds, void *arg)
1087 1649  {
1088      -        struct enqueue_clones_arg *eca = arg;
     1650 +        uint64_t originobj = *(uint64_t *)arg;
1089 1651          dsl_dataset_t *ds;
1090 1652          int err;
1091 1653          dsl_scan_t *scn = dp->dp_scan;
1092 1654  
1093      -        if (dsl_dir_phys(hds->ds_dir)->dd_origin_obj != eca->originobj)
     1655 +        if (dsl_dir_phys(hds->ds_dir)->dd_origin_obj != originobj)
1094 1656                  return (0);
1095 1657  
1096 1658          err = dsl_dataset_hold_obj(dp, hds->ds_object, FTAG, &ds);
1097 1659          if (err)
1098 1660                  return (err);
1099 1661  
1100      -        while (dsl_dataset_phys(ds)->ds_prev_snap_obj != eca->originobj) {
     1662 +        while (dsl_dataset_phys(ds)->ds_prev_snap_obj != originobj) {
1101 1663                  dsl_dataset_t *prev;
1102 1664                  err = dsl_dataset_hold_obj(dp,
1103 1665                      dsl_dataset_phys(ds)->ds_prev_snap_obj, FTAG, &prev);
1104 1666  
1105 1667                  dsl_dataset_rele(ds, FTAG);
1106 1668                  if (err)
1107 1669                          return (err);
1108 1670                  ds = prev;
1109 1671          }
1110      -        VERIFY(zap_add_int_key(dp->dp_meta_objset,
1111      -            scn->scn_phys.scn_queue_obj, ds->ds_object,
1112      -            dsl_dataset_phys(ds)->ds_prev_snap_txg, eca->tx) == 0);
     1672 +        VERIFY0(scan_ds_queue_insert(scn, ds->ds_object,
     1673 +            dsl_dataset_phys(ds)->ds_prev_snap_txg));
1113 1674          dsl_dataset_rele(ds, FTAG);
1114 1675          return (0);
1115 1676  }
1116 1677  
1117 1678  static void
1118 1679  dsl_scan_visitds(dsl_scan_t *scn, uint64_t dsobj, dmu_tx_t *tx)
1119 1680  {
1120 1681          dsl_pool_t *dp = scn->scn_dp;
1121 1682          dsl_dataset_t *ds;
     1683 +        objset_t *os;
1122 1684  
1123 1685          VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));
1124 1686  
1125 1687          if (scn->scn_phys.scn_cur_min_txg >=
1126 1688              scn->scn_phys.scn_max_txg) {
1127 1689                  /*
1128 1690                   * This can happen if this snapshot was created after the
1129 1691                   * scan started, and we already completed a previous snapshot
1130 1692                   * that was created after the scan started.  This snapshot
1131 1693                   * only references blocks with:
1132 1694                   *
1133 1695                   *      birth < our ds_creation_txg
1134 1696                   *      cur_min_txg is no less than ds_creation_txg.
1135 1697                   *      We have already visited these blocks.
1136 1698                   * or
1137 1699                   *      birth > scn_max_txg
1138 1700                   *      The scan requested not to visit these blocks.
1139 1701                   *
1140 1702                   * Subsequent snapshots (and clones) can reference our
1141 1703                   * blocks, or blocks with even higher birth times.
1142 1704                   * Therefore we do not need to visit them either,
1143 1705                   * so we do not add them to the work queue.
1144 1706                   *
1145 1707                   * Note that checking for cur_min_txg >= cur_max_txg
1146 1708                   * is not sufficient, because in that case we may need to
1147 1709                   * visit subsequent snapshots.  This happens when min_txg > 0,
1148 1710                   * which raises cur_min_txg.  In this case we will visit
1149 1711                   * this dataset but skip all of its blocks, because the
1150 1712                   * rootbp's birth time is < cur_min_txg.  Then we will
1151 1713                   * add the next snapshots/clones to the work queue.
1152 1714                   */
1153 1715                  char *dsname = kmem_alloc(MAXNAMELEN, KM_SLEEP);
1154 1716                  dsl_dataset_name(ds, dsname);
  
    | 
      ↓ open down ↓ | 
    23 lines elided | 
    
      ↑ open up ↑ | 
  
1155 1717                  zfs_dbgmsg("scanning dataset %llu (%s) is unnecessary because "
1156 1718                      "cur_min_txg (%llu) >= max_txg (%llu)",
1157 1719                      dsobj, dsname,
1158 1720                      scn->scn_phys.scn_cur_min_txg,
1159 1721                      scn->scn_phys.scn_max_txg);
1160 1722                  kmem_free(dsname, MAXNAMELEN);
1161 1723  
1162 1724                  goto out;
1163 1725          }
1164 1726  
     1727 +        if (dmu_objset_from_ds(ds, &os))
     1728 +                goto out;
     1729 +
1165 1730          /*
1166      -         * Only the ZIL in the head (non-snapshot) is valid. Even though
     1731 +         * Only the ZIL in the head (non-snapshot) is valid.  Even though
1167 1732           * snapshots can have ZIL block pointers (which may be the same
1168      -         * BP as in the head), they must be ignored. In addition, $ORIGIN
1169      -         * doesn't have a objset (i.e. its ds_bp is a hole) so we don't
1170      -         * need to look for a ZIL in it either. So we traverse the ZIL here,
1171      -         * rather than in scan_recurse(), because the regular snapshot
1172      -         * block-sharing rules don't apply to it.
     1733 +         * BP as in the head), they must be ignored.  So we traverse the
     1734 +         * ZIL here, rather than in scan_recurse(), because the regular
     1735 +         * snapshot block-sharing rules don't apply to it.
1173 1736           */
1174      -        if (DSL_SCAN_IS_SCRUB_RESILVER(scn) && !dsl_dataset_is_snapshot(ds) &&
1175      -            ds->ds_dir != dp->dp_origin_snap->ds_dir) {
1176      -                objset_t *os;
1177      -                if (dmu_objset_from_ds(ds, &os) != 0) {
1178      -                        goto out;
1179      -                }
     1737 +        if (DSL_SCAN_IS_SCRUB_RESILVER(scn) && !ds->ds_is_snapshot)
1180 1738                  dsl_scan_zil(dp, &os->os_zil_header);
1181      -        }
1182 1739  
1183 1740          /*
1184 1741           * Iterate over the bps in this ds.
1185 1742           */
1186 1743          dmu_buf_will_dirty(ds->ds_dbuf, tx);
1187 1744          rrw_enter(&ds->ds_bp_rwlock, RW_READER, FTAG);
1188 1745          dsl_scan_visit_rootbp(scn, ds, &dsl_dataset_phys(ds)->ds_bp, tx);
1189 1746          rrw_exit(&ds->ds_bp_rwlock, FTAG);
1190 1747  
1191 1748          char *dsname = kmem_alloc(ZFS_MAX_DATASET_NAME_LEN, KM_SLEEP);
1192 1749          dsl_dataset_name(ds, dsname);
1193 1750          zfs_dbgmsg("scanned dataset %llu (%s) with min=%llu max=%llu; "
1194 1751              "suspending=%u",
1195 1752              (longlong_t)dsobj, dsname,
1196 1753              (longlong_t)scn->scn_phys.scn_cur_min_txg,
1197 1754              (longlong_t)scn->scn_phys.scn_cur_max_txg,
1198 1755              (int)scn->scn_suspending);
1199 1756          kmem_free(dsname, ZFS_MAX_DATASET_NAME_LEN);
1200 1757  
     1758 +        DTRACE_PROBE3(scan_done, dsl_scan_t *, scn, dsl_dataset_t *, ds,
     1759 +            dmu_tx_t *, tx);
     1760 +
1201 1761          if (scn->scn_suspending)
1202 1762                  goto out;
1203 1763  
1204 1764          /*
1205 1765           * We've finished this pass over this dataset.
1206 1766           */
1207 1767  
1208 1768          /*
1209 1769           * If we did not completely visit this dataset, do another pass.
1210 1770           */
1211 1771          if (scn->scn_phys.scn_flags & DSF_VISIT_DS_AGAIN) {
     1772 +                DTRACE_PROBE3(scan_incomplete, dsl_scan_t *, scn,
     1773 +                    dsl_dataset_t *, ds, dmu_tx_t *, tx);
1212 1774                  zfs_dbgmsg("incomplete pass; visiting again");
1213 1775                  scn->scn_phys.scn_flags &= ~DSF_VISIT_DS_AGAIN;
1214      -                VERIFY(zap_add_int_key(dp->dp_meta_objset,
1215      -                    scn->scn_phys.scn_queue_obj, ds->ds_object,
1216      -                    scn->scn_phys.scn_cur_max_txg, tx) == 0);
     1776 +                VERIFY0(scan_ds_queue_insert(scn, ds->ds_object,
     1777 +                    scn->scn_phys.scn_cur_max_txg));
1217 1778                  goto out;
1218 1779          }
1219 1780  
1220 1781          /*
1221 1782           * Add descendent datasets to work queue.
1222 1783           */
1223 1784          if (dsl_dataset_phys(ds)->ds_next_snap_obj != 0) {
1224      -                VERIFY(zap_add_int_key(dp->dp_meta_objset,
1225      -                    scn->scn_phys.scn_queue_obj,
     1785 +                VERIFY0(scan_ds_queue_insert(scn,
1226 1786                      dsl_dataset_phys(ds)->ds_next_snap_obj,
1227      -                    dsl_dataset_phys(ds)->ds_creation_txg, tx) == 0);
     1787 +                    dsl_dataset_phys(ds)->ds_creation_txg));
1228 1788          }
1229 1789          if (dsl_dataset_phys(ds)->ds_num_children > 1) {
1230 1790                  boolean_t usenext = B_FALSE;
1231 1791                  if (dsl_dataset_phys(ds)->ds_next_clones_obj != 0) {
1232 1792                          uint64_t count;
1233 1793                          /*
1234 1794                           * A bug in a previous version of the code could
1235 1795                           * cause upgrade_clones_cb() to not set
1236 1796                           * ds_next_snap_obj when it should, leading to a
1237 1797                           * missing entry.  Therefore we can only use the
1238 1798                           * next_clones_obj when its count is correct.
1239 1799                           */
1240 1800                          int err = zap_count(dp->dp_meta_objset,
1241 1801                              dsl_dataset_phys(ds)->ds_next_clones_obj, &count);
1242 1802                          if (err == 0 &&
1243 1803                              count == dsl_dataset_phys(ds)->ds_num_children - 1)
1244 1804                                  usenext = B_TRUE;
1245 1805                  }
1246 1806  
1247 1807                  if (usenext) {
1248      -                        VERIFY0(zap_join_key(dp->dp_meta_objset,
1249      -                            dsl_dataset_phys(ds)->ds_next_clones_obj,
1250      -                            scn->scn_phys.scn_queue_obj,
1251      -                            dsl_dataset_phys(ds)->ds_creation_txg, tx));
     1808 +                        zap_cursor_t zc;
     1809 +                        zap_attribute_t za;
     1810 +                        for (zap_cursor_init(&zc, dp->dp_meta_objset,
     1811 +                            dsl_dataset_phys(ds)->ds_next_clones_obj);
     1812 +                            zap_cursor_retrieve(&zc, &za) == 0;
     1813 +                            (void) zap_cursor_advance(&zc)) {
     1814 +                                VERIFY0(scan_ds_queue_insert(scn,
     1815 +                                    zfs_strtonum(za.za_name, NULL),
     1816 +                                    dsl_dataset_phys(ds)->ds_creation_txg));
     1817 +                        }
     1818 +                        zap_cursor_fini(&zc);
1252 1819                  } else {
1253      -                        struct enqueue_clones_arg eca;
1254      -                        eca.tx = tx;
1255      -                        eca.originobj = ds->ds_object;
1256      -
1257 1820                          VERIFY0(dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
1258      -                            enqueue_clones_cb, &eca, DS_FIND_CHILDREN));
     1821 +                            enqueue_clones_cb, &ds->ds_object,
     1822 +                            DS_FIND_CHILDREN));
1259 1823                  }
1260 1824          }
1261 1825  
1262 1826  out:
1263 1827          dsl_dataset_rele(ds, FTAG);
1264 1828  }
1265 1829  
1266 1830  /* ARGSUSED */
1267 1831  static int
1268 1832  enqueue_cb(dsl_pool_t *dp, dsl_dataset_t *hds, void *arg)
1269 1833  {
1270      -        dmu_tx_t *tx = arg;
1271 1834          dsl_dataset_t *ds;
1272 1835          int err;
1273 1836          dsl_scan_t *scn = dp->dp_scan;
1274 1837  
1275 1838          err = dsl_dataset_hold_obj(dp, hds->ds_object, FTAG, &ds);
1276 1839          if (err)
1277 1840                  return (err);
1278 1841  
1279 1842          while (dsl_dataset_phys(ds)->ds_prev_snap_obj != 0) {
1280 1843                  dsl_dataset_t *prev;
1281 1844                  err = dsl_dataset_hold_obj(dp,
1282 1845                      dsl_dataset_phys(ds)->ds_prev_snap_obj, FTAG, &prev);
1283 1846                  if (err) {
1284 1847                          dsl_dataset_rele(ds, FTAG);
1285 1848                          return (err);
1286 1849                  }
1287 1850  
1288 1851                  /*
1289 1852                   * If this is a clone, we don't need to worry about it for now.
  
    | 
      ↓ open down ↓ | 
    9 lines elided | 
    
      ↑ open up ↑ | 
  
1290 1853                   */
1291 1854                  if (dsl_dataset_phys(prev)->ds_next_snap_obj != ds->ds_object) {
1292 1855                          dsl_dataset_rele(ds, FTAG);
1293 1856                          dsl_dataset_rele(prev, FTAG);
1294 1857                          return (0);
1295 1858                  }
1296 1859                  dsl_dataset_rele(ds, FTAG);
1297 1860                  ds = prev;
1298 1861          }
1299 1862  
1300      -        VERIFY(zap_add_int_key(dp->dp_meta_objset, scn->scn_phys.scn_queue_obj,
1301      -            ds->ds_object, dsl_dataset_phys(ds)->ds_prev_snap_txg, tx) == 0);
     1863 +        VERIFY0(scan_ds_queue_insert(scn, ds->ds_object,
     1864 +            dsl_dataset_phys(ds)->ds_prev_snap_txg));
1302 1865          dsl_dataset_rele(ds, FTAG);
1303 1866          return (0);
1304 1867  }
1305 1868  
1306 1869  /*
1307 1870   * Scrub/dedup interaction.
1308 1871   *
1309 1872   * If there are N references to a deduped block, we don't want to scrub it
1310 1873   * N times -- ideally, we should scrub it exactly once.
1311 1874   *
1312 1875   * We leverage the fact that the dde's replication class (enum ddt_class)
1313 1876   * is ordered from highest replication class (DDT_CLASS_DITTO) to lowest
1314 1877   * (DDT_CLASS_UNIQUE) so that we may walk the DDT in that order.
1315 1878   *
1316 1879   * To prevent excess scrubbing, the scrub begins by walking the DDT
1317 1880   * to find all blocks with refcnt > 1, and scrubs each of these once.
1318 1881   * Since there are two replication classes which contain blocks with
1319 1882   * refcnt > 1, we scrub the highest replication class (DDT_CLASS_DITTO) first.
1320 1883   * Finally the top-down scrub begins, only visiting blocks with refcnt == 1.
1321 1884   *
1322 1885   * There would be nothing more to say if a block's refcnt couldn't change
1323 1886   * during a scrub, but of course it can so we must account for changes
1324 1887   * in a block's replication class.
1325 1888   *
1326 1889   * Here's an example of what can occur:
1327 1890   *
1328 1891   * If a block has refcnt > 1 during the DDT scrub phase, but has refcnt == 1
1329 1892   * when visited during the top-down scrub phase, it will be scrubbed twice.
1330 1893   * This negates our scrub optimization, but is otherwise harmless.
1331 1894   *
1332 1895   * If a block has refcnt == 1 during the DDT scrub phase, but has refcnt > 1
1333 1896   * on each visit during the top-down scrub phase, it will never be scrubbed.
1334 1897   * To catch this, ddt_sync_entry() notifies the scrub code whenever a block's
1335 1898   * reference class transitions to a higher level (i.e DDT_CLASS_UNIQUE to
1336 1899   * DDT_CLASS_DUPLICATE); if it transitions from refcnt == 1 to refcnt > 1
1337 1900   * while a scrub is in progress, it scrubs the block right then.
1338 1901   */
1339 1902  static void
1340 1903  dsl_scan_ddt(dsl_scan_t *scn, dmu_tx_t *tx)
1341 1904  {
  
    | 
      ↓ open down ↓ | 
    30 lines elided | 
    
      ↑ open up ↑ | 
  
1342 1905          ddt_bookmark_t *ddb = &scn->scn_phys.scn_ddt_bookmark;
1343 1906          ddt_entry_t dde = { 0 };
1344 1907          int error;
1345 1908          uint64_t n = 0;
1346 1909  
1347 1910          while ((error = ddt_walk(scn->scn_dp->dp_spa, ddb, &dde)) == 0) {
1348 1911                  ddt_t *ddt;
1349 1912  
1350 1913                  if (ddb->ddb_class > scn->scn_phys.scn_ddt_class_max)
1351 1914                          break;
     1915 +                DTRACE_PROBE1(scan_ddb, ddt_bookmark_t *, ddb);
1352 1916                  dprintf("visiting ddb=%llu/%llu/%llu/%llx\n",
1353 1917                      (longlong_t)ddb->ddb_class,
1354 1918                      (longlong_t)ddb->ddb_type,
1355 1919                      (longlong_t)ddb->ddb_checksum,
1356 1920                      (longlong_t)ddb->ddb_cursor);
1357 1921  
1358 1922                  /* There should be no pending changes to the dedup table */
1359 1923                  ddt = scn->scn_dp->dp_spa->spa_ddt[ddb->ddb_checksum];
1360      -                ASSERT(avl_first(&ddt->ddt_tree) == NULL);
1361      -
     1924 +#ifdef ZFS_DEBUG
     1925 +                for (uint_t i = 0; i < DDT_HASHSZ; i++)
     1926 +                        ASSERT(avl_first(&ddt->ddt_tree[i]) == NULL);
     1927 +#endif
1362 1928                  dsl_scan_ddt_entry(scn, ddb->ddb_checksum, &dde, tx);
1363 1929                  n++;
1364 1930  
1365 1931                  if (dsl_scan_check_suspend(scn, NULL))
1366 1932                          break;
1367 1933          }
1368 1934  
     1935 +        DTRACE_PROBE2(scan_ddt_done, dsl_scan_t *, scn, uint64_t, n);
1369 1936          zfs_dbgmsg("scanned %llu ddt entries with class_max = %u; "
1370 1937              "suspending=%u", (longlong_t)n,
1371 1938              (int)scn->scn_phys.scn_ddt_class_max, (int)scn->scn_suspending);
1372 1939  
1373 1940          ASSERT(error == 0 || error == ENOENT);
1374 1941          ASSERT(error != ENOENT ||
1375 1942              ddb->ddb_class > scn->scn_phys.scn_ddt_class_max);
1376 1943  }
1377 1944  
1378 1945  /* ARGSUSED */
1379 1946  void
1380 1947  dsl_scan_ddt_entry(dsl_scan_t *scn, enum zio_checksum checksum,
1381 1948      ddt_entry_t *dde, dmu_tx_t *tx)
1382 1949  {
1383 1950          const ddt_key_t *ddk = &dde->dde_key;
1384 1951          ddt_phys_t *ddp = dde->dde_phys;
1385 1952          blkptr_t bp;
1386 1953          zbookmark_phys_t zb = { 0 };
1387 1954  
1388 1955          if (scn->scn_phys.scn_state != DSS_SCANNING)
1389 1956                  return;
1390 1957  
1391 1958          for (int p = 0; p < DDT_PHYS_TYPES; p++, ddp++) {
1392 1959                  if (ddp->ddp_phys_birth == 0 ||
1393 1960                      ddp->ddp_phys_birth > scn->scn_phys.scn_max_txg)
1394 1961                          continue;
1395 1962                  ddt_bp_create(checksum, ddk, ddp, &bp);
  
    | 
      ↓ open down ↓ | 
    17 lines elided | 
    
      ↑ open up ↑ | 
  
1396 1963  
1397 1964                  scn->scn_visited_this_txg++;
1398 1965                  scan_funcs[scn->scn_phys.scn_func](scn->scn_dp, &bp, &zb);
1399 1966          }
1400 1967  }
1401 1968  
1402 1969  static void
1403 1970  dsl_scan_visit(dsl_scan_t *scn, dmu_tx_t *tx)
1404 1971  {
1405 1972          dsl_pool_t *dp = scn->scn_dp;
1406      -        zap_cursor_t zc;
1407      -        zap_attribute_t za;
     1973 +        uint64_t dsobj, txg;
1408 1974  
1409 1975          if (scn->scn_phys.scn_ddt_bookmark.ddb_class <=
1410 1976              scn->scn_phys.scn_ddt_class_max) {
1411 1977                  scn->scn_phys.scn_cur_min_txg = scn->scn_phys.scn_min_txg;
1412 1978                  scn->scn_phys.scn_cur_max_txg = scn->scn_phys.scn_max_txg;
1413 1979                  dsl_scan_ddt(scn, tx);
1414 1980                  if (scn->scn_suspending)
1415 1981                          return;
1416 1982          }
1417 1983  
1418 1984          if (scn->scn_phys.scn_bookmark.zb_objset == DMU_META_OBJSET) {
1419 1985                  /* First do the MOS & ORIGIN */
1420 1986  
  
    | 
      ↓ open down ↓ | 
    3 lines elided | 
    
      ↑ open up ↑ | 
  
1421 1987                  scn->scn_phys.scn_cur_min_txg = scn->scn_phys.scn_min_txg;
1422 1988                  scn->scn_phys.scn_cur_max_txg = scn->scn_phys.scn_max_txg;
1423 1989                  dsl_scan_visit_rootbp(scn, NULL,
1424 1990                      &dp->dp_meta_rootbp, tx);
1425 1991                  spa_set_rootblkptr(dp->dp_spa, &dp->dp_meta_rootbp);
1426 1992                  if (scn->scn_suspending)
1427 1993                          return;
1428 1994  
1429 1995                  if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB) {
1430 1996                          VERIFY0(dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
1431      -                            enqueue_cb, tx, DS_FIND_CHILDREN));
     1997 +                            enqueue_cb, NULL, DS_FIND_CHILDREN));
1432 1998                  } else {
1433 1999                          dsl_scan_visitds(scn,
1434 2000                              dp->dp_origin_snap->ds_object, tx);
1435 2001                  }
1436 2002                  ASSERT(!scn->scn_suspending);
1437 2003          } else if (scn->scn_phys.scn_bookmark.zb_objset !=
1438 2004              ZB_DESTROYED_OBJSET) {
     2005 +                uint64_t dsobj = scn->scn_phys.scn_bookmark.zb_objset;
1439 2006                  /*
1440 2007                   * If we were suspended, continue from here.  Note if the
1441 2008                   * ds we were suspended on was deleted, the zb_objset may
1442 2009                   * be -1, so we will skip this and find a new objset
1443 2010                   * below.
1444 2011                   */
1445      -                dsl_scan_visitds(scn, scn->scn_phys.scn_bookmark.zb_objset, tx);
     2012 +                dsl_scan_visitds(scn, dsobj, tx);
1446 2013                  if (scn->scn_suspending)
1447 2014                          return;
1448 2015          }
1449 2016  
1450 2017          /*
1451 2018           * In case we were suspended right at the end of the ds, zero the
1452 2019           * bookmark so we don't think that we're still trying to resume.
1453 2020           */
1454 2021          bzero(&scn->scn_phys.scn_bookmark, sizeof (zbookmark_phys_t));
1455 2022  
1456 2023          /* keep pulling things out of the zap-object-as-queue */
1457      -        while (zap_cursor_init(&zc, dp->dp_meta_objset,
1458      -            scn->scn_phys.scn_queue_obj),
1459      -            zap_cursor_retrieve(&zc, &za) == 0) {
     2024 +        while (scan_ds_queue_first(scn, &dsobj, &txg)) {
1460 2025                  dsl_dataset_t *ds;
1461      -                uint64_t dsobj;
1462 2026  
1463      -                dsobj = zfs_strtonum(za.za_name, NULL);
1464      -                VERIFY3U(0, ==, zap_remove_int(dp->dp_meta_objset,
1465      -                    scn->scn_phys.scn_queue_obj, dsobj, tx));
     2027 +                scan_ds_queue_remove(scn, dsobj);
1466 2028  
1467 2029                  /* Set up min/max txg */
1468 2030                  VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));
1469      -                if (za.za_first_integer != 0) {
     2031 +                if (txg != 0) {
1470 2032                          scn->scn_phys.scn_cur_min_txg =
1471      -                            MAX(scn->scn_phys.scn_min_txg,
1472      -                            za.za_first_integer);
     2033 +                            MAX(scn->scn_phys.scn_min_txg, txg);
1473 2034                  } else {
1474 2035                          scn->scn_phys.scn_cur_min_txg =
1475 2036                              MAX(scn->scn_phys.scn_min_txg,
1476 2037                              dsl_dataset_phys(ds)->ds_prev_snap_txg);
1477 2038                  }
1478 2039                  scn->scn_phys.scn_cur_max_txg = dsl_scan_ds_maxtxg(ds);
1479 2040                  dsl_dataset_rele(ds, FTAG);
1480 2041  
1481 2042                  dsl_scan_visitds(scn, dsobj, tx);
1482      -                zap_cursor_fini(&zc);
1483 2043                  if (scn->scn_suspending)
1484 2044                          return;
1485 2045          }
1486      -        zap_cursor_fini(&zc);
     2046 +        /* No more objsets to fetch, we're done */
     2047 +        scn->scn_phys.scn_bookmark.zb_objset = ZB_DESTROYED_OBJSET;
     2048 +        ASSERT0(scn->scn_suspending);
1487 2049  }
1488 2050  
1489 2051  static boolean_t
1490      -dsl_scan_async_block_should_pause(dsl_scan_t *scn)
     2052 +dsl_scan_free_should_suspend(dsl_scan_t *scn)
1491 2053  {
1492 2054          uint64_t elapsed_nanosecs;
1493 2055  
1494 2056          if (zfs_recover)
1495 2057                  return (B_FALSE);
1496 2058  
1497      -        if (scn->scn_visited_this_txg >= zfs_async_block_max_blocks)
     2059 +        if (scn->scn_visited_this_txg >= zfs_free_max_blocks)
1498 2060                  return (B_TRUE);
1499 2061  
1500 2062          elapsed_nanosecs = gethrtime() - scn->scn_sync_start_time;
1501 2063          return (elapsed_nanosecs / NANOSEC > zfs_txg_timeout ||
1502      -            (NSEC2MSEC(elapsed_nanosecs) > scn->scn_async_block_min_time_ms &&
     2064 +            (NSEC2MSEC(elapsed_nanosecs) > zfs_free_min_time_ms &&
1503 2065              txg_sync_waiting(scn->scn_dp)) ||
1504 2066              spa_shutting_down(scn->scn_dp->dp_spa));
1505 2067  }
1506 2068  
1507 2069  static int
1508 2070  dsl_scan_free_block_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
1509 2071  {
1510 2072          dsl_scan_t *scn = arg;
1511 2073  
1512 2074          if (!scn->scn_is_bptree ||
1513 2075              (BP_GET_LEVEL(bp) == 0 && BP_GET_TYPE(bp) != DMU_OT_OBJSET)) {
1514      -                if (dsl_scan_async_block_should_pause(scn))
     2076 +                if (dsl_scan_free_should_suspend(scn))
1515 2077                          return (SET_ERROR(ERESTART));
1516 2078          }
1517 2079  
1518 2080          zio_nowait(zio_free_sync(scn->scn_zio_root, scn->scn_dp->dp_spa,
1519 2081              dmu_tx_get_txg(tx), bp, 0));
1520 2082          dsl_dir_diduse_space(tx->tx_pool->dp_free_dir, DD_USED_HEAD,
1521 2083              -bp_get_dsize_sync(scn->scn_dp->dp_spa, bp),
1522 2084              -BP_GET_PSIZE(bp), -BP_GET_UCSIZE(bp), tx);
1523 2085          scn->scn_visited_this_txg++;
1524 2086          return (0);
1525 2087  }
1526 2088  
1527      -static int
1528      -dsl_scan_obsolete_block_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
1529      -{
1530      -        dsl_scan_t *scn = arg;
1531      -        const dva_t *dva = &bp->blk_dva[0];
1532      -
1533      -        if (dsl_scan_async_block_should_pause(scn))
1534      -                return (SET_ERROR(ERESTART));
1535      -
1536      -        spa_vdev_indirect_mark_obsolete(scn->scn_dp->dp_spa,
1537      -            DVA_GET_VDEV(dva), DVA_GET_OFFSET(dva),
1538      -            DVA_GET_ASIZE(dva), tx);
1539      -        scn->scn_visited_this_txg++;
1540      -        return (0);
1541      -}
1542      -
1543 2089  boolean_t
1544 2090  dsl_scan_active(dsl_scan_t *scn)
1545 2091  {
1546 2092          spa_t *spa = scn->scn_dp->dp_spa;
1547 2093          uint64_t used = 0, comp, uncomp;
1548 2094  
1549 2095          if (spa->spa_load_state != SPA_LOAD_NONE)
1550 2096                  return (B_FALSE);
1551 2097          if (spa_shutting_down(spa))
1552 2098                  return (B_FALSE);
1553      -        if ((scn->scn_phys.scn_state == DSS_SCANNING &&
1554      -            !dsl_scan_is_paused_scrub(scn)) ||
     2099 +        if ((dsl_scan_is_running(scn) && !dsl_scan_is_paused_scrub(scn)) ||
1555 2100              (scn->scn_async_destroying && !scn->scn_async_stalled))
1556 2101                  return (B_TRUE);
1557 2102  
1558 2103          if (spa_version(scn->scn_dp->dp_spa) >= SPA_VERSION_DEADLISTS) {
1559 2104                  (void) bpobj_space(&scn->scn_dp->dp_free_bpobj,
1560 2105                      &used, &comp, &uncomp);
1561 2106          }
1562 2107          return (used != 0);
1563 2108  }
1564 2109  
1565 2110  /* Called whenever a txg syncs. */
1566 2111  void
1567 2112  dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
1568 2113  {
1569 2114          dsl_scan_t *scn = dp->dp_scan;
1570 2115          spa_t *spa = dp->dp_spa;
1571 2116          int err = 0;
1572 2117  
1573 2118          /*
1574 2119           * Check for scn_restart_txg before checking spa_load_state, so
1575 2120           * that we can restart an old-style scan while the pool is being
1576 2121           * imported (see dsl_scan_init).
1577 2122           */
1578 2123          if (dsl_scan_restarting(scn, tx)) {
1579 2124                  pool_scan_func_t func = POOL_SCAN_SCRUB;
1580 2125                  dsl_scan_done(scn, B_FALSE, tx);
1581 2126                  if (vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL))
1582 2127                          func = POOL_SCAN_RESILVER;
1583 2128                  zfs_dbgmsg("restarting scan func=%u txg=%llu",
1584 2129                      func, tx->tx_txg);
1585 2130                  dsl_scan_setup_sync(&func, tx);
1586 2131          }
1587 2132  
1588 2133          /*
1589 2134           * Only process scans in sync pass 1.
1590 2135           */
1591 2136          if (spa_sync_pass(dp->dp_spa) > 1)
1592 2137                  return;
1593 2138  
1594 2139          /*
1595 2140           * If the spa is shutting down, then stop scanning. This will
1596 2141           * ensure that the scan does not dirty any new data during the
1597 2142           * shutdown phase.
1598 2143           */
1599 2144          if (spa_shutting_down(spa))
1600 2145                  return;
1601 2146  
1602 2147          /*
1603 2148           * If the scan is inactive due to a stalled async destroy, try again.
1604 2149           */
1605 2150          if (!scn->scn_async_stalled && !dsl_scan_active(scn))
1606 2151                  return;
1607 2152  
1608 2153          scn->scn_visited_this_txg = 0;
1609 2154          scn->scn_suspending = B_FALSE;
1610 2155          scn->scn_sync_start_time = gethrtime();
1611 2156          spa->spa_scrub_active = B_TRUE;
1612 2157  
  
    | 
      ↓ open down ↓ | 
    48 lines elided | 
    
      ↑ open up ↑ | 
  
1613 2158          /*
1614 2159           * First process the async destroys.  If we suspend, don't do
1615 2160           * any scrubbing or resilvering.  This ensures that there are no
1616 2161           * async destroys while we are scanning, so the scan code doesn't
1617 2162           * have to worry about traversing it.  It is also faster to free the
1618 2163           * blocks than to scrub them.
1619 2164           */
1620 2165          if (zfs_free_bpobj_enabled &&
1621 2166              spa_version(dp->dp_spa) >= SPA_VERSION_DEADLISTS) {
1622 2167                  scn->scn_is_bptree = B_FALSE;
1623      -                scn->scn_async_block_min_time_ms = zfs_free_min_time_ms;
1624 2168                  scn->scn_zio_root = zio_root(dp->dp_spa, NULL,
1625 2169                      NULL, ZIO_FLAG_MUSTSUCCEED);
1626 2170                  err = bpobj_iterate(&dp->dp_free_bpobj,
1627 2171                      dsl_scan_free_block_cb, scn, tx);
1628 2172                  VERIFY3U(0, ==, zio_wait(scn->scn_zio_root));
1629 2173  
1630 2174                  if (err != 0 && err != ERESTART)
1631 2175                          zfs_panic_recover("error %u from bpobj_iterate()", err);
1632 2176          }
1633 2177  
1634 2178          if (err == 0 && spa_feature_is_active(spa, SPA_FEATURE_ASYNC_DESTROY)) {
1635 2179                  ASSERT(scn->scn_async_destroying);
1636 2180                  scn->scn_is_bptree = B_TRUE;
1637 2181                  scn->scn_zio_root = zio_root(dp->dp_spa, NULL,
1638 2182                      NULL, ZIO_FLAG_MUSTSUCCEED);
1639 2183                  err = bptree_iterate(dp->dp_meta_objset,
1640 2184                      dp->dp_bptree_obj, B_TRUE, dsl_scan_free_block_cb, scn, tx);
1641 2185                  VERIFY0(zio_wait(scn->scn_zio_root));
1642 2186  
1643 2187                  if (err == EIO || err == ECKSUM) {
1644 2188                          err = 0;
1645 2189                  } else if (err != 0 && err != ERESTART) {
1646 2190                          zfs_panic_recover("error %u from "
1647 2191                              "traverse_dataset_destroyed()", err);
1648 2192                  }
1649 2193  
1650 2194                  if (bptree_is_empty(dp->dp_meta_objset, dp->dp_bptree_obj)) {
1651 2195                          /* finished; deactivate async destroy feature */
1652 2196                          spa_feature_decr(spa, SPA_FEATURE_ASYNC_DESTROY, tx);
1653 2197                          ASSERT(!spa_feature_is_active(spa,
1654 2198                              SPA_FEATURE_ASYNC_DESTROY));
1655 2199                          VERIFY0(zap_remove(dp->dp_meta_objset,
1656 2200                              DMU_POOL_DIRECTORY_OBJECT,
1657 2201                              DMU_POOL_BPTREE_OBJ, tx));
1658 2202                          VERIFY0(bptree_free(dp->dp_meta_objset,
1659 2203                              dp->dp_bptree_obj, tx));
1660 2204                          dp->dp_bptree_obj = 0;
1661 2205                          scn->scn_async_destroying = B_FALSE;
1662 2206                          scn->scn_async_stalled = B_FALSE;
1663 2207                  } else {
1664 2208                          /*
1665 2209                           * If we didn't make progress, mark the async
1666 2210                           * destroy as stalled, so that we will not initiate
1667 2211                           * a spa_sync() on its behalf.  Note that we only
1668 2212                           * check this if we are not finished, because if the
1669 2213                           * bptree had no blocks for us to visit, we can
1670 2214                           * finish without "making progress".
1671 2215                           */
1672 2216                          scn->scn_async_stalled =
1673 2217                              (scn->scn_visited_this_txg == 0);
1674 2218                  }
1675 2219          }
1676 2220          if (scn->scn_visited_this_txg) {
1677 2221                  zfs_dbgmsg("freed %llu blocks in %llums from "
1678 2222                      "free_bpobj/bptree txg %llu; err=%u",
1679 2223                      (longlong_t)scn->scn_visited_this_txg,
1680 2224                      (longlong_t)
1681 2225                      NSEC2MSEC(gethrtime() - scn->scn_sync_start_time),
1682 2226                      (longlong_t)tx->tx_txg, err);
1683 2227                  scn->scn_visited_this_txg = 0;
1684 2228  
1685 2229                  /*
1686 2230                   * Write out changes to the DDT that may be required as a
1687 2231                   * result of the blocks freed.  This ensures that the DDT
1688 2232                   * is clean when a scrub/resilver runs.
1689 2233                   */
1690 2234                  ddt_sync(spa, tx->tx_txg);
1691 2235          }
1692 2236          if (err != 0)
1693 2237                  return;
1694 2238          if (dp->dp_free_dir != NULL && !scn->scn_async_destroying &&
1695 2239              zfs_free_leak_on_eio &&
1696 2240              (dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes != 0 ||
1697 2241              dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes != 0 ||
1698 2242              dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes != 0)) {
1699 2243                  /*
1700 2244                   * We have finished background destroying, but there is still
1701 2245                   * some space left in the dp_free_dir. Transfer this leaked
1702 2246                   * space to the dp_leak_dir.
1703 2247                   */
1704 2248                  if (dp->dp_leak_dir == NULL) {
1705 2249                          rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG);
1706 2250                          (void) dsl_dir_create_sync(dp, dp->dp_root_dir,
1707 2251                              LEAK_DIR_NAME, tx);
1708 2252                          VERIFY0(dsl_pool_open_special_dir(dp,
1709 2253                              LEAK_DIR_NAME, &dp->dp_leak_dir));
1710 2254                          rrw_exit(&dp->dp_config_rwlock, FTAG);
  
    | 
      ↓ open down ↓ | 
    77 lines elided | 
    
      ↑ open up ↑ | 
  
1711 2255                  }
1712 2256                  dsl_dir_diduse_space(dp->dp_leak_dir, DD_USED_HEAD,
1713 2257                      dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes,
1714 2258                      dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes,
1715 2259                      dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes, tx);
1716 2260                  dsl_dir_diduse_space(dp->dp_free_dir, DD_USED_HEAD,
1717 2261                      -dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes,
1718 2262                      -dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes,
1719 2263                      -dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes, tx);
1720 2264          }
1721      -
1722 2265          if (dp->dp_free_dir != NULL && !scn->scn_async_destroying) {
1723 2266                  /* finished; verify that space accounting went to zero */
1724 2267                  ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes);
1725 2268                  ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes);
1726 2269                  ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes);
1727 2270          }
1728 2271  
1729      -        EQUIV(bpobj_is_open(&dp->dp_obsolete_bpobj),
1730      -            0 == zap_contains(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
1731      -            DMU_POOL_OBSOLETE_BPOBJ));
1732      -        if (err == 0 && bpobj_is_open(&dp->dp_obsolete_bpobj)) {
1733      -                ASSERT(spa_feature_is_active(dp->dp_spa,
1734      -                    SPA_FEATURE_OBSOLETE_COUNTS));
     2272 +        if (!dsl_scan_is_running(scn))
     2273 +                return;
1735 2274  
1736      -                scn->scn_is_bptree = B_FALSE;
1737      -                scn->scn_async_block_min_time_ms = zfs_obsolete_min_time_ms;
1738      -                err = bpobj_iterate(&dp->dp_obsolete_bpobj,
1739      -                    dsl_scan_obsolete_block_cb, scn, tx);
1740      -                if (err != 0 && err != ERESTART)
1741      -                        zfs_panic_recover("error %u from bpobj_iterate()", err);
1742      -
1743      -                if (bpobj_is_empty(&dp->dp_obsolete_bpobj))
1744      -                        dsl_pool_destroy_obsolete_bpobj(dp, tx);
     2275 +        if (!zfs_scan_direct) {
     2276 +                if (!scn->scn_is_sorted)
     2277 +                        scn->scn_last_queue_run_time = 0;
     2278 +                scn->scn_is_sorted = B_TRUE;
1745 2279          }
1746 2280  
1747      -        if (scn->scn_phys.scn_state != DSS_SCANNING)
1748      -                return;
1749      -
1750      -        if (scn->scn_done_txg == tx->tx_txg) {
     2281 +        if (scn->scn_done_txg == tx->tx_txg ||
     2282 +            scn->scn_phys.scn_state == DSS_FINISHING) {
1751 2283                  ASSERT(!scn->scn_suspending);
     2284 +                if (scn->scn_bytes_pending != 0) {
     2285 +                        ASSERT(scn->scn_is_sorted);
     2286 +                        scn->scn_phys.scn_state = DSS_FINISHING;
     2287 +                        goto finish;
     2288 +                }
1752 2289                  /* finished with scan. */
1753 2290                  zfs_dbgmsg("txg %llu scan complete", tx->tx_txg);
1754 2291                  dsl_scan_done(scn, B_TRUE, tx);
1755 2292                  ASSERT3U(spa->spa_scrub_inflight, ==, 0);
1756      -                dsl_scan_sync_state(scn, tx);
     2293 +                dsl_scan_sync_state(scn, tx, SYNC_MANDATORY);
1757 2294                  return;
1758 2295          }
1759 2296  
1760 2297          if (dsl_scan_is_paused_scrub(scn))
1761 2298                  return;
1762 2299  
1763 2300          if (scn->scn_phys.scn_ddt_bookmark.ddb_class <=
1764 2301              scn->scn_phys.scn_ddt_class_max) {
1765 2302                  zfs_dbgmsg("doing scan sync txg %llu; "
1766 2303                      "ddt bm=%llu/%llu/%llu/%llx",
1767 2304                      (longlong_t)tx->tx_txg,
1768 2305                      (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_class,
1769 2306                      (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_type,
1770 2307                      (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_checksum,
1771 2308                      (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_cursor);
1772 2309                  ASSERT(scn->scn_phys.scn_bookmark.zb_objset == 0);
1773 2310                  ASSERT(scn->scn_phys.scn_bookmark.zb_object == 0);
1774 2311                  ASSERT(scn->scn_phys.scn_bookmark.zb_level == 0);
  
    | 
      ↓ open down ↓ | 
    8 lines elided | 
    
      ↑ open up ↑ | 
  
1775 2312                  ASSERT(scn->scn_phys.scn_bookmark.zb_blkid == 0);
1776 2313          } else {
1777 2314                  zfs_dbgmsg("doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu",
1778 2315                      (longlong_t)tx->tx_txg,
1779 2316                      (longlong_t)scn->scn_phys.scn_bookmark.zb_objset,
1780 2317                      (longlong_t)scn->scn_phys.scn_bookmark.zb_object,
1781 2318                      (longlong_t)scn->scn_phys.scn_bookmark.zb_level,
1782 2319                      (longlong_t)scn->scn_phys.scn_bookmark.zb_blkid);
1783 2320          }
1784 2321  
1785      -        scn->scn_zio_root = zio_root(dp->dp_spa, NULL,
1786      -            NULL, ZIO_FLAG_CANFAIL);
1787      -        dsl_pool_config_enter(dp, FTAG);
1788      -        dsl_scan_visit(scn, tx);
1789      -        dsl_pool_config_exit(dp, FTAG);
1790      -        (void) zio_wait(scn->scn_zio_root);
1791      -        scn->scn_zio_root = NULL;
     2322 +        if (scn->scn_is_sorted) {
     2323 +                /*
     2324 +                 * This is the out-of-order queue handling. We determine our
     2325 +                 * memory usage and based on that switch states between normal
     2326 +                 * operation (i.e. don't issue queued up I/O unless we've
     2327 +                 * reached the end of scanning) and 'clearing' (issue queued
     2328 +                 * extents just to clear up some memory).
     2329 +                 */
     2330 +                mem_lim_t mlim = scan_io_queue_mem_lim(scn);
1792 2331  
1793      -        zfs_dbgmsg("visited %llu blocks in %llums",
1794      -            (longlong_t)scn->scn_visited_this_txg,
1795      -            (longlong_t)NSEC2MSEC(gethrtime() - scn->scn_sync_start_time));
     2332 +                if (mlim == MEM_LIM_HARD && !scn->scn_clearing)
     2333 +                        scn->scn_clearing = B_TRUE;
     2334 +                else if (mlim == MEM_LIM_NONE && scn->scn_clearing)
     2335 +                        scn->scn_clearing = B_FALSE;
1796 2336  
     2337 +                if ((scn->scn_checkpointing || ddi_get_lbolt() -
     2338 +                    scn->scn_last_checkpoint > ZFS_SCAN_CHECKPOINT_INTVAL) &&
     2339 +                    scn->scn_phys.scn_state != DSS_FINISHING &&
     2340 +                    !scn->scn_clearing) {
     2341 +                        scn->scn_checkpointing = B_TRUE;
     2342 +                }
     2343 +        }
     2344 +
     2345 +        if (!scn->scn_clearing && !scn->scn_checkpointing) {
     2346 +                scn->scn_zio_root = zio_root(dp->dp_spa, NULL,
     2347 +                    NULL, ZIO_FLAG_CANFAIL);
     2348 +                dsl_pool_config_enter(dp, FTAG);
     2349 +                dsl_scan_visit(scn, tx);
     2350 +                dsl_pool_config_exit(dp, FTAG);
     2351 +                (void) zio_wait(scn->scn_zio_root);
     2352 +                scn->scn_zio_root = NULL;
     2353 +
     2354 +                zfs_dbgmsg("visited %llu blocks in %llums",
     2355 +                    (longlong_t)scn->scn_visited_this_txg,
     2356 +                    (longlong_t)NSEC2MSEC(gethrtime() -
     2357 +                    scn->scn_sync_start_time));
     2358 +
     2359 +                if (!scn->scn_suspending) {
     2360 +                        scn->scn_done_txg = tx->tx_txg + 1;
     2361 +                        zfs_dbgmsg("txg %llu traversal complete, waiting "
     2362 +                            "till txg %llu", tx->tx_txg, scn->scn_done_txg);
     2363 +                }
     2364 +        }
1797 2365          if (!scn->scn_suspending) {
1798 2366                  scn->scn_done_txg = tx->tx_txg + 1;
1799 2367                  zfs_dbgmsg("txg %llu traversal complete, waiting till txg %llu",
1800 2368                      tx->tx_txg, scn->scn_done_txg);
1801 2369          }
     2370 +finish:
     2371 +        if (scn->scn_is_sorted) {
     2372 +                dsl_pool_config_enter(dp, FTAG);
     2373 +                scan_io_queues_run(scn);
     2374 +                dsl_pool_config_exit(dp, FTAG);
     2375 +        }
1802 2376  
1803 2377          if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) {
1804 2378                  mutex_enter(&spa->spa_scrub_lock);
1805 2379                  while (spa->spa_scrub_inflight > 0) {
1806 2380                          cv_wait(&spa->spa_scrub_io_cv,
1807 2381                              &spa->spa_scrub_lock);
1808 2382                  }
1809 2383                  mutex_exit(&spa->spa_scrub_lock);
1810 2384          }
1811 2385  
1812      -        dsl_scan_sync_state(scn, tx);
     2386 +        dsl_scan_sync_state(scn, tx, SYNC_OPTIONAL);
1813 2387  }
1814 2388  
1815 2389  /*
1816 2390   * This will start a new scan, or restart an existing one.
1817 2391   */
1818 2392  void
1819 2393  dsl_resilver_restart(dsl_pool_t *dp, uint64_t txg)
1820 2394  {
     2395 +        /* Stop any ongoing TRIMs */
     2396 +        spa_man_trim_stop(dp->dp_spa);
     2397 +
1821 2398          if (txg == 0) {
1822 2399                  dmu_tx_t *tx;
1823 2400                  tx = dmu_tx_create_dd(dp->dp_mos_dir);
1824 2401                  VERIFY(0 == dmu_tx_assign(tx, TXG_WAIT));
1825 2402  
1826 2403                  txg = dmu_tx_get_txg(tx);
1827 2404                  dp->dp_scan->scn_restart_txg = txg;
1828 2405                  dmu_tx_commit(tx);
1829 2406          } else {
1830 2407                  dp->dp_scan->scn_restart_txg = txg;
1831 2408          }
1832 2409          zfs_dbgmsg("restarting resilver txg=%llu", txg);
1833 2410  }
1834 2411  
1835 2412  boolean_t
1836 2413  dsl_scan_resilvering(dsl_pool_t *dp)
1837 2414  {
1838      -        return (dp->dp_scan->scn_phys.scn_state == DSS_SCANNING &&
     2415 +        return (dsl_scan_is_running(dp->dp_scan) &&
1839 2416              dp->dp_scan->scn_phys.scn_func == POOL_SCAN_RESILVER);
1840 2417  }
1841 2418  
1842 2419  /*
1843 2420   * scrub consumers
1844 2421   */
1845 2422  
1846 2423  static void
1847      -count_block(zfs_all_blkstats_t *zab, const blkptr_t *bp)
     2424 +count_block(dsl_scan_t *scn, zfs_all_blkstats_t *zab, const blkptr_t *bp)
1848 2425  {
1849 2426          int i;
1850 2427  
     2428 +        for (i = 0; i < BP_GET_NDVAS(bp); i++)
     2429 +                atomic_add_64(&scn->scn_bytes_issued,
     2430 +                    DVA_GET_ASIZE(&bp->blk_dva[i]));
     2431 +
1851 2432          /*
1852 2433           * If we resume after a reboot, zab will be NULL; don't record
1853 2434           * incomplete stats in that case.
1854 2435           */
1855 2436          if (zab == NULL)
1856 2437                  return;
1857 2438  
1858 2439          for (i = 0; i < 4; i++) {
1859 2440                  int l = (i < 2) ? BP_GET_LEVEL(bp) : DN_MAX_LEVELS;
1860 2441                  int t = (i & 1) ? BP_GET_TYPE(bp) : DMU_OT_TOTAL;
1861 2442                  if (t & DMU_OT_NEWTYPE)
1862 2443                          t = DMU_OT_OTHER;
1863 2444                  zfs_blkstat_t *zb = &zab->zab_type[l][t];
1864 2445                  int equal;
1865 2446  
1866 2447                  zb->zb_count++;
1867 2448                  zb->zb_asize += BP_GET_ASIZE(bp);
1868 2449                  zb->zb_lsize += BP_GET_LSIZE(bp);
1869 2450                  zb->zb_psize += BP_GET_PSIZE(bp);
1870 2451                  zb->zb_gangs += BP_COUNT_GANG(bp);
1871 2452  
1872 2453                  switch (BP_GET_NDVAS(bp)) {
1873 2454                  case 2:
1874 2455                          if (DVA_GET_VDEV(&bp->blk_dva[0]) ==
1875 2456                              DVA_GET_VDEV(&bp->blk_dva[1]))
1876 2457                                  zb->zb_ditto_2_of_2_samevdev++;
1877 2458                          break;
1878 2459                  case 3:
1879 2460                          equal = (DVA_GET_VDEV(&bp->blk_dva[0]) ==
1880 2461                              DVA_GET_VDEV(&bp->blk_dva[1])) +
1881 2462                              (DVA_GET_VDEV(&bp->blk_dva[0]) ==
1882 2463                              DVA_GET_VDEV(&bp->blk_dva[2])) +
1883 2464                              (DVA_GET_VDEV(&bp->blk_dva[1]) ==
1884 2465                              DVA_GET_VDEV(&bp->blk_dva[2]));
1885 2466                          if (equal == 1)
1886 2467                                  zb->zb_ditto_2_of_3_samevdev++;
1887 2468                          else if (equal == 3)
1888 2469                                  zb->zb_ditto_3_of_3_samevdev++;
1889 2470                          break;
1890 2471                  }
1891 2472          }
1892 2473  }
1893 2474  
  
    | 
      ↓ open down ↓ | 
    33 lines elided | 
    
      ↑ open up ↑ | 
  
1894 2475  static void
1895 2476  dsl_scan_scrub_done(zio_t *zio)
1896 2477  {
1897 2478          spa_t *spa = zio->io_spa;
1898 2479  
1899 2480          abd_free(zio->io_abd);
1900 2481  
1901 2482          mutex_enter(&spa->spa_scrub_lock);
1902 2483          spa->spa_scrub_inflight--;
1903 2484          cv_broadcast(&spa->spa_scrub_io_cv);
     2485 +        mutex_exit(&spa->spa_scrub_lock);
1904 2486  
1905 2487          if (zio->io_error && (zio->io_error != ECKSUM ||
1906 2488              !(zio->io_flags & ZIO_FLAG_SPECULATIVE))) {
1907      -                spa->spa_dsl_pool->dp_scan->scn_phys.scn_errors++;
     2489 +                atomic_inc_64(&spa->spa_dsl_pool->dp_scan->scn_phys.scn_errors);
     2490 +                DTRACE_PROBE1(scan_error, zio_t *, zio);
1908 2491          }
1909      -        mutex_exit(&spa->spa_scrub_lock);
1910 2492  }
1911 2493  
1912 2494  static int
1913 2495  dsl_scan_scrub_cb(dsl_pool_t *dp,
1914 2496      const blkptr_t *bp, const zbookmark_phys_t *zb)
1915 2497  {
1916 2498          dsl_scan_t *scn = dp->dp_scan;
1917      -        size_t size = BP_GET_PSIZE(bp);
1918 2499          spa_t *spa = dp->dp_spa;
1919 2500          uint64_t phys_birth = BP_PHYSICAL_BIRTH(bp);
1920 2501          boolean_t needs_io;
1921 2502          int zio_flags = ZIO_FLAG_SCAN_THREAD | ZIO_FLAG_RAW | ZIO_FLAG_CANFAIL;
1922      -        int scan_delay = 0;
     2503 +        boolean_t ignore_dva0;
1923 2504  
1924 2505          if (phys_birth <= scn->scn_phys.scn_min_txg ||
1925 2506              phys_birth >= scn->scn_phys.scn_max_txg)
1926 2507                  return (0);
1927 2508  
1928      -        count_block(dp->dp_blkstats, bp);
1929      -
1930      -        if (BP_IS_EMBEDDED(bp))
     2509 +        if (BP_IS_EMBEDDED(bp)) {
     2510 +                count_block(scn, dp->dp_blkstats, bp);
1931 2511                  return (0);
     2512 +        }
1932 2513  
1933 2514          ASSERT(DSL_SCAN_IS_SCRUB_RESILVER(scn));
1934      -        if (scn->scn_phys.scn_func == POOL_SCAN_SCRUB) {
     2515 +        if (scn->scn_phys.scn_func == POOL_SCAN_SCRUB ||
     2516 +            scn->scn_phys.scn_func == POOL_SCAN_MOS ||
     2517 +            scn->scn_phys.scn_func == POOL_SCAN_META) {
1935 2518                  zio_flags |= ZIO_FLAG_SCRUB;
1936 2519                  needs_io = B_TRUE;
1937      -                scan_delay = zfs_scrub_delay;
1938 2520          } else {
1939 2521                  ASSERT3U(scn->scn_phys.scn_func, ==, POOL_SCAN_RESILVER);
1940 2522                  zio_flags |= ZIO_FLAG_RESILVER;
1941 2523                  needs_io = B_FALSE;
1942      -                scan_delay = zfs_resilver_delay;
1943 2524          }
1944 2525  
1945 2526          /* If it's an intent log block, failure is expected. */
1946 2527          if (zb->zb_level == ZB_ZIL_LEVEL)
1947 2528                  zio_flags |= ZIO_FLAG_SPECULATIVE;
1948 2529  
     2530 +        if (scn->scn_phys.scn_func == POOL_SCAN_MOS)
     2531 +                needs_io = (zb->zb_objset == 0);
     2532 +
     2533 +        if (scn->scn_phys.scn_func == POOL_SCAN_META)
     2534 +                needs_io = zb->zb_objset == 0 || BP_GET_LEVEL(bp) != 0 ||
     2535 +                    DMU_OT_IS_METADATA(BP_GET_TYPE(bp));
     2536 +
     2537 +        DTRACE_PROBE3(scan_needs_io, boolean_t, needs_io,
     2538 +            const blkptr_t *, bp, spa_t *, spa);
     2539 +
     2540 +        /*
     2541 +         * WBC will invalidate DVA[0] after migrating the block to the main
     2542 +         * pool. If the user subsequently disables WBC and removes the special
     2543 +         * device, DVA[0] can now point to a hole vdev. We won't try to do
     2544 +         * I/O to it, but we must also avoid doing DTL checks.
     2545 +         */
     2546 +        ignore_dva0 = (BP_IS_SPECIAL(bp) &&
     2547 +            wbc_bp_is_migrated(spa_get_wbc_data(spa), bp));
     2548 +
1949 2549          for (int d = 0; d < BP_GET_NDVAS(bp); d++) {
1950      -                vdev_t *vd = vdev_lookup_top(spa,
1951      -                    DVA_GET_VDEV(&bp->blk_dva[d]));
     2550 +                vdev_t *vd;
1952 2551  
1953 2552                  /*
1954 2553                   * Keep track of how much data we've examined so that
1955 2554                   * zpool(1M) status can make useful progress reports.
1956 2555                   */
1957 2556                  scn->scn_phys.scn_examined += DVA_GET_ASIZE(&bp->blk_dva[d]);
1958 2557                  spa->spa_scan_pass_exam += DVA_GET_ASIZE(&bp->blk_dva[d]);
1959 2558  
     2559 +                /* WBC-invalidated DVA post-migration, so skip it */
     2560 +                if (d == 0 && ignore_dva0)
     2561 +                        continue;
     2562 +                vd = vdev_lookup_top(spa, DVA_GET_VDEV(&bp->blk_dva[d]));
     2563 +
1960 2564                  /* if it's a resilver, this may not be in the target range */
1961      -                if (!needs_io) {
     2565 +                if (!needs_io && scn->scn_phys.scn_func != POOL_SCAN_MOS &&
     2566 +                    scn->scn_phys.scn_func != POOL_SCAN_META) {
1962 2567                          if (DVA_GET_GANG(&bp->blk_dva[d])) {
1963 2568                                  /*
1964 2569                                   * Gang members may be spread across multiple
1965 2570                                   * vdevs, so the best estimate we have is the
1966 2571                                   * scrub range, which has already been checked.
1967 2572                                   * XXX -- it would be better to change our
1968 2573                                   * allocation policy to ensure that all
1969 2574                                   * gang members reside on the same vdev.
1970 2575                                   */
1971 2576                                  needs_io = B_TRUE;
     2577 +                                DTRACE_PROBE2(gang_bp, const blkptr_t *, bp,
     2578 +                                    spa_t *, spa);
1972 2579                          } else {
1973 2580                                  needs_io = vdev_dtl_contains(vd, DTL_PARTIAL,
1974 2581                                      phys_birth, 1);
     2582 +                                if (needs_io)
     2583 +                                        DTRACE_PROBE2(dtl, const blkptr_t *,
     2584 +                                            bp, spa_t *, spa);
1975 2585                          }
1976 2586                  }
1977 2587          }
1978 2588  
1979 2589          if (needs_io && !zfs_no_scrub_io) {
1980      -                vdev_t *rvd = spa->spa_root_vdev;
1981      -                uint64_t maxinflight = rvd->vdev_children * zfs_top_maxinflight;
1982      -
1983      -                mutex_enter(&spa->spa_scrub_lock);
1984      -                while (spa->spa_scrub_inflight >= maxinflight)
1985      -                        cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
1986      -                spa->spa_scrub_inflight++;
1987      -                mutex_exit(&spa->spa_scrub_lock);
1988      -
1989      -                /*
1990      -                 * If we're seeing recent (zfs_scan_idle) "important" I/Os
1991      -                 * then throttle our workload to limit the impact of a scan.
1992      -                 */
1993      -                if (ddi_get_lbolt64() - spa->spa_last_io <= zfs_scan_idle)
1994      -                        delay(scan_delay);
1995      -
1996      -                zio_nowait(zio_read(NULL, spa, bp,
1997      -                    abd_alloc_for_io(size, B_FALSE), size, dsl_scan_scrub_done,
1998      -                    NULL, ZIO_PRIORITY_SCRUB, zio_flags, zb));
     2590 +                dsl_scan_enqueue(dp, bp, zio_flags, zb);
     2591 +        } else {
     2592 +                count_block(scn, dp->dp_blkstats, bp);
1999 2593          }
2000 2594  
2001 2595          /* do not relocate this block */
2002 2596          return (0);
2003 2597  }
2004 2598  
2005 2599  /*
2006 2600   * Called by the ZFS_IOC_POOL_SCAN ioctl to start a scrub or resilver.
2007 2601   * Can also be called to resume a paused scrub.
2008 2602   */
2009 2603  int
2010 2604  dsl_scan(dsl_pool_t *dp, pool_scan_func_t func)
2011 2605  {
2012 2606          spa_t *spa = dp->dp_spa;
2013 2607          dsl_scan_t *scn = dp->dp_scan;
2014 2608  
2015 2609          /*
2016 2610           * Purge all vdev caches and probe all devices.  We do this here
2017 2611           * rather than in sync context because this requires a writer lock
2018 2612           * on the spa_config lock, which we can't do from sync context.  The
2019 2613           * spa_scrub_reopen flag indicates that vdev_open() should not
2020 2614           * attempt to start another scrub.
2021 2615           */
  
    | 
      ↓ open down ↓ | 
    13 lines elided | 
    
      ↑ open up ↑ | 
  
2022 2616          spa_vdev_state_enter(spa, SCL_NONE);
2023 2617          spa->spa_scrub_reopen = B_TRUE;
2024 2618          vdev_reopen(spa->spa_root_vdev);
2025 2619          spa->spa_scrub_reopen = B_FALSE;
2026 2620          (void) spa_vdev_state_exit(spa, NULL, 0);
2027 2621  
2028 2622          if (func == POOL_SCAN_SCRUB && dsl_scan_is_paused_scrub(scn)) {
2029 2623                  /* got scrub start cmd, resume paused scrub */
2030 2624                  int err = dsl_scrub_set_pause_resume(scn->scn_dp,
2031 2625                      POOL_SCRUB_NORMAL);
2032      -                if (err == 0) {
2033      -                        spa_event_notify(spa, NULL, NULL, ESC_ZFS_SCRUB_RESUME);
     2626 +                if (err == 0)
2034 2627                          return (ECANCELED);
2035      -                }
2036 2628  
2037 2629                  return (SET_ERROR(err));
2038 2630          }
2039 2631  
2040 2632          return (dsl_sync_task(spa_name(spa), dsl_scan_setup_check,
2041 2633              dsl_scan_setup_sync, &func, 0, ZFS_SPACE_CHECK_NONE));
2042 2634  }
2043 2635  
2044 2636  static boolean_t
2045 2637  dsl_scan_restarting(dsl_scan_t *scn, dmu_tx_t *tx)
2046 2638  {
2047 2639          return (scn->scn_restart_txg != 0 &&
2048 2640              scn->scn_restart_txg <= tx->tx_txg);
     2641 +}
     2642 +
     2643 +/*
     2644 + * Grand theory statement on scan queue sorting
     2645 + *
     2646 + * Scanning is implemented by recursively traversing all indirection levels
     2647 + * in an object and reading all blocks referenced from said objects. This
     2648 + * results in us approximately traversing the object from lowest logical
     2649 + * offset to the highest. Naturally, if we were simply read all blocks in
     2650 + * this order, we would require that the blocks be also physically arranged
     2651 + * in sort of a linear fashion on the vdevs. However, this is frequently
     2652 + * not the case on pools. So we instead stick the I/Os into a reordering
     2653 + * queue and issue them out of logical order and in a way that most benefits
     2654 + * physical disks (LBA-order).
     2655 + *
     2656 + * This sorting algorithm is subject to limitations. We can't do this with
     2657 + * blocks that are non-leaf, because the scanner itself depends on these
     2658 + * being available ASAP for further metadata traversal. So we exclude any
     2659 + * block that is bp_level > 0. Fortunately, this usually represents only
     2660 + * around 1% of our data volume, so no great loss.
     2661 + *
     2662 + * As a further limitation, we cannot sort blocks which have more than
     2663 + * one DVA present (copies > 1), because there's no sensible way to sort
     2664 + * these (how do you sort a queue based on multiple contradictory
     2665 + * criteria?). So we exclude those as well. Again, these are very rarely
     2666 + * used for leaf blocks, usually only on metadata.
     2667 + *
     2668 + * WBC consideration: we can't sort blocks which have not yet been fully
     2669 + * migrated to normal devices, because their data can reside purely on the
     2670 + * special device or on both normal and special. This would require larger
     2671 + * data structures to track both DVAs in our queues and we need the
     2672 + * smallest in-core structures we can possibly get to get good sorting
     2673 + * performance. Therefore, blocks which have not yet been fully migrated
     2674 + * out of the WBC are processed as non-sortable and issued immediately.
     2675 + *
     2676 + * Queue management:
     2677 + *
     2678 + * Ideally, we would want to scan all metadata and queue up all leaf block
     2679 + * I/O prior to starting to issue it, because that allows us to do an
     2680 + * optimal sorting job. This can however consume large amounts of memory.
     2681 + * Therefore we continuously monitor the size of the queues and constrain
     2682 + * them to 5% (zfs_scan_mem_lim_fact) of physmem. If the queues grow larger
     2683 + * than this limit, we clear out a few of the largest extents at the head
     2684 + * of the queues to make room for more scanning. Hopefully, these extents
     2685 + * will be fairly large and contiguous, allowing us to approach sequential
     2686 + * I/O throughput even without a fully sorted tree.
     2687 + *
     2688 + * Metadata scanning takes place in dsl_scan_visit(), which is called from
     2689 + * dsl_scan_sync() every spa_sync(). If we have either fully scanned all
     2690 + * metadata on the pool, or we need to make room in memory because our
     2691 + * queues are too large, dsl_scan_visit() is postponed and
     2692 + * scan_io_queues_run() is called from dsl_scan_sync() instead. That means,
     2693 + * metadata scanning and queued I/O issuing are mutually exclusive. This is
     2694 + * to provide maximum sequential I/O throughput for the queued I/O issue
     2695 + * process. Sequential I/O performance is significantly negatively impacted
     2696 + * if it is interleaved with random I/O.
     2697 + *
     2698 + * Backwards compatibility
     2699 + *
     2700 + * This new algorithm is backwards compatible with the legacy on-disk data
     2701 + * structures. If imported on a machine without the new sorting algorithm,
     2702 + * the scan simply resumes from the last checkpoint.
     2703 + */
     2704 +
     2705 +/*
     2706 + * Given a set of I/O parameters as discovered by the metadata traversal
     2707 + * process, attempts to place the I/O into the reordering queue (if
     2708 + * possible), or immediately executes the I/O. The check for whether an
     2709 + * I/O is suitable for sorting is performed here.
     2710 + */
     2711 +static void
     2712 +dsl_scan_enqueue(dsl_pool_t *dp, const blkptr_t *bp, int zio_flags,
     2713 +    const zbookmark_phys_t *zb)
     2714 +{
     2715 +        spa_t *spa = dp->dp_spa;
     2716 +
     2717 +        ASSERT(!BP_IS_EMBEDDED(bp));
     2718 +        if (!dp->dp_scan->scn_is_sorted || (BP_IS_SPECIAL(bp) &&
     2719 +            !wbc_bp_is_migrated(spa_get_wbc_data(spa), bp))) {
     2720 +                scan_exec_io(dp, bp, zio_flags, zb, B_TRUE);
     2721 +                return;
     2722 +        }
     2723 +
     2724 +        for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
     2725 +                dva_t dva;
     2726 +                vdev_t *vdev;
     2727 +
     2728 +                /* On special BPs we only support handling the normal DVA */
     2729 +                if (BP_IS_SPECIAL(bp) && i != WBC_NORMAL_DVA)
     2730 +                        continue;
     2731 +
     2732 +                dva = bp->blk_dva[i];
     2733 +                vdev = vdev_lookup_top(spa, DVA_GET_VDEV(&dva));
     2734 +                ASSERT(vdev != NULL);
     2735 +
     2736 +                mutex_enter(&vdev->vdev_scan_io_queue_lock);
     2737 +                if (vdev->vdev_scan_io_queue == NULL)
     2738 +                        vdev->vdev_scan_io_queue = scan_io_queue_create(vdev);
     2739 +                ASSERT(dp->dp_scan != NULL);
     2740 +                scan_io_queue_insert(dp->dp_scan, vdev->vdev_scan_io_queue, bp,
     2741 +                    i, zio_flags, zb);
     2742 +                mutex_exit(&vdev->vdev_scan_io_queue_lock);
     2743 +        }
     2744 +}
     2745 +
     2746 +/*
     2747 + * Given a scanning zio's information, executes the zio. The zio need
     2748 + * not necessarily be only sortable, this function simply executes the
     2749 + * zio, no matter what it is. The limit_inflight flag controls whether
     2750 + * we limit the number of concurrently executing scan zio's to
     2751 + * zfs_top_maxinflight times the number of top-level vdevs. This is
     2752 + * used during metadata discovery to pace the generation of I/O and
     2753 + * properly time the pausing of the scanning algorithm. The queue
     2754 + * processing part uses a different method of controlling timing and
     2755 + * so doesn't need this limit applied to its zio's.
     2756 + */
     2757 +static void
     2758 +scan_exec_io(dsl_pool_t *dp, const blkptr_t *bp, int zio_flags,
     2759 +    const zbookmark_phys_t *zb, boolean_t limit_inflight)
     2760 +{
     2761 +        spa_t *spa = dp->dp_spa;
     2762 +        size_t size = BP_GET_PSIZE(bp);
     2763 +        vdev_t *rvd = spa->spa_root_vdev;
     2764 +        uint64_t maxinflight = rvd->vdev_children * zfs_top_maxinflight;
     2765 +        dsl_scan_t *scn = dp->dp_scan;
     2766 +        zio_priority_t prio;
     2767 +
     2768 +        mutex_enter(&spa->spa_scrub_lock);
     2769 +        while (limit_inflight && spa->spa_scrub_inflight >= maxinflight)
     2770 +                cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
     2771 +        spa->spa_scrub_inflight++;
     2772 +        mutex_exit(&spa->spa_scrub_lock);
     2773 +
     2774 +        for (int i = 0; i < BP_GET_NDVAS(bp); i++)
     2775 +                atomic_add_64(&spa->spa_scan_pass_work,
     2776 +                    DVA_GET_ASIZE(&bp->blk_dva[i]));
     2777 +
     2778 +        count_block(dp->dp_scan, dp->dp_blkstats, bp);
     2779 +        DTRACE_PROBE3(do_io, uint64_t, dp->dp_scan->scn_phys.scn_func,
     2780 +            boolean_t, B_TRUE, spa_t *, spa);
     2781 +        prio = (scn->scn_phys.scn_func == POOL_SCAN_RESILVER ?
     2782 +            ZIO_PRIORITY_RESILVER : ZIO_PRIORITY_SCRUB);
     2783 +        zio_nowait(zio_read(NULL, spa, bp, abd_alloc_for_io(size, B_FALSE),
     2784 +            size, dsl_scan_scrub_done, NULL, prio, zio_flags, zb));
     2785 +}
     2786 +
     2787 +/*
     2788 + * Given all the info we got from our metadata scanning process, we
     2789 + * construct a scan_io_t and insert it into the scan sorting queue. The
     2790 + * I/O must already be suitable for us to process. This is controlled
     2791 + * by dsl_scan_enqueue().
     2792 + */
     2793 +static void
     2794 +scan_io_queue_insert(dsl_scan_t *scn, dsl_scan_io_queue_t *queue,
     2795 +    const blkptr_t *bp, int dva_i, int zio_flags, const zbookmark_phys_t *zb)
     2796 +{
     2797 +        scan_io_t *sio = kmem_zalloc(sizeof (*sio), KM_SLEEP);
     2798 +        avl_index_t idx;
     2799 +        uint64_t offset, asize;
     2800 +
     2801 +        ASSERT(MUTEX_HELD(&queue->q_vd->vdev_scan_io_queue_lock));
     2802 +
     2803 +        bp2sio(bp, sio, dva_i);
     2804 +        sio->sio_flags = zio_flags;
     2805 +        sio->sio_zb = *zb;
     2806 +        offset = SCAN_IO_GET_OFFSET(sio);
     2807 +        asize = sio->sio_asize;
     2808 +
     2809 +        if (avl_find(&queue->q_zios_by_addr, sio, &idx) != NULL) {
     2810 +                /* block is already scheduled for reading */
     2811 +                kmem_free(sio, sizeof (*sio));
     2812 +                return;
     2813 +        }
     2814 +        avl_insert(&queue->q_zios_by_addr, sio, idx);
     2815 +        atomic_add_64(&queue->q_zio_bytes, asize);
     2816 +
     2817 +        /*
     2818 +         * Increment the bytes pending counter now so that we can't
     2819 +         * get an integer underflow in case the worker processes the
     2820 +         * zio before we get to incrementing this counter.
     2821 +         */
     2822 +        mutex_enter(&scn->scn_status_lock);
     2823 +        scn->scn_bytes_pending += asize;
     2824 +        mutex_exit(&scn->scn_status_lock);
     2825 +
     2826 +        range_tree_set_gap(queue->q_exts_by_addr, zfs_scan_max_ext_gap);
     2827 +        range_tree_add_fill(queue->q_exts_by_addr, offset, asize, asize);
     2828 +}
     2829 +
     2830 +/* q_exts_by_addr segment add callback. */
     2831 +/*ARGSUSED*/
     2832 +static void
     2833 +scan_io_queue_insert_cb(range_tree_t *rt, range_seg_t *rs, void *arg)
     2834 +{
     2835 +        dsl_scan_io_queue_t *queue = arg;
     2836 +        avl_index_t idx;
     2837 +        ASSERT(MUTEX_HELD(&queue->q_vd->vdev_scan_io_queue_lock));
     2838 +        VERIFY3P(avl_find(&queue->q_exts_by_size, rs, &idx), ==, NULL);
     2839 +        avl_insert(&queue->q_exts_by_size, rs, idx);
     2840 +}
     2841 +
     2842 +/* q_exts_by_addr segment remove callback. */
     2843 +/*ARGSUSED*/
     2844 +static void
     2845 +scan_io_queue_remove_cb(range_tree_t *rt, range_seg_t *rs, void *arg)
     2846 +{
     2847 +        dsl_scan_io_queue_t *queue = arg;
     2848 +        avl_remove(&queue->q_exts_by_size, rs);
     2849 +}
     2850 +
     2851 +/* q_exts_by_addr vacate callback. */
     2852 +/*ARGSUSED*/
     2853 +static void
     2854 +scan_io_queue_vacate_cb(range_tree_t *rt, void *arg)
     2855 +{
     2856 +        dsl_scan_io_queue_t *queue = arg;
     2857 +        void *cookie = NULL;
     2858 +        while (avl_destroy_nodes(&queue->q_exts_by_size, &cookie) != NULL)
     2859 +                ;
     2860 +}
     2861 +
     2862 +/*
     2863 + * This is the primary extent sorting algorithm. We balance two parameters:
     2864 + * 1) how many bytes of I/O are in an extent
     2865 + * 2) how well the extent is filled with I/O (as a fraction of its total size)
     2866 + * Since we allow extents to have gaps between their constituent I/Os, it's
     2867 + * possible to have a fairly large extent that contains the same amount of
     2868 + * I/O bytes than a much smaller extent, which just packs the I/O more tightly.
     2869 + * The algorithm sorts based on a score calculated from the extent's size,
     2870 + * the relative fill volume (in %) and a "fill weight" parameter that controls
     2871 + * the split between whether we prefer larger extents or more well populated
     2872 + * extents:
     2873 + *
     2874 + * SCORE = FILL_IN_BYTES + (FILL_IN_PERCENT * FILL_IN_BYTES * FILL_WEIGHT)
     2875 + *
     2876 + * Example:
     2877 + * 1) assume extsz = 64 MiB
     2878 + * 2) assume fill = 32 MiB (extent is half full)
     2879 + * 3) assume fill_weight = 3
     2880 + * 4)   SCORE = 32M + (((32M * 100) / 64M) * 3 * 32M) / 100
     2881 + *      SCORE = 32M + (50 * 3 * 32M) / 100
     2882 + *      SCORE = 32M + (4800M / 100)
     2883 + *      SCORE = 32M + 48M
     2884 + *               ^     ^
     2885 + *               |     +--- final total relative fill-based score
     2886 + *               +--------- final total fill-based score
     2887 + *      SCORE = 80M
     2888 + *
     2889 + * As can be seen, at fill_ratio=3, the algorithm is slightly biased towards
     2890 + * extents that are more completely filled (in a 3:2 ratio) vs just larger.
     2891 + */
     2892 +static int
     2893 +ext_size_compar(const void *x, const void *y)
     2894 +{
     2895 +        const range_seg_t *rsa = x, *rsb = y;
     2896 +        uint64_t sa = rsa->rs_end - rsa->rs_start,
     2897 +            sb = rsb->rs_end - rsb->rs_start;
     2898 +        uint64_t score_a, score_b;
     2899 +
     2900 +        score_a = rsa->rs_fill + (((rsa->rs_fill * 100) / sa) *
     2901 +            fill_weight * rsa->rs_fill) / 100;
     2902 +        score_b = rsb->rs_fill + (((rsb->rs_fill * 100) / sb) *
     2903 +            fill_weight * rsb->rs_fill) / 100;
     2904 +
     2905 +        if (score_a > score_b)
     2906 +                return (-1);
     2907 +        if (score_a == score_b) {
     2908 +                if (rsa->rs_start < rsb->rs_start)
     2909 +                        return (-1);
     2910 +                if (rsa->rs_start == rsb->rs_start)
     2911 +                        return (0);
     2912 +                return (1);
     2913 +        }
     2914 +        return (1);
     2915 +}
     2916 +
     2917 +/*
     2918 + * Comparator for the q_zios_by_addr tree. Sorting is simply performed
     2919 + * based on LBA-order (from lowest to highest).
     2920 + */
     2921 +static int
     2922 +io_addr_compar(const void *x, const void *y)
     2923 +{
     2924 +        const scan_io_t *a = x, *b = y;
     2925 +        uint64_t off_a = SCAN_IO_GET_OFFSET(a);
     2926 +        uint64_t off_b = SCAN_IO_GET_OFFSET(b);
     2927 +        if (off_a < off_b)
     2928 +                return (-1);
     2929 +        if (off_a == off_b)
     2930 +                return (0);
     2931 +        return (1);
     2932 +}
     2933 +
     2934 +static dsl_scan_io_queue_t *
     2935 +scan_io_queue_create(vdev_t *vd)
     2936 +{
     2937 +        dsl_scan_t *scn = vd->vdev_spa->spa_dsl_pool->dp_scan;
     2938 +        dsl_scan_io_queue_t *q = kmem_zalloc(sizeof (*q), KM_SLEEP);
     2939 +
     2940 +        q->q_scn = scn;
     2941 +        q->q_vd = vd;
     2942 +        cv_init(&q->q_cv, NULL, CV_DEFAULT, NULL);
     2943 +        q->q_exts_by_addr = range_tree_create(&scan_io_queue_ops, q,
     2944 +            &q->q_vd->vdev_scan_io_queue_lock);
     2945 +        avl_create(&q->q_exts_by_size, ext_size_compar,
     2946 +            sizeof (range_seg_t), offsetof(range_seg_t, rs_pp_node));
     2947 +        avl_create(&q->q_zios_by_addr, io_addr_compar,
     2948 +            sizeof (scan_io_t), offsetof(scan_io_t, sio_nodes.sio_addr_node));
     2949 +
     2950 +        return (q);
     2951 +}
     2952 +
     2953 +/*
     2954 + * Destroyes a scan queue and all segments and scan_io_t's contained in it.
     2955 + * No further execution of I/O occurs, anything pending in the queue is
     2956 + * simply dropped. Prior to calling this, the queue should have been
     2957 + * removed from its parent top-level vdev, hence holding the queue's
     2958 + * lock is not permitted.
     2959 + */
     2960 +void
     2961 +dsl_scan_io_queue_destroy(dsl_scan_io_queue_t *queue)
     2962 +{
     2963 +        dsl_scan_t *scn = queue->q_scn;
     2964 +        scan_io_t *sio;
     2965 +        uint64_t bytes_dequeued = 0;
     2966 +        kmutex_t *q_lock = &queue->q_vd->vdev_scan_io_queue_lock;
     2967 +
     2968 +        ASSERT(!MUTEX_HELD(q_lock));
     2969 +
     2970 +#ifdef  DEBUG   /* This is for the ASSERT(range_tree_contains... below */
     2971 +        mutex_enter(q_lock);
     2972 +#endif
     2973 +        while ((sio = avl_first(&queue->q_zios_by_addr)) != NULL) {
     2974 +                ASSERT(range_tree_contains(queue->q_exts_by_addr,
     2975 +                    SCAN_IO_GET_OFFSET(sio), sio->sio_asize));
     2976 +                bytes_dequeued += sio->sio_asize;
     2977 +                avl_remove(&queue->q_zios_by_addr, sio);
     2978 +                kmem_free(sio, sizeof (*sio));
     2979 +        }
     2980 +#ifdef  DEBUG
     2981 +        mutex_exit(q_lock);
     2982 +#endif
     2983 +
     2984 +        mutex_enter(&scn->scn_status_lock);
     2985 +        ASSERT3U(scn->scn_bytes_pending, >=, bytes_dequeued);
     2986 +        scn->scn_bytes_pending -= bytes_dequeued;
     2987 +        mutex_exit(&scn->scn_status_lock);
     2988 +
     2989 +        /* lock here to avoid tripping assertion in range_tree_vacate */
     2990 +        mutex_enter(q_lock);
     2991 +        range_tree_vacate(queue->q_exts_by_addr, NULL, queue);
     2992 +        mutex_exit(q_lock);
     2993 +
     2994 +        range_tree_destroy(queue->q_exts_by_addr);
     2995 +        avl_destroy(&queue->q_exts_by_size);
     2996 +        avl_destroy(&queue->q_zios_by_addr);
     2997 +        cv_destroy(&queue->q_cv);
     2998 +
     2999 +        kmem_free(queue, sizeof (*queue));
     3000 +}
     3001 +
     3002 +/*
     3003 + * Properly transfers a dsl_scan_queue_t from `svd' to `tvd'. This is
     3004 + * called on behalf of vdev_top_transfer when creating or destroying
     3005 + * a mirror vdev due to zpool attach/detach.
     3006 + */
     3007 +void
     3008 +dsl_scan_io_queue_vdev_xfer(vdev_t *svd, vdev_t *tvd)
     3009 +{
     3010 +        mutex_enter(&svd->vdev_scan_io_queue_lock);
     3011 +        mutex_enter(&tvd->vdev_scan_io_queue_lock);
     3012 +
     3013 +        VERIFY3P(tvd->vdev_scan_io_queue, ==, NULL);
     3014 +        tvd->vdev_scan_io_queue = svd->vdev_scan_io_queue;
     3015 +        svd->vdev_scan_io_queue = NULL;
     3016 +        if (tvd->vdev_scan_io_queue != NULL) {
     3017 +                tvd->vdev_scan_io_queue->q_vd = tvd;
     3018 +                range_tree_set_lock(tvd->vdev_scan_io_queue->q_exts_by_addr,
     3019 +                    &tvd->vdev_scan_io_queue_lock);
     3020 +        }
     3021 +
     3022 +        mutex_exit(&tvd->vdev_scan_io_queue_lock);
     3023 +        mutex_exit(&svd->vdev_scan_io_queue_lock);
     3024 +}
     3025 +
     3026 +static void
     3027 +scan_io_queues_destroy(dsl_scan_t *scn)
     3028 +{
     3029 +        vdev_t *rvd = scn->scn_dp->dp_spa->spa_root_vdev;
     3030 +
     3031 +        for (uint64_t i = 0; i < rvd->vdev_children; i++) {
     3032 +                vdev_t *tvd = rvd->vdev_child[i];
     3033 +                dsl_scan_io_queue_t *queue;
     3034 +
     3035 +                mutex_enter(&tvd->vdev_scan_io_queue_lock);
     3036 +                queue = tvd->vdev_scan_io_queue;
     3037 +                tvd->vdev_scan_io_queue = NULL;
     3038 +                mutex_exit(&tvd->vdev_scan_io_queue_lock);
     3039 +
     3040 +                if (queue != NULL)
     3041 +                        dsl_scan_io_queue_destroy(queue);
     3042 +        }
     3043 +}
     3044 +
     3045 +/*
     3046 + * Computes the memory limit state that we're currently in. A sorted scan
     3047 + * needs quite a bit of memory to hold the sorting queues, so we need to
     3048 + * reasonably constrain their size so they don't impact overall system
     3049 + * performance. We compute two limits:
     3050 + * 1) Hard memory limit: if the amount of memory used by the sorting
     3051 + *      queues on a pool gets above this value, we stop the metadata
     3052 + *      scanning portion and start issuing the queued up and sorted
     3053 + *      I/Os to reduce memory usage.
     3054 + *      This limit is calculated as a fraction of physmem (by default 5%).
     3055 + *      We constrain the lower bound of the hard limit to an absolute
     3056 + *      minimum of zfs_scan_mem_lim_min (default: 16 MiB). We also constrain
     3057 + *      the upper bound to 5% of the total pool size - no chance we'll
     3058 + *      ever need that much memory, but just to keep the value in check.
     3059 + * 2) Soft memory limit: once we hit the hard memory limit, we start
     3060 + *      issuing I/O to lower queue memory usage, but we don't want to
     3061 + *      completely empty them out, as having more in the queues allows
     3062 + *      us to make better sorting decisions. So we stop the issuing of
     3063 + *      I/Os once the amount of memory used drops below the soft limit
     3064 + *      (at which point we stop issuing I/O and start scanning metadata
     3065 + *      again).
     3066 + *      The limit is calculated by subtracting a fraction of the hard
     3067 + *      limit from the hard limit. By default this fraction is 10%, so
     3068 + *      the soft limit is 90% of the hard limit. We cap the size of the
     3069 + *      difference between the hard and soft limits at an absolute
     3070 + *      maximum of zfs_scan_mem_lim_soft_max (default: 128 MiB) - this is
     3071 + *      sufficient to not cause too frequent switching between the
     3072 + *      metadata scan and I/O issue (even at 2k recordsize, 128 MiB's
     3073 + *      worth of queues is about 1.2 GiB of on-pool data, so scanning
     3074 + *      that should take at least a decent fraction of a second).
     3075 + */
     3076 +static mem_lim_t
     3077 +scan_io_queue_mem_lim(dsl_scan_t *scn)
     3078 +{
     3079 +        vdev_t *rvd = scn->scn_dp->dp_spa->spa_root_vdev;
     3080 +        uint64_t mlim_hard, mlim_soft, mused;
     3081 +        uint64_t alloc = metaslab_class_get_alloc(spa_normal_class(
     3082 +            scn->scn_dp->dp_spa));
     3083 +
     3084 +        mlim_hard = MAX((physmem / zfs_scan_mem_lim_fact) * PAGESIZE,
     3085 +            zfs_scan_mem_lim_min);
     3086 +        mlim_hard = MIN(mlim_hard, alloc / 20);
     3087 +        mlim_soft = mlim_hard - MIN(mlim_hard / zfs_scan_mem_lim_soft_fact,
     3088 +            zfs_scan_mem_lim_soft_max);
     3089 +        mused = 0;
     3090 +        for (uint64_t i = 0; i < rvd->vdev_children; i++) {
     3091 +                vdev_t *tvd = rvd->vdev_child[i];
     3092 +                dsl_scan_io_queue_t *queue;
     3093 +
     3094 +                mutex_enter(&tvd->vdev_scan_io_queue_lock);
     3095 +                queue = tvd->vdev_scan_io_queue;
     3096 +                if (queue != NULL) {
     3097 +                        /* #extents in exts_by_size = # in exts_by_addr */
     3098 +                        mused += avl_numnodes(&queue->q_exts_by_size) *
     3099 +                            sizeof (range_seg_t) +
     3100 +                            (avl_numnodes(&queue->q_zios_by_addr) +
     3101 +                            queue->q_num_issuing_zios) * sizeof (scan_io_t);
     3102 +                }
     3103 +                mutex_exit(&tvd->vdev_scan_io_queue_lock);
     3104 +        }
     3105 +        DTRACE_PROBE4(queue_mem_lim, dsl_scan_t *, scn, uint64_t, mlim_hard,
     3106 +            uint64_t, mlim_soft, uint64_t, mused);
     3107 +
     3108 +        if (mused >= mlim_hard)
     3109 +                return (MEM_LIM_HARD);
     3110 +        else if (mused >= mlim_soft)
     3111 +                return (MEM_LIM_SOFT);
     3112 +        else
     3113 +                return (MEM_LIM_NONE);
     3114 +}
     3115 +
     3116 +/*
     3117 + * Given a list of scan_io_t's in io_list, this issues the io's out to
     3118 + * disk. Passing shutdown=B_TRUE instead discards the zio's without
     3119 + * issuing them. This consumes the io_list and frees the scan_io_t's.
     3120 + * This is called when emptying queues, either when we're up against
     3121 + * the memory limit or we have finished scanning.
     3122 + */
     3123 +static void
     3124 +scan_io_queue_issue(list_t *io_list, dsl_scan_io_queue_t *queue)
     3125 +{
     3126 +        dsl_scan_t *scn = queue->q_scn;
     3127 +        scan_io_t *sio;
     3128 +        int64_t bytes_issued = 0;
     3129 +
     3130 +        while ((sio = list_head(io_list)) != NULL) {
     3131 +                blkptr_t bp;
     3132 +
     3133 +                sio2bp(sio, &bp, queue->q_vd->vdev_id);
     3134 +                bytes_issued += sio->sio_asize;
     3135 +                scan_exec_io(scn->scn_dp, &bp, sio->sio_flags, &sio->sio_zb,
     3136 +                    B_FALSE);
     3137 +                (void) list_remove_head(io_list);
     3138 +                ASSERT(queue->q_num_issuing_zios > 0);
     3139 +                atomic_dec_64(&queue->q_num_issuing_zios);
     3140 +                kmem_free(sio, sizeof (*sio));
     3141 +        }
     3142 +
     3143 +        mutex_enter(&scn->scn_status_lock);
     3144 +        ASSERT3U(scn->scn_bytes_pending, >=, bytes_issued);
     3145 +        scn->scn_bytes_pending -= bytes_issued;
     3146 +        mutex_exit(&scn->scn_status_lock);
     3147 +
     3148 +        ASSERT3U(queue->q_zio_bytes, >=, bytes_issued);
     3149 +        atomic_add_64(&queue->q_zio_bytes, -bytes_issued);
     3150 +
     3151 +        list_destroy(io_list);
     3152 +}
     3153 +
     3154 +/*
     3155 + * Given a range_seg_t (extent) and a list, this function passes over a
     3156 + * scan queue and gathers up the appropriate ios which fit into that
     3157 + * scan seg (starting from lowest LBA). During this, we observe that we
     3158 + * don't go over the `limit' in the total amount of scan_io_t bytes that
     3159 + * were gathered. At the end, we remove the appropriate amount of space
     3160 + * from the q_exts_by_addr. If we have consumed the entire scan seg, we
     3161 + * remove it completely from q_exts_by_addr. If we've only consumed a
     3162 + * portion of it, we shorten the scan seg appropriately. A future call
     3163 + * will consume more of the scan seg's constituent io's until
     3164 + * consuming the extent completely. If we've reduced the size of the
     3165 + * scan seg, we of course reinsert it in the appropriate spot in the
     3166 + * q_exts_by_size tree.
     3167 + */
     3168 +static uint64_t
     3169 +scan_io_queue_gather(const range_seg_t *rs, list_t *list,
     3170 +    dsl_scan_io_queue_t *queue, uint64_t limit)
     3171 +{
     3172 +        scan_io_t srch_sio, *sio, *next_sio;
     3173 +        avl_index_t idx;
     3174 +        int64_t num_zios = 0, bytes = 0;
     3175 +        boolean_t size_limited = B_FALSE;
     3176 +
     3177 +        ASSERT(rs != NULL);
     3178 +        ASSERT3U(limit, !=, 0);
     3179 +        ASSERT(MUTEX_HELD(&queue->q_vd->vdev_scan_io_queue_lock));
     3180 +
     3181 +        list_create(list, sizeof (scan_io_t),
     3182 +            offsetof(scan_io_t, sio_nodes.sio_list_node));
     3183 +        SCAN_IO_SET_OFFSET(&srch_sio, rs->rs_start);
     3184 +
     3185 +        /*
     3186 +         * The exact start of the extent might not contain any matching zios,
     3187 +         * so if that's the case, examine the next one in the tree.
     3188 +         */
     3189 +        sio = avl_find(&queue->q_zios_by_addr, &srch_sio, &idx);
     3190 +        if (sio == NULL)
     3191 +                sio = avl_nearest(&queue->q_zios_by_addr, idx, AVL_AFTER);
     3192 +
     3193 +        while (sio != NULL && SCAN_IO_GET_OFFSET(sio) < rs->rs_end) {
     3194 +                if (bytes >= limit) {
     3195 +                        size_limited = B_TRUE;
     3196 +                        break;
     3197 +                }
     3198 +                ASSERT3U(SCAN_IO_GET_OFFSET(sio), >=, rs->rs_start);
     3199 +                ASSERT3U(SCAN_IO_GET_OFFSET(sio) + sio->sio_asize, <=,
     3200 +                    rs->rs_end);
     3201 +
     3202 +                next_sio = AVL_NEXT(&queue->q_zios_by_addr, sio);
     3203 +                avl_remove(&queue->q_zios_by_addr, sio);
     3204 +                list_insert_tail(list, sio);
     3205 +                num_zios++;
     3206 +                bytes += sio->sio_asize;
     3207 +                sio = next_sio;
     3208 +        }
     3209 +
     3210 +        if (size_limited) {
     3211 +                uint64_t end;
     3212 +                sio = list_tail(list);
     3213 +                end = SCAN_IO_GET_OFFSET(sio) + sio->sio_asize;
     3214 +                range_tree_remove_fill(queue->q_exts_by_addr, rs->rs_start,
     3215 +                    end - rs->rs_start, bytes, 0);
     3216 +        } else {
     3217 +                /*
     3218 +                 * Whole extent consumed, remove it all, including any head
     3219 +                 * or tail overhang.
     3220 +                 */
     3221 +                range_tree_remove_fill(queue->q_exts_by_addr, rs->rs_start,
     3222 +                    rs->rs_end - rs->rs_start, bytes, 0);
     3223 +        }
     3224 +        atomic_add_64(&queue->q_num_issuing_zios, num_zios);
     3225 +
     3226 +        return (bytes);
     3227 +}
     3228 +
     3229 +/*
     3230 + * This is called from the queue emptying thread and selects the next
     3231 + * extent from which we are to issue io's. The behavior of this function
     3232 + * depends on the state of the scan, the current memory consumption and
     3233 + * whether or not we are performing a scan shutdown.
     3234 + * 1) We select extents in an elevator algorithm (LBA-order) if:
     3235 + *      a) the scan has finished traversing metadata (DSS_FINISHING)
     3236 + *      b) the scan needs to perform a checkpoint
     3237 + * 2) We select the largest available extent if we are up against the
     3238 + *      memory limit.
     3239 + * 3) Otherwise we don't select any extents.
     3240 + */
     3241 +static const range_seg_t *
     3242 +scan_io_queue_fetch_ext(dsl_scan_io_queue_t *queue)
     3243 +{
     3244 +        dsl_scan_t *scn = queue->q_scn;
     3245 +
     3246 +        ASSERT(MUTEX_HELD(&queue->q_vd->vdev_scan_io_queue_lock));
     3247 +        ASSERT0(queue->q_issuing_rs.rs_start);
     3248 +        ASSERT0(queue->q_issuing_rs.rs_end);
     3249 +        ASSERT(scn->scn_is_sorted);
     3250 +
     3251 +        if (scn->scn_phys.scn_state == DSS_FINISHING ||
     3252 +            scn->scn_checkpointing) {
     3253 +                /*
     3254 +                 * When the scan has finished traversing all metadata and is
     3255 +                 * in the DSS_FINISHING state or a checkpoint has been
     3256 +                 * requested, no new extents will be added to the sorting
     3257 +                 * queue, so the way we are sorted now is as good as it'll
     3258 +                 * get. So instead, switch to issuing extents in linear order.
     3259 +                 */
     3260 +                return (range_tree_first(queue->q_exts_by_addr));
     3261 +        } else if (scn->scn_clearing) {
     3262 +                return (avl_first(&queue->q_exts_by_size));
     3263 +        } else {
     3264 +                return (NULL);
     3265 +        }
     3266 +}
     3267 +
     3268 +/*
     3269 + * Empties a scan queue until we have issued at least info->qri_limit
     3270 + * bytes, or the queue is empty. This is called via the scn_taskq so as
     3271 + * to parallelize processing of all top-level vdevs as much as possible.
     3272 + */
     3273 +static void
     3274 +scan_io_queues_run_one(io_queue_run_info_t *info)
     3275 +{
     3276 +        dsl_scan_io_queue_t *queue = info->qri_queue;
     3277 +        uint64_t limit = info->qri_limit;
     3278 +        dsl_scan_t *scn = queue->q_scn;
     3279 +        kmutex_t *q_lock = &queue->q_vd->vdev_scan_io_queue_lock;
     3280 +        list_t zio_list;
     3281 +        const range_seg_t *rs;
     3282 +        uint64_t issued = 0;
     3283 +
     3284 +        ASSERT(scn->scn_is_sorted);
     3285 +
     3286 +        /* loop until we have issued as much I/O as was requested */
     3287 +        while (issued < limit) {
     3288 +                scan_io_t *first_io, *last_io;
     3289 +
     3290 +                mutex_enter(q_lock);
     3291 +                /* First we select the extent we'll be issuing from next. */
     3292 +                rs = scan_io_queue_fetch_ext(queue);
     3293 +                DTRACE_PROBE2(queue_fetch_ext, range_seg_t *, rs,
     3294 +                    dsl_scan_io_queue_t *, queue);
     3295 +                if (rs == NULL) {
     3296 +                        mutex_exit(q_lock);
     3297 +                        break;
     3298 +                }
     3299 +
     3300 +                /*
     3301 +                 * We have selected which extent needs to be processed next,
     3302 +                 * gather up the corresponding zio's, taking care not to step
     3303 +                 * over the limit.
     3304 +                 */
     3305 +                issued += scan_io_queue_gather(rs, &zio_list, queue,
     3306 +                    limit - issued);
     3307 +                first_io = list_head(&zio_list);
     3308 +                last_io = list_tail(&zio_list);
     3309 +                if (first_io != NULL) {
     3310 +                        /*
     3311 +                         * We have zio's to issue. Construct a fake range
     3312 +                         * seg that covers the whole list of zio's to issue
     3313 +                         * (the list is guaranteed to be LBA-ordered) and
     3314 +                         * save that in the queue's "in flight" segment.
     3315 +                         * This is used to prevent freeing I/O from hitting
     3316 +                         * that range while we're working on it.
     3317 +                         */
     3318 +                        ASSERT(last_io != NULL);
     3319 +                        queue->q_issuing_rs.rs_start =
     3320 +                            SCAN_IO_GET_OFFSET(first_io);
     3321 +                        queue->q_issuing_rs.rs_end =
     3322 +                            SCAN_IO_GET_OFFSET(last_io) + last_io->sio_asize;
     3323 +                }
     3324 +                mutex_exit(q_lock);
     3325 +
     3326 +                /*
     3327 +                 * Issuing zio's can take a long time (especially because
     3328 +                 * we are contrained by zfs_top_maxinflight), so drop the
     3329 +                 * queue lock.
     3330 +                 */
     3331 +                scan_io_queue_issue(&zio_list, queue);
     3332 +
     3333 +                mutex_enter(q_lock);
     3334 +                /* invalidate the in-flight I/O range */
     3335 +                bzero(&queue->q_issuing_rs, sizeof (queue->q_issuing_rs));
     3336 +                cv_broadcast(&queue->q_cv);
     3337 +                mutex_exit(q_lock);
     3338 +        }
     3339 +}
     3340 +
     3341 +/*
     3342 + * Performs an emptying run on all scan queues in the pool. This just
     3343 + * punches out one thread per top-level vdev, each of which processes
     3344 + * only that vdev's scan queue. We can parallelize the I/O here because
     3345 + * we know that each queue's io's only affect its own top-level vdev.
     3346 + * The amount of I/O dequeued per run of this function is calibrated
     3347 + * dynamically so that its total run time doesn't exceed
     3348 + * zfs_scan_dequeue_run_target_ms + zfs_dequeue_run_bonus_ms. The
     3349 + * timing algorithm aims to hit the target value, but still
     3350 + * oversubscribes the amount of data that it is allowed to fetch by
     3351 + * the bonus value. This is to allow for non-equal completion times
     3352 + * across top-level vdevs.
     3353 + *
     3354 + * This function waits for the queue runs to complete, and must be
     3355 + * called from dsl_scan_sync (or in general, syncing context).
     3356 + */
     3357 +static void
     3358 +scan_io_queues_run(dsl_scan_t *scn)
     3359 +{
     3360 +        spa_t *spa = scn->scn_dp->dp_spa;
     3361 +        uint64_t dirty_limit, total_limit, total_bytes;
     3362 +        io_queue_run_info_t *info;
     3363 +        uint64_t dequeue_min = zfs_scan_dequeue_min *
     3364 +            spa->spa_root_vdev->vdev_children;
     3365 +
     3366 +        ASSERT(scn->scn_is_sorted);
     3367 +        ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
     3368 +
     3369 +        if (scn->scn_taskq == NULL) {
     3370 +                char *tq_name = kmem_zalloc(ZFS_MAX_DATASET_NAME_LEN + 16,
     3371 +                    KM_SLEEP);
     3372 +                const int nthreads = spa->spa_root_vdev->vdev_children;
     3373 +
     3374 +                /*
     3375 +                 * We need to make this taskq *always* execute as many
     3376 +                 * threads in parallel as we have top-level vdevs and no
     3377 +                 * less, otherwise strange serialization of the calls to
     3378 +                 * scan_io_queues_run_one can occur during spa_sync runs
     3379 +                 * and that significantly impacts performance.
     3380 +                 */
     3381 +                (void) snprintf(tq_name, ZFS_MAX_DATASET_NAME_LEN + 16,
     3382 +                    "dsl_scan_tq_%s", spa->spa_name);
     3383 +                scn->scn_taskq = taskq_create(tq_name, nthreads, minclsyspri,
     3384 +                    nthreads, nthreads, TASKQ_PREPOPULATE);
     3385 +                kmem_free(tq_name, ZFS_MAX_DATASET_NAME_LEN + 16);
     3386 +        }
     3387 +
     3388 +        /*
     3389 +         * This is the automatic run time calibration algorithm. We gauge
     3390 +         * how long spa_sync took since last time we were invoked. If it
     3391 +         * took longer than our target + bonus values, we reduce the
     3392 +         * amount of data that the queues are allowed to process in this
     3393 +         * iteration. Conversely, if it took less than target + bonus,
     3394 +         * we increase the amount of data the queues are allowed to process.
     3395 +         * This is designed as a partial load-following algorithm, so if
     3396 +         * other ZFS users start issuing I/O, we back off, until we hit our
     3397 +         * minimum issue amount (per-TL-vdev) of zfs_scan_dequeue_min bytes.
     3398 +         */
     3399 +        if (scn->scn_last_queue_run_time != 0) {
     3400 +                uint64_t now = ddi_get_lbolt64();
     3401 +                uint64_t delta_ms = TICK_TO_MSEC(now -
     3402 +                    scn->scn_last_queue_run_time);
     3403 +                uint64_t bonus = zfs_dequeue_run_bonus_ms;
     3404 +
     3405 +                bonus = MIN(bonus, DEQUEUE_BONUS_MS_MAX);
     3406 +                if (delta_ms <= bonus)
     3407 +                        delta_ms = bonus + 1;
     3408 +
     3409 +                scn->scn_last_dequeue_limit = MAX(dequeue_min,
     3410 +                    (scn->scn_last_dequeue_limit *
     3411 +                    zfs_scan_dequeue_run_target_ms) / (delta_ms - bonus));
     3412 +                scn->scn_last_queue_run_time = now;
     3413 +        } else {
     3414 +                scn->scn_last_queue_run_time = ddi_get_lbolt64();
     3415 +                scn->scn_last_dequeue_limit = dequeue_min;
     3416 +        }
     3417 +
     3418 +        /*
     3419 +         * We also constrain the amount of data we are allowed to issue
     3420 +         * by the zfs_dirty_data_max value - this serves as basically a
     3421 +         * sanity check just to prevent us from issuing huge amounts of
     3422 +         * data to be dequeued per run.
     3423 +         */
     3424 +        dirty_limit = (zfs_vdev_async_write_active_min_dirty_percent *
     3425 +            zfs_dirty_data_max) / 100;
     3426 +        if (dirty_limit >= scn->scn_dp->dp_dirty_total)
     3427 +                dirty_limit -= scn->scn_dp->dp_dirty_total;
     3428 +        else
     3429 +                dirty_limit = 0;
     3430 +
     3431 +        total_limit = MAX(MIN(scn->scn_last_dequeue_limit, dirty_limit),
     3432 +            dequeue_min);
     3433 +
     3434 +        /*
     3435 +         * We use this to determine how much data each queue is allowed to
     3436 +         * issue this run. We take the amount of dirty data available in
     3437 +         * the current txg and proportionally split it between each queue,
     3438 +         * depending on how full a given queue is. No need to lock here,
     3439 +         * new data can't enter the queue, since that's only done in our
     3440 +         * sync thread.
     3441 +         */
     3442 +        total_bytes = scn->scn_bytes_pending;
     3443 +        if (total_bytes == 0)
     3444 +                return;
     3445 +
     3446 +        info = kmem_zalloc(sizeof (*info) * spa->spa_root_vdev->vdev_children,
     3447 +            KM_SLEEP);
     3448 +        for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
     3449 +                vdev_t *vd = spa->spa_root_vdev->vdev_child[i];
     3450 +                dsl_scan_io_queue_t *queue;
     3451 +                uint64_t limit;
     3452 +
     3453 +                /*
     3454 +                 * No need to lock to check if the queue exists, since this
     3455 +                 * is called from sync context only and queues are only
     3456 +                 * created in sync context also.
     3457 +                 */
     3458 +                queue = vd->vdev_scan_io_queue;
     3459 +                if (queue == NULL)
     3460 +                        continue;
     3461 +
     3462 +                /*
     3463 +                 * Compute the per-queue limit as a fraction of the queue's
     3464 +                 * size, relative to the total amount of zio bytes in all
     3465 +                 * all queues. 1000 here is the fixed-point precision. If
     3466 +                 * there are ever  more than 1000 top-level vdevs, this
     3467 +                 * code might misbehave.
     3468 +                 */
     3469 +                limit = MAX((((queue->q_zio_bytes * 1000) / total_bytes) *
     3470 +                    total_limit) / 1000, zfs_scan_dequeue_min);
     3471 +
     3472 +                info[i].qri_queue = queue;
     3473 +                info[i].qri_limit = limit;
     3474 +
     3475 +                VERIFY(taskq_dispatch(scn->scn_taskq,
     3476 +                    (void (*)(void *))scan_io_queues_run_one, &info[i],
     3477 +                    TQ_SLEEP) != NULL);
     3478 +        }
     3479 +
     3480 +        /*
     3481 +         * We need to wait for all queues to finish their run, just to keep
     3482 +         * things nice and consistent. This doesn't necessarily mean all
     3483 +         * I/O generated by the queues emptying has finished (there may be
     3484 +         * up to zfs_top_maxinflight zio's still processing on behalf of
     3485 +         * each queue).
     3486 +         */
     3487 +        taskq_wait(scn->scn_taskq);
     3488 +
     3489 +        kmem_free(info, sizeof (*info) * spa->spa_root_vdev->vdev_children);
     3490 +}
     3491 +
     3492 +/*
     3493 + * Callback invoked when a zio_free() zio is executing. This needs to be
     3494 + * intercepted to prevent the zio from deallocating a particular portion
     3495 + * of disk space and it then getting reallocated and written to, while we
     3496 + * still have it queued up for processing, or even while we're trying to
     3497 + * scrub or resilver it.
     3498 + */
     3499 +void
     3500 +dsl_scan_freed(spa_t *spa, const blkptr_t *bp)
     3501 +{
     3502 +        dsl_pool_t *dp = spa->spa_dsl_pool;
     3503 +        dsl_scan_t *scn = dp->dp_scan;
     3504 +
     3505 +        ASSERT(!BP_IS_EMBEDDED(bp));
     3506 +        ASSERT(scn != NULL);
     3507 +        if (!dsl_scan_is_running(scn))
     3508 +                return;
     3509 +
     3510 +        for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
     3511 +                if (BP_IS_SPECIAL(bp) && i != WBC_NORMAL_DVA)
     3512 +                        continue;
     3513 +                dsl_scan_freed_dva(spa, bp, i);
     3514 +        }
     3515 +}
     3516 +
     3517 +static void
     3518 +dsl_scan_freed_dva(spa_t *spa, const blkptr_t *bp, int dva_i)
     3519 +{
     3520 +        dsl_pool_t *dp = spa->spa_dsl_pool;
     3521 +        dsl_scan_t *scn = dp->dp_scan;
     3522 +        vdev_t *vdev;
     3523 +        kmutex_t *q_lock;
     3524 +        dsl_scan_io_queue_t *queue;
     3525 +        scan_io_t srch, *sio;
     3526 +        avl_index_t idx;
     3527 +        uint64_t offset;
     3528 +        int64_t asize;
     3529 +
     3530 +        ASSERT(!BP_IS_EMBEDDED(bp));
     3531 +        ASSERT(scn != NULL);
     3532 +        ASSERT(!BP_IS_SPECIAL(bp) || dva_i == WBC_NORMAL_DVA);
     3533 +
     3534 +        vdev = vdev_lookup_top(spa, DVA_GET_VDEV(&bp->blk_dva[dva_i]));
     3535 +        ASSERT(vdev != NULL);
     3536 +        q_lock = &vdev->vdev_scan_io_queue_lock;
     3537 +        queue = vdev->vdev_scan_io_queue;
     3538 +
     3539 +        mutex_enter(q_lock);
     3540 +        if (queue == NULL) {
     3541 +                mutex_exit(q_lock);
     3542 +                return;
     3543 +        }
     3544 +
     3545 +        bp2sio(bp, &srch, dva_i);
     3546 +        offset = SCAN_IO_GET_OFFSET(&srch);
     3547 +        asize = srch.sio_asize;
     3548 +
     3549 +        /*
     3550 +         * We can find the zio in two states:
     3551 +         * 1) Cold, just sitting in the queue of zio's to be issued at
     3552 +         *      some point in the future. In this case, all we do is
     3553 +         *      remove the zio from the q_zios_by_addr tree, decrement
     3554 +         *      its data volume from the containing range_seg_t and
     3555 +         *      resort the q_exts_by_size tree to reflect that the
     3556 +         *      range_seg_t has lost some of its 'fill'. We don't shorten
     3557 +         *      the range_seg_t - this is usually rare enough not to be
     3558 +         *      worth the extra hassle of trying keep track of precise
     3559 +         *      extent boundaries.
     3560 +         * 2) Hot, where the zio is currently in-flight in
     3561 +         *      dsl_scan_issue_ios. In this case, we can't simply
     3562 +         *      reach in and stop the in-flight zio's, so we instead
     3563 +         *      block the caller. Eventually, dsl_scan_issue_ios will
     3564 +         *      be done with issuing the zio's it gathered and will
     3565 +         *      signal us.
     3566 +         */
     3567 +        sio = avl_find(&queue->q_zios_by_addr, &srch, &idx);
     3568 +        if (sio != NULL) {
     3569 +                range_seg_t *rs;
     3570 +
     3571 +                /* Got it while it was cold in the queue */
     3572 +                ASSERT3U(srch.sio_asize, ==, sio->sio_asize);
     3573 +                DTRACE_PROBE2(dequeue_now, const blkptr_t *, bp,
     3574 +                    dsl_scan_queue_t *, queue);
     3575 +                count_block(scn, dp->dp_blkstats, bp);
     3576 +                ASSERT(range_tree_contains(queue->q_exts_by_addr, offset,
     3577 +                    asize));
     3578 +                avl_remove(&queue->q_zios_by_addr, sio);
     3579 +
     3580 +                /*
     3581 +                 * Since we're taking this scan_io_t out of its parent
     3582 +                 * range_seg_t, we need to alter the range_seg_t's rs_fill
     3583 +                 * value, so this changes its ordering position. We need
     3584 +                 * to reinsert in its appropriate place in q_exts_by_size.
     3585 +                 */
     3586 +                rs = range_tree_find(queue->q_exts_by_addr,
     3587 +                    SCAN_IO_GET_OFFSET(sio), sio->sio_asize);
     3588 +                ASSERT(rs != NULL);
     3589 +                ASSERT3U(rs->rs_fill, >=, sio->sio_asize);
     3590 +                avl_remove(&queue->q_exts_by_size, rs);
     3591 +                ASSERT3U(rs->rs_fill, >=, sio->sio_asize);
     3592 +                rs->rs_fill -= sio->sio_asize;
     3593 +                VERIFY3P(avl_find(&queue->q_exts_by_size, rs, &idx), ==, NULL);
     3594 +                avl_insert(&queue->q_exts_by_size, rs, idx);
     3595 +
     3596 +                /*
     3597 +                 * We only update the queue byte counter in the cold path,
     3598 +                 * otherwise it will already have been accounted for as
     3599 +                 * part of the zio's execution.
     3600 +                 */
     3601 +                ASSERT3U(queue->q_zio_bytes, >=, asize);
     3602 +                atomic_add_64(&queue->q_zio_bytes, -asize);
     3603 +
     3604 +                mutex_enter(&scn->scn_status_lock);
     3605 +                ASSERT3U(scn->scn_bytes_pending, >=, asize);
     3606 +                scn->scn_bytes_pending -= asize;
     3607 +                mutex_exit(&scn->scn_status_lock);
     3608 +
     3609 +                kmem_free(sio, sizeof (*sio));
     3610 +        } else {
     3611 +                /*
     3612 +                 * If it's part of an extent that's currently being issued,
     3613 +                 * wait until the extent has been consumed. In this case it's
     3614 +                 * not us who is dequeueing this zio, so no need to
     3615 +                 * decrement its size from scn_bytes_pending or the queue.
     3616 +                 */
     3617 +                while (queue->q_issuing_rs.rs_start <= offset &&
     3618 +                    queue->q_issuing_rs.rs_end >= offset + asize) {
     3619 +                        DTRACE_PROBE2(dequeue_wait, const blkptr_t *, bp,
     3620 +                            dsl_scan_queue_t *, queue);
     3621 +                        cv_wait(&queue->q_cv, &vdev->vdev_scan_io_queue_lock);
     3622 +                }
     3623 +        }
     3624 +        mutex_exit(q_lock);
2049 3625  }
    
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX