Print this page
NEX-18589 checksum errors on SSD-based pool
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-14242 Getting panic in module "zfs" due to a NULL pointer dereference
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-17716 reoccurring checksum errors on pool
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-15266 Default resilver throttling values are too aggressive
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-15266 Default resilver throttling values are too aggressive
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-10069 ZFS_READONLY is a little too strict (fix test lint)
NEX-9553 Move ss_fill gap logic from scan algorithm into range_tree.c
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-9719 Reorganize scan_io_t to make it smaller and improve scan performance
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9658 Resilver code leaks the block sorting queues
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9651 Resilver code leaks a bit of memory in the dataset processing queue
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-9551 Resilver algorithm should properly sort metadata and data with copies > 1
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9601 New resilver algorithm causes scrub errors on WBC devices
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-9593 New resilvering algorithm can panic when WBC mirrors are in use
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9554 dsl_scan.c internals contain some confusingly similar function names for handling the dataset and block sorting queues
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9562 Attaching a vdev while resilver/scrub is running causes panic.
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-4940 Special Vdev operation in presence (or absense) of IO Errors
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-4705 WRC: Kernel-panic during the destroying of a pool with activated WRC
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
6450 scrub/resilver unnecessarily traverses snapshots created after the scrub started
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6292 exporting a pool while an async destroy is running can leave entries in the deferred tree
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Fabian Keil <fk@fabiankeil.de>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix multi-proto)
6251 add tunable to disable free_bpobj processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
usr/src/common/zfs/zpool_prop.c
usr/src/uts/common/sys/fs/zfs.h
4391 panic system rather than corrupting pool if we hit bug 4390
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Issue #26: partial scrub
Added partial scrub options:
-M for MOS only scrub
-m for metadata scrub
Issue #2: optimize DDE lookup in DDT objects
Added option to control number of classes of DDE's in DDT.
New default is one, that is all DDE's are stored together
regardless of refcount.
re #12619 rb4429 More dp->dp_config_rwlock holds
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
| Split |
Close |
| Expand all |
| Collapse all |
--- old/usr/src/uts/common/fs/zfs/dsl_scan.c
+++ new/usr/src/uts/common/fs/zfs/dsl_scan.c
1 1 /*
2 2 * CDDL HEADER START
3 3 *
4 4 * The contents of this file are subject to the terms of the
5 5 * Common Development and Distribution License (the "License").
6 6 * You may not use this file except in compliance with the License.
7 7 *
8 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9 9 * or http://www.opensolaris.org/os/licensing.
10 10 * See the License for the specific language governing permissions
11 11 * and limitations under the License.
12 12 *
|
↓ open down ↓ |
12 lines elided |
↑ open up ↑ |
13 13 * When distributing Covered Code, include this CDDL HEADER in each
14 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15 15 * If applicable, add the following below this CDDL HEADER, with the
16 16 * fields enclosed by brackets "[]" replaced with your own identifying
17 17 * information: Portions Copyright [yyyy] [name of copyright owner]
18 18 *
19 19 * CDDL HEADER END
20 20 */
21 21 /*
22 22 * Copyright (c) 2008, 2010, Oracle and/or its affiliates. All rights reserved.
23 + * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
23 24 * Copyright 2016 Gary Mills
24 - * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
25 + * Copyright (c) 2011, 2016 by Delphix. All rights reserved.
25 26 * Copyright 2017 Joyent, Inc.
26 27 * Copyright (c) 2017 Datto Inc.
27 28 */
28 29
29 30 #include <sys/dsl_scan.h>
30 31 #include <sys/dsl_pool.h>
31 32 #include <sys/dsl_dataset.h>
32 33 #include <sys/dsl_prop.h>
33 34 #include <sys/dsl_dir.h>
34 35 #include <sys/dsl_synctask.h>
35 36 #include <sys/dnode.h>
36 37 #include <sys/dmu_tx.h>
37 38 #include <sys/dmu_objset.h>
38 39 #include <sys/arc.h>
39 40 #include <sys/zap.h>
40 41 #include <sys/zio.h>
41 42 #include <sys/zfs_context.h>
42 43 #include <sys/fs/zfs.h>
43 44 #include <sys/zfs_znode.h>
44 45 #include <sys/spa_impl.h>
45 46 #include <sys/vdev_impl.h>
|
↓ open down ↓ |
11 lines elided |
↑ open up ↑ |
46 47 #include <sys/zil_impl.h>
47 48 #include <sys/zio_checksum.h>
48 49 #include <sys/ddt.h>
49 50 #include <sys/sa.h>
50 51 #include <sys/sa_impl.h>
51 52 #include <sys/zfeature.h>
52 53 #include <sys/abd.h>
53 54 #ifdef _KERNEL
54 55 #include <sys/zfs_vfsops.h>
55 56 #endif
57 +#include <sys/range_tree.h>
56 58
59 +extern int zfs_vdev_async_write_active_min_dirty_percent;
60 +
61 +typedef struct {
62 + uint64_t sds_dsobj;
63 + uint64_t sds_txg;
64 + avl_node_t sds_node;
65 +} scan_ds_t;
66 +
67 +typedef struct {
68 + dsl_scan_io_queue_t *qri_queue;
69 + uint64_t qri_limit;
70 +} io_queue_run_info_t;
71 +
72 +/*
73 + * This controls what conditions are placed on dsl_scan_sync_state():
74 + * SYNC_OPTIONAL) write out scn_phys iff scn_bytes_pending == 0
75 + * SYNC_MANDATORY) write out scn_phys always. scn_bytes_pending must be 0.
76 + * SYNC_CACHED) if scn_bytes_pending == 0, write out scn_phys. Otherwise
77 + * write out the scn_phys_cached version.
78 + * See dsl_scan_sync_state for details.
79 + */
80 +typedef enum {
81 + SYNC_OPTIONAL,
82 + SYNC_MANDATORY,
83 + SYNC_CACHED
84 +} state_sync_type_t;
85 +
57 86 typedef int (scan_cb_t)(dsl_pool_t *, const blkptr_t *,
58 87 const zbookmark_phys_t *);
59 88
60 89 static scan_cb_t dsl_scan_scrub_cb;
61 90 static void dsl_scan_cancel_sync(void *, dmu_tx_t *);
62 -static void dsl_scan_sync_state(dsl_scan_t *, dmu_tx_t *);
91 +static void dsl_scan_sync_state(dsl_scan_t *scn, dmu_tx_t *tx,
92 + state_sync_type_t sync_type);
63 93 static boolean_t dsl_scan_restarting(dsl_scan_t *, dmu_tx_t *);
64 94
65 -int zfs_top_maxinflight = 32; /* maximum I/Os per top-level */
66 -int zfs_resilver_delay = 2; /* number of ticks to delay resilver */
67 -int zfs_scrub_delay = 4; /* number of ticks to delay scrub */
68 -int zfs_scan_idle = 50; /* idle window in clock ticks */
95 +static int scan_ds_queue_compar(const void *a, const void *b);
96 +static void scan_ds_queue_empty(dsl_scan_t *scn, boolean_t destroy);
97 +static boolean_t scan_ds_queue_contains(dsl_scan_t *scn, uint64_t dsobj,
98 + uint64_t *txg);
99 +static int scan_ds_queue_insert(dsl_scan_t *scn, uint64_t dsobj, uint64_t txg);
100 +static void scan_ds_queue_remove(dsl_scan_t *scn, uint64_t dsobj);
101 +static boolean_t scan_ds_queue_first(dsl_scan_t *scn, uint64_t *dsobj,
102 + uint64_t *txg);
103 +static void scan_ds_queue_sync(dsl_scan_t *scn, dmu_tx_t *tx);
69 104
105 +/*
106 + * Maximum number of parallelly executing I/Os per top-level vdev.
107 + * Tune with care. Very high settings (hundreds) are known to trigger
108 + * some firmware bugs and resets on certain SSDs.
109 + */
110 +int zfs_top_maxinflight = 32;
111 +
112 +/*
113 + * Minimum amount of data we dequeue if our queues are full and the
114 + * dirty data limit for a txg has been reached.
115 + */
116 +uint64_t zfs_scan_dequeue_min = 16 << 20;
117 +/*
118 + * The duration target we're aiming for a dsl_scan_sync to take due to our
119 + * dequeued data. If we go over that value, we lower the amount we dequeue
120 + * each run and vice versa. The bonus value below is just something we add
121 + * on top of he target value so that we have a little bit of fudging in case
122 + * some top-level vdevs finish before others - we want to keep the vdevs as
123 + * hot as possible.
124 + */
125 +uint64_t zfs_scan_dequeue_run_target_ms = 2000;
126 +uint64_t zfs_dequeue_run_bonus_ms = 1000;
127 +#define DEQUEUE_BONUS_MS_MAX 100000
128 +
129 +boolean_t zfs_scan_direct = B_FALSE; /* don't queue & sort zios, go direct */
130 +uint64_t zfs_scan_max_ext_gap = 2 << 20; /* bytes */
131 +/* See scan_io_queue_mem_lim for details on the memory limit tunables */
132 +uint64_t zfs_scan_mem_lim_fact = 20; /* fraction of physmem */
133 +uint64_t zfs_scan_mem_lim_soft_fact = 20; /* fraction of mem lim above */
134 +uint64_t zfs_scan_checkpoint_intval = 7200; /* seconds */
135 +/*
136 + * fill_weight is non-tunable at runtime, so we copy it at module init from
137 + * zfs_scan_fill_weight. Runtime adjustments to zfs_scan_fill_weight would
138 + * break queue sorting.
139 + */
140 +uint64_t zfs_scan_fill_weight = 3;
141 +static uint64_t fill_weight = 3;
142 +
143 +/* See scan_io_queue_mem_lim for details on the memory limit tunables */
144 +uint64_t zfs_scan_mem_lim_min = 16 << 20; /* bytes */
145 +uint64_t zfs_scan_mem_lim_soft_max = 128 << 20; /* bytes */
146 +
147 +#define ZFS_SCAN_CHECKPOINT_INTVAL SEC_TO_TICK(zfs_scan_checkpoint_intval)
148 +
70 149 int zfs_scan_min_time_ms = 1000; /* min millisecs to scrub per txg */
71 150 int zfs_free_min_time_ms = 1000; /* min millisecs to free per txg */
72 -int zfs_obsolete_min_time_ms = 500; /* min millisecs to obsolete per txg */
73 151 int zfs_resilver_min_time_ms = 3000; /* min millisecs to resilver per txg */
74 152 boolean_t zfs_no_scrub_io = B_FALSE; /* set to disable scrub i/o */
75 153 boolean_t zfs_no_scrub_prefetch = B_FALSE; /* set to disable scrub prefetch */
76 154 enum ddt_class zfs_scrub_ddt_class_max = DDT_CLASS_DUPLICATE;
77 155 int dsl_scan_delay_completion = B_FALSE; /* set to delay scan completion */
78 156 /* max number of blocks to free in a single TXG */
79 -uint64_t zfs_async_block_max_blocks = UINT64_MAX;
157 +uint64_t zfs_free_max_blocks = UINT64_MAX;
80 158
81 159 #define DSL_SCAN_IS_SCRUB_RESILVER(scn) \
82 160 ((scn)->scn_phys.scn_func == POOL_SCAN_SCRUB || \
83 - (scn)->scn_phys.scn_func == POOL_SCAN_RESILVER)
161 + (scn)->scn_phys.scn_func == POOL_SCAN_RESILVER || \
162 + (scn)->scn_phys.scn_func == POOL_SCAN_MOS || \
163 + (scn)->scn_phys.scn_func == POOL_SCAN_META)
84 164
85 165 extern int zfs_txg_timeout;
86 166
87 167 /*
88 168 * Enable/disable the processing of the free_bpobj object.
89 169 */
90 170 boolean_t zfs_free_bpobj_enabled = B_TRUE;
91 171
92 172 /* the order has to match pool_scan_type */
93 173 static scan_cb_t *scan_funcs[POOL_SCAN_FUNCS] = {
94 174 NULL,
95 175 dsl_scan_scrub_cb, /* POOL_SCAN_SCRUB */
96 176 dsl_scan_scrub_cb, /* POOL_SCAN_RESILVER */
177 + dsl_scan_scrub_cb, /* POOL_SCAN_MOS */
178 + dsl_scan_scrub_cb, /* POOL_SCAN_META */
97 179 };
98 180
181 +typedef struct scan_io {
182 + uint64_t sio_prop;
183 + uint64_t sio_phys_birth;
184 + uint64_t sio_birth;
185 + zio_cksum_t sio_cksum;
186 + zbookmark_phys_t sio_zb;
187 + union {
188 + avl_node_t sio_addr_node;
189 + list_node_t sio_list_node;
190 + } sio_nodes;
191 + uint64_t sio_dva_word1;
192 + uint32_t sio_asize;
193 + int sio_flags;
194 +} scan_io_t;
195 +
196 +struct dsl_scan_io_queue {
197 + dsl_scan_t *q_scn;
198 + vdev_t *q_vd;
199 +
200 + kcondvar_t q_cv;
201 +
202 + range_tree_t *q_exts_by_addr;
203 + avl_tree_t q_zios_by_addr;
204 + avl_tree_t q_exts_by_size;
205 +
206 + /* number of bytes in queued zios - atomic ops */
207 + uint64_t q_zio_bytes;
208 +
209 + range_seg_t q_issuing_rs;
210 + uint64_t q_num_issuing_zios;
211 +};
212 +
213 +#define SCAN_IO_GET_OFFSET(sio) \
214 + BF64_GET_SB((sio)->sio_dva_word1, 0, 63, SPA_MINBLOCKSHIFT, 0)
215 +#define SCAN_IO_SET_OFFSET(sio, offset) \
216 + BF64_SET_SB((sio)->sio_dva_word1, 0, 63, SPA_MINBLOCKSHIFT, 0, offset)
217 +
218 +static void scan_io_queue_insert_cb(range_tree_t *rt, range_seg_t *rs,
219 + void *arg);
220 +static void scan_io_queue_remove_cb(range_tree_t *rt, range_seg_t *rs,
221 + void *arg);
222 +static void scan_io_queue_vacate_cb(range_tree_t *rt, void *arg);
223 +static int ext_size_compar(const void *x, const void *y);
224 +static int io_addr_compar(const void *x, const void *y);
225 +
226 +static struct range_tree_ops scan_io_queue_ops = {
227 + .rtop_create = NULL,
228 + .rtop_destroy = NULL,
229 + .rtop_add = scan_io_queue_insert_cb,
230 + .rtop_remove = scan_io_queue_remove_cb,
231 + .rtop_vacate = scan_io_queue_vacate_cb
232 +};
233 +
234 +typedef enum {
235 + MEM_LIM_NONE,
236 + MEM_LIM_SOFT,
237 + MEM_LIM_HARD
238 +} mem_lim_t;
239 +
240 +static void dsl_scan_enqueue(dsl_pool_t *dp, const blkptr_t *bp,
241 + int zio_flags, const zbookmark_phys_t *zb);
242 +static void scan_exec_io(dsl_pool_t *dp, const blkptr_t *bp, int zio_flags,
243 + const zbookmark_phys_t *zb, boolean_t limit_inflight);
244 +static void scan_io_queue_insert(dsl_scan_t *scn, dsl_scan_io_queue_t *queue,
245 + const blkptr_t *bp, int dva_i, int zio_flags, const zbookmark_phys_t *zb);
246 +
247 +static void scan_io_queues_run_one(io_queue_run_info_t *info);
248 +static void scan_io_queues_run(dsl_scan_t *scn);
249 +static mem_lim_t scan_io_queue_mem_lim(dsl_scan_t *scn);
250 +
251 +static dsl_scan_io_queue_t *scan_io_queue_create(vdev_t *vd);
252 +static void scan_io_queues_destroy(dsl_scan_t *scn);
253 +static void dsl_scan_freed_dva(spa_t *spa, const blkptr_t *bp, int dva_i);
254 +
255 +static inline boolean_t
256 +dsl_scan_is_running(const dsl_scan_t *scn)
257 +{
258 + return (scn->scn_phys.scn_state == DSS_SCANNING ||
259 + scn->scn_phys.scn_state == DSS_FINISHING);
260 +}
261 +
262 +static inline void
263 +sio2bp(const scan_io_t *sio, blkptr_t *bp, uint64_t vdev_id)
264 +{
265 + bzero(bp, sizeof (*bp));
266 + DVA_SET_ASIZE(&bp->blk_dva[0], sio->sio_asize);
267 + DVA_SET_VDEV(&bp->blk_dva[0], vdev_id);
268 + bp->blk_dva[0].dva_word[1] = sio->sio_dva_word1;
269 + bp->blk_prop = sio->sio_prop;
270 + /*
271 + * We must reset the special flag, because the rebuilt BP lacks
272 + * a second DVA, so wbc_select_dva must not be allowed to run.
273 + */
274 + BP_SET_SPECIAL(bp, 0);
275 + bp->blk_phys_birth = sio->sio_phys_birth;
276 + bp->blk_birth = sio->sio_birth;
277 + bp->blk_fill = 1; /* we always only work with data pointers */
278 + bp->blk_cksum = sio->sio_cksum;
279 +}
280 +
281 +static inline void
282 +bp2sio(const blkptr_t *bp, scan_io_t *sio, int dva_i)
283 +{
284 + if (BP_IS_SPECIAL(bp))
285 + ASSERT3S(dva_i, ==, WBC_NORMAL_DVA);
286 + /* we discard the vdev guid, since we can deduce it from the queue */
287 + sio->sio_dva_word1 = bp->blk_dva[dva_i].dva_word[1];
288 + sio->sio_asize = DVA_GET_ASIZE(&bp->blk_dva[dva_i]);
289 + sio->sio_prop = bp->blk_prop;
290 + sio->sio_phys_birth = bp->blk_phys_birth;
291 + sio->sio_birth = bp->blk_birth;
292 + sio->sio_cksum = bp->blk_cksum;
293 +}
294 +
295 +void
296 +dsl_scan_global_init()
297 +{
298 + fill_weight = zfs_scan_fill_weight;
299 +}
300 +
99 301 int
100 302 dsl_scan_init(dsl_pool_t *dp, uint64_t txg)
101 303 {
102 304 int err;
103 305 dsl_scan_t *scn;
104 306 spa_t *spa = dp->dp_spa;
105 307 uint64_t f;
106 308
107 309 scn = dp->dp_scan = kmem_zalloc(sizeof (dsl_scan_t), KM_SLEEP);
108 310 scn->scn_dp = dp;
109 311
312 + mutex_init(&scn->scn_sorted_lock, NULL, MUTEX_DEFAULT, NULL);
313 + mutex_init(&scn->scn_status_lock, NULL, MUTEX_DEFAULT, NULL);
314 +
110 315 /*
111 316 * It's possible that we're resuming a scan after a reboot so
112 317 * make sure that the scan_async_destroying flag is initialized
113 318 * appropriately.
114 319 */
115 320 ASSERT(!scn->scn_async_destroying);
116 321 scn->scn_async_destroying = spa_feature_is_active(dp->dp_spa,
117 322 SPA_FEATURE_ASYNC_DESTROY);
118 323
324 + bcopy(&scn->scn_phys, &scn->scn_phys_cached, sizeof (scn->scn_phys));
325 + mutex_init(&scn->scn_queue_lock, NULL, MUTEX_DEFAULT, NULL);
326 + avl_create(&scn->scn_queue, scan_ds_queue_compar, sizeof (scan_ds_t),
327 + offsetof(scan_ds_t, sds_node));
328 +
119 329 err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
120 330 "scrub_func", sizeof (uint64_t), 1, &f);
121 331 if (err == 0) {
122 332 /*
123 333 * There was an old-style scrub in progress. Restart a
124 334 * new-style scrub from the beginning.
125 335 */
126 336 scn->scn_restart_txg = txg;
337 + DTRACE_PROBE2(scan_init__old2new, dsl_scan_t *, scn,
338 + uint64_t, txg);
127 339 zfs_dbgmsg("old-style scrub was in progress; "
128 340 "restarting new-style scrub in txg %llu",
129 341 scn->scn_restart_txg);
130 342
131 343 /*
132 344 * Load the queue obj from the old location so that it
133 345 * can be freed by dsl_scan_done().
134 346 */
135 347 (void) zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
136 348 "scrub_queue", sizeof (uint64_t), 1,
137 349 &scn->scn_phys.scn_queue_obj);
138 350 } else {
139 351 err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
140 352 DMU_POOL_SCAN, sizeof (uint64_t), SCAN_PHYS_NUMINTS,
141 353 &scn->scn_phys);
142 354 if (err == ENOENT)
143 355 return (0);
144 356 else if (err)
145 357 return (err);
146 358
147 - if (scn->scn_phys.scn_state == DSS_SCANNING &&
359 + /*
360 + * We might be restarting after a reboot, so jump the issued
361 + * counter to how far we've scanned. We know we're consistent
362 + * up to here.
363 + */
364 + scn->scn_bytes_issued = scn->scn_phys.scn_examined;
365 +
366 + if (dsl_scan_is_running(scn) &&
148 367 spa_prev_software_version(dp->dp_spa) < SPA_VERSION_SCAN) {
149 368 /*
150 369 * A new-type scrub was in progress on an old
151 370 * pool, and the pool was accessed by old
152 371 * software. Restart from the beginning, since
153 372 * the old software may have changed the pool in
154 373 * the meantime.
155 374 */
156 375 scn->scn_restart_txg = txg;
376 + DTRACE_PROBE2(scan_init__new2old2new,
377 + dsl_scan_t *, scn, uint64_t, txg);
157 378 zfs_dbgmsg("new-style scrub was modified "
158 379 "by old software; restarting in txg %llu",
159 380 scn->scn_restart_txg);
160 381 }
161 382 }
162 383
384 + /* reload the queue into the in-core state */
385 + if (scn->scn_phys.scn_queue_obj != 0) {
386 + zap_cursor_t zc;
387 + zap_attribute_t za;
388 +
389 + for (zap_cursor_init(&zc, dp->dp_meta_objset,
390 + scn->scn_phys.scn_queue_obj);
391 + zap_cursor_retrieve(&zc, &za) == 0;
392 + (void) zap_cursor_advance(&zc)) {
393 + VERIFY0(scan_ds_queue_insert(scn,
394 + zfs_strtonum(za.za_name, NULL),
395 + za.za_first_integer));
396 + }
397 + zap_cursor_fini(&zc);
398 + }
399 +
163 400 spa_scan_stat_init(spa);
164 401 return (0);
165 402 }
166 403
167 404 void
168 405 dsl_scan_fini(dsl_pool_t *dp)
169 406 {
170 - if (dp->dp_scan) {
407 + if (dp->dp_scan != NULL) {
408 + dsl_scan_t *scn = dp->dp_scan;
409 +
410 + mutex_destroy(&scn->scn_sorted_lock);
411 + mutex_destroy(&scn->scn_status_lock);
412 + if (scn->scn_taskq != NULL)
413 + taskq_destroy(scn->scn_taskq);
414 + scan_ds_queue_empty(scn, B_TRUE);
415 + mutex_destroy(&scn->scn_queue_lock);
416 +
171 417 kmem_free(dp->dp_scan, sizeof (dsl_scan_t));
172 418 dp->dp_scan = NULL;
173 419 }
174 420 }
175 421
176 422 /* ARGSUSED */
177 423 static int
178 424 dsl_scan_setup_check(void *arg, dmu_tx_t *tx)
179 425 {
180 426 dsl_scan_t *scn = dmu_tx_pool(tx)->dp_scan;
181 427
182 - if (scn->scn_phys.scn_state == DSS_SCANNING)
428 + if (dsl_scan_is_running(scn))
183 429 return (SET_ERROR(EBUSY));
184 430
185 431 return (0);
186 432 }
187 433
188 434 static void
189 435 dsl_scan_setup_sync(void *arg, dmu_tx_t *tx)
190 436 {
191 437 dsl_scan_t *scn = dmu_tx_pool(tx)->dp_scan;
192 438 pool_scan_func_t *funcp = arg;
193 439 dmu_object_type_t ot = 0;
194 440 dsl_pool_t *dp = scn->scn_dp;
195 441 spa_t *spa = dp->dp_spa;
196 442
197 - ASSERT(scn->scn_phys.scn_state != DSS_SCANNING);
443 + ASSERT(!dsl_scan_is_running(scn));
198 444 ASSERT(*funcp > POOL_SCAN_NONE && *funcp < POOL_SCAN_FUNCS);
199 445 bzero(&scn->scn_phys, sizeof (scn->scn_phys));
200 446 scn->scn_phys.scn_func = *funcp;
201 447 scn->scn_phys.scn_state = DSS_SCANNING;
202 448 scn->scn_phys.scn_min_txg = 0;
203 449 scn->scn_phys.scn_max_txg = tx->tx_txg;
204 - scn->scn_phys.scn_ddt_class_max = DDT_CLASSES - 1; /* the entire DDT */
450 + /* the entire DDT */
451 + scn->scn_phys.scn_ddt_class_max = spa->spa_ddt_class_max;
205 452 scn->scn_phys.scn_start_time = gethrestime_sec();
206 453 scn->scn_phys.scn_errors = 0;
207 454 scn->scn_phys.scn_to_examine = spa->spa_root_vdev->vdev_stat.vs_alloc;
208 455 scn->scn_restart_txg = 0;
209 456 scn->scn_done_txg = 0;
457 + scn->scn_bytes_issued = 0;
458 + scn->scn_checkpointing = B_FALSE;
459 + scn->scn_last_checkpoint = 0;
210 460 spa_scan_stat_init(spa);
211 461
212 462 if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) {
213 - scn->scn_phys.scn_ddt_class_max = zfs_scrub_ddt_class_max;
463 + scn->scn_phys.scn_ddt_class_max =
464 + MIN(zfs_scrub_ddt_class_max, spa->spa_ddt_class_max);
214 465
215 466 /* rewrite all disk labels */
216 467 vdev_config_dirty(spa->spa_root_vdev);
217 468
218 469 if (vdev_resilver_needed(spa->spa_root_vdev,
219 470 &scn->scn_phys.scn_min_txg, &scn->scn_phys.scn_max_txg)) {
220 471 spa_event_notify(spa, NULL, NULL,
221 472 ESC_ZFS_RESILVER_START);
222 473 } else {
223 474 spa_event_notify(spa, NULL, NULL, ESC_ZFS_SCRUB_START);
224 475 }
225 476
226 477 spa->spa_scrub_started = B_TRUE;
227 478 /*
228 479 * If this is an incremental scrub, limit the DDT scrub phase
229 480 * to just the auto-ditto class (for correctness); the rest
230 481 * of the scrub should go faster using top-down pruning.
231 482 */
232 483 if (scn->scn_phys.scn_min_txg > TXG_INITIAL)
233 - scn->scn_phys.scn_ddt_class_max = DDT_CLASS_DITTO;
484 + scn->scn_phys.scn_ddt_class_max =
485 + MIN(DDT_CLASS_DITTO, spa->spa_ddt_class_max);
234 486
235 487 }
236 488
237 489 /* back to the generic stuff */
238 490
239 491 if (dp->dp_blkstats == NULL) {
240 492 dp->dp_blkstats =
241 493 kmem_alloc(sizeof (zfs_all_blkstats_t), KM_SLEEP);
242 494 }
243 495 bzero(dp->dp_blkstats, sizeof (zfs_all_blkstats_t));
244 496
245 497 if (spa_version(spa) < SPA_VERSION_DSL_SCRUB)
246 498 ot = DMU_OT_ZAP_OTHER;
247 499
248 500 scn->scn_phys.scn_queue_obj = zap_create(dp->dp_meta_objset,
249 501 ot ? ot : DMU_OT_SCAN_QUEUE, DMU_OT_NONE, 0, tx);
250 502
251 - dsl_scan_sync_state(scn, tx);
503 + bcopy(&scn->scn_phys, &scn->scn_phys_cached, sizeof (scn->scn_phys));
252 504
505 + dsl_scan_sync_state(scn, tx, SYNC_MANDATORY);
506 +
253 507 spa_history_log_internal(spa, "scan setup", tx,
254 508 "func=%u mintxg=%llu maxtxg=%llu",
255 509 *funcp, scn->scn_phys.scn_min_txg, scn->scn_phys.scn_max_txg);
256 510 }
257 511
258 512 /* ARGSUSED */
259 513 static void
260 514 dsl_scan_done(dsl_scan_t *scn, boolean_t complete, dmu_tx_t *tx)
261 515 {
262 516 static const char *old_names[] = {
263 517 "scrub_bookmark",
264 518 "scrub_ddt_bookmark",
265 519 "scrub_ddt_class_max",
266 520 "scrub_queue",
267 521 "scrub_min_txg",
268 522 "scrub_max_txg",
269 523 "scrub_func",
270 524 "scrub_errors",
271 525 NULL
272 526 };
273 527
274 528 dsl_pool_t *dp = scn->scn_dp;
|
↓ open down ↓ |
12 lines elided |
↑ open up ↑ |
275 529 spa_t *spa = dp->dp_spa;
276 530 int i;
277 531
278 532 /* Remove any remnants of an old-style scrub. */
279 533 for (i = 0; old_names[i]; i++) {
280 534 (void) zap_remove(dp->dp_meta_objset,
281 535 DMU_POOL_DIRECTORY_OBJECT, old_names[i], tx);
282 536 }
283 537
284 538 if (scn->scn_phys.scn_queue_obj != 0) {
285 - VERIFY(0 == dmu_object_free(dp->dp_meta_objset,
539 + VERIFY0(dmu_object_free(dp->dp_meta_objset,
286 540 scn->scn_phys.scn_queue_obj, tx));
287 541 scn->scn_phys.scn_queue_obj = 0;
288 542 }
543 + scan_ds_queue_empty(scn, B_FALSE);
289 544
290 545 scn->scn_phys.scn_flags &= ~DSF_SCRUB_PAUSED;
291 546
292 547 /*
293 548 * If we were "restarted" from a stopped state, don't bother
294 549 * with anything else.
295 550 */
296 - if (scn->scn_phys.scn_state != DSS_SCANNING)
551 + if (!dsl_scan_is_running(scn)) {
552 + ASSERT(!scn->scn_is_sorted);
297 553 return;
554 + }
298 555
299 - if (complete)
300 - scn->scn_phys.scn_state = DSS_FINISHED;
301 - else
302 - scn->scn_phys.scn_state = DSS_CANCELED;
556 + if (scn->scn_is_sorted) {
557 + scan_io_queues_destroy(scn);
558 + scn->scn_is_sorted = B_FALSE;
303 559
560 + if (scn->scn_taskq != NULL) {
561 + taskq_destroy(scn->scn_taskq);
562 + scn->scn_taskq = NULL;
563 + }
564 + }
565 +
566 + scn->scn_phys.scn_state = complete ? DSS_FINISHED : DSS_CANCELED;
567 +
304 568 if (dsl_scan_restarting(scn, tx))
305 569 spa_history_log_internal(spa, "scan aborted, restarting", tx,
306 570 "errors=%llu", spa_get_errlog_size(spa));
307 571 else if (!complete)
308 572 spa_history_log_internal(spa, "scan cancelled", tx,
309 573 "errors=%llu", spa_get_errlog_size(spa));
310 574 else
311 575 spa_history_log_internal(spa, "scan done", tx,
312 576 "errors=%llu", spa_get_errlog_size(spa));
313 577
314 578 if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) {
315 579 mutex_enter(&spa->spa_scrub_lock);
316 580 while (spa->spa_scrub_inflight > 0) {
317 581 cv_wait(&spa->spa_scrub_io_cv,
318 582 &spa->spa_scrub_lock);
319 583 }
320 584 mutex_exit(&spa->spa_scrub_lock);
321 585 spa->spa_scrub_started = B_FALSE;
322 586 spa->spa_scrub_active = B_FALSE;
323 587
324 588 /*
325 589 * If the scrub/resilver completed, update all DTLs to
326 590 * reflect this. Whether it succeeded or not, vacate
327 591 * all temporary scrub DTLs.
328 592 */
329 593 vdev_dtl_reassess(spa->spa_root_vdev, tx->tx_txg,
330 594 complete ? scn->scn_phys.scn_max_txg : 0, B_TRUE);
331 595 if (complete) {
332 596 spa_event_notify(spa, NULL, NULL,
333 597 scn->scn_phys.scn_min_txg ?
334 598 ESC_ZFS_RESILVER_FINISH : ESC_ZFS_SCRUB_FINISH);
335 599 }
|
↓ open down ↓ |
22 lines elided |
↑ open up ↑ |
336 600 spa_errlog_rotate(spa);
337 601
338 602 /*
339 603 * We may have finished replacing a device.
340 604 * Let the async thread assess this and handle the detach.
341 605 */
342 606 spa_async_request(spa, SPA_ASYNC_RESILVER_DONE);
343 607 }
344 608
345 609 scn->scn_phys.scn_end_time = gethrestime_sec();
610 +
611 + ASSERT(!dsl_scan_is_running(scn));
612 +
613 + /*
614 + * If the special-vdev does not have any errors after
615 + * SCRUB/RESILVER we need to drop flag that does not
616 + * allow to write to special
617 + */
618 + spa_special_check_errors(spa);
346 619 }
347 620
348 621 /* ARGSUSED */
349 622 static int
350 623 dsl_scan_cancel_check(void *arg, dmu_tx_t *tx)
351 624 {
352 625 dsl_scan_t *scn = dmu_tx_pool(tx)->dp_scan;
353 626
354 - if (scn->scn_phys.scn_state != DSS_SCANNING)
627 + if (!dsl_scan_is_running(scn))
355 628 return (SET_ERROR(ENOENT));
356 629 return (0);
357 630 }
358 631
359 632 /* ARGSUSED */
360 633 static void
361 634 dsl_scan_cancel_sync(void *arg, dmu_tx_t *tx)
362 635 {
363 636 dsl_scan_t *scn = dmu_tx_pool(tx)->dp_scan;
364 637
365 638 dsl_scan_done(scn, B_FALSE, tx);
366 - dsl_scan_sync_state(scn, tx);
367 - spa_event_notify(scn->scn_dp->dp_spa, NULL, NULL, ESC_ZFS_SCRUB_ABORT);
639 + dsl_scan_sync_state(scn, tx, SYNC_MANDATORY);
368 640 }
369 641
370 642 int
371 643 dsl_scan_cancel(dsl_pool_t *dp)
372 644 {
373 645 return (dsl_sync_task(spa_name(dp->dp_spa), dsl_scan_cancel_check,
374 646 dsl_scan_cancel_sync, NULL, 3, ZFS_SPACE_CHECK_RESERVED));
375 647 }
376 648
377 649 boolean_t
378 650 dsl_scan_is_paused_scrub(const dsl_scan_t *scn)
379 651 {
380 652 if (dsl_scan_scrubbing(scn->scn_dp) &&
381 653 scn->scn_phys.scn_flags & DSF_SCRUB_PAUSED)
382 654 return (B_TRUE);
383 655
384 656 return (B_FALSE);
385 657 }
386 658
387 659 static int
388 660 dsl_scrub_pause_resume_check(void *arg, dmu_tx_t *tx)
389 661 {
390 662 pool_scrub_cmd_t *cmd = arg;
391 663 dsl_pool_t *dp = dmu_tx_pool(tx);
392 664 dsl_scan_t *scn = dp->dp_scan;
393 665
394 666 if (*cmd == POOL_SCRUB_PAUSE) {
395 667 /* can't pause a scrub when there is no in-progress scrub */
396 668 if (!dsl_scan_scrubbing(dp))
397 669 return (SET_ERROR(ENOENT));
398 670
399 671 /* can't pause a paused scrub */
400 672 if (dsl_scan_is_paused_scrub(scn))
401 673 return (SET_ERROR(EBUSY));
402 674 } else if (*cmd != POOL_SCRUB_NORMAL) {
403 675 return (SET_ERROR(ENOTSUP));
404 676 }
405 677
406 678 return (0);
407 679 }
408 680
409 681 static void
410 682 dsl_scrub_pause_resume_sync(void *arg, dmu_tx_t *tx)
|
↓ open down ↓ |
33 lines elided |
↑ open up ↑ |
411 683 {
412 684 pool_scrub_cmd_t *cmd = arg;
413 685 dsl_pool_t *dp = dmu_tx_pool(tx);
414 686 spa_t *spa = dp->dp_spa;
415 687 dsl_scan_t *scn = dp->dp_scan;
416 688
417 689 if (*cmd == POOL_SCRUB_PAUSE) {
418 690 /* can't pause a scrub when there is no in-progress scrub */
419 691 spa->spa_scan_pass_scrub_pause = gethrestime_sec();
420 692 scn->scn_phys.scn_flags |= DSF_SCRUB_PAUSED;
421 - dsl_scan_sync_state(scn, tx);
422 - spa_event_notify(spa, NULL, NULL, ESC_ZFS_SCRUB_PAUSED);
693 + scn->scn_phys_cached.scn_flags |= DSF_SCRUB_PAUSED;
694 + dsl_scan_sync_state(scn, tx, SYNC_CACHED);
423 695 } else {
424 696 ASSERT3U(*cmd, ==, POOL_SCRUB_NORMAL);
425 697 if (dsl_scan_is_paused_scrub(scn)) {
426 698 /*
427 699 * We need to keep track of how much time we spend
428 700 * paused per pass so that we can adjust the scrub rate
429 701 * shown in the output of 'zpool status'
430 702 */
431 703 spa->spa_scan_pass_scrub_spent_paused +=
432 704 gethrestime_sec() - spa->spa_scan_pass_scrub_pause;
433 705 spa->spa_scan_pass_scrub_pause = 0;
434 706 scn->scn_phys.scn_flags &= ~DSF_SCRUB_PAUSED;
435 - dsl_scan_sync_state(scn, tx);
707 + scn->scn_phys_cached.scn_flags &= ~DSF_SCRUB_PAUSED;
708 + dsl_scan_sync_state(scn, tx, SYNC_CACHED);
436 709 }
437 710 }
438 711 }
439 712
440 713 /*
441 714 * Set scrub pause/resume state if it makes sense to do so
442 715 */
443 716 int
444 717 dsl_scrub_set_pause_resume(const dsl_pool_t *dp, pool_scrub_cmd_t cmd)
445 718 {
446 719 return (dsl_sync_task(spa_name(dp->dp_spa),
447 720 dsl_scrub_pause_resume_check, dsl_scrub_pause_resume_sync, &cmd, 3,
448 721 ZFS_SPACE_CHECK_RESERVED));
449 722 }
450 723
451 724 boolean_t
452 725 dsl_scan_scrubbing(const dsl_pool_t *dp)
453 726 {
454 727 dsl_scan_t *scn = dp->dp_scan;
455 728
456 - if (scn->scn_phys.scn_state == DSS_SCANNING &&
729 + if ((scn->scn_phys.scn_state == DSS_SCANNING ||
730 + scn->scn_phys.scn_state == DSS_FINISHING) &&
457 731 scn->scn_phys.scn_func == POOL_SCAN_SCRUB)
458 732 return (B_TRUE);
459 733
460 734 return (B_FALSE);
461 735 }
462 736
463 737 static void dsl_scan_visitbp(blkptr_t *bp, const zbookmark_phys_t *zb,
464 738 dnode_phys_t *dnp, dsl_dataset_t *ds, dsl_scan_t *scn,
465 739 dmu_objset_type_t ostype, dmu_tx_t *tx);
466 740 static void dsl_scan_visitdnode(dsl_scan_t *, dsl_dataset_t *ds,
467 741 dmu_objset_type_t ostype,
468 742 dnode_phys_t *dnp, uint64_t object, dmu_tx_t *tx);
469 743
470 744 void
471 745 dsl_free(dsl_pool_t *dp, uint64_t txg, const blkptr_t *bp)
472 746 {
473 747 zio_free(dp->dp_spa, txg, bp);
474 748 }
475 749
476 750 void
477 751 dsl_free_sync(zio_t *pio, dsl_pool_t *dp, uint64_t txg, const blkptr_t *bpp)
478 752 {
479 753 ASSERT(dsl_pool_sync_context(dp));
480 754 zio_nowait(zio_free_sync(pio, dp->dp_spa, txg, bpp, pio->io_flags));
481 755 }
|
↓ open down ↓ |
15 lines elided |
↑ open up ↑ |
482 756
483 757 static uint64_t
484 758 dsl_scan_ds_maxtxg(dsl_dataset_t *ds)
485 759 {
486 760 uint64_t smt = ds->ds_dir->dd_pool->dp_scan->scn_phys.scn_max_txg;
487 761 if (ds->ds_is_snapshot)
488 762 return (MIN(smt, dsl_dataset_phys(ds)->ds_creation_txg));
489 763 return (smt);
490 764 }
491 765
766 +/*
767 + * This is the dataset processing "queue", i.e. the datasets that are to be
768 + * scanned for data locations and inserted into the LBA reordering tree.
769 + * Please note that even though we call this a "queue", the actual
770 + * implementation uses an avl tree (to detect double insertion). The tree
771 + * uses the dataset object set number for the sorting criterion, so
772 + * scan_ds_queue_insert CANNOT be guaranteed to always append stuff at the
773 + * end (datasets are inserted by the scanner in discovery order, i.e.
774 + * parent-child relationships). Consequently, the scanner must never step
775 + * through the AVL tree in a naively sequential fashion using AVL_NEXT.
776 + * We must always use scan_ds_queue_first to pick the first dataset in the
777 + * list, process it, remove it using scan_ds_queue_remove and pick the next
778 + * first dataset, again using scan_ds_queue_first.
779 + */
780 +static int
781 +scan_ds_queue_compar(const void *a, const void *b)
782 +{
783 + const scan_ds_t *sds_a = a, *sds_b = b;
784 +
785 + if (sds_a->sds_dsobj < sds_b->sds_dsobj)
786 + return (-1);
787 + if (sds_a->sds_dsobj == sds_b->sds_dsobj)
788 + return (0);
789 + return (1);
790 +}
791 +
492 792 static void
493 -dsl_scan_sync_state(dsl_scan_t *scn, dmu_tx_t *tx)
793 +scan_ds_queue_empty(dsl_scan_t *scn, boolean_t destroy)
494 794 {
495 - VERIFY0(zap_update(scn->scn_dp->dp_meta_objset,
496 - DMU_POOL_DIRECTORY_OBJECT,
497 - DMU_POOL_SCAN, sizeof (uint64_t), SCAN_PHYS_NUMINTS,
498 - &scn->scn_phys, tx));
795 + void *cookie = NULL;
796 + scan_ds_t *sds;
797 +
798 + mutex_enter(&scn->scn_queue_lock);
799 + while ((sds = avl_destroy_nodes(&scn->scn_queue, &cookie)) != NULL)
800 + kmem_free(sds, sizeof (*sds));
801 + mutex_exit(&scn->scn_queue_lock);
802 +
803 + if (destroy)
804 + avl_destroy(&scn->scn_queue);
499 805 }
500 806
501 -extern int zfs_vdev_async_write_active_min_dirty_percent;
807 +static boolean_t
808 +scan_ds_queue_contains(dsl_scan_t *scn, uint64_t dsobj, uint64_t *txg)
809 +{
810 + scan_ds_t *sds;
811 + scan_ds_t srch = { .sds_dsobj = dsobj };
502 812
813 + mutex_enter(&scn->scn_queue_lock);
814 + sds = avl_find(&scn->scn_queue, &srch, NULL);
815 + if (sds != NULL && txg != NULL)
816 + *txg = sds->sds_txg;
817 + mutex_exit(&scn->scn_queue_lock);
818 +
819 + return (sds != NULL);
820 +}
821 +
822 +static int
823 +scan_ds_queue_insert(dsl_scan_t *scn, uint64_t dsobj, uint64_t txg)
824 +{
825 + scan_ds_t *sds;
826 + avl_index_t where;
827 +
828 + sds = kmem_zalloc(sizeof (*sds), KM_SLEEP);
829 + sds->sds_dsobj = dsobj;
830 + sds->sds_txg = txg;
831 +
832 + mutex_enter(&scn->scn_queue_lock);
833 + if (avl_find(&scn->scn_queue, sds, &where) != NULL) {
834 + kmem_free(sds, sizeof (*sds));
835 + return (EEXIST);
836 + }
837 + avl_insert(&scn->scn_queue, sds, where);
838 + mutex_exit(&scn->scn_queue_lock);
839 +
840 + return (0);
841 +}
842 +
843 +static void
844 +scan_ds_queue_remove(dsl_scan_t *scn, uint64_t dsobj)
845 +{
846 + scan_ds_t srch, *sds;
847 +
848 + srch.sds_dsobj = dsobj;
849 +
850 + mutex_enter(&scn->scn_queue_lock);
851 + sds = avl_find(&scn->scn_queue, &srch, NULL);
852 + VERIFY(sds != NULL);
853 + avl_remove(&scn->scn_queue, sds);
854 + mutex_exit(&scn->scn_queue_lock);
855 +
856 + kmem_free(sds, sizeof (*sds));
857 +}
858 +
503 859 static boolean_t
860 +scan_ds_queue_first(dsl_scan_t *scn, uint64_t *dsobj, uint64_t *txg)
861 +{
862 + scan_ds_t *sds;
863 +
864 + mutex_enter(&scn->scn_queue_lock);
865 + sds = avl_first(&scn->scn_queue);
866 + if (sds != NULL) {
867 + *dsobj = sds->sds_dsobj;
868 + *txg = sds->sds_txg;
869 + }
870 + mutex_exit(&scn->scn_queue_lock);
871 +
872 + return (sds != NULL);
873 +}
874 +
875 +static void
876 +scan_ds_queue_sync(dsl_scan_t *scn, dmu_tx_t *tx)
877 +{
878 + dsl_pool_t *dp = scn->scn_dp;
879 + spa_t *spa = dp->dp_spa;
880 + dmu_object_type_t ot = (spa_version(spa) >= SPA_VERSION_DSL_SCRUB) ?
881 + DMU_OT_SCAN_QUEUE : DMU_OT_ZAP_OTHER;
882 +
883 + ASSERT0(scn->scn_bytes_pending);
884 + ASSERT(scn->scn_phys.scn_queue_obj != 0);
885 +
886 + VERIFY0(dmu_object_free(dp->dp_meta_objset,
887 + scn->scn_phys.scn_queue_obj, tx));
888 + scn->scn_phys.scn_queue_obj = zap_create(dp->dp_meta_objset, ot,
889 + DMU_OT_NONE, 0, tx);
890 +
891 + mutex_enter(&scn->scn_queue_lock);
892 + for (scan_ds_t *sds = avl_first(&scn->scn_queue);
893 + sds != NULL; sds = AVL_NEXT(&scn->scn_queue, sds)) {
894 + VERIFY0(zap_add_int_key(dp->dp_meta_objset,
895 + scn->scn_phys.scn_queue_obj, sds->sds_dsobj,
896 + sds->sds_txg, tx));
897 + }
898 + mutex_exit(&scn->scn_queue_lock);
899 +}
900 +
901 +/*
902 + * Writes out a persistent dsl_scan_phys_t record to the pool directory.
903 + * Because we can be running in the block sorting algorithm, we do not always
904 + * want to write out the record, only when it is "safe" to do so. This safety
905 + * condition is achieved by making sure that the sorting queues are empty
906 + * (scn_bytes_pending==0). The sync'ed state could be inconsistent with how
907 + * much actual scanning progress has been made. What kind of sync is performed
908 + * specified by the sync_type argument. If the sync is optional, we only
909 + * sync if the queues are empty. If the sync is mandatory, we do a hard VERIFY
910 + * to make sure that the queues are empty. The third possible state is a
911 + * "cached" sync. This is done in response to:
912 + * 1) The dataset that was in the last sync'ed dsl_scan_phys_t having been
913 + * destroyed, so we wouldn't be able to restart scanning from it.
914 + * 2) The snapshot that was in the last sync'ed dsl_scan_phys_t having been
915 + * superseded by a newer snapshot.
916 + * 3) The dataset that was in the last sync'ed dsl_scan_phys_t having been
917 + * swapped with its clone.
918 + * In all cases, a cached sync simply rewrites the last record we've written,
919 + * just slightly modified. For the modifications that are performed to the
920 + * last written dsl_scan_phys_t, see dsl_scan_ds_destroyed,
921 + * dsl_scan_ds_snapshotted and dsl_scan_ds_clone_swapped.
922 + */
923 +static void
924 +dsl_scan_sync_state(dsl_scan_t *scn, dmu_tx_t *tx, state_sync_type_t sync_type)
925 +{
926 + mutex_enter(&scn->scn_status_lock);
927 + ASSERT(sync_type != SYNC_MANDATORY || scn->scn_bytes_pending == 0);
928 + if (scn->scn_bytes_pending == 0) {
929 + if (scn->scn_phys.scn_queue_obj != 0)
930 + scan_ds_queue_sync(scn, tx);
931 + VERIFY0(zap_update(scn->scn_dp->dp_meta_objset,
932 + DMU_POOL_DIRECTORY_OBJECT,
933 + DMU_POOL_SCAN, sizeof (uint64_t), SCAN_PHYS_NUMINTS,
934 + &scn->scn_phys, tx));
935 + bcopy(&scn->scn_phys, &scn->scn_phys_cached,
936 + sizeof (scn->scn_phys));
937 + scn->scn_checkpointing = B_FALSE;
938 + scn->scn_last_checkpoint = ddi_get_lbolt();
939 + } else if (sync_type == SYNC_CACHED) {
940 + VERIFY0(zap_update(scn->scn_dp->dp_meta_objset,
941 + DMU_POOL_DIRECTORY_OBJECT,
942 + DMU_POOL_SCAN, sizeof (uint64_t), SCAN_PHYS_NUMINTS,
943 + &scn->scn_phys_cached, tx));
944 + }
945 + mutex_exit(&scn->scn_status_lock);
946 +}
947 +
948 +static boolean_t
504 949 dsl_scan_check_suspend(dsl_scan_t *scn, const zbookmark_phys_t *zb)
505 950 {
506 951 /* we never skip user/group accounting objects */
507 952 if (zb && (int64_t)zb->zb_object < 0)
508 953 return (B_FALSE);
509 954
510 955 if (scn->scn_suspending)
511 956 return (B_TRUE); /* we're already suspending */
512 957
513 958 if (!ZB_IS_ZERO(&scn->scn_phys.scn_bookmark))
514 959 return (B_FALSE); /* we're resuming */
515 960
516 961 /* We only know how to resume from level-0 blocks. */
517 962 if (zb && zb->zb_level != 0)
518 963 return (B_FALSE);
519 964
520 965 /*
521 966 * We suspend if:
522 967 * - we have scanned for the maximum time: an entire txg
|
↓ open down ↓ |
9 lines elided |
↑ open up ↑ |
523 968 * timeout (default 5 sec)
524 969 * or
525 970 * - we have scanned for at least the minimum time (default 1 sec
526 971 * for scrub, 3 sec for resilver), and either we have sufficient
527 972 * dirty data that we are starting to write more quickly
528 973 * (default 30%), or someone is explicitly waiting for this txg
529 974 * to complete.
530 975 * or
531 976 * - the spa is shutting down because this pool is being exported
532 977 * or the machine is rebooting.
978 + * or
979 + * - the scan queue has reached its memory use limit
533 980 */
534 981 int mintime = (scn->scn_phys.scn_func == POOL_SCAN_RESILVER) ?
535 982 zfs_resilver_min_time_ms : zfs_scan_min_time_ms;
536 983 uint64_t elapsed_nanosecs = gethrtime() - scn->scn_sync_start_time;
537 984 int dirty_pct = scn->scn_dp->dp_dirty_total * 100 / zfs_dirty_data_max;
538 985 if (elapsed_nanosecs / NANOSEC >= zfs_txg_timeout ||
539 986 (NSEC2MSEC(elapsed_nanosecs) > mintime &&
540 987 (txg_sync_waiting(scn->scn_dp) ||
541 988 dirty_pct >= zfs_vdev_async_write_active_min_dirty_percent)) ||
542 - spa_shutting_down(scn->scn_dp->dp_spa)) {
989 + spa_shutting_down(scn->scn_dp->dp_spa) || scn->scn_clearing ||
990 + scan_io_queue_mem_lim(scn) == MEM_LIM_HARD) {
543 991 if (zb) {
992 + DTRACE_PROBE1(scan_pause, zbookmark_phys_t *, zb);
544 993 dprintf("suspending at bookmark %llx/%llx/%llx/%llx\n",
545 994 (longlong_t)zb->zb_objset,
546 995 (longlong_t)zb->zb_object,
547 996 (longlong_t)zb->zb_level,
548 997 (longlong_t)zb->zb_blkid);
549 998 scn->scn_phys.scn_bookmark = *zb;
999 + } else {
1000 + DTRACE_PROBE1(scan_pause_ddt, ddt_bookmark_t *,
1001 + &scn->scn_phys.scn_ddt_bookmark);
1002 + dprintf("pausing at DDT bookmark %llx/%llx/%llx/%llx\n",
1003 + (longlong_t)scn->scn_phys.scn_ddt_bookmark.
1004 + ddb_class,
1005 + (longlong_t)scn->scn_phys.scn_ddt_bookmark.
1006 + ddb_type,
1007 + (longlong_t)scn->scn_phys.scn_ddt_bookmark.
1008 + ddb_checksum,
1009 + (longlong_t)scn->scn_phys.scn_ddt_bookmark.
1010 + ddb_cursor);
550 1011 }
551 1012 dprintf("suspending at DDT bookmark %llx/%llx/%llx/%llx\n",
552 1013 (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_class,
553 1014 (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_type,
554 1015 (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_checksum,
555 1016 (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_cursor);
556 1017 scn->scn_suspending = B_TRUE;
557 1018 return (B_TRUE);
558 1019 }
559 1020 return (B_FALSE);
560 1021 }
561 1022
562 1023 typedef struct zil_scan_arg {
563 1024 dsl_pool_t *zsa_dp;
564 1025 zil_header_t *zsa_zh;
565 1026 } zil_scan_arg_t;
566 1027
567 1028 /* ARGSUSED */
568 1029 static int
569 1030 dsl_scan_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
570 1031 {
571 1032 zil_scan_arg_t *zsa = arg;
572 1033 dsl_pool_t *dp = zsa->zsa_dp;
573 1034 dsl_scan_t *scn = dp->dp_scan;
574 1035 zil_header_t *zh = zsa->zsa_zh;
575 1036 zbookmark_phys_t zb;
576 1037
577 1038 if (BP_IS_HOLE(bp) || bp->blk_birth <= scn->scn_phys.scn_cur_min_txg)
578 1039 return (0);
579 1040
580 1041 /*
581 1042 * One block ("stubby") can be allocated a long time ago; we
582 1043 * want to visit that one because it has been allocated
583 1044 * (on-disk) even if it hasn't been claimed (even though for
584 1045 * scrub there's nothing to do to it).
585 1046 */
586 1047 if (claim_txg == 0 && bp->blk_birth >= spa_first_txg(dp->dp_spa))
587 1048 return (0);
588 1049
589 1050 SET_BOOKMARK(&zb, zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET],
590 1051 ZB_ZIL_OBJECT, ZB_ZIL_LEVEL, bp->blk_cksum.zc_word[ZIL_ZC_SEQ]);
591 1052
592 1053 VERIFY(0 == scan_funcs[scn->scn_phys.scn_func](dp, bp, &zb));
593 1054 return (0);
594 1055 }
595 1056
596 1057 /* ARGSUSED */
597 1058 static int
598 1059 dsl_scan_zil_record(zilog_t *zilog, lr_t *lrc, void *arg, uint64_t claim_txg)
599 1060 {
600 1061 if (lrc->lrc_txtype == TX_WRITE) {
601 1062 zil_scan_arg_t *zsa = arg;
602 1063 dsl_pool_t *dp = zsa->zsa_dp;
603 1064 dsl_scan_t *scn = dp->dp_scan;
604 1065 zil_header_t *zh = zsa->zsa_zh;
605 1066 lr_write_t *lr = (lr_write_t *)lrc;
606 1067 blkptr_t *bp = &lr->lr_blkptr;
607 1068 zbookmark_phys_t zb;
608 1069
609 1070 if (BP_IS_HOLE(bp) ||
610 1071 bp->blk_birth <= scn->scn_phys.scn_cur_min_txg)
611 1072 return (0);
612 1073
613 1074 /*
614 1075 * birth can be < claim_txg if this record's txg is
615 1076 * already txg sync'ed (but this log block contains
616 1077 * other records that are not synced)
617 1078 */
618 1079 if (claim_txg == 0 || bp->blk_birth < claim_txg)
619 1080 return (0);
620 1081
621 1082 SET_BOOKMARK(&zb, zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET],
622 1083 lr->lr_foid, ZB_ZIL_LEVEL,
623 1084 lr->lr_offset / BP_GET_LSIZE(bp));
624 1085
625 1086 VERIFY(0 == scan_funcs[scn->scn_phys.scn_func](dp, bp, &zb));
626 1087 }
627 1088 return (0);
628 1089 }
629 1090
630 1091 static void
631 1092 dsl_scan_zil(dsl_pool_t *dp, zil_header_t *zh)
632 1093 {
633 1094 uint64_t claim_txg = zh->zh_claim_txg;
634 1095 zil_scan_arg_t zsa = { dp, zh };
635 1096 zilog_t *zilog;
636 1097
637 1098 /*
638 1099 * We only want to visit blocks that have been claimed but not yet
639 1100 * replayed (or, in read-only mode, blocks that *would* be claimed).
640 1101 */
641 1102 if (claim_txg == 0 && spa_writeable(dp->dp_spa))
642 1103 return;
643 1104
644 1105 zilog = zil_alloc(dp->dp_meta_objset, zh);
645 1106
646 1107 (void) zil_parse(zilog, dsl_scan_zil_block, dsl_scan_zil_record, &zsa,
647 1108 claim_txg);
648 1109
649 1110 zil_free(zilog);
650 1111 }
651 1112
652 1113 /* ARGSUSED */
653 1114 static void
654 1115 dsl_scan_prefetch(dsl_scan_t *scn, arc_buf_t *buf, blkptr_t *bp,
655 1116 uint64_t objset, uint64_t object, uint64_t blkid)
656 1117 {
657 1118 zbookmark_phys_t czb;
658 1119 arc_flags_t flags = ARC_FLAG_NOWAIT | ARC_FLAG_PREFETCH;
659 1120
660 1121 if (zfs_no_scrub_prefetch)
661 1122 return;
662 1123
663 1124 if (BP_IS_HOLE(bp) || bp->blk_birth <= scn->scn_phys.scn_min_txg ||
664 1125 (BP_GET_LEVEL(bp) == 0 && BP_GET_TYPE(bp) != DMU_OT_DNODE))
665 1126 return;
666 1127
667 1128 SET_BOOKMARK(&czb, objset, object, BP_GET_LEVEL(bp), blkid);
668 1129
669 1130 (void) arc_read(scn->scn_zio_root, scn->scn_dp->dp_spa, bp,
670 1131 NULL, NULL, ZIO_PRIORITY_ASYNC_READ,
671 1132 ZIO_FLAG_CANFAIL | ZIO_FLAG_SCAN_THREAD, &flags, &czb);
672 1133 }
673 1134
674 1135 static boolean_t
675 1136 dsl_scan_check_resume(dsl_scan_t *scn, const dnode_phys_t *dnp,
676 1137 const zbookmark_phys_t *zb)
677 1138 {
678 1139 /*
679 1140 * We never skip over user/group accounting objects (obj<0)
680 1141 */
681 1142 if (!ZB_IS_ZERO(&scn->scn_phys.scn_bookmark) &&
682 1143 (int64_t)zb->zb_object >= 0) {
683 1144 /*
684 1145 * If we already visited this bp & everything below (in
685 1146 * a prior txg sync), don't bother doing it again.
686 1147 */
687 1148 if (zbookmark_subtree_completed(dnp, zb,
688 1149 &scn->scn_phys.scn_bookmark))
|
↓ open down ↓ |
129 lines elided |
↑ open up ↑ |
689 1150 return (B_TRUE);
690 1151
691 1152 /*
692 1153 * If we found the block we're trying to resume from, or
693 1154 * we went past it to a different object, zero it out to
694 1155 * indicate that it's OK to start checking for suspending
695 1156 * again.
696 1157 */
697 1158 if (bcmp(zb, &scn->scn_phys.scn_bookmark, sizeof (*zb)) == 0 ||
698 1159 zb->zb_object > scn->scn_phys.scn_bookmark.zb_object) {
1160 + DTRACE_PROBE1(scan_resume, zbookmark_phys_t *, zb);
699 1161 dprintf("resuming at %llx/%llx/%llx/%llx\n",
700 1162 (longlong_t)zb->zb_objset,
701 1163 (longlong_t)zb->zb_object,
702 1164 (longlong_t)zb->zb_level,
703 1165 (longlong_t)zb->zb_blkid);
704 1166 bzero(&scn->scn_phys.scn_bookmark, sizeof (*zb));
705 1167 }
706 1168 }
707 1169 return (B_FALSE);
708 1170 }
709 1171
710 1172 /*
711 1173 * Return nonzero on i/o error.
712 1174 * Return new buf to write out in *bufp.
713 1175 */
714 1176 static int
715 1177 dsl_scan_recurse(dsl_scan_t *scn, dsl_dataset_t *ds, dmu_objset_type_t ostype,
716 1178 dnode_phys_t *dnp, const blkptr_t *bp,
717 1179 const zbookmark_phys_t *zb, dmu_tx_t *tx)
718 1180 {
719 1181 dsl_pool_t *dp = scn->scn_dp;
720 1182 int zio_flags = ZIO_FLAG_CANFAIL | ZIO_FLAG_SCAN_THREAD;
721 1183 int err;
722 1184
|
↓ open down ↓ |
14 lines elided |
↑ open up ↑ |
723 1185 if (BP_GET_LEVEL(bp) > 0) {
724 1186 arc_flags_t flags = ARC_FLAG_WAIT;
725 1187 int i;
726 1188 blkptr_t *cbp;
727 1189 int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
728 1190 arc_buf_t *buf;
729 1191
730 1192 err = arc_read(NULL, dp->dp_spa, bp, arc_getbuf_func, &buf,
731 1193 ZIO_PRIORITY_ASYNC_READ, zio_flags, &flags, zb);
732 1194 if (err) {
733 - scn->scn_phys.scn_errors++;
1195 + atomic_inc_64(&scn->scn_phys.scn_errors);
734 1196 return (err);
735 1197 }
736 1198 for (i = 0, cbp = buf->b_data; i < epb; i++, cbp++) {
737 1199 dsl_scan_prefetch(scn, buf, cbp, zb->zb_objset,
738 1200 zb->zb_object, zb->zb_blkid * epb + i);
739 1201 }
740 1202 for (i = 0, cbp = buf->b_data; i < epb; i++, cbp++) {
741 1203 zbookmark_phys_t czb;
742 1204
743 1205 SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
744 1206 zb->zb_level - 1,
745 1207 zb->zb_blkid * epb + i);
746 1208 dsl_scan_visitbp(cbp, &czb, dnp,
747 1209 ds, scn, ostype, tx);
748 1210 }
749 1211 arc_buf_destroy(buf, &buf);
|
↓ open down ↓ |
6 lines elided |
↑ open up ↑ |
750 1212 } else if (BP_GET_TYPE(bp) == DMU_OT_DNODE) {
751 1213 arc_flags_t flags = ARC_FLAG_WAIT;
752 1214 dnode_phys_t *cdnp;
753 1215 int i, j;
754 1216 int epb = BP_GET_LSIZE(bp) >> DNODE_SHIFT;
755 1217 arc_buf_t *buf;
756 1218
757 1219 err = arc_read(NULL, dp->dp_spa, bp, arc_getbuf_func, &buf,
758 1220 ZIO_PRIORITY_ASYNC_READ, zio_flags, &flags, zb);
759 1221 if (err) {
760 - scn->scn_phys.scn_errors++;
1222 + atomic_inc_64(&scn->scn_phys.scn_errors);
761 1223 return (err);
762 1224 }
763 1225 for (i = 0, cdnp = buf->b_data; i < epb; i++, cdnp++) {
764 1226 for (j = 0; j < cdnp->dn_nblkptr; j++) {
765 1227 blkptr_t *cbp = &cdnp->dn_blkptr[j];
766 1228 dsl_scan_prefetch(scn, buf, cbp,
767 1229 zb->zb_objset, zb->zb_blkid * epb + i, j);
768 1230 }
769 1231 }
770 1232 for (i = 0, cdnp = buf->b_data; i < epb; i++, cdnp++) {
771 1233 dsl_scan_visitdnode(scn, ds, ostype,
772 1234 cdnp, zb->zb_blkid * epb + i, tx);
773 1235 }
|
↓ open down ↓ |
3 lines elided |
↑ open up ↑ |
774 1236
775 1237 arc_buf_destroy(buf, &buf);
776 1238 } else if (BP_GET_TYPE(bp) == DMU_OT_OBJSET) {
777 1239 arc_flags_t flags = ARC_FLAG_WAIT;
778 1240 objset_phys_t *osp;
779 1241 arc_buf_t *buf;
780 1242
781 1243 err = arc_read(NULL, dp->dp_spa, bp, arc_getbuf_func, &buf,
782 1244 ZIO_PRIORITY_ASYNC_READ, zio_flags, &flags, zb);
783 1245 if (err) {
784 - scn->scn_phys.scn_errors++;
1246 + atomic_inc_64(&scn->scn_phys.scn_errors);
785 1247 return (err);
786 1248 }
787 1249
788 1250 osp = buf->b_data;
789 1251
790 1252 dsl_scan_visitdnode(scn, ds, osp->os_type,
791 1253 &osp->os_meta_dnode, DMU_META_DNODE_OBJECT, tx);
792 1254
793 1255 if (OBJSET_BUF_HAS_USERUSED(buf)) {
794 1256 /*
795 1257 * We also always visit user/group accounting
796 1258 * objects, and never skip them, even if we are
797 1259 * suspending. This is necessary so that the space
798 1260 * deltas from this txg get integrated.
799 1261 */
800 1262 dsl_scan_visitdnode(scn, ds, osp->os_type,
801 1263 &osp->os_groupused_dnode,
802 1264 DMU_GROUPUSED_OBJECT, tx);
803 1265 dsl_scan_visitdnode(scn, ds, osp->os_type,
804 1266 &osp->os_userused_dnode,
805 1267 DMU_USERUSED_OBJECT, tx);
806 1268 }
807 1269 arc_buf_destroy(buf, &buf);
808 1270 }
809 1271
810 1272 return (0);
811 1273 }
812 1274
813 1275 static void
814 1276 dsl_scan_visitdnode(dsl_scan_t *scn, dsl_dataset_t *ds,
815 1277 dmu_objset_type_t ostype, dnode_phys_t *dnp,
816 1278 uint64_t object, dmu_tx_t *tx)
817 1279 {
818 1280 int j;
819 1281
820 1282 for (j = 0; j < dnp->dn_nblkptr; j++) {
821 1283 zbookmark_phys_t czb;
822 1284
823 1285 SET_BOOKMARK(&czb, ds ? ds->ds_object : 0, object,
824 1286 dnp->dn_nlevels - 1, j);
825 1287 dsl_scan_visitbp(&dnp->dn_blkptr[j],
826 1288 &czb, dnp, ds, scn, ostype, tx);
827 1289 }
828 1290
829 1291 if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) {
830 1292 zbookmark_phys_t czb;
831 1293 SET_BOOKMARK(&czb, ds ? ds->ds_object : 0, object,
832 1294 0, DMU_SPILL_BLKID);
833 1295 dsl_scan_visitbp(&dnp->dn_spill,
834 1296 &czb, dnp, ds, scn, ostype, tx);
835 1297 }
836 1298 }
837 1299
838 1300 /*
839 1301 * The arguments are in this order because mdb can only print the
840 1302 * first 5; we want them to be useful.
841 1303 */
842 1304 static void
843 1305 dsl_scan_visitbp(blkptr_t *bp, const zbookmark_phys_t *zb,
844 1306 dnode_phys_t *dnp, dsl_dataset_t *ds, dsl_scan_t *scn,
845 1307 dmu_objset_type_t ostype, dmu_tx_t *tx)
846 1308 {
847 1309 dsl_pool_t *dp = scn->scn_dp;
848 1310 arc_buf_t *buf = NULL;
849 1311 blkptr_t bp_toread = *bp;
850 1312
851 1313 /* ASSERT(pbuf == NULL || arc_released(pbuf)); */
852 1314
853 1315 if (dsl_scan_check_suspend(scn, zb))
|
↓ open down ↓ |
59 lines elided |
↑ open up ↑ |
854 1316 return;
855 1317
856 1318 if (dsl_scan_check_resume(scn, dnp, zb))
857 1319 return;
858 1320
859 1321 if (BP_IS_HOLE(bp))
860 1322 return;
861 1323
862 1324 scn->scn_visited_this_txg++;
863 1325
1326 +#ifdef _KERNEL
1327 + DTRACE_PROBE7(scan_visitbp, blkptr_t *, bp, zbookmark_phys_t *, zb,
1328 + dnode_phys_t *, dnp, dsl_dataset_t *, ds, dsl_scan_t *, scn,
1329 + dmu_objset_type_t, ostype, dmu_tx_t *, tx);
1330 +#endif /* _KERNEL */
864 1331 dprintf_bp(bp,
865 1332 "visiting ds=%p/%llu zb=%llx/%llx/%llx/%llx bp=%p",
866 1333 ds, ds ? ds->ds_object : 0,
867 1334 zb->zb_objset, zb->zb_object, zb->zb_level, zb->zb_blkid,
868 1335 bp);
869 1336
870 1337 if (bp->blk_birth <= scn->scn_phys.scn_cur_min_txg)
871 1338 return;
872 1339
873 1340 if (dsl_scan_recurse(scn, ds, ostype, dnp, &bp_toread, zb, tx) != 0)
874 1341 return;
875 1342
876 1343 /*
877 1344 * If dsl_scan_ddt() has already visited this block, it will have
878 1345 * already done any translations or scrubbing, so don't call the
879 1346 * callback again.
880 1347 */
881 1348 if (ddt_class_contains(dp->dp_spa,
882 1349 scn->scn_phys.scn_ddt_class_max, bp)) {
883 1350 ASSERT(buf == NULL);
884 1351 return;
885 1352 }
886 1353
887 1354 /*
888 1355 * If this block is from the future (after cur_max_txg), then we
889 1356 * are doing this on behalf of a deleted snapshot, and we will
890 1357 * revisit the future block on the next pass of this dataset.
891 1358 * Don't scan it now unless we need to because something
892 1359 * under it was modified.
893 1360 */
894 1361 if (BP_PHYSICAL_BIRTH(bp) <= scn->scn_phys.scn_cur_max_txg) {
895 1362 scan_funcs[scn->scn_phys.scn_func](dp, bp, zb);
896 1363 }
897 1364 }
898 1365
899 1366 static void
|
↓ open down ↓ |
26 lines elided |
↑ open up ↑ |
900 1367 dsl_scan_visit_rootbp(dsl_scan_t *scn, dsl_dataset_t *ds, blkptr_t *bp,
901 1368 dmu_tx_t *tx)
902 1369 {
903 1370 zbookmark_phys_t zb;
904 1371
905 1372 SET_BOOKMARK(&zb, ds ? ds->ds_object : DMU_META_OBJSET,
906 1373 ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
907 1374 dsl_scan_visitbp(bp, &zb, NULL,
908 1375 ds, scn, DMU_OST_NONE, tx);
909 1376
1377 + DTRACE_PROBE4(scan_finished, dsl_scan_t *, scn, dsl_dataset_t *, ds,
1378 + blkptr_t *, bp, dmu_tx_t *, tx);
910 1379 dprintf_ds(ds, "finished scan%s", "");
911 1380 }
912 1381
913 -void
914 -dsl_scan_ds_destroyed(dsl_dataset_t *ds, dmu_tx_t *tx)
1382 +static void
1383 +ds_destroyed_scn_phys(dsl_dataset_t *ds, dsl_scan_phys_t *scn_phys)
915 1384 {
916 - dsl_pool_t *dp = ds->ds_dir->dd_pool;
917 - dsl_scan_t *scn = dp->dp_scan;
918 - uint64_t mintxg;
919 -
920 - if (scn->scn_phys.scn_state != DSS_SCANNING)
921 - return;
922 -
923 - if (scn->scn_phys.scn_bookmark.zb_objset == ds->ds_object) {
1385 + if (scn_phys->scn_bookmark.zb_objset == ds->ds_object) {
924 1386 if (ds->ds_is_snapshot) {
925 1387 /*
926 1388 * Note:
927 1389 * - scn_cur_{min,max}_txg stays the same.
928 1390 * - Setting the flag is not really necessary if
929 1391 * scn_cur_max_txg == scn_max_txg, because there
930 1392 * is nothing after this snapshot that we care
931 1393 * about. However, we set it anyway and then
932 1394 * ignore it when we retraverse it in
933 1395 * dsl_scan_visitds().
934 1396 */
935 - scn->scn_phys.scn_bookmark.zb_objset =
1397 + scn_phys->scn_bookmark.zb_objset =
936 1398 dsl_dataset_phys(ds)->ds_next_snap_obj;
937 1399 zfs_dbgmsg("destroying ds %llu; currently traversing; "
938 1400 "reset zb_objset to %llu",
939 1401 (u_longlong_t)ds->ds_object,
940 1402 (u_longlong_t)dsl_dataset_phys(ds)->
941 1403 ds_next_snap_obj);
942 - scn->scn_phys.scn_flags |= DSF_VISIT_DS_AGAIN;
1404 + scn_phys->scn_flags |= DSF_VISIT_DS_AGAIN;
943 1405 } else {
944 - SET_BOOKMARK(&scn->scn_phys.scn_bookmark,
1406 + SET_BOOKMARK(&scn_phys->scn_bookmark,
945 1407 ZB_DESTROYED_OBJSET, 0, 0, 0);
946 1408 zfs_dbgmsg("destroying ds %llu; currently traversing; "
947 1409 "reset bookmark to -1,0,0,0",
948 1410 (u_longlong_t)ds->ds_object);
949 1411 }
950 - } else if (zap_lookup_int_key(dp->dp_meta_objset,
951 - scn->scn_phys.scn_queue_obj, ds->ds_object, &mintxg) == 0) {
1412 + }
1413 +}
1414 +
1415 +/*
1416 + * Invoked when a dataset is destroyed. We need to make sure that:
1417 + *
1418 + * 1) If it is the dataset that was currently being scanned, we write
1419 + * a new dsl_scan_phys_t and marking the objset reference in it
1420 + * as destroyed.
1421 + * 2) Remove it from the work queue, if it was present.
1422 + *
1423 + * If the dataset was actually a snapshot, instead of marking the dataset
1424 + * as destroyed, we instead substitute the next snapshot in line.
1425 + */
1426 +void
1427 +dsl_scan_ds_destroyed(dsl_dataset_t *ds, dmu_tx_t *tx)
1428 +{
1429 + dsl_pool_t *dp = ds->ds_dir->dd_pool;
1430 + dsl_scan_t *scn = dp->dp_scan;
1431 + uint64_t mintxg;
1432 +
1433 + if (!dsl_scan_is_running(scn))
1434 + return;
1435 +
1436 + ds_destroyed_scn_phys(ds, &scn->scn_phys);
1437 + ds_destroyed_scn_phys(ds, &scn->scn_phys_cached);
1438 +
1439 + if (scan_ds_queue_contains(scn, ds->ds_object, &mintxg)) {
1440 + scan_ds_queue_remove(scn, ds->ds_object);
1441 + if (ds->ds_is_snapshot) {
1442 + VERIFY0(scan_ds_queue_insert(scn,
1443 + dsl_dataset_phys(ds)->ds_next_snap_obj, mintxg));
1444 + }
1445 + }
1446 +
1447 + if (zap_lookup_int_key(dp->dp_meta_objset, scn->scn_phys.scn_queue_obj,
1448 + ds->ds_object, &mintxg) == 0) {
1449 + DTRACE_PROBE3(scan_ds_destroyed__in_queue,
1450 + dsl_scan_t *, scn, dsl_dataset_t *, ds, dmu_tx_t *, tx);
952 1451 ASSERT3U(dsl_dataset_phys(ds)->ds_num_children, <=, 1);
953 1452 VERIFY3U(0, ==, zap_remove_int(dp->dp_meta_objset,
954 1453 scn->scn_phys.scn_queue_obj, ds->ds_object, tx));
955 1454 if (ds->ds_is_snapshot) {
956 1455 /*
957 1456 * We keep the same mintxg; it could be >
958 1457 * ds_creation_txg if the previous snapshot was
959 1458 * deleted too.
960 1459 */
961 1460 VERIFY(zap_add_int_key(dp->dp_meta_objset,
962 1461 scn->scn_phys.scn_queue_obj,
963 1462 dsl_dataset_phys(ds)->ds_next_snap_obj,
964 1463 mintxg, tx) == 0);
965 1464 zfs_dbgmsg("destroying ds %llu; in queue; "
966 1465 "replacing with %llu",
967 1466 (u_longlong_t)ds->ds_object,
968 1467 (u_longlong_t)dsl_dataset_phys(ds)->
969 1468 ds_next_snap_obj);
|
↓ open down ↓ |
8 lines elided |
↑ open up ↑ |
970 1469 } else {
971 1470 zfs_dbgmsg("destroying ds %llu; in queue; removing",
972 1471 (u_longlong_t)ds->ds_object);
973 1472 }
974 1473 }
975 1474
976 1475 /*
977 1476 * dsl_scan_sync() should be called after this, and should sync
978 1477 * out our changed state, but just to be safe, do it here.
979 1478 */
980 - dsl_scan_sync_state(scn, tx);
1479 + dsl_scan_sync_state(scn, tx, SYNC_CACHED);
981 1480 }
982 1481
1482 +static void
1483 +ds_snapshotted_bookmark(dsl_dataset_t *ds, zbookmark_phys_t *scn_bookmark)
1484 +{
1485 + if (scn_bookmark->zb_objset == ds->ds_object) {
1486 + scn_bookmark->zb_objset =
1487 + dsl_dataset_phys(ds)->ds_prev_snap_obj;
1488 + zfs_dbgmsg("snapshotting ds %llu; currently traversing; "
1489 + "reset zb_objset to %llu",
1490 + (u_longlong_t)ds->ds_object,
1491 + (u_longlong_t)dsl_dataset_phys(ds)->ds_prev_snap_obj);
1492 + }
1493 +}
1494 +
1495 +/*
1496 + * Called when a dataset is snapshotted. If we were currently traversing
1497 + * this snapshot, we reset our bookmark to point at the newly created
1498 + * snapshot. We also modify our work queue to remove the old snapshot and
1499 + * replace with the new one.
1500 + */
983 1501 void
984 1502 dsl_scan_ds_snapshotted(dsl_dataset_t *ds, dmu_tx_t *tx)
985 1503 {
986 1504 dsl_pool_t *dp = ds->ds_dir->dd_pool;
987 1505 dsl_scan_t *scn = dp->dp_scan;
988 1506 uint64_t mintxg;
989 1507
990 - if (scn->scn_phys.scn_state != DSS_SCANNING)
1508 + if (!dsl_scan_is_running(scn))
991 1509 return;
992 1510
993 1511 ASSERT(dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);
994 1512
995 - if (scn->scn_phys.scn_bookmark.zb_objset == ds->ds_object) {
996 - scn->scn_phys.scn_bookmark.zb_objset =
997 - dsl_dataset_phys(ds)->ds_prev_snap_obj;
998 - zfs_dbgmsg("snapshotting ds %llu; currently traversing; "
999 - "reset zb_objset to %llu",
1000 - (u_longlong_t)ds->ds_object,
1001 - (u_longlong_t)dsl_dataset_phys(ds)->ds_prev_snap_obj);
1002 - } else if (zap_lookup_int_key(dp->dp_meta_objset,
1003 - scn->scn_phys.scn_queue_obj, ds->ds_object, &mintxg) == 0) {
1513 + ds_snapshotted_bookmark(ds, &scn->scn_phys.scn_bookmark);
1514 + ds_snapshotted_bookmark(ds, &scn->scn_phys_cached.scn_bookmark);
1515 +
1516 + if (scan_ds_queue_contains(scn, ds->ds_object, &mintxg)) {
1517 + scan_ds_queue_remove(scn, ds->ds_object);
1518 + VERIFY0(scan_ds_queue_insert(scn,
1519 + dsl_dataset_phys(ds)->ds_prev_snap_obj, mintxg));
1520 + }
1521 +
1522 + if (zap_lookup_int_key(dp->dp_meta_objset, scn->scn_phys.scn_queue_obj,
1523 + ds->ds_object, &mintxg) == 0) {
1524 + DTRACE_PROBE3(scan_ds_snapshotted__in_queue,
1525 + dsl_scan_t *, scn, dsl_dataset_t *, ds, dmu_tx_t *, tx);
1526 +
1004 1527 VERIFY3U(0, ==, zap_remove_int(dp->dp_meta_objset,
1005 1528 scn->scn_phys.scn_queue_obj, ds->ds_object, tx));
1006 1529 VERIFY(zap_add_int_key(dp->dp_meta_objset,
1007 1530 scn->scn_phys.scn_queue_obj,
1008 1531 dsl_dataset_phys(ds)->ds_prev_snap_obj, mintxg, tx) == 0);
1009 1532 zfs_dbgmsg("snapshotting ds %llu; in queue; "
1010 1533 "replacing with %llu",
1011 1534 (u_longlong_t)ds->ds_object,
1012 1535 (u_longlong_t)dsl_dataset_phys(ds)->ds_prev_snap_obj);
1013 1536 }
1014 - dsl_scan_sync_state(scn, tx);
1537 +
1538 + dsl_scan_sync_state(scn, tx, SYNC_CACHED);
1015 1539 }
1016 1540
1541 +static void
1542 +ds_clone_swapped_bookmark(dsl_dataset_t *ds1, dsl_dataset_t *ds2,
1543 + zbookmark_phys_t *scn_bookmark)
1544 +{
1545 + if (scn_bookmark->zb_objset == ds1->ds_object) {
1546 + scn_bookmark->zb_objset = ds2->ds_object;
1547 + zfs_dbgmsg("clone_swap ds %llu; currently traversing; "
1548 + "reset zb_objset to %llu",
1549 + (u_longlong_t)ds1->ds_object,
1550 + (u_longlong_t)ds2->ds_object);
1551 + } else if (scn_bookmark->zb_objset == ds2->ds_object) {
1552 + scn_bookmark->zb_objset = ds1->ds_object;
1553 + zfs_dbgmsg("clone_swap ds %llu; currently traversing; "
1554 + "reset zb_objset to %llu",
1555 + (u_longlong_t)ds2->ds_object,
1556 + (u_longlong_t)ds1->ds_object);
1557 + }
1558 +}
1559 +
1560 +/*
1561 + * Called when a parent dataset and its clone are swapped. If we were
1562 + * currently traversing the dataset, we need to switch to traversing the
1563 + * newly promoted parent.
1564 + */
1017 1565 void
1018 1566 dsl_scan_ds_clone_swapped(dsl_dataset_t *ds1, dsl_dataset_t *ds2, dmu_tx_t *tx)
1019 1567 {
1020 1568 dsl_pool_t *dp = ds1->ds_dir->dd_pool;
1021 1569 dsl_scan_t *scn = dp->dp_scan;
1022 1570 uint64_t mintxg;
1023 1571
1024 - if (scn->scn_phys.scn_state != DSS_SCANNING)
1572 + if (!dsl_scan_is_running(scn))
1025 1573 return;
1026 1574
1027 - if (scn->scn_phys.scn_bookmark.zb_objset == ds1->ds_object) {
1028 - scn->scn_phys.scn_bookmark.zb_objset = ds2->ds_object;
1575 + ds_clone_swapped_bookmark(ds1, ds2, &scn->scn_phys.scn_bookmark);
1576 + ds_clone_swapped_bookmark(ds1, ds2, &scn->scn_phys_cached.scn_bookmark);
1577 +
1578 + if (scan_ds_queue_contains(scn, ds1->ds_object, &mintxg)) {
1579 + int err;
1580 +
1581 + scan_ds_queue_remove(scn, ds1->ds_object);
1582 + err = scan_ds_queue_insert(scn, ds2->ds_object, mintxg);
1583 + VERIFY(err == 0 || err == EEXIST);
1584 + if (err == EEXIST) {
1585 + /* Both were there to begin with */
1586 + VERIFY0(scan_ds_queue_insert(scn, ds1->ds_object,
1587 + mintxg));
1588 + }
1029 1589 zfs_dbgmsg("clone_swap ds %llu; currently traversing; "
1030 1590 "reset zb_objset to %llu",
1031 1591 (u_longlong_t)ds1->ds_object,
1032 1592 (u_longlong_t)ds2->ds_object);
1033 - } else if (scn->scn_phys.scn_bookmark.zb_objset == ds2->ds_object) {
1034 - scn->scn_phys.scn_bookmark.zb_objset = ds1->ds_object;
1593 + } else if (scan_ds_queue_contains(scn, ds2->ds_object, &mintxg)) {
1594 + scan_ds_queue_remove(scn, ds2->ds_object);
1595 + VERIFY0(scan_ds_queue_insert(scn, ds1->ds_object, mintxg));
1035 1596 zfs_dbgmsg("clone_swap ds %llu; currently traversing; "
1036 1597 "reset zb_objset to %llu",
1037 1598 (u_longlong_t)ds2->ds_object,
1038 1599 (u_longlong_t)ds1->ds_object);
1039 1600 }
1040 1601
1041 1602 if (zap_lookup_int_key(dp->dp_meta_objset, scn->scn_phys.scn_queue_obj,
1042 1603 ds1->ds_object, &mintxg) == 0) {
1043 1604 int err;
1044 1605
1606 + DTRACE_PROBE4(scan_ds_clone_swapped__in_queue_ds1,
1607 + dsl_scan_t *, scn, dsl_dataset_t *, ds1,
1608 + dsl_dataset_t *, ds2, dmu_tx_t *, tx);
1045 1609 ASSERT3U(mintxg, ==, dsl_dataset_phys(ds1)->ds_prev_snap_txg);
1046 1610 ASSERT3U(mintxg, ==, dsl_dataset_phys(ds2)->ds_prev_snap_txg);
1047 1611 VERIFY3U(0, ==, zap_remove_int(dp->dp_meta_objset,
1048 1612 scn->scn_phys.scn_queue_obj, ds1->ds_object, tx));
1049 1613 err = zap_add_int_key(dp->dp_meta_objset,
1050 1614 scn->scn_phys.scn_queue_obj, ds2->ds_object, mintxg, tx);
1051 1615 VERIFY(err == 0 || err == EEXIST);
1052 1616 if (err == EEXIST) {
1053 1617 /* Both were there to begin with */
1054 1618 VERIFY(0 == zap_add_int_key(dp->dp_meta_objset,
1055 1619 scn->scn_phys.scn_queue_obj,
1056 1620 ds1->ds_object, mintxg, tx));
1057 1621 }
1058 1622 zfs_dbgmsg("clone_swap ds %llu; in queue; "
1059 1623 "replacing with %llu",
1060 1624 (u_longlong_t)ds1->ds_object,
1061 1625 (u_longlong_t)ds2->ds_object);
1062 1626 } else if (zap_lookup_int_key(dp->dp_meta_objset,
1063 1627 scn->scn_phys.scn_queue_obj, ds2->ds_object, &mintxg) == 0) {
1628 + DTRACE_PROBE4(scan_ds_clone_swapped__in_queue_ds2,
1629 + dsl_scan_t *, scn, dsl_dataset_t *, ds1,
1630 + dsl_dataset_t *, ds2, dmu_tx_t *, tx);
1064 1631 ASSERT3U(mintxg, ==, dsl_dataset_phys(ds1)->ds_prev_snap_txg);
1065 1632 ASSERT3U(mintxg, ==, dsl_dataset_phys(ds2)->ds_prev_snap_txg);
1066 1633 VERIFY3U(0, ==, zap_remove_int(dp->dp_meta_objset,
1067 1634 scn->scn_phys.scn_queue_obj, ds2->ds_object, tx));
1068 1635 VERIFY(0 == zap_add_int_key(dp->dp_meta_objset,
1069 1636 scn->scn_phys.scn_queue_obj, ds1->ds_object, mintxg, tx));
1070 1637 zfs_dbgmsg("clone_swap ds %llu; in queue; "
1071 1638 "replacing with %llu",
1072 1639 (u_longlong_t)ds2->ds_object,
1073 1640 (u_longlong_t)ds1->ds_object);
1074 1641 }
1075 1642
1076 - dsl_scan_sync_state(scn, tx);
1643 + dsl_scan_sync_state(scn, tx, SYNC_CACHED);
1077 1644 }
1078 1645
1079 -struct enqueue_clones_arg {
1080 - dmu_tx_t *tx;
1081 - uint64_t originobj;
1082 -};
1083 -
1084 1646 /* ARGSUSED */
1085 1647 static int
1086 1648 enqueue_clones_cb(dsl_pool_t *dp, dsl_dataset_t *hds, void *arg)
1087 1649 {
1088 - struct enqueue_clones_arg *eca = arg;
1650 + uint64_t originobj = *(uint64_t *)arg;
1089 1651 dsl_dataset_t *ds;
1090 1652 int err;
1091 1653 dsl_scan_t *scn = dp->dp_scan;
1092 1654
1093 - if (dsl_dir_phys(hds->ds_dir)->dd_origin_obj != eca->originobj)
1655 + if (dsl_dir_phys(hds->ds_dir)->dd_origin_obj != originobj)
1094 1656 return (0);
1095 1657
1096 1658 err = dsl_dataset_hold_obj(dp, hds->ds_object, FTAG, &ds);
1097 1659 if (err)
1098 1660 return (err);
1099 1661
1100 - while (dsl_dataset_phys(ds)->ds_prev_snap_obj != eca->originobj) {
1662 + while (dsl_dataset_phys(ds)->ds_prev_snap_obj != originobj) {
1101 1663 dsl_dataset_t *prev;
1102 1664 err = dsl_dataset_hold_obj(dp,
1103 1665 dsl_dataset_phys(ds)->ds_prev_snap_obj, FTAG, &prev);
1104 1666
1105 1667 dsl_dataset_rele(ds, FTAG);
1106 1668 if (err)
1107 1669 return (err);
1108 1670 ds = prev;
1109 1671 }
1110 - VERIFY(zap_add_int_key(dp->dp_meta_objset,
1111 - scn->scn_phys.scn_queue_obj, ds->ds_object,
1112 - dsl_dataset_phys(ds)->ds_prev_snap_txg, eca->tx) == 0);
1672 + VERIFY0(scan_ds_queue_insert(scn, ds->ds_object,
1673 + dsl_dataset_phys(ds)->ds_prev_snap_txg));
1113 1674 dsl_dataset_rele(ds, FTAG);
1114 1675 return (0);
1115 1676 }
1116 1677
1117 1678 static void
1118 1679 dsl_scan_visitds(dsl_scan_t *scn, uint64_t dsobj, dmu_tx_t *tx)
1119 1680 {
1120 1681 dsl_pool_t *dp = scn->scn_dp;
1121 1682 dsl_dataset_t *ds;
1683 + objset_t *os;
1122 1684
1123 1685 VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));
1124 1686
1125 1687 if (scn->scn_phys.scn_cur_min_txg >=
1126 1688 scn->scn_phys.scn_max_txg) {
1127 1689 /*
1128 1690 * This can happen if this snapshot was created after the
1129 1691 * scan started, and we already completed a previous snapshot
1130 1692 * that was created after the scan started. This snapshot
1131 1693 * only references blocks with:
1132 1694 *
1133 1695 * birth < our ds_creation_txg
1134 1696 * cur_min_txg is no less than ds_creation_txg.
1135 1697 * We have already visited these blocks.
1136 1698 * or
1137 1699 * birth > scn_max_txg
1138 1700 * The scan requested not to visit these blocks.
1139 1701 *
1140 1702 * Subsequent snapshots (and clones) can reference our
1141 1703 * blocks, or blocks with even higher birth times.
1142 1704 * Therefore we do not need to visit them either,
1143 1705 * so we do not add them to the work queue.
1144 1706 *
1145 1707 * Note that checking for cur_min_txg >= cur_max_txg
1146 1708 * is not sufficient, because in that case we may need to
1147 1709 * visit subsequent snapshots. This happens when min_txg > 0,
1148 1710 * which raises cur_min_txg. In this case we will visit
1149 1711 * this dataset but skip all of its blocks, because the
1150 1712 * rootbp's birth time is < cur_min_txg. Then we will
1151 1713 * add the next snapshots/clones to the work queue.
1152 1714 */
1153 1715 char *dsname = kmem_alloc(MAXNAMELEN, KM_SLEEP);
1154 1716 dsl_dataset_name(ds, dsname);
|
↓ open down ↓ |
23 lines elided |
↑ open up ↑ |
1155 1717 zfs_dbgmsg("scanning dataset %llu (%s) is unnecessary because "
1156 1718 "cur_min_txg (%llu) >= max_txg (%llu)",
1157 1719 dsobj, dsname,
1158 1720 scn->scn_phys.scn_cur_min_txg,
1159 1721 scn->scn_phys.scn_max_txg);
1160 1722 kmem_free(dsname, MAXNAMELEN);
1161 1723
1162 1724 goto out;
1163 1725 }
1164 1726
1727 + if (dmu_objset_from_ds(ds, &os))
1728 + goto out;
1729 +
1165 1730 /*
1166 - * Only the ZIL in the head (non-snapshot) is valid. Even though
1731 + * Only the ZIL in the head (non-snapshot) is valid. Even though
1167 1732 * snapshots can have ZIL block pointers (which may be the same
1168 - * BP as in the head), they must be ignored. In addition, $ORIGIN
1169 - * doesn't have a objset (i.e. its ds_bp is a hole) so we don't
1170 - * need to look for a ZIL in it either. So we traverse the ZIL here,
1171 - * rather than in scan_recurse(), because the regular snapshot
1172 - * block-sharing rules don't apply to it.
1733 + * BP as in the head), they must be ignored. So we traverse the
1734 + * ZIL here, rather than in scan_recurse(), because the regular
1735 + * snapshot block-sharing rules don't apply to it.
1173 1736 */
1174 - if (DSL_SCAN_IS_SCRUB_RESILVER(scn) && !dsl_dataset_is_snapshot(ds) &&
1175 - ds->ds_dir != dp->dp_origin_snap->ds_dir) {
1176 - objset_t *os;
1177 - if (dmu_objset_from_ds(ds, &os) != 0) {
1178 - goto out;
1179 - }
1737 + if (DSL_SCAN_IS_SCRUB_RESILVER(scn) && !ds->ds_is_snapshot)
1180 1738 dsl_scan_zil(dp, &os->os_zil_header);
1181 - }
1182 1739
1183 1740 /*
1184 1741 * Iterate over the bps in this ds.
1185 1742 */
1186 1743 dmu_buf_will_dirty(ds->ds_dbuf, tx);
1187 1744 rrw_enter(&ds->ds_bp_rwlock, RW_READER, FTAG);
1188 1745 dsl_scan_visit_rootbp(scn, ds, &dsl_dataset_phys(ds)->ds_bp, tx);
1189 1746 rrw_exit(&ds->ds_bp_rwlock, FTAG);
1190 1747
1191 1748 char *dsname = kmem_alloc(ZFS_MAX_DATASET_NAME_LEN, KM_SLEEP);
1192 1749 dsl_dataset_name(ds, dsname);
1193 1750 zfs_dbgmsg("scanned dataset %llu (%s) with min=%llu max=%llu; "
1194 1751 "suspending=%u",
1195 1752 (longlong_t)dsobj, dsname,
1196 1753 (longlong_t)scn->scn_phys.scn_cur_min_txg,
1197 1754 (longlong_t)scn->scn_phys.scn_cur_max_txg,
1198 1755 (int)scn->scn_suspending);
1199 1756 kmem_free(dsname, ZFS_MAX_DATASET_NAME_LEN);
1200 1757
1758 + DTRACE_PROBE3(scan_done, dsl_scan_t *, scn, dsl_dataset_t *, ds,
1759 + dmu_tx_t *, tx);
1760 +
1201 1761 if (scn->scn_suspending)
1202 1762 goto out;
1203 1763
1204 1764 /*
1205 1765 * We've finished this pass over this dataset.
1206 1766 */
1207 1767
1208 1768 /*
1209 1769 * If we did not completely visit this dataset, do another pass.
1210 1770 */
1211 1771 if (scn->scn_phys.scn_flags & DSF_VISIT_DS_AGAIN) {
1772 + DTRACE_PROBE3(scan_incomplete, dsl_scan_t *, scn,
1773 + dsl_dataset_t *, ds, dmu_tx_t *, tx);
1212 1774 zfs_dbgmsg("incomplete pass; visiting again");
1213 1775 scn->scn_phys.scn_flags &= ~DSF_VISIT_DS_AGAIN;
1214 - VERIFY(zap_add_int_key(dp->dp_meta_objset,
1215 - scn->scn_phys.scn_queue_obj, ds->ds_object,
1216 - scn->scn_phys.scn_cur_max_txg, tx) == 0);
1776 + VERIFY0(scan_ds_queue_insert(scn, ds->ds_object,
1777 + scn->scn_phys.scn_cur_max_txg));
1217 1778 goto out;
1218 1779 }
1219 1780
1220 1781 /*
1221 1782 * Add descendent datasets to work queue.
1222 1783 */
1223 1784 if (dsl_dataset_phys(ds)->ds_next_snap_obj != 0) {
1224 - VERIFY(zap_add_int_key(dp->dp_meta_objset,
1225 - scn->scn_phys.scn_queue_obj,
1785 + VERIFY0(scan_ds_queue_insert(scn,
1226 1786 dsl_dataset_phys(ds)->ds_next_snap_obj,
1227 - dsl_dataset_phys(ds)->ds_creation_txg, tx) == 0);
1787 + dsl_dataset_phys(ds)->ds_creation_txg));
1228 1788 }
1229 1789 if (dsl_dataset_phys(ds)->ds_num_children > 1) {
1230 1790 boolean_t usenext = B_FALSE;
1231 1791 if (dsl_dataset_phys(ds)->ds_next_clones_obj != 0) {
1232 1792 uint64_t count;
1233 1793 /*
1234 1794 * A bug in a previous version of the code could
1235 1795 * cause upgrade_clones_cb() to not set
1236 1796 * ds_next_snap_obj when it should, leading to a
1237 1797 * missing entry. Therefore we can only use the
1238 1798 * next_clones_obj when its count is correct.
1239 1799 */
1240 1800 int err = zap_count(dp->dp_meta_objset,
1241 1801 dsl_dataset_phys(ds)->ds_next_clones_obj, &count);
1242 1802 if (err == 0 &&
1243 1803 count == dsl_dataset_phys(ds)->ds_num_children - 1)
1244 1804 usenext = B_TRUE;
1245 1805 }
1246 1806
1247 1807 if (usenext) {
1248 - VERIFY0(zap_join_key(dp->dp_meta_objset,
1249 - dsl_dataset_phys(ds)->ds_next_clones_obj,
1250 - scn->scn_phys.scn_queue_obj,
1251 - dsl_dataset_phys(ds)->ds_creation_txg, tx));
1808 + zap_cursor_t zc;
1809 + zap_attribute_t za;
1810 + for (zap_cursor_init(&zc, dp->dp_meta_objset,
1811 + dsl_dataset_phys(ds)->ds_next_clones_obj);
1812 + zap_cursor_retrieve(&zc, &za) == 0;
1813 + (void) zap_cursor_advance(&zc)) {
1814 + VERIFY0(scan_ds_queue_insert(scn,
1815 + zfs_strtonum(za.za_name, NULL),
1816 + dsl_dataset_phys(ds)->ds_creation_txg));
1817 + }
1818 + zap_cursor_fini(&zc);
1252 1819 } else {
1253 - struct enqueue_clones_arg eca;
1254 - eca.tx = tx;
1255 - eca.originobj = ds->ds_object;
1256 -
1257 1820 VERIFY0(dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
1258 - enqueue_clones_cb, &eca, DS_FIND_CHILDREN));
1821 + enqueue_clones_cb, &ds->ds_object,
1822 + DS_FIND_CHILDREN));
1259 1823 }
1260 1824 }
1261 1825
1262 1826 out:
1263 1827 dsl_dataset_rele(ds, FTAG);
1264 1828 }
1265 1829
1266 1830 /* ARGSUSED */
1267 1831 static int
1268 1832 enqueue_cb(dsl_pool_t *dp, dsl_dataset_t *hds, void *arg)
1269 1833 {
1270 - dmu_tx_t *tx = arg;
1271 1834 dsl_dataset_t *ds;
1272 1835 int err;
1273 1836 dsl_scan_t *scn = dp->dp_scan;
1274 1837
1275 1838 err = dsl_dataset_hold_obj(dp, hds->ds_object, FTAG, &ds);
1276 1839 if (err)
1277 1840 return (err);
1278 1841
1279 1842 while (dsl_dataset_phys(ds)->ds_prev_snap_obj != 0) {
1280 1843 dsl_dataset_t *prev;
1281 1844 err = dsl_dataset_hold_obj(dp,
1282 1845 dsl_dataset_phys(ds)->ds_prev_snap_obj, FTAG, &prev);
1283 1846 if (err) {
1284 1847 dsl_dataset_rele(ds, FTAG);
1285 1848 return (err);
1286 1849 }
1287 1850
1288 1851 /*
1289 1852 * If this is a clone, we don't need to worry about it for now.
|
↓ open down ↓ |
9 lines elided |
↑ open up ↑ |
1290 1853 */
1291 1854 if (dsl_dataset_phys(prev)->ds_next_snap_obj != ds->ds_object) {
1292 1855 dsl_dataset_rele(ds, FTAG);
1293 1856 dsl_dataset_rele(prev, FTAG);
1294 1857 return (0);
1295 1858 }
1296 1859 dsl_dataset_rele(ds, FTAG);
1297 1860 ds = prev;
1298 1861 }
1299 1862
1300 - VERIFY(zap_add_int_key(dp->dp_meta_objset, scn->scn_phys.scn_queue_obj,
1301 - ds->ds_object, dsl_dataset_phys(ds)->ds_prev_snap_txg, tx) == 0);
1863 + VERIFY0(scan_ds_queue_insert(scn, ds->ds_object,
1864 + dsl_dataset_phys(ds)->ds_prev_snap_txg));
1302 1865 dsl_dataset_rele(ds, FTAG);
1303 1866 return (0);
1304 1867 }
1305 1868
1306 1869 /*
1307 1870 * Scrub/dedup interaction.
1308 1871 *
1309 1872 * If there are N references to a deduped block, we don't want to scrub it
1310 1873 * N times -- ideally, we should scrub it exactly once.
1311 1874 *
1312 1875 * We leverage the fact that the dde's replication class (enum ddt_class)
1313 1876 * is ordered from highest replication class (DDT_CLASS_DITTO) to lowest
1314 1877 * (DDT_CLASS_UNIQUE) so that we may walk the DDT in that order.
1315 1878 *
1316 1879 * To prevent excess scrubbing, the scrub begins by walking the DDT
1317 1880 * to find all blocks with refcnt > 1, and scrubs each of these once.
1318 1881 * Since there are two replication classes which contain blocks with
1319 1882 * refcnt > 1, we scrub the highest replication class (DDT_CLASS_DITTO) first.
1320 1883 * Finally the top-down scrub begins, only visiting blocks with refcnt == 1.
1321 1884 *
1322 1885 * There would be nothing more to say if a block's refcnt couldn't change
1323 1886 * during a scrub, but of course it can so we must account for changes
1324 1887 * in a block's replication class.
1325 1888 *
1326 1889 * Here's an example of what can occur:
1327 1890 *
1328 1891 * If a block has refcnt > 1 during the DDT scrub phase, but has refcnt == 1
1329 1892 * when visited during the top-down scrub phase, it will be scrubbed twice.
1330 1893 * This negates our scrub optimization, but is otherwise harmless.
1331 1894 *
1332 1895 * If a block has refcnt == 1 during the DDT scrub phase, but has refcnt > 1
1333 1896 * on each visit during the top-down scrub phase, it will never be scrubbed.
1334 1897 * To catch this, ddt_sync_entry() notifies the scrub code whenever a block's
1335 1898 * reference class transitions to a higher level (i.e DDT_CLASS_UNIQUE to
1336 1899 * DDT_CLASS_DUPLICATE); if it transitions from refcnt == 1 to refcnt > 1
1337 1900 * while a scrub is in progress, it scrubs the block right then.
1338 1901 */
1339 1902 static void
1340 1903 dsl_scan_ddt(dsl_scan_t *scn, dmu_tx_t *tx)
1341 1904 {
|
↓ open down ↓ |
30 lines elided |
↑ open up ↑ |
1342 1905 ddt_bookmark_t *ddb = &scn->scn_phys.scn_ddt_bookmark;
1343 1906 ddt_entry_t dde = { 0 };
1344 1907 int error;
1345 1908 uint64_t n = 0;
1346 1909
1347 1910 while ((error = ddt_walk(scn->scn_dp->dp_spa, ddb, &dde)) == 0) {
1348 1911 ddt_t *ddt;
1349 1912
1350 1913 if (ddb->ddb_class > scn->scn_phys.scn_ddt_class_max)
1351 1914 break;
1915 + DTRACE_PROBE1(scan_ddb, ddt_bookmark_t *, ddb);
1352 1916 dprintf("visiting ddb=%llu/%llu/%llu/%llx\n",
1353 1917 (longlong_t)ddb->ddb_class,
1354 1918 (longlong_t)ddb->ddb_type,
1355 1919 (longlong_t)ddb->ddb_checksum,
1356 1920 (longlong_t)ddb->ddb_cursor);
1357 1921
1358 1922 /* There should be no pending changes to the dedup table */
1359 1923 ddt = scn->scn_dp->dp_spa->spa_ddt[ddb->ddb_checksum];
1360 - ASSERT(avl_first(&ddt->ddt_tree) == NULL);
1361 -
1924 +#ifdef ZFS_DEBUG
1925 + for (uint_t i = 0; i < DDT_HASHSZ; i++)
1926 + ASSERT(avl_first(&ddt->ddt_tree[i]) == NULL);
1927 +#endif
1362 1928 dsl_scan_ddt_entry(scn, ddb->ddb_checksum, &dde, tx);
1363 1929 n++;
1364 1930
1365 1931 if (dsl_scan_check_suspend(scn, NULL))
1366 1932 break;
1367 1933 }
1368 1934
1935 + DTRACE_PROBE2(scan_ddt_done, dsl_scan_t *, scn, uint64_t, n);
1369 1936 zfs_dbgmsg("scanned %llu ddt entries with class_max = %u; "
1370 1937 "suspending=%u", (longlong_t)n,
1371 1938 (int)scn->scn_phys.scn_ddt_class_max, (int)scn->scn_suspending);
1372 1939
1373 1940 ASSERT(error == 0 || error == ENOENT);
1374 1941 ASSERT(error != ENOENT ||
1375 1942 ddb->ddb_class > scn->scn_phys.scn_ddt_class_max);
1376 1943 }
1377 1944
1378 1945 /* ARGSUSED */
1379 1946 void
1380 1947 dsl_scan_ddt_entry(dsl_scan_t *scn, enum zio_checksum checksum,
1381 1948 ddt_entry_t *dde, dmu_tx_t *tx)
1382 1949 {
1383 1950 const ddt_key_t *ddk = &dde->dde_key;
1384 1951 ddt_phys_t *ddp = dde->dde_phys;
1385 1952 blkptr_t bp;
1386 1953 zbookmark_phys_t zb = { 0 };
1387 1954
1388 1955 if (scn->scn_phys.scn_state != DSS_SCANNING)
1389 1956 return;
1390 1957
1391 1958 for (int p = 0; p < DDT_PHYS_TYPES; p++, ddp++) {
1392 1959 if (ddp->ddp_phys_birth == 0 ||
1393 1960 ddp->ddp_phys_birth > scn->scn_phys.scn_max_txg)
1394 1961 continue;
1395 1962 ddt_bp_create(checksum, ddk, ddp, &bp);
|
↓ open down ↓ |
17 lines elided |
↑ open up ↑ |
1396 1963
1397 1964 scn->scn_visited_this_txg++;
1398 1965 scan_funcs[scn->scn_phys.scn_func](scn->scn_dp, &bp, &zb);
1399 1966 }
1400 1967 }
1401 1968
1402 1969 static void
1403 1970 dsl_scan_visit(dsl_scan_t *scn, dmu_tx_t *tx)
1404 1971 {
1405 1972 dsl_pool_t *dp = scn->scn_dp;
1406 - zap_cursor_t zc;
1407 - zap_attribute_t za;
1973 + uint64_t dsobj, txg;
1408 1974
1409 1975 if (scn->scn_phys.scn_ddt_bookmark.ddb_class <=
1410 1976 scn->scn_phys.scn_ddt_class_max) {
1411 1977 scn->scn_phys.scn_cur_min_txg = scn->scn_phys.scn_min_txg;
1412 1978 scn->scn_phys.scn_cur_max_txg = scn->scn_phys.scn_max_txg;
1413 1979 dsl_scan_ddt(scn, tx);
1414 1980 if (scn->scn_suspending)
1415 1981 return;
1416 1982 }
1417 1983
1418 1984 if (scn->scn_phys.scn_bookmark.zb_objset == DMU_META_OBJSET) {
1419 1985 /* First do the MOS & ORIGIN */
1420 1986
|
↓ open down ↓ |
3 lines elided |
↑ open up ↑ |
1421 1987 scn->scn_phys.scn_cur_min_txg = scn->scn_phys.scn_min_txg;
1422 1988 scn->scn_phys.scn_cur_max_txg = scn->scn_phys.scn_max_txg;
1423 1989 dsl_scan_visit_rootbp(scn, NULL,
1424 1990 &dp->dp_meta_rootbp, tx);
1425 1991 spa_set_rootblkptr(dp->dp_spa, &dp->dp_meta_rootbp);
1426 1992 if (scn->scn_suspending)
1427 1993 return;
1428 1994
1429 1995 if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB) {
1430 1996 VERIFY0(dmu_objset_find_dp(dp, dp->dp_root_dir_obj,
1431 - enqueue_cb, tx, DS_FIND_CHILDREN));
1997 + enqueue_cb, NULL, DS_FIND_CHILDREN));
1432 1998 } else {
1433 1999 dsl_scan_visitds(scn,
1434 2000 dp->dp_origin_snap->ds_object, tx);
1435 2001 }
1436 2002 ASSERT(!scn->scn_suspending);
1437 2003 } else if (scn->scn_phys.scn_bookmark.zb_objset !=
1438 2004 ZB_DESTROYED_OBJSET) {
2005 + uint64_t dsobj = scn->scn_phys.scn_bookmark.zb_objset;
1439 2006 /*
1440 2007 * If we were suspended, continue from here. Note if the
1441 2008 * ds we were suspended on was deleted, the zb_objset may
1442 2009 * be -1, so we will skip this and find a new objset
1443 2010 * below.
1444 2011 */
1445 - dsl_scan_visitds(scn, scn->scn_phys.scn_bookmark.zb_objset, tx);
2012 + dsl_scan_visitds(scn, dsobj, tx);
1446 2013 if (scn->scn_suspending)
1447 2014 return;
1448 2015 }
1449 2016
1450 2017 /*
1451 2018 * In case we were suspended right at the end of the ds, zero the
1452 2019 * bookmark so we don't think that we're still trying to resume.
1453 2020 */
1454 2021 bzero(&scn->scn_phys.scn_bookmark, sizeof (zbookmark_phys_t));
1455 2022
1456 2023 /* keep pulling things out of the zap-object-as-queue */
1457 - while (zap_cursor_init(&zc, dp->dp_meta_objset,
1458 - scn->scn_phys.scn_queue_obj),
1459 - zap_cursor_retrieve(&zc, &za) == 0) {
2024 + while (scan_ds_queue_first(scn, &dsobj, &txg)) {
1460 2025 dsl_dataset_t *ds;
1461 - uint64_t dsobj;
1462 2026
1463 - dsobj = zfs_strtonum(za.za_name, NULL);
1464 - VERIFY3U(0, ==, zap_remove_int(dp->dp_meta_objset,
1465 - scn->scn_phys.scn_queue_obj, dsobj, tx));
2027 + scan_ds_queue_remove(scn, dsobj);
1466 2028
1467 2029 /* Set up min/max txg */
1468 2030 VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));
1469 - if (za.za_first_integer != 0) {
2031 + if (txg != 0) {
1470 2032 scn->scn_phys.scn_cur_min_txg =
1471 - MAX(scn->scn_phys.scn_min_txg,
1472 - za.za_first_integer);
2033 + MAX(scn->scn_phys.scn_min_txg, txg);
1473 2034 } else {
1474 2035 scn->scn_phys.scn_cur_min_txg =
1475 2036 MAX(scn->scn_phys.scn_min_txg,
1476 2037 dsl_dataset_phys(ds)->ds_prev_snap_txg);
1477 2038 }
1478 2039 scn->scn_phys.scn_cur_max_txg = dsl_scan_ds_maxtxg(ds);
1479 2040 dsl_dataset_rele(ds, FTAG);
1480 2041
1481 2042 dsl_scan_visitds(scn, dsobj, tx);
1482 - zap_cursor_fini(&zc);
1483 2043 if (scn->scn_suspending)
1484 2044 return;
1485 2045 }
1486 - zap_cursor_fini(&zc);
2046 + /* No more objsets to fetch, we're done */
2047 + scn->scn_phys.scn_bookmark.zb_objset = ZB_DESTROYED_OBJSET;
2048 + ASSERT0(scn->scn_suspending);
1487 2049 }
1488 2050
1489 2051 static boolean_t
1490 -dsl_scan_async_block_should_pause(dsl_scan_t *scn)
2052 +dsl_scan_free_should_suspend(dsl_scan_t *scn)
1491 2053 {
1492 2054 uint64_t elapsed_nanosecs;
1493 2055
1494 2056 if (zfs_recover)
1495 2057 return (B_FALSE);
1496 2058
1497 - if (scn->scn_visited_this_txg >= zfs_async_block_max_blocks)
2059 + if (scn->scn_visited_this_txg >= zfs_free_max_blocks)
1498 2060 return (B_TRUE);
1499 2061
1500 2062 elapsed_nanosecs = gethrtime() - scn->scn_sync_start_time;
1501 2063 return (elapsed_nanosecs / NANOSEC > zfs_txg_timeout ||
1502 - (NSEC2MSEC(elapsed_nanosecs) > scn->scn_async_block_min_time_ms &&
2064 + (NSEC2MSEC(elapsed_nanosecs) > zfs_free_min_time_ms &&
1503 2065 txg_sync_waiting(scn->scn_dp)) ||
1504 2066 spa_shutting_down(scn->scn_dp->dp_spa));
1505 2067 }
1506 2068
1507 2069 static int
1508 2070 dsl_scan_free_block_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
1509 2071 {
1510 2072 dsl_scan_t *scn = arg;
1511 2073
1512 2074 if (!scn->scn_is_bptree ||
1513 2075 (BP_GET_LEVEL(bp) == 0 && BP_GET_TYPE(bp) != DMU_OT_OBJSET)) {
1514 - if (dsl_scan_async_block_should_pause(scn))
2076 + if (dsl_scan_free_should_suspend(scn))
1515 2077 return (SET_ERROR(ERESTART));
1516 2078 }
1517 2079
1518 2080 zio_nowait(zio_free_sync(scn->scn_zio_root, scn->scn_dp->dp_spa,
1519 2081 dmu_tx_get_txg(tx), bp, 0));
1520 2082 dsl_dir_diduse_space(tx->tx_pool->dp_free_dir, DD_USED_HEAD,
1521 2083 -bp_get_dsize_sync(scn->scn_dp->dp_spa, bp),
1522 2084 -BP_GET_PSIZE(bp), -BP_GET_UCSIZE(bp), tx);
1523 2085 scn->scn_visited_this_txg++;
1524 2086 return (0);
1525 2087 }
1526 2088
1527 -static int
1528 -dsl_scan_obsolete_block_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
1529 -{
1530 - dsl_scan_t *scn = arg;
1531 - const dva_t *dva = &bp->blk_dva[0];
1532 -
1533 - if (dsl_scan_async_block_should_pause(scn))
1534 - return (SET_ERROR(ERESTART));
1535 -
1536 - spa_vdev_indirect_mark_obsolete(scn->scn_dp->dp_spa,
1537 - DVA_GET_VDEV(dva), DVA_GET_OFFSET(dva),
1538 - DVA_GET_ASIZE(dva), tx);
1539 - scn->scn_visited_this_txg++;
1540 - return (0);
1541 -}
1542 -
1543 2089 boolean_t
1544 2090 dsl_scan_active(dsl_scan_t *scn)
1545 2091 {
1546 2092 spa_t *spa = scn->scn_dp->dp_spa;
1547 2093 uint64_t used = 0, comp, uncomp;
1548 2094
1549 2095 if (spa->spa_load_state != SPA_LOAD_NONE)
1550 2096 return (B_FALSE);
1551 2097 if (spa_shutting_down(spa))
1552 2098 return (B_FALSE);
1553 - if ((scn->scn_phys.scn_state == DSS_SCANNING &&
1554 - !dsl_scan_is_paused_scrub(scn)) ||
2099 + if ((dsl_scan_is_running(scn) && !dsl_scan_is_paused_scrub(scn)) ||
1555 2100 (scn->scn_async_destroying && !scn->scn_async_stalled))
1556 2101 return (B_TRUE);
1557 2102
1558 2103 if (spa_version(scn->scn_dp->dp_spa) >= SPA_VERSION_DEADLISTS) {
1559 2104 (void) bpobj_space(&scn->scn_dp->dp_free_bpobj,
1560 2105 &used, &comp, &uncomp);
1561 2106 }
1562 2107 return (used != 0);
1563 2108 }
1564 2109
1565 2110 /* Called whenever a txg syncs. */
1566 2111 void
1567 2112 dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
1568 2113 {
1569 2114 dsl_scan_t *scn = dp->dp_scan;
1570 2115 spa_t *spa = dp->dp_spa;
1571 2116 int err = 0;
1572 2117
1573 2118 /*
1574 2119 * Check for scn_restart_txg before checking spa_load_state, so
1575 2120 * that we can restart an old-style scan while the pool is being
1576 2121 * imported (see dsl_scan_init).
1577 2122 */
1578 2123 if (dsl_scan_restarting(scn, tx)) {
1579 2124 pool_scan_func_t func = POOL_SCAN_SCRUB;
1580 2125 dsl_scan_done(scn, B_FALSE, tx);
1581 2126 if (vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL))
1582 2127 func = POOL_SCAN_RESILVER;
1583 2128 zfs_dbgmsg("restarting scan func=%u txg=%llu",
1584 2129 func, tx->tx_txg);
1585 2130 dsl_scan_setup_sync(&func, tx);
1586 2131 }
1587 2132
1588 2133 /*
1589 2134 * Only process scans in sync pass 1.
1590 2135 */
1591 2136 if (spa_sync_pass(dp->dp_spa) > 1)
1592 2137 return;
1593 2138
1594 2139 /*
1595 2140 * If the spa is shutting down, then stop scanning. This will
1596 2141 * ensure that the scan does not dirty any new data during the
1597 2142 * shutdown phase.
1598 2143 */
1599 2144 if (spa_shutting_down(spa))
1600 2145 return;
1601 2146
1602 2147 /*
1603 2148 * If the scan is inactive due to a stalled async destroy, try again.
1604 2149 */
1605 2150 if (!scn->scn_async_stalled && !dsl_scan_active(scn))
1606 2151 return;
1607 2152
1608 2153 scn->scn_visited_this_txg = 0;
1609 2154 scn->scn_suspending = B_FALSE;
1610 2155 scn->scn_sync_start_time = gethrtime();
1611 2156 spa->spa_scrub_active = B_TRUE;
1612 2157
|
↓ open down ↓ |
48 lines elided |
↑ open up ↑ |
1613 2158 /*
1614 2159 * First process the async destroys. If we suspend, don't do
1615 2160 * any scrubbing or resilvering. This ensures that there are no
1616 2161 * async destroys while we are scanning, so the scan code doesn't
1617 2162 * have to worry about traversing it. It is also faster to free the
1618 2163 * blocks than to scrub them.
1619 2164 */
1620 2165 if (zfs_free_bpobj_enabled &&
1621 2166 spa_version(dp->dp_spa) >= SPA_VERSION_DEADLISTS) {
1622 2167 scn->scn_is_bptree = B_FALSE;
1623 - scn->scn_async_block_min_time_ms = zfs_free_min_time_ms;
1624 2168 scn->scn_zio_root = zio_root(dp->dp_spa, NULL,
1625 2169 NULL, ZIO_FLAG_MUSTSUCCEED);
1626 2170 err = bpobj_iterate(&dp->dp_free_bpobj,
1627 2171 dsl_scan_free_block_cb, scn, tx);
1628 2172 VERIFY3U(0, ==, zio_wait(scn->scn_zio_root));
1629 2173
1630 2174 if (err != 0 && err != ERESTART)
1631 2175 zfs_panic_recover("error %u from bpobj_iterate()", err);
1632 2176 }
1633 2177
1634 2178 if (err == 0 && spa_feature_is_active(spa, SPA_FEATURE_ASYNC_DESTROY)) {
1635 2179 ASSERT(scn->scn_async_destroying);
1636 2180 scn->scn_is_bptree = B_TRUE;
1637 2181 scn->scn_zio_root = zio_root(dp->dp_spa, NULL,
1638 2182 NULL, ZIO_FLAG_MUSTSUCCEED);
1639 2183 err = bptree_iterate(dp->dp_meta_objset,
1640 2184 dp->dp_bptree_obj, B_TRUE, dsl_scan_free_block_cb, scn, tx);
1641 2185 VERIFY0(zio_wait(scn->scn_zio_root));
1642 2186
1643 2187 if (err == EIO || err == ECKSUM) {
1644 2188 err = 0;
1645 2189 } else if (err != 0 && err != ERESTART) {
1646 2190 zfs_panic_recover("error %u from "
1647 2191 "traverse_dataset_destroyed()", err);
1648 2192 }
1649 2193
1650 2194 if (bptree_is_empty(dp->dp_meta_objset, dp->dp_bptree_obj)) {
1651 2195 /* finished; deactivate async destroy feature */
1652 2196 spa_feature_decr(spa, SPA_FEATURE_ASYNC_DESTROY, tx);
1653 2197 ASSERT(!spa_feature_is_active(spa,
1654 2198 SPA_FEATURE_ASYNC_DESTROY));
1655 2199 VERIFY0(zap_remove(dp->dp_meta_objset,
1656 2200 DMU_POOL_DIRECTORY_OBJECT,
1657 2201 DMU_POOL_BPTREE_OBJ, tx));
1658 2202 VERIFY0(bptree_free(dp->dp_meta_objset,
1659 2203 dp->dp_bptree_obj, tx));
1660 2204 dp->dp_bptree_obj = 0;
1661 2205 scn->scn_async_destroying = B_FALSE;
1662 2206 scn->scn_async_stalled = B_FALSE;
1663 2207 } else {
1664 2208 /*
1665 2209 * If we didn't make progress, mark the async
1666 2210 * destroy as stalled, so that we will not initiate
1667 2211 * a spa_sync() on its behalf. Note that we only
1668 2212 * check this if we are not finished, because if the
1669 2213 * bptree had no blocks for us to visit, we can
1670 2214 * finish without "making progress".
1671 2215 */
1672 2216 scn->scn_async_stalled =
1673 2217 (scn->scn_visited_this_txg == 0);
1674 2218 }
1675 2219 }
1676 2220 if (scn->scn_visited_this_txg) {
1677 2221 zfs_dbgmsg("freed %llu blocks in %llums from "
1678 2222 "free_bpobj/bptree txg %llu; err=%u",
1679 2223 (longlong_t)scn->scn_visited_this_txg,
1680 2224 (longlong_t)
1681 2225 NSEC2MSEC(gethrtime() - scn->scn_sync_start_time),
1682 2226 (longlong_t)tx->tx_txg, err);
1683 2227 scn->scn_visited_this_txg = 0;
1684 2228
1685 2229 /*
1686 2230 * Write out changes to the DDT that may be required as a
1687 2231 * result of the blocks freed. This ensures that the DDT
1688 2232 * is clean when a scrub/resilver runs.
1689 2233 */
1690 2234 ddt_sync(spa, tx->tx_txg);
1691 2235 }
1692 2236 if (err != 0)
1693 2237 return;
1694 2238 if (dp->dp_free_dir != NULL && !scn->scn_async_destroying &&
1695 2239 zfs_free_leak_on_eio &&
1696 2240 (dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes != 0 ||
1697 2241 dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes != 0 ||
1698 2242 dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes != 0)) {
1699 2243 /*
1700 2244 * We have finished background destroying, but there is still
1701 2245 * some space left in the dp_free_dir. Transfer this leaked
1702 2246 * space to the dp_leak_dir.
1703 2247 */
1704 2248 if (dp->dp_leak_dir == NULL) {
1705 2249 rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG);
1706 2250 (void) dsl_dir_create_sync(dp, dp->dp_root_dir,
1707 2251 LEAK_DIR_NAME, tx);
1708 2252 VERIFY0(dsl_pool_open_special_dir(dp,
1709 2253 LEAK_DIR_NAME, &dp->dp_leak_dir));
1710 2254 rrw_exit(&dp->dp_config_rwlock, FTAG);
|
↓ open down ↓ |
77 lines elided |
↑ open up ↑ |
1711 2255 }
1712 2256 dsl_dir_diduse_space(dp->dp_leak_dir, DD_USED_HEAD,
1713 2257 dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes,
1714 2258 dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes,
1715 2259 dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes, tx);
1716 2260 dsl_dir_diduse_space(dp->dp_free_dir, DD_USED_HEAD,
1717 2261 -dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes,
1718 2262 -dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes,
1719 2263 -dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes, tx);
1720 2264 }
1721 -
1722 2265 if (dp->dp_free_dir != NULL && !scn->scn_async_destroying) {
1723 2266 /* finished; verify that space accounting went to zero */
1724 2267 ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes);
1725 2268 ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes);
1726 2269 ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes);
1727 2270 }
1728 2271
1729 - EQUIV(bpobj_is_open(&dp->dp_obsolete_bpobj),
1730 - 0 == zap_contains(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
1731 - DMU_POOL_OBSOLETE_BPOBJ));
1732 - if (err == 0 && bpobj_is_open(&dp->dp_obsolete_bpobj)) {
1733 - ASSERT(spa_feature_is_active(dp->dp_spa,
1734 - SPA_FEATURE_OBSOLETE_COUNTS));
2272 + if (!dsl_scan_is_running(scn))
2273 + return;
1735 2274
1736 - scn->scn_is_bptree = B_FALSE;
1737 - scn->scn_async_block_min_time_ms = zfs_obsolete_min_time_ms;
1738 - err = bpobj_iterate(&dp->dp_obsolete_bpobj,
1739 - dsl_scan_obsolete_block_cb, scn, tx);
1740 - if (err != 0 && err != ERESTART)
1741 - zfs_panic_recover("error %u from bpobj_iterate()", err);
1742 -
1743 - if (bpobj_is_empty(&dp->dp_obsolete_bpobj))
1744 - dsl_pool_destroy_obsolete_bpobj(dp, tx);
2275 + if (!zfs_scan_direct) {
2276 + if (!scn->scn_is_sorted)
2277 + scn->scn_last_queue_run_time = 0;
2278 + scn->scn_is_sorted = B_TRUE;
1745 2279 }
1746 2280
1747 - if (scn->scn_phys.scn_state != DSS_SCANNING)
1748 - return;
1749 -
1750 - if (scn->scn_done_txg == tx->tx_txg) {
2281 + if (scn->scn_done_txg == tx->tx_txg ||
2282 + scn->scn_phys.scn_state == DSS_FINISHING) {
1751 2283 ASSERT(!scn->scn_suspending);
2284 + if (scn->scn_bytes_pending != 0) {
2285 + ASSERT(scn->scn_is_sorted);
2286 + scn->scn_phys.scn_state = DSS_FINISHING;
2287 + goto finish;
2288 + }
1752 2289 /* finished with scan. */
1753 2290 zfs_dbgmsg("txg %llu scan complete", tx->tx_txg);
1754 2291 dsl_scan_done(scn, B_TRUE, tx);
1755 2292 ASSERT3U(spa->spa_scrub_inflight, ==, 0);
1756 - dsl_scan_sync_state(scn, tx);
2293 + dsl_scan_sync_state(scn, tx, SYNC_MANDATORY);
1757 2294 return;
1758 2295 }
1759 2296
1760 2297 if (dsl_scan_is_paused_scrub(scn))
1761 2298 return;
1762 2299
1763 2300 if (scn->scn_phys.scn_ddt_bookmark.ddb_class <=
1764 2301 scn->scn_phys.scn_ddt_class_max) {
1765 2302 zfs_dbgmsg("doing scan sync txg %llu; "
1766 2303 "ddt bm=%llu/%llu/%llu/%llx",
1767 2304 (longlong_t)tx->tx_txg,
1768 2305 (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_class,
1769 2306 (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_type,
1770 2307 (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_checksum,
1771 2308 (longlong_t)scn->scn_phys.scn_ddt_bookmark.ddb_cursor);
1772 2309 ASSERT(scn->scn_phys.scn_bookmark.zb_objset == 0);
1773 2310 ASSERT(scn->scn_phys.scn_bookmark.zb_object == 0);
1774 2311 ASSERT(scn->scn_phys.scn_bookmark.zb_level == 0);
|
↓ open down ↓ |
8 lines elided |
↑ open up ↑ |
1775 2312 ASSERT(scn->scn_phys.scn_bookmark.zb_blkid == 0);
1776 2313 } else {
1777 2314 zfs_dbgmsg("doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu",
1778 2315 (longlong_t)tx->tx_txg,
1779 2316 (longlong_t)scn->scn_phys.scn_bookmark.zb_objset,
1780 2317 (longlong_t)scn->scn_phys.scn_bookmark.zb_object,
1781 2318 (longlong_t)scn->scn_phys.scn_bookmark.zb_level,
1782 2319 (longlong_t)scn->scn_phys.scn_bookmark.zb_blkid);
1783 2320 }
1784 2321
1785 - scn->scn_zio_root = zio_root(dp->dp_spa, NULL,
1786 - NULL, ZIO_FLAG_CANFAIL);
1787 - dsl_pool_config_enter(dp, FTAG);
1788 - dsl_scan_visit(scn, tx);
1789 - dsl_pool_config_exit(dp, FTAG);
1790 - (void) zio_wait(scn->scn_zio_root);
1791 - scn->scn_zio_root = NULL;
2322 + if (scn->scn_is_sorted) {
2323 + /*
2324 + * This is the out-of-order queue handling. We determine our
2325 + * memory usage and based on that switch states between normal
2326 + * operation (i.e. don't issue queued up I/O unless we've
2327 + * reached the end of scanning) and 'clearing' (issue queued
2328 + * extents just to clear up some memory).
2329 + */
2330 + mem_lim_t mlim = scan_io_queue_mem_lim(scn);
1792 2331
1793 - zfs_dbgmsg("visited %llu blocks in %llums",
1794 - (longlong_t)scn->scn_visited_this_txg,
1795 - (longlong_t)NSEC2MSEC(gethrtime() - scn->scn_sync_start_time));
2332 + if (mlim == MEM_LIM_HARD && !scn->scn_clearing)
2333 + scn->scn_clearing = B_TRUE;
2334 + else if (mlim == MEM_LIM_NONE && scn->scn_clearing)
2335 + scn->scn_clearing = B_FALSE;
1796 2336
2337 + if ((scn->scn_checkpointing || ddi_get_lbolt() -
2338 + scn->scn_last_checkpoint > ZFS_SCAN_CHECKPOINT_INTVAL) &&
2339 + scn->scn_phys.scn_state != DSS_FINISHING &&
2340 + !scn->scn_clearing) {
2341 + scn->scn_checkpointing = B_TRUE;
2342 + }
2343 + }
2344 +
2345 + if (!scn->scn_clearing && !scn->scn_checkpointing) {
2346 + scn->scn_zio_root = zio_root(dp->dp_spa, NULL,
2347 + NULL, ZIO_FLAG_CANFAIL);
2348 + dsl_pool_config_enter(dp, FTAG);
2349 + dsl_scan_visit(scn, tx);
2350 + dsl_pool_config_exit(dp, FTAG);
2351 + (void) zio_wait(scn->scn_zio_root);
2352 + scn->scn_zio_root = NULL;
2353 +
2354 + zfs_dbgmsg("visited %llu blocks in %llums",
2355 + (longlong_t)scn->scn_visited_this_txg,
2356 + (longlong_t)NSEC2MSEC(gethrtime() -
2357 + scn->scn_sync_start_time));
2358 +
2359 + if (!scn->scn_suspending) {
2360 + scn->scn_done_txg = tx->tx_txg + 1;
2361 + zfs_dbgmsg("txg %llu traversal complete, waiting "
2362 + "till txg %llu", tx->tx_txg, scn->scn_done_txg);
2363 + }
2364 + }
1797 2365 if (!scn->scn_suspending) {
1798 2366 scn->scn_done_txg = tx->tx_txg + 1;
1799 2367 zfs_dbgmsg("txg %llu traversal complete, waiting till txg %llu",
1800 2368 tx->tx_txg, scn->scn_done_txg);
1801 2369 }
2370 +finish:
2371 + if (scn->scn_is_sorted) {
2372 + dsl_pool_config_enter(dp, FTAG);
2373 + scan_io_queues_run(scn);
2374 + dsl_pool_config_exit(dp, FTAG);
2375 + }
1802 2376
1803 2377 if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) {
1804 2378 mutex_enter(&spa->spa_scrub_lock);
1805 2379 while (spa->spa_scrub_inflight > 0) {
1806 2380 cv_wait(&spa->spa_scrub_io_cv,
1807 2381 &spa->spa_scrub_lock);
1808 2382 }
1809 2383 mutex_exit(&spa->spa_scrub_lock);
1810 2384 }
1811 2385
1812 - dsl_scan_sync_state(scn, tx);
2386 + dsl_scan_sync_state(scn, tx, SYNC_OPTIONAL);
1813 2387 }
1814 2388
1815 2389 /*
1816 2390 * This will start a new scan, or restart an existing one.
1817 2391 */
1818 2392 void
1819 2393 dsl_resilver_restart(dsl_pool_t *dp, uint64_t txg)
1820 2394 {
2395 + /* Stop any ongoing TRIMs */
2396 + spa_man_trim_stop(dp->dp_spa);
2397 +
1821 2398 if (txg == 0) {
1822 2399 dmu_tx_t *tx;
1823 2400 tx = dmu_tx_create_dd(dp->dp_mos_dir);
1824 2401 VERIFY(0 == dmu_tx_assign(tx, TXG_WAIT));
1825 2402
1826 2403 txg = dmu_tx_get_txg(tx);
1827 2404 dp->dp_scan->scn_restart_txg = txg;
1828 2405 dmu_tx_commit(tx);
1829 2406 } else {
1830 2407 dp->dp_scan->scn_restart_txg = txg;
1831 2408 }
1832 2409 zfs_dbgmsg("restarting resilver txg=%llu", txg);
1833 2410 }
1834 2411
1835 2412 boolean_t
1836 2413 dsl_scan_resilvering(dsl_pool_t *dp)
1837 2414 {
1838 - return (dp->dp_scan->scn_phys.scn_state == DSS_SCANNING &&
2415 + return (dsl_scan_is_running(dp->dp_scan) &&
1839 2416 dp->dp_scan->scn_phys.scn_func == POOL_SCAN_RESILVER);
1840 2417 }
1841 2418
1842 2419 /*
1843 2420 * scrub consumers
1844 2421 */
1845 2422
1846 2423 static void
1847 -count_block(zfs_all_blkstats_t *zab, const blkptr_t *bp)
2424 +count_block(dsl_scan_t *scn, zfs_all_blkstats_t *zab, const blkptr_t *bp)
1848 2425 {
1849 2426 int i;
1850 2427
2428 + for (i = 0; i < BP_GET_NDVAS(bp); i++)
2429 + atomic_add_64(&scn->scn_bytes_issued,
2430 + DVA_GET_ASIZE(&bp->blk_dva[i]));
2431 +
1851 2432 /*
1852 2433 * If we resume after a reboot, zab will be NULL; don't record
1853 2434 * incomplete stats in that case.
1854 2435 */
1855 2436 if (zab == NULL)
1856 2437 return;
1857 2438
1858 2439 for (i = 0; i < 4; i++) {
1859 2440 int l = (i < 2) ? BP_GET_LEVEL(bp) : DN_MAX_LEVELS;
1860 2441 int t = (i & 1) ? BP_GET_TYPE(bp) : DMU_OT_TOTAL;
1861 2442 if (t & DMU_OT_NEWTYPE)
1862 2443 t = DMU_OT_OTHER;
1863 2444 zfs_blkstat_t *zb = &zab->zab_type[l][t];
1864 2445 int equal;
1865 2446
1866 2447 zb->zb_count++;
1867 2448 zb->zb_asize += BP_GET_ASIZE(bp);
1868 2449 zb->zb_lsize += BP_GET_LSIZE(bp);
1869 2450 zb->zb_psize += BP_GET_PSIZE(bp);
1870 2451 zb->zb_gangs += BP_COUNT_GANG(bp);
1871 2452
1872 2453 switch (BP_GET_NDVAS(bp)) {
1873 2454 case 2:
1874 2455 if (DVA_GET_VDEV(&bp->blk_dva[0]) ==
1875 2456 DVA_GET_VDEV(&bp->blk_dva[1]))
1876 2457 zb->zb_ditto_2_of_2_samevdev++;
1877 2458 break;
1878 2459 case 3:
1879 2460 equal = (DVA_GET_VDEV(&bp->blk_dva[0]) ==
1880 2461 DVA_GET_VDEV(&bp->blk_dva[1])) +
1881 2462 (DVA_GET_VDEV(&bp->blk_dva[0]) ==
1882 2463 DVA_GET_VDEV(&bp->blk_dva[2])) +
1883 2464 (DVA_GET_VDEV(&bp->blk_dva[1]) ==
1884 2465 DVA_GET_VDEV(&bp->blk_dva[2]));
1885 2466 if (equal == 1)
1886 2467 zb->zb_ditto_2_of_3_samevdev++;
1887 2468 else if (equal == 3)
1888 2469 zb->zb_ditto_3_of_3_samevdev++;
1889 2470 break;
1890 2471 }
1891 2472 }
1892 2473 }
1893 2474
|
↓ open down ↓ |
33 lines elided |
↑ open up ↑ |
1894 2475 static void
1895 2476 dsl_scan_scrub_done(zio_t *zio)
1896 2477 {
1897 2478 spa_t *spa = zio->io_spa;
1898 2479
1899 2480 abd_free(zio->io_abd);
1900 2481
1901 2482 mutex_enter(&spa->spa_scrub_lock);
1902 2483 spa->spa_scrub_inflight--;
1903 2484 cv_broadcast(&spa->spa_scrub_io_cv);
2485 + mutex_exit(&spa->spa_scrub_lock);
1904 2486
1905 2487 if (zio->io_error && (zio->io_error != ECKSUM ||
1906 2488 !(zio->io_flags & ZIO_FLAG_SPECULATIVE))) {
1907 - spa->spa_dsl_pool->dp_scan->scn_phys.scn_errors++;
2489 + atomic_inc_64(&spa->spa_dsl_pool->dp_scan->scn_phys.scn_errors);
2490 + DTRACE_PROBE1(scan_error, zio_t *, zio);
1908 2491 }
1909 - mutex_exit(&spa->spa_scrub_lock);
1910 2492 }
1911 2493
1912 2494 static int
1913 2495 dsl_scan_scrub_cb(dsl_pool_t *dp,
1914 2496 const blkptr_t *bp, const zbookmark_phys_t *zb)
1915 2497 {
1916 2498 dsl_scan_t *scn = dp->dp_scan;
1917 - size_t size = BP_GET_PSIZE(bp);
1918 2499 spa_t *spa = dp->dp_spa;
1919 2500 uint64_t phys_birth = BP_PHYSICAL_BIRTH(bp);
1920 2501 boolean_t needs_io;
1921 2502 int zio_flags = ZIO_FLAG_SCAN_THREAD | ZIO_FLAG_RAW | ZIO_FLAG_CANFAIL;
1922 - int scan_delay = 0;
2503 + boolean_t ignore_dva0;
1923 2504
1924 2505 if (phys_birth <= scn->scn_phys.scn_min_txg ||
1925 2506 phys_birth >= scn->scn_phys.scn_max_txg)
1926 2507 return (0);
1927 2508
1928 - count_block(dp->dp_blkstats, bp);
1929 -
1930 - if (BP_IS_EMBEDDED(bp))
2509 + if (BP_IS_EMBEDDED(bp)) {
2510 + count_block(scn, dp->dp_blkstats, bp);
1931 2511 return (0);
2512 + }
1932 2513
1933 2514 ASSERT(DSL_SCAN_IS_SCRUB_RESILVER(scn));
1934 - if (scn->scn_phys.scn_func == POOL_SCAN_SCRUB) {
2515 + if (scn->scn_phys.scn_func == POOL_SCAN_SCRUB ||
2516 + scn->scn_phys.scn_func == POOL_SCAN_MOS ||
2517 + scn->scn_phys.scn_func == POOL_SCAN_META) {
1935 2518 zio_flags |= ZIO_FLAG_SCRUB;
1936 2519 needs_io = B_TRUE;
1937 - scan_delay = zfs_scrub_delay;
1938 2520 } else {
1939 2521 ASSERT3U(scn->scn_phys.scn_func, ==, POOL_SCAN_RESILVER);
1940 2522 zio_flags |= ZIO_FLAG_RESILVER;
1941 2523 needs_io = B_FALSE;
1942 - scan_delay = zfs_resilver_delay;
1943 2524 }
1944 2525
1945 2526 /* If it's an intent log block, failure is expected. */
1946 2527 if (zb->zb_level == ZB_ZIL_LEVEL)
1947 2528 zio_flags |= ZIO_FLAG_SPECULATIVE;
1948 2529
2530 + if (scn->scn_phys.scn_func == POOL_SCAN_MOS)
2531 + needs_io = (zb->zb_objset == 0);
2532 +
2533 + if (scn->scn_phys.scn_func == POOL_SCAN_META)
2534 + needs_io = zb->zb_objset == 0 || BP_GET_LEVEL(bp) != 0 ||
2535 + DMU_OT_IS_METADATA(BP_GET_TYPE(bp));
2536 +
2537 + DTRACE_PROBE3(scan_needs_io, boolean_t, needs_io,
2538 + const blkptr_t *, bp, spa_t *, spa);
2539 +
2540 + /*
2541 + * WBC will invalidate DVA[0] after migrating the block to the main
2542 + * pool. If the user subsequently disables WBC and removes the special
2543 + * device, DVA[0] can now point to a hole vdev. We won't try to do
2544 + * I/O to it, but we must also avoid doing DTL checks.
2545 + */
2546 + ignore_dva0 = (BP_IS_SPECIAL(bp) &&
2547 + wbc_bp_is_migrated(spa_get_wbc_data(spa), bp));
2548 +
1949 2549 for (int d = 0; d < BP_GET_NDVAS(bp); d++) {
1950 - vdev_t *vd = vdev_lookup_top(spa,
1951 - DVA_GET_VDEV(&bp->blk_dva[d]));
2550 + vdev_t *vd;
1952 2551
1953 2552 /*
1954 2553 * Keep track of how much data we've examined so that
1955 2554 * zpool(1M) status can make useful progress reports.
1956 2555 */
1957 2556 scn->scn_phys.scn_examined += DVA_GET_ASIZE(&bp->blk_dva[d]);
1958 2557 spa->spa_scan_pass_exam += DVA_GET_ASIZE(&bp->blk_dva[d]);
1959 2558
2559 + /* WBC-invalidated DVA post-migration, so skip it */
2560 + if (d == 0 && ignore_dva0)
2561 + continue;
2562 + vd = vdev_lookup_top(spa, DVA_GET_VDEV(&bp->blk_dva[d]));
2563 +
1960 2564 /* if it's a resilver, this may not be in the target range */
1961 - if (!needs_io) {
2565 + if (!needs_io && scn->scn_phys.scn_func != POOL_SCAN_MOS &&
2566 + scn->scn_phys.scn_func != POOL_SCAN_META) {
1962 2567 if (DVA_GET_GANG(&bp->blk_dva[d])) {
1963 2568 /*
1964 2569 * Gang members may be spread across multiple
1965 2570 * vdevs, so the best estimate we have is the
1966 2571 * scrub range, which has already been checked.
1967 2572 * XXX -- it would be better to change our
1968 2573 * allocation policy to ensure that all
1969 2574 * gang members reside on the same vdev.
1970 2575 */
1971 2576 needs_io = B_TRUE;
2577 + DTRACE_PROBE2(gang_bp, const blkptr_t *, bp,
2578 + spa_t *, spa);
1972 2579 } else {
1973 2580 needs_io = vdev_dtl_contains(vd, DTL_PARTIAL,
1974 2581 phys_birth, 1);
2582 + if (needs_io)
2583 + DTRACE_PROBE2(dtl, const blkptr_t *,
2584 + bp, spa_t *, spa);
1975 2585 }
1976 2586 }
1977 2587 }
1978 2588
1979 2589 if (needs_io && !zfs_no_scrub_io) {
1980 - vdev_t *rvd = spa->spa_root_vdev;
1981 - uint64_t maxinflight = rvd->vdev_children * zfs_top_maxinflight;
1982 -
1983 - mutex_enter(&spa->spa_scrub_lock);
1984 - while (spa->spa_scrub_inflight >= maxinflight)
1985 - cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
1986 - spa->spa_scrub_inflight++;
1987 - mutex_exit(&spa->spa_scrub_lock);
1988 -
1989 - /*
1990 - * If we're seeing recent (zfs_scan_idle) "important" I/Os
1991 - * then throttle our workload to limit the impact of a scan.
1992 - */
1993 - if (ddi_get_lbolt64() - spa->spa_last_io <= zfs_scan_idle)
1994 - delay(scan_delay);
1995 -
1996 - zio_nowait(zio_read(NULL, spa, bp,
1997 - abd_alloc_for_io(size, B_FALSE), size, dsl_scan_scrub_done,
1998 - NULL, ZIO_PRIORITY_SCRUB, zio_flags, zb));
2590 + dsl_scan_enqueue(dp, bp, zio_flags, zb);
2591 + } else {
2592 + count_block(scn, dp->dp_blkstats, bp);
1999 2593 }
2000 2594
2001 2595 /* do not relocate this block */
2002 2596 return (0);
2003 2597 }
2004 2598
2005 2599 /*
2006 2600 * Called by the ZFS_IOC_POOL_SCAN ioctl to start a scrub or resilver.
2007 2601 * Can also be called to resume a paused scrub.
2008 2602 */
2009 2603 int
2010 2604 dsl_scan(dsl_pool_t *dp, pool_scan_func_t func)
2011 2605 {
2012 2606 spa_t *spa = dp->dp_spa;
2013 2607 dsl_scan_t *scn = dp->dp_scan;
2014 2608
2015 2609 /*
2016 2610 * Purge all vdev caches and probe all devices. We do this here
2017 2611 * rather than in sync context because this requires a writer lock
2018 2612 * on the spa_config lock, which we can't do from sync context. The
2019 2613 * spa_scrub_reopen flag indicates that vdev_open() should not
2020 2614 * attempt to start another scrub.
2021 2615 */
|
↓ open down ↓ |
13 lines elided |
↑ open up ↑ |
2022 2616 spa_vdev_state_enter(spa, SCL_NONE);
2023 2617 spa->spa_scrub_reopen = B_TRUE;
2024 2618 vdev_reopen(spa->spa_root_vdev);
2025 2619 spa->spa_scrub_reopen = B_FALSE;
2026 2620 (void) spa_vdev_state_exit(spa, NULL, 0);
2027 2621
2028 2622 if (func == POOL_SCAN_SCRUB && dsl_scan_is_paused_scrub(scn)) {
2029 2623 /* got scrub start cmd, resume paused scrub */
2030 2624 int err = dsl_scrub_set_pause_resume(scn->scn_dp,
2031 2625 POOL_SCRUB_NORMAL);
2032 - if (err == 0) {
2033 - spa_event_notify(spa, NULL, NULL, ESC_ZFS_SCRUB_RESUME);
2626 + if (err == 0)
2034 2627 return (ECANCELED);
2035 - }
2036 2628
2037 2629 return (SET_ERROR(err));
2038 2630 }
2039 2631
2040 2632 return (dsl_sync_task(spa_name(spa), dsl_scan_setup_check,
2041 2633 dsl_scan_setup_sync, &func, 0, ZFS_SPACE_CHECK_NONE));
2042 2634 }
2043 2635
2044 2636 static boolean_t
2045 2637 dsl_scan_restarting(dsl_scan_t *scn, dmu_tx_t *tx)
2046 2638 {
2047 2639 return (scn->scn_restart_txg != 0 &&
2048 2640 scn->scn_restart_txg <= tx->tx_txg);
2641 +}
2642 +
2643 +/*
2644 + * Grand theory statement on scan queue sorting
2645 + *
2646 + * Scanning is implemented by recursively traversing all indirection levels
2647 + * in an object and reading all blocks referenced from said objects. This
2648 + * results in us approximately traversing the object from lowest logical
2649 + * offset to the highest. Naturally, if we were simply read all blocks in
2650 + * this order, we would require that the blocks be also physically arranged
2651 + * in sort of a linear fashion on the vdevs. However, this is frequently
2652 + * not the case on pools. So we instead stick the I/Os into a reordering
2653 + * queue and issue them out of logical order and in a way that most benefits
2654 + * physical disks (LBA-order).
2655 + *
2656 + * This sorting algorithm is subject to limitations. We can't do this with
2657 + * blocks that are non-leaf, because the scanner itself depends on these
2658 + * being available ASAP for further metadata traversal. So we exclude any
2659 + * block that is bp_level > 0. Fortunately, this usually represents only
2660 + * around 1% of our data volume, so no great loss.
2661 + *
2662 + * As a further limitation, we cannot sort blocks which have more than
2663 + * one DVA present (copies > 1), because there's no sensible way to sort
2664 + * these (how do you sort a queue based on multiple contradictory
2665 + * criteria?). So we exclude those as well. Again, these are very rarely
2666 + * used for leaf blocks, usually only on metadata.
2667 + *
2668 + * WBC consideration: we can't sort blocks which have not yet been fully
2669 + * migrated to normal devices, because their data can reside purely on the
2670 + * special device or on both normal and special. This would require larger
2671 + * data structures to track both DVAs in our queues and we need the
2672 + * smallest in-core structures we can possibly get to get good sorting
2673 + * performance. Therefore, blocks which have not yet been fully migrated
2674 + * out of the WBC are processed as non-sortable and issued immediately.
2675 + *
2676 + * Queue management:
2677 + *
2678 + * Ideally, we would want to scan all metadata and queue up all leaf block
2679 + * I/O prior to starting to issue it, because that allows us to do an
2680 + * optimal sorting job. This can however consume large amounts of memory.
2681 + * Therefore we continuously monitor the size of the queues and constrain
2682 + * them to 5% (zfs_scan_mem_lim_fact) of physmem. If the queues grow larger
2683 + * than this limit, we clear out a few of the largest extents at the head
2684 + * of the queues to make room for more scanning. Hopefully, these extents
2685 + * will be fairly large and contiguous, allowing us to approach sequential
2686 + * I/O throughput even without a fully sorted tree.
2687 + *
2688 + * Metadata scanning takes place in dsl_scan_visit(), which is called from
2689 + * dsl_scan_sync() every spa_sync(). If we have either fully scanned all
2690 + * metadata on the pool, or we need to make room in memory because our
2691 + * queues are too large, dsl_scan_visit() is postponed and
2692 + * scan_io_queues_run() is called from dsl_scan_sync() instead. That means,
2693 + * metadata scanning and queued I/O issuing are mutually exclusive. This is
2694 + * to provide maximum sequential I/O throughput for the queued I/O issue
2695 + * process. Sequential I/O performance is significantly negatively impacted
2696 + * if it is interleaved with random I/O.
2697 + *
2698 + * Backwards compatibility
2699 + *
2700 + * This new algorithm is backwards compatible with the legacy on-disk data
2701 + * structures. If imported on a machine without the new sorting algorithm,
2702 + * the scan simply resumes from the last checkpoint.
2703 + */
2704 +
2705 +/*
2706 + * Given a set of I/O parameters as discovered by the metadata traversal
2707 + * process, attempts to place the I/O into the reordering queue (if
2708 + * possible), or immediately executes the I/O. The check for whether an
2709 + * I/O is suitable for sorting is performed here.
2710 + */
2711 +static void
2712 +dsl_scan_enqueue(dsl_pool_t *dp, const blkptr_t *bp, int zio_flags,
2713 + const zbookmark_phys_t *zb)
2714 +{
2715 + spa_t *spa = dp->dp_spa;
2716 +
2717 + ASSERT(!BP_IS_EMBEDDED(bp));
2718 + if (!dp->dp_scan->scn_is_sorted || (BP_IS_SPECIAL(bp) &&
2719 + !wbc_bp_is_migrated(spa_get_wbc_data(spa), bp))) {
2720 + scan_exec_io(dp, bp, zio_flags, zb, B_TRUE);
2721 + return;
2722 + }
2723 +
2724 + for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
2725 + dva_t dva;
2726 + vdev_t *vdev;
2727 +
2728 + /* On special BPs we only support handling the normal DVA */
2729 + if (BP_IS_SPECIAL(bp) && i != WBC_NORMAL_DVA)
2730 + continue;
2731 +
2732 + dva = bp->blk_dva[i];
2733 + vdev = vdev_lookup_top(spa, DVA_GET_VDEV(&dva));
2734 + ASSERT(vdev != NULL);
2735 +
2736 + mutex_enter(&vdev->vdev_scan_io_queue_lock);
2737 + if (vdev->vdev_scan_io_queue == NULL)
2738 + vdev->vdev_scan_io_queue = scan_io_queue_create(vdev);
2739 + ASSERT(dp->dp_scan != NULL);
2740 + scan_io_queue_insert(dp->dp_scan, vdev->vdev_scan_io_queue, bp,
2741 + i, zio_flags, zb);
2742 + mutex_exit(&vdev->vdev_scan_io_queue_lock);
2743 + }
2744 +}
2745 +
2746 +/*
2747 + * Given a scanning zio's information, executes the zio. The zio need
2748 + * not necessarily be only sortable, this function simply executes the
2749 + * zio, no matter what it is. The limit_inflight flag controls whether
2750 + * we limit the number of concurrently executing scan zio's to
2751 + * zfs_top_maxinflight times the number of top-level vdevs. This is
2752 + * used during metadata discovery to pace the generation of I/O and
2753 + * properly time the pausing of the scanning algorithm. The queue
2754 + * processing part uses a different method of controlling timing and
2755 + * so doesn't need this limit applied to its zio's.
2756 + */
2757 +static void
2758 +scan_exec_io(dsl_pool_t *dp, const blkptr_t *bp, int zio_flags,
2759 + const zbookmark_phys_t *zb, boolean_t limit_inflight)
2760 +{
2761 + spa_t *spa = dp->dp_spa;
2762 + size_t size = BP_GET_PSIZE(bp);
2763 + vdev_t *rvd = spa->spa_root_vdev;
2764 + uint64_t maxinflight = rvd->vdev_children * zfs_top_maxinflight;
2765 + dsl_scan_t *scn = dp->dp_scan;
2766 + zio_priority_t prio;
2767 +
2768 + mutex_enter(&spa->spa_scrub_lock);
2769 + while (limit_inflight && spa->spa_scrub_inflight >= maxinflight)
2770 + cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
2771 + spa->spa_scrub_inflight++;
2772 + mutex_exit(&spa->spa_scrub_lock);
2773 +
2774 + for (int i = 0; i < BP_GET_NDVAS(bp); i++)
2775 + atomic_add_64(&spa->spa_scan_pass_work,
2776 + DVA_GET_ASIZE(&bp->blk_dva[i]));
2777 +
2778 + count_block(dp->dp_scan, dp->dp_blkstats, bp);
2779 + DTRACE_PROBE3(do_io, uint64_t, dp->dp_scan->scn_phys.scn_func,
2780 + boolean_t, B_TRUE, spa_t *, spa);
2781 + prio = (scn->scn_phys.scn_func == POOL_SCAN_RESILVER ?
2782 + ZIO_PRIORITY_RESILVER : ZIO_PRIORITY_SCRUB);
2783 + zio_nowait(zio_read(NULL, spa, bp, abd_alloc_for_io(size, B_FALSE),
2784 + size, dsl_scan_scrub_done, NULL, prio, zio_flags, zb));
2785 +}
2786 +
2787 +/*
2788 + * Given all the info we got from our metadata scanning process, we
2789 + * construct a scan_io_t and insert it into the scan sorting queue. The
2790 + * I/O must already be suitable for us to process. This is controlled
2791 + * by dsl_scan_enqueue().
2792 + */
2793 +static void
2794 +scan_io_queue_insert(dsl_scan_t *scn, dsl_scan_io_queue_t *queue,
2795 + const blkptr_t *bp, int dva_i, int zio_flags, const zbookmark_phys_t *zb)
2796 +{
2797 + scan_io_t *sio = kmem_zalloc(sizeof (*sio), KM_SLEEP);
2798 + avl_index_t idx;
2799 + uint64_t offset, asize;
2800 +
2801 + ASSERT(MUTEX_HELD(&queue->q_vd->vdev_scan_io_queue_lock));
2802 +
2803 + bp2sio(bp, sio, dva_i);
2804 + sio->sio_flags = zio_flags;
2805 + sio->sio_zb = *zb;
2806 + offset = SCAN_IO_GET_OFFSET(sio);
2807 + asize = sio->sio_asize;
2808 +
2809 + if (avl_find(&queue->q_zios_by_addr, sio, &idx) != NULL) {
2810 + /* block is already scheduled for reading */
2811 + kmem_free(sio, sizeof (*sio));
2812 + return;
2813 + }
2814 + avl_insert(&queue->q_zios_by_addr, sio, idx);
2815 + atomic_add_64(&queue->q_zio_bytes, asize);
2816 +
2817 + /*
2818 + * Increment the bytes pending counter now so that we can't
2819 + * get an integer underflow in case the worker processes the
2820 + * zio before we get to incrementing this counter.
2821 + */
2822 + mutex_enter(&scn->scn_status_lock);
2823 + scn->scn_bytes_pending += asize;
2824 + mutex_exit(&scn->scn_status_lock);
2825 +
2826 + range_tree_set_gap(queue->q_exts_by_addr, zfs_scan_max_ext_gap);
2827 + range_tree_add_fill(queue->q_exts_by_addr, offset, asize, asize);
2828 +}
2829 +
2830 +/* q_exts_by_addr segment add callback. */
2831 +/*ARGSUSED*/
2832 +static void
2833 +scan_io_queue_insert_cb(range_tree_t *rt, range_seg_t *rs, void *arg)
2834 +{
2835 + dsl_scan_io_queue_t *queue = arg;
2836 + avl_index_t idx;
2837 + ASSERT(MUTEX_HELD(&queue->q_vd->vdev_scan_io_queue_lock));
2838 + VERIFY3P(avl_find(&queue->q_exts_by_size, rs, &idx), ==, NULL);
2839 + avl_insert(&queue->q_exts_by_size, rs, idx);
2840 +}
2841 +
2842 +/* q_exts_by_addr segment remove callback. */
2843 +/*ARGSUSED*/
2844 +static void
2845 +scan_io_queue_remove_cb(range_tree_t *rt, range_seg_t *rs, void *arg)
2846 +{
2847 + dsl_scan_io_queue_t *queue = arg;
2848 + avl_remove(&queue->q_exts_by_size, rs);
2849 +}
2850 +
2851 +/* q_exts_by_addr vacate callback. */
2852 +/*ARGSUSED*/
2853 +static void
2854 +scan_io_queue_vacate_cb(range_tree_t *rt, void *arg)
2855 +{
2856 + dsl_scan_io_queue_t *queue = arg;
2857 + void *cookie = NULL;
2858 + while (avl_destroy_nodes(&queue->q_exts_by_size, &cookie) != NULL)
2859 + ;
2860 +}
2861 +
2862 +/*
2863 + * This is the primary extent sorting algorithm. We balance two parameters:
2864 + * 1) how many bytes of I/O are in an extent
2865 + * 2) how well the extent is filled with I/O (as a fraction of its total size)
2866 + * Since we allow extents to have gaps between their constituent I/Os, it's
2867 + * possible to have a fairly large extent that contains the same amount of
2868 + * I/O bytes than a much smaller extent, which just packs the I/O more tightly.
2869 + * The algorithm sorts based on a score calculated from the extent's size,
2870 + * the relative fill volume (in %) and a "fill weight" parameter that controls
2871 + * the split between whether we prefer larger extents or more well populated
2872 + * extents:
2873 + *
2874 + * SCORE = FILL_IN_BYTES + (FILL_IN_PERCENT * FILL_IN_BYTES * FILL_WEIGHT)
2875 + *
2876 + * Example:
2877 + * 1) assume extsz = 64 MiB
2878 + * 2) assume fill = 32 MiB (extent is half full)
2879 + * 3) assume fill_weight = 3
2880 + * 4) SCORE = 32M + (((32M * 100) / 64M) * 3 * 32M) / 100
2881 + * SCORE = 32M + (50 * 3 * 32M) / 100
2882 + * SCORE = 32M + (4800M / 100)
2883 + * SCORE = 32M + 48M
2884 + * ^ ^
2885 + * | +--- final total relative fill-based score
2886 + * +--------- final total fill-based score
2887 + * SCORE = 80M
2888 + *
2889 + * As can be seen, at fill_ratio=3, the algorithm is slightly biased towards
2890 + * extents that are more completely filled (in a 3:2 ratio) vs just larger.
2891 + */
2892 +static int
2893 +ext_size_compar(const void *x, const void *y)
2894 +{
2895 + const range_seg_t *rsa = x, *rsb = y;
2896 + uint64_t sa = rsa->rs_end - rsa->rs_start,
2897 + sb = rsb->rs_end - rsb->rs_start;
2898 + uint64_t score_a, score_b;
2899 +
2900 + score_a = rsa->rs_fill + (((rsa->rs_fill * 100) / sa) *
2901 + fill_weight * rsa->rs_fill) / 100;
2902 + score_b = rsb->rs_fill + (((rsb->rs_fill * 100) / sb) *
2903 + fill_weight * rsb->rs_fill) / 100;
2904 +
2905 + if (score_a > score_b)
2906 + return (-1);
2907 + if (score_a == score_b) {
2908 + if (rsa->rs_start < rsb->rs_start)
2909 + return (-1);
2910 + if (rsa->rs_start == rsb->rs_start)
2911 + return (0);
2912 + return (1);
2913 + }
2914 + return (1);
2915 +}
2916 +
2917 +/*
2918 + * Comparator for the q_zios_by_addr tree. Sorting is simply performed
2919 + * based on LBA-order (from lowest to highest).
2920 + */
2921 +static int
2922 +io_addr_compar(const void *x, const void *y)
2923 +{
2924 + const scan_io_t *a = x, *b = y;
2925 + uint64_t off_a = SCAN_IO_GET_OFFSET(a);
2926 + uint64_t off_b = SCAN_IO_GET_OFFSET(b);
2927 + if (off_a < off_b)
2928 + return (-1);
2929 + if (off_a == off_b)
2930 + return (0);
2931 + return (1);
2932 +}
2933 +
2934 +static dsl_scan_io_queue_t *
2935 +scan_io_queue_create(vdev_t *vd)
2936 +{
2937 + dsl_scan_t *scn = vd->vdev_spa->spa_dsl_pool->dp_scan;
2938 + dsl_scan_io_queue_t *q = kmem_zalloc(sizeof (*q), KM_SLEEP);
2939 +
2940 + q->q_scn = scn;
2941 + q->q_vd = vd;
2942 + cv_init(&q->q_cv, NULL, CV_DEFAULT, NULL);
2943 + q->q_exts_by_addr = range_tree_create(&scan_io_queue_ops, q,
2944 + &q->q_vd->vdev_scan_io_queue_lock);
2945 + avl_create(&q->q_exts_by_size, ext_size_compar,
2946 + sizeof (range_seg_t), offsetof(range_seg_t, rs_pp_node));
2947 + avl_create(&q->q_zios_by_addr, io_addr_compar,
2948 + sizeof (scan_io_t), offsetof(scan_io_t, sio_nodes.sio_addr_node));
2949 +
2950 + return (q);
2951 +}
2952 +
2953 +/*
2954 + * Destroyes a scan queue and all segments and scan_io_t's contained in it.
2955 + * No further execution of I/O occurs, anything pending in the queue is
2956 + * simply dropped. Prior to calling this, the queue should have been
2957 + * removed from its parent top-level vdev, hence holding the queue's
2958 + * lock is not permitted.
2959 + */
2960 +void
2961 +dsl_scan_io_queue_destroy(dsl_scan_io_queue_t *queue)
2962 +{
2963 + dsl_scan_t *scn = queue->q_scn;
2964 + scan_io_t *sio;
2965 + uint64_t bytes_dequeued = 0;
2966 + kmutex_t *q_lock = &queue->q_vd->vdev_scan_io_queue_lock;
2967 +
2968 + ASSERT(!MUTEX_HELD(q_lock));
2969 +
2970 +#ifdef DEBUG /* This is for the ASSERT(range_tree_contains... below */
2971 + mutex_enter(q_lock);
2972 +#endif
2973 + while ((sio = avl_first(&queue->q_zios_by_addr)) != NULL) {
2974 + ASSERT(range_tree_contains(queue->q_exts_by_addr,
2975 + SCAN_IO_GET_OFFSET(sio), sio->sio_asize));
2976 + bytes_dequeued += sio->sio_asize;
2977 + avl_remove(&queue->q_zios_by_addr, sio);
2978 + kmem_free(sio, sizeof (*sio));
2979 + }
2980 +#ifdef DEBUG
2981 + mutex_exit(q_lock);
2982 +#endif
2983 +
2984 + mutex_enter(&scn->scn_status_lock);
2985 + ASSERT3U(scn->scn_bytes_pending, >=, bytes_dequeued);
2986 + scn->scn_bytes_pending -= bytes_dequeued;
2987 + mutex_exit(&scn->scn_status_lock);
2988 +
2989 + /* lock here to avoid tripping assertion in range_tree_vacate */
2990 + mutex_enter(q_lock);
2991 + range_tree_vacate(queue->q_exts_by_addr, NULL, queue);
2992 + mutex_exit(q_lock);
2993 +
2994 + range_tree_destroy(queue->q_exts_by_addr);
2995 + avl_destroy(&queue->q_exts_by_size);
2996 + avl_destroy(&queue->q_zios_by_addr);
2997 + cv_destroy(&queue->q_cv);
2998 +
2999 + kmem_free(queue, sizeof (*queue));
3000 +}
3001 +
3002 +/*
3003 + * Properly transfers a dsl_scan_queue_t from `svd' to `tvd'. This is
3004 + * called on behalf of vdev_top_transfer when creating or destroying
3005 + * a mirror vdev due to zpool attach/detach.
3006 + */
3007 +void
3008 +dsl_scan_io_queue_vdev_xfer(vdev_t *svd, vdev_t *tvd)
3009 +{
3010 + mutex_enter(&svd->vdev_scan_io_queue_lock);
3011 + mutex_enter(&tvd->vdev_scan_io_queue_lock);
3012 +
3013 + VERIFY3P(tvd->vdev_scan_io_queue, ==, NULL);
3014 + tvd->vdev_scan_io_queue = svd->vdev_scan_io_queue;
3015 + svd->vdev_scan_io_queue = NULL;
3016 + if (tvd->vdev_scan_io_queue != NULL) {
3017 + tvd->vdev_scan_io_queue->q_vd = tvd;
3018 + range_tree_set_lock(tvd->vdev_scan_io_queue->q_exts_by_addr,
3019 + &tvd->vdev_scan_io_queue_lock);
3020 + }
3021 +
3022 + mutex_exit(&tvd->vdev_scan_io_queue_lock);
3023 + mutex_exit(&svd->vdev_scan_io_queue_lock);
3024 +}
3025 +
3026 +static void
3027 +scan_io_queues_destroy(dsl_scan_t *scn)
3028 +{
3029 + vdev_t *rvd = scn->scn_dp->dp_spa->spa_root_vdev;
3030 +
3031 + for (uint64_t i = 0; i < rvd->vdev_children; i++) {
3032 + vdev_t *tvd = rvd->vdev_child[i];
3033 + dsl_scan_io_queue_t *queue;
3034 +
3035 + mutex_enter(&tvd->vdev_scan_io_queue_lock);
3036 + queue = tvd->vdev_scan_io_queue;
3037 + tvd->vdev_scan_io_queue = NULL;
3038 + mutex_exit(&tvd->vdev_scan_io_queue_lock);
3039 +
3040 + if (queue != NULL)
3041 + dsl_scan_io_queue_destroy(queue);
3042 + }
3043 +}
3044 +
3045 +/*
3046 + * Computes the memory limit state that we're currently in. A sorted scan
3047 + * needs quite a bit of memory to hold the sorting queues, so we need to
3048 + * reasonably constrain their size so they don't impact overall system
3049 + * performance. We compute two limits:
3050 + * 1) Hard memory limit: if the amount of memory used by the sorting
3051 + * queues on a pool gets above this value, we stop the metadata
3052 + * scanning portion and start issuing the queued up and sorted
3053 + * I/Os to reduce memory usage.
3054 + * This limit is calculated as a fraction of physmem (by default 5%).
3055 + * We constrain the lower bound of the hard limit to an absolute
3056 + * minimum of zfs_scan_mem_lim_min (default: 16 MiB). We also constrain
3057 + * the upper bound to 5% of the total pool size - no chance we'll
3058 + * ever need that much memory, but just to keep the value in check.
3059 + * 2) Soft memory limit: once we hit the hard memory limit, we start
3060 + * issuing I/O to lower queue memory usage, but we don't want to
3061 + * completely empty them out, as having more in the queues allows
3062 + * us to make better sorting decisions. So we stop the issuing of
3063 + * I/Os once the amount of memory used drops below the soft limit
3064 + * (at which point we stop issuing I/O and start scanning metadata
3065 + * again).
3066 + * The limit is calculated by subtracting a fraction of the hard
3067 + * limit from the hard limit. By default this fraction is 10%, so
3068 + * the soft limit is 90% of the hard limit. We cap the size of the
3069 + * difference between the hard and soft limits at an absolute
3070 + * maximum of zfs_scan_mem_lim_soft_max (default: 128 MiB) - this is
3071 + * sufficient to not cause too frequent switching between the
3072 + * metadata scan and I/O issue (even at 2k recordsize, 128 MiB's
3073 + * worth of queues is about 1.2 GiB of on-pool data, so scanning
3074 + * that should take at least a decent fraction of a second).
3075 + */
3076 +static mem_lim_t
3077 +scan_io_queue_mem_lim(dsl_scan_t *scn)
3078 +{
3079 + vdev_t *rvd = scn->scn_dp->dp_spa->spa_root_vdev;
3080 + uint64_t mlim_hard, mlim_soft, mused;
3081 + uint64_t alloc = metaslab_class_get_alloc(spa_normal_class(
3082 + scn->scn_dp->dp_spa));
3083 +
3084 + mlim_hard = MAX((physmem / zfs_scan_mem_lim_fact) * PAGESIZE,
3085 + zfs_scan_mem_lim_min);
3086 + mlim_hard = MIN(mlim_hard, alloc / 20);
3087 + mlim_soft = mlim_hard - MIN(mlim_hard / zfs_scan_mem_lim_soft_fact,
3088 + zfs_scan_mem_lim_soft_max);
3089 + mused = 0;
3090 + for (uint64_t i = 0; i < rvd->vdev_children; i++) {
3091 + vdev_t *tvd = rvd->vdev_child[i];
3092 + dsl_scan_io_queue_t *queue;
3093 +
3094 + mutex_enter(&tvd->vdev_scan_io_queue_lock);
3095 + queue = tvd->vdev_scan_io_queue;
3096 + if (queue != NULL) {
3097 + /* #extents in exts_by_size = # in exts_by_addr */
3098 + mused += avl_numnodes(&queue->q_exts_by_size) *
3099 + sizeof (range_seg_t) +
3100 + (avl_numnodes(&queue->q_zios_by_addr) +
3101 + queue->q_num_issuing_zios) * sizeof (scan_io_t);
3102 + }
3103 + mutex_exit(&tvd->vdev_scan_io_queue_lock);
3104 + }
3105 + DTRACE_PROBE4(queue_mem_lim, dsl_scan_t *, scn, uint64_t, mlim_hard,
3106 + uint64_t, mlim_soft, uint64_t, mused);
3107 +
3108 + if (mused >= mlim_hard)
3109 + return (MEM_LIM_HARD);
3110 + else if (mused >= mlim_soft)
3111 + return (MEM_LIM_SOFT);
3112 + else
3113 + return (MEM_LIM_NONE);
3114 +}
3115 +
3116 +/*
3117 + * Given a list of scan_io_t's in io_list, this issues the io's out to
3118 + * disk. Passing shutdown=B_TRUE instead discards the zio's without
3119 + * issuing them. This consumes the io_list and frees the scan_io_t's.
3120 + * This is called when emptying queues, either when we're up against
3121 + * the memory limit or we have finished scanning.
3122 + */
3123 +static void
3124 +scan_io_queue_issue(list_t *io_list, dsl_scan_io_queue_t *queue)
3125 +{
3126 + dsl_scan_t *scn = queue->q_scn;
3127 + scan_io_t *sio;
3128 + int64_t bytes_issued = 0;
3129 +
3130 + while ((sio = list_head(io_list)) != NULL) {
3131 + blkptr_t bp;
3132 +
3133 + sio2bp(sio, &bp, queue->q_vd->vdev_id);
3134 + bytes_issued += sio->sio_asize;
3135 + scan_exec_io(scn->scn_dp, &bp, sio->sio_flags, &sio->sio_zb,
3136 + B_FALSE);
3137 + (void) list_remove_head(io_list);
3138 + ASSERT(queue->q_num_issuing_zios > 0);
3139 + atomic_dec_64(&queue->q_num_issuing_zios);
3140 + kmem_free(sio, sizeof (*sio));
3141 + }
3142 +
3143 + mutex_enter(&scn->scn_status_lock);
3144 + ASSERT3U(scn->scn_bytes_pending, >=, bytes_issued);
3145 + scn->scn_bytes_pending -= bytes_issued;
3146 + mutex_exit(&scn->scn_status_lock);
3147 +
3148 + ASSERT3U(queue->q_zio_bytes, >=, bytes_issued);
3149 + atomic_add_64(&queue->q_zio_bytes, -bytes_issued);
3150 +
3151 + list_destroy(io_list);
3152 +}
3153 +
3154 +/*
3155 + * Given a range_seg_t (extent) and a list, this function passes over a
3156 + * scan queue and gathers up the appropriate ios which fit into that
3157 + * scan seg (starting from lowest LBA). During this, we observe that we
3158 + * don't go over the `limit' in the total amount of scan_io_t bytes that
3159 + * were gathered. At the end, we remove the appropriate amount of space
3160 + * from the q_exts_by_addr. If we have consumed the entire scan seg, we
3161 + * remove it completely from q_exts_by_addr. If we've only consumed a
3162 + * portion of it, we shorten the scan seg appropriately. A future call
3163 + * will consume more of the scan seg's constituent io's until
3164 + * consuming the extent completely. If we've reduced the size of the
3165 + * scan seg, we of course reinsert it in the appropriate spot in the
3166 + * q_exts_by_size tree.
3167 + */
3168 +static uint64_t
3169 +scan_io_queue_gather(const range_seg_t *rs, list_t *list,
3170 + dsl_scan_io_queue_t *queue, uint64_t limit)
3171 +{
3172 + scan_io_t srch_sio, *sio, *next_sio;
3173 + avl_index_t idx;
3174 + int64_t num_zios = 0, bytes = 0;
3175 + boolean_t size_limited = B_FALSE;
3176 +
3177 + ASSERT(rs != NULL);
3178 + ASSERT3U(limit, !=, 0);
3179 + ASSERT(MUTEX_HELD(&queue->q_vd->vdev_scan_io_queue_lock));
3180 +
3181 + list_create(list, sizeof (scan_io_t),
3182 + offsetof(scan_io_t, sio_nodes.sio_list_node));
3183 + SCAN_IO_SET_OFFSET(&srch_sio, rs->rs_start);
3184 +
3185 + /*
3186 + * The exact start of the extent might not contain any matching zios,
3187 + * so if that's the case, examine the next one in the tree.
3188 + */
3189 + sio = avl_find(&queue->q_zios_by_addr, &srch_sio, &idx);
3190 + if (sio == NULL)
3191 + sio = avl_nearest(&queue->q_zios_by_addr, idx, AVL_AFTER);
3192 +
3193 + while (sio != NULL && SCAN_IO_GET_OFFSET(sio) < rs->rs_end) {
3194 + if (bytes >= limit) {
3195 + size_limited = B_TRUE;
3196 + break;
3197 + }
3198 + ASSERT3U(SCAN_IO_GET_OFFSET(sio), >=, rs->rs_start);
3199 + ASSERT3U(SCAN_IO_GET_OFFSET(sio) + sio->sio_asize, <=,
3200 + rs->rs_end);
3201 +
3202 + next_sio = AVL_NEXT(&queue->q_zios_by_addr, sio);
3203 + avl_remove(&queue->q_zios_by_addr, sio);
3204 + list_insert_tail(list, sio);
3205 + num_zios++;
3206 + bytes += sio->sio_asize;
3207 + sio = next_sio;
3208 + }
3209 +
3210 + if (size_limited) {
3211 + uint64_t end;
3212 + sio = list_tail(list);
3213 + end = SCAN_IO_GET_OFFSET(sio) + sio->sio_asize;
3214 + range_tree_remove_fill(queue->q_exts_by_addr, rs->rs_start,
3215 + end - rs->rs_start, bytes, 0);
3216 + } else {
3217 + /*
3218 + * Whole extent consumed, remove it all, including any head
3219 + * or tail overhang.
3220 + */
3221 + range_tree_remove_fill(queue->q_exts_by_addr, rs->rs_start,
3222 + rs->rs_end - rs->rs_start, bytes, 0);
3223 + }
3224 + atomic_add_64(&queue->q_num_issuing_zios, num_zios);
3225 +
3226 + return (bytes);
3227 +}
3228 +
3229 +/*
3230 + * This is called from the queue emptying thread and selects the next
3231 + * extent from which we are to issue io's. The behavior of this function
3232 + * depends on the state of the scan, the current memory consumption and
3233 + * whether or not we are performing a scan shutdown.
3234 + * 1) We select extents in an elevator algorithm (LBA-order) if:
3235 + * a) the scan has finished traversing metadata (DSS_FINISHING)
3236 + * b) the scan needs to perform a checkpoint
3237 + * 2) We select the largest available extent if we are up against the
3238 + * memory limit.
3239 + * 3) Otherwise we don't select any extents.
3240 + */
3241 +static const range_seg_t *
3242 +scan_io_queue_fetch_ext(dsl_scan_io_queue_t *queue)
3243 +{
3244 + dsl_scan_t *scn = queue->q_scn;
3245 +
3246 + ASSERT(MUTEX_HELD(&queue->q_vd->vdev_scan_io_queue_lock));
3247 + ASSERT0(queue->q_issuing_rs.rs_start);
3248 + ASSERT0(queue->q_issuing_rs.rs_end);
3249 + ASSERT(scn->scn_is_sorted);
3250 +
3251 + if (scn->scn_phys.scn_state == DSS_FINISHING ||
3252 + scn->scn_checkpointing) {
3253 + /*
3254 + * When the scan has finished traversing all metadata and is
3255 + * in the DSS_FINISHING state or a checkpoint has been
3256 + * requested, no new extents will be added to the sorting
3257 + * queue, so the way we are sorted now is as good as it'll
3258 + * get. So instead, switch to issuing extents in linear order.
3259 + */
3260 + return (range_tree_first(queue->q_exts_by_addr));
3261 + } else if (scn->scn_clearing) {
3262 + return (avl_first(&queue->q_exts_by_size));
3263 + } else {
3264 + return (NULL);
3265 + }
3266 +}
3267 +
3268 +/*
3269 + * Empties a scan queue until we have issued at least info->qri_limit
3270 + * bytes, or the queue is empty. This is called via the scn_taskq so as
3271 + * to parallelize processing of all top-level vdevs as much as possible.
3272 + */
3273 +static void
3274 +scan_io_queues_run_one(io_queue_run_info_t *info)
3275 +{
3276 + dsl_scan_io_queue_t *queue = info->qri_queue;
3277 + uint64_t limit = info->qri_limit;
3278 + dsl_scan_t *scn = queue->q_scn;
3279 + kmutex_t *q_lock = &queue->q_vd->vdev_scan_io_queue_lock;
3280 + list_t zio_list;
3281 + const range_seg_t *rs;
3282 + uint64_t issued = 0;
3283 +
3284 + ASSERT(scn->scn_is_sorted);
3285 +
3286 + /* loop until we have issued as much I/O as was requested */
3287 + while (issued < limit) {
3288 + scan_io_t *first_io, *last_io;
3289 +
3290 + mutex_enter(q_lock);
3291 + /* First we select the extent we'll be issuing from next. */
3292 + rs = scan_io_queue_fetch_ext(queue);
3293 + DTRACE_PROBE2(queue_fetch_ext, range_seg_t *, rs,
3294 + dsl_scan_io_queue_t *, queue);
3295 + if (rs == NULL) {
3296 + mutex_exit(q_lock);
3297 + break;
3298 + }
3299 +
3300 + /*
3301 + * We have selected which extent needs to be processed next,
3302 + * gather up the corresponding zio's, taking care not to step
3303 + * over the limit.
3304 + */
3305 + issued += scan_io_queue_gather(rs, &zio_list, queue,
3306 + limit - issued);
3307 + first_io = list_head(&zio_list);
3308 + last_io = list_tail(&zio_list);
3309 + if (first_io != NULL) {
3310 + /*
3311 + * We have zio's to issue. Construct a fake range
3312 + * seg that covers the whole list of zio's to issue
3313 + * (the list is guaranteed to be LBA-ordered) and
3314 + * save that in the queue's "in flight" segment.
3315 + * This is used to prevent freeing I/O from hitting
3316 + * that range while we're working on it.
3317 + */
3318 + ASSERT(last_io != NULL);
3319 + queue->q_issuing_rs.rs_start =
3320 + SCAN_IO_GET_OFFSET(first_io);
3321 + queue->q_issuing_rs.rs_end =
3322 + SCAN_IO_GET_OFFSET(last_io) + last_io->sio_asize;
3323 + }
3324 + mutex_exit(q_lock);
3325 +
3326 + /*
3327 + * Issuing zio's can take a long time (especially because
3328 + * we are contrained by zfs_top_maxinflight), so drop the
3329 + * queue lock.
3330 + */
3331 + scan_io_queue_issue(&zio_list, queue);
3332 +
3333 + mutex_enter(q_lock);
3334 + /* invalidate the in-flight I/O range */
3335 + bzero(&queue->q_issuing_rs, sizeof (queue->q_issuing_rs));
3336 + cv_broadcast(&queue->q_cv);
3337 + mutex_exit(q_lock);
3338 + }
3339 +}
3340 +
3341 +/*
3342 + * Performs an emptying run on all scan queues in the pool. This just
3343 + * punches out one thread per top-level vdev, each of which processes
3344 + * only that vdev's scan queue. We can parallelize the I/O here because
3345 + * we know that each queue's io's only affect its own top-level vdev.
3346 + * The amount of I/O dequeued per run of this function is calibrated
3347 + * dynamically so that its total run time doesn't exceed
3348 + * zfs_scan_dequeue_run_target_ms + zfs_dequeue_run_bonus_ms. The
3349 + * timing algorithm aims to hit the target value, but still
3350 + * oversubscribes the amount of data that it is allowed to fetch by
3351 + * the bonus value. This is to allow for non-equal completion times
3352 + * across top-level vdevs.
3353 + *
3354 + * This function waits for the queue runs to complete, and must be
3355 + * called from dsl_scan_sync (or in general, syncing context).
3356 + */
3357 +static void
3358 +scan_io_queues_run(dsl_scan_t *scn)
3359 +{
3360 + spa_t *spa = scn->scn_dp->dp_spa;
3361 + uint64_t dirty_limit, total_limit, total_bytes;
3362 + io_queue_run_info_t *info;
3363 + uint64_t dequeue_min = zfs_scan_dequeue_min *
3364 + spa->spa_root_vdev->vdev_children;
3365 +
3366 + ASSERT(scn->scn_is_sorted);
3367 + ASSERT(spa_config_held(spa, SCL_CONFIG, RW_READER));
3368 +
3369 + if (scn->scn_taskq == NULL) {
3370 + char *tq_name = kmem_zalloc(ZFS_MAX_DATASET_NAME_LEN + 16,
3371 + KM_SLEEP);
3372 + const int nthreads = spa->spa_root_vdev->vdev_children;
3373 +
3374 + /*
3375 + * We need to make this taskq *always* execute as many
3376 + * threads in parallel as we have top-level vdevs and no
3377 + * less, otherwise strange serialization of the calls to
3378 + * scan_io_queues_run_one can occur during spa_sync runs
3379 + * and that significantly impacts performance.
3380 + */
3381 + (void) snprintf(tq_name, ZFS_MAX_DATASET_NAME_LEN + 16,
3382 + "dsl_scan_tq_%s", spa->spa_name);
3383 + scn->scn_taskq = taskq_create(tq_name, nthreads, minclsyspri,
3384 + nthreads, nthreads, TASKQ_PREPOPULATE);
3385 + kmem_free(tq_name, ZFS_MAX_DATASET_NAME_LEN + 16);
3386 + }
3387 +
3388 + /*
3389 + * This is the automatic run time calibration algorithm. We gauge
3390 + * how long spa_sync took since last time we were invoked. If it
3391 + * took longer than our target + bonus values, we reduce the
3392 + * amount of data that the queues are allowed to process in this
3393 + * iteration. Conversely, if it took less than target + bonus,
3394 + * we increase the amount of data the queues are allowed to process.
3395 + * This is designed as a partial load-following algorithm, so if
3396 + * other ZFS users start issuing I/O, we back off, until we hit our
3397 + * minimum issue amount (per-TL-vdev) of zfs_scan_dequeue_min bytes.
3398 + */
3399 + if (scn->scn_last_queue_run_time != 0) {
3400 + uint64_t now = ddi_get_lbolt64();
3401 + uint64_t delta_ms = TICK_TO_MSEC(now -
3402 + scn->scn_last_queue_run_time);
3403 + uint64_t bonus = zfs_dequeue_run_bonus_ms;
3404 +
3405 + bonus = MIN(bonus, DEQUEUE_BONUS_MS_MAX);
3406 + if (delta_ms <= bonus)
3407 + delta_ms = bonus + 1;
3408 +
3409 + scn->scn_last_dequeue_limit = MAX(dequeue_min,
3410 + (scn->scn_last_dequeue_limit *
3411 + zfs_scan_dequeue_run_target_ms) / (delta_ms - bonus));
3412 + scn->scn_last_queue_run_time = now;
3413 + } else {
3414 + scn->scn_last_queue_run_time = ddi_get_lbolt64();
3415 + scn->scn_last_dequeue_limit = dequeue_min;
3416 + }
3417 +
3418 + /*
3419 + * We also constrain the amount of data we are allowed to issue
3420 + * by the zfs_dirty_data_max value - this serves as basically a
3421 + * sanity check just to prevent us from issuing huge amounts of
3422 + * data to be dequeued per run.
3423 + */
3424 + dirty_limit = (zfs_vdev_async_write_active_min_dirty_percent *
3425 + zfs_dirty_data_max) / 100;
3426 + if (dirty_limit >= scn->scn_dp->dp_dirty_total)
3427 + dirty_limit -= scn->scn_dp->dp_dirty_total;
3428 + else
3429 + dirty_limit = 0;
3430 +
3431 + total_limit = MAX(MIN(scn->scn_last_dequeue_limit, dirty_limit),
3432 + dequeue_min);
3433 +
3434 + /*
3435 + * We use this to determine how much data each queue is allowed to
3436 + * issue this run. We take the amount of dirty data available in
3437 + * the current txg and proportionally split it between each queue,
3438 + * depending on how full a given queue is. No need to lock here,
3439 + * new data can't enter the queue, since that's only done in our
3440 + * sync thread.
3441 + */
3442 + total_bytes = scn->scn_bytes_pending;
3443 + if (total_bytes == 0)
3444 + return;
3445 +
3446 + info = kmem_zalloc(sizeof (*info) * spa->spa_root_vdev->vdev_children,
3447 + KM_SLEEP);
3448 + for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
3449 + vdev_t *vd = spa->spa_root_vdev->vdev_child[i];
3450 + dsl_scan_io_queue_t *queue;
3451 + uint64_t limit;
3452 +
3453 + /*
3454 + * No need to lock to check if the queue exists, since this
3455 + * is called from sync context only and queues are only
3456 + * created in sync context also.
3457 + */
3458 + queue = vd->vdev_scan_io_queue;
3459 + if (queue == NULL)
3460 + continue;
3461 +
3462 + /*
3463 + * Compute the per-queue limit as a fraction of the queue's
3464 + * size, relative to the total amount of zio bytes in all
3465 + * all queues. 1000 here is the fixed-point precision. If
3466 + * there are ever more than 1000 top-level vdevs, this
3467 + * code might misbehave.
3468 + */
3469 + limit = MAX((((queue->q_zio_bytes * 1000) / total_bytes) *
3470 + total_limit) / 1000, zfs_scan_dequeue_min);
3471 +
3472 + info[i].qri_queue = queue;
3473 + info[i].qri_limit = limit;
3474 +
3475 + VERIFY(taskq_dispatch(scn->scn_taskq,
3476 + (void (*)(void *))scan_io_queues_run_one, &info[i],
3477 + TQ_SLEEP) != NULL);
3478 + }
3479 +
3480 + /*
3481 + * We need to wait for all queues to finish their run, just to keep
3482 + * things nice and consistent. This doesn't necessarily mean all
3483 + * I/O generated by the queues emptying has finished (there may be
3484 + * up to zfs_top_maxinflight zio's still processing on behalf of
3485 + * each queue).
3486 + */
3487 + taskq_wait(scn->scn_taskq);
3488 +
3489 + kmem_free(info, sizeof (*info) * spa->spa_root_vdev->vdev_children);
3490 +}
3491 +
3492 +/*
3493 + * Callback invoked when a zio_free() zio is executing. This needs to be
3494 + * intercepted to prevent the zio from deallocating a particular portion
3495 + * of disk space and it then getting reallocated and written to, while we
3496 + * still have it queued up for processing, or even while we're trying to
3497 + * scrub or resilver it.
3498 + */
3499 +void
3500 +dsl_scan_freed(spa_t *spa, const blkptr_t *bp)
3501 +{
3502 + dsl_pool_t *dp = spa->spa_dsl_pool;
3503 + dsl_scan_t *scn = dp->dp_scan;
3504 +
3505 + ASSERT(!BP_IS_EMBEDDED(bp));
3506 + ASSERT(scn != NULL);
3507 + if (!dsl_scan_is_running(scn))
3508 + return;
3509 +
3510 + for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
3511 + if (BP_IS_SPECIAL(bp) && i != WBC_NORMAL_DVA)
3512 + continue;
3513 + dsl_scan_freed_dva(spa, bp, i);
3514 + }
3515 +}
3516 +
3517 +static void
3518 +dsl_scan_freed_dva(spa_t *spa, const blkptr_t *bp, int dva_i)
3519 +{
3520 + dsl_pool_t *dp = spa->spa_dsl_pool;
3521 + dsl_scan_t *scn = dp->dp_scan;
3522 + vdev_t *vdev;
3523 + kmutex_t *q_lock;
3524 + dsl_scan_io_queue_t *queue;
3525 + scan_io_t srch, *sio;
3526 + avl_index_t idx;
3527 + uint64_t offset;
3528 + int64_t asize;
3529 +
3530 + ASSERT(!BP_IS_EMBEDDED(bp));
3531 + ASSERT(scn != NULL);
3532 + ASSERT(!BP_IS_SPECIAL(bp) || dva_i == WBC_NORMAL_DVA);
3533 +
3534 + vdev = vdev_lookup_top(spa, DVA_GET_VDEV(&bp->blk_dva[dva_i]));
3535 + ASSERT(vdev != NULL);
3536 + q_lock = &vdev->vdev_scan_io_queue_lock;
3537 + queue = vdev->vdev_scan_io_queue;
3538 +
3539 + mutex_enter(q_lock);
3540 + if (queue == NULL) {
3541 + mutex_exit(q_lock);
3542 + return;
3543 + }
3544 +
3545 + bp2sio(bp, &srch, dva_i);
3546 + offset = SCAN_IO_GET_OFFSET(&srch);
3547 + asize = srch.sio_asize;
3548 +
3549 + /*
3550 + * We can find the zio in two states:
3551 + * 1) Cold, just sitting in the queue of zio's to be issued at
3552 + * some point in the future. In this case, all we do is
3553 + * remove the zio from the q_zios_by_addr tree, decrement
3554 + * its data volume from the containing range_seg_t and
3555 + * resort the q_exts_by_size tree to reflect that the
3556 + * range_seg_t has lost some of its 'fill'. We don't shorten
3557 + * the range_seg_t - this is usually rare enough not to be
3558 + * worth the extra hassle of trying keep track of precise
3559 + * extent boundaries.
3560 + * 2) Hot, where the zio is currently in-flight in
3561 + * dsl_scan_issue_ios. In this case, we can't simply
3562 + * reach in and stop the in-flight zio's, so we instead
3563 + * block the caller. Eventually, dsl_scan_issue_ios will
3564 + * be done with issuing the zio's it gathered and will
3565 + * signal us.
3566 + */
3567 + sio = avl_find(&queue->q_zios_by_addr, &srch, &idx);
3568 + if (sio != NULL) {
3569 + range_seg_t *rs;
3570 +
3571 + /* Got it while it was cold in the queue */
3572 + ASSERT3U(srch.sio_asize, ==, sio->sio_asize);
3573 + DTRACE_PROBE2(dequeue_now, const blkptr_t *, bp,
3574 + dsl_scan_queue_t *, queue);
3575 + count_block(scn, dp->dp_blkstats, bp);
3576 + ASSERT(range_tree_contains(queue->q_exts_by_addr, offset,
3577 + asize));
3578 + avl_remove(&queue->q_zios_by_addr, sio);
3579 +
3580 + /*
3581 + * Since we're taking this scan_io_t out of its parent
3582 + * range_seg_t, we need to alter the range_seg_t's rs_fill
3583 + * value, so this changes its ordering position. We need
3584 + * to reinsert in its appropriate place in q_exts_by_size.
3585 + */
3586 + rs = range_tree_find(queue->q_exts_by_addr,
3587 + SCAN_IO_GET_OFFSET(sio), sio->sio_asize);
3588 + ASSERT(rs != NULL);
3589 + ASSERT3U(rs->rs_fill, >=, sio->sio_asize);
3590 + avl_remove(&queue->q_exts_by_size, rs);
3591 + ASSERT3U(rs->rs_fill, >=, sio->sio_asize);
3592 + rs->rs_fill -= sio->sio_asize;
3593 + VERIFY3P(avl_find(&queue->q_exts_by_size, rs, &idx), ==, NULL);
3594 + avl_insert(&queue->q_exts_by_size, rs, idx);
3595 +
3596 + /*
3597 + * We only update the queue byte counter in the cold path,
3598 + * otherwise it will already have been accounted for as
3599 + * part of the zio's execution.
3600 + */
3601 + ASSERT3U(queue->q_zio_bytes, >=, asize);
3602 + atomic_add_64(&queue->q_zio_bytes, -asize);
3603 +
3604 + mutex_enter(&scn->scn_status_lock);
3605 + ASSERT3U(scn->scn_bytes_pending, >=, asize);
3606 + scn->scn_bytes_pending -= asize;
3607 + mutex_exit(&scn->scn_status_lock);
3608 +
3609 + kmem_free(sio, sizeof (*sio));
3610 + } else {
3611 + /*
3612 + * If it's part of an extent that's currently being issued,
3613 + * wait until the extent has been consumed. In this case it's
3614 + * not us who is dequeueing this zio, so no need to
3615 + * decrement its size from scn_bytes_pending or the queue.
3616 + */
3617 + while (queue->q_issuing_rs.rs_start <= offset &&
3618 + queue->q_issuing_rs.rs_end >= offset + asize) {
3619 + DTRACE_PROBE2(dequeue_wait, const blkptr_t *, bp,
3620 + dsl_scan_queue_t *, queue);
3621 + cv_wait(&queue->q_cv, &vdev->vdev_scan_io_queue_lock);
3622 + }
3623 + }
3624 + mutex_exit(q_lock);
2049 3625 }
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX