Print this page
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-13135 Running BDD tests exposes a panic in ZFS TRIM due to a trimset overlap
Reviewed by: Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-10069 ZFS_READONLY is a little too strict (fix test lint)
NEX-9553 Move ss_fill gap logic from scan algorithm into range_tree.c
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5553 ZFS auto-trim, manual-trim and scrub can race and deadlock
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4720 WRC: DVA allocation bypass for special BPs works incorrect
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4683 WRC: Special block pointer must know that it is special
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4620 ZFS autotrim triggering is unreliable
NEX-4622 On-demand TRIM code illogically enumerates metaslabs via mg_ms_tree
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
6295 metaslab_condense's dbgmsg should include vdev id
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4245 WRC: Code cleanup and refactoring to simplify merge with upstream
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-4059 On-demand TRIM can sometimes race in metaslab_load
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
usr/src/common/zfs/zpool_prop.c
usr/src/uts/common/sys/fs/zfs.h
NEX-3710 WRC improvements and bug-fixes
* refactored WRC move-logic to use zio kmem_cashes
* replace size and compression fields by blk_prop field
(the same in blkptr_t) to little reduce size of wrc_block_t
and use similar macros as for blkptr_t to get PSIZE, LSIZE
and COMPRESSION
* make CPU more happy by reduce atomic calls
* removed unused code
* fixed naming of variables
* fixed possible system panic after restart system
with enabled WRC
* fixed a race that causes system panic
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
usr/src/uts/common/io/scsi/targets/sd.c
usr/src/uts/common/sys/scsi/targets/sddef.h
OS-197 Series of zpool exports and imports can hang the system
Reviewed by: Sarah Jelinek <sarah.jelinek@nexetna.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittens@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
re #8346 rb2639 KT disk failures
| Split |
Close |
| Expand all |
| Collapse all |
--- old/usr/src/uts/common/fs/zfs/metaslab.c
+++ new/usr/src/uts/common/fs/zfs/metaslab.c
1 1 /*
2 2 * CDDL HEADER START
3 3 *
4 4 * The contents of this file are subject to the terms of the
5 5 * Common Development and Distribution License (the "License").
6 6 * You may not use this file except in compliance with the License.
7 7 *
8 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9 9 * or http://www.opensolaris.org/os/licensing.
10 10 * See the License for the specific language governing permissions
11 11 * and limitations under the License.
12 12 *
13 13 * When distributing Covered Code, include this CDDL HEADER in each
14 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15 15 * If applicable, add the following below this CDDL HEADER, with the
|
↓ open down ↓ |
15 lines elided |
↑ open up ↑ |
16 16 * fields enclosed by brackets "[]" replaced with your own identifying
17 17 * information: Portions Copyright [yyyy] [name of copyright owner]
18 18 *
19 19 * CDDL HEADER END
20 20 */
21 21 /*
22 22 * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
23 23 * Copyright (c) 2011, 2015 by Delphix. All rights reserved.
24 24 * Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
25 25 * Copyright (c) 2014 Integros [integros.com]
26 + * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
26 27 */
27 28
28 29 #include <sys/zfs_context.h>
29 30 #include <sys/dmu.h>
30 31 #include <sys/dmu_tx.h>
31 32 #include <sys/space_map.h>
32 33 #include <sys/metaslab_impl.h>
33 34 #include <sys/vdev_impl.h>
34 35 #include <sys/zio.h>
35 36 #include <sys/spa_impl.h>
36 37 #include <sys/zfeature.h>
37 -#include <sys/vdev_indirect_mapping.h>
38 +#include <sys/wbc.h>
38 39
39 40 #define GANG_ALLOCATION(flags) \
40 41 ((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER))
41 42
42 43 uint64_t metaslab_aliquot = 512ULL << 10;
43 44 uint64_t metaslab_gang_bang = SPA_MAXBLOCKSIZE + 1; /* force gang blocks */
44 45
45 46 /*
46 47 * The in-core space map representation is more compact than its on-disk form.
47 48 * The zfs_condense_pct determines how much more compact the in-core
48 49 * space map representation must be before we compact it on-disk.
49 50 * Values should be greater than or equal to 100.
50 51 */
51 52 int zfs_condense_pct = 200;
52 53
53 54 /*
54 55 * Condensing a metaslab is not guaranteed to actually reduce the amount of
55 56 * space used on disk. In particular, a space map uses data in increments of
56 57 * MAX(1 << ashift, space_map_blksize), so a metaslab might use the
57 58 * same number of blocks after condensing. Since the goal of condensing is to
58 59 * reduce the number of IOPs required to read the space map, we only want to
59 60 * condense when we can be sure we will reduce the number of blocks used by the
60 61 * space map. Unfortunately, we cannot precisely compute whether or not this is
61 62 * the case in metaslab_should_condense since we are holding ms_lock. Instead,
62 63 * we apply the following heuristic: do not condense a spacemap unless the
63 64 * uncondensed size consumes greater than zfs_metaslab_condense_block_threshold
64 65 * blocks.
65 66 */
66 67 int zfs_metaslab_condense_block_threshold = 4;
67 68
68 69 /*
69 70 * The zfs_mg_noalloc_threshold defines which metaslab groups should
70 71 * be eligible for allocation. The value is defined as a percentage of
71 72 * free space. Metaslab groups that have more free space than
72 73 * zfs_mg_noalloc_threshold are always eligible for allocations. Once
73 74 * a metaslab group's free space is less than or equal to the
74 75 * zfs_mg_noalloc_threshold the allocator will avoid allocating to that
75 76 * group unless all groups in the pool have reached zfs_mg_noalloc_threshold.
76 77 * Once all groups in the pool reach zfs_mg_noalloc_threshold then all
77 78 * groups are allowed to accept allocations. Gang blocks are always
78 79 * eligible to allocate on any metaslab group. The default value of 0 means
79 80 * no metaslab group will be excluded based on this criterion.
80 81 */
81 82 int zfs_mg_noalloc_threshold = 0;
82 83
83 84 /*
84 85 * Metaslab groups are considered eligible for allocations if their
85 86 * fragmenation metric (measured as a percentage) is less than or equal to
86 87 * zfs_mg_fragmentation_threshold. If a metaslab group exceeds this threshold
87 88 * then it will be skipped unless all metaslab groups within the metaslab
88 89 * class have also crossed this threshold.
89 90 */
90 91 int zfs_mg_fragmentation_threshold = 85;
91 92
92 93 /*
93 94 * Allow metaslabs to keep their active state as long as their fragmentation
94 95 * percentage is less than or equal to zfs_metaslab_fragmentation_threshold. An
95 96 * active metaslab that exceeds this threshold will no longer keep its active
96 97 * status allowing better metaslabs to be selected.
97 98 */
98 99 int zfs_metaslab_fragmentation_threshold = 70;
99 100
100 101 /*
101 102 * When set will load all metaslabs when pool is first opened.
102 103 */
103 104 int metaslab_debug_load = 0;
104 105
105 106 /*
106 107 * When set will prevent metaslabs from being unloaded.
107 108 */
108 109 int metaslab_debug_unload = 0;
109 110
110 111 /*
111 112 * Minimum size which forces the dynamic allocator to change
112 113 * it's allocation strategy. Once the space map cannot satisfy
113 114 * an allocation of this size then it switches to using more
114 115 * aggressive strategy (i.e search by size rather than offset).
115 116 */
116 117 uint64_t metaslab_df_alloc_threshold = SPA_OLD_MAXBLOCKSIZE;
117 118
118 119 /*
119 120 * The minimum free space, in percent, which must be available
120 121 * in a space map to continue allocations in a first-fit fashion.
121 122 * Once the space map's free space drops below this level we dynamically
122 123 * switch to using best-fit allocations.
123 124 */
124 125 int metaslab_df_free_pct = 4;
125 126
126 127 /*
127 128 * A metaslab is considered "free" if it contains a contiguous
128 129 * segment which is greater than metaslab_min_alloc_size.
129 130 */
130 131 uint64_t metaslab_min_alloc_size = DMU_MAX_ACCESS;
131 132
132 133 /*
133 134 * Percentage of all cpus that can be used by the metaslab taskq.
134 135 */
135 136 int metaslab_load_pct = 50;
136 137
137 138 /*
138 139 * Determines how many txgs a metaslab may remain loaded without having any
139 140 * allocations from it. As long as a metaslab continues to be used we will
140 141 * keep it loaded.
141 142 */
142 143 int metaslab_unload_delay = TXG_SIZE * 2;
143 144
144 145 /*
145 146 * Max number of metaslabs per group to preload.
146 147 */
147 148 int metaslab_preload_limit = SPA_DVAS_PER_BP;
148 149
149 150 /*
150 151 * Enable/disable preloading of metaslab.
151 152 */
152 153 boolean_t metaslab_preload_enabled = B_TRUE;
153 154
154 155 /*
155 156 * Enable/disable fragmentation weighting on metaslabs.
156 157 */
157 158 boolean_t metaslab_fragmentation_factor_enabled = B_TRUE;
158 159
159 160 /*
|
↓ open down ↓ |
112 lines elided |
↑ open up ↑ |
160 161 * Enable/disable lba weighting (i.e. outer tracks are given preference).
161 162 */
162 163 boolean_t metaslab_lba_weighting_enabled = B_TRUE;
163 164
164 165 /*
165 166 * Enable/disable metaslab group biasing.
166 167 */
167 168 boolean_t metaslab_bias_enabled = B_TRUE;
168 169
169 170 /*
170 - * Enable/disable remapping of indirect DVAs to their concrete vdevs.
171 - */
172 -boolean_t zfs_remap_blkptr_enable = B_TRUE;
173 -
174 -/*
175 171 * Enable/disable segment-based metaslab selection.
176 172 */
177 173 boolean_t zfs_metaslab_segment_weight_enabled = B_TRUE;
178 174
179 175 /*
180 176 * When using segment-based metaslab selection, we will continue
181 177 * allocating from the active metaslab until we have exhausted
182 178 * zfs_metaslab_switch_threshold of its buckets.
183 179 */
184 180 int zfs_metaslab_switch_threshold = 2;
185 181
186 182 /*
187 183 * Internal switch to enable/disable the metaslab allocation tracing
188 184 * facility.
189 185 */
190 186 boolean_t metaslab_trace_enabled = B_TRUE;
191 187
192 188 /*
193 189 * Maximum entries that the metaslab allocation tracing facility will keep
|
↓ open down ↓ |
9 lines elided |
↑ open up ↑ |
194 190 * in a given list when running in non-debug mode. We limit the number
195 191 * of entries in non-debug mode to prevent us from using up too much memory.
196 192 * The limit should be sufficiently large that we don't expect any allocation
197 193 * to every exceed this value. In debug mode, the system will panic if this
198 194 * limit is ever reached allowing for further investigation.
199 195 */
200 196 uint64_t metaslab_trace_max_entries = 5000;
201 197
202 198 static uint64_t metaslab_weight(metaslab_t *);
203 199 static void metaslab_set_fragmentation(metaslab_t *);
204 -static void metaslab_free_impl(vdev_t *, uint64_t, uint64_t, uint64_t);
205 -static void metaslab_check_free_impl(vdev_t *, uint64_t, uint64_t);
206 200
207 201 kmem_cache_t *metaslab_alloc_trace_cache;
208 202
209 203 /*
204 + * Toggle between space-based DVA allocator 0, latency-based 1 or hybrid 2.
205 + * A value other than 0, 1 or 2 will be considered 0 (default).
206 + */
207 +int metaslab_alloc_dva_algorithm = 0;
208 +
209 +/*
210 + * How many TXG's worth of updates should be aggregated per TRIM/UNMAP
211 + * issued to the underlying vdev. We keep two range trees of extents
212 + * (called "trim sets") to be trimmed per metaslab, the `current' and
213 + * the `previous' TS. New free's are added to the current TS. Then,
214 + * once `zfs_txgs_per_trim' transactions have elapsed, the `current'
215 + * TS becomes the `previous' TS and a new, blank TS is created to be
216 + * the new `current', which will then start accumulating any new frees.
217 + * Once another zfs_txgs_per_trim TXGs have passed, the previous TS's
218 + * extents are trimmed, the TS is destroyed and the current TS again
219 + * becomes the previous TS.
220 + * This serves to fulfill two functions: aggregate many small frees
221 + * into fewer larger trim operations (which should help with devices
222 + * which do not take so kindly to them) and to allow for disaster
223 + * recovery (extents won't get trimmed immediately, but instead only
224 + * after passing this rather long timeout, thus not preserving
225 + * 'zfs import -F' functionality).
226 + */
227 +unsigned int zfs_txgs_per_trim = 32;
228 +
229 +static void metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size);
230 +static void metaslab_trim_add(void *arg, uint64_t offset, uint64_t size);
231 +
232 +static zio_t *metaslab_exec_trim(metaslab_t *msp);
233 +
234 +static metaslab_trimset_t *metaslab_new_trimset(uint64_t txg, kmutex_t *lock);
235 +static void metaslab_free_trimset(metaslab_trimset_t *ts);
236 +static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
237 + uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit);
238 +
239 +/*
210 240 * ==========================================================================
211 241 * Metaslab classes
212 242 * ==========================================================================
213 243 */
214 244 metaslab_class_t *
215 245 metaslab_class_create(spa_t *spa, metaslab_ops_t *ops)
216 246 {
217 247 metaslab_class_t *mc;
218 248
219 249 mc = kmem_zalloc(sizeof (metaslab_class_t), KM_SLEEP);
220 250
251 + mutex_init(&mc->mc_alloc_lock, NULL, MUTEX_DEFAULT, NULL);
252 + avl_create(&mc->mc_alloc_tree, zio_bookmark_compare,
253 + sizeof (zio_t), offsetof(zio_t, io_alloc_node));
254 +
221 255 mc->mc_spa = spa;
222 256 mc->mc_rotor = NULL;
223 257 mc->mc_ops = ops;
224 258 mutex_init(&mc->mc_lock, NULL, MUTEX_DEFAULT, NULL);
225 259 refcount_create_tracked(&mc->mc_alloc_slots);
226 260
227 261 return (mc);
228 262 }
229 263
230 264 void
231 265 metaslab_class_destroy(metaslab_class_t *mc)
232 266 {
233 267 ASSERT(mc->mc_rotor == NULL);
234 268 ASSERT(mc->mc_alloc == 0);
235 269 ASSERT(mc->mc_deferred == 0);
236 270 ASSERT(mc->mc_space == 0);
237 271 ASSERT(mc->mc_dspace == 0);
238 272
273 + avl_destroy(&mc->mc_alloc_tree);
274 + mutex_destroy(&mc->mc_alloc_lock);
275 +
239 276 refcount_destroy(&mc->mc_alloc_slots);
240 277 mutex_destroy(&mc->mc_lock);
241 278 kmem_free(mc, sizeof (metaslab_class_t));
242 279 }
243 280
244 281 int
245 282 metaslab_class_validate(metaslab_class_t *mc)
246 283 {
247 284 metaslab_group_t *mg;
248 285 vdev_t *vd;
249 286
250 287 /*
251 288 * Must hold one of the spa_config locks.
252 289 */
253 290 ASSERT(spa_config_held(mc->mc_spa, SCL_ALL, RW_READER) ||
254 291 spa_config_held(mc->mc_spa, SCL_ALL, RW_WRITER));
255 292
256 293 if ((mg = mc->mc_rotor) == NULL)
257 294 return (0);
258 295
259 296 do {
260 297 vd = mg->mg_vd;
261 298 ASSERT(vd->vdev_mg != NULL);
262 299 ASSERT3P(vd->vdev_top, ==, vd);
263 300 ASSERT3P(mg->mg_class, ==, mc);
264 301 ASSERT3P(vd->vdev_ops, !=, &vdev_hole_ops);
265 302 } while ((mg = mg->mg_next) != mc->mc_rotor);
266 303
267 304 return (0);
268 305 }
269 306
270 307 void
271 308 metaslab_class_space_update(metaslab_class_t *mc, int64_t alloc_delta,
272 309 int64_t defer_delta, int64_t space_delta, int64_t dspace_delta)
273 310 {
274 311 atomic_add_64(&mc->mc_alloc, alloc_delta);
275 312 atomic_add_64(&mc->mc_deferred, defer_delta);
276 313 atomic_add_64(&mc->mc_space, space_delta);
277 314 atomic_add_64(&mc->mc_dspace, dspace_delta);
278 315 }
279 316
280 317 uint64_t
281 318 metaslab_class_get_alloc(metaslab_class_t *mc)
282 319 {
283 320 return (mc->mc_alloc);
284 321 }
285 322
286 323 uint64_t
287 324 metaslab_class_get_deferred(metaslab_class_t *mc)
288 325 {
289 326 return (mc->mc_deferred);
290 327 }
291 328
292 329 uint64_t
293 330 metaslab_class_get_space(metaslab_class_t *mc)
294 331 {
295 332 return (mc->mc_space);
296 333 }
297 334
298 335 uint64_t
299 336 metaslab_class_get_dspace(metaslab_class_t *mc)
300 337 {
301 338 return (spa_deflate(mc->mc_spa) ? mc->mc_dspace : mc->mc_space);
302 339 }
303 340
304 341 void
305 342 metaslab_class_histogram_verify(metaslab_class_t *mc)
306 343 {
307 344 vdev_t *rvd = mc->mc_spa->spa_root_vdev;
308 345 uint64_t *mc_hist;
309 346 int i;
310 347
311 348 if ((zfs_flags & ZFS_DEBUG_HISTOGRAM_VERIFY) == 0)
312 349 return;
313 350
314 351 mc_hist = kmem_zalloc(sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE,
|
↓ open down ↓ |
66 lines elided |
↑ open up ↑ |
315 352 KM_SLEEP);
316 353
317 354 for (int c = 0; c < rvd->vdev_children; c++) {
318 355 vdev_t *tvd = rvd->vdev_child[c];
319 356 metaslab_group_t *mg = tvd->vdev_mg;
320 357
321 358 /*
322 359 * Skip any holes, uninitialized top-levels, or
323 360 * vdevs that are not in this metalab class.
324 361 */
325 - if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
362 + if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
326 363 mg->mg_class != mc) {
327 364 continue;
328 365 }
329 366
330 367 for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
331 368 mc_hist[i] += mg->mg_histogram[i];
332 369 }
333 370
334 371 for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
335 372 VERIFY3U(mc_hist[i], ==, mc->mc_histogram[i]);
336 373
337 374 kmem_free(mc_hist, sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE);
338 375 }
339 376
340 377 /*
341 378 * Calculate the metaslab class's fragmentation metric. The metric
342 379 * is weighted based on the space contribution of each metaslab group.
343 380 * The return value will be a number between 0 and 100 (inclusive), or
344 381 * ZFS_FRAG_INVALID if the metric has not been set. See comment above the
345 382 * zfs_frag_table for more information about the metric.
346 383 */
347 384 uint64_t
348 385 metaslab_class_fragmentation(metaslab_class_t *mc)
349 386 {
|
↓ open down ↓ |
14 lines elided |
↑ open up ↑ |
350 387 vdev_t *rvd = mc->mc_spa->spa_root_vdev;
351 388 uint64_t fragmentation = 0;
352 389
353 390 spa_config_enter(mc->mc_spa, SCL_VDEV, FTAG, RW_READER);
354 391
355 392 for (int c = 0; c < rvd->vdev_children; c++) {
356 393 vdev_t *tvd = rvd->vdev_child[c];
357 394 metaslab_group_t *mg = tvd->vdev_mg;
358 395
359 396 /*
360 - * Skip any holes, uninitialized top-levels,
361 - * or vdevs that are not in this metalab class.
397 + * Skip any holes, uninitialized top-levels, or
398 + * vdevs that are not in this metalab class.
362 399 */
363 - if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
400 + if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
364 401 mg->mg_class != mc) {
365 402 continue;
366 403 }
367 404
368 405 /*
369 406 * If a metaslab group does not contain a fragmentation
370 407 * metric then just bail out.
371 408 */
372 409 if (mg->mg_fragmentation == ZFS_FRAG_INVALID) {
373 410 spa_config_exit(mc->mc_spa, SCL_VDEV, FTAG);
374 411 return (ZFS_FRAG_INVALID);
375 412 }
376 413
377 414 /*
378 415 * Determine how much this metaslab_group is contributing
379 416 * to the overall pool fragmentation metric.
380 417 */
381 418 fragmentation += mg->mg_fragmentation *
382 419 metaslab_group_get_space(mg);
383 420 }
384 421 fragmentation /= metaslab_class_get_space(mc);
385 422
386 423 ASSERT3U(fragmentation, <=, 100);
387 424 spa_config_exit(mc->mc_spa, SCL_VDEV, FTAG);
388 425 return (fragmentation);
389 426 }
390 427
391 428 /*
392 429 * Calculate the amount of expandable space that is available in
393 430 * this metaslab class. If a device is expanded then its expandable
394 431 * space will be the amount of allocatable space that is currently not
395 432 * part of this metaslab class.
396 433 */
397 434 uint64_t
398 435 metaslab_class_expandable_space(metaslab_class_t *mc)
|
↓ open down ↓ |
25 lines elided |
↑ open up ↑ |
399 436 {
400 437 vdev_t *rvd = mc->mc_spa->spa_root_vdev;
401 438 uint64_t space = 0;
402 439
403 440 spa_config_enter(mc->mc_spa, SCL_VDEV, FTAG, RW_READER);
404 441 for (int c = 0; c < rvd->vdev_children; c++) {
405 442 uint64_t tspace;
406 443 vdev_t *tvd = rvd->vdev_child[c];
407 444 metaslab_group_t *mg = tvd->vdev_mg;
408 445
409 - if (!vdev_is_concrete(tvd) || tvd->vdev_ms_shift == 0 ||
446 + if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
410 447 mg->mg_class != mc) {
411 448 continue;
412 449 }
413 450
414 451 /*
415 452 * Calculate if we have enough space to add additional
416 453 * metaslabs. We report the expandable space in terms
417 454 * of the metaslab size since that's the unit of expansion.
418 455 * Adjust by efi system partition size.
419 456 */
420 457 tspace = tvd->vdev_max_asize - tvd->vdev_asize;
421 458 if (tspace > mc->mc_spa->spa_bootsize) {
422 459 tspace -= mc->mc_spa->spa_bootsize;
423 460 }
424 461 space += P2ALIGN(tspace, 1ULL << tvd->vdev_ms_shift);
425 462 }
426 463 spa_config_exit(mc->mc_spa, SCL_VDEV, FTAG);
427 464 return (space);
428 465 }
429 466
430 467 static int
431 468 metaslab_compare(const void *x1, const void *x2)
432 469 {
433 470 const metaslab_t *m1 = x1;
434 471 const metaslab_t *m2 = x2;
435 472
436 473 if (m1->ms_weight < m2->ms_weight)
437 474 return (1);
438 475 if (m1->ms_weight > m2->ms_weight)
439 476 return (-1);
440 477
441 478 /*
442 479 * If the weights are identical, use the offset to force uniqueness.
443 480 */
444 481 if (m1->ms_start < m2->ms_start)
445 482 return (-1);
446 483 if (m1->ms_start > m2->ms_start)
447 484 return (1);
448 485
449 486 ASSERT3P(m1, ==, m2);
450 487
451 488 return (0);
452 489 }
453 490
454 491 /*
455 492 * Verify that the space accounting on disk matches the in-core range_trees.
456 493 */
457 494 void
458 495 metaslab_verify_space(metaslab_t *msp, uint64_t txg)
459 496 {
460 497 spa_t *spa = msp->ms_group->mg_vd->vdev_spa;
461 498 uint64_t allocated = 0;
462 499 uint64_t sm_free_space, msp_free_space;
463 500
464 501 ASSERT(MUTEX_HELD(&msp->ms_lock));
465 502
466 503 if ((zfs_flags & ZFS_DEBUG_METASLAB_VERIFY) == 0)
467 504 return;
468 505
469 506 /*
470 507 * We can only verify the metaslab space when we're called
471 508 * from syncing context with a loaded metaslab that has an allocated
472 509 * space map. Calling this in non-syncing context does not
473 510 * provide a consistent view of the metaslab since we're performing
474 511 * allocations in the future.
475 512 */
476 513 if (txg != spa_syncing_txg(spa) || msp->ms_sm == NULL ||
477 514 !msp->ms_loaded)
478 515 return;
479 516
480 517 sm_free_space = msp->ms_size - space_map_allocated(msp->ms_sm) -
481 518 space_map_alloc_delta(msp->ms_sm);
482 519
483 520 /*
484 521 * Account for future allocations since we would have already
485 522 * deducted that space from the ms_freetree.
486 523 */
487 524 for (int t = 0; t < TXG_CONCURRENT_STATES; t++) {
488 525 allocated +=
489 526 range_tree_space(msp->ms_alloctree[(txg + t) & TXG_MASK]);
490 527 }
491 528
492 529 msp_free_space = range_tree_space(msp->ms_tree) + allocated +
493 530 msp->ms_deferspace + range_tree_space(msp->ms_freedtree);
494 531
495 532 VERIFY3U(sm_free_space, ==, msp_free_space);
496 533 }
497 534
498 535 /*
499 536 * ==========================================================================
500 537 * Metaslab groups
501 538 * ==========================================================================
502 539 */
503 540 /*
504 541 * Update the allocatable flag and the metaslab group's capacity.
505 542 * The allocatable flag is set to true if the capacity is below
506 543 * the zfs_mg_noalloc_threshold or has a fragmentation value that is
507 544 * greater than zfs_mg_fragmentation_threshold. If a metaslab group
508 545 * transitions from allocatable to non-allocatable or vice versa then the
509 546 * metaslab group's class is updated to reflect the transition.
510 547 */
|
↓ open down ↓ |
91 lines elided |
↑ open up ↑ |
511 548 static void
512 549 metaslab_group_alloc_update(metaslab_group_t *mg)
513 550 {
514 551 vdev_t *vd = mg->mg_vd;
515 552 metaslab_class_t *mc = mg->mg_class;
516 553 vdev_stat_t *vs = &vd->vdev_stat;
517 554 boolean_t was_allocatable;
518 555 boolean_t was_initialized;
519 556
520 557 ASSERT(vd == vd->vdev_top);
521 - ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_READER), ==,
522 - SCL_ALLOC);
523 558
524 559 mutex_enter(&mg->mg_lock);
525 560 was_allocatable = mg->mg_allocatable;
526 561 was_initialized = mg->mg_initialized;
527 562
528 563 mg->mg_free_capacity = ((vs->vs_space - vs->vs_alloc) * 100) /
529 564 (vs->vs_space + 1);
530 565
531 566 mutex_enter(&mc->mc_lock);
532 567
533 568 /*
534 569 * If the metaslab group was just added then it won't
535 570 * have any space until we finish syncing out this txg.
536 571 * At that point we will consider it initialized and available
537 572 * for allocations. We also don't consider non-activated
538 573 * metaslab groups (e.g. vdevs that are in the middle of being removed)
539 574 * to be initialized, because they can't be used for allocation.
540 575 */
541 576 mg->mg_initialized = metaslab_group_initialized(mg);
542 577 if (!was_initialized && mg->mg_initialized) {
543 578 mc->mc_groups++;
544 579 } else if (was_initialized && !mg->mg_initialized) {
545 580 ASSERT3U(mc->mc_groups, >, 0);
546 581 mc->mc_groups--;
547 582 }
548 583 if (mg->mg_initialized)
549 584 mg->mg_no_free_space = B_FALSE;
550 585
551 586 /*
552 587 * A metaslab group is considered allocatable if it has plenty
553 588 * of free space or is not heavily fragmented. We only take
554 589 * fragmentation into account if the metaslab group has a valid
555 590 * fragmentation metric (i.e. a value between 0 and 100).
556 591 */
557 592 mg->mg_allocatable = (mg->mg_activation_count > 0 &&
558 593 mg->mg_free_capacity > zfs_mg_noalloc_threshold &&
559 594 (mg->mg_fragmentation == ZFS_FRAG_INVALID ||
560 595 mg->mg_fragmentation <= zfs_mg_fragmentation_threshold));
561 596
562 597 /*
563 598 * The mc_alloc_groups maintains a count of the number of
564 599 * groups in this metaslab class that are still above the
565 600 * zfs_mg_noalloc_threshold. This is used by the allocating
566 601 * threads to determine if they should avoid allocations to
567 602 * a given group. The allocator will avoid allocations to a group
568 603 * if that group has reached or is below the zfs_mg_noalloc_threshold
569 604 * and there are still other groups that are above the threshold.
570 605 * When a group transitions from allocatable to non-allocatable or
571 606 * vice versa we update the metaslab class to reflect that change.
572 607 * When the mc_alloc_groups value drops to 0 that means that all
573 608 * groups have reached the zfs_mg_noalloc_threshold making all groups
574 609 * eligible for allocations. This effectively means that all devices
575 610 * are balanced again.
576 611 */
577 612 if (was_allocatable && !mg->mg_allocatable)
578 613 mc->mc_alloc_groups--;
579 614 else if (!was_allocatable && mg->mg_allocatable)
580 615 mc->mc_alloc_groups++;
581 616 mutex_exit(&mc->mc_lock);
582 617
583 618 mutex_exit(&mg->mg_lock);
584 619 }
585 620
586 621 metaslab_group_t *
587 622 metaslab_group_create(metaslab_class_t *mc, vdev_t *vd)
588 623 {
589 624 metaslab_group_t *mg;
590 625
591 626 mg = kmem_zalloc(sizeof (metaslab_group_t), KM_SLEEP);
592 627 mutex_init(&mg->mg_lock, NULL, MUTEX_DEFAULT, NULL);
593 628 avl_create(&mg->mg_metaslab_tree, metaslab_compare,
594 629 sizeof (metaslab_t), offsetof(struct metaslab, ms_group_node));
595 630 mg->mg_vd = vd;
596 631 mg->mg_class = mc;
597 632 mg->mg_activation_count = 0;
598 633 mg->mg_initialized = B_FALSE;
599 634 mg->mg_no_free_space = B_TRUE;
600 635 refcount_create_tracked(&mg->mg_alloc_queue_depth);
601 636
602 637 mg->mg_taskq = taskq_create("metaslab_group_taskq", metaslab_load_pct,
603 638 minclsyspri, 10, INT_MAX, TASKQ_THREADS_CPU_PCT);
604 639
605 640 return (mg);
606 641 }
607 642
608 643 void
609 644 metaslab_group_destroy(metaslab_group_t *mg)
|
↓ open down ↓ |
77 lines elided |
↑ open up ↑ |
610 645 {
611 646 ASSERT(mg->mg_prev == NULL);
612 647 ASSERT(mg->mg_next == NULL);
613 648 /*
614 649 * We may have gone below zero with the activation count
615 650 * either because we never activated in the first place or
616 651 * because we're done, and possibly removing the vdev.
617 652 */
618 653 ASSERT(mg->mg_activation_count <= 0);
619 654
620 - taskq_destroy(mg->mg_taskq);
655 + if (mg->mg_taskq)
656 + taskq_destroy(mg->mg_taskq);
621 657 avl_destroy(&mg->mg_metaslab_tree);
622 658 mutex_destroy(&mg->mg_lock);
623 659 refcount_destroy(&mg->mg_alloc_queue_depth);
624 660 kmem_free(mg, sizeof (metaslab_group_t));
625 661 }
626 662
627 663 void
628 664 metaslab_group_activate(metaslab_group_t *mg)
629 665 {
630 666 metaslab_class_t *mc = mg->mg_class;
631 667 metaslab_group_t *mgprev, *mgnext;
632 668
633 - ASSERT3U(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER), !=, 0);
669 + ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
634 670
635 671 ASSERT(mc->mc_rotor != mg);
636 672 ASSERT(mg->mg_prev == NULL);
637 673 ASSERT(mg->mg_next == NULL);
638 674 ASSERT(mg->mg_activation_count <= 0);
639 675
640 676 if (++mg->mg_activation_count <= 0)
641 677 return;
642 678
643 679 mg->mg_aliquot = metaslab_aliquot * MAX(1, mg->mg_vd->vdev_children);
644 680 metaslab_group_alloc_update(mg);
645 681
646 682 if ((mgprev = mc->mc_rotor) == NULL) {
647 683 mg->mg_prev = mg;
648 684 mg->mg_next = mg;
|
↓ open down ↓ |
5 lines elided |
↑ open up ↑ |
649 685 } else {
650 686 mgnext = mgprev->mg_next;
651 687 mg->mg_prev = mgprev;
652 688 mg->mg_next = mgnext;
653 689 mgprev->mg_next = mg;
654 690 mgnext->mg_prev = mg;
655 691 }
656 692 mc->mc_rotor = mg;
657 693 }
658 694
659 -/*
660 - * Passivate a metaslab group and remove it from the allocation rotor.
661 - * Callers must hold both the SCL_ALLOC and SCL_ZIO lock prior to passivating
662 - * a metaslab group. This function will momentarily drop spa_config_locks
663 - * that are lower than the SCL_ALLOC lock (see comment below).
664 - */
665 695 void
666 696 metaslab_group_passivate(metaslab_group_t *mg)
667 697 {
668 698 metaslab_class_t *mc = mg->mg_class;
669 - spa_t *spa = mc->mc_spa;
670 699 metaslab_group_t *mgprev, *mgnext;
671 - int locks = spa_config_held(spa, SCL_ALL, RW_WRITER);
672 700
673 - ASSERT3U(spa_config_held(spa, SCL_ALLOC | SCL_ZIO, RW_WRITER), ==,
674 - (SCL_ALLOC | SCL_ZIO));
701 + ASSERT(spa_config_held(mc->mc_spa, SCL_ALLOC, RW_WRITER));
675 702
676 703 if (--mg->mg_activation_count != 0) {
677 704 ASSERT(mc->mc_rotor != mg);
678 705 ASSERT(mg->mg_prev == NULL);
679 706 ASSERT(mg->mg_next == NULL);
680 707 ASSERT(mg->mg_activation_count < 0);
681 708 return;
682 709 }
683 710
684 - /*
685 - * The spa_config_lock is an array of rwlocks, ordered as
686 - * follows (from highest to lowest):
687 - * SCL_CONFIG > SCL_STATE > SCL_L2ARC > SCL_ALLOC >
688 - * SCL_ZIO > SCL_FREE > SCL_VDEV
689 - * (For more information about the spa_config_lock see spa_misc.c)
690 - * The higher the lock, the broader its coverage. When we passivate
691 - * a metaslab group, we must hold both the SCL_ALLOC and the SCL_ZIO
692 - * config locks. However, the metaslab group's taskq might be trying
693 - * to preload metaslabs so we must drop the SCL_ZIO lock and any
694 - * lower locks to allow the I/O to complete. At a minimum,
695 - * we continue to hold the SCL_ALLOC lock, which prevents any future
696 - * allocations from taking place and any changes to the vdev tree.
697 - */
698 - spa_config_exit(spa, locks & ~(SCL_ZIO - 1), spa);
699 711 taskq_wait(mg->mg_taskq);
700 - spa_config_enter(spa, locks & ~(SCL_ZIO - 1), spa, RW_WRITER);
701 712 metaslab_group_alloc_update(mg);
702 713
703 714 mgprev = mg->mg_prev;
704 715 mgnext = mg->mg_next;
705 716
706 717 if (mg == mgnext) {
707 718 mc->mc_rotor = NULL;
708 719 } else {
709 720 mc->mc_rotor = mgnext;
710 721 mgprev->mg_next = mgnext;
711 722 mgnext->mg_prev = mgprev;
712 723 }
713 724
714 725 mg->mg_prev = NULL;
715 726 mg->mg_next = NULL;
716 727 }
717 728
718 729 boolean_t
719 730 metaslab_group_initialized(metaslab_group_t *mg)
720 731 {
721 732 vdev_t *vd = mg->mg_vd;
722 733 vdev_stat_t *vs = &vd->vdev_stat;
723 734
724 735 return (vs->vs_space != 0 && mg->mg_activation_count > 0);
725 736 }
726 737
727 738 uint64_t
728 739 metaslab_group_get_space(metaslab_group_t *mg)
729 740 {
730 741 return ((1ULL << mg->mg_vd->vdev_ms_shift) * mg->mg_vd->vdev_ms_count);
731 742 }
732 743
733 744 void
734 745 metaslab_group_histogram_verify(metaslab_group_t *mg)
735 746 {
736 747 uint64_t *mg_hist;
737 748 vdev_t *vd = mg->mg_vd;
738 749 uint64_t ashift = vd->vdev_ashift;
739 750 int i;
740 751
741 752 if ((zfs_flags & ZFS_DEBUG_HISTOGRAM_VERIFY) == 0)
742 753 return;
743 754
744 755 mg_hist = kmem_zalloc(sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE,
745 756 KM_SLEEP);
746 757
747 758 ASSERT3U(RANGE_TREE_HISTOGRAM_SIZE, >=,
748 759 SPACE_MAP_HISTOGRAM_SIZE + ashift);
749 760
750 761 for (int m = 0; m < vd->vdev_ms_count; m++) {
751 762 metaslab_t *msp = vd->vdev_ms[m];
752 763
753 764 if (msp->ms_sm == NULL)
754 765 continue;
755 766
756 767 for (i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++)
757 768 mg_hist[i + ashift] +=
758 769 msp->ms_sm->sm_phys->smp_histogram[i];
759 770 }
760 771
761 772 for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i ++)
762 773 VERIFY3U(mg_hist[i], ==, mg->mg_histogram[i]);
763 774
764 775 kmem_free(mg_hist, sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE);
765 776 }
766 777
767 778 static void
768 779 metaslab_group_histogram_add(metaslab_group_t *mg, metaslab_t *msp)
769 780 {
770 781 metaslab_class_t *mc = mg->mg_class;
771 782 uint64_t ashift = mg->mg_vd->vdev_ashift;
772 783
773 784 ASSERT(MUTEX_HELD(&msp->ms_lock));
774 785 if (msp->ms_sm == NULL)
775 786 return;
776 787
777 788 mutex_enter(&mg->mg_lock);
778 789 for (int i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++) {
779 790 mg->mg_histogram[i + ashift] +=
780 791 msp->ms_sm->sm_phys->smp_histogram[i];
781 792 mc->mc_histogram[i + ashift] +=
782 793 msp->ms_sm->sm_phys->smp_histogram[i];
783 794 }
784 795 mutex_exit(&mg->mg_lock);
785 796 }
786 797
787 798 void
788 799 metaslab_group_histogram_remove(metaslab_group_t *mg, metaslab_t *msp)
789 800 {
790 801 metaslab_class_t *mc = mg->mg_class;
791 802 uint64_t ashift = mg->mg_vd->vdev_ashift;
792 803
793 804 ASSERT(MUTEX_HELD(&msp->ms_lock));
794 805 if (msp->ms_sm == NULL)
795 806 return;
796 807
797 808 mutex_enter(&mg->mg_lock);
798 809 for (int i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++) {
799 810 ASSERT3U(mg->mg_histogram[i + ashift], >=,
800 811 msp->ms_sm->sm_phys->smp_histogram[i]);
801 812 ASSERT3U(mc->mc_histogram[i + ashift], >=,
802 813 msp->ms_sm->sm_phys->smp_histogram[i]);
803 814
804 815 mg->mg_histogram[i + ashift] -=
805 816 msp->ms_sm->sm_phys->smp_histogram[i];
806 817 mc->mc_histogram[i + ashift] -=
807 818 msp->ms_sm->sm_phys->smp_histogram[i];
808 819 }
809 820 mutex_exit(&mg->mg_lock);
810 821 }
811 822
812 823 static void
813 824 metaslab_group_add(metaslab_group_t *mg, metaslab_t *msp)
814 825 {
815 826 ASSERT(msp->ms_group == NULL);
816 827 mutex_enter(&mg->mg_lock);
817 828 msp->ms_group = mg;
818 829 msp->ms_weight = 0;
819 830 avl_add(&mg->mg_metaslab_tree, msp);
820 831 mutex_exit(&mg->mg_lock);
821 832
822 833 mutex_enter(&msp->ms_lock);
823 834 metaslab_group_histogram_add(mg, msp);
824 835 mutex_exit(&msp->ms_lock);
825 836 }
826 837
827 838 static void
828 839 metaslab_group_remove(metaslab_group_t *mg, metaslab_t *msp)
829 840 {
830 841 mutex_enter(&msp->ms_lock);
831 842 metaslab_group_histogram_remove(mg, msp);
832 843 mutex_exit(&msp->ms_lock);
833 844
834 845 mutex_enter(&mg->mg_lock);
835 846 ASSERT(msp->ms_group == mg);
836 847 avl_remove(&mg->mg_metaslab_tree, msp);
837 848 msp->ms_group = NULL;
838 849 mutex_exit(&mg->mg_lock);
839 850 }
840 851
841 852 static void
842 853 metaslab_group_sort(metaslab_group_t *mg, metaslab_t *msp, uint64_t weight)
843 854 {
844 855 /*
845 856 * Although in principle the weight can be any value, in
846 857 * practice we do not use values in the range [1, 511].
847 858 */
848 859 ASSERT(weight >= SPA_MINBLOCKSIZE || weight == 0);
849 860 ASSERT(MUTEX_HELD(&msp->ms_lock));
850 861
851 862 mutex_enter(&mg->mg_lock);
852 863 ASSERT(msp->ms_group == mg);
853 864 avl_remove(&mg->mg_metaslab_tree, msp);
854 865 msp->ms_weight = weight;
855 866 avl_add(&mg->mg_metaslab_tree, msp);
856 867 mutex_exit(&mg->mg_lock);
857 868 }
858 869
859 870 /*
860 871 * Calculate the fragmentation for a given metaslab group. We can use
861 872 * a simple average here since all metaslabs within the group must have
862 873 * the same size. The return value will be a value between 0 and 100
863 874 * (inclusive), or ZFS_FRAG_INVALID if less than half of the metaslab in this
864 875 * group have a fragmentation metric.
865 876 */
866 877 uint64_t
867 878 metaslab_group_fragmentation(metaslab_group_t *mg)
868 879 {
869 880 vdev_t *vd = mg->mg_vd;
870 881 uint64_t fragmentation = 0;
871 882 uint64_t valid_ms = 0;
872 883
873 884 for (int m = 0; m < vd->vdev_ms_count; m++) {
874 885 metaslab_t *msp = vd->vdev_ms[m];
875 886
876 887 if (msp->ms_fragmentation == ZFS_FRAG_INVALID)
877 888 continue;
878 889
879 890 valid_ms++;
880 891 fragmentation += msp->ms_fragmentation;
881 892 }
882 893
883 894 if (valid_ms <= vd->vdev_ms_count / 2)
884 895 return (ZFS_FRAG_INVALID);
885 896
886 897 fragmentation /= valid_ms;
887 898 ASSERT3U(fragmentation, <=, 100);
888 899 return (fragmentation);
889 900 }
890 901
891 902 /*
892 903 * Determine if a given metaslab group should skip allocations. A metaslab
893 904 * group should avoid allocations if its free capacity is less than the
894 905 * zfs_mg_noalloc_threshold or its fragmentation metric is greater than
895 906 * zfs_mg_fragmentation_threshold and there is at least one metaslab group
896 907 * that can still handle allocations. If the allocation throttle is enabled
897 908 * then we skip allocations to devices that have reached their maximum
898 909 * allocation queue depth unless the selected metaslab group is the only
899 910 * eligible group remaining.
900 911 */
901 912 static boolean_t
902 913 metaslab_group_allocatable(metaslab_group_t *mg, metaslab_group_t *rotor,
903 914 uint64_t psize)
904 915 {
905 916 spa_t *spa = mg->mg_vd->vdev_spa;
906 917 metaslab_class_t *mc = mg->mg_class;
907 918
908 919 /*
909 920 * We can only consider skipping this metaslab group if it's
910 921 * in the normal metaslab class and there are other metaslab
911 922 * groups to select from. Otherwise, we always consider it eligible
912 923 * for allocations.
913 924 */
914 925 if (mc != spa_normal_class(spa) || mc->mc_groups <= 1)
915 926 return (B_TRUE);
916 927
917 928 /*
918 929 * If the metaslab group's mg_allocatable flag is set (see comments
919 930 * in metaslab_group_alloc_update() for more information) and
920 931 * the allocation throttle is disabled then allow allocations to this
921 932 * device. However, if the allocation throttle is enabled then
922 933 * check if we have reached our allocation limit (mg_alloc_queue_depth)
923 934 * to determine if we should allow allocations to this metaslab group.
924 935 * If all metaslab groups are no longer considered allocatable
925 936 * (mc_alloc_groups == 0) or we're trying to allocate the smallest
926 937 * gang block size then we allow allocations on this metaslab group
927 938 * regardless of the mg_allocatable or throttle settings.
928 939 */
929 940 if (mg->mg_allocatable) {
930 941 metaslab_group_t *mgp;
931 942 int64_t qdepth;
932 943 uint64_t qmax = mg->mg_max_alloc_queue_depth;
933 944
934 945 if (!mc->mc_alloc_throttle_enabled)
935 946 return (B_TRUE);
936 947
937 948 /*
938 949 * If this metaslab group does not have any free space, then
939 950 * there is no point in looking further.
940 951 */
941 952 if (mg->mg_no_free_space)
942 953 return (B_FALSE);
943 954
944 955 qdepth = refcount_count(&mg->mg_alloc_queue_depth);
945 956
946 957 /*
947 958 * If this metaslab group is below its qmax or it's
948 959 * the only allocatable metasable group, then attempt
949 960 * to allocate from it.
950 961 */
951 962 if (qdepth < qmax || mc->mc_alloc_groups == 1)
952 963 return (B_TRUE);
953 964 ASSERT3U(mc->mc_alloc_groups, >, 1);
954 965
955 966 /*
956 967 * Since this metaslab group is at or over its qmax, we
957 968 * need to determine if there are metaslab groups after this
958 969 * one that might be able to handle this allocation. This is
959 970 * racy since we can't hold the locks for all metaslab
960 971 * groups at the same time when we make this check.
961 972 */
962 973 for (mgp = mg->mg_next; mgp != rotor; mgp = mgp->mg_next) {
963 974 qmax = mgp->mg_max_alloc_queue_depth;
964 975
965 976 qdepth = refcount_count(&mgp->mg_alloc_queue_depth);
966 977
967 978 /*
968 979 * If there is another metaslab group that
969 980 * might be able to handle the allocation, then
970 981 * we return false so that we skip this group.
971 982 */
972 983 if (qdepth < qmax && !mgp->mg_no_free_space)
973 984 return (B_FALSE);
974 985 }
975 986
976 987 /*
977 988 * We didn't find another group to handle the allocation
978 989 * so we can't skip this metaslab group even though
979 990 * we are at or over our qmax.
980 991 */
981 992 return (B_TRUE);
982 993
983 994 } else if (mc->mc_alloc_groups == 0 || psize == SPA_MINBLOCKSIZE) {
984 995 return (B_TRUE);
985 996 }
986 997 return (B_FALSE);
987 998 }
988 999
989 1000 /*
990 1001 * ==========================================================================
991 1002 * Range tree callbacks
992 1003 * ==========================================================================
993 1004 */
994 1005
995 1006 /*
996 1007 * Comparison function for the private size-ordered tree. Tree is sorted
997 1008 * by size, larger sizes at the end of the tree.
998 1009 */
999 1010 static int
1000 1011 metaslab_rangesize_compare(const void *x1, const void *x2)
1001 1012 {
1002 1013 const range_seg_t *r1 = x1;
1003 1014 const range_seg_t *r2 = x2;
1004 1015 uint64_t rs_size1 = r1->rs_end - r1->rs_start;
1005 1016 uint64_t rs_size2 = r2->rs_end - r2->rs_start;
1006 1017
1007 1018 if (rs_size1 < rs_size2)
1008 1019 return (-1);
1009 1020 if (rs_size1 > rs_size2)
1010 1021 return (1);
1011 1022
1012 1023 if (r1->rs_start < r2->rs_start)
1013 1024 return (-1);
1014 1025
1015 1026 if (r1->rs_start > r2->rs_start)
1016 1027 return (1);
1017 1028
1018 1029 return (0);
1019 1030 }
1020 1031
1021 1032 /*
1022 1033 * Create any block allocator specific components. The current allocators
1023 1034 * rely on using both a size-ordered range_tree_t and an array of uint64_t's.
1024 1035 */
1025 1036 static void
1026 1037 metaslab_rt_create(range_tree_t *rt, void *arg)
1027 1038 {
1028 1039 metaslab_t *msp = arg;
1029 1040
1030 1041 ASSERT3P(rt->rt_arg, ==, msp);
1031 1042 ASSERT(msp->ms_tree == NULL);
1032 1043
1033 1044 avl_create(&msp->ms_size_tree, metaslab_rangesize_compare,
1034 1045 sizeof (range_seg_t), offsetof(range_seg_t, rs_pp_node));
1035 1046 }
1036 1047
1037 1048 /*
1038 1049 * Destroy the block allocator specific components.
1039 1050 */
1040 1051 static void
1041 1052 metaslab_rt_destroy(range_tree_t *rt, void *arg)
1042 1053 {
1043 1054 metaslab_t *msp = arg;
1044 1055
1045 1056 ASSERT3P(rt->rt_arg, ==, msp);
1046 1057 ASSERT3P(msp->ms_tree, ==, rt);
1047 1058 ASSERT0(avl_numnodes(&msp->ms_size_tree));
1048 1059
1049 1060 avl_destroy(&msp->ms_size_tree);
1050 1061 }
1051 1062
1052 1063 static void
1053 1064 metaslab_rt_add(range_tree_t *rt, range_seg_t *rs, void *arg)
1054 1065 {
1055 1066 metaslab_t *msp = arg;
1056 1067
1057 1068 ASSERT3P(rt->rt_arg, ==, msp);
1058 1069 ASSERT3P(msp->ms_tree, ==, rt);
1059 1070 VERIFY(!msp->ms_condensing);
1060 1071 avl_add(&msp->ms_size_tree, rs);
1061 1072 }
1062 1073
1063 1074 static void
1064 1075 metaslab_rt_remove(range_tree_t *rt, range_seg_t *rs, void *arg)
1065 1076 {
1066 1077 metaslab_t *msp = arg;
1067 1078
1068 1079 ASSERT3P(rt->rt_arg, ==, msp);
1069 1080 ASSERT3P(msp->ms_tree, ==, rt);
1070 1081 VERIFY(!msp->ms_condensing);
1071 1082 avl_remove(&msp->ms_size_tree, rs);
1072 1083 }
1073 1084
1074 1085 static void
1075 1086 metaslab_rt_vacate(range_tree_t *rt, void *arg)
1076 1087 {
1077 1088 metaslab_t *msp = arg;
1078 1089
1079 1090 ASSERT3P(rt->rt_arg, ==, msp);
1080 1091 ASSERT3P(msp->ms_tree, ==, rt);
1081 1092
1082 1093 /*
1083 1094 * Normally one would walk the tree freeing nodes along the way.
1084 1095 * Since the nodes are shared with the range trees we can avoid
1085 1096 * walking all nodes and just reinitialize the avl tree. The nodes
1086 1097 * will be freed by the range tree, so we don't want to free them here.
1087 1098 */
1088 1099 avl_create(&msp->ms_size_tree, metaslab_rangesize_compare,
1089 1100 sizeof (range_seg_t), offsetof(range_seg_t, rs_pp_node));
1090 1101 }
1091 1102
1092 1103 static range_tree_ops_t metaslab_rt_ops = {
1093 1104 metaslab_rt_create,
1094 1105 metaslab_rt_destroy,
1095 1106 metaslab_rt_add,
1096 1107 metaslab_rt_remove,
1097 1108 metaslab_rt_vacate
1098 1109 };
1099 1110
1100 1111 /*
1101 1112 * ==========================================================================
1102 1113 * Common allocator routines
1103 1114 * ==========================================================================
1104 1115 */
1105 1116
1106 1117 /*
1107 1118 * Return the maximum contiguous segment within the metaslab.
1108 1119 */
1109 1120 uint64_t
1110 1121 metaslab_block_maxsize(metaslab_t *msp)
1111 1122 {
1112 1123 avl_tree_t *t = &msp->ms_size_tree;
1113 1124 range_seg_t *rs;
1114 1125
1115 1126 if (t == NULL || (rs = avl_last(t)) == NULL)
1116 1127 return (0ULL);
1117 1128
1118 1129 return (rs->rs_end - rs->rs_start);
1119 1130 }
1120 1131
1121 1132 static range_seg_t *
1122 1133 metaslab_block_find(avl_tree_t *t, uint64_t start, uint64_t size)
1123 1134 {
1124 1135 range_seg_t *rs, rsearch;
1125 1136 avl_index_t where;
1126 1137
1127 1138 rsearch.rs_start = start;
1128 1139 rsearch.rs_end = start + size;
1129 1140
1130 1141 rs = avl_find(t, &rsearch, &where);
1131 1142 if (rs == NULL) {
1132 1143 rs = avl_nearest(t, where, AVL_AFTER);
1133 1144 }
|
↓ open down ↓ |
423 lines elided |
↑ open up ↑ |
1134 1145
1135 1146 return (rs);
1136 1147 }
1137 1148
1138 1149 /*
1139 1150 * This is a helper function that can be used by the allocator to find
1140 1151 * a suitable block to allocate. This will search the specified AVL
1141 1152 * tree looking for a block that matches the specified criteria.
1142 1153 */
1143 1154 static uint64_t
1144 -metaslab_block_picker(avl_tree_t *t, uint64_t *cursor, uint64_t size,
1145 - uint64_t align)
1155 +metaslab_block_picker(metaslab_t *msp, avl_tree_t *t, uint64_t *cursor,
1156 + uint64_t size, uint64_t align)
1146 1157 {
1147 1158 range_seg_t *rs = metaslab_block_find(t, *cursor, size);
1148 1159
1149 - while (rs != NULL) {
1160 + for (; rs != NULL; rs = AVL_NEXT(t, rs)) {
1150 1161 uint64_t offset = P2ROUNDUP(rs->rs_start, align);
1151 1162
1152 - if (offset + size <= rs->rs_end) {
1163 + if (offset + size <= rs->rs_end &&
1164 + !metaslab_check_trim_conflict(msp, &offset, size, align,
1165 + rs->rs_end)) {
1153 1166 *cursor = offset + size;
1154 1167 return (offset);
1155 1168 }
1156 - rs = AVL_NEXT(t, rs);
1157 1169 }
1158 1170
1159 1171 /*
1160 1172 * If we know we've searched the whole map (*cursor == 0), give up.
1161 1173 * Otherwise, reset the cursor to the beginning and try again.
1162 1174 */
1163 1175 if (*cursor == 0)
1164 1176 return (-1ULL);
1165 1177
1166 1178 *cursor = 0;
1167 - return (metaslab_block_picker(t, cursor, size, align));
1179 + return (metaslab_block_picker(msp, t, cursor, size, align));
1168 1180 }
1169 1181
1170 1182 /*
1171 1183 * ==========================================================================
1172 1184 * The first-fit block allocator
1173 1185 * ==========================================================================
1174 1186 */
1175 1187 static uint64_t
1176 1188 metaslab_ff_alloc(metaslab_t *msp, uint64_t size)
1177 1189 {
1178 1190 /*
|
↓ open down ↓ |
1 lines elided |
↑ open up ↑ |
1179 1191 * Find the largest power of 2 block size that evenly divides the
1180 1192 * requested size. This is used to try to allocate blocks with similar
1181 1193 * alignment from the same area of the metaslab (i.e. same cursor
1182 1194 * bucket) but it does not guarantee that other allocations sizes
1183 1195 * may exist in the same region.
1184 1196 */
1185 1197 uint64_t align = size & -size;
1186 1198 uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1];
1187 1199 avl_tree_t *t = &msp->ms_tree->rt_root;
1188 1200
1189 - return (metaslab_block_picker(t, cursor, size, align));
1201 + return (metaslab_block_picker(msp, t, cursor, size, align));
1190 1202 }
1191 1203
1192 1204 static metaslab_ops_t metaslab_ff_ops = {
1193 1205 metaslab_ff_alloc
1194 1206 };
1195 1207
1196 1208 /*
1197 1209 * ==========================================================================
1198 1210 * Dynamic block allocator -
1199 1211 * Uses the first fit allocation scheme until space get low and then
1200 1212 * adjusts to a best fit allocation method. Uses metaslab_df_alloc_threshold
1201 1213 * and metaslab_df_free_pct to determine when to switch the allocation scheme.
1202 1214 * ==========================================================================
1203 1215 */
1204 1216 static uint64_t
1205 1217 metaslab_df_alloc(metaslab_t *msp, uint64_t size)
1206 1218 {
1207 1219 /*
1208 1220 * Find the largest power of 2 block size that evenly divides the
1209 1221 * requested size. This is used to try to allocate blocks with similar
1210 1222 * alignment from the same area of the metaslab (i.e. same cursor
1211 1223 * bucket) but it does not guarantee that other allocations sizes
1212 1224 * may exist in the same region.
1213 1225 */
1214 1226 uint64_t align = size & -size;
1215 1227 uint64_t *cursor = &msp->ms_lbas[highbit64(align) - 1];
1216 1228 range_tree_t *rt = msp->ms_tree;
1217 1229 avl_tree_t *t = &rt->rt_root;
1218 1230 uint64_t max_size = metaslab_block_maxsize(msp);
1219 1231 int free_pct = range_tree_space(rt) * 100 / msp->ms_size;
1220 1232
1221 1233 ASSERT(MUTEX_HELD(&msp->ms_lock));
1222 1234 ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&msp->ms_size_tree));
1223 1235
1224 1236 if (max_size < size)
1225 1237 return (-1ULL);
1226 1238
|
↓ open down ↓ |
27 lines elided |
↑ open up ↑ |
1227 1239 /*
1228 1240 * If we're running low on space switch to using the size
1229 1241 * sorted AVL tree (best-fit).
1230 1242 */
1231 1243 if (max_size < metaslab_df_alloc_threshold ||
1232 1244 free_pct < metaslab_df_free_pct) {
1233 1245 t = &msp->ms_size_tree;
1234 1246 *cursor = 0;
1235 1247 }
1236 1248
1237 - return (metaslab_block_picker(t, cursor, size, 1ULL));
1249 + return (metaslab_block_picker(msp, t, cursor, size, 1ULL));
1238 1250 }
1239 1251
1240 1252 static metaslab_ops_t metaslab_df_ops = {
1241 1253 metaslab_df_alloc
1242 1254 };
1243 1255
1244 1256 /*
1245 1257 * ==========================================================================
1246 1258 * Cursor fit block allocator -
1247 1259 * Select the largest region in the metaslab, set the cursor to the beginning
1248 1260 * of the range and the cursor_end to the end of the range. As allocations
1249 1261 * are made advance the cursor. Continue allocating from the cursor until
1250 1262 * the range is exhausted and then find a new range.
1251 1263 * ==========================================================================
1252 1264 */
1253 1265 static uint64_t
1254 1266 metaslab_cf_alloc(metaslab_t *msp, uint64_t size)
1255 1267 {
1256 1268 range_tree_t *rt = msp->ms_tree;
1257 1269 avl_tree_t *t = &msp->ms_size_tree;
1258 1270 uint64_t *cursor = &msp->ms_lbas[0];
|
↓ open down ↓ |
11 lines elided |
↑ open up ↑ |
1259 1271 uint64_t *cursor_end = &msp->ms_lbas[1];
1260 1272 uint64_t offset = 0;
1261 1273
1262 1274 ASSERT(MUTEX_HELD(&msp->ms_lock));
1263 1275 ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&rt->rt_root));
1264 1276
1265 1277 ASSERT3U(*cursor_end, >=, *cursor);
1266 1278
1267 1279 if ((*cursor + size) > *cursor_end) {
1268 1280 range_seg_t *rs;
1269 -
1270 - rs = avl_last(&msp->ms_size_tree);
1271 - if (rs == NULL || (rs->rs_end - rs->rs_start) < size)
1281 + for (rs = avl_last(&msp->ms_size_tree);
1282 + rs != NULL && rs->rs_end - rs->rs_start >= size;
1283 + rs = AVL_PREV(&msp->ms_size_tree, rs)) {
1284 + *cursor = rs->rs_start;
1285 + *cursor_end = rs->rs_end;
1286 + if (!metaslab_check_trim_conflict(msp, cursor, size,
1287 + 1, *cursor_end)) {
1288 + /* segment appears to be acceptable */
1289 + break;
1290 + }
1291 + }
1292 + if (rs == NULL || rs->rs_end - rs->rs_start < size)
1272 1293 return (-1ULL);
1273 -
1274 - *cursor = rs->rs_start;
1275 - *cursor_end = rs->rs_end;
1276 1294 }
1277 1295
1278 1296 offset = *cursor;
1279 1297 *cursor += size;
1280 1298
1281 1299 return (offset);
1282 1300 }
1283 1301
1284 1302 static metaslab_ops_t metaslab_cf_ops = {
1285 1303 metaslab_cf_alloc
1286 1304 };
1287 1305
1288 1306 /*
1289 1307 * ==========================================================================
1290 1308 * New dynamic fit allocator -
1291 1309 * Select a region that is large enough to allocate 2^metaslab_ndf_clump_shift
1292 1310 * contiguous blocks. If no region is found then just use the largest segment
1293 1311 * that remains.
1294 1312 * ==========================================================================
1295 1313 */
1296 1314
1297 1315 /*
1298 1316 * Determines desired number of contiguous blocks (2^metaslab_ndf_clump_shift)
1299 1317 * to request from the allocator.
1300 1318 */
1301 1319 uint64_t metaslab_ndf_clump_shift = 4;
|
↓ open down ↓ |
16 lines elided |
↑ open up ↑ |
1302 1320
1303 1321 static uint64_t
1304 1322 metaslab_ndf_alloc(metaslab_t *msp, uint64_t size)
1305 1323 {
1306 1324 avl_tree_t *t = &msp->ms_tree->rt_root;
1307 1325 avl_index_t where;
1308 1326 range_seg_t *rs, rsearch;
1309 1327 uint64_t hbit = highbit64(size);
1310 1328 uint64_t *cursor = &msp->ms_lbas[hbit - 1];
1311 1329 uint64_t max_size = metaslab_block_maxsize(msp);
1330 + /* mutable copy for adjustment by metaslab_check_trim_conflict */
1331 + uint64_t adjustable_start;
1312 1332
1313 1333 ASSERT(MUTEX_HELD(&msp->ms_lock));
1314 1334 ASSERT3U(avl_numnodes(t), ==, avl_numnodes(&msp->ms_size_tree));
1315 1335
1316 1336 if (max_size < size)
1317 1337 return (-1ULL);
1318 1338
1319 1339 rsearch.rs_start = *cursor;
1320 1340 rsearch.rs_end = *cursor + size;
1321 1341
1322 1342 rs = avl_find(t, &rsearch, &where);
1323 - if (rs == NULL || (rs->rs_end - rs->rs_start) < size) {
1343 + if (rs != NULL)
1344 + adjustable_start = rs->rs_start;
1345 + if (rs == NULL || rs->rs_end - adjustable_start < size ||
1346 + metaslab_check_trim_conflict(msp, &adjustable_start, size, 1,
1347 + rs->rs_end)) {
1348 + /* segment not usable, try the largest remaining one */
1324 1349 t = &msp->ms_size_tree;
1325 1350
1326 1351 rsearch.rs_start = 0;
1327 1352 rsearch.rs_end = MIN(max_size,
1328 1353 1ULL << (hbit + metaslab_ndf_clump_shift));
1329 1354 rs = avl_find(t, &rsearch, &where);
1330 1355 if (rs == NULL)
1331 1356 rs = avl_nearest(t, where, AVL_AFTER);
1332 1357 ASSERT(rs != NULL);
1358 + adjustable_start = rs->rs_start;
1359 + if (rs->rs_end - adjustable_start < size ||
1360 + metaslab_check_trim_conflict(msp, &adjustable_start,
1361 + size, 1, rs->rs_end)) {
1362 + /* even largest remaining segment not usable */
1363 + return (-1ULL);
1364 + }
1333 1365 }
1334 1366
1335 - if ((rs->rs_end - rs->rs_start) >= size) {
1336 - *cursor = rs->rs_start + size;
1337 - return (rs->rs_start);
1338 - }
1339 - return (-1ULL);
1367 + *cursor = adjustable_start + size;
1368 + return (*cursor);
1340 1369 }
1341 1370
1342 1371 static metaslab_ops_t metaslab_ndf_ops = {
1343 1372 metaslab_ndf_alloc
1344 1373 };
1345 1374
1346 1375 metaslab_ops_t *zfs_metaslab_ops = &metaslab_df_ops;
1347 1376
1348 1377 /*
1349 1378 * ==========================================================================
1350 1379 * Metaslabs
1351 1380 * ==========================================================================
1352 1381 */
1353 1382
1354 1383 /*
1355 1384 * Wait for any in-progress metaslab loads to complete.
1356 1385 */
1357 1386 void
1358 1387 metaslab_load_wait(metaslab_t *msp)
1359 1388 {
1360 1389 ASSERT(MUTEX_HELD(&msp->ms_lock));
1361 1390
1362 1391 while (msp->ms_loading) {
1363 1392 ASSERT(!msp->ms_loaded);
1364 1393 cv_wait(&msp->ms_load_cv, &msp->ms_lock);
1365 1394 }
1366 1395 }
1367 1396
1368 1397 int
|
↓ open down ↓ |
19 lines elided |
↑ open up ↑ |
1369 1398 metaslab_load(metaslab_t *msp)
1370 1399 {
1371 1400 int error = 0;
1372 1401 boolean_t success = B_FALSE;
1373 1402
1374 1403 ASSERT(MUTEX_HELD(&msp->ms_lock));
1375 1404 ASSERT(!msp->ms_loaded);
1376 1405 ASSERT(!msp->ms_loading);
1377 1406
1378 1407 msp->ms_loading = B_TRUE;
1379 - /*
1380 - * Nobody else can manipulate a loading metaslab, so it's now safe
1381 - * to drop the lock. This way we don't have to hold the lock while
1382 - * reading the spacemap from disk.
1383 - */
1384 - mutex_exit(&msp->ms_lock);
1385 1408
1386 1409 /*
1387 1410 * If the space map has not been allocated yet, then treat
1388 1411 * all the space in the metaslab as free and add it to the
1389 1412 * ms_tree.
1390 1413 */
1391 1414 if (msp->ms_sm != NULL)
1392 1415 error = space_map_load(msp->ms_sm, msp->ms_tree, SM_FREE);
1393 1416 else
1394 1417 range_tree_add(msp->ms_tree, msp->ms_start, msp->ms_size);
1395 1418
1396 1419 success = (error == 0);
1397 -
1398 - mutex_enter(&msp->ms_lock);
1399 1420 msp->ms_loading = B_FALSE;
1400 1421
1401 1422 if (success) {
1402 1423 ASSERT3P(msp->ms_group, !=, NULL);
1403 1424 msp->ms_loaded = B_TRUE;
1404 1425
1405 1426 for (int t = 0; t < TXG_DEFER_SIZE; t++) {
1406 1427 range_tree_walk(msp->ms_defertree[t],
1407 1428 range_tree_remove, msp->ms_tree);
1429 + range_tree_walk(msp->ms_defertree[t],
1430 + metaslab_trim_remove, msp);
1408 1431 }
1409 1432 msp->ms_max_size = metaslab_block_maxsize(msp);
1410 1433 }
1411 1434 cv_broadcast(&msp->ms_load_cv);
1412 1435 return (error);
1413 1436 }
1414 1437
1415 1438 void
1416 1439 metaslab_unload(metaslab_t *msp)
1417 1440 {
1418 1441 ASSERT(MUTEX_HELD(&msp->ms_lock));
1419 1442 range_tree_vacate(msp->ms_tree, NULL, NULL);
1420 1443 msp->ms_loaded = B_FALSE;
1421 1444 msp->ms_weight &= ~METASLAB_ACTIVE_MASK;
1422 1445 msp->ms_max_size = 0;
1423 1446 }
1424 1447
1425 1448 int
|
↓ open down ↓ |
8 lines elided |
↑ open up ↑ |
1426 1449 metaslab_init(metaslab_group_t *mg, uint64_t id, uint64_t object, uint64_t txg,
1427 1450 metaslab_t **msp)
1428 1451 {
1429 1452 vdev_t *vd = mg->mg_vd;
1430 1453 objset_t *mos = vd->vdev_spa->spa_meta_objset;
1431 1454 metaslab_t *ms;
1432 1455 int error;
1433 1456
1434 1457 ms = kmem_zalloc(sizeof (metaslab_t), KM_SLEEP);
1435 1458 mutex_init(&ms->ms_lock, NULL, MUTEX_DEFAULT, NULL);
1436 - mutex_init(&ms->ms_sync_lock, NULL, MUTEX_DEFAULT, NULL);
1437 1459 cv_init(&ms->ms_load_cv, NULL, CV_DEFAULT, NULL);
1460 + cv_init(&ms->ms_trim_cv, NULL, CV_DEFAULT, NULL);
1438 1461 ms->ms_id = id;
1439 1462 ms->ms_start = id << vd->vdev_ms_shift;
1440 1463 ms->ms_size = 1ULL << vd->vdev_ms_shift;
1441 1464
1442 1465 /*
1443 1466 * We only open space map objects that already exist. All others
1444 1467 * will be opened when we finally allocate an object for it.
1445 1468 */
1446 1469 if (object != 0) {
1447 1470 error = space_map_open(&ms->ms_sm, mos, object, ms->ms_start,
1448 - ms->ms_size, vd->vdev_ashift);
1471 + ms->ms_size, vd->vdev_ashift, &ms->ms_lock);
1449 1472
1450 1473 if (error != 0) {
1451 1474 kmem_free(ms, sizeof (metaslab_t));
1452 1475 return (error);
1453 1476 }
1454 1477
1455 1478 ASSERT(ms->ms_sm != NULL);
1456 1479 }
1457 1480
1481 + ms->ms_cur_ts = metaslab_new_trimset(0, &ms->ms_lock);
1482 +
1458 1483 /*
1459 1484 * We create the main range tree here, but we don't create the
1460 1485 * other range trees until metaslab_sync_done(). This serves
1461 1486 * two purposes: it allows metaslab_sync_done() to detect the
1462 1487 * addition of new space; and for debugging, it ensures that we'd
1463 1488 * data fault on any attempt to use this metaslab before it's ready.
1464 1489 */
1465 - ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms);
1490 + ms->ms_tree = range_tree_create(&metaslab_rt_ops, ms, &ms->ms_lock);
1466 1491 metaslab_group_add(mg, ms);
1467 1492
1468 1493 metaslab_set_fragmentation(ms);
1469 1494
1470 1495 /*
1471 1496 * If we're opening an existing pool (txg == 0) or creating
1472 1497 * a new one (txg == TXG_INITIAL), all space is available now.
1473 1498 * If we're adding space to an existing pool, the new space
1474 1499 * does not become available until after this txg has synced.
1475 1500 * The metaslab's weight will also be initialized when we sync
1476 1501 * out this txg. This ensures that we don't attempt to allocate
1477 1502 * from it before we have initialized it completely.
1478 1503 */
1479 1504 if (txg <= TXG_INITIAL)
1480 1505 metaslab_sync_done(ms, 0);
1481 1506
1482 1507 /*
1483 1508 * If metaslab_debug_load is set and we're initializing a metaslab
1484 1509 * that has an allocated space map object then load the its space
1485 1510 * map so that can verify frees.
1486 1511 */
1487 1512 if (metaslab_debug_load && ms->ms_sm != NULL) {
1488 1513 mutex_enter(&ms->ms_lock);
1489 1514 VERIFY0(metaslab_load(ms));
1490 1515 mutex_exit(&ms->ms_lock);
1491 1516 }
1492 1517
1493 1518 if (txg != 0) {
1494 1519 vdev_dirty(vd, 0, NULL, txg);
1495 1520 vdev_dirty(vd, VDD_METASLAB, ms, txg);
1496 1521 }
1497 1522
1498 1523 *msp = ms;
1499 1524
1500 1525 return (0);
1501 1526 }
1502 1527
1503 1528 void
1504 1529 metaslab_fini(metaslab_t *msp)
1505 1530 {
1506 1531 metaslab_group_t *mg = msp->ms_group;
1507 1532
1508 1533 metaslab_group_remove(mg, msp);
1509 1534
1510 1535 mutex_enter(&msp->ms_lock);
1511 1536 VERIFY(msp->ms_group == NULL);
1512 1537 vdev_space_update(mg->mg_vd, -space_map_allocated(msp->ms_sm),
1513 1538 0, -msp->ms_size);
1514 1539 space_map_close(msp->ms_sm);
1515 1540
1516 1541 metaslab_unload(msp);
1517 1542 range_tree_destroy(msp->ms_tree);
1518 1543 range_tree_destroy(msp->ms_freeingtree);
|
↓ open down ↓ |
43 lines elided |
↑ open up ↑ |
1519 1544 range_tree_destroy(msp->ms_freedtree);
1520 1545
1521 1546 for (int t = 0; t < TXG_SIZE; t++) {
1522 1547 range_tree_destroy(msp->ms_alloctree[t]);
1523 1548 }
1524 1549
1525 1550 for (int t = 0; t < TXG_DEFER_SIZE; t++) {
1526 1551 range_tree_destroy(msp->ms_defertree[t]);
1527 1552 }
1528 1553
1554 + metaslab_free_trimset(msp->ms_cur_ts);
1555 + if (msp->ms_prev_ts)
1556 + metaslab_free_trimset(msp->ms_prev_ts);
1557 + ASSERT3P(msp->ms_trimming_ts, ==, NULL);
1558 +
1529 1559 ASSERT0(msp->ms_deferspace);
1530 1560
1531 1561 mutex_exit(&msp->ms_lock);
1532 1562 cv_destroy(&msp->ms_load_cv);
1563 + cv_destroy(&msp->ms_trim_cv);
1533 1564 mutex_destroy(&msp->ms_lock);
1534 - mutex_destroy(&msp->ms_sync_lock);
1535 1565
1536 1566 kmem_free(msp, sizeof (metaslab_t));
1537 1567 }
1538 1568
1539 1569 #define FRAGMENTATION_TABLE_SIZE 17
1540 1570
1541 1571 /*
1542 1572 * This table defines a segment size based fragmentation metric that will
1543 1573 * allow each metaslab to derive its own fragmentation value. This is done
1544 1574 * by calculating the space in each bucket of the spacemap histogram and
1545 1575 * multiplying that by the fragmetation metric in this table. Doing
1546 1576 * this for all buckets and dividing it by the total amount of free
1547 1577 * space in this metaslab (i.e. the total free space in all buckets) gives
1548 1578 * us the fragmentation metric. This means that a high fragmentation metric
1549 1579 * equates to most of the free space being comprised of small segments.
1550 1580 * Conversely, if the metric is low, then most of the free space is in
1551 1581 * large segments. A 10% change in fragmentation equates to approximately
1552 1582 * double the number of segments.
1553 1583 *
1554 1584 * This table defines 0% fragmented space using 16MB segments. Testing has
1555 1585 * shown that segments that are greater than or equal to 16MB do not suffer
1556 1586 * from drastic performance problems. Using this value, we derive the rest
1557 1587 * of the table. Since the fragmentation value is never stored on disk, it
1558 1588 * is possible to change these calculations in the future.
1559 1589 */
1560 1590 int zfs_frag_table[FRAGMENTATION_TABLE_SIZE] = {
1561 1591 100, /* 512B */
1562 1592 100, /* 1K */
1563 1593 98, /* 2K */
1564 1594 95, /* 4K */
1565 1595 90, /* 8K */
1566 1596 80, /* 16K */
1567 1597 70, /* 32K */
1568 1598 60, /* 64K */
1569 1599 50, /* 128K */
1570 1600 40, /* 256K */
1571 1601 30, /* 512K */
1572 1602 20, /* 1M */
1573 1603 15, /* 2M */
1574 1604 10, /* 4M */
1575 1605 5, /* 8M */
1576 1606 0 /* 16M */
1577 1607 };
1578 1608
1579 1609 /*
1580 1610 * Calclate the metaslab's fragmentation metric. A return value
1581 1611 * of ZFS_FRAG_INVALID means that the metaslab has not been upgraded and does
1582 1612 * not support this metric. Otherwise, the return value should be in the
1583 1613 * range [0, 100].
1584 1614 */
1585 1615 static void
1586 1616 metaslab_set_fragmentation(metaslab_t *msp)
1587 1617 {
1588 1618 spa_t *spa = msp->ms_group->mg_vd->vdev_spa;
1589 1619 uint64_t fragmentation = 0;
1590 1620 uint64_t total = 0;
1591 1621 boolean_t feature_enabled = spa_feature_is_enabled(spa,
1592 1622 SPA_FEATURE_SPACEMAP_HISTOGRAM);
1593 1623
1594 1624 if (!feature_enabled) {
1595 1625 msp->ms_fragmentation = ZFS_FRAG_INVALID;
1596 1626 return;
1597 1627 }
1598 1628
1599 1629 /*
1600 1630 * A null space map means that the entire metaslab is free
1601 1631 * and thus is not fragmented.
1602 1632 */
1603 1633 if (msp->ms_sm == NULL) {
1604 1634 msp->ms_fragmentation = 0;
1605 1635 return;
1606 1636 }
1607 1637
1608 1638 /*
1609 1639 * If this metaslab's space map has not been upgraded, flag it
1610 1640 * so that we upgrade next time we encounter it.
1611 1641 */
1612 1642 if (msp->ms_sm->sm_dbuf->db_size != sizeof (space_map_phys_t)) {
1613 1643 uint64_t txg = spa_syncing_txg(spa);
1614 1644 vdev_t *vd = msp->ms_group->mg_vd;
1615 1645
1616 1646 /*
1617 1647 * If we've reached the final dirty txg, then we must
1618 1648 * be shutting down the pool. We don't want to dirty
1619 1649 * any data past this point so skip setting the condense
1620 1650 * flag. We can retry this action the next time the pool
1621 1651 * is imported.
1622 1652 */
1623 1653 if (spa_writeable(spa) && txg < spa_final_dirty_txg(spa)) {
1624 1654 msp->ms_condense_wanted = B_TRUE;
1625 1655 vdev_dirty(vd, VDD_METASLAB, msp, txg + 1);
1626 1656 spa_dbgmsg(spa, "txg %llu, requesting force condense: "
1627 1657 "ms_id %llu, vdev_id %llu", txg, msp->ms_id,
1628 1658 vd->vdev_id);
1629 1659 }
1630 1660 msp->ms_fragmentation = ZFS_FRAG_INVALID;
1631 1661 return;
1632 1662 }
1633 1663
1634 1664 for (int i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++) {
1635 1665 uint64_t space = 0;
1636 1666 uint8_t shift = msp->ms_sm->sm_shift;
1637 1667
1638 1668 int idx = MIN(shift - SPA_MINBLOCKSHIFT + i,
1639 1669 FRAGMENTATION_TABLE_SIZE - 1);
1640 1670
1641 1671 if (msp->ms_sm->sm_phys->smp_histogram[i] == 0)
1642 1672 continue;
1643 1673
1644 1674 space = msp->ms_sm->sm_phys->smp_histogram[i] << (i + shift);
1645 1675 total += space;
1646 1676
1647 1677 ASSERT3U(idx, <, FRAGMENTATION_TABLE_SIZE);
1648 1678 fragmentation += space * zfs_frag_table[idx];
1649 1679 }
1650 1680
1651 1681 if (total > 0)
1652 1682 fragmentation /= total;
1653 1683 ASSERT3U(fragmentation, <=, 100);
1654 1684
1655 1685 msp->ms_fragmentation = fragmentation;
1656 1686 }
1657 1687
1658 1688 /*
1659 1689 * Compute a weight -- a selection preference value -- for the given metaslab.
1660 1690 * This is based on the amount of free space, the level of fragmentation,
1661 1691 * the LBA range, and whether the metaslab is loaded.
1662 1692 */
1663 1693 static uint64_t
1664 1694 metaslab_space_weight(metaslab_t *msp)
1665 1695 {
1666 1696 metaslab_group_t *mg = msp->ms_group;
1667 1697 vdev_t *vd = mg->mg_vd;
1668 1698 uint64_t weight, space;
1669 1699
1670 1700 ASSERT(MUTEX_HELD(&msp->ms_lock));
1671 1701 ASSERT(!vd->vdev_removing);
1672 1702
1673 1703 /*
1674 1704 * The baseline weight is the metaslab's free space.
1675 1705 */
1676 1706 space = msp->ms_size - space_map_allocated(msp->ms_sm);
1677 1707
1678 1708 if (metaslab_fragmentation_factor_enabled &&
1679 1709 msp->ms_fragmentation != ZFS_FRAG_INVALID) {
1680 1710 /*
1681 1711 * Use the fragmentation information to inversely scale
1682 1712 * down the baseline weight. We need to ensure that we
1683 1713 * don't exclude this metaslab completely when it's 100%
1684 1714 * fragmented. To avoid this we reduce the fragmented value
1685 1715 * by 1.
1686 1716 */
1687 1717 space = (space * (100 - (msp->ms_fragmentation - 1))) / 100;
1688 1718
1689 1719 /*
1690 1720 * If space < SPA_MINBLOCKSIZE, then we will not allocate from
1691 1721 * this metaslab again. The fragmentation metric may have
1692 1722 * decreased the space to something smaller than
1693 1723 * SPA_MINBLOCKSIZE, so reset the space to SPA_MINBLOCKSIZE
1694 1724 * so that we can consume any remaining space.
1695 1725 */
1696 1726 if (space > 0 && space < SPA_MINBLOCKSIZE)
1697 1727 space = SPA_MINBLOCKSIZE;
1698 1728 }
1699 1729 weight = space;
1700 1730
1701 1731 /*
1702 1732 * Modern disks have uniform bit density and constant angular velocity.
1703 1733 * Therefore, the outer recording zones are faster (higher bandwidth)
1704 1734 * than the inner zones by the ratio of outer to inner track diameter,
1705 1735 * which is typically around 2:1. We account for this by assigning
1706 1736 * higher weight to lower metaslabs (multiplier ranging from 2x to 1x).
1707 1737 * In effect, this means that we'll select the metaslab with the most
1708 1738 * free bandwidth rather than simply the one with the most free space.
1709 1739 */
1710 1740 if (metaslab_lba_weighting_enabled) {
1711 1741 weight = 2 * weight - (msp->ms_id * weight) / vd->vdev_ms_count;
1712 1742 ASSERT(weight >= space && weight <= 2 * space);
1713 1743 }
1714 1744
1715 1745 /*
1716 1746 * If this metaslab is one we're actively using, adjust its
1717 1747 * weight to make it preferable to any inactive metaslab so
1718 1748 * we'll polish it off. If the fragmentation on this metaslab
1719 1749 * has exceed our threshold, then don't mark it active.
1720 1750 */
1721 1751 if (msp->ms_loaded && msp->ms_fragmentation != ZFS_FRAG_INVALID &&
1722 1752 msp->ms_fragmentation <= zfs_metaslab_fragmentation_threshold) {
1723 1753 weight |= (msp->ms_weight & METASLAB_ACTIVE_MASK);
1724 1754 }
1725 1755
1726 1756 WEIGHT_SET_SPACEBASED(weight);
1727 1757 return (weight);
1728 1758 }
1729 1759
1730 1760 /*
1731 1761 * Return the weight of the specified metaslab, according to the segment-based
1732 1762 * weighting algorithm. The metaslab must be loaded. This function can
1733 1763 * be called within a sync pass since it relies only on the metaslab's
1734 1764 * range tree which is always accurate when the metaslab is loaded.
1735 1765 */
1736 1766 static uint64_t
1737 1767 metaslab_weight_from_range_tree(metaslab_t *msp)
1738 1768 {
1739 1769 uint64_t weight = 0;
1740 1770 uint32_t segments = 0;
1741 1771
1742 1772 ASSERT(msp->ms_loaded);
1743 1773
1744 1774 for (int i = RANGE_TREE_HISTOGRAM_SIZE - 1; i >= SPA_MINBLOCKSHIFT;
1745 1775 i--) {
1746 1776 uint8_t shift = msp->ms_group->mg_vd->vdev_ashift;
1747 1777 int max_idx = SPACE_MAP_HISTOGRAM_SIZE + shift - 1;
1748 1778
1749 1779 segments <<= 1;
1750 1780 segments += msp->ms_tree->rt_histogram[i];
1751 1781
1752 1782 /*
1753 1783 * The range tree provides more precision than the space map
1754 1784 * and must be downgraded so that all values fit within the
1755 1785 * space map's histogram. This allows us to compare loaded
1756 1786 * vs. unloaded metaslabs to determine which metaslab is
1757 1787 * considered "best".
1758 1788 */
1759 1789 if (i > max_idx)
1760 1790 continue;
1761 1791
1762 1792 if (segments != 0) {
1763 1793 WEIGHT_SET_COUNT(weight, segments);
1764 1794 WEIGHT_SET_INDEX(weight, i);
1765 1795 WEIGHT_SET_ACTIVE(weight, 0);
1766 1796 break;
1767 1797 }
1768 1798 }
1769 1799 return (weight);
1770 1800 }
1771 1801
1772 1802 /*
1773 1803 * Calculate the weight based on the on-disk histogram. This should only
1774 1804 * be called after a sync pass has completely finished since the on-disk
1775 1805 * information is updated in metaslab_sync().
1776 1806 */
1777 1807 static uint64_t
1778 1808 metaslab_weight_from_spacemap(metaslab_t *msp)
1779 1809 {
1780 1810 uint64_t weight = 0;
1781 1811
1782 1812 for (int i = SPACE_MAP_HISTOGRAM_SIZE - 1; i >= 0; i--) {
1783 1813 if (msp->ms_sm->sm_phys->smp_histogram[i] != 0) {
1784 1814 WEIGHT_SET_COUNT(weight,
1785 1815 msp->ms_sm->sm_phys->smp_histogram[i]);
1786 1816 WEIGHT_SET_INDEX(weight, i +
1787 1817 msp->ms_sm->sm_shift);
1788 1818 WEIGHT_SET_ACTIVE(weight, 0);
1789 1819 break;
1790 1820 }
1791 1821 }
1792 1822 return (weight);
1793 1823 }
1794 1824
1795 1825 /*
1796 1826 * Compute a segment-based weight for the specified metaslab. The weight
1797 1827 * is determined by highest bucket in the histogram. The information
1798 1828 * for the highest bucket is encoded into the weight value.
1799 1829 */
1800 1830 static uint64_t
1801 1831 metaslab_segment_weight(metaslab_t *msp)
1802 1832 {
1803 1833 metaslab_group_t *mg = msp->ms_group;
1804 1834 uint64_t weight = 0;
1805 1835 uint8_t shift = mg->mg_vd->vdev_ashift;
1806 1836
1807 1837 ASSERT(MUTEX_HELD(&msp->ms_lock));
1808 1838
1809 1839 /*
1810 1840 * The metaslab is completely free.
1811 1841 */
1812 1842 if (space_map_allocated(msp->ms_sm) == 0) {
1813 1843 int idx = highbit64(msp->ms_size) - 1;
1814 1844 int max_idx = SPACE_MAP_HISTOGRAM_SIZE + shift - 1;
1815 1845
1816 1846 if (idx < max_idx) {
1817 1847 WEIGHT_SET_COUNT(weight, 1ULL);
1818 1848 WEIGHT_SET_INDEX(weight, idx);
1819 1849 } else {
1820 1850 WEIGHT_SET_COUNT(weight, 1ULL << (idx - max_idx));
1821 1851 WEIGHT_SET_INDEX(weight, max_idx);
1822 1852 }
1823 1853 WEIGHT_SET_ACTIVE(weight, 0);
1824 1854 ASSERT(!WEIGHT_IS_SPACEBASED(weight));
1825 1855
1826 1856 return (weight);
1827 1857 }
1828 1858
1829 1859 ASSERT3U(msp->ms_sm->sm_dbuf->db_size, ==, sizeof (space_map_phys_t));
1830 1860
1831 1861 /*
1832 1862 * If the metaslab is fully allocated then just make the weight 0.
1833 1863 */
1834 1864 if (space_map_allocated(msp->ms_sm) == msp->ms_size)
1835 1865 return (0);
1836 1866 /*
1837 1867 * If the metaslab is already loaded, then use the range tree to
1838 1868 * determine the weight. Otherwise, we rely on the space map information
1839 1869 * to generate the weight.
1840 1870 */
1841 1871 if (msp->ms_loaded) {
1842 1872 weight = metaslab_weight_from_range_tree(msp);
1843 1873 } else {
1844 1874 weight = metaslab_weight_from_spacemap(msp);
1845 1875 }
1846 1876
1847 1877 /*
1848 1878 * If the metaslab was active the last time we calculated its weight
1849 1879 * then keep it active. We want to consume the entire region that
1850 1880 * is associated with this weight.
1851 1881 */
1852 1882 if (msp->ms_activation_weight != 0 && weight != 0)
1853 1883 WEIGHT_SET_ACTIVE(weight, WEIGHT_GET_ACTIVE(msp->ms_weight));
1854 1884 return (weight);
1855 1885 }
1856 1886
1857 1887 /*
1858 1888 * Determine if we should attempt to allocate from this metaslab. If the
1859 1889 * metaslab has a maximum size then we can quickly determine if the desired
1860 1890 * allocation size can be satisfied. Otherwise, if we're using segment-based
1861 1891 * weighting then we can determine the maximum allocation that this metaslab
1862 1892 * can accommodate based on the index encoded in the weight. If we're using
1863 1893 * space-based weights then rely on the entire weight (excluding the weight
1864 1894 * type bit).
1865 1895 */
1866 1896 boolean_t
1867 1897 metaslab_should_allocate(metaslab_t *msp, uint64_t asize)
1868 1898 {
1869 1899 boolean_t should_allocate;
1870 1900
1871 1901 if (msp->ms_max_size != 0)
1872 1902 return (msp->ms_max_size >= asize);
1873 1903
1874 1904 if (!WEIGHT_IS_SPACEBASED(msp->ms_weight)) {
1875 1905 /*
1876 1906 * The metaslab segment weight indicates segments in the
1877 1907 * range [2^i, 2^(i+1)), where i is the index in the weight.
1878 1908 * Since the asize might be in the middle of the range, we
1879 1909 * should attempt the allocation if asize < 2^(i+1).
1880 1910 */
1881 1911 should_allocate = (asize <
1882 1912 1ULL << (WEIGHT_GET_INDEX(msp->ms_weight) + 1));
1883 1913 } else {
1884 1914 should_allocate = (asize <=
1885 1915 (msp->ms_weight & ~METASLAB_WEIGHT_TYPE));
1886 1916 }
1887 1917 return (should_allocate);
1888 1918 }
1889 1919
|
↓ open down ↓ |
345 lines elided |
↑ open up ↑ |
1890 1920 static uint64_t
1891 1921 metaslab_weight(metaslab_t *msp)
1892 1922 {
1893 1923 vdev_t *vd = msp->ms_group->mg_vd;
1894 1924 spa_t *spa = vd->vdev_spa;
1895 1925 uint64_t weight;
1896 1926
1897 1927 ASSERT(MUTEX_HELD(&msp->ms_lock));
1898 1928
1899 1929 /*
1900 - * If this vdev is in the process of being removed, there is nothing
1930 + * This vdev is in the process of being removed so there is nothing
1901 1931 * for us to do here.
1902 1932 */
1903 - if (vd->vdev_removing)
1933 + if (vd->vdev_removing) {
1934 + ASSERT0(space_map_allocated(msp->ms_sm));
1935 + ASSERT0(vd->vdev_ms_shift);
1904 1936 return (0);
1937 + }
1905 1938
1906 1939 metaslab_set_fragmentation(msp);
1907 1940
1908 1941 /*
1909 1942 * Update the maximum size if the metaslab is loaded. This will
1910 1943 * ensure that we get an accurate maximum size if newly freed space
1911 1944 * has been added back into the free tree.
1912 1945 */
1913 1946 if (msp->ms_loaded)
1914 1947 msp->ms_max_size = metaslab_block_maxsize(msp);
1915 1948
1916 1949 /*
1917 1950 * Segment-based weighting requires space map histogram support.
1918 1951 */
1919 1952 if (zfs_metaslab_segment_weight_enabled &&
1920 1953 spa_feature_is_enabled(spa, SPA_FEATURE_SPACEMAP_HISTOGRAM) &&
1921 1954 (msp->ms_sm == NULL || msp->ms_sm->sm_dbuf->db_size ==
1922 1955 sizeof (space_map_phys_t))) {
1923 1956 weight = metaslab_segment_weight(msp);
1924 1957 } else {
1925 1958 weight = metaslab_space_weight(msp);
1926 1959 }
1927 1960 return (weight);
1928 1961 }
1929 1962
1930 1963 static int
1931 1964 metaslab_activate(metaslab_t *msp, uint64_t activation_weight)
1932 1965 {
1933 1966 ASSERT(MUTEX_HELD(&msp->ms_lock));
1934 1967
1935 1968 if ((msp->ms_weight & METASLAB_ACTIVE_MASK) == 0) {
1936 1969 metaslab_load_wait(msp);
1937 1970 if (!msp->ms_loaded) {
1938 1971 int error = metaslab_load(msp);
1939 1972 if (error) {
1940 1973 metaslab_group_sort(msp->ms_group, msp, 0);
1941 1974 return (error);
1942 1975 }
1943 1976 }
1944 1977
1945 1978 msp->ms_activation_weight = msp->ms_weight;
1946 1979 metaslab_group_sort(msp->ms_group, msp,
1947 1980 msp->ms_weight | activation_weight);
1948 1981 }
1949 1982 ASSERT(msp->ms_loaded);
1950 1983 ASSERT(msp->ms_weight & METASLAB_ACTIVE_MASK);
1951 1984
1952 1985 return (0);
1953 1986 }
1954 1987
1955 1988 static void
1956 1989 metaslab_passivate(metaslab_t *msp, uint64_t weight)
1957 1990 {
1958 1991 uint64_t size = weight & ~METASLAB_WEIGHT_TYPE;
1959 1992
1960 1993 /*
1961 1994 * If size < SPA_MINBLOCKSIZE, then we will not allocate from
1962 1995 * this metaslab again. In that case, it had better be empty,
1963 1996 * or we would be leaving space on the table.
1964 1997 */
1965 1998 ASSERT(size >= SPA_MINBLOCKSIZE ||
1966 1999 range_tree_space(msp->ms_tree) == 0);
1967 2000 ASSERT0(weight & METASLAB_ACTIVE_MASK);
1968 2001
1969 2002 msp->ms_activation_weight = 0;
1970 2003 metaslab_group_sort(msp->ms_group, msp, weight);
1971 2004 ASSERT((msp->ms_weight & METASLAB_ACTIVE_MASK) == 0);
1972 2005 }
1973 2006
1974 2007 /*
1975 2008 * Segment-based metaslabs are activated once and remain active until
1976 2009 * we either fail an allocation attempt (similar to space-based metaslabs)
1977 2010 * or have exhausted the free space in zfs_metaslab_switch_threshold
1978 2011 * buckets since the metaslab was activated. This function checks to see
1979 2012 * if we've exhaused the zfs_metaslab_switch_threshold buckets in the
1980 2013 * metaslab and passivates it proactively. This will allow us to select a
1981 2014 * metaslabs with larger contiguous region if any remaining within this
1982 2015 * metaslab group. If we're in sync pass > 1, then we continue using this
1983 2016 * metaslab so that we don't dirty more block and cause more sync passes.
1984 2017 */
1985 2018 void
1986 2019 metaslab_segment_may_passivate(metaslab_t *msp)
1987 2020 {
1988 2021 spa_t *spa = msp->ms_group->mg_vd->vdev_spa;
1989 2022
1990 2023 if (WEIGHT_IS_SPACEBASED(msp->ms_weight) || spa_sync_pass(spa) > 1)
1991 2024 return;
1992 2025
1993 2026 /*
1994 2027 * Since we are in the middle of a sync pass, the most accurate
1995 2028 * information that is accessible to us is the in-core range tree
1996 2029 * histogram; calculate the new weight based on that information.
1997 2030 */
1998 2031 uint64_t weight = metaslab_weight_from_range_tree(msp);
1999 2032 int activation_idx = WEIGHT_GET_INDEX(msp->ms_activation_weight);
2000 2033 int current_idx = WEIGHT_GET_INDEX(weight);
2001 2034
2002 2035 if (current_idx <= activation_idx - zfs_metaslab_switch_threshold)
2003 2036 metaslab_passivate(msp, weight);
2004 2037 }
2005 2038
2006 2039 static void
2007 2040 metaslab_preload(void *arg)
2008 2041 {
2009 2042 metaslab_t *msp = arg;
2010 2043 spa_t *spa = msp->ms_group->mg_vd->vdev_spa;
2011 2044
2012 2045 ASSERT(!MUTEX_HELD(&msp->ms_group->mg_lock));
2013 2046
2014 2047 mutex_enter(&msp->ms_lock);
2015 2048 metaslab_load_wait(msp);
2016 2049 if (!msp->ms_loaded)
2017 2050 (void) metaslab_load(msp);
2018 2051 msp->ms_selected_txg = spa_syncing_txg(spa);
2019 2052 mutex_exit(&msp->ms_lock);
2020 2053 }
2021 2054
2022 2055 static void
2023 2056 metaslab_group_preload(metaslab_group_t *mg)
2024 2057 {
2025 2058 spa_t *spa = mg->mg_vd->vdev_spa;
|
↓ open down ↓ |
111 lines elided |
↑ open up ↑ |
2026 2059 metaslab_t *msp;
2027 2060 avl_tree_t *t = &mg->mg_metaslab_tree;
2028 2061 int m = 0;
2029 2062
2030 2063 if (spa_shutting_down(spa) || !metaslab_preload_enabled) {
2031 2064 taskq_wait(mg->mg_taskq);
2032 2065 return;
2033 2066 }
2034 2067
2035 2068 mutex_enter(&mg->mg_lock);
2036 -
2037 2069 /*
2038 2070 * Load the next potential metaslabs
2039 2071 */
2040 2072 for (msp = avl_first(t); msp != NULL; msp = AVL_NEXT(t, msp)) {
2041 - ASSERT3P(msp->ms_group, ==, mg);
2042 -
2043 2073 /*
2044 2074 * We preload only the maximum number of metaslabs specified
2045 2075 * by metaslab_preload_limit. If a metaslab is being forced
2046 2076 * to condense then we preload it too. This will ensure
2047 2077 * that force condensing happens in the next txg.
2048 2078 */
2049 2079 if (++m > metaslab_preload_limit && !msp->ms_condense_wanted) {
2050 2080 continue;
2051 2081 }
2052 2082
2053 2083 VERIFY(taskq_dispatch(mg->mg_taskq, metaslab_preload,
2054 2084 msp, TQ_SLEEP) != NULL);
2055 2085 }
2056 2086 mutex_exit(&mg->mg_lock);
2057 2087 }
2058 2088
|
↓ open down ↓ |
6 lines elided |
↑ open up ↑ |
2059 2089 /*
2060 2090 * Determine if the space map's on-disk footprint is past our tolerance
2061 2091 * for inefficiency. We would like to use the following criteria to make
2062 2092 * our decision:
2063 2093 *
2064 2094 * 1. The size of the space map object should not dramatically increase as a
2065 2095 * result of writing out the free space range tree.
2066 2096 *
2067 2097 * 2. The minimal on-disk space map representation is zfs_condense_pct/100
2068 2098 * times the size than the free space range tree representation
2069 - * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1MB).
2099 + * (i.e. zfs_condense_pct = 110 and in-core = 1MB, minimal = 1.1.MB).
2070 2100 *
2071 2101 * 3. The on-disk size of the space map should actually decrease.
2072 2102 *
2073 2103 * Checking the first condition is tricky since we don't want to walk
2074 2104 * the entire AVL tree calculating the estimated on-disk size. Instead we
2075 2105 * use the size-ordered range tree in the metaslab and calculate the
2076 2106 * size required to write out the largest segment in our free tree. If the
2077 2107 * size required to represent that segment on disk is larger than the space
2078 2108 * map object then we avoid condensing this map.
2079 2109 *
2080 2110 * To determine the second criterion we use a best-case estimate and assume
2081 2111 * each segment can be represented on-disk as a single 64-bit entry. We refer
2082 2112 * to this best-case estimate as the space map's minimal form.
2083 2113 *
2084 2114 * Unfortunately, we cannot compute the on-disk size of the space map in this
2085 2115 * context because we cannot accurately compute the effects of compression, etc.
2086 2116 * Instead, we apply the heuristic described in the block comment for
2087 2117 * zfs_metaslab_condense_block_threshold - we only condense if the space used
2088 2118 * is greater than a threshold number of blocks.
2089 2119 */
2090 2120 static boolean_t
2091 2121 metaslab_should_condense(metaslab_t *msp)
2092 2122 {
2093 2123 space_map_t *sm = msp->ms_sm;
2094 2124 range_seg_t *rs;
2095 2125 uint64_t size, entries, segsz, object_size, optimal_size, record_size;
2096 2126 dmu_object_info_t doi;
2097 2127 uint64_t vdev_blocksize = 1 << msp->ms_group->mg_vd->vdev_ashift;
2098 2128
2099 2129 ASSERT(MUTEX_HELD(&msp->ms_lock));
2100 2130 ASSERT(msp->ms_loaded);
2101 2131
2102 2132 /*
2103 2133 * Use the ms_size_tree range tree, which is ordered by size, to
2104 2134 * obtain the largest segment in the free tree. We always condense
2105 2135 * metaslabs that are empty and metaslabs for which a condense
2106 2136 * request has been made.
2107 2137 */
2108 2138 rs = avl_last(&msp->ms_size_tree);
2109 2139 if (rs == NULL || msp->ms_condense_wanted)
2110 2140 return (B_TRUE);
2111 2141
2112 2142 /*
2113 2143 * Calculate the number of 64-bit entries this segment would
2114 2144 * require when written to disk. If this single segment would be
2115 2145 * larger on-disk than the entire current on-disk structure, then
2116 2146 * clearly condensing will increase the on-disk structure size.
2117 2147 */
2118 2148 size = (rs->rs_end - rs->rs_start) >> sm->sm_shift;
2119 2149 entries = size / (MIN(size, SM_RUN_MAX));
2120 2150 segsz = entries * sizeof (uint64_t);
2121 2151
2122 2152 optimal_size = sizeof (uint64_t) * avl_numnodes(&msp->ms_tree->rt_root);
2123 2153 object_size = space_map_length(msp->ms_sm);
2124 2154
2125 2155 dmu_object_info_from_db(sm->sm_dbuf, &doi);
2126 2156 record_size = MAX(doi.doi_data_block_size, vdev_blocksize);
2127 2157
2128 2158 return (segsz <= object_size &&
2129 2159 object_size >= (optimal_size * zfs_condense_pct / 100) &&
2130 2160 object_size > zfs_metaslab_condense_block_threshold * record_size);
2131 2161 }
2132 2162
2133 2163 /*
2134 2164 * Condense the on-disk space map representation to its minimized form.
2135 2165 * The minimized form consists of a small number of allocations followed by
2136 2166 * the entries of the free range tree.
2137 2167 */
2138 2168 static void
2139 2169 metaslab_condense(metaslab_t *msp, uint64_t txg, dmu_tx_t *tx)
2140 2170 {
2141 2171 spa_t *spa = msp->ms_group->mg_vd->vdev_spa;
2142 2172 range_tree_t *condense_tree;
2143 2173 space_map_t *sm = msp->ms_sm;
2144 2174
2145 2175 ASSERT(MUTEX_HELD(&msp->ms_lock));
2146 2176 ASSERT3U(spa_sync_pass(spa), ==, 1);
2147 2177 ASSERT(msp->ms_loaded);
2148 2178
2149 2179
2150 2180 spa_dbgmsg(spa, "condensing: txg %llu, msp[%llu] %p, vdev id %llu, "
2151 2181 "spa %s, smp size %llu, segments %lu, forcing condense=%s", txg,
2152 2182 msp->ms_id, msp, msp->ms_group->mg_vd->vdev_id,
2153 2183 msp->ms_group->mg_vd->vdev_spa->spa_name,
2154 2184 space_map_length(msp->ms_sm), avl_numnodes(&msp->ms_tree->rt_root),
2155 2185 msp->ms_condense_wanted ? "TRUE" : "FALSE");
|
↓ open down ↓ |
76 lines elided |
↑ open up ↑ |
2156 2186
2157 2187 msp->ms_condense_wanted = B_FALSE;
2158 2188
2159 2189 /*
2160 2190 * Create an range tree that is 100% allocated. We remove segments
2161 2191 * that have been freed in this txg, any deferred frees that exist,
2162 2192 * and any allocation in the future. Removing segments should be
2163 2193 * a relatively inexpensive operation since we expect these trees to
2164 2194 * have a small number of nodes.
2165 2195 */
2166 - condense_tree = range_tree_create(NULL, NULL);
2196 + condense_tree = range_tree_create(NULL, NULL, &msp->ms_lock);
2167 2197 range_tree_add(condense_tree, msp->ms_start, msp->ms_size);
2168 2198
2169 2199 /*
2170 2200 * Remove what's been freed in this txg from the condense_tree.
2171 2201 * Since we're in sync_pass 1, we know that all the frees from
2172 2202 * this txg are in the freeingtree.
2173 2203 */
2174 2204 range_tree_walk(msp->ms_freeingtree, range_tree_remove, condense_tree);
2175 2205
2176 2206 for (int t = 0; t < TXG_DEFER_SIZE; t++) {
2177 2207 range_tree_walk(msp->ms_defertree[t],
2178 2208 range_tree_remove, condense_tree);
2179 2209 }
2180 2210
2181 2211 for (int t = 1; t < TXG_CONCURRENT_STATES; t++) {
2182 2212 range_tree_walk(msp->ms_alloctree[(txg + t) & TXG_MASK],
2183 2213 range_tree_remove, condense_tree);
2184 2214 }
2185 2215
2186 2216 /*
2187 2217 * We're about to drop the metaslab's lock thus allowing
2188 2218 * other consumers to change it's content. Set the
|
↓ open down ↓ |
12 lines elided |
↑ open up ↑ |
2189 2219 * metaslab's ms_condensing flag to ensure that
2190 2220 * allocations on this metaslab do not occur while we're
2191 2221 * in the middle of committing it to disk. This is only critical
2192 2222 * for the ms_tree as all other range trees use per txg
2193 2223 * views of their content.
2194 2224 */
2195 2225 msp->ms_condensing = B_TRUE;
2196 2226
2197 2227 mutex_exit(&msp->ms_lock);
2198 2228 space_map_truncate(sm, tx);
2229 + mutex_enter(&msp->ms_lock);
2199 2230
2200 2231 /*
2201 2232 * While we would ideally like to create a space map representation
2202 2233 * that consists only of allocation records, doing so can be
2203 2234 * prohibitively expensive because the in-core free tree can be
2204 2235 * large, and therefore computationally expensive to subtract
2205 2236 * from the condense_tree. Instead we sync out two trees, a cheap
2206 2237 * allocation only tree followed by the in-core free tree. While not
2207 2238 * optimal, this is typically close to optimal, and much cheaper to
2208 2239 * compute.
2209 2240 */
2210 2241 space_map_write(sm, condense_tree, SM_ALLOC, tx);
2211 2242 range_tree_vacate(condense_tree, NULL, NULL);
2212 2243 range_tree_destroy(condense_tree);
2213 2244
2214 2245 space_map_write(sm, msp->ms_tree, SM_FREE, tx);
2215 - mutex_enter(&msp->ms_lock);
2216 2246 msp->ms_condensing = B_FALSE;
2217 2247 }
2218 2248
2219 2249 /*
2220 2250 * Write a metaslab to disk in the context of the specified transaction group.
2221 2251 */
2222 2252 void
2223 2253 metaslab_sync(metaslab_t *msp, uint64_t txg)
2224 2254 {
2225 2255 metaslab_group_t *mg = msp->ms_group;
2226 2256 vdev_t *vd = mg->mg_vd;
2227 2257 spa_t *spa = vd->vdev_spa;
2228 2258 objset_t *mos = spa_meta_objset(spa);
2229 2259 range_tree_t *alloctree = msp->ms_alloctree[txg & TXG_MASK];
2230 2260 dmu_tx_t *tx;
2231 2261 uint64_t object = space_map_object(msp->ms_sm);
2232 2262
2233 2263 ASSERT(!vd->vdev_ishole);
2234 2264
2265 + mutex_enter(&msp->ms_lock);
2266 +
2235 2267 /*
2236 2268 * This metaslab has just been added so there's no work to do now.
2237 2269 */
2238 2270 if (msp->ms_freeingtree == NULL) {
2239 2271 ASSERT3P(alloctree, ==, NULL);
2272 + mutex_exit(&msp->ms_lock);
2240 2273 return;
2241 2274 }
2242 2275
2243 2276 ASSERT3P(alloctree, !=, NULL);
2244 2277 ASSERT3P(msp->ms_freeingtree, !=, NULL);
2245 2278 ASSERT3P(msp->ms_freedtree, !=, NULL);
2246 2279
2247 2280 /*
2248 2281 * Normally, we don't want to process a metaslab if there
2249 2282 * are no allocations or frees to perform. However, if the metaslab
2250 2283 * is being forced to condense and it's loaded, we need to let it
2251 2284 * through.
2252 2285 */
2253 2286 if (range_tree_space(alloctree) == 0 &&
2254 2287 range_tree_space(msp->ms_freeingtree) == 0 &&
2255 - !(msp->ms_loaded && msp->ms_condense_wanted))
2288 + !(msp->ms_loaded && msp->ms_condense_wanted)) {
2289 + mutex_exit(&msp->ms_lock);
2256 2290 return;
2291 + }
2257 2292
2258 2293
2259 2294 VERIFY(txg <= spa_final_dirty_txg(spa));
2260 2295
2261 2296 /*
2262 2297 * The only state that can actually be changing concurrently with
2263 2298 * metaslab_sync() is the metaslab's ms_tree. No other thread can
2264 2299 * be modifying this txg's alloctree, freeingtree, freedtree, or
2265 - * space_map_phys_t. We drop ms_lock whenever we could call
2266 - * into the DMU, because the DMU can call down to us
2267 - * (e.g. via zio_free()) at any time.
2268 - *
2269 - * The spa_vdev_remove_thread() can be reading metaslab state
2270 - * concurrently, and it is locked out by the ms_sync_lock. Note
2271 - * that the ms_lock is insufficient for this, because it is dropped
2272 - * by space_map_write().
2300 + * space_map_phys_t. Therefore, we only hold ms_lock to satify
2301 + * space map ASSERTs. We drop it whenever we call into the DMU,
2302 + * because the DMU can call down to us (e.g. via zio_free()) at
2303 + * any time.
2273 2304 */
2274 2305
2275 2306 tx = dmu_tx_create_assigned(spa_get_dsl(spa), txg);
2276 2307
2277 2308 if (msp->ms_sm == NULL) {
2278 2309 uint64_t new_object;
2279 2310
2280 2311 new_object = space_map_alloc(mos, tx);
2281 2312 VERIFY3U(new_object, !=, 0);
2282 2313
2283 2314 VERIFY0(space_map_open(&msp->ms_sm, mos, new_object,
2284 - msp->ms_start, msp->ms_size, vd->vdev_ashift));
2315 + msp->ms_start, msp->ms_size, vd->vdev_ashift,
2316 + &msp->ms_lock));
2285 2317 ASSERT(msp->ms_sm != NULL);
2286 2318 }
2287 2319
2288 - mutex_enter(&msp->ms_sync_lock);
2289 - mutex_enter(&msp->ms_lock);
2290 -
2291 2320 /*
2292 2321 * Note: metaslab_condense() clears the space map's histogram.
2293 2322 * Therefore we must verify and remove this histogram before
2294 2323 * condensing.
2295 2324 */
2296 2325 metaslab_group_histogram_verify(mg);
2297 2326 metaslab_class_histogram_verify(mg->mg_class);
2298 2327 metaslab_group_histogram_remove(mg, msp);
2299 2328
2300 2329 if (msp->ms_loaded && spa_sync_pass(spa) == 1 &&
2301 2330 metaslab_should_condense(msp)) {
2302 2331 metaslab_condense(msp, txg, tx);
2303 2332 } else {
2304 - mutex_exit(&msp->ms_lock);
2305 2333 space_map_write(msp->ms_sm, alloctree, SM_ALLOC, tx);
2306 2334 space_map_write(msp->ms_sm, msp->ms_freeingtree, SM_FREE, tx);
2307 - mutex_enter(&msp->ms_lock);
2308 2335 }
2309 2336
2310 2337 if (msp->ms_loaded) {
2311 2338 /*
2312 - * When the space map is loaded, we have an accurate
2339 + * When the space map is loaded, we have an accruate
2313 2340 * histogram in the range tree. This gives us an opportunity
2314 2341 * to bring the space map's histogram up-to-date so we clear
2315 2342 * it first before updating it.
2316 2343 */
2317 2344 space_map_histogram_clear(msp->ms_sm);
2318 2345 space_map_histogram_add(msp->ms_sm, msp->ms_tree, tx);
2319 2346
2320 2347 /*
2321 2348 * Since we've cleared the histogram we need to add back
2322 2349 * any free space that has already been processed, plus
2323 2350 * any deferred space. This allows the on-disk histogram
2324 2351 * to accurately reflect all free space even if some space
2325 2352 * is not yet available for allocation (i.e. deferred).
2326 2353 */
2327 2354 space_map_histogram_add(msp->ms_sm, msp->ms_freedtree, tx);
2328 2355
2329 2356 /*
2330 2357 * Add back any deferred free space that has not been
2331 2358 * added back into the in-core free tree yet. This will
2332 2359 * ensure that we don't end up with a space map histogram
2333 2360 * that is completely empty unless the metaslab is fully
2334 2361 * allocated.
2335 2362 */
2336 2363 for (int t = 0; t < TXG_DEFER_SIZE; t++) {
2337 2364 space_map_histogram_add(msp->ms_sm,
2338 2365 msp->ms_defertree[t], tx);
2339 2366 }
2340 2367 }
2341 2368
2342 2369 /*
2343 2370 * Always add the free space from this sync pass to the space
2344 2371 * map histogram. We want to make sure that the on-disk histogram
2345 2372 * accounts for all free space. If the space map is not loaded,
2346 2373 * then we will lose some accuracy but will correct it the next
2347 2374 * time we load the space map.
2348 2375 */
2349 2376 space_map_histogram_add(msp->ms_sm, msp->ms_freeingtree, tx);
2350 2377
2351 2378 metaslab_group_histogram_add(mg, msp);
2352 2379 metaslab_group_histogram_verify(mg);
2353 2380 metaslab_class_histogram_verify(mg->mg_class);
2354 2381
2355 2382 /*
2356 2383 * For sync pass 1, we avoid traversing this txg's free range tree
2357 2384 * and instead will just swap the pointers for freeingtree and
2358 2385 * freedtree. We can safely do this since the freed_tree is
2359 2386 * guaranteed to be empty on the initial pass.
2360 2387 */
2361 2388 if (spa_sync_pass(spa) == 1) {
2362 2389 range_tree_swap(&msp->ms_freeingtree, &msp->ms_freedtree);
2363 2390 } else {
2364 2391 range_tree_vacate(msp->ms_freeingtree,
2365 2392 range_tree_add, msp->ms_freedtree);
2366 2393 }
2367 2394 range_tree_vacate(alloctree, NULL, NULL);
2368 2395
2369 2396 ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK]));
|
↓ open down ↓ |
47 lines elided |
↑ open up ↑ |
2370 2397 ASSERT0(range_tree_space(msp->ms_alloctree[TXG_CLEAN(txg) & TXG_MASK]));
2371 2398 ASSERT0(range_tree_space(msp->ms_freeingtree));
2372 2399
2373 2400 mutex_exit(&msp->ms_lock);
2374 2401
2375 2402 if (object != space_map_object(msp->ms_sm)) {
2376 2403 object = space_map_object(msp->ms_sm);
2377 2404 dmu_write(mos, vd->vdev_ms_array, sizeof (uint64_t) *
2378 2405 msp->ms_id, sizeof (uint64_t), &object, tx);
2379 2406 }
2380 - mutex_exit(&msp->ms_sync_lock);
2381 2407 dmu_tx_commit(tx);
2382 2408 }
2383 2409
2384 2410 /*
2385 2411 * Called after a transaction group has completely synced to mark
2386 2412 * all of the metaslab's free space as usable.
2387 2413 */
2388 2414 void
2389 2415 metaslab_sync_done(metaslab_t *msp, uint64_t txg)
2390 2416 {
2391 2417 metaslab_group_t *mg = msp->ms_group;
2392 2418 vdev_t *vd = mg->mg_vd;
2393 2419 spa_t *spa = vd->vdev_spa;
2394 2420 range_tree_t **defer_tree;
2395 2421 int64_t alloc_delta, defer_delta;
2396 2422 boolean_t defer_allowed = B_TRUE;
2397 2423
2398 2424 ASSERT(!vd->vdev_ishole);
2399 2425
|
↓ open down ↓ |
9 lines elided |
↑ open up ↑ |
2400 2426 mutex_enter(&msp->ms_lock);
2401 2427
2402 2428 /*
2403 2429 * If this metaslab is just becoming available, initialize its
2404 2430 * range trees and add its capacity to the vdev.
2405 2431 */
2406 2432 if (msp->ms_freedtree == NULL) {
2407 2433 for (int t = 0; t < TXG_SIZE; t++) {
2408 2434 ASSERT(msp->ms_alloctree[t] == NULL);
2409 2435
2410 - msp->ms_alloctree[t] = range_tree_create(NULL, NULL);
2436 + msp->ms_alloctree[t] = range_tree_create(NULL, msp,
2437 + &msp->ms_lock);
2411 2438 }
2412 2439
2413 2440 ASSERT3P(msp->ms_freeingtree, ==, NULL);
2414 - msp->ms_freeingtree = range_tree_create(NULL, NULL);
2441 + msp->ms_freeingtree = range_tree_create(NULL, msp,
2442 + &msp->ms_lock);
2415 2443
2416 2444 ASSERT3P(msp->ms_freedtree, ==, NULL);
2417 - msp->ms_freedtree = range_tree_create(NULL, NULL);
2445 + msp->ms_freedtree = range_tree_create(NULL, msp,
2446 + &msp->ms_lock);
2418 2447
2419 2448 for (int t = 0; t < TXG_DEFER_SIZE; t++) {
2420 2449 ASSERT(msp->ms_defertree[t] == NULL);
2421 2450
2422 - msp->ms_defertree[t] = range_tree_create(NULL, NULL);
2451 + msp->ms_defertree[t] = range_tree_create(NULL, msp,
2452 + &msp->ms_lock);
2423 2453 }
2424 2454
2425 2455 vdev_space_update(vd, 0, 0, msp->ms_size);
2426 2456 }
2427 2457
2428 2458 defer_tree = &msp->ms_defertree[txg % TXG_DEFER_SIZE];
2429 2459
2430 2460 uint64_t free_space = metaslab_class_get_space(spa_normal_class(spa)) -
2431 2461 metaslab_class_get_alloc(spa_normal_class(spa));
2432 - if (free_space <= spa_get_slop_space(spa) || vd->vdev_removing) {
2462 + if (free_space <= spa_get_slop_space(spa)) {
2433 2463 defer_allowed = B_FALSE;
2434 2464 }
2435 2465
2436 2466 defer_delta = 0;
2437 2467 alloc_delta = space_map_alloc_delta(msp->ms_sm);
2438 2468 if (defer_allowed) {
2439 2469 defer_delta = range_tree_space(msp->ms_freedtree) -
2440 2470 range_tree_space(*defer_tree);
2441 2471 } else {
2442 2472 defer_delta -= range_tree_space(*defer_tree);
2443 2473 }
2444 2474
2445 2475 vdev_space_update(vd, alloc_delta + defer_delta, defer_delta, 0);
2446 2476
2447 2477 /*
2448 2478 * If there's a metaslab_load() in progress, wait for it to complete
|
↓ open down ↓ |
6 lines elided |
↑ open up ↑ |
2449 2479 * so that we have a consistent view of the in-core space map.
2450 2480 */
2451 2481 metaslab_load_wait(msp);
2452 2482
2453 2483 /*
2454 2484 * Move the frees from the defer_tree back to the free
2455 2485 * range tree (if it's loaded). Swap the freed_tree and the
2456 2486 * defer_tree -- this is safe to do because we've just emptied out
2457 2487 * the defer_tree.
2458 2488 */
2489 + if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
2490 + !vd->vdev_man_trimming) {
2491 + range_tree_walk(*defer_tree, metaslab_trim_add, msp);
2492 + if (!defer_allowed) {
2493 + range_tree_walk(msp->ms_freedtree, metaslab_trim_add,
2494 + msp);
2495 + }
2496 + }
2459 2497 range_tree_vacate(*defer_tree,
2460 2498 msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree);
2461 2499 if (defer_allowed) {
2462 2500 range_tree_swap(&msp->ms_freedtree, defer_tree);
2463 2501 } else {
2464 2502 range_tree_vacate(msp->ms_freedtree,
2465 2503 msp->ms_loaded ? range_tree_add : NULL, msp->ms_tree);
2466 2504 }
2467 2505
2468 2506 space_map_update(msp->ms_sm);
2469 2507
2470 2508 msp->ms_deferspace += defer_delta;
2471 2509 ASSERT3S(msp->ms_deferspace, >=, 0);
2472 2510 ASSERT3S(msp->ms_deferspace, <=, msp->ms_size);
2473 2511 if (msp->ms_deferspace != 0) {
2474 2512 /*
2475 2513 * Keep syncing this metaslab until all deferred frees
2476 2514 * are back in circulation.
2477 2515 */
2478 2516 vdev_dirty(vd, VDD_METASLAB, msp, txg + 1);
2479 2517 }
2480 2518
2481 2519 /*
2482 2520 * Calculate the new weights before unloading any metaslabs.
2483 2521 * This will give us the most accurate weighting.
2484 2522 */
2485 2523 metaslab_group_sort(mg, msp, metaslab_weight(msp));
2486 2524
2487 2525 /*
2488 2526 * If the metaslab is loaded and we've not tried to load or allocate
2489 2527 * from it in 'metaslab_unload_delay' txgs, then unload it.
2490 2528 */
2491 2529 if (msp->ms_loaded &&
|
↓ open down ↓ |
23 lines elided |
↑ open up ↑ |
2492 2530 msp->ms_selected_txg + metaslab_unload_delay < txg) {
2493 2531 for (int t = 1; t < TXG_CONCURRENT_STATES; t++) {
2494 2532 VERIFY0(range_tree_space(
2495 2533 msp->ms_alloctree[(txg + t) & TXG_MASK]));
2496 2534 }
2497 2535
2498 2536 if (!metaslab_debug_unload)
2499 2537 metaslab_unload(msp);
2500 2538 }
2501 2539
2502 - ASSERT0(range_tree_space(msp->ms_alloctree[txg & TXG_MASK]));
2503 - ASSERT0(range_tree_space(msp->ms_freeingtree));
2504 - ASSERT0(range_tree_space(msp->ms_freedtree));
2505 -
2506 2540 mutex_exit(&msp->ms_lock);
2507 2541 }
2508 2542
2509 2543 void
2510 2544 metaslab_sync_reassess(metaslab_group_t *mg)
2511 2545 {
2512 - spa_t *spa = mg->mg_class->mc_spa;
2513 -
2514 - spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
2515 2546 metaslab_group_alloc_update(mg);
2516 2547 mg->mg_fragmentation = metaslab_group_fragmentation(mg);
2517 2548
2518 2549 /*
2519 - * Preload the next potential metaslabs but only on active
2520 - * metaslab groups. We can get into a state where the metaslab
2521 - * is no longer active since we dirty metaslabs as we remove a
2522 - * a device, thus potentially making the metaslab group eligible
2523 - * for preloading.
2550 + * Preload the next potential metaslabs
2524 2551 */
2525 - if (mg->mg_activation_count > 0) {
2526 - metaslab_group_preload(mg);
2527 - }
2528 - spa_config_exit(spa, SCL_ALLOC, FTAG);
2552 + metaslab_group_preload(mg);
2529 2553 }
2530 2554
2531 2555 static uint64_t
2532 2556 metaslab_distance(metaslab_t *msp, dva_t *dva)
2533 2557 {
2534 2558 uint64_t ms_shift = msp->ms_group->mg_vd->vdev_ms_shift;
2535 2559 uint64_t offset = DVA_GET_OFFSET(dva) >> ms_shift;
2536 2560 uint64_t start = msp->ms_id;
2537 2561
2538 2562 if (msp->ms_group->mg_vd->vdev_id != DVA_GET_VDEV(dva))
2539 2563 return (1ULL << 63);
2540 2564
2541 2565 if (offset < start)
2542 2566 return ((start - offset) << ms_shift);
2543 2567 if (offset > start)
2544 2568 return ((offset - start) << ms_shift);
2545 2569 return (0);
2546 2570 }
2547 2571
2548 2572 /*
2549 2573 * ==========================================================================
2550 2574 * Metaslab allocation tracing facility
2551 2575 * ==========================================================================
2552 2576 */
2553 2577 kstat_t *metaslab_trace_ksp;
2554 2578 kstat_named_t metaslab_trace_over_limit;
2555 2579
2556 2580 void
2557 2581 metaslab_alloc_trace_init(void)
2558 2582 {
2559 2583 ASSERT(metaslab_alloc_trace_cache == NULL);
2560 2584 metaslab_alloc_trace_cache = kmem_cache_create(
2561 2585 "metaslab_alloc_trace_cache", sizeof (metaslab_alloc_trace_t),
2562 2586 0, NULL, NULL, NULL, NULL, NULL, 0);
2563 2587 metaslab_trace_ksp = kstat_create("zfs", 0, "metaslab_trace_stats",
2564 2588 "misc", KSTAT_TYPE_NAMED, 1, KSTAT_FLAG_VIRTUAL);
2565 2589 if (metaslab_trace_ksp != NULL) {
2566 2590 metaslab_trace_ksp->ks_data = &metaslab_trace_over_limit;
2567 2591 kstat_named_init(&metaslab_trace_over_limit,
2568 2592 "metaslab_trace_over_limit", KSTAT_DATA_UINT64);
2569 2593 kstat_install(metaslab_trace_ksp);
2570 2594 }
2571 2595 }
2572 2596
2573 2597 void
2574 2598 metaslab_alloc_trace_fini(void)
2575 2599 {
2576 2600 if (metaslab_trace_ksp != NULL) {
2577 2601 kstat_delete(metaslab_trace_ksp);
2578 2602 metaslab_trace_ksp = NULL;
2579 2603 }
2580 2604 kmem_cache_destroy(metaslab_alloc_trace_cache);
2581 2605 metaslab_alloc_trace_cache = NULL;
2582 2606 }
2583 2607
2584 2608 /*
2585 2609 * Add an allocation trace element to the allocation tracing list.
2586 2610 */
2587 2611 static void
2588 2612 metaslab_trace_add(zio_alloc_list_t *zal, metaslab_group_t *mg,
2589 2613 metaslab_t *msp, uint64_t psize, uint32_t dva_id, uint64_t offset)
2590 2614 {
2591 2615 if (!metaslab_trace_enabled)
2592 2616 return;
2593 2617
2594 2618 /*
2595 2619 * When the tracing list reaches its maximum we remove
2596 2620 * the second element in the list before adding a new one.
2597 2621 * By removing the second element we preserve the original
2598 2622 * entry as a clue to what allocations steps have already been
2599 2623 * performed.
2600 2624 */
2601 2625 if (zal->zal_size == metaslab_trace_max_entries) {
2602 2626 metaslab_alloc_trace_t *mat_next;
2603 2627 #ifdef DEBUG
2604 2628 panic("too many entries in allocation list");
2605 2629 #endif
2606 2630 atomic_inc_64(&metaslab_trace_over_limit.value.ui64);
2607 2631 zal->zal_size--;
2608 2632 mat_next = list_next(&zal->zal_list, list_head(&zal->zal_list));
2609 2633 list_remove(&zal->zal_list, mat_next);
2610 2634 kmem_cache_free(metaslab_alloc_trace_cache, mat_next);
2611 2635 }
2612 2636
2613 2637 metaslab_alloc_trace_t *mat =
2614 2638 kmem_cache_alloc(metaslab_alloc_trace_cache, KM_SLEEP);
2615 2639 list_link_init(&mat->mat_list_node);
2616 2640 mat->mat_mg = mg;
2617 2641 mat->mat_msp = msp;
2618 2642 mat->mat_size = psize;
2619 2643 mat->mat_dva_id = dva_id;
2620 2644 mat->mat_offset = offset;
2621 2645 mat->mat_weight = 0;
2622 2646
2623 2647 if (msp != NULL)
2624 2648 mat->mat_weight = msp->ms_weight;
2625 2649
2626 2650 /*
2627 2651 * The list is part of the zio so locking is not required. Only
2628 2652 * a single thread will perform allocations for a given zio.
2629 2653 */
2630 2654 list_insert_tail(&zal->zal_list, mat);
2631 2655 zal->zal_size++;
2632 2656
2633 2657 ASSERT3U(zal->zal_size, <=, metaslab_trace_max_entries);
2634 2658 }
2635 2659
2636 2660 void
2637 2661 metaslab_trace_init(zio_alloc_list_t *zal)
2638 2662 {
2639 2663 list_create(&zal->zal_list, sizeof (metaslab_alloc_trace_t),
2640 2664 offsetof(metaslab_alloc_trace_t, mat_list_node));
2641 2665 zal->zal_size = 0;
2642 2666 }
2643 2667
2644 2668 void
2645 2669 metaslab_trace_fini(zio_alloc_list_t *zal)
2646 2670 {
2647 2671 metaslab_alloc_trace_t *mat;
2648 2672
2649 2673 while ((mat = list_remove_head(&zal->zal_list)) != NULL)
2650 2674 kmem_cache_free(metaslab_alloc_trace_cache, mat);
2651 2675 list_destroy(&zal->zal_list);
2652 2676 zal->zal_size = 0;
2653 2677 }
2654 2678
2655 2679 /*
2656 2680 * ==========================================================================
2657 2681 * Metaslab block operations
2658 2682 * ==========================================================================
2659 2683 */
2660 2684
2661 2685 static void
2662 2686 metaslab_group_alloc_increment(spa_t *spa, uint64_t vdev, void *tag, int flags)
2663 2687 {
2664 2688 if (!(flags & METASLAB_ASYNC_ALLOC) ||
2665 2689 flags & METASLAB_DONT_THROTTLE)
2666 2690 return;
2667 2691
2668 2692 metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg;
2669 2693 if (!mg->mg_class->mc_alloc_throttle_enabled)
2670 2694 return;
2671 2695
2672 2696 (void) refcount_add(&mg->mg_alloc_queue_depth, tag);
2673 2697 }
2674 2698
2675 2699 void
2676 2700 metaslab_group_alloc_decrement(spa_t *spa, uint64_t vdev, void *tag, int flags)
2677 2701 {
2678 2702 if (!(flags & METASLAB_ASYNC_ALLOC) ||
2679 2703 flags & METASLAB_DONT_THROTTLE)
2680 2704 return;
2681 2705
2682 2706 metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg;
2683 2707 if (!mg->mg_class->mc_alloc_throttle_enabled)
2684 2708 return;
2685 2709
2686 2710 (void) refcount_remove(&mg->mg_alloc_queue_depth, tag);
2687 2711 }
2688 2712
2689 2713 void
2690 2714 metaslab_group_alloc_verify(spa_t *spa, const blkptr_t *bp, void *tag)
2691 2715 {
2692 2716 #ifdef ZFS_DEBUG
2693 2717 const dva_t *dva = bp->blk_dva;
2694 2718 int ndvas = BP_GET_NDVAS(bp);
2695 2719
2696 2720 for (int d = 0; d < ndvas; d++) {
2697 2721 uint64_t vdev = DVA_GET_VDEV(&dva[d]);
2698 2722 metaslab_group_t *mg = vdev_lookup_top(spa, vdev)->vdev_mg;
2699 2723 VERIFY(refcount_not_held(&mg->mg_alloc_queue_depth, tag));
2700 2724 }
2701 2725 #endif
2702 2726 }
2703 2727
2704 2728 static uint64_t
2705 2729 metaslab_block_alloc(metaslab_t *msp, uint64_t size, uint64_t txg)
2706 2730 {
2707 2731 uint64_t start;
2708 2732 range_tree_t *rt = msp->ms_tree;
2709 2733 metaslab_class_t *mc = msp->ms_group->mg_class;
2710 2734
2711 2735 VERIFY(!msp->ms_condensing);
|
↓ open down ↓ |
173 lines elided |
↑ open up ↑ |
2712 2736
2713 2737 start = mc->mc_ops->msop_alloc(msp, size);
2714 2738 if (start != -1ULL) {
2715 2739 metaslab_group_t *mg = msp->ms_group;
2716 2740 vdev_t *vd = mg->mg_vd;
2717 2741
2718 2742 VERIFY0(P2PHASE(start, 1ULL << vd->vdev_ashift));
2719 2743 VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
2720 2744 VERIFY3U(range_tree_space(rt) - size, <=, msp->ms_size);
2721 2745 range_tree_remove(rt, start, size);
2746 + metaslab_trim_remove(msp, start, size);
2722 2747
2723 2748 if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
2724 2749 vdev_dirty(mg->mg_vd, VDD_METASLAB, msp, txg);
2725 2750
2726 2751 range_tree_add(msp->ms_alloctree[txg & TXG_MASK], start, size);
2727 2752
2728 2753 /* Track the last successful allocation */
2729 2754 msp->ms_alloc_txg = txg;
2730 2755 metaslab_verify_space(msp, txg);
2731 2756 }
2732 2757
|
↓ open down ↓ |
1 lines elided |
↑ open up ↑ |
2733 2758 /*
2734 2759 * Now that we've attempted the allocation we need to update the
2735 2760 * metaslab's maximum block size since it may have changed.
2736 2761 */
2737 2762 msp->ms_max_size = metaslab_block_maxsize(msp);
2738 2763 return (start);
2739 2764 }
2740 2765
2741 2766 static uint64_t
2742 2767 metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal,
2743 - uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
2768 + uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d,
2769 + int flags)
2744 2770 {
2745 2771 metaslab_t *msp = NULL;
2746 2772 uint64_t offset = -1ULL;
2747 2773 uint64_t activation_weight;
2748 2774 uint64_t target_distance;
2749 2775 int i;
2750 2776
2751 2777 activation_weight = METASLAB_WEIGHT_PRIMARY;
2752 2778 for (i = 0; i < d; i++) {
2753 2779 if (DVA_GET_VDEV(&dva[i]) == mg->mg_vd->vdev_id) {
2754 2780 activation_weight = METASLAB_WEIGHT_SECONDARY;
2755 2781 break;
2756 2782 }
2757 2783 }
2758 2784
2759 2785 metaslab_t *search = kmem_alloc(sizeof (*search), KM_SLEEP);
2760 2786 search->ms_weight = UINT64_MAX;
2761 2787 search->ms_start = 0;
2762 2788 for (;;) {
2763 2789 boolean_t was_active;
2790 + boolean_t pass_primary = B_TRUE;
2764 2791 avl_tree_t *t = &mg->mg_metaslab_tree;
2765 2792 avl_index_t idx;
2766 2793
2767 2794 mutex_enter(&mg->mg_lock);
2768 2795
2769 2796 /*
2770 2797 * Find the metaslab with the highest weight that is less
2771 2798 * than what we've already tried. In the common case, this
2772 2799 * means that we will examine each metaslab at most once.
2773 2800 * Note that concurrent callers could reorder metaslabs
2774 2801 * by activation/passivation once we have dropped the mg_lock.
2775 2802 * If a metaslab is activated by another thread, and we fail
2776 2803 * to allocate from the metaslab we have selected, we may
2777 2804 * not try the newly-activated metaslab, and instead activate
2778 2805 * another metaslab. This is not optimal, but generally
2779 2806 * does not cause any problems (a possible exception being
2780 2807 * if every metaslab is completely full except for the
2781 2808 * the newly-activated metaslab which we fail to examine).
2782 2809 */
2783 2810 msp = avl_find(t, search, &idx);
2784 2811 if (msp == NULL)
2785 2812 msp = avl_nearest(t, idx, AVL_AFTER);
2786 2813 for (; msp != NULL; msp = AVL_NEXT(t, msp)) {
2787 2814
2788 2815 if (!metaslab_should_allocate(msp, asize)) {
2789 2816 metaslab_trace_add(zal, mg, msp, asize, d,
2790 2817 TRACE_TOO_SMALL);
|
↓ open down ↓ |
17 lines elided |
↑ open up ↑ |
2791 2818 continue;
2792 2819 }
2793 2820
2794 2821 /*
2795 2822 * If the selected metaslab is condensing, skip it.
2796 2823 */
2797 2824 if (msp->ms_condensing)
2798 2825 continue;
2799 2826
2800 2827 was_active = msp->ms_weight & METASLAB_ACTIVE_MASK;
2801 - if (activation_weight == METASLAB_WEIGHT_PRIMARY)
2802 - break;
2828 + if (flags & METASLAB_USE_WEIGHT_SECONDARY) {
2829 + if (!pass_primary) {
2830 + DTRACE_PROBE(metaslab_use_secondary);
2831 + activation_weight =
2832 + METASLAB_WEIGHT_SECONDARY;
2833 + break;
2834 + }
2803 2835
2804 - target_distance = min_distance +
2805 - (space_map_allocated(msp->ms_sm) != 0 ? 0 :
2806 - min_distance >> 1);
2836 + pass_primary = B_FALSE;
2837 + } else {
2838 + if (activation_weight ==
2839 + METASLAB_WEIGHT_PRIMARY)
2840 + break;
2807 2841
2808 - for (i = 0; i < d; i++) {
2809 - if (metaslab_distance(msp, &dva[i]) <
2810 - target_distance)
2842 + target_distance = min_distance +
2843 + (space_map_allocated(msp->ms_sm) != 0 ? 0 :
2844 + min_distance >> 1);
2845 +
2846 + for (i = 0; i < d; i++)
2847 + if (metaslab_distance(msp, &dva[i]) <
2848 + target_distance)
2849 + break;
2850 + if (i == d)
2811 2851 break;
2812 2852 }
2813 - if (i == d)
2814 - break;
2815 2853 }
2816 2854 mutex_exit(&mg->mg_lock);
2817 2855 if (msp == NULL) {
2818 2856 kmem_free(search, sizeof (*search));
2819 2857 return (-1ULL);
2820 2858 }
2821 2859 search->ms_weight = msp->ms_weight;
2822 2860 search->ms_start = msp->ms_start + 1;
2823 2861
2824 2862 mutex_enter(&msp->ms_lock);
2825 2863
2826 2864 /*
2827 2865 * Ensure that the metaslab we have selected is still
2828 2866 * capable of handling our request. It's possible that
2829 2867 * another thread may have changed the weight while we
2830 2868 * were blocked on the metaslab lock. We check the
2831 2869 * active status first to see if we need to reselect
2832 2870 * a new metaslab.
2833 2871 */
2834 2872 if (was_active && !(msp->ms_weight & METASLAB_ACTIVE_MASK)) {
2835 2873 mutex_exit(&msp->ms_lock);
2836 2874 continue;
2837 2875 }
2838 2876
2839 2877 if ((msp->ms_weight & METASLAB_WEIGHT_SECONDARY) &&
2840 2878 activation_weight == METASLAB_WEIGHT_PRIMARY) {
2841 2879 metaslab_passivate(msp,
2842 2880 msp->ms_weight & ~METASLAB_ACTIVE_MASK);
2843 2881 mutex_exit(&msp->ms_lock);
2844 2882 continue;
2845 2883 }
2846 2884
2847 2885 if (metaslab_activate(msp, activation_weight) != 0) {
2848 2886 mutex_exit(&msp->ms_lock);
2849 2887 continue;
2850 2888 }
2851 2889 msp->ms_selected_txg = txg;
2852 2890
2853 2891 /*
2854 2892 * Now that we have the lock, recheck to see if we should
2855 2893 * continue to use this metaslab for this allocation. The
2856 2894 * the metaslab is now loaded so metaslab_should_allocate() can
2857 2895 * accurately determine if the allocation attempt should
2858 2896 * proceed.
2859 2897 */
2860 2898 if (!metaslab_should_allocate(msp, asize)) {
2861 2899 /* Passivate this metaslab and select a new one. */
2862 2900 metaslab_trace_add(zal, mg, msp, asize, d,
2863 2901 TRACE_TOO_SMALL);
2864 2902 goto next;
2865 2903 }
2866 2904
2867 2905 /*
2868 2906 * If this metaslab is currently condensing then pick again as
2869 2907 * we can't manipulate this metaslab until it's committed
2870 2908 * to disk.
2871 2909 */
2872 2910 if (msp->ms_condensing) {
2873 2911 metaslab_trace_add(zal, mg, msp, asize, d,
2874 2912 TRACE_CONDENSING);
2875 2913 mutex_exit(&msp->ms_lock);
2876 2914 continue;
2877 2915 }
2878 2916
2879 2917 offset = metaslab_block_alloc(msp, asize, txg);
2880 2918 metaslab_trace_add(zal, mg, msp, asize, d, offset);
2881 2919
2882 2920 if (offset != -1ULL) {
2883 2921 /* Proactively passivate the metaslab, if needed */
2884 2922 metaslab_segment_may_passivate(msp);
2885 2923 break;
2886 2924 }
2887 2925 next:
2888 2926 ASSERT(msp->ms_loaded);
2889 2927
2890 2928 /*
2891 2929 * We were unable to allocate from this metaslab so determine
2892 2930 * a new weight for this metaslab. Now that we have loaded
2893 2931 * the metaslab we can provide a better hint to the metaslab
2894 2932 * selector.
2895 2933 *
2896 2934 * For space-based metaslabs, we use the maximum block size.
2897 2935 * This information is only available when the metaslab
2898 2936 * is loaded and is more accurate than the generic free
2899 2937 * space weight that was calculated by metaslab_weight().
2900 2938 * This information allows us to quickly compare the maximum
2901 2939 * available allocation in the metaslab to the allocation
2902 2940 * size being requested.
2903 2941 *
2904 2942 * For segment-based metaslabs, determine the new weight
2905 2943 * based on the highest bucket in the range tree. We
2906 2944 * explicitly use the loaded segment weight (i.e. the range
2907 2945 * tree histogram) since it contains the space that is
2908 2946 * currently available for allocation and is accurate
2909 2947 * even within a sync pass.
2910 2948 */
2911 2949 if (WEIGHT_IS_SPACEBASED(msp->ms_weight)) {
2912 2950 uint64_t weight = metaslab_block_maxsize(msp);
2913 2951 WEIGHT_SET_SPACEBASED(weight);
2914 2952 metaslab_passivate(msp, weight);
2915 2953 } else {
2916 2954 metaslab_passivate(msp,
2917 2955 metaslab_weight_from_range_tree(msp));
2918 2956 }
2919 2957
2920 2958 /*
2921 2959 * We have just failed an allocation attempt, check
2922 2960 * that metaslab_should_allocate() agrees. Otherwise,
2923 2961 * we may end up in an infinite loop retrying the same
2924 2962 * metaslab.
2925 2963 */
|
↓ open down ↓ |
101 lines elided |
↑ open up ↑ |
2926 2964 ASSERT(!metaslab_should_allocate(msp, asize));
2927 2965 mutex_exit(&msp->ms_lock);
2928 2966 }
2929 2967 mutex_exit(&msp->ms_lock);
2930 2968 kmem_free(search, sizeof (*search));
2931 2969 return (offset);
2932 2970 }
2933 2971
2934 2972 static uint64_t
2935 2973 metaslab_group_alloc(metaslab_group_t *mg, zio_alloc_list_t *zal,
2936 - uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva, int d)
2974 + uint64_t asize, uint64_t txg, uint64_t min_distance, dva_t *dva,
2975 + int d, int flags)
2937 2976 {
2938 2977 uint64_t offset;
2939 2978 ASSERT(mg->mg_initialized);
2940 2979
2941 2980 offset = metaslab_group_alloc_normal(mg, zal, asize, txg,
2942 - min_distance, dva, d);
2981 + min_distance, dva, d, flags);
2943 2982
2944 2983 mutex_enter(&mg->mg_lock);
2945 2984 if (offset == -1ULL) {
2946 2985 mg->mg_failed_allocations++;
2947 2986 metaslab_trace_add(zal, mg, NULL, asize, d,
2948 2987 TRACE_GROUP_FAILURE);
2949 2988 if (asize == SPA_GANGBLOCKSIZE) {
2950 2989 /*
2951 2990 * This metaslab group was unable to allocate
2952 2991 * the minimum gang block size so it must be out of
2953 2992 * space. We must notify the allocation throttle
2954 2993 * to start skipping allocation attempts to this
2955 2994 * metaslab group until more space becomes available.
2956 2995 * Note: this failure cannot be caused by the
2957 2996 * allocation throttle since the allocation throttle
2958 2997 * is only responsible for skipping devices and
2959 2998 * not failing block allocations.
2960 2999 */
2961 3000 mg->mg_no_free_space = B_TRUE;
2962 3001 }
2963 3002 }
2964 3003 mg->mg_allocations++;
2965 3004 mutex_exit(&mg->mg_lock);
2966 3005 return (offset);
2967 3006 }
2968 3007
2969 3008 /*
|
↓ open down ↓ |
17 lines elided |
↑ open up ↑ |
2970 3009 * If we have to write a ditto block (i.e. more than one DVA for a given BP)
2971 3010 * on the same vdev as an existing DVA of this BP, then try to allocate it
2972 3011 * at least (vdev_asize / (2 ^ ditto_same_vdev_distance_shift)) away from the
2973 3012 * existing DVAs.
2974 3013 */
2975 3014 int ditto_same_vdev_distance_shift = 3;
2976 3015
2977 3016 /*
2978 3017 * Allocate a block for the specified i/o.
2979 3018 */
2980 -int
3019 +static int
2981 3020 metaslab_alloc_dva(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
2982 3021 dva_t *dva, int d, dva_t *hintdva, uint64_t txg, int flags,
2983 3022 zio_alloc_list_t *zal)
2984 3023 {
2985 3024 metaslab_group_t *mg, *rotor;
2986 3025 vdev_t *vd;
2987 3026 boolean_t try_hard = B_FALSE;
2988 3027
2989 3028 ASSERT(!DVA_IS_VALID(&dva[d]));
2990 3029
2991 3030 /*
2992 3031 * For testing, make some blocks above a certain size be gang blocks.
2993 3032 */
2994 3033 if (psize >= metaslab_gang_bang && (ddi_get_lbolt() & 3) == 0) {
2995 3034 metaslab_trace_add(zal, NULL, NULL, psize, d, TRACE_FORCE_GANG);
2996 3035 return (SET_ERROR(ENOSPC));
2997 3036 }
2998 3037
2999 3038 /*
3000 3039 * Start at the rotor and loop through all mgs until we find something.
3001 3040 * Note that there's no locking on mc_rotor or mc_aliquot because
3002 3041 * nothing actually breaks if we miss a few updates -- we just won't
3003 3042 * allocate quite as evenly. It all balances out over time.
3004 3043 *
3005 3044 * If we are doing ditto or log blocks, try to spread them across
3006 3045 * consecutive vdevs. If we're forced to reuse a vdev before we've
3007 3046 * allocated all of our ditto blocks, then try and spread them out on
3008 3047 * that vdev as much as possible. If it turns out to not be possible,
3009 3048 * gradually lower our standards until anything becomes acceptable.
3010 3049 * Also, allocating on consecutive vdevs (as opposed to random vdevs)
3011 3050 * gives us hope of containing our fault domains to something we're
3012 3051 * able to reason about. Otherwise, any two top-level vdev failures
3013 3052 * will guarantee the loss of data. With consecutive allocation,
3014 3053 * only two adjacent top-level vdev failures will result in data loss.
3015 3054 *
|
↓ open down ↓ |
25 lines elided |
↑ open up ↑ |
3016 3055 * If we are doing gang blocks (hintdva is non-NULL), try to keep
3017 3056 * ourselves on the same vdev as our gang block header. That
3018 3057 * way, we can hope for locality in vdev_cache, plus it makes our
3019 3058 * fault domains something tractable.
3020 3059 */
3021 3060 if (hintdva) {
3022 3061 vd = vdev_lookup_top(spa, DVA_GET_VDEV(&hintdva[d]));
3023 3062
3024 3063 /*
3025 3064 * It's possible the vdev we're using as the hint no
3026 - * longer exists or its mg has been closed (e.g. by
3027 - * device removal). Consult the rotor when
3065 + * longer exists (i.e. removed). Consult the rotor when
3028 3066 * all else fails.
3029 3067 */
3030 - if (vd != NULL && vd->vdev_mg != NULL) {
3068 + if (vd != NULL) {
3031 3069 mg = vd->vdev_mg;
3032 3070
3033 3071 if (flags & METASLAB_HINTBP_AVOID &&
3034 3072 mg->mg_next != NULL)
3035 3073 mg = mg->mg_next;
3036 3074 } else {
3037 3075 mg = mc->mc_rotor;
3038 3076 }
3039 3077 } else if (d != 0) {
3040 3078 vd = vdev_lookup_top(spa, DVA_GET_VDEV(&dva[d - 1]));
3041 3079 mg = vd->vdev_mg->mg_next;
3042 3080 } else {
3043 3081 mg = mc->mc_rotor;
3044 3082 }
3045 3083
3046 3084 /*
3047 3085 * If the hint put us into the wrong metaslab class, or into a
3048 3086 * metaslab group that has been passivated, just follow the rotor.
3049 3087 */
3050 3088 if (mg->mg_class != mc || mg->mg_activation_count <= 0)
3051 3089 mg = mc->mc_rotor;
3052 3090
3053 3091 rotor = mg;
3054 3092 top:
3055 3093 do {
3056 3094 boolean_t allocatable;
3057 3095
3058 3096 ASSERT(mg->mg_activation_count == 1);
3059 3097 vd = mg->mg_vd;
3060 3098
3061 3099 /*
3062 3100 * Don't allocate from faulted devices.
3063 3101 */
3064 3102 if (try_hard) {
3065 3103 spa_config_enter(spa, SCL_ZIO, FTAG, RW_READER);
3066 3104 allocatable = vdev_allocatable(vd);
3067 3105 spa_config_exit(spa, SCL_ZIO, FTAG);
3068 3106 } else {
3069 3107 allocatable = vdev_allocatable(vd);
3070 3108 }
3071 3109
3072 3110 /*
3073 3111 * Determine if the selected metaslab group is eligible
3074 3112 * for allocations. If we're ganging then don't allow
3075 3113 * this metaslab group to skip allocations since that would
3076 3114 * inadvertently return ENOSPC and suspend the pool
3077 3115 * even though space is still available.
3078 3116 */
3079 3117 if (allocatable && !GANG_ALLOCATION(flags) && !try_hard) {
3080 3118 allocatable = metaslab_group_allocatable(mg, rotor,
3081 3119 psize);
3082 3120 }
3083 3121
3084 3122 if (!allocatable) {
3085 3123 metaslab_trace_add(zal, mg, NULL, psize, d,
3086 3124 TRACE_NOT_ALLOCATABLE);
3087 3125 goto next;
3088 3126 }
3089 3127
3090 3128 ASSERT(mg->mg_initialized);
3091 3129
3092 3130 /*
3093 3131 * Avoid writing single-copy data to a failing,
3094 3132 * non-redundant vdev, unless we've already tried all
3095 3133 * other vdevs.
3096 3134 */
3097 3135 if ((vd->vdev_stat.vs_write_errors > 0 ||
3098 3136 vd->vdev_state < VDEV_STATE_HEALTHY) &&
3099 3137 d == 0 && !try_hard && vd->vdev_children == 0) {
3100 3138 metaslab_trace_add(zal, mg, NULL, psize, d,
3101 3139 TRACE_VDEV_ERROR);
3102 3140 goto next;
3103 3141 }
3104 3142
3105 3143 ASSERT(mg->mg_class == mc);
3106 3144
3107 3145 /*
3108 3146 * If we don't need to try hard, then require that the
3109 3147 * block be 1/8th of the device away from any other DVAs
3110 3148 * in this BP. If we are trying hard, allow any offset
3111 3149 * to be used (distance=0).
3112 3150 */
3113 3151 uint64_t distance = 0;
3114 3152 if (!try_hard) {
|
↓ open down ↓ |
74 lines elided |
↑ open up ↑ |
3115 3153 distance = vd->vdev_asize >>
3116 3154 ditto_same_vdev_distance_shift;
3117 3155 if (distance <= (1ULL << vd->vdev_ms_shift))
3118 3156 distance = 0;
3119 3157 }
3120 3158
3121 3159 uint64_t asize = vdev_psize_to_asize(vd, psize);
3122 3160 ASSERT(P2PHASE(asize, 1ULL << vd->vdev_ashift) == 0);
3123 3161
3124 3162 uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg,
3125 - distance, dva, d);
3163 + distance, dva, d, flags);
3126 3164
3127 3165 if (offset != -1ULL) {
3128 3166 /*
3129 3167 * If we've just selected this metaslab group,
3130 3168 * figure out whether the corresponding vdev is
3131 3169 * over- or under-used relative to the pool,
3132 3170 * and set an allocation bias to even it out.
3133 3171 */
3134 3172 if (mc->mc_aliquot == 0 && metaslab_bias_enabled) {
3135 3173 vdev_stat_t *vs = &vd->vdev_stat;
3136 - int64_t vu, cu;
3174 + vdev_stat_t *pvs = &vd->vdev_parent->vdev_stat;
3175 + int64_t vu, cu, vu_io;
3137 3176
3138 3177 vu = (vs->vs_alloc * 100) / (vs->vs_space + 1);
3139 3178 cu = (mc->mc_alloc * 100) / (mc->mc_space + 1);
3179 + vu_io =
3180 + (((vs->vs_iotime[ZIO_TYPE_WRITE] * 100) /
3181 + (pvs->vs_iotime[ZIO_TYPE_WRITE] + 1)) *
3182 + (vd->vdev_parent->vdev_children)) - 100;
3140 3183
3141 3184 /*
3142 3185 * Calculate how much more or less we should
3143 3186 * try to allocate from this device during
3144 3187 * this iteration around the rotor.
3145 3188 * For example, if a device is 80% full
3146 3189 * and the pool is 20% full then we should
3147 3190 * reduce allocations by 60% on this device.
3148 3191 *
3149 3192 * mg_bias = (20 - 80) * 512K / 100 = -307K
3150 3193 *
3151 3194 * This reduces allocations by 307K for this
3152 3195 * iteration.
3153 3196 */
3154 3197 mg->mg_bias = ((cu - vu) *
3155 3198 (int64_t)mg->mg_aliquot) / 100;
3199 +
3200 + /*
3201 + * Experiment: space-based DVA allocator 0,
3202 + * latency-based 1 or hybrid 2.
3203 + */
3204 + switch (metaslab_alloc_dva_algorithm) {
3205 + case 1:
3206 + mg->mg_bias =
3207 + (vu_io * (int64_t)mg->mg_aliquot) /
3208 + 100;
3209 + break;
3210 + case 2:
3211 + mg->mg_bias =
3212 + ((((cu - vu) + vu_io) / 2) *
3213 + (int64_t)mg->mg_aliquot) / 100;
3214 + break;
3215 + default:
3216 + break;
3217 + }
3156 3218 } else if (!metaslab_bias_enabled) {
3157 3219 mg->mg_bias = 0;
3158 3220 }
3159 3221
3160 3222 if (atomic_add_64_nv(&mc->mc_aliquot, asize) >=
3161 3223 mg->mg_aliquot + mg->mg_bias) {
3162 3224 mc->mc_rotor = mg->mg_next;
3163 3225 mc->mc_aliquot = 0;
3164 3226 }
3165 3227
3166 3228 DVA_SET_VDEV(&dva[d], vd->vdev_id);
3167 3229 DVA_SET_OFFSET(&dva[d], offset);
3168 3230 DVA_SET_GANG(&dva[d], !!(flags & METASLAB_GANG_HEADER));
3169 3231 DVA_SET_ASIZE(&dva[d], asize);
3232 + DTRACE_PROBE3(alloc_dva_probe, uint64_t, vd->vdev_id,
3233 + uint64_t, offset, uint64_t, psize);
3170 3234
3171 3235 return (0);
3172 3236 }
3173 3237 next:
3174 3238 mc->mc_rotor = mg->mg_next;
3175 3239 mc->mc_aliquot = 0;
3176 3240 } while ((mg = mg->mg_next) != rotor);
3177 3241
3178 3242 /*
3179 3243 * If we haven't tried hard, do so now.
3180 3244 */
3181 3245 if (!try_hard) {
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
3182 3246 try_hard = B_TRUE;
3183 3247 goto top;
3184 3248 }
3185 3249
3186 3250 bzero(&dva[d], sizeof (dva_t));
3187 3251
3188 3252 metaslab_trace_add(zal, rotor, NULL, psize, d, TRACE_ENOSPC);
3189 3253 return (SET_ERROR(ENOSPC));
3190 3254 }
3191 3255
3192 -void
3193 -metaslab_free_concrete(vdev_t *vd, uint64_t offset, uint64_t asize,
3194 - uint64_t txg)
3195 -{
3196 - metaslab_t *msp;
3197 - spa_t *spa = vd->vdev_spa;
3198 -
3199 - ASSERT3U(txg, ==, spa->spa_syncing_txg);
3200 - ASSERT(vdev_is_concrete(vd));
3201 - ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3202 - ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
3203 -
3204 - msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3205 -
3206 - VERIFY(!msp->ms_condensing);
3207 - VERIFY3U(offset, >=, msp->ms_start);
3208 - VERIFY3U(offset + asize, <=, msp->ms_start + msp->ms_size);
3209 - VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3210 - VERIFY0(P2PHASE(asize, 1ULL << vd->vdev_ashift));
3211 -
3212 - metaslab_check_free_impl(vd, offset, asize);
3213 - mutex_enter(&msp->ms_lock);
3214 - if (range_tree_space(msp->ms_freeingtree) == 0) {
3215 - vdev_dirty(vd, VDD_METASLAB, msp, txg);
3216 - }
3217 - range_tree_add(msp->ms_freeingtree, offset, asize);
3218 - mutex_exit(&msp->ms_lock);
3219 -}
3220 -
3221 -/* ARGSUSED */
3222 -void
3223 -metaslab_free_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
3224 - uint64_t size, void *arg)
3225 -{
3226 - uint64_t *txgp = arg;
3227 -
3228 - if (vd->vdev_ops->vdev_op_remap != NULL)
3229 - vdev_indirect_mark_obsolete(vd, offset, size, *txgp);
3230 - else
3231 - metaslab_free_impl(vd, offset, size, *txgp);
3232 -}
3233 -
3234 -static void
3235 -metaslab_free_impl(vdev_t *vd, uint64_t offset, uint64_t size,
3236 - uint64_t txg)
3237 -{
3238 - spa_t *spa = vd->vdev_spa;
3239 -
3240 - ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3241 -
3242 - if (txg > spa_freeze_txg(spa))
3243 - return;
3244 -
3245 - if (spa->spa_vdev_removal != NULL &&
3246 - spa->spa_vdev_removal->svr_vdev == vd &&
3247 - vdev_is_concrete(vd)) {
3248 - /*
3249 - * Note: we check if the vdev is concrete because when
3250 - * we complete the removal, we first change the vdev to be
3251 - * an indirect vdev (in open context), and then (in syncing
3252 - * context) clear spa_vdev_removal.
3253 - */
3254 - free_from_removing_vdev(vd, offset, size, txg);
3255 - } else if (vd->vdev_ops->vdev_op_remap != NULL) {
3256 - vdev_indirect_mark_obsolete(vd, offset, size, txg);
3257 - vd->vdev_ops->vdev_op_remap(vd, offset, size,
3258 - metaslab_free_impl_cb, &txg);
3259 - } else {
3260 - metaslab_free_concrete(vd, offset, size, txg);
3261 - }
3262 -}
3263 -
3264 -typedef struct remap_blkptr_cb_arg {
3265 - blkptr_t *rbca_bp;
3266 - spa_remap_cb_t rbca_cb;
3267 - vdev_t *rbca_remap_vd;
3268 - uint64_t rbca_remap_offset;
3269 - void *rbca_cb_arg;
3270 -} remap_blkptr_cb_arg_t;
3271 -
3272 -void
3273 -remap_blkptr_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
3274 - uint64_t size, void *arg)
3275 -{
3276 - remap_blkptr_cb_arg_t *rbca = arg;
3277 - blkptr_t *bp = rbca->rbca_bp;
3278 -
3279 - /* We can not remap split blocks. */
3280 - if (size != DVA_GET_ASIZE(&bp->blk_dva[0]))
3281 - return;
3282 - ASSERT0(inner_offset);
3283 -
3284 - if (rbca->rbca_cb != NULL) {
3285 - /*
3286 - * At this point we know that we are not handling split
3287 - * blocks and we invoke the callback on the previous
3288 - * vdev which must be indirect.
3289 - */
3290 - ASSERT3P(rbca->rbca_remap_vd->vdev_ops, ==, &vdev_indirect_ops);
3291 -
3292 - rbca->rbca_cb(rbca->rbca_remap_vd->vdev_id,
3293 - rbca->rbca_remap_offset, size, rbca->rbca_cb_arg);
3294 -
3295 - /* set up remap_blkptr_cb_arg for the next call */
3296 - rbca->rbca_remap_vd = vd;
3297 - rbca->rbca_remap_offset = offset;
3298 - }
3299 -
3300 - /*
3301 - * The phys birth time is that of dva[0]. This ensures that we know
3302 - * when each dva was written, so that resilver can determine which
3303 - * blocks need to be scrubbed (i.e. those written during the time
3304 - * the vdev was offline). It also ensures that the key used in
3305 - * the ARC hash table is unique (i.e. dva[0] + phys_birth). If
3306 - * we didn't change the phys_birth, a lookup in the ARC for a
3307 - * remapped BP could find the data that was previously stored at
3308 - * this vdev + offset.
3309 - */
3310 - vdev_t *oldvd = vdev_lookup_top(vd->vdev_spa,
3311 - DVA_GET_VDEV(&bp->blk_dva[0]));
3312 - vdev_indirect_births_t *vib = oldvd->vdev_indirect_births;
3313 - bp->blk_phys_birth = vdev_indirect_births_physbirth(vib,
3314 - DVA_GET_OFFSET(&bp->blk_dva[0]), DVA_GET_ASIZE(&bp->blk_dva[0]));
3315 -
3316 - DVA_SET_VDEV(&bp->blk_dva[0], vd->vdev_id);
3317 - DVA_SET_OFFSET(&bp->blk_dva[0], offset);
3318 -}
3319 -
3320 3256 /*
3321 - * If the block pointer contains any indirect DVAs, modify them to refer to
3322 - * concrete DVAs. Note that this will sometimes not be possible, leaving
3323 - * the indirect DVA in place. This happens if the indirect DVA spans multiple
3324 - * segments in the mapping (i.e. it is a "split block").
3325 - *
3326 - * If the BP was remapped, calls the callback on the original dva (note the
3327 - * callback can be called multiple times if the original indirect DVA refers
3328 - * to another indirect DVA, etc).
3329 - *
3330 - * Returns TRUE if the BP was remapped.
3257 + * Free the block represented by DVA in the context of the specified
3258 + * transaction group.
3331 3259 */
3332 -boolean_t
3333 -spa_remap_blkptr(spa_t *spa, blkptr_t *bp, spa_remap_cb_t callback, void *arg)
3334 -{
3335 - remap_blkptr_cb_arg_t rbca;
3336 -
3337 - if (!zfs_remap_blkptr_enable)
3338 - return (B_FALSE);
3339 -
3340 - if (!spa_feature_is_enabled(spa, SPA_FEATURE_OBSOLETE_COUNTS))
3341 - return (B_FALSE);
3342 -
3343 - /*
3344 - * Dedup BP's can not be remapped, because ddt_phys_select() depends
3345 - * on DVA[0] being the same in the BP as in the DDT (dedup table).
3346 - */
3347 - if (BP_GET_DEDUP(bp))
3348 - return (B_FALSE);
3349 -
3350 - /*
3351 - * Gang blocks can not be remapped, because
3352 - * zio_checksum_gang_verifier() depends on the DVA[0] that's in
3353 - * the BP used to read the gang block header (GBH) being the same
3354 - * as the DVA[0] that we allocated for the GBH.
3355 - */
3356 - if (BP_IS_GANG(bp))
3357 - return (B_FALSE);
3358 -
3359 - /*
3360 - * Embedded BP's have no DVA to remap.
3361 - */
3362 - if (BP_GET_NDVAS(bp) < 1)
3363 - return (B_FALSE);
3364 -
3365 - /*
3366 - * Note: we only remap dva[0]. If we remapped other dvas, we
3367 - * would no longer know what their phys birth txg is.
3368 - */
3369 - dva_t *dva = &bp->blk_dva[0];
3370 -
3371 - uint64_t offset = DVA_GET_OFFSET(dva);
3372 - uint64_t size = DVA_GET_ASIZE(dva);
3373 - vdev_t *vd = vdev_lookup_top(spa, DVA_GET_VDEV(dva));
3374 -
3375 - if (vd->vdev_ops->vdev_op_remap == NULL)
3376 - return (B_FALSE);
3377 -
3378 - rbca.rbca_bp = bp;
3379 - rbca.rbca_cb = callback;
3380 - rbca.rbca_remap_vd = vd;
3381 - rbca.rbca_remap_offset = offset;
3382 - rbca.rbca_cb_arg = arg;
3383 -
3384 - /*
3385 - * remap_blkptr_cb() will be called in order for each level of
3386 - * indirection, until a concrete vdev is reached or a split block is
3387 - * encountered. old_vd and old_offset are updated within the callback
3388 - * as we go from the one indirect vdev to the next one (either concrete
3389 - * or indirect again) in that order.
3390 - */
3391 - vd->vdev_ops->vdev_op_remap(vd, offset, size, remap_blkptr_cb, &rbca);
3392 -
3393 - /* Check if the DVA wasn't remapped because it is a split block */
3394 - if (DVA_GET_VDEV(&rbca.rbca_bp->blk_dva[0]) == vd->vdev_id)
3395 - return (B_FALSE);
3396 -
3397 - return (B_TRUE);
3398 -}
3399 -
3400 -/*
3401 - * Undo the allocation of a DVA which happened in the given transaction group.
3402 - */
3403 3260 void
3404 -metaslab_unalloc_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
3261 +metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg, boolean_t now)
3405 3262 {
3406 - metaslab_t *msp;
3407 - vdev_t *vd;
3408 3263 uint64_t vdev = DVA_GET_VDEV(dva);
3409 3264 uint64_t offset = DVA_GET_OFFSET(dva);
3410 3265 uint64_t size = DVA_GET_ASIZE(dva);
3266 + vdev_t *vd;
3267 + metaslab_t *msp;
3411 3268
3269 + DTRACE_PROBE3(free_dva_probe, uint64_t, vdev,
3270 + uint64_t, offset, uint64_t, size);
3271 +
3412 3272 ASSERT(DVA_IS_VALID(dva));
3413 - ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3414 3273
3415 3274 if (txg > spa_freeze_txg(spa))
3416 3275 return;
3417 3276
3418 3277 if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
3419 3278 (offset >> vd->vdev_ms_shift) >= vd->vdev_ms_count) {
3420 3279 cmn_err(CE_WARN, "metaslab_free_dva(): bad DVA %llu:%llu",
3421 3280 (u_longlong_t)vdev, (u_longlong_t)offset);
3422 3281 ASSERT(0);
3423 3282 return;
3424 3283 }
3425 3284
3426 - ASSERT(!vd->vdev_removing);
3427 - ASSERT(vdev_is_concrete(vd));
3428 - ASSERT0(vd->vdev_indirect_config.vic_mapping_object);
3429 - ASSERT3P(vd->vdev_indirect_mapping, ==, NULL);
3285 + msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3430 3286
3431 3287 if (DVA_GET_GANG(dva))
3432 3288 size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
3433 3289
3434 - msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3435 -
3436 3290 mutex_enter(&msp->ms_lock);
3437 - range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
3438 - offset, size);
3439 3291
3440 - VERIFY(!msp->ms_condensing);
3441 - VERIFY3U(offset, >=, msp->ms_start);
3442 - VERIFY3U(offset + size, <=, msp->ms_start + msp->ms_size);
3443 - VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
3444 - msp->ms_size);
3445 - VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3446 - VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
3447 - range_tree_add(msp->ms_tree, offset, size);
3292 + if (now) {
3293 + range_tree_remove(msp->ms_alloctree[txg & TXG_MASK],
3294 + offset, size);
3295 +
3296 + VERIFY(!msp->ms_condensing);
3297 + VERIFY3U(offset, >=, msp->ms_start);
3298 + VERIFY3U(offset + size, <=, msp->ms_start + msp->ms_size);
3299 + VERIFY3U(range_tree_space(msp->ms_tree) + size, <=,
3300 + msp->ms_size);
3301 + VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3302 + VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
3303 + range_tree_add(msp->ms_tree, offset, size);
3304 + if (spa_get_auto_trim(spa) == SPA_AUTO_TRIM_ON &&
3305 + !vd->vdev_man_trimming)
3306 + metaslab_trim_add(msp, offset, size);
3307 + msp->ms_max_size = metaslab_block_maxsize(msp);
3308 + } else {
3309 + VERIFY3U(txg, ==, spa->spa_syncing_txg);
3310 + if (range_tree_space(msp->ms_freeingtree) == 0)
3311 + vdev_dirty(vd, VDD_METASLAB, msp, txg);
3312 + range_tree_add(msp->ms_freeingtree, offset, size);
3313 + }
3314 +
3448 3315 mutex_exit(&msp->ms_lock);
3449 3316 }
3450 3317
3451 3318 /*
3452 - * Free the block represented by DVA in the context of the specified
3453 - * transaction group.
3319 + * Intent log support: upon opening the pool after a crash, notify the SPA
3320 + * of blocks that the intent log has allocated for immediate write, but
3321 + * which are still considered free by the SPA because the last transaction
3322 + * group didn't commit yet.
3454 3323 */
3455 -void
3456 -metaslab_free_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
3324 +static int
3325 +metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
3457 3326 {
3458 3327 uint64_t vdev = DVA_GET_VDEV(dva);
3459 3328 uint64_t offset = DVA_GET_OFFSET(dva);
3460 3329 uint64_t size = DVA_GET_ASIZE(dva);
3461 - vdev_t *vd = vdev_lookup_top(spa, vdev);
3330 + vdev_t *vd;
3331 + metaslab_t *msp;
3332 + int error = 0;
3462 3333
3463 3334 ASSERT(DVA_IS_VALID(dva));
3464 - ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3465 3335
3466 - if (DVA_GET_GANG(dva)) {
3336 + if ((vd = vdev_lookup_top(spa, vdev)) == NULL ||
3337 + (offset >> vd->vdev_ms_shift) >= vd->vdev_ms_count)
3338 + return (SET_ERROR(ENXIO));
3339 +
3340 + msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3341 +
3342 + if (DVA_GET_GANG(dva))
3467 3343 size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
3344 +
3345 + mutex_enter(&msp->ms_lock);
3346 +
3347 + if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
3348 + error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
3349 +
3350 + if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
3351 + error = SET_ERROR(ENOENT);
3352 +
3353 + if (error || txg == 0) { /* txg == 0 indicates dry run */
3354 + mutex_exit(&msp->ms_lock);
3355 + return (error);
3468 3356 }
3469 3357
3470 - metaslab_free_impl(vd, offset, size, txg);
3358 + VERIFY(!msp->ms_condensing);
3359 + VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3360 + VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
3361 + VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
3362 + range_tree_remove(msp->ms_tree, offset, size);
3363 + metaslab_trim_remove(msp, offset, size);
3364 +
3365 + if (spa_writeable(spa)) { /* don't dirty if we're zdb(1M) */
3366 + if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
3367 + vdev_dirty(vd, VDD_METASLAB, msp, txg);
3368 + range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
3369 + }
3370 +
3371 + mutex_exit(&msp->ms_lock);
3372 +
3373 + return (0);
3471 3374 }
3472 3375
3473 3376 /*
3474 3377 * Reserve some allocation slots. The reservation system must be called
3475 3378 * before we call into the allocator. If there aren't any available slots
3476 3379 * then the I/O will be throttled until an I/O completes and its slots are
3477 3380 * freed up. The function returns true if it was successful in placing
3478 3381 * the reservation.
3479 3382 */
3480 3383 boolean_t
3481 3384 metaslab_class_throttle_reserve(metaslab_class_t *mc, int slots, zio_t *zio,
3482 3385 int flags)
3483 3386 {
3484 3387 uint64_t available_slots = 0;
3485 3388 boolean_t slot_reserved = B_FALSE;
3486 3389
3487 3390 ASSERT(mc->mc_alloc_throttle_enabled);
3488 3391 mutex_enter(&mc->mc_lock);
3489 3392
3490 3393 uint64_t reserved_slots = refcount_count(&mc->mc_alloc_slots);
3491 3394 if (reserved_slots < mc->mc_alloc_max_slots)
3492 3395 available_slots = mc->mc_alloc_max_slots - reserved_slots;
3493 3396
3494 3397 if (slots <= available_slots || GANG_ALLOCATION(flags)) {
3495 3398 /*
3496 3399 * We reserve the slots individually so that we can unreserve
3497 3400 * them individually when an I/O completes.
3498 3401 */
3499 3402 for (int d = 0; d < slots; d++) {
3500 3403 reserved_slots = refcount_add(&mc->mc_alloc_slots, zio);
3501 3404 }
3502 3405 zio->io_flags |= ZIO_FLAG_IO_ALLOCATING;
3503 3406 slot_reserved = B_TRUE;
3504 3407 }
3505 3408
3506 3409 mutex_exit(&mc->mc_lock);
3507 3410 return (slot_reserved);
3508 3411 }
3509 3412
3510 3413 void
|
↓ open down ↓ |
30 lines elided |
↑ open up ↑ |
3511 3414 metaslab_class_throttle_unreserve(metaslab_class_t *mc, int slots, zio_t *zio)
3512 3415 {
3513 3416 ASSERT(mc->mc_alloc_throttle_enabled);
3514 3417 mutex_enter(&mc->mc_lock);
3515 3418 for (int d = 0; d < slots; d++) {
3516 3419 (void) refcount_remove(&mc->mc_alloc_slots, zio);
3517 3420 }
3518 3421 mutex_exit(&mc->mc_lock);
3519 3422 }
3520 3423
3521 -static int
3522 -metaslab_claim_concrete(vdev_t *vd, uint64_t offset, uint64_t size,
3523 - uint64_t txg)
3524 -{
3525 - metaslab_t *msp;
3526 - spa_t *spa = vd->vdev_spa;
3527 - int error = 0;
3528 -
3529 - if (offset >> vd->vdev_ms_shift >= vd->vdev_ms_count)
3530 - return (ENXIO);
3531 -
3532 - ASSERT3P(vd->vdev_ms, !=, NULL);
3533 - msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3534 -
3535 - mutex_enter(&msp->ms_lock);
3536 -
3537 - if ((txg != 0 && spa_writeable(spa)) || !msp->ms_loaded)
3538 - error = metaslab_activate(msp, METASLAB_WEIGHT_SECONDARY);
3539 -
3540 - if (error == 0 && !range_tree_contains(msp->ms_tree, offset, size))
3541 - error = SET_ERROR(ENOENT);
3542 -
3543 - if (error || txg == 0) { /* txg == 0 indicates dry run */
3544 - mutex_exit(&msp->ms_lock);
3545 - return (error);
3546 - }
3547 -
3548 - VERIFY(!msp->ms_condensing);
3549 - VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
3550 - VERIFY0(P2PHASE(size, 1ULL << vd->vdev_ashift));
3551 - VERIFY3U(range_tree_space(msp->ms_tree) - size, <=, msp->ms_size);
3552 - range_tree_remove(msp->ms_tree, offset, size);
3553 -
3554 - if (spa_writeable(spa)) { /* don't dirty if we're zdb(1M) */
3555 - if (range_tree_space(msp->ms_alloctree[txg & TXG_MASK]) == 0)
3556 - vdev_dirty(vd, VDD_METASLAB, msp, txg);
3557 - range_tree_add(msp->ms_alloctree[txg & TXG_MASK], offset, size);
3558 - }
3559 -
3560 - mutex_exit(&msp->ms_lock);
3561 -
3562 - return (0);
3563 -}
3564 -
3565 -typedef struct metaslab_claim_cb_arg_t {
3566 - uint64_t mcca_txg;
3567 - int mcca_error;
3568 -} metaslab_claim_cb_arg_t;
3569 -
3570 -/* ARGSUSED */
3571 -static void
3572 -metaslab_claim_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
3573 - uint64_t size, void *arg)
3574 -{
3575 - metaslab_claim_cb_arg_t *mcca_arg = arg;
3576 -
3577 - if (mcca_arg->mcca_error == 0) {
3578 - mcca_arg->mcca_error = metaslab_claim_concrete(vd, offset,
3579 - size, mcca_arg->mcca_txg);
3580 - }
3581 -}
3582 -
3583 3424 int
3584 -metaslab_claim_impl(vdev_t *vd, uint64_t offset, uint64_t size, uint64_t txg)
3585 -{
3586 - if (vd->vdev_ops->vdev_op_remap != NULL) {
3587 - metaslab_claim_cb_arg_t arg;
3588 -
3589 - /*
3590 - * Only zdb(1M) can claim on indirect vdevs. This is used
3591 - * to detect leaks of mapped space (that are not accounted
3592 - * for in the obsolete counts, spacemap, or bpobj).
3593 - */
3594 - ASSERT(!spa_writeable(vd->vdev_spa));
3595 - arg.mcca_error = 0;
3596 - arg.mcca_txg = txg;
3597 -
3598 - vd->vdev_ops->vdev_op_remap(vd, offset, size,
3599 - metaslab_claim_impl_cb, &arg);
3600 -
3601 - if (arg.mcca_error == 0) {
3602 - arg.mcca_error = metaslab_claim_concrete(vd,
3603 - offset, size, txg);
3604 - }
3605 - return (arg.mcca_error);
3606 - } else {
3607 - return (metaslab_claim_concrete(vd, offset, size, txg));
3608 - }
3609 -}
3610 -
3611 -/*
3612 - * Intent log support: upon opening the pool after a crash, notify the SPA
3613 - * of blocks that the intent log has allocated for immediate write, but
3614 - * which are still considered free by the SPA because the last transaction
3615 - * group didn't commit yet.
3616 - */
3617 -static int
3618 -metaslab_claim_dva(spa_t *spa, const dva_t *dva, uint64_t txg)
3619 -{
3620 - uint64_t vdev = DVA_GET_VDEV(dva);
3621 - uint64_t offset = DVA_GET_OFFSET(dva);
3622 - uint64_t size = DVA_GET_ASIZE(dva);
3623 - vdev_t *vd;
3624 -
3625 - if ((vd = vdev_lookup_top(spa, vdev)) == NULL) {
3626 - return (SET_ERROR(ENXIO));
3627 - }
3628 -
3629 - ASSERT(DVA_IS_VALID(dva));
3630 -
3631 - if (DVA_GET_GANG(dva))
3632 - size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
3633 -
3634 - return (metaslab_claim_impl(vd, offset, size, txg));
3635 -}
3636 -
3637 -int
3638 3425 metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp,
3639 3426 int ndvas, uint64_t txg, blkptr_t *hintbp, int flags,
3640 3427 zio_alloc_list_t *zal, zio_t *zio)
3641 3428 {
3642 3429 dva_t *dva = bp->blk_dva;
3643 3430 dva_t *hintdva = hintbp->blk_dva;
3644 3431 int error = 0;
3645 3432
3646 3433 ASSERT(bp->blk_birth == 0);
3647 3434 ASSERT(BP_PHYSICAL_BIRTH(bp) == 0);
3648 3435
3649 3436 spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
3650 3437
|
↓ open down ↓ |
3 lines elided |
↑ open up ↑ |
3651 3438 if (mc->mc_rotor == NULL) { /* no vdevs in this class */
3652 3439 spa_config_exit(spa, SCL_ALLOC, FTAG);
3653 3440 return (SET_ERROR(ENOSPC));
3654 3441 }
3655 3442
3656 3443 ASSERT(ndvas > 0 && ndvas <= spa_max_replication(spa));
3657 3444 ASSERT(BP_GET_NDVAS(bp) == 0);
3658 3445 ASSERT(hintbp == NULL || ndvas <= BP_GET_NDVAS(hintbp));
3659 3446 ASSERT3P(zal, !=, NULL);
3660 3447
3661 - for (int d = 0; d < ndvas; d++) {
3662 - error = metaslab_alloc_dva(spa, mc, psize, dva, d, hintdva,
3663 - txg, flags, zal);
3664 - if (error != 0) {
3665 - for (d--; d >= 0; d--) {
3666 - metaslab_unalloc_dva(spa, &dva[d], txg);
3667 - metaslab_group_alloc_decrement(spa,
3668 - DVA_GET_VDEV(&dva[d]), zio, flags);
3669 - bzero(&dva[d], sizeof (dva_t));
3448 + if (mc == spa_special_class(spa) && !BP_IS_METADATA(bp) &&
3449 + !(flags & (METASLAB_GANG_HEADER)) &&
3450 + !(spa->spa_meta_policy.spa_small_data_to_special &&
3451 + psize <= spa->spa_meta_policy.spa_small_data_to_special)) {
3452 + error = metaslab_alloc_dva(spa, spa_normal_class(spa),
3453 + psize, &dva[WBC_NORMAL_DVA], 0, NULL, txg,
3454 + flags | METASLAB_USE_WEIGHT_SECONDARY, zal);
3455 + if (error == 0) {
3456 + error = metaslab_alloc_dva(spa, mc, psize,
3457 + &dva[WBC_SPECIAL_DVA], 0, NULL, txg, flags, zal);
3458 + if (error != 0) {
3459 + error = 0;
3460 + /*
3461 + * Change the place of NORMAL and cleanup the
3462 + * second DVA. After that this BP is just a
3463 + * regular BP with one DVA
3464 + *
3465 + * This operation is valid only if:
3466 + * WBC_SPECIAL_DVA is dva[0]
3467 + * WBC_NORMAL_DVA is dva[1]
3468 + *
3469 + * see wbc.h
3470 + */
3471 + bcopy(&dva[WBC_NORMAL_DVA],
3472 + &dva[WBC_SPECIAL_DVA], sizeof (dva_t));
3473 + bzero(&dva[WBC_NORMAL_DVA], sizeof (dva_t));
3474 +
3475 + /*
3476 + * Allocation of special DVA has failed,
3477 + * so this BP will be a regular BP and need
3478 + * to update the metaslab group's queue depth
3479 + * based on the newly allocated dva.
3480 + */
3481 + metaslab_group_alloc_increment(spa,
3482 + DVA_GET_VDEV(&dva[0]), zio, flags);
3483 + } else {
3484 + BP_SET_SPECIAL(bp, 1);
3670 3485 }
3486 + } else {
3671 3487 spa_config_exit(spa, SCL_ALLOC, FTAG);
3672 3488 return (error);
3673 - } else {
3674 - /*
3675 - * Update the metaslab group's queue depth
3676 - * based on the newly allocated dva.
3677 - */
3678 - metaslab_group_alloc_increment(spa,
3679 - DVA_GET_VDEV(&dva[d]), zio, flags);
3680 3489 }
3681 -
3490 + } else {
3491 + for (int d = 0; d < ndvas; d++) {
3492 + error = metaslab_alloc_dva(spa, mc, psize, dva, d,
3493 + hintdva, txg, flags, zal);
3494 + if (error != 0) {
3495 + for (d--; d >= 0; d--) {
3496 + metaslab_free_dva(spa, &dva[d],
3497 + txg, B_TRUE);
3498 + metaslab_group_alloc_decrement(spa,
3499 + DVA_GET_VDEV(&dva[d]), zio, flags);
3500 + bzero(&dva[d], sizeof (dva_t));
3501 + }
3502 + spa_config_exit(spa, SCL_ALLOC, FTAG);
3503 + return (error);
3504 + } else {
3505 + /*
3506 + * Update the metaslab group's queue depth
3507 + * based on the newly allocated dva.
3508 + */
3509 + metaslab_group_alloc_increment(spa,
3510 + DVA_GET_VDEV(&dva[d]), zio, flags);
3511 + }
3512 + }
3513 + ASSERT(BP_GET_NDVAS(bp) == ndvas);
3682 3514 }
3683 3515 ASSERT(error == 0);
3684 - ASSERT(BP_GET_NDVAS(bp) == ndvas);
3685 3516
3686 3517 spa_config_exit(spa, SCL_ALLOC, FTAG);
3687 3518
3688 3519 BP_SET_BIRTH(bp, txg, txg);
3689 3520
3690 3521 return (0);
3691 3522 }
3692 3523
3693 3524 void
3694 3525 metaslab_free(spa_t *spa, const blkptr_t *bp, uint64_t txg, boolean_t now)
3695 3526 {
3696 3527 const dva_t *dva = bp->blk_dva;
3697 3528 int ndvas = BP_GET_NDVAS(bp);
3698 3529
3699 3530 ASSERT(!BP_IS_HOLE(bp));
3700 3531 ASSERT(!now || bp->blk_birth >= spa_syncing_txg(spa));
3701 3532
3702 3533 spa_config_enter(spa, SCL_FREE, FTAG, RW_READER);
3703 3534
3704 - for (int d = 0; d < ndvas; d++) {
3705 - if (now) {
3706 - metaslab_unalloc_dva(spa, &dva[d], txg);
3707 - } else {
3708 - metaslab_free_dva(spa, &dva[d], txg);
3535 + if (BP_IS_SPECIAL(bp)) {
3536 + int start_dva;
3537 + wbc_data_t *wbc_data = spa_get_wbc_data(spa);
3538 +
3539 + mutex_enter(&wbc_data->wbc_lock);
3540 + start_dva = wbc_first_valid_dva(bp, wbc_data, B_TRUE);
3541 + mutex_exit(&wbc_data->wbc_lock);
3542 +
3543 + /*
3544 + * Actual freeing should not be locked as
3545 + * the block is already exempted from WBC
3546 + * trees, and thus will not be moved
3547 + */
3548 + metaslab_free_dva(spa, &dva[WBC_NORMAL_DVA], txg, now);
3549 + if (start_dva == 0) {
3550 + metaslab_free_dva(spa, &dva[WBC_SPECIAL_DVA],
3551 + txg, now);
3709 3552 }
3553 + } else {
3554 + for (int d = 0; d < ndvas; d++)
3555 + metaslab_free_dva(spa, &dva[d], txg, now);
3710 3556 }
3711 3557
3712 3558 spa_config_exit(spa, SCL_FREE, FTAG);
3713 3559 }
3714 3560
3715 3561 int
3716 3562 metaslab_claim(spa_t *spa, const blkptr_t *bp, uint64_t txg)
3717 3563 {
3718 3564 const dva_t *dva = bp->blk_dva;
3719 3565 int ndvas = BP_GET_NDVAS(bp);
3720 3566 int error = 0;
3721 3567
3722 3568 ASSERT(!BP_IS_HOLE(bp));
3723 3569
3724 3570 if (txg != 0) {
|
↓ open down ↓ |
5 lines elided |
↑ open up ↑ |
3725 3571 /*
3726 3572 * First do a dry run to make sure all DVAs are claimable,
3727 3573 * so we don't have to unwind from partial failures below.
3728 3574 */
3729 3575 if ((error = metaslab_claim(spa, bp, 0)) != 0)
3730 3576 return (error);
3731 3577 }
3732 3578
3733 3579 spa_config_enter(spa, SCL_ALLOC, FTAG, RW_READER);
3734 3580
3735 - for (int d = 0; d < ndvas; d++)
3736 - if ((error = metaslab_claim_dva(spa, &dva[d], txg)) != 0)
3737 - break;
3581 + if (BP_IS_SPECIAL(bp)) {
3582 + int start_dva;
3583 + wbc_data_t *wbc_data = spa_get_wbc_data(spa);
3738 3584
3585 + mutex_enter(&wbc_data->wbc_lock);
3586 + start_dva = wbc_first_valid_dva(bp, wbc_data, B_FALSE);
3587 +
3588 + /*
3589 + * Actual claiming should be under lock for WBC blocks. It must
3590 + * be done to ensure zdb will not fail. The only other user of
3591 + * the claiming is ZIL whose blocks can not be WBC ones, and
3592 + * thus the lock will not be held for them.
3593 + */
3594 + error = metaslab_claim_dva(spa,
3595 + &dva[WBC_NORMAL_DVA], txg);
3596 + if (error == 0 && start_dva == 0) {
3597 + error = metaslab_claim_dva(spa,
3598 + &dva[WBC_SPECIAL_DVA], txg);
3599 + }
3600 +
3601 + mutex_exit(&wbc_data->wbc_lock);
3602 + } else {
3603 + for (int d = 0; d < ndvas; d++)
3604 + if ((error = metaslab_claim_dva(spa,
3605 + &dva[d], txg)) != 0)
3606 + break;
3607 + }
3608 +
3739 3609 spa_config_exit(spa, SCL_ALLOC, FTAG);
3740 3610
3741 3611 ASSERT(error == 0 || txg == 0);
3742 3612
3743 3613 return (error);
3744 3614 }
3745 3615
3746 -/* ARGSUSED */
3747 -static void
3748 -metaslab_check_free_impl_cb(uint64_t inner, vdev_t *vd, uint64_t offset,
3749 - uint64_t size, void *arg)
3616 +void
3617 +metaslab_check_free(spa_t *spa, const blkptr_t *bp)
3750 3618 {
3751 - if (vd->vdev_ops == &vdev_indirect_ops)
3619 + if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
3752 3620 return;
3753 3621
3754 - metaslab_check_free_impl(vd, offset, size);
3622 + if (BP_IS_SPECIAL(bp)) {
3623 + /* Do not check frees for WBC blocks */
3624 + return;
3625 + }
3626 +
3627 + spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
3628 + for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
3629 + uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
3630 + vdev_t *vd = vdev_lookup_top(spa, vdev);
3631 + uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
3632 + uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
3633 + metaslab_t *msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3634 +
3635 + if (msp->ms_loaded) {
3636 + range_tree_verify(msp->ms_tree, offset, size);
3637 + range_tree_verify(msp->ms_cur_ts->ts_tree,
3638 + offset, size);
3639 + if (msp->ms_prev_ts != NULL) {
3640 + range_tree_verify(msp->ms_prev_ts->ts_tree,
3641 + offset, size);
3642 + }
3643 + }
3644 +
3645 + range_tree_verify(msp->ms_freeingtree, offset, size);
3646 + range_tree_verify(msp->ms_freedtree, offset, size);
3647 + for (int j = 0; j < TXG_DEFER_SIZE; j++)
3648 + range_tree_verify(msp->ms_defertree[j], offset, size);
3649 + }
3650 + spa_config_exit(spa, SCL_VDEV, FTAG);
3755 3651 }
3756 3652
3757 -static void
3758 -metaslab_check_free_impl(vdev_t *vd, uint64_t offset, uint64_t size)
3653 +/*
3654 + * Trims all free space in the metaslab. Returns the root TRIM zio (that the
3655 + * caller should zio_wait() for) and the amount of space in the metaslab that
3656 + * has been scheduled for trimming in the `delta' return argument.
3657 + */
3658 +zio_t *
3659 +metaslab_trim_all(metaslab_t *msp, uint64_t *delta)
3759 3660 {
3760 - metaslab_t *msp;
3761 - spa_t *spa = vd->vdev_spa;
3661 + boolean_t was_loaded;
3662 + uint64_t trimmed_space;
3663 + zio_t *trim_io;
3762 3664
3763 - if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
3764 - return;
3665 + ASSERT(!MUTEX_HELD(&msp->ms_group->mg_lock));
3765 3666
3766 - if (vd->vdev_ops->vdev_op_remap != NULL) {
3767 - vd->vdev_ops->vdev_op_remap(vd, offset, size,
3768 - metaslab_check_free_impl_cb, NULL);
3769 - return;
3667 + mutex_enter(&msp->ms_lock);
3668 +
3669 + while (msp->ms_loading)
3670 + metaslab_load_wait(msp);
3671 + /* If we loaded the metaslab, unload it when we're done. */
3672 + was_loaded = msp->ms_loaded;
3673 + if (!was_loaded) {
3674 + if (metaslab_load(msp) != 0) {
3675 + mutex_exit(&msp->ms_lock);
3676 + return (0);
3677 + }
3770 3678 }
3679 + /* Flush out any scheduled extents and add everything in ms_tree. */
3680 + range_tree_vacate(msp->ms_cur_ts->ts_tree, NULL, NULL);
3681 + range_tree_walk(msp->ms_tree, metaslab_trim_add, msp);
3771 3682
3772 - ASSERT(vdev_is_concrete(vd));
3773 - ASSERT3U(offset >> vd->vdev_ms_shift, <, vd->vdev_ms_count);
3774 - ASSERT3U(spa_config_held(spa, SCL_ALL, RW_READER), !=, 0);
3683 + /* Force this trim to take place ASAP. */
3684 + if (msp->ms_prev_ts != NULL)
3685 + metaslab_free_trimset(msp->ms_prev_ts);
3686 + msp->ms_prev_ts = msp->ms_cur_ts;
3687 + msp->ms_cur_ts = metaslab_new_trimset(0, &msp->ms_lock);
3688 + trimmed_space = range_tree_space(msp->ms_tree);
3689 + if (!was_loaded)
3690 + metaslab_unload(msp);
3775 3691
3776 - msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
3692 + trim_io = metaslab_exec_trim(msp);
3693 + mutex_exit(&msp->ms_lock);
3694 + *delta = trimmed_space;
3777 3695
3696 + return (trim_io);
3697 +}
3698 +
3699 +/*
3700 + * Notifies the trimsets in a metaslab that an extent has been allocated.
3701 + * This removes the segment from the queues of extents awaiting to be trimmed.
3702 + */
3703 +static void
3704 +metaslab_trim_remove(void *arg, uint64_t offset, uint64_t size)
3705 +{
3706 + metaslab_t *msp = arg;
3707 +
3708 + range_tree_remove_overlap(msp->ms_cur_ts->ts_tree, offset, size);
3709 + if (msp->ms_prev_ts != NULL) {
3710 + range_tree_remove_overlap(msp->ms_prev_ts->ts_tree, offset,
3711 + size);
3712 + }
3713 +}
3714 +
3715 +/*
3716 + * Notifies the trimsets in a metaslab that an extent has been freed.
3717 + * This adds the segment to the currently open queue of extents awaiting
3718 + * to be trimmed.
3719 + */
3720 +static void
3721 +metaslab_trim_add(void *arg, uint64_t offset, uint64_t size)
3722 +{
3723 + metaslab_t *msp = arg;
3724 + ASSERT(msp->ms_cur_ts != NULL);
3725 + range_tree_add(msp->ms_cur_ts->ts_tree, offset, size);
3726 +}
3727 +
3728 +/*
3729 + * Does a metaslab's automatic trim operation processing. This must be
3730 + * called from metaslab_sync, with the txg number of the txg. This function
3731 + * issues trims in intervals as dictated by the zfs_txgs_per_trim tunable.
3732 + */
3733 +void
3734 +metaslab_auto_trim(metaslab_t *msp, uint64_t txg)
3735 +{
3736 + /* for atomicity */
3737 + uint64_t txgs_per_trim = zfs_txgs_per_trim;
3738 +
3739 + ASSERT(!MUTEX_HELD(&msp->ms_lock));
3778 3740 mutex_enter(&msp->ms_lock);
3779 - if (msp->ms_loaded)
3780 - range_tree_verify(msp->ms_tree, offset, size);
3781 3741
3782 - range_tree_verify(msp->ms_freeingtree, offset, size);
3783 - range_tree_verify(msp->ms_freedtree, offset, size);
3784 - for (int j = 0; j < TXG_DEFER_SIZE; j++)
3785 - range_tree_verify(msp->ms_defertree[j], offset, size);
3742 + /*
3743 + * Since we typically have hundreds of metaslabs per vdev, but we only
3744 + * trim them once every zfs_txgs_per_trim txgs, it'd be best if we
3745 + * could sequence the TRIM commands from all metaslabs so that they
3746 + * don't all always pound the device in the same txg. We do so by
3747 + * artificially inflating the birth txg of the first trim set by a
3748 + * sequence number derived from the metaslab's starting offset
3749 + * (modulo zfs_txgs_per_trim). Thus, for the default 200 metaslabs and
3750 + * 32 txgs per trim, we'll only be trimming ~6.25 metaslabs per txg.
3751 + *
3752 + * If we detect that the txg has advanced too far ahead of ts_birth,
3753 + * it means our birth txg is out of lockstep. Recompute it by
3754 + * rounding down to the nearest zfs_txgs_per_trim multiple and adding
3755 + * our metaslab id modulo zfs_txgs_per_trim.
3756 + */
3757 + if (txg > msp->ms_cur_ts->ts_birth + txgs_per_trim) {
3758 + msp->ms_cur_ts->ts_birth = (txg / txgs_per_trim) *
3759 + txgs_per_trim + (msp->ms_id % txgs_per_trim);
3760 + }
3761 +
3762 + /* Time to swap out the current and previous trimsets */
3763 + if (txg == msp->ms_cur_ts->ts_birth + txgs_per_trim) {
3764 + if (msp->ms_prev_ts != NULL) {
3765 + if (msp->ms_trimming_ts != NULL) {
3766 + spa_t *spa = msp->ms_group->mg_class->mc_spa;
3767 + /*
3768 + * The previous trim run is still ongoing, so
3769 + * the device is reacting slowly to our trim
3770 + * requests. Drop this trimset, so as not to
3771 + * back the device up with trim requests.
3772 + */
3773 + spa_trimstats_auto_slow_incr(spa);
3774 + metaslab_free_trimset(msp->ms_prev_ts);
3775 + } else if (msp->ms_group->mg_vd->vdev_man_trimming) {
3776 + /*
3777 + * If a manual trim is ongoing, we want to
3778 + * inhibit autotrim temporarily so it doesn't
3779 + * slow down the manual trim.
3780 + */
3781 + metaslab_free_trimset(msp->ms_prev_ts);
3782 + } else {
3783 + /*
3784 + * Trim out aged extents on the vdevs - these
3785 + * are safe to be destroyed now. We'll keep
3786 + * the trimset around to deny allocations from
3787 + * these regions while the trims are ongoing.
3788 + */
3789 + zio_nowait(metaslab_exec_trim(msp));
3790 + }
3791 + }
3792 + msp->ms_prev_ts = msp->ms_cur_ts;
3793 + msp->ms_cur_ts = metaslab_new_trimset(txg, &msp->ms_lock);
3794 + }
3786 3795 mutex_exit(&msp->ms_lock);
3787 3796 }
3788 3797
3789 -void
3790 -metaslab_check_free(spa_t *spa, const blkptr_t *bp)
3798 +static void
3799 +metaslab_trim_done(zio_t *zio)
3791 3800 {
3792 - if ((zfs_flags & ZFS_DEBUG_ZIO_FREE) == 0)
3793 - return;
3801 + metaslab_t *msp = zio->io_private;
3802 + boolean_t held;
3794 3803
3795 - spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
3796 - for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
3797 - uint64_t vdev = DVA_GET_VDEV(&bp->blk_dva[i]);
3798 - vdev_t *vd = vdev_lookup_top(spa, vdev);
3799 - uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
3800 - uint64_t size = DVA_GET_ASIZE(&bp->blk_dva[i]);
3804 + ASSERT(msp != NULL);
3805 + ASSERT(msp->ms_trimming_ts != NULL);
3806 + held = MUTEX_HELD(&msp->ms_lock);
3807 + if (!held)
3808 + mutex_enter(&msp->ms_lock);
3809 + metaslab_free_trimset(msp->ms_trimming_ts);
3810 + msp->ms_trimming_ts = NULL;
3811 + cv_signal(&msp->ms_trim_cv);
3812 + if (!held)
3813 + mutex_exit(&msp->ms_lock);
3814 +}
3801 3815
3802 - if (DVA_GET_GANG(&bp->blk_dva[i]))
3803 - size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
3816 +/*
3817 + * Executes a zio_trim on a range tree holding freed extents in the metaslab.
3818 + */
3819 +static zio_t *
3820 +metaslab_exec_trim(metaslab_t *msp)
3821 +{
3822 + metaslab_group_t *mg = msp->ms_group;
3823 + spa_t *spa = mg->mg_class->mc_spa;
3824 + vdev_t *vd = mg->mg_vd;
3825 + range_tree_t *trim_tree;
3826 + zio_t *zio;
3804 3827
3805 - ASSERT3P(vd, !=, NULL);
3828 + ASSERT(MUTEX_HELD(&msp->ms_lock));
3806 3829
3807 - metaslab_check_free_impl(vd, offset, size);
3830 + /* wait for a preceding trim to finish */
3831 + while (msp->ms_trimming_ts != NULL)
3832 + cv_wait(&msp->ms_trim_cv, &msp->ms_lock);
3833 + msp->ms_trimming_ts = msp->ms_prev_ts;
3834 + msp->ms_prev_ts = NULL;
3835 + trim_tree = msp->ms_trimming_ts->ts_tree;
3836 +#ifdef DEBUG
3837 + if (msp->ms_loaded) {
3838 + for (range_seg_t *rs = avl_first(&trim_tree->rt_root);
3839 + rs != NULL; rs = AVL_NEXT(&trim_tree->rt_root, rs)) {
3840 + if (!range_tree_contains(msp->ms_tree,
3841 + rs->rs_start, rs->rs_end - rs->rs_start)) {
3842 + panic("trimming allocated region; mss=%p",
3843 + (void*)rs);
3844 + }
3845 + }
3808 3846 }
3809 - spa_config_exit(spa, SCL_VDEV, FTAG);
3847 +#endif
3848 +
3849 + /* Nothing to trim */
3850 + if (range_tree_space(trim_tree) == 0) {
3851 + metaslab_free_trimset(msp->ms_trimming_ts);
3852 + msp->ms_trimming_ts = 0;
3853 + return (zio_root(spa, NULL, NULL, 0));
3854 + }
3855 + zio = zio_trim(spa, vd, trim_tree, metaslab_trim_done, msp, 0,
3856 + ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |
3857 + ZIO_FLAG_CONFIG_WRITER, msp);
3858 +
3859 + return (zio);
3860 +}
3861 +
3862 +/*
3863 + * Allocates and initializes a new trimset structure. The `txg' argument
3864 + * indicates when this trimset was born and `lock' indicates the lock to
3865 + * link to the range tree.
3866 + */
3867 +static metaslab_trimset_t *
3868 +metaslab_new_trimset(uint64_t txg, kmutex_t *lock)
3869 +{
3870 + metaslab_trimset_t *ts;
3871 +
3872 + ts = kmem_zalloc(sizeof (*ts), KM_SLEEP);
3873 + ts->ts_birth = txg;
3874 + ts->ts_tree = range_tree_create(NULL, NULL, lock);
3875 +
3876 + return (ts);
3877 +}
3878 +
3879 +/*
3880 + * Destroys and frees a trim set previously allocated by metaslab_new_trimset.
3881 + */
3882 +static void
3883 +metaslab_free_trimset(metaslab_trimset_t *ts)
3884 +{
3885 + range_tree_vacate(ts->ts_tree, NULL, NULL);
3886 + range_tree_destroy(ts->ts_tree);
3887 + kmem_free(ts, sizeof (*ts));
3888 +}
3889 +
3890 +/*
3891 + * Checks whether an allocation conflicts with an ongoing trim operation in
3892 + * the given metaslab. This function takes a segment starting at `*offset'
3893 + * of `size' and checks whether it hits any region in the metaslab currently
3894 + * being trimmed. If yes, it tries to adjust the allocation to the end of
3895 + * the region being trimmed (P2ROUNDUP aligned by `align'), but only up to
3896 + * `limit' (no part of the allocation is allowed to go past this point).
3897 + *
3898 + * Returns B_FALSE if either the original allocation wasn't in conflict, or
3899 + * the conflict could be resolved by adjusting the value stored in `offset'
3900 + * such that the whole allocation still fits below `limit'. Returns B_TRUE
3901 + * if the allocation conflict couldn't be resolved.
3902 + */
3903 +static boolean_t metaslab_check_trim_conflict(metaslab_t *msp,
3904 + uint64_t *offset, uint64_t size, uint64_t align, uint64_t limit)
3905 +{
3906 + uint64_t new_offset;
3907 +
3908 + if (msp->ms_trimming_ts == NULL)
3909 + /* no trim conflict, original offset is OK */
3910 + return (B_FALSE);
3911 +
3912 + new_offset = P2ROUNDUP(range_tree_find_gap(msp->ms_trimming_ts->ts_tree,
3913 + *offset, size), align);
3914 + if (new_offset != *offset && new_offset + size > limit)
3915 + /* trim conflict and adjustment not possible */
3916 + return (B_TRUE);
3917 +
3918 + /* trim conflict, but adjusted offset still within limit */
3919 + *offset = new_offset;
3920 + return (B_FALSE);
3810 3921 }
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX