Print this page
NEX-20218 Backport Illumos #9474 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes
MFV illumos-gate@fa41d87de9ec9000964c605eb01d6dc19e4a1abe
9464 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
NEX-20208 Backport Illumos #9993 zil writes can get delayed in zio pipeline
MFV illumos-gate@2258ad0b755b24a55c6173b1e6bb6188389f72dd
9993 zil writes can get delayed in zio pipeline
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-15067 KRRP: system panics during ZFS-receive: assertion failed: arc_can_share(hdr, buf)
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15067 KRRP: system panics during ZFS-receive: assertion failed: arc_can_share(hdr, buf)
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-14571 remove isal support remnants
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8065 ZFS doesn't notice when disk vdevs have no write cache
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
NEX-5856 ddt_capped isn't reset when deduped dataset is destroyed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5367 special vdev: sync-write options (NEW)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5318 Cleanup specialclass property (obsolete, not used) and fix related meta-to-special case
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5188 Removed special-vdev causes panic on read or on get size of special-bp
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4794 Write Back Cache sync and async writes: adjust routing according to watermark limits
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4619 Want kstats to monitor TRIM and UNMAP operation
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5438 zfs_blkptr_verify should continue after zfs_panic_recover
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Dan McDonald <danmcd@omniti.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
NEX-3502 dedup ceiling should set a pool prop when cap is in effect
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
usr/src/common/zfs/zpool_prop.c
usr/src/uts/common/sys/fs/zfs.h
NEX-4003 WRC: System panics on debug build
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
usr/src/uts/common/io/scsi/targets/sd.c
usr/src/uts/common/sys/scsi/targets/sddef.h
NEX-3411 Removal of small l2arc ddt vdev disables dedup despite enough RAM
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
NEX-3300 ddt byte count ceiling tunables should not depend on zfs_ddt_limit_type being set
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-1110 Odd zpool Latency Output
OS-70 remove zio timer code
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
Fixup merge results
re #13989 port of illumos-3805
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
SUP-504 Multiple disks being falsely failed/retired by new zio_timeout handling code
re #12770 rb4121 zio latency reports can produce false positives
re #12645 rb4073 Make vdev delay simulator independent of DEBUG
re #12643 rb4064 ZFS meta refactoring - vdev utilization tracking, auto-dedup
re #12616 rb4051 zfs_log_write()/dmu_sync() write once to special refactoring
re #8279 rb3915 need a mechanism to notify NMS about ZFS config changes (fix lint -courtesy of Yuri Pankov)
re #12584 rb4049 zfsxx latest code merge (fix lint - courtesy of Yuri Pankov)
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #12393 rb3935 Kerberos and smbd disagree about who is our AD server (fix elf runtime attributes check)
re #11612 rb3907 Failing vdev of a mirrored pool should not take zfs operations out of action for extended periods of time.
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties
| Split |
Close |
| Expand all |
| Collapse all |
--- old/usr/src/uts/common/fs/zfs/zio.c
+++ new/usr/src/uts/common/fs/zfs/zio.c
1 1 /*
2 2 * CDDL HEADER START
3 3 *
4 4 * The contents of this file are subject to the terms of the
5 5 * Common Development and Distribution License (the "License").
6 6 * You may not use this file except in compliance with the License.
7 7 *
8 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9 9 * or http://www.opensolaris.org/os/licensing.
10 10 * See the License for the specific language governing permissions
|
↓ open down ↓ |
10 lines elided |
↑ open up ↑ |
11 11 * and limitations under the License.
12 12 *
13 13 * When distributing Covered Code, include this CDDL HEADER in each
14 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15 15 * If applicable, add the following below this CDDL HEADER, with the
16 16 * fields enclosed by brackets "[]" replaced with your own identifying
17 17 * information: Portions Copyright [yyyy] [name of copyright owner]
18 18 *
19 19 * CDDL HEADER END
20 20 */
21 +
21 22 /*
22 23 * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
23 24 * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
24 - * Copyright (c) 2011 Nexenta Systems, Inc. All rights reserved.
25 25 * Copyright (c) 2014 Integros [integros.com]
26 + * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
26 27 */
27 28
28 29 #include <sys/sysmacros.h>
29 30 #include <sys/zfs_context.h>
30 31 #include <sys/fm/fs/zfs.h>
31 32 #include <sys/spa.h>
32 33 #include <sys/txg.h>
33 34 #include <sys/spa_impl.h>
34 35 #include <sys/vdev_impl.h>
35 36 #include <sys/zio_impl.h>
36 37 #include <sys/zio_compress.h>
37 38 #include <sys/zio_checksum.h>
38 39 #include <sys/dmu_objset.h>
39 40 #include <sys/arc.h>
40 41 #include <sys/ddt.h>
41 42 #include <sys/blkptr.h>
43 +#include <sys/special.h>
44 +#include <sys/blkptr.h>
42 45 #include <sys/zfeature.h>
46 +#include <sys/dkioc_free_util.h>
47 +#include <sys/dsl_scan.h>
48 +
43 49 #include <sys/metaslab_impl.h>
44 50 #include <sys/abd.h>
45 51
52 +extern int zfs_txg_timeout;
53 +
46 54 /*
47 55 * ==========================================================================
48 56 * I/O type descriptions
49 57 * ==========================================================================
50 58 */
51 59 const char *zio_type_name[ZIO_TYPES] = {
52 60 "zio_null", "zio_read", "zio_write", "zio_free", "zio_claim",
53 61 "zio_ioctl"
54 62 };
55 63
56 64 boolean_t zio_dva_throttle_enabled = B_TRUE;
57 65
58 66 /*
59 67 * ==========================================================================
60 68 * I/O kmem caches
61 69 * ==========================================================================
|
↓ open down ↓ |
6 lines elided |
↑ open up ↑ |
62 70 */
63 71 kmem_cache_t *zio_cache;
64 72 kmem_cache_t *zio_link_cache;
65 73 kmem_cache_t *zio_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
66 74 kmem_cache_t *zio_data_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
67 75
68 76 #ifdef _KERNEL
69 77 extern vmem_t *zio_alloc_arena;
70 78 #endif
71 79
72 -#define ZIO_PIPELINE_CONTINUE 0x100
73 -#define ZIO_PIPELINE_STOP 0x101
74 -
75 80 #define BP_SPANB(indblkshift, level) \
76 81 (((uint64_t)1) << ((level) * ((indblkshift) - SPA_BLKPTRSHIFT)))
77 82 #define COMPARE_META_LEVEL 0x80000000ul
83 +
78 84 /*
79 85 * The following actions directly effect the spa's sync-to-convergence logic.
80 86 * The values below define the sync pass when we start performing the action.
81 87 * Care should be taken when changing these values as they directly impact
82 88 * spa_sync() performance. Tuning these values may introduce subtle performance
83 89 * pathologies and should only be done in the context of performance analysis.
84 90 * These tunables will eventually be removed and replaced with #defines once
85 91 * enough analysis has been done to determine optimal values.
86 92 *
87 93 * The 'zfs_sync_pass_deferred_free' pass must be greater than 1 to ensure that
88 94 * regular blocks are not deferred.
89 95 */
90 96 int zfs_sync_pass_deferred_free = 2; /* defer frees starting in this pass */
91 97 int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */
92 98 int zfs_sync_pass_rewrite = 2; /* rewrite new bps starting in this pass */
93 99
94 100 /*
95 101 * An allocating zio is one that either currently has the DVA allocate
96 102 * stage set or will have it later in its lifetime.
97 103 */
|
↓ open down ↓ |
10 lines elided |
↑ open up ↑ |
98 104 #define IO_IS_ALLOCATING(zio) ((zio)->io_orig_pipeline & ZIO_STAGE_DVA_ALLOCATE)
99 105
100 106 boolean_t zio_requeue_io_start_cut_in_line = B_TRUE;
101 107
102 108 #ifdef ZFS_DEBUG
103 109 int zio_buf_debug_limit = 16384;
104 110 #else
105 111 int zio_buf_debug_limit = 0;
106 112 #endif
107 113
114 +/*
115 + * Fault insertion for stress testing
116 + */
117 +int zio_faulty_vdev_enabled = 0;
118 +uint64_t zio_faulty_vdev_guid;
119 +uint64_t zio_faulty_vdev_delay_us = 1000000; /* 1 second */
120 +
121 +/*
122 + * Tunable to allow for debugging SCSI UNMAP/SATA TRIM calls. Disabling
123 + * it will prevent ZFS from attempting to issue DKIOCFREE ioctls to the
124 + * underlying storage.
125 + */
126 +boolean_t zfs_trim = B_TRUE;
127 +uint64_t zfs_trim_min_ext_sz = 1 << 20; /* 1 MB */
128 +
108 129 static void zio_taskq_dispatch(zio_t *, zio_taskq_type_t, boolean_t);
109 130
110 131 void
111 132 zio_init(void)
112 133 {
113 134 size_t c;
114 135 vmem_t *data_alloc_arena = NULL;
115 136
116 137 #ifdef _KERNEL
117 138 data_alloc_arena = zio_alloc_arena;
118 139 #endif
119 140 zio_cache = kmem_cache_create("zio_cache",
120 141 sizeof (zio_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
121 142 zio_link_cache = kmem_cache_create("zio_link_cache",
122 143 sizeof (zio_link_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
123 144
124 145 /*
125 146 * For small buffers, we want a cache for each multiple of
126 147 * SPA_MINBLOCKSIZE. For larger buffers, we want a cache
127 148 * for each quarter-power of 2.
128 149 */
129 150 for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
130 151 size_t size = (c + 1) << SPA_MINBLOCKSHIFT;
131 152 size_t p2 = size;
132 153 size_t align = 0;
133 154 size_t cflags = (size > zio_buf_debug_limit) ? KMC_NODEBUG : 0;
134 155
135 156 while (!ISP2(p2))
136 157 p2 &= p2 - 1;
137 158
138 159 #ifndef _KERNEL
139 160 /*
140 161 * If we are using watchpoints, put each buffer on its own page,
141 162 * to eliminate the performance overhead of trapping to the
142 163 * kernel when modifying a non-watched buffer that shares the
143 164 * page with a watched buffer.
144 165 */
145 166 if (arc_watch && !IS_P2ALIGNED(size, PAGESIZE))
146 167 continue;
147 168 #endif
148 169 if (size <= 4 * SPA_MINBLOCKSIZE) {
149 170 align = SPA_MINBLOCKSIZE;
150 171 } else if (IS_P2ALIGNED(size, p2 >> 2)) {
151 172 align = MIN(p2 >> 2, PAGESIZE);
152 173 }
153 174
154 175 if (align != 0) {
155 176 char name[36];
156 177 (void) sprintf(name, "zio_buf_%lu", (ulong_t)size);
157 178 zio_buf_cache[c] = kmem_cache_create(name, size,
158 179 align, NULL, NULL, NULL, NULL, NULL, cflags);
159 180
160 181 /*
161 182 * Since zio_data bufs do not appear in crash dumps, we
162 183 * pass KMC_NOTOUCH so that no allocator metadata is
163 184 * stored with the buffers.
164 185 */
165 186 (void) sprintf(name, "zio_data_buf_%lu", (ulong_t)size);
166 187 zio_data_buf_cache[c] = kmem_cache_create(name, size,
167 188 align, NULL, NULL, NULL, NULL, data_alloc_arena,
168 189 cflags | KMC_NOTOUCH);
169 190 }
170 191 }
171 192
172 193 while (--c != 0) {
|
↓ open down ↓ |
55 lines elided |
↑ open up ↑ |
173 194 ASSERT(zio_buf_cache[c] != NULL);
174 195 if (zio_buf_cache[c - 1] == NULL)
175 196 zio_buf_cache[c - 1] = zio_buf_cache[c];
176 197
177 198 ASSERT(zio_data_buf_cache[c] != NULL);
178 199 if (zio_data_buf_cache[c - 1] == NULL)
179 200 zio_data_buf_cache[c - 1] = zio_data_buf_cache[c];
180 201 }
181 202
182 203 zio_inject_init();
204 +
183 205 }
184 206
185 207 void
186 208 zio_fini(void)
187 209 {
188 210 size_t c;
189 211 kmem_cache_t *last_cache = NULL;
190 212 kmem_cache_t *last_data_cache = NULL;
191 213
192 214 for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
193 215 if (zio_buf_cache[c] != last_cache) {
194 216 last_cache = zio_buf_cache[c];
195 217 kmem_cache_destroy(zio_buf_cache[c]);
196 218 }
197 219 zio_buf_cache[c] = NULL;
198 220
199 221 if (zio_data_buf_cache[c] != last_data_cache) {
200 222 last_data_cache = zio_data_buf_cache[c];
201 223 kmem_cache_destroy(zio_data_buf_cache[c]);
202 224 }
203 225 zio_data_buf_cache[c] = NULL;
204 226 }
205 227
206 228 kmem_cache_destroy(zio_link_cache);
207 229 kmem_cache_destroy(zio_cache);
208 230
209 231 zio_inject_fini();
210 232 }
211 233
212 234 /*
213 235 * ==========================================================================
214 236 * Allocate and free I/O buffers
215 237 * ==========================================================================
216 238 */
217 239
218 240 /*
219 241 * Use zio_buf_alloc to allocate ZFS metadata. This data will appear in a
220 242 * crashdump if the kernel panics, so use it judiciously. Obviously, it's
221 243 * useful to inspect ZFS metadata, but if possible, we should avoid keeping
222 244 * excess / transient data in-core during a crashdump.
223 245 */
224 246 void *
225 247 zio_buf_alloc(size_t size)
226 248 {
227 249 size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
228 250
229 251 VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
230 252
231 253 return (kmem_cache_alloc(zio_buf_cache[c], KM_PUSHPAGE));
232 254 }
233 255
234 256 /*
235 257 * Use zio_data_buf_alloc to allocate data. The data will not appear in a
236 258 * crashdump if the kernel panics. This exists so that we will limit the amount
237 259 * of ZFS data that shows up in a kernel crashdump. (Thus reducing the amount
238 260 * of kernel heap dumped to disk when the kernel panics)
239 261 */
240 262 void *
241 263 zio_data_buf_alloc(size_t size)
242 264 {
243 265 size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
244 266
245 267 VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
246 268
247 269 return (kmem_cache_alloc(zio_data_buf_cache[c], KM_PUSHPAGE));
248 270 }
249 271
250 272 void
251 273 zio_buf_free(void *buf, size_t size)
252 274 {
253 275 size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
254 276
255 277 VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
256 278
257 279 kmem_cache_free(zio_buf_cache[c], buf);
258 280 }
259 281
260 282 void
261 283 zio_data_buf_free(void *buf, size_t size)
262 284 {
263 285 size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
264 286
265 287 VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
266 288
267 289 kmem_cache_free(zio_data_buf_cache[c], buf);
268 290 }
269 291
270 292 /*
271 293 * ==========================================================================
272 294 * Push and pop I/O transform buffers
273 295 * ==========================================================================
274 296 */
275 297 void
276 298 zio_push_transform(zio_t *zio, abd_t *data, uint64_t size, uint64_t bufsize,
277 299 zio_transform_func_t *transform)
278 300 {
279 301 zio_transform_t *zt = kmem_alloc(sizeof (zio_transform_t), KM_SLEEP);
280 302
281 303 /*
282 304 * Ensure that anyone expecting this zio to contain a linear ABD isn't
283 305 * going to get a nasty surprise when they try to access the data.
284 306 */
285 307 IMPLY(abd_is_linear(zio->io_abd), abd_is_linear(data));
286 308
287 309 zt->zt_orig_abd = zio->io_abd;
288 310 zt->zt_orig_size = zio->io_size;
289 311 zt->zt_bufsize = bufsize;
290 312 zt->zt_transform = transform;
291 313
292 314 zt->zt_next = zio->io_transform_stack;
293 315 zio->io_transform_stack = zt;
294 316
295 317 zio->io_abd = data;
296 318 zio->io_size = size;
297 319 }
298 320
299 321 void
300 322 zio_pop_transforms(zio_t *zio)
301 323 {
302 324 zio_transform_t *zt;
303 325
304 326 while ((zt = zio->io_transform_stack) != NULL) {
305 327 if (zt->zt_transform != NULL)
306 328 zt->zt_transform(zio,
307 329 zt->zt_orig_abd, zt->zt_orig_size);
308 330
309 331 if (zt->zt_bufsize != 0)
310 332 abd_free(zio->io_abd);
311 333
312 334 zio->io_abd = zt->zt_orig_abd;
313 335 zio->io_size = zt->zt_orig_size;
314 336 zio->io_transform_stack = zt->zt_next;
315 337
316 338 kmem_free(zt, sizeof (zio_transform_t));
317 339 }
318 340 }
319 341
320 342 /*
321 343 * ==========================================================================
322 344 * I/O transform callbacks for subblocks and decompression
323 345 * ==========================================================================
324 346 */
325 347 static void
326 348 zio_subblock(zio_t *zio, abd_t *data, uint64_t size)
327 349 {
328 350 ASSERT(zio->io_size > size);
329 351
330 352 if (zio->io_type == ZIO_TYPE_READ)
331 353 abd_copy(data, zio->io_abd, size);
332 354 }
333 355
334 356 static void
335 357 zio_decompress(zio_t *zio, abd_t *data, uint64_t size)
336 358 {
337 359 if (zio->io_error == 0) {
338 360 void *tmp = abd_borrow_buf(data, size);
339 361 int ret = zio_decompress_data(BP_GET_COMPRESS(zio->io_bp),
340 362 zio->io_abd, tmp, zio->io_size, size);
341 363 abd_return_buf_copy(data, tmp, size);
342 364
343 365 if (ret != 0)
344 366 zio->io_error = SET_ERROR(EIO);
345 367 }
346 368 }
347 369
348 370 /*
349 371 * ==========================================================================
350 372 * I/O parent/child relationships and pipeline interlocks
351 373 * ==========================================================================
352 374 */
353 375 zio_t *
354 376 zio_walk_parents(zio_t *cio, zio_link_t **zl)
355 377 {
356 378 list_t *pl = &cio->io_parent_list;
357 379
358 380 *zl = (*zl == NULL) ? list_head(pl) : list_next(pl, *zl);
359 381 if (*zl == NULL)
360 382 return (NULL);
361 383
362 384 ASSERT((*zl)->zl_child == cio);
363 385 return ((*zl)->zl_parent);
364 386 }
365 387
366 388 zio_t *
367 389 zio_walk_children(zio_t *pio, zio_link_t **zl)
368 390 {
369 391 list_t *cl = &pio->io_child_list;
370 392
371 393 *zl = (*zl == NULL) ? list_head(cl) : list_next(cl, *zl);
372 394 if (*zl == NULL)
373 395 return (NULL);
374 396
375 397 ASSERT((*zl)->zl_parent == pio);
376 398 return ((*zl)->zl_child);
377 399 }
378 400
379 401 zio_t *
380 402 zio_unique_parent(zio_t *cio)
381 403 {
382 404 zio_link_t *zl = NULL;
383 405 zio_t *pio = zio_walk_parents(cio, &zl);
384 406
385 407 VERIFY3P(zio_walk_parents(cio, &zl), ==, NULL);
386 408 return (pio);
387 409 }
388 410
389 411 void
390 412 zio_add_child(zio_t *pio, zio_t *cio)
391 413 {
392 414 zio_link_t *zl = kmem_cache_alloc(zio_link_cache, KM_SLEEP);
393 415
394 416 /*
395 417 * Logical I/Os can have logical, gang, or vdev children.
396 418 * Gang I/Os can have gang or vdev children.
397 419 * Vdev I/Os can only have vdev children.
398 420 * The following ASSERT captures all of these constraints.
399 421 */
400 422 ASSERT3S(cio->io_child_type, <=, pio->io_child_type);
401 423
402 424 zl->zl_parent = pio;
403 425 zl->zl_child = cio;
404 426
405 427 mutex_enter(&cio->io_lock);
406 428 mutex_enter(&pio->io_lock);
407 429
408 430 ASSERT(pio->io_state[ZIO_WAIT_DONE] == 0);
409 431
410 432 for (int w = 0; w < ZIO_WAIT_TYPES; w++)
411 433 pio->io_children[cio->io_child_type][w] += !cio->io_state[w];
412 434
413 435 list_insert_head(&pio->io_child_list, zl);
414 436 list_insert_head(&cio->io_parent_list, zl);
415 437
416 438 pio->io_child_count++;
417 439 cio->io_parent_count++;
418 440
419 441 mutex_exit(&pio->io_lock);
420 442 mutex_exit(&cio->io_lock);
421 443 }
422 444
423 445 static void
424 446 zio_remove_child(zio_t *pio, zio_t *cio, zio_link_t *zl)
425 447 {
426 448 ASSERT(zl->zl_parent == pio);
427 449 ASSERT(zl->zl_child == cio);
428 450
429 451 mutex_enter(&cio->io_lock);
430 452 mutex_enter(&pio->io_lock);
431 453
432 454 list_remove(&pio->io_child_list, zl);
433 455 list_remove(&cio->io_parent_list, zl);
434 456
|
↓ open down ↓ |
242 lines elided |
↑ open up ↑ |
435 457 pio->io_child_count--;
436 458 cio->io_parent_count--;
437 459
438 460 mutex_exit(&pio->io_lock);
439 461 mutex_exit(&cio->io_lock);
440 462
441 463 kmem_cache_free(zio_link_cache, zl);
442 464 }
443 465
444 466 static boolean_t
445 -zio_wait_for_children(zio_t *zio, uint8_t childbits, enum zio_wait_type wait)
467 +zio_wait_for_children(zio_t *zio, enum zio_child child, enum zio_wait_type wait)
446 468 {
469 + uint64_t *countp = &zio->io_children[child][wait];
447 470 boolean_t waiting = B_FALSE;
448 471
449 472 mutex_enter(&zio->io_lock);
450 473 ASSERT(zio->io_stall == NULL);
451 - for (int c = 0; c < ZIO_CHILD_TYPES; c++) {
452 - if (!(ZIO_CHILD_BIT_IS_SET(childbits, c)))
453 - continue;
454 -
455 - uint64_t *countp = &zio->io_children[c][wait];
456 - if (*countp != 0) {
457 - zio->io_stage >>= 1;
458 - ASSERT3U(zio->io_stage, !=, ZIO_STAGE_OPEN);
459 - zio->io_stall = countp;
460 - waiting = B_TRUE;
461 - break;
462 - }
474 + if (*countp != 0) {
475 + zio->io_stage >>= 1;
476 + ASSERT3U(zio->io_stage, !=, ZIO_STAGE_OPEN);
477 + zio->io_stall = countp;
478 + waiting = B_TRUE;
463 479 }
464 480 mutex_exit(&zio->io_lock);
481 +
465 482 return (waiting);
466 483 }
467 484
468 485 static void
469 486 zio_notify_parent(zio_t *pio, zio_t *zio, enum zio_wait_type wait)
470 487 {
471 488 uint64_t *countp = &pio->io_children[zio->io_child_type][wait];
472 489 int *errorp = &pio->io_child_error[zio->io_child_type];
473 490
474 491 mutex_enter(&pio->io_lock);
475 492 if (zio->io_error && !(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE))
476 493 *errorp = zio_worst_error(*errorp, zio->io_error);
477 494 pio->io_reexecute |= zio->io_reexecute;
478 495 ASSERT3U(*countp, >, 0);
479 496
480 497 (*countp)--;
481 498
482 499 if (*countp == 0 && pio->io_stall == countp) {
483 500 zio_taskq_type_t type =
484 501 pio->io_stage < ZIO_STAGE_VDEV_IO_START ? ZIO_TASKQ_ISSUE :
485 502 ZIO_TASKQ_INTERRUPT;
486 503 pio->io_stall = NULL;
487 504 mutex_exit(&pio->io_lock);
488 505 /*
489 506 * Dispatch the parent zio in its own taskq so that
490 507 * the child can continue to make progress. This also
491 508 * prevents overflowing the stack when we have deeply nested
492 509 * parent-child relationships.
493 510 */
494 511 zio_taskq_dispatch(pio, type, B_FALSE);
495 512 } else {
496 513 mutex_exit(&pio->io_lock);
497 514 }
498 515 }
499 516
500 517 static void
501 518 zio_inherit_child_errors(zio_t *zio, enum zio_child c)
502 519 {
503 520 if (zio->io_child_error[c] != 0 && zio->io_error == 0)
504 521 zio->io_error = zio->io_child_error[c];
505 522 }
506 523
507 524 int
508 525 zio_bookmark_compare(const void *x1, const void *x2)
509 526 {
510 527 const zio_t *z1 = x1;
511 528 const zio_t *z2 = x2;
512 529
513 530 if (z1->io_bookmark.zb_objset < z2->io_bookmark.zb_objset)
514 531 return (-1);
515 532 if (z1->io_bookmark.zb_objset > z2->io_bookmark.zb_objset)
516 533 return (1);
517 534
518 535 if (z1->io_bookmark.zb_object < z2->io_bookmark.zb_object)
519 536 return (-1);
520 537 if (z1->io_bookmark.zb_object > z2->io_bookmark.zb_object)
521 538 return (1);
522 539
523 540 if (z1->io_bookmark.zb_level < z2->io_bookmark.zb_level)
524 541 return (-1);
525 542 if (z1->io_bookmark.zb_level > z2->io_bookmark.zb_level)
526 543 return (1);
527 544
528 545 if (z1->io_bookmark.zb_blkid < z2->io_bookmark.zb_blkid)
529 546 return (-1);
530 547 if (z1->io_bookmark.zb_blkid > z2->io_bookmark.zb_blkid)
531 548 return (1);
532 549
533 550 if (z1 < z2)
534 551 return (-1);
535 552 if (z1 > z2)
536 553 return (1);
537 554
538 555 return (0);
539 556 }
540 557
541 558 /*
542 559 * ==========================================================================
543 560 * Create the various types of I/O (read, write, free, etc)
544 561 * ==========================================================================
545 562 */
546 563 static zio_t *
547 564 zio_create(zio_t *pio, spa_t *spa, uint64_t txg, const blkptr_t *bp,
548 565 abd_t *data, uint64_t lsize, uint64_t psize, zio_done_func_t *done,
549 566 void *private, zio_type_t type, zio_priority_t priority,
550 567 enum zio_flag flags, vdev_t *vd, uint64_t offset,
551 568 const zbookmark_phys_t *zb, enum zio_stage stage, enum zio_stage pipeline)
552 569 {
553 570 zio_t *zio;
554 571
555 572 ASSERT3U(psize, <=, SPA_MAXBLOCKSIZE);
556 573 ASSERT(P2PHASE(psize, SPA_MINBLOCKSIZE) == 0);
557 574 ASSERT(P2PHASE(offset, SPA_MINBLOCKSIZE) == 0);
558 575
559 576 ASSERT(!vd || spa_config_held(spa, SCL_STATE_ALL, RW_READER));
560 577 ASSERT(!bp || !(flags & ZIO_FLAG_CONFIG_WRITER));
561 578 ASSERT(vd || stage == ZIO_STAGE_OPEN);
562 579
563 580 IMPLY(lsize != psize, (flags & ZIO_FLAG_RAW) != 0);
564 581
565 582 zio = kmem_cache_alloc(zio_cache, KM_SLEEP);
566 583 bzero(zio, sizeof (zio_t));
567 584
568 585 mutex_init(&zio->io_lock, NULL, MUTEX_DEFAULT, NULL);
569 586 cv_init(&zio->io_cv, NULL, CV_DEFAULT, NULL);
570 587
571 588 list_create(&zio->io_parent_list, sizeof (zio_link_t),
572 589 offsetof(zio_link_t, zl_parent_node));
573 590 list_create(&zio->io_child_list, sizeof (zio_link_t),
574 591 offsetof(zio_link_t, zl_child_node));
575 592 metaslab_trace_init(&zio->io_alloc_list);
576 593
577 594 if (vd != NULL)
578 595 zio->io_child_type = ZIO_CHILD_VDEV;
579 596 else if (flags & ZIO_FLAG_GANG_CHILD)
580 597 zio->io_child_type = ZIO_CHILD_GANG;
581 598 else if (flags & ZIO_FLAG_DDT_CHILD)
582 599 zio->io_child_type = ZIO_CHILD_DDT;
583 600 else
584 601 zio->io_child_type = ZIO_CHILD_LOGICAL;
585 602
586 603 if (bp != NULL) {
587 604 zio->io_bp = (blkptr_t *)bp;
588 605 zio->io_bp_copy = *bp;
589 606 zio->io_bp_orig = *bp;
590 607 if (type != ZIO_TYPE_WRITE ||
591 608 zio->io_child_type == ZIO_CHILD_DDT)
592 609 zio->io_bp = &zio->io_bp_copy; /* so caller can free */
593 610 if (zio->io_child_type == ZIO_CHILD_LOGICAL)
594 611 zio->io_logical = zio;
595 612 if (zio->io_child_type > ZIO_CHILD_GANG && BP_IS_GANG(bp))
596 613 pipeline |= ZIO_GANG_STAGES;
597 614 }
598 615
599 616 zio->io_spa = spa;
600 617 zio->io_txg = txg;
601 618 zio->io_done = done;
602 619 zio->io_private = private;
603 620 zio->io_type = type;
604 621 zio->io_priority = priority;
605 622 zio->io_vd = vd;
606 623 zio->io_offset = offset;
607 624 zio->io_orig_abd = zio->io_abd = data;
608 625 zio->io_orig_size = zio->io_size = psize;
609 626 zio->io_lsize = lsize;
610 627 zio->io_orig_flags = zio->io_flags = flags;
611 628 zio->io_orig_stage = zio->io_stage = stage;
|
↓ open down ↓ |
137 lines elided |
↑ open up ↑ |
612 629 zio->io_orig_pipeline = zio->io_pipeline = pipeline;
613 630 zio->io_pipeline_trace = ZIO_STAGE_OPEN;
614 631
615 632 zio->io_state[ZIO_WAIT_READY] = (stage >= ZIO_STAGE_READY);
616 633 zio->io_state[ZIO_WAIT_DONE] = (stage >= ZIO_STAGE_DONE);
617 634
618 635 if (zb != NULL)
619 636 zio->io_bookmark = *zb;
620 637
621 638 if (pio != NULL) {
639 + zio->io_mc = pio->io_mc;
622 640 if (zio->io_logical == NULL)
623 641 zio->io_logical = pio->io_logical;
624 642 if (zio->io_child_type == ZIO_CHILD_GANG)
625 643 zio->io_gang_leader = pio->io_gang_leader;
626 644 zio_add_child(pio, zio);
645 +
646 + /* copy the smartcomp setting when creating child zio's */
647 + bcopy(&pio->io_smartcomp, &zio->io_smartcomp,
648 + sizeof (zio->io_smartcomp));
627 649 }
628 650
629 651 return (zio);
630 652 }
631 653
632 654 static void
633 655 zio_destroy(zio_t *zio)
634 656 {
635 657 metaslab_trace_fini(&zio->io_alloc_list);
636 658 list_destroy(&zio->io_parent_list);
637 659 list_destroy(&zio->io_child_list);
638 660 mutex_destroy(&zio->io_lock);
639 661 cv_destroy(&zio->io_cv);
640 662 kmem_cache_free(zio_cache, zio);
641 663 }
642 664
643 665 zio_t *
644 666 zio_null(zio_t *pio, spa_t *spa, vdev_t *vd, zio_done_func_t *done,
645 667 void *private, enum zio_flag flags)
646 668 {
647 669 zio_t *zio;
648 670
649 671 zio = zio_create(pio, spa, 0, NULL, NULL, 0, 0, done, private,
650 672 ZIO_TYPE_NULL, ZIO_PRIORITY_NOW, flags, vd, 0, NULL,
651 673 ZIO_STAGE_OPEN, ZIO_INTERLOCK_PIPELINE);
652 674
653 675 return (zio);
654 676 }
|
↓ open down ↓ |
18 lines elided |
↑ open up ↑ |
655 677
656 678 zio_t *
657 679 zio_root(spa_t *spa, zio_done_func_t *done, void *private, enum zio_flag flags)
658 680 {
659 681 return (zio_null(NULL, spa, NULL, done, private, flags));
660 682 }
661 683
662 684 void
663 685 zfs_blkptr_verify(spa_t *spa, const blkptr_t *bp)
664 686 {
687 + /*
688 + * SPECIAL-BP has two DVAs, but DVA[0] in this case is a
689 + * temporary DVA, and after migration only the DVA[1]
690 + * contains valid data. Therefore, we start walking for
691 + * these BPs from DVA[1].
692 + */
693 + int start_dva = BP_IS_SPECIAL(bp) ? 1 : 0;
694 +
665 695 if (!DMU_OT_IS_VALID(BP_GET_TYPE(bp))) {
666 696 zfs_panic_recover("blkptr at %p has invalid TYPE %llu",
667 697 bp, (longlong_t)BP_GET_TYPE(bp));
668 698 }
669 699 if (BP_GET_CHECKSUM(bp) >= ZIO_CHECKSUM_FUNCTIONS ||
670 700 BP_GET_CHECKSUM(bp) <= ZIO_CHECKSUM_ON) {
671 701 zfs_panic_recover("blkptr at %p has invalid CHECKSUM %llu",
672 702 bp, (longlong_t)BP_GET_CHECKSUM(bp));
673 703 }
674 704 if (BP_GET_COMPRESS(bp) >= ZIO_COMPRESS_FUNCTIONS ||
675 705 BP_GET_COMPRESS(bp) <= ZIO_COMPRESS_ON) {
676 706 zfs_panic_recover("blkptr at %p has invalid COMPRESS %llu",
677 707 bp, (longlong_t)BP_GET_COMPRESS(bp));
678 708 }
679 709 if (BP_GET_LSIZE(bp) > SPA_MAXBLOCKSIZE) {
680 710 zfs_panic_recover("blkptr at %p has invalid LSIZE %llu",
681 711 bp, (longlong_t)BP_GET_LSIZE(bp));
682 712 }
683 713 if (BP_GET_PSIZE(bp) > SPA_MAXBLOCKSIZE) {
684 714 zfs_panic_recover("blkptr at %p has invalid PSIZE %llu",
685 715 bp, (longlong_t)BP_GET_PSIZE(bp));
|
↓ open down ↓ |
11 lines elided |
↑ open up ↑ |
686 716 }
687 717
688 718 if (BP_IS_EMBEDDED(bp)) {
689 719 if (BPE_GET_ETYPE(bp) > NUM_BP_EMBEDDED_TYPES) {
690 720 zfs_panic_recover("blkptr at %p has invalid ETYPE %llu",
691 721 bp, (longlong_t)BPE_GET_ETYPE(bp));
692 722 }
693 723 }
694 724
695 725 /*
696 - * Do not verify individual DVAs if the config is not trusted. This
697 - * will be done once the zio is executed in vdev_mirror_map_alloc.
698 - */
699 - if (!spa->spa_trust_config)
700 - return;
701 -
702 - /*
703 726 * Pool-specific checks.
704 727 *
705 728 * Note: it would be nice to verify that the blk_birth and
706 729 * BP_PHYSICAL_BIRTH() are not too large. However, spa_freeze()
707 730 * allows the birth time of log blocks (and dmu_sync()-ed blocks
708 731 * that are in the log) to be arbitrarily large.
709 732 */
710 - for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
733 + for (int i = start_dva; i < BP_GET_NDVAS(bp); i++) {
711 734 uint64_t vdevid = DVA_GET_VDEV(&bp->blk_dva[i]);
712 735 if (vdevid >= spa->spa_root_vdev->vdev_children) {
713 736 zfs_panic_recover("blkptr at %p DVA %u has invalid "
714 737 "VDEV %llu",
715 738 bp, i, (longlong_t)vdevid);
716 739 continue;
717 740 }
718 741 vdev_t *vd = spa->spa_root_vdev->vdev_child[vdevid];
719 742 if (vd == NULL) {
720 743 zfs_panic_recover("blkptr at %p DVA %u has invalid "
721 744 "VDEV %llu",
722 745 bp, i, (longlong_t)vdevid);
723 746 continue;
724 747 }
725 748 if (vd->vdev_ops == &vdev_hole_ops) {
726 749 zfs_panic_recover("blkptr at %p DVA %u has hole "
727 750 "VDEV %llu",
728 751 bp, i, (longlong_t)vdevid);
729 752 continue;
730 753 }
731 754 if (vd->vdev_ops == &vdev_missing_ops) {
732 755 /*
733 756 * "missing" vdevs are valid during import, but we
734 757 * don't have their detailed info (e.g. asize), so
735 758 * we can't perform any more checks on them.
736 759 */
737 760 continue;
738 761 }
739 762 uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
740 763 uint64_t asize = DVA_GET_ASIZE(&bp->blk_dva[i]);
|
↓ open down ↓ |
20 lines elided |
↑ open up ↑ |
741 764 if (BP_IS_GANG(bp))
742 765 asize = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
743 766 if (offset + asize > vd->vdev_asize) {
744 767 zfs_panic_recover("blkptr at %p DVA %u has invalid "
745 768 "OFFSET %llu",
746 769 bp, i, (longlong_t)offset);
747 770 }
748 771 }
749 772 }
750 773
751 -boolean_t
752 -zfs_dva_valid(spa_t *spa, const dva_t *dva, const blkptr_t *bp)
753 -{
754 - uint64_t vdevid = DVA_GET_VDEV(dva);
755 -
756 - if (vdevid >= spa->spa_root_vdev->vdev_children)
757 - return (B_FALSE);
758 -
759 - vdev_t *vd = spa->spa_root_vdev->vdev_child[vdevid];
760 - if (vd == NULL)
761 - return (B_FALSE);
762 -
763 - if (vd->vdev_ops == &vdev_hole_ops)
764 - return (B_FALSE);
765 -
766 - if (vd->vdev_ops == &vdev_missing_ops) {
767 - return (B_FALSE);
768 - }
769 -
770 - uint64_t offset = DVA_GET_OFFSET(dva);
771 - uint64_t asize = DVA_GET_ASIZE(dva);
772 -
773 - if (BP_IS_GANG(bp))
774 - asize = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
775 - if (offset + asize > vd->vdev_asize)
776 - return (B_FALSE);
777 -
778 - return (B_TRUE);
779 -}
780 -
781 774 zio_t *
782 775 zio_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,
783 776 abd_t *data, uint64_t size, zio_done_func_t *done, void *private,
784 777 zio_priority_t priority, enum zio_flag flags, const zbookmark_phys_t *zb)
785 778 {
786 779 zio_t *zio;
787 780
788 781 zfs_blkptr_verify(spa, bp);
789 782
790 783 zio = zio_create(pio, spa, BP_PHYSICAL_BIRTH(bp), bp,
791 784 data, size, size, done, private,
792 785 ZIO_TYPE_READ, priority, flags, NULL, 0, zb,
793 786 ZIO_STAGE_OPEN, (flags & ZIO_FLAG_DDT_CHILD) ?
794 787 ZIO_DDT_CHILD_READ_PIPELINE : ZIO_READ_PIPELINE);
|
↓ open down ↓ |
4 lines elided |
↑ open up ↑ |
795 788
796 789 return (zio);
797 790 }
798 791
799 792 zio_t *
800 793 zio_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
801 794 abd_t *data, uint64_t lsize, uint64_t psize, const zio_prop_t *zp,
802 795 zio_done_func_t *ready, zio_done_func_t *children_ready,
803 796 zio_done_func_t *physdone, zio_done_func_t *done,
804 797 void *private, zio_priority_t priority, enum zio_flag flags,
805 - const zbookmark_phys_t *zb)
798 + const zbookmark_phys_t *zb,
799 + const zio_smartcomp_info_t *smartcomp)
806 800 {
807 801 zio_t *zio;
808 802
809 803 ASSERT(zp->zp_checksum >= ZIO_CHECKSUM_OFF &&
810 804 zp->zp_checksum < ZIO_CHECKSUM_FUNCTIONS &&
811 805 zp->zp_compress >= ZIO_COMPRESS_OFF &&
812 806 zp->zp_compress < ZIO_COMPRESS_FUNCTIONS &&
813 807 DMU_OT_IS_VALID(zp->zp_type) &&
814 808 zp->zp_level < 32 &&
815 809 zp->zp_copies > 0 &&
816 810 zp->zp_copies <= spa_max_replication(spa));
|
↓ open down ↓ |
1 lines elided |
↑ open up ↑ |
817 811
818 812 zio = zio_create(pio, spa, txg, bp, data, lsize, psize, done, private,
819 813 ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
820 814 ZIO_STAGE_OPEN, (flags & ZIO_FLAG_DDT_CHILD) ?
821 815 ZIO_DDT_CHILD_WRITE_PIPELINE : ZIO_WRITE_PIPELINE);
822 816
823 817 zio->io_ready = ready;
824 818 zio->io_children_ready = children_ready;
825 819 zio->io_physdone = physdone;
826 820 zio->io_prop = *zp;
821 + if (smartcomp != NULL)
822 + bcopy(smartcomp, &zio->io_smartcomp, sizeof (*smartcomp));
827 823
828 824 /*
829 825 * Data can be NULL if we are going to call zio_write_override() to
830 826 * provide the already-allocated BP. But we may need the data to
831 827 * verify a dedup hit (if requested). In this case, don't try to
832 828 * dedup (just take the already-allocated BP verbatim).
833 829 */
834 830 if (data == NULL && zio->io_prop.zp_dedup_verify) {
835 831 zio->io_prop.zp_dedup = zio->io_prop.zp_dedup_verify = B_FALSE;
836 832 }
837 833
838 834 return (zio);
839 835 }
840 836
841 837 zio_t *
842 838 zio_rewrite(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, abd_t *data,
843 839 uint64_t size, zio_done_func_t *done, void *private,
844 840 zio_priority_t priority, enum zio_flag flags, zbookmark_phys_t *zb)
845 841 {
846 842 zio_t *zio;
847 843
848 844 zio = zio_create(pio, spa, txg, bp, data, size, size, done, private,
849 845 ZIO_TYPE_WRITE, priority, flags | ZIO_FLAG_IO_REWRITE, NULL, 0, zb,
850 846 ZIO_STAGE_OPEN, ZIO_REWRITE_PIPELINE);
851 847
852 848 return (zio);
853 849 }
854 850
855 851 void
856 852 zio_write_override(zio_t *zio, blkptr_t *bp, int copies, boolean_t nopwrite)
857 853 {
858 854 ASSERT(zio->io_type == ZIO_TYPE_WRITE);
859 855 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
860 856 ASSERT(zio->io_stage == ZIO_STAGE_OPEN);
861 857 ASSERT(zio->io_txg == spa_syncing_txg(zio->io_spa));
862 858
863 859 /*
864 860 * We must reset the io_prop to match the values that existed
865 861 * when the bp was first written by dmu_sync() keeping in mind
866 862 * that nopwrite and dedup are mutually exclusive.
867 863 */
|
↓ open down ↓ |
31 lines elided |
↑ open up ↑ |
868 864 zio->io_prop.zp_dedup = nopwrite ? B_FALSE : zio->io_prop.zp_dedup;
869 865 zio->io_prop.zp_nopwrite = nopwrite;
870 866 zio->io_prop.zp_copies = copies;
871 867 zio->io_bp_override = bp;
872 868 }
873 869
874 870 void
875 871 zio_free(spa_t *spa, uint64_t txg, const blkptr_t *bp)
876 872 {
877 873
878 - zfs_blkptr_verify(spa, bp);
879 -
880 874 /*
881 875 * The check for EMBEDDED is a performance optimization. We
882 876 * process the free here (by ignoring it) rather than
883 877 * putting it on the list and then processing it in zio_free_sync().
884 878 */
885 879 if (BP_IS_EMBEDDED(bp))
886 880 return;
887 881 metaslab_check_free(spa, bp);
888 882
889 883 /*
890 884 * Frees that are for the currently-syncing txg, are not going to be
891 885 * deferred, and which will not need to do a read (i.e. not GANG or
892 886 * DEDUP), can be processed immediately. Otherwise, put them on the
893 887 * in-memory list for later processing.
894 888 */
895 889 if (BP_IS_GANG(bp) || BP_GET_DEDUP(bp) ||
896 890 txg != spa->spa_syncing_txg ||
897 891 spa_sync_pass(spa) >= zfs_sync_pass_deferred_free) {
898 892 bplist_append(&spa->spa_free_bplist[txg & TXG_MASK], bp);
899 893 } else {
900 894 VERIFY0(zio_wait(zio_free_sync(NULL, spa, txg, bp, 0)));
901 895 }
902 896 }
903 897
904 898 zio_t *
905 899 zio_free_sync(zio_t *pio, spa_t *spa, uint64_t txg, const blkptr_t *bp,
906 900 enum zio_flag flags)
907 901 {
908 902 zio_t *zio;
909 903 enum zio_stage stage = ZIO_FREE_PIPELINE;
|
↓ open down ↓ |
20 lines elided |
↑ open up ↑ |
910 904
911 905 ASSERT(!BP_IS_HOLE(bp));
912 906 ASSERT(spa_syncing_txg(spa) == txg);
913 907 ASSERT(spa_sync_pass(spa) < zfs_sync_pass_deferred_free);
914 908
915 909 if (BP_IS_EMBEDDED(bp))
916 910 return (zio_null(pio, spa, NULL, NULL, NULL, 0));
917 911
918 912 metaslab_check_free(spa, bp);
919 913 arc_freed(spa, bp);
914 + dsl_scan_freed(spa, bp);
920 915
921 916 /*
922 917 * GANG and DEDUP blocks can induce a read (for the gang block header,
923 918 * or the DDT), so issue them asynchronously so that this thread is
924 919 * not tied up.
925 920 */
926 921 if (BP_IS_GANG(bp) || BP_GET_DEDUP(bp))
927 922 stage |= ZIO_STAGE_ISSUE_ASYNC;
928 923
929 924 zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
930 925 BP_GET_PSIZE(bp), NULL, NULL, ZIO_TYPE_FREE, ZIO_PRIORITY_NOW,
931 926 flags, NULL, 0, NULL, ZIO_STAGE_OPEN, stage);
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
932 927
933 928 return (zio);
934 929 }
935 930
936 931 zio_t *
937 932 zio_claim(zio_t *pio, spa_t *spa, uint64_t txg, const blkptr_t *bp,
938 933 zio_done_func_t *done, void *private, enum zio_flag flags)
939 934 {
940 935 zio_t *zio;
941 936
942 - zfs_blkptr_verify(spa, bp);
937 + dprintf_bp(bp, "claiming in txg %llu", txg);
943 938
944 939 if (BP_IS_EMBEDDED(bp))
945 940 return (zio_null(pio, spa, NULL, NULL, NULL, 0));
946 941
947 942 /*
948 943 * A claim is an allocation of a specific block. Claims are needed
949 944 * to support immediate writes in the intent log. The issue is that
950 945 * immediate writes contain committed data, but in a txg that was
951 946 * *not* committed. Upon opening the pool after an unclean shutdown,
952 947 * the intent log claims all blocks that contain immediate write data
953 948 * so that the SPA knows they're in use.
954 949 *
955 950 * All claims *must* be resolved in the first txg -- before the SPA
956 951 * starts allocating blocks -- so that nothing is allocated twice.
957 952 * If txg == 0 we just verify that the block is claimable.
958 953 */
959 954 ASSERT3U(spa->spa_uberblock.ub_rootbp.blk_birth, <, spa_first_txg(spa));
960 955 ASSERT(txg == spa_first_txg(spa) || txg == 0);
|
↓ open down ↓ |
8 lines elided |
↑ open up ↑ |
961 956 ASSERT(!BP_GET_DEDUP(bp) || !spa_writeable(spa)); /* zdb(1M) */
962 957
963 958 zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
964 959 BP_GET_PSIZE(bp), done, private, ZIO_TYPE_CLAIM, ZIO_PRIORITY_NOW,
965 960 flags, NULL, 0, NULL, ZIO_STAGE_OPEN, ZIO_CLAIM_PIPELINE);
966 961 ASSERT0(zio->io_queued_timestamp);
967 962
968 963 return (zio);
969 964 }
970 965
971 -zio_t *
972 -zio_ioctl(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
973 - zio_done_func_t *done, void *private, enum zio_flag flags)
966 +static zio_t *
967 +zio_ioctl_with_pipeline(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
968 + zio_done_func_t *done, void *private, enum zio_flag flags,
969 + enum zio_stage pipeline)
974 970 {
975 971 zio_t *zio;
976 972 int c;
977 973
978 974 if (vd->vdev_children == 0) {
979 975 zio = zio_create(pio, spa, 0, NULL, NULL, 0, 0, done, private,
980 976 ZIO_TYPE_IOCTL, ZIO_PRIORITY_NOW, flags, vd, 0, NULL,
981 - ZIO_STAGE_OPEN, ZIO_IOCTL_PIPELINE);
977 + ZIO_STAGE_OPEN, pipeline);
982 978
983 979 zio->io_cmd = cmd;
984 980 } else {
985 - zio = zio_null(pio, spa, NULL, NULL, NULL, flags);
986 -
987 - for (c = 0; c < vd->vdev_children; c++)
988 - zio_nowait(zio_ioctl(zio, spa, vd->vdev_child[c], cmd,
989 - done, private, flags));
981 + zio = zio_null(pio, spa, vd, done, private, flags);
982 + /*
983 + * DKIOCFREE ioctl's need some special handling on interior
984 + * vdevs. If the device provides an ops function to handle
985 + * recomputing dkioc_free extents, then we call it.
986 + * Otherwise the default behavior applies, which simply fans
987 + * out the ioctl to all component vdevs.
988 + */
989 + if (cmd == DKIOCFREE && vd->vdev_ops->vdev_op_trim != NULL) {
990 + vd->vdev_ops->vdev_op_trim(vd, zio, private);
991 + } else {
992 + for (c = 0; c < vd->vdev_children; c++)
993 + zio_nowait(zio_ioctl_with_pipeline(zio,
994 + spa, vd->vdev_child[c], cmd, NULL,
995 + private, flags, pipeline));
996 + }
990 997 }
991 998
992 999 return (zio);
993 1000 }
994 1001
995 1002 zio_t *
1003 +zio_ioctl(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
1004 + zio_done_func_t *done, void *private, enum zio_flag flags)
1005 +{
1006 + return (zio_ioctl_with_pipeline(pio, spa, vd, cmd, done,
1007 + private, flags, ZIO_IOCTL_PIPELINE));
1008 +}
1009 +
1010 +/*
1011 + * Callback for when a trim zio has completed. This simply frees the
1012 + * dkioc_free_list_t extent list of the DKIOCFREE ioctl.
1013 + */
1014 +static void
1015 +zio_trim_done(zio_t *zio)
1016 +{
1017 + VERIFY(zio->io_private != NULL);
1018 + dfl_free(zio->io_private);
1019 +}
1020 +
1021 +static void
1022 +zio_trim_check(uint64_t start, uint64_t len, void *msp)
1023 +{
1024 + metaslab_t *ms = msp;
1025 + boolean_t held = MUTEX_HELD(&ms->ms_lock);
1026 + if (!held)
1027 + mutex_enter(&ms->ms_lock);
1028 + ASSERT(ms->ms_trimming_ts != NULL);
1029 + ASSERT(range_tree_contains(ms->ms_trimming_ts->ts_tree,
1030 + start - VDEV_LABEL_START_SIZE, len));
1031 + if (!held)
1032 + mutex_exit(&ms->ms_lock);
1033 +}
1034 +
1035 +/*
1036 + * Takes a bunch of freed extents and tells the underlying vdevs that the
1037 + * space associated with these extents can be released.
1038 + * This is used by flash storage to pre-erase blocks for rapid reuse later
1039 + * and thin-provisioned block storage to reclaim unused blocks.
1040 + */
1041 +zio_t *
1042 +zio_trim(spa_t *spa, vdev_t *vd, struct range_tree *tree,
1043 + zio_done_func_t *done, void *private, enum zio_flag flags,
1044 + int trim_flags, metaslab_t *msp)
1045 +{
1046 + dkioc_free_list_t *dfl = NULL;
1047 + range_seg_t *rs;
1048 + uint64_t rs_idx;
1049 + uint64_t num_exts;
1050 + uint64_t bytes_issued = 0, bytes_skipped = 0, exts_skipped = 0;
1051 + /*
1052 + * We need this to invoke the caller's `done' callback with the
1053 + * correct io_private (not the dkioc_free_list_t, which is needed
1054 + * by the underlying DKIOCFREE ioctl).
1055 + */
1056 + zio_t *sub_pio = zio_root(spa, done, private, flags);
1057 +
1058 + ASSERT(range_tree_space(tree) != 0);
1059 +
1060 + if (!zfs_trim)
1061 + return (sub_pio);
1062 +
1063 + num_exts = avl_numnodes(&tree->rt_root);
1064 + dfl = kmem_zalloc(DFL_SZ(num_exts), KM_SLEEP);
1065 + dfl->dfl_flags = trim_flags;
1066 + dfl->dfl_num_exts = num_exts;
1067 + dfl->dfl_offset = VDEV_LABEL_START_SIZE;
1068 + if (msp) {
1069 + dfl->dfl_ck_func = zio_trim_check;
1070 + dfl->dfl_ck_arg = msp;
1071 + }
1072 +
1073 + for (rs = avl_first(&tree->rt_root), rs_idx = 0; rs != NULL;
1074 + rs = AVL_NEXT(&tree->rt_root, rs)) {
1075 + uint64_t len = rs->rs_end - rs->rs_start;
1076 +
1077 + if (len < zfs_trim_min_ext_sz) {
1078 + bytes_skipped += len;
1079 + exts_skipped++;
1080 + continue;
1081 + }
1082 +
1083 + dfl->dfl_exts[rs_idx].dfle_start = rs->rs_start;
1084 + dfl->dfl_exts[rs_idx].dfle_length = len;
1085 +
1086 + // check we're a multiple of the vdev ashift
1087 + ASSERT0(dfl->dfl_exts[rs_idx].dfle_start &
1088 + ((1 << vd->vdev_ashift) - 1));
1089 + ASSERT0(dfl->dfl_exts[rs_idx].dfle_length &
1090 + ((1 << vd->vdev_ashift) - 1));
1091 +
1092 + rs_idx++;
1093 + bytes_issued += len;
1094 + }
1095 +
1096 + spa_trimstats_update(spa, rs_idx, bytes_issued, exts_skipped,
1097 + bytes_skipped);
1098 +
1099 + /* the zfs_trim_min_ext_sz filter may have shortened the list */
1100 + if (dfl->dfl_num_exts != rs_idx) {
1101 + dkioc_free_list_t *dfl2 = kmem_zalloc(DFL_SZ(rs_idx), KM_SLEEP);
1102 + bcopy(dfl, dfl2, DFL_SZ(rs_idx));
1103 + dfl2->dfl_num_exts = rs_idx;
1104 + dfl_free(dfl);
1105 + dfl = dfl2;
1106 + }
1107 +
1108 + zio_nowait(zio_ioctl_with_pipeline(sub_pio, spa, vd, DKIOCFREE,
1109 + zio_trim_done, dfl, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
1110 + ZIO_FLAG_DONT_RETRY, ZIO_TRIM_PIPELINE));
1111 + return (sub_pio);
1112 +}
1113 +
1114 +zio_t *
996 1115 zio_read_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size,
997 1116 abd_t *data, int checksum, zio_done_func_t *done, void *private,
998 1117 zio_priority_t priority, enum zio_flag flags, boolean_t labels)
999 1118 {
1000 1119 zio_t *zio;
1001 1120
1002 1121 ASSERT(vd->vdev_children == 0);
1003 1122 ASSERT(!labels || offset + size <= VDEV_LABEL_START_SIZE ||
1004 1123 offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
1005 1124 ASSERT3U(offset + size, <=, vd->vdev_psize);
1006 1125
1007 1126 zio = zio_create(pio, vd->vdev_spa, 0, NULL, data, size, size, done,
1008 1127 private, ZIO_TYPE_READ, priority, flags | ZIO_FLAG_PHYSICAL, vd,
1009 1128 offset, NULL, ZIO_STAGE_OPEN, ZIO_READ_PHYS_PIPELINE);
1010 1129
1011 1130 zio->io_prop.zp_checksum = checksum;
1012 1131
1013 1132 return (zio);
1014 1133 }
1015 1134
1016 1135 zio_t *
1017 1136 zio_write_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size,
1018 1137 abd_t *data, int checksum, zio_done_func_t *done, void *private,
1019 1138 zio_priority_t priority, enum zio_flag flags, boolean_t labels)
1020 1139 {
1021 1140 zio_t *zio;
1022 1141
1023 1142 ASSERT(vd->vdev_children == 0);
1024 1143 ASSERT(!labels || offset + size <= VDEV_LABEL_START_SIZE ||
1025 1144 offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
1026 1145 ASSERT3U(offset + size, <=, vd->vdev_psize);
1027 1146
1028 1147 zio = zio_create(pio, vd->vdev_spa, 0, NULL, data, size, size, done,
1029 1148 private, ZIO_TYPE_WRITE, priority, flags | ZIO_FLAG_PHYSICAL, vd,
1030 1149 offset, NULL, ZIO_STAGE_OPEN, ZIO_WRITE_PHYS_PIPELINE);
1031 1150
1032 1151 zio->io_prop.zp_checksum = checksum;
1033 1152
1034 1153 if (zio_checksum_table[checksum].ci_flags & ZCHECKSUM_FLAG_EMBEDDED) {
1035 1154 /*
1036 1155 * zec checksums are necessarily destructive -- they modify
1037 1156 * the end of the write buffer to hold the verifier/checksum.
1038 1157 * Therefore, we must make a local copy in case the data is
1039 1158 * being written to multiple places in parallel.
1040 1159 */
1041 1160 abd_t *wbuf = abd_alloc_sametype(data, size);
1042 1161 abd_copy(wbuf, data, size);
1043 1162
1044 1163 zio_push_transform(zio, wbuf, size, size, NULL);
1045 1164 }
1046 1165
1047 1166 return (zio);
1048 1167 }
1049 1168
1050 1169 /*
|
↓ open down ↓ |
45 lines elided |
↑ open up ↑ |
1051 1170 * Create a child I/O to do some work for us.
1052 1171 */
1053 1172 zio_t *
1054 1173 zio_vdev_child_io(zio_t *pio, blkptr_t *bp, vdev_t *vd, uint64_t offset,
1055 1174 abd_t *data, uint64_t size, int type, zio_priority_t priority,
1056 1175 enum zio_flag flags, zio_done_func_t *done, void *private)
1057 1176 {
1058 1177 enum zio_stage pipeline = ZIO_VDEV_CHILD_PIPELINE;
1059 1178 zio_t *zio;
1060 1179
1061 - /*
1062 - * vdev child I/Os do not propagate their error to the parent.
1063 - * Therefore, for correct operation the caller *must* check for
1064 - * and handle the error in the child i/o's done callback.
1065 - * The only exceptions are i/os that we don't care about
1066 - * (OPTIONAL or REPAIR).
1067 - */
1068 - ASSERT((flags & ZIO_FLAG_OPTIONAL) || (flags & ZIO_FLAG_IO_REPAIR) ||
1069 - done != NULL);
1180 + ASSERT(vd->vdev_parent ==
1181 + (pio->io_vd ? pio->io_vd : pio->io_spa->spa_root_vdev));
1070 1182
1071 - /*
1072 - * In the common case, where the parent zio was to a normal vdev,
1073 - * the child zio must be to a child vdev of that vdev. Otherwise,
1074 - * the child zio must be to a top-level vdev.
1075 - */
1076 - if (pio->io_vd != NULL && pio->io_vd->vdev_ops != &vdev_indirect_ops) {
1077 - ASSERT3P(vd->vdev_parent, ==, pio->io_vd);
1078 - } else {
1079 - ASSERT3P(vd, ==, vd->vdev_top);
1080 - }
1081 -
1082 1183 if (type == ZIO_TYPE_READ && bp != NULL) {
1083 1184 /*
1084 1185 * If we have the bp, then the child should perform the
1085 1186 * checksum and the parent need not. This pushes error
1086 1187 * detection as close to the leaves as possible and
1087 1188 * eliminates redundant checksums in the interior nodes.
1088 1189 */
1089 1190 pipeline |= ZIO_STAGE_CHECKSUM_VERIFY;
1090 1191 pio->io_pipeline &= ~ZIO_STAGE_CHECKSUM_VERIFY;
1091 1192 }
1092 1193
1093 - if (vd->vdev_ops->vdev_op_leaf) {
1094 - ASSERT0(vd->vdev_children);
1194 + if (vd->vdev_children == 0)
1095 1195 offset += VDEV_LABEL_START_SIZE;
1096 - }
1097 1196
1098 - flags |= ZIO_VDEV_CHILD_FLAGS(pio);
1197 + flags |= ZIO_VDEV_CHILD_FLAGS(pio) | ZIO_FLAG_DONT_PROPAGATE;
1099 1198
1100 1199 /*
1101 1200 * If we've decided to do a repair, the write is not speculative --
1102 1201 * even if the original read was.
1103 1202 */
1104 1203 if (flags & ZIO_FLAG_IO_REPAIR)
1105 1204 flags &= ~ZIO_FLAG_SPECULATIVE;
1106 1205
1107 1206 /*
1108 1207 * If we're creating a child I/O that is not associated with a
1109 1208 * top-level vdev, then the child zio is not an allocating I/O.
1110 1209 * If this is a retried I/O then we ignore it since we will
1111 1210 * have already processed the original allocating I/O.
1112 1211 */
1113 1212 if (flags & ZIO_FLAG_IO_ALLOCATING &&
1114 1213 (vd != vd->vdev_top || (flags & ZIO_FLAG_IO_RETRY))) {
1115 - metaslab_class_t *mc = spa_normal_class(pio->io_spa);
1214 + metaslab_class_t *mc = pio->io_mc;
1116 1215
1117 1216 ASSERT(mc->mc_alloc_throttle_enabled);
1118 1217 ASSERT(type == ZIO_TYPE_WRITE);
1119 1218 ASSERT(priority == ZIO_PRIORITY_ASYNC_WRITE);
1120 1219 ASSERT(!(flags & ZIO_FLAG_IO_REPAIR));
1121 1220 ASSERT(!(pio->io_flags & ZIO_FLAG_IO_REWRITE) ||
1122 1221 pio->io_child_type == ZIO_CHILD_GANG);
1123 1222
1124 1223 flags &= ~ZIO_FLAG_IO_ALLOCATING;
1125 1224 }
1126 1225
1127 1226 zio = zio_create(pio, pio->io_spa, pio->io_txg, bp, data, size, size,
1128 1227 done, private, type, priority, flags, vd, offset, &pio->io_bookmark,
1129 1228 ZIO_STAGE_VDEV_IO_START >> 1, pipeline);
1130 1229 ASSERT3U(zio->io_child_type, ==, ZIO_CHILD_VDEV);
1131 1230
1132 1231 zio->io_physdone = pio->io_physdone;
1133 1232 if (vd->vdev_ops->vdev_op_leaf && zio->io_logical != NULL)
1134 1233 zio->io_logical->io_phys_children++;
1135 1234
1136 1235 return (zio);
1137 1236 }
1138 1237
1139 1238 zio_t *
1140 1239 zio_vdev_delegated_io(vdev_t *vd, uint64_t offset, abd_t *data, uint64_t size,
1141 1240 int type, zio_priority_t priority, enum zio_flag flags,
1142 1241 zio_done_func_t *done, void *private)
1143 1242 {
1144 1243 zio_t *zio;
1145 1244
1146 1245 ASSERT(vd->vdev_ops->vdev_op_leaf);
1147 1246
1148 1247 zio = zio_create(NULL, vd->vdev_spa, 0, NULL,
1149 1248 data, size, size, done, private, type, priority,
1150 1249 flags | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_RETRY | ZIO_FLAG_DELEGATED,
1151 1250 vd, offset, NULL,
1152 1251 ZIO_STAGE_VDEV_IO_START >> 1, ZIO_VDEV_CHILD_PIPELINE);
1153 1252
1154 1253 return (zio);
1155 1254 }
1156 1255
1157 1256 void
1158 1257 zio_flush(zio_t *zio, vdev_t *vd)
1159 1258 {
1160 1259 zio_nowait(zio_ioctl(zio, zio->io_spa, vd, DKIOCFLUSHWRITECACHE,
1161 1260 NULL, NULL,
1162 1261 ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY));
1163 1262 }
1164 1263
1165 1264 void
1166 1265 zio_shrink(zio_t *zio, uint64_t size)
1167 1266 {
1168 1267 ASSERT3P(zio->io_executor, ==, NULL);
1169 1268 ASSERT3P(zio->io_orig_size, ==, zio->io_size);
1170 1269 ASSERT3U(size, <=, zio->io_size);
1171 1270
1172 1271 /*
1173 1272 * We don't shrink for raidz because of problems with the
1174 1273 * reconstruction when reading back less than the block size.
1175 1274 * Note, BP_IS_RAIDZ() assumes no compression.
1176 1275 */
1177 1276 ASSERT(BP_GET_COMPRESS(zio->io_bp) == ZIO_COMPRESS_OFF);
1178 1277 if (!BP_IS_RAIDZ(zio->io_bp)) {
1179 1278 /* we are not doing a raw write */
1180 1279 ASSERT3U(zio->io_size, ==, zio->io_lsize);
1181 1280 zio->io_orig_size = zio->io_size = zio->io_lsize = size;
1182 1281 }
1183 1282 }
1184 1283
1185 1284 /*
|
↓ open down ↓ |
60 lines elided |
↑ open up ↑ |
1186 1285 * ==========================================================================
1187 1286 * Prepare to read and write logical blocks
1188 1287 * ==========================================================================
1189 1288 */
1190 1289
1191 1290 static int
1192 1291 zio_read_bp_init(zio_t *zio)
1193 1292 {
1194 1293 blkptr_t *bp = zio->io_bp;
1195 1294
1196 - ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy);
1197 -
1198 1295 if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF &&
1199 1296 zio->io_child_type == ZIO_CHILD_LOGICAL &&
1200 1297 !(zio->io_flags & ZIO_FLAG_RAW)) {
1201 1298 uint64_t psize =
1202 1299 BP_IS_EMBEDDED(bp) ? BPE_GET_PSIZE(bp) : BP_GET_PSIZE(bp);
1203 1300 zio_push_transform(zio, abd_alloc_sametype(zio->io_abd, psize),
1204 1301 psize, psize, zio_decompress);
1205 1302 }
1206 1303
1207 1304 if (BP_IS_EMBEDDED(bp) && BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA) {
1208 1305 zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1209 1306
1210 1307 int psize = BPE_GET_PSIZE(bp);
1211 1308 void *data = abd_borrow_buf(zio->io_abd, psize);
1212 1309 decode_embedded_bp_compressed(bp, data);
1213 1310 abd_return_buf_copy(zio->io_abd, data, psize);
1214 1311 } else {
1215 1312 ASSERT(!BP_IS_EMBEDDED(bp));
1216 - ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy);
1217 1313 }
1218 1314
1219 - if (!DMU_OT_IS_METADATA(BP_GET_TYPE(bp)) && BP_GET_LEVEL(bp) == 0)
1315 + if (!BP_IS_METADATA(bp))
1220 1316 zio->io_flags |= ZIO_FLAG_DONT_CACHE;
1221 1317
1222 1318 if (BP_GET_TYPE(bp) == DMU_OT_DDT_ZAP)
1223 1319 zio->io_flags |= ZIO_FLAG_DONT_CACHE;
1224 1320
1225 1321 if (BP_GET_DEDUP(bp) && zio->io_child_type == ZIO_CHILD_LOGICAL)
1226 1322 zio->io_pipeline = ZIO_DDT_READ_PIPELINE;
1227 1323
1228 1324 return (ZIO_PIPELINE_CONTINUE);
1229 1325 }
1230 1326
1231 1327 static int
1232 1328 zio_write_bp_init(zio_t *zio)
1233 1329 {
1234 1330 if (!IO_IS_ALLOCATING(zio))
1235 1331 return (ZIO_PIPELINE_CONTINUE);
1236 1332
1237 1333 ASSERT(zio->io_child_type != ZIO_CHILD_DDT);
1238 1334
1239 1335 if (zio->io_bp_override) {
1240 1336 blkptr_t *bp = zio->io_bp;
1241 1337 zio_prop_t *zp = &zio->io_prop;
1242 1338
1243 1339 ASSERT(bp->blk_birth != zio->io_txg);
1244 1340 ASSERT(BP_GET_DEDUP(zio->io_bp_override) == 0);
1245 1341
1246 1342 *bp = *zio->io_bp_override;
1247 1343 zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1248 1344
1249 1345 if (BP_IS_EMBEDDED(bp))
1250 1346 return (ZIO_PIPELINE_CONTINUE);
1251 1347
1252 1348 /*
1253 1349 * If we've been overridden and nopwrite is set then
1254 1350 * set the flag accordingly to indicate that a nopwrite
1255 1351 * has already occurred.
1256 1352 */
1257 1353 if (!BP_IS_HOLE(bp) && zp->zp_nopwrite) {
1258 1354 ASSERT(!zp->zp_dedup);
1259 1355 ASSERT3U(BP_GET_CHECKSUM(bp), ==, zp->zp_checksum);
1260 1356 zio->io_flags |= ZIO_FLAG_NOPWRITE;
1261 1357 return (ZIO_PIPELINE_CONTINUE);
1262 1358 }
1263 1359
1264 1360 ASSERT(!zp->zp_nopwrite);
1265 1361
1266 1362 if (BP_IS_HOLE(bp) || !zp->zp_dedup)
1267 1363 return (ZIO_PIPELINE_CONTINUE);
1268 1364
1269 1365 ASSERT((zio_checksum_table[zp->zp_checksum].ci_flags &
1270 1366 ZCHECKSUM_FLAG_DEDUP) || zp->zp_dedup_verify);
1271 1367
1272 1368 if (BP_GET_CHECKSUM(bp) == zp->zp_checksum) {
1273 1369 BP_SET_DEDUP(bp, 1);
1274 1370 zio->io_pipeline |= ZIO_STAGE_DDT_WRITE;
1275 1371 return (ZIO_PIPELINE_CONTINUE);
1276 1372 }
1277 1373
1278 1374 /*
1279 1375 * We were unable to handle this as an override bp, treat
1280 1376 * it as a regular write I/O.
1281 1377 */
1282 1378 zio->io_bp_override = NULL;
1283 1379 *bp = zio->io_bp_orig;
1284 1380 zio->io_pipeline = zio->io_orig_pipeline;
1285 1381 }
1286 1382
1287 1383 return (ZIO_PIPELINE_CONTINUE);
1288 1384 }
1289 1385
1290 1386 static int
1291 1387 zio_write_compress(zio_t *zio)
1292 1388 {
1293 1389 spa_t *spa = zio->io_spa;
1294 1390 zio_prop_t *zp = &zio->io_prop;
1295 1391 enum zio_compress compress = zp->zp_compress;
1296 1392 blkptr_t *bp = zio->io_bp;
|
↓ open down ↓ |
67 lines elided |
↑ open up ↑ |
1297 1393 uint64_t lsize = zio->io_lsize;
1298 1394 uint64_t psize = zio->io_size;
1299 1395 int pass = 1;
1300 1396
1301 1397 EQUIV(lsize != psize, (zio->io_flags & ZIO_FLAG_RAW) != 0);
1302 1398
1303 1399 /*
1304 1400 * If our children haven't all reached the ready stage,
1305 1401 * wait for them and then repeat this pipeline stage.
1306 1402 */
1307 - if (zio_wait_for_children(zio, ZIO_CHILD_LOGICAL_BIT |
1308 - ZIO_CHILD_GANG_BIT, ZIO_WAIT_READY)) {
1403 + if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) ||
1404 + zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_READY))
1309 1405 return (ZIO_PIPELINE_STOP);
1310 - }
1311 1406
1312 1407 if (!IO_IS_ALLOCATING(zio))
1313 1408 return (ZIO_PIPELINE_CONTINUE);
1314 1409
1315 1410 if (zio->io_children_ready != NULL) {
1316 1411 /*
1317 1412 * Now that all our children are ready, run the callback
1318 1413 * associated with this zio in case it wants to modify the
1319 1414 * data to be written.
1320 1415 */
1321 1416 ASSERT3U(zp->zp_level, >, 0);
1322 1417 zio->io_children_ready(zio);
1323 1418 }
1324 1419
1325 1420 ASSERT(zio->io_child_type != ZIO_CHILD_DDT);
1326 1421 ASSERT(zio->io_bp_override == NULL);
1327 1422
1328 1423 if (!BP_IS_HOLE(bp) && bp->blk_birth == zio->io_txg) {
1329 1424 /*
1330 1425 * We're rewriting an existing block, which means we're
1331 1426 * working on behalf of spa_sync(). For spa_sync() to
1332 1427 * converge, it must eventually be the case that we don't
1333 1428 * have to allocate new blocks. But compression changes
1334 1429 * the blocksize, which forces a reallocate, and makes
1335 1430 * convergence take longer. Therefore, after the first
1336 1431 * few passes, stop compressing to ensure convergence.
1337 1432 */
1338 1433 pass = spa_sync_pass(spa);
1339 1434
1340 1435 ASSERT(zio->io_txg == spa_syncing_txg(spa));
1341 1436 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
|
↓ open down ↓ |
21 lines elided |
↑ open up ↑ |
1342 1437 ASSERT(!BP_GET_DEDUP(bp));
1343 1438
1344 1439 if (pass >= zfs_sync_pass_dont_compress)
1345 1440 compress = ZIO_COMPRESS_OFF;
1346 1441
1347 1442 /* Make sure someone doesn't change their mind on overwrites */
1348 1443 ASSERT(BP_IS_EMBEDDED(bp) || MIN(zp->zp_copies + BP_IS_GANG(bp),
1349 1444 spa_max_replication(spa)) == BP_GET_NDVAS(bp));
1350 1445 }
1351 1446
1447 + DTRACE_PROBE1(zio_compress_ready, zio_t *, zio);
1352 1448 /* If it's a compressed write that is not raw, compress the buffer. */
1353 - if (compress != ZIO_COMPRESS_OFF && psize == lsize) {
1449 + if (compress != ZIO_COMPRESS_OFF && psize == lsize &&
1450 + ZIO_SHOULD_COMPRESS(zio)) {
1354 1451 void *cbuf = zio_buf_alloc(lsize);
1355 1452 psize = zio_compress_data(compress, zio->io_abd, cbuf, lsize);
1356 1453 if (psize == 0 || psize == lsize) {
1357 1454 compress = ZIO_COMPRESS_OFF;
1358 1455 zio_buf_free(cbuf, lsize);
1359 1456 } else if (!zp->zp_dedup && psize <= BPE_PAYLOAD_SIZE &&
1360 1457 zp->zp_level == 0 && !DMU_OT_HAS_FILL(zp->zp_type) &&
1361 1458 spa_feature_is_enabled(spa, SPA_FEATURE_EMBEDDED_DATA)) {
1362 1459 encode_embedded_bp_compressed(bp,
1363 1460 cbuf, compress, lsize, psize);
1364 1461 BPE_SET_ETYPE(bp, BP_EMBEDDED_TYPE_DATA);
1365 1462 BP_SET_TYPE(bp, zio->io_prop.zp_type);
1366 1463 BP_SET_LEVEL(bp, zio->io_prop.zp_level);
1367 1464 zio_buf_free(cbuf, lsize);
1368 1465 bp->blk_birth = zio->io_txg;
1369 1466 zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1370 1467 ASSERT(spa_feature_is_active(spa,
1371 1468 SPA_FEATURE_EMBEDDED_DATA));
1469 + if (zio->io_smartcomp.sc_result != NULL) {
1470 + zio->io_smartcomp.sc_result(
1471 + zio->io_smartcomp.sc_userinfo, zio);
1472 + } else {
1473 + ASSERT(zio->io_smartcomp.sc_ask == NULL);
1474 + }
1372 1475 return (ZIO_PIPELINE_CONTINUE);
1373 1476 } else {
1374 1477 /*
1375 1478 * Round up compressed size up to the ashift
1376 1479 * of the smallest-ashift device, and zero the tail.
1377 1480 * This ensures that the compressed size of the BP
1378 1481 * (and thus compressratio property) are correct,
1379 1482 * in that we charge for the padding used to fill out
1380 1483 * the last sector.
1381 1484 */
1382 1485 ASSERT3U(spa->spa_min_ashift, >=, SPA_MINBLOCKSHIFT);
1383 1486 size_t rounded = (size_t)P2ROUNDUP(psize,
1384 1487 1ULL << spa->spa_min_ashift);
1385 1488 if (rounded >= lsize) {
1386 1489 compress = ZIO_COMPRESS_OFF;
1387 1490 zio_buf_free(cbuf, lsize);
1388 1491 psize = lsize;
|
↓ open down ↓ |
7 lines elided |
↑ open up ↑ |
1389 1492 } else {
1390 1493 abd_t *cdata = abd_get_from_buf(cbuf, lsize);
1391 1494 abd_take_ownership_of_buf(cdata, B_TRUE);
1392 1495 abd_zero_off(cdata, psize, rounded - psize);
1393 1496 psize = rounded;
1394 1497 zio_push_transform(zio, cdata,
1395 1498 psize, lsize, NULL);
1396 1499 }
1397 1500 }
1398 1501
1502 + if (zio->io_smartcomp.sc_result != NULL) {
1503 + zio->io_smartcomp.sc_result(
1504 + zio->io_smartcomp.sc_userinfo, zio);
1505 + } else {
1506 + ASSERT(zio->io_smartcomp.sc_ask == NULL);
1507 + }
1508 +
1399 1509 /*
1400 1510 * We were unable to handle this as an override bp, treat
1401 1511 * it as a regular write I/O.
1402 1512 */
1403 1513 zio->io_bp_override = NULL;
1404 1514 *bp = zio->io_bp_orig;
1405 1515 zio->io_pipeline = zio->io_orig_pipeline;
1406 1516 } else {
1407 1517 ASSERT3U(psize, !=, 0);
1518 +
1519 + /*
1520 + * We are here because of:
1521 + * - compress == ZIO_COMPRESS_OFF
1522 + * - SmartCompression decides don't compress this data
1523 + * - this is a RAW-write
1524 + *
1525 + * In case of RAW-write we should not override "compress"
1526 + */
1527 + if ((zio->io_flags & ZIO_FLAG_RAW) == 0)
1528 + compress = ZIO_COMPRESS_OFF;
1408 1529 }
1409 1530
1410 1531 /*
1411 1532 * The final pass of spa_sync() must be all rewrites, but the first
1412 1533 * few passes offer a trade-off: allocating blocks defers convergence,
1413 1534 * but newly allocated blocks are sequential, so they can be written
1414 1535 * to disk faster. Therefore, we allow the first few passes of
1415 1536 * spa_sync() to allocate new blocks, but force rewrites after that.
1416 1537 * There should only be a handful of blocks after pass 1 in any case.
1417 1538 */
1418 1539 if (!BP_IS_HOLE(bp) && bp->blk_birth == zio->io_txg &&
1419 1540 BP_GET_PSIZE(bp) == psize &&
1420 1541 pass >= zfs_sync_pass_rewrite) {
1421 1542 ASSERT(psize != 0);
1422 1543 enum zio_stage gang_stages = zio->io_pipeline & ZIO_GANG_STAGES;
1423 1544 zio->io_pipeline = ZIO_REWRITE_PIPELINE | gang_stages;
1424 1545 zio->io_flags |= ZIO_FLAG_IO_REWRITE;
1425 1546 } else {
1426 1547 BP_ZERO(bp);
1427 1548 zio->io_pipeline = ZIO_WRITE_PIPELINE;
1428 1549 }
1429 1550
|
↓ open down ↓ |
12 lines elided |
↑ open up ↑ |
1430 1551 if (psize == 0) {
1431 1552 if (zio->io_bp_orig.blk_birth != 0 &&
1432 1553 spa_feature_is_active(spa, SPA_FEATURE_HOLE_BIRTH)) {
1433 1554 BP_SET_LSIZE(bp, lsize);
1434 1555 BP_SET_TYPE(bp, zp->zp_type);
1435 1556 BP_SET_LEVEL(bp, zp->zp_level);
1436 1557 BP_SET_BIRTH(bp, zio->io_txg, 0);
1437 1558 }
1438 1559 zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1439 1560 } else {
1561 + if (zp->zp_dedup) {
1562 + /* check the best-effort dedup setting */
1563 + zio_best_effort_dedup(zio);
1564 + }
1440 1565 ASSERT(zp->zp_checksum != ZIO_CHECKSUM_GANG_HEADER);
1441 1566 BP_SET_LSIZE(bp, lsize);
1442 1567 BP_SET_TYPE(bp, zp->zp_type);
1443 1568 BP_SET_LEVEL(bp, zp->zp_level);
1444 1569 BP_SET_PSIZE(bp, psize);
1445 1570 BP_SET_COMPRESS(bp, compress);
1446 1571 BP_SET_CHECKSUM(bp, zp->zp_checksum);
1447 1572 BP_SET_DEDUP(bp, zp->zp_dedup);
1448 1573 BP_SET_BYTEORDER(bp, ZFS_HOST_BYTEORDER);
1449 1574 if (zp->zp_dedup) {
1450 1575 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
1451 1576 ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REWRITE));
1452 1577 zio->io_pipeline = ZIO_DDT_WRITE_PIPELINE;
1453 1578 }
1454 1579 if (zp->zp_nopwrite) {
1455 1580 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
1456 1581 ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REWRITE));
1457 1582 zio->io_pipeline |= ZIO_STAGE_NOP_WRITE;
1458 1583 }
1459 1584 }
1460 1585 return (ZIO_PIPELINE_CONTINUE);
1461 1586 }
1462 1587
|
↓ open down ↓ |
13 lines elided |
↑ open up ↑ |
1463 1588 static int
1464 1589 zio_free_bp_init(zio_t *zio)
1465 1590 {
1466 1591 blkptr_t *bp = zio->io_bp;
1467 1592
1468 1593 if (zio->io_child_type == ZIO_CHILD_LOGICAL) {
1469 1594 if (BP_GET_DEDUP(bp))
1470 1595 zio->io_pipeline = ZIO_DDT_FREE_PIPELINE;
1471 1596 }
1472 1597
1473 - ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy);
1474 -
1475 1598 return (ZIO_PIPELINE_CONTINUE);
1476 1599 }
1477 1600
1478 1601 /*
1479 1602 * ==========================================================================
1480 1603 * Execute the I/O pipeline
1481 1604 * ==========================================================================
1482 1605 */
1483 1606
1484 1607 static void
1485 1608 zio_taskq_dispatch(zio_t *zio, zio_taskq_type_t q, boolean_t cutinline)
1486 1609 {
1487 1610 spa_t *spa = zio->io_spa;
1488 1611 zio_type_t t = zio->io_type;
1489 1612 int flags = (cutinline ? TQ_FRONT : 0);
1490 1613
1491 1614 /*
1492 1615 * If we're a config writer or a probe, the normal issue and
1493 1616 * interrupt threads may all be blocked waiting for the config lock.
1494 1617 * In this case, select the otherwise-unused taskq for ZIO_TYPE_NULL.
1495 1618 */
1496 1619 if (zio->io_flags & (ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_PROBE))
1497 1620 t = ZIO_TYPE_NULL;
1498 1621
|
↓ open down ↓ |
14 lines elided |
↑ open up ↑ |
1499 1622 /*
1500 1623 * A similar issue exists for the L2ARC write thread until L2ARC 2.0.
1501 1624 */
1502 1625 if (t == ZIO_TYPE_WRITE && zio->io_vd && zio->io_vd->vdev_aux)
1503 1626 t = ZIO_TYPE_NULL;
1504 1627
1505 1628 /*
1506 1629 * If this is a high priority I/O, then use the high priority taskq if
1507 1630 * available.
1508 1631 */
1509 - if (zio->io_priority == ZIO_PRIORITY_NOW &&
1632 + if ((zio->io_priority == ZIO_PRIORITY_NOW ||
1633 + zio->io_priority == ZIO_PRIORITY_SYNC_WRITE) &&
1510 1634 spa->spa_zio_taskq[t][q + 1].stqs_count != 0)
1511 1635 q++;
1512 1636
1513 1637 ASSERT3U(q, <, ZIO_TASKQ_TYPES);
1514 1638
1515 1639 /*
1516 1640 * NB: We are assuming that the zio can only be dispatched
1517 1641 * to a single taskq at a time. It would be a grievous error
1518 1642 * to dispatch the zio to another taskq at the same time.
1519 1643 */
1520 1644 ASSERT(zio->io_tqent.tqent_next == NULL);
1521 1645 spa_taskq_dispatch_ent(spa, t, q, (task_func_t *)zio_execute, zio,
1522 1646 flags, &zio->io_tqent);
1523 1647 }
1524 1648
1525 1649 static boolean_t
1526 1650 zio_taskq_member(zio_t *zio, zio_taskq_type_t q)
1527 1651 {
1528 1652 kthread_t *executor = zio->io_executor;
1529 1653 spa_t *spa = zio->io_spa;
1530 1654
1531 1655 for (zio_type_t t = 0; t < ZIO_TYPES; t++) {
1532 1656 spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
1533 1657 uint_t i;
1534 1658 for (i = 0; i < tqs->stqs_count; i++) {
1535 1659 if (taskq_member(tqs->stqs_taskq[i], executor))
1536 1660 return (B_TRUE);
1537 1661 }
1538 1662 }
1539 1663
1540 1664 return (B_FALSE);
1541 1665 }
1542 1666
1543 1667 static int
1544 1668 zio_issue_async(zio_t *zio)
1545 1669 {
1546 1670 zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE, B_FALSE);
1547 1671
1548 1672 return (ZIO_PIPELINE_STOP);
1549 1673 }
1550 1674
1551 1675 void
1552 1676 zio_interrupt(zio_t *zio)
1553 1677 {
1554 1678 zio_taskq_dispatch(zio, ZIO_TASKQ_INTERRUPT, B_FALSE);
1555 1679 }
1556 1680
1557 1681 void
1558 1682 zio_delay_interrupt(zio_t *zio)
1559 1683 {
1560 1684 /*
1561 1685 * The timeout_generic() function isn't defined in userspace, so
1562 1686 * rather than trying to implement the function, the zio delay
1563 1687 * functionality has been disabled for userspace builds.
1564 1688 */
1565 1689
1566 1690 #ifdef _KERNEL
1567 1691 /*
1568 1692 * If io_target_timestamp is zero, then no delay has been registered
1569 1693 * for this IO, thus jump to the end of this function and "skip" the
1570 1694 * delay; issuing it directly to the zio layer.
1571 1695 */
1572 1696 if (zio->io_target_timestamp != 0) {
1573 1697 hrtime_t now = gethrtime();
1574 1698
1575 1699 if (now >= zio->io_target_timestamp) {
1576 1700 /*
1577 1701 * This IO has already taken longer than the target
1578 1702 * delay to complete, so we don't want to delay it
1579 1703 * any longer; we "miss" the delay and issue it
1580 1704 * directly to the zio layer. This is likely due to
1581 1705 * the target latency being set to a value less than
1582 1706 * the underlying hardware can satisfy (e.g. delay
1583 1707 * set to 1ms, but the disks take 10ms to complete an
1584 1708 * IO request).
1585 1709 */
1586 1710
1587 1711 DTRACE_PROBE2(zio__delay__miss, zio_t *, zio,
1588 1712 hrtime_t, now);
1589 1713
1590 1714 zio_interrupt(zio);
1591 1715 } else {
1592 1716 hrtime_t diff = zio->io_target_timestamp - now;
1593 1717
1594 1718 DTRACE_PROBE3(zio__delay__hit, zio_t *, zio,
1595 1719 hrtime_t, now, hrtime_t, diff);
1596 1720
1597 1721 (void) timeout_generic(CALLOUT_NORMAL,
1598 1722 (void (*)(void *))zio_interrupt, zio, diff, 1, 0);
1599 1723 }
1600 1724
1601 1725 return;
1602 1726 }
1603 1727 #endif
1604 1728
1605 1729 DTRACE_PROBE1(zio__delay__skip, zio_t *, zio);
1606 1730 zio_interrupt(zio);
1607 1731 }
1608 1732
1609 1733 /*
1610 1734 * Execute the I/O pipeline until one of the following occurs:
1611 1735 *
1612 1736 * (1) the I/O completes
1613 1737 * (2) the pipeline stalls waiting for dependent child I/Os
1614 1738 * (3) the I/O issues, so we're waiting for an I/O completion interrupt
1615 1739 * (4) the I/O is delegated by vdev-level caching or aggregation
1616 1740 * (5) the I/O is deferred due to vdev-level queueing
1617 1741 * (6) the I/O is handed off to another thread.
1618 1742 *
1619 1743 * In all cases, the pipeline stops whenever there's no CPU work; it never
1620 1744 * burns a thread in cv_wait().
1621 1745 *
1622 1746 * There's no locking on io_stage because there's no legitimate way
1623 1747 * for multiple threads to be attempting to process the same I/O.
1624 1748 */
1625 1749 static zio_pipe_stage_t *zio_pipeline[];
|
↓ open down ↓ |
106 lines elided |
↑ open up ↑ |
1626 1750
1627 1751 void
1628 1752 zio_execute(zio_t *zio)
1629 1753 {
1630 1754 zio->io_executor = curthread;
1631 1755
1632 1756 ASSERT3U(zio->io_queued_timestamp, >, 0);
1633 1757
1634 1758 while (zio->io_stage < ZIO_STAGE_DONE) {
1635 1759 enum zio_stage pipeline = zio->io_pipeline;
1760 + enum zio_stage old_stage = zio->io_stage;
1636 1761 enum zio_stage stage = zio->io_stage;
1637 1762 int rv;
1638 1763
1639 1764 ASSERT(!MUTEX_HELD(&zio->io_lock));
1640 1765 ASSERT(ISP2(stage));
1641 1766 ASSERT(zio->io_stall == NULL);
1642 1767
1643 1768 do {
1644 1769 stage <<= 1;
1645 1770 } while ((stage & pipeline) == 0);
1646 1771
1647 1772 ASSERT(stage <= ZIO_STAGE_DONE);
1648 1773
1649 1774 /*
1650 1775 * If we are in interrupt context and this pipeline stage
1651 1776 * will grab a config lock that is held across I/O,
1652 1777 * or may wait for an I/O that needs an interrupt thread
1653 1778 * to complete, issue async to avoid deadlock.
1654 1779 *
1655 1780 * For VDEV_IO_START, we cut in line so that the io will
1656 1781 * be sent to disk promptly.
1657 1782 */
1658 1783 if ((stage & ZIO_BLOCKING_STAGES) && zio->io_vd == NULL &&
1659 1784 zio_taskq_member(zio, ZIO_TASKQ_INTERRUPT)) {
1660 1785 boolean_t cut = (stage == ZIO_STAGE_VDEV_IO_START) ?
1661 1786 zio_requeue_io_start_cut_in_line : B_FALSE;
1662 1787 zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE, cut);
|
↓ open down ↓ |
17 lines elided |
↑ open up ↑ |
1663 1788 return;
1664 1789 }
1665 1790
1666 1791 zio->io_stage = stage;
1667 1792 zio->io_pipeline_trace |= zio->io_stage;
1668 1793 rv = zio_pipeline[highbit64(stage) - 1](zio);
1669 1794
1670 1795 if (rv == ZIO_PIPELINE_STOP)
1671 1796 return;
1672 1797
1798 + if (rv == ZIO_PIPELINE_RESTART_STAGE) {
1799 + zio->io_stage = old_stage;
1800 + (void) zio_issue_async(zio);
1801 + return;
1802 + }
1803 +
1673 1804 ASSERT(rv == ZIO_PIPELINE_CONTINUE);
1674 1805 }
1675 1806 }
1676 1807
1677 1808 /*
1678 1809 * ==========================================================================
1679 1810 * Initiate I/O, either sync or async
1680 1811 * ==========================================================================
1681 1812 */
1682 1813 int
1683 1814 zio_wait(zio_t *zio)
1684 1815 {
1685 1816 int error;
1686 1817
1687 1818 ASSERT3P(zio->io_stage, ==, ZIO_STAGE_OPEN);
1688 1819 ASSERT3P(zio->io_executor, ==, NULL);
1689 1820
1690 1821 zio->io_waiter = curthread;
1691 1822 ASSERT0(zio->io_queued_timestamp);
1692 1823 zio->io_queued_timestamp = gethrtime();
1693 1824
1694 1825 zio_execute(zio);
1695 1826
1696 1827 mutex_enter(&zio->io_lock);
1697 1828 while (zio->io_executor != NULL)
1698 1829 cv_wait(&zio->io_cv, &zio->io_lock);
1699 1830 mutex_exit(&zio->io_lock);
1700 1831
1701 1832 error = zio->io_error;
1702 1833 zio_destroy(zio);
1703 1834
1704 1835 return (error);
1705 1836 }
1706 1837
1707 1838 void
1708 1839 zio_nowait(zio_t *zio)
1709 1840 {
1710 1841 ASSERT3P(zio->io_executor, ==, NULL);
1711 1842
1712 1843 if (zio->io_child_type == ZIO_CHILD_LOGICAL &&
1713 1844 zio_unique_parent(zio) == NULL) {
1714 1845 /*
1715 1846 * This is a logical async I/O with no parent to wait for it.
1716 1847 * We add it to the spa_async_root_zio "Godfather" I/O which
1717 1848 * will ensure they complete prior to unloading the pool.
1718 1849 */
1719 1850 spa_t *spa = zio->io_spa;
1720 1851
1721 1852 zio_add_child(spa->spa_async_zio_root[CPU_SEQID], zio);
1722 1853 }
1723 1854
1724 1855 ASSERT0(zio->io_queued_timestamp);
1725 1856 zio->io_queued_timestamp = gethrtime();
1726 1857 zio_execute(zio);
1727 1858 }
1728 1859
1729 1860 /*
1730 1861 * ==========================================================================
1731 1862 * Reexecute, cancel, or suspend/resume failed I/O
1732 1863 * ==========================================================================
1733 1864 */
1734 1865
1735 1866 static void
1736 1867 zio_reexecute(zio_t *pio)
1737 1868 {
1738 1869 zio_t *cio, *cio_next;
1739 1870
1740 1871 ASSERT(pio->io_child_type == ZIO_CHILD_LOGICAL);
1741 1872 ASSERT(pio->io_orig_stage == ZIO_STAGE_OPEN);
1742 1873 ASSERT(pio->io_gang_leader == NULL);
1743 1874 ASSERT(pio->io_gang_tree == NULL);
1744 1875
1745 1876 pio->io_flags = pio->io_orig_flags;
1746 1877 pio->io_stage = pio->io_orig_stage;
1747 1878 pio->io_pipeline = pio->io_orig_pipeline;
1748 1879 pio->io_reexecute = 0;
1749 1880 pio->io_flags |= ZIO_FLAG_REEXECUTED;
1750 1881 pio->io_pipeline_trace = 0;
1751 1882 pio->io_error = 0;
1752 1883 for (int w = 0; w < ZIO_WAIT_TYPES; w++)
1753 1884 pio->io_state[w] = 0;
1754 1885 for (int c = 0; c < ZIO_CHILD_TYPES; c++)
1755 1886 pio->io_child_error[c] = 0;
1756 1887
1757 1888 if (IO_IS_ALLOCATING(pio))
1758 1889 BP_ZERO(pio->io_bp);
1759 1890
1760 1891 /*
1761 1892 * As we reexecute pio's children, new children could be created.
1762 1893 * New children go to the head of pio's io_child_list, however,
1763 1894 * so we will (correctly) not reexecute them. The key is that
1764 1895 * the remainder of pio's io_child_list, from 'cio_next' onward,
1765 1896 * cannot be affected by any side effects of reexecuting 'cio'.
1766 1897 */
1767 1898 zio_link_t *zl = NULL;
1768 1899 for (cio = zio_walk_children(pio, &zl); cio != NULL; cio = cio_next) {
1769 1900 cio_next = zio_walk_children(pio, &zl);
1770 1901 mutex_enter(&pio->io_lock);
1771 1902 for (int w = 0; w < ZIO_WAIT_TYPES; w++)
1772 1903 pio->io_children[cio->io_child_type][w]++;
1773 1904 mutex_exit(&pio->io_lock);
1774 1905 zio_reexecute(cio);
1775 1906 }
1776 1907
1777 1908 /*
1778 1909 * Now that all children have been reexecuted, execute the parent.
1779 1910 * We don't reexecute "The Godfather" I/O here as it's the
1780 1911 * responsibility of the caller to wait on it.
1781 1912 */
1782 1913 if (!(pio->io_flags & ZIO_FLAG_GODFATHER)) {
1783 1914 pio->io_queued_timestamp = gethrtime();
1784 1915 zio_execute(pio);
1785 1916 }
1786 1917 }
1787 1918
1788 1919 void
1789 1920 zio_suspend(spa_t *spa, zio_t *zio)
1790 1921 {
1791 1922 if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_PANIC)
1792 1923 fm_panic("Pool '%s' has encountered an uncorrectable I/O "
1793 1924 "failure and the failure mode property for this pool "
1794 1925 "is set to panic.", spa_name(spa));
1795 1926
1796 1927 zfs_ereport_post(FM_EREPORT_ZFS_IO_FAILURE, spa, NULL, NULL, 0, 0);
1797 1928
1798 1929 mutex_enter(&spa->spa_suspend_lock);
1799 1930
1800 1931 if (spa->spa_suspend_zio_root == NULL)
1801 1932 spa->spa_suspend_zio_root = zio_root(spa, NULL, NULL,
1802 1933 ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
1803 1934 ZIO_FLAG_GODFATHER);
1804 1935
1805 1936 spa->spa_suspended = B_TRUE;
1806 1937
1807 1938 if (zio != NULL) {
1808 1939 ASSERT(!(zio->io_flags & ZIO_FLAG_GODFATHER));
1809 1940 ASSERT(zio != spa->spa_suspend_zio_root);
1810 1941 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
1811 1942 ASSERT(zio_unique_parent(zio) == NULL);
1812 1943 ASSERT(zio->io_stage == ZIO_STAGE_DONE);
1813 1944 zio_add_child(spa->spa_suspend_zio_root, zio);
1814 1945 }
1815 1946
1816 1947 mutex_exit(&spa->spa_suspend_lock);
1817 1948 }
1818 1949
1819 1950 int
1820 1951 zio_resume(spa_t *spa)
1821 1952 {
1822 1953 zio_t *pio;
1823 1954
1824 1955 /*
1825 1956 * Reexecute all previously suspended i/o.
1826 1957 */
1827 1958 mutex_enter(&spa->spa_suspend_lock);
1828 1959 spa->spa_suspended = B_FALSE;
1829 1960 cv_broadcast(&spa->spa_suspend_cv);
1830 1961 pio = spa->spa_suspend_zio_root;
1831 1962 spa->spa_suspend_zio_root = NULL;
1832 1963 mutex_exit(&spa->spa_suspend_lock);
1833 1964
1834 1965 if (pio == NULL)
1835 1966 return (0);
1836 1967
1837 1968 zio_reexecute(pio);
1838 1969 return (zio_wait(pio));
1839 1970 }
1840 1971
1841 1972 void
1842 1973 zio_resume_wait(spa_t *spa)
1843 1974 {
1844 1975 mutex_enter(&spa->spa_suspend_lock);
1845 1976 while (spa_suspended(spa))
1846 1977 cv_wait(&spa->spa_suspend_cv, &spa->spa_suspend_lock);
1847 1978 mutex_exit(&spa->spa_suspend_lock);
1848 1979 }
1849 1980
1850 1981 /*
1851 1982 * ==========================================================================
1852 1983 * Gang blocks.
1853 1984 *
1854 1985 * A gang block is a collection of small blocks that looks to the DMU
1855 1986 * like one large block. When zio_dva_allocate() cannot find a block
1856 1987 * of the requested size, due to either severe fragmentation or the pool
1857 1988 * being nearly full, it calls zio_write_gang_block() to construct the
1858 1989 * block from smaller fragments.
1859 1990 *
1860 1991 * A gang block consists of a gang header (zio_gbh_phys_t) and up to
1861 1992 * three (SPA_GBH_NBLKPTRS) gang members. The gang header is just like
1862 1993 * an indirect block: it's an array of block pointers. It consumes
1863 1994 * only one sector and hence is allocatable regardless of fragmentation.
1864 1995 * The gang header's bps point to its gang members, which hold the data.
1865 1996 *
1866 1997 * Gang blocks are self-checksumming, using the bp's <vdev, offset, txg>
1867 1998 * as the verifier to ensure uniqueness of the SHA256 checksum.
1868 1999 * Critically, the gang block bp's blk_cksum is the checksum of the data,
1869 2000 * not the gang header. This ensures that data block signatures (needed for
1870 2001 * deduplication) are independent of how the block is physically stored.
1871 2002 *
1872 2003 * Gang blocks can be nested: a gang member may itself be a gang block.
1873 2004 * Thus every gang block is a tree in which root and all interior nodes are
1874 2005 * gang headers, and the leaves are normal blocks that contain user data.
1875 2006 * The root of the gang tree is called the gang leader.
1876 2007 *
1877 2008 * To perform any operation (read, rewrite, free, claim) on a gang block,
1878 2009 * zio_gang_assemble() first assembles the gang tree (minus data leaves)
1879 2010 * in the io_gang_tree field of the original logical i/o by recursively
1880 2011 * reading the gang leader and all gang headers below it. This yields
1881 2012 * an in-core tree containing the contents of every gang header and the
1882 2013 * bps for every constituent of the gang block.
1883 2014 *
1884 2015 * With the gang tree now assembled, zio_gang_issue() just walks the gang tree
1885 2016 * and invokes a callback on each bp. To free a gang block, zio_gang_issue()
1886 2017 * calls zio_free_gang() -- a trivial wrapper around zio_free() -- for each bp.
1887 2018 * zio_claim_gang() provides a similarly trivial wrapper for zio_claim().
1888 2019 * zio_read_gang() is a wrapper around zio_read() that omits reading gang
1889 2020 * headers, since we already have those in io_gang_tree. zio_rewrite_gang()
1890 2021 * performs a zio_rewrite() of the data or, for gang headers, a zio_rewrite()
1891 2022 * of the gang header plus zio_checksum_compute() of the data to update the
1892 2023 * gang header's blk_cksum as described above.
1893 2024 *
1894 2025 * The two-phase assemble/issue model solves the problem of partial failure --
1895 2026 * what if you'd freed part of a gang block but then couldn't read the
1896 2027 * gang header for another part? Assembling the entire gang tree first
1897 2028 * ensures that all the necessary gang header I/O has succeeded before
1898 2029 * starting the actual work of free, claim, or write. Once the gang tree
1899 2030 * is assembled, free and claim are in-memory operations that cannot fail.
1900 2031 *
1901 2032 * In the event that a gang write fails, zio_dva_unallocate() walks the
1902 2033 * gang tree to immediately free (i.e. insert back into the space map)
1903 2034 * everything we've allocated. This ensures that we don't get ENOSPC
1904 2035 * errors during repeated suspend/resume cycles due to a flaky device.
1905 2036 *
1906 2037 * Gang rewrites only happen during sync-to-convergence. If we can't assemble
1907 2038 * the gang tree, we won't modify the block, so we can safely defer the free
1908 2039 * (knowing that the block is still intact). If we *can* assemble the gang
1909 2040 * tree, then even if some of the rewrites fail, zio_dva_unallocate() will free
1910 2041 * each constituent bp and we can allocate a new block on the next sync pass.
1911 2042 *
1912 2043 * In all cases, the gang tree allows complete recovery from partial failure.
1913 2044 * ==========================================================================
1914 2045 */
1915 2046
1916 2047 static void
1917 2048 zio_gang_issue_func_done(zio_t *zio)
1918 2049 {
1919 2050 abd_put(zio->io_abd);
1920 2051 }
1921 2052
1922 2053 static zio_t *
1923 2054 zio_read_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, abd_t *data,
1924 2055 uint64_t offset)
1925 2056 {
1926 2057 if (gn != NULL)
1927 2058 return (pio);
1928 2059
1929 2060 return (zio_read(pio, pio->io_spa, bp, abd_get_offset(data, offset),
1930 2061 BP_GET_PSIZE(bp), zio_gang_issue_func_done,
1931 2062 NULL, pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio),
1932 2063 &pio->io_bookmark));
1933 2064 }
1934 2065
1935 2066 static zio_t *
1936 2067 zio_rewrite_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, abd_t *data,
1937 2068 uint64_t offset)
1938 2069 {
1939 2070 zio_t *zio;
1940 2071
1941 2072 if (gn != NULL) {
1942 2073 abd_t *gbh_abd =
1943 2074 abd_get_from_buf(gn->gn_gbh, SPA_GANGBLOCKSIZE);
1944 2075 zio = zio_rewrite(pio, pio->io_spa, pio->io_txg, bp,
1945 2076 gbh_abd, SPA_GANGBLOCKSIZE, zio_gang_issue_func_done, NULL,
1946 2077 pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio),
1947 2078 &pio->io_bookmark);
1948 2079 /*
1949 2080 * As we rewrite each gang header, the pipeline will compute
1950 2081 * a new gang block header checksum for it; but no one will
1951 2082 * compute a new data checksum, so we do that here. The one
1952 2083 * exception is the gang leader: the pipeline already computed
1953 2084 * its data checksum because that stage precedes gang assembly.
1954 2085 * (Presently, nothing actually uses interior data checksums;
1955 2086 * this is just good hygiene.)
1956 2087 */
1957 2088 if (gn != pio->io_gang_leader->io_gang_tree) {
1958 2089 abd_t *buf = abd_get_offset(data, offset);
1959 2090
1960 2091 zio_checksum_compute(zio, BP_GET_CHECKSUM(bp),
1961 2092 buf, BP_GET_PSIZE(bp));
1962 2093
1963 2094 abd_put(buf);
1964 2095 }
1965 2096 /*
1966 2097 * If we are here to damage data for testing purposes,
1967 2098 * leave the GBH alone so that we can detect the damage.
1968 2099 */
1969 2100 if (pio->io_gang_leader->io_flags & ZIO_FLAG_INDUCE_DAMAGE)
1970 2101 zio->io_pipeline &= ~ZIO_VDEV_IO_STAGES;
1971 2102 } else {
1972 2103 zio = zio_rewrite(pio, pio->io_spa, pio->io_txg, bp,
1973 2104 abd_get_offset(data, offset), BP_GET_PSIZE(bp),
1974 2105 zio_gang_issue_func_done, NULL, pio->io_priority,
1975 2106 ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
1976 2107 }
1977 2108
1978 2109 return (zio);
1979 2110 }
1980 2111
1981 2112 /* ARGSUSED */
1982 2113 static zio_t *
1983 2114 zio_free_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, abd_t *data,
1984 2115 uint64_t offset)
1985 2116 {
1986 2117 return (zio_free_sync(pio, pio->io_spa, pio->io_txg, bp,
1987 2118 ZIO_GANG_CHILD_FLAGS(pio)));
1988 2119 }
1989 2120
1990 2121 /* ARGSUSED */
1991 2122 static zio_t *
1992 2123 zio_claim_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, abd_t *data,
1993 2124 uint64_t offset)
1994 2125 {
1995 2126 return (zio_claim(pio, pio->io_spa, pio->io_txg, bp,
1996 2127 NULL, NULL, ZIO_GANG_CHILD_FLAGS(pio)));
1997 2128 }
1998 2129
1999 2130 static zio_gang_issue_func_t *zio_gang_issue_func[ZIO_TYPES] = {
2000 2131 NULL,
2001 2132 zio_read_gang,
2002 2133 zio_rewrite_gang,
2003 2134 zio_free_gang,
2004 2135 zio_claim_gang,
2005 2136 NULL
2006 2137 };
2007 2138
2008 2139 static void zio_gang_tree_assemble_done(zio_t *zio);
2009 2140
2010 2141 static zio_gang_node_t *
2011 2142 zio_gang_node_alloc(zio_gang_node_t **gnpp)
2012 2143 {
2013 2144 zio_gang_node_t *gn;
2014 2145
2015 2146 ASSERT(*gnpp == NULL);
2016 2147
2017 2148 gn = kmem_zalloc(sizeof (*gn), KM_SLEEP);
2018 2149 gn->gn_gbh = zio_buf_alloc(SPA_GANGBLOCKSIZE);
2019 2150 *gnpp = gn;
2020 2151
2021 2152 return (gn);
2022 2153 }
2023 2154
2024 2155 static void
2025 2156 zio_gang_node_free(zio_gang_node_t **gnpp)
2026 2157 {
2027 2158 zio_gang_node_t *gn = *gnpp;
2028 2159
2029 2160 for (int g = 0; g < SPA_GBH_NBLKPTRS; g++)
2030 2161 ASSERT(gn->gn_child[g] == NULL);
2031 2162
2032 2163 zio_buf_free(gn->gn_gbh, SPA_GANGBLOCKSIZE);
2033 2164 kmem_free(gn, sizeof (*gn));
2034 2165 *gnpp = NULL;
2035 2166 }
2036 2167
2037 2168 static void
2038 2169 zio_gang_tree_free(zio_gang_node_t **gnpp)
2039 2170 {
2040 2171 zio_gang_node_t *gn = *gnpp;
2041 2172
2042 2173 if (gn == NULL)
2043 2174 return;
2044 2175
2045 2176 for (int g = 0; g < SPA_GBH_NBLKPTRS; g++)
2046 2177 zio_gang_tree_free(&gn->gn_child[g]);
2047 2178
2048 2179 zio_gang_node_free(gnpp);
2049 2180 }
2050 2181
2051 2182 static void
2052 2183 zio_gang_tree_assemble(zio_t *gio, blkptr_t *bp, zio_gang_node_t **gnpp)
2053 2184 {
2054 2185 zio_gang_node_t *gn = zio_gang_node_alloc(gnpp);
2055 2186 abd_t *gbh_abd = abd_get_from_buf(gn->gn_gbh, SPA_GANGBLOCKSIZE);
2056 2187
2057 2188 ASSERT(gio->io_gang_leader == gio);
2058 2189 ASSERT(BP_IS_GANG(bp));
2059 2190
2060 2191 zio_nowait(zio_read(gio, gio->io_spa, bp, gbh_abd, SPA_GANGBLOCKSIZE,
2061 2192 zio_gang_tree_assemble_done, gn, gio->io_priority,
2062 2193 ZIO_GANG_CHILD_FLAGS(gio), &gio->io_bookmark));
2063 2194 }
2064 2195
2065 2196 static void
2066 2197 zio_gang_tree_assemble_done(zio_t *zio)
2067 2198 {
2068 2199 zio_t *gio = zio->io_gang_leader;
2069 2200 zio_gang_node_t *gn = zio->io_private;
2070 2201 blkptr_t *bp = zio->io_bp;
2071 2202
2072 2203 ASSERT(gio == zio_unique_parent(zio));
2073 2204 ASSERT(zio->io_child_count == 0);
2074 2205
2075 2206 if (zio->io_error)
2076 2207 return;
2077 2208
2078 2209 /* this ABD was created from a linear buf in zio_gang_tree_assemble */
2079 2210 if (BP_SHOULD_BYTESWAP(bp))
2080 2211 byteswap_uint64_array(abd_to_buf(zio->io_abd), zio->io_size);
2081 2212
2082 2213 ASSERT3P(abd_to_buf(zio->io_abd), ==, gn->gn_gbh);
2083 2214 ASSERT(zio->io_size == SPA_GANGBLOCKSIZE);
2084 2215 ASSERT(gn->gn_gbh->zg_tail.zec_magic == ZEC_MAGIC);
2085 2216
2086 2217 abd_put(zio->io_abd);
2087 2218
2088 2219 for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
2089 2220 blkptr_t *gbp = &gn->gn_gbh->zg_blkptr[g];
2090 2221 if (!BP_IS_GANG(gbp))
2091 2222 continue;
2092 2223 zio_gang_tree_assemble(gio, gbp, &gn->gn_child[g]);
2093 2224 }
2094 2225 }
2095 2226
2096 2227 static void
2097 2228 zio_gang_tree_issue(zio_t *pio, zio_gang_node_t *gn, blkptr_t *bp, abd_t *data,
2098 2229 uint64_t offset)
2099 2230 {
2100 2231 zio_t *gio = pio->io_gang_leader;
2101 2232 zio_t *zio;
2102 2233
2103 2234 ASSERT(BP_IS_GANG(bp) == !!gn);
2104 2235 ASSERT(BP_GET_CHECKSUM(bp) == BP_GET_CHECKSUM(gio->io_bp));
2105 2236 ASSERT(BP_GET_LSIZE(bp) == BP_GET_PSIZE(bp) || gn == gio->io_gang_tree);
2106 2237
2107 2238 /*
2108 2239 * If you're a gang header, your data is in gn->gn_gbh.
2109 2240 * If you're a gang member, your data is in 'data' and gn == NULL.
2110 2241 */
2111 2242 zio = zio_gang_issue_func[gio->io_type](pio, bp, gn, data, offset);
2112 2243
2113 2244 if (gn != NULL) {
2114 2245 ASSERT(gn->gn_gbh->zg_tail.zec_magic == ZEC_MAGIC);
2115 2246
2116 2247 for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
2117 2248 blkptr_t *gbp = &gn->gn_gbh->zg_blkptr[g];
2118 2249 if (BP_IS_HOLE(gbp))
2119 2250 continue;
2120 2251 zio_gang_tree_issue(zio, gn->gn_child[g], gbp, data,
2121 2252 offset);
2122 2253 offset += BP_GET_PSIZE(gbp);
2123 2254 }
2124 2255 }
2125 2256
2126 2257 if (gn == gio->io_gang_tree)
2127 2258 ASSERT3U(gio->io_size, ==, offset);
2128 2259
2129 2260 if (zio != pio)
2130 2261 zio_nowait(zio);
2131 2262 }
2132 2263
2133 2264 static int
2134 2265 zio_gang_assemble(zio_t *zio)
2135 2266 {
2136 2267 blkptr_t *bp = zio->io_bp;
2137 2268
2138 2269 ASSERT(BP_IS_GANG(bp) && zio->io_gang_leader == NULL);
2139 2270 ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2140 2271
2141 2272 zio->io_gang_leader = zio;
2142 2273
|
↓ open down ↓ |
460 lines elided |
↑ open up ↑ |
2143 2274 zio_gang_tree_assemble(zio, bp, &zio->io_gang_tree);
2144 2275
2145 2276 return (ZIO_PIPELINE_CONTINUE);
2146 2277 }
2147 2278
2148 2279 static int
2149 2280 zio_gang_issue(zio_t *zio)
2150 2281 {
2151 2282 blkptr_t *bp = zio->io_bp;
2152 2283
2153 - if (zio_wait_for_children(zio, ZIO_CHILD_GANG_BIT, ZIO_WAIT_DONE)) {
2284 + if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE))
2154 2285 return (ZIO_PIPELINE_STOP);
2155 - }
2156 2286
2157 2287 ASSERT(BP_IS_GANG(bp) && zio->io_gang_leader == zio);
2158 2288 ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2159 2289
2160 2290 if (zio->io_child_error[ZIO_CHILD_GANG] == 0)
2161 2291 zio_gang_tree_issue(zio, zio->io_gang_tree, bp, zio->io_abd,
2162 2292 0);
2163 2293 else
2164 2294 zio_gang_tree_free(&zio->io_gang_tree);
2165 2295
2166 2296 zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
2167 2297
2168 2298 return (ZIO_PIPELINE_CONTINUE);
2169 2299 }
2170 2300
2171 2301 static void
2172 2302 zio_write_gang_member_ready(zio_t *zio)
2173 2303 {
2174 2304 zio_t *pio = zio_unique_parent(zio);
2175 2305 zio_t *gio = zio->io_gang_leader;
2176 2306 dva_t *cdva = zio->io_bp->blk_dva;
2177 2307 dva_t *pdva = pio->io_bp->blk_dva;
2178 2308 uint64_t asize;
2179 2309
2180 2310 if (BP_IS_HOLE(zio->io_bp))
2181 2311 return;
2182 2312
2183 2313 ASSERT(BP_IS_HOLE(&zio->io_bp_orig));
2184 2314
2185 2315 ASSERT(zio->io_child_type == ZIO_CHILD_GANG);
2186 2316 ASSERT3U(zio->io_prop.zp_copies, ==, gio->io_prop.zp_copies);
2187 2317 ASSERT3U(zio->io_prop.zp_copies, <=, BP_GET_NDVAS(zio->io_bp));
2188 2318 ASSERT3U(pio->io_prop.zp_copies, <=, BP_GET_NDVAS(pio->io_bp));
2189 2319 ASSERT3U(BP_GET_NDVAS(zio->io_bp), <=, BP_GET_NDVAS(pio->io_bp));
2190 2320
2191 2321 mutex_enter(&pio->io_lock);
2192 2322 for (int d = 0; d < BP_GET_NDVAS(zio->io_bp); d++) {
2193 2323 ASSERT(DVA_GET_GANG(&pdva[d]));
2194 2324 asize = DVA_GET_ASIZE(&pdva[d]);
2195 2325 asize += DVA_GET_ASIZE(&cdva[d]);
2196 2326 DVA_SET_ASIZE(&pdva[d], asize);
2197 2327 }
2198 2328 mutex_exit(&pio->io_lock);
2199 2329 }
2200 2330
|
↓ open down ↓ |
35 lines elided |
↑ open up ↑ |
2201 2331 static void
2202 2332 zio_write_gang_done(zio_t *zio)
2203 2333 {
2204 2334 abd_put(zio->io_abd);
2205 2335 }
2206 2336
2207 2337 static int
2208 2338 zio_write_gang_block(zio_t *pio)
2209 2339 {
2210 2340 spa_t *spa = pio->io_spa;
2211 - metaslab_class_t *mc = spa_normal_class(spa);
2341 + metaslab_class_t *mc = pio->io_mc;
2212 2342 blkptr_t *bp = pio->io_bp;
2213 2343 zio_t *gio = pio->io_gang_leader;
2214 2344 zio_t *zio;
2215 2345 zio_gang_node_t *gn, **gnpp;
2216 2346 zio_gbh_phys_t *gbh;
2217 2347 abd_t *gbh_abd;
2218 2348 uint64_t txg = pio->io_txg;
2219 2349 uint64_t resid = pio->io_size;
2220 2350 uint64_t lsize;
2221 2351 int copies = gio->io_prop.zp_copies;
2222 2352 int gbh_copies = MIN(copies + 1, spa_max_replication(spa));
2223 2353 zio_prop_t zp;
2224 2354 int error;
2225 2355
2226 2356 int flags = METASLAB_HINTBP_FAVOR | METASLAB_GANG_HEADER;
2227 2357 if (pio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
2228 2358 ASSERT(pio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
2229 2359 ASSERT(!(pio->io_flags & ZIO_FLAG_NODATA));
2230 2360
2231 2361 flags |= METASLAB_ASYNC_ALLOC;
2232 2362 VERIFY(refcount_held(&mc->mc_alloc_slots, pio));
2233 2363
2234 2364 /*
2235 2365 * The logical zio has already placed a reservation for
2236 2366 * 'copies' allocation slots but gang blocks may require
2237 2367 * additional copies. These additional copies
2238 2368 * (i.e. gbh_copies - copies) are guaranteed to succeed
2239 2369 * since metaslab_class_throttle_reserve() always allows
2240 2370 * additional reservations for gang blocks.
2241 2371 */
2242 2372 VERIFY(metaslab_class_throttle_reserve(mc, gbh_copies - copies,
2243 2373 pio, flags));
2244 2374 }
2245 2375
2246 2376 error = metaslab_alloc(spa, mc, SPA_GANGBLOCKSIZE,
2247 2377 bp, gbh_copies, txg, pio == gio ? NULL : gio->io_bp, flags,
2248 2378 &pio->io_alloc_list, pio);
2249 2379 if (error) {
2250 2380 if (pio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
2251 2381 ASSERT(pio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
2252 2382 ASSERT(!(pio->io_flags & ZIO_FLAG_NODATA));
2253 2383
2254 2384 /*
2255 2385 * If we failed to allocate the gang block header then
2256 2386 * we remove any additional allocation reservations that
2257 2387 * we placed here. The original reservation will
2258 2388 * be removed when the logical I/O goes to the ready
2259 2389 * stage.
2260 2390 */
2261 2391 metaslab_class_throttle_unreserve(mc,
2262 2392 gbh_copies - copies, pio);
2263 2393 }
2264 2394 pio->io_error = error;
2265 2395 return (ZIO_PIPELINE_CONTINUE);
2266 2396 }
2267 2397
2268 2398 if (pio == gio) {
2269 2399 gnpp = &gio->io_gang_tree;
2270 2400 } else {
2271 2401 gnpp = pio->io_private;
2272 2402 ASSERT(pio->io_ready == zio_write_gang_member_ready);
2273 2403 }
2274 2404
2275 2405 gn = zio_gang_node_alloc(gnpp);
2276 2406 gbh = gn->gn_gbh;
2277 2407 bzero(gbh, SPA_GANGBLOCKSIZE);
2278 2408 gbh_abd = abd_get_from_buf(gbh, SPA_GANGBLOCKSIZE);
2279 2409
2280 2410 /*
2281 2411 * Create the gang header.
2282 2412 */
2283 2413 zio = zio_rewrite(pio, spa, txg, bp, gbh_abd, SPA_GANGBLOCKSIZE,
2284 2414 zio_write_gang_done, NULL, pio->io_priority,
2285 2415 ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
2286 2416
2287 2417 /*
2288 2418 * Create and nowait the gang children.
2289 2419 */
2290 2420 for (int g = 0; resid != 0; resid -= lsize, g++) {
2291 2421 lsize = P2ROUNDUP(resid / (SPA_GBH_NBLKPTRS - g),
2292 2422 SPA_MINBLOCKSIZE);
2293 2423 ASSERT(lsize >= SPA_MINBLOCKSIZE && lsize <= resid);
2294 2424
2295 2425 zp.zp_checksum = gio->io_prop.zp_checksum;
2296 2426 zp.zp_compress = ZIO_COMPRESS_OFF;
2297 2427 zp.zp_type = DMU_OT_NONE;
|
↓ open down ↓ |
76 lines elided |
↑ open up ↑ |
2298 2428 zp.zp_level = 0;
2299 2429 zp.zp_copies = gio->io_prop.zp_copies;
2300 2430 zp.zp_dedup = B_FALSE;
2301 2431 zp.zp_dedup_verify = B_FALSE;
2302 2432 zp.zp_nopwrite = B_FALSE;
2303 2433
2304 2434 zio_t *cio = zio_write(zio, spa, txg, &gbh->zg_blkptr[g],
2305 2435 abd_get_offset(pio->io_abd, pio->io_size - resid), lsize,
2306 2436 lsize, &zp, zio_write_gang_member_ready, NULL, NULL,
2307 2437 zio_write_gang_done, &gn->gn_child[g], pio->io_priority,
2308 - ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
2438 + ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark,
2439 + &pio->io_smartcomp);
2309 2440
2441 + cio->io_mc = mc;
2442 +
2310 2443 if (pio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
2311 2444 ASSERT(pio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
2312 2445 ASSERT(!(pio->io_flags & ZIO_FLAG_NODATA));
2313 2446
2314 2447 /*
2315 2448 * Gang children won't throttle but we should
2316 2449 * account for their work, so reserve an allocation
2317 2450 * slot for them here.
2318 2451 */
2319 2452 VERIFY(metaslab_class_throttle_reserve(mc,
2320 2453 zp.zp_copies, cio, flags));
2321 2454 }
2322 2455 zio_nowait(cio);
2323 2456 }
2324 2457
2325 2458 /*
2326 2459 * Set pio's pipeline to just wait for zio to finish.
2327 2460 */
2328 2461 pio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
2329 2462
2330 2463 zio_nowait(zio);
2331 2464
2332 2465 return (ZIO_PIPELINE_CONTINUE);
2333 2466 }
2334 2467
2335 2468 /*
2336 2469 * The zio_nop_write stage in the pipeline determines if allocating a
2337 2470 * new bp is necessary. The nopwrite feature can handle writes in
2338 2471 * either syncing or open context (i.e. zil writes) and as a result is
2339 2472 * mutually exclusive with dedup.
2340 2473 *
2341 2474 * By leveraging a cryptographically secure checksum, such as SHA256, we
2342 2475 * can compare the checksums of the new data and the old to determine if
2343 2476 * allocating a new block is required. Note that our requirements for
2344 2477 * cryptographic strength are fairly weak: there can't be any accidental
2345 2478 * hash collisions, but we don't need to be secure against intentional
2346 2479 * (malicious) collisions. To trigger a nopwrite, you have to be able
2347 2480 * to write the file to begin with, and triggering an incorrect (hash
2348 2481 * collision) nopwrite is no worse than simply writing to the file.
2349 2482 * That said, there are no known attacks against the checksum algorithms
2350 2483 * used for nopwrite, assuming that the salt and the checksums
2351 2484 * themselves remain secret.
2352 2485 */
2353 2486 static int
2354 2487 zio_nop_write(zio_t *zio)
2355 2488 {
2356 2489 blkptr_t *bp = zio->io_bp;
2357 2490 blkptr_t *bp_orig = &zio->io_bp_orig;
2358 2491 zio_prop_t *zp = &zio->io_prop;
2359 2492
2360 2493 ASSERT(BP_GET_LEVEL(bp) == 0);
2361 2494 ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REWRITE));
2362 2495 ASSERT(zp->zp_nopwrite);
2363 2496 ASSERT(!zp->zp_dedup);
2364 2497 ASSERT(zio->io_bp_override == NULL);
2365 2498 ASSERT(IO_IS_ALLOCATING(zio));
2366 2499
2367 2500 /*
2368 2501 * Check to see if the original bp and the new bp have matching
2369 2502 * characteristics (i.e. same checksum, compression algorithms, etc).
2370 2503 * If they don't then just continue with the pipeline which will
2371 2504 * allocate a new bp.
2372 2505 */
2373 2506 if (BP_IS_HOLE(bp_orig) ||
2374 2507 !(zio_checksum_table[BP_GET_CHECKSUM(bp)].ci_flags &
2375 2508 ZCHECKSUM_FLAG_NOPWRITE) ||
2376 2509 BP_GET_CHECKSUM(bp) != BP_GET_CHECKSUM(bp_orig) ||
2377 2510 BP_GET_COMPRESS(bp) != BP_GET_COMPRESS(bp_orig) ||
2378 2511 BP_GET_DEDUP(bp) != BP_GET_DEDUP(bp_orig) ||
2379 2512 zp->zp_copies != BP_GET_NDVAS(bp_orig))
2380 2513 return (ZIO_PIPELINE_CONTINUE);
2381 2514
2382 2515 /*
2383 2516 * If the checksums match then reset the pipeline so that we
2384 2517 * avoid allocating a new bp and issuing any I/O.
2385 2518 */
2386 2519 if (ZIO_CHECKSUM_EQUAL(bp->blk_cksum, bp_orig->blk_cksum)) {
2387 2520 ASSERT(zio_checksum_table[zp->zp_checksum].ci_flags &
2388 2521 ZCHECKSUM_FLAG_NOPWRITE);
2389 2522 ASSERT3U(BP_GET_PSIZE(bp), ==, BP_GET_PSIZE(bp_orig));
2390 2523 ASSERT3U(BP_GET_LSIZE(bp), ==, BP_GET_LSIZE(bp_orig));
2391 2524 ASSERT(zp->zp_compress != ZIO_COMPRESS_OFF);
2392 2525 ASSERT(bcmp(&bp->blk_prop, &bp_orig->blk_prop,
2393 2526 sizeof (uint64_t)) == 0);
2394 2527
2395 2528 *bp = *bp_orig;
2396 2529 zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
2397 2530 zio->io_flags |= ZIO_FLAG_NOPWRITE;
2398 2531 }
2399 2532
2400 2533 return (ZIO_PIPELINE_CONTINUE);
2401 2534 }
2402 2535
2403 2536 /*
2404 2537 * ==========================================================================
2405 2538 * Dedup
2406 2539 * ==========================================================================
2407 2540 */
2408 2541 static void
2409 2542 zio_ddt_child_read_done(zio_t *zio)
2410 2543 {
2411 2544 blkptr_t *bp = zio->io_bp;
2412 2545 ddt_entry_t *dde = zio->io_private;
2413 2546 ddt_phys_t *ddp;
2414 2547 zio_t *pio = zio_unique_parent(zio);
2415 2548
2416 2549 mutex_enter(&pio->io_lock);
2417 2550 ddp = ddt_phys_select(dde, bp);
2418 2551 if (zio->io_error == 0)
2419 2552 ddt_phys_clear(ddp); /* this ddp doesn't need repair */
2420 2553
2421 2554 if (zio->io_error == 0 && dde->dde_repair_abd == NULL)
2422 2555 dde->dde_repair_abd = zio->io_abd;
2423 2556 else
2424 2557 abd_free(zio->io_abd);
2425 2558 mutex_exit(&pio->io_lock);
2426 2559 }
2427 2560
2428 2561 static int
2429 2562 zio_ddt_read_start(zio_t *zio)
2430 2563 {
2431 2564 blkptr_t *bp = zio->io_bp;
2432 2565
2433 2566 ASSERT(BP_GET_DEDUP(bp));
2434 2567 ASSERT(BP_GET_PSIZE(bp) == zio->io_size);
2435 2568 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
2436 2569
2437 2570 if (zio->io_child_error[ZIO_CHILD_DDT]) {
2438 2571 ddt_t *ddt = ddt_select(zio->io_spa, bp);
2439 2572 ddt_entry_t *dde = ddt_repair_start(ddt, bp);
2440 2573 ddt_phys_t *ddp = dde->dde_phys;
2441 2574 ddt_phys_t *ddp_self = ddt_phys_select(dde, bp);
2442 2575 blkptr_t blk;
2443 2576
2444 2577 ASSERT(zio->io_vsd == NULL);
2445 2578 zio->io_vsd = dde;
2446 2579
2447 2580 if (ddp_self == NULL)
2448 2581 return (ZIO_PIPELINE_CONTINUE);
2449 2582
2450 2583 for (int p = 0; p < DDT_PHYS_TYPES; p++, ddp++) {
2451 2584 if (ddp->ddp_phys_birth == 0 || ddp == ddp_self)
2452 2585 continue;
2453 2586 ddt_bp_create(ddt->ddt_checksum, &dde->dde_key, ddp,
2454 2587 &blk);
2455 2588 zio_nowait(zio_read(zio, zio->io_spa, &blk,
2456 2589 abd_alloc_for_io(zio->io_size, B_TRUE),
2457 2590 zio->io_size, zio_ddt_child_read_done, dde,
2458 2591 zio->io_priority, ZIO_DDT_CHILD_FLAGS(zio) |
2459 2592 ZIO_FLAG_DONT_PROPAGATE, &zio->io_bookmark));
2460 2593 }
2461 2594 return (ZIO_PIPELINE_CONTINUE);
2462 2595 }
2463 2596
2464 2597 zio_nowait(zio_read(zio, zio->io_spa, bp,
2465 2598 zio->io_abd, zio->io_size, NULL, NULL, zio->io_priority,
|
↓ open down ↓ |
146 lines elided |
↑ open up ↑ |
2466 2599 ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark));
2467 2600
2468 2601 return (ZIO_PIPELINE_CONTINUE);
2469 2602 }
2470 2603
2471 2604 static int
2472 2605 zio_ddt_read_done(zio_t *zio)
2473 2606 {
2474 2607 blkptr_t *bp = zio->io_bp;
2475 2608
2476 - if (zio_wait_for_children(zio, ZIO_CHILD_DDT_BIT, ZIO_WAIT_DONE)) {
2609 + if (zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE))
2477 2610 return (ZIO_PIPELINE_STOP);
2478 - }
2479 2611
2480 2612 ASSERT(BP_GET_DEDUP(bp));
2481 2613 ASSERT(BP_GET_PSIZE(bp) == zio->io_size);
2482 2614 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
2483 2615
2484 2616 if (zio->io_child_error[ZIO_CHILD_DDT]) {
2485 2617 ddt_t *ddt = ddt_select(zio->io_spa, bp);
2486 2618 ddt_entry_t *dde = zio->io_vsd;
2487 2619 if (ddt == NULL) {
2488 2620 ASSERT(spa_load_state(zio->io_spa) != SPA_LOAD_NONE);
2489 2621 return (ZIO_PIPELINE_CONTINUE);
2490 2622 }
2491 2623 if (dde == NULL) {
2492 2624 zio->io_stage = ZIO_STAGE_DDT_READ_START >> 1;
2493 2625 zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE, B_FALSE);
2494 2626 return (ZIO_PIPELINE_STOP);
2495 2627 }
2496 2628 if (dde->dde_repair_abd != NULL) {
2497 2629 abd_copy(zio->io_abd, dde->dde_repair_abd,
2498 2630 zio->io_size);
2499 2631 zio->io_child_error[ZIO_CHILD_DDT] = 0;
|
↓ open down ↓ |
11 lines elided |
↑ open up ↑ |
2500 2632 }
2501 2633 ddt_repair_done(ddt, dde);
2502 2634 zio->io_vsd = NULL;
2503 2635 }
2504 2636
2505 2637 ASSERT(zio->io_vsd == NULL);
2506 2638
2507 2639 return (ZIO_PIPELINE_CONTINUE);
2508 2640 }
2509 2641
2642 +/* ARGSUSED */
2510 2643 static boolean_t
2511 2644 zio_ddt_collision(zio_t *zio, ddt_t *ddt, ddt_entry_t *dde)
2512 2645 {
2513 2646 spa_t *spa = zio->io_spa;
2514 2647 boolean_t do_raw = (zio->io_flags & ZIO_FLAG_RAW);
2515 2648
2516 2649 /* We should never get a raw, override zio */
2517 2650 ASSERT(!(zio->io_bp_override && do_raw));
2518 2651
2519 2652 /*
2520 2653 * Note: we compare the original data, not the transformed data,
2521 2654 * because when zio->io_bp is an override bp, we will not have
2522 2655 * pushed the I/O transforms. That's an important optimization
2523 2656 * because otherwise we'd compress/encrypt all dmu_sync() data twice.
2524 2657 */
2525 2658 for (int p = DDT_PHYS_SINGLE; p <= DDT_PHYS_TRIPLE; p++) {
2526 2659 zio_t *lio = dde->dde_lead_zio[p];
2527 2660
2528 2661 if (lio != NULL) {
2529 2662 return (lio->io_orig_size != zio->io_orig_size ||
2530 2663 abd_cmp(zio->io_orig_abd, lio->io_orig_abd,
2531 2664 zio->io_orig_size) != 0);
2532 2665 }
2533 2666 }
2534 2667
2535 2668 for (int p = DDT_PHYS_SINGLE; p <= DDT_PHYS_TRIPLE; p++) {
2536 2669 ddt_phys_t *ddp = &dde->dde_phys[p];
|
↓ open down ↓ |
17 lines elided |
↑ open up ↑ |
2537 2670
2538 2671 if (ddp->ddp_phys_birth != 0) {
2539 2672 arc_buf_t *abuf = NULL;
2540 2673 arc_flags_t aflags = ARC_FLAG_WAIT;
2541 2674 int zio_flags = ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE;
2542 2675 blkptr_t blk = *zio->io_bp;
2543 2676 int error;
2544 2677
2545 2678 ddt_bp_fill(ddp, &blk, ddp->ddp_phys_birth);
2546 2679
2547 - ddt_exit(ddt);
2680 + dde_exit(dde);
2548 2681
2549 2682 /*
2550 2683 * Intuitively, it would make more sense to compare
2551 2684 * io_abd than io_orig_abd in the raw case since you
2552 2685 * don't want to look at any transformations that have
2553 2686 * happened to the data. However, for raw I/Os the
2554 2687 * data will actually be the same in io_abd and
2555 2688 * io_orig_abd, so all we have to do is issue this as
2556 2689 * a raw ARC read.
2557 2690 */
2558 2691 if (do_raw) {
2559 2692 zio_flags |= ZIO_FLAG_RAW;
2560 2693 ASSERT3U(zio->io_size, ==, zio->io_orig_size);
2561 2694 ASSERT0(abd_cmp(zio->io_abd, zio->io_orig_abd,
2562 2695 zio->io_size));
2563 2696 ASSERT3P(zio->io_transform_stack, ==, NULL);
2564 2697 }
2565 2698
2566 2699 error = arc_read(NULL, spa, &blk,
2567 2700 arc_getbuf_func, &abuf, ZIO_PRIORITY_SYNC_READ,
|
↓ open down ↓ |
10 lines elided |
↑ open up ↑ |
2568 2701 zio_flags, &aflags, &zio->io_bookmark);
2569 2702
2570 2703 if (error == 0) {
2571 2704 if (arc_buf_size(abuf) != zio->io_orig_size ||
2572 2705 abd_cmp_buf(zio->io_orig_abd, abuf->b_data,
2573 2706 zio->io_orig_size) != 0)
2574 2707 error = SET_ERROR(EEXIST);
2575 2708 arc_buf_destroy(abuf, &abuf);
2576 2709 }
2577 2710
2578 - ddt_enter(ddt);
2711 + dde_enter(dde);
2579 2712 return (error != 0);
2580 2713 }
2581 2714 }
2582 2715
2583 2716 return (B_FALSE);
2584 2717 }
2585 2718
2586 2719 static void
2587 2720 zio_ddt_child_write_ready(zio_t *zio)
2588 2721 {
2589 2722 int p = zio->io_prop.zp_copies;
2590 - ddt_t *ddt = ddt_select(zio->io_spa, zio->io_bp);
2591 2723 ddt_entry_t *dde = zio->io_private;
2592 2724 ddt_phys_t *ddp = &dde->dde_phys[p];
2593 2725 zio_t *pio;
2594 2726
2595 2727 if (zio->io_error)
2596 2728 return;
2597 2729
2598 - ddt_enter(ddt);
2730 + dde_enter(dde);
2599 2731
2600 2732 ASSERT(dde->dde_lead_zio[p] == zio);
2601 2733
2602 2734 ddt_phys_fill(ddp, zio->io_bp);
2603 2735
2604 2736 zio_link_t *zl = NULL;
2605 2737 while ((pio = zio_walk_parents(zio, &zl)) != NULL)
2606 2738 ddt_bp_fill(ddp, pio->io_bp, zio->io_txg);
2607 2739
2608 - ddt_exit(ddt);
2740 + dde_exit(dde);
2609 2741 }
2610 2742
2611 2743 static void
2612 2744 zio_ddt_child_write_done(zio_t *zio)
2613 2745 {
2614 2746 int p = zio->io_prop.zp_copies;
2615 - ddt_t *ddt = ddt_select(zio->io_spa, zio->io_bp);
2616 2747 ddt_entry_t *dde = zio->io_private;
2617 2748 ddt_phys_t *ddp = &dde->dde_phys[p];
2618 2749
2619 - ddt_enter(ddt);
2750 + dde_enter(dde);
2620 2751
2621 2752 ASSERT(ddp->ddp_refcnt == 0);
2622 2753 ASSERT(dde->dde_lead_zio[p] == zio);
2623 2754 dde->dde_lead_zio[p] = NULL;
2624 2755
2625 2756 if (zio->io_error == 0) {
2626 2757 zio_link_t *zl = NULL;
2627 2758 while (zio_walk_parents(zio, &zl) != NULL)
2628 2759 ddt_phys_addref(ddp);
2629 2760 } else {
2630 2761 ddt_phys_clear(ddp);
2631 2762 }
2632 2763
2633 - ddt_exit(ddt);
2764 + dde_exit(dde);
2634 2765 }
2635 2766
2636 2767 static void
2637 2768 zio_ddt_ditto_write_done(zio_t *zio)
2638 2769 {
2639 2770 int p = DDT_PHYS_DITTO;
2640 2771 zio_prop_t *zp = &zio->io_prop;
2641 2772 blkptr_t *bp = zio->io_bp;
2642 2773 ddt_t *ddt = ddt_select(zio->io_spa, bp);
2643 2774 ddt_entry_t *dde = zio->io_private;
2644 2775 ddt_phys_t *ddp = &dde->dde_phys[p];
2645 2776 ddt_key_t *ddk = &dde->dde_key;
2646 2777
2647 - ddt_enter(ddt);
2778 + dde_enter(dde);
2648 2779
2649 2780 ASSERT(ddp->ddp_refcnt == 0);
2650 2781 ASSERT(dde->dde_lead_zio[p] == zio);
2651 2782 dde->dde_lead_zio[p] = NULL;
2652 2783
2653 2784 if (zio->io_error == 0) {
2654 2785 ASSERT(ZIO_CHECKSUM_EQUAL(bp->blk_cksum, ddk->ddk_cksum));
2655 2786 ASSERT(zp->zp_copies < SPA_DVAS_PER_BP);
2656 2787 ASSERT(zp->zp_copies == BP_GET_NDVAS(bp) - BP_IS_GANG(bp));
2657 2788 if (ddp->ddp_phys_birth != 0)
2658 2789 ddt_phys_free(ddt, ddk, ddp, zio->io_txg);
2659 2790 ddt_phys_fill(ddp, bp);
2660 2791 }
2661 2792
2662 - ddt_exit(ddt);
2793 + dde_exit(dde);
2663 2794 }
2664 2795
2665 2796 static int
2666 2797 zio_ddt_write(zio_t *zio)
2667 2798 {
2668 2799 spa_t *spa = zio->io_spa;
2669 2800 blkptr_t *bp = zio->io_bp;
2670 2801 uint64_t txg = zio->io_txg;
2671 2802 zio_prop_t *zp = &zio->io_prop;
2672 2803 int p = zp->zp_copies;
2673 2804 int ditto_copies;
2674 2805 zio_t *cio = NULL;
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
2675 2806 zio_t *dio = NULL;
2676 2807 ddt_t *ddt = ddt_select(spa, bp);
2677 2808 ddt_entry_t *dde;
2678 2809 ddt_phys_t *ddp;
2679 2810
2680 2811 ASSERT(BP_GET_DEDUP(bp));
2681 2812 ASSERT(BP_GET_CHECKSUM(bp) == zp->zp_checksum);
2682 2813 ASSERT(BP_IS_HOLE(bp) || zio->io_bp_override);
2683 2814 ASSERT(!(zio->io_bp_override && (zio->io_flags & ZIO_FLAG_RAW)));
2684 2815
2685 - ddt_enter(ddt);
2686 2816 dde = ddt_lookup(ddt, bp, B_TRUE);
2687 - ddp = &dde->dde_phys[p];
2688 2817
2818 + /*
2819 + * If we're not using special tier, for each new DDE that's not on disk:
2820 + * disable dedup if we have exhausted "allowed" DDT L2/ARC space
2821 + */
2822 + if ((dde->dde_state & DDE_NEW) && !spa->spa_usesc &&
2823 + (zfs_ddt_limit_type != DDT_NO_LIMIT || zfs_ddt_byte_ceiling != 0)) {
2824 + /* turn off dedup if we need to stop DDT growth */
2825 + if (spa_enable_dedup_cap(spa)) {
2826 + dde->dde_state |= DDE_DONT_SYNC;
2827 +
2828 + /* disable dedup and use the ordinary write pipeline */
2829 + zio_pop_transforms(zio);
2830 + zp->zp_dedup = zp->zp_dedup_verify = B_FALSE;
2831 + zio->io_stage = ZIO_STAGE_OPEN;
2832 + zio->io_pipeline = ZIO_WRITE_PIPELINE;
2833 + zio->io_bp_override = NULL;
2834 + BP_ZERO(bp);
2835 + dde_exit(dde);
2836 +
2837 + return (ZIO_PIPELINE_CONTINUE);
2838 + }
2839 + }
2840 + ASSERT(!(dde->dde_state & DDE_DONT_SYNC));
2841 +
2689 2842 if (zp->zp_dedup_verify && zio_ddt_collision(zio, ddt, dde)) {
2690 2843 /*
2691 2844 * If we're using a weak checksum, upgrade to a strong checksum
2692 2845 * and try again. If we're already using a strong checksum,
2693 2846 * we can't resolve it, so just convert to an ordinary write.
2694 2847 * (And automatically e-mail a paper to Nature?)
2695 2848 */
2696 2849 if (!(zio_checksum_table[zp->zp_checksum].ci_flags &
2697 2850 ZCHECKSUM_FLAG_DEDUP)) {
2698 2851 zp->zp_checksum = spa_dedup_checksum(spa);
2699 2852 zio_pop_transforms(zio);
2700 2853 zio->io_stage = ZIO_STAGE_OPEN;
2701 2854 BP_ZERO(bp);
2702 2855 } else {
2703 2856 zp->zp_dedup = B_FALSE;
2704 2857 BP_SET_DEDUP(bp, B_FALSE);
2705 2858 }
2706 2859 ASSERT(!BP_GET_DEDUP(bp));
2707 2860 zio->io_pipeline = ZIO_WRITE_PIPELINE;
2708 - ddt_exit(ddt);
2861 + dde_exit(dde);
2709 2862 return (ZIO_PIPELINE_CONTINUE);
2710 2863 }
2711 2864
2865 + ddp = &dde->dde_phys[p];
2712 2866 ditto_copies = ddt_ditto_copies_needed(ddt, dde, ddp);
2713 2867 ASSERT(ditto_copies < SPA_DVAS_PER_BP);
2714 2868
2715 2869 if (ditto_copies > ddt_ditto_copies_present(dde) &&
2716 2870 dde->dde_lead_zio[DDT_PHYS_DITTO] == NULL) {
2717 2871 zio_prop_t czp = *zp;
2718 2872
2719 2873 czp.zp_copies = ditto_copies;
2720 2874
2721 2875 /*
2722 2876 * If we arrived here with an override bp, we won't have run
2723 2877 * the transform stack, so we won't have the data we need to
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
2724 2878 * generate a child i/o. So, toss the override bp and restart.
2725 2879 * This is safe, because using the override bp is just an
2726 2880 * optimization; and it's rare, so the cost doesn't matter.
2727 2881 */
2728 2882 if (zio->io_bp_override) {
2729 2883 zio_pop_transforms(zio);
2730 2884 zio->io_stage = ZIO_STAGE_OPEN;
2731 2885 zio->io_pipeline = ZIO_WRITE_PIPELINE;
2732 2886 zio->io_bp_override = NULL;
2733 2887 BP_ZERO(bp);
2734 - ddt_exit(ddt);
2888 + dde_exit(dde);
2735 2889 return (ZIO_PIPELINE_CONTINUE);
2736 2890 }
2737 2891
2738 2892 dio = zio_write(zio, spa, txg, bp, zio->io_orig_abd,
2739 2893 zio->io_orig_size, zio->io_orig_size, &czp, NULL, NULL,
2740 2894 NULL, zio_ddt_ditto_write_done, dde, zio->io_priority,
2741 - ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark);
2895 + ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark, NULL);
2742 2896
2743 2897 zio_push_transform(dio, zio->io_abd, zio->io_size, 0, NULL);
2744 2898 dde->dde_lead_zio[DDT_PHYS_DITTO] = dio;
2745 2899 }
2746 2900
2747 2901 if (ddp->ddp_phys_birth != 0 || dde->dde_lead_zio[p] != NULL) {
2748 2902 if (ddp->ddp_phys_birth != 0)
2749 2903 ddt_bp_fill(ddp, bp, txg);
2750 2904 if (dde->dde_lead_zio[p] != NULL)
2751 2905 zio_add_child(zio, dde->dde_lead_zio[p]);
2752 2906 else
2753 2907 ddt_phys_addref(ddp);
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
2754 2908 } else if (zio->io_bp_override) {
2755 2909 ASSERT(bp->blk_birth == txg);
2756 2910 ASSERT(BP_EQUAL(bp, zio->io_bp_override));
2757 2911 ddt_phys_fill(ddp, bp);
2758 2912 ddt_phys_addref(ddp);
2759 2913 } else {
2760 2914 cio = zio_write(zio, spa, txg, bp, zio->io_orig_abd,
2761 2915 zio->io_orig_size, zio->io_orig_size, zp,
2762 2916 zio_ddt_child_write_ready, NULL, NULL,
2763 2917 zio_ddt_child_write_done, dde, zio->io_priority,
2764 - ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark);
2918 + ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark, NULL);
2765 2919
2766 2920 zio_push_transform(cio, zio->io_abd, zio->io_size, 0, NULL);
2767 2921 dde->dde_lead_zio[p] = cio;
2768 2922 }
2769 2923
2770 - ddt_exit(ddt);
2924 + dde_exit(dde);
2771 2925
2772 2926 if (cio)
2773 2927 zio_nowait(cio);
2774 2928 if (dio)
2775 2929 zio_nowait(dio);
2776 2930
2777 2931 return (ZIO_PIPELINE_CONTINUE);
2778 2932 }
2779 2933
2780 2934 ddt_entry_t *freedde; /* for debugging */
2781 2935
2782 2936 static int
2783 2937 zio_ddt_free(zio_t *zio)
|
↓ open down ↓ |
3 lines elided |
↑ open up ↑ |
2784 2938 {
2785 2939 spa_t *spa = zio->io_spa;
2786 2940 blkptr_t *bp = zio->io_bp;
2787 2941 ddt_t *ddt = ddt_select(spa, bp);
2788 2942 ddt_entry_t *dde;
2789 2943 ddt_phys_t *ddp;
2790 2944
2791 2945 ASSERT(BP_GET_DEDUP(bp));
2792 2946 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
2793 2947
2794 - ddt_enter(ddt);
2795 2948 freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
2796 2949 ddp = ddt_phys_select(dde, bp);
2797 - ddt_phys_decref(ddp);
2798 - ddt_exit(ddt);
2950 + if (ddp)
2951 + ddt_phys_decref(ddp);
2952 + dde_exit(dde);
2799 2953
2800 2954 return (ZIO_PIPELINE_CONTINUE);
2801 2955 }
2802 2956
2803 2957 /*
2804 2958 * ==========================================================================
2805 2959 * Allocate and free blocks
2806 2960 * ==========================================================================
2807 2961 */
2808 2962
2809 2963 static zio_t *
2810 -zio_io_to_allocate(spa_t *spa)
2964 +zio_io_to_allocate(metaslab_class_t *mc)
2811 2965 {
2812 2966 zio_t *zio;
2813 2967
2814 - ASSERT(MUTEX_HELD(&spa->spa_alloc_lock));
2968 + ASSERT(MUTEX_HELD(&mc->mc_alloc_lock));
2815 2969
2816 - zio = avl_first(&spa->spa_alloc_tree);
2970 + zio = avl_first(&mc->mc_alloc_tree);
2817 2971 if (zio == NULL)
2818 2972 return (NULL);
2819 2973
2820 2974 ASSERT(IO_IS_ALLOCATING(zio));
2821 2975
2822 2976 /*
2823 2977 * Try to place a reservation for this zio. If we're unable to
2824 2978 * reserve then we throttle.
2825 2979 */
2826 - if (!metaslab_class_throttle_reserve(spa_normal_class(spa),
2980 + if (!metaslab_class_throttle_reserve(mc,
2827 2981 zio->io_prop.zp_copies, zio, 0)) {
2828 2982 return (NULL);
2829 2983 }
2830 2984
2831 - avl_remove(&spa->spa_alloc_tree, zio);
2985 + avl_remove(&mc->mc_alloc_tree, zio);
2832 2986 ASSERT3U(zio->io_stage, <, ZIO_STAGE_DVA_ALLOCATE);
2833 2987
2834 2988 return (zio);
2835 2989 }
2836 2990
2837 2991 static int
2838 2992 zio_dva_throttle(zio_t *zio)
2839 2993 {
2840 2994 spa_t *spa = zio->io_spa;
2841 2995 zio_t *nio;
2842 2996
2997 + /* We need to use parent's MetaslabClass */
2998 + if (zio->io_mc == NULL) {
2999 + zio->io_mc = spa_select_class(spa, zio);
3000 + if (zio->io_prop.zp_usewbc)
3001 + return (ZIO_PIPELINE_CONTINUE);
3002 + }
3003 +
2843 3004 if (zio->io_priority == ZIO_PRIORITY_SYNC_WRITE ||
2844 - !spa_normal_class(zio->io_spa)->mc_alloc_throttle_enabled ||
3005 + !zio->io_mc->mc_alloc_throttle_enabled ||
2845 3006 zio->io_child_type == ZIO_CHILD_GANG ||
2846 3007 zio->io_flags & ZIO_FLAG_NODATA) {
2847 3008 return (ZIO_PIPELINE_CONTINUE);
2848 3009 }
2849 3010
2850 3011 ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2851 3012
2852 3013 ASSERT3U(zio->io_queued_timestamp, >, 0);
2853 3014 ASSERT(zio->io_stage == ZIO_STAGE_DVA_THROTTLE);
2854 3015
2855 - mutex_enter(&spa->spa_alloc_lock);
3016 + mutex_enter(&zio->io_mc->mc_alloc_lock);
2856 3017
2857 3018 ASSERT(zio->io_type == ZIO_TYPE_WRITE);
2858 - avl_add(&spa->spa_alloc_tree, zio);
3019 + avl_add(&zio->io_mc->mc_alloc_tree, zio);
2859 3020
2860 - nio = zio_io_to_allocate(zio->io_spa);
2861 - mutex_exit(&spa->spa_alloc_lock);
3021 + nio = zio_io_to_allocate(zio->io_mc);
3022 + mutex_exit(&zio->io_mc->mc_alloc_lock);
2862 3023
2863 3024 if (nio == zio)
2864 3025 return (ZIO_PIPELINE_CONTINUE);
2865 3026
2866 3027 if (nio != NULL) {
2867 3028 ASSERT(nio->io_stage == ZIO_STAGE_DVA_THROTTLE);
2868 3029 /*
2869 3030 * We are passing control to a new zio so make sure that
2870 3031 * it is processed by a different thread. We do this to
2871 3032 * avoid stack overflows that can occur when parents are
2872 3033 * throttled and children are making progress. We allow
2873 3034 * it to go to the head of the taskq since it's already
2874 3035 * been waiting.
2875 3036 */
2876 3037 zio_taskq_dispatch(nio, ZIO_TASKQ_ISSUE, B_TRUE);
2877 3038 }
2878 3039 return (ZIO_PIPELINE_STOP);
2879 3040 }
2880 3041
2881 3042 void
2882 -zio_allocate_dispatch(spa_t *spa)
3043 +zio_allocate_dispatch(metaslab_class_t *mc)
2883 3044 {
2884 3045 zio_t *zio;
2885 3046
2886 - mutex_enter(&spa->spa_alloc_lock);
2887 - zio = zio_io_to_allocate(spa);
2888 - mutex_exit(&spa->spa_alloc_lock);
3047 + mutex_enter(&mc->mc_alloc_lock);
3048 + zio = zio_io_to_allocate(mc);
3049 + mutex_exit(&mc->mc_alloc_lock);
2889 3050 if (zio == NULL)
2890 3051 return;
2891 3052
2892 3053 ASSERT3U(zio->io_stage, ==, ZIO_STAGE_DVA_THROTTLE);
2893 3054 ASSERT0(zio->io_error);
2894 3055 zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE, B_TRUE);
2895 3056 }
2896 3057
2897 3058 static int
2898 3059 zio_dva_allocate(zio_t *zio)
2899 3060 {
2900 3061 spa_t *spa = zio->io_spa;
2901 - metaslab_class_t *mc = spa_normal_class(spa);
3062 + metaslab_class_t *mc = zio->io_mc;
3063 +
2902 3064 blkptr_t *bp = zio->io_bp;
2903 3065 int error;
2904 3066 int flags = 0;
2905 3067
2906 3068 if (zio->io_gang_leader == NULL) {
2907 3069 ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2908 3070 zio->io_gang_leader = zio;
2909 3071 }
2910 3072
2911 3073 ASSERT(BP_IS_HOLE(bp));
2912 3074 ASSERT0(BP_GET_NDVAS(bp));
2913 3075 ASSERT3U(zio->io_prop.zp_copies, >, 0);
2914 3076 ASSERT3U(zio->io_prop.zp_copies, <=, spa_max_replication(spa));
2915 3077 ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));
2916 3078
2917 - if (zio->io_flags & ZIO_FLAG_NODATA) {
3079 + if (zio->io_flags & ZIO_FLAG_NODATA || zio->io_prop.zp_usewbc) {
2918 3080 flags |= METASLAB_DONT_THROTTLE;
2919 3081 }
2920 3082 if (zio->io_flags & ZIO_FLAG_GANG_CHILD) {
2921 3083 flags |= METASLAB_GANG_CHILD;
2922 3084 }
2923 - if (zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE) {
3085 + if (zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE &&
3086 + zio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
2924 3087 flags |= METASLAB_ASYNC_ALLOC;
2925 3088 }
2926 3089
2927 3090 error = metaslab_alloc(spa, mc, zio->io_size, bp,
2928 3091 zio->io_prop.zp_copies, zio->io_txg, NULL, flags,
2929 3092 &zio->io_alloc_list, zio);
2930 3093
3094 +#ifdef _KERNEL
3095 + DTRACE_PROBE6(zio_dva_allocate,
3096 + uint64_t, DVA_GET_VDEV(&bp->blk_dva[0]),
3097 + uint64_t, DVA_GET_VDEV(&bp->blk_dva[1]),
3098 + uint64_t, BP_GET_LEVEL(bp),
3099 + boolean_t, BP_IS_SPECIAL(bp),
3100 + boolean_t, BP_IS_METADATA(bp),
3101 + int, error);
3102 +#endif
3103 +
2931 3104 if (error != 0) {
2932 3105 spa_dbgmsg(spa, "%s: metaslab allocation failure: zio %p, "
2933 3106 "size %llu, error %d", spa_name(spa), zio, zio->io_size,
2934 3107 error);
2935 - if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE)
3108 + if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE) {
3109 + if (zio->io_prop.zp_usewbc) {
3110 + zio->io_prop.zp_usewbc = B_FALSE;
3111 + zio->io_prop.zp_usesc = B_FALSE;
3112 + zio->io_mc = spa_normal_class(spa);
3113 + }
3114 +
2936 3115 return (zio_write_gang_block(zio));
3116 + }
3117 +
2937 3118 zio->io_error = error;
2938 3119 }
2939 3120
2940 3121 return (ZIO_PIPELINE_CONTINUE);
2941 3122 }
2942 3123
2943 3124 static int
2944 3125 zio_dva_free(zio_t *zio)
2945 3126 {
2946 3127 metaslab_free(zio->io_spa, zio->io_bp, zio->io_txg, B_FALSE);
2947 3128
2948 3129 return (ZIO_PIPELINE_CONTINUE);
2949 3130 }
2950 3131
2951 3132 static int
2952 3133 zio_dva_claim(zio_t *zio)
2953 3134 {
2954 3135 int error;
2955 3136
2956 3137 error = metaslab_claim(zio->io_spa, zio->io_bp, zio->io_txg);
2957 3138 if (error)
2958 3139 zio->io_error = error;
2959 3140
2960 3141 return (ZIO_PIPELINE_CONTINUE);
2961 3142 }
2962 3143
2963 3144 /*
2964 3145 * Undo an allocation. This is used by zio_done() when an I/O fails
2965 3146 * and we want to give back the block we just allocated.
2966 3147 * This handles both normal blocks and gang blocks.
2967 3148 */
2968 3149 static void
2969 3150 zio_dva_unallocate(zio_t *zio, zio_gang_node_t *gn, blkptr_t *bp)
2970 3151 {
2971 3152 ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp));
2972 3153 ASSERT(zio->io_bp_override == NULL);
2973 3154
2974 3155 if (!BP_IS_HOLE(bp))
2975 3156 metaslab_free(zio->io_spa, bp, bp->blk_birth, B_TRUE);
2976 3157
2977 3158 if (gn != NULL) {
2978 3159 for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
2979 3160 zio_dva_unallocate(zio, gn->gn_child[g],
2980 3161 &gn->gn_gbh->zg_blkptr[g]);
2981 3162 }
2982 3163 }
2983 3164 }
|
↓ open down ↓ |
37 lines elided |
↑ open up ↑ |
2984 3165
2985 3166 /*
2986 3167 * Try to allocate an intent log block. Return 0 on success, errno on failure.
2987 3168 */
2988 3169 int
2989 3170 zio_alloc_zil(spa_t *spa, uint64_t txg, blkptr_t *new_bp, blkptr_t *old_bp,
2990 3171 uint64_t size, boolean_t *slog)
2991 3172 {
2992 3173 int error = 1;
2993 3174 zio_alloc_list_t io_alloc_list;
3175 + spa_meta_placement_t *mp = &spa->spa_meta_policy;
2994 3176
2995 3177 ASSERT(txg > spa_syncing_txg(spa));
2996 3178
2997 3179 metaslab_trace_init(&io_alloc_list);
2998 - error = metaslab_alloc(spa, spa_log_class(spa), size, new_bp, 1,
2999 - txg, old_bp, METASLAB_HINTBP_AVOID, &io_alloc_list, NULL);
3000 - if (error == 0) {
3001 - *slog = TRUE;
3002 - } else {
3180 +
3181 + /*
3182 + * ZIL blocks are always contiguous (i.e. not gang blocks)
3183 + * so we set the METASLAB_HINTBP_AVOID flag so that they
3184 + * don't "fast gang" when allocating them.
3185 + * If the caller indicates that slog is not to be used
3186 + * (via use_slog)
3187 + * separate allocation class will not indeed be used,
3188 + * independently of whether this is log or special
3189 + */
3190 +
3191 + if (spa_has_slogs(spa)) {
3192 + error = metaslab_alloc(spa, spa_log_class(spa),
3193 + size, new_bp, 1, txg, old_bp,
3194 + METASLAB_HINTBP_AVOID, &io_alloc_list, NULL);
3195 +
3196 + DTRACE_PROBE2(zio_alloc_zil_log,
3197 + spa_t *, spa, int, error);
3198 +
3199 + if (error == 0)
3200 + *slog = TRUE;
3201 + }
3202 +
3203 + /*
3204 + * use special when failed to allocate from the regular
3205 + * slog, but only if allowed and if the special used
3206 + * space is below watermarks
3207 + */
3208 + if (error != 0 && spa_can_special_be_used(spa) &&
3209 + mp->spa_sync_to_special != SYNC_TO_SPECIAL_DISABLED) {
3210 + error = metaslab_alloc(spa, spa_special_class(spa),
3211 + size, new_bp, 1, txg, old_bp,
3212 + METASLAB_HINTBP_AVOID, &io_alloc_list, NULL);
3213 +
3214 + DTRACE_PROBE2(zio_alloc_zil_special,
3215 + spa_t *, spa, int, error);
3216 +
3217 + if (error == 0)
3218 + *slog = FALSE;
3219 + }
3220 +
3221 + if (error != 0) {
3003 3222 error = metaslab_alloc(spa, spa_normal_class(spa), size,
3004 3223 new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID,
3005 3224 &io_alloc_list, NULL);
3225 +
3226 + DTRACE_PROBE2(zio_alloc_zil_normal,
3227 + spa_t *, spa, int, error);
3228 +
3006 3229 if (error == 0)
3007 3230 *slog = FALSE;
3008 3231 }
3232 +
3009 3233 metaslab_trace_fini(&io_alloc_list);
3010 3234
3011 3235 if (error == 0) {
3012 3236 BP_SET_LSIZE(new_bp, size);
3013 3237 BP_SET_PSIZE(new_bp, size);
3014 3238 BP_SET_COMPRESS(new_bp, ZIO_COMPRESS_OFF);
3015 3239 BP_SET_CHECKSUM(new_bp,
3016 3240 spa_version(spa) >= SPA_VERSION_SLIM_ZIL
3017 3241 ? ZIO_CHECKSUM_ZILOG2 : ZIO_CHECKSUM_ZILOG);
3018 3242 BP_SET_TYPE(new_bp, DMU_OT_INTENT_LOG);
3019 3243 BP_SET_LEVEL(new_bp, 0);
3020 3244 BP_SET_DEDUP(new_bp, 0);
3021 3245 BP_SET_BYTEORDER(new_bp, ZFS_HOST_BYTEORDER);
3022 3246 } else {
3023 3247 zfs_dbgmsg("%s: zil block allocation failure: "
3024 3248 "size %llu, error %d", spa_name(spa), size, error);
3025 3249 }
3026 3250
3027 3251 return (error);
3028 3252 }
3029 3253
3030 3254 /*
3031 3255 * Free an intent log block.
3032 3256 */
3033 3257 void
3034 3258 zio_free_zil(spa_t *spa, uint64_t txg, blkptr_t *bp)
3035 3259 {
3036 3260 ASSERT(BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG);
3037 3261 ASSERT(!BP_IS_GANG(bp));
3038 3262
3039 3263 zio_free(spa, txg, bp);
3040 3264 }
3041 3265
3042 3266 /*
3043 3267 * ==========================================================================
3044 3268 * Read and write to physical devices
3045 3269 * ==========================================================================
3046 3270 */
3047 3271
3048 3272
3049 3273 /*
3050 3274 * Issue an I/O to the underlying vdev. Typically the issue pipeline
3051 3275 * stops after this stage and will resume upon I/O completion.
3052 3276 * However, there are instances where the vdev layer may need to
3053 3277 * continue the pipeline when an I/O was not issued. Since the I/O
3054 3278 * that was sent to the vdev layer might be different than the one
|
↓ open down ↓ |
36 lines elided |
↑ open up ↑ |
3055 3279 * currently active in the pipeline (see vdev_queue_io()), we explicitly
3056 3280 * force the underlying vdev layers to call either zio_execute() or
3057 3281 * zio_interrupt() to ensure that the pipeline continues with the correct I/O.
3058 3282 */
3059 3283 static int
3060 3284 zio_vdev_io_start(zio_t *zio)
3061 3285 {
3062 3286 vdev_t *vd = zio->io_vd;
3063 3287 uint64_t align;
3064 3288 spa_t *spa = zio->io_spa;
3289 + zio_type_t type = zio->io_type;
3290 + zio->io_vd_timestamp = gethrtime();
3065 3291
3066 3292 ASSERT(zio->io_error == 0);
3067 3293 ASSERT(zio->io_child_error[ZIO_CHILD_VDEV] == 0);
3068 3294
3069 3295 if (vd == NULL) {
3070 3296 if (!(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
3071 3297 spa_config_enter(spa, SCL_ZIO, zio, RW_READER);
3072 3298
3073 3299 /*
3074 3300 * The mirror_ops handle multiple DVAs in a single BP.
3075 3301 */
3076 3302 vdev_mirror_ops.vdev_op_io_start(zio);
3077 3303 return (ZIO_PIPELINE_STOP);
3078 3304 }
3079 3305
3080 3306 ASSERT3P(zio->io_logical, !=, zio);
3081 - if (zio->io_type == ZIO_TYPE_WRITE) {
3082 - ASSERT(spa->spa_trust_config);
3083 3307
3084 - if (zio->io_vd->vdev_removing) {
3085 - ASSERT(zio->io_flags &
3086 - (ZIO_FLAG_PHYSICAL | ZIO_FLAG_SELF_HEAL |
3087 - ZIO_FLAG_INDUCE_DAMAGE));
3088 - }
3089 - }
3090 -
3091 - /*
3092 - * We keep track of time-sensitive I/Os so that the scan thread
3093 - * can quickly react to certain workloads. In particular, we care
3094 - * about non-scrubbing, top-level reads and writes with the following
3095 - * characteristics:
3096 - * - synchronous writes of user data to non-slog devices
3097 - * - any reads of user data
3098 - * When these conditions are met, adjust the timestamp of spa_last_io
3099 - * which allows the scan thread to adjust its workload accordingly.
3100 - */
3101 - if (!(zio->io_flags & ZIO_FLAG_SCAN_THREAD) && zio->io_bp != NULL &&
3102 - vd == vd->vdev_top && !vd->vdev_islog &&
3103 - zio->io_bookmark.zb_objset != DMU_META_OBJSET &&
3104 - zio->io_txg != spa_syncing_txg(spa)) {
3105 - uint64_t old = spa->spa_last_io;
3106 - uint64_t new = ddi_get_lbolt64();
3107 - if (old != new)
3108 - (void) atomic_cas_64(&spa->spa_last_io, old, new);
3109 - }
3110 -
3111 3308 align = 1ULL << vd->vdev_top->vdev_ashift;
3112 3309
3113 3310 if (!(zio->io_flags & ZIO_FLAG_PHYSICAL) &&
3114 3311 P2PHASE(zio->io_size, align) != 0) {
3115 3312 /* Transform logical writes to be a full physical block size. */
3116 3313 uint64_t asize = P2ROUNDUP(zio->io_size, align);
3117 3314 abd_t *abuf = abd_alloc_sametype(zio->io_abd, asize);
3118 3315 ASSERT(vd == vd->vdev_top);
3119 - if (zio->io_type == ZIO_TYPE_WRITE) {
3316 + if (type == ZIO_TYPE_WRITE) {
3120 3317 abd_copy(abuf, zio->io_abd, zio->io_size);
3121 3318 abd_zero_off(abuf, zio->io_size, asize - zio->io_size);
3122 3319 }
3123 3320 zio_push_transform(zio, abuf, asize, asize, zio_subblock);
3124 3321 }
3125 3322
3126 3323 /*
3127 3324 * If this is not a physical io, make sure that it is properly aligned
3128 3325 * before proceeding.
3129 3326 */
3130 3327 if (!(zio->io_flags & ZIO_FLAG_PHYSICAL)) {
3131 3328 ASSERT0(P2PHASE(zio->io_offset, align));
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
3132 3329 ASSERT0(P2PHASE(zio->io_size, align));
3133 3330 } else {
3134 3331 /*
3135 3332 * For physical writes, we allow 512b aligned writes and assume
3136 3333 * the device will perform a read-modify-write as necessary.
3137 3334 */
3138 3335 ASSERT0(P2PHASE(zio->io_offset, SPA_MINBLOCKSIZE));
3139 3336 ASSERT0(P2PHASE(zio->io_size, SPA_MINBLOCKSIZE));
3140 3337 }
3141 3338
3142 - VERIFY(zio->io_type != ZIO_TYPE_WRITE || spa_writeable(spa));
3339 + VERIFY(type != ZIO_TYPE_WRITE || spa_writeable(spa));
3143 3340
3144 3341 /*
3145 3342 * If this is a repair I/O, and there's no self-healing involved --
3146 3343 * that is, we're just resilvering what we expect to resilver --
3147 3344 * then don't do the I/O unless zio's txg is actually in vd's DTL.
3148 3345 * This prevents spurious resilvering with nested replication.
3149 3346 * For example, given a mirror of mirrors, (A+B)+(C+D), if only
3150 3347 * A is out of date, we'll read from C+D, then use the data to
3151 3348 * resilver A+B -- but we don't actually want to resilver B, just A.
3152 3349 * The top-level mirror has no way to know this, so instead we just
3153 3350 * discard unnecessary repairs as we work our way down the vdev tree.
3154 3351 * The same logic applies to any form of nested replication:
3155 3352 * ditto + mirror, RAID-Z + replacing, etc. This covers them all.
3156 3353 */
3157 3354 if ((zio->io_flags & ZIO_FLAG_IO_REPAIR) &&
3158 3355 !(zio->io_flags & ZIO_FLAG_SELF_HEAL) &&
3159 3356 zio->io_txg != 0 && /* not a delegated i/o */
3160 3357 !vdev_dtl_contains(vd, DTL_PARTIAL, zio->io_txg, 1)) {
3161 - ASSERT(zio->io_type == ZIO_TYPE_WRITE);
3358 + ASSERT(type == ZIO_TYPE_WRITE);
3162 3359 zio_vdev_io_bypass(zio);
3163 3360 return (ZIO_PIPELINE_CONTINUE);
3164 3361 }
3165 3362
3166 3363 if (vd->vdev_ops->vdev_op_leaf &&
3167 - (zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE)) {
3168 -
3169 - if (zio->io_type == ZIO_TYPE_READ && vdev_cache_read(zio))
3364 + (type == ZIO_TYPE_READ || type == ZIO_TYPE_WRITE)) {
3365 + if (type == ZIO_TYPE_READ && vdev_cache_read(zio))
3170 3366 return (ZIO_PIPELINE_CONTINUE);
3171 3367
3172 3368 if ((zio = vdev_queue_io(zio)) == NULL)
3173 3369 return (ZIO_PIPELINE_STOP);
3174 3370
3175 3371 if (!vdev_accessible(vd, zio)) {
3176 3372 zio->io_error = SET_ERROR(ENXIO);
3177 3373 zio_interrupt(zio);
3178 3374 return (ZIO_PIPELINE_STOP);
3179 3375 }
3376 +
3377 + /*
3378 + * Insert a fault simulation delay for a particular vdev.
3379 + */
3380 + if (zio_faulty_vdev_enabled &&
3381 + (zio->io_vd->vdev_guid == zio_faulty_vdev_guid)) {
3382 + delay(NSEC_TO_TICK(zio_faulty_vdev_delay_us *
3383 + (NANOSEC / MICROSEC)));
3384 + }
3180 3385 }
3181 3386
3182 3387 vd->vdev_ops->vdev_op_io_start(zio);
3183 3388 return (ZIO_PIPELINE_STOP);
3184 3389 }
3185 3390
3186 3391 static int
3187 3392 zio_vdev_io_done(zio_t *zio)
3188 3393 {
3189 3394 vdev_t *vd = zio->io_vd;
3190 3395 vdev_ops_t *ops = vd ? vd->vdev_ops : &vdev_mirror_ops;
3191 3396 boolean_t unexpected_error = B_FALSE;
3192 3397
3193 - if (zio_wait_for_children(zio, ZIO_CHILD_VDEV_BIT, ZIO_WAIT_DONE)) {
3398 + if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
3194 3399 return (ZIO_PIPELINE_STOP);
3195 - }
3196 3400
3197 3401 ASSERT(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE);
3198 3402
3199 3403 if (vd != NULL && vd->vdev_ops->vdev_op_leaf) {
3200 -
3201 3404 vdev_queue_io_done(zio);
3202 3405
3203 3406 if (zio->io_type == ZIO_TYPE_WRITE)
3204 3407 vdev_cache_write(zio);
3205 3408
3206 3409 if (zio_injection_enabled && zio->io_error == 0)
3207 3410 zio->io_error = zio_handle_device_injection(vd,
3208 3411 zio, EIO);
3209 3412
3210 3413 if (zio_injection_enabled && zio->io_error == 0)
3211 3414 zio->io_error = zio_handle_label_injection(zio, EIO);
3212 3415
3213 3416 if (zio->io_error) {
3214 3417 if (!vdev_accessible(vd, zio)) {
3215 3418 zio->io_error = SET_ERROR(ENXIO);
3216 3419 } else {
|
↓ open down ↓ |
6 lines elided |
↑ open up ↑ |
3217 3420 unexpected_error = B_TRUE;
3218 3421 }
3219 3422 }
3220 3423 }
3221 3424
3222 3425 ops->vdev_op_io_done(zio);
3223 3426
3224 3427 if (unexpected_error)
3225 3428 VERIFY(vdev_probe(vd, zio) == NULL);
3226 3429
3430 + /*
3431 + * Measure delta between start and end of the I/O in nanoseconds.
3432 + * XXX: Handle overflow.
3433 + */
3434 + zio->io_vd_timestamp = gethrtime() - zio->io_vd_timestamp;
3435 +
3227 3436 return (ZIO_PIPELINE_CONTINUE);
3228 3437 }
3229 3438
3230 3439 /*
3231 3440 * For non-raidz ZIOs, we can just copy aside the bad data read from the
3232 3441 * disk, and use that to finish the checksum ereport later.
3233 3442 */
3234 3443 static void
3235 3444 zio_vsd_default_cksum_finish(zio_cksum_report_t *zcr,
3236 3445 const void *good_buf)
3237 3446 {
3238 3447 /* no processing needed */
3239 3448 zfs_ereport_finish_checksum(zcr, good_buf, zcr->zcr_cbdata, B_FALSE);
3240 3449 }
3241 3450
3242 3451 /*ARGSUSED*/
3243 3452 void
3244 3453 zio_vsd_default_cksum_report(zio_t *zio, zio_cksum_report_t *zcr, void *ignored)
3245 3454 {
3246 3455 void *buf = zio_buf_alloc(zio->io_size);
3247 3456
3248 3457 abd_copy_to_buf(buf, zio->io_abd, zio->io_size);
3249 3458
3250 3459 zcr->zcr_cbinfo = zio->io_size;
|
↓ open down ↓ |
14 lines elided |
↑ open up ↑ |
3251 3460 zcr->zcr_cbdata = buf;
3252 3461 zcr->zcr_finish = zio_vsd_default_cksum_finish;
3253 3462 zcr->zcr_free = zio_buf_free;
3254 3463 }
3255 3464
3256 3465 static int
3257 3466 zio_vdev_io_assess(zio_t *zio)
3258 3467 {
3259 3468 vdev_t *vd = zio->io_vd;
3260 3469
3261 - if (zio_wait_for_children(zio, ZIO_CHILD_VDEV_BIT, ZIO_WAIT_DONE)) {
3470 + if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
3262 3471 return (ZIO_PIPELINE_STOP);
3263 - }
3264 3472
3265 3473 if (vd == NULL && !(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
3266 3474 spa_config_exit(zio->io_spa, SCL_ZIO, zio);
3267 3475
3268 3476 if (zio->io_vsd != NULL) {
3269 3477 zio->io_vsd_ops->vsd_free(zio);
3270 3478 zio->io_vsd = NULL;
3271 3479 }
3272 3480
3273 3481 if (zio_injection_enabled && zio->io_error == 0)
3274 3482 zio->io_error = zio_handle_fault_injection(zio, EIO);
3275 3483
3276 3484 /*
3277 3485 * If the I/O failed, determine whether we should attempt to retry it.
3278 3486 *
3279 3487 * On retry, we cut in line in the issue queue, since we don't want
3280 3488 * compression/checksumming/etc. work to prevent our (cheap) IO reissue.
3281 3489 */
3282 3490 if (zio->io_error && vd == NULL &&
3283 3491 !(zio->io_flags & (ZIO_FLAG_DONT_RETRY | ZIO_FLAG_IO_RETRY))) {
3284 3492 ASSERT(!(zio->io_flags & ZIO_FLAG_DONT_QUEUE)); /* not a leaf */
3285 3493 ASSERT(!(zio->io_flags & ZIO_FLAG_IO_BYPASS)); /* not a leaf */
3286 3494 zio->io_error = 0;
3287 3495 zio->io_flags |= ZIO_FLAG_IO_RETRY |
3288 3496 ZIO_FLAG_DONT_CACHE | ZIO_FLAG_DONT_AGGREGATE;
3289 3497 zio->io_stage = ZIO_STAGE_VDEV_IO_START >> 1;
3290 3498 zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE,
3291 3499 zio_requeue_io_start_cut_in_line);
3292 3500 return (ZIO_PIPELINE_STOP);
3293 3501 }
3294 3502
3295 3503 /*
3296 3504 * If we got an error on a leaf device, convert it to ENXIO
3297 3505 * if the device is not accessible at all.
3298 3506 */
3299 3507 if (zio->io_error && vd != NULL && vd->vdev_ops->vdev_op_leaf &&
3300 3508 !vdev_accessible(vd, zio))
3301 3509 zio->io_error = SET_ERROR(ENXIO);
3302 3510
3303 3511 /*
3304 3512 * If we can't write to an interior vdev (mirror or RAID-Z),
3305 3513 * set vdev_cant_write so that we stop trying to allocate from it.
3306 3514 */
3307 3515 if (zio->io_error == ENXIO && zio->io_type == ZIO_TYPE_WRITE &&
3308 3516 vd != NULL && !vd->vdev_ops->vdev_op_leaf) {
3309 3517 vd->vdev_cant_write = B_TRUE;
3310 3518 }
3311 3519
3312 3520 /*
3313 3521 * If a cache flush returns ENOTSUP or ENOTTY, we know that no future
3314 3522 * attempts will ever succeed. In this case we set a persistent bit so
3315 3523 * that we don't bother with it in the future.
3316 3524 */
3317 3525 if ((zio->io_error == ENOTSUP || zio->io_error == ENOTTY) &&
3318 3526 zio->io_type == ZIO_TYPE_IOCTL &&
3319 3527 zio->io_cmd == DKIOCFLUSHWRITECACHE && vd != NULL)
3320 3528 vd->vdev_nowritecache = B_TRUE;
3321 3529
3322 3530 if (zio->io_error)
3323 3531 zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
3324 3532
3325 3533 if (vd != NULL && vd->vdev_ops->vdev_op_leaf &&
3326 3534 zio->io_physdone != NULL) {
3327 3535 ASSERT(!(zio->io_flags & ZIO_FLAG_DELEGATED));
3328 3536 ASSERT(zio->io_child_type == ZIO_CHILD_VDEV);
3329 3537 zio->io_physdone(zio->io_logical);
3330 3538 }
3331 3539
3332 3540 return (ZIO_PIPELINE_CONTINUE);
3333 3541 }
3334 3542
3335 3543 void
3336 3544 zio_vdev_io_reissue(zio_t *zio)
3337 3545 {
3338 3546 ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_START);
3339 3547 ASSERT(zio->io_error == 0);
3340 3548
3341 3549 zio->io_stage >>= 1;
3342 3550 }
3343 3551
3344 3552 void
3345 3553 zio_vdev_io_redone(zio_t *zio)
3346 3554 {
3347 3555 ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_DONE);
3348 3556
3349 3557 zio->io_stage >>= 1;
3350 3558 }
3351 3559
3352 3560 void
3353 3561 zio_vdev_io_bypass(zio_t *zio)
3354 3562 {
3355 3563 ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_START);
3356 3564 ASSERT(zio->io_error == 0);
3357 3565
3358 3566 zio->io_flags |= ZIO_FLAG_IO_BYPASS;
3359 3567 zio->io_stage = ZIO_STAGE_VDEV_IO_ASSESS >> 1;
3360 3568 }
3361 3569
3362 3570 /*
3363 3571 * ==========================================================================
3364 3572 * Generate and verify checksums
3365 3573 * ==========================================================================
3366 3574 */
3367 3575 static int
3368 3576 zio_checksum_generate(zio_t *zio)
3369 3577 {
3370 3578 blkptr_t *bp = zio->io_bp;
3371 3579 enum zio_checksum checksum;
3372 3580
3373 3581 if (bp == NULL) {
3374 3582 /*
3375 3583 * This is zio_write_phys().
3376 3584 * We're either generating a label checksum, or none at all.
3377 3585 */
3378 3586 checksum = zio->io_prop.zp_checksum;
3379 3587
3380 3588 if (checksum == ZIO_CHECKSUM_OFF)
3381 3589 return (ZIO_PIPELINE_CONTINUE);
3382 3590
3383 3591 ASSERT(checksum == ZIO_CHECKSUM_LABEL);
3384 3592 } else {
3385 3593 if (BP_IS_GANG(bp) && zio->io_child_type == ZIO_CHILD_GANG) {
3386 3594 ASSERT(!IO_IS_ALLOCATING(zio));
3387 3595 checksum = ZIO_CHECKSUM_GANG_HEADER;
3388 3596 } else {
3389 3597 checksum = BP_GET_CHECKSUM(bp);
3390 3598 }
3391 3599 }
3392 3600
3393 3601 zio_checksum_compute(zio, checksum, zio->io_abd, zio->io_size);
3394 3602
3395 3603 return (ZIO_PIPELINE_CONTINUE);
3396 3604 }
3397 3605
3398 3606 static int
3399 3607 zio_checksum_verify(zio_t *zio)
3400 3608 {
3401 3609 zio_bad_cksum_t info;
3402 3610 blkptr_t *bp = zio->io_bp;
3403 3611 int error;
3404 3612
3405 3613 ASSERT(zio->io_vd != NULL);
3406 3614
3407 3615 if (bp == NULL) {
3408 3616 /*
3409 3617 * This is zio_read_phys().
3410 3618 * We're either verifying a label checksum, or nothing at all.
3411 3619 */
3412 3620 if (zio->io_prop.zp_checksum == ZIO_CHECKSUM_OFF)
3413 3621 return (ZIO_PIPELINE_CONTINUE);
3414 3622
3415 3623 ASSERT(zio->io_prop.zp_checksum == ZIO_CHECKSUM_LABEL);
3416 3624 }
3417 3625
3418 3626 if ((error = zio_checksum_error(zio, &info)) != 0) {
3419 3627 zio->io_error = error;
3420 3628 if (error == ECKSUM &&
3421 3629 !(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
3422 3630 zfs_ereport_start_checksum(zio->io_spa,
3423 3631 zio->io_vd, zio, zio->io_offset,
3424 3632 zio->io_size, NULL, &info);
3425 3633 }
3426 3634 }
3427 3635
3428 3636 return (ZIO_PIPELINE_CONTINUE);
3429 3637 }
3430 3638
3431 3639 /*
3432 3640 * Called by RAID-Z to ensure we don't compute the checksum twice.
3433 3641 */
3434 3642 void
3435 3643 zio_checksum_verified(zio_t *zio)
3436 3644 {
3437 3645 zio->io_pipeline &= ~ZIO_STAGE_CHECKSUM_VERIFY;
3438 3646 }
3439 3647
3440 3648 /*
3441 3649 * ==========================================================================
3442 3650 * Error rank. Error are ranked in the order 0, ENXIO, ECKSUM, EIO, other.
3443 3651 * An error of 0 indicates success. ENXIO indicates whole-device failure,
3444 3652 * which may be transient (e.g. unplugged) or permament. ECKSUM and EIO
3445 3653 * indicate errors that are specific to one I/O, and most likely permanent.
3446 3654 * Any other error is presumed to be worse because we weren't expecting it.
3447 3655 * ==========================================================================
3448 3656 */
3449 3657 int
3450 3658 zio_worst_error(int e1, int e2)
3451 3659 {
3452 3660 static int zio_error_rank[] = { 0, ENXIO, ECKSUM, EIO };
3453 3661 int r1, r2;
3454 3662
3455 3663 for (r1 = 0; r1 < sizeof (zio_error_rank) / sizeof (int); r1++)
3456 3664 if (e1 == zio_error_rank[r1])
3457 3665 break;
3458 3666
3459 3667 for (r2 = 0; r2 < sizeof (zio_error_rank) / sizeof (int); r2++)
3460 3668 if (e2 == zio_error_rank[r2])
3461 3669 break;
3462 3670
3463 3671 return (r1 > r2 ? e1 : e2);
3464 3672 }
3465 3673
3466 3674 /*
3467 3675 * ==========================================================================
|
↓ open down ↓ |
194 lines elided |
↑ open up ↑ |
3468 3676 * I/O completion
3469 3677 * ==========================================================================
3470 3678 */
3471 3679 static int
3472 3680 zio_ready(zio_t *zio)
3473 3681 {
3474 3682 blkptr_t *bp = zio->io_bp;
3475 3683 zio_t *pio, *pio_next;
3476 3684 zio_link_t *zl = NULL;
3477 3685
3478 - if (zio_wait_for_children(zio, ZIO_CHILD_GANG_BIT | ZIO_CHILD_DDT_BIT,
3479 - ZIO_WAIT_READY)) {
3686 + if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) ||
3687 + zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_READY))
3480 3688 return (ZIO_PIPELINE_STOP);
3481 - }
3482 3689
3483 3690 if (zio->io_ready) {
3484 3691 ASSERT(IO_IS_ALLOCATING(zio));
3485 3692 ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp) ||
3486 3693 (zio->io_flags & ZIO_FLAG_NOPWRITE));
3487 3694 ASSERT(zio->io_children[ZIO_CHILD_GANG][ZIO_WAIT_READY] == 0);
3488 3695
3489 3696 zio->io_ready(zio);
3490 3697 }
3491 3698
3492 3699 if (bp != NULL && bp != &zio->io_bp_copy)
3493 3700 zio->io_bp_copy = *bp;
3494 3701
|
↓ open down ↓ |
3 lines elided |
↑ open up ↑ |
3495 3702 if (zio->io_error != 0) {
3496 3703 zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
3497 3704
3498 3705 if (zio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
3499 3706 ASSERT(IO_IS_ALLOCATING(zio));
3500 3707 ASSERT(zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
3501 3708 /*
3502 3709 * We were unable to allocate anything, unreserve and
3503 3710 * issue the next I/O to allocate.
3504 3711 */
3505 - metaslab_class_throttle_unreserve(
3506 - spa_normal_class(zio->io_spa),
3712 + metaslab_class_throttle_unreserve(zio->io_mc,
3507 3713 zio->io_prop.zp_copies, zio);
3508 - zio_allocate_dispatch(zio->io_spa);
3714 + zio_allocate_dispatch(zio->io_mc);
3509 3715 }
3510 3716 }
3511 3717
3512 3718 mutex_enter(&zio->io_lock);
3513 3719 zio->io_state[ZIO_WAIT_READY] = 1;
3514 3720 pio = zio_walk_parents(zio, &zl);
3515 3721 mutex_exit(&zio->io_lock);
3516 3722
3517 3723 /*
3518 3724 * As we notify zio's parents, new parents could be added.
3519 3725 * New parents go to the head of zio's io_parent_list, however,
3520 3726 * so we will (correctly) not notify them. The remainder of zio's
3521 3727 * io_parent_list, from 'pio_next' onward, cannot change because
3522 3728 * all parents must wait for us to be done before they can be done.
3523 3729 */
3524 3730 for (; pio != NULL; pio = pio_next) {
3525 3731 pio_next = zio_walk_parents(zio, &zl);
3526 3732 zio_notify_parent(pio, zio, ZIO_WAIT_READY);
3527 3733 }
3528 3734
3529 3735 if (zio->io_flags & ZIO_FLAG_NODATA) {
3530 3736 if (BP_IS_GANG(bp)) {
3531 3737 zio->io_flags &= ~ZIO_FLAG_NODATA;
3532 3738 } else {
3533 3739 ASSERT((uintptr_t)zio->io_abd < SPA_MAXBLOCKSIZE);
3534 3740 zio->io_pipeline &= ~ZIO_VDEV_IO_STAGES;
3535 3741 }
3536 3742 }
3537 3743
3538 3744 if (zio_injection_enabled &&
3539 3745 zio->io_spa->spa_syncing_txg == zio->io_txg)
3540 3746 zio_handle_ignored_writes(zio);
3541 3747
3542 3748 return (ZIO_PIPELINE_CONTINUE);
3543 3749 }
3544 3750
3545 3751 /*
3546 3752 * Update the allocation throttle accounting.
3547 3753 */
3548 3754 static void
3549 3755 zio_dva_throttle_done(zio_t *zio)
3550 3756 {
3551 3757 zio_t *lio = zio->io_logical;
3552 3758 zio_t *pio = zio_unique_parent(zio);
3553 3759 vdev_t *vd = zio->io_vd;
3554 3760 int flags = METASLAB_ASYNC_ALLOC;
3555 3761
3556 3762 ASSERT3P(zio->io_bp, !=, NULL);
3557 3763 ASSERT3U(zio->io_type, ==, ZIO_TYPE_WRITE);
3558 3764 ASSERT3U(zio->io_priority, ==, ZIO_PRIORITY_ASYNC_WRITE);
3559 3765 ASSERT3U(zio->io_child_type, ==, ZIO_CHILD_VDEV);
3560 3766 ASSERT(vd != NULL);
3561 3767 ASSERT3P(vd, ==, vd->vdev_top);
3562 3768 ASSERT(!(zio->io_flags & (ZIO_FLAG_IO_REPAIR | ZIO_FLAG_IO_RETRY)));
3563 3769 ASSERT(zio->io_flags & ZIO_FLAG_IO_ALLOCATING);
3564 3770 ASSERT(!(lio->io_flags & ZIO_FLAG_IO_REWRITE));
3565 3771 ASSERT(!(lio->io_orig_flags & ZIO_FLAG_NODATA));
3566 3772
3567 3773 /*
3568 3774 * Parents of gang children can have two flavors -- ones that
3569 3775 * allocated the gang header (will have ZIO_FLAG_IO_REWRITE set)
3570 3776 * and ones that allocated the constituent blocks. The allocation
3571 3777 * throttle needs to know the allocating parent zio so we must find
3572 3778 * it here.
3573 3779 */
3574 3780 if (pio->io_child_type == ZIO_CHILD_GANG) {
3575 3781 /*
3576 3782 * If our parent is a rewrite gang child then our grandparent
3577 3783 * would have been the one that performed the allocation.
3578 3784 */
3579 3785 if (pio->io_flags & ZIO_FLAG_IO_REWRITE)
3580 3786 pio = zio_unique_parent(pio);
3581 3787 flags |= METASLAB_GANG_CHILD;
3582 3788 }
3583 3789
|
↓ open down ↓ |
65 lines elided |
↑ open up ↑ |
3584 3790 ASSERT(IO_IS_ALLOCATING(pio));
3585 3791 ASSERT3P(zio, !=, zio->io_logical);
3586 3792 ASSERT(zio->io_logical != NULL);
3587 3793 ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REPAIR));
3588 3794 ASSERT0(zio->io_flags & ZIO_FLAG_NOPWRITE);
3589 3795
3590 3796 mutex_enter(&pio->io_lock);
3591 3797 metaslab_group_alloc_decrement(zio->io_spa, vd->vdev_id, pio, flags);
3592 3798 mutex_exit(&pio->io_lock);
3593 3799
3594 - metaslab_class_throttle_unreserve(spa_normal_class(zio->io_spa),
3595 - 1, pio);
3800 + metaslab_class_throttle_unreserve(pio->io_mc, 1, pio);
3596 3801
3597 3802 /*
3598 3803 * Call into the pipeline to see if there is more work that
3599 3804 * needs to be done. If there is work to be done it will be
3600 3805 * dispatched to another taskq thread.
3601 3806 */
3602 - zio_allocate_dispatch(zio->io_spa);
3807 + zio_allocate_dispatch(pio->io_mc);
3603 3808 }
3604 3809
3605 3810 static int
3606 3811 zio_done(zio_t *zio)
3607 3812 {
3608 3813 spa_t *spa = zio->io_spa;
3609 3814 zio_t *lio = zio->io_logical;
3610 3815 blkptr_t *bp = zio->io_bp;
3611 3816 vdev_t *vd = zio->io_vd;
3612 3817 uint64_t psize = zio->io_size;
3613 3818 zio_t *pio, *pio_next;
3614 - metaslab_class_t *mc = spa_normal_class(spa);
3819 + metaslab_class_t *mc = zio->io_mc;
3615 3820 zio_link_t *zl = NULL;
3616 3821
3617 3822 /*
3618 3823 * If our children haven't all completed,
3619 3824 * wait for them and then repeat this pipeline stage.
3620 3825 */
3621 - if (zio_wait_for_children(zio, ZIO_CHILD_ALL_BITS, ZIO_WAIT_DONE)) {
3826 + if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) ||
3827 + zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) ||
3828 + zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) ||
3829 + zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE))
3622 3830 return (ZIO_PIPELINE_STOP);
3623 - }
3624 3831
3625 3832 /*
3626 3833 * If the allocation throttle is enabled, then update the accounting.
3627 3834 * We only track child I/Os that are part of an allocating async
3628 3835 * write. We must do this since the allocation is performed
3629 3836 * by the logical I/O but the actual write is done by child I/Os.
3630 3837 */
3631 3838 if (zio->io_flags & ZIO_FLAG_IO_ALLOCATING &&
3632 3839 zio->io_child_type == ZIO_CHILD_VDEV) {
3633 3840 ASSERT(mc->mc_alloc_throttle_enabled);
3634 3841 zio_dva_throttle_done(zio);
3635 3842 }
3636 3843
3637 3844 /*
3638 3845 * If the allocation throttle is enabled, verify that
3639 3846 * we have decremented the refcounts for every I/O that was throttled.
3640 3847 */
3641 3848 if (zio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
3642 3849 ASSERT(zio->io_type == ZIO_TYPE_WRITE);
3643 3850 ASSERT(zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
3644 3851 ASSERT(bp != NULL);
3645 3852 metaslab_group_alloc_verify(spa, zio->io_bp, zio);
3646 3853 VERIFY(refcount_not_held(&mc->mc_alloc_slots, zio));
3647 3854 }
3648 3855
3649 3856 for (int c = 0; c < ZIO_CHILD_TYPES; c++)
3650 3857 for (int w = 0; w < ZIO_WAIT_TYPES; w++)
3651 3858 ASSERT(zio->io_children[c][w] == 0);
3652 3859
3653 3860 if (bp != NULL && !BP_IS_EMBEDDED(bp)) {
3654 3861 ASSERT(bp->blk_pad[0] == 0);
3655 3862 ASSERT(bp->blk_pad[1] == 0);
3656 3863 ASSERT(bcmp(bp, &zio->io_bp_copy, sizeof (blkptr_t)) == 0 ||
3657 3864 (bp == zio_unique_parent(zio)->io_bp));
3658 3865 if (zio->io_type == ZIO_TYPE_WRITE && !BP_IS_HOLE(bp) &&
3659 3866 zio->io_bp_override == NULL &&
3660 3867 !(zio->io_flags & ZIO_FLAG_IO_REPAIR)) {
3661 3868 ASSERT(!BP_SHOULD_BYTESWAP(bp));
3662 3869 ASSERT3U(zio->io_prop.zp_copies, <=, BP_GET_NDVAS(bp));
3663 3870 ASSERT(BP_COUNT_GANG(bp) == 0 ||
3664 3871 (BP_COUNT_GANG(bp) == BP_GET_NDVAS(bp)));
3665 3872 }
3666 3873 if (zio->io_flags & ZIO_FLAG_NOPWRITE)
3667 3874 VERIFY(BP_EQUAL(bp, &zio->io_bp_orig));
3668 3875 }
3669 3876
3670 3877 /*
3671 3878 * If there were child vdev/gang/ddt errors, they apply to us now.
3672 3879 */
3673 3880 zio_inherit_child_errors(zio, ZIO_CHILD_VDEV);
3674 3881 zio_inherit_child_errors(zio, ZIO_CHILD_GANG);
3675 3882 zio_inherit_child_errors(zio, ZIO_CHILD_DDT);
3676 3883
3677 3884 /*
3678 3885 * If the I/O on the transformed data was successful, generate any
3679 3886 * checksum reports now while we still have the transformed data.
3680 3887 */
3681 3888 if (zio->io_error == 0) {
3682 3889 while (zio->io_cksum_report != NULL) {
3683 3890 zio_cksum_report_t *zcr = zio->io_cksum_report;
3684 3891 uint64_t align = zcr->zcr_align;
3685 3892 uint64_t asize = P2ROUNDUP(psize, align);
3686 3893 char *abuf = NULL;
3687 3894 abd_t *adata = zio->io_abd;
3688 3895
3689 3896 if (asize != psize) {
3690 3897 adata = abd_alloc_linear(asize, B_TRUE);
3691 3898 abd_copy(adata, zio->io_abd, psize);
3692 3899 abd_zero_off(adata, psize, asize - psize);
3693 3900 }
3694 3901
3695 3902 if (adata != NULL)
3696 3903 abuf = abd_borrow_buf_copy(adata, asize);
3697 3904
3698 3905 zio->io_cksum_report = zcr->zcr_next;
3699 3906 zcr->zcr_next = NULL;
3700 3907 zcr->zcr_finish(zcr, abuf);
3701 3908 zfs_ereport_free_checksum(zcr);
3702 3909
3703 3910 if (adata != NULL)
3704 3911 abd_return_buf(adata, abuf, asize);
3705 3912
3706 3913 if (asize != psize)
3707 3914 abd_free(adata);
3708 3915 }
3709 3916 }
3710 3917
3711 3918 zio_pop_transforms(zio); /* note: may set zio->io_error */
3712 3919
3713 3920 vdev_stat_update(zio, psize);
3714 3921
3715 3922 if (zio->io_error) {
3716 3923 /*
3717 3924 * If this I/O is attached to a particular vdev,
3718 3925 * generate an error message describing the I/O failure
3719 3926 * at the block level. We ignore these errors if the
3720 3927 * device is currently unavailable.
3721 3928 */
3722 3929 if (zio->io_error != ECKSUM && vd != NULL && !vdev_is_dead(vd))
3723 3930 zfs_ereport_post(FM_EREPORT_ZFS_IO, spa, vd, zio, 0, 0);
3724 3931
3725 3932 if ((zio->io_error == EIO || !(zio->io_flags &
3726 3933 (ZIO_FLAG_SPECULATIVE | ZIO_FLAG_DONT_PROPAGATE))) &&
3727 3934 zio == lio) {
3728 3935 /*
3729 3936 * For logical I/O requests, tell the SPA to log the
3730 3937 * error and generate a logical data ereport.
3731 3938 */
3732 3939 spa_log_error(spa, zio);
3733 3940 zfs_ereport_post(FM_EREPORT_ZFS_DATA, spa, NULL, zio,
3734 3941 0, 0);
3735 3942 }
3736 3943 }
3737 3944
3738 3945 if (zio->io_error && zio == lio) {
3739 3946 /*
3740 3947 * Determine whether zio should be reexecuted. This will
3741 3948 * propagate all the way to the root via zio_notify_parent().
3742 3949 */
3743 3950 ASSERT(vd == NULL && bp != NULL);
3744 3951 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
3745 3952
3746 3953 if (IO_IS_ALLOCATING(zio) &&
3747 3954 !(zio->io_flags & ZIO_FLAG_CANFAIL)) {
3748 3955 if (zio->io_error != ENOSPC)
3749 3956 zio->io_reexecute |= ZIO_REEXECUTE_NOW;
3750 3957 else
3751 3958 zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
3752 3959 }
3753 3960
3754 3961 if ((zio->io_type == ZIO_TYPE_READ ||
3755 3962 zio->io_type == ZIO_TYPE_FREE) &&
3756 3963 !(zio->io_flags & ZIO_FLAG_SCAN_THREAD) &&
3757 3964 zio->io_error == ENXIO &&
3758 3965 spa_load_state(spa) == SPA_LOAD_NONE &&
3759 3966 spa_get_failmode(spa) != ZIO_FAILURE_MODE_CONTINUE)
3760 3967 zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
3761 3968
3762 3969 if (!(zio->io_flags & ZIO_FLAG_CANFAIL) && !zio->io_reexecute)
3763 3970 zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
3764 3971
3765 3972 /*
3766 3973 * Here is a possibly good place to attempt to do
3767 3974 * either combinatorial reconstruction or error correction
3768 3975 * based on checksums. It also might be a good place
3769 3976 * to send out preliminary ereports before we suspend
3770 3977 * processing.
3771 3978 */
3772 3979 }
3773 3980
3774 3981 /*
3775 3982 * If there were logical child errors, they apply to us now.
3776 3983 * We defer this until now to avoid conflating logical child
3777 3984 * errors with errors that happened to the zio itself when
3778 3985 * updating vdev stats and reporting FMA events above.
3779 3986 */
3780 3987 zio_inherit_child_errors(zio, ZIO_CHILD_LOGICAL);
3781 3988
3782 3989 if ((zio->io_error || zio->io_reexecute) &&
3783 3990 IO_IS_ALLOCATING(zio) && zio->io_gang_leader == zio &&
3784 3991 !(zio->io_flags & (ZIO_FLAG_IO_REWRITE | ZIO_FLAG_NOPWRITE)))
3785 3992 zio_dva_unallocate(zio, zio->io_gang_tree, bp);
3786 3993
3787 3994 zio_gang_tree_free(&zio->io_gang_tree);
3788 3995
3789 3996 /*
3790 3997 * Godfather I/Os should never suspend.
3791 3998 */
3792 3999 if ((zio->io_flags & ZIO_FLAG_GODFATHER) &&
3793 4000 (zio->io_reexecute & ZIO_REEXECUTE_SUSPEND))
3794 4001 zio->io_reexecute = 0;
3795 4002
3796 4003 if (zio->io_reexecute) {
3797 4004 /*
3798 4005 * This is a logical I/O that wants to reexecute.
3799 4006 *
3800 4007 * Reexecute is top-down. When an i/o fails, if it's not
3801 4008 * the root, it simply notifies its parent and sticks around.
3802 4009 * The parent, seeing that it still has children in zio_done(),
3803 4010 * does the same. This percolates all the way up to the root.
3804 4011 * The root i/o will reexecute or suspend the entire tree.
3805 4012 *
3806 4013 * This approach ensures that zio_reexecute() honors
3807 4014 * all the original i/o dependency relationships, e.g.
3808 4015 * parents not executing until children are ready.
3809 4016 */
3810 4017 ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
3811 4018
3812 4019 zio->io_gang_leader = NULL;
3813 4020
3814 4021 mutex_enter(&zio->io_lock);
3815 4022 zio->io_state[ZIO_WAIT_DONE] = 1;
3816 4023 mutex_exit(&zio->io_lock);
3817 4024
3818 4025 /*
3819 4026 * "The Godfather" I/O monitors its children but is
3820 4027 * not a true parent to them. It will track them through
3821 4028 * the pipeline but severs its ties whenever they get into
3822 4029 * trouble (e.g. suspended). This allows "The Godfather"
3823 4030 * I/O to return status without blocking.
3824 4031 */
3825 4032 zl = NULL;
3826 4033 for (pio = zio_walk_parents(zio, &zl); pio != NULL;
3827 4034 pio = pio_next) {
3828 4035 zio_link_t *remove_zl = zl;
3829 4036 pio_next = zio_walk_parents(zio, &zl);
3830 4037
3831 4038 if ((pio->io_flags & ZIO_FLAG_GODFATHER) &&
3832 4039 (zio->io_reexecute & ZIO_REEXECUTE_SUSPEND)) {
3833 4040 zio_remove_child(pio, zio, remove_zl);
3834 4041 zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
3835 4042 }
3836 4043 }
3837 4044
3838 4045 if ((pio = zio_unique_parent(zio)) != NULL) {
3839 4046 /*
3840 4047 * We're not a root i/o, so there's nothing to do
3841 4048 * but notify our parent. Don't propagate errors
3842 4049 * upward since we haven't permanently failed yet.
3843 4050 */
3844 4051 ASSERT(!(zio->io_flags & ZIO_FLAG_GODFATHER));
3845 4052 zio->io_flags |= ZIO_FLAG_DONT_PROPAGATE;
3846 4053 zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
3847 4054 } else if (zio->io_reexecute & ZIO_REEXECUTE_SUSPEND) {
3848 4055 /*
3849 4056 * We'd fail again if we reexecuted now, so suspend
3850 4057 * until conditions improve (e.g. device comes online).
3851 4058 */
3852 4059 zio_suspend(spa, zio);
3853 4060 } else {
3854 4061 /*
3855 4062 * Reexecution is potentially a huge amount of work.
3856 4063 * Hand it off to the otherwise-unused claim taskq.
3857 4064 */
3858 4065 ASSERT(zio->io_tqent.tqent_next == NULL);
3859 4066 spa_taskq_dispatch_ent(spa, ZIO_TYPE_CLAIM,
3860 4067 ZIO_TASKQ_ISSUE, (task_func_t *)zio_reexecute, zio,
3861 4068 0, &zio->io_tqent);
3862 4069 }
3863 4070 return (ZIO_PIPELINE_STOP);
3864 4071 }
3865 4072
3866 4073 ASSERT(zio->io_child_count == 0);
3867 4074 ASSERT(zio->io_reexecute == 0);
3868 4075 ASSERT(zio->io_error == 0 || (zio->io_flags & ZIO_FLAG_CANFAIL));
3869 4076
3870 4077 /*
3871 4078 * Report any checksum errors, since the I/O is complete.
3872 4079 */
3873 4080 while (zio->io_cksum_report != NULL) {
3874 4081 zio_cksum_report_t *zcr = zio->io_cksum_report;
3875 4082 zio->io_cksum_report = zcr->zcr_next;
3876 4083 zcr->zcr_next = NULL;
3877 4084 zcr->zcr_finish(zcr, NULL);
3878 4085 zfs_ereport_free_checksum(zcr);
3879 4086 }
3880 4087
3881 4088 /*
3882 4089 * It is the responsibility of the done callback to ensure that this
3883 4090 * particular zio is no longer discoverable for adoption, and as
3884 4091 * such, cannot acquire any new parents.
3885 4092 */
3886 4093 if (zio->io_done)
3887 4094 zio->io_done(zio);
3888 4095
3889 4096 mutex_enter(&zio->io_lock);
3890 4097 zio->io_state[ZIO_WAIT_DONE] = 1;
3891 4098 mutex_exit(&zio->io_lock);
3892 4099
3893 4100 zl = NULL;
3894 4101 for (pio = zio_walk_parents(zio, &zl); pio != NULL; pio = pio_next) {
3895 4102 zio_link_t *remove_zl = zl;
3896 4103 pio_next = zio_walk_parents(zio, &zl);
3897 4104 zio_remove_child(pio, zio, remove_zl);
3898 4105 zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
3899 4106 }
3900 4107
3901 4108 if (zio->io_waiter != NULL) {
3902 4109 mutex_enter(&zio->io_lock);
|
↓ open down ↓ |
269 lines elided |
↑ open up ↑ |
3903 4110 zio->io_executor = NULL;
3904 4111 cv_broadcast(&zio->io_cv);
3905 4112 mutex_exit(&zio->io_lock);
3906 4113 } else {
3907 4114 zio_destroy(zio);
3908 4115 }
3909 4116
3910 4117 return (ZIO_PIPELINE_STOP);
3911 4118 }
3912 4119
4120 +zio_t *
4121 +zio_wbc(zio_type_t type, vdev_t *vd, abd_t *data,
4122 + uint64_t size, uint64_t offset)
4123 +{
4124 + zio_t *zio = NULL;
4125 +
4126 + switch (type) {
4127 + case ZIO_TYPE_WRITE:
4128 + zio = zio_create(NULL, vd->vdev_spa, 0, NULL, data, size,
4129 + size, NULL, NULL, ZIO_TYPE_WRITE, ZIO_PRIORITY_ASYNC_WRITE,
4130 + ZIO_FLAG_PHYSICAL, vd, offset,
4131 + NULL, ZIO_STAGE_OPEN, ZIO_WRITE_PHYS_PIPELINE);
4132 + break;
4133 + case ZIO_TYPE_READ:
4134 + zio = zio_create(NULL, vd->vdev_spa, 0, NULL, data, size,
4135 + size, NULL, NULL, ZIO_TYPE_READ, ZIO_PRIORITY_ASYNC_READ,
4136 + ZIO_FLAG_DONT_CACHE | ZIO_FLAG_PHYSICAL, vd, offset,
4137 + NULL, ZIO_STAGE_OPEN, ZIO_READ_PHYS_PIPELINE);
4138 + break;
4139 + default:
4140 + ASSERT(0);
4141 + }
4142 +
4143 + zio->io_prop.zp_checksum = ZIO_CHECKSUM_OFF;
4144 +
4145 + return (zio);
4146 +}
4147 +
3913 4148 /*
3914 4149 * ==========================================================================
3915 4150 * I/O pipeline definition
3916 4151 * ==========================================================================
3917 4152 */
3918 4153 static zio_pipe_stage_t *zio_pipeline[] = {
3919 4154 NULL,
3920 4155 zio_read_bp_init,
3921 4156 zio_write_bp_init,
3922 4157 zio_free_bp_init,
3923 4158 zio_issue_async,
3924 4159 zio_write_compress,
3925 4160 zio_checksum_generate,
3926 4161 zio_nop_write,
3927 4162 zio_ddt_read_start,
3928 4163 zio_ddt_read_done,
3929 4164 zio_ddt_write,
3930 4165 zio_ddt_free,
3931 4166 zio_gang_assemble,
3932 4167 zio_gang_issue,
3933 4168 zio_dva_throttle,
3934 4169 zio_dva_allocate,
3935 4170 zio_dva_free,
3936 4171 zio_dva_claim,
3937 4172 zio_ready,
3938 4173 zio_vdev_io_start,
3939 4174 zio_vdev_io_done,
3940 4175 zio_vdev_io_assess,
3941 4176 zio_checksum_verify,
3942 4177 zio_done
3943 4178 };
3944 4179
3945 4180
3946 4181
3947 4182
3948 4183 /*
3949 4184 * Compare two zbookmark_phys_t's to see which we would reach first in a
3950 4185 * pre-order traversal of the object tree.
3951 4186 *
3952 4187 * This is simple in every case aside from the meta-dnode object. For all other
3953 4188 * objects, we traverse them in order (object 1 before object 2, and so on).
3954 4189 * However, all of these objects are traversed while traversing object 0, since
3955 4190 * the data it points to is the list of objects. Thus, we need to convert to a
3956 4191 * canonical representation so we can compare meta-dnode bookmarks to
3957 4192 * non-meta-dnode bookmarks.
3958 4193 *
3959 4194 * We do this by calculating "equivalents" for each field of the zbookmark.
3960 4195 * zbookmarks outside of the meta-dnode use their own object and level, and
3961 4196 * calculate the level 0 equivalent (the first L0 blkid that is contained in the
3962 4197 * blocks this bookmark refers to) by multiplying their blkid by their span
3963 4198 * (the number of L0 blocks contained within one block at their level).
3964 4199 * zbookmarks inside the meta-dnode calculate their object equivalent
3965 4200 * (which is L0equiv * dnodes per data block), use 0 for their L0equiv, and use
3966 4201 * level + 1<<31 (any value larger than a level could ever be) for their level.
3967 4202 * This causes them to always compare before a bookmark in their object
3968 4203 * equivalent, compare appropriately to bookmarks in other objects, and to
3969 4204 * compare appropriately to other bookmarks in the meta-dnode.
3970 4205 */
3971 4206 int
3972 4207 zbookmark_compare(uint16_t dbss1, uint8_t ibs1, uint16_t dbss2, uint8_t ibs2,
3973 4208 const zbookmark_phys_t *zb1, const zbookmark_phys_t *zb2)
3974 4209 {
3975 4210 /*
3976 4211 * These variables represent the "equivalent" values for the zbookmark,
3977 4212 * after converting zbookmarks inside the meta dnode to their
3978 4213 * normal-object equivalents.
3979 4214 */
3980 4215 uint64_t zb1obj, zb2obj;
3981 4216 uint64_t zb1L0, zb2L0;
3982 4217 uint64_t zb1level, zb2level;
3983 4218
3984 4219 if (zb1->zb_object == zb2->zb_object &&
3985 4220 zb1->zb_level == zb2->zb_level &&
3986 4221 zb1->zb_blkid == zb2->zb_blkid)
3987 4222 return (0);
3988 4223
3989 4224 /*
3990 4225 * BP_SPANB calculates the span in blocks.
3991 4226 */
3992 4227 zb1L0 = (zb1->zb_blkid) * BP_SPANB(ibs1, zb1->zb_level);
3993 4228 zb2L0 = (zb2->zb_blkid) * BP_SPANB(ibs2, zb2->zb_level);
3994 4229
3995 4230 if (zb1->zb_object == DMU_META_DNODE_OBJECT) {
3996 4231 zb1obj = zb1L0 * (dbss1 << (SPA_MINBLOCKSHIFT - DNODE_SHIFT));
3997 4232 zb1L0 = 0;
3998 4233 zb1level = zb1->zb_level + COMPARE_META_LEVEL;
3999 4234 } else {
4000 4235 zb1obj = zb1->zb_object;
4001 4236 zb1level = zb1->zb_level;
4002 4237 }
4003 4238
4004 4239 if (zb2->zb_object == DMU_META_DNODE_OBJECT) {
4005 4240 zb2obj = zb2L0 * (dbss2 << (SPA_MINBLOCKSHIFT - DNODE_SHIFT));
4006 4241 zb2L0 = 0;
4007 4242 zb2level = zb2->zb_level + COMPARE_META_LEVEL;
4008 4243 } else {
4009 4244 zb2obj = zb2->zb_object;
4010 4245 zb2level = zb2->zb_level;
4011 4246 }
4012 4247
4013 4248 /* Now that we have a canonical representation, do the comparison. */
4014 4249 if (zb1obj != zb2obj)
4015 4250 return (zb1obj < zb2obj ? -1 : 1);
4016 4251 else if (zb1L0 != zb2L0)
4017 4252 return (zb1L0 < zb2L0 ? -1 : 1);
4018 4253 else if (zb1level != zb2level)
4019 4254 return (zb1level > zb2level ? -1 : 1);
4020 4255 /*
4021 4256 * This can (theoretically) happen if the bookmarks have the same object
4022 4257 * and level, but different blkids, if the block sizes are not the same.
4023 4258 * There is presently no way to change the indirect block sizes
4024 4259 */
4025 4260 return (0);
4026 4261 }
4027 4262
4028 4263 /*
4029 4264 * This function checks the following: given that last_block is the place that
4030 4265 * our traversal stopped last time, does that guarantee that we've visited
4031 4266 * every node under subtree_root? Therefore, we can't just use the raw output
4032 4267 * of zbookmark_compare. We have to pass in a modified version of
4033 4268 * subtree_root; by incrementing the block id, and then checking whether
4034 4269 * last_block is before or equal to that, we can tell whether or not having
4035 4270 * visited last_block implies that all of subtree_root's children have been
4036 4271 * visited.
4037 4272 */
4038 4273 boolean_t
4039 4274 zbookmark_subtree_completed(const dnode_phys_t *dnp,
4040 4275 const zbookmark_phys_t *subtree_root, const zbookmark_phys_t *last_block)
4041 4276 {
4042 4277 zbookmark_phys_t mod_zb = *subtree_root;
4043 4278 mod_zb.zb_blkid++;
4044 4279 ASSERT(last_block->zb_level == 0);
4045 4280
4046 4281 /* The objset_phys_t isn't before anything. */
4047 4282 if (dnp == NULL)
4048 4283 return (B_FALSE);
4049 4284
4050 4285 /*
4051 4286 * We pass in 1ULL << (DNODE_BLOCK_SHIFT - SPA_MINBLOCKSHIFT) for the
4052 4287 * data block size in sectors, because that variable is only used if
4053 4288 * the bookmark refers to a block in the meta-dnode. Since we don't
4054 4289 * know without examining it what object it refers to, and there's no
4055 4290 * harm in passing in this value in other cases, we always pass it in.
4056 4291 *
4057 4292 * We pass in 0 for the indirect block size shift because zb2 must be
4058 4293 * level 0. The indirect block size is only used to calculate the span
4059 4294 * of the bookmark, but since the bookmark must be level 0, the span is
4060 4295 * always 1, so the math works out.
4061 4296 *
4062 4297 * If you make changes to how the zbookmark_compare code works, be sure
4063 4298 * to make sure that this code still works afterwards.
4064 4299 */
4065 4300 return (zbookmark_compare(dnp->dn_datablkszsec, dnp->dn_indblkshift,
4066 4301 1ULL << (DNODE_BLOCK_SHIFT - SPA_MINBLOCKSHIFT), 0, &mod_zb,
4067 4302 last_block) <= 0);
4068 4303 }
|
↓ open down ↓ |
146 lines elided |
↑ open up ↑ |
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX