Print this page
re #13729 assign each ARC hash bucket its own mutex
In ARC the number of buckets in buffer header hash table is
proportional to the size of physical RAM.
The number of locks protecting headers in the buckets is fixed to 256 though.
Hence, on systems with large memory (>= 128GB) too many unrelated buffer
headers are protected by the same mutex.
When the memory in the system is fragmented this may cause a deadlock:
- An arc_read thread may be trying to allocate a 128k buffer while holding
a header lock.
- The allocation uses KM_PUSHPAGE option that blocks the thread if no contigous
chunk of requested size is available.
- ARC eviction thread that is supposed to evict some buffers would call
an evict callback on one of the buffers.
- Before freing the memory, the callback will attempt to take a lock on buffer
header.
- Incidentally, this buffer header will be protected by the same lock as
he one in arc_read() thread.
The solution in this patch is not perfect - that is, it protects all headers
in the hash bucket by the same lock.
However, a probability of collision is very low and does not depend on memory
size.
By the same argument, padding locks to cacheline looks like a waste of memory
here since the probability of contention on a cacheline is quite low, given
he number of buckets, number of locks per cacheline (4) and the fact that
he hash function (crc64 % hash table size) is supposed to be a very good
randomizer.
This effect on memory usage is as follows:
Per hash table size n,
- Original code uses 16K + 16 + n * 8 bytes of memory
- This fix uses 2 * n * 8 + 8 bytes of memory
- The net memory overhead is therefore n * 8 - 16K - 8 bytes
The value of n grows proportionally to physical memory size.
For 128GB of physical memory it is 2M, so the memory overhead is
16M - 16K - 8 bytes.
For smaller memory configurations the overhead is proportionally smaller, and
for larger memory configurations it is propottionally bigger.
The patch has been tested for 30+ hours using vdbench script that reproduces
hang with original code 100% of times in 20-30 minutes.
| Split |
Close |
| Expand all |
| Collapse all |
--- old/usr/src/uts/common/fs/zfs/arc.c
+++ new/usr/src/uts/common/fs/zfs/arc.c
1 1 /*
2 2 * CDDL HEADER START
3 3 *
4 4 * The contents of this file are subject to the terms of the
5 5 * Common Development and Distribution License (the "License").
6 6 * You may not use this file except in compliance with the License.
7 7 *
8 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9 9 * or http://www.opensolaris.org/os/licensing.
10 10 * See the License for the specific language governing permissions
11 11 * and limitations under the License.
12 12 *
13 13 * When distributing Covered Code, include this CDDL HEADER in each
14 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15 15 * If applicable, add the following below this CDDL HEADER, with the
16 16 * fields enclosed by brackets "[]" replaced with your own identifying
17 17 * information: Portions Copyright [yyyy] [name of copyright owner]
18 18 *
19 19 * CDDL HEADER END
20 20 */
21 21 /*
22 22 * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
23 23 * Copyright 2011 Nexenta Systems, Inc. All rights reserved.
24 24 * Copyright (c) 2013 by Delphix. All rights reserved.
25 25 */
26 26
27 27 /*
28 28 * DVA-based Adjustable Replacement Cache
29 29 *
30 30 * While much of the theory of operation used here is
31 31 * based on the self-tuning, low overhead replacement cache
32 32 * presented by Megiddo and Modha at FAST 2003, there are some
33 33 * significant differences:
34 34 *
35 35 * 1. The Megiddo and Modha model assumes any page is evictable.
36 36 * Pages in its cache cannot be "locked" into memory. This makes
37 37 * the eviction algorithm simple: evict the last page in the list.
38 38 * This also make the performance characteristics easy to reason
39 39 * about. Our cache is not so simple. At any given moment, some
40 40 * subset of the blocks in the cache are un-evictable because we
41 41 * have handed out a reference to them. Blocks are only evictable
42 42 * when there are no external references active. This makes
43 43 * eviction far more problematic: we choose to evict the evictable
44 44 * blocks that are the "lowest" in the list.
45 45 *
46 46 * There are times when it is not possible to evict the requested
47 47 * space. In these circumstances we are unable to adjust the cache
48 48 * size. To prevent the cache growing unbounded at these times we
49 49 * implement a "cache throttle" that slows the flow of new data
50 50 * into the cache until we can make space available.
51 51 *
52 52 * 2. The Megiddo and Modha model assumes a fixed cache size.
53 53 * Pages are evicted when the cache is full and there is a cache
54 54 * miss. Our model has a variable sized cache. It grows with
55 55 * high use, but also tries to react to memory pressure from the
56 56 * operating system: decreasing its size when system memory is
57 57 * tight.
58 58 *
59 59 * 3. The Megiddo and Modha model assumes a fixed page size. All
60 60 * elements of the cache are therefor exactly the same size. So
61 61 * when adjusting the cache size following a cache miss, its simply
62 62 * a matter of choosing a single page to evict. In our model, we
63 63 * have variable sized cache blocks (rangeing from 512 bytes to
64 64 * 128K bytes). We therefor choose a set of blocks to evict to make
65 65 * space for a cache miss that approximates as closely as possible
66 66 * the space used by the new block.
67 67 *
68 68 * See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache"
69 69 * by N. Megiddo & D. Modha, FAST 2003
70 70 */
71 71
72 72 /*
73 73 * The locking model:
74 74 *
75 75 * A new reference to a cache buffer can be obtained in two
76 76 * ways: 1) via a hash table lookup using the DVA as a key,
77 77 * or 2) via one of the ARC lists. The arc_read() interface
78 78 * uses method 1, while the internal arc algorithms for
79 79 * adjusting the cache use method 2. We therefor provide two
80 80 * types of locks: 1) the hash table lock array, and 2) the
81 81 * arc list locks.
82 82 *
83 83 * Buffers do not have their own mutexes, rather they rely on the
84 84 * hash table mutexes for the bulk of their protection (i.e. most
85 85 * fields in the arc_buf_hdr_t are protected by these mutexes).
86 86 *
87 87 * buf_hash_find() returns the appropriate mutex (held) when it
88 88 * locates the requested buffer in the hash table. It returns
89 89 * NULL for the mutex if the buffer was not in the table.
90 90 *
91 91 * buf_hash_remove() expects the appropriate hash mutex to be
92 92 * already held before it is invoked.
93 93 *
94 94 * Each arc state also has a mutex which is used to protect the
95 95 * buffer list associated with the state. When attempting to
96 96 * obtain a hash table lock while holding an arc list lock you
97 97 * must use: mutex_tryenter() to avoid deadlock. Also note that
98 98 * the active state mutex must be held before the ghost state mutex.
99 99 *
100 100 * Arc buffers may have an associated eviction callback function.
101 101 * This function will be invoked prior to removing the buffer (e.g.
102 102 * in arc_do_user_evicts()). Note however that the data associated
103 103 * with the buffer may be evicted prior to the callback. The callback
104 104 * must be made with *no locks held* (to prevent deadlock). Additionally,
105 105 * the users of callbacks must ensure that their private data is
106 106 * protected from simultaneous callbacks from arc_buf_evict()
107 107 * and arc_do_user_evicts().
108 108 *
109 109 * Note that the majority of the performance stats are manipulated
110 110 * with atomic operations.
111 111 *
112 112 * The L2ARC uses the l2arc_buflist_mtx global mutex for the following:
113 113 *
114 114 * - L2ARC buflist creation
115 115 * - L2ARC buflist eviction
116 116 * - L2ARC write completion, which walks L2ARC buflists
117 117 * - ARC header destruction, as it removes from L2ARC buflists
118 118 * - ARC header release, as it removes from L2ARC buflists
119 119 */
120 120
121 121 #include <sys/spa.h>
122 122 #include <sys/zio.h>
123 123 #include <sys/zfs_context.h>
124 124 #include <sys/arc.h>
125 125 #include <sys/refcount.h>
126 126 #include <sys/vdev.h>
127 127 #include <sys/vdev_impl.h>
128 128 #ifdef _KERNEL
129 129 #include <sys/vmsystm.h>
130 130 #include <vm/anon.h>
131 131 #include <sys/fs/swapnode.h>
132 132 #include <sys/dnlc.h>
133 133 #endif
134 134 #include <sys/callb.h>
135 135 #include <sys/kstat.h>
136 136 #include <zfs_fletcher.h>
137 137
138 138 #ifndef _KERNEL
139 139 /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
140 140 boolean_t arc_watch = B_FALSE;
141 141 int arc_procfd;
142 142 #endif
143 143
144 144 static kmutex_t arc_reclaim_thr_lock;
145 145 static kcondvar_t arc_reclaim_thr_cv; /* used to signal reclaim thr */
146 146 static uint8_t arc_thread_exit;
147 147
148 148 extern int zfs_write_limit_shift;
149 149 extern uint64_t zfs_write_limit_max;
150 150 extern kmutex_t zfs_write_limit_lock;
151 151
152 152 #define ARC_REDUCE_DNLC_PERCENT 3
153 153 uint_t arc_reduce_dnlc_percent = ARC_REDUCE_DNLC_PERCENT;
154 154
155 155 typedef enum arc_reclaim_strategy {
156 156 ARC_RECLAIM_AGGR, /* Aggressive reclaim strategy */
157 157 ARC_RECLAIM_CONS /* Conservative reclaim strategy */
158 158 } arc_reclaim_strategy_t;
159 159
160 160 /* number of seconds before growing cache again */
161 161 static int arc_grow_retry = 60;
162 162
163 163 /* shift of arc_c for calculating both min and max arc_p */
164 164 static int arc_p_min_shift = 4;
165 165
166 166 /* log2(fraction of arc to reclaim) */
167 167 static int arc_shrink_shift = 5;
168 168
169 169 /*
170 170 * minimum lifespan of a prefetch block in clock ticks
171 171 * (initialized in arc_init())
172 172 */
173 173 static int arc_min_prefetch_lifespan;
174 174
175 175 static int arc_dead;
176 176
177 177 /*
178 178 * The arc has filled available memory and has now warmed up.
179 179 */
180 180 static boolean_t arc_warm;
181 181
182 182 /*
183 183 * These tunables are for performance analysis.
184 184 */
185 185 uint64_t zfs_arc_max;
186 186 uint64_t zfs_arc_min;
187 187 uint64_t zfs_arc_meta_limit = 0;
188 188 int zfs_arc_grow_retry = 0;
189 189 int zfs_arc_shrink_shift = 0;
190 190 int zfs_arc_p_min_shift = 0;
191 191 int zfs_disable_dup_eviction = 0;
192 192
193 193 /*
194 194 * Note that buffers can be in one of 6 states:
195 195 * ARC_anon - anonymous (discussed below)
196 196 * ARC_mru - recently used, currently cached
197 197 * ARC_mru_ghost - recentely used, no longer in cache
198 198 * ARC_mfu - frequently used, currently cached
199 199 * ARC_mfu_ghost - frequently used, no longer in cache
200 200 * ARC_l2c_only - exists in L2ARC but not other states
201 201 * When there are no active references to the buffer, they are
202 202 * are linked onto a list in one of these arc states. These are
203 203 * the only buffers that can be evicted or deleted. Within each
204 204 * state there are multiple lists, one for meta-data and one for
205 205 * non-meta-data. Meta-data (indirect blocks, blocks of dnodes,
206 206 * etc.) is tracked separately so that it can be managed more
207 207 * explicitly: favored over data, limited explicitly.
208 208 *
209 209 * Anonymous buffers are buffers that are not associated with
210 210 * a DVA. These are buffers that hold dirty block copies
211 211 * before they are written to stable storage. By definition,
212 212 * they are "ref'd" and are considered part of arc_mru
213 213 * that cannot be freed. Generally, they will aquire a DVA
214 214 * as they are written and migrate onto the arc_mru list.
215 215 *
216 216 * The ARC_l2c_only state is for buffers that are in the second
217 217 * level ARC but no longer in any of the ARC_m* lists. The second
218 218 * level ARC itself may also contain buffers that are in any of
219 219 * the ARC_m* states - meaning that a buffer can exist in two
220 220 * places. The reason for the ARC_l2c_only state is to keep the
221 221 * buffer header in the hash table, so that reads that hit the
222 222 * second level ARC benefit from these fast lookups.
223 223 */
224 224
225 225 typedef struct arc_state {
226 226 list_t arcs_list[ARC_BUFC_NUMTYPES]; /* list of evictable buffers */
227 227 uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */
228 228 uint64_t arcs_size; /* total amount of data in this state */
229 229 kmutex_t arcs_mtx;
230 230 } arc_state_t;
231 231
232 232 /* The 6 states: */
233 233 static arc_state_t ARC_anon;
234 234 static arc_state_t ARC_mru;
235 235 static arc_state_t ARC_mru_ghost;
236 236 static arc_state_t ARC_mfu;
237 237 static arc_state_t ARC_mfu_ghost;
238 238 static arc_state_t ARC_l2c_only;
239 239
240 240 typedef struct arc_stats {
241 241 kstat_named_t arcstat_hits;
242 242 kstat_named_t arcstat_misses;
243 243 kstat_named_t arcstat_demand_data_hits;
244 244 kstat_named_t arcstat_demand_data_misses;
245 245 kstat_named_t arcstat_demand_metadata_hits;
246 246 kstat_named_t arcstat_demand_metadata_misses;
247 247 kstat_named_t arcstat_prefetch_data_hits;
248 248 kstat_named_t arcstat_prefetch_data_misses;
249 249 kstat_named_t arcstat_prefetch_metadata_hits;
250 250 kstat_named_t arcstat_prefetch_metadata_misses;
251 251 kstat_named_t arcstat_mru_hits;
252 252 kstat_named_t arcstat_mru_ghost_hits;
253 253 kstat_named_t arcstat_mfu_hits;
254 254 kstat_named_t arcstat_mfu_ghost_hits;
255 255 kstat_named_t arcstat_deleted;
256 256 kstat_named_t arcstat_recycle_miss;
257 257 kstat_named_t arcstat_mutex_miss;
258 258 kstat_named_t arcstat_evict_skip;
259 259 kstat_named_t arcstat_evict_l2_cached;
260 260 kstat_named_t arcstat_evict_l2_eligible;
261 261 kstat_named_t arcstat_evict_l2_ineligible;
262 262 kstat_named_t arcstat_hash_elements;
263 263 kstat_named_t arcstat_hash_elements_max;
264 264 kstat_named_t arcstat_hash_collisions;
265 265 kstat_named_t arcstat_hash_chains;
266 266 kstat_named_t arcstat_hash_chain_max;
267 267 kstat_named_t arcstat_p;
268 268 kstat_named_t arcstat_c;
269 269 kstat_named_t arcstat_c_min;
270 270 kstat_named_t arcstat_c_max;
271 271 kstat_named_t arcstat_size;
272 272 kstat_named_t arcstat_hdr_size;
273 273 kstat_named_t arcstat_data_size;
274 274 kstat_named_t arcstat_other_size;
275 275 kstat_named_t arcstat_l2_hits;
276 276 kstat_named_t arcstat_l2_misses;
277 277 kstat_named_t arcstat_l2_feeds;
278 278 kstat_named_t arcstat_l2_rw_clash;
279 279 kstat_named_t arcstat_l2_read_bytes;
280 280 kstat_named_t arcstat_l2_write_bytes;
281 281 kstat_named_t arcstat_l2_writes_sent;
282 282 kstat_named_t arcstat_l2_writes_done;
283 283 kstat_named_t arcstat_l2_writes_error;
284 284 kstat_named_t arcstat_l2_writes_hdr_miss;
285 285 kstat_named_t arcstat_l2_evict_lock_retry;
286 286 kstat_named_t arcstat_l2_evict_reading;
287 287 kstat_named_t arcstat_l2_free_on_write;
288 288 kstat_named_t arcstat_l2_abort_lowmem;
289 289 kstat_named_t arcstat_l2_cksum_bad;
290 290 kstat_named_t arcstat_l2_io_error;
291 291 kstat_named_t arcstat_l2_size;
292 292 kstat_named_t arcstat_l2_hdr_size;
293 293 kstat_named_t arcstat_memory_throttle_count;
294 294 kstat_named_t arcstat_duplicate_buffers;
295 295 kstat_named_t arcstat_duplicate_buffers_size;
296 296 kstat_named_t arcstat_duplicate_reads;
297 297 kstat_named_t arcstat_meta_used;
298 298 kstat_named_t arcstat_meta_limit;
299 299 kstat_named_t arcstat_meta_max;
300 300 } arc_stats_t;
301 301
302 302 static arc_stats_t arc_stats = {
303 303 { "hits", KSTAT_DATA_UINT64 },
304 304 { "misses", KSTAT_DATA_UINT64 },
305 305 { "demand_data_hits", KSTAT_DATA_UINT64 },
306 306 { "demand_data_misses", KSTAT_DATA_UINT64 },
307 307 { "demand_metadata_hits", KSTAT_DATA_UINT64 },
308 308 { "demand_metadata_misses", KSTAT_DATA_UINT64 },
309 309 { "prefetch_data_hits", KSTAT_DATA_UINT64 },
310 310 { "prefetch_data_misses", KSTAT_DATA_UINT64 },
311 311 { "prefetch_metadata_hits", KSTAT_DATA_UINT64 },
312 312 { "prefetch_metadata_misses", KSTAT_DATA_UINT64 },
313 313 { "mru_hits", KSTAT_DATA_UINT64 },
314 314 { "mru_ghost_hits", KSTAT_DATA_UINT64 },
315 315 { "mfu_hits", KSTAT_DATA_UINT64 },
316 316 { "mfu_ghost_hits", KSTAT_DATA_UINT64 },
317 317 { "deleted", KSTAT_DATA_UINT64 },
318 318 { "recycle_miss", KSTAT_DATA_UINT64 },
319 319 { "mutex_miss", KSTAT_DATA_UINT64 },
320 320 { "evict_skip", KSTAT_DATA_UINT64 },
321 321 { "evict_l2_cached", KSTAT_DATA_UINT64 },
322 322 { "evict_l2_eligible", KSTAT_DATA_UINT64 },
323 323 { "evict_l2_ineligible", KSTAT_DATA_UINT64 },
324 324 { "hash_elements", KSTAT_DATA_UINT64 },
325 325 { "hash_elements_max", KSTAT_DATA_UINT64 },
326 326 { "hash_collisions", KSTAT_DATA_UINT64 },
327 327 { "hash_chains", KSTAT_DATA_UINT64 },
328 328 { "hash_chain_max", KSTAT_DATA_UINT64 },
329 329 { "p", KSTAT_DATA_UINT64 },
330 330 { "c", KSTAT_DATA_UINT64 },
331 331 { "c_min", KSTAT_DATA_UINT64 },
332 332 { "c_max", KSTAT_DATA_UINT64 },
333 333 { "size", KSTAT_DATA_UINT64 },
334 334 { "hdr_size", KSTAT_DATA_UINT64 },
335 335 { "data_size", KSTAT_DATA_UINT64 },
336 336 { "other_size", KSTAT_DATA_UINT64 },
337 337 { "l2_hits", KSTAT_DATA_UINT64 },
338 338 { "l2_misses", KSTAT_DATA_UINT64 },
339 339 { "l2_feeds", KSTAT_DATA_UINT64 },
340 340 { "l2_rw_clash", KSTAT_DATA_UINT64 },
341 341 { "l2_read_bytes", KSTAT_DATA_UINT64 },
342 342 { "l2_write_bytes", KSTAT_DATA_UINT64 },
343 343 { "l2_writes_sent", KSTAT_DATA_UINT64 },
344 344 { "l2_writes_done", KSTAT_DATA_UINT64 },
345 345 { "l2_writes_error", KSTAT_DATA_UINT64 },
346 346 { "l2_writes_hdr_miss", KSTAT_DATA_UINT64 },
347 347 { "l2_evict_lock_retry", KSTAT_DATA_UINT64 },
348 348 { "l2_evict_reading", KSTAT_DATA_UINT64 },
349 349 { "l2_free_on_write", KSTAT_DATA_UINT64 },
350 350 { "l2_abort_lowmem", KSTAT_DATA_UINT64 },
351 351 { "l2_cksum_bad", KSTAT_DATA_UINT64 },
352 352 { "l2_io_error", KSTAT_DATA_UINT64 },
353 353 { "l2_size", KSTAT_DATA_UINT64 },
354 354 { "l2_hdr_size", KSTAT_DATA_UINT64 },
355 355 { "memory_throttle_count", KSTAT_DATA_UINT64 },
356 356 { "duplicate_buffers", KSTAT_DATA_UINT64 },
357 357 { "duplicate_buffers_size", KSTAT_DATA_UINT64 },
358 358 { "duplicate_reads", KSTAT_DATA_UINT64 },
359 359 { "arc_meta_used", KSTAT_DATA_UINT64 },
360 360 { "arc_meta_limit", KSTAT_DATA_UINT64 },
361 361 { "arc_meta_max", KSTAT_DATA_UINT64 }
362 362 };
363 363
364 364 #define ARCSTAT(stat) (arc_stats.stat.value.ui64)
365 365
366 366 #define ARCSTAT_INCR(stat, val) \
367 367 atomic_add_64(&arc_stats.stat.value.ui64, (val));
368 368
369 369 #define ARCSTAT_BUMP(stat) ARCSTAT_INCR(stat, 1)
370 370 #define ARCSTAT_BUMPDOWN(stat) ARCSTAT_INCR(stat, -1)
371 371
372 372 #define ARCSTAT_MAX(stat, val) { \
373 373 uint64_t m; \
374 374 while ((val) > (m = arc_stats.stat.value.ui64) && \
375 375 (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
376 376 continue; \
377 377 }
378 378
379 379 #define ARCSTAT_MAXSTAT(stat) \
380 380 ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
381 381
382 382 /*
383 383 * We define a macro to allow ARC hits/misses to be easily broken down by
384 384 * two separate conditions, giving a total of four different subtypes for
385 385 * each of hits and misses (so eight statistics total).
386 386 */
387 387 #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
388 388 if (cond1) { \
389 389 if (cond2) { \
390 390 ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
391 391 } else { \
392 392 ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
393 393 } \
394 394 } else { \
395 395 if (cond2) { \
396 396 ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
397 397 } else { \
398 398 ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
399 399 } \
400 400 }
401 401
402 402 kstat_t *arc_ksp;
403 403 static arc_state_t *arc_anon;
404 404 static arc_state_t *arc_mru;
405 405 static arc_state_t *arc_mru_ghost;
406 406 static arc_state_t *arc_mfu;
407 407 static arc_state_t *arc_mfu_ghost;
408 408 static arc_state_t *arc_l2c_only;
409 409
410 410 /*
411 411 * There are several ARC variables that are critical to export as kstats --
412 412 * but we don't want to have to grovel around in the kstat whenever we wish to
413 413 * manipulate them. For these variables, we therefore define them to be in
414 414 * terms of the statistic variable. This assures that we are not introducing
415 415 * the possibility of inconsistency by having shadow copies of the variables,
416 416 * while still allowing the code to be readable.
417 417 */
418 418 #define arc_size ARCSTAT(arcstat_size) /* actual total arc size */
419 419 #define arc_p ARCSTAT(arcstat_p) /* target size of MRU */
420 420 #define arc_c ARCSTAT(arcstat_c) /* target size of cache */
421 421 #define arc_c_min ARCSTAT(arcstat_c_min) /* min target cache size */
422 422 #define arc_c_max ARCSTAT(arcstat_c_max) /* max target cache size */
423 423 #define arc_meta_limit ARCSTAT(arcstat_meta_limit) /* max size for metadata */
424 424 #define arc_meta_used ARCSTAT(arcstat_meta_used) /* size of metadata */
425 425 #define arc_meta_max ARCSTAT(arcstat_meta_max) /* max size of metadata */
426 426
427 427 static int arc_no_grow; /* Don't try to grow cache size */
428 428 static uint64_t arc_tempreserve;
429 429 static uint64_t arc_loaned_bytes;
430 430
431 431 typedef struct l2arc_buf_hdr l2arc_buf_hdr_t;
432 432
433 433 typedef struct arc_callback arc_callback_t;
434 434
435 435 struct arc_callback {
436 436 void *acb_private;
437 437 arc_done_func_t *acb_done;
438 438 arc_buf_t *acb_buf;
439 439 zio_t *acb_zio_dummy;
440 440 arc_callback_t *acb_next;
441 441 };
442 442
443 443 typedef struct arc_write_callback arc_write_callback_t;
444 444
445 445 struct arc_write_callback {
446 446 void *awcb_private;
447 447 arc_done_func_t *awcb_ready;
448 448 arc_done_func_t *awcb_done;
449 449 arc_buf_t *awcb_buf;
450 450 };
451 451
452 452 struct arc_buf_hdr {
453 453 /* protected by hash lock */
454 454 dva_t b_dva;
455 455 uint64_t b_birth;
456 456 uint64_t b_cksum0;
457 457
458 458 kmutex_t b_freeze_lock;
459 459 zio_cksum_t *b_freeze_cksum;
460 460 void *b_thawed;
461 461
462 462 arc_buf_hdr_t *b_hash_next;
463 463 arc_buf_t *b_buf;
464 464 uint32_t b_flags;
465 465 uint32_t b_datacnt;
466 466
467 467 arc_callback_t *b_acb;
468 468 kcondvar_t b_cv;
469 469
470 470 /* immutable */
471 471 arc_buf_contents_t b_type;
472 472 uint64_t b_size;
473 473 uint64_t b_spa;
474 474
475 475 /* protected by arc state mutex */
476 476 arc_state_t *b_state;
477 477 list_node_t b_arc_node;
478 478
479 479 /* updated atomically */
480 480 clock_t b_arc_access;
481 481
482 482 /* self protecting */
483 483 refcount_t b_refcnt;
484 484
485 485 l2arc_buf_hdr_t *b_l2hdr;
486 486 list_node_t b_l2node;
487 487 };
488 488
489 489 static arc_buf_t *arc_eviction_list;
490 490 static kmutex_t arc_eviction_mtx;
491 491 static arc_buf_hdr_t arc_eviction_hdr;
492 492 static void arc_get_data_buf(arc_buf_t *buf);
493 493 static void arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock);
494 494 static int arc_evict_needed(arc_buf_contents_t type);
495 495 static void arc_evict_ghost(arc_state_t *state, uint64_t spa, int64_t bytes);
496 496 static void arc_buf_watch(arc_buf_t *buf);
497 497
498 498 static boolean_t l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *ab);
499 499
500 500 #define GHOST_STATE(state) \
501 501 ((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \
502 502 (state) == arc_l2c_only)
503 503
504 504 /*
505 505 * Private ARC flags. These flags are private ARC only flags that will show up
506 506 * in b_flags in the arc_hdr_buf_t. Some flags are publicly declared, and can
507 507 * be passed in as arc_flags in things like arc_read. However, these flags
508 508 * should never be passed and should only be set by ARC code. When adding new
509 509 * public flags, make sure not to smash the private ones.
510 510 */
511 511
512 512 #define ARC_IN_HASH_TABLE (1 << 9) /* this buffer is hashed */
513 513 #define ARC_IO_IN_PROGRESS (1 << 10) /* I/O in progress for buf */
514 514 #define ARC_IO_ERROR (1 << 11) /* I/O failed for buf */
515 515 #define ARC_FREED_IN_READ (1 << 12) /* buf freed while in read */
516 516 #define ARC_BUF_AVAILABLE (1 << 13) /* block not in active use */
517 517 #define ARC_INDIRECT (1 << 14) /* this is an indirect block */
518 518 #define ARC_FREE_IN_PROGRESS (1 << 15) /* hdr about to be freed */
519 519 #define ARC_L2_WRITING (1 << 16) /* L2ARC write in progress */
520 520 #define ARC_L2_EVICTED (1 << 17) /* evicted during I/O */
521 521 #define ARC_L2_WRITE_HEAD (1 << 18) /* head of write list */
522 522
523 523 #define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_IN_HASH_TABLE)
524 524 #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS)
525 525 #define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_IO_ERROR)
526 526 #define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_PREFETCH)
527 527 #define HDR_FREED_IN_READ(hdr) ((hdr)->b_flags & ARC_FREED_IN_READ)
528 528 #define HDR_BUF_AVAILABLE(hdr) ((hdr)->b_flags & ARC_BUF_AVAILABLE)
529 529 #define HDR_FREE_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FREE_IN_PROGRESS)
530 530 #define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_L2CACHE)
531 531 #define HDR_L2_READING(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS && \
532 532 (hdr)->b_l2hdr != NULL)
533 533 #define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_L2_WRITING)
534 534 #define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_L2_EVICTED)
535 535 #define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_L2_WRITE_HEAD)
536 536
537 537 /*
|
↓ open down ↓ |
537 lines elided |
↑ open up ↑ |
538 538 * Other sizes
539 539 */
540 540
541 541 #define HDR_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
542 542 #define L2HDR_SIZE ((int64_t)sizeof (l2arc_buf_hdr_t))
543 543
544 544 /*
545 545 * Hash table routines
546 546 */
547 547
548 -#define HT_LOCK_PAD 64
549 -
550 -struct ht_lock {
551 - kmutex_t ht_lock;
552 -#ifdef _KERNEL
553 - unsigned char pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
554 -#endif
548 +struct ht_table {
549 + arc_buf_hdr_t *hdr;
550 + kmutex_t lock;
555 551 };
556 552
557 -#define BUF_LOCKS 256
558 553 typedef struct buf_hash_table {
559 554 uint64_t ht_mask;
560 - arc_buf_hdr_t **ht_table;
561 - struct ht_lock ht_locks[BUF_LOCKS];
555 + struct ht_table *ht_table;
562 556 } buf_hash_table_t;
563 557
564 558 static buf_hash_table_t buf_hash_table;
565 559
566 560 #define BUF_HASH_INDEX(spa, dva, birth) \
567 561 (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
568 -#define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
569 -#define BUF_HASH_LOCK(idx) (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
562 +#define BUF_HASH_LOCK(idx) (&buf_hash_table.ht_table[idx].lock)
570 563 #define HDR_LOCK(hdr) \
571 564 (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
572 565
573 566 uint64_t zfs_crc64_table[256];
574 567
575 568 /*
576 569 * Level 2 ARC
577 570 */
578 571
579 572 #define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */
580 573 #define L2ARC_HEADROOM 2 /* num of writes */
581 574 #define L2ARC_FEED_SECS 1 /* caching interval secs */
582 575 #define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */
583 576
584 577 #define l2arc_writes_sent ARCSTAT(arcstat_l2_writes_sent)
585 578 #define l2arc_writes_done ARCSTAT(arcstat_l2_writes_done)
586 579
587 580 /*
588 581 * L2ARC Performance Tunables
589 582 */
590 583 uint64_t l2arc_write_max = L2ARC_WRITE_SIZE; /* default max write size */
591 584 uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra write during warmup */
592 585 uint64_t l2arc_headroom = L2ARC_HEADROOM; /* number of dev writes */
593 586 uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */
594 587 uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
595 588 boolean_t l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */
596 589 boolean_t l2arc_feed_again = B_TRUE; /* turbo warmup */
597 590 boolean_t l2arc_norw = B_TRUE; /* no reads during writes */
598 591
599 592 /*
600 593 * L2ARC Internals
601 594 */
602 595 typedef struct l2arc_dev {
603 596 vdev_t *l2ad_vdev; /* vdev */
604 597 spa_t *l2ad_spa; /* spa */
605 598 uint64_t l2ad_hand; /* next write location */
606 599 uint64_t l2ad_write; /* desired write size, bytes */
607 600 uint64_t l2ad_boost; /* warmup write boost, bytes */
608 601 uint64_t l2ad_start; /* first addr on device */
609 602 uint64_t l2ad_end; /* last addr on device */
610 603 uint64_t l2ad_evict; /* last addr eviction reached */
611 604 boolean_t l2ad_first; /* first sweep through */
612 605 boolean_t l2ad_writing; /* currently writing */
613 606 list_t *l2ad_buflist; /* buffer list */
614 607 list_node_t l2ad_node; /* device list node */
615 608 } l2arc_dev_t;
616 609
617 610 static list_t L2ARC_dev_list; /* device list */
618 611 static list_t *l2arc_dev_list; /* device list pointer */
619 612 static kmutex_t l2arc_dev_mtx; /* device list mutex */
620 613 static l2arc_dev_t *l2arc_dev_last; /* last device used */
621 614 static kmutex_t l2arc_buflist_mtx; /* mutex for all buflists */
622 615 static list_t L2ARC_free_on_write; /* free after write buf list */
623 616 static list_t *l2arc_free_on_write; /* free after write list ptr */
624 617 static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */
625 618 static uint64_t l2arc_ndev; /* number of devices */
626 619
627 620 typedef struct l2arc_read_callback {
628 621 arc_buf_t *l2rcb_buf; /* read buffer */
629 622 spa_t *l2rcb_spa; /* spa */
630 623 blkptr_t l2rcb_bp; /* original blkptr */
631 624 zbookmark_t l2rcb_zb; /* original bookmark */
632 625 int l2rcb_flags; /* original flags */
633 626 } l2arc_read_callback_t;
634 627
635 628 typedef struct l2arc_write_callback {
636 629 l2arc_dev_t *l2wcb_dev; /* device info */
637 630 arc_buf_hdr_t *l2wcb_head; /* head of write buflist */
638 631 } l2arc_write_callback_t;
639 632
640 633 struct l2arc_buf_hdr {
641 634 /* protected by arc_buf_hdr mutex */
642 635 l2arc_dev_t *b_dev; /* L2ARC device */
643 636 uint64_t b_daddr; /* disk address, offset byte */
644 637 };
645 638
646 639 typedef struct l2arc_data_free {
647 640 /* protected by l2arc_free_on_write_mtx */
648 641 void *l2df_data;
649 642 size_t l2df_size;
650 643 void (*l2df_func)(void *, size_t);
651 644 list_node_t l2df_list_node;
652 645 } l2arc_data_free_t;
653 646
654 647 static kmutex_t l2arc_feed_thr_lock;
655 648 static kcondvar_t l2arc_feed_thr_cv;
656 649 static uint8_t l2arc_thread_exit;
657 650
658 651 static void l2arc_read_done(zio_t *zio);
659 652 static void l2arc_hdr_stat_add(void);
660 653 static void l2arc_hdr_stat_remove(void);
661 654
662 655 static uint64_t
663 656 buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
664 657 {
665 658 uint8_t *vdva = (uint8_t *)dva;
666 659 uint64_t crc = -1ULL;
667 660 int i;
668 661
669 662 ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
670 663
671 664 for (i = 0; i < sizeof (dva_t); i++)
672 665 crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
673 666
674 667 crc ^= (spa>>8) ^ birth;
675 668
676 669 return (crc);
677 670 }
678 671
679 672 #define BUF_EMPTY(buf) \
680 673 ((buf)->b_dva.dva_word[0] == 0 && \
681 674 (buf)->b_dva.dva_word[1] == 0 && \
682 675 (buf)->b_birth == 0)
683 676
684 677 #define BUF_EQUAL(spa, dva, birth, buf) \
685 678 ((buf)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \
686 679 ((buf)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \
687 680 ((buf)->b_birth == birth) && ((buf)->b_spa == spa)
688 681
689 682 static void
690 683 buf_discard_identity(arc_buf_hdr_t *hdr)
691 684 {
692 685 hdr->b_dva.dva_word[0] = 0;
693 686 hdr->b_dva.dva_word[1] = 0;
694 687 hdr->b_birth = 0;
695 688 hdr->b_cksum0 = 0;
|
↓ open down ↓ |
116 lines elided |
↑ open up ↑ |
696 689 }
697 690
698 691 static arc_buf_hdr_t *
699 692 buf_hash_find(uint64_t spa, const dva_t *dva, uint64_t birth, kmutex_t **lockp)
700 693 {
701 694 uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
702 695 kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
703 696 arc_buf_hdr_t *buf;
704 697
705 698 mutex_enter(hash_lock);
706 - for (buf = buf_hash_table.ht_table[idx]; buf != NULL;
699 + for (buf = buf_hash_table.ht_table[idx].hdr; buf != NULL;
707 700 buf = buf->b_hash_next) {
708 701 if (BUF_EQUAL(spa, dva, birth, buf)) {
709 702 *lockp = hash_lock;
710 703 return (buf);
711 704 }
712 705 }
713 706 mutex_exit(hash_lock);
714 707 *lockp = NULL;
715 708 return (NULL);
716 709 }
717 710
718 711 /*
719 712 * Insert an entry into the hash table. If there is already an element
720 713 * equal to elem in the hash table, then the already existing element
721 714 * will be returned and the new element will not be inserted.
722 715 * Otherwise returns NULL.
723 716 */
724 717 static arc_buf_hdr_t *
|
↓ open down ↓ |
8 lines elided |
↑ open up ↑ |
725 718 buf_hash_insert(arc_buf_hdr_t *buf, kmutex_t **lockp)
726 719 {
727 720 uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
728 721 kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
729 722 arc_buf_hdr_t *fbuf;
730 723 uint32_t i;
731 724
732 725 ASSERT(!HDR_IN_HASH_TABLE(buf));
733 726 *lockp = hash_lock;
734 727 mutex_enter(hash_lock);
735 - for (fbuf = buf_hash_table.ht_table[idx], i = 0; fbuf != NULL;
728 + for (fbuf = buf_hash_table.ht_table[idx].hdr, i = 0; fbuf != NULL;
736 729 fbuf = fbuf->b_hash_next, i++) {
737 730 if (BUF_EQUAL(buf->b_spa, &buf->b_dva, buf->b_birth, fbuf))
738 731 return (fbuf);
739 732 }
740 733
741 - buf->b_hash_next = buf_hash_table.ht_table[idx];
742 - buf_hash_table.ht_table[idx] = buf;
734 + buf->b_hash_next = buf_hash_table.ht_table[idx].hdr;
735 + buf_hash_table.ht_table[idx].hdr = buf;
743 736 buf->b_flags |= ARC_IN_HASH_TABLE;
744 737
745 738 /* collect some hash table performance data */
746 739 if (i > 0) {
747 740 ARCSTAT_BUMP(arcstat_hash_collisions);
748 741 if (i == 1)
749 742 ARCSTAT_BUMP(arcstat_hash_chains);
750 743
751 744 ARCSTAT_MAX(arcstat_hash_chain_max, i);
752 745 }
753 746
754 747 ARCSTAT_BUMP(arcstat_hash_elements);
755 748 ARCSTAT_MAXSTAT(arcstat_hash_elements);
756 749
757 750 return (NULL);
758 751 }
|
↓ open down ↓ |
6 lines elided |
↑ open up ↑ |
759 752
760 753 static void
761 754 buf_hash_remove(arc_buf_hdr_t *buf)
762 755 {
763 756 arc_buf_hdr_t *fbuf, **bufp;
764 757 uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
765 758
766 759 ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
767 760 ASSERT(HDR_IN_HASH_TABLE(buf));
768 761
769 - bufp = &buf_hash_table.ht_table[idx];
762 + bufp = &buf_hash_table.ht_table[idx].hdr;
770 763 while ((fbuf = *bufp) != buf) {
771 764 ASSERT(fbuf != NULL);
772 765 bufp = &fbuf->b_hash_next;
773 766 }
774 767 *bufp = buf->b_hash_next;
775 768 buf->b_hash_next = NULL;
776 769 buf->b_flags &= ~ARC_IN_HASH_TABLE;
777 770
778 771 /* collect some hash table performance data */
779 772 ARCSTAT_BUMPDOWN(arcstat_hash_elements);
780 773
781 - if (buf_hash_table.ht_table[idx] &&
782 - buf_hash_table.ht_table[idx]->b_hash_next == NULL)
774 + if (buf_hash_table.ht_table[idx].hdr &&
775 + buf_hash_table.ht_table[idx].hdr->b_hash_next == NULL)
783 776 ARCSTAT_BUMPDOWN(arcstat_hash_chains);
784 777 }
785 778
786 779 /*
787 780 * Global data structures and functions for the buf kmem cache.
788 781 */
789 782 static kmem_cache_t *hdr_cache;
790 783 static kmem_cache_t *buf_cache;
791 784
792 785 static void
793 786 buf_fini(void)
794 787 {
795 788 int i;
796 789
790 + for (i = 0; i < buf_hash_table.ht_mask + 1; i++)
791 + mutex_destroy(&buf_hash_table.ht_table[i].lock);
797 792 kmem_free(buf_hash_table.ht_table,
798 - (buf_hash_table.ht_mask + 1) * sizeof (void *));
799 - for (i = 0; i < BUF_LOCKS; i++)
800 - mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
793 + (buf_hash_table.ht_mask + 1) * sizeof (struct ht_table));
801 794 kmem_cache_destroy(hdr_cache);
802 795 kmem_cache_destroy(buf_cache);
803 796 }
804 797
805 798 /*
806 799 * Constructor callback - called when the cache is empty
807 800 * and a new buf is requested.
808 801 */
809 802 /* ARGSUSED */
810 803 static int
811 804 hdr_cons(void *vbuf, void *unused, int kmflag)
812 805 {
813 806 arc_buf_hdr_t *buf = vbuf;
814 807
815 808 bzero(buf, sizeof (arc_buf_hdr_t));
816 809 refcount_create(&buf->b_refcnt);
817 810 cv_init(&buf->b_cv, NULL, CV_DEFAULT, NULL);
818 811 mutex_init(&buf->b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
819 812 arc_space_consume(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
820 813
821 814 return (0);
822 815 }
823 816
824 817 /* ARGSUSED */
825 818 static int
826 819 buf_cons(void *vbuf, void *unused, int kmflag)
827 820 {
828 821 arc_buf_t *buf = vbuf;
829 822
830 823 bzero(buf, sizeof (arc_buf_t));
831 824 mutex_init(&buf->b_evict_lock, NULL, MUTEX_DEFAULT, NULL);
832 825 arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
833 826
834 827 return (0);
835 828 }
836 829
837 830 /*
838 831 * Destructor callback - called when a cached buf is
839 832 * no longer required.
840 833 */
841 834 /* ARGSUSED */
842 835 static void
843 836 hdr_dest(void *vbuf, void *unused)
844 837 {
845 838 arc_buf_hdr_t *buf = vbuf;
846 839
847 840 ASSERT(BUF_EMPTY(buf));
848 841 refcount_destroy(&buf->b_refcnt);
849 842 cv_destroy(&buf->b_cv);
850 843 mutex_destroy(&buf->b_freeze_lock);
851 844 arc_space_return(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
852 845 }
853 846
854 847 /* ARGSUSED */
855 848 static void
856 849 buf_dest(void *vbuf, void *unused)
857 850 {
858 851 arc_buf_t *buf = vbuf;
859 852
860 853 mutex_destroy(&buf->b_evict_lock);
861 854 arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
862 855 }
863 856
864 857 /*
865 858 * Reclaim callback -- invoked when memory is low.
866 859 */
867 860 /* ARGSUSED */
868 861 static void
869 862 hdr_recl(void *unused)
870 863 {
871 864 dprintf("hdr_recl called\n");
872 865 /*
873 866 * umem calls the reclaim func when we destroy the buf cache,
874 867 * which is after we do arc_fini().
875 868 */
876 869 if (!arc_dead)
877 870 cv_signal(&arc_reclaim_thr_cv);
878 871 }
879 872
880 873 static void
881 874 buf_init(void)
882 875 {
883 876 uint64_t *ct;
884 877 uint64_t hsize = 1ULL << 12;
885 878 int i, j;
886 879
|
↓ open down ↓ |
76 lines elided |
↑ open up ↑ |
887 880 /*
888 881 * The hash table is big enough to fill all of physical memory
889 882 * with an average 64K block size. The table will take up
890 883 * totalmem*sizeof(void*)/64K (eg. 128KB/GB with 8-byte pointers).
891 884 */
892 885 while (hsize * 65536 < physmem * PAGESIZE)
893 886 hsize <<= 1;
894 887 retry:
895 888 buf_hash_table.ht_mask = hsize - 1;
896 889 buf_hash_table.ht_table =
897 - kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
890 + kmem_zalloc(hsize * sizeof (struct ht_table), KM_NOSLEEP);
898 891 if (buf_hash_table.ht_table == NULL) {
899 892 ASSERT(hsize > (1ULL << 8));
900 893 hsize >>= 1;
901 894 goto retry;
902 895 }
903 896
904 897 hdr_cache = kmem_cache_create("arc_buf_hdr_t", sizeof (arc_buf_hdr_t),
905 898 0, hdr_cons, hdr_dest, hdr_recl, NULL, NULL, 0);
906 899 buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
907 900 0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
908 901
909 902 for (i = 0; i < 256; i++)
910 903 for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
911 904 *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
912 905
913 - for (i = 0; i < BUF_LOCKS; i++) {
914 - mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
906 + for (i = 0; i < hsize; i++) {
907 + mutex_init(&buf_hash_table.ht_table[i].lock,
915 908 NULL, MUTEX_DEFAULT, NULL);
916 909 }
917 910 }
918 911
919 912 #define ARC_MINTIME (hz>>4) /* 62 ms */
920 913
921 914 static void
922 915 arc_cksum_verify(arc_buf_t *buf)
923 916 {
924 917 zio_cksum_t zc;
925 918
926 919 if (!(zfs_flags & ZFS_DEBUG_MODIFY))
927 920 return;
928 921
929 922 mutex_enter(&buf->b_hdr->b_freeze_lock);
930 923 if (buf->b_hdr->b_freeze_cksum == NULL ||
931 924 (buf->b_hdr->b_flags & ARC_IO_ERROR)) {
932 925 mutex_exit(&buf->b_hdr->b_freeze_lock);
933 926 return;
934 927 }
935 928 fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
936 929 if (!ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc))
937 930 panic("buffer modified while frozen!");
938 931 mutex_exit(&buf->b_hdr->b_freeze_lock);
939 932 }
940 933
941 934 static int
942 935 arc_cksum_equal(arc_buf_t *buf)
943 936 {
944 937 zio_cksum_t zc;
945 938 int equal;
946 939
947 940 mutex_enter(&buf->b_hdr->b_freeze_lock);
948 941 fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
949 942 equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc);
950 943 mutex_exit(&buf->b_hdr->b_freeze_lock);
951 944
952 945 return (equal);
953 946 }
954 947
955 948 static void
956 949 arc_cksum_compute(arc_buf_t *buf, boolean_t force)
957 950 {
958 951 if (!force && !(zfs_flags & ZFS_DEBUG_MODIFY))
959 952 return;
960 953
961 954 mutex_enter(&buf->b_hdr->b_freeze_lock);
962 955 if (buf->b_hdr->b_freeze_cksum != NULL) {
963 956 mutex_exit(&buf->b_hdr->b_freeze_lock);
964 957 return;
965 958 }
966 959 buf->b_hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP);
967 960 fletcher_2_native(buf->b_data, buf->b_hdr->b_size,
968 961 buf->b_hdr->b_freeze_cksum);
969 962 mutex_exit(&buf->b_hdr->b_freeze_lock);
970 963 arc_buf_watch(buf);
971 964 }
972 965
973 966 #ifndef _KERNEL
974 967 typedef struct procctl {
975 968 long cmd;
976 969 prwatch_t prwatch;
977 970 } procctl_t;
978 971 #endif
979 972
980 973 /* ARGSUSED */
981 974 static void
982 975 arc_buf_unwatch(arc_buf_t *buf)
983 976 {
984 977 #ifndef _KERNEL
985 978 if (arc_watch) {
986 979 int result;
987 980 procctl_t ctl;
988 981 ctl.cmd = PCWATCH;
989 982 ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
990 983 ctl.prwatch.pr_size = 0;
991 984 ctl.prwatch.pr_wflags = 0;
992 985 result = write(arc_procfd, &ctl, sizeof (ctl));
993 986 ASSERT3U(result, ==, sizeof (ctl));
994 987 }
995 988 #endif
996 989 }
997 990
998 991 /* ARGSUSED */
999 992 static void
1000 993 arc_buf_watch(arc_buf_t *buf)
1001 994 {
1002 995 #ifndef _KERNEL
1003 996 if (arc_watch) {
1004 997 int result;
1005 998 procctl_t ctl;
1006 999 ctl.cmd = PCWATCH;
1007 1000 ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
1008 1001 ctl.prwatch.pr_size = buf->b_hdr->b_size;
1009 1002 ctl.prwatch.pr_wflags = WA_WRITE;
1010 1003 result = write(arc_procfd, &ctl, sizeof (ctl));
1011 1004 ASSERT3U(result, ==, sizeof (ctl));
1012 1005 }
1013 1006 #endif
1014 1007 }
1015 1008
1016 1009 void
1017 1010 arc_buf_thaw(arc_buf_t *buf)
1018 1011 {
1019 1012 if (zfs_flags & ZFS_DEBUG_MODIFY) {
1020 1013 if (buf->b_hdr->b_state != arc_anon)
1021 1014 panic("modifying non-anon buffer!");
1022 1015 if (buf->b_hdr->b_flags & ARC_IO_IN_PROGRESS)
1023 1016 panic("modifying buffer while i/o in progress!");
1024 1017 arc_cksum_verify(buf);
1025 1018 }
1026 1019
1027 1020 mutex_enter(&buf->b_hdr->b_freeze_lock);
1028 1021 if (buf->b_hdr->b_freeze_cksum != NULL) {
1029 1022 kmem_free(buf->b_hdr->b_freeze_cksum, sizeof (zio_cksum_t));
1030 1023 buf->b_hdr->b_freeze_cksum = NULL;
1031 1024 }
1032 1025
1033 1026 if (zfs_flags & ZFS_DEBUG_MODIFY) {
1034 1027 if (buf->b_hdr->b_thawed)
1035 1028 kmem_free(buf->b_hdr->b_thawed, 1);
1036 1029 buf->b_hdr->b_thawed = kmem_alloc(1, KM_SLEEP);
1037 1030 }
1038 1031
1039 1032 mutex_exit(&buf->b_hdr->b_freeze_lock);
1040 1033
1041 1034 arc_buf_unwatch(buf);
1042 1035 }
1043 1036
1044 1037 void
1045 1038 arc_buf_freeze(arc_buf_t *buf)
1046 1039 {
1047 1040 kmutex_t *hash_lock;
1048 1041
1049 1042 if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1050 1043 return;
1051 1044
1052 1045 hash_lock = HDR_LOCK(buf->b_hdr);
1053 1046 mutex_enter(hash_lock);
1054 1047
1055 1048 ASSERT(buf->b_hdr->b_freeze_cksum != NULL ||
1056 1049 buf->b_hdr->b_state == arc_anon);
1057 1050 arc_cksum_compute(buf, B_FALSE);
1058 1051 mutex_exit(hash_lock);
1059 1052
1060 1053 }
1061 1054
1062 1055 static void
1063 1056 add_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
1064 1057 {
1065 1058 ASSERT(MUTEX_HELD(hash_lock));
1066 1059
1067 1060 if ((refcount_add(&ab->b_refcnt, tag) == 1) &&
1068 1061 (ab->b_state != arc_anon)) {
1069 1062 uint64_t delta = ab->b_size * ab->b_datacnt;
1070 1063 list_t *list = &ab->b_state->arcs_list[ab->b_type];
1071 1064 uint64_t *size = &ab->b_state->arcs_lsize[ab->b_type];
1072 1065
1073 1066 ASSERT(!MUTEX_HELD(&ab->b_state->arcs_mtx));
1074 1067 mutex_enter(&ab->b_state->arcs_mtx);
1075 1068 ASSERT(list_link_active(&ab->b_arc_node));
1076 1069 list_remove(list, ab);
1077 1070 if (GHOST_STATE(ab->b_state)) {
1078 1071 ASSERT0(ab->b_datacnt);
1079 1072 ASSERT3P(ab->b_buf, ==, NULL);
1080 1073 delta = ab->b_size;
1081 1074 }
1082 1075 ASSERT(delta > 0);
1083 1076 ASSERT3U(*size, >=, delta);
1084 1077 atomic_add_64(size, -delta);
1085 1078 mutex_exit(&ab->b_state->arcs_mtx);
1086 1079 /* remove the prefetch flag if we get a reference */
1087 1080 if (ab->b_flags & ARC_PREFETCH)
1088 1081 ab->b_flags &= ~ARC_PREFETCH;
1089 1082 }
1090 1083 }
1091 1084
1092 1085 static int
1093 1086 remove_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
1094 1087 {
1095 1088 int cnt;
1096 1089 arc_state_t *state = ab->b_state;
1097 1090
1098 1091 ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
1099 1092 ASSERT(!GHOST_STATE(state));
1100 1093
1101 1094 if (((cnt = refcount_remove(&ab->b_refcnt, tag)) == 0) &&
1102 1095 (state != arc_anon)) {
1103 1096 uint64_t *size = &state->arcs_lsize[ab->b_type];
1104 1097
1105 1098 ASSERT(!MUTEX_HELD(&state->arcs_mtx));
1106 1099 mutex_enter(&state->arcs_mtx);
1107 1100 ASSERT(!list_link_active(&ab->b_arc_node));
1108 1101 list_insert_head(&state->arcs_list[ab->b_type], ab);
1109 1102 ASSERT(ab->b_datacnt > 0);
1110 1103 atomic_add_64(size, ab->b_size * ab->b_datacnt);
1111 1104 mutex_exit(&state->arcs_mtx);
1112 1105 }
1113 1106 return (cnt);
1114 1107 }
1115 1108
1116 1109 /*
1117 1110 * Move the supplied buffer to the indicated state. The mutex
1118 1111 * for the buffer must be held by the caller.
1119 1112 */
1120 1113 static void
1121 1114 arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *ab, kmutex_t *hash_lock)
1122 1115 {
1123 1116 arc_state_t *old_state = ab->b_state;
1124 1117 int64_t refcnt = refcount_count(&ab->b_refcnt);
1125 1118 uint64_t from_delta, to_delta;
1126 1119
1127 1120 ASSERT(MUTEX_HELD(hash_lock));
1128 1121 ASSERT(new_state != old_state);
1129 1122 ASSERT(refcnt == 0 || ab->b_datacnt > 0);
1130 1123 ASSERT(ab->b_datacnt == 0 || !GHOST_STATE(new_state));
1131 1124 ASSERT(ab->b_datacnt <= 1 || old_state != arc_anon);
1132 1125
1133 1126 from_delta = to_delta = ab->b_datacnt * ab->b_size;
1134 1127
1135 1128 /*
1136 1129 * If this buffer is evictable, transfer it from the
1137 1130 * old state list to the new state list.
1138 1131 */
1139 1132 if (refcnt == 0) {
1140 1133 if (old_state != arc_anon) {
1141 1134 int use_mutex = !MUTEX_HELD(&old_state->arcs_mtx);
1142 1135 uint64_t *size = &old_state->arcs_lsize[ab->b_type];
1143 1136
1144 1137 if (use_mutex)
1145 1138 mutex_enter(&old_state->arcs_mtx);
1146 1139
1147 1140 ASSERT(list_link_active(&ab->b_arc_node));
1148 1141 list_remove(&old_state->arcs_list[ab->b_type], ab);
1149 1142
1150 1143 /*
1151 1144 * If prefetching out of the ghost cache,
1152 1145 * we will have a non-zero datacnt.
1153 1146 */
1154 1147 if (GHOST_STATE(old_state) && ab->b_datacnt == 0) {
1155 1148 /* ghost elements have a ghost size */
1156 1149 ASSERT(ab->b_buf == NULL);
1157 1150 from_delta = ab->b_size;
1158 1151 }
1159 1152 ASSERT3U(*size, >=, from_delta);
1160 1153 atomic_add_64(size, -from_delta);
1161 1154
1162 1155 if (use_mutex)
1163 1156 mutex_exit(&old_state->arcs_mtx);
1164 1157 }
1165 1158 if (new_state != arc_anon) {
1166 1159 int use_mutex = !MUTEX_HELD(&new_state->arcs_mtx);
1167 1160 uint64_t *size = &new_state->arcs_lsize[ab->b_type];
1168 1161
1169 1162 if (use_mutex)
1170 1163 mutex_enter(&new_state->arcs_mtx);
1171 1164
1172 1165 list_insert_head(&new_state->arcs_list[ab->b_type], ab);
1173 1166
1174 1167 /* ghost elements have a ghost size */
1175 1168 if (GHOST_STATE(new_state)) {
1176 1169 ASSERT(ab->b_datacnt == 0);
1177 1170 ASSERT(ab->b_buf == NULL);
1178 1171 to_delta = ab->b_size;
1179 1172 }
1180 1173 atomic_add_64(size, to_delta);
1181 1174
1182 1175 if (use_mutex)
1183 1176 mutex_exit(&new_state->arcs_mtx);
1184 1177 }
1185 1178 }
1186 1179
1187 1180 ASSERT(!BUF_EMPTY(ab));
1188 1181 if (new_state == arc_anon && HDR_IN_HASH_TABLE(ab))
1189 1182 buf_hash_remove(ab);
1190 1183
1191 1184 /* adjust state sizes */
1192 1185 if (to_delta)
1193 1186 atomic_add_64(&new_state->arcs_size, to_delta);
1194 1187 if (from_delta) {
1195 1188 ASSERT3U(old_state->arcs_size, >=, from_delta);
1196 1189 atomic_add_64(&old_state->arcs_size, -from_delta);
1197 1190 }
1198 1191 ab->b_state = new_state;
1199 1192
1200 1193 /* adjust l2arc hdr stats */
1201 1194 if (new_state == arc_l2c_only)
1202 1195 l2arc_hdr_stat_add();
1203 1196 else if (old_state == arc_l2c_only)
1204 1197 l2arc_hdr_stat_remove();
1205 1198 }
1206 1199
1207 1200 void
1208 1201 arc_space_consume(uint64_t space, arc_space_type_t type)
1209 1202 {
1210 1203 ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
1211 1204
1212 1205 switch (type) {
1213 1206 case ARC_SPACE_DATA:
1214 1207 ARCSTAT_INCR(arcstat_data_size, space);
1215 1208 break;
1216 1209 case ARC_SPACE_OTHER:
1217 1210 ARCSTAT_INCR(arcstat_other_size, space);
1218 1211 break;
1219 1212 case ARC_SPACE_HDRS:
1220 1213 ARCSTAT_INCR(arcstat_hdr_size, space);
1221 1214 break;
1222 1215 case ARC_SPACE_L2HDRS:
1223 1216 ARCSTAT_INCR(arcstat_l2_hdr_size, space);
1224 1217 break;
1225 1218 }
1226 1219
1227 1220 ARCSTAT_INCR(arcstat_meta_used, space);
1228 1221 atomic_add_64(&arc_size, space);
1229 1222 }
1230 1223
1231 1224 void
1232 1225 arc_space_return(uint64_t space, arc_space_type_t type)
1233 1226 {
1234 1227 ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
1235 1228
1236 1229 switch (type) {
1237 1230 case ARC_SPACE_DATA:
1238 1231 ARCSTAT_INCR(arcstat_data_size, -space);
1239 1232 break;
1240 1233 case ARC_SPACE_OTHER:
1241 1234 ARCSTAT_INCR(arcstat_other_size, -space);
1242 1235 break;
1243 1236 case ARC_SPACE_HDRS:
1244 1237 ARCSTAT_INCR(arcstat_hdr_size, -space);
1245 1238 break;
1246 1239 case ARC_SPACE_L2HDRS:
1247 1240 ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
1248 1241 break;
1249 1242 }
1250 1243
1251 1244 ASSERT(arc_meta_used >= space);
1252 1245 if (arc_meta_max < arc_meta_used)
1253 1246 arc_meta_max = arc_meta_used;
1254 1247 ARCSTAT_INCR(arcstat_meta_used, -space);
1255 1248 ASSERT(arc_size >= space);
1256 1249 atomic_add_64(&arc_size, -space);
1257 1250 }
1258 1251
1259 1252 void *
1260 1253 arc_data_buf_alloc(uint64_t size)
1261 1254 {
1262 1255 if (arc_evict_needed(ARC_BUFC_DATA))
1263 1256 cv_signal(&arc_reclaim_thr_cv);
1264 1257 atomic_add_64(&arc_size, size);
1265 1258 return (zio_data_buf_alloc(size));
1266 1259 }
1267 1260
1268 1261 void
1269 1262 arc_data_buf_free(void *buf, uint64_t size)
1270 1263 {
1271 1264 zio_data_buf_free(buf, size);
1272 1265 ASSERT(arc_size >= size);
1273 1266 atomic_add_64(&arc_size, -size);
1274 1267 }
1275 1268
1276 1269 arc_buf_t *
1277 1270 arc_buf_alloc(spa_t *spa, int size, void *tag, arc_buf_contents_t type)
1278 1271 {
1279 1272 arc_buf_hdr_t *hdr;
1280 1273 arc_buf_t *buf;
1281 1274
1282 1275 ASSERT3U(size, >, 0);
1283 1276 hdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
1284 1277 ASSERT(BUF_EMPTY(hdr));
1285 1278 hdr->b_size = size;
1286 1279 hdr->b_type = type;
1287 1280 hdr->b_spa = spa_load_guid(spa);
1288 1281 hdr->b_state = arc_anon;
1289 1282 hdr->b_arc_access = 0;
1290 1283 buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
1291 1284 buf->b_hdr = hdr;
1292 1285 buf->b_data = NULL;
1293 1286 buf->b_efunc = NULL;
1294 1287 buf->b_private = NULL;
1295 1288 buf->b_next = NULL;
1296 1289 hdr->b_buf = buf;
1297 1290 arc_get_data_buf(buf);
1298 1291 hdr->b_datacnt = 1;
1299 1292 hdr->b_flags = 0;
1300 1293 ASSERT(refcount_is_zero(&hdr->b_refcnt));
1301 1294 (void) refcount_add(&hdr->b_refcnt, tag);
1302 1295
1303 1296 return (buf);
1304 1297 }
1305 1298
1306 1299 static char *arc_onloan_tag = "onloan";
1307 1300
1308 1301 /*
1309 1302 * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
1310 1303 * flight data by arc_tempreserve_space() until they are "returned". Loaned
1311 1304 * buffers must be returned to the arc before they can be used by the DMU or
1312 1305 * freed.
1313 1306 */
1314 1307 arc_buf_t *
1315 1308 arc_loan_buf(spa_t *spa, int size)
1316 1309 {
1317 1310 arc_buf_t *buf;
1318 1311
1319 1312 buf = arc_buf_alloc(spa, size, arc_onloan_tag, ARC_BUFC_DATA);
1320 1313
1321 1314 atomic_add_64(&arc_loaned_bytes, size);
1322 1315 return (buf);
1323 1316 }
1324 1317
1325 1318 /*
1326 1319 * Return a loaned arc buffer to the arc.
1327 1320 */
1328 1321 void
1329 1322 arc_return_buf(arc_buf_t *buf, void *tag)
1330 1323 {
1331 1324 arc_buf_hdr_t *hdr = buf->b_hdr;
1332 1325
1333 1326 ASSERT(buf->b_data != NULL);
1334 1327 (void) refcount_add(&hdr->b_refcnt, tag);
1335 1328 (void) refcount_remove(&hdr->b_refcnt, arc_onloan_tag);
1336 1329
1337 1330 atomic_add_64(&arc_loaned_bytes, -hdr->b_size);
1338 1331 }
1339 1332
1340 1333 /* Detach an arc_buf from a dbuf (tag) */
1341 1334 void
1342 1335 arc_loan_inuse_buf(arc_buf_t *buf, void *tag)
1343 1336 {
1344 1337 arc_buf_hdr_t *hdr;
1345 1338
1346 1339 ASSERT(buf->b_data != NULL);
1347 1340 hdr = buf->b_hdr;
1348 1341 (void) refcount_add(&hdr->b_refcnt, arc_onloan_tag);
1349 1342 (void) refcount_remove(&hdr->b_refcnt, tag);
1350 1343 buf->b_efunc = NULL;
1351 1344 buf->b_private = NULL;
1352 1345
1353 1346 atomic_add_64(&arc_loaned_bytes, hdr->b_size);
1354 1347 }
1355 1348
1356 1349 static arc_buf_t *
1357 1350 arc_buf_clone(arc_buf_t *from)
1358 1351 {
1359 1352 arc_buf_t *buf;
1360 1353 arc_buf_hdr_t *hdr = from->b_hdr;
1361 1354 uint64_t size = hdr->b_size;
1362 1355
1363 1356 ASSERT(hdr->b_state != arc_anon);
1364 1357
1365 1358 buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
1366 1359 buf->b_hdr = hdr;
1367 1360 buf->b_data = NULL;
1368 1361 buf->b_efunc = NULL;
1369 1362 buf->b_private = NULL;
1370 1363 buf->b_next = hdr->b_buf;
1371 1364 hdr->b_buf = buf;
1372 1365 arc_get_data_buf(buf);
1373 1366 bcopy(from->b_data, buf->b_data, size);
1374 1367
1375 1368 /*
1376 1369 * This buffer already exists in the arc so create a duplicate
1377 1370 * copy for the caller. If the buffer is associated with user data
1378 1371 * then track the size and number of duplicates. These stats will be
1379 1372 * updated as duplicate buffers are created and destroyed.
1380 1373 */
1381 1374 if (hdr->b_type == ARC_BUFC_DATA) {
1382 1375 ARCSTAT_BUMP(arcstat_duplicate_buffers);
1383 1376 ARCSTAT_INCR(arcstat_duplicate_buffers_size, size);
1384 1377 }
1385 1378 hdr->b_datacnt += 1;
1386 1379 return (buf);
1387 1380 }
1388 1381
1389 1382 void
1390 1383 arc_buf_add_ref(arc_buf_t *buf, void* tag)
1391 1384 {
1392 1385 arc_buf_hdr_t *hdr;
1393 1386 kmutex_t *hash_lock;
1394 1387
1395 1388 /*
1396 1389 * Check to see if this buffer is evicted. Callers
1397 1390 * must verify b_data != NULL to know if the add_ref
1398 1391 * was successful.
1399 1392 */
1400 1393 mutex_enter(&buf->b_evict_lock);
1401 1394 if (buf->b_data == NULL) {
1402 1395 mutex_exit(&buf->b_evict_lock);
1403 1396 return;
1404 1397 }
1405 1398 hash_lock = HDR_LOCK(buf->b_hdr);
1406 1399 mutex_enter(hash_lock);
1407 1400 hdr = buf->b_hdr;
1408 1401 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
1409 1402 mutex_exit(&buf->b_evict_lock);
1410 1403
1411 1404 ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
1412 1405 add_reference(hdr, hash_lock, tag);
1413 1406 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
1414 1407 arc_access(hdr, hash_lock);
1415 1408 mutex_exit(hash_lock);
1416 1409 ARCSTAT_BUMP(arcstat_hits);
1417 1410 ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
1418 1411 demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
1419 1412 data, metadata, hits);
1420 1413 }
1421 1414
1422 1415 /*
1423 1416 * Free the arc data buffer. If it is an l2arc write in progress,
1424 1417 * the buffer is placed on l2arc_free_on_write to be freed later.
1425 1418 */
1426 1419 static void
1427 1420 arc_buf_data_free(arc_buf_t *buf, void (*free_func)(void *, size_t))
1428 1421 {
1429 1422 arc_buf_hdr_t *hdr = buf->b_hdr;
1430 1423
1431 1424 if (HDR_L2_WRITING(hdr)) {
1432 1425 l2arc_data_free_t *df;
1433 1426 df = kmem_alloc(sizeof (l2arc_data_free_t), KM_SLEEP);
1434 1427 df->l2df_data = buf->b_data;
1435 1428 df->l2df_size = hdr->b_size;
1436 1429 df->l2df_func = free_func;
1437 1430 mutex_enter(&l2arc_free_on_write_mtx);
1438 1431 list_insert_head(l2arc_free_on_write, df);
1439 1432 mutex_exit(&l2arc_free_on_write_mtx);
1440 1433 ARCSTAT_BUMP(arcstat_l2_free_on_write);
1441 1434 } else {
1442 1435 free_func(buf->b_data, hdr->b_size);
1443 1436 }
1444 1437 }
1445 1438
1446 1439 static void
1447 1440 arc_buf_destroy(arc_buf_t *buf, boolean_t recycle, boolean_t all)
1448 1441 {
1449 1442 arc_buf_t **bufp;
1450 1443
1451 1444 /* free up data associated with the buf */
1452 1445 if (buf->b_data) {
1453 1446 arc_state_t *state = buf->b_hdr->b_state;
1454 1447 uint64_t size = buf->b_hdr->b_size;
1455 1448 arc_buf_contents_t type = buf->b_hdr->b_type;
1456 1449
1457 1450 arc_cksum_verify(buf);
1458 1451 arc_buf_unwatch(buf);
1459 1452
1460 1453 if (!recycle) {
1461 1454 if (type == ARC_BUFC_METADATA) {
1462 1455 arc_buf_data_free(buf, zio_buf_free);
1463 1456 arc_space_return(size, ARC_SPACE_DATA);
1464 1457 } else {
1465 1458 ASSERT(type == ARC_BUFC_DATA);
1466 1459 arc_buf_data_free(buf, zio_data_buf_free);
1467 1460 ARCSTAT_INCR(arcstat_data_size, -size);
1468 1461 atomic_add_64(&arc_size, -size);
1469 1462 }
1470 1463 }
1471 1464 if (list_link_active(&buf->b_hdr->b_arc_node)) {
1472 1465 uint64_t *cnt = &state->arcs_lsize[type];
1473 1466
1474 1467 ASSERT(refcount_is_zero(&buf->b_hdr->b_refcnt));
1475 1468 ASSERT(state != arc_anon);
1476 1469
1477 1470 ASSERT3U(*cnt, >=, size);
1478 1471 atomic_add_64(cnt, -size);
1479 1472 }
1480 1473 ASSERT3U(state->arcs_size, >=, size);
1481 1474 atomic_add_64(&state->arcs_size, -size);
1482 1475 buf->b_data = NULL;
1483 1476
1484 1477 /*
1485 1478 * If we're destroying a duplicate buffer make sure
1486 1479 * that the appropriate statistics are updated.
1487 1480 */
1488 1481 if (buf->b_hdr->b_datacnt > 1 &&
1489 1482 buf->b_hdr->b_type == ARC_BUFC_DATA) {
1490 1483 ARCSTAT_BUMPDOWN(arcstat_duplicate_buffers);
1491 1484 ARCSTAT_INCR(arcstat_duplicate_buffers_size, -size);
1492 1485 }
1493 1486 ASSERT(buf->b_hdr->b_datacnt > 0);
1494 1487 buf->b_hdr->b_datacnt -= 1;
1495 1488 }
1496 1489
1497 1490 /* only remove the buf if requested */
1498 1491 if (!all)
1499 1492 return;
1500 1493
1501 1494 /* remove the buf from the hdr list */
1502 1495 for (bufp = &buf->b_hdr->b_buf; *bufp != buf; bufp = &(*bufp)->b_next)
1503 1496 continue;
1504 1497 *bufp = buf->b_next;
1505 1498 buf->b_next = NULL;
1506 1499
1507 1500 ASSERT(buf->b_efunc == NULL);
1508 1501
1509 1502 /* clean up the buf */
1510 1503 buf->b_hdr = NULL;
1511 1504 kmem_cache_free(buf_cache, buf);
1512 1505 }
1513 1506
1514 1507 static void
1515 1508 arc_hdr_destroy(arc_buf_hdr_t *hdr)
1516 1509 {
1517 1510 ASSERT(refcount_is_zero(&hdr->b_refcnt));
1518 1511 ASSERT3P(hdr->b_state, ==, arc_anon);
1519 1512 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
1520 1513 l2arc_buf_hdr_t *l2hdr = hdr->b_l2hdr;
1521 1514
1522 1515 if (l2hdr != NULL) {
1523 1516 boolean_t buflist_held = MUTEX_HELD(&l2arc_buflist_mtx);
1524 1517 /*
1525 1518 * To prevent arc_free() and l2arc_evict() from
1526 1519 * attempting to free the same buffer at the same time,
1527 1520 * a FREE_IN_PROGRESS flag is given to arc_free() to
1528 1521 * give it priority. l2arc_evict() can't destroy this
1529 1522 * header while we are waiting on l2arc_buflist_mtx.
1530 1523 *
1531 1524 * The hdr may be removed from l2ad_buflist before we
1532 1525 * grab l2arc_buflist_mtx, so b_l2hdr is rechecked.
1533 1526 */
1534 1527 if (!buflist_held) {
1535 1528 mutex_enter(&l2arc_buflist_mtx);
1536 1529 l2hdr = hdr->b_l2hdr;
1537 1530 }
1538 1531
1539 1532 if (l2hdr != NULL) {
1540 1533 list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
1541 1534 ARCSTAT_INCR(arcstat_l2_size, -hdr->b_size);
1542 1535 kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
1543 1536 if (hdr->b_state == arc_l2c_only)
1544 1537 l2arc_hdr_stat_remove();
1545 1538 hdr->b_l2hdr = NULL;
1546 1539 }
1547 1540
1548 1541 if (!buflist_held)
1549 1542 mutex_exit(&l2arc_buflist_mtx);
1550 1543 }
1551 1544
1552 1545 if (!BUF_EMPTY(hdr)) {
1553 1546 ASSERT(!HDR_IN_HASH_TABLE(hdr));
1554 1547 buf_discard_identity(hdr);
1555 1548 }
1556 1549 while (hdr->b_buf) {
1557 1550 arc_buf_t *buf = hdr->b_buf;
1558 1551
1559 1552 if (buf->b_efunc) {
1560 1553 mutex_enter(&arc_eviction_mtx);
1561 1554 mutex_enter(&buf->b_evict_lock);
1562 1555 ASSERT(buf->b_hdr != NULL);
1563 1556 arc_buf_destroy(hdr->b_buf, FALSE, FALSE);
1564 1557 hdr->b_buf = buf->b_next;
1565 1558 buf->b_hdr = &arc_eviction_hdr;
1566 1559 buf->b_next = arc_eviction_list;
1567 1560 arc_eviction_list = buf;
1568 1561 mutex_exit(&buf->b_evict_lock);
1569 1562 mutex_exit(&arc_eviction_mtx);
1570 1563 } else {
1571 1564 arc_buf_destroy(hdr->b_buf, FALSE, TRUE);
1572 1565 }
1573 1566 }
1574 1567 if (hdr->b_freeze_cksum != NULL) {
1575 1568 kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
1576 1569 hdr->b_freeze_cksum = NULL;
1577 1570 }
1578 1571 if (hdr->b_thawed) {
1579 1572 kmem_free(hdr->b_thawed, 1);
1580 1573 hdr->b_thawed = NULL;
1581 1574 }
1582 1575
1583 1576 ASSERT(!list_link_active(&hdr->b_arc_node));
1584 1577 ASSERT3P(hdr->b_hash_next, ==, NULL);
1585 1578 ASSERT3P(hdr->b_acb, ==, NULL);
1586 1579 kmem_cache_free(hdr_cache, hdr);
1587 1580 }
1588 1581
1589 1582 void
1590 1583 arc_buf_free(arc_buf_t *buf, void *tag)
1591 1584 {
1592 1585 arc_buf_hdr_t *hdr = buf->b_hdr;
1593 1586 int hashed = hdr->b_state != arc_anon;
1594 1587
1595 1588 ASSERT(buf->b_efunc == NULL);
1596 1589 ASSERT(buf->b_data != NULL);
1597 1590
1598 1591 if (hashed) {
1599 1592 kmutex_t *hash_lock = HDR_LOCK(hdr);
1600 1593
1601 1594 mutex_enter(hash_lock);
1602 1595 hdr = buf->b_hdr;
1603 1596 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
1604 1597
1605 1598 (void) remove_reference(hdr, hash_lock, tag);
1606 1599 if (hdr->b_datacnt > 1) {
1607 1600 arc_buf_destroy(buf, FALSE, TRUE);
1608 1601 } else {
1609 1602 ASSERT(buf == hdr->b_buf);
1610 1603 ASSERT(buf->b_efunc == NULL);
1611 1604 hdr->b_flags |= ARC_BUF_AVAILABLE;
1612 1605 }
1613 1606 mutex_exit(hash_lock);
1614 1607 } else if (HDR_IO_IN_PROGRESS(hdr)) {
1615 1608 int destroy_hdr;
1616 1609 /*
1617 1610 * We are in the middle of an async write. Don't destroy
1618 1611 * this buffer unless the write completes before we finish
1619 1612 * decrementing the reference count.
1620 1613 */
1621 1614 mutex_enter(&arc_eviction_mtx);
1622 1615 (void) remove_reference(hdr, NULL, tag);
1623 1616 ASSERT(refcount_is_zero(&hdr->b_refcnt));
1624 1617 destroy_hdr = !HDR_IO_IN_PROGRESS(hdr);
1625 1618 mutex_exit(&arc_eviction_mtx);
1626 1619 if (destroy_hdr)
1627 1620 arc_hdr_destroy(hdr);
1628 1621 } else {
1629 1622 if (remove_reference(hdr, NULL, tag) > 0)
1630 1623 arc_buf_destroy(buf, FALSE, TRUE);
1631 1624 else
1632 1625 arc_hdr_destroy(hdr);
1633 1626 }
1634 1627 }
1635 1628
1636 1629 boolean_t
1637 1630 arc_buf_remove_ref(arc_buf_t *buf, void* tag)
1638 1631 {
1639 1632 arc_buf_hdr_t *hdr = buf->b_hdr;
1640 1633 kmutex_t *hash_lock = HDR_LOCK(hdr);
1641 1634 boolean_t no_callback = (buf->b_efunc == NULL);
1642 1635
1643 1636 if (hdr->b_state == arc_anon) {
1644 1637 ASSERT(hdr->b_datacnt == 1);
1645 1638 arc_buf_free(buf, tag);
1646 1639 return (no_callback);
1647 1640 }
1648 1641
1649 1642 mutex_enter(hash_lock);
1650 1643 hdr = buf->b_hdr;
1651 1644 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
1652 1645 ASSERT(hdr->b_state != arc_anon);
1653 1646 ASSERT(buf->b_data != NULL);
1654 1647
1655 1648 (void) remove_reference(hdr, hash_lock, tag);
1656 1649 if (hdr->b_datacnt > 1) {
1657 1650 if (no_callback)
1658 1651 arc_buf_destroy(buf, FALSE, TRUE);
1659 1652 } else if (no_callback) {
1660 1653 ASSERT(hdr->b_buf == buf && buf->b_next == NULL);
1661 1654 ASSERT(buf->b_efunc == NULL);
1662 1655 hdr->b_flags |= ARC_BUF_AVAILABLE;
1663 1656 }
1664 1657 ASSERT(no_callback || hdr->b_datacnt > 1 ||
1665 1658 refcount_is_zero(&hdr->b_refcnt));
1666 1659 mutex_exit(hash_lock);
1667 1660 return (no_callback);
1668 1661 }
1669 1662
1670 1663 int
1671 1664 arc_buf_size(arc_buf_t *buf)
1672 1665 {
1673 1666 return (buf->b_hdr->b_size);
1674 1667 }
1675 1668
1676 1669 /*
1677 1670 * Called from the DMU to determine if the current buffer should be
1678 1671 * evicted. In order to ensure proper locking, the eviction must be initiated
1679 1672 * from the DMU. Return true if the buffer is associated with user data and
1680 1673 * duplicate buffers still exist.
1681 1674 */
1682 1675 boolean_t
1683 1676 arc_buf_eviction_needed(arc_buf_t *buf)
1684 1677 {
1685 1678 arc_buf_hdr_t *hdr;
1686 1679 boolean_t evict_needed = B_FALSE;
1687 1680
1688 1681 if (zfs_disable_dup_eviction)
1689 1682 return (B_FALSE);
1690 1683
1691 1684 mutex_enter(&buf->b_evict_lock);
1692 1685 hdr = buf->b_hdr;
1693 1686 if (hdr == NULL) {
1694 1687 /*
1695 1688 * We are in arc_do_user_evicts(); let that function
1696 1689 * perform the eviction.
1697 1690 */
1698 1691 ASSERT(buf->b_data == NULL);
1699 1692 mutex_exit(&buf->b_evict_lock);
1700 1693 return (B_FALSE);
1701 1694 } else if (buf->b_data == NULL) {
1702 1695 /*
1703 1696 * We have already been added to the arc eviction list;
1704 1697 * recommend eviction.
1705 1698 */
1706 1699 ASSERT3P(hdr, ==, &arc_eviction_hdr);
1707 1700 mutex_exit(&buf->b_evict_lock);
1708 1701 return (B_TRUE);
1709 1702 }
1710 1703
1711 1704 if (hdr->b_datacnt > 1 && hdr->b_type == ARC_BUFC_DATA)
1712 1705 evict_needed = B_TRUE;
1713 1706
1714 1707 mutex_exit(&buf->b_evict_lock);
1715 1708 return (evict_needed);
1716 1709 }
1717 1710
1718 1711 /*
1719 1712 * Evict buffers from list until we've removed the specified number of
1720 1713 * bytes. Move the removed buffers to the appropriate evict state.
1721 1714 * If the recycle flag is set, then attempt to "recycle" a buffer:
1722 1715 * - look for a buffer to evict that is `bytes' long.
1723 1716 * - return the data block from this buffer rather than freeing it.
1724 1717 * This flag is used by callers that are trying to make space for a
1725 1718 * new buffer in a full arc cache.
1726 1719 *
1727 1720 * This function makes a "best effort". It skips over any buffers
1728 1721 * it can't get a hash_lock on, and so may not catch all candidates.
1729 1722 * It may also return without evicting as much space as requested.
1730 1723 */
1731 1724 static void *
1732 1725 arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle,
1733 1726 arc_buf_contents_t type)
1734 1727 {
1735 1728 arc_state_t *evicted_state;
1736 1729 uint64_t bytes_evicted = 0, skipped = 0, missed = 0;
1737 1730 arc_buf_hdr_t *ab, *ab_prev = NULL;
1738 1731 list_t *list = &state->arcs_list[type];
1739 1732 kmutex_t *hash_lock;
1740 1733 boolean_t have_lock;
1741 1734 void *stolen = NULL;
1742 1735
1743 1736 ASSERT(state == arc_mru || state == arc_mfu);
1744 1737
1745 1738 evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
1746 1739
1747 1740 mutex_enter(&state->arcs_mtx);
1748 1741 mutex_enter(&evicted_state->arcs_mtx);
1749 1742
1750 1743 for (ab = list_tail(list); ab; ab = ab_prev) {
1751 1744 ab_prev = list_prev(list, ab);
1752 1745 /* prefetch buffers have a minimum lifespan */
1753 1746 if (HDR_IO_IN_PROGRESS(ab) ||
1754 1747 (spa && ab->b_spa != spa) ||
1755 1748 (ab->b_flags & (ARC_PREFETCH|ARC_INDIRECT) &&
1756 1749 ddi_get_lbolt() - ab->b_arc_access <
1757 1750 arc_min_prefetch_lifespan)) {
1758 1751 skipped++;
1759 1752 continue;
1760 1753 }
1761 1754 /* "lookahead" for better eviction candidate */
1762 1755 if (recycle && ab->b_size != bytes &&
1763 1756 ab_prev && ab_prev->b_size == bytes)
1764 1757 continue;
1765 1758 hash_lock = HDR_LOCK(ab);
1766 1759 have_lock = MUTEX_HELD(hash_lock);
1767 1760 if (have_lock || mutex_tryenter(hash_lock)) {
1768 1761 ASSERT0(refcount_count(&ab->b_refcnt));
1769 1762 ASSERT(ab->b_datacnt > 0);
1770 1763 while (ab->b_buf) {
1771 1764 arc_buf_t *buf = ab->b_buf;
1772 1765 if (!mutex_tryenter(&buf->b_evict_lock)) {
1773 1766 missed += 1;
1774 1767 break;
1775 1768 }
1776 1769 if (buf->b_data) {
1777 1770 bytes_evicted += ab->b_size;
1778 1771 if (recycle && ab->b_type == type &&
1779 1772 ab->b_size == bytes &&
1780 1773 !HDR_L2_WRITING(ab)) {
1781 1774 stolen = buf->b_data;
1782 1775 recycle = FALSE;
1783 1776 }
1784 1777 }
1785 1778 if (buf->b_efunc) {
1786 1779 mutex_enter(&arc_eviction_mtx);
1787 1780 arc_buf_destroy(buf,
1788 1781 buf->b_data == stolen, FALSE);
1789 1782 ab->b_buf = buf->b_next;
1790 1783 buf->b_hdr = &arc_eviction_hdr;
1791 1784 buf->b_next = arc_eviction_list;
1792 1785 arc_eviction_list = buf;
1793 1786 mutex_exit(&arc_eviction_mtx);
1794 1787 mutex_exit(&buf->b_evict_lock);
1795 1788 } else {
1796 1789 mutex_exit(&buf->b_evict_lock);
1797 1790 arc_buf_destroy(buf,
1798 1791 buf->b_data == stolen, TRUE);
1799 1792 }
1800 1793 }
1801 1794
1802 1795 if (ab->b_l2hdr) {
1803 1796 ARCSTAT_INCR(arcstat_evict_l2_cached,
1804 1797 ab->b_size);
1805 1798 } else {
1806 1799 if (l2arc_write_eligible(ab->b_spa, ab)) {
1807 1800 ARCSTAT_INCR(arcstat_evict_l2_eligible,
1808 1801 ab->b_size);
1809 1802 } else {
1810 1803 ARCSTAT_INCR(
1811 1804 arcstat_evict_l2_ineligible,
1812 1805 ab->b_size);
1813 1806 }
1814 1807 }
1815 1808
1816 1809 if (ab->b_datacnt == 0) {
1817 1810 arc_change_state(evicted_state, ab, hash_lock);
1818 1811 ASSERT(HDR_IN_HASH_TABLE(ab));
1819 1812 ab->b_flags |= ARC_IN_HASH_TABLE;
1820 1813 ab->b_flags &= ~ARC_BUF_AVAILABLE;
1821 1814 DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, ab);
1822 1815 }
1823 1816 if (!have_lock)
1824 1817 mutex_exit(hash_lock);
1825 1818 if (bytes >= 0 && bytes_evicted >= bytes)
1826 1819 break;
1827 1820 } else {
1828 1821 missed += 1;
1829 1822 }
1830 1823 }
1831 1824
1832 1825 mutex_exit(&evicted_state->arcs_mtx);
1833 1826 mutex_exit(&state->arcs_mtx);
1834 1827
1835 1828 if (bytes_evicted < bytes)
1836 1829 dprintf("only evicted %lld bytes from %x",
1837 1830 (longlong_t)bytes_evicted, state);
1838 1831
1839 1832 if (skipped)
1840 1833 ARCSTAT_INCR(arcstat_evict_skip, skipped);
1841 1834
1842 1835 if (missed)
1843 1836 ARCSTAT_INCR(arcstat_mutex_miss, missed);
1844 1837
1845 1838 /*
1846 1839 * We have just evicted some data into the ghost state, make
1847 1840 * sure we also adjust the ghost state size if necessary.
1848 1841 */
1849 1842 if (arc_no_grow &&
1850 1843 arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size > arc_c) {
1851 1844 int64_t mru_over = arc_anon->arcs_size + arc_mru->arcs_size +
1852 1845 arc_mru_ghost->arcs_size - arc_c;
1853 1846
1854 1847 if (mru_over > 0 && arc_mru_ghost->arcs_lsize[type] > 0) {
1855 1848 int64_t todelete =
1856 1849 MIN(arc_mru_ghost->arcs_lsize[type], mru_over);
1857 1850 arc_evict_ghost(arc_mru_ghost, NULL, todelete);
1858 1851 } else if (arc_mfu_ghost->arcs_lsize[type] > 0) {
1859 1852 int64_t todelete = MIN(arc_mfu_ghost->arcs_lsize[type],
1860 1853 arc_mru_ghost->arcs_size +
1861 1854 arc_mfu_ghost->arcs_size - arc_c);
1862 1855 arc_evict_ghost(arc_mfu_ghost, NULL, todelete);
1863 1856 }
1864 1857 }
1865 1858
1866 1859 return (stolen);
1867 1860 }
1868 1861
1869 1862 /*
1870 1863 * Remove buffers from list until we've removed the specified number of
1871 1864 * bytes. Destroy the buffers that are removed.
1872 1865 */
1873 1866 static void
1874 1867 arc_evict_ghost(arc_state_t *state, uint64_t spa, int64_t bytes)
1875 1868 {
1876 1869 arc_buf_hdr_t *ab, *ab_prev;
1877 1870 arc_buf_hdr_t marker = { 0 };
1878 1871 list_t *list = &state->arcs_list[ARC_BUFC_DATA];
1879 1872 kmutex_t *hash_lock;
1880 1873 uint64_t bytes_deleted = 0;
1881 1874 uint64_t bufs_skipped = 0;
1882 1875
1883 1876 ASSERT(GHOST_STATE(state));
1884 1877 top:
1885 1878 mutex_enter(&state->arcs_mtx);
1886 1879 for (ab = list_tail(list); ab; ab = ab_prev) {
1887 1880 ab_prev = list_prev(list, ab);
1888 1881 if (spa && ab->b_spa != spa)
1889 1882 continue;
1890 1883
1891 1884 /* ignore markers */
1892 1885 if (ab->b_spa == 0)
1893 1886 continue;
1894 1887
1895 1888 hash_lock = HDR_LOCK(ab);
1896 1889 /* caller may be trying to modify this buffer, skip it */
1897 1890 if (MUTEX_HELD(hash_lock))
1898 1891 continue;
1899 1892 if (mutex_tryenter(hash_lock)) {
1900 1893 ASSERT(!HDR_IO_IN_PROGRESS(ab));
1901 1894 ASSERT(ab->b_buf == NULL);
1902 1895 ARCSTAT_BUMP(arcstat_deleted);
1903 1896 bytes_deleted += ab->b_size;
1904 1897
1905 1898 if (ab->b_l2hdr != NULL) {
1906 1899 /*
1907 1900 * This buffer is cached on the 2nd Level ARC;
1908 1901 * don't destroy the header.
1909 1902 */
1910 1903 arc_change_state(arc_l2c_only, ab, hash_lock);
1911 1904 mutex_exit(hash_lock);
1912 1905 } else {
1913 1906 arc_change_state(arc_anon, ab, hash_lock);
1914 1907 mutex_exit(hash_lock);
1915 1908 arc_hdr_destroy(ab);
1916 1909 }
1917 1910
1918 1911 DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, ab);
1919 1912 if (bytes >= 0 && bytes_deleted >= bytes)
1920 1913 break;
1921 1914 } else if (bytes < 0) {
1922 1915 /*
1923 1916 * Insert a list marker and then wait for the
1924 1917 * hash lock to become available. Once its
1925 1918 * available, restart from where we left off.
1926 1919 */
1927 1920 list_insert_after(list, ab, &marker);
1928 1921 mutex_exit(&state->arcs_mtx);
1929 1922 mutex_enter(hash_lock);
1930 1923 mutex_exit(hash_lock);
1931 1924 mutex_enter(&state->arcs_mtx);
1932 1925 ab_prev = list_prev(list, &marker);
1933 1926 list_remove(list, &marker);
1934 1927 } else
1935 1928 bufs_skipped += 1;
1936 1929 }
1937 1930 mutex_exit(&state->arcs_mtx);
1938 1931
1939 1932 if (list == &state->arcs_list[ARC_BUFC_DATA] &&
1940 1933 (bytes < 0 || bytes_deleted < bytes)) {
1941 1934 list = &state->arcs_list[ARC_BUFC_METADATA];
1942 1935 goto top;
1943 1936 }
1944 1937
1945 1938 if (bufs_skipped) {
1946 1939 ARCSTAT_INCR(arcstat_mutex_miss, bufs_skipped);
1947 1940 ASSERT(bytes >= 0);
1948 1941 }
1949 1942
1950 1943 if (bytes_deleted < bytes)
1951 1944 dprintf("only deleted %lld bytes from %p",
1952 1945 (longlong_t)bytes_deleted, state);
1953 1946 }
1954 1947
1955 1948 static void
1956 1949 arc_adjust(void)
1957 1950 {
1958 1951 int64_t adjustment, delta;
1959 1952
1960 1953 /*
1961 1954 * Adjust MRU size
1962 1955 */
1963 1956
1964 1957 adjustment = MIN((int64_t)(arc_size - arc_c),
1965 1958 (int64_t)(arc_anon->arcs_size + arc_mru->arcs_size + arc_meta_used -
1966 1959 arc_p));
1967 1960
1968 1961 if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {
1969 1962 delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);
1970 1963 (void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA);
1971 1964 adjustment -= delta;
1972 1965 }
1973 1966
1974 1967 if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) {
1975 1968 delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustment);
1976 1969 (void) arc_evict(arc_mru, NULL, delta, FALSE,
1977 1970 ARC_BUFC_METADATA);
1978 1971 }
1979 1972
1980 1973 /*
1981 1974 * Adjust MFU size
1982 1975 */
1983 1976
1984 1977 adjustment = arc_size - arc_c;
1985 1978
1986 1979 if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_DATA] > 0) {
1987 1980 delta = MIN(adjustment, arc_mfu->arcs_lsize[ARC_BUFC_DATA]);
1988 1981 (void) arc_evict(arc_mfu, NULL, delta, FALSE, ARC_BUFC_DATA);
1989 1982 adjustment -= delta;
1990 1983 }
1991 1984
1992 1985 if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) {
1993 1986 int64_t delta = MIN(adjustment,
1994 1987 arc_mfu->arcs_lsize[ARC_BUFC_METADATA]);
1995 1988 (void) arc_evict(arc_mfu, NULL, delta, FALSE,
1996 1989 ARC_BUFC_METADATA);
1997 1990 }
1998 1991
1999 1992 /*
2000 1993 * Adjust ghost lists
2001 1994 */
2002 1995
2003 1996 adjustment = arc_mru->arcs_size + arc_mru_ghost->arcs_size - arc_c;
2004 1997
2005 1998 if (adjustment > 0 && arc_mru_ghost->arcs_size > 0) {
2006 1999 delta = MIN(arc_mru_ghost->arcs_size, adjustment);
2007 2000 arc_evict_ghost(arc_mru_ghost, NULL, delta);
2008 2001 }
2009 2002
2010 2003 adjustment =
2011 2004 arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size - arc_c;
2012 2005
2013 2006 if (adjustment > 0 && arc_mfu_ghost->arcs_size > 0) {
2014 2007 delta = MIN(arc_mfu_ghost->arcs_size, adjustment);
2015 2008 arc_evict_ghost(arc_mfu_ghost, NULL, delta);
2016 2009 }
2017 2010 }
2018 2011
2019 2012 static void
2020 2013 arc_do_user_evicts(void)
2021 2014 {
2022 2015 mutex_enter(&arc_eviction_mtx);
2023 2016 while (arc_eviction_list != NULL) {
2024 2017 arc_buf_t *buf = arc_eviction_list;
2025 2018 arc_eviction_list = buf->b_next;
2026 2019 mutex_enter(&buf->b_evict_lock);
2027 2020 buf->b_hdr = NULL;
2028 2021 mutex_exit(&buf->b_evict_lock);
2029 2022 mutex_exit(&arc_eviction_mtx);
2030 2023
2031 2024 if (buf->b_efunc != NULL)
2032 2025 VERIFY(buf->b_efunc(buf) == 0);
2033 2026
2034 2027 buf->b_efunc = NULL;
2035 2028 buf->b_private = NULL;
2036 2029 kmem_cache_free(buf_cache, buf);
2037 2030 mutex_enter(&arc_eviction_mtx);
2038 2031 }
2039 2032 mutex_exit(&arc_eviction_mtx);
2040 2033 }
2041 2034
2042 2035 /*
2043 2036 * Flush all *evictable* data from the cache for the given spa.
2044 2037 * NOTE: this will not touch "active" (i.e. referenced) data.
2045 2038 */
2046 2039 void
2047 2040 arc_flush(spa_t *spa)
2048 2041 {
2049 2042 uint64_t guid = 0;
2050 2043
2051 2044 if (spa)
2052 2045 guid = spa_load_guid(spa);
2053 2046
2054 2047 while (list_head(&arc_mru->arcs_list[ARC_BUFC_DATA])) {
2055 2048 (void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_DATA);
2056 2049 if (spa)
2057 2050 break;
2058 2051 }
2059 2052 while (list_head(&arc_mru->arcs_list[ARC_BUFC_METADATA])) {
2060 2053 (void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_METADATA);
2061 2054 if (spa)
2062 2055 break;
2063 2056 }
2064 2057 while (list_head(&arc_mfu->arcs_list[ARC_BUFC_DATA])) {
2065 2058 (void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_DATA);
2066 2059 if (spa)
2067 2060 break;
2068 2061 }
2069 2062 while (list_head(&arc_mfu->arcs_list[ARC_BUFC_METADATA])) {
2070 2063 (void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_METADATA);
2071 2064 if (spa)
2072 2065 break;
2073 2066 }
2074 2067
2075 2068 arc_evict_ghost(arc_mru_ghost, guid, -1);
2076 2069 arc_evict_ghost(arc_mfu_ghost, guid, -1);
2077 2070
2078 2071 mutex_enter(&arc_reclaim_thr_lock);
2079 2072 arc_do_user_evicts();
2080 2073 mutex_exit(&arc_reclaim_thr_lock);
2081 2074 ASSERT(spa || arc_eviction_list == NULL);
2082 2075 }
2083 2076
2084 2077 void
2085 2078 arc_shrink(void)
2086 2079 {
2087 2080 if (arc_c > arc_c_min) {
2088 2081 uint64_t to_free;
2089 2082
2090 2083 #ifdef _KERNEL
2091 2084 to_free = MAX(arc_c >> arc_shrink_shift, ptob(needfree));
2092 2085 #else
2093 2086 to_free = arc_c >> arc_shrink_shift;
2094 2087 #endif
2095 2088 if (arc_c > arc_c_min + to_free)
2096 2089 atomic_add_64(&arc_c, -to_free);
2097 2090 else
2098 2091 arc_c = arc_c_min;
2099 2092
2100 2093 atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
2101 2094 if (arc_c > arc_size)
2102 2095 arc_c = MAX(arc_size, arc_c_min);
2103 2096 if (arc_p > arc_c)
2104 2097 arc_p = (arc_c >> 1);
2105 2098 ASSERT(arc_c >= arc_c_min);
2106 2099 ASSERT((int64_t)arc_p >= 0);
2107 2100 }
2108 2101
2109 2102 if (arc_size > arc_c)
2110 2103 arc_adjust();
2111 2104 }
2112 2105
2113 2106 /*
2114 2107 * Determine if the system is under memory pressure and is asking
2115 2108 * to reclaim memory. A return value of 1 indicates that the system
2116 2109 * is under memory pressure and that the arc should adjust accordingly.
2117 2110 */
2118 2111 static int
2119 2112 arc_reclaim_needed(void)
2120 2113 {
2121 2114 uint64_t extra;
2122 2115
2123 2116 #ifdef _KERNEL
2124 2117
2125 2118 if (needfree)
2126 2119 return (1);
2127 2120
2128 2121 /*
2129 2122 * take 'desfree' extra pages, so we reclaim sooner, rather than later
2130 2123 */
2131 2124 extra = desfree;
2132 2125
2133 2126 /*
2134 2127 * check that we're out of range of the pageout scanner. It starts to
2135 2128 * schedule paging if freemem is less than lotsfree and needfree.
2136 2129 * lotsfree is the high-water mark for pageout, and needfree is the
2137 2130 * number of needed free pages. We add extra pages here to make sure
2138 2131 * the scanner doesn't start up while we're freeing memory.
2139 2132 */
2140 2133 if (freemem < lotsfree + needfree + extra)
2141 2134 return (1);
2142 2135
2143 2136 /*
2144 2137 * check to make sure that swapfs has enough space so that anon
2145 2138 * reservations can still succeed. anon_resvmem() checks that the
2146 2139 * availrmem is greater than swapfs_minfree, and the number of reserved
2147 2140 * swap pages. We also add a bit of extra here just to prevent
2148 2141 * circumstances from getting really dire.
2149 2142 */
2150 2143 if (availrmem < swapfs_minfree + swapfs_reserve + extra)
2151 2144 return (1);
2152 2145
2153 2146 #if defined(__i386)
2154 2147 /*
2155 2148 * If we're on an i386 platform, it's possible that we'll exhaust the
2156 2149 * kernel heap space before we ever run out of available physical
2157 2150 * memory. Most checks of the size of the heap_area compare against
2158 2151 * tune.t_minarmem, which is the minimum available real memory that we
2159 2152 * can have in the system. However, this is generally fixed at 25 pages
2160 2153 * which is so low that it's useless. In this comparison, we seek to
2161 2154 * calculate the total heap-size, and reclaim if more than 3/4ths of the
2162 2155 * heap is allocated. (Or, in the calculation, if less than 1/4th is
2163 2156 * free)
2164 2157 */
2165 2158 if (vmem_size(heap_arena, VMEM_FREE) <
2166 2159 (vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC) >> 2))
2167 2160 return (1);
2168 2161 #endif
2169 2162
2170 2163 /*
2171 2164 * If zio data pages are being allocated out of a separate heap segment,
2172 2165 * then enforce that the size of available vmem for this arena remains
2173 2166 * above about 1/16th free.
2174 2167 *
2175 2168 * Note: The 1/16th arena free requirement was put in place
2176 2169 * to aggressively evict memory from the arc in order to avoid
2177 2170 * memory fragmentation issues.
2178 2171 */
2179 2172 if (zio_arena != NULL &&
2180 2173 vmem_size(zio_arena, VMEM_FREE) <
2181 2174 (vmem_size(zio_arena, VMEM_ALLOC) >> 4))
2182 2175 return (1);
2183 2176 #else
2184 2177 if (spa_get_random(100) == 0)
2185 2178 return (1);
2186 2179 #endif
2187 2180 return (0);
2188 2181 }
2189 2182
2190 2183 static void
2191 2184 arc_kmem_reap_now(arc_reclaim_strategy_t strat)
2192 2185 {
2193 2186 size_t i;
2194 2187 kmem_cache_t *prev_cache = NULL;
2195 2188 kmem_cache_t *prev_data_cache = NULL;
2196 2189 extern kmem_cache_t *zio_buf_cache[];
2197 2190 extern kmem_cache_t *zio_data_buf_cache[];
2198 2191
2199 2192 #ifdef _KERNEL
2200 2193 if (arc_meta_used >= arc_meta_limit) {
2201 2194 /*
2202 2195 * We are exceeding our meta-data cache limit.
2203 2196 * Purge some DNLC entries to release holds on meta-data.
2204 2197 */
2205 2198 dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
2206 2199 }
2207 2200 #if defined(__i386)
2208 2201 /*
2209 2202 * Reclaim unused memory from all kmem caches.
2210 2203 */
2211 2204 kmem_reap();
2212 2205 #endif
2213 2206 #endif
2214 2207
2215 2208 /*
2216 2209 * An aggressive reclamation will shrink the cache size as well as
2217 2210 * reap free buffers from the arc kmem caches.
2218 2211 */
2219 2212 if (strat == ARC_RECLAIM_AGGR)
2220 2213 arc_shrink();
2221 2214
2222 2215 for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
2223 2216 if (zio_buf_cache[i] != prev_cache) {
2224 2217 prev_cache = zio_buf_cache[i];
2225 2218 kmem_cache_reap_now(zio_buf_cache[i]);
2226 2219 }
2227 2220 if (zio_data_buf_cache[i] != prev_data_cache) {
2228 2221 prev_data_cache = zio_data_buf_cache[i];
2229 2222 kmem_cache_reap_now(zio_data_buf_cache[i]);
2230 2223 }
2231 2224 }
2232 2225 kmem_cache_reap_now(buf_cache);
2233 2226 kmem_cache_reap_now(hdr_cache);
2234 2227
2235 2228 /*
2236 2229 * Ask the vmem areana to reclaim unused memory from its
2237 2230 * quantum caches.
2238 2231 */
2239 2232 if (zio_arena != NULL && strat == ARC_RECLAIM_AGGR)
2240 2233 vmem_qcache_reap(zio_arena);
2241 2234 }
2242 2235
2243 2236 static void
2244 2237 arc_reclaim_thread(void)
2245 2238 {
2246 2239 clock_t growtime = 0;
2247 2240 arc_reclaim_strategy_t last_reclaim = ARC_RECLAIM_CONS;
2248 2241 callb_cpr_t cpr;
2249 2242
2250 2243 CALLB_CPR_INIT(&cpr, &arc_reclaim_thr_lock, callb_generic_cpr, FTAG);
2251 2244
2252 2245 mutex_enter(&arc_reclaim_thr_lock);
2253 2246 while (arc_thread_exit == 0) {
2254 2247 if (arc_reclaim_needed()) {
2255 2248
2256 2249 if (arc_no_grow) {
2257 2250 if (last_reclaim == ARC_RECLAIM_CONS) {
2258 2251 last_reclaim = ARC_RECLAIM_AGGR;
2259 2252 } else {
2260 2253 last_reclaim = ARC_RECLAIM_CONS;
2261 2254 }
2262 2255 } else {
2263 2256 arc_no_grow = TRUE;
2264 2257 last_reclaim = ARC_RECLAIM_AGGR;
2265 2258 membar_producer();
2266 2259 }
2267 2260
2268 2261 /* reset the growth delay for every reclaim */
2269 2262 growtime = ddi_get_lbolt() + (arc_grow_retry * hz);
2270 2263
2271 2264 arc_kmem_reap_now(last_reclaim);
2272 2265 arc_warm = B_TRUE;
2273 2266
2274 2267 } else if (arc_no_grow && ddi_get_lbolt() >= growtime) {
2275 2268 arc_no_grow = FALSE;
2276 2269 }
2277 2270
2278 2271 arc_adjust();
2279 2272
2280 2273 if (arc_eviction_list != NULL)
2281 2274 arc_do_user_evicts();
2282 2275
2283 2276 /* block until needed, or one second, whichever is shorter */
2284 2277 CALLB_CPR_SAFE_BEGIN(&cpr);
2285 2278 (void) cv_timedwait(&arc_reclaim_thr_cv,
2286 2279 &arc_reclaim_thr_lock, (ddi_get_lbolt() + hz));
2287 2280 CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_thr_lock);
2288 2281 }
2289 2282
2290 2283 arc_thread_exit = 0;
2291 2284 cv_broadcast(&arc_reclaim_thr_cv);
2292 2285 CALLB_CPR_EXIT(&cpr); /* drops arc_reclaim_thr_lock */
2293 2286 thread_exit();
2294 2287 }
2295 2288
2296 2289 /*
2297 2290 * Adapt arc info given the number of bytes we are trying to add and
2298 2291 * the state that we are comming from. This function is only called
2299 2292 * when we are adding new content to the cache.
2300 2293 */
2301 2294 static void
2302 2295 arc_adapt(int bytes, arc_state_t *state)
2303 2296 {
2304 2297 int mult;
2305 2298 uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
2306 2299
2307 2300 if (state == arc_l2c_only)
2308 2301 return;
2309 2302
2310 2303 ASSERT(bytes > 0);
2311 2304 /*
2312 2305 * Adapt the target size of the MRU list:
2313 2306 * - if we just hit in the MRU ghost list, then increase
2314 2307 * the target size of the MRU list.
2315 2308 * - if we just hit in the MFU ghost list, then increase
2316 2309 * the target size of the MFU list by decreasing the
2317 2310 * target size of the MRU list.
2318 2311 */
2319 2312 if (state == arc_mru_ghost) {
2320 2313 mult = ((arc_mru_ghost->arcs_size >= arc_mfu_ghost->arcs_size) ?
2321 2314 1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size));
2322 2315 mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
2323 2316
2324 2317 arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
2325 2318 } else if (state == arc_mfu_ghost) {
2326 2319 uint64_t delta;
2327 2320
2328 2321 mult = ((arc_mfu_ghost->arcs_size >= arc_mru_ghost->arcs_size) ?
2329 2322 1 : (arc_mru_ghost->arcs_size/arc_mfu_ghost->arcs_size));
2330 2323 mult = MIN(mult, 10);
2331 2324
2332 2325 delta = MIN(bytes * mult, arc_p);
2333 2326 arc_p = MAX(arc_p_min, arc_p - delta);
2334 2327 }
2335 2328 ASSERT((int64_t)arc_p >= 0);
2336 2329
2337 2330 if (arc_reclaim_needed()) {
2338 2331 cv_signal(&arc_reclaim_thr_cv);
2339 2332 return;
2340 2333 }
2341 2334
2342 2335 if (arc_no_grow)
2343 2336 return;
2344 2337
2345 2338 if (arc_c >= arc_c_max)
2346 2339 return;
2347 2340
2348 2341 /*
2349 2342 * If we're within (2 * maxblocksize) bytes of the target
2350 2343 * cache size, increment the target cache size
2351 2344 */
2352 2345 if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
2353 2346 atomic_add_64(&arc_c, (int64_t)bytes);
2354 2347 if (arc_c > arc_c_max)
2355 2348 arc_c = arc_c_max;
2356 2349 else if (state == arc_anon)
2357 2350 atomic_add_64(&arc_p, (int64_t)bytes);
2358 2351 if (arc_p > arc_c)
2359 2352 arc_p = arc_c;
2360 2353 }
2361 2354 ASSERT((int64_t)arc_p >= 0);
2362 2355 }
2363 2356
2364 2357 /*
2365 2358 * Check if the cache has reached its limits and eviction is required
2366 2359 * prior to insert.
2367 2360 */
2368 2361 static int
2369 2362 arc_evict_needed(arc_buf_contents_t type)
2370 2363 {
2371 2364 if (type == ARC_BUFC_METADATA && arc_meta_used >= arc_meta_limit)
2372 2365 return (1);
2373 2366
2374 2367 if (arc_reclaim_needed())
2375 2368 return (1);
2376 2369
2377 2370 return (arc_size > arc_c);
2378 2371 }
2379 2372
2380 2373 /*
2381 2374 * The buffer, supplied as the first argument, needs a data block.
2382 2375 * So, if we are at cache max, determine which cache should be victimized.
2383 2376 * We have the following cases:
2384 2377 *
2385 2378 * 1. Insert for MRU, p > sizeof(arc_anon + arc_mru) ->
2386 2379 * In this situation if we're out of space, but the resident size of the MFU is
2387 2380 * under the limit, victimize the MFU cache to satisfy this insertion request.
2388 2381 *
2389 2382 * 2. Insert for MRU, p <= sizeof(arc_anon + arc_mru) ->
2390 2383 * Here, we've used up all of the available space for the MRU, so we need to
2391 2384 * evict from our own cache instead. Evict from the set of resident MRU
2392 2385 * entries.
2393 2386 *
2394 2387 * 3. Insert for MFU (c - p) > sizeof(arc_mfu) ->
2395 2388 * c minus p represents the MFU space in the cache, since p is the size of the
2396 2389 * cache that is dedicated to the MRU. In this situation there's still space on
2397 2390 * the MFU side, so the MRU side needs to be victimized.
2398 2391 *
2399 2392 * 4. Insert for MFU (c - p) < sizeof(arc_mfu) ->
2400 2393 * MFU's resident set is consuming more space than it has been allotted. In
2401 2394 * this situation, we must victimize our own cache, the MFU, for this insertion.
2402 2395 */
2403 2396 static void
2404 2397 arc_get_data_buf(arc_buf_t *buf)
2405 2398 {
2406 2399 arc_state_t *state = buf->b_hdr->b_state;
2407 2400 uint64_t size = buf->b_hdr->b_size;
2408 2401 arc_buf_contents_t type = buf->b_hdr->b_type;
2409 2402
2410 2403 arc_adapt(size, state);
2411 2404
2412 2405 /*
2413 2406 * We have not yet reached cache maximum size,
2414 2407 * just allocate a new buffer.
2415 2408 */
2416 2409 if (!arc_evict_needed(type)) {
2417 2410 if (type == ARC_BUFC_METADATA) {
2418 2411 buf->b_data = zio_buf_alloc(size);
2419 2412 arc_space_consume(size, ARC_SPACE_DATA);
2420 2413 } else {
2421 2414 ASSERT(type == ARC_BUFC_DATA);
2422 2415 buf->b_data = zio_data_buf_alloc(size);
2423 2416 ARCSTAT_INCR(arcstat_data_size, size);
2424 2417 atomic_add_64(&arc_size, size);
2425 2418 }
2426 2419 goto out;
2427 2420 }
2428 2421
2429 2422 /*
2430 2423 * If we are prefetching from the mfu ghost list, this buffer
2431 2424 * will end up on the mru list; so steal space from there.
2432 2425 */
2433 2426 if (state == arc_mfu_ghost)
2434 2427 state = buf->b_hdr->b_flags & ARC_PREFETCH ? arc_mru : arc_mfu;
2435 2428 else if (state == arc_mru_ghost)
2436 2429 state = arc_mru;
2437 2430
2438 2431 if (state == arc_mru || state == arc_anon) {
2439 2432 uint64_t mru_used = arc_anon->arcs_size + arc_mru->arcs_size;
2440 2433 state = (arc_mfu->arcs_lsize[type] >= size &&
2441 2434 arc_p > mru_used) ? arc_mfu : arc_mru;
2442 2435 } else {
2443 2436 /* MFU cases */
2444 2437 uint64_t mfu_space = arc_c - arc_p;
2445 2438 state = (arc_mru->arcs_lsize[type] >= size &&
2446 2439 mfu_space > arc_mfu->arcs_size) ? arc_mru : arc_mfu;
2447 2440 }
2448 2441 if ((buf->b_data = arc_evict(state, NULL, size, TRUE, type)) == NULL) {
2449 2442 if (type == ARC_BUFC_METADATA) {
2450 2443 buf->b_data = zio_buf_alloc(size);
2451 2444 arc_space_consume(size, ARC_SPACE_DATA);
2452 2445 } else {
2453 2446 ASSERT(type == ARC_BUFC_DATA);
2454 2447 buf->b_data = zio_data_buf_alloc(size);
2455 2448 ARCSTAT_INCR(arcstat_data_size, size);
2456 2449 atomic_add_64(&arc_size, size);
2457 2450 }
2458 2451 ARCSTAT_BUMP(arcstat_recycle_miss);
2459 2452 }
2460 2453 ASSERT(buf->b_data != NULL);
2461 2454 out:
2462 2455 /*
2463 2456 * Update the state size. Note that ghost states have a
2464 2457 * "ghost size" and so don't need to be updated.
2465 2458 */
2466 2459 if (!GHOST_STATE(buf->b_hdr->b_state)) {
2467 2460 arc_buf_hdr_t *hdr = buf->b_hdr;
2468 2461
2469 2462 atomic_add_64(&hdr->b_state->arcs_size, size);
2470 2463 if (list_link_active(&hdr->b_arc_node)) {
2471 2464 ASSERT(refcount_is_zero(&hdr->b_refcnt));
2472 2465 atomic_add_64(&hdr->b_state->arcs_lsize[type], size);
2473 2466 }
2474 2467 /*
2475 2468 * If we are growing the cache, and we are adding anonymous
2476 2469 * data, and we have outgrown arc_p, update arc_p
2477 2470 */
2478 2471 if (arc_size < arc_c && hdr->b_state == arc_anon &&
2479 2472 arc_anon->arcs_size + arc_mru->arcs_size > arc_p)
2480 2473 arc_p = MIN(arc_c, arc_p + size);
2481 2474 }
2482 2475 }
2483 2476
2484 2477 /*
2485 2478 * This routine is called whenever a buffer is accessed.
2486 2479 * NOTE: the hash lock is dropped in this function.
2487 2480 */
2488 2481 static void
2489 2482 arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock)
2490 2483 {
2491 2484 clock_t now;
2492 2485
2493 2486 ASSERT(MUTEX_HELD(hash_lock));
2494 2487
2495 2488 if (buf->b_state == arc_anon) {
2496 2489 /*
2497 2490 * This buffer is not in the cache, and does not
2498 2491 * appear in our "ghost" list. Add the new buffer
2499 2492 * to the MRU state.
2500 2493 */
2501 2494
2502 2495 ASSERT(buf->b_arc_access == 0);
2503 2496 buf->b_arc_access = ddi_get_lbolt();
2504 2497 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
2505 2498 arc_change_state(arc_mru, buf, hash_lock);
2506 2499
2507 2500 } else if (buf->b_state == arc_mru) {
2508 2501 now = ddi_get_lbolt();
2509 2502
2510 2503 /*
2511 2504 * If this buffer is here because of a prefetch, then either:
2512 2505 * - clear the flag if this is a "referencing" read
2513 2506 * (any subsequent access will bump this into the MFU state).
2514 2507 * or
2515 2508 * - move the buffer to the head of the list if this is
2516 2509 * another prefetch (to make it less likely to be evicted).
2517 2510 */
2518 2511 if ((buf->b_flags & ARC_PREFETCH) != 0) {
2519 2512 if (refcount_count(&buf->b_refcnt) == 0) {
2520 2513 ASSERT(list_link_active(&buf->b_arc_node));
2521 2514 } else {
2522 2515 buf->b_flags &= ~ARC_PREFETCH;
2523 2516 ARCSTAT_BUMP(arcstat_mru_hits);
2524 2517 }
2525 2518 buf->b_arc_access = now;
2526 2519 return;
2527 2520 }
2528 2521
2529 2522 /*
2530 2523 * This buffer has been "accessed" only once so far,
2531 2524 * but it is still in the cache. Move it to the MFU
2532 2525 * state.
2533 2526 */
2534 2527 if (now > buf->b_arc_access + ARC_MINTIME) {
2535 2528 /*
2536 2529 * More than 125ms have passed since we
2537 2530 * instantiated this buffer. Move it to the
2538 2531 * most frequently used state.
2539 2532 */
2540 2533 buf->b_arc_access = now;
2541 2534 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
2542 2535 arc_change_state(arc_mfu, buf, hash_lock);
2543 2536 }
2544 2537 ARCSTAT_BUMP(arcstat_mru_hits);
2545 2538 } else if (buf->b_state == arc_mru_ghost) {
2546 2539 arc_state_t *new_state;
2547 2540 /*
2548 2541 * This buffer has been "accessed" recently, but
2549 2542 * was evicted from the cache. Move it to the
2550 2543 * MFU state.
2551 2544 */
2552 2545
2553 2546 if (buf->b_flags & ARC_PREFETCH) {
2554 2547 new_state = arc_mru;
2555 2548 if (refcount_count(&buf->b_refcnt) > 0)
2556 2549 buf->b_flags &= ~ARC_PREFETCH;
2557 2550 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
2558 2551 } else {
2559 2552 new_state = arc_mfu;
2560 2553 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
2561 2554 }
2562 2555
2563 2556 buf->b_arc_access = ddi_get_lbolt();
2564 2557 arc_change_state(new_state, buf, hash_lock);
2565 2558
2566 2559 ARCSTAT_BUMP(arcstat_mru_ghost_hits);
2567 2560 } else if (buf->b_state == arc_mfu) {
2568 2561 /*
2569 2562 * This buffer has been accessed more than once and is
2570 2563 * still in the cache. Keep it in the MFU state.
2571 2564 *
2572 2565 * NOTE: an add_reference() that occurred when we did
2573 2566 * the arc_read() will have kicked this off the list.
2574 2567 * If it was a prefetch, we will explicitly move it to
2575 2568 * the head of the list now.
2576 2569 */
2577 2570 if ((buf->b_flags & ARC_PREFETCH) != 0) {
2578 2571 ASSERT(refcount_count(&buf->b_refcnt) == 0);
2579 2572 ASSERT(list_link_active(&buf->b_arc_node));
2580 2573 }
2581 2574 ARCSTAT_BUMP(arcstat_mfu_hits);
2582 2575 buf->b_arc_access = ddi_get_lbolt();
2583 2576 } else if (buf->b_state == arc_mfu_ghost) {
2584 2577 arc_state_t *new_state = arc_mfu;
2585 2578 /*
2586 2579 * This buffer has been accessed more than once but has
2587 2580 * been evicted from the cache. Move it back to the
2588 2581 * MFU state.
2589 2582 */
2590 2583
2591 2584 if (buf->b_flags & ARC_PREFETCH) {
2592 2585 /*
2593 2586 * This is a prefetch access...
2594 2587 * move this block back to the MRU state.
2595 2588 */
2596 2589 ASSERT0(refcount_count(&buf->b_refcnt));
2597 2590 new_state = arc_mru;
2598 2591 }
2599 2592
2600 2593 buf->b_arc_access = ddi_get_lbolt();
2601 2594 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
2602 2595 arc_change_state(new_state, buf, hash_lock);
2603 2596
2604 2597 ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
2605 2598 } else if (buf->b_state == arc_l2c_only) {
2606 2599 /*
2607 2600 * This buffer is on the 2nd Level ARC.
2608 2601 */
2609 2602
2610 2603 buf->b_arc_access = ddi_get_lbolt();
2611 2604 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
2612 2605 arc_change_state(arc_mfu, buf, hash_lock);
2613 2606 } else {
2614 2607 ASSERT(!"invalid arc state");
2615 2608 }
2616 2609 }
2617 2610
2618 2611 /* a generic arc_done_func_t which you can use */
2619 2612 /* ARGSUSED */
2620 2613 void
2621 2614 arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
2622 2615 {
2623 2616 if (zio == NULL || zio->io_error == 0)
2624 2617 bcopy(buf->b_data, arg, buf->b_hdr->b_size);
2625 2618 VERIFY(arc_buf_remove_ref(buf, arg));
2626 2619 }
2627 2620
2628 2621 /* a generic arc_done_func_t */
2629 2622 void
2630 2623 arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
2631 2624 {
2632 2625 arc_buf_t **bufp = arg;
2633 2626 if (zio && zio->io_error) {
2634 2627 VERIFY(arc_buf_remove_ref(buf, arg));
2635 2628 *bufp = NULL;
2636 2629 } else {
2637 2630 *bufp = buf;
2638 2631 ASSERT(buf->b_data);
2639 2632 }
2640 2633 }
2641 2634
2642 2635 static void
2643 2636 arc_read_done(zio_t *zio)
2644 2637 {
2645 2638 arc_buf_hdr_t *hdr, *found;
2646 2639 arc_buf_t *buf;
2647 2640 arc_buf_t *abuf; /* buffer we're assigning to callback */
2648 2641 kmutex_t *hash_lock;
2649 2642 arc_callback_t *callback_list, *acb;
2650 2643 int freeable = FALSE;
2651 2644
2652 2645 buf = zio->io_private;
2653 2646 hdr = buf->b_hdr;
2654 2647
2655 2648 /*
2656 2649 * The hdr was inserted into hash-table and removed from lists
2657 2650 * prior to starting I/O. We should find this header, since
2658 2651 * it's in the hash table, and it should be legit since it's
2659 2652 * not possible to evict it during the I/O. The only possible
2660 2653 * reason for it not to be found is if we were freed during the
2661 2654 * read.
2662 2655 */
2663 2656 found = buf_hash_find(hdr->b_spa, &hdr->b_dva, hdr->b_birth,
2664 2657 &hash_lock);
2665 2658
2666 2659 ASSERT((found == NULL && HDR_FREED_IN_READ(hdr) && hash_lock == NULL) ||
2667 2660 (found == hdr && DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
2668 2661 (found == hdr && HDR_L2_READING(hdr)));
2669 2662
2670 2663 hdr->b_flags &= ~ARC_L2_EVICTED;
2671 2664 if (l2arc_noprefetch && (hdr->b_flags & ARC_PREFETCH))
2672 2665 hdr->b_flags &= ~ARC_L2CACHE;
2673 2666
2674 2667 /* byteswap if necessary */
2675 2668 callback_list = hdr->b_acb;
2676 2669 ASSERT(callback_list != NULL);
2677 2670 if (BP_SHOULD_BYTESWAP(zio->io_bp) && zio->io_error == 0) {
2678 2671 dmu_object_byteswap_t bswap =
2679 2672 DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));
2680 2673 arc_byteswap_func_t *func = BP_GET_LEVEL(zio->io_bp) > 0 ?
2681 2674 byteswap_uint64_array :
2682 2675 dmu_ot_byteswap[bswap].ob_func;
2683 2676 func(buf->b_data, hdr->b_size);
2684 2677 }
2685 2678
2686 2679 arc_cksum_compute(buf, B_FALSE);
2687 2680 arc_buf_watch(buf);
2688 2681
2689 2682 if (hash_lock && zio->io_error == 0 && hdr->b_state == arc_anon) {
2690 2683 /*
2691 2684 * Only call arc_access on anonymous buffers. This is because
2692 2685 * if we've issued an I/O for an evicted buffer, we've already
2693 2686 * called arc_access (to prevent any simultaneous readers from
2694 2687 * getting confused).
2695 2688 */
2696 2689 arc_access(hdr, hash_lock);
2697 2690 }
2698 2691
2699 2692 /* create copies of the data buffer for the callers */
2700 2693 abuf = buf;
2701 2694 for (acb = callback_list; acb; acb = acb->acb_next) {
2702 2695 if (acb->acb_done) {
2703 2696 if (abuf == NULL) {
2704 2697 ARCSTAT_BUMP(arcstat_duplicate_reads);
2705 2698 abuf = arc_buf_clone(buf);
2706 2699 }
2707 2700 acb->acb_buf = abuf;
2708 2701 abuf = NULL;
2709 2702 }
2710 2703 }
2711 2704 hdr->b_acb = NULL;
2712 2705 hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
2713 2706 ASSERT(!HDR_BUF_AVAILABLE(hdr));
2714 2707 if (abuf == buf) {
2715 2708 ASSERT(buf->b_efunc == NULL);
2716 2709 ASSERT(hdr->b_datacnt == 1);
2717 2710 hdr->b_flags |= ARC_BUF_AVAILABLE;
2718 2711 }
2719 2712
2720 2713 ASSERT(refcount_is_zero(&hdr->b_refcnt) || callback_list != NULL);
2721 2714
2722 2715 if (zio->io_error != 0) {
2723 2716 hdr->b_flags |= ARC_IO_ERROR;
2724 2717 if (hdr->b_state != arc_anon)
2725 2718 arc_change_state(arc_anon, hdr, hash_lock);
2726 2719 if (HDR_IN_HASH_TABLE(hdr))
2727 2720 buf_hash_remove(hdr);
2728 2721 freeable = refcount_is_zero(&hdr->b_refcnt);
2729 2722 }
2730 2723
2731 2724 /*
2732 2725 * Broadcast before we drop the hash_lock to avoid the possibility
2733 2726 * that the hdr (and hence the cv) might be freed before we get to
2734 2727 * the cv_broadcast().
2735 2728 */
2736 2729 cv_broadcast(&hdr->b_cv);
2737 2730
2738 2731 if (hash_lock) {
2739 2732 mutex_exit(hash_lock);
2740 2733 } else {
2741 2734 /*
2742 2735 * This block was freed while we waited for the read to
2743 2736 * complete. It has been removed from the hash table and
2744 2737 * moved to the anonymous state (so that it won't show up
2745 2738 * in the cache).
2746 2739 */
2747 2740 ASSERT3P(hdr->b_state, ==, arc_anon);
2748 2741 freeable = refcount_is_zero(&hdr->b_refcnt);
2749 2742 }
2750 2743
2751 2744 /* execute each callback and free its structure */
2752 2745 while ((acb = callback_list) != NULL) {
2753 2746 if (acb->acb_done)
2754 2747 acb->acb_done(zio, acb->acb_buf, acb->acb_private);
2755 2748
2756 2749 if (acb->acb_zio_dummy != NULL) {
2757 2750 acb->acb_zio_dummy->io_error = zio->io_error;
2758 2751 zio_nowait(acb->acb_zio_dummy);
2759 2752 }
2760 2753
2761 2754 callback_list = acb->acb_next;
2762 2755 kmem_free(acb, sizeof (arc_callback_t));
2763 2756 }
2764 2757
2765 2758 if (freeable)
2766 2759 arc_hdr_destroy(hdr);
2767 2760 }
2768 2761
2769 2762 /*
2770 2763 * "Read" the block at the specified DVA (in bp) via the
2771 2764 * cache. If the block is found in the cache, invoke the provided
2772 2765 * callback immediately and return. Note that the `zio' parameter
2773 2766 * in the callback will be NULL in this case, since no IO was
2774 2767 * required. If the block is not in the cache pass the read request
2775 2768 * on to the spa with a substitute callback function, so that the
2776 2769 * requested block will be added to the cache.
2777 2770 *
2778 2771 * If a read request arrives for a block that has a read in-progress,
2779 2772 * either wait for the in-progress read to complete (and return the
2780 2773 * results); or, if this is a read with a "done" func, add a record
2781 2774 * to the read to invoke the "done" func when the read completes,
2782 2775 * and return; or just return.
2783 2776 *
2784 2777 * arc_read_done() will invoke all the requested "done" functions
2785 2778 * for readers of this block.
2786 2779 */
2787 2780 int
2788 2781 arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_done_func_t *done,
2789 2782 void *private, int priority, int zio_flags, uint32_t *arc_flags,
2790 2783 const zbookmark_t *zb)
2791 2784 {
2792 2785 arc_buf_hdr_t *hdr;
2793 2786 arc_buf_t *buf = NULL;
2794 2787 kmutex_t *hash_lock;
2795 2788 zio_t *rzio;
2796 2789 uint64_t guid = spa_load_guid(spa);
2797 2790
2798 2791 top:
2799 2792 hdr = buf_hash_find(guid, BP_IDENTITY(bp), BP_PHYSICAL_BIRTH(bp),
2800 2793 &hash_lock);
2801 2794 if (hdr && hdr->b_datacnt > 0) {
2802 2795
2803 2796 *arc_flags |= ARC_CACHED;
2804 2797
2805 2798 if (HDR_IO_IN_PROGRESS(hdr)) {
2806 2799
2807 2800 if (*arc_flags & ARC_WAIT) {
2808 2801 cv_wait(&hdr->b_cv, hash_lock);
2809 2802 mutex_exit(hash_lock);
2810 2803 goto top;
2811 2804 }
2812 2805 ASSERT(*arc_flags & ARC_NOWAIT);
2813 2806
2814 2807 if (done) {
2815 2808 arc_callback_t *acb = NULL;
2816 2809
2817 2810 acb = kmem_zalloc(sizeof (arc_callback_t),
2818 2811 KM_SLEEP);
2819 2812 acb->acb_done = done;
2820 2813 acb->acb_private = private;
2821 2814 if (pio != NULL)
2822 2815 acb->acb_zio_dummy = zio_null(pio,
2823 2816 spa, NULL, NULL, NULL, zio_flags);
2824 2817
2825 2818 ASSERT(acb->acb_done != NULL);
2826 2819 acb->acb_next = hdr->b_acb;
2827 2820 hdr->b_acb = acb;
2828 2821 add_reference(hdr, hash_lock, private);
2829 2822 mutex_exit(hash_lock);
2830 2823 return (0);
2831 2824 }
2832 2825 mutex_exit(hash_lock);
2833 2826 return (0);
2834 2827 }
2835 2828
2836 2829 ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
2837 2830
2838 2831 if (done) {
2839 2832 add_reference(hdr, hash_lock, private);
2840 2833 /*
2841 2834 * If this block is already in use, create a new
2842 2835 * copy of the data so that we will be guaranteed
2843 2836 * that arc_release() will always succeed.
2844 2837 */
2845 2838 buf = hdr->b_buf;
2846 2839 ASSERT(buf);
2847 2840 ASSERT(buf->b_data);
2848 2841 if (HDR_BUF_AVAILABLE(hdr)) {
2849 2842 ASSERT(buf->b_efunc == NULL);
2850 2843 hdr->b_flags &= ~ARC_BUF_AVAILABLE;
2851 2844 } else {
2852 2845 buf = arc_buf_clone(buf);
2853 2846 }
2854 2847
2855 2848 } else if (*arc_flags & ARC_PREFETCH &&
2856 2849 refcount_count(&hdr->b_refcnt) == 0) {
2857 2850 hdr->b_flags |= ARC_PREFETCH;
2858 2851 }
2859 2852 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
2860 2853 arc_access(hdr, hash_lock);
2861 2854 if (*arc_flags & ARC_L2CACHE)
2862 2855 hdr->b_flags |= ARC_L2CACHE;
2863 2856 mutex_exit(hash_lock);
2864 2857 ARCSTAT_BUMP(arcstat_hits);
2865 2858 ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
2866 2859 demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
2867 2860 data, metadata, hits);
2868 2861
2869 2862 if (done)
2870 2863 done(NULL, buf, private);
2871 2864 } else {
2872 2865 uint64_t size = BP_GET_LSIZE(bp);
2873 2866 arc_callback_t *acb;
2874 2867 vdev_t *vd = NULL;
2875 2868 uint64_t addr = 0;
2876 2869 boolean_t devw = B_FALSE;
2877 2870
2878 2871 if (hdr == NULL) {
2879 2872 /* this block is not in the cache */
2880 2873 arc_buf_hdr_t *exists;
2881 2874 arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
2882 2875 buf = arc_buf_alloc(spa, size, private, type);
2883 2876 hdr = buf->b_hdr;
2884 2877 hdr->b_dva = *BP_IDENTITY(bp);
2885 2878 hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
2886 2879 hdr->b_cksum0 = bp->blk_cksum.zc_word[0];
2887 2880 exists = buf_hash_insert(hdr, &hash_lock);
2888 2881 if (exists) {
2889 2882 /* somebody beat us to the hash insert */
2890 2883 mutex_exit(hash_lock);
2891 2884 buf_discard_identity(hdr);
2892 2885 (void) arc_buf_remove_ref(buf, private);
2893 2886 goto top; /* restart the IO request */
2894 2887 }
2895 2888 /* if this is a prefetch, we don't have a reference */
2896 2889 if (*arc_flags & ARC_PREFETCH) {
2897 2890 (void) remove_reference(hdr, hash_lock,
2898 2891 private);
2899 2892 hdr->b_flags |= ARC_PREFETCH;
2900 2893 }
2901 2894 if (*arc_flags & ARC_L2CACHE)
2902 2895 hdr->b_flags |= ARC_L2CACHE;
2903 2896 if (BP_GET_LEVEL(bp) > 0)
2904 2897 hdr->b_flags |= ARC_INDIRECT;
2905 2898 } else {
2906 2899 /* this block is in the ghost cache */
2907 2900 ASSERT(GHOST_STATE(hdr->b_state));
2908 2901 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
2909 2902 ASSERT0(refcount_count(&hdr->b_refcnt));
2910 2903 ASSERT(hdr->b_buf == NULL);
2911 2904
2912 2905 /* if this is a prefetch, we don't have a reference */
2913 2906 if (*arc_flags & ARC_PREFETCH)
2914 2907 hdr->b_flags |= ARC_PREFETCH;
2915 2908 else
2916 2909 add_reference(hdr, hash_lock, private);
2917 2910 if (*arc_flags & ARC_L2CACHE)
2918 2911 hdr->b_flags |= ARC_L2CACHE;
2919 2912 buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
2920 2913 buf->b_hdr = hdr;
2921 2914 buf->b_data = NULL;
2922 2915 buf->b_efunc = NULL;
2923 2916 buf->b_private = NULL;
2924 2917 buf->b_next = NULL;
2925 2918 hdr->b_buf = buf;
2926 2919 ASSERT(hdr->b_datacnt == 0);
2927 2920 hdr->b_datacnt = 1;
2928 2921 arc_get_data_buf(buf);
2929 2922 arc_access(hdr, hash_lock);
2930 2923 }
2931 2924
2932 2925 ASSERT(!GHOST_STATE(hdr->b_state));
2933 2926
2934 2927 acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
2935 2928 acb->acb_done = done;
2936 2929 acb->acb_private = private;
2937 2930
2938 2931 ASSERT(hdr->b_acb == NULL);
2939 2932 hdr->b_acb = acb;
2940 2933 hdr->b_flags |= ARC_IO_IN_PROGRESS;
2941 2934
2942 2935 if (HDR_L2CACHE(hdr) && hdr->b_l2hdr != NULL &&
2943 2936 (vd = hdr->b_l2hdr->b_dev->l2ad_vdev) != NULL) {
2944 2937 devw = hdr->b_l2hdr->b_dev->l2ad_writing;
2945 2938 addr = hdr->b_l2hdr->b_daddr;
2946 2939 /*
2947 2940 * Lock out device removal.
2948 2941 */
2949 2942 if (vdev_is_dead(vd) ||
2950 2943 !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
2951 2944 vd = NULL;
2952 2945 }
2953 2946
2954 2947 mutex_exit(hash_lock);
2955 2948
2956 2949 ASSERT3U(hdr->b_size, ==, size);
2957 2950 DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
2958 2951 uint64_t, size, zbookmark_t *, zb);
2959 2952 ARCSTAT_BUMP(arcstat_misses);
2960 2953 ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
2961 2954 demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
2962 2955 data, metadata, misses);
2963 2956
2964 2957 if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
2965 2958 /*
2966 2959 * Read from the L2ARC if the following are true:
2967 2960 * 1. The L2ARC vdev was previously cached.
2968 2961 * 2. This buffer still has L2ARC metadata.
2969 2962 * 3. This buffer isn't currently writing to the L2ARC.
2970 2963 * 4. The L2ARC entry wasn't evicted, which may
2971 2964 * also have invalidated the vdev.
2972 2965 * 5. This isn't prefetch and l2arc_noprefetch is set.
2973 2966 */
2974 2967 if (hdr->b_l2hdr != NULL &&
2975 2968 !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
2976 2969 !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
2977 2970 l2arc_read_callback_t *cb;
2978 2971
2979 2972 DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
2980 2973 ARCSTAT_BUMP(arcstat_l2_hits);
2981 2974
2982 2975 cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
2983 2976 KM_SLEEP);
2984 2977 cb->l2rcb_buf = buf;
2985 2978 cb->l2rcb_spa = spa;
2986 2979 cb->l2rcb_bp = *bp;
2987 2980 cb->l2rcb_zb = *zb;
2988 2981 cb->l2rcb_flags = zio_flags;
2989 2982
2990 2983 ASSERT(addr >= VDEV_LABEL_START_SIZE &&
2991 2984 addr + size < vd->vdev_psize -
2992 2985 VDEV_LABEL_END_SIZE);
2993 2986
2994 2987 /*
2995 2988 * l2arc read. The SCL_L2ARC lock will be
2996 2989 * released by l2arc_read_done().
2997 2990 */
2998 2991 rzio = zio_read_phys(pio, vd, addr, size,
2999 2992 buf->b_data, ZIO_CHECKSUM_OFF,
3000 2993 l2arc_read_done, cb, priority, zio_flags |
3001 2994 ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
3002 2995 ZIO_FLAG_DONT_PROPAGATE |
3003 2996 ZIO_FLAG_DONT_RETRY, B_FALSE);
3004 2997 DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
3005 2998 zio_t *, rzio);
3006 2999 ARCSTAT_INCR(arcstat_l2_read_bytes, size);
3007 3000
3008 3001 if (*arc_flags & ARC_NOWAIT) {
3009 3002 zio_nowait(rzio);
3010 3003 return (0);
3011 3004 }
3012 3005
3013 3006 ASSERT(*arc_flags & ARC_WAIT);
3014 3007 if (zio_wait(rzio) == 0)
3015 3008 return (0);
3016 3009
3017 3010 /* l2arc read error; goto zio_read() */
3018 3011 } else {
3019 3012 DTRACE_PROBE1(l2arc__miss,
3020 3013 arc_buf_hdr_t *, hdr);
3021 3014 ARCSTAT_BUMP(arcstat_l2_misses);
3022 3015 if (HDR_L2_WRITING(hdr))
3023 3016 ARCSTAT_BUMP(arcstat_l2_rw_clash);
3024 3017 spa_config_exit(spa, SCL_L2ARC, vd);
3025 3018 }
3026 3019 } else {
3027 3020 if (vd != NULL)
3028 3021 spa_config_exit(spa, SCL_L2ARC, vd);
3029 3022 if (l2arc_ndev != 0) {
3030 3023 DTRACE_PROBE1(l2arc__miss,
3031 3024 arc_buf_hdr_t *, hdr);
3032 3025 ARCSTAT_BUMP(arcstat_l2_misses);
3033 3026 }
3034 3027 }
3035 3028
3036 3029 rzio = zio_read(pio, spa, bp, buf->b_data, size,
3037 3030 arc_read_done, buf, priority, zio_flags, zb);
3038 3031
3039 3032 if (*arc_flags & ARC_WAIT)
3040 3033 return (zio_wait(rzio));
3041 3034
3042 3035 ASSERT(*arc_flags & ARC_NOWAIT);
3043 3036 zio_nowait(rzio);
3044 3037 }
3045 3038 return (0);
3046 3039 }
3047 3040
3048 3041 void
3049 3042 arc_set_callback(arc_buf_t *buf, arc_evict_func_t *func, void *private)
3050 3043 {
3051 3044 ASSERT(buf->b_hdr != NULL);
3052 3045 ASSERT(buf->b_hdr->b_state != arc_anon);
3053 3046 ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt) || func == NULL);
3054 3047 ASSERT(buf->b_efunc == NULL);
3055 3048 ASSERT(!HDR_BUF_AVAILABLE(buf->b_hdr));
3056 3049
3057 3050 buf->b_efunc = func;
3058 3051 buf->b_private = private;
3059 3052 }
3060 3053
3061 3054 /*
3062 3055 * This is used by the DMU to let the ARC know that a buffer is
3063 3056 * being evicted, so the ARC should clean up. If this arc buf
3064 3057 * is not yet in the evicted state, it will be put there.
3065 3058 */
3066 3059 int
3067 3060 arc_buf_evict(arc_buf_t *buf)
3068 3061 {
3069 3062 arc_buf_hdr_t *hdr;
3070 3063 kmutex_t *hash_lock;
3071 3064 arc_buf_t **bufp;
3072 3065
3073 3066 mutex_enter(&buf->b_evict_lock);
3074 3067 hdr = buf->b_hdr;
3075 3068 if (hdr == NULL) {
3076 3069 /*
3077 3070 * We are in arc_do_user_evicts().
3078 3071 */
3079 3072 ASSERT(buf->b_data == NULL);
3080 3073 mutex_exit(&buf->b_evict_lock);
3081 3074 return (0);
3082 3075 } else if (buf->b_data == NULL) {
3083 3076 arc_buf_t copy = *buf; /* structure assignment */
3084 3077 /*
3085 3078 * We are on the eviction list; process this buffer now
3086 3079 * but let arc_do_user_evicts() do the reaping.
3087 3080 */
3088 3081 buf->b_efunc = NULL;
3089 3082 mutex_exit(&buf->b_evict_lock);
3090 3083 VERIFY(copy.b_efunc(©) == 0);
3091 3084 return (1);
3092 3085 }
3093 3086 hash_lock = HDR_LOCK(hdr);
3094 3087 mutex_enter(hash_lock);
3095 3088 hdr = buf->b_hdr;
3096 3089 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
3097 3090
3098 3091 ASSERT3U(refcount_count(&hdr->b_refcnt), <, hdr->b_datacnt);
3099 3092 ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
3100 3093
3101 3094 /*
3102 3095 * Pull this buffer off of the hdr
3103 3096 */
3104 3097 bufp = &hdr->b_buf;
3105 3098 while (*bufp != buf)
3106 3099 bufp = &(*bufp)->b_next;
3107 3100 *bufp = buf->b_next;
3108 3101
3109 3102 ASSERT(buf->b_data != NULL);
3110 3103 arc_buf_destroy(buf, FALSE, FALSE);
3111 3104
3112 3105 if (hdr->b_datacnt == 0) {
3113 3106 arc_state_t *old_state = hdr->b_state;
3114 3107 arc_state_t *evicted_state;
3115 3108
3116 3109 ASSERT(hdr->b_buf == NULL);
3117 3110 ASSERT(refcount_is_zero(&hdr->b_refcnt));
3118 3111
3119 3112 evicted_state =
3120 3113 (old_state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
3121 3114
3122 3115 mutex_enter(&old_state->arcs_mtx);
3123 3116 mutex_enter(&evicted_state->arcs_mtx);
3124 3117
3125 3118 arc_change_state(evicted_state, hdr, hash_lock);
3126 3119 ASSERT(HDR_IN_HASH_TABLE(hdr));
3127 3120 hdr->b_flags |= ARC_IN_HASH_TABLE;
3128 3121 hdr->b_flags &= ~ARC_BUF_AVAILABLE;
3129 3122
3130 3123 mutex_exit(&evicted_state->arcs_mtx);
3131 3124 mutex_exit(&old_state->arcs_mtx);
3132 3125 }
3133 3126 mutex_exit(hash_lock);
3134 3127 mutex_exit(&buf->b_evict_lock);
3135 3128
3136 3129 VERIFY(buf->b_efunc(buf) == 0);
3137 3130 buf->b_efunc = NULL;
3138 3131 buf->b_private = NULL;
3139 3132 buf->b_hdr = NULL;
3140 3133 buf->b_next = NULL;
3141 3134 kmem_cache_free(buf_cache, buf);
3142 3135 return (1);
3143 3136 }
3144 3137
3145 3138 /*
3146 3139 * Release this buffer from the cache. This must be done
3147 3140 * after a read and prior to modifying the buffer contents.
3148 3141 * If the buffer has more than one reference, we must make
3149 3142 * a new hdr for the buffer.
3150 3143 */
3151 3144 void
3152 3145 arc_release(arc_buf_t *buf, void *tag)
3153 3146 {
3154 3147 arc_buf_hdr_t *hdr;
3155 3148 kmutex_t *hash_lock = NULL;
3156 3149 l2arc_buf_hdr_t *l2hdr;
3157 3150 uint64_t buf_size;
3158 3151
3159 3152 /*
3160 3153 * It would be nice to assert that if it's DMU metadata (level >
3161 3154 * 0 || it's the dnode file), then it must be syncing context.
3162 3155 * But we don't know that information at this level.
3163 3156 */
3164 3157
3165 3158 mutex_enter(&buf->b_evict_lock);
3166 3159 hdr = buf->b_hdr;
3167 3160
3168 3161 /* this buffer is not on any list */
3169 3162 ASSERT(refcount_count(&hdr->b_refcnt) > 0);
3170 3163
3171 3164 if (hdr->b_state == arc_anon) {
3172 3165 /* this buffer is already released */
3173 3166 ASSERT(buf->b_efunc == NULL);
3174 3167 } else {
3175 3168 hash_lock = HDR_LOCK(hdr);
3176 3169 mutex_enter(hash_lock);
3177 3170 hdr = buf->b_hdr;
3178 3171 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
3179 3172 }
3180 3173
3181 3174 l2hdr = hdr->b_l2hdr;
3182 3175 if (l2hdr) {
3183 3176 mutex_enter(&l2arc_buflist_mtx);
3184 3177 hdr->b_l2hdr = NULL;
3185 3178 }
3186 3179 buf_size = hdr->b_size;
3187 3180
3188 3181 /*
3189 3182 * Do we have more than one buf?
3190 3183 */
3191 3184 if (hdr->b_datacnt > 1) {
3192 3185 arc_buf_hdr_t *nhdr;
3193 3186 arc_buf_t **bufp;
3194 3187 uint64_t blksz = hdr->b_size;
3195 3188 uint64_t spa = hdr->b_spa;
3196 3189 arc_buf_contents_t type = hdr->b_type;
3197 3190 uint32_t flags = hdr->b_flags;
3198 3191
3199 3192 ASSERT(hdr->b_buf != buf || buf->b_next != NULL);
3200 3193 /*
3201 3194 * Pull the data off of this hdr and attach it to
3202 3195 * a new anonymous hdr.
3203 3196 */
3204 3197 (void) remove_reference(hdr, hash_lock, tag);
3205 3198 bufp = &hdr->b_buf;
3206 3199 while (*bufp != buf)
3207 3200 bufp = &(*bufp)->b_next;
3208 3201 *bufp = buf->b_next;
3209 3202 buf->b_next = NULL;
3210 3203
3211 3204 ASSERT3U(hdr->b_state->arcs_size, >=, hdr->b_size);
3212 3205 atomic_add_64(&hdr->b_state->arcs_size, -hdr->b_size);
3213 3206 if (refcount_is_zero(&hdr->b_refcnt)) {
3214 3207 uint64_t *size = &hdr->b_state->arcs_lsize[hdr->b_type];
3215 3208 ASSERT3U(*size, >=, hdr->b_size);
3216 3209 atomic_add_64(size, -hdr->b_size);
3217 3210 }
3218 3211
3219 3212 /*
3220 3213 * We're releasing a duplicate user data buffer, update
3221 3214 * our statistics accordingly.
3222 3215 */
3223 3216 if (hdr->b_type == ARC_BUFC_DATA) {
3224 3217 ARCSTAT_BUMPDOWN(arcstat_duplicate_buffers);
3225 3218 ARCSTAT_INCR(arcstat_duplicate_buffers_size,
3226 3219 -hdr->b_size);
3227 3220 }
3228 3221 hdr->b_datacnt -= 1;
3229 3222 arc_cksum_verify(buf);
3230 3223 arc_buf_unwatch(buf);
3231 3224
3232 3225 mutex_exit(hash_lock);
3233 3226
3234 3227 nhdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
3235 3228 nhdr->b_size = blksz;
3236 3229 nhdr->b_spa = spa;
3237 3230 nhdr->b_type = type;
3238 3231 nhdr->b_buf = buf;
3239 3232 nhdr->b_state = arc_anon;
3240 3233 nhdr->b_arc_access = 0;
3241 3234 nhdr->b_flags = flags & ARC_L2_WRITING;
3242 3235 nhdr->b_l2hdr = NULL;
3243 3236 nhdr->b_datacnt = 1;
3244 3237 nhdr->b_freeze_cksum = NULL;
3245 3238 (void) refcount_add(&nhdr->b_refcnt, tag);
3246 3239 buf->b_hdr = nhdr;
3247 3240 mutex_exit(&buf->b_evict_lock);
3248 3241 atomic_add_64(&arc_anon->arcs_size, blksz);
3249 3242 } else {
3250 3243 mutex_exit(&buf->b_evict_lock);
3251 3244 ASSERT(refcount_count(&hdr->b_refcnt) == 1);
3252 3245 ASSERT(!list_link_active(&hdr->b_arc_node));
3253 3246 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3254 3247 if (hdr->b_state != arc_anon)
3255 3248 arc_change_state(arc_anon, hdr, hash_lock);
3256 3249 hdr->b_arc_access = 0;
3257 3250 if (hash_lock)
3258 3251 mutex_exit(hash_lock);
3259 3252
3260 3253 buf_discard_identity(hdr);
3261 3254 arc_buf_thaw(buf);
3262 3255 }
3263 3256 buf->b_efunc = NULL;
3264 3257 buf->b_private = NULL;
3265 3258
3266 3259 if (l2hdr) {
3267 3260 list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
3268 3261 kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
3269 3262 ARCSTAT_INCR(arcstat_l2_size, -buf_size);
3270 3263 mutex_exit(&l2arc_buflist_mtx);
3271 3264 }
3272 3265 }
3273 3266
3274 3267 int
3275 3268 arc_released(arc_buf_t *buf)
3276 3269 {
3277 3270 int released;
3278 3271
3279 3272 mutex_enter(&buf->b_evict_lock);
3280 3273 released = (buf->b_data != NULL && buf->b_hdr->b_state == arc_anon);
3281 3274 mutex_exit(&buf->b_evict_lock);
3282 3275 return (released);
3283 3276 }
3284 3277
3285 3278 int
3286 3279 arc_has_callback(arc_buf_t *buf)
3287 3280 {
3288 3281 int callback;
3289 3282
3290 3283 mutex_enter(&buf->b_evict_lock);
3291 3284 callback = (buf->b_efunc != NULL);
3292 3285 mutex_exit(&buf->b_evict_lock);
3293 3286 return (callback);
3294 3287 }
3295 3288
3296 3289 #ifdef ZFS_DEBUG
3297 3290 int
3298 3291 arc_referenced(arc_buf_t *buf)
3299 3292 {
3300 3293 int referenced;
3301 3294
3302 3295 mutex_enter(&buf->b_evict_lock);
3303 3296 referenced = (refcount_count(&buf->b_hdr->b_refcnt));
3304 3297 mutex_exit(&buf->b_evict_lock);
3305 3298 return (referenced);
3306 3299 }
3307 3300 #endif
3308 3301
3309 3302 static void
3310 3303 arc_write_ready(zio_t *zio)
3311 3304 {
3312 3305 arc_write_callback_t *callback = zio->io_private;
3313 3306 arc_buf_t *buf = callback->awcb_buf;
3314 3307 arc_buf_hdr_t *hdr = buf->b_hdr;
3315 3308
3316 3309 ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt));
3317 3310 callback->awcb_ready(zio, buf, callback->awcb_private);
3318 3311
3319 3312 /*
3320 3313 * If the IO is already in progress, then this is a re-write
3321 3314 * attempt, so we need to thaw and re-compute the cksum.
3322 3315 * It is the responsibility of the callback to handle the
3323 3316 * accounting for any re-write attempt.
3324 3317 */
3325 3318 if (HDR_IO_IN_PROGRESS(hdr)) {
3326 3319 mutex_enter(&hdr->b_freeze_lock);
3327 3320 if (hdr->b_freeze_cksum != NULL) {
3328 3321 kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
3329 3322 hdr->b_freeze_cksum = NULL;
3330 3323 }
3331 3324 mutex_exit(&hdr->b_freeze_lock);
3332 3325 }
3333 3326 arc_cksum_compute(buf, B_FALSE);
3334 3327 hdr->b_flags |= ARC_IO_IN_PROGRESS;
3335 3328 }
3336 3329
3337 3330 static void
3338 3331 arc_write_done(zio_t *zio)
3339 3332 {
3340 3333 arc_write_callback_t *callback = zio->io_private;
3341 3334 arc_buf_t *buf = callback->awcb_buf;
3342 3335 arc_buf_hdr_t *hdr = buf->b_hdr;
3343 3336
3344 3337 ASSERT(hdr->b_acb == NULL);
3345 3338
3346 3339 if (zio->io_error == 0) {
3347 3340 hdr->b_dva = *BP_IDENTITY(zio->io_bp);
3348 3341 hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp);
3349 3342 hdr->b_cksum0 = zio->io_bp->blk_cksum.zc_word[0];
3350 3343 } else {
3351 3344 ASSERT(BUF_EMPTY(hdr));
3352 3345 }
3353 3346
3354 3347 /*
3355 3348 * If the block to be written was all-zero, we may have
3356 3349 * compressed it away. In this case no write was performed
3357 3350 * so there will be no dva/birth/checksum. The buffer must
3358 3351 * therefore remain anonymous (and uncached).
3359 3352 */
3360 3353 if (!BUF_EMPTY(hdr)) {
3361 3354 arc_buf_hdr_t *exists;
3362 3355 kmutex_t *hash_lock;
3363 3356
3364 3357 ASSERT(zio->io_error == 0);
3365 3358
3366 3359 arc_cksum_verify(buf);
3367 3360
3368 3361 exists = buf_hash_insert(hdr, &hash_lock);
3369 3362 if (exists) {
3370 3363 /*
3371 3364 * This can only happen if we overwrite for
3372 3365 * sync-to-convergence, because we remove
3373 3366 * buffers from the hash table when we arc_free().
3374 3367 */
3375 3368 if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
3376 3369 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
3377 3370 panic("bad overwrite, hdr=%p exists=%p",
3378 3371 (void *)hdr, (void *)exists);
3379 3372 ASSERT(refcount_is_zero(&exists->b_refcnt));
3380 3373 arc_change_state(arc_anon, exists, hash_lock);
3381 3374 mutex_exit(hash_lock);
3382 3375 arc_hdr_destroy(exists);
3383 3376 exists = buf_hash_insert(hdr, &hash_lock);
3384 3377 ASSERT3P(exists, ==, NULL);
3385 3378 } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
3386 3379 /* nopwrite */
3387 3380 ASSERT(zio->io_prop.zp_nopwrite);
3388 3381 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
3389 3382 panic("bad nopwrite, hdr=%p exists=%p",
3390 3383 (void *)hdr, (void *)exists);
3391 3384 } else {
3392 3385 /* Dedup */
3393 3386 ASSERT(hdr->b_datacnt == 1);
3394 3387 ASSERT(hdr->b_state == arc_anon);
3395 3388 ASSERT(BP_GET_DEDUP(zio->io_bp));
3396 3389 ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
3397 3390 }
3398 3391 }
3399 3392 hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
3400 3393 /* if it's not anon, we are doing a scrub */
3401 3394 if (!exists && hdr->b_state == arc_anon)
3402 3395 arc_access(hdr, hash_lock);
3403 3396 mutex_exit(hash_lock);
3404 3397 } else {
3405 3398 hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
3406 3399 }
3407 3400
3408 3401 ASSERT(!refcount_is_zero(&hdr->b_refcnt));
3409 3402 callback->awcb_done(zio, buf, callback->awcb_private);
3410 3403
3411 3404 kmem_free(callback, sizeof (arc_write_callback_t));
3412 3405 }
3413 3406
3414 3407 zio_t *
3415 3408 arc_write(zio_t *pio, spa_t *spa, uint64_t txg,
3416 3409 blkptr_t *bp, arc_buf_t *buf, boolean_t l2arc, const zio_prop_t *zp,
3417 3410 arc_done_func_t *ready, arc_done_func_t *done, void *private,
3418 3411 int priority, int zio_flags, const zbookmark_t *zb)
3419 3412 {
3420 3413 arc_buf_hdr_t *hdr = buf->b_hdr;
3421 3414 arc_write_callback_t *callback;
3422 3415 zio_t *zio;
3423 3416
3424 3417 ASSERT(ready != NULL);
3425 3418 ASSERT(done != NULL);
3426 3419 ASSERT(!HDR_IO_ERROR(hdr));
3427 3420 ASSERT((hdr->b_flags & ARC_IO_IN_PROGRESS) == 0);
3428 3421 ASSERT(hdr->b_acb == NULL);
3429 3422 if (l2arc)
3430 3423 hdr->b_flags |= ARC_L2CACHE;
3431 3424 callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
3432 3425 callback->awcb_ready = ready;
3433 3426 callback->awcb_done = done;
3434 3427 callback->awcb_private = private;
3435 3428 callback->awcb_buf = buf;
3436 3429
3437 3430 zio = zio_write(pio, spa, txg, bp, buf->b_data, hdr->b_size, zp,
3438 3431 arc_write_ready, arc_write_done, callback, priority, zio_flags, zb);
3439 3432
3440 3433 return (zio);
3441 3434 }
3442 3435
3443 3436 static int
3444 3437 arc_memory_throttle(uint64_t reserve, uint64_t inflight_data, uint64_t txg)
3445 3438 {
3446 3439 #ifdef _KERNEL
3447 3440 uint64_t available_memory = ptob(freemem);
3448 3441 static uint64_t page_load = 0;
3449 3442 static uint64_t last_txg = 0;
3450 3443
3451 3444 #if defined(__i386)
3452 3445 available_memory =
3453 3446 MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
3454 3447 #endif
3455 3448 if (available_memory >= zfs_write_limit_max)
3456 3449 return (0);
3457 3450
3458 3451 if (txg > last_txg) {
3459 3452 last_txg = txg;
3460 3453 page_load = 0;
3461 3454 }
3462 3455 /*
3463 3456 * If we are in pageout, we know that memory is already tight,
3464 3457 * the arc is already going to be evicting, so we just want to
3465 3458 * continue to let page writes occur as quickly as possible.
3466 3459 */
3467 3460 if (curproc == proc_pageout) {
3468 3461 if (page_load > MAX(ptob(minfree), available_memory) / 4)
3469 3462 return (SET_ERROR(ERESTART));
3470 3463 /* Note: reserve is inflated, so we deflate */
3471 3464 page_load += reserve / 8;
3472 3465 return (0);
3473 3466 } else if (page_load > 0 && arc_reclaim_needed()) {
3474 3467 /* memory is low, delay before restarting */
3475 3468 ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
3476 3469 return (SET_ERROR(EAGAIN));
3477 3470 }
3478 3471 page_load = 0;
3479 3472
3480 3473 if (arc_size > arc_c_min) {
3481 3474 uint64_t evictable_memory =
3482 3475 arc_mru->arcs_lsize[ARC_BUFC_DATA] +
3483 3476 arc_mru->arcs_lsize[ARC_BUFC_METADATA] +
3484 3477 arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
3485 3478 arc_mfu->arcs_lsize[ARC_BUFC_METADATA];
3486 3479 available_memory += MIN(evictable_memory, arc_size - arc_c_min);
3487 3480 }
3488 3481
3489 3482 if (inflight_data > available_memory / 4) {
3490 3483 ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
3491 3484 return (SET_ERROR(ERESTART));
3492 3485 }
3493 3486 #endif
3494 3487 return (0);
3495 3488 }
3496 3489
3497 3490 void
3498 3491 arc_tempreserve_clear(uint64_t reserve)
3499 3492 {
3500 3493 atomic_add_64(&arc_tempreserve, -reserve);
3501 3494 ASSERT((int64_t)arc_tempreserve >= 0);
3502 3495 }
3503 3496
3504 3497 int
3505 3498 arc_tempreserve_space(uint64_t reserve, uint64_t txg)
3506 3499 {
3507 3500 int error;
3508 3501 uint64_t anon_size;
3509 3502
3510 3503 #ifdef ZFS_DEBUG
3511 3504 /*
3512 3505 * Once in a while, fail for no reason. Everything should cope.
3513 3506 */
3514 3507 if (spa_get_random(10000) == 0) {
3515 3508 dprintf("forcing random failure\n");
3516 3509 return (SET_ERROR(ERESTART));
3517 3510 }
3518 3511 #endif
3519 3512 if (reserve > arc_c/4 && !arc_no_grow)
3520 3513 arc_c = MIN(arc_c_max, reserve * 4);
3521 3514 if (reserve > arc_c)
3522 3515 return (SET_ERROR(ENOMEM));
3523 3516
3524 3517 /*
3525 3518 * Don't count loaned bufs as in flight dirty data to prevent long
3526 3519 * network delays from blocking transactions that are ready to be
3527 3520 * assigned to a txg.
3528 3521 */
3529 3522 anon_size = MAX((int64_t)(arc_anon->arcs_size - arc_loaned_bytes), 0);
3530 3523
3531 3524 /*
3532 3525 * Writes will, almost always, require additional memory allocations
3533 3526 * in order to compress/encrypt/etc the data. We therefor need to
3534 3527 * make sure that there is sufficient available memory for this.
3535 3528 */
3536 3529 if (error = arc_memory_throttle(reserve, anon_size, txg))
3537 3530 return (error);
3538 3531
3539 3532 /*
3540 3533 * Throttle writes when the amount of dirty data in the cache
3541 3534 * gets too large. We try to keep the cache less than half full
3542 3535 * of dirty blocks so that our sync times don't grow too large.
3543 3536 * Note: if two requests come in concurrently, we might let them
3544 3537 * both succeed, when one of them should fail. Not a huge deal.
3545 3538 */
3546 3539
3547 3540 if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
3548 3541 anon_size > arc_c / 4) {
3549 3542 dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
3550 3543 "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
3551 3544 arc_tempreserve>>10,
3552 3545 arc_anon->arcs_lsize[ARC_BUFC_METADATA]>>10,
3553 3546 arc_anon->arcs_lsize[ARC_BUFC_DATA]>>10,
3554 3547 reserve>>10, arc_c>>10);
3555 3548 return (SET_ERROR(ERESTART));
3556 3549 }
3557 3550 atomic_add_64(&arc_tempreserve, reserve);
3558 3551 return (0);
3559 3552 }
3560 3553
3561 3554 void
3562 3555 arc_init(void)
3563 3556 {
3564 3557 mutex_init(&arc_reclaim_thr_lock, NULL, MUTEX_DEFAULT, NULL);
3565 3558 cv_init(&arc_reclaim_thr_cv, NULL, CV_DEFAULT, NULL);
3566 3559
3567 3560 /* Convert seconds to clock ticks */
3568 3561 arc_min_prefetch_lifespan = 1 * hz;
3569 3562
3570 3563 /* Start out with 1/8 of all memory */
3571 3564 arc_c = physmem * PAGESIZE / 8;
3572 3565
3573 3566 #ifdef _KERNEL
3574 3567 /*
3575 3568 * On architectures where the physical memory can be larger
3576 3569 * than the addressable space (intel in 32-bit mode), we may
3577 3570 * need to limit the cache to 1/8 of VM size.
3578 3571 */
3579 3572 arc_c = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 8);
3580 3573 #endif
3581 3574
3582 3575 /* set min cache to 1/32 of all memory, or 64MB, whichever is more */
3583 3576 arc_c_min = MAX(arc_c / 4, 64<<20);
3584 3577 /* set max to 3/4 of all memory, or all but 1GB, whichever is more */
3585 3578 if (arc_c * 8 >= 1<<30)
3586 3579 arc_c_max = (arc_c * 8) - (1<<30);
3587 3580 else
3588 3581 arc_c_max = arc_c_min;
3589 3582 arc_c_max = MAX(arc_c * 6, arc_c_max);
3590 3583
3591 3584 /*
3592 3585 * Allow the tunables to override our calculations if they are
3593 3586 * reasonable (ie. over 64MB)
3594 3587 */
3595 3588 if (zfs_arc_max > 64<<20 && zfs_arc_max < physmem * PAGESIZE)
3596 3589 arc_c_max = zfs_arc_max;
3597 3590 if (zfs_arc_min > 64<<20 && zfs_arc_min <= arc_c_max)
3598 3591 arc_c_min = zfs_arc_min;
3599 3592
3600 3593 arc_c = arc_c_max;
3601 3594 arc_p = (arc_c >> 1);
3602 3595
3603 3596 /* limit meta-data to 1/4 of the arc capacity */
3604 3597 arc_meta_limit = arc_c_max / 4;
3605 3598
3606 3599 /* Allow the tunable to override if it is reasonable */
3607 3600 if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
3608 3601 arc_meta_limit = zfs_arc_meta_limit;
3609 3602
3610 3603 if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
3611 3604 arc_c_min = arc_meta_limit / 2;
3612 3605
3613 3606 if (zfs_arc_grow_retry > 0)
3614 3607 arc_grow_retry = zfs_arc_grow_retry;
3615 3608
3616 3609 if (zfs_arc_shrink_shift > 0)
3617 3610 arc_shrink_shift = zfs_arc_shrink_shift;
3618 3611
3619 3612 if (zfs_arc_p_min_shift > 0)
3620 3613 arc_p_min_shift = zfs_arc_p_min_shift;
3621 3614
3622 3615 /* if kmem_flags are set, lets try to use less memory */
3623 3616 if (kmem_debugging())
3624 3617 arc_c = arc_c / 2;
3625 3618 if (arc_c < arc_c_min)
3626 3619 arc_c = arc_c_min;
3627 3620
3628 3621 arc_anon = &ARC_anon;
3629 3622 arc_mru = &ARC_mru;
3630 3623 arc_mru_ghost = &ARC_mru_ghost;
3631 3624 arc_mfu = &ARC_mfu;
3632 3625 arc_mfu_ghost = &ARC_mfu_ghost;
3633 3626 arc_l2c_only = &ARC_l2c_only;
3634 3627 arc_size = 0;
3635 3628
3636 3629 mutex_init(&arc_anon->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3637 3630 mutex_init(&arc_mru->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3638 3631 mutex_init(&arc_mru_ghost->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3639 3632 mutex_init(&arc_mfu->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3640 3633 mutex_init(&arc_mfu_ghost->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3641 3634 mutex_init(&arc_l2c_only->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3642 3635
3643 3636 list_create(&arc_mru->arcs_list[ARC_BUFC_METADATA],
3644 3637 sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3645 3638 list_create(&arc_mru->arcs_list[ARC_BUFC_DATA],
3646 3639 sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3647 3640 list_create(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],
3648 3641 sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3649 3642 list_create(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],
3650 3643 sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3651 3644 list_create(&arc_mfu->arcs_list[ARC_BUFC_METADATA],
3652 3645 sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3653 3646 list_create(&arc_mfu->arcs_list[ARC_BUFC_DATA],
3654 3647 sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3655 3648 list_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],
3656 3649 sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3657 3650 list_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],
3658 3651 sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3659 3652 list_create(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],
3660 3653 sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3661 3654 list_create(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],
3662 3655 sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3663 3656
3664 3657 buf_init();
3665 3658
3666 3659 arc_thread_exit = 0;
3667 3660 arc_eviction_list = NULL;
3668 3661 mutex_init(&arc_eviction_mtx, NULL, MUTEX_DEFAULT, NULL);
3669 3662 bzero(&arc_eviction_hdr, sizeof (arc_buf_hdr_t));
3670 3663
3671 3664 arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
3672 3665 sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
3673 3666
3674 3667 if (arc_ksp != NULL) {
3675 3668 arc_ksp->ks_data = &arc_stats;
3676 3669 kstat_install(arc_ksp);
3677 3670 }
3678 3671
3679 3672 (void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
3680 3673 TS_RUN, minclsyspri);
3681 3674
3682 3675 arc_dead = FALSE;
3683 3676 arc_warm = B_FALSE;
3684 3677
3685 3678 if (zfs_write_limit_max == 0)
3686 3679 zfs_write_limit_max = ptob(physmem) >> zfs_write_limit_shift;
3687 3680 else
3688 3681 zfs_write_limit_shift = 0;
3689 3682 mutex_init(&zfs_write_limit_lock, NULL, MUTEX_DEFAULT, NULL);
3690 3683 }
3691 3684
3692 3685 void
3693 3686 arc_fini(void)
3694 3687 {
3695 3688 mutex_enter(&arc_reclaim_thr_lock);
3696 3689 arc_thread_exit = 1;
3697 3690 while (arc_thread_exit != 0)
3698 3691 cv_wait(&arc_reclaim_thr_cv, &arc_reclaim_thr_lock);
3699 3692 mutex_exit(&arc_reclaim_thr_lock);
3700 3693
3701 3694 arc_flush(NULL);
3702 3695
3703 3696 arc_dead = TRUE;
3704 3697
3705 3698 if (arc_ksp != NULL) {
3706 3699 kstat_delete(arc_ksp);
3707 3700 arc_ksp = NULL;
3708 3701 }
3709 3702
3710 3703 mutex_destroy(&arc_eviction_mtx);
3711 3704 mutex_destroy(&arc_reclaim_thr_lock);
3712 3705 cv_destroy(&arc_reclaim_thr_cv);
3713 3706
3714 3707 list_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]);
3715 3708 list_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
3716 3709 list_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]);
3717 3710 list_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
3718 3711 list_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]);
3719 3712 list_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
3720 3713 list_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]);
3721 3714 list_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
3722 3715
3723 3716 mutex_destroy(&arc_anon->arcs_mtx);
3724 3717 mutex_destroy(&arc_mru->arcs_mtx);
3725 3718 mutex_destroy(&arc_mru_ghost->arcs_mtx);
3726 3719 mutex_destroy(&arc_mfu->arcs_mtx);
3727 3720 mutex_destroy(&arc_mfu_ghost->arcs_mtx);
3728 3721 mutex_destroy(&arc_l2c_only->arcs_mtx);
3729 3722
3730 3723 mutex_destroy(&zfs_write_limit_lock);
3731 3724
3732 3725 buf_fini();
3733 3726
3734 3727 ASSERT(arc_loaned_bytes == 0);
3735 3728 }
3736 3729
3737 3730 /*
3738 3731 * Level 2 ARC
3739 3732 *
3740 3733 * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
3741 3734 * It uses dedicated storage devices to hold cached data, which are populated
3742 3735 * using large infrequent writes. The main role of this cache is to boost
3743 3736 * the performance of random read workloads. The intended L2ARC devices
3744 3737 * include short-stroked disks, solid state disks, and other media with
3745 3738 * substantially faster read latency than disk.
3746 3739 *
3747 3740 * +-----------------------+
3748 3741 * | ARC |
3749 3742 * +-----------------------+
3750 3743 * | ^ ^
3751 3744 * | | |
3752 3745 * l2arc_feed_thread() arc_read()
3753 3746 * | | |
3754 3747 * | l2arc read |
3755 3748 * V | |
3756 3749 * +---------------+ |
3757 3750 * | L2ARC | |
3758 3751 * +---------------+ |
3759 3752 * | ^ |
3760 3753 * l2arc_write() | |
3761 3754 * | | |
3762 3755 * V | |
3763 3756 * +-------+ +-------+
3764 3757 * | vdev | | vdev |
3765 3758 * | cache | | cache |
3766 3759 * +-------+ +-------+
3767 3760 * +=========+ .-----.
3768 3761 * : L2ARC : |-_____-|
3769 3762 * : devices : | Disks |
3770 3763 * +=========+ `-_____-'
3771 3764 *
3772 3765 * Read requests are satisfied from the following sources, in order:
3773 3766 *
3774 3767 * 1) ARC
3775 3768 * 2) vdev cache of L2ARC devices
3776 3769 * 3) L2ARC devices
3777 3770 * 4) vdev cache of disks
3778 3771 * 5) disks
3779 3772 *
3780 3773 * Some L2ARC device types exhibit extremely slow write performance.
3781 3774 * To accommodate for this there are some significant differences between
3782 3775 * the L2ARC and traditional cache design:
3783 3776 *
3784 3777 * 1. There is no eviction path from the ARC to the L2ARC. Evictions from
3785 3778 * the ARC behave as usual, freeing buffers and placing headers on ghost
3786 3779 * lists. The ARC does not send buffers to the L2ARC during eviction as
3787 3780 * this would add inflated write latencies for all ARC memory pressure.
3788 3781 *
3789 3782 * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
3790 3783 * It does this by periodically scanning buffers from the eviction-end of
3791 3784 * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
3792 3785 * not already there. It scans until a headroom of buffers is satisfied,
3793 3786 * which itself is a buffer for ARC eviction. The thread that does this is
3794 3787 * l2arc_feed_thread(), illustrated below; example sizes are included to
3795 3788 * provide a better sense of ratio than this diagram:
3796 3789 *
3797 3790 * head --> tail
3798 3791 * +---------------------+----------+
3799 3792 * ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC
3800 3793 * +---------------------+----------+ | o L2ARC eligible
3801 3794 * ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer
3802 3795 * +---------------------+----------+ |
3803 3796 * 15.9 Gbytes ^ 32 Mbytes |
3804 3797 * headroom |
3805 3798 * l2arc_feed_thread()
3806 3799 * |
3807 3800 * l2arc write hand <--[oooo]--'
3808 3801 * | 8 Mbyte
3809 3802 * | write max
3810 3803 * V
3811 3804 * +==============================+
3812 3805 * L2ARC dev |####|#|###|###| |####| ... |
3813 3806 * +==============================+
3814 3807 * 32 Gbytes
3815 3808 *
3816 3809 * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
3817 3810 * evicted, then the L2ARC has cached a buffer much sooner than it probably
3818 3811 * needed to, potentially wasting L2ARC device bandwidth and storage. It is
3819 3812 * safe to say that this is an uncommon case, since buffers at the end of
3820 3813 * the ARC lists have moved there due to inactivity.
3821 3814 *
3822 3815 * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
3823 3816 * then the L2ARC simply misses copying some buffers. This serves as a
3824 3817 * pressure valve to prevent heavy read workloads from both stalling the ARC
3825 3818 * with waits and clogging the L2ARC with writes. This also helps prevent
3826 3819 * the potential for the L2ARC to churn if it attempts to cache content too
3827 3820 * quickly, such as during backups of the entire pool.
3828 3821 *
3829 3822 * 5. After system boot and before the ARC has filled main memory, there are
3830 3823 * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
3831 3824 * lists can remain mostly static. Instead of searching from tail of these
3832 3825 * lists as pictured, the l2arc_feed_thread() will search from the list heads
3833 3826 * for eligible buffers, greatly increasing its chance of finding them.
3834 3827 *
3835 3828 * The L2ARC device write speed is also boosted during this time so that
3836 3829 * the L2ARC warms up faster. Since there have been no ARC evictions yet,
3837 3830 * there are no L2ARC reads, and no fear of degrading read performance
3838 3831 * through increased writes.
3839 3832 *
3840 3833 * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
3841 3834 * the vdev queue can aggregate them into larger and fewer writes. Each
3842 3835 * device is written to in a rotor fashion, sweeping writes through
3843 3836 * available space then repeating.
3844 3837 *
3845 3838 * 7. The L2ARC does not store dirty content. It never needs to flush
3846 3839 * write buffers back to disk based storage.
3847 3840 *
3848 3841 * 8. If an ARC buffer is written (and dirtied) which also exists in the
3849 3842 * L2ARC, the now stale L2ARC buffer is immediately dropped.
3850 3843 *
3851 3844 * The performance of the L2ARC can be tweaked by a number of tunables, which
3852 3845 * may be necessary for different workloads:
3853 3846 *
3854 3847 * l2arc_write_max max write bytes per interval
3855 3848 * l2arc_write_boost extra write bytes during device warmup
3856 3849 * l2arc_noprefetch skip caching prefetched buffers
3857 3850 * l2arc_headroom number of max device writes to precache
3858 3851 * l2arc_feed_secs seconds between L2ARC writing
3859 3852 *
3860 3853 * Tunables may be removed or added as future performance improvements are
3861 3854 * integrated, and also may become zpool properties.
3862 3855 *
3863 3856 * There are three key functions that control how the L2ARC warms up:
3864 3857 *
3865 3858 * l2arc_write_eligible() check if a buffer is eligible to cache
3866 3859 * l2arc_write_size() calculate how much to write
3867 3860 * l2arc_write_interval() calculate sleep delay between writes
3868 3861 *
3869 3862 * These three functions determine what to write, how much, and how quickly
3870 3863 * to send writes.
3871 3864 */
3872 3865
3873 3866 static boolean_t
3874 3867 l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *ab)
3875 3868 {
3876 3869 /*
3877 3870 * A buffer is *not* eligible for the L2ARC if it:
3878 3871 * 1. belongs to a different spa.
3879 3872 * 2. is already cached on the L2ARC.
3880 3873 * 3. has an I/O in progress (it may be an incomplete read).
3881 3874 * 4. is flagged not eligible (zfs property).
3882 3875 */
3883 3876 if (ab->b_spa != spa_guid || ab->b_l2hdr != NULL ||
3884 3877 HDR_IO_IN_PROGRESS(ab) || !HDR_L2CACHE(ab))
3885 3878 return (B_FALSE);
3886 3879
3887 3880 return (B_TRUE);
3888 3881 }
3889 3882
3890 3883 static uint64_t
3891 3884 l2arc_write_size(l2arc_dev_t *dev)
3892 3885 {
3893 3886 uint64_t size;
3894 3887
3895 3888 size = dev->l2ad_write;
3896 3889
3897 3890 if (arc_warm == B_FALSE)
3898 3891 size += dev->l2ad_boost;
3899 3892
3900 3893 return (size);
3901 3894
3902 3895 }
3903 3896
3904 3897 static clock_t
3905 3898 l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
3906 3899 {
3907 3900 clock_t interval, next, now;
3908 3901
3909 3902 /*
3910 3903 * If the ARC lists are busy, increase our write rate; if the
3911 3904 * lists are stale, idle back. This is achieved by checking
3912 3905 * how much we previously wrote - if it was more than half of
3913 3906 * what we wanted, schedule the next write much sooner.
3914 3907 */
3915 3908 if (l2arc_feed_again && wrote > (wanted / 2))
3916 3909 interval = (hz * l2arc_feed_min_ms) / 1000;
3917 3910 else
3918 3911 interval = hz * l2arc_feed_secs;
3919 3912
3920 3913 now = ddi_get_lbolt();
3921 3914 next = MAX(now, MIN(now + interval, began + interval));
3922 3915
3923 3916 return (next);
3924 3917 }
3925 3918
3926 3919 static void
3927 3920 l2arc_hdr_stat_add(void)
3928 3921 {
3929 3922 ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE + L2HDR_SIZE);
3930 3923 ARCSTAT_INCR(arcstat_hdr_size, -HDR_SIZE);
3931 3924 }
3932 3925
3933 3926 static void
3934 3927 l2arc_hdr_stat_remove(void)
3935 3928 {
3936 3929 ARCSTAT_INCR(arcstat_l2_hdr_size, -(HDR_SIZE + L2HDR_SIZE));
3937 3930 ARCSTAT_INCR(arcstat_hdr_size, HDR_SIZE);
3938 3931 }
3939 3932
3940 3933 /*
3941 3934 * Cycle through L2ARC devices. This is how L2ARC load balances.
3942 3935 * If a device is returned, this also returns holding the spa config lock.
3943 3936 */
3944 3937 static l2arc_dev_t *
3945 3938 l2arc_dev_get_next(void)
3946 3939 {
3947 3940 l2arc_dev_t *first, *next = NULL;
3948 3941
3949 3942 /*
3950 3943 * Lock out the removal of spas (spa_namespace_lock), then removal
3951 3944 * of cache devices (l2arc_dev_mtx). Once a device has been selected,
3952 3945 * both locks will be dropped and a spa config lock held instead.
3953 3946 */
3954 3947 mutex_enter(&spa_namespace_lock);
3955 3948 mutex_enter(&l2arc_dev_mtx);
3956 3949
3957 3950 /* if there are no vdevs, there is nothing to do */
3958 3951 if (l2arc_ndev == 0)
3959 3952 goto out;
3960 3953
3961 3954 first = NULL;
3962 3955 next = l2arc_dev_last;
3963 3956 do {
3964 3957 /* loop around the list looking for a non-faulted vdev */
3965 3958 if (next == NULL) {
3966 3959 next = list_head(l2arc_dev_list);
3967 3960 } else {
3968 3961 next = list_next(l2arc_dev_list, next);
3969 3962 if (next == NULL)
3970 3963 next = list_head(l2arc_dev_list);
3971 3964 }
3972 3965
3973 3966 /* if we have come back to the start, bail out */
3974 3967 if (first == NULL)
3975 3968 first = next;
3976 3969 else if (next == first)
3977 3970 break;
3978 3971
3979 3972 } while (vdev_is_dead(next->l2ad_vdev));
3980 3973
3981 3974 /* if we were unable to find any usable vdevs, return NULL */
3982 3975 if (vdev_is_dead(next->l2ad_vdev))
3983 3976 next = NULL;
3984 3977
3985 3978 l2arc_dev_last = next;
3986 3979
3987 3980 out:
3988 3981 mutex_exit(&l2arc_dev_mtx);
3989 3982
3990 3983 /*
3991 3984 * Grab the config lock to prevent the 'next' device from being
3992 3985 * removed while we are writing to it.
3993 3986 */
3994 3987 if (next != NULL)
3995 3988 spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
3996 3989 mutex_exit(&spa_namespace_lock);
3997 3990
3998 3991 return (next);
3999 3992 }
4000 3993
4001 3994 /*
4002 3995 * Free buffers that were tagged for destruction.
4003 3996 */
4004 3997 static void
4005 3998 l2arc_do_free_on_write()
4006 3999 {
4007 4000 list_t *buflist;
4008 4001 l2arc_data_free_t *df, *df_prev;
4009 4002
4010 4003 mutex_enter(&l2arc_free_on_write_mtx);
4011 4004 buflist = l2arc_free_on_write;
4012 4005
4013 4006 for (df = list_tail(buflist); df; df = df_prev) {
4014 4007 df_prev = list_prev(buflist, df);
4015 4008 ASSERT(df->l2df_data != NULL);
4016 4009 ASSERT(df->l2df_func != NULL);
4017 4010 df->l2df_func(df->l2df_data, df->l2df_size);
4018 4011 list_remove(buflist, df);
4019 4012 kmem_free(df, sizeof (l2arc_data_free_t));
4020 4013 }
4021 4014
4022 4015 mutex_exit(&l2arc_free_on_write_mtx);
4023 4016 }
4024 4017
4025 4018 /*
4026 4019 * A write to a cache device has completed. Update all headers to allow
4027 4020 * reads from these buffers to begin.
4028 4021 */
4029 4022 static void
4030 4023 l2arc_write_done(zio_t *zio)
4031 4024 {
4032 4025 l2arc_write_callback_t *cb;
4033 4026 l2arc_dev_t *dev;
4034 4027 list_t *buflist;
4035 4028 arc_buf_hdr_t *head, *ab, *ab_prev;
4036 4029 l2arc_buf_hdr_t *abl2;
4037 4030 kmutex_t *hash_lock;
4038 4031
4039 4032 cb = zio->io_private;
4040 4033 ASSERT(cb != NULL);
4041 4034 dev = cb->l2wcb_dev;
4042 4035 ASSERT(dev != NULL);
4043 4036 head = cb->l2wcb_head;
4044 4037 ASSERT(head != NULL);
4045 4038 buflist = dev->l2ad_buflist;
4046 4039 ASSERT(buflist != NULL);
4047 4040 DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
4048 4041 l2arc_write_callback_t *, cb);
4049 4042
4050 4043 if (zio->io_error != 0)
4051 4044 ARCSTAT_BUMP(arcstat_l2_writes_error);
4052 4045
4053 4046 mutex_enter(&l2arc_buflist_mtx);
4054 4047
4055 4048 /*
4056 4049 * All writes completed, or an error was hit.
4057 4050 */
4058 4051 for (ab = list_prev(buflist, head); ab; ab = ab_prev) {
4059 4052 ab_prev = list_prev(buflist, ab);
4060 4053
4061 4054 hash_lock = HDR_LOCK(ab);
4062 4055 if (!mutex_tryenter(hash_lock)) {
4063 4056 /*
4064 4057 * This buffer misses out. It may be in a stage
4065 4058 * of eviction. Its ARC_L2_WRITING flag will be
4066 4059 * left set, denying reads to this buffer.
4067 4060 */
4068 4061 ARCSTAT_BUMP(arcstat_l2_writes_hdr_miss);
4069 4062 continue;
4070 4063 }
4071 4064
4072 4065 if (zio->io_error != 0) {
4073 4066 /*
4074 4067 * Error - drop L2ARC entry.
4075 4068 */
4076 4069 list_remove(buflist, ab);
4077 4070 abl2 = ab->b_l2hdr;
4078 4071 ab->b_l2hdr = NULL;
4079 4072 kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
4080 4073 ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
4081 4074 }
4082 4075
4083 4076 /*
4084 4077 * Allow ARC to begin reads to this L2ARC entry.
4085 4078 */
4086 4079 ab->b_flags &= ~ARC_L2_WRITING;
4087 4080
4088 4081 mutex_exit(hash_lock);
4089 4082 }
4090 4083
4091 4084 atomic_inc_64(&l2arc_writes_done);
4092 4085 list_remove(buflist, head);
4093 4086 kmem_cache_free(hdr_cache, head);
4094 4087 mutex_exit(&l2arc_buflist_mtx);
4095 4088
4096 4089 l2arc_do_free_on_write();
4097 4090
4098 4091 kmem_free(cb, sizeof (l2arc_write_callback_t));
4099 4092 }
4100 4093
4101 4094 /*
4102 4095 * A read to a cache device completed. Validate buffer contents before
4103 4096 * handing over to the regular ARC routines.
4104 4097 */
4105 4098 static void
4106 4099 l2arc_read_done(zio_t *zio)
4107 4100 {
4108 4101 l2arc_read_callback_t *cb;
4109 4102 arc_buf_hdr_t *hdr;
4110 4103 arc_buf_t *buf;
4111 4104 kmutex_t *hash_lock;
4112 4105 int equal;
4113 4106
4114 4107 ASSERT(zio->io_vd != NULL);
4115 4108 ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
4116 4109
4117 4110 spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
4118 4111
4119 4112 cb = zio->io_private;
4120 4113 ASSERT(cb != NULL);
4121 4114 buf = cb->l2rcb_buf;
4122 4115 ASSERT(buf != NULL);
4123 4116
4124 4117 hash_lock = HDR_LOCK(buf->b_hdr);
4125 4118 mutex_enter(hash_lock);
4126 4119 hdr = buf->b_hdr;
4127 4120 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
4128 4121
4129 4122 /*
4130 4123 * Check this survived the L2ARC journey.
4131 4124 */
4132 4125 equal = arc_cksum_equal(buf);
4133 4126 if (equal && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
4134 4127 mutex_exit(hash_lock);
4135 4128 zio->io_private = buf;
4136 4129 zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
4137 4130 zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */
4138 4131 arc_read_done(zio);
4139 4132 } else {
4140 4133 mutex_exit(hash_lock);
4141 4134 /*
4142 4135 * Buffer didn't survive caching. Increment stats and
4143 4136 * reissue to the original storage device.
4144 4137 */
4145 4138 if (zio->io_error != 0) {
4146 4139 ARCSTAT_BUMP(arcstat_l2_io_error);
4147 4140 } else {
4148 4141 zio->io_error = SET_ERROR(EIO);
4149 4142 }
4150 4143 if (!equal)
4151 4144 ARCSTAT_BUMP(arcstat_l2_cksum_bad);
4152 4145
4153 4146 /*
4154 4147 * If there's no waiter, issue an async i/o to the primary
4155 4148 * storage now. If there *is* a waiter, the caller must
4156 4149 * issue the i/o in a context where it's OK to block.
4157 4150 */
4158 4151 if (zio->io_waiter == NULL) {
4159 4152 zio_t *pio = zio_unique_parent(zio);
4160 4153
4161 4154 ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
4162 4155
4163 4156 zio_nowait(zio_read(pio, cb->l2rcb_spa, &cb->l2rcb_bp,
4164 4157 buf->b_data, zio->io_size, arc_read_done, buf,
4165 4158 zio->io_priority, cb->l2rcb_flags, &cb->l2rcb_zb));
4166 4159 }
4167 4160 }
4168 4161
4169 4162 kmem_free(cb, sizeof (l2arc_read_callback_t));
4170 4163 }
4171 4164
4172 4165 /*
4173 4166 * This is the list priority from which the L2ARC will search for pages to
4174 4167 * cache. This is used within loops (0..3) to cycle through lists in the
4175 4168 * desired order. This order can have a significant effect on cache
4176 4169 * performance.
4177 4170 *
4178 4171 * Currently the metadata lists are hit first, MFU then MRU, followed by
4179 4172 * the data lists. This function returns a locked list, and also returns
4180 4173 * the lock pointer.
4181 4174 */
4182 4175 static list_t *
4183 4176 l2arc_list_locked(int list_num, kmutex_t **lock)
4184 4177 {
4185 4178 list_t *list = NULL;
4186 4179
4187 4180 ASSERT(list_num >= 0 && list_num <= 3);
4188 4181
4189 4182 switch (list_num) {
4190 4183 case 0:
4191 4184 list = &arc_mfu->arcs_list[ARC_BUFC_METADATA];
4192 4185 *lock = &arc_mfu->arcs_mtx;
4193 4186 break;
4194 4187 case 1:
4195 4188 list = &arc_mru->arcs_list[ARC_BUFC_METADATA];
4196 4189 *lock = &arc_mru->arcs_mtx;
4197 4190 break;
4198 4191 case 2:
4199 4192 list = &arc_mfu->arcs_list[ARC_BUFC_DATA];
4200 4193 *lock = &arc_mfu->arcs_mtx;
4201 4194 break;
4202 4195 case 3:
4203 4196 list = &arc_mru->arcs_list[ARC_BUFC_DATA];
4204 4197 *lock = &arc_mru->arcs_mtx;
4205 4198 break;
4206 4199 }
4207 4200
4208 4201 ASSERT(!(MUTEX_HELD(*lock)));
4209 4202 mutex_enter(*lock);
4210 4203 return (list);
4211 4204 }
4212 4205
4213 4206 /*
4214 4207 * Evict buffers from the device write hand to the distance specified in
4215 4208 * bytes. This distance may span populated buffers, it may span nothing.
4216 4209 * This is clearing a region on the L2ARC device ready for writing.
4217 4210 * If the 'all' boolean is set, every buffer is evicted.
4218 4211 */
4219 4212 static void
4220 4213 l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
4221 4214 {
4222 4215 list_t *buflist;
4223 4216 l2arc_buf_hdr_t *abl2;
4224 4217 arc_buf_hdr_t *ab, *ab_prev;
4225 4218 kmutex_t *hash_lock;
4226 4219 uint64_t taddr;
4227 4220
4228 4221 buflist = dev->l2ad_buflist;
4229 4222
4230 4223 if (buflist == NULL)
4231 4224 return;
4232 4225
4233 4226 if (!all && dev->l2ad_first) {
4234 4227 /*
4235 4228 * This is the first sweep through the device. There is
4236 4229 * nothing to evict.
4237 4230 */
4238 4231 return;
4239 4232 }
4240 4233
4241 4234 if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
4242 4235 /*
4243 4236 * When nearing the end of the device, evict to the end
4244 4237 * before the device write hand jumps to the start.
4245 4238 */
4246 4239 taddr = dev->l2ad_end;
4247 4240 } else {
4248 4241 taddr = dev->l2ad_hand + distance;
4249 4242 }
4250 4243 DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
4251 4244 uint64_t, taddr, boolean_t, all);
4252 4245
4253 4246 top:
4254 4247 mutex_enter(&l2arc_buflist_mtx);
4255 4248 for (ab = list_tail(buflist); ab; ab = ab_prev) {
4256 4249 ab_prev = list_prev(buflist, ab);
4257 4250
4258 4251 hash_lock = HDR_LOCK(ab);
4259 4252 if (!mutex_tryenter(hash_lock)) {
4260 4253 /*
4261 4254 * Missed the hash lock. Retry.
4262 4255 */
4263 4256 ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
4264 4257 mutex_exit(&l2arc_buflist_mtx);
4265 4258 mutex_enter(hash_lock);
4266 4259 mutex_exit(hash_lock);
4267 4260 goto top;
4268 4261 }
4269 4262
4270 4263 if (HDR_L2_WRITE_HEAD(ab)) {
4271 4264 /*
4272 4265 * We hit a write head node. Leave it for
4273 4266 * l2arc_write_done().
4274 4267 */
4275 4268 list_remove(buflist, ab);
4276 4269 mutex_exit(hash_lock);
4277 4270 continue;
4278 4271 }
4279 4272
4280 4273 if (!all && ab->b_l2hdr != NULL &&
4281 4274 (ab->b_l2hdr->b_daddr > taddr ||
4282 4275 ab->b_l2hdr->b_daddr < dev->l2ad_hand)) {
4283 4276 /*
4284 4277 * We've evicted to the target address,
4285 4278 * or the end of the device.
4286 4279 */
4287 4280 mutex_exit(hash_lock);
4288 4281 break;
4289 4282 }
4290 4283
4291 4284 if (HDR_FREE_IN_PROGRESS(ab)) {
4292 4285 /*
4293 4286 * Already on the path to destruction.
4294 4287 */
4295 4288 mutex_exit(hash_lock);
4296 4289 continue;
4297 4290 }
4298 4291
4299 4292 if (ab->b_state == arc_l2c_only) {
4300 4293 ASSERT(!HDR_L2_READING(ab));
4301 4294 /*
4302 4295 * This doesn't exist in the ARC. Destroy.
4303 4296 * arc_hdr_destroy() will call list_remove()
4304 4297 * and decrement arcstat_l2_size.
4305 4298 */
4306 4299 arc_change_state(arc_anon, ab, hash_lock);
4307 4300 arc_hdr_destroy(ab);
4308 4301 } else {
4309 4302 /*
4310 4303 * Invalidate issued or about to be issued
4311 4304 * reads, since we may be about to write
4312 4305 * over this location.
4313 4306 */
4314 4307 if (HDR_L2_READING(ab)) {
4315 4308 ARCSTAT_BUMP(arcstat_l2_evict_reading);
4316 4309 ab->b_flags |= ARC_L2_EVICTED;
4317 4310 }
4318 4311
4319 4312 /*
4320 4313 * Tell ARC this no longer exists in L2ARC.
4321 4314 */
4322 4315 if (ab->b_l2hdr != NULL) {
4323 4316 abl2 = ab->b_l2hdr;
4324 4317 ab->b_l2hdr = NULL;
4325 4318 kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
4326 4319 ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
4327 4320 }
4328 4321 list_remove(buflist, ab);
4329 4322
4330 4323 /*
4331 4324 * This may have been leftover after a
4332 4325 * failed write.
4333 4326 */
4334 4327 ab->b_flags &= ~ARC_L2_WRITING;
4335 4328 }
4336 4329 mutex_exit(hash_lock);
4337 4330 }
4338 4331 mutex_exit(&l2arc_buflist_mtx);
4339 4332
4340 4333 vdev_space_update(dev->l2ad_vdev, -(taddr - dev->l2ad_evict), 0, 0);
4341 4334 dev->l2ad_evict = taddr;
4342 4335 }
4343 4336
4344 4337 /*
4345 4338 * Find and write ARC buffers to the L2ARC device.
4346 4339 *
4347 4340 * An ARC_L2_WRITING flag is set so that the L2ARC buffers are not valid
4348 4341 * for reading until they have completed writing.
4349 4342 */
4350 4343 static uint64_t
4351 4344 l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
4352 4345 {
4353 4346 arc_buf_hdr_t *ab, *ab_prev, *head;
4354 4347 l2arc_buf_hdr_t *hdrl2;
4355 4348 list_t *list;
4356 4349 uint64_t passed_sz, write_sz, buf_sz, headroom;
4357 4350 void *buf_data;
4358 4351 kmutex_t *hash_lock, *list_lock;
4359 4352 boolean_t have_lock, full;
4360 4353 l2arc_write_callback_t *cb;
4361 4354 zio_t *pio, *wzio;
4362 4355 uint64_t guid = spa_load_guid(spa);
4363 4356
4364 4357 ASSERT(dev->l2ad_vdev != NULL);
4365 4358
4366 4359 pio = NULL;
4367 4360 write_sz = 0;
4368 4361 full = B_FALSE;
4369 4362 head = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
4370 4363 head->b_flags |= ARC_L2_WRITE_HEAD;
4371 4364
4372 4365 /*
4373 4366 * Copy buffers for L2ARC writing.
4374 4367 */
4375 4368 mutex_enter(&l2arc_buflist_mtx);
4376 4369 for (int try = 0; try <= 3; try++) {
4377 4370 list = l2arc_list_locked(try, &list_lock);
4378 4371 passed_sz = 0;
4379 4372
4380 4373 /*
4381 4374 * L2ARC fast warmup.
4382 4375 *
4383 4376 * Until the ARC is warm and starts to evict, read from the
4384 4377 * head of the ARC lists rather than the tail.
4385 4378 */
4386 4379 headroom = target_sz * l2arc_headroom;
4387 4380 if (arc_warm == B_FALSE)
4388 4381 ab = list_head(list);
4389 4382 else
4390 4383 ab = list_tail(list);
4391 4384
4392 4385 for (; ab; ab = ab_prev) {
4393 4386 if (arc_warm == B_FALSE)
4394 4387 ab_prev = list_next(list, ab);
4395 4388 else
4396 4389 ab_prev = list_prev(list, ab);
4397 4390
4398 4391 hash_lock = HDR_LOCK(ab);
4399 4392 have_lock = MUTEX_HELD(hash_lock);
4400 4393 if (!have_lock && !mutex_tryenter(hash_lock)) {
4401 4394 /*
4402 4395 * Skip this buffer rather than waiting.
4403 4396 */
4404 4397 continue;
4405 4398 }
4406 4399
4407 4400 passed_sz += ab->b_size;
4408 4401 if (passed_sz > headroom) {
4409 4402 /*
4410 4403 * Searched too far.
4411 4404 */
4412 4405 mutex_exit(hash_lock);
4413 4406 break;
4414 4407 }
4415 4408
4416 4409 if (!l2arc_write_eligible(guid, ab)) {
4417 4410 mutex_exit(hash_lock);
4418 4411 continue;
4419 4412 }
4420 4413
4421 4414 if ((write_sz + ab->b_size) > target_sz) {
4422 4415 full = B_TRUE;
4423 4416 mutex_exit(hash_lock);
4424 4417 break;
4425 4418 }
4426 4419
4427 4420 if (pio == NULL) {
4428 4421 /*
4429 4422 * Insert a dummy header on the buflist so
4430 4423 * l2arc_write_done() can find where the
4431 4424 * write buffers begin without searching.
4432 4425 */
4433 4426 list_insert_head(dev->l2ad_buflist, head);
4434 4427
4435 4428 cb = kmem_alloc(
4436 4429 sizeof (l2arc_write_callback_t), KM_SLEEP);
4437 4430 cb->l2wcb_dev = dev;
4438 4431 cb->l2wcb_head = head;
4439 4432 pio = zio_root(spa, l2arc_write_done, cb,
4440 4433 ZIO_FLAG_CANFAIL);
4441 4434 }
4442 4435
4443 4436 /*
4444 4437 * Create and add a new L2ARC header.
4445 4438 */
4446 4439 hdrl2 = kmem_zalloc(sizeof (l2arc_buf_hdr_t), KM_SLEEP);
4447 4440 hdrl2->b_dev = dev;
4448 4441 hdrl2->b_daddr = dev->l2ad_hand;
4449 4442
4450 4443 ab->b_flags |= ARC_L2_WRITING;
4451 4444 ab->b_l2hdr = hdrl2;
4452 4445 list_insert_head(dev->l2ad_buflist, ab);
4453 4446 buf_data = ab->b_buf->b_data;
4454 4447 buf_sz = ab->b_size;
4455 4448
4456 4449 /*
4457 4450 * Compute and store the buffer cksum before
4458 4451 * writing. On debug the cksum is verified first.
4459 4452 */
4460 4453 arc_cksum_verify(ab->b_buf);
4461 4454 arc_cksum_compute(ab->b_buf, B_TRUE);
4462 4455
4463 4456 mutex_exit(hash_lock);
4464 4457
4465 4458 wzio = zio_write_phys(pio, dev->l2ad_vdev,
4466 4459 dev->l2ad_hand, buf_sz, buf_data, ZIO_CHECKSUM_OFF,
4467 4460 NULL, NULL, ZIO_PRIORITY_ASYNC_WRITE,
4468 4461 ZIO_FLAG_CANFAIL, B_FALSE);
4469 4462
4470 4463 DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
4471 4464 zio_t *, wzio);
4472 4465 (void) zio_nowait(wzio);
4473 4466
4474 4467 /*
4475 4468 * Keep the clock hand suitably device-aligned.
4476 4469 */
4477 4470 buf_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);
4478 4471
4479 4472 write_sz += buf_sz;
4480 4473 dev->l2ad_hand += buf_sz;
4481 4474 }
4482 4475
4483 4476 mutex_exit(list_lock);
4484 4477
4485 4478 if (full == B_TRUE)
4486 4479 break;
4487 4480 }
4488 4481 mutex_exit(&l2arc_buflist_mtx);
4489 4482
4490 4483 if (pio == NULL) {
4491 4484 ASSERT0(write_sz);
4492 4485 kmem_cache_free(hdr_cache, head);
4493 4486 return (0);
4494 4487 }
4495 4488
4496 4489 ASSERT3U(write_sz, <=, target_sz);
4497 4490 ARCSTAT_BUMP(arcstat_l2_writes_sent);
4498 4491 ARCSTAT_INCR(arcstat_l2_write_bytes, write_sz);
4499 4492 ARCSTAT_INCR(arcstat_l2_size, write_sz);
4500 4493 vdev_space_update(dev->l2ad_vdev, write_sz, 0, 0);
4501 4494
4502 4495 /*
4503 4496 * Bump device hand to the device start if it is approaching the end.
4504 4497 * l2arc_evict() will already have evicted ahead for this case.
4505 4498 */
4506 4499 if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
4507 4500 vdev_space_update(dev->l2ad_vdev,
4508 4501 dev->l2ad_end - dev->l2ad_hand, 0, 0);
4509 4502 dev->l2ad_hand = dev->l2ad_start;
4510 4503 dev->l2ad_evict = dev->l2ad_start;
4511 4504 dev->l2ad_first = B_FALSE;
4512 4505 }
4513 4506
4514 4507 dev->l2ad_writing = B_TRUE;
4515 4508 (void) zio_wait(pio);
4516 4509 dev->l2ad_writing = B_FALSE;
4517 4510
4518 4511 return (write_sz);
4519 4512 }
4520 4513
4521 4514 /*
4522 4515 * This thread feeds the L2ARC at regular intervals. This is the beating
4523 4516 * heart of the L2ARC.
4524 4517 */
4525 4518 static void
4526 4519 l2arc_feed_thread(void)
4527 4520 {
4528 4521 callb_cpr_t cpr;
4529 4522 l2arc_dev_t *dev;
4530 4523 spa_t *spa;
4531 4524 uint64_t size, wrote;
4532 4525 clock_t begin, next = ddi_get_lbolt();
4533 4526
4534 4527 CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
4535 4528
4536 4529 mutex_enter(&l2arc_feed_thr_lock);
4537 4530
4538 4531 while (l2arc_thread_exit == 0) {
4539 4532 CALLB_CPR_SAFE_BEGIN(&cpr);
4540 4533 (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
4541 4534 next);
4542 4535 CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
4543 4536 next = ddi_get_lbolt() + hz;
4544 4537
4545 4538 /*
4546 4539 * Quick check for L2ARC devices.
4547 4540 */
4548 4541 mutex_enter(&l2arc_dev_mtx);
4549 4542 if (l2arc_ndev == 0) {
4550 4543 mutex_exit(&l2arc_dev_mtx);
4551 4544 continue;
4552 4545 }
4553 4546 mutex_exit(&l2arc_dev_mtx);
4554 4547 begin = ddi_get_lbolt();
4555 4548
4556 4549 /*
4557 4550 * This selects the next l2arc device to write to, and in
4558 4551 * doing so the next spa to feed from: dev->l2ad_spa. This
4559 4552 * will return NULL if there are now no l2arc devices or if
4560 4553 * they are all faulted.
4561 4554 *
4562 4555 * If a device is returned, its spa's config lock is also
4563 4556 * held to prevent device removal. l2arc_dev_get_next()
4564 4557 * will grab and release l2arc_dev_mtx.
4565 4558 */
4566 4559 if ((dev = l2arc_dev_get_next()) == NULL)
4567 4560 continue;
4568 4561
4569 4562 spa = dev->l2ad_spa;
4570 4563 ASSERT(spa != NULL);
4571 4564
4572 4565 /*
4573 4566 * If the pool is read-only then force the feed thread to
4574 4567 * sleep a little longer.
4575 4568 */
4576 4569 if (!spa_writeable(spa)) {
4577 4570 next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
4578 4571 spa_config_exit(spa, SCL_L2ARC, dev);
4579 4572 continue;
4580 4573 }
4581 4574
4582 4575 /*
4583 4576 * Avoid contributing to memory pressure.
4584 4577 */
4585 4578 if (arc_reclaim_needed()) {
4586 4579 ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
4587 4580 spa_config_exit(spa, SCL_L2ARC, dev);
4588 4581 continue;
4589 4582 }
4590 4583
4591 4584 ARCSTAT_BUMP(arcstat_l2_feeds);
4592 4585
4593 4586 size = l2arc_write_size(dev);
4594 4587
4595 4588 /*
4596 4589 * Evict L2ARC buffers that will be overwritten.
4597 4590 */
4598 4591 l2arc_evict(dev, size, B_FALSE);
4599 4592
4600 4593 /*
4601 4594 * Write ARC buffers.
4602 4595 */
4603 4596 wrote = l2arc_write_buffers(spa, dev, size);
4604 4597
4605 4598 /*
4606 4599 * Calculate interval between writes.
4607 4600 */
4608 4601 next = l2arc_write_interval(begin, size, wrote);
4609 4602 spa_config_exit(spa, SCL_L2ARC, dev);
4610 4603 }
4611 4604
4612 4605 l2arc_thread_exit = 0;
4613 4606 cv_broadcast(&l2arc_feed_thr_cv);
4614 4607 CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */
4615 4608 thread_exit();
4616 4609 }
4617 4610
4618 4611 boolean_t
4619 4612 l2arc_vdev_present(vdev_t *vd)
4620 4613 {
4621 4614 l2arc_dev_t *dev;
4622 4615
4623 4616 mutex_enter(&l2arc_dev_mtx);
4624 4617 for (dev = list_head(l2arc_dev_list); dev != NULL;
4625 4618 dev = list_next(l2arc_dev_list, dev)) {
4626 4619 if (dev->l2ad_vdev == vd)
4627 4620 break;
4628 4621 }
4629 4622 mutex_exit(&l2arc_dev_mtx);
4630 4623
4631 4624 return (dev != NULL);
4632 4625 }
4633 4626
4634 4627 /*
4635 4628 * Add a vdev for use by the L2ARC. By this point the spa has already
4636 4629 * validated the vdev and opened it.
4637 4630 */
4638 4631 void
4639 4632 l2arc_add_vdev(spa_t *spa, vdev_t *vd)
4640 4633 {
4641 4634 l2arc_dev_t *adddev;
4642 4635
4643 4636 ASSERT(!l2arc_vdev_present(vd));
4644 4637
4645 4638 /*
4646 4639 * Create a new l2arc device entry.
4647 4640 */
4648 4641 adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
4649 4642 adddev->l2ad_spa = spa;
4650 4643 adddev->l2ad_vdev = vd;
4651 4644 adddev->l2ad_write = l2arc_write_max;
4652 4645 adddev->l2ad_boost = l2arc_write_boost;
4653 4646 adddev->l2ad_start = VDEV_LABEL_START_SIZE;
4654 4647 adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
4655 4648 adddev->l2ad_hand = adddev->l2ad_start;
4656 4649 adddev->l2ad_evict = adddev->l2ad_start;
4657 4650 adddev->l2ad_first = B_TRUE;
4658 4651 adddev->l2ad_writing = B_FALSE;
4659 4652 ASSERT3U(adddev->l2ad_write, >, 0);
4660 4653
4661 4654 /*
4662 4655 * This is a list of all ARC buffers that are still valid on the
4663 4656 * device.
4664 4657 */
4665 4658 adddev->l2ad_buflist = kmem_zalloc(sizeof (list_t), KM_SLEEP);
4666 4659 list_create(adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
4667 4660 offsetof(arc_buf_hdr_t, b_l2node));
4668 4661
4669 4662 vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
4670 4663
4671 4664 /*
4672 4665 * Add device to global list
4673 4666 */
4674 4667 mutex_enter(&l2arc_dev_mtx);
4675 4668 list_insert_head(l2arc_dev_list, adddev);
4676 4669 atomic_inc_64(&l2arc_ndev);
4677 4670 mutex_exit(&l2arc_dev_mtx);
4678 4671 }
4679 4672
4680 4673 /*
4681 4674 * Remove a vdev from the L2ARC.
4682 4675 */
4683 4676 void
4684 4677 l2arc_remove_vdev(vdev_t *vd)
4685 4678 {
4686 4679 l2arc_dev_t *dev, *nextdev, *remdev = NULL;
4687 4680
4688 4681 /*
4689 4682 * Find the device by vdev
4690 4683 */
4691 4684 mutex_enter(&l2arc_dev_mtx);
4692 4685 for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
4693 4686 nextdev = list_next(l2arc_dev_list, dev);
4694 4687 if (vd == dev->l2ad_vdev) {
4695 4688 remdev = dev;
4696 4689 break;
4697 4690 }
4698 4691 }
4699 4692 ASSERT(remdev != NULL);
4700 4693
4701 4694 /*
4702 4695 * Remove device from global list
4703 4696 */
4704 4697 list_remove(l2arc_dev_list, remdev);
4705 4698 l2arc_dev_last = NULL; /* may have been invalidated */
4706 4699 atomic_dec_64(&l2arc_ndev);
4707 4700 mutex_exit(&l2arc_dev_mtx);
4708 4701
4709 4702 /*
4710 4703 * Clear all buflists and ARC references. L2ARC device flush.
4711 4704 */
4712 4705 l2arc_evict(remdev, 0, B_TRUE);
4713 4706 list_destroy(remdev->l2ad_buflist);
4714 4707 kmem_free(remdev->l2ad_buflist, sizeof (list_t));
4715 4708 kmem_free(remdev, sizeof (l2arc_dev_t));
4716 4709 }
4717 4710
4718 4711 void
4719 4712 l2arc_init(void)
4720 4713 {
4721 4714 l2arc_thread_exit = 0;
4722 4715 l2arc_ndev = 0;
4723 4716 l2arc_writes_sent = 0;
4724 4717 l2arc_writes_done = 0;
4725 4718
4726 4719 mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
4727 4720 cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
4728 4721 mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
4729 4722 mutex_init(&l2arc_buflist_mtx, NULL, MUTEX_DEFAULT, NULL);
4730 4723 mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
4731 4724
4732 4725 l2arc_dev_list = &L2ARC_dev_list;
4733 4726 l2arc_free_on_write = &L2ARC_free_on_write;
4734 4727 list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
4735 4728 offsetof(l2arc_dev_t, l2ad_node));
4736 4729 list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
4737 4730 offsetof(l2arc_data_free_t, l2df_list_node));
4738 4731 }
4739 4732
4740 4733 void
4741 4734 l2arc_fini(void)
4742 4735 {
4743 4736 /*
4744 4737 * This is called from dmu_fini(), which is called from spa_fini();
4745 4738 * Because of this, we can assume that all l2arc devices have
4746 4739 * already been removed when the pools themselves were removed.
4747 4740 */
4748 4741
4749 4742 l2arc_do_free_on_write();
4750 4743
4751 4744 mutex_destroy(&l2arc_feed_thr_lock);
4752 4745 cv_destroy(&l2arc_feed_thr_cv);
4753 4746 mutex_destroy(&l2arc_dev_mtx);
4754 4747 mutex_destroy(&l2arc_buflist_mtx);
4755 4748 mutex_destroy(&l2arc_free_on_write_mtx);
4756 4749
4757 4750 list_destroy(l2arc_dev_list);
4758 4751 list_destroy(l2arc_free_on_write);
4759 4752 }
4760 4753
4761 4754 void
4762 4755 l2arc_start(void)
4763 4756 {
4764 4757 if (!(spa_mode_global & FWRITE))
4765 4758 return;
4766 4759
4767 4760 (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
4768 4761 TS_RUN, minclsyspri);
4769 4762 }
4770 4763
4771 4764 void
4772 4765 l2arc_stop(void)
4773 4766 {
4774 4767 if (!(spa_mode_global & FWRITE))
4775 4768 return;
4776 4769
4777 4770 mutex_enter(&l2arc_feed_thr_lock);
4778 4771 cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */
4779 4772 l2arc_thread_exit = 1;
4780 4773 while (l2arc_thread_exit != 0)
4781 4774 cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
4782 4775 mutex_exit(&l2arc_feed_thr_lock);
4783 4776 }
|
↓ open down ↓ |
3859 lines elided |
↑ open up ↑ |
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX