big-one Sdiff usr/src/uts/common/fs/zfs/arc.c

Print this page

NEX-19742 A race between ARC and L2ARC causes system panic
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-16904 Need to port Illumos Bug #9433 to fix ARC hit rate
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15303 ARC-ABD logic works incorrect when deduplication is enabled
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15303 ARC-ABD logic works incorrect when deduplication is enabled
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15446 set zfs_ddt_limit_type to DDT_LIMIT_TO_ARC
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-15446 set zfs_ddt_limit_type to DDT_LIMIT_TO_ARC
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-14571 remove isal support remnants
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-8057 renaming of mount points should not be allowed (redo)
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5785 zdb: assertion failed for thread 0xf8a20240, thread-id 130: mp->initialized == B_TRUE, file ../common/kernel.c, line 162
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexent.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-4228 dedup arcstats are redundant
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-7317 Getting assert !refcount_is_zero(&scl->scl_count) when trying to import pool
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5671 assertion: (ab->b_l2hdr.b_asize) >> (9) >= 1 (0x0 >= 0x1), file: ../../common/fs/zfs/arc.c, line: 8275
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "Merge pull request #520 in OS/nza-kernel from ~SASO.KISELKOV/nza-kernel:NEX-5671-pl2arc-le_psize to master"
This reverts commit b63e91b939886744224854ea365d70e05ddd6077, reversing
changes made to a6e3a0255c8b22f65343bf641ffefaf9ae948fd4.
NEX-5671 assertion: (ab->b_l2hdr.b_asize) >> (9) >= 1 (0x0 >= 0x1), file: ../../common/fs/zfs/arc.c, line: 8275
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
6421 Add missing multilist_destroy calls to arc_fini
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
6293 ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign()
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
5219 l2arc_write_buffers() may write beyond target_sz
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Justin Gibbs <gibbs@FreeBSD.org>
Approved by: Matthew Ahrens <mahrens@delphix.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
6220 memleak in l2arc on debug build
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
5847 libzfs_diff should check zfs_prop_get() return
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5701 zpool list reports incorrect "alloc" value for cache devices
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-3879 L2ARC evict task allocates a useless struct
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4408 backport illumos #6214 to avoid corruption (fix pL2ARC integration)
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4408 backport illumos #6214 to avoid corruption
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3979 fix arc_mru/mfu typo
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3961 arc_meta_max is not counted correctly
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3946 Port Illumos 5983 to release-5.0
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Jean McCormack <jean.maccormack@nexenta.com>
NEX-3945 file-backed cache devices considered harmful
Reviewed by: Alek Pinchuk <alek@nexenta.com>
NEX-3541 Implement persistent L2ARC - fix build breakage in libzpool (v2).
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3630 Backport illumos #5701 from master to 5.0
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3558 KRRP Integration
NEX-3387 ARC stats appear to be in wrong/weird order
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3296 turn on DDT limit by default
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3300 ddt byte count ceiling tunables should not depend on zfs_ddt_limit_type being set
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
 NEX-3165 segregate ddt in arc (other lint fix)
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-3165 segregate ddt in arc
NEX-3079 port illumos ARC improvements
NEX-2301 zpool destroy assertion failed: vd->vdev_stat.vs_alloc == 0 (part 2)
NEX-2704 smbstat man page needs update
NEX-2301 zpool destroy assertion failed: vd->vdev_stat.vs_alloc == 0
3995 Memory leak of compressed buffers in l2arc_write_done
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Garrett D'Amore <garrett@damore.org>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
NEX-463: bumped max queue size for L2ARC async evict
Maximum length of a taskq used for async arc and l2arc flush is
now a tuneable (zfs_flush_ntasks) that is initialized to 64.
The number is equally arbitrary, yet higher than original 4.
Real fix should rework l2arc evict according to OS-53, but for now
just longer queue should suffice.
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
re #14119 BAD-TRAP panic under load
re #13989 port of illumos-3805
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
re #13729 assign each ARC hash bucket its own mutex
In ARC the number of buckets in buffer header hash table is
proportional to the size of physical RAM.
The number of locks protecting headers in the buckets is fixed to 256 though.
Hence, on systems with large memory (>= 128GB) too many unrelated buffer
headers are protected by the same mutex.
When the memory in the system is fragmented this may cause a deadlock:
- An arc_read thread may be trying to allocate a 128k buffer while holding
a header lock.
- The allocation uses KM_PUSHPAGE option that blocks the thread if no contigous
chunk of requested size is available.
- ARC eviction thread that is supposed to evict some buffers would call
an evict callback on one of the buffers.
- Before freing the memory, the callback will attempt to take a lock on buffer
header.
- Incidentally, this buffer header will be protected by the same lock as
the one in arc_read() thread.
The solution in this patch is not perfect - that is, it protects all headers
in the hash bucket by the same lock.
However, a probability of collision is very low and does not depend on memory
size.
By the same argument, padding locks to cacheline looks like a waste of memory
here since the probability of contention on a cacheline is quite low, given
the number of buckets, number of locks per cacheline (4) and the fact that
the hash function (crc64 % hash table size) is supposed to be a very good
randomizer.
This effect on memory usage is as follows:
Per hash table size n,
- Original code uses 16K + 16 + n * 8 bytes of memory
- This fix uses 2 * n * 8 + 8 bytes of memory
- The net memory overhead is therefore n * 8 - 16K - 8 bytes
The value of n grows proportionally to physical memory size.
For 128GB of physical memory it is 2M, so the memory overhead is
16M - 16K - 8 bytes.
For smaller memory configurations the overhead is proportionally smaller, and
for larger memory configurations it is propottionally bigger.
The patch has been tested for 30+ hours using vdbench script that reproduces
hang with original code 100% of times in 20-30 minutes.
re #10054 rb4467 Support for asynchronous ARC/L2ARC eviction
re #13165 rb4265 zfs-monitor should fallback to using DEV_BSIZE
re #10054 rb4249 Long export time causes failover to fail

   6  * You may not use this file except in compliance with the License.
   7  *
   8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9  * or http://www.opensolaris.org/os/licensing.
  10  * See the License for the specific language governing permissions
  11  * and limitations under the License.
  12  *
  13  * When distributing Covered Code, include this CDDL HEADER in each
  14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15  * If applicable, add the following below this CDDL HEADER, with the
  16  * fields enclosed by brackets "[]" replaced with your own identifying
  17  * information: Portions Copyright [yyyy] [name of copyright owner]
  18  *
  19  * CDDL HEADER END
  20  */
  21 /*
  22  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23  * Copyright (c) 2018, Joyent, Inc.
  24  * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
  25  * Copyright (c) 2014 by Saso Kiselkov. All rights reserved.
  26  * Copyright 2017 Nexenta Systems, Inc.  All rights reserved.
  27  */
  28 
  29 /*
  30  * DVA-based Adjustable Replacement Cache
  31  *
  32  * While much of the theory of operation used here is
  33  * based on the self-tuning, low overhead replacement cache
  34  * presented by Megiddo and Modha at FAST 2003, there are some
  35  * significant differences:
  36  *
  37  * 1. The Megiddo and Modha model assumes any page is evictable.
  38  * Pages in its cache cannot be "locked" into memory.  This makes
  39  * the eviction algorithm simple: evict the last page in the list.
  40  * This also make the performance characteristics easy to reason
  41  * about.  Our cache is not so simple.  At any given moment, some
  42  * subset of the blocks in the cache are un-evictable because we
  43  * have handed out a reference to them.  Blocks are only evictable
  44  * when there are no external references active.  This makes
  45  * eviction far more problematic:  we choose to evict the evictable
  46  * blocks that are the "lowest" in the list.

 236  * it may compress the data before writing it to disk. The ARC will be called
 237  * with the transformed data and will bcopy the transformed on-disk block into
 238  * a newly allocated b_pabd. Writes are always done into buffers which have
 239  * either been loaned (and hence are new and don't have other readers) or
 240  * buffers which have been released (and hence have their own hdr, if there
 241  * were originally other readers of the buf's original hdr). This ensures that
 242  * the ARC only needs to update a single buf and its hdr after a write occurs.
 243  *
 244  * When the L2ARC is in use, it will also take advantage of the b_pabd. The
 245  * L2ARC will always write the contents of b_pabd to the L2ARC. This means
 246  * that when compressed ARC is enabled that the L2ARC blocks are identical
 247  * to the on-disk block in the main data pool. This provides a significant
 248  * advantage since the ARC can leverage the bp's checksum when reading from the
 249  * L2ARC to determine if the contents are valid. However, if the compressed
 250  * ARC is disabled, then the L2ARC's block must be transformed to look
 251  * like the physical block in the main data pool before comparing the
 252  * checksum and determining its validity.
 253  */
 254 
 255 #include <sys/spa.h>

 256 #include <sys/zio.h>
 257 #include <sys/spa_impl.h>
 258 #include <sys/zio_compress.h>
 259 #include <sys/zio_checksum.h>
 260 #include <sys/zfs_context.h>
 261 #include <sys/arc.h>
 262 #include <sys/refcount.h>
 263 #include <sys/vdev.h>
 264 #include <sys/vdev_impl.h>
 265 #include <sys/dsl_pool.h>
 266 #include <sys/zio_checksum.h>
 267 #include <sys/multilist.h>
 268 #include <sys/abd.h>
 269 #ifdef _KERNEL
 270 #include <sys/vmsystm.h>
 271 #include <vm/anon.h>
 272 #include <sys/fs/swapnode.h>
 273 #include <sys/dnlc.h>
 274 #endif
 275 #include <sys/callb.h>
 276 #include <sys/kstat.h>
 277 #include <zfs_fletcher.h>
 278 #include <sys/aggsum.h>
 279 #include <sys/cityhash.h>
 280 
 281 #ifndef _KERNEL
 282 /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
 283 boolean_t arc_watch = B_FALSE;
 284 int arc_procfd;
 285 #endif
 286 
 287 static kmutex_t         arc_reclaim_lock;
 288 static kcondvar_t       arc_reclaim_thread_cv;
 289 static boolean_t        arc_reclaim_thread_exit;
 290 static kcondvar_t       arc_reclaim_waiters_cv;
 291 
 292 uint_t arc_reduce_dnlc_percent = 3;
 293 
 294 /*
 295  * The number of headers to evict in arc_evict_state_impl() before
 296  * dropping the sublist lock and evicting from another sublist. A lower
 297  * value means we're more likely to evict the "correct" header (i.e. the
 298  * oldest header in the arc state), but comes with higher overhead
 299  * (i.e. more invocations of arc_evict_state_impl()).

 340 
 341 static int arc_dead;
 342 
 343 /*
 344  * The arc has filled available memory and has now warmed up.
 345  */
 346 static boolean_t arc_warm;
 347 
 348 /*
 349  * log2 fraction of the zio arena to keep free.
 350  */
 351 int arc_zio_arena_free_shift = 2;
 352 
 353 /*
 354  * These tunables are for performance analysis.
 355  */
 356 uint64_t zfs_arc_max;
 357 uint64_t zfs_arc_min;
 358 uint64_t zfs_arc_meta_limit = 0;
 359 uint64_t zfs_arc_meta_min = 0;

















 360 int zfs_arc_grow_retry = 0;
 361 int zfs_arc_shrink_shift = 0;
 362 int zfs_arc_p_min_shift = 0;
 363 int zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
 364 



 365 boolean_t zfs_compressed_arc_enabled = B_TRUE;
 366 
 367 /*
 368  * Note that buffers can be in one of 6 states:
 369  *      ARC_anon        - anonymous (discussed below)
 370  *      ARC_mru         - recently used, currently cached
 371  *      ARC_mru_ghost   - recentely used, no longer in cache
 372  *      ARC_mfu         - frequently used, currently cached
 373  *      ARC_mfu_ghost   - frequently used, no longer in cache
 374  *      ARC_l2c_only    - exists in L2ARC but not other states
 375  * When there are no active references to the buffer, they are
 376  * are linked onto a list in one of these arc states.  These are
 377  * the only buffers that can be evicted or deleted.  Within each
 378  * state there are multiple lists, one for meta-data and one for
 379  * non-meta-data.  Meta-data (indirect blocks, blocks of dnodes,
 380  * etc.) is tracked separately so that it can be managed more
 381  * explicitly: favored over data, limited explicitly.
 382  *
 383  * Anonymous buffers are buffers that are not associated with
 384  * a DVA.  These are buffers that hold dirty block copies

 390  * The ARC_l2c_only state is for buffers that are in the second
 391  * level ARC but no longer in any of the ARC_m* lists.  The second
 392  * level ARC itself may also contain buffers that are in any of
 393  * the ARC_m* states - meaning that a buffer can exist in two
 394  * places.  The reason for the ARC_l2c_only state is to keep the
 395  * buffer header in the hash table, so that reads that hit the
 396  * second level ARC benefit from these fast lookups.
 397  */
 398 
 399 typedef struct arc_state {
 400         /*
 401          * list of evictable buffers
 402          */
 403         multilist_t *arcs_list[ARC_BUFC_NUMTYPES];
 404         /*
 405          * total amount of evictable data in this state
 406          */
 407         refcount_t arcs_esize[ARC_BUFC_NUMTYPES];
 408         /*
 409          * total amount of data in this state; this includes: evictable,
 410          * non-evictable, ARC_BUFC_DATA, and ARC_BUFC_METADATA.


 411          */
 412         refcount_t arcs_size;
 413 } arc_state_t;
 414 















 415 /* The 6 states: */
 416 static arc_state_t ARC_anon;
 417 static arc_state_t ARC_mru;
 418 static arc_state_t ARC_mru_ghost;
 419 static arc_state_t ARC_mfu;
 420 static arc_state_t ARC_mfu_ghost;
 421 static arc_state_t ARC_l2c_only;
 422 
 423 typedef struct arc_stats {
 424         kstat_named_t arcstat_hits;

 425         kstat_named_t arcstat_misses;
 426         kstat_named_t arcstat_demand_data_hits;
 427         kstat_named_t arcstat_demand_data_misses;
 428         kstat_named_t arcstat_demand_metadata_hits;
 429         kstat_named_t arcstat_demand_metadata_misses;


 430         kstat_named_t arcstat_prefetch_data_hits;
 431         kstat_named_t arcstat_prefetch_data_misses;
 432         kstat_named_t arcstat_prefetch_metadata_hits;
 433         kstat_named_t arcstat_prefetch_metadata_misses;


 434         kstat_named_t arcstat_mru_hits;
 435         kstat_named_t arcstat_mru_ghost_hits;
 436         kstat_named_t arcstat_mfu_hits;
 437         kstat_named_t arcstat_mfu_ghost_hits;
 438         kstat_named_t arcstat_deleted;
 439         /*
 440          * Number of buffers that could not be evicted because the hash lock
 441          * was held by another thread.  The lock may not necessarily be held
 442          * by something using the same buffer, since hash locks are shared
 443          * by multiple buffers.
 444          */
 445         kstat_named_t arcstat_mutex_miss;
 446         /*





 447          * Number of buffers skipped because they have I/O in progress, are
 448          * indrect prefetch buffers that have not lived long enough, or are
 449          * not from the spa we're trying to evict from.
 450          */
 451         kstat_named_t arcstat_evict_skip;
 452         /*
 453          * Number of times arc_evict_state() was unable to evict enough
 454          * buffers to reach it's target amount.
 455          */
 456         kstat_named_t arcstat_evict_not_enough;
 457         kstat_named_t arcstat_evict_l2_cached;
 458         kstat_named_t arcstat_evict_l2_eligible;
 459         kstat_named_t arcstat_evict_l2_ineligible;
 460         kstat_named_t arcstat_evict_l2_skip;
 461         kstat_named_t arcstat_hash_elements;
 462         kstat_named_t arcstat_hash_elements_max;
 463         kstat_named_t arcstat_hash_collisions;
 464         kstat_named_t arcstat_hash_chains;
 465         kstat_named_t arcstat_hash_chain_max;
 466         kstat_named_t arcstat_p;
 467         kstat_named_t arcstat_c;
 468         kstat_named_t arcstat_c_min;
 469         kstat_named_t arcstat_c_max;
 470         /* Not updated directly; only synced in arc_kstat_update. */
 471         kstat_named_t arcstat_size;
 472         /*
 473          * Number of compressed bytes stored in the arc_buf_hdr_t's b_pabd.
 474          * Note that the compressed bytes may match the uncompressed bytes
 475          * if the block is either not compressed or compressed arc is disabled.
 476          */
 477         kstat_named_t arcstat_compressed_size;
 478         /*
 479          * Uncompressed size of the data stored in b_pabd. If compressed
 480          * arc is disabled then this value will be identical to the stat
 481          * above.
 482          */
 483         kstat_named_t arcstat_uncompressed_size;
 484         /*
 485          * Number of bytes stored in all the arc_buf_t's. This is classified
 486          * as "overhead" since this data is typically short-lived and will
 487          * be evicted from the arc when it becomes unreferenced unless the
 488          * zfs_keep_uncompressed_metadata or zfs_keep_uncompressed_level
 489          * values have been set (see comment in dbuf.c for more information).
 490          */
 491         kstat_named_t arcstat_overhead_size;
 492         /*
 493          * Number of bytes consumed by internal ARC structures necessary
 494          * for tracking purposes; these structures are not actually
 495          * backed by ARC buffers. This includes arc_buf_hdr_t structures
 496          * (allocated via arc_buf_hdr_t_full and arc_buf_hdr_t_l2only
 497          * caches), and arc_buf_t structures (allocated via arc_buf_t
 498          * cache).
 499          * Not updated directly; only synced in arc_kstat_update.
 500          */
 501         kstat_named_t arcstat_hdr_size;
 502         /*
 503          * Number of bytes consumed by ARC buffers of type equal to
 504          * ARC_BUFC_DATA. This is generally consumed by buffers backing
 505          * on disk user data (e.g. plain file contents).
 506          * Not updated directly; only synced in arc_kstat_update.
 507          */
 508         kstat_named_t arcstat_data_size;
 509         /*
 510          * Number of bytes consumed by ARC buffers of type equal to
 511          * ARC_BUFC_METADATA. This is generally consumed by buffers
 512          * backing on disk data that is used for internal ZFS
 513          * structures (e.g. ZAP, dnode, indirect blocks, etc).
 514          * Not updated directly; only synced in arc_kstat_update.
 515          */
 516         kstat_named_t arcstat_metadata_size;
 517         /*







 518          * Number of bytes consumed by various buffers and structures
 519          * not actually backed with ARC buffers. This includes bonus
 520          * buffers (allocated directly via zio_buf_* functions),
 521          * dmu_buf_impl_t structures (allocated via dmu_buf_impl_t
 522          * cache), and dnode_t structures (allocated via dnode_t cache).
 523          * Not updated directly; only synced in arc_kstat_update.
 524          */
 525         kstat_named_t arcstat_other_size;
 526         /*
 527          * Total number of bytes consumed by ARC buffers residing in the
 528          * arc_anon state. This includes *all* buffers in the arc_anon
 529          * state; e.g. data, metadata, evictable, and unevictable buffers
 530          * are all included in this value.
 531          * Not updated directly; only synced in arc_kstat_update.
 532          */
 533         kstat_named_t arcstat_anon_size;
 534         /*
 535          * Number of bytes consumed by ARC buffers that meet the
 536          * following criteria: backing buffers of type ARC_BUFC_DATA,
 537          * residing in the arc_anon state, and are eligible for eviction
 538          * (e.g. have no outstanding holds on the buffer).
 539          * Not updated directly; only synced in arc_kstat_update.
 540          */
 541         kstat_named_t arcstat_anon_evictable_data;
 542         /*
 543          * Number of bytes consumed by ARC buffers that meet the
 544          * following criteria: backing buffers of type ARC_BUFC_METADATA,
 545          * residing in the arc_anon state, and are eligible for eviction
 546          * (e.g. have no outstanding holds on the buffer).
 547          * Not updated directly; only synced in arc_kstat_update.
 548          */
 549         kstat_named_t arcstat_anon_evictable_metadata;
 550         /*







 551          * Total number of bytes consumed by ARC buffers residing in the
 552          * arc_mru state. This includes *all* buffers in the arc_mru
 553          * state; e.g. data, metadata, evictable, and unevictable buffers
 554          * are all included in this value.
 555          * Not updated directly; only synced in arc_kstat_update.
 556          */
 557         kstat_named_t arcstat_mru_size;
 558         /*
 559          * Number of bytes consumed by ARC buffers that meet the
 560          * following criteria: backing buffers of type ARC_BUFC_DATA,
 561          * residing in the arc_mru state, and are eligible for eviction
 562          * (e.g. have no outstanding holds on the buffer).
 563          * Not updated directly; only synced in arc_kstat_update.
 564          */
 565         kstat_named_t arcstat_mru_evictable_data;
 566         /*
 567          * Number of bytes consumed by ARC buffers that meet the
 568          * following criteria: backing buffers of type ARC_BUFC_METADATA,
 569          * residing in the arc_mru state, and are eligible for eviction
 570          * (e.g. have no outstanding holds on the buffer).
 571          * Not updated directly; only synced in arc_kstat_update.
 572          */
 573         kstat_named_t arcstat_mru_evictable_metadata;
 574         /*








 575          * Total number of bytes that *would have been* consumed by ARC
 576          * buffers in the arc_mru_ghost state. The key thing to note
 577          * here, is the fact that this size doesn't actually indicate
 578          * RAM consumption. The ghost lists only consist of headers and
 579          * don't actually have ARC buffers linked off of these headers.
 580          * Thus, *if* the headers had associated ARC buffers, these
 581          * buffers *would have* consumed this number of bytes.
 582          * Not updated directly; only synced in arc_kstat_update.
 583          */
 584         kstat_named_t arcstat_mru_ghost_size;
 585         /*
 586          * Number of bytes that *would have been* consumed by ARC
 587          * buffers that are eligible for eviction, of type
 588          * ARC_BUFC_DATA, and linked off the arc_mru_ghost state.
 589          * Not updated directly; only synced in arc_kstat_update.
 590          */
 591         kstat_named_t arcstat_mru_ghost_evictable_data;
 592         /*
 593          * Number of bytes that *would have been* consumed by ARC
 594          * buffers that are eligible for eviction, of type
 595          * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
 596          * Not updated directly; only synced in arc_kstat_update.
 597          */
 598         kstat_named_t arcstat_mru_ghost_evictable_metadata;
 599         /*







 600          * Total number of bytes consumed by ARC buffers residing in the
 601          * arc_mfu state. This includes *all* buffers in the arc_mfu
 602          * state; e.g. data, metadata, evictable, and unevictable buffers
 603          * are all included in this value.
 604          * Not updated directly; only synced in arc_kstat_update.
 605          */
 606         kstat_named_t arcstat_mfu_size;
 607         /*
 608          * Number of bytes consumed by ARC buffers that are eligible for
 609          * eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu
 610          * state.
 611          * Not updated directly; only synced in arc_kstat_update.
 612          */
 613         kstat_named_t arcstat_mfu_evictable_data;
 614         /*
 615          * Number of bytes consumed by ARC buffers that are eligible for
 616          * eviction, of type ARC_BUFC_METADATA, and reside in the
 617          * arc_mfu state.
 618          * Not updated directly; only synced in arc_kstat_update.
 619          */
 620         kstat_named_t arcstat_mfu_evictable_metadata;
 621         /*







 622          * Total number of bytes that *would have been* consumed by ARC
 623          * buffers in the arc_mfu_ghost state. See the comment above
 624          * arcstat_mru_ghost_size for more details.
 625          * Not updated directly; only synced in arc_kstat_update.
 626          */
 627         kstat_named_t arcstat_mfu_ghost_size;
 628         /*
 629          * Number of bytes that *would have been* consumed by ARC
 630          * buffers that are eligible for eviction, of type
 631          * ARC_BUFC_DATA, and linked off the arc_mfu_ghost state.
 632          * Not updated directly; only synced in arc_kstat_update.
 633          */
 634         kstat_named_t arcstat_mfu_ghost_evictable_data;
 635         /*
 636          * Number of bytes that *would have been* consumed by ARC
 637          * buffers that are eligible for eviction, of type
 638          * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
 639          * Not updated directly; only synced in arc_kstat_update.
 640          */
 641         kstat_named_t arcstat_mfu_ghost_evictable_metadata;







 642         kstat_named_t arcstat_l2_hits;

 643         kstat_named_t arcstat_l2_misses;
 644         kstat_named_t arcstat_l2_feeds;
 645         kstat_named_t arcstat_l2_rw_clash;
 646         kstat_named_t arcstat_l2_read_bytes;

 647         kstat_named_t arcstat_l2_write_bytes;

 648         kstat_named_t arcstat_l2_writes_sent;
 649         kstat_named_t arcstat_l2_writes_done;
 650         kstat_named_t arcstat_l2_writes_error;
 651         kstat_named_t arcstat_l2_writes_lock_retry;
 652         kstat_named_t arcstat_l2_evict_lock_retry;
 653         kstat_named_t arcstat_l2_evict_reading;
 654         kstat_named_t arcstat_l2_evict_l1cached;
 655         kstat_named_t arcstat_l2_free_on_write;
 656         kstat_named_t arcstat_l2_abort_lowmem;
 657         kstat_named_t arcstat_l2_cksum_bad;
 658         kstat_named_t arcstat_l2_io_error;
 659         kstat_named_t arcstat_l2_lsize;
 660         kstat_named_t arcstat_l2_psize;
 661         /* Not updated directly; only synced in arc_kstat_update. */
 662         kstat_named_t arcstat_l2_hdr_size;














 663         kstat_named_t arcstat_memory_throttle_count;
 664         /* Not updated directly; only synced in arc_kstat_update. */
 665         kstat_named_t arcstat_meta_used;
 666         kstat_named_t arcstat_meta_limit;
 667         kstat_named_t arcstat_meta_max;
 668         kstat_named_t arcstat_meta_min;

 669         kstat_named_t arcstat_sync_wait_for_async;
 670         kstat_named_t arcstat_demand_hit_predictive_prefetch;
 671 } arc_stats_t;
 672 
 673 static arc_stats_t arc_stats = {
 674         { "hits",                       KSTAT_DATA_UINT64 },

 675         { "misses",                     KSTAT_DATA_UINT64 },
 676         { "demand_data_hits",           KSTAT_DATA_UINT64 },
 677         { "demand_data_misses",         KSTAT_DATA_UINT64 },
 678         { "demand_metadata_hits",       KSTAT_DATA_UINT64 },
 679         { "demand_metadata_misses",     KSTAT_DATA_UINT64 },


 680         { "prefetch_data_hits",         KSTAT_DATA_UINT64 },
 681         { "prefetch_data_misses",       KSTAT_DATA_UINT64 },
 682         { "prefetch_metadata_hits",     KSTAT_DATA_UINT64 },
 683         { "prefetch_metadata_misses",   KSTAT_DATA_UINT64 },


 684         { "mru_hits",                   KSTAT_DATA_UINT64 },
 685         { "mru_ghost_hits",             KSTAT_DATA_UINT64 },
 686         { "mfu_hits",                   KSTAT_DATA_UINT64 },
 687         { "mfu_ghost_hits",             KSTAT_DATA_UINT64 },
 688         { "deleted",                    KSTAT_DATA_UINT64 },
 689         { "mutex_miss",                 KSTAT_DATA_UINT64 },

 690         { "evict_skip",                 KSTAT_DATA_UINT64 },
 691         { "evict_not_enough",           KSTAT_DATA_UINT64 },
 692         { "evict_l2_cached",            KSTAT_DATA_UINT64 },
 693         { "evict_l2_eligible",          KSTAT_DATA_UINT64 },
 694         { "evict_l2_ineligible",        KSTAT_DATA_UINT64 },
 695         { "evict_l2_skip",              KSTAT_DATA_UINT64 },
 696         { "hash_elements",              KSTAT_DATA_UINT64 },
 697         { "hash_elements_max",          KSTAT_DATA_UINT64 },
 698         { "hash_collisions",            KSTAT_DATA_UINT64 },
 699         { "hash_chains",                KSTAT_DATA_UINT64 },
 700         { "hash_chain_max",             KSTAT_DATA_UINT64 },
 701         { "p",                          KSTAT_DATA_UINT64 },
 702         { "c",                          KSTAT_DATA_UINT64 },
 703         { "c_min",                      KSTAT_DATA_UINT64 },
 704         { "c_max",                      KSTAT_DATA_UINT64 },
 705         { "size",                       KSTAT_DATA_UINT64 },
 706         { "compressed_size",            KSTAT_DATA_UINT64 },
 707         { "uncompressed_size",          KSTAT_DATA_UINT64 },
 708         { "overhead_size",              KSTAT_DATA_UINT64 },
 709         { "hdr_size",                   KSTAT_DATA_UINT64 },
 710         { "data_size",                  KSTAT_DATA_UINT64 },
 711         { "metadata_size",              KSTAT_DATA_UINT64 },

 712         { "other_size",                 KSTAT_DATA_UINT64 },
 713         { "anon_size",                  KSTAT_DATA_UINT64 },
 714         { "anon_evictable_data",        KSTAT_DATA_UINT64 },
 715         { "anon_evictable_metadata",    KSTAT_DATA_UINT64 },

 716         { "mru_size",                   KSTAT_DATA_UINT64 },
 717         { "mru_evictable_data",         KSTAT_DATA_UINT64 },
 718         { "mru_evictable_metadata",     KSTAT_DATA_UINT64 },

 719         { "mru_ghost_size",             KSTAT_DATA_UINT64 },
 720         { "mru_ghost_evictable_data",   KSTAT_DATA_UINT64 },
 721         { "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },

 722         { "mfu_size",                   KSTAT_DATA_UINT64 },
 723         { "mfu_evictable_data",         KSTAT_DATA_UINT64 },
 724         { "mfu_evictable_metadata",     KSTAT_DATA_UINT64 },

 725         { "mfu_ghost_size",             KSTAT_DATA_UINT64 },
 726         { "mfu_ghost_evictable_data",   KSTAT_DATA_UINT64 },
 727         { "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },

 728         { "l2_hits",                    KSTAT_DATA_UINT64 },

 729         { "l2_misses",                  KSTAT_DATA_UINT64 },
 730         { "l2_feeds",                   KSTAT_DATA_UINT64 },
 731         { "l2_rw_clash",                KSTAT_DATA_UINT64 },
 732         { "l2_read_bytes",              KSTAT_DATA_UINT64 },

 733         { "l2_write_bytes",             KSTAT_DATA_UINT64 },

 734         { "l2_writes_sent",             KSTAT_DATA_UINT64 },
 735         { "l2_writes_done",             KSTAT_DATA_UINT64 },
 736         { "l2_writes_error",            KSTAT_DATA_UINT64 },
 737         { "l2_writes_lock_retry",       KSTAT_DATA_UINT64 },
 738         { "l2_evict_lock_retry",        KSTAT_DATA_UINT64 },
 739         { "l2_evict_reading",           KSTAT_DATA_UINT64 },
 740         { "l2_evict_l1cached",          KSTAT_DATA_UINT64 },
 741         { "l2_free_on_write",           KSTAT_DATA_UINT64 },
 742         { "l2_abort_lowmem",            KSTAT_DATA_UINT64 },
 743         { "l2_cksum_bad",               KSTAT_DATA_UINT64 },
 744         { "l2_io_error",                KSTAT_DATA_UINT64 },
 745         { "l2_size",                    KSTAT_DATA_UINT64 },
 746         { "l2_asize",                   KSTAT_DATA_UINT64 },
 747         { "l2_hdr_size",                KSTAT_DATA_UINT64 },














 748         { "memory_throttle_count",      KSTAT_DATA_UINT64 },
 749         { "arc_meta_used",              KSTAT_DATA_UINT64 },
 750         { "arc_meta_limit",             KSTAT_DATA_UINT64 },
 751         { "arc_meta_max",               KSTAT_DATA_UINT64 },
 752         { "arc_meta_min",               KSTAT_DATA_UINT64 },

 753         { "sync_wait_for_async",        KSTAT_DATA_UINT64 },
 754         { "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },
 755 };
 756 
 757 #define ARCSTAT(stat)   (arc_stats.stat.value.ui64)
 758 
 759 #define ARCSTAT_INCR(stat, val) \
 760         atomic_add_64(&arc_stats.stat.value.ui64, (val))
 761 
 762 #define ARCSTAT_BUMP(stat)      ARCSTAT_INCR(stat, 1)
 763 #define ARCSTAT_BUMPDOWN(stat)  ARCSTAT_INCR(stat, -1)
 764 
 765 #define ARCSTAT_MAX(stat, val) {                                        \
 766         uint64_t m;                                                     \
 767         while ((val) > (m = arc_stats.stat.value.ui64) &&            \
 768             (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val))))     \
 769                 continue;                                               \
 770 }
 771 
 772 #define ARCSTAT_MAXSTAT(stat) \
 773         ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
 774 
 775 /*
 776  * We define a macro to allow ARC hits/misses to be easily broken down by
 777  * two separate conditions, giving a total of four different subtypes for
 778  * each of hits and misses (so eight statistics total).
 779  */
 780 #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
 781         if (cond1) {                                                    \
 782                 if (cond2) {                                            \
 783                         ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
 784                 } else {                                                \
 785                         ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
 786                 }                                                       \
 787         } else {                                                        \
 788                 if (cond2) {                                            \
 789                         ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
 790                 } else {                                                \
 791                         ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
 792                 }                                                       \
 793         }
 794 


















 795 kstat_t                 *arc_ksp;
 796 static arc_state_t      *arc_anon;
 797 static arc_state_t      *arc_mru;
 798 static arc_state_t      *arc_mru_ghost;
 799 static arc_state_t      *arc_mfu;
 800 static arc_state_t      *arc_mfu_ghost;
 801 static arc_state_t      *arc_l2c_only;
 802 
 803 /*
 804  * There are several ARC variables that are critical to export as kstats --
 805  * but we don't want to have to grovel around in the kstat whenever we wish to
 806  * manipulate them.  For these variables, we therefore define them to be in
 807  * terms of the statistic variable.  This assures that we are not introducing
 808  * the possibility of inconsistency by having shadow copies of the variables,
 809  * while still allowing the code to be readable.
 810  */

 811 #define arc_p           ARCSTAT(arcstat_p)      /* target size of MRU */
 812 #define arc_c           ARCSTAT(arcstat_c)      /* target size of cache */
 813 #define arc_c_min       ARCSTAT(arcstat_c_min)  /* min target cache size */
 814 #define arc_c_max       ARCSTAT(arcstat_c_max)  /* max target cache size */
 815 #define arc_meta_limit  ARCSTAT(arcstat_meta_limit) /* max size for metadata */
 816 #define arc_meta_min    ARCSTAT(arcstat_meta_min) /* min size for metadata */

 817 #define arc_meta_max    ARCSTAT(arcstat_meta_max) /* max size of metadata */


 818 





 819 /* compressed size of entire arc */
 820 #define arc_compressed_size     ARCSTAT(arcstat_compressed_size)
 821 /* uncompressed size of entire arc */
 822 #define arc_uncompressed_size   ARCSTAT(arcstat_uncompressed_size)
 823 /* number of bytes in the arc from arc_buf_t's */
 824 #define arc_overhead_size       ARCSTAT(arcstat_overhead_size)
 825 
 826 /*
 827  * There are also some ARC variables that we want to export, but that are
 828  * updated so often that having the canonical representation be the statistic
 829  * variable causes a performance bottleneck. We want to use aggsum_t's for these
 830  * instead, but still be able to export the kstat in the same way as before.
 831  * The solution is to always use the aggsum version, except in the kstat update
 832  * callback.
 833  */
 834 aggsum_t arc_size;
 835 aggsum_t arc_meta_used;
 836 aggsum_t astat_data_size;
 837 aggsum_t astat_metadata_size;
 838 aggsum_t astat_hdr_size;
 839 aggsum_t astat_other_size;
 840 aggsum_t astat_l2_hdr_size;
 841 
 842 static int              arc_no_grow;    /* Don't try to grow cache size */
 843 static uint64_t         arc_tempreserve;
 844 static uint64_t         arc_loaned_bytes;
 845 
 846 typedef struct arc_callback arc_callback_t;
 847 
 848 struct arc_callback {
 849         void                    *acb_private;
 850         arc_done_func_t         *acb_done;
 851         arc_buf_t               *acb_buf;
 852         boolean_t               acb_compressed;
 853         zio_t                   *acb_zio_dummy;
 854         arc_callback_t          *acb_next;
 855 };
 856 
 857 typedef struct arc_write_callback arc_write_callback_t;
 858 
 859 struct arc_write_callback {
 860         void            *awcb_private;

 881  *    | l2arc_buf_hdr_t        |          | l2arc_buf_hdr_t        |
 882  *    | (undefined if L1-only) |          |                        |
 883  *    +------------------------+          +------------------------+
 884  *    | l1arc_buf_hdr_t        |
 885  *    |                        |
 886  *    |                        |
 887  *    |                        |
 888  *    |                        |
 889  *    +------------------------+
 890  *
 891  * Because it's possible for the L2ARC to become extremely large, we can wind
 892  * up eating a lot of memory in L2ARC buffer headers, so the size of a header
 893  * is minimized by only allocating the fields necessary for an L1-cached buffer
 894  * when a header is actually in the L1 cache. The sub-headers (l1arc_buf_hdr and
 895  * l2arc_buf_hdr) are embedded rather than allocated separately to save a couple
 896  * words in pointers. arc_hdr_realloc() is used to switch a header between
 897  * these two allocation states.
 898  */
 899 typedef struct l1arc_buf_hdr {
 900         kmutex_t                b_freeze_lock;
 901         zio_cksum_t             *b_freeze_cksum;
 902 #ifdef ZFS_DEBUG
 903         /*
 904          * Used for debugging with kmem_flags - by allocating and freeing
 905          * b_thawed when the buffer is thawed, we get a record of the stack
 906          * trace that thawed it.
 907          */
 908         void                    *b_thawed;
 909 #endif
 910 



 911         arc_buf_t               *b_buf;
 912         uint32_t                b_bufcnt;
 913         /* for waiting on writes to complete */
 914         kcondvar_t              b_cv;
 915         uint8_t                 b_byteswap;
 916 
 917         /* protected by arc state mutex */
 918         arc_state_t             *b_state;
 919         multilist_node_t        b_arc_node;
 920 
 921         /* updated atomically */
 922         clock_t                 b_arc_access;
 923 
 924         /* self protecting */
 925         refcount_t              b_refcnt;
 926 
 927         arc_callback_t          *b_acb;
 928         abd_t                   *b_pabd;
 929 } l1arc_buf_hdr_t;
 930 
 931 typedef struct l2arc_dev l2arc_dev_t;
 932 
 933 typedef struct l2arc_buf_hdr {
 934         /* protected by arc_buf_hdr mutex */
 935         l2arc_dev_t             *b_dev;         /* L2ARC device */
 936         uint64_t                b_daddr;        /* disk address, offset byte */
 937 
 938         list_node_t             b_l2node;
 939 } l2arc_buf_hdr_t;
 940 
 941 struct arc_buf_hdr {
 942         /* protected by hash lock */
 943         dva_t                   b_dva;
 944         uint64_t                b_birth;
 945 








 946         arc_buf_contents_t      b_type;
 947         arc_buf_hdr_t           *b_hash_next;
 948         arc_flags_t             b_flags;
 949 
 950         /*
 951          * This field stores the size of the data buffer after
 952          * compression, and is set in the arc's zio completion handlers.
 953          * It is in units of SPA_MINBLOCKSIZE (e.g. 1 == 512 bytes).
 954          *
 955          * While the block pointers can store up to 32MB in their psize
 956          * field, we can only store up to 32MB minus 512B. This is due
 957          * to the bp using a bias of 1, whereas we use a bias of 0 (i.e.
 958          * a field of zeros represents 512B in the bp). We can't use a
 959          * bias of 1 since we need to reserve a psize of zero, here, to
 960          * represent holes and embedded blocks.
 961          *
 962          * This isn't a problem in practice, since the maximum size of a
 963          * buffer is limited to 16MB, so we never need to store 32MB in
 964          * this field. Even in the upstream illumos code base, the
 965          * maximum size of a buffer is limited to 16MB.

 983 #define GHOST_STATE(state)      \
 984         ((state) == arc_mru_ghost || (state) == arc_mfu_ghost ||        \
 985         (state) == arc_l2c_only)
 986 
 987 #define HDR_IN_HASH_TABLE(hdr)  ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)
 988 #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)
 989 #define HDR_IO_ERROR(hdr)       ((hdr)->b_flags & ARC_FLAG_IO_ERROR)
 990 #define HDR_PREFETCH(hdr)       ((hdr)->b_flags & ARC_FLAG_PREFETCH)
 991 #define HDR_COMPRESSION_ENABLED(hdr)    \
 992         ((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)
 993 
 994 #define HDR_L2CACHE(hdr)        ((hdr)->b_flags & ARC_FLAG_L2CACHE)
 995 #define HDR_L2_READING(hdr)     \
 996         (((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) &&   \
 997         ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))
 998 #define HDR_L2_WRITING(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_WRITING)
 999 #define HDR_L2_EVICTED(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)
1000 #define HDR_L2_WRITE_HEAD(hdr)  ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)
1001 #define HDR_SHARED_DATA(hdr)    ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)
1002 


1003 #define HDR_ISTYPE_METADATA(hdr)        \
1004         ((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)
1005 #define HDR_ISTYPE_DATA(hdr)    (!HDR_ISTYPE_METADATA(hdr))

1006 
1007 #define HDR_HAS_L1HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)
1008 #define HDR_HAS_L2HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)
1009 
1010 /* For storing compression mode in b_flags */
1011 #define HDR_COMPRESS_OFFSET     (highbit64(ARC_FLAG_COMPRESS_0) - 1)
1012 
1013 #define HDR_GET_COMPRESS(hdr)   ((enum zio_compress)BF32_GET((hdr)->b_flags, \
1014         HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))
1015 #define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \
1016         HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));
1017 
1018 #define ARC_BUF_LAST(buf)       ((buf)->b_next == NULL)
1019 #define ARC_BUF_SHARED(buf)     ((buf)->b_flags & ARC_BUF_FLAG_SHARED)
1020 #define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)
1021 
1022 /*
1023  * Other sizes
1024  */
1025 
1026 #define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
1027 #define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))
1028 
1029 /*
1030  * Hash table routines
1031  */
1032 
1033 #define HT_LOCK_PAD     64
1034 
1035 struct ht_lock {
1036         kmutex_t        ht_lock;
1037 #ifdef _KERNEL
1038         unsigned char   pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
1039 #endif
1040 };
1041 
1042 #define BUF_LOCKS 256
1043 typedef struct buf_hash_table {
1044         uint64_t ht_mask;
1045         arc_buf_hdr_t **ht_table;
1046         struct ht_lock ht_locks[BUF_LOCKS];
1047 } buf_hash_table_t;
1048 

1049 static buf_hash_table_t buf_hash_table;
1050 
1051 #define BUF_HASH_INDEX(spa, dva, birth) \
1052         (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
1053 #define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
1054 #define BUF_HASH_LOCK(idx)      (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
1055 #define HDR_LOCK(hdr) \
1056         (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
1057 
1058 uint64_t zfs_crc64_table[256];
1059 
1060 /*
1061  * Level 2 ARC
1062  */
1063 
1064 #define L2ARC_WRITE_SIZE        (8 * 1024 * 1024)       /* initial write max */
1065 #define L2ARC_HEADROOM          2                       /* num of writes */
1066 /*
1067  * If we discover during ARC scan any buffers to be compressed, we boost
1068  * our headroom for the next scanning cycle by this percentage multiple.
1069  */
1070 #define L2ARC_HEADROOM_BOOST    200
1071 #define L2ARC_FEED_SECS         1               /* caching interval secs */
1072 #define L2ARC_FEED_MIN_MS       200             /* min caching interval ms */
1073 
1074 #define l2arc_writes_sent       ARCSTAT(arcstat_l2_writes_sent)
1075 #define l2arc_writes_done       ARCSTAT(arcstat_l2_writes_done)
1076 
1077 /* L2ARC Performance Tunables */
1078 uint64_t l2arc_write_max = L2ARC_WRITE_SIZE;    /* default max write size */
1079 uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE;  /* extra write during warmup */
1080 uint64_t l2arc_headroom = L2ARC_HEADROOM;       /* number of dev writes */
1081 uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
1082 uint64_t l2arc_feed_secs = L2ARC_FEED_SECS;     /* interval seconds */
1083 uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
1084 boolean_t l2arc_noprefetch = B_TRUE;            /* don't cache prefetch bufs */
1085 boolean_t l2arc_feed_again = B_TRUE;            /* turbo warmup */
1086 boolean_t l2arc_norw = B_TRUE;                  /* no reads during writes */
1087 
1088 /*
1089  * L2ARC Internals
1090  */
1091 struct l2arc_dev {
1092         vdev_t                  *l2ad_vdev;     /* vdev */
1093         spa_t                   *l2ad_spa;      /* spa */
1094         uint64_t                l2ad_hand;      /* next write location */
1095         uint64_t                l2ad_start;     /* first addr on device */
1096         uint64_t                l2ad_end;       /* last addr on device */
1097         boolean_t               l2ad_first;     /* first sweep through */
1098         boolean_t               l2ad_writing;   /* currently writing */
1099         kmutex_t                l2ad_mtx;       /* lock for buffer list */
1100         list_t                  l2ad_buflist;   /* buffer list */
1101         list_node_t             l2ad_node;      /* device list node */
1102         refcount_t              l2ad_alloc;     /* allocated bytes */
1103 };
1104 
1105 static list_t L2ARC_dev_list;                   /* device list */
1106 static list_t *l2arc_dev_list;                  /* device list pointer */
1107 static kmutex_t l2arc_dev_mtx;                  /* device list mutex */
1108 static l2arc_dev_t *l2arc_dev_last;             /* last device used */

1109 static list_t L2ARC_free_on_write;              /* free after write buf list */
1110 static list_t *l2arc_free_on_write;             /* free after write list ptr */
1111 static kmutex_t l2arc_free_on_write_mtx;        /* mutex for list */
1112 static uint64_t l2arc_ndev;                     /* number of devices */
1113 
1114 typedef struct l2arc_read_callback {
1115         arc_buf_hdr_t           *l2rcb_hdr;             /* read header */
1116         blkptr_t                l2rcb_bp;               /* original blkptr */
1117         zbookmark_phys_t        l2rcb_zb;               /* original bookmark */
1118         int                     l2rcb_flags;            /* original flags */
1119         abd_t                   *l2rcb_abd;             /* temporary buffer */
1120 } l2arc_read_callback_t;
1121 
1122 typedef struct l2arc_write_callback {
1123         l2arc_dev_t     *l2wcb_dev;             /* device info */
1124         arc_buf_hdr_t   *l2wcb_head;            /* head of write buflist */

1125 } l2arc_write_callback_t;
1126 
1127 typedef struct l2arc_data_free {
1128         /* protected by l2arc_free_on_write_mtx */
1129         abd_t           *l2df_abd;
1130         size_t          l2df_size;
1131         arc_buf_contents_t l2df_type;
1132         list_node_t     l2df_list_node;
1133 } l2arc_data_free_t;
1134 
1135 static kmutex_t l2arc_feed_thr_lock;
1136 static kcondvar_t l2arc_feed_thr_cv;
1137 static uint8_t l2arc_thread_exit;
1138 
1139 static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, void *);
1140 static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, void *);
1141 static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, void *);
1142 static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, void *);
1143 static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, void *);
1144 static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag);
1145 static void arc_hdr_free_pabd(arc_buf_hdr_t *);
1146 static void arc_hdr_alloc_pabd(arc_buf_hdr_t *);
1147 static void arc_access(arc_buf_hdr_t *, kmutex_t *);
1148 static boolean_t arc_is_overflowing();
1149 static void arc_buf_watch(arc_buf_t *);

1150 
1151 static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);
1152 static uint32_t arc_bufc_to_flags(arc_buf_contents_t);

1153 static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1154 static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1155 
1156 static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
1157 static void l2arc_read_done(zio_t *);
1158 



















1159 




1160 /*
1161  * We use Cityhash for this. It's fast, and has good hash properties without
1162  * requiring any large static buffers.
1163  */
1164 static uint64_t






































































































































































































































































1165 buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
1166 {
1167         return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth));











1168 }
1169 
1170 #define HDR_EMPTY(hdr)                                          \
1171         ((hdr)->b_dva.dva_word[0] == 0 &&                    \
1172         (hdr)->b_dva.dva_word[1] == 0)
1173 
1174 #define HDR_EQUAL(spa, dva, birth, hdr)                         \
1175         ((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) &&       \
1176         ((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) &&       \
1177         ((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)
1178 
1179 static void
1180 buf_discard_identity(arc_buf_hdr_t *hdr)
1181 {
1182         hdr->b_dva.dva_word[0] = 0;
1183         hdr->b_dva.dva_word[1] = 0;
1184         hdr->b_birth = 0;
1185 }
1186 
1187 static arc_buf_hdr_t *
1188 buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
1189 {
1190         const dva_t *dva = BP_IDENTITY(bp);
1191         uint64_t birth = BP_PHYSICAL_BIRTH(bp);
1192         uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
1193         kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1194         arc_buf_hdr_t *hdr;
1195 
1196         mutex_enter(hash_lock);
1197         for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;
1198             hdr = hdr->b_hash_next) {
1199                 if (HDR_EQUAL(spa, dva, birth, hdr)) {
1200                         *lockp = hash_lock;
1201                         return (hdr);
1202                 }
1203         }
1204         mutex_exit(hash_lock);
1205         *lockp = NULL;
1206         return (NULL);
1207 }
1208 
1209 /*
1210  * Insert an entry into the hash table.  If there is already an element
1211  * equal to elem in the hash table, then the already existing element
1212  * will be returned and the new element will not be inserted.
1213  * Otherwise returns NULL.
1214  * If lockp == NULL, the caller is assumed to already hold the hash lock.
1215  */
1216 static arc_buf_hdr_t *
1217 buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)
1218 {
1219         uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1220         kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1221         arc_buf_hdr_t *fhdr;
1222         uint32_t i;
1223 
1224         ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));
1225         ASSERT(hdr->b_birth != 0);
1226         ASSERT(!HDR_IN_HASH_TABLE(hdr));
1227 
1228         if (lockp != NULL) {
1229                 *lockp = hash_lock;
1230                 mutex_enter(hash_lock);
1231         } else {
1232                 ASSERT(MUTEX_HELD(hash_lock));
1233         }
1234 
1235         for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;
1236             fhdr = fhdr->b_hash_next, i++) {
1237                 if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))
1238                         return (fhdr);
1239         }
1240 
1241         hdr->b_hash_next = buf_hash_table.ht_table[idx];
1242         buf_hash_table.ht_table[idx] = hdr;
1243         arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1244 
1245         /* collect some hash table performance data */
1246         if (i > 0) {
1247                 ARCSTAT_BUMP(arcstat_hash_collisions);
1248                 if (i == 1)
1249                         ARCSTAT_BUMP(arcstat_hash_chains);
1250 
1251                 ARCSTAT_MAX(arcstat_hash_chain_max, i);
1252         }
1253 
1254         ARCSTAT_BUMP(arcstat_hash_elements);
1255         ARCSTAT_MAXSTAT(arcstat_hash_elements);
1256 
1257         return (NULL);
1258 }
1259 
1260 static void
1261 buf_hash_remove(arc_buf_hdr_t *hdr)
1262 {
1263         arc_buf_hdr_t *fhdr, **hdrp;
1264         uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1265 
1266         ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
1267         ASSERT(HDR_IN_HASH_TABLE(hdr));
1268 
1269         hdrp = &buf_hash_table.ht_table[idx];
1270         while ((fhdr = *hdrp) != hdr) {
1271                 ASSERT3P(fhdr, !=, NULL);
1272                 hdrp = &fhdr->b_hash_next;
1273         }
1274         *hdrp = hdr->b_hash_next;
1275         hdr->b_hash_next = NULL;
1276         arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1277 
1278         /* collect some hash table performance data */
1279         ARCSTAT_BUMPDOWN(arcstat_hash_elements);
1280 
1281         if (buf_hash_table.ht_table[idx] &&
1282             buf_hash_table.ht_table[idx]->b_hash_next == NULL)
1283                 ARCSTAT_BUMPDOWN(arcstat_hash_chains);
1284 }
1285 
1286 /*
1287  * Global data structures and functions for the buf kmem cache.
1288  */
1289 static kmem_cache_t *hdr_full_cache;
1290 static kmem_cache_t *hdr_l2only_cache;
1291 static kmem_cache_t *buf_cache;
1292 
1293 static void
1294 buf_fini(void)
1295 {
1296         int i;
1297 


1298         kmem_free(buf_hash_table.ht_table,
1299             (buf_hash_table.ht_mask + 1) * sizeof (void *));
1300         for (i = 0; i < BUF_LOCKS; i++)
1301                 mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
1302         kmem_cache_destroy(hdr_full_cache);
1303         kmem_cache_destroy(hdr_l2only_cache);
1304         kmem_cache_destroy(buf_cache);
1305 }
1306 
1307 /*
1308  * Constructor callback - called when the cache is empty
1309  * and a new buf is requested.
1310  */
1311 /* ARGSUSED */
1312 static int
1313 hdr_full_cons(void *vbuf, void *unused, int kmflag)
1314 {
1315         arc_buf_hdr_t *hdr = vbuf;
1316 
1317         bzero(hdr, HDR_FULL_SIZE);
1318         cv_init(&hdr->b_l1hdr.b_cv, NULL, CV_DEFAULT, NULL);
1319         refcount_create(&hdr->b_l1hdr.b_refcnt);
1320         mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
1321         multilist_link_init(&hdr->b_l1hdr.b_arc_node);

1404 }
1405 
1406 static void
1407 buf_init(void)
1408 {
1409         uint64_t *ct;
1410         uint64_t hsize = 1ULL << 12;
1411         int i, j;
1412 
1413         /*
1414          * The hash table is big enough to fill all of physical memory
1415          * with an average block size of zfs_arc_average_blocksize (default 8K).
1416          * By default, the table will take up
1417          * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
1418          */
1419         while (hsize * zfs_arc_average_blocksize < physmem * PAGESIZE)
1420                 hsize <<= 1;
1421 retry:
1422         buf_hash_table.ht_mask = hsize - 1;
1423         buf_hash_table.ht_table =
1424             kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
1425         if (buf_hash_table.ht_table == NULL) {
1426                 ASSERT(hsize > (1ULL << 8));
1427                 hsize >>= 1;
1428                 goto retry;
1429         }
1430 
1431         hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,
1432             0, hdr_full_cons, hdr_full_dest, hdr_recl, NULL, NULL, 0);
1433         hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",
1434             HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, hdr_recl,
1435             NULL, NULL, 0);
1436         buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
1437             0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
1438 
1439         for (i = 0; i < 256; i++)
1440                 for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
1441                         *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
1442 
1443         for (i = 0; i < BUF_LOCKS; i++) {
1444                 mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
1445                     NULL, MUTEX_DEFAULT, NULL);
1446         }
1447 }
1448 








1449 /*
1450  * This is the size that the buf occupies in memory. If the buf is compressed,
1451  * it will correspond to the compressed size. You should use this method of
1452  * getting the buf size unless you explicitly need the logical size.
1453  */
1454 int32_t
1455 arc_buf_size(arc_buf_t *buf)
1456 {
1457         return (ARC_BUF_COMPRESSED(buf) ?
1458             HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
1459 }
1460 
1461 int32_t
1462 arc_buf_lsize(arc_buf_t *buf)
1463 {
1464         return (HDR_GET_LSIZE(buf->b_hdr));
1465 }
1466 
1467 enum zio_compress
1468 arc_get_compression(arc_buf_t *buf)

1484         IMPLY(shared, ARC_BUF_SHARED(buf));
1485         IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));
1486 
1487         /*
1488          * It would be nice to assert arc_can_share() too, but the "hdr isn't
1489          * already being shared" requirement prevents us from doing that.
1490          */
1491 
1492         return (shared);
1493 }
1494 
1495 /*
1496  * Free the checksum associated with this header. If there is no checksum, this
1497  * is a no-op.
1498  */
1499 static inline void
1500 arc_cksum_free(arc_buf_hdr_t *hdr)
1501 {
1502         ASSERT(HDR_HAS_L1HDR(hdr));
1503         mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1504         if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
1505                 kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));
1506                 hdr->b_l1hdr.b_freeze_cksum = NULL;
1507         }
1508         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1509 }
1510 
1511 /*
1512  * Return true iff at least one of the bufs on hdr is not compressed.
1513  */
1514 static boolean_t
1515 arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)
1516 {
1517         for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) {
1518                 if (!ARC_BUF_COMPRESSED(b)) {
1519                         return (B_TRUE);
1520                 }
1521         }
1522         return (B_FALSE);
1523 }
1524 
1525 /*
1526  * If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data
1527  * matches the checksum that is stored in the hdr. If there is no checksum,
1528  * or if the buf is compressed, this is a no-op.
1529  */
1530 static void
1531 arc_cksum_verify(arc_buf_t *buf)
1532 {
1533         arc_buf_hdr_t *hdr = buf->b_hdr;
1534         zio_cksum_t zc;
1535 
1536         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1537                 return;
1538 
1539         if (ARC_BUF_COMPRESSED(buf)) {
1540                 ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
1541                     arc_hdr_has_uncompressed_buf(hdr));
1542                 return;
1543         }
1544 
1545         ASSERT(HDR_HAS_L1HDR(hdr));
1546 
1547         mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1548         if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
1549                 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1550                 return;
1551         }
1552 
1553         fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);
1554         if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))
1555                 panic("buffer modified while frozen!");
1556         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1557 }
1558 
1559 static boolean_t
1560 arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)
1561 {
1562         enum zio_compress compress = BP_GET_COMPRESS(zio->io_bp);
1563         boolean_t valid_cksum;
1564 
1565         ASSERT(!BP_IS_EMBEDDED(zio->io_bp));
1566         VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));
1567 
1568         /*
1569          * We rely on the blkptr's checksum to determine if the block
1570          * is valid or not. When compressed arc is enabled, the l2arc
1571          * writes the block to the l2arc just as it appears in the pool.
1572          * This allows us to use the blkptr's checksum to validate the
1573          * data that we just read off of the l2arc without having to store
1574          * a separate checksum in the arc_buf_hdr_t. However, if compressed
1575          * arc is disabled, then the data written to the l2arc is always
1576          * uncompressed and won't match the block as it exists in the main
1577          * pool. When this is the case, we must first compress it if it is
1578          * compressed on the main pool before we can validate the checksum.
1579          */
1580         if (!HDR_COMPRESSION_ENABLED(hdr) && compress != ZIO_COMPRESS_OFF) {
1581                 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
1582                 uint64_t lsize = HDR_GET_LSIZE(hdr);
1583                 uint64_t csize;
1584 
1585                 abd_t *cdata = abd_alloc_linear(HDR_GET_PSIZE(hdr), B_TRUE);
1586                 csize = zio_compress_data(compress, zio->io_abd,
1587                     abd_to_buf(cdata), lsize);

1588 
1589                 ASSERT3U(csize, <=, HDR_GET_PSIZE(hdr));
1590                 if (csize < HDR_GET_PSIZE(hdr)) {
1591                         /*
1592                          * Compressed blocks are always a multiple of the
1593                          * smallest ashift in the pool. Ideally, we would
1594                          * like to round up the csize to the next
1595                          * spa_min_ashift but that value may have changed
1596                          * since the block was last written. Instead,
1597                          * we rely on the fact that the hdr's psize
1598                          * was set to the psize of the block when it was
1599                          * last written. We set the csize to that value
1600                          * and zero out any part that should not contain
1601                          * data.
1602                          */
1603                         abd_zero_off(cdata, csize, HDR_GET_PSIZE(hdr) - csize);
1604                         csize = HDR_GET_PSIZE(hdr);
1605                 }
1606                 zio_push_transform(zio, cdata, csize, HDR_GET_PSIZE(hdr), NULL);
1607         }

1626         return (valid_cksum);
1627 }
1628 
1629 /*
1630  * Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a
1631  * checksum and attaches it to the buf's hdr so that we can ensure that the buf
1632  * isn't modified later on. If buf is compressed or there is already a checksum
1633  * on the hdr, this is a no-op (we only checksum uncompressed bufs).
1634  */
1635 static void
1636 arc_cksum_compute(arc_buf_t *buf)
1637 {
1638         arc_buf_hdr_t *hdr = buf->b_hdr;
1639 
1640         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1641                 return;
1642 
1643         ASSERT(HDR_HAS_L1HDR(hdr));
1644 
1645         mutex_enter(&buf->b_hdr->b_l1hdr.b_freeze_lock);
1646         if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
1647                 ASSERT(arc_hdr_has_uncompressed_buf(hdr));
1648                 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1649                 return;
1650         } else if (ARC_BUF_COMPRESSED(buf)) {
1651                 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1652                 return;
1653         }
1654 
1655         ASSERT(!ARC_BUF_COMPRESSED(buf));
1656         hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
1657             KM_SLEEP);
1658         fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,
1659             hdr->b_l1hdr.b_freeze_cksum);
1660         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1661         arc_buf_watch(buf);
1662 }
1663 
1664 #ifndef _KERNEL
1665 typedef struct procctl {
1666         long cmd;
1667         prwatch_t prwatch;
1668 } procctl_t;
1669 #endif
1670 
1671 /* ARGSUSED */
1672 static void
1673 arc_buf_unwatch(arc_buf_t *buf)
1674 {
1675 #ifndef _KERNEL
1676         if (arc_watch) {
1677                 int result;
1678                 procctl_t ctl;
1679                 ctl.cmd = PCWATCH;

1691 arc_buf_watch(arc_buf_t *buf)
1692 {
1693 #ifndef _KERNEL
1694         if (arc_watch) {
1695                 int result;
1696                 procctl_t ctl;
1697                 ctl.cmd = PCWATCH;
1698                 ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
1699                 ctl.prwatch.pr_size = arc_buf_size(buf);
1700                 ctl.prwatch.pr_wflags = WA_WRITE;
1701                 result = write(arc_procfd, &ctl, sizeof (ctl));
1702                 ASSERT3U(result, ==, sizeof (ctl));
1703         }
1704 #endif
1705 }
1706 
1707 static arc_buf_contents_t
1708 arc_buf_type(arc_buf_hdr_t *hdr)
1709 {
1710         arc_buf_contents_t type;

1711         if (HDR_ISTYPE_METADATA(hdr)) {
1712                 type = ARC_BUFC_METADATA;


1713         } else {
1714                 type = ARC_BUFC_DATA;
1715         }
1716         VERIFY3U(hdr->b_type, ==, type);
1717         return (type);
1718 }
1719 
1720 boolean_t
1721 arc_is_metadata(arc_buf_t *buf)
1722 {
1723         return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);
1724 }
1725 
1726 static uint32_t
1727 arc_bufc_to_flags(arc_buf_contents_t type)
1728 {
1729         switch (type) {
1730         case ARC_BUFC_DATA:
1731                 /* metadata field is 0 if buffer contains normal data */
1732                 return (0);
1733         case ARC_BUFC_METADATA:
1734                 return (ARC_FLAG_BUFC_METADATA);


1735         default:
1736                 break;
1737         }
1738         panic("undefined ARC buffer type!");
1739         return ((uint32_t)-1);
1740 }
1741 










1742 void
1743 arc_buf_thaw(arc_buf_t *buf)
1744 {
1745         arc_buf_hdr_t *hdr = buf->b_hdr;
1746 
1747         ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
1748         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
1749 
1750         arc_cksum_verify(buf);
1751 
1752         /*
1753          * Compressed buffers do not manipulate the b_freeze_cksum or
1754          * allocate b_thawed.
1755          */
1756         if (ARC_BUF_COMPRESSED(buf)) {
1757                 ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
1758                     arc_hdr_has_uncompressed_buf(hdr));
1759                 return;
1760         }
1761 
1762         ASSERT(HDR_HAS_L1HDR(hdr));
1763         arc_cksum_free(hdr);
1764 
1765         mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1766 #ifdef ZFS_DEBUG
1767         if (zfs_flags & ZFS_DEBUG_MODIFY) {
1768                 if (hdr->b_l1hdr.b_thawed != NULL)
1769                         kmem_free(hdr->b_l1hdr.b_thawed, 1);
1770                 hdr->b_l1hdr.b_thawed = kmem_alloc(1, KM_SLEEP);
1771         }
1772 #endif
1773 
1774         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1775 
1776         arc_buf_unwatch(buf);
1777 }
1778 
1779 void
1780 arc_buf_freeze(arc_buf_t *buf)
1781 {
1782         arc_buf_hdr_t *hdr = buf->b_hdr;
1783         kmutex_t *hash_lock;
1784 
1785         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1786                 return;
1787 
1788         if (ARC_BUF_COMPRESSED(buf)) {
1789                 ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
1790                     arc_hdr_has_uncompressed_buf(hdr));
1791                 return;
1792         }
1793 
1794         hash_lock = HDR_LOCK(hdr);
1795         mutex_enter(hash_lock);
1796 
1797         ASSERT(HDR_HAS_L1HDR(hdr));
1798         ASSERT(hdr->b_l1hdr.b_freeze_cksum != NULL ||
1799             hdr->b_l1hdr.b_state == arc_anon);
1800         arc_cksum_compute(buf);
1801         mutex_exit(hash_lock);
1802 }
1803 
1804 /*
1805  * The arc_buf_hdr_t's b_flags should never be modified directly. Instead,
1806  * the following functions should be used to ensure that the flags are
1807  * updated in a thread-safe way. When manipulating the flags either
1808  * the hash_lock must be held or the hdr must be undiscoverable. This
1809  * ensures that we're not racing with any other threads when updating
1810  * the flags.
1811  */
1812 static inline void
1813 arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
1814 {
1815         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
1816         hdr->b_flags |= flags;
1817 }
1818

1870         ASSERT(!ARC_BUF_COMPRESSED(buf));
1871 
1872         for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL;
1873             from = from->b_next) {
1874                 /* can't use our own data buffer */
1875                 if (from == buf) {
1876                         continue;
1877                 }
1878 
1879                 if (!ARC_BUF_COMPRESSED(from)) {
1880                         bcopy(from->b_data, buf->b_data, arc_buf_size(buf));
1881                         copied = B_TRUE;
1882                         break;
1883                 }
1884         }
1885 
1886         /*
1887          * There were no decompressed bufs, so there should not be a
1888          * checksum on the hdr either.
1889          */
1890         EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);
1891 
1892         return (copied);
1893 }
1894 
1895 /*
1896  * Given a buf that has a data buffer attached to it, this function will
1897  * efficiently fill the buf with data of the specified compression setting from
1898  * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr
1899  * are already sharing a data buf, no copy is performed.
1900  *
1901  * If the buf is marked as compressed but uncompressed data was requested, this
1902  * will allocate a new data buffer for the buf, remove that flag, and fill the
1903  * buf with uncompressed data. You can't request a compressed buf on a hdr with
1904  * uncompressed data, and (since we haven't added support for it yet) if you
1905  * want compressed data your buf must already be marked as compressed and have
1906  * the correct-sized data buffer.
1907  */
1908 static int
1909 arc_buf_fill(arc_buf_t *buf, boolean_t compressed)
1910 {

1949                             arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
1950 
1951                         /* We increased the size of b_data; update overhead */
1952                         ARCSTAT_INCR(arcstat_overhead_size,
1953                             HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));
1954                 }
1955 
1956                 /*
1957                  * Regardless of the buf's previous compression settings, it
1958                  * should not be compressed at the end of this function.
1959                  */
1960                 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
1961 
1962                 /*
1963                  * Try copying the data from another buf which already has a
1964                  * decompressed version. If that's not possible, it's time to
1965                  * bite the bullet and decompress the data from the hdr.
1966                  */
1967                 if (arc_buf_try_copy_decompressed_data(buf)) {
1968                         /* Skip byteswapping and checksumming (already done) */
1969                         ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, !=, NULL);
1970                         return (0);
1971                 } else {
1972                         int error = zio_decompress_data(HDR_GET_COMPRESS(hdr),
1973                             hdr->b_l1hdr.b_pabd, buf->b_data,
1974                             HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
1975 
1976                         /*
1977                          * Absent hardware errors or software bugs, this should
1978                          * be impossible, but log it anyway so we can debug it.
1979                          */
1980                         if (error != 0) {
1981                                 zfs_dbgmsg(
1982                                     "hdr %p, compress %d, psize %d, lsize %d",
1983                                     hdr, HDR_GET_COMPRESS(hdr),
1984                                     HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
1985                                 return (SET_ERROR(EIO));
1986                         }
1987                 }
1988         }
1989

2212 
2213                         /*
2214                          * An L1 header always exists here, since if we're
2215                          * moving to some L1-cached state (i.e. not l2c_only or
2216                          * anonymous), we realloc the header to add an L1hdr
2217                          * beforehand.
2218                          */
2219                         ASSERT(HDR_HAS_L1HDR(hdr));
2220                         multilist_insert(new_state->arcs_list[buftype], hdr);
2221 
2222                         if (GHOST_STATE(new_state)) {
2223                                 ASSERT0(bufcnt);
2224                                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2225                                 update_new = B_TRUE;
2226                         }
2227                         arc_evictable_space_increment(hdr, new_state);
2228                 }
2229         }
2230 
2231         ASSERT(!HDR_EMPTY(hdr));
2232         if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))

2233                 buf_hash_remove(hdr);

2234 
2235         /* adjust state sizes (ignore arc_l2c_only) */
2236 
2237         if (update_new && new_state != arc_l2c_only) {
2238                 ASSERT(HDR_HAS_L1HDR(hdr));
2239                 if (GHOST_STATE(new_state)) {
2240                         ASSERT0(bufcnt);
2241 
2242                         /*
2243                          * When moving a header to a ghost state, we first
2244                          * remove all arc buffers. Thus, we'll have a
2245                          * bufcnt of zero, and no arc buffer to use for
2246                          * the reference. As a result, we use the arc
2247                          * header pointer for the reference.
2248                          */
2249                         (void) refcount_add_many(&new_state->arcs_size,
2250                             HDR_GET_LSIZE(hdr), hdr);
2251                         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2252                 } else {
2253                         uint32_t buffers = 0;

2326                                         continue;
2327 
2328                                 (void) refcount_remove_many(
2329                                     &old_state->arcs_size, arc_buf_size(buf),
2330                                     buf);
2331                         }
2332                         ASSERT3U(bufcnt, ==, buffers);
2333                         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2334                         (void) refcount_remove_many(
2335                             &old_state->arcs_size, arc_hdr_size(hdr), hdr);
2336                 }
2337         }
2338 
2339         if (HDR_HAS_L1HDR(hdr))
2340                 hdr->b_l1hdr.b_state = new_state;
2341 
2342         /*
2343          * L2 headers should never be on the L2 state list since they don't
2344          * have L1 headers allocated.
2345          */
2346         ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]) &&
2347             multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));

2348 }
2349 
2350 void
2351 arc_space_consume(uint64_t space, arc_space_type_t type)
2352 {
2353         ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2354 
2355         switch (type) {
2356         case ARC_SPACE_DATA:
2357                 aggsum_add(&astat_data_size, space);
2358                 break;
2359         case ARC_SPACE_META:
2360                 aggsum_add(&astat_metadata_size, space);
2361                 break;



2362         case ARC_SPACE_OTHER:
2363                 aggsum_add(&astat_other_size, space);
2364                 break;
2365         case ARC_SPACE_HDRS:
2366                 aggsum_add(&astat_hdr_size, space);
2367                 break;
2368         case ARC_SPACE_L2HDRS:
2369                 aggsum_add(&astat_l2_hdr_size, space);
2370                 break;
2371         }
2372 
2373         if (type != ARC_SPACE_DATA)
2374                 aggsum_add(&arc_meta_used, space);
2375 
2376         aggsum_add(&arc_size, space);
2377 }
2378 
2379 void
2380 arc_space_return(uint64_t space, arc_space_type_t type)
2381 {
2382         ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2383 
2384         switch (type) {
2385         case ARC_SPACE_DATA:
2386                 aggsum_add(&astat_data_size, -space);
2387                 break;
2388         case ARC_SPACE_META:
2389                 aggsum_add(&astat_metadata_size, -space);
2390                 break;



2391         case ARC_SPACE_OTHER:
2392                 aggsum_add(&astat_other_size, -space);
2393                 break;
2394         case ARC_SPACE_HDRS:
2395                 aggsum_add(&astat_hdr_size, -space);
2396                 break;
2397         case ARC_SPACE_L2HDRS:
2398                 aggsum_add(&astat_l2_hdr_size, -space);
2399                 break;
2400         }
2401 
2402         if (type != ARC_SPACE_DATA) {
2403                 ASSERT(aggsum_compare(&arc_meta_used, space) >= 0);
2404                 /*
2405                  * We use the upper bound here rather than the precise value
2406                  * because the arc_meta_max value doesn't need to be
2407                  * precise. It's only consumed by humans via arcstats.
2408                  */
2409                 if (arc_meta_max < aggsum_upper_bound(&arc_meta_used))
2410                         arc_meta_max = aggsum_upper_bound(&arc_meta_used);
2411                 aggsum_add(&arc_meta_used, -space);
2412         }
2413 
2414         ASSERT(aggsum_compare(&arc_size, space) >= 0);
2415         aggsum_add(&arc_size, -space);
2416 }
2417 
2418 /*
2419  * Given a hdr and a buf, returns whether that buf can share its b_data buffer
2420  * with the hdr's b_pabd.
2421  */
2422 static boolean_t
2423 arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2424 {
2425         /*
2426          * The criteria for sharing a hdr's data are:
2427          * 1. the hdr's compression matches the buf's compression
2428          * 2. the hdr doesn't need to be byteswapped
2429          * 3. the hdr isn't already being shared
2430          * 4. the buf is either compressed or it is the last buf in the hdr list
2431          *
2432          * Criterion #4 maintains the invariant that shared uncompressed
2433          * bufs must be the final buf in the hdr's b_buf list. Reading this, you
2434          * might ask, "if a compressed buf is allocated first, won't that be the
2435          * last thing in the list?", but in that case it's impossible to create

2449         return (buf_compressed == hdr_compressed &&
2450             hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&
2451             !HDR_SHARED_DATA(hdr) &&
2452             (ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));
2453 }
2454 
2455 /*
2456  * Allocate a buf for this hdr. If you care about the data that's in the hdr,
2457  * or if you want a compressed buffer, pass those flags in. Returns 0 if the
2458  * copy was made successfully, or an error code otherwise.
2459  */
2460 static int
2461 arc_buf_alloc_impl(arc_buf_hdr_t *hdr, void *tag, boolean_t compressed,
2462     boolean_t fill, arc_buf_t **ret)
2463 {
2464         arc_buf_t *buf;
2465 
2466         ASSERT(HDR_HAS_L1HDR(hdr));
2467         ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
2468         VERIFY(hdr->b_type == ARC_BUFC_DATA ||
2469             hdr->b_type == ARC_BUFC_METADATA);

2470         ASSERT3P(ret, !=, NULL);
2471         ASSERT3P(*ret, ==, NULL);
2472 
2473         buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
2474         buf->b_hdr = hdr;
2475         buf->b_data = NULL;
2476         buf->b_next = hdr->b_l1hdr.b_buf;
2477         buf->b_flags = 0;
2478 
2479         add_reference(hdr, tag);
2480 
2481         /*
2482          * We're about to change the hdr's b_flags. We must either
2483          * hold the hash_lock or be undiscoverable.
2484          */
2485         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2486 
2487         /*
2488          * Only honor requests for compressed bufs if the hdr is actually
2489          * compressed.

2529          */
2530         if (fill) {
2531                 return (arc_buf_fill(buf, ARC_BUF_COMPRESSED(buf) != 0));
2532         }
2533 
2534         return (0);
2535 }
2536 
2537 static char *arc_onloan_tag = "onloan";
2538 
2539 static inline void
2540 arc_loaned_bytes_update(int64_t delta)
2541 {
2542         atomic_add_64(&arc_loaned_bytes, delta);
2543 
2544         /* assert that it did not wrap around */
2545         ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
2546 }
2547 
2548 /*


















































2549  * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
2550  * flight data by arc_tempreserve_space() until they are "returned". Loaned
2551  * buffers must be returned to the arc before they can be used by the DMU or
2552  * freed.
2553  */
2554 arc_buf_t *
2555 arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)
2556 {
2557         arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,
2558             is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);
2559 
2560         arc_loaned_bytes_update(size);
2561 
2562         return (buf);
2563 }
2564 
2565 arc_buf_t *
2566 arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,
2567     enum zio_compress compression_type)
2568 {

2617         list_insert_head(l2arc_free_on_write, df);
2618         mutex_exit(&l2arc_free_on_write_mtx);
2619 }
2620 
2621 static void
2622 arc_hdr_free_on_write(arc_buf_hdr_t *hdr)
2623 {
2624         arc_state_t *state = hdr->b_l1hdr.b_state;
2625         arc_buf_contents_t type = arc_buf_type(hdr);
2626         uint64_t size = arc_hdr_size(hdr);
2627 
2628         /* protected by hash lock, if in the hash table */
2629         if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
2630                 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2631                 ASSERT(state != arc_anon && state != arc_l2c_only);
2632 
2633                 (void) refcount_remove_many(&state->arcs_esize[type],
2634                     size, hdr);
2635         }
2636         (void) refcount_remove_many(&state->arcs_size, size, hdr);
2637         if (type == ARC_BUFC_METADATA) {


2638                 arc_space_return(size, ARC_SPACE_META);
2639         } else {
2640                 ASSERT(type == ARC_BUFC_DATA);
2641                 arc_space_return(size, ARC_SPACE_DATA);
2642         }
2643 
2644         l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);
2645 }
2646 
2647 /*
2648  * Share the arc_buf_t's data with the hdr. Whenever we are sharing the
2649  * data buffer, we transfer the refcount ownership to the hdr and update
2650  * the appropriate kstats.
2651  */
2652 static void
2653 arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2654 {
2655         arc_state_t *state = hdr->b_l1hdr.b_state;
2656 
2657         ASSERT(arc_can_share(hdr, buf));
2658         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2659         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2660 
2661         /*
2662          * Start sharing the data buffer. We transfer the
2663          * refcount ownership to the hdr since it always owns
2664          * the refcount whenever an arc_buf_t is shared.
2665          */
2666         refcount_transfer_ownership(&state->arcs_size, buf, hdr);
2667         hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));
2668         abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,
2669             HDR_ISTYPE_METADATA(hdr));
2670         arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
2671         buf->b_flags |= ARC_BUF_FLAG_SHARED;
2672 
2673         /*
2674          * Since we've transferred ownership to the hdr we need
2675          * to increment its compressed and uncompressed kstats and
2676          * decrement the overhead size.
2677          */
2678         ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
2679         ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
2680         ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));
2681 }
2682 
2683 static void
2684 arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2685 {
2686         arc_state_t *state = hdr->b_l1hdr.b_state;
2687 
2688         ASSERT(arc_buf_is_shared(buf));
2689         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);

2842 
2843         /* clean up the buf */
2844         buf->b_hdr = NULL;
2845         kmem_cache_free(buf_cache, buf);
2846 }
2847 
2848 static void
2849 arc_hdr_alloc_pabd(arc_buf_hdr_t *hdr)
2850 {
2851         ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
2852         ASSERT(HDR_HAS_L1HDR(hdr));
2853         ASSERT(!HDR_SHARED_DATA(hdr));
2854 
2855         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2856         hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr);
2857         hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
2858         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2859 
2860         ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
2861         ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));

2862 }
2863 
2864 static void
2865 arc_hdr_free_pabd(arc_buf_hdr_t *hdr)
2866 {
2867         ASSERT(HDR_HAS_L1HDR(hdr));
2868         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2869 
2870         /*
2871          * If the hdr is currently being written to the l2arc then
2872          * we defer freeing the data by adding it to the l2arc_free_on_write
2873          * list. The l2arc will free the data once it's finished
2874          * writing it to the l2arc device.
2875          */
2876         if (HDR_L2_WRITING(hdr)) {
2877                 arc_hdr_free_on_write(hdr);
2878                 ARCSTAT_BUMP(arcstat_l2_free_on_write);
2879         } else {
2880                 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
2881                     arc_hdr_size(hdr), hdr);
2882         }
2883         hdr->b_l1hdr.b_pabd = NULL;
2884         hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
2885 
2886         ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
2887         ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
2888 }
2889 
2890 static arc_buf_hdr_t *
2891 arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,
2892     enum zio_compress compression_type, arc_buf_contents_t type)
2893 {
2894         arc_buf_hdr_t *hdr;
2895 
2896         VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);
2897 





2898         hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);
2899         ASSERT(HDR_EMPTY(hdr));
2900         ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
2901         ASSERT3P(hdr->b_l1hdr.b_thawed, ==, NULL);
2902         HDR_SET_PSIZE(hdr, psize);
2903         HDR_SET_LSIZE(hdr, lsize);
2904         hdr->b_spa = spa;
2905         hdr->b_type = type;
2906         hdr->b_flags = 0;
2907         arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);
2908         arc_hdr_set_compress(hdr, compression_type);
2909 
2910         hdr->b_l1hdr.b_state = arc_anon;
2911         hdr->b_l1hdr.b_arc_access = 0;
2912         hdr->b_l1hdr.b_bufcnt = 0;
2913         hdr->b_l1hdr.b_buf = NULL;
2914 
2915         /*
2916          * Allocate the hdr's buffer. This will contain either
2917          * the compressed or uncompressed data depending on the block
2918          * it references and compressed arc enablement.
2919          */
2920         arc_hdr_alloc_pabd(hdr);

2945 
2946         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
2947         buf_hash_remove(hdr);
2948 
2949         bcopy(hdr, nhdr, HDR_L2ONLY_SIZE);
2950 
2951         if (new == hdr_full_cache) {
2952                 arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);
2953                 /*
2954                  * arc_access and arc_change_state need to be aware that a
2955                  * header has just come out of L2ARC, so we set its state to
2956                  * l2c_only even though it's about to change.
2957                  */
2958                 nhdr->b_l1hdr.b_state = arc_l2c_only;
2959 
2960                 /* Verify previous threads set to NULL before freeing */
2961                 ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL);
2962         } else {
2963                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2964                 ASSERT0(hdr->b_l1hdr.b_bufcnt);
2965                 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
2966 
2967                 /*
2968                  * If we've reached here, We must have been called from
2969                  * arc_evict_hdr(), as such we should have already been
2970                  * removed from any ghost list we were previously on
2971                  * (which protects us from racing with arc_evict_state),
2972                  * thus no locking is needed during this check.
2973                  */
2974                 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
2975 
2976                 /*
2977                  * A buffer must not be moved into the arc_l2c_only
2978                  * state if it's not finished being written out to the
2979                  * l2arc device. Otherwise, the b_l1hdr.b_pabd field
2980                  * might try to be accessed, even though it was removed.
2981                  */
2982                 VERIFY(!HDR_L2_WRITING(hdr));
2983                 VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2984 
2985 #ifdef ZFS_DEBUG

3050 /*
3051  * Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this
3052  * for bufs containing metadata.
3053  */
3054 arc_buf_t *
3055 arc_alloc_compressed_buf(spa_t *spa, void *tag, uint64_t psize, uint64_t lsize,
3056     enum zio_compress compression_type)
3057 {
3058         ASSERT3U(lsize, >, 0);
3059         ASSERT3U(lsize, >=, psize);
3060         ASSERT(compression_type > ZIO_COMPRESS_OFF);
3061         ASSERT(compression_type < ZIO_COMPRESS_FUNCTIONS);
3062 
3063         arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
3064             compression_type, ARC_BUFC_DATA);
3065         ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
3066 
3067         arc_buf_t *buf = NULL;
3068         VERIFY0(arc_buf_alloc_impl(hdr, tag, B_TRUE, B_FALSE, &buf));
3069         arc_buf_thaw(buf);
3070         ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
3071 
3072         if (!arc_buf_is_shared(buf)) {
3073                 /*
3074                  * To ensure that the hdr has the correct data in it if we call
3075                  * arc_decompress() on this buf before it's been written to
3076                  * disk, it's easiest if we just set up sharing between the
3077                  * buf and the hdr.
3078                  */
3079                 ASSERT(!abd_is_linear(hdr->b_l1hdr.b_pabd));
3080                 arc_hdr_free_pabd(hdr);
3081                 arc_share_buf(hdr, buf);
3082         }
3083 
3084         return (buf);
3085 }
3086 
3087 static void
3088 arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)
3089 {
3090         l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
3091         l2arc_dev_t *dev = l2hdr->b_dev;
3092         uint64_t psize = arc_hdr_size(hdr);
3093 
3094         ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
3095         ASSERT(HDR_HAS_L2HDR(hdr));
3096 
3097         list_remove(&dev->l2ad_buflist, hdr);
3098 
3099         ARCSTAT_INCR(arcstat_l2_psize, -psize);
3100         ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
3101 




3102         vdev_space_update(dev->l2ad_vdev, -psize, 0, 0);
3103 
3104         (void) refcount_remove_many(&dev->l2ad_alloc, psize, hdr);
3105         arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
3106 }
3107 
3108 static void
3109 arc_hdr_destroy(arc_buf_hdr_t *hdr)
3110 {
3111         if (HDR_HAS_L1HDR(hdr)) {
3112                 ASSERT(hdr->b_l1hdr.b_buf == NULL ||
3113                     hdr->b_l1hdr.b_bufcnt > 0);
3114                 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
3115                 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
3116         }
3117         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3118         ASSERT(!HDR_IN_HASH_TABLE(hdr));
3119 
3120         if (!HDR_EMPTY(hdr))
3121                 buf_discard_identity(hdr);
3122 
3123         if (HDR_HAS_L2HDR(hdr)) {
3124                 l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
3125                 boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);
3126 



3127                 if (!buflist_held)
3128                         mutex_enter(&dev->l2ad_mtx);
3129 
3130                 /*









3131                  * Even though we checked this conditional above, we
3132                  * need to check this again now that we have the
3133                  * l2ad_mtx. This is because we could be racing with
3134                  * another thread calling l2arc_evict() which might have
3135                  * destroyed this header's L2 portion as we were waiting
3136                  * to acquire the l2ad_mtx. If that happens, we don't
3137                  * want to re-destroy the header's L2 portion.
3138                  */
3139                 if (HDR_HAS_L2HDR(hdr))
3140                         arc_hdr_l2hdr_destroy(hdr);
3141 
3142                 if (!buflist_held)
3143                         mutex_exit(&dev->l2ad_mtx);
3144         }
3145 



3146         if (HDR_HAS_L1HDR(hdr)) {
3147                 arc_cksum_free(hdr);
3148 
3149                 while (hdr->b_l1hdr.b_buf != NULL)
3150                         arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);
3151 
3152 #ifdef ZFS_DEBUG
3153                 if (hdr->b_l1hdr.b_thawed != NULL) {
3154                         kmem_free(hdr->b_l1hdr.b_thawed, 1);
3155                         hdr->b_l1hdr.b_thawed = NULL;
3156                 }
3157 #endif
3158 
3159                 if (hdr->b_l1hdr.b_pabd != NULL) {
3160                         arc_hdr_free_pabd(hdr);
3161                 }
3162         }
3163 
3164         ASSERT3P(hdr->b_hash_next, ==, NULL);
3165         if (HDR_HAS_L1HDR(hdr)) {

3201  * Evict the arc_buf_hdr that is provided as a parameter. The resultant
3202  * state of the header is dependent on it's state prior to entering this
3203  * function. The following transitions are possible:
3204  *
3205  *    - arc_mru -> arc_mru_ghost
3206  *    - arc_mfu -> arc_mfu_ghost
3207  *    - arc_mru_ghost -> arc_l2c_only
3208  *    - arc_mru_ghost -> deleted
3209  *    - arc_mfu_ghost -> arc_l2c_only
3210  *    - arc_mfu_ghost -> deleted
3211  */
3212 static int64_t
3213 arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
3214 {
3215         arc_state_t *evicted_state, *state;
3216         int64_t bytes_evicted = 0;
3217 
3218         ASSERT(MUTEX_HELD(hash_lock));
3219         ASSERT(HDR_HAS_L1HDR(hdr));
3220 


3221         state = hdr->b_l1hdr.b_state;
3222         if (GHOST_STATE(state)) {
3223                 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3224                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
3225 
3226                 /*
3227                  * l2arc_write_buffers() relies on a header's L1 portion
3228                  * (i.e. its b_pabd field) during it's write phase.
3229                  * Thus, we cannot push a header onto the arc_l2c_only
3230                  * state (removing it's L1 piece) until the header is
3231                  * done being written to the l2arc.
3232                  */
3233                 if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {
3234                         ARCSTAT_BUMP(arcstat_evict_l2_skip);
3235                         return (bytes_evicted);
3236                 }
3237 
3238                 ARCSTAT_BUMP(arcstat_deleted);
3239                 bytes_evicted += HDR_GET_LSIZE(hdr);
3240

3589  * prevents us from trying to evict more from a state's list than
3590  * is "evictable", and to skip evicting altogether when passed a
3591  * negative value for "bytes". In contrast, arc_evict_state() will
3592  * evict everything it can, when passed a negative value for "bytes".
3593  */
3594 static uint64_t
3595 arc_adjust_impl(arc_state_t *state, uint64_t spa, int64_t bytes,
3596     arc_buf_contents_t type)
3597 {
3598         int64_t delta;
3599 
3600         if (bytes > 0 && refcount_count(&state->arcs_esize[type]) > 0) {
3601                 delta = MIN(refcount_count(&state->arcs_esize[type]), bytes);
3602                 return (arc_evict_state(state, spa, delta, type));
3603         }
3604 
3605         return (0);
3606 }
3607 
3608 /*
3609  * Evict metadata buffers from the cache, such that arc_meta_used is
3610  * capped by the arc_meta_limit tunable.


3611  */
3612 static uint64_t
3613 arc_adjust_meta(uint64_t meta_used)
3614 {
3615         uint64_t total_evicted = 0;
3616         int64_t target;

3617 








3618         /*
3619          * If we're over the meta limit, we want to evict enough
3620          * metadata to get back under the meta limit. We don't want to
3621          * evict so much that we drop the MRU below arc_p, though. If
3622          * we're over the meta limit more than we're over arc_p, we
3623          * evict some from the MRU here, and some from the MFU below.
3624          */
3625         target = MIN((int64_t)(meta_used - arc_meta_limit),
3626             (int64_t)(refcount_count(&arc_anon->arcs_size) +
3627             refcount_count(&arc_mru->arcs_size) - arc_p));
3628 
3629         total_evicted += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
3630 



3631         /*
3632          * Similar to the above, we want to evict enough bytes to get us
3633          * below the meta limit, but not so much as to drop us below the
3634          * space allotted to the MFU (which is defined as arc_c - arc_p).
3635          */
3636         target = MIN((int64_t)(meta_used - arc_meta_limit),
3637             (int64_t)(refcount_count(&arc_mfu->arcs_size) -
3638             (arc_c - arc_p)));
3639 
3640         total_evicted += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
3641 
3642         return (total_evicted);
3643 }
3644 
3645 /*
3646  * Return the type of the oldest buffer in the given arc state
3647  *
3648  * This function will select a random sublist of type ARC_BUFC_DATA and
3649  * a random sublist of type ARC_BUFC_METADATA. The tail of each sublist
3650  * is compared, and the type which contains the "older" buffer will be
3651  * returned.
3652  */
3653 static arc_buf_contents_t
3654 arc_adjust_type(arc_state_t *state)
3655 {
3656         multilist_t *data_ml = state->arcs_list[ARC_BUFC_DATA];
3657         multilist_t *meta_ml = state->arcs_list[ARC_BUFC_METADATA];

3658         int data_idx = multilist_get_random_index(data_ml);
3659         int meta_idx = multilist_get_random_index(meta_ml);

3660         multilist_sublist_t *data_mls;
3661         multilist_sublist_t *meta_mls;
3662         arc_buf_contents_t type;

3663         arc_buf_hdr_t *data_hdr;
3664         arc_buf_hdr_t *meta_hdr;


3665 
3666         /*
3667          * We keep the sublist lock until we're finished, to prevent
3668          * the headers from being destroyed via arc_evict_state().
3669          */
3670         data_mls = multilist_sublist_lock(data_ml, data_idx);
3671         meta_mls = multilist_sublist_lock(meta_ml, meta_idx);

3672 
3673         /*
3674          * These two loops are to ensure we skip any markers that
3675          * might be at the tail of the lists due to arc_evict_state().
3676          */
3677 
3678         for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL;
3679             data_hdr = multilist_sublist_prev(data_mls, data_hdr)) {
3680                 if (data_hdr->b_spa != 0)
3681                         break;
3682         }
3683 
3684         for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL;
3685             meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) {
3686                 if (meta_hdr->b_spa != 0)
3687                         break;
3688         }
3689 
3690         if (data_hdr == NULL && meta_hdr == NULL) {






3691                 type = ARC_BUFC_DATA;
3692         } else if (data_hdr == NULL) {















3693                 ASSERT3P(meta_hdr, !=, NULL);
3694                 type = ARC_BUFC_METADATA;
3695         } else if (meta_hdr == NULL) {
3696                 ASSERT3P(data_hdr, !=, NULL);
3697                 type = ARC_BUFC_DATA;













3698         } else {
3699                 ASSERT3P(data_hdr, !=, NULL);
3700                 ASSERT3P(meta_hdr, !=, NULL);


3701 
3702                 /* The headers can't be on the sublist without an L1 header */













3703                 ASSERT(HDR_HAS_L1HDR(data_hdr));
3704                 ASSERT(HDR_HAS_L1HDR(meta_hdr));
3705 
3706                 if (data_hdr->b_l1hdr.b_arc_access <
3707                     meta_hdr->b_l1hdr.b_arc_access) {
3708                         type = ARC_BUFC_DATA;
3709                 } else {
3710                         type = ARC_BUFC_METADATA;
3711                 }



3712         }
3713 

3714         multilist_sublist_unlock(meta_mls);
3715         multilist_sublist_unlock(data_mls);
3716 
3717         return (type);
3718 }
3719 
3720 /*
3721  * Evict buffers from the cache, such that arc_size is capped by arc_c.
3722  */
3723 static uint64_t
3724 arc_adjust(void)
3725 {
3726         uint64_t total_evicted = 0;
3727         uint64_t bytes;
3728         int64_t target;
3729         uint64_t asize = aggsum_value(&arc_size);
3730         uint64_t ameta = aggsum_value(&arc_meta_used);
3731 
3732         /*
3733          * If we're over arc_meta_limit, we want to correct that before
3734          * potentially evicting data buffers below.
3735          */
3736         total_evicted += arc_adjust_meta(ameta);
3737 
3738         /*






3739          * Adjust MRU size
3740          *
3741          * If we're over the target cache size, we want to evict enough
3742          * from the list to get back to our target size. We don't want
3743          * to evict too much from the MRU, such that it drops below
3744          * arc_p. So, if we're over our target cache size more than
3745          * the MRU is over arc_p, we'll evict enough to get back to
3746          * arc_p here, and then evict more from the MFU below.
3747          */
3748         target = MIN((int64_t)(asize - arc_c),
3749             (int64_t)(refcount_count(&arc_anon->arcs_size) +
3750             refcount_count(&arc_mru->arcs_size) + ameta - arc_p));
3751 
3752         /*
3753          * If we're below arc_meta_min, always prefer to evict data.
3754          * Otherwise, try to satisfy the requested number of bytes to
3755          * evict from the type which contains older buffers; in an
3756          * effort to keep newer buffers in the cache regardless of their
3757          * type. If we cannot satisfy the number of bytes from this
3758          * type, spill over into the next type.
3759          */
3760         if (arc_adjust_type(arc_mru) == ARC_BUFC_METADATA &&
3761             ameta > arc_meta_min) {
3762                 bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
3763                 total_evicted += bytes;
3764 
3765                 /*
3766                  * If we couldn't evict our target number of bytes from
3767                  * metadata, we try to get the rest from data.
3768                  */
3769                 target -= bytes;
3770 
3771                 total_evicted +=
3772                     arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
3773         } else {
3774                 bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
3775                 total_evicted += bytes;
3776 
3777                 /*
3778                  * If we couldn't evict our target number of bytes from
3779                  * data, we try to get the rest from metadata.
3780                  */
3781                 target -= bytes;
3782 
3783                 total_evicted +=
3784                     arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
3785         }
3786 
3787         /*








3788          * Adjust MFU size
3789          *
3790          * Now that we've tried to evict enough from the MRU to get its
3791          * size back to arc_p, if we're still above the target cache
3792          * size, we evict the rest from the MFU.
3793          */
3794         target = asize - arc_c;
3795 
3796         if (arc_adjust_type(arc_mfu) == ARC_BUFC_METADATA &&
3797             ameta > arc_meta_min) {
3798                 bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
3799                 total_evicted += bytes;
3800 
3801                 /*
3802                  * If we couldn't evict our target number of bytes from
3803                  * metadata, we try to get the rest from data.
3804                  */
3805                 target -= bytes;
3806 
3807                 total_evicted +=
3808                     arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
3809         } else {
3810                 bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
3811                 total_evicted += bytes;
3812 
3813                 /*
3814                  * If we couldn't evict our target number of bytes from
3815                  * data, we try to get the rest from data.
3816                  */
3817                 target -= bytes;
3818 
3819                 total_evicted +=
3820                     arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
3821         }
3822 
3823         /*








3824          * Adjust ghost lists
3825          *
3826          * In addition to the above, the ARC also defines target values
3827          * for the ghost lists. The sum of the mru list and mru ghost
3828          * list should never exceed the target size of the cache, and
3829          * the sum of the mru list, mfu list, mru ghost list, and mfu
3830          * ghost list should never exceed twice the target size of the
3831          * cache. The following logic enforces these limits on the ghost
3832          * caches, and evicts from them as needed.
3833          */
3834         target = refcount_count(&arc_mru->arcs_size) +
3835             refcount_count(&arc_mru_ghost->arcs_size) - arc_c;
3836 
3837         bytes = arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA);
3838         total_evicted += bytes;
3839 
3840         target -= bytes;
3841 





3842         total_evicted +=
3843             arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
3844 
3845         /*
3846          * We assume the sum of the mru list and mfu list is less than
3847          * or equal to arc_c (we enforced this above), which means we
3848          * can use the simpler of the two equations below:
3849          *
3850          *      mru + mfu + mru ghost + mfu ghost <= 2 * arc_c
3851          *                  mru ghost + mfu ghost <= arc_c
3852          */
3853         target = refcount_count(&arc_mru_ghost->arcs_size) +
3854             refcount_count(&arc_mfu_ghost->arcs_size) - arc_c;
3855 
3856         bytes = arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA);
3857         total_evicted += bytes;
3858 
3859         target -= bytes;
3860 





3861         total_evicted +=
3862             arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
3863 
3864         return (total_evicted);
3865 }
3866 






























3867 void
3868 arc_flush(spa_t *spa, boolean_t retry)
3869 {
3870         uint64_t guid = 0;


3871 
3872         /*
3873          * If retry is B_TRUE, a spa must not be specified since we have
3874          * no good way to determine if all of a spa's buffers have been
3875          * evicted from an arc state.
3876          */
3877         ASSERT(!retry || spa == 0);
3878 
3879         if (spa != NULL)
3880                 guid = spa_load_guid(spa);







3881 
3882         (void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);
3883         (void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);
3884 
3885         (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);
3886         (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);
3887 
3888         (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);
3889         (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);
3890 
3891         (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);
3892         (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);






3893 }
3894 
3895 void
3896 arc_shrink(int64_t to_free)
3897 {
3898         uint64_t asize = aggsum_value(&arc_size);
3899         if (arc_c > arc_c_min) {
3900 
3901                 if (arc_c > arc_c_min + to_free)
3902                         atomic_add_64(&arc_c, -to_free);
3903                 else
3904                         arc_c = arc_c_min;
3905 
3906                 atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
3907                 if (asize < arc_c)
3908                         arc_c = MAX(asize, arc_c_min);
3909                 if (arc_p > arc_c)
3910                         arc_p = (arc_c >> 1);
3911                 ASSERT(arc_c >= arc_c_min);
3912                 ASSERT((int64_t)arc_p >= 0);
3913         }
3914 
3915         if (asize > arc_c)
3916                 (void) arc_adjust();
3917 }
3918 
3919 typedef enum free_memory_reason_t {
3920         FMR_UNKNOWN,
3921         FMR_NEEDFREE,
3922         FMR_LOTSFREE,
3923         FMR_SWAPFS_MINFREE,
3924         FMR_PAGES_PP_MAXIMUM,
3925         FMR_HEAP_ARENA,
3926         FMR_ZIO_ARENA,
3927 } free_memory_reason_t;
3928 
3929 int64_t last_free_memory;
3930 free_memory_reason_t last_free_reason;
3931 
3932 /*
3933  * Additional reserve of pages for pp_reserve.
3934  */
3935 int64_t arc_pages_pp_reserve = 64;

4059  * is under memory pressure and that the arc should adjust accordingly.
4060  */
4061 static boolean_t
4062 arc_reclaim_needed(void)
4063 {
4064         return (arc_available_memory() < 0);
4065 }
4066 
4067 static void
4068 arc_kmem_reap_now(void)
4069 {
4070         size_t                  i;
4071         kmem_cache_t            *prev_cache = NULL;
4072         kmem_cache_t            *prev_data_cache = NULL;
4073         extern kmem_cache_t     *zio_buf_cache[];
4074         extern kmem_cache_t     *zio_data_buf_cache[];
4075         extern kmem_cache_t     *range_seg_cache;
4076         extern kmem_cache_t     *abd_chunk_cache;
4077 
4078 #ifdef _KERNEL
4079         if (aggsum_compare(&arc_meta_used, arc_meta_limit) >= 0) {
4080                 /*
4081                  * We are exceeding our meta-data cache limit.
4082                  * Purge some DNLC entries to release holds on meta-data.
4083                  */
4084                 dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
4085         }
4086 #if defined(__i386)
4087         /*
4088          * Reclaim unused memory from all kmem caches.
4089          */
4090         kmem_reap();
4091 #endif
4092 #endif
4093 
4094         /*
4095          * If a kmem reap is already active, don't schedule more.  We must
4096          * check for this because kmem_cache_reap_soon() won't actually
4097          * block on the cache being reaped (this is to prevent callers from
4098          * becoming implicitly blocked by a system-wide kmem reap -- which,
4099          * on a system with many, many full magazines, can take minutes).
4100          */
4101         if (kmem_cache_reap_active())
4102                 return;

4218 #endif
4219                                 arc_shrink(to_free);
4220                         }
4221                 } else if (free_memory < arc_c >> arc_no_grow_shift) {
4222                         arc_no_grow = B_TRUE;
4223                 } else if (gethrtime() >= growtime) {
4224                         arc_no_grow = B_FALSE;
4225                 }
4226 
4227                 mutex_enter(&arc_reclaim_lock);
4228 
4229                 /*
4230                  * If evicted is zero, we couldn't evict anything via
4231                  * arc_adjust(). This could be due to hash lock
4232                  * collisions, but more likely due to the majority of
4233                  * arc buffers being unevictable. Therefore, even if
4234                  * arc_size is above arc_c, another pass is unlikely to
4235                  * be helpful and could potentially cause us to enter an
4236                  * infinite loop.
4237                  */
4238                 if (aggsum_compare(&arc_size, arc_c) <= 0|| evicted == 0) {
4239                         /*
4240                          * We're either no longer overflowing, or we
4241                          * can't evict anything more, so we should wake
4242                          * up any threads before we go to sleep.
4243                          */
4244                         cv_broadcast(&arc_reclaim_waiters_cv);
4245 
4246                         /*
4247                          * Block until signaled, or after one second (we
4248                          * might need to perform arc_kmem_reap_now()
4249                          * even if we aren't being signalled)
4250                          */
4251                         CALLB_CPR_SAFE_BEGIN(&cpr);
4252                         (void) cv_timedwait_hires(&arc_reclaim_thread_cv,
4253                             &arc_reclaim_lock, SEC2NSEC(1), MSEC2NSEC(1), 0);
4254                         CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_lock);
4255                 }
4256         }
4257 
4258         arc_reclaim_thread_exit = B_FALSE;

4300                 delta = MIN(bytes * mult, arc_p);
4301                 arc_p = MAX(arc_p_min, arc_p - delta);
4302         }
4303         ASSERT((int64_t)arc_p >= 0);
4304 
4305         if (arc_reclaim_needed()) {
4306                 cv_signal(&arc_reclaim_thread_cv);
4307                 return;
4308         }
4309 
4310         if (arc_no_grow)
4311                 return;
4312 
4313         if (arc_c >= arc_c_max)
4314                 return;
4315 
4316         /*
4317          * If we're within (2 * maxblocksize) bytes of the target
4318          * cache size, increment the target cache size
4319          */
4320         if (aggsum_compare(&arc_size, arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) >
4321             0) {
4322                 atomic_add_64(&arc_c, (int64_t)bytes);
4323                 if (arc_c > arc_c_max)
4324                         arc_c = arc_c_max;
4325                 else if (state == arc_anon)
4326                         atomic_add_64(&arc_p, (int64_t)bytes);
4327                 if (arc_p > arc_c)
4328                         arc_p = arc_c;
4329         }
4330         ASSERT((int64_t)arc_p >= 0);
4331 }
4332 
4333 /*
4334  * Check if arc_size has grown past our upper threshold, determined by
4335  * zfs_arc_overflow_shift.
4336  */
4337 static boolean_t
4338 arc_is_overflowing(void)
4339 {
4340         /* Always allow at least one block of overflow */
4341         uint64_t overflow = MAX(SPA_MAXBLOCKSIZE,
4342             arc_c >> zfs_arc_overflow_shift);
4343 
4344         /*
4345          * We just compare the lower bound here for performance reasons. Our
4346          * primary goals are to make sure that the arc never grows without
4347          * bound, and that it can reach its maximum size. This check
4348          * accomplishes both goals. The maximum amount we could run over by is
4349          * 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block
4350          * in the ARC. In practice, that's in the tens of MB, which is low
4351          * enough to be safe.
4352          */
4353         return (aggsum_lower_bound(&arc_size) >= arc_c + overflow);
4354 }
4355 
4356 static abd_t *
4357 arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4358 {
4359         arc_buf_contents_t type = arc_buf_type(hdr);
4360 
4361         arc_get_data_impl(hdr, size, tag);
4362         if (type == ARC_BUFC_METADATA) {
4363                 return (abd_alloc(size, B_TRUE));
4364         } else {
4365                 ASSERT(type == ARC_BUFC_DATA);
4366                 return (abd_alloc(size, B_FALSE));
4367         }
4368 }
4369 
4370 static void *
4371 arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4372 {
4373         arc_buf_contents_t type = arc_buf_type(hdr);
4374 
4375         arc_get_data_impl(hdr, size, tag);
4376         if (type == ARC_BUFC_METADATA) {
4377                 return (zio_buf_alloc(size));
4378         } else {
4379                 ASSERT(type == ARC_BUFC_DATA);
4380                 return (zio_data_buf_alloc(size));
4381         }
4382 }
4383 
4384 /*
4385  * Allocate a block and return it to the caller. If we are hitting the
4386  * hard limit for the cache size, we must sleep, waiting for the eviction
4387  * thread to catch up. If we're past the target size but below the hard
4388  * limit, we'll only signal the reclaim thread and continue on.
4389  */
4390 static void
4391 arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4392 {
4393         arc_state_t *state = hdr->b_l1hdr.b_state;
4394         arc_buf_contents_t type = arc_buf_type(hdr);
4395 
4396         arc_adapt(size, state);

4415                 /*
4416                  * Now that we've acquired the lock, we may no longer be
4417                  * over the overflow limit, lets check.
4418                  *
4419                  * We're ignoring the case of spurious wake ups. If that
4420                  * were to happen, it'd let this thread consume an ARC
4421                  * buffer before it should have (i.e. before we're under
4422                  * the overflow limit and were signalled by the reclaim
4423                  * thread). As long as that is a rare occurrence, it
4424                  * shouldn't cause any harm.
4425                  */
4426                 if (arc_is_overflowing()) {
4427                         cv_signal(&arc_reclaim_thread_cv);
4428                         cv_wait(&arc_reclaim_waiters_cv, &arc_reclaim_lock);
4429                 }
4430 
4431                 mutex_exit(&arc_reclaim_lock);
4432         }
4433 
4434         VERIFY3U(hdr->b_type, ==, type);
4435         if (type == ARC_BUFC_METADATA) {


4436                 arc_space_consume(size, ARC_SPACE_META);
4437         } else {
4438                 arc_space_consume(size, ARC_SPACE_DATA);
4439         }
4440 
4441         /*
4442          * Update the state size.  Note that ghost states have a
4443          * "ghost size" and so don't need to be updated.
4444          */
4445         if (!GHOST_STATE(state)) {
4446 
4447                 (void) refcount_add_many(&state->arcs_size, size, tag);
4448 
4449                 /*
4450                  * If this is reached via arc_read, the link is
4451                  * protected by the hash lock. If reached via
4452                  * arc_buf_alloc, the header should not be accessed by
4453                  * any other thread. And, if reached via arc_read_done,
4454                  * the hash lock will protect it if it's found in the
4455                  * hash table; otherwise no other thread should be
4456                  * trying to [add|remove]_reference it.
4457                  */
4458                 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
4459                         ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4460                         (void) refcount_add_many(&state->arcs_esize[type],
4461                             size, tag);
4462                 }
4463 
4464                 /*
4465                  * If we are growing the cache, and we are adding anonymous
4466                  * data, and we have outgrown arc_p, update arc_p
4467                  */
4468                 if (aggsum_compare(&arc_size, arc_c) < 0 &&
4469                     hdr->b_l1hdr.b_state == arc_anon &&
4470                     (refcount_count(&arc_anon->arcs_size) +
4471                     refcount_count(&arc_mru->arcs_size) > arc_p))
4472                         arc_p = MIN(arc_c, arc_p + size);
4473         }
4474 }
4475 
4476 static void
4477 arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size, void *tag)
4478 {
4479         arc_free_data_impl(hdr, size, tag);
4480         abd_free(abd);
4481 }
4482 
4483 static void
4484 arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, void *tag)
4485 {
4486         arc_buf_contents_t type = arc_buf_type(hdr);
4487 
4488         arc_free_data_impl(hdr, size, tag);
4489         if (type == ARC_BUFC_METADATA) {
4490                 zio_buf_free(buf, size);
4491         } else {
4492                 ASSERT(type == ARC_BUFC_DATA);
4493                 zio_data_buf_free(buf, size);
4494         }
4495 }
4496 
4497 /*
4498  * Free the arc data buffer.
4499  */
4500 static void
4501 arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4502 {
4503         arc_state_t *state = hdr->b_l1hdr.b_state;
4504         arc_buf_contents_t type = arc_buf_type(hdr);
4505 
4506         /* protected by hash lock, if in the hash table */
4507         if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
4508                 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4509                 ASSERT(state != arc_anon && state != arc_l2c_only);
4510 
4511                 (void) refcount_remove_many(&state->arcs_esize[type],
4512                     size, tag);
4513         }
4514         (void) refcount_remove_many(&state->arcs_size, size, tag);
4515 
4516         VERIFY3U(hdr->b_type, ==, type);
4517         if (type == ARC_BUFC_METADATA) {


4518                 arc_space_return(size, ARC_SPACE_META);
4519         } else {
4520                 ASSERT(type == ARC_BUFC_DATA);
4521                 arc_space_return(size, ARC_SPACE_DATA);
4522         }
4523 }
4524 
4525 /*
4526  * This routine is called whenever a buffer is accessed.
4527  * NOTE: the hash lock is dropped in this function.
4528  */
4529 static void
4530 arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
4531 {
4532         clock_t now;
4533 
4534         ASSERT(MUTEX_HELD(hash_lock));
4535         ASSERT(HDR_HAS_L1HDR(hdr));
4536 
4537         if (hdr->b_l1hdr.b_state == arc_anon) {

4643                 }
4644 
4645                 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4646                 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4647                 arc_change_state(new_state, hdr, hash_lock);
4648 
4649                 ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
4650         } else if (hdr->b_l1hdr.b_state == arc_l2c_only) {
4651                 /*
4652                  * This buffer is on the 2nd Level ARC.
4653                  */
4654 
4655                 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4656                 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4657                 arc_change_state(arc_mfu, hdr, hash_lock);
4658         } else {
4659                 ASSERT(!"invalid arc state");
4660         }
4661 }
4662 



























































4663 /* a generic arc_done_func_t which you can use */
4664 /* ARGSUSED */
4665 void
4666 arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
4667 {
4668         if (zio == NULL || zio->io_error == 0)
4669                 bcopy(buf->b_data, arg, arc_buf_size(buf));
4670         arc_buf_destroy(buf, arg);
4671 }
4672 
4673 /* a generic arc_done_func_t */
4674 void
4675 arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
4676 {
4677         arc_buf_t **bufp = arg;
4678         if (zio && zio->io_error) {
4679                 arc_buf_destroy(buf, arg);
4680                 *bufp = NULL;
4681         } else {
4682                 *bufp = buf;

4785                         zio->io_error = error;
4786                 }
4787         }
4788         hdr->b_l1hdr.b_acb = NULL;
4789         arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
4790         if (callback_cnt == 0) {
4791                 ASSERT(HDR_PREFETCH(hdr));
4792                 ASSERT0(hdr->b_l1hdr.b_bufcnt);
4793                 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
4794         }
4795 
4796         ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt) ||
4797             callback_list != NULL);
4798 
4799         if (no_zio_error) {
4800                 arc_hdr_verify(hdr, zio->io_bp);
4801         } else {
4802                 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
4803                 if (hdr->b_l1hdr.b_state != arc_anon)
4804                         arc_change_state(arc_anon, hdr, hash_lock);
4805                 if (HDR_IN_HASH_TABLE(hdr))


4806                         buf_hash_remove(hdr);

4807                 freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
4808         }
4809 
4810         /*
4811          * Broadcast before we drop the hash_lock to avoid the possibility
4812          * that the hdr (and hence the cv) might be freed before we get to
4813          * the cv_broadcast().
4814          */
4815         cv_broadcast(&hdr->b_l1hdr.b_cv);
4816 
4817         if (hash_lock != NULL) {
4818                 mutex_exit(hash_lock);
4819         } else {
4820                 /*
4821                  * This block was freed while we waited for the read to
4822                  * complete.  It has been removed from the hash table and
4823                  * moved to the anonymous state (so that it won't show up
4824                  * in the cache).
4825                  */
4826                 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);

4829 
4830         /* execute each callback and free its structure */
4831         while ((acb = callback_list) != NULL) {
4832                 if (acb->acb_done)
4833                         acb->acb_done(zio, acb->acb_buf, acb->acb_private);
4834 
4835                 if (acb->acb_zio_dummy != NULL) {
4836                         acb->acb_zio_dummy->io_error = zio->io_error;
4837                         zio_nowait(acb->acb_zio_dummy);
4838                 }
4839 
4840                 callback_list = acb->acb_next;
4841                 kmem_free(acb, sizeof (arc_callback_t));
4842         }
4843 
4844         if (freeable)
4845                 arc_hdr_destroy(hdr);
4846 }
4847 
4848 /*

















































4849  * "Read" the block at the specified DVA (in bp) via the
4850  * cache.  If the block is found in the cache, invoke the provided
4851  * callback immediately and return.  Note that the `zio' parameter
4852  * in the callback will be NULL in this case, since no IO was
4853  * required.  If the block is not in the cache pass the read request
4854  * on to the spa with a substitute callback function, so that the
4855  * requested block will be added to the cache.
4856  *
4857  * If a read request arrives for a block that has a read in-progress,
4858  * either wait for the in-progress read to complete (and return the
4859  * results); or, if this is a read with a "done" func, add a record
4860  * to the read to invoke the "done" func when the read completes,
4861  * and return; or just return.
4862  *
4863  * arc_read_done() will invoke all the requested "done" functions
4864  * for readers of this block.
4865  */
4866 int
4867 arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_done_func_t *done,
4868     void *private, zio_priority_t priority, int zio_flags,

4968                                 ARCSTAT_BUMP(
4969                                     arcstat_demand_hit_predictive_prefetch);
4970                                 arc_hdr_clear_flags(hdr,
4971                                     ARC_FLAG_PREDICTIVE_PREFETCH);
4972                         }
4973                         ASSERT(!BP_IS_EMBEDDED(bp) || !BP_IS_HOLE(bp));
4974 
4975                         /* Get a buf with the desired data in it. */
4976                         VERIFY0(arc_buf_alloc_impl(hdr, private,
4977                             compressed_read, B_TRUE, &buf));
4978                 } else if (*arc_flags & ARC_FLAG_PREFETCH &&
4979                     refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
4980                         arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
4981                 }
4982                 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
4983                 arc_access(hdr, hash_lock);
4984                 if (*arc_flags & ARC_FLAG_L2CACHE)
4985                         arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
4986                 mutex_exit(hash_lock);
4987                 ARCSTAT_BUMP(arcstat_hits);
4988                 ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
4989                     demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
4990                     data, metadata, hits);
4991 
4992                 if (done)
4993                         done(NULL, buf, private);
4994         } else {
4995                 uint64_t lsize = BP_GET_LSIZE(bp);
4996                 uint64_t psize = BP_GET_PSIZE(bp);
4997                 arc_callback_t *acb;
4998                 vdev_t *vd = NULL;
4999                 uint64_t addr = 0;
5000                 boolean_t devw = B_FALSE;
5001                 uint64_t size;
5002 
5003                 if (hdr == NULL) {
5004                         /* this block is not in the cache */
5005                         arc_buf_hdr_t *exists = NULL;
5006                         arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
5007                         hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
5008                             BP_GET_COMPRESS(bp), type);
5009 
5010                         if (!BP_IS_EMBEDDED(bp)) {
5011                                 hdr->b_dva = *BP_IDENTITY(bp);
5012                                 hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
5013                                 exists = buf_hash_insert(hdr, &hash_lock);
5014                         }
5015                         if (exists != NULL) {
5016                                 /* somebody beat us to the hash insert */
5017                                 mutex_exit(hash_lock);
5018                                 buf_discard_identity(hdr);
5019                                 arc_hdr_destroy(hdr);

5020                                 goto top; /* restart the IO request */
5021                         }
5022                 } else {
5023                         /*
5024                          * This block is in the ghost cache. If it was L2-only
5025                          * (and thus didn't have an L1 hdr), we realloc the
5026                          * header to add an L1 hdr.
5027                          */
5028                         if (!HDR_HAS_L1HDR(hdr)) {
5029                                 hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,
5030                                     hdr_full_cache);
5031                         }
5032                         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5033                         ASSERT(GHOST_STATE(hdr->b_l1hdr.b_state));
5034                         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5035                         ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5036                         ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
5037                         ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
5038 
5039                         /*
5040                          * This is a delicate dance that we play here.
5041                          * This hdr is in the ghost list so we access it
5042                          * to move it out of the ghost list before we
5043                          * initiate the read. If it's a prefetch then
5044                          * it won't have a callback so we'll remove the
5045                          * reference that arc_buf_alloc_impl() created. We
5046                          * do this after we've called arc_access() to
5047                          * avoid hitting an assert in remove_reference().
5048                          */
5049                         arc_access(hdr, hash_lock);
5050                         arc_hdr_alloc_pabd(hdr);
5051                 }
5052                 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
5053                 size = arc_hdr_size(hdr);
5054 
5055                 /*
5056                  * If compression is enabled on the hdr, then will do
5057                  * RAW I/O and will store the compressed data in the hdr's

5069                 if (BP_GET_LEVEL(bp) > 0)
5070                         arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);
5071                 if (*arc_flags & ARC_FLAG_PREDICTIVE_PREFETCH)
5072                         arc_hdr_set_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH);
5073                 ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));
5074 
5075                 acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
5076                 acb->acb_done = done;
5077                 acb->acb_private = private;
5078                 acb->acb_compressed = compressed_read;
5079 
5080                 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5081                 hdr->b_l1hdr.b_acb = acb;
5082                 arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5083 
5084                 if (HDR_HAS_L2HDR(hdr) &&
5085                     (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
5086                         devw = hdr->b_l2hdr.b_dev->l2ad_writing;
5087                         addr = hdr->b_l2hdr.b_daddr;
5088                         /*
5089                          * Lock out L2ARC device removal.
5090                          */
5091                         if (vdev_is_dead(vd) ||
5092                             !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
5093                                 vd = NULL;
5094                 }
5095 
5096                 if (priority == ZIO_PRIORITY_ASYNC_READ)
5097                         arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
5098                 else
5099                         arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
5100 
5101                 if (hash_lock != NULL)
5102                         mutex_exit(hash_lock);
5103 
5104                 /*
5105                  * At this point, we have a level 1 cache miss.  Try again in
5106                  * L2ARC if possible.
5107                  */
5108                 ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);
5109 
5110                 DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
5111                     uint64_t, lsize, zbookmark_phys_t *, zb);
5112                 ARCSTAT_BUMP(arcstat_misses);
5113                 ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
5114                     demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
5115                     data, metadata, misses);
5116 
5117                 if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
5118                         /*
5119                          * Read from the L2ARC if the following are true:
5120                          * 1. The L2ARC vdev was previously cached.
5121                          * 2. This buffer still has L2ARC metadata.
5122                          * 3. This buffer isn't currently writing to the L2ARC.
5123                          * 4. The L2ARC entry wasn't evicted, which may
5124                          *    also have invalidated the vdev.
5125                          * 5. This isn't prefetch and l2arc_noprefetch is set.
5126                          */
5127                         if (HDR_HAS_L2HDR(hdr) &&
5128                             !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
5129                             !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
5130                                 l2arc_read_callback_t *cb;
5131                                 abd_t *abd;
5132                                 uint64_t asize;
5133 
5134                                 DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
5135                                 ARCSTAT_BUMP(arcstat_l2_hits);


5136 
5137                                 cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
5138                                     KM_SLEEP);
5139                                 cb->l2rcb_hdr = hdr;
5140                                 cb->l2rcb_bp = *bp;
5141                                 cb->l2rcb_zb = *zb;
5142                                 cb->l2rcb_flags = zio_flags;
5143 
5144                                 asize = vdev_psize_to_asize(vd, size);
5145                                 if (asize != size) {
5146                                         abd = abd_alloc_for_io(asize,
5147                                             HDR_ISTYPE_METADATA(hdr));
5148                                         cb->l2rcb_abd = abd;
5149                                 } else {
5150                                         abd = hdr->b_l1hdr.b_pabd;
5151                                 }
5152 
5153                                 ASSERT(addr >= VDEV_LABEL_START_SIZE &&
5154                                     addr + asize <= vd->vdev_psize -
5155                                     VDEV_LABEL_END_SIZE);
5156 
5157                                 /*
5158                                  * l2arc read.  The SCL_L2ARC lock will be
5159                                  * released by l2arc_read_done().
5160                                  * Issue a null zio if the underlying buffer
5161                                  * was squashed to zero size by compression.
5162                                  */
5163                                 ASSERT3U(HDR_GET_COMPRESS(hdr), !=,
5164                                     ZIO_COMPRESS_EMPTY);
5165                                 rzio = zio_read_phys(pio, vd, addr,
5166                                     asize, abd,
5167                                     ZIO_CHECKSUM_OFF,
5168                                     l2arc_read_done, cb, priority,
5169                                     zio_flags | ZIO_FLAG_DONT_CACHE |
5170                                     ZIO_FLAG_CANFAIL |
5171                                     ZIO_FLAG_DONT_PROPAGATE |
5172                                     ZIO_FLAG_DONT_RETRY, B_FALSE);
5173                                 DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
5174                                     zio_t *, rzio);

5175                                 ARCSTAT_INCR(arcstat_l2_read_bytes, size);



5176 
5177                                 if (*arc_flags & ARC_FLAG_NOWAIT) {
5178                                         zio_nowait(rzio);
5179                                         return (0);
5180                                 }
5181 
5182                                 ASSERT(*arc_flags & ARC_FLAG_WAIT);
5183                                 if (zio_wait(rzio) == 0)
5184                                         return (0);
5185 
5186                                 /* l2arc read error; goto zio_read() */
5187                         } else {
5188                                 DTRACE_PROBE1(l2arc__miss,
5189                                     arc_buf_hdr_t *, hdr);
5190                                 ARCSTAT_BUMP(arcstat_l2_misses);
5191                                 if (HDR_L2_WRITING(hdr))
5192                                         ARCSTAT_BUMP(arcstat_l2_rw_clash);
5193                                 spa_config_exit(spa, SCL_L2ARC, vd);
5194                         }
5195                 } else {

5427                 hdr->b_l1hdr.b_bufcnt -= 1;
5428                 arc_cksum_verify(buf);
5429                 arc_buf_unwatch(buf);
5430 
5431                 mutex_exit(hash_lock);
5432 
5433                 /*
5434                  * Allocate a new hdr. The new hdr will contain a b_pabd
5435                  * buffer which will be freed in arc_write().
5436                  */
5437                 nhdr = arc_hdr_alloc(spa, psize, lsize, compress, type);
5438                 ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL);
5439                 ASSERT0(nhdr->b_l1hdr.b_bufcnt);
5440                 ASSERT0(refcount_count(&nhdr->b_l1hdr.b_refcnt));
5441                 VERIFY3U(nhdr->b_type, ==, type);
5442                 ASSERT(!HDR_SHARED_DATA(nhdr));
5443 
5444                 nhdr->b_l1hdr.b_buf = buf;
5445                 nhdr->b_l1hdr.b_bufcnt = 1;
5446                 (void) refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);


5447                 buf->b_hdr = nhdr;
5448 
5449                 mutex_exit(&buf->b_evict_lock);
5450                 (void) refcount_add_many(&arc_anon->arcs_size,
5451                     arc_buf_size(buf), buf);
5452         } else {
5453                 mutex_exit(&buf->b_evict_lock);
5454                 ASSERT(refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);
5455                 /* protected by hash lock, or hdr is on arc_anon */
5456                 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
5457                 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5458                 arc_change_state(arc_anon, hdr, hash_lock);
5459                 hdr->b_l1hdr.b_arc_access = 0;
5460                 mutex_exit(hash_lock);
5461 
5462                 buf_discard_identity(hdr);
5463                 arc_buf_thaw(buf);
5464         }
5465 }
5466

5637                 kmutex_t *hash_lock;
5638 
5639                 ASSERT3U(zio->io_error, ==, 0);
5640 
5641                 arc_cksum_verify(buf);
5642 
5643                 exists = buf_hash_insert(hdr, &hash_lock);
5644                 if (exists != NULL) {
5645                         /*
5646                          * This can only happen if we overwrite for
5647                          * sync-to-convergence, because we remove
5648                          * buffers from the hash table when we arc_free().
5649                          */
5650                         if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
5651                                 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
5652                                         panic("bad overwrite, hdr=%p exists=%p",
5653                                             (void *)hdr, (void *)exists);
5654                                 ASSERT(refcount_is_zero(
5655                                     &exists->b_l1hdr.b_refcnt));
5656                                 arc_change_state(arc_anon, exists, hash_lock);
5657                                 mutex_exit(hash_lock);
5658                                 arc_hdr_destroy(exists);

5659                                 exists = buf_hash_insert(hdr, &hash_lock);
5660                                 ASSERT3P(exists, ==, NULL);
5661                         } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
5662                                 /* nopwrite */
5663                                 ASSERT(zio->io_prop.zp_nopwrite);
5664                                 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
5665                                         panic("bad nopwrite, hdr=%p exists=%p",
5666                                             (void *)hdr, (void *)exists);
5667                         } else {
5668                                 /* Dedup */
5669                                 ASSERT(hdr->b_l1hdr.b_bufcnt == 1);
5670                                 ASSERT(hdr->b_l1hdr.b_state == arc_anon);
5671                                 ASSERT(BP_GET_DEDUP(zio->io_bp));
5672                                 ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
5673                         }
5674                 }
5675                 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5676                 /* if it's not anon, we are doing a scrub */
5677                 if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)
5678                         arc_access(hdr, hash_lock);
5679                 mutex_exit(hash_lock);
5680         } else {
5681                 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5682         }
5683 
5684         ASSERT(!refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5685         callback->awcb_done(zio, buf, callback->awcb_private);
5686 
5687         abd_put(zio->io_abd);
5688         kmem_free(callback, sizeof (arc_write_callback_t));
5689 }
5690 
5691 zio_t *
5692 arc_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, arc_buf_t *buf,
5693     boolean_t l2arc, const zio_prop_t *zp, arc_done_func_t *ready,
5694     arc_done_func_t *children_ready, arc_done_func_t *physdone,
5695     arc_done_func_t *done, void *private, zio_priority_t priority,
5696     int zio_flags, const zbookmark_phys_t *zb)

5697 {
5698         arc_buf_hdr_t *hdr = buf->b_hdr;
5699         arc_write_callback_t *callback;
5700         zio_t *zio;
5701         zio_prop_t localprop = *zp;
5702 
5703         ASSERT3P(ready, !=, NULL);
5704         ASSERT3P(done, !=, NULL);
5705         ASSERT(!HDR_IO_ERROR(hdr));
5706         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5707         ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5708         ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
5709         if (l2arc)
5710                 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
5711         if (ARC_BUF_COMPRESSED(buf)) {
5712                 /*
5713                  * We're writing a pre-compressed buffer.  Make the
5714                  * compression algorithm requested by the zio_prop_t match
5715                  * the pre-compressed buffer's compression algorithm.
5716                  */

5737                  * the hdr then we need to break that relationship here.
5738                  * The hdr will remain with a NULL data pointer and the
5739                  * buf will take sole ownership of the block.
5740                  */
5741                 if (arc_buf_is_shared(buf)) {
5742                         arc_unshare_buf(hdr, buf);
5743                 } else {
5744                         arc_hdr_free_pabd(hdr);
5745                 }
5746                 VERIFY3P(buf->b_data, !=, NULL);
5747                 arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);
5748         }
5749         ASSERT(!arc_buf_is_shared(buf));
5750         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5751 
5752         zio = zio_write(pio, spa, txg, bp,
5753             abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
5754             HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
5755             (children_ready != NULL) ? arc_write_children_ready : NULL,
5756             arc_write_physdone, arc_write_done, callback,
5757             priority, zio_flags, zb);
5758 
5759         return (zio);
5760 }
5761 
5762 static int
5763 arc_memory_throttle(uint64_t reserve, uint64_t txg)
5764 {
5765 #ifdef _KERNEL
5766         uint64_t available_memory = ptob(freemem);
5767         static uint64_t page_load = 0;
5768         static uint64_t last_txg = 0;
5769 
5770 #if defined(__i386)
5771         available_memory =
5772             MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
5773 #endif
5774 
5775         if (freemem > physmem * arc_lotsfree_percent / 100)
5776                 return (0);
5777

5829 
5830         anon_size = MAX((int64_t)(refcount_count(&arc_anon->arcs_size) -
5831             arc_loaned_bytes), 0);
5832 
5833         /*
5834          * Writes will, almost always, require additional memory allocations
5835          * in order to compress/encrypt/etc the data.  We therefore need to
5836          * make sure that there is sufficient available memory for this.
5837          */
5838         error = arc_memory_throttle(reserve, txg);
5839         if (error != 0)
5840                 return (error);
5841 
5842         /*
5843          * Throttle writes when the amount of dirty data in the cache
5844          * gets too large.  We try to keep the cache less than half full
5845          * of dirty blocks so that our sync times don't grow too large.
5846          * Note: if two requests come in concurrently, we might let them
5847          * both succeed, when one of them should fail.  Not a huge deal.
5848          */
5849 
5850         if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
5851             anon_size > arc_c / 4) {




5852                 uint64_t meta_esize =
5853                     refcount_count(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
5854                 uint64_t data_esize =
5855                     refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
5856                 dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
5857                     "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
5858                     arc_tempreserve >> 10, meta_esize >> 10,
5859                     data_esize >> 10, reserve >> 10, arc_c >> 10);
5860                 return (SET_ERROR(ERESTART));
5861         }
5862         atomic_add_64(&arc_tempreserve, reserve);
5863         return (0);
5864 }
5865 
5866 static void
5867 arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
5868     kstat_named_t *evict_data, kstat_named_t *evict_metadata)

5869 {
5870         size->value.ui64 = refcount_count(&state->arcs_size);
5871         evict_data->value.ui64 =
5872             refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);
5873         evict_metadata->value.ui64 =
5874             refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);


5875 }
5876 
5877 static int
5878 arc_kstat_update(kstat_t *ksp, int rw)
5879 {
5880         arc_stats_t *as = ksp->ks_data;
5881 
5882         if (rw == KSTAT_WRITE) {
5883                 return (EACCES);
5884         } else {
5885                 arc_kstat_update_state(arc_anon,
5886                     &as->arcstat_anon_size,
5887                     &as->arcstat_anon_evictable_data,
5888                     &as->arcstat_anon_evictable_metadata);

5889                 arc_kstat_update_state(arc_mru,
5890                     &as->arcstat_mru_size,
5891                     &as->arcstat_mru_evictable_data,
5892                     &as->arcstat_mru_evictable_metadata);

5893                 arc_kstat_update_state(arc_mru_ghost,
5894                     &as->arcstat_mru_ghost_size,
5895                     &as->arcstat_mru_ghost_evictable_data,
5896                     &as->arcstat_mru_ghost_evictable_metadata);

5897                 arc_kstat_update_state(arc_mfu,
5898                     &as->arcstat_mfu_size,
5899                     &as->arcstat_mfu_evictable_data,
5900                     &as->arcstat_mfu_evictable_metadata);

5901                 arc_kstat_update_state(arc_mfu_ghost,
5902                     &as->arcstat_mfu_ghost_size,
5903                     &as->arcstat_mfu_ghost_evictable_data,
5904                     &as->arcstat_mfu_ghost_evictable_metadata);
5905 
5906                 ARCSTAT(arcstat_size) = aggsum_value(&arc_size);
5907                 ARCSTAT(arcstat_meta_used) = aggsum_value(&arc_meta_used);
5908                 ARCSTAT(arcstat_data_size) = aggsum_value(&astat_data_size);
5909                 ARCSTAT(arcstat_metadata_size) =
5910                     aggsum_value(&astat_metadata_size);
5911                 ARCSTAT(arcstat_hdr_size) = aggsum_value(&astat_hdr_size);
5912                 ARCSTAT(arcstat_other_size) = aggsum_value(&astat_other_size);
5913                 ARCSTAT(arcstat_l2_hdr_size) = aggsum_value(&astat_l2_hdr_size);
5914         }
5915 
5916         return (0);
5917 }
5918 
5919 /*
5920  * This function *must* return indices evenly distributed between all
5921  * sublists of the multilist. This is needed due to how the ARC eviction
5922  * code is laid out; arc_evict_state() assumes ARC buffers are evenly
5923  * distributed between all sublists and uses this assumption when
5924  * deciding which sublist to evict from and how much to evict from it.
5925  */
5926 unsigned int
5927 arc_state_multilist_index_func(multilist_t *ml, void *obj)
5928 {
5929         arc_buf_hdr_t *hdr = obj;
5930 
5931         /*
5932          * We rely on b_dva to generate evenly distributed index
5933          * numbers using buf_hash below. So, as an added precaution,

5943          * on insertion, as this index can be recalculated on removal.
5944          *
5945          * Also, the low order bits of the hash value are thought to be
5946          * distributed evenly. Otherwise, in the case that the multilist
5947          * has a power of two number of sublists, each sublists' usage
5948          * would not be evenly distributed.
5949          */
5950         return (buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %
5951             multilist_get_num_sublists(ml));
5952 }
5953 
5954 static void
5955 arc_state_init(void)
5956 {
5957         arc_anon = &ARC_anon;
5958         arc_mru = &ARC_mru;
5959         arc_mru_ghost = &ARC_mru_ghost;
5960         arc_mfu = &ARC_mfu;
5961         arc_mfu_ghost = &ARC_mfu_ghost;
5962         arc_l2c_only = &ARC_l2c_only;

5963 
5964         arc_mru->arcs_list[ARC_BUFC_METADATA] =

5965             multilist_create(sizeof (arc_buf_hdr_t),
5966             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5967             arc_state_multilist_index_func);
5968         arc_mru->arcs_list[ARC_BUFC_DATA] =
5969             multilist_create(sizeof (arc_buf_hdr_t),
5970             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5971             arc_state_multilist_index_func);
5972         arc_mru_ghost->arcs_list[ARC_BUFC_METADATA] =
5973             multilist_create(sizeof (arc_buf_hdr_t),
5974             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5975             arc_state_multilist_index_func);
5976         arc_mru_ghost->arcs_list[ARC_BUFC_DATA] =
5977             multilist_create(sizeof (arc_buf_hdr_t),
5978             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5979             arc_state_multilist_index_func);
5980         arc_mfu->arcs_list[ARC_BUFC_METADATA] =
5981             multilist_create(sizeof (arc_buf_hdr_t),
5982             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5983             arc_state_multilist_index_func);
5984         arc_mfu->arcs_list[ARC_BUFC_DATA] =
5985             multilist_create(sizeof (arc_buf_hdr_t),
5986             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5987             arc_state_multilist_index_func);
5988         arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA] =
5989             multilist_create(sizeof (arc_buf_hdr_t),
5990             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5991             arc_state_multilist_index_func);
5992         arc_mfu_ghost->arcs_list[ARC_BUFC_DATA] =
5993             multilist_create(sizeof (arc_buf_hdr_t),
5994             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5995             arc_state_multilist_index_func);
5996         arc_l2c_only->arcs_list[ARC_BUFC_METADATA] =
5997             multilist_create(sizeof (arc_buf_hdr_t),
5998             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5999             arc_state_multilist_index_func);
6000         arc_l2c_only->arcs_list[ARC_BUFC_DATA] =
6001             multilist_create(sizeof (arc_buf_hdr_t),
6002             offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6003             arc_state_multilist_index_func);
6004 
6005         refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
6006         refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
6007         refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
6008         refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
6009         refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
6010         refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
6011         refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
6012         refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
6013         refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
6014         refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
6015         refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
6016         refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
6017 



6018         refcount_create(&arc_anon->arcs_size);
6019         refcount_create(&arc_mru->arcs_size);
6020         refcount_create(&arc_mru_ghost->arcs_size);
6021         refcount_create(&arc_mfu->arcs_size);
6022         refcount_create(&arc_mfu_ghost->arcs_size);
6023         refcount_create(&arc_l2c_only->arcs_size);
6024 
6025         aggsum_init(&arc_meta_used, 0);
6026         aggsum_init(&arc_size, 0);
6027         aggsum_init(&astat_data_size, 0);
6028         aggsum_init(&astat_metadata_size, 0);
6029         aggsum_init(&astat_hdr_size, 0);
6030         aggsum_init(&astat_other_size, 0);
6031         aggsum_init(&astat_l2_hdr_size, 0);
6032 }
6033 
6034 static void
6035 arc_state_fini(void)
6036 {
6037         refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
6038         refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
6039         refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
6040         refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
6041         refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
6042         refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
6043         refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
6044         refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
6045         refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
6046         refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
6047         refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
6048         refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
6049 
6050         refcount_destroy(&arc_anon->arcs_size);
6051         refcount_destroy(&arc_mru->arcs_size);
6052         refcount_destroy(&arc_mru_ghost->arcs_size);
6053         refcount_destroy(&arc_mfu->arcs_size);
6054         refcount_destroy(&arc_mfu_ghost->arcs_size);
6055         refcount_destroy(&arc_l2c_only->arcs_size);
6056 
6057         multilist_destroy(arc_mru->arcs_list[ARC_BUFC_METADATA]);
6058         multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
6059         multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_METADATA]);
6060         multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
6061         multilist_destroy(arc_mru->arcs_list[ARC_BUFC_DATA]);
6062         multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
6063         multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_DATA]);
6064         multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);






6065 }
6066 
6067 uint64_t
6068 arc_max_bytes(void)
6069 {
6070         return (arc_c_max);
6071 }
6072 
6073 void
6074 arc_init(void)
6075 {
6076         /*
6077          * allmem is "all memory that we could possibly use".
6078          */
6079 #ifdef _KERNEL
6080         uint64_t allmem = ptob(physmem - swapfs_minfree);
6081 #else
6082         uint64_t allmem = (physmem * PAGESIZE) / 2;
6083 #endif
6084

6104          * small, because it can cause transactions to be larger than
6105          * arc_c, causing arc_tempreserve_space() to fail.
6106          */
6107 #ifndef _KERNEL
6108         arc_c_min = arc_c_max / 2;
6109 #endif
6110 
6111         /*
6112          * Allow the tunables to override our calculations if they are
6113          * reasonable (ie. over 64MB)
6114          */
6115         if (zfs_arc_max > 64 << 20 && zfs_arc_max < allmem) {
6116                 arc_c_max = zfs_arc_max;
6117                 arc_c_min = MIN(arc_c_min, arc_c_max);
6118         }
6119         if (zfs_arc_min > 64 << 20 && zfs_arc_min <= arc_c_max)
6120                 arc_c_min = zfs_arc_min;
6121 
6122         arc_c = arc_c_max;
6123         arc_p = (arc_c >> 1);

6124 


6125         /* limit meta-data to 1/4 of the arc capacity */
6126         arc_meta_limit = arc_c_max / 4;
6127 
6128 #ifdef _KERNEL
6129         /*
6130          * Metadata is stored in the kernel's heap.  Don't let us
6131          * use more than half the heap for the ARC.
6132          */
6133         arc_meta_limit = MIN(arc_meta_limit,
6134             vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 2);
6135 #endif
6136 
6137         /* Allow the tunable to override if it is reasonable */






6138         if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
6139                 arc_meta_limit = zfs_arc_meta_limit;
6140 
6141         if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
6142                 arc_c_min = arc_meta_limit / 2;
6143 
6144         if (zfs_arc_meta_min > 0) {
6145                 arc_meta_min = zfs_arc_meta_min;
6146         } else {
6147                 arc_meta_min = arc_c_min / 2;
6148         }
6149 
6150         if (zfs_arc_grow_retry > 0)
6151                 arc_grow_retry = zfs_arc_grow_retry;
6152 
6153         if (zfs_arc_shrink_shift > 0)
6154                 arc_shrink_shift = zfs_arc_shrink_shift;
6155 
6156         /*
6157          * Ensure that arc_no_grow_shift is less than arc_shrink_shift.

6212         /*
6213          * The reclaim thread will set arc_reclaim_thread_exit back to
6214          * B_FALSE when it is finished exiting; we're waiting for that.
6215          */
6216         while (arc_reclaim_thread_exit) {
6217                 cv_signal(&arc_reclaim_thread_cv);
6218                 cv_wait(&arc_reclaim_thread_cv, &arc_reclaim_lock);
6219         }
6220         mutex_exit(&arc_reclaim_lock);
6221 
6222         /* Use B_TRUE to ensure *all* buffers are evicted */
6223         arc_flush(NULL, B_TRUE);
6224 
6225         arc_dead = B_TRUE;
6226 
6227         if (arc_ksp != NULL) {
6228                 kstat_delete(arc_ksp);
6229                 arc_ksp = NULL;
6230         }
6231 


6232         mutex_destroy(&arc_reclaim_lock);
6233         cv_destroy(&arc_reclaim_thread_cv);
6234         cv_destroy(&arc_reclaim_waiters_cv);
6235 
6236         arc_state_fini();
6237         buf_fini();
6238 
6239         ASSERT0(arc_loaned_bytes);
6240 }
6241 
6242 /*
6243  * Level 2 ARC
6244  *
6245  * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
6246  * It uses dedicated storage devices to hold cached data, which are populated
6247  * using large infrequent writes.  The main role of this cache is to boost
6248  * the performance of random read workloads.  The intended L2ARC devices
6249  * include short-stroked disks, solid state disks, and other media with
6250  * substantially faster read latency than disk.
6251  *

6365  *      l2arc_noprefetch        skip caching prefetched buffers
6366  *      l2arc_headroom          number of max device writes to precache
6367  *      l2arc_headroom_boost    when we find compressed buffers during ARC
6368  *                              scanning, we multiply headroom by this
6369  *                              percentage factor for the next scan cycle,
6370  *                              since more compressed buffers are likely to
6371  *                              be present
6372  *      l2arc_feed_secs         seconds between L2ARC writing
6373  *
6374  * Tunables may be removed or added as future performance improvements are
6375  * integrated, and also may become zpool properties.
6376  *
6377  * There are three key functions that control how the L2ARC warms up:
6378  *
6379  *      l2arc_write_eligible()  check if a buffer is eligible to cache
6380  *      l2arc_write_size()      calculate how much to write
6381  *      l2arc_write_interval()  calculate sleep delay between writes
6382  *
6383  * These three functions determine what to write, how much, and how quickly
6384  * to send writes.

























































































6385  */
6386 
6387 static boolean_t
6388 l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
6389 {
6390         /*
6391          * A buffer is *not* eligible for the L2ARC if it:
6392          * 1. belongs to a different spa.
6393          * 2. is already cached on the L2ARC.
6394          * 3. has an I/O in progress (it may be an incomplete read).
6395          * 4. is flagged not eligible (zfs property).
6396          */
6397         if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) ||
6398             HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr))
6399                 return (B_FALSE);
6400 
6401         return (B_TRUE);
6402 }
6403 
6404 static uint64_t

6430 {
6431         clock_t interval, next, now;
6432 
6433         /*
6434          * If the ARC lists are busy, increase our write rate; if the
6435          * lists are stale, idle back.  This is achieved by checking
6436          * how much we previously wrote - if it was more than half of
6437          * what we wanted, schedule the next write much sooner.
6438          */
6439         if (l2arc_feed_again && wrote > (wanted / 2))
6440                 interval = (hz * l2arc_feed_min_ms) / 1000;
6441         else
6442                 interval = hz * l2arc_feed_secs;
6443 
6444         now = ddi_get_lbolt();
6445         next = MAX(now, MIN(now + interval, began + interval));
6446 
6447         return (next);
6448 }
6449 






6450 /*
6451  * Cycle through L2ARC devices.  This is how L2ARC load balances.
6452  * If a device is returned, this also returns holding the spa config lock.
6453  */
6454 static l2arc_dev_t *
6455 l2arc_dev_get_next(void)
6456 {
6457         l2arc_dev_t *first, *next = NULL;
6458 
6459         /*
6460          * Lock out the removal of spas (spa_namespace_lock), then removal
6461          * of cache devices (l2arc_dev_mtx).  Once a device has been selected,
6462          * both locks will be dropped and a spa config lock held instead.
6463          */
6464         mutex_enter(&spa_namespace_lock);
6465         mutex_enter(&l2arc_dev_mtx);
6466 
6467         /* if there are no vdevs, there is nothing to do */
6468         if (l2arc_ndev == 0)
6469                 goto out;
6470 
6471         first = NULL;


6472         next = l2arc_dev_last;
6473         do {
6474                 /* loop around the list looking for a non-faulted vdev */
6475                 if (next == NULL) {
6476                         next = list_head(l2arc_dev_list);
6477                 } else {


6478                         next = list_next(l2arc_dev_list, next);
6479                         if (next == NULL)





6480                                 next = list_head(l2arc_dev_list);
6481                 }
6482 
6483                 /* if we have come back to the start, bail out */
6484                 if (first == NULL)
6485                         first = next;
6486                 else if (next == first)




6487                         break;


6488 
6489         } while (vdev_is_dead(next->l2ad_vdev));
6490 
6491         /* if we were unable to find any usable vdevs, return NULL */
6492         if (vdev_is_dead(next->l2ad_vdev))
6493                 next = NULL;
6494 











6495         l2arc_dev_last = next;
6496 
6497 out:
6498         mutex_exit(&l2arc_dev_mtx);
6499 
6500         /*
6501          * Grab the config lock to prevent the 'next' device from being
6502          * removed while we are writing to it.
6503          */
6504         if (next != NULL)
6505                 spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
6506         mutex_exit(&spa_namespace_lock);
6507 
6508         return (next);
6509 }
6510 
6511 /*
6512  * Free buffers that were tagged for destruction.
6513  */
6514 static void

6527                 list_remove(buflist, df);
6528                 kmem_free(df, sizeof (l2arc_data_free_t));
6529         }
6530 
6531         mutex_exit(&l2arc_free_on_write_mtx);
6532 }
6533 
6534 /*
6535  * A write to a cache device has completed.  Update all headers to allow
6536  * reads from these buffers to begin.
6537  */
6538 static void
6539 l2arc_write_done(zio_t *zio)
6540 {
6541         l2arc_write_callback_t *cb;
6542         l2arc_dev_t *dev;
6543         list_t *buflist;
6544         arc_buf_hdr_t *head, *hdr, *hdr_prev;
6545         kmutex_t *hash_lock;
6546         int64_t bytes_dropped = 0;

6547 
6548         cb = zio->io_private;
6549         ASSERT3P(cb, !=, NULL);
6550         dev = cb->l2wcb_dev;
6551         ASSERT3P(dev, !=, NULL);
6552         head = cb->l2wcb_head;
6553         ASSERT3P(head, !=, NULL);
6554         buflist = &dev->l2ad_buflist;
6555         ASSERT3P(buflist, !=, NULL);
6556         DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
6557             l2arc_write_callback_t *, cb);
6558 
6559         if (zio->io_error != 0)
6560                 ARCSTAT_BUMP(arcstat_l2_writes_error);
6561 
6562         /*
6563          * All writes completed, or an error was hit.
6564          */
6565 top:
6566         mutex_enter(&dev->l2ad_mtx);

6623                         bytes_dropped += arc_hdr_size(hdr);
6624                         (void) refcount_remove_many(&dev->l2ad_alloc,
6625                             arc_hdr_size(hdr), hdr);
6626                 }
6627 
6628                 /*
6629                  * Allow ARC to begin reads and ghost list evictions to
6630                  * this L2ARC entry.
6631                  */
6632                 arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);
6633 
6634                 mutex_exit(hash_lock);
6635         }
6636 
6637         atomic_inc_64(&l2arc_writes_done);
6638         list_remove(buflist, head);
6639         ASSERT(!HDR_HAS_L1HDR(head));
6640         kmem_cache_free(hdr_l2only_cache, head);
6641         mutex_exit(&dev->l2ad_mtx);
6642 

6643         vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
6644 
6645         l2arc_do_free_on_write();
6646 



6647         kmem_free(cb, sizeof (l2arc_write_callback_t));
6648 }
6649 
6650 /*
6651  * A read to a cache device completed.  Validate buffer contents before
6652  * handing over to the regular ARC routines.
6653  */
6654 static void
6655 l2arc_read_done(zio_t *zio)
6656 {
6657         l2arc_read_callback_t *cb;
6658         arc_buf_hdr_t *hdr;
6659         kmutex_t *hash_lock;
6660         boolean_t valid_cksum;
6661 
6662         ASSERT3P(zio->io_vd, !=, NULL);
6663         ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
6664 
6665         spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
6666

6733                  * storage now.  If there *is* a waiter, the caller must
6734                  * issue the i/o in a context where it's OK to block.
6735                  */
6736                 if (zio->io_waiter == NULL) {
6737                         zio_t *pio = zio_unique_parent(zio);
6738 
6739                         ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
6740 
6741                         zio_nowait(zio_read(pio, zio->io_spa, zio->io_bp,
6742                             hdr->b_l1hdr.b_pabd, zio->io_size, arc_read_done,
6743                             hdr, zio->io_priority, cb->l2rcb_flags,
6744                             &cb->l2rcb_zb));
6745                 }
6746         }
6747 
6748         kmem_free(cb, sizeof (l2arc_read_callback_t));
6749 }
6750 
6751 /*
6752  * This is the list priority from which the L2ARC will search for pages to
6753  * cache.  This is used within loops (0..3) to cycle through lists in the
6754  * desired order.  This order can have a significant effect on cache
6755  * performance.
6756  *
6757  * Currently the metadata lists are hit first, MFU then MRU, followed by
6758  * the data lists.  This function returns a locked list, and also returns
6759  * the lock pointer.
6760  */
6761 static multilist_sublist_t *
6762 l2arc_sublist_lock(int list_num)
6763 {
6764         multilist_t *ml = NULL;
6765         unsigned int idx;
6766 
6767         ASSERT(list_num >= 0 && list_num <= 3);

6768 
6769         switch (list_num) {
6770         case 0:






6771                 ml = arc_mfu->arcs_list[ARC_BUFC_METADATA];
6772                 break;
6773         case 1:
6774                 ml = arc_mru->arcs_list[ARC_BUFC_METADATA];
6775                 break;
6776         case 2:
6777                 ml = arc_mfu->arcs_list[ARC_BUFC_DATA];
6778                 break;
6779         case 3:
6780                 ml = arc_mru->arcs_list[ARC_BUFC_DATA];
6781                 break;
6782         }
6783 
6784         /*
6785          * Return a randomly-selected sublist. This is acceptable
6786          * because the caller feeds only a little bit of data for each
6787          * call (8MB). Subsequent calls will result in different
6788          * sublists being selected.
6789          */
6790         idx = multilist_get_random_index(ml);
6791         return (multilist_sublist_lock(ml, idx));
6792 }
6793 
6794 /*













6795  * Evict buffers from the device write hand to the distance specified in
6796  * bytes.  This distance may span populated buffers, it may span nothing.
6797  * This is clearing a region on the L2ARC device ready for writing.
6798  * If the 'all' boolean is set, every buffer is evicted.
6799  */
6800 static void
6801 l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
6802 {
6803         list_t *buflist;
6804         arc_buf_hdr_t *hdr, *hdr_prev;
6805         kmutex_t *hash_lock;
6806         uint64_t taddr;
6807 
6808         buflist = &dev->l2ad_buflist;
6809 
6810         if (!all && dev->l2ad_first) {
6811                 /*
6812                  * This is the first sweep through the device.  There is
6813                  * nothing to evict.
6814                  */
6815                 return;
6816         }
6817 




6818         if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
6819                 /*
6820                  * When nearing the end of the device, evict to the end
6821                  * before the device write hand jumps to the start.
6822                  */
6823                 taddr = dev->l2ad_end;
6824         } else {
6825                 taddr = dev->l2ad_hand + distance;
6826         }
6827         DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
6828             uint64_t, taddr, boolean_t, all);
6829 
6830 top:
6831         mutex_enter(&dev->l2ad_mtx);
6832         for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {
6833                 hdr_prev = list_prev(buflist, hdr);
6834 
6835                 hash_lock = HDR_LOCK(hdr);
6836 
6837                 /*

6881                 } else {
6882                         ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);
6883                         ARCSTAT_BUMP(arcstat_l2_evict_l1cached);
6884                         /*
6885                          * Invalidate issued or about to be issued
6886                          * reads, since we may be about to write
6887                          * over this location.
6888                          */
6889                         if (HDR_L2_READING(hdr)) {
6890                                 ARCSTAT_BUMP(arcstat_l2_evict_reading);
6891                                 arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);
6892                         }
6893 
6894                         arc_hdr_l2hdr_destroy(hdr);
6895                 }
6896                 mutex_exit(hash_lock);
6897         }
6898         mutex_exit(&dev->l2ad_mtx);
6899 }
6900 





























6901 /*





































6902  * Find and write ARC buffers to the L2ARC device.
6903  *
6904  * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid
6905  * for reading until they have completed writing.
6906  * The headroom_boost is an in-out parameter used to maintain headroom boost
6907  * state between calls to this function.
6908  *
6909  * Returns the number of bytes actually written (which may be smaller than
6910  * the delta by which the device hand has changed due to alignment).
6911  */
6912 static uint64_t
6913 l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)

6914 {
6915         arc_buf_hdr_t *hdr, *hdr_prev, *head;













6916         uint64_t write_asize, write_psize, write_lsize, headroom;
6917         boolean_t full;
6918         l2arc_write_callback_t *cb;
6919         zio_t *pio, *wzio;

6920         uint64_t guid = spa_load_guid(spa);

6921 
6922         ASSERT3P(dev->l2ad_vdev, !=, NULL);
6923 
6924         pio = NULL;

6925         write_lsize = write_asize = write_psize = 0;
6926         full = B_FALSE;
6927         head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
6928         arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);
6929 
6930         /*
6931          * Copy buffers for L2ARC writing.
6932          */
6933         for (int try = 0; try <= 3; try++) {
6934                 multilist_sublist_t *mls = l2arc_sublist_lock(try);
6935                 uint64_t passed_sz = 0;
6936 
6937                 /*
6938                  * L2ARC fast warmup.
6939                  *
6940                  * Until the ARC is warm and starts to evict, read from the
6941                  * head of the ARC lists rather than the tail.
6942                  */
6943                 if (arc_warm == B_FALSE)
6944                         hdr = multilist_sublist_head(mls);
6945                 else
6946                         hdr = multilist_sublist_tail(mls);
6947 
6948                 headroom = target_sz * l2arc_headroom;
6949                 if (zfs_compressed_arc_enabled)
6950                         headroom = (headroom * l2arc_headroom_boost) / 100;
6951 
6952                 for (; hdr; hdr = hdr_prev) {
6953                         kmutex_t *hash_lock;

6983                          * We rely on the L1 portion of the header below, so
6984                          * it's invalid for this header to have been evicted out
6985                          * of the ghost cache, prior to being written out. The
6986                          * ARC_FLAG_L2_WRITING bit ensures this won't happen.
6987                          */
6988                         ASSERT(HDR_HAS_L1HDR(hdr));
6989 
6990                         ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);
6991                         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
6992                         ASSERT3U(arc_hdr_size(hdr), >, 0);
6993                         uint64_t psize = arc_hdr_size(hdr);
6994                         uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,
6995                             psize);
6996 
6997                         if ((write_asize + asize) > target_sz) {
6998                                 full = B_TRUE;
6999                                 mutex_exit(hash_lock);
7000                                 break;
7001                         }
7002 









7003                         if (pio == NULL) {
7004                                 /*
7005                                  * Insert a dummy header on the buflist so
7006                                  * l2arc_write_done() can find where the
7007                                  * write buffers begin without searching.
7008                                  */
7009                                 mutex_enter(&dev->l2ad_mtx);
7010                                 list_insert_head(&dev->l2ad_buflist, head);
7011                                 mutex_exit(&dev->l2ad_mtx);
7012 
7013                                 cb = kmem_alloc(
7014                                     sizeof (l2arc_write_callback_t), KM_SLEEP);
7015                                 cb->l2wcb_dev = dev;
7016                                 cb->l2wcb_head = head;



7017                                 pio = zio_root(spa, l2arc_write_done, cb,
7018                                     ZIO_FLAG_CANFAIL);
7019                         }
7020 
7021                         hdr->b_l2hdr.b_dev = dev;
7022                         hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
7023                         arc_hdr_set_flags(hdr,
7024                             ARC_FLAG_L2_WRITING | ARC_FLAG_HAS_L2HDR);
7025 
7026                         mutex_enter(&dev->l2ad_mtx);
7027                         list_insert_head(&dev->l2ad_buflist, hdr);
7028                         mutex_exit(&dev->l2ad_mtx);
7029 
7030                         (void) refcount_add_many(&dev->l2ad_alloc, psize, hdr);
7031 
7032                         /*
7033                          * Normally the L2ARC can use the hdr's data, but if
7034                          * we're sharing data between the hdr and one of its
7035                          * bufs, L2ARC needs its own copy of the data so that
7036                          * the ZIO below can't race with the buf consumer.
7037                          * Another case where we need to create a copy of the
7038                          * data is when the buffer size is not device-aligned
7039                          * and we need to pad the block to make it such.
7040                          * That also keeps the clock hand suitably aligned.
7041                          *
7042                          * To ensure that the copy will be available for the
7043                          * lifetime of the ZIO and be cleaned up afterwards, we
7044                          * add it to the l2arc_free_on_write queue.
7045                          */
7046                         abd_t *to_write;
7047                         if (!HDR_SHARED_DATA(hdr) && psize == asize) {
7048                                 to_write = hdr->b_l1hdr.b_pabd;
7049                         } else {
7050                                 to_write = abd_alloc_for_io(asize,
7051                                     HDR_ISTYPE_METADATA(hdr));
7052                                 abd_copy(to_write, hdr->b_l1hdr.b_pabd, psize);
7053                                 if (asize != psize) {
7054                                         abd_zero_off(to_write, psize,
7055                                             asize - psize);
7056                                 }
7057                                 l2arc_free_abd_on_write(to_write, asize,
7058                                     arc_buf_type(hdr));
7059                         }
7060                         wzio = zio_write_phys(pio, dev->l2ad_vdev,
7061                             hdr->b_l2hdr.b_daddr, asize, to_write,
7062                             ZIO_CHECKSUM_OFF, NULL, hdr,
7063                             ZIO_PRIORITY_ASYNC_WRITE,
7064                             ZIO_FLAG_CANFAIL, B_FALSE);
7065 
7066                         write_lsize += HDR_GET_LSIZE(hdr);
7067                         DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
7068                             zio_t *, wzio);
7069 
7070                         write_psize += psize;
7071                         write_asize += asize;
7072                         dev->l2ad_hand += asize;
7073 
7074                         mutex_exit(hash_lock);
7075 
7076                         (void) zio_nowait(wzio);








7077                 }

7078 
7079                 multilist_sublist_unlock(mls);
7080 
7081                 if (full == B_TRUE)
7082                         break;
7083         }
7084 
7085         /* No buffers selected for writing? */
7086         if (pio == NULL) {
7087                 ASSERT0(write_lsize);
7088                 ASSERT(!HDR_HAS_L1HDR(head));
7089                 kmem_cache_free(hdr_l2only_cache, head);
7090                 return (0);
7091         }
7092 







7093         ASSERT3U(write_asize, <=, target_sz);
7094         ARCSTAT_BUMP(arcstat_l2_writes_sent);
7095         ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);


7096         ARCSTAT_INCR(arcstat_l2_lsize, write_lsize);
7097         ARCSTAT_INCR(arcstat_l2_psize, write_psize);
7098         vdev_space_update(dev->l2ad_vdev, write_psize, 0, 0);
7099 
7100         /*
7101          * Bump device hand to the device start if it is approaching the end.
7102          * l2arc_evict() will already have evicted ahead for this case.
7103          */
7104         if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {

7105                 dev->l2ad_hand = dev->l2ad_start;
7106                 dev->l2ad_first = B_FALSE;
7107         }
7108 
7109         dev->l2ad_writing = B_TRUE;
7110         (void) zio_wait(pio);
7111         dev->l2ad_writing = B_FALSE;
7112 
7113         return (write_asize);
7114 }
7115 


















































7116 /*
7117  * This thread feeds the L2ARC at regular intervals.  This is the beating
7118  * heart of the L2ARC.
7119  */
7120 /* ARGSUSED */
7121 static void
7122 l2arc_feed_thread(void *unused)
7123 {
7124         callb_cpr_t cpr;
7125         l2arc_dev_t *dev;
7126         spa_t *spa;
7127         uint64_t size, wrote;
7128         clock_t begin, next = ddi_get_lbolt();

7129 
7130         CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
7131 
7132         mutex_enter(&l2arc_feed_thr_lock);
7133 
7134         while (l2arc_thread_exit == 0) {
7135                 CALLB_CPR_SAFE_BEGIN(&cpr);
7136                 (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
7137                     next);
7138                 CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
7139                 next = ddi_get_lbolt() + hz;
7140 
7141                 /*
7142                  * Quick check for L2ARC devices.
7143                  */
7144                 mutex_enter(&l2arc_dev_mtx);
7145                 if (l2arc_ndev == 0) {
7146                         mutex_exit(&l2arc_dev_mtx);
7147                         continue;
7148                 }
7149                 mutex_exit(&l2arc_dev_mtx);
7150                 begin = ddi_get_lbolt();
7151 
7152                 /*
7153                  * This selects the next l2arc device to write to, and in
7154                  * doing so the next spa to feed from: dev->l2ad_spa.   This
7155                  * will return NULL if there are now no l2arc devices or if
7156                  * they are all faulted.
7157                  *
7158                  * If a device is returned, its spa's config lock is also
7159                  * held to prevent device removal.  l2arc_dev_get_next()
7160                  * will grab and release l2arc_dev_mtx.
7161                  */
7162                 if ((dev = l2arc_dev_get_next()) == NULL)
7163                         continue;
7164 
7165                 spa = dev->l2ad_spa;
7166                 ASSERT3P(spa, !=, NULL);
7167 
7168                 /*
7169                  * If the pool is read-only then force the feed thread to
7170                  * sleep a little longer.
7171                  */
7172                 if (!spa_writeable(spa)) {
7173                         next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
7174                         spa_config_exit(spa, SCL_L2ARC, dev);
7175                         continue;
7176                 }
7177 
7178                 /*
7179                  * Avoid contributing to memory pressure.
7180                  */
7181                 if (arc_reclaim_needed()) {
7182                         ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
7183                         spa_config_exit(spa, SCL_L2ARC, dev);
7184                         continue;
7185                 }
7186 
7187                 ARCSTAT_BUMP(arcstat_l2_feeds);




7188 
7189                 size = l2arc_write_size();





7190 
7191                 /*
7192                  * Evict L2ARC buffers that will be overwritten.
7193                  */
7194                 l2arc_evict(dev, size, B_FALSE);
7195 
7196                 /*
7197                  * Write ARC buffers.
7198                  */
7199                 wrote = l2arc_write_buffers(spa, dev, size);
7200 
7201                 /*
7202                  * Calculate interval between writes.
7203                  */
7204                 next = l2arc_write_interval(begin, size, wrote);
7205                 spa_config_exit(spa, SCL_L2ARC, dev);


7206         }
7207 
7208         l2arc_thread_exit = 0;
7209         cv_broadcast(&l2arc_feed_thr_cv);
7210         CALLB_CPR_EXIT(&cpr);               /* drops l2arc_feed_thr_lock */
7211         thread_exit();
7212 }
7213 
7214 boolean_t
7215 l2arc_vdev_present(vdev_t *vd)
7216 {










7217         l2arc_dev_t *dev;

7218 

7219         mutex_enter(&l2arc_dev_mtx);
7220         for (dev = list_head(l2arc_dev_list); dev != NULL;
7221             dev = list_next(l2arc_dev_list, dev)) {
7222                 if (dev->l2ad_vdev == vd)
7223                         break;
7224         }

7225         mutex_exit(&l2arc_dev_mtx);
7226 
7227         return (dev != NULL);
7228 }
7229 
7230 /*
7231  * Add a vdev for use by the L2ARC.  By this point the spa has already
7232  * validated the vdev and opened it.

7233  */
7234 void
7235 l2arc_add_vdev(spa_t *spa, vdev_t *vd)
7236 {
7237         l2arc_dev_t *adddev;
7238 
7239         ASSERT(!l2arc_vdev_present(vd));
7240 
7241         /*
7242          * Create a new l2arc device entry.
7243          */
7244         adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
7245         adddev->l2ad_spa = spa;
7246         adddev->l2ad_vdev = vd;
7247         adddev->l2ad_start = VDEV_LABEL_START_SIZE;



7248         adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);

7249         adddev->l2ad_hand = adddev->l2ad_start;
7250         adddev->l2ad_first = B_TRUE;
7251         adddev->l2ad_writing = B_FALSE;


7252 
7253         mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);
7254         /*
7255          * This is a list of all ARC buffers that are still valid on the
7256          * device.
7257          */
7258         list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
7259             offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));
7260 
7261         vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
7262         refcount_create(&adddev->l2ad_alloc);
7263 
7264         /*
7265          * Add device to global list
7266          */
7267         mutex_enter(&l2arc_dev_mtx);
7268         list_insert_head(l2arc_dev_list, adddev);
7269         atomic_inc_64(&l2arc_ndev);










7270         mutex_exit(&l2arc_dev_mtx);
7271 }
7272 
7273 /*
7274  * Remove a vdev from the L2ARC.
7275  */
7276 void
7277 l2arc_remove_vdev(vdev_t *vd)
7278 {
7279         l2arc_dev_t *dev, *nextdev, *remdev = NULL;
7280 
7281         /*
7282          * Find the device by vdev
7283          */
7284         mutex_enter(&l2arc_dev_mtx);
7285         for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
7286                 nextdev = list_next(l2arc_dev_list, dev);
7287                 if (vd == dev->l2ad_vdev) {
7288                         remdev = dev;
7289                         break;
7290                 }
7291         }
7292         ASSERT3P(remdev, !=, NULL);
7293 
7294         /*















7295          * Remove device from global list
7296          */
7297         list_remove(l2arc_dev_list, remdev);
7298         l2arc_dev_last = NULL;          /* may have been invalidated */

7299         atomic_dec_64(&l2arc_ndev);
7300         mutex_exit(&l2arc_dev_mtx);
7301 




7302         /*
7303          * Clear all buflists and ARC references.  L2ARC device flush.
7304          */
7305         l2arc_evict(remdev, 0, B_TRUE);




7306         list_destroy(&remdev->l2ad_buflist);
7307         mutex_destroy(&remdev->l2ad_mtx);
7308         refcount_destroy(&remdev->l2ad_alloc);
7309         kmem_free(remdev, sizeof (l2arc_dev_t));

7310 }
7311 
7312 void
7313 l2arc_init(void)
7314 {
7315         l2arc_thread_exit = 0;
7316         l2arc_ndev = 0;
7317         l2arc_writes_sent = 0;
7318         l2arc_writes_done = 0;
7319 
7320         mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
7321         cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
7322         mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
7323         mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
7324 
7325         l2arc_dev_list = &L2ARC_dev_list;
7326         l2arc_free_on_write = &L2ARC_free_on_write;
7327         list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
7328             offsetof(l2arc_dev_t, l2ad_node));
7329         list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),

7355 {
7356         if (!(spa_mode_global & FWRITE))
7357                 return;
7358 
7359         (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
7360             TS_RUN, minclsyspri);
7361 }
7362 
7363 void
7364 l2arc_stop(void)
7365 {
7366         if (!(spa_mode_global & FWRITE))
7367                 return;
7368 
7369         mutex_enter(&l2arc_feed_thr_lock);
7370         cv_signal(&l2arc_feed_thr_cv);      /* kick thread out of startup */
7371         l2arc_thread_exit = 1;
7372         while (l2arc_thread_exit != 0)
7373                 cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
7374         mutex_exit(&l2arc_feed_thr_lock);










































































































































































































































































































































































































































































































































































































































































































































































































7375 }

   6  * You may not use this file except in compliance with the License.
   7  *
   8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9  * or http://www.opensolaris.org/os/licensing.
  10  * See the License for the specific language governing permissions
  11  * and limitations under the License.
  12  *
  13  * When distributing Covered Code, include this CDDL HEADER in each
  14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15  * If applicable, add the following below this CDDL HEADER, with the
  16  * fields enclosed by brackets "[]" replaced with your own identifying
  17  * information: Portions Copyright [yyyy] [name of copyright owner]
  18  *
  19  * CDDL HEADER END
  20  */
  21 /*
  22  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23  * Copyright (c) 2018, Joyent, Inc.
  24  * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
  25  * Copyright (c) 2014 by Saso Kiselkov. All rights reserved.
  26  * Copyright 2019 Nexenta Systems, Inc.  All rights reserved.
  27  */
  28 
  29 /*
  30  * DVA-based Adjustable Replacement Cache
  31  *
  32  * While much of the theory of operation used here is
  33  * based on the self-tuning, low overhead replacement cache
  34  * presented by Megiddo and Modha at FAST 2003, there are some
  35  * significant differences:
  36  *
  37  * 1. The Megiddo and Modha model assumes any page is evictable.
  38  * Pages in its cache cannot be "locked" into memory.  This makes
  39  * the eviction algorithm simple: evict the last page in the list.
  40  * This also make the performance characteristics easy to reason
  41  * about.  Our cache is not so simple.  At any given moment, some
  42  * subset of the blocks in the cache are un-evictable because we
  43  * have handed out a reference to them.  Blocks are only evictable
  44  * when there are no external references active.  This makes
  45  * eviction far more problematic:  we choose to evict the evictable
  46  * blocks that are the "lowest" in the list.

 236  * it may compress the data before writing it to disk. The ARC will be called
 237  * with the transformed data and will bcopy the transformed on-disk block into
 238  * a newly allocated b_pabd. Writes are always done into buffers which have
 239  * either been loaned (and hence are new and don't have other readers) or
 240  * buffers which have been released (and hence have their own hdr, if there
 241  * were originally other readers of the buf's original hdr). This ensures that
 242  * the ARC only needs to update a single buf and its hdr after a write occurs.
 243  *
 244  * When the L2ARC is in use, it will also take advantage of the b_pabd. The
 245  * L2ARC will always write the contents of b_pabd to the L2ARC. This means
 246  * that when compressed ARC is enabled that the L2ARC blocks are identical
 247  * to the on-disk block in the main data pool. This provides a significant
 248  * advantage since the ARC can leverage the bp's checksum when reading from the
 249  * L2ARC to determine if the contents are valid. However, if the compressed
 250  * ARC is disabled, then the L2ARC's block must be transformed to look
 251  * like the physical block in the main data pool before comparing the
 252  * checksum and determining its validity.
 253  */
 254 
 255 #include <sys/spa.h>
 256 #include <sys/spa_impl.h>
 257 #include <sys/zio.h>
 258 #include <sys/spa_impl.h>
 259 #include <sys/zio_compress.h>
 260 #include <sys/zio_checksum.h>
 261 #include <sys/zfs_context.h>
 262 #include <sys/arc.h>
 263 #include <sys/refcount.h>
 264 #include <sys/vdev.h>
 265 #include <sys/vdev_impl.h>
 266 #include <sys/dsl_pool.h>
 267 #include <sys/zio_checksum.h>
 268 #include <sys/multilist.h>
 269 #include <sys/abd.h>
 270 #ifdef _KERNEL
 271 #include <sys/vmsystm.h>
 272 #include <vm/anon.h>
 273 #include <sys/fs/swapnode.h>
 274 #include <sys/dnlc.h>
 275 #endif
 276 #include <sys/callb.h>
 277 #include <sys/kstat.h>
 278 #include <zfs_fletcher.h>
 279 #include <sys/byteorder.h>
 280 #include <sys/spa_impl.h>
 281 
 282 #ifndef _KERNEL
 283 /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
 284 boolean_t arc_watch = B_FALSE;
 285 int arc_procfd;
 286 #endif
 287 
 288 static kmutex_t         arc_reclaim_lock;
 289 static kcondvar_t       arc_reclaim_thread_cv;
 290 static boolean_t        arc_reclaim_thread_exit;
 291 static kcondvar_t       arc_reclaim_waiters_cv;
 292 
 293 uint_t arc_reduce_dnlc_percent = 3;
 294 
 295 /*
 296  * The number of headers to evict in arc_evict_state_impl() before
 297  * dropping the sublist lock and evicting from another sublist. A lower
 298  * value means we're more likely to evict the "correct" header (i.e. the
 299  * oldest header in the arc state), but comes with higher overhead
 300  * (i.e. more invocations of arc_evict_state_impl()).

 341 
 342 static int arc_dead;
 343 
 344 /*
 345  * The arc has filled available memory and has now warmed up.
 346  */
 347 static boolean_t arc_warm;
 348 
 349 /*
 350  * log2 fraction of the zio arena to keep free.
 351  */
 352 int arc_zio_arena_free_shift = 2;
 353 
 354 /*
 355  * These tunables are for performance analysis.
 356  */
 357 uint64_t zfs_arc_max;
 358 uint64_t zfs_arc_min;
 359 uint64_t zfs_arc_meta_limit = 0;
 360 uint64_t zfs_arc_meta_min = 0;
 361 uint64_t zfs_arc_ddt_limit = 0;
 362 /*
 363  * Tunable to control "dedup ceiling"
 364  * Possible values:
 365  *  DDT_NO_LIMIT        - default behaviour, ie no ceiling
 366  *  DDT_LIMIT_TO_ARC    - stop DDT growth if DDT is bigger than it's "ARC space"
 367  *  DDT_LIMIT_TO_L2ARC  - stop DDT growth when DDT size is bigger than the
 368  *                        L2ARC DDT dev(s) for that pool
 369  */
 370 zfs_ddt_limit_t zfs_ddt_limit_type = DDT_LIMIT_TO_ARC;
 371 /*
 372  * Alternative to the above way of controlling "dedup ceiling":
 373  * Stop DDT growth when in core DDTs size is above the below tunable.
 374  * This tunable overrides the zfs_ddt_limit_type tunable.
 375  */
 376 uint64_t zfs_ddt_byte_ceiling = 0;
 377 boolean_t zfs_arc_segregate_ddt = B_TRUE;
 378 int zfs_arc_grow_retry = 0;
 379 int zfs_arc_shrink_shift = 0;
 380 int zfs_arc_p_min_shift = 0;
 381 int zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
 382 
 383 /* Tuneable, default is 64, which is essentially arbitrary */
 384 int zfs_flush_ntasks = 64;
 385 
 386 boolean_t zfs_compressed_arc_enabled = B_TRUE;
 387 
 388 /*
 389  * Note that buffers can be in one of 6 states:
 390  *      ARC_anon        - anonymous (discussed below)
 391  *      ARC_mru         - recently used, currently cached
 392  *      ARC_mru_ghost   - recentely used, no longer in cache
 393  *      ARC_mfu         - frequently used, currently cached
 394  *      ARC_mfu_ghost   - frequently used, no longer in cache
 395  *      ARC_l2c_only    - exists in L2ARC but not other states
 396  * When there are no active references to the buffer, they are
 397  * are linked onto a list in one of these arc states.  These are
 398  * the only buffers that can be evicted or deleted.  Within each
 399  * state there are multiple lists, one for meta-data and one for
 400  * non-meta-data.  Meta-data (indirect blocks, blocks of dnodes,
 401  * etc.) is tracked separately so that it can be managed more
 402  * explicitly: favored over data, limited explicitly.
 403  *
 404  * Anonymous buffers are buffers that are not associated with
 405  * a DVA.  These are buffers that hold dirty block copies

 411  * The ARC_l2c_only state is for buffers that are in the second
 412  * level ARC but no longer in any of the ARC_m* lists.  The second
 413  * level ARC itself may also contain buffers that are in any of
 414  * the ARC_m* states - meaning that a buffer can exist in two
 415  * places.  The reason for the ARC_l2c_only state is to keep the
 416  * buffer header in the hash table, so that reads that hit the
 417  * second level ARC benefit from these fast lookups.
 418  */
 419 
 420 typedef struct arc_state {
 421         /*
 422          * list of evictable buffers
 423          */
 424         multilist_t *arcs_list[ARC_BUFC_NUMTYPES];
 425         /*
 426          * total amount of evictable data in this state
 427          */
 428         refcount_t arcs_esize[ARC_BUFC_NUMTYPES];
 429         /*
 430          * total amount of data in this state; this includes: evictable,
 431          * non-evictable, ARC_BUFC_DATA, ARC_BUFC_METADATA and ARC_BUFC_DDT.
 432          * ARC_BUFC_DDT list is only populated when zfs_arc_segregate_ddt is
 433          * true.
 434          */
 435         refcount_t arcs_size;
 436 } arc_state_t;
 437 
 438 /*
 439  * We loop through these in l2arc_write_buffers() starting from
 440  * PRIORITY_MFU_DDT until we reach PRIORITY_NUMTYPES or the buffer that we
 441  * will be writing to L2ARC dev gets full.
 442  */
 443 enum l2arc_priorities {
 444         PRIORITY_MFU_DDT,
 445         PRIORITY_MRU_DDT,
 446         PRIORITY_MFU_META,
 447         PRIORITY_MRU_META,
 448         PRIORITY_MFU_DATA,
 449         PRIORITY_MRU_DATA,
 450         PRIORITY_NUMTYPES,
 451 };
 452 
 453 /* The 6 states: */
 454 static arc_state_t ARC_anon;
 455 static arc_state_t ARC_mru;
 456 static arc_state_t ARC_mru_ghost;
 457 static arc_state_t ARC_mfu;
 458 static arc_state_t ARC_mfu_ghost;
 459 static arc_state_t ARC_l2c_only;
 460 
 461 typedef struct arc_stats {
 462         kstat_named_t arcstat_hits;
 463         kstat_named_t arcstat_ddt_hits;
 464         kstat_named_t arcstat_misses;
 465         kstat_named_t arcstat_demand_data_hits;
 466         kstat_named_t arcstat_demand_data_misses;
 467         kstat_named_t arcstat_demand_metadata_hits;
 468         kstat_named_t arcstat_demand_metadata_misses;
 469         kstat_named_t arcstat_demand_ddt_hits;
 470         kstat_named_t arcstat_demand_ddt_misses;
 471         kstat_named_t arcstat_prefetch_data_hits;
 472         kstat_named_t arcstat_prefetch_data_misses;
 473         kstat_named_t arcstat_prefetch_metadata_hits;
 474         kstat_named_t arcstat_prefetch_metadata_misses;
 475         kstat_named_t arcstat_prefetch_ddt_hits;
 476         kstat_named_t arcstat_prefetch_ddt_misses;
 477         kstat_named_t arcstat_mru_hits;
 478         kstat_named_t arcstat_mru_ghost_hits;
 479         kstat_named_t arcstat_mfu_hits;
 480         kstat_named_t arcstat_mfu_ghost_hits;
 481         kstat_named_t arcstat_deleted;
 482         /*
 483          * Number of buffers that could not be evicted because the hash lock
 484          * was held by another thread.  The lock may not necessarily be held
 485          * by something using the same buffer, since hash locks are shared
 486          * by multiple buffers.
 487          */
 488         kstat_named_t arcstat_mutex_miss;
 489         /*
 490          * Number of buffers skipped when updating the access state due to the
 491          * header having already been released after acquiring the hash lock.
 492          */
 493         kstat_named_t arcstat_access_skip;
 494         /*
 495          * Number of buffers skipped because they have I/O in progress, are
 496          * indirect prefetch buffers that have not lived long enough, or are
 497          * not from the spa we're trying to evict from.
 498          */
 499         kstat_named_t arcstat_evict_skip;
 500         /*
 501          * Number of times arc_evict_state() was unable to evict enough
 502          * buffers to reach it's target amount.
 503          */
 504         kstat_named_t arcstat_evict_not_enough;
 505         kstat_named_t arcstat_evict_l2_cached;
 506         kstat_named_t arcstat_evict_l2_eligible;
 507         kstat_named_t arcstat_evict_l2_ineligible;
 508         kstat_named_t arcstat_evict_l2_skip;
 509         kstat_named_t arcstat_hash_elements;
 510         kstat_named_t arcstat_hash_elements_max;
 511         kstat_named_t arcstat_hash_collisions;
 512         kstat_named_t arcstat_hash_chains;
 513         kstat_named_t arcstat_hash_chain_max;
 514         kstat_named_t arcstat_p;
 515         kstat_named_t arcstat_c;
 516         kstat_named_t arcstat_c_min;
 517         kstat_named_t arcstat_c_max;

 518         kstat_named_t arcstat_size;
 519         /*
 520          * Number of compressed bytes stored in the arc_buf_hdr_t's b_pabd.
 521          * Note that the compressed bytes may match the uncompressed bytes
 522          * if the block is either not compressed or compressed arc is disabled.
 523          */
 524         kstat_named_t arcstat_compressed_size;
 525         /*
 526          * Uncompressed size of the data stored in b_pabd. If compressed
 527          * arc is disabled then this value will be identical to the stat
 528          * above.
 529          */
 530         kstat_named_t arcstat_uncompressed_size;
 531         /*
 532          * Number of bytes stored in all the arc_buf_t's. This is classified
 533          * as "overhead" since this data is typically short-lived and will
 534          * be evicted from the arc when it becomes unreferenced unless the
 535          * zfs_keep_uncompressed_metadata or zfs_keep_uncompressed_level
 536          * values have been set (see comment in dbuf.c for more information).
 537          */
 538         kstat_named_t arcstat_overhead_size;
 539         /*
 540          * Number of bytes consumed by internal ARC structures necessary
 541          * for tracking purposes; these structures are not actually
 542          * backed by ARC buffers. This includes arc_buf_hdr_t structures
 543          * (allocated via arc_buf_hdr_t_full and arc_buf_hdr_t_l2only
 544          * caches), and arc_buf_t structures (allocated via arc_buf_t
 545          * cache).

 546          */
 547         kstat_named_t arcstat_hdr_size;
 548         /*
 549          * Number of bytes consumed by ARC buffers of type equal to
 550          * ARC_BUFC_DATA. This is generally consumed by buffers backing
 551          * on disk user data (e.g. plain file contents).

 552          */
 553         kstat_named_t arcstat_data_size;
 554         /*
 555          * Number of bytes consumed by ARC buffers of type equal to
 556          * ARC_BUFC_METADATA. This is generally consumed by buffers
 557          * backing on disk data that is used for internal ZFS
 558          * structures (e.g. ZAP, dnode, indirect blocks, etc).

 559          */
 560         kstat_named_t arcstat_metadata_size;
 561         /*
 562          * Number of bytes consumed by ARC buffers of type equal to
 563          * ARC_BUFC_DDT. This is consumed by buffers backing on disk data
 564          * that is used to store DDT (ZAP, ddt stats).
 565          * Only used if zfs_arc_segregate_ddt is true.
 566          */
 567         kstat_named_t arcstat_ddt_size;
 568         /*
 569          * Number of bytes consumed by various buffers and structures
 570          * not actually backed with ARC buffers. This includes bonus
 571          * buffers (allocated directly via zio_buf_* functions),
 572          * dmu_buf_impl_t structures (allocated via dmu_buf_impl_t
 573          * cache), and dnode_t structures (allocated via dnode_t cache).

 574          */
 575         kstat_named_t arcstat_other_size;
 576         /*
 577          * Total number of bytes consumed by ARC buffers residing in the
 578          * arc_anon state. This includes *all* buffers in the arc_anon
 579          * state; e.g. data, metadata, evictable, and unevictable buffers
 580          * are all included in this value.

 581          */
 582         kstat_named_t arcstat_anon_size;
 583         /*
 584          * Number of bytes consumed by ARC buffers that meet the
 585          * following criteria: backing buffers of type ARC_BUFC_DATA,
 586          * residing in the arc_anon state, and are eligible for eviction
 587          * (e.g. have no outstanding holds on the buffer).

 588          */
 589         kstat_named_t arcstat_anon_evictable_data;
 590         /*
 591          * Number of bytes consumed by ARC buffers that meet the
 592          * following criteria: backing buffers of type ARC_BUFC_METADATA,
 593          * residing in the arc_anon state, and are eligible for eviction
 594          * (e.g. have no outstanding holds on the buffer).

 595          */
 596         kstat_named_t arcstat_anon_evictable_metadata;
 597         /*
 598          * Number of bytes consumed by ARC buffers that meet the
 599          * following criteria: backing buffers of type ARC_BUFC_DDT,
 600          * residing in the arc_anon state, and are eligible for eviction
 601          * Only used if zfs_arc_segregate_ddt is true.
 602          */
 603         kstat_named_t arcstat_anon_evictable_ddt;
 604         /*
 605          * Total number of bytes consumed by ARC buffers residing in the
 606          * arc_mru state. This includes *all* buffers in the arc_mru
 607          * state; e.g. data, metadata, evictable, and unevictable buffers
 608          * are all included in this value.

 609          */
 610         kstat_named_t arcstat_mru_size;
 611         /*
 612          * Number of bytes consumed by ARC buffers that meet the
 613          * following criteria: backing buffers of type ARC_BUFC_DATA,
 614          * residing in the arc_mru state, and are eligible for eviction
 615          * (e.g. have no outstanding holds on the buffer).

 616          */
 617         kstat_named_t arcstat_mru_evictable_data;
 618         /*
 619          * Number of bytes consumed by ARC buffers that meet the
 620          * following criteria: backing buffers of type ARC_BUFC_METADATA,
 621          * residing in the arc_mru state, and are eligible for eviction
 622          * (e.g. have no outstanding holds on the buffer).

 623          */
 624         kstat_named_t arcstat_mru_evictable_metadata;
 625         /*
 626          * Number of bytes consumed by ARC buffers that meet the
 627          * following criteria: backing buffers of type ARC_BUFC_DDT,
 628          * residing in the arc_mru state, and are eligible for eviction
 629          * (e.g. have no outstanding holds on the buffer).
 630          * Only used if zfs_arc_segregate_ddt is true.
 631          */
 632         kstat_named_t arcstat_mru_evictable_ddt;
 633         /*
 634          * Total number of bytes that *would have been* consumed by ARC
 635          * buffers in the arc_mru_ghost state. The key thing to note
 636          * here, is the fact that this size doesn't actually indicate
 637          * RAM consumption. The ghost lists only consist of headers and
 638          * don't actually have ARC buffers linked off of these headers.
 639          * Thus, *if* the headers had associated ARC buffers, these
 640          * buffers *would have* consumed this number of bytes.

 641          */
 642         kstat_named_t arcstat_mru_ghost_size;
 643         /*
 644          * Number of bytes that *would have been* consumed by ARC
 645          * buffers that are eligible for eviction, of type
 646          * ARC_BUFC_DATA, and linked off the arc_mru_ghost state.

 647          */
 648         kstat_named_t arcstat_mru_ghost_evictable_data;
 649         /*
 650          * Number of bytes that *would have been* consumed by ARC
 651          * buffers that are eligible for eviction, of type
 652          * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.

 653          */
 654         kstat_named_t arcstat_mru_ghost_evictable_metadata;
 655         /*
 656          * Number of bytes that *would have been* consumed by ARC
 657          * buffers that are eligible for eviction, of type
 658          * ARC_BUFC_DDT, and linked off the arc_mru_ghost state.
 659          * Only used if zfs_arc_segregate_ddt is true.
 660          */
 661         kstat_named_t arcstat_mru_ghost_evictable_ddt;
 662         /*
 663          * Total number of bytes consumed by ARC buffers residing in the
 664          * arc_mfu state. This includes *all* buffers in the arc_mfu
 665          * state; e.g. data, metadata, evictable, and unevictable buffers
 666          * are all included in this value.

 667          */
 668         kstat_named_t arcstat_mfu_size;
 669         /*
 670          * Number of bytes consumed by ARC buffers that are eligible for
 671          * eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu
 672          * state.

 673          */
 674         kstat_named_t arcstat_mfu_evictable_data;
 675         /*
 676          * Number of bytes consumed by ARC buffers that are eligible for
 677          * eviction, of type ARC_BUFC_METADATA, and reside in the
 678          * arc_mfu state.

 679          */
 680         kstat_named_t arcstat_mfu_evictable_metadata;
 681         /*
 682          * Number of bytes consumed by ARC buffers that are eligible for
 683          * eviction, of type ARC_BUFC_DDT, and reside in the
 684          * arc_mfu state.
 685          * Only used if zfs_arc_segregate_ddt is true.
 686          */
 687         kstat_named_t arcstat_mfu_evictable_ddt;
 688         /*
 689          * Total number of bytes that *would have been* consumed by ARC
 690          * buffers in the arc_mfu_ghost state. See the comment above
 691          * arcstat_mru_ghost_size for more details.

 692          */
 693         kstat_named_t arcstat_mfu_ghost_size;
 694         /*
 695          * Number of bytes that *would have been* consumed by ARC
 696          * buffers that are eligible for eviction, of type
 697          * ARC_BUFC_DATA, and linked off the arc_mfu_ghost state.

 698          */
 699         kstat_named_t arcstat_mfu_ghost_evictable_data;
 700         /*
 701          * Number of bytes that *would have been* consumed by ARC
 702          * buffers that are eligible for eviction, of type
 703          * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.

 704          */
 705         kstat_named_t arcstat_mfu_ghost_evictable_metadata;
 706         /*
 707          * Number of bytes that *would have been* consumed by ARC
 708          * buffers that are eligible for eviction, of type
 709          * ARC_BUFC_DDT, and linked off the arc_mru_ghost state.
 710          * Only used if zfs_arc_segregate_ddt is true.
 711          */
 712         kstat_named_t arcstat_mfu_ghost_evictable_ddt;
 713         kstat_named_t arcstat_l2_hits;
 714         kstat_named_t arcstat_l2_ddt_hits;
 715         kstat_named_t arcstat_l2_misses;
 716         kstat_named_t arcstat_l2_feeds;
 717         kstat_named_t arcstat_l2_rw_clash;
 718         kstat_named_t arcstat_l2_read_bytes;
 719         kstat_named_t arcstat_l2_ddt_read_bytes;
 720         kstat_named_t arcstat_l2_write_bytes;
 721         kstat_named_t arcstat_l2_ddt_write_bytes;
 722         kstat_named_t arcstat_l2_writes_sent;
 723         kstat_named_t arcstat_l2_writes_done;
 724         kstat_named_t arcstat_l2_writes_error;
 725         kstat_named_t arcstat_l2_writes_lock_retry;
 726         kstat_named_t arcstat_l2_evict_lock_retry;
 727         kstat_named_t arcstat_l2_evict_reading;
 728         kstat_named_t arcstat_l2_evict_l1cached;
 729         kstat_named_t arcstat_l2_free_on_write;
 730         kstat_named_t arcstat_l2_abort_lowmem;
 731         kstat_named_t arcstat_l2_cksum_bad;
 732         kstat_named_t arcstat_l2_io_error;
 733         kstat_named_t arcstat_l2_lsize;
 734         kstat_named_t arcstat_l2_psize;

 735         kstat_named_t arcstat_l2_hdr_size;
 736         kstat_named_t arcstat_l2_log_blk_writes;
 737         kstat_named_t arcstat_l2_log_blk_avg_size;
 738         kstat_named_t arcstat_l2_data_to_meta_ratio;
 739         kstat_named_t arcstat_l2_rebuild_successes;
 740         kstat_named_t arcstat_l2_rebuild_abort_unsupported;
 741         kstat_named_t arcstat_l2_rebuild_abort_io_errors;
 742         kstat_named_t arcstat_l2_rebuild_abort_cksum_errors;
 743         kstat_named_t arcstat_l2_rebuild_abort_loop_errors;
 744         kstat_named_t arcstat_l2_rebuild_abort_lowmem;
 745         kstat_named_t arcstat_l2_rebuild_size;
 746         kstat_named_t arcstat_l2_rebuild_bufs;
 747         kstat_named_t arcstat_l2_rebuild_bufs_precached;
 748         kstat_named_t arcstat_l2_rebuild_psize;
 749         kstat_named_t arcstat_l2_rebuild_log_blks;
 750         kstat_named_t arcstat_memory_throttle_count;

 751         kstat_named_t arcstat_meta_used;
 752         kstat_named_t arcstat_meta_limit;
 753         kstat_named_t arcstat_meta_max;
 754         kstat_named_t arcstat_meta_min;
 755         kstat_named_t arcstat_ddt_limit;
 756         kstat_named_t arcstat_sync_wait_for_async;
 757         kstat_named_t arcstat_demand_hit_predictive_prefetch;
 758 } arc_stats_t;
 759 
 760 static arc_stats_t arc_stats = {
 761         { "hits",                       KSTAT_DATA_UINT64 },
 762         { "ddt_hits",                   KSTAT_DATA_UINT64 },
 763         { "misses",                     KSTAT_DATA_UINT64 },
 764         { "demand_data_hits",           KSTAT_DATA_UINT64 },
 765         { "demand_data_misses",         KSTAT_DATA_UINT64 },
 766         { "demand_metadata_hits",       KSTAT_DATA_UINT64 },
 767         { "demand_metadata_misses",     KSTAT_DATA_UINT64 },
 768         { "demand_ddt_hits",            KSTAT_DATA_UINT64 },
 769         { "demand_ddt_misses",          KSTAT_DATA_UINT64 },
 770         { "prefetch_data_hits",         KSTAT_DATA_UINT64 },
 771         { "prefetch_data_misses",       KSTAT_DATA_UINT64 },
 772         { "prefetch_metadata_hits",     KSTAT_DATA_UINT64 },
 773         { "prefetch_metadata_misses",   KSTAT_DATA_UINT64 },
 774         { "prefetch_ddt_hits",          KSTAT_DATA_UINT64 },
 775         { "prefetch_ddt_misses",        KSTAT_DATA_UINT64 },
 776         { "mru_hits",                   KSTAT_DATA_UINT64 },
 777         { "mru_ghost_hits",             KSTAT_DATA_UINT64 },
 778         { "mfu_hits",                   KSTAT_DATA_UINT64 },
 779         { "mfu_ghost_hits",             KSTAT_DATA_UINT64 },
 780         { "deleted",                    KSTAT_DATA_UINT64 },
 781         { "mutex_miss",                 KSTAT_DATA_UINT64 },
 782         { "access_skip",                KSTAT_DATA_UINT64 },
 783         { "evict_skip",                 KSTAT_DATA_UINT64 },
 784         { "evict_not_enough",           KSTAT_DATA_UINT64 },
 785         { "evict_l2_cached",            KSTAT_DATA_UINT64 },
 786         { "evict_l2_eligible",          KSTAT_DATA_UINT64 },
 787         { "evict_l2_ineligible",        KSTAT_DATA_UINT64 },
 788         { "evict_l2_skip",              KSTAT_DATA_UINT64 },
 789         { "hash_elements",              KSTAT_DATA_UINT64 },
 790         { "hash_elements_max",          KSTAT_DATA_UINT64 },
 791         { "hash_collisions",            KSTAT_DATA_UINT64 },
 792         { "hash_chains",                KSTAT_DATA_UINT64 },
 793         { "hash_chain_max",             KSTAT_DATA_UINT64 },
 794         { "p",                          KSTAT_DATA_UINT64 },
 795         { "c",                          KSTAT_DATA_UINT64 },
 796         { "c_min",                      KSTAT_DATA_UINT64 },
 797         { "c_max",                      KSTAT_DATA_UINT64 },
 798         { "size",                       KSTAT_DATA_UINT64 },
 799         { "compressed_size",            KSTAT_DATA_UINT64 },
 800         { "uncompressed_size",          KSTAT_DATA_UINT64 },
 801         { "overhead_size",              KSTAT_DATA_UINT64 },
 802         { "hdr_size",                   KSTAT_DATA_UINT64 },
 803         { "data_size",                  KSTAT_DATA_UINT64 },
 804         { "metadata_size",              KSTAT_DATA_UINT64 },
 805         { "ddt_size",                   KSTAT_DATA_UINT64 },
 806         { "other_size",                 KSTAT_DATA_UINT64 },
 807         { "anon_size",                  KSTAT_DATA_UINT64 },
 808         { "anon_evictable_data",        KSTAT_DATA_UINT64 },
 809         { "anon_evictable_metadata",    KSTAT_DATA_UINT64 },
 810         { "anon_evictable_ddt",         KSTAT_DATA_UINT64 },
 811         { "mru_size",                   KSTAT_DATA_UINT64 },
 812         { "mru_evictable_data",         KSTAT_DATA_UINT64 },
 813         { "mru_evictable_metadata",     KSTAT_DATA_UINT64 },
 814         { "mru_evictable_ddt",          KSTAT_DATA_UINT64 },
 815         { "mru_ghost_size",             KSTAT_DATA_UINT64 },
 816         { "mru_ghost_evictable_data",   KSTAT_DATA_UINT64 },
 817         { "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
 818         { "mru_ghost_evictable_ddt",    KSTAT_DATA_UINT64 },
 819         { "mfu_size",                   KSTAT_DATA_UINT64 },
 820         { "mfu_evictable_data",         KSTAT_DATA_UINT64 },
 821         { "mfu_evictable_metadata",     KSTAT_DATA_UINT64 },
 822         { "mfu_evictable_ddt",          KSTAT_DATA_UINT64 },
 823         { "mfu_ghost_size",             KSTAT_DATA_UINT64 },
 824         { "mfu_ghost_evictable_data",   KSTAT_DATA_UINT64 },
 825         { "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
 826         { "mfu_ghost_evictable_ddt",    KSTAT_DATA_UINT64 },
 827         { "l2_hits",                    KSTAT_DATA_UINT64 },
 828         { "l2_ddt_hits",                KSTAT_DATA_UINT64 },
 829         { "l2_misses",                  KSTAT_DATA_UINT64 },
 830         { "l2_feeds",                   KSTAT_DATA_UINT64 },
 831         { "l2_rw_clash",                KSTAT_DATA_UINT64 },
 832         { "l2_read_bytes",              KSTAT_DATA_UINT64 },
 833         { "l2_ddt_read_bytes",          KSTAT_DATA_UINT64 },
 834         { "l2_write_bytes",             KSTAT_DATA_UINT64 },
 835         { "l2_ddt_write_bytes",         KSTAT_DATA_UINT64 },
 836         { "l2_writes_sent",             KSTAT_DATA_UINT64 },
 837         { "l2_writes_done",             KSTAT_DATA_UINT64 },
 838         { "l2_writes_error",            KSTAT_DATA_UINT64 },
 839         { "l2_writes_lock_retry",       KSTAT_DATA_UINT64 },
 840         { "l2_evict_lock_retry",        KSTAT_DATA_UINT64 },
 841         { "l2_evict_reading",           KSTAT_DATA_UINT64 },
 842         { "l2_evict_l1cached",          KSTAT_DATA_UINT64 },
 843         { "l2_free_on_write",           KSTAT_DATA_UINT64 },
 844         { "l2_abort_lowmem",            KSTAT_DATA_UINT64 },
 845         { "l2_cksum_bad",               KSTAT_DATA_UINT64 },
 846         { "l2_io_error",                KSTAT_DATA_UINT64 },
 847         { "l2_size",                    KSTAT_DATA_UINT64 },
 848         { "l2_asize",                   KSTAT_DATA_UINT64 },
 849         { "l2_hdr_size",                KSTAT_DATA_UINT64 },
 850         { "l2_log_blk_writes",          KSTAT_DATA_UINT64 },
 851         { "l2_log_blk_avg_size",        KSTAT_DATA_UINT64 },
 852         { "l2_data_to_meta_ratio",      KSTAT_DATA_UINT64 },
 853         { "l2_rebuild_successes",       KSTAT_DATA_UINT64 },
 854         { "l2_rebuild_unsupported",     KSTAT_DATA_UINT64 },
 855         { "l2_rebuild_io_errors",       KSTAT_DATA_UINT64 },
 856         { "l2_rebuild_cksum_errors",    KSTAT_DATA_UINT64 },
 857         { "l2_rebuild_loop_errors",     KSTAT_DATA_UINT64 },
 858         { "l2_rebuild_lowmem",          KSTAT_DATA_UINT64 },
 859         { "l2_rebuild_size",            KSTAT_DATA_UINT64 },
 860         { "l2_rebuild_bufs",            KSTAT_DATA_UINT64 },
 861         { "l2_rebuild_bufs_precached",  KSTAT_DATA_UINT64 },
 862         { "l2_rebuild_psize",           KSTAT_DATA_UINT64 },
 863         { "l2_rebuild_log_blks",        KSTAT_DATA_UINT64 },
 864         { "memory_throttle_count",      KSTAT_DATA_UINT64 },
 865         { "arc_meta_used",              KSTAT_DATA_UINT64 },
 866         { "arc_meta_limit",             KSTAT_DATA_UINT64 },
 867         { "arc_meta_max",               KSTAT_DATA_UINT64 },
 868         { "arc_meta_min",               KSTAT_DATA_UINT64 },
 869         { "arc_ddt_limit",              KSTAT_DATA_UINT64 },
 870         { "sync_wait_for_async",        KSTAT_DATA_UINT64 },
 871         { "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },
 872 };
 873 
 874 #define ARCSTAT(stat)   (arc_stats.stat.value.ui64)
 875 
 876 #define ARCSTAT_INCR(stat, val) \
 877         atomic_add_64(&arc_stats.stat.value.ui64, (val))
 878 
 879 #define ARCSTAT_BUMP(stat)      ARCSTAT_INCR(stat, 1)
 880 #define ARCSTAT_BUMPDOWN(stat)  ARCSTAT_INCR(stat, -1)
 881 
 882 #define ARCSTAT_MAX(stat, val) {                                        \
 883         uint64_t m;                                                     \
 884         while ((val) > (m = arc_stats.stat.value.ui64) &&            \
 885             (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val))))     \
 886                 continue;                                               \
 887 }
 888 
 889 #define ARCSTAT_MAXSTAT(stat) \
 890         ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
 891 
 892 /*
 893  * We define a macro to allow ARC hits/misses to be easily broken down by
 894  * two separate conditions, giving a total of four different subtypes for
 895  * each of hits and misses (so eight statistics total).
 896  */
 897 #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
 898         if (cond1) {                                                    \
 899                 if (cond2) {                                            \
 900                         ARCSTAT_BUMP(arcstat_##stat1##_##stat##_##stat2); \
 901                 } else {                                                \
 902                         ARCSTAT_BUMP(arcstat_##stat1##_##stat##_##notstat2); \
 903                 }                                                       \
 904         } else {                                                        \
 905                 if (cond2) {                                            \
 906                         ARCSTAT_BUMP(arcstat_##notstat1##_##stat##_##stat2); \
 907                 } else {                                                \
 908                         ARCSTAT_BUMP(arcstat_##notstat1##_##stat##_##notstat2);\
 909                 }                                                       \
 910         }
 911 
 912 /*
 913  * This macro allows us to use kstats as floating averages. Each time we
 914  * update this kstat, we first factor it and the update value by
 915  * ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall
 916  * average. This macro assumes that integer loads and stores are atomic, but
 917  * is not safe for multiple writers updating the kstat in parallel (only the
 918  * last writer's update will remain).
 919  */
 920 #define ARCSTAT_F_AVG_FACTOR    3
 921 #define ARCSTAT_F_AVG(stat, value) \
 922         do { \
 923                 uint64_t x = ARCSTAT(stat); \
 924                 x = x - x / ARCSTAT_F_AVG_FACTOR + \
 925                     (value) / ARCSTAT_F_AVG_FACTOR; \
 926                 ARCSTAT(stat) = x; \
 927                 _NOTE(CONSTCOND) \
 928         } while (0)
 929 
 930 kstat_t                 *arc_ksp;
 931 static arc_state_t      *arc_anon;
 932 static arc_state_t      *arc_mru;
 933 static arc_state_t      *arc_mru_ghost;
 934 static arc_state_t      *arc_mfu;
 935 static arc_state_t      *arc_mfu_ghost;
 936 static arc_state_t      *arc_l2c_only;
 937 
 938 /*
 939  * There are several ARC variables that are critical to export as kstats --
 940  * but we don't want to have to grovel around in the kstat whenever we wish to
 941  * manipulate them.  For these variables, we therefore define them to be in
 942  * terms of the statistic variable.  This assures that we are not introducing
 943  * the possibility of inconsistency by having shadow copies of the variables,
 944  * while still allowing the code to be readable.
 945  */
 946 #define arc_size        ARCSTAT(arcstat_size)   /* actual total arc size */
 947 #define arc_p           ARCSTAT(arcstat_p)      /* target size of MRU */
 948 #define arc_c           ARCSTAT(arcstat_c)      /* target size of cache */
 949 #define arc_c_min       ARCSTAT(arcstat_c_min)  /* min target cache size */
 950 #define arc_c_max       ARCSTAT(arcstat_c_max)  /* max target cache size */
 951 #define arc_meta_limit  ARCSTAT(arcstat_meta_limit) /* max size for metadata */
 952 #define arc_meta_min    ARCSTAT(arcstat_meta_min) /* min size for metadata */
 953 #define arc_meta_used   ARCSTAT(arcstat_meta_used) /* size of metadata */
 954 #define arc_meta_max    ARCSTAT(arcstat_meta_max) /* max size of metadata */
 955 #define arc_ddt_size    ARCSTAT(arcstat_ddt_size) /* ddt size in arc */
 956 #define arc_ddt_limit   ARCSTAT(arcstat_ddt_limit) /* ddt in arc size limit */
 957 
 958 /*
 959  * Used int zio.c to optionally keep DDT cached in ARC
 960  */
 961 uint64_t const *arc_ddt_evict_threshold;
 962 
 963 /* compressed size of entire arc */
 964 #define arc_compressed_size     ARCSTAT(arcstat_compressed_size)
 965 /* uncompressed size of entire arc */
 966 #define arc_uncompressed_size   ARCSTAT(arcstat_uncompressed_size)
 967 /* number of bytes in the arc from arc_buf_t's */
 968 #define arc_overhead_size       ARCSTAT(arcstat_overhead_size)
 969 















 970 
 971 static int              arc_no_grow;    /* Don't try to grow cache size */
 972 static uint64_t         arc_tempreserve;
 973 static uint64_t         arc_loaned_bytes;
 974 
 975 typedef struct arc_callback arc_callback_t;
 976 
 977 struct arc_callback {
 978         void                    *acb_private;
 979         arc_done_func_t         *acb_done;
 980         arc_buf_t               *acb_buf;
 981         boolean_t               acb_compressed;
 982         zio_t                   *acb_zio_dummy;
 983         arc_callback_t          *acb_next;
 984 };
 985 
 986 typedef struct arc_write_callback arc_write_callback_t;
 987 
 988 struct arc_write_callback {
 989         void            *awcb_private;

1010  *    | l2arc_buf_hdr_t        |          | l2arc_buf_hdr_t        |
1011  *    | (undefined if L1-only) |          |                        |
1012  *    +------------------------+          +------------------------+
1013  *    | l1arc_buf_hdr_t        |
1014  *    |                        |
1015  *    |                        |
1016  *    |                        |
1017  *    |                        |
1018  *    +------------------------+
1019  *
1020  * Because it's possible for the L2ARC to become extremely large, we can wind
1021  * up eating a lot of memory in L2ARC buffer headers, so the size of a header
1022  * is minimized by only allocating the fields necessary for an L1-cached buffer
1023  * when a header is actually in the L1 cache. The sub-headers (l1arc_buf_hdr and
1024  * l2arc_buf_hdr) are embedded rather than allocated separately to save a couple
1025  * words in pointers. arc_hdr_realloc() is used to switch a header between
1026  * these two allocation states.
1027  */
1028 typedef struct l1arc_buf_hdr {
1029         kmutex_t                b_freeze_lock;

1030 #ifdef ZFS_DEBUG
1031         /*
1032          * Used for debugging with kmem_flags - by allocating and freeing
1033          * b_thawed when the buffer is thawed, we get a record of the stack
1034          * trace that thawed it.
1035          */
1036         void                    *b_thawed;
1037 #endif
1038 
1039         /* number of krrp tasks using this buffer */
1040         uint64_t                b_krrp;
1041 
1042         arc_buf_t               *b_buf;
1043         uint32_t                b_bufcnt;
1044         /* for waiting on writes to complete */
1045         kcondvar_t              b_cv;
1046         uint8_t                 b_byteswap;
1047 
1048         /* protected by arc state mutex */
1049         arc_state_t             *b_state;
1050         multilist_node_t        b_arc_node;
1051 
1052         /* updated atomically */
1053         clock_t                 b_arc_access;
1054 
1055         /* self protecting */
1056         refcount_t              b_refcnt;
1057 
1058         arc_callback_t          *b_acb;
1059         abd_t                   *b_pabd;
1060 } l1arc_buf_hdr_t;
1061 
1062 typedef struct l2arc_dev l2arc_dev_t;
1063 
1064 typedef struct l2arc_buf_hdr {
1065         /* protected by arc_buf_hdr mutex */
1066         l2arc_dev_t             *b_dev;         /* L2ARC device */
1067         uint64_t                b_daddr;        /* disk address, offset byte */
1068 
1069         list_node_t             b_l2node;
1070 } l2arc_buf_hdr_t;
1071 
1072 struct arc_buf_hdr {
1073         /* protected by hash lock */
1074         dva_t                   b_dva;
1075         uint64_t                b_birth;
1076 
1077         /*
1078          * Even though this checksum is only set/verified when a buffer is in
1079          * the L1 cache, it needs to be in the set of common fields because it
1080          * must be preserved from the time before a buffer is written out to
1081          * L2ARC until after it is read back in.
1082          */
1083         zio_cksum_t             *b_freeze_cksum;
1084 
1085         arc_buf_contents_t      b_type;
1086         arc_buf_hdr_t           *b_hash_next;
1087         arc_flags_t             b_flags;
1088 
1089         /*
1090          * This field stores the size of the data buffer after
1091          * compression, and is set in the arc's zio completion handlers.
1092          * It is in units of SPA_MINBLOCKSIZE (e.g. 1 == 512 bytes).
1093          *
1094          * While the block pointers can store up to 32MB in their psize
1095          * field, we can only store up to 32MB minus 512B. This is due
1096          * to the bp using a bias of 1, whereas we use a bias of 0 (i.e.
1097          * a field of zeros represents 512B in the bp). We can't use a
1098          * bias of 1 since we need to reserve a psize of zero, here, to
1099          * represent holes and embedded blocks.
1100          *
1101          * This isn't a problem in practice, since the maximum size of a
1102          * buffer is limited to 16MB, so we never need to store 32MB in
1103          * this field. Even in the upstream illumos code base, the
1104          * maximum size of a buffer is limited to 16MB.

1122 #define GHOST_STATE(state)      \
1123         ((state) == arc_mru_ghost || (state) == arc_mfu_ghost ||        \
1124         (state) == arc_l2c_only)
1125 
1126 #define HDR_IN_HASH_TABLE(hdr)  ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)
1127 #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)
1128 #define HDR_IO_ERROR(hdr)       ((hdr)->b_flags & ARC_FLAG_IO_ERROR)
1129 #define HDR_PREFETCH(hdr)       ((hdr)->b_flags & ARC_FLAG_PREFETCH)
1130 #define HDR_COMPRESSION_ENABLED(hdr)    \
1131         ((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)
1132 
1133 #define HDR_L2CACHE(hdr)        ((hdr)->b_flags & ARC_FLAG_L2CACHE)
1134 #define HDR_L2_READING(hdr)     \
1135         (((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) &&   \
1136         ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))
1137 #define HDR_L2_WRITING(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_WRITING)
1138 #define HDR_L2_EVICTED(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)
1139 #define HDR_L2_WRITE_HEAD(hdr)  ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)
1140 #define HDR_SHARED_DATA(hdr)    ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)
1141 
1142 #define HDR_ISTYPE_DDT(hdr)     \
1143             ((hdr)->b_flags & ARC_FLAG_BUFC_DDT)
1144 #define HDR_ISTYPE_METADATA(hdr)        \
1145         ((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)
1146 #define HDR_ISTYPE_DATA(hdr)    (!HDR_ISTYPE_METADATA(hdr) && \
1147         !HDR_ISTYPE_DDT(hdr))
1148 
1149 #define HDR_HAS_L1HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)
1150 #define HDR_HAS_L2HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)
1151 
1152 /* For storing compression mode in b_flags */
1153 #define HDR_COMPRESS_OFFSET     (highbit64(ARC_FLAG_COMPRESS_0) - 1)
1154 
1155 #define HDR_GET_COMPRESS(hdr)   ((enum zio_compress)BF32_GET((hdr)->b_flags, \
1156         HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))
1157 #define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \
1158         HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));
1159 
1160 #define ARC_BUF_LAST(buf)       ((buf)->b_next == NULL)
1161 #define ARC_BUF_SHARED(buf)     ((buf)->b_flags & ARC_BUF_FLAG_SHARED)
1162 #define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)
1163 
1164 /*
1165  * Other sizes
1166  */
1167 
1168 #define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
1169 #define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))
1170 
1171 /*
1172  * Hash table routines
1173  */
1174 
1175 struct ht_table {
1176         arc_buf_hdr_t   *hdr;
1177         kmutex_t        lock;




1178 };
1179 

1180 typedef struct buf_hash_table {
1181         uint64_t ht_mask;
1182         struct ht_table *ht_table;

1183 } buf_hash_table_t;
1184 
1185 #pragma align 64(buf_hash_table)
1186 static buf_hash_table_t buf_hash_table;
1187 
1188 #define BUF_HASH_INDEX(spa, dva, birth) \
1189         (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
1190 #define BUF_HASH_LOCK(idx) (&buf_hash_table.ht_table[idx].lock)

1191 #define HDR_LOCK(hdr) \
1192         (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
1193 
1194 uint64_t zfs_crc64_table[256];
1195 
1196 /*
1197  * Level 2 ARC
1198  */
1199 
1200 #define L2ARC_WRITE_SIZE        (8 * 1024 * 1024)       /* initial write max */
1201 #define L2ARC_HEADROOM          2                       /* num of writes */
1202 /*
1203  * If we discover during ARC scan any buffers to be compressed, we boost
1204  * our headroom for the next scanning cycle by this percentage multiple.
1205  */
1206 #define L2ARC_HEADROOM_BOOST    200
1207 #define L2ARC_FEED_SECS         1               /* caching interval secs */
1208 #define L2ARC_FEED_MIN_MS       200             /* min caching interval ms */
1209 
1210 #define l2arc_writes_sent       ARCSTAT(arcstat_l2_writes_sent)
1211 #define l2arc_writes_done       ARCSTAT(arcstat_l2_writes_done)
1212 
1213 /* L2ARC Performance Tunables */
1214 uint64_t l2arc_write_max = L2ARC_WRITE_SIZE;    /* default max write size */
1215 uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE;  /* extra write during warmup */
1216 uint64_t l2arc_headroom = L2ARC_HEADROOM;       /* number of dev writes */
1217 uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
1218 uint64_t l2arc_feed_secs = L2ARC_FEED_SECS;     /* interval seconds */
1219 uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
1220 boolean_t l2arc_noprefetch = B_TRUE;            /* don't cache prefetch bufs */
1221 boolean_t l2arc_feed_again = B_TRUE;            /* turbo warmup */
1222 boolean_t l2arc_norw = B_TRUE;                  /* no reads during writes */
1223 

















1224 static list_t L2ARC_dev_list;                   /* device list */
1225 static list_t *l2arc_dev_list;                  /* device list pointer */
1226 static kmutex_t l2arc_dev_mtx;                  /* device list mutex */
1227 static l2arc_dev_t *l2arc_dev_last;             /* last device used */
1228 static l2arc_dev_t *l2arc_ddt_dev_last;         /* last DDT device used */
1229 static list_t L2ARC_free_on_write;              /* free after write buf list */
1230 static list_t *l2arc_free_on_write;             /* free after write list ptr */
1231 static kmutex_t l2arc_free_on_write_mtx;        /* mutex for list */
1232 static uint64_t l2arc_ndev;                     /* number of devices */
1233 
1234 typedef struct l2arc_read_callback {
1235         arc_buf_hdr_t           *l2rcb_hdr;             /* read header */
1236         blkptr_t                l2rcb_bp;               /* original blkptr */
1237         zbookmark_phys_t        l2rcb_zb;               /* original bookmark */
1238         int                     l2rcb_flags;            /* original flags */
1239         abd_t                   *l2rcb_abd;             /* temporary buffer */
1240 } l2arc_read_callback_t;
1241 
1242 typedef struct l2arc_write_callback {
1243         l2arc_dev_t     *l2wcb_dev;             /* device info */
1244         arc_buf_hdr_t   *l2wcb_head;            /* head of write buflist */
1245         list_t          l2wcb_log_blk_buflist;  /* in-flight log blocks */
1246 } l2arc_write_callback_t;
1247 
1248 typedef struct l2arc_data_free {
1249         /* protected by l2arc_free_on_write_mtx */
1250         abd_t           *l2df_abd;
1251         size_t          l2df_size;
1252         arc_buf_contents_t l2df_type;
1253         list_node_t     l2df_list_node;
1254 } l2arc_data_free_t;
1255 
1256 static kmutex_t l2arc_feed_thr_lock;
1257 static kcondvar_t l2arc_feed_thr_cv;
1258 static uint8_t l2arc_thread_exit;
1259 
1260 static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, void *);
1261 static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, void *);
1262 static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, void *);
1263 static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, void *);
1264 static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, void *);
1265 static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag);
1266 static void arc_hdr_free_pabd(arc_buf_hdr_t *);
1267 static void arc_hdr_alloc_pabd(arc_buf_hdr_t *);
1268 static void arc_access(arc_buf_hdr_t *, kmutex_t *);
1269 static boolean_t arc_is_overflowing();
1270 static void arc_buf_watch(arc_buf_t *);
1271 static l2arc_dev_t *l2arc_vdev_get(vdev_t *vd);
1272 
1273 static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);
1274 static uint32_t arc_bufc_to_flags(arc_buf_contents_t);
1275 static arc_buf_contents_t arc_flags_to_bufc(uint32_t);
1276 static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1277 static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1278 
1279 static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
1280 static void l2arc_read_done(zio_t *);
1281 
1282 static void
1283 arc_update_hit_stat(arc_buf_hdr_t *hdr, boolean_t hit)
1284 {
1285         boolean_t pf = !HDR_PREFETCH(hdr);
1286         switch (arc_buf_type(hdr)) {
1287         case ARC_BUFC_DATA:
1288                 ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses, data);
1289                 break;
1290         case ARC_BUFC_METADATA:
1291                 ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses,
1292                     metadata);
1293                 break;
1294         case ARC_BUFC_DDT:
1295                 ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses, ddt);
1296                 break;
1297         default:
1298                 break;
1299         }
1300 }
1301 
1302 enum {
1303         L2ARC_DEV_HDR_EVICT_FIRST = (1 << 0)      /* mirror of l2ad_first */
1304 };
1305 
1306 /*
1307  * Pointer used in persistent L2ARC (for pointing to log blocks & ARC buffers).

1308  */
1309 typedef struct l2arc_log_blkptr {
1310         uint64_t        lbp_daddr;      /* device address of log */
1311         /*
1312          * lbp_prop is the same format as the blk_prop in blkptr_t:
1313          *      * logical size (in sectors)
1314          *      * physical size (in sectors)
1315          *      * checksum algorithm (used for lbp_cksum)
1316          *      * object type & level (unused for now)
1317          */
1318         uint64_t        lbp_prop;
1319         zio_cksum_t     lbp_cksum;      /* fletcher4 of log */
1320 } l2arc_log_blkptr_t;
1321 
1322 /*
1323  * The persistent L2ARC device header.
1324  * Byte order of magic determines whether 64-bit bswap of fields is necessary.
1325  */
1326 typedef struct l2arc_dev_hdr_phys {
1327         uint64_t        dh_magic;       /* L2ARC_DEV_HDR_MAGIC_Vx */
1328         zio_cksum_t     dh_self_cksum;  /* fletcher4 of fields below */
1329 
1330         /*
1331          * Global L2ARC device state and metadata.
1332          */
1333         uint64_t        dh_spa_guid;
1334         uint64_t        dh_alloc_space;         /* vdev space alloc status */
1335         uint64_t        dh_flags;               /* l2arc_dev_hdr_flags_t */
1336 
1337         /*
1338          * Start of log block chain. [0] -> newest log, [1] -> one older (used
1339          * for initiating prefetch).
1340          */
1341         l2arc_log_blkptr_t      dh_start_lbps[2];
1342 
1343         const uint64_t  dh_pad[44];             /* pad to 512 bytes */
1344 } l2arc_dev_hdr_phys_t;
1345 CTASSERT(sizeof (l2arc_dev_hdr_phys_t) == SPA_MINBLOCKSIZE);
1346 
1347 /*
1348  * A single ARC buffer header entry in a l2arc_log_blk_phys_t.
1349  */
1350 typedef struct l2arc_log_ent_phys {
1351         dva_t                   le_dva; /* dva of buffer */
1352         uint64_t                le_birth;       /* birth txg of buffer */
1353         zio_cksum_t             le_freeze_cksum;
1354         /*
1355          * le_prop is the same format as the blk_prop in blkptr_t:
1356          *      * logical size (in sectors)
1357          *      * physical size (in sectors)
1358          *      * checksum algorithm (used for b_freeze_cksum)
1359          *      * object type & level (used to restore arc_buf_contents_t)
1360          */
1361         uint64_t                le_prop;
1362         uint64_t                le_daddr;       /* buf location on l2dev */
1363         const uint64_t          le_pad[7];      /* resv'd for future use */
1364 } l2arc_log_ent_phys_t;
1365 
1366 /*
1367  * These design limits give us the following metadata overhead (before
1368  * compression):
1369  *      avg_blk_sz      overhead
1370  *      1k              12.51 %
1371  *      2k               6.26 %
1372  *      4k               3.13 %
1373  *      8k               1.56 %
1374  *      16k              0.78 %
1375  *      32k              0.39 %
1376  *      64k              0.20 %
1377  *      128k             0.10 %
1378  * Compression should be able to sequeeze these down by about a factor of 2x.
1379  */
1380 #define L2ARC_LOG_BLK_SIZE                      (128 * 1024)    /* 128k */
1381 #define L2ARC_LOG_BLK_HEADER_LEN                (128)
1382 #define L2ARC_LOG_BLK_ENTRIES                   /* 1023 entries */      \
1383         ((L2ARC_LOG_BLK_SIZE - L2ARC_LOG_BLK_HEADER_LEN) /              \
1384         sizeof (l2arc_log_ent_phys_t))
1385 /*
1386  * Maximum amount of data in an l2arc log block (used to terminate rebuilding
1387  * before we hit the write head and restore potentially corrupted blocks).
1388  */
1389 #define L2ARC_LOG_BLK_MAX_PAYLOAD_SIZE  \
1390         (SPA_MAXBLOCKSIZE * L2ARC_LOG_BLK_ENTRIES)
1391 /*
1392  * For the persistency and rebuild algorithms to operate reliably we need
1393  * the L2ARC device to at least be able to hold 3 full log blocks (otherwise
1394  * excessive log block looping might confuse the log chain end detection).
1395  * Under normal circumstances this is not a problem, since this is somewhere
1396  * around only 400 MB.
1397  */
1398 #define L2ARC_PERSIST_MIN_SIZE  (3 * L2ARC_LOG_BLK_MAX_PAYLOAD_SIZE)
1399 
1400 /*
1401  * A log block of up to 1023 ARC buffer log entries, chained into the
1402  * persistent L2ARC metadata linked list. Byte order of magic determines
1403  * whether 64-bit bswap of fields is necessary.
1404  */
1405 typedef struct l2arc_log_blk_phys {
1406         /* Header - see L2ARC_LOG_BLK_HEADER_LEN above */
1407         uint64_t                lb_magic;       /* L2ARC_LOG_BLK_MAGIC */
1408         l2arc_log_blkptr_t      lb_back2_lbp;   /* back 2 steps in chain */
1409         uint64_t                lb_pad[9];      /* resv'd for future use */
1410         /* Payload */
1411         l2arc_log_ent_phys_t    lb_entries[L2ARC_LOG_BLK_ENTRIES];
1412 } l2arc_log_blk_phys_t;
1413 
1414 CTASSERT(sizeof (l2arc_log_blk_phys_t) == L2ARC_LOG_BLK_SIZE);
1415 CTASSERT(offsetof(l2arc_log_blk_phys_t, lb_entries) -
1416     offsetof(l2arc_log_blk_phys_t, lb_magic) == L2ARC_LOG_BLK_HEADER_LEN);
1417 
1418 /*
1419  * These structures hold in-flight l2arc_log_blk_phys_t's as they're being
1420  * written to the L2ARC device. They may be compressed, hence the uint8_t[].
1421  */
1422 typedef struct l2arc_log_blk_buf {
1423         uint8_t         lbb_log_blk[sizeof (l2arc_log_blk_phys_t)];
1424         list_node_t     lbb_node;
1425 } l2arc_log_blk_buf_t;
1426 
1427 /* Macros for the manipulation fields in the blk_prop format of blkptr_t */
1428 #define BLKPROP_GET_LSIZE(_obj, _field)         \
1429         BF64_GET_SB((_obj)->_field, 0, 16, SPA_MINBLOCKSHIFT, 1)
1430 #define BLKPROP_SET_LSIZE(_obj, _field, x)      \
1431         BF64_SET_SB((_obj)->_field, 0, 16, SPA_MINBLOCKSHIFT, 1, x)
1432 #define BLKPROP_GET_PSIZE(_obj, _field)         \
1433         BF64_GET_SB((_obj)->_field, 16, 16, SPA_MINBLOCKSHIFT, 0)
1434 #define BLKPROP_SET_PSIZE(_obj, _field, x)      \
1435         BF64_SET_SB((_obj)->_field, 16, 16, SPA_MINBLOCKSHIFT, 0, x)
1436 #define BLKPROP_GET_COMPRESS(_obj, _field)      \
1437         BF64_GET((_obj)->_field, 32, 7)
1438 #define BLKPROP_SET_COMPRESS(_obj, _field, x)   \
1439         BF64_SET((_obj)->_field, 32, 7, x)
1440 #define BLKPROP_GET_ARC_COMPRESS(_obj, _field)  \
1441         BF64_GET((_obj)->_field, 39, 1)
1442 #define BLKPROP_SET_ARC_COMPRESS(_obj, _field, x)       \
1443         BF64_SET((_obj)->_field, 39, 1, x)
1444 #define BLKPROP_GET_CHECKSUM(_obj, _field)      \
1445         BF64_GET((_obj)->_field, 40, 8)
1446 #define BLKPROP_SET_CHECKSUM(_obj, _field, x)   \
1447         BF64_SET((_obj)->_field, 40, 8, x)
1448 #define BLKPROP_GET_TYPE(_obj, _field)          \
1449         BF64_GET((_obj)->_field, 48, 8)
1450 #define BLKPROP_SET_TYPE(_obj, _field, x)       \
1451         BF64_SET((_obj)->_field, 48, 8, x)
1452 
1453 /* Macros for manipulating a l2arc_log_blkptr_t->lbp_prop field */
1454 #define LBP_GET_LSIZE(_add)             BLKPROP_GET_LSIZE(_add, lbp_prop)
1455 #define LBP_SET_LSIZE(_add, x)          BLKPROP_SET_LSIZE(_add, lbp_prop, x)
1456 #define LBP_GET_PSIZE(_add)             BLKPROP_GET_PSIZE(_add, lbp_prop)
1457 #define LBP_SET_PSIZE(_add, x)          BLKPROP_SET_PSIZE(_add, lbp_prop, x)
1458 #define LBP_GET_COMPRESS(_add)          BLKPROP_GET_COMPRESS(_add, lbp_prop)
1459 #define LBP_SET_COMPRESS(_add, x)       BLKPROP_SET_COMPRESS(_add, lbp_prop, x)
1460 #define LBP_GET_CHECKSUM(_add)          BLKPROP_GET_CHECKSUM(_add, lbp_prop)
1461 #define LBP_SET_CHECKSUM(_add, x)       BLKPROP_SET_CHECKSUM(_add, lbp_prop, x)
1462 #define LBP_GET_TYPE(_add)              BLKPROP_GET_TYPE(_add, lbp_prop)
1463 #define LBP_SET_TYPE(_add, x)           BLKPROP_SET_TYPE(_add, lbp_prop, x)
1464 
1465 /* Macros for manipulating a l2arc_log_ent_phys_t->le_prop field */
1466 #define LE_GET_LSIZE(_le)       BLKPROP_GET_LSIZE(_le, le_prop)
1467 #define LE_SET_LSIZE(_le, x)    BLKPROP_SET_LSIZE(_le, le_prop, x)
1468 #define LE_GET_PSIZE(_le)       BLKPROP_GET_PSIZE(_le, le_prop)
1469 #define LE_SET_PSIZE(_le, x)    BLKPROP_SET_PSIZE(_le, le_prop, x)
1470 #define LE_GET_COMPRESS(_le)    BLKPROP_GET_COMPRESS(_le, le_prop)
1471 #define LE_SET_COMPRESS(_le, x) BLKPROP_SET_COMPRESS(_le, le_prop, x)
1472 #define LE_GET_ARC_COMPRESS(_le)        BLKPROP_GET_ARC_COMPRESS(_le, le_prop)
1473 #define LE_SET_ARC_COMPRESS(_le, x)     BLKPROP_SET_ARC_COMPRESS(_le, le_prop, x)
1474 #define LE_GET_CHECKSUM(_le)    BLKPROP_GET_CHECKSUM(_le, le_prop)
1475 #define LE_SET_CHECKSUM(_le, x) BLKPROP_SET_CHECKSUM(_le, le_prop, x)
1476 #define LE_GET_TYPE(_le)        BLKPROP_GET_TYPE(_le, le_prop)
1477 #define LE_SET_TYPE(_le, x)     BLKPROP_SET_TYPE(_le, le_prop, x)
1478 
1479 #define PTR_SWAP(x, y)          \
1480         do {                    \
1481                 void *tmp = (x);\
1482                 x = y;          \
1483                 y = tmp;        \
1484                 _NOTE(CONSTCOND)\
1485         } while (0)
1486 
1487 /*
1488  * Sadly, after compressed ARC integration older kernels would panic
1489  * when trying to rebuild persistent L2ARC created by the new code.
1490  */
1491 #define L2ARC_DEV_HDR_MAGIC_V1  0x4c32415243763031LLU   /* ASCII: "L2ARCv01" */
1492 #define L2ARC_LOG_BLK_MAGIC     0x4c4f47424c4b4844LLU   /* ASCII: "LOGBLKHD" */
1493 
1494 /*
1495  * Performance tuning of L2ARC persistency:
1496  *
1497  * l2arc_rebuild_enabled : Controls whether L2ARC device adds (either at
1498  *              pool import or when adding one manually later) will attempt
1499  *              to rebuild L2ARC buffer contents. In special circumstances,
1500  *              the administrator may want to set this to B_FALSE, if they
1501  *              are having trouble importing a pool or attaching an L2ARC
1502  *              device (e.g. the L2ARC device is slow to read in stored log
1503  *              metadata, or the metadata has become somehow
1504  *              fragmented/unusable).
1505  */
1506 boolean_t l2arc_rebuild_enabled = B_TRUE;
1507 
1508 /* L2ARC persistency rebuild control routines. */
1509 static void l2arc_dev_rebuild_start(l2arc_dev_t *dev);
1510 static int l2arc_rebuild(l2arc_dev_t *dev);
1511 
1512 /* L2ARC persistency read I/O routines. */
1513 static int l2arc_dev_hdr_read(l2arc_dev_t *dev);
1514 static int l2arc_log_blk_read(l2arc_dev_t *dev,
1515     const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp,
1516     l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
1517     uint8_t *this_lb_buf, uint8_t *next_lb_buf,
1518     zio_t *this_io, zio_t **next_io);
1519 static zio_t *l2arc_log_blk_prefetch(vdev_t *vd,
1520     const l2arc_log_blkptr_t *lp, uint8_t *lb_buf);
1521 static void l2arc_log_blk_prefetch_abort(zio_t *zio);
1522 
1523 /* L2ARC persistency block restoration routines. */
1524 static void l2arc_log_blk_restore(l2arc_dev_t *dev, uint64_t load_guid,
1525     const l2arc_log_blk_phys_t *lb, uint64_t lb_psize);
1526 static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le,
1527     l2arc_dev_t *dev, uint64_t guid);
1528 
1529 /* L2ARC persistency write I/O routines. */
1530 static void l2arc_dev_hdr_update(l2arc_dev_t *dev, zio_t *pio);
1531 static void l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,
1532     l2arc_write_callback_t *cb);
1533 
1534 /* L2ARC persistency auxilliary routines. */
1535 static boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev,
1536     const l2arc_log_blkptr_t *lp);
1537 static void l2arc_dev_hdr_checksum(const l2arc_dev_hdr_phys_t *hdr,
1538     zio_cksum_t *cksum);
1539 static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev,
1540     const arc_buf_hdr_t *ab);
1541 static inline boolean_t l2arc_range_check_overlap(uint64_t bottom,
1542     uint64_t top, uint64_t check);
1543 
1544 /*
1545  * L2ARC Internals
1546  */
1547 struct l2arc_dev {
1548         vdev_t                  *l2ad_vdev;     /* vdev */
1549         spa_t                   *l2ad_spa;      /* spa */
1550         uint64_t                l2ad_hand;      /* next write location */
1551         uint64_t                l2ad_start;     /* first addr on device */
1552         uint64_t                l2ad_end;       /* last addr on device */
1553         boolean_t               l2ad_first;     /* first sweep through */
1554         boolean_t               l2ad_writing;   /* currently writing */
1555         kmutex_t                l2ad_mtx;       /* lock for buffer list */
1556         list_t                  l2ad_buflist;   /* buffer list */
1557         list_node_t             l2ad_node;      /* device list node */
1558         refcount_t              l2ad_alloc;     /* allocated bytes */
1559         l2arc_dev_hdr_phys_t    *l2ad_dev_hdr;  /* persistent device header */
1560         uint64_t                l2ad_dev_hdr_asize; /* aligned hdr size */
1561         l2arc_log_blk_phys_t    l2ad_log_blk;   /* currently open log block */
1562         int                     l2ad_log_ent_idx; /* index into cur log blk */
1563         /* number of bytes in current log block's payload */
1564         uint64_t                l2ad_log_blk_payload_asize;
1565         /* flag indicating whether a rebuild is scheduled or is going on */
1566         boolean_t               l2ad_rebuild;
1567         boolean_t               l2ad_rebuild_cancel;
1568         kt_did_t                l2ad_rebuild_did;
1569 };
1570 
1571 static inline uint64_t
1572 buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
1573 {
1574         uint8_t *vdva = (uint8_t *)dva;
1575         uint64_t crc = -1ULL;
1576         int i;
1577 
1578         ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
1579 
1580         for (i = 0; i < sizeof (dva_t); i++)
1581                 crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
1582 
1583         crc ^= (spa>>8) ^ birth;
1584 
1585         return (crc);
1586 }
1587 
1588 #define HDR_EMPTY(hdr)                                          \
1589         ((hdr)->b_dva.dva_word[0] == 0 &&                    \
1590         (hdr)->b_dva.dva_word[1] == 0)
1591 
1592 #define HDR_EQUAL(spa, dva, birth, hdr)                         \
1593         ((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) &&       \
1594         ((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) &&       \
1595         ((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)
1596 
1597 static void
1598 buf_discard_identity(arc_buf_hdr_t *hdr)
1599 {
1600         hdr->b_dva.dva_word[0] = 0;
1601         hdr->b_dva.dva_word[1] = 0;
1602         hdr->b_birth = 0;
1603 }
1604 
1605 static arc_buf_hdr_t *
1606 buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
1607 {
1608         const dva_t *dva = BP_IDENTITY(bp);
1609         uint64_t birth = BP_PHYSICAL_BIRTH(bp);
1610         uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
1611         kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1612         arc_buf_hdr_t *hdr;
1613 
1614         mutex_enter(hash_lock);
1615         for (hdr = buf_hash_table.ht_table[idx].hdr; hdr != NULL;
1616             hdr = hdr->b_hash_next) {
1617                 if (HDR_EQUAL(spa, dva, birth, hdr)) {
1618                         *lockp = hash_lock;
1619                         return (hdr);
1620                 }
1621         }
1622         mutex_exit(hash_lock);
1623         *lockp = NULL;
1624         return (NULL);
1625 }
1626 
1627 /*
1628  * Insert an entry into the hash table.  If there is already an element
1629  * equal to elem in the hash table, then the already existing element
1630  * will be returned and the new element will not be inserted.
1631  * Otherwise returns NULL.
1632  * If lockp == NULL, the caller is assumed to already hold the hash lock.
1633  */
1634 static arc_buf_hdr_t *
1635 buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)
1636 {
1637         uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1638         kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1639         arc_buf_hdr_t *fhdr;
1640         uint32_t i;
1641 
1642         ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));
1643         ASSERT(hdr->b_birth != 0);
1644         ASSERT(!HDR_IN_HASH_TABLE(hdr));
1645 
1646         if (lockp != NULL) {
1647                 *lockp = hash_lock;
1648                 mutex_enter(hash_lock);
1649         } else {
1650                 ASSERT(MUTEX_HELD(hash_lock));
1651         }
1652 
1653         for (fhdr = buf_hash_table.ht_table[idx].hdr, i = 0; fhdr != NULL;
1654             fhdr = fhdr->b_hash_next, i++) {
1655                 if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))
1656                         return (fhdr);
1657         }
1658 
1659         hdr->b_hash_next = buf_hash_table.ht_table[idx].hdr;
1660         buf_hash_table.ht_table[idx].hdr = hdr;
1661         arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1662 
1663         /* collect some hash table performance data */
1664         if (i > 0) {
1665                 ARCSTAT_BUMP(arcstat_hash_collisions);
1666                 if (i == 1)
1667                         ARCSTAT_BUMP(arcstat_hash_chains);
1668 
1669                 ARCSTAT_MAX(arcstat_hash_chain_max, i);
1670         }
1671 
1672         ARCSTAT_BUMP(arcstat_hash_elements);
1673         ARCSTAT_MAXSTAT(arcstat_hash_elements);
1674 
1675         return (NULL);
1676 }
1677 
1678 static void
1679 buf_hash_remove(arc_buf_hdr_t *hdr)
1680 {
1681         arc_buf_hdr_t *fhdr, **hdrp;
1682         uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1683 
1684         ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
1685         ASSERT(HDR_IN_HASH_TABLE(hdr));
1686 
1687         hdrp = &buf_hash_table.ht_table[idx].hdr;
1688         while ((fhdr = *hdrp) != hdr) {
1689                 ASSERT3P(fhdr, !=, NULL);
1690                 hdrp = &fhdr->b_hash_next;
1691         }
1692         *hdrp = hdr->b_hash_next;
1693         hdr->b_hash_next = NULL;
1694         arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1695 
1696         /* collect some hash table performance data */
1697         ARCSTAT_BUMPDOWN(arcstat_hash_elements);
1698 
1699         if (buf_hash_table.ht_table[idx].hdr &&
1700             buf_hash_table.ht_table[idx].hdr->b_hash_next == NULL)
1701                 ARCSTAT_BUMPDOWN(arcstat_hash_chains);
1702 }
1703 
1704 /*
1705  * Global data structures and functions for the buf kmem cache.
1706  */
1707 static kmem_cache_t *hdr_full_cache;
1708 static kmem_cache_t *hdr_l2only_cache;
1709 static kmem_cache_t *buf_cache;
1710 
1711 static void
1712 buf_fini(void)
1713 {
1714         int i;
1715 
1716         for (i = 0; i < buf_hash_table.ht_mask + 1; i++)
1717                 mutex_destroy(&buf_hash_table.ht_table[i].lock);
1718         kmem_free(buf_hash_table.ht_table,
1719             (buf_hash_table.ht_mask + 1) * sizeof (struct ht_table));


1720         kmem_cache_destroy(hdr_full_cache);
1721         kmem_cache_destroy(hdr_l2only_cache);
1722         kmem_cache_destroy(buf_cache);
1723 }
1724 
1725 /*
1726  * Constructor callback - called when the cache is empty
1727  * and a new buf is requested.
1728  */
1729 /* ARGSUSED */
1730 static int
1731 hdr_full_cons(void *vbuf, void *unused, int kmflag)
1732 {
1733         arc_buf_hdr_t *hdr = vbuf;
1734 
1735         bzero(hdr, HDR_FULL_SIZE);
1736         cv_init(&hdr->b_l1hdr.b_cv, NULL, CV_DEFAULT, NULL);
1737         refcount_create(&hdr->b_l1hdr.b_refcnt);
1738         mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
1739         multilist_link_init(&hdr->b_l1hdr.b_arc_node);

1822 }
1823 
1824 static void
1825 buf_init(void)
1826 {
1827         uint64_t *ct;
1828         uint64_t hsize = 1ULL << 12;
1829         int i, j;
1830 
1831         /*
1832          * The hash table is big enough to fill all of physical memory
1833          * with an average block size of zfs_arc_average_blocksize (default 8K).
1834          * By default, the table will take up
1835          * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
1836          */
1837         while (hsize * zfs_arc_average_blocksize < physmem * PAGESIZE)
1838                 hsize <<= 1;
1839 retry:
1840         buf_hash_table.ht_mask = hsize - 1;
1841         buf_hash_table.ht_table =
1842             kmem_zalloc(hsize * sizeof (struct ht_table), KM_NOSLEEP);
1843         if (buf_hash_table.ht_table == NULL) {
1844                 ASSERT(hsize > (1ULL << 8));
1845                 hsize >>= 1;
1846                 goto retry;
1847         }
1848 
1849         hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,
1850             0, hdr_full_cons, hdr_full_dest, hdr_recl, NULL, NULL, 0);
1851         hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",
1852             HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, hdr_recl,
1853             NULL, NULL, 0);
1854         buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
1855             0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
1856 
1857         for (i = 0; i < 256; i++)
1858                 for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
1859                         *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
1860 
1861         for (i = 0; i < hsize; i++) {
1862                 mutex_init(&buf_hash_table.ht_table[i].lock,
1863                     NULL, MUTEX_DEFAULT, NULL);
1864         }
1865 }
1866 
1867 /* wait until krrp releases the buffer */
1868 static inline void
1869 arc_wait_for_krrp(arc_buf_hdr_t *hdr)
1870 {
1871         while (HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_krrp != 0)
1872                 cv_wait(&hdr->b_l1hdr.b_cv, HDR_LOCK(hdr));
1873 }
1874 
1875 /*
1876  * This is the size that the buf occupies in memory. If the buf is compressed,
1877  * it will correspond to the compressed size. You should use this method of
1878  * getting the buf size unless you explicitly need the logical size.
1879  */
1880 int32_t
1881 arc_buf_size(arc_buf_t *buf)
1882 {
1883         return (ARC_BUF_COMPRESSED(buf) ?
1884             HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
1885 }
1886 
1887 int32_t
1888 arc_buf_lsize(arc_buf_t *buf)
1889 {
1890         return (HDR_GET_LSIZE(buf->b_hdr));
1891 }
1892 
1893 enum zio_compress
1894 arc_get_compression(arc_buf_t *buf)

1910         IMPLY(shared, ARC_BUF_SHARED(buf));
1911         IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));
1912 
1913         /*
1914          * It would be nice to assert arc_can_share() too, but the "hdr isn't
1915          * already being shared" requirement prevents us from doing that.
1916          */
1917 
1918         return (shared);
1919 }
1920 
1921 /*
1922  * Free the checksum associated with this header. If there is no checksum, this
1923  * is a no-op.
1924  */
1925 static inline void
1926 arc_cksum_free(arc_buf_hdr_t *hdr)
1927 {
1928         ASSERT(HDR_HAS_L1HDR(hdr));
1929         mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1930         if (hdr->b_freeze_cksum != NULL) {
1931                 kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
1932                 hdr->b_freeze_cksum = NULL;
1933         }
1934         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1935 }
1936 
1937 /*
1938  * Return true iff at least one of the bufs on hdr is not compressed.
1939  */
1940 static boolean_t
1941 arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)
1942 {
1943         for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) {
1944                 if (!ARC_BUF_COMPRESSED(b)) {
1945                         return (B_TRUE);
1946                 }
1947         }
1948         return (B_FALSE);
1949 }
1950 
1951 /*
1952  * If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data
1953  * matches the checksum that is stored in the hdr. If there is no checksum,
1954  * or if the buf is compressed, this is a no-op.
1955  */
1956 static void
1957 arc_cksum_verify(arc_buf_t *buf)
1958 {
1959         arc_buf_hdr_t *hdr = buf->b_hdr;
1960         zio_cksum_t zc;
1961 
1962         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1963                 return;
1964 
1965         if (ARC_BUF_COMPRESSED(buf)) {
1966                 ASSERT(hdr->b_freeze_cksum == NULL ||
1967                     arc_hdr_has_uncompressed_buf(hdr));
1968                 return;
1969         }
1970 
1971         ASSERT(HDR_HAS_L1HDR(hdr));
1972 
1973         mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1974         if (hdr->b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
1975                 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1976                 return;
1977         }
1978 
1979         fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);
1980         if (!ZIO_CHECKSUM_EQUAL(*hdr->b_freeze_cksum, zc))
1981                 panic("buffer modified while frozen!");
1982         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1983 }
1984 
1985 static boolean_t
1986 arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)
1987 {
1988         enum zio_compress compress = BP_GET_COMPRESS(zio->io_bp);
1989         boolean_t valid_cksum;
1990 
1991         ASSERT(!BP_IS_EMBEDDED(zio->io_bp));
1992         VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));
1993 
1994         /*
1995          * We rely on the blkptr's checksum to determine if the block
1996          * is valid or not. When compressed arc is enabled, the l2arc
1997          * writes the block to the l2arc just as it appears in the pool.
1998          * This allows us to use the blkptr's checksum to validate the
1999          * data that we just read off of the l2arc without having to store
2000          * a separate checksum in the arc_buf_hdr_t. However, if compressed
2001          * arc is disabled, then the data written to the l2arc is always
2002          * uncompressed and won't match the block as it exists in the main
2003          * pool. When this is the case, we must first compress it if it is
2004          * compressed on the main pool before we can validate the checksum.
2005          */
2006         if (!HDR_COMPRESSION_ENABLED(hdr) && compress != ZIO_COMPRESS_OFF) {
2007                 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
2008                 uint64_t lsize = HDR_GET_LSIZE(hdr);
2009                 uint64_t csize;
2010 
2011                 void *cbuf = zio_buf_alloc(HDR_GET_PSIZE(hdr));
2012                 csize = zio_compress_data(compress, zio->io_abd, cbuf, lsize);
2013                 abd_t *cdata = abd_get_from_buf(cbuf, HDR_GET_PSIZE(hdr));
2014                 abd_take_ownership_of_buf(cdata, B_TRUE);
2015 
2016                 ASSERT3U(csize, <=, HDR_GET_PSIZE(hdr));
2017                 if (csize < HDR_GET_PSIZE(hdr)) {
2018                         /*
2019                          * Compressed blocks are always a multiple of the
2020                          * smallest ashift in the pool. Ideally, we would
2021                          * like to round up the csize to the next
2022                          * spa_min_ashift but that value may have changed
2023                          * since the block was last written. Instead,
2024                          * we rely on the fact that the hdr's psize
2025                          * was set to the psize of the block when it was
2026                          * last written. We set the csize to that value
2027                          * and zero out any part that should not contain
2028                          * data.
2029                          */
2030                         abd_zero_off(cdata, csize, HDR_GET_PSIZE(hdr) - csize);
2031                         csize = HDR_GET_PSIZE(hdr);
2032                 }
2033                 zio_push_transform(zio, cdata, csize, HDR_GET_PSIZE(hdr), NULL);
2034         }

2053         return (valid_cksum);
2054 }
2055 
2056 /*
2057  * Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a
2058  * checksum and attaches it to the buf's hdr so that we can ensure that the buf
2059  * isn't modified later on. If buf is compressed or there is already a checksum
2060  * on the hdr, this is a no-op (we only checksum uncompressed bufs).
2061  */
2062 static void
2063 arc_cksum_compute(arc_buf_t *buf)
2064 {
2065         arc_buf_hdr_t *hdr = buf->b_hdr;
2066 
2067         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
2068                 return;
2069 
2070         ASSERT(HDR_HAS_L1HDR(hdr));
2071 
2072         mutex_enter(&buf->b_hdr->b_l1hdr.b_freeze_lock);
2073         if (hdr->b_freeze_cksum != NULL) {
2074                 ASSERT(arc_hdr_has_uncompressed_buf(hdr));
2075                 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
2076                 return;
2077         } else if (ARC_BUF_COMPRESSED(buf)) {
2078                 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
2079                 return;
2080         }
2081 
2082         ASSERT(!ARC_BUF_COMPRESSED(buf));
2083         hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
2084             KM_SLEEP);
2085         fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,
2086             hdr->b_freeze_cksum);
2087         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
2088         arc_buf_watch(buf);
2089 }
2090 
2091 #ifndef _KERNEL
2092 typedef struct procctl {
2093         long cmd;
2094         prwatch_t prwatch;
2095 } procctl_t;
2096 #endif
2097 
2098 /* ARGSUSED */
2099 static void
2100 arc_buf_unwatch(arc_buf_t *buf)
2101 {
2102 #ifndef _KERNEL
2103         if (arc_watch) {
2104                 int result;
2105                 procctl_t ctl;
2106                 ctl.cmd = PCWATCH;

2118 arc_buf_watch(arc_buf_t *buf)
2119 {
2120 #ifndef _KERNEL
2121         if (arc_watch) {
2122                 int result;
2123                 procctl_t ctl;
2124                 ctl.cmd = PCWATCH;
2125                 ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
2126                 ctl.prwatch.pr_size = arc_buf_size(buf);
2127                 ctl.prwatch.pr_wflags = WA_WRITE;
2128                 result = write(arc_procfd, &ctl, sizeof (ctl));
2129                 ASSERT3U(result, ==, sizeof (ctl));
2130         }
2131 #endif
2132 }
2133 
2134 static arc_buf_contents_t
2135 arc_buf_type(arc_buf_hdr_t *hdr)
2136 {
2137         arc_buf_contents_t type;
2138 
2139         if (HDR_ISTYPE_METADATA(hdr)) {
2140                 type = ARC_BUFC_METADATA;
2141         } else if (HDR_ISTYPE_DDT(hdr)) {
2142                 type = ARC_BUFC_DDT;
2143         } else {
2144                 type = ARC_BUFC_DATA;
2145         }
2146         VERIFY3U(hdr->b_type, ==, type);
2147         return (type);
2148 }
2149 
2150 boolean_t
2151 arc_is_metadata(arc_buf_t *buf)
2152 {
2153         return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);
2154 }
2155 
2156 static uint32_t
2157 arc_bufc_to_flags(arc_buf_contents_t type)
2158 {
2159         switch (type) {
2160         case ARC_BUFC_DATA:
2161                 /* metadata field is 0 if buffer contains normal data */
2162                 return (0);
2163         case ARC_BUFC_METADATA:
2164                 return (ARC_FLAG_BUFC_METADATA);
2165         case ARC_BUFC_DDT:
2166                 return (ARC_FLAG_BUFC_DDT);
2167         default:
2168                 break;
2169         }
2170         panic("undefined ARC buffer type!");
2171         return ((uint32_t)-1);
2172 }
2173 
2174 static arc_buf_contents_t
2175 arc_flags_to_bufc(uint32_t flags)
2176 {
2177         if (flags & ARC_FLAG_BUFC_DDT)
2178                 return (ARC_BUFC_DDT);
2179         if (flags & ARC_FLAG_BUFC_METADATA)
2180                 return (ARC_BUFC_METADATA);
2181         return (ARC_BUFC_DATA);
2182 }
2183 
2184 void
2185 arc_buf_thaw(arc_buf_t *buf)
2186 {
2187         arc_buf_hdr_t *hdr = buf->b_hdr;
2188 
2189         ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
2190         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
2191 
2192         arc_cksum_verify(buf);
2193 
2194         /*
2195          * Compressed buffers do not manipulate the b_freeze_cksum or
2196          * allocate b_thawed.
2197          */
2198         if (ARC_BUF_COMPRESSED(buf)) {
2199                 ASSERT(hdr->b_freeze_cksum == NULL ||
2200                     arc_hdr_has_uncompressed_buf(hdr));
2201                 return;
2202         }
2203 
2204         ASSERT(HDR_HAS_L1HDR(hdr));
2205         arc_cksum_free(hdr);
2206 
2207         mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
2208 #ifdef ZFS_DEBUG
2209         if (zfs_flags & ZFS_DEBUG_MODIFY) {
2210                 if (hdr->b_l1hdr.b_thawed != NULL)
2211                         kmem_free(hdr->b_l1hdr.b_thawed, 1);
2212                 hdr->b_l1hdr.b_thawed = kmem_alloc(1, KM_SLEEP);
2213         }
2214 #endif
2215 
2216         mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
2217 
2218         arc_buf_unwatch(buf);
2219 }
2220 
2221 void
2222 arc_buf_freeze(arc_buf_t *buf)
2223 {
2224         arc_buf_hdr_t *hdr = buf->b_hdr;
2225         kmutex_t *hash_lock;
2226 
2227         if (!(zfs_flags & ZFS_DEBUG_MODIFY))
2228                 return;
2229 
2230         if (ARC_BUF_COMPRESSED(buf)) {
2231                 ASSERT(hdr->b_freeze_cksum == NULL ||
2232                     arc_hdr_has_uncompressed_buf(hdr));
2233                 return;
2234         }
2235 
2236         hash_lock = HDR_LOCK(hdr);
2237         mutex_enter(hash_lock);
2238 
2239         ASSERT(HDR_HAS_L1HDR(hdr));
2240         ASSERT(hdr->b_freeze_cksum != NULL ||
2241             hdr->b_l1hdr.b_state == arc_anon);
2242         arc_cksum_compute(buf);
2243         mutex_exit(hash_lock);
2244 }
2245 
2246 /*
2247  * The arc_buf_hdr_t's b_flags should never be modified directly. Instead,
2248  * the following functions should be used to ensure that the flags are
2249  * updated in a thread-safe way. When manipulating the flags either
2250  * the hash_lock must be held or the hdr must be undiscoverable. This
2251  * ensures that we're not racing with any other threads when updating
2252  * the flags.
2253  */
2254 static inline void
2255 arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
2256 {
2257         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2258         hdr->b_flags |= flags;
2259 }
2260

2312         ASSERT(!ARC_BUF_COMPRESSED(buf));
2313 
2314         for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL;
2315             from = from->b_next) {
2316                 /* can't use our own data buffer */
2317                 if (from == buf) {
2318                         continue;
2319                 }
2320 
2321                 if (!ARC_BUF_COMPRESSED(from)) {
2322                         bcopy(from->b_data, buf->b_data, arc_buf_size(buf));
2323                         copied = B_TRUE;
2324                         break;
2325                 }
2326         }
2327 
2328         /*
2329          * There were no decompressed bufs, so there should not be a
2330          * checksum on the hdr either.
2331          */
2332         EQUIV(!copied, hdr->b_freeze_cksum == NULL);
2333 
2334         return (copied);
2335 }
2336 
2337 /*
2338  * Given a buf that has a data buffer attached to it, this function will
2339  * efficiently fill the buf with data of the specified compression setting from
2340  * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr
2341  * are already sharing a data buf, no copy is performed.
2342  *
2343  * If the buf is marked as compressed but uncompressed data was requested, this
2344  * will allocate a new data buffer for the buf, remove that flag, and fill the
2345  * buf with uncompressed data. You can't request a compressed buf on a hdr with
2346  * uncompressed data, and (since we haven't added support for it yet) if you
2347  * want compressed data your buf must already be marked as compressed and have
2348  * the correct-sized data buffer.
2349  */
2350 static int
2351 arc_buf_fill(arc_buf_t *buf, boolean_t compressed)
2352 {

2391                             arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
2392 
2393                         /* We increased the size of b_data; update overhead */
2394                         ARCSTAT_INCR(arcstat_overhead_size,
2395                             HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));
2396                 }
2397 
2398                 /*
2399                  * Regardless of the buf's previous compression settings, it
2400                  * should not be compressed at the end of this function.
2401                  */
2402                 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
2403 
2404                 /*
2405                  * Try copying the data from another buf which already has a
2406                  * decompressed version. If that's not possible, it's time to
2407                  * bite the bullet and decompress the data from the hdr.
2408                  */
2409                 if (arc_buf_try_copy_decompressed_data(buf)) {
2410                         /* Skip byteswapping and checksumming (already done) */
2411                         ASSERT3P(hdr->b_freeze_cksum, !=, NULL);
2412                         return (0);
2413                 } else {
2414                         int error = zio_decompress_data(HDR_GET_COMPRESS(hdr),
2415                             hdr->b_l1hdr.b_pabd, buf->b_data,
2416                             HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
2417 
2418                         /*
2419                          * Absent hardware errors or software bugs, this should
2420                          * be impossible, but log it anyway so we can debug it.
2421                          */
2422                         if (error != 0) {
2423                                 zfs_dbgmsg(
2424                                     "hdr %p, compress %d, psize %d, lsize %d",
2425                                     hdr, HDR_GET_COMPRESS(hdr),
2426                                     HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
2427                                 return (SET_ERROR(EIO));
2428                         }
2429                 }
2430         }
2431

2654 
2655                         /*
2656                          * An L1 header always exists here, since if we're
2657                          * moving to some L1-cached state (i.e. not l2c_only or
2658                          * anonymous), we realloc the header to add an L1hdr
2659                          * beforehand.
2660                          */
2661                         ASSERT(HDR_HAS_L1HDR(hdr));
2662                         multilist_insert(new_state->arcs_list[buftype], hdr);
2663 
2664                         if (GHOST_STATE(new_state)) {
2665                                 ASSERT0(bufcnt);
2666                                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2667                                 update_new = B_TRUE;
2668                         }
2669                         arc_evictable_space_increment(hdr, new_state);
2670                 }
2671         }
2672 
2673         ASSERT(!HDR_EMPTY(hdr));
2674         if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr)) {
2675                 arc_wait_for_krrp(hdr);
2676                 buf_hash_remove(hdr);
2677         }
2678 
2679         /* adjust state sizes (ignore arc_l2c_only) */
2680 
2681         if (update_new && new_state != arc_l2c_only) {
2682                 ASSERT(HDR_HAS_L1HDR(hdr));
2683                 if (GHOST_STATE(new_state)) {
2684                         ASSERT0(bufcnt);
2685 
2686                         /*
2687                          * When moving a header to a ghost state, we first
2688                          * remove all arc buffers. Thus, we'll have a
2689                          * bufcnt of zero, and no arc buffer to use for
2690                          * the reference. As a result, we use the arc
2691                          * header pointer for the reference.
2692                          */
2693                         (void) refcount_add_many(&new_state->arcs_size,
2694                             HDR_GET_LSIZE(hdr), hdr);
2695                         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2696                 } else {
2697                         uint32_t buffers = 0;

2770                                         continue;
2771 
2772                                 (void) refcount_remove_many(
2773                                     &old_state->arcs_size, arc_buf_size(buf),
2774                                     buf);
2775                         }
2776                         ASSERT3U(bufcnt, ==, buffers);
2777                         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2778                         (void) refcount_remove_many(
2779                             &old_state->arcs_size, arc_hdr_size(hdr), hdr);
2780                 }
2781         }
2782 
2783         if (HDR_HAS_L1HDR(hdr))
2784                 hdr->b_l1hdr.b_state = new_state;
2785 
2786         /*
2787          * L2 headers should never be on the L2 state list since they don't
2788          * have L1 headers allocated.
2789          */
2790         ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]));
2791         ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));
2792         ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DDT]));
2793 }
2794 
2795 void
2796 arc_space_consume(uint64_t space, arc_space_type_t type)
2797 {
2798         ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2799 
2800         switch (type) {
2801         case ARC_SPACE_DATA:
2802                 ARCSTAT_INCR(arcstat_data_size, space);
2803                 break;
2804         case ARC_SPACE_META:
2805                 ARCSTAT_INCR(arcstat_metadata_size, space);
2806                 break;
2807         case ARC_SPACE_DDT:
2808                 ARCSTAT_INCR(arcstat_ddt_size, space);
2809                 break;
2810         case ARC_SPACE_OTHER:
2811                 ARCSTAT_INCR(arcstat_other_size, space);
2812                 break;
2813         case ARC_SPACE_HDRS:
2814                 ARCSTAT_INCR(arcstat_hdr_size, space);
2815                 break;
2816         case ARC_SPACE_L2HDRS:
2817                 ARCSTAT_INCR(arcstat_l2_hdr_size, space);
2818                 break;
2819         }
2820 
2821         if (type != ARC_SPACE_DATA && type != ARC_SPACE_DDT)
2822                 ARCSTAT_INCR(arcstat_meta_used, space);
2823 
2824         atomic_add_64(&arc_size, space);
2825 }
2826 
2827 void
2828 arc_space_return(uint64_t space, arc_space_type_t type)
2829 {
2830         ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2831 
2832         switch (type) {
2833         case ARC_SPACE_DATA:
2834                 ARCSTAT_INCR(arcstat_data_size, -space);
2835                 break;
2836         case ARC_SPACE_META:
2837                 ARCSTAT_INCR(arcstat_metadata_size, -space);
2838                 break;
2839         case ARC_SPACE_DDT:
2840                 ARCSTAT_INCR(arcstat_ddt_size, -space);
2841                 break;
2842         case ARC_SPACE_OTHER:
2843                 ARCSTAT_INCR(arcstat_other_size, -space);
2844                 break;
2845         case ARC_SPACE_HDRS:
2846                 ARCSTAT_INCR(arcstat_hdr_size, -space);
2847                 break;
2848         case ARC_SPACE_L2HDRS:
2849                 ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
2850                 break;
2851         }
2852 
2853         if (type != ARC_SPACE_DATA && type != ARC_SPACE_DDT) {
2854                 ASSERT(arc_meta_used >= space);
2855                 if (arc_meta_max < arc_meta_used)
2856                         arc_meta_max = arc_meta_used;
2857                 ARCSTAT_INCR(arcstat_meta_used, -space);





2858         }
2859 
2860         ASSERT(arc_size >= space);
2861         atomic_add_64(&arc_size, -space);
2862 }
2863 
2864 /*
2865  * Given a hdr and a buf, returns whether that buf can share its b_data buffer
2866  * with the hdr's b_pabd.
2867  */
2868 static boolean_t
2869 arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2870 {
2871         /*
2872          * The criteria for sharing a hdr's data are:
2873          * 1. the hdr's compression matches the buf's compression
2874          * 2. the hdr doesn't need to be byteswapped
2875          * 3. the hdr isn't already being shared
2876          * 4. the buf is either compressed or it is the last buf in the hdr list
2877          *
2878          * Criterion #4 maintains the invariant that shared uncompressed
2879          * bufs must be the final buf in the hdr's b_buf list. Reading this, you
2880          * might ask, "if a compressed buf is allocated first, won't that be the
2881          * last thing in the list?", but in that case it's impossible to create

2895         return (buf_compressed == hdr_compressed &&
2896             hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&
2897             !HDR_SHARED_DATA(hdr) &&
2898             (ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));
2899 }
2900 
2901 /*
2902  * Allocate a buf for this hdr. If you care about the data that's in the hdr,
2903  * or if you want a compressed buffer, pass those flags in. Returns 0 if the
2904  * copy was made successfully, or an error code otherwise.
2905  */
2906 static int
2907 arc_buf_alloc_impl(arc_buf_hdr_t *hdr, void *tag, boolean_t compressed,
2908     boolean_t fill, arc_buf_t **ret)
2909 {
2910         arc_buf_t *buf;
2911 
2912         ASSERT(HDR_HAS_L1HDR(hdr));
2913         ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
2914         VERIFY(hdr->b_type == ARC_BUFC_DATA ||
2915             hdr->b_type == ARC_BUFC_METADATA ||
2916             hdr->b_type == ARC_BUFC_DDT);
2917         ASSERT3P(ret, !=, NULL);
2918         ASSERT3P(*ret, ==, NULL);
2919 
2920         buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
2921         buf->b_hdr = hdr;
2922         buf->b_data = NULL;
2923         buf->b_next = hdr->b_l1hdr.b_buf;
2924         buf->b_flags = 0;
2925 
2926         add_reference(hdr, tag);
2927 
2928         /*
2929          * We're about to change the hdr's b_flags. We must either
2930          * hold the hash_lock or be undiscoverable.
2931          */
2932         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2933 
2934         /*
2935          * Only honor requests for compressed bufs if the hdr is actually
2936          * compressed.

2976          */
2977         if (fill) {
2978                 return (arc_buf_fill(buf, ARC_BUF_COMPRESSED(buf) != 0));
2979         }
2980 
2981         return (0);
2982 }
2983 
2984 static char *arc_onloan_tag = "onloan";
2985 
2986 static inline void
2987 arc_loaned_bytes_update(int64_t delta)
2988 {
2989         atomic_add_64(&arc_loaned_bytes, delta);
2990 
2991         /* assert that it did not wrap around */
2992         ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
2993 }
2994 
2995 /*
2996  * Allocates an ARC buf header that's in an evicted & L2-cached state.
2997  * This is used during l2arc reconstruction to make empty ARC buffers
2998  * which circumvent the regular disk->arc->l2arc path and instead come
2999  * into being in the reverse order, i.e. l2arc->arc.
3000  */
3001 static arc_buf_hdr_t *
3002 arc_buf_alloc_l2only(uint64_t load_guid, arc_buf_contents_t type,
3003     l2arc_dev_t *dev, dva_t dva, uint64_t daddr, uint64_t lsize,
3004     uint64_t psize, uint64_t birth, zio_cksum_t cksum, int checksum_type,
3005     enum zio_compress compress, boolean_t arc_compress)
3006 {
3007         arc_buf_hdr_t *hdr;
3008 
3009         if (type == ARC_BUFC_DDT && !zfs_arc_segregate_ddt)
3010                 type = ARC_BUFC_METADATA;
3011 
3012         ASSERT(lsize != 0);
3013         hdr = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
3014         ASSERT(HDR_EMPTY(hdr));
3015         ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
3016 
3017         hdr->b_spa = load_guid;
3018         hdr->b_type = type;
3019         hdr->b_flags = 0;
3020 
3021         if (arc_compress)
3022                 arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
3023         else
3024                 arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
3025 
3026         HDR_SET_COMPRESS(hdr, compress);
3027 
3028         arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR);
3029         hdr->b_dva = dva;
3030         hdr->b_birth = birth;
3031         if (checksum_type != ZIO_CHECKSUM_OFF) {
3032                 hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP);
3033                 bcopy(&cksum, hdr->b_freeze_cksum, sizeof (cksum));
3034         }
3035 
3036         HDR_SET_PSIZE(hdr, psize);
3037         HDR_SET_LSIZE(hdr, lsize);
3038 
3039         hdr->b_l2hdr.b_dev = dev;
3040         hdr->b_l2hdr.b_daddr = daddr;
3041 
3042         return (hdr);
3043 }
3044 
3045 /*
3046  * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
3047  * flight data by arc_tempreserve_space() until they are "returned". Loaned
3048  * buffers must be returned to the arc before they can be used by the DMU or
3049  * freed.
3050  */
3051 arc_buf_t *
3052 arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)
3053 {
3054         arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,
3055             is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);
3056 
3057         arc_loaned_bytes_update(size);
3058 
3059         return (buf);
3060 }
3061 
3062 arc_buf_t *
3063 arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,
3064     enum zio_compress compression_type)
3065 {

3114         list_insert_head(l2arc_free_on_write, df);
3115         mutex_exit(&l2arc_free_on_write_mtx);
3116 }
3117 
3118 static void
3119 arc_hdr_free_on_write(arc_buf_hdr_t *hdr)
3120 {
3121         arc_state_t *state = hdr->b_l1hdr.b_state;
3122         arc_buf_contents_t type = arc_buf_type(hdr);
3123         uint64_t size = arc_hdr_size(hdr);
3124 
3125         /* protected by hash lock, if in the hash table */
3126         if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
3127                 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
3128                 ASSERT(state != arc_anon && state != arc_l2c_only);
3129 
3130                 (void) refcount_remove_many(&state->arcs_esize[type],
3131                     size, hdr);
3132         }
3133         (void) refcount_remove_many(&state->arcs_size, size, hdr);
3134         if (type == ARC_BUFC_DDT) {
3135                 arc_space_return(size, ARC_SPACE_DDT);
3136         } else if (type == ARC_BUFC_METADATA) {
3137                 arc_space_return(size, ARC_SPACE_META);
3138         } else {
3139                 ASSERT(type == ARC_BUFC_DATA);
3140                 arc_space_return(size, ARC_SPACE_DATA);
3141         }
3142 
3143         l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);
3144 }
3145 
3146 /*
3147  * Share the arc_buf_t's data with the hdr. Whenever we are sharing the
3148  * data buffer, we transfer the refcount ownership to the hdr and update
3149  * the appropriate kstats.
3150  */
3151 static void
3152 arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
3153 {
3154         arc_state_t *state = hdr->b_l1hdr.b_state;
3155 
3156         ASSERT(arc_can_share(hdr, buf));
3157         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
3158         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
3159 
3160         /*
3161          * Start sharing the data buffer. We transfer the
3162          * refcount ownership to the hdr since it always owns
3163          * the refcount whenever an arc_buf_t is shared.
3164          */
3165         refcount_transfer_ownership(&state->arcs_size, buf, hdr);
3166         hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));
3167         abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,
3168             !HDR_ISTYPE_DATA(hdr));
3169         arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
3170         buf->b_flags |= ARC_BUF_FLAG_SHARED;
3171 
3172         /*
3173          * Since we've transferred ownership to the hdr we need
3174          * to increment its compressed and uncompressed kstats and
3175          * decrement the overhead size.
3176          */
3177         ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
3178         ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
3179         ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));
3180 }
3181 
3182 static void
3183 arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
3184 {
3185         arc_state_t *state = hdr->b_l1hdr.b_state;
3186 
3187         ASSERT(arc_buf_is_shared(buf));
3188         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);

3341 
3342         /* clean up the buf */
3343         buf->b_hdr = NULL;
3344         kmem_cache_free(buf_cache, buf);
3345 }
3346 
3347 static void
3348 arc_hdr_alloc_pabd(arc_buf_hdr_t *hdr)
3349 {
3350         ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
3351         ASSERT(HDR_HAS_L1HDR(hdr));
3352         ASSERT(!HDR_SHARED_DATA(hdr));
3353 
3354         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
3355         hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr);
3356         hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
3357         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
3358 
3359         ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
3360         ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
3361         arc_update_hit_stat(hdr, B_TRUE);
3362 }
3363 
3364 static void
3365 arc_hdr_free_pabd(arc_buf_hdr_t *hdr)
3366 {
3367         ASSERT(HDR_HAS_L1HDR(hdr));
3368         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
3369 
3370         /*
3371          * If the hdr is currently being written to the l2arc then
3372          * we defer freeing the data by adding it to the l2arc_free_on_write
3373          * list. The l2arc will free the data once it's finished
3374          * writing it to the l2arc device.
3375          */
3376         if (HDR_L2_WRITING(hdr)) {
3377                 arc_hdr_free_on_write(hdr);
3378                 ARCSTAT_BUMP(arcstat_l2_free_on_write);
3379         } else {
3380                 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
3381                     arc_hdr_size(hdr), hdr);
3382         }
3383         hdr->b_l1hdr.b_pabd = NULL;
3384         hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
3385 
3386         ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
3387         ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
3388 }
3389 
3390 static arc_buf_hdr_t *
3391 arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,
3392     enum zio_compress compression_type, arc_buf_contents_t type)
3393 {
3394         arc_buf_hdr_t *hdr;
3395 
3396         ASSERT3U(lsize, >, 0);
3397 
3398         if (type == ARC_BUFC_DDT && !zfs_arc_segregate_ddt)
3399                 type = ARC_BUFC_METADATA;
3400         VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA ||
3401             type == ARC_BUFC_DDT);
3402 
3403         hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);
3404         ASSERT(HDR_EMPTY(hdr));
3405         ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
3406         ASSERT3P(hdr->b_l1hdr.b_thawed, ==, NULL);
3407         HDR_SET_PSIZE(hdr, psize);
3408         HDR_SET_LSIZE(hdr, lsize);
3409         hdr->b_spa = spa;
3410         hdr->b_type = type;
3411         hdr->b_flags = 0;
3412         arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);
3413         arc_hdr_set_compress(hdr, compression_type);
3414 
3415         hdr->b_l1hdr.b_state = arc_anon;
3416         hdr->b_l1hdr.b_arc_access = 0;
3417         hdr->b_l1hdr.b_bufcnt = 0;
3418         hdr->b_l1hdr.b_buf = NULL;
3419 
3420         /*
3421          * Allocate the hdr's buffer. This will contain either
3422          * the compressed or uncompressed data depending on the block
3423          * it references and compressed arc enablement.
3424          */
3425         arc_hdr_alloc_pabd(hdr);

3450 
3451         ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
3452         buf_hash_remove(hdr);
3453 
3454         bcopy(hdr, nhdr, HDR_L2ONLY_SIZE);
3455 
3456         if (new == hdr_full_cache) {
3457                 arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);
3458                 /*
3459                  * arc_access and arc_change_state need to be aware that a
3460                  * header has just come out of L2ARC, so we set its state to
3461                  * l2c_only even though it's about to change.
3462                  */
3463                 nhdr->b_l1hdr.b_state = arc_l2c_only;
3464 
3465                 /* Verify previous threads set to NULL before freeing */
3466                 ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL);
3467         } else {
3468                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
3469                 ASSERT0(hdr->b_l1hdr.b_bufcnt);
3470                 ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
3471 
3472                 /*
3473                  * If we've reached here, We must have been called from
3474                  * arc_evict_hdr(), as such we should have already been
3475                  * removed from any ghost list we were previously on
3476                  * (which protects us from racing with arc_evict_state),
3477                  * thus no locking is needed during this check.
3478                  */
3479                 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
3480 
3481                 /*
3482                  * A buffer must not be moved into the arc_l2c_only
3483                  * state if it's not finished being written out to the
3484                  * l2arc device. Otherwise, the b_l1hdr.b_pabd field
3485                  * might try to be accessed, even though it was removed.
3486                  */
3487                 VERIFY(!HDR_L2_WRITING(hdr));
3488                 VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL);
3489 
3490 #ifdef ZFS_DEBUG

3555 /*
3556  * Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this
3557  * for bufs containing metadata.
3558  */
3559 arc_buf_t *
3560 arc_alloc_compressed_buf(spa_t *spa, void *tag, uint64_t psize, uint64_t lsize,
3561     enum zio_compress compression_type)
3562 {
3563         ASSERT3U(lsize, >, 0);
3564         ASSERT3U(lsize, >=, psize);
3565         ASSERT(compression_type > ZIO_COMPRESS_OFF);
3566         ASSERT(compression_type < ZIO_COMPRESS_FUNCTIONS);
3567 
3568         arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
3569             compression_type, ARC_BUFC_DATA);
3570         ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
3571 
3572         arc_buf_t *buf = NULL;
3573         VERIFY0(arc_buf_alloc_impl(hdr, tag, B_TRUE, B_FALSE, &buf));
3574         arc_buf_thaw(buf);
3575         ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
3576 
3577         if (!arc_buf_is_shared(buf)) {
3578                 /*
3579                  * To ensure that the hdr has the correct data in it if we call
3580                  * arc_decompress() on this buf before it's been written to
3581                  * disk, it's easiest if we just set up sharing between the
3582                  * buf and the hdr.
3583                  */
3584                 ASSERT(!abd_is_linear(hdr->b_l1hdr.b_pabd));
3585                 arc_hdr_free_pabd(hdr);
3586                 arc_share_buf(hdr, buf);
3587         }
3588 
3589         return (buf);
3590 }
3591 
3592 static void
3593 arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)
3594 {
3595         l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
3596         l2arc_dev_t *dev = l2hdr->b_dev;
3597         uint64_t psize = arc_hdr_size(hdr);
3598 
3599         ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
3600         ASSERT(HDR_HAS_L2HDR(hdr));
3601 
3602         list_remove(&dev->l2ad_buflist, hdr);
3603 
3604         ARCSTAT_INCR(arcstat_l2_psize, -psize);
3605         ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
3606 
3607         /*
3608          * l2ad_vdev can be NULL here if we async evicted it
3609          */
3610         if (dev->l2ad_vdev != NULL)
3611                 vdev_space_update(dev->l2ad_vdev, -psize, 0, 0);
3612 
3613         (void) refcount_remove_many(&dev->l2ad_alloc, psize, hdr);
3614         arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
3615 }
3616 
3617 static void
3618 arc_hdr_destroy(arc_buf_hdr_t *hdr)
3619 {
3620         if (HDR_HAS_L1HDR(hdr)) {
3621                 ASSERT(hdr->b_l1hdr.b_buf == NULL ||
3622                     hdr->b_l1hdr.b_bufcnt > 0);
3623                 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
3624                 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
3625         }
3626         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3627         ASSERT(!HDR_IN_HASH_TABLE(hdr));
3628 



3629         if (HDR_HAS_L2HDR(hdr)) {
3630                 l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
3631                 boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);
3632 
3633                 /* To avoid racing with L2ARC the header needs to be locked */
3634                 ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
3635 
3636                 if (!buflist_held)
3637                         mutex_enter(&dev->l2ad_mtx);
3638 
3639                 /*
3640                  * L2ARC buflist has been held, so we can safety discard
3641                  * identity, otherwise L2ARC can lock incorrect mutex
3642                  * for the hdr, that will cause a panic. That is possible,
3643                  * because a mutex is selected according to identity.
3644                  */
3645                 if (!HDR_EMPTY(hdr))
3646                         buf_discard_identity(hdr);
3647 
3648                 /*
3649                  * Even though we checked this conditional above, we
3650                  * need to check this again now that we have the
3651                  * l2ad_mtx. This is because we could be racing with
3652                  * another thread calling l2arc_evict() which might have
3653                  * destroyed this header's L2 portion as we were waiting
3654                  * to acquire the l2ad_mtx. If that happens, we don't
3655                  * want to re-destroy the header's L2 portion.
3656                  */
3657                 if (HDR_HAS_L2HDR(hdr))
3658                         arc_hdr_l2hdr_destroy(hdr);
3659 
3660                 if (!buflist_held)
3661                         mutex_exit(&dev->l2ad_mtx);
3662         }
3663 
3664         if (!HDR_EMPTY(hdr))
3665                 buf_discard_identity(hdr);
3666 
3667         if (HDR_HAS_L1HDR(hdr)) {
3668                 arc_cksum_free(hdr);
3669 
3670                 while (hdr->b_l1hdr.b_buf != NULL)
3671                         arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);
3672 
3673 #ifdef ZFS_DEBUG
3674                 if (hdr->b_l1hdr.b_thawed != NULL) {
3675                         kmem_free(hdr->b_l1hdr.b_thawed, 1);
3676                         hdr->b_l1hdr.b_thawed = NULL;
3677                 }
3678 #endif
3679 
3680                 if (hdr->b_l1hdr.b_pabd != NULL) {
3681                         arc_hdr_free_pabd(hdr);
3682                 }
3683         }
3684 
3685         ASSERT3P(hdr->b_hash_next, ==, NULL);
3686         if (HDR_HAS_L1HDR(hdr)) {

3722  * Evict the arc_buf_hdr that is provided as a parameter. The resultant
3723  * state of the header is dependent on it's state prior to entering this
3724  * function. The following transitions are possible:
3725  *
3726  *    - arc_mru -> arc_mru_ghost
3727  *    - arc_mfu -> arc_mfu_ghost
3728  *    - arc_mru_ghost -> arc_l2c_only
3729  *    - arc_mru_ghost -> deleted
3730  *    - arc_mfu_ghost -> arc_l2c_only
3731  *    - arc_mfu_ghost -> deleted
3732  */
3733 static int64_t
3734 arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
3735 {
3736         arc_state_t *evicted_state, *state;
3737         int64_t bytes_evicted = 0;
3738 
3739         ASSERT(MUTEX_HELD(hash_lock));
3740         ASSERT(HDR_HAS_L1HDR(hdr));
3741 
3742         arc_wait_for_krrp(hdr);
3743 
3744         state = hdr->b_l1hdr.b_state;
3745         if (GHOST_STATE(state)) {
3746                 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3747                 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
3748 
3749                 /*
3750                  * l2arc_write_buffers() relies on a header's L1 portion
3751                  * (i.e. its b_pabd field) during it's write phase.
3752                  * Thus, we cannot push a header onto the arc_l2c_only
3753                  * state (removing it's L1 piece) until the header is
3754                  * done being written to the l2arc.
3755                  */
3756                 if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {
3757                         ARCSTAT_BUMP(arcstat_evict_l2_skip);
3758                         return (bytes_evicted);
3759                 }
3760 
3761                 ARCSTAT_BUMP(arcstat_deleted);
3762                 bytes_evicted += HDR_GET_LSIZE(hdr);
3763

4112  * prevents us from trying to evict more from a state's list than
4113  * is "evictable", and to skip evicting altogether when passed a
4114  * negative value for "bytes". In contrast, arc_evict_state() will
4115  * evict everything it can, when passed a negative value for "bytes".
4116  */
4117 static uint64_t
4118 arc_adjust_impl(arc_state_t *state, uint64_t spa, int64_t bytes,
4119     arc_buf_contents_t type)
4120 {
4121         int64_t delta;
4122 
4123         if (bytes > 0 && refcount_count(&state->arcs_esize[type]) > 0) {
4124                 delta = MIN(refcount_count(&state->arcs_esize[type]), bytes);
4125                 return (arc_evict_state(state, spa, delta, type));
4126         }
4127 
4128         return (0);
4129 }
4130 
4131 /*
4132  * Depending on the value of adjust_ddt arg evict either DDT (B_TRUE)
4133  * or metadata (B_TRUE) buffers.
4134  * Evict metadata or DDT buffers from the cache, such that arc_meta_used or
4135  * arc_ddt_size is capped by the arc_meta_limit or arc_ddt_limit tunable.
4136  */
4137 static uint64_t
4138 arc_adjust_meta_or_ddt(boolean_t adjust_ddt)
4139 {
4140         uint64_t total_evicted = 0;
4141         int64_t target, over_limit;
4142         arc_buf_contents_t type;
4143 
4144         if (adjust_ddt) {
4145                 over_limit = arc_ddt_size - arc_ddt_limit;
4146                 type = ARC_BUFC_DDT;
4147         } else {
4148                 over_limit = arc_meta_used - arc_meta_limit;
4149                 type = ARC_BUFC_METADATA;
4150         }
4151 
4152         /*
4153          * If we're over the limit, we want to evict enough
4154          * to get back under the limit. We don't want to
4155          * evict so much that we drop the MRU below arc_p, though. If
4156          * we're over the meta limit more than we're over arc_p, we
4157          * evict some from the MRU here, and some from the MFU below.
4158          */
4159         target = MIN(over_limit,
4160             (int64_t)(refcount_count(&arc_anon->arcs_size) +
4161             refcount_count(&arc_mru->arcs_size) - arc_p));
4162 
4163         total_evicted += arc_adjust_impl(arc_mru, 0, target, type);
4164 
4165         over_limit = adjust_ddt ? arc_ddt_size - arc_ddt_limit :
4166             arc_meta_used - arc_meta_limit;
4167 
4168         /*
4169          * Similar to the above, we want to evict enough bytes to get us
4170          * below the meta limit, but not so much as to drop us below the
4171          * space allotted to the MFU (which is defined as arc_c - arc_p).
4172          */
4173         target = MIN(over_limit,
4174             (int64_t)(refcount_count(&arc_mfu->arcs_size) - (arc_c - arc_p)));

4175 
4176         total_evicted += arc_adjust_impl(arc_mfu, 0, target, type);
4177 
4178         return (total_evicted);
4179 }
4180 
4181 /*
4182  * Return the type of the oldest buffer in the given arc state
4183  *
4184  * This function will select a random sublists of type ARC_BUFC_DATA,
4185  * ARC_BUFC_METADATA, and ARC_BUFC_DDT. The tail of each sublist
4186  * is compared, and the type which contains the "older" buffer will be
4187  * returned.
4188  */
4189 static arc_buf_contents_t
4190 arc_adjust_type(arc_state_t *state)
4191 {
4192         multilist_t *data_ml = state->arcs_list[ARC_BUFC_DATA];
4193         multilist_t *meta_ml = state->arcs_list[ARC_BUFC_METADATA];
4194         multilist_t *ddt_ml = state->arcs_list[ARC_BUFC_DDT];
4195         int data_idx = multilist_get_random_index(data_ml);
4196         int meta_idx = multilist_get_random_index(meta_ml);
4197         int ddt_idx = multilist_get_random_index(ddt_ml);
4198         multilist_sublist_t *data_mls;
4199         multilist_sublist_t *meta_mls;
4200         multilist_sublist_t *ddt_mls;
4201         arc_buf_contents_t type = ARC_BUFC_DATA; /* silence compiler warning */
4202         arc_buf_hdr_t *data_hdr;
4203         arc_buf_hdr_t *meta_hdr;
4204         arc_buf_hdr_t *ddt_hdr;
4205         clock_t oldest;
4206 
4207         /*
4208          * We keep the sublist lock until we're finished, to prevent
4209          * the headers from being destroyed via arc_evict_state().
4210          */
4211         data_mls = multilist_sublist_lock(data_ml, data_idx);
4212         meta_mls = multilist_sublist_lock(meta_ml, meta_idx);
4213         ddt_mls = multilist_sublist_lock(ddt_ml, ddt_idx);
4214 
4215         /*
4216          * These two loops are to ensure we skip any markers that
4217          * might be at the tail of the lists due to arc_evict_state().
4218          */
4219 
4220         for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL;
4221             data_hdr = multilist_sublist_prev(data_mls, data_hdr)) {
4222                 if (data_hdr->b_spa != 0)
4223                         break;
4224         }
4225 
4226         for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL;
4227             meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) {
4228                 if (meta_hdr->b_spa != 0)
4229                         break;
4230         }
4231 
4232         for (ddt_hdr = multilist_sublist_tail(ddt_mls); ddt_hdr != NULL;
4233             ddt_hdr = multilist_sublist_prev(ddt_mls, ddt_hdr)) {
4234                 if (ddt_hdr->b_spa != 0)
4235                         break;
4236         }
4237 
4238         if (data_hdr == NULL && meta_hdr == NULL && ddt_hdr == NULL) {
4239                 type = ARC_BUFC_DATA;
4240         } else if (data_hdr != NULL && meta_hdr != NULL && ddt_hdr != NULL) {
4241                 /* The headers can't be on the sublist without an L1 header */
4242                 ASSERT(HDR_HAS_L1HDR(data_hdr));
4243                 ASSERT(HDR_HAS_L1HDR(meta_hdr));
4244                 ASSERT(HDR_HAS_L1HDR(ddt_hdr));
4245 
4246                 oldest = data_hdr->b_l1hdr.b_arc_access;
4247                 type = ARC_BUFC_DATA;
4248                 if (oldest > meta_hdr->b_l1hdr.b_arc_access) {
4249                         oldest = meta_hdr->b_l1hdr.b_arc_access;
4250                         type = ARC_BUFC_METADATA;
4251                 }
4252                 if (oldest > ddt_hdr->b_l1hdr.b_arc_access) {
4253                         type = ARC_BUFC_DDT;
4254                 }
4255         } else if (data_hdr == NULL && ddt_hdr == NULL) {
4256                 ASSERT3P(meta_hdr, !=, NULL);
4257                 type = ARC_BUFC_METADATA;
4258         } else if (meta_hdr == NULL && ddt_hdr == NULL) {
4259                 ASSERT3P(data_hdr, !=, NULL);
4260                 type = ARC_BUFC_DATA;
4261         } else if (meta_hdr == NULL && data_hdr == NULL) {
4262                 ASSERT3P(ddt_hdr, !=, NULL);
4263                 type = ARC_BUFC_DDT;
4264         } else if (data_hdr != NULL && ddt_hdr != NULL) {
4265                 ASSERT3P(meta_hdr, ==, NULL);
4266 
4267                 /* The headers can't be on the sublist without an L1 header */
4268                 ASSERT(HDR_HAS_L1HDR(data_hdr));
4269                 ASSERT(HDR_HAS_L1HDR(ddt_hdr));
4270 
4271                 if (data_hdr->b_l1hdr.b_arc_access <
4272                     ddt_hdr->b_l1hdr.b_arc_access) {
4273                         type = ARC_BUFC_DATA;
4274                 } else {
4275                         type = ARC_BUFC_DDT;
4276                 }
4277         } else if (meta_hdr != NULL && ddt_hdr != NULL) {
4278                 ASSERT3P(data_hdr, ==, NULL);
4279 
4280                 /* The headers can't be on the sublist without an L1 header */
4281                 ASSERT(HDR_HAS_L1HDR(meta_hdr));
4282                 ASSERT(HDR_HAS_L1HDR(ddt_hdr));
4283 
4284                 if (meta_hdr->b_l1hdr.b_arc_access <
4285                     ddt_hdr->b_l1hdr.b_arc_access) {
4286                         type = ARC_BUFC_METADATA;
4287                 } else {
4288                         type = ARC_BUFC_DDT;
4289                 }
4290         } else if (meta_hdr != NULL && data_hdr != NULL) {
4291                 ASSERT3P(ddt_hdr, ==, NULL);
4292 
4293                 /* The headers can't be on the sublist without an L1 header */
4294                 ASSERT(HDR_HAS_L1HDR(data_hdr));
4295                 ASSERT(HDR_HAS_L1HDR(meta_hdr));
4296 
4297                 if (data_hdr->b_l1hdr.b_arc_access <
4298                     meta_hdr->b_l1hdr.b_arc_access) {
4299                         type = ARC_BUFC_DATA;
4300                 } else {
4301                         type = ARC_BUFC_METADATA;
4302                 }
4303         } else {
4304                 /* should never get here */
4305                 ASSERT(0);
4306         }
4307 
4308         multilist_sublist_unlock(ddt_mls);
4309         multilist_sublist_unlock(meta_mls);
4310         multilist_sublist_unlock(data_mls);
4311 
4312         return (type);
4313 }
4314 
4315 /*
4316  * Evict buffers from the cache, such that arc_size is capped by arc_c.
4317  */
4318 static uint64_t
4319 arc_adjust(void)
4320 {
4321         uint64_t total_evicted = 0;
4322         uint64_t bytes;
4323         int64_t target;


4324 
4325         /*
4326          * If we're over arc_meta_limit, we want to correct that before
4327          * potentially evicting data buffers below.
4328          */
4329         total_evicted += arc_adjust_meta_or_ddt(B_FALSE);
4330 
4331         /*
4332          * If we're over arc_ddt_limit, we want to correct that before
4333          * potentially evicting data buffers below.
4334          */
4335         total_evicted += arc_adjust_meta_or_ddt(B_TRUE);
4336 
4337         /*
4338          * Adjust MRU size
4339          *
4340          * If we're over the target cache size, we want to evict enough
4341          * from the list to get back to our target size. We don't want
4342          * to evict too much from the MRU, such that it drops below
4343          * arc_p. So, if we're over our target cache size more than
4344          * the MRU is over arc_p, we'll evict enough to get back to
4345          * arc_p here, and then evict more from the MFU below.
4346          */
4347         target = MIN((int64_t)(arc_size - arc_c),
4348             (int64_t)(refcount_count(&arc_anon->arcs_size) +
4349             refcount_count(&arc_mru->arcs_size) + arc_meta_used - arc_p));
4350 
4351         /*
4352          * If we're below arc_meta_min, always prefer to evict data.
4353          * Otherwise, try to satisfy the requested number of bytes to
4354          * evict from the type which contains older buffers; in an
4355          * effort to keep newer buffers in the cache regardless of their
4356          * type. If we cannot satisfy the number of bytes from this
4357          * type, spill over into the next type.
4358          */
4359         if (arc_adjust_type(arc_mru) == ARC_BUFC_METADATA &&
4360             arc_meta_used > arc_meta_min) {
4361                 bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
4362                 total_evicted += bytes;
4363 
4364                 /*
4365                  * If we couldn't evict our target number of bytes from
4366                  * metadata, we try to get the rest from data.
4367                  */
4368                 target -= bytes;
4369 
4370                 bytes += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
4371                 total_evicted += bytes;
4372         } else {
4373                 bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
4374                 total_evicted += bytes;
4375 
4376                 /*
4377                  * If we couldn't evict our target number of bytes from
4378                  * data, we try to get the rest from metadata.
4379                  */
4380                 target -= bytes;
4381 
4382                 bytes += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
4383                 total_evicted += bytes;
4384         }
4385 
4386         /*
4387          * If we couldn't evict our target number of bytes from
4388          * data and metadata, we try to get the rest from ddt.
4389          */
4390         target -= bytes;
4391         total_evicted +=
4392             arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DDT);
4393 
4394         /*
4395          * Adjust MFU size
4396          *
4397          * Now that we've tried to evict enough from the MRU to get its
4398          * size back to arc_p, if we're still above the target cache
4399          * size, we evict the rest from the MFU.
4400          */
4401         target = arc_size - arc_c;
4402 
4403         if (arc_adjust_type(arc_mfu) == ARC_BUFC_METADATA &&
4404             arc_meta_used > arc_meta_min) {
4405                 bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
4406                 total_evicted += bytes;
4407 
4408                 /*
4409                  * If we couldn't evict our target number of bytes from
4410                  * metadata, we try to get the rest from data.
4411                  */
4412                 target -= bytes;
4413 
4414                 bytes += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
4415                 total_evicted += bytes;
4416         } else {
4417                 bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
4418                 total_evicted += bytes;
4419 
4420                 /*
4421                  * If we couldn't evict our target number of bytes from
4422                  * data, we try to get the rest from data.
4423                  */
4424                 target -= bytes;
4425 
4426                 bytes += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
4427                 total_evicted += bytes;
4428         }
4429 
4430         /*
4431          * If we couldn't evict our target number of bytes from
4432          * data and metadata, we try to get the rest from ddt.
4433          */
4434         target -= bytes;
4435         total_evicted +=
4436             arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DDT);
4437 
4438         /*
4439          * Adjust ghost lists
4440          *
4441          * In addition to the above, the ARC also defines target values
4442          * for the ghost lists. The sum of the mru list and mru ghost
4443          * list should never exceed the target size of the cache, and
4444          * the sum of the mru list, mfu list, mru ghost list, and mfu
4445          * ghost list should never exceed twice the target size of the
4446          * cache. The following logic enforces these limits on the ghost
4447          * caches, and evicts from them as needed.
4448          */
4449         target = refcount_count(&arc_mru->arcs_size) +
4450             refcount_count(&arc_mru_ghost->arcs_size) - arc_c;
4451 
4452         bytes = arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA);
4453         total_evicted += bytes;
4454 
4455         target -= bytes;
4456 
4457         bytes += arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
4458         total_evicted += bytes;
4459 
4460         target -= bytes;
4461 
4462         total_evicted +=
4463             arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DDT);
4464 
4465         /*
4466          * We assume the sum of the mru list and mfu list is less than
4467          * or equal to arc_c (we enforced this above), which means we
4468          * can use the simpler of the two equations below:
4469          *
4470          *      mru + mfu + mru ghost + mfu ghost <= 2 * arc_c
4471          *                  mru ghost + mfu ghost <= arc_c
4472          */
4473         target = refcount_count(&arc_mru_ghost->arcs_size) +
4474             refcount_count(&arc_mfu_ghost->arcs_size) - arc_c;
4475 
4476         bytes = arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA);
4477         total_evicted += bytes;
4478 
4479         target -= bytes;
4480 
4481         bytes += arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
4482         total_evicted += bytes;
4483 
4484         target -= bytes;
4485 
4486         total_evicted +=
4487             arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DDT);
4488 
4489         return (total_evicted);
4490 }
4491 
4492 typedef struct arc_async_flush_data {
4493         uint64_t        aaf_guid;
4494         boolean_t       aaf_retry;
4495 } arc_async_flush_data_t;
4496 
4497 static taskq_t *arc_flush_taskq;
4498 
4499 static void
4500 arc_flush_impl(uint64_t guid, boolean_t retry)
4501 {
4502         arc_buf_contents_t arcs;
4503 
4504         for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
4505                 (void) arc_flush_state(arc_mru, guid, arcs, retry);
4506                 (void) arc_flush_state(arc_mfu, guid, arcs, retry);
4507                 (void) arc_flush_state(arc_mru_ghost, guid, arcs, retry);
4508                 (void) arc_flush_state(arc_mfu_ghost, guid, arcs, retry);
4509         }
4510 }
4511 
4512 static void
4513 arc_flush_task(void *arg)
4514 {
4515         arc_async_flush_data_t *aaf = (arc_async_flush_data_t *)arg;
4516         arc_flush_impl(aaf->aaf_guid, aaf->aaf_retry);
4517         kmem_free(aaf, sizeof (arc_async_flush_data_t));
4518 }
4519 
4520 boolean_t zfs_fastflush = B_TRUE;
4521 
4522 void
4523 arc_flush(spa_t *spa, boolean_t retry)
4524 {
4525         uint64_t guid = 0;
4526         boolean_t async_flush = (spa != NULL ? zfs_fastflush : FALSE);
4527         arc_async_flush_data_t *aaf = NULL;
4528 
4529         /*
4530          * If retry is B_TRUE, a spa must not be specified since we have
4531          * no good way to determine if all of a spa's buffers have been
4532          * evicted from an arc state.
4533          */
4534         ASSERT(!retry || spa == NULL);
4535 
4536         if (spa != NULL) {
4537                 guid = spa_load_guid(spa);
4538                 if (async_flush) {
4539                         aaf = kmem_alloc(sizeof (arc_async_flush_data_t),
4540                             KM_SLEEP);
4541                         aaf->aaf_guid = guid;
4542                         aaf->aaf_retry = retry;
4543                 }
4544         }
4545 
4546         /*
4547          * Try to flush per-spa remaining ARC ghost buffers asynchronously
4548          * while a pool is being closed.
4549          * An ARC buffer is bound to spa only by guid, so buffer can
4550          * exist even when pool has already gone. If asynchronous flushing
4551          * fails we fall back to regular (synchronous) one.
4552          * NOTE: If asynchronous flushing had not yet finished when the pool
4553          * was imported again it wouldn't be a problem, even when guids before
4554          * and after export/import are the same. We can evict only unreferenced
4555          * buffers, other are skipped.
4556          */
4557         if (!async_flush || (taskq_dispatch(arc_flush_taskq, arc_flush_task,
4558             aaf, TQ_NOSLEEP) == NULL)) {
4559                 arc_flush_impl(guid, retry);
4560                 if (async_flush)
4561                         kmem_free(aaf, sizeof (arc_async_flush_data_t));
4562         }
4563 }
4564 
4565 void
4566 arc_shrink(int64_t to_free)
4567 {

4568         if (arc_c > arc_c_min) {
4569 
4570                 if (arc_c > arc_c_min + to_free)
4571                         atomic_add_64(&arc_c, -to_free);
4572                 else
4573                         arc_c = arc_c_min;
4574 
4575                 atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
4576                 if (arc_c > arc_size)
4577                         arc_c = MAX(arc_size, arc_c_min);
4578                 if (arc_p > arc_c)
4579                         arc_p = (arc_c >> 1);
4580                 ASSERT(arc_c >= arc_c_min);
4581                 ASSERT((int64_t)arc_p >= 0);
4582         }
4583 
4584         if (arc_size > arc_c)
4585                 (void) arc_adjust();
4586 }
4587 
4588 typedef enum free_memory_reason_t {
4589         FMR_UNKNOWN,
4590         FMR_NEEDFREE,
4591         FMR_LOTSFREE,
4592         FMR_SWAPFS_MINFREE,
4593         FMR_PAGES_PP_MAXIMUM,
4594         FMR_HEAP_ARENA,
4595         FMR_ZIO_ARENA,
4596 } free_memory_reason_t;
4597 
4598 int64_t last_free_memory;
4599 free_memory_reason_t last_free_reason;
4600 
4601 /*
4602  * Additional reserve of pages for pp_reserve.
4603  */
4604 int64_t arc_pages_pp_reserve = 64;

4728  * is under memory pressure and that the arc should adjust accordingly.
4729  */
4730 static boolean_t
4731 arc_reclaim_needed(void)
4732 {
4733         return (arc_available_memory() < 0);
4734 }
4735 
4736 static void
4737 arc_kmem_reap_now(void)
4738 {
4739         size_t                  i;
4740         kmem_cache_t            *prev_cache = NULL;
4741         kmem_cache_t            *prev_data_cache = NULL;
4742         extern kmem_cache_t     *zio_buf_cache[];
4743         extern kmem_cache_t     *zio_data_buf_cache[];
4744         extern kmem_cache_t     *range_seg_cache;
4745         extern kmem_cache_t     *abd_chunk_cache;
4746 
4747 #ifdef _KERNEL
4748         if (arc_meta_used >= arc_meta_limit || arc_ddt_size >= arc_ddt_limit) {
4749                 /*
4750                  * We are exceeding our meta-data or DDT cache limit.
4751                  * Purge some DNLC entries to release holds on meta-data/DDT.
4752                  */
4753                 dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
4754         }
4755 #if defined(__i386)
4756         /*
4757          * Reclaim unused memory from all kmem caches.
4758          */
4759         kmem_reap();
4760 #endif
4761 #endif
4762 
4763         /*
4764          * If a kmem reap is already active, don't schedule more.  We must
4765          * check for this because kmem_cache_reap_soon() won't actually
4766          * block on the cache being reaped (this is to prevent callers from
4767          * becoming implicitly blocked by a system-wide kmem reap -- which,
4768          * on a system with many, many full magazines, can take minutes).
4769          */
4770         if (kmem_cache_reap_active())
4771                 return;

4887 #endif
4888                                 arc_shrink(to_free);
4889                         }
4890                 } else if (free_memory < arc_c >> arc_no_grow_shift) {
4891                         arc_no_grow = B_TRUE;
4892                 } else if (gethrtime() >= growtime) {
4893                         arc_no_grow = B_FALSE;
4894                 }
4895 
4896                 mutex_enter(&arc_reclaim_lock);
4897 
4898                 /*
4899                  * If evicted is zero, we couldn't evict anything via
4900                  * arc_adjust(). This could be due to hash lock
4901                  * collisions, but more likely due to the majority of
4902                  * arc buffers being unevictable. Therefore, even if
4903                  * arc_size is above arc_c, another pass is unlikely to
4904                  * be helpful and could potentially cause us to enter an
4905                  * infinite loop.
4906                  */
4907                 if (arc_size <= arc_c || evicted == 0) {
4908                         /*
4909                          * We're either no longer overflowing, or we
4910                          * can't evict anything more, so we should wake
4911                          * up any threads before we go to sleep.
4912                          */
4913                         cv_broadcast(&arc_reclaim_waiters_cv);
4914 
4915                         /*
4916                          * Block until signaled, or after one second (we
4917                          * might need to perform arc_kmem_reap_now()
4918                          * even if we aren't being signalled)
4919                          */
4920                         CALLB_CPR_SAFE_BEGIN(&cpr);
4921                         (void) cv_timedwait_hires(&arc_reclaim_thread_cv,
4922                             &arc_reclaim_lock, SEC2NSEC(1), MSEC2NSEC(1), 0);
4923                         CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_lock);
4924                 }
4925         }
4926 
4927         arc_reclaim_thread_exit = B_FALSE;

4969                 delta = MIN(bytes * mult, arc_p);
4970                 arc_p = MAX(arc_p_min, arc_p - delta);
4971         }
4972         ASSERT((int64_t)arc_p >= 0);
4973 
4974         if (arc_reclaim_needed()) {
4975                 cv_signal(&arc_reclaim_thread_cv);
4976                 return;
4977         }
4978 
4979         if (arc_no_grow)
4980                 return;
4981 
4982         if (arc_c >= arc_c_max)
4983                 return;
4984 
4985         /*
4986          * If we're within (2 * maxblocksize) bytes of the target
4987          * cache size, increment the target cache size
4988          */
4989         if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {

4990                 atomic_add_64(&arc_c, (int64_t)bytes);
4991                 if (arc_c > arc_c_max)
4992                         arc_c = arc_c_max;
4993                 else if (state == arc_anon)
4994                         atomic_add_64(&arc_p, (int64_t)bytes);
4995                 if (arc_p > arc_c)
4996                         arc_p = arc_c;
4997         }
4998         ASSERT((int64_t)arc_p >= 0);
4999 }
5000 
5001 /*
5002  * Check if arc_size has grown past our upper threshold, determined by
5003  * zfs_arc_overflow_shift.
5004  */
5005 static boolean_t
5006 arc_is_overflowing(void)
5007 {
5008         /* Always allow at least one block of overflow */
5009         uint64_t overflow = MAX(SPA_MAXBLOCKSIZE,
5010             arc_c >> zfs_arc_overflow_shift);
5011 
5012         return (arc_size >= arc_c + overflow);









5013 }
5014 
5015 static abd_t *
5016 arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
5017 {
5018         arc_buf_contents_t type = arc_buf_type(hdr);
5019 
5020         arc_get_data_impl(hdr, size, tag);
5021         if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
5022                 return (abd_alloc(size, B_TRUE));
5023         } else {
5024                 ASSERT(type == ARC_BUFC_DATA);
5025                 return (abd_alloc(size, B_FALSE));
5026         }
5027 }
5028 
5029 static void *
5030 arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
5031 {
5032         arc_buf_contents_t type = arc_buf_type(hdr);
5033 
5034         arc_get_data_impl(hdr, size, tag);
5035         if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
5036                 return (zio_buf_alloc(size));
5037         } else {
5038                 ASSERT(type == ARC_BUFC_DATA);
5039                 return (zio_data_buf_alloc(size));
5040         }
5041 }
5042 
5043 /*
5044  * Allocate a block and return it to the caller. If we are hitting the
5045  * hard limit for the cache size, we must sleep, waiting for the eviction
5046  * thread to catch up. If we're past the target size but below the hard
5047  * limit, we'll only signal the reclaim thread and continue on.
5048  */
5049 static void
5050 arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
5051 {
5052         arc_state_t *state = hdr->b_l1hdr.b_state;
5053         arc_buf_contents_t type = arc_buf_type(hdr);
5054 
5055         arc_adapt(size, state);

5074                 /*
5075                  * Now that we've acquired the lock, we may no longer be
5076                  * over the overflow limit, lets check.
5077                  *
5078                  * We're ignoring the case of spurious wake ups. If that
5079                  * were to happen, it'd let this thread consume an ARC
5080                  * buffer before it should have (i.e. before we're under
5081                  * the overflow limit and were signalled by the reclaim
5082                  * thread). As long as that is a rare occurrence, it
5083                  * shouldn't cause any harm.
5084                  */
5085                 if (arc_is_overflowing()) {
5086                         cv_signal(&arc_reclaim_thread_cv);
5087                         cv_wait(&arc_reclaim_waiters_cv, &arc_reclaim_lock);
5088                 }
5089 
5090                 mutex_exit(&arc_reclaim_lock);
5091         }
5092 
5093         VERIFY3U(hdr->b_type, ==, type);
5094         if (type == ARC_BUFC_DDT) {
5095                 arc_space_consume(size, ARC_SPACE_DDT);
5096         } else if (type == ARC_BUFC_METADATA) {
5097                 arc_space_consume(size, ARC_SPACE_META);
5098         } else {
5099                 arc_space_consume(size, ARC_SPACE_DATA);
5100         }
5101 
5102         /*
5103          * Update the state size.  Note that ghost states have a
5104          * "ghost size" and so don't need to be updated.
5105          */
5106         if (!GHOST_STATE(state)) {
5107 
5108                 (void) refcount_add_many(&state->arcs_size, size, tag);
5109 
5110                 /*
5111                  * If this is reached via arc_read, the link is
5112                  * protected by the hash lock. If reached via
5113                  * arc_buf_alloc, the header should not be accessed by
5114                  * any other thread. And, if reached via arc_read_done,
5115                  * the hash lock will protect it if it's found in the
5116                  * hash table; otherwise no other thread should be
5117                  * trying to [add|remove]_reference it.
5118                  */
5119                 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
5120                         ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5121                         (void) refcount_add_many(&state->arcs_esize[type],
5122                             size, tag);
5123                 }
5124 
5125                 /*
5126                  * If we are growing the cache, and we are adding anonymous
5127                  * data, and we have outgrown arc_p, update arc_p
5128                  */
5129                 if (arc_size < arc_c && hdr->b_l1hdr.b_state == arc_anon &&

5130                     (refcount_count(&arc_anon->arcs_size) +
5131                     refcount_count(&arc_mru->arcs_size) > arc_p))
5132                         arc_p = MIN(arc_c, arc_p + size);
5133         }
5134 }
5135 
5136 static void
5137 arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size, void *tag)
5138 {
5139         arc_free_data_impl(hdr, size, tag);
5140         abd_free(abd);
5141 }
5142 
5143 static void
5144 arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, void *tag)
5145 {
5146         arc_buf_contents_t type = arc_buf_type(hdr);
5147 
5148         arc_free_data_impl(hdr, size, tag);
5149         if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
5150                 zio_buf_free(buf, size);
5151         } else {
5152                 ASSERT(type == ARC_BUFC_DATA);
5153                 zio_data_buf_free(buf, size);
5154         }
5155 }
5156 
5157 /*
5158  * Free the arc data buffer.
5159  */
5160 static void
5161 arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
5162 {
5163         arc_state_t *state = hdr->b_l1hdr.b_state;
5164         arc_buf_contents_t type = arc_buf_type(hdr);
5165 
5166         /* protected by hash lock, if in the hash table */
5167         if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
5168                 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5169                 ASSERT(state != arc_anon && state != arc_l2c_only);
5170 
5171                 (void) refcount_remove_many(&state->arcs_esize[type],
5172                     size, tag);
5173         }
5174         (void) refcount_remove_many(&state->arcs_size, size, tag);
5175 
5176         VERIFY3U(hdr->b_type, ==, type);
5177         if (type == ARC_BUFC_DDT) {
5178                 arc_space_return(size, ARC_SPACE_DDT);
5179         } else if (type == ARC_BUFC_METADATA) {
5180                 arc_space_return(size, ARC_SPACE_META);
5181         } else {
5182                 ASSERT(type == ARC_BUFC_DATA);
5183                 arc_space_return(size, ARC_SPACE_DATA);
5184         }
5185 }
5186 
5187 /*
5188  * This routine is called whenever a buffer is accessed.
5189  * NOTE: the hash lock is dropped in this function.
5190  */
5191 static void
5192 arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
5193 {
5194         clock_t now;
5195 
5196         ASSERT(MUTEX_HELD(hash_lock));
5197         ASSERT(HDR_HAS_L1HDR(hdr));
5198 
5199         if (hdr->b_l1hdr.b_state == arc_anon) {

5305                 }
5306 
5307                 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
5308                 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
5309                 arc_change_state(new_state, hdr, hash_lock);
5310 
5311                 ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
5312         } else if (hdr->b_l1hdr.b_state == arc_l2c_only) {
5313                 /*
5314                  * This buffer is on the 2nd Level ARC.
5315                  */
5316 
5317                 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
5318                 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
5319                 arc_change_state(arc_mfu, hdr, hash_lock);
5320         } else {
5321                 ASSERT(!"invalid arc state");
5322         }
5323 }
5324 
5325 /*
5326  * This routine is called by dbuf_hold() to update the arc_access() state
5327  * which otherwise would be skipped for entries in the dbuf cache.
5328  */
5329 void
5330 arc_buf_access(arc_buf_t *buf)
5331 {
5332         mutex_enter(&buf->b_evict_lock);
5333         arc_buf_hdr_t *hdr = buf->b_hdr;
5334 
5335         /*
5336          * Avoid taking the hash_lock when possible as an optimization.
5337          * The header must be checked again under the hash_lock in order
5338          * to handle the case where it is concurrently being released.
5339          */
5340         if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {
5341                 mutex_exit(&buf->b_evict_lock);
5342                 return;
5343         }
5344 
5345         kmutex_t *hash_lock = HDR_LOCK(hdr);
5346         mutex_enter(hash_lock);
5347 
5348         if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {
5349                 mutex_exit(hash_lock);
5350                 mutex_exit(&buf->b_evict_lock);
5351                 ARCSTAT_BUMP(arcstat_access_skip);
5352                 return;
5353         }
5354 
5355         mutex_exit(&buf->b_evict_lock);
5356 
5357         ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
5358             hdr->b_l1hdr.b_state == arc_mfu);
5359 
5360         DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
5361         arc_access(hdr, hash_lock);
5362         mutex_exit(hash_lock);
5363 
5364         ARCSTAT_BUMP(arcstat_hits);
5365         /*
5366          * Upstream used the ARCSTAT_CONDSTAT macro here, but they changed
5367          * the argument format for that macro, which would requie that we
5368          * go and modify all other uses of it. So it's easier to just expand
5369          * this one invocation of the macro to do the right thing.
5370          */
5371         if (!HDR_PREFETCH(hdr)) {
5372                 if (!HDR_ISTYPE_METADATA(hdr))
5373                         ARCSTAT_BUMP(arcstat_demand_data_hits);
5374                 else
5375                         ARCSTAT_BUMP(arcstat_demand_metadata_hits);
5376         } else {
5377                 if (!HDR_ISTYPE_METADATA(hdr))
5378                         ARCSTAT_BUMP(arcstat_prefetch_data_hits);
5379                 else
5380                         ARCSTAT_BUMP(arcstat_prefetch_metadata_hits);
5381         }
5382 }
5383 
5384 /* a generic arc_done_func_t which you can use */
5385 /* ARGSUSED */
5386 void
5387 arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
5388 {
5389         if (zio == NULL || zio->io_error == 0)
5390                 bcopy(buf->b_data, arg, arc_buf_size(buf));
5391         arc_buf_destroy(buf, arg);
5392 }
5393 
5394 /* a generic arc_done_func_t */
5395 void
5396 arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
5397 {
5398         arc_buf_t **bufp = arg;
5399         if (zio && zio->io_error) {
5400                 arc_buf_destroy(buf, arg);
5401                 *bufp = NULL;
5402         } else {
5403                 *bufp = buf;

5506                         zio->io_error = error;
5507                 }
5508         }
5509         hdr->b_l1hdr.b_acb = NULL;
5510         arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5511         if (callback_cnt == 0) {
5512                 ASSERT(HDR_PREFETCH(hdr));
5513                 ASSERT0(hdr->b_l1hdr.b_bufcnt);
5514                 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
5515         }
5516 
5517         ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt) ||
5518             callback_list != NULL);
5519 
5520         if (no_zio_error) {
5521                 arc_hdr_verify(hdr, zio->io_bp);
5522         } else {
5523                 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
5524                 if (hdr->b_l1hdr.b_state != arc_anon)
5525                         arc_change_state(arc_anon, hdr, hash_lock);
5526                 if (HDR_IN_HASH_TABLE(hdr)) {
5527                         if (hash_lock)
5528                                 arc_wait_for_krrp(hdr);
5529                         buf_hash_remove(hdr);
5530                 }
5531                 freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
5532         }
5533 
5534         /*
5535          * Broadcast before we drop the hash_lock to avoid the possibility
5536          * that the hdr (and hence the cv) might be freed before we get to
5537          * the cv_broadcast().
5538          */
5539         cv_broadcast(&hdr->b_l1hdr.b_cv);
5540 
5541         if (hash_lock != NULL) {
5542                 mutex_exit(hash_lock);
5543         } else {
5544                 /*
5545                  * This block was freed while we waited for the read to
5546                  * complete.  It has been removed from the hash table and
5547                  * moved to the anonymous state (so that it won't show up
5548                  * in the cache).
5549                  */
5550                 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);

5553 
5554         /* execute each callback and free its structure */
5555         while ((acb = callback_list) != NULL) {
5556                 if (acb->acb_done)
5557                         acb->acb_done(zio, acb->acb_buf, acb->acb_private);
5558 
5559                 if (acb->acb_zio_dummy != NULL) {
5560                         acb->acb_zio_dummy->io_error = zio->io_error;
5561                         zio_nowait(acb->acb_zio_dummy);
5562                 }
5563 
5564                 callback_list = acb->acb_next;
5565                 kmem_free(acb, sizeof (arc_callback_t));
5566         }
5567 
5568         if (freeable)
5569                 arc_hdr_destroy(hdr);
5570 }
5571 
5572 /*
5573  * The function to process data from arc by a callback
5574  * The main purpose is to directly copy data from arc to a target buffer
5575  */
5576 int
5577 arc_io_bypass(spa_t *spa, const blkptr_t *bp,
5578     arc_bypass_io_func func, void *arg)
5579 {
5580         arc_buf_hdr_t *hdr;
5581         kmutex_t *hash_lock = NULL;
5582         int error = 0;
5583         uint64_t guid = spa_load_guid(spa);
5584 
5585 top:
5586         hdr = buf_hash_find(guid, bp, &hash_lock);
5587         if (hdr && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_bufcnt > 0 &&
5588             hdr->b_l1hdr.b_buf->b_data) {
5589                 if (HDR_IO_IN_PROGRESS(hdr)) {
5590                         cv_wait(&hdr->b_l1hdr.b_cv, hash_lock);
5591                         mutex_exit(hash_lock);
5592                         DTRACE_PROBE(arc_bypass_wait);
5593                         goto top;
5594                 }
5595 
5596                 /*
5597                  * As the func is an arbitrary callback, which can block, lock
5598                  * should be released not to block other threads from
5599                  * performing. A counter is used to hold a reference to block
5600                  * which are held by krrp.
5601                  */
5602 
5603                 hdr->b_l1hdr.b_krrp++;
5604                 mutex_exit(hash_lock);
5605 
5606                 error = func(hdr->b_l1hdr.b_buf->b_data, hdr->b_lsize, arg);
5607 
5608                 mutex_enter(hash_lock);
5609                 hdr->b_l1hdr.b_krrp--;
5610                 cv_broadcast(&hdr->b_l1hdr.b_cv);
5611                 mutex_exit(hash_lock);
5612 
5613                 return (error);
5614         } else {
5615                 if (hash_lock)
5616                         mutex_exit(hash_lock);
5617                 return (ENODATA);
5618         }
5619 }
5620 
5621 /*
5622  * "Read" the block at the specified DVA (in bp) via the
5623  * cache.  If the block is found in the cache, invoke the provided
5624  * callback immediately and return.  Note that the `zio' parameter
5625  * in the callback will be NULL in this case, since no IO was
5626  * required.  If the block is not in the cache pass the read request
5627  * on to the spa with a substitute callback function, so that the
5628  * requested block will be added to the cache.
5629  *
5630  * If a read request arrives for a block that has a read in-progress,
5631  * either wait for the in-progress read to complete (and return the
5632  * results); or, if this is a read with a "done" func, add a record
5633  * to the read to invoke the "done" func when the read completes,
5634  * and return; or just return.
5635  *
5636  * arc_read_done() will invoke all the requested "done" functions
5637  * for readers of this block.
5638  */
5639 int
5640 arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_done_func_t *done,
5641     void *private, zio_priority_t priority, int zio_flags,

5741                                 ARCSTAT_BUMP(
5742                                     arcstat_demand_hit_predictive_prefetch);
5743                                 arc_hdr_clear_flags(hdr,
5744                                     ARC_FLAG_PREDICTIVE_PREFETCH);
5745                         }
5746                         ASSERT(!BP_IS_EMBEDDED(bp) || !BP_IS_HOLE(bp));
5747 
5748                         /* Get a buf with the desired data in it. */
5749                         VERIFY0(arc_buf_alloc_impl(hdr, private,
5750                             compressed_read, B_TRUE, &buf));
5751                 } else if (*arc_flags & ARC_FLAG_PREFETCH &&
5752                     refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
5753                         arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
5754                 }
5755                 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
5756                 arc_access(hdr, hash_lock);
5757                 if (*arc_flags & ARC_FLAG_L2CACHE)
5758                         arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
5759                 mutex_exit(hash_lock);
5760                 ARCSTAT_BUMP(arcstat_hits);
5761                 if (HDR_ISTYPE_DDT(hdr))
5762                         ARCSTAT_BUMP(arcstat_ddt_hits);
5763                 arc_update_hit_stat(hdr, B_TRUE);
5764 
5765                 if (done)
5766                         done(NULL, buf, private);
5767         } else {
5768                 uint64_t lsize = BP_GET_LSIZE(bp);
5769                 uint64_t psize = BP_GET_PSIZE(bp);
5770                 arc_callback_t *acb;
5771                 vdev_t *vd = NULL;
5772                 uint64_t addr = 0;
5773                 boolean_t devw = B_FALSE;
5774                 uint64_t size;
5775 
5776                 if (hdr == NULL) {
5777                         /* this block is not in the cache */
5778                         arc_buf_hdr_t *exists = NULL;
5779                         arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
5780                         hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
5781                             BP_GET_COMPRESS(bp), type);
5782 
5783                         if (!BP_IS_EMBEDDED(bp)) {
5784                                 hdr->b_dva = *BP_IDENTITY(bp);
5785                                 hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
5786                                 exists = buf_hash_insert(hdr, &hash_lock);
5787                         }
5788                         if (exists != NULL) {
5789                                 /* somebody beat us to the hash insert */


5790                                 arc_hdr_destroy(hdr);
5791                                 mutex_exit(hash_lock);
5792                                 goto top; /* restart the IO request */
5793                         }
5794                 } else {
5795                         /*
5796                          * This block is in the ghost cache. If it was L2-only
5797                          * (and thus didn't have an L1 hdr), we realloc the
5798                          * header to add an L1 hdr.
5799                          */
5800                         if (!HDR_HAS_L1HDR(hdr)) {
5801                                 hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,
5802                                     hdr_full_cache);
5803                         }
5804                         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5805                         ASSERT(GHOST_STATE(hdr->b_l1hdr.b_state));
5806                         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5807                         ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5808                         ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
5809                         ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
5810 
5811                         /*
5812                          * This is a delicate dance that we play here.
5813                          * This hdr is in the ghost list so we access it
5814                          * to move it out of the ghost list before we
5815                          * initiate the read. If it's a prefetch then
5816                          * it won't have a callback so we'll remove the
5817                          * reference that arc_buf_alloc_impl() created. We
5818                          * do this after we've called arc_access() to
5819                          * avoid hitting an assert in remove_reference().
5820                          */
5821                         arc_access(hdr, hash_lock);
5822                         arc_hdr_alloc_pabd(hdr);
5823                 }
5824                 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
5825                 size = arc_hdr_size(hdr);
5826 
5827                 /*
5828                  * If compression is enabled on the hdr, then will do
5829                  * RAW I/O and will store the compressed data in the hdr's

5841                 if (BP_GET_LEVEL(bp) > 0)
5842                         arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);
5843                 if (*arc_flags & ARC_FLAG_PREDICTIVE_PREFETCH)
5844                         arc_hdr_set_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH);
5845                 ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));
5846 
5847                 acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
5848                 acb->acb_done = done;
5849                 acb->acb_private = private;
5850                 acb->acb_compressed = compressed_read;
5851 
5852                 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5853                 hdr->b_l1hdr.b_acb = acb;
5854                 arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5855 
5856                 if (HDR_HAS_L2HDR(hdr) &&
5857                     (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
5858                         devw = hdr->b_l2hdr.b_dev->l2ad_writing;
5859                         addr = hdr->b_l2hdr.b_daddr;
5860                         /*
5861                          * Lock out device removal.
5862                          */
5863                         if (vdev_is_dead(vd) ||
5864                             !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
5865                                 vd = NULL;
5866                 }
5867 
5868                 if (priority == ZIO_PRIORITY_ASYNC_READ)
5869                         arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
5870                 else
5871                         arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
5872 
5873                 if (hash_lock != NULL)
5874                         mutex_exit(hash_lock);
5875 
5876                 /*
5877                  * At this point, we have a level 1 cache miss.  Try again in
5878                  * L2ARC if possible.
5879                  */
5880                 ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);
5881 
5882                 DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
5883                     uint64_t, lsize, zbookmark_phys_t *, zb);
5884                 ARCSTAT_BUMP(arcstat_misses);
5885                 arc_update_hit_stat(hdr, B_FALSE);


5886 
5887                 if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
5888                         /*
5889                          * Read from the L2ARC if the following are true:
5890                          * 1. The L2ARC vdev was previously cached.
5891                          * 2. This buffer still has L2ARC metadata.
5892                          * 3. This buffer isn't currently writing to the L2ARC.
5893                          * 4. The L2ARC entry wasn't evicted, which may
5894                          *    also have invalidated the vdev.
5895                          * 5. This isn't prefetch and l2arc_noprefetch is set.
5896                          */
5897                         if (HDR_HAS_L2HDR(hdr) &&
5898                             !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
5899                             !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
5900                                 l2arc_read_callback_t *cb;
5901                                 abd_t *abd;
5902                                 uint64_t asize;
5903 
5904                                 DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
5905                                 ARCSTAT_BUMP(arcstat_l2_hits);
5906                                 if (vdev_type_is_ddt(vd))
5907                                         ARCSTAT_BUMP(arcstat_l2_ddt_hits);
5908 
5909                                 cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
5910                                     KM_SLEEP);
5911                                 cb->l2rcb_hdr = hdr;
5912                                 cb->l2rcb_bp = *bp;
5913                                 cb->l2rcb_zb = *zb;
5914                                 cb->l2rcb_flags = zio_flags;
5915 
5916                                 asize = vdev_psize_to_asize(vd, size);
5917                                 if (asize != size) {
5918                                         abd = abd_alloc_for_io(asize,
5919                                             !HDR_ISTYPE_DATA(hdr));
5920                                         cb->l2rcb_abd = abd;
5921                                 } else {
5922                                         abd = hdr->b_l1hdr.b_pabd;
5923                                 }
5924 
5925                                 ASSERT(addr >= VDEV_LABEL_START_SIZE &&
5926                                     addr + asize <= vd->vdev_psize -
5927                                     VDEV_LABEL_END_SIZE);
5928 
5929                                 /*
5930                                  * l2arc read.  The SCL_L2ARC lock will be
5931                                  * released by l2arc_read_done().
5932                                  * Issue a null zio if the underlying buffer
5933                                  * was squashed to zero size by compression.
5934                                  */
5935                                 ASSERT3U(HDR_GET_COMPRESS(hdr), !=,
5936                                     ZIO_COMPRESS_EMPTY);
5937                                 rzio = zio_read_phys(pio, vd, addr,
5938                                     asize, abd,
5939                                     ZIO_CHECKSUM_OFF,
5940                                     l2arc_read_done, cb, priority,
5941                                     zio_flags | ZIO_FLAG_DONT_CACHE |
5942                                     ZIO_FLAG_CANFAIL |
5943                                     ZIO_FLAG_DONT_PROPAGATE |
5944                                     ZIO_FLAG_DONT_RETRY, B_FALSE);
5945                                 DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
5946                                     zio_t *, rzio);
5947 
5948                                 ARCSTAT_INCR(arcstat_l2_read_bytes, size);
5949                                 if (vdev_type_is_ddt(vd))
5950                                         ARCSTAT_INCR(arcstat_l2_ddt_read_bytes,
5951                                             size);
5952 
5953                                 if (*arc_flags & ARC_FLAG_NOWAIT) {
5954                                         zio_nowait(rzio);
5955                                         return (0);
5956                                 }
5957 
5958                                 ASSERT(*arc_flags & ARC_FLAG_WAIT);
5959                                 if (zio_wait(rzio) == 0)
5960                                         return (0);
5961 
5962                                 /* l2arc read error; goto zio_read() */
5963                         } else {
5964                                 DTRACE_PROBE1(l2arc__miss,
5965                                     arc_buf_hdr_t *, hdr);
5966                                 ARCSTAT_BUMP(arcstat_l2_misses);
5967                                 if (HDR_L2_WRITING(hdr))
5968                                         ARCSTAT_BUMP(arcstat_l2_rw_clash);
5969                                 spa_config_exit(spa, SCL_L2ARC, vd);
5970                         }
5971                 } else {

6203                 hdr->b_l1hdr.b_bufcnt -= 1;
6204                 arc_cksum_verify(buf);
6205                 arc_buf_unwatch(buf);
6206 
6207                 mutex_exit(hash_lock);
6208 
6209                 /*
6210                  * Allocate a new hdr. The new hdr will contain a b_pabd
6211                  * buffer which will be freed in arc_write().
6212                  */
6213                 nhdr = arc_hdr_alloc(spa, psize, lsize, compress, type);
6214                 ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL);
6215                 ASSERT0(nhdr->b_l1hdr.b_bufcnt);
6216                 ASSERT0(refcount_count(&nhdr->b_l1hdr.b_refcnt));
6217                 VERIFY3U(nhdr->b_type, ==, type);
6218                 ASSERT(!HDR_SHARED_DATA(nhdr));
6219 
6220                 nhdr->b_l1hdr.b_buf = buf;
6221                 nhdr->b_l1hdr.b_bufcnt = 1;
6222                 (void) refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);
6223                 nhdr->b_l1hdr.b_krrp = 0;
6224 
6225                 buf->b_hdr = nhdr;
6226 
6227                 mutex_exit(&buf->b_evict_lock);
6228                 (void) refcount_add_many(&arc_anon->arcs_size,
6229                     arc_buf_size(buf), buf);
6230         } else {
6231                 mutex_exit(&buf->b_evict_lock);
6232                 ASSERT(refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);
6233                 /* protected by hash lock, or hdr is on arc_anon */
6234                 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
6235                 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
6236                 arc_change_state(arc_anon, hdr, hash_lock);
6237                 hdr->b_l1hdr.b_arc_access = 0;
6238                 mutex_exit(hash_lock);
6239 
6240                 buf_discard_identity(hdr);
6241                 arc_buf_thaw(buf);
6242         }
6243 }
6244

6415                 kmutex_t *hash_lock;
6416 
6417                 ASSERT3U(zio->io_error, ==, 0);
6418 
6419                 arc_cksum_verify(buf);
6420 
6421                 exists = buf_hash_insert(hdr, &hash_lock);
6422                 if (exists != NULL) {
6423                         /*
6424                          * This can only happen if we overwrite for
6425                          * sync-to-convergence, because we remove
6426                          * buffers from the hash table when we arc_free().
6427                          */
6428                         if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
6429                                 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
6430                                         panic("bad overwrite, hdr=%p exists=%p",
6431                                             (void *)hdr, (void *)exists);
6432                                 ASSERT(refcount_is_zero(
6433                                     &exists->b_l1hdr.b_refcnt));
6434                                 arc_change_state(arc_anon, exists, hash_lock);
6435                                 arc_wait_for_krrp(exists);
6436                                 arc_hdr_destroy(exists);
6437                                 mutex_exit(hash_lock);
6438                                 exists = buf_hash_insert(hdr, &hash_lock);
6439                                 ASSERT3P(exists, ==, NULL);
6440                         } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
6441                                 /* nopwrite */
6442                                 ASSERT(zio->io_prop.zp_nopwrite);
6443                                 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
6444                                         panic("bad nopwrite, hdr=%p exists=%p",
6445                                             (void *)hdr, (void *)exists);
6446                         } else {
6447                                 /* Dedup */
6448                                 ASSERT(hdr->b_l1hdr.b_bufcnt == 1);
6449                                 ASSERT(hdr->b_l1hdr.b_state == arc_anon);
6450                                 ASSERT(BP_GET_DEDUP(zio->io_bp));
6451                                 ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
6452                         }
6453                 }
6454                 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
6455                 /* if it's not anon, we are doing a scrub */
6456                 if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)
6457                         arc_access(hdr, hash_lock);
6458                 mutex_exit(hash_lock);
6459         } else {
6460                 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
6461         }
6462 
6463         ASSERT(!refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
6464         callback->awcb_done(zio, buf, callback->awcb_private);
6465 
6466         abd_put(zio->io_abd);
6467         kmem_free(callback, sizeof (arc_write_callback_t));
6468 }
6469 
6470 zio_t *
6471 arc_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, arc_buf_t *buf,
6472     boolean_t l2arc, const zio_prop_t *zp, arc_done_func_t *ready,
6473     arc_done_func_t *children_ready, arc_done_func_t *physdone,
6474     arc_done_func_t *done, void *private, zio_priority_t priority,
6475     int zio_flags, const zbookmark_phys_t *zb,
6476     const zio_smartcomp_info_t *smartcomp)
6477 {
6478         arc_buf_hdr_t *hdr = buf->b_hdr;
6479         arc_write_callback_t *callback;
6480         zio_t *zio;
6481         zio_prop_t localprop = *zp;
6482 
6483         ASSERT3P(ready, !=, NULL);
6484         ASSERT3P(done, !=, NULL);
6485         ASSERT(!HDR_IO_ERROR(hdr));
6486         ASSERT(!HDR_IO_IN_PROGRESS(hdr));
6487         ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
6488         ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
6489         if (l2arc)
6490                 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
6491         if (ARC_BUF_COMPRESSED(buf)) {
6492                 /*
6493                  * We're writing a pre-compressed buffer.  Make the
6494                  * compression algorithm requested by the zio_prop_t match
6495                  * the pre-compressed buffer's compression algorithm.
6496                  */

6517                  * the hdr then we need to break that relationship here.
6518                  * The hdr will remain with a NULL data pointer and the
6519                  * buf will take sole ownership of the block.
6520                  */
6521                 if (arc_buf_is_shared(buf)) {
6522                         arc_unshare_buf(hdr, buf);
6523                 } else {
6524                         arc_hdr_free_pabd(hdr);
6525                 }
6526                 VERIFY3P(buf->b_data, !=, NULL);
6527                 arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);
6528         }
6529         ASSERT(!arc_buf_is_shared(buf));
6530         ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
6531 
6532         zio = zio_write(pio, spa, txg, bp,
6533             abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
6534             HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
6535             (children_ready != NULL) ? arc_write_children_ready : NULL,
6536             arc_write_physdone, arc_write_done, callback,
6537             priority, zio_flags, zb, smartcomp);
6538 
6539         return (zio);
6540 }
6541 
6542 static int
6543 arc_memory_throttle(uint64_t reserve, uint64_t txg)
6544 {
6545 #ifdef _KERNEL
6546         uint64_t available_memory = ptob(freemem);
6547         static uint64_t page_load = 0;
6548         static uint64_t last_txg = 0;
6549 
6550 #if defined(__i386)
6551         available_memory =
6552             MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
6553 #endif
6554 
6555         if (freemem > physmem * arc_lotsfree_percent / 100)
6556                 return (0);
6557

6609 
6610         anon_size = MAX((int64_t)(refcount_count(&arc_anon->arcs_size) -
6611             arc_loaned_bytes), 0);
6612 
6613         /*
6614          * Writes will, almost always, require additional memory allocations
6615          * in order to compress/encrypt/etc the data.  We therefore need to
6616          * make sure that there is sufficient available memory for this.
6617          */
6618         error = arc_memory_throttle(reserve, txg);
6619         if (error != 0)
6620                 return (error);
6621 
6622         /*
6623          * Throttle writes when the amount of dirty data in the cache
6624          * gets too large.  We try to keep the cache less than half full
6625          * of dirty blocks so that our sync times don't grow too large.
6626          * Note: if two requests come in concurrently, we might let them
6627          * both succeed, when one of them should fail.  Not a huge deal.
6628          */

6629         if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
6630             anon_size > arc_c / 4) {
6631                 DTRACE_PROBE4(arc__tempreserve__space__throttle, uint64_t,
6632                     arc_tempreserve, arc_state_t *, arc_anon, uint64_t,
6633                     reserve, uint64_t, arc_c);
6634 
6635                 uint64_t meta_esize =
6636                     refcount_count(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
6637                 uint64_t data_esize =
6638                     refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
6639                 dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
6640                     "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
6641                     arc_tempreserve >> 10, meta_esize >> 10,
6642                     data_esize >> 10, reserve >> 10, arc_c >> 10);
6643                 return (SET_ERROR(ERESTART));
6644         }
6645         atomic_add_64(&arc_tempreserve, reserve);
6646         return (0);
6647 }
6648 
6649 static void
6650 arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
6651     kstat_named_t *evict_data, kstat_named_t *evict_metadata,
6652     kstat_named_t *evict_ddt)
6653 {
6654         size->value.ui64 = refcount_count(&state->arcs_size);
6655         evict_data->value.ui64 =
6656             refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);
6657         evict_metadata->value.ui64 =
6658             refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);
6659         evict_ddt->value.ui64 =
6660             refcount_count(&state->arcs_esize[ARC_BUFC_DDT]);
6661 }
6662 
6663 static int
6664 arc_kstat_update(kstat_t *ksp, int rw)
6665 {
6666         arc_stats_t *as = ksp->ks_data;
6667 
6668         if (rw == KSTAT_WRITE) {
6669                 return (EACCES);
6670         } else {
6671                 arc_kstat_update_state(arc_anon,
6672                     &as->arcstat_anon_size,
6673                     &as->arcstat_anon_evictable_data,
6674                     &as->arcstat_anon_evictable_metadata,
6675                     &as->arcstat_anon_evictable_ddt);
6676                 arc_kstat_update_state(arc_mru,
6677                     &as->arcstat_mru_size,
6678                     &as->arcstat_mru_evictable_data,
6679                     &as->arcstat_mru_evictable_metadata,
6680                     &as->arcstat_mru_evictable_ddt);
6681                 arc_kstat_update_state(arc_mru_ghost,
6682                     &as->arcstat_mru_ghost_size,
6683                     &as->arcstat_mru_ghost_evictable_data,
6684                     &as->arcstat_mru_ghost_evictable_metadata,
6685                     &as->arcstat_mru_ghost_evictable_ddt);
6686                 arc_kstat_update_state(arc_mfu,
6687                     &as->arcstat_mfu_size,
6688                     &as->arcstat_mfu_evictable_data,
6689                     &as->arcstat_mfu_evictable_metadata,
6690                     &as->arcstat_mfu_evictable_ddt);
6691                 arc_kstat_update_state(arc_mfu_ghost,
6692                     &as->arcstat_mfu_ghost_size,
6693                     &as->arcstat_mfu_ghost_evictable_data,
6694                     &as->arcstat_mfu_ghost_evictable_metadata,
6695                     &as->arcstat_mfu_ghost_evictable_ddt);








6696         }
6697 
6698         return (0);
6699 }
6700 
6701 /*
6702  * This function *must* return indices evenly distributed between all
6703  * sublists of the multilist. This is needed due to how the ARC eviction
6704  * code is laid out; arc_evict_state() assumes ARC buffers are evenly
6705  * distributed between all sublists and uses this assumption when
6706  * deciding which sublist to evict from and how much to evict from it.
6707  */
6708 unsigned int
6709 arc_state_multilist_index_func(multilist_t *ml, void *obj)
6710 {
6711         arc_buf_hdr_t *hdr = obj;
6712 
6713         /*
6714          * We rely on b_dva to generate evenly distributed index
6715          * numbers using buf_hash below. So, as an added precaution,

6725          * on insertion, as this index can be recalculated on removal.
6726          *
6727          * Also, the low order bits of the hash value are thought to be
6728          * distributed evenly. Otherwise, in the case that the multilist
6729          * has a power of two number of sublists, each sublists' usage
6730          * would not be evenly distributed.
6731          */
6732         return (buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %
6733             multilist_get_num_sublists(ml));
6734 }
6735 
6736 static void
6737 arc_state_init(void)
6738 {
6739         arc_anon = &ARC_anon;
6740         arc_mru = &ARC_mru;
6741         arc_mru_ghost = &ARC_mru_ghost;
6742         arc_mfu = &ARC_mfu;
6743         arc_mfu_ghost = &ARC_mfu_ghost;
6744         arc_l2c_only = &ARC_l2c_only;
6745         arc_buf_contents_t arcs;
6746 
6747         for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
6748                 arc_mru->arcs_list[arcs] =
6749                     multilist_create(sizeof (arc_buf_hdr_t),
6750                     offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6751                     arc_state_multilist_index_func);
6752                 arc_mru_ghost->arcs_list[arcs] =
6753                     multilist_create(sizeof (arc_buf_hdr_t),
6754                     offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6755                         arc_state_multilist_index_func);
6756                 arc_mfu->arcs_list[arcs] =
6757                     multilist_create(sizeof (arc_buf_hdr_t),
6758                     offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6759                     arc_state_multilist_index_func);
6760                 arc_mfu_ghost->arcs_list[arcs] =
6761                     multilist_create(sizeof (arc_buf_hdr_t),
6762                     offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6763                     arc_state_multilist_index_func);
6764                 arc_l2c_only->arcs_list[arcs] =
6765                     multilist_create(sizeof (arc_buf_hdr_t),
6766                     offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6767                     arc_state_multilist_index_func);




















6768 
6769                 refcount_create(&arc_anon->arcs_esize[arcs]);
6770                 refcount_create(&arc_mru->arcs_esize[arcs]);
6771                 refcount_create(&arc_mru_ghost->arcs_esize[arcs]);
6772                 refcount_create(&arc_mfu->arcs_esize[arcs]);
6773                 refcount_create(&arc_mfu_ghost->arcs_esize[arcs]);
6774                 refcount_create(&arc_l2c_only->arcs_esize[arcs]);
6775         }





6776 
6777         arc_flush_taskq = taskq_create("arc_flush_tq",
6778             max_ncpus, minclsyspri, 1, zfs_flush_ntasks, TASKQ_DYNAMIC);
6779 
6780         refcount_create(&arc_anon->arcs_size);
6781         refcount_create(&arc_mru->arcs_size);
6782         refcount_create(&arc_mru_ghost->arcs_size);
6783         refcount_create(&arc_mfu->arcs_size);
6784         refcount_create(&arc_mfu_ghost->arcs_size);
6785         refcount_create(&arc_l2c_only->arcs_size);








6786 }
6787 
6788 static void
6789 arc_state_fini(void)
6790 {
6791         arc_buf_contents_t arcs;











6792 
6793         refcount_destroy(&arc_anon->arcs_size);
6794         refcount_destroy(&arc_mru->arcs_size);
6795         refcount_destroy(&arc_mru_ghost->arcs_size);
6796         refcount_destroy(&arc_mfu->arcs_size);
6797         refcount_destroy(&arc_mfu_ghost->arcs_size);
6798         refcount_destroy(&arc_l2c_only->arcs_size);
6799 
6800         for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
6801                 multilist_destroy(arc_mru->arcs_list[arcs]);
6802                 multilist_destroy(arc_mru_ghost->arcs_list[arcs]);
6803                 multilist_destroy(arc_mfu->arcs_list[arcs]);
6804                 multilist_destroy(arc_mfu_ghost->arcs_list[arcs]);
6805                 multilist_destroy(arc_l2c_only->arcs_list[arcs]);
6806 
6807                 refcount_destroy(&arc_anon->arcs_esize[arcs]);
6808                 refcount_destroy(&arc_mru->arcs_esize[arcs]);
6809                 refcount_destroy(&arc_mru_ghost->arcs_esize[arcs]);
6810                 refcount_destroy(&arc_mfu->arcs_esize[arcs]);
6811                 refcount_destroy(&arc_mfu_ghost->arcs_esize[arcs]);
6812                 refcount_destroy(&arc_l2c_only->arcs_esize[arcs]);
6813         }
6814 }
6815 
6816 uint64_t
6817 arc_max_bytes(void)
6818 {
6819         return (arc_c_max);
6820 }
6821 
6822 void
6823 arc_init(void)
6824 {
6825         /*
6826          * allmem is "all memory that we could possibly use".
6827          */
6828 #ifdef _KERNEL
6829         uint64_t allmem = ptob(physmem - swapfs_minfree);
6830 #else
6831         uint64_t allmem = (physmem * PAGESIZE) / 2;
6832 #endif
6833

6853          * small, because it can cause transactions to be larger than
6854          * arc_c, causing arc_tempreserve_space() to fail.
6855          */
6856 #ifndef _KERNEL
6857         arc_c_min = arc_c_max / 2;
6858 #endif
6859 
6860         /*
6861          * Allow the tunables to override our calculations if they are
6862          * reasonable (ie. over 64MB)
6863          */
6864         if (zfs_arc_max > 64 << 20 && zfs_arc_max < allmem) {
6865                 arc_c_max = zfs_arc_max;
6866                 arc_c_min = MIN(arc_c_min, arc_c_max);
6867         }
6868         if (zfs_arc_min > 64 << 20 && zfs_arc_min <= arc_c_max)
6869                 arc_c_min = zfs_arc_min;
6870 
6871         arc_c = arc_c_max;
6872         arc_p = (arc_c >> 1);
6873         arc_size = 0;
6874 
6875         /* limit ddt meta-data to 1/4 of the arc capacity */
6876         arc_ddt_limit = arc_c_max / 4;
6877         /* limit meta-data to 1/4 of the arc capacity */
6878         arc_meta_limit = arc_c_max / 4;
6879 
6880 #ifdef _KERNEL
6881         /*
6882          * Metadata is stored in the kernel's heap.  Don't let us
6883          * use more than half the heap for the ARC.
6884          */
6885         arc_meta_limit = MIN(arc_meta_limit,
6886             vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 2);
6887 #endif
6888 
6889         /* Allow the tunable to override if it is reasonable */
6890         if (zfs_arc_ddt_limit > 0 && zfs_arc_ddt_limit <= arc_c_max)
6891                 arc_ddt_limit = zfs_arc_ddt_limit;
6892         arc_ddt_evict_threshold =
6893             zfs_arc_segregate_ddt ? &arc_ddt_limit : &arc_meta_limit;
6894 
6895         /* Allow the tunable to override if it is reasonable */
6896         if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
6897                 arc_meta_limit = zfs_arc_meta_limit;
6898 
6899         if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
6900                 arc_c_min = arc_meta_limit / 2;
6901 
6902         if (zfs_arc_meta_min > 0) {
6903                 arc_meta_min = zfs_arc_meta_min;
6904         } else {
6905                 arc_meta_min = arc_c_min / 2;
6906         }
6907 
6908         if (zfs_arc_grow_retry > 0)
6909                 arc_grow_retry = zfs_arc_grow_retry;
6910 
6911         if (zfs_arc_shrink_shift > 0)
6912                 arc_shrink_shift = zfs_arc_shrink_shift;
6913 
6914         /*
6915          * Ensure that arc_no_grow_shift is less than arc_shrink_shift.

6970         /*
6971          * The reclaim thread will set arc_reclaim_thread_exit back to
6972          * B_FALSE when it is finished exiting; we're waiting for that.
6973          */
6974         while (arc_reclaim_thread_exit) {
6975                 cv_signal(&arc_reclaim_thread_cv);
6976                 cv_wait(&arc_reclaim_thread_cv, &arc_reclaim_lock);
6977         }
6978         mutex_exit(&arc_reclaim_lock);
6979 
6980         /* Use B_TRUE to ensure *all* buffers are evicted */
6981         arc_flush(NULL, B_TRUE);
6982 
6983         arc_dead = B_TRUE;
6984 
6985         if (arc_ksp != NULL) {
6986                 kstat_delete(arc_ksp);
6987                 arc_ksp = NULL;
6988         }
6989 
6990         taskq_destroy(arc_flush_taskq);
6991 
6992         mutex_destroy(&arc_reclaim_lock);
6993         cv_destroy(&arc_reclaim_thread_cv);
6994         cv_destroy(&arc_reclaim_waiters_cv);
6995 
6996         arc_state_fini();
6997         buf_fini();
6998 
6999         ASSERT0(arc_loaned_bytes);
7000 }
7001 
7002 /*
7003  * Level 2 ARC
7004  *
7005  * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
7006  * It uses dedicated storage devices to hold cached data, which are populated
7007  * using large infrequent writes.  The main role of this cache is to boost
7008  * the performance of random read workloads.  The intended L2ARC devices
7009  * include short-stroked disks, solid state disks, and other media with
7010  * substantially faster read latency than disk.
7011  *

7125  *      l2arc_noprefetch        skip caching prefetched buffers
7126  *      l2arc_headroom          number of max device writes to precache
7127  *      l2arc_headroom_boost    when we find compressed buffers during ARC
7128  *                              scanning, we multiply headroom by this
7129  *                              percentage factor for the next scan cycle,
7130  *                              since more compressed buffers are likely to
7131  *                              be present
7132  *      l2arc_feed_secs         seconds between L2ARC writing
7133  *
7134  * Tunables may be removed or added as future performance improvements are
7135  * integrated, and also may become zpool properties.
7136  *
7137  * There are three key functions that control how the L2ARC warms up:
7138  *
7139  *      l2arc_write_eligible()  check if a buffer is eligible to cache
7140  *      l2arc_write_size()      calculate how much to write
7141  *      l2arc_write_interval()  calculate sleep delay between writes
7142  *
7143  * These three functions determine what to write, how much, and how quickly
7144  * to send writes.
7145  *
7146  * L2ARC persistency:
7147  *
7148  * When writing buffers to L2ARC, we periodically add some metadata to
7149  * make sure we can pick them up after reboot, thus dramatically reducing
7150  * the impact that any downtime has on the performance of storage systems
7151  * with large caches.
7152  *
7153  * The implementation works fairly simply by integrating the following two
7154  * modifications:
7155  *
7156  * *) Every now and then we mix in a piece of metadata (called a log block)
7157  *    into the L2ARC write. This allows us to understand what's been written,
7158  *    so that we can rebuild the arc_buf_hdr_t structures of the main ARC
7159  *    buffers. The log block also includes a "2-back-reference" pointer to
7160  *    he second-to-previous block, forming a back-linked list of blocks on
7161  *    the L2ARC device.
7162  *
7163  * *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device
7164  *    for our header bookkeeping purposes. This contains a device header,
7165  *    which contains our top-level reference structures. We update it each
7166  *    time we write a new log block, so that we're able to locate it in the
7167  *    L2ARC device. If this write results in an inconsistent device header
7168  *    (e.g. due to power failure), we detect this by verifying the header's
7169  *    checksum and simply drop the entries from L2ARC.
7170  *
7171  * Implementation diagram:
7172  *
7173  * +=== L2ARC device (not to scale) ======================================+
7174  * |       ___two newest log block pointers__.__________                  |
7175  * |      /                                   \1 back   \latest           |
7176  * |.____/_.                                   V         V                |
7177  * ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---|
7178  * ||   hdr|      ^         /^       /^        /         /                |
7179  * |+------+  ...--\-------/  \-----/--\------/         /                 |
7180  * |                \--------------/    \--------------/                  |
7181  * +======================================================================+
7182  *
7183  * As can be seen on the diagram, rather than using a simple linked list,
7184  * we use a pair of linked lists with alternating elements. This is a
7185  * performance enhancement due to the fact that we only find out of the
7186  * address of the next log block access once the current block has been
7187  * completely read in. Obviously, this hurts performance, because we'd be
7188  * keeping the device's I/O queue at only a 1 operation deep, thus
7189  * incurring a large amount of I/O round-trip latency. Having two lists
7190  * allows us to "prefetch" two log blocks ahead of where we are currently
7191  * rebuilding L2ARC buffers.
7192  *
7193  * On-device data structures:
7194  *
7195  * L2ARC device header: l2arc_dev_hdr_phys_t
7196  * L2ARC log block:     l2arc_log_blk_phys_t
7197  *
7198  * L2ARC reconstruction:
7199  *
7200  * When writing data, we simply write in the standard rotary fashion,
7201  * evicting buffers as we go and simply writing new data over them (writing
7202  * a new log block every now and then). This obviously means that once we
7203  * loop around the end of the device, we will start cutting into an already
7204  * committed log block (and its referenced data buffers), like so:
7205  *
7206  *    current write head__       __old tail
7207  *                        \     /
7208  *                        V    V
7209  * <--|bufs |lb |bufs |lb |    |bufs |lb |bufs |lb |-->
7210  *                         ^    ^^^^^^^^^___________________________________
7211  *                         |                                                \
7212  *                   <<nextwrite>> may overwrite this blk and/or its bufs --'
7213  *
7214  * When importing the pool, we detect this situation and use it to stop
7215  * our scanning process (see l2arc_rebuild).
7216  *
7217  * There is one significant caveat to consider when rebuilding ARC contents
7218  * from an L2ARC device: what about invalidated buffers? Given the above
7219  * construction, we cannot update blocks which we've already written to amend
7220  * them to remove buffers which were invalidated. Thus, during reconstruction,
7221  * we might be populating the cache with buffers for data that's not on the
7222  * main pool anymore, or may have been overwritten!
7223  *
7224  * As it turns out, this isn't a problem. Every arc_read request includes
7225  * both the DVA and, crucially, the birth TXG of the BP the caller is
7226  * looking for. So even if the cache were populated by completely rotten
7227  * blocks for data that had been long deleted and/or overwritten, we'll
7228  * never actually return bad data from the cache, since the DVA with the
7229  * birth TXG uniquely identify a block in space and time - once created,
7230  * a block is immutable on disk. The worst thing we have done is wasted
7231  * some time and memory at l2arc rebuild to reconstruct outdated ARC
7232  * entries that will get dropped from the l2arc as it is being updated
7233  * with new blocks.
7234  */
7235 
7236 static boolean_t
7237 l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
7238 {
7239         /*
7240          * A buffer is *not* eligible for the L2ARC if it:
7241          * 1. belongs to a different spa.
7242          * 2. is already cached on the L2ARC.
7243          * 3. has an I/O in progress (it may be an incomplete read).
7244          * 4. is flagged not eligible (zfs property).
7245          */
7246         if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) ||
7247             HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr))
7248                 return (B_FALSE);
7249 
7250         return (B_TRUE);
7251 }
7252 
7253 static uint64_t

7279 {
7280         clock_t interval, next, now;
7281 
7282         /*
7283          * If the ARC lists are busy, increase our write rate; if the
7284          * lists are stale, idle back.  This is achieved by checking
7285          * how much we previously wrote - if it was more than half of
7286          * what we wanted, schedule the next write much sooner.
7287          */
7288         if (l2arc_feed_again && wrote > (wanted / 2))
7289                 interval = (hz * l2arc_feed_min_ms) / 1000;
7290         else
7291                 interval = hz * l2arc_feed_secs;
7292 
7293         now = ddi_get_lbolt();
7294         next = MAX(now, MIN(now + interval, began + interval));
7295 
7296         return (next);
7297 }
7298 
7299 typedef enum l2ad_feed {
7300         L2ARC_FEED_ALL = 1,
7301         L2ARC_FEED_DDT_DEV,
7302         L2ARC_FEED_NON_DDT_DEV,
7303 } l2ad_feed_t;
7304 
7305 /*
7306  * Cycle through L2ARC devices.  This is how L2ARC load balances.
7307  * If a device is returned, this also returns holding the spa config lock.
7308  */
7309 static l2arc_dev_t *
7310 l2arc_dev_get_next(l2ad_feed_t feed_type)
7311 {
7312         l2arc_dev_t *start = NULL, *next = NULL;
7313 
7314         /*
7315          * Lock out the removal of spas (spa_namespace_lock), then removal
7316          * of cache devices (l2arc_dev_mtx).  Once a device has been selected,
7317          * both locks will be dropped and a spa config lock held instead.
7318          */
7319         mutex_enter(&spa_namespace_lock);
7320         mutex_enter(&l2arc_dev_mtx);
7321 
7322         /* if there are no vdevs, there is nothing to do */
7323         if (l2arc_ndev == 0)
7324                 goto out;
7325 
7326         if (feed_type == L2ARC_FEED_DDT_DEV)
7327                 next = l2arc_ddt_dev_last;
7328         else
7329                 next = l2arc_dev_last;
7330 
7331         /* figure out what the next device we look at should be */
7332         if (next == NULL)
7333                 next = list_head(l2arc_dev_list);
7334         else if (list_next(l2arc_dev_list, next) == NULL)
7335                 next = list_head(l2arc_dev_list);
7336         else
7337                 next = list_next(l2arc_dev_list, next);
7338         ASSERT(next);
7339 
7340         /* loop through L2ARC devs looking for the one we need */
7341         /* LINTED(E_CONSTANT_CONDITION) */
7342         while (1) {
7343                 if (next == NULL) /* reached list end, start from beginning */
7344                         next = list_head(l2arc_dev_list);

7345 
7346                 if (start == NULL) { /* save starting dev */
7347                         start = next;
7348                 } else if (start == next) { /* full loop completed - stop now */
7349                         next = NULL;
7350                         if (feed_type == L2ARC_FEED_DDT_DEV) {
7351                                 l2arc_ddt_dev_last = NULL;
7352                                 goto out;
7353                         } else {
7354                                 break;
7355                         }
7356                 }
7357 
7358                 if (!vdev_is_dead(next->l2ad_vdev) && !next->l2ad_rebuild) {
7359                         if (feed_type == L2ARC_FEED_DDT_DEV) {
7360                                 if (vdev_type_is_ddt(next->l2ad_vdev)) {
7361                                         l2arc_ddt_dev_last = next;
7362                                         goto out;
7363                                 }
7364                         } else if (feed_type == L2ARC_FEED_NON_DDT_DEV) {
7365                                 if (!vdev_type_is_ddt(next->l2ad_vdev)) {
7366                                         break;
7367                                 }
7368                         } else {
7369                                 ASSERT(feed_type == L2ARC_FEED_ALL);
7370                                 break;
7371                         }
7372                 }
7373                 next = list_next(l2arc_dev_list, next);
7374         }
7375         l2arc_dev_last = next;
7376 
7377 out:
7378         mutex_exit(&l2arc_dev_mtx);
7379 
7380         /*
7381          * Grab the config lock to prevent the 'next' device from being
7382          * removed while we are writing to it.
7383          */
7384         if (next != NULL)
7385                 spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
7386         mutex_exit(&spa_namespace_lock);
7387 
7388         return (next);
7389 }
7390 
7391 /*
7392  * Free buffers that were tagged for destruction.
7393  */
7394 static void

7407                 list_remove(buflist, df);
7408                 kmem_free(df, sizeof (l2arc_data_free_t));
7409         }
7410 
7411         mutex_exit(&l2arc_free_on_write_mtx);
7412 }
7413 
7414 /*
7415  * A write to a cache device has completed.  Update all headers to allow
7416  * reads from these buffers to begin.
7417  */
7418 static void
7419 l2arc_write_done(zio_t *zio)
7420 {
7421         l2arc_write_callback_t *cb;
7422         l2arc_dev_t *dev;
7423         list_t *buflist;
7424         arc_buf_hdr_t *head, *hdr, *hdr_prev;
7425         kmutex_t *hash_lock;
7426         int64_t bytes_dropped = 0;
7427         l2arc_log_blk_buf_t *lb_buf;
7428 
7429         cb = zio->io_private;
7430         ASSERT3P(cb, !=, NULL);
7431         dev = cb->l2wcb_dev;
7432         ASSERT3P(dev, !=, NULL);
7433         head = cb->l2wcb_head;
7434         ASSERT3P(head, !=, NULL);
7435         buflist = &dev->l2ad_buflist;
7436         ASSERT3P(buflist, !=, NULL);
7437         DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
7438             l2arc_write_callback_t *, cb);
7439 
7440         if (zio->io_error != 0)
7441                 ARCSTAT_BUMP(arcstat_l2_writes_error);
7442 
7443         /*
7444          * All writes completed, or an error was hit.
7445          */
7446 top:
7447         mutex_enter(&dev->l2ad_mtx);

7504                         bytes_dropped += arc_hdr_size(hdr);
7505                         (void) refcount_remove_many(&dev->l2ad_alloc,
7506                             arc_hdr_size(hdr), hdr);
7507                 }
7508 
7509                 /*
7510                  * Allow ARC to begin reads and ghost list evictions to
7511                  * this L2ARC entry.
7512                  */
7513                 arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);
7514 
7515                 mutex_exit(hash_lock);
7516         }
7517 
7518         atomic_inc_64(&l2arc_writes_done);
7519         list_remove(buflist, head);
7520         ASSERT(!HDR_HAS_L1HDR(head));
7521         kmem_cache_free(hdr_l2only_cache, head);
7522         mutex_exit(&dev->l2ad_mtx);
7523 
7524         ASSERT(dev->l2ad_vdev != NULL);
7525         vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
7526 
7527         l2arc_do_free_on_write();
7528 
7529         while ((lb_buf = list_remove_tail(&cb->l2wcb_log_blk_buflist)) != NULL)
7530                 kmem_free(lb_buf, sizeof (*lb_buf));
7531         list_destroy(&cb->l2wcb_log_blk_buflist);
7532         kmem_free(cb, sizeof (l2arc_write_callback_t));
7533 }
7534 
7535 /*
7536  * A read to a cache device completed.  Validate buffer contents before
7537  * handing over to the regular ARC routines.
7538  */
7539 static void
7540 l2arc_read_done(zio_t *zio)
7541 {
7542         l2arc_read_callback_t *cb;
7543         arc_buf_hdr_t *hdr;
7544         kmutex_t *hash_lock;
7545         boolean_t valid_cksum;
7546 
7547         ASSERT3P(zio->io_vd, !=, NULL);
7548         ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
7549 
7550         spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
7551

7618                  * storage now.  If there *is* a waiter, the caller must
7619                  * issue the i/o in a context where it's OK to block.
7620                  */
7621                 if (zio->io_waiter == NULL) {
7622                         zio_t *pio = zio_unique_parent(zio);
7623 
7624                         ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
7625 
7626                         zio_nowait(zio_read(pio, zio->io_spa, zio->io_bp,
7627                             hdr->b_l1hdr.b_pabd, zio->io_size, arc_read_done,
7628                             hdr, zio->io_priority, cb->l2rcb_flags,
7629                             &cb->l2rcb_zb));
7630                 }
7631         }
7632 
7633         kmem_free(cb, sizeof (l2arc_read_callback_t));
7634 }
7635 
7636 /*
7637  * This is the list priority from which the L2ARC will search for pages to
7638  * cache.  This is used within loops to cycle through lists in the
7639  * desired order.  This order can have a significant effect on cache
7640  * performance.
7641  *
7642  * Currently the ddt lists are hit first (MFU then MRU),
7643  * followed by metadata then by the data lists.
7644  * This function returns a locked list, and also returns the lock pointer.
7645  */
7646 static multilist_sublist_t *
7647 l2arc_sublist_lock(enum l2arc_priorities prio)
7648 {
7649         multilist_t *ml = NULL;
7650         unsigned int idx;
7651 
7652         ASSERT(prio >= PRIORITY_MFU_DDT);
7653         ASSERT(prio < PRIORITY_NUMTYPES);
7654 
7655         switch (prio) {
7656         case PRIORITY_MFU_DDT:
7657                 ml = arc_mfu->arcs_list[ARC_BUFC_DDT];
7658                 break;
7659         case PRIORITY_MRU_DDT:
7660                 ml = arc_mru->arcs_list[ARC_BUFC_DDT];
7661                 break;
7662         case PRIORITY_MFU_META:
7663                 ml = arc_mfu->arcs_list[ARC_BUFC_METADATA];
7664                 break;
7665         case PRIORITY_MRU_META:
7666                 ml = arc_mru->arcs_list[ARC_BUFC_METADATA];
7667                 break;
7668         case PRIORITY_MFU_DATA:
7669                 ml = arc_mfu->arcs_list[ARC_BUFC_DATA];
7670                 break;
7671         case PRIORITY_MRU_DATA:
7672                 ml = arc_mru->arcs_list[ARC_BUFC_DATA];
7673                 break;
7674         }
7675 
7676         /*
7677          * Return a randomly-selected sublist. This is acceptable
7678          * because the caller feeds only a little bit of data for each
7679          * call (8MB). Subsequent calls will result in different
7680          * sublists being selected.
7681          */
7682         idx = multilist_get_random_index(ml);
7683         return (multilist_sublist_lock(ml, idx));
7684 }
7685 
7686 /*
7687  * Calculates the maximum overhead of L2ARC metadata log blocks for a given
7688  * L2ARC write size. l2arc_evict and l2arc_write_buffers need to include this
7689  * overhead in processing to make sure there is enough headroom available
7690  * when writing buffers.
7691  */
7692 static inline uint64_t
7693 l2arc_log_blk_overhead(uint64_t write_sz)
7694 {
7695         return ((write_sz / SPA_MINBLOCKSIZE / L2ARC_LOG_BLK_ENTRIES) + 1) *
7696             L2ARC_LOG_BLK_SIZE;
7697 }
7698 
7699 /*
7700  * Evict buffers from the device write hand to the distance specified in
7701  * bytes.  This distance may span populated buffers, it may span nothing.
7702  * This is clearing a region on the L2ARC device ready for writing.
7703  * If the 'all' boolean is set, every buffer is evicted.
7704  */
7705 static void
7706 l2arc_evict_impl(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
7707 {
7708         list_t *buflist;
7709         arc_buf_hdr_t *hdr, *hdr_prev;
7710         kmutex_t *hash_lock;
7711         uint64_t taddr;
7712 
7713         buflist = &dev->l2ad_buflist;
7714 
7715         if (!all && dev->l2ad_first) {
7716                 /*
7717                  * This is the first sweep through the device.  There is
7718                  * nothing to evict.
7719                  */
7720                 return;
7721         }
7722 
7723         /*
7724          * We need to add in the worst case scenario of log block overhead.
7725          */
7726         distance += l2arc_log_blk_overhead(distance);
7727         if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
7728                 /*
7729                  * When nearing the end of the device, evict to the end
7730                  * before the device write hand jumps to the start.
7731                  */
7732                 taddr = dev->l2ad_end;
7733         } else {
7734                 taddr = dev->l2ad_hand + distance;
7735         }
7736         DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
7737             uint64_t, taddr, boolean_t, all);
7738 
7739 top:
7740         mutex_enter(&dev->l2ad_mtx);
7741         for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {
7742                 hdr_prev = list_prev(buflist, hdr);
7743 
7744                 hash_lock = HDR_LOCK(hdr);
7745 
7746                 /*

7790                 } else {
7791                         ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);
7792                         ARCSTAT_BUMP(arcstat_l2_evict_l1cached);
7793                         /*
7794                          * Invalidate issued or about to be issued
7795                          * reads, since we may be about to write
7796                          * over this location.
7797                          */
7798                         if (HDR_L2_READING(hdr)) {
7799                                 ARCSTAT_BUMP(arcstat_l2_evict_reading);
7800                                 arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);
7801                         }
7802 
7803                         arc_hdr_l2hdr_destroy(hdr);
7804                 }
7805                 mutex_exit(hash_lock);
7806         }
7807         mutex_exit(&dev->l2ad_mtx);
7808 }
7809 
7810 static void
7811 l2arc_evict_task(void *arg)
7812 {
7813         l2arc_dev_t *dev = arg;
7814         ASSERT(dev);
7815 
7816         /*
7817          * Evict l2arc buffers asynchronously; we need to keep the device
7818          * around until we are sure there aren't any buffers referencing it.
7819          * We do not need to hold any config locks, etc. because at this point,
7820          * we are the only ones who knows about this device (the in-core
7821          * structure), so no new buffers can be created (e.g. if the pool is
7822          * re-imported while the asynchronous eviction is in progress) that
7823          * reference this same in-core structure. Also remove the vdev link
7824          * since further use of it as l2arc device is prohibited.
7825          */
7826         dev->l2ad_vdev = NULL;
7827         l2arc_evict_impl(dev, 0LL, B_TRUE);
7828 
7829         /* Same cleanup as in the synchronous path */
7830         list_destroy(&dev->l2ad_buflist);
7831         mutex_destroy(&dev->l2ad_mtx);
7832         refcount_destroy(&dev->l2ad_alloc);
7833         kmem_free(dev->l2ad_dev_hdr, dev->l2ad_dev_hdr_asize);
7834         kmem_free(dev, sizeof (l2arc_dev_t));
7835 }
7836 
7837 boolean_t zfs_l2arc_async_evict = B_TRUE;
7838 
7839 /*
7840  * Perform l2arc eviction for buffers associated with this device
7841  * If evicting all buffers (done at pool export time), try to evict
7842  * asynchronously, and fall back to synchronous eviction in case of error
7843  * Tell the caller whether to cleanup the device:
7844  *  - B_TRUE means "asynchronous eviction, do not cleanup"
7845  *  - B_FALSE means "synchronous eviction, done, please cleanup"
7846  */
7847 static boolean_t
7848 l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
7849 {
7850         /*
7851          *  If we are evicting all the buffers for this device, which happens
7852          *  at pool export time, schedule asynchronous task
7853          */
7854         if (all && zfs_l2arc_async_evict) {
7855                 if ((taskq_dispatch(arc_flush_taskq, l2arc_evict_task,
7856                     dev, TQ_NOSLEEP) == NULL)) {
7857                         /*
7858                          * Failed to dispatch asynchronous task
7859                          * cleanup, evict synchronously
7860                          */
7861                         l2arc_evict_impl(dev, distance, all);
7862                 } else {
7863                         /*
7864                          * Successful dispatch, vdev space updated
7865                          */
7866                         return (B_TRUE);
7867                 }
7868         } else {
7869                 /* Evict synchronously */
7870                 l2arc_evict_impl(dev, distance, all);
7871         }
7872 
7873         return (B_FALSE);
7874 }
7875 
7876 /*
7877  * Find and write ARC buffers to the L2ARC device.
7878  *
7879  * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid
7880  * for reading until they have completed writing.
7881  * The headroom_boost is an in-out parameter used to maintain headroom boost
7882  * state between calls to this function.
7883  *
7884  * Returns the number of bytes actually written (which may be smaller than
7885  * the delta by which the device hand has changed due to alignment).
7886  */
7887 static uint64_t
7888 l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
7889     l2ad_feed_t feed_type)
7890 {
7891         arc_buf_hdr_t *hdr, *hdr_prev, *head;
7892         /*
7893          * We must carefully track the space we deal with here:
7894          * - write_size: sum of the size of all buffers to be written
7895          *      without compression or inter-buffer alignment applied.
7896          *      This size is added to arcstat_l2_size, because subsequent
7897          *      eviction of buffers decrements this kstat by only the
7898          *      buffer's b_lsize (which doesn't take alignment into account).
7899          * - write_asize: sum of the size of all buffers to be written
7900          *      with inter-buffer alignment applied.
7901          *      This size is used to estimate the maximum number of bytes
7902          *      we could take up on the device and is thus used to gauge how
7903          *      close we are to hitting target_sz.
7904          */
7905         uint64_t write_asize, write_psize, write_lsize, headroom;
7906         boolean_t full;
7907         l2arc_write_callback_t *cb;
7908         zio_t *pio, *wzio;
7909         enum l2arc_priorities try;
7910         uint64_t guid = spa_load_guid(spa);
7911         boolean_t dev_hdr_update = B_FALSE;
7912 
7913         ASSERT3P(dev->l2ad_vdev, !=, NULL);
7914 
7915         pio = NULL;
7916         cb = NULL;
7917         write_lsize = write_asize = write_psize = 0;
7918         full = B_FALSE;
7919         head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
7920         arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);
7921 
7922         /*
7923          * Copy buffers for L2ARC writing.
7924          */
7925         for (try = PRIORITY_MFU_DDT; try < PRIORITY_NUMTYPES; try++) {
7926                 multilist_sublist_t *mls = l2arc_sublist_lock(try);
7927                 uint64_t passed_sz = 0;
7928 
7929                 /*
7930                  * L2ARC fast warmup.
7931                  *
7932                  * Until the ARC is warm and starts to evict, read from the
7933                  * head of the ARC lists rather than the tail.
7934                  */
7935                 if (arc_warm == B_FALSE)
7936                         hdr = multilist_sublist_head(mls);
7937                 else
7938                         hdr = multilist_sublist_tail(mls);
7939 
7940                 headroom = target_sz * l2arc_headroom;
7941                 if (zfs_compressed_arc_enabled)
7942                         headroom = (headroom * l2arc_headroom_boost) / 100;
7943 
7944                 for (; hdr; hdr = hdr_prev) {
7945                         kmutex_t *hash_lock;

7975                          * We rely on the L1 portion of the header below, so
7976                          * it's invalid for this header to have been evicted out
7977                          * of the ghost cache, prior to being written out. The
7978                          * ARC_FLAG_L2_WRITING bit ensures this won't happen.
7979                          */
7980                         ASSERT(HDR_HAS_L1HDR(hdr));
7981 
7982                         ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);
7983                         ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
7984                         ASSERT3U(arc_hdr_size(hdr), >, 0);
7985                         uint64_t psize = arc_hdr_size(hdr);
7986                         uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,
7987                             psize);
7988 
7989                         if ((write_asize + asize) > target_sz) {
7990                                 full = B_TRUE;
7991                                 mutex_exit(hash_lock);
7992                                 break;
7993                         }
7994 
7995                         /* make sure buf we select corresponds to feed_type */
7996                         if ((feed_type == L2ARC_FEED_DDT_DEV &&
7997                             arc_buf_type(hdr) != ARC_BUFC_DDT) ||
7998                             (feed_type == L2ARC_FEED_NON_DDT_DEV &&
7999                             arc_buf_type(hdr) == ARC_BUFC_DDT)) {
8000                                         mutex_exit(hash_lock);
8001                                         continue;
8002                         }
8003 
8004                         if (pio == NULL) {
8005                                 /*
8006                                  * Insert a dummy header on the buflist so
8007                                  * l2arc_write_done() can find where the
8008                                  * write buffers begin without searching.
8009                                  */
8010                                 mutex_enter(&dev->l2ad_mtx);
8011                                 list_insert_head(&dev->l2ad_buflist, head);
8012                                 mutex_exit(&dev->l2ad_mtx);
8013 
8014                                 cb = kmem_zalloc(
8015                                     sizeof (l2arc_write_callback_t), KM_SLEEP);
8016                                 cb->l2wcb_dev = dev;
8017                                 cb->l2wcb_head = head;
8018                                 list_create(&cb->l2wcb_log_blk_buflist,
8019                                     sizeof (l2arc_log_blk_buf_t),
8020                                     offsetof(l2arc_log_blk_buf_t, lbb_node));
8021                                 pio = zio_root(spa, l2arc_write_done, cb,
8022                                     ZIO_FLAG_CANFAIL);
8023                         }
8024 
8025                         hdr->b_l2hdr.b_dev = dev;
8026                         hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
8027                         arc_hdr_set_flags(hdr,
8028                             ARC_FLAG_L2_WRITING | ARC_FLAG_HAS_L2HDR);
8029 
8030                         mutex_enter(&dev->l2ad_mtx);
8031                         list_insert_head(&dev->l2ad_buflist, hdr);
8032                         mutex_exit(&dev->l2ad_mtx);
8033 
8034                         (void) refcount_add_many(&dev->l2ad_alloc, psize, hdr);
8035 
8036                         /*
8037                          * Normally the L2ARC can use the hdr's data, but if
8038                          * we're sharing data between the hdr and one of its
8039                          * bufs, L2ARC needs its own copy of the data so that
8040                          * the ZIO below can't race with the buf consumer.
8041                          * Another case where we need to create a copy of the
8042                          * data is when the buffer size is not device-aligned
8043                          * and we need to pad the block to make it such.
8044                          * That also keeps the clock hand suitably aligned.
8045                          *
8046                          * To ensure that the copy will be available for the
8047                          * lifetime of the ZIO and be cleaned up afterwards, we
8048                          * add it to the l2arc_free_on_write queue.
8049                          */
8050                         abd_t *to_write;
8051                         if (!HDR_SHARED_DATA(hdr) && psize == asize) {
8052                                 to_write = hdr->b_l1hdr.b_pabd;
8053                         } else {
8054                                 to_write = abd_alloc_for_io(asize,
8055                                     !HDR_ISTYPE_DATA(hdr));
8056                                 abd_copy(to_write, hdr->b_l1hdr.b_pabd, psize);
8057                                 if (asize != psize) {
8058                                         abd_zero_off(to_write, psize,
8059                                             asize - psize);
8060                                 }
8061                                 l2arc_free_abd_on_write(to_write, asize,
8062                                     arc_buf_type(hdr));
8063                         }
8064                         wzio = zio_write_phys(pio, dev->l2ad_vdev,
8065                             hdr->b_l2hdr.b_daddr, asize, to_write,
8066                             ZIO_CHECKSUM_OFF, NULL, hdr,
8067                             ZIO_PRIORITY_ASYNC_WRITE,
8068                             ZIO_FLAG_CANFAIL, B_FALSE);
8069 
8070                         write_lsize += HDR_GET_LSIZE(hdr);
8071                         DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
8072                             zio_t *, wzio);
8073 
8074                         write_psize += psize;
8075                         write_asize += asize;
8076                         dev->l2ad_hand += asize;
8077 
8078                         mutex_exit(hash_lock);
8079 
8080                         (void) zio_nowait(wzio);
8081 
8082                         /*
8083                          * Append buf info to current log and commit if full.
8084                          * arcstat_l2_{size,asize} kstats are updated internally.
8085                          */
8086                         if (l2arc_log_blk_insert(dev, hdr)) {
8087                                 l2arc_log_blk_commit(dev, pio, cb);
8088                                 dev_hdr_update = B_TRUE;
8089                         }
8090                 }
8091 
8092                 multilist_sublist_unlock(mls);
8093 
8094                 if (full == B_TRUE)
8095                         break;
8096         }
8097 
8098         /* No buffers selected for writing? */
8099         if (pio == NULL) {
8100                 ASSERT0(write_lsize);
8101                 ASSERT(!HDR_HAS_L1HDR(head));
8102                 kmem_cache_free(hdr_l2only_cache, head);
8103                 return (0);
8104         }
8105 
8106         /*
8107          * If we wrote any logs as part of this write, update dev hdr
8108          * to point to it.
8109          */
8110         if (dev_hdr_update)
8111                 l2arc_dev_hdr_update(dev, pio);
8112 
8113         ASSERT3U(write_asize, <=, target_sz);
8114         ARCSTAT_BUMP(arcstat_l2_writes_sent);
8115         ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);
8116         if (feed_type == L2ARC_FEED_DDT_DEV)
8117                 ARCSTAT_INCR(arcstat_l2_ddt_write_bytes, write_psize);
8118         ARCSTAT_INCR(arcstat_l2_lsize, write_lsize);
8119         ARCSTAT_INCR(arcstat_l2_psize, write_psize);
8120         vdev_space_update(dev->l2ad_vdev, write_psize, 0, 0);
8121 
8122         /*
8123          * Bump device hand to the device start if it is approaching the end.
8124          * l2arc_evict() will already have evicted ahead for this case.
8125          */
8126         if (dev->l2ad_hand + target_sz + l2arc_log_blk_overhead(target_sz) >=
8127             dev->l2ad_end) {
8128                 dev->l2ad_hand = dev->l2ad_start;
8129                 dev->l2ad_first = B_FALSE;
8130         }
8131 
8132         dev->l2ad_writing = B_TRUE;
8133         (void) zio_wait(pio);
8134         dev->l2ad_writing = B_FALSE;
8135 
8136         return (write_asize);
8137 }
8138 
8139 static boolean_t
8140 l2arc_feed_dev(l2ad_feed_t feed_type, uint64_t *wrote)
8141 {
8142         spa_t *spa;
8143         l2arc_dev_t *dev;
8144         uint64_t size;
8145 
8146         /*
8147          * This selects the next l2arc device to write to, and in
8148          * doing so the next spa to feed from: dev->l2ad_spa.   This
8149          * will return NULL if there are now no l2arc devices or if
8150          * they are all faulted.
8151          *
8152          * If a device is returned, its spa's config lock is also
8153          * held to prevent device removal.  l2arc_dev_get_next()
8154          * will grab and release l2arc_dev_mtx.
8155          */
8156         if ((dev = l2arc_dev_get_next(feed_type)) == NULL)
8157                 return (B_FALSE);
8158 
8159         spa = dev->l2ad_spa;
8160         ASSERT(spa != NULL);
8161 
8162         /*
8163          * If the pool is read-only - skip it
8164          */
8165         if (!spa_writeable(spa)) {
8166                 spa_config_exit(spa, SCL_L2ARC, dev);
8167                 return (B_FALSE);
8168         }
8169 
8170         ARCSTAT_BUMP(arcstat_l2_feeds);
8171         size = l2arc_write_size();
8172 
8173         /*
8174          * Evict L2ARC buffers that will be overwritten.
8175          * B_FALSE guarantees synchronous eviction.
8176          */
8177         (void) l2arc_evict(dev, size, B_FALSE);
8178 
8179         /*
8180          * Write ARC buffers.
8181          */
8182         *wrote = l2arc_write_buffers(spa, dev, size, feed_type);
8183 
8184         spa_config_exit(spa, SCL_L2ARC, dev);
8185 
8186         return (B_TRUE);
8187 }
8188 
8189 /*
8190  * This thread feeds the L2ARC at regular intervals.  This is the beating
8191  * heart of the L2ARC.
8192  */
8193 /* ARGSUSED */
8194 static void
8195 l2arc_feed_thread(void *unused)
8196 {
8197         callb_cpr_t cpr;
8198         uint64_t size, total_written = 0;


8199         clock_t begin, next = ddi_get_lbolt();
8200         l2ad_feed_t feed_type = L2ARC_FEED_ALL;
8201 
8202         CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
8203 
8204         mutex_enter(&l2arc_feed_thr_lock);
8205 
8206         while (l2arc_thread_exit == 0) {
8207                 CALLB_CPR_SAFE_BEGIN(&cpr);
8208                 (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
8209                     next);
8210                 CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
8211                 next = ddi_get_lbolt() + hz;
8212 
8213                 /*
8214                  * Quick check for L2ARC devices.
8215                  */
8216                 mutex_enter(&l2arc_dev_mtx);
8217                 if (l2arc_ndev == 0) {
8218                         mutex_exit(&l2arc_dev_mtx);
8219                         continue;
8220                 }
8221                 mutex_exit(&l2arc_dev_mtx);
8222                 begin = ddi_get_lbolt();
8223 
8224                 /*


























8225                  * Avoid contributing to memory pressure.
8226                  */
8227                 if (arc_reclaim_needed()) {
8228                         ARCSTAT_BUMP(arcstat_l2_abort_lowmem);

8229                         continue;
8230                 }
8231 
8232                 /* try to write to DDT L2ARC device if any */
8233                 if (l2arc_feed_dev(L2ARC_FEED_DDT_DEV, &size)) {
8234                         total_written += size;
8235                         feed_type = L2ARC_FEED_NON_DDT_DEV;
8236                 }
8237 
8238                 /* try to write to the regular L2ARC device if any */
8239                 if (l2arc_feed_dev(feed_type, &size)) {
8240                         total_written += size;
8241                         if (feed_type == L2ARC_FEED_NON_DDT_DEV)
8242                                 total_written /= 2; /* avg written per device */
8243                 }
8244 
8245                 /*










8246                  * Calculate interval between writes.
8247                  */
8248                 next = l2arc_write_interval(begin, l2arc_write_size(),
8249                     total_written);
8250 
8251                 total_written = 0;
8252         }
8253 
8254         l2arc_thread_exit = 0;
8255         cv_broadcast(&l2arc_feed_thr_cv);
8256         CALLB_CPR_EXIT(&cpr);               /* drops l2arc_feed_thr_lock */
8257         thread_exit();
8258 }
8259 
8260 boolean_t
8261 l2arc_vdev_present(vdev_t *vd)
8262 {
8263         return (l2arc_vdev_get(vd) != NULL);
8264 }
8265 
8266 /*
8267  * Returns the l2arc_dev_t associated with a particular vdev_t or NULL if
8268  * the vdev_t isn't an L2ARC device.
8269  */
8270 static l2arc_dev_t *
8271 l2arc_vdev_get(vdev_t *vd)
8272 {
8273         l2arc_dev_t     *dev;
8274         boolean_t       held = MUTEX_HELD(&l2arc_dev_mtx);
8275 
8276         if (!held)
8277                 mutex_enter(&l2arc_dev_mtx);
8278         for (dev = list_head(l2arc_dev_list); dev != NULL;
8279             dev = list_next(l2arc_dev_list, dev)) {
8280                 if (dev->l2ad_vdev == vd)
8281                         break;
8282         }
8283         if (!held)
8284                 mutex_exit(&l2arc_dev_mtx);
8285 
8286         return (dev);
8287 }
8288 
8289 /*
8290  * Add a vdev for use by the L2ARC.  By this point the spa has already
8291  * validated the vdev and opened it. The `rebuild' flag indicates whether
8292  * we should attempt an L2ARC persistency rebuild.
8293  */
8294 void
8295 l2arc_add_vdev(spa_t *spa, vdev_t *vd, boolean_t rebuild)
8296 {
8297         l2arc_dev_t *adddev;
8298 
8299         ASSERT(!l2arc_vdev_present(vd));
8300 
8301         /*
8302          * Create a new l2arc device entry.
8303          */
8304         adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
8305         adddev->l2ad_spa = spa;
8306         adddev->l2ad_vdev = vd;
8307         /* leave extra size for an l2arc device header */
8308         adddev->l2ad_dev_hdr_asize = MAX(sizeof (*adddev->l2ad_dev_hdr),
8309             1 << vd->vdev_ashift);
8310         adddev->l2ad_start = VDEV_LABEL_START_SIZE + adddev->l2ad_dev_hdr_asize;
8311         adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
8312         ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end);
8313         adddev->l2ad_hand = adddev->l2ad_start;
8314         adddev->l2ad_first = B_TRUE;
8315         adddev->l2ad_writing = B_FALSE;
8316         adddev->l2ad_dev_hdr = kmem_zalloc(adddev->l2ad_dev_hdr_asize,
8317             KM_SLEEP);
8318 
8319         mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);
8320         /*
8321          * This is a list of all ARC buffers that are still valid on the
8322          * device.
8323          */
8324         list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
8325             offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));
8326 
8327         vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
8328         refcount_create(&adddev->l2ad_alloc);
8329 
8330         /*
8331          * Add device to global list
8332          */
8333         mutex_enter(&l2arc_dev_mtx);
8334         list_insert_head(l2arc_dev_list, adddev);
8335         atomic_inc_64(&l2arc_ndev);
8336         if (rebuild && l2arc_rebuild_enabled &&
8337             adddev->l2ad_end - adddev->l2ad_start > L2ARC_PERSIST_MIN_SIZE) {
8338                 /*
8339                  * Just mark the device as pending for a rebuild. We won't
8340                  * be starting a rebuild in line here as it would block pool
8341                  * import. Instead spa_load_impl will hand that off to an
8342                  * async task which will call l2arc_spa_rebuild_start.
8343                  */
8344                 adddev->l2ad_rebuild = B_TRUE;
8345         }
8346         mutex_exit(&l2arc_dev_mtx);
8347 }
8348 
8349 /*
8350  * Remove a vdev from the L2ARC.
8351  */
8352 void
8353 l2arc_remove_vdev(vdev_t *vd)
8354 {
8355         l2arc_dev_t *dev, *nextdev, *remdev = NULL;
8356 
8357         /*
8358          * Find the device by vdev
8359          */
8360         mutex_enter(&l2arc_dev_mtx);
8361         for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
8362                 nextdev = list_next(l2arc_dev_list, dev);
8363                 if (vd == dev->l2ad_vdev) {
8364                         remdev = dev;
8365                         break;
8366                 }
8367         }
8368         ASSERT3P(remdev, !=, NULL);
8369 
8370         /*
8371          * Cancel any ongoing or scheduled rebuild (race protection with
8372          * l2arc_spa_rebuild_start provided via l2arc_dev_mtx).
8373          */
8374         remdev->l2ad_rebuild_cancel = B_TRUE;
8375         if (remdev->l2ad_rebuild_did != 0) {
8376                 /*
8377                  * N.B. it should be safe to thread_join with the rebuild
8378                  * thread while holding l2arc_dev_mtx because it is not
8379                  * accessed from anywhere in the l2arc rebuild code below
8380                  * (except for l2arc_spa_rebuild_start, which is ok).
8381                  */
8382                 thread_join(remdev->l2ad_rebuild_did);
8383         }
8384 
8385         /*
8386          * Remove device from global list
8387          */
8388         list_remove(l2arc_dev_list, remdev);
8389         l2arc_dev_last = NULL;          /* may have been invalidated */
8390         l2arc_ddt_dev_last = NULL;      /* may have been invalidated */
8391         atomic_dec_64(&l2arc_ndev);
8392         mutex_exit(&l2arc_dev_mtx);
8393 
8394         if (vdev_type_is_ddt(remdev->l2ad_vdev))
8395                 atomic_add_64(&remdev->l2ad_spa->spa_l2arc_ddt_devs_size,
8396                     -(vdev_get_min_asize(remdev->l2ad_vdev)));
8397 
8398         /*
8399          * Clear all buflists and ARC references.  L2ARC device flush.
8400          */
8401         if (l2arc_evict(remdev, 0, B_TRUE) == B_FALSE) {
8402                 /*
8403                  * The eviction was done synchronously, cleanup here
8404                  * Otherwise, the asynchronous task will cleanup
8405                  */
8406                 list_destroy(&remdev->l2ad_buflist);
8407                 mutex_destroy(&remdev->l2ad_mtx);
8408                 kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize);
8409                 kmem_free(remdev, sizeof (l2arc_dev_t));
8410         }
8411 }
8412 
8413 void
8414 l2arc_init(void)
8415 {
8416         l2arc_thread_exit = 0;
8417         l2arc_ndev = 0;
8418         l2arc_writes_sent = 0;
8419         l2arc_writes_done = 0;
8420 
8421         mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
8422         cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
8423         mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
8424         mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
8425 
8426         l2arc_dev_list = &L2ARC_dev_list;
8427         l2arc_free_on_write = &L2ARC_free_on_write;
8428         list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
8429             offsetof(l2arc_dev_t, l2ad_node));
8430         list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),

8456 {
8457         if (!(spa_mode_global & FWRITE))
8458                 return;
8459 
8460         (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
8461             TS_RUN, minclsyspri);
8462 }
8463 
8464 void
8465 l2arc_stop(void)
8466 {
8467         if (!(spa_mode_global & FWRITE))
8468                 return;
8469 
8470         mutex_enter(&l2arc_feed_thr_lock);
8471         cv_signal(&l2arc_feed_thr_cv);      /* kick thread out of startup */
8472         l2arc_thread_exit = 1;
8473         while (l2arc_thread_exit != 0)
8474                 cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
8475         mutex_exit(&l2arc_feed_thr_lock);
8476 }
8477 
8478 /*
8479  * Punches out rebuild threads for the L2ARC devices in a spa. This should
8480  * be called after pool import from the spa async thread, since starting
8481  * these threads directly from spa_import() will make them part of the
8482  * "zpool import" context and delay process exit (and thus pool import).
8483  */
8484 void
8485 l2arc_spa_rebuild_start(spa_t *spa)
8486 {
8487         /*
8488          * Locate the spa's l2arc devices and kick off rebuild threads.
8489          */
8490         mutex_enter(&l2arc_dev_mtx);
8491         for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {
8492                 l2arc_dev_t *dev =
8493                     l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);
8494                 if (dev == NULL) {
8495                         /* Don't attempt a rebuild if the vdev is UNAVAIL */
8496                         continue;
8497                 }
8498                 if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
8499                         VERIFY3U(dev->l2ad_rebuild_did, ==, 0);
8500 #ifdef  _KERNEL
8501                         dev->l2ad_rebuild_did = thread_create(NULL, 0,
8502                             l2arc_dev_rebuild_start, dev, 0, &p0, TS_RUN,
8503                             minclsyspri)->t_did;
8504 #endif
8505                 }
8506         }
8507         mutex_exit(&l2arc_dev_mtx);
8508 }
8509 
8510 /*
8511  * Main entry point for L2ARC rebuilding.
8512  */
8513 static void
8514 l2arc_dev_rebuild_start(l2arc_dev_t *dev)
8515 {
8516         if (!dev->l2ad_rebuild_cancel) {
8517                 VERIFY(dev->l2ad_rebuild);
8518                 (void) l2arc_rebuild(dev);
8519                 dev->l2ad_rebuild = B_FALSE;
8520         }
8521 }
8522 
8523 /*
8524  * This function implements the actual L2ARC metadata rebuild. It:
8525  *
8526  * 1) reads the device's header
8527  * 2) if a good device header is found, starts reading the log block chain
8528  * 3) restores each block's contents to memory (reconstructing arc_buf_hdr_t's)
8529  *
8530  * Operation stops under any of the following conditions:
8531  *
8532  * 1) We reach the end of the log blk chain (the back-reference in the blk is
8533  *    invalid or loops over our starting point).
8534  * 2) We encounter *any* error condition (cksum errors, io errors, looped
8535  *    blocks, etc.).
8536  */
8537 static int
8538 l2arc_rebuild(l2arc_dev_t *dev)
8539 {
8540         vdev_t                  *vd = dev->l2ad_vdev;
8541         spa_t                   *spa = vd->vdev_spa;
8542         int                     err;
8543         l2arc_log_blk_phys_t    *this_lb, *next_lb;
8544         uint8_t                 *this_lb_buf, *next_lb_buf;
8545         zio_t                   *this_io = NULL, *next_io = NULL;
8546         l2arc_log_blkptr_t      lb_ptrs[2];
8547         boolean_t               first_pass, lock_held;
8548         uint64_t                load_guid;
8549 
8550         this_lb = kmem_zalloc(sizeof (*this_lb), KM_SLEEP);
8551         next_lb = kmem_zalloc(sizeof (*next_lb), KM_SLEEP);
8552         this_lb_buf = kmem_zalloc(sizeof (l2arc_log_blk_phys_t), KM_SLEEP);
8553         next_lb_buf = kmem_zalloc(sizeof (l2arc_log_blk_phys_t), KM_SLEEP);
8554 
8555         /*
8556          * We prevent device removal while issuing reads to the device,
8557          * then during the rebuilding phases we drop this lock again so
8558          * that a spa_unload or device remove can be initiated - this is
8559          * safe, because the spa will signal us to stop before removing
8560          * our device and wait for us to stop.
8561          */
8562         spa_config_enter(spa, SCL_L2ARC, vd, RW_READER);
8563         lock_held = B_TRUE;
8564 
8565         load_guid = spa_load_guid(dev->l2ad_vdev->vdev_spa);
8566         /*
8567          * Device header processing phase.
8568          */
8569         if ((err = l2arc_dev_hdr_read(dev)) != 0) {
8570                 /* device header corrupted, start a new one */
8571                 bzero(dev->l2ad_dev_hdr, dev->l2ad_dev_hdr_asize);
8572                 goto out;
8573         }
8574 
8575         /* Retrieve the persistent L2ARC device state */
8576         dev->l2ad_hand = vdev_psize_to_asize(dev->l2ad_vdev,
8577             dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr +
8578             LBP_GET_PSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0]));
8579         dev->l2ad_first = !!(dev->l2ad_dev_hdr->dh_flags &
8580             L2ARC_DEV_HDR_EVICT_FIRST);
8581 
8582         /* Prepare the rebuild processing state */
8583         bcopy(dev->l2ad_dev_hdr->dh_start_lbps, lb_ptrs, sizeof (lb_ptrs));
8584         first_pass = B_TRUE;
8585 
8586         /* Start the rebuild process */
8587         for (;;) {
8588                 if (!l2arc_log_blkptr_valid(dev, &lb_ptrs[0]))
8589                         /* We hit an invalid block address, end the rebuild. */
8590                         break;
8591 
8592                 if ((err = l2arc_log_blk_read(dev, &lb_ptrs[0], &lb_ptrs[1],
8593                     this_lb, next_lb, this_lb_buf, next_lb_buf,
8594                     this_io, &next_io)) != 0)
8595                         break;
8596 
8597                 spa_config_exit(spa, SCL_L2ARC, vd);
8598                 lock_held = B_FALSE;
8599 
8600                 /* Protection against infinite loops of log blocks. */
8601                 if (l2arc_range_check_overlap(lb_ptrs[1].lbp_daddr,
8602                     lb_ptrs[0].lbp_daddr,
8603                     dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr) &&
8604                     !first_pass) {
8605                         ARCSTAT_BUMP(arcstat_l2_rebuild_abort_loop_errors);
8606                         err = SET_ERROR(ELOOP);
8607                         break;
8608                 }
8609 
8610                 /*
8611                  * Our memory pressure valve. If the system is running low
8612                  * on memory, rather than swamping memory with new ARC buf
8613                  * hdrs, we opt not to rebuild the L2ARC. At this point,
8614                  * however, we have already set up our L2ARC dev to chain in
8615                  * new metadata log blk, so the user may choose to re-add the
8616                  * L2ARC dev at a later time to reconstruct it (when there's
8617                  * less memory pressure).
8618                  */
8619                 if (arc_reclaim_needed()) {
8620                         ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem);
8621                         cmn_err(CE_NOTE, "System running low on memory, "
8622                             "aborting L2ARC rebuild.");
8623                         err = SET_ERROR(ENOMEM);
8624                         break;
8625                 }
8626 
8627                 /*
8628                  * Now that we know that the next_lb checks out alright, we
8629                  * can start reconstruction from this lb - we can be sure
8630                  * that the L2ARC write hand has not yet reached any of our
8631                  * buffers.
8632                  */
8633                 l2arc_log_blk_restore(dev, load_guid, this_lb,
8634                     LBP_GET_PSIZE(&lb_ptrs[0]));
8635 
8636                 /*
8637                  * End of list detection. We can look ahead two steps in the
8638                  * blk chain and if the 2nd blk from this_lb dips below the
8639                  * initial chain starting point, then we know two things:
8640                  *      1) it can't be valid, and
8641                  *      2) the next_lb's ARC entries might have already been
8642                  *      partially overwritten and so we should stop before
8643                  *      we restore it
8644                  */
8645                 if (l2arc_range_check_overlap(
8646                     this_lb->lb_back2_lbp.lbp_daddr, lb_ptrs[0].lbp_daddr,
8647                     dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr) &&
8648                     !first_pass)
8649                         break;
8650 
8651                 /* log blk restored, continue with next one in the list */
8652                 lb_ptrs[0] = lb_ptrs[1];
8653                 lb_ptrs[1] = this_lb->lb_back2_lbp;
8654                 PTR_SWAP(this_lb, next_lb);
8655                 PTR_SWAP(this_lb_buf, next_lb_buf);
8656                 this_io = next_io;
8657                 next_io = NULL;
8658                 first_pass = B_FALSE;
8659 
8660                 for (;;) {
8661                         if (dev->l2ad_rebuild_cancel) {
8662                                 err = SET_ERROR(ECANCELED);
8663                                 goto out;
8664                         }
8665                         if (spa_config_tryenter(spa, SCL_L2ARC, vd,
8666                             RW_READER)) {
8667                                 lock_held = B_TRUE;
8668                                 break;
8669                         }
8670                         /*
8671                          * L2ARC config lock held by somebody in writer,
8672                          * possibly due to them trying to remove us. They'll
8673                          * likely to want us to shut down, so after a little
8674                          * delay, we check l2ad_rebuild_cancel and retry
8675                          * the lock again.
8676                          */
8677                         delay(1);
8678                 }
8679         }
8680 out:
8681         if (next_io != NULL)
8682                 l2arc_log_blk_prefetch_abort(next_io);
8683         kmem_free(this_lb, sizeof (*this_lb));
8684         kmem_free(next_lb, sizeof (*next_lb));
8685         kmem_free(this_lb_buf, sizeof (l2arc_log_blk_phys_t));
8686         kmem_free(next_lb_buf, sizeof (l2arc_log_blk_phys_t));
8687         if (err == 0)
8688                 ARCSTAT_BUMP(arcstat_l2_rebuild_successes);
8689 
8690         if (lock_held)
8691                 spa_config_exit(spa, SCL_L2ARC, vd);
8692 
8693         return (err);
8694 }
8695 
8696 /*
8697  * Attempts to read the device header on the provided L2ARC device and writes
8698  * it to `hdr'. On success, this function returns 0, otherwise the appropriate
8699  * error code is returned.
8700  */
8701 static int
8702 l2arc_dev_hdr_read(l2arc_dev_t *dev)
8703 {
8704         int                     err;
8705         uint64_t                guid;
8706         zio_cksum_t             cksum;
8707         l2arc_dev_hdr_phys_t    *hdr = dev->l2ad_dev_hdr;
8708         const uint64_t          hdr_asize = dev->l2ad_dev_hdr_asize;
8709         abd_t *abd;
8710 
8711         guid = spa_guid(dev->l2ad_vdev->vdev_spa);
8712 
8713         abd = abd_get_from_buf(hdr, hdr_asize);
8714         err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev,
8715             VDEV_LABEL_START_SIZE, hdr_asize, abd,
8716             ZIO_CHECKSUM_OFF, NULL, NULL, ZIO_PRIORITY_ASYNC_READ,
8717             ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
8718             ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));
8719         abd_put(abd);
8720         if (err != 0) {
8721                 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);
8722                 return (err);
8723         }
8724 
8725         if (hdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC_V1))
8726                 byteswap_uint64_array(hdr, sizeof (*hdr));
8727 
8728         if (hdr->dh_magic != L2ARC_DEV_HDR_MAGIC_V1 ||
8729             hdr->dh_spa_guid != guid) {
8730                 /*
8731                  * Attempt to rebuild a device containing no actual dev hdr
8732                  * or containing a header from some other pool.
8733                  */
8734                 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported);
8735                 return (SET_ERROR(ENOTSUP));
8736         }
8737 
8738         l2arc_dev_hdr_checksum(hdr, &cksum);
8739         if (!ZIO_CHECKSUM_EQUAL(hdr->dh_self_cksum, cksum)) {
8740                 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_errors);
8741                 return (SET_ERROR(EINVAL));
8742         }
8743 
8744         return (0);
8745 }
8746 
8747 /*
8748  * Reads L2ARC log blocks from storage and validates their contents.
8749  *
8750  * This function implements a simple prefetcher to make sure that while
8751  * we're processing one buffer the L2ARC is already prefetching the next
8752  * one in the chain.
8753  *
8754  * The arguments this_lp and next_lp point to the current and next log blk
8755  * address in the block chain. Similarly, this_lb and next_lb hold the
8756  * l2arc_log_blk_phys_t's of the current and next L2ARC blk. The this_lb_buf
8757  * and next_lb_buf must be buffers of appropriate to hold a raw
8758  * l2arc_log_blk_phys_t (they are used as catch buffers for read ops prior
8759  * to buffer decompression).
8760  *
8761  * The `this_io' and `next_io' arguments are used for block prefetching.
8762  * When issuing the first blk IO during rebuild, you should pass NULL for
8763  * `this_io'. This function will then issue a sync IO to read the block and
8764  * also issue an async IO to fetch the next block in the block chain. The
8765  * prefetch IO is returned in `next_io'. On subsequent calls to this
8766  * function, pass the value returned in `next_io' from the previous call
8767  * as `this_io' and a fresh `next_io' pointer to hold the next prefetch IO.
8768  * Prior to the call, you should initialize your `next_io' pointer to be
8769  * NULL. If no prefetch IO was issued, the pointer is left set at NULL.
8770  *
8771  * On success, this function returns 0, otherwise it returns an appropriate
8772  * error code. On error the prefetching IO is aborted and cleared before
8773  * returning from this function. Therefore, if we return `success', the
8774  * caller can assume that we have taken care of cleanup of prefetch IOs.
8775  */
8776 static int
8777 l2arc_log_blk_read(l2arc_dev_t *dev,
8778     const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp,
8779     l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
8780     uint8_t *this_lb_buf, uint8_t *next_lb_buf,
8781     zio_t *this_io, zio_t **next_io)
8782 {
8783         int             err = 0;
8784         zio_cksum_t     cksum;
8785 
8786         ASSERT(this_lbp != NULL && next_lbp != NULL);
8787         ASSERT(this_lb != NULL && next_lb != NULL);
8788         ASSERT(this_lb_buf != NULL && next_lb_buf != NULL);
8789         ASSERT(next_io != NULL && *next_io == NULL);
8790         ASSERT(l2arc_log_blkptr_valid(dev, this_lbp));
8791 
8792         /*
8793          * Check to see if we have issued the IO for this log blk in a
8794          * previous run. If not, this is the first call, so issue it now.
8795          */
8796         if (this_io == NULL) {
8797                 this_io = l2arc_log_blk_prefetch(dev->l2ad_vdev, this_lbp,
8798                     this_lb_buf);
8799         }
8800 
8801         /*
8802          * Peek to see if we can start issuing the next IO immediately.
8803          */
8804         if (l2arc_log_blkptr_valid(dev, next_lbp)) {
8805                 /*
8806                  * Start issuing IO for the next log blk early - this
8807                  * should help keep the L2ARC device busy while we
8808                  * decompress and restore this log blk.
8809                  */
8810                 *next_io = l2arc_log_blk_prefetch(dev->l2ad_vdev, next_lbp,
8811                     next_lb_buf);
8812         }
8813 
8814         /* Wait for the IO to read this log block to complete */
8815         if ((err = zio_wait(this_io)) != 0) {
8816                 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);
8817                 goto cleanup;
8818         }
8819 
8820         /* Make sure the buffer checks out */
8821         fletcher_4_native(this_lb_buf, LBP_GET_PSIZE(this_lbp), NULL, &cksum);
8822         if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) {
8823                 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_errors);
8824                 err = SET_ERROR(EINVAL);
8825                 goto cleanup;
8826         }
8827 
8828         /* Now we can take our time decoding this buffer */
8829         switch (LBP_GET_COMPRESS(this_lbp)) {
8830         case ZIO_COMPRESS_OFF:
8831                 bcopy(this_lb_buf, this_lb, sizeof (*this_lb));
8832                 break;
8833         case ZIO_COMPRESS_LZ4:
8834                 err = zio_decompress_data_buf(LBP_GET_COMPRESS(this_lbp),
8835                     this_lb_buf, this_lb, LBP_GET_PSIZE(this_lbp),
8836                     sizeof (*this_lb));
8837                 if (err != 0) {
8838                         err = SET_ERROR(EINVAL);
8839                         goto cleanup;
8840                 }
8841 
8842                 break;
8843         default:
8844                 err = SET_ERROR(EINVAL);
8845                 break;
8846         }
8847 
8848         if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC))
8849                 byteswap_uint64_array(this_lb, sizeof (*this_lb));
8850 
8851         if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) {
8852                 err = SET_ERROR(EINVAL);
8853                 goto cleanup;
8854         }
8855 
8856 cleanup:
8857         /* Abort an in-flight prefetch I/O in case of error */
8858         if (err != 0 && *next_io != NULL) {
8859                 l2arc_log_blk_prefetch_abort(*next_io);
8860                 *next_io = NULL;
8861         }
8862         return (err);
8863 }
8864 
8865 /*
8866  * Restores the payload of a log blk to ARC. This creates empty ARC hdr
8867  * entries which only contain an l2arc hdr, essentially restoring the
8868  * buffers to their L2ARC evicted state. This function also updates space
8869  * usage on the L2ARC vdev to make sure it tracks restored buffers.
8870  */
8871 static void
8872 l2arc_log_blk_restore(l2arc_dev_t *dev, uint64_t load_guid,
8873     const l2arc_log_blk_phys_t *lb, uint64_t lb_psize)
8874 {
8875         uint64_t        size = 0, psize = 0;
8876 
8877         for (int i = L2ARC_LOG_BLK_ENTRIES - 1; i >= 0; i--) {
8878                 /*
8879                  * Restore goes in the reverse temporal direction to preserve
8880                  * correct temporal ordering of buffers in the l2ad_buflist.
8881                  * l2arc_hdr_restore also does a list_insert_tail instead of
8882                  * list_insert_head on the l2ad_buflist:
8883                  *
8884                  *              LIST    l2ad_buflist            LIST
8885                  *              HEAD  <------ (time) ------  TAIL
8886                  * direction    +-----+-----+-----+-----+-----+    direction
8887                  * of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild
8888                  * fill         +-----+-----+-----+-----+-----+
8889                  *              ^                               ^
8890                  *              |                               |
8891                  *              |                               |
8892                  *      l2arc_fill_thread               l2arc_rebuild
8893                  *      places new bufs here            restores bufs here
8894                  *
8895                  * This also works when the restored bufs get evicted at any
8896                  * point during the rebuild.
8897                  */
8898                 l2arc_hdr_restore(&lb->lb_entries[i], dev, load_guid);
8899                 size += LE_GET_LSIZE(&lb->lb_entries[i]);
8900                 psize += LE_GET_PSIZE(&lb->lb_entries[i]);
8901         }
8902 
8903         /*
8904          * Record rebuild stats:
8905          *      size            In-memory size of restored buffer data in ARC
8906          *      psize           Physical size of restored buffers in the L2ARC
8907          *      bufs            # of ARC buffer headers restored
8908          *      log_blks        # of L2ARC log entries processed during restore
8909          */
8910         ARCSTAT_INCR(arcstat_l2_rebuild_size, size);
8911         ARCSTAT_INCR(arcstat_l2_rebuild_psize, psize);
8912         ARCSTAT_INCR(arcstat_l2_rebuild_bufs, L2ARC_LOG_BLK_ENTRIES);
8913         ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks);
8914         ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_size, lb_psize);
8915         ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, psize / lb_psize);
8916         vdev_space_update(dev->l2ad_vdev, psize, 0, 0);
8917 }
8918 
8919 /*
8920  * Restores a single ARC buf hdr from a log block. The ARC buffer is put
8921  * into a state indicating that it has been evicted to L2ARC.
8922  */
8923 static void
8924 l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev,
8925     uint64_t load_guid)
8926 {
8927         arc_buf_hdr_t           *hdr, *exists;
8928         kmutex_t                *hash_lock;
8929         arc_buf_contents_t      type = LE_GET_TYPE(le);
8930 
8931         /*
8932          * Do all the allocation before grabbing any locks, this lets us
8933          * sleep if memory is full and we don't have to deal with failed
8934          * allocations.
8935          */
8936         hdr = arc_buf_alloc_l2only(load_guid, type, dev, le->le_dva,
8937             le->le_daddr, LE_GET_LSIZE(le), LE_GET_PSIZE(le),
8938             le->le_birth, le->le_freeze_cksum, LE_GET_CHECKSUM(le),
8939             LE_GET_COMPRESS(le), LE_GET_ARC_COMPRESS(le));
8940 
8941         ARCSTAT_INCR(arcstat_l2_lsize, HDR_GET_LSIZE(hdr));
8942         ARCSTAT_INCR(arcstat_l2_psize, arc_hdr_size(hdr));
8943 
8944         mutex_enter(&dev->l2ad_mtx);
8945         /*
8946          * We connect the l2hdr to the hdr only after the hdr is in the hash
8947          * table, otherwise the rest of the arc hdr manipulation machinery
8948          * might get confused.
8949          */
8950         list_insert_tail(&dev->l2ad_buflist, hdr);
8951         (void) refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
8952         mutex_exit(&dev->l2ad_mtx);
8953 
8954         exists = buf_hash_insert(hdr, &hash_lock);
8955         if (exists) {
8956                 /* Buffer was already cached, no need to restore it. */
8957                 arc_hdr_destroy(hdr);
8958                 mutex_exit(hash_lock);
8959                 ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached);
8960                 return;
8961         }
8962 
8963         mutex_exit(hash_lock);
8964 }
8965 
8966 /*
8967  * Used by PL2ARC related functions that do
8968  * async read/write
8969  */
8970 static void
8971 pl2arc_io_done(zio_t *zio)
8972 {
8973         abd_put(zio->io_private);
8974         zio->io_private = NULL;
8975 }
8976 
8977 /*
8978  * Starts an asynchronous read IO to read a log block. This is used in log
8979  * block reconstruction to start reading the next block before we are done
8980  * decoding and reconstructing the current block, to keep the l2arc device
8981  * nice and hot with read IO to process.
8982  * The returned zio will contain a newly allocated memory buffers for the IO
8983  * data which should then be freed by the caller once the zio is no longer
8984  * needed (i.e. due to it having completed). If you wish to abort this
8985  * zio, you should do so using l2arc_log_blk_prefetch_abort, which takes
8986  * care of disposing of the allocated buffers correctly.
8987  */
8988 static zio_t *
8989 l2arc_log_blk_prefetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp,
8990     uint8_t *lb_buf)
8991 {
8992         uint32_t        psize;
8993         zio_t           *pio;
8994         abd_t           *abd;
8995 
8996         psize = LBP_GET_PSIZE(lbp);
8997         ASSERT(psize <= sizeof (l2arc_log_blk_phys_t));
8998         pio = zio_root(vd->vdev_spa, NULL, NULL, ZIO_FLAG_DONT_CACHE |
8999             ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
9000             ZIO_FLAG_DONT_RETRY);
9001         abd = abd_get_from_buf(lb_buf, psize);
9002         (void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, psize,
9003             abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
9004                 ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
9005             ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));
9006 
9007         return (pio);
9008 }
9009 
9010 /*
9011  * Aborts a zio returned from l2arc_log_blk_prefetch and frees the data
9012  * buffers allocated for it.
9013  */
9014 static void
9015 l2arc_log_blk_prefetch_abort(zio_t *zio)
9016 {
9017         (void) zio_wait(zio);
9018 }
9019 
9020 /*
9021  * Creates a zio to update the device header on an l2arc device. The zio is
9022  * initiated as a child of `pio'.
9023  */
9024 static void
9025 l2arc_dev_hdr_update(l2arc_dev_t *dev, zio_t *pio)
9026 {
9027         zio_t                   *wzio;
9028         abd_t                   *abd;
9029         l2arc_dev_hdr_phys_t    *hdr = dev->l2ad_dev_hdr;
9030         const uint64_t          hdr_asize = dev->l2ad_dev_hdr_asize;
9031 
9032         hdr->dh_magic = L2ARC_DEV_HDR_MAGIC_V1;
9033         hdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa);
9034         hdr->dh_alloc_space = refcount_count(&dev->l2ad_alloc);
9035         hdr->dh_flags = 0;
9036         if (dev->l2ad_first)
9037                 hdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST;
9038 
9039         /* checksum operation goes last */
9040         l2arc_dev_hdr_checksum(hdr, &hdr->dh_self_cksum);
9041 
9042         abd = abd_get_from_buf(hdr, hdr_asize);
9043         wzio = zio_write_phys(pio, dev->l2ad_vdev, VDEV_LABEL_START_SIZE,
9044             hdr_asize, abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
9045             ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);
9046         DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);
9047         (void) zio_nowait(wzio);
9048 }
9049 
9050 /*
9051  * Commits a log block to the L2ARC device. This routine is invoked from
9052  * l2arc_write_buffers when the log block fills up.
9053  * This function allocates some memory to temporarily hold the serialized
9054  * buffer to be written. This is then released in l2arc_write_done.
9055  */
9056 static void
9057 l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,
9058     l2arc_write_callback_t *cb)
9059 {
9060         l2arc_log_blk_phys_t    *lb = &dev->l2ad_log_blk;
9061         uint64_t                psize, asize;
9062         l2arc_log_blk_buf_t     *lb_buf;
9063         abd_t *abd;
9064         zio_t                   *wzio;
9065 
9066         VERIFY(dev->l2ad_log_ent_idx == L2ARC_LOG_BLK_ENTRIES);
9067 
9068         /* link the buffer into the block chain */
9069         lb->lb_back2_lbp = dev->l2ad_dev_hdr->dh_start_lbps[1];
9070         lb->lb_magic = L2ARC_LOG_BLK_MAGIC;
9071 
9072         /* try to compress the buffer */
9073         lb_buf = kmem_zalloc(sizeof (*lb_buf), KM_SLEEP);
9074         list_insert_tail(&cb->l2wcb_log_blk_buflist, lb_buf);
9075         abd = abd_get_from_buf(lb, sizeof (*lb));
9076         psize = zio_compress_data(ZIO_COMPRESS_LZ4, abd, lb_buf->lbb_log_blk,
9077             sizeof (*lb));
9078         abd_put(abd);
9079         /* a log block is never entirely zero */
9080         ASSERT(psize != 0);
9081         asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);
9082         ASSERT(asize <= sizeof (lb_buf->lbb_log_blk));
9083 
9084         /*
9085          * Update the start log blk pointer in the device header to point
9086          * to the log block we're about to write.
9087          */
9088         dev->l2ad_dev_hdr->dh_start_lbps[1] =
9089             dev->l2ad_dev_hdr->dh_start_lbps[0];
9090         dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand;
9091         _NOTE(CONSTCOND)
9092         LBP_SET_LSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0], sizeof (*lb));
9093         LBP_SET_PSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0], asize);
9094         LBP_SET_CHECKSUM(&dev->l2ad_dev_hdr->dh_start_lbps[0],
9095             ZIO_CHECKSUM_FLETCHER_4);
9096         LBP_SET_TYPE(&dev->l2ad_dev_hdr->dh_start_lbps[0], 0);
9097 
9098         if (asize < sizeof (*lb)) {
9099                 /* compression succeeded */
9100                 bzero(lb_buf->lbb_log_blk + psize, asize - psize);
9101                 LBP_SET_COMPRESS(&dev->l2ad_dev_hdr->dh_start_lbps[0],
9102                     ZIO_COMPRESS_LZ4);
9103         } else {
9104                 /* compression failed */
9105                 bcopy(lb, lb_buf->lbb_log_blk, sizeof (*lb));
9106                 LBP_SET_COMPRESS(&dev->l2ad_dev_hdr->dh_start_lbps[0],
9107                     ZIO_COMPRESS_OFF);
9108         }
9109 
9110         /* checksum what we're about to write */
9111         fletcher_4_native(lb_buf->lbb_log_blk, asize,
9112             NULL, &dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_cksum);
9113 
9114         /* perform the write itself */
9115         CTASSERT(L2ARC_LOG_BLK_SIZE >= SPA_MINBLOCKSIZE &&
9116             L2ARC_LOG_BLK_SIZE <= SPA_MAXBLOCKSIZE);
9117         abd = abd_get_from_buf(lb_buf->lbb_log_blk, asize);
9118         wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand,
9119             asize, abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
9120             ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);
9121         DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);
9122         (void) zio_nowait(wzio);
9123 
9124         dev->l2ad_hand += asize;
9125         vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
9126 
9127         /* bump the kstats */
9128         ARCSTAT_INCR(arcstat_l2_write_bytes, asize);
9129         ARCSTAT_BUMP(arcstat_l2_log_blk_writes);
9130         ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_size, asize);
9131         ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio,
9132             dev->l2ad_log_blk_payload_asize / asize);
9133 
9134         /* start a new log block */
9135         dev->l2ad_log_ent_idx = 0;
9136         dev->l2ad_log_blk_payload_asize = 0;
9137 }
9138 
9139 /*
9140  * Validates an L2ARC log blk address to make sure that it can be read
9141  * from the provided L2ARC device. Returns B_TRUE if the address is
9142  * within the device's bounds, or B_FALSE if not.
9143  */
9144 static boolean_t
9145 l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp)
9146 {
9147         uint64_t psize = LBP_GET_PSIZE(lbp);
9148         uint64_t end = lbp->lbp_daddr + psize;
9149 
9150         /*
9151          * A log block is valid if all of the following conditions are true:
9152          * - it fits entirely between l2ad_start and l2ad_end
9153          * - it has a valid size
9154          */
9155         return (lbp->lbp_daddr >= dev->l2ad_start && end <= dev->l2ad_end &&
9156             psize > 0 && psize <= sizeof (l2arc_log_blk_phys_t));
9157 }
9158 
9159 /*
9160  * Computes the checksum of `hdr' and stores it in `cksum'.
9161  */
9162 static void
9163 l2arc_dev_hdr_checksum(const l2arc_dev_hdr_phys_t *hdr, zio_cksum_t *cksum)
9164 {
9165         fletcher_4_native((uint8_t *)hdr +
9166             offsetof(l2arc_dev_hdr_phys_t, dh_spa_guid),
9167             sizeof (*hdr) - offsetof(l2arc_dev_hdr_phys_t, dh_spa_guid),
9168             NULL, cksum);
9169 }
9170 
9171 /*
9172  * Inserts ARC buffer `ab' into the current L2ARC log blk on the device.
9173  * The buffer being inserted must be present in L2ARC.
9174  * Returns B_TRUE if the L2ARC log blk is full and needs to be committed
9175  * to L2ARC, or B_FALSE if it still has room for more ARC buffers.
9176  */
9177 static boolean_t
9178 l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *ab)
9179 {
9180         l2arc_log_blk_phys_t    *lb = &dev->l2ad_log_blk;
9181         l2arc_log_ent_phys_t    *le;
9182         int                     index = dev->l2ad_log_ent_idx++;
9183 
9184         ASSERT(index < L2ARC_LOG_BLK_ENTRIES);
9185 
9186         le = &lb->lb_entries[index];
9187         bzero(le, sizeof (*le));
9188         le->le_dva = ab->b_dva;
9189         le->le_birth = ab->b_birth;
9190         le->le_daddr = ab->b_l2hdr.b_daddr;
9191         LE_SET_LSIZE(le, HDR_GET_LSIZE(ab));
9192         LE_SET_PSIZE(le, HDR_GET_PSIZE(ab));
9193 
9194         if ((ab->b_flags & ARC_FLAG_COMPRESSED_ARC) != 0) {
9195                 LE_SET_ARC_COMPRESS(le, 1);
9196                 LE_SET_COMPRESS(le, HDR_GET_COMPRESS(ab));
9197         } else {
9198                 ASSERT3U(HDR_GET_COMPRESS(ab), ==, ZIO_COMPRESS_OFF);
9199                 LE_SET_ARC_COMPRESS(le, 0);
9200                 LE_SET_COMPRESS(le, ZIO_COMPRESS_OFF);
9201         }
9202 
9203         if (ab->b_freeze_cksum != NULL) {
9204                 le->le_freeze_cksum = *ab->b_freeze_cksum;
9205                 LE_SET_CHECKSUM(le, ZIO_CHECKSUM_FLETCHER_2);
9206         } else {
9207                 LE_SET_CHECKSUM(le, ZIO_CHECKSUM_OFF);
9208         }
9209 
9210         LE_SET_TYPE(le, arc_flags_to_bufc(ab->b_flags));
9211         dev->l2ad_log_blk_payload_asize += arc_hdr_size((arc_buf_hdr_t *)ab);
9212 
9213         return (dev->l2ad_log_ent_idx == L2ARC_LOG_BLK_ENTRIES);
9214 }
9215 
9216 /*
9217  * Checks whether a given L2ARC device address sits in a time-sequential
9218  * range. The trick here is that the L2ARC is a rotary buffer, so we can't
9219  * just do a range comparison, we need to handle the situation in which the
9220  * range wraps around the end of the L2ARC device. Arguments:
9221  *      bottom  Lower end of the range to check (written to earlier).
9222  *      top     Upper end of the range to check (written to later).
9223  *      check   The address for which we want to determine if it sits in
9224  *              between the top and bottom.
9225  *
9226  * The 3-way conditional below represents the following cases:
9227  *
9228  *      bottom < top : Sequentially ordered case:
9229  *        <check>--------+-------------------+
9230  *                       |  (overlap here?)  |
9231  *       L2ARC dev       V                   V
9232  *       |---------------<bottom>============<top>--------------|
9233  *
9234  *      bottom > top: Looped-around case:
9235  *                            <check>--------+------------------+
9236  *                                           |  (overlap here?) |
9237  *       L2ARC dev                           V                  V
9238  *       |===============<top>---------------<bottom>===========|
9239  *       ^               ^
9240  *       |  (or here?)   |
9241  *       +---------------+---------<check>
9242  *
9243  *      top == bottom : Just a single address comparison.
9244  */
9245 static inline boolean_t
9246 l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check)
9247 {
9248         if (bottom < top)
9249                 return (bottom <= check && check <= top);
9250         else if (bottom > top)
9251                 return (check <= top || bottom <= check);
9252         else
9253                 return (check == top);
9254 }