Print this page
NEX-19742 A race between ARC and L2ARC causes system panic
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-16904 Need to port Illumos Bug #9433 to fix ARC hit rate
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15303 ARC-ABD logic works incorrect when deduplication is enabled
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15303 ARC-ABD logic works incorrect when deduplication is enabled
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15446 set zfs_ddt_limit_type to DDT_LIMIT_TO_ARC
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-15446 set zfs_ddt_limit_type to DDT_LIMIT_TO_ARC
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-14571 remove isal support remnants
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-8057 renaming of mount points should not be allowed (redo)
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5785 zdb: assertion failed for thread 0xf8a20240, thread-id 130: mp->initialized == B_TRUE, file ../common/kernel.c, line 162
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexent.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-4228 dedup arcstats are redundant
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-7317 Getting assert !refcount_is_zero(&scl->scl_count) when trying to import pool
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5671 assertion: (ab->b_l2hdr.b_asize) >> (9) >= 1 (0x0 >= 0x1), file: ../../common/fs/zfs/arc.c, line: 8275
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "Merge pull request #520 in OS/nza-kernel from ~SASO.KISELKOV/nza-kernel:NEX-5671-pl2arc-le_psize to master"
This reverts commit b63e91b939886744224854ea365d70e05ddd6077, reversing
changes made to a6e3a0255c8b22f65343bf641ffefaf9ae948fd4.
NEX-5671 assertion: (ab->b_l2hdr.b_asize) >> (9) >= 1 (0x0 >= 0x1), file: ../../common/fs/zfs/arc.c, line: 8275
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
6421 Add missing multilist_destroy calls to arc_fini
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
6293 ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign()
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
5219 l2arc_write_buffers() may write beyond target_sz
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Justin Gibbs <gibbs@FreeBSD.org>
Approved by: Matthew Ahrens <mahrens@delphix.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
6220 memleak in l2arc on debug build
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
5847 libzfs_diff should check zfs_prop_get() return
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5701 zpool list reports incorrect "alloc" value for cache devices
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-3879 L2ARC evict task allocates a useless struct
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4408 backport illumos #6214 to avoid corruption (fix pL2ARC integration)
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4408 backport illumos #6214 to avoid corruption
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3979 fix arc_mru/mfu typo
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3961 arc_meta_max is not counted correctly
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3946 Port Illumos 5983 to release-5.0
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Jean McCormack <jean.maccormack@nexenta.com>
NEX-3945 file-backed cache devices considered harmful
Reviewed by: Alek Pinchuk <alek@nexenta.com>
NEX-3541 Implement persistent L2ARC - fix build breakage in libzpool (v2).
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3630 Backport illumos #5701 from master to 5.0
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3558 KRRP Integration
NEX-3387 ARC stats appear to be in wrong/weird order
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3296 turn on DDT limit by default
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3300 ddt byte count ceiling tunables should not depend on zfs_ddt_limit_type being set
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
 NEX-3165 segregate ddt in arc (other lint fix)
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-3165 segregate ddt in arc
NEX-3079 port illumos ARC improvements
NEX-2301 zpool destroy assertion failed: vd->vdev_stat.vs_alloc == 0 (part 2)
NEX-2704 smbstat man page needs update
NEX-2301 zpool destroy assertion failed: vd->vdev_stat.vs_alloc == 0
3995 Memory leak of compressed buffers in l2arc_write_done
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Garrett D'Amore <garrett@damore.org>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
NEX-463: bumped max queue size for L2ARC async evict
Maximum length of a taskq used for async arc and l2arc flush is
now a tuneable (zfs_flush_ntasks) that is initialized to 64.
The number is equally arbitrary, yet higher than original 4.
Real fix should rework l2arc evict according to OS-53, but for now
just longer queue should suffice.
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
re #14119 BAD-TRAP panic under load
re #13989 port of illumos-3805
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
re #13729 assign each ARC hash bucket its own mutex
In ARC the number of buckets in buffer header hash table is
proportional to the size of physical RAM.
The number of locks protecting headers in the buckets is fixed to 256 though.
Hence, on systems with large memory (>= 128GB) too many unrelated buffer
headers are protected by the same mutex.
When the memory in the system is fragmented this may cause a deadlock:
- An arc_read thread may be trying to allocate a 128k buffer while holding
a header lock.
- The allocation uses KM_PUSHPAGE option that blocks the thread if no contigous
chunk of requested size is available.
- ARC eviction thread that is supposed to evict some buffers would call
an evict callback on one of the buffers.
- Before freing the memory, the callback will attempt to take a lock on buffer
header.
- Incidentally, this buffer header will be protected by the same lock as
the one in arc_read() thread.
The solution in this patch is not perfect - that is, it protects all headers
in the hash bucket by the same lock.
However, a probability of collision is very low and does not depend on memory
size.
By the same argument, padding locks to cacheline looks like a waste of memory
here since the probability of contention on a cacheline is quite low, given
the number of buckets, number of locks per cacheline (4) and the fact that
the hash function (crc64 % hash table size) is supposed to be a very good
randomizer.
This effect on memory usage is as follows:
Per hash table size n,
- Original code uses 16K + 16 + n * 8 bytes of memory
- This fix uses 2 * n * 8 + 8 bytes of memory
- The net memory overhead is therefore n * 8 - 16K - 8 bytes
The value of n grows proportionally to physical memory size.
For 128GB of physical memory it is 2M, so the memory overhead is
16M - 16K - 8 bytes.
For smaller memory configurations the overhead is proportionally smaller, and
for larger memory configurations it is propottionally bigger.
The patch has been tested for 30+ hours using vdbench script that reproduces
hang with original code 100% of times in 20-30 minutes.
re #10054 rb4467 Support for asynchronous ARC/L2ARC eviction
re #13165 rb4265 zfs-monitor should fallback to using DEV_BSIZE
re #10054 rb4249 Long export time causes failover to fail

Split Close
Expand all
Collapse all
          --- old/usr/src/uts/common/fs/zfs/arc.c
          +++ new/usr/src/uts/common/fs/zfs/arc.c
↓ open down ↓ 15 lines elided ↑ open up ↑
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23   23   * Copyright (c) 2018, Joyent, Inc.
  24   24   * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
  25   25   * Copyright (c) 2014 by Saso Kiselkov. All rights reserved.
  26      - * Copyright 2017 Nexenta Systems, Inc.  All rights reserved.
       26 + * Copyright 2019 Nexenta Systems, Inc.  All rights reserved.
  27   27   */
  28   28  
  29   29  /*
  30   30   * DVA-based Adjustable Replacement Cache
  31   31   *
  32   32   * While much of the theory of operation used here is
  33   33   * based on the self-tuning, low overhead replacement cache
  34   34   * presented by Megiddo and Modha at FAST 2003, there are some
  35   35   * significant differences:
  36   36   *
↓ open down ↓ 209 lines elided ↑ open up ↑
 246  246   * that when compressed ARC is enabled that the L2ARC blocks are identical
 247  247   * to the on-disk block in the main data pool. This provides a significant
 248  248   * advantage since the ARC can leverage the bp's checksum when reading from the
 249  249   * L2ARC to determine if the contents are valid. However, if the compressed
 250  250   * ARC is disabled, then the L2ARC's block must be transformed to look
 251  251   * like the physical block in the main data pool before comparing the
 252  252   * checksum and determining its validity.
 253  253   */
 254  254  
 255  255  #include <sys/spa.h>
      256 +#include <sys/spa_impl.h>
 256  257  #include <sys/zio.h>
 257  258  #include <sys/spa_impl.h>
 258  259  #include <sys/zio_compress.h>
 259  260  #include <sys/zio_checksum.h>
 260  261  #include <sys/zfs_context.h>
 261  262  #include <sys/arc.h>
 262  263  #include <sys/refcount.h>
 263  264  #include <sys/vdev.h>
 264  265  #include <sys/vdev_impl.h>
 265  266  #include <sys/dsl_pool.h>
↓ open down ↓ 2 lines elided ↑ open up ↑
 268  269  #include <sys/abd.h>
 269  270  #ifdef _KERNEL
 270  271  #include <sys/vmsystm.h>
 271  272  #include <vm/anon.h>
 272  273  #include <sys/fs/swapnode.h>
 273  274  #include <sys/dnlc.h>
 274  275  #endif
 275  276  #include <sys/callb.h>
 276  277  #include <sys/kstat.h>
 277  278  #include <zfs_fletcher.h>
 278      -#include <sys/aggsum.h>
 279      -#include <sys/cityhash.h>
      279 +#include <sys/byteorder.h>
      280 +#include <sys/spa_impl.h>
 280  281  
 281  282  #ifndef _KERNEL
 282  283  /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
 283  284  boolean_t arc_watch = B_FALSE;
 284  285  int arc_procfd;
 285  286  #endif
 286  287  
 287  288  static kmutex_t         arc_reclaim_lock;
 288  289  static kcondvar_t       arc_reclaim_thread_cv;
 289  290  static boolean_t        arc_reclaim_thread_exit;
↓ open down ↓ 60 lines elided ↑ open up ↑
 350  351   */
 351  352  int arc_zio_arena_free_shift = 2;
 352  353  
 353  354  /*
 354  355   * These tunables are for performance analysis.
 355  356   */
 356  357  uint64_t zfs_arc_max;
 357  358  uint64_t zfs_arc_min;
 358  359  uint64_t zfs_arc_meta_limit = 0;
 359  360  uint64_t zfs_arc_meta_min = 0;
      361 +uint64_t zfs_arc_ddt_limit = 0;
      362 +/*
      363 + * Tunable to control "dedup ceiling"
      364 + * Possible values:
      365 + *  DDT_NO_LIMIT        - default behaviour, ie no ceiling
      366 + *  DDT_LIMIT_TO_ARC    - stop DDT growth if DDT is bigger than it's "ARC space"
      367 + *  DDT_LIMIT_TO_L2ARC  - stop DDT growth when DDT size is bigger than the
      368 + *                        L2ARC DDT dev(s) for that pool
      369 + */
      370 +zfs_ddt_limit_t zfs_ddt_limit_type = DDT_LIMIT_TO_ARC;
      371 +/*
      372 + * Alternative to the above way of controlling "dedup ceiling":
      373 + * Stop DDT growth when in core DDTs size is above the below tunable.
      374 + * This tunable overrides the zfs_ddt_limit_type tunable.
      375 + */
      376 +uint64_t zfs_ddt_byte_ceiling = 0;
      377 +boolean_t zfs_arc_segregate_ddt = B_TRUE;
 360  378  int zfs_arc_grow_retry = 0;
 361  379  int zfs_arc_shrink_shift = 0;
 362  380  int zfs_arc_p_min_shift = 0;
 363  381  int zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
 364  382  
      383 +/* Tuneable, default is 64, which is essentially arbitrary */
      384 +int zfs_flush_ntasks = 64;
      385 +
 365  386  boolean_t zfs_compressed_arc_enabled = B_TRUE;
 366  387  
 367  388  /*
 368  389   * Note that buffers can be in one of 6 states:
 369  390   *      ARC_anon        - anonymous (discussed below)
 370  391   *      ARC_mru         - recently used, currently cached
 371  392   *      ARC_mru_ghost   - recentely used, no longer in cache
 372  393   *      ARC_mfu         - frequently used, currently cached
 373  394   *      ARC_mfu_ghost   - frequently used, no longer in cache
 374  395   *      ARC_l2c_only    - exists in L2ARC but not other states
↓ open down ↓ 25 lines elided ↑ open up ↑
 400  421          /*
 401  422           * list of evictable buffers
 402  423           */
 403  424          multilist_t *arcs_list[ARC_BUFC_NUMTYPES];
 404  425          /*
 405  426           * total amount of evictable data in this state
 406  427           */
 407  428          refcount_t arcs_esize[ARC_BUFC_NUMTYPES];
 408  429          /*
 409  430           * total amount of data in this state; this includes: evictable,
 410      -         * non-evictable, ARC_BUFC_DATA, and ARC_BUFC_METADATA.
      431 +         * non-evictable, ARC_BUFC_DATA, ARC_BUFC_METADATA and ARC_BUFC_DDT.
      432 +         * ARC_BUFC_DDT list is only populated when zfs_arc_segregate_ddt is
      433 +         * true.
 411  434           */
 412  435          refcount_t arcs_size;
 413  436  } arc_state_t;
 414  437  
      438 +/*
      439 + * We loop through these in l2arc_write_buffers() starting from
      440 + * PRIORITY_MFU_DDT until we reach PRIORITY_NUMTYPES or the buffer that we
      441 + * will be writing to L2ARC dev gets full.
      442 + */
      443 +enum l2arc_priorities {
      444 +        PRIORITY_MFU_DDT,
      445 +        PRIORITY_MRU_DDT,
      446 +        PRIORITY_MFU_META,
      447 +        PRIORITY_MRU_META,
      448 +        PRIORITY_MFU_DATA,
      449 +        PRIORITY_MRU_DATA,
      450 +        PRIORITY_NUMTYPES,
      451 +};
      452 +
 415  453  /* The 6 states: */
 416  454  static arc_state_t ARC_anon;
 417  455  static arc_state_t ARC_mru;
 418  456  static arc_state_t ARC_mru_ghost;
 419  457  static arc_state_t ARC_mfu;
 420  458  static arc_state_t ARC_mfu_ghost;
 421  459  static arc_state_t ARC_l2c_only;
 422  460  
 423  461  typedef struct arc_stats {
 424  462          kstat_named_t arcstat_hits;
      463 +        kstat_named_t arcstat_ddt_hits;
 425  464          kstat_named_t arcstat_misses;
 426  465          kstat_named_t arcstat_demand_data_hits;
 427  466          kstat_named_t arcstat_demand_data_misses;
 428  467          kstat_named_t arcstat_demand_metadata_hits;
 429  468          kstat_named_t arcstat_demand_metadata_misses;
      469 +        kstat_named_t arcstat_demand_ddt_hits;
      470 +        kstat_named_t arcstat_demand_ddt_misses;
 430  471          kstat_named_t arcstat_prefetch_data_hits;
 431  472          kstat_named_t arcstat_prefetch_data_misses;
 432  473          kstat_named_t arcstat_prefetch_metadata_hits;
 433  474          kstat_named_t arcstat_prefetch_metadata_misses;
      475 +        kstat_named_t arcstat_prefetch_ddt_hits;
      476 +        kstat_named_t arcstat_prefetch_ddt_misses;
 434  477          kstat_named_t arcstat_mru_hits;
 435  478          kstat_named_t arcstat_mru_ghost_hits;
 436  479          kstat_named_t arcstat_mfu_hits;
 437  480          kstat_named_t arcstat_mfu_ghost_hits;
 438  481          kstat_named_t arcstat_deleted;
 439  482          /*
 440  483           * Number of buffers that could not be evicted because the hash lock
 441  484           * was held by another thread.  The lock may not necessarily be held
 442  485           * by something using the same buffer, since hash locks are shared
 443  486           * by multiple buffers.
 444  487           */
 445  488          kstat_named_t arcstat_mutex_miss;
 446  489          /*
      490 +         * Number of buffers skipped when updating the access state due to the
      491 +         * header having already been released after acquiring the hash lock.
      492 +         */
      493 +        kstat_named_t arcstat_access_skip;
      494 +        /*
 447  495           * Number of buffers skipped because they have I/O in progress, are
 448      -         * indrect prefetch buffers that have not lived long enough, or are
      496 +         * indirect prefetch buffers that have not lived long enough, or are
 449  497           * not from the spa we're trying to evict from.
 450  498           */
 451  499          kstat_named_t arcstat_evict_skip;
 452  500          /*
 453  501           * Number of times arc_evict_state() was unable to evict enough
 454  502           * buffers to reach it's target amount.
 455  503           */
 456  504          kstat_named_t arcstat_evict_not_enough;
 457  505          kstat_named_t arcstat_evict_l2_cached;
 458  506          kstat_named_t arcstat_evict_l2_eligible;
↓ open down ↓ 1 lines elided ↑ open up ↑
 460  508          kstat_named_t arcstat_evict_l2_skip;
 461  509          kstat_named_t arcstat_hash_elements;
 462  510          kstat_named_t arcstat_hash_elements_max;
 463  511          kstat_named_t arcstat_hash_collisions;
 464  512          kstat_named_t arcstat_hash_chains;
 465  513          kstat_named_t arcstat_hash_chain_max;
 466  514          kstat_named_t arcstat_p;
 467  515          kstat_named_t arcstat_c;
 468  516          kstat_named_t arcstat_c_min;
 469  517          kstat_named_t arcstat_c_max;
 470      -        /* Not updated directly; only synced in arc_kstat_update. */
 471  518          kstat_named_t arcstat_size;
 472  519          /*
 473  520           * Number of compressed bytes stored in the arc_buf_hdr_t's b_pabd.
 474  521           * Note that the compressed bytes may match the uncompressed bytes
 475  522           * if the block is either not compressed or compressed arc is disabled.
 476  523           */
 477  524          kstat_named_t arcstat_compressed_size;
 478  525          /*
 479  526           * Uncompressed size of the data stored in b_pabd. If compressed
 480  527           * arc is disabled then this value will be identical to the stat
↓ open down ↓ 8 lines elided ↑ open up ↑
 489  536           * values have been set (see comment in dbuf.c for more information).
 490  537           */
 491  538          kstat_named_t arcstat_overhead_size;
 492  539          /*
 493  540           * Number of bytes consumed by internal ARC structures necessary
 494  541           * for tracking purposes; these structures are not actually
 495  542           * backed by ARC buffers. This includes arc_buf_hdr_t structures
 496  543           * (allocated via arc_buf_hdr_t_full and arc_buf_hdr_t_l2only
 497  544           * caches), and arc_buf_t structures (allocated via arc_buf_t
 498  545           * cache).
 499      -         * Not updated directly; only synced in arc_kstat_update.
 500  546           */
 501  547          kstat_named_t arcstat_hdr_size;
 502  548          /*
 503  549           * Number of bytes consumed by ARC buffers of type equal to
 504  550           * ARC_BUFC_DATA. This is generally consumed by buffers backing
 505  551           * on disk user data (e.g. plain file contents).
 506      -         * Not updated directly; only synced in arc_kstat_update.
 507  552           */
 508  553          kstat_named_t arcstat_data_size;
 509  554          /*
 510  555           * Number of bytes consumed by ARC buffers of type equal to
 511  556           * ARC_BUFC_METADATA. This is generally consumed by buffers
 512  557           * backing on disk data that is used for internal ZFS
 513  558           * structures (e.g. ZAP, dnode, indirect blocks, etc).
 514      -         * Not updated directly; only synced in arc_kstat_update.
 515  559           */
 516  560          kstat_named_t arcstat_metadata_size;
 517  561          /*
      562 +         * Number of bytes consumed by ARC buffers of type equal to
      563 +         * ARC_BUFC_DDT. This is consumed by buffers backing on disk data
      564 +         * that is used to store DDT (ZAP, ddt stats).
      565 +         * Only used if zfs_arc_segregate_ddt is true.
      566 +         */
      567 +        kstat_named_t arcstat_ddt_size;
      568 +        /*
 518  569           * Number of bytes consumed by various buffers and structures
 519  570           * not actually backed with ARC buffers. This includes bonus
 520  571           * buffers (allocated directly via zio_buf_* functions),
 521  572           * dmu_buf_impl_t structures (allocated via dmu_buf_impl_t
 522  573           * cache), and dnode_t structures (allocated via dnode_t cache).
 523      -         * Not updated directly; only synced in arc_kstat_update.
 524  574           */
 525  575          kstat_named_t arcstat_other_size;
 526  576          /*
 527  577           * Total number of bytes consumed by ARC buffers residing in the
 528  578           * arc_anon state. This includes *all* buffers in the arc_anon
 529  579           * state; e.g. data, metadata, evictable, and unevictable buffers
 530  580           * are all included in this value.
 531      -         * Not updated directly; only synced in arc_kstat_update.
 532  581           */
 533  582          kstat_named_t arcstat_anon_size;
 534  583          /*
 535  584           * Number of bytes consumed by ARC buffers that meet the
 536  585           * following criteria: backing buffers of type ARC_BUFC_DATA,
 537  586           * residing in the arc_anon state, and are eligible for eviction
 538  587           * (e.g. have no outstanding holds on the buffer).
 539      -         * Not updated directly; only synced in arc_kstat_update.
 540  588           */
 541  589          kstat_named_t arcstat_anon_evictable_data;
 542  590          /*
 543  591           * Number of bytes consumed by ARC buffers that meet the
 544  592           * following criteria: backing buffers of type ARC_BUFC_METADATA,
 545  593           * residing in the arc_anon state, and are eligible for eviction
 546  594           * (e.g. have no outstanding holds on the buffer).
 547      -         * Not updated directly; only synced in arc_kstat_update.
 548  595           */
 549  596          kstat_named_t arcstat_anon_evictable_metadata;
 550  597          /*
      598 +         * Number of bytes consumed by ARC buffers that meet the
      599 +         * following criteria: backing buffers of type ARC_BUFC_DDT,
      600 +         * residing in the arc_anon state, and are eligible for eviction
      601 +         * Only used if zfs_arc_segregate_ddt is true.
      602 +         */
      603 +        kstat_named_t arcstat_anon_evictable_ddt;
      604 +        /*
 551  605           * Total number of bytes consumed by ARC buffers residing in the
 552  606           * arc_mru state. This includes *all* buffers in the arc_mru
 553  607           * state; e.g. data, metadata, evictable, and unevictable buffers
 554  608           * are all included in this value.
 555      -         * Not updated directly; only synced in arc_kstat_update.
 556  609           */
 557  610          kstat_named_t arcstat_mru_size;
 558  611          /*
 559  612           * Number of bytes consumed by ARC buffers that meet the
 560  613           * following criteria: backing buffers of type ARC_BUFC_DATA,
 561  614           * residing in the arc_mru state, and are eligible for eviction
 562  615           * (e.g. have no outstanding holds on the buffer).
 563      -         * Not updated directly; only synced in arc_kstat_update.
 564  616           */
 565  617          kstat_named_t arcstat_mru_evictable_data;
 566  618          /*
 567  619           * Number of bytes consumed by ARC buffers that meet the
 568  620           * following criteria: backing buffers of type ARC_BUFC_METADATA,
 569  621           * residing in the arc_mru state, and are eligible for eviction
 570  622           * (e.g. have no outstanding holds on the buffer).
 571      -         * Not updated directly; only synced in arc_kstat_update.
 572  623           */
 573  624          kstat_named_t arcstat_mru_evictable_metadata;
 574  625          /*
      626 +         * Number of bytes consumed by ARC buffers that meet the
      627 +         * following criteria: backing buffers of type ARC_BUFC_DDT,
      628 +         * residing in the arc_mru state, and are eligible for eviction
      629 +         * (e.g. have no outstanding holds on the buffer).
      630 +         * Only used if zfs_arc_segregate_ddt is true.
      631 +         */
      632 +        kstat_named_t arcstat_mru_evictable_ddt;
      633 +        /*
 575  634           * Total number of bytes that *would have been* consumed by ARC
 576  635           * buffers in the arc_mru_ghost state. The key thing to note
 577  636           * here, is the fact that this size doesn't actually indicate
 578  637           * RAM consumption. The ghost lists only consist of headers and
 579  638           * don't actually have ARC buffers linked off of these headers.
 580  639           * Thus, *if* the headers had associated ARC buffers, these
 581  640           * buffers *would have* consumed this number of bytes.
 582      -         * Not updated directly; only synced in arc_kstat_update.
 583  641           */
 584  642          kstat_named_t arcstat_mru_ghost_size;
 585  643          /*
 586  644           * Number of bytes that *would have been* consumed by ARC
 587  645           * buffers that are eligible for eviction, of type
 588  646           * ARC_BUFC_DATA, and linked off the arc_mru_ghost state.
 589      -         * Not updated directly; only synced in arc_kstat_update.
 590  647           */
 591  648          kstat_named_t arcstat_mru_ghost_evictable_data;
 592  649          /*
 593  650           * Number of bytes that *would have been* consumed by ARC
 594  651           * buffers that are eligible for eviction, of type
 595  652           * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
 596      -         * Not updated directly; only synced in arc_kstat_update.
 597  653           */
 598  654          kstat_named_t arcstat_mru_ghost_evictable_metadata;
 599  655          /*
      656 +         * Number of bytes that *would have been* consumed by ARC
      657 +         * buffers that are eligible for eviction, of type
      658 +         * ARC_BUFC_DDT, and linked off the arc_mru_ghost state.
      659 +         * Only used if zfs_arc_segregate_ddt is true.
      660 +         */
      661 +        kstat_named_t arcstat_mru_ghost_evictable_ddt;
      662 +        /*
 600  663           * Total number of bytes consumed by ARC buffers residing in the
 601  664           * arc_mfu state. This includes *all* buffers in the arc_mfu
 602  665           * state; e.g. data, metadata, evictable, and unevictable buffers
 603  666           * are all included in this value.
 604      -         * Not updated directly; only synced in arc_kstat_update.
 605  667           */
 606  668          kstat_named_t arcstat_mfu_size;
 607  669          /*
 608  670           * Number of bytes consumed by ARC buffers that are eligible for
 609  671           * eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu
 610  672           * state.
 611      -         * Not updated directly; only synced in arc_kstat_update.
 612  673           */
 613  674          kstat_named_t arcstat_mfu_evictable_data;
 614  675          /*
 615  676           * Number of bytes consumed by ARC buffers that are eligible for
 616  677           * eviction, of type ARC_BUFC_METADATA, and reside in the
 617  678           * arc_mfu state.
 618      -         * Not updated directly; only synced in arc_kstat_update.
 619  679           */
 620  680          kstat_named_t arcstat_mfu_evictable_metadata;
 621  681          /*
      682 +         * Number of bytes consumed by ARC buffers that are eligible for
      683 +         * eviction, of type ARC_BUFC_DDT, and reside in the
      684 +         * arc_mfu state.
      685 +         * Only used if zfs_arc_segregate_ddt is true.
      686 +         */
      687 +        kstat_named_t arcstat_mfu_evictable_ddt;
      688 +        /*
 622  689           * Total number of bytes that *would have been* consumed by ARC
 623  690           * buffers in the arc_mfu_ghost state. See the comment above
 624  691           * arcstat_mru_ghost_size for more details.
 625      -         * Not updated directly; only synced in arc_kstat_update.
 626  692           */
 627  693          kstat_named_t arcstat_mfu_ghost_size;
 628  694          /*
 629  695           * Number of bytes that *would have been* consumed by ARC
 630  696           * buffers that are eligible for eviction, of type
 631  697           * ARC_BUFC_DATA, and linked off the arc_mfu_ghost state.
 632      -         * Not updated directly; only synced in arc_kstat_update.
 633  698           */
 634  699          kstat_named_t arcstat_mfu_ghost_evictable_data;
 635  700          /*
 636  701           * Number of bytes that *would have been* consumed by ARC
 637  702           * buffers that are eligible for eviction, of type
 638  703           * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
 639      -         * Not updated directly; only synced in arc_kstat_update.
 640  704           */
 641  705          kstat_named_t arcstat_mfu_ghost_evictable_metadata;
      706 +        /*
      707 +         * Number of bytes that *would have been* consumed by ARC
      708 +         * buffers that are eligible for eviction, of type
      709 +         * ARC_BUFC_DDT, and linked off the arc_mru_ghost state.
      710 +         * Only used if zfs_arc_segregate_ddt is true.
      711 +         */
      712 +        kstat_named_t arcstat_mfu_ghost_evictable_ddt;
 642  713          kstat_named_t arcstat_l2_hits;
      714 +        kstat_named_t arcstat_l2_ddt_hits;
 643  715          kstat_named_t arcstat_l2_misses;
 644  716          kstat_named_t arcstat_l2_feeds;
 645  717          kstat_named_t arcstat_l2_rw_clash;
 646  718          kstat_named_t arcstat_l2_read_bytes;
      719 +        kstat_named_t arcstat_l2_ddt_read_bytes;
 647  720          kstat_named_t arcstat_l2_write_bytes;
      721 +        kstat_named_t arcstat_l2_ddt_write_bytes;
 648  722          kstat_named_t arcstat_l2_writes_sent;
 649  723          kstat_named_t arcstat_l2_writes_done;
 650  724          kstat_named_t arcstat_l2_writes_error;
 651  725          kstat_named_t arcstat_l2_writes_lock_retry;
 652  726          kstat_named_t arcstat_l2_evict_lock_retry;
 653  727          kstat_named_t arcstat_l2_evict_reading;
 654  728          kstat_named_t arcstat_l2_evict_l1cached;
 655  729          kstat_named_t arcstat_l2_free_on_write;
 656  730          kstat_named_t arcstat_l2_abort_lowmem;
 657  731          kstat_named_t arcstat_l2_cksum_bad;
 658  732          kstat_named_t arcstat_l2_io_error;
 659  733          kstat_named_t arcstat_l2_lsize;
 660  734          kstat_named_t arcstat_l2_psize;
 661      -        /* Not updated directly; only synced in arc_kstat_update. */
 662  735          kstat_named_t arcstat_l2_hdr_size;
      736 +        kstat_named_t arcstat_l2_log_blk_writes;
      737 +        kstat_named_t arcstat_l2_log_blk_avg_size;
      738 +        kstat_named_t arcstat_l2_data_to_meta_ratio;
      739 +        kstat_named_t arcstat_l2_rebuild_successes;
      740 +        kstat_named_t arcstat_l2_rebuild_abort_unsupported;
      741 +        kstat_named_t arcstat_l2_rebuild_abort_io_errors;
      742 +        kstat_named_t arcstat_l2_rebuild_abort_cksum_errors;
      743 +        kstat_named_t arcstat_l2_rebuild_abort_loop_errors;
      744 +        kstat_named_t arcstat_l2_rebuild_abort_lowmem;
      745 +        kstat_named_t arcstat_l2_rebuild_size;
      746 +        kstat_named_t arcstat_l2_rebuild_bufs;
      747 +        kstat_named_t arcstat_l2_rebuild_bufs_precached;
      748 +        kstat_named_t arcstat_l2_rebuild_psize;
      749 +        kstat_named_t arcstat_l2_rebuild_log_blks;
 663  750          kstat_named_t arcstat_memory_throttle_count;
 664      -        /* Not updated directly; only synced in arc_kstat_update. */
 665  751          kstat_named_t arcstat_meta_used;
 666  752          kstat_named_t arcstat_meta_limit;
 667  753          kstat_named_t arcstat_meta_max;
 668  754          kstat_named_t arcstat_meta_min;
      755 +        kstat_named_t arcstat_ddt_limit;
 669  756          kstat_named_t arcstat_sync_wait_for_async;
 670  757          kstat_named_t arcstat_demand_hit_predictive_prefetch;
 671  758  } arc_stats_t;
 672  759  
 673  760  static arc_stats_t arc_stats = {
 674  761          { "hits",                       KSTAT_DATA_UINT64 },
      762 +        { "ddt_hits",                   KSTAT_DATA_UINT64 },
 675  763          { "misses",                     KSTAT_DATA_UINT64 },
 676  764          { "demand_data_hits",           KSTAT_DATA_UINT64 },
 677  765          { "demand_data_misses",         KSTAT_DATA_UINT64 },
 678  766          { "demand_metadata_hits",       KSTAT_DATA_UINT64 },
 679  767          { "demand_metadata_misses",     KSTAT_DATA_UINT64 },
      768 +        { "demand_ddt_hits",            KSTAT_DATA_UINT64 },
      769 +        { "demand_ddt_misses",          KSTAT_DATA_UINT64 },
 680  770          { "prefetch_data_hits",         KSTAT_DATA_UINT64 },
 681  771          { "prefetch_data_misses",       KSTAT_DATA_UINT64 },
 682  772          { "prefetch_metadata_hits",     KSTAT_DATA_UINT64 },
 683  773          { "prefetch_metadata_misses",   KSTAT_DATA_UINT64 },
      774 +        { "prefetch_ddt_hits",          KSTAT_DATA_UINT64 },
      775 +        { "prefetch_ddt_misses",        KSTAT_DATA_UINT64 },
 684  776          { "mru_hits",                   KSTAT_DATA_UINT64 },
 685  777          { "mru_ghost_hits",             KSTAT_DATA_UINT64 },
 686  778          { "mfu_hits",                   KSTAT_DATA_UINT64 },
 687  779          { "mfu_ghost_hits",             KSTAT_DATA_UINT64 },
 688  780          { "deleted",                    KSTAT_DATA_UINT64 },
 689  781          { "mutex_miss",                 KSTAT_DATA_UINT64 },
      782 +        { "access_skip",                KSTAT_DATA_UINT64 },
 690  783          { "evict_skip",                 KSTAT_DATA_UINT64 },
 691  784          { "evict_not_enough",           KSTAT_DATA_UINT64 },
 692  785          { "evict_l2_cached",            KSTAT_DATA_UINT64 },
 693  786          { "evict_l2_eligible",          KSTAT_DATA_UINT64 },
 694  787          { "evict_l2_ineligible",        KSTAT_DATA_UINT64 },
 695  788          { "evict_l2_skip",              KSTAT_DATA_UINT64 },
 696  789          { "hash_elements",              KSTAT_DATA_UINT64 },
 697  790          { "hash_elements_max",          KSTAT_DATA_UINT64 },
 698  791          { "hash_collisions",            KSTAT_DATA_UINT64 },
 699  792          { "hash_chains",                KSTAT_DATA_UINT64 },
↓ open down ↓ 2 lines elided ↑ open up ↑
 702  795          { "c",                          KSTAT_DATA_UINT64 },
 703  796          { "c_min",                      KSTAT_DATA_UINT64 },
 704  797          { "c_max",                      KSTAT_DATA_UINT64 },
 705  798          { "size",                       KSTAT_DATA_UINT64 },
 706  799          { "compressed_size",            KSTAT_DATA_UINT64 },
 707  800          { "uncompressed_size",          KSTAT_DATA_UINT64 },
 708  801          { "overhead_size",              KSTAT_DATA_UINT64 },
 709  802          { "hdr_size",                   KSTAT_DATA_UINT64 },
 710  803          { "data_size",                  KSTAT_DATA_UINT64 },
 711  804          { "metadata_size",              KSTAT_DATA_UINT64 },
      805 +        { "ddt_size",                   KSTAT_DATA_UINT64 },
 712  806          { "other_size",                 KSTAT_DATA_UINT64 },
 713  807          { "anon_size",                  KSTAT_DATA_UINT64 },
 714  808          { "anon_evictable_data",        KSTAT_DATA_UINT64 },
 715  809          { "anon_evictable_metadata",    KSTAT_DATA_UINT64 },
      810 +        { "anon_evictable_ddt",         KSTAT_DATA_UINT64 },
 716  811          { "mru_size",                   KSTAT_DATA_UINT64 },
 717  812          { "mru_evictable_data",         KSTAT_DATA_UINT64 },
 718  813          { "mru_evictable_metadata",     KSTAT_DATA_UINT64 },
      814 +        { "mru_evictable_ddt",          KSTAT_DATA_UINT64 },
 719  815          { "mru_ghost_size",             KSTAT_DATA_UINT64 },
 720  816          { "mru_ghost_evictable_data",   KSTAT_DATA_UINT64 },
 721  817          { "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
      818 +        { "mru_ghost_evictable_ddt",    KSTAT_DATA_UINT64 },
 722  819          { "mfu_size",                   KSTAT_DATA_UINT64 },
 723  820          { "mfu_evictable_data",         KSTAT_DATA_UINT64 },
 724  821          { "mfu_evictable_metadata",     KSTAT_DATA_UINT64 },
      822 +        { "mfu_evictable_ddt",          KSTAT_DATA_UINT64 },
 725  823          { "mfu_ghost_size",             KSTAT_DATA_UINT64 },
 726  824          { "mfu_ghost_evictable_data",   KSTAT_DATA_UINT64 },
 727  825          { "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
      826 +        { "mfu_ghost_evictable_ddt",    KSTAT_DATA_UINT64 },
 728  827          { "l2_hits",                    KSTAT_DATA_UINT64 },
      828 +        { "l2_ddt_hits",                KSTAT_DATA_UINT64 },
 729  829          { "l2_misses",                  KSTAT_DATA_UINT64 },
 730  830          { "l2_feeds",                   KSTAT_DATA_UINT64 },
 731  831          { "l2_rw_clash",                KSTAT_DATA_UINT64 },
 732  832          { "l2_read_bytes",              KSTAT_DATA_UINT64 },
      833 +        { "l2_ddt_read_bytes",          KSTAT_DATA_UINT64 },
 733  834          { "l2_write_bytes",             KSTAT_DATA_UINT64 },
      835 +        { "l2_ddt_write_bytes",         KSTAT_DATA_UINT64 },
 734  836          { "l2_writes_sent",             KSTAT_DATA_UINT64 },
 735  837          { "l2_writes_done",             KSTAT_DATA_UINT64 },
 736  838          { "l2_writes_error",            KSTAT_DATA_UINT64 },
 737  839          { "l2_writes_lock_retry",       KSTAT_DATA_UINT64 },
 738  840          { "l2_evict_lock_retry",        KSTAT_DATA_UINT64 },
 739  841          { "l2_evict_reading",           KSTAT_DATA_UINT64 },
 740  842          { "l2_evict_l1cached",          KSTAT_DATA_UINT64 },
 741  843          { "l2_free_on_write",           KSTAT_DATA_UINT64 },
 742  844          { "l2_abort_lowmem",            KSTAT_DATA_UINT64 },
 743  845          { "l2_cksum_bad",               KSTAT_DATA_UINT64 },
 744  846          { "l2_io_error",                KSTAT_DATA_UINT64 },
 745  847          { "l2_size",                    KSTAT_DATA_UINT64 },
 746  848          { "l2_asize",                   KSTAT_DATA_UINT64 },
 747  849          { "l2_hdr_size",                KSTAT_DATA_UINT64 },
      850 +        { "l2_log_blk_writes",          KSTAT_DATA_UINT64 },
      851 +        { "l2_log_blk_avg_size",        KSTAT_DATA_UINT64 },
      852 +        { "l2_data_to_meta_ratio",      KSTAT_DATA_UINT64 },
      853 +        { "l2_rebuild_successes",       KSTAT_DATA_UINT64 },
      854 +        { "l2_rebuild_unsupported",     KSTAT_DATA_UINT64 },
      855 +        { "l2_rebuild_io_errors",       KSTAT_DATA_UINT64 },
      856 +        { "l2_rebuild_cksum_errors",    KSTAT_DATA_UINT64 },
      857 +        { "l2_rebuild_loop_errors",     KSTAT_DATA_UINT64 },
      858 +        { "l2_rebuild_lowmem",          KSTAT_DATA_UINT64 },
      859 +        { "l2_rebuild_size",            KSTAT_DATA_UINT64 },
      860 +        { "l2_rebuild_bufs",            KSTAT_DATA_UINT64 },
      861 +        { "l2_rebuild_bufs_precached",  KSTAT_DATA_UINT64 },
      862 +        { "l2_rebuild_psize",           KSTAT_DATA_UINT64 },
      863 +        { "l2_rebuild_log_blks",        KSTAT_DATA_UINT64 },
 748  864          { "memory_throttle_count",      KSTAT_DATA_UINT64 },
 749  865          { "arc_meta_used",              KSTAT_DATA_UINT64 },
 750  866          { "arc_meta_limit",             KSTAT_DATA_UINT64 },
 751  867          { "arc_meta_max",               KSTAT_DATA_UINT64 },
 752  868          { "arc_meta_min",               KSTAT_DATA_UINT64 },
      869 +        { "arc_ddt_limit",              KSTAT_DATA_UINT64 },
 753  870          { "sync_wait_for_async",        KSTAT_DATA_UINT64 },
 754  871          { "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },
 755  872  };
 756  873  
 757  874  #define ARCSTAT(stat)   (arc_stats.stat.value.ui64)
 758  875  
 759  876  #define ARCSTAT_INCR(stat, val) \
 760  877          atomic_add_64(&arc_stats.stat.value.ui64, (val))
 761  878  
 762  879  #define ARCSTAT_BUMP(stat)      ARCSTAT_INCR(stat, 1)
↓ open down ↓ 10 lines elided ↑ open up ↑
 773  890          ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
 774  891  
 775  892  /*
 776  893   * We define a macro to allow ARC hits/misses to be easily broken down by
 777  894   * two separate conditions, giving a total of four different subtypes for
 778  895   * each of hits and misses (so eight statistics total).
 779  896   */
 780  897  #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
 781  898          if (cond1) {                                                    \
 782  899                  if (cond2) {                                            \
 783      -                        ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
      900 +                        ARCSTAT_BUMP(arcstat_##stat1##_##stat##_##stat2); \
 784  901                  } else {                                                \
 785      -                        ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
      902 +                        ARCSTAT_BUMP(arcstat_##stat1##_##stat##_##notstat2); \
 786  903                  }                                                       \
 787  904          } else {                                                        \
 788  905                  if (cond2) {                                            \
 789      -                        ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
      906 +                        ARCSTAT_BUMP(arcstat_##notstat1##_##stat##_##stat2); \
 790  907                  } else {                                                \
 791      -                        ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
      908 +                        ARCSTAT_BUMP(arcstat_##notstat1##_##stat##_##notstat2);\
 792  909                  }                                                       \
 793  910          }
 794  911  
      912 +/*
      913 + * This macro allows us to use kstats as floating averages. Each time we
      914 + * update this kstat, we first factor it and the update value by
      915 + * ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall
      916 + * average. This macro assumes that integer loads and stores are atomic, but
      917 + * is not safe for multiple writers updating the kstat in parallel (only the
      918 + * last writer's update will remain).
      919 + */
      920 +#define ARCSTAT_F_AVG_FACTOR    3
      921 +#define ARCSTAT_F_AVG(stat, value) \
      922 +        do { \
      923 +                uint64_t x = ARCSTAT(stat); \
      924 +                x = x - x / ARCSTAT_F_AVG_FACTOR + \
      925 +                    (value) / ARCSTAT_F_AVG_FACTOR; \
      926 +                ARCSTAT(stat) = x; \
      927 +                _NOTE(CONSTCOND) \
      928 +        } while (0)
      929 +
 795  930  kstat_t                 *arc_ksp;
 796  931  static arc_state_t      *arc_anon;
 797  932  static arc_state_t      *arc_mru;
 798  933  static arc_state_t      *arc_mru_ghost;
 799  934  static arc_state_t      *arc_mfu;
 800  935  static arc_state_t      *arc_mfu_ghost;
 801  936  static arc_state_t      *arc_l2c_only;
 802  937  
 803  938  /*
 804  939   * There are several ARC variables that are critical to export as kstats --
 805  940   * but we don't want to have to grovel around in the kstat whenever we wish to
 806  941   * manipulate them.  For these variables, we therefore define them to be in
 807  942   * terms of the statistic variable.  This assures that we are not introducing
 808  943   * the possibility of inconsistency by having shadow copies of the variables,
 809  944   * while still allowing the code to be readable.
 810  945   */
      946 +#define arc_size        ARCSTAT(arcstat_size)   /* actual total arc size */
 811  947  #define arc_p           ARCSTAT(arcstat_p)      /* target size of MRU */
 812  948  #define arc_c           ARCSTAT(arcstat_c)      /* target size of cache */
 813  949  #define arc_c_min       ARCSTAT(arcstat_c_min)  /* min target cache size */
 814  950  #define arc_c_max       ARCSTAT(arcstat_c_max)  /* max target cache size */
 815  951  #define arc_meta_limit  ARCSTAT(arcstat_meta_limit) /* max size for metadata */
 816  952  #define arc_meta_min    ARCSTAT(arcstat_meta_min) /* min size for metadata */
      953 +#define arc_meta_used   ARCSTAT(arcstat_meta_used) /* size of metadata */
 817  954  #define arc_meta_max    ARCSTAT(arcstat_meta_max) /* max size of metadata */
      955 +#define arc_ddt_size    ARCSTAT(arcstat_ddt_size) /* ddt size in arc */
      956 +#define arc_ddt_limit   ARCSTAT(arcstat_ddt_limit) /* ddt in arc size limit */
 818  957  
      958 +/*
      959 + * Used int zio.c to optionally keep DDT cached in ARC
      960 + */
      961 +uint64_t const *arc_ddt_evict_threshold;
      962 +
 819  963  /* compressed size of entire arc */
 820  964  #define arc_compressed_size     ARCSTAT(arcstat_compressed_size)
 821  965  /* uncompressed size of entire arc */
 822  966  #define arc_uncompressed_size   ARCSTAT(arcstat_uncompressed_size)
 823  967  /* number of bytes in the arc from arc_buf_t's */
 824  968  #define arc_overhead_size       ARCSTAT(arcstat_overhead_size)
 825  969  
 826      -/*
 827      - * There are also some ARC variables that we want to export, but that are
 828      - * updated so often that having the canonical representation be the statistic
 829      - * variable causes a performance bottleneck. We want to use aggsum_t's for these
 830      - * instead, but still be able to export the kstat in the same way as before.
 831      - * The solution is to always use the aggsum version, except in the kstat update
 832      - * callback.
 833      - */
 834      -aggsum_t arc_size;
 835      -aggsum_t arc_meta_used;
 836      -aggsum_t astat_data_size;
 837      -aggsum_t astat_metadata_size;
 838      -aggsum_t astat_hdr_size;
 839      -aggsum_t astat_other_size;
 840      -aggsum_t astat_l2_hdr_size;
 841  970  
 842  971  static int              arc_no_grow;    /* Don't try to grow cache size */
 843  972  static uint64_t         arc_tempreserve;
 844  973  static uint64_t         arc_loaned_bytes;
 845  974  
 846  975  typedef struct arc_callback arc_callback_t;
 847  976  
 848  977  struct arc_callback {
 849  978          void                    *acb_private;
 850  979          arc_done_func_t         *acb_done;
↓ open down ↓ 40 lines elided ↑ open up ↑
 891 1020   * Because it's possible for the L2ARC to become extremely large, we can wind
 892 1021   * up eating a lot of memory in L2ARC buffer headers, so the size of a header
 893 1022   * is minimized by only allocating the fields necessary for an L1-cached buffer
 894 1023   * when a header is actually in the L1 cache. The sub-headers (l1arc_buf_hdr and
 895 1024   * l2arc_buf_hdr) are embedded rather than allocated separately to save a couple
 896 1025   * words in pointers. arc_hdr_realloc() is used to switch a header between
 897 1026   * these two allocation states.
 898 1027   */
 899 1028  typedef struct l1arc_buf_hdr {
 900 1029          kmutex_t                b_freeze_lock;
 901      -        zio_cksum_t             *b_freeze_cksum;
 902 1030  #ifdef ZFS_DEBUG
 903 1031          /*
 904 1032           * Used for debugging with kmem_flags - by allocating and freeing
 905 1033           * b_thawed when the buffer is thawed, we get a record of the stack
 906 1034           * trace that thawed it.
 907 1035           */
 908 1036          void                    *b_thawed;
 909 1037  #endif
 910 1038  
     1039 +        /* number of krrp tasks using this buffer */
     1040 +        uint64_t                b_krrp;
     1041 +
 911 1042          arc_buf_t               *b_buf;
 912 1043          uint32_t                b_bufcnt;
 913 1044          /* for waiting on writes to complete */
 914 1045          kcondvar_t              b_cv;
 915 1046          uint8_t                 b_byteswap;
 916 1047  
 917 1048          /* protected by arc state mutex */
 918 1049          arc_state_t             *b_state;
 919 1050          multilist_node_t        b_arc_node;
 920 1051  
↓ open down ↓ 15 lines elided ↑ open up ↑
 936 1067          uint64_t                b_daddr;        /* disk address, offset byte */
 937 1068  
 938 1069          list_node_t             b_l2node;
 939 1070  } l2arc_buf_hdr_t;
 940 1071  
 941 1072  struct arc_buf_hdr {
 942 1073          /* protected by hash lock */
 943 1074          dva_t                   b_dva;
 944 1075          uint64_t                b_birth;
 945 1076  
     1077 +        /*
     1078 +         * Even though this checksum is only set/verified when a buffer is in
     1079 +         * the L1 cache, it needs to be in the set of common fields because it
     1080 +         * must be preserved from the time before a buffer is written out to
     1081 +         * L2ARC until after it is read back in.
     1082 +         */
     1083 +        zio_cksum_t             *b_freeze_cksum;
     1084 +
 946 1085          arc_buf_contents_t      b_type;
 947 1086          arc_buf_hdr_t           *b_hash_next;
 948 1087          arc_flags_t             b_flags;
 949 1088  
 950 1089          /*
 951 1090           * This field stores the size of the data buffer after
 952 1091           * compression, and is set in the arc's zio completion handlers.
 953 1092           * It is in units of SPA_MINBLOCKSIZE (e.g. 1 == 512 bytes).
 954 1093           *
 955 1094           * While the block pointers can store up to 32MB in their psize
↓ open down ↓ 37 lines elided ↑ open up ↑
 993 1132  
 994 1133  #define HDR_L2CACHE(hdr)        ((hdr)->b_flags & ARC_FLAG_L2CACHE)
 995 1134  #define HDR_L2_READING(hdr)     \
 996 1135          (((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) &&  \
 997 1136          ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))
 998 1137  #define HDR_L2_WRITING(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_WRITING)
 999 1138  #define HDR_L2_EVICTED(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)
1000 1139  #define HDR_L2_WRITE_HEAD(hdr)  ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)
1001 1140  #define HDR_SHARED_DATA(hdr)    ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)
1002 1141  
     1142 +#define HDR_ISTYPE_DDT(hdr)     \
     1143 +            ((hdr)->b_flags & ARC_FLAG_BUFC_DDT)
1003 1144  #define HDR_ISTYPE_METADATA(hdr)        \
1004 1145          ((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)
1005      -#define HDR_ISTYPE_DATA(hdr)    (!HDR_ISTYPE_METADATA(hdr))
     1146 +#define HDR_ISTYPE_DATA(hdr)    (!HDR_ISTYPE_METADATA(hdr) && \
     1147 +        !HDR_ISTYPE_DDT(hdr))
1006 1148  
1007 1149  #define HDR_HAS_L1HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)
1008 1150  #define HDR_HAS_L2HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)
1009 1151  
1010 1152  /* For storing compression mode in b_flags */
1011 1153  #define HDR_COMPRESS_OFFSET     (highbit64(ARC_FLAG_COMPRESS_0) - 1)
1012 1154  
1013 1155  #define HDR_GET_COMPRESS(hdr)   ((enum zio_compress)BF32_GET((hdr)->b_flags, \
1014 1156          HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))
1015 1157  #define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \
↓ open down ↓ 7 lines elided ↑ open up ↑
1023 1165   * Other sizes
1024 1166   */
1025 1167  
1026 1168  #define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
1027 1169  #define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))
1028 1170  
1029 1171  /*
1030 1172   * Hash table routines
1031 1173   */
1032 1174  
1033      -#define HT_LOCK_PAD     64
1034      -
1035      -struct ht_lock {
1036      -        kmutex_t        ht_lock;
1037      -#ifdef _KERNEL
1038      -        unsigned char   pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
1039      -#endif
     1175 +struct ht_table {
     1176 +        arc_buf_hdr_t   *hdr;
     1177 +        kmutex_t        lock;
1040 1178  };
1041 1179  
1042      -#define BUF_LOCKS 256
1043 1180  typedef struct buf_hash_table {
1044 1181          uint64_t ht_mask;
1045      -        arc_buf_hdr_t **ht_table;
1046      -        struct ht_lock ht_locks[BUF_LOCKS];
     1182 +        struct ht_table *ht_table;
1047 1183  } buf_hash_table_t;
1048 1184  
     1185 +#pragma align 64(buf_hash_table)
1049 1186  static buf_hash_table_t buf_hash_table;
1050 1187  
1051 1188  #define BUF_HASH_INDEX(spa, dva, birth) \
1052 1189          (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
1053      -#define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
1054      -#define BUF_HASH_LOCK(idx)      (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
     1190 +#define BUF_HASH_LOCK(idx) (&buf_hash_table.ht_table[idx].lock)
1055 1191  #define HDR_LOCK(hdr) \
1056 1192          (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
1057 1193  
1058 1194  uint64_t zfs_crc64_table[256];
1059 1195  
1060 1196  /*
1061 1197   * Level 2 ARC
1062 1198   */
1063 1199  
1064 1200  #define L2ARC_WRITE_SIZE        (8 * 1024 * 1024)       /* initial write max */
↓ open down ↓ 13 lines elided ↑ open up ↑
1078 1214  uint64_t l2arc_write_max = L2ARC_WRITE_SIZE;    /* default max write size */
1079 1215  uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE;  /* extra write during warmup */
1080 1216  uint64_t l2arc_headroom = L2ARC_HEADROOM;       /* number of dev writes */
1081 1217  uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
1082 1218  uint64_t l2arc_feed_secs = L2ARC_FEED_SECS;     /* interval seconds */
1083 1219  uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
1084 1220  boolean_t l2arc_noprefetch = B_TRUE;            /* don't cache prefetch bufs */
1085 1221  boolean_t l2arc_feed_again = B_TRUE;            /* turbo warmup */
1086 1222  boolean_t l2arc_norw = B_TRUE;                  /* no reads during writes */
1087 1223  
1088      -/*
1089      - * L2ARC Internals
1090      - */
1091      -struct l2arc_dev {
1092      -        vdev_t                  *l2ad_vdev;     /* vdev */
1093      -        spa_t                   *l2ad_spa;      /* spa */
1094      -        uint64_t                l2ad_hand;      /* next write location */
1095      -        uint64_t                l2ad_start;     /* first addr on device */
1096      -        uint64_t                l2ad_end;       /* last addr on device */
1097      -        boolean_t               l2ad_first;     /* first sweep through */
1098      -        boolean_t               l2ad_writing;   /* currently writing */
1099      -        kmutex_t                l2ad_mtx;       /* lock for buffer list */
1100      -        list_t                  l2ad_buflist;   /* buffer list */
1101      -        list_node_t             l2ad_node;      /* device list node */
1102      -        refcount_t              l2ad_alloc;     /* allocated bytes */
1103      -};
1104      -
1105 1224  static list_t L2ARC_dev_list;                   /* device list */
1106 1225  static list_t *l2arc_dev_list;                  /* device list pointer */
1107 1226  static kmutex_t l2arc_dev_mtx;                  /* device list mutex */
1108 1227  static l2arc_dev_t *l2arc_dev_last;             /* last device used */
     1228 +static l2arc_dev_t *l2arc_ddt_dev_last;         /* last DDT device used */
1109 1229  static list_t L2ARC_free_on_write;              /* free after write buf list */
1110 1230  static list_t *l2arc_free_on_write;             /* free after write list ptr */
1111 1231  static kmutex_t l2arc_free_on_write_mtx;        /* mutex for list */
1112 1232  static uint64_t l2arc_ndev;                     /* number of devices */
1113 1233  
1114 1234  typedef struct l2arc_read_callback {
1115 1235          arc_buf_hdr_t           *l2rcb_hdr;             /* read header */
1116 1236          blkptr_t                l2rcb_bp;               /* original blkptr */
1117 1237          zbookmark_phys_t        l2rcb_zb;               /* original bookmark */
1118 1238          int                     l2rcb_flags;            /* original flags */
1119 1239          abd_t                   *l2rcb_abd;             /* temporary buffer */
1120 1240  } l2arc_read_callback_t;
1121 1241  
1122 1242  typedef struct l2arc_write_callback {
1123 1243          l2arc_dev_t     *l2wcb_dev;             /* device info */
1124 1244          arc_buf_hdr_t   *l2wcb_head;            /* head of write buflist */
     1245 +        list_t          l2wcb_log_blk_buflist;  /* in-flight log blocks */
1125 1246  } l2arc_write_callback_t;
1126 1247  
1127 1248  typedef struct l2arc_data_free {
1128 1249          /* protected by l2arc_free_on_write_mtx */
1129 1250          abd_t           *l2df_abd;
1130 1251          size_t          l2df_size;
1131 1252          arc_buf_contents_t l2df_type;
1132 1253          list_node_t     l2df_list_node;
1133 1254  } l2arc_data_free_t;
1134 1255  
↓ open down ↓ 5 lines elided ↑ open up ↑
1140 1261  static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, void *);
1141 1262  static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, void *);
1142 1263  static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, void *);
1143 1264  static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, void *);
1144 1265  static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag);
1145 1266  static void arc_hdr_free_pabd(arc_buf_hdr_t *);
1146 1267  static void arc_hdr_alloc_pabd(arc_buf_hdr_t *);
1147 1268  static void arc_access(arc_buf_hdr_t *, kmutex_t *);
1148 1269  static boolean_t arc_is_overflowing();
1149 1270  static void arc_buf_watch(arc_buf_t *);
     1271 +static l2arc_dev_t *l2arc_vdev_get(vdev_t *vd);
1150 1272  
1151 1273  static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);
1152 1274  static uint32_t arc_bufc_to_flags(arc_buf_contents_t);
     1275 +static arc_buf_contents_t arc_flags_to_bufc(uint32_t);
1153 1276  static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1154 1277  static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1155 1278  
1156 1279  static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
1157 1280  static void l2arc_read_done(zio_t *);
1158 1281  
     1282 +static void
     1283 +arc_update_hit_stat(arc_buf_hdr_t *hdr, boolean_t hit)
     1284 +{
     1285 +        boolean_t pf = !HDR_PREFETCH(hdr);
     1286 +        switch (arc_buf_type(hdr)) {
     1287 +        case ARC_BUFC_DATA:
     1288 +                ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses, data);
     1289 +                break;
     1290 +        case ARC_BUFC_METADATA:
     1291 +                ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses,
     1292 +                    metadata);
     1293 +                break;
     1294 +        case ARC_BUFC_DDT:
     1295 +                ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses, ddt);
     1296 +                break;
     1297 +        default:
     1298 +                break;
     1299 +        }
     1300 +}
1159 1301  
     1302 +enum {
     1303 +        L2ARC_DEV_HDR_EVICT_FIRST = (1 << 0)    /* mirror of l2ad_first */
     1304 +};
     1305 +
1160 1306  /*
1161      - * We use Cityhash for this. It's fast, and has good hash properties without
1162      - * requiring any large static buffers.
     1307 + * Pointer used in persistent L2ARC (for pointing to log blocks & ARC buffers).
1163 1308   */
1164      -static uint64_t
     1309 +typedef struct l2arc_log_blkptr {
     1310 +        uint64_t        lbp_daddr;      /* device address of log */
     1311 +        /*
     1312 +         * lbp_prop is the same format as the blk_prop in blkptr_t:
     1313 +         *      * logical size (in sectors)
     1314 +         *      * physical size (in sectors)
     1315 +         *      * checksum algorithm (used for lbp_cksum)
     1316 +         *      * object type & level (unused for now)
     1317 +         */
     1318 +        uint64_t        lbp_prop;
     1319 +        zio_cksum_t     lbp_cksum;      /* fletcher4 of log */
     1320 +} l2arc_log_blkptr_t;
     1321 +
     1322 +/*
     1323 + * The persistent L2ARC device header.
     1324 + * Byte order of magic determines whether 64-bit bswap of fields is necessary.
     1325 + */
     1326 +typedef struct l2arc_dev_hdr_phys {
     1327 +        uint64_t        dh_magic;       /* L2ARC_DEV_HDR_MAGIC_Vx */
     1328 +        zio_cksum_t     dh_self_cksum;  /* fletcher4 of fields below */
     1329 +
     1330 +        /*
     1331 +         * Global L2ARC device state and metadata.
     1332 +         */
     1333 +        uint64_t        dh_spa_guid;
     1334 +        uint64_t        dh_alloc_space;         /* vdev space alloc status */
     1335 +        uint64_t        dh_flags;               /* l2arc_dev_hdr_flags_t */
     1336 +
     1337 +        /*
     1338 +         * Start of log block chain. [0] -> newest log, [1] -> one older (used
     1339 +         * for initiating prefetch).
     1340 +         */
     1341 +        l2arc_log_blkptr_t      dh_start_lbps[2];
     1342 +
     1343 +        const uint64_t  dh_pad[44];             /* pad to 512 bytes */
     1344 +} l2arc_dev_hdr_phys_t;
     1345 +CTASSERT(sizeof (l2arc_dev_hdr_phys_t) == SPA_MINBLOCKSIZE);
     1346 +
     1347 +/*
     1348 + * A single ARC buffer header entry in a l2arc_log_blk_phys_t.
     1349 + */
     1350 +typedef struct l2arc_log_ent_phys {
     1351 +        dva_t                   le_dva; /* dva of buffer */
     1352 +        uint64_t                le_birth;       /* birth txg of buffer */
     1353 +        zio_cksum_t             le_freeze_cksum;
     1354 +        /*
     1355 +         * le_prop is the same format as the blk_prop in blkptr_t:
     1356 +         *      * logical size (in sectors)
     1357 +         *      * physical size (in sectors)
     1358 +         *      * checksum algorithm (used for b_freeze_cksum)
     1359 +         *      * object type & level (used to restore arc_buf_contents_t)
     1360 +         */
     1361 +        uint64_t                le_prop;
     1362 +        uint64_t                le_daddr;       /* buf location on l2dev */
     1363 +        const uint64_t          le_pad[7];      /* resv'd for future use */
     1364 +} l2arc_log_ent_phys_t;
     1365 +
     1366 +/*
     1367 + * These design limits give us the following metadata overhead (before
     1368 + * compression):
     1369 + *      avg_blk_sz      overhead
     1370 + *      1k              12.51 %
     1371 + *      2k               6.26 %
     1372 + *      4k               3.13 %
     1373 + *      8k               1.56 %
     1374 + *      16k              0.78 %
     1375 + *      32k              0.39 %
     1376 + *      64k              0.20 %
     1377 + *      128k             0.10 %
     1378 + * Compression should be able to sequeeze these down by about a factor of 2x.
     1379 + */
     1380 +#define L2ARC_LOG_BLK_SIZE                      (128 * 1024)    /* 128k */
     1381 +#define L2ARC_LOG_BLK_HEADER_LEN                (128)
     1382 +#define L2ARC_LOG_BLK_ENTRIES                   /* 1023 entries */      \
     1383 +        ((L2ARC_LOG_BLK_SIZE - L2ARC_LOG_BLK_HEADER_LEN) /              \
     1384 +        sizeof (l2arc_log_ent_phys_t))
     1385 +/*
     1386 + * Maximum amount of data in an l2arc log block (used to terminate rebuilding
     1387 + * before we hit the write head and restore potentially corrupted blocks).
     1388 + */
     1389 +#define L2ARC_LOG_BLK_MAX_PAYLOAD_SIZE  \
     1390 +        (SPA_MAXBLOCKSIZE * L2ARC_LOG_BLK_ENTRIES)
     1391 +/*
     1392 + * For the persistency and rebuild algorithms to operate reliably we need
     1393 + * the L2ARC device to at least be able to hold 3 full log blocks (otherwise
     1394 + * excessive log block looping might confuse the log chain end detection).
     1395 + * Under normal circumstances this is not a problem, since this is somewhere
     1396 + * around only 400 MB.
     1397 + */
     1398 +#define L2ARC_PERSIST_MIN_SIZE  (3 * L2ARC_LOG_BLK_MAX_PAYLOAD_SIZE)
     1399 +
     1400 +/*
     1401 + * A log block of up to 1023 ARC buffer log entries, chained into the
     1402 + * persistent L2ARC metadata linked list. Byte order of magic determines
     1403 + * whether 64-bit bswap of fields is necessary.
     1404 + */
     1405 +typedef struct l2arc_log_blk_phys {
     1406 +        /* Header - see L2ARC_LOG_BLK_HEADER_LEN above */
     1407 +        uint64_t                lb_magic;       /* L2ARC_LOG_BLK_MAGIC */
     1408 +        l2arc_log_blkptr_t      lb_back2_lbp;   /* back 2 steps in chain */
     1409 +        uint64_t                lb_pad[9];      /* resv'd for future use */
     1410 +        /* Payload */
     1411 +        l2arc_log_ent_phys_t    lb_entries[L2ARC_LOG_BLK_ENTRIES];
     1412 +} l2arc_log_blk_phys_t;
     1413 +
     1414 +CTASSERT(sizeof (l2arc_log_blk_phys_t) == L2ARC_LOG_BLK_SIZE);
     1415 +CTASSERT(offsetof(l2arc_log_blk_phys_t, lb_entries) -
     1416 +    offsetof(l2arc_log_blk_phys_t, lb_magic) == L2ARC_LOG_BLK_HEADER_LEN);
     1417 +
     1418 +/*
     1419 + * These structures hold in-flight l2arc_log_blk_phys_t's as they're being
     1420 + * written to the L2ARC device. They may be compressed, hence the uint8_t[].
     1421 + */
     1422 +typedef struct l2arc_log_blk_buf {
     1423 +        uint8_t         lbb_log_blk[sizeof (l2arc_log_blk_phys_t)];
     1424 +        list_node_t     lbb_node;
     1425 +} l2arc_log_blk_buf_t;
     1426 +
     1427 +/* Macros for the manipulation fields in the blk_prop format of blkptr_t */
     1428 +#define BLKPROP_GET_LSIZE(_obj, _field)         \
     1429 +        BF64_GET_SB((_obj)->_field, 0, 16, SPA_MINBLOCKSHIFT, 1)
     1430 +#define BLKPROP_SET_LSIZE(_obj, _field, x)      \
     1431 +        BF64_SET_SB((_obj)->_field, 0, 16, SPA_MINBLOCKSHIFT, 1, x)
     1432 +#define BLKPROP_GET_PSIZE(_obj, _field)         \
     1433 +        BF64_GET_SB((_obj)->_field, 16, 16, SPA_MINBLOCKSHIFT, 0)
     1434 +#define BLKPROP_SET_PSIZE(_obj, _field, x)      \
     1435 +        BF64_SET_SB((_obj)->_field, 16, 16, SPA_MINBLOCKSHIFT, 0, x)
     1436 +#define BLKPROP_GET_COMPRESS(_obj, _field)      \
     1437 +        BF64_GET((_obj)->_field, 32, 7)
     1438 +#define BLKPROP_SET_COMPRESS(_obj, _field, x)   \
     1439 +        BF64_SET((_obj)->_field, 32, 7, x)
     1440 +#define BLKPROP_GET_ARC_COMPRESS(_obj, _field)  \
     1441 +        BF64_GET((_obj)->_field, 39, 1)
     1442 +#define BLKPROP_SET_ARC_COMPRESS(_obj, _field, x)       \
     1443 +        BF64_SET((_obj)->_field, 39, 1, x)
     1444 +#define BLKPROP_GET_CHECKSUM(_obj, _field)      \
     1445 +        BF64_GET((_obj)->_field, 40, 8)
     1446 +#define BLKPROP_SET_CHECKSUM(_obj, _field, x)   \
     1447 +        BF64_SET((_obj)->_field, 40, 8, x)
     1448 +#define BLKPROP_GET_TYPE(_obj, _field)          \
     1449 +        BF64_GET((_obj)->_field, 48, 8)
     1450 +#define BLKPROP_SET_TYPE(_obj, _field, x)       \
     1451 +        BF64_SET((_obj)->_field, 48, 8, x)
     1452 +
     1453 +/* Macros for manipulating a l2arc_log_blkptr_t->lbp_prop field */
     1454 +#define LBP_GET_LSIZE(_add)             BLKPROP_GET_LSIZE(_add, lbp_prop)
     1455 +#define LBP_SET_LSIZE(_add, x)          BLKPROP_SET_LSIZE(_add, lbp_prop, x)
     1456 +#define LBP_GET_PSIZE(_add)             BLKPROP_GET_PSIZE(_add, lbp_prop)
     1457 +#define LBP_SET_PSIZE(_add, x)          BLKPROP_SET_PSIZE(_add, lbp_prop, x)
     1458 +#define LBP_GET_COMPRESS(_add)          BLKPROP_GET_COMPRESS(_add, lbp_prop)
     1459 +#define LBP_SET_COMPRESS(_add, x)       BLKPROP_SET_COMPRESS(_add, lbp_prop, x)
     1460 +#define LBP_GET_CHECKSUM(_add)          BLKPROP_GET_CHECKSUM(_add, lbp_prop)
     1461 +#define LBP_SET_CHECKSUM(_add, x)       BLKPROP_SET_CHECKSUM(_add, lbp_prop, x)
     1462 +#define LBP_GET_TYPE(_add)              BLKPROP_GET_TYPE(_add, lbp_prop)
     1463 +#define LBP_SET_TYPE(_add, x)           BLKPROP_SET_TYPE(_add, lbp_prop, x)
     1464 +
     1465 +/* Macros for manipulating a l2arc_log_ent_phys_t->le_prop field */
     1466 +#define LE_GET_LSIZE(_le)       BLKPROP_GET_LSIZE(_le, le_prop)
     1467 +#define LE_SET_LSIZE(_le, x)    BLKPROP_SET_LSIZE(_le, le_prop, x)
     1468 +#define LE_GET_PSIZE(_le)       BLKPROP_GET_PSIZE(_le, le_prop)
     1469 +#define LE_SET_PSIZE(_le, x)    BLKPROP_SET_PSIZE(_le, le_prop, x)
     1470 +#define LE_GET_COMPRESS(_le)    BLKPROP_GET_COMPRESS(_le, le_prop)
     1471 +#define LE_SET_COMPRESS(_le, x) BLKPROP_SET_COMPRESS(_le, le_prop, x)
     1472 +#define LE_GET_ARC_COMPRESS(_le)        BLKPROP_GET_ARC_COMPRESS(_le, le_prop)
     1473 +#define LE_SET_ARC_COMPRESS(_le, x)     BLKPROP_SET_ARC_COMPRESS(_le, le_prop, x)
     1474 +#define LE_GET_CHECKSUM(_le)    BLKPROP_GET_CHECKSUM(_le, le_prop)
     1475 +#define LE_SET_CHECKSUM(_le, x) BLKPROP_SET_CHECKSUM(_le, le_prop, x)
     1476 +#define LE_GET_TYPE(_le)        BLKPROP_GET_TYPE(_le, le_prop)
     1477 +#define LE_SET_TYPE(_le, x)     BLKPROP_SET_TYPE(_le, le_prop, x)
     1478 +
     1479 +#define PTR_SWAP(x, y)          \
     1480 +        do {                    \
     1481 +                void *tmp = (x);\
     1482 +                x = y;          \
     1483 +                y = tmp;        \
     1484 +                _NOTE(CONSTCOND)\
     1485 +        } while (0)
     1486 +
     1487 +/*
     1488 + * Sadly, after compressed ARC integration older kernels would panic
     1489 + * when trying to rebuild persistent L2ARC created by the new code.
     1490 + */
     1491 +#define L2ARC_DEV_HDR_MAGIC_V1  0x4c32415243763031LLU   /* ASCII: "L2ARCv01" */
     1492 +#define L2ARC_LOG_BLK_MAGIC     0x4c4f47424c4b4844LLU   /* ASCII: "LOGBLKHD" */
     1493 +
     1494 +/*
     1495 + * Performance tuning of L2ARC persistency:
     1496 + *
     1497 + * l2arc_rebuild_enabled : Controls whether L2ARC device adds (either at
     1498 + *              pool import or when adding one manually later) will attempt
     1499 + *              to rebuild L2ARC buffer contents. In special circumstances,
     1500 + *              the administrator may want to set this to B_FALSE, if they
     1501 + *              are having trouble importing a pool or attaching an L2ARC
     1502 + *              device (e.g. the L2ARC device is slow to read in stored log
     1503 + *              metadata, or the metadata has become somehow
     1504 + *              fragmented/unusable).
     1505 + */
     1506 +boolean_t l2arc_rebuild_enabled = B_TRUE;
     1507 +
     1508 +/* L2ARC persistency rebuild control routines. */
     1509 +static void l2arc_dev_rebuild_start(l2arc_dev_t *dev);
     1510 +static int l2arc_rebuild(l2arc_dev_t *dev);
     1511 +
     1512 +/* L2ARC persistency read I/O routines. */
     1513 +static int l2arc_dev_hdr_read(l2arc_dev_t *dev);
     1514 +static int l2arc_log_blk_read(l2arc_dev_t *dev,
     1515 +    const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp,
     1516 +    l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
     1517 +    uint8_t *this_lb_buf, uint8_t *next_lb_buf,
     1518 +    zio_t *this_io, zio_t **next_io);
     1519 +static zio_t *l2arc_log_blk_prefetch(vdev_t *vd,
     1520 +    const l2arc_log_blkptr_t *lp, uint8_t *lb_buf);
     1521 +static void l2arc_log_blk_prefetch_abort(zio_t *zio);
     1522 +
     1523 +/* L2ARC persistency block restoration routines. */
     1524 +static void l2arc_log_blk_restore(l2arc_dev_t *dev, uint64_t load_guid,
     1525 +    const l2arc_log_blk_phys_t *lb, uint64_t lb_psize);
     1526 +static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le,
     1527 +    l2arc_dev_t *dev, uint64_t guid);
     1528 +
     1529 +/* L2ARC persistency write I/O routines. */
     1530 +static void l2arc_dev_hdr_update(l2arc_dev_t *dev, zio_t *pio);
     1531 +static void l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,
     1532 +    l2arc_write_callback_t *cb);
     1533 +
     1534 +/* L2ARC persistency auxilliary routines. */
     1535 +static boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev,
     1536 +    const l2arc_log_blkptr_t *lp);
     1537 +static void l2arc_dev_hdr_checksum(const l2arc_dev_hdr_phys_t *hdr,
     1538 +    zio_cksum_t *cksum);
     1539 +static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev,
     1540 +    const arc_buf_hdr_t *ab);
     1541 +static inline boolean_t l2arc_range_check_overlap(uint64_t bottom,
     1542 +    uint64_t top, uint64_t check);
     1543 +
     1544 +/*
     1545 + * L2ARC Internals
     1546 + */
     1547 +struct l2arc_dev {
     1548 +        vdev_t                  *l2ad_vdev;     /* vdev */
     1549 +        spa_t                   *l2ad_spa;      /* spa */
     1550 +        uint64_t                l2ad_hand;      /* next write location */
     1551 +        uint64_t                l2ad_start;     /* first addr on device */
     1552 +        uint64_t                l2ad_end;       /* last addr on device */
     1553 +        boolean_t               l2ad_first;     /* first sweep through */
     1554 +        boolean_t               l2ad_writing;   /* currently writing */
     1555 +        kmutex_t                l2ad_mtx;       /* lock for buffer list */
     1556 +        list_t                  l2ad_buflist;   /* buffer list */
     1557 +        list_node_t             l2ad_node;      /* device list node */
     1558 +        refcount_t              l2ad_alloc;     /* allocated bytes */
     1559 +        l2arc_dev_hdr_phys_t    *l2ad_dev_hdr;  /* persistent device header */
     1560 +        uint64_t                l2ad_dev_hdr_asize; /* aligned hdr size */
     1561 +        l2arc_log_blk_phys_t    l2ad_log_blk;   /* currently open log block */
     1562 +        int                     l2ad_log_ent_idx; /* index into cur log blk */
     1563 +        /* number of bytes in current log block's payload */
     1564 +        uint64_t                l2ad_log_blk_payload_asize;
     1565 +        /* flag indicating whether a rebuild is scheduled or is going on */
     1566 +        boolean_t               l2ad_rebuild;
     1567 +        boolean_t               l2ad_rebuild_cancel;
     1568 +        kt_did_t                l2ad_rebuild_did;
     1569 +};
     1570 +
     1571 +static inline uint64_t
1165 1572  buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
1166 1573  {
1167      -        return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth));
     1574 +        uint8_t *vdva = (uint8_t *)dva;
     1575 +        uint64_t crc = -1ULL;
     1576 +        int i;
     1577 +
     1578 +        ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
     1579 +
     1580 +        for (i = 0; i < sizeof (dva_t); i++)
     1581 +                crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
     1582 +
     1583 +        crc ^= (spa>>8) ^ birth;
     1584 +
     1585 +        return (crc);
1168 1586  }
1169 1587  
1170 1588  #define HDR_EMPTY(hdr)                                          \
1171 1589          ((hdr)->b_dva.dva_word[0] == 0 &&                       \
1172 1590          (hdr)->b_dva.dva_word[1] == 0)
1173 1591  
1174 1592  #define HDR_EQUAL(spa, dva, birth, hdr)                         \
1175 1593          ((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) &&     \
1176 1594          ((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) &&     \
1177 1595          ((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)
↓ open down ↓ 9 lines elided ↑ open up ↑
1187 1605  static arc_buf_hdr_t *
1188 1606  buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
1189 1607  {
1190 1608          const dva_t *dva = BP_IDENTITY(bp);
1191 1609          uint64_t birth = BP_PHYSICAL_BIRTH(bp);
1192 1610          uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
1193 1611          kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1194 1612          arc_buf_hdr_t *hdr;
1195 1613  
1196 1614          mutex_enter(hash_lock);
1197      -        for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;
     1615 +        for (hdr = buf_hash_table.ht_table[idx].hdr; hdr != NULL;
1198 1616              hdr = hdr->b_hash_next) {
1199 1617                  if (HDR_EQUAL(spa, dva, birth, hdr)) {
1200 1618                          *lockp = hash_lock;
1201 1619                          return (hdr);
1202 1620                  }
1203 1621          }
1204 1622          mutex_exit(hash_lock);
1205 1623          *lockp = NULL;
1206 1624          return (NULL);
1207 1625  }
↓ open down ↓ 17 lines elided ↑ open up ↑
1225 1643          ASSERT(hdr->b_birth != 0);
1226 1644          ASSERT(!HDR_IN_HASH_TABLE(hdr));
1227 1645  
1228 1646          if (lockp != NULL) {
1229 1647                  *lockp = hash_lock;
1230 1648                  mutex_enter(hash_lock);
1231 1649          } else {
1232 1650                  ASSERT(MUTEX_HELD(hash_lock));
1233 1651          }
1234 1652  
1235      -        for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;
     1653 +        for (fhdr = buf_hash_table.ht_table[idx].hdr, i = 0; fhdr != NULL;
1236 1654              fhdr = fhdr->b_hash_next, i++) {
1237 1655                  if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))
1238 1656                          return (fhdr);
1239 1657          }
1240 1658  
1241      -        hdr->b_hash_next = buf_hash_table.ht_table[idx];
1242      -        buf_hash_table.ht_table[idx] = hdr;
     1659 +        hdr->b_hash_next = buf_hash_table.ht_table[idx].hdr;
     1660 +        buf_hash_table.ht_table[idx].hdr = hdr;
1243 1661          arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1244 1662  
1245 1663          /* collect some hash table performance data */
1246 1664          if (i > 0) {
1247 1665                  ARCSTAT_BUMP(arcstat_hash_collisions);
1248 1666                  if (i == 1)
1249 1667                          ARCSTAT_BUMP(arcstat_hash_chains);
1250 1668  
1251 1669                  ARCSTAT_MAX(arcstat_hash_chain_max, i);
1252 1670          }
↓ open down ↓ 6 lines elided ↑ open up ↑
1259 1677  
1260 1678  static void
1261 1679  buf_hash_remove(arc_buf_hdr_t *hdr)
1262 1680  {
1263 1681          arc_buf_hdr_t *fhdr, **hdrp;
1264 1682          uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1265 1683  
1266 1684          ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
1267 1685          ASSERT(HDR_IN_HASH_TABLE(hdr));
1268 1686  
1269      -        hdrp = &buf_hash_table.ht_table[idx];
     1687 +        hdrp = &buf_hash_table.ht_table[idx].hdr;
1270 1688          while ((fhdr = *hdrp) != hdr) {
1271 1689                  ASSERT3P(fhdr, !=, NULL);
1272 1690                  hdrp = &fhdr->b_hash_next;
1273 1691          }
1274 1692          *hdrp = hdr->b_hash_next;
1275 1693          hdr->b_hash_next = NULL;
1276 1694          arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1277 1695  
1278 1696          /* collect some hash table performance data */
1279 1697          ARCSTAT_BUMPDOWN(arcstat_hash_elements);
1280 1698  
1281      -        if (buf_hash_table.ht_table[idx] &&
1282      -            buf_hash_table.ht_table[idx]->b_hash_next == NULL)
     1699 +        if (buf_hash_table.ht_table[idx].hdr &&
     1700 +            buf_hash_table.ht_table[idx].hdr->b_hash_next == NULL)
1283 1701                  ARCSTAT_BUMPDOWN(arcstat_hash_chains);
1284 1702  }
1285 1703  
1286 1704  /*
1287 1705   * Global data structures and functions for the buf kmem cache.
1288 1706   */
1289 1707  static kmem_cache_t *hdr_full_cache;
1290 1708  static kmem_cache_t *hdr_l2only_cache;
1291 1709  static kmem_cache_t *buf_cache;
1292 1710  
1293 1711  static void
1294 1712  buf_fini(void)
1295 1713  {
1296 1714          int i;
1297 1715  
     1716 +        for (i = 0; i < buf_hash_table.ht_mask + 1; i++)
     1717 +                mutex_destroy(&buf_hash_table.ht_table[i].lock);
1298 1718          kmem_free(buf_hash_table.ht_table,
1299      -            (buf_hash_table.ht_mask + 1) * sizeof (void *));
1300      -        for (i = 0; i < BUF_LOCKS; i++)
1301      -                mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
     1719 +            (buf_hash_table.ht_mask + 1) * sizeof (struct ht_table));
1302 1720          kmem_cache_destroy(hdr_full_cache);
1303 1721          kmem_cache_destroy(hdr_l2only_cache);
1304 1722          kmem_cache_destroy(buf_cache);
1305 1723  }
1306 1724  
1307 1725  /*
1308 1726   * Constructor callback - called when the cache is empty
1309 1727   * and a new buf is requested.
1310 1728   */
1311 1729  /* ARGSUSED */
↓ open down ↓ 102 lines elided ↑ open up ↑
1414 1832           * The hash table is big enough to fill all of physical memory
1415 1833           * with an average block size of zfs_arc_average_blocksize (default 8K).
1416 1834           * By default, the table will take up
1417 1835           * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
1418 1836           */
1419 1837          while (hsize * zfs_arc_average_blocksize < physmem * PAGESIZE)
1420 1838                  hsize <<= 1;
1421 1839  retry:
1422 1840          buf_hash_table.ht_mask = hsize - 1;
1423 1841          buf_hash_table.ht_table =
1424      -            kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
     1842 +            kmem_zalloc(hsize * sizeof (struct ht_table), KM_NOSLEEP);
1425 1843          if (buf_hash_table.ht_table == NULL) {
1426 1844                  ASSERT(hsize > (1ULL << 8));
1427 1845                  hsize >>= 1;
1428 1846                  goto retry;
1429 1847          }
1430 1848  
1431 1849          hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,
1432 1850              0, hdr_full_cons, hdr_full_dest, hdr_recl, NULL, NULL, 0);
1433 1851          hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",
1434 1852              HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, hdr_recl,
1435 1853              NULL, NULL, 0);
1436 1854          buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
1437 1855              0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
1438 1856  
1439 1857          for (i = 0; i < 256; i++)
1440 1858                  for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
1441 1859                          *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
1442 1860  
1443      -        for (i = 0; i < BUF_LOCKS; i++) {
1444      -                mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
     1861 +        for (i = 0; i < hsize; i++) {
     1862 +                mutex_init(&buf_hash_table.ht_table[i].lock,
1445 1863                      NULL, MUTEX_DEFAULT, NULL);
1446 1864          }
1447 1865  }
1448 1866  
     1867 +/* wait until krrp releases the buffer */
     1868 +static inline void
     1869 +arc_wait_for_krrp(arc_buf_hdr_t *hdr)
     1870 +{
     1871 +        while (HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_krrp != 0)
     1872 +                cv_wait(&hdr->b_l1hdr.b_cv, HDR_LOCK(hdr));
     1873 +}
     1874 +
1449 1875  /*
1450 1876   * This is the size that the buf occupies in memory. If the buf is compressed,
1451 1877   * it will correspond to the compressed size. You should use this method of
1452 1878   * getting the buf size unless you explicitly need the logical size.
1453 1879   */
1454 1880  int32_t
1455 1881  arc_buf_size(arc_buf_t *buf)
1456 1882  {
1457 1883          return (ARC_BUF_COMPRESSED(buf) ?
1458 1884              HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
↓ open down ↓ 35 lines elided ↑ open up ↑
1494 1920  
1495 1921  /*
1496 1922   * Free the checksum associated with this header. If there is no checksum, this
1497 1923   * is a no-op.
1498 1924   */
1499 1925  static inline void
1500 1926  arc_cksum_free(arc_buf_hdr_t *hdr)
1501 1927  {
1502 1928          ASSERT(HDR_HAS_L1HDR(hdr));
1503 1929          mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1504      -        if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
1505      -                kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));
1506      -                hdr->b_l1hdr.b_freeze_cksum = NULL;
     1930 +        if (hdr->b_freeze_cksum != NULL) {
     1931 +                kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
     1932 +                hdr->b_freeze_cksum = NULL;
1507 1933          }
1508 1934          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1509 1935  }
1510 1936  
1511 1937  /*
1512 1938   * Return true iff at least one of the bufs on hdr is not compressed.
1513 1939   */
1514 1940  static boolean_t
1515 1941  arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)
1516 1942  {
↓ open down ↓ 13 lines elided ↑ open up ↑
1530 1956  static void
1531 1957  arc_cksum_verify(arc_buf_t *buf)
1532 1958  {
1533 1959          arc_buf_hdr_t *hdr = buf->b_hdr;
1534 1960          zio_cksum_t zc;
1535 1961  
1536 1962          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1537 1963                  return;
1538 1964  
1539 1965          if (ARC_BUF_COMPRESSED(buf)) {
1540      -                ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
     1966 +                ASSERT(hdr->b_freeze_cksum == NULL ||
1541 1967                      arc_hdr_has_uncompressed_buf(hdr));
1542 1968                  return;
1543 1969          }
1544 1970  
1545 1971          ASSERT(HDR_HAS_L1HDR(hdr));
1546 1972  
1547 1973          mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1548      -        if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
     1974 +        if (hdr->b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
1549 1975                  mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1550 1976                  return;
1551 1977          }
1552 1978  
1553 1979          fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);
1554      -        if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))
     1980 +        if (!ZIO_CHECKSUM_EQUAL(*hdr->b_freeze_cksum, zc))
1555 1981                  panic("buffer modified while frozen!");
1556 1982          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1557 1983  }
1558 1984  
1559 1985  static boolean_t
1560 1986  arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)
1561 1987  {
1562 1988          enum zio_compress compress = BP_GET_COMPRESS(zio->io_bp);
1563 1989          boolean_t valid_cksum;
1564 1990  
↓ open down ↓ 10 lines elided ↑ open up ↑
1575 2001           * arc is disabled, then the data written to the l2arc is always
1576 2002           * uncompressed and won't match the block as it exists in the main
1577 2003           * pool. When this is the case, we must first compress it if it is
1578 2004           * compressed on the main pool before we can validate the checksum.
1579 2005           */
1580 2006          if (!HDR_COMPRESSION_ENABLED(hdr) && compress != ZIO_COMPRESS_OFF) {
1581 2007                  ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
1582 2008                  uint64_t lsize = HDR_GET_LSIZE(hdr);
1583 2009                  uint64_t csize;
1584 2010  
1585      -                abd_t *cdata = abd_alloc_linear(HDR_GET_PSIZE(hdr), B_TRUE);
1586      -                csize = zio_compress_data(compress, zio->io_abd,
1587      -                    abd_to_buf(cdata), lsize);
     2011 +                void *cbuf = zio_buf_alloc(HDR_GET_PSIZE(hdr));
     2012 +                csize = zio_compress_data(compress, zio->io_abd, cbuf, lsize);
     2013 +                abd_t *cdata = abd_get_from_buf(cbuf, HDR_GET_PSIZE(hdr));
     2014 +                abd_take_ownership_of_buf(cdata, B_TRUE);
1588 2015  
1589 2016                  ASSERT3U(csize, <=, HDR_GET_PSIZE(hdr));
1590 2017                  if (csize < HDR_GET_PSIZE(hdr)) {
1591 2018                          /*
1592 2019                           * Compressed blocks are always a multiple of the
1593 2020                           * smallest ashift in the pool. Ideally, we would
1594 2021                           * like to round up the csize to the next
1595 2022                           * spa_min_ashift but that value may have changed
1596 2023                           * since the block was last written. Instead,
1597 2024                           * we rely on the fact that the hdr's psize
↓ open down ↓ 38 lines elided ↑ open up ↑
1636 2063  arc_cksum_compute(arc_buf_t *buf)
1637 2064  {
1638 2065          arc_buf_hdr_t *hdr = buf->b_hdr;
1639 2066  
1640 2067          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1641 2068                  return;
1642 2069  
1643 2070          ASSERT(HDR_HAS_L1HDR(hdr));
1644 2071  
1645 2072          mutex_enter(&buf->b_hdr->b_l1hdr.b_freeze_lock);
1646      -        if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
     2073 +        if (hdr->b_freeze_cksum != NULL) {
1647 2074                  ASSERT(arc_hdr_has_uncompressed_buf(hdr));
1648 2075                  mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1649 2076                  return;
1650 2077          } else if (ARC_BUF_COMPRESSED(buf)) {
1651 2078                  mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1652 2079                  return;
1653 2080          }
1654 2081  
1655 2082          ASSERT(!ARC_BUF_COMPRESSED(buf));
1656      -        hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
     2083 +        hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
1657 2084              KM_SLEEP);
1658 2085          fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,
1659      -            hdr->b_l1hdr.b_freeze_cksum);
     2086 +            hdr->b_freeze_cksum);
1660 2087          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1661 2088          arc_buf_watch(buf);
1662 2089  }
1663 2090  
1664 2091  #ifndef _KERNEL
1665 2092  typedef struct procctl {
1666 2093          long cmd;
1667 2094          prwatch_t prwatch;
1668 2095  } procctl_t;
1669 2096  #endif
↓ open down ↓ 31 lines elided ↑ open up ↑
1701 2128                  result = write(arc_procfd, &ctl, sizeof (ctl));
1702 2129                  ASSERT3U(result, ==, sizeof (ctl));
1703 2130          }
1704 2131  #endif
1705 2132  }
1706 2133  
1707 2134  static arc_buf_contents_t
1708 2135  arc_buf_type(arc_buf_hdr_t *hdr)
1709 2136  {
1710 2137          arc_buf_contents_t type;
     2138 +
1711 2139          if (HDR_ISTYPE_METADATA(hdr)) {
1712 2140                  type = ARC_BUFC_METADATA;
     2141 +        } else if (HDR_ISTYPE_DDT(hdr)) {
     2142 +                type = ARC_BUFC_DDT;
1713 2143          } else {
1714 2144                  type = ARC_BUFC_DATA;
1715 2145          }
1716 2146          VERIFY3U(hdr->b_type, ==, type);
1717 2147          return (type);
1718 2148  }
1719 2149  
1720 2150  boolean_t
1721 2151  arc_is_metadata(arc_buf_t *buf)
1722 2152  {
↓ open down ↓ 2 lines elided ↑ open up ↑
1725 2155  
1726 2156  static uint32_t
1727 2157  arc_bufc_to_flags(arc_buf_contents_t type)
1728 2158  {
1729 2159          switch (type) {
1730 2160          case ARC_BUFC_DATA:
1731 2161                  /* metadata field is 0 if buffer contains normal data */
1732 2162                  return (0);
1733 2163          case ARC_BUFC_METADATA:
1734 2164                  return (ARC_FLAG_BUFC_METADATA);
     2165 +        case ARC_BUFC_DDT:
     2166 +                return (ARC_FLAG_BUFC_DDT);
1735 2167          default:
1736 2168                  break;
1737 2169          }
1738 2170          panic("undefined ARC buffer type!");
1739 2171          return ((uint32_t)-1);
1740 2172  }
1741 2173  
     2174 +static arc_buf_contents_t
     2175 +arc_flags_to_bufc(uint32_t flags)
     2176 +{
     2177 +        if (flags & ARC_FLAG_BUFC_DDT)
     2178 +                return (ARC_BUFC_DDT);
     2179 +        if (flags & ARC_FLAG_BUFC_METADATA)
     2180 +                return (ARC_BUFC_METADATA);
     2181 +        return (ARC_BUFC_DATA);
     2182 +}
     2183 +
1742 2184  void
1743 2185  arc_buf_thaw(arc_buf_t *buf)
1744 2186  {
1745 2187          arc_buf_hdr_t *hdr = buf->b_hdr;
1746 2188  
1747 2189          ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
1748 2190          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
1749 2191  
1750 2192          arc_cksum_verify(buf);
1751 2193  
1752 2194          /*
1753 2195           * Compressed buffers do not manipulate the b_freeze_cksum or
1754 2196           * allocate b_thawed.
1755 2197           */
1756 2198          if (ARC_BUF_COMPRESSED(buf)) {
1757      -                ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
     2199 +                ASSERT(hdr->b_freeze_cksum == NULL ||
1758 2200                      arc_hdr_has_uncompressed_buf(hdr));
1759 2201                  return;
1760 2202          }
1761 2203  
1762 2204          ASSERT(HDR_HAS_L1HDR(hdr));
1763 2205          arc_cksum_free(hdr);
1764 2206  
1765 2207          mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1766 2208  #ifdef ZFS_DEBUG
1767 2209          if (zfs_flags & ZFS_DEBUG_MODIFY) {
↓ open down ↓ 11 lines elided ↑ open up ↑
1779 2221  void
1780 2222  arc_buf_freeze(arc_buf_t *buf)
1781 2223  {
1782 2224          arc_buf_hdr_t *hdr = buf->b_hdr;
1783 2225          kmutex_t *hash_lock;
1784 2226  
1785 2227          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1786 2228                  return;
1787 2229  
1788 2230          if (ARC_BUF_COMPRESSED(buf)) {
1789      -                ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
     2231 +                ASSERT(hdr->b_freeze_cksum == NULL ||
1790 2232                      arc_hdr_has_uncompressed_buf(hdr));
1791 2233                  return;
1792 2234          }
1793 2235  
1794 2236          hash_lock = HDR_LOCK(hdr);
1795 2237          mutex_enter(hash_lock);
1796 2238  
1797 2239          ASSERT(HDR_HAS_L1HDR(hdr));
1798      -        ASSERT(hdr->b_l1hdr.b_freeze_cksum != NULL ||
     2240 +        ASSERT(hdr->b_freeze_cksum != NULL ||
1799 2241              hdr->b_l1hdr.b_state == arc_anon);
1800 2242          arc_cksum_compute(buf);
1801 2243          mutex_exit(hash_lock);
1802 2244  }
1803 2245  
1804 2246  /*
1805 2247   * The arc_buf_hdr_t's b_flags should never be modified directly. Instead,
1806 2248   * the following functions should be used to ensure that the flags are
1807 2249   * updated in a thread-safe way. When manipulating the flags either
1808 2250   * the hash_lock must be held or the hdr must be undiscoverable. This
↓ open down ↓ 71 lines elided ↑ open up ↑
1880 2322                          bcopy(from->b_data, buf->b_data, arc_buf_size(buf));
1881 2323                          copied = B_TRUE;
1882 2324                          break;
1883 2325                  }
1884 2326          }
1885 2327  
1886 2328          /*
1887 2329           * There were no decompressed bufs, so there should not be a
1888 2330           * checksum on the hdr either.
1889 2331           */
1890      -        EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);
     2332 +        EQUIV(!copied, hdr->b_freeze_cksum == NULL);
1891 2333  
1892 2334          return (copied);
1893 2335  }
1894 2336  
1895 2337  /*
1896 2338   * Given a buf that has a data buffer attached to it, this function will
1897 2339   * efficiently fill the buf with data of the specified compression setting from
1898 2340   * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr
1899 2341   * are already sharing a data buf, no copy is performed.
1900 2342   *
↓ open down ↓ 58 lines elided ↑ open up ↑
1959 2401                   */
1960 2402                  buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
1961 2403  
1962 2404                  /*
1963 2405                   * Try copying the data from another buf which already has a
1964 2406                   * decompressed version. If that's not possible, it's time to
1965 2407                   * bite the bullet and decompress the data from the hdr.
1966 2408                   */
1967 2409                  if (arc_buf_try_copy_decompressed_data(buf)) {
1968 2410                          /* Skip byteswapping and checksumming (already done) */
1969      -                        ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, !=, NULL);
     2411 +                        ASSERT3P(hdr->b_freeze_cksum, !=, NULL);
1970 2412                          return (0);
1971 2413                  } else {
1972 2414                          int error = zio_decompress_data(HDR_GET_COMPRESS(hdr),
1973 2415                              hdr->b_l1hdr.b_pabd, buf->b_data,
1974 2416                              HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
1975 2417  
1976 2418                          /*
1977 2419                           * Absent hardware errors or software bugs, this should
1978 2420                           * be impossible, but log it anyway so we can debug it.
1979 2421                           */
↓ open down ↓ 242 lines elided ↑ open up ↑
2222 2664                          if (GHOST_STATE(new_state)) {
2223 2665                                  ASSERT0(bufcnt);
2224 2666                                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2225 2667                                  update_new = B_TRUE;
2226 2668                          }
2227 2669                          arc_evictable_space_increment(hdr, new_state);
2228 2670                  }
2229 2671          }
2230 2672  
2231 2673          ASSERT(!HDR_EMPTY(hdr));
2232      -        if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))
     2674 +        if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr)) {
     2675 +                arc_wait_for_krrp(hdr);
2233 2676                  buf_hash_remove(hdr);
     2677 +        }
2234 2678  
2235 2679          /* adjust state sizes (ignore arc_l2c_only) */
2236 2680  
2237 2681          if (update_new && new_state != arc_l2c_only) {
2238 2682                  ASSERT(HDR_HAS_L1HDR(hdr));
2239 2683                  if (GHOST_STATE(new_state)) {
2240 2684                          ASSERT0(bufcnt);
2241 2685  
2242 2686                          /*
2243 2687                           * When moving a header to a ghost state, we first
↓ open down ↓ 92 lines elided ↑ open up ↑
2336 2780                  }
2337 2781          }
2338 2782  
2339 2783          if (HDR_HAS_L1HDR(hdr))
2340 2784                  hdr->b_l1hdr.b_state = new_state;
2341 2785  
2342 2786          /*
2343 2787           * L2 headers should never be on the L2 state list since they don't
2344 2788           * have L1 headers allocated.
2345 2789           */
2346      -        ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]) &&
2347      -            multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));
     2790 +        ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]));
     2791 +        ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));
     2792 +        ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DDT]));
2348 2793  }
2349 2794  
2350 2795  void
2351 2796  arc_space_consume(uint64_t space, arc_space_type_t type)
2352 2797  {
2353 2798          ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2354 2799  
2355 2800          switch (type) {
2356 2801          case ARC_SPACE_DATA:
2357      -                aggsum_add(&astat_data_size, space);
     2802 +                ARCSTAT_INCR(arcstat_data_size, space);
2358 2803                  break;
2359 2804          case ARC_SPACE_META:
2360      -                aggsum_add(&astat_metadata_size, space);
     2805 +                ARCSTAT_INCR(arcstat_metadata_size, space);
2361 2806                  break;
     2807 +        case ARC_SPACE_DDT:
     2808 +                ARCSTAT_INCR(arcstat_ddt_size, space);
     2809 +                break;
2362 2810          case ARC_SPACE_OTHER:
2363      -                aggsum_add(&astat_other_size, space);
     2811 +                ARCSTAT_INCR(arcstat_other_size, space);
2364 2812                  break;
2365 2813          case ARC_SPACE_HDRS:
2366      -                aggsum_add(&astat_hdr_size, space);
     2814 +                ARCSTAT_INCR(arcstat_hdr_size, space);
2367 2815                  break;
2368 2816          case ARC_SPACE_L2HDRS:
2369      -                aggsum_add(&astat_l2_hdr_size, space);
     2817 +                ARCSTAT_INCR(arcstat_l2_hdr_size, space);
2370 2818                  break;
2371 2819          }
2372 2820  
2373      -        if (type != ARC_SPACE_DATA)
2374      -                aggsum_add(&arc_meta_used, space);
     2821 +        if (type != ARC_SPACE_DATA && type != ARC_SPACE_DDT)
     2822 +                ARCSTAT_INCR(arcstat_meta_used, space);
2375 2823  
2376      -        aggsum_add(&arc_size, space);
     2824 +        atomic_add_64(&arc_size, space);
2377 2825  }
2378 2826  
2379 2827  void
2380 2828  arc_space_return(uint64_t space, arc_space_type_t type)
2381 2829  {
2382 2830          ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2383 2831  
2384 2832          switch (type) {
2385 2833          case ARC_SPACE_DATA:
2386      -                aggsum_add(&astat_data_size, -space);
     2834 +                ARCSTAT_INCR(arcstat_data_size, -space);
2387 2835                  break;
2388 2836          case ARC_SPACE_META:
2389      -                aggsum_add(&astat_metadata_size, -space);
     2837 +                ARCSTAT_INCR(arcstat_metadata_size, -space);
2390 2838                  break;
     2839 +        case ARC_SPACE_DDT:
     2840 +                ARCSTAT_INCR(arcstat_ddt_size, -space);
     2841 +                break;
2391 2842          case ARC_SPACE_OTHER:
2392      -                aggsum_add(&astat_other_size, -space);
     2843 +                ARCSTAT_INCR(arcstat_other_size, -space);
2393 2844                  break;
2394 2845          case ARC_SPACE_HDRS:
2395      -                aggsum_add(&astat_hdr_size, -space);
     2846 +                ARCSTAT_INCR(arcstat_hdr_size, -space);
2396 2847                  break;
2397 2848          case ARC_SPACE_L2HDRS:
2398      -                aggsum_add(&astat_l2_hdr_size, -space);
     2849 +                ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
2399 2850                  break;
2400 2851          }
2401 2852  
2402      -        if (type != ARC_SPACE_DATA) {
2403      -                ASSERT(aggsum_compare(&arc_meta_used, space) >= 0);
2404      -                /*
2405      -                 * We use the upper bound here rather than the precise value
2406      -                 * because the arc_meta_max value doesn't need to be
2407      -                 * precise. It's only consumed by humans via arcstats.
2408      -                 */
2409      -                if (arc_meta_max < aggsum_upper_bound(&arc_meta_used))
2410      -                        arc_meta_max = aggsum_upper_bound(&arc_meta_used);
2411      -                aggsum_add(&arc_meta_used, -space);
     2853 +        if (type != ARC_SPACE_DATA && type != ARC_SPACE_DDT) {
     2854 +                ASSERT(arc_meta_used >= space);
     2855 +                if (arc_meta_max < arc_meta_used)
     2856 +                        arc_meta_max = arc_meta_used;
     2857 +                ARCSTAT_INCR(arcstat_meta_used, -space);
2412 2858          }
2413 2859  
2414      -        ASSERT(aggsum_compare(&arc_size, space) >= 0);
2415      -        aggsum_add(&arc_size, -space);
     2860 +        ASSERT(arc_size >= space);
     2861 +        atomic_add_64(&arc_size, -space);
2416 2862  }
2417 2863  
2418 2864  /*
2419 2865   * Given a hdr and a buf, returns whether that buf can share its b_data buffer
2420 2866   * with the hdr's b_pabd.
2421 2867   */
2422 2868  static boolean_t
2423 2869  arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2424 2870  {
2425 2871          /*
↓ open down ↓ 33 lines elided ↑ open up ↑
2459 2905   */
2460 2906  static int
2461 2907  arc_buf_alloc_impl(arc_buf_hdr_t *hdr, void *tag, boolean_t compressed,
2462 2908      boolean_t fill, arc_buf_t **ret)
2463 2909  {
2464 2910          arc_buf_t *buf;
2465 2911  
2466 2912          ASSERT(HDR_HAS_L1HDR(hdr));
2467 2913          ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
2468 2914          VERIFY(hdr->b_type == ARC_BUFC_DATA ||
2469      -            hdr->b_type == ARC_BUFC_METADATA);
     2915 +            hdr->b_type == ARC_BUFC_METADATA ||
     2916 +            hdr->b_type == ARC_BUFC_DDT);
2470 2917          ASSERT3P(ret, !=, NULL);
2471 2918          ASSERT3P(*ret, ==, NULL);
2472 2919  
2473 2920          buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
2474 2921          buf->b_hdr = hdr;
2475 2922          buf->b_data = NULL;
2476 2923          buf->b_next = hdr->b_l1hdr.b_buf;
2477 2924          buf->b_flags = 0;
2478 2925  
2479 2926          add_reference(hdr, tag);
↓ open down ↓ 59 lines elided ↑ open up ↑
2539 2986  static inline void
2540 2987  arc_loaned_bytes_update(int64_t delta)
2541 2988  {
2542 2989          atomic_add_64(&arc_loaned_bytes, delta);
2543 2990  
2544 2991          /* assert that it did not wrap around */
2545 2992          ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
2546 2993  }
2547 2994  
2548 2995  /*
     2996 + * Allocates an ARC buf header that's in an evicted & L2-cached state.
     2997 + * This is used during l2arc reconstruction to make empty ARC buffers
     2998 + * which circumvent the regular disk->arc->l2arc path and instead come
     2999 + * into being in the reverse order, i.e. l2arc->arc.
     3000 + */
     3001 +static arc_buf_hdr_t *
     3002 +arc_buf_alloc_l2only(uint64_t load_guid, arc_buf_contents_t type,
     3003 +    l2arc_dev_t *dev, dva_t dva, uint64_t daddr, uint64_t lsize,
     3004 +    uint64_t psize, uint64_t birth, zio_cksum_t cksum, int checksum_type,
     3005 +    enum zio_compress compress, boolean_t arc_compress)
     3006 +{
     3007 +        arc_buf_hdr_t *hdr;
     3008 +
     3009 +        if (type == ARC_BUFC_DDT && !zfs_arc_segregate_ddt)
     3010 +                type = ARC_BUFC_METADATA;
     3011 +
     3012 +        ASSERT(lsize != 0);
     3013 +        hdr = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
     3014 +        ASSERT(HDR_EMPTY(hdr));
     3015 +        ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
     3016 +
     3017 +        hdr->b_spa = load_guid;
     3018 +        hdr->b_type = type;
     3019 +        hdr->b_flags = 0;
     3020 +
     3021 +        if (arc_compress)
     3022 +                arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
     3023 +        else
     3024 +                arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
     3025 +
     3026 +        HDR_SET_COMPRESS(hdr, compress);
     3027 +
     3028 +        arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR);
     3029 +        hdr->b_dva = dva;
     3030 +        hdr->b_birth = birth;
     3031 +        if (checksum_type != ZIO_CHECKSUM_OFF) {
     3032 +                hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP);
     3033 +                bcopy(&cksum, hdr->b_freeze_cksum, sizeof (cksum));
     3034 +        }
     3035 +
     3036 +        HDR_SET_PSIZE(hdr, psize);
     3037 +        HDR_SET_LSIZE(hdr, lsize);
     3038 +
     3039 +        hdr->b_l2hdr.b_dev = dev;
     3040 +        hdr->b_l2hdr.b_daddr = daddr;
     3041 +
     3042 +        return (hdr);
     3043 +}
     3044 +
     3045 +/*
2549 3046   * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
2550 3047   * flight data by arc_tempreserve_space() until they are "returned". Loaned
2551 3048   * buffers must be returned to the arc before they can be used by the DMU or
2552 3049   * freed.
2553 3050   */
2554 3051  arc_buf_t *
2555 3052  arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)
2556 3053  {
2557 3054          arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,
2558 3055              is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);
↓ open down ↓ 68 lines elided ↑ open up ↑
2627 3124  
2628 3125          /* protected by hash lock, if in the hash table */
2629 3126          if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
2630 3127                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2631 3128                  ASSERT(state != arc_anon && state != arc_l2c_only);
2632 3129  
2633 3130                  (void) refcount_remove_many(&state->arcs_esize[type],
2634 3131                      size, hdr);
2635 3132          }
2636 3133          (void) refcount_remove_many(&state->arcs_size, size, hdr);
2637      -        if (type == ARC_BUFC_METADATA) {
     3134 +        if (type == ARC_BUFC_DDT) {
     3135 +                arc_space_return(size, ARC_SPACE_DDT);
     3136 +        } else if (type == ARC_BUFC_METADATA) {
2638 3137                  arc_space_return(size, ARC_SPACE_META);
2639 3138          } else {
2640 3139                  ASSERT(type == ARC_BUFC_DATA);
2641 3140                  arc_space_return(size, ARC_SPACE_DATA);
2642 3141          }
2643 3142  
2644 3143          l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);
2645 3144  }
2646 3145  
2647 3146  /*
↓ open down ↓ 11 lines elided ↑ open up ↑
2659 3158          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2660 3159  
2661 3160          /*
2662 3161           * Start sharing the data buffer. We transfer the
2663 3162           * refcount ownership to the hdr since it always owns
2664 3163           * the refcount whenever an arc_buf_t is shared.
2665 3164           */
2666 3165          refcount_transfer_ownership(&state->arcs_size, buf, hdr);
2667 3166          hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));
2668 3167          abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,
2669      -            HDR_ISTYPE_METADATA(hdr));
     3168 +            !HDR_ISTYPE_DATA(hdr));
2670 3169          arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
2671 3170          buf->b_flags |= ARC_BUF_FLAG_SHARED;
2672 3171  
2673 3172          /*
2674 3173           * Since we've transferred ownership to the hdr we need
2675 3174           * to increment its compressed and uncompressed kstats and
2676 3175           * decrement the overhead size.
2677 3176           */
2678 3177          ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
2679 3178          ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
↓ open down ↓ 172 lines elided ↑ open up ↑
2852 3351          ASSERT(HDR_HAS_L1HDR(hdr));
2853 3352          ASSERT(!HDR_SHARED_DATA(hdr));
2854 3353  
2855 3354          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2856 3355          hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr);
2857 3356          hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
2858 3357          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2859 3358  
2860 3359          ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
2861 3360          ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
     3361 +        arc_update_hit_stat(hdr, B_TRUE);
2862 3362  }
2863 3363  
2864 3364  static void
2865 3365  arc_hdr_free_pabd(arc_buf_hdr_t *hdr)
2866 3366  {
2867 3367          ASSERT(HDR_HAS_L1HDR(hdr));
2868 3368          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2869 3369  
2870 3370          /*
2871 3371           * If the hdr is currently being written to the l2arc then
↓ open down ↓ 14 lines elided ↑ open up ↑
2886 3386          ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
2887 3387          ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
2888 3388  }
2889 3389  
2890 3390  static arc_buf_hdr_t *
2891 3391  arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,
2892 3392      enum zio_compress compression_type, arc_buf_contents_t type)
2893 3393  {
2894 3394          arc_buf_hdr_t *hdr;
2895 3395  
2896      -        VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);
     3396 +        ASSERT3U(lsize, >, 0);
2897 3397  
     3398 +        if (type == ARC_BUFC_DDT && !zfs_arc_segregate_ddt)
     3399 +                type = ARC_BUFC_METADATA;
     3400 +        VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA ||
     3401 +            type == ARC_BUFC_DDT);
     3402 +
2898 3403          hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);
2899 3404          ASSERT(HDR_EMPTY(hdr));
2900      -        ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
     3405 +        ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
2901 3406          ASSERT3P(hdr->b_l1hdr.b_thawed, ==, NULL);
2902 3407          HDR_SET_PSIZE(hdr, psize);
2903 3408          HDR_SET_LSIZE(hdr, lsize);
2904 3409          hdr->b_spa = spa;
2905 3410          hdr->b_type = type;
2906 3411          hdr->b_flags = 0;
2907 3412          arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);
2908 3413          arc_hdr_set_compress(hdr, compression_type);
2909 3414  
2910 3415          hdr->b_l1hdr.b_state = arc_anon;
↓ open down ↓ 44 lines elided ↑ open up ↑
2955 3460                   * header has just come out of L2ARC, so we set its state to
2956 3461                   * l2c_only even though it's about to change.
2957 3462                   */
2958 3463                  nhdr->b_l1hdr.b_state = arc_l2c_only;
2959 3464  
2960 3465                  /* Verify previous threads set to NULL before freeing */
2961 3466                  ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL);
2962 3467          } else {
2963 3468                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2964 3469                  ASSERT0(hdr->b_l1hdr.b_bufcnt);
2965      -                ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
     3470 +                ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
2966 3471  
2967 3472                  /*
2968 3473                   * If we've reached here, We must have been called from
2969 3474                   * arc_evict_hdr(), as such we should have already been
2970 3475                   * removed from any ghost list we were previously on
2971 3476                   * (which protects us from racing with arc_evict_state),
2972 3477                   * thus no locking is needed during this check.
2973 3478                   */
2974 3479                  ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
2975 3480  
↓ open down ↓ 84 lines elided ↑ open up ↑
3060 3565          ASSERT(compression_type > ZIO_COMPRESS_OFF);
3061 3566          ASSERT(compression_type < ZIO_COMPRESS_FUNCTIONS);
3062 3567  
3063 3568          arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
3064 3569              compression_type, ARC_BUFC_DATA);
3065 3570          ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
3066 3571  
3067 3572          arc_buf_t *buf = NULL;
3068 3573          VERIFY0(arc_buf_alloc_impl(hdr, tag, B_TRUE, B_FALSE, &buf));
3069 3574          arc_buf_thaw(buf);
3070      -        ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
     3575 +        ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
3071 3576  
3072 3577          if (!arc_buf_is_shared(buf)) {
3073 3578                  /*
3074 3579                   * To ensure that the hdr has the correct data in it if we call
3075 3580                   * arc_decompress() on this buf before it's been written to
3076 3581                   * disk, it's easiest if we just set up sharing between the
3077 3582                   * buf and the hdr.
3078 3583                   */
3079 3584                  ASSERT(!abd_is_linear(hdr->b_l1hdr.b_pabd));
3080 3585                  arc_hdr_free_pabd(hdr);
↓ open down ↓ 11 lines elided ↑ open up ↑
3092 3597          uint64_t psize = arc_hdr_size(hdr);
3093 3598  
3094 3599          ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
3095 3600          ASSERT(HDR_HAS_L2HDR(hdr));
3096 3601  
3097 3602          list_remove(&dev->l2ad_buflist, hdr);
3098 3603  
3099 3604          ARCSTAT_INCR(arcstat_l2_psize, -psize);
3100 3605          ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
3101 3606  
3102      -        vdev_space_update(dev->l2ad_vdev, -psize, 0, 0);
     3607 +        /*
     3608 +         * l2ad_vdev can be NULL here if we async evicted it
     3609 +         */
     3610 +        if (dev->l2ad_vdev != NULL)
     3611 +                vdev_space_update(dev->l2ad_vdev, -psize, 0, 0);
3103 3612  
3104 3613          (void) refcount_remove_many(&dev->l2ad_alloc, psize, hdr);
3105 3614          arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
3106 3615  }
3107 3616  
3108 3617  static void
3109 3618  arc_hdr_destroy(arc_buf_hdr_t *hdr)
3110 3619  {
3111 3620          if (HDR_HAS_L1HDR(hdr)) {
3112 3621                  ASSERT(hdr->b_l1hdr.b_buf == NULL ||
3113 3622                      hdr->b_l1hdr.b_bufcnt > 0);
3114 3623                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
3115 3624                  ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
3116 3625          }
3117 3626          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3118 3627          ASSERT(!HDR_IN_HASH_TABLE(hdr));
3119 3628  
3120      -        if (!HDR_EMPTY(hdr))
3121      -                buf_discard_identity(hdr);
3122      -
3123 3629          if (HDR_HAS_L2HDR(hdr)) {
3124 3630                  l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
3125 3631                  boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);
3126 3632  
     3633 +                /* To avoid racing with L2ARC the header needs to be locked */
     3634 +                ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
     3635 +
3127 3636                  if (!buflist_held)
3128 3637                          mutex_enter(&dev->l2ad_mtx);
3129 3638  
3130 3639                  /*
     3640 +                 * L2ARC buflist has been held, so we can safety discard
     3641 +                 * identity, otherwise L2ARC can lock incorrect mutex
     3642 +                 * for the hdr, that will cause a panic. That is possible,
     3643 +                 * because a mutex is selected according to identity.
     3644 +                 */
     3645 +                if (!HDR_EMPTY(hdr))
     3646 +                        buf_discard_identity(hdr);
     3647 +
     3648 +                /*
3131 3649                   * Even though we checked this conditional above, we
3132 3650                   * need to check this again now that we have the
3133 3651                   * l2ad_mtx. This is because we could be racing with
3134 3652                   * another thread calling l2arc_evict() which might have
3135 3653                   * destroyed this header's L2 portion as we were waiting
3136 3654                   * to acquire the l2ad_mtx. If that happens, we don't
3137 3655                   * want to re-destroy the header's L2 portion.
3138 3656                   */
3139 3657                  if (HDR_HAS_L2HDR(hdr))
3140 3658                          arc_hdr_l2hdr_destroy(hdr);
3141 3659  
3142 3660                  if (!buflist_held)
3143 3661                          mutex_exit(&dev->l2ad_mtx);
3144 3662          }
3145 3663  
     3664 +        if (!HDR_EMPTY(hdr))
     3665 +                buf_discard_identity(hdr);
     3666 +
3146 3667          if (HDR_HAS_L1HDR(hdr)) {
3147 3668                  arc_cksum_free(hdr);
3148 3669  
3149 3670                  while (hdr->b_l1hdr.b_buf != NULL)
3150 3671                          arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);
3151 3672  
3152 3673  #ifdef ZFS_DEBUG
3153 3674                  if (hdr->b_l1hdr.b_thawed != NULL) {
3154 3675                          kmem_free(hdr->b_l1hdr.b_thawed, 1);
3155 3676                          hdr->b_l1hdr.b_thawed = NULL;
↓ open down ↓ 55 lines elided ↑ open up ↑
3211 3732   */
3212 3733  static int64_t
3213 3734  arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
3214 3735  {
3215 3736          arc_state_t *evicted_state, *state;
3216 3737          int64_t bytes_evicted = 0;
3217 3738  
3218 3739          ASSERT(MUTEX_HELD(hash_lock));
3219 3740          ASSERT(HDR_HAS_L1HDR(hdr));
3220 3741  
     3742 +        arc_wait_for_krrp(hdr);
     3743 +
3221 3744          state = hdr->b_l1hdr.b_state;
3222 3745          if (GHOST_STATE(state)) {
3223 3746                  ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3224 3747                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
3225 3748  
3226 3749                  /*
3227 3750                   * l2arc_write_buffers() relies on a header's L1 portion
3228 3751                   * (i.e. its b_pabd field) during it's write phase.
3229 3752                   * Thus, we cannot push a header onto the arc_l2c_only
3230 3753                   * state (removing it's L1 piece) until the header is
↓ open down ↓ 368 lines elided ↑ open up ↑
3599 4122  
3600 4123          if (bytes > 0 && refcount_count(&state->arcs_esize[type]) > 0) {
3601 4124                  delta = MIN(refcount_count(&state->arcs_esize[type]), bytes);
3602 4125                  return (arc_evict_state(state, spa, delta, type));
3603 4126          }
3604 4127  
3605 4128          return (0);
3606 4129  }
3607 4130  
3608 4131  /*
3609      - * Evict metadata buffers from the cache, such that arc_meta_used is
3610      - * capped by the arc_meta_limit tunable.
     4132 + * Depending on the value of adjust_ddt arg evict either DDT (B_TRUE)
     4133 + * or metadata (B_TRUE) buffers.
     4134 + * Evict metadata or DDT buffers from the cache, such that arc_meta_used or
     4135 + * arc_ddt_size is capped by the arc_meta_limit or arc_ddt_limit tunable.
3611 4136   */
3612 4137  static uint64_t
3613      -arc_adjust_meta(uint64_t meta_used)
     4138 +arc_adjust_meta_or_ddt(boolean_t adjust_ddt)
3614 4139  {
3615 4140          uint64_t total_evicted = 0;
3616      -        int64_t target;
     4141 +        int64_t target, over_limit;
     4142 +        arc_buf_contents_t type;
3617 4143  
     4144 +        if (adjust_ddt) {
     4145 +                over_limit = arc_ddt_size - arc_ddt_limit;
     4146 +                type = ARC_BUFC_DDT;
     4147 +        } else {
     4148 +                over_limit = arc_meta_used - arc_meta_limit;
     4149 +                type = ARC_BUFC_METADATA;
     4150 +        }
     4151 +
3618 4152          /*
3619      -         * If we're over the meta limit, we want to evict enough
3620      -         * metadata to get back under the meta limit. We don't want to
     4153 +         * If we're over the limit, we want to evict enough
     4154 +         * to get back under the limit. We don't want to
3621 4155           * evict so much that we drop the MRU below arc_p, though. If
3622 4156           * we're over the meta limit more than we're over arc_p, we
3623 4157           * evict some from the MRU here, and some from the MFU below.
3624 4158           */
3625      -        target = MIN((int64_t)(meta_used - arc_meta_limit),
     4159 +        target = MIN(over_limit,
3626 4160              (int64_t)(refcount_count(&arc_anon->arcs_size) +
3627 4161              refcount_count(&arc_mru->arcs_size) - arc_p));
3628 4162  
3629      -        total_evicted += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
     4163 +        total_evicted += arc_adjust_impl(arc_mru, 0, target, type);
3630 4164  
     4165 +        over_limit = adjust_ddt ? arc_ddt_size - arc_ddt_limit :
     4166 +            arc_meta_used - arc_meta_limit;
     4167 +
3631 4168          /*
3632 4169           * Similar to the above, we want to evict enough bytes to get us
3633 4170           * below the meta limit, but not so much as to drop us below the
3634 4171           * space allotted to the MFU (which is defined as arc_c - arc_p).
3635 4172           */
3636      -        target = MIN((int64_t)(meta_used - arc_meta_limit),
3637      -            (int64_t)(refcount_count(&arc_mfu->arcs_size) -
3638      -            (arc_c - arc_p)));
     4173 +        target = MIN(over_limit,
     4174 +            (int64_t)(refcount_count(&arc_mfu->arcs_size) - (arc_c - arc_p)));
3639 4175  
3640      -        total_evicted += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
     4176 +        total_evicted += arc_adjust_impl(arc_mfu, 0, target, type);
3641 4177  
3642 4178          return (total_evicted);
3643 4179  }
3644 4180  
3645 4181  /*
3646 4182   * Return the type of the oldest buffer in the given arc state
3647 4183   *
3648      - * This function will select a random sublist of type ARC_BUFC_DATA and
3649      - * a random sublist of type ARC_BUFC_METADATA. The tail of each sublist
     4184 + * This function will select a random sublists of type ARC_BUFC_DATA,
     4185 + * ARC_BUFC_METADATA, and ARC_BUFC_DDT. The tail of each sublist
3650 4186   * is compared, and the type which contains the "older" buffer will be
3651 4187   * returned.
3652 4188   */
3653 4189  static arc_buf_contents_t
3654 4190  arc_adjust_type(arc_state_t *state)
3655 4191  {
3656 4192          multilist_t *data_ml = state->arcs_list[ARC_BUFC_DATA];
3657 4193          multilist_t *meta_ml = state->arcs_list[ARC_BUFC_METADATA];
     4194 +        multilist_t *ddt_ml = state->arcs_list[ARC_BUFC_DDT];
3658 4195          int data_idx = multilist_get_random_index(data_ml);
3659 4196          int meta_idx = multilist_get_random_index(meta_ml);
     4197 +        int ddt_idx = multilist_get_random_index(ddt_ml);
3660 4198          multilist_sublist_t *data_mls;
3661 4199          multilist_sublist_t *meta_mls;
3662      -        arc_buf_contents_t type;
     4200 +        multilist_sublist_t *ddt_mls;
     4201 +        arc_buf_contents_t type = ARC_BUFC_DATA; /* silence compiler warning */
3663 4202          arc_buf_hdr_t *data_hdr;
3664 4203          arc_buf_hdr_t *meta_hdr;
     4204 +        arc_buf_hdr_t *ddt_hdr;
     4205 +        clock_t oldest;
3665 4206  
3666 4207          /*
3667 4208           * We keep the sublist lock until we're finished, to prevent
3668 4209           * the headers from being destroyed via arc_evict_state().
3669 4210           */
3670 4211          data_mls = multilist_sublist_lock(data_ml, data_idx);
3671 4212          meta_mls = multilist_sublist_lock(meta_ml, meta_idx);
     4213 +        ddt_mls = multilist_sublist_lock(ddt_ml, ddt_idx);
3672 4214  
3673 4215          /*
3674 4216           * These two loops are to ensure we skip any markers that
3675 4217           * might be at the tail of the lists due to arc_evict_state().
3676 4218           */
3677 4219  
3678 4220          for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL;
3679 4221              data_hdr = multilist_sublist_prev(data_mls, data_hdr)) {
3680 4222                  if (data_hdr->b_spa != 0)
3681 4223                          break;
3682 4224          }
3683 4225  
3684 4226          for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL;
3685 4227              meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) {
3686 4228                  if (meta_hdr->b_spa != 0)
3687 4229                          break;
3688 4230          }
3689 4231  
3690      -        if (data_hdr == NULL && meta_hdr == NULL) {
     4232 +        for (ddt_hdr = multilist_sublist_tail(ddt_mls); ddt_hdr != NULL;
     4233 +            ddt_hdr = multilist_sublist_prev(ddt_mls, ddt_hdr)) {
     4234 +                if (ddt_hdr->b_spa != 0)
     4235 +                        break;
     4236 +        }
     4237 +
     4238 +        if (data_hdr == NULL && meta_hdr == NULL && ddt_hdr == NULL) {
3691 4239                  type = ARC_BUFC_DATA;
3692      -        } else if (data_hdr == NULL) {
     4240 +        } else if (data_hdr != NULL && meta_hdr != NULL && ddt_hdr != NULL) {
     4241 +                /* The headers can't be on the sublist without an L1 header */
     4242 +                ASSERT(HDR_HAS_L1HDR(data_hdr));
     4243 +                ASSERT(HDR_HAS_L1HDR(meta_hdr));
     4244 +                ASSERT(HDR_HAS_L1HDR(ddt_hdr));
     4245 +
     4246 +                oldest = data_hdr->b_l1hdr.b_arc_access;
     4247 +                type = ARC_BUFC_DATA;
     4248 +                if (oldest > meta_hdr->b_l1hdr.b_arc_access) {
     4249 +                        oldest = meta_hdr->b_l1hdr.b_arc_access;
     4250 +                        type = ARC_BUFC_METADATA;
     4251 +                }
     4252 +                if (oldest > ddt_hdr->b_l1hdr.b_arc_access) {
     4253 +                        type = ARC_BUFC_DDT;
     4254 +                }
     4255 +        } else if (data_hdr == NULL && ddt_hdr == NULL) {
3693 4256                  ASSERT3P(meta_hdr, !=, NULL);
3694 4257                  type = ARC_BUFC_METADATA;
3695      -        } else if (meta_hdr == NULL) {
     4258 +        } else if (meta_hdr == NULL && ddt_hdr == NULL) {
3696 4259                  ASSERT3P(data_hdr, !=, NULL);
3697 4260                  type = ARC_BUFC_DATA;
3698      -        } else {
3699      -                ASSERT3P(data_hdr, !=, NULL);
3700      -                ASSERT3P(meta_hdr, !=, NULL);
     4261 +        } else if (meta_hdr == NULL && data_hdr == NULL) {
     4262 +                ASSERT3P(ddt_hdr, !=, NULL);
     4263 +                type = ARC_BUFC_DDT;
     4264 +        } else if (data_hdr != NULL && ddt_hdr != NULL) {
     4265 +                ASSERT3P(meta_hdr, ==, NULL);
3701 4266  
3702 4267                  /* The headers can't be on the sublist without an L1 header */
3703 4268                  ASSERT(HDR_HAS_L1HDR(data_hdr));
     4269 +                ASSERT(HDR_HAS_L1HDR(ddt_hdr));
     4270 +
     4271 +                if (data_hdr->b_l1hdr.b_arc_access <
     4272 +                    ddt_hdr->b_l1hdr.b_arc_access) {
     4273 +                        type = ARC_BUFC_DATA;
     4274 +                } else {
     4275 +                        type = ARC_BUFC_DDT;
     4276 +                }
     4277 +        } else if (meta_hdr != NULL && ddt_hdr != NULL) {
     4278 +                ASSERT3P(data_hdr, ==, NULL);
     4279 +
     4280 +                /* The headers can't be on the sublist without an L1 header */
3704 4281                  ASSERT(HDR_HAS_L1HDR(meta_hdr));
     4282 +                ASSERT(HDR_HAS_L1HDR(ddt_hdr));
3705 4283  
     4284 +                if (meta_hdr->b_l1hdr.b_arc_access <
     4285 +                    ddt_hdr->b_l1hdr.b_arc_access) {
     4286 +                        type = ARC_BUFC_METADATA;
     4287 +                } else {
     4288 +                        type = ARC_BUFC_DDT;
     4289 +                }
     4290 +        } else if (meta_hdr != NULL && data_hdr != NULL) {
     4291 +                ASSERT3P(ddt_hdr, ==, NULL);
     4292 +
     4293 +                /* The headers can't be on the sublist without an L1 header */
     4294 +                ASSERT(HDR_HAS_L1HDR(data_hdr));
     4295 +                ASSERT(HDR_HAS_L1HDR(meta_hdr));
     4296 +
3706 4297                  if (data_hdr->b_l1hdr.b_arc_access <
3707 4298                      meta_hdr->b_l1hdr.b_arc_access) {
3708 4299                          type = ARC_BUFC_DATA;
3709 4300                  } else {
3710 4301                          type = ARC_BUFC_METADATA;
3711 4302                  }
     4303 +        } else {
     4304 +                /* should never get here */
     4305 +                ASSERT(0);
3712 4306          }
3713 4307  
     4308 +        multilist_sublist_unlock(ddt_mls);
3714 4309          multilist_sublist_unlock(meta_mls);
3715 4310          multilist_sublist_unlock(data_mls);
3716 4311  
3717 4312          return (type);
3718 4313  }
3719 4314  
3720 4315  /*
3721 4316   * Evict buffers from the cache, such that arc_size is capped by arc_c.
3722 4317   */
3723 4318  static uint64_t
3724 4319  arc_adjust(void)
3725 4320  {
3726 4321          uint64_t total_evicted = 0;
3727 4322          uint64_t bytes;
3728 4323          int64_t target;
3729      -        uint64_t asize = aggsum_value(&arc_size);
3730      -        uint64_t ameta = aggsum_value(&arc_meta_used);
3731 4324  
3732 4325          /*
3733 4326           * If we're over arc_meta_limit, we want to correct that before
3734 4327           * potentially evicting data buffers below.
3735 4328           */
3736      -        total_evicted += arc_adjust_meta(ameta);
     4329 +        total_evicted += arc_adjust_meta_or_ddt(B_FALSE);
3737 4330  
3738 4331          /*
     4332 +         * If we're over arc_ddt_limit, we want to correct that before
     4333 +         * potentially evicting data buffers below.
     4334 +         */
     4335 +        total_evicted += arc_adjust_meta_or_ddt(B_TRUE);
     4336 +
     4337 +        /*
3739 4338           * Adjust MRU size
3740 4339           *
3741 4340           * If we're over the target cache size, we want to evict enough
3742 4341           * from the list to get back to our target size. We don't want
3743 4342           * to evict too much from the MRU, such that it drops below
3744 4343           * arc_p. So, if we're over our target cache size more than
3745 4344           * the MRU is over arc_p, we'll evict enough to get back to
3746 4345           * arc_p here, and then evict more from the MFU below.
3747 4346           */
3748      -        target = MIN((int64_t)(asize - arc_c),
     4347 +        target = MIN((int64_t)(arc_size - arc_c),
3749 4348              (int64_t)(refcount_count(&arc_anon->arcs_size) +
3750      -            refcount_count(&arc_mru->arcs_size) + ameta - arc_p));
     4349 +            refcount_count(&arc_mru->arcs_size) + arc_meta_used - arc_p));
3751 4350  
3752 4351          /*
3753 4352           * If we're below arc_meta_min, always prefer to evict data.
3754 4353           * Otherwise, try to satisfy the requested number of bytes to
3755 4354           * evict from the type which contains older buffers; in an
3756 4355           * effort to keep newer buffers in the cache regardless of their
3757 4356           * type. If we cannot satisfy the number of bytes from this
3758 4357           * type, spill over into the next type.
3759 4358           */
3760 4359          if (arc_adjust_type(arc_mru) == ARC_BUFC_METADATA &&
3761      -            ameta > arc_meta_min) {
     4360 +            arc_meta_used > arc_meta_min) {
3762 4361                  bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
3763 4362                  total_evicted += bytes;
3764 4363  
3765 4364                  /*
3766 4365                   * If we couldn't evict our target number of bytes from
3767 4366                   * metadata, we try to get the rest from data.
3768 4367                   */
3769 4368                  target -= bytes;
3770 4369  
3771      -                total_evicted +=
3772      -                    arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
     4370 +                bytes += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
     4371 +                total_evicted += bytes;
3773 4372          } else {
3774 4373                  bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
3775 4374                  total_evicted += bytes;
3776 4375  
3777 4376                  /*
3778 4377                   * If we couldn't evict our target number of bytes from
3779 4378                   * data, we try to get the rest from metadata.
3780 4379                   */
3781 4380                  target -= bytes;
3782 4381  
3783      -                total_evicted +=
3784      -                    arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
     4382 +                bytes += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
     4383 +                total_evicted += bytes;
3785 4384          }
3786 4385  
3787 4386          /*
     4387 +         * If we couldn't evict our target number of bytes from
     4388 +         * data and metadata, we try to get the rest from ddt.
     4389 +         */
     4390 +        target -= bytes;
     4391 +        total_evicted +=
     4392 +            arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DDT);
     4393 +
     4394 +        /*
3788 4395           * Adjust MFU size
3789 4396           *
3790 4397           * Now that we've tried to evict enough from the MRU to get its
3791 4398           * size back to arc_p, if we're still above the target cache
3792 4399           * size, we evict the rest from the MFU.
3793 4400           */
3794      -        target = asize - arc_c;
     4401 +        target = arc_size - arc_c;
3795 4402  
3796 4403          if (arc_adjust_type(arc_mfu) == ARC_BUFC_METADATA &&
3797      -            ameta > arc_meta_min) {
     4404 +            arc_meta_used > arc_meta_min) {
3798 4405                  bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
3799 4406                  total_evicted += bytes;
3800 4407  
3801 4408                  /*
3802 4409                   * If we couldn't evict our target number of bytes from
3803 4410                   * metadata, we try to get the rest from data.
3804 4411                   */
3805 4412                  target -= bytes;
3806 4413  
3807      -                total_evicted +=
3808      -                    arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
     4414 +                bytes += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
     4415 +                total_evicted += bytes;
3809 4416          } else {
3810 4417                  bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
3811 4418                  total_evicted += bytes;
3812 4419  
3813 4420                  /*
3814 4421                   * If we couldn't evict our target number of bytes from
3815 4422                   * data, we try to get the rest from data.
3816 4423                   */
3817 4424                  target -= bytes;
3818 4425  
3819      -                total_evicted +=
3820      -                    arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
     4426 +                bytes += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
     4427 +                total_evicted += bytes;
3821 4428          }
3822 4429  
3823 4430          /*
     4431 +         * If we couldn't evict our target number of bytes from
     4432 +         * data and metadata, we try to get the rest from ddt.
     4433 +         */
     4434 +        target -= bytes;
     4435 +        total_evicted +=
     4436 +            arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DDT);
     4437 +
     4438 +        /*
3824 4439           * Adjust ghost lists
3825 4440           *
3826 4441           * In addition to the above, the ARC also defines target values
3827 4442           * for the ghost lists. The sum of the mru list and mru ghost
3828 4443           * list should never exceed the target size of the cache, and
3829 4444           * the sum of the mru list, mfu list, mru ghost list, and mfu
3830 4445           * ghost list should never exceed twice the target size of the
3831 4446           * cache. The following logic enforces these limits on the ghost
3832 4447           * caches, and evicts from them as needed.
3833 4448           */
3834 4449          target = refcount_count(&arc_mru->arcs_size) +
3835 4450              refcount_count(&arc_mru_ghost->arcs_size) - arc_c;
3836 4451  
3837 4452          bytes = arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA);
3838 4453          total_evicted += bytes;
3839 4454  
3840 4455          target -= bytes;
3841 4456  
     4457 +        bytes += arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
     4458 +        total_evicted += bytes;
     4459 +
     4460 +        target -= bytes;
     4461 +
3842 4462          total_evicted +=
3843      -            arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
     4463 +            arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DDT);
3844 4464  
3845 4465          /*
3846 4466           * We assume the sum of the mru list and mfu list is less than
3847 4467           * or equal to arc_c (we enforced this above), which means we
3848 4468           * can use the simpler of the two equations below:
3849 4469           *
3850 4470           *      mru + mfu + mru ghost + mfu ghost <= 2 * arc_c
3851 4471           *                  mru ghost + mfu ghost <= arc_c
3852 4472           */
3853 4473          target = refcount_count(&arc_mru_ghost->arcs_size) +
3854 4474              refcount_count(&arc_mfu_ghost->arcs_size) - arc_c;
3855 4475  
3856 4476          bytes = arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA);
3857 4477          total_evicted += bytes;
3858 4478  
3859 4479          target -= bytes;
3860 4480  
     4481 +        bytes += arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
     4482 +        total_evicted += bytes;
     4483 +
     4484 +        target -= bytes;
     4485 +
3861 4486          total_evicted +=
3862      -            arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
     4487 +            arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DDT);
3863 4488  
3864 4489          return (total_evicted);
3865 4490  }
3866 4491  
     4492 +typedef struct arc_async_flush_data {
     4493 +        uint64_t        aaf_guid;
     4494 +        boolean_t       aaf_retry;
     4495 +} arc_async_flush_data_t;
     4496 +
     4497 +static taskq_t *arc_flush_taskq;
     4498 +
     4499 +static void
     4500 +arc_flush_impl(uint64_t guid, boolean_t retry)
     4501 +{
     4502 +        arc_buf_contents_t arcs;
     4503 +
     4504 +        for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
     4505 +                (void) arc_flush_state(arc_mru, guid, arcs, retry);
     4506 +                (void) arc_flush_state(arc_mfu, guid, arcs, retry);
     4507 +                (void) arc_flush_state(arc_mru_ghost, guid, arcs, retry);
     4508 +                (void) arc_flush_state(arc_mfu_ghost, guid, arcs, retry);
     4509 +        }
     4510 +}
     4511 +
     4512 +static void
     4513 +arc_flush_task(void *arg)
     4514 +{
     4515 +        arc_async_flush_data_t *aaf = (arc_async_flush_data_t *)arg;
     4516 +        arc_flush_impl(aaf->aaf_guid, aaf->aaf_retry);
     4517 +        kmem_free(aaf, sizeof (arc_async_flush_data_t));
     4518 +}
     4519 +
     4520 +boolean_t zfs_fastflush = B_TRUE;
     4521 +
3867 4522  void
3868 4523  arc_flush(spa_t *spa, boolean_t retry)
3869 4524  {
3870 4525          uint64_t guid = 0;
     4526 +        boolean_t async_flush = (spa != NULL ? zfs_fastflush : FALSE);
     4527 +        arc_async_flush_data_t *aaf = NULL;
3871 4528  
3872 4529          /*
3873 4530           * If retry is B_TRUE, a spa must not be specified since we have
3874 4531           * no good way to determine if all of a spa's buffers have been
3875 4532           * evicted from an arc state.
3876 4533           */
3877      -        ASSERT(!retry || spa == 0);
     4534 +        ASSERT(!retry || spa == NULL);
3878 4535  
3879      -        if (spa != NULL)
     4536 +        if (spa != NULL) {
3880 4537                  guid = spa_load_guid(spa);
     4538 +                if (async_flush) {
     4539 +                        aaf = kmem_alloc(sizeof (arc_async_flush_data_t),
     4540 +                            KM_SLEEP);
     4541 +                        aaf->aaf_guid = guid;
     4542 +                        aaf->aaf_retry = retry;
     4543 +                }
     4544 +        }
3881 4545  
3882      -        (void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);
3883      -        (void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);
3884      -
3885      -        (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);
3886      -        (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);
3887      -
3888      -        (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);
3889      -        (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);
3890      -
3891      -        (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);
3892      -        (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);
     4546 +        /*
     4547 +         * Try to flush per-spa remaining ARC ghost buffers asynchronously
     4548 +         * while a pool is being closed.
     4549 +         * An ARC buffer is bound to spa only by guid, so buffer can
     4550 +         * exist even when pool has already gone. If asynchronous flushing
     4551 +         * fails we fall back to regular (synchronous) one.
     4552 +         * NOTE: If asynchronous flushing had not yet finished when the pool
     4553 +         * was imported again it wouldn't be a problem, even when guids before
     4554 +         * and after export/import are the same. We can evict only unreferenced
     4555 +         * buffers, other are skipped.
     4556 +         */
     4557 +        if (!async_flush || (taskq_dispatch(arc_flush_taskq, arc_flush_task,
     4558 +            aaf, TQ_NOSLEEP) == NULL)) {
     4559 +                arc_flush_impl(guid, retry);
     4560 +                if (async_flush)
     4561 +                        kmem_free(aaf, sizeof (arc_async_flush_data_t));
     4562 +        }
3893 4563  }
3894 4564  
3895 4565  void
3896 4566  arc_shrink(int64_t to_free)
3897 4567  {
3898      -        uint64_t asize = aggsum_value(&arc_size);
3899 4568          if (arc_c > arc_c_min) {
3900 4569  
3901 4570                  if (arc_c > arc_c_min + to_free)
3902 4571                          atomic_add_64(&arc_c, -to_free);
3903 4572                  else
3904 4573                          arc_c = arc_c_min;
3905 4574  
3906 4575                  atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
3907      -                if (asize < arc_c)
3908      -                        arc_c = MAX(asize, arc_c_min);
     4576 +                if (arc_c > arc_size)
     4577 +                        arc_c = MAX(arc_size, arc_c_min);
3909 4578                  if (arc_p > arc_c)
3910 4579                          arc_p = (arc_c >> 1);
3911 4580                  ASSERT(arc_c >= arc_c_min);
3912 4581                  ASSERT((int64_t)arc_p >= 0);
3913 4582          }
3914 4583  
3915      -        if (asize > arc_c)
     4584 +        if (arc_size > arc_c)
3916 4585                  (void) arc_adjust();
3917 4586  }
3918 4587  
3919 4588  typedef enum free_memory_reason_t {
3920 4589          FMR_UNKNOWN,
3921 4590          FMR_NEEDFREE,
3922 4591          FMR_LOTSFREE,
3923 4592          FMR_SWAPFS_MINFREE,
3924 4593          FMR_PAGES_PP_MAXIMUM,
3925 4594          FMR_HEAP_ARENA,
↓ open down ↓ 143 lines elided ↑ open up ↑
4069 4738  {
4070 4739          size_t                  i;
4071 4740          kmem_cache_t            *prev_cache = NULL;
4072 4741          kmem_cache_t            *prev_data_cache = NULL;
4073 4742          extern kmem_cache_t     *zio_buf_cache[];
4074 4743          extern kmem_cache_t     *zio_data_buf_cache[];
4075 4744          extern kmem_cache_t     *range_seg_cache;
4076 4745          extern kmem_cache_t     *abd_chunk_cache;
4077 4746  
4078 4747  #ifdef _KERNEL
4079      -        if (aggsum_compare(&arc_meta_used, arc_meta_limit) >= 0) {
     4748 +        if (arc_meta_used >= arc_meta_limit || arc_ddt_size >= arc_ddt_limit) {
4080 4749                  /*
4081      -                 * We are exceeding our meta-data cache limit.
4082      -                 * Purge some DNLC entries to release holds on meta-data.
     4750 +                 * We are exceeding our meta-data or DDT cache limit.
     4751 +                 * Purge some DNLC entries to release holds on meta-data/DDT.
4083 4752                   */
4084 4753                  dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
4085 4754          }
4086 4755  #if defined(__i386)
4087 4756          /*
4088 4757           * Reclaim unused memory from all kmem caches.
4089 4758           */
4090 4759          kmem_reap();
4091 4760  #endif
4092 4761  #endif
↓ open down ↓ 135 lines elided ↑ open up ↑
4228 4897  
4229 4898                  /*
4230 4899                   * If evicted is zero, we couldn't evict anything via
4231 4900                   * arc_adjust(). This could be due to hash lock
4232 4901                   * collisions, but more likely due to the majority of
4233 4902                   * arc buffers being unevictable. Therefore, even if
4234 4903                   * arc_size is above arc_c, another pass is unlikely to
4235 4904                   * be helpful and could potentially cause us to enter an
4236 4905                   * infinite loop.
4237 4906                   */
4238      -                if (aggsum_compare(&arc_size, arc_c) <= 0|| evicted == 0) {
     4907 +                if (arc_size <= arc_c || evicted == 0) {
4239 4908                          /*
4240 4909                           * We're either no longer overflowing, or we
4241 4910                           * can't evict anything more, so we should wake
4242 4911                           * up any threads before we go to sleep.
4243 4912                           */
4244 4913                          cv_broadcast(&arc_reclaim_waiters_cv);
4245 4914  
4246 4915                          /*
4247 4916                           * Block until signaled, or after one second (we
4248 4917                           * might need to perform arc_kmem_reap_now()
↓ open down ↓ 61 lines elided ↑ open up ↑
4310 4979          if (arc_no_grow)
4311 4980                  return;
4312 4981  
4313 4982          if (arc_c >= arc_c_max)
4314 4983                  return;
4315 4984  
4316 4985          /*
4317 4986           * If we're within (2 * maxblocksize) bytes of the target
4318 4987           * cache size, increment the target cache size
4319 4988           */
4320      -        if (aggsum_compare(&arc_size, arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) >
4321      -            0) {
     4989 +        if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
4322 4990                  atomic_add_64(&arc_c, (int64_t)bytes);
4323 4991                  if (arc_c > arc_c_max)
4324 4992                          arc_c = arc_c_max;
4325 4993                  else if (state == arc_anon)
4326 4994                          atomic_add_64(&arc_p, (int64_t)bytes);
4327 4995                  if (arc_p > arc_c)
4328 4996                          arc_p = arc_c;
4329 4997          }
4330 4998          ASSERT((int64_t)arc_p >= 0);
4331 4999  }
↓ open down ↓ 2 lines elided ↑ open up ↑
4334 5002   * Check if arc_size has grown past our upper threshold, determined by
4335 5003   * zfs_arc_overflow_shift.
4336 5004   */
4337 5005  static boolean_t
4338 5006  arc_is_overflowing(void)
4339 5007  {
4340 5008          /* Always allow at least one block of overflow */
4341 5009          uint64_t overflow = MAX(SPA_MAXBLOCKSIZE,
4342 5010              arc_c >> zfs_arc_overflow_shift);
4343 5011  
4344      -        /*
4345      -         * We just compare the lower bound here for performance reasons. Our
4346      -         * primary goals are to make sure that the arc never grows without
4347      -         * bound, and that it can reach its maximum size. This check
4348      -         * accomplishes both goals. The maximum amount we could run over by is
4349      -         * 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block
4350      -         * in the ARC. In practice, that's in the tens of MB, which is low
4351      -         * enough to be safe.
4352      -         */
4353      -        return (aggsum_lower_bound(&arc_size) >= arc_c + overflow);
     5012 +        return (arc_size >= arc_c + overflow);
4354 5013  }
4355 5014  
4356 5015  static abd_t *
4357 5016  arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4358 5017  {
4359 5018          arc_buf_contents_t type = arc_buf_type(hdr);
4360 5019  
4361 5020          arc_get_data_impl(hdr, size, tag);
4362      -        if (type == ARC_BUFC_METADATA) {
     5021 +        if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
4363 5022                  return (abd_alloc(size, B_TRUE));
4364 5023          } else {
4365 5024                  ASSERT(type == ARC_BUFC_DATA);
4366 5025                  return (abd_alloc(size, B_FALSE));
4367 5026          }
4368 5027  }
4369 5028  
4370 5029  static void *
4371 5030  arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4372 5031  {
4373 5032          arc_buf_contents_t type = arc_buf_type(hdr);
4374 5033  
4375 5034          arc_get_data_impl(hdr, size, tag);
4376      -        if (type == ARC_BUFC_METADATA) {
     5035 +        if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
4377 5036                  return (zio_buf_alloc(size));
4378 5037          } else {
4379 5038                  ASSERT(type == ARC_BUFC_DATA);
4380 5039                  return (zio_data_buf_alloc(size));
4381 5040          }
4382 5041  }
4383 5042  
4384 5043  /*
4385 5044   * Allocate a block and return it to the caller. If we are hitting the
4386 5045   * hard limit for the cache size, we must sleep, waiting for the eviction
↓ open down ↓ 38 lines elided ↑ open up ↑
4425 5084                   */
4426 5085                  if (arc_is_overflowing()) {
4427 5086                          cv_signal(&arc_reclaim_thread_cv);
4428 5087                          cv_wait(&arc_reclaim_waiters_cv, &arc_reclaim_lock);
4429 5088                  }
4430 5089  
4431 5090                  mutex_exit(&arc_reclaim_lock);
4432 5091          }
4433 5092  
4434 5093          VERIFY3U(hdr->b_type, ==, type);
4435      -        if (type == ARC_BUFC_METADATA) {
     5094 +        if (type == ARC_BUFC_DDT) {
     5095 +                arc_space_consume(size, ARC_SPACE_DDT);
     5096 +        } else if (type == ARC_BUFC_METADATA) {
4436 5097                  arc_space_consume(size, ARC_SPACE_META);
4437 5098          } else {
4438 5099                  arc_space_consume(size, ARC_SPACE_DATA);
4439 5100          }
4440 5101  
4441 5102          /*
4442 5103           * Update the state size.  Note that ghost states have a
4443 5104           * "ghost size" and so don't need to be updated.
4444 5105           */
4445 5106          if (!GHOST_STATE(state)) {
↓ open down ↓ 12 lines elided ↑ open up ↑
4458 5119                  if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
4459 5120                          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4460 5121                          (void) refcount_add_many(&state->arcs_esize[type],
4461 5122                              size, tag);
4462 5123                  }
4463 5124  
4464 5125                  /*
4465 5126                   * If we are growing the cache, and we are adding anonymous
4466 5127                   * data, and we have outgrown arc_p, update arc_p
4467 5128                   */
4468      -                if (aggsum_compare(&arc_size, arc_c) < 0 &&
4469      -                    hdr->b_l1hdr.b_state == arc_anon &&
     5129 +                if (arc_size < arc_c && hdr->b_l1hdr.b_state == arc_anon &&
4470 5130                      (refcount_count(&arc_anon->arcs_size) +
4471 5131                      refcount_count(&arc_mru->arcs_size) > arc_p))
4472 5132                          arc_p = MIN(arc_c, arc_p + size);
4473 5133          }
4474 5134  }
4475 5135  
4476 5136  static void
4477 5137  arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size, void *tag)
4478 5138  {
4479 5139          arc_free_data_impl(hdr, size, tag);
4480 5140          abd_free(abd);
4481 5141  }
4482 5142  
4483 5143  static void
4484 5144  arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, void *tag)
4485 5145  {
4486 5146          arc_buf_contents_t type = arc_buf_type(hdr);
4487 5147  
4488 5148          arc_free_data_impl(hdr, size, tag);
4489      -        if (type == ARC_BUFC_METADATA) {
     5149 +        if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
4490 5150                  zio_buf_free(buf, size);
4491 5151          } else {
4492 5152                  ASSERT(type == ARC_BUFC_DATA);
4493 5153                  zio_data_buf_free(buf, size);
4494 5154          }
4495 5155  }
4496 5156  
4497 5157  /*
4498 5158   * Free the arc data buffer.
4499 5159   */
↓ open down ↓ 7 lines elided ↑ open up ↑
4507 5167          if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
4508 5168                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4509 5169                  ASSERT(state != arc_anon && state != arc_l2c_only);
4510 5170  
4511 5171                  (void) refcount_remove_many(&state->arcs_esize[type],
4512 5172                      size, tag);
4513 5173          }
4514 5174          (void) refcount_remove_many(&state->arcs_size, size, tag);
4515 5175  
4516 5176          VERIFY3U(hdr->b_type, ==, type);
4517      -        if (type == ARC_BUFC_METADATA) {
     5177 +        if (type == ARC_BUFC_DDT) {
     5178 +                arc_space_return(size, ARC_SPACE_DDT);
     5179 +        } else if (type == ARC_BUFC_METADATA) {
4518 5180                  arc_space_return(size, ARC_SPACE_META);
4519 5181          } else {
4520 5182                  ASSERT(type == ARC_BUFC_DATA);
4521 5183                  arc_space_return(size, ARC_SPACE_DATA);
4522 5184          }
4523 5185  }
4524 5186  
4525 5187  /*
4526 5188   * This routine is called whenever a buffer is accessed.
4527 5189   * NOTE: the hash lock is dropped in this function.
↓ open down ↓ 125 lines elided ↑ open up ↑
4653 5315                   */
4654 5316  
4655 5317                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4656 5318                  DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4657 5319                  arc_change_state(arc_mfu, hdr, hash_lock);
4658 5320          } else {
4659 5321                  ASSERT(!"invalid arc state");
4660 5322          }
4661 5323  }
4662 5324  
     5325 +/*
     5326 + * This routine is called by dbuf_hold() to update the arc_access() state
     5327 + * which otherwise would be skipped for entries in the dbuf cache.
     5328 + */
     5329 +void
     5330 +arc_buf_access(arc_buf_t *buf)
     5331 +{
     5332 +        mutex_enter(&buf->b_evict_lock);
     5333 +        arc_buf_hdr_t *hdr = buf->b_hdr;
     5334 +
     5335 +        /*
     5336 +         * Avoid taking the hash_lock when possible as an optimization.
     5337 +         * The header must be checked again under the hash_lock in order
     5338 +         * to handle the case where it is concurrently being released.
     5339 +         */
     5340 +        if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {
     5341 +                mutex_exit(&buf->b_evict_lock);
     5342 +                return;
     5343 +        }
     5344 +
     5345 +        kmutex_t *hash_lock = HDR_LOCK(hdr);
     5346 +        mutex_enter(hash_lock);
     5347 +
     5348 +        if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {
     5349 +                mutex_exit(hash_lock);
     5350 +                mutex_exit(&buf->b_evict_lock);
     5351 +                ARCSTAT_BUMP(arcstat_access_skip);
     5352 +                return;
     5353 +        }
     5354 +
     5355 +        mutex_exit(&buf->b_evict_lock);
     5356 +
     5357 +        ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
     5358 +            hdr->b_l1hdr.b_state == arc_mfu);
     5359 +
     5360 +        DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
     5361 +        arc_access(hdr, hash_lock);
     5362 +        mutex_exit(hash_lock);
     5363 +
     5364 +        ARCSTAT_BUMP(arcstat_hits);
     5365 +        /*
     5366 +         * Upstream used the ARCSTAT_CONDSTAT macro here, but they changed
     5367 +         * the argument format for that macro, which would requie that we
     5368 +         * go and modify all other uses of it. So it's easier to just expand
     5369 +         * this one invocation of the macro to do the right thing.
     5370 +         */
     5371 +        if (!HDR_PREFETCH(hdr)) {
     5372 +                if (!HDR_ISTYPE_METADATA(hdr))
     5373 +                        ARCSTAT_BUMP(arcstat_demand_data_hits);
     5374 +                else
     5375 +                        ARCSTAT_BUMP(arcstat_demand_metadata_hits);
     5376 +        } else {
     5377 +                if (!HDR_ISTYPE_METADATA(hdr))
     5378 +                        ARCSTAT_BUMP(arcstat_prefetch_data_hits);
     5379 +                else
     5380 +                        ARCSTAT_BUMP(arcstat_prefetch_metadata_hits);
     5381 +        }
     5382 +}
     5383 +
4663 5384  /* a generic arc_done_func_t which you can use */
4664 5385  /* ARGSUSED */
4665 5386  void
4666 5387  arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
4667 5388  {
4668 5389          if (zio == NULL || zio->io_error == 0)
4669 5390                  bcopy(buf->b_data, arg, arc_buf_size(buf));
4670 5391          arc_buf_destroy(buf, arg);
4671 5392  }
4672 5393  
↓ open down ↓ 122 lines elided ↑ open up ↑
4795 5516  
4796 5517          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt) ||
4797 5518              callback_list != NULL);
4798 5519  
4799 5520          if (no_zio_error) {
4800 5521                  arc_hdr_verify(hdr, zio->io_bp);
4801 5522          } else {
4802 5523                  arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
4803 5524                  if (hdr->b_l1hdr.b_state != arc_anon)
4804 5525                          arc_change_state(arc_anon, hdr, hash_lock);
4805      -                if (HDR_IN_HASH_TABLE(hdr))
     5526 +                if (HDR_IN_HASH_TABLE(hdr)) {
     5527 +                        if (hash_lock)
     5528 +                                arc_wait_for_krrp(hdr);
4806 5529                          buf_hash_remove(hdr);
     5530 +                }
4807 5531                  freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
4808 5532          }
4809 5533  
4810 5534          /*
4811 5535           * Broadcast before we drop the hash_lock to avoid the possibility
4812 5536           * that the hdr (and hence the cv) might be freed before we get to
4813 5537           * the cv_broadcast().
4814 5538           */
4815 5539          cv_broadcast(&hdr->b_l1hdr.b_cv);
4816 5540  
↓ open down ↓ 22 lines elided ↑ open up ↑
4839 5563  
4840 5564                  callback_list = acb->acb_next;
4841 5565                  kmem_free(acb, sizeof (arc_callback_t));
4842 5566          }
4843 5567  
4844 5568          if (freeable)
4845 5569                  arc_hdr_destroy(hdr);
4846 5570  }
4847 5571  
4848 5572  /*
     5573 + * The function to process data from arc by a callback
     5574 + * The main purpose is to directly copy data from arc to a target buffer
     5575 + */
     5576 +int
     5577 +arc_io_bypass(spa_t *spa, const blkptr_t *bp,
     5578 +    arc_bypass_io_func func, void *arg)
     5579 +{
     5580 +        arc_buf_hdr_t *hdr;
     5581 +        kmutex_t *hash_lock = NULL;
     5582 +        int error = 0;
     5583 +        uint64_t guid = spa_load_guid(spa);
     5584 +
     5585 +top:
     5586 +        hdr = buf_hash_find(guid, bp, &hash_lock);
     5587 +        if (hdr && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_bufcnt > 0 &&
     5588 +            hdr->b_l1hdr.b_buf->b_data) {
     5589 +                if (HDR_IO_IN_PROGRESS(hdr)) {
     5590 +                        cv_wait(&hdr->b_l1hdr.b_cv, hash_lock);
     5591 +                        mutex_exit(hash_lock);
     5592 +                        DTRACE_PROBE(arc_bypass_wait);
     5593 +                        goto top;
     5594 +                }
     5595 +
     5596 +                /*
     5597 +                 * As the func is an arbitrary callback, which can block, lock
     5598 +                 * should be released not to block other threads from
     5599 +                 * performing. A counter is used to hold a reference to block
     5600 +                 * which are held by krrp.
     5601 +                 */
     5602 +
     5603 +                hdr->b_l1hdr.b_krrp++;
     5604 +                mutex_exit(hash_lock);
     5605 +
     5606 +                error = func(hdr->b_l1hdr.b_buf->b_data, hdr->b_lsize, arg);
     5607 +
     5608 +                mutex_enter(hash_lock);
     5609 +                hdr->b_l1hdr.b_krrp--;
     5610 +                cv_broadcast(&hdr->b_l1hdr.b_cv);
     5611 +                mutex_exit(hash_lock);
     5612 +
     5613 +                return (error);
     5614 +        } else {
     5615 +                if (hash_lock)
     5616 +                        mutex_exit(hash_lock);
     5617 +                return (ENODATA);
     5618 +        }
     5619 +}
     5620 +
     5621 +/*
4849 5622   * "Read" the block at the specified DVA (in bp) via the
4850 5623   * cache.  If the block is found in the cache, invoke the provided
4851 5624   * callback immediately and return.  Note that the `zio' parameter
4852 5625   * in the callback will be NULL in this case, since no IO was
4853 5626   * required.  If the block is not in the cache pass the read request
4854 5627   * on to the spa with a substitute callback function, so that the
4855 5628   * requested block will be added to the cache.
4856 5629   *
4857 5630   * If a read request arrives for a block that has a read in-progress,
4858 5631   * either wait for the in-progress read to complete (and return the
↓ open down ↓ 119 lines elided ↑ open up ↑
4978 5751                  } else if (*arc_flags & ARC_FLAG_PREFETCH &&
4979 5752                      refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
4980 5753                          arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
4981 5754                  }
4982 5755                  DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
4983 5756                  arc_access(hdr, hash_lock);
4984 5757                  if (*arc_flags & ARC_FLAG_L2CACHE)
4985 5758                          arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
4986 5759                  mutex_exit(hash_lock);
4987 5760                  ARCSTAT_BUMP(arcstat_hits);
4988      -                ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
4989      -                    demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
4990      -                    data, metadata, hits);
     5761 +                if (HDR_ISTYPE_DDT(hdr))
     5762 +                        ARCSTAT_BUMP(arcstat_ddt_hits);
     5763 +                arc_update_hit_stat(hdr, B_TRUE);
4991 5764  
4992 5765                  if (done)
4993 5766                          done(NULL, buf, private);
4994 5767          } else {
4995 5768                  uint64_t lsize = BP_GET_LSIZE(bp);
4996 5769                  uint64_t psize = BP_GET_PSIZE(bp);
4997 5770                  arc_callback_t *acb;
4998 5771                  vdev_t *vd = NULL;
4999 5772                  uint64_t addr = 0;
5000 5773                  boolean_t devw = B_FALSE;
↓ open down ↓ 6 lines elided ↑ open up ↑
5007 5780                          hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
5008 5781                              BP_GET_COMPRESS(bp), type);
5009 5782  
5010 5783                          if (!BP_IS_EMBEDDED(bp)) {
5011 5784                                  hdr->b_dva = *BP_IDENTITY(bp);
5012 5785                                  hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
5013 5786                                  exists = buf_hash_insert(hdr, &hash_lock);
5014 5787                          }
5015 5788                          if (exists != NULL) {
5016 5789                                  /* somebody beat us to the hash insert */
5017      -                                mutex_exit(hash_lock);
5018      -                                buf_discard_identity(hdr);
5019 5790                                  arc_hdr_destroy(hdr);
     5791 +                                mutex_exit(hash_lock);
5020 5792                                  goto top; /* restart the IO request */
5021 5793                          }
5022 5794                  } else {
5023 5795                          /*
5024 5796                           * This block is in the ghost cache. If it was L2-only
5025 5797                           * (and thus didn't have an L1 hdr), we realloc the
5026 5798                           * header to add an L1 hdr.
5027 5799                           */
5028 5800                          if (!HDR_HAS_L1HDR(hdr)) {
5029 5801                                  hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,
5030 5802                                      hdr_full_cache);
5031 5803                          }
5032 5804                          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5033 5805                          ASSERT(GHOST_STATE(hdr->b_l1hdr.b_state));
5034 5806                          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5035 5807                          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5036 5808                          ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
5037      -                        ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
     5809 +                        ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
5038 5810  
5039 5811                          /*
5040 5812                           * This is a delicate dance that we play here.
5041 5813                           * This hdr is in the ghost list so we access it
5042 5814                           * to move it out of the ghost list before we
5043 5815                           * initiate the read. If it's a prefetch then
5044 5816                           * it won't have a callback so we'll remove the
5045 5817                           * reference that arc_buf_alloc_impl() created. We
5046 5818                           * do this after we've called arc_access() to
5047 5819                           * avoid hitting an assert in remove_reference().
↓ open down ↓ 31 lines elided ↑ open up ↑
5079 5851  
5080 5852                  ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5081 5853                  hdr->b_l1hdr.b_acb = acb;
5082 5854                  arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5083 5855  
5084 5856                  if (HDR_HAS_L2HDR(hdr) &&
5085 5857                      (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
5086 5858                          devw = hdr->b_l2hdr.b_dev->l2ad_writing;
5087 5859                          addr = hdr->b_l2hdr.b_daddr;
5088 5860                          /*
5089      -                         * Lock out L2ARC device removal.
     5861 +                         * Lock out device removal.
5090 5862                           */
5091 5863                          if (vdev_is_dead(vd) ||
5092 5864                              !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
5093 5865                                  vd = NULL;
5094 5866                  }
5095 5867  
5096 5868                  if (priority == ZIO_PRIORITY_ASYNC_READ)
5097 5869                          arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
5098 5870                  else
5099 5871                          arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
↓ open down ↓ 3 lines elided ↑ open up ↑
5103 5875  
5104 5876                  /*
5105 5877                   * At this point, we have a level 1 cache miss.  Try again in
5106 5878                   * L2ARC if possible.
5107 5879                   */
5108 5880                  ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);
5109 5881  
5110 5882                  DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
5111 5883                      uint64_t, lsize, zbookmark_phys_t *, zb);
5112 5884                  ARCSTAT_BUMP(arcstat_misses);
5113      -                ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
5114      -                    demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
5115      -                    data, metadata, misses);
     5885 +                arc_update_hit_stat(hdr, B_FALSE);
5116 5886  
5117 5887                  if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
5118 5888                          /*
5119 5889                           * Read from the L2ARC if the following are true:
5120 5890                           * 1. The L2ARC vdev was previously cached.
5121 5891                           * 2. This buffer still has L2ARC metadata.
5122 5892                           * 3. This buffer isn't currently writing to the L2ARC.
5123 5893                           * 4. The L2ARC entry wasn't evicted, which may
5124 5894                           *    also have invalidated the vdev.
5125 5895                           * 5. This isn't prefetch and l2arc_noprefetch is set.
5126 5896                           */
5127 5897                          if (HDR_HAS_L2HDR(hdr) &&
5128 5898                              !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
5129 5899                              !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
5130 5900                                  l2arc_read_callback_t *cb;
5131 5901                                  abd_t *abd;
5132 5902                                  uint64_t asize;
5133 5903  
5134 5904                                  DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
5135 5905                                  ARCSTAT_BUMP(arcstat_l2_hits);
     5906 +                                if (vdev_type_is_ddt(vd))
     5907 +                                        ARCSTAT_BUMP(arcstat_l2_ddt_hits);
5136 5908  
5137 5909                                  cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
5138 5910                                      KM_SLEEP);
5139 5911                                  cb->l2rcb_hdr = hdr;
5140 5912                                  cb->l2rcb_bp = *bp;
5141 5913                                  cb->l2rcb_zb = *zb;
5142 5914                                  cb->l2rcb_flags = zio_flags;
5143 5915  
5144 5916                                  asize = vdev_psize_to_asize(vd, size);
5145 5917                                  if (asize != size) {
5146 5918                                          abd = abd_alloc_for_io(asize,
5147      -                                            HDR_ISTYPE_METADATA(hdr));
     5919 +                                            !HDR_ISTYPE_DATA(hdr));
5148 5920                                          cb->l2rcb_abd = abd;
5149 5921                                  } else {
5150 5922                                          abd = hdr->b_l1hdr.b_pabd;
5151 5923                                  }
5152 5924  
5153 5925                                  ASSERT(addr >= VDEV_LABEL_START_SIZE &&
5154 5926                                      addr + asize <= vd->vdev_psize -
5155 5927                                      VDEV_LABEL_END_SIZE);
5156 5928  
5157 5929                                  /*
↓ open down ↓ 7 lines elided ↑ open up ↑
5165 5937                                  rzio = zio_read_phys(pio, vd, addr,
5166 5938                                      asize, abd,
5167 5939                                      ZIO_CHECKSUM_OFF,
5168 5940                                      l2arc_read_done, cb, priority,
5169 5941                                      zio_flags | ZIO_FLAG_DONT_CACHE |
5170 5942                                      ZIO_FLAG_CANFAIL |
5171 5943                                      ZIO_FLAG_DONT_PROPAGATE |
5172 5944                                      ZIO_FLAG_DONT_RETRY, B_FALSE);
5173 5945                                  DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
5174 5946                                      zio_t *, rzio);
     5947 +
5175 5948                                  ARCSTAT_INCR(arcstat_l2_read_bytes, size);
     5949 +                                if (vdev_type_is_ddt(vd))
     5950 +                                        ARCSTAT_INCR(arcstat_l2_ddt_read_bytes,
     5951 +                                            size);
5176 5952  
5177 5953                                  if (*arc_flags & ARC_FLAG_NOWAIT) {
5178 5954                                          zio_nowait(rzio);
5179 5955                                          return (0);
5180 5956                                  }
5181 5957  
5182 5958                                  ASSERT(*arc_flags & ARC_FLAG_WAIT);
5183 5959                                  if (zio_wait(rzio) == 0)
5184 5960                                          return (0);
5185 5961  
↓ open down ↓ 251 lines elided ↑ open up ↑
5437 6213                  nhdr = arc_hdr_alloc(spa, psize, lsize, compress, type);
5438 6214                  ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL);
5439 6215                  ASSERT0(nhdr->b_l1hdr.b_bufcnt);
5440 6216                  ASSERT0(refcount_count(&nhdr->b_l1hdr.b_refcnt));
5441 6217                  VERIFY3U(nhdr->b_type, ==, type);
5442 6218                  ASSERT(!HDR_SHARED_DATA(nhdr));
5443 6219  
5444 6220                  nhdr->b_l1hdr.b_buf = buf;
5445 6221                  nhdr->b_l1hdr.b_bufcnt = 1;
5446 6222                  (void) refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);
     6223 +                nhdr->b_l1hdr.b_krrp = 0;
     6224 +
5447 6225                  buf->b_hdr = nhdr;
5448 6226  
5449 6227                  mutex_exit(&buf->b_evict_lock);
5450 6228                  (void) refcount_add_many(&arc_anon->arcs_size,
5451 6229                      arc_buf_size(buf), buf);
5452 6230          } else {
5453 6231                  mutex_exit(&buf->b_evict_lock);
5454 6232                  ASSERT(refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);
5455 6233                  /* protected by hash lock, or hdr is on arc_anon */
5456 6234                  ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
↓ open down ↓ 190 lines elided ↑ open up ↑
5647 6425                           * sync-to-convergence, because we remove
5648 6426                           * buffers from the hash table when we arc_free().
5649 6427                           */
5650 6428                          if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
5651 6429                                  if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
5652 6430                                          panic("bad overwrite, hdr=%p exists=%p",
5653 6431                                              (void *)hdr, (void *)exists);
5654 6432                                  ASSERT(refcount_is_zero(
5655 6433                                      &exists->b_l1hdr.b_refcnt));
5656 6434                                  arc_change_state(arc_anon, exists, hash_lock);
5657      -                                mutex_exit(hash_lock);
     6435 +                                arc_wait_for_krrp(exists);
5658 6436                                  arc_hdr_destroy(exists);
     6437 +                                mutex_exit(hash_lock);
5659 6438                                  exists = buf_hash_insert(hdr, &hash_lock);
5660 6439                                  ASSERT3P(exists, ==, NULL);
5661 6440                          } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
5662 6441                                  /* nopwrite */
5663 6442                                  ASSERT(zio->io_prop.zp_nopwrite);
5664 6443                                  if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
5665 6444                                          panic("bad nopwrite, hdr=%p exists=%p",
5666 6445                                              (void *)hdr, (void *)exists);
5667 6446                          } else {
5668 6447                                  /* Dedup */
↓ open down ↓ 17 lines elided ↑ open up ↑
5686 6465  
5687 6466          abd_put(zio->io_abd);
5688 6467          kmem_free(callback, sizeof (arc_write_callback_t));
5689 6468  }
5690 6469  
5691 6470  zio_t *
5692 6471  arc_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, arc_buf_t *buf,
5693 6472      boolean_t l2arc, const zio_prop_t *zp, arc_done_func_t *ready,
5694 6473      arc_done_func_t *children_ready, arc_done_func_t *physdone,
5695 6474      arc_done_func_t *done, void *private, zio_priority_t priority,
5696      -    int zio_flags, const zbookmark_phys_t *zb)
     6475 +    int zio_flags, const zbookmark_phys_t *zb,
     6476 +    const zio_smartcomp_info_t *smartcomp)
5697 6477  {
5698 6478          arc_buf_hdr_t *hdr = buf->b_hdr;
5699 6479          arc_write_callback_t *callback;
5700 6480          zio_t *zio;
5701 6481          zio_prop_t localprop = *zp;
5702 6482  
5703 6483          ASSERT3P(ready, !=, NULL);
5704 6484          ASSERT3P(done, !=, NULL);
5705 6485          ASSERT(!HDR_IO_ERROR(hdr));
5706 6486          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
↓ open down ↓ 40 lines elided ↑ open up ↑
5747 6527                  arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);
5748 6528          }
5749 6529          ASSERT(!arc_buf_is_shared(buf));
5750 6530          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5751 6531  
5752 6532          zio = zio_write(pio, spa, txg, bp,
5753 6533              abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
5754 6534              HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
5755 6535              (children_ready != NULL) ? arc_write_children_ready : NULL,
5756 6536              arc_write_physdone, arc_write_done, callback,
5757      -            priority, zio_flags, zb);
     6537 +            priority, zio_flags, zb, smartcomp);
5758 6538  
5759 6539          return (zio);
5760 6540  }
5761 6541  
5762 6542  static int
5763 6543  arc_memory_throttle(uint64_t reserve, uint64_t txg)
5764 6544  {
5765 6545  #ifdef _KERNEL
5766 6546          uint64_t available_memory = ptob(freemem);
5767 6547          static uint64_t page_load = 0;
↓ open down ↓ 71 lines elided ↑ open up ↑
5839 6619          if (error != 0)
5840 6620                  return (error);
5841 6621  
5842 6622          /*
5843 6623           * Throttle writes when the amount of dirty data in the cache
5844 6624           * gets too large.  We try to keep the cache less than half full
5845 6625           * of dirty blocks so that our sync times don't grow too large.
5846 6626           * Note: if two requests come in concurrently, we might let them
5847 6627           * both succeed, when one of them should fail.  Not a huge deal.
5848 6628           */
5849      -
5850 6629          if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
5851 6630              anon_size > arc_c / 4) {
     6631 +                DTRACE_PROBE4(arc__tempreserve__space__throttle, uint64_t,
     6632 +                    arc_tempreserve, arc_state_t *, arc_anon, uint64_t,
     6633 +                    reserve, uint64_t, arc_c);
     6634 +
5852 6635                  uint64_t meta_esize =
5853 6636                      refcount_count(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
5854 6637                  uint64_t data_esize =
5855 6638                      refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
5856 6639                  dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
5857 6640                      "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
5858 6641                      arc_tempreserve >> 10, meta_esize >> 10,
5859 6642                      data_esize >> 10, reserve >> 10, arc_c >> 10);
5860 6643                  return (SET_ERROR(ERESTART));
5861 6644          }
5862 6645          atomic_add_64(&arc_tempreserve, reserve);
5863 6646          return (0);
5864 6647  }
5865 6648  
5866 6649  static void
5867 6650  arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
5868      -    kstat_named_t *evict_data, kstat_named_t *evict_metadata)
     6651 +    kstat_named_t *evict_data, kstat_named_t *evict_metadata,
     6652 +    kstat_named_t *evict_ddt)
5869 6653  {
5870 6654          size->value.ui64 = refcount_count(&state->arcs_size);
5871 6655          evict_data->value.ui64 =
5872 6656              refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);
5873 6657          evict_metadata->value.ui64 =
5874 6658              refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);
     6659 +        evict_ddt->value.ui64 =
     6660 +            refcount_count(&state->arcs_esize[ARC_BUFC_DDT]);
5875 6661  }
5876 6662  
5877 6663  static int
5878 6664  arc_kstat_update(kstat_t *ksp, int rw)
5879 6665  {
5880 6666          arc_stats_t *as = ksp->ks_data;
5881 6667  
5882 6668          if (rw == KSTAT_WRITE) {
5883 6669                  return (EACCES);
5884 6670          } else {
5885 6671                  arc_kstat_update_state(arc_anon,
5886 6672                      &as->arcstat_anon_size,
5887 6673                      &as->arcstat_anon_evictable_data,
5888      -                    &as->arcstat_anon_evictable_metadata);
     6674 +                    &as->arcstat_anon_evictable_metadata,
     6675 +                    &as->arcstat_anon_evictable_ddt);
5889 6676                  arc_kstat_update_state(arc_mru,
5890 6677                      &as->arcstat_mru_size,
5891 6678                      &as->arcstat_mru_evictable_data,
5892      -                    &as->arcstat_mru_evictable_metadata);
     6679 +                    &as->arcstat_mru_evictable_metadata,
     6680 +                    &as->arcstat_mru_evictable_ddt);
5893 6681                  arc_kstat_update_state(arc_mru_ghost,
5894 6682                      &as->arcstat_mru_ghost_size,
5895 6683                      &as->arcstat_mru_ghost_evictable_data,
5896      -                    &as->arcstat_mru_ghost_evictable_metadata);
     6684 +                    &as->arcstat_mru_ghost_evictable_metadata,
     6685 +                    &as->arcstat_mru_ghost_evictable_ddt);
5897 6686                  arc_kstat_update_state(arc_mfu,
5898 6687                      &as->arcstat_mfu_size,
5899 6688                      &as->arcstat_mfu_evictable_data,
5900      -                    &as->arcstat_mfu_evictable_metadata);
     6689 +                    &as->arcstat_mfu_evictable_metadata,
     6690 +                    &as->arcstat_mfu_evictable_ddt);
5901 6691                  arc_kstat_update_state(arc_mfu_ghost,
5902 6692                      &as->arcstat_mfu_ghost_size,
5903 6693                      &as->arcstat_mfu_ghost_evictable_data,
5904      -                    &as->arcstat_mfu_ghost_evictable_metadata);
5905      -
5906      -                ARCSTAT(arcstat_size) = aggsum_value(&arc_size);
5907      -                ARCSTAT(arcstat_meta_used) = aggsum_value(&arc_meta_used);
5908      -                ARCSTAT(arcstat_data_size) = aggsum_value(&astat_data_size);
5909      -                ARCSTAT(arcstat_metadata_size) =
5910      -                    aggsum_value(&astat_metadata_size);
5911      -                ARCSTAT(arcstat_hdr_size) = aggsum_value(&astat_hdr_size);
5912      -                ARCSTAT(arcstat_other_size) = aggsum_value(&astat_other_size);
5913      -                ARCSTAT(arcstat_l2_hdr_size) = aggsum_value(&astat_l2_hdr_size);
     6694 +                    &as->arcstat_mfu_ghost_evictable_metadata,
     6695 +                    &as->arcstat_mfu_ghost_evictable_ddt);
5914 6696          }
5915 6697  
5916 6698          return (0);
5917 6699  }
5918 6700  
5919 6701  /*
5920 6702   * This function *must* return indices evenly distributed between all
5921 6703   * sublists of the multilist. This is needed due to how the ARC eviction
5922 6704   * code is laid out; arc_evict_state() assumes ARC buffers are evenly
5923 6705   * distributed between all sublists and uses this assumption when
↓ open down ↓ 29 lines elided ↑ open up ↑
5953 6735  
5954 6736  static void
5955 6737  arc_state_init(void)
5956 6738  {
5957 6739          arc_anon = &ARC_anon;
5958 6740          arc_mru = &ARC_mru;
5959 6741          arc_mru_ghost = &ARC_mru_ghost;
5960 6742          arc_mfu = &ARC_mfu;
5961 6743          arc_mfu_ghost = &ARC_mfu_ghost;
5962 6744          arc_l2c_only = &ARC_l2c_only;
     6745 +        arc_buf_contents_t arcs;
5963 6746  
5964      -        arc_mru->arcs_list[ARC_BUFC_METADATA] =
5965      -            multilist_create(sizeof (arc_buf_hdr_t),
5966      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5967      -            arc_state_multilist_index_func);
5968      -        arc_mru->arcs_list[ARC_BUFC_DATA] =
5969      -            multilist_create(sizeof (arc_buf_hdr_t),
5970      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5971      -            arc_state_multilist_index_func);
5972      -        arc_mru_ghost->arcs_list[ARC_BUFC_METADATA] =
5973      -            multilist_create(sizeof (arc_buf_hdr_t),
5974      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5975      -            arc_state_multilist_index_func);
5976      -        arc_mru_ghost->arcs_list[ARC_BUFC_DATA] =
5977      -            multilist_create(sizeof (arc_buf_hdr_t),
5978      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5979      -            arc_state_multilist_index_func);
5980      -        arc_mfu->arcs_list[ARC_BUFC_METADATA] =
5981      -            multilist_create(sizeof (arc_buf_hdr_t),
5982      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5983      -            arc_state_multilist_index_func);
5984      -        arc_mfu->arcs_list[ARC_BUFC_DATA] =
5985      -            multilist_create(sizeof (arc_buf_hdr_t),
5986      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5987      -            arc_state_multilist_index_func);
5988      -        arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA] =
5989      -            multilist_create(sizeof (arc_buf_hdr_t),
5990      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5991      -            arc_state_multilist_index_func);
5992      -        arc_mfu_ghost->arcs_list[ARC_BUFC_DATA] =
5993      -            multilist_create(sizeof (arc_buf_hdr_t),
5994      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5995      -            arc_state_multilist_index_func);
5996      -        arc_l2c_only->arcs_list[ARC_BUFC_METADATA] =
5997      -            multilist_create(sizeof (arc_buf_hdr_t),
5998      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5999      -            arc_state_multilist_index_func);
6000      -        arc_l2c_only->arcs_list[ARC_BUFC_DATA] =
6001      -            multilist_create(sizeof (arc_buf_hdr_t),
6002      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6003      -            arc_state_multilist_index_func);
     6747 +        for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
     6748 +                arc_mru->arcs_list[arcs] =
     6749 +                    multilist_create(sizeof (arc_buf_hdr_t),
     6750 +                    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
     6751 +                    arc_state_multilist_index_func);
     6752 +                arc_mru_ghost->arcs_list[arcs] =
     6753 +                    multilist_create(sizeof (arc_buf_hdr_t),
     6754 +                    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
     6755 +                        arc_state_multilist_index_func);
     6756 +                arc_mfu->arcs_list[arcs] =
     6757 +                    multilist_create(sizeof (arc_buf_hdr_t),
     6758 +                    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
     6759 +                    arc_state_multilist_index_func);
     6760 +                arc_mfu_ghost->arcs_list[arcs] =
     6761 +                    multilist_create(sizeof (arc_buf_hdr_t),
     6762 +                    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
     6763 +                    arc_state_multilist_index_func);
     6764 +                arc_l2c_only->arcs_list[arcs] =
     6765 +                    multilist_create(sizeof (arc_buf_hdr_t),
     6766 +                    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
     6767 +                    arc_state_multilist_index_func);
6004 6768  
6005      -        refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
6006      -        refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
6007      -        refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
6008      -        refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
6009      -        refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
6010      -        refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
6011      -        refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
6012      -        refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
6013      -        refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
6014      -        refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
6015      -        refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
6016      -        refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
     6769 +                refcount_create(&arc_anon->arcs_esize[arcs]);
     6770 +                refcount_create(&arc_mru->arcs_esize[arcs]);
     6771 +                refcount_create(&arc_mru_ghost->arcs_esize[arcs]);
     6772 +                refcount_create(&arc_mfu->arcs_esize[arcs]);
     6773 +                refcount_create(&arc_mfu_ghost->arcs_esize[arcs]);
     6774 +                refcount_create(&arc_l2c_only->arcs_esize[arcs]);
     6775 +        }
6017 6776  
     6777 +        arc_flush_taskq = taskq_create("arc_flush_tq",
     6778 +            max_ncpus, minclsyspri, 1, zfs_flush_ntasks, TASKQ_DYNAMIC);
     6779 +
6018 6780          refcount_create(&arc_anon->arcs_size);
6019 6781          refcount_create(&arc_mru->arcs_size);
6020 6782          refcount_create(&arc_mru_ghost->arcs_size);
6021 6783          refcount_create(&arc_mfu->arcs_size);
6022 6784          refcount_create(&arc_mfu_ghost->arcs_size);
6023 6785          refcount_create(&arc_l2c_only->arcs_size);
6024      -
6025      -        aggsum_init(&arc_meta_used, 0);
6026      -        aggsum_init(&arc_size, 0);
6027      -        aggsum_init(&astat_data_size, 0);
6028      -        aggsum_init(&astat_metadata_size, 0);
6029      -        aggsum_init(&astat_hdr_size, 0);
6030      -        aggsum_init(&astat_other_size, 0);
6031      -        aggsum_init(&astat_l2_hdr_size, 0);
6032 6786  }
6033 6787  
6034 6788  static void
6035 6789  arc_state_fini(void)
6036 6790  {
6037      -        refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
6038      -        refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
6039      -        refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
6040      -        refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
6041      -        refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
6042      -        refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
6043      -        refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
6044      -        refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
6045      -        refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
6046      -        refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
6047      -        refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
6048      -        refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
     6791 +        arc_buf_contents_t arcs;
6049 6792  
6050 6793          refcount_destroy(&arc_anon->arcs_size);
6051 6794          refcount_destroy(&arc_mru->arcs_size);
6052 6795          refcount_destroy(&arc_mru_ghost->arcs_size);
6053 6796          refcount_destroy(&arc_mfu->arcs_size);
6054 6797          refcount_destroy(&arc_mfu_ghost->arcs_size);
6055 6798          refcount_destroy(&arc_l2c_only->arcs_size);
6056 6799  
6057      -        multilist_destroy(arc_mru->arcs_list[ARC_BUFC_METADATA]);
6058      -        multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
6059      -        multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_METADATA]);
6060      -        multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
6061      -        multilist_destroy(arc_mru->arcs_list[ARC_BUFC_DATA]);
6062      -        multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
6063      -        multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_DATA]);
6064      -        multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
     6800 +        for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
     6801 +                multilist_destroy(arc_mru->arcs_list[arcs]);
     6802 +                multilist_destroy(arc_mru_ghost->arcs_list[arcs]);
     6803 +                multilist_destroy(arc_mfu->arcs_list[arcs]);
     6804 +                multilist_destroy(arc_mfu_ghost->arcs_list[arcs]);
     6805 +                multilist_destroy(arc_l2c_only->arcs_list[arcs]);
     6806 +
     6807 +                refcount_destroy(&arc_anon->arcs_esize[arcs]);
     6808 +                refcount_destroy(&arc_mru->arcs_esize[arcs]);
     6809 +                refcount_destroy(&arc_mru_ghost->arcs_esize[arcs]);
     6810 +                refcount_destroy(&arc_mfu->arcs_esize[arcs]);
     6811 +                refcount_destroy(&arc_mfu_ghost->arcs_esize[arcs]);
     6812 +                refcount_destroy(&arc_l2c_only->arcs_esize[arcs]);
     6813 +        }
6065 6814  }
6066 6815  
6067 6816  uint64_t
6068 6817  arc_max_bytes(void)
6069 6818  {
6070 6819          return (arc_c_max);
6071 6820  }
6072 6821  
6073 6822  void
6074 6823  arc_init(void)
↓ open down ↓ 39 lines elided ↑ open up ↑
6114 6863           */
6115 6864          if (zfs_arc_max > 64 << 20 && zfs_arc_max < allmem) {
6116 6865                  arc_c_max = zfs_arc_max;
6117 6866                  arc_c_min = MIN(arc_c_min, arc_c_max);
6118 6867          }
6119 6868          if (zfs_arc_min > 64 << 20 && zfs_arc_min <= arc_c_max)
6120 6869                  arc_c_min = zfs_arc_min;
6121 6870  
6122 6871          arc_c = arc_c_max;
6123 6872          arc_p = (arc_c >> 1);
     6873 +        arc_size = 0;
6124 6874  
     6875 +        /* limit ddt meta-data to 1/4 of the arc capacity */
     6876 +        arc_ddt_limit = arc_c_max / 4;
6125 6877          /* limit meta-data to 1/4 of the arc capacity */
6126 6878          arc_meta_limit = arc_c_max / 4;
6127 6879  
6128 6880  #ifdef _KERNEL
6129 6881          /*
6130 6882           * Metadata is stored in the kernel's heap.  Don't let us
6131 6883           * use more than half the heap for the ARC.
6132 6884           */
6133 6885          arc_meta_limit = MIN(arc_meta_limit,
6134 6886              vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 2);
6135 6887  #endif
6136 6888  
6137 6889          /* Allow the tunable to override if it is reasonable */
     6890 +        if (zfs_arc_ddt_limit > 0 && zfs_arc_ddt_limit <= arc_c_max)
     6891 +                arc_ddt_limit = zfs_arc_ddt_limit;
     6892 +        arc_ddt_evict_threshold =
     6893 +            zfs_arc_segregate_ddt ? &arc_ddt_limit : &arc_meta_limit;
     6894 +
     6895 +        /* Allow the tunable to override if it is reasonable */
6138 6896          if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
6139 6897                  arc_meta_limit = zfs_arc_meta_limit;
6140 6898  
6141 6899          if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
6142 6900                  arc_c_min = arc_meta_limit / 2;
6143 6901  
6144 6902          if (zfs_arc_meta_min > 0) {
6145 6903                  arc_meta_min = zfs_arc_meta_min;
6146 6904          } else {
6147 6905                  arc_meta_min = arc_c_min / 2;
↓ open down ↓ 74 lines elided ↑ open up ↑
6222 6980          /* Use B_TRUE to ensure *all* buffers are evicted */
6223 6981          arc_flush(NULL, B_TRUE);
6224 6982  
6225 6983          arc_dead = B_TRUE;
6226 6984  
6227 6985          if (arc_ksp != NULL) {
6228 6986                  kstat_delete(arc_ksp);
6229 6987                  arc_ksp = NULL;
6230 6988          }
6231 6989  
     6990 +        taskq_destroy(arc_flush_taskq);
     6991 +
6232 6992          mutex_destroy(&arc_reclaim_lock);
6233 6993          cv_destroy(&arc_reclaim_thread_cv);
6234 6994          cv_destroy(&arc_reclaim_waiters_cv);
6235 6995  
6236 6996          arc_state_fini();
6237 6997          buf_fini();
6238 6998  
6239 6999          ASSERT0(arc_loaned_bytes);
6240 7000  }
6241 7001  
↓ open down ↓ 133 lines elided ↑ open up ↑
6375 7135   * integrated, and also may become zpool properties.
6376 7136   *
6377 7137   * There are three key functions that control how the L2ARC warms up:
6378 7138   *
6379 7139   *      l2arc_write_eligible()  check if a buffer is eligible to cache
6380 7140   *      l2arc_write_size()      calculate how much to write
6381 7141   *      l2arc_write_interval()  calculate sleep delay between writes
6382 7142   *
6383 7143   * These three functions determine what to write, how much, and how quickly
6384 7144   * to send writes.
     7145 + *
     7146 + * L2ARC persistency:
     7147 + *
     7148 + * When writing buffers to L2ARC, we periodically add some metadata to
     7149 + * make sure we can pick them up after reboot, thus dramatically reducing
     7150 + * the impact that any downtime has on the performance of storage systems
     7151 + * with large caches.
     7152 + *
     7153 + * The implementation works fairly simply by integrating the following two
     7154 + * modifications:
     7155 + *
     7156 + * *) Every now and then we mix in a piece of metadata (called a log block)
     7157 + *    into the L2ARC write. This allows us to understand what's been written,
     7158 + *    so that we can rebuild the arc_buf_hdr_t structures of the main ARC
     7159 + *    buffers. The log block also includes a "2-back-reference" pointer to
     7160 + *    he second-to-previous block, forming a back-linked list of blocks on
     7161 + *    the L2ARC device.
     7162 + *
     7163 + * *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device
     7164 + *    for our header bookkeeping purposes. This contains a device header,
     7165 + *    which contains our top-level reference structures. We update it each
     7166 + *    time we write a new log block, so that we're able to locate it in the
     7167 + *    L2ARC device. If this write results in an inconsistent device header
     7168 + *    (e.g. due to power failure), we detect this by verifying the header's
     7169 + *    checksum and simply drop the entries from L2ARC.
     7170 + *
     7171 + * Implementation diagram:
     7172 + *
     7173 + * +=== L2ARC device (not to scale) ======================================+
     7174 + * |       ___two newest log block pointers__.__________                  |
     7175 + * |      /                                   \1 back   \latest           |
     7176 + * |.____/_.                                   V         V                |
     7177 + * ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---|
     7178 + * ||   hdr|      ^         /^       /^        /         /                |
     7179 + * |+------+  ...--\-------/  \-----/--\------/         /                 |
     7180 + * |                \--------------/    \--------------/                  |
     7181 + * +======================================================================+
     7182 + *
     7183 + * As can be seen on the diagram, rather than using a simple linked list,
     7184 + * we use a pair of linked lists with alternating elements. This is a
     7185 + * performance enhancement due to the fact that we only find out of the
     7186 + * address of the next log block access once the current block has been
     7187 + * completely read in. Obviously, this hurts performance, because we'd be
     7188 + * keeping the device's I/O queue at only a 1 operation deep, thus
     7189 + * incurring a large amount of I/O round-trip latency. Having two lists
     7190 + * allows us to "prefetch" two log blocks ahead of where we are currently
     7191 + * rebuilding L2ARC buffers.
     7192 + *
     7193 + * On-device data structures:
     7194 + *
     7195 + * L2ARC device header: l2arc_dev_hdr_phys_t
     7196 + * L2ARC log block:     l2arc_log_blk_phys_t
     7197 + *
     7198 + * L2ARC reconstruction:
     7199 + *
     7200 + * When writing data, we simply write in the standard rotary fashion,
     7201 + * evicting buffers as we go and simply writing new data over them (writing
     7202 + * a new log block every now and then). This obviously means that once we
     7203 + * loop around the end of the device, we will start cutting into an already
     7204 + * committed log block (and its referenced data buffers), like so:
     7205 + *
     7206 + *    current write head__       __old tail
     7207 + *                        \     /
     7208 + *                        V    V
     7209 + * <--|bufs |lb |bufs |lb |    |bufs |lb |bufs |lb |-->
     7210 + *                         ^    ^^^^^^^^^___________________________________
     7211 + *                         |                                                \
     7212 + *                   <<nextwrite>> may overwrite this blk and/or its bufs --'
     7213 + *
     7214 + * When importing the pool, we detect this situation and use it to stop
     7215 + * our scanning process (see l2arc_rebuild).
     7216 + *
     7217 + * There is one significant caveat to consider when rebuilding ARC contents
     7218 + * from an L2ARC device: what about invalidated buffers? Given the above
     7219 + * construction, we cannot update blocks which we've already written to amend
     7220 + * them to remove buffers which were invalidated. Thus, during reconstruction,
     7221 + * we might be populating the cache with buffers for data that's not on the
     7222 + * main pool anymore, or may have been overwritten!
     7223 + *
     7224 + * As it turns out, this isn't a problem. Every arc_read request includes
     7225 + * both the DVA and, crucially, the birth TXG of the BP the caller is
     7226 + * looking for. So even if the cache were populated by completely rotten
     7227 + * blocks for data that had been long deleted and/or overwritten, we'll
     7228 + * never actually return bad data from the cache, since the DVA with the
     7229 + * birth TXG uniquely identify a block in space and time - once created,
     7230 + * a block is immutable on disk. The worst thing we have done is wasted
     7231 + * some time and memory at l2arc rebuild to reconstruct outdated ARC
     7232 + * entries that will get dropped from the l2arc as it is being updated
     7233 + * with new blocks.
6385 7234   */
6386 7235  
6387 7236  static boolean_t
6388 7237  l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
6389 7238  {
6390 7239          /*
6391 7240           * A buffer is *not* eligible for the L2ARC if it:
6392 7241           * 1. belongs to a different spa.
6393 7242           * 2. is already cached on the L2ARC.
6394 7243           * 3. has an I/O in progress (it may be an incomplete read).
↓ open down ↓ 45 lines elided ↑ open up ↑
6440 7289                  interval = (hz * l2arc_feed_min_ms) / 1000;
6441 7290          else
6442 7291                  interval = hz * l2arc_feed_secs;
6443 7292  
6444 7293          now = ddi_get_lbolt();
6445 7294          next = MAX(now, MIN(now + interval, began + interval));
6446 7295  
6447 7296          return (next);
6448 7297  }
6449 7298  
     7299 +typedef enum l2ad_feed {
     7300 +        L2ARC_FEED_ALL = 1,
     7301 +        L2ARC_FEED_DDT_DEV,
     7302 +        L2ARC_FEED_NON_DDT_DEV,
     7303 +} l2ad_feed_t;
     7304 +
6450 7305  /*
6451 7306   * Cycle through L2ARC devices.  This is how L2ARC load balances.
6452 7307   * If a device is returned, this also returns holding the spa config lock.
6453 7308   */
6454 7309  static l2arc_dev_t *
6455      -l2arc_dev_get_next(void)
     7310 +l2arc_dev_get_next(l2ad_feed_t feed_type)
6456 7311  {
6457      -        l2arc_dev_t *first, *next = NULL;
     7312 +        l2arc_dev_t *start = NULL, *next = NULL;
6458 7313  
6459 7314          /*
6460 7315           * Lock out the removal of spas (spa_namespace_lock), then removal
6461 7316           * of cache devices (l2arc_dev_mtx).  Once a device has been selected,
6462 7317           * both locks will be dropped and a spa config lock held instead.
6463 7318           */
6464 7319          mutex_enter(&spa_namespace_lock);
6465 7320          mutex_enter(&l2arc_dev_mtx);
6466 7321  
6467 7322          /* if there are no vdevs, there is nothing to do */
6468 7323          if (l2arc_ndev == 0)
6469 7324                  goto out;
6470 7325  
6471      -        first = NULL;
6472      -        next = l2arc_dev_last;
6473      -        do {
6474      -                /* loop around the list looking for a non-faulted vdev */
6475      -                if (next == NULL) {
6476      -                        next = list_head(l2arc_dev_list);
6477      -                } else {
6478      -                        next = list_next(l2arc_dev_list, next);
6479      -                        if (next == NULL)
6480      -                                next = list_head(l2arc_dev_list);
6481      -                }
     7326 +        if (feed_type == L2ARC_FEED_DDT_DEV)
     7327 +                next = l2arc_ddt_dev_last;
     7328 +        else
     7329 +                next = l2arc_dev_last;
6482 7330  
6483      -                /* if we have come back to the start, bail out */
6484      -                if (first == NULL)
6485      -                        first = next;
6486      -                else if (next == first)
6487      -                        break;
     7331 +        /* figure out what the next device we look at should be */
     7332 +        if (next == NULL)
     7333 +                next = list_head(l2arc_dev_list);
     7334 +        else if (list_next(l2arc_dev_list, next) == NULL)
     7335 +                next = list_head(l2arc_dev_list);
     7336 +        else
     7337 +                next = list_next(l2arc_dev_list, next);
     7338 +        ASSERT(next);
6488 7339  
6489      -        } while (vdev_is_dead(next->l2ad_vdev));
     7340 +        /* loop through L2ARC devs looking for the one we need */
     7341 +        /* LINTED(E_CONSTANT_CONDITION) */
     7342 +        while (1) {
     7343 +                if (next == NULL) /* reached list end, start from beginning */
     7344 +                        next = list_head(l2arc_dev_list);
6490 7345  
6491      -        /* if we were unable to find any usable vdevs, return NULL */
6492      -        if (vdev_is_dead(next->l2ad_vdev))
6493      -                next = NULL;
     7346 +                if (start == NULL) { /* save starting dev */
     7347 +                        start = next;
     7348 +                } else if (start == next) { /* full loop completed - stop now */
     7349 +                        next = NULL;
     7350 +                        if (feed_type == L2ARC_FEED_DDT_DEV) {
     7351 +                                l2arc_ddt_dev_last = NULL;
     7352 +                                goto out;
     7353 +                        } else {
     7354 +                                break;
     7355 +                        }
     7356 +                }
6494 7357  
     7358 +                if (!vdev_is_dead(next->l2ad_vdev) && !next->l2ad_rebuild) {
     7359 +                        if (feed_type == L2ARC_FEED_DDT_DEV) {
     7360 +                                if (vdev_type_is_ddt(next->l2ad_vdev)) {
     7361 +                                        l2arc_ddt_dev_last = next;
     7362 +                                        goto out;
     7363 +                                }
     7364 +                        } else if (feed_type == L2ARC_FEED_NON_DDT_DEV) {
     7365 +                                if (!vdev_type_is_ddt(next->l2ad_vdev)) {
     7366 +                                        break;
     7367 +                                }
     7368 +                        } else {
     7369 +                                ASSERT(feed_type == L2ARC_FEED_ALL);
     7370 +                                break;
     7371 +                        }
     7372 +                }
     7373 +                next = list_next(l2arc_dev_list, next);
     7374 +        }
6495 7375          l2arc_dev_last = next;
6496 7376  
6497 7377  out:
6498 7378          mutex_exit(&l2arc_dev_mtx);
6499 7379  
6500 7380          /*
6501 7381           * Grab the config lock to prevent the 'next' device from being
6502 7382           * removed while we are writing to it.
6503 7383           */
6504 7384          if (next != NULL)
↓ open down ↓ 32 lines elided ↑ open up ↑
6537 7417   */
6538 7418  static void
6539 7419  l2arc_write_done(zio_t *zio)
6540 7420  {
6541 7421          l2arc_write_callback_t *cb;
6542 7422          l2arc_dev_t *dev;
6543 7423          list_t *buflist;
6544 7424          arc_buf_hdr_t *head, *hdr, *hdr_prev;
6545 7425          kmutex_t *hash_lock;
6546 7426          int64_t bytes_dropped = 0;
     7427 +        l2arc_log_blk_buf_t *lb_buf;
6547 7428  
6548 7429          cb = zio->io_private;
6549 7430          ASSERT3P(cb, !=, NULL);
6550 7431          dev = cb->l2wcb_dev;
6551 7432          ASSERT3P(dev, !=, NULL);
6552 7433          head = cb->l2wcb_head;
6553 7434          ASSERT3P(head, !=, NULL);
6554 7435          buflist = &dev->l2ad_buflist;
6555 7436          ASSERT3P(buflist, !=, NULL);
6556 7437          DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
↓ open down ↓ 76 lines elided ↑ open up ↑
6633 7514  
6634 7515                  mutex_exit(hash_lock);
6635 7516          }
6636 7517  
6637 7518          atomic_inc_64(&l2arc_writes_done);
6638 7519          list_remove(buflist, head);
6639 7520          ASSERT(!HDR_HAS_L1HDR(head));
6640 7521          kmem_cache_free(hdr_l2only_cache, head);
6641 7522          mutex_exit(&dev->l2ad_mtx);
6642 7523  
     7524 +        ASSERT(dev->l2ad_vdev != NULL);
6643 7525          vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
6644 7526  
6645 7527          l2arc_do_free_on_write();
6646 7528  
     7529 +        while ((lb_buf = list_remove_tail(&cb->l2wcb_log_blk_buflist)) != NULL)
     7530 +                kmem_free(lb_buf, sizeof (*lb_buf));
     7531 +        list_destroy(&cb->l2wcb_log_blk_buflist);
6647 7532          kmem_free(cb, sizeof (l2arc_write_callback_t));
6648 7533  }
6649 7534  
6650 7535  /*
6651 7536   * A read to a cache device completed.  Validate buffer contents before
6652 7537   * handing over to the regular ARC routines.
6653 7538   */
6654 7539  static void
6655 7540  l2arc_read_done(zio_t *zio)
6656 7541  {
↓ open down ↓ 86 lines elided ↑ open up ↑
6743 7628                              hdr, zio->io_priority, cb->l2rcb_flags,
6744 7629                              &cb->l2rcb_zb));
6745 7630                  }
6746 7631          }
6747 7632  
6748 7633          kmem_free(cb, sizeof (l2arc_read_callback_t));
6749 7634  }
6750 7635  
6751 7636  /*
6752 7637   * This is the list priority from which the L2ARC will search for pages to
6753      - * cache.  This is used within loops (0..3) to cycle through lists in the
     7638 + * cache.  This is used within loops to cycle through lists in the
6754 7639   * desired order.  This order can have a significant effect on cache
6755 7640   * performance.
6756 7641   *
6757      - * Currently the metadata lists are hit first, MFU then MRU, followed by
6758      - * the data lists.  This function returns a locked list, and also returns
6759      - * the lock pointer.
     7642 + * Currently the ddt lists are hit first (MFU then MRU),
     7643 + * followed by metadata then by the data lists.
     7644 + * This function returns a locked list, and also returns the lock pointer.
6760 7645   */
6761 7646  static multilist_sublist_t *
6762      -l2arc_sublist_lock(int list_num)
     7647 +l2arc_sublist_lock(enum l2arc_priorities prio)
6763 7648  {
6764 7649          multilist_t *ml = NULL;
6765 7650          unsigned int idx;
6766 7651  
6767      -        ASSERT(list_num >= 0 && list_num <= 3);
     7652 +        ASSERT(prio >= PRIORITY_MFU_DDT);
     7653 +        ASSERT(prio < PRIORITY_NUMTYPES);
6768 7654  
6769      -        switch (list_num) {
6770      -        case 0:
     7655 +        switch (prio) {
     7656 +        case PRIORITY_MFU_DDT:
     7657 +                ml = arc_mfu->arcs_list[ARC_BUFC_DDT];
     7658 +                break;
     7659 +        case PRIORITY_MRU_DDT:
     7660 +                ml = arc_mru->arcs_list[ARC_BUFC_DDT];
     7661 +                break;
     7662 +        case PRIORITY_MFU_META:
6771 7663                  ml = arc_mfu->arcs_list[ARC_BUFC_METADATA];
6772 7664                  break;
6773      -        case 1:
     7665 +        case PRIORITY_MRU_META:
6774 7666                  ml = arc_mru->arcs_list[ARC_BUFC_METADATA];
6775 7667                  break;
6776      -        case 2:
     7668 +        case PRIORITY_MFU_DATA:
6777 7669                  ml = arc_mfu->arcs_list[ARC_BUFC_DATA];
6778 7670                  break;
6779      -        case 3:
     7671 +        case PRIORITY_MRU_DATA:
6780 7672                  ml = arc_mru->arcs_list[ARC_BUFC_DATA];
6781 7673                  break;
6782 7674          }
6783 7675  
6784 7676          /*
6785 7677           * Return a randomly-selected sublist. This is acceptable
6786 7678           * because the caller feeds only a little bit of data for each
6787 7679           * call (8MB). Subsequent calls will result in different
6788 7680           * sublists being selected.
6789 7681           */
6790 7682          idx = multilist_get_random_index(ml);
6791 7683          return (multilist_sublist_lock(ml, idx));
6792 7684  }
6793 7685  
6794 7686  /*
     7687 + * Calculates the maximum overhead of L2ARC metadata log blocks for a given
     7688 + * L2ARC write size. l2arc_evict and l2arc_write_buffers need to include this
     7689 + * overhead in processing to make sure there is enough headroom available
     7690 + * when writing buffers.
     7691 + */
     7692 +static inline uint64_t
     7693 +l2arc_log_blk_overhead(uint64_t write_sz)
     7694 +{
     7695 +        return ((write_sz / SPA_MINBLOCKSIZE / L2ARC_LOG_BLK_ENTRIES) + 1) *
     7696 +            L2ARC_LOG_BLK_SIZE;
     7697 +}
     7698 +
     7699 +/*
6795 7700   * Evict buffers from the device write hand to the distance specified in
6796 7701   * bytes.  This distance may span populated buffers, it may span nothing.
6797 7702   * This is clearing a region on the L2ARC device ready for writing.
6798 7703   * If the 'all' boolean is set, every buffer is evicted.
6799 7704   */
6800 7705  static void
6801      -l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
     7706 +l2arc_evict_impl(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
6802 7707  {
6803 7708          list_t *buflist;
6804 7709          arc_buf_hdr_t *hdr, *hdr_prev;
6805 7710          kmutex_t *hash_lock;
6806 7711          uint64_t taddr;
6807 7712  
6808 7713          buflist = &dev->l2ad_buflist;
6809 7714  
6810 7715          if (!all && dev->l2ad_first) {
6811 7716                  /*
6812 7717                   * This is the first sweep through the device.  There is
6813 7718                   * nothing to evict.
6814 7719                   */
6815 7720                  return;
6816 7721          }
6817 7722  
     7723 +        /*
     7724 +         * We need to add in the worst case scenario of log block overhead.
     7725 +         */
     7726 +        distance += l2arc_log_blk_overhead(distance);
6818 7727          if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
6819 7728                  /*
6820 7729                   * When nearing the end of the device, evict to the end
6821 7730                   * before the device write hand jumps to the start.
6822 7731                   */
6823 7732                  taddr = dev->l2ad_end;
6824 7733          } else {
6825 7734                  taddr = dev->l2ad_hand + distance;
6826 7735          }
6827 7736          DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
↓ open down ↓ 63 lines elided ↑ open up ↑
6891 7800                                  arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);
6892 7801                          }
6893 7802  
6894 7803                          arc_hdr_l2hdr_destroy(hdr);
6895 7804                  }
6896 7805                  mutex_exit(hash_lock);
6897 7806          }
6898 7807          mutex_exit(&dev->l2ad_mtx);
6899 7808  }
6900 7809  
     7810 +static void
     7811 +l2arc_evict_task(void *arg)
     7812 +{
     7813 +        l2arc_dev_t *dev = arg;
     7814 +        ASSERT(dev);
     7815 +
     7816 +        /*
     7817 +         * Evict l2arc buffers asynchronously; we need to keep the device
     7818 +         * around until we are sure there aren't any buffers referencing it.
     7819 +         * We do not need to hold any config locks, etc. because at this point,
     7820 +         * we are the only ones who knows about this device (the in-core
     7821 +         * structure), so no new buffers can be created (e.g. if the pool is
     7822 +         * re-imported while the asynchronous eviction is in progress) that
     7823 +         * reference this same in-core structure. Also remove the vdev link
     7824 +         * since further use of it as l2arc device is prohibited.
     7825 +         */
     7826 +        dev->l2ad_vdev = NULL;
     7827 +        l2arc_evict_impl(dev, 0LL, B_TRUE);
     7828 +
     7829 +        /* Same cleanup as in the synchronous path */
     7830 +        list_destroy(&dev->l2ad_buflist);
     7831 +        mutex_destroy(&dev->l2ad_mtx);
     7832 +        refcount_destroy(&dev->l2ad_alloc);
     7833 +        kmem_free(dev->l2ad_dev_hdr, dev->l2ad_dev_hdr_asize);
     7834 +        kmem_free(dev, sizeof (l2arc_dev_t));
     7835 +}
     7836 +
     7837 +boolean_t zfs_l2arc_async_evict = B_TRUE;
     7838 +
6901 7839  /*
     7840 + * Perform l2arc eviction for buffers associated with this device
     7841 + * If evicting all buffers (done at pool export time), try to evict
     7842 + * asynchronously, and fall back to synchronous eviction in case of error
     7843 + * Tell the caller whether to cleanup the device:
     7844 + *  - B_TRUE means "asynchronous eviction, do not cleanup"
     7845 + *  - B_FALSE means "synchronous eviction, done, please cleanup"
     7846 + */
     7847 +static boolean_t
     7848 +l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
     7849 +{
     7850 +        /*
     7851 +         *  If we are evicting all the buffers for this device, which happens
     7852 +         *  at pool export time, schedule asynchronous task
     7853 +         */
     7854 +        if (all && zfs_l2arc_async_evict) {
     7855 +                if ((taskq_dispatch(arc_flush_taskq, l2arc_evict_task,
     7856 +                    dev, TQ_NOSLEEP) == NULL)) {
     7857 +                        /*
     7858 +                         * Failed to dispatch asynchronous task
     7859 +                         * cleanup, evict synchronously
     7860 +                         */
     7861 +                        l2arc_evict_impl(dev, distance, all);
     7862 +                } else {
     7863 +                        /*
     7864 +                         * Successful dispatch, vdev space updated
     7865 +                         */
     7866 +                        return (B_TRUE);
     7867 +                }
     7868 +        } else {
     7869 +                /* Evict synchronously */
     7870 +                l2arc_evict_impl(dev, distance, all);
     7871 +        }
     7872 +
     7873 +        return (B_FALSE);
     7874 +}
     7875 +
     7876 +/*
6902 7877   * Find and write ARC buffers to the L2ARC device.
6903 7878   *
6904 7879   * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid
6905 7880   * for reading until they have completed writing.
6906 7881   * The headroom_boost is an in-out parameter used to maintain headroom boost
6907 7882   * state between calls to this function.
6908 7883   *
6909 7884   * Returns the number of bytes actually written (which may be smaller than
6910 7885   * the delta by which the device hand has changed due to alignment).
6911 7886   */
6912 7887  static uint64_t
6913      -l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
     7888 +l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
     7889 +    l2ad_feed_t feed_type)
6914 7890  {
6915 7891          arc_buf_hdr_t *hdr, *hdr_prev, *head;
     7892 +        /*
     7893 +         * We must carefully track the space we deal with here:
     7894 +         * - write_size: sum of the size of all buffers to be written
     7895 +         *      without compression or inter-buffer alignment applied.
     7896 +         *      This size is added to arcstat_l2_size, because subsequent
     7897 +         *      eviction of buffers decrements this kstat by only the
     7898 +         *      buffer's b_lsize (which doesn't take alignment into account).
     7899 +         * - write_asize: sum of the size of all buffers to be written
     7900 +         *      with inter-buffer alignment applied.
     7901 +         *      This size is used to estimate the maximum number of bytes
     7902 +         *      we could take up on the device and is thus used to gauge how
     7903 +         *      close we are to hitting target_sz.
     7904 +         */
6916 7905          uint64_t write_asize, write_psize, write_lsize, headroom;
6917 7906          boolean_t full;
6918 7907          l2arc_write_callback_t *cb;
6919 7908          zio_t *pio, *wzio;
     7909 +        enum l2arc_priorities try;
6920 7910          uint64_t guid = spa_load_guid(spa);
     7911 +        boolean_t dev_hdr_update = B_FALSE;
6921 7912  
6922 7913          ASSERT3P(dev->l2ad_vdev, !=, NULL);
6923 7914  
6924 7915          pio = NULL;
     7916 +        cb = NULL;
6925 7917          write_lsize = write_asize = write_psize = 0;
6926 7918          full = B_FALSE;
6927 7919          head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
6928 7920          arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);
6929 7921  
6930 7922          /*
6931 7923           * Copy buffers for L2ARC writing.
6932 7924           */
6933      -        for (int try = 0; try <= 3; try++) {
     7925 +        for (try = PRIORITY_MFU_DDT; try < PRIORITY_NUMTYPES; try++) {
6934 7926                  multilist_sublist_t *mls = l2arc_sublist_lock(try);
6935 7927                  uint64_t passed_sz = 0;
6936 7928  
6937 7929                  /*
6938 7930                   * L2ARC fast warmup.
6939 7931                   *
6940 7932                   * Until the ARC is warm and starts to evict, read from the
6941 7933                   * head of the ARC lists rather than the tail.
6942 7934                   */
6943 7935                  if (arc_warm == B_FALSE)
↓ open down ↓ 49 lines elided ↑ open up ↑
6993 7985                          uint64_t psize = arc_hdr_size(hdr);
6994 7986                          uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,
6995 7987                              psize);
6996 7988  
6997 7989                          if ((write_asize + asize) > target_sz) {
6998 7990                                  full = B_TRUE;
6999 7991                                  mutex_exit(hash_lock);
7000 7992                                  break;
7001 7993                          }
7002 7994  
     7995 +                        /* make sure buf we select corresponds to feed_type */
     7996 +                        if ((feed_type == L2ARC_FEED_DDT_DEV &&
     7997 +                            arc_buf_type(hdr) != ARC_BUFC_DDT) ||
     7998 +                            (feed_type == L2ARC_FEED_NON_DDT_DEV &&
     7999 +                            arc_buf_type(hdr) == ARC_BUFC_DDT)) {
     8000 +                                        mutex_exit(hash_lock);
     8001 +                                        continue;
     8002 +                        }
     8003 +
7003 8004                          if (pio == NULL) {
7004 8005                                  /*
7005 8006                                   * Insert a dummy header on the buflist so
7006 8007                                   * l2arc_write_done() can find where the
7007 8008                                   * write buffers begin without searching.
7008 8009                                   */
7009 8010                                  mutex_enter(&dev->l2ad_mtx);
7010 8011                                  list_insert_head(&dev->l2ad_buflist, head);
7011 8012                                  mutex_exit(&dev->l2ad_mtx);
7012 8013  
7013      -                                cb = kmem_alloc(
     8014 +                                cb = kmem_zalloc(
7014 8015                                      sizeof (l2arc_write_callback_t), KM_SLEEP);
7015 8016                                  cb->l2wcb_dev = dev;
7016 8017                                  cb->l2wcb_head = head;
     8018 +                                list_create(&cb->l2wcb_log_blk_buflist,
     8019 +                                    sizeof (l2arc_log_blk_buf_t),
     8020 +                                    offsetof(l2arc_log_blk_buf_t, lbb_node));
7017 8021                                  pio = zio_root(spa, l2arc_write_done, cb,
7018 8022                                      ZIO_FLAG_CANFAIL);
7019 8023                          }
7020 8024  
7021 8025                          hdr->b_l2hdr.b_dev = dev;
7022 8026                          hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
7023 8027                          arc_hdr_set_flags(hdr,
7024 8028                              ARC_FLAG_L2_WRITING | ARC_FLAG_HAS_L2HDR);
7025 8029  
7026 8030                          mutex_enter(&dev->l2ad_mtx);
↓ open down ↓ 14 lines elided ↑ open up ↑
7041 8045                           *
7042 8046                           * To ensure that the copy will be available for the
7043 8047                           * lifetime of the ZIO and be cleaned up afterwards, we
7044 8048                           * add it to the l2arc_free_on_write queue.
7045 8049                           */
7046 8050                          abd_t *to_write;
7047 8051                          if (!HDR_SHARED_DATA(hdr) && psize == asize) {
7048 8052                                  to_write = hdr->b_l1hdr.b_pabd;
7049 8053                          } else {
7050 8054                                  to_write = abd_alloc_for_io(asize,
7051      -                                    HDR_ISTYPE_METADATA(hdr));
     8055 +                                    !HDR_ISTYPE_DATA(hdr));
7052 8056                                  abd_copy(to_write, hdr->b_l1hdr.b_pabd, psize);
7053 8057                                  if (asize != psize) {
7054 8058                                          abd_zero_off(to_write, psize,
7055 8059                                              asize - psize);
7056 8060                                  }
7057 8061                                  l2arc_free_abd_on_write(to_write, asize,
7058 8062                                      arc_buf_type(hdr));
7059 8063                          }
7060 8064                          wzio = zio_write_phys(pio, dev->l2ad_vdev,
7061 8065                              hdr->b_l2hdr.b_daddr, asize, to_write,
↓ open down ↓ 5 lines elided ↑ open up ↑
7067 8071                          DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
7068 8072                              zio_t *, wzio);
7069 8073  
7070 8074                          write_psize += psize;
7071 8075                          write_asize += asize;
7072 8076                          dev->l2ad_hand += asize;
7073 8077  
7074 8078                          mutex_exit(hash_lock);
7075 8079  
7076 8080                          (void) zio_nowait(wzio);
     8081 +
     8082 +                        /*
     8083 +                         * Append buf info to current log and commit if full.
     8084 +                         * arcstat_l2_{size,asize} kstats are updated internally.
     8085 +                         */
     8086 +                        if (l2arc_log_blk_insert(dev, hdr)) {
     8087 +                                l2arc_log_blk_commit(dev, pio, cb);
     8088 +                                dev_hdr_update = B_TRUE;
     8089 +                        }
7077 8090                  }
7078 8091  
7079 8092                  multilist_sublist_unlock(mls);
7080 8093  
7081 8094                  if (full == B_TRUE)
7082 8095                          break;
7083 8096          }
7084 8097  
7085 8098          /* No buffers selected for writing? */
7086 8099          if (pio == NULL) {
7087 8100                  ASSERT0(write_lsize);
7088 8101                  ASSERT(!HDR_HAS_L1HDR(head));
7089 8102                  kmem_cache_free(hdr_l2only_cache, head);
7090 8103                  return (0);
7091 8104          }
7092 8105  
     8106 +        /*
     8107 +         * If we wrote any logs as part of this write, update dev hdr
     8108 +         * to point to it.
     8109 +         */
     8110 +        if (dev_hdr_update)
     8111 +                l2arc_dev_hdr_update(dev, pio);
     8112 +
7093 8113          ASSERT3U(write_asize, <=, target_sz);
7094 8114          ARCSTAT_BUMP(arcstat_l2_writes_sent);
7095 8115          ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);
     8116 +        if (feed_type == L2ARC_FEED_DDT_DEV)
     8117 +                ARCSTAT_INCR(arcstat_l2_ddt_write_bytes, write_psize);
7096 8118          ARCSTAT_INCR(arcstat_l2_lsize, write_lsize);
7097 8119          ARCSTAT_INCR(arcstat_l2_psize, write_psize);
7098 8120          vdev_space_update(dev->l2ad_vdev, write_psize, 0, 0);
7099 8121  
7100 8122          /*
7101 8123           * Bump device hand to the device start if it is approaching the end.
7102 8124           * l2arc_evict() will already have evicted ahead for this case.
7103 8125           */
7104      -        if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
     8126 +        if (dev->l2ad_hand + target_sz + l2arc_log_blk_overhead(target_sz) >=
     8127 +            dev->l2ad_end) {
7105 8128                  dev->l2ad_hand = dev->l2ad_start;
7106 8129                  dev->l2ad_first = B_FALSE;
7107 8130          }
7108 8131  
7109 8132          dev->l2ad_writing = B_TRUE;
7110 8133          (void) zio_wait(pio);
7111 8134          dev->l2ad_writing = B_FALSE;
7112 8135  
7113 8136          return (write_asize);
7114 8137  }
7115 8138  
     8139 +static boolean_t
     8140 +l2arc_feed_dev(l2ad_feed_t feed_type, uint64_t *wrote)
     8141 +{
     8142 +        spa_t *spa;
     8143 +        l2arc_dev_t *dev;
     8144 +        uint64_t size;
     8145 +
     8146 +        /*
     8147 +         * This selects the next l2arc device to write to, and in
     8148 +         * doing so the next spa to feed from: dev->l2ad_spa.   This
     8149 +         * will return NULL if there are now no l2arc devices or if
     8150 +         * they are all faulted.
     8151 +         *
     8152 +         * If a device is returned, its spa's config lock is also
     8153 +         * held to prevent device removal.  l2arc_dev_get_next()
     8154 +         * will grab and release l2arc_dev_mtx.
     8155 +         */
     8156 +        if ((dev = l2arc_dev_get_next(feed_type)) == NULL)
     8157 +                return (B_FALSE);
     8158 +
     8159 +        spa = dev->l2ad_spa;
     8160 +        ASSERT(spa != NULL);
     8161 +
     8162 +        /*
     8163 +         * If the pool is read-only - skip it
     8164 +         */
     8165 +        if (!spa_writeable(spa)) {
     8166 +                spa_config_exit(spa, SCL_L2ARC, dev);
     8167 +                return (B_FALSE);
     8168 +        }
     8169 +
     8170 +        ARCSTAT_BUMP(arcstat_l2_feeds);
     8171 +        size = l2arc_write_size();
     8172 +
     8173 +        /*
     8174 +         * Evict L2ARC buffers that will be overwritten.
     8175 +         * B_FALSE guarantees synchronous eviction.
     8176 +         */
     8177 +        (void) l2arc_evict(dev, size, B_FALSE);
     8178 +
     8179 +        /*
     8180 +         * Write ARC buffers.
     8181 +         */
     8182 +        *wrote = l2arc_write_buffers(spa, dev, size, feed_type);
     8183 +
     8184 +        spa_config_exit(spa, SCL_L2ARC, dev);
     8185 +
     8186 +        return (B_TRUE);
     8187 +}
     8188 +
7116 8189  /*
7117 8190   * This thread feeds the L2ARC at regular intervals.  This is the beating
7118 8191   * heart of the L2ARC.
7119 8192   */
7120 8193  /* ARGSUSED */
7121 8194  static void
7122 8195  l2arc_feed_thread(void *unused)
7123 8196  {
7124 8197          callb_cpr_t cpr;
7125      -        l2arc_dev_t *dev;
7126      -        spa_t *spa;
7127      -        uint64_t size, wrote;
     8198 +        uint64_t size, total_written = 0;
7128 8199          clock_t begin, next = ddi_get_lbolt();
     8200 +        l2ad_feed_t feed_type = L2ARC_FEED_ALL;
7129 8201  
7130 8202          CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
7131 8203  
7132 8204          mutex_enter(&l2arc_feed_thr_lock);
7133 8205  
7134 8206          while (l2arc_thread_exit == 0) {
7135 8207                  CALLB_CPR_SAFE_BEGIN(&cpr);
7136 8208                  (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
7137 8209                      next);
7138 8210                  CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
↓ open down ↓ 4 lines elided ↑ open up ↑
7143 8215                   */
7144 8216                  mutex_enter(&l2arc_dev_mtx);
7145 8217                  if (l2arc_ndev == 0) {
7146 8218                          mutex_exit(&l2arc_dev_mtx);
7147 8219                          continue;
7148 8220                  }
7149 8221                  mutex_exit(&l2arc_dev_mtx);
7150 8222                  begin = ddi_get_lbolt();
7151 8223  
7152 8224                  /*
7153      -                 * This selects the next l2arc device to write to, and in
7154      -                 * doing so the next spa to feed from: dev->l2ad_spa.   This
7155      -                 * will return NULL if there are now no l2arc devices or if
7156      -                 * they are all faulted.
7157      -                 *
7158      -                 * If a device is returned, its spa's config lock is also
7159      -                 * held to prevent device removal.  l2arc_dev_get_next()
7160      -                 * will grab and release l2arc_dev_mtx.
7161      -                 */
7162      -                if ((dev = l2arc_dev_get_next()) == NULL)
7163      -                        continue;
7164      -
7165      -                spa = dev->l2ad_spa;
7166      -                ASSERT3P(spa, !=, NULL);
7167      -
7168      -                /*
7169      -                 * If the pool is read-only then force the feed thread to
7170      -                 * sleep a little longer.
7171      -                 */
7172      -                if (!spa_writeable(spa)) {
7173      -                        next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
7174      -                        spa_config_exit(spa, SCL_L2ARC, dev);
7175      -                        continue;
7176      -                }
7177      -
7178      -                /*
7179 8225                   * Avoid contributing to memory pressure.
7180 8226                   */
7181 8227                  if (arc_reclaim_needed()) {
7182 8228                          ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
7183      -                        spa_config_exit(spa, SCL_L2ARC, dev);
7184 8229                          continue;
7185 8230                  }
7186 8231  
7187      -                ARCSTAT_BUMP(arcstat_l2_feeds);
     8232 +                /* try to write to DDT L2ARC device if any */
     8233 +                if (l2arc_feed_dev(L2ARC_FEED_DDT_DEV, &size)) {
     8234 +                        total_written += size;
     8235 +                        feed_type = L2ARC_FEED_NON_DDT_DEV;
     8236 +                }
7188 8237  
7189      -                size = l2arc_write_size();
     8238 +                /* try to write to the regular L2ARC device if any */
     8239 +                if (l2arc_feed_dev(feed_type, &size)) {
     8240 +                        total_written += size;
     8241 +                        if (feed_type == L2ARC_FEED_NON_DDT_DEV)
     8242 +                                total_written /= 2; /* avg written per device */
     8243 +                }
7190 8244  
7191 8245                  /*
7192      -                 * Evict L2ARC buffers that will be overwritten.
7193      -                 */
7194      -                l2arc_evict(dev, size, B_FALSE);
7195      -
7196      -                /*
7197      -                 * Write ARC buffers.
7198      -                 */
7199      -                wrote = l2arc_write_buffers(spa, dev, size);
7200      -
7201      -                /*
7202 8246                   * Calculate interval between writes.
7203 8247                   */
7204      -                next = l2arc_write_interval(begin, size, wrote);
7205      -                spa_config_exit(spa, SCL_L2ARC, dev);
     8248 +                next = l2arc_write_interval(begin, l2arc_write_size(),
     8249 +                    total_written);
     8250 +
     8251 +                total_written = 0;
7206 8252          }
7207 8253  
7208 8254          l2arc_thread_exit = 0;
7209 8255          cv_broadcast(&l2arc_feed_thr_cv);
7210 8256          CALLB_CPR_EXIT(&cpr);           /* drops l2arc_feed_thr_lock */
7211 8257          thread_exit();
7212 8258  }
7213 8259  
7214 8260  boolean_t
7215 8261  l2arc_vdev_present(vdev_t *vd)
7216 8262  {
7217      -        l2arc_dev_t *dev;
     8263 +        return (l2arc_vdev_get(vd) != NULL);
     8264 +}
7218 8265  
7219      -        mutex_enter(&l2arc_dev_mtx);
     8266 +/*
     8267 + * Returns the l2arc_dev_t associated with a particular vdev_t or NULL if
     8268 + * the vdev_t isn't an L2ARC device.
     8269 + */
     8270 +static l2arc_dev_t *
     8271 +l2arc_vdev_get(vdev_t *vd)
     8272 +{
     8273 +        l2arc_dev_t     *dev;
     8274 +        boolean_t       held = MUTEX_HELD(&l2arc_dev_mtx);
     8275 +
     8276 +        if (!held)
     8277 +                mutex_enter(&l2arc_dev_mtx);
7220 8278          for (dev = list_head(l2arc_dev_list); dev != NULL;
7221 8279              dev = list_next(l2arc_dev_list, dev)) {
7222 8280                  if (dev->l2ad_vdev == vd)
7223 8281                          break;
7224 8282          }
7225      -        mutex_exit(&l2arc_dev_mtx);
     8283 +        if (!held)
     8284 +                mutex_exit(&l2arc_dev_mtx);
7226 8285  
7227      -        return (dev != NULL);
     8286 +        return (dev);
7228 8287  }
7229 8288  
7230 8289  /*
7231 8290   * Add a vdev for use by the L2ARC.  By this point the spa has already
7232      - * validated the vdev and opened it.
     8291 + * validated the vdev and opened it. The `rebuild' flag indicates whether
     8292 + * we should attempt an L2ARC persistency rebuild.
7233 8293   */
7234 8294  void
7235      -l2arc_add_vdev(spa_t *spa, vdev_t *vd)
     8295 +l2arc_add_vdev(spa_t *spa, vdev_t *vd, boolean_t rebuild)
7236 8296  {
7237 8297          l2arc_dev_t *adddev;
7238 8298  
7239 8299          ASSERT(!l2arc_vdev_present(vd));
7240 8300  
7241 8301          /*
7242 8302           * Create a new l2arc device entry.
7243 8303           */
7244 8304          adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
7245 8305          adddev->l2ad_spa = spa;
7246 8306          adddev->l2ad_vdev = vd;
7247      -        adddev->l2ad_start = VDEV_LABEL_START_SIZE;
     8307 +        /* leave extra size for an l2arc device header */
     8308 +        adddev->l2ad_dev_hdr_asize = MAX(sizeof (*adddev->l2ad_dev_hdr),
     8309 +            1 << vd->vdev_ashift);
     8310 +        adddev->l2ad_start = VDEV_LABEL_START_SIZE + adddev->l2ad_dev_hdr_asize;
7248 8311          adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
     8312 +        ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end);
7249 8313          adddev->l2ad_hand = adddev->l2ad_start;
7250 8314          adddev->l2ad_first = B_TRUE;
7251 8315          adddev->l2ad_writing = B_FALSE;
     8316 +        adddev->l2ad_dev_hdr = kmem_zalloc(adddev->l2ad_dev_hdr_asize,
     8317 +            KM_SLEEP);
7252 8318  
7253 8319          mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);
7254 8320          /*
7255 8321           * This is a list of all ARC buffers that are still valid on the
7256 8322           * device.
7257 8323           */
7258 8324          list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
7259 8325              offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));
7260 8326  
7261 8327          vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
7262 8328          refcount_create(&adddev->l2ad_alloc);
7263 8329  
7264 8330          /*
7265 8331           * Add device to global list
7266 8332           */
7267 8333          mutex_enter(&l2arc_dev_mtx);
7268 8334          list_insert_head(l2arc_dev_list, adddev);
7269 8335          atomic_inc_64(&l2arc_ndev);
     8336 +        if (rebuild && l2arc_rebuild_enabled &&
     8337 +            adddev->l2ad_end - adddev->l2ad_start > L2ARC_PERSIST_MIN_SIZE) {
     8338 +                /*
     8339 +                 * Just mark the device as pending for a rebuild. We won't
     8340 +                 * be starting a rebuild in line here as it would block pool
     8341 +                 * import. Instead spa_load_impl will hand that off to an
     8342 +                 * async task which will call l2arc_spa_rebuild_start.
     8343 +                 */
     8344 +                adddev->l2ad_rebuild = B_TRUE;
     8345 +        }
7270 8346          mutex_exit(&l2arc_dev_mtx);
7271 8347  }
7272 8348  
7273 8349  /*
7274 8350   * Remove a vdev from the L2ARC.
7275 8351   */
7276 8352  void
7277 8353  l2arc_remove_vdev(vdev_t *vd)
7278 8354  {
7279 8355          l2arc_dev_t *dev, *nextdev, *remdev = NULL;
↓ open down ↓ 5 lines elided ↑ open up ↑
7285 8361          for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
7286 8362                  nextdev = list_next(l2arc_dev_list, dev);
7287 8363                  if (vd == dev->l2ad_vdev) {
7288 8364                          remdev = dev;
7289 8365                          break;
7290 8366                  }
7291 8367          }
7292 8368          ASSERT3P(remdev, !=, NULL);
7293 8369  
7294 8370          /*
     8371 +         * Cancel any ongoing or scheduled rebuild (race protection with
     8372 +         * l2arc_spa_rebuild_start provided via l2arc_dev_mtx).
     8373 +         */
     8374 +        remdev->l2ad_rebuild_cancel = B_TRUE;
     8375 +        if (remdev->l2ad_rebuild_did != 0) {
     8376 +                /*
     8377 +                 * N.B. it should be safe to thread_join with the rebuild
     8378 +                 * thread while holding l2arc_dev_mtx because it is not
     8379 +                 * accessed from anywhere in the l2arc rebuild code below
     8380 +                 * (except for l2arc_spa_rebuild_start, which is ok).
     8381 +                 */
     8382 +                thread_join(remdev->l2ad_rebuild_did);
     8383 +        }
     8384 +
     8385 +        /*
7295 8386           * Remove device from global list
7296 8387           */
7297 8388          list_remove(l2arc_dev_list, remdev);
7298 8389          l2arc_dev_last = NULL;          /* may have been invalidated */
     8390 +        l2arc_ddt_dev_last = NULL;      /* may have been invalidated */
7299 8391          atomic_dec_64(&l2arc_ndev);
7300 8392          mutex_exit(&l2arc_dev_mtx);
7301 8393  
     8394 +        if (vdev_type_is_ddt(remdev->l2ad_vdev))
     8395 +                atomic_add_64(&remdev->l2ad_spa->spa_l2arc_ddt_devs_size,
     8396 +                    -(vdev_get_min_asize(remdev->l2ad_vdev)));
     8397 +
7302 8398          /*
7303 8399           * Clear all buflists and ARC references.  L2ARC device flush.
7304 8400           */
7305      -        l2arc_evict(remdev, 0, B_TRUE);
7306      -        list_destroy(&remdev->l2ad_buflist);
7307      -        mutex_destroy(&remdev->l2ad_mtx);
7308      -        refcount_destroy(&remdev->l2ad_alloc);
7309      -        kmem_free(remdev, sizeof (l2arc_dev_t));
     8401 +        if (l2arc_evict(remdev, 0, B_TRUE) == B_FALSE) {
     8402 +                /*
     8403 +                 * The eviction was done synchronously, cleanup here
     8404 +                 * Otherwise, the asynchronous task will cleanup
     8405 +                 */
     8406 +                list_destroy(&remdev->l2ad_buflist);
     8407 +                mutex_destroy(&remdev->l2ad_mtx);
     8408 +                kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize);
     8409 +                kmem_free(remdev, sizeof (l2arc_dev_t));
     8410 +        }
7310 8411  }
7311 8412  
7312 8413  void
7313 8414  l2arc_init(void)
7314 8415  {
7315 8416          l2arc_thread_exit = 0;
7316 8417          l2arc_ndev = 0;
7317 8418          l2arc_writes_sent = 0;
7318 8419          l2arc_writes_done = 0;
7319 8420  
↓ open down ↓ 45 lines elided ↑ open up ↑
7365 8466  {
7366 8467          if (!(spa_mode_global & FWRITE))
7367 8468                  return;
7368 8469  
7369 8470          mutex_enter(&l2arc_feed_thr_lock);
7370 8471          cv_signal(&l2arc_feed_thr_cv);  /* kick thread out of startup */
7371 8472          l2arc_thread_exit = 1;
7372 8473          while (l2arc_thread_exit != 0)
7373 8474                  cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
7374 8475          mutex_exit(&l2arc_feed_thr_lock);
     8476 +}
     8477 +
     8478 +/*
     8479 + * Punches out rebuild threads for the L2ARC devices in a spa. This should
     8480 + * be called after pool import from the spa async thread, since starting
     8481 + * these threads directly from spa_import() will make them part of the
     8482 + * "zpool import" context and delay process exit (and thus pool import).
     8483 + */
     8484 +void
     8485 +l2arc_spa_rebuild_start(spa_t *spa)
     8486 +{
     8487 +        /*
     8488 +         * Locate the spa's l2arc devices and kick off rebuild threads.
     8489 +         */
     8490 +        mutex_enter(&l2arc_dev_mtx);
     8491 +        for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {
     8492 +                l2arc_dev_t *dev =
     8493 +                    l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);
     8494 +                if (dev == NULL) {
     8495 +                        /* Don't attempt a rebuild if the vdev is UNAVAIL */
     8496 +                        continue;
     8497 +                }
     8498 +                if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
     8499 +                        VERIFY3U(dev->l2ad_rebuild_did, ==, 0);
     8500 +#ifdef  _KERNEL
     8501 +                        dev->l2ad_rebuild_did = thread_create(NULL, 0,
     8502 +                            l2arc_dev_rebuild_start, dev, 0, &p0, TS_RUN,
     8503 +                            minclsyspri)->t_did;
     8504 +#endif
     8505 +                }
     8506 +        }
     8507 +        mutex_exit(&l2arc_dev_mtx);
     8508 +}
     8509 +
     8510 +/*
     8511 + * Main entry point for L2ARC rebuilding.
     8512 + */
     8513 +static void
     8514 +l2arc_dev_rebuild_start(l2arc_dev_t *dev)
     8515 +{
     8516 +        if (!dev->l2ad_rebuild_cancel) {
     8517 +                VERIFY(dev->l2ad_rebuild);
     8518 +                (void) l2arc_rebuild(dev);
     8519 +                dev->l2ad_rebuild = B_FALSE;
     8520 +        }
     8521 +}
     8522 +
     8523 +/*
     8524 + * This function implements the actual L2ARC metadata rebuild. It:
     8525 + *
     8526 + * 1) reads the device's header
     8527 + * 2) if a good device header is found, starts reading the log block chain
     8528 + * 3) restores each block's contents to memory (reconstructing arc_buf_hdr_t's)
     8529 + *
     8530 + * Operation stops under any of the following conditions:
     8531 + *
     8532 + * 1) We reach the end of the log blk chain (the back-reference in the blk is
     8533 + *    invalid or loops over our starting point).
     8534 + * 2) We encounter *any* error condition (cksum errors, io errors, looped
     8535 + *    blocks, etc.).
     8536 + */
     8537 +static int
     8538 +l2arc_rebuild(l2arc_dev_t *dev)
     8539 +{
     8540 +        vdev_t                  *vd = dev->l2ad_vdev;
     8541 +        spa_t                   *spa = vd->vdev_spa;
     8542 +        int                     err;
     8543 +        l2arc_log_blk_phys_t    *this_lb, *next_lb;
     8544 +        uint8_t                 *this_lb_buf, *next_lb_buf;
     8545 +        zio_t                   *this_io = NULL, *next_io = NULL;
     8546 +        l2arc_log_blkptr_t      lb_ptrs[2];
     8547 +        boolean_t               first_pass, lock_held;
     8548 +        uint64_t                load_guid;
     8549 +
     8550 +        this_lb = kmem_zalloc(sizeof (*this_lb), KM_SLEEP);
     8551 +        next_lb = kmem_zalloc(sizeof (*next_lb), KM_SLEEP);
     8552 +        this_lb_buf = kmem_zalloc(sizeof (l2arc_log_blk_phys_t), KM_SLEEP);
     8553 +        next_lb_buf = kmem_zalloc(sizeof (l2arc_log_blk_phys_t), KM_SLEEP);
     8554 +
     8555 +        /*
     8556 +         * We prevent device removal while issuing reads to the device,
     8557 +         * then during the rebuilding phases we drop this lock again so
     8558 +         * that a spa_unload or device remove can be initiated - this is
     8559 +         * safe, because the spa will signal us to stop before removing
     8560 +         * our device and wait for us to stop.
     8561 +         */
     8562 +        spa_config_enter(spa, SCL_L2ARC, vd, RW_READER);
     8563 +        lock_held = B_TRUE;
     8564 +
     8565 +        load_guid = spa_load_guid(dev->l2ad_vdev->vdev_spa);
     8566 +        /*
     8567 +         * Device header processing phase.
     8568 +         */
     8569 +        if ((err = l2arc_dev_hdr_read(dev)) != 0) {
     8570 +                /* device header corrupted, start a new one */
     8571 +                bzero(dev->l2ad_dev_hdr, dev->l2ad_dev_hdr_asize);
     8572 +                goto out;
     8573 +        }
     8574 +
     8575 +        /* Retrieve the persistent L2ARC device state */
     8576 +        dev->l2ad_hand = vdev_psize_to_asize(dev->l2ad_vdev,
     8577 +            dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr +
     8578 +            LBP_GET_PSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0]));
     8579 +        dev->l2ad_first = !!(dev->l2ad_dev_hdr->dh_flags &
     8580 +            L2ARC_DEV_HDR_EVICT_FIRST);
     8581 +
     8582 +        /* Prepare the rebuild processing state */
     8583 +        bcopy(dev->l2ad_dev_hdr->dh_start_lbps, lb_ptrs, sizeof (lb_ptrs));
     8584 +        first_pass = B_TRUE;
     8585 +
     8586 +        /* Start the rebuild process */
     8587 +        for (;;) {
     8588 +                if (!l2arc_log_blkptr_valid(dev, &lb_ptrs[0]))
     8589 +                        /* We hit an invalid block address, end the rebuild. */
     8590 +                        break;
     8591 +
     8592 +                if ((err = l2arc_log_blk_read(dev, &lb_ptrs[0], &lb_ptrs[1],
     8593 +                    this_lb, next_lb, this_lb_buf, next_lb_buf,
     8594 +                    this_io, &next_io)) != 0)
     8595 +                        break;
     8596 +
     8597 +                spa_config_exit(spa, SCL_L2ARC, vd);
     8598 +                lock_held = B_FALSE;
     8599 +
     8600 +                /* Protection against infinite loops of log blocks. */
     8601 +                if (l2arc_range_check_overlap(lb_ptrs[1].lbp_daddr,
     8602 +                    lb_ptrs[0].lbp_daddr,
     8603 +                    dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr) &&
     8604 +                    !first_pass) {
     8605 +                        ARCSTAT_BUMP(arcstat_l2_rebuild_abort_loop_errors);
     8606 +                        err = SET_ERROR(ELOOP);
     8607 +                        break;
     8608 +                }
     8609 +
     8610 +                /*
     8611 +                 * Our memory pressure valve. If the system is running low
     8612 +                 * on memory, rather than swamping memory with new ARC buf
     8613 +                 * hdrs, we opt not to rebuild the L2ARC. At this point,
     8614 +                 * however, we have already set up our L2ARC dev to chain in
     8615 +                 * new metadata log blk, so the user may choose to re-add the
     8616 +                 * L2ARC dev at a later time to reconstruct it (when there's
     8617 +                 * less memory pressure).
     8618 +                 */
     8619 +                if (arc_reclaim_needed()) {
     8620 +                        ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem);
     8621 +                        cmn_err(CE_NOTE, "System running low on memory, "
     8622 +                            "aborting L2ARC rebuild.");
     8623 +                        err = SET_ERROR(ENOMEM);
     8624 +                        break;
     8625 +                }
     8626 +
     8627 +                /*
     8628 +                 * Now that we know that the next_lb checks out alright, we
     8629 +                 * can start reconstruction from this lb - we can be sure
     8630 +                 * that the L2ARC write hand has not yet reached any of our
     8631 +                 * buffers.
     8632 +                 */
     8633 +                l2arc_log_blk_restore(dev, load_guid, this_lb,
     8634 +                    LBP_GET_PSIZE(&lb_ptrs[0]));
     8635 +
     8636 +                /*
     8637 +                 * End of list detection. We can look ahead two steps in the
     8638 +                 * blk chain and if the 2nd blk from this_lb dips below the
     8639 +                 * initial chain starting point, then we know two things:
     8640 +                 *      1) it can't be valid, and
     8641 +                 *      2) the next_lb's ARC entries might have already been
     8642 +                 *      partially overwritten and so we should stop before
     8643 +                 *      we restore it
     8644 +                 */
     8645 +                if (l2arc_range_check_overlap(
     8646 +                    this_lb->lb_back2_lbp.lbp_daddr, lb_ptrs[0].lbp_daddr,
     8647 +                    dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr) &&
     8648 +                    !first_pass)
     8649 +                        break;
     8650 +
     8651 +                /* log blk restored, continue with next one in the list */
     8652 +                lb_ptrs[0] = lb_ptrs[1];
     8653 +                lb_ptrs[1] = this_lb->lb_back2_lbp;
     8654 +                PTR_SWAP(this_lb, next_lb);
     8655 +                PTR_SWAP(this_lb_buf, next_lb_buf);
     8656 +                this_io = next_io;
     8657 +                next_io = NULL;
     8658 +                first_pass = B_FALSE;
     8659 +
     8660 +                for (;;) {
     8661 +                        if (dev->l2ad_rebuild_cancel) {
     8662 +                                err = SET_ERROR(ECANCELED);
     8663 +                                goto out;
     8664 +                        }
     8665 +                        if (spa_config_tryenter(spa, SCL_L2ARC, vd,
     8666 +                            RW_READER)) {
     8667 +                                lock_held = B_TRUE;
     8668 +                                break;
     8669 +                        }
     8670 +                        /*
     8671 +                         * L2ARC config lock held by somebody in writer,
     8672 +                         * possibly due to them trying to remove us. They'll
     8673 +                         * likely to want us to shut down, so after a little
     8674 +                         * delay, we check l2ad_rebuild_cancel and retry
     8675 +                         * the lock again.
     8676 +                         */
     8677 +                        delay(1);
     8678 +                }
     8679 +        }
     8680 +out:
     8681 +        if (next_io != NULL)
     8682 +                l2arc_log_blk_prefetch_abort(next_io);
     8683 +        kmem_free(this_lb, sizeof (*this_lb));
     8684 +        kmem_free(next_lb, sizeof (*next_lb));
     8685 +        kmem_free(this_lb_buf, sizeof (l2arc_log_blk_phys_t));
     8686 +        kmem_free(next_lb_buf, sizeof (l2arc_log_blk_phys_t));
     8687 +        if (err == 0)
     8688 +                ARCSTAT_BUMP(arcstat_l2_rebuild_successes);
     8689 +
     8690 +        if (lock_held)
     8691 +                spa_config_exit(spa, SCL_L2ARC, vd);
     8692 +
     8693 +        return (err);
     8694 +}
     8695 +
     8696 +/*
     8697 + * Attempts to read the device header on the provided L2ARC device and writes
     8698 + * it to `hdr'. On success, this function returns 0, otherwise the appropriate
     8699 + * error code is returned.
     8700 + */
     8701 +static int
     8702 +l2arc_dev_hdr_read(l2arc_dev_t *dev)
     8703 +{
     8704 +        int                     err;
     8705 +        uint64_t                guid;
     8706 +        zio_cksum_t             cksum;
     8707 +        l2arc_dev_hdr_phys_t    *hdr = dev->l2ad_dev_hdr;
     8708 +        const uint64_t          hdr_asize = dev->l2ad_dev_hdr_asize;
     8709 +        abd_t *abd;
     8710 +
     8711 +        guid = spa_guid(dev->l2ad_vdev->vdev_spa);
     8712 +
     8713 +        abd = abd_get_from_buf(hdr, hdr_asize);
     8714 +        err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev,
     8715 +            VDEV_LABEL_START_SIZE, hdr_asize, abd,
     8716 +            ZIO_CHECKSUM_OFF, NULL, NULL, ZIO_PRIORITY_ASYNC_READ,
     8717 +            ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
     8718 +            ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));
     8719 +        abd_put(abd);
     8720 +        if (err != 0) {
     8721 +                ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);
     8722 +                return (err);
     8723 +        }
     8724 +
     8725 +        if (hdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC_V1))
     8726 +                byteswap_uint64_array(hdr, sizeof (*hdr));
     8727 +
     8728 +        if (hdr->dh_magic != L2ARC_DEV_HDR_MAGIC_V1 ||
     8729 +            hdr->dh_spa_guid != guid) {
     8730 +                /*
     8731 +                 * Attempt to rebuild a device containing no actual dev hdr
     8732 +                 * or containing a header from some other pool.
     8733 +                 */
     8734 +                ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported);
     8735 +                return (SET_ERROR(ENOTSUP));
     8736 +        }
     8737 +
     8738 +        l2arc_dev_hdr_checksum(hdr, &cksum);
     8739 +        if (!ZIO_CHECKSUM_EQUAL(hdr->dh_self_cksum, cksum)) {
     8740 +                ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_errors);
     8741 +                return (SET_ERROR(EINVAL));
     8742 +        }
     8743 +
     8744 +        return (0);
     8745 +}
     8746 +
     8747 +/*
     8748 + * Reads L2ARC log blocks from storage and validates their contents.
     8749 + *
     8750 + * This function implements a simple prefetcher to make sure that while
     8751 + * we're processing one buffer the L2ARC is already prefetching the next
     8752 + * one in the chain.
     8753 + *
     8754 + * The arguments this_lp and next_lp point to the current and next log blk
     8755 + * address in the block chain. Similarly, this_lb and next_lb hold the
     8756 + * l2arc_log_blk_phys_t's of the current and next L2ARC blk. The this_lb_buf
     8757 + * and next_lb_buf must be buffers of appropriate to hold a raw
     8758 + * l2arc_log_blk_phys_t (they are used as catch buffers for read ops prior
     8759 + * to buffer decompression).
     8760 + *
     8761 + * The `this_io' and `next_io' arguments are used for block prefetching.
     8762 + * When issuing the first blk IO during rebuild, you should pass NULL for
     8763 + * `this_io'. This function will then issue a sync IO to read the block and
     8764 + * also issue an async IO to fetch the next block in the block chain. The
     8765 + * prefetch IO is returned in `next_io'. On subsequent calls to this
     8766 + * function, pass the value returned in `next_io' from the previous call
     8767 + * as `this_io' and a fresh `next_io' pointer to hold the next prefetch IO.
     8768 + * Prior to the call, you should initialize your `next_io' pointer to be
     8769 + * NULL. If no prefetch IO was issued, the pointer is left set at NULL.
     8770 + *
     8771 + * On success, this function returns 0, otherwise it returns an appropriate
     8772 + * error code. On error the prefetching IO is aborted and cleared before
     8773 + * returning from this function. Therefore, if we return `success', the
     8774 + * caller can assume that we have taken care of cleanup of prefetch IOs.
     8775 + */
     8776 +static int
     8777 +l2arc_log_blk_read(l2arc_dev_t *dev,
     8778 +    const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp,
     8779 +    l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
     8780 +    uint8_t *this_lb_buf, uint8_t *next_lb_buf,
     8781 +    zio_t *this_io, zio_t **next_io)
     8782 +{
     8783 +        int             err = 0;
     8784 +        zio_cksum_t     cksum;
     8785 +
     8786 +        ASSERT(this_lbp != NULL && next_lbp != NULL);
     8787 +        ASSERT(this_lb != NULL && next_lb != NULL);
     8788 +        ASSERT(this_lb_buf != NULL && next_lb_buf != NULL);
     8789 +        ASSERT(next_io != NULL && *next_io == NULL);
     8790 +        ASSERT(l2arc_log_blkptr_valid(dev, this_lbp));
     8791 +
     8792 +        /*
     8793 +         * Check to see if we have issued the IO for this log blk in a
     8794 +         * previous run. If not, this is the first call, so issue it now.
     8795 +         */
     8796 +        if (this_io == NULL) {
     8797 +                this_io = l2arc_log_blk_prefetch(dev->l2ad_vdev, this_lbp,
     8798 +                    this_lb_buf);
     8799 +        }
     8800 +
     8801 +        /*
     8802 +         * Peek to see if we can start issuing the next IO immediately.
     8803 +         */
     8804 +        if (l2arc_log_blkptr_valid(dev, next_lbp)) {
     8805 +                /*
     8806 +                 * Start issuing IO for the next log blk early - this
     8807 +                 * should help keep the L2ARC device busy while we
     8808 +                 * decompress and restore this log blk.
     8809 +                 */
     8810 +                *next_io = l2arc_log_blk_prefetch(dev->l2ad_vdev, next_lbp,
     8811 +                    next_lb_buf);
     8812 +        }
     8813 +
     8814 +        /* Wait for the IO to read this log block to complete */
     8815 +        if ((err = zio_wait(this_io)) != 0) {
     8816 +                ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);
     8817 +                goto cleanup;
     8818 +        }
     8819 +
     8820 +        /* Make sure the buffer checks out */
     8821 +        fletcher_4_native(this_lb_buf, LBP_GET_PSIZE(this_lbp), NULL, &cksum);
     8822 +        if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) {
     8823 +                ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_errors);
     8824 +                err = SET_ERROR(EINVAL);
     8825 +                goto cleanup;
     8826 +        }
     8827 +
     8828 +        /* Now we can take our time decoding this buffer */
     8829 +        switch (LBP_GET_COMPRESS(this_lbp)) {
     8830 +        case ZIO_COMPRESS_OFF:
     8831 +                bcopy(this_lb_buf, this_lb, sizeof (*this_lb));
     8832 +                break;
     8833 +        case ZIO_COMPRESS_LZ4:
     8834 +                err = zio_decompress_data_buf(LBP_GET_COMPRESS(this_lbp),
     8835 +                    this_lb_buf, this_lb, LBP_GET_PSIZE(this_lbp),
     8836 +                    sizeof (*this_lb));
     8837 +                if (err != 0) {
     8838 +                        err = SET_ERROR(EINVAL);
     8839 +                        goto cleanup;
     8840 +                }
     8841 +
     8842 +                break;
     8843 +        default:
     8844 +                err = SET_ERROR(EINVAL);
     8845 +                break;
     8846 +        }
     8847 +
     8848 +        if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC))
     8849 +                byteswap_uint64_array(this_lb, sizeof (*this_lb));
     8850 +
     8851 +        if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) {
     8852 +                err = SET_ERROR(EINVAL);
     8853 +                goto cleanup;
     8854 +        }
     8855 +
     8856 +cleanup:
     8857 +        /* Abort an in-flight prefetch I/O in case of error */
     8858 +        if (err != 0 && *next_io != NULL) {
     8859 +                l2arc_log_blk_prefetch_abort(*next_io);
     8860 +                *next_io = NULL;
     8861 +        }
     8862 +        return (err);
     8863 +}
     8864 +
     8865 +/*
     8866 + * Restores the payload of a log blk to ARC. This creates empty ARC hdr
     8867 + * entries which only contain an l2arc hdr, essentially restoring the
     8868 + * buffers to their L2ARC evicted state. This function also updates space
     8869 + * usage on the L2ARC vdev to make sure it tracks restored buffers.
     8870 + */
     8871 +static void
     8872 +l2arc_log_blk_restore(l2arc_dev_t *dev, uint64_t load_guid,
     8873 +    const l2arc_log_blk_phys_t *lb, uint64_t lb_psize)
     8874 +{
     8875 +        uint64_t        size = 0, psize = 0;
     8876 +
     8877 +        for (int i = L2ARC_LOG_BLK_ENTRIES - 1; i >= 0; i--) {
     8878 +                /*
     8879 +                 * Restore goes in the reverse temporal direction to preserve
     8880 +                 * correct temporal ordering of buffers in the l2ad_buflist.
     8881 +                 * l2arc_hdr_restore also does a list_insert_tail instead of
     8882 +                 * list_insert_head on the l2ad_buflist:
     8883 +                 *
     8884 +                 *              LIST    l2ad_buflist            LIST
     8885 +                 *              HEAD  <------ (time) ------     TAIL
     8886 +                 * direction    +-----+-----+-----+-----+-----+    direction
     8887 +                 * of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild
     8888 +                 * fill         +-----+-----+-----+-----+-----+
     8889 +                 *              ^                               ^
     8890 +                 *              |                               |
     8891 +                 *              |                               |
     8892 +                 *      l2arc_fill_thread               l2arc_rebuild
     8893 +                 *      places new bufs here            restores bufs here
     8894 +                 *
     8895 +                 * This also works when the restored bufs get evicted at any
     8896 +                 * point during the rebuild.
     8897 +                 */
     8898 +                l2arc_hdr_restore(&lb->lb_entries[i], dev, load_guid);
     8899 +                size += LE_GET_LSIZE(&lb->lb_entries[i]);
     8900 +                psize += LE_GET_PSIZE(&lb->lb_entries[i]);
     8901 +        }
     8902 +
     8903 +        /*
     8904 +         * Record rebuild stats:
     8905 +         *      size            In-memory size of restored buffer data in ARC
     8906 +         *      psize           Physical size of restored buffers in the L2ARC
     8907 +         *      bufs            # of ARC buffer headers restored
     8908 +         *      log_blks        # of L2ARC log entries processed during restore
     8909 +         */
     8910 +        ARCSTAT_INCR(arcstat_l2_rebuild_size, size);
     8911 +        ARCSTAT_INCR(arcstat_l2_rebuild_psize, psize);
     8912 +        ARCSTAT_INCR(arcstat_l2_rebuild_bufs, L2ARC_LOG_BLK_ENTRIES);
     8913 +        ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks);
     8914 +        ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_size, lb_psize);
     8915 +        ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, psize / lb_psize);
     8916 +        vdev_space_update(dev->l2ad_vdev, psize, 0, 0);
     8917 +}
     8918 +
     8919 +/*
     8920 + * Restores a single ARC buf hdr from a log block. The ARC buffer is put
     8921 + * into a state indicating that it has been evicted to L2ARC.
     8922 + */
     8923 +static void
     8924 +l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev,
     8925 +    uint64_t load_guid)
     8926 +{
     8927 +        arc_buf_hdr_t           *hdr, *exists;
     8928 +        kmutex_t                *hash_lock;
     8929 +        arc_buf_contents_t      type = LE_GET_TYPE(le);
     8930 +
     8931 +        /*
     8932 +         * Do all the allocation before grabbing any locks, this lets us
     8933 +         * sleep if memory is full and we don't have to deal with failed
     8934 +         * allocations.
     8935 +         */
     8936 +        hdr = arc_buf_alloc_l2only(load_guid, type, dev, le->le_dva,
     8937 +            le->le_daddr, LE_GET_LSIZE(le), LE_GET_PSIZE(le),
     8938 +            le->le_birth, le->le_freeze_cksum, LE_GET_CHECKSUM(le),
     8939 +            LE_GET_COMPRESS(le), LE_GET_ARC_COMPRESS(le));
     8940 +
     8941 +        ARCSTAT_INCR(arcstat_l2_lsize, HDR_GET_LSIZE(hdr));
     8942 +        ARCSTAT_INCR(arcstat_l2_psize, arc_hdr_size(hdr));
     8943 +
     8944 +        mutex_enter(&dev->l2ad_mtx);
     8945 +        /*
     8946 +         * We connect the l2hdr to the hdr only after the hdr is in the hash
     8947 +         * table, otherwise the rest of the arc hdr manipulation machinery
     8948 +         * might get confused.
     8949 +         */
     8950 +        list_insert_tail(&dev->l2ad_buflist, hdr);
     8951 +        (void) refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
     8952 +        mutex_exit(&dev->l2ad_mtx);
     8953 +
     8954 +        exists = buf_hash_insert(hdr, &hash_lock);
     8955 +        if (exists) {
     8956 +                /* Buffer was already cached, no need to restore it. */
     8957 +                arc_hdr_destroy(hdr);
     8958 +                mutex_exit(hash_lock);
     8959 +                ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached);
     8960 +                return;
     8961 +        }
     8962 +
     8963 +        mutex_exit(hash_lock);
     8964 +}
     8965 +
     8966 +/*
     8967 + * Used by PL2ARC related functions that do
     8968 + * async read/write
     8969 + */
     8970 +static void
     8971 +pl2arc_io_done(zio_t *zio)
     8972 +{
     8973 +        abd_put(zio->io_private);
     8974 +        zio->io_private = NULL;
     8975 +}
     8976 +
     8977 +/*
     8978 + * Starts an asynchronous read IO to read a log block. This is used in log
     8979 + * block reconstruction to start reading the next block before we are done
     8980 + * decoding and reconstructing the current block, to keep the l2arc device
     8981 + * nice and hot with read IO to process.
     8982 + * The returned zio will contain a newly allocated memory buffers for the IO
     8983 + * data which should then be freed by the caller once the zio is no longer
     8984 + * needed (i.e. due to it having completed). If you wish to abort this
     8985 + * zio, you should do so using l2arc_log_blk_prefetch_abort, which takes
     8986 + * care of disposing of the allocated buffers correctly.
     8987 + */
     8988 +static zio_t *
     8989 +l2arc_log_blk_prefetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp,
     8990 +    uint8_t *lb_buf)
     8991 +{
     8992 +        uint32_t        psize;
     8993 +        zio_t           *pio;
     8994 +        abd_t           *abd;
     8995 +
     8996 +        psize = LBP_GET_PSIZE(lbp);
     8997 +        ASSERT(psize <= sizeof (l2arc_log_blk_phys_t));
     8998 +        pio = zio_root(vd->vdev_spa, NULL, NULL, ZIO_FLAG_DONT_CACHE |
     8999 +            ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
     9000 +            ZIO_FLAG_DONT_RETRY);
     9001 +        abd = abd_get_from_buf(lb_buf, psize);
     9002 +        (void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, psize,
     9003 +            abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
     9004 +                ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
     9005 +            ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));
     9006 +
     9007 +        return (pio);
     9008 +}
     9009 +
     9010 +/*
     9011 + * Aborts a zio returned from l2arc_log_blk_prefetch and frees the data
     9012 + * buffers allocated for it.
     9013 + */
     9014 +static void
     9015 +l2arc_log_blk_prefetch_abort(zio_t *zio)
     9016 +{
     9017 +        (void) zio_wait(zio);
     9018 +}
     9019 +
     9020 +/*
     9021 + * Creates a zio to update the device header on an l2arc device. The zio is
     9022 + * initiated as a child of `pio'.
     9023 + */
     9024 +static void
     9025 +l2arc_dev_hdr_update(l2arc_dev_t *dev, zio_t *pio)
     9026 +{
     9027 +        zio_t                   *wzio;
     9028 +        abd_t                   *abd;
     9029 +        l2arc_dev_hdr_phys_t    *hdr = dev->l2ad_dev_hdr;
     9030 +        const uint64_t          hdr_asize = dev->l2ad_dev_hdr_asize;
     9031 +
     9032 +        hdr->dh_magic = L2ARC_DEV_HDR_MAGIC_V1;
     9033 +        hdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa);
     9034 +        hdr->dh_alloc_space = refcount_count(&dev->l2ad_alloc);
     9035 +        hdr->dh_flags = 0;
     9036 +        if (dev->l2ad_first)
     9037 +                hdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST;
     9038 +
     9039 +        /* checksum operation goes last */
     9040 +        l2arc_dev_hdr_checksum(hdr, &hdr->dh_self_cksum);
     9041 +
     9042 +        abd = abd_get_from_buf(hdr, hdr_asize);
     9043 +        wzio = zio_write_phys(pio, dev->l2ad_vdev, VDEV_LABEL_START_SIZE,
     9044 +            hdr_asize, abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
     9045 +            ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);
     9046 +        DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);
     9047 +        (void) zio_nowait(wzio);
     9048 +}
     9049 +
     9050 +/*
     9051 + * Commits a log block to the L2ARC device. This routine is invoked from
     9052 + * l2arc_write_buffers when the log block fills up.
     9053 + * This function allocates some memory to temporarily hold the serialized
     9054 + * buffer to be written. This is then released in l2arc_write_done.
     9055 + */
     9056 +static void
     9057 +l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,
     9058 +    l2arc_write_callback_t *cb)
     9059 +{
     9060 +        l2arc_log_blk_phys_t    *lb = &dev->l2ad_log_blk;
     9061 +        uint64_t                psize, asize;
     9062 +        l2arc_log_blk_buf_t     *lb_buf;
     9063 +        abd_t *abd;
     9064 +        zio_t                   *wzio;
     9065 +
     9066 +        VERIFY(dev->l2ad_log_ent_idx == L2ARC_LOG_BLK_ENTRIES);
     9067 +
     9068 +        /* link the buffer into the block chain */
     9069 +        lb->lb_back2_lbp = dev->l2ad_dev_hdr->dh_start_lbps[1];
     9070 +        lb->lb_magic = L2ARC_LOG_BLK_MAGIC;
     9071 +
     9072 +        /* try to compress the buffer */
     9073 +        lb_buf = kmem_zalloc(sizeof (*lb_buf), KM_SLEEP);
     9074 +        list_insert_tail(&cb->l2wcb_log_blk_buflist, lb_buf);
     9075 +        abd = abd_get_from_buf(lb, sizeof (*lb));
     9076 +        psize = zio_compress_data(ZIO_COMPRESS_LZ4, abd, lb_buf->lbb_log_blk,
     9077 +            sizeof (*lb));
     9078 +        abd_put(abd);
     9079 +        /* a log block is never entirely zero */
     9080 +        ASSERT(psize != 0);
     9081 +        asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);
     9082 +        ASSERT(asize <= sizeof (lb_buf->lbb_log_blk));
     9083 +
     9084 +        /*
     9085 +         * Update the start log blk pointer in the device header to point
     9086 +         * to the log block we're about to write.
     9087 +         */
     9088 +        dev->l2ad_dev_hdr->dh_start_lbps[1] =
     9089 +            dev->l2ad_dev_hdr->dh_start_lbps[0];
     9090 +        dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand;
     9091 +        _NOTE(CONSTCOND)
     9092 +        LBP_SET_LSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0], sizeof (*lb));
     9093 +        LBP_SET_PSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0], asize);
     9094 +        LBP_SET_CHECKSUM(&dev->l2ad_dev_hdr->dh_start_lbps[0],
     9095 +            ZIO_CHECKSUM_FLETCHER_4);
     9096 +        LBP_SET_TYPE(&dev->l2ad_dev_hdr->dh_start_lbps[0], 0);
     9097 +
     9098 +        if (asize < sizeof (*lb)) {
     9099 +                /* compression succeeded */
     9100 +                bzero(lb_buf->lbb_log_blk + psize, asize - psize);
     9101 +                LBP_SET_COMPRESS(&dev->l2ad_dev_hdr->dh_start_lbps[0],
     9102 +                    ZIO_COMPRESS_LZ4);
     9103 +        } else {
     9104 +                /* compression failed */
     9105 +                bcopy(lb, lb_buf->lbb_log_blk, sizeof (*lb));
     9106 +                LBP_SET_COMPRESS(&dev->l2ad_dev_hdr->dh_start_lbps[0],
     9107 +                    ZIO_COMPRESS_OFF);
     9108 +        }
     9109 +
     9110 +        /* checksum what we're about to write */
     9111 +        fletcher_4_native(lb_buf->lbb_log_blk, asize,
     9112 +            NULL, &dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_cksum);
     9113 +
     9114 +        /* perform the write itself */
     9115 +        CTASSERT(L2ARC_LOG_BLK_SIZE >= SPA_MINBLOCKSIZE &&
     9116 +            L2ARC_LOG_BLK_SIZE <= SPA_MAXBLOCKSIZE);
     9117 +        abd = abd_get_from_buf(lb_buf->lbb_log_blk, asize);
     9118 +        wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand,
     9119 +            asize, abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
     9120 +            ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);
     9121 +        DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);
     9122 +        (void) zio_nowait(wzio);
     9123 +
     9124 +        dev->l2ad_hand += asize;
     9125 +        vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
     9126 +
     9127 +        /* bump the kstats */
     9128 +        ARCSTAT_INCR(arcstat_l2_write_bytes, asize);
     9129 +        ARCSTAT_BUMP(arcstat_l2_log_blk_writes);
     9130 +        ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_size, asize);
     9131 +        ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio,
     9132 +            dev->l2ad_log_blk_payload_asize / asize);
     9133 +
     9134 +        /* start a new log block */
     9135 +        dev->l2ad_log_ent_idx = 0;
     9136 +        dev->l2ad_log_blk_payload_asize = 0;
     9137 +}
     9138 +
     9139 +/*
     9140 + * Validates an L2ARC log blk address to make sure that it can be read
     9141 + * from the provided L2ARC device. Returns B_TRUE if the address is
     9142 + * within the device's bounds, or B_FALSE if not.
     9143 + */
     9144 +static boolean_t
     9145 +l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp)
     9146 +{
     9147 +        uint64_t psize = LBP_GET_PSIZE(lbp);
     9148 +        uint64_t end = lbp->lbp_daddr + psize;
     9149 +
     9150 +        /*
     9151 +         * A log block is valid if all of the following conditions are true:
     9152 +         * - it fits entirely between l2ad_start and l2ad_end
     9153 +         * - it has a valid size
     9154 +         */
     9155 +        return (lbp->lbp_daddr >= dev->l2ad_start && end <= dev->l2ad_end &&
     9156 +            psize > 0 && psize <= sizeof (l2arc_log_blk_phys_t));
     9157 +}
     9158 +
     9159 +/*
     9160 + * Computes the checksum of `hdr' and stores it in `cksum'.
     9161 + */
     9162 +static void
     9163 +l2arc_dev_hdr_checksum(const l2arc_dev_hdr_phys_t *hdr, zio_cksum_t *cksum)
     9164 +{
     9165 +        fletcher_4_native((uint8_t *)hdr +
     9166 +            offsetof(l2arc_dev_hdr_phys_t, dh_spa_guid),
     9167 +            sizeof (*hdr) - offsetof(l2arc_dev_hdr_phys_t, dh_spa_guid),
     9168 +            NULL, cksum);
     9169 +}
     9170 +
     9171 +/*
     9172 + * Inserts ARC buffer `ab' into the current L2ARC log blk on the device.
     9173 + * The buffer being inserted must be present in L2ARC.
     9174 + * Returns B_TRUE if the L2ARC log blk is full and needs to be committed
     9175 + * to L2ARC, or B_FALSE if it still has room for more ARC buffers.
     9176 + */
     9177 +static boolean_t
     9178 +l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *ab)
     9179 +{
     9180 +        l2arc_log_blk_phys_t    *lb = &dev->l2ad_log_blk;
     9181 +        l2arc_log_ent_phys_t    *le;
     9182 +        int                     index = dev->l2ad_log_ent_idx++;
     9183 +
     9184 +        ASSERT(index < L2ARC_LOG_BLK_ENTRIES);
     9185 +
     9186 +        le = &lb->lb_entries[index];
     9187 +        bzero(le, sizeof (*le));
     9188 +        le->le_dva = ab->b_dva;
     9189 +        le->le_birth = ab->b_birth;
     9190 +        le->le_daddr = ab->b_l2hdr.b_daddr;
     9191 +        LE_SET_LSIZE(le, HDR_GET_LSIZE(ab));
     9192 +        LE_SET_PSIZE(le, HDR_GET_PSIZE(ab));
     9193 +
     9194 +        if ((ab->b_flags & ARC_FLAG_COMPRESSED_ARC) != 0) {
     9195 +                LE_SET_ARC_COMPRESS(le, 1);
     9196 +                LE_SET_COMPRESS(le, HDR_GET_COMPRESS(ab));
     9197 +        } else {
     9198 +                ASSERT3U(HDR_GET_COMPRESS(ab), ==, ZIO_COMPRESS_OFF);
     9199 +                LE_SET_ARC_COMPRESS(le, 0);
     9200 +                LE_SET_COMPRESS(le, ZIO_COMPRESS_OFF);
     9201 +        }
     9202 +
     9203 +        if (ab->b_freeze_cksum != NULL) {
     9204 +                le->le_freeze_cksum = *ab->b_freeze_cksum;
     9205 +                LE_SET_CHECKSUM(le, ZIO_CHECKSUM_FLETCHER_2);
     9206 +        } else {
     9207 +                LE_SET_CHECKSUM(le, ZIO_CHECKSUM_OFF);
     9208 +        }
     9209 +
     9210 +        LE_SET_TYPE(le, arc_flags_to_bufc(ab->b_flags));
     9211 +        dev->l2ad_log_blk_payload_asize += arc_hdr_size((arc_buf_hdr_t *)ab);
     9212 +
     9213 +        return (dev->l2ad_log_ent_idx == L2ARC_LOG_BLK_ENTRIES);
     9214 +}
     9215 +
     9216 +/*
     9217 + * Checks whether a given L2ARC device address sits in a time-sequential
     9218 + * range. The trick here is that the L2ARC is a rotary buffer, so we can't
     9219 + * just do a range comparison, we need to handle the situation in which the
     9220 + * range wraps around the end of the L2ARC device. Arguments:
     9221 + *      bottom  Lower end of the range to check (written to earlier).
     9222 + *      top     Upper end of the range to check (written to later).
     9223 + *      check   The address for which we want to determine if it sits in
     9224 + *              between the top and bottom.
     9225 + *
     9226 + * The 3-way conditional below represents the following cases:
     9227 + *
     9228 + *      bottom < top : Sequentially ordered case:
     9229 + *        <check>--------+-------------------+
     9230 + *                       |  (overlap here?)  |
     9231 + *       L2ARC dev       V                   V
     9232 + *       |---------------<bottom>============<top>--------------|
     9233 + *
     9234 + *      bottom > top: Looped-around case:
     9235 + *                            <check>--------+------------------+
     9236 + *                                           |  (overlap here?) |
     9237 + *       L2ARC dev                           V                  V
     9238 + *       |===============<top>---------------<bottom>===========|
     9239 + *       ^               ^
     9240 + *       |  (or here?)   |
     9241 + *       +---------------+---------<check>
     9242 + *
     9243 + *      top == bottom : Just a single address comparison.
     9244 + */
     9245 +static inline boolean_t
     9246 +l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check)
     9247 +{
     9248 +        if (bottom < top)
     9249 +                return (bottom <= check && check <= top);
     9250 +        else if (bottom > top)
     9251 +                return (check <= top || bottom <= check);
     9252 +        else
     9253 +                return (check == top);
7375 9254  }
    
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX