big-one Wdiff usr/src/uts/common/fs/zfs/arc.c

Print this page

NEX-19742 A race between ARC and L2ARC causes system panic
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-16904 Need to port Illumos Bug #9433 to fix ARC hit rate
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15303 ARC-ABD logic works incorrect when deduplication is enabled
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15303 ARC-ABD logic works incorrect when deduplication is enabled
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15446 set zfs_ddt_limit_type to DDT_LIMIT_TO_ARC
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-15446 set zfs_ddt_limit_type to DDT_LIMIT_TO_ARC
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-14571 remove isal support remnants
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-8057 renaming of mount points should not be allowed (redo)
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5785 zdb: assertion failed for thread 0xf8a20240, thread-id 130: mp->initialized == B_TRUE, file ../common/kernel.c, line 162
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexent.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-4228 dedup arcstats are redundant
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-7317 Getting assert !refcount_is_zero(&scl->scl_count) when trying to import pool
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5671 assertion: (ab->b_l2hdr.b_asize) >> (9) >= 1 (0x0 >= 0x1), file: ../../common/fs/zfs/arc.c, line: 8275
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "Merge pull request #520 in OS/nza-kernel from ~SASO.KISELKOV/nza-kernel:NEX-5671-pl2arc-le_psize to master"
This reverts commit b63e91b939886744224854ea365d70e05ddd6077, reversing
changes made to a6e3a0255c8b22f65343bf641ffefaf9ae948fd4.
NEX-5671 assertion: (ab->b_l2hdr.b_asize) >> (9) >= 1 (0x0 >= 0x1), file: ../../common/fs/zfs/arc.c, line: 8275
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
6421 Add missing multilist_destroy calls to arc_fini
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
6293 ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign()
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
5219 l2arc_write_buffers() may write beyond target_sz
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Justin Gibbs <gibbs@FreeBSD.org>
Approved by: Matthew Ahrens <mahrens@delphix.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
6220 memleak in l2arc on debug build
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
5847 libzfs_diff should check zfs_prop_get() return
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5701 zpool list reports incorrect "alloc" value for cache devices
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-3879 L2ARC evict task allocates a useless struct
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4408 backport illumos #6214 to avoid corruption (fix pL2ARC integration)
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4408 backport illumos #6214 to avoid corruption
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3979 fix arc_mru/mfu typo
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3961 arc_meta_max is not counted correctly
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3946 Port Illumos 5983 to release-5.0
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Jean McCormack <jean.maccormack@nexenta.com>
NEX-3945 file-backed cache devices considered harmful
Reviewed by: Alek Pinchuk <alek@nexenta.com>
NEX-3541 Implement persistent L2ARC - fix build breakage in libzpool (v2).
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3630 Backport illumos #5701 from master to 5.0
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3558 KRRP Integration
NEX-3387 ARC stats appear to be in wrong/weird order
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3296 turn on DDT limit by default
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3300 ddt byte count ceiling tunables should not depend on zfs_ddt_limit_type being set
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
 NEX-3165 segregate ddt in arc (other lint fix)
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-3165 segregate ddt in arc
NEX-3079 port illumos ARC improvements
NEX-2301 zpool destroy assertion failed: vd->vdev_stat.vs_alloc == 0 (part 2)
NEX-2704 smbstat man page needs update
NEX-2301 zpool destroy assertion failed: vd->vdev_stat.vs_alloc == 0
3995 Memory leak of compressed buffers in l2arc_write_done
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Garrett D'Amore <garrett@damore.org>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
NEX-463: bumped max queue size for L2ARC async evict
Maximum length of a taskq used for async arc and l2arc flush is
now a tuneable (zfs_flush_ntasks) that is initialized to 64.
The number is equally arbitrary, yet higher than original 4.
Real fix should rework l2arc evict according to OS-53, but for now
just longer queue should suffice.
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
re #14119 BAD-TRAP panic under load
re #13989 port of illumos-3805
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
re #13729 assign each ARC hash bucket its own mutex
In ARC the number of buckets in buffer header hash table is
proportional to the size of physical RAM.
The number of locks protecting headers in the buckets is fixed to 256 though.
Hence, on systems with large memory (>= 128GB) too many unrelated buffer
headers are protected by the same mutex.
When the memory in the system is fragmented this may cause a deadlock:
- An arc_read thread may be trying to allocate a 128k buffer while holding
a header lock.
- The allocation uses KM_PUSHPAGE option that blocks the thread if no contigous
chunk of requested size is available.
- ARC eviction thread that is supposed to evict some buffers would call
an evict callback on one of the buffers.
- Before freing the memory, the callback will attempt to take a lock on buffer
header.
- Incidentally, this buffer header will be protected by the same lock as
the one in arc_read() thread.
The solution in this patch is not perfect - that is, it protects all headers
in the hash bucket by the same lock.
However, a probability of collision is very low and does not depend on memory
size.
By the same argument, padding locks to cacheline looks like a waste of memory
here since the probability of contention on a cacheline is quite low, given
the number of buckets, number of locks per cacheline (4) and the fact that
the hash function (crc64 % hash table size) is supposed to be a very good
randomizer.
This effect on memory usage is as follows:
Per hash table size n,
- Original code uses 16K + 16 + n * 8 bytes of memory
- This fix uses 2 * n * 8 + 8 bytes of memory
- The net memory overhead is therefore n * 8 - 16K - 8 bytes
The value of n grows proportionally to physical memory size.
For 128GB of physical memory it is 2M, so the memory overhead is
16M - 16K - 8 bytes.
For smaller memory configurations the overhead is proportionally smaller, and
for larger memory configurations it is propottionally bigger.
The patch has been tested for 30+ hours using vdbench script that reproduces
hang with original code 100% of times in 20-30 minutes.
re #10054 rb4467 Support for asynchronous ARC/L2ARC eviction
re #13165 rb4265 zfs-monitor should fallback to using DEV_BSIZE
re #10054 rb4249 Long export time causes failover to fail

Split	Close
Expand all
Collapse all

          --- old/usr/src/uts/common/fs/zfs/arc.c
          +++ new/usr/src/uts/common/fs/zfs/arc.c

   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the

↓ open down ↓

15 lines elided

↑ open up ↑

  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23   23   * Copyright (c) 2018, Joyent, Inc.
  24   24   * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
  25   25   * Copyright (c) 2014 by Saso Kiselkov. All rights reserved.
  26      - * Copyright 2017 Nexenta Systems, Inc.  All rights reserved.
       26 + * Copyright 2019 Nexenta Systems, Inc.  All rights reserved.
  27   27   */
  28   28  
  29   29  /*
  30   30   * DVA-based Adjustable Replacement Cache
  31   31   *
  32   32   * While much of the theory of operation used here is
  33   33   * based on the self-tuning, low overhead replacement cache
  34   34   * presented by Megiddo and Modha at FAST 2003, there are some
  35   35   * significant differences:
  36   36   *

  37   37   * 1. The Megiddo and Modha model assumes any page is evictable.
  38   38   * Pages in its cache cannot be "locked" into memory.  This makes
  39   39   * the eviction algorithm simple: evict the last page in the list.
  40   40   * This also make the performance characteristics easy to reason
  41   41   * about.  Our cache is not so simple.  At any given moment, some
  42   42   * subset of the blocks in the cache are un-evictable because we
  43   43   * have handed out a reference to them.  Blocks are only evictable
  44   44   * when there are no external references active.  This makes
  45   45   * eviction far more problematic:  we choose to evict the evictable
  46   46   * blocks that are the "lowest" in the list.
  47   47   *
  48   48   * There are times when it is not possible to evict the requested
  49   49   * space.  In these circumstances we are unable to adjust the cache
  50   50   * size.  To prevent the cache growing unbounded at these times we
  51   51   * implement a "cache throttle" that slows the flow of new data
  52   52   * into the cache until we can make space available.
  53   53   *
  54   54   * 2. The Megiddo and Modha model assumes a fixed cache size.
  55   55   * Pages are evicted when the cache is full and there is a cache
  56   56   * miss.  Our model has a variable sized cache.  It grows with
  57   57   * high use, but also tries to react to memory pressure from the
  58   58   * operating system: decreasing its size when system memory is
  59   59   * tight.
  60   60   *
  61   61   * 3. The Megiddo and Modha model assumes a fixed page size. All
  62   62   * elements of the cache are therefore exactly the same size.  So
  63   63   * when adjusting the cache size following a cache miss, its simply
  64   64   * a matter of choosing a single page to evict.  In our model, we
  65   65   * have variable sized cache blocks (rangeing from 512 bytes to
  66   66   * 128K bytes).  We therefore choose a set of blocks to evict to make
  67   67   * space for a cache miss that approximates as closely as possible
  68   68   * the space used by the new block.
  69   69   *
  70   70   * See also:  "ARC: A Self-Tuning, Low Overhead Replacement Cache"
  71   71   * by N. Megiddo & D. Modha, FAST 2003
  72   72   */
  73   73  
  74   74  /*
  75   75   * The locking model:
  76   76   *
  77   77   * A new reference to a cache buffer can be obtained in two
  78   78   * ways: 1) via a hash table lookup using the DVA as a key,
  79   79   * or 2) via one of the ARC lists.  The arc_read() interface
  80   80   * uses method 1, while the internal ARC algorithms for
  81   81   * adjusting the cache use method 2.  We therefore provide two
  82   82   * types of locks: 1) the hash table lock array, and 2) the
  83   83   * ARC list locks.
  84   84   *
  85   85   * Buffers do not have their own mutexes, rather they rely on the
  86   86   * hash table mutexes for the bulk of their protection (i.e. most
  87   87   * fields in the arc_buf_hdr_t are protected by these mutexes).
  88   88   *
  89   89   * buf_hash_find() returns the appropriate mutex (held) when it
  90   90   * locates the requested buffer in the hash table.  It returns
  91   91   * NULL for the mutex if the buffer was not in the table.
  92   92   *
  93   93   * buf_hash_remove() expects the appropriate hash mutex to be
  94   94   * already held before it is invoked.
  95   95   *
  96   96   * Each ARC state also has a mutex which is used to protect the
  97   97   * buffer list associated with the state.  When attempting to
  98   98   * obtain a hash table lock while holding an ARC list lock you
  99   99   * must use: mutex_tryenter() to avoid deadlock.  Also note that
 100  100   * the active state mutex must be held before the ghost state mutex.
 101  101   *
 102  102   * Note that the majority of the performance stats are manipulated
 103  103   * with atomic operations.
 104  104   *
 105  105   * The L2ARC uses the l2ad_mtx on each vdev for the following:
 106  106   *
 107  107   *      - L2ARC buflist creation
 108  108   *      - L2ARC buflist eviction
 109  109   *      - L2ARC write completion, which walks L2ARC buflists
 110  110   *      - ARC header destruction, as it removes from L2ARC buflists
 111  111   *      - ARC header release, as it removes from L2ARC buflists
 112  112   */
 113  113  
 114  114  /*
 115  115   * ARC operation:
 116  116   *
 117  117   * Every block that is in the ARC is tracked by an arc_buf_hdr_t structure.
 118  118   * This structure can point either to a block that is still in the cache or to
 119  119   * one that is only accessible in an L2 ARC device, or it can provide
 120  120   * information about a block that was recently evicted. If a block is
 121  121   * only accessible in the L2ARC, then the arc_buf_hdr_t only has enough
 122  122   * information to retrieve it from the L2ARC device. This information is
 123  123   * stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block
 124  124   * that is in this state cannot access the data directly.
 125  125   *
 126  126   * Blocks that are actively being referenced or have not been evicted
 127  127   * are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within
 128  128   * the arc_buf_hdr_t that will point to the data block in memory. A block can
 129  129   * only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC
 130  130   * caches data in two ways -- in a list of ARC buffers (arc_buf_t) and
 131  131   * also in the arc_buf_hdr_t's private physical data block pointer (b_pabd).
 132  132   *
 133  133   * The L1ARC's data pointer may or may not be uncompressed. The ARC has the
 134  134   * ability to store the physical data (b_pabd) associated with the DVA of the
 135  135   * arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block,
 136  136   * it will match its on-disk compression characteristics. This behavior can be
 137  137   * disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the
 138  138   * compressed ARC functionality is disabled, the b_pabd will point to an
 139  139   * uncompressed version of the on-disk data.
 140  140   *
 141  141   * Data in the L1ARC is not accessed by consumers of the ARC directly. Each
 142  142   * arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it.
 143  143   * Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC
 144  144   * consumer. The ARC will provide references to this data and will keep it
 145  145   * cached until it is no longer in use. The ARC caches only the L1ARC's physical
 146  146   * data block and will evict any arc_buf_t that is no longer referenced. The
 147  147   * amount of memory consumed by the arc_buf_ts' data buffers can be seen via the
 148  148   * "overhead_size" kstat.
 149  149   *
 150  150   * Depending on the consumer, an arc_buf_t can be requested in uncompressed or
 151  151   * compressed form. The typical case is that consumers will want uncompressed
 152  152   * data, and when that happens a new data buffer is allocated where the data is
 153  153   * decompressed for them to use. Currently the only consumer who wants
 154  154   * compressed arc_buf_t's is "zfs send", when it streams data exactly as it
 155  155   * exists on disk. When this happens, the arc_buf_t's data buffer is shared
 156  156   * with the arc_buf_hdr_t.
 157  157   *
 158  158   * Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The
 159  159   * first one is owned by a compressed send consumer (and therefore references
 160  160   * the same compressed data buffer as the arc_buf_hdr_t) and the second could be
 161  161   * used by any other consumer (and has its own uncompressed copy of the data
 162  162   * buffer).
 163  163   *
 164  164   *   arc_buf_hdr_t
 165  165   *   +-----------+
 166  166   *   | fields    |
 167  167   *   | common to |
 168  168   *   | L1- and   |
 169  169   *   | L2ARC     |
 170  170   *   +-----------+
 171  171   *   | l2arc_buf_hdr_t
 172  172   *   |           |
 173  173   *   +-----------+
 174  174   *   | l1arc_buf_hdr_t
 175  175   *   |           |              arc_buf_t
 176  176   *   | b_buf     +------------>+-----------+      arc_buf_t
 177  177   *   | b_pabd    +-+           |b_next     +---->+-----------+
 178  178   *   +-----------+ |           |-----------|     |b_next     +-->NULL
 179  179   *                 |           |b_comp = T |     +-----------+
 180  180   *                 |           |b_data     +-+   |b_comp = F |
 181  181   *                 |           +-----------+ |   |b_data     +-+
 182  182   *                 +->+------+               |   +-----------+ |
 183  183   *        compressed  |      |               |                 |
 184  184   *           data     |      |<--------------+                 | uncompressed
 185  185   *                    +------+          compressed,            |     data
 186  186   *                                        shared               +-->+------+
 187  187   *                                         data                    |      |
 188  188   *                                                                 |      |
 189  189   *                                                                 +------+
 190  190   *
 191  191   * When a consumer reads a block, the ARC must first look to see if the
 192  192   * arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new
 193  193   * arc_buf_t and either copies uncompressed data into a new data buffer from an
 194  194   * existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a
 195  195   * new data buffer, or shares the hdr's b_pabd buffer, depending on whether the
 196  196   * hdr is compressed and the desired compression characteristics of the
 197  197   * arc_buf_t consumer. If the arc_buf_t ends up sharing data with the
 198  198   * arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be
 199  199   * the last buffer in the hdr's b_buf list, however a shared compressed buf can
 200  200   * be anywhere in the hdr's list.
 201  201   *
 202  202   * The diagram below shows an example of an uncompressed ARC hdr that is
 203  203   * sharing its data with an arc_buf_t (note that the shared uncompressed buf is
 204  204   * the last element in the buf list):
 205  205   *
 206  206   *                arc_buf_hdr_t
 207  207   *                +-----------+
 208  208   *                |           |
 209  209   *                |           |
 210  210   *                |           |
 211  211   *                +-----------+
 212  212   * l2arc_buf_hdr_t|           |
 213  213   *                |           |
 214  214   *                +-----------+
 215  215   * l1arc_buf_hdr_t|           |
 216  216   *                |           |                 arc_buf_t    (shared)
 217  217   *                |    b_buf  +------------>+---------+      arc_buf_t
 218  218   *                |           |             |b_next   +---->+---------+
 219  219   *                |  b_pabd   +-+           |---------|     |b_next   +-->NULL
 220  220   *                +-----------+ |           |         |     +---------+
 221  221   *                              |           |b_data   +-+   |         |
 222  222   *                              |           +---------+ |   |b_data   +-+
 223  223   *                              +->+------+             |   +---------+ |
 224  224   *                                 |      |             |               |
 225  225   *                   uncompressed  |      |             |               |
 226  226   *                        data     +------+             |               |
 227  227   *                                    ^                 +->+------+     |
 228  228   *                                    |       uncompressed |      |     |
 229  229   *                                    |           data     |      |     |
 230  230   *                                    |                    +------+     |
 231  231   *                                    +---------------------------------+
 232  232   *
 233  233   * Writing to the ARC requires that the ARC first discard the hdr's b_pabd
 234  234   * since the physical block is about to be rewritten. The new data contents
 235  235   * will be contained in the arc_buf_t. As the I/O pipeline performs the write,
 236  236   * it may compress the data before writing it to disk. The ARC will be called
 237  237   * with the transformed data and will bcopy the transformed on-disk block into
 238  238   * a newly allocated b_pabd. Writes are always done into buffers which have
 239  239   * either been loaned (and hence are new and don't have other readers) or
 240  240   * buffers which have been released (and hence have their own hdr, if there
 241  241   * were originally other readers of the buf's original hdr). This ensures that
 242  242   * the ARC only needs to update a single buf and its hdr after a write occurs.
 243  243   *
 244  244   * When the L2ARC is in use, it will also take advantage of the b_pabd. The
 245  245   * L2ARC will always write the contents of b_pabd to the L2ARC. This means

↓ open down ↓

209 lines elided

↑ open up ↑

 246  246   * that when compressed ARC is enabled that the L2ARC blocks are identical
 247  247   * to the on-disk block in the main data pool. This provides a significant
 248  248   * advantage since the ARC can leverage the bp's checksum when reading from the
 249  249   * L2ARC to determine if the contents are valid. However, if the compressed
 250  250   * ARC is disabled, then the L2ARC's block must be transformed to look
 251  251   * like the physical block in the main data pool before comparing the
 252  252   * checksum and determining its validity.
 253  253   */
 254  254  
 255  255  #include <sys/spa.h>
      256 +#include <sys/spa_impl.h>
 256  257  #include <sys/zio.h>
 257  258  #include <sys/spa_impl.h>
 258  259  #include <sys/zio_compress.h>
 259  260  #include <sys/zio_checksum.h>
 260  261  #include <sys/zfs_context.h>
 261  262  #include <sys/arc.h>
 262  263  #include <sys/refcount.h>
 263  264  #include <sys/vdev.h>
 264  265  #include <sys/vdev_impl.h>
 265  266  #include <sys/dsl_pool.h>

 266  267  #include <sys/zio_checksum.h>
 267  268  #include <sys/multilist.h>

↓ open down ↓

2 lines elided

↑ open up ↑

 268  269  #include <sys/abd.h>
 269  270  #ifdef _KERNEL
 270  271  #include <sys/vmsystm.h>
 271  272  #include <vm/anon.h>
 272  273  #include <sys/fs/swapnode.h>
 273  274  #include <sys/dnlc.h>
 274  275  #endif
 275  276  #include <sys/callb.h>
 276  277  #include <sys/kstat.h>
 277  278  #include <zfs_fletcher.h>
 278      -#include <sys/aggsum.h>
 279      -#include <sys/cityhash.h>
      279 +#include <sys/byteorder.h>
      280 +#include <sys/spa_impl.h>
 280  281  
 281  282  #ifndef _KERNEL
 282  283  /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
 283  284  boolean_t arc_watch = B_FALSE;
 284  285  int arc_procfd;
 285  286  #endif
 286  287  
 287  288  static kmutex_t         arc_reclaim_lock;
 288  289  static kcondvar_t       arc_reclaim_thread_cv;
 289  290  static boolean_t        arc_reclaim_thread_exit;

 290  291  static kcondvar_t       arc_reclaim_waiters_cv;
 291  292  
 292  293  uint_t arc_reduce_dnlc_percent = 3;
 293  294  
 294  295  /*
 295  296   * The number of headers to evict in arc_evict_state_impl() before
 296  297   * dropping the sublist lock and evicting from another sublist. A lower
 297  298   * value means we're more likely to evict the "correct" header (i.e. the
 298  299   * oldest header in the arc state), but comes with higher overhead
 299  300   * (i.e. more invocations of arc_evict_state_impl()).
 300  301   */
 301  302  int zfs_arc_evict_batch_limit = 10;
 302  303  
 303  304  /* number of seconds before growing cache again */
 304  305  static int              arc_grow_retry = 60;
 305  306  
 306  307  /* number of milliseconds before attempting a kmem-cache-reap */
 307  308  static int              arc_kmem_cache_reap_retry_ms = 1000;
 308  309  
 309  310  /* shift of arc_c for calculating overflow limit in arc_get_data_impl */
 310  311  int             zfs_arc_overflow_shift = 8;
 311  312  
 312  313  /* shift of arc_c for calculating both min and max arc_p */
 313  314  static int              arc_p_min_shift = 4;
 314  315  
 315  316  /* log2(fraction of arc to reclaim) */
 316  317  static int              arc_shrink_shift = 7;
 317  318  
 318  319  /*
 319  320   * log2(fraction of ARC which must be free to allow growing).
 320  321   * I.e. If there is less than arc_c >> arc_no_grow_shift free memory,
 321  322   * when reading a new block into the ARC, we will evict an equal-sized block
 322  323   * from the ARC.
 323  324   *
 324  325   * This must be less than arc_shrink_shift, so that when we shrink the ARC,
 325  326   * we will still not allow it to grow.
 326  327   */
 327  328  int                     arc_no_grow_shift = 5;
 328  329  
 329  330  
 330  331  /*
 331  332   * minimum lifespan of a prefetch block in clock ticks
 332  333   * (initialized in arc_init())
 333  334   */
 334  335  static int              arc_min_prefetch_lifespan;
 335  336  
 336  337  /*
 337  338   * If this percent of memory is free, don't throttle.
 338  339   */
 339  340  int arc_lotsfree_percent = 10;
 340  341  
 341  342  static int arc_dead;
 342  343  
 343  344  /*
 344  345   * The arc has filled available memory and has now warmed up.
 345  346   */
 346  347  static boolean_t arc_warm;
 347  348  
 348  349  /*
 349  350   * log2 fraction of the zio arena to keep free.

↓ open down ↓

60 lines elided

↑ open up ↑

 350  351   */
 351  352  int arc_zio_arena_free_shift = 2;
 352  353  
 353  354  /*
 354  355   * These tunables are for performance analysis.
 355  356   */
 356  357  uint64_t zfs_arc_max;
 357  358  uint64_t zfs_arc_min;
 358  359  uint64_t zfs_arc_meta_limit = 0;
 359  360  uint64_t zfs_arc_meta_min = 0;
      361 +uint64_t zfs_arc_ddt_limit = 0;
      362 +/*
      363 + * Tunable to control "dedup ceiling"
      364 + * Possible values:
      365 + *  DDT_NO_LIMIT        - default behaviour, ie no ceiling
      366 + *  DDT_LIMIT_TO_ARC    - stop DDT growth if DDT is bigger than it's "ARC space"
      367 + *  DDT_LIMIT_TO_L2ARC  - stop DDT growth when DDT size is bigger than the
      368 + *                        L2ARC DDT dev(s) for that pool
      369 + */
      370 +zfs_ddt_limit_t zfs_ddt_limit_type = DDT_LIMIT_TO_ARC;
      371 +/*
      372 + * Alternative to the above way of controlling "dedup ceiling":
      373 + * Stop DDT growth when in core DDTs size is above the below tunable.
      374 + * This tunable overrides the zfs_ddt_limit_type tunable.
      375 + */
      376 +uint64_t zfs_ddt_byte_ceiling = 0;
      377 +boolean_t zfs_arc_segregate_ddt = B_TRUE;
 360  378  int zfs_arc_grow_retry = 0;
 361  379  int zfs_arc_shrink_shift = 0;
 362  380  int zfs_arc_p_min_shift = 0;
 363  381  int zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
 364  382  
      383 +/* Tuneable, default is 64, which is essentially arbitrary */
      384 +int zfs_flush_ntasks = 64;
      385 +
 365  386  boolean_t zfs_compressed_arc_enabled = B_TRUE;
 366  387  
 367  388  /*
 368  389   * Note that buffers can be in one of 6 states:
 369  390   *      ARC_anon        - anonymous (discussed below)
 370  391   *      ARC_mru         - recently used, currently cached
 371  392   *      ARC_mru_ghost   - recentely used, no longer in cache
 372  393   *      ARC_mfu         - frequently used, currently cached
 373  394   *      ARC_mfu_ghost   - frequently used, no longer in cache
 374  395   *      ARC_l2c_only    - exists in L2ARC but not other states

 375  396   * When there are no active references to the buffer, they are
 376  397   * are linked onto a list in one of these arc states.  These are
 377  398   * the only buffers that can be evicted or deleted.  Within each
 378  399   * state there are multiple lists, one for meta-data and one for
 379  400   * non-meta-data.  Meta-data (indirect blocks, blocks of dnodes,
 380  401   * etc.) is tracked separately so that it can be managed more
 381  402   * explicitly: favored over data, limited explicitly.
 382  403   *
 383  404   * Anonymous buffers are buffers that are not associated with
 384  405   * a DVA.  These are buffers that hold dirty block copies
 385  406   * before they are written to stable storage.  By definition,
 386  407   * they are "ref'd" and are considered part of arc_mru
 387  408   * that cannot be freed.  Generally, they will aquire a DVA
 388  409   * as they are written and migrate onto the arc_mru list.
 389  410   *
 390  411   * The ARC_l2c_only state is for buffers that are in the second
 391  412   * level ARC but no longer in any of the ARC_m* lists.  The second
 392  413   * level ARC itself may also contain buffers that are in any of
 393  414   * the ARC_m* states - meaning that a buffer can exist in two
 394  415   * places.  The reason for the ARC_l2c_only state is to keep the
 395  416   * buffer header in the hash table, so that reads that hit the
 396  417   * second level ARC benefit from these fast lookups.
 397  418   */
 398  419  
 399  420  typedef struct arc_state {

↓ open down ↓

25 lines elided

↑ open up ↑

 400  421          /*
 401  422           * list of evictable buffers
 402  423           */
 403  424          multilist_t *arcs_list[ARC_BUFC_NUMTYPES];
 404  425          /*
 405  426           * total amount of evictable data in this state
 406  427           */
 407  428          refcount_t arcs_esize[ARC_BUFC_NUMTYPES];
 408  429          /*
 409  430           * total amount of data in this state; this includes: evictable,
 410      -         * non-evictable, ARC_BUFC_DATA, and ARC_BUFC_METADATA.
      431 +         * non-evictable, ARC_BUFC_DATA, ARC_BUFC_METADATA and ARC_BUFC_DDT.
      432 +         * ARC_BUFC_DDT list is only populated when zfs_arc_segregate_ddt is
      433 +         * true.
 411  434           */
 412  435          refcount_t arcs_size;
 413  436  } arc_state_t;
 414  437  
      438 +/*
      439 + * We loop through these in l2arc_write_buffers() starting from
      440 + * PRIORITY_MFU_DDT until we reach PRIORITY_NUMTYPES or the buffer that we
      441 + * will be writing to L2ARC dev gets full.
      442 + */
      443 +enum l2arc_priorities {
      444 +        PRIORITY_MFU_DDT,
      445 +        PRIORITY_MRU_DDT,
      446 +        PRIORITY_MFU_META,
      447 +        PRIORITY_MRU_META,
      448 +        PRIORITY_MFU_DATA,
      449 +        PRIORITY_MRU_DATA,
      450 +        PRIORITY_NUMTYPES,
      451 +};
      452 +
 415  453  /* The 6 states: */
 416  454  static arc_state_t ARC_anon;
 417  455  static arc_state_t ARC_mru;
 418  456  static arc_state_t ARC_mru_ghost;
 419  457  static arc_state_t ARC_mfu;
 420  458  static arc_state_t ARC_mfu_ghost;
 421  459  static arc_state_t ARC_l2c_only;
 422  460  
 423  461  typedef struct arc_stats {
 424  462          kstat_named_t arcstat_hits;
      463 +        kstat_named_t arcstat_ddt_hits;
 425  464          kstat_named_t arcstat_misses;
 426  465          kstat_named_t arcstat_demand_data_hits;
 427  466          kstat_named_t arcstat_demand_data_misses;
 428  467          kstat_named_t arcstat_demand_metadata_hits;
 429  468          kstat_named_t arcstat_demand_metadata_misses;
      469 +        kstat_named_t arcstat_demand_ddt_hits;
      470 +        kstat_named_t arcstat_demand_ddt_misses;
 430  471          kstat_named_t arcstat_prefetch_data_hits;
 431  472          kstat_named_t arcstat_prefetch_data_misses;
 432  473          kstat_named_t arcstat_prefetch_metadata_hits;
 433  474          kstat_named_t arcstat_prefetch_metadata_misses;
      475 +        kstat_named_t arcstat_prefetch_ddt_hits;
      476 +        kstat_named_t arcstat_prefetch_ddt_misses;
 434  477          kstat_named_t arcstat_mru_hits;
 435  478          kstat_named_t arcstat_mru_ghost_hits;
 436  479          kstat_named_t arcstat_mfu_hits;
 437  480          kstat_named_t arcstat_mfu_ghost_hits;
 438  481          kstat_named_t arcstat_deleted;
 439  482          /*
 440  483           * Number of buffers that could not be evicted because the hash lock
 441  484           * was held by another thread.  The lock may not necessarily be held
 442  485           * by something using the same buffer, since hash locks are shared
 443  486           * by multiple buffers.
 444  487           */
 445  488          kstat_named_t arcstat_mutex_miss;
 446  489          /*
      490 +         * Number of buffers skipped when updating the access state due to the
      491 +         * header having already been released after acquiring the hash lock.
      492 +         */
      493 +        kstat_named_t arcstat_access_skip;
      494 +        /*
 447  495           * Number of buffers skipped because they have I/O in progress, are
 448      -         * indrect prefetch buffers that have not lived long enough, or are
      496 +         * indirect prefetch buffers that have not lived long enough, or are
 449  497           * not from the spa we're trying to evict from.
 450  498           */
 451  499          kstat_named_t arcstat_evict_skip;
 452  500          /*
 453  501           * Number of times arc_evict_state() was unable to evict enough
 454  502           * buffers to reach it's target amount.
 455  503           */
 456  504          kstat_named_t arcstat_evict_not_enough;
 457  505          kstat_named_t arcstat_evict_l2_cached;
 458  506          kstat_named_t arcstat_evict_l2_eligible;

 459  507          kstat_named_t arcstat_evict_l2_ineligible;

↓ open down ↓

1 lines elided

↑ open up ↑

 460  508          kstat_named_t arcstat_evict_l2_skip;
 461  509          kstat_named_t arcstat_hash_elements;
 462  510          kstat_named_t arcstat_hash_elements_max;
 463  511          kstat_named_t arcstat_hash_collisions;
 464  512          kstat_named_t arcstat_hash_chains;
 465  513          kstat_named_t arcstat_hash_chain_max;
 466  514          kstat_named_t arcstat_p;
 467  515          kstat_named_t arcstat_c;
 468  516          kstat_named_t arcstat_c_min;
 469  517          kstat_named_t arcstat_c_max;
 470      -        /* Not updated directly; only synced in arc_kstat_update. */
 471  518          kstat_named_t arcstat_size;
 472  519          /*
 473  520           * Number of compressed bytes stored in the arc_buf_hdr_t's b_pabd.
 474  521           * Note that the compressed bytes may match the uncompressed bytes
 475  522           * if the block is either not compressed or compressed arc is disabled.
 476  523           */
 477  524          kstat_named_t arcstat_compressed_size;
 478  525          /*
 479  526           * Uncompressed size of the data stored in b_pabd. If compressed
 480  527           * arc is disabled then this value will be identical to the stat

 481  528           * above.
 482  529           */
 483  530          kstat_named_t arcstat_uncompressed_size;
 484  531          /*
 485  532           * Number of bytes stored in all the arc_buf_t's. This is classified
 486  533           * as "overhead" since this data is typically short-lived and will
 487  534           * be evicted from the arc when it becomes unreferenced unless the
 488  535           * zfs_keep_uncompressed_metadata or zfs_keep_uncompressed_level

↓ open down ↓

8 lines elided

↑ open up ↑

 489  536           * values have been set (see comment in dbuf.c for more information).
 490  537           */
 491  538          kstat_named_t arcstat_overhead_size;
 492  539          /*
 493  540           * Number of bytes consumed by internal ARC structures necessary
 494  541           * for tracking purposes; these structures are not actually
 495  542           * backed by ARC buffers. This includes arc_buf_hdr_t structures
 496  543           * (allocated via arc_buf_hdr_t_full and arc_buf_hdr_t_l2only
 497  544           * caches), and arc_buf_t structures (allocated via arc_buf_t
 498  545           * cache).
 499      -         * Not updated directly; only synced in arc_kstat_update.
 500  546           */
 501  547          kstat_named_t arcstat_hdr_size;
 502  548          /*
 503  549           * Number of bytes consumed by ARC buffers of type equal to
 504  550           * ARC_BUFC_DATA. This is generally consumed by buffers backing
 505  551           * on disk user data (e.g. plain file contents).
 506      -         * Not updated directly; only synced in arc_kstat_update.
 507  552           */
 508  553          kstat_named_t arcstat_data_size;
 509  554          /*
 510  555           * Number of bytes consumed by ARC buffers of type equal to
 511  556           * ARC_BUFC_METADATA. This is generally consumed by buffers
 512  557           * backing on disk data that is used for internal ZFS
 513  558           * structures (e.g. ZAP, dnode, indirect blocks, etc).
 514      -         * Not updated directly; only synced in arc_kstat_update.
 515  559           */
 516  560          kstat_named_t arcstat_metadata_size;
 517  561          /*
      562 +         * Number of bytes consumed by ARC buffers of type equal to
      563 +         * ARC_BUFC_DDT. This is consumed by buffers backing on disk data
      564 +         * that is used to store DDT (ZAP, ddt stats).
      565 +         * Only used if zfs_arc_segregate_ddt is true.
      566 +         */
      567 +        kstat_named_t arcstat_ddt_size;
      568 +        /*
 518  569           * Number of bytes consumed by various buffers and structures
 519  570           * not actually backed with ARC buffers. This includes bonus
 520  571           * buffers (allocated directly via zio_buf_* functions),
 521  572           * dmu_buf_impl_t structures (allocated via dmu_buf_impl_t
 522  573           * cache), and dnode_t structures (allocated via dnode_t cache).
 523      -         * Not updated directly; only synced in arc_kstat_update.
 524  574           */
 525  575          kstat_named_t arcstat_other_size;
 526  576          /*
 527  577           * Total number of bytes consumed by ARC buffers residing in the
 528  578           * arc_anon state. This includes *all* buffers in the arc_anon
 529  579           * state; e.g. data, metadata, evictable, and unevictable buffers
 530  580           * are all included in this value.
 531      -         * Not updated directly; only synced in arc_kstat_update.
 532  581           */
 533  582          kstat_named_t arcstat_anon_size;
 534  583          /*
 535  584           * Number of bytes consumed by ARC buffers that meet the
 536  585           * following criteria: backing buffers of type ARC_BUFC_DATA,
 537  586           * residing in the arc_anon state, and are eligible for eviction
 538  587           * (e.g. have no outstanding holds on the buffer).
 539      -         * Not updated directly; only synced in arc_kstat_update.
 540  588           */
 541  589          kstat_named_t arcstat_anon_evictable_data;
 542  590          /*
 543  591           * Number of bytes consumed by ARC buffers that meet the
 544  592           * following criteria: backing buffers of type ARC_BUFC_METADATA,
 545  593           * residing in the arc_anon state, and are eligible for eviction
 546  594           * (e.g. have no outstanding holds on the buffer).
 547      -         * Not updated directly; only synced in arc_kstat_update.
 548  595           */
 549  596          kstat_named_t arcstat_anon_evictable_metadata;
 550  597          /*
      598 +         * Number of bytes consumed by ARC buffers that meet the
      599 +         * following criteria: backing buffers of type ARC_BUFC_DDT,
      600 +         * residing in the arc_anon state, and are eligible for eviction
      601 +         * Only used if zfs_arc_segregate_ddt is true.
      602 +         */
      603 +        kstat_named_t arcstat_anon_evictable_ddt;
      604 +        /*
 551  605           * Total number of bytes consumed by ARC buffers residing in the
 552  606           * arc_mru state. This includes *all* buffers in the arc_mru
 553  607           * state; e.g. data, metadata, evictable, and unevictable buffers
 554  608           * are all included in this value.
 555      -         * Not updated directly; only synced in arc_kstat_update.
 556  609           */
 557  610          kstat_named_t arcstat_mru_size;
 558  611          /*
 559  612           * Number of bytes consumed by ARC buffers that meet the
 560  613           * following criteria: backing buffers of type ARC_BUFC_DATA,
 561  614           * residing in the arc_mru state, and are eligible for eviction
 562  615           * (e.g. have no outstanding holds on the buffer).
 563      -         * Not updated directly; only synced in arc_kstat_update.
 564  616           */
 565  617          kstat_named_t arcstat_mru_evictable_data;
 566  618          /*
 567  619           * Number of bytes consumed by ARC buffers that meet the
 568  620           * following criteria: backing buffers of type ARC_BUFC_METADATA,
 569  621           * residing in the arc_mru state, and are eligible for eviction
 570  622           * (e.g. have no outstanding holds on the buffer).
 571      -         * Not updated directly; only synced in arc_kstat_update.
 572  623           */
 573  624          kstat_named_t arcstat_mru_evictable_metadata;
 574  625          /*
      626 +         * Number of bytes consumed by ARC buffers that meet the
      627 +         * following criteria: backing buffers of type ARC_BUFC_DDT,
      628 +         * residing in the arc_mru state, and are eligible for eviction
      629 +         * (e.g. have no outstanding holds on the buffer).
      630 +         * Only used if zfs_arc_segregate_ddt is true.
      631 +         */
      632 +        kstat_named_t arcstat_mru_evictable_ddt;
      633 +        /*
 575  634           * Total number of bytes that *would have been* consumed by ARC
 576  635           * buffers in the arc_mru_ghost state. The key thing to note
 577  636           * here, is the fact that this size doesn't actually indicate
 578  637           * RAM consumption. The ghost lists only consist of headers and
 579  638           * don't actually have ARC buffers linked off of these headers.
 580  639           * Thus, *if* the headers had associated ARC buffers, these
 581  640           * buffers *would have* consumed this number of bytes.
 582      -         * Not updated directly; only synced in arc_kstat_update.
 583  641           */
 584  642          kstat_named_t arcstat_mru_ghost_size;
 585  643          /*
 586  644           * Number of bytes that *would have been* consumed by ARC
 587  645           * buffers that are eligible for eviction, of type
 588  646           * ARC_BUFC_DATA, and linked off the arc_mru_ghost state.
 589      -         * Not updated directly; only synced in arc_kstat_update.
 590  647           */
 591  648          kstat_named_t arcstat_mru_ghost_evictable_data;
 592  649          /*
 593  650           * Number of bytes that *would have been* consumed by ARC
 594  651           * buffers that are eligible for eviction, of type
 595  652           * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
 596      -         * Not updated directly; only synced in arc_kstat_update.
 597  653           */
 598  654          kstat_named_t arcstat_mru_ghost_evictable_metadata;
 599  655          /*
      656 +         * Number of bytes that *would have been* consumed by ARC
      657 +         * buffers that are eligible for eviction, of type
      658 +         * ARC_BUFC_DDT, and linked off the arc_mru_ghost state.
      659 +         * Only used if zfs_arc_segregate_ddt is true.
      660 +         */
      661 +        kstat_named_t arcstat_mru_ghost_evictable_ddt;
      662 +        /*
 600  663           * Total number of bytes consumed by ARC buffers residing in the
 601  664           * arc_mfu state. This includes *all* buffers in the arc_mfu
 602  665           * state; e.g. data, metadata, evictable, and unevictable buffers
 603  666           * are all included in this value.
 604      -         * Not updated directly; only synced in arc_kstat_update.
 605  667           */
 606  668          kstat_named_t arcstat_mfu_size;
 607  669          /*
 608  670           * Number of bytes consumed by ARC buffers that are eligible for
 609  671           * eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu
 610  672           * state.
 611      -         * Not updated directly; only synced in arc_kstat_update.
 612  673           */
 613  674          kstat_named_t arcstat_mfu_evictable_data;
 614  675          /*
 615  676           * Number of bytes consumed by ARC buffers that are eligible for
 616  677           * eviction, of type ARC_BUFC_METADATA, and reside in the
 617  678           * arc_mfu state.
 618      -         * Not updated directly; only synced in arc_kstat_update.
 619  679           */
 620  680          kstat_named_t arcstat_mfu_evictable_metadata;
 621  681          /*
      682 +         * Number of bytes consumed by ARC buffers that are eligible for
      683 +         * eviction, of type ARC_BUFC_DDT, and reside in the
      684 +         * arc_mfu state.
      685 +         * Only used if zfs_arc_segregate_ddt is true.
      686 +         */
      687 +        kstat_named_t arcstat_mfu_evictable_ddt;
      688 +        /*
 622  689           * Total number of bytes that *would have been* consumed by ARC
 623  690           * buffers in the arc_mfu_ghost state. See the comment above
 624  691           * arcstat_mru_ghost_size for more details.
 625      -         * Not updated directly; only synced in arc_kstat_update.
 626  692           */
 627  693          kstat_named_t arcstat_mfu_ghost_size;
 628  694          /*
 629  695           * Number of bytes that *would have been* consumed by ARC
 630  696           * buffers that are eligible for eviction, of type
 631  697           * ARC_BUFC_DATA, and linked off the arc_mfu_ghost state.
 632      -         * Not updated directly; only synced in arc_kstat_update.
 633  698           */
 634  699          kstat_named_t arcstat_mfu_ghost_evictable_data;
 635  700          /*
 636  701           * Number of bytes that *would have been* consumed by ARC
 637  702           * buffers that are eligible for eviction, of type
 638  703           * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
 639      -         * Not updated directly; only synced in arc_kstat_update.
 640  704           */
 641  705          kstat_named_t arcstat_mfu_ghost_evictable_metadata;
      706 +        /*
      707 +         * Number of bytes that *would have been* consumed by ARC
      708 +         * buffers that are eligible for eviction, of type
      709 +         * ARC_BUFC_DDT, and linked off the arc_mru_ghost state.
      710 +         * Only used if zfs_arc_segregate_ddt is true.
      711 +         */
      712 +        kstat_named_t arcstat_mfu_ghost_evictable_ddt;
 642  713          kstat_named_t arcstat_l2_hits;
      714 +        kstat_named_t arcstat_l2_ddt_hits;
 643  715          kstat_named_t arcstat_l2_misses;
 644  716          kstat_named_t arcstat_l2_feeds;
 645  717          kstat_named_t arcstat_l2_rw_clash;
 646  718          kstat_named_t arcstat_l2_read_bytes;
      719 +        kstat_named_t arcstat_l2_ddt_read_bytes;
 647  720          kstat_named_t arcstat_l2_write_bytes;
      721 +        kstat_named_t arcstat_l2_ddt_write_bytes;
 648  722          kstat_named_t arcstat_l2_writes_sent;
 649  723          kstat_named_t arcstat_l2_writes_done;
 650  724          kstat_named_t arcstat_l2_writes_error;
 651  725          kstat_named_t arcstat_l2_writes_lock_retry;
 652  726          kstat_named_t arcstat_l2_evict_lock_retry;
 653  727          kstat_named_t arcstat_l2_evict_reading;
 654  728          kstat_named_t arcstat_l2_evict_l1cached;
 655  729          kstat_named_t arcstat_l2_free_on_write;
 656  730          kstat_named_t arcstat_l2_abort_lowmem;
 657  731          kstat_named_t arcstat_l2_cksum_bad;
 658  732          kstat_named_t arcstat_l2_io_error;
 659  733          kstat_named_t arcstat_l2_lsize;
 660  734          kstat_named_t arcstat_l2_psize;
 661      -        /* Not updated directly; only synced in arc_kstat_update. */
 662  735          kstat_named_t arcstat_l2_hdr_size;
      736 +        kstat_named_t arcstat_l2_log_blk_writes;
      737 +        kstat_named_t arcstat_l2_log_blk_avg_size;
      738 +        kstat_named_t arcstat_l2_data_to_meta_ratio;
      739 +        kstat_named_t arcstat_l2_rebuild_successes;
      740 +        kstat_named_t arcstat_l2_rebuild_abort_unsupported;
      741 +        kstat_named_t arcstat_l2_rebuild_abort_io_errors;
      742 +        kstat_named_t arcstat_l2_rebuild_abort_cksum_errors;
      743 +        kstat_named_t arcstat_l2_rebuild_abort_loop_errors;
      744 +        kstat_named_t arcstat_l2_rebuild_abort_lowmem;
      745 +        kstat_named_t arcstat_l2_rebuild_size;
      746 +        kstat_named_t arcstat_l2_rebuild_bufs;
      747 +        kstat_named_t arcstat_l2_rebuild_bufs_precached;
      748 +        kstat_named_t arcstat_l2_rebuild_psize;
      749 +        kstat_named_t arcstat_l2_rebuild_log_blks;
 663  750          kstat_named_t arcstat_memory_throttle_count;
 664      -        /* Not updated directly; only synced in arc_kstat_update. */
 665  751          kstat_named_t arcstat_meta_used;
 666  752          kstat_named_t arcstat_meta_limit;
 667  753          kstat_named_t arcstat_meta_max;
 668  754          kstat_named_t arcstat_meta_min;
      755 +        kstat_named_t arcstat_ddt_limit;
 669  756          kstat_named_t arcstat_sync_wait_for_async;
 670  757          kstat_named_t arcstat_demand_hit_predictive_prefetch;
 671  758  } arc_stats_t;
 672  759  
 673  760  static arc_stats_t arc_stats = {
 674  761          { "hits",                       KSTAT_DATA_UINT64 },
      762 +        { "ddt_hits",                   KSTAT_DATA_UINT64 },
 675  763          { "misses",                     KSTAT_DATA_UINT64 },
 676  764          { "demand_data_hits",           KSTAT_DATA_UINT64 },
 677  765          { "demand_data_misses",         KSTAT_DATA_UINT64 },
 678  766          { "demand_metadata_hits",       KSTAT_DATA_UINT64 },
 679  767          { "demand_metadata_misses",     KSTAT_DATA_UINT64 },
      768 +        { "demand_ddt_hits",            KSTAT_DATA_UINT64 },
      769 +        { "demand_ddt_misses",          KSTAT_DATA_UINT64 },
 680  770          { "prefetch_data_hits",         KSTAT_DATA_UINT64 },
 681  771          { "prefetch_data_misses",       KSTAT_DATA_UINT64 },
 682  772          { "prefetch_metadata_hits",     KSTAT_DATA_UINT64 },
 683  773          { "prefetch_metadata_misses",   KSTAT_DATA_UINT64 },
      774 +        { "prefetch_ddt_hits",          KSTAT_DATA_UINT64 },
      775 +        { "prefetch_ddt_misses",        KSTAT_DATA_UINT64 },
 684  776          { "mru_hits",                   KSTAT_DATA_UINT64 },
 685  777          { "mru_ghost_hits",             KSTAT_DATA_UINT64 },
 686  778          { "mfu_hits",                   KSTAT_DATA_UINT64 },
 687  779          { "mfu_ghost_hits",             KSTAT_DATA_UINT64 },
 688  780          { "deleted",                    KSTAT_DATA_UINT64 },
 689  781          { "mutex_miss",                 KSTAT_DATA_UINT64 },
      782 +        { "access_skip",                KSTAT_DATA_UINT64 },
 690  783          { "evict_skip",                 KSTAT_DATA_UINT64 },
 691  784          { "evict_not_enough",           KSTAT_DATA_UINT64 },
 692  785          { "evict_l2_cached",            KSTAT_DATA_UINT64 },
 693  786          { "evict_l2_eligible",          KSTAT_DATA_UINT64 },
 694  787          { "evict_l2_ineligible",        KSTAT_DATA_UINT64 },
 695  788          { "evict_l2_skip",              KSTAT_DATA_UINT64 },
 696  789          { "hash_elements",              KSTAT_DATA_UINT64 },
 697  790          { "hash_elements_max",          KSTAT_DATA_UINT64 },
 698  791          { "hash_collisions",            KSTAT_DATA_UINT64 },
 699  792          { "hash_chains",                KSTAT_DATA_UINT64 },

 700  793          { "hash_chain_max",             KSTAT_DATA_UINT64 },
 701  794          { "p",                          KSTAT_DATA_UINT64 },

↓ open down ↓

2 lines elided

↑ open up ↑

 702  795          { "c",                          KSTAT_DATA_UINT64 },
 703  796          { "c_min",                      KSTAT_DATA_UINT64 },
 704  797          { "c_max",                      KSTAT_DATA_UINT64 },
 705  798          { "size",                       KSTAT_DATA_UINT64 },
 706  799          { "compressed_size",            KSTAT_DATA_UINT64 },
 707  800          { "uncompressed_size",          KSTAT_DATA_UINT64 },
 708  801          { "overhead_size",              KSTAT_DATA_UINT64 },
 709  802          { "hdr_size",                   KSTAT_DATA_UINT64 },
 710  803          { "data_size",                  KSTAT_DATA_UINT64 },
 711  804          { "metadata_size",              KSTAT_DATA_UINT64 },
      805 +        { "ddt_size",                   KSTAT_DATA_UINT64 },
 712  806          { "other_size",                 KSTAT_DATA_UINT64 },
 713  807          { "anon_size",                  KSTAT_DATA_UINT64 },
 714  808          { "anon_evictable_data",        KSTAT_DATA_UINT64 },
 715  809          { "anon_evictable_metadata",    KSTAT_DATA_UINT64 },
      810 +        { "anon_evictable_ddt",         KSTAT_DATA_UINT64 },
 716  811          { "mru_size",                   KSTAT_DATA_UINT64 },
 717  812          { "mru_evictable_data",         KSTAT_DATA_UINT64 },
 718  813          { "mru_evictable_metadata",     KSTAT_DATA_UINT64 },
      814 +        { "mru_evictable_ddt",          KSTAT_DATA_UINT64 },
 719  815          { "mru_ghost_size",             KSTAT_DATA_UINT64 },
 720  816          { "mru_ghost_evictable_data",   KSTAT_DATA_UINT64 },
 721  817          { "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
      818 +        { "mru_ghost_evictable_ddt",    KSTAT_DATA_UINT64 },
 722  819          { "mfu_size",                   KSTAT_DATA_UINT64 },
 723  820          { "mfu_evictable_data",         KSTAT_DATA_UINT64 },
 724  821          { "mfu_evictable_metadata",     KSTAT_DATA_UINT64 },
      822 +        { "mfu_evictable_ddt",          KSTAT_DATA_UINT64 },
 725  823          { "mfu_ghost_size",             KSTAT_DATA_UINT64 },
 726  824          { "mfu_ghost_evictable_data",   KSTAT_DATA_UINT64 },
 727  825          { "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
      826 +        { "mfu_ghost_evictable_ddt",    KSTAT_DATA_UINT64 },
 728  827          { "l2_hits",                    KSTAT_DATA_UINT64 },
      828 +        { "l2_ddt_hits",                KSTAT_DATA_UINT64 },
 729  829          { "l2_misses",                  KSTAT_DATA_UINT64 },
 730  830          { "l2_feeds",                   KSTAT_DATA_UINT64 },
 731  831          { "l2_rw_clash",                KSTAT_DATA_UINT64 },
 732  832          { "l2_read_bytes",              KSTAT_DATA_UINT64 },
      833 +        { "l2_ddt_read_bytes",          KSTAT_DATA_UINT64 },
 733  834          { "l2_write_bytes",             KSTAT_DATA_UINT64 },
      835 +        { "l2_ddt_write_bytes",         KSTAT_DATA_UINT64 },
 734  836          { "l2_writes_sent",             KSTAT_DATA_UINT64 },
 735  837          { "l2_writes_done",             KSTAT_DATA_UINT64 },
 736  838          { "l2_writes_error",            KSTAT_DATA_UINT64 },
 737  839          { "l2_writes_lock_retry",       KSTAT_DATA_UINT64 },
 738  840          { "l2_evict_lock_retry",        KSTAT_DATA_UINT64 },
 739  841          { "l2_evict_reading",           KSTAT_DATA_UINT64 },
 740  842          { "l2_evict_l1cached",          KSTAT_DATA_UINT64 },
 741  843          { "l2_free_on_write",           KSTAT_DATA_UINT64 },
 742  844          { "l2_abort_lowmem",            KSTAT_DATA_UINT64 },
 743  845          { "l2_cksum_bad",               KSTAT_DATA_UINT64 },
 744  846          { "l2_io_error",                KSTAT_DATA_UINT64 },
 745  847          { "l2_size",                    KSTAT_DATA_UINT64 },
 746  848          { "l2_asize",                   KSTAT_DATA_UINT64 },
 747  849          { "l2_hdr_size",                KSTAT_DATA_UINT64 },
      850 +        { "l2_log_blk_writes",          KSTAT_DATA_UINT64 },
      851 +        { "l2_log_blk_avg_size",        KSTAT_DATA_UINT64 },
      852 +        { "l2_data_to_meta_ratio",      KSTAT_DATA_UINT64 },
      853 +        { "l2_rebuild_successes",       KSTAT_DATA_UINT64 },
      854 +        { "l2_rebuild_unsupported",     KSTAT_DATA_UINT64 },
      855 +        { "l2_rebuild_io_errors",       KSTAT_DATA_UINT64 },
      856 +        { "l2_rebuild_cksum_errors",    KSTAT_DATA_UINT64 },
      857 +        { "l2_rebuild_loop_errors",     KSTAT_DATA_UINT64 },
      858 +        { "l2_rebuild_lowmem",          KSTAT_DATA_UINT64 },
      859 +        { "l2_rebuild_size",            KSTAT_DATA_UINT64 },
      860 +        { "l2_rebuild_bufs",            KSTAT_DATA_UINT64 },
      861 +        { "l2_rebuild_bufs_precached",  KSTAT_DATA_UINT64 },
      862 +        { "l2_rebuild_psize",           KSTAT_DATA_UINT64 },
      863 +        { "l2_rebuild_log_blks",        KSTAT_DATA_UINT64 },
 748  864          { "memory_throttle_count",      KSTAT_DATA_UINT64 },
 749  865          { "arc_meta_used",              KSTAT_DATA_UINT64 },
 750  866          { "arc_meta_limit",             KSTAT_DATA_UINT64 },
 751  867          { "arc_meta_max",               KSTAT_DATA_UINT64 },
 752  868          { "arc_meta_min",               KSTAT_DATA_UINT64 },
      869 +        { "arc_ddt_limit",              KSTAT_DATA_UINT64 },
 753  870          { "sync_wait_for_async",        KSTAT_DATA_UINT64 },
 754  871          { "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },
 755  872  };
 756  873  
 757  874  #define ARCSTAT(stat)   (arc_stats.stat.value.ui64)
 758  875  
 759  876  #define ARCSTAT_INCR(stat, val) \
 760  877          atomic_add_64(&arc_stats.stat.value.ui64, (val))
 761  878  
 762  879  #define ARCSTAT_BUMP(stat)      ARCSTAT_INCR(stat, 1)

 763  880  #define ARCSTAT_BUMPDOWN(stat)  ARCSTAT_INCR(stat, -1)
 764  881  
 765  882  #define ARCSTAT_MAX(stat, val) {                                        \
 766  883          uint64_t m;                                                     \
 767  884          while ((val) > (m = arc_stats.stat.value.ui64) &&               \
 768  885              (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
 769  886                  continue;                                               \
 770  887  }
 771  888  
 772  889  #define ARCSTAT_MAXSTAT(stat) \

↓ open down ↓

10 lines elided

↑ open up ↑

 773  890          ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
 774  891  
 775  892  /*
 776  893   * We define a macro to allow ARC hits/misses to be easily broken down by
 777  894   * two separate conditions, giving a total of four different subtypes for
 778  895   * each of hits and misses (so eight statistics total).
 779  896   */
 780  897  #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
 781  898          if (cond1) {                                                    \
 782  899                  if (cond2) {                                            \
 783      -                        ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
      900 +                        ARCSTAT_BUMP(arcstat_##stat1##_##stat##_##stat2); \
 784  901                  } else {                                                \
 785      -                        ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
      902 +                        ARCSTAT_BUMP(arcstat_##stat1##_##stat##_##notstat2); \
 786  903                  }                                                       \
 787  904          } else {                                                        \
 788  905                  if (cond2) {                                            \
 789      -                        ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
      906 +                        ARCSTAT_BUMP(arcstat_##notstat1##_##stat##_##stat2); \
 790  907                  } else {                                                \
 791      -                        ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
      908 +                        ARCSTAT_BUMP(arcstat_##notstat1##_##stat##_##notstat2);\
 792  909                  }                                                       \
 793  910          }
 794  911  
      912 +/*
      913 + * This macro allows us to use kstats as floating averages. Each time we
      914 + * update this kstat, we first factor it and the update value by
      915 + * ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall
      916 + * average. This macro assumes that integer loads and stores are atomic, but
      917 + * is not safe for multiple writers updating the kstat in parallel (only the
      918 + * last writer's update will remain).
      919 + */
      920 +#define ARCSTAT_F_AVG_FACTOR    3
      921 +#define ARCSTAT_F_AVG(stat, value) \
      922 +        do { \
      923 +                uint64_t x = ARCSTAT(stat); \
      924 +                x = x - x / ARCSTAT_F_AVG_FACTOR + \
      925 +                    (value) / ARCSTAT_F_AVG_FACTOR; \
      926 +                ARCSTAT(stat) = x; \
      927 +                _NOTE(CONSTCOND) \
      928 +        } while (0)
      929 +
 795  930  kstat_t                 *arc_ksp;
 796  931  static arc_state_t      *arc_anon;
 797  932  static arc_state_t      *arc_mru;
 798  933  static arc_state_t      *arc_mru_ghost;
 799  934  static arc_state_t      *arc_mfu;
 800  935  static arc_state_t      *arc_mfu_ghost;
 801  936  static arc_state_t      *arc_l2c_only;
 802  937  
 803  938  /*
 804  939   * There are several ARC variables that are critical to export as kstats --
 805  940   * but we don't want to have to grovel around in the kstat whenever we wish to
 806  941   * manipulate them.  For these variables, we therefore define them to be in
 807  942   * terms of the statistic variable.  This assures that we are not introducing
 808  943   * the possibility of inconsistency by having shadow copies of the variables,
 809  944   * while still allowing the code to be readable.
 810  945   */
      946 +#define arc_size        ARCSTAT(arcstat_size)   /* actual total arc size */
 811  947  #define arc_p           ARCSTAT(arcstat_p)      /* target size of MRU */
 812  948  #define arc_c           ARCSTAT(arcstat_c)      /* target size of cache */
 813  949  #define arc_c_min       ARCSTAT(arcstat_c_min)  /* min target cache size */
 814  950  #define arc_c_max       ARCSTAT(arcstat_c_max)  /* max target cache size */
 815  951  #define arc_meta_limit  ARCSTAT(arcstat_meta_limit) /* max size for metadata */
 816  952  #define arc_meta_min    ARCSTAT(arcstat_meta_min) /* min size for metadata */
      953 +#define arc_meta_used   ARCSTAT(arcstat_meta_used) /* size of metadata */
 817  954  #define arc_meta_max    ARCSTAT(arcstat_meta_max) /* max size of metadata */
      955 +#define arc_ddt_size    ARCSTAT(arcstat_ddt_size) /* ddt size in arc */
      956 +#define arc_ddt_limit   ARCSTAT(arcstat_ddt_limit) /* ddt in arc size limit */
 818  957  
      958 +/*
      959 + * Used int zio.c to optionally keep DDT cached in ARC
      960 + */
      961 +uint64_t const *arc_ddt_evict_threshold;
      962 +
 819  963  /* compressed size of entire arc */
 820  964  #define arc_compressed_size     ARCSTAT(arcstat_compressed_size)
 821  965  /* uncompressed size of entire arc */
 822  966  #define arc_uncompressed_size   ARCSTAT(arcstat_uncompressed_size)
 823  967  /* number of bytes in the arc from arc_buf_t's */
 824  968  #define arc_overhead_size       ARCSTAT(arcstat_overhead_size)
 825  969  
 826      -/*
 827      - * There are also some ARC variables that we want to export, but that are
 828      - * updated so often that having the canonical representation be the statistic
 829      - * variable causes a performance bottleneck. We want to use aggsum_t's for these
 830      - * instead, but still be able to export the kstat in the same way as before.
 831      - * The solution is to always use the aggsum version, except in the kstat update
 832      - * callback.
 833      - */
 834      -aggsum_t arc_size;
 835      -aggsum_t arc_meta_used;
 836      -aggsum_t astat_data_size;
 837      -aggsum_t astat_metadata_size;
 838      -aggsum_t astat_hdr_size;
 839      -aggsum_t astat_other_size;
 840      -aggsum_t astat_l2_hdr_size;
 841  970  
 842  971  static int              arc_no_grow;    /* Don't try to grow cache size */
 843  972  static uint64_t         arc_tempreserve;
 844  973  static uint64_t         arc_loaned_bytes;
 845  974  
 846  975  typedef struct arc_callback arc_callback_t;
 847  976  
 848  977  struct arc_callback {
 849  978          void                    *acb_private;
 850  979          arc_done_func_t         *acb_done;

 851  980          arc_buf_t               *acb_buf;
 852  981          boolean_t               acb_compressed;
 853  982          zio_t                   *acb_zio_dummy;
 854  983          arc_callback_t          *acb_next;
 855  984  };
 856  985  
 857  986  typedef struct arc_write_callback arc_write_callback_t;
 858  987  
 859  988  struct arc_write_callback {
 860  989          void            *awcb_private;
 861  990          arc_done_func_t *awcb_ready;
 862  991          arc_done_func_t *awcb_children_ready;
 863  992          arc_done_func_t *awcb_physdone;
 864  993          arc_done_func_t *awcb_done;
 865  994          arc_buf_t       *awcb_buf;
 866  995  };
 867  996  
 868  997  /*
 869  998   * ARC buffers are separated into multiple structs as a memory saving measure:
 870  999   *   - Common fields struct, always defined, and embedded within it:
 871 1000   *       - L2-only fields, always allocated but undefined when not in L2ARC
 872 1001   *       - L1-only fields, only allocated when in L1ARC
 873 1002   *
 874 1003   *           Buffer in L1                     Buffer only in L2
 875 1004   *    +------------------------+          +------------------------+
 876 1005   *    | arc_buf_hdr_t          |          | arc_buf_hdr_t          |
 877 1006   *    |                        |          |                        |
 878 1007   *    |                        |          |                        |
 879 1008   *    |                        |          |                        |
 880 1009   *    +------------------------+          +------------------------+
 881 1010   *    | l2arc_buf_hdr_t        |          | l2arc_buf_hdr_t        |
 882 1011   *    | (undefined if L1-only) |          |                        |
 883 1012   *    +------------------------+          +------------------------+
 884 1013   *    | l1arc_buf_hdr_t        |
 885 1014   *    |                        |
 886 1015   *    |                        |
 887 1016   *    |                        |
 888 1017   *    |                        |
 889 1018   *    +------------------------+
 890 1019   *

↓ open down ↓

40 lines elided

↑ open up ↑

 891 1020   * Because it's possible for the L2ARC to become extremely large, we can wind
 892 1021   * up eating a lot of memory in L2ARC buffer headers, so the size of a header
 893 1022   * is minimized by only allocating the fields necessary for an L1-cached buffer
 894 1023   * when a header is actually in the L1 cache. The sub-headers (l1arc_buf_hdr and
 895 1024   * l2arc_buf_hdr) are embedded rather than allocated separately to save a couple
 896 1025   * words in pointers. arc_hdr_realloc() is used to switch a header between
 897 1026   * these two allocation states.
 898 1027   */
 899 1028  typedef struct l1arc_buf_hdr {
 900 1029          kmutex_t                b_freeze_lock;
 901      -        zio_cksum_t             *b_freeze_cksum;
 902 1030  #ifdef ZFS_DEBUG
 903 1031          /*
 904 1032           * Used for debugging with kmem_flags - by allocating and freeing
 905 1033           * b_thawed when the buffer is thawed, we get a record of the stack
 906 1034           * trace that thawed it.
 907 1035           */
 908 1036          void                    *b_thawed;
 909 1037  #endif
 910 1038  
     1039 +        /* number of krrp tasks using this buffer */
     1040 +        uint64_t                b_krrp;
     1041 +
 911 1042          arc_buf_t               *b_buf;
 912 1043          uint32_t                b_bufcnt;
 913 1044          /* for waiting on writes to complete */
 914 1045          kcondvar_t              b_cv;
 915 1046          uint8_t                 b_byteswap;
 916 1047  
 917 1048          /* protected by arc state mutex */
 918 1049          arc_state_t             *b_state;
 919 1050          multilist_node_t        b_arc_node;
 920 1051

 921 1052          /* updated atomically */
 922 1053          clock_t                 b_arc_access;
 923 1054  
 924 1055          /* self protecting */
 925 1056          refcount_t              b_refcnt;
 926 1057  
 927 1058          arc_callback_t          *b_acb;
 928 1059          abd_t                   *b_pabd;
 929 1060  } l1arc_buf_hdr_t;
 930 1061  
 931 1062  typedef struct l2arc_dev l2arc_dev_t;
 932 1063  
 933 1064  typedef struct l2arc_buf_hdr {
 934 1065          /* protected by arc_buf_hdr mutex */
 935 1066          l2arc_dev_t             *b_dev;         /* L2ARC device */

↓ open down ↓

15 lines elided

↑ open up ↑

 936 1067          uint64_t                b_daddr;        /* disk address, offset byte */
 937 1068  
 938 1069          list_node_t             b_l2node;
 939 1070  } l2arc_buf_hdr_t;
 940 1071  
 941 1072  struct arc_buf_hdr {
 942 1073          /* protected by hash lock */
 943 1074          dva_t                   b_dva;
 944 1075          uint64_t                b_birth;
 945 1076  
     1077 +        /*
     1078 +         * Even though this checksum is only set/verified when a buffer is in
     1079 +         * the L1 cache, it needs to be in the set of common fields because it
     1080 +         * must be preserved from the time before a buffer is written out to
     1081 +         * L2ARC until after it is read back in.
     1082 +         */
     1083 +        zio_cksum_t             *b_freeze_cksum;
     1084 +
 946 1085          arc_buf_contents_t      b_type;
 947 1086          arc_buf_hdr_t           *b_hash_next;
 948 1087          arc_flags_t             b_flags;
 949 1088  
 950 1089          /*
 951 1090           * This field stores the size of the data buffer after
 952 1091           * compression, and is set in the arc's zio completion handlers.
 953 1092           * It is in units of SPA_MINBLOCKSIZE (e.g. 1 == 512 bytes).
 954 1093           *
 955 1094           * While the block pointers can store up to 32MB in their psize

 956 1095           * field, we can only store up to 32MB minus 512B. This is due
 957 1096           * to the bp using a bias of 1, whereas we use a bias of 0 (i.e.
 958 1097           * a field of zeros represents 512B in the bp). We can't use a
 959 1098           * bias of 1 since we need to reserve a psize of zero, here, to
 960 1099           * represent holes and embedded blocks.
 961 1100           *
 962 1101           * This isn't a problem in practice, since the maximum size of a
 963 1102           * buffer is limited to 16MB, so we never need to store 32MB in
 964 1103           * this field. Even in the upstream illumos code base, the
 965 1104           * maximum size of a buffer is limited to 16MB.
 966 1105           */
 967 1106          uint16_t                b_psize;
 968 1107  
 969 1108          /*
 970 1109           * This field stores the size of the data buffer before
 971 1110           * compression, and cannot change once set. It is in units
 972 1111           * of SPA_MINBLOCKSIZE (e.g. 2 == 1024 bytes)
 973 1112           */
 974 1113          uint16_t                b_lsize;        /* immutable */
 975 1114          uint64_t                b_spa;          /* immutable */
 976 1115  
 977 1116          /* L2ARC fields. Undefined when not in L2ARC. */
 978 1117          l2arc_buf_hdr_t         b_l2hdr;
 979 1118          /* L1ARC fields. Undefined when in l2arc_only state */
 980 1119          l1arc_buf_hdr_t         b_l1hdr;
 981 1120  };
 982 1121  
 983 1122  #define GHOST_STATE(state)      \
 984 1123          ((state) == arc_mru_ghost || (state) == arc_mfu_ghost ||        \
 985 1124          (state) == arc_l2c_only)
 986 1125  
 987 1126  #define HDR_IN_HASH_TABLE(hdr)  ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)
 988 1127  #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)
 989 1128  #define HDR_IO_ERROR(hdr)       ((hdr)->b_flags & ARC_FLAG_IO_ERROR)
 990 1129  #define HDR_PREFETCH(hdr)       ((hdr)->b_flags & ARC_FLAG_PREFETCH)
 991 1130  #define HDR_COMPRESSION_ENABLED(hdr)    \
 992 1131          ((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)

↓ open down ↓

37 lines elided

↑ open up ↑

 993 1132  
 994 1133  #define HDR_L2CACHE(hdr)        ((hdr)->b_flags & ARC_FLAG_L2CACHE)
 995 1134  #define HDR_L2_READING(hdr)     \
 996 1135          (((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) &&  \
 997 1136          ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))
 998 1137  #define HDR_L2_WRITING(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_WRITING)
 999 1138  #define HDR_L2_EVICTED(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)
1000 1139  #define HDR_L2_WRITE_HEAD(hdr)  ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)
1001 1140  #define HDR_SHARED_DATA(hdr)    ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)
1002 1141  
     1142 +#define HDR_ISTYPE_DDT(hdr)     \
     1143 +            ((hdr)->b_flags & ARC_FLAG_BUFC_DDT)
1003 1144  #define HDR_ISTYPE_METADATA(hdr)        \
1004 1145          ((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)
1005      -#define HDR_ISTYPE_DATA(hdr)    (!HDR_ISTYPE_METADATA(hdr))
     1146 +#define HDR_ISTYPE_DATA(hdr)    (!HDR_ISTYPE_METADATA(hdr) && \
     1147 +        !HDR_ISTYPE_DDT(hdr))
1006 1148  
1007 1149  #define HDR_HAS_L1HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)
1008 1150  #define HDR_HAS_L2HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)
1009 1151  
1010 1152  /* For storing compression mode in b_flags */
1011 1153  #define HDR_COMPRESS_OFFSET     (highbit64(ARC_FLAG_COMPRESS_0) - 1)
1012 1154  
1013 1155  #define HDR_GET_COMPRESS(hdr)   ((enum zio_compress)BF32_GET((hdr)->b_flags, \
1014 1156          HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))
1015 1157  #define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \

1016 1158          HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));
1017 1159  
1018 1160  #define ARC_BUF_LAST(buf)       ((buf)->b_next == NULL)
1019 1161  #define ARC_BUF_SHARED(buf)     ((buf)->b_flags & ARC_BUF_FLAG_SHARED)
1020 1162  #define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)
1021 1163  
1022 1164  /*

↓ open down ↓

7 lines elided

↑ open up ↑

1023 1165   * Other sizes
1024 1166   */
1025 1167  
1026 1168  #define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
1027 1169  #define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))
1028 1170  
1029 1171  /*
1030 1172   * Hash table routines
1031 1173   */
1032 1174  
1033      -#define HT_LOCK_PAD     64
1034      -
1035      -struct ht_lock {
1036      -        kmutex_t        ht_lock;
1037      -#ifdef _KERNEL
1038      -        unsigned char   pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
1039      -#endif
     1175 +struct ht_table {
     1176 +        arc_buf_hdr_t   *hdr;
     1177 +        kmutex_t        lock;
1040 1178  };
1041 1179  
1042      -#define BUF_LOCKS 256
1043 1180  typedef struct buf_hash_table {
1044 1181          uint64_t ht_mask;
1045      -        arc_buf_hdr_t **ht_table;
1046      -        struct ht_lock ht_locks[BUF_LOCKS];
     1182 +        struct ht_table *ht_table;
1047 1183  } buf_hash_table_t;
1048 1184  
     1185 +#pragma align 64(buf_hash_table)
1049 1186  static buf_hash_table_t buf_hash_table;
1050 1187  
1051 1188  #define BUF_HASH_INDEX(spa, dva, birth) \
1052 1189          (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
1053      -#define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
1054      -#define BUF_HASH_LOCK(idx)      (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
     1190 +#define BUF_HASH_LOCK(idx) (&buf_hash_table.ht_table[idx].lock)
1055 1191  #define HDR_LOCK(hdr) \
1056 1192          (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
1057 1193  
1058 1194  uint64_t zfs_crc64_table[256];
1059 1195  
1060 1196  /*
1061 1197   * Level 2 ARC
1062 1198   */
1063 1199  
1064 1200  #define L2ARC_WRITE_SIZE        (8 * 1024 * 1024)       /* initial write max */

1065 1201  #define L2ARC_HEADROOM          2                       /* num of writes */
1066 1202  /*
1067 1203   * If we discover during ARC scan any buffers to be compressed, we boost
1068 1204   * our headroom for the next scanning cycle by this percentage multiple.
1069 1205   */
1070 1206  #define L2ARC_HEADROOM_BOOST    200
1071 1207  #define L2ARC_FEED_SECS         1               /* caching interval secs */
1072 1208  #define L2ARC_FEED_MIN_MS       200             /* min caching interval ms */
1073 1209  
1074 1210  #define l2arc_writes_sent       ARCSTAT(arcstat_l2_writes_sent)
1075 1211  #define l2arc_writes_done       ARCSTAT(arcstat_l2_writes_done)
1076 1212  
1077 1213  /* L2ARC Performance Tunables */

↓ open down ↓

13 lines elided

↑ open up ↑

1078 1214  uint64_t l2arc_write_max = L2ARC_WRITE_SIZE;    /* default max write size */
1079 1215  uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE;  /* extra write during warmup */
1080 1216  uint64_t l2arc_headroom = L2ARC_HEADROOM;       /* number of dev writes */
1081 1217  uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
1082 1218  uint64_t l2arc_feed_secs = L2ARC_FEED_SECS;     /* interval seconds */
1083 1219  uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
1084 1220  boolean_t l2arc_noprefetch = B_TRUE;            /* don't cache prefetch bufs */
1085 1221  boolean_t l2arc_feed_again = B_TRUE;            /* turbo warmup */
1086 1222  boolean_t l2arc_norw = B_TRUE;                  /* no reads during writes */
1087 1223  
1088      -/*
1089      - * L2ARC Internals
1090      - */
1091      -struct l2arc_dev {
1092      -        vdev_t                  *l2ad_vdev;     /* vdev */
1093      -        spa_t                   *l2ad_spa;      /* spa */
1094      -        uint64_t                l2ad_hand;      /* next write location */
1095      -        uint64_t                l2ad_start;     /* first addr on device */
1096      -        uint64_t                l2ad_end;       /* last addr on device */
1097      -        boolean_t               l2ad_first;     /* first sweep through */
1098      -        boolean_t               l2ad_writing;   /* currently writing */
1099      -        kmutex_t                l2ad_mtx;       /* lock for buffer list */
1100      -        list_t                  l2ad_buflist;   /* buffer list */
1101      -        list_node_t             l2ad_node;      /* device list node */
1102      -        refcount_t              l2ad_alloc;     /* allocated bytes */
1103      -};
1104      -
1105 1224  static list_t L2ARC_dev_list;                   /* device list */
1106 1225  static list_t *l2arc_dev_list;                  /* device list pointer */
1107 1226  static kmutex_t l2arc_dev_mtx;                  /* device list mutex */
1108 1227  static l2arc_dev_t *l2arc_dev_last;             /* last device used */
     1228 +static l2arc_dev_t *l2arc_ddt_dev_last;         /* last DDT device used */
1109 1229  static list_t L2ARC_free_on_write;              /* free after write buf list */
1110 1230  static list_t *l2arc_free_on_write;             /* free after write list ptr */
1111 1231  static kmutex_t l2arc_free_on_write_mtx;        /* mutex for list */
1112 1232  static uint64_t l2arc_ndev;                     /* number of devices */
1113 1233  
1114 1234  typedef struct l2arc_read_callback {
1115 1235          arc_buf_hdr_t           *l2rcb_hdr;             /* read header */
1116 1236          blkptr_t                l2rcb_bp;               /* original blkptr */
1117 1237          zbookmark_phys_t        l2rcb_zb;               /* original bookmark */
1118 1238          int                     l2rcb_flags;            /* original flags */
1119 1239          abd_t                   *l2rcb_abd;             /* temporary buffer */
1120 1240  } l2arc_read_callback_t;
1121 1241  
1122 1242  typedef struct l2arc_write_callback {
1123 1243          l2arc_dev_t     *l2wcb_dev;             /* device info */
1124 1244          arc_buf_hdr_t   *l2wcb_head;            /* head of write buflist */
     1245 +        list_t          l2wcb_log_blk_buflist;  /* in-flight log blocks */
1125 1246  } l2arc_write_callback_t;
1126 1247  
1127 1248  typedef struct l2arc_data_free {
1128 1249          /* protected by l2arc_free_on_write_mtx */
1129 1250          abd_t           *l2df_abd;
1130 1251          size_t          l2df_size;
1131 1252          arc_buf_contents_t l2df_type;
1132 1253          list_node_t     l2df_list_node;
1133 1254  } l2arc_data_free_t;
1134 1255

1135 1256  static kmutex_t l2arc_feed_thr_lock;
1136 1257  static kcondvar_t l2arc_feed_thr_cv;
1137 1258  static uint8_t l2arc_thread_exit;
1138 1259  
1139 1260  static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, void *);

↓ open down ↓

5 lines elided

↑ open up ↑

1140 1261  static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, void *);
1141 1262  static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, void *);
1142 1263  static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, void *);
1143 1264  static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, void *);
1144 1265  static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag);
1145 1266  static void arc_hdr_free_pabd(arc_buf_hdr_t *);
1146 1267  static void arc_hdr_alloc_pabd(arc_buf_hdr_t *);
1147 1268  static void arc_access(arc_buf_hdr_t *, kmutex_t *);
1148 1269  static boolean_t arc_is_overflowing();
1149 1270  static void arc_buf_watch(arc_buf_t *);
     1271 +static l2arc_dev_t *l2arc_vdev_get(vdev_t *vd);
1150 1272  
1151 1273  static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);
1152 1274  static uint32_t arc_bufc_to_flags(arc_buf_contents_t);
     1275 +static arc_buf_contents_t arc_flags_to_bufc(uint32_t);
1153 1276  static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1154 1277  static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1155 1278  
1156 1279  static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
1157 1280  static void l2arc_read_done(zio_t *);
1158 1281  
     1282 +static void
     1283 +arc_update_hit_stat(arc_buf_hdr_t *hdr, boolean_t hit)
     1284 +{
     1285 +        boolean_t pf = !HDR_PREFETCH(hdr);
     1286 +        switch (arc_buf_type(hdr)) {
     1287 +        case ARC_BUFC_DATA:
     1288 +                ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses, data);
     1289 +                break;
     1290 +        case ARC_BUFC_METADATA:
     1291 +                ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses,
     1292 +                    metadata);
     1293 +                break;
     1294 +        case ARC_BUFC_DDT:
     1295 +                ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses, ddt);
     1296 +                break;
     1297 +        default:
     1298 +                break;
     1299 +        }
     1300 +}
1159 1301  
     1302 +enum {
     1303 +        L2ARC_DEV_HDR_EVICT_FIRST = (1 << 0)    /* mirror of l2ad_first */
     1304 +};
     1305 +
1160 1306  /*
1161      - * We use Cityhash for this. It's fast, and has good hash properties without
1162      - * requiring any large static buffers.
     1307 + * Pointer used in persistent L2ARC (for pointing to log blocks & ARC buffers).
1163 1308   */
1164      -static uint64_t
     1309 +typedef struct l2arc_log_blkptr {
     1310 +        uint64_t        lbp_daddr;      /* device address of log */
     1311 +        /*
     1312 +         * lbp_prop is the same format as the blk_prop in blkptr_t:
     1313 +         *      * logical size (in sectors)
     1314 +         *      * physical size (in sectors)
     1315 +         *      * checksum algorithm (used for lbp_cksum)
     1316 +         *      * object type & level (unused for now)
     1317 +         */
     1318 +        uint64_t        lbp_prop;
     1319 +        zio_cksum_t     lbp_cksum;      /* fletcher4 of log */
     1320 +} l2arc_log_blkptr_t;
     1321 +
     1322 +/*
     1323 + * The persistent L2ARC device header.
     1324 + * Byte order of magic determines whether 64-bit bswap of fields is necessary.
     1325 + */
     1326 +typedef struct l2arc_dev_hdr_phys {
     1327 +        uint64_t        dh_magic;       /* L2ARC_DEV_HDR_MAGIC_Vx */
     1328 +        zio_cksum_t     dh_self_cksum;  /* fletcher4 of fields below */
     1329 +
     1330 +        /*
     1331 +         * Global L2ARC device state and metadata.
     1332 +         */
     1333 +        uint64_t        dh_spa_guid;
     1334 +        uint64_t        dh_alloc_space;         /* vdev space alloc status */
     1335 +        uint64_t        dh_flags;               /* l2arc_dev_hdr_flags_t */
     1336 +
     1337 +        /*
     1338 +         * Start of log block chain. [0] -> newest log, [1] -> one older (used
     1339 +         * for initiating prefetch).
     1340 +         */
     1341 +        l2arc_log_blkptr_t      dh_start_lbps[2];
     1342 +
     1343 +        const uint64_t  dh_pad[44];             /* pad to 512 bytes */
     1344 +} l2arc_dev_hdr_phys_t;
     1345 +CTASSERT(sizeof (l2arc_dev_hdr_phys_t) == SPA_MINBLOCKSIZE);
     1346 +
     1347 +/*
     1348 + * A single ARC buffer header entry in a l2arc_log_blk_phys_t.
     1349 + */
     1350 +typedef struct l2arc_log_ent_phys {
     1351 +        dva_t                   le_dva; /* dva of buffer */
     1352 +        uint64_t                le_birth;       /* birth txg of buffer */
     1353 +        zio_cksum_t             le_freeze_cksum;
     1354 +        /*
     1355 +         * le_prop is the same format as the blk_prop in blkptr_t:
     1356 +         *      * logical size (in sectors)
     1357 +         *      * physical size (in sectors)
     1358 +         *      * checksum algorithm (used for b_freeze_cksum)
     1359 +         *      * object type & level (used to restore arc_buf_contents_t)
     1360 +         */
     1361 +        uint64_t                le_prop;
     1362 +        uint64_t                le_daddr;       /* buf location on l2dev */
     1363 +        const uint64_t          le_pad[7];      /* resv'd for future use */
     1364 +} l2arc_log_ent_phys_t;
     1365 +
     1366 +/*
     1367 + * These design limits give us the following metadata overhead (before
     1368 + * compression):
     1369 + *      avg_blk_sz      overhead
     1370 + *      1k              12.51 %
     1371 + *      2k               6.26 %
     1372 + *      4k               3.13 %
     1373 + *      8k               1.56 %
     1374 + *      16k              0.78 %
     1375 + *      32k              0.39 %
     1376 + *      64k              0.20 %
     1377 + *      128k             0.10 %
     1378 + * Compression should be able to sequeeze these down by about a factor of 2x.
     1379 + */
     1380 +#define L2ARC_LOG_BLK_SIZE                      (128 * 1024)    /* 128k */
     1381 +#define L2ARC_LOG_BLK_HEADER_LEN                (128)
     1382 +#define L2ARC_LOG_BLK_ENTRIES                   /* 1023 entries */      \
     1383 +        ((L2ARC_LOG_BLK_SIZE - L2ARC_LOG_BLK_HEADER_LEN) /              \
     1384 +        sizeof (l2arc_log_ent_phys_t))
     1385 +/*
     1386 + * Maximum amount of data in an l2arc log block (used to terminate rebuilding
     1387 + * before we hit the write head and restore potentially corrupted blocks).
     1388 + */
     1389 +#define L2ARC_LOG_BLK_MAX_PAYLOAD_SIZE  \
     1390 +        (SPA_MAXBLOCKSIZE * L2ARC_LOG_BLK_ENTRIES)
     1391 +/*
     1392 + * For the persistency and rebuild algorithms to operate reliably we need
     1393 + * the L2ARC device to at least be able to hold 3 full log blocks (otherwise
     1394 + * excessive log block looping might confuse the log chain end detection).
     1395 + * Under normal circumstances this is not a problem, since this is somewhere
     1396 + * around only 400 MB.
     1397 + */
     1398 +#define L2ARC_PERSIST_MIN_SIZE  (3 * L2ARC_LOG_BLK_MAX_PAYLOAD_SIZE)
     1399 +
     1400 +/*
     1401 + * A log block of up to 1023 ARC buffer log entries, chained into the
     1402 + * persistent L2ARC metadata linked list. Byte order of magic determines
     1403 + * whether 64-bit bswap of fields is necessary.
     1404 + */
     1405 +typedef struct l2arc_log_blk_phys {
     1406 +        /* Header - see L2ARC_LOG_BLK_HEADER_LEN above */
     1407 +        uint64_t                lb_magic;       /* L2ARC_LOG_BLK_MAGIC */
     1408 +        l2arc_log_blkptr_t      lb_back2_lbp;   /* back 2 steps in chain */
     1409 +        uint64_t                lb_pad[9];      /* resv'd for future use */
     1410 +        /* Payload */
     1411 +        l2arc_log_ent_phys_t    lb_entries[L2ARC_LOG_BLK_ENTRIES];
     1412 +} l2arc_log_blk_phys_t;
     1413 +
     1414 +CTASSERT(sizeof (l2arc_log_blk_phys_t) == L2ARC_LOG_BLK_SIZE);
     1415 +CTASSERT(offsetof(l2arc_log_blk_phys_t, lb_entries) -
     1416 +    offsetof(l2arc_log_blk_phys_t, lb_magic) == L2ARC_LOG_BLK_HEADER_LEN);
     1417 +
     1418 +/*
     1419 + * These structures hold in-flight l2arc_log_blk_phys_t's as they're being
     1420 + * written to the L2ARC device. They may be compressed, hence the uint8_t[].
     1421 + */
     1422 +typedef struct l2arc_log_blk_buf {
     1423 +        uint8_t         lbb_log_blk[sizeof (l2arc_log_blk_phys_t)];
     1424 +        list_node_t     lbb_node;
     1425 +} l2arc_log_blk_buf_t;
     1426 +
     1427 +/* Macros for the manipulation fields in the blk_prop format of blkptr_t */
     1428 +#define BLKPROP_GET_LSIZE(_obj, _field)         \
     1429 +        BF64_GET_SB((_obj)->_field, 0, 16, SPA_MINBLOCKSHIFT, 1)
     1430 +#define BLKPROP_SET_LSIZE(_obj, _field, x)      \
     1431 +        BF64_SET_SB((_obj)->_field, 0, 16, SPA_MINBLOCKSHIFT, 1, x)
     1432 +#define BLKPROP_GET_PSIZE(_obj, _field)         \
     1433 +        BF64_GET_SB((_obj)->_field, 16, 16, SPA_MINBLOCKSHIFT, 0)
     1434 +#define BLKPROP_SET_PSIZE(_obj, _field, x)      \
     1435 +        BF64_SET_SB((_obj)->_field, 16, 16, SPA_MINBLOCKSHIFT, 0, x)
     1436 +#define BLKPROP_GET_COMPRESS(_obj, _field)      \
     1437 +        BF64_GET((_obj)->_field, 32, 7)
     1438 +#define BLKPROP_SET_COMPRESS(_obj, _field, x)   \
     1439 +        BF64_SET((_obj)->_field, 32, 7, x)
     1440 +#define BLKPROP_GET_ARC_COMPRESS(_obj, _field)  \
     1441 +        BF64_GET((_obj)->_field, 39, 1)
     1442 +#define BLKPROP_SET_ARC_COMPRESS(_obj, _field, x)       \
     1443 +        BF64_SET((_obj)->_field, 39, 1, x)
     1444 +#define BLKPROP_GET_CHECKSUM(_obj, _field)      \
     1445 +        BF64_GET((_obj)->_field, 40, 8)
     1446 +#define BLKPROP_SET_CHECKSUM(_obj, _field, x)   \
     1447 +        BF64_SET((_obj)->_field, 40, 8, x)
     1448 +#define BLKPROP_GET_TYPE(_obj, _field)          \
     1449 +        BF64_GET((_obj)->_field, 48, 8)
     1450 +#define BLKPROP_SET_TYPE(_obj, _field, x)       \
     1451 +        BF64_SET((_obj)->_field, 48, 8, x)
     1452 +
     1453 +/* Macros for manipulating a l2arc_log_blkptr_t->lbp_prop field */
     1454 +#define LBP_GET_LSIZE(_add)             BLKPROP_GET_LSIZE(_add, lbp_prop)
     1455 +#define LBP_SET_LSIZE(_add, x)          BLKPROP_SET_LSIZE(_add, lbp_prop, x)
     1456 +#define LBP_GET_PSIZE(_add)             BLKPROP_GET_PSIZE(_add, lbp_prop)
     1457 +#define LBP_SET_PSIZE(_add, x)          BLKPROP_SET_PSIZE(_add, lbp_prop, x)
     1458 +#define LBP_GET_COMPRESS(_add)          BLKPROP_GET_COMPRESS(_add, lbp_prop)
     1459 +#define LBP_SET_COMPRESS(_add, x)       BLKPROP_SET_COMPRESS(_add, lbp_prop, x)
     1460 +#define LBP_GET_CHECKSUM(_add)          BLKPROP_GET_CHECKSUM(_add, lbp_prop)
     1461 +#define LBP_SET_CHECKSUM(_add, x)       BLKPROP_SET_CHECKSUM(_add, lbp_prop, x)
     1462 +#define LBP_GET_TYPE(_add)              BLKPROP_GET_TYPE(_add, lbp_prop)
     1463 +#define LBP_SET_TYPE(_add, x)           BLKPROP_SET_TYPE(_add, lbp_prop, x)
     1464 +
     1465 +/* Macros for manipulating a l2arc_log_ent_phys_t->le_prop field */
     1466 +#define LE_GET_LSIZE(_le)       BLKPROP_GET_LSIZE(_le, le_prop)
     1467 +#define LE_SET_LSIZE(_le, x)    BLKPROP_SET_LSIZE(_le, le_prop, x)
     1468 +#define LE_GET_PSIZE(_le)       BLKPROP_GET_PSIZE(_le, le_prop)
     1469 +#define LE_SET_PSIZE(_le, x)    BLKPROP_SET_PSIZE(_le, le_prop, x)
     1470 +#define LE_GET_COMPRESS(_le)    BLKPROP_GET_COMPRESS(_le, le_prop)
     1471 +#define LE_SET_COMPRESS(_le, x) BLKPROP_SET_COMPRESS(_le, le_prop, x)
     1472 +#define LE_GET_ARC_COMPRESS(_le)        BLKPROP_GET_ARC_COMPRESS(_le, le_prop)
     1473 +#define LE_SET_ARC_COMPRESS(_le, x)     BLKPROP_SET_ARC_COMPRESS(_le, le_prop, x)
     1474 +#define LE_GET_CHECKSUM(_le)    BLKPROP_GET_CHECKSUM(_le, le_prop)
     1475 +#define LE_SET_CHECKSUM(_le, x) BLKPROP_SET_CHECKSUM(_le, le_prop, x)
     1476 +#define LE_GET_TYPE(_le)        BLKPROP_GET_TYPE(_le, le_prop)
     1477 +#define LE_SET_TYPE(_le, x)     BLKPROP_SET_TYPE(_le, le_prop, x)
     1478 +
     1479 +#define PTR_SWAP(x, y)          \
     1480 +        do {                    \
     1481 +                void *tmp = (x);\
     1482 +                x = y;          \
     1483 +                y = tmp;        \
     1484 +                _NOTE(CONSTCOND)\
     1485 +        } while (0)
     1486 +
     1487 +/*
     1488 + * Sadly, after compressed ARC integration older kernels would panic
     1489 + * when trying to rebuild persistent L2ARC created by the new code.
     1490 + */
     1491 +#define L2ARC_DEV_HDR_MAGIC_V1  0x4c32415243763031LLU   /* ASCII: "L2ARCv01" */
     1492 +#define L2ARC_LOG_BLK_MAGIC     0x4c4f47424c4b4844LLU   /* ASCII: "LOGBLKHD" */
     1493 +
     1494 +/*
     1495 + * Performance tuning of L2ARC persistency:
     1496 + *
     1497 + * l2arc_rebuild_enabled : Controls whether L2ARC device adds (either at
     1498 + *              pool import or when adding one manually later) will attempt
     1499 + *              to rebuild L2ARC buffer contents. In special circumstances,
     1500 + *              the administrator may want to set this to B_FALSE, if they
     1501 + *              are having trouble importing a pool or attaching an L2ARC
     1502 + *              device (e.g. the L2ARC device is slow to read in stored log
     1503 + *              metadata, or the metadata has become somehow
     1504 + *              fragmented/unusable).
     1505 + */
     1506 +boolean_t l2arc_rebuild_enabled = B_TRUE;
     1507 +
     1508 +/* L2ARC persistency rebuild control routines. */
     1509 +static void l2arc_dev_rebuild_start(l2arc_dev_t *dev);
     1510 +static int l2arc_rebuild(l2arc_dev_t *dev);
     1511 +
     1512 +/* L2ARC persistency read I/O routines. */
     1513 +static int l2arc_dev_hdr_read(l2arc_dev_t *dev);
     1514 +static int l2arc_log_blk_read(l2arc_dev_t *dev,
     1515 +    const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp,
     1516 +    l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
     1517 +    uint8_t *this_lb_buf, uint8_t *next_lb_buf,
     1518 +    zio_t *this_io, zio_t **next_io);
     1519 +static zio_t *l2arc_log_blk_prefetch(vdev_t *vd,
     1520 +    const l2arc_log_blkptr_t *lp, uint8_t *lb_buf);
     1521 +static void l2arc_log_blk_prefetch_abort(zio_t *zio);
     1522 +
     1523 +/* L2ARC persistency block restoration routines. */
     1524 +static void l2arc_log_blk_restore(l2arc_dev_t *dev, uint64_t load_guid,
     1525 +    const l2arc_log_blk_phys_t *lb, uint64_t lb_psize);
     1526 +static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le,
     1527 +    l2arc_dev_t *dev, uint64_t guid);
     1528 +
     1529 +/* L2ARC persistency write I/O routines. */
     1530 +static void l2arc_dev_hdr_update(l2arc_dev_t *dev, zio_t *pio);
     1531 +static void l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,
     1532 +    l2arc_write_callback_t *cb);
     1533 +
     1534 +/* L2ARC persistency auxilliary routines. */
     1535 +static boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev,
     1536 +    const l2arc_log_blkptr_t *lp);
     1537 +static void l2arc_dev_hdr_checksum(const l2arc_dev_hdr_phys_t *hdr,
     1538 +    zio_cksum_t *cksum);
     1539 +static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev,
     1540 +    const arc_buf_hdr_t *ab);
     1541 +static inline boolean_t l2arc_range_check_overlap(uint64_t bottom,
     1542 +    uint64_t top, uint64_t check);
     1543 +
     1544 +/*
     1545 + * L2ARC Internals
     1546 + */
     1547 +struct l2arc_dev {
     1548 +        vdev_t                  *l2ad_vdev;     /* vdev */
     1549 +        spa_t                   *l2ad_spa;      /* spa */
     1550 +        uint64_t                l2ad_hand;      /* next write location */
     1551 +        uint64_t                l2ad_start;     /* first addr on device */
     1552 +        uint64_t                l2ad_end;       /* last addr on device */
     1553 +        boolean_t               l2ad_first;     /* first sweep through */
     1554 +        boolean_t               l2ad_writing;   /* currently writing */
     1555 +        kmutex_t                l2ad_mtx;       /* lock for buffer list */
     1556 +        list_t                  l2ad_buflist;   /* buffer list */
     1557 +        list_node_t             l2ad_node;      /* device list node */
     1558 +        refcount_t              l2ad_alloc;     /* allocated bytes */
     1559 +        l2arc_dev_hdr_phys_t    *l2ad_dev_hdr;  /* persistent device header */
     1560 +        uint64_t                l2ad_dev_hdr_asize; /* aligned hdr size */
     1561 +        l2arc_log_blk_phys_t    l2ad_log_blk;   /* currently open log block */
     1562 +        int                     l2ad_log_ent_idx; /* index into cur log blk */
     1563 +        /* number of bytes in current log block's payload */
     1564 +        uint64_t                l2ad_log_blk_payload_asize;
     1565 +        /* flag indicating whether a rebuild is scheduled or is going on */
     1566 +        boolean_t               l2ad_rebuild;
     1567 +        boolean_t               l2ad_rebuild_cancel;
     1568 +        kt_did_t                l2ad_rebuild_did;
     1569 +};
     1570 +
     1571 +static inline uint64_t
1165 1572  buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
1166 1573  {
1167      -        return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth));
     1574 +        uint8_t *vdva = (uint8_t *)dva;
     1575 +        uint64_t crc = -1ULL;
     1576 +        int i;
     1577 +
     1578 +        ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
     1579 +
     1580 +        for (i = 0; i < sizeof (dva_t); i++)
     1581 +                crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
     1582 +
     1583 +        crc ^= (spa>>8) ^ birth;
     1584 +
     1585 +        return (crc);
1168 1586  }
1169 1587  
1170 1588  #define HDR_EMPTY(hdr)                                          \
1171 1589          ((hdr)->b_dva.dva_word[0] == 0 &&                       \
1172 1590          (hdr)->b_dva.dva_word[1] == 0)
1173 1591  
1174 1592  #define HDR_EQUAL(spa, dva, birth, hdr)                         \
1175 1593          ((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) &&     \
1176 1594          ((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) &&     \
1177 1595          ((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)

1178 1596  
1179 1597  static void
1180 1598  buf_discard_identity(arc_buf_hdr_t *hdr)
1181 1599  {
1182 1600          hdr->b_dva.dva_word[0] = 0;
1183 1601          hdr->b_dva.dva_word[1] = 0;
1184 1602          hdr->b_birth = 0;
1185 1603  }
1186 1604

↓ open down ↓

9 lines elided

↑ open up ↑

1187 1605  static arc_buf_hdr_t *
1188 1606  buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
1189 1607  {
1190 1608          const dva_t *dva = BP_IDENTITY(bp);
1191 1609          uint64_t birth = BP_PHYSICAL_BIRTH(bp);
1192 1610          uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
1193 1611          kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1194 1612          arc_buf_hdr_t *hdr;
1195 1613  
1196 1614          mutex_enter(hash_lock);
1197      -        for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;
     1615 +        for (hdr = buf_hash_table.ht_table[idx].hdr; hdr != NULL;
1198 1616              hdr = hdr->b_hash_next) {
1199 1617                  if (HDR_EQUAL(spa, dva, birth, hdr)) {
1200 1618                          *lockp = hash_lock;
1201 1619                          return (hdr);
1202 1620                  }
1203 1621          }
1204 1622          mutex_exit(hash_lock);
1205 1623          *lockp = NULL;
1206 1624          return (NULL);
1207 1625  }

1208 1626  
1209 1627  /*
1210 1628   * Insert an entry into the hash table.  If there is already an element
1211 1629   * equal to elem in the hash table, then the already existing element
1212 1630   * will be returned and the new element will not be inserted.
1213 1631   * Otherwise returns NULL.
1214 1632   * If lockp == NULL, the caller is assumed to already hold the hash lock.
1215 1633   */
1216 1634  static arc_buf_hdr_t *
1217 1635  buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)
1218 1636  {
1219 1637          uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1220 1638          kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1221 1639          arc_buf_hdr_t *fhdr;
1222 1640          uint32_t i;
1223 1641  
1224 1642          ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));

↓ open down ↓

17 lines elided

↑ open up ↑

1225 1643          ASSERT(hdr->b_birth != 0);
1226 1644          ASSERT(!HDR_IN_HASH_TABLE(hdr));
1227 1645  
1228 1646          if (lockp != NULL) {
1229 1647                  *lockp = hash_lock;
1230 1648                  mutex_enter(hash_lock);
1231 1649          } else {
1232 1650                  ASSERT(MUTEX_HELD(hash_lock));
1233 1651          }
1234 1652  
1235      -        for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;
     1653 +        for (fhdr = buf_hash_table.ht_table[idx].hdr, i = 0; fhdr != NULL;
1236 1654              fhdr = fhdr->b_hash_next, i++) {
1237 1655                  if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))
1238 1656                          return (fhdr);
1239 1657          }
1240 1658  
1241      -        hdr->b_hash_next = buf_hash_table.ht_table[idx];
1242      -        buf_hash_table.ht_table[idx] = hdr;
     1659 +        hdr->b_hash_next = buf_hash_table.ht_table[idx].hdr;
     1660 +        buf_hash_table.ht_table[idx].hdr = hdr;
1243 1661          arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1244 1662  
1245 1663          /* collect some hash table performance data */
1246 1664          if (i > 0) {
1247 1665                  ARCSTAT_BUMP(arcstat_hash_collisions);
1248 1666                  if (i == 1)
1249 1667                          ARCSTAT_BUMP(arcstat_hash_chains);
1250 1668  
1251 1669                  ARCSTAT_MAX(arcstat_hash_chain_max, i);
1252 1670          }

1253 1671  
1254 1672          ARCSTAT_BUMP(arcstat_hash_elements);
1255 1673          ARCSTAT_MAXSTAT(arcstat_hash_elements);
1256 1674  
1257 1675          return (NULL);
1258 1676  }

↓ open down ↓

6 lines elided

↑ open up ↑

1259 1677  
1260 1678  static void
1261 1679  buf_hash_remove(arc_buf_hdr_t *hdr)
1262 1680  {
1263 1681          arc_buf_hdr_t *fhdr, **hdrp;
1264 1682          uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1265 1683  
1266 1684          ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
1267 1685          ASSERT(HDR_IN_HASH_TABLE(hdr));
1268 1686  
1269      -        hdrp = &buf_hash_table.ht_table[idx];
     1687 +        hdrp = &buf_hash_table.ht_table[idx].hdr;
1270 1688          while ((fhdr = *hdrp) != hdr) {
1271 1689                  ASSERT3P(fhdr, !=, NULL);
1272 1690                  hdrp = &fhdr->b_hash_next;
1273 1691          }
1274 1692          *hdrp = hdr->b_hash_next;
1275 1693          hdr->b_hash_next = NULL;
1276 1694          arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1277 1695  
1278 1696          /* collect some hash table performance data */
1279 1697          ARCSTAT_BUMPDOWN(arcstat_hash_elements);
1280 1698  
1281      -        if (buf_hash_table.ht_table[idx] &&
1282      -            buf_hash_table.ht_table[idx]->b_hash_next == NULL)
     1699 +        if (buf_hash_table.ht_table[idx].hdr &&
     1700 +            buf_hash_table.ht_table[idx].hdr->b_hash_next == NULL)
1283 1701                  ARCSTAT_BUMPDOWN(arcstat_hash_chains);
1284 1702  }
1285 1703  
1286 1704  /*
1287 1705   * Global data structures and functions for the buf kmem cache.
1288 1706   */
1289 1707  static kmem_cache_t *hdr_full_cache;
1290 1708  static kmem_cache_t *hdr_l2only_cache;
1291 1709  static kmem_cache_t *buf_cache;
1292 1710  
1293 1711  static void
1294 1712  buf_fini(void)
1295 1713  {
1296 1714          int i;
1297 1715  
     1716 +        for (i = 0; i < buf_hash_table.ht_mask + 1; i++)
     1717 +                mutex_destroy(&buf_hash_table.ht_table[i].lock);
1298 1718          kmem_free(buf_hash_table.ht_table,
1299      -            (buf_hash_table.ht_mask + 1) * sizeof (void *));
1300      -        for (i = 0; i < BUF_LOCKS; i++)
1301      -                mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
     1719 +            (buf_hash_table.ht_mask + 1) * sizeof (struct ht_table));
1302 1720          kmem_cache_destroy(hdr_full_cache);
1303 1721          kmem_cache_destroy(hdr_l2only_cache);
1304 1722          kmem_cache_destroy(buf_cache);
1305 1723  }
1306 1724  
1307 1725  /*
1308 1726   * Constructor callback - called when the cache is empty
1309 1727   * and a new buf is requested.
1310 1728   */
1311 1729  /* ARGSUSED */

1312 1730  static int
1313 1731  hdr_full_cons(void *vbuf, void *unused, int kmflag)
1314 1732  {
1315 1733          arc_buf_hdr_t *hdr = vbuf;
1316 1734  
1317 1735          bzero(hdr, HDR_FULL_SIZE);
1318 1736          cv_init(&hdr->b_l1hdr.b_cv, NULL, CV_DEFAULT, NULL);
1319 1737          refcount_create(&hdr->b_l1hdr.b_refcnt);
1320 1738          mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
1321 1739          multilist_link_init(&hdr->b_l1hdr.b_arc_node);
1322 1740          arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS);
1323 1741  
1324 1742          return (0);
1325 1743  }
1326 1744  
1327 1745  /* ARGSUSED */
1328 1746  static int
1329 1747  hdr_l2only_cons(void *vbuf, void *unused, int kmflag)
1330 1748  {
1331 1749          arc_buf_hdr_t *hdr = vbuf;
1332 1750  
1333 1751          bzero(hdr, HDR_L2ONLY_SIZE);
1334 1752          arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
1335 1753  
1336 1754          return (0);
1337 1755  }
1338 1756  
1339 1757  /* ARGSUSED */
1340 1758  static int
1341 1759  buf_cons(void *vbuf, void *unused, int kmflag)
1342 1760  {
1343 1761          arc_buf_t *buf = vbuf;
1344 1762  
1345 1763          bzero(buf, sizeof (arc_buf_t));
1346 1764          mutex_init(&buf->b_evict_lock, NULL, MUTEX_DEFAULT, NULL);
1347 1765          arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
1348 1766  
1349 1767          return (0);
1350 1768  }
1351 1769  
1352 1770  /*
1353 1771   * Destructor callback - called when a cached buf is
1354 1772   * no longer required.
1355 1773   */
1356 1774  /* ARGSUSED */
1357 1775  static void
1358 1776  hdr_full_dest(void *vbuf, void *unused)
1359 1777  {
1360 1778          arc_buf_hdr_t *hdr = vbuf;
1361 1779  
1362 1780          ASSERT(HDR_EMPTY(hdr));
1363 1781          cv_destroy(&hdr->b_l1hdr.b_cv);
1364 1782          refcount_destroy(&hdr->b_l1hdr.b_refcnt);
1365 1783          mutex_destroy(&hdr->b_l1hdr.b_freeze_lock);
1366 1784          ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
1367 1785          arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS);
1368 1786  }
1369 1787  
1370 1788  /* ARGSUSED */
1371 1789  static void
1372 1790  hdr_l2only_dest(void *vbuf, void *unused)
1373 1791  {
1374 1792          arc_buf_hdr_t *hdr = vbuf;
1375 1793  
1376 1794          ASSERT(HDR_EMPTY(hdr));
1377 1795          arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
1378 1796  }
1379 1797  
1380 1798  /* ARGSUSED */
1381 1799  static void
1382 1800  buf_dest(void *vbuf, void *unused)
1383 1801  {
1384 1802          arc_buf_t *buf = vbuf;
1385 1803  
1386 1804          mutex_destroy(&buf->b_evict_lock);
1387 1805          arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
1388 1806  }
1389 1807  
1390 1808  /*
1391 1809   * Reclaim callback -- invoked when memory is low.
1392 1810   */
1393 1811  /* ARGSUSED */
1394 1812  static void
1395 1813  hdr_recl(void *unused)
1396 1814  {
1397 1815          dprintf("hdr_recl called\n");
1398 1816          /*
1399 1817           * umem calls the reclaim func when we destroy the buf cache,
1400 1818           * which is after we do arc_fini().
1401 1819           */
1402 1820          if (!arc_dead)
1403 1821                  cv_signal(&arc_reclaim_thread_cv);
1404 1822  }
1405 1823  
1406 1824  static void
1407 1825  buf_init(void)
1408 1826  {
1409 1827          uint64_t *ct;
1410 1828          uint64_t hsize = 1ULL << 12;
1411 1829          int i, j;
1412 1830  
1413 1831          /*

↓ open down ↓

102 lines elided

↑ open up ↑

1414 1832           * The hash table is big enough to fill all of physical memory
1415 1833           * with an average block size of zfs_arc_average_blocksize (default 8K).
1416 1834           * By default, the table will take up
1417 1835           * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
1418 1836           */
1419 1837          while (hsize * zfs_arc_average_blocksize < physmem * PAGESIZE)
1420 1838                  hsize <<= 1;
1421 1839  retry:
1422 1840          buf_hash_table.ht_mask = hsize - 1;
1423 1841          buf_hash_table.ht_table =
1424      -            kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
     1842 +            kmem_zalloc(hsize * sizeof (struct ht_table), KM_NOSLEEP);
1425 1843          if (buf_hash_table.ht_table == NULL) {
1426 1844                  ASSERT(hsize > (1ULL << 8));
1427 1845                  hsize >>= 1;
1428 1846                  goto retry;
1429 1847          }
1430 1848  
1431 1849          hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,
1432 1850              0, hdr_full_cons, hdr_full_dest, hdr_recl, NULL, NULL, 0);
1433 1851          hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",
1434 1852              HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, hdr_recl,
1435 1853              NULL, NULL, 0);
1436 1854          buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
1437 1855              0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
1438 1856  
1439 1857          for (i = 0; i < 256; i++)
1440 1858                  for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
1441 1859                          *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
1442 1860  
1443      -        for (i = 0; i < BUF_LOCKS; i++) {
1444      -                mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
     1861 +        for (i = 0; i < hsize; i++) {
     1862 +                mutex_init(&buf_hash_table.ht_table[i].lock,
1445 1863                      NULL, MUTEX_DEFAULT, NULL);
1446 1864          }
1447 1865  }
1448 1866  
     1867 +/* wait until krrp releases the buffer */
     1868 +static inline void
     1869 +arc_wait_for_krrp(arc_buf_hdr_t *hdr)
     1870 +{
     1871 +        while (HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_krrp != 0)
     1872 +                cv_wait(&hdr->b_l1hdr.b_cv, HDR_LOCK(hdr));
     1873 +}
     1874 +
1449 1875  /*
1450 1876   * This is the size that the buf occupies in memory. If the buf is compressed,
1451 1877   * it will correspond to the compressed size. You should use this method of
1452 1878   * getting the buf size unless you explicitly need the logical size.
1453 1879   */
1454 1880  int32_t
1455 1881  arc_buf_size(arc_buf_t *buf)
1456 1882  {
1457 1883          return (ARC_BUF_COMPRESSED(buf) ?
1458 1884              HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));

1459 1885  }
1460 1886  
1461 1887  int32_t
1462 1888  arc_buf_lsize(arc_buf_t *buf)
1463 1889  {
1464 1890          return (HDR_GET_LSIZE(buf->b_hdr));
1465 1891  }
1466 1892  
1467 1893  enum zio_compress
1468 1894  arc_get_compression(arc_buf_t *buf)
1469 1895  {
1470 1896          return (ARC_BUF_COMPRESSED(buf) ?
1471 1897              HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF);
1472 1898  }
1473 1899  
1474 1900  #define ARC_MINTIME     (hz>>4) /* 62 ms */
1475 1901  
1476 1902  static inline boolean_t
1477 1903  arc_buf_is_shared(arc_buf_t *buf)
1478 1904  {
1479 1905          boolean_t shared = (buf->b_data != NULL &&
1480 1906              buf->b_hdr->b_l1hdr.b_pabd != NULL &&
1481 1907              abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) &&
1482 1908              buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd));
1483 1909          IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr));
1484 1910          IMPLY(shared, ARC_BUF_SHARED(buf));
1485 1911          IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));
1486 1912  
1487 1913          /*
1488 1914           * It would be nice to assert arc_can_share() too, but the "hdr isn't
1489 1915           * already being shared" requirement prevents us from doing that.
1490 1916           */
1491 1917  
1492 1918          return (shared);
1493 1919  }

↓ open down ↓

35 lines elided

↑ open up ↑

1494 1920  
1495 1921  /*
1496 1922   * Free the checksum associated with this header. If there is no checksum, this
1497 1923   * is a no-op.
1498 1924   */
1499 1925  static inline void
1500 1926  arc_cksum_free(arc_buf_hdr_t *hdr)
1501 1927  {
1502 1928          ASSERT(HDR_HAS_L1HDR(hdr));
1503 1929          mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1504      -        if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
1505      -                kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));
1506      -                hdr->b_l1hdr.b_freeze_cksum = NULL;
     1930 +        if (hdr->b_freeze_cksum != NULL) {
     1931 +                kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
     1932 +                hdr->b_freeze_cksum = NULL;
1507 1933          }
1508 1934          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1509 1935  }
1510 1936  
1511 1937  /*
1512 1938   * Return true iff at least one of the bufs on hdr is not compressed.
1513 1939   */
1514 1940  static boolean_t
1515 1941  arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)
1516 1942  {

1517 1943          for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) {
1518 1944                  if (!ARC_BUF_COMPRESSED(b)) {
1519 1945                          return (B_TRUE);
1520 1946                  }
1521 1947          }
1522 1948          return (B_FALSE);
1523 1949  }
1524 1950  
1525 1951  /*
1526 1952   * If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data
1527 1953   * matches the checksum that is stored in the hdr. If there is no checksum,
1528 1954   * or if the buf is compressed, this is a no-op.
1529 1955   */

↓ open down ↓

13 lines elided

↑ open up ↑

1530 1956  static void
1531 1957  arc_cksum_verify(arc_buf_t *buf)
1532 1958  {
1533 1959          arc_buf_hdr_t *hdr = buf->b_hdr;
1534 1960          zio_cksum_t zc;
1535 1961  
1536 1962          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1537 1963                  return;
1538 1964  
1539 1965          if (ARC_BUF_COMPRESSED(buf)) {
1540      -                ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
     1966 +                ASSERT(hdr->b_freeze_cksum == NULL ||
1541 1967                      arc_hdr_has_uncompressed_buf(hdr));
1542 1968                  return;
1543 1969          }
1544 1970  
1545 1971          ASSERT(HDR_HAS_L1HDR(hdr));
1546 1972  
1547 1973          mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1548      -        if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
     1974 +        if (hdr->b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
1549 1975                  mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1550 1976                  return;
1551 1977          }
1552 1978  
1553 1979          fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);
1554      -        if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))
     1980 +        if (!ZIO_CHECKSUM_EQUAL(*hdr->b_freeze_cksum, zc))
1555 1981                  panic("buffer modified while frozen!");
1556 1982          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1557 1983  }
1558 1984  
1559 1985  static boolean_t
1560 1986  arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)
1561 1987  {
1562 1988          enum zio_compress compress = BP_GET_COMPRESS(zio->io_bp);
1563 1989          boolean_t valid_cksum;
1564 1990

1565 1991          ASSERT(!BP_IS_EMBEDDED(zio->io_bp));
1566 1992          VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));
1567 1993  
1568 1994          /*
1569 1995           * We rely on the blkptr's checksum to determine if the block
1570 1996           * is valid or not. When compressed arc is enabled, the l2arc
1571 1997           * writes the block to the l2arc just as it appears in the pool.
1572 1998           * This allows us to use the blkptr's checksum to validate the
1573 1999           * data that we just read off of the l2arc without having to store
1574 2000           * a separate checksum in the arc_buf_hdr_t. However, if compressed

↓ open down ↓

10 lines elided

↑ open up ↑

1575 2001           * arc is disabled, then the data written to the l2arc is always
1576 2002           * uncompressed and won't match the block as it exists in the main
1577 2003           * pool. When this is the case, we must first compress it if it is
1578 2004           * compressed on the main pool before we can validate the checksum.
1579 2005           */
1580 2006          if (!HDR_COMPRESSION_ENABLED(hdr) && compress != ZIO_COMPRESS_OFF) {
1581 2007                  ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
1582 2008                  uint64_t lsize = HDR_GET_LSIZE(hdr);
1583 2009                  uint64_t csize;
1584 2010  
1585      -                abd_t *cdata = abd_alloc_linear(HDR_GET_PSIZE(hdr), B_TRUE);
1586      -                csize = zio_compress_data(compress, zio->io_abd,
1587      -                    abd_to_buf(cdata), lsize);
     2011 +                void *cbuf = zio_buf_alloc(HDR_GET_PSIZE(hdr));
     2012 +                csize = zio_compress_data(compress, zio->io_abd, cbuf, lsize);
     2013 +                abd_t *cdata = abd_get_from_buf(cbuf, HDR_GET_PSIZE(hdr));
     2014 +                abd_take_ownership_of_buf(cdata, B_TRUE);
1588 2015  
1589 2016                  ASSERT3U(csize, <=, HDR_GET_PSIZE(hdr));
1590 2017                  if (csize < HDR_GET_PSIZE(hdr)) {
1591 2018                          /*
1592 2019                           * Compressed blocks are always a multiple of the
1593 2020                           * smallest ashift in the pool. Ideally, we would
1594 2021                           * like to round up the csize to the next
1595 2022                           * spa_min_ashift but that value may have changed
1596 2023                           * since the block was last written. Instead,
1597 2024                           * we rely on the fact that the hdr's psize

1598 2025                           * was set to the psize of the block when it was
1599 2026                           * last written. We set the csize to that value
1600 2027                           * and zero out any part that should not contain
1601 2028                           * data.
1602 2029                           */
1603 2030                          abd_zero_off(cdata, csize, HDR_GET_PSIZE(hdr) - csize);
1604 2031                          csize = HDR_GET_PSIZE(hdr);
1605 2032                  }
1606 2033                  zio_push_transform(zio, cdata, csize, HDR_GET_PSIZE(hdr), NULL);
1607 2034          }
1608 2035  
1609 2036          /*
1610 2037           * Block pointers always store the checksum for the logical data.
1611 2038           * If the block pointer has the gang bit set, then the checksum
1612 2039           * it represents is for the reconstituted data and not for an
1613 2040           * individual gang member. The zio pipeline, however, must be able to
1614 2041           * determine the checksum of each of the gang constituents so it
1615 2042           * treats the checksum comparison differently than what we need
1616 2043           * for l2arc blocks. This prevents us from using the
1617 2044           * zio_checksum_error() interface directly. Instead we must call the
1618 2045           * zio_checksum_error_impl() so that we can ensure the checksum is
1619 2046           * generated using the correct checksum algorithm and accounts for the
1620 2047           * logical I/O size and not just a gang fragment.
1621 2048           */
1622 2049          valid_cksum = (zio_checksum_error_impl(zio->io_spa, zio->io_bp,
1623 2050              BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size,
1624 2051              zio->io_offset, NULL) == 0);
1625 2052          zio_pop_transforms(zio);
1626 2053          return (valid_cksum);
1627 2054  }
1628 2055  
1629 2056  /*
1630 2057   * Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a
1631 2058   * checksum and attaches it to the buf's hdr so that we can ensure that the buf
1632 2059   * isn't modified later on. If buf is compressed or there is already a checksum
1633 2060   * on the hdr, this is a no-op (we only checksum uncompressed bufs).
1634 2061   */
1635 2062  static void

↓ open down ↓

38 lines elided

↑ open up ↑

1636 2063  arc_cksum_compute(arc_buf_t *buf)
1637 2064  {
1638 2065          arc_buf_hdr_t *hdr = buf->b_hdr;
1639 2066  
1640 2067          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1641 2068                  return;
1642 2069  
1643 2070          ASSERT(HDR_HAS_L1HDR(hdr));
1644 2071  
1645 2072          mutex_enter(&buf->b_hdr->b_l1hdr.b_freeze_lock);
1646      -        if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
     2073 +        if (hdr->b_freeze_cksum != NULL) {
1647 2074                  ASSERT(arc_hdr_has_uncompressed_buf(hdr));
1648 2075                  mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1649 2076                  return;
1650 2077          } else if (ARC_BUF_COMPRESSED(buf)) {
1651 2078                  mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1652 2079                  return;
1653 2080          }
1654 2081  
1655 2082          ASSERT(!ARC_BUF_COMPRESSED(buf));
1656      -        hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
     2083 +        hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
1657 2084              KM_SLEEP);
1658 2085          fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,
1659      -            hdr->b_l1hdr.b_freeze_cksum);
     2086 +            hdr->b_freeze_cksum);
1660 2087          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1661 2088          arc_buf_watch(buf);
1662 2089  }
1663 2090  
1664 2091  #ifndef _KERNEL
1665 2092  typedef struct procctl {
1666 2093          long cmd;
1667 2094          prwatch_t prwatch;
1668 2095  } procctl_t;
1669 2096  #endif

1670 2097  
1671 2098  /* ARGSUSED */
1672 2099  static void
1673 2100  arc_buf_unwatch(arc_buf_t *buf)
1674 2101  {
1675 2102  #ifndef _KERNEL
1676 2103          if (arc_watch) {
1677 2104                  int result;
1678 2105                  procctl_t ctl;
1679 2106                  ctl.cmd = PCWATCH;
1680 2107                  ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
1681 2108                  ctl.prwatch.pr_size = 0;
1682 2109                  ctl.prwatch.pr_wflags = 0;
1683 2110                  result = write(arc_procfd, &ctl, sizeof (ctl));
1684 2111                  ASSERT3U(result, ==, sizeof (ctl));
1685 2112          }
1686 2113  #endif
1687 2114  }
1688 2115  
1689 2116  /* ARGSUSED */
1690 2117  static void
1691 2118  arc_buf_watch(arc_buf_t *buf)
1692 2119  {
1693 2120  #ifndef _KERNEL
1694 2121          if (arc_watch) {
1695 2122                  int result;
1696 2123                  procctl_t ctl;
1697 2124                  ctl.cmd = PCWATCH;
1698 2125                  ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
1699 2126                  ctl.prwatch.pr_size = arc_buf_size(buf);
1700 2127                  ctl.prwatch.pr_wflags = WA_WRITE;

↓ open down ↓

31 lines elided

↑ open up ↑

1701 2128                  result = write(arc_procfd, &ctl, sizeof (ctl));
1702 2129                  ASSERT3U(result, ==, sizeof (ctl));
1703 2130          }
1704 2131  #endif
1705 2132  }
1706 2133  
1707 2134  static arc_buf_contents_t
1708 2135  arc_buf_type(arc_buf_hdr_t *hdr)
1709 2136  {
1710 2137          arc_buf_contents_t type;
     2138 +
1711 2139          if (HDR_ISTYPE_METADATA(hdr)) {
1712 2140                  type = ARC_BUFC_METADATA;
     2141 +        } else if (HDR_ISTYPE_DDT(hdr)) {
     2142 +                type = ARC_BUFC_DDT;
1713 2143          } else {
1714 2144                  type = ARC_BUFC_DATA;
1715 2145          }
1716 2146          VERIFY3U(hdr->b_type, ==, type);
1717 2147          return (type);
1718 2148  }
1719 2149  
1720 2150  boolean_t
1721 2151  arc_is_metadata(arc_buf_t *buf)
1722 2152  {

1723 2153          return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);
1724 2154  }

↓ open down ↓

2 lines elided

↑ open up ↑

1725 2155  
1726 2156  static uint32_t
1727 2157  arc_bufc_to_flags(arc_buf_contents_t type)
1728 2158  {
1729 2159          switch (type) {
1730 2160          case ARC_BUFC_DATA:
1731 2161                  /* metadata field is 0 if buffer contains normal data */
1732 2162                  return (0);
1733 2163          case ARC_BUFC_METADATA:
1734 2164                  return (ARC_FLAG_BUFC_METADATA);
     2165 +        case ARC_BUFC_DDT:
     2166 +                return (ARC_FLAG_BUFC_DDT);
1735 2167          default:
1736 2168                  break;
1737 2169          }
1738 2170          panic("undefined ARC buffer type!");
1739 2171          return ((uint32_t)-1);
1740 2172  }
1741 2173  
     2174 +static arc_buf_contents_t
     2175 +arc_flags_to_bufc(uint32_t flags)
     2176 +{
     2177 +        if (flags & ARC_FLAG_BUFC_DDT)
     2178 +                return (ARC_BUFC_DDT);
     2179 +        if (flags & ARC_FLAG_BUFC_METADATA)
     2180 +                return (ARC_BUFC_METADATA);
     2181 +        return (ARC_BUFC_DATA);
     2182 +}
     2183 +
1742 2184  void
1743 2185  arc_buf_thaw(arc_buf_t *buf)
1744 2186  {
1745 2187          arc_buf_hdr_t *hdr = buf->b_hdr;
1746 2188  
1747 2189          ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
1748 2190          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
1749 2191  
1750 2192          arc_cksum_verify(buf);
1751 2193  
1752 2194          /*
1753 2195           * Compressed buffers do not manipulate the b_freeze_cksum or
1754 2196           * allocate b_thawed.
1755 2197           */
1756 2198          if (ARC_BUF_COMPRESSED(buf)) {
1757      -                ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
     2199 +                ASSERT(hdr->b_freeze_cksum == NULL ||
1758 2200                      arc_hdr_has_uncompressed_buf(hdr));
1759 2201                  return;
1760 2202          }
1761 2203  
1762 2204          ASSERT(HDR_HAS_L1HDR(hdr));
1763 2205          arc_cksum_free(hdr);
1764 2206  
1765 2207          mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1766 2208  #ifdef ZFS_DEBUG
1767 2209          if (zfs_flags & ZFS_DEBUG_MODIFY) {

1768 2210                  if (hdr->b_l1hdr.b_thawed != NULL)
1769 2211                          kmem_free(hdr->b_l1hdr.b_thawed, 1);
1770 2212                  hdr->b_l1hdr.b_thawed = kmem_alloc(1, KM_SLEEP);
1771 2213          }
1772 2214  #endif
1773 2215  
1774 2216          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1775 2217  
1776 2218          arc_buf_unwatch(buf);
1777 2219  }
1778 2220

↓ open down ↓

11 lines elided

↑ open up ↑

1779 2221  void
1780 2222  arc_buf_freeze(arc_buf_t *buf)
1781 2223  {
1782 2224          arc_buf_hdr_t *hdr = buf->b_hdr;
1783 2225          kmutex_t *hash_lock;
1784 2226  
1785 2227          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1786 2228                  return;
1787 2229  
1788 2230          if (ARC_BUF_COMPRESSED(buf)) {
1789      -                ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
     2231 +                ASSERT(hdr->b_freeze_cksum == NULL ||
1790 2232                      arc_hdr_has_uncompressed_buf(hdr));
1791 2233                  return;
1792 2234          }
1793 2235  
1794 2236          hash_lock = HDR_LOCK(hdr);
1795 2237          mutex_enter(hash_lock);
1796 2238  
1797 2239          ASSERT(HDR_HAS_L1HDR(hdr));
1798      -        ASSERT(hdr->b_l1hdr.b_freeze_cksum != NULL ||
     2240 +        ASSERT(hdr->b_freeze_cksum != NULL ||
1799 2241              hdr->b_l1hdr.b_state == arc_anon);
1800 2242          arc_cksum_compute(buf);
1801 2243          mutex_exit(hash_lock);
1802 2244  }
1803 2245  
1804 2246  /*
1805 2247   * The arc_buf_hdr_t's b_flags should never be modified directly. Instead,
1806 2248   * the following functions should be used to ensure that the flags are
1807 2249   * updated in a thread-safe way. When manipulating the flags either
1808 2250   * the hash_lock must be held or the hdr must be undiscoverable. This

1809 2251   * ensures that we're not racing with any other threads when updating
1810 2252   * the flags.
1811 2253   */
1812 2254  static inline void
1813 2255  arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
1814 2256  {
1815 2257          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
1816 2258          hdr->b_flags |= flags;
1817 2259  }
1818 2260  
1819 2261  static inline void
1820 2262  arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
1821 2263  {
1822 2264          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
1823 2265          hdr->b_flags &= ~flags;
1824 2266  }
1825 2267  
1826 2268  /*
1827 2269   * Setting the compression bits in the arc_buf_hdr_t's b_flags is
1828 2270   * done in a special way since we have to clear and set bits
1829 2271   * at the same time. Consumers that wish to set the compression bits
1830 2272   * must use this function to ensure that the flags are updated in
1831 2273   * thread-safe manner.
1832 2274   */
1833 2275  static void
1834 2276  arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp)
1835 2277  {
1836 2278          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
1837 2279  
1838 2280          /*
1839 2281           * Holes and embedded blocks will always have a psize = 0 so
1840 2282           * we ignore the compression of the blkptr and set the
1841 2283           * arc_buf_hdr_t's compression to ZIO_COMPRESS_OFF.
1842 2284           * Holes and embedded blocks remain anonymous so we don't
1843 2285           * want to uncompress them. Mark them as uncompressed.
1844 2286           */
1845 2287          if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) {
1846 2288                  arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
1847 2289                  HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_OFF);
1848 2290                  ASSERT(!HDR_COMPRESSION_ENABLED(hdr));
1849 2291                  ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
1850 2292          } else {
1851 2293                  arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
1852 2294                  HDR_SET_COMPRESS(hdr, cmp);
1853 2295                  ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp);
1854 2296                  ASSERT(HDR_COMPRESSION_ENABLED(hdr));
1855 2297          }
1856 2298  }
1857 2299  
1858 2300  /*
1859 2301   * Looks for another buf on the same hdr which has the data decompressed, copies
1860 2302   * from it, and returns true. If no such buf exists, returns false.
1861 2303   */
1862 2304  static boolean_t
1863 2305  arc_buf_try_copy_decompressed_data(arc_buf_t *buf)
1864 2306  {
1865 2307          arc_buf_hdr_t *hdr = buf->b_hdr;
1866 2308          boolean_t copied = B_FALSE;
1867 2309  
1868 2310          ASSERT(HDR_HAS_L1HDR(hdr));
1869 2311          ASSERT3P(buf->b_data, !=, NULL);
1870 2312          ASSERT(!ARC_BUF_COMPRESSED(buf));
1871 2313  
1872 2314          for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL;
1873 2315              from = from->b_next) {
1874 2316                  /* can't use our own data buffer */
1875 2317                  if (from == buf) {
1876 2318                          continue;
1877 2319                  }
1878 2320  
1879 2321                  if (!ARC_BUF_COMPRESSED(from)) {

↓ open down ↓

71 lines elided

↑ open up ↑

1880 2322                          bcopy(from->b_data, buf->b_data, arc_buf_size(buf));
1881 2323                          copied = B_TRUE;
1882 2324                          break;
1883 2325                  }
1884 2326          }
1885 2327  
1886 2328          /*
1887 2329           * There were no decompressed bufs, so there should not be a
1888 2330           * checksum on the hdr either.
1889 2331           */
1890      -        EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);
     2332 +        EQUIV(!copied, hdr->b_freeze_cksum == NULL);
1891 2333  
1892 2334          return (copied);
1893 2335  }
1894 2336  
1895 2337  /*
1896 2338   * Given a buf that has a data buffer attached to it, this function will
1897 2339   * efficiently fill the buf with data of the specified compression setting from
1898 2340   * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr
1899 2341   * are already sharing a data buf, no copy is performed.
1900 2342   *

1901 2343   * If the buf is marked as compressed but uncompressed data was requested, this
1902 2344   * will allocate a new data buffer for the buf, remove that flag, and fill the
1903 2345   * buf with uncompressed data. You can't request a compressed buf on a hdr with
1904 2346   * uncompressed data, and (since we haven't added support for it yet) if you
1905 2347   * want compressed data your buf must already be marked as compressed and have
1906 2348   * the correct-sized data buffer.
1907 2349   */
1908 2350  static int
1909 2351  arc_buf_fill(arc_buf_t *buf, boolean_t compressed)
1910 2352  {
1911 2353          arc_buf_hdr_t *hdr = buf->b_hdr;
1912 2354          boolean_t hdr_compressed = (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
1913 2355          dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap;
1914 2356  
1915 2357          ASSERT3P(buf->b_data, !=, NULL);
1916 2358          IMPLY(compressed, hdr_compressed);
1917 2359          IMPLY(compressed, ARC_BUF_COMPRESSED(buf));
1918 2360  
1919 2361          if (hdr_compressed == compressed) {
1920 2362                  if (!arc_buf_is_shared(buf)) {
1921 2363                          abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd,
1922 2364                              arc_buf_size(buf));
1923 2365                  }
1924 2366          } else {
1925 2367                  ASSERT(hdr_compressed);
1926 2368                  ASSERT(!compressed);
1927 2369                  ASSERT3U(HDR_GET_LSIZE(hdr), !=, HDR_GET_PSIZE(hdr));
1928 2370  
1929 2371                  /*
1930 2372                   * If the buf is sharing its data with the hdr, unlink it and
1931 2373                   * allocate a new data buffer for the buf.
1932 2374                   */
1933 2375                  if (arc_buf_is_shared(buf)) {
1934 2376                          ASSERT(ARC_BUF_COMPRESSED(buf));
1935 2377  
1936 2378                          /* We need to give the buf it's own b_data */
1937 2379                          buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
1938 2380                          buf->b_data =
1939 2381                              arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
1940 2382                          arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
1941 2383  
1942 2384                          /* Previously overhead was 0; just add new overhead */
1943 2385                          ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr));
1944 2386                  } else if (ARC_BUF_COMPRESSED(buf)) {
1945 2387                          /* We need to reallocate the buf's b_data */
1946 2388                          arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr),
1947 2389                              buf);
1948 2390                          buf->b_data =
1949 2391                              arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
1950 2392  
1951 2393                          /* We increased the size of b_data; update overhead */
1952 2394                          ARCSTAT_INCR(arcstat_overhead_size,
1953 2395                              HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));
1954 2396                  }
1955 2397  
1956 2398                  /*
1957 2399                   * Regardless of the buf's previous compression settings, it
1958 2400                   * should not be compressed at the end of this function.

↓ open down ↓

58 lines elided

↑ open up ↑

1959 2401                   */
1960 2402                  buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
1961 2403  
1962 2404                  /*
1963 2405                   * Try copying the data from another buf which already has a
1964 2406                   * decompressed version. If that's not possible, it's time to
1965 2407                   * bite the bullet and decompress the data from the hdr.
1966 2408                   */
1967 2409                  if (arc_buf_try_copy_decompressed_data(buf)) {
1968 2410                          /* Skip byteswapping and checksumming (already done) */
1969      -                        ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, !=, NULL);
     2411 +                        ASSERT3P(hdr->b_freeze_cksum, !=, NULL);
1970 2412                          return (0);
1971 2413                  } else {
1972 2414                          int error = zio_decompress_data(HDR_GET_COMPRESS(hdr),
1973 2415                              hdr->b_l1hdr.b_pabd, buf->b_data,
1974 2416                              HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
1975 2417  
1976 2418                          /*
1977 2419                           * Absent hardware errors or software bugs, this should
1978 2420                           * be impossible, but log it anyway so we can debug it.
1979 2421                           */

1980 2422                          if (error != 0) {
1981 2423                                  zfs_dbgmsg(
1982 2424                                      "hdr %p, compress %d, psize %d, lsize %d",
1983 2425                                      hdr, HDR_GET_COMPRESS(hdr),
1984 2426                                      HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
1985 2427                                  return (SET_ERROR(EIO));
1986 2428                          }
1987 2429                  }
1988 2430          }
1989 2431  
1990 2432          /* Byteswap the buf's data if necessary */
1991 2433          if (bswap != DMU_BSWAP_NUMFUNCS) {
1992 2434                  ASSERT(!HDR_SHARED_DATA(hdr));
1993 2435                  ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS);
1994 2436                  dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr));
1995 2437          }
1996 2438  
1997 2439          /* Compute the hdr's checksum if necessary */
1998 2440          arc_cksum_compute(buf);
1999 2441  
2000 2442          return (0);
2001 2443  }
2002 2444  
2003 2445  int
2004 2446  arc_decompress(arc_buf_t *buf)
2005 2447  {
2006 2448          return (arc_buf_fill(buf, B_FALSE));
2007 2449  }
2008 2450  
2009 2451  /*
2010 2452   * Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t.
2011 2453   */
2012 2454  static uint64_t
2013 2455  arc_hdr_size(arc_buf_hdr_t *hdr)
2014 2456  {
2015 2457          uint64_t size;
2016 2458  
2017 2459          if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
2018 2460              HDR_GET_PSIZE(hdr) > 0) {
2019 2461                  size = HDR_GET_PSIZE(hdr);
2020 2462          } else {
2021 2463                  ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0);
2022 2464                  size = HDR_GET_LSIZE(hdr);
2023 2465          }
2024 2466          return (size);
2025 2467  }
2026 2468  
2027 2469  /*
2028 2470   * Increment the amount of evictable space in the arc_state_t's refcount.
2029 2471   * We account for the space used by the hdr and the arc buf individually
2030 2472   * so that we can add and remove them from the refcount individually.
2031 2473   */
2032 2474  static void
2033 2475  arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state)
2034 2476  {
2035 2477          arc_buf_contents_t type = arc_buf_type(hdr);
2036 2478  
2037 2479          ASSERT(HDR_HAS_L1HDR(hdr));
2038 2480  
2039 2481          if (GHOST_STATE(state)) {
2040 2482                  ASSERT0(hdr->b_l1hdr.b_bufcnt);
2041 2483                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2042 2484                  ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2043 2485                  (void) refcount_add_many(&state->arcs_esize[type],
2044 2486                      HDR_GET_LSIZE(hdr), hdr);
2045 2487                  return;
2046 2488          }
2047 2489  
2048 2490          ASSERT(!GHOST_STATE(state));
2049 2491          if (hdr->b_l1hdr.b_pabd != NULL) {
2050 2492                  (void) refcount_add_many(&state->arcs_esize[type],
2051 2493                      arc_hdr_size(hdr), hdr);
2052 2494          }
2053 2495          for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2054 2496              buf = buf->b_next) {
2055 2497                  if (arc_buf_is_shared(buf))
2056 2498                          continue;
2057 2499                  (void) refcount_add_many(&state->arcs_esize[type],
2058 2500                      arc_buf_size(buf), buf);
2059 2501          }
2060 2502  }
2061 2503  
2062 2504  /*
2063 2505   * Decrement the amount of evictable space in the arc_state_t's refcount.
2064 2506   * We account for the space used by the hdr and the arc buf individually
2065 2507   * so that we can add and remove them from the refcount individually.
2066 2508   */
2067 2509  static void
2068 2510  arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state)
2069 2511  {
2070 2512          arc_buf_contents_t type = arc_buf_type(hdr);
2071 2513  
2072 2514          ASSERT(HDR_HAS_L1HDR(hdr));
2073 2515  
2074 2516          if (GHOST_STATE(state)) {
2075 2517                  ASSERT0(hdr->b_l1hdr.b_bufcnt);
2076 2518                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2077 2519                  ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2078 2520                  (void) refcount_remove_many(&state->arcs_esize[type],
2079 2521                      HDR_GET_LSIZE(hdr), hdr);
2080 2522                  return;
2081 2523          }
2082 2524  
2083 2525          ASSERT(!GHOST_STATE(state));
2084 2526          if (hdr->b_l1hdr.b_pabd != NULL) {
2085 2527                  (void) refcount_remove_many(&state->arcs_esize[type],
2086 2528                      arc_hdr_size(hdr), hdr);
2087 2529          }
2088 2530          for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2089 2531              buf = buf->b_next) {
2090 2532                  if (arc_buf_is_shared(buf))
2091 2533                          continue;
2092 2534                  (void) refcount_remove_many(&state->arcs_esize[type],
2093 2535                      arc_buf_size(buf), buf);
2094 2536          }
2095 2537  }
2096 2538  
2097 2539  /*
2098 2540   * Add a reference to this hdr indicating that someone is actively
2099 2541   * referencing that memory. When the refcount transitions from 0 to 1,
2100 2542   * we remove it from the respective arc_state_t list to indicate that
2101 2543   * it is not evictable.
2102 2544   */
2103 2545  static void
2104 2546  add_reference(arc_buf_hdr_t *hdr, void *tag)
2105 2547  {
2106 2548          ASSERT(HDR_HAS_L1HDR(hdr));
2107 2549          if (!MUTEX_HELD(HDR_LOCK(hdr))) {
2108 2550                  ASSERT(hdr->b_l1hdr.b_state == arc_anon);
2109 2551                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2110 2552                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2111 2553          }
2112 2554  
2113 2555          arc_state_t *state = hdr->b_l1hdr.b_state;
2114 2556  
2115 2557          if ((refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) &&
2116 2558              (state != arc_anon)) {
2117 2559                  /* We don't use the L2-only state list. */
2118 2560                  if (state != arc_l2c_only) {
2119 2561                          multilist_remove(state->arcs_list[arc_buf_type(hdr)],
2120 2562                              hdr);
2121 2563                          arc_evictable_space_decrement(hdr, state);
2122 2564                  }
2123 2565                  /* remove the prefetch flag if we get a reference */
2124 2566                  arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
2125 2567          }
2126 2568  }
2127 2569  
2128 2570  /*
2129 2571   * Remove a reference from this hdr. When the reference transitions from
2130 2572   * 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's
2131 2573   * list making it eligible for eviction.
2132 2574   */
2133 2575  static int
2134 2576  remove_reference(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, void *tag)
2135 2577  {
2136 2578          int cnt;
2137 2579          arc_state_t *state = hdr->b_l1hdr.b_state;
2138 2580  
2139 2581          ASSERT(HDR_HAS_L1HDR(hdr));
2140 2582          ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
2141 2583          ASSERT(!GHOST_STATE(state));
2142 2584  
2143 2585          /*
2144 2586           * arc_l2c_only counts as a ghost state so we don't need to explicitly
2145 2587           * check to prevent usage of the arc_l2c_only list.
2146 2588           */
2147 2589          if (((cnt = refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) == 0) &&
2148 2590              (state != arc_anon)) {
2149 2591                  multilist_insert(state->arcs_list[arc_buf_type(hdr)], hdr);
2150 2592                  ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
2151 2593                  arc_evictable_space_increment(hdr, state);
2152 2594          }
2153 2595          return (cnt);
2154 2596  }
2155 2597  
2156 2598  /*
2157 2599   * Move the supplied buffer to the indicated state. The hash lock
2158 2600   * for the buffer must be held by the caller.
2159 2601   */
2160 2602  static void
2161 2603  arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr,
2162 2604      kmutex_t *hash_lock)
2163 2605  {
2164 2606          arc_state_t *old_state;
2165 2607          int64_t refcnt;
2166 2608          uint32_t bufcnt;
2167 2609          boolean_t update_old, update_new;
2168 2610          arc_buf_contents_t buftype = arc_buf_type(hdr);
2169 2611  
2170 2612          /*
2171 2613           * We almost always have an L1 hdr here, since we call arc_hdr_realloc()
2172 2614           * in arc_read() when bringing a buffer out of the L2ARC.  However, the
2173 2615           * L1 hdr doesn't always exist when we change state to arc_anon before
2174 2616           * destroying a header, in which case reallocating to add the L1 hdr is
2175 2617           * pointless.
2176 2618           */
2177 2619          if (HDR_HAS_L1HDR(hdr)) {
2178 2620                  old_state = hdr->b_l1hdr.b_state;
2179 2621                  refcnt = refcount_count(&hdr->b_l1hdr.b_refcnt);
2180 2622                  bufcnt = hdr->b_l1hdr.b_bufcnt;
2181 2623                  update_old = (bufcnt > 0 || hdr->b_l1hdr.b_pabd != NULL);
2182 2624          } else {
2183 2625                  old_state = arc_l2c_only;
2184 2626                  refcnt = 0;
2185 2627                  bufcnt = 0;
2186 2628                  update_old = B_FALSE;
2187 2629          }
2188 2630          update_new = update_old;
2189 2631  
2190 2632          ASSERT(MUTEX_HELD(hash_lock));
2191 2633          ASSERT3P(new_state, !=, old_state);
2192 2634          ASSERT(!GHOST_STATE(new_state) || bufcnt == 0);
2193 2635          ASSERT(old_state != arc_anon || bufcnt <= 1);
2194 2636  
2195 2637          /*
2196 2638           * If this buffer is evictable, transfer it from the
2197 2639           * old state list to the new state list.
2198 2640           */
2199 2641          if (refcnt == 0) {
2200 2642                  if (old_state != arc_anon && old_state != arc_l2c_only) {
2201 2643                          ASSERT(HDR_HAS_L1HDR(hdr));
2202 2644                          multilist_remove(old_state->arcs_list[buftype], hdr);
2203 2645  
2204 2646                          if (GHOST_STATE(old_state)) {
2205 2647                                  ASSERT0(bufcnt);
2206 2648                                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2207 2649                                  update_old = B_TRUE;
2208 2650                          }
2209 2651                          arc_evictable_space_decrement(hdr, old_state);
2210 2652                  }
2211 2653                  if (new_state != arc_anon && new_state != arc_l2c_only) {
2212 2654  
2213 2655                          /*
2214 2656                           * An L1 header always exists here, since if we're
2215 2657                           * moving to some L1-cached state (i.e. not l2c_only or
2216 2658                           * anonymous), we realloc the header to add an L1hdr
2217 2659                           * beforehand.
2218 2660                           */
2219 2661                          ASSERT(HDR_HAS_L1HDR(hdr));
2220 2662                          multilist_insert(new_state->arcs_list[buftype], hdr);
2221 2663

↓ open down ↓

242 lines elided

↑ open up ↑

2222 2664                          if (GHOST_STATE(new_state)) {
2223 2665                                  ASSERT0(bufcnt);
2224 2666                                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2225 2667                                  update_new = B_TRUE;
2226 2668                          }
2227 2669                          arc_evictable_space_increment(hdr, new_state);
2228 2670                  }
2229 2671          }
2230 2672  
2231 2673          ASSERT(!HDR_EMPTY(hdr));
2232      -        if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))
     2674 +        if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr)) {
     2675 +                arc_wait_for_krrp(hdr);
2233 2676                  buf_hash_remove(hdr);
     2677 +        }
2234 2678  
2235 2679          /* adjust state sizes (ignore arc_l2c_only) */
2236 2680  
2237 2681          if (update_new && new_state != arc_l2c_only) {
2238 2682                  ASSERT(HDR_HAS_L1HDR(hdr));
2239 2683                  if (GHOST_STATE(new_state)) {
2240 2684                          ASSERT0(bufcnt);
2241 2685  
2242 2686                          /*
2243 2687                           * When moving a header to a ghost state, we first

2244 2688                           * remove all arc buffers. Thus, we'll have a
2245 2689                           * bufcnt of zero, and no arc buffer to use for
2246 2690                           * the reference. As a result, we use the arc
2247 2691                           * header pointer for the reference.
2248 2692                           */
2249 2693                          (void) refcount_add_many(&new_state->arcs_size,
2250 2694                              HDR_GET_LSIZE(hdr), hdr);
2251 2695                          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2252 2696                  } else {
2253 2697                          uint32_t buffers = 0;
2254 2698  
2255 2699                          /*
2256 2700                           * Each individual buffer holds a unique reference,
2257 2701                           * thus we must remove each of these references one
2258 2702                           * at a time.
2259 2703                           */
2260 2704                          for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2261 2705                              buf = buf->b_next) {
2262 2706                                  ASSERT3U(bufcnt, !=, 0);
2263 2707                                  buffers++;
2264 2708  
2265 2709                                  /*
2266 2710                                   * When the arc_buf_t is sharing the data
2267 2711                                   * block with the hdr, the owner of the
2268 2712                                   * reference belongs to the hdr. Only
2269 2713                                   * add to the refcount if the arc_buf_t is
2270 2714                                   * not shared.
2271 2715                                   */
2272 2716                                  if (arc_buf_is_shared(buf))
2273 2717                                          continue;
2274 2718  
2275 2719                                  (void) refcount_add_many(&new_state->arcs_size,
2276 2720                                      arc_buf_size(buf), buf);
2277 2721                          }
2278 2722                          ASSERT3U(bufcnt, ==, buffers);
2279 2723  
2280 2724                          if (hdr->b_l1hdr.b_pabd != NULL) {
2281 2725                                  (void) refcount_add_many(&new_state->arcs_size,
2282 2726                                      arc_hdr_size(hdr), hdr);
2283 2727                          } else {
2284 2728                                  ASSERT(GHOST_STATE(old_state));
2285 2729                          }
2286 2730                  }
2287 2731          }
2288 2732  
2289 2733          if (update_old && old_state != arc_l2c_only) {
2290 2734                  ASSERT(HDR_HAS_L1HDR(hdr));
2291 2735                  if (GHOST_STATE(old_state)) {
2292 2736                          ASSERT0(bufcnt);
2293 2737                          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2294 2738  
2295 2739                          /*
2296 2740                           * When moving a header off of a ghost state,
2297 2741                           * the header will not contain any arc buffers.
2298 2742                           * We use the arc header pointer for the reference
2299 2743                           * which is exactly what we did when we put the
2300 2744                           * header on the ghost state.
2301 2745                           */
2302 2746  
2303 2747                          (void) refcount_remove_many(&old_state->arcs_size,
2304 2748                              HDR_GET_LSIZE(hdr), hdr);
2305 2749                  } else {
2306 2750                          uint32_t buffers = 0;
2307 2751  
2308 2752                          /*
2309 2753                           * Each individual buffer holds a unique reference,
2310 2754                           * thus we must remove each of these references one
2311 2755                           * at a time.
2312 2756                           */
2313 2757                          for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2314 2758                              buf = buf->b_next) {
2315 2759                                  ASSERT3U(bufcnt, !=, 0);
2316 2760                                  buffers++;
2317 2761  
2318 2762                                  /*
2319 2763                                   * When the arc_buf_t is sharing the data
2320 2764                                   * block with the hdr, the owner of the
2321 2765                                   * reference belongs to the hdr. Only
2322 2766                                   * add to the refcount if the arc_buf_t is
2323 2767                                   * not shared.
2324 2768                                   */
2325 2769                                  if (arc_buf_is_shared(buf))
2326 2770                                          continue;
2327 2771  
2328 2772                                  (void) refcount_remove_many(
2329 2773                                      &old_state->arcs_size, arc_buf_size(buf),
2330 2774                                      buf);
2331 2775                          }
2332 2776                          ASSERT3U(bufcnt, ==, buffers);
2333 2777                          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2334 2778                          (void) refcount_remove_many(
2335 2779                              &old_state->arcs_size, arc_hdr_size(hdr), hdr);

↓ open down ↓

92 lines elided

↑ open up ↑

2336 2780                  }
2337 2781          }
2338 2782  
2339 2783          if (HDR_HAS_L1HDR(hdr))
2340 2784                  hdr->b_l1hdr.b_state = new_state;
2341 2785  
2342 2786          /*
2343 2787           * L2 headers should never be on the L2 state list since they don't
2344 2788           * have L1 headers allocated.
2345 2789           */
2346      -        ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]) &&
2347      -            multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));
     2790 +        ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]));
     2791 +        ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));
     2792 +        ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DDT]));
2348 2793  }
2349 2794  
2350 2795  void
2351 2796  arc_space_consume(uint64_t space, arc_space_type_t type)
2352 2797  {
2353 2798          ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2354 2799  
2355 2800          switch (type) {
2356 2801          case ARC_SPACE_DATA:
2357      -                aggsum_add(&astat_data_size, space);
     2802 +                ARCSTAT_INCR(arcstat_data_size, space);
2358 2803                  break;
2359 2804          case ARC_SPACE_META:
2360      -                aggsum_add(&astat_metadata_size, space);
     2805 +                ARCSTAT_INCR(arcstat_metadata_size, space);
2361 2806                  break;
     2807 +        case ARC_SPACE_DDT:
     2808 +                ARCSTAT_INCR(arcstat_ddt_size, space);
     2809 +                break;
2362 2810          case ARC_SPACE_OTHER:
2363      -                aggsum_add(&astat_other_size, space);
     2811 +                ARCSTAT_INCR(arcstat_other_size, space);
2364 2812                  break;
2365 2813          case ARC_SPACE_HDRS:
2366      -                aggsum_add(&astat_hdr_size, space);
     2814 +                ARCSTAT_INCR(arcstat_hdr_size, space);
2367 2815                  break;
2368 2816          case ARC_SPACE_L2HDRS:
2369      -                aggsum_add(&astat_l2_hdr_size, space);
     2817 +                ARCSTAT_INCR(arcstat_l2_hdr_size, space);
2370 2818                  break;
2371 2819          }
2372 2820  
2373      -        if (type != ARC_SPACE_DATA)
2374      -                aggsum_add(&arc_meta_used, space);
     2821 +        if (type != ARC_SPACE_DATA && type != ARC_SPACE_DDT)
     2822 +                ARCSTAT_INCR(arcstat_meta_used, space);
2375 2823  
2376      -        aggsum_add(&arc_size, space);
     2824 +        atomic_add_64(&arc_size, space);
2377 2825  }
2378 2826  
2379 2827  void
2380 2828  arc_space_return(uint64_t space, arc_space_type_t type)
2381 2829  {
2382 2830          ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2383 2831  
2384 2832          switch (type) {
2385 2833          case ARC_SPACE_DATA:
2386      -                aggsum_add(&astat_data_size, -space);
     2834 +                ARCSTAT_INCR(arcstat_data_size, -space);
2387 2835                  break;
2388 2836          case ARC_SPACE_META:
2389      -                aggsum_add(&astat_metadata_size, -space);
     2837 +                ARCSTAT_INCR(arcstat_metadata_size, -space);
2390 2838                  break;
     2839 +        case ARC_SPACE_DDT:
     2840 +                ARCSTAT_INCR(arcstat_ddt_size, -space);
     2841 +                break;
2391 2842          case ARC_SPACE_OTHER:
2392      -                aggsum_add(&astat_other_size, -space);
     2843 +                ARCSTAT_INCR(arcstat_other_size, -space);
2393 2844                  break;
2394 2845          case ARC_SPACE_HDRS:
2395      -                aggsum_add(&astat_hdr_size, -space);
     2846 +                ARCSTAT_INCR(arcstat_hdr_size, -space);
2396 2847                  break;
2397 2848          case ARC_SPACE_L2HDRS:
2398      -                aggsum_add(&astat_l2_hdr_size, -space);
     2849 +                ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
2399 2850                  break;
2400 2851          }
2401 2852  
2402      -        if (type != ARC_SPACE_DATA) {
2403      -                ASSERT(aggsum_compare(&arc_meta_used, space) >= 0);
2404      -                /*
2405      -                 * We use the upper bound here rather than the precise value
2406      -                 * because the arc_meta_max value doesn't need to be
2407      -                 * precise. It's only consumed by humans via arcstats.
2408      -                 */
2409      -                if (arc_meta_max < aggsum_upper_bound(&arc_meta_used))
2410      -                        arc_meta_max = aggsum_upper_bound(&arc_meta_used);
2411      -                aggsum_add(&arc_meta_used, -space);
     2853 +        if (type != ARC_SPACE_DATA && type != ARC_SPACE_DDT) {
     2854 +                ASSERT(arc_meta_used >= space);
     2855 +                if (arc_meta_max < arc_meta_used)
     2856 +                        arc_meta_max = arc_meta_used;
     2857 +                ARCSTAT_INCR(arcstat_meta_used, -space);
2412 2858          }
2413 2859  
2414      -        ASSERT(aggsum_compare(&arc_size, space) >= 0);
2415      -        aggsum_add(&arc_size, -space);
     2860 +        ASSERT(arc_size >= space);
     2861 +        atomic_add_64(&arc_size, -space);
2416 2862  }
2417 2863  
2418 2864  /*
2419 2865   * Given a hdr and a buf, returns whether that buf can share its b_data buffer
2420 2866   * with the hdr's b_pabd.
2421 2867   */
2422 2868  static boolean_t
2423 2869  arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2424 2870  {
2425 2871          /*

2426 2872           * The criteria for sharing a hdr's data are:
2427 2873           * 1. the hdr's compression matches the buf's compression
2428 2874           * 2. the hdr doesn't need to be byteswapped
2429 2875           * 3. the hdr isn't already being shared
2430 2876           * 4. the buf is either compressed or it is the last buf in the hdr list
2431 2877           *
2432 2878           * Criterion #4 maintains the invariant that shared uncompressed
2433 2879           * bufs must be the final buf in the hdr's b_buf list. Reading this, you
2434 2880           * might ask, "if a compressed buf is allocated first, won't that be the
2435 2881           * last thing in the list?", but in that case it's impossible to create
2436 2882           * a shared uncompressed buf anyway (because the hdr must be compressed
2437 2883           * to have the compressed buf). You might also think that #3 is
2438 2884           * sufficient to make this guarantee, however it's possible
2439 2885           * (specifically in the rare L2ARC write race mentioned in
2440 2886           * arc_buf_alloc_impl()) there will be an existing uncompressed buf that
2441 2887           * is sharable, but wasn't at the time of its allocation. Rather than
2442 2888           * allow a new shared uncompressed buf to be created and then shuffle
2443 2889           * the list around to make it the last element, this simply disallows
2444 2890           * sharing if the new buf isn't the first to be added.
2445 2891           */
2446 2892          ASSERT3P(buf->b_hdr, ==, hdr);
2447 2893          boolean_t hdr_compressed = HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF;
2448 2894          boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0;
2449 2895          return (buf_compressed == hdr_compressed &&
2450 2896              hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&
2451 2897              !HDR_SHARED_DATA(hdr) &&
2452 2898              (ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));
2453 2899  }
2454 2900  
2455 2901  /*
2456 2902   * Allocate a buf for this hdr. If you care about the data that's in the hdr,
2457 2903   * or if you want a compressed buffer, pass those flags in. Returns 0 if the
2458 2904   * copy was made successfully, or an error code otherwise.

↓ open down ↓

33 lines elided

↑ open up ↑

2459 2905   */
2460 2906  static int
2461 2907  arc_buf_alloc_impl(arc_buf_hdr_t *hdr, void *tag, boolean_t compressed,
2462 2908      boolean_t fill, arc_buf_t **ret)
2463 2909  {
2464 2910          arc_buf_t *buf;
2465 2911  
2466 2912          ASSERT(HDR_HAS_L1HDR(hdr));
2467 2913          ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
2468 2914          VERIFY(hdr->b_type == ARC_BUFC_DATA ||
2469      -            hdr->b_type == ARC_BUFC_METADATA);
     2915 +            hdr->b_type == ARC_BUFC_METADATA ||
     2916 +            hdr->b_type == ARC_BUFC_DDT);
2470 2917          ASSERT3P(ret, !=, NULL);
2471 2918          ASSERT3P(*ret, ==, NULL);
2472 2919  
2473 2920          buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
2474 2921          buf->b_hdr = hdr;
2475 2922          buf->b_data = NULL;
2476 2923          buf->b_next = hdr->b_l1hdr.b_buf;
2477 2924          buf->b_flags = 0;
2478 2925  
2479 2926          add_reference(hdr, tag);

2480 2927  
2481 2928          /*
2482 2929           * We're about to change the hdr's b_flags. We must either
2483 2930           * hold the hash_lock or be undiscoverable.
2484 2931           */
2485 2932          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2486 2933  
2487 2934          /*
2488 2935           * Only honor requests for compressed bufs if the hdr is actually
2489 2936           * compressed.
2490 2937           */
2491 2938          if (compressed && HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF)
2492 2939                  buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;
2493 2940  
2494 2941          /*
2495 2942           * If the hdr's data can be shared then we share the data buffer and
2496 2943           * set the appropriate bit in the hdr's b_flags to indicate the hdr is
2497 2944           * sharing it's b_pabd with the arc_buf_t. Otherwise, we allocate a new
2498 2945           * buffer to store the buf's data.
2499 2946           *
2500 2947           * There are two additional restrictions here because we're sharing
2501 2948           * hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be
2502 2949           * actively involved in an L2ARC write, because if this buf is used by
2503 2950           * an arc_write() then the hdr's data buffer will be released when the
2504 2951           * write completes, even though the L2ARC write might still be using it.
2505 2952           * Second, the hdr's ABD must be linear so that the buf's user doesn't
2506 2953           * need to be ABD-aware.
2507 2954           */
2508 2955          boolean_t can_share = arc_can_share(hdr, buf) && !HDR_L2_WRITING(hdr) &&
2509 2956              abd_is_linear(hdr->b_l1hdr.b_pabd);
2510 2957  
2511 2958          /* Set up b_data and sharing */
2512 2959          if (can_share) {
2513 2960                  buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd);
2514 2961                  buf->b_flags |= ARC_BUF_FLAG_SHARED;
2515 2962                  arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
2516 2963          } else {
2517 2964                  buf->b_data =
2518 2965                      arc_get_data_buf(hdr, arc_buf_size(buf), buf);
2519 2966                  ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
2520 2967          }
2521 2968          VERIFY3P(buf->b_data, !=, NULL);
2522 2969  
2523 2970          hdr->b_l1hdr.b_buf = buf;
2524 2971          hdr->b_l1hdr.b_bufcnt += 1;
2525 2972  
2526 2973          /*
2527 2974           * If the user wants the data from the hdr, we need to either copy or
2528 2975           * decompress the data.
2529 2976           */
2530 2977          if (fill) {
2531 2978                  return (arc_buf_fill(buf, ARC_BUF_COMPRESSED(buf) != 0));
2532 2979          }
2533 2980  
2534 2981          return (0);
2535 2982  }
2536 2983  
2537 2984  static char *arc_onloan_tag = "onloan";
2538 2985

↓ open down ↓

59 lines elided

↑ open up ↑

2539 2986  static inline void
2540 2987  arc_loaned_bytes_update(int64_t delta)
2541 2988  {
2542 2989          atomic_add_64(&arc_loaned_bytes, delta);
2543 2990  
2544 2991          /* assert that it did not wrap around */
2545 2992          ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
2546 2993  }
2547 2994  
2548 2995  /*
     2996 + * Allocates an ARC buf header that's in an evicted & L2-cached state.
     2997 + * This is used during l2arc reconstruction to make empty ARC buffers
     2998 + * which circumvent the regular disk->arc->l2arc path and instead come
     2999 + * into being in the reverse order, i.e. l2arc->arc.
     3000 + */
     3001 +static arc_buf_hdr_t *
     3002 +arc_buf_alloc_l2only(uint64_t load_guid, arc_buf_contents_t type,
     3003 +    l2arc_dev_t *dev, dva_t dva, uint64_t daddr, uint64_t lsize,
     3004 +    uint64_t psize, uint64_t birth, zio_cksum_t cksum, int checksum_type,
     3005 +    enum zio_compress compress, boolean_t arc_compress)
     3006 +{
     3007 +        arc_buf_hdr_t *hdr;
     3008 +
     3009 +        if (type == ARC_BUFC_DDT && !zfs_arc_segregate_ddt)
     3010 +                type = ARC_BUFC_METADATA;
     3011 +
     3012 +        ASSERT(lsize != 0);
     3013 +        hdr = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
     3014 +        ASSERT(HDR_EMPTY(hdr));
     3015 +        ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
     3016 +
     3017 +        hdr->b_spa = load_guid;
     3018 +        hdr->b_type = type;
     3019 +        hdr->b_flags = 0;
     3020 +
     3021 +        if (arc_compress)
     3022 +                arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
     3023 +        else
     3024 +                arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
     3025 +
     3026 +        HDR_SET_COMPRESS(hdr, compress);
     3027 +
     3028 +        arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR);
     3029 +        hdr->b_dva = dva;
     3030 +        hdr->b_birth = birth;
     3031 +        if (checksum_type != ZIO_CHECKSUM_OFF) {
     3032 +                hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP);
     3033 +                bcopy(&cksum, hdr->b_freeze_cksum, sizeof (cksum));
     3034 +        }
     3035 +
     3036 +        HDR_SET_PSIZE(hdr, psize);
     3037 +        HDR_SET_LSIZE(hdr, lsize);
     3038 +
     3039 +        hdr->b_l2hdr.b_dev = dev;
     3040 +        hdr->b_l2hdr.b_daddr = daddr;
     3041 +
     3042 +        return (hdr);
     3043 +}
     3044 +
     3045 +/*
2549 3046   * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
2550 3047   * flight data by arc_tempreserve_space() until they are "returned". Loaned
2551 3048   * buffers must be returned to the arc before they can be used by the DMU or
2552 3049   * freed.
2553 3050   */
2554 3051  arc_buf_t *
2555 3052  arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)
2556 3053  {
2557 3054          arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,
2558 3055              is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);

2559 3056  
2560 3057          arc_loaned_bytes_update(size);
2561 3058  
2562 3059          return (buf);
2563 3060  }
2564 3061  
2565 3062  arc_buf_t *
2566 3063  arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,
2567 3064      enum zio_compress compression_type)
2568 3065  {
2569 3066          arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag,
2570 3067              psize, lsize, compression_type);
2571 3068  
2572 3069          arc_loaned_bytes_update(psize);
2573 3070  
2574 3071          return (buf);
2575 3072  }
2576 3073  
2577 3074  
2578 3075  /*
2579 3076   * Return a loaned arc buffer to the arc.
2580 3077   */
2581 3078  void
2582 3079  arc_return_buf(arc_buf_t *buf, void *tag)
2583 3080  {
2584 3081          arc_buf_hdr_t *hdr = buf->b_hdr;
2585 3082  
2586 3083          ASSERT3P(buf->b_data, !=, NULL);
2587 3084          ASSERT(HDR_HAS_L1HDR(hdr));
2588 3085          (void) refcount_add(&hdr->b_l1hdr.b_refcnt, tag);
2589 3086          (void) refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
2590 3087  
2591 3088          arc_loaned_bytes_update(-arc_buf_size(buf));
2592 3089  }
2593 3090  
2594 3091  /* Detach an arc_buf from a dbuf (tag) */
2595 3092  void
2596 3093  arc_loan_inuse_buf(arc_buf_t *buf, void *tag)
2597 3094  {
2598 3095          arc_buf_hdr_t *hdr = buf->b_hdr;
2599 3096  
2600 3097          ASSERT3P(buf->b_data, !=, NULL);
2601 3098          ASSERT(HDR_HAS_L1HDR(hdr));
2602 3099          (void) refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
2603 3100          (void) refcount_remove(&hdr->b_l1hdr.b_refcnt, tag);
2604 3101  
2605 3102          arc_loaned_bytes_update(arc_buf_size(buf));
2606 3103  }
2607 3104  
2608 3105  static void
2609 3106  l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type)
2610 3107  {
2611 3108          l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP);
2612 3109  
2613 3110          df->l2df_abd = abd;
2614 3111          df->l2df_size = size;
2615 3112          df->l2df_type = type;
2616 3113          mutex_enter(&l2arc_free_on_write_mtx);
2617 3114          list_insert_head(l2arc_free_on_write, df);
2618 3115          mutex_exit(&l2arc_free_on_write_mtx);
2619 3116  }
2620 3117  
2621 3118  static void
2622 3119  arc_hdr_free_on_write(arc_buf_hdr_t *hdr)
2623 3120  {
2624 3121          arc_state_t *state = hdr->b_l1hdr.b_state;
2625 3122          arc_buf_contents_t type = arc_buf_type(hdr);
2626 3123          uint64_t size = arc_hdr_size(hdr);

↓ open down ↓

68 lines elided

↑ open up ↑

2627 3124  
2628 3125          /* protected by hash lock, if in the hash table */
2629 3126          if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
2630 3127                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2631 3128                  ASSERT(state != arc_anon && state != arc_l2c_only);
2632 3129  
2633 3130                  (void) refcount_remove_many(&state->arcs_esize[type],
2634 3131                      size, hdr);
2635 3132          }
2636 3133          (void) refcount_remove_many(&state->arcs_size, size, hdr);
2637      -        if (type == ARC_BUFC_METADATA) {
     3134 +        if (type == ARC_BUFC_DDT) {
     3135 +                arc_space_return(size, ARC_SPACE_DDT);
     3136 +        } else if (type == ARC_BUFC_METADATA) {
2638 3137                  arc_space_return(size, ARC_SPACE_META);
2639 3138          } else {
2640 3139                  ASSERT(type == ARC_BUFC_DATA);
2641 3140                  arc_space_return(size, ARC_SPACE_DATA);
2642 3141          }
2643 3142  
2644 3143          l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);
2645 3144  }
2646 3145  
2647 3146  /*

2648 3147   * Share the arc_buf_t's data with the hdr. Whenever we are sharing the
2649 3148   * data buffer, we transfer the refcount ownership to the hdr and update
2650 3149   * the appropriate kstats.
2651 3150   */
2652 3151  static void
2653 3152  arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2654 3153  {
2655 3154          arc_state_t *state = hdr->b_l1hdr.b_state;
2656 3155  
2657 3156          ASSERT(arc_can_share(hdr, buf));
2658 3157          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);

↓ open down ↓

11 lines elided

↑ open up ↑

2659 3158          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2660 3159  
2661 3160          /*
2662 3161           * Start sharing the data buffer. We transfer the
2663 3162           * refcount ownership to the hdr since it always owns
2664 3163           * the refcount whenever an arc_buf_t is shared.
2665 3164           */
2666 3165          refcount_transfer_ownership(&state->arcs_size, buf, hdr);
2667 3166          hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));
2668 3167          abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,
2669      -            HDR_ISTYPE_METADATA(hdr));
     3168 +            !HDR_ISTYPE_DATA(hdr));
2670 3169          arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
2671 3170          buf->b_flags |= ARC_BUF_FLAG_SHARED;
2672 3171  
2673 3172          /*
2674 3173           * Since we've transferred ownership to the hdr we need
2675 3174           * to increment its compressed and uncompressed kstats and
2676 3175           * decrement the overhead size.
2677 3176           */
2678 3177          ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
2679 3178          ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));

2680 3179          ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));
2681 3180  }
2682 3181  
2683 3182  static void
2684 3183  arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2685 3184  {
2686 3185          arc_state_t *state = hdr->b_l1hdr.b_state;
2687 3186  
2688 3187          ASSERT(arc_buf_is_shared(buf));
2689 3188          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2690 3189          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2691 3190  
2692 3191          /*
2693 3192           * We are no longer sharing this buffer so we need
2694 3193           * to transfer its ownership to the rightful owner.
2695 3194           */
2696 3195          refcount_transfer_ownership(&state->arcs_size, hdr, buf);
2697 3196          arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
2698 3197          abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd);
2699 3198          abd_put(hdr->b_l1hdr.b_pabd);
2700 3199          hdr->b_l1hdr.b_pabd = NULL;
2701 3200          buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
2702 3201  
2703 3202          /*
2704 3203           * Since the buffer is no longer shared between
2705 3204           * the arc buf and the hdr, count it as overhead.
2706 3205           */
2707 3206          ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
2708 3207          ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
2709 3208          ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
2710 3209  }
2711 3210  
2712 3211  /*
2713 3212   * Remove an arc_buf_t from the hdr's buf list and return the last
2714 3213   * arc_buf_t on the list. If no buffers remain on the list then return
2715 3214   * NULL.
2716 3215   */
2717 3216  static arc_buf_t *
2718 3217  arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2719 3218  {
2720 3219          ASSERT(HDR_HAS_L1HDR(hdr));
2721 3220          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2722 3221  
2723 3222          arc_buf_t **bufp = &hdr->b_l1hdr.b_buf;
2724 3223          arc_buf_t *lastbuf = NULL;
2725 3224  
2726 3225          /*
2727 3226           * Remove the buf from the hdr list and locate the last
2728 3227           * remaining buffer on the list.
2729 3228           */
2730 3229          while (*bufp != NULL) {
2731 3230                  if (*bufp == buf)
2732 3231                          *bufp = buf->b_next;
2733 3232  
2734 3233                  /*
2735 3234                   * If we've removed a buffer in the middle of
2736 3235                   * the list then update the lastbuf and update
2737 3236                   * bufp.
2738 3237                   */
2739 3238                  if (*bufp != NULL) {
2740 3239                          lastbuf = *bufp;
2741 3240                          bufp = &(*bufp)->b_next;
2742 3241                  }
2743 3242          }
2744 3243          buf->b_next = NULL;
2745 3244          ASSERT3P(lastbuf, !=, buf);
2746 3245          IMPLY(hdr->b_l1hdr.b_bufcnt > 0, lastbuf != NULL);
2747 3246          IMPLY(hdr->b_l1hdr.b_bufcnt > 0, hdr->b_l1hdr.b_buf != NULL);
2748 3247          IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf));
2749 3248  
2750 3249          return (lastbuf);
2751 3250  }
2752 3251  
2753 3252  /*
2754 3253   * Free up buf->b_data and pull the arc_buf_t off of the the arc_buf_hdr_t's
2755 3254   * list and free it.
2756 3255   */
2757 3256  static void
2758 3257  arc_buf_destroy_impl(arc_buf_t *buf)
2759 3258  {
2760 3259          arc_buf_hdr_t *hdr = buf->b_hdr;
2761 3260  
2762 3261          /*
2763 3262           * Free up the data associated with the buf but only if we're not
2764 3263           * sharing this with the hdr. If we are sharing it with the hdr, the
2765 3264           * hdr is responsible for doing the free.
2766 3265           */
2767 3266          if (buf->b_data != NULL) {
2768 3267                  /*
2769 3268                   * We're about to change the hdr's b_flags. We must either
2770 3269                   * hold the hash_lock or be undiscoverable.
2771 3270                   */
2772 3271                  ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2773 3272  
2774 3273                  arc_cksum_verify(buf);
2775 3274                  arc_buf_unwatch(buf);
2776 3275  
2777 3276                  if (arc_buf_is_shared(buf)) {
2778 3277                          arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
2779 3278                  } else {
2780 3279                          uint64_t size = arc_buf_size(buf);
2781 3280                          arc_free_data_buf(hdr, buf->b_data, size, buf);
2782 3281                          ARCSTAT_INCR(arcstat_overhead_size, -size);
2783 3282                  }
2784 3283                  buf->b_data = NULL;
2785 3284  
2786 3285                  ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
2787 3286                  hdr->b_l1hdr.b_bufcnt -= 1;
2788 3287          }
2789 3288  
2790 3289          arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
2791 3290  
2792 3291          if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {
2793 3292                  /*
2794 3293                   * If the current arc_buf_t is sharing its data buffer with the
2795 3294                   * hdr, then reassign the hdr's b_pabd to share it with the new
2796 3295                   * buffer at the end of the list. The shared buffer is always
2797 3296                   * the last one on the hdr's buffer list.
2798 3297                   *
2799 3298                   * There is an equivalent case for compressed bufs, but since
2800 3299                   * they aren't guaranteed to be the last buf in the list and
2801 3300                   * that is an exceedingly rare case, we just allow that space be
2802 3301                   * wasted temporarily.
2803 3302                   */
2804 3303                  if (lastbuf != NULL) {
2805 3304                          /* Only one buf can be shared at once */
2806 3305                          VERIFY(!arc_buf_is_shared(lastbuf));
2807 3306                          /* hdr is uncompressed so can't have compressed buf */
2808 3307                          VERIFY(!ARC_BUF_COMPRESSED(lastbuf));
2809 3308  
2810 3309                          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2811 3310                          arc_hdr_free_pabd(hdr);
2812 3311  
2813 3312                          /*
2814 3313                           * We must setup a new shared block between the
2815 3314                           * last buffer and the hdr. The data would have
2816 3315                           * been allocated by the arc buf so we need to transfer
2817 3316                           * ownership to the hdr since it's now being shared.
2818 3317                           */
2819 3318                          arc_share_buf(hdr, lastbuf);
2820 3319                  }
2821 3320          } else if (HDR_SHARED_DATA(hdr)) {
2822 3321                  /*
2823 3322                   * Uncompressed shared buffers are always at the end
2824 3323                   * of the list. Compressed buffers don't have the
2825 3324                   * same requirements. This makes it hard to
2826 3325                   * simply assert that the lastbuf is shared so
2827 3326                   * we rely on the hdr's compression flags to determine
2828 3327                   * if we have a compressed, shared buffer.
2829 3328                   */
2830 3329                  ASSERT3P(lastbuf, !=, NULL);
2831 3330                  ASSERT(arc_buf_is_shared(lastbuf) ||
2832 3331                      HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
2833 3332          }
2834 3333  
2835 3334          /*
2836 3335           * Free the checksum if we're removing the last uncompressed buf from
2837 3336           * this hdr.
2838 3337           */
2839 3338          if (!arc_hdr_has_uncompressed_buf(hdr)) {
2840 3339                  arc_cksum_free(hdr);
2841 3340          }
2842 3341  
2843 3342          /* clean up the buf */
2844 3343          buf->b_hdr = NULL;
2845 3344          kmem_cache_free(buf_cache, buf);
2846 3345  }
2847 3346  
2848 3347  static void
2849 3348  arc_hdr_alloc_pabd(arc_buf_hdr_t *hdr)
2850 3349  {
2851 3350          ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);

↓ open down ↓

172 lines elided

↑ open up ↑

2852 3351          ASSERT(HDR_HAS_L1HDR(hdr));
2853 3352          ASSERT(!HDR_SHARED_DATA(hdr));
2854 3353  
2855 3354          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2856 3355          hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr);
2857 3356          hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
2858 3357          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2859 3358  
2860 3359          ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
2861 3360          ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
     3361 +        arc_update_hit_stat(hdr, B_TRUE);
2862 3362  }
2863 3363  
2864 3364  static void
2865 3365  arc_hdr_free_pabd(arc_buf_hdr_t *hdr)
2866 3366  {
2867 3367          ASSERT(HDR_HAS_L1HDR(hdr));
2868 3368          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2869 3369  
2870 3370          /*
2871 3371           * If the hdr is currently being written to the l2arc then

2872 3372           * we defer freeing the data by adding it to the l2arc_free_on_write
2873 3373           * list. The l2arc will free the data once it's finished
2874 3374           * writing it to the l2arc device.
2875 3375           */
2876 3376          if (HDR_L2_WRITING(hdr)) {
2877 3377                  arc_hdr_free_on_write(hdr);
2878 3378                  ARCSTAT_BUMP(arcstat_l2_free_on_write);
2879 3379          } else {
2880 3380                  arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
2881 3381                      arc_hdr_size(hdr), hdr);
2882 3382          }
2883 3383          hdr->b_l1hdr.b_pabd = NULL;
2884 3384          hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
2885 3385

↓ open down ↓

14 lines elided

↑ open up ↑

2886 3386          ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
2887 3387          ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
2888 3388  }
2889 3389  
2890 3390  static arc_buf_hdr_t *
2891 3391  arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,
2892 3392      enum zio_compress compression_type, arc_buf_contents_t type)
2893 3393  {
2894 3394          arc_buf_hdr_t *hdr;
2895 3395  
2896      -        VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);
     3396 +        ASSERT3U(lsize, >, 0);
2897 3397  
     3398 +        if (type == ARC_BUFC_DDT && !zfs_arc_segregate_ddt)
     3399 +                type = ARC_BUFC_METADATA;
     3400 +        VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA ||
     3401 +            type == ARC_BUFC_DDT);
     3402 +
2898 3403          hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);
2899 3404          ASSERT(HDR_EMPTY(hdr));
2900      -        ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
     3405 +        ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
2901 3406          ASSERT3P(hdr->b_l1hdr.b_thawed, ==, NULL);
2902 3407          HDR_SET_PSIZE(hdr, psize);
2903 3408          HDR_SET_LSIZE(hdr, lsize);
2904 3409          hdr->b_spa = spa;
2905 3410          hdr->b_type = type;
2906 3411          hdr->b_flags = 0;
2907 3412          arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);
2908 3413          arc_hdr_set_compress(hdr, compression_type);
2909 3414  
2910 3415          hdr->b_l1hdr.b_state = arc_anon;

2911 3416          hdr->b_l1hdr.b_arc_access = 0;
2912 3417          hdr->b_l1hdr.b_bufcnt = 0;
2913 3418          hdr->b_l1hdr.b_buf = NULL;
2914 3419  
2915 3420          /*
2916 3421           * Allocate the hdr's buffer. This will contain either
2917 3422           * the compressed or uncompressed data depending on the block
2918 3423           * it references and compressed arc enablement.
2919 3424           */
2920 3425          arc_hdr_alloc_pabd(hdr);
2921 3426          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2922 3427  
2923 3428          return (hdr);
2924 3429  }
2925 3430  
2926 3431  /*
2927 3432   * Transition between the two allocation states for the arc_buf_hdr struct.
2928 3433   * The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without
2929 3434   * (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller
2930 3435   * version is used when a cache buffer is only in the L2ARC in order to reduce
2931 3436   * memory usage.
2932 3437   */
2933 3438  static arc_buf_hdr_t *
2934 3439  arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new)
2935 3440  {
2936 3441          ASSERT(HDR_HAS_L2HDR(hdr));
2937 3442  
2938 3443          arc_buf_hdr_t *nhdr;
2939 3444          l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
2940 3445  
2941 3446          ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) ||
2942 3447              (old == hdr_l2only_cache && new == hdr_full_cache));
2943 3448  
2944 3449          nhdr = kmem_cache_alloc(new, KM_PUSHPAGE);
2945 3450  
2946 3451          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
2947 3452          buf_hash_remove(hdr);
2948 3453  
2949 3454          bcopy(hdr, nhdr, HDR_L2ONLY_SIZE);
2950 3455  
2951 3456          if (new == hdr_full_cache) {
2952 3457                  arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);
2953 3458                  /*
2954 3459                   * arc_access and arc_change_state need to be aware that a

↓ open down ↓

44 lines elided

↑ open up ↑

2955 3460                   * header has just come out of L2ARC, so we set its state to
2956 3461                   * l2c_only even though it's about to change.
2957 3462                   */
2958 3463                  nhdr->b_l1hdr.b_state = arc_l2c_only;
2959 3464  
2960 3465                  /* Verify previous threads set to NULL before freeing */
2961 3466                  ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL);
2962 3467          } else {
2963 3468                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2964 3469                  ASSERT0(hdr->b_l1hdr.b_bufcnt);
2965      -                ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
     3470 +                ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
2966 3471  
2967 3472                  /*
2968 3473                   * If we've reached here, We must have been called from
2969 3474                   * arc_evict_hdr(), as such we should have already been
2970 3475                   * removed from any ghost list we were previously on
2971 3476                   * (which protects us from racing with arc_evict_state),
2972 3477                   * thus no locking is needed during this check.
2973 3478                   */
2974 3479                  ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
2975 3480

2976 3481                  /*
2977 3482                   * A buffer must not be moved into the arc_l2c_only
2978 3483                   * state if it's not finished being written out to the
2979 3484                   * l2arc device. Otherwise, the b_l1hdr.b_pabd field
2980 3485                   * might try to be accessed, even though it was removed.
2981 3486                   */
2982 3487                  VERIFY(!HDR_L2_WRITING(hdr));
2983 3488                  VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2984 3489  
2985 3490  #ifdef ZFS_DEBUG
2986 3491                  if (hdr->b_l1hdr.b_thawed != NULL) {
2987 3492                          kmem_free(hdr->b_l1hdr.b_thawed, 1);
2988 3493                          hdr->b_l1hdr.b_thawed = NULL;
2989 3494                  }
2990 3495  #endif
2991 3496  
2992 3497                  arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR);
2993 3498          }
2994 3499          /*
2995 3500           * The header has been reallocated so we need to re-insert it into any
2996 3501           * lists it was on.
2997 3502           */
2998 3503          (void) buf_hash_insert(nhdr, NULL);
2999 3504  
3000 3505          ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node));
3001 3506  
3002 3507          mutex_enter(&dev->l2ad_mtx);
3003 3508  
3004 3509          /*
3005 3510           * We must place the realloc'ed header back into the list at
3006 3511           * the same spot. Otherwise, if it's placed earlier in the list,
3007 3512           * l2arc_write_buffers() could find it during the function's
3008 3513           * write phase, and try to write it out to the l2arc.
3009 3514           */
3010 3515          list_insert_after(&dev->l2ad_buflist, hdr, nhdr);
3011 3516          list_remove(&dev->l2ad_buflist, hdr);
3012 3517  
3013 3518          mutex_exit(&dev->l2ad_mtx);
3014 3519  
3015 3520          /*
3016 3521           * Since we're using the pointer address as the tag when
3017 3522           * incrementing and decrementing the l2ad_alloc refcount, we
3018 3523           * must remove the old pointer (that we're about to destroy) and
3019 3524           * add the new pointer to the refcount. Otherwise we'd remove
3020 3525           * the wrong pointer address when calling arc_hdr_destroy() later.
3021 3526           */
3022 3527  
3023 3528          (void) refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
3024 3529          (void) refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(nhdr), nhdr);
3025 3530  
3026 3531          buf_discard_identity(hdr);
3027 3532          kmem_cache_free(old, hdr);
3028 3533  
3029 3534          return (nhdr);
3030 3535  }
3031 3536  
3032 3537  /*
3033 3538   * Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller.
3034 3539   * The buf is returned thawed since we expect the consumer to modify it.
3035 3540   */
3036 3541  arc_buf_t *
3037 3542  arc_alloc_buf(spa_t *spa, void *tag, arc_buf_contents_t type, int32_t size)
3038 3543  {
3039 3544          arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size,
3040 3545              ZIO_COMPRESS_OFF, type);
3041 3546          ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
3042 3547  
3043 3548          arc_buf_t *buf = NULL;
3044 3549          VERIFY0(arc_buf_alloc_impl(hdr, tag, B_FALSE, B_FALSE, &buf));
3045 3550          arc_buf_thaw(buf);
3046 3551  
3047 3552          return (buf);
3048 3553  }
3049 3554  
3050 3555  /*
3051 3556   * Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this
3052 3557   * for bufs containing metadata.
3053 3558   */
3054 3559  arc_buf_t *
3055 3560  arc_alloc_compressed_buf(spa_t *spa, void *tag, uint64_t psize, uint64_t lsize,
3056 3561      enum zio_compress compression_type)
3057 3562  {
3058 3563          ASSERT3U(lsize, >, 0);
3059 3564          ASSERT3U(lsize, >=, psize);

↓ open down ↓

84 lines elided

↑ open up ↑

3060 3565          ASSERT(compression_type > ZIO_COMPRESS_OFF);
3061 3566          ASSERT(compression_type < ZIO_COMPRESS_FUNCTIONS);
3062 3567  
3063 3568          arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
3064 3569              compression_type, ARC_BUFC_DATA);
3065 3570          ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
3066 3571  
3067 3572          arc_buf_t *buf = NULL;
3068 3573          VERIFY0(arc_buf_alloc_impl(hdr, tag, B_TRUE, B_FALSE, &buf));
3069 3574          arc_buf_thaw(buf);
3070      -        ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
     3575 +        ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
3071 3576  
3072 3577          if (!arc_buf_is_shared(buf)) {
3073 3578                  /*
3074 3579                   * To ensure that the hdr has the correct data in it if we call
3075 3580                   * arc_decompress() on this buf before it's been written to
3076 3581                   * disk, it's easiest if we just set up sharing between the
3077 3582                   * buf and the hdr.
3078 3583                   */
3079 3584                  ASSERT(!abd_is_linear(hdr->b_l1hdr.b_pabd));
3080 3585                  arc_hdr_free_pabd(hdr);

3081 3586                  arc_share_buf(hdr, buf);
3082 3587          }
3083 3588  
3084 3589          return (buf);
3085 3590  }
3086 3591  
3087 3592  static void
3088 3593  arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)
3089 3594  {
3090 3595          l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
3091 3596          l2arc_dev_t *dev = l2hdr->b_dev;

↓ open down ↓

11 lines elided

↑ open up ↑

3092 3597          uint64_t psize = arc_hdr_size(hdr);
3093 3598  
3094 3599          ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
3095 3600          ASSERT(HDR_HAS_L2HDR(hdr));
3096 3601  
3097 3602          list_remove(&dev->l2ad_buflist, hdr);
3098 3603  
3099 3604          ARCSTAT_INCR(arcstat_l2_psize, -psize);
3100 3605          ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
3101 3606  
3102      -        vdev_space_update(dev->l2ad_vdev, -psize, 0, 0);
     3607 +        /*
     3608 +         * l2ad_vdev can be NULL here if we async evicted it
     3609 +         */
     3610 +        if (dev->l2ad_vdev != NULL)
     3611 +                vdev_space_update(dev->l2ad_vdev, -psize, 0, 0);
3103 3612  
3104 3613          (void) refcount_remove_many(&dev->l2ad_alloc, psize, hdr);
3105 3614          arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
3106 3615  }
3107 3616  
3108 3617  static void
3109 3618  arc_hdr_destroy(arc_buf_hdr_t *hdr)
3110 3619  {
3111 3620          if (HDR_HAS_L1HDR(hdr)) {
3112 3621                  ASSERT(hdr->b_l1hdr.b_buf == NULL ||
3113 3622                      hdr->b_l1hdr.b_bufcnt > 0);
3114 3623                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
3115 3624                  ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
3116 3625          }
3117 3626          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3118 3627          ASSERT(!HDR_IN_HASH_TABLE(hdr));
3119 3628  
3120      -        if (!HDR_EMPTY(hdr))
3121      -                buf_discard_identity(hdr);
3122      -
3123 3629          if (HDR_HAS_L2HDR(hdr)) {
3124 3630                  l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
3125 3631                  boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);
3126 3632  
     3633 +                /* To avoid racing with L2ARC the header needs to be locked */
     3634 +                ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
     3635 +
3127 3636                  if (!buflist_held)
3128 3637                          mutex_enter(&dev->l2ad_mtx);
3129 3638  
3130 3639                  /*
     3640 +                 * L2ARC buflist has been held, so we can safety discard
     3641 +                 * identity, otherwise L2ARC can lock incorrect mutex
     3642 +                 * for the hdr, that will cause a panic. That is possible,
     3643 +                 * because a mutex is selected according to identity.
     3644 +                 */
     3645 +                if (!HDR_EMPTY(hdr))
     3646 +                        buf_discard_identity(hdr);
     3647 +
     3648 +                /*
3131 3649                   * Even though we checked this conditional above, we
3132 3650                   * need to check this again now that we have the
3133 3651                   * l2ad_mtx. This is because we could be racing with
3134 3652                   * another thread calling l2arc_evict() which might have
3135 3653                   * destroyed this header's L2 portion as we were waiting
3136 3654                   * to acquire the l2ad_mtx. If that happens, we don't
3137 3655                   * want to re-destroy the header's L2 portion.
3138 3656                   */
3139 3657                  if (HDR_HAS_L2HDR(hdr))
3140 3658                          arc_hdr_l2hdr_destroy(hdr);
3141 3659  
3142 3660                  if (!buflist_held)
3143 3661                          mutex_exit(&dev->l2ad_mtx);
3144 3662          }
3145 3663  
     3664 +        if (!HDR_EMPTY(hdr))
     3665 +                buf_discard_identity(hdr);
     3666 +
3146 3667          if (HDR_HAS_L1HDR(hdr)) {
3147 3668                  arc_cksum_free(hdr);
3148 3669  
3149 3670                  while (hdr->b_l1hdr.b_buf != NULL)
3150 3671                          arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);
3151 3672  
3152 3673  #ifdef ZFS_DEBUG
3153 3674                  if (hdr->b_l1hdr.b_thawed != NULL) {
3154 3675                          kmem_free(hdr->b_l1hdr.b_thawed, 1);
3155 3676                          hdr->b_l1hdr.b_thawed = NULL;

3156 3677                  }
3157 3678  #endif
3158 3679  
3159 3680                  if (hdr->b_l1hdr.b_pabd != NULL) {
3160 3681                          arc_hdr_free_pabd(hdr);
3161 3682                  }
3162 3683          }
3163 3684  
3164 3685          ASSERT3P(hdr->b_hash_next, ==, NULL);
3165 3686          if (HDR_HAS_L1HDR(hdr)) {
3166 3687                  ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
3167 3688                  ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
3168 3689                  kmem_cache_free(hdr_full_cache, hdr);
3169 3690          } else {
3170 3691                  kmem_cache_free(hdr_l2only_cache, hdr);
3171 3692          }
3172 3693  }
3173 3694  
3174 3695  void
3175 3696  arc_buf_destroy(arc_buf_t *buf, void* tag)
3176 3697  {
3177 3698          arc_buf_hdr_t *hdr = buf->b_hdr;
3178 3699          kmutex_t *hash_lock = HDR_LOCK(hdr);
3179 3700  
3180 3701          if (hdr->b_l1hdr.b_state == arc_anon) {
3181 3702                  ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
3182 3703                  ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3183 3704                  VERIFY0(remove_reference(hdr, NULL, tag));
3184 3705                  arc_hdr_destroy(hdr);
3185 3706                  return;
3186 3707          }
3187 3708  
3188 3709          mutex_enter(hash_lock);
3189 3710          ASSERT3P(hdr, ==, buf->b_hdr);
3190 3711          ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
3191 3712          ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
3192 3713          ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon);
3193 3714          ASSERT3P(buf->b_data, !=, NULL);
3194 3715  
3195 3716          (void) remove_reference(hdr, hash_lock, tag);
3196 3717          arc_buf_destroy_impl(buf);
3197 3718          mutex_exit(hash_lock);
3198 3719  }
3199 3720  
3200 3721  /*
3201 3722   * Evict the arc_buf_hdr that is provided as a parameter. The resultant
3202 3723   * state of the header is dependent on it's state prior to entering this
3203 3724   * function. The following transitions are possible:
3204 3725   *
3205 3726   *    - arc_mru -> arc_mru_ghost
3206 3727   *    - arc_mfu -> arc_mfu_ghost
3207 3728   *    - arc_mru_ghost -> arc_l2c_only
3208 3729   *    - arc_mru_ghost -> deleted
3209 3730   *    - arc_mfu_ghost -> arc_l2c_only
3210 3731   *    - arc_mfu_ghost -> deleted

↓ open down ↓

55 lines elided

↑ open up ↑

3211 3732   */
3212 3733  static int64_t
3213 3734  arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
3214 3735  {
3215 3736          arc_state_t *evicted_state, *state;
3216 3737          int64_t bytes_evicted = 0;
3217 3738  
3218 3739          ASSERT(MUTEX_HELD(hash_lock));
3219 3740          ASSERT(HDR_HAS_L1HDR(hdr));
3220 3741  
     3742 +        arc_wait_for_krrp(hdr);
     3743 +
3221 3744          state = hdr->b_l1hdr.b_state;
3222 3745          if (GHOST_STATE(state)) {
3223 3746                  ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3224 3747                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
3225 3748  
3226 3749                  /*
3227 3750                   * l2arc_write_buffers() relies on a header's L1 portion
3228 3751                   * (i.e. its b_pabd field) during it's write phase.
3229 3752                   * Thus, we cannot push a header onto the arc_l2c_only
3230 3753                   * state (removing it's L1 piece) until the header is

3231 3754                   * done being written to the l2arc.
3232 3755                   */
3233 3756                  if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {
3234 3757                          ARCSTAT_BUMP(arcstat_evict_l2_skip);
3235 3758                          return (bytes_evicted);
3236 3759                  }
3237 3760  
3238 3761                  ARCSTAT_BUMP(arcstat_deleted);
3239 3762                  bytes_evicted += HDR_GET_LSIZE(hdr);
3240 3763  
3241 3764                  DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr);
3242 3765  
3243 3766                  ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
3244 3767                  if (HDR_HAS_L2HDR(hdr)) {
3245 3768                          /*
3246 3769                           * This buffer is cached on the 2nd Level ARC;
3247 3770                           * don't destroy the header.
3248 3771                           */
3249 3772                          arc_change_state(arc_l2c_only, hdr, hash_lock);
3250 3773                          /*
3251 3774                           * dropping from L1+L2 cached to L2-only,
3252 3775                           * realloc to remove the L1 header.
3253 3776                           */
3254 3777                          hdr = arc_hdr_realloc(hdr, hdr_full_cache,
3255 3778                              hdr_l2only_cache);
3256 3779                  } else {
3257 3780                          arc_change_state(arc_anon, hdr, hash_lock);
3258 3781                          arc_hdr_destroy(hdr);
3259 3782                  }
3260 3783                  return (bytes_evicted);
3261 3784          }
3262 3785  
3263 3786          ASSERT(state == arc_mru || state == arc_mfu);
3264 3787          evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
3265 3788  
3266 3789          /* prefetch buffers have a minimum lifespan */
3267 3790          if (HDR_IO_IN_PROGRESS(hdr) ||
3268 3791              ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&
3269 3792              ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access <
3270 3793              arc_min_prefetch_lifespan)) {
3271 3794                  ARCSTAT_BUMP(arcstat_evict_skip);
3272 3795                  return (bytes_evicted);
3273 3796          }
3274 3797  
3275 3798          ASSERT0(refcount_count(&hdr->b_l1hdr.b_refcnt));
3276 3799          while (hdr->b_l1hdr.b_buf) {
3277 3800                  arc_buf_t *buf = hdr->b_l1hdr.b_buf;
3278 3801                  if (!mutex_tryenter(&buf->b_evict_lock)) {
3279 3802                          ARCSTAT_BUMP(arcstat_mutex_miss);
3280 3803                          break;
3281 3804                  }
3282 3805                  if (buf->b_data != NULL)
3283 3806                          bytes_evicted += HDR_GET_LSIZE(hdr);
3284 3807                  mutex_exit(&buf->b_evict_lock);
3285 3808                  arc_buf_destroy_impl(buf);
3286 3809          }
3287 3810  
3288 3811          if (HDR_HAS_L2HDR(hdr)) {
3289 3812                  ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr));
3290 3813          } else {
3291 3814                  if (l2arc_write_eligible(hdr->b_spa, hdr)) {
3292 3815                          ARCSTAT_INCR(arcstat_evict_l2_eligible,
3293 3816                              HDR_GET_LSIZE(hdr));
3294 3817                  } else {
3295 3818                          ARCSTAT_INCR(arcstat_evict_l2_ineligible,
3296 3819                              HDR_GET_LSIZE(hdr));
3297 3820                  }
3298 3821          }
3299 3822  
3300 3823          if (hdr->b_l1hdr.b_bufcnt == 0) {
3301 3824                  arc_cksum_free(hdr);
3302 3825  
3303 3826                  bytes_evicted += arc_hdr_size(hdr);
3304 3827  
3305 3828                  /*
3306 3829                   * If this hdr is being evicted and has a compressed
3307 3830                   * buffer then we discard it here before we change states.
3308 3831                   * This ensures that the accounting is updated correctly
3309 3832                   * in arc_free_data_impl().
3310 3833                   */
3311 3834                  arc_hdr_free_pabd(hdr);
3312 3835  
3313 3836                  arc_change_state(evicted_state, hdr, hash_lock);
3314 3837                  ASSERT(HDR_IN_HASH_TABLE(hdr));
3315 3838                  arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
3316 3839                  DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr);
3317 3840          }
3318 3841  
3319 3842          return (bytes_evicted);
3320 3843  }
3321 3844  
3322 3845  static uint64_t
3323 3846  arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker,
3324 3847      uint64_t spa, int64_t bytes)
3325 3848  {
3326 3849          multilist_sublist_t *mls;
3327 3850          uint64_t bytes_evicted = 0;
3328 3851          arc_buf_hdr_t *hdr;
3329 3852          kmutex_t *hash_lock;
3330 3853          int evict_count = 0;
3331 3854  
3332 3855          ASSERT3P(marker, !=, NULL);
3333 3856          IMPLY(bytes < 0, bytes == ARC_EVICT_ALL);
3334 3857  
3335 3858          mls = multilist_sublist_lock(ml, idx);
3336 3859  
3337 3860          for (hdr = multilist_sublist_prev(mls, marker); hdr != NULL;
3338 3861              hdr = multilist_sublist_prev(mls, marker)) {
3339 3862                  if ((bytes != ARC_EVICT_ALL && bytes_evicted >= bytes) ||
3340 3863                      (evict_count >= zfs_arc_evict_batch_limit))
3341 3864                          break;
3342 3865  
3343 3866                  /*
3344 3867                   * To keep our iteration location, move the marker
3345 3868                   * forward. Since we're not holding hdr's hash lock, we
3346 3869                   * must be very careful and not remove 'hdr' from the
3347 3870                   * sublist. Otherwise, other consumers might mistake the
3348 3871                   * 'hdr' as not being on a sublist when they call the
3349 3872                   * multilist_link_active() function (they all rely on
3350 3873                   * the hash lock protecting concurrent insertions and
3351 3874                   * removals). multilist_sublist_move_forward() was
3352 3875                   * specifically implemented to ensure this is the case
3353 3876                   * (only 'marker' will be removed and re-inserted).
3354 3877                   */
3355 3878                  multilist_sublist_move_forward(mls, marker);
3356 3879  
3357 3880                  /*
3358 3881                   * The only case where the b_spa field should ever be
3359 3882                   * zero, is the marker headers inserted by
3360 3883                   * arc_evict_state(). It's possible for multiple threads
3361 3884                   * to be calling arc_evict_state() concurrently (e.g.
3362 3885                   * dsl_pool_close() and zio_inject_fault()), so we must
3363 3886                   * skip any markers we see from these other threads.
3364 3887                   */
3365 3888                  if (hdr->b_spa == 0)
3366 3889                          continue;
3367 3890  
3368 3891                  /* we're only interested in evicting buffers of a certain spa */
3369 3892                  if (spa != 0 && hdr->b_spa != spa) {
3370 3893                          ARCSTAT_BUMP(arcstat_evict_skip);
3371 3894                          continue;
3372 3895                  }
3373 3896  
3374 3897                  hash_lock = HDR_LOCK(hdr);
3375 3898  
3376 3899                  /*
3377 3900                   * We aren't calling this function from any code path
3378 3901                   * that would already be holding a hash lock, so we're
3379 3902                   * asserting on this assumption to be defensive in case
3380 3903                   * this ever changes. Without this check, it would be
3381 3904                   * possible to incorrectly increment arcstat_mutex_miss
3382 3905                   * below (e.g. if the code changed such that we called
3383 3906                   * this function with a hash lock held).
3384 3907                   */
3385 3908                  ASSERT(!MUTEX_HELD(hash_lock));
3386 3909  
3387 3910                  if (mutex_tryenter(hash_lock)) {
3388 3911                          uint64_t evicted = arc_evict_hdr(hdr, hash_lock);
3389 3912                          mutex_exit(hash_lock);
3390 3913  
3391 3914                          bytes_evicted += evicted;
3392 3915  
3393 3916                          /*
3394 3917                           * If evicted is zero, arc_evict_hdr() must have
3395 3918                           * decided to skip this header, don't increment
3396 3919                           * evict_count in this case.
3397 3920                           */
3398 3921                          if (evicted != 0)
3399 3922                                  evict_count++;
3400 3923  
3401 3924                          /*
3402 3925                           * If arc_size isn't overflowing, signal any
3403 3926                           * threads that might happen to be waiting.
3404 3927                           *
3405 3928                           * For each header evicted, we wake up a single
3406 3929                           * thread. If we used cv_broadcast, we could
3407 3930                           * wake up "too many" threads causing arc_size
3408 3931                           * to significantly overflow arc_c; since
3409 3932                           * arc_get_data_impl() doesn't check for overflow
3410 3933                           * when it's woken up (it doesn't because it's
3411 3934                           * possible for the ARC to be overflowing while
3412 3935                           * full of un-evictable buffers, and the
3413 3936                           * function should proceed in this case).
3414 3937                           *
3415 3938                           * If threads are left sleeping, due to not
3416 3939                           * using cv_broadcast, they will be woken up
3417 3940                           * just before arc_reclaim_thread() sleeps.
3418 3941                           */
3419 3942                          mutex_enter(&arc_reclaim_lock);
3420 3943                          if (!arc_is_overflowing())
3421 3944                                  cv_signal(&arc_reclaim_waiters_cv);
3422 3945                          mutex_exit(&arc_reclaim_lock);
3423 3946                  } else {
3424 3947                          ARCSTAT_BUMP(arcstat_mutex_miss);
3425 3948                  }
3426 3949          }
3427 3950  
3428 3951          multilist_sublist_unlock(mls);
3429 3952  
3430 3953          return (bytes_evicted);
3431 3954  }
3432 3955  
3433 3956  /*
3434 3957   * Evict buffers from the given arc state, until we've removed the
3435 3958   * specified number of bytes. Move the removed buffers to the
3436 3959   * appropriate evict state.
3437 3960   *
3438 3961   * This function makes a "best effort". It skips over any buffers
3439 3962   * it can't get a hash_lock on, and so, may not catch all candidates.
3440 3963   * It may also return without evicting as much space as requested.
3441 3964   *
3442 3965   * If bytes is specified using the special value ARC_EVICT_ALL, this
3443 3966   * will evict all available (i.e. unlocked and evictable) buffers from
3444 3967   * the given arc state; which is used by arc_flush().
3445 3968   */
3446 3969  static uint64_t
3447 3970  arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes,
3448 3971      arc_buf_contents_t type)
3449 3972  {
3450 3973          uint64_t total_evicted = 0;
3451 3974          multilist_t *ml = state->arcs_list[type];
3452 3975          int num_sublists;
3453 3976          arc_buf_hdr_t **markers;
3454 3977  
3455 3978          IMPLY(bytes < 0, bytes == ARC_EVICT_ALL);
3456 3979  
3457 3980          num_sublists = multilist_get_num_sublists(ml);
3458 3981  
3459 3982          /*
3460 3983           * If we've tried to evict from each sublist, made some
3461 3984           * progress, but still have not hit the target number of bytes
3462 3985           * to evict, we want to keep trying. The markers allow us to
3463 3986           * pick up where we left off for each individual sublist, rather
3464 3987           * than starting from the tail each time.
3465 3988           */
3466 3989          markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP);
3467 3990          for (int i = 0; i < num_sublists; i++) {
3468 3991                  markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
3469 3992  
3470 3993                  /*
3471 3994                   * A b_spa of 0 is used to indicate that this header is
3472 3995                   * a marker. This fact is used in arc_adjust_type() and
3473 3996                   * arc_evict_state_impl().
3474 3997                   */
3475 3998                  markers[i]->b_spa = 0;
3476 3999  
3477 4000                  multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
3478 4001                  multilist_sublist_insert_tail(mls, markers[i]);
3479 4002                  multilist_sublist_unlock(mls);
3480 4003          }
3481 4004  
3482 4005          /*
3483 4006           * While we haven't hit our target number of bytes to evict, or
3484 4007           * we're evicting all available buffers.
3485 4008           */
3486 4009          while (total_evicted < bytes || bytes == ARC_EVICT_ALL) {
3487 4010                  /*
3488 4011                   * Start eviction using a randomly selected sublist,
3489 4012                   * this is to try and evenly balance eviction across all
3490 4013                   * sublists. Always starting at the same sublist
3491 4014                   * (e.g. index 0) would cause evictions to favor certain
3492 4015                   * sublists over others.
3493 4016                   */
3494 4017                  int sublist_idx = multilist_get_random_index(ml);
3495 4018                  uint64_t scan_evicted = 0;
3496 4019  
3497 4020                  for (int i = 0; i < num_sublists; i++) {
3498 4021                          uint64_t bytes_remaining;
3499 4022                          uint64_t bytes_evicted;
3500 4023  
3501 4024                          if (bytes == ARC_EVICT_ALL)
3502 4025                                  bytes_remaining = ARC_EVICT_ALL;
3503 4026                          else if (total_evicted < bytes)
3504 4027                                  bytes_remaining = bytes - total_evicted;
3505 4028                          else
3506 4029                                  break;
3507 4030  
3508 4031                          bytes_evicted = arc_evict_state_impl(ml, sublist_idx,
3509 4032                              markers[sublist_idx], spa, bytes_remaining);
3510 4033  
3511 4034                          scan_evicted += bytes_evicted;
3512 4035                          total_evicted += bytes_evicted;
3513 4036  
3514 4037                          /* we've reached the end, wrap to the beginning */
3515 4038                          if (++sublist_idx >= num_sublists)
3516 4039                                  sublist_idx = 0;
3517 4040                  }
3518 4041  
3519 4042                  /*
3520 4043                   * If we didn't evict anything during this scan, we have
3521 4044                   * no reason to believe we'll evict more during another
3522 4045                   * scan, so break the loop.
3523 4046                   */
3524 4047                  if (scan_evicted == 0) {
3525 4048                          /* This isn't possible, let's make that obvious */
3526 4049                          ASSERT3S(bytes, !=, 0);
3527 4050  
3528 4051                          /*
3529 4052                           * When bytes is ARC_EVICT_ALL, the only way to
3530 4053                           * break the loop is when scan_evicted is zero.
3531 4054                           * In that case, we actually have evicted enough,
3532 4055                           * so we don't want to increment the kstat.
3533 4056                           */
3534 4057                          if (bytes != ARC_EVICT_ALL) {
3535 4058                                  ASSERT3S(total_evicted, <, bytes);
3536 4059                                  ARCSTAT_BUMP(arcstat_evict_not_enough);
3537 4060                          }
3538 4061  
3539 4062                          break;
3540 4063                  }
3541 4064          }
3542 4065  
3543 4066          for (int i = 0; i < num_sublists; i++) {
3544 4067                  multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
3545 4068                  multilist_sublist_remove(mls, markers[i]);
3546 4069                  multilist_sublist_unlock(mls);
3547 4070  
3548 4071                  kmem_cache_free(hdr_full_cache, markers[i]);
3549 4072          }
3550 4073          kmem_free(markers, sizeof (*markers) * num_sublists);
3551 4074  
3552 4075          return (total_evicted);
3553 4076  }
3554 4077  
3555 4078  /*
3556 4079   * Flush all "evictable" data of the given type from the arc state
3557 4080   * specified. This will not evict any "active" buffers (i.e. referenced).
3558 4081   *
3559 4082   * When 'retry' is set to B_FALSE, the function will make a single pass
3560 4083   * over the state and evict any buffers that it can. Since it doesn't
3561 4084   * continually retry the eviction, it might end up leaving some buffers
3562 4085   * in the ARC due to lock misses.
3563 4086   *
3564 4087   * When 'retry' is set to B_TRUE, the function will continually retry the
3565 4088   * eviction until *all* evictable buffers have been removed from the
3566 4089   * state. As a result, if concurrent insertions into the state are
3567 4090   * allowed (e.g. if the ARC isn't shutting down), this function might
3568 4091   * wind up in an infinite loop, continually trying to evict buffers.
3569 4092   */
3570 4093  static uint64_t
3571 4094  arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type,
3572 4095      boolean_t retry)
3573 4096  {
3574 4097          uint64_t evicted = 0;
3575 4098  
3576 4099          while (refcount_count(&state->arcs_esize[type]) != 0) {
3577 4100                  evicted += arc_evict_state(state, spa, ARC_EVICT_ALL, type);
3578 4101  
3579 4102                  if (!retry)
3580 4103                          break;
3581 4104          }
3582 4105  
3583 4106          return (evicted);
3584 4107  }
3585 4108  
3586 4109  /*
3587 4110   * Evict the specified number of bytes from the state specified,
3588 4111   * restricting eviction to the spa and type given. This function
3589 4112   * prevents us from trying to evict more from a state's list than
3590 4113   * is "evictable", and to skip evicting altogether when passed a
3591 4114   * negative value for "bytes". In contrast, arc_evict_state() will
3592 4115   * evict everything it can, when passed a negative value for "bytes".
3593 4116   */
3594 4117  static uint64_t
3595 4118  arc_adjust_impl(arc_state_t *state, uint64_t spa, int64_t bytes,
3596 4119      arc_buf_contents_t type)
3597 4120  {
3598 4121          int64_t delta;

↓ open down ↓

368 lines elided

↑ open up ↑

3599 4122  
3600 4123          if (bytes > 0 && refcount_count(&state->arcs_esize[type]) > 0) {
3601 4124                  delta = MIN(refcount_count(&state->arcs_esize[type]), bytes);
3602 4125                  return (arc_evict_state(state, spa, delta, type));
3603 4126          }
3604 4127  
3605 4128          return (0);
3606 4129  }
3607 4130  
3608 4131  /*
3609      - * Evict metadata buffers from the cache, such that arc_meta_used is
3610      - * capped by the arc_meta_limit tunable.
     4132 + * Depending on the value of adjust_ddt arg evict either DDT (B_TRUE)
     4133 + * or metadata (B_TRUE) buffers.
     4134 + * Evict metadata or DDT buffers from the cache, such that arc_meta_used or
     4135 + * arc_ddt_size is capped by the arc_meta_limit or arc_ddt_limit tunable.
3611 4136   */
3612 4137  static uint64_t
3613      -arc_adjust_meta(uint64_t meta_used)
     4138 +arc_adjust_meta_or_ddt(boolean_t adjust_ddt)
3614 4139  {
3615 4140          uint64_t total_evicted = 0;
3616      -        int64_t target;
     4141 +        int64_t target, over_limit;
     4142 +        arc_buf_contents_t type;
3617 4143  
     4144 +        if (adjust_ddt) {
     4145 +                over_limit = arc_ddt_size - arc_ddt_limit;
     4146 +                type = ARC_BUFC_DDT;
     4147 +        } else {
     4148 +                over_limit = arc_meta_used - arc_meta_limit;
     4149 +                type = ARC_BUFC_METADATA;
     4150 +        }
     4151 +
3618 4152          /*
3619      -         * If we're over the meta limit, we want to evict enough
3620      -         * metadata to get back under the meta limit. We don't want to
     4153 +         * If we're over the limit, we want to evict enough
     4154 +         * to get back under the limit. We don't want to
3621 4155           * evict so much that we drop the MRU below arc_p, though. If
3622 4156           * we're over the meta limit more than we're over arc_p, we
3623 4157           * evict some from the MRU here, and some from the MFU below.
3624 4158           */
3625      -        target = MIN((int64_t)(meta_used - arc_meta_limit),
     4159 +        target = MIN(over_limit,
3626 4160              (int64_t)(refcount_count(&arc_anon->arcs_size) +
3627 4161              refcount_count(&arc_mru->arcs_size) - arc_p));
3628 4162  
3629      -        total_evicted += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
     4163 +        total_evicted += arc_adjust_impl(arc_mru, 0, target, type);
3630 4164  
     4165 +        over_limit = adjust_ddt ? arc_ddt_size - arc_ddt_limit :
     4166 +            arc_meta_used - arc_meta_limit;
     4167 +
3631 4168          /*
3632 4169           * Similar to the above, we want to evict enough bytes to get us
3633 4170           * below the meta limit, but not so much as to drop us below the
3634 4171           * space allotted to the MFU (which is defined as arc_c - arc_p).
3635 4172           */
3636      -        target = MIN((int64_t)(meta_used - arc_meta_limit),
3637      -            (int64_t)(refcount_count(&arc_mfu->arcs_size) -
3638      -            (arc_c - arc_p)));
     4173 +        target = MIN(over_limit,
     4174 +            (int64_t)(refcount_count(&arc_mfu->arcs_size) - (arc_c - arc_p)));
3639 4175  
3640      -        total_evicted += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
     4176 +        total_evicted += arc_adjust_impl(arc_mfu, 0, target, type);
3641 4177  
3642 4178          return (total_evicted);
3643 4179  }
3644 4180  
3645 4181  /*
3646 4182   * Return the type of the oldest buffer in the given arc state
3647 4183   *
3648      - * This function will select a random sublist of type ARC_BUFC_DATA and
3649      - * a random sublist of type ARC_BUFC_METADATA. The tail of each sublist
     4184 + * This function will select a random sublists of type ARC_BUFC_DATA,
     4185 + * ARC_BUFC_METADATA, and ARC_BUFC_DDT. The tail of each sublist
3650 4186   * is compared, and the type which contains the "older" buffer will be
3651 4187   * returned.
3652 4188   */
3653 4189  static arc_buf_contents_t
3654 4190  arc_adjust_type(arc_state_t *state)
3655 4191  {
3656 4192          multilist_t *data_ml = state->arcs_list[ARC_BUFC_DATA];
3657 4193          multilist_t *meta_ml = state->arcs_list[ARC_BUFC_METADATA];
     4194 +        multilist_t *ddt_ml = state->arcs_list[ARC_BUFC_DDT];
3658 4195          int data_idx = multilist_get_random_index(data_ml);
3659 4196          int meta_idx = multilist_get_random_index(meta_ml);
     4197 +        int ddt_idx = multilist_get_random_index(ddt_ml);
3660 4198          multilist_sublist_t *data_mls;
3661 4199          multilist_sublist_t *meta_mls;
3662      -        arc_buf_contents_t type;
     4200 +        multilist_sublist_t *ddt_mls;
     4201 +        arc_buf_contents_t type = ARC_BUFC_DATA; /* silence compiler warning */
3663 4202          arc_buf_hdr_t *data_hdr;
3664 4203          arc_buf_hdr_t *meta_hdr;
     4204 +        arc_buf_hdr_t *ddt_hdr;
     4205 +        clock_t oldest;
3665 4206  
3666 4207          /*
3667 4208           * We keep the sublist lock until we're finished, to prevent
3668 4209           * the headers from being destroyed via arc_evict_state().
3669 4210           */
3670 4211          data_mls = multilist_sublist_lock(data_ml, data_idx);
3671 4212          meta_mls = multilist_sublist_lock(meta_ml, meta_idx);
     4213 +        ddt_mls = multilist_sublist_lock(ddt_ml, ddt_idx);
3672 4214  
3673 4215          /*
3674 4216           * These two loops are to ensure we skip any markers that
3675 4217           * might be at the tail of the lists due to arc_evict_state().
3676 4218           */
3677 4219  
3678 4220          for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL;
3679 4221              data_hdr = multilist_sublist_prev(data_mls, data_hdr)) {
3680 4222                  if (data_hdr->b_spa != 0)
3681 4223                          break;
3682 4224          }
3683 4225  
3684 4226          for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL;
3685 4227              meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) {
3686 4228                  if (meta_hdr->b_spa != 0)
3687 4229                          break;
3688 4230          }
3689 4231  
3690      -        if (data_hdr == NULL && meta_hdr == NULL) {
     4232 +        for (ddt_hdr = multilist_sublist_tail(ddt_mls); ddt_hdr != NULL;
     4233 +            ddt_hdr = multilist_sublist_prev(ddt_mls, ddt_hdr)) {
     4234 +                if (ddt_hdr->b_spa != 0)
     4235 +                        break;
     4236 +        }
     4237 +
     4238 +        if (data_hdr == NULL && meta_hdr == NULL && ddt_hdr == NULL) {
3691 4239                  type = ARC_BUFC_DATA;
3692      -        } else if (data_hdr == NULL) {
     4240 +        } else if (data_hdr != NULL && meta_hdr != NULL && ddt_hdr != NULL) {
     4241 +                /* The headers can't be on the sublist without an L1 header */
     4242 +                ASSERT(HDR_HAS_L1HDR(data_hdr));
     4243 +                ASSERT(HDR_HAS_L1HDR(meta_hdr));
     4244 +                ASSERT(HDR_HAS_L1HDR(ddt_hdr));
     4245 +
     4246 +                oldest = data_hdr->b_l1hdr.b_arc_access;
     4247 +                type = ARC_BUFC_DATA;
     4248 +                if (oldest > meta_hdr->b_l1hdr.b_arc_access) {
     4249 +                        oldest = meta_hdr->b_l1hdr.b_arc_access;
     4250 +                        type = ARC_BUFC_METADATA;
     4251 +                }
     4252 +                if (oldest > ddt_hdr->b_l1hdr.b_arc_access) {
     4253 +                        type = ARC_BUFC_DDT;
     4254 +                }
     4255 +        } else if (data_hdr == NULL && ddt_hdr == NULL) {
3693 4256                  ASSERT3P(meta_hdr, !=, NULL);
3694 4257                  type = ARC_BUFC_METADATA;
3695      -        } else if (meta_hdr == NULL) {
     4258 +        } else if (meta_hdr == NULL && ddt_hdr == NULL) {
3696 4259                  ASSERT3P(data_hdr, !=, NULL);
3697 4260                  type = ARC_BUFC_DATA;
3698      -        } else {
3699      -                ASSERT3P(data_hdr, !=, NULL);
3700      -                ASSERT3P(meta_hdr, !=, NULL);
     4261 +        } else if (meta_hdr == NULL && data_hdr == NULL) {
     4262 +                ASSERT3P(ddt_hdr, !=, NULL);
     4263 +                type = ARC_BUFC_DDT;
     4264 +        } else if (data_hdr != NULL && ddt_hdr != NULL) {
     4265 +                ASSERT3P(meta_hdr, ==, NULL);
3701 4266  
3702 4267                  /* The headers can't be on the sublist without an L1 header */
3703 4268                  ASSERT(HDR_HAS_L1HDR(data_hdr));
     4269 +                ASSERT(HDR_HAS_L1HDR(ddt_hdr));
     4270 +
     4271 +                if (data_hdr->b_l1hdr.b_arc_access <
     4272 +                    ddt_hdr->b_l1hdr.b_arc_access) {
     4273 +                        type = ARC_BUFC_DATA;
     4274 +                } else {
     4275 +                        type = ARC_BUFC_DDT;
     4276 +                }
     4277 +        } else if (meta_hdr != NULL && ddt_hdr != NULL) {
     4278 +                ASSERT3P(data_hdr, ==, NULL);
     4279 +
     4280 +                /* The headers can't be on the sublist without an L1 header */
3704 4281                  ASSERT(HDR_HAS_L1HDR(meta_hdr));
     4282 +                ASSERT(HDR_HAS_L1HDR(ddt_hdr));
3705 4283  
     4284 +                if (meta_hdr->b_l1hdr.b_arc_access <
     4285 +                    ddt_hdr->b_l1hdr.b_arc_access) {
     4286 +                        type = ARC_BUFC_METADATA;
     4287 +                } else {
     4288 +                        type = ARC_BUFC_DDT;
     4289 +                }
     4290 +        } else if (meta_hdr != NULL && data_hdr != NULL) {
     4291 +                ASSERT3P(ddt_hdr, ==, NULL);
     4292 +
     4293 +                /* The headers can't be on the sublist without an L1 header */
     4294 +                ASSERT(HDR_HAS_L1HDR(data_hdr));
     4295 +                ASSERT(HDR_HAS_L1HDR(meta_hdr));
     4296 +
3706 4297                  if (data_hdr->b_l1hdr.b_arc_access <
3707 4298                      meta_hdr->b_l1hdr.b_arc_access) {
3708 4299                          type = ARC_BUFC_DATA;
3709 4300                  } else {
3710 4301                          type = ARC_BUFC_METADATA;
3711 4302                  }
     4303 +        } else {
     4304 +                /* should never get here */
     4305 +                ASSERT(0);
3712 4306          }
3713 4307  
     4308 +        multilist_sublist_unlock(ddt_mls);
3714 4309          multilist_sublist_unlock(meta_mls);
3715 4310          multilist_sublist_unlock(data_mls);
3716 4311  
3717 4312          return (type);
3718 4313  }
3719 4314  
3720 4315  /*
3721 4316   * Evict buffers from the cache, such that arc_size is capped by arc_c.
3722 4317   */
3723 4318  static uint64_t
3724 4319  arc_adjust(void)
3725 4320  {
3726 4321          uint64_t total_evicted = 0;
3727 4322          uint64_t bytes;
3728 4323          int64_t target;
3729      -        uint64_t asize = aggsum_value(&arc_size);
3730      -        uint64_t ameta = aggsum_value(&arc_meta_used);
3731 4324  
3732 4325          /*
3733 4326           * If we're over arc_meta_limit, we want to correct that before
3734 4327           * potentially evicting data buffers below.
3735 4328           */
3736      -        total_evicted += arc_adjust_meta(ameta);
     4329 +        total_evicted += arc_adjust_meta_or_ddt(B_FALSE);
3737 4330  
3738 4331          /*
     4332 +         * If we're over arc_ddt_limit, we want to correct that before
     4333 +         * potentially evicting data buffers below.
     4334 +         */
     4335 +        total_evicted += arc_adjust_meta_or_ddt(B_TRUE);
     4336 +
     4337 +        /*
3739 4338           * Adjust MRU size
3740 4339           *
3741 4340           * If we're over the target cache size, we want to evict enough
3742 4341           * from the list to get back to our target size. We don't want
3743 4342           * to evict too much from the MRU, such that it drops below
3744 4343           * arc_p. So, if we're over our target cache size more than
3745 4344           * the MRU is over arc_p, we'll evict enough to get back to
3746 4345           * arc_p here, and then evict more from the MFU below.
3747 4346           */
3748      -        target = MIN((int64_t)(asize - arc_c),
     4347 +        target = MIN((int64_t)(arc_size - arc_c),
3749 4348              (int64_t)(refcount_count(&arc_anon->arcs_size) +
3750      -            refcount_count(&arc_mru->arcs_size) + ameta - arc_p));
     4349 +            refcount_count(&arc_mru->arcs_size) + arc_meta_used - arc_p));
3751 4350  
3752 4351          /*
3753 4352           * If we're below arc_meta_min, always prefer to evict data.
3754 4353           * Otherwise, try to satisfy the requested number of bytes to
3755 4354           * evict from the type which contains older buffers; in an
3756 4355           * effort to keep newer buffers in the cache regardless of their
3757 4356           * type. If we cannot satisfy the number of bytes from this
3758 4357           * type, spill over into the next type.
3759 4358           */
3760 4359          if (arc_adjust_type(arc_mru) == ARC_BUFC_METADATA &&
3761      -            ameta > arc_meta_min) {
     4360 +            arc_meta_used > arc_meta_min) {
3762 4361                  bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
3763 4362                  total_evicted += bytes;
3764 4363  
3765 4364                  /*
3766 4365                   * If we couldn't evict our target number of bytes from
3767 4366                   * metadata, we try to get the rest from data.
3768 4367                   */
3769 4368                  target -= bytes;
3770 4369  
3771      -                total_evicted +=
3772      -                    arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
     4370 +                bytes += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
     4371 +                total_evicted += bytes;
3773 4372          } else {
3774 4373                  bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
3775 4374                  total_evicted += bytes;
3776 4375  
3777 4376                  /*
3778 4377                   * If we couldn't evict our target number of bytes from
3779 4378                   * data, we try to get the rest from metadata.
3780 4379                   */
3781 4380                  target -= bytes;
3782 4381  
3783      -                total_evicted +=
3784      -                    arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
     4382 +                bytes += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
     4383 +                total_evicted += bytes;
3785 4384          }
3786 4385  
3787 4386          /*
     4387 +         * If we couldn't evict our target number of bytes from
     4388 +         * data and metadata, we try to get the rest from ddt.
     4389 +         */
     4390 +        target -= bytes;
     4391 +        total_evicted +=
     4392 +            arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DDT);
     4393 +
     4394 +        /*
3788 4395           * Adjust MFU size
3789 4396           *
3790 4397           * Now that we've tried to evict enough from the MRU to get its
3791 4398           * size back to arc_p, if we're still above the target cache
3792 4399           * size, we evict the rest from the MFU.
3793 4400           */
3794      -        target = asize - arc_c;
     4401 +        target = arc_size - arc_c;
3795 4402  
3796 4403          if (arc_adjust_type(arc_mfu) == ARC_BUFC_METADATA &&
3797      -            ameta > arc_meta_min) {
     4404 +            arc_meta_used > arc_meta_min) {
3798 4405                  bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
3799 4406                  total_evicted += bytes;
3800 4407  
3801 4408                  /*
3802 4409                   * If we couldn't evict our target number of bytes from
3803 4410                   * metadata, we try to get the rest from data.
3804 4411                   */
3805 4412                  target -= bytes;
3806 4413  
3807      -                total_evicted +=
3808      -                    arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
     4414 +                bytes += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
     4415 +                total_evicted += bytes;
3809 4416          } else {
3810 4417                  bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
3811 4418                  total_evicted += bytes;
3812 4419  
3813 4420                  /*
3814 4421                   * If we couldn't evict our target number of bytes from
3815 4422                   * data, we try to get the rest from data.
3816 4423                   */
3817 4424                  target -= bytes;
3818 4425  
3819      -                total_evicted +=
3820      -                    arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
     4426 +                bytes += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
     4427 +                total_evicted += bytes;
3821 4428          }
3822 4429  
3823 4430          /*
     4431 +         * If we couldn't evict our target number of bytes from
     4432 +         * data and metadata, we try to get the rest from ddt.
     4433 +         */
     4434 +        target -= bytes;
     4435 +        total_evicted +=
     4436 +            arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DDT);
     4437 +
     4438 +        /*
3824 4439           * Adjust ghost lists
3825 4440           *
3826 4441           * In addition to the above, the ARC also defines target values
3827 4442           * for the ghost lists. The sum of the mru list and mru ghost
3828 4443           * list should never exceed the target size of the cache, and
3829 4444           * the sum of the mru list, mfu list, mru ghost list, and mfu
3830 4445           * ghost list should never exceed twice the target size of the
3831 4446           * cache. The following logic enforces these limits on the ghost
3832 4447           * caches, and evicts from them as needed.
3833 4448           */
3834 4449          target = refcount_count(&arc_mru->arcs_size) +
3835 4450              refcount_count(&arc_mru_ghost->arcs_size) - arc_c;
3836 4451  
3837 4452          bytes = arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA);
3838 4453          total_evicted += bytes;
3839 4454  
3840 4455          target -= bytes;
3841 4456  
     4457 +        bytes += arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
     4458 +        total_evicted += bytes;
     4459 +
     4460 +        target -= bytes;
     4461 +
3842 4462          total_evicted +=
3843      -            arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
     4463 +            arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DDT);
3844 4464  
3845 4465          /*
3846 4466           * We assume the sum of the mru list and mfu list is less than
3847 4467           * or equal to arc_c (we enforced this above), which means we
3848 4468           * can use the simpler of the two equations below:
3849 4469           *
3850 4470           *      mru + mfu + mru ghost + mfu ghost <= 2 * arc_c
3851 4471           *                  mru ghost + mfu ghost <= arc_c
3852 4472           */
3853 4473          target = refcount_count(&arc_mru_ghost->arcs_size) +
3854 4474              refcount_count(&arc_mfu_ghost->arcs_size) - arc_c;
3855 4475  
3856 4476          bytes = arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA);
3857 4477          total_evicted += bytes;
3858 4478  
3859 4479          target -= bytes;
3860 4480  
     4481 +        bytes += arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
     4482 +        total_evicted += bytes;
     4483 +
     4484 +        target -= bytes;
     4485 +
3861 4486          total_evicted +=
3862      -            arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
     4487 +            arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DDT);
3863 4488  
3864 4489          return (total_evicted);
3865 4490  }
3866 4491  
     4492 +typedef struct arc_async_flush_data {
     4493 +        uint64_t        aaf_guid;
     4494 +        boolean_t       aaf_retry;
     4495 +} arc_async_flush_data_t;
     4496 +
     4497 +static taskq_t *arc_flush_taskq;
     4498 +
     4499 +static void
     4500 +arc_flush_impl(uint64_t guid, boolean_t retry)
     4501 +{
     4502 +        arc_buf_contents_t arcs;
     4503 +
     4504 +        for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
     4505 +                (void) arc_flush_state(arc_mru, guid, arcs, retry);
     4506 +                (void) arc_flush_state(arc_mfu, guid, arcs, retry);
     4507 +                (void) arc_flush_state(arc_mru_ghost, guid, arcs, retry);
     4508 +                (void) arc_flush_state(arc_mfu_ghost, guid, arcs, retry);
     4509 +        }
     4510 +}
     4511 +
     4512 +static void
     4513 +arc_flush_task(void *arg)
     4514 +{
     4515 +        arc_async_flush_data_t *aaf = (arc_async_flush_data_t *)arg;
     4516 +        arc_flush_impl(aaf->aaf_guid, aaf->aaf_retry);
     4517 +        kmem_free(aaf, sizeof (arc_async_flush_data_t));
     4518 +}
     4519 +
     4520 +boolean_t zfs_fastflush = B_TRUE;
     4521 +
3867 4522  void
3868 4523  arc_flush(spa_t *spa, boolean_t retry)
3869 4524  {
3870 4525          uint64_t guid = 0;
     4526 +        boolean_t async_flush = (spa != NULL ? zfs_fastflush : FALSE);
     4527 +        arc_async_flush_data_t *aaf = NULL;
3871 4528  
3872 4529          /*
3873 4530           * If retry is B_TRUE, a spa must not be specified since we have
3874 4531           * no good way to determine if all of a spa's buffers have been
3875 4532           * evicted from an arc state.
3876 4533           */
3877      -        ASSERT(!retry || spa == 0);
     4534 +        ASSERT(!retry || spa == NULL);
3878 4535  
3879      -        if (spa != NULL)
     4536 +        if (spa != NULL) {
3880 4537                  guid = spa_load_guid(spa);
     4538 +                if (async_flush) {
     4539 +                        aaf = kmem_alloc(sizeof (arc_async_flush_data_t),
     4540 +                            KM_SLEEP);
     4541 +                        aaf->aaf_guid = guid;
     4542 +                        aaf->aaf_retry = retry;
     4543 +                }
     4544 +        }
3881 4545  
3882      -        (void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);
3883      -        (void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);
3884      -
3885      -        (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);
3886      -        (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);
3887      -
3888      -        (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);
3889      -        (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);
3890      -
3891      -        (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);
3892      -        (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);
     4546 +        /*
     4547 +         * Try to flush per-spa remaining ARC ghost buffers asynchronously
     4548 +         * while a pool is being closed.
     4549 +         * An ARC buffer is bound to spa only by guid, so buffer can
     4550 +         * exist even when pool has already gone. If asynchronous flushing
     4551 +         * fails we fall back to regular (synchronous) one.
     4552 +         * NOTE: If asynchronous flushing had not yet finished when the pool
     4553 +         * was imported again it wouldn't be a problem, even when guids before
     4554 +         * and after export/import are the same. We can evict only unreferenced
     4555 +         * buffers, other are skipped.
     4556 +         */
     4557 +        if (!async_flush || (taskq_dispatch(arc_flush_taskq, arc_flush_task,
     4558 +            aaf, TQ_NOSLEEP) == NULL)) {
     4559 +                arc_flush_impl(guid, retry);
     4560 +                if (async_flush)
     4561 +                        kmem_free(aaf, sizeof (arc_async_flush_data_t));
     4562 +        }
3893 4563  }
3894 4564  
3895 4565  void
3896 4566  arc_shrink(int64_t to_free)
3897 4567  {
3898      -        uint64_t asize = aggsum_value(&arc_size);
3899 4568          if (arc_c > arc_c_min) {
3900 4569  
3901 4570                  if (arc_c > arc_c_min + to_free)
3902 4571                          atomic_add_64(&arc_c, -to_free);
3903 4572                  else
3904 4573                          arc_c = arc_c_min;
3905 4574  
3906 4575                  atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
3907      -                if (asize < arc_c)
3908      -                        arc_c = MAX(asize, arc_c_min);
     4576 +                if (arc_c > arc_size)
     4577 +                        arc_c = MAX(arc_size, arc_c_min);
3909 4578                  if (arc_p > arc_c)
3910 4579                          arc_p = (arc_c >> 1);
3911 4580                  ASSERT(arc_c >= arc_c_min);
3912 4581                  ASSERT((int64_t)arc_p >= 0);
3913 4582          }
3914 4583  
3915      -        if (asize > arc_c)
     4584 +        if (arc_size > arc_c)
3916 4585                  (void) arc_adjust();
3917 4586  }
3918 4587  
3919 4588  typedef enum free_memory_reason_t {
3920 4589          FMR_UNKNOWN,
3921 4590          FMR_NEEDFREE,
3922 4591          FMR_LOTSFREE,
3923 4592          FMR_SWAPFS_MINFREE,
3924 4593          FMR_PAGES_PP_MAXIMUM,
3925 4594          FMR_HEAP_ARENA,

3926 4595          FMR_ZIO_ARENA,
3927 4596  } free_memory_reason_t;
3928 4597  
3929 4598  int64_t last_free_memory;
3930 4599  free_memory_reason_t last_free_reason;
3931 4600  
3932 4601  /*
3933 4602   * Additional reserve of pages for pp_reserve.
3934 4603   */
3935 4604  int64_t arc_pages_pp_reserve = 64;
3936 4605  
3937 4606  /*
3938 4607   * Additional reserve of pages for swapfs.
3939 4608   */
3940 4609  int64_t arc_swapfs_reserve = 64;
3941 4610  
3942 4611  /*
3943 4612   * Return the amount of memory that can be consumed before reclaim will be
3944 4613   * needed.  Positive if there is sufficient free memory, negative indicates
3945 4614   * the amount of memory that needs to be freed up.
3946 4615   */
3947 4616  static int64_t
3948 4617  arc_available_memory(void)
3949 4618  {
3950 4619          int64_t lowest = INT64_MAX;
3951 4620          int64_t n;
3952 4621          free_memory_reason_t r = FMR_UNKNOWN;
3953 4622  
3954 4623  #ifdef _KERNEL
3955 4624          if (needfree > 0) {
3956 4625                  n = PAGESIZE * (-needfree);
3957 4626                  if (n < lowest) {
3958 4627                          lowest = n;
3959 4628                          r = FMR_NEEDFREE;
3960 4629                  }
3961 4630          }
3962 4631  
3963 4632          /*
3964 4633           * check that we're out of range of the pageout scanner.  It starts to
3965 4634           * schedule paging if freemem is less than lotsfree and needfree.
3966 4635           * lotsfree is the high-water mark for pageout, and needfree is the
3967 4636           * number of needed free pages.  We add extra pages here to make sure
3968 4637           * the scanner doesn't start up while we're freeing memory.
3969 4638           */
3970 4639          n = PAGESIZE * (freemem - lotsfree - needfree - desfree);
3971 4640          if (n < lowest) {
3972 4641                  lowest = n;
3973 4642                  r = FMR_LOTSFREE;
3974 4643          }
3975 4644  
3976 4645          /*
3977 4646           * check to make sure that swapfs has enough space so that anon
3978 4647           * reservations can still succeed. anon_resvmem() checks that the
3979 4648           * availrmem is greater than swapfs_minfree, and the number of reserved
3980 4649           * swap pages.  We also add a bit of extra here just to prevent
3981 4650           * circumstances from getting really dire.
3982 4651           */
3983 4652          n = PAGESIZE * (availrmem - swapfs_minfree - swapfs_reserve -
3984 4653              desfree - arc_swapfs_reserve);
3985 4654          if (n < lowest) {
3986 4655                  lowest = n;
3987 4656                  r = FMR_SWAPFS_MINFREE;
3988 4657          }
3989 4658  
3990 4659  
3991 4660          /*
3992 4661           * Check that we have enough availrmem that memory locking (e.g., via
3993 4662           * mlock(3C) or memcntl(2)) can still succeed.  (pages_pp_maximum
3994 4663           * stores the number of pages that cannot be locked; when availrmem
3995 4664           * drops below pages_pp_maximum, page locking mechanisms such as
3996 4665           * page_pp_lock() will fail.)
3997 4666           */
3998 4667          n = PAGESIZE * (availrmem - pages_pp_maximum -
3999 4668              arc_pages_pp_reserve);
4000 4669          if (n < lowest) {
4001 4670                  lowest = n;
4002 4671                  r = FMR_PAGES_PP_MAXIMUM;
4003 4672          }
4004 4673  
4005 4674  #if defined(__i386)
4006 4675          /*
4007 4676           * If we're on an i386 platform, it's possible that we'll exhaust the
4008 4677           * kernel heap space before we ever run out of available physical
4009 4678           * memory.  Most checks of the size of the heap_area compare against
4010 4679           * tune.t_minarmem, which is the minimum available real memory that we
4011 4680           * can have in the system.  However, this is generally fixed at 25 pages
4012 4681           * which is so low that it's useless.  In this comparison, we seek to
4013 4682           * calculate the total heap-size, and reclaim if more than 3/4ths of the
4014 4683           * heap is allocated.  (Or, in the calculation, if less than 1/4th is
4015 4684           * free)
4016 4685           */
4017 4686          n = (int64_t)vmem_size(heap_arena, VMEM_FREE) -
4018 4687              (vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC) >> 2);
4019 4688          if (n < lowest) {
4020 4689                  lowest = n;
4021 4690                  r = FMR_HEAP_ARENA;
4022 4691          }
4023 4692  #endif
4024 4693  
4025 4694          /*
4026 4695           * If zio data pages are being allocated out of a separate heap segment,
4027 4696           * then enforce that the size of available vmem for this arena remains
4028 4697           * above about 1/4th (1/(2^arc_zio_arena_free_shift)) free.
4029 4698           *
4030 4699           * Note that reducing the arc_zio_arena_free_shift keeps more virtual
4031 4700           * memory (in the zio_arena) free, which can avoid memory
4032 4701           * fragmentation issues.
4033 4702           */
4034 4703          if (zio_arena != NULL) {
4035 4704                  n = (int64_t)vmem_size(zio_arena, VMEM_FREE) -
4036 4705                      (vmem_size(zio_arena, VMEM_ALLOC) >>
4037 4706                      arc_zio_arena_free_shift);
4038 4707                  if (n < lowest) {
4039 4708                          lowest = n;
4040 4709                          r = FMR_ZIO_ARENA;
4041 4710                  }
4042 4711          }
4043 4712  #else
4044 4713          /* Every 100 calls, free a small amount */
4045 4714          if (spa_get_random(100) == 0)
4046 4715                  lowest = -1024;
4047 4716  #endif
4048 4717  
4049 4718          last_free_memory = lowest;
4050 4719          last_free_reason = r;
4051 4720  
4052 4721          return (lowest);
4053 4722  }
4054 4723  
4055 4724  
4056 4725  /*
4057 4726   * Determine if the system is under memory pressure and is asking
4058 4727   * to reclaim memory. A return value of B_TRUE indicates that the system
4059 4728   * is under memory pressure and that the arc should adjust accordingly.
4060 4729   */
4061 4730  static boolean_t
4062 4731  arc_reclaim_needed(void)
4063 4732  {
4064 4733          return (arc_available_memory() < 0);
4065 4734  }
4066 4735  
4067 4736  static void
4068 4737  arc_kmem_reap_now(void)

↓ open down ↓

143 lines elided

↑ open up ↑

4069 4738  {
4070 4739          size_t                  i;
4071 4740          kmem_cache_t            *prev_cache = NULL;
4072 4741          kmem_cache_t            *prev_data_cache = NULL;
4073 4742          extern kmem_cache_t     *zio_buf_cache[];
4074 4743          extern kmem_cache_t     *zio_data_buf_cache[];
4075 4744          extern kmem_cache_t     *range_seg_cache;
4076 4745          extern kmem_cache_t     *abd_chunk_cache;
4077 4746  
4078 4747  #ifdef _KERNEL
4079      -        if (aggsum_compare(&arc_meta_used, arc_meta_limit) >= 0) {
     4748 +        if (arc_meta_used >= arc_meta_limit || arc_ddt_size >= arc_ddt_limit) {
4080 4749                  /*
4081      -                 * We are exceeding our meta-data cache limit.
4082      -                 * Purge some DNLC entries to release holds on meta-data.
     4750 +                 * We are exceeding our meta-data or DDT cache limit.
     4751 +                 * Purge some DNLC entries to release holds on meta-data/DDT.
4083 4752                   */
4084 4753                  dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
4085 4754          }
4086 4755  #if defined(__i386)
4087 4756          /*
4088 4757           * Reclaim unused memory from all kmem caches.
4089 4758           */
4090 4759          kmem_reap();
4091 4760  #endif
4092 4761  #endif

4093 4762  
4094 4763          /*
4095 4764           * If a kmem reap is already active, don't schedule more.  We must
4096 4765           * check for this because kmem_cache_reap_soon() won't actually
4097 4766           * block on the cache being reaped (this is to prevent callers from
4098 4767           * becoming implicitly blocked by a system-wide kmem reap -- which,
4099 4768           * on a system with many, many full magazines, can take minutes).
4100 4769           */
4101 4770          if (kmem_cache_reap_active())
4102 4771                  return;
4103 4772  
4104 4773          for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
4105 4774                  if (zio_buf_cache[i] != prev_cache) {
4106 4775                          prev_cache = zio_buf_cache[i];
4107 4776                          kmem_cache_reap_soon(zio_buf_cache[i]);
4108 4777                  }
4109 4778                  if (zio_data_buf_cache[i] != prev_data_cache) {
4110 4779                          prev_data_cache = zio_data_buf_cache[i];
4111 4780                          kmem_cache_reap_soon(zio_data_buf_cache[i]);
4112 4781                  }
4113 4782          }
4114 4783          kmem_cache_reap_soon(abd_chunk_cache);
4115 4784          kmem_cache_reap_soon(buf_cache);
4116 4785          kmem_cache_reap_soon(hdr_full_cache);
4117 4786          kmem_cache_reap_soon(hdr_l2only_cache);
4118 4787          kmem_cache_reap_soon(range_seg_cache);
4119 4788  
4120 4789          if (zio_arena != NULL) {
4121 4790                  /*
4122 4791                   * Ask the vmem arena to reclaim unused memory from its
4123 4792                   * quantum caches.
4124 4793                   */
4125 4794                  vmem_qcache_reap(zio_arena);
4126 4795          }
4127 4796  }
4128 4797  
4129 4798  /*
4130 4799   * Threads can block in arc_get_data_impl() waiting for this thread to evict
4131 4800   * enough data and signal them to proceed. When this happens, the threads in
4132 4801   * arc_get_data_impl() are sleeping while holding the hash lock for their
4133 4802   * particular arc header. Thus, we must be careful to never sleep on a
4134 4803   * hash lock in this thread. This is to prevent the following deadlock:
4135 4804   *
4136 4805   *  - Thread A sleeps on CV in arc_get_data_impl() holding hash lock "L",
4137 4806   *    waiting for the reclaim thread to signal it.
4138 4807   *
4139 4808   *  - arc_reclaim_thread() tries to acquire hash lock "L" using mutex_enter,
4140 4809   *    fails, and goes to sleep forever.
4141 4810   *
4142 4811   * This possible deadlock is avoided by always acquiring a hash lock
4143 4812   * using mutex_tryenter() from arc_reclaim_thread().
4144 4813   */
4145 4814  /* ARGSUSED */
4146 4815  static void
4147 4816  arc_reclaim_thread(void *unused)
4148 4817  {
4149 4818          hrtime_t                growtime = 0;
4150 4819          hrtime_t                kmem_reap_time = 0;
4151 4820          callb_cpr_t             cpr;
4152 4821  
4153 4822          CALLB_CPR_INIT(&cpr, &arc_reclaim_lock, callb_generic_cpr, FTAG);
4154 4823  
4155 4824          mutex_enter(&arc_reclaim_lock);
4156 4825          while (!arc_reclaim_thread_exit) {
4157 4826                  uint64_t evicted = 0;
4158 4827  
4159 4828                  /*
4160 4829                   * This is necessary in order for the mdb ::arc dcmd to
4161 4830                   * show up to date information. Since the ::arc command
4162 4831                   * does not call the kstat's update function, without
4163 4832                   * this call, the command may show stale stats for the
4164 4833                   * anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even
4165 4834                   * with this change, the data might be up to 1 second
4166 4835                   * out of date; but that should suffice. The arc_state_t
4167 4836                   * structures can be queried directly if more accurate
4168 4837                   * information is needed.
4169 4838                   */
4170 4839                  if (arc_ksp != NULL)
4171 4840                          arc_ksp->ks_update(arc_ksp, KSTAT_READ);
4172 4841  
4173 4842                  mutex_exit(&arc_reclaim_lock);
4174 4843  
4175 4844                  /*
4176 4845                   * We call arc_adjust() before (possibly) calling
4177 4846                   * arc_kmem_reap_now(), so that we can wake up
4178 4847                   * arc_get_data_impl() sooner.
4179 4848                   */
4180 4849                  evicted = arc_adjust();
4181 4850  
4182 4851                  int64_t free_memory = arc_available_memory();
4183 4852                  if (free_memory < 0) {
4184 4853                          hrtime_t curtime = gethrtime();
4185 4854                          arc_no_grow = B_TRUE;
4186 4855                          arc_warm = B_TRUE;
4187 4856  
4188 4857                          /*
4189 4858                           * Wait at least zfs_grow_retry (default 60) seconds
4190 4859                           * before considering growing.
4191 4860                           */
4192 4861                          growtime = curtime + SEC2NSEC(arc_grow_retry);
4193 4862  
4194 4863                          /*
4195 4864                           * Wait at least arc_kmem_cache_reap_retry_ms
4196 4865                           * between arc_kmem_reap_now() calls. Without
4197 4866                           * this check it is possible to end up in a
4198 4867                           * situation where we spend lots of time
4199 4868                           * reaping caches, while we're near arc_c_min.
4200 4869                           */
4201 4870                          if (curtime >= kmem_reap_time) {
4202 4871                                  arc_kmem_reap_now();
4203 4872                                  kmem_reap_time = gethrtime() +
4204 4873                                      MSEC2NSEC(arc_kmem_cache_reap_retry_ms);
4205 4874                          }
4206 4875  
4207 4876                          /*
4208 4877                           * If we are still low on memory, shrink the ARC
4209 4878                           * so that we have arc_shrink_min free space.
4210 4879                           */
4211 4880                          free_memory = arc_available_memory();
4212 4881  
4213 4882                          int64_t to_free =
4214 4883                              (arc_c >> arc_shrink_shift) - free_memory;
4215 4884                          if (to_free > 0) {
4216 4885  #ifdef _KERNEL
4217 4886                                  to_free = MAX(to_free, ptob(needfree));
4218 4887  #endif
4219 4888                                  arc_shrink(to_free);
4220 4889                          }
4221 4890                  } else if (free_memory < arc_c >> arc_no_grow_shift) {
4222 4891                          arc_no_grow = B_TRUE;
4223 4892                  } else if (gethrtime() >= growtime) {
4224 4893                          arc_no_grow = B_FALSE;
4225 4894                  }
4226 4895  
4227 4896                  mutex_enter(&arc_reclaim_lock);

↓ open down ↓

135 lines elided

↑ open up ↑

4228 4897  
4229 4898                  /*
4230 4899                   * If evicted is zero, we couldn't evict anything via
4231 4900                   * arc_adjust(). This could be due to hash lock
4232 4901                   * collisions, but more likely due to the majority of
4233 4902                   * arc buffers being unevictable. Therefore, even if
4234 4903                   * arc_size is above arc_c, another pass is unlikely to
4235 4904                   * be helpful and could potentially cause us to enter an
4236 4905                   * infinite loop.
4237 4906                   */
4238      -                if (aggsum_compare(&arc_size, arc_c) <= 0|| evicted == 0) {
     4907 +                if (arc_size <= arc_c || evicted == 0) {
4239 4908                          /*
4240 4909                           * We're either no longer overflowing, or we
4241 4910                           * can't evict anything more, so we should wake
4242 4911                           * up any threads before we go to sleep.
4243 4912                           */
4244 4913                          cv_broadcast(&arc_reclaim_waiters_cv);
4245 4914  
4246 4915                          /*
4247 4916                           * Block until signaled, or after one second (we
4248 4917                           * might need to perform arc_kmem_reap_now()

4249 4918                           * even if we aren't being signalled)
4250 4919                           */
4251 4920                          CALLB_CPR_SAFE_BEGIN(&cpr);
4252 4921                          (void) cv_timedwait_hires(&arc_reclaim_thread_cv,
4253 4922                              &arc_reclaim_lock, SEC2NSEC(1), MSEC2NSEC(1), 0);
4254 4923                          CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_lock);
4255 4924                  }
4256 4925          }
4257 4926  
4258 4927          arc_reclaim_thread_exit = B_FALSE;
4259 4928          cv_broadcast(&arc_reclaim_thread_cv);
4260 4929          CALLB_CPR_EXIT(&cpr);           /* drops arc_reclaim_lock */
4261 4930          thread_exit();
4262 4931  }
4263 4932  
4264 4933  /*
4265 4934   * Adapt arc info given the number of bytes we are trying to add and
4266 4935   * the state that we are comming from.  This function is only called
4267 4936   * when we are adding new content to the cache.
4268 4937   */
4269 4938  static void
4270 4939  arc_adapt(int bytes, arc_state_t *state)
4271 4940  {
4272 4941          int mult;
4273 4942          uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
4274 4943          int64_t mrug_size = refcount_count(&arc_mru_ghost->arcs_size);
4275 4944          int64_t mfug_size = refcount_count(&arc_mfu_ghost->arcs_size);
4276 4945  
4277 4946          if (state == arc_l2c_only)
4278 4947                  return;
4279 4948  
4280 4949          ASSERT(bytes > 0);
4281 4950          /*
4282 4951           * Adapt the target size of the MRU list:
4283 4952           *      - if we just hit in the MRU ghost list, then increase
4284 4953           *        the target size of the MRU list.
4285 4954           *      - if we just hit in the MFU ghost list, then increase
4286 4955           *        the target size of the MFU list by decreasing the
4287 4956           *        target size of the MRU list.
4288 4957           */
4289 4958          if (state == arc_mru_ghost) {
4290 4959                  mult = (mrug_size >= mfug_size) ? 1 : (mfug_size / mrug_size);
4291 4960                  mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
4292 4961  
4293 4962                  arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
4294 4963          } else if (state == arc_mfu_ghost) {
4295 4964                  uint64_t delta;
4296 4965  
4297 4966                  mult = (mfug_size >= mrug_size) ? 1 : (mrug_size / mfug_size);
4298 4967                  mult = MIN(mult, 10);
4299 4968  
4300 4969                  delta = MIN(bytes * mult, arc_p);
4301 4970                  arc_p = MAX(arc_p_min, arc_p - delta);
4302 4971          }
4303 4972          ASSERT((int64_t)arc_p >= 0);
4304 4973  
4305 4974          if (arc_reclaim_needed()) {
4306 4975                  cv_signal(&arc_reclaim_thread_cv);
4307 4976                  return;
4308 4977          }
4309 4978

↓ open down ↓

61 lines elided

↑ open up ↑

4310 4979          if (arc_no_grow)
4311 4980                  return;
4312 4981  
4313 4982          if (arc_c >= arc_c_max)
4314 4983                  return;
4315 4984  
4316 4985          /*
4317 4986           * If we're within (2 * maxblocksize) bytes of the target
4318 4987           * cache size, increment the target cache size
4319 4988           */
4320      -        if (aggsum_compare(&arc_size, arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) >
4321      -            0) {
     4989 +        if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
4322 4990                  atomic_add_64(&arc_c, (int64_t)bytes);
4323 4991                  if (arc_c > arc_c_max)
4324 4992                          arc_c = arc_c_max;
4325 4993                  else if (state == arc_anon)
4326 4994                          atomic_add_64(&arc_p, (int64_t)bytes);
4327 4995                  if (arc_p > arc_c)
4328 4996                          arc_p = arc_c;
4329 4997          }
4330 4998          ASSERT((int64_t)arc_p >= 0);
4331 4999  }

4332 5000  
4333 5001  /*

↓ open down ↓

2 lines elided

↑ open up ↑

4334 5002   * Check if arc_size has grown past our upper threshold, determined by
4335 5003   * zfs_arc_overflow_shift.
4336 5004   */
4337 5005  static boolean_t
4338 5006  arc_is_overflowing(void)
4339 5007  {
4340 5008          /* Always allow at least one block of overflow */
4341 5009          uint64_t overflow = MAX(SPA_MAXBLOCKSIZE,
4342 5010              arc_c >> zfs_arc_overflow_shift);
4343 5011  
4344      -        /*
4345      -         * We just compare the lower bound here for performance reasons. Our
4346      -         * primary goals are to make sure that the arc never grows without
4347      -         * bound, and that it can reach its maximum size. This check
4348      -         * accomplishes both goals. The maximum amount we could run over by is
4349      -         * 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block
4350      -         * in the ARC. In practice, that's in the tens of MB, which is low
4351      -         * enough to be safe.
4352      -         */
4353      -        return (aggsum_lower_bound(&arc_size) >= arc_c + overflow);
     5012 +        return (arc_size >= arc_c + overflow);
4354 5013  }
4355 5014  
4356 5015  static abd_t *
4357 5016  arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4358 5017  {
4359 5018          arc_buf_contents_t type = arc_buf_type(hdr);
4360 5019  
4361 5020          arc_get_data_impl(hdr, size, tag);
4362      -        if (type == ARC_BUFC_METADATA) {
     5021 +        if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
4363 5022                  return (abd_alloc(size, B_TRUE));
4364 5023          } else {
4365 5024                  ASSERT(type == ARC_BUFC_DATA);
4366 5025                  return (abd_alloc(size, B_FALSE));
4367 5026          }
4368 5027  }
4369 5028  
4370 5029  static void *
4371 5030  arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4372 5031  {
4373 5032          arc_buf_contents_t type = arc_buf_type(hdr);
4374 5033  
4375 5034          arc_get_data_impl(hdr, size, tag);
4376      -        if (type == ARC_BUFC_METADATA) {
     5035 +        if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
4377 5036                  return (zio_buf_alloc(size));
4378 5037          } else {
4379 5038                  ASSERT(type == ARC_BUFC_DATA);
4380 5039                  return (zio_data_buf_alloc(size));
4381 5040          }
4382 5041  }
4383 5042  
4384 5043  /*
4385 5044   * Allocate a block and return it to the caller. If we are hitting the
4386 5045   * hard limit for the cache size, we must sleep, waiting for the eviction

4387 5046   * thread to catch up. If we're past the target size but below the hard
4388 5047   * limit, we'll only signal the reclaim thread and continue on.
4389 5048   */
4390 5049  static void
4391 5050  arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4392 5051  {
4393 5052          arc_state_t *state = hdr->b_l1hdr.b_state;
4394 5053          arc_buf_contents_t type = arc_buf_type(hdr);
4395 5054  
4396 5055          arc_adapt(size, state);
4397 5056  
4398 5057          /*
4399 5058           * If arc_size is currently overflowing, and has grown past our
4400 5059           * upper limit, we must be adding data faster than the evict
4401 5060           * thread can evict. Thus, to ensure we don't compound the
4402 5061           * problem by adding more data and forcing arc_size to grow even
4403 5062           * further past it's target size, we halt and wait for the
4404 5063           * eviction thread to catch up.
4405 5064           *
4406 5065           * It's also possible that the reclaim thread is unable to evict
4407 5066           * enough buffers to get arc_size below the overflow limit (e.g.
4408 5067           * due to buffers being un-evictable, or hash lock collisions).
4409 5068           * In this case, we want to proceed regardless if we're
4410 5069           * overflowing; thus we don't use a while loop here.
4411 5070           */
4412 5071          if (arc_is_overflowing()) {
4413 5072                  mutex_enter(&arc_reclaim_lock);
4414 5073  
4415 5074                  /*
4416 5075                   * Now that we've acquired the lock, we may no longer be
4417 5076                   * over the overflow limit, lets check.
4418 5077                   *
4419 5078                   * We're ignoring the case of spurious wake ups. If that
4420 5079                   * were to happen, it'd let this thread consume an ARC
4421 5080                   * buffer before it should have (i.e. before we're under
4422 5081                   * the overflow limit and were signalled by the reclaim
4423 5082                   * thread). As long as that is a rare occurrence, it
4424 5083                   * shouldn't cause any harm.

↓ open down ↓

38 lines elided

↑ open up ↑

4425 5084                   */
4426 5085                  if (arc_is_overflowing()) {
4427 5086                          cv_signal(&arc_reclaim_thread_cv);
4428 5087                          cv_wait(&arc_reclaim_waiters_cv, &arc_reclaim_lock);
4429 5088                  }
4430 5089  
4431 5090                  mutex_exit(&arc_reclaim_lock);
4432 5091          }
4433 5092  
4434 5093          VERIFY3U(hdr->b_type, ==, type);
4435      -        if (type == ARC_BUFC_METADATA) {
     5094 +        if (type == ARC_BUFC_DDT) {
     5095 +                arc_space_consume(size, ARC_SPACE_DDT);
     5096 +        } else if (type == ARC_BUFC_METADATA) {
4436 5097                  arc_space_consume(size, ARC_SPACE_META);
4437 5098          } else {
4438 5099                  arc_space_consume(size, ARC_SPACE_DATA);
4439 5100          }
4440 5101  
4441 5102          /*
4442 5103           * Update the state size.  Note that ghost states have a
4443 5104           * "ghost size" and so don't need to be updated.
4444 5105           */
4445 5106          if (!GHOST_STATE(state)) {

4446 5107  
4447 5108                  (void) refcount_add_many(&state->arcs_size, size, tag);
4448 5109  
4449 5110                  /*
4450 5111                   * If this is reached via arc_read, the link is
4451 5112                   * protected by the hash lock. If reached via
4452 5113                   * arc_buf_alloc, the header should not be accessed by
4453 5114                   * any other thread. And, if reached via arc_read_done,
4454 5115                   * the hash lock will protect it if it's found in the
4455 5116                   * hash table; otherwise no other thread should be
4456 5117                   * trying to [add|remove]_reference it.
4457 5118                   */

↓ open down ↓

12 lines elided

↑ open up ↑

4458 5119                  if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
4459 5120                          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4460 5121                          (void) refcount_add_many(&state->arcs_esize[type],
4461 5122                              size, tag);
4462 5123                  }
4463 5124  
4464 5125                  /*
4465 5126                   * If we are growing the cache, and we are adding anonymous
4466 5127                   * data, and we have outgrown arc_p, update arc_p
4467 5128                   */
4468      -                if (aggsum_compare(&arc_size, arc_c) < 0 &&
4469      -                    hdr->b_l1hdr.b_state == arc_anon &&
     5129 +                if (arc_size < arc_c && hdr->b_l1hdr.b_state == arc_anon &&
4470 5130                      (refcount_count(&arc_anon->arcs_size) +
4471 5131                      refcount_count(&arc_mru->arcs_size) > arc_p))
4472 5132                          arc_p = MIN(arc_c, arc_p + size);
4473 5133          }
4474 5134  }
4475 5135  
4476 5136  static void
4477 5137  arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size, void *tag)
4478 5138  {
4479 5139          arc_free_data_impl(hdr, size, tag);
4480 5140          abd_free(abd);
4481 5141  }
4482 5142  
4483 5143  static void
4484 5144  arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, void *tag)
4485 5145  {
4486 5146          arc_buf_contents_t type = arc_buf_type(hdr);
4487 5147  
4488 5148          arc_free_data_impl(hdr, size, tag);
4489      -        if (type == ARC_BUFC_METADATA) {
     5149 +        if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
4490 5150                  zio_buf_free(buf, size);
4491 5151          } else {
4492 5152                  ASSERT(type == ARC_BUFC_DATA);
4493 5153                  zio_data_buf_free(buf, size);
4494 5154          }
4495 5155  }
4496 5156  
4497 5157  /*
4498 5158   * Free the arc data buffer.
4499 5159   */

4500 5160  static void
4501 5161  arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4502 5162  {
4503 5163          arc_state_t *state = hdr->b_l1hdr.b_state;
4504 5164          arc_buf_contents_t type = arc_buf_type(hdr);
4505 5165  
4506 5166          /* protected by hash lock, if in the hash table */

↓ open down ↓

7 lines elided

↑ open up ↑

4507 5167          if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
4508 5168                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4509 5169                  ASSERT(state != arc_anon && state != arc_l2c_only);
4510 5170  
4511 5171                  (void) refcount_remove_many(&state->arcs_esize[type],
4512 5172                      size, tag);
4513 5173          }
4514 5174          (void) refcount_remove_many(&state->arcs_size, size, tag);
4515 5175  
4516 5176          VERIFY3U(hdr->b_type, ==, type);
4517      -        if (type == ARC_BUFC_METADATA) {
     5177 +        if (type == ARC_BUFC_DDT) {
     5178 +                arc_space_return(size, ARC_SPACE_DDT);
     5179 +        } else if (type == ARC_BUFC_METADATA) {
4518 5180                  arc_space_return(size, ARC_SPACE_META);
4519 5181          } else {
4520 5182                  ASSERT(type == ARC_BUFC_DATA);
4521 5183                  arc_space_return(size, ARC_SPACE_DATA);
4522 5184          }
4523 5185  }
4524 5186  
4525 5187  /*
4526 5188   * This routine is called whenever a buffer is accessed.
4527 5189   * NOTE: the hash lock is dropped in this function.

4528 5190   */
4529 5191  static void
4530 5192  arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
4531 5193  {
4532 5194          clock_t now;
4533 5195  
4534 5196          ASSERT(MUTEX_HELD(hash_lock));
4535 5197          ASSERT(HDR_HAS_L1HDR(hdr));
4536 5198  
4537 5199          if (hdr->b_l1hdr.b_state == arc_anon) {
4538 5200                  /*
4539 5201                   * This buffer is not in the cache, and does not
4540 5202                   * appear in our "ghost" list.  Add the new buffer
4541 5203                   * to the MRU state.
4542 5204                   */
4543 5205  
4544 5206                  ASSERT0(hdr->b_l1hdr.b_arc_access);
4545 5207                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4546 5208                  DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
4547 5209                  arc_change_state(arc_mru, hdr, hash_lock);
4548 5210  
4549 5211          } else if (hdr->b_l1hdr.b_state == arc_mru) {
4550 5212                  now = ddi_get_lbolt();
4551 5213  
4552 5214                  /*
4553 5215                   * If this buffer is here because of a prefetch, then either:
4554 5216                   * - clear the flag if this is a "referencing" read
4555 5217                   *   (any subsequent access will bump this into the MFU state).
4556 5218                   * or
4557 5219                   * - move the buffer to the head of the list if this is
4558 5220                   *   another prefetch (to make it less likely to be evicted).
4559 5221                   */
4560 5222                  if (HDR_PREFETCH(hdr)) {
4561 5223                          if (refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
4562 5224                                  /* link protected by hash lock */
4563 5225                                  ASSERT(multilist_link_active(
4564 5226                                      &hdr->b_l1hdr.b_arc_node));
4565 5227                          } else {
4566 5228                                  arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
4567 5229                                  ARCSTAT_BUMP(arcstat_mru_hits);
4568 5230                          }
4569 5231                          hdr->b_l1hdr.b_arc_access = now;
4570 5232                          return;
4571 5233                  }
4572 5234  
4573 5235                  /*
4574 5236                   * This buffer has been "accessed" only once so far,
4575 5237                   * but it is still in the cache. Move it to the MFU
4576 5238                   * state.
4577 5239                   */
4578 5240                  if (now > hdr->b_l1hdr.b_arc_access + ARC_MINTIME) {
4579 5241                          /*
4580 5242                           * More than 125ms have passed since we
4581 5243                           * instantiated this buffer.  Move it to the
4582 5244                           * most frequently used state.
4583 5245                           */
4584 5246                          hdr->b_l1hdr.b_arc_access = now;
4585 5247                          DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4586 5248                          arc_change_state(arc_mfu, hdr, hash_lock);
4587 5249                  }
4588 5250                  ARCSTAT_BUMP(arcstat_mru_hits);
4589 5251          } else if (hdr->b_l1hdr.b_state == arc_mru_ghost) {
4590 5252                  arc_state_t     *new_state;
4591 5253                  /*
4592 5254                   * This buffer has been "accessed" recently, but
4593 5255                   * was evicted from the cache.  Move it to the
4594 5256                   * MFU state.
4595 5257                   */
4596 5258  
4597 5259                  if (HDR_PREFETCH(hdr)) {
4598 5260                          new_state = arc_mru;
4599 5261                          if (refcount_count(&hdr->b_l1hdr.b_refcnt) > 0)
4600 5262                                  arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
4601 5263                          DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
4602 5264                  } else {
4603 5265                          new_state = arc_mfu;
4604 5266                          DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4605 5267                  }
4606 5268  
4607 5269                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4608 5270                  arc_change_state(new_state, hdr, hash_lock);
4609 5271  
4610 5272                  ARCSTAT_BUMP(arcstat_mru_ghost_hits);
4611 5273          } else if (hdr->b_l1hdr.b_state == arc_mfu) {
4612 5274                  /*
4613 5275                   * This buffer has been accessed more than once and is
4614 5276                   * still in the cache.  Keep it in the MFU state.
4615 5277                   *
4616 5278                   * NOTE: an add_reference() that occurred when we did
4617 5279                   * the arc_read() will have kicked this off the list.
4618 5280                   * If it was a prefetch, we will explicitly move it to
4619 5281                   * the head of the list now.
4620 5282                   */
4621 5283                  if ((HDR_PREFETCH(hdr)) != 0) {
4622 5284                          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4623 5285                          /* link protected by hash_lock */
4624 5286                          ASSERT(multilist_link_active(&hdr->b_l1hdr.b_arc_node));
4625 5287                  }
4626 5288                  ARCSTAT_BUMP(arcstat_mfu_hits);
4627 5289                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4628 5290          } else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) {
4629 5291                  arc_state_t     *new_state = arc_mfu;
4630 5292                  /*
4631 5293                   * This buffer has been accessed more than once but has
4632 5294                   * been evicted from the cache.  Move it back to the
4633 5295                   * MFU state.
4634 5296                   */
4635 5297  
4636 5298                  if (HDR_PREFETCH(hdr)) {
4637 5299                          /*
4638 5300                           * This is a prefetch access...
4639 5301                           * move this block back to the MRU state.
4640 5302                           */
4641 5303                          ASSERT0(refcount_count(&hdr->b_l1hdr.b_refcnt));
4642 5304                          new_state = arc_mru;
4643 5305                  }
4644 5306  
4645 5307                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4646 5308                  DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4647 5309                  arc_change_state(new_state, hdr, hash_lock);
4648 5310  
4649 5311                  ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
4650 5312          } else if (hdr->b_l1hdr.b_state == arc_l2c_only) {
4651 5313                  /*
4652 5314                   * This buffer is on the 2nd Level ARC.

↓ open down ↓

125 lines elided

↑ open up ↑

4653 5315                   */
4654 5316  
4655 5317                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4656 5318                  DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4657 5319                  arc_change_state(arc_mfu, hdr, hash_lock);
4658 5320          } else {
4659 5321                  ASSERT(!"invalid arc state");
4660 5322          }
4661 5323  }
4662 5324  
     5325 +/*
     5326 + * This routine is called by dbuf_hold() to update the arc_access() state
     5327 + * which otherwise would be skipped for entries in the dbuf cache.
     5328 + */
     5329 +void
     5330 +arc_buf_access(arc_buf_t *buf)
     5331 +{
     5332 +        mutex_enter(&buf->b_evict_lock);
     5333 +        arc_buf_hdr_t *hdr = buf->b_hdr;
     5334 +
     5335 +        /*
     5336 +         * Avoid taking the hash_lock when possible as an optimization.
     5337 +         * The header must be checked again under the hash_lock in order
     5338 +         * to handle the case where it is concurrently being released.
     5339 +         */
     5340 +        if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {
     5341 +                mutex_exit(&buf->b_evict_lock);
     5342 +                return;
     5343 +        }
     5344 +
     5345 +        kmutex_t *hash_lock = HDR_LOCK(hdr);
     5346 +        mutex_enter(hash_lock);
     5347 +
     5348 +        if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {
     5349 +                mutex_exit(hash_lock);
     5350 +                mutex_exit(&buf->b_evict_lock);
     5351 +                ARCSTAT_BUMP(arcstat_access_skip);
     5352 +                return;
     5353 +        }
     5354 +
     5355 +        mutex_exit(&buf->b_evict_lock);
     5356 +
     5357 +        ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
     5358 +            hdr->b_l1hdr.b_state == arc_mfu);
     5359 +
     5360 +        DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
     5361 +        arc_access(hdr, hash_lock);
     5362 +        mutex_exit(hash_lock);
     5363 +
     5364 +        ARCSTAT_BUMP(arcstat_hits);
     5365 +        /*
     5366 +         * Upstream used the ARCSTAT_CONDSTAT macro here, but they changed
     5367 +         * the argument format for that macro, which would requie that we
     5368 +         * go and modify all other uses of it. So it's easier to just expand
     5369 +         * this one invocation of the macro to do the right thing.
     5370 +         */
     5371 +        if (!HDR_PREFETCH(hdr)) {
     5372 +                if (!HDR_ISTYPE_METADATA(hdr))
     5373 +                        ARCSTAT_BUMP(arcstat_demand_data_hits);
     5374 +                else
     5375 +                        ARCSTAT_BUMP(arcstat_demand_metadata_hits);
     5376 +        } else {
     5377 +                if (!HDR_ISTYPE_METADATA(hdr))
     5378 +                        ARCSTAT_BUMP(arcstat_prefetch_data_hits);
     5379 +                else
     5380 +                        ARCSTAT_BUMP(arcstat_prefetch_metadata_hits);
     5381 +        }
     5382 +}
     5383 +
4663 5384  /* a generic arc_done_func_t which you can use */
4664 5385  /* ARGSUSED */
4665 5386  void
4666 5387  arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
4667 5388  {
4668 5389          if (zio == NULL || zio->io_error == 0)
4669 5390                  bcopy(buf->b_data, arg, arc_buf_size(buf));
4670 5391          arc_buf_destroy(buf, arg);
4671 5392  }
4672 5393

4673 5394  /* a generic arc_done_func_t */
4674 5395  void
4675 5396  arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
4676 5397  {
4677 5398          arc_buf_t **bufp = arg;
4678 5399          if (zio && zio->io_error) {
4679 5400                  arc_buf_destroy(buf, arg);
4680 5401                  *bufp = NULL;
4681 5402          } else {
4682 5403                  *bufp = buf;
4683 5404                  ASSERT(buf->b_data);
4684 5405          }
4685 5406  }
4686 5407  
4687 5408  static void
4688 5409  arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp)
4689 5410  {
4690 5411          if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {
4691 5412                  ASSERT3U(HDR_GET_PSIZE(hdr), ==, 0);
4692 5413                  ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
4693 5414          } else {
4694 5415                  if (HDR_COMPRESSION_ENABLED(hdr)) {
4695 5416                          ASSERT3U(HDR_GET_COMPRESS(hdr), ==,
4696 5417                              BP_GET_COMPRESS(bp));
4697 5418                  }
4698 5419                  ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));
4699 5420                  ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp));
4700 5421          }
4701 5422  }
4702 5423  
4703 5424  static void
4704 5425  arc_read_done(zio_t *zio)
4705 5426  {
4706 5427          arc_buf_hdr_t   *hdr = zio->io_private;
4707 5428          kmutex_t        *hash_lock = NULL;
4708 5429          arc_callback_t  *callback_list;
4709 5430          arc_callback_t  *acb;
4710 5431          boolean_t       freeable = B_FALSE;
4711 5432          boolean_t       no_zio_error = (zio->io_error == 0);
4712 5433  
4713 5434          /*
4714 5435           * The hdr was inserted into hash-table and removed from lists
4715 5436           * prior to starting I/O.  We should find this header, since
4716 5437           * it's in the hash table, and it should be legit since it's
4717 5438           * not possible to evict it during the I/O.  The only possible
4718 5439           * reason for it not to be found is if we were freed during the
4719 5440           * read.
4720 5441           */
4721 5442          if (HDR_IN_HASH_TABLE(hdr)) {
4722 5443                  ASSERT3U(hdr->b_birth, ==, BP_PHYSICAL_BIRTH(zio->io_bp));
4723 5444                  ASSERT3U(hdr->b_dva.dva_word[0], ==,
4724 5445                      BP_IDENTITY(zio->io_bp)->dva_word[0]);
4725 5446                  ASSERT3U(hdr->b_dva.dva_word[1], ==,
4726 5447                      BP_IDENTITY(zio->io_bp)->dva_word[1]);
4727 5448  
4728 5449                  arc_buf_hdr_t *found = buf_hash_find(hdr->b_spa, zio->io_bp,
4729 5450                      &hash_lock);
4730 5451  
4731 5452                  ASSERT((found == hdr &&
4732 5453                      DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
4733 5454                      (found == hdr && HDR_L2_READING(hdr)));
4734 5455                  ASSERT3P(hash_lock, !=, NULL);
4735 5456          }
4736 5457  
4737 5458          if (no_zio_error) {
4738 5459                  /* byteswap if necessary */
4739 5460                  if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
4740 5461                          if (BP_GET_LEVEL(zio->io_bp) > 0) {
4741 5462                                  hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;
4742 5463                          } else {
4743 5464                                  hdr->b_l1hdr.b_byteswap =
4744 5465                                      DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));
4745 5466                          }
4746 5467                  } else {
4747 5468                          hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
4748 5469                  }
4749 5470          }
4750 5471  
4751 5472          arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED);
4752 5473          if (l2arc_noprefetch && HDR_PREFETCH(hdr))
4753 5474                  arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE);
4754 5475  
4755 5476          callback_list = hdr->b_l1hdr.b_acb;
4756 5477          ASSERT3P(callback_list, !=, NULL);
4757 5478  
4758 5479          if (hash_lock && no_zio_error && hdr->b_l1hdr.b_state == arc_anon) {
4759 5480                  /*
4760 5481                   * Only call arc_access on anonymous buffers.  This is because
4761 5482                   * if we've issued an I/O for an evicted buffer, we've already
4762 5483                   * called arc_access (to prevent any simultaneous readers from
4763 5484                   * getting confused).
4764 5485                   */
4765 5486                  arc_access(hdr, hash_lock);
4766 5487          }
4767 5488  
4768 5489          /*
4769 5490           * If a read request has a callback (i.e. acb_done is not NULL), then we
4770 5491           * make a buf containing the data according to the parameters which were
4771 5492           * passed in. The implementation of arc_buf_alloc_impl() ensures that we
4772 5493           * aren't needlessly decompressing the data multiple times.
4773 5494           */
4774 5495          int callback_cnt = 0;
4775 5496          for (acb = callback_list; acb != NULL; acb = acb->acb_next) {
4776 5497                  if (!acb->acb_done)
4777 5498                          continue;
4778 5499  
4779 5500                  /* This is a demand read since prefetches don't use callbacks */
4780 5501                  callback_cnt++;
4781 5502  
4782 5503                  int error = arc_buf_alloc_impl(hdr, acb->acb_private,
4783 5504                      acb->acb_compressed, no_zio_error, &acb->acb_buf);
4784 5505                  if (no_zio_error) {
4785 5506                          zio->io_error = error;
4786 5507                  }
4787 5508          }
4788 5509          hdr->b_l1hdr.b_acb = NULL;
4789 5510          arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
4790 5511          if (callback_cnt == 0) {
4791 5512                  ASSERT(HDR_PREFETCH(hdr));
4792 5513                  ASSERT0(hdr->b_l1hdr.b_bufcnt);
4793 5514                  ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
4794 5515          }

↓ open down ↓

122 lines elided

↑ open up ↑

4795 5516  
4796 5517          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt) ||
4797 5518              callback_list != NULL);
4798 5519  
4799 5520          if (no_zio_error) {
4800 5521                  arc_hdr_verify(hdr, zio->io_bp);
4801 5522          } else {
4802 5523                  arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
4803 5524                  if (hdr->b_l1hdr.b_state != arc_anon)
4804 5525                          arc_change_state(arc_anon, hdr, hash_lock);
4805      -                if (HDR_IN_HASH_TABLE(hdr))
     5526 +                if (HDR_IN_HASH_TABLE(hdr)) {
     5527 +                        if (hash_lock)
     5528 +                                arc_wait_for_krrp(hdr);
4806 5529                          buf_hash_remove(hdr);
     5530 +                }
4807 5531                  freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
4808 5532          }
4809 5533  
4810 5534          /*
4811 5535           * Broadcast before we drop the hash_lock to avoid the possibility
4812 5536           * that the hdr (and hence the cv) might be freed before we get to
4813 5537           * the cv_broadcast().
4814 5538           */
4815 5539          cv_broadcast(&hdr->b_l1hdr.b_cv);
4816 5540

4817 5541          if (hash_lock != NULL) {
4818 5542                  mutex_exit(hash_lock);
4819 5543          } else {
4820 5544                  /*
4821 5545                   * This block was freed while we waited for the read to
4822 5546                   * complete.  It has been removed from the hash table and
4823 5547                   * moved to the anonymous state (so that it won't show up
4824 5548                   * in the cache).
4825 5549                   */
4826 5550                  ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
4827 5551                  freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
4828 5552          }
4829 5553  
4830 5554          /* execute each callback and free its structure */
4831 5555          while ((acb = callback_list) != NULL) {
4832 5556                  if (acb->acb_done)
4833 5557                          acb->acb_done(zio, acb->acb_buf, acb->acb_private);
4834 5558  
4835 5559                  if (acb->acb_zio_dummy != NULL) {
4836 5560                          acb->acb_zio_dummy->io_error = zio->io_error;
4837 5561                          zio_nowait(acb->acb_zio_dummy);
4838 5562                  }

↓ open down ↓

22 lines elided

↑ open up ↑

4839 5563  
4840 5564                  callback_list = acb->acb_next;
4841 5565                  kmem_free(acb, sizeof (arc_callback_t));
4842 5566          }
4843 5567  
4844 5568          if (freeable)
4845 5569                  arc_hdr_destroy(hdr);
4846 5570  }
4847 5571  
4848 5572  /*
     5573 + * The function to process data from arc by a callback
     5574 + * The main purpose is to directly copy data from arc to a target buffer
     5575 + */
     5576 +int
     5577 +arc_io_bypass(spa_t *spa, const blkptr_t *bp,
     5578 +    arc_bypass_io_func func, void *arg)
     5579 +{
     5580 +        arc_buf_hdr_t *hdr;
     5581 +        kmutex_t *hash_lock = NULL;
     5582 +        int error = 0;
     5583 +        uint64_t guid = spa_load_guid(spa);
     5584 +
     5585 +top:
     5586 +        hdr = buf_hash_find(guid, bp, &hash_lock);
     5587 +        if (hdr && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_bufcnt > 0 &&
     5588 +            hdr->b_l1hdr.b_buf->b_data) {
     5589 +                if (HDR_IO_IN_PROGRESS(hdr)) {
     5590 +                        cv_wait(&hdr->b_l1hdr.b_cv, hash_lock);
     5591 +                        mutex_exit(hash_lock);
     5592 +                        DTRACE_PROBE(arc_bypass_wait);
     5593 +                        goto top;
     5594 +                }
     5595 +
     5596 +                /*
     5597 +                 * As the func is an arbitrary callback, which can block, lock
     5598 +                 * should be released not to block other threads from
     5599 +                 * performing. A counter is used to hold a reference to block
     5600 +                 * which are held by krrp.
     5601 +                 */
     5602 +
     5603 +                hdr->b_l1hdr.b_krrp++;
     5604 +                mutex_exit(hash_lock);
     5605 +
     5606 +                error = func(hdr->b_l1hdr.b_buf->b_data, hdr->b_lsize, arg);
     5607 +
     5608 +                mutex_enter(hash_lock);
     5609 +                hdr->b_l1hdr.b_krrp--;
     5610 +                cv_broadcast(&hdr->b_l1hdr.b_cv);
     5611 +                mutex_exit(hash_lock);
     5612 +
     5613 +                return (error);
     5614 +        } else {
     5615 +                if (hash_lock)
     5616 +                        mutex_exit(hash_lock);
     5617 +                return (ENODATA);
     5618 +        }
     5619 +}
     5620 +
     5621 +/*
4849 5622   * "Read" the block at the specified DVA (in bp) via the
4850 5623   * cache.  If the block is found in the cache, invoke the provided
4851 5624   * callback immediately and return.  Note that the `zio' parameter
4852 5625   * in the callback will be NULL in this case, since no IO was
4853 5626   * required.  If the block is not in the cache pass the read request
4854 5627   * on to the spa with a substitute callback function, so that the
4855 5628   * requested block will be added to the cache.
4856 5629   *
4857 5630   * If a read request arrives for a block that has a read in-progress,
4858 5631   * either wait for the in-progress read to complete (and return the

4859 5632   * results); or, if this is a read with a "done" func, add a record
4860 5633   * to the read to invoke the "done" func when the read completes,
4861 5634   * and return; or just return.
4862 5635   *
4863 5636   * arc_read_done() will invoke all the requested "done" functions
4864 5637   * for readers of this block.
4865 5638   */
4866 5639  int
4867 5640  arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_done_func_t *done,
4868 5641      void *private, zio_priority_t priority, int zio_flags,
4869 5642      arc_flags_t *arc_flags, const zbookmark_phys_t *zb)
4870 5643  {
4871 5644          arc_buf_hdr_t *hdr = NULL;
4872 5645          kmutex_t *hash_lock = NULL;
4873 5646          zio_t *rzio;
4874 5647          uint64_t guid = spa_load_guid(spa);
4875 5648          boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW) != 0;
4876 5649  
4877 5650          ASSERT(!BP_IS_EMBEDDED(bp) ||
4878 5651              BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);
4879 5652  
4880 5653  top:
4881 5654          if (!BP_IS_EMBEDDED(bp)) {
4882 5655                  /*
4883 5656                   * Embedded BP's have no DVA and require no I/O to "read".
4884 5657                   * Create an anonymous arc buf to back it.
4885 5658                   */
4886 5659                  hdr = buf_hash_find(guid, bp, &hash_lock);
4887 5660          }
4888 5661  
4889 5662          if (hdr != NULL && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_pabd != NULL) {
4890 5663                  arc_buf_t *buf = NULL;
4891 5664                  *arc_flags |= ARC_FLAG_CACHED;
4892 5665  
4893 5666                  if (HDR_IO_IN_PROGRESS(hdr)) {
4894 5667  
4895 5668                          if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) &&
4896 5669                              priority == ZIO_PRIORITY_SYNC_READ) {
4897 5670                                  /*
4898 5671                                   * This sync read must wait for an
4899 5672                                   * in-progress async read (e.g. a predictive
4900 5673                                   * prefetch).  Async reads are queued
4901 5674                                   * separately at the vdev_queue layer, so
4902 5675                                   * this is a form of priority inversion.
4903 5676                                   * Ideally, we would "inherit" the demand
4904 5677                                   * i/o's priority by moving the i/o from
4905 5678                                   * the async queue to the synchronous queue,
4906 5679                                   * but there is currently no mechanism to do
4907 5680                                   * so.  Track this so that we can evaluate
4908 5681                                   * the magnitude of this potential performance
4909 5682                                   * problem.
4910 5683                                   *
4911 5684                                   * Note that if the prefetch i/o is already
4912 5685                                   * active (has been issued to the device),
4913 5686                                   * the prefetch improved performance, because
4914 5687                                   * we issued it sooner than we would have
4915 5688                                   * without the prefetch.
4916 5689                                   */
4917 5690                                  DTRACE_PROBE1(arc__sync__wait__for__async,
4918 5691                                      arc_buf_hdr_t *, hdr);
4919 5692                                  ARCSTAT_BUMP(arcstat_sync_wait_for_async);
4920 5693                          }
4921 5694                          if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) {
4922 5695                                  arc_hdr_clear_flags(hdr,
4923 5696                                      ARC_FLAG_PREDICTIVE_PREFETCH);
4924 5697                          }
4925 5698  
4926 5699                          if (*arc_flags & ARC_FLAG_WAIT) {
4927 5700                                  cv_wait(&hdr->b_l1hdr.b_cv, hash_lock);
4928 5701                                  mutex_exit(hash_lock);
4929 5702                                  goto top;
4930 5703                          }
4931 5704                          ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
4932 5705  
4933 5706                          if (done) {
4934 5707                                  arc_callback_t *acb = NULL;
4935 5708  
4936 5709                                  acb = kmem_zalloc(sizeof (arc_callback_t),
4937 5710                                      KM_SLEEP);
4938 5711                                  acb->acb_done = done;
4939 5712                                  acb->acb_private = private;
4940 5713                                  acb->acb_compressed = compressed_read;
4941 5714                                  if (pio != NULL)
4942 5715                                          acb->acb_zio_dummy = zio_null(pio,
4943 5716                                              spa, NULL, NULL, NULL, zio_flags);
4944 5717  
4945 5718                                  ASSERT3P(acb->acb_done, !=, NULL);
4946 5719                                  acb->acb_next = hdr->b_l1hdr.b_acb;
4947 5720                                  hdr->b_l1hdr.b_acb = acb;
4948 5721                                  mutex_exit(hash_lock);
4949 5722                                  return (0);
4950 5723                          }
4951 5724                          mutex_exit(hash_lock);
4952 5725                          return (0);
4953 5726                  }
4954 5727  
4955 5728                  ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
4956 5729                      hdr->b_l1hdr.b_state == arc_mfu);
4957 5730  
4958 5731                  if (done) {
4959 5732                          if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) {
4960 5733                                  /*
4961 5734                                   * This is a demand read which does not have to
4962 5735                                   * wait for i/o because we did a predictive
4963 5736                                   * prefetch i/o for it, which has completed.
4964 5737                                   */
4965 5738                                  DTRACE_PROBE1(
4966 5739                                      arc__demand__hit__predictive__prefetch,
4967 5740                                      arc_buf_hdr_t *, hdr);
4968 5741                                  ARCSTAT_BUMP(
4969 5742                                      arcstat_demand_hit_predictive_prefetch);
4970 5743                                  arc_hdr_clear_flags(hdr,
4971 5744                                      ARC_FLAG_PREDICTIVE_PREFETCH);
4972 5745                          }
4973 5746                          ASSERT(!BP_IS_EMBEDDED(bp) || !BP_IS_HOLE(bp));
4974 5747  
4975 5748                          /* Get a buf with the desired data in it. */
4976 5749                          VERIFY0(arc_buf_alloc_impl(hdr, private,
4977 5750                              compressed_read, B_TRUE, &buf));

↓ open down ↓

119 lines elided

↑ open up ↑

4978 5751                  } else if (*arc_flags & ARC_FLAG_PREFETCH &&
4979 5752                      refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
4980 5753                          arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
4981 5754                  }
4982 5755                  DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
4983 5756                  arc_access(hdr, hash_lock);
4984 5757                  if (*arc_flags & ARC_FLAG_L2CACHE)
4985 5758                          arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
4986 5759                  mutex_exit(hash_lock);
4987 5760                  ARCSTAT_BUMP(arcstat_hits);
4988      -                ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
4989      -                    demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
4990      -                    data, metadata, hits);
     5761 +                if (HDR_ISTYPE_DDT(hdr))
     5762 +                        ARCSTAT_BUMP(arcstat_ddt_hits);
     5763 +                arc_update_hit_stat(hdr, B_TRUE);
4991 5764  
4992 5765                  if (done)
4993 5766                          done(NULL, buf, private);
4994 5767          } else {
4995 5768                  uint64_t lsize = BP_GET_LSIZE(bp);
4996 5769                  uint64_t psize = BP_GET_PSIZE(bp);
4997 5770                  arc_callback_t *acb;
4998 5771                  vdev_t *vd = NULL;
4999 5772                  uint64_t addr = 0;
5000 5773                  boolean_t devw = B_FALSE;

5001 5774                  uint64_t size;
5002 5775  
5003 5776                  if (hdr == NULL) {
5004 5777                          /* this block is not in the cache */
5005 5778                          arc_buf_hdr_t *exists = NULL;
5006 5779                          arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);

↓ open down ↓

6 lines elided

↑ open up ↑

5007 5780                          hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
5008 5781                              BP_GET_COMPRESS(bp), type);
5009 5782  
5010 5783                          if (!BP_IS_EMBEDDED(bp)) {
5011 5784                                  hdr->b_dva = *BP_IDENTITY(bp);
5012 5785                                  hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
5013 5786                                  exists = buf_hash_insert(hdr, &hash_lock);
5014 5787                          }
5015 5788                          if (exists != NULL) {
5016 5789                                  /* somebody beat us to the hash insert */
5017      -                                mutex_exit(hash_lock);
5018      -                                buf_discard_identity(hdr);
5019 5790                                  arc_hdr_destroy(hdr);
     5791 +                                mutex_exit(hash_lock);
5020 5792                                  goto top; /* restart the IO request */
5021 5793                          }
5022 5794                  } else {
5023 5795                          /*
5024 5796                           * This block is in the ghost cache. If it was L2-only
5025 5797                           * (and thus didn't have an L1 hdr), we realloc the
5026 5798                           * header to add an L1 hdr.
5027 5799                           */
5028 5800                          if (!HDR_HAS_L1HDR(hdr)) {
5029 5801                                  hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,
5030 5802                                      hdr_full_cache);
5031 5803                          }
5032 5804                          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5033 5805                          ASSERT(GHOST_STATE(hdr->b_l1hdr.b_state));
5034 5806                          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5035 5807                          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5036 5808                          ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
5037      -                        ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
     5809 +                        ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
5038 5810  
5039 5811                          /*
5040 5812                           * This is a delicate dance that we play here.
5041 5813                           * This hdr is in the ghost list so we access it
5042 5814                           * to move it out of the ghost list before we
5043 5815                           * initiate the read. If it's a prefetch then
5044 5816                           * it won't have a callback so we'll remove the
5045 5817                           * reference that arc_buf_alloc_impl() created. We
5046 5818                           * do this after we've called arc_access() to
5047 5819                           * avoid hitting an assert in remove_reference().

5048 5820                           */
5049 5821                          arc_access(hdr, hash_lock);
5050 5822                          arc_hdr_alloc_pabd(hdr);
5051 5823                  }
5052 5824                  ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
5053 5825                  size = arc_hdr_size(hdr);
5054 5826  
5055 5827                  /*
5056 5828                   * If compression is enabled on the hdr, then will do
5057 5829                   * RAW I/O and will store the compressed data in the hdr's
5058 5830                   * data block. Otherwise, the hdr's data block will contain
5059 5831                   * the uncompressed data.
5060 5832                   */
5061 5833                  if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF) {
5062 5834                          zio_flags |= ZIO_FLAG_RAW;
5063 5835                  }
5064 5836  
5065 5837                  if (*arc_flags & ARC_FLAG_PREFETCH)
5066 5838                          arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
5067 5839                  if (*arc_flags & ARC_FLAG_L2CACHE)
5068 5840                          arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
5069 5841                  if (BP_GET_LEVEL(bp) > 0)
5070 5842                          arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);
5071 5843                  if (*arc_flags & ARC_FLAG_PREDICTIVE_PREFETCH)
5072 5844                          arc_hdr_set_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH);
5073 5845                  ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));
5074 5846  
5075 5847                  acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
5076 5848                  acb->acb_done = done;
5077 5849                  acb->acb_private = private;
5078 5850                  acb->acb_compressed = compressed_read;

↓ open down ↓

31 lines elided

↑ open up ↑

5079 5851  
5080 5852                  ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5081 5853                  hdr->b_l1hdr.b_acb = acb;
5082 5854                  arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5083 5855  
5084 5856                  if (HDR_HAS_L2HDR(hdr) &&
5085 5857                      (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
5086 5858                          devw = hdr->b_l2hdr.b_dev->l2ad_writing;
5087 5859                          addr = hdr->b_l2hdr.b_daddr;
5088 5860                          /*
5089      -                         * Lock out L2ARC device removal.
     5861 +                         * Lock out device removal.
5090 5862                           */
5091 5863                          if (vdev_is_dead(vd) ||
5092 5864                              !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
5093 5865                                  vd = NULL;
5094 5866                  }
5095 5867  
5096 5868                  if (priority == ZIO_PRIORITY_ASYNC_READ)
5097 5869                          arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
5098 5870                  else
5099 5871                          arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);

5100 5872  
5101 5873                  if (hash_lock != NULL)
5102 5874                          mutex_exit(hash_lock);

↓ open down ↓

3 lines elided

↑ open up ↑

5103 5875  
5104 5876                  /*
5105 5877                   * At this point, we have a level 1 cache miss.  Try again in
5106 5878                   * L2ARC if possible.
5107 5879                   */
5108 5880                  ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);
5109 5881  
5110 5882                  DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
5111 5883                      uint64_t, lsize, zbookmark_phys_t *, zb);
5112 5884                  ARCSTAT_BUMP(arcstat_misses);
5113      -                ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
5114      -                    demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
5115      -                    data, metadata, misses);
     5885 +                arc_update_hit_stat(hdr, B_FALSE);
5116 5886  
5117 5887                  if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
5118 5888                          /*
5119 5889                           * Read from the L2ARC if the following are true:
5120 5890                           * 1. The L2ARC vdev was previously cached.
5121 5891                           * 2. This buffer still has L2ARC metadata.
5122 5892                           * 3. This buffer isn't currently writing to the L2ARC.
5123 5893                           * 4. The L2ARC entry wasn't evicted, which may
5124 5894                           *    also have invalidated the vdev.
5125 5895                           * 5. This isn't prefetch and l2arc_noprefetch is set.
5126 5896                           */
5127 5897                          if (HDR_HAS_L2HDR(hdr) &&
5128 5898                              !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
5129 5899                              !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
5130 5900                                  l2arc_read_callback_t *cb;
5131 5901                                  abd_t *abd;
5132 5902                                  uint64_t asize;
5133 5903  
5134 5904                                  DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
5135 5905                                  ARCSTAT_BUMP(arcstat_l2_hits);
     5906 +                                if (vdev_type_is_ddt(vd))
     5907 +                                        ARCSTAT_BUMP(arcstat_l2_ddt_hits);
5136 5908  
5137 5909                                  cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
5138 5910                                      KM_SLEEP);
5139 5911                                  cb->l2rcb_hdr = hdr;
5140 5912                                  cb->l2rcb_bp = *bp;
5141 5913                                  cb->l2rcb_zb = *zb;
5142 5914                                  cb->l2rcb_flags = zio_flags;
5143 5915  
5144 5916                                  asize = vdev_psize_to_asize(vd, size);
5145 5917                                  if (asize != size) {
5146 5918                                          abd = abd_alloc_for_io(asize,
5147      -                                            HDR_ISTYPE_METADATA(hdr));
     5919 +                                            !HDR_ISTYPE_DATA(hdr));
5148 5920                                          cb->l2rcb_abd = abd;
5149 5921                                  } else {
5150 5922                                          abd = hdr->b_l1hdr.b_pabd;
5151 5923                                  }
5152 5924  
5153 5925                                  ASSERT(addr >= VDEV_LABEL_START_SIZE &&
5154 5926                                      addr + asize <= vd->vdev_psize -
5155 5927                                      VDEV_LABEL_END_SIZE);
5156 5928  
5157 5929                                  /*

5158 5930                                   * l2arc read.  The SCL_L2ARC lock will be
5159 5931                                   * released by l2arc_read_done().
5160 5932                                   * Issue a null zio if the underlying buffer
5161 5933                                   * was squashed to zero size by compression.
5162 5934                                   */
5163 5935                                  ASSERT3U(HDR_GET_COMPRESS(hdr), !=,
5164 5936                                      ZIO_COMPRESS_EMPTY);

↓ open down ↓

7 lines elided

↑ open up ↑

5165 5937                                  rzio = zio_read_phys(pio, vd, addr,
5166 5938                                      asize, abd,
5167 5939                                      ZIO_CHECKSUM_OFF,
5168 5940                                      l2arc_read_done, cb, priority,
5169 5941                                      zio_flags | ZIO_FLAG_DONT_CACHE |
5170 5942                                      ZIO_FLAG_CANFAIL |
5171 5943                                      ZIO_FLAG_DONT_PROPAGATE |
5172 5944                                      ZIO_FLAG_DONT_RETRY, B_FALSE);
5173 5945                                  DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
5174 5946                                      zio_t *, rzio);
     5947 +
5175 5948                                  ARCSTAT_INCR(arcstat_l2_read_bytes, size);
     5949 +                                if (vdev_type_is_ddt(vd))
     5950 +                                        ARCSTAT_INCR(arcstat_l2_ddt_read_bytes,
     5951 +                                            size);
5176 5952  
5177 5953                                  if (*arc_flags & ARC_FLAG_NOWAIT) {
5178 5954                                          zio_nowait(rzio);
5179 5955                                          return (0);
5180 5956                                  }
5181 5957  
5182 5958                                  ASSERT(*arc_flags & ARC_FLAG_WAIT);
5183 5959                                  if (zio_wait(rzio) == 0)
5184 5960                                          return (0);
5185 5961

5186 5962                                  /* l2arc read error; goto zio_read() */
5187 5963                          } else {
5188 5964                                  DTRACE_PROBE1(l2arc__miss,
5189 5965                                      arc_buf_hdr_t *, hdr);
5190 5966                                  ARCSTAT_BUMP(arcstat_l2_misses);
5191 5967                                  if (HDR_L2_WRITING(hdr))
5192 5968                                          ARCSTAT_BUMP(arcstat_l2_rw_clash);
5193 5969                                  spa_config_exit(spa, SCL_L2ARC, vd);
5194 5970                          }
5195 5971                  } else {
5196 5972                          if (vd != NULL)
5197 5973                                  spa_config_exit(spa, SCL_L2ARC, vd);
5198 5974                          if (l2arc_ndev != 0) {
5199 5975                                  DTRACE_PROBE1(l2arc__miss,
5200 5976                                      arc_buf_hdr_t *, hdr);
5201 5977                                  ARCSTAT_BUMP(arcstat_l2_misses);
5202 5978                          }
5203 5979                  }
5204 5980  
5205 5981                  rzio = zio_read(pio, spa, bp, hdr->b_l1hdr.b_pabd, size,
5206 5982                      arc_read_done, hdr, priority, zio_flags, zb);
5207 5983  
5208 5984                  if (*arc_flags & ARC_FLAG_WAIT)
5209 5985                          return (zio_wait(rzio));
5210 5986  
5211 5987                  ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
5212 5988                  zio_nowait(rzio);
5213 5989          }
5214 5990          return (0);
5215 5991  }
5216 5992  
5217 5993  /*
5218 5994   * Notify the arc that a block was freed, and thus will never be used again.
5219 5995   */
5220 5996  void
5221 5997  arc_freed(spa_t *spa, const blkptr_t *bp)
5222 5998  {
5223 5999          arc_buf_hdr_t *hdr;
5224 6000          kmutex_t *hash_lock;
5225 6001          uint64_t guid = spa_load_guid(spa);
5226 6002  
5227 6003          ASSERT(!BP_IS_EMBEDDED(bp));
5228 6004  
5229 6005          hdr = buf_hash_find(guid, bp, &hash_lock);
5230 6006          if (hdr == NULL)
5231 6007                  return;
5232 6008  
5233 6009          /*
5234 6010           * We might be trying to free a block that is still doing I/O
5235 6011           * (i.e. prefetch) or has a reference (i.e. a dedup-ed,
5236 6012           * dmu_sync-ed block). If this block is being prefetched, then it
5237 6013           * would still have the ARC_FLAG_IO_IN_PROGRESS flag set on the hdr
5238 6014           * until the I/O completes. A block may also have a reference if it is
5239 6015           * part of a dedup-ed, dmu_synced write. The dmu_sync() function would
5240 6016           * have written the new block to its final resting place on disk but
5241 6017           * without the dedup flag set. This would have left the hdr in the MRU
5242 6018           * state and discoverable. When the txg finally syncs it detects that
5243 6019           * the block was overridden in open context and issues an override I/O.
5244 6020           * Since this is a dedup block, the override I/O will determine if the
5245 6021           * block is already in the DDT. If so, then it will replace the io_bp
5246 6022           * with the bp from the DDT and allow the I/O to finish. When the I/O
5247 6023           * reaches the done callback, dbuf_write_override_done, it will
5248 6024           * check to see if the io_bp and io_bp_override are identical.
5249 6025           * If they are not, then it indicates that the bp was replaced with
5250 6026           * the bp in the DDT and the override bp is freed. This allows
5251 6027           * us to arrive here with a reference on a block that is being
5252 6028           * freed. So if we have an I/O in progress, or a reference to
5253 6029           * this hdr, then we don't destroy the hdr.
5254 6030           */
5255 6031          if (!HDR_HAS_L1HDR(hdr) || (!HDR_IO_IN_PROGRESS(hdr) &&
5256 6032              refcount_is_zero(&hdr->b_l1hdr.b_refcnt))) {
5257 6033                  arc_change_state(arc_anon, hdr, hash_lock);
5258 6034                  arc_hdr_destroy(hdr);
5259 6035                  mutex_exit(hash_lock);
5260 6036          } else {
5261 6037                  mutex_exit(hash_lock);
5262 6038          }
5263 6039  
5264 6040  }
5265 6041  
5266 6042  /*
5267 6043   * Release this buffer from the cache, making it an anonymous buffer.  This
5268 6044   * must be done after a read and prior to modifying the buffer contents.
5269 6045   * If the buffer has more than one reference, we must make
5270 6046   * a new hdr for the buffer.
5271 6047   */
5272 6048  void
5273 6049  arc_release(arc_buf_t *buf, void *tag)
5274 6050  {
5275 6051          arc_buf_hdr_t *hdr = buf->b_hdr;
5276 6052  
5277 6053          /*
5278 6054           * It would be nice to assert that if it's DMU metadata (level >
5279 6055           * 0 || it's the dnode file), then it must be syncing context.
5280 6056           * But we don't know that information at this level.
5281 6057           */
5282 6058  
5283 6059          mutex_enter(&buf->b_evict_lock);
5284 6060  
5285 6061          ASSERT(HDR_HAS_L1HDR(hdr));
5286 6062  
5287 6063          /*
5288 6064           * We don't grab the hash lock prior to this check, because if
5289 6065           * the buffer's header is in the arc_anon state, it won't be
5290 6066           * linked into the hash table.
5291 6067           */
5292 6068          if (hdr->b_l1hdr.b_state == arc_anon) {
5293 6069                  mutex_exit(&buf->b_evict_lock);
5294 6070                  ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5295 6071                  ASSERT(!HDR_IN_HASH_TABLE(hdr));
5296 6072                  ASSERT(!HDR_HAS_L2HDR(hdr));
5297 6073                  ASSERT(HDR_EMPTY(hdr));
5298 6074  
5299 6075                  ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
5300 6076                  ASSERT3S(refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1);
5301 6077                  ASSERT(!list_link_active(&hdr->b_l1hdr.b_arc_node));
5302 6078  
5303 6079                  hdr->b_l1hdr.b_arc_access = 0;
5304 6080  
5305 6081                  /*
5306 6082                   * If the buf is being overridden then it may already
5307 6083                   * have a hdr that is not empty.
5308 6084                   */
5309 6085                  buf_discard_identity(hdr);
5310 6086                  arc_buf_thaw(buf);
5311 6087  
5312 6088                  return;
5313 6089          }
5314 6090  
5315 6091          kmutex_t *hash_lock = HDR_LOCK(hdr);
5316 6092          mutex_enter(hash_lock);
5317 6093  
5318 6094          /*
5319 6095           * This assignment is only valid as long as the hash_lock is
5320 6096           * held, we must be careful not to reference state or the
5321 6097           * b_state field after dropping the lock.
5322 6098           */
5323 6099          arc_state_t *state = hdr->b_l1hdr.b_state;
5324 6100          ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
5325 6101          ASSERT3P(state, !=, arc_anon);
5326 6102  
5327 6103          /* this buffer is not on any list */
5328 6104          ASSERT3S(refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0);
5329 6105  
5330 6106          if (HDR_HAS_L2HDR(hdr)) {
5331 6107                  mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx);
5332 6108  
5333 6109                  /*
5334 6110                   * We have to recheck this conditional again now that
5335 6111                   * we're holding the l2ad_mtx to prevent a race with
5336 6112                   * another thread which might be concurrently calling
5337 6113                   * l2arc_evict(). In that case, l2arc_evict() might have
5338 6114                   * destroyed the header's L2 portion as we were waiting
5339 6115                   * to acquire the l2ad_mtx.
5340 6116                   */
5341 6117                  if (HDR_HAS_L2HDR(hdr))
5342 6118                          arc_hdr_l2hdr_destroy(hdr);
5343 6119  
5344 6120                  mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx);
5345 6121          }
5346 6122  
5347 6123          /*
5348 6124           * Do we have more than one buf?
5349 6125           */
5350 6126          if (hdr->b_l1hdr.b_bufcnt > 1) {
5351 6127                  arc_buf_hdr_t *nhdr;
5352 6128                  uint64_t spa = hdr->b_spa;
5353 6129                  uint64_t psize = HDR_GET_PSIZE(hdr);
5354 6130                  uint64_t lsize = HDR_GET_LSIZE(hdr);
5355 6131                  enum zio_compress compress = HDR_GET_COMPRESS(hdr);
5356 6132                  arc_buf_contents_t type = arc_buf_type(hdr);
5357 6133                  VERIFY3U(hdr->b_type, ==, type);
5358 6134  
5359 6135                  ASSERT(hdr->b_l1hdr.b_buf != buf || buf->b_next != NULL);
5360 6136                  (void) remove_reference(hdr, hash_lock, tag);
5361 6137  
5362 6138                  if (arc_buf_is_shared(buf) && !ARC_BUF_COMPRESSED(buf)) {
5363 6139                          ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);
5364 6140                          ASSERT(ARC_BUF_LAST(buf));
5365 6141                  }
5366 6142  
5367 6143                  /*
5368 6144                   * Pull the data off of this hdr and attach it to
5369 6145                   * a new anonymous hdr. Also find the last buffer
5370 6146                   * in the hdr's buffer list.
5371 6147                   */
5372 6148                  arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
5373 6149                  ASSERT3P(lastbuf, !=, NULL);
5374 6150  
5375 6151                  /*
5376 6152                   * If the current arc_buf_t and the hdr are sharing their data
5377 6153                   * buffer, then we must stop sharing that block.
5378 6154                   */
5379 6155                  if (arc_buf_is_shared(buf)) {
5380 6156                          VERIFY(!arc_buf_is_shared(lastbuf));
5381 6157  
5382 6158                          /*
5383 6159                           * First, sever the block sharing relationship between
5384 6160                           * buf and the arc_buf_hdr_t.
5385 6161                           */
5386 6162                          arc_unshare_buf(hdr, buf);
5387 6163  
5388 6164                          /*
5389 6165                           * Now we need to recreate the hdr's b_pabd. Since we
5390 6166                           * have lastbuf handy, we try to share with it, but if
5391 6167                           * we can't then we allocate a new b_pabd and copy the
5392 6168                           * data from buf into it.
5393 6169                           */
5394 6170                          if (arc_can_share(hdr, lastbuf)) {
5395 6171                                  arc_share_buf(hdr, lastbuf);
5396 6172                          } else {
5397 6173                                  arc_hdr_alloc_pabd(hdr);
5398 6174                                  abd_copy_from_buf(hdr->b_l1hdr.b_pabd,
5399 6175                                      buf->b_data, psize);
5400 6176                          }
5401 6177                          VERIFY3P(lastbuf->b_data, !=, NULL);
5402 6178                  } else if (HDR_SHARED_DATA(hdr)) {
5403 6179                          /*
5404 6180                           * Uncompressed shared buffers are always at the end
5405 6181                           * of the list. Compressed buffers don't have the
5406 6182                           * same requirements. This makes it hard to
5407 6183                           * simply assert that the lastbuf is shared so
5408 6184                           * we rely on the hdr's compression flags to determine
5409 6185                           * if we have a compressed, shared buffer.
5410 6186                           */
5411 6187                          ASSERT(arc_buf_is_shared(lastbuf) ||
5412 6188                              HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
5413 6189                          ASSERT(!ARC_BUF_SHARED(buf));
5414 6190                  }
5415 6191                  ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
5416 6192                  ASSERT3P(state, !=, arc_l2c_only);
5417 6193  
5418 6194                  (void) refcount_remove_many(&state->arcs_size,
5419 6195                      arc_buf_size(buf), buf);
5420 6196  
5421 6197                  if (refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {
5422 6198                          ASSERT3P(state, !=, arc_l2c_only);
5423 6199                          (void) refcount_remove_many(&state->arcs_esize[type],
5424 6200                              arc_buf_size(buf), buf);
5425 6201                  }
5426 6202  
5427 6203                  hdr->b_l1hdr.b_bufcnt -= 1;
5428 6204                  arc_cksum_verify(buf);
5429 6205                  arc_buf_unwatch(buf);
5430 6206  
5431 6207                  mutex_exit(hash_lock);
5432 6208  
5433 6209                  /*
5434 6210                   * Allocate a new hdr. The new hdr will contain a b_pabd
5435 6211                   * buffer which will be freed in arc_write().
5436 6212                   */

↓ open down ↓

251 lines elided

↑ open up ↑

5437 6213                  nhdr = arc_hdr_alloc(spa, psize, lsize, compress, type);
5438 6214                  ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL);
5439 6215                  ASSERT0(nhdr->b_l1hdr.b_bufcnt);
5440 6216                  ASSERT0(refcount_count(&nhdr->b_l1hdr.b_refcnt));
5441 6217                  VERIFY3U(nhdr->b_type, ==, type);
5442 6218                  ASSERT(!HDR_SHARED_DATA(nhdr));
5443 6219  
5444 6220                  nhdr->b_l1hdr.b_buf = buf;
5445 6221                  nhdr->b_l1hdr.b_bufcnt = 1;
5446 6222                  (void) refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);
     6223 +                nhdr->b_l1hdr.b_krrp = 0;
     6224 +
5447 6225                  buf->b_hdr = nhdr;
5448 6226  
5449 6227                  mutex_exit(&buf->b_evict_lock);
5450 6228                  (void) refcount_add_many(&arc_anon->arcs_size,
5451 6229                      arc_buf_size(buf), buf);
5452 6230          } else {
5453 6231                  mutex_exit(&buf->b_evict_lock);
5454 6232                  ASSERT(refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);
5455 6233                  /* protected by hash lock, or hdr is on arc_anon */
5456 6234                  ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));

5457 6235                  ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5458 6236                  arc_change_state(arc_anon, hdr, hash_lock);
5459 6237                  hdr->b_l1hdr.b_arc_access = 0;
5460 6238                  mutex_exit(hash_lock);
5461 6239  
5462 6240                  buf_discard_identity(hdr);
5463 6241                  arc_buf_thaw(buf);
5464 6242          }
5465 6243  }
5466 6244  
5467 6245  int
5468 6246  arc_released(arc_buf_t *buf)
5469 6247  {
5470 6248          int released;
5471 6249  
5472 6250          mutex_enter(&buf->b_evict_lock);
5473 6251          released = (buf->b_data != NULL &&
5474 6252              buf->b_hdr->b_l1hdr.b_state == arc_anon);
5475 6253          mutex_exit(&buf->b_evict_lock);
5476 6254          return (released);
5477 6255  }
5478 6256  
5479 6257  #ifdef ZFS_DEBUG
5480 6258  int
5481 6259  arc_referenced(arc_buf_t *buf)
5482 6260  {
5483 6261          int referenced;
5484 6262  
5485 6263          mutex_enter(&buf->b_evict_lock);
5486 6264          referenced = (refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt));
5487 6265          mutex_exit(&buf->b_evict_lock);
5488 6266          return (referenced);
5489 6267  }
5490 6268  #endif
5491 6269  
5492 6270  static void
5493 6271  arc_write_ready(zio_t *zio)
5494 6272  {
5495 6273          arc_write_callback_t *callback = zio->io_private;
5496 6274          arc_buf_t *buf = callback->awcb_buf;
5497 6275          arc_buf_hdr_t *hdr = buf->b_hdr;
5498 6276          uint64_t psize = BP_IS_HOLE(zio->io_bp) ? 0 : BP_GET_PSIZE(zio->io_bp);
5499 6277  
5500 6278          ASSERT(HDR_HAS_L1HDR(hdr));
5501 6279          ASSERT(!refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt));
5502 6280          ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
5503 6281  
5504 6282          /*
5505 6283           * If we're reexecuting this zio because the pool suspended, then
5506 6284           * cleanup any state that was previously set the first time the
5507 6285           * callback was invoked.
5508 6286           */
5509 6287          if (zio->io_flags & ZIO_FLAG_REEXECUTED) {
5510 6288                  arc_cksum_free(hdr);
5511 6289                  arc_buf_unwatch(buf);
5512 6290                  if (hdr->b_l1hdr.b_pabd != NULL) {
5513 6291                          if (arc_buf_is_shared(buf)) {
5514 6292                                  arc_unshare_buf(hdr, buf);
5515 6293                          } else {
5516 6294                                  arc_hdr_free_pabd(hdr);
5517 6295                          }
5518 6296                  }
5519 6297          }
5520 6298          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5521 6299          ASSERT(!HDR_SHARED_DATA(hdr));
5522 6300          ASSERT(!arc_buf_is_shared(buf));
5523 6301  
5524 6302          callback->awcb_ready(zio, buf, callback->awcb_private);
5525 6303  
5526 6304          if (HDR_IO_IN_PROGRESS(hdr))
5527 6305                  ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED);
5528 6306  
5529 6307          arc_cksum_compute(buf);
5530 6308          arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5531 6309  
5532 6310          enum zio_compress compress;
5533 6311          if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
5534 6312                  compress = ZIO_COMPRESS_OFF;
5535 6313          } else {
5536 6314                  ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(zio->io_bp));
5537 6315                  compress = BP_GET_COMPRESS(zio->io_bp);
5538 6316          }
5539 6317          HDR_SET_PSIZE(hdr, psize);
5540 6318          arc_hdr_set_compress(hdr, compress);
5541 6319  
5542 6320  
5543 6321          /*
5544 6322           * Fill the hdr with data. If the hdr is compressed, the data we want
5545 6323           * is available from the zio, otherwise we can take it from the buf.
5546 6324           *
5547 6325           * We might be able to share the buf's data with the hdr here. However,
5548 6326           * doing so would cause the ARC to be full of linear ABDs if we write a
5549 6327           * lot of shareable data. As a compromise, we check whether scattered
5550 6328           * ABDs are allowed, and assume that if they are then the user wants
5551 6329           * the ARC to be primarily filled with them regardless of the data being
5552 6330           * written. Therefore, if they're allowed then we allocate one and copy
5553 6331           * the data into it; otherwise, we share the data directly if we can.
5554 6332           */
5555 6333          if (zfs_abd_scatter_enabled || !arc_can_share(hdr, buf)) {
5556 6334                  arc_hdr_alloc_pabd(hdr);
5557 6335  
5558 6336                  /*
5559 6337                   * Ideally, we would always copy the io_abd into b_pabd, but the
5560 6338                   * user may have disabled compressed ARC, thus we must check the
5561 6339                   * hdr's compression setting rather than the io_bp's.
5562 6340                   */
5563 6341                  if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF) {
5564 6342                          ASSERT3U(BP_GET_COMPRESS(zio->io_bp), !=,
5565 6343                              ZIO_COMPRESS_OFF);
5566 6344                          ASSERT3U(psize, >, 0);
5567 6345  
5568 6346                          abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize);
5569 6347                  } else {
5570 6348                          ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr));
5571 6349  
5572 6350                          abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data,
5573 6351                              arc_buf_size(buf));
5574 6352                  }
5575 6353          } else {
5576 6354                  ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd));
5577 6355                  ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf));
5578 6356                  ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
5579 6357  
5580 6358                  arc_share_buf(hdr, buf);
5581 6359          }
5582 6360  
5583 6361          arc_hdr_verify(hdr, zio->io_bp);
5584 6362  }
5585 6363  
5586 6364  static void
5587 6365  arc_write_children_ready(zio_t *zio)
5588 6366  {
5589 6367          arc_write_callback_t *callback = zio->io_private;
5590 6368          arc_buf_t *buf = callback->awcb_buf;
5591 6369  
5592 6370          callback->awcb_children_ready(zio, buf, callback->awcb_private);
5593 6371  }
5594 6372  
5595 6373  /*
5596 6374   * The SPA calls this callback for each physical write that happens on behalf
5597 6375   * of a logical write.  See the comment in dbuf_write_physdone() for details.
5598 6376   */
5599 6377  static void
5600 6378  arc_write_physdone(zio_t *zio)
5601 6379  {
5602 6380          arc_write_callback_t *cb = zio->io_private;
5603 6381          if (cb->awcb_physdone != NULL)
5604 6382                  cb->awcb_physdone(zio, cb->awcb_buf, cb->awcb_private);
5605 6383  }
5606 6384  
5607 6385  static void
5608 6386  arc_write_done(zio_t *zio)
5609 6387  {
5610 6388          arc_write_callback_t *callback = zio->io_private;
5611 6389          arc_buf_t *buf = callback->awcb_buf;
5612 6390          arc_buf_hdr_t *hdr = buf->b_hdr;
5613 6391  
5614 6392          ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5615 6393  
5616 6394          if (zio->io_error == 0) {
5617 6395                  arc_hdr_verify(hdr, zio->io_bp);
5618 6396  
5619 6397                  if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
5620 6398                          buf_discard_identity(hdr);
5621 6399                  } else {
5622 6400                          hdr->b_dva = *BP_IDENTITY(zio->io_bp);
5623 6401                          hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp);
5624 6402                  }
5625 6403          } else {
5626 6404                  ASSERT(HDR_EMPTY(hdr));
5627 6405          }
5628 6406  
5629 6407          /*
5630 6408           * If the block to be written was all-zero or compressed enough to be
5631 6409           * embedded in the BP, no write was performed so there will be no
5632 6410           * dva/birth/checksum.  The buffer must therefore remain anonymous
5633 6411           * (and uncached).
5634 6412           */
5635 6413          if (!HDR_EMPTY(hdr)) {
5636 6414                  arc_buf_hdr_t *exists;
5637 6415                  kmutex_t *hash_lock;
5638 6416  
5639 6417                  ASSERT3U(zio->io_error, ==, 0);
5640 6418  
5641 6419                  arc_cksum_verify(buf);
5642 6420  
5643 6421                  exists = buf_hash_insert(hdr, &hash_lock);
5644 6422                  if (exists != NULL) {
5645 6423                          /*
5646 6424                           * This can only happen if we overwrite for

↓ open down ↓

190 lines elided

↑ open up ↑

5647 6425                           * sync-to-convergence, because we remove
5648 6426                           * buffers from the hash table when we arc_free().
5649 6427                           */
5650 6428                          if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
5651 6429                                  if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
5652 6430                                          panic("bad overwrite, hdr=%p exists=%p",
5653 6431                                              (void *)hdr, (void *)exists);
5654 6432                                  ASSERT(refcount_is_zero(
5655 6433                                      &exists->b_l1hdr.b_refcnt));
5656 6434                                  arc_change_state(arc_anon, exists, hash_lock);
5657      -                                mutex_exit(hash_lock);
     6435 +                                arc_wait_for_krrp(exists);
5658 6436                                  arc_hdr_destroy(exists);
     6437 +                                mutex_exit(hash_lock);
5659 6438                                  exists = buf_hash_insert(hdr, &hash_lock);
5660 6439                                  ASSERT3P(exists, ==, NULL);
5661 6440                          } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
5662 6441                                  /* nopwrite */
5663 6442                                  ASSERT(zio->io_prop.zp_nopwrite);
5664 6443                                  if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
5665 6444                                          panic("bad nopwrite, hdr=%p exists=%p",
5666 6445                                              (void *)hdr, (void *)exists);
5667 6446                          } else {
5668 6447                                  /* Dedup */

5669 6448                                  ASSERT(hdr->b_l1hdr.b_bufcnt == 1);
5670 6449                                  ASSERT(hdr->b_l1hdr.b_state == arc_anon);
5671 6450                                  ASSERT(BP_GET_DEDUP(zio->io_bp));
5672 6451                                  ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
5673 6452                          }
5674 6453                  }
5675 6454                  arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5676 6455                  /* if it's not anon, we are doing a scrub */
5677 6456                  if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)
5678 6457                          arc_access(hdr, hash_lock);
5679 6458                  mutex_exit(hash_lock);
5680 6459          } else {
5681 6460                  arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5682 6461          }
5683 6462  
5684 6463          ASSERT(!refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5685 6464          callback->awcb_done(zio, buf, callback->awcb_private);

↓ open down ↓

17 lines elided

↑ open up ↑

5686 6465  
5687 6466          abd_put(zio->io_abd);
5688 6467          kmem_free(callback, sizeof (arc_write_callback_t));
5689 6468  }
5690 6469  
5691 6470  zio_t *
5692 6471  arc_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, arc_buf_t *buf,
5693 6472      boolean_t l2arc, const zio_prop_t *zp, arc_done_func_t *ready,
5694 6473      arc_done_func_t *children_ready, arc_done_func_t *physdone,
5695 6474      arc_done_func_t *done, void *private, zio_priority_t priority,
5696      -    int zio_flags, const zbookmark_phys_t *zb)
     6475 +    int zio_flags, const zbookmark_phys_t *zb,
     6476 +    const zio_smartcomp_info_t *smartcomp)
5697 6477  {
5698 6478          arc_buf_hdr_t *hdr = buf->b_hdr;
5699 6479          arc_write_callback_t *callback;
5700 6480          zio_t *zio;
5701 6481          zio_prop_t localprop = *zp;
5702 6482  
5703 6483          ASSERT3P(ready, !=, NULL);
5704 6484          ASSERT3P(done, !=, NULL);
5705 6485          ASSERT(!HDR_IO_ERROR(hdr));
5706 6486          ASSERT(!HDR_IO_IN_PROGRESS(hdr));

5707 6487          ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5708 6488          ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
5709 6489          if (l2arc)
5710 6490                  arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
5711 6491          if (ARC_BUF_COMPRESSED(buf)) {
5712 6492                  /*
5713 6493                   * We're writing a pre-compressed buffer.  Make the
5714 6494                   * compression algorithm requested by the zio_prop_t match
5715 6495                   * the pre-compressed buffer's compression algorithm.
5716 6496                   */
5717 6497                  localprop.zp_compress = HDR_GET_COMPRESS(hdr);
5718 6498  
5719 6499                  ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf));
5720 6500                  zio_flags |= ZIO_FLAG_RAW;
5721 6501          }
5722 6502          callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
5723 6503          callback->awcb_ready = ready;
5724 6504          callback->awcb_children_ready = children_ready;
5725 6505          callback->awcb_physdone = physdone;
5726 6506          callback->awcb_done = done;
5727 6507          callback->awcb_private = private;
5728 6508          callback->awcb_buf = buf;
5729 6509  
5730 6510          /*
5731 6511           * The hdr's b_pabd is now stale, free it now. A new data block
5732 6512           * will be allocated when the zio pipeline calls arc_write_ready().
5733 6513           */
5734 6514          if (hdr->b_l1hdr.b_pabd != NULL) {
5735 6515                  /*
5736 6516                   * If the buf is currently sharing the data block with
5737 6517                   * the hdr then we need to break that relationship here.
5738 6518                   * The hdr will remain with a NULL data pointer and the
5739 6519                   * buf will take sole ownership of the block.
5740 6520                   */
5741 6521                  if (arc_buf_is_shared(buf)) {
5742 6522                          arc_unshare_buf(hdr, buf);
5743 6523                  } else {
5744 6524                          arc_hdr_free_pabd(hdr);
5745 6525                  }
5746 6526                  VERIFY3P(buf->b_data, !=, NULL);

↓ open down ↓

40 lines elided

↑ open up ↑

5747 6527                  arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);
5748 6528          }
5749 6529          ASSERT(!arc_buf_is_shared(buf));
5750 6530          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5751 6531  
5752 6532          zio = zio_write(pio, spa, txg, bp,
5753 6533              abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
5754 6534              HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
5755 6535              (children_ready != NULL) ? arc_write_children_ready : NULL,
5756 6536              arc_write_physdone, arc_write_done, callback,
5757      -            priority, zio_flags, zb);
     6537 +            priority, zio_flags, zb, smartcomp);
5758 6538  
5759 6539          return (zio);
5760 6540  }
5761 6541  
5762 6542  static int
5763 6543  arc_memory_throttle(uint64_t reserve, uint64_t txg)
5764 6544  {
5765 6545  #ifdef _KERNEL
5766 6546          uint64_t available_memory = ptob(freemem);
5767 6547          static uint64_t page_load = 0;

5768 6548          static uint64_t last_txg = 0;
5769 6549  
5770 6550  #if defined(__i386)
5771 6551          available_memory =
5772 6552              MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
5773 6553  #endif
5774 6554  
5775 6555          if (freemem > physmem * arc_lotsfree_percent / 100)
5776 6556                  return (0);
5777 6557  
5778 6558          if (txg > last_txg) {
5779 6559                  last_txg = txg;
5780 6560                  page_load = 0;
5781 6561          }
5782 6562          /*
5783 6563           * If we are in pageout, we know that memory is already tight,
5784 6564           * the arc is already going to be evicting, so we just want to
5785 6565           * continue to let page writes occur as quickly as possible.
5786 6566           */
5787 6567          if (curproc == proc_pageout) {
5788 6568                  if (page_load > MAX(ptob(minfree), available_memory) / 4)
5789 6569                          return (SET_ERROR(ERESTART));
5790 6570                  /* Note: reserve is inflated, so we deflate */
5791 6571                  page_load += reserve / 8;
5792 6572                  return (0);
5793 6573          } else if (page_load > 0 && arc_reclaim_needed()) {
5794 6574                  /* memory is low, delay before restarting */
5795 6575                  ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
5796 6576                  return (SET_ERROR(EAGAIN));
5797 6577          }
5798 6578          page_load = 0;
5799 6579  #endif
5800 6580          return (0);
5801 6581  }
5802 6582  
5803 6583  void
5804 6584  arc_tempreserve_clear(uint64_t reserve)
5805 6585  {
5806 6586          atomic_add_64(&arc_tempreserve, -reserve);
5807 6587          ASSERT((int64_t)arc_tempreserve >= 0);
5808 6588  }
5809 6589  
5810 6590  int
5811 6591  arc_tempreserve_space(uint64_t reserve, uint64_t txg)
5812 6592  {
5813 6593          int error;
5814 6594          uint64_t anon_size;
5815 6595  
5816 6596          if (reserve > arc_c/4 && !arc_no_grow)
5817 6597                  arc_c = MIN(arc_c_max, reserve * 4);
5818 6598          if (reserve > arc_c)
5819 6599                  return (SET_ERROR(ENOMEM));
5820 6600  
5821 6601          /*
5822 6602           * Don't count loaned bufs as in flight dirty data to prevent long
5823 6603           * network delays from blocking transactions that are ready to be
5824 6604           * assigned to a txg.
5825 6605           */
5826 6606  
5827 6607          /* assert that it has not wrapped around */
5828 6608          ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
5829 6609  
5830 6610          anon_size = MAX((int64_t)(refcount_count(&arc_anon->arcs_size) -
5831 6611              arc_loaned_bytes), 0);
5832 6612  
5833 6613          /*
5834 6614           * Writes will, almost always, require additional memory allocations
5835 6615           * in order to compress/encrypt/etc the data.  We therefore need to
5836 6616           * make sure that there is sufficient available memory for this.
5837 6617           */
5838 6618          error = arc_memory_throttle(reserve, txg);

↓ open down ↓

71 lines elided

↑ open up ↑

5839 6619          if (error != 0)
5840 6620                  return (error);
5841 6621  
5842 6622          /*
5843 6623           * Throttle writes when the amount of dirty data in the cache
5844 6624           * gets too large.  We try to keep the cache less than half full
5845 6625           * of dirty blocks so that our sync times don't grow too large.
5846 6626           * Note: if two requests come in concurrently, we might let them
5847 6627           * both succeed, when one of them should fail.  Not a huge deal.
5848 6628           */
5849      -
5850 6629          if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
5851 6630              anon_size > arc_c / 4) {
     6631 +                DTRACE_PROBE4(arc__tempreserve__space__throttle, uint64_t,
     6632 +                    arc_tempreserve, arc_state_t *, arc_anon, uint64_t,
     6633 +                    reserve, uint64_t, arc_c);
     6634 +
5852 6635                  uint64_t meta_esize =
5853 6636                      refcount_count(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
5854 6637                  uint64_t data_esize =
5855 6638                      refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
5856 6639                  dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
5857 6640                      "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
5858 6641                      arc_tempreserve >> 10, meta_esize >> 10,
5859 6642                      data_esize >> 10, reserve >> 10, arc_c >> 10);
5860 6643                  return (SET_ERROR(ERESTART));
5861 6644          }
5862 6645          atomic_add_64(&arc_tempreserve, reserve);
5863 6646          return (0);
5864 6647  }
5865 6648  
5866 6649  static void
5867 6650  arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
5868      -    kstat_named_t *evict_data, kstat_named_t *evict_metadata)
     6651 +    kstat_named_t *evict_data, kstat_named_t *evict_metadata,
     6652 +    kstat_named_t *evict_ddt)
5869 6653  {
5870 6654          size->value.ui64 = refcount_count(&state->arcs_size);
5871 6655          evict_data->value.ui64 =
5872 6656              refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);
5873 6657          evict_metadata->value.ui64 =
5874 6658              refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);
     6659 +        evict_ddt->value.ui64 =
     6660 +            refcount_count(&state->arcs_esize[ARC_BUFC_DDT]);
5875 6661  }
5876 6662  
5877 6663  static int
5878 6664  arc_kstat_update(kstat_t *ksp, int rw)
5879 6665  {
5880 6666          arc_stats_t *as = ksp->ks_data;
5881 6667  
5882 6668          if (rw == KSTAT_WRITE) {
5883 6669                  return (EACCES);
5884 6670          } else {
5885 6671                  arc_kstat_update_state(arc_anon,
5886 6672                      &as->arcstat_anon_size,
5887 6673                      &as->arcstat_anon_evictable_data,
5888      -                    &as->arcstat_anon_evictable_metadata);
     6674 +                    &as->arcstat_anon_evictable_metadata,
     6675 +                    &as->arcstat_anon_evictable_ddt);
5889 6676                  arc_kstat_update_state(arc_mru,
5890 6677                      &as->arcstat_mru_size,
5891 6678                      &as->arcstat_mru_evictable_data,
5892      -                    &as->arcstat_mru_evictable_metadata);
     6679 +                    &as->arcstat_mru_evictable_metadata,
     6680 +                    &as->arcstat_mru_evictable_ddt);
5893 6681                  arc_kstat_update_state(arc_mru_ghost,
5894 6682                      &as->arcstat_mru_ghost_size,
5895 6683                      &as->arcstat_mru_ghost_evictable_data,
5896      -                    &as->arcstat_mru_ghost_evictable_metadata);
     6684 +                    &as->arcstat_mru_ghost_evictable_metadata,
     6685 +                    &as->arcstat_mru_ghost_evictable_ddt);
5897 6686                  arc_kstat_update_state(arc_mfu,
5898 6687                      &as->arcstat_mfu_size,
5899 6688                      &as->arcstat_mfu_evictable_data,
5900      -                    &as->arcstat_mfu_evictable_metadata);
     6689 +                    &as->arcstat_mfu_evictable_metadata,
     6690 +                    &as->arcstat_mfu_evictable_ddt);
5901 6691                  arc_kstat_update_state(arc_mfu_ghost,
5902 6692                      &as->arcstat_mfu_ghost_size,
5903 6693                      &as->arcstat_mfu_ghost_evictable_data,
5904      -                    &as->arcstat_mfu_ghost_evictable_metadata);
5905      -
5906      -                ARCSTAT(arcstat_size) = aggsum_value(&arc_size);
5907      -                ARCSTAT(arcstat_meta_used) = aggsum_value(&arc_meta_used);
5908      -                ARCSTAT(arcstat_data_size) = aggsum_value(&astat_data_size);
5909      -                ARCSTAT(arcstat_metadata_size) =
5910      -                    aggsum_value(&astat_metadata_size);
5911      -                ARCSTAT(arcstat_hdr_size) = aggsum_value(&astat_hdr_size);
5912      -                ARCSTAT(arcstat_other_size) = aggsum_value(&astat_other_size);
5913      -                ARCSTAT(arcstat_l2_hdr_size) = aggsum_value(&astat_l2_hdr_size);
     6694 +                    &as->arcstat_mfu_ghost_evictable_metadata,
     6695 +                    &as->arcstat_mfu_ghost_evictable_ddt);
5914 6696          }
5915 6697  
5916 6698          return (0);
5917 6699  }
5918 6700  
5919 6701  /*
5920 6702   * This function *must* return indices evenly distributed between all
5921 6703   * sublists of the multilist. This is needed due to how the ARC eviction
5922 6704   * code is laid out; arc_evict_state() assumes ARC buffers are evenly
5923 6705   * distributed between all sublists and uses this assumption when

5924 6706   * deciding which sublist to evict from and how much to evict from it.
5925 6707   */
5926 6708  unsigned int
5927 6709  arc_state_multilist_index_func(multilist_t *ml, void *obj)
5928 6710  {
5929 6711          arc_buf_hdr_t *hdr = obj;
5930 6712  
5931 6713          /*
5932 6714           * We rely on b_dva to generate evenly distributed index
5933 6715           * numbers using buf_hash below. So, as an added precaution,
5934 6716           * let's make sure we never add empty buffers to the arc lists.
5935 6717           */
5936 6718          ASSERT(!HDR_EMPTY(hdr));
5937 6719  
5938 6720          /*
5939 6721           * The assumption here, is the hash value for a given
5940 6722           * arc_buf_hdr_t will remain constant throughout it's lifetime
5941 6723           * (i.e. it's b_spa, b_dva, and b_birth fields don't change).
5942 6724           * Thus, we don't need to store the header's sublist index
5943 6725           * on insertion, as this index can be recalculated on removal.
5944 6726           *
5945 6727           * Also, the low order bits of the hash value are thought to be
5946 6728           * distributed evenly. Otherwise, in the case that the multilist
5947 6729           * has a power of two number of sublists, each sublists' usage
5948 6730           * would not be evenly distributed.
5949 6731           */
5950 6732          return (buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %
5951 6733              multilist_get_num_sublists(ml));
5952 6734  }

↓ open down ↓

29 lines elided

↑ open up ↑

5953 6735  
5954 6736  static void
5955 6737  arc_state_init(void)
5956 6738  {
5957 6739          arc_anon = &ARC_anon;
5958 6740          arc_mru = &ARC_mru;
5959 6741          arc_mru_ghost = &ARC_mru_ghost;
5960 6742          arc_mfu = &ARC_mfu;
5961 6743          arc_mfu_ghost = &ARC_mfu_ghost;
5962 6744          arc_l2c_only = &ARC_l2c_only;
     6745 +        arc_buf_contents_t arcs;
5963 6746  
5964      -        arc_mru->arcs_list[ARC_BUFC_METADATA] =
5965      -            multilist_create(sizeof (arc_buf_hdr_t),
5966      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5967      -            arc_state_multilist_index_func);
5968      -        arc_mru->arcs_list[ARC_BUFC_DATA] =
5969      -            multilist_create(sizeof (arc_buf_hdr_t),
5970      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5971      -            arc_state_multilist_index_func);
5972      -        arc_mru_ghost->arcs_list[ARC_BUFC_METADATA] =
5973      -            multilist_create(sizeof (arc_buf_hdr_t),
5974      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5975      -            arc_state_multilist_index_func);
5976      -        arc_mru_ghost->arcs_list[ARC_BUFC_DATA] =
5977      -            multilist_create(sizeof (arc_buf_hdr_t),
5978      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5979      -            arc_state_multilist_index_func);
5980      -        arc_mfu->arcs_list[ARC_BUFC_METADATA] =
5981      -            multilist_create(sizeof (arc_buf_hdr_t),
5982      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5983      -            arc_state_multilist_index_func);
5984      -        arc_mfu->arcs_list[ARC_BUFC_DATA] =
5985      -            multilist_create(sizeof (arc_buf_hdr_t),
5986      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5987      -            arc_state_multilist_index_func);
5988      -        arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA] =
5989      -            multilist_create(sizeof (arc_buf_hdr_t),
5990      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5991      -            arc_state_multilist_index_func);
5992      -        arc_mfu_ghost->arcs_list[ARC_BUFC_DATA] =
5993      -            multilist_create(sizeof (arc_buf_hdr_t),
5994      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5995      -            arc_state_multilist_index_func);
5996      -        arc_l2c_only->arcs_list[ARC_BUFC_METADATA] =
5997      -            multilist_create(sizeof (arc_buf_hdr_t),
5998      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5999      -            arc_state_multilist_index_func);
6000      -        arc_l2c_only->arcs_list[ARC_BUFC_DATA] =
6001      -            multilist_create(sizeof (arc_buf_hdr_t),
6002      -            offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6003      -            arc_state_multilist_index_func);
     6747 +        for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
     6748 +                arc_mru->arcs_list[arcs] =
     6749 +                    multilist_create(sizeof (arc_buf_hdr_t),
     6750 +                    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
     6751 +                    arc_state_multilist_index_func);
     6752 +                arc_mru_ghost->arcs_list[arcs] =
     6753 +                    multilist_create(sizeof (arc_buf_hdr_t),
     6754 +                    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
     6755 +                        arc_state_multilist_index_func);
     6756 +                arc_mfu->arcs_list[arcs] =
     6757 +                    multilist_create(sizeof (arc_buf_hdr_t),
     6758 +                    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
     6759 +                    arc_state_multilist_index_func);
     6760 +                arc_mfu_ghost->arcs_list[arcs] =
     6761 +                    multilist_create(sizeof (arc_buf_hdr_t),
     6762 +                    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
     6763 +                    arc_state_multilist_index_func);
     6764 +                arc_l2c_only->arcs_list[arcs] =
     6765 +                    multilist_create(sizeof (arc_buf_hdr_t),
     6766 +                    offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
     6767 +                    arc_state_multilist_index_func);
6004 6768  
6005      -        refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
6006      -        refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
6007      -        refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
6008      -        refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
6009      -        refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
6010      -        refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
6011      -        refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
6012      -        refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
6013      -        refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
6014      -        refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
6015      -        refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
6016      -        refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
     6769 +                refcount_create(&arc_anon->arcs_esize[arcs]);
     6770 +                refcount_create(&arc_mru->arcs_esize[arcs]);
     6771 +                refcount_create(&arc_mru_ghost->arcs_esize[arcs]);
     6772 +                refcount_create(&arc_mfu->arcs_esize[arcs]);
     6773 +                refcount_create(&arc_mfu_ghost->arcs_esize[arcs]);
     6774 +                refcount_create(&arc_l2c_only->arcs_esize[arcs]);
     6775 +        }
6017 6776  
     6777 +        arc_flush_taskq = taskq_create("arc_flush_tq",
     6778 +            max_ncpus, minclsyspri, 1, zfs_flush_ntasks, TASKQ_DYNAMIC);
     6779 +
6018 6780          refcount_create(&arc_anon->arcs_size);
6019 6781          refcount_create(&arc_mru->arcs_size);
6020 6782          refcount_create(&arc_mru_ghost->arcs_size);
6021 6783          refcount_create(&arc_mfu->arcs_size);
6022 6784          refcount_create(&arc_mfu_ghost->arcs_size);
6023 6785          refcount_create(&arc_l2c_only->arcs_size);
6024      -
6025      -        aggsum_init(&arc_meta_used, 0);
6026      -        aggsum_init(&arc_size, 0);
6027      -        aggsum_init(&astat_data_size, 0);
6028      -        aggsum_init(&astat_metadata_size, 0);
6029      -        aggsum_init(&astat_hdr_size, 0);
6030      -        aggsum_init(&astat_other_size, 0);
6031      -        aggsum_init(&astat_l2_hdr_size, 0);
6032 6786  }
6033 6787  
6034 6788  static void
6035 6789  arc_state_fini(void)
6036 6790  {
6037      -        refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
6038      -        refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
6039      -        refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
6040      -        refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
6041      -        refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
6042      -        refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
6043      -        refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
6044      -        refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
6045      -        refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
6046      -        refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
6047      -        refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
6048      -        refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
     6791 +        arc_buf_contents_t arcs;
6049 6792  
6050 6793          refcount_destroy(&arc_anon->arcs_size);
6051 6794          refcount_destroy(&arc_mru->arcs_size);
6052 6795          refcount_destroy(&arc_mru_ghost->arcs_size);
6053 6796          refcount_destroy(&arc_mfu->arcs_size);
6054 6797          refcount_destroy(&arc_mfu_ghost->arcs_size);
6055 6798          refcount_destroy(&arc_l2c_only->arcs_size);
6056 6799  
6057      -        multilist_destroy(arc_mru->arcs_list[ARC_BUFC_METADATA]);
6058      -        multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
6059      -        multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_METADATA]);
6060      -        multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
6061      -        multilist_destroy(arc_mru->arcs_list[ARC_BUFC_DATA]);
6062      -        multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
6063      -        multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_DATA]);
6064      -        multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
     6800 +        for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
     6801 +                multilist_destroy(arc_mru->arcs_list[arcs]);
     6802 +                multilist_destroy(arc_mru_ghost->arcs_list[arcs]);
     6803 +                multilist_destroy(arc_mfu->arcs_list[arcs]);
     6804 +                multilist_destroy(arc_mfu_ghost->arcs_list[arcs]);
     6805 +                multilist_destroy(arc_l2c_only->arcs_list[arcs]);
     6806 +
     6807 +                refcount_destroy(&arc_anon->arcs_esize[arcs]);
     6808 +                refcount_destroy(&arc_mru->arcs_esize[arcs]);
     6809 +                refcount_destroy(&arc_mru_ghost->arcs_esize[arcs]);
     6810 +                refcount_destroy(&arc_mfu->arcs_esize[arcs]);
     6811 +                refcount_destroy(&arc_mfu_ghost->arcs_esize[arcs]);
     6812 +                refcount_destroy(&arc_l2c_only->arcs_esize[arcs]);
     6813 +        }
6065 6814  }
6066 6815  
6067 6816  uint64_t
6068 6817  arc_max_bytes(void)
6069 6818  {
6070 6819          return (arc_c_max);
6071 6820  }
6072 6821  
6073 6822  void
6074 6823  arc_init(void)

6075 6824  {
6076 6825          /*
6077 6826           * allmem is "all memory that we could possibly use".
6078 6827           */
6079 6828  #ifdef _KERNEL
6080 6829          uint64_t allmem = ptob(physmem - swapfs_minfree);
6081 6830  #else
6082 6831          uint64_t allmem = (physmem * PAGESIZE) / 2;
6083 6832  #endif
6084 6833  
6085 6834          mutex_init(&arc_reclaim_lock, NULL, MUTEX_DEFAULT, NULL);
6086 6835          cv_init(&arc_reclaim_thread_cv, NULL, CV_DEFAULT, NULL);
6087 6836          cv_init(&arc_reclaim_waiters_cv, NULL, CV_DEFAULT, NULL);
6088 6837  
6089 6838          /* Convert seconds to clock ticks */
6090 6839          arc_min_prefetch_lifespan = 1 * hz;
6091 6840  
6092 6841          /* set min cache to 1/32 of all memory, or 64MB, whichever is more */
6093 6842          arc_c_min = MAX(allmem / 32, 64 << 20);
6094 6843          /* set max to 3/4 of all memory, or all but 1GB, whichever is more */
6095 6844          if (allmem >= 1 << 30)
6096 6845                  arc_c_max = allmem - (1 << 30);
6097 6846          else
6098 6847                  arc_c_max = arc_c_min;
6099 6848          arc_c_max = MAX(allmem * 3 / 4, arc_c_max);
6100 6849  
6101 6850          /*
6102 6851           * In userland, there's only the memory pressure that we artificially
6103 6852           * create (see arc_available_memory()).  Don't let arc_c get too
6104 6853           * small, because it can cause transactions to be larger than
6105 6854           * arc_c, causing arc_tempreserve_space() to fail.
6106 6855           */
6107 6856  #ifndef _KERNEL
6108 6857          arc_c_min = arc_c_max / 2;
6109 6858  #endif
6110 6859  
6111 6860          /*
6112 6861           * Allow the tunables to override our calculations if they are
6113 6862           * reasonable (ie. over 64MB)

↓ open down ↓

39 lines elided

↑ open up ↑

6114 6863           */
6115 6864          if (zfs_arc_max > 64 << 20 && zfs_arc_max < allmem) {
6116 6865                  arc_c_max = zfs_arc_max;
6117 6866                  arc_c_min = MIN(arc_c_min, arc_c_max);
6118 6867          }
6119 6868          if (zfs_arc_min > 64 << 20 && zfs_arc_min <= arc_c_max)
6120 6869                  arc_c_min = zfs_arc_min;
6121 6870  
6122 6871          arc_c = arc_c_max;
6123 6872          arc_p = (arc_c >> 1);
     6873 +        arc_size = 0;
6124 6874  
     6875 +        /* limit ddt meta-data to 1/4 of the arc capacity */
     6876 +        arc_ddt_limit = arc_c_max / 4;
6125 6877          /* limit meta-data to 1/4 of the arc capacity */
6126 6878          arc_meta_limit = arc_c_max / 4;
6127 6879  
6128 6880  #ifdef _KERNEL
6129 6881          /*
6130 6882           * Metadata is stored in the kernel's heap.  Don't let us
6131 6883           * use more than half the heap for the ARC.
6132 6884           */
6133 6885          arc_meta_limit = MIN(arc_meta_limit,
6134 6886              vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 2);
6135 6887  #endif
6136 6888  
6137 6889          /* Allow the tunable to override if it is reasonable */
     6890 +        if (zfs_arc_ddt_limit > 0 && zfs_arc_ddt_limit <= arc_c_max)
     6891 +                arc_ddt_limit = zfs_arc_ddt_limit;
     6892 +        arc_ddt_evict_threshold =
     6893 +            zfs_arc_segregate_ddt ? &arc_ddt_limit : &arc_meta_limit;
     6894 +
     6895 +        /* Allow the tunable to override if it is reasonable */
6138 6896          if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
6139 6897                  arc_meta_limit = zfs_arc_meta_limit;
6140 6898  
6141 6899          if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
6142 6900                  arc_c_min = arc_meta_limit / 2;
6143 6901  
6144 6902          if (zfs_arc_meta_min > 0) {
6145 6903                  arc_meta_min = zfs_arc_meta_min;
6146 6904          } else {
6147 6905                  arc_meta_min = arc_c_min / 2;

6148 6906          }
6149 6907  
6150 6908          if (zfs_arc_grow_retry > 0)
6151 6909                  arc_grow_retry = zfs_arc_grow_retry;
6152 6910  
6153 6911          if (zfs_arc_shrink_shift > 0)
6154 6912                  arc_shrink_shift = zfs_arc_shrink_shift;
6155 6913  
6156 6914          /*
6157 6915           * Ensure that arc_no_grow_shift is less than arc_shrink_shift.
6158 6916           */
6159 6917          if (arc_no_grow_shift >= arc_shrink_shift)
6160 6918                  arc_no_grow_shift = arc_shrink_shift - 1;
6161 6919  
6162 6920          if (zfs_arc_p_min_shift > 0)
6163 6921                  arc_p_min_shift = zfs_arc_p_min_shift;
6164 6922  
6165 6923          /* if kmem_flags are set, lets try to use less memory */
6166 6924          if (kmem_debugging())
6167 6925                  arc_c = arc_c / 2;
6168 6926          if (arc_c < arc_c_min)
6169 6927                  arc_c = arc_c_min;
6170 6928  
6171 6929          arc_state_init();
6172 6930          buf_init();
6173 6931  
6174 6932          arc_reclaim_thread_exit = B_FALSE;
6175 6933  
6176 6934          arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
6177 6935              sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
6178 6936  
6179 6937          if (arc_ksp != NULL) {
6180 6938                  arc_ksp->ks_data = &arc_stats;
6181 6939                  arc_ksp->ks_update = arc_kstat_update;
6182 6940                  kstat_install(arc_ksp);
6183 6941          }
6184 6942  
6185 6943          (void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
6186 6944              TS_RUN, minclsyspri);
6187 6945  
6188 6946          arc_dead = B_FALSE;
6189 6947          arc_warm = B_FALSE;
6190 6948  
6191 6949          /*
6192 6950           * Calculate maximum amount of dirty data per pool.
6193 6951           *
6194 6952           * If it has been set by /etc/system, take that.
6195 6953           * Otherwise, use a percentage of physical memory defined by
6196 6954           * zfs_dirty_data_max_percent (default 10%) with a cap at
6197 6955           * zfs_dirty_data_max_max (default 4GB).
6198 6956           */
6199 6957          if (zfs_dirty_data_max == 0) {
6200 6958                  zfs_dirty_data_max = physmem * PAGESIZE *
6201 6959                      zfs_dirty_data_max_percent / 100;
6202 6960                  zfs_dirty_data_max = MIN(zfs_dirty_data_max,
6203 6961                      zfs_dirty_data_max_max);
6204 6962          }
6205 6963  }
6206 6964  
6207 6965  void
6208 6966  arc_fini(void)
6209 6967  {
6210 6968          mutex_enter(&arc_reclaim_lock);
6211 6969          arc_reclaim_thread_exit = B_TRUE;
6212 6970          /*
6213 6971           * The reclaim thread will set arc_reclaim_thread_exit back to
6214 6972           * B_FALSE when it is finished exiting; we're waiting for that.
6215 6973           */
6216 6974          while (arc_reclaim_thread_exit) {
6217 6975                  cv_signal(&arc_reclaim_thread_cv);
6218 6976                  cv_wait(&arc_reclaim_thread_cv, &arc_reclaim_lock);
6219 6977          }
6220 6978          mutex_exit(&arc_reclaim_lock);
6221 6979

↓ open down ↓

74 lines elided

↑ open up ↑

6222 6980          /* Use B_TRUE to ensure *all* buffers are evicted */
6223 6981          arc_flush(NULL, B_TRUE);
6224 6982  
6225 6983          arc_dead = B_TRUE;
6226 6984  
6227 6985          if (arc_ksp != NULL) {
6228 6986                  kstat_delete(arc_ksp);
6229 6987                  arc_ksp = NULL;
6230 6988          }
6231 6989  
     6990 +        taskq_destroy(arc_flush_taskq);
     6991 +
6232 6992          mutex_destroy(&arc_reclaim_lock);
6233 6993          cv_destroy(&arc_reclaim_thread_cv);
6234 6994          cv_destroy(&arc_reclaim_waiters_cv);
6235 6995  
6236 6996          arc_state_fini();
6237 6997          buf_fini();
6238 6998  
6239 6999          ASSERT0(arc_loaned_bytes);
6240 7000  }
6241 7001

6242 7002  /*
6243 7003   * Level 2 ARC
6244 7004   *
6245 7005   * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
6246 7006   * It uses dedicated storage devices to hold cached data, which are populated
6247 7007   * using large infrequent writes.  The main role of this cache is to boost
6248 7008   * the performance of random read workloads.  The intended L2ARC devices
6249 7009   * include short-stroked disks, solid state disks, and other media with
6250 7010   * substantially faster read latency than disk.
6251 7011   *
6252 7012   *                 +-----------------------+
6253 7013   *                 |         ARC           |
6254 7014   *                 +-----------------------+
6255 7015   *                    |         ^     ^
6256 7016   *                    |         |     |
6257 7017   *      l2arc_feed_thread()    arc_read()
6258 7018   *                    |         |     |
6259 7019   *                    |  l2arc read   |
6260 7020   *                    V         |     |
6261 7021   *               +---------------+    |
6262 7022   *               |     L2ARC     |    |
6263 7023   *               +---------------+    |
6264 7024   *                   |    ^           |
6265 7025   *          l2arc_write() |           |
6266 7026   *                   |    |           |
6267 7027   *                   V    |           |
6268 7028   *                 +-------+      +-------+
6269 7029   *                 | vdev  |      | vdev  |
6270 7030   *                 | cache |      | cache |
6271 7031   *                 +-------+      +-------+
6272 7032   *                 +=========+     .-----.
6273 7033   *                 :  L2ARC  :    |-_____-|
6274 7034   *                 : devices :    | Disks |
6275 7035   *                 +=========+    `-_____-'
6276 7036   *
6277 7037   * Read requests are satisfied from the following sources, in order:
6278 7038   *
6279 7039   *      1) ARC
6280 7040   *      2) vdev cache of L2ARC devices
6281 7041   *      3) L2ARC devices
6282 7042   *      4) vdev cache of disks
6283 7043   *      5) disks
6284 7044   *
6285 7045   * Some L2ARC device types exhibit extremely slow write performance.
6286 7046   * To accommodate for this there are some significant differences between
6287 7047   * the L2ARC and traditional cache design:
6288 7048   *
6289 7049   * 1. There is no eviction path from the ARC to the L2ARC.  Evictions from
6290 7050   * the ARC behave as usual, freeing buffers and placing headers on ghost
6291 7051   * lists.  The ARC does not send buffers to the L2ARC during eviction as
6292 7052   * this would add inflated write latencies for all ARC memory pressure.
6293 7053   *
6294 7054   * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
6295 7055   * It does this by periodically scanning buffers from the eviction-end of
6296 7056   * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
6297 7057   * not already there. It scans until a headroom of buffers is satisfied,
6298 7058   * which itself is a buffer for ARC eviction. If a compressible buffer is
6299 7059   * found during scanning and selected for writing to an L2ARC device, we
6300 7060   * temporarily boost scanning headroom during the next scan cycle to make
6301 7061   * sure we adapt to compression effects (which might significantly reduce
6302 7062   * the data volume we write to L2ARC). The thread that does this is
6303 7063   * l2arc_feed_thread(), illustrated below; example sizes are included to
6304 7064   * provide a better sense of ratio than this diagram:
6305 7065   *
6306 7066   *             head -->                        tail
6307 7067   *              +---------------------+----------+
6308 7068   *      ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
6309 7069   *              +---------------------+----------+   |   o L2ARC eligible
6310 7070   *      ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
6311 7071   *              +---------------------+----------+   |
6312 7072   *                   15.9 Gbytes      ^ 32 Mbytes    |
6313 7073   *                                 headroom          |
6314 7074   *                                            l2arc_feed_thread()
6315 7075   *                                                   |
6316 7076   *                       l2arc write hand <--[oooo]--'
6317 7077   *                               |           8 Mbyte
6318 7078   *                               |          write max
6319 7079   *                               V
6320 7080   *                +==============================+
6321 7081   *      L2ARC dev |####|#|###|###|    |####| ... |
6322 7082   *                +==============================+
6323 7083   *                           32 Gbytes
6324 7084   *
6325 7085   * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
6326 7086   * evicted, then the L2ARC has cached a buffer much sooner than it probably
6327 7087   * needed to, potentially wasting L2ARC device bandwidth and storage.  It is
6328 7088   * safe to say that this is an uncommon case, since buffers at the end of
6329 7089   * the ARC lists have moved there due to inactivity.
6330 7090   *
6331 7091   * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
6332 7092   * then the L2ARC simply misses copying some buffers.  This serves as a
6333 7093   * pressure valve to prevent heavy read workloads from both stalling the ARC
6334 7094   * with waits and clogging the L2ARC with writes.  This also helps prevent
6335 7095   * the potential for the L2ARC to churn if it attempts to cache content too
6336 7096   * quickly, such as during backups of the entire pool.
6337 7097   *
6338 7098   * 5. After system boot and before the ARC has filled main memory, there are
6339 7099   * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
6340 7100   * lists can remain mostly static.  Instead of searching from tail of these
6341 7101   * lists as pictured, the l2arc_feed_thread() will search from the list heads
6342 7102   * for eligible buffers, greatly increasing its chance of finding them.
6343 7103   *
6344 7104   * The L2ARC device write speed is also boosted during this time so that
6345 7105   * the L2ARC warms up faster.  Since there have been no ARC evictions yet,
6346 7106   * there are no L2ARC reads, and no fear of degrading read performance
6347 7107   * through increased writes.
6348 7108   *
6349 7109   * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
6350 7110   * the vdev queue can aggregate them into larger and fewer writes.  Each
6351 7111   * device is written to in a rotor fashion, sweeping writes through
6352 7112   * available space then repeating.
6353 7113   *
6354 7114   * 7. The L2ARC does not store dirty content.  It never needs to flush
6355 7115   * write buffers back to disk based storage.
6356 7116   *
6357 7117   * 8. If an ARC buffer is written (and dirtied) which also exists in the
6358 7118   * L2ARC, the now stale L2ARC buffer is immediately dropped.
6359 7119   *
6360 7120   * The performance of the L2ARC can be tweaked by a number of tunables, which
6361 7121   * may be necessary for different workloads:
6362 7122   *
6363 7123   *      l2arc_write_max         max write bytes per interval
6364 7124   *      l2arc_write_boost       extra write bytes during device warmup
6365 7125   *      l2arc_noprefetch        skip caching prefetched buffers
6366 7126   *      l2arc_headroom          number of max device writes to precache
6367 7127   *      l2arc_headroom_boost    when we find compressed buffers during ARC
6368 7128   *                              scanning, we multiply headroom by this
6369 7129   *                              percentage factor for the next scan cycle,
6370 7130   *                              since more compressed buffers are likely to
6371 7131   *                              be present
6372 7132   *      l2arc_feed_secs         seconds between L2ARC writing
6373 7133   *
6374 7134   * Tunables may be removed or added as future performance improvements are

↓ open down ↓

133 lines elided

↑ open up ↑

6375 7135   * integrated, and also may become zpool properties.
6376 7136   *
6377 7137   * There are three key functions that control how the L2ARC warms up:
6378 7138   *
6379 7139   *      l2arc_write_eligible()  check if a buffer is eligible to cache
6380 7140   *      l2arc_write_size()      calculate how much to write
6381 7141   *      l2arc_write_interval()  calculate sleep delay between writes
6382 7142   *
6383 7143   * These three functions determine what to write, how much, and how quickly
6384 7144   * to send writes.
     7145 + *
     7146 + * L2ARC persistency:
     7147 + *
     7148 + * When writing buffers to L2ARC, we periodically add some metadata to
     7149 + * make sure we can pick them up after reboot, thus dramatically reducing
     7150 + * the impact that any downtime has on the performance of storage systems
     7151 + * with large caches.
     7152 + *
     7153 + * The implementation works fairly simply by integrating the following two
     7154 + * modifications:
     7155 + *
     7156 + * *) Every now and then we mix in a piece of metadata (called a log block)
     7157 + *    into the L2ARC write. This allows us to understand what's been written,
     7158 + *    so that we can rebuild the arc_buf_hdr_t structures of the main ARC
     7159 + *    buffers. The log block also includes a "2-back-reference" pointer to
     7160 + *    he second-to-previous block, forming a back-linked list of blocks on
     7161 + *    the L2ARC device.
     7162 + *
     7163 + * *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device
     7164 + *    for our header bookkeeping purposes. This contains a device header,
     7165 + *    which contains our top-level reference structures. We update it each
     7166 + *    time we write a new log block, so that we're able to locate it in the
     7167 + *    L2ARC device. If this write results in an inconsistent device header
     7168 + *    (e.g. due to power failure), we detect this by verifying the header's
     7169 + *    checksum and simply drop the entries from L2ARC.
     7170 + *
     7171 + * Implementation diagram:
     7172 + *
     7173 + * +=== L2ARC device (not to scale) ======================================+
     7174 + * |       ___two newest log block pointers__.__________                  |
     7175 + * |      /                                   \1 back   \latest           |
     7176 + * |.____/_.                                   V         V                |
     7177 + * ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---|
     7178 + * ||   hdr|      ^         /^       /^        /         /                |
     7179 + * |+------+  ...--\-------/  \-----/--\------/         /                 |
     7180 + * |                \--------------/    \--------------/                  |
     7181 + * +======================================================================+
     7182 + *
     7183 + * As can be seen on the diagram, rather than using a simple linked list,
     7184 + * we use a pair of linked lists with alternating elements. This is a
     7185 + * performance enhancement due to the fact that we only find out of the
     7186 + * address of the next log block access once the current block has been
     7187 + * completely read in. Obviously, this hurts performance, because we'd be
     7188 + * keeping the device's I/O queue at only a 1 operation deep, thus
     7189 + * incurring a large amount of I/O round-trip latency. Having two lists
     7190 + * allows us to "prefetch" two log blocks ahead of where we are currently
     7191 + * rebuilding L2ARC buffers.
     7192 + *
     7193 + * On-device data structures:
     7194 + *
     7195 + * L2ARC device header: l2arc_dev_hdr_phys_t
     7196 + * L2ARC log block:     l2arc_log_blk_phys_t
     7197 + *
     7198 + * L2ARC reconstruction:
     7199 + *
     7200 + * When writing data, we simply write in the standard rotary fashion,
     7201 + * evicting buffers as we go and simply writing new data over them (writing
     7202 + * a new log block every now and then). This obviously means that once we
     7203 + * loop around the end of the device, we will start cutting into an already
     7204 + * committed log block (and its referenced data buffers), like so:
     7205 + *
     7206 + *    current write head__       __old tail
     7207 + *                        \     /
     7208 + *                        V    V
     7209 + * <--|bufs |lb |bufs |lb |    |bufs |lb |bufs |lb |-->
     7210 + *                         ^    ^^^^^^^^^___________________________________
     7211 + *                         |                                                \
     7212 + *                   <<nextwrite>> may overwrite this blk and/or its bufs --'
     7213 + *
     7214 + * When importing the pool, we detect this situation and use it to stop
     7215 + * our scanning process (see l2arc_rebuild).
     7216 + *
     7217 + * There is one significant caveat to consider when rebuilding ARC contents
     7218 + * from an L2ARC device: what about invalidated buffers? Given the above
     7219 + * construction, we cannot update blocks which we've already written to amend
     7220 + * them to remove buffers which were invalidated. Thus, during reconstruction,
     7221 + * we might be populating the cache with buffers for data that's not on the
     7222 + * main pool anymore, or may have been overwritten!
     7223 + *
     7224 + * As it turns out, this isn't a problem. Every arc_read request includes
     7225 + * both the DVA and, crucially, the birth TXG of the BP the caller is
     7226 + * looking for. So even if the cache were populated by completely rotten
     7227 + * blocks for data that had been long deleted and/or overwritten, we'll
     7228 + * never actually return bad data from the cache, since the DVA with the
     7229 + * birth TXG uniquely identify a block in space and time - once created,
     7230 + * a block is immutable on disk. The worst thing we have done is wasted
     7231 + * some time and memory at l2arc rebuild to reconstruct outdated ARC
     7232 + * entries that will get dropped from the l2arc as it is being updated
     7233 + * with new blocks.
6385 7234   */
6386 7235  
6387 7236  static boolean_t
6388 7237  l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
6389 7238  {
6390 7239          /*
6391 7240           * A buffer is *not* eligible for the L2ARC if it:
6392 7241           * 1. belongs to a different spa.
6393 7242           * 2. is already cached on the L2ARC.
6394 7243           * 3. has an I/O in progress (it may be an incomplete read).

6395 7244           * 4. is flagged not eligible (zfs property).
6396 7245           */
6397 7246          if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) ||
6398 7247              HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr))
6399 7248                  return (B_FALSE);
6400 7249  
6401 7250          return (B_TRUE);
6402 7251  }
6403 7252  
6404 7253  static uint64_t
6405 7254  l2arc_write_size(void)
6406 7255  {
6407 7256          uint64_t size;
6408 7257  
6409 7258          /*
6410 7259           * Make sure our globals have meaningful values in case the user
6411 7260           * altered them.
6412 7261           */
6413 7262          size = l2arc_write_max;
6414 7263          if (size == 0) {
6415 7264                  cmn_err(CE_NOTE, "Bad value for l2arc_write_max, value must "
6416 7265                      "be greater than zero, resetting it to the default (%d)",
6417 7266                      L2ARC_WRITE_SIZE);
6418 7267                  size = l2arc_write_max = L2ARC_WRITE_SIZE;
6419 7268          }
6420 7269  
6421 7270          if (arc_warm == B_FALSE)
6422 7271                  size += l2arc_write_boost;
6423 7272  
6424 7273          return (size);
6425 7274  
6426 7275  }
6427 7276  
6428 7277  static clock_t
6429 7278  l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
6430 7279  {
6431 7280          clock_t interval, next, now;
6432 7281  
6433 7282          /*
6434 7283           * If the ARC lists are busy, increase our write rate; if the
6435 7284           * lists are stale, idle back.  This is achieved by checking
6436 7285           * how much we previously wrote - if it was more than half of
6437 7286           * what we wanted, schedule the next write much sooner.
6438 7287           */
6439 7288          if (l2arc_feed_again && wrote > (wanted / 2))

↓ open down ↓

45 lines elided

↑ open up ↑

6440 7289                  interval = (hz * l2arc_feed_min_ms) / 1000;
6441 7290          else
6442 7291                  interval = hz * l2arc_feed_secs;
6443 7292  
6444 7293          now = ddi_get_lbolt();
6445 7294          next = MAX(now, MIN(now + interval, began + interval));
6446 7295  
6447 7296          return (next);
6448 7297  }
6449 7298  
     7299 +typedef enum l2ad_feed {
     7300 +        L2ARC_FEED_ALL = 1,
     7301 +        L2ARC_FEED_DDT_DEV,
     7302 +        L2ARC_FEED_NON_DDT_DEV,
     7303 +} l2ad_feed_t;
     7304 +
6450 7305  /*
6451 7306   * Cycle through L2ARC devices.  This is how L2ARC load balances.
6452 7307   * If a device is returned, this also returns holding the spa config lock.
6453 7308   */
6454 7309  static l2arc_dev_t *
6455      -l2arc_dev_get_next(void)
     7310 +l2arc_dev_get_next(l2ad_feed_t feed_type)
6456 7311  {
6457      -        l2arc_dev_t *first, *next = NULL;
     7312 +        l2arc_dev_t *start = NULL, *next = NULL;
6458 7313  
6459 7314          /*
6460 7315           * Lock out the removal of spas (spa_namespace_lock), then removal
6461 7316           * of cache devices (l2arc_dev_mtx).  Once a device has been selected,
6462 7317           * both locks will be dropped and a spa config lock held instead.
6463 7318           */
6464 7319          mutex_enter(&spa_namespace_lock);
6465 7320          mutex_enter(&l2arc_dev_mtx);
6466 7321  
6467 7322          /* if there are no vdevs, there is nothing to do */
6468 7323          if (l2arc_ndev == 0)
6469 7324                  goto out;
6470 7325  
6471      -        first = NULL;
6472      -        next = l2arc_dev_last;
6473      -        do {
6474      -                /* loop around the list looking for a non-faulted vdev */
6475      -                if (next == NULL) {
6476      -                        next = list_head(l2arc_dev_list);
6477      -                } else {
6478      -                        next = list_next(l2arc_dev_list, next);
6479      -                        if (next == NULL)
6480      -                                next = list_head(l2arc_dev_list);
6481      -                }
     7326 +        if (feed_type == L2ARC_FEED_DDT_DEV)
     7327 +                next = l2arc_ddt_dev_last;
     7328 +        else
     7329 +                next = l2arc_dev_last;
6482 7330  
6483      -                /* if we have come back to the start, bail out */
6484      -                if (first == NULL)
6485      -                        first = next;
6486      -                else if (next == first)
6487      -                        break;
     7331 +        /* figure out what the next device we look at should be */
     7332 +        if (next == NULL)
     7333 +                next = list_head(l2arc_dev_list);
     7334 +        else if (list_next(l2arc_dev_list, next) == NULL)
     7335 +                next = list_head(l2arc_dev_list);
     7336 +        else
     7337 +                next = list_next(l2arc_dev_list, next);
     7338 +        ASSERT(next);
6488 7339  
6489      -        } while (vdev_is_dead(next->l2ad_vdev));
     7340 +        /* loop through L2ARC devs looking for the one we need */
     7341 +        /* LINTED(E_CONSTANT_CONDITION) */
     7342 +        while (1) {
     7343 +                if (next == NULL) /* reached list end, start from beginning */
     7344 +                        next = list_head(l2arc_dev_list);
6490 7345  
6491      -        /* if we were unable to find any usable vdevs, return NULL */
6492      -        if (vdev_is_dead(next->l2ad_vdev))
6493      -                next = NULL;
     7346 +                if (start == NULL) { /* save starting dev */
     7347 +                        start = next;
     7348 +                } else if (start == next) { /* full loop completed - stop now */
     7349 +                        next = NULL;
     7350 +                        if (feed_type == L2ARC_FEED_DDT_DEV) {
     7351 +                                l2arc_ddt_dev_last = NULL;
     7352 +                                goto out;
     7353 +                        } else {
     7354 +                                break;
     7355 +                        }
     7356 +                }
6494 7357  
     7358 +                if (!vdev_is_dead(next->l2ad_vdev) && !next->l2ad_rebuild) {
     7359 +                        if (feed_type == L2ARC_FEED_DDT_DEV) {
     7360 +                                if (vdev_type_is_ddt(next->l2ad_vdev)) {
     7361 +                                        l2arc_ddt_dev_last = next;
     7362 +                                        goto out;
     7363 +                                }
     7364 +                        } else if (feed_type == L2ARC_FEED_NON_DDT_DEV) {
     7365 +                                if (!vdev_type_is_ddt(next->l2ad_vdev)) {
     7366 +                                        break;
     7367 +                                }
     7368 +                        } else {
     7369 +                                ASSERT(feed_type == L2ARC_FEED_ALL);
     7370 +                                break;
     7371 +                        }
     7372 +                }
     7373 +                next = list_next(l2arc_dev_list, next);
     7374 +        }
6495 7375          l2arc_dev_last = next;
6496 7376  
6497 7377  out:
6498 7378          mutex_exit(&l2arc_dev_mtx);
6499 7379  
6500 7380          /*
6501 7381           * Grab the config lock to prevent the 'next' device from being
6502 7382           * removed while we are writing to it.
6503 7383           */
6504 7384          if (next != NULL)

6505 7385                  spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
6506 7386          mutex_exit(&spa_namespace_lock);
6507 7387  
6508 7388          return (next);
6509 7389  }
6510 7390  
6511 7391  /*
6512 7392   * Free buffers that were tagged for destruction.
6513 7393   */
6514 7394  static void
6515 7395  l2arc_do_free_on_write()
6516 7396  {
6517 7397          list_t *buflist;
6518 7398          l2arc_data_free_t *df, *df_prev;
6519 7399  
6520 7400          mutex_enter(&l2arc_free_on_write_mtx);
6521 7401          buflist = l2arc_free_on_write;
6522 7402  
6523 7403          for (df = list_tail(buflist); df; df = df_prev) {
6524 7404                  df_prev = list_prev(buflist, df);
6525 7405                  ASSERT3P(df->l2df_abd, !=, NULL);
6526 7406                  abd_free(df->l2df_abd);
6527 7407                  list_remove(buflist, df);
6528 7408                  kmem_free(df, sizeof (l2arc_data_free_t));
6529 7409          }
6530 7410  
6531 7411          mutex_exit(&l2arc_free_on_write_mtx);
6532 7412  }
6533 7413  
6534 7414  /*
6535 7415   * A write to a cache device has completed.  Update all headers to allow
6536 7416   * reads from these buffers to begin.

↓ open down ↓

32 lines elided

↑ open up ↑

6537 7417   */
6538 7418  static void
6539 7419  l2arc_write_done(zio_t *zio)
6540 7420  {
6541 7421          l2arc_write_callback_t *cb;
6542 7422          l2arc_dev_t *dev;
6543 7423          list_t *buflist;
6544 7424          arc_buf_hdr_t *head, *hdr, *hdr_prev;
6545 7425          kmutex_t *hash_lock;
6546 7426          int64_t bytes_dropped = 0;
     7427 +        l2arc_log_blk_buf_t *lb_buf;
6547 7428  
6548 7429          cb = zio->io_private;
6549 7430          ASSERT3P(cb, !=, NULL);
6550 7431          dev = cb->l2wcb_dev;
6551 7432          ASSERT3P(dev, !=, NULL);
6552 7433          head = cb->l2wcb_head;
6553 7434          ASSERT3P(head, !=, NULL);
6554 7435          buflist = &dev->l2ad_buflist;
6555 7436          ASSERT3P(buflist, !=, NULL);
6556 7437          DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,

6557 7438              l2arc_write_callback_t *, cb);
6558 7439  
6559 7440          if (zio->io_error != 0)
6560 7441                  ARCSTAT_BUMP(arcstat_l2_writes_error);
6561 7442  
6562 7443          /*
6563 7444           * All writes completed, or an error was hit.
6564 7445           */
6565 7446  top:
6566 7447          mutex_enter(&dev->l2ad_mtx);
6567 7448          for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) {
6568 7449                  hdr_prev = list_prev(buflist, hdr);
6569 7450  
6570 7451                  hash_lock = HDR_LOCK(hdr);
6571 7452  
6572 7453                  /*
6573 7454                   * We cannot use mutex_enter or else we can deadlock
6574 7455                   * with l2arc_write_buffers (due to swapping the order
6575 7456                   * the hash lock and l2ad_mtx are taken).
6576 7457                   */
6577 7458                  if (!mutex_tryenter(hash_lock)) {
6578 7459                          /*
6579 7460                           * Missed the hash lock. We must retry so we
6580 7461                           * don't leave the ARC_FLAG_L2_WRITING bit set.
6581 7462                           */
6582 7463                          ARCSTAT_BUMP(arcstat_l2_writes_lock_retry);
6583 7464  
6584 7465                          /*
6585 7466                           * We don't want to rescan the headers we've
6586 7467                           * already marked as having been written out, so
6587 7468                           * we reinsert the head node so we can pick up
6588 7469                           * where we left off.
6589 7470                           */
6590 7471                          list_remove(buflist, head);
6591 7472                          list_insert_after(buflist, hdr, head);
6592 7473  
6593 7474                          mutex_exit(&dev->l2ad_mtx);
6594 7475  
6595 7476                          /*
6596 7477                           * We wait for the hash lock to become available
6597 7478                           * to try and prevent busy waiting, and increase
6598 7479                           * the chance we'll be able to acquire the lock
6599 7480                           * the next time around.
6600 7481                           */
6601 7482                          mutex_enter(hash_lock);
6602 7483                          mutex_exit(hash_lock);
6603 7484                          goto top;
6604 7485                  }
6605 7486  
6606 7487                  /*
6607 7488                   * We could not have been moved into the arc_l2c_only
6608 7489                   * state while in-flight due to our ARC_FLAG_L2_WRITING
6609 7490                   * bit being set. Let's just ensure that's being enforced.
6610 7491                   */
6611 7492                  ASSERT(HDR_HAS_L1HDR(hdr));
6612 7493  
6613 7494                  if (zio->io_error != 0) {
6614 7495                          /*
6615 7496                           * Error - drop L2ARC entry.
6616 7497                           */
6617 7498                          list_remove(buflist, hdr);
6618 7499                          arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
6619 7500  
6620 7501                          ARCSTAT_INCR(arcstat_l2_psize, -arc_hdr_size(hdr));
6621 7502                          ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
6622 7503  
6623 7504                          bytes_dropped += arc_hdr_size(hdr);
6624 7505                          (void) refcount_remove_many(&dev->l2ad_alloc,
6625 7506                              arc_hdr_size(hdr), hdr);
6626 7507                  }
6627 7508  
6628 7509                  /*
6629 7510                   * Allow ARC to begin reads and ghost list evictions to
6630 7511                   * this L2ARC entry.
6631 7512                   */
6632 7513                  arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);

↓ open down ↓

76 lines elided

↑ open up ↑

6633 7514  
6634 7515                  mutex_exit(hash_lock);
6635 7516          }
6636 7517  
6637 7518          atomic_inc_64(&l2arc_writes_done);
6638 7519          list_remove(buflist, head);
6639 7520          ASSERT(!HDR_HAS_L1HDR(head));
6640 7521          kmem_cache_free(hdr_l2only_cache, head);
6641 7522          mutex_exit(&dev->l2ad_mtx);
6642 7523  
     7524 +        ASSERT(dev->l2ad_vdev != NULL);
6643 7525          vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
6644 7526  
6645 7527          l2arc_do_free_on_write();
6646 7528  
     7529 +        while ((lb_buf = list_remove_tail(&cb->l2wcb_log_blk_buflist)) != NULL)
     7530 +                kmem_free(lb_buf, sizeof (*lb_buf));
     7531 +        list_destroy(&cb->l2wcb_log_blk_buflist);
6647 7532          kmem_free(cb, sizeof (l2arc_write_callback_t));
6648 7533  }
6649 7534  
6650 7535  /*
6651 7536   * A read to a cache device completed.  Validate buffer contents before
6652 7537   * handing over to the regular ARC routines.
6653 7538   */
6654 7539  static void
6655 7540  l2arc_read_done(zio_t *zio)
6656 7541  {

6657 7542          l2arc_read_callback_t *cb;
6658 7543          arc_buf_hdr_t *hdr;
6659 7544          kmutex_t *hash_lock;
6660 7545          boolean_t valid_cksum;
6661 7546  
6662 7547          ASSERT3P(zio->io_vd, !=, NULL);
6663 7548          ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
6664 7549  
6665 7550          spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
6666 7551  
6667 7552          cb = zio->io_private;
6668 7553          ASSERT3P(cb, !=, NULL);
6669 7554          hdr = cb->l2rcb_hdr;
6670 7555          ASSERT3P(hdr, !=, NULL);
6671 7556  
6672 7557          hash_lock = HDR_LOCK(hdr);
6673 7558          mutex_enter(hash_lock);
6674 7559          ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
6675 7560  
6676 7561          /*
6677 7562           * If the data was read into a temporary buffer,
6678 7563           * move it and free the buffer.
6679 7564           */
6680 7565          if (cb->l2rcb_abd != NULL) {
6681 7566                  ASSERT3U(arc_hdr_size(hdr), <, zio->io_size);
6682 7567                  if (zio->io_error == 0) {
6683 7568                          abd_copy(hdr->b_l1hdr.b_pabd, cb->l2rcb_abd,
6684 7569                              arc_hdr_size(hdr));
6685 7570                  }
6686 7571  
6687 7572                  /*
6688 7573                   * The following must be done regardless of whether
6689 7574                   * there was an error:
6690 7575                   * - free the temporary buffer
6691 7576                   * - point zio to the real ARC buffer
6692 7577                   * - set zio size accordingly
6693 7578                   * These are required because zio is either re-used for
6694 7579                   * an I/O of the block in the case of the error
6695 7580                   * or the zio is passed to arc_read_done() and it
6696 7581                   * needs real data.
6697 7582                   */
6698 7583                  abd_free(cb->l2rcb_abd);
6699 7584                  zio->io_size = zio->io_orig_size = arc_hdr_size(hdr);
6700 7585                  zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd;
6701 7586          }
6702 7587  
6703 7588          ASSERT3P(zio->io_abd, !=, NULL);
6704 7589  
6705 7590          /*
6706 7591           * Check this survived the L2ARC journey.
6707 7592           */
6708 7593          ASSERT3P(zio->io_abd, ==, hdr->b_l1hdr.b_pabd);
6709 7594          zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
6710 7595          zio->io_bp = &zio->io_bp_copy;  /* XXX fix in L2ARC 2.0 */
6711 7596  
6712 7597          valid_cksum = arc_cksum_is_equal(hdr, zio);
6713 7598          if (valid_cksum && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
6714 7599                  mutex_exit(hash_lock);
6715 7600                  zio->io_private = hdr;
6716 7601                  arc_read_done(zio);
6717 7602          } else {
6718 7603                  mutex_exit(hash_lock);
6719 7604                  /*
6720 7605                   * Buffer didn't survive caching.  Increment stats and
6721 7606                   * reissue to the original storage device.
6722 7607                   */
6723 7608                  if (zio->io_error != 0) {
6724 7609                          ARCSTAT_BUMP(arcstat_l2_io_error);
6725 7610                  } else {
6726 7611                          zio->io_error = SET_ERROR(EIO);
6727 7612                  }
6728 7613                  if (!valid_cksum)
6729 7614                          ARCSTAT_BUMP(arcstat_l2_cksum_bad);
6730 7615  
6731 7616                  /*
6732 7617                   * If there's no waiter, issue an async i/o to the primary
6733 7618                   * storage now.  If there *is* a waiter, the caller must
6734 7619                   * issue the i/o in a context where it's OK to block.
6735 7620                   */
6736 7621                  if (zio->io_waiter == NULL) {
6737 7622                          zio_t *pio = zio_unique_parent(zio);
6738 7623  
6739 7624                          ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
6740 7625  
6741 7626                          zio_nowait(zio_read(pio, zio->io_spa, zio->io_bp,
6742 7627                              hdr->b_l1hdr.b_pabd, zio->io_size, arc_read_done,

↓ open down ↓

86 lines elided

↑ open up ↑

6743 7628                              hdr, zio->io_priority, cb->l2rcb_flags,
6744 7629                              &cb->l2rcb_zb));
6745 7630                  }
6746 7631          }
6747 7632  
6748 7633          kmem_free(cb, sizeof (l2arc_read_callback_t));
6749 7634  }
6750 7635  
6751 7636  /*
6752 7637   * This is the list priority from which the L2ARC will search for pages to
6753      - * cache.  This is used within loops (0..3) to cycle through lists in the
     7638 + * cache.  This is used within loops to cycle through lists in the
6754 7639   * desired order.  This order can have a significant effect on cache
6755 7640   * performance.
6756 7641   *
6757      - * Currently the metadata lists are hit first, MFU then MRU, followed by
6758      - * the data lists.  This function returns a locked list, and also returns
6759      - * the lock pointer.
     7642 + * Currently the ddt lists are hit first (MFU then MRU),
     7643 + * followed by metadata then by the data lists.
     7644 + * This function returns a locked list, and also returns the lock pointer.
6760 7645   */
6761 7646  static multilist_sublist_t *
6762      -l2arc_sublist_lock(int list_num)
     7647 +l2arc_sublist_lock(enum l2arc_priorities prio)
6763 7648  {
6764 7649          multilist_t *ml = NULL;
6765 7650          unsigned int idx;
6766 7651  
6767      -        ASSERT(list_num >= 0 && list_num <= 3);
     7652 +        ASSERT(prio >= PRIORITY_MFU_DDT);
     7653 +        ASSERT(prio < PRIORITY_NUMTYPES);
6768 7654  
6769      -        switch (list_num) {
6770      -        case 0:
     7655 +        switch (prio) {
     7656 +        case PRIORITY_MFU_DDT:
     7657 +                ml = arc_mfu->arcs_list[ARC_BUFC_DDT];
     7658 +                break;
     7659 +        case PRIORITY_MRU_DDT:
     7660 +                ml = arc_mru->arcs_list[ARC_BUFC_DDT];
     7661 +                break;
     7662 +        case PRIORITY_MFU_META:
6771 7663                  ml = arc_mfu->arcs_list[ARC_BUFC_METADATA];
6772 7664                  break;
6773      -        case 1:
     7665 +        case PRIORITY_MRU_META:
6774 7666                  ml = arc_mru->arcs_list[ARC_BUFC_METADATA];
6775 7667                  break;
6776      -        case 2:
     7668 +        case PRIORITY_MFU_DATA:
6777 7669                  ml = arc_mfu->arcs_list[ARC_BUFC_DATA];
6778 7670                  break;
6779      -        case 3:
     7671 +        case PRIORITY_MRU_DATA:
6780 7672                  ml = arc_mru->arcs_list[ARC_BUFC_DATA];
6781 7673                  break;
6782 7674          }
6783 7675  
6784 7676          /*
6785 7677           * Return a randomly-selected sublist. This is acceptable
6786 7678           * because the caller feeds only a little bit of data for each
6787 7679           * call (8MB). Subsequent calls will result in different
6788 7680           * sublists being selected.
6789 7681           */
6790 7682          idx = multilist_get_random_index(ml);
6791 7683          return (multilist_sublist_lock(ml, idx));
6792 7684  }
6793 7685  
6794 7686  /*
     7687 + * Calculates the maximum overhead of L2ARC metadata log blocks for a given
     7688 + * L2ARC write size. l2arc_evict and l2arc_write_buffers need to include this
     7689 + * overhead in processing to make sure there is enough headroom available
     7690 + * when writing buffers.
     7691 + */
     7692 +static inline uint64_t
     7693 +l2arc_log_blk_overhead(uint64_t write_sz)
     7694 +{
     7695 +        return ((write_sz / SPA_MINBLOCKSIZE / L2ARC_LOG_BLK_ENTRIES) + 1) *
     7696 +            L2ARC_LOG_BLK_SIZE;
     7697 +}
     7698 +
     7699 +/*
6795 7700   * Evict buffers from the device write hand to the distance specified in
6796 7701   * bytes.  This distance may span populated buffers, it may span nothing.
6797 7702   * This is clearing a region on the L2ARC device ready for writing.
6798 7703   * If the 'all' boolean is set, every buffer is evicted.
6799 7704   */
6800 7705  static void
6801      -l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
     7706 +l2arc_evict_impl(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
6802 7707  {
6803 7708          list_t *buflist;
6804 7709          arc_buf_hdr_t *hdr, *hdr_prev;
6805 7710          kmutex_t *hash_lock;
6806 7711          uint64_t taddr;
6807 7712  
6808 7713          buflist = &dev->l2ad_buflist;
6809 7714  
6810 7715          if (!all && dev->l2ad_first) {
6811 7716                  /*
6812 7717                   * This is the first sweep through the device.  There is
6813 7718                   * nothing to evict.
6814 7719                   */
6815 7720                  return;
6816 7721          }
6817 7722  
     7723 +        /*
     7724 +         * We need to add in the worst case scenario of log block overhead.
     7725 +         */
     7726 +        distance += l2arc_log_blk_overhead(distance);
6818 7727          if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
6819 7728                  /*
6820 7729                   * When nearing the end of the device, evict to the end
6821 7730                   * before the device write hand jumps to the start.
6822 7731                   */
6823 7732                  taddr = dev->l2ad_end;
6824 7733          } else {
6825 7734                  taddr = dev->l2ad_hand + distance;
6826 7735          }
6827 7736          DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,

6828 7737              uint64_t, taddr, boolean_t, all);
6829 7738  
6830 7739  top:
6831 7740          mutex_enter(&dev->l2ad_mtx);
6832 7741          for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {
6833 7742                  hdr_prev = list_prev(buflist, hdr);
6834 7743  
6835 7744                  hash_lock = HDR_LOCK(hdr);
6836 7745  
6837 7746                  /*
6838 7747                   * We cannot use mutex_enter or else we can deadlock
6839 7748                   * with l2arc_write_buffers (due to swapping the order
6840 7749                   * the hash lock and l2ad_mtx are taken).
6841 7750                   */
6842 7751                  if (!mutex_tryenter(hash_lock)) {
6843 7752                          /*
6844 7753                           * Missed the hash lock.  Retry.
6845 7754                           */
6846 7755                          ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
6847 7756                          mutex_exit(&dev->l2ad_mtx);
6848 7757                          mutex_enter(hash_lock);
6849 7758                          mutex_exit(hash_lock);
6850 7759                          goto top;
6851 7760                  }
6852 7761  
6853 7762                  /*
6854 7763                   * A header can't be on this list if it doesn't have L2 header.
6855 7764                   */
6856 7765                  ASSERT(HDR_HAS_L2HDR(hdr));
6857 7766  
6858 7767                  /* Ensure this header has finished being written. */
6859 7768                  ASSERT(!HDR_L2_WRITING(hdr));
6860 7769                  ASSERT(!HDR_L2_WRITE_HEAD(hdr));
6861 7770  
6862 7771                  if (!all && (hdr->b_l2hdr.b_daddr >= taddr ||
6863 7772                      hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) {
6864 7773                          /*
6865 7774                           * We've evicted to the target address,
6866 7775                           * or the end of the device.
6867 7776                           */
6868 7777                          mutex_exit(hash_lock);
6869 7778                          break;
6870 7779                  }
6871 7780  
6872 7781                  if (!HDR_HAS_L1HDR(hdr)) {
6873 7782                          ASSERT(!HDR_L2_READING(hdr));
6874 7783                          /*
6875 7784                           * This doesn't exist in the ARC.  Destroy.
6876 7785                           * arc_hdr_destroy() will call list_remove()
6877 7786                           * and decrement arcstat_l2_lsize.
6878 7787                           */
6879 7788                          arc_change_state(arc_anon, hdr, hash_lock);
6880 7789                          arc_hdr_destroy(hdr);
6881 7790                  } else {
6882 7791                          ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);
6883 7792                          ARCSTAT_BUMP(arcstat_l2_evict_l1cached);
6884 7793                          /*
6885 7794                           * Invalidate issued or about to be issued
6886 7795                           * reads, since we may be about to write
6887 7796                           * over this location.
6888 7797                           */
6889 7798                          if (HDR_L2_READING(hdr)) {
6890 7799                                  ARCSTAT_BUMP(arcstat_l2_evict_reading);

↓ open down ↓

63 lines elided

↑ open up ↑

6891 7800                                  arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);
6892 7801                          }
6893 7802  
6894 7803                          arc_hdr_l2hdr_destroy(hdr);
6895 7804                  }
6896 7805                  mutex_exit(hash_lock);
6897 7806          }
6898 7807          mutex_exit(&dev->l2ad_mtx);
6899 7808  }
6900 7809  
     7810 +static void
     7811 +l2arc_evict_task(void *arg)
     7812 +{
     7813 +        l2arc_dev_t *dev = arg;
     7814 +        ASSERT(dev);
     7815 +
     7816 +        /*
     7817 +         * Evict l2arc buffers asynchronously; we need to keep the device
     7818 +         * around until we are sure there aren't any buffers referencing it.
     7819 +         * We do not need to hold any config locks, etc. because at this point,
     7820 +         * we are the only ones who knows about this device (the in-core
     7821 +         * structure), so no new buffers can be created (e.g. if the pool is
     7822 +         * re-imported while the asynchronous eviction is in progress) that
     7823 +         * reference this same in-core structure. Also remove the vdev link
     7824 +         * since further use of it as l2arc device is prohibited.
     7825 +         */
     7826 +        dev->l2ad_vdev = NULL;
     7827 +        l2arc_evict_impl(dev, 0LL, B_TRUE);
     7828 +
     7829 +        /* Same cleanup as in the synchronous path */
     7830 +        list_destroy(&dev->l2ad_buflist);
     7831 +        mutex_destroy(&dev->l2ad_mtx);
     7832 +        refcount_destroy(&dev->l2ad_alloc);
     7833 +        kmem_free(dev->l2ad_dev_hdr, dev->l2ad_dev_hdr_asize);
     7834 +        kmem_free(dev, sizeof (l2arc_dev_t));
     7835 +}
     7836 +
     7837 +boolean_t zfs_l2arc_async_evict = B_TRUE;
     7838 +
6901 7839  /*
     7840 + * Perform l2arc eviction for buffers associated with this device
     7841 + * If evicting all buffers (done at pool export time), try to evict
     7842 + * asynchronously, and fall back to synchronous eviction in case of error
     7843 + * Tell the caller whether to cleanup the device:
     7844 + *  - B_TRUE means "asynchronous eviction, do not cleanup"
     7845 + *  - B_FALSE means "synchronous eviction, done, please cleanup"
     7846 + */
     7847 +static boolean_t
     7848 +l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
     7849 +{
     7850 +        /*
     7851 +         *  If we are evicting all the buffers for this device, which happens
     7852 +         *  at pool export time, schedule asynchronous task
     7853 +         */
     7854 +        if (all && zfs_l2arc_async_evict) {
     7855 +                if ((taskq_dispatch(arc_flush_taskq, l2arc_evict_task,
     7856 +                    dev, TQ_NOSLEEP) == NULL)) {
     7857 +                        /*
     7858 +                         * Failed to dispatch asynchronous task
     7859 +                         * cleanup, evict synchronously
     7860 +                         */
     7861 +                        l2arc_evict_impl(dev, distance, all);
     7862 +                } else {
     7863 +                        /*
     7864 +                         * Successful dispatch, vdev space updated
     7865 +                         */
     7866 +                        return (B_TRUE);
     7867 +                }
     7868 +        } else {
     7869 +                /* Evict synchronously */
     7870 +                l2arc_evict_impl(dev, distance, all);
     7871 +        }
     7872 +
     7873 +        return (B_FALSE);
     7874 +}
     7875 +
     7876 +/*
6902 7877   * Find and write ARC buffers to the L2ARC device.
6903 7878   *
6904 7879   * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid
6905 7880   * for reading until they have completed writing.
6906 7881   * The headroom_boost is an in-out parameter used to maintain headroom boost
6907 7882   * state between calls to this function.
6908 7883   *
6909 7884   * Returns the number of bytes actually written (which may be smaller than
6910 7885   * the delta by which the device hand has changed due to alignment).
6911 7886   */
6912 7887  static uint64_t
6913      -l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
     7888 +l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
     7889 +    l2ad_feed_t feed_type)
6914 7890  {
6915 7891          arc_buf_hdr_t *hdr, *hdr_prev, *head;
     7892 +        /*
     7893 +         * We must carefully track the space we deal with here:
     7894 +         * - write_size: sum of the size of all buffers to be written
     7895 +         *      without compression or inter-buffer alignment applied.
     7896 +         *      This size is added to arcstat_l2_size, because subsequent
     7897 +         *      eviction of buffers decrements this kstat by only the
     7898 +         *      buffer's b_lsize (which doesn't take alignment into account).
     7899 +         * - write_asize: sum of the size of all buffers to be written
     7900 +         *      with inter-buffer alignment applied.
     7901 +         *      This size is used to estimate the maximum number of bytes
     7902 +         *      we could take up on the device and is thus used to gauge how
     7903 +         *      close we are to hitting target_sz.
     7904 +         */
6916 7905          uint64_t write_asize, write_psize, write_lsize, headroom;
6917 7906          boolean_t full;
6918 7907          l2arc_write_callback_t *cb;
6919 7908          zio_t *pio, *wzio;
     7909 +        enum l2arc_priorities try;
6920 7910          uint64_t guid = spa_load_guid(spa);
     7911 +        boolean_t dev_hdr_update = B_FALSE;
6921 7912  
6922 7913          ASSERT3P(dev->l2ad_vdev, !=, NULL);
6923 7914  
6924 7915          pio = NULL;
     7916 +        cb = NULL;
6925 7917          write_lsize = write_asize = write_psize = 0;
6926 7918          full = B_FALSE;
6927 7919          head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
6928 7920          arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);
6929 7921  
6930 7922          /*
6931 7923           * Copy buffers for L2ARC writing.
6932 7924           */
6933      -        for (int try = 0; try <= 3; try++) {
     7925 +        for (try = PRIORITY_MFU_DDT; try < PRIORITY_NUMTYPES; try++) {
6934 7926                  multilist_sublist_t *mls = l2arc_sublist_lock(try);
6935 7927                  uint64_t passed_sz = 0;
6936 7928  
6937 7929                  /*
6938 7930                   * L2ARC fast warmup.
6939 7931                   *
6940 7932                   * Until the ARC is warm and starts to evict, read from the
6941 7933                   * head of the ARC lists rather than the tail.
6942 7934                   */
6943 7935                  if (arc_warm == B_FALSE)

6944 7936                          hdr = multilist_sublist_head(mls);
6945 7937                  else
6946 7938                          hdr = multilist_sublist_tail(mls);
6947 7939  
6948 7940                  headroom = target_sz * l2arc_headroom;
6949 7941                  if (zfs_compressed_arc_enabled)
6950 7942                          headroom = (headroom * l2arc_headroom_boost) / 100;
6951 7943  
6952 7944                  for (; hdr; hdr = hdr_prev) {
6953 7945                          kmutex_t *hash_lock;
6954 7946  
6955 7947                          if (arc_warm == B_FALSE)
6956 7948                                  hdr_prev = multilist_sublist_next(mls, hdr);
6957 7949                          else
6958 7950                                  hdr_prev = multilist_sublist_prev(mls, hdr);
6959 7951  
6960 7952                          hash_lock = HDR_LOCK(hdr);
6961 7953                          if (!mutex_tryenter(hash_lock)) {
6962 7954                                  /*
6963 7955                                   * Skip this buffer rather than waiting.
6964 7956                                   */
6965 7957                                  continue;
6966 7958                          }
6967 7959  
6968 7960                          passed_sz += HDR_GET_LSIZE(hdr);
6969 7961                          if (passed_sz > headroom) {
6970 7962                                  /*
6971 7963                                   * Searched too far.
6972 7964                                   */
6973 7965                                  mutex_exit(hash_lock);
6974 7966                                  break;
6975 7967                          }
6976 7968  
6977 7969                          if (!l2arc_write_eligible(guid, hdr)) {
6978 7970                                  mutex_exit(hash_lock);
6979 7971                                  continue;
6980 7972                          }
6981 7973  
6982 7974                          /*
6983 7975                           * We rely on the L1 portion of the header below, so
6984 7976                           * it's invalid for this header to have been evicted out
6985 7977                           * of the ghost cache, prior to being written out. The
6986 7978                           * ARC_FLAG_L2_WRITING bit ensures this won't happen.
6987 7979                           */
6988 7980                          ASSERT(HDR_HAS_L1HDR(hdr));
6989 7981  
6990 7982                          ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);
6991 7983                          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
6992 7984                          ASSERT3U(arc_hdr_size(hdr), >, 0);

↓ open down ↓

49 lines elided

↑ open up ↑

6993 7985                          uint64_t psize = arc_hdr_size(hdr);
6994 7986                          uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,
6995 7987                              psize);
6996 7988  
6997 7989                          if ((write_asize + asize) > target_sz) {
6998 7990                                  full = B_TRUE;
6999 7991                                  mutex_exit(hash_lock);
7000 7992                                  break;
7001 7993                          }
7002 7994  
     7995 +                        /* make sure buf we select corresponds to feed_type */
     7996 +                        if ((feed_type == L2ARC_FEED_DDT_DEV &&
     7997 +                            arc_buf_type(hdr) != ARC_BUFC_DDT) ||
     7998 +                            (feed_type == L2ARC_FEED_NON_DDT_DEV &&
     7999 +                            arc_buf_type(hdr) == ARC_BUFC_DDT)) {
     8000 +                                        mutex_exit(hash_lock);
     8001 +                                        continue;
     8002 +                        }
     8003 +
7003 8004                          if (pio == NULL) {
7004 8005                                  /*
7005 8006                                   * Insert a dummy header on the buflist so
7006 8007                                   * l2arc_write_done() can find where the
7007 8008                                   * write buffers begin without searching.
7008 8009                                   */
7009 8010                                  mutex_enter(&dev->l2ad_mtx);
7010 8011                                  list_insert_head(&dev->l2ad_buflist, head);
7011 8012                                  mutex_exit(&dev->l2ad_mtx);
7012 8013  
7013      -                                cb = kmem_alloc(
     8014 +                                cb = kmem_zalloc(
7014 8015                                      sizeof (l2arc_write_callback_t), KM_SLEEP);
7015 8016                                  cb->l2wcb_dev = dev;
7016 8017                                  cb->l2wcb_head = head;
     8018 +                                list_create(&cb->l2wcb_log_blk_buflist,
     8019 +                                    sizeof (l2arc_log_blk_buf_t),
     8020 +                                    offsetof(l2arc_log_blk_buf_t, lbb_node));
7017 8021                                  pio = zio_root(spa, l2arc_write_done, cb,
7018 8022                                      ZIO_FLAG_CANFAIL);
7019 8023                          }
7020 8024  
7021 8025                          hdr->b_l2hdr.b_dev = dev;
7022 8026                          hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
7023 8027                          arc_hdr_set_flags(hdr,
7024 8028                              ARC_FLAG_L2_WRITING | ARC_FLAG_HAS_L2HDR);
7025 8029  
7026 8030                          mutex_enter(&dev->l2ad_mtx);

7027 8031                          list_insert_head(&dev->l2ad_buflist, hdr);
7028 8032                          mutex_exit(&dev->l2ad_mtx);
7029 8033  
7030 8034                          (void) refcount_add_many(&dev->l2ad_alloc, psize, hdr);
7031 8035  
7032 8036                          /*
7033 8037                           * Normally the L2ARC can use the hdr's data, but if
7034 8038                           * we're sharing data between the hdr and one of its
7035 8039                           * bufs, L2ARC needs its own copy of the data so that
7036 8040                           * the ZIO below can't race with the buf consumer.
7037 8041                           * Another case where we need to create a copy of the
7038 8042                           * data is when the buffer size is not device-aligned
7039 8043                           * and we need to pad the block to make it such.
7040 8044                           * That also keeps the clock hand suitably aligned.

↓ open down ↓

14 lines elided

↑ open up ↑

7041 8045                           *
7042 8046                           * To ensure that the copy will be available for the
7043 8047                           * lifetime of the ZIO and be cleaned up afterwards, we
7044 8048                           * add it to the l2arc_free_on_write queue.
7045 8049                           */
7046 8050                          abd_t *to_write;
7047 8051                          if (!HDR_SHARED_DATA(hdr) && psize == asize) {
7048 8052                                  to_write = hdr->b_l1hdr.b_pabd;
7049 8053                          } else {
7050 8054                                  to_write = abd_alloc_for_io(asize,
7051      -                                    HDR_ISTYPE_METADATA(hdr));
     8055 +                                    !HDR_ISTYPE_DATA(hdr));
7052 8056                                  abd_copy(to_write, hdr->b_l1hdr.b_pabd, psize);
7053 8057                                  if (asize != psize) {
7054 8058                                          abd_zero_off(to_write, psize,
7055 8059                                              asize - psize);
7056 8060                                  }
7057 8061                                  l2arc_free_abd_on_write(to_write, asize,
7058 8062                                      arc_buf_type(hdr));
7059 8063                          }
7060 8064                          wzio = zio_write_phys(pio, dev->l2ad_vdev,
7061 8065                              hdr->b_l2hdr.b_daddr, asize, to_write,

7062 8066                              ZIO_CHECKSUM_OFF, NULL, hdr,
7063 8067                              ZIO_PRIORITY_ASYNC_WRITE,
7064 8068                              ZIO_FLAG_CANFAIL, B_FALSE);
7065 8069  
7066 8070                          write_lsize += HDR_GET_LSIZE(hdr);

↓ open down ↓

5 lines elided

↑ open up ↑

7067 8071                          DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
7068 8072                              zio_t *, wzio);
7069 8073  
7070 8074                          write_psize += psize;
7071 8075                          write_asize += asize;
7072 8076                          dev->l2ad_hand += asize;
7073 8077  
7074 8078                          mutex_exit(hash_lock);
7075 8079  
7076 8080                          (void) zio_nowait(wzio);
     8081 +
     8082 +                        /*
     8083 +                         * Append buf info to current log and commit if full.
     8084 +                         * arcstat_l2_{size,asize} kstats are updated internally.
     8085 +                         */
     8086 +                        if (l2arc_log_blk_insert(dev, hdr)) {
     8087 +                                l2arc_log_blk_commit(dev, pio, cb);
     8088 +                                dev_hdr_update = B_TRUE;
     8089 +                        }
7077 8090                  }
7078 8091  
7079 8092                  multilist_sublist_unlock(mls);
7080 8093  
7081 8094                  if (full == B_TRUE)
7082 8095                          break;
7083 8096          }
7084 8097  
7085 8098          /* No buffers selected for writing? */
7086 8099          if (pio == NULL) {
7087 8100                  ASSERT0(write_lsize);
7088 8101                  ASSERT(!HDR_HAS_L1HDR(head));
7089 8102                  kmem_cache_free(hdr_l2only_cache, head);
7090 8103                  return (0);
7091 8104          }
7092 8105  
     8106 +        /*
     8107 +         * If we wrote any logs as part of this write, update dev hdr
     8108 +         * to point to it.
     8109 +         */
     8110 +        if (dev_hdr_update)
     8111 +                l2arc_dev_hdr_update(dev, pio);
     8112 +
7093 8113          ASSERT3U(write_asize, <=, target_sz);
7094 8114          ARCSTAT_BUMP(arcstat_l2_writes_sent);
7095 8115          ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);
     8116 +        if (feed_type == L2ARC_FEED_DDT_DEV)
     8117 +                ARCSTAT_INCR(arcstat_l2_ddt_write_bytes, write_psize);
7096 8118          ARCSTAT_INCR(arcstat_l2_lsize, write_lsize);
7097 8119          ARCSTAT_INCR(arcstat_l2_psize, write_psize);
7098 8120          vdev_space_update(dev->l2ad_vdev, write_psize, 0, 0);
7099 8121  
7100 8122          /*
7101 8123           * Bump device hand to the device start if it is approaching the end.
7102 8124           * l2arc_evict() will already have evicted ahead for this case.
7103 8125           */
7104      -        if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
     8126 +        if (dev->l2ad_hand + target_sz + l2arc_log_blk_overhead(target_sz) >=
     8127 +            dev->l2ad_end) {
7105 8128                  dev->l2ad_hand = dev->l2ad_start;
7106 8129                  dev->l2ad_first = B_FALSE;
7107 8130          }
7108 8131  
7109 8132          dev->l2ad_writing = B_TRUE;
7110 8133          (void) zio_wait(pio);
7111 8134          dev->l2ad_writing = B_FALSE;
7112 8135  
7113 8136          return (write_asize);
7114 8137  }
7115 8138  
     8139 +static boolean_t
     8140 +l2arc_feed_dev(l2ad_feed_t feed_type, uint64_t *wrote)
     8141 +{
     8142 +        spa_t *spa;
     8143 +        l2arc_dev_t *dev;
     8144 +        uint64_t size;
     8145 +
     8146 +        /*
     8147 +         * This selects the next l2arc device to write to, and in
     8148 +         * doing so the next spa to feed from: dev->l2ad_spa.   This
     8149 +         * will return NULL if there are now no l2arc devices or if
     8150 +         * they are all faulted.
     8151 +         *
     8152 +         * If a device is returned, its spa's config lock is also
     8153 +         * held to prevent device removal.  l2arc_dev_get_next()
     8154 +         * will grab and release l2arc_dev_mtx.
     8155 +         */
     8156 +        if ((dev = l2arc_dev_get_next(feed_type)) == NULL)
     8157 +                return (B_FALSE);
     8158 +
     8159 +        spa = dev->l2ad_spa;
     8160 +        ASSERT(spa != NULL);
     8161 +
     8162 +        /*
     8163 +         * If the pool is read-only - skip it
     8164 +         */
     8165 +        if (!spa_writeable(spa)) {
     8166 +                spa_config_exit(spa, SCL_L2ARC, dev);
     8167 +                return (B_FALSE);
     8168 +        }
     8169 +
     8170 +        ARCSTAT_BUMP(arcstat_l2_feeds);
     8171 +        size = l2arc_write_size();
     8172 +
     8173 +        /*
     8174 +         * Evict L2ARC buffers that will be overwritten.
     8175 +         * B_FALSE guarantees synchronous eviction.
     8176 +         */
     8177 +        (void) l2arc_evict(dev, size, B_FALSE);
     8178 +
     8179 +        /*
     8180 +         * Write ARC buffers.
     8181 +         */
     8182 +        *wrote = l2arc_write_buffers(spa, dev, size, feed_type);
     8183 +
     8184 +        spa_config_exit(spa, SCL_L2ARC, dev);
     8185 +
     8186 +        return (B_TRUE);
     8187 +}
     8188 +
7116 8189  /*
7117 8190   * This thread feeds the L2ARC at regular intervals.  This is the beating
7118 8191   * heart of the L2ARC.
7119 8192   */
7120 8193  /* ARGSUSED */
7121 8194  static void
7122 8195  l2arc_feed_thread(void *unused)
7123 8196  {
7124 8197          callb_cpr_t cpr;
7125      -        l2arc_dev_t *dev;
7126      -        spa_t *spa;
7127      -        uint64_t size, wrote;
     8198 +        uint64_t size, total_written = 0;
7128 8199          clock_t begin, next = ddi_get_lbolt();
     8200 +        l2ad_feed_t feed_type = L2ARC_FEED_ALL;
7129 8201  
7130 8202          CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
7131 8203  
7132 8204          mutex_enter(&l2arc_feed_thr_lock);
7133 8205  
7134 8206          while (l2arc_thread_exit == 0) {
7135 8207                  CALLB_CPR_SAFE_BEGIN(&cpr);
7136 8208                  (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
7137 8209                      next);
7138 8210                  CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);

7139 8211                  next = ddi_get_lbolt() + hz;
7140 8212  
7141 8213                  /*
7142 8214                   * Quick check for L2ARC devices.

↓ open down ↓

4 lines elided

↑ open up ↑

7143 8215                   */
7144 8216                  mutex_enter(&l2arc_dev_mtx);
7145 8217                  if (l2arc_ndev == 0) {
7146 8218                          mutex_exit(&l2arc_dev_mtx);
7147 8219                          continue;
7148 8220                  }
7149 8221                  mutex_exit(&l2arc_dev_mtx);
7150 8222                  begin = ddi_get_lbolt();
7151 8223  
7152 8224                  /*
7153      -                 * This selects the next l2arc device to write to, and in
7154      -                 * doing so the next spa to feed from: dev->l2ad_spa.   This
7155      -                 * will return NULL if there are now no l2arc devices or if
7156      -                 * they are all faulted.
7157      -                 *
7158      -                 * If a device is returned, its spa's config lock is also
7159      -                 * held to prevent device removal.  l2arc_dev_get_next()
7160      -                 * will grab and release l2arc_dev_mtx.
7161      -                 */
7162      -                if ((dev = l2arc_dev_get_next()) == NULL)
7163      -                        continue;
7164      -
7165      -                spa = dev->l2ad_spa;
7166      -                ASSERT3P(spa, !=, NULL);
7167      -
7168      -                /*
7169      -                 * If the pool is read-only then force the feed thread to
7170      -                 * sleep a little longer.
7171      -                 */
7172      -                if (!spa_writeable(spa)) {
7173      -                        next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
7174      -                        spa_config_exit(spa, SCL_L2ARC, dev);
7175      -                        continue;
7176      -                }
7177      -
7178      -                /*
7179 8225                   * Avoid contributing to memory pressure.
7180 8226                   */
7181 8227                  if (arc_reclaim_needed()) {
7182 8228                          ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
7183      -                        spa_config_exit(spa, SCL_L2ARC, dev);
7184 8229                          continue;
7185 8230                  }
7186 8231  
7187      -                ARCSTAT_BUMP(arcstat_l2_feeds);
     8232 +                /* try to write to DDT L2ARC device if any */
     8233 +                if (l2arc_feed_dev(L2ARC_FEED_DDT_DEV, &size)) {
     8234 +                        total_written += size;
     8235 +                        feed_type = L2ARC_FEED_NON_DDT_DEV;
     8236 +                }
7188 8237  
7189      -                size = l2arc_write_size();
     8238 +                /* try to write to the regular L2ARC device if any */
     8239 +                if (l2arc_feed_dev(feed_type, &size)) {
     8240 +                        total_written += size;
     8241 +                        if (feed_type == L2ARC_FEED_NON_DDT_DEV)
     8242 +                                total_written /= 2; /* avg written per device */
     8243 +                }
7190 8244  
7191 8245                  /*
7192      -                 * Evict L2ARC buffers that will be overwritten.
7193      -                 */
7194      -                l2arc_evict(dev, size, B_FALSE);
7195      -
7196      -                /*
7197      -                 * Write ARC buffers.
7198      -                 */
7199      -                wrote = l2arc_write_buffers(spa, dev, size);
7200      -
7201      -                /*
7202 8246                   * Calculate interval between writes.
7203 8247                   */
7204      -                next = l2arc_write_interval(begin, size, wrote);
7205      -                spa_config_exit(spa, SCL_L2ARC, dev);
     8248 +                next = l2arc_write_interval(begin, l2arc_write_size(),
     8249 +                    total_written);
     8250 +
     8251 +                total_written = 0;
7206 8252          }
7207 8253  
7208 8254          l2arc_thread_exit = 0;
7209 8255          cv_broadcast(&l2arc_feed_thr_cv);
7210 8256          CALLB_CPR_EXIT(&cpr);           /* drops l2arc_feed_thr_lock */
7211 8257          thread_exit();
7212 8258  }
7213 8259  
7214 8260  boolean_t
7215 8261  l2arc_vdev_present(vdev_t *vd)
7216 8262  {
7217      -        l2arc_dev_t *dev;
     8263 +        return (l2arc_vdev_get(vd) != NULL);
     8264 +}
7218 8265  
7219      -        mutex_enter(&l2arc_dev_mtx);
     8266 +/*
     8267 + * Returns the l2arc_dev_t associated with a particular vdev_t or NULL if
     8268 + * the vdev_t isn't an L2ARC device.
     8269 + */
     8270 +static l2arc_dev_t *
     8271 +l2arc_vdev_get(vdev_t *vd)
     8272 +{
     8273 +        l2arc_dev_t     *dev;
     8274 +        boolean_t       held = MUTEX_HELD(&l2arc_dev_mtx);
     8275 +
     8276 +        if (!held)
     8277 +                mutex_enter(&l2arc_dev_mtx);
7220 8278          for (dev = list_head(l2arc_dev_list); dev != NULL;
7221 8279              dev = list_next(l2arc_dev_list, dev)) {
7222 8280                  if (dev->l2ad_vdev == vd)
7223 8281                          break;
7224 8282          }
7225      -        mutex_exit(&l2arc_dev_mtx);
     8283 +        if (!held)
     8284 +                mutex_exit(&l2arc_dev_mtx);
7226 8285  
7227      -        return (dev != NULL);
     8286 +        return (dev);
7228 8287  }
7229 8288  
7230 8289  /*
7231 8290   * Add a vdev for use by the L2ARC.  By this point the spa has already
7232      - * validated the vdev and opened it.
     8291 + * validated the vdev and opened it. The `rebuild' flag indicates whether
     8292 + * we should attempt an L2ARC persistency rebuild.
7233 8293   */
7234 8294  void
7235      -l2arc_add_vdev(spa_t *spa, vdev_t *vd)
     8295 +l2arc_add_vdev(spa_t *spa, vdev_t *vd, boolean_t rebuild)
7236 8296  {
7237 8297          l2arc_dev_t *adddev;
7238 8298  
7239 8299          ASSERT(!l2arc_vdev_present(vd));
7240 8300  
7241 8301          /*
7242 8302           * Create a new l2arc device entry.
7243 8303           */
7244 8304          adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
7245 8305          adddev->l2ad_spa = spa;
7246 8306          adddev->l2ad_vdev = vd;
7247      -        adddev->l2ad_start = VDEV_LABEL_START_SIZE;
     8307 +        /* leave extra size for an l2arc device header */
     8308 +        adddev->l2ad_dev_hdr_asize = MAX(sizeof (*adddev->l2ad_dev_hdr),
     8309 +            1 << vd->vdev_ashift);
     8310 +        adddev->l2ad_start = VDEV_LABEL_START_SIZE + adddev->l2ad_dev_hdr_asize;
7248 8311          adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
     8312 +        ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end);
7249 8313          adddev->l2ad_hand = adddev->l2ad_start;
7250 8314          adddev->l2ad_first = B_TRUE;
7251 8315          adddev->l2ad_writing = B_FALSE;
     8316 +        adddev->l2ad_dev_hdr = kmem_zalloc(adddev->l2ad_dev_hdr_asize,
     8317 +            KM_SLEEP);
7252 8318  
7253 8319          mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);
7254 8320          /*
7255 8321           * This is a list of all ARC buffers that are still valid on the
7256 8322           * device.
7257 8323           */
7258 8324          list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
7259 8325              offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));
7260 8326  
7261 8327          vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
7262 8328          refcount_create(&adddev->l2ad_alloc);
7263 8329  
7264 8330          /*
7265 8331           * Add device to global list
7266 8332           */
7267 8333          mutex_enter(&l2arc_dev_mtx);
7268 8334          list_insert_head(l2arc_dev_list, adddev);
7269 8335          atomic_inc_64(&l2arc_ndev);
     8336 +        if (rebuild && l2arc_rebuild_enabled &&
     8337 +            adddev->l2ad_end - adddev->l2ad_start > L2ARC_PERSIST_MIN_SIZE) {
     8338 +                /*
     8339 +                 * Just mark the device as pending for a rebuild. We won't
     8340 +                 * be starting a rebuild in line here as it would block pool
     8341 +                 * import. Instead spa_load_impl will hand that off to an
     8342 +                 * async task which will call l2arc_spa_rebuild_start.
     8343 +                 */
     8344 +                adddev->l2ad_rebuild = B_TRUE;
     8345 +        }
7270 8346          mutex_exit(&l2arc_dev_mtx);
7271 8347  }
7272 8348  
7273 8349  /*
7274 8350   * Remove a vdev from the L2ARC.
7275 8351   */
7276 8352  void
7277 8353  l2arc_remove_vdev(vdev_t *vd)
7278 8354  {
7279 8355          l2arc_dev_t *dev, *nextdev, *remdev = NULL;

7280 8356  
7281 8357          /*
7282 8358           * Find the device by vdev
7283 8359           */
7284 8360          mutex_enter(&l2arc_dev_mtx);

↓ open down ↓

5 lines elided

↑ open up ↑

7285 8361          for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
7286 8362                  nextdev = list_next(l2arc_dev_list, dev);
7287 8363                  if (vd == dev->l2ad_vdev) {
7288 8364                          remdev = dev;
7289 8365                          break;
7290 8366                  }
7291 8367          }
7292 8368          ASSERT3P(remdev, !=, NULL);
7293 8369  
7294 8370          /*
     8371 +         * Cancel any ongoing or scheduled rebuild (race protection with
     8372 +         * l2arc_spa_rebuild_start provided via l2arc_dev_mtx).
     8373 +         */
     8374 +        remdev->l2ad_rebuild_cancel = B_TRUE;
     8375 +        if (remdev->l2ad_rebuild_did != 0) {
     8376 +                /*
     8377 +                 * N.B. it should be safe to thread_join with the rebuild
     8378 +                 * thread while holding l2arc_dev_mtx because it is not
     8379 +                 * accessed from anywhere in the l2arc rebuild code below
     8380 +                 * (except for l2arc_spa_rebuild_start, which is ok).
     8381 +                 */
     8382 +                thread_join(remdev->l2ad_rebuild_did);
     8383 +        }
     8384 +
     8385 +        /*
7295 8386           * Remove device from global list
7296 8387           */
7297 8388          list_remove(l2arc_dev_list, remdev);
7298 8389          l2arc_dev_last = NULL;          /* may have been invalidated */
     8390 +        l2arc_ddt_dev_last = NULL;      /* may have been invalidated */
7299 8391          atomic_dec_64(&l2arc_ndev);
7300 8392          mutex_exit(&l2arc_dev_mtx);
7301 8393  
     8394 +        if (vdev_type_is_ddt(remdev->l2ad_vdev))
     8395 +                atomic_add_64(&remdev->l2ad_spa->spa_l2arc_ddt_devs_size,
     8396 +                    -(vdev_get_min_asize(remdev->l2ad_vdev)));
     8397 +
7302 8398          /*
7303 8399           * Clear all buflists and ARC references.  L2ARC device flush.
7304 8400           */
7305      -        l2arc_evict(remdev, 0, B_TRUE);
7306      -        list_destroy(&remdev->l2ad_buflist);
7307      -        mutex_destroy(&remdev->l2ad_mtx);
7308      -        refcount_destroy(&remdev->l2ad_alloc);
7309      -        kmem_free(remdev, sizeof (l2arc_dev_t));
     8401 +        if (l2arc_evict(remdev, 0, B_TRUE) == B_FALSE) {
     8402 +                /*
     8403 +                 * The eviction was done synchronously, cleanup here
     8404 +                 * Otherwise, the asynchronous task will cleanup
     8405 +                 */
     8406 +                list_destroy(&remdev->l2ad_buflist);
     8407 +                mutex_destroy(&remdev->l2ad_mtx);
     8408 +                kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize);
     8409 +                kmem_free(remdev, sizeof (l2arc_dev_t));
     8410 +        }
7310 8411  }
7311 8412  
7312 8413  void
7313 8414  l2arc_init(void)
7314 8415  {
7315 8416          l2arc_thread_exit = 0;
7316 8417          l2arc_ndev = 0;
7317 8418          l2arc_writes_sent = 0;
7318 8419          l2arc_writes_done = 0;
7319 8420

7320 8421          mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
7321 8422          cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
7322 8423          mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
7323 8424          mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
7324 8425  
7325 8426          l2arc_dev_list = &L2ARC_dev_list;
7326 8427          l2arc_free_on_write = &L2ARC_free_on_write;
7327 8428          list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
7328 8429              offsetof(l2arc_dev_t, l2ad_node));
7329 8430          list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
7330 8431              offsetof(l2arc_data_free_t, l2df_list_node));
7331 8432  }
7332 8433  
7333 8434  void
7334 8435  l2arc_fini(void)
7335 8436  {
7336 8437          /*
7337 8438           * This is called from dmu_fini(), which is called from spa_fini();
7338 8439           * Because of this, we can assume that all l2arc devices have
7339 8440           * already been removed when the pools themselves were removed.
7340 8441           */
7341 8442  
7342 8443          l2arc_do_free_on_write();
7343 8444  
7344 8445          mutex_destroy(&l2arc_feed_thr_lock);
7345 8446          cv_destroy(&l2arc_feed_thr_cv);
7346 8447          mutex_destroy(&l2arc_dev_mtx);
7347 8448          mutex_destroy(&l2arc_free_on_write_mtx);
7348 8449  
7349 8450          list_destroy(l2arc_dev_list);
7350 8451          list_destroy(l2arc_free_on_write);
7351 8452  }
7352 8453  
7353 8454  void
7354 8455  l2arc_start(void)
7355 8456  {
7356 8457          if (!(spa_mode_global & FWRITE))
7357 8458                  return;
7358 8459  
7359 8460          (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
7360 8461              TS_RUN, minclsyspri);
7361 8462  }
7362 8463  
7363 8464  void
7364 8465  l2arc_stop(void)

↓ open down ↓

45 lines elided

↑ open up ↑

7365 8466  {
7366 8467          if (!(spa_mode_global & FWRITE))
7367 8468                  return;
7368 8469  
7369 8470          mutex_enter(&l2arc_feed_thr_lock);
7370 8471          cv_signal(&l2arc_feed_thr_cv);  /* kick thread out of startup */
7371 8472          l2arc_thread_exit = 1;
7372 8473          while (l2arc_thread_exit != 0)
7373 8474                  cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
7374 8475          mutex_exit(&l2arc_feed_thr_lock);
     8476 +}
     8477 +
     8478 +/*
     8479 + * Punches out rebuild threads for the L2ARC devices in a spa. This should
     8480 + * be called after pool import from the spa async thread, since starting
     8481 + * these threads directly from spa_import() will make them part of the
     8482 + * "zpool import" context and delay process exit (and thus pool import).
     8483 + */
     8484 +void
     8485 +l2arc_spa_rebuild_start(spa_t *spa)
     8486 +{
     8487 +        /*
     8488 +         * Locate the spa's l2arc devices and kick off rebuild threads.
     8489 +         */
     8490 +        mutex_enter(&l2arc_dev_mtx);
     8491 +        for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {
     8492 +                l2arc_dev_t *dev =
     8493 +                    l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);
     8494 +                if (dev == NULL) {
     8495 +                        /* Don't attempt a rebuild if the vdev is UNAVAIL */
     8496 +                        continue;
     8497 +                }
     8498 +                if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
     8499 +                        VERIFY3U(dev->l2ad_rebuild_did, ==, 0);
     8500 +#ifdef  _KERNEL
     8501 +                        dev->l2ad_rebuild_did = thread_create(NULL, 0,
     8502 +                            l2arc_dev_rebuild_start, dev, 0, &p0, TS_RUN,
     8503 +                            minclsyspri)->t_did;
     8504 +#endif
     8505 +                }
     8506 +        }
     8507 +        mutex_exit(&l2arc_dev_mtx);
     8508 +}
     8509 +
     8510 +/*
     8511 + * Main entry point for L2ARC rebuilding.
     8512 + */
     8513 +static void
     8514 +l2arc_dev_rebuild_start(l2arc_dev_t *dev)
     8515 +{
     8516 +        if (!dev->l2ad_rebuild_cancel) {
     8517 +                VERIFY(dev->l2ad_rebuild);
     8518 +                (void) l2arc_rebuild(dev);
     8519 +                dev->l2ad_rebuild = B_FALSE;
     8520 +        }
     8521 +}
     8522 +
     8523 +/*
     8524 + * This function implements the actual L2ARC metadata rebuild. It:
     8525 + *
     8526 + * 1) reads the device's header
     8527 + * 2) if a good device header is found, starts reading the log block chain
     8528 + * 3) restores each block's contents to memory (reconstructing arc_buf_hdr_t's)
     8529 + *
     8530 + * Operation stops under any of the following conditions:
     8531 + *
     8532 + * 1) We reach the end of the log blk chain (the back-reference in the blk is
     8533 + *    invalid or loops over our starting point).
     8534 + * 2) We encounter *any* error condition (cksum errors, io errors, looped
     8535 + *    blocks, etc.).
     8536 + */
     8537 +static int
     8538 +l2arc_rebuild(l2arc_dev_t *dev)
     8539 +{
     8540 +        vdev_t                  *vd = dev->l2ad_vdev;
     8541 +        spa_t                   *spa = vd->vdev_spa;
     8542 +        int                     err;
     8543 +        l2arc_log_blk_phys_t    *this_lb, *next_lb;
     8544 +        uint8_t                 *this_lb_buf, *next_lb_buf;
     8545 +        zio_t                   *this_io = NULL, *next_io = NULL;
     8546 +        l2arc_log_blkptr_t      lb_ptrs[2];
     8547 +        boolean_t               first_pass, lock_held;
     8548 +        uint64_t                load_guid;
     8549 +
     8550 +        this_lb = kmem_zalloc(sizeof (*this_lb), KM_SLEEP);
     8551 +        next_lb = kmem_zalloc(sizeof (*next_lb), KM_SLEEP);
     8552 +        this_lb_buf = kmem_zalloc(sizeof (l2arc_log_blk_phys_t), KM_SLEEP);
     8553 +        next_lb_buf = kmem_zalloc(sizeof (l2arc_log_blk_phys_t), KM_SLEEP);
     8554 +
     8555 +        /*
     8556 +         * We prevent device removal while issuing reads to the device,
     8557 +         * then during the rebuilding phases we drop this lock again so
     8558 +         * that a spa_unload or device remove can be initiated - this is
     8559 +         * safe, because the spa will signal us to stop before removing
     8560 +         * our device and wait for us to stop.
     8561 +         */
     8562 +        spa_config_enter(spa, SCL_L2ARC, vd, RW_READER);
     8563 +        lock_held = B_TRUE;
     8564 +
     8565 +        load_guid = spa_load_guid(dev->l2ad_vdev->vdev_spa);
     8566 +        /*
     8567 +         * Device header processing phase.
     8568 +         */
     8569 +        if ((err = l2arc_dev_hdr_read(dev)) != 0) {
     8570 +                /* device header corrupted, start a new one */
     8571 +                bzero(dev->l2ad_dev_hdr, dev->l2ad_dev_hdr_asize);
     8572 +                goto out;
     8573 +        }
     8574 +
     8575 +        /* Retrieve the persistent L2ARC device state */
     8576 +        dev->l2ad_hand = vdev_psize_to_asize(dev->l2ad_vdev,
     8577 +            dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr +
     8578 +            LBP_GET_PSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0]));
     8579 +        dev->l2ad_first = !!(dev->l2ad_dev_hdr->dh_flags &
     8580 +            L2ARC_DEV_HDR_EVICT_FIRST);
     8581 +
     8582 +        /* Prepare the rebuild processing state */
     8583 +        bcopy(dev->l2ad_dev_hdr->dh_start_lbps, lb_ptrs, sizeof (lb_ptrs));
     8584 +        first_pass = B_TRUE;
     8585 +
     8586 +        /* Start the rebuild process */
     8587 +        for (;;) {
     8588 +                if (!l2arc_log_blkptr_valid(dev, &lb_ptrs[0]))
     8589 +                        /* We hit an invalid block address, end the rebuild. */
     8590 +                        break;
     8591 +
     8592 +                if ((err = l2arc_log_blk_read(dev, &lb_ptrs[0], &lb_ptrs[1],
     8593 +                    this_lb, next_lb, this_lb_buf, next_lb_buf,
     8594 +                    this_io, &next_io)) != 0)
     8595 +                        break;
     8596 +
     8597 +                spa_config_exit(spa, SCL_L2ARC, vd);
     8598 +                lock_held = B_FALSE;
     8599 +
     8600 +                /* Protection against infinite loops of log blocks. */
     8601 +                if (l2arc_range_check_overlap(lb_ptrs[1].lbp_daddr,
     8602 +                    lb_ptrs[0].lbp_daddr,
     8603 +                    dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr) &&
     8604 +                    !first_pass) {
     8605 +                        ARCSTAT_BUMP(arcstat_l2_rebuild_abort_loop_errors);
     8606 +                        err = SET_ERROR(ELOOP);
     8607 +                        break;
     8608 +                }
     8609 +
     8610 +                /*
     8611 +                 * Our memory pressure valve. If the system is running low
     8612 +                 * on memory, rather than swamping memory with new ARC buf
     8613 +                 * hdrs, we opt not to rebuild the L2ARC. At this point,
     8614 +                 * however, we have already set up our L2ARC dev to chain in
     8615 +                 * new metadata log blk, so the user may choose to re-add the
     8616 +                 * L2ARC dev at a later time to reconstruct it (when there's
     8617 +                 * less memory pressure).
     8618 +                 */
     8619 +                if (arc_reclaim_needed()) {
     8620 +                        ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem);
     8621 +                        cmn_err(CE_NOTE, "System running low on memory, "
     8622 +                            "aborting L2ARC rebuild.");
     8623 +                        err = SET_ERROR(ENOMEM);
     8624 +                        break;
     8625 +                }
     8626 +
     8627 +                /*
     8628 +                 * Now that we know that the next_lb checks out alright, we
     8629 +                 * can start reconstruction from this lb - we can be sure
     8630 +                 * that the L2ARC write hand has not yet reached any of our
     8631 +                 * buffers.
     8632 +                 */
     8633 +                l2arc_log_blk_restore(dev, load_guid, this_lb,
     8634 +                    LBP_GET_PSIZE(&lb_ptrs[0]));
     8635 +
     8636 +                /*
     8637 +                 * End of list detection. We can look ahead two steps in the
     8638 +                 * blk chain and if the 2nd blk from this_lb dips below the
     8639 +                 * initial chain starting point, then we know two things:
     8640 +                 *      1) it can't be valid, and
     8641 +                 *      2) the next_lb's ARC entries might have already been
     8642 +                 *      partially overwritten and so we should stop before
     8643 +                 *      we restore it
     8644 +                 */
     8645 +                if (l2arc_range_check_overlap(
     8646 +                    this_lb->lb_back2_lbp.lbp_daddr, lb_ptrs[0].lbp_daddr,
     8647 +                    dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr) &&
     8648 +                    !first_pass)
     8649 +                        break;
     8650 +
     8651 +                /* log blk restored, continue with next one in the list */
     8652 +                lb_ptrs[0] = lb_ptrs[1];
     8653 +                lb_ptrs[1] = this_lb->lb_back2_lbp;
     8654 +                PTR_SWAP(this_lb, next_lb);
     8655 +                PTR_SWAP(this_lb_buf, next_lb_buf);
     8656 +                this_io = next_io;
     8657 +                next_io = NULL;
     8658 +                first_pass = B_FALSE;
     8659 +
     8660 +                for (;;) {
     8661 +                        if (dev->l2ad_rebuild_cancel) {
     8662 +                                err = SET_ERROR(ECANCELED);
     8663 +                                goto out;
     8664 +                        }
     8665 +                        if (spa_config_tryenter(spa, SCL_L2ARC, vd,
     8666 +                            RW_READER)) {
     8667 +                                lock_held = B_TRUE;
     8668 +                                break;
     8669 +                        }
     8670 +                        /*
     8671 +                         * L2ARC config lock held by somebody in writer,
     8672 +                         * possibly due to them trying to remove us. They'll
     8673 +                         * likely to want us to shut down, so after a little
     8674 +                         * delay, we check l2ad_rebuild_cancel and retry
     8675 +                         * the lock again.
     8676 +                         */
     8677 +                        delay(1);
     8678 +                }
     8679 +        }
     8680 +out:
     8681 +        if (next_io != NULL)
     8682 +                l2arc_log_blk_prefetch_abort(next_io);
     8683 +        kmem_free(this_lb, sizeof (*this_lb));
     8684 +        kmem_free(next_lb, sizeof (*next_lb));
     8685 +        kmem_free(this_lb_buf, sizeof (l2arc_log_blk_phys_t));
     8686 +        kmem_free(next_lb_buf, sizeof (l2arc_log_blk_phys_t));
     8687 +        if (err == 0)
     8688 +                ARCSTAT_BUMP(arcstat_l2_rebuild_successes);
     8689 +
     8690 +        if (lock_held)
     8691 +                spa_config_exit(spa, SCL_L2ARC, vd);
     8692 +
     8693 +        return (err);
     8694 +}
     8695 +
     8696 +/*
     8697 + * Attempts to read the device header on the provided L2ARC device and writes
     8698 + * it to `hdr'. On success, this function returns 0, otherwise the appropriate
     8699 + * error code is returned.
     8700 + */
     8701 +static int
     8702 +l2arc_dev_hdr_read(l2arc_dev_t *dev)
     8703 +{
     8704 +        int                     err;
     8705 +        uint64_t                guid;
     8706 +        zio_cksum_t             cksum;
     8707 +        l2arc_dev_hdr_phys_t    *hdr = dev->l2ad_dev_hdr;
     8708 +        const uint64_t          hdr_asize = dev->l2ad_dev_hdr_asize;
     8709 +        abd_t *abd;
     8710 +
     8711 +        guid = spa_guid(dev->l2ad_vdev->vdev_spa);
     8712 +
     8713 +        abd = abd_get_from_buf(hdr, hdr_asize);
     8714 +        err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev,
     8715 +            VDEV_LABEL_START_SIZE, hdr_asize, abd,
     8716 +            ZIO_CHECKSUM_OFF, NULL, NULL, ZIO_PRIORITY_ASYNC_READ,
     8717 +            ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
     8718 +            ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));
     8719 +        abd_put(abd);
     8720 +        if (err != 0) {
     8721 +                ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);
     8722 +                return (err);
     8723 +        }
     8724 +
     8725 +        if (hdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC_V1))
     8726 +                byteswap_uint64_array(hdr, sizeof (*hdr));
     8727 +
     8728 +        if (hdr->dh_magic != L2ARC_DEV_HDR_MAGIC_V1 ||
     8729 +            hdr->dh_spa_guid != guid) {
     8730 +                /*
     8731 +                 * Attempt to rebuild a device containing no actual dev hdr
     8732 +                 * or containing a header from some other pool.
     8733 +                 */
     8734 +                ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported);
     8735 +                return (SET_ERROR(ENOTSUP));
     8736 +        }
     8737 +
     8738 +        l2arc_dev_hdr_checksum(hdr, &cksum);
     8739 +        if (!ZIO_CHECKSUM_EQUAL(hdr->dh_self_cksum, cksum)) {
     8740 +                ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_errors);
     8741 +                return (SET_ERROR(EINVAL));
     8742 +        }
     8743 +
     8744 +        return (0);
     8745 +}
     8746 +
     8747 +/*
     8748 + * Reads L2ARC log blocks from storage and validates their contents.
     8749 + *
     8750 + * This function implements a simple prefetcher to make sure that while
     8751 + * we're processing one buffer the L2ARC is already prefetching the next
     8752 + * one in the chain.
     8753 + *
     8754 + * The arguments this_lp and next_lp point to the current and next log blk
     8755 + * address in the block chain. Similarly, this_lb and next_lb hold the
     8756 + * l2arc_log_blk_phys_t's of the current and next L2ARC blk. The this_lb_buf
     8757 + * and next_lb_buf must be buffers of appropriate to hold a raw
     8758 + * l2arc_log_blk_phys_t (they are used as catch buffers for read ops prior
     8759 + * to buffer decompression).
     8760 + *
     8761 + * The `this_io' and `next_io' arguments are used for block prefetching.
     8762 + * When issuing the first blk IO during rebuild, you should pass NULL for
     8763 + * `this_io'. This function will then issue a sync IO to read the block and
     8764 + * also issue an async IO to fetch the next block in the block chain. The
     8765 + * prefetch IO is returned in `next_io'. On subsequent calls to this
     8766 + * function, pass the value returned in `next_io' from the previous call
     8767 + * as `this_io' and a fresh `next_io' pointer to hold the next prefetch IO.
     8768 + * Prior to the call, you should initialize your `next_io' pointer to be
     8769 + * NULL. If no prefetch IO was issued, the pointer is left set at NULL.
     8770 + *
     8771 + * On success, this function returns 0, otherwise it returns an appropriate
     8772 + * error code. On error the prefetching IO is aborted and cleared before
     8773 + * returning from this function. Therefore, if we return `success', the
     8774 + * caller can assume that we have taken care of cleanup of prefetch IOs.
     8775 + */
     8776 +static int
     8777 +l2arc_log_blk_read(l2arc_dev_t *dev,
     8778 +    const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp,
     8779 +    l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
     8780 +    uint8_t *this_lb_buf, uint8_t *next_lb_buf,
     8781 +    zio_t *this_io, zio_t **next_io)
     8782 +{
     8783 +        int             err = 0;
     8784 +        zio_cksum_t     cksum;
     8785 +
     8786 +        ASSERT(this_lbp != NULL && next_lbp != NULL);
     8787 +        ASSERT(this_lb != NULL && next_lb != NULL);
     8788 +        ASSERT(this_lb_buf != NULL && next_lb_buf != NULL);
     8789 +        ASSERT(next_io != NULL && *next_io == NULL);
     8790 +        ASSERT(l2arc_log_blkptr_valid(dev, this_lbp));
     8791 +
     8792 +        /*
     8793 +         * Check to see if we have issued the IO for this log blk in a
     8794 +         * previous run. If not, this is the first call, so issue it now.
     8795 +         */
     8796 +        if (this_io == NULL) {
     8797 +                this_io = l2arc_log_blk_prefetch(dev->l2ad_vdev, this_lbp,
     8798 +                    this_lb_buf);
     8799 +        }
     8800 +
     8801 +        /*
     8802 +         * Peek to see if we can start issuing the next IO immediately.
     8803 +         */
     8804 +        if (l2arc_log_blkptr_valid(dev, next_lbp)) {
     8805 +                /*
     8806 +                 * Start issuing IO for the next log blk early - this
     8807 +                 * should help keep the L2ARC device busy while we
     8808 +                 * decompress and restore this log blk.
     8809 +                 */
     8810 +                *next_io = l2arc_log_blk_prefetch(dev->l2ad_vdev, next_lbp,
     8811 +                    next_lb_buf);
     8812 +        }
     8813 +
     8814 +        /* Wait for the IO to read this log block to complete */
     8815 +        if ((err = zio_wait(this_io)) != 0) {
     8816 +                ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);
     8817 +                goto cleanup;
     8818 +        }
     8819 +
     8820 +        /* Make sure the buffer checks out */
     8821 +        fletcher_4_native(this_lb_buf, LBP_GET_PSIZE(this_lbp), NULL, &cksum);
     8822 +        if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) {
     8823 +                ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_errors);
     8824 +                err = SET_ERROR(EINVAL);
     8825 +                goto cleanup;
     8826 +        }
     8827 +
     8828 +        /* Now we can take our time decoding this buffer */
     8829 +        switch (LBP_GET_COMPRESS(this_lbp)) {
     8830 +        case ZIO_COMPRESS_OFF:
     8831 +                bcopy(this_lb_buf, this_lb, sizeof (*this_lb));
     8832 +                break;
     8833 +        case ZIO_COMPRESS_LZ4:
     8834 +                err = zio_decompress_data_buf(LBP_GET_COMPRESS(this_lbp),
     8835 +                    this_lb_buf, this_lb, LBP_GET_PSIZE(this_lbp),
     8836 +                    sizeof (*this_lb));
     8837 +                if (err != 0) {
     8838 +                        err = SET_ERROR(EINVAL);
     8839 +                        goto cleanup;
     8840 +                }
     8841 +
     8842 +                break;
     8843 +        default:
     8844 +                err = SET_ERROR(EINVAL);
     8845 +                break;
     8846 +        }
     8847 +
     8848 +        if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC))
     8849 +                byteswap_uint64_array(this_lb, sizeof (*this_lb));
     8850 +
     8851 +        if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) {
     8852 +                err = SET_ERROR(EINVAL);
     8853 +                goto cleanup;
     8854 +        }
     8855 +
     8856 +cleanup:
     8857 +        /* Abort an in-flight prefetch I/O in case of error */
     8858 +        if (err != 0 && *next_io != NULL) {
     8859 +                l2arc_log_blk_prefetch_abort(*next_io);
     8860 +                *next_io = NULL;
     8861 +        }
     8862 +        return (err);
     8863 +}
     8864 +
     8865 +/*
     8866 + * Restores the payload of a log blk to ARC. This creates empty ARC hdr
     8867 + * entries which only contain an l2arc hdr, essentially restoring the
     8868 + * buffers to their L2ARC evicted state. This function also updates space
     8869 + * usage on the L2ARC vdev to make sure it tracks restored buffers.
     8870 + */
     8871 +static void
     8872 +l2arc_log_blk_restore(l2arc_dev_t *dev, uint64_t load_guid,
     8873 +    const l2arc_log_blk_phys_t *lb, uint64_t lb_psize)
     8874 +{
     8875 +        uint64_t        size = 0, psize = 0;
     8876 +
     8877 +        for (int i = L2ARC_LOG_BLK_ENTRIES - 1; i >= 0; i--) {
     8878 +                /*
     8879 +                 * Restore goes in the reverse temporal direction to preserve
     8880 +                 * correct temporal ordering of buffers in the l2ad_buflist.
     8881 +                 * l2arc_hdr_restore also does a list_insert_tail instead of
     8882 +                 * list_insert_head on the l2ad_buflist:
     8883 +                 *
     8884 +                 *              LIST    l2ad_buflist            LIST
     8885 +                 *              HEAD  <------ (time) ------     TAIL
     8886 +                 * direction    +-----+-----+-----+-----+-----+    direction
     8887 +                 * of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild
     8888 +                 * fill         +-----+-----+-----+-----+-----+
     8889 +                 *              ^                               ^
     8890 +                 *              |                               |
     8891 +                 *              |                               |
     8892 +                 *      l2arc_fill_thread               l2arc_rebuild
     8893 +                 *      places new bufs here            restores bufs here
     8894 +                 *
     8895 +                 * This also works when the restored bufs get evicted at any
     8896 +                 * point during the rebuild.
     8897 +                 */
     8898 +                l2arc_hdr_restore(&lb->lb_entries[i], dev, load_guid);
     8899 +                size += LE_GET_LSIZE(&lb->lb_entries[i]);
     8900 +                psize += LE_GET_PSIZE(&lb->lb_entries[i]);
     8901 +        }
     8902 +
     8903 +        /*
     8904 +         * Record rebuild stats:
     8905 +         *      size            In-memory size of restored buffer data in ARC
     8906 +         *      psize           Physical size of restored buffers in the L2ARC
     8907 +         *      bufs            # of ARC buffer headers restored
     8908 +         *      log_blks        # of L2ARC log entries processed during restore
     8909 +         */
     8910 +        ARCSTAT_INCR(arcstat_l2_rebuild_size, size);
     8911 +        ARCSTAT_INCR(arcstat_l2_rebuild_psize, psize);
     8912 +        ARCSTAT_INCR(arcstat_l2_rebuild_bufs, L2ARC_LOG_BLK_ENTRIES);
     8913 +        ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks);
     8914 +        ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_size, lb_psize);
     8915 +        ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, psize / lb_psize);
     8916 +        vdev_space_update(dev->l2ad_vdev, psize, 0, 0);
     8917 +}
     8918 +
     8919 +/*
     8920 + * Restores a single ARC buf hdr from a log block. The ARC buffer is put
     8921 + * into a state indicating that it has been evicted to L2ARC.
     8922 + */
     8923 +static void
     8924 +l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev,
     8925 +    uint64_t load_guid)
     8926 +{
     8927 +        arc_buf_hdr_t           *hdr, *exists;
     8928 +        kmutex_t                *hash_lock;
     8929 +        arc_buf_contents_t      type = LE_GET_TYPE(le);
     8930 +
     8931 +        /*
     8932 +         * Do all the allocation before grabbing any locks, this lets us
     8933 +         * sleep if memory is full and we don't have to deal with failed
     8934 +         * allocations.
     8935 +         */
     8936 +        hdr = arc_buf_alloc_l2only(load_guid, type, dev, le->le_dva,
     8937 +            le->le_daddr, LE_GET_LSIZE(le), LE_GET_PSIZE(le),
     8938 +            le->le_birth, le->le_freeze_cksum, LE_GET_CHECKSUM(le),
     8939 +            LE_GET_COMPRESS(le), LE_GET_ARC_COMPRESS(le));
     8940 +
     8941 +        ARCSTAT_INCR(arcstat_l2_lsize, HDR_GET_LSIZE(hdr));
     8942 +        ARCSTAT_INCR(arcstat_l2_psize, arc_hdr_size(hdr));
     8943 +
     8944 +        mutex_enter(&dev->l2ad_mtx);
     8945 +        /*
     8946 +         * We connect the l2hdr to the hdr only after the hdr is in the hash
     8947 +         * table, otherwise the rest of the arc hdr manipulation machinery
     8948 +         * might get confused.
     8949 +         */
     8950 +        list_insert_tail(&dev->l2ad_buflist, hdr);
     8951 +        (void) refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
     8952 +        mutex_exit(&dev->l2ad_mtx);
     8953 +
     8954 +        exists = buf_hash_insert(hdr, &hash_lock);
     8955 +        if (exists) {
     8956 +                /* Buffer was already cached, no need to restore it. */
     8957 +                arc_hdr_destroy(hdr);
     8958 +                mutex_exit(hash_lock);
     8959 +                ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached);
     8960 +                return;
     8961 +        }
     8962 +
     8963 +        mutex_exit(hash_lock);
     8964 +}
     8965 +
     8966 +/*
     8967 + * Used by PL2ARC related functions that do
     8968 + * async read/write
     8969 + */
     8970 +static void
     8971 +pl2arc_io_done(zio_t *zio)
     8972 +{
     8973 +        abd_put(zio->io_private);
     8974 +        zio->io_private = NULL;
     8975 +}
     8976 +
     8977 +/*
     8978 + * Starts an asynchronous read IO to read a log block. This is used in log
     8979 + * block reconstruction to start reading the next block before we are done
     8980 + * decoding and reconstructing the current block, to keep the l2arc device
     8981 + * nice and hot with read IO to process.
     8982 + * The returned zio will contain a newly allocated memory buffers for the IO
     8983 + * data which should then be freed by the caller once the zio is no longer
     8984 + * needed (i.e. due to it having completed). If you wish to abort this
     8985 + * zio, you should do so using l2arc_log_blk_prefetch_abort, which takes
     8986 + * care of disposing of the allocated buffers correctly.
     8987 + */
     8988 +static zio_t *
     8989 +l2arc_log_blk_prefetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp,
     8990 +    uint8_t *lb_buf)
     8991 +{
     8992 +        uint32_t        psize;
     8993 +        zio_t           *pio;
     8994 +        abd_t           *abd;
     8995 +
     8996 +        psize = LBP_GET_PSIZE(lbp);
     8997 +        ASSERT(psize <= sizeof (l2arc_log_blk_phys_t));
     8998 +        pio = zio_root(vd->vdev_spa, NULL, NULL, ZIO_FLAG_DONT_CACHE |
     8999 +            ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
     9000 +            ZIO_FLAG_DONT_RETRY);
     9001 +        abd = abd_get_from_buf(lb_buf, psize);
     9002 +        (void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, psize,
     9003 +            abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
     9004 +                ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
     9005 +            ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));
     9006 +
     9007 +        return (pio);
     9008 +}
     9009 +
     9010 +/*
     9011 + * Aborts a zio returned from l2arc_log_blk_prefetch and frees the data
     9012 + * buffers allocated for it.
     9013 + */
     9014 +static void
     9015 +l2arc_log_blk_prefetch_abort(zio_t *zio)
     9016 +{
     9017 +        (void) zio_wait(zio);
     9018 +}
     9019 +
     9020 +/*
     9021 + * Creates a zio to update the device header on an l2arc device. The zio is
     9022 + * initiated as a child of `pio'.
     9023 + */
     9024 +static void
     9025 +l2arc_dev_hdr_update(l2arc_dev_t *dev, zio_t *pio)
     9026 +{
     9027 +        zio_t                   *wzio;
     9028 +        abd_t                   *abd;
     9029 +        l2arc_dev_hdr_phys_t    *hdr = dev->l2ad_dev_hdr;
     9030 +        const uint64_t          hdr_asize = dev->l2ad_dev_hdr_asize;
     9031 +
     9032 +        hdr->dh_magic = L2ARC_DEV_HDR_MAGIC_V1;
     9033 +        hdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa);
     9034 +        hdr->dh_alloc_space = refcount_count(&dev->l2ad_alloc);
     9035 +        hdr->dh_flags = 0;
     9036 +        if (dev->l2ad_first)
     9037 +                hdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST;
     9038 +
     9039 +        /* checksum operation goes last */
     9040 +        l2arc_dev_hdr_checksum(hdr, &hdr->dh_self_cksum);
     9041 +
     9042 +        abd = abd_get_from_buf(hdr, hdr_asize);
     9043 +        wzio = zio_write_phys(pio, dev->l2ad_vdev, VDEV_LABEL_START_SIZE,
     9044 +            hdr_asize, abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
     9045 +            ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);
     9046 +        DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);
     9047 +        (void) zio_nowait(wzio);
     9048 +}
     9049 +
     9050 +/*
     9051 + * Commits a log block to the L2ARC device. This routine is invoked from
     9052 + * l2arc_write_buffers when the log block fills up.
     9053 + * This function allocates some memory to temporarily hold the serialized
     9054 + * buffer to be written. This is then released in l2arc_write_done.
     9055 + */
     9056 +static void
     9057 +l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,
     9058 +    l2arc_write_callback_t *cb)
     9059 +{
     9060 +        l2arc_log_blk_phys_t    *lb = &dev->l2ad_log_blk;
     9061 +        uint64_t                psize, asize;
     9062 +        l2arc_log_blk_buf_t     *lb_buf;
     9063 +        abd_t *abd;
     9064 +        zio_t                   *wzio;
     9065 +
     9066 +        VERIFY(dev->l2ad_log_ent_idx == L2ARC_LOG_BLK_ENTRIES);
     9067 +
     9068 +        /* link the buffer into the block chain */
     9069 +        lb->lb_back2_lbp = dev->l2ad_dev_hdr->dh_start_lbps[1];
     9070 +        lb->lb_magic = L2ARC_LOG_BLK_MAGIC;
     9071 +
     9072 +        /* try to compress the buffer */
     9073 +        lb_buf = kmem_zalloc(sizeof (*lb_buf), KM_SLEEP);
     9074 +        list_insert_tail(&cb->l2wcb_log_blk_buflist, lb_buf);
     9075 +        abd = abd_get_from_buf(lb, sizeof (*lb));
     9076 +        psize = zio_compress_data(ZIO_COMPRESS_LZ4, abd, lb_buf->lbb_log_blk,
     9077 +            sizeof (*lb));
     9078 +        abd_put(abd);
     9079 +        /* a log block is never entirely zero */
     9080 +        ASSERT(psize != 0);
     9081 +        asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);
     9082 +        ASSERT(asize <= sizeof (lb_buf->lbb_log_blk));
     9083 +
     9084 +        /*
     9085 +         * Update the start log blk pointer in the device header to point
     9086 +         * to the log block we're about to write.
     9087 +         */
     9088 +        dev->l2ad_dev_hdr->dh_start_lbps[1] =
     9089 +            dev->l2ad_dev_hdr->dh_start_lbps[0];
     9090 +        dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand;
     9091 +        _NOTE(CONSTCOND)
     9092 +        LBP_SET_LSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0], sizeof (*lb));
     9093 +        LBP_SET_PSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0], asize);
     9094 +        LBP_SET_CHECKSUM(&dev->l2ad_dev_hdr->dh_start_lbps[0],
     9095 +            ZIO_CHECKSUM_FLETCHER_4);
     9096 +        LBP_SET_TYPE(&dev->l2ad_dev_hdr->dh_start_lbps[0], 0);
     9097 +
     9098 +        if (asize < sizeof (*lb)) {
     9099 +                /* compression succeeded */
     9100 +                bzero(lb_buf->lbb_log_blk + psize, asize - psize);
     9101 +                LBP_SET_COMPRESS(&dev->l2ad_dev_hdr->dh_start_lbps[0],
     9102 +                    ZIO_COMPRESS_LZ4);
     9103 +        } else {
     9104 +                /* compression failed */
     9105 +                bcopy(lb, lb_buf->lbb_log_blk, sizeof (*lb));
     9106 +                LBP_SET_COMPRESS(&dev->l2ad_dev_hdr->dh_start_lbps[0],
     9107 +                    ZIO_COMPRESS_OFF);
     9108 +        }
     9109 +
     9110 +        /* checksum what we're about to write */
     9111 +        fletcher_4_native(lb_buf->lbb_log_blk, asize,
     9112 +            NULL, &dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_cksum);
     9113 +
     9114 +        /* perform the write itself */
     9115 +        CTASSERT(L2ARC_LOG_BLK_SIZE >= SPA_MINBLOCKSIZE &&
     9116 +            L2ARC_LOG_BLK_SIZE <= SPA_MAXBLOCKSIZE);
     9117 +        abd = abd_get_from_buf(lb_buf->lbb_log_blk, asize);
     9118 +        wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand,
     9119 +            asize, abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
     9120 +            ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);
     9121 +        DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);
     9122 +        (void) zio_nowait(wzio);
     9123 +
     9124 +        dev->l2ad_hand += asize;
     9125 +        vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
     9126 +
     9127 +        /* bump the kstats */
     9128 +        ARCSTAT_INCR(arcstat_l2_write_bytes, asize);
     9129 +        ARCSTAT_BUMP(arcstat_l2_log_blk_writes);
     9130 +        ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_size, asize);
     9131 +        ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio,
     9132 +            dev->l2ad_log_blk_payload_asize / asize);
     9133 +
     9134 +        /* start a new log block */
     9135 +        dev->l2ad_log_ent_idx = 0;
     9136 +        dev->l2ad_log_blk_payload_asize = 0;
     9137 +}
     9138 +
     9139 +/*
     9140 + * Validates an L2ARC log blk address to make sure that it can be read
     9141 + * from the provided L2ARC device. Returns B_TRUE if the address is
     9142 + * within the device's bounds, or B_FALSE if not.
     9143 + */
     9144 +static boolean_t
     9145 +l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp)
     9146 +{
     9147 +        uint64_t psize = LBP_GET_PSIZE(lbp);
     9148 +        uint64_t end = lbp->lbp_daddr + psize;
     9149 +
     9150 +        /*
     9151 +         * A log block is valid if all of the following conditions are true:
     9152 +         * - it fits entirely between l2ad_start and l2ad_end
     9153 +         * - it has a valid size
     9154 +         */
     9155 +        return (lbp->lbp_daddr >= dev->l2ad_start && end <= dev->l2ad_end &&
     9156 +            psize > 0 && psize <= sizeof (l2arc_log_blk_phys_t));
     9157 +}
     9158 +
     9159 +/*
     9160 + * Computes the checksum of `hdr' and stores it in `cksum'.
     9161 + */
     9162 +static void
     9163 +l2arc_dev_hdr_checksum(const l2arc_dev_hdr_phys_t *hdr, zio_cksum_t *cksum)
     9164 +{
     9165 +        fletcher_4_native((uint8_t *)hdr +
     9166 +            offsetof(l2arc_dev_hdr_phys_t, dh_spa_guid),
     9167 +            sizeof (*hdr) - offsetof(l2arc_dev_hdr_phys_t, dh_spa_guid),
     9168 +            NULL, cksum);
     9169 +}
     9170 +
     9171 +/*
     9172 + * Inserts ARC buffer `ab' into the current L2ARC log blk on the device.
     9173 + * The buffer being inserted must be present in L2ARC.
     9174 + * Returns B_TRUE if the L2ARC log blk is full and needs to be committed
     9175 + * to L2ARC, or B_FALSE if it still has room for more ARC buffers.
     9176 + */
     9177 +static boolean_t
     9178 +l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *ab)
     9179 +{
     9180 +        l2arc_log_blk_phys_t    *lb = &dev->l2ad_log_blk;
     9181 +        l2arc_log_ent_phys_t    *le;
     9182 +        int                     index = dev->l2ad_log_ent_idx++;
     9183 +
     9184 +        ASSERT(index < L2ARC_LOG_BLK_ENTRIES);
     9185 +
     9186 +        le = &lb->lb_entries[index];
     9187 +        bzero(le, sizeof (*le));
     9188 +        le->le_dva = ab->b_dva;
     9189 +        le->le_birth = ab->b_birth;
     9190 +        le->le_daddr = ab->b_l2hdr.b_daddr;
     9191 +        LE_SET_LSIZE(le, HDR_GET_LSIZE(ab));
     9192 +        LE_SET_PSIZE(le, HDR_GET_PSIZE(ab));
     9193 +
     9194 +        if ((ab->b_flags & ARC_FLAG_COMPRESSED_ARC) != 0) {
     9195 +                LE_SET_ARC_COMPRESS(le, 1);
     9196 +                LE_SET_COMPRESS(le, HDR_GET_COMPRESS(ab));
     9197 +        } else {
     9198 +                ASSERT3U(HDR_GET_COMPRESS(ab), ==, ZIO_COMPRESS_OFF);
     9199 +                LE_SET_ARC_COMPRESS(le, 0);
     9200 +                LE_SET_COMPRESS(le, ZIO_COMPRESS_OFF);
     9201 +        }
     9202 +
     9203 +        if (ab->b_freeze_cksum != NULL) {
     9204 +                le->le_freeze_cksum = *ab->b_freeze_cksum;
     9205 +                LE_SET_CHECKSUM(le, ZIO_CHECKSUM_FLETCHER_2);
     9206 +        } else {
     9207 +                LE_SET_CHECKSUM(le, ZIO_CHECKSUM_OFF);
     9208 +        }
     9209 +
     9210 +        LE_SET_TYPE(le, arc_flags_to_bufc(ab->b_flags));
     9211 +        dev->l2ad_log_blk_payload_asize += arc_hdr_size((arc_buf_hdr_t *)ab);
     9212 +
     9213 +        return (dev->l2ad_log_ent_idx == L2ARC_LOG_BLK_ENTRIES);
     9214 +}
     9215 +
     9216 +/*
     9217 + * Checks whether a given L2ARC device address sits in a time-sequential
     9218 + * range. The trick here is that the L2ARC is a rotary buffer, so we can't
     9219 + * just do a range comparison, we need to handle the situation in which the
     9220 + * range wraps around the end of the L2ARC device. Arguments:
     9221 + *      bottom  Lower end of the range to check (written to earlier).
     9222 + *      top     Upper end of the range to check (written to later).
     9223 + *      check   The address for which we want to determine if it sits in
     9224 + *              between the top and bottom.
     9225 + *
     9226 + * The 3-way conditional below represents the following cases:
     9227 + *
     9228 + *      bottom < top : Sequentially ordered case:
     9229 + *        <check>--------+-------------------+
     9230 + *                       |  (overlap here?)  |
     9231 + *       L2ARC dev       V                   V
     9232 + *       |---------------<bottom>============<top>--------------|
     9233 + *
     9234 + *      bottom > top: Looped-around case:
     9235 + *                            <check>--------+------------------+
     9236 + *                                           |  (overlap here?) |
     9237 + *       L2ARC dev                           V                  V
     9238 + *       |===============<top>---------------<bottom>===========|
     9239 + *       ^               ^
     9240 + *       |  (or here?)   |
     9241 + *       +---------------+---------<check>
     9242 + *
     9243 + *      top == bottom : Just a single address comparison.
     9244 + */
     9245 +static inline boolean_t
     9246 +l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check)
     9247 +{
     9248 +        if (bottom < top)
     9249 +                return (bottom <= check && check <= top);
     9250 +        else if (bottom > top)
     9251 +                return (check <= top || bottom <= check);
     9252 +        else
     9253 +                return (check == top);
7375 9254  }

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX