no-reap-now Wdiff usr/src/uts/common/fs/zfs/arc.c

Print this page

OS-6363 system went to dark side of moon for ~467 seconds OS-6404 ARC reclaim should throttle its calls to arc_kmem_reap_now() Reviewed by: Bryan Cantrill <bryan@joyent.com> Reviewed by: Dan McDonald <danmcd@joyent.com>

Split	Close
Expand all
Collapse all

          --- old/usr/src/uts/common/fs/zfs/arc.c
          +++ new/usr/src/uts/common/fs/zfs/arc.c

   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *

↓ open down ↓

12 lines elided

↑ open up ↑

  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23      - * Copyright (c) 2012, Joyent, Inc. All rights reserved.
       23 + * Copyright (c) 2017, Joyent, Inc. All rights reserved.
  24   24   * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
  25   25   * Copyright (c) 2014 by Saso Kiselkov. All rights reserved.
  26   26   * Copyright 2017 Nexenta Systems, Inc.  All rights reserved.
  27   27   */
  28   28  
  29   29  /*
  30   30   * DVA-based Adjustable Replacement Cache
  31   31   *
  32   32   * While much of the theory of operation used here is
  33   33   * based on the self-tuning, low overhead replacement cache

  34   34   * presented by Megiddo and Modha at FAST 2003, there are some
  35   35   * significant differences:
  36   36   *
  37   37   * 1. The Megiddo and Modha model assumes any page is evictable.
  38   38   * Pages in its cache cannot be "locked" into memory.  This makes
  39   39   * the eviction algorithm simple: evict the last page in the list.
  40   40   * This also make the performance characteristics easy to reason
  41   41   * about.  Our cache is not so simple.  At any given moment, some
  42   42   * subset of the blocks in the cache are un-evictable because we
  43   43   * have handed out a reference to them.  Blocks are only evictable
  44   44   * when there are no external references active.  This makes
  45   45   * eviction far more problematic:  we choose to evict the evictable
  46   46   * blocks that are the "lowest" in the list.
  47   47   *
  48   48   * There are times when it is not possible to evict the requested
  49   49   * space.  In these circumstances we are unable to adjust the cache
  50   50   * size.  To prevent the cache growing unbounded at these times we
  51   51   * implement a "cache throttle" that slows the flow of new data
  52   52   * into the cache until we can make space available.
  53   53   *
  54   54   * 2. The Megiddo and Modha model assumes a fixed cache size.
  55   55   * Pages are evicted when the cache is full and there is a cache
  56   56   * miss.  Our model has a variable sized cache.  It grows with
  57   57   * high use, but also tries to react to memory pressure from the
  58   58   * operating system: decreasing its size when system memory is
  59   59   * tight.
  60   60   *
  61   61   * 3. The Megiddo and Modha model assumes a fixed page size. All
  62   62   * elements of the cache are therefore exactly the same size.  So
  63   63   * when adjusting the cache size following a cache miss, its simply
  64   64   * a matter of choosing a single page to evict.  In our model, we
  65   65   * have variable sized cache blocks (rangeing from 512 bytes to
  66   66   * 128K bytes).  We therefore choose a set of blocks to evict to make
  67   67   * space for a cache miss that approximates as closely as possible
  68   68   * the space used by the new block.
  69   69   *
  70   70   * See also:  "ARC: A Self-Tuning, Low Overhead Replacement Cache"
  71   71   * by N. Megiddo & D. Modha, FAST 2003
  72   72   */
  73   73  
  74   74  /*
  75   75   * The locking model:
  76   76   *
  77   77   * A new reference to a cache buffer can be obtained in two
  78   78   * ways: 1) via a hash table lookup using the DVA as a key,
  79   79   * or 2) via one of the ARC lists.  The arc_read() interface
  80   80   * uses method 1, while the internal ARC algorithms for
  81   81   * adjusting the cache use method 2.  We therefore provide two
  82   82   * types of locks: 1) the hash table lock array, and 2) the
  83   83   * ARC list locks.
  84   84   *
  85   85   * Buffers do not have their own mutexes, rather they rely on the
  86   86   * hash table mutexes for the bulk of their protection (i.e. most
  87   87   * fields in the arc_buf_hdr_t are protected by these mutexes).
  88   88   *
  89   89   * buf_hash_find() returns the appropriate mutex (held) when it
  90   90   * locates the requested buffer in the hash table.  It returns
  91   91   * NULL for the mutex if the buffer was not in the table.
  92   92   *
  93   93   * buf_hash_remove() expects the appropriate hash mutex to be
  94   94   * already held before it is invoked.
  95   95   *
  96   96   * Each ARC state also has a mutex which is used to protect the
  97   97   * buffer list associated with the state.  When attempting to
  98   98   * obtain a hash table lock while holding an ARC list lock you
  99   99   * must use: mutex_tryenter() to avoid deadlock.  Also note that
 100  100   * the active state mutex must be held before the ghost state mutex.
 101  101   *
 102  102   * Note that the majority of the performance stats are manipulated
 103  103   * with atomic operations.
 104  104   *
 105  105   * The L2ARC uses the l2ad_mtx on each vdev for the following:
 106  106   *
 107  107   *      - L2ARC buflist creation
 108  108   *      - L2ARC buflist eviction
 109  109   *      - L2ARC write completion, which walks L2ARC buflists
 110  110   *      - ARC header destruction, as it removes from L2ARC buflists
 111  111   *      - ARC header release, as it removes from L2ARC buflists
 112  112   */
 113  113  
 114  114  /*
 115  115   * ARC operation:
 116  116   *
 117  117   * Every block that is in the ARC is tracked by an arc_buf_hdr_t structure.
 118  118   * This structure can point either to a block that is still in the cache or to
 119  119   * one that is only accessible in an L2 ARC device, or it can provide
 120  120   * information about a block that was recently evicted. If a block is
 121  121   * only accessible in the L2ARC, then the arc_buf_hdr_t only has enough
 122  122   * information to retrieve it from the L2ARC device. This information is
 123  123   * stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block
 124  124   * that is in this state cannot access the data directly.
 125  125   *
 126  126   * Blocks that are actively being referenced or have not been evicted
 127  127   * are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within
 128  128   * the arc_buf_hdr_t that will point to the data block in memory. A block can
 129  129   * only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC
 130  130   * caches data in two ways -- in a list of ARC buffers (arc_buf_t) and
 131  131   * also in the arc_buf_hdr_t's private physical data block pointer (b_pabd).
 132  132   *
 133  133   * The L1ARC's data pointer may or may not be uncompressed. The ARC has the
 134  134   * ability to store the physical data (b_pabd) associated with the DVA of the
 135  135   * arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block,
 136  136   * it will match its on-disk compression characteristics. This behavior can be
 137  137   * disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the
 138  138   * compressed ARC functionality is disabled, the b_pabd will point to an
 139  139   * uncompressed version of the on-disk data.
 140  140   *
 141  141   * Data in the L1ARC is not accessed by consumers of the ARC directly. Each
 142  142   * arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it.
 143  143   * Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC
 144  144   * consumer. The ARC will provide references to this data and will keep it
 145  145   * cached until it is no longer in use. The ARC caches only the L1ARC's physical
 146  146   * data block and will evict any arc_buf_t that is no longer referenced. The
 147  147   * amount of memory consumed by the arc_buf_ts' data buffers can be seen via the
 148  148   * "overhead_size" kstat.
 149  149   *
 150  150   * Depending on the consumer, an arc_buf_t can be requested in uncompressed or
 151  151   * compressed form. The typical case is that consumers will want uncompressed
 152  152   * data, and when that happens a new data buffer is allocated where the data is
 153  153   * decompressed for them to use. Currently the only consumer who wants
 154  154   * compressed arc_buf_t's is "zfs send", when it streams data exactly as it
 155  155   * exists on disk. When this happens, the arc_buf_t's data buffer is shared
 156  156   * with the arc_buf_hdr_t.
 157  157   *
 158  158   * Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The
 159  159   * first one is owned by a compressed send consumer (and therefore references
 160  160   * the same compressed data buffer as the arc_buf_hdr_t) and the second could be
 161  161   * used by any other consumer (and has its own uncompressed copy of the data
 162  162   * buffer).
 163  163   *
 164  164   *   arc_buf_hdr_t
 165  165   *   +-----------+
 166  166   *   | fields    |
 167  167   *   | common to |
 168  168   *   | L1- and   |
 169  169   *   | L2ARC     |
 170  170   *   +-----------+
 171  171   *   | l2arc_buf_hdr_t
 172  172   *   |           |
 173  173   *   +-----------+
 174  174   *   | l1arc_buf_hdr_t
 175  175   *   |           |              arc_buf_t
 176  176   *   | b_buf     +------------>+-----------+      arc_buf_t
 177  177   *   | b_pabd    +-+           |b_next     +---->+-----------+
 178  178   *   +-----------+ |           |-----------|     |b_next     +-->NULL
 179  179   *                 |           |b_comp = T |     +-----------+
 180  180   *                 |           |b_data     +-+   |b_comp = F |
 181  181   *                 |           +-----------+ |   |b_data     +-+
 182  182   *                 +->+------+               |   +-----------+ |
 183  183   *        compressed  |      |               |                 |
 184  184   *           data     |      |<--------------+                 | uncompressed
 185  185   *                    +------+          compressed,            |     data
 186  186   *                                        shared               +-->+------+
 187  187   *                                         data                    |      |
 188  188   *                                                                 |      |
 189  189   *                                                                 +------+
 190  190   *
 191  191   * When a consumer reads a block, the ARC must first look to see if the
 192  192   * arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new
 193  193   * arc_buf_t and either copies uncompressed data into a new data buffer from an
 194  194   * existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a
 195  195   * new data buffer, or shares the hdr's b_pabd buffer, depending on whether the
 196  196   * hdr is compressed and the desired compression characteristics of the
 197  197   * arc_buf_t consumer. If the arc_buf_t ends up sharing data with the
 198  198   * arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be
 199  199   * the last buffer in the hdr's b_buf list, however a shared compressed buf can
 200  200   * be anywhere in the hdr's list.
 201  201   *
 202  202   * The diagram below shows an example of an uncompressed ARC hdr that is
 203  203   * sharing its data with an arc_buf_t (note that the shared uncompressed buf is
 204  204   * the last element in the buf list):
 205  205   *
 206  206   *                arc_buf_hdr_t
 207  207   *                +-----------+
 208  208   *                |           |
 209  209   *                |           |
 210  210   *                |           |
 211  211   *                +-----------+
 212  212   * l2arc_buf_hdr_t|           |
 213  213   *                |           |
 214  214   *                +-----------+
 215  215   * l1arc_buf_hdr_t|           |
 216  216   *                |           |                 arc_buf_t    (shared)
 217  217   *                |    b_buf  +------------>+---------+      arc_buf_t
 218  218   *                |           |             |b_next   +---->+---------+
 219  219   *                |  b_pabd   +-+           |---------|     |b_next   +-->NULL
 220  220   *                +-----------+ |           |         |     +---------+
 221  221   *                              |           |b_data   +-+   |         |
 222  222   *                              |           +---------+ |   |b_data   +-+
 223  223   *                              +->+------+             |   +---------+ |
 224  224   *                                 |      |             |               |
 225  225   *                   uncompressed  |      |             |               |
 226  226   *                        data     +------+             |               |
 227  227   *                                    ^                 +->+------+     |
 228  228   *                                    |       uncompressed |      |     |
 229  229   *                                    |           data     |      |     |
 230  230   *                                    |                    +------+     |
 231  231   *                                    +---------------------------------+
 232  232   *
 233  233   * Writing to the ARC requires that the ARC first discard the hdr's b_pabd
 234  234   * since the physical block is about to be rewritten. The new data contents
 235  235   * will be contained in the arc_buf_t. As the I/O pipeline performs the write,
 236  236   * it may compress the data before writing it to disk. The ARC will be called
 237  237   * with the transformed data and will bcopy the transformed on-disk block into
 238  238   * a newly allocated b_pabd. Writes are always done into buffers which have
 239  239   * either been loaned (and hence are new and don't have other readers) or
 240  240   * buffers which have been released (and hence have their own hdr, if there
 241  241   * were originally other readers of the buf's original hdr). This ensures that
 242  242   * the ARC only needs to update a single buf and its hdr after a write occurs.
 243  243   *
 244  244   * When the L2ARC is in use, it will also take advantage of the b_pabd. The
 245  245   * L2ARC will always write the contents of b_pabd to the L2ARC. This means
 246  246   * that when compressed ARC is enabled that the L2ARC blocks are identical
 247  247   * to the on-disk block in the main data pool. This provides a significant
 248  248   * advantage since the ARC can leverage the bp's checksum when reading from the
 249  249   * L2ARC to determine if the contents are valid. However, if the compressed
 250  250   * ARC is disabled, then the L2ARC's block must be transformed to look
 251  251   * like the physical block in the main data pool before comparing the
 252  252   * checksum and determining its validity.
 253  253   */
 254  254  
 255  255  #include <sys/spa.h>
 256  256  #include <sys/zio.h>
 257  257  #include <sys/spa_impl.h>
 258  258  #include <sys/zio_compress.h>
 259  259  #include <sys/zio_checksum.h>
 260  260  #include <sys/zfs_context.h>
 261  261  #include <sys/arc.h>
 262  262  #include <sys/refcount.h>
 263  263  #include <sys/vdev.h>
 264  264  #include <sys/vdev_impl.h>
 265  265  #include <sys/dsl_pool.h>
 266  266  #include <sys/zio_checksum.h>
 267  267  #include <sys/multilist.h>
 268  268  #include <sys/abd.h>
 269  269  #ifdef _KERNEL
 270  270  #include <sys/vmsystm.h>
 271  271  #include <vm/anon.h>
 272  272  #include <sys/fs/swapnode.h>
 273  273  #include <sys/dnlc.h>
 274  274  #endif
 275  275  #include <sys/callb.h>
 276  276  #include <sys/kstat.h>
 277  277  #include <zfs_fletcher.h>
 278  278  
 279  279  #ifndef _KERNEL
 280  280  /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
 281  281  boolean_t arc_watch = B_FALSE;
 282  282  int arc_procfd;
 283  283  #endif
 284  284  
 285  285  static kmutex_t         arc_reclaim_lock;
 286  286  static kcondvar_t       arc_reclaim_thread_cv;
 287  287  static boolean_t        arc_reclaim_thread_exit;
 288  288  static kcondvar_t       arc_reclaim_waiters_cv;
 289  289  
 290  290  uint_t arc_reduce_dnlc_percent = 3;
 291  291  
 292  292  /*
 293  293   * The number of headers to evict in arc_evict_state_impl() before

↓ open down ↓

260 lines elided

↑ open up ↑

 294  294   * dropping the sublist lock and evicting from another sublist. A lower
 295  295   * value means we're more likely to evict the "correct" header (i.e. the
 296  296   * oldest header in the arc state), but comes with higher overhead
 297  297   * (i.e. more invocations of arc_evict_state_impl()).
 298  298   */
 299  299  int zfs_arc_evict_batch_limit = 10;
 300  300  
 301  301  /* number of seconds before growing cache again */
 302  302  static int              arc_grow_retry = 60;
 303  303  
      304 +/* number of milliseconds before attempting a kmem-cache-reap */
      305 +static int              arc_kmem_cache_reap_retry_ms = 1000;
      306 +
 304  307  /* shift of arc_c for calculating overflow limit in arc_get_data_impl */
 305  308  int             zfs_arc_overflow_shift = 8;
 306  309  
 307  310  /* shift of arc_c for calculating both min and max arc_p */
 308  311  static int              arc_p_min_shift = 4;
 309  312  
 310  313  /* log2(fraction of arc to reclaim) */
 311  314  static int              arc_shrink_shift = 7;
 312  315  
 313  316  /*

 314  317   * log2(fraction of ARC which must be free to allow growing).
 315  318   * I.e. If there is less than arc_c >> arc_no_grow_shift free memory,
 316  319   * when reading a new block into the ARC, we will evict an equal-sized block
 317  320   * from the ARC.
 318  321   *
 319  322   * This must be less than arc_shrink_shift, so that when we shrink the ARC,
 320  323   * we will still not allow it to grow.
 321  324   */
 322  325  int                     arc_no_grow_shift = 5;
 323  326  
 324  327  
 325  328  /*
 326  329   * minimum lifespan of a prefetch block in clock ticks
 327  330   * (initialized in arc_init())
 328  331   */
 329  332  static int              arc_min_prefetch_lifespan;
 330  333  
 331  334  /*
 332  335   * If this percent of memory is free, don't throttle.
 333  336   */
 334  337  int arc_lotsfree_percent = 10;
 335  338  
 336  339  static int arc_dead;
 337  340  
 338  341  /*
 339  342   * The arc has filled available memory and has now warmed up.
 340  343   */
 341  344  static boolean_t arc_warm;
 342  345  
 343  346  /*
 344  347   * log2 fraction of the zio arena to keep free.
 345  348   */
 346  349  int arc_zio_arena_free_shift = 2;
 347  350  
 348  351  /*
 349  352   * These tunables are for performance analysis.
 350  353   */
 351  354  uint64_t zfs_arc_max;
 352  355  uint64_t zfs_arc_min;
 353  356  uint64_t zfs_arc_meta_limit = 0;
 354  357  uint64_t zfs_arc_meta_min = 0;
 355  358  int zfs_arc_grow_retry = 0;
 356  359  int zfs_arc_shrink_shift = 0;
 357  360  int zfs_arc_p_min_shift = 0;
 358  361  int zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
 359  362  
 360  363  boolean_t zfs_compressed_arc_enabled = B_TRUE;
 361  364  
 362  365  /*
 363  366   * Note that buffers can be in one of 6 states:
 364  367   *      ARC_anon        - anonymous (discussed below)
 365  368   *      ARC_mru         - recently used, currently cached
 366  369   *      ARC_mru_ghost   - recentely used, no longer in cache
 367  370   *      ARC_mfu         - frequently used, currently cached
 368  371   *      ARC_mfu_ghost   - frequently used, no longer in cache
 369  372   *      ARC_l2c_only    - exists in L2ARC but not other states
 370  373   * When there are no active references to the buffer, they are
 371  374   * are linked onto a list in one of these arc states.  These are
 372  375   * the only buffers that can be evicted or deleted.  Within each
 373  376   * state there are multiple lists, one for meta-data and one for
 374  377   * non-meta-data.  Meta-data (indirect blocks, blocks of dnodes,
 375  378   * etc.) is tracked separately so that it can be managed more
 376  379   * explicitly: favored over data, limited explicitly.
 377  380   *
 378  381   * Anonymous buffers are buffers that are not associated with
 379  382   * a DVA.  These are buffers that hold dirty block copies
 380  383   * before they are written to stable storage.  By definition,
 381  384   * they are "ref'd" and are considered part of arc_mru
 382  385   * that cannot be freed.  Generally, they will aquire a DVA
 383  386   * as they are written and migrate onto the arc_mru list.
 384  387   *
 385  388   * The ARC_l2c_only state is for buffers that are in the second
 386  389   * level ARC but no longer in any of the ARC_m* lists.  The second
 387  390   * level ARC itself may also contain buffers that are in any of
 388  391   * the ARC_m* states - meaning that a buffer can exist in two
 389  392   * places.  The reason for the ARC_l2c_only state is to keep the
 390  393   * buffer header in the hash table, so that reads that hit the
 391  394   * second level ARC benefit from these fast lookups.
 392  395   */
 393  396  
 394  397  typedef struct arc_state {
 395  398          /*
 396  399           * list of evictable buffers
 397  400           */
 398  401          multilist_t *arcs_list[ARC_BUFC_NUMTYPES];
 399  402          /*
 400  403           * total amount of evictable data in this state
 401  404           */
 402  405          refcount_t arcs_esize[ARC_BUFC_NUMTYPES];
 403  406          /*
 404  407           * total amount of data in this state; this includes: evictable,
 405  408           * non-evictable, ARC_BUFC_DATA, and ARC_BUFC_METADATA.
 406  409           */
 407  410          refcount_t arcs_size;
 408  411  } arc_state_t;
 409  412  
 410  413  /* The 6 states: */
 411  414  static arc_state_t ARC_anon;
 412  415  static arc_state_t ARC_mru;
 413  416  static arc_state_t ARC_mru_ghost;
 414  417  static arc_state_t ARC_mfu;
 415  418  static arc_state_t ARC_mfu_ghost;
 416  419  static arc_state_t ARC_l2c_only;
 417  420  
 418  421  typedef struct arc_stats {
 419  422          kstat_named_t arcstat_hits;
 420  423          kstat_named_t arcstat_misses;
 421  424          kstat_named_t arcstat_demand_data_hits;
 422  425          kstat_named_t arcstat_demand_data_misses;
 423  426          kstat_named_t arcstat_demand_metadata_hits;
 424  427          kstat_named_t arcstat_demand_metadata_misses;
 425  428          kstat_named_t arcstat_prefetch_data_hits;
 426  429          kstat_named_t arcstat_prefetch_data_misses;
 427  430          kstat_named_t arcstat_prefetch_metadata_hits;
 428  431          kstat_named_t arcstat_prefetch_metadata_misses;
 429  432          kstat_named_t arcstat_mru_hits;
 430  433          kstat_named_t arcstat_mru_ghost_hits;
 431  434          kstat_named_t arcstat_mfu_hits;
 432  435          kstat_named_t arcstat_mfu_ghost_hits;
 433  436          kstat_named_t arcstat_deleted;
 434  437          /*
 435  438           * Number of buffers that could not be evicted because the hash lock
 436  439           * was held by another thread.  The lock may not necessarily be held
 437  440           * by something using the same buffer, since hash locks are shared
 438  441           * by multiple buffers.
 439  442           */
 440  443          kstat_named_t arcstat_mutex_miss;
 441  444          /*
 442  445           * Number of buffers skipped because they have I/O in progress, are
 443  446           * indrect prefetch buffers that have not lived long enough, or are
 444  447           * not from the spa we're trying to evict from.
 445  448           */
 446  449          kstat_named_t arcstat_evict_skip;
 447  450          /*
 448  451           * Number of times arc_evict_state() was unable to evict enough
 449  452           * buffers to reach it's target amount.
 450  453           */
 451  454          kstat_named_t arcstat_evict_not_enough;
 452  455          kstat_named_t arcstat_evict_l2_cached;
 453  456          kstat_named_t arcstat_evict_l2_eligible;
 454  457          kstat_named_t arcstat_evict_l2_ineligible;
 455  458          kstat_named_t arcstat_evict_l2_skip;
 456  459          kstat_named_t arcstat_hash_elements;
 457  460          kstat_named_t arcstat_hash_elements_max;
 458  461          kstat_named_t arcstat_hash_collisions;
 459  462          kstat_named_t arcstat_hash_chains;
 460  463          kstat_named_t arcstat_hash_chain_max;
 461  464          kstat_named_t arcstat_p;
 462  465          kstat_named_t arcstat_c;
 463  466          kstat_named_t arcstat_c_min;
 464  467          kstat_named_t arcstat_c_max;
 465  468          kstat_named_t arcstat_size;
 466  469          /*
 467  470           * Number of compressed bytes stored in the arc_buf_hdr_t's b_pabd.
 468  471           * Note that the compressed bytes may match the uncompressed bytes
 469  472           * if the block is either not compressed or compressed arc is disabled.
 470  473           */
 471  474          kstat_named_t arcstat_compressed_size;
 472  475          /*
 473  476           * Uncompressed size of the data stored in b_pabd. If compressed
 474  477           * arc is disabled then this value will be identical to the stat
 475  478           * above.
 476  479           */
 477  480          kstat_named_t arcstat_uncompressed_size;
 478  481          /*
 479  482           * Number of bytes stored in all the arc_buf_t's. This is classified
 480  483           * as "overhead" since this data is typically short-lived and will
 481  484           * be evicted from the arc when it becomes unreferenced unless the
 482  485           * zfs_keep_uncompressed_metadata or zfs_keep_uncompressed_level
 483  486           * values have been set (see comment in dbuf.c for more information).
 484  487           */
 485  488          kstat_named_t arcstat_overhead_size;
 486  489          /*
 487  490           * Number of bytes consumed by internal ARC structures necessary
 488  491           * for tracking purposes; these structures are not actually
 489  492           * backed by ARC buffers. This includes arc_buf_hdr_t structures
 490  493           * (allocated via arc_buf_hdr_t_full and arc_buf_hdr_t_l2only
 491  494           * caches), and arc_buf_t structures (allocated via arc_buf_t
 492  495           * cache).
 493  496           */
 494  497          kstat_named_t arcstat_hdr_size;
 495  498          /*
 496  499           * Number of bytes consumed by ARC buffers of type equal to
 497  500           * ARC_BUFC_DATA. This is generally consumed by buffers backing
 498  501           * on disk user data (e.g. plain file contents).
 499  502           */
 500  503          kstat_named_t arcstat_data_size;
 501  504          /*
 502  505           * Number of bytes consumed by ARC buffers of type equal to
 503  506           * ARC_BUFC_METADATA. This is generally consumed by buffers
 504  507           * backing on disk data that is used for internal ZFS
 505  508           * structures (e.g. ZAP, dnode, indirect blocks, etc).
 506  509           */
 507  510          kstat_named_t arcstat_metadata_size;
 508  511          /*
 509  512           * Number of bytes consumed by various buffers and structures
 510  513           * not actually backed with ARC buffers. This includes bonus
 511  514           * buffers (allocated directly via zio_buf_* functions),
 512  515           * dmu_buf_impl_t structures (allocated via dmu_buf_impl_t
 513  516           * cache), and dnode_t structures (allocated via dnode_t cache).
 514  517           */
 515  518          kstat_named_t arcstat_other_size;
 516  519          /*
 517  520           * Total number of bytes consumed by ARC buffers residing in the
 518  521           * arc_anon state. This includes *all* buffers in the arc_anon
 519  522           * state; e.g. data, metadata, evictable, and unevictable buffers
 520  523           * are all included in this value.
 521  524           */
 522  525          kstat_named_t arcstat_anon_size;
 523  526          /*
 524  527           * Number of bytes consumed by ARC buffers that meet the
 525  528           * following criteria: backing buffers of type ARC_BUFC_DATA,
 526  529           * residing in the arc_anon state, and are eligible for eviction
 527  530           * (e.g. have no outstanding holds on the buffer).
 528  531           */
 529  532          kstat_named_t arcstat_anon_evictable_data;
 530  533          /*
 531  534           * Number of bytes consumed by ARC buffers that meet the
 532  535           * following criteria: backing buffers of type ARC_BUFC_METADATA,
 533  536           * residing in the arc_anon state, and are eligible for eviction
 534  537           * (e.g. have no outstanding holds on the buffer).
 535  538           */
 536  539          kstat_named_t arcstat_anon_evictable_metadata;
 537  540          /*
 538  541           * Total number of bytes consumed by ARC buffers residing in the
 539  542           * arc_mru state. This includes *all* buffers in the arc_mru
 540  543           * state; e.g. data, metadata, evictable, and unevictable buffers
 541  544           * are all included in this value.
 542  545           */
 543  546          kstat_named_t arcstat_mru_size;
 544  547          /*
 545  548           * Number of bytes consumed by ARC buffers that meet the
 546  549           * following criteria: backing buffers of type ARC_BUFC_DATA,
 547  550           * residing in the arc_mru state, and are eligible for eviction
 548  551           * (e.g. have no outstanding holds on the buffer).
 549  552           */
 550  553          kstat_named_t arcstat_mru_evictable_data;
 551  554          /*
 552  555           * Number of bytes consumed by ARC buffers that meet the
 553  556           * following criteria: backing buffers of type ARC_BUFC_METADATA,
 554  557           * residing in the arc_mru state, and are eligible for eviction
 555  558           * (e.g. have no outstanding holds on the buffer).
 556  559           */
 557  560          kstat_named_t arcstat_mru_evictable_metadata;
 558  561          /*
 559  562           * Total number of bytes that *would have been* consumed by ARC
 560  563           * buffers in the arc_mru_ghost state. The key thing to note
 561  564           * here, is the fact that this size doesn't actually indicate
 562  565           * RAM consumption. The ghost lists only consist of headers and
 563  566           * don't actually have ARC buffers linked off of these headers.
 564  567           * Thus, *if* the headers had associated ARC buffers, these
 565  568           * buffers *would have* consumed this number of bytes.
 566  569           */
 567  570          kstat_named_t arcstat_mru_ghost_size;
 568  571          /*
 569  572           * Number of bytes that *would have been* consumed by ARC
 570  573           * buffers that are eligible for eviction, of type
 571  574           * ARC_BUFC_DATA, and linked off the arc_mru_ghost state.
 572  575           */
 573  576          kstat_named_t arcstat_mru_ghost_evictable_data;
 574  577          /*
 575  578           * Number of bytes that *would have been* consumed by ARC
 576  579           * buffers that are eligible for eviction, of type
 577  580           * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
 578  581           */
 579  582          kstat_named_t arcstat_mru_ghost_evictable_metadata;
 580  583          /*
 581  584           * Total number of bytes consumed by ARC buffers residing in the
 582  585           * arc_mfu state. This includes *all* buffers in the arc_mfu
 583  586           * state; e.g. data, metadata, evictable, and unevictable buffers
 584  587           * are all included in this value.
 585  588           */
 586  589          kstat_named_t arcstat_mfu_size;
 587  590          /*
 588  591           * Number of bytes consumed by ARC buffers that are eligible for
 589  592           * eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu
 590  593           * state.
 591  594           */
 592  595          kstat_named_t arcstat_mfu_evictable_data;
 593  596          /*
 594  597           * Number of bytes consumed by ARC buffers that are eligible for
 595  598           * eviction, of type ARC_BUFC_METADATA, and reside in the
 596  599           * arc_mfu state.
 597  600           */
 598  601          kstat_named_t arcstat_mfu_evictable_metadata;
 599  602          /*
 600  603           * Total number of bytes that *would have been* consumed by ARC
 601  604           * buffers in the arc_mfu_ghost state. See the comment above
 602  605           * arcstat_mru_ghost_size for more details.
 603  606           */
 604  607          kstat_named_t arcstat_mfu_ghost_size;
 605  608          /*
 606  609           * Number of bytes that *would have been* consumed by ARC
 607  610           * buffers that are eligible for eviction, of type
 608  611           * ARC_BUFC_DATA, and linked off the arc_mfu_ghost state.
 609  612           */
 610  613          kstat_named_t arcstat_mfu_ghost_evictable_data;
 611  614          /*
 612  615           * Number of bytes that *would have been* consumed by ARC
 613  616           * buffers that are eligible for eviction, of type
 614  617           * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
 615  618           */
 616  619          kstat_named_t arcstat_mfu_ghost_evictable_metadata;
 617  620          kstat_named_t arcstat_l2_hits;
 618  621          kstat_named_t arcstat_l2_misses;
 619  622          kstat_named_t arcstat_l2_feeds;
 620  623          kstat_named_t arcstat_l2_rw_clash;
 621  624          kstat_named_t arcstat_l2_read_bytes;
 622  625          kstat_named_t arcstat_l2_write_bytes;
 623  626          kstat_named_t arcstat_l2_writes_sent;
 624  627          kstat_named_t arcstat_l2_writes_done;
 625  628          kstat_named_t arcstat_l2_writes_error;
 626  629          kstat_named_t arcstat_l2_writes_lock_retry;
 627  630          kstat_named_t arcstat_l2_evict_lock_retry;
 628  631          kstat_named_t arcstat_l2_evict_reading;
 629  632          kstat_named_t arcstat_l2_evict_l1cached;
 630  633          kstat_named_t arcstat_l2_free_on_write;
 631  634          kstat_named_t arcstat_l2_abort_lowmem;
 632  635          kstat_named_t arcstat_l2_cksum_bad;
 633  636          kstat_named_t arcstat_l2_io_error;
 634  637          kstat_named_t arcstat_l2_lsize;
 635  638          kstat_named_t arcstat_l2_psize;
 636  639          kstat_named_t arcstat_l2_hdr_size;
 637  640          kstat_named_t arcstat_memory_throttle_count;
 638  641          kstat_named_t arcstat_meta_used;
 639  642          kstat_named_t arcstat_meta_limit;
 640  643          kstat_named_t arcstat_meta_max;
 641  644          kstat_named_t arcstat_meta_min;
 642  645          kstat_named_t arcstat_sync_wait_for_async;
 643  646          kstat_named_t arcstat_demand_hit_predictive_prefetch;
 644  647  } arc_stats_t;
 645  648  
 646  649  static arc_stats_t arc_stats = {
 647  650          { "hits",                       KSTAT_DATA_UINT64 },
 648  651          { "misses",                     KSTAT_DATA_UINT64 },
 649  652          { "demand_data_hits",           KSTAT_DATA_UINT64 },
 650  653          { "demand_data_misses",         KSTAT_DATA_UINT64 },
 651  654          { "demand_metadata_hits",       KSTAT_DATA_UINT64 },
 652  655          { "demand_metadata_misses",     KSTAT_DATA_UINT64 },
 653  656          { "prefetch_data_hits",         KSTAT_DATA_UINT64 },
 654  657          { "prefetch_data_misses",       KSTAT_DATA_UINT64 },
 655  658          { "prefetch_metadata_hits",     KSTAT_DATA_UINT64 },
 656  659          { "prefetch_metadata_misses",   KSTAT_DATA_UINT64 },
 657  660          { "mru_hits",                   KSTAT_DATA_UINT64 },
 658  661          { "mru_ghost_hits",             KSTAT_DATA_UINT64 },
 659  662          { "mfu_hits",                   KSTAT_DATA_UINT64 },
 660  663          { "mfu_ghost_hits",             KSTAT_DATA_UINT64 },
 661  664          { "deleted",                    KSTAT_DATA_UINT64 },
 662  665          { "mutex_miss",                 KSTAT_DATA_UINT64 },
 663  666          { "evict_skip",                 KSTAT_DATA_UINT64 },
 664  667          { "evict_not_enough",           KSTAT_DATA_UINT64 },
 665  668          { "evict_l2_cached",            KSTAT_DATA_UINT64 },
 666  669          { "evict_l2_eligible",          KSTAT_DATA_UINT64 },
 667  670          { "evict_l2_ineligible",        KSTAT_DATA_UINT64 },
 668  671          { "evict_l2_skip",              KSTAT_DATA_UINT64 },
 669  672          { "hash_elements",              KSTAT_DATA_UINT64 },
 670  673          { "hash_elements_max",          KSTAT_DATA_UINT64 },
 671  674          { "hash_collisions",            KSTAT_DATA_UINT64 },
 672  675          { "hash_chains",                KSTAT_DATA_UINT64 },
 673  676          { "hash_chain_max",             KSTAT_DATA_UINT64 },
 674  677          { "p",                          KSTAT_DATA_UINT64 },
 675  678          { "c",                          KSTAT_DATA_UINT64 },
 676  679          { "c_min",                      KSTAT_DATA_UINT64 },
 677  680          { "c_max",                      KSTAT_DATA_UINT64 },
 678  681          { "size",                       KSTAT_DATA_UINT64 },
 679  682          { "compressed_size",            KSTAT_DATA_UINT64 },
 680  683          { "uncompressed_size",          KSTAT_DATA_UINT64 },
 681  684          { "overhead_size",              KSTAT_DATA_UINT64 },
 682  685          { "hdr_size",                   KSTAT_DATA_UINT64 },
 683  686          { "data_size",                  KSTAT_DATA_UINT64 },
 684  687          { "metadata_size",              KSTAT_DATA_UINT64 },
 685  688          { "other_size",                 KSTAT_DATA_UINT64 },
 686  689          { "anon_size",                  KSTAT_DATA_UINT64 },
 687  690          { "anon_evictable_data",        KSTAT_DATA_UINT64 },
 688  691          { "anon_evictable_metadata",    KSTAT_DATA_UINT64 },
 689  692          { "mru_size",                   KSTAT_DATA_UINT64 },
 690  693          { "mru_evictable_data",         KSTAT_DATA_UINT64 },
 691  694          { "mru_evictable_metadata",     KSTAT_DATA_UINT64 },
 692  695          { "mru_ghost_size",             KSTAT_DATA_UINT64 },
 693  696          { "mru_ghost_evictable_data",   KSTAT_DATA_UINT64 },
 694  697          { "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
 695  698          { "mfu_size",                   KSTAT_DATA_UINT64 },
 696  699          { "mfu_evictable_data",         KSTAT_DATA_UINT64 },
 697  700          { "mfu_evictable_metadata",     KSTAT_DATA_UINT64 },
 698  701          { "mfu_ghost_size",             KSTAT_DATA_UINT64 },
 699  702          { "mfu_ghost_evictable_data",   KSTAT_DATA_UINT64 },
 700  703          { "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
 701  704          { "l2_hits",                    KSTAT_DATA_UINT64 },
 702  705          { "l2_misses",                  KSTAT_DATA_UINT64 },
 703  706          { "l2_feeds",                   KSTAT_DATA_UINT64 },
 704  707          { "l2_rw_clash",                KSTAT_DATA_UINT64 },
 705  708          { "l2_read_bytes",              KSTAT_DATA_UINT64 },
 706  709          { "l2_write_bytes",             KSTAT_DATA_UINT64 },
 707  710          { "l2_writes_sent",             KSTAT_DATA_UINT64 },
 708  711          { "l2_writes_done",             KSTAT_DATA_UINT64 },
 709  712          { "l2_writes_error",            KSTAT_DATA_UINT64 },
 710  713          { "l2_writes_lock_retry",       KSTAT_DATA_UINT64 },
 711  714          { "l2_evict_lock_retry",        KSTAT_DATA_UINT64 },
 712  715          { "l2_evict_reading",           KSTAT_DATA_UINT64 },
 713  716          { "l2_evict_l1cached",          KSTAT_DATA_UINT64 },
 714  717          { "l2_free_on_write",           KSTAT_DATA_UINT64 },
 715  718          { "l2_abort_lowmem",            KSTAT_DATA_UINT64 },
 716  719          { "l2_cksum_bad",               KSTAT_DATA_UINT64 },
 717  720          { "l2_io_error",                KSTAT_DATA_UINT64 },
 718  721          { "l2_size",                    KSTAT_DATA_UINT64 },
 719  722          { "l2_asize",                   KSTAT_DATA_UINT64 },
 720  723          { "l2_hdr_size",                KSTAT_DATA_UINT64 },
 721  724          { "memory_throttle_count",      KSTAT_DATA_UINT64 },
 722  725          { "arc_meta_used",              KSTAT_DATA_UINT64 },
 723  726          { "arc_meta_limit",             KSTAT_DATA_UINT64 },
 724  727          { "arc_meta_max",               KSTAT_DATA_UINT64 },
 725  728          { "arc_meta_min",               KSTAT_DATA_UINT64 },
 726  729          { "sync_wait_for_async",        KSTAT_DATA_UINT64 },
 727  730          { "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },
 728  731  };
 729  732  
 730  733  #define ARCSTAT(stat)   (arc_stats.stat.value.ui64)
 731  734  
 732  735  #define ARCSTAT_INCR(stat, val) \
 733  736          atomic_add_64(&arc_stats.stat.value.ui64, (val))
 734  737  
 735  738  #define ARCSTAT_BUMP(stat)      ARCSTAT_INCR(stat, 1)
 736  739  #define ARCSTAT_BUMPDOWN(stat)  ARCSTAT_INCR(stat, -1)
 737  740  
 738  741  #define ARCSTAT_MAX(stat, val) {                                        \
 739  742          uint64_t m;                                                     \
 740  743          while ((val) > (m = arc_stats.stat.value.ui64) &&               \
 741  744              (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
 742  745                  continue;                                               \
 743  746  }
 744  747  
 745  748  #define ARCSTAT_MAXSTAT(stat) \
 746  749          ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
 747  750  
 748  751  /*
 749  752   * We define a macro to allow ARC hits/misses to be easily broken down by
 750  753   * two separate conditions, giving a total of four different subtypes for
 751  754   * each of hits and misses (so eight statistics total).
 752  755   */
 753  756  #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
 754  757          if (cond1) {                                                    \
 755  758                  if (cond2) {                                            \
 756  759                          ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
 757  760                  } else {                                                \
 758  761                          ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
 759  762                  }                                                       \
 760  763          } else {                                                        \
 761  764                  if (cond2) {                                            \
 762  765                          ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
 763  766                  } else {                                                \
 764  767                          ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
 765  768                  }                                                       \
 766  769          }
 767  770  
 768  771  kstat_t                 *arc_ksp;
 769  772  static arc_state_t      *arc_anon;
 770  773  static arc_state_t      *arc_mru;
 771  774  static arc_state_t      *arc_mru_ghost;
 772  775  static arc_state_t      *arc_mfu;
 773  776  static arc_state_t      *arc_mfu_ghost;
 774  777  static arc_state_t      *arc_l2c_only;
 775  778  
 776  779  /*
 777  780   * There are several ARC variables that are critical to export as kstats --
 778  781   * but we don't want to have to grovel around in the kstat whenever we wish to
 779  782   * manipulate them.  For these variables, we therefore define them to be in
 780  783   * terms of the statistic variable.  This assures that we are not introducing
 781  784   * the possibility of inconsistency by having shadow copies of the variables,
 782  785   * while still allowing the code to be readable.
 783  786   */
 784  787  #define arc_size        ARCSTAT(arcstat_size)   /* actual total arc size */
 785  788  #define arc_p           ARCSTAT(arcstat_p)      /* target size of MRU */
 786  789  #define arc_c           ARCSTAT(arcstat_c)      /* target size of cache */
 787  790  #define arc_c_min       ARCSTAT(arcstat_c_min)  /* min target cache size */
 788  791  #define arc_c_max       ARCSTAT(arcstat_c_max)  /* max target cache size */
 789  792  #define arc_meta_limit  ARCSTAT(arcstat_meta_limit) /* max size for metadata */
 790  793  #define arc_meta_min    ARCSTAT(arcstat_meta_min) /* min size for metadata */
 791  794  #define arc_meta_used   ARCSTAT(arcstat_meta_used) /* size of metadata */
 792  795  #define arc_meta_max    ARCSTAT(arcstat_meta_max) /* max size of metadata */
 793  796  
 794  797  /* compressed size of entire arc */
 795  798  #define arc_compressed_size     ARCSTAT(arcstat_compressed_size)
 796  799  /* uncompressed size of entire arc */
 797  800  #define arc_uncompressed_size   ARCSTAT(arcstat_uncompressed_size)
 798  801  /* number of bytes in the arc from arc_buf_t's */
 799  802  #define arc_overhead_size       ARCSTAT(arcstat_overhead_size)
 800  803  
 801  804  static int              arc_no_grow;    /* Don't try to grow cache size */
 802  805  static uint64_t         arc_tempreserve;
 803  806  static uint64_t         arc_loaned_bytes;
 804  807  
 805  808  typedef struct arc_callback arc_callback_t;
 806  809  
 807  810  struct arc_callback {
 808  811          void                    *acb_private;
 809  812          arc_done_func_t         *acb_done;
 810  813          arc_buf_t               *acb_buf;
 811  814          boolean_t               acb_compressed;
 812  815          zio_t                   *acb_zio_dummy;
 813  816          arc_callback_t          *acb_next;
 814  817  };
 815  818  
 816  819  typedef struct arc_write_callback arc_write_callback_t;
 817  820  
 818  821  struct arc_write_callback {
 819  822          void            *awcb_private;
 820  823          arc_done_func_t *awcb_ready;
 821  824          arc_done_func_t *awcb_children_ready;
 822  825          arc_done_func_t *awcb_physdone;
 823  826          arc_done_func_t *awcb_done;
 824  827          arc_buf_t       *awcb_buf;
 825  828  };
 826  829  
 827  830  /*
 828  831   * ARC buffers are separated into multiple structs as a memory saving measure:
 829  832   *   - Common fields struct, always defined, and embedded within it:
 830  833   *       - L2-only fields, always allocated but undefined when not in L2ARC
 831  834   *       - L1-only fields, only allocated when in L1ARC
 832  835   *
 833  836   *           Buffer in L1                     Buffer only in L2
 834  837   *    +------------------------+          +------------------------+
 835  838   *    | arc_buf_hdr_t          |          | arc_buf_hdr_t          |
 836  839   *    |                        |          |                        |
 837  840   *    |                        |          |                        |
 838  841   *    |                        |          |                        |
 839  842   *    +------------------------+          +------------------------+
 840  843   *    | l2arc_buf_hdr_t        |          | l2arc_buf_hdr_t        |
 841  844   *    | (undefined if L1-only) |          |                        |
 842  845   *    +------------------------+          +------------------------+
 843  846   *    | l1arc_buf_hdr_t        |
 844  847   *    |                        |
 845  848   *    |                        |
 846  849   *    |                        |
 847  850   *    |                        |
 848  851   *    +------------------------+
 849  852   *
 850  853   * Because it's possible for the L2ARC to become extremely large, we can wind
 851  854   * up eating a lot of memory in L2ARC buffer headers, so the size of a header
 852  855   * is minimized by only allocating the fields necessary for an L1-cached buffer
 853  856   * when a header is actually in the L1 cache. The sub-headers (l1arc_buf_hdr and
 854  857   * l2arc_buf_hdr) are embedded rather than allocated separately to save a couple
 855  858   * words in pointers. arc_hdr_realloc() is used to switch a header between
 856  859   * these two allocation states.
 857  860   */
 858  861  typedef struct l1arc_buf_hdr {
 859  862          kmutex_t                b_freeze_lock;
 860  863          zio_cksum_t             *b_freeze_cksum;
 861  864  #ifdef ZFS_DEBUG
 862  865          /*
 863  866           * Used for debugging with kmem_flags - by allocating and freeing
 864  867           * b_thawed when the buffer is thawed, we get a record of the stack
 865  868           * trace that thawed it.
 866  869           */
 867  870          void                    *b_thawed;
 868  871  #endif
 869  872  
 870  873          arc_buf_t               *b_buf;
 871  874          uint32_t                b_bufcnt;
 872  875          /* for waiting on writes to complete */
 873  876          kcondvar_t              b_cv;
 874  877          uint8_t                 b_byteswap;
 875  878  
 876  879          /* protected by arc state mutex */
 877  880          arc_state_t             *b_state;
 878  881          multilist_node_t        b_arc_node;
 879  882  
 880  883          /* updated atomically */
 881  884          clock_t                 b_arc_access;
 882  885  
 883  886          /* self protecting */
 884  887          refcount_t              b_refcnt;
 885  888  
 886  889          arc_callback_t          *b_acb;
 887  890          abd_t                   *b_pabd;
 888  891  } l1arc_buf_hdr_t;
 889  892  
 890  893  typedef struct l2arc_dev l2arc_dev_t;
 891  894  
 892  895  typedef struct l2arc_buf_hdr {
 893  896          /* protected by arc_buf_hdr mutex */
 894  897          l2arc_dev_t             *b_dev;         /* L2ARC device */
 895  898          uint64_t                b_daddr;        /* disk address, offset byte */
 896  899  
 897  900          list_node_t             b_l2node;
 898  901  } l2arc_buf_hdr_t;
 899  902  
 900  903  struct arc_buf_hdr {
 901  904          /* protected by hash lock */
 902  905          dva_t                   b_dva;
 903  906          uint64_t                b_birth;
 904  907  
 905  908          arc_buf_contents_t      b_type;
 906  909          arc_buf_hdr_t           *b_hash_next;
 907  910          arc_flags_t             b_flags;
 908  911  
 909  912          /*
 910  913           * This field stores the size of the data buffer after
 911  914           * compression, and is set in the arc's zio completion handlers.
 912  915           * It is in units of SPA_MINBLOCKSIZE (e.g. 1 == 512 bytes).
 913  916           *
 914  917           * While the block pointers can store up to 32MB in their psize
 915  918           * field, we can only store up to 32MB minus 512B. This is due
 916  919           * to the bp using a bias of 1, whereas we use a bias of 0 (i.e.
 917  920           * a field of zeros represents 512B in the bp). We can't use a
 918  921           * bias of 1 since we need to reserve a psize of zero, here, to
 919  922           * represent holes and embedded blocks.
 920  923           *
 921  924           * This isn't a problem in practice, since the maximum size of a
 922  925           * buffer is limited to 16MB, so we never need to store 32MB in
 923  926           * this field. Even in the upstream illumos code base, the
 924  927           * maximum size of a buffer is limited to 16MB.
 925  928           */
 926  929          uint16_t                b_psize;
 927  930  
 928  931          /*
 929  932           * This field stores the size of the data buffer before
 930  933           * compression, and cannot change once set. It is in units
 931  934           * of SPA_MINBLOCKSIZE (e.g. 2 == 1024 bytes)
 932  935           */
 933  936          uint16_t                b_lsize;        /* immutable */
 934  937          uint64_t                b_spa;          /* immutable */
 935  938  
 936  939          /* L2ARC fields. Undefined when not in L2ARC. */
 937  940          l2arc_buf_hdr_t         b_l2hdr;
 938  941          /* L1ARC fields. Undefined when in l2arc_only state */
 939  942          l1arc_buf_hdr_t         b_l1hdr;
 940  943  };
 941  944  
 942  945  #define GHOST_STATE(state)      \
 943  946          ((state) == arc_mru_ghost || (state) == arc_mfu_ghost ||        \
 944  947          (state) == arc_l2c_only)
 945  948  
 946  949  #define HDR_IN_HASH_TABLE(hdr)  ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)
 947  950  #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)
 948  951  #define HDR_IO_ERROR(hdr)       ((hdr)->b_flags & ARC_FLAG_IO_ERROR)
 949  952  #define HDR_PREFETCH(hdr)       ((hdr)->b_flags & ARC_FLAG_PREFETCH)
 950  953  #define HDR_COMPRESSION_ENABLED(hdr)    \
 951  954          ((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)
 952  955  
 953  956  #define HDR_L2CACHE(hdr)        ((hdr)->b_flags & ARC_FLAG_L2CACHE)
 954  957  #define HDR_L2_READING(hdr)     \
 955  958          (((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) &&  \
 956  959          ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))
 957  960  #define HDR_L2_WRITING(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_WRITING)
 958  961  #define HDR_L2_EVICTED(hdr)     ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)
 959  962  #define HDR_L2_WRITE_HEAD(hdr)  ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)
 960  963  #define HDR_SHARED_DATA(hdr)    ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)
 961  964  
 962  965  #define HDR_ISTYPE_METADATA(hdr)        \
 963  966          ((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)
 964  967  #define HDR_ISTYPE_DATA(hdr)    (!HDR_ISTYPE_METADATA(hdr))
 965  968  
 966  969  #define HDR_HAS_L1HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)
 967  970  #define HDR_HAS_L2HDR(hdr)      ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)
 968  971  
 969  972  /* For storing compression mode in b_flags */
 970  973  #define HDR_COMPRESS_OFFSET     (highbit64(ARC_FLAG_COMPRESS_0) - 1)
 971  974  
 972  975  #define HDR_GET_COMPRESS(hdr)   ((enum zio_compress)BF32_GET((hdr)->b_flags, \
 973  976          HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))
 974  977  #define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \
 975  978          HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));
 976  979  
 977  980  #define ARC_BUF_LAST(buf)       ((buf)->b_next == NULL)
 978  981  #define ARC_BUF_SHARED(buf)     ((buf)->b_flags & ARC_BUF_FLAG_SHARED)
 979  982  #define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)
 980  983  
 981  984  /*
 982  985   * Other sizes
 983  986   */
 984  987  
 985  988  #define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
 986  989  #define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))
 987  990  
 988  991  /*
 989  992   * Hash table routines
 990  993   */
 991  994  
 992  995  #define HT_LOCK_PAD     64
 993  996  
 994  997  struct ht_lock {
 995  998          kmutex_t        ht_lock;
 996  999  #ifdef _KERNEL
 997 1000          unsigned char   pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
 998 1001  #endif
 999 1002  };
1000 1003  
1001 1004  #define BUF_LOCKS 256
1002 1005  typedef struct buf_hash_table {
1003 1006          uint64_t ht_mask;
1004 1007          arc_buf_hdr_t **ht_table;
1005 1008          struct ht_lock ht_locks[BUF_LOCKS];
1006 1009  } buf_hash_table_t;
1007 1010  
1008 1011  static buf_hash_table_t buf_hash_table;
1009 1012  
1010 1013  #define BUF_HASH_INDEX(spa, dva, birth) \
1011 1014          (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
1012 1015  #define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
1013 1016  #define BUF_HASH_LOCK(idx)      (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
1014 1017  #define HDR_LOCK(hdr) \
1015 1018          (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
1016 1019  
1017 1020  uint64_t zfs_crc64_table[256];
1018 1021  
1019 1022  /*
1020 1023   * Level 2 ARC
1021 1024   */
1022 1025  
1023 1026  #define L2ARC_WRITE_SIZE        (8 * 1024 * 1024)       /* initial write max */
1024 1027  #define L2ARC_HEADROOM          2                       /* num of writes */
1025 1028  /*
1026 1029   * If we discover during ARC scan any buffers to be compressed, we boost
1027 1030   * our headroom for the next scanning cycle by this percentage multiple.
1028 1031   */
1029 1032  #define L2ARC_HEADROOM_BOOST    200
1030 1033  #define L2ARC_FEED_SECS         1               /* caching interval secs */
1031 1034  #define L2ARC_FEED_MIN_MS       200             /* min caching interval ms */
1032 1035  
1033 1036  #define l2arc_writes_sent       ARCSTAT(arcstat_l2_writes_sent)
1034 1037  #define l2arc_writes_done       ARCSTAT(arcstat_l2_writes_done)
1035 1038  
1036 1039  /* L2ARC Performance Tunables */
1037 1040  uint64_t l2arc_write_max = L2ARC_WRITE_SIZE;    /* default max write size */
1038 1041  uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE;  /* extra write during warmup */
1039 1042  uint64_t l2arc_headroom = L2ARC_HEADROOM;       /* number of dev writes */
1040 1043  uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
1041 1044  uint64_t l2arc_feed_secs = L2ARC_FEED_SECS;     /* interval seconds */
1042 1045  uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
1043 1046  boolean_t l2arc_noprefetch = B_TRUE;            /* don't cache prefetch bufs */
1044 1047  boolean_t l2arc_feed_again = B_TRUE;            /* turbo warmup */
1045 1048  boolean_t l2arc_norw = B_TRUE;                  /* no reads during writes */
1046 1049  
1047 1050  /*
1048 1051   * L2ARC Internals
1049 1052   */
1050 1053  struct l2arc_dev {
1051 1054          vdev_t                  *l2ad_vdev;     /* vdev */
1052 1055          spa_t                   *l2ad_spa;      /* spa */
1053 1056          uint64_t                l2ad_hand;      /* next write location */
1054 1057          uint64_t                l2ad_start;     /* first addr on device */
1055 1058          uint64_t                l2ad_end;       /* last addr on device */
1056 1059          boolean_t               l2ad_first;     /* first sweep through */
1057 1060          boolean_t               l2ad_writing;   /* currently writing */
1058 1061          kmutex_t                l2ad_mtx;       /* lock for buffer list */
1059 1062          list_t                  l2ad_buflist;   /* buffer list */
1060 1063          list_node_t             l2ad_node;      /* device list node */
1061 1064          refcount_t              l2ad_alloc;     /* allocated bytes */
1062 1065  };
1063 1066  
1064 1067  static list_t L2ARC_dev_list;                   /* device list */
1065 1068  static list_t *l2arc_dev_list;                  /* device list pointer */
1066 1069  static kmutex_t l2arc_dev_mtx;                  /* device list mutex */
1067 1070  static l2arc_dev_t *l2arc_dev_last;             /* last device used */
1068 1071  static list_t L2ARC_free_on_write;              /* free after write buf list */
1069 1072  static list_t *l2arc_free_on_write;             /* free after write list ptr */
1070 1073  static kmutex_t l2arc_free_on_write_mtx;        /* mutex for list */
1071 1074  static uint64_t l2arc_ndev;                     /* number of devices */
1072 1075  
1073 1076  typedef struct l2arc_read_callback {
1074 1077          arc_buf_hdr_t           *l2rcb_hdr;             /* read header */
1075 1078          blkptr_t                l2rcb_bp;               /* original blkptr */
1076 1079          zbookmark_phys_t        l2rcb_zb;               /* original bookmark */
1077 1080          int                     l2rcb_flags;            /* original flags */
1078 1081          abd_t                   *l2rcb_abd;             /* temporary buffer */
1079 1082  } l2arc_read_callback_t;
1080 1083  
1081 1084  typedef struct l2arc_write_callback {
1082 1085          l2arc_dev_t     *l2wcb_dev;             /* device info */
1083 1086          arc_buf_hdr_t   *l2wcb_head;            /* head of write buflist */
1084 1087  } l2arc_write_callback_t;
1085 1088  
1086 1089  typedef struct l2arc_data_free {
1087 1090          /* protected by l2arc_free_on_write_mtx */
1088 1091          abd_t           *l2df_abd;
1089 1092          size_t          l2df_size;
1090 1093          arc_buf_contents_t l2df_type;
1091 1094          list_node_t     l2df_list_node;
1092 1095  } l2arc_data_free_t;
1093 1096  
1094 1097  static kmutex_t l2arc_feed_thr_lock;
1095 1098  static kcondvar_t l2arc_feed_thr_cv;
1096 1099  static uint8_t l2arc_thread_exit;
1097 1100  
1098 1101  static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, void *);
1099 1102  static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, void *);
1100 1103  static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, void *);
1101 1104  static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, void *);
1102 1105  static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, void *);
1103 1106  static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag);
1104 1107  static void arc_hdr_free_pabd(arc_buf_hdr_t *);
1105 1108  static void arc_hdr_alloc_pabd(arc_buf_hdr_t *);
1106 1109  static void arc_access(arc_buf_hdr_t *, kmutex_t *);
1107 1110  static boolean_t arc_is_overflowing();
1108 1111  static void arc_buf_watch(arc_buf_t *);
1109 1112  
1110 1113  static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);
1111 1114  static uint32_t arc_bufc_to_flags(arc_buf_contents_t);
1112 1115  static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1113 1116  static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1114 1117  
1115 1118  static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
1116 1119  static void l2arc_read_done(zio_t *);
1117 1120  
1118 1121  static uint64_t
1119 1122  buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
1120 1123  {
1121 1124          uint8_t *vdva = (uint8_t *)dva;
1122 1125          uint64_t crc = -1ULL;
1123 1126          int i;
1124 1127  
1125 1128          ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
1126 1129  
1127 1130          for (i = 0; i < sizeof (dva_t); i++)
1128 1131                  crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
1129 1132  
1130 1133          crc ^= (spa>>8) ^ birth;
1131 1134  
1132 1135          return (crc);
1133 1136  }
1134 1137  
1135 1138  #define HDR_EMPTY(hdr)                                          \
1136 1139          ((hdr)->b_dva.dva_word[0] == 0 &&                       \
1137 1140          (hdr)->b_dva.dva_word[1] == 0)
1138 1141  
1139 1142  #define HDR_EQUAL(spa, dva, birth, hdr)                         \
1140 1143          ((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) &&     \
1141 1144          ((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) &&     \
1142 1145          ((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)
1143 1146  
1144 1147  static void
1145 1148  buf_discard_identity(arc_buf_hdr_t *hdr)
1146 1149  {
1147 1150          hdr->b_dva.dva_word[0] = 0;
1148 1151          hdr->b_dva.dva_word[1] = 0;
1149 1152          hdr->b_birth = 0;
1150 1153  }
1151 1154  
1152 1155  static arc_buf_hdr_t *
1153 1156  buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
1154 1157  {
1155 1158          const dva_t *dva = BP_IDENTITY(bp);
1156 1159          uint64_t birth = BP_PHYSICAL_BIRTH(bp);
1157 1160          uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
1158 1161          kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1159 1162          arc_buf_hdr_t *hdr;
1160 1163  
1161 1164          mutex_enter(hash_lock);
1162 1165          for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;
1163 1166              hdr = hdr->b_hash_next) {
1164 1167                  if (HDR_EQUAL(spa, dva, birth, hdr)) {
1165 1168                          *lockp = hash_lock;
1166 1169                          return (hdr);
1167 1170                  }
1168 1171          }
1169 1172          mutex_exit(hash_lock);
1170 1173          *lockp = NULL;
1171 1174          return (NULL);
1172 1175  }
1173 1176  
1174 1177  /*
1175 1178   * Insert an entry into the hash table.  If there is already an element
1176 1179   * equal to elem in the hash table, then the already existing element
1177 1180   * will be returned and the new element will not be inserted.
1178 1181   * Otherwise returns NULL.
1179 1182   * If lockp == NULL, the caller is assumed to already hold the hash lock.
1180 1183   */
1181 1184  static arc_buf_hdr_t *
1182 1185  buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)
1183 1186  {
1184 1187          uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1185 1188          kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1186 1189          arc_buf_hdr_t *fhdr;
1187 1190          uint32_t i;
1188 1191  
1189 1192          ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));
1190 1193          ASSERT(hdr->b_birth != 0);
1191 1194          ASSERT(!HDR_IN_HASH_TABLE(hdr));
1192 1195  
1193 1196          if (lockp != NULL) {
1194 1197                  *lockp = hash_lock;
1195 1198                  mutex_enter(hash_lock);
1196 1199          } else {
1197 1200                  ASSERT(MUTEX_HELD(hash_lock));
1198 1201          }
1199 1202  
1200 1203          for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;
1201 1204              fhdr = fhdr->b_hash_next, i++) {
1202 1205                  if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))
1203 1206                          return (fhdr);
1204 1207          }
1205 1208  
1206 1209          hdr->b_hash_next = buf_hash_table.ht_table[idx];
1207 1210          buf_hash_table.ht_table[idx] = hdr;
1208 1211          arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1209 1212  
1210 1213          /* collect some hash table performance data */
1211 1214          if (i > 0) {
1212 1215                  ARCSTAT_BUMP(arcstat_hash_collisions);
1213 1216                  if (i == 1)
1214 1217                          ARCSTAT_BUMP(arcstat_hash_chains);
1215 1218  
1216 1219                  ARCSTAT_MAX(arcstat_hash_chain_max, i);
1217 1220          }
1218 1221  
1219 1222          ARCSTAT_BUMP(arcstat_hash_elements);
1220 1223          ARCSTAT_MAXSTAT(arcstat_hash_elements);
1221 1224  
1222 1225          return (NULL);
1223 1226  }
1224 1227  
1225 1228  static void
1226 1229  buf_hash_remove(arc_buf_hdr_t *hdr)
1227 1230  {
1228 1231          arc_buf_hdr_t *fhdr, **hdrp;
1229 1232          uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1230 1233  
1231 1234          ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
1232 1235          ASSERT(HDR_IN_HASH_TABLE(hdr));
1233 1236  
1234 1237          hdrp = &buf_hash_table.ht_table[idx];
1235 1238          while ((fhdr = *hdrp) != hdr) {
1236 1239                  ASSERT3P(fhdr, !=, NULL);
1237 1240                  hdrp = &fhdr->b_hash_next;
1238 1241          }
1239 1242          *hdrp = hdr->b_hash_next;
1240 1243          hdr->b_hash_next = NULL;
1241 1244          arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1242 1245  
1243 1246          /* collect some hash table performance data */
1244 1247          ARCSTAT_BUMPDOWN(arcstat_hash_elements);
1245 1248  
1246 1249          if (buf_hash_table.ht_table[idx] &&
1247 1250              buf_hash_table.ht_table[idx]->b_hash_next == NULL)
1248 1251                  ARCSTAT_BUMPDOWN(arcstat_hash_chains);
1249 1252  }
1250 1253  
1251 1254  /*
1252 1255   * Global data structures and functions for the buf kmem cache.
1253 1256   */
1254 1257  static kmem_cache_t *hdr_full_cache;
1255 1258  static kmem_cache_t *hdr_l2only_cache;
1256 1259  static kmem_cache_t *buf_cache;
1257 1260  
1258 1261  static void
1259 1262  buf_fini(void)
1260 1263  {
1261 1264          int i;
1262 1265  
1263 1266          kmem_free(buf_hash_table.ht_table,
1264 1267              (buf_hash_table.ht_mask + 1) * sizeof (void *));
1265 1268          for (i = 0; i < BUF_LOCKS; i++)
1266 1269                  mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
1267 1270          kmem_cache_destroy(hdr_full_cache);
1268 1271          kmem_cache_destroy(hdr_l2only_cache);
1269 1272          kmem_cache_destroy(buf_cache);
1270 1273  }
1271 1274  
1272 1275  /*
1273 1276   * Constructor callback - called when the cache is empty
1274 1277   * and a new buf is requested.
1275 1278   */
1276 1279  /* ARGSUSED */
1277 1280  static int
1278 1281  hdr_full_cons(void *vbuf, void *unused, int kmflag)
1279 1282  {
1280 1283          arc_buf_hdr_t *hdr = vbuf;
1281 1284  
1282 1285          bzero(hdr, HDR_FULL_SIZE);
1283 1286          cv_init(&hdr->b_l1hdr.b_cv, NULL, CV_DEFAULT, NULL);
1284 1287          refcount_create(&hdr->b_l1hdr.b_refcnt);
1285 1288          mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
1286 1289          multilist_link_init(&hdr->b_l1hdr.b_arc_node);
1287 1290          arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS);
1288 1291  
1289 1292          return (0);
1290 1293  }
1291 1294  
1292 1295  /* ARGSUSED */
1293 1296  static int
1294 1297  hdr_l2only_cons(void *vbuf, void *unused, int kmflag)
1295 1298  {
1296 1299          arc_buf_hdr_t *hdr = vbuf;
1297 1300  
1298 1301          bzero(hdr, HDR_L2ONLY_SIZE);
1299 1302          arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
1300 1303  
1301 1304          return (0);
1302 1305  }
1303 1306  
1304 1307  /* ARGSUSED */
1305 1308  static int
1306 1309  buf_cons(void *vbuf, void *unused, int kmflag)
1307 1310  {
1308 1311          arc_buf_t *buf = vbuf;
1309 1312  
1310 1313          bzero(buf, sizeof (arc_buf_t));
1311 1314          mutex_init(&buf->b_evict_lock, NULL, MUTEX_DEFAULT, NULL);
1312 1315          arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
1313 1316  
1314 1317          return (0);
1315 1318  }
1316 1319  
1317 1320  /*
1318 1321   * Destructor callback - called when a cached buf is
1319 1322   * no longer required.
1320 1323   */
1321 1324  /* ARGSUSED */
1322 1325  static void
1323 1326  hdr_full_dest(void *vbuf, void *unused)
1324 1327  {
1325 1328          arc_buf_hdr_t *hdr = vbuf;
1326 1329  
1327 1330          ASSERT(HDR_EMPTY(hdr));
1328 1331          cv_destroy(&hdr->b_l1hdr.b_cv);
1329 1332          refcount_destroy(&hdr->b_l1hdr.b_refcnt);
1330 1333          mutex_destroy(&hdr->b_l1hdr.b_freeze_lock);
1331 1334          ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
1332 1335          arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS);
1333 1336  }
1334 1337  
1335 1338  /* ARGSUSED */
1336 1339  static void
1337 1340  hdr_l2only_dest(void *vbuf, void *unused)
1338 1341  {
1339 1342          arc_buf_hdr_t *hdr = vbuf;
1340 1343  
1341 1344          ASSERT(HDR_EMPTY(hdr));
1342 1345          arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
1343 1346  }
1344 1347  
1345 1348  /* ARGSUSED */
1346 1349  static void
1347 1350  buf_dest(void *vbuf, void *unused)
1348 1351  {
1349 1352          arc_buf_t *buf = vbuf;
1350 1353  
1351 1354          mutex_destroy(&buf->b_evict_lock);
1352 1355          arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
1353 1356  }
1354 1357  
1355 1358  /*
1356 1359   * Reclaim callback -- invoked when memory is low.
1357 1360   */
1358 1361  /* ARGSUSED */
1359 1362  static void
1360 1363  hdr_recl(void *unused)
1361 1364  {
1362 1365          dprintf("hdr_recl called\n");
1363 1366          /*
1364 1367           * umem calls the reclaim func when we destroy the buf cache,
1365 1368           * which is after we do arc_fini().
1366 1369           */
1367 1370          if (!arc_dead)
1368 1371                  cv_signal(&arc_reclaim_thread_cv);
1369 1372  }
1370 1373  
1371 1374  static void
1372 1375  buf_init(void)
1373 1376  {
1374 1377          uint64_t *ct;
1375 1378          uint64_t hsize = 1ULL << 12;
1376 1379          int i, j;
1377 1380  
1378 1381          /*
1379 1382           * The hash table is big enough to fill all of physical memory
1380 1383           * with an average block size of zfs_arc_average_blocksize (default 8K).
1381 1384           * By default, the table will take up
1382 1385           * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
1383 1386           */
1384 1387          while (hsize * zfs_arc_average_blocksize < physmem * PAGESIZE)
1385 1388                  hsize <<= 1;
1386 1389  retry:
1387 1390          buf_hash_table.ht_mask = hsize - 1;
1388 1391          buf_hash_table.ht_table =
1389 1392              kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
1390 1393          if (buf_hash_table.ht_table == NULL) {
1391 1394                  ASSERT(hsize > (1ULL << 8));
1392 1395                  hsize >>= 1;
1393 1396                  goto retry;
1394 1397          }
1395 1398  
1396 1399          hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,
1397 1400              0, hdr_full_cons, hdr_full_dest, hdr_recl, NULL, NULL, 0);
1398 1401          hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",
1399 1402              HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, hdr_recl,
1400 1403              NULL, NULL, 0);
1401 1404          buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
1402 1405              0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
1403 1406  
1404 1407          for (i = 0; i < 256; i++)
1405 1408                  for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
1406 1409                          *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
1407 1410  
1408 1411          for (i = 0; i < BUF_LOCKS; i++) {
1409 1412                  mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
1410 1413                      NULL, MUTEX_DEFAULT, NULL);
1411 1414          }
1412 1415  }
1413 1416  
1414 1417  /*
1415 1418   * This is the size that the buf occupies in memory. If the buf is compressed,
1416 1419   * it will correspond to the compressed size. You should use this method of
1417 1420   * getting the buf size unless you explicitly need the logical size.
1418 1421   */
1419 1422  int32_t
1420 1423  arc_buf_size(arc_buf_t *buf)
1421 1424  {
1422 1425          return (ARC_BUF_COMPRESSED(buf) ?
1423 1426              HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
1424 1427  }
1425 1428  
1426 1429  int32_t
1427 1430  arc_buf_lsize(arc_buf_t *buf)
1428 1431  {
1429 1432          return (HDR_GET_LSIZE(buf->b_hdr));
1430 1433  }
1431 1434  
1432 1435  enum zio_compress
1433 1436  arc_get_compression(arc_buf_t *buf)
1434 1437  {
1435 1438          return (ARC_BUF_COMPRESSED(buf) ?
1436 1439              HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF);
1437 1440  }
1438 1441  
1439 1442  #define ARC_MINTIME     (hz>>4) /* 62 ms */
1440 1443  
1441 1444  static inline boolean_t
1442 1445  arc_buf_is_shared(arc_buf_t *buf)
1443 1446  {
1444 1447          boolean_t shared = (buf->b_data != NULL &&
1445 1448              buf->b_hdr->b_l1hdr.b_pabd != NULL &&
1446 1449              abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) &&
1447 1450              buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd));
1448 1451          IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr));
1449 1452          IMPLY(shared, ARC_BUF_SHARED(buf));
1450 1453          IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));
1451 1454  
1452 1455          /*
1453 1456           * It would be nice to assert arc_can_share() too, but the "hdr isn't
1454 1457           * already being shared" requirement prevents us from doing that.
1455 1458           */
1456 1459  
1457 1460          return (shared);
1458 1461  }
1459 1462  
1460 1463  /*
1461 1464   * Free the checksum associated with this header. If there is no checksum, this
1462 1465   * is a no-op.
1463 1466   */
1464 1467  static inline void
1465 1468  arc_cksum_free(arc_buf_hdr_t *hdr)
1466 1469  {
1467 1470          ASSERT(HDR_HAS_L1HDR(hdr));
1468 1471          mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1469 1472          if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
1470 1473                  kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));
1471 1474                  hdr->b_l1hdr.b_freeze_cksum = NULL;
1472 1475          }
1473 1476          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1474 1477  }
1475 1478  
1476 1479  /*
1477 1480   * Return true iff at least one of the bufs on hdr is not compressed.
1478 1481   */
1479 1482  static boolean_t
1480 1483  arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)
1481 1484  {
1482 1485          for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) {
1483 1486                  if (!ARC_BUF_COMPRESSED(b)) {
1484 1487                          return (B_TRUE);
1485 1488                  }
1486 1489          }
1487 1490          return (B_FALSE);
1488 1491  }
1489 1492  
1490 1493  /*
1491 1494   * If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data
1492 1495   * matches the checksum that is stored in the hdr. If there is no checksum,
1493 1496   * or if the buf is compressed, this is a no-op.
1494 1497   */
1495 1498  static void
1496 1499  arc_cksum_verify(arc_buf_t *buf)
1497 1500  {
1498 1501          arc_buf_hdr_t *hdr = buf->b_hdr;
1499 1502          zio_cksum_t zc;
1500 1503  
1501 1504          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1502 1505                  return;
1503 1506  
1504 1507          if (ARC_BUF_COMPRESSED(buf)) {
1505 1508                  ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
1506 1509                      arc_hdr_has_uncompressed_buf(hdr));
1507 1510                  return;
1508 1511          }
1509 1512  
1510 1513          ASSERT(HDR_HAS_L1HDR(hdr));
1511 1514  
1512 1515          mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1513 1516          if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
1514 1517                  mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1515 1518                  return;
1516 1519          }
1517 1520  
1518 1521          fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);
1519 1522          if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))
1520 1523                  panic("buffer modified while frozen!");
1521 1524          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1522 1525  }
1523 1526  
1524 1527  static boolean_t
1525 1528  arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)
1526 1529  {
1527 1530          enum zio_compress compress = BP_GET_COMPRESS(zio->io_bp);
1528 1531          boolean_t valid_cksum;
1529 1532  
1530 1533          ASSERT(!BP_IS_EMBEDDED(zio->io_bp));
1531 1534          VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));
1532 1535  
1533 1536          /*
1534 1537           * We rely on the blkptr's checksum to determine if the block
1535 1538           * is valid or not. When compressed arc is enabled, the l2arc
1536 1539           * writes the block to the l2arc just as it appears in the pool.
1537 1540           * This allows us to use the blkptr's checksum to validate the
1538 1541           * data that we just read off of the l2arc without having to store
1539 1542           * a separate checksum in the arc_buf_hdr_t. However, if compressed
1540 1543           * arc is disabled, then the data written to the l2arc is always
1541 1544           * uncompressed and won't match the block as it exists in the main
1542 1545           * pool. When this is the case, we must first compress it if it is
1543 1546           * compressed on the main pool before we can validate the checksum.
1544 1547           */
1545 1548          if (!HDR_COMPRESSION_ENABLED(hdr) && compress != ZIO_COMPRESS_OFF) {
1546 1549                  ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
1547 1550                  uint64_t lsize = HDR_GET_LSIZE(hdr);
1548 1551                  uint64_t csize;
1549 1552  
1550 1553                  abd_t *cdata = abd_alloc_linear(HDR_GET_PSIZE(hdr), B_TRUE);
1551 1554                  csize = zio_compress_data(compress, zio->io_abd,
1552 1555                      abd_to_buf(cdata), lsize);
1553 1556  
1554 1557                  ASSERT3U(csize, <=, HDR_GET_PSIZE(hdr));
1555 1558                  if (csize < HDR_GET_PSIZE(hdr)) {
1556 1559                          /*
1557 1560                           * Compressed blocks are always a multiple of the
1558 1561                           * smallest ashift in the pool. Ideally, we would
1559 1562                           * like to round up the csize to the next
1560 1563                           * spa_min_ashift but that value may have changed
1561 1564                           * since the block was last written. Instead,
1562 1565                           * we rely on the fact that the hdr's psize
1563 1566                           * was set to the psize of the block when it was
1564 1567                           * last written. We set the csize to that value
1565 1568                           * and zero out any part that should not contain
1566 1569                           * data.
1567 1570                           */
1568 1571                          abd_zero_off(cdata, csize, HDR_GET_PSIZE(hdr) - csize);
1569 1572                          csize = HDR_GET_PSIZE(hdr);
1570 1573                  }
1571 1574                  zio_push_transform(zio, cdata, csize, HDR_GET_PSIZE(hdr), NULL);
1572 1575          }
1573 1576  
1574 1577          /*
1575 1578           * Block pointers always store the checksum for the logical data.
1576 1579           * If the block pointer has the gang bit set, then the checksum
1577 1580           * it represents is for the reconstituted data and not for an
1578 1581           * individual gang member. The zio pipeline, however, must be able to
1579 1582           * determine the checksum of each of the gang constituents so it
1580 1583           * treats the checksum comparison differently than what we need
1581 1584           * for l2arc blocks. This prevents us from using the
1582 1585           * zio_checksum_error() interface directly. Instead we must call the
1583 1586           * zio_checksum_error_impl() so that we can ensure the checksum is
1584 1587           * generated using the correct checksum algorithm and accounts for the
1585 1588           * logical I/O size and not just a gang fragment.
1586 1589           */
1587 1590          valid_cksum = (zio_checksum_error_impl(zio->io_spa, zio->io_bp,
1588 1591              BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size,
1589 1592              zio->io_offset, NULL) == 0);
1590 1593          zio_pop_transforms(zio);
1591 1594          return (valid_cksum);
1592 1595  }
1593 1596  
1594 1597  /*
1595 1598   * Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a
1596 1599   * checksum and attaches it to the buf's hdr so that we can ensure that the buf
1597 1600   * isn't modified later on. If buf is compressed or there is already a checksum
1598 1601   * on the hdr, this is a no-op (we only checksum uncompressed bufs).
1599 1602   */
1600 1603  static void
1601 1604  arc_cksum_compute(arc_buf_t *buf)
1602 1605  {
1603 1606          arc_buf_hdr_t *hdr = buf->b_hdr;
1604 1607  
1605 1608          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1606 1609                  return;
1607 1610  
1608 1611          ASSERT(HDR_HAS_L1HDR(hdr));
1609 1612  
1610 1613          mutex_enter(&buf->b_hdr->b_l1hdr.b_freeze_lock);
1611 1614          if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
1612 1615                  ASSERT(arc_hdr_has_uncompressed_buf(hdr));
1613 1616                  mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1614 1617                  return;
1615 1618          } else if (ARC_BUF_COMPRESSED(buf)) {
1616 1619                  mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1617 1620                  return;
1618 1621          }
1619 1622  
1620 1623          ASSERT(!ARC_BUF_COMPRESSED(buf));
1621 1624          hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
1622 1625              KM_SLEEP);
1623 1626          fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,
1624 1627              hdr->b_l1hdr.b_freeze_cksum);
1625 1628          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1626 1629          arc_buf_watch(buf);
1627 1630  }
1628 1631  
1629 1632  #ifndef _KERNEL
1630 1633  typedef struct procctl {
1631 1634          long cmd;
1632 1635          prwatch_t prwatch;
1633 1636  } procctl_t;
1634 1637  #endif
1635 1638  
1636 1639  /* ARGSUSED */
1637 1640  static void
1638 1641  arc_buf_unwatch(arc_buf_t *buf)
1639 1642  {
1640 1643  #ifndef _KERNEL
1641 1644          if (arc_watch) {
1642 1645                  int result;
1643 1646                  procctl_t ctl;
1644 1647                  ctl.cmd = PCWATCH;
1645 1648                  ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
1646 1649                  ctl.prwatch.pr_size = 0;
1647 1650                  ctl.prwatch.pr_wflags = 0;
1648 1651                  result = write(arc_procfd, &ctl, sizeof (ctl));
1649 1652                  ASSERT3U(result, ==, sizeof (ctl));
1650 1653          }
1651 1654  #endif
1652 1655  }
1653 1656  
1654 1657  /* ARGSUSED */
1655 1658  static void
1656 1659  arc_buf_watch(arc_buf_t *buf)
1657 1660  {
1658 1661  #ifndef _KERNEL
1659 1662          if (arc_watch) {
1660 1663                  int result;
1661 1664                  procctl_t ctl;
1662 1665                  ctl.cmd = PCWATCH;
1663 1666                  ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
1664 1667                  ctl.prwatch.pr_size = arc_buf_size(buf);
1665 1668                  ctl.prwatch.pr_wflags = WA_WRITE;
1666 1669                  result = write(arc_procfd, &ctl, sizeof (ctl));
1667 1670                  ASSERT3U(result, ==, sizeof (ctl));
1668 1671          }
1669 1672  #endif
1670 1673  }
1671 1674  
1672 1675  static arc_buf_contents_t
1673 1676  arc_buf_type(arc_buf_hdr_t *hdr)
1674 1677  {
1675 1678          arc_buf_contents_t type;
1676 1679          if (HDR_ISTYPE_METADATA(hdr)) {
1677 1680                  type = ARC_BUFC_METADATA;
1678 1681          } else {
1679 1682                  type = ARC_BUFC_DATA;
1680 1683          }
1681 1684          VERIFY3U(hdr->b_type, ==, type);
1682 1685          return (type);
1683 1686  }
1684 1687  
1685 1688  boolean_t
1686 1689  arc_is_metadata(arc_buf_t *buf)
1687 1690  {
1688 1691          return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);
1689 1692  }
1690 1693  
1691 1694  static uint32_t
1692 1695  arc_bufc_to_flags(arc_buf_contents_t type)
1693 1696  {
1694 1697          switch (type) {
1695 1698          case ARC_BUFC_DATA:
1696 1699                  /* metadata field is 0 if buffer contains normal data */
1697 1700                  return (0);
1698 1701          case ARC_BUFC_METADATA:
1699 1702                  return (ARC_FLAG_BUFC_METADATA);
1700 1703          default:
1701 1704                  break;
1702 1705          }
1703 1706          panic("undefined ARC buffer type!");
1704 1707          return ((uint32_t)-1);
1705 1708  }
1706 1709  
1707 1710  void
1708 1711  arc_buf_thaw(arc_buf_t *buf)
1709 1712  {
1710 1713          arc_buf_hdr_t *hdr = buf->b_hdr;
1711 1714  
1712 1715          ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
1713 1716          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
1714 1717  
1715 1718          arc_cksum_verify(buf);
1716 1719  
1717 1720          /*
1718 1721           * Compressed buffers do not manipulate the b_freeze_cksum or
1719 1722           * allocate b_thawed.
1720 1723           */
1721 1724          if (ARC_BUF_COMPRESSED(buf)) {
1722 1725                  ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
1723 1726                      arc_hdr_has_uncompressed_buf(hdr));
1724 1727                  return;
1725 1728          }
1726 1729  
1727 1730          ASSERT(HDR_HAS_L1HDR(hdr));
1728 1731          arc_cksum_free(hdr);
1729 1732  
1730 1733          mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1731 1734  #ifdef ZFS_DEBUG
1732 1735          if (zfs_flags & ZFS_DEBUG_MODIFY) {
1733 1736                  if (hdr->b_l1hdr.b_thawed != NULL)
1734 1737                          kmem_free(hdr->b_l1hdr.b_thawed, 1);
1735 1738                  hdr->b_l1hdr.b_thawed = kmem_alloc(1, KM_SLEEP);
1736 1739          }
1737 1740  #endif
1738 1741  
1739 1742          mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1740 1743  
1741 1744          arc_buf_unwatch(buf);
1742 1745  }
1743 1746  
1744 1747  void
1745 1748  arc_buf_freeze(arc_buf_t *buf)
1746 1749  {
1747 1750          arc_buf_hdr_t *hdr = buf->b_hdr;
1748 1751          kmutex_t *hash_lock;
1749 1752  
1750 1753          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1751 1754                  return;
1752 1755  
1753 1756          if (ARC_BUF_COMPRESSED(buf)) {
1754 1757                  ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
1755 1758                      arc_hdr_has_uncompressed_buf(hdr));
1756 1759                  return;
1757 1760          }
1758 1761  
1759 1762          hash_lock = HDR_LOCK(hdr);
1760 1763          mutex_enter(hash_lock);
1761 1764  
1762 1765          ASSERT(HDR_HAS_L1HDR(hdr));
1763 1766          ASSERT(hdr->b_l1hdr.b_freeze_cksum != NULL ||
1764 1767              hdr->b_l1hdr.b_state == arc_anon);
1765 1768          arc_cksum_compute(buf);
1766 1769          mutex_exit(hash_lock);
1767 1770  }
1768 1771  
1769 1772  /*
1770 1773   * The arc_buf_hdr_t's b_flags should never be modified directly. Instead,
1771 1774   * the following functions should be used to ensure that the flags are
1772 1775   * updated in a thread-safe way. When manipulating the flags either
1773 1776   * the hash_lock must be held or the hdr must be undiscoverable. This
1774 1777   * ensures that we're not racing with any other threads when updating
1775 1778   * the flags.
1776 1779   */
1777 1780  static inline void
1778 1781  arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
1779 1782  {
1780 1783          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
1781 1784          hdr->b_flags |= flags;
1782 1785  }
1783 1786  
1784 1787  static inline void
1785 1788  arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
1786 1789  {
1787 1790          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
1788 1791          hdr->b_flags &= ~flags;
1789 1792  }
1790 1793  
1791 1794  /*
1792 1795   * Setting the compression bits in the arc_buf_hdr_t's b_flags is
1793 1796   * done in a special way since we have to clear and set bits
1794 1797   * at the same time. Consumers that wish to set the compression bits
1795 1798   * must use this function to ensure that the flags are updated in
1796 1799   * thread-safe manner.
1797 1800   */
1798 1801  static void
1799 1802  arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp)
1800 1803  {
1801 1804          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
1802 1805  
1803 1806          /*
1804 1807           * Holes and embedded blocks will always have a psize = 0 so
1805 1808           * we ignore the compression of the blkptr and set the
1806 1809           * arc_buf_hdr_t's compression to ZIO_COMPRESS_OFF.
1807 1810           * Holes and embedded blocks remain anonymous so we don't
1808 1811           * want to uncompress them. Mark them as uncompressed.
1809 1812           */
1810 1813          if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) {
1811 1814                  arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
1812 1815                  HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_OFF);
1813 1816                  ASSERT(!HDR_COMPRESSION_ENABLED(hdr));
1814 1817                  ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
1815 1818          } else {
1816 1819                  arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
1817 1820                  HDR_SET_COMPRESS(hdr, cmp);
1818 1821                  ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp);
1819 1822                  ASSERT(HDR_COMPRESSION_ENABLED(hdr));
1820 1823          }
1821 1824  }
1822 1825  
1823 1826  /*
1824 1827   * Looks for another buf on the same hdr which has the data decompressed, copies
1825 1828   * from it, and returns true. If no such buf exists, returns false.
1826 1829   */
1827 1830  static boolean_t
1828 1831  arc_buf_try_copy_decompressed_data(arc_buf_t *buf)
1829 1832  {
1830 1833          arc_buf_hdr_t *hdr = buf->b_hdr;
1831 1834          boolean_t copied = B_FALSE;
1832 1835  
1833 1836          ASSERT(HDR_HAS_L1HDR(hdr));
1834 1837          ASSERT3P(buf->b_data, !=, NULL);
1835 1838          ASSERT(!ARC_BUF_COMPRESSED(buf));
1836 1839  
1837 1840          for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL;
1838 1841              from = from->b_next) {
1839 1842                  /* can't use our own data buffer */
1840 1843                  if (from == buf) {
1841 1844                          continue;
1842 1845                  }
1843 1846  
1844 1847                  if (!ARC_BUF_COMPRESSED(from)) {
1845 1848                          bcopy(from->b_data, buf->b_data, arc_buf_size(buf));
1846 1849                          copied = B_TRUE;
1847 1850                          break;
1848 1851                  }
1849 1852          }
1850 1853  
1851 1854          /*
1852 1855           * There were no decompressed bufs, so there should not be a
1853 1856           * checksum on the hdr either.
1854 1857           */
1855 1858          EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);
1856 1859  
1857 1860          return (copied);
1858 1861  }
1859 1862  
1860 1863  /*
1861 1864   * Given a buf that has a data buffer attached to it, this function will
1862 1865   * efficiently fill the buf with data of the specified compression setting from
1863 1866   * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr
1864 1867   * are already sharing a data buf, no copy is performed.
1865 1868   *
1866 1869   * If the buf is marked as compressed but uncompressed data was requested, this
1867 1870   * will allocate a new data buffer for the buf, remove that flag, and fill the
1868 1871   * buf with uncompressed data. You can't request a compressed buf on a hdr with
1869 1872   * uncompressed data, and (since we haven't added support for it yet) if you
1870 1873   * want compressed data your buf must already be marked as compressed and have
1871 1874   * the correct-sized data buffer.
1872 1875   */
1873 1876  static int
1874 1877  arc_buf_fill(arc_buf_t *buf, boolean_t compressed)
1875 1878  {
1876 1879          arc_buf_hdr_t *hdr = buf->b_hdr;
1877 1880          boolean_t hdr_compressed = (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
1878 1881          dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap;
1879 1882  
1880 1883          ASSERT3P(buf->b_data, !=, NULL);
1881 1884          IMPLY(compressed, hdr_compressed);
1882 1885          IMPLY(compressed, ARC_BUF_COMPRESSED(buf));
1883 1886  
1884 1887          if (hdr_compressed == compressed) {
1885 1888                  if (!arc_buf_is_shared(buf)) {
1886 1889                          abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd,
1887 1890                              arc_buf_size(buf));
1888 1891                  }
1889 1892          } else {
1890 1893                  ASSERT(hdr_compressed);
1891 1894                  ASSERT(!compressed);
1892 1895                  ASSERT3U(HDR_GET_LSIZE(hdr), !=, HDR_GET_PSIZE(hdr));
1893 1896  
1894 1897                  /*
1895 1898                   * If the buf is sharing its data with the hdr, unlink it and
1896 1899                   * allocate a new data buffer for the buf.
1897 1900                   */
1898 1901                  if (arc_buf_is_shared(buf)) {
1899 1902                          ASSERT(ARC_BUF_COMPRESSED(buf));
1900 1903  
1901 1904                          /* We need to give the buf it's own b_data */
1902 1905                          buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
1903 1906                          buf->b_data =
1904 1907                              arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
1905 1908                          arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
1906 1909  
1907 1910                          /* Previously overhead was 0; just add new overhead */
1908 1911                          ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr));
1909 1912                  } else if (ARC_BUF_COMPRESSED(buf)) {
1910 1913                          /* We need to reallocate the buf's b_data */
1911 1914                          arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr),
1912 1915                              buf);
1913 1916                          buf->b_data =
1914 1917                              arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
1915 1918  
1916 1919                          /* We increased the size of b_data; update overhead */
1917 1920                          ARCSTAT_INCR(arcstat_overhead_size,
1918 1921                              HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));
1919 1922                  }
1920 1923  
1921 1924                  /*
1922 1925                   * Regardless of the buf's previous compression settings, it
1923 1926                   * should not be compressed at the end of this function.
1924 1927                   */
1925 1928                  buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
1926 1929  
1927 1930                  /*
1928 1931                   * Try copying the data from another buf which already has a
1929 1932                   * decompressed version. If that's not possible, it's time to
1930 1933                   * bite the bullet and decompress the data from the hdr.
1931 1934                   */
1932 1935                  if (arc_buf_try_copy_decompressed_data(buf)) {
1933 1936                          /* Skip byteswapping and checksumming (already done) */
1934 1937                          ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, !=, NULL);
1935 1938                          return (0);
1936 1939                  } else {
1937 1940                          int error = zio_decompress_data(HDR_GET_COMPRESS(hdr),
1938 1941                              hdr->b_l1hdr.b_pabd, buf->b_data,
1939 1942                              HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
1940 1943  
1941 1944                          /*
1942 1945                           * Absent hardware errors or software bugs, this should
1943 1946                           * be impossible, but log it anyway so we can debug it.
1944 1947                           */
1945 1948                          if (error != 0) {
1946 1949                                  zfs_dbgmsg(
1947 1950                                      "hdr %p, compress %d, psize %d, lsize %d",
1948 1951                                      hdr, HDR_GET_COMPRESS(hdr),
1949 1952                                      HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
1950 1953                                  return (SET_ERROR(EIO));
1951 1954                          }
1952 1955                  }
1953 1956          }
1954 1957  
1955 1958          /* Byteswap the buf's data if necessary */
1956 1959          if (bswap != DMU_BSWAP_NUMFUNCS) {
1957 1960                  ASSERT(!HDR_SHARED_DATA(hdr));
1958 1961                  ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS);
1959 1962                  dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr));
1960 1963          }
1961 1964  
1962 1965          /* Compute the hdr's checksum if necessary */
1963 1966          arc_cksum_compute(buf);
1964 1967  
1965 1968          return (0);
1966 1969  }
1967 1970  
1968 1971  int
1969 1972  arc_decompress(arc_buf_t *buf)
1970 1973  {
1971 1974          return (arc_buf_fill(buf, B_FALSE));
1972 1975  }
1973 1976  
1974 1977  /*
1975 1978   * Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t.
1976 1979   */
1977 1980  static uint64_t
1978 1981  arc_hdr_size(arc_buf_hdr_t *hdr)
1979 1982  {
1980 1983          uint64_t size;
1981 1984  
1982 1985          if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
1983 1986              HDR_GET_PSIZE(hdr) > 0) {
1984 1987                  size = HDR_GET_PSIZE(hdr);
1985 1988          } else {
1986 1989                  ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0);
1987 1990                  size = HDR_GET_LSIZE(hdr);
1988 1991          }
1989 1992          return (size);
1990 1993  }
1991 1994  
1992 1995  /*
1993 1996   * Increment the amount of evictable space in the arc_state_t's refcount.
1994 1997   * We account for the space used by the hdr and the arc buf individually
1995 1998   * so that we can add and remove them from the refcount individually.
1996 1999   */
1997 2000  static void
1998 2001  arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state)
1999 2002  {
2000 2003          arc_buf_contents_t type = arc_buf_type(hdr);
2001 2004  
2002 2005          ASSERT(HDR_HAS_L1HDR(hdr));
2003 2006  
2004 2007          if (GHOST_STATE(state)) {
2005 2008                  ASSERT0(hdr->b_l1hdr.b_bufcnt);
2006 2009                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2007 2010                  ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2008 2011                  (void) refcount_add_many(&state->arcs_esize[type],
2009 2012                      HDR_GET_LSIZE(hdr), hdr);
2010 2013                  return;
2011 2014          }
2012 2015  
2013 2016          ASSERT(!GHOST_STATE(state));
2014 2017          if (hdr->b_l1hdr.b_pabd != NULL) {
2015 2018                  (void) refcount_add_many(&state->arcs_esize[type],
2016 2019                      arc_hdr_size(hdr), hdr);
2017 2020          }
2018 2021          for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2019 2022              buf = buf->b_next) {
2020 2023                  if (arc_buf_is_shared(buf))
2021 2024                          continue;
2022 2025                  (void) refcount_add_many(&state->arcs_esize[type],
2023 2026                      arc_buf_size(buf), buf);
2024 2027          }
2025 2028  }
2026 2029  
2027 2030  /*
2028 2031   * Decrement the amount of evictable space in the arc_state_t's refcount.
2029 2032   * We account for the space used by the hdr and the arc buf individually
2030 2033   * so that we can add and remove them from the refcount individually.
2031 2034   */
2032 2035  static void
2033 2036  arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state)
2034 2037  {
2035 2038          arc_buf_contents_t type = arc_buf_type(hdr);
2036 2039  
2037 2040          ASSERT(HDR_HAS_L1HDR(hdr));
2038 2041  
2039 2042          if (GHOST_STATE(state)) {
2040 2043                  ASSERT0(hdr->b_l1hdr.b_bufcnt);
2041 2044                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2042 2045                  ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2043 2046                  (void) refcount_remove_many(&state->arcs_esize[type],
2044 2047                      HDR_GET_LSIZE(hdr), hdr);
2045 2048                  return;
2046 2049          }
2047 2050  
2048 2051          ASSERT(!GHOST_STATE(state));
2049 2052          if (hdr->b_l1hdr.b_pabd != NULL) {
2050 2053                  (void) refcount_remove_many(&state->arcs_esize[type],
2051 2054                      arc_hdr_size(hdr), hdr);
2052 2055          }
2053 2056          for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2054 2057              buf = buf->b_next) {
2055 2058                  if (arc_buf_is_shared(buf))
2056 2059                          continue;
2057 2060                  (void) refcount_remove_many(&state->arcs_esize[type],
2058 2061                      arc_buf_size(buf), buf);
2059 2062          }
2060 2063  }
2061 2064  
2062 2065  /*
2063 2066   * Add a reference to this hdr indicating that someone is actively
2064 2067   * referencing that memory. When the refcount transitions from 0 to 1,
2065 2068   * we remove it from the respective arc_state_t list to indicate that
2066 2069   * it is not evictable.
2067 2070   */
2068 2071  static void
2069 2072  add_reference(arc_buf_hdr_t *hdr, void *tag)
2070 2073  {
2071 2074          ASSERT(HDR_HAS_L1HDR(hdr));
2072 2075          if (!MUTEX_HELD(HDR_LOCK(hdr))) {
2073 2076                  ASSERT(hdr->b_l1hdr.b_state == arc_anon);
2074 2077                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2075 2078                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2076 2079          }
2077 2080  
2078 2081          arc_state_t *state = hdr->b_l1hdr.b_state;
2079 2082  
2080 2083          if ((refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) &&
2081 2084              (state != arc_anon)) {
2082 2085                  /* We don't use the L2-only state list. */
2083 2086                  if (state != arc_l2c_only) {
2084 2087                          multilist_remove(state->arcs_list[arc_buf_type(hdr)],
2085 2088                              hdr);
2086 2089                          arc_evictable_space_decrement(hdr, state);
2087 2090                  }
2088 2091                  /* remove the prefetch flag if we get a reference */
2089 2092                  arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
2090 2093          }
2091 2094  }
2092 2095  
2093 2096  /*
2094 2097   * Remove a reference from this hdr. When the reference transitions from
2095 2098   * 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's
2096 2099   * list making it eligible for eviction.
2097 2100   */
2098 2101  static int
2099 2102  remove_reference(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, void *tag)
2100 2103  {
2101 2104          int cnt;
2102 2105          arc_state_t *state = hdr->b_l1hdr.b_state;
2103 2106  
2104 2107          ASSERT(HDR_HAS_L1HDR(hdr));
2105 2108          ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
2106 2109          ASSERT(!GHOST_STATE(state));
2107 2110  
2108 2111          /*
2109 2112           * arc_l2c_only counts as a ghost state so we don't need to explicitly
2110 2113           * check to prevent usage of the arc_l2c_only list.
2111 2114           */
2112 2115          if (((cnt = refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) == 0) &&
2113 2116              (state != arc_anon)) {
2114 2117                  multilist_insert(state->arcs_list[arc_buf_type(hdr)], hdr);
2115 2118                  ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
2116 2119                  arc_evictable_space_increment(hdr, state);
2117 2120          }
2118 2121          return (cnt);
2119 2122  }
2120 2123  
2121 2124  /*
2122 2125   * Move the supplied buffer to the indicated state. The hash lock
2123 2126   * for the buffer must be held by the caller.
2124 2127   */
2125 2128  static void
2126 2129  arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr,
2127 2130      kmutex_t *hash_lock)
2128 2131  {
2129 2132          arc_state_t *old_state;
2130 2133          int64_t refcnt;
2131 2134          uint32_t bufcnt;
2132 2135          boolean_t update_old, update_new;
2133 2136          arc_buf_contents_t buftype = arc_buf_type(hdr);
2134 2137  
2135 2138          /*
2136 2139           * We almost always have an L1 hdr here, since we call arc_hdr_realloc()
2137 2140           * in arc_read() when bringing a buffer out of the L2ARC.  However, the
2138 2141           * L1 hdr doesn't always exist when we change state to arc_anon before
2139 2142           * destroying a header, in which case reallocating to add the L1 hdr is
2140 2143           * pointless.
2141 2144           */
2142 2145          if (HDR_HAS_L1HDR(hdr)) {
2143 2146                  old_state = hdr->b_l1hdr.b_state;
2144 2147                  refcnt = refcount_count(&hdr->b_l1hdr.b_refcnt);
2145 2148                  bufcnt = hdr->b_l1hdr.b_bufcnt;
2146 2149                  update_old = (bufcnt > 0 || hdr->b_l1hdr.b_pabd != NULL);
2147 2150          } else {
2148 2151                  old_state = arc_l2c_only;
2149 2152                  refcnt = 0;
2150 2153                  bufcnt = 0;
2151 2154                  update_old = B_FALSE;
2152 2155          }
2153 2156          update_new = update_old;
2154 2157  
2155 2158          ASSERT(MUTEX_HELD(hash_lock));
2156 2159          ASSERT3P(new_state, !=, old_state);
2157 2160          ASSERT(!GHOST_STATE(new_state) || bufcnt == 0);
2158 2161          ASSERT(old_state != arc_anon || bufcnt <= 1);
2159 2162  
2160 2163          /*
2161 2164           * If this buffer is evictable, transfer it from the
2162 2165           * old state list to the new state list.
2163 2166           */
2164 2167          if (refcnt == 0) {
2165 2168                  if (old_state != arc_anon && old_state != arc_l2c_only) {
2166 2169                          ASSERT(HDR_HAS_L1HDR(hdr));
2167 2170                          multilist_remove(old_state->arcs_list[buftype], hdr);
2168 2171  
2169 2172                          if (GHOST_STATE(old_state)) {
2170 2173                                  ASSERT0(bufcnt);
2171 2174                                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2172 2175                                  update_old = B_TRUE;
2173 2176                          }
2174 2177                          arc_evictable_space_decrement(hdr, old_state);
2175 2178                  }
2176 2179                  if (new_state != arc_anon && new_state != arc_l2c_only) {
2177 2180  
2178 2181                          /*
2179 2182                           * An L1 header always exists here, since if we're
2180 2183                           * moving to some L1-cached state (i.e. not l2c_only or
2181 2184                           * anonymous), we realloc the header to add an L1hdr
2182 2185                           * beforehand.
2183 2186                           */
2184 2187                          ASSERT(HDR_HAS_L1HDR(hdr));
2185 2188                          multilist_insert(new_state->arcs_list[buftype], hdr);
2186 2189  
2187 2190                          if (GHOST_STATE(new_state)) {
2188 2191                                  ASSERT0(bufcnt);
2189 2192                                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2190 2193                                  update_new = B_TRUE;
2191 2194                          }
2192 2195                          arc_evictable_space_increment(hdr, new_state);
2193 2196                  }
2194 2197          }
2195 2198  
2196 2199          ASSERT(!HDR_EMPTY(hdr));
2197 2200          if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))
2198 2201                  buf_hash_remove(hdr);
2199 2202  
2200 2203          /* adjust state sizes (ignore arc_l2c_only) */
2201 2204  
2202 2205          if (update_new && new_state != arc_l2c_only) {
2203 2206                  ASSERT(HDR_HAS_L1HDR(hdr));
2204 2207                  if (GHOST_STATE(new_state)) {
2205 2208                          ASSERT0(bufcnt);
2206 2209  
2207 2210                          /*
2208 2211                           * When moving a header to a ghost state, we first
2209 2212                           * remove all arc buffers. Thus, we'll have a
2210 2213                           * bufcnt of zero, and no arc buffer to use for
2211 2214                           * the reference. As a result, we use the arc
2212 2215                           * header pointer for the reference.
2213 2216                           */
2214 2217                          (void) refcount_add_many(&new_state->arcs_size,
2215 2218                              HDR_GET_LSIZE(hdr), hdr);
2216 2219                          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2217 2220                  } else {
2218 2221                          uint32_t buffers = 0;
2219 2222  
2220 2223                          /*
2221 2224                           * Each individual buffer holds a unique reference,
2222 2225                           * thus we must remove each of these references one
2223 2226                           * at a time.
2224 2227                           */
2225 2228                          for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2226 2229                              buf = buf->b_next) {
2227 2230                                  ASSERT3U(bufcnt, !=, 0);
2228 2231                                  buffers++;
2229 2232  
2230 2233                                  /*
2231 2234                                   * When the arc_buf_t is sharing the data
2232 2235                                   * block with the hdr, the owner of the
2233 2236                                   * reference belongs to the hdr. Only
2234 2237                                   * add to the refcount if the arc_buf_t is
2235 2238                                   * not shared.
2236 2239                                   */
2237 2240                                  if (arc_buf_is_shared(buf))
2238 2241                                          continue;
2239 2242  
2240 2243                                  (void) refcount_add_many(&new_state->arcs_size,
2241 2244                                      arc_buf_size(buf), buf);
2242 2245                          }
2243 2246                          ASSERT3U(bufcnt, ==, buffers);
2244 2247  
2245 2248                          if (hdr->b_l1hdr.b_pabd != NULL) {
2246 2249                                  (void) refcount_add_many(&new_state->arcs_size,
2247 2250                                      arc_hdr_size(hdr), hdr);
2248 2251                          } else {
2249 2252                                  ASSERT(GHOST_STATE(old_state));
2250 2253                          }
2251 2254                  }
2252 2255          }
2253 2256  
2254 2257          if (update_old && old_state != arc_l2c_only) {
2255 2258                  ASSERT(HDR_HAS_L1HDR(hdr));
2256 2259                  if (GHOST_STATE(old_state)) {
2257 2260                          ASSERT0(bufcnt);
2258 2261                          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2259 2262  
2260 2263                          /*
2261 2264                           * When moving a header off of a ghost state,
2262 2265                           * the header will not contain any arc buffers.
2263 2266                           * We use the arc header pointer for the reference
2264 2267                           * which is exactly what we did when we put the
2265 2268                           * header on the ghost state.
2266 2269                           */
2267 2270  
2268 2271                          (void) refcount_remove_many(&old_state->arcs_size,
2269 2272                              HDR_GET_LSIZE(hdr), hdr);
2270 2273                  } else {
2271 2274                          uint32_t buffers = 0;
2272 2275  
2273 2276                          /*
2274 2277                           * Each individual buffer holds a unique reference,
2275 2278                           * thus we must remove each of these references one
2276 2279                           * at a time.
2277 2280                           */
2278 2281                          for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2279 2282                              buf = buf->b_next) {
2280 2283                                  ASSERT3U(bufcnt, !=, 0);
2281 2284                                  buffers++;
2282 2285  
2283 2286                                  /*
2284 2287                                   * When the arc_buf_t is sharing the data
2285 2288                                   * block with the hdr, the owner of the
2286 2289                                   * reference belongs to the hdr. Only
2287 2290                                   * add to the refcount if the arc_buf_t is
2288 2291                                   * not shared.
2289 2292                                   */
2290 2293                                  if (arc_buf_is_shared(buf))
2291 2294                                          continue;
2292 2295  
2293 2296                                  (void) refcount_remove_many(
2294 2297                                      &old_state->arcs_size, arc_buf_size(buf),
2295 2298                                      buf);
2296 2299                          }
2297 2300                          ASSERT3U(bufcnt, ==, buffers);
2298 2301                          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2299 2302                          (void) refcount_remove_many(
2300 2303                              &old_state->arcs_size, arc_hdr_size(hdr), hdr);
2301 2304                  }
2302 2305          }
2303 2306  
2304 2307          if (HDR_HAS_L1HDR(hdr))
2305 2308                  hdr->b_l1hdr.b_state = new_state;
2306 2309  
2307 2310          /*
2308 2311           * L2 headers should never be on the L2 state list since they don't
2309 2312           * have L1 headers allocated.
2310 2313           */
2311 2314          ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]) &&
2312 2315              multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));
2313 2316  }
2314 2317  
2315 2318  void
2316 2319  arc_space_consume(uint64_t space, arc_space_type_t type)
2317 2320  {
2318 2321          ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2319 2322  
2320 2323          switch (type) {
2321 2324          case ARC_SPACE_DATA:
2322 2325                  ARCSTAT_INCR(arcstat_data_size, space);
2323 2326                  break;
2324 2327          case ARC_SPACE_META:
2325 2328                  ARCSTAT_INCR(arcstat_metadata_size, space);
2326 2329                  break;
2327 2330          case ARC_SPACE_OTHER:
2328 2331                  ARCSTAT_INCR(arcstat_other_size, space);
2329 2332                  break;
2330 2333          case ARC_SPACE_HDRS:
2331 2334                  ARCSTAT_INCR(arcstat_hdr_size, space);
2332 2335                  break;
2333 2336          case ARC_SPACE_L2HDRS:
2334 2337                  ARCSTAT_INCR(arcstat_l2_hdr_size, space);
2335 2338                  break;
2336 2339          }
2337 2340  
2338 2341          if (type != ARC_SPACE_DATA)
2339 2342                  ARCSTAT_INCR(arcstat_meta_used, space);
2340 2343  
2341 2344          atomic_add_64(&arc_size, space);
2342 2345  }
2343 2346  
2344 2347  void
2345 2348  arc_space_return(uint64_t space, arc_space_type_t type)
2346 2349  {
2347 2350          ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2348 2351  
2349 2352          switch (type) {
2350 2353          case ARC_SPACE_DATA:
2351 2354                  ARCSTAT_INCR(arcstat_data_size, -space);
2352 2355                  break;
2353 2356          case ARC_SPACE_META:
2354 2357                  ARCSTAT_INCR(arcstat_metadata_size, -space);
2355 2358                  break;
2356 2359          case ARC_SPACE_OTHER:
2357 2360                  ARCSTAT_INCR(arcstat_other_size, -space);
2358 2361                  break;
2359 2362          case ARC_SPACE_HDRS:
2360 2363                  ARCSTAT_INCR(arcstat_hdr_size, -space);
2361 2364                  break;
2362 2365          case ARC_SPACE_L2HDRS:
2363 2366                  ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
2364 2367                  break;
2365 2368          }
2366 2369  
2367 2370          if (type != ARC_SPACE_DATA) {
2368 2371                  ASSERT(arc_meta_used >= space);
2369 2372                  if (arc_meta_max < arc_meta_used)
2370 2373                          arc_meta_max = arc_meta_used;
2371 2374                  ARCSTAT_INCR(arcstat_meta_used, -space);
2372 2375          }
2373 2376  
2374 2377          ASSERT(arc_size >= space);
2375 2378          atomic_add_64(&arc_size, -space);
2376 2379  }
2377 2380  
2378 2381  /*
2379 2382   * Given a hdr and a buf, returns whether that buf can share its b_data buffer
2380 2383   * with the hdr's b_pabd.
2381 2384   */
2382 2385  static boolean_t
2383 2386  arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2384 2387  {
2385 2388          /*
2386 2389           * The criteria for sharing a hdr's data are:
2387 2390           * 1. the hdr's compression matches the buf's compression
2388 2391           * 2. the hdr doesn't need to be byteswapped
2389 2392           * 3. the hdr isn't already being shared
2390 2393           * 4. the buf is either compressed or it is the last buf in the hdr list
2391 2394           *
2392 2395           * Criterion #4 maintains the invariant that shared uncompressed
2393 2396           * bufs must be the final buf in the hdr's b_buf list. Reading this, you
2394 2397           * might ask, "if a compressed buf is allocated first, won't that be the
2395 2398           * last thing in the list?", but in that case it's impossible to create
2396 2399           * a shared uncompressed buf anyway (because the hdr must be compressed
2397 2400           * to have the compressed buf). You might also think that #3 is
2398 2401           * sufficient to make this guarantee, however it's possible
2399 2402           * (specifically in the rare L2ARC write race mentioned in
2400 2403           * arc_buf_alloc_impl()) there will be an existing uncompressed buf that
2401 2404           * is sharable, but wasn't at the time of its allocation. Rather than
2402 2405           * allow a new shared uncompressed buf to be created and then shuffle
2403 2406           * the list around to make it the last element, this simply disallows
2404 2407           * sharing if the new buf isn't the first to be added.
2405 2408           */
2406 2409          ASSERT3P(buf->b_hdr, ==, hdr);
2407 2410          boolean_t hdr_compressed = HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF;
2408 2411          boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0;
2409 2412          return (buf_compressed == hdr_compressed &&
2410 2413              hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&
2411 2414              !HDR_SHARED_DATA(hdr) &&
2412 2415              (ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));
2413 2416  }
2414 2417  
2415 2418  /*
2416 2419   * Allocate a buf for this hdr. If you care about the data that's in the hdr,
2417 2420   * or if you want a compressed buffer, pass those flags in. Returns 0 if the
2418 2421   * copy was made successfully, or an error code otherwise.
2419 2422   */
2420 2423  static int
2421 2424  arc_buf_alloc_impl(arc_buf_hdr_t *hdr, void *tag, boolean_t compressed,
2422 2425      boolean_t fill, arc_buf_t **ret)
2423 2426  {
2424 2427          arc_buf_t *buf;
2425 2428  
2426 2429          ASSERT(HDR_HAS_L1HDR(hdr));
2427 2430          ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
2428 2431          VERIFY(hdr->b_type == ARC_BUFC_DATA ||
2429 2432              hdr->b_type == ARC_BUFC_METADATA);
2430 2433          ASSERT3P(ret, !=, NULL);
2431 2434          ASSERT3P(*ret, ==, NULL);
2432 2435  
2433 2436          buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
2434 2437          buf->b_hdr = hdr;
2435 2438          buf->b_data = NULL;
2436 2439          buf->b_next = hdr->b_l1hdr.b_buf;
2437 2440          buf->b_flags = 0;
2438 2441  
2439 2442          add_reference(hdr, tag);
2440 2443  
2441 2444          /*
2442 2445           * We're about to change the hdr's b_flags. We must either
2443 2446           * hold the hash_lock or be undiscoverable.
2444 2447           */
2445 2448          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2446 2449  
2447 2450          /*
2448 2451           * Only honor requests for compressed bufs if the hdr is actually
2449 2452           * compressed.
2450 2453           */
2451 2454          if (compressed && HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF)
2452 2455                  buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;
2453 2456  
2454 2457          /*
2455 2458           * If the hdr's data can be shared then we share the data buffer and
2456 2459           * set the appropriate bit in the hdr's b_flags to indicate the hdr is
2457 2460           * sharing it's b_pabd with the arc_buf_t. Otherwise, we allocate a new
2458 2461           * buffer to store the buf's data.
2459 2462           *
2460 2463           * There are two additional restrictions here because we're sharing
2461 2464           * hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be
2462 2465           * actively involved in an L2ARC write, because if this buf is used by
2463 2466           * an arc_write() then the hdr's data buffer will be released when the
2464 2467           * write completes, even though the L2ARC write might still be using it.
2465 2468           * Second, the hdr's ABD must be linear so that the buf's user doesn't
2466 2469           * need to be ABD-aware.
2467 2470           */
2468 2471          boolean_t can_share = arc_can_share(hdr, buf) && !HDR_L2_WRITING(hdr) &&
2469 2472              abd_is_linear(hdr->b_l1hdr.b_pabd);
2470 2473  
2471 2474          /* Set up b_data and sharing */
2472 2475          if (can_share) {
2473 2476                  buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd);
2474 2477                  buf->b_flags |= ARC_BUF_FLAG_SHARED;
2475 2478                  arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
2476 2479          } else {
2477 2480                  buf->b_data =
2478 2481                      arc_get_data_buf(hdr, arc_buf_size(buf), buf);
2479 2482                  ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
2480 2483          }
2481 2484          VERIFY3P(buf->b_data, !=, NULL);
2482 2485  
2483 2486          hdr->b_l1hdr.b_buf = buf;
2484 2487          hdr->b_l1hdr.b_bufcnt += 1;
2485 2488  
2486 2489          /*
2487 2490           * If the user wants the data from the hdr, we need to either copy or
2488 2491           * decompress the data.
2489 2492           */
2490 2493          if (fill) {
2491 2494                  return (arc_buf_fill(buf, ARC_BUF_COMPRESSED(buf) != 0));
2492 2495          }
2493 2496  
2494 2497          return (0);
2495 2498  }
2496 2499  
2497 2500  static char *arc_onloan_tag = "onloan";
2498 2501  
2499 2502  static inline void
2500 2503  arc_loaned_bytes_update(int64_t delta)
2501 2504  {
2502 2505          atomic_add_64(&arc_loaned_bytes, delta);
2503 2506  
2504 2507          /* assert that it did not wrap around */
2505 2508          ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
2506 2509  }
2507 2510  
2508 2511  /*
2509 2512   * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
2510 2513   * flight data by arc_tempreserve_space() until they are "returned". Loaned
2511 2514   * buffers must be returned to the arc before they can be used by the DMU or
2512 2515   * freed.
2513 2516   */
2514 2517  arc_buf_t *
2515 2518  arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)
2516 2519  {
2517 2520          arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,
2518 2521              is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);
2519 2522  
2520 2523          arc_loaned_bytes_update(size);
2521 2524  
2522 2525          return (buf);
2523 2526  }
2524 2527  
2525 2528  arc_buf_t *
2526 2529  arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,
2527 2530      enum zio_compress compression_type)
2528 2531  {
2529 2532          arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag,
2530 2533              psize, lsize, compression_type);
2531 2534  
2532 2535          arc_loaned_bytes_update(psize);
2533 2536  
2534 2537          return (buf);
2535 2538  }
2536 2539  
2537 2540  
2538 2541  /*
2539 2542   * Return a loaned arc buffer to the arc.
2540 2543   */
2541 2544  void
2542 2545  arc_return_buf(arc_buf_t *buf, void *tag)
2543 2546  {
2544 2547          arc_buf_hdr_t *hdr = buf->b_hdr;
2545 2548  
2546 2549          ASSERT3P(buf->b_data, !=, NULL);
2547 2550          ASSERT(HDR_HAS_L1HDR(hdr));
2548 2551          (void) refcount_add(&hdr->b_l1hdr.b_refcnt, tag);
2549 2552          (void) refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
2550 2553  
2551 2554          arc_loaned_bytes_update(-arc_buf_size(buf));
2552 2555  }
2553 2556  
2554 2557  /* Detach an arc_buf from a dbuf (tag) */
2555 2558  void
2556 2559  arc_loan_inuse_buf(arc_buf_t *buf, void *tag)
2557 2560  {
2558 2561          arc_buf_hdr_t *hdr = buf->b_hdr;
2559 2562  
2560 2563          ASSERT3P(buf->b_data, !=, NULL);
2561 2564          ASSERT(HDR_HAS_L1HDR(hdr));
2562 2565          (void) refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
2563 2566          (void) refcount_remove(&hdr->b_l1hdr.b_refcnt, tag);
2564 2567  
2565 2568          arc_loaned_bytes_update(arc_buf_size(buf));
2566 2569  }
2567 2570  
2568 2571  static void
2569 2572  l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type)
2570 2573  {
2571 2574          l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP);
2572 2575  
2573 2576          df->l2df_abd = abd;
2574 2577          df->l2df_size = size;
2575 2578          df->l2df_type = type;
2576 2579          mutex_enter(&l2arc_free_on_write_mtx);
2577 2580          list_insert_head(l2arc_free_on_write, df);
2578 2581          mutex_exit(&l2arc_free_on_write_mtx);
2579 2582  }
2580 2583  
2581 2584  static void
2582 2585  arc_hdr_free_on_write(arc_buf_hdr_t *hdr)
2583 2586  {
2584 2587          arc_state_t *state = hdr->b_l1hdr.b_state;
2585 2588          arc_buf_contents_t type = arc_buf_type(hdr);
2586 2589          uint64_t size = arc_hdr_size(hdr);
2587 2590  
2588 2591          /* protected by hash lock, if in the hash table */
2589 2592          if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
2590 2593                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2591 2594                  ASSERT(state != arc_anon && state != arc_l2c_only);
2592 2595  
2593 2596                  (void) refcount_remove_many(&state->arcs_esize[type],
2594 2597                      size, hdr);
2595 2598          }
2596 2599          (void) refcount_remove_many(&state->arcs_size, size, hdr);
2597 2600          if (type == ARC_BUFC_METADATA) {
2598 2601                  arc_space_return(size, ARC_SPACE_META);
2599 2602          } else {
2600 2603                  ASSERT(type == ARC_BUFC_DATA);
2601 2604                  arc_space_return(size, ARC_SPACE_DATA);
2602 2605          }
2603 2606  
2604 2607          l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);
2605 2608  }
2606 2609  
2607 2610  /*
2608 2611   * Share the arc_buf_t's data with the hdr. Whenever we are sharing the
2609 2612   * data buffer, we transfer the refcount ownership to the hdr and update
2610 2613   * the appropriate kstats.
2611 2614   */
2612 2615  static void
2613 2616  arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2614 2617  {
2615 2618          arc_state_t *state = hdr->b_l1hdr.b_state;
2616 2619  
2617 2620          ASSERT(arc_can_share(hdr, buf));
2618 2621          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2619 2622          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2620 2623  
2621 2624          /*
2622 2625           * Start sharing the data buffer. We transfer the
2623 2626           * refcount ownership to the hdr since it always owns
2624 2627           * the refcount whenever an arc_buf_t is shared.
2625 2628           */
2626 2629          refcount_transfer_ownership(&state->arcs_size, buf, hdr);
2627 2630          hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));
2628 2631          abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,
2629 2632              HDR_ISTYPE_METADATA(hdr));
2630 2633          arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
2631 2634          buf->b_flags |= ARC_BUF_FLAG_SHARED;
2632 2635  
2633 2636          /*
2634 2637           * Since we've transferred ownership to the hdr we need
2635 2638           * to increment its compressed and uncompressed kstats and
2636 2639           * decrement the overhead size.
2637 2640           */
2638 2641          ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
2639 2642          ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
2640 2643          ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));
2641 2644  }
2642 2645  
2643 2646  static void
2644 2647  arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2645 2648  {
2646 2649          arc_state_t *state = hdr->b_l1hdr.b_state;
2647 2650  
2648 2651          ASSERT(arc_buf_is_shared(buf));
2649 2652          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2650 2653          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2651 2654  
2652 2655          /*
2653 2656           * We are no longer sharing this buffer so we need
2654 2657           * to transfer its ownership to the rightful owner.
2655 2658           */
2656 2659          refcount_transfer_ownership(&state->arcs_size, hdr, buf);
2657 2660          arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
2658 2661          abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd);
2659 2662          abd_put(hdr->b_l1hdr.b_pabd);
2660 2663          hdr->b_l1hdr.b_pabd = NULL;
2661 2664          buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
2662 2665  
2663 2666          /*
2664 2667           * Since the buffer is no longer shared between
2665 2668           * the arc buf and the hdr, count it as overhead.
2666 2669           */
2667 2670          ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
2668 2671          ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
2669 2672          ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
2670 2673  }
2671 2674  
2672 2675  /*
2673 2676   * Remove an arc_buf_t from the hdr's buf list and return the last
2674 2677   * arc_buf_t on the list. If no buffers remain on the list then return
2675 2678   * NULL.
2676 2679   */
2677 2680  static arc_buf_t *
2678 2681  arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2679 2682  {
2680 2683          ASSERT(HDR_HAS_L1HDR(hdr));
2681 2684          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2682 2685  
2683 2686          arc_buf_t **bufp = &hdr->b_l1hdr.b_buf;
2684 2687          arc_buf_t *lastbuf = NULL;
2685 2688  
2686 2689          /*
2687 2690           * Remove the buf from the hdr list and locate the last
2688 2691           * remaining buffer on the list.
2689 2692           */
2690 2693          while (*bufp != NULL) {
2691 2694                  if (*bufp == buf)
2692 2695                          *bufp = buf->b_next;
2693 2696  
2694 2697                  /*
2695 2698                   * If we've removed a buffer in the middle of
2696 2699                   * the list then update the lastbuf and update
2697 2700                   * bufp.
2698 2701                   */
2699 2702                  if (*bufp != NULL) {
2700 2703                          lastbuf = *bufp;
2701 2704                          bufp = &(*bufp)->b_next;
2702 2705                  }
2703 2706          }
2704 2707          buf->b_next = NULL;
2705 2708          ASSERT3P(lastbuf, !=, buf);
2706 2709          IMPLY(hdr->b_l1hdr.b_bufcnt > 0, lastbuf != NULL);
2707 2710          IMPLY(hdr->b_l1hdr.b_bufcnt > 0, hdr->b_l1hdr.b_buf != NULL);
2708 2711          IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf));
2709 2712  
2710 2713          return (lastbuf);
2711 2714  }
2712 2715  
2713 2716  /*
2714 2717   * Free up buf->b_data and pull the arc_buf_t off of the the arc_buf_hdr_t's
2715 2718   * list and free it.
2716 2719   */
2717 2720  static void
2718 2721  arc_buf_destroy_impl(arc_buf_t *buf)
2719 2722  {
2720 2723          arc_buf_hdr_t *hdr = buf->b_hdr;
2721 2724  
2722 2725          /*
2723 2726           * Free up the data associated with the buf but only if we're not
2724 2727           * sharing this with the hdr. If we are sharing it with the hdr, the
2725 2728           * hdr is responsible for doing the free.
2726 2729           */
2727 2730          if (buf->b_data != NULL) {
2728 2731                  /*
2729 2732                   * We're about to change the hdr's b_flags. We must either
2730 2733                   * hold the hash_lock or be undiscoverable.
2731 2734                   */
2732 2735                  ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2733 2736  
2734 2737                  arc_cksum_verify(buf);
2735 2738                  arc_buf_unwatch(buf);
2736 2739  
2737 2740                  if (arc_buf_is_shared(buf)) {
2738 2741                          arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
2739 2742                  } else {
2740 2743                          uint64_t size = arc_buf_size(buf);
2741 2744                          arc_free_data_buf(hdr, buf->b_data, size, buf);
2742 2745                          ARCSTAT_INCR(arcstat_overhead_size, -size);
2743 2746                  }
2744 2747                  buf->b_data = NULL;
2745 2748  
2746 2749                  ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
2747 2750                  hdr->b_l1hdr.b_bufcnt -= 1;
2748 2751          }
2749 2752  
2750 2753          arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
2751 2754  
2752 2755          if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {
2753 2756                  /*
2754 2757                   * If the current arc_buf_t is sharing its data buffer with the
2755 2758                   * hdr, then reassign the hdr's b_pabd to share it with the new
2756 2759                   * buffer at the end of the list. The shared buffer is always
2757 2760                   * the last one on the hdr's buffer list.
2758 2761                   *
2759 2762                   * There is an equivalent case for compressed bufs, but since
2760 2763                   * they aren't guaranteed to be the last buf in the list and
2761 2764                   * that is an exceedingly rare case, we just allow that space be
2762 2765                   * wasted temporarily.
2763 2766                   */
2764 2767                  if (lastbuf != NULL) {
2765 2768                          /* Only one buf can be shared at once */
2766 2769                          VERIFY(!arc_buf_is_shared(lastbuf));
2767 2770                          /* hdr is uncompressed so can't have compressed buf */
2768 2771                          VERIFY(!ARC_BUF_COMPRESSED(lastbuf));
2769 2772  
2770 2773                          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2771 2774                          arc_hdr_free_pabd(hdr);
2772 2775  
2773 2776                          /*
2774 2777                           * We must setup a new shared block between the
2775 2778                           * last buffer and the hdr. The data would have
2776 2779                           * been allocated by the arc buf so we need to transfer
2777 2780                           * ownership to the hdr since it's now being shared.
2778 2781                           */
2779 2782                          arc_share_buf(hdr, lastbuf);
2780 2783                  }
2781 2784          } else if (HDR_SHARED_DATA(hdr)) {
2782 2785                  /*
2783 2786                   * Uncompressed shared buffers are always at the end
2784 2787                   * of the list. Compressed buffers don't have the
2785 2788                   * same requirements. This makes it hard to
2786 2789                   * simply assert that the lastbuf is shared so
2787 2790                   * we rely on the hdr's compression flags to determine
2788 2791                   * if we have a compressed, shared buffer.
2789 2792                   */
2790 2793                  ASSERT3P(lastbuf, !=, NULL);
2791 2794                  ASSERT(arc_buf_is_shared(lastbuf) ||
2792 2795                      HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
2793 2796          }
2794 2797  
2795 2798          /*
2796 2799           * Free the checksum if we're removing the last uncompressed buf from
2797 2800           * this hdr.
2798 2801           */
2799 2802          if (!arc_hdr_has_uncompressed_buf(hdr)) {
2800 2803                  arc_cksum_free(hdr);
2801 2804          }
2802 2805  
2803 2806          /* clean up the buf */
2804 2807          buf->b_hdr = NULL;
2805 2808          kmem_cache_free(buf_cache, buf);
2806 2809  }
2807 2810  
2808 2811  static void
2809 2812  arc_hdr_alloc_pabd(arc_buf_hdr_t *hdr)
2810 2813  {
2811 2814          ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
2812 2815          ASSERT(HDR_HAS_L1HDR(hdr));
2813 2816          ASSERT(!HDR_SHARED_DATA(hdr));
2814 2817  
2815 2818          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2816 2819          hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr);
2817 2820          hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
2818 2821          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2819 2822  
2820 2823          ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
2821 2824          ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
2822 2825  }
2823 2826  
2824 2827  static void
2825 2828  arc_hdr_free_pabd(arc_buf_hdr_t *hdr)
2826 2829  {
2827 2830          ASSERT(HDR_HAS_L1HDR(hdr));
2828 2831          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2829 2832  
2830 2833          /*
2831 2834           * If the hdr is currently being written to the l2arc then
2832 2835           * we defer freeing the data by adding it to the l2arc_free_on_write
2833 2836           * list. The l2arc will free the data once it's finished
2834 2837           * writing it to the l2arc device.
2835 2838           */
2836 2839          if (HDR_L2_WRITING(hdr)) {
2837 2840                  arc_hdr_free_on_write(hdr);
2838 2841                  ARCSTAT_BUMP(arcstat_l2_free_on_write);
2839 2842          } else {
2840 2843                  arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
2841 2844                      arc_hdr_size(hdr), hdr);
2842 2845          }
2843 2846          hdr->b_l1hdr.b_pabd = NULL;
2844 2847          hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
2845 2848  
2846 2849          ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
2847 2850          ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
2848 2851  }
2849 2852  
2850 2853  static arc_buf_hdr_t *
2851 2854  arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,
2852 2855      enum zio_compress compression_type, arc_buf_contents_t type)
2853 2856  {
2854 2857          arc_buf_hdr_t *hdr;
2855 2858  
2856 2859          VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);
2857 2860  
2858 2861          hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);
2859 2862          ASSERT(HDR_EMPTY(hdr));
2860 2863          ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
2861 2864          ASSERT3P(hdr->b_l1hdr.b_thawed, ==, NULL);
2862 2865          HDR_SET_PSIZE(hdr, psize);
2863 2866          HDR_SET_LSIZE(hdr, lsize);
2864 2867          hdr->b_spa = spa;
2865 2868          hdr->b_type = type;
2866 2869          hdr->b_flags = 0;
2867 2870          arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);
2868 2871          arc_hdr_set_compress(hdr, compression_type);
2869 2872  
2870 2873          hdr->b_l1hdr.b_state = arc_anon;
2871 2874          hdr->b_l1hdr.b_arc_access = 0;
2872 2875          hdr->b_l1hdr.b_bufcnt = 0;
2873 2876          hdr->b_l1hdr.b_buf = NULL;
2874 2877  
2875 2878          /*
2876 2879           * Allocate the hdr's buffer. This will contain either
2877 2880           * the compressed or uncompressed data depending on the block
2878 2881           * it references and compressed arc enablement.
2879 2882           */
2880 2883          arc_hdr_alloc_pabd(hdr);
2881 2884          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2882 2885  
2883 2886          return (hdr);
2884 2887  }
2885 2888  
2886 2889  /*
2887 2890   * Transition between the two allocation states for the arc_buf_hdr struct.
2888 2891   * The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without
2889 2892   * (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller
2890 2893   * version is used when a cache buffer is only in the L2ARC in order to reduce
2891 2894   * memory usage.
2892 2895   */
2893 2896  static arc_buf_hdr_t *
2894 2897  arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new)
2895 2898  {
2896 2899          ASSERT(HDR_HAS_L2HDR(hdr));
2897 2900  
2898 2901          arc_buf_hdr_t *nhdr;
2899 2902          l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
2900 2903  
2901 2904          ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) ||
2902 2905              (old == hdr_l2only_cache && new == hdr_full_cache));
2903 2906  
2904 2907          nhdr = kmem_cache_alloc(new, KM_PUSHPAGE);
2905 2908  
2906 2909          ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
2907 2910          buf_hash_remove(hdr);
2908 2911  
2909 2912          bcopy(hdr, nhdr, HDR_L2ONLY_SIZE);
2910 2913  
2911 2914          if (new == hdr_full_cache) {
2912 2915                  arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);
2913 2916                  /*
2914 2917                   * arc_access and arc_change_state need to be aware that a
2915 2918                   * header has just come out of L2ARC, so we set its state to
2916 2919                   * l2c_only even though it's about to change.
2917 2920                   */
2918 2921                  nhdr->b_l1hdr.b_state = arc_l2c_only;
2919 2922  
2920 2923                  /* Verify previous threads set to NULL before freeing */
2921 2924                  ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL);
2922 2925          } else {
2923 2926                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2924 2927                  ASSERT0(hdr->b_l1hdr.b_bufcnt);
2925 2928                  ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
2926 2929  
2927 2930                  /*
2928 2931                   * If we've reached here, We must have been called from
2929 2932                   * arc_evict_hdr(), as such we should have already been
2930 2933                   * removed from any ghost list we were previously on
2931 2934                   * (which protects us from racing with arc_evict_state),
2932 2935                   * thus no locking is needed during this check.
2933 2936                   */
2934 2937                  ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
2935 2938  
2936 2939                  /*
2937 2940                   * A buffer must not be moved into the arc_l2c_only
2938 2941                   * state if it's not finished being written out to the
2939 2942                   * l2arc device. Otherwise, the b_l1hdr.b_pabd field
2940 2943                   * might try to be accessed, even though it was removed.
2941 2944                   */
2942 2945                  VERIFY(!HDR_L2_WRITING(hdr));
2943 2946                  VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2944 2947  
2945 2948  #ifdef ZFS_DEBUG
2946 2949                  if (hdr->b_l1hdr.b_thawed != NULL) {
2947 2950                          kmem_free(hdr->b_l1hdr.b_thawed, 1);
2948 2951                          hdr->b_l1hdr.b_thawed = NULL;
2949 2952                  }
2950 2953  #endif
2951 2954  
2952 2955                  arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR);
2953 2956          }
2954 2957          /*
2955 2958           * The header has been reallocated so we need to re-insert it into any
2956 2959           * lists it was on.
2957 2960           */
2958 2961          (void) buf_hash_insert(nhdr, NULL);
2959 2962  
2960 2963          ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node));
2961 2964  
2962 2965          mutex_enter(&dev->l2ad_mtx);
2963 2966  
2964 2967          /*
2965 2968           * We must place the realloc'ed header back into the list at
2966 2969           * the same spot. Otherwise, if it's placed earlier in the list,
2967 2970           * l2arc_write_buffers() could find it during the function's
2968 2971           * write phase, and try to write it out to the l2arc.
2969 2972           */
2970 2973          list_insert_after(&dev->l2ad_buflist, hdr, nhdr);
2971 2974          list_remove(&dev->l2ad_buflist, hdr);
2972 2975  
2973 2976          mutex_exit(&dev->l2ad_mtx);
2974 2977  
2975 2978          /*
2976 2979           * Since we're using the pointer address as the tag when
2977 2980           * incrementing and decrementing the l2ad_alloc refcount, we
2978 2981           * must remove the old pointer (that we're about to destroy) and
2979 2982           * add the new pointer to the refcount. Otherwise we'd remove
2980 2983           * the wrong pointer address when calling arc_hdr_destroy() later.
2981 2984           */
2982 2985  
2983 2986          (void) refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
2984 2987          (void) refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(nhdr), nhdr);
2985 2988  
2986 2989          buf_discard_identity(hdr);
2987 2990          kmem_cache_free(old, hdr);
2988 2991  
2989 2992          return (nhdr);
2990 2993  }
2991 2994  
2992 2995  /*
2993 2996   * Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller.
2994 2997   * The buf is returned thawed since we expect the consumer to modify it.
2995 2998   */
2996 2999  arc_buf_t *
2997 3000  arc_alloc_buf(spa_t *spa, void *tag, arc_buf_contents_t type, int32_t size)
2998 3001  {
2999 3002          arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size,
3000 3003              ZIO_COMPRESS_OFF, type);
3001 3004          ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
3002 3005  
3003 3006          arc_buf_t *buf = NULL;
3004 3007          VERIFY0(arc_buf_alloc_impl(hdr, tag, B_FALSE, B_FALSE, &buf));
3005 3008          arc_buf_thaw(buf);
3006 3009  
3007 3010          return (buf);
3008 3011  }
3009 3012  
3010 3013  /*
3011 3014   * Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this
3012 3015   * for bufs containing metadata.
3013 3016   */
3014 3017  arc_buf_t *
3015 3018  arc_alloc_compressed_buf(spa_t *spa, void *tag, uint64_t psize, uint64_t lsize,
3016 3019      enum zio_compress compression_type)
3017 3020  {
3018 3021          ASSERT3U(lsize, >, 0);
3019 3022          ASSERT3U(lsize, >=, psize);
3020 3023          ASSERT(compression_type > ZIO_COMPRESS_OFF);
3021 3024          ASSERT(compression_type < ZIO_COMPRESS_FUNCTIONS);
3022 3025  
3023 3026          arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
3024 3027              compression_type, ARC_BUFC_DATA);
3025 3028          ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
3026 3029  
3027 3030          arc_buf_t *buf = NULL;
3028 3031          VERIFY0(arc_buf_alloc_impl(hdr, tag, B_TRUE, B_FALSE, &buf));
3029 3032          arc_buf_thaw(buf);
3030 3033          ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
3031 3034  
3032 3035          if (!arc_buf_is_shared(buf)) {
3033 3036                  /*
3034 3037                   * To ensure that the hdr has the correct data in it if we call
3035 3038                   * arc_decompress() on this buf before it's been written to
3036 3039                   * disk, it's easiest if we just set up sharing between the
3037 3040                   * buf and the hdr.
3038 3041                   */
3039 3042                  ASSERT(!abd_is_linear(hdr->b_l1hdr.b_pabd));
3040 3043                  arc_hdr_free_pabd(hdr);
3041 3044                  arc_share_buf(hdr, buf);
3042 3045          }
3043 3046  
3044 3047          return (buf);
3045 3048  }
3046 3049  
3047 3050  static void
3048 3051  arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)
3049 3052  {
3050 3053          l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
3051 3054          l2arc_dev_t *dev = l2hdr->b_dev;
3052 3055          uint64_t psize = arc_hdr_size(hdr);
3053 3056  
3054 3057          ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
3055 3058          ASSERT(HDR_HAS_L2HDR(hdr));
3056 3059  
3057 3060          list_remove(&dev->l2ad_buflist, hdr);
3058 3061  
3059 3062          ARCSTAT_INCR(arcstat_l2_psize, -psize);
3060 3063          ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
3061 3064  
3062 3065          vdev_space_update(dev->l2ad_vdev, -psize, 0, 0);
3063 3066  
3064 3067          (void) refcount_remove_many(&dev->l2ad_alloc, psize, hdr);
3065 3068          arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
3066 3069  }
3067 3070  
3068 3071  static void
3069 3072  arc_hdr_destroy(arc_buf_hdr_t *hdr)
3070 3073  {
3071 3074          if (HDR_HAS_L1HDR(hdr)) {
3072 3075                  ASSERT(hdr->b_l1hdr.b_buf == NULL ||
3073 3076                      hdr->b_l1hdr.b_bufcnt > 0);
3074 3077                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
3075 3078                  ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
3076 3079          }
3077 3080          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3078 3081          ASSERT(!HDR_IN_HASH_TABLE(hdr));
3079 3082  
3080 3083          if (!HDR_EMPTY(hdr))
3081 3084                  buf_discard_identity(hdr);
3082 3085  
3083 3086          if (HDR_HAS_L2HDR(hdr)) {
3084 3087                  l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
3085 3088                  boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);
3086 3089  
3087 3090                  if (!buflist_held)
3088 3091                          mutex_enter(&dev->l2ad_mtx);
3089 3092  
3090 3093                  /*
3091 3094                   * Even though we checked this conditional above, we
3092 3095                   * need to check this again now that we have the
3093 3096                   * l2ad_mtx. This is because we could be racing with
3094 3097                   * another thread calling l2arc_evict() which might have
3095 3098                   * destroyed this header's L2 portion as we were waiting
3096 3099                   * to acquire the l2ad_mtx. If that happens, we don't
3097 3100                   * want to re-destroy the header's L2 portion.
3098 3101                   */
3099 3102                  if (HDR_HAS_L2HDR(hdr))
3100 3103                          arc_hdr_l2hdr_destroy(hdr);
3101 3104  
3102 3105                  if (!buflist_held)
3103 3106                          mutex_exit(&dev->l2ad_mtx);
3104 3107          }
3105 3108  
3106 3109          if (HDR_HAS_L1HDR(hdr)) {
3107 3110                  arc_cksum_free(hdr);
3108 3111  
3109 3112                  while (hdr->b_l1hdr.b_buf != NULL)
3110 3113                          arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);
3111 3114  
3112 3115  #ifdef ZFS_DEBUG
3113 3116                  if (hdr->b_l1hdr.b_thawed != NULL) {
3114 3117                          kmem_free(hdr->b_l1hdr.b_thawed, 1);
3115 3118                          hdr->b_l1hdr.b_thawed = NULL;
3116 3119                  }
3117 3120  #endif
3118 3121  
3119 3122                  if (hdr->b_l1hdr.b_pabd != NULL) {
3120 3123                          arc_hdr_free_pabd(hdr);
3121 3124                  }
3122 3125          }
3123 3126  
3124 3127          ASSERT3P(hdr->b_hash_next, ==, NULL);
3125 3128          if (HDR_HAS_L1HDR(hdr)) {
3126 3129                  ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
3127 3130                  ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
3128 3131                  kmem_cache_free(hdr_full_cache, hdr);
3129 3132          } else {
3130 3133                  kmem_cache_free(hdr_l2only_cache, hdr);
3131 3134          }
3132 3135  }
3133 3136  
3134 3137  void
3135 3138  arc_buf_destroy(arc_buf_t *buf, void* tag)
3136 3139  {
3137 3140          arc_buf_hdr_t *hdr = buf->b_hdr;
3138 3141          kmutex_t *hash_lock = HDR_LOCK(hdr);
3139 3142  
3140 3143          if (hdr->b_l1hdr.b_state == arc_anon) {
3141 3144                  ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
3142 3145                  ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3143 3146                  VERIFY0(remove_reference(hdr, NULL, tag));
3144 3147                  arc_hdr_destroy(hdr);
3145 3148                  return;
3146 3149          }
3147 3150  
3148 3151          mutex_enter(hash_lock);
3149 3152          ASSERT3P(hdr, ==, buf->b_hdr);
3150 3153          ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
3151 3154          ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
3152 3155          ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon);
3153 3156          ASSERT3P(buf->b_data, !=, NULL);
3154 3157  
3155 3158          (void) remove_reference(hdr, hash_lock, tag);
3156 3159          arc_buf_destroy_impl(buf);
3157 3160          mutex_exit(hash_lock);
3158 3161  }
3159 3162  
3160 3163  /*
3161 3164   * Evict the arc_buf_hdr that is provided as a parameter. The resultant
3162 3165   * state of the header is dependent on it's state prior to entering this
3163 3166   * function. The following transitions are possible:
3164 3167   *
3165 3168   *    - arc_mru -> arc_mru_ghost
3166 3169   *    - arc_mfu -> arc_mfu_ghost
3167 3170   *    - arc_mru_ghost -> arc_l2c_only
3168 3171   *    - arc_mru_ghost -> deleted
3169 3172   *    - arc_mfu_ghost -> arc_l2c_only
3170 3173   *    - arc_mfu_ghost -> deleted
3171 3174   */
3172 3175  static int64_t
3173 3176  arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
3174 3177  {
3175 3178          arc_state_t *evicted_state, *state;
3176 3179          int64_t bytes_evicted = 0;
3177 3180  
3178 3181          ASSERT(MUTEX_HELD(hash_lock));
3179 3182          ASSERT(HDR_HAS_L1HDR(hdr));
3180 3183  
3181 3184          state = hdr->b_l1hdr.b_state;
3182 3185          if (GHOST_STATE(state)) {
3183 3186                  ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3184 3187                  ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
3185 3188  
3186 3189                  /*
3187 3190                   * l2arc_write_buffers() relies on a header's L1 portion
3188 3191                   * (i.e. its b_pabd field) during it's write phase.
3189 3192                   * Thus, we cannot push a header onto the arc_l2c_only
3190 3193                   * state (removing it's L1 piece) until the header is
3191 3194                   * done being written to the l2arc.
3192 3195                   */
3193 3196                  if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {
3194 3197                          ARCSTAT_BUMP(arcstat_evict_l2_skip);
3195 3198                          return (bytes_evicted);
3196 3199                  }
3197 3200  
3198 3201                  ARCSTAT_BUMP(arcstat_deleted);
3199 3202                  bytes_evicted += HDR_GET_LSIZE(hdr);
3200 3203  
3201 3204                  DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr);
3202 3205  
3203 3206                  ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
3204 3207                  if (HDR_HAS_L2HDR(hdr)) {
3205 3208                          /*
3206 3209                           * This buffer is cached on the 2nd Level ARC;
3207 3210                           * don't destroy the header.
3208 3211                           */
3209 3212                          arc_change_state(arc_l2c_only, hdr, hash_lock);
3210 3213                          /*
3211 3214                           * dropping from L1+L2 cached to L2-only,
3212 3215                           * realloc to remove the L1 header.
3213 3216                           */
3214 3217                          hdr = arc_hdr_realloc(hdr, hdr_full_cache,
3215 3218                              hdr_l2only_cache);
3216 3219                  } else {
3217 3220                          arc_change_state(arc_anon, hdr, hash_lock);
3218 3221                          arc_hdr_destroy(hdr);
3219 3222                  }
3220 3223                  return (bytes_evicted);
3221 3224          }
3222 3225  
3223 3226          ASSERT(state == arc_mru || state == arc_mfu);
3224 3227          evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
3225 3228  
3226 3229          /* prefetch buffers have a minimum lifespan */
3227 3230          if (HDR_IO_IN_PROGRESS(hdr) ||
3228 3231              ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&
3229 3232              ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access <
3230 3233              arc_min_prefetch_lifespan)) {
3231 3234                  ARCSTAT_BUMP(arcstat_evict_skip);
3232 3235                  return (bytes_evicted);
3233 3236          }
3234 3237  
3235 3238          ASSERT0(refcount_count(&hdr->b_l1hdr.b_refcnt));
3236 3239          while (hdr->b_l1hdr.b_buf) {
3237 3240                  arc_buf_t *buf = hdr->b_l1hdr.b_buf;
3238 3241                  if (!mutex_tryenter(&buf->b_evict_lock)) {
3239 3242                          ARCSTAT_BUMP(arcstat_mutex_miss);
3240 3243                          break;
3241 3244                  }
3242 3245                  if (buf->b_data != NULL)
3243 3246                          bytes_evicted += HDR_GET_LSIZE(hdr);
3244 3247                  mutex_exit(&buf->b_evict_lock);
3245 3248                  arc_buf_destroy_impl(buf);
3246 3249          }
3247 3250  
3248 3251          if (HDR_HAS_L2HDR(hdr)) {
3249 3252                  ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr));
3250 3253          } else {
3251 3254                  if (l2arc_write_eligible(hdr->b_spa, hdr)) {
3252 3255                          ARCSTAT_INCR(arcstat_evict_l2_eligible,
3253 3256                              HDR_GET_LSIZE(hdr));
3254 3257                  } else {
3255 3258                          ARCSTAT_INCR(arcstat_evict_l2_ineligible,
3256 3259                              HDR_GET_LSIZE(hdr));
3257 3260                  }
3258 3261          }
3259 3262  
3260 3263          if (hdr->b_l1hdr.b_bufcnt == 0) {
3261 3264                  arc_cksum_free(hdr);
3262 3265  
3263 3266                  bytes_evicted += arc_hdr_size(hdr);
3264 3267  
3265 3268                  /*
3266 3269                   * If this hdr is being evicted and has a compressed
3267 3270                   * buffer then we discard it here before we change states.
3268 3271                   * This ensures that the accounting is updated correctly
3269 3272                   * in arc_free_data_impl().
3270 3273                   */
3271 3274                  arc_hdr_free_pabd(hdr);
3272 3275  
3273 3276                  arc_change_state(evicted_state, hdr, hash_lock);
3274 3277                  ASSERT(HDR_IN_HASH_TABLE(hdr));
3275 3278                  arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
3276 3279                  DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr);
3277 3280          }
3278 3281  
3279 3282          return (bytes_evicted);
3280 3283  }
3281 3284  
3282 3285  static uint64_t
3283 3286  arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker,
3284 3287      uint64_t spa, int64_t bytes)
3285 3288  {
3286 3289          multilist_sublist_t *mls;
3287 3290          uint64_t bytes_evicted = 0;
3288 3291          arc_buf_hdr_t *hdr;
3289 3292          kmutex_t *hash_lock;
3290 3293          int evict_count = 0;
3291 3294  
3292 3295          ASSERT3P(marker, !=, NULL);
3293 3296          IMPLY(bytes < 0, bytes == ARC_EVICT_ALL);
3294 3297  
3295 3298          mls = multilist_sublist_lock(ml, idx);
3296 3299  
3297 3300          for (hdr = multilist_sublist_prev(mls, marker); hdr != NULL;
3298 3301              hdr = multilist_sublist_prev(mls, marker)) {
3299 3302                  if ((bytes != ARC_EVICT_ALL && bytes_evicted >= bytes) ||
3300 3303                      (evict_count >= zfs_arc_evict_batch_limit))
3301 3304                          break;
3302 3305  
3303 3306                  /*
3304 3307                   * To keep our iteration location, move the marker
3305 3308                   * forward. Since we're not holding hdr's hash lock, we
3306 3309                   * must be very careful and not remove 'hdr' from the
3307 3310                   * sublist. Otherwise, other consumers might mistake the
3308 3311                   * 'hdr' as not being on a sublist when they call the
3309 3312                   * multilist_link_active() function (they all rely on
3310 3313                   * the hash lock protecting concurrent insertions and
3311 3314                   * removals). multilist_sublist_move_forward() was
3312 3315                   * specifically implemented to ensure this is the case
3313 3316                   * (only 'marker' will be removed and re-inserted).
3314 3317                   */
3315 3318                  multilist_sublist_move_forward(mls, marker);
3316 3319  
3317 3320                  /*
3318 3321                   * The only case where the b_spa field should ever be
3319 3322                   * zero, is the marker headers inserted by
3320 3323                   * arc_evict_state(). It's possible for multiple threads
3321 3324                   * to be calling arc_evict_state() concurrently (e.g.
3322 3325                   * dsl_pool_close() and zio_inject_fault()), so we must
3323 3326                   * skip any markers we see from these other threads.
3324 3327                   */
3325 3328                  if (hdr->b_spa == 0)
3326 3329                          continue;
3327 3330  
3328 3331                  /* we're only interested in evicting buffers of a certain spa */
3329 3332                  if (spa != 0 && hdr->b_spa != spa) {
3330 3333                          ARCSTAT_BUMP(arcstat_evict_skip);
3331 3334                          continue;
3332 3335                  }
3333 3336  
3334 3337                  hash_lock = HDR_LOCK(hdr);
3335 3338  
3336 3339                  /*
3337 3340                   * We aren't calling this function from any code path
3338 3341                   * that would already be holding a hash lock, so we're
3339 3342                   * asserting on this assumption to be defensive in case
3340 3343                   * this ever changes. Without this check, it would be
3341 3344                   * possible to incorrectly increment arcstat_mutex_miss
3342 3345                   * below (e.g. if the code changed such that we called
3343 3346                   * this function with a hash lock held).
3344 3347                   */
3345 3348                  ASSERT(!MUTEX_HELD(hash_lock));
3346 3349  
3347 3350                  if (mutex_tryenter(hash_lock)) {
3348 3351                          uint64_t evicted = arc_evict_hdr(hdr, hash_lock);
3349 3352                          mutex_exit(hash_lock);
3350 3353  
3351 3354                          bytes_evicted += evicted;
3352 3355  
3353 3356                          /*
3354 3357                           * If evicted is zero, arc_evict_hdr() must have
3355 3358                           * decided to skip this header, don't increment
3356 3359                           * evict_count in this case.
3357 3360                           */
3358 3361                          if (evicted != 0)
3359 3362                                  evict_count++;
3360 3363  
3361 3364                          /*
3362 3365                           * If arc_size isn't overflowing, signal any
3363 3366                           * threads that might happen to be waiting.
3364 3367                           *
3365 3368                           * For each header evicted, we wake up a single
3366 3369                           * thread. If we used cv_broadcast, we could
3367 3370                           * wake up "too many" threads causing arc_size
3368 3371                           * to significantly overflow arc_c; since
3369 3372                           * arc_get_data_impl() doesn't check for overflow
3370 3373                           * when it's woken up (it doesn't because it's
3371 3374                           * possible for the ARC to be overflowing while
3372 3375                           * full of un-evictable buffers, and the
3373 3376                           * function should proceed in this case).
3374 3377                           *
3375 3378                           * If threads are left sleeping, due to not
3376 3379                           * using cv_broadcast, they will be woken up
3377 3380                           * just before arc_reclaim_thread() sleeps.
3378 3381                           */
3379 3382                          mutex_enter(&arc_reclaim_lock);
3380 3383                          if (!arc_is_overflowing())
3381 3384                                  cv_signal(&arc_reclaim_waiters_cv);
3382 3385                          mutex_exit(&arc_reclaim_lock);
3383 3386                  } else {
3384 3387                          ARCSTAT_BUMP(arcstat_mutex_miss);
3385 3388                  }
3386 3389          }
3387 3390  
3388 3391          multilist_sublist_unlock(mls);
3389 3392  
3390 3393          return (bytes_evicted);
3391 3394  }
3392 3395  
3393 3396  /*
3394 3397   * Evict buffers from the given arc state, until we've removed the
3395 3398   * specified number of bytes. Move the removed buffers to the
3396 3399   * appropriate evict state.
3397 3400   *
3398 3401   * This function makes a "best effort". It skips over any buffers
3399 3402   * it can't get a hash_lock on, and so, may not catch all candidates.
3400 3403   * It may also return without evicting as much space as requested.
3401 3404   *
3402 3405   * If bytes is specified using the special value ARC_EVICT_ALL, this
3403 3406   * will evict all available (i.e. unlocked and evictable) buffers from
3404 3407   * the given arc state; which is used by arc_flush().
3405 3408   */
3406 3409  static uint64_t
3407 3410  arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes,
3408 3411      arc_buf_contents_t type)
3409 3412  {
3410 3413          uint64_t total_evicted = 0;
3411 3414          multilist_t *ml = state->arcs_list[type];
3412 3415          int num_sublists;
3413 3416          arc_buf_hdr_t **markers;
3414 3417  
3415 3418          IMPLY(bytes < 0, bytes == ARC_EVICT_ALL);
3416 3419  
3417 3420          num_sublists = multilist_get_num_sublists(ml);
3418 3421  
3419 3422          /*
3420 3423           * If we've tried to evict from each sublist, made some
3421 3424           * progress, but still have not hit the target number of bytes
3422 3425           * to evict, we want to keep trying. The markers allow us to
3423 3426           * pick up where we left off for each individual sublist, rather
3424 3427           * than starting from the tail each time.
3425 3428           */
3426 3429          markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP);
3427 3430          for (int i = 0; i < num_sublists; i++) {
3428 3431                  markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
3429 3432  
3430 3433                  /*
3431 3434                   * A b_spa of 0 is used to indicate that this header is
3432 3435                   * a marker. This fact is used in arc_adjust_type() and
3433 3436                   * arc_evict_state_impl().
3434 3437                   */
3435 3438                  markers[i]->b_spa = 0;
3436 3439  
3437 3440                  multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
3438 3441                  multilist_sublist_insert_tail(mls, markers[i]);
3439 3442                  multilist_sublist_unlock(mls);
3440 3443          }
3441 3444  
3442 3445          /*
3443 3446           * While we haven't hit our target number of bytes to evict, or
3444 3447           * we're evicting all available buffers.
3445 3448           */
3446 3449          while (total_evicted < bytes || bytes == ARC_EVICT_ALL) {
3447 3450                  /*
3448 3451                   * Start eviction using a randomly selected sublist,
3449 3452                   * this is to try and evenly balance eviction across all
3450 3453                   * sublists. Always starting at the same sublist
3451 3454                   * (e.g. index 0) would cause evictions to favor certain
3452 3455                   * sublists over others.
3453 3456                   */
3454 3457                  int sublist_idx = multilist_get_random_index(ml);
3455 3458                  uint64_t scan_evicted = 0;
3456 3459  
3457 3460                  for (int i = 0; i < num_sublists; i++) {
3458 3461                          uint64_t bytes_remaining;
3459 3462                          uint64_t bytes_evicted;
3460 3463  
3461 3464                          if (bytes == ARC_EVICT_ALL)
3462 3465                                  bytes_remaining = ARC_EVICT_ALL;
3463 3466                          else if (total_evicted < bytes)
3464 3467                                  bytes_remaining = bytes - total_evicted;
3465 3468                          else
3466 3469                                  break;
3467 3470  
3468 3471                          bytes_evicted = arc_evict_state_impl(ml, sublist_idx,
3469 3472                              markers[sublist_idx], spa, bytes_remaining);
3470 3473  
3471 3474                          scan_evicted += bytes_evicted;
3472 3475                          total_evicted += bytes_evicted;
3473 3476  
3474 3477                          /* we've reached the end, wrap to the beginning */
3475 3478                          if (++sublist_idx >= num_sublists)
3476 3479                                  sublist_idx = 0;
3477 3480                  }
3478 3481  
3479 3482                  /*
3480 3483                   * If we didn't evict anything during this scan, we have
3481 3484                   * no reason to believe we'll evict more during another
3482 3485                   * scan, so break the loop.
3483 3486                   */
3484 3487                  if (scan_evicted == 0) {
3485 3488                          /* This isn't possible, let's make that obvious */
3486 3489                          ASSERT3S(bytes, !=, 0);
3487 3490  
3488 3491                          /*
3489 3492                           * When bytes is ARC_EVICT_ALL, the only way to
3490 3493                           * break the loop is when scan_evicted is zero.
3491 3494                           * In that case, we actually have evicted enough,
3492 3495                           * so we don't want to increment the kstat.
3493 3496                           */
3494 3497                          if (bytes != ARC_EVICT_ALL) {
3495 3498                                  ASSERT3S(total_evicted, <, bytes);
3496 3499                                  ARCSTAT_BUMP(arcstat_evict_not_enough);
3497 3500                          }
3498 3501  
3499 3502                          break;
3500 3503                  }
3501 3504          }
3502 3505  
3503 3506          for (int i = 0; i < num_sublists; i++) {
3504 3507                  multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
3505 3508                  multilist_sublist_remove(mls, markers[i]);
3506 3509                  multilist_sublist_unlock(mls);
3507 3510  
3508 3511                  kmem_cache_free(hdr_full_cache, markers[i]);
3509 3512          }
3510 3513          kmem_free(markers, sizeof (*markers) * num_sublists);
3511 3514  
3512 3515          return (total_evicted);
3513 3516  }
3514 3517  
3515 3518  /*
3516 3519   * Flush all "evictable" data of the given type from the arc state
3517 3520   * specified. This will not evict any "active" buffers (i.e. referenced).
3518 3521   *
3519 3522   * When 'retry' is set to B_FALSE, the function will make a single pass
3520 3523   * over the state and evict any buffers that it can. Since it doesn't
3521 3524   * continually retry the eviction, it might end up leaving some buffers
3522 3525   * in the ARC due to lock misses.
3523 3526   *
3524 3527   * When 'retry' is set to B_TRUE, the function will continually retry the
3525 3528   * eviction until *all* evictable buffers have been removed from the
3526 3529   * state. As a result, if concurrent insertions into the state are
3527 3530   * allowed (e.g. if the ARC isn't shutting down), this function might
3528 3531   * wind up in an infinite loop, continually trying to evict buffers.
3529 3532   */
3530 3533  static uint64_t
3531 3534  arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type,
3532 3535      boolean_t retry)
3533 3536  {
3534 3537          uint64_t evicted = 0;
3535 3538  
3536 3539          while (refcount_count(&state->arcs_esize[type]) != 0) {
3537 3540                  evicted += arc_evict_state(state, spa, ARC_EVICT_ALL, type);
3538 3541  
3539 3542                  if (!retry)
3540 3543                          break;
3541 3544          }
3542 3545  
3543 3546          return (evicted);
3544 3547  }
3545 3548  
3546 3549  /*
3547 3550   * Evict the specified number of bytes from the state specified,
3548 3551   * restricting eviction to the spa and type given. This function
3549 3552   * prevents us from trying to evict more from a state's list than
3550 3553   * is "evictable", and to skip evicting altogether when passed a
3551 3554   * negative value for "bytes". In contrast, arc_evict_state() will
3552 3555   * evict everything it can, when passed a negative value for "bytes".
3553 3556   */
3554 3557  static uint64_t
3555 3558  arc_adjust_impl(arc_state_t *state, uint64_t spa, int64_t bytes,
3556 3559      arc_buf_contents_t type)
3557 3560  {
3558 3561          int64_t delta;
3559 3562  
3560 3563          if (bytes > 0 && refcount_count(&state->arcs_esize[type]) > 0) {
3561 3564                  delta = MIN(refcount_count(&state->arcs_esize[type]), bytes);
3562 3565                  return (arc_evict_state(state, spa, delta, type));
3563 3566          }
3564 3567  
3565 3568          return (0);
3566 3569  }
3567 3570  
3568 3571  /*
3569 3572   * Evict metadata buffers from the cache, such that arc_meta_used is
3570 3573   * capped by the arc_meta_limit tunable.
3571 3574   */
3572 3575  static uint64_t
3573 3576  arc_adjust_meta(void)
3574 3577  {
3575 3578          uint64_t total_evicted = 0;
3576 3579          int64_t target;
3577 3580  
3578 3581          /*
3579 3582           * If we're over the meta limit, we want to evict enough
3580 3583           * metadata to get back under the meta limit. We don't want to
3581 3584           * evict so much that we drop the MRU below arc_p, though. If
3582 3585           * we're over the meta limit more than we're over arc_p, we
3583 3586           * evict some from the MRU here, and some from the MFU below.
3584 3587           */
3585 3588          target = MIN((int64_t)(arc_meta_used - arc_meta_limit),
3586 3589              (int64_t)(refcount_count(&arc_anon->arcs_size) +
3587 3590              refcount_count(&arc_mru->arcs_size) - arc_p));
3588 3591  
3589 3592          total_evicted += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
3590 3593  
3591 3594          /*
3592 3595           * Similar to the above, we want to evict enough bytes to get us
3593 3596           * below the meta limit, but not so much as to drop us below the
3594 3597           * space allotted to the MFU (which is defined as arc_c - arc_p).
3595 3598           */
3596 3599          target = MIN((int64_t)(arc_meta_used - arc_meta_limit),
3597 3600              (int64_t)(refcount_count(&arc_mfu->arcs_size) - (arc_c - arc_p)));
3598 3601  
3599 3602          total_evicted += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
3600 3603  
3601 3604          return (total_evicted);
3602 3605  }
3603 3606  
3604 3607  /*
3605 3608   * Return the type of the oldest buffer in the given arc state
3606 3609   *
3607 3610   * This function will select a random sublist of type ARC_BUFC_DATA and
3608 3611   * a random sublist of type ARC_BUFC_METADATA. The tail of each sublist
3609 3612   * is compared, and the type which contains the "older" buffer will be
3610 3613   * returned.
3611 3614   */
3612 3615  static arc_buf_contents_t
3613 3616  arc_adjust_type(arc_state_t *state)
3614 3617  {
3615 3618          multilist_t *data_ml = state->arcs_list[ARC_BUFC_DATA];
3616 3619          multilist_t *meta_ml = state->arcs_list[ARC_BUFC_METADATA];
3617 3620          int data_idx = multilist_get_random_index(data_ml);
3618 3621          int meta_idx = multilist_get_random_index(meta_ml);
3619 3622          multilist_sublist_t *data_mls;
3620 3623          multilist_sublist_t *meta_mls;
3621 3624          arc_buf_contents_t type;
3622 3625          arc_buf_hdr_t *data_hdr;
3623 3626          arc_buf_hdr_t *meta_hdr;
3624 3627  
3625 3628          /*
3626 3629           * We keep the sublist lock until we're finished, to prevent
3627 3630           * the headers from being destroyed via arc_evict_state().
3628 3631           */
3629 3632          data_mls = multilist_sublist_lock(data_ml, data_idx);
3630 3633          meta_mls = multilist_sublist_lock(meta_ml, meta_idx);
3631 3634  
3632 3635          /*
3633 3636           * These two loops are to ensure we skip any markers that
3634 3637           * might be at the tail of the lists due to arc_evict_state().
3635 3638           */
3636 3639  
3637 3640          for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL;
3638 3641              data_hdr = multilist_sublist_prev(data_mls, data_hdr)) {
3639 3642                  if (data_hdr->b_spa != 0)
3640 3643                          break;
3641 3644          }
3642 3645  
3643 3646          for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL;
3644 3647              meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) {
3645 3648                  if (meta_hdr->b_spa != 0)
3646 3649                          break;
3647 3650          }
3648 3651  
3649 3652          if (data_hdr == NULL && meta_hdr == NULL) {
3650 3653                  type = ARC_BUFC_DATA;
3651 3654          } else if (data_hdr == NULL) {
3652 3655                  ASSERT3P(meta_hdr, !=, NULL);
3653 3656                  type = ARC_BUFC_METADATA;
3654 3657          } else if (meta_hdr == NULL) {
3655 3658                  ASSERT3P(data_hdr, !=, NULL);
3656 3659                  type = ARC_BUFC_DATA;
3657 3660          } else {
3658 3661                  ASSERT3P(data_hdr, !=, NULL);
3659 3662                  ASSERT3P(meta_hdr, !=, NULL);
3660 3663  
3661 3664                  /* The headers can't be on the sublist without an L1 header */
3662 3665                  ASSERT(HDR_HAS_L1HDR(data_hdr));
3663 3666                  ASSERT(HDR_HAS_L1HDR(meta_hdr));
3664 3667  
3665 3668                  if (data_hdr->b_l1hdr.b_arc_access <
3666 3669                      meta_hdr->b_l1hdr.b_arc_access) {
3667 3670                          type = ARC_BUFC_DATA;
3668 3671                  } else {
3669 3672                          type = ARC_BUFC_METADATA;
3670 3673                  }
3671 3674          }
3672 3675  
3673 3676          multilist_sublist_unlock(meta_mls);
3674 3677          multilist_sublist_unlock(data_mls);
3675 3678  
3676 3679          return (type);
3677 3680  }
3678 3681  
3679 3682  /*
3680 3683   * Evict buffers from the cache, such that arc_size is capped by arc_c.
3681 3684   */
3682 3685  static uint64_t
3683 3686  arc_adjust(void)
3684 3687  {
3685 3688          uint64_t total_evicted = 0;
3686 3689          uint64_t bytes;
3687 3690          int64_t target;
3688 3691  
3689 3692          /*
3690 3693           * If we're over arc_meta_limit, we want to correct that before
3691 3694           * potentially evicting data buffers below.
3692 3695           */
3693 3696          total_evicted += arc_adjust_meta();
3694 3697  
3695 3698          /*
3696 3699           * Adjust MRU size
3697 3700           *
3698 3701           * If we're over the target cache size, we want to evict enough
3699 3702           * from the list to get back to our target size. We don't want
3700 3703           * to evict too much from the MRU, such that it drops below
3701 3704           * arc_p. So, if we're over our target cache size more than
3702 3705           * the MRU is over arc_p, we'll evict enough to get back to
3703 3706           * arc_p here, and then evict more from the MFU below.
3704 3707           */
3705 3708          target = MIN((int64_t)(arc_size - arc_c),
3706 3709              (int64_t)(refcount_count(&arc_anon->arcs_size) +
3707 3710              refcount_count(&arc_mru->arcs_size) + arc_meta_used - arc_p));
3708 3711  
3709 3712          /*
3710 3713           * If we're below arc_meta_min, always prefer to evict data.
3711 3714           * Otherwise, try to satisfy the requested number of bytes to
3712 3715           * evict from the type which contains older buffers; in an
3713 3716           * effort to keep newer buffers in the cache regardless of their
3714 3717           * type. If we cannot satisfy the number of bytes from this
3715 3718           * type, spill over into the next type.
3716 3719           */
3717 3720          if (arc_adjust_type(arc_mru) == ARC_BUFC_METADATA &&
3718 3721              arc_meta_used > arc_meta_min) {
3719 3722                  bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
3720 3723                  total_evicted += bytes;
3721 3724  
3722 3725                  /*
3723 3726                   * If we couldn't evict our target number of bytes from
3724 3727                   * metadata, we try to get the rest from data.
3725 3728                   */
3726 3729                  target -= bytes;
3727 3730  
3728 3731                  total_evicted +=
3729 3732                      arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
3730 3733          } else {
3731 3734                  bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
3732 3735                  total_evicted += bytes;
3733 3736  
3734 3737                  /*
3735 3738                   * If we couldn't evict our target number of bytes from
3736 3739                   * data, we try to get the rest from metadata.
3737 3740                   */
3738 3741                  target -= bytes;
3739 3742  
3740 3743                  total_evicted +=
3741 3744                      arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
3742 3745          }
3743 3746  
3744 3747          /*
3745 3748           * Adjust MFU size
3746 3749           *
3747 3750           * Now that we've tried to evict enough from the MRU to get its
3748 3751           * size back to arc_p, if we're still above the target cache
3749 3752           * size, we evict the rest from the MFU.
3750 3753           */
3751 3754          target = arc_size - arc_c;
3752 3755  
3753 3756          if (arc_adjust_type(arc_mfu) == ARC_BUFC_METADATA &&
3754 3757              arc_meta_used > arc_meta_min) {
3755 3758                  bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
3756 3759                  total_evicted += bytes;
3757 3760  
3758 3761                  /*
3759 3762                   * If we couldn't evict our target number of bytes from
3760 3763                   * metadata, we try to get the rest from data.
3761 3764                   */
3762 3765                  target -= bytes;
3763 3766  
3764 3767                  total_evicted +=
3765 3768                      arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
3766 3769          } else {
3767 3770                  bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
3768 3771                  total_evicted += bytes;
3769 3772  
3770 3773                  /*
3771 3774                   * If we couldn't evict our target number of bytes from
3772 3775                   * data, we try to get the rest from data.
3773 3776                   */
3774 3777                  target -= bytes;
3775 3778  
3776 3779                  total_evicted +=
3777 3780                      arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
3778 3781          }
3779 3782  
3780 3783          /*
3781 3784           * Adjust ghost lists
3782 3785           *
3783 3786           * In addition to the above, the ARC also defines target values
3784 3787           * for the ghost lists. The sum of the mru list and mru ghost
3785 3788           * list should never exceed the target size of the cache, and
3786 3789           * the sum of the mru list, mfu list, mru ghost list, and mfu
3787 3790           * ghost list should never exceed twice the target size of the
3788 3791           * cache. The following logic enforces these limits on the ghost
3789 3792           * caches, and evicts from them as needed.
3790 3793           */
3791 3794          target = refcount_count(&arc_mru->arcs_size) +
3792 3795              refcount_count(&arc_mru_ghost->arcs_size) - arc_c;
3793 3796  
3794 3797          bytes = arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA);
3795 3798          total_evicted += bytes;
3796 3799  
3797 3800          target -= bytes;
3798 3801  
3799 3802          total_evicted +=
3800 3803              arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
3801 3804  
3802 3805          /*
3803 3806           * We assume the sum of the mru list and mfu list is less than
3804 3807           * or equal to arc_c (we enforced this above), which means we
3805 3808           * can use the simpler of the two equations below:
3806 3809           *
3807 3810           *      mru + mfu + mru ghost + mfu ghost <= 2 * arc_c
3808 3811           *                  mru ghost + mfu ghost <= arc_c
3809 3812           */
3810 3813          target = refcount_count(&arc_mru_ghost->arcs_size) +
3811 3814              refcount_count(&arc_mfu_ghost->arcs_size) - arc_c;
3812 3815  
3813 3816          bytes = arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA);
3814 3817          total_evicted += bytes;
3815 3818  
3816 3819          target -= bytes;
3817 3820  
3818 3821          total_evicted +=
3819 3822              arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
3820 3823  
3821 3824          return (total_evicted);
3822 3825  }
3823 3826  
3824 3827  void
3825 3828  arc_flush(spa_t *spa, boolean_t retry)
3826 3829  {
3827 3830          uint64_t guid = 0;
3828 3831  
3829 3832          /*
3830 3833           * If retry is B_TRUE, a spa must not be specified since we have
3831 3834           * no good way to determine if all of a spa's buffers have been
3832 3835           * evicted from an arc state.
3833 3836           */
3834 3837          ASSERT(!retry || spa == 0);
3835 3838  
3836 3839          if (spa != NULL)
3837 3840                  guid = spa_load_guid(spa);
3838 3841  
3839 3842          (void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);
3840 3843          (void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);
3841 3844  
3842 3845          (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);
3843 3846          (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);
3844 3847  
3845 3848          (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);
3846 3849          (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);
3847 3850  
3848 3851          (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);
3849 3852          (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);
3850 3853  }
3851 3854  
3852 3855  void
3853 3856  arc_shrink(int64_t to_free)
3854 3857  {
3855 3858          if (arc_c > arc_c_min) {
3856 3859  
3857 3860                  if (arc_c > arc_c_min + to_free)
3858 3861                          atomic_add_64(&arc_c, -to_free);
3859 3862                  else
3860 3863                          arc_c = arc_c_min;
3861 3864  
3862 3865                  atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
3863 3866                  if (arc_c > arc_size)
3864 3867                          arc_c = MAX(arc_size, arc_c_min);
3865 3868                  if (arc_p > arc_c)
3866 3869                          arc_p = (arc_c >> 1);
3867 3870                  ASSERT(arc_c >= arc_c_min);
3868 3871                  ASSERT((int64_t)arc_p >= 0);
3869 3872          }
3870 3873  
3871 3874          if (arc_size > arc_c)
3872 3875                  (void) arc_adjust();
3873 3876  }
3874 3877  
3875 3878  typedef enum free_memory_reason_t {
3876 3879          FMR_UNKNOWN,
3877 3880          FMR_NEEDFREE,
3878 3881          FMR_LOTSFREE,
3879 3882          FMR_SWAPFS_MINFREE,
3880 3883          FMR_PAGES_PP_MAXIMUM,
3881 3884          FMR_HEAP_ARENA,
3882 3885          FMR_ZIO_ARENA,
3883 3886  } free_memory_reason_t;
3884 3887  
3885 3888  int64_t last_free_memory;
3886 3889  free_memory_reason_t last_free_reason;
3887 3890  
3888 3891  /*
3889 3892   * Additional reserve of pages for pp_reserve.
3890 3893   */
3891 3894  int64_t arc_pages_pp_reserve = 64;
3892 3895  
3893 3896  /*
3894 3897   * Additional reserve of pages for swapfs.
3895 3898   */
3896 3899  int64_t arc_swapfs_reserve = 64;
3897 3900  
3898 3901  /*
3899 3902   * Return the amount of memory that can be consumed before reclaim will be
3900 3903   * needed.  Positive if there is sufficient free memory, negative indicates
3901 3904   * the amount of memory that needs to be freed up.
3902 3905   */
3903 3906  static int64_t
3904 3907  arc_available_memory(void)
3905 3908  {
3906 3909          int64_t lowest = INT64_MAX;
3907 3910          int64_t n;
3908 3911          free_memory_reason_t r = FMR_UNKNOWN;
3909 3912  
3910 3913  #ifdef _KERNEL
3911 3914          if (needfree > 0) {
3912 3915                  n = PAGESIZE * (-needfree);
3913 3916                  if (n < lowest) {
3914 3917                          lowest = n;
3915 3918                          r = FMR_NEEDFREE;
3916 3919                  }
3917 3920          }
3918 3921  
3919 3922          /*
3920 3923           * check that we're out of range of the pageout scanner.  It starts to
3921 3924           * schedule paging if freemem is less than lotsfree and needfree.
3922 3925           * lotsfree is the high-water mark for pageout, and needfree is the
3923 3926           * number of needed free pages.  We add extra pages here to make sure
3924 3927           * the scanner doesn't start up while we're freeing memory.
3925 3928           */
3926 3929          n = PAGESIZE * (freemem - lotsfree - needfree - desfree);
3927 3930          if (n < lowest) {
3928 3931                  lowest = n;
3929 3932                  r = FMR_LOTSFREE;
3930 3933          }
3931 3934  
3932 3935          /*
3933 3936           * check to make sure that swapfs has enough space so that anon
3934 3937           * reservations can still succeed. anon_resvmem() checks that the
3935 3938           * availrmem is greater than swapfs_minfree, and the number of reserved
3936 3939           * swap pages.  We also add a bit of extra here just to prevent
3937 3940           * circumstances from getting really dire.
3938 3941           */
3939 3942          n = PAGESIZE * (availrmem - swapfs_minfree - swapfs_reserve -
3940 3943              desfree - arc_swapfs_reserve);
3941 3944          if (n < lowest) {
3942 3945                  lowest = n;
3943 3946                  r = FMR_SWAPFS_MINFREE;
3944 3947          }
3945 3948  
3946 3949  
3947 3950          /*
3948 3951           * Check that we have enough availrmem that memory locking (e.g., via
3949 3952           * mlock(3C) or memcntl(2)) can still succeed.  (pages_pp_maximum
3950 3953           * stores the number of pages that cannot be locked; when availrmem
3951 3954           * drops below pages_pp_maximum, page locking mechanisms such as
3952 3955           * page_pp_lock() will fail.)
3953 3956           */
3954 3957          n = PAGESIZE * (availrmem - pages_pp_maximum -
3955 3958              arc_pages_pp_reserve);
3956 3959          if (n < lowest) {
3957 3960                  lowest = n;
3958 3961                  r = FMR_PAGES_PP_MAXIMUM;
3959 3962          }
3960 3963  
3961 3964  #if defined(__i386)
3962 3965          /*
3963 3966           * If we're on an i386 platform, it's possible that we'll exhaust the
3964 3967           * kernel heap space before we ever run out of available physical
3965 3968           * memory.  Most checks of the size of the heap_area compare against
3966 3969           * tune.t_minarmem, which is the minimum available real memory that we
3967 3970           * can have in the system.  However, this is generally fixed at 25 pages
3968 3971           * which is so low that it's useless.  In this comparison, we seek to
3969 3972           * calculate the total heap-size, and reclaim if more than 3/4ths of the
3970 3973           * heap is allocated.  (Or, in the calculation, if less than 1/4th is
3971 3974           * free)
3972 3975           */
3973 3976          n = (int64_t)vmem_size(heap_arena, VMEM_FREE) -
3974 3977              (vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC) >> 2);
3975 3978          if (n < lowest) {
3976 3979                  lowest = n;
3977 3980                  r = FMR_HEAP_ARENA;
3978 3981          }
3979 3982  #endif
3980 3983  
3981 3984          /*
3982 3985           * If zio data pages are being allocated out of a separate heap segment,
3983 3986           * then enforce that the size of available vmem for this arena remains
3984 3987           * above about 1/4th (1/(2^arc_zio_arena_free_shift)) free.
3985 3988           *
3986 3989           * Note that reducing the arc_zio_arena_free_shift keeps more virtual
3987 3990           * memory (in the zio_arena) free, which can avoid memory
3988 3991           * fragmentation issues.
3989 3992           */
3990 3993          if (zio_arena != NULL) {
3991 3994                  n = (int64_t)vmem_size(zio_arena, VMEM_FREE) -
3992 3995                      (vmem_size(zio_arena, VMEM_ALLOC) >>
3993 3996                      arc_zio_arena_free_shift);
3994 3997                  if (n < lowest) {
3995 3998                          lowest = n;
3996 3999                          r = FMR_ZIO_ARENA;
3997 4000                  }
3998 4001          }
3999 4002  #else
4000 4003          /* Every 100 calls, free a small amount */
4001 4004          if (spa_get_random(100) == 0)
4002 4005                  lowest = -1024;
4003 4006  #endif
4004 4007  
4005 4008          last_free_memory = lowest;
4006 4009          last_free_reason = r;
4007 4010  
4008 4011          return (lowest);
4009 4012  }
4010 4013  
4011 4014  
4012 4015  /*
4013 4016   * Determine if the system is under memory pressure and is asking
4014 4017   * to reclaim memory. A return value of B_TRUE indicates that the system
4015 4018   * is under memory pressure and that the arc should adjust accordingly.
4016 4019   */
4017 4020  static boolean_t
4018 4021  arc_reclaim_needed(void)
4019 4022  {
4020 4023          return (arc_available_memory() < 0);
4021 4024  }
4022 4025  
4023 4026  static void
4024 4027  arc_kmem_reap_now(void)
4025 4028  {
4026 4029          size_t                  i;
4027 4030          kmem_cache_t            *prev_cache = NULL;
4028 4031          kmem_cache_t            *prev_data_cache = NULL;
4029 4032          extern kmem_cache_t     *zio_buf_cache[];
4030 4033          extern kmem_cache_t     *zio_data_buf_cache[];
4031 4034          extern kmem_cache_t     *range_seg_cache;
4032 4035          extern kmem_cache_t     *abd_chunk_cache;
4033 4036  
4034 4037  #ifdef _KERNEL
4035 4038          if (arc_meta_used >= arc_meta_limit) {
4036 4039                  /*
4037 4040                   * We are exceeding our meta-data cache limit.
4038 4041                   * Purge some DNLC entries to release holds on meta-data.
4039 4042                   */

↓ open down ↓

3726 lines elided

↑ open up ↑

4040 4043                  dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
4041 4044          }
4042 4045  #if defined(__i386)
4043 4046          /*
4044 4047           * Reclaim unused memory from all kmem caches.
4045 4048           */
4046 4049          kmem_reap();
4047 4050  #endif
4048 4051  #endif
4049 4052  
     4053 +        /*
     4054 +         * If a kmem reap is already active, don't schedule more.  We must
     4055 +         * check for this because kmem_cache_reap_soon() won't actually
     4056 +         * block on the cache being reaped (this is to prevent callers from
     4057 +         * becoming implicitly blocked by a system-wide kmem reap -- which,
     4058 +         * on a system with many, many full magazines, can take minutes).
     4059 +         */
     4060 +        if (kmem_cache_reap_active())
     4061 +                return;
     4062 +
4050 4063          for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
4051 4064                  if (zio_buf_cache[i] != prev_cache) {
4052 4065                          prev_cache = zio_buf_cache[i];
4053      -                        kmem_cache_reap_now(zio_buf_cache[i]);
     4066 +                        kmem_cache_reap_soon(zio_buf_cache[i]);
4054 4067                  }
4055 4068                  if (zio_data_buf_cache[i] != prev_data_cache) {
4056 4069                          prev_data_cache = zio_data_buf_cache[i];
4057      -                        kmem_cache_reap_now(zio_data_buf_cache[i]);
     4070 +                        kmem_cache_reap_soon(zio_data_buf_cache[i]);
4058 4071                  }
4059 4072          }
4060      -        kmem_cache_reap_now(abd_chunk_cache);
4061      -        kmem_cache_reap_now(buf_cache);
4062      -        kmem_cache_reap_now(hdr_full_cache);
4063      -        kmem_cache_reap_now(hdr_l2only_cache);
4064      -        kmem_cache_reap_now(range_seg_cache);
     4073 +        kmem_cache_reap_soon(abd_chunk_cache);
     4074 +        kmem_cache_reap_soon(buf_cache);
     4075 +        kmem_cache_reap_soon(hdr_full_cache);
     4076 +        kmem_cache_reap_soon(hdr_l2only_cache);
     4077 +        kmem_cache_reap_soon(range_seg_cache);
4065 4078  
4066 4079          if (zio_arena != NULL) {
4067 4080                  /*
4068 4081                   * Ask the vmem arena to reclaim unused memory from its
4069 4082                   * quantum caches.
4070 4083                   */
4071 4084                  vmem_qcache_reap(zio_arena);
4072 4085          }
4073 4086  }
4074 4087

4075 4088  /*
4076 4089   * Threads can block in arc_get_data_impl() waiting for this thread to evict
4077 4090   * enough data and signal them to proceed. When this happens, the threads in
4078 4091   * arc_get_data_impl() are sleeping while holding the hash lock for their
4079 4092   * particular arc header. Thus, we must be careful to never sleep on a
4080 4093   * hash lock in this thread. This is to prevent the following deadlock:
4081 4094   *
4082 4095   *  - Thread A sleeps on CV in arc_get_data_impl() holding hash lock "L",
4083 4096   *    waiting for the reclaim thread to signal it.
4084 4097   *
4085 4098   *  - arc_reclaim_thread() tries to acquire hash lock "L" using mutex_enter,

↓ open down ↓

11 lines elided

↑ open up ↑

4086 4099   *    fails, and goes to sleep forever.
4087 4100   *
4088 4101   * This possible deadlock is avoided by always acquiring a hash lock
4089 4102   * using mutex_tryenter() from arc_reclaim_thread().
4090 4103   */
4091 4104  /* ARGSUSED */
4092 4105  static void
4093 4106  arc_reclaim_thread(void *unused)
4094 4107  {
4095 4108          hrtime_t                growtime = 0;
     4109 +        hrtime_t                kmem_reap_time = 0;
4096 4110          callb_cpr_t             cpr;
4097 4111  
4098 4112          CALLB_CPR_INIT(&cpr, &arc_reclaim_lock, callb_generic_cpr, FTAG);
4099 4113  
4100 4114          mutex_enter(&arc_reclaim_lock);
4101 4115          while (!arc_reclaim_thread_exit) {
4102 4116                  uint64_t evicted = 0;
4103 4117  
4104 4118                  /*
4105 4119                   * This is necessary in order for the mdb ::arc dcmd to

4106 4120                   * show up to date information. Since the ::arc command
4107 4121                   * does not call the kstat's update function, without
4108 4122                   * this call, the command may show stale stats for the
4109 4123                   * anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even
4110 4124                   * with this change, the data might be up to 1 second
4111 4125                   * out of date; but that should suffice. The arc_state_t
4112 4126                   * structures can be queried directly if more accurate
4113 4127                   * information is needed.
4114 4128                   */
4115 4129                  if (arc_ksp != NULL)
4116 4130                          arc_ksp->ks_update(arc_ksp, KSTAT_READ);
4117 4131  
4118 4132                  mutex_exit(&arc_reclaim_lock);

↓ open down ↓

13 lines elided

↑ open up ↑

4119 4133  
4120 4134                  /*
4121 4135                   * We call arc_adjust() before (possibly) calling
4122 4136                   * arc_kmem_reap_now(), so that we can wake up
4123 4137                   * arc_get_data_impl() sooner.
4124 4138                   */
4125 4139                  evicted = arc_adjust();
4126 4140  
4127 4141                  int64_t free_memory = arc_available_memory();
4128 4142                  if (free_memory < 0) {
4129      -
     4143 +                        hrtime_t curtime = gethrtime();
4130 4144                          arc_no_grow = B_TRUE;
4131 4145                          arc_warm = B_TRUE;
4132 4146  
4133 4147                          /*
4134 4148                           * Wait at least zfs_grow_retry (default 60) seconds
4135 4149                           * before considering growing.
4136 4150                           */
4137      -                        growtime = gethrtime() + SEC2NSEC(arc_grow_retry);
     4151 +                        growtime = curtime + SEC2NSEC(arc_grow_retry);
4138 4152  
4139      -                        arc_kmem_reap_now();
     4153 +                        /*
     4154 +                         * Wait at least arc_kmem_cache_reap_retry_ms
     4155 +                         * between arc_kmem_reap_now() calls. Without
     4156 +                         * this check it is possible to end up in a
     4157 +                         * situation where we spend lots of time
     4158 +                         * reaping caches, while we're near arc_c_min.
     4159 +                         */
     4160 +                        if (curtime >= kmem_reap_time) {
     4161 +                                arc_kmem_reap_now();
     4162 +                                kmem_reap_time = gethrtime() +
     4163 +                                    MSEC2NSEC(arc_kmem_cache_reap_retry_ms);
     4164 +                        }
4140 4165  
4141 4166                          /*
4142 4167                           * If we are still low on memory, shrink the ARC
4143 4168                           * so that we have arc_shrink_min free space.
4144 4169                           */
4145 4170                          free_memory = arc_available_memory();
4146 4171  
4147 4172                          int64_t to_free =
4148 4173                              (arc_c >> arc_shrink_shift) - free_memory;
4149 4174                          if (to_free > 0) {

4150 4175  #ifdef _KERNEL
4151 4176                                  to_free = MAX(to_free, ptob(needfree));
4152 4177  #endif
4153 4178                                  arc_shrink(to_free);
4154 4179                          }
4155 4180                  } else if (free_memory < arc_c >> arc_no_grow_shift) {
4156 4181                          arc_no_grow = B_TRUE;
4157 4182                  } else if (gethrtime() >= growtime) {
4158 4183                          arc_no_grow = B_FALSE;
4159 4184                  }
4160 4185  
4161 4186                  mutex_enter(&arc_reclaim_lock);
4162 4187  
4163 4188                  /*
4164 4189                   * If evicted is zero, we couldn't evict anything via
4165 4190                   * arc_adjust(). This could be due to hash lock
4166 4191                   * collisions, but more likely due to the majority of
4167 4192                   * arc buffers being unevictable. Therefore, even if
4168 4193                   * arc_size is above arc_c, another pass is unlikely to
4169 4194                   * be helpful and could potentially cause us to enter an
4170 4195                   * infinite loop.
4171 4196                   */
4172 4197                  if (arc_size <= arc_c || evicted == 0) {
4173 4198                          /*
4174 4199                           * We're either no longer overflowing, or we
4175 4200                           * can't evict anything more, so we should wake
4176 4201                           * up any threads before we go to sleep.
4177 4202                           */
4178 4203                          cv_broadcast(&arc_reclaim_waiters_cv);
4179 4204  
4180 4205                          /*
4181 4206                           * Block until signaled, or after one second (we
4182 4207                           * might need to perform arc_kmem_reap_now()
4183 4208                           * even if we aren't being signalled)
4184 4209                           */
4185 4210                          CALLB_CPR_SAFE_BEGIN(&cpr);
4186 4211                          (void) cv_timedwait_hires(&arc_reclaim_thread_cv,
4187 4212                              &arc_reclaim_lock, SEC2NSEC(1), MSEC2NSEC(1), 0);
4188 4213                          CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_lock);
4189 4214                  }
4190 4215          }
4191 4216  
4192 4217          arc_reclaim_thread_exit = B_FALSE;
4193 4218          cv_broadcast(&arc_reclaim_thread_cv);
4194 4219          CALLB_CPR_EXIT(&cpr);           /* drops arc_reclaim_lock */
4195 4220          thread_exit();
4196 4221  }
4197 4222  
4198 4223  /*
4199 4224   * Adapt arc info given the number of bytes we are trying to add and
4200 4225   * the state that we are comming from.  This function is only called
4201 4226   * when we are adding new content to the cache.
4202 4227   */
4203 4228  static void
4204 4229  arc_adapt(int bytes, arc_state_t *state)
4205 4230  {
4206 4231          int mult;
4207 4232          uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
4208 4233          int64_t mrug_size = refcount_count(&arc_mru_ghost->arcs_size);
4209 4234          int64_t mfug_size = refcount_count(&arc_mfu_ghost->arcs_size);
4210 4235  
4211 4236          if (state == arc_l2c_only)
4212 4237                  return;
4213 4238  
4214 4239          ASSERT(bytes > 0);
4215 4240          /*
4216 4241           * Adapt the target size of the MRU list:
4217 4242           *      - if we just hit in the MRU ghost list, then increase
4218 4243           *        the target size of the MRU list.
4219 4244           *      - if we just hit in the MFU ghost list, then increase
4220 4245           *        the target size of the MFU list by decreasing the
4221 4246           *        target size of the MRU list.
4222 4247           */
4223 4248          if (state == arc_mru_ghost) {
4224 4249                  mult = (mrug_size >= mfug_size) ? 1 : (mfug_size / mrug_size);
4225 4250                  mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
4226 4251  
4227 4252                  arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
4228 4253          } else if (state == arc_mfu_ghost) {
4229 4254                  uint64_t delta;
4230 4255  
4231 4256                  mult = (mfug_size >= mrug_size) ? 1 : (mrug_size / mfug_size);
4232 4257                  mult = MIN(mult, 10);
4233 4258  
4234 4259                  delta = MIN(bytes * mult, arc_p);
4235 4260                  arc_p = MAX(arc_p_min, arc_p - delta);
4236 4261          }
4237 4262          ASSERT((int64_t)arc_p >= 0);
4238 4263  
4239 4264          if (arc_reclaim_needed()) {
4240 4265                  cv_signal(&arc_reclaim_thread_cv);
4241 4266                  return;
4242 4267          }
4243 4268  
4244 4269          if (arc_no_grow)
4245 4270                  return;
4246 4271  
4247 4272          if (arc_c >= arc_c_max)
4248 4273                  return;
4249 4274  
4250 4275          /*
4251 4276           * If we're within (2 * maxblocksize) bytes of the target
4252 4277           * cache size, increment the target cache size
4253 4278           */
4254 4279          if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
4255 4280                  atomic_add_64(&arc_c, (int64_t)bytes);
4256 4281                  if (arc_c > arc_c_max)
4257 4282                          arc_c = arc_c_max;
4258 4283                  else if (state == arc_anon)
4259 4284                          atomic_add_64(&arc_p, (int64_t)bytes);
4260 4285                  if (arc_p > arc_c)
4261 4286                          arc_p = arc_c;
4262 4287          }
4263 4288          ASSERT((int64_t)arc_p >= 0);
4264 4289  }
4265 4290  
4266 4291  /*
4267 4292   * Check if arc_size has grown past our upper threshold, determined by
4268 4293   * zfs_arc_overflow_shift.
4269 4294   */
4270 4295  static boolean_t
4271 4296  arc_is_overflowing(void)
4272 4297  {
4273 4298          /* Always allow at least one block of overflow */
4274 4299          uint64_t overflow = MAX(SPA_MAXBLOCKSIZE,
4275 4300              arc_c >> zfs_arc_overflow_shift);
4276 4301  
4277 4302          return (arc_size >= arc_c + overflow);
4278 4303  }
4279 4304  
4280 4305  static abd_t *
4281 4306  arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4282 4307  {
4283 4308          arc_buf_contents_t type = arc_buf_type(hdr);
4284 4309  
4285 4310          arc_get_data_impl(hdr, size, tag);
4286 4311          if (type == ARC_BUFC_METADATA) {
4287 4312                  return (abd_alloc(size, B_TRUE));
4288 4313          } else {
4289 4314                  ASSERT(type == ARC_BUFC_DATA);
4290 4315                  return (abd_alloc(size, B_FALSE));
4291 4316          }
4292 4317  }
4293 4318  
4294 4319  static void *
4295 4320  arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4296 4321  {
4297 4322          arc_buf_contents_t type = arc_buf_type(hdr);
4298 4323  
4299 4324          arc_get_data_impl(hdr, size, tag);
4300 4325          if (type == ARC_BUFC_METADATA) {
4301 4326                  return (zio_buf_alloc(size));
4302 4327          } else {
4303 4328                  ASSERT(type == ARC_BUFC_DATA);
4304 4329                  return (zio_data_buf_alloc(size));
4305 4330          }
4306 4331  }
4307 4332  
4308 4333  /*
4309 4334   * Allocate a block and return it to the caller. If we are hitting the
4310 4335   * hard limit for the cache size, we must sleep, waiting for the eviction
4311 4336   * thread to catch up. If we're past the target size but below the hard
4312 4337   * limit, we'll only signal the reclaim thread and continue on.
4313 4338   */
4314 4339  static void
4315 4340  arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4316 4341  {
4317 4342          arc_state_t *state = hdr->b_l1hdr.b_state;
4318 4343          arc_buf_contents_t type = arc_buf_type(hdr);
4319 4344  
4320 4345          arc_adapt(size, state);
4321 4346  
4322 4347          /*
4323 4348           * If arc_size is currently overflowing, and has grown past our
4324 4349           * upper limit, we must be adding data faster than the evict
4325 4350           * thread can evict. Thus, to ensure we don't compound the
4326 4351           * problem by adding more data and forcing arc_size to grow even
4327 4352           * further past it's target size, we halt and wait for the
4328 4353           * eviction thread to catch up.
4329 4354           *
4330 4355           * It's also possible that the reclaim thread is unable to evict
4331 4356           * enough buffers to get arc_size below the overflow limit (e.g.
4332 4357           * due to buffers being un-evictable, or hash lock collisions).
4333 4358           * In this case, we want to proceed regardless if we're
4334 4359           * overflowing; thus we don't use a while loop here.
4335 4360           */
4336 4361          if (arc_is_overflowing()) {
4337 4362                  mutex_enter(&arc_reclaim_lock);
4338 4363  
4339 4364                  /*
4340 4365                   * Now that we've acquired the lock, we may no longer be
4341 4366                   * over the overflow limit, lets check.
4342 4367                   *
4343 4368                   * We're ignoring the case of spurious wake ups. If that
4344 4369                   * were to happen, it'd let this thread consume an ARC
4345 4370                   * buffer before it should have (i.e. before we're under
4346 4371                   * the overflow limit and were signalled by the reclaim
4347 4372                   * thread). As long as that is a rare occurrence, it
4348 4373                   * shouldn't cause any harm.
4349 4374                   */
4350 4375                  if (arc_is_overflowing()) {
4351 4376                          cv_signal(&arc_reclaim_thread_cv);
4352 4377                          cv_wait(&arc_reclaim_waiters_cv, &arc_reclaim_lock);
4353 4378                  }
4354 4379  
4355 4380                  mutex_exit(&arc_reclaim_lock);
4356 4381          }
4357 4382  
4358 4383          VERIFY3U(hdr->b_type, ==, type);
4359 4384          if (type == ARC_BUFC_METADATA) {
4360 4385                  arc_space_consume(size, ARC_SPACE_META);
4361 4386          } else {
4362 4387                  arc_space_consume(size, ARC_SPACE_DATA);
4363 4388          }
4364 4389  
4365 4390          /*
4366 4391           * Update the state size.  Note that ghost states have a
4367 4392           * "ghost size" and so don't need to be updated.
4368 4393           */
4369 4394          if (!GHOST_STATE(state)) {
4370 4395  
4371 4396                  (void) refcount_add_many(&state->arcs_size, size, tag);
4372 4397  
4373 4398                  /*
4374 4399                   * If this is reached via arc_read, the link is
4375 4400                   * protected by the hash lock. If reached via
4376 4401                   * arc_buf_alloc, the header should not be accessed by
4377 4402                   * any other thread. And, if reached via arc_read_done,
4378 4403                   * the hash lock will protect it if it's found in the
4379 4404                   * hash table; otherwise no other thread should be
4380 4405                   * trying to [add|remove]_reference it.
4381 4406                   */
4382 4407                  if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
4383 4408                          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4384 4409                          (void) refcount_add_many(&state->arcs_esize[type],
4385 4410                              size, tag);
4386 4411                  }
4387 4412  
4388 4413                  /*
4389 4414                   * If we are growing the cache, and we are adding anonymous
4390 4415                   * data, and we have outgrown arc_p, update arc_p
4391 4416                   */
4392 4417                  if (arc_size < arc_c && hdr->b_l1hdr.b_state == arc_anon &&
4393 4418                      (refcount_count(&arc_anon->arcs_size) +
4394 4419                      refcount_count(&arc_mru->arcs_size) > arc_p))
4395 4420                          arc_p = MIN(arc_c, arc_p + size);
4396 4421          }
4397 4422  }
4398 4423  
4399 4424  static void
4400 4425  arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size, void *tag)
4401 4426  {
4402 4427          arc_free_data_impl(hdr, size, tag);
4403 4428          abd_free(abd);
4404 4429  }
4405 4430  
4406 4431  static void
4407 4432  arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, void *tag)
4408 4433  {
4409 4434          arc_buf_contents_t type = arc_buf_type(hdr);
4410 4435  
4411 4436          arc_free_data_impl(hdr, size, tag);
4412 4437          if (type == ARC_BUFC_METADATA) {
4413 4438                  zio_buf_free(buf, size);
4414 4439          } else {
4415 4440                  ASSERT(type == ARC_BUFC_DATA);
4416 4441                  zio_data_buf_free(buf, size);
4417 4442          }
4418 4443  }
4419 4444  
4420 4445  /*
4421 4446   * Free the arc data buffer.
4422 4447   */
4423 4448  static void
4424 4449  arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4425 4450  {
4426 4451          arc_state_t *state = hdr->b_l1hdr.b_state;
4427 4452          arc_buf_contents_t type = arc_buf_type(hdr);
4428 4453  
4429 4454          /* protected by hash lock, if in the hash table */
4430 4455          if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
4431 4456                  ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4432 4457                  ASSERT(state != arc_anon && state != arc_l2c_only);
4433 4458  
4434 4459                  (void) refcount_remove_many(&state->arcs_esize[type],
4435 4460                      size, tag);
4436 4461          }
4437 4462          (void) refcount_remove_many(&state->arcs_size, size, tag);
4438 4463  
4439 4464          VERIFY3U(hdr->b_type, ==, type);
4440 4465          if (type == ARC_BUFC_METADATA) {
4441 4466                  arc_space_return(size, ARC_SPACE_META);
4442 4467          } else {
4443 4468                  ASSERT(type == ARC_BUFC_DATA);
4444 4469                  arc_space_return(size, ARC_SPACE_DATA);
4445 4470          }
4446 4471  }
4447 4472  
4448 4473  /*
4449 4474   * This routine is called whenever a buffer is accessed.
4450 4475   * NOTE: the hash lock is dropped in this function.
4451 4476   */
4452 4477  static void
4453 4478  arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
4454 4479  {
4455 4480          clock_t now;
4456 4481  
4457 4482          ASSERT(MUTEX_HELD(hash_lock));
4458 4483          ASSERT(HDR_HAS_L1HDR(hdr));
4459 4484  
4460 4485          if (hdr->b_l1hdr.b_state == arc_anon) {
4461 4486                  /*
4462 4487                   * This buffer is not in the cache, and does not
4463 4488                   * appear in our "ghost" list.  Add the new buffer
4464 4489                   * to the MRU state.
4465 4490                   */
4466 4491  
4467 4492                  ASSERT0(hdr->b_l1hdr.b_arc_access);
4468 4493                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4469 4494                  DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
4470 4495                  arc_change_state(arc_mru, hdr, hash_lock);
4471 4496  
4472 4497          } else if (hdr->b_l1hdr.b_state == arc_mru) {
4473 4498                  now = ddi_get_lbolt();
4474 4499  
4475 4500                  /*
4476 4501                   * If this buffer is here because of a prefetch, then either:
4477 4502                   * - clear the flag if this is a "referencing" read
4478 4503                   *   (any subsequent access will bump this into the MFU state).
4479 4504                   * or
4480 4505                   * - move the buffer to the head of the list if this is
4481 4506                   *   another prefetch (to make it less likely to be evicted).
4482 4507                   */
4483 4508                  if (HDR_PREFETCH(hdr)) {
4484 4509                          if (refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
4485 4510                                  /* link protected by hash lock */
4486 4511                                  ASSERT(multilist_link_active(
4487 4512                                      &hdr->b_l1hdr.b_arc_node));
4488 4513                          } else {
4489 4514                                  arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
4490 4515                                  ARCSTAT_BUMP(arcstat_mru_hits);
4491 4516                          }
4492 4517                          hdr->b_l1hdr.b_arc_access = now;
4493 4518                          return;
4494 4519                  }
4495 4520  
4496 4521                  /*
4497 4522                   * This buffer has been "accessed" only once so far,
4498 4523                   * but it is still in the cache. Move it to the MFU
4499 4524                   * state.
4500 4525                   */
4501 4526                  if (now > hdr->b_l1hdr.b_arc_access + ARC_MINTIME) {
4502 4527                          /*
4503 4528                           * More than 125ms have passed since we
4504 4529                           * instantiated this buffer.  Move it to the
4505 4530                           * most frequently used state.
4506 4531                           */
4507 4532                          hdr->b_l1hdr.b_arc_access = now;
4508 4533                          DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4509 4534                          arc_change_state(arc_mfu, hdr, hash_lock);
4510 4535                  }
4511 4536                  ARCSTAT_BUMP(arcstat_mru_hits);
4512 4537          } else if (hdr->b_l1hdr.b_state == arc_mru_ghost) {
4513 4538                  arc_state_t     *new_state;
4514 4539                  /*
4515 4540                   * This buffer has been "accessed" recently, but
4516 4541                   * was evicted from the cache.  Move it to the
4517 4542                   * MFU state.
4518 4543                   */
4519 4544  
4520 4545                  if (HDR_PREFETCH(hdr)) {
4521 4546                          new_state = arc_mru;
4522 4547                          if (refcount_count(&hdr->b_l1hdr.b_refcnt) > 0)
4523 4548                                  arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
4524 4549                          DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
4525 4550                  } else {
4526 4551                          new_state = arc_mfu;
4527 4552                          DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4528 4553                  }
4529 4554  
4530 4555                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4531 4556                  arc_change_state(new_state, hdr, hash_lock);
4532 4557  
4533 4558                  ARCSTAT_BUMP(arcstat_mru_ghost_hits);
4534 4559          } else if (hdr->b_l1hdr.b_state == arc_mfu) {
4535 4560                  /*
4536 4561                   * This buffer has been accessed more than once and is
4537 4562                   * still in the cache.  Keep it in the MFU state.
4538 4563                   *
4539 4564                   * NOTE: an add_reference() that occurred when we did
4540 4565                   * the arc_read() will have kicked this off the list.
4541 4566                   * If it was a prefetch, we will explicitly move it to
4542 4567                   * the head of the list now.
4543 4568                   */
4544 4569                  if ((HDR_PREFETCH(hdr)) != 0) {
4545 4570                          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4546 4571                          /* link protected by hash_lock */
4547 4572                          ASSERT(multilist_link_active(&hdr->b_l1hdr.b_arc_node));
4548 4573                  }
4549 4574                  ARCSTAT_BUMP(arcstat_mfu_hits);
4550 4575                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4551 4576          } else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) {
4552 4577                  arc_state_t     *new_state = arc_mfu;
4553 4578                  /*
4554 4579                   * This buffer has been accessed more than once but has
4555 4580                   * been evicted from the cache.  Move it back to the
4556 4581                   * MFU state.
4557 4582                   */
4558 4583  
4559 4584                  if (HDR_PREFETCH(hdr)) {
4560 4585                          /*
4561 4586                           * This is a prefetch access...
4562 4587                           * move this block back to the MRU state.
4563 4588                           */
4564 4589                          ASSERT0(refcount_count(&hdr->b_l1hdr.b_refcnt));
4565 4590                          new_state = arc_mru;
4566 4591                  }
4567 4592  
4568 4593                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4569 4594                  DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4570 4595                  arc_change_state(new_state, hdr, hash_lock);
4571 4596  
4572 4597                  ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
4573 4598          } else if (hdr->b_l1hdr.b_state == arc_l2c_only) {
4574 4599                  /*
4575 4600                   * This buffer is on the 2nd Level ARC.
4576 4601                   */
4577 4602  
4578 4603                  hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4579 4604                  DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4580 4605                  arc_change_state(arc_mfu, hdr, hash_lock);
4581 4606          } else {
4582 4607                  ASSERT(!"invalid arc state");
4583 4608          }
4584 4609  }
4585 4610  
4586 4611  /* a generic arc_done_func_t which you can use */
4587 4612  /* ARGSUSED */
4588 4613  void
4589 4614  arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
4590 4615  {
4591 4616          if (zio == NULL || zio->io_error == 0)
4592 4617                  bcopy(buf->b_data, arg, arc_buf_size(buf));
4593 4618          arc_buf_destroy(buf, arg);
4594 4619  }
4595 4620  
4596 4621  /* a generic arc_done_func_t */
4597 4622  void
4598 4623  arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
4599 4624  {
4600 4625          arc_buf_t **bufp = arg;
4601 4626          if (zio && zio->io_error) {
4602 4627                  arc_buf_destroy(buf, arg);
4603 4628                  *bufp = NULL;
4604 4629          } else {
4605 4630                  *bufp = buf;
4606 4631                  ASSERT(buf->b_data);
4607 4632          }
4608 4633  }
4609 4634  
4610 4635  static void
4611 4636  arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp)
4612 4637  {
4613 4638          if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {
4614 4639                  ASSERT3U(HDR_GET_PSIZE(hdr), ==, 0);
4615 4640                  ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
4616 4641          } else {
4617 4642                  if (HDR_COMPRESSION_ENABLED(hdr)) {
4618 4643                          ASSERT3U(HDR_GET_COMPRESS(hdr), ==,
4619 4644                              BP_GET_COMPRESS(bp));
4620 4645                  }
4621 4646                  ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));
4622 4647                  ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp));
4623 4648          }
4624 4649  }
4625 4650  
4626 4651  static void
4627 4652  arc_read_done(zio_t *zio)
4628 4653  {
4629 4654          arc_buf_hdr_t   *hdr = zio->io_private;
4630 4655          kmutex_t        *hash_lock = NULL;
4631 4656          arc_callback_t  *callback_list;
4632 4657          arc_callback_t  *acb;
4633 4658          boolean_t       freeable = B_FALSE;
4634 4659          boolean_t       no_zio_error = (zio->io_error == 0);
4635 4660  
4636 4661          /*
4637 4662           * The hdr was inserted into hash-table and removed from lists
4638 4663           * prior to starting I/O.  We should find this header, since
4639 4664           * it's in the hash table, and it should be legit since it's
4640 4665           * not possible to evict it during the I/O.  The only possible
4641 4666           * reason for it not to be found is if we were freed during the
4642 4667           * read.
4643 4668           */
4644 4669          if (HDR_IN_HASH_TABLE(hdr)) {
4645 4670                  ASSERT3U(hdr->b_birth, ==, BP_PHYSICAL_BIRTH(zio->io_bp));
4646 4671                  ASSERT3U(hdr->b_dva.dva_word[0], ==,
4647 4672                      BP_IDENTITY(zio->io_bp)->dva_word[0]);
4648 4673                  ASSERT3U(hdr->b_dva.dva_word[1], ==,
4649 4674                      BP_IDENTITY(zio->io_bp)->dva_word[1]);
4650 4675  
4651 4676                  arc_buf_hdr_t *found = buf_hash_find(hdr->b_spa, zio->io_bp,
4652 4677                      &hash_lock);
4653 4678  
4654 4679                  ASSERT((found == hdr &&
4655 4680                      DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
4656 4681                      (found == hdr && HDR_L2_READING(hdr)));
4657 4682                  ASSERT3P(hash_lock, !=, NULL);
4658 4683          }
4659 4684  
4660 4685          if (no_zio_error) {
4661 4686                  /* byteswap if necessary */
4662 4687                  if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
4663 4688                          if (BP_GET_LEVEL(zio->io_bp) > 0) {
4664 4689                                  hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;
4665 4690                          } else {
4666 4691                                  hdr->b_l1hdr.b_byteswap =
4667 4692                                      DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));
4668 4693                          }
4669 4694                  } else {
4670 4695                          hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
4671 4696                  }
4672 4697          }
4673 4698  
4674 4699          arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED);
4675 4700          if (l2arc_noprefetch && HDR_PREFETCH(hdr))
4676 4701                  arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE);
4677 4702  
4678 4703          callback_list = hdr->b_l1hdr.b_acb;
4679 4704          ASSERT3P(callback_list, !=, NULL);
4680 4705  
4681 4706          if (hash_lock && no_zio_error && hdr->b_l1hdr.b_state == arc_anon) {
4682 4707                  /*
4683 4708                   * Only call arc_access on anonymous buffers.  This is because
4684 4709                   * if we've issued an I/O for an evicted buffer, we've already
4685 4710                   * called arc_access (to prevent any simultaneous readers from
4686 4711                   * getting confused).
4687 4712                   */
4688 4713                  arc_access(hdr, hash_lock);
4689 4714          }
4690 4715  
4691 4716          /*
4692 4717           * If a read request has a callback (i.e. acb_done is not NULL), then we
4693 4718           * make a buf containing the data according to the parameters which were
4694 4719           * passed in. The implementation of arc_buf_alloc_impl() ensures that we
4695 4720           * aren't needlessly decompressing the data multiple times.
4696 4721           */
4697 4722          int callback_cnt = 0;
4698 4723          for (acb = callback_list; acb != NULL; acb = acb->acb_next) {
4699 4724                  if (!acb->acb_done)
4700 4725                          continue;
4701 4726  
4702 4727                  /* This is a demand read since prefetches don't use callbacks */
4703 4728                  callback_cnt++;
4704 4729  
4705 4730                  int error = arc_buf_alloc_impl(hdr, acb->acb_private,
4706 4731                      acb->acb_compressed, no_zio_error, &acb->acb_buf);
4707 4732                  if (no_zio_error) {
4708 4733                          zio->io_error = error;
4709 4734                  }
4710 4735          }
4711 4736          hdr->b_l1hdr.b_acb = NULL;
4712 4737          arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
4713 4738          if (callback_cnt == 0) {
4714 4739                  ASSERT(HDR_PREFETCH(hdr));
4715 4740                  ASSERT0(hdr->b_l1hdr.b_bufcnt);
4716 4741                  ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
4717 4742          }
4718 4743  
4719 4744          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt) ||
4720 4745              callback_list != NULL);
4721 4746  
4722 4747          if (no_zio_error) {
4723 4748                  arc_hdr_verify(hdr, zio->io_bp);
4724 4749          } else {
4725 4750                  arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
4726 4751                  if (hdr->b_l1hdr.b_state != arc_anon)
4727 4752                          arc_change_state(arc_anon, hdr, hash_lock);
4728 4753                  if (HDR_IN_HASH_TABLE(hdr))
4729 4754                          buf_hash_remove(hdr);
4730 4755                  freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
4731 4756          }
4732 4757  
4733 4758          /*
4734 4759           * Broadcast before we drop the hash_lock to avoid the possibility
4735 4760           * that the hdr (and hence the cv) might be freed before we get to
4736 4761           * the cv_broadcast().
4737 4762           */
4738 4763          cv_broadcast(&hdr->b_l1hdr.b_cv);
4739 4764  
4740 4765          if (hash_lock != NULL) {
4741 4766                  mutex_exit(hash_lock);
4742 4767          } else {
4743 4768                  /*
4744 4769                   * This block was freed while we waited for the read to
4745 4770                   * complete.  It has been removed from the hash table and
4746 4771                   * moved to the anonymous state (so that it won't show up
4747 4772                   * in the cache).
4748 4773                   */
4749 4774                  ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
4750 4775                  freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
4751 4776          }
4752 4777  
4753 4778          /* execute each callback and free its structure */
4754 4779          while ((acb = callback_list) != NULL) {
4755 4780                  if (acb->acb_done)
4756 4781                          acb->acb_done(zio, acb->acb_buf, acb->acb_private);
4757 4782  
4758 4783                  if (acb->acb_zio_dummy != NULL) {
4759 4784                          acb->acb_zio_dummy->io_error = zio->io_error;
4760 4785                          zio_nowait(acb->acb_zio_dummy);
4761 4786                  }
4762 4787  
4763 4788                  callback_list = acb->acb_next;
4764 4789                  kmem_free(acb, sizeof (arc_callback_t));
4765 4790          }
4766 4791  
4767 4792          if (freeable)
4768 4793                  arc_hdr_destroy(hdr);
4769 4794  }
4770 4795  
4771 4796  /*
4772 4797   * "Read" the block at the specified DVA (in bp) via the
4773 4798   * cache.  If the block is found in the cache, invoke the provided
4774 4799   * callback immediately and return.  Note that the `zio' parameter
4775 4800   * in the callback will be NULL in this case, since no IO was
4776 4801   * required.  If the block is not in the cache pass the read request
4777 4802   * on to the spa with a substitute callback function, so that the
4778 4803   * requested block will be added to the cache.
4779 4804   *
4780 4805   * If a read request arrives for a block that has a read in-progress,
4781 4806   * either wait for the in-progress read to complete (and return the
4782 4807   * results); or, if this is a read with a "done" func, add a record
4783 4808   * to the read to invoke the "done" func when the read completes,
4784 4809   * and return; or just return.
4785 4810   *
4786 4811   * arc_read_done() will invoke all the requested "done" functions
4787 4812   * for readers of this block.
4788 4813   */
4789 4814  int
4790 4815  arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_done_func_t *done,
4791 4816      void *private, zio_priority_t priority, int zio_flags,
4792 4817      arc_flags_t *arc_flags, const zbookmark_phys_t *zb)
4793 4818  {
4794 4819          arc_buf_hdr_t *hdr = NULL;
4795 4820          kmutex_t *hash_lock = NULL;
4796 4821          zio_t *rzio;
4797 4822          uint64_t guid = spa_load_guid(spa);
4798 4823          boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW) != 0;
4799 4824  
4800 4825          ASSERT(!BP_IS_EMBEDDED(bp) ||
4801 4826              BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);
4802 4827  
4803 4828  top:
4804 4829          if (!BP_IS_EMBEDDED(bp)) {
4805 4830                  /*
4806 4831                   * Embedded BP's have no DVA and require no I/O to "read".
4807 4832                   * Create an anonymous arc buf to back it.
4808 4833                   */
4809 4834                  hdr = buf_hash_find(guid, bp, &hash_lock);
4810 4835          }
4811 4836  
4812 4837          if (hdr != NULL && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_pabd != NULL) {
4813 4838                  arc_buf_t *buf = NULL;
4814 4839                  *arc_flags |= ARC_FLAG_CACHED;
4815 4840  
4816 4841                  if (HDR_IO_IN_PROGRESS(hdr)) {
4817 4842  
4818 4843                          if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) &&
4819 4844                              priority == ZIO_PRIORITY_SYNC_READ) {
4820 4845                                  /*
4821 4846                                   * This sync read must wait for an
4822 4847                                   * in-progress async read (e.g. a predictive
4823 4848                                   * prefetch).  Async reads are queued
4824 4849                                   * separately at the vdev_queue layer, so
4825 4850                                   * this is a form of priority inversion.
4826 4851                                   * Ideally, we would "inherit" the demand
4827 4852                                   * i/o's priority by moving the i/o from
4828 4853                                   * the async queue to the synchronous queue,
4829 4854                                   * but there is currently no mechanism to do
4830 4855                                   * so.  Track this so that we can evaluate
4831 4856                                   * the magnitude of this potential performance
4832 4857                                   * problem.
4833 4858                                   *
4834 4859                                   * Note that if the prefetch i/o is already
4835 4860                                   * active (has been issued to the device),
4836 4861                                   * the prefetch improved performance, because
4837 4862                                   * we issued it sooner than we would have
4838 4863                                   * without the prefetch.
4839 4864                                   */
4840 4865                                  DTRACE_PROBE1(arc__sync__wait__for__async,
4841 4866                                      arc_buf_hdr_t *, hdr);
4842 4867                                  ARCSTAT_BUMP(arcstat_sync_wait_for_async);
4843 4868                          }
4844 4869                          if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) {
4845 4870                                  arc_hdr_clear_flags(hdr,
4846 4871                                      ARC_FLAG_PREDICTIVE_PREFETCH);
4847 4872                          }
4848 4873  
4849 4874                          if (*arc_flags & ARC_FLAG_WAIT) {
4850 4875                                  cv_wait(&hdr->b_l1hdr.b_cv, hash_lock);
4851 4876                                  mutex_exit(hash_lock);
4852 4877                                  goto top;
4853 4878                          }
4854 4879                          ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
4855 4880  
4856 4881                          if (done) {
4857 4882                                  arc_callback_t *acb = NULL;
4858 4883  
4859 4884                                  acb = kmem_zalloc(sizeof (arc_callback_t),
4860 4885                                      KM_SLEEP);
4861 4886                                  acb->acb_done = done;
4862 4887                                  acb->acb_private = private;
4863 4888                                  acb->acb_compressed = compressed_read;
4864 4889                                  if (pio != NULL)
4865 4890                                          acb->acb_zio_dummy = zio_null(pio,
4866 4891                                              spa, NULL, NULL, NULL, zio_flags);
4867 4892  
4868 4893                                  ASSERT3P(acb->acb_done, !=, NULL);
4869 4894                                  acb->acb_next = hdr->b_l1hdr.b_acb;
4870 4895                                  hdr->b_l1hdr.b_acb = acb;
4871 4896                                  mutex_exit(hash_lock);
4872 4897                                  return (0);
4873 4898                          }
4874 4899                          mutex_exit(hash_lock);
4875 4900                          return (0);
4876 4901                  }
4877 4902  
4878 4903                  ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
4879 4904                      hdr->b_l1hdr.b_state == arc_mfu);
4880 4905  
4881 4906                  if (done) {
4882 4907                          if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) {
4883 4908                                  /*
4884 4909                                   * This is a demand read which does not have to
4885 4910                                   * wait for i/o because we did a predictive
4886 4911                                   * prefetch i/o for it, which has completed.
4887 4912                                   */
4888 4913                                  DTRACE_PROBE1(
4889 4914                                      arc__demand__hit__predictive__prefetch,
4890 4915                                      arc_buf_hdr_t *, hdr);
4891 4916                                  ARCSTAT_BUMP(
4892 4917                                      arcstat_demand_hit_predictive_prefetch);
4893 4918                                  arc_hdr_clear_flags(hdr,
4894 4919                                      ARC_FLAG_PREDICTIVE_PREFETCH);
4895 4920                          }
4896 4921                          ASSERT(!BP_IS_EMBEDDED(bp) || !BP_IS_HOLE(bp));
4897 4922  
4898 4923                          /* Get a buf with the desired data in it. */
4899 4924                          VERIFY0(arc_buf_alloc_impl(hdr, private,
4900 4925                              compressed_read, B_TRUE, &buf));
4901 4926                  } else if (*arc_flags & ARC_FLAG_PREFETCH &&
4902 4927                      refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
4903 4928                          arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
4904 4929                  }
4905 4930                  DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
4906 4931                  arc_access(hdr, hash_lock);
4907 4932                  if (*arc_flags & ARC_FLAG_L2CACHE)
4908 4933                          arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
4909 4934                  mutex_exit(hash_lock);
4910 4935                  ARCSTAT_BUMP(arcstat_hits);
4911 4936                  ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
4912 4937                      demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
4913 4938                      data, metadata, hits);
4914 4939  
4915 4940                  if (done)
4916 4941                          done(NULL, buf, private);
4917 4942          } else {
4918 4943                  uint64_t lsize = BP_GET_LSIZE(bp);
4919 4944                  uint64_t psize = BP_GET_PSIZE(bp);
4920 4945                  arc_callback_t *acb;
4921 4946                  vdev_t *vd = NULL;
4922 4947                  uint64_t addr = 0;
4923 4948                  boolean_t devw = B_FALSE;
4924 4949                  uint64_t size;
4925 4950  
4926 4951                  if (hdr == NULL) {
4927 4952                          /* this block is not in the cache */
4928 4953                          arc_buf_hdr_t *exists = NULL;
4929 4954                          arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
4930 4955                          hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
4931 4956                              BP_GET_COMPRESS(bp), type);
4932 4957  
4933 4958                          if (!BP_IS_EMBEDDED(bp)) {
4934 4959                                  hdr->b_dva = *BP_IDENTITY(bp);
4935 4960                                  hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
4936 4961                                  exists = buf_hash_insert(hdr, &hash_lock);
4937 4962                          }
4938 4963                          if (exists != NULL) {
4939 4964                                  /* somebody beat us to the hash insert */
4940 4965                                  mutex_exit(hash_lock);
4941 4966                                  buf_discard_identity(hdr);
4942 4967                                  arc_hdr_destroy(hdr);
4943 4968                                  goto top; /* restart the IO request */
4944 4969                          }
4945 4970                  } else {
4946 4971                          /*
4947 4972                           * This block is in the ghost cache. If it was L2-only
4948 4973                           * (and thus didn't have an L1 hdr), we realloc the
4949 4974                           * header to add an L1 hdr.
4950 4975                           */
4951 4976                          if (!HDR_HAS_L1HDR(hdr)) {
4952 4977                                  hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,
4953 4978                                      hdr_full_cache);
4954 4979                          }
4955 4980                          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
4956 4981                          ASSERT(GHOST_STATE(hdr->b_l1hdr.b_state));
4957 4982                          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
4958 4983                          ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4959 4984                          ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
4960 4985                          ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
4961 4986  
4962 4987                          /*
4963 4988                           * This is a delicate dance that we play here.
4964 4989                           * This hdr is in the ghost list so we access it
4965 4990                           * to move it out of the ghost list before we
4966 4991                           * initiate the read. If it's a prefetch then
4967 4992                           * it won't have a callback so we'll remove the
4968 4993                           * reference that arc_buf_alloc_impl() created. We
4969 4994                           * do this after we've called arc_access() to
4970 4995                           * avoid hitting an assert in remove_reference().
4971 4996                           */
4972 4997                          arc_access(hdr, hash_lock);
4973 4998                          arc_hdr_alloc_pabd(hdr);
4974 4999                  }
4975 5000                  ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
4976 5001                  size = arc_hdr_size(hdr);
4977 5002  
4978 5003                  /*
4979 5004                   * If compression is enabled on the hdr, then will do
4980 5005                   * RAW I/O and will store the compressed data in the hdr's
4981 5006                   * data block. Otherwise, the hdr's data block will contain
4982 5007                   * the uncompressed data.
4983 5008                   */
4984 5009                  if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF) {
4985 5010                          zio_flags |= ZIO_FLAG_RAW;
4986 5011                  }
4987 5012  
4988 5013                  if (*arc_flags & ARC_FLAG_PREFETCH)
4989 5014                          arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
4990 5015                  if (*arc_flags & ARC_FLAG_L2CACHE)
4991 5016                          arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
4992 5017                  if (BP_GET_LEVEL(bp) > 0)
4993 5018                          arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);
4994 5019                  if (*arc_flags & ARC_FLAG_PREDICTIVE_PREFETCH)
4995 5020                          arc_hdr_set_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH);
4996 5021                  ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));
4997 5022  
4998 5023                  acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
4999 5024                  acb->acb_done = done;
5000 5025                  acb->acb_private = private;
5001 5026                  acb->acb_compressed = compressed_read;
5002 5027  
5003 5028                  ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5004 5029                  hdr->b_l1hdr.b_acb = acb;
5005 5030                  arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5006 5031  
5007 5032                  if (HDR_HAS_L2HDR(hdr) &&
5008 5033                      (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
5009 5034                          devw = hdr->b_l2hdr.b_dev->l2ad_writing;
5010 5035                          addr = hdr->b_l2hdr.b_daddr;
5011 5036                          /*
5012 5037                           * Lock out L2ARC device removal.
5013 5038                           */
5014 5039                          if (vdev_is_dead(vd) ||
5015 5040                              !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
5016 5041                                  vd = NULL;
5017 5042                  }
5018 5043  
5019 5044                  if (priority == ZIO_PRIORITY_ASYNC_READ)
5020 5045                          arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
5021 5046                  else
5022 5047                          arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
5023 5048  
5024 5049                  if (hash_lock != NULL)
5025 5050                          mutex_exit(hash_lock);
5026 5051  
5027 5052                  /*
5028 5053                   * At this point, we have a level 1 cache miss.  Try again in
5029 5054                   * L2ARC if possible.
5030 5055                   */
5031 5056                  ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);
5032 5057  
5033 5058                  DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
5034 5059                      uint64_t, lsize, zbookmark_phys_t *, zb);
5035 5060                  ARCSTAT_BUMP(arcstat_misses);
5036 5061                  ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
5037 5062                      demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
5038 5063                      data, metadata, misses);
5039 5064  
5040 5065                  if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
5041 5066                          /*
5042 5067                           * Read from the L2ARC if the following are true:
5043 5068                           * 1. The L2ARC vdev was previously cached.
5044 5069                           * 2. This buffer still has L2ARC metadata.
5045 5070                           * 3. This buffer isn't currently writing to the L2ARC.
5046 5071                           * 4. The L2ARC entry wasn't evicted, which may
5047 5072                           *    also have invalidated the vdev.
5048 5073                           * 5. This isn't prefetch and l2arc_noprefetch is set.
5049 5074                           */
5050 5075                          if (HDR_HAS_L2HDR(hdr) &&
5051 5076                              !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
5052 5077                              !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
5053 5078                                  l2arc_read_callback_t *cb;
5054 5079                                  abd_t *abd;
5055 5080                                  uint64_t asize;
5056 5081  
5057 5082                                  DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
5058 5083                                  ARCSTAT_BUMP(arcstat_l2_hits);
5059 5084  
5060 5085                                  cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
5061 5086                                      KM_SLEEP);
5062 5087                                  cb->l2rcb_hdr = hdr;
5063 5088                                  cb->l2rcb_bp = *bp;
5064 5089                                  cb->l2rcb_zb = *zb;
5065 5090                                  cb->l2rcb_flags = zio_flags;
5066 5091  
5067 5092                                  asize = vdev_psize_to_asize(vd, size);
5068 5093                                  if (asize != size) {
5069 5094                                          abd = abd_alloc_for_io(asize,
5070 5095                                              HDR_ISTYPE_METADATA(hdr));
5071 5096                                          cb->l2rcb_abd = abd;
5072 5097                                  } else {
5073 5098                                          abd = hdr->b_l1hdr.b_pabd;
5074 5099                                  }
5075 5100  
5076 5101                                  ASSERT(addr >= VDEV_LABEL_START_SIZE &&
5077 5102                                      addr + asize <= vd->vdev_psize -
5078 5103                                      VDEV_LABEL_END_SIZE);
5079 5104  
5080 5105                                  /*
5081 5106                                   * l2arc read.  The SCL_L2ARC lock will be
5082 5107                                   * released by l2arc_read_done().
5083 5108                                   * Issue a null zio if the underlying buffer
5084 5109                                   * was squashed to zero size by compression.
5085 5110                                   */
5086 5111                                  ASSERT3U(HDR_GET_COMPRESS(hdr), !=,
5087 5112                                      ZIO_COMPRESS_EMPTY);
5088 5113                                  rzio = zio_read_phys(pio, vd, addr,
5089 5114                                      asize, abd,
5090 5115                                      ZIO_CHECKSUM_OFF,
5091 5116                                      l2arc_read_done, cb, priority,
5092 5117                                      zio_flags | ZIO_FLAG_DONT_CACHE |
5093 5118                                      ZIO_FLAG_CANFAIL |
5094 5119                                      ZIO_FLAG_DONT_PROPAGATE |
5095 5120                                      ZIO_FLAG_DONT_RETRY, B_FALSE);
5096 5121                                  DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
5097 5122                                      zio_t *, rzio);
5098 5123                                  ARCSTAT_INCR(arcstat_l2_read_bytes, size);
5099 5124  
5100 5125                                  if (*arc_flags & ARC_FLAG_NOWAIT) {
5101 5126                                          zio_nowait(rzio);
5102 5127                                          return (0);
5103 5128                                  }
5104 5129  
5105 5130                                  ASSERT(*arc_flags & ARC_FLAG_WAIT);
5106 5131                                  if (zio_wait(rzio) == 0)
5107 5132                                          return (0);
5108 5133  
5109 5134                                  /* l2arc read error; goto zio_read() */
5110 5135                          } else {
5111 5136                                  DTRACE_PROBE1(l2arc__miss,
5112 5137                                      arc_buf_hdr_t *, hdr);
5113 5138                                  ARCSTAT_BUMP(arcstat_l2_misses);
5114 5139                                  if (HDR_L2_WRITING(hdr))
5115 5140                                          ARCSTAT_BUMP(arcstat_l2_rw_clash);
5116 5141                                  spa_config_exit(spa, SCL_L2ARC, vd);
5117 5142                          }
5118 5143                  } else {
5119 5144                          if (vd != NULL)
5120 5145                                  spa_config_exit(spa, SCL_L2ARC, vd);
5121 5146                          if (l2arc_ndev != 0) {
5122 5147                                  DTRACE_PROBE1(l2arc__miss,
5123 5148                                      arc_buf_hdr_t *, hdr);
5124 5149                                  ARCSTAT_BUMP(arcstat_l2_misses);
5125 5150                          }
5126 5151                  }
5127 5152  
5128 5153                  rzio = zio_read(pio, spa, bp, hdr->b_l1hdr.b_pabd, size,
5129 5154                      arc_read_done, hdr, priority, zio_flags, zb);
5130 5155  
5131 5156                  if (*arc_flags & ARC_FLAG_WAIT)
5132 5157                          return (zio_wait(rzio));
5133 5158  
5134 5159                  ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
5135 5160                  zio_nowait(rzio);
5136 5161          }
5137 5162          return (0);
5138 5163  }
5139 5164  
5140 5165  /*
5141 5166   * Notify the arc that a block was freed, and thus will never be used again.
5142 5167   */
5143 5168  void
5144 5169  arc_freed(spa_t *spa, const blkptr_t *bp)
5145 5170  {
5146 5171          arc_buf_hdr_t *hdr;
5147 5172          kmutex_t *hash_lock;
5148 5173          uint64_t guid = spa_load_guid(spa);
5149 5174  
5150 5175          ASSERT(!BP_IS_EMBEDDED(bp));
5151 5176  
5152 5177          hdr = buf_hash_find(guid, bp, &hash_lock);
5153 5178          if (hdr == NULL)
5154 5179                  return;
5155 5180  
5156 5181          /*
5157 5182           * We might be trying to free a block that is still doing I/O
5158 5183           * (i.e. prefetch) or has a reference (i.e. a dedup-ed,
5159 5184           * dmu_sync-ed block). If this block is being prefetched, then it
5160 5185           * would still have the ARC_FLAG_IO_IN_PROGRESS flag set on the hdr
5161 5186           * until the I/O completes. A block may also have a reference if it is
5162 5187           * part of a dedup-ed, dmu_synced write. The dmu_sync() function would
5163 5188           * have written the new block to its final resting place on disk but
5164 5189           * without the dedup flag set. This would have left the hdr in the MRU
5165 5190           * state and discoverable. When the txg finally syncs it detects that
5166 5191           * the block was overridden in open context and issues an override I/O.
5167 5192           * Since this is a dedup block, the override I/O will determine if the
5168 5193           * block is already in the DDT. If so, then it will replace the io_bp
5169 5194           * with the bp from the DDT and allow the I/O to finish. When the I/O
5170 5195           * reaches the done callback, dbuf_write_override_done, it will
5171 5196           * check to see if the io_bp and io_bp_override are identical.
5172 5197           * If they are not, then it indicates that the bp was replaced with
5173 5198           * the bp in the DDT and the override bp is freed. This allows
5174 5199           * us to arrive here with a reference on a block that is being
5175 5200           * freed. So if we have an I/O in progress, or a reference to
5176 5201           * this hdr, then we don't destroy the hdr.
5177 5202           */
5178 5203          if (!HDR_HAS_L1HDR(hdr) || (!HDR_IO_IN_PROGRESS(hdr) &&
5179 5204              refcount_is_zero(&hdr->b_l1hdr.b_refcnt))) {
5180 5205                  arc_change_state(arc_anon, hdr, hash_lock);
5181 5206                  arc_hdr_destroy(hdr);
5182 5207                  mutex_exit(hash_lock);
5183 5208          } else {
5184 5209                  mutex_exit(hash_lock);
5185 5210          }
5186 5211  
5187 5212  }
5188 5213  
5189 5214  /*
5190 5215   * Release this buffer from the cache, making it an anonymous buffer.  This
5191 5216   * must be done after a read and prior to modifying the buffer contents.
5192 5217   * If the buffer has more than one reference, we must make
5193 5218   * a new hdr for the buffer.
5194 5219   */
5195 5220  void
5196 5221  arc_release(arc_buf_t *buf, void *tag)
5197 5222  {
5198 5223          arc_buf_hdr_t *hdr = buf->b_hdr;
5199 5224  
5200 5225          /*
5201 5226           * It would be nice to assert that if it's DMU metadata (level >
5202 5227           * 0 || it's the dnode file), then it must be syncing context.
5203 5228           * But we don't know that information at this level.
5204 5229           */
5205 5230  
5206 5231          mutex_enter(&buf->b_evict_lock);
5207 5232  
5208 5233          ASSERT(HDR_HAS_L1HDR(hdr));
5209 5234  
5210 5235          /*
5211 5236           * We don't grab the hash lock prior to this check, because if
5212 5237           * the buffer's header is in the arc_anon state, it won't be
5213 5238           * linked into the hash table.
5214 5239           */
5215 5240          if (hdr->b_l1hdr.b_state == arc_anon) {
5216 5241                  mutex_exit(&buf->b_evict_lock);
5217 5242                  ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5218 5243                  ASSERT(!HDR_IN_HASH_TABLE(hdr));
5219 5244                  ASSERT(!HDR_HAS_L2HDR(hdr));
5220 5245                  ASSERT(HDR_EMPTY(hdr));
5221 5246  
5222 5247                  ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
5223 5248                  ASSERT3S(refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1);
5224 5249                  ASSERT(!list_link_active(&hdr->b_l1hdr.b_arc_node));
5225 5250  
5226 5251                  hdr->b_l1hdr.b_arc_access = 0;
5227 5252  
5228 5253                  /*
5229 5254                   * If the buf is being overridden then it may already
5230 5255                   * have a hdr that is not empty.
5231 5256                   */
5232 5257                  buf_discard_identity(hdr);
5233 5258                  arc_buf_thaw(buf);
5234 5259  
5235 5260                  return;
5236 5261          }
5237 5262  
5238 5263          kmutex_t *hash_lock = HDR_LOCK(hdr);
5239 5264          mutex_enter(hash_lock);
5240 5265  
5241 5266          /*
5242 5267           * This assignment is only valid as long as the hash_lock is
5243 5268           * held, we must be careful not to reference state or the
5244 5269           * b_state field after dropping the lock.
5245 5270           */
5246 5271          arc_state_t *state = hdr->b_l1hdr.b_state;
5247 5272          ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
5248 5273          ASSERT3P(state, !=, arc_anon);
5249 5274  
5250 5275          /* this buffer is not on any list */
5251 5276          ASSERT3S(refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0);
5252 5277  
5253 5278          if (HDR_HAS_L2HDR(hdr)) {
5254 5279                  mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx);
5255 5280  
5256 5281                  /*
5257 5282                   * We have to recheck this conditional again now that
5258 5283                   * we're holding the l2ad_mtx to prevent a race with
5259 5284                   * another thread which might be concurrently calling
5260 5285                   * l2arc_evict(). In that case, l2arc_evict() might have
5261 5286                   * destroyed the header's L2 portion as we were waiting
5262 5287                   * to acquire the l2ad_mtx.
5263 5288                   */
5264 5289                  if (HDR_HAS_L2HDR(hdr))
5265 5290                          arc_hdr_l2hdr_destroy(hdr);
5266 5291  
5267 5292                  mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx);
5268 5293          }
5269 5294  
5270 5295          /*
5271 5296           * Do we have more than one buf?
5272 5297           */
5273 5298          if (hdr->b_l1hdr.b_bufcnt > 1) {
5274 5299                  arc_buf_hdr_t *nhdr;
5275 5300                  uint64_t spa = hdr->b_spa;
5276 5301                  uint64_t psize = HDR_GET_PSIZE(hdr);
5277 5302                  uint64_t lsize = HDR_GET_LSIZE(hdr);
5278 5303                  enum zio_compress compress = HDR_GET_COMPRESS(hdr);
5279 5304                  arc_buf_contents_t type = arc_buf_type(hdr);
5280 5305                  VERIFY3U(hdr->b_type, ==, type);
5281 5306  
5282 5307                  ASSERT(hdr->b_l1hdr.b_buf != buf || buf->b_next != NULL);
5283 5308                  (void) remove_reference(hdr, hash_lock, tag);
5284 5309  
5285 5310                  if (arc_buf_is_shared(buf) && !ARC_BUF_COMPRESSED(buf)) {
5286 5311                          ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);
5287 5312                          ASSERT(ARC_BUF_LAST(buf));
5288 5313                  }
5289 5314  
5290 5315                  /*
5291 5316                   * Pull the data off of this hdr and attach it to
5292 5317                   * a new anonymous hdr. Also find the last buffer
5293 5318                   * in the hdr's buffer list.
5294 5319                   */
5295 5320                  arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
5296 5321                  ASSERT3P(lastbuf, !=, NULL);
5297 5322  
5298 5323                  /*
5299 5324                   * If the current arc_buf_t and the hdr are sharing their data
5300 5325                   * buffer, then we must stop sharing that block.
5301 5326                   */
5302 5327                  if (arc_buf_is_shared(buf)) {
5303 5328                          VERIFY(!arc_buf_is_shared(lastbuf));
5304 5329  
5305 5330                          /*
5306 5331                           * First, sever the block sharing relationship between
5307 5332                           * buf and the arc_buf_hdr_t.
5308 5333                           */
5309 5334                          arc_unshare_buf(hdr, buf);
5310 5335  
5311 5336                          /*
5312 5337                           * Now we need to recreate the hdr's b_pabd. Since we
5313 5338                           * have lastbuf handy, we try to share with it, but if
5314 5339                           * we can't then we allocate a new b_pabd and copy the
5315 5340                           * data from buf into it.
5316 5341                           */
5317 5342                          if (arc_can_share(hdr, lastbuf)) {
5318 5343                                  arc_share_buf(hdr, lastbuf);
5319 5344                          } else {
5320 5345                                  arc_hdr_alloc_pabd(hdr);
5321 5346                                  abd_copy_from_buf(hdr->b_l1hdr.b_pabd,
5322 5347                                      buf->b_data, psize);
5323 5348                          }
5324 5349                          VERIFY3P(lastbuf->b_data, !=, NULL);
5325 5350                  } else if (HDR_SHARED_DATA(hdr)) {
5326 5351                          /*
5327 5352                           * Uncompressed shared buffers are always at the end
5328 5353                           * of the list. Compressed buffers don't have the
5329 5354                           * same requirements. This makes it hard to
5330 5355                           * simply assert that the lastbuf is shared so
5331 5356                           * we rely on the hdr's compression flags to determine
5332 5357                           * if we have a compressed, shared buffer.
5333 5358                           */
5334 5359                          ASSERT(arc_buf_is_shared(lastbuf) ||
5335 5360                              HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
5336 5361                          ASSERT(!ARC_BUF_SHARED(buf));
5337 5362                  }
5338 5363                  ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
5339 5364                  ASSERT3P(state, !=, arc_l2c_only);
5340 5365  
5341 5366                  (void) refcount_remove_many(&state->arcs_size,
5342 5367                      arc_buf_size(buf), buf);
5343 5368  
5344 5369                  if (refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {
5345 5370                          ASSERT3P(state, !=, arc_l2c_only);
5346 5371                          (void) refcount_remove_many(&state->arcs_esize[type],
5347 5372                              arc_buf_size(buf), buf);
5348 5373                  }
5349 5374  
5350 5375                  hdr->b_l1hdr.b_bufcnt -= 1;
5351 5376                  arc_cksum_verify(buf);
5352 5377                  arc_buf_unwatch(buf);
5353 5378  
5354 5379                  mutex_exit(hash_lock);
5355 5380  
5356 5381                  /*
5357 5382                   * Allocate a new hdr. The new hdr will contain a b_pabd
5358 5383                   * buffer which will be freed in arc_write().
5359 5384                   */
5360 5385                  nhdr = arc_hdr_alloc(spa, psize, lsize, compress, type);
5361 5386                  ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL);
5362 5387                  ASSERT0(nhdr->b_l1hdr.b_bufcnt);
5363 5388                  ASSERT0(refcount_count(&nhdr->b_l1hdr.b_refcnt));
5364 5389                  VERIFY3U(nhdr->b_type, ==, type);
5365 5390                  ASSERT(!HDR_SHARED_DATA(nhdr));
5366 5391  
5367 5392                  nhdr->b_l1hdr.b_buf = buf;
5368 5393                  nhdr->b_l1hdr.b_bufcnt = 1;
5369 5394                  (void) refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);
5370 5395                  buf->b_hdr = nhdr;
5371 5396  
5372 5397                  mutex_exit(&buf->b_evict_lock);
5373 5398                  (void) refcount_add_many(&arc_anon->arcs_size,
5374 5399                      arc_buf_size(buf), buf);
5375 5400          } else {
5376 5401                  mutex_exit(&buf->b_evict_lock);
5377 5402                  ASSERT(refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);
5378 5403                  /* protected by hash lock, or hdr is on arc_anon */
5379 5404                  ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
5380 5405                  ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5381 5406                  arc_change_state(arc_anon, hdr, hash_lock);
5382 5407                  hdr->b_l1hdr.b_arc_access = 0;
5383 5408                  mutex_exit(hash_lock);
5384 5409  
5385 5410                  buf_discard_identity(hdr);
5386 5411                  arc_buf_thaw(buf);
5387 5412          }
5388 5413  }
5389 5414  
5390 5415  int
5391 5416  arc_released(arc_buf_t *buf)
5392 5417  {
5393 5418          int released;
5394 5419  
5395 5420          mutex_enter(&buf->b_evict_lock);
5396 5421          released = (buf->b_data != NULL &&
5397 5422              buf->b_hdr->b_l1hdr.b_state == arc_anon);
5398 5423          mutex_exit(&buf->b_evict_lock);
5399 5424          return (released);
5400 5425  }
5401 5426  
5402 5427  #ifdef ZFS_DEBUG
5403 5428  int
5404 5429  arc_referenced(arc_buf_t *buf)
5405 5430  {
5406 5431          int referenced;
5407 5432  
5408 5433          mutex_enter(&buf->b_evict_lock);
5409 5434          referenced = (refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt));
5410 5435          mutex_exit(&buf->b_evict_lock);
5411 5436          return (referenced);
5412 5437  }
5413 5438  #endif
5414 5439  
5415 5440  static void
5416 5441  arc_write_ready(zio_t *zio)
5417 5442  {
5418 5443          arc_write_callback_t *callback = zio->io_private;
5419 5444          arc_buf_t *buf = callback->awcb_buf;
5420 5445          arc_buf_hdr_t *hdr = buf->b_hdr;
5421 5446          uint64_t psize = BP_IS_HOLE(zio->io_bp) ? 0 : BP_GET_PSIZE(zio->io_bp);
5422 5447  
5423 5448          ASSERT(HDR_HAS_L1HDR(hdr));
5424 5449          ASSERT(!refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt));
5425 5450          ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
5426 5451  
5427 5452          /*
5428 5453           * If we're reexecuting this zio because the pool suspended, then
5429 5454           * cleanup any state that was previously set the first time the
5430 5455           * callback was invoked.
5431 5456           */
5432 5457          if (zio->io_flags & ZIO_FLAG_REEXECUTED) {
5433 5458                  arc_cksum_free(hdr);
5434 5459                  arc_buf_unwatch(buf);
5435 5460                  if (hdr->b_l1hdr.b_pabd != NULL) {
5436 5461                          if (arc_buf_is_shared(buf)) {
5437 5462                                  arc_unshare_buf(hdr, buf);
5438 5463                          } else {
5439 5464                                  arc_hdr_free_pabd(hdr);
5440 5465                          }
5441 5466                  }
5442 5467          }
5443 5468          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5444 5469          ASSERT(!HDR_SHARED_DATA(hdr));
5445 5470          ASSERT(!arc_buf_is_shared(buf));
5446 5471  
5447 5472          callback->awcb_ready(zio, buf, callback->awcb_private);
5448 5473  
5449 5474          if (HDR_IO_IN_PROGRESS(hdr))
5450 5475                  ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED);
5451 5476  
5452 5477          arc_cksum_compute(buf);
5453 5478          arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5454 5479  
5455 5480          enum zio_compress compress;
5456 5481          if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
5457 5482                  compress = ZIO_COMPRESS_OFF;
5458 5483          } else {
5459 5484                  ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(zio->io_bp));
5460 5485                  compress = BP_GET_COMPRESS(zio->io_bp);
5461 5486          }
5462 5487          HDR_SET_PSIZE(hdr, psize);
5463 5488          arc_hdr_set_compress(hdr, compress);
5464 5489  
5465 5490  
5466 5491          /*
5467 5492           * Fill the hdr with data. If the hdr is compressed, the data we want
5468 5493           * is available from the zio, otherwise we can take it from the buf.
5469 5494           *
5470 5495           * We might be able to share the buf's data with the hdr here. However,
5471 5496           * doing so would cause the ARC to be full of linear ABDs if we write a
5472 5497           * lot of shareable data. As a compromise, we check whether scattered
5473 5498           * ABDs are allowed, and assume that if they are then the user wants
5474 5499           * the ARC to be primarily filled with them regardless of the data being
5475 5500           * written. Therefore, if they're allowed then we allocate one and copy
5476 5501           * the data into it; otherwise, we share the data directly if we can.
5477 5502           */
5478 5503          if (zfs_abd_scatter_enabled || !arc_can_share(hdr, buf)) {
5479 5504                  arc_hdr_alloc_pabd(hdr);
5480 5505  
5481 5506                  /*
5482 5507                   * Ideally, we would always copy the io_abd into b_pabd, but the
5483 5508                   * user may have disabled compressed ARC, thus we must check the
5484 5509                   * hdr's compression setting rather than the io_bp's.
5485 5510                   */
5486 5511                  if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF) {
5487 5512                          ASSERT3U(BP_GET_COMPRESS(zio->io_bp), !=,
5488 5513                              ZIO_COMPRESS_OFF);
5489 5514                          ASSERT3U(psize, >, 0);
5490 5515  
5491 5516                          abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize);
5492 5517                  } else {
5493 5518                          ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr));
5494 5519  
5495 5520                          abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data,
5496 5521                              arc_buf_size(buf));
5497 5522                  }
5498 5523          } else {
5499 5524                  ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd));
5500 5525                  ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf));
5501 5526                  ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
5502 5527  
5503 5528                  arc_share_buf(hdr, buf);
5504 5529          }
5505 5530  
5506 5531          arc_hdr_verify(hdr, zio->io_bp);
5507 5532  }
5508 5533  
5509 5534  static void
5510 5535  arc_write_children_ready(zio_t *zio)
5511 5536  {
5512 5537          arc_write_callback_t *callback = zio->io_private;
5513 5538          arc_buf_t *buf = callback->awcb_buf;
5514 5539  
5515 5540          callback->awcb_children_ready(zio, buf, callback->awcb_private);
5516 5541  }
5517 5542  
5518 5543  /*
5519 5544   * The SPA calls this callback for each physical write that happens on behalf
5520 5545   * of a logical write.  See the comment in dbuf_write_physdone() for details.
5521 5546   */
5522 5547  static void
5523 5548  arc_write_physdone(zio_t *zio)
5524 5549  {
5525 5550          arc_write_callback_t *cb = zio->io_private;
5526 5551          if (cb->awcb_physdone != NULL)
5527 5552                  cb->awcb_physdone(zio, cb->awcb_buf, cb->awcb_private);
5528 5553  }
5529 5554  
5530 5555  static void
5531 5556  arc_write_done(zio_t *zio)
5532 5557  {
5533 5558          arc_write_callback_t *callback = zio->io_private;
5534 5559          arc_buf_t *buf = callback->awcb_buf;
5535 5560          arc_buf_hdr_t *hdr = buf->b_hdr;
5536 5561  
5537 5562          ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5538 5563  
5539 5564          if (zio->io_error == 0) {
5540 5565                  arc_hdr_verify(hdr, zio->io_bp);
5541 5566  
5542 5567                  if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
5543 5568                          buf_discard_identity(hdr);
5544 5569                  } else {
5545 5570                          hdr->b_dva = *BP_IDENTITY(zio->io_bp);
5546 5571                          hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp);
5547 5572                  }
5548 5573          } else {
5549 5574                  ASSERT(HDR_EMPTY(hdr));
5550 5575          }
5551 5576  
5552 5577          /*
5553 5578           * If the block to be written was all-zero or compressed enough to be
5554 5579           * embedded in the BP, no write was performed so there will be no
5555 5580           * dva/birth/checksum.  The buffer must therefore remain anonymous
5556 5581           * (and uncached).
5557 5582           */
5558 5583          if (!HDR_EMPTY(hdr)) {
5559 5584                  arc_buf_hdr_t *exists;
5560 5585                  kmutex_t *hash_lock;
5561 5586  
5562 5587                  ASSERT3U(zio->io_error, ==, 0);
5563 5588  
5564 5589                  arc_cksum_verify(buf);
5565 5590  
5566 5591                  exists = buf_hash_insert(hdr, &hash_lock);
5567 5592                  if (exists != NULL) {
5568 5593                          /*
5569 5594                           * This can only happen if we overwrite for
5570 5595                           * sync-to-convergence, because we remove
5571 5596                           * buffers from the hash table when we arc_free().
5572 5597                           */
5573 5598                          if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
5574 5599                                  if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
5575 5600                                          panic("bad overwrite, hdr=%p exists=%p",
5576 5601                                              (void *)hdr, (void *)exists);
5577 5602                                  ASSERT(refcount_is_zero(
5578 5603                                      &exists->b_l1hdr.b_refcnt));
5579 5604                                  arc_change_state(arc_anon, exists, hash_lock);
5580 5605                                  mutex_exit(hash_lock);
5581 5606                                  arc_hdr_destroy(exists);
5582 5607                                  exists = buf_hash_insert(hdr, &hash_lock);
5583 5608                                  ASSERT3P(exists, ==, NULL);
5584 5609                          } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
5585 5610                                  /* nopwrite */
5586 5611                                  ASSERT(zio->io_prop.zp_nopwrite);
5587 5612                                  if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
5588 5613                                          panic("bad nopwrite, hdr=%p exists=%p",
5589 5614                                              (void *)hdr, (void *)exists);
5590 5615                          } else {
5591 5616                                  /* Dedup */
5592 5617                                  ASSERT(hdr->b_l1hdr.b_bufcnt == 1);
5593 5618                                  ASSERT(hdr->b_l1hdr.b_state == arc_anon);
5594 5619                                  ASSERT(BP_GET_DEDUP(zio->io_bp));
5595 5620                                  ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
5596 5621                          }
5597 5622                  }
5598 5623                  arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5599 5624                  /* if it's not anon, we are doing a scrub */
5600 5625                  if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)
5601 5626                          arc_access(hdr, hash_lock);
5602 5627                  mutex_exit(hash_lock);
5603 5628          } else {
5604 5629                  arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5605 5630          }
5606 5631  
5607 5632          ASSERT(!refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5608 5633          callback->awcb_done(zio, buf, callback->awcb_private);
5609 5634  
5610 5635          abd_put(zio->io_abd);
5611 5636          kmem_free(callback, sizeof (arc_write_callback_t));
5612 5637  }
5613 5638  
5614 5639  zio_t *
5615 5640  arc_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, arc_buf_t *buf,
5616 5641      boolean_t l2arc, const zio_prop_t *zp, arc_done_func_t *ready,
5617 5642      arc_done_func_t *children_ready, arc_done_func_t *physdone,
5618 5643      arc_done_func_t *done, void *private, zio_priority_t priority,
5619 5644      int zio_flags, const zbookmark_phys_t *zb)
5620 5645  {
5621 5646          arc_buf_hdr_t *hdr = buf->b_hdr;
5622 5647          arc_write_callback_t *callback;
5623 5648          zio_t *zio;
5624 5649          zio_prop_t localprop = *zp;
5625 5650  
5626 5651          ASSERT3P(ready, !=, NULL);
5627 5652          ASSERT3P(done, !=, NULL);
5628 5653          ASSERT(!HDR_IO_ERROR(hdr));
5629 5654          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5630 5655          ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5631 5656          ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
5632 5657          if (l2arc)
5633 5658                  arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
5634 5659          if (ARC_BUF_COMPRESSED(buf)) {
5635 5660                  /*
5636 5661                   * We're writing a pre-compressed buffer.  Make the
5637 5662                   * compression algorithm requested by the zio_prop_t match
5638 5663                   * the pre-compressed buffer's compression algorithm.
5639 5664                   */
5640 5665                  localprop.zp_compress = HDR_GET_COMPRESS(hdr);
5641 5666  
5642 5667                  ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf));
5643 5668                  zio_flags |= ZIO_FLAG_RAW;
5644 5669          }
5645 5670          callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
5646 5671          callback->awcb_ready = ready;
5647 5672          callback->awcb_children_ready = children_ready;
5648 5673          callback->awcb_physdone = physdone;
5649 5674          callback->awcb_done = done;
5650 5675          callback->awcb_private = private;
5651 5676          callback->awcb_buf = buf;
5652 5677  
5653 5678          /*
5654 5679           * The hdr's b_pabd is now stale, free it now. A new data block
5655 5680           * will be allocated when the zio pipeline calls arc_write_ready().
5656 5681           */
5657 5682          if (hdr->b_l1hdr.b_pabd != NULL) {
5658 5683                  /*
5659 5684                   * If the buf is currently sharing the data block with
5660 5685                   * the hdr then we need to break that relationship here.
5661 5686                   * The hdr will remain with a NULL data pointer and the
5662 5687                   * buf will take sole ownership of the block.
5663 5688                   */
5664 5689                  if (arc_buf_is_shared(buf)) {
5665 5690                          arc_unshare_buf(hdr, buf);
5666 5691                  } else {
5667 5692                          arc_hdr_free_pabd(hdr);
5668 5693                  }
5669 5694                  VERIFY3P(buf->b_data, !=, NULL);
5670 5695                  arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);
5671 5696          }
5672 5697          ASSERT(!arc_buf_is_shared(buf));
5673 5698          ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5674 5699  
5675 5700          zio = zio_write(pio, spa, txg, bp,
5676 5701              abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
5677 5702              HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
5678 5703              (children_ready != NULL) ? arc_write_children_ready : NULL,
5679 5704              arc_write_physdone, arc_write_done, callback,
5680 5705              priority, zio_flags, zb);
5681 5706  
5682 5707          return (zio);
5683 5708  }
5684 5709  
5685 5710  static int
5686 5711  arc_memory_throttle(uint64_t reserve, uint64_t txg)
5687 5712  {
5688 5713  #ifdef _KERNEL
5689 5714          uint64_t available_memory = ptob(freemem);
5690 5715          static uint64_t page_load = 0;
5691 5716          static uint64_t last_txg = 0;
5692 5717  
5693 5718  #if defined(__i386)
5694 5719          available_memory =
5695 5720              MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
5696 5721  #endif
5697 5722  
5698 5723          if (freemem > physmem * arc_lotsfree_percent / 100)
5699 5724                  return (0);
5700 5725  
5701 5726          if (txg > last_txg) {
5702 5727                  last_txg = txg;
5703 5728                  page_load = 0;
5704 5729          }
5705 5730          /*
5706 5731           * If we are in pageout, we know that memory is already tight,
5707 5732           * the arc is already going to be evicting, so we just want to
5708 5733           * continue to let page writes occur as quickly as possible.
5709 5734           */
5710 5735          if (curproc == proc_pageout) {
5711 5736                  if (page_load > MAX(ptob(minfree), available_memory) / 4)
5712 5737                          return (SET_ERROR(ERESTART));
5713 5738                  /* Note: reserve is inflated, so we deflate */
5714 5739                  page_load += reserve / 8;
5715 5740                  return (0);
5716 5741          } else if (page_load > 0 && arc_reclaim_needed()) {
5717 5742                  /* memory is low, delay before restarting */
5718 5743                  ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
5719 5744                  return (SET_ERROR(EAGAIN));
5720 5745          }
5721 5746          page_load = 0;
5722 5747  #endif
5723 5748          return (0);
5724 5749  }
5725 5750  
5726 5751  void
5727 5752  arc_tempreserve_clear(uint64_t reserve)
5728 5753  {
5729 5754          atomic_add_64(&arc_tempreserve, -reserve);
5730 5755          ASSERT((int64_t)arc_tempreserve >= 0);
5731 5756  }
5732 5757  
5733 5758  int
5734 5759  arc_tempreserve_space(uint64_t reserve, uint64_t txg)
5735 5760  {
5736 5761          int error;
5737 5762          uint64_t anon_size;
5738 5763  
5739 5764          if (reserve > arc_c/4 && !arc_no_grow)
5740 5765                  arc_c = MIN(arc_c_max, reserve * 4);
5741 5766          if (reserve > arc_c)
5742 5767                  return (SET_ERROR(ENOMEM));
5743 5768  
5744 5769          /*
5745 5770           * Don't count loaned bufs as in flight dirty data to prevent long
5746 5771           * network delays from blocking transactions that are ready to be
5747 5772           * assigned to a txg.
5748 5773           */
5749 5774  
5750 5775          /* assert that it has not wrapped around */
5751 5776          ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
5752 5777  
5753 5778          anon_size = MAX((int64_t)(refcount_count(&arc_anon->arcs_size) -
5754 5779              arc_loaned_bytes), 0);
5755 5780  
5756 5781          /*
5757 5782           * Writes will, almost always, require additional memory allocations
5758 5783           * in order to compress/encrypt/etc the data.  We therefore need to
5759 5784           * make sure that there is sufficient available memory for this.
5760 5785           */
5761 5786          error = arc_memory_throttle(reserve, txg);
5762 5787          if (error != 0)
5763 5788                  return (error);
5764 5789  
5765 5790          /*
5766 5791           * Throttle writes when the amount of dirty data in the cache
5767 5792           * gets too large.  We try to keep the cache less than half full
5768 5793           * of dirty blocks so that our sync times don't grow too large.
5769 5794           * Note: if two requests come in concurrently, we might let them
5770 5795           * both succeed, when one of them should fail.  Not a huge deal.
5771 5796           */
5772 5797  
5773 5798          if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
5774 5799              anon_size > arc_c / 4) {
5775 5800                  uint64_t meta_esize =
5776 5801                      refcount_count(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
5777 5802                  uint64_t data_esize =
5778 5803                      refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
5779 5804                  dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
5780 5805                      "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
5781 5806                      arc_tempreserve >> 10, meta_esize >> 10,
5782 5807                      data_esize >> 10, reserve >> 10, arc_c >> 10);
5783 5808                  return (SET_ERROR(ERESTART));
5784 5809          }
5785 5810          atomic_add_64(&arc_tempreserve, reserve);
5786 5811          return (0);
5787 5812  }
5788 5813  
5789 5814  static void
5790 5815  arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
5791 5816      kstat_named_t *evict_data, kstat_named_t *evict_metadata)
5792 5817  {
5793 5818          size->value.ui64 = refcount_count(&state->arcs_size);
5794 5819          evict_data->value.ui64 =
5795 5820              refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);
5796 5821          evict_metadata->value.ui64 =
5797 5822              refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);
5798 5823  }
5799 5824  
5800 5825  static int
5801 5826  arc_kstat_update(kstat_t *ksp, int rw)
5802 5827  {
5803 5828          arc_stats_t *as = ksp->ks_data;
5804 5829  
5805 5830          if (rw == KSTAT_WRITE) {
5806 5831                  return (EACCES);
5807 5832          } else {
5808 5833                  arc_kstat_update_state(arc_anon,
5809 5834                      &as->arcstat_anon_size,
5810 5835                      &as->arcstat_anon_evictable_data,
5811 5836                      &as->arcstat_anon_evictable_metadata);
5812 5837                  arc_kstat_update_state(arc_mru,
5813 5838                      &as->arcstat_mru_size,
5814 5839                      &as->arcstat_mru_evictable_data,
5815 5840                      &as->arcstat_mru_evictable_metadata);
5816 5841                  arc_kstat_update_state(arc_mru_ghost,
5817 5842                      &as->arcstat_mru_ghost_size,
5818 5843                      &as->arcstat_mru_ghost_evictable_data,
5819 5844                      &as->arcstat_mru_ghost_evictable_metadata);
5820 5845                  arc_kstat_update_state(arc_mfu,
5821 5846                      &as->arcstat_mfu_size,
5822 5847                      &as->arcstat_mfu_evictable_data,
5823 5848                      &as->arcstat_mfu_evictable_metadata);
5824 5849                  arc_kstat_update_state(arc_mfu_ghost,
5825 5850                      &as->arcstat_mfu_ghost_size,
5826 5851                      &as->arcstat_mfu_ghost_evictable_data,
5827 5852                      &as->arcstat_mfu_ghost_evictable_metadata);
5828 5853          }
5829 5854  
5830 5855          return (0);
5831 5856  }
5832 5857  
5833 5858  /*
5834 5859   * This function *must* return indices evenly distributed between all
5835 5860   * sublists of the multilist. This is needed due to how the ARC eviction
5836 5861   * code is laid out; arc_evict_state() assumes ARC buffers are evenly
5837 5862   * distributed between all sublists and uses this assumption when
5838 5863   * deciding which sublist to evict from and how much to evict from it.
5839 5864   */
5840 5865  unsigned int
5841 5866  arc_state_multilist_index_func(multilist_t *ml, void *obj)
5842 5867  {
5843 5868          arc_buf_hdr_t *hdr = obj;
5844 5869  
5845 5870          /*
5846 5871           * We rely on b_dva to generate evenly distributed index
5847 5872           * numbers using buf_hash below. So, as an added precaution,
5848 5873           * let's make sure we never add empty buffers to the arc lists.
5849 5874           */
5850 5875          ASSERT(!HDR_EMPTY(hdr));
5851 5876  
5852 5877          /*
5853 5878           * The assumption here, is the hash value for a given
5854 5879           * arc_buf_hdr_t will remain constant throughout it's lifetime
5855 5880           * (i.e. it's b_spa, b_dva, and b_birth fields don't change).
5856 5881           * Thus, we don't need to store the header's sublist index
5857 5882           * on insertion, as this index can be recalculated on removal.
5858 5883           *
5859 5884           * Also, the low order bits of the hash value are thought to be
5860 5885           * distributed evenly. Otherwise, in the case that the multilist
5861 5886           * has a power of two number of sublists, each sublists' usage
5862 5887           * would not be evenly distributed.
5863 5888           */
5864 5889          return (buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %
5865 5890              multilist_get_num_sublists(ml));
5866 5891  }
5867 5892  
5868 5893  static void
5869 5894  arc_state_init(void)
5870 5895  {
5871 5896          arc_anon = &ARC_anon;
5872 5897          arc_mru = &ARC_mru;
5873 5898          arc_mru_ghost = &ARC_mru_ghost;
5874 5899          arc_mfu = &ARC_mfu;
5875 5900          arc_mfu_ghost = &ARC_mfu_ghost;
5876 5901          arc_l2c_only = &ARC_l2c_only;
5877 5902  
5878 5903          arc_mru->arcs_list[ARC_BUFC_METADATA] =
5879 5904              multilist_create(sizeof (arc_buf_hdr_t),
5880 5905              offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5881 5906              arc_state_multilist_index_func);
5882 5907          arc_mru->arcs_list[ARC_BUFC_DATA] =
5883 5908              multilist_create(sizeof (arc_buf_hdr_t),
5884 5909              offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5885 5910              arc_state_multilist_index_func);
5886 5911          arc_mru_ghost->arcs_list[ARC_BUFC_METADATA] =
5887 5912              multilist_create(sizeof (arc_buf_hdr_t),
5888 5913              offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5889 5914              arc_state_multilist_index_func);
5890 5915          arc_mru_ghost->arcs_list[ARC_BUFC_DATA] =
5891 5916              multilist_create(sizeof (arc_buf_hdr_t),
5892 5917              offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5893 5918              arc_state_multilist_index_func);
5894 5919          arc_mfu->arcs_list[ARC_BUFC_METADATA] =
5895 5920              multilist_create(sizeof (arc_buf_hdr_t),
5896 5921              offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5897 5922              arc_state_multilist_index_func);
5898 5923          arc_mfu->arcs_list[ARC_BUFC_DATA] =
5899 5924              multilist_create(sizeof (arc_buf_hdr_t),
5900 5925              offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5901 5926              arc_state_multilist_index_func);
5902 5927          arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA] =
5903 5928              multilist_create(sizeof (arc_buf_hdr_t),
5904 5929              offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5905 5930              arc_state_multilist_index_func);
5906 5931          arc_mfu_ghost->arcs_list[ARC_BUFC_DATA] =
5907 5932              multilist_create(sizeof (arc_buf_hdr_t),
5908 5933              offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5909 5934              arc_state_multilist_index_func);
5910 5935          arc_l2c_only->arcs_list[ARC_BUFC_METADATA] =
5911 5936              multilist_create(sizeof (arc_buf_hdr_t),
5912 5937              offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5913 5938              arc_state_multilist_index_func);
5914 5939          arc_l2c_only->arcs_list[ARC_BUFC_DATA] =
5915 5940              multilist_create(sizeof (arc_buf_hdr_t),
5916 5941              offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5917 5942              arc_state_multilist_index_func);
5918 5943  
5919 5944          refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
5920 5945          refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
5921 5946          refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
5922 5947          refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
5923 5948          refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
5924 5949          refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
5925 5950          refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
5926 5951          refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
5927 5952          refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
5928 5953          refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
5929 5954          refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
5930 5955          refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
5931 5956  
5932 5957          refcount_create(&arc_anon->arcs_size);
5933 5958          refcount_create(&arc_mru->arcs_size);
5934 5959          refcount_create(&arc_mru_ghost->arcs_size);
5935 5960          refcount_create(&arc_mfu->arcs_size);
5936 5961          refcount_create(&arc_mfu_ghost->arcs_size);
5937 5962          refcount_create(&arc_l2c_only->arcs_size);
5938 5963  }
5939 5964  
5940 5965  static void
5941 5966  arc_state_fini(void)
5942 5967  {
5943 5968          refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
5944 5969          refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
5945 5970          refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
5946 5971          refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
5947 5972          refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
5948 5973          refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
5949 5974          refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
5950 5975          refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
5951 5976          refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
5952 5977          refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
5953 5978          refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
5954 5979          refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
5955 5980  
5956 5981          refcount_destroy(&arc_anon->arcs_size);
5957 5982          refcount_destroy(&arc_mru->arcs_size);
5958 5983          refcount_destroy(&arc_mru_ghost->arcs_size);
5959 5984          refcount_destroy(&arc_mfu->arcs_size);
5960 5985          refcount_destroy(&arc_mfu_ghost->arcs_size);
5961 5986          refcount_destroy(&arc_l2c_only->arcs_size);
5962 5987  
5963 5988          multilist_destroy(arc_mru->arcs_list[ARC_BUFC_METADATA]);
5964 5989          multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
5965 5990          multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_METADATA]);
5966 5991          multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
5967 5992          multilist_destroy(arc_mru->arcs_list[ARC_BUFC_DATA]);
5968 5993          multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
5969 5994          multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_DATA]);
5970 5995          multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
5971 5996  }
5972 5997  
5973 5998  uint64_t
5974 5999  arc_max_bytes(void)
5975 6000  {
5976 6001          return (arc_c_max);
5977 6002  }
5978 6003  
5979 6004  void
5980 6005  arc_init(void)
5981 6006  {
5982 6007          /*
5983 6008           * allmem is "all memory that we could possibly use".
5984 6009           */
5985 6010  #ifdef _KERNEL
5986 6011          uint64_t allmem = ptob(physmem - swapfs_minfree);
5987 6012  #else
5988 6013          uint64_t allmem = (physmem * PAGESIZE) / 2;
5989 6014  #endif
5990 6015  
5991 6016          mutex_init(&arc_reclaim_lock, NULL, MUTEX_DEFAULT, NULL);
5992 6017          cv_init(&arc_reclaim_thread_cv, NULL, CV_DEFAULT, NULL);
5993 6018          cv_init(&arc_reclaim_waiters_cv, NULL, CV_DEFAULT, NULL);
5994 6019  
5995 6020          /* Convert seconds to clock ticks */
5996 6021          arc_min_prefetch_lifespan = 1 * hz;
5997 6022  
5998 6023          /* set min cache to 1/32 of all memory, or 64MB, whichever is more */
5999 6024          arc_c_min = MAX(allmem / 32, 64 << 20);
6000 6025          /* set max to 3/4 of all memory, or all but 1GB, whichever is more */
6001 6026          if (allmem >= 1 << 30)
6002 6027                  arc_c_max = allmem - (1 << 30);
6003 6028          else
6004 6029                  arc_c_max = arc_c_min;
6005 6030          arc_c_max = MAX(allmem * 3 / 4, arc_c_max);
6006 6031  
6007 6032          /*
6008 6033           * In userland, there's only the memory pressure that we artificially
6009 6034           * create (see arc_available_memory()).  Don't let arc_c get too
6010 6035           * small, because it can cause transactions to be larger than
6011 6036           * arc_c, causing arc_tempreserve_space() to fail.
6012 6037           */
6013 6038  #ifndef _KERNEL
6014 6039          arc_c_min = arc_c_max / 2;
6015 6040  #endif
6016 6041  
6017 6042          /*
6018 6043           * Allow the tunables to override our calculations if they are
6019 6044           * reasonable (ie. over 64MB)
6020 6045           */
6021 6046          if (zfs_arc_max > 64 << 20 && zfs_arc_max < allmem) {
6022 6047                  arc_c_max = zfs_arc_max;
6023 6048                  arc_c_min = MIN(arc_c_min, arc_c_max);
6024 6049          }
6025 6050          if (zfs_arc_min > 64 << 20 && zfs_arc_min <= arc_c_max)
6026 6051                  arc_c_min = zfs_arc_min;
6027 6052  
6028 6053          arc_c = arc_c_max;
6029 6054          arc_p = (arc_c >> 1);
6030 6055          arc_size = 0;
6031 6056  
6032 6057          /* limit meta-data to 1/4 of the arc capacity */
6033 6058          arc_meta_limit = arc_c_max / 4;
6034 6059  
6035 6060  #ifdef _KERNEL
6036 6061          /*
6037 6062           * Metadata is stored in the kernel's heap.  Don't let us
6038 6063           * use more than half the heap for the ARC.
6039 6064           */
6040 6065          arc_meta_limit = MIN(arc_meta_limit,
6041 6066              vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 2);
6042 6067  #endif
6043 6068  
6044 6069          /* Allow the tunable to override if it is reasonable */
6045 6070          if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
6046 6071                  arc_meta_limit = zfs_arc_meta_limit;
6047 6072  
6048 6073          if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
6049 6074                  arc_c_min = arc_meta_limit / 2;
6050 6075  
6051 6076          if (zfs_arc_meta_min > 0) {
6052 6077                  arc_meta_min = zfs_arc_meta_min;
6053 6078          } else {
6054 6079                  arc_meta_min = arc_c_min / 2;
6055 6080          }
6056 6081  
6057 6082          if (zfs_arc_grow_retry > 0)
6058 6083                  arc_grow_retry = zfs_arc_grow_retry;
6059 6084  
6060 6085          if (zfs_arc_shrink_shift > 0)
6061 6086                  arc_shrink_shift = zfs_arc_shrink_shift;
6062 6087  
6063 6088          /*
6064 6089           * Ensure that arc_no_grow_shift is less than arc_shrink_shift.
6065 6090           */
6066 6091          if (arc_no_grow_shift >= arc_shrink_shift)
6067 6092                  arc_no_grow_shift = arc_shrink_shift - 1;
6068 6093  
6069 6094          if (zfs_arc_p_min_shift > 0)
6070 6095                  arc_p_min_shift = zfs_arc_p_min_shift;
6071 6096  
6072 6097          /* if kmem_flags are set, lets try to use less memory */
6073 6098          if (kmem_debugging())
6074 6099                  arc_c = arc_c / 2;
6075 6100          if (arc_c < arc_c_min)
6076 6101                  arc_c = arc_c_min;
6077 6102  
6078 6103          arc_state_init();
6079 6104          buf_init();
6080 6105  
6081 6106          arc_reclaim_thread_exit = B_FALSE;
6082 6107  
6083 6108          arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
6084 6109              sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
6085 6110  
6086 6111          if (arc_ksp != NULL) {
6087 6112                  arc_ksp->ks_data = &arc_stats;
6088 6113                  arc_ksp->ks_update = arc_kstat_update;
6089 6114                  kstat_install(arc_ksp);
6090 6115          }
6091 6116  
6092 6117          (void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
6093 6118              TS_RUN, minclsyspri);
6094 6119  
6095 6120          arc_dead = B_FALSE;
6096 6121          arc_warm = B_FALSE;
6097 6122  
6098 6123          /*
6099 6124           * Calculate maximum amount of dirty data per pool.
6100 6125           *
6101 6126           * If it has been set by /etc/system, take that.
6102 6127           * Otherwise, use a percentage of physical memory defined by
6103 6128           * zfs_dirty_data_max_percent (default 10%) with a cap at
6104 6129           * zfs_dirty_data_max_max (default 4GB).
6105 6130           */
6106 6131          if (zfs_dirty_data_max == 0) {
6107 6132                  zfs_dirty_data_max = physmem * PAGESIZE *
6108 6133                      zfs_dirty_data_max_percent / 100;
6109 6134                  zfs_dirty_data_max = MIN(zfs_dirty_data_max,
6110 6135                      zfs_dirty_data_max_max);
6111 6136          }
6112 6137  }
6113 6138  
6114 6139  void
6115 6140  arc_fini(void)
6116 6141  {
6117 6142          mutex_enter(&arc_reclaim_lock);
6118 6143          arc_reclaim_thread_exit = B_TRUE;
6119 6144          /*
6120 6145           * The reclaim thread will set arc_reclaim_thread_exit back to
6121 6146           * B_FALSE when it is finished exiting; we're waiting for that.
6122 6147           */
6123 6148          while (arc_reclaim_thread_exit) {
6124 6149                  cv_signal(&arc_reclaim_thread_cv);
6125 6150                  cv_wait(&arc_reclaim_thread_cv, &arc_reclaim_lock);
6126 6151          }
6127 6152          mutex_exit(&arc_reclaim_lock);
6128 6153  
6129 6154          /* Use B_TRUE to ensure *all* buffers are evicted */
6130 6155          arc_flush(NULL, B_TRUE);
6131 6156  
6132 6157          arc_dead = B_TRUE;
6133 6158  
6134 6159          if (arc_ksp != NULL) {
6135 6160                  kstat_delete(arc_ksp);
6136 6161                  arc_ksp = NULL;
6137 6162          }
6138 6163  
6139 6164          mutex_destroy(&arc_reclaim_lock);
6140 6165          cv_destroy(&arc_reclaim_thread_cv);
6141 6166          cv_destroy(&arc_reclaim_waiters_cv);
6142 6167  
6143 6168          arc_state_fini();
6144 6169          buf_fini();
6145 6170  
6146 6171          ASSERT0(arc_loaned_bytes);
6147 6172  }
6148 6173  
6149 6174  /*
6150 6175   * Level 2 ARC
6151 6176   *
6152 6177   * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
6153 6178   * It uses dedicated storage devices to hold cached data, which are populated
6154 6179   * using large infrequent writes.  The main role of this cache is to boost
6155 6180   * the performance of random read workloads.  The intended L2ARC devices
6156 6181   * include short-stroked disks, solid state disks, and other media with
6157 6182   * substantially faster read latency than disk.
6158 6183   *
6159 6184   *                 +-----------------------+
6160 6185   *                 |         ARC           |
6161 6186   *                 +-----------------------+
6162 6187   *                    |         ^     ^
6163 6188   *                    |         |     |
6164 6189   *      l2arc_feed_thread()    arc_read()
6165 6190   *                    |         |     |
6166 6191   *                    |  l2arc read   |
6167 6192   *                    V         |     |
6168 6193   *               +---------------+    |
6169 6194   *               |     L2ARC     |    |
6170 6195   *               +---------------+    |
6171 6196   *                   |    ^           |
6172 6197   *          l2arc_write() |           |
6173 6198   *                   |    |           |
6174 6199   *                   V    |           |
6175 6200   *                 +-------+      +-------+
6176 6201   *                 | vdev  |      | vdev  |
6177 6202   *                 | cache |      | cache |
6178 6203   *                 +-------+      +-------+
6179 6204   *                 +=========+     .-----.
6180 6205   *                 :  L2ARC  :    |-_____-|
6181 6206   *                 : devices :    | Disks |
6182 6207   *                 +=========+    `-_____-'
6183 6208   *
6184 6209   * Read requests are satisfied from the following sources, in order:
6185 6210   *
6186 6211   *      1) ARC
6187 6212   *      2) vdev cache of L2ARC devices
6188 6213   *      3) L2ARC devices
6189 6214   *      4) vdev cache of disks
6190 6215   *      5) disks
6191 6216   *
6192 6217   * Some L2ARC device types exhibit extremely slow write performance.
6193 6218   * To accommodate for this there are some significant differences between
6194 6219   * the L2ARC and traditional cache design:
6195 6220   *
6196 6221   * 1. There is no eviction path from the ARC to the L2ARC.  Evictions from
6197 6222   * the ARC behave as usual, freeing buffers and placing headers on ghost
6198 6223   * lists.  The ARC does not send buffers to the L2ARC during eviction as
6199 6224   * this would add inflated write latencies for all ARC memory pressure.
6200 6225   *
6201 6226   * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
6202 6227   * It does this by periodically scanning buffers from the eviction-end of
6203 6228   * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
6204 6229   * not already there. It scans until a headroom of buffers is satisfied,
6205 6230   * which itself is a buffer for ARC eviction. If a compressible buffer is
6206 6231   * found during scanning and selected for writing to an L2ARC device, we
6207 6232   * temporarily boost scanning headroom during the next scan cycle to make
6208 6233   * sure we adapt to compression effects (which might significantly reduce
6209 6234   * the data volume we write to L2ARC). The thread that does this is
6210 6235   * l2arc_feed_thread(), illustrated below; example sizes are included to
6211 6236   * provide a better sense of ratio than this diagram:
6212 6237   *
6213 6238   *             head -->                        tail
6214 6239   *              +---------------------+----------+
6215 6240   *      ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
6216 6241   *              +---------------------+----------+   |   o L2ARC eligible
6217 6242   *      ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
6218 6243   *              +---------------------+----------+   |
6219 6244   *                   15.9 Gbytes      ^ 32 Mbytes    |
6220 6245   *                                 headroom          |
6221 6246   *                                            l2arc_feed_thread()
6222 6247   *                                                   |
6223 6248   *                       l2arc write hand <--[oooo]--'
6224 6249   *                               |           8 Mbyte
6225 6250   *                               |          write max
6226 6251   *                               V
6227 6252   *                +==============================+
6228 6253   *      L2ARC dev |####|#|###|###|    |####| ... |
6229 6254   *                +==============================+
6230 6255   *                           32 Gbytes
6231 6256   *
6232 6257   * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
6233 6258   * evicted, then the L2ARC has cached a buffer much sooner than it probably
6234 6259   * needed to, potentially wasting L2ARC device bandwidth and storage.  It is
6235 6260   * safe to say that this is an uncommon case, since buffers at the end of
6236 6261   * the ARC lists have moved there due to inactivity.
6237 6262   *
6238 6263   * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
6239 6264   * then the L2ARC simply misses copying some buffers.  This serves as a
6240 6265   * pressure valve to prevent heavy read workloads from both stalling the ARC
6241 6266   * with waits and clogging the L2ARC with writes.  This also helps prevent
6242 6267   * the potential for the L2ARC to churn if it attempts to cache content too
6243 6268   * quickly, such as during backups of the entire pool.
6244 6269   *
6245 6270   * 5. After system boot and before the ARC has filled main memory, there are
6246 6271   * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
6247 6272   * lists can remain mostly static.  Instead of searching from tail of these
6248 6273   * lists as pictured, the l2arc_feed_thread() will search from the list heads
6249 6274   * for eligible buffers, greatly increasing its chance of finding them.
6250 6275   *
6251 6276   * The L2ARC device write speed is also boosted during this time so that
6252 6277   * the L2ARC warms up faster.  Since there have been no ARC evictions yet,
6253 6278   * there are no L2ARC reads, and no fear of degrading read performance
6254 6279   * through increased writes.
6255 6280   *
6256 6281   * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
6257 6282   * the vdev queue can aggregate them into larger and fewer writes.  Each
6258 6283   * device is written to in a rotor fashion, sweeping writes through
6259 6284   * available space then repeating.
6260 6285   *
6261 6286   * 7. The L2ARC does not store dirty content.  It never needs to flush
6262 6287   * write buffers back to disk based storage.
6263 6288   *
6264 6289   * 8. If an ARC buffer is written (and dirtied) which also exists in the
6265 6290   * L2ARC, the now stale L2ARC buffer is immediately dropped.
6266 6291   *
6267 6292   * The performance of the L2ARC can be tweaked by a number of tunables, which
6268 6293   * may be necessary for different workloads:
6269 6294   *
6270 6295   *      l2arc_write_max         max write bytes per interval
6271 6296   *      l2arc_write_boost       extra write bytes during device warmup
6272 6297   *      l2arc_noprefetch        skip caching prefetched buffers
6273 6298   *      l2arc_headroom          number of max device writes to precache
6274 6299   *      l2arc_headroom_boost    when we find compressed buffers during ARC
6275 6300   *                              scanning, we multiply headroom by this
6276 6301   *                              percentage factor for the next scan cycle,
6277 6302   *                              since more compressed buffers are likely to
6278 6303   *                              be present
6279 6304   *      l2arc_feed_secs         seconds between L2ARC writing
6280 6305   *
6281 6306   * Tunables may be removed or added as future performance improvements are
6282 6307   * integrated, and also may become zpool properties.
6283 6308   *
6284 6309   * There are three key functions that control how the L2ARC warms up:
6285 6310   *
6286 6311   *      l2arc_write_eligible()  check if a buffer is eligible to cache
6287 6312   *      l2arc_write_size()      calculate how much to write
6288 6313   *      l2arc_write_interval()  calculate sleep delay between writes
6289 6314   *
6290 6315   * These three functions determine what to write, how much, and how quickly
6291 6316   * to send writes.
6292 6317   */
6293 6318  
6294 6319  static boolean_t
6295 6320  l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
6296 6321  {
6297 6322          /*
6298 6323           * A buffer is *not* eligible for the L2ARC if it:
6299 6324           * 1. belongs to a different spa.
6300 6325           * 2. is already cached on the L2ARC.
6301 6326           * 3. has an I/O in progress (it may be an incomplete read).
6302 6327           * 4. is flagged not eligible (zfs property).
6303 6328           */
6304 6329          if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) ||
6305 6330              HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr))
6306 6331                  return (B_FALSE);
6307 6332  
6308 6333          return (B_TRUE);
6309 6334  }
6310 6335  
6311 6336  static uint64_t
6312 6337  l2arc_write_size(void)
6313 6338  {
6314 6339          uint64_t size;
6315 6340  
6316 6341          /*
6317 6342           * Make sure our globals have meaningful values in case the user
6318 6343           * altered them.
6319 6344           */
6320 6345          size = l2arc_write_max;
6321 6346          if (size == 0) {
6322 6347                  cmn_err(CE_NOTE, "Bad value for l2arc_write_max, value must "
6323 6348                      "be greater than zero, resetting it to the default (%d)",
6324 6349                      L2ARC_WRITE_SIZE);
6325 6350                  size = l2arc_write_max = L2ARC_WRITE_SIZE;
6326 6351          }
6327 6352  
6328 6353          if (arc_warm == B_FALSE)
6329 6354                  size += l2arc_write_boost;
6330 6355  
6331 6356          return (size);
6332 6357  
6333 6358  }
6334 6359  
6335 6360  static clock_t
6336 6361  l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
6337 6362  {
6338 6363          clock_t interval, next, now;
6339 6364  
6340 6365          /*
6341 6366           * If the ARC lists are busy, increase our write rate; if the
6342 6367           * lists are stale, idle back.  This is achieved by checking
6343 6368           * how much we previously wrote - if it was more than half of
6344 6369           * what we wanted, schedule the next write much sooner.
6345 6370           */
6346 6371          if (l2arc_feed_again && wrote > (wanted / 2))
6347 6372                  interval = (hz * l2arc_feed_min_ms) / 1000;
6348 6373          else
6349 6374                  interval = hz * l2arc_feed_secs;
6350 6375  
6351 6376          now = ddi_get_lbolt();
6352 6377          next = MAX(now, MIN(now + interval, began + interval));
6353 6378  
6354 6379          return (next);
6355 6380  }
6356 6381  
6357 6382  /*
6358 6383   * Cycle through L2ARC devices.  This is how L2ARC load balances.
6359 6384   * If a device is returned, this also returns holding the spa config lock.
6360 6385   */
6361 6386  static l2arc_dev_t *
6362 6387  l2arc_dev_get_next(void)
6363 6388  {
6364 6389          l2arc_dev_t *first, *next = NULL;
6365 6390  
6366 6391          /*
6367 6392           * Lock out the removal of spas (spa_namespace_lock), then removal
6368 6393           * of cache devices (l2arc_dev_mtx).  Once a device has been selected,
6369 6394           * both locks will be dropped and a spa config lock held instead.
6370 6395           */
6371 6396          mutex_enter(&spa_namespace_lock);
6372 6397          mutex_enter(&l2arc_dev_mtx);
6373 6398  
6374 6399          /* if there are no vdevs, there is nothing to do */
6375 6400          if (l2arc_ndev == 0)
6376 6401                  goto out;
6377 6402  
6378 6403          first = NULL;
6379 6404          next = l2arc_dev_last;
6380 6405          do {
6381 6406                  /* loop around the list looking for a non-faulted vdev */
6382 6407                  if (next == NULL) {
6383 6408                          next = list_head(l2arc_dev_list);
6384 6409                  } else {
6385 6410                          next = list_next(l2arc_dev_list, next);
6386 6411                          if (next == NULL)
6387 6412                                  next = list_head(l2arc_dev_list);
6388 6413                  }
6389 6414  
6390 6415                  /* if we have come back to the start, bail out */
6391 6416                  if (first == NULL)
6392 6417                          first = next;
6393 6418                  else if (next == first)
6394 6419                          break;
6395 6420  
6396 6421          } while (vdev_is_dead(next->l2ad_vdev));
6397 6422  
6398 6423          /* if we were unable to find any usable vdevs, return NULL */
6399 6424          if (vdev_is_dead(next->l2ad_vdev))
6400 6425                  next = NULL;
6401 6426  
6402 6427          l2arc_dev_last = next;
6403 6428  
6404 6429  out:
6405 6430          mutex_exit(&l2arc_dev_mtx);
6406 6431  
6407 6432          /*
6408 6433           * Grab the config lock to prevent the 'next' device from being
6409 6434           * removed while we are writing to it.
6410 6435           */
6411 6436          if (next != NULL)
6412 6437                  spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
6413 6438          mutex_exit(&spa_namespace_lock);
6414 6439  
6415 6440          return (next);
6416 6441  }
6417 6442  
6418 6443  /*
6419 6444   * Free buffers that were tagged for destruction.
6420 6445   */
6421 6446  static void
6422 6447  l2arc_do_free_on_write()
6423 6448  {
6424 6449          list_t *buflist;
6425 6450          l2arc_data_free_t *df, *df_prev;
6426 6451  
6427 6452          mutex_enter(&l2arc_free_on_write_mtx);
6428 6453          buflist = l2arc_free_on_write;
6429 6454  
6430 6455          for (df = list_tail(buflist); df; df = df_prev) {
6431 6456                  df_prev = list_prev(buflist, df);
6432 6457                  ASSERT3P(df->l2df_abd, !=, NULL);
6433 6458                  abd_free(df->l2df_abd);
6434 6459                  list_remove(buflist, df);
6435 6460                  kmem_free(df, sizeof (l2arc_data_free_t));
6436 6461          }
6437 6462  
6438 6463          mutex_exit(&l2arc_free_on_write_mtx);
6439 6464  }
6440 6465  
6441 6466  /*
6442 6467   * A write to a cache device has completed.  Update all headers to allow
6443 6468   * reads from these buffers to begin.
6444 6469   */
6445 6470  static void
6446 6471  l2arc_write_done(zio_t *zio)
6447 6472  {
6448 6473          l2arc_write_callback_t *cb;
6449 6474          l2arc_dev_t *dev;
6450 6475          list_t *buflist;
6451 6476          arc_buf_hdr_t *head, *hdr, *hdr_prev;
6452 6477          kmutex_t *hash_lock;
6453 6478          int64_t bytes_dropped = 0;
6454 6479  
6455 6480          cb = zio->io_private;
6456 6481          ASSERT3P(cb, !=, NULL);
6457 6482          dev = cb->l2wcb_dev;
6458 6483          ASSERT3P(dev, !=, NULL);
6459 6484          head = cb->l2wcb_head;
6460 6485          ASSERT3P(head, !=, NULL);
6461 6486          buflist = &dev->l2ad_buflist;
6462 6487          ASSERT3P(buflist, !=, NULL);
6463 6488          DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
6464 6489              l2arc_write_callback_t *, cb);
6465 6490  
6466 6491          if (zio->io_error != 0)
6467 6492                  ARCSTAT_BUMP(arcstat_l2_writes_error);
6468 6493  
6469 6494          /*
6470 6495           * All writes completed, or an error was hit.
6471 6496           */
6472 6497  top:
6473 6498          mutex_enter(&dev->l2ad_mtx);
6474 6499          for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) {
6475 6500                  hdr_prev = list_prev(buflist, hdr);
6476 6501  
6477 6502                  hash_lock = HDR_LOCK(hdr);
6478 6503  
6479 6504                  /*
6480 6505                   * We cannot use mutex_enter or else we can deadlock
6481 6506                   * with l2arc_write_buffers (due to swapping the order
6482 6507                   * the hash lock and l2ad_mtx are taken).
6483 6508                   */
6484 6509                  if (!mutex_tryenter(hash_lock)) {
6485 6510                          /*
6486 6511                           * Missed the hash lock. We must retry so we
6487 6512                           * don't leave the ARC_FLAG_L2_WRITING bit set.
6488 6513                           */
6489 6514                          ARCSTAT_BUMP(arcstat_l2_writes_lock_retry);
6490 6515  
6491 6516                          /*
6492 6517                           * We don't want to rescan the headers we've
6493 6518                           * already marked as having been written out, so
6494 6519                           * we reinsert the head node so we can pick up
6495 6520                           * where we left off.
6496 6521                           */
6497 6522                          list_remove(buflist, head);
6498 6523                          list_insert_after(buflist, hdr, head);
6499 6524  
6500 6525                          mutex_exit(&dev->l2ad_mtx);
6501 6526  
6502 6527                          /*
6503 6528                           * We wait for the hash lock to become available
6504 6529                           * to try and prevent busy waiting, and increase
6505 6530                           * the chance we'll be able to acquire the lock
6506 6531                           * the next time around.
6507 6532                           */
6508 6533                          mutex_enter(hash_lock);
6509 6534                          mutex_exit(hash_lock);
6510 6535                          goto top;
6511 6536                  }
6512 6537  
6513 6538                  /*
6514 6539                   * We could not have been moved into the arc_l2c_only
6515 6540                   * state while in-flight due to our ARC_FLAG_L2_WRITING
6516 6541                   * bit being set. Let's just ensure that's being enforced.
6517 6542                   */
6518 6543                  ASSERT(HDR_HAS_L1HDR(hdr));
6519 6544  
6520 6545                  if (zio->io_error != 0) {
6521 6546                          /*
6522 6547                           * Error - drop L2ARC entry.
6523 6548                           */
6524 6549                          list_remove(buflist, hdr);
6525 6550                          arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
6526 6551  
6527 6552                          ARCSTAT_INCR(arcstat_l2_psize, -arc_hdr_size(hdr));
6528 6553                          ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
6529 6554  
6530 6555                          bytes_dropped += arc_hdr_size(hdr);
6531 6556                          (void) refcount_remove_many(&dev->l2ad_alloc,
6532 6557                              arc_hdr_size(hdr), hdr);
6533 6558                  }
6534 6559  
6535 6560                  /*
6536 6561                   * Allow ARC to begin reads and ghost list evictions to
6537 6562                   * this L2ARC entry.
6538 6563                   */
6539 6564                  arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);
6540 6565  
6541 6566                  mutex_exit(hash_lock);
6542 6567          }
6543 6568  
6544 6569          atomic_inc_64(&l2arc_writes_done);
6545 6570          list_remove(buflist, head);
6546 6571          ASSERT(!HDR_HAS_L1HDR(head));
6547 6572          kmem_cache_free(hdr_l2only_cache, head);
6548 6573          mutex_exit(&dev->l2ad_mtx);
6549 6574  
6550 6575          vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
6551 6576  
6552 6577          l2arc_do_free_on_write();
6553 6578  
6554 6579          kmem_free(cb, sizeof (l2arc_write_callback_t));
6555 6580  }
6556 6581  
6557 6582  /*
6558 6583   * A read to a cache device completed.  Validate buffer contents before
6559 6584   * handing over to the regular ARC routines.
6560 6585   */
6561 6586  static void
6562 6587  l2arc_read_done(zio_t *zio)
6563 6588  {
6564 6589          l2arc_read_callback_t *cb;
6565 6590          arc_buf_hdr_t *hdr;
6566 6591          kmutex_t *hash_lock;
6567 6592          boolean_t valid_cksum;
6568 6593  
6569 6594          ASSERT3P(zio->io_vd, !=, NULL);
6570 6595          ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
6571 6596  
6572 6597          spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
6573 6598  
6574 6599          cb = zio->io_private;
6575 6600          ASSERT3P(cb, !=, NULL);
6576 6601          hdr = cb->l2rcb_hdr;
6577 6602          ASSERT3P(hdr, !=, NULL);
6578 6603  
6579 6604          hash_lock = HDR_LOCK(hdr);
6580 6605          mutex_enter(hash_lock);
6581 6606          ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
6582 6607  
6583 6608          /*
6584 6609           * If the data was read into a temporary buffer,
6585 6610           * move it and free the buffer.
6586 6611           */
6587 6612          if (cb->l2rcb_abd != NULL) {
6588 6613                  ASSERT3U(arc_hdr_size(hdr), <, zio->io_size);
6589 6614                  if (zio->io_error == 0) {
6590 6615                          abd_copy(hdr->b_l1hdr.b_pabd, cb->l2rcb_abd,
6591 6616                              arc_hdr_size(hdr));
6592 6617                  }
6593 6618  
6594 6619                  /*
6595 6620                   * The following must be done regardless of whether
6596 6621                   * there was an error:
6597 6622                   * - free the temporary buffer
6598 6623                   * - point zio to the real ARC buffer
6599 6624                   * - set zio size accordingly
6600 6625                   * These are required because zio is either re-used for
6601 6626                   * an I/O of the block in the case of the error
6602 6627                   * or the zio is passed to arc_read_done() and it
6603 6628                   * needs real data.
6604 6629                   */
6605 6630                  abd_free(cb->l2rcb_abd);
6606 6631                  zio->io_size = zio->io_orig_size = arc_hdr_size(hdr);
6607 6632                  zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd;
6608 6633          }
6609 6634  
6610 6635          ASSERT3P(zio->io_abd, !=, NULL);
6611 6636  
6612 6637          /*
6613 6638           * Check this survived the L2ARC journey.
6614 6639           */
6615 6640          ASSERT3P(zio->io_abd, ==, hdr->b_l1hdr.b_pabd);
6616 6641          zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
6617 6642          zio->io_bp = &zio->io_bp_copy;  /* XXX fix in L2ARC 2.0 */
6618 6643  
6619 6644          valid_cksum = arc_cksum_is_equal(hdr, zio);
6620 6645          if (valid_cksum && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
6621 6646                  mutex_exit(hash_lock);
6622 6647                  zio->io_private = hdr;
6623 6648                  arc_read_done(zio);
6624 6649          } else {
6625 6650                  mutex_exit(hash_lock);
6626 6651                  /*
6627 6652                   * Buffer didn't survive caching.  Increment stats and
6628 6653                   * reissue to the original storage device.
6629 6654                   */
6630 6655                  if (zio->io_error != 0) {
6631 6656                          ARCSTAT_BUMP(arcstat_l2_io_error);
6632 6657                  } else {
6633 6658                          zio->io_error = SET_ERROR(EIO);
6634 6659                  }
6635 6660                  if (!valid_cksum)
6636 6661                          ARCSTAT_BUMP(arcstat_l2_cksum_bad);
6637 6662  
6638 6663                  /*
6639 6664                   * If there's no waiter, issue an async i/o to the primary
6640 6665                   * storage now.  If there *is* a waiter, the caller must
6641 6666                   * issue the i/o in a context where it's OK to block.
6642 6667                   */
6643 6668                  if (zio->io_waiter == NULL) {
6644 6669                          zio_t *pio = zio_unique_parent(zio);
6645 6670  
6646 6671                          ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
6647 6672  
6648 6673                          zio_nowait(zio_read(pio, zio->io_spa, zio->io_bp,
6649 6674                              hdr->b_l1hdr.b_pabd, zio->io_size, arc_read_done,
6650 6675                              hdr, zio->io_priority, cb->l2rcb_flags,
6651 6676                              &cb->l2rcb_zb));
6652 6677                  }
6653 6678          }
6654 6679  
6655 6680          kmem_free(cb, sizeof (l2arc_read_callback_t));
6656 6681  }
6657 6682  
6658 6683  /*
6659 6684   * This is the list priority from which the L2ARC will search for pages to
6660 6685   * cache.  This is used within loops (0..3) to cycle through lists in the
6661 6686   * desired order.  This order can have a significant effect on cache
6662 6687   * performance.
6663 6688   *
6664 6689   * Currently the metadata lists are hit first, MFU then MRU, followed by
6665 6690   * the data lists.  This function returns a locked list, and also returns
6666 6691   * the lock pointer.
6667 6692   */
6668 6693  static multilist_sublist_t *
6669 6694  l2arc_sublist_lock(int list_num)
6670 6695  {
6671 6696          multilist_t *ml = NULL;
6672 6697          unsigned int idx;
6673 6698  
6674 6699          ASSERT(list_num >= 0 && list_num <= 3);
6675 6700  
6676 6701          switch (list_num) {
6677 6702          case 0:
6678 6703                  ml = arc_mfu->arcs_list[ARC_BUFC_METADATA];
6679 6704                  break;
6680 6705          case 1:
6681 6706                  ml = arc_mru->arcs_list[ARC_BUFC_METADATA];
6682 6707                  break;
6683 6708          case 2:
6684 6709                  ml = arc_mfu->arcs_list[ARC_BUFC_DATA];
6685 6710                  break;
6686 6711          case 3:
6687 6712                  ml = arc_mru->arcs_list[ARC_BUFC_DATA];
6688 6713                  break;
6689 6714          }
6690 6715  
6691 6716          /*
6692 6717           * Return a randomly-selected sublist. This is acceptable
6693 6718           * because the caller feeds only a little bit of data for each
6694 6719           * call (8MB). Subsequent calls will result in different
6695 6720           * sublists being selected.
6696 6721           */
6697 6722          idx = multilist_get_random_index(ml);
6698 6723          return (multilist_sublist_lock(ml, idx));
6699 6724  }
6700 6725  
6701 6726  /*
6702 6727   * Evict buffers from the device write hand to the distance specified in
6703 6728   * bytes.  This distance may span populated buffers, it may span nothing.
6704 6729   * This is clearing a region on the L2ARC device ready for writing.
6705 6730   * If the 'all' boolean is set, every buffer is evicted.
6706 6731   */
6707 6732  static void
6708 6733  l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
6709 6734  {
6710 6735          list_t *buflist;
6711 6736          arc_buf_hdr_t *hdr, *hdr_prev;
6712 6737          kmutex_t *hash_lock;
6713 6738          uint64_t taddr;
6714 6739  
6715 6740          buflist = &dev->l2ad_buflist;
6716 6741  
6717 6742          if (!all && dev->l2ad_first) {
6718 6743                  /*
6719 6744                   * This is the first sweep through the device.  There is
6720 6745                   * nothing to evict.
6721 6746                   */
6722 6747                  return;
6723 6748          }
6724 6749  
6725 6750          if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
6726 6751                  /*
6727 6752                   * When nearing the end of the device, evict to the end
6728 6753                   * before the device write hand jumps to the start.
6729 6754                   */
6730 6755                  taddr = dev->l2ad_end;
6731 6756          } else {
6732 6757                  taddr = dev->l2ad_hand + distance;
6733 6758          }
6734 6759          DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
6735 6760              uint64_t, taddr, boolean_t, all);
6736 6761  
6737 6762  top:
6738 6763          mutex_enter(&dev->l2ad_mtx);
6739 6764          for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {
6740 6765                  hdr_prev = list_prev(buflist, hdr);
6741 6766  
6742 6767                  hash_lock = HDR_LOCK(hdr);
6743 6768  
6744 6769                  /*
6745 6770                   * We cannot use mutex_enter or else we can deadlock
6746 6771                   * with l2arc_write_buffers (due to swapping the order
6747 6772                   * the hash lock and l2ad_mtx are taken).
6748 6773                   */
6749 6774                  if (!mutex_tryenter(hash_lock)) {
6750 6775                          /*
6751 6776                           * Missed the hash lock.  Retry.
6752 6777                           */
6753 6778                          ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
6754 6779                          mutex_exit(&dev->l2ad_mtx);
6755 6780                          mutex_enter(hash_lock);
6756 6781                          mutex_exit(hash_lock);
6757 6782                          goto top;
6758 6783                  }
6759 6784  
6760 6785                  /*
6761 6786                   * A header can't be on this list if it doesn't have L2 header.
6762 6787                   */
6763 6788                  ASSERT(HDR_HAS_L2HDR(hdr));
6764 6789  
6765 6790                  /* Ensure this header has finished being written. */
6766 6791                  ASSERT(!HDR_L2_WRITING(hdr));
6767 6792                  ASSERT(!HDR_L2_WRITE_HEAD(hdr));
6768 6793  
6769 6794                  if (!all && (hdr->b_l2hdr.b_daddr >= taddr ||
6770 6795                      hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) {
6771 6796                          /*
6772 6797                           * We've evicted to the target address,
6773 6798                           * or the end of the device.
6774 6799                           */
6775 6800                          mutex_exit(hash_lock);
6776 6801                          break;
6777 6802                  }
6778 6803  
6779 6804                  if (!HDR_HAS_L1HDR(hdr)) {
6780 6805                          ASSERT(!HDR_L2_READING(hdr));
6781 6806                          /*
6782 6807                           * This doesn't exist in the ARC.  Destroy.
6783 6808                           * arc_hdr_destroy() will call list_remove()
6784 6809                           * and decrement arcstat_l2_lsize.
6785 6810                           */
6786 6811                          arc_change_state(arc_anon, hdr, hash_lock);
6787 6812                          arc_hdr_destroy(hdr);
6788 6813                  } else {
6789 6814                          ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);
6790 6815                          ARCSTAT_BUMP(arcstat_l2_evict_l1cached);
6791 6816                          /*
6792 6817                           * Invalidate issued or about to be issued
6793 6818                           * reads, since we may be about to write
6794 6819                           * over this location.
6795 6820                           */
6796 6821                          if (HDR_L2_READING(hdr)) {
6797 6822                                  ARCSTAT_BUMP(arcstat_l2_evict_reading);
6798 6823                                  arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);
6799 6824                          }
6800 6825  
6801 6826                          arc_hdr_l2hdr_destroy(hdr);
6802 6827                  }
6803 6828                  mutex_exit(hash_lock);
6804 6829          }
6805 6830          mutex_exit(&dev->l2ad_mtx);
6806 6831  }
6807 6832  
6808 6833  /*
6809 6834   * Find and write ARC buffers to the L2ARC device.
6810 6835   *
6811 6836   * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid
6812 6837   * for reading until they have completed writing.
6813 6838   * The headroom_boost is an in-out parameter used to maintain headroom boost
6814 6839   * state between calls to this function.
6815 6840   *
6816 6841   * Returns the number of bytes actually written (which may be smaller than
6817 6842   * the delta by which the device hand has changed due to alignment).
6818 6843   */
6819 6844  static uint64_t
6820 6845  l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
6821 6846  {
6822 6847          arc_buf_hdr_t *hdr, *hdr_prev, *head;
6823 6848          uint64_t write_asize, write_psize, write_lsize, headroom;
6824 6849          boolean_t full;
6825 6850          l2arc_write_callback_t *cb;
6826 6851          zio_t *pio, *wzio;
6827 6852          uint64_t guid = spa_load_guid(spa);
6828 6853  
6829 6854          ASSERT3P(dev->l2ad_vdev, !=, NULL);
6830 6855  
6831 6856          pio = NULL;
6832 6857          write_lsize = write_asize = write_psize = 0;
6833 6858          full = B_FALSE;
6834 6859          head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
6835 6860          arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);
6836 6861  
6837 6862          /*
6838 6863           * Copy buffers for L2ARC writing.
6839 6864           */
6840 6865          for (int try = 0; try <= 3; try++) {
6841 6866                  multilist_sublist_t *mls = l2arc_sublist_lock(try);
6842 6867                  uint64_t passed_sz = 0;
6843 6868  
6844 6869                  /*
6845 6870                   * L2ARC fast warmup.
6846 6871                   *
6847 6872                   * Until the ARC is warm and starts to evict, read from the
6848 6873                   * head of the ARC lists rather than the tail.
6849 6874                   */
6850 6875                  if (arc_warm == B_FALSE)
6851 6876                          hdr = multilist_sublist_head(mls);
6852 6877                  else
6853 6878                          hdr = multilist_sublist_tail(mls);
6854 6879  
6855 6880                  headroom = target_sz * l2arc_headroom;
6856 6881                  if (zfs_compressed_arc_enabled)
6857 6882                          headroom = (headroom * l2arc_headroom_boost) / 100;
6858 6883  
6859 6884                  for (; hdr; hdr = hdr_prev) {
6860 6885                          kmutex_t *hash_lock;
6861 6886  
6862 6887                          if (arc_warm == B_FALSE)
6863 6888                                  hdr_prev = multilist_sublist_next(mls, hdr);
6864 6889                          else
6865 6890                                  hdr_prev = multilist_sublist_prev(mls, hdr);
6866 6891  
6867 6892                          hash_lock = HDR_LOCK(hdr);
6868 6893                          if (!mutex_tryenter(hash_lock)) {
6869 6894                                  /*
6870 6895                                   * Skip this buffer rather than waiting.
6871 6896                                   */
6872 6897                                  continue;
6873 6898                          }
6874 6899  
6875 6900                          passed_sz += HDR_GET_LSIZE(hdr);
6876 6901                          if (passed_sz > headroom) {
6877 6902                                  /*
6878 6903                                   * Searched too far.
6879 6904                                   */
6880 6905                                  mutex_exit(hash_lock);
6881 6906                                  break;
6882 6907                          }
6883 6908  
6884 6909                          if (!l2arc_write_eligible(guid, hdr)) {
6885 6910                                  mutex_exit(hash_lock);
6886 6911                                  continue;
6887 6912                          }
6888 6913  
6889 6914                          /*
6890 6915                           * We rely on the L1 portion of the header below, so
6891 6916                           * it's invalid for this header to have been evicted out
6892 6917                           * of the ghost cache, prior to being written out. The
6893 6918                           * ARC_FLAG_L2_WRITING bit ensures this won't happen.
6894 6919                           */
6895 6920                          ASSERT(HDR_HAS_L1HDR(hdr));
6896 6921  
6897 6922                          ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);
6898 6923                          ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
6899 6924                          ASSERT3U(arc_hdr_size(hdr), >, 0);
6900 6925                          uint64_t psize = arc_hdr_size(hdr);
6901 6926                          uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,
6902 6927                              psize);
6903 6928  
6904 6929                          if ((write_asize + asize) > target_sz) {
6905 6930                                  full = B_TRUE;
6906 6931                                  mutex_exit(hash_lock);
6907 6932                                  break;
6908 6933                          }
6909 6934  
6910 6935                          if (pio == NULL) {
6911 6936                                  /*
6912 6937                                   * Insert a dummy header on the buflist so
6913 6938                                   * l2arc_write_done() can find where the
6914 6939                                   * write buffers begin without searching.
6915 6940                                   */
6916 6941                                  mutex_enter(&dev->l2ad_mtx);
6917 6942                                  list_insert_head(&dev->l2ad_buflist, head);
6918 6943                                  mutex_exit(&dev->l2ad_mtx);
6919 6944  
6920 6945                                  cb = kmem_alloc(
6921 6946                                      sizeof (l2arc_write_callback_t), KM_SLEEP);
6922 6947                                  cb->l2wcb_dev = dev;
6923 6948                                  cb->l2wcb_head = head;
6924 6949                                  pio = zio_root(spa, l2arc_write_done, cb,
6925 6950                                      ZIO_FLAG_CANFAIL);
6926 6951                          }
6927 6952  
6928 6953                          hdr->b_l2hdr.b_dev = dev;
6929 6954                          hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
6930 6955                          arc_hdr_set_flags(hdr,
6931 6956                              ARC_FLAG_L2_WRITING | ARC_FLAG_HAS_L2HDR);
6932 6957  
6933 6958                          mutex_enter(&dev->l2ad_mtx);
6934 6959                          list_insert_head(&dev->l2ad_buflist, hdr);
6935 6960                          mutex_exit(&dev->l2ad_mtx);
6936 6961  
6937 6962                          (void) refcount_add_many(&dev->l2ad_alloc, psize, hdr);
6938 6963  
6939 6964                          /*
6940 6965                           * Normally the L2ARC can use the hdr's data, but if
6941 6966                           * we're sharing data between the hdr and one of its
6942 6967                           * bufs, L2ARC needs its own copy of the data so that
6943 6968                           * the ZIO below can't race with the buf consumer.
6944 6969                           * Another case where we need to create a copy of the
6945 6970                           * data is when the buffer size is not device-aligned
6946 6971                           * and we need to pad the block to make it such.
6947 6972                           * That also keeps the clock hand suitably aligned.
6948 6973                           *
6949 6974                           * To ensure that the copy will be available for the
6950 6975                           * lifetime of the ZIO and be cleaned up afterwards, we
6951 6976                           * add it to the l2arc_free_on_write queue.
6952 6977                           */
6953 6978                          abd_t *to_write;
6954 6979                          if (!HDR_SHARED_DATA(hdr) && psize == asize) {
6955 6980                                  to_write = hdr->b_l1hdr.b_pabd;
6956 6981                          } else {
6957 6982                                  to_write = abd_alloc_for_io(asize,
6958 6983                                      HDR_ISTYPE_METADATA(hdr));
6959 6984                                  abd_copy(to_write, hdr->b_l1hdr.b_pabd, psize);
6960 6985                                  if (asize != psize) {
6961 6986                                          abd_zero_off(to_write, psize,
6962 6987                                              asize - psize);
6963 6988                                  }
6964 6989                                  l2arc_free_abd_on_write(to_write, asize,
6965 6990                                      arc_buf_type(hdr));
6966 6991                          }
6967 6992                          wzio = zio_write_phys(pio, dev->l2ad_vdev,
6968 6993                              hdr->b_l2hdr.b_daddr, asize, to_write,
6969 6994                              ZIO_CHECKSUM_OFF, NULL, hdr,
6970 6995                              ZIO_PRIORITY_ASYNC_WRITE,
6971 6996                              ZIO_FLAG_CANFAIL, B_FALSE);
6972 6997  
6973 6998                          write_lsize += HDR_GET_LSIZE(hdr);
6974 6999                          DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
6975 7000                              zio_t *, wzio);
6976 7001  
6977 7002                          write_psize += psize;
6978 7003                          write_asize += asize;
6979 7004                          dev->l2ad_hand += asize;
6980 7005  
6981 7006                          mutex_exit(hash_lock);
6982 7007  
6983 7008                          (void) zio_nowait(wzio);
6984 7009                  }
6985 7010  
6986 7011                  multilist_sublist_unlock(mls);
6987 7012  
6988 7013                  if (full == B_TRUE)
6989 7014                          break;
6990 7015          }
6991 7016  
6992 7017          /* No buffers selected for writing? */
6993 7018          if (pio == NULL) {
6994 7019                  ASSERT0(write_lsize);
6995 7020                  ASSERT(!HDR_HAS_L1HDR(head));
6996 7021                  kmem_cache_free(hdr_l2only_cache, head);
6997 7022                  return (0);
6998 7023          }
6999 7024  
7000 7025          ASSERT3U(write_asize, <=, target_sz);
7001 7026          ARCSTAT_BUMP(arcstat_l2_writes_sent);
7002 7027          ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);
7003 7028          ARCSTAT_INCR(arcstat_l2_lsize, write_lsize);
7004 7029          ARCSTAT_INCR(arcstat_l2_psize, write_psize);
7005 7030          vdev_space_update(dev->l2ad_vdev, write_psize, 0, 0);
7006 7031  
7007 7032          /*
7008 7033           * Bump device hand to the device start if it is approaching the end.
7009 7034           * l2arc_evict() will already have evicted ahead for this case.
7010 7035           */
7011 7036          if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
7012 7037                  dev->l2ad_hand = dev->l2ad_start;
7013 7038                  dev->l2ad_first = B_FALSE;
7014 7039          }
7015 7040  
7016 7041          dev->l2ad_writing = B_TRUE;
7017 7042          (void) zio_wait(pio);
7018 7043          dev->l2ad_writing = B_FALSE;
7019 7044  
7020 7045          return (write_asize);
7021 7046  }
7022 7047  
7023 7048  /*
7024 7049   * This thread feeds the L2ARC at regular intervals.  This is the beating
7025 7050   * heart of the L2ARC.
7026 7051   */
7027 7052  /* ARGSUSED */
7028 7053  static void
7029 7054  l2arc_feed_thread(void *unused)
7030 7055  {
7031 7056          callb_cpr_t cpr;
7032 7057          l2arc_dev_t *dev;
7033 7058          spa_t *spa;
7034 7059          uint64_t size, wrote;
7035 7060          clock_t begin, next = ddi_get_lbolt();
7036 7061  
7037 7062          CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
7038 7063  
7039 7064          mutex_enter(&l2arc_feed_thr_lock);
7040 7065  
7041 7066          while (l2arc_thread_exit == 0) {
7042 7067                  CALLB_CPR_SAFE_BEGIN(&cpr);
7043 7068                  (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
7044 7069                      next);
7045 7070                  CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
7046 7071                  next = ddi_get_lbolt() + hz;
7047 7072  
7048 7073                  /*
7049 7074                   * Quick check for L2ARC devices.
7050 7075                   */
7051 7076                  mutex_enter(&l2arc_dev_mtx);
7052 7077                  if (l2arc_ndev == 0) {
7053 7078                          mutex_exit(&l2arc_dev_mtx);
7054 7079                          continue;
7055 7080                  }
7056 7081                  mutex_exit(&l2arc_dev_mtx);
7057 7082                  begin = ddi_get_lbolt();
7058 7083  
7059 7084                  /*
7060 7085                   * This selects the next l2arc device to write to, and in
7061 7086                   * doing so the next spa to feed from: dev->l2ad_spa.   This
7062 7087                   * will return NULL if there are now no l2arc devices or if
7063 7088                   * they are all faulted.
7064 7089                   *
7065 7090                   * If a device is returned, its spa's config lock is also
7066 7091                   * held to prevent device removal.  l2arc_dev_get_next()
7067 7092                   * will grab and release l2arc_dev_mtx.
7068 7093                   */
7069 7094                  if ((dev = l2arc_dev_get_next()) == NULL)
7070 7095                          continue;
7071 7096  
7072 7097                  spa = dev->l2ad_spa;
7073 7098                  ASSERT3P(spa, !=, NULL);
7074 7099  
7075 7100                  /*
7076 7101                   * If the pool is read-only then force the feed thread to
7077 7102                   * sleep a little longer.
7078 7103                   */
7079 7104                  if (!spa_writeable(spa)) {
7080 7105                          next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
7081 7106                          spa_config_exit(spa, SCL_L2ARC, dev);
7082 7107                          continue;
7083 7108                  }
7084 7109  
7085 7110                  /*
7086 7111                   * Avoid contributing to memory pressure.
7087 7112                   */
7088 7113                  if (arc_reclaim_needed()) {
7089 7114                          ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
7090 7115                          spa_config_exit(spa, SCL_L2ARC, dev);
7091 7116                          continue;
7092 7117                  }
7093 7118  
7094 7119                  ARCSTAT_BUMP(arcstat_l2_feeds);
7095 7120  
7096 7121                  size = l2arc_write_size();
7097 7122  
7098 7123                  /*
7099 7124                   * Evict L2ARC buffers that will be overwritten.
7100 7125                   */
7101 7126                  l2arc_evict(dev, size, B_FALSE);
7102 7127  
7103 7128                  /*
7104 7129                   * Write ARC buffers.
7105 7130                   */
7106 7131                  wrote = l2arc_write_buffers(spa, dev, size);
7107 7132  
7108 7133                  /*
7109 7134                   * Calculate interval between writes.
7110 7135                   */
7111 7136                  next = l2arc_write_interval(begin, size, wrote);
7112 7137                  spa_config_exit(spa, SCL_L2ARC, dev);
7113 7138          }
7114 7139  
7115 7140          l2arc_thread_exit = 0;
7116 7141          cv_broadcast(&l2arc_feed_thr_cv);
7117 7142          CALLB_CPR_EXIT(&cpr);           /* drops l2arc_feed_thr_lock */
7118 7143          thread_exit();
7119 7144  }
7120 7145  
7121 7146  boolean_t
7122 7147  l2arc_vdev_present(vdev_t *vd)
7123 7148  {
7124 7149          l2arc_dev_t *dev;
7125 7150  
7126 7151          mutex_enter(&l2arc_dev_mtx);
7127 7152          for (dev = list_head(l2arc_dev_list); dev != NULL;
7128 7153              dev = list_next(l2arc_dev_list, dev)) {
7129 7154                  if (dev->l2ad_vdev == vd)
7130 7155                          break;
7131 7156          }
7132 7157          mutex_exit(&l2arc_dev_mtx);
7133 7158  
7134 7159          return (dev != NULL);
7135 7160  }
7136 7161  
7137 7162  /*
7138 7163   * Add a vdev for use by the L2ARC.  By this point the spa has already
7139 7164   * validated the vdev and opened it.
7140 7165   */
7141 7166  void
7142 7167  l2arc_add_vdev(spa_t *spa, vdev_t *vd)
7143 7168  {
7144 7169          l2arc_dev_t *adddev;
7145 7170  
7146 7171          ASSERT(!l2arc_vdev_present(vd));
7147 7172  
7148 7173          /*
7149 7174           * Create a new l2arc device entry.
7150 7175           */
7151 7176          adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
7152 7177          adddev->l2ad_spa = spa;
7153 7178          adddev->l2ad_vdev = vd;
7154 7179          adddev->l2ad_start = VDEV_LABEL_START_SIZE;
7155 7180          adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
7156 7181          adddev->l2ad_hand = adddev->l2ad_start;
7157 7182          adddev->l2ad_first = B_TRUE;
7158 7183          adddev->l2ad_writing = B_FALSE;
7159 7184  
7160 7185          mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);
7161 7186          /*
7162 7187           * This is a list of all ARC buffers that are still valid on the
7163 7188           * device.
7164 7189           */
7165 7190          list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
7166 7191              offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));
7167 7192  
7168 7193          vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
7169 7194          refcount_create(&adddev->l2ad_alloc);
7170 7195  
7171 7196          /*
7172 7197           * Add device to global list
7173 7198           */
7174 7199          mutex_enter(&l2arc_dev_mtx);
7175 7200          list_insert_head(l2arc_dev_list, adddev);
7176 7201          atomic_inc_64(&l2arc_ndev);
7177 7202          mutex_exit(&l2arc_dev_mtx);
7178 7203  }
7179 7204  
7180 7205  /*
7181 7206   * Remove a vdev from the L2ARC.
7182 7207   */
7183 7208  void
7184 7209  l2arc_remove_vdev(vdev_t *vd)
7185 7210  {
7186 7211          l2arc_dev_t *dev, *nextdev, *remdev = NULL;
7187 7212  
7188 7213          /*
7189 7214           * Find the device by vdev
7190 7215           */
7191 7216          mutex_enter(&l2arc_dev_mtx);
7192 7217          for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
7193 7218                  nextdev = list_next(l2arc_dev_list, dev);
7194 7219                  if (vd == dev->l2ad_vdev) {
7195 7220                          remdev = dev;
7196 7221                          break;
7197 7222                  }
7198 7223          }
7199 7224          ASSERT3P(remdev, !=, NULL);
7200 7225  
7201 7226          /*
7202 7227           * Remove device from global list
7203 7228           */
7204 7229          list_remove(l2arc_dev_list, remdev);
7205 7230          l2arc_dev_last = NULL;          /* may have been invalidated */
7206 7231          atomic_dec_64(&l2arc_ndev);
7207 7232          mutex_exit(&l2arc_dev_mtx);
7208 7233  
7209 7234          /*
7210 7235           * Clear all buflists and ARC references.  L2ARC device flush.
7211 7236           */
7212 7237          l2arc_evict(remdev, 0, B_TRUE);
7213 7238          list_destroy(&remdev->l2ad_buflist);
7214 7239          mutex_destroy(&remdev->l2ad_mtx);
7215 7240          refcount_destroy(&remdev->l2ad_alloc);
7216 7241          kmem_free(remdev, sizeof (l2arc_dev_t));
7217 7242  }
7218 7243  
7219 7244  void
7220 7245  l2arc_init(void)
7221 7246  {
7222 7247          l2arc_thread_exit = 0;
7223 7248          l2arc_ndev = 0;
7224 7249          l2arc_writes_sent = 0;
7225 7250          l2arc_writes_done = 0;
7226 7251  
7227 7252          mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
7228 7253          cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
7229 7254          mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
7230 7255          mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
7231 7256  
7232 7257          l2arc_dev_list = &L2ARC_dev_list;
7233 7258          l2arc_free_on_write = &L2ARC_free_on_write;
7234 7259          list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
7235 7260              offsetof(l2arc_dev_t, l2ad_node));
7236 7261          list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
7237 7262              offsetof(l2arc_data_free_t, l2df_list_node));
7238 7263  }
7239 7264  
7240 7265  void
7241 7266  l2arc_fini(void)
7242 7267  {
7243 7268          /*
7244 7269           * This is called from dmu_fini(), which is called from spa_fini();
7245 7270           * Because of this, we can assume that all l2arc devices have
7246 7271           * already been removed when the pools themselves were removed.
7247 7272           */
7248 7273  
7249 7274          l2arc_do_free_on_write();
7250 7275  
7251 7276          mutex_destroy(&l2arc_feed_thr_lock);
7252 7277          cv_destroy(&l2arc_feed_thr_cv);
7253 7278          mutex_destroy(&l2arc_dev_mtx);
7254 7279          mutex_destroy(&l2arc_free_on_write_mtx);
7255 7280  
7256 7281          list_destroy(l2arc_dev_list);
7257 7282          list_destroy(l2arc_free_on_write);
7258 7283  }
7259 7284  
7260 7285  void
7261 7286  l2arc_start(void)
7262 7287  {
7263 7288          if (!(spa_mode_global & FWRITE))
7264 7289                  return;
7265 7290  
7266 7291          (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
7267 7292              TS_RUN, minclsyspri);
7268 7293  }
7269 7294  
7270 7295  void
7271 7296  l2arc_stop(void)
7272 7297  {
7273 7298          if (!(spa_mode_global & FWRITE))
7274 7299                  return;
7275 7300  
7276 7301          mutex_enter(&l2arc_feed_thr_lock);
7277 7302          cv_signal(&l2arc_feed_thr_cv);  /* kick thread out of startup */
7278 7303          l2arc_thread_exit = 1;
7279 7304          while (l2arc_thread_exit != 0)
7280 7305                  cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
7281 7306          mutex_exit(&l2arc_feed_thr_lock);
7282 7307  }

↓ open down ↓

3133 lines elided

↑ open up ↑

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX