Print this page
    
re #13729 assign each ARC hash bucket its own mutex
In ARC the number of buckets in buffer header hash table is
proportional to the size of physical RAM.
The number of locks protecting headers in the buckets is fixed to 256 though.
Hence, on systems with large memory (>= 128GB) too many unrelated buffer
headers are protected by the same mutex.
When the memory in the system is fragmented this may cause a deadlock:
- An arc_read thread may be trying to allocate a 128k buffer while holding
a header lock.
- The allocation uses KM_PUSHPAGE option that blocks the thread if no contigous
chunk of requested size is available.
- ARC eviction thread that is supposed to evict some buffers would call
an evict callback on one of the buffers.
- Before freing the memory, the callback will attempt to take a lock on buffer
header.
- Incidentally, this buffer header will be protected by the same lock as
he one in arc_read() thread.
The solution in this patch is not perfect - that is, it protects all headers
in the hash bucket by the same lock.
However, a probability of collision is very low and does not depend on memory
size.
By the same argument, padding locks to cacheline looks like a waste of memory
here since the probability of contention on a cacheline is quite low, given
he number of buckets, number of locks per cacheline (4) and the fact that
he hash function (crc64 % hash table size) is supposed to be a very good
randomizer.
This effect on memory usage is as follows:
Per hash table size n,
- Original code uses 16K + 16 + n * 8 bytes of memory
- This fix uses 2 * n * 8 + 8 bytes of memory
- The net memory overhead is therefore n * 8 - 16K - 8 bytes
The value of n grows proportionally to physical memory size.
For 128GB of physical memory it is 2M, so the memory overhead is
16M - 16K - 8 bytes.
For smaller memory configurations the overhead is proportionally smaller, and
for larger memory configurations it is propottionally bigger.
The patch has been tested for 30+ hours using vdbench script that reproduces
hang with original code 100% of times in 20-30 minutes.
    
      
        | Split | 
	Close | 
      
      | Expand all | 
      | Collapse all | 
    
    
          --- old/usr/src/uts/common/fs/zfs/arc.c
          +++ new/usr/src/uts/common/fs/zfs/arc.c
   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23   23   * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
  24   24   * Copyright (c) 2013 by Delphix. All rights reserved.
  25   25   */
  26   26  
  27   27  /*
  28   28   * DVA-based Adjustable Replacement Cache
  29   29   *
  30   30   * While much of the theory of operation used here is
  31   31   * based on the self-tuning, low overhead replacement cache
  32   32   * presented by Megiddo and Modha at FAST 2003, there are some
  33   33   * significant differences:
  34   34   *
  35   35   * 1. The Megiddo and Modha model assumes any page is evictable.
  36   36   * Pages in its cache cannot be "locked" into memory.  This makes
  37   37   * the eviction algorithm simple: evict the last page in the list.
  38   38   * This also make the performance characteristics easy to reason
  39   39   * about.  Our cache is not so simple.  At any given moment, some
  40   40   * subset of the blocks in the cache are un-evictable because we
  41   41   * have handed out a reference to them.  Blocks are only evictable
  42   42   * when there are no external references active.  This makes
  43   43   * eviction far more problematic:  we choose to evict the evictable
  44   44   * blocks that are the "lowest" in the list.
  45   45   *
  46   46   * There are times when it is not possible to evict the requested
  47   47   * space.  In these circumstances we are unable to adjust the cache
  48   48   * size.  To prevent the cache growing unbounded at these times we
  49   49   * implement a "cache throttle" that slows the flow of new data
  50   50   * into the cache until we can make space available.
  51   51   *
  52   52   * 2. The Megiddo and Modha model assumes a fixed cache size.
  53   53   * Pages are evicted when the cache is full and there is a cache
  54   54   * miss.  Our model has a variable sized cache.  It grows with
  55   55   * high use, but also tries to react to memory pressure from the
  56   56   * operating system: decreasing its size when system memory is
  57   57   * tight.
  58   58   *
  59   59   * 3. The Megiddo and Modha model assumes a fixed page size. All
  60   60   * elements of the cache are therefor exactly the same size.  So
  61   61   * when adjusting the cache size following a cache miss, its simply
  62   62   * a matter of choosing a single page to evict.  In our model, we
  63   63   * have variable sized cache blocks (rangeing from 512 bytes to
  64   64   * 128K bytes).  We therefor choose a set of blocks to evict to make
  65   65   * space for a cache miss that approximates as closely as possible
  66   66   * the space used by the new block.
  67   67   *
  68   68   * See also:  "ARC: A Self-Tuning, Low Overhead Replacement Cache"
  69   69   * by N. Megiddo & D. Modha, FAST 2003
  70   70   */
  71   71  
  72   72  /*
  73   73   * The locking model:
  74   74   *
  75   75   * A new reference to a cache buffer can be obtained in two
  76   76   * ways: 1) via a hash table lookup using the DVA as a key,
  77   77   * or 2) via one of the ARC lists.  The arc_read() interface
  78   78   * uses method 1, while the internal arc algorithms for
  79   79   * adjusting the cache use method 2.  We therefor provide two
  80   80   * types of locks: 1) the hash table lock array, and 2) the
  81   81   * arc list locks.
  82   82   *
  83   83   * Buffers do not have their own mutexes, rather they rely on the
  84   84   * hash table mutexes for the bulk of their protection (i.e. most
  85   85   * fields in the arc_buf_hdr_t are protected by these mutexes).
  86   86   *
  87   87   * buf_hash_find() returns the appropriate mutex (held) when it
  88   88   * locates the requested buffer in the hash table.  It returns
  89   89   * NULL for the mutex if the buffer was not in the table.
  90   90   *
  91   91   * buf_hash_remove() expects the appropriate hash mutex to be
  92   92   * already held before it is invoked.
  93   93   *
  94   94   * Each arc state also has a mutex which is used to protect the
  95   95   * buffer list associated with the state.  When attempting to
  96   96   * obtain a hash table lock while holding an arc list lock you
  97   97   * must use: mutex_tryenter() to avoid deadlock.  Also note that
  98   98   * the active state mutex must be held before the ghost state mutex.
  99   99   *
 100  100   * Arc buffers may have an associated eviction callback function.
 101  101   * This function will be invoked prior to removing the buffer (e.g.
 102  102   * in arc_do_user_evicts()).  Note however that the data associated
 103  103   * with the buffer may be evicted prior to the callback.  The callback
 104  104   * must be made with *no locks held* (to prevent deadlock).  Additionally,
 105  105   * the users of callbacks must ensure that their private data is
 106  106   * protected from simultaneous callbacks from arc_buf_evict()
 107  107   * and arc_do_user_evicts().
 108  108   *
 109  109   * Note that the majority of the performance stats are manipulated
 110  110   * with atomic operations.
 111  111   *
 112  112   * The L2ARC uses the l2arc_buflist_mtx global mutex for the following:
 113  113   *
 114  114   *      - L2ARC buflist creation
 115  115   *      - L2ARC buflist eviction
 116  116   *      - L2ARC write completion, which walks L2ARC buflists
 117  117   *      - ARC header destruction, as it removes from L2ARC buflists
 118  118   *      - ARC header release, as it removes from L2ARC buflists
 119  119   */
 120  120  
 121  121  #include <sys/spa.h>
 122  122  #include <sys/zio.h>
 123  123  #include <sys/zfs_context.h>
 124  124  #include <sys/arc.h>
 125  125  #include <sys/refcount.h>
 126  126  #include <sys/vdev.h>
 127  127  #include <sys/vdev_impl.h>
 128  128  #ifdef _KERNEL
 129  129  #include <sys/vmsystm.h>
 130  130  #include <vm/anon.h>
 131  131  #include <sys/fs/swapnode.h>
 132  132  #include <sys/dnlc.h>
 133  133  #endif
 134  134  #include <sys/callb.h>
 135  135  #include <sys/kstat.h>
 136  136  #include <zfs_fletcher.h>
 137  137  
 138  138  #ifndef _KERNEL
 139  139  /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
 140  140  boolean_t arc_watch = B_FALSE;
 141  141  int arc_procfd;
 142  142  #endif
 143  143  
 144  144  static kmutex_t         arc_reclaim_thr_lock;
 145  145  static kcondvar_t       arc_reclaim_thr_cv;     /* used to signal reclaim thr */
 146  146  static uint8_t          arc_thread_exit;
 147  147  
 148  148  extern int zfs_write_limit_shift;
 149  149  extern uint64_t zfs_write_limit_max;
 150  150  extern kmutex_t zfs_write_limit_lock;
 151  151  
 152  152  #define ARC_REDUCE_DNLC_PERCENT 3
 153  153  uint_t arc_reduce_dnlc_percent = ARC_REDUCE_DNLC_PERCENT;
 154  154  
 155  155  typedef enum arc_reclaim_strategy {
 156  156          ARC_RECLAIM_AGGR,               /* Aggressive reclaim strategy */
 157  157          ARC_RECLAIM_CONS                /* Conservative reclaim strategy */
 158  158  } arc_reclaim_strategy_t;
 159  159  
 160  160  /* number of seconds before growing cache again */
 161  161  static int              arc_grow_retry = 60;
 162  162  
 163  163  /* shift of arc_c for calculating both min and max arc_p */
 164  164  static int              arc_p_min_shift = 4;
 165  165  
 166  166  /* log2(fraction of arc to reclaim) */
 167  167  static int              arc_shrink_shift = 5;
 168  168  
 169  169  /*
 170  170   * minimum lifespan of a prefetch block in clock ticks
 171  171   * (initialized in arc_init())
 172  172   */
 173  173  static int              arc_min_prefetch_lifespan;
 174  174  
 175  175  static int arc_dead;
 176  176  
 177  177  /*
 178  178   * The arc has filled available memory and has now warmed up.
 179  179   */
 180  180  static boolean_t arc_warm;
 181  181  
 182  182  /*
 183  183   * These tunables are for performance analysis.
 184  184   */
 185  185  uint64_t zfs_arc_max;
 186  186  uint64_t zfs_arc_min;
 187  187  uint64_t zfs_arc_meta_limit = 0;
 188  188  int zfs_arc_grow_retry = 0;
 189  189  int zfs_arc_shrink_shift = 0;
 190  190  int zfs_arc_p_min_shift = 0;
 191  191  int zfs_disable_dup_eviction = 0;
 192  192  
 193  193  /*
 194  194   * Note that buffers can be in one of 6 states:
 195  195   *      ARC_anon        - anonymous (discussed below)
 196  196   *      ARC_mru         - recently used, currently cached
 197  197   *      ARC_mru_ghost   - recentely used, no longer in cache
 198  198   *      ARC_mfu         - frequently used, currently cached
 199  199   *      ARC_mfu_ghost   - frequently used, no longer in cache
 200  200   *      ARC_l2c_only    - exists in L2ARC but not other states
 201  201   * When there are no active references to the buffer, they are
 202  202   * are linked onto a list in one of these arc states.  These are
 203  203   * the only buffers that can be evicted or deleted.  Within each
 204  204   * state there are multiple lists, one for meta-data and one for
 205  205   * non-meta-data.  Meta-data (indirect blocks, blocks of dnodes,
 206  206   * etc.) is tracked separately so that it can be managed more
 207  207   * explicitly: favored over data, limited explicitly.
 208  208   *
 209  209   * Anonymous buffers are buffers that are not associated with
 210  210   * a DVA.  These are buffers that hold dirty block copies
 211  211   * before they are written to stable storage.  By definition,
 212  212   * they are "ref'd" and are considered part of arc_mru
 213  213   * that cannot be freed.  Generally, they will aquire a DVA
 214  214   * as they are written and migrate onto the arc_mru list.
 215  215   *
 216  216   * The ARC_l2c_only state is for buffers that are in the second
 217  217   * level ARC but no longer in any of the ARC_m* lists.  The second
 218  218   * level ARC itself may also contain buffers that are in any of
 219  219   * the ARC_m* states - meaning that a buffer can exist in two
 220  220   * places.  The reason for the ARC_l2c_only state is to keep the
 221  221   * buffer header in the hash table, so that reads that hit the
 222  222   * second level ARC benefit from these fast lookups.
 223  223   */
 224  224  
 225  225  typedef struct arc_state {
 226  226          list_t  arcs_list[ARC_BUFC_NUMTYPES];   /* list of evictable buffers */
 227  227          uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */
 228  228          uint64_t arcs_size;     /* total amount of data in this state */
 229  229          kmutex_t arcs_mtx;
 230  230  } arc_state_t;
 231  231  
 232  232  /* The 6 states: */
 233  233  static arc_state_t ARC_anon;
 234  234  static arc_state_t ARC_mru;
 235  235  static arc_state_t ARC_mru_ghost;
 236  236  static arc_state_t ARC_mfu;
 237  237  static arc_state_t ARC_mfu_ghost;
 238  238  static arc_state_t ARC_l2c_only;
 239  239  
 240  240  typedef struct arc_stats {
 241  241          kstat_named_t arcstat_hits;
 242  242          kstat_named_t arcstat_misses;
 243  243          kstat_named_t arcstat_demand_data_hits;
 244  244          kstat_named_t arcstat_demand_data_misses;
 245  245          kstat_named_t arcstat_demand_metadata_hits;
 246  246          kstat_named_t arcstat_demand_metadata_misses;
 247  247          kstat_named_t arcstat_prefetch_data_hits;
 248  248          kstat_named_t arcstat_prefetch_data_misses;
 249  249          kstat_named_t arcstat_prefetch_metadata_hits;
 250  250          kstat_named_t arcstat_prefetch_metadata_misses;
 251  251          kstat_named_t arcstat_mru_hits;
 252  252          kstat_named_t arcstat_mru_ghost_hits;
 253  253          kstat_named_t arcstat_mfu_hits;
 254  254          kstat_named_t arcstat_mfu_ghost_hits;
 255  255          kstat_named_t arcstat_deleted;
 256  256          kstat_named_t arcstat_recycle_miss;
 257  257          kstat_named_t arcstat_mutex_miss;
 258  258          kstat_named_t arcstat_evict_skip;
 259  259          kstat_named_t arcstat_evict_l2_cached;
 260  260          kstat_named_t arcstat_evict_l2_eligible;
 261  261          kstat_named_t arcstat_evict_l2_ineligible;
 262  262          kstat_named_t arcstat_hash_elements;
 263  263          kstat_named_t arcstat_hash_elements_max;
 264  264          kstat_named_t arcstat_hash_collisions;
 265  265          kstat_named_t arcstat_hash_chains;
 266  266          kstat_named_t arcstat_hash_chain_max;
 267  267          kstat_named_t arcstat_p;
 268  268          kstat_named_t arcstat_c;
 269  269          kstat_named_t arcstat_c_min;
 270  270          kstat_named_t arcstat_c_max;
 271  271          kstat_named_t arcstat_size;
 272  272          kstat_named_t arcstat_hdr_size;
 273  273          kstat_named_t arcstat_data_size;
 274  274          kstat_named_t arcstat_other_size;
 275  275          kstat_named_t arcstat_l2_hits;
 276  276          kstat_named_t arcstat_l2_misses;
 277  277          kstat_named_t arcstat_l2_feeds;
 278  278          kstat_named_t arcstat_l2_rw_clash;
 279  279          kstat_named_t arcstat_l2_read_bytes;
 280  280          kstat_named_t arcstat_l2_write_bytes;
 281  281          kstat_named_t arcstat_l2_writes_sent;
 282  282          kstat_named_t arcstat_l2_writes_done;
 283  283          kstat_named_t arcstat_l2_writes_error;
 284  284          kstat_named_t arcstat_l2_writes_hdr_miss;
 285  285          kstat_named_t arcstat_l2_evict_lock_retry;
 286  286          kstat_named_t arcstat_l2_evict_reading;
 287  287          kstat_named_t arcstat_l2_free_on_write;
 288  288          kstat_named_t arcstat_l2_abort_lowmem;
 289  289          kstat_named_t arcstat_l2_cksum_bad;
 290  290          kstat_named_t arcstat_l2_io_error;
 291  291          kstat_named_t arcstat_l2_size;
 292  292          kstat_named_t arcstat_l2_hdr_size;
 293  293          kstat_named_t arcstat_memory_throttle_count;
 294  294          kstat_named_t arcstat_duplicate_buffers;
 295  295          kstat_named_t arcstat_duplicate_buffers_size;
 296  296          kstat_named_t arcstat_duplicate_reads;
 297  297          kstat_named_t arcstat_meta_used;
 298  298          kstat_named_t arcstat_meta_limit;
 299  299          kstat_named_t arcstat_meta_max;
 300  300  } arc_stats_t;
 301  301  
 302  302  static arc_stats_t arc_stats = {
 303  303          { "hits",                       KSTAT_DATA_UINT64 },
 304  304          { "misses",                     KSTAT_DATA_UINT64 },
 305  305          { "demand_data_hits",           KSTAT_DATA_UINT64 },
 306  306          { "demand_data_misses",         KSTAT_DATA_UINT64 },
 307  307          { "demand_metadata_hits",       KSTAT_DATA_UINT64 },
 308  308          { "demand_metadata_misses",     KSTAT_DATA_UINT64 },
 309  309          { "prefetch_data_hits",         KSTAT_DATA_UINT64 },
 310  310          { "prefetch_data_misses",       KSTAT_DATA_UINT64 },
 311  311          { "prefetch_metadata_hits",     KSTAT_DATA_UINT64 },
 312  312          { "prefetch_metadata_misses",   KSTAT_DATA_UINT64 },
 313  313          { "mru_hits",                   KSTAT_DATA_UINT64 },
 314  314          { "mru_ghost_hits",             KSTAT_DATA_UINT64 },
 315  315          { "mfu_hits",                   KSTAT_DATA_UINT64 },
 316  316          { "mfu_ghost_hits",             KSTAT_DATA_UINT64 },
 317  317          { "deleted",                    KSTAT_DATA_UINT64 },
 318  318          { "recycle_miss",               KSTAT_DATA_UINT64 },
 319  319          { "mutex_miss",                 KSTAT_DATA_UINT64 },
 320  320          { "evict_skip",                 KSTAT_DATA_UINT64 },
 321  321          { "evict_l2_cached",            KSTAT_DATA_UINT64 },
 322  322          { "evict_l2_eligible",          KSTAT_DATA_UINT64 },
 323  323          { "evict_l2_ineligible",        KSTAT_DATA_UINT64 },
 324  324          { "hash_elements",              KSTAT_DATA_UINT64 },
 325  325          { "hash_elements_max",          KSTAT_DATA_UINT64 },
 326  326          { "hash_collisions",            KSTAT_DATA_UINT64 },
 327  327          { "hash_chains",                KSTAT_DATA_UINT64 },
 328  328          { "hash_chain_max",             KSTAT_DATA_UINT64 },
 329  329          { "p",                          KSTAT_DATA_UINT64 },
 330  330          { "c",                          KSTAT_DATA_UINT64 },
 331  331          { "c_min",                      KSTAT_DATA_UINT64 },
 332  332          { "c_max",                      KSTAT_DATA_UINT64 },
 333  333          { "size",                       KSTAT_DATA_UINT64 },
 334  334          { "hdr_size",                   KSTAT_DATA_UINT64 },
 335  335          { "data_size",                  KSTAT_DATA_UINT64 },
 336  336          { "other_size",                 KSTAT_DATA_UINT64 },
 337  337          { "l2_hits",                    KSTAT_DATA_UINT64 },
 338  338          { "l2_misses",                  KSTAT_DATA_UINT64 },
 339  339          { "l2_feeds",                   KSTAT_DATA_UINT64 },
 340  340          { "l2_rw_clash",                KSTAT_DATA_UINT64 },
 341  341          { "l2_read_bytes",              KSTAT_DATA_UINT64 },
 342  342          { "l2_write_bytes",             KSTAT_DATA_UINT64 },
 343  343          { "l2_writes_sent",             KSTAT_DATA_UINT64 },
 344  344          { "l2_writes_done",             KSTAT_DATA_UINT64 },
 345  345          { "l2_writes_error",            KSTAT_DATA_UINT64 },
 346  346          { "l2_writes_hdr_miss",         KSTAT_DATA_UINT64 },
 347  347          { "l2_evict_lock_retry",        KSTAT_DATA_UINT64 },
 348  348          { "l2_evict_reading",           KSTAT_DATA_UINT64 },
 349  349          { "l2_free_on_write",           KSTAT_DATA_UINT64 },
 350  350          { "l2_abort_lowmem",            KSTAT_DATA_UINT64 },
 351  351          { "l2_cksum_bad",               KSTAT_DATA_UINT64 },
 352  352          { "l2_io_error",                KSTAT_DATA_UINT64 },
 353  353          { "l2_size",                    KSTAT_DATA_UINT64 },
 354  354          { "l2_hdr_size",                KSTAT_DATA_UINT64 },
 355  355          { "memory_throttle_count",      KSTAT_DATA_UINT64 },
 356  356          { "duplicate_buffers",          KSTAT_DATA_UINT64 },
 357  357          { "duplicate_buffers_size",     KSTAT_DATA_UINT64 },
 358  358          { "duplicate_reads",            KSTAT_DATA_UINT64 },
 359  359          { "arc_meta_used",              KSTAT_DATA_UINT64 },
 360  360          { "arc_meta_limit",             KSTAT_DATA_UINT64 },
 361  361          { "arc_meta_max",               KSTAT_DATA_UINT64 }
 362  362  };
 363  363  
 364  364  #define ARCSTAT(stat)   (arc_stats.stat.value.ui64)
 365  365  
 366  366  #define ARCSTAT_INCR(stat, val) \
 367  367          atomic_add_64(&arc_stats.stat.value.ui64, (val));
 368  368  
 369  369  #define ARCSTAT_BUMP(stat)      ARCSTAT_INCR(stat, 1)
 370  370  #define ARCSTAT_BUMPDOWN(stat)  ARCSTAT_INCR(stat, -1)
 371  371  
 372  372  #define ARCSTAT_MAX(stat, val) {                                        \
 373  373          uint64_t m;                                                     \
 374  374          while ((val) > (m = arc_stats.stat.value.ui64) &&               \
 375  375              (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
 376  376                  continue;                                               \
 377  377  }
 378  378  
 379  379  #define ARCSTAT_MAXSTAT(stat) \
 380  380          ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
 381  381  
 382  382  /*
 383  383   * We define a macro to allow ARC hits/misses to be easily broken down by
 384  384   * two separate conditions, giving a total of four different subtypes for
 385  385   * each of hits and misses (so eight statistics total).
 386  386   */
 387  387  #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
 388  388          if (cond1) {                                                    \
 389  389                  if (cond2) {                                            \
 390  390                          ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
 391  391                  } else {                                                \
 392  392                          ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
 393  393                  }                                                       \
 394  394          } else {                                                        \
 395  395                  if (cond2) {                                            \
 396  396                          ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
 397  397                  } else {                                                \
 398  398                          ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
 399  399                  }                                                       \
 400  400          }
 401  401  
 402  402  kstat_t                 *arc_ksp;
 403  403  static arc_state_t      *arc_anon;
 404  404  static arc_state_t      *arc_mru;
 405  405  static arc_state_t      *arc_mru_ghost;
 406  406  static arc_state_t      *arc_mfu;
 407  407  static arc_state_t      *arc_mfu_ghost;
 408  408  static arc_state_t      *arc_l2c_only;
 409  409  
 410  410  /*
 411  411   * There are several ARC variables that are critical to export as kstats --
 412  412   * but we don't want to have to grovel around in the kstat whenever we wish to
 413  413   * manipulate them.  For these variables, we therefore define them to be in
 414  414   * terms of the statistic variable.  This assures that we are not introducing
 415  415   * the possibility of inconsistency by having shadow copies of the variables,
 416  416   * while still allowing the code to be readable.
 417  417   */
 418  418  #define arc_size        ARCSTAT(arcstat_size)   /* actual total arc size */
 419  419  #define arc_p           ARCSTAT(arcstat_p)      /* target size of MRU */
 420  420  #define arc_c           ARCSTAT(arcstat_c)      /* target size of cache */
 421  421  #define arc_c_min       ARCSTAT(arcstat_c_min)  /* min target cache size */
 422  422  #define arc_c_max       ARCSTAT(arcstat_c_max)  /* max target cache size */
 423  423  #define arc_meta_limit  ARCSTAT(arcstat_meta_limit) /* max size for metadata */
 424  424  #define arc_meta_used   ARCSTAT(arcstat_meta_used) /* size of metadata */
 425  425  #define arc_meta_max    ARCSTAT(arcstat_meta_max) /* max size of metadata */
 426  426  
 427  427  static int              arc_no_grow;    /* Don't try to grow cache size */
 428  428  static uint64_t         arc_tempreserve;
 429  429  static uint64_t         arc_loaned_bytes;
 430  430  
 431  431  typedef struct l2arc_buf_hdr l2arc_buf_hdr_t;
 432  432  
 433  433  typedef struct arc_callback arc_callback_t;
 434  434  
 435  435  struct arc_callback {
 436  436          void                    *acb_private;
 437  437          arc_done_func_t         *acb_done;
 438  438          arc_buf_t               *acb_buf;
 439  439          zio_t                   *acb_zio_dummy;
 440  440          arc_callback_t          *acb_next;
 441  441  };
 442  442  
 443  443  typedef struct arc_write_callback arc_write_callback_t;
 444  444  
 445  445  struct arc_write_callback {
 446  446          void            *awcb_private;
 447  447          arc_done_func_t *awcb_ready;
 448  448          arc_done_func_t *awcb_done;
 449  449          arc_buf_t       *awcb_buf;
 450  450  };
 451  451  
 452  452  struct arc_buf_hdr {
 453  453          /* protected by hash lock */
 454  454          dva_t                   b_dva;
 455  455          uint64_t                b_birth;
 456  456          uint64_t                b_cksum0;
 457  457  
 458  458          kmutex_t                b_freeze_lock;
 459  459          zio_cksum_t             *b_freeze_cksum;
 460  460          void                    *b_thawed;
 461  461  
 462  462          arc_buf_hdr_t           *b_hash_next;
 463  463          arc_buf_t               *b_buf;
 464  464          uint32_t                b_flags;
 465  465          uint32_t                b_datacnt;
 466  466  
 467  467          arc_callback_t          *b_acb;
 468  468          kcondvar_t              b_cv;
 469  469  
 470  470          /* immutable */
 471  471          arc_buf_contents_t      b_type;
 472  472          uint64_t                b_size;
 473  473          uint64_t                b_spa;
 474  474  
 475  475          /* protected by arc state mutex */
 476  476          arc_state_t             *b_state;
 477  477          list_node_t             b_arc_node;
 478  478  
 479  479          /* updated atomically */
 480  480          clock_t                 b_arc_access;
 481  481  
 482  482          /* self protecting */
 483  483          refcount_t              b_refcnt;
 484  484  
 485  485          l2arc_buf_hdr_t         *b_l2hdr;
 486  486          list_node_t             b_l2node;
 487  487  };
 488  488  
 489  489  static arc_buf_t *arc_eviction_list;
 490  490  static kmutex_t arc_eviction_mtx;
 491  491  static arc_buf_hdr_t arc_eviction_hdr;
 492  492  static void arc_get_data_buf(arc_buf_t *buf);
 493  493  static void arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock);
 494  494  static int arc_evict_needed(arc_buf_contents_t type);
 495  495  static void arc_evict_ghost(arc_state_t *state, uint64_t spa, int64_t bytes);
 496  496  static void arc_buf_watch(arc_buf_t *buf);
 497  497  
 498  498  static boolean_t l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *ab);
 499  499  
 500  500  #define GHOST_STATE(state)      \
 501  501          ((state) == arc_mru_ghost || (state) == arc_mfu_ghost ||        \
 502  502          (state) == arc_l2c_only)
 503  503  
 504  504  /*
 505  505   * Private ARC flags.  These flags are private ARC only flags that will show up
 506  506   * in b_flags in the arc_hdr_buf_t.  Some flags are publicly declared, and can
 507  507   * be passed in as arc_flags in things like arc_read.  However, these flags
 508  508   * should never be passed and should only be set by ARC code.  When adding new
 509  509   * public flags, make sure not to smash the private ones.
 510  510   */
 511  511  
 512  512  #define ARC_IN_HASH_TABLE       (1 << 9)        /* this buffer is hashed */
 513  513  #define ARC_IO_IN_PROGRESS      (1 << 10)       /* I/O in progress for buf */
 514  514  #define ARC_IO_ERROR            (1 << 11)       /* I/O failed for buf */
 515  515  #define ARC_FREED_IN_READ       (1 << 12)       /* buf freed while in read */
 516  516  #define ARC_BUF_AVAILABLE       (1 << 13)       /* block not in active use */
 517  517  #define ARC_INDIRECT            (1 << 14)       /* this is an indirect block */
 518  518  #define ARC_FREE_IN_PROGRESS    (1 << 15)       /* hdr about to be freed */
 519  519  #define ARC_L2_WRITING          (1 << 16)       /* L2ARC write in progress */
 520  520  #define ARC_L2_EVICTED          (1 << 17)       /* evicted during I/O */
 521  521  #define ARC_L2_WRITE_HEAD       (1 << 18)       /* head of write list */
 522  522  
 523  523  #define HDR_IN_HASH_TABLE(hdr)  ((hdr)->b_flags & ARC_IN_HASH_TABLE)
 524  524  #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS)
 525  525  #define HDR_IO_ERROR(hdr)       ((hdr)->b_flags & ARC_IO_ERROR)
 526  526  #define HDR_PREFETCH(hdr)       ((hdr)->b_flags & ARC_PREFETCH)
 527  527  #define HDR_FREED_IN_READ(hdr)  ((hdr)->b_flags & ARC_FREED_IN_READ)
 528  528  #define HDR_BUF_AVAILABLE(hdr)  ((hdr)->b_flags & ARC_BUF_AVAILABLE)
 529  529  #define HDR_FREE_IN_PROGRESS(hdr)       ((hdr)->b_flags & ARC_FREE_IN_PROGRESS)
 530  530  #define HDR_L2CACHE(hdr)        ((hdr)->b_flags & ARC_L2CACHE)
 531  531  #define HDR_L2_READING(hdr)     ((hdr)->b_flags & ARC_IO_IN_PROGRESS && \
 532  532                                      (hdr)->b_l2hdr != NULL)
 533  533  #define HDR_L2_WRITING(hdr)     ((hdr)->b_flags & ARC_L2_WRITING)
 534  534  #define HDR_L2_EVICTED(hdr)     ((hdr)->b_flags & ARC_L2_EVICTED)
 535  535  #define HDR_L2_WRITE_HEAD(hdr)  ((hdr)->b_flags & ARC_L2_WRITE_HEAD)
 536  536  
 537  537  /*
  
    | 
      ↓ open down ↓ | 
    537 lines elided | 
    
      ↑ open up ↑ | 
  
 538  538   * Other sizes
 539  539   */
 540  540  
 541  541  #define HDR_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
 542  542  #define L2HDR_SIZE ((int64_t)sizeof (l2arc_buf_hdr_t))
 543  543  
 544  544  /*
 545  545   * Hash table routines
 546  546   */
 547  547  
 548      -#define HT_LOCK_PAD     64
 549      -
 550      -struct ht_lock {
 551      -        kmutex_t        ht_lock;
 552      -#ifdef _KERNEL
 553      -        unsigned char   pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
 554      -#endif
      548 +struct ht_table {
      549 +        arc_buf_hdr_t   *hdr;
      550 +        kmutex_t        lock;
 555  551  };
 556  552  
 557      -#define BUF_LOCKS 256
 558  553  typedef struct buf_hash_table {
 559  554          uint64_t ht_mask;
 560      -        arc_buf_hdr_t **ht_table;
 561      -        struct ht_lock ht_locks[BUF_LOCKS];
      555 +        struct ht_table *ht_table;
 562  556  } buf_hash_table_t;
 563  557  
 564  558  static buf_hash_table_t buf_hash_table;
 565  559  
 566  560  #define BUF_HASH_INDEX(spa, dva, birth) \
 567  561          (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
 568      -#define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
 569      -#define BUF_HASH_LOCK(idx)      (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
      562 +#define BUF_HASH_LOCK(idx) (&buf_hash_table.ht_table[idx].lock)
 570  563  #define HDR_LOCK(hdr) \
 571  564          (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
 572  565  
 573  566  uint64_t zfs_crc64_table[256];
 574  567  
 575  568  /*
 576  569   * Level 2 ARC
 577  570   */
 578  571  
 579  572  #define L2ARC_WRITE_SIZE        (8 * 1024 * 1024)       /* initial write max */
 580  573  #define L2ARC_HEADROOM          2               /* num of writes */
 581  574  #define L2ARC_FEED_SECS         1               /* caching interval secs */
 582  575  #define L2ARC_FEED_MIN_MS       200             /* min caching interval ms */
 583  576  
 584  577  #define l2arc_writes_sent       ARCSTAT(arcstat_l2_writes_sent)
 585  578  #define l2arc_writes_done       ARCSTAT(arcstat_l2_writes_done)
 586  579  
 587  580  /*
 588  581   * L2ARC Performance Tunables
 589  582   */
 590  583  uint64_t l2arc_write_max = L2ARC_WRITE_SIZE;    /* default max write size */
 591  584  uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE;  /* extra write during warmup */
 592  585  uint64_t l2arc_headroom = L2ARC_HEADROOM;       /* number of dev writes */
 593  586  uint64_t l2arc_feed_secs = L2ARC_FEED_SECS;     /* interval seconds */
 594  587  uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
 595  588  boolean_t l2arc_noprefetch = B_TRUE;            /* don't cache prefetch bufs */
 596  589  boolean_t l2arc_feed_again = B_TRUE;            /* turbo warmup */
 597  590  boolean_t l2arc_norw = B_TRUE;                  /* no reads during writes */
 598  591  
 599  592  /*
 600  593   * L2ARC Internals
 601  594   */
 602  595  typedef struct l2arc_dev {
 603  596          vdev_t                  *l2ad_vdev;     /* vdev */
 604  597          spa_t                   *l2ad_spa;      /* spa */
 605  598          uint64_t                l2ad_hand;      /* next write location */
 606  599          uint64_t                l2ad_write;     /* desired write size, bytes */
 607  600          uint64_t                l2ad_boost;     /* warmup write boost, bytes */
 608  601          uint64_t                l2ad_start;     /* first addr on device */
 609  602          uint64_t                l2ad_end;       /* last addr on device */
 610  603          uint64_t                l2ad_evict;     /* last addr eviction reached */
 611  604          boolean_t               l2ad_first;     /* first sweep through */
 612  605          boolean_t               l2ad_writing;   /* currently writing */
 613  606          list_t                  *l2ad_buflist;  /* buffer list */
 614  607          list_node_t             l2ad_node;      /* device list node */
 615  608  } l2arc_dev_t;
 616  609  
 617  610  static list_t L2ARC_dev_list;                   /* device list */
 618  611  static list_t *l2arc_dev_list;                  /* device list pointer */
 619  612  static kmutex_t l2arc_dev_mtx;                  /* device list mutex */
 620  613  static l2arc_dev_t *l2arc_dev_last;             /* last device used */
 621  614  static kmutex_t l2arc_buflist_mtx;              /* mutex for all buflists */
 622  615  static list_t L2ARC_free_on_write;              /* free after write buf list */
 623  616  static list_t *l2arc_free_on_write;             /* free after write list ptr */
 624  617  static kmutex_t l2arc_free_on_write_mtx;        /* mutex for list */
 625  618  static uint64_t l2arc_ndev;                     /* number of devices */
 626  619  
 627  620  typedef struct l2arc_read_callback {
 628  621          arc_buf_t       *l2rcb_buf;             /* read buffer */
 629  622          spa_t           *l2rcb_spa;             /* spa */
 630  623          blkptr_t        l2rcb_bp;               /* original blkptr */
 631  624          zbookmark_t     l2rcb_zb;               /* original bookmark */
 632  625          int             l2rcb_flags;            /* original flags */
 633  626  } l2arc_read_callback_t;
 634  627  
 635  628  typedef struct l2arc_write_callback {
 636  629          l2arc_dev_t     *l2wcb_dev;             /* device info */
 637  630          arc_buf_hdr_t   *l2wcb_head;            /* head of write buflist */
 638  631  } l2arc_write_callback_t;
 639  632  
 640  633  struct l2arc_buf_hdr {
 641  634          /* protected by arc_buf_hdr  mutex */
 642  635          l2arc_dev_t     *b_dev;                 /* L2ARC device */
 643  636          uint64_t        b_daddr;                /* disk address, offset byte */
 644  637  };
 645  638  
 646  639  typedef struct l2arc_data_free {
 647  640          /* protected by l2arc_free_on_write_mtx */
 648  641          void            *l2df_data;
 649  642          size_t          l2df_size;
 650  643          void            (*l2df_func)(void *, size_t);
 651  644          list_node_t     l2df_list_node;
 652  645  } l2arc_data_free_t;
 653  646  
 654  647  static kmutex_t l2arc_feed_thr_lock;
 655  648  static kcondvar_t l2arc_feed_thr_cv;
 656  649  static uint8_t l2arc_thread_exit;
 657  650  
 658  651  static void l2arc_read_done(zio_t *zio);
 659  652  static void l2arc_hdr_stat_add(void);
 660  653  static void l2arc_hdr_stat_remove(void);
 661  654  
 662  655  static uint64_t
 663  656  buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
 664  657  {
 665  658          uint8_t *vdva = (uint8_t *)dva;
 666  659          uint64_t crc = -1ULL;
 667  660          int i;
 668  661  
 669  662          ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
 670  663  
 671  664          for (i = 0; i < sizeof (dva_t); i++)
 672  665                  crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
 673  666  
 674  667          crc ^= (spa>>8) ^ birth;
 675  668  
 676  669          return (crc);
 677  670  }
 678  671  
 679  672  #define BUF_EMPTY(buf)                                          \
 680  673          ((buf)->b_dva.dva_word[0] == 0 &&                       \
 681  674          (buf)->b_dva.dva_word[1] == 0 &&                        \
 682  675          (buf)->b_birth == 0)
 683  676  
 684  677  #define BUF_EQUAL(spa, dva, birth, buf)                         \
 685  678          ((buf)->b_dva.dva_word[0] == (dva)->dva_word[0]) &&     \
 686  679          ((buf)->b_dva.dva_word[1] == (dva)->dva_word[1]) &&     \
 687  680          ((buf)->b_birth == birth) && ((buf)->b_spa == spa)
 688  681  
 689  682  static void
 690  683  buf_discard_identity(arc_buf_hdr_t *hdr)
 691  684  {
 692  685          hdr->b_dva.dva_word[0] = 0;
 693  686          hdr->b_dva.dva_word[1] = 0;
 694  687          hdr->b_birth = 0;
 695  688          hdr->b_cksum0 = 0;
  
    | 
      ↓ open down ↓ | 
    116 lines elided | 
    
      ↑ open up ↑ | 
  
 696  689  }
 697  690  
 698  691  static arc_buf_hdr_t *
 699  692  buf_hash_find(uint64_t spa, const dva_t *dva, uint64_t birth, kmutex_t **lockp)
 700  693  {
 701  694          uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
 702  695          kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
 703  696          arc_buf_hdr_t *buf;
 704  697  
 705  698          mutex_enter(hash_lock);
 706      -        for (buf = buf_hash_table.ht_table[idx]; buf != NULL;
      699 +        for (buf = buf_hash_table.ht_table[idx].hdr; buf != NULL;
 707  700              buf = buf->b_hash_next) {
 708  701                  if (BUF_EQUAL(spa, dva, birth, buf)) {
 709  702                          *lockp = hash_lock;
 710  703                          return (buf);
 711  704                  }
 712  705          }
 713  706          mutex_exit(hash_lock);
 714  707          *lockp = NULL;
 715  708          return (NULL);
 716  709  }
 717  710  
 718  711  /*
 719  712   * Insert an entry into the hash table.  If there is already an element
 720  713   * equal to elem in the hash table, then the already existing element
 721  714   * will be returned and the new element will not be inserted.
 722  715   * Otherwise returns NULL.
 723  716   */
 724  717  static arc_buf_hdr_t *
  
    | 
      ↓ open down ↓ | 
    8 lines elided | 
    
      ↑ open up ↑ | 
  
 725  718  buf_hash_insert(arc_buf_hdr_t *buf, kmutex_t **lockp)
 726  719  {
 727  720          uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
 728  721          kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
 729  722          arc_buf_hdr_t *fbuf;
 730  723          uint32_t i;
 731  724  
 732  725          ASSERT(!HDR_IN_HASH_TABLE(buf));
 733  726          *lockp = hash_lock;
 734  727          mutex_enter(hash_lock);
 735      -        for (fbuf = buf_hash_table.ht_table[idx], i = 0; fbuf != NULL;
      728 +        for (fbuf = buf_hash_table.ht_table[idx].hdr, i = 0; fbuf != NULL;
 736  729              fbuf = fbuf->b_hash_next, i++) {
 737  730                  if (BUF_EQUAL(buf->b_spa, &buf->b_dva, buf->b_birth, fbuf))
 738  731                          return (fbuf);
 739  732          }
 740  733  
 741      -        buf->b_hash_next = buf_hash_table.ht_table[idx];
 742      -        buf_hash_table.ht_table[idx] = buf;
      734 +        buf->b_hash_next = buf_hash_table.ht_table[idx].hdr;
      735 +        buf_hash_table.ht_table[idx].hdr = buf;
 743  736          buf->b_flags |= ARC_IN_HASH_TABLE;
 744  737  
 745  738          /* collect some hash table performance data */
 746  739          if (i > 0) {
 747  740                  ARCSTAT_BUMP(arcstat_hash_collisions);
 748  741                  if (i == 1)
 749  742                          ARCSTAT_BUMP(arcstat_hash_chains);
 750  743  
 751  744                  ARCSTAT_MAX(arcstat_hash_chain_max, i);
 752  745          }
 753  746  
 754  747          ARCSTAT_BUMP(arcstat_hash_elements);
 755  748          ARCSTAT_MAXSTAT(arcstat_hash_elements);
 756  749  
 757  750          return (NULL);
 758  751  }
  
    | 
      ↓ open down ↓ | 
    6 lines elided | 
    
      ↑ open up ↑ | 
  
 759  752  
 760  753  static void
 761  754  buf_hash_remove(arc_buf_hdr_t *buf)
 762  755  {
 763  756          arc_buf_hdr_t *fbuf, **bufp;
 764  757          uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
 765  758  
 766  759          ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
 767  760          ASSERT(HDR_IN_HASH_TABLE(buf));
 768  761  
 769      -        bufp = &buf_hash_table.ht_table[idx];
      762 +        bufp = &buf_hash_table.ht_table[idx].hdr;
 770  763          while ((fbuf = *bufp) != buf) {
 771  764                  ASSERT(fbuf != NULL);
 772  765                  bufp = &fbuf->b_hash_next;
 773  766          }
 774  767          *bufp = buf->b_hash_next;
 775  768          buf->b_hash_next = NULL;
 776  769          buf->b_flags &= ~ARC_IN_HASH_TABLE;
 777  770  
 778  771          /* collect some hash table performance data */
 779  772          ARCSTAT_BUMPDOWN(arcstat_hash_elements);
 780  773  
 781      -        if (buf_hash_table.ht_table[idx] &&
 782      -            buf_hash_table.ht_table[idx]->b_hash_next == NULL)
      774 +        if (buf_hash_table.ht_table[idx].hdr &&
      775 +            buf_hash_table.ht_table[idx].hdr->b_hash_next == NULL)
 783  776                  ARCSTAT_BUMPDOWN(arcstat_hash_chains);
 784  777  }
 785  778  
 786  779  /*
 787  780   * Global data structures and functions for the buf kmem cache.
 788  781   */
 789  782  static kmem_cache_t *hdr_cache;
 790  783  static kmem_cache_t *buf_cache;
 791  784  
 792  785  static void
 793  786  buf_fini(void)
 794  787  {
 795  788          int i;
 796  789  
      790 +        for (i = 0; i < buf_hash_table.ht_mask + 1; i++)
      791 +                mutex_destroy(&buf_hash_table.ht_table[i].lock);
 797  792          kmem_free(buf_hash_table.ht_table,
 798      -            (buf_hash_table.ht_mask + 1) * sizeof (void *));
 799      -        for (i = 0; i < BUF_LOCKS; i++)
 800      -                mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
      793 +            (buf_hash_table.ht_mask + 1) * sizeof (struct ht_table));
 801  794          kmem_cache_destroy(hdr_cache);
 802  795          kmem_cache_destroy(buf_cache);
 803  796  }
 804  797  
 805  798  /*
 806  799   * Constructor callback - called when the cache is empty
 807  800   * and a new buf is requested.
 808  801   */
 809  802  /* ARGSUSED */
 810  803  static int
 811  804  hdr_cons(void *vbuf, void *unused, int kmflag)
 812  805  {
 813  806          arc_buf_hdr_t *buf = vbuf;
 814  807  
 815  808          bzero(buf, sizeof (arc_buf_hdr_t));
 816  809          refcount_create(&buf->b_refcnt);
 817  810          cv_init(&buf->b_cv, NULL, CV_DEFAULT, NULL);
 818  811          mutex_init(&buf->b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
 819  812          arc_space_consume(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
 820  813  
 821  814          return (0);
 822  815  }
 823  816  
 824  817  /* ARGSUSED */
 825  818  static int
 826  819  buf_cons(void *vbuf, void *unused, int kmflag)
 827  820  {
 828  821          arc_buf_t *buf = vbuf;
 829  822  
 830  823          bzero(buf, sizeof (arc_buf_t));
 831  824          mutex_init(&buf->b_evict_lock, NULL, MUTEX_DEFAULT, NULL);
 832  825          arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
 833  826  
 834  827          return (0);
 835  828  }
 836  829  
 837  830  /*
 838  831   * Destructor callback - called when a cached buf is
 839  832   * no longer required.
 840  833   */
 841  834  /* ARGSUSED */
 842  835  static void
 843  836  hdr_dest(void *vbuf, void *unused)
 844  837  {
 845  838          arc_buf_hdr_t *buf = vbuf;
 846  839  
 847  840          ASSERT(BUF_EMPTY(buf));
 848  841          refcount_destroy(&buf->b_refcnt);
 849  842          cv_destroy(&buf->b_cv);
 850  843          mutex_destroy(&buf->b_freeze_lock);
 851  844          arc_space_return(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
 852  845  }
 853  846  
 854  847  /* ARGSUSED */
 855  848  static void
 856  849  buf_dest(void *vbuf, void *unused)
 857  850  {
 858  851          arc_buf_t *buf = vbuf;
 859  852  
 860  853          mutex_destroy(&buf->b_evict_lock);
 861  854          arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
 862  855  }
 863  856  
 864  857  /*
 865  858   * Reclaim callback -- invoked when memory is low.
 866  859   */
 867  860  /* ARGSUSED */
 868  861  static void
 869  862  hdr_recl(void *unused)
 870  863  {
 871  864          dprintf("hdr_recl called\n");
 872  865          /*
 873  866           * umem calls the reclaim func when we destroy the buf cache,
 874  867           * which is after we do arc_fini().
 875  868           */
 876  869          if (!arc_dead)
 877  870                  cv_signal(&arc_reclaim_thr_cv);
 878  871  }
 879  872  
 880  873  static void
 881  874  buf_init(void)
 882  875  {
 883  876          uint64_t *ct;
 884  877          uint64_t hsize = 1ULL << 12;
 885  878          int i, j;
 886  879  
  
    | 
      ↓ open down ↓ | 
    76 lines elided | 
    
      ↑ open up ↑ | 
  
 887  880          /*
 888  881           * The hash table is big enough to fill all of physical memory
 889  882           * with an average 64K block size.  The table will take up
 890  883           * totalmem*sizeof(void*)/64K (eg. 128KB/GB with 8-byte pointers).
 891  884           */
 892  885          while (hsize * 65536 < physmem * PAGESIZE)
 893  886                  hsize <<= 1;
 894  887  retry:
 895  888          buf_hash_table.ht_mask = hsize - 1;
 896  889          buf_hash_table.ht_table =
 897      -            kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
      890 +            kmem_zalloc(hsize * sizeof (struct ht_table), KM_NOSLEEP);
 898  891          if (buf_hash_table.ht_table == NULL) {
 899  892                  ASSERT(hsize > (1ULL << 8));
 900  893                  hsize >>= 1;
 901  894                  goto retry;
 902  895          }
 903  896  
 904  897          hdr_cache = kmem_cache_create("arc_buf_hdr_t", sizeof (arc_buf_hdr_t),
 905  898              0, hdr_cons, hdr_dest, hdr_recl, NULL, NULL, 0);
 906  899          buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
 907  900              0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
 908  901  
 909  902          for (i = 0; i < 256; i++)
 910  903                  for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
 911  904                          *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
 912  905  
 913      -        for (i = 0; i < BUF_LOCKS; i++) {
 914      -                mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
      906 +        for (i = 0; i < hsize; i++) {
      907 +                mutex_init(&buf_hash_table.ht_table[i].lock,
 915  908                      NULL, MUTEX_DEFAULT, NULL);
 916  909          }
 917  910  }
 918  911  
 919  912  #define ARC_MINTIME     (hz>>4) /* 62 ms */
 920  913  
 921  914  static void
 922  915  arc_cksum_verify(arc_buf_t *buf)
 923  916  {
 924  917          zio_cksum_t zc;
 925  918  
 926  919          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
 927  920                  return;
 928  921  
 929  922          mutex_enter(&buf->b_hdr->b_freeze_lock);
 930  923          if (buf->b_hdr->b_freeze_cksum == NULL ||
 931  924              (buf->b_hdr->b_flags & ARC_IO_ERROR)) {
 932  925                  mutex_exit(&buf->b_hdr->b_freeze_lock);
 933  926                  return;
 934  927          }
 935  928          fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
 936  929          if (!ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc))
 937  930                  panic("buffer modified while frozen!");
 938  931          mutex_exit(&buf->b_hdr->b_freeze_lock);
 939  932  }
 940  933  
 941  934  static int
 942  935  arc_cksum_equal(arc_buf_t *buf)
 943  936  {
 944  937          zio_cksum_t zc;
 945  938          int equal;
 946  939  
 947  940          mutex_enter(&buf->b_hdr->b_freeze_lock);
 948  941          fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
 949  942          equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc);
 950  943          mutex_exit(&buf->b_hdr->b_freeze_lock);
 951  944  
 952  945          return (equal);
 953  946  }
 954  947  
 955  948  static void
 956  949  arc_cksum_compute(arc_buf_t *buf, boolean_t force)
 957  950  {
 958  951          if (!force && !(zfs_flags & ZFS_DEBUG_MODIFY))
 959  952                  return;
 960  953  
 961  954          mutex_enter(&buf->b_hdr->b_freeze_lock);
 962  955          if (buf->b_hdr->b_freeze_cksum != NULL) {
 963  956                  mutex_exit(&buf->b_hdr->b_freeze_lock);
 964  957                  return;
 965  958          }
 966  959          buf->b_hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP);
 967  960          fletcher_2_native(buf->b_data, buf->b_hdr->b_size,
 968  961              buf->b_hdr->b_freeze_cksum);
 969  962          mutex_exit(&buf->b_hdr->b_freeze_lock);
 970  963          arc_buf_watch(buf);
 971  964  }
 972  965  
 973  966  #ifndef _KERNEL
 974  967  typedef struct procctl {
 975  968          long cmd;
 976  969          prwatch_t prwatch;
 977  970  } procctl_t;
 978  971  #endif
 979  972  
 980  973  /* ARGSUSED */
 981  974  static void
 982  975  arc_buf_unwatch(arc_buf_t *buf)
 983  976  {
 984  977  #ifndef _KERNEL
 985  978          if (arc_watch) {
 986  979                  int result;
 987  980                  procctl_t ctl;
 988  981                  ctl.cmd = PCWATCH;
 989  982                  ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
 990  983                  ctl.prwatch.pr_size = 0;
 991  984                  ctl.prwatch.pr_wflags = 0;
 992  985                  result = write(arc_procfd, &ctl, sizeof (ctl));
 993  986                  ASSERT3U(result, ==, sizeof (ctl));
 994  987          }
 995  988  #endif
 996  989  }
 997  990  
 998  991  /* ARGSUSED */
 999  992  static void
1000  993  arc_buf_watch(arc_buf_t *buf)
1001  994  {
1002  995  #ifndef _KERNEL
1003  996          if (arc_watch) {
1004  997                  int result;
1005  998                  procctl_t ctl;
1006  999                  ctl.cmd = PCWATCH;
1007 1000                  ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
1008 1001                  ctl.prwatch.pr_size = buf->b_hdr->b_size;
1009 1002                  ctl.prwatch.pr_wflags = WA_WRITE;
1010 1003                  result = write(arc_procfd, &ctl, sizeof (ctl));
1011 1004                  ASSERT3U(result, ==, sizeof (ctl));
1012 1005          }
1013 1006  #endif
1014 1007  }
1015 1008  
1016 1009  void
1017 1010  arc_buf_thaw(arc_buf_t *buf)
1018 1011  {
1019 1012          if (zfs_flags & ZFS_DEBUG_MODIFY) {
1020 1013                  if (buf->b_hdr->b_state != arc_anon)
1021 1014                          panic("modifying non-anon buffer!");
1022 1015                  if (buf->b_hdr->b_flags & ARC_IO_IN_PROGRESS)
1023 1016                          panic("modifying buffer while i/o in progress!");
1024 1017                  arc_cksum_verify(buf);
1025 1018          }
1026 1019  
1027 1020          mutex_enter(&buf->b_hdr->b_freeze_lock);
1028 1021          if (buf->b_hdr->b_freeze_cksum != NULL) {
1029 1022                  kmem_free(buf->b_hdr->b_freeze_cksum, sizeof (zio_cksum_t));
1030 1023                  buf->b_hdr->b_freeze_cksum = NULL;
1031 1024          }
1032 1025  
1033 1026          if (zfs_flags & ZFS_DEBUG_MODIFY) {
1034 1027                  if (buf->b_hdr->b_thawed)
1035 1028                          kmem_free(buf->b_hdr->b_thawed, 1);
1036 1029                  buf->b_hdr->b_thawed = kmem_alloc(1, KM_SLEEP);
1037 1030          }
1038 1031  
1039 1032          mutex_exit(&buf->b_hdr->b_freeze_lock);
1040 1033  
1041 1034          arc_buf_unwatch(buf);
1042 1035  }
1043 1036  
1044 1037  void
1045 1038  arc_buf_freeze(arc_buf_t *buf)
1046 1039  {
1047 1040          kmutex_t *hash_lock;
1048 1041  
1049 1042          if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1050 1043                  return;
1051 1044  
1052 1045          hash_lock = HDR_LOCK(buf->b_hdr);
1053 1046          mutex_enter(hash_lock);
1054 1047  
1055 1048          ASSERT(buf->b_hdr->b_freeze_cksum != NULL ||
1056 1049              buf->b_hdr->b_state == arc_anon);
1057 1050          arc_cksum_compute(buf, B_FALSE);
1058 1051          mutex_exit(hash_lock);
1059 1052  
1060 1053  }
1061 1054  
1062 1055  static void
1063 1056  add_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
1064 1057  {
1065 1058          ASSERT(MUTEX_HELD(hash_lock));
1066 1059  
1067 1060          if ((refcount_add(&ab->b_refcnt, tag) == 1) &&
1068 1061              (ab->b_state != arc_anon)) {
1069 1062                  uint64_t delta = ab->b_size * ab->b_datacnt;
1070 1063                  list_t *list = &ab->b_state->arcs_list[ab->b_type];
1071 1064                  uint64_t *size = &ab->b_state->arcs_lsize[ab->b_type];
1072 1065  
1073 1066                  ASSERT(!MUTEX_HELD(&ab->b_state->arcs_mtx));
1074 1067                  mutex_enter(&ab->b_state->arcs_mtx);
1075 1068                  ASSERT(list_link_active(&ab->b_arc_node));
1076 1069                  list_remove(list, ab);
1077 1070                  if (GHOST_STATE(ab->b_state)) {
1078 1071                          ASSERT0(ab->b_datacnt);
1079 1072                          ASSERT3P(ab->b_buf, ==, NULL);
1080 1073                          delta = ab->b_size;
1081 1074                  }
1082 1075                  ASSERT(delta > 0);
1083 1076                  ASSERT3U(*size, >=, delta);
1084 1077                  atomic_add_64(size, -delta);
1085 1078                  mutex_exit(&ab->b_state->arcs_mtx);
1086 1079                  /* remove the prefetch flag if we get a reference */
1087 1080                  if (ab->b_flags & ARC_PREFETCH)
1088 1081                          ab->b_flags &= ~ARC_PREFETCH;
1089 1082          }
1090 1083  }
1091 1084  
1092 1085  static int
1093 1086  remove_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
1094 1087  {
1095 1088          int cnt;
1096 1089          arc_state_t *state = ab->b_state;
1097 1090  
1098 1091          ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
1099 1092          ASSERT(!GHOST_STATE(state));
1100 1093  
1101 1094          if (((cnt = refcount_remove(&ab->b_refcnt, tag)) == 0) &&
1102 1095              (state != arc_anon)) {
1103 1096                  uint64_t *size = &state->arcs_lsize[ab->b_type];
1104 1097  
1105 1098                  ASSERT(!MUTEX_HELD(&state->arcs_mtx));
1106 1099                  mutex_enter(&state->arcs_mtx);
1107 1100                  ASSERT(!list_link_active(&ab->b_arc_node));
1108 1101                  list_insert_head(&state->arcs_list[ab->b_type], ab);
1109 1102                  ASSERT(ab->b_datacnt > 0);
1110 1103                  atomic_add_64(size, ab->b_size * ab->b_datacnt);
1111 1104                  mutex_exit(&state->arcs_mtx);
1112 1105          }
1113 1106          return (cnt);
1114 1107  }
1115 1108  
1116 1109  /*
1117 1110   * Move the supplied buffer to the indicated state.  The mutex
1118 1111   * for the buffer must be held by the caller.
1119 1112   */
1120 1113  static void
1121 1114  arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *ab, kmutex_t *hash_lock)
1122 1115  {
1123 1116          arc_state_t *old_state = ab->b_state;
1124 1117          int64_t refcnt = refcount_count(&ab->b_refcnt);
1125 1118          uint64_t from_delta, to_delta;
1126 1119  
1127 1120          ASSERT(MUTEX_HELD(hash_lock));
1128 1121          ASSERT(new_state != old_state);
1129 1122          ASSERT(refcnt == 0 || ab->b_datacnt > 0);
1130 1123          ASSERT(ab->b_datacnt == 0 || !GHOST_STATE(new_state));
1131 1124          ASSERT(ab->b_datacnt <= 1 || old_state != arc_anon);
1132 1125  
1133 1126          from_delta = to_delta = ab->b_datacnt * ab->b_size;
1134 1127  
1135 1128          /*
1136 1129           * If this buffer is evictable, transfer it from the
1137 1130           * old state list to the new state list.
1138 1131           */
1139 1132          if (refcnt == 0) {
1140 1133                  if (old_state != arc_anon) {
1141 1134                          int use_mutex = !MUTEX_HELD(&old_state->arcs_mtx);
1142 1135                          uint64_t *size = &old_state->arcs_lsize[ab->b_type];
1143 1136  
1144 1137                          if (use_mutex)
1145 1138                                  mutex_enter(&old_state->arcs_mtx);
1146 1139  
1147 1140                          ASSERT(list_link_active(&ab->b_arc_node));
1148 1141                          list_remove(&old_state->arcs_list[ab->b_type], ab);
1149 1142  
1150 1143                          /*
1151 1144                           * If prefetching out of the ghost cache,
1152 1145                           * we will have a non-zero datacnt.
1153 1146                           */
1154 1147                          if (GHOST_STATE(old_state) && ab->b_datacnt == 0) {
1155 1148                                  /* ghost elements have a ghost size */
1156 1149                                  ASSERT(ab->b_buf == NULL);
1157 1150                                  from_delta = ab->b_size;
1158 1151                          }
1159 1152                          ASSERT3U(*size, >=, from_delta);
1160 1153                          atomic_add_64(size, -from_delta);
1161 1154  
1162 1155                          if (use_mutex)
1163 1156                                  mutex_exit(&old_state->arcs_mtx);
1164 1157                  }
1165 1158                  if (new_state != arc_anon) {
1166 1159                          int use_mutex = !MUTEX_HELD(&new_state->arcs_mtx);
1167 1160                          uint64_t *size = &new_state->arcs_lsize[ab->b_type];
1168 1161  
1169 1162                          if (use_mutex)
1170 1163                                  mutex_enter(&new_state->arcs_mtx);
1171 1164  
1172 1165                          list_insert_head(&new_state->arcs_list[ab->b_type], ab);
1173 1166  
1174 1167                          /* ghost elements have a ghost size */
1175 1168                          if (GHOST_STATE(new_state)) {
1176 1169                                  ASSERT(ab->b_datacnt == 0);
1177 1170                                  ASSERT(ab->b_buf == NULL);
1178 1171                                  to_delta = ab->b_size;
1179 1172                          }
1180 1173                          atomic_add_64(size, to_delta);
1181 1174  
1182 1175                          if (use_mutex)
1183 1176                                  mutex_exit(&new_state->arcs_mtx);
1184 1177                  }
1185 1178          }
1186 1179  
1187 1180          ASSERT(!BUF_EMPTY(ab));
1188 1181          if (new_state == arc_anon && HDR_IN_HASH_TABLE(ab))
1189 1182                  buf_hash_remove(ab);
1190 1183  
1191 1184          /* adjust state sizes */
1192 1185          if (to_delta)
1193 1186                  atomic_add_64(&new_state->arcs_size, to_delta);
1194 1187          if (from_delta) {
1195 1188                  ASSERT3U(old_state->arcs_size, >=, from_delta);
1196 1189                  atomic_add_64(&old_state->arcs_size, -from_delta);
1197 1190          }
1198 1191          ab->b_state = new_state;
1199 1192  
1200 1193          /* adjust l2arc hdr stats */
1201 1194          if (new_state == arc_l2c_only)
1202 1195                  l2arc_hdr_stat_add();
1203 1196          else if (old_state == arc_l2c_only)
1204 1197                  l2arc_hdr_stat_remove();
1205 1198  }
1206 1199  
1207 1200  void
1208 1201  arc_space_consume(uint64_t space, arc_space_type_t type)
1209 1202  {
1210 1203          ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
1211 1204  
1212 1205          switch (type) {
1213 1206          case ARC_SPACE_DATA:
1214 1207                  ARCSTAT_INCR(arcstat_data_size, space);
1215 1208                  break;
1216 1209          case ARC_SPACE_OTHER:
1217 1210                  ARCSTAT_INCR(arcstat_other_size, space);
1218 1211                  break;
1219 1212          case ARC_SPACE_HDRS:
1220 1213                  ARCSTAT_INCR(arcstat_hdr_size, space);
1221 1214                  break;
1222 1215          case ARC_SPACE_L2HDRS:
1223 1216                  ARCSTAT_INCR(arcstat_l2_hdr_size, space);
1224 1217                  break;
1225 1218          }
1226 1219  
1227 1220          ARCSTAT_INCR(arcstat_meta_used, space);
1228 1221          atomic_add_64(&arc_size, space);
1229 1222  }
1230 1223  
1231 1224  void
1232 1225  arc_space_return(uint64_t space, arc_space_type_t type)
1233 1226  {
1234 1227          ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
1235 1228  
1236 1229          switch (type) {
1237 1230          case ARC_SPACE_DATA:
1238 1231                  ARCSTAT_INCR(arcstat_data_size, -space);
1239 1232                  break;
1240 1233          case ARC_SPACE_OTHER:
1241 1234                  ARCSTAT_INCR(arcstat_other_size, -space);
1242 1235                  break;
1243 1236          case ARC_SPACE_HDRS:
1244 1237                  ARCSTAT_INCR(arcstat_hdr_size, -space);
1245 1238                  break;
1246 1239          case ARC_SPACE_L2HDRS:
1247 1240                  ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
1248 1241                  break;
1249 1242          }
1250 1243  
1251 1244          ASSERT(arc_meta_used >= space);
1252 1245          if (arc_meta_max < arc_meta_used)
1253 1246                  arc_meta_max = arc_meta_used;
1254 1247          ARCSTAT_INCR(arcstat_meta_used, -space);
1255 1248          ASSERT(arc_size >= space);
1256 1249          atomic_add_64(&arc_size, -space);
1257 1250  }
1258 1251  
1259 1252  void *
1260 1253  arc_data_buf_alloc(uint64_t size)
1261 1254  {
1262 1255          if (arc_evict_needed(ARC_BUFC_DATA))
1263 1256                  cv_signal(&arc_reclaim_thr_cv);
1264 1257          atomic_add_64(&arc_size, size);
1265 1258          return (zio_data_buf_alloc(size));
1266 1259  }
1267 1260  
1268 1261  void
1269 1262  arc_data_buf_free(void *buf, uint64_t size)
1270 1263  {
1271 1264          zio_data_buf_free(buf, size);
1272 1265          ASSERT(arc_size >= size);
1273 1266          atomic_add_64(&arc_size, -size);
1274 1267  }
1275 1268  
1276 1269  arc_buf_t *
1277 1270  arc_buf_alloc(spa_t *spa, int size, void *tag, arc_buf_contents_t type)
1278 1271  {
1279 1272          arc_buf_hdr_t *hdr;
1280 1273          arc_buf_t *buf;
1281 1274  
1282 1275          ASSERT3U(size, >, 0);
1283 1276          hdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
1284 1277          ASSERT(BUF_EMPTY(hdr));
1285 1278          hdr->b_size = size;
1286 1279          hdr->b_type = type;
1287 1280          hdr->b_spa = spa_load_guid(spa);
1288 1281          hdr->b_state = arc_anon;
1289 1282          hdr->b_arc_access = 0;
1290 1283          buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
1291 1284          buf->b_hdr = hdr;
1292 1285          buf->b_data = NULL;
1293 1286          buf->b_efunc = NULL;
1294 1287          buf->b_private = NULL;
1295 1288          buf->b_next = NULL;
1296 1289          hdr->b_buf = buf;
1297 1290          arc_get_data_buf(buf);
1298 1291          hdr->b_datacnt = 1;
1299 1292          hdr->b_flags = 0;
1300 1293          ASSERT(refcount_is_zero(&hdr->b_refcnt));
1301 1294          (void) refcount_add(&hdr->b_refcnt, tag);
1302 1295  
1303 1296          return (buf);
1304 1297  }
1305 1298  
1306 1299  static char *arc_onloan_tag = "onloan";
1307 1300  
1308 1301  /*
1309 1302   * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
1310 1303   * flight data by arc_tempreserve_space() until they are "returned". Loaned
1311 1304   * buffers must be returned to the arc before they can be used by the DMU or
1312 1305   * freed.
1313 1306   */
1314 1307  arc_buf_t *
1315 1308  arc_loan_buf(spa_t *spa, int size)
1316 1309  {
1317 1310          arc_buf_t *buf;
1318 1311  
1319 1312          buf = arc_buf_alloc(spa, size, arc_onloan_tag, ARC_BUFC_DATA);
1320 1313  
1321 1314          atomic_add_64(&arc_loaned_bytes, size);
1322 1315          return (buf);
1323 1316  }
1324 1317  
1325 1318  /*
1326 1319   * Return a loaned arc buffer to the arc.
1327 1320   */
1328 1321  void
1329 1322  arc_return_buf(arc_buf_t *buf, void *tag)
1330 1323  {
1331 1324          arc_buf_hdr_t *hdr = buf->b_hdr;
1332 1325  
1333 1326          ASSERT(buf->b_data != NULL);
1334 1327          (void) refcount_add(&hdr->b_refcnt, tag);
1335 1328          (void) refcount_remove(&hdr->b_refcnt, arc_onloan_tag);
1336 1329  
1337 1330          atomic_add_64(&arc_loaned_bytes, -hdr->b_size);
1338 1331  }
1339 1332  
1340 1333  /* Detach an arc_buf from a dbuf (tag) */
1341 1334  void
1342 1335  arc_loan_inuse_buf(arc_buf_t *buf, void *tag)
1343 1336  {
1344 1337          arc_buf_hdr_t *hdr;
1345 1338  
1346 1339          ASSERT(buf->b_data != NULL);
1347 1340          hdr = buf->b_hdr;
1348 1341          (void) refcount_add(&hdr->b_refcnt, arc_onloan_tag);
1349 1342          (void) refcount_remove(&hdr->b_refcnt, tag);
1350 1343          buf->b_efunc = NULL;
1351 1344          buf->b_private = NULL;
1352 1345  
1353 1346          atomic_add_64(&arc_loaned_bytes, hdr->b_size);
1354 1347  }
1355 1348  
1356 1349  static arc_buf_t *
1357 1350  arc_buf_clone(arc_buf_t *from)
1358 1351  {
1359 1352          arc_buf_t *buf;
1360 1353          arc_buf_hdr_t *hdr = from->b_hdr;
1361 1354          uint64_t size = hdr->b_size;
1362 1355  
1363 1356          ASSERT(hdr->b_state != arc_anon);
1364 1357  
1365 1358          buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
1366 1359          buf->b_hdr = hdr;
1367 1360          buf->b_data = NULL;
1368 1361          buf->b_efunc = NULL;
1369 1362          buf->b_private = NULL;
1370 1363          buf->b_next = hdr->b_buf;
1371 1364          hdr->b_buf = buf;
1372 1365          arc_get_data_buf(buf);
1373 1366          bcopy(from->b_data, buf->b_data, size);
1374 1367  
1375 1368          /*
1376 1369           * This buffer already exists in the arc so create a duplicate
1377 1370           * copy for the caller.  If the buffer is associated with user data
1378 1371           * then track the size and number of duplicates.  These stats will be
1379 1372           * updated as duplicate buffers are created and destroyed.
1380 1373           */
1381 1374          if (hdr->b_type == ARC_BUFC_DATA) {
1382 1375                  ARCSTAT_BUMP(arcstat_duplicate_buffers);
1383 1376                  ARCSTAT_INCR(arcstat_duplicate_buffers_size, size);
1384 1377          }
1385 1378          hdr->b_datacnt += 1;
1386 1379          return (buf);
1387 1380  }
1388 1381  
1389 1382  void
1390 1383  arc_buf_add_ref(arc_buf_t *buf, void* tag)
1391 1384  {
1392 1385          arc_buf_hdr_t *hdr;
1393 1386          kmutex_t *hash_lock;
1394 1387  
1395 1388          /*
1396 1389           * Check to see if this buffer is evicted.  Callers
1397 1390           * must verify b_data != NULL to know if the add_ref
1398 1391           * was successful.
1399 1392           */
1400 1393          mutex_enter(&buf->b_evict_lock);
1401 1394          if (buf->b_data == NULL) {
1402 1395                  mutex_exit(&buf->b_evict_lock);
1403 1396                  return;
1404 1397          }
1405 1398          hash_lock = HDR_LOCK(buf->b_hdr);
1406 1399          mutex_enter(hash_lock);
1407 1400          hdr = buf->b_hdr;
1408 1401          ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
1409 1402          mutex_exit(&buf->b_evict_lock);
1410 1403  
1411 1404          ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
1412 1405          add_reference(hdr, hash_lock, tag);
1413 1406          DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
1414 1407          arc_access(hdr, hash_lock);
1415 1408          mutex_exit(hash_lock);
1416 1409          ARCSTAT_BUMP(arcstat_hits);
1417 1410          ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
1418 1411              demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
1419 1412              data, metadata, hits);
1420 1413  }
1421 1414  
1422 1415  /*
1423 1416   * Free the arc data buffer.  If it is an l2arc write in progress,
1424 1417   * the buffer is placed on l2arc_free_on_write to be freed later.
1425 1418   */
1426 1419  static void
1427 1420  arc_buf_data_free(arc_buf_t *buf, void (*free_func)(void *, size_t))
1428 1421  {
1429 1422          arc_buf_hdr_t *hdr = buf->b_hdr;
1430 1423  
1431 1424          if (HDR_L2_WRITING(hdr)) {
1432 1425                  l2arc_data_free_t *df;
1433 1426                  df = kmem_alloc(sizeof (l2arc_data_free_t), KM_SLEEP);
1434 1427                  df->l2df_data = buf->b_data;
1435 1428                  df->l2df_size = hdr->b_size;
1436 1429                  df->l2df_func = free_func;
1437 1430                  mutex_enter(&l2arc_free_on_write_mtx);
1438 1431                  list_insert_head(l2arc_free_on_write, df);
1439 1432                  mutex_exit(&l2arc_free_on_write_mtx);
1440 1433                  ARCSTAT_BUMP(arcstat_l2_free_on_write);
1441 1434          } else {
1442 1435                  free_func(buf->b_data, hdr->b_size);
1443 1436          }
1444 1437  }
1445 1438  
1446 1439  static void
1447 1440  arc_buf_destroy(arc_buf_t *buf, boolean_t recycle, boolean_t all)
1448 1441  {
1449 1442          arc_buf_t **bufp;
1450 1443  
1451 1444          /* free up data associated with the buf */
1452 1445          if (buf->b_data) {
1453 1446                  arc_state_t *state = buf->b_hdr->b_state;
1454 1447                  uint64_t size = buf->b_hdr->b_size;
1455 1448                  arc_buf_contents_t type = buf->b_hdr->b_type;
1456 1449  
1457 1450                  arc_cksum_verify(buf);
1458 1451                  arc_buf_unwatch(buf);
1459 1452  
1460 1453                  if (!recycle) {
1461 1454                          if (type == ARC_BUFC_METADATA) {
1462 1455                                  arc_buf_data_free(buf, zio_buf_free);
1463 1456                                  arc_space_return(size, ARC_SPACE_DATA);
1464 1457                          } else {
1465 1458                                  ASSERT(type == ARC_BUFC_DATA);
1466 1459                                  arc_buf_data_free(buf, zio_data_buf_free);
1467 1460                                  ARCSTAT_INCR(arcstat_data_size, -size);
1468 1461                                  atomic_add_64(&arc_size, -size);
1469 1462                          }
1470 1463                  }
1471 1464                  if (list_link_active(&buf->b_hdr->b_arc_node)) {
1472 1465                          uint64_t *cnt = &state->arcs_lsize[type];
1473 1466  
1474 1467                          ASSERT(refcount_is_zero(&buf->b_hdr->b_refcnt));
1475 1468                          ASSERT(state != arc_anon);
1476 1469  
1477 1470                          ASSERT3U(*cnt, >=, size);
1478 1471                          atomic_add_64(cnt, -size);
1479 1472                  }
1480 1473                  ASSERT3U(state->arcs_size, >=, size);
1481 1474                  atomic_add_64(&state->arcs_size, -size);
1482 1475                  buf->b_data = NULL;
1483 1476  
1484 1477                  /*
1485 1478                   * If we're destroying a duplicate buffer make sure
1486 1479                   * that the appropriate statistics are updated.
1487 1480                   */
1488 1481                  if (buf->b_hdr->b_datacnt > 1 &&
1489 1482                      buf->b_hdr->b_type == ARC_BUFC_DATA) {
1490 1483                          ARCSTAT_BUMPDOWN(arcstat_duplicate_buffers);
1491 1484                          ARCSTAT_INCR(arcstat_duplicate_buffers_size, -size);
1492 1485                  }
1493 1486                  ASSERT(buf->b_hdr->b_datacnt > 0);
1494 1487                  buf->b_hdr->b_datacnt -= 1;
1495 1488          }
1496 1489  
1497 1490          /* only remove the buf if requested */
1498 1491          if (!all)
1499 1492                  return;
1500 1493  
1501 1494          /* remove the buf from the hdr list */
1502 1495          for (bufp = &buf->b_hdr->b_buf; *bufp != buf; bufp = &(*bufp)->b_next)
1503 1496                  continue;
1504 1497          *bufp = buf->b_next;
1505 1498          buf->b_next = NULL;
1506 1499  
1507 1500          ASSERT(buf->b_efunc == NULL);
1508 1501  
1509 1502          /* clean up the buf */
1510 1503          buf->b_hdr = NULL;
1511 1504          kmem_cache_free(buf_cache, buf);
1512 1505  }
1513 1506  
1514 1507  static void
1515 1508  arc_hdr_destroy(arc_buf_hdr_t *hdr)
1516 1509  {
1517 1510          ASSERT(refcount_is_zero(&hdr->b_refcnt));
1518 1511          ASSERT3P(hdr->b_state, ==, arc_anon);
1519 1512          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
1520 1513          l2arc_buf_hdr_t *l2hdr = hdr->b_l2hdr;
1521 1514  
1522 1515          if (l2hdr != NULL) {
1523 1516                  boolean_t buflist_held = MUTEX_HELD(&l2arc_buflist_mtx);
1524 1517                  /*
1525 1518                   * To prevent arc_free() and l2arc_evict() from
1526 1519                   * attempting to free the same buffer at the same time,
1527 1520                   * a FREE_IN_PROGRESS flag is given to arc_free() to
1528 1521                   * give it priority.  l2arc_evict() can't destroy this
1529 1522                   * header while we are waiting on l2arc_buflist_mtx.
1530 1523                   *
1531 1524                   * The hdr may be removed from l2ad_buflist before we
1532 1525                   * grab l2arc_buflist_mtx, so b_l2hdr is rechecked.
1533 1526                   */
1534 1527                  if (!buflist_held) {
1535 1528                          mutex_enter(&l2arc_buflist_mtx);
1536 1529                          l2hdr = hdr->b_l2hdr;
1537 1530                  }
1538 1531  
1539 1532                  if (l2hdr != NULL) {
1540 1533                          list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
1541 1534                          ARCSTAT_INCR(arcstat_l2_size, -hdr->b_size);
1542 1535                          kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
1543 1536                          if (hdr->b_state == arc_l2c_only)
1544 1537                                  l2arc_hdr_stat_remove();
1545 1538                          hdr->b_l2hdr = NULL;
1546 1539                  }
1547 1540  
1548 1541                  if (!buflist_held)
1549 1542                          mutex_exit(&l2arc_buflist_mtx);
1550 1543          }
1551 1544  
1552 1545          if (!BUF_EMPTY(hdr)) {
1553 1546                  ASSERT(!HDR_IN_HASH_TABLE(hdr));
1554 1547                  buf_discard_identity(hdr);
1555 1548          }
1556 1549          while (hdr->b_buf) {
1557 1550                  arc_buf_t *buf = hdr->b_buf;
1558 1551  
1559 1552                  if (buf->b_efunc) {
1560 1553                          mutex_enter(&arc_eviction_mtx);
1561 1554                          mutex_enter(&buf->b_evict_lock);
1562 1555                          ASSERT(buf->b_hdr != NULL);
1563 1556                          arc_buf_destroy(hdr->b_buf, FALSE, FALSE);
1564 1557                          hdr->b_buf = buf->b_next;
1565 1558                          buf->b_hdr = &arc_eviction_hdr;
1566 1559                          buf->b_next = arc_eviction_list;
1567 1560                          arc_eviction_list = buf;
1568 1561                          mutex_exit(&buf->b_evict_lock);
1569 1562                          mutex_exit(&arc_eviction_mtx);
1570 1563                  } else {
1571 1564                          arc_buf_destroy(hdr->b_buf, FALSE, TRUE);
1572 1565                  }
1573 1566          }
1574 1567          if (hdr->b_freeze_cksum != NULL) {
1575 1568                  kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
1576 1569                  hdr->b_freeze_cksum = NULL;
1577 1570          }
1578 1571          if (hdr->b_thawed) {
1579 1572                  kmem_free(hdr->b_thawed, 1);
1580 1573                  hdr->b_thawed = NULL;
1581 1574          }
1582 1575  
1583 1576          ASSERT(!list_link_active(&hdr->b_arc_node));
1584 1577          ASSERT3P(hdr->b_hash_next, ==, NULL);
1585 1578          ASSERT3P(hdr->b_acb, ==, NULL);
1586 1579          kmem_cache_free(hdr_cache, hdr);
1587 1580  }
1588 1581  
1589 1582  void
1590 1583  arc_buf_free(arc_buf_t *buf, void *tag)
1591 1584  {
1592 1585          arc_buf_hdr_t *hdr = buf->b_hdr;
1593 1586          int hashed = hdr->b_state != arc_anon;
1594 1587  
1595 1588          ASSERT(buf->b_efunc == NULL);
1596 1589          ASSERT(buf->b_data != NULL);
1597 1590  
1598 1591          if (hashed) {
1599 1592                  kmutex_t *hash_lock = HDR_LOCK(hdr);
1600 1593  
1601 1594                  mutex_enter(hash_lock);
1602 1595                  hdr = buf->b_hdr;
1603 1596                  ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
1604 1597  
1605 1598                  (void) remove_reference(hdr, hash_lock, tag);
1606 1599                  if (hdr->b_datacnt > 1) {
1607 1600                          arc_buf_destroy(buf, FALSE, TRUE);
1608 1601                  } else {
1609 1602                          ASSERT(buf == hdr->b_buf);
1610 1603                          ASSERT(buf->b_efunc == NULL);
1611 1604                          hdr->b_flags |= ARC_BUF_AVAILABLE;
1612 1605                  }
1613 1606                  mutex_exit(hash_lock);
1614 1607          } else if (HDR_IO_IN_PROGRESS(hdr)) {
1615 1608                  int destroy_hdr;
1616 1609                  /*
1617 1610                   * We are in the middle of an async write.  Don't destroy
1618 1611                   * this buffer unless the write completes before we finish
1619 1612                   * decrementing the reference count.
1620 1613                   */
1621 1614                  mutex_enter(&arc_eviction_mtx);
1622 1615                  (void) remove_reference(hdr, NULL, tag);
1623 1616                  ASSERT(refcount_is_zero(&hdr->b_refcnt));
1624 1617                  destroy_hdr = !HDR_IO_IN_PROGRESS(hdr);
1625 1618                  mutex_exit(&arc_eviction_mtx);
1626 1619                  if (destroy_hdr)
1627 1620                          arc_hdr_destroy(hdr);
1628 1621          } else {
1629 1622                  if (remove_reference(hdr, NULL, tag) > 0)
1630 1623                          arc_buf_destroy(buf, FALSE, TRUE);
1631 1624                  else
1632 1625                          arc_hdr_destroy(hdr);
1633 1626          }
1634 1627  }
1635 1628  
1636 1629  boolean_t
1637 1630  arc_buf_remove_ref(arc_buf_t *buf, void* tag)
1638 1631  {
1639 1632          arc_buf_hdr_t *hdr = buf->b_hdr;
1640 1633          kmutex_t *hash_lock = HDR_LOCK(hdr);
1641 1634          boolean_t no_callback = (buf->b_efunc == NULL);
1642 1635  
1643 1636          if (hdr->b_state == arc_anon) {
1644 1637                  ASSERT(hdr->b_datacnt == 1);
1645 1638                  arc_buf_free(buf, tag);
1646 1639                  return (no_callback);
1647 1640          }
1648 1641  
1649 1642          mutex_enter(hash_lock);
1650 1643          hdr = buf->b_hdr;
1651 1644          ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
1652 1645          ASSERT(hdr->b_state != arc_anon);
1653 1646          ASSERT(buf->b_data != NULL);
1654 1647  
1655 1648          (void) remove_reference(hdr, hash_lock, tag);
1656 1649          if (hdr->b_datacnt > 1) {
1657 1650                  if (no_callback)
1658 1651                          arc_buf_destroy(buf, FALSE, TRUE);
1659 1652          } else if (no_callback) {
1660 1653                  ASSERT(hdr->b_buf == buf && buf->b_next == NULL);
1661 1654                  ASSERT(buf->b_efunc == NULL);
1662 1655                  hdr->b_flags |= ARC_BUF_AVAILABLE;
1663 1656          }
1664 1657          ASSERT(no_callback || hdr->b_datacnt > 1 ||
1665 1658              refcount_is_zero(&hdr->b_refcnt));
1666 1659          mutex_exit(hash_lock);
1667 1660          return (no_callback);
1668 1661  }
1669 1662  
1670 1663  int
1671 1664  arc_buf_size(arc_buf_t *buf)
1672 1665  {
1673 1666          return (buf->b_hdr->b_size);
1674 1667  }
1675 1668  
1676 1669  /*
1677 1670   * Called from the DMU to determine if the current buffer should be
1678 1671   * evicted. In order to ensure proper locking, the eviction must be initiated
1679 1672   * from the DMU. Return true if the buffer is associated with user data and
1680 1673   * duplicate buffers still exist.
1681 1674   */
1682 1675  boolean_t
1683 1676  arc_buf_eviction_needed(arc_buf_t *buf)
1684 1677  {
1685 1678          arc_buf_hdr_t *hdr;
1686 1679          boolean_t evict_needed = B_FALSE;
1687 1680  
1688 1681          if (zfs_disable_dup_eviction)
1689 1682                  return (B_FALSE);
1690 1683  
1691 1684          mutex_enter(&buf->b_evict_lock);
1692 1685          hdr = buf->b_hdr;
1693 1686          if (hdr == NULL) {
1694 1687                  /*
1695 1688                   * We are in arc_do_user_evicts(); let that function
1696 1689                   * perform the eviction.
1697 1690                   */
1698 1691                  ASSERT(buf->b_data == NULL);
1699 1692                  mutex_exit(&buf->b_evict_lock);
1700 1693                  return (B_FALSE);
1701 1694          } else if (buf->b_data == NULL) {
1702 1695                  /*
1703 1696                   * We have already been added to the arc eviction list;
1704 1697                   * recommend eviction.
1705 1698                   */
1706 1699                  ASSERT3P(hdr, ==, &arc_eviction_hdr);
1707 1700                  mutex_exit(&buf->b_evict_lock);
1708 1701                  return (B_TRUE);
1709 1702          }
1710 1703  
1711 1704          if (hdr->b_datacnt > 1 && hdr->b_type == ARC_BUFC_DATA)
1712 1705                  evict_needed = B_TRUE;
1713 1706  
1714 1707          mutex_exit(&buf->b_evict_lock);
1715 1708          return (evict_needed);
1716 1709  }
1717 1710  
1718 1711  /*
1719 1712   * Evict buffers from list until we've removed the specified number of
1720 1713   * bytes.  Move the removed buffers to the appropriate evict state.
1721 1714   * If the recycle flag is set, then attempt to "recycle" a buffer:
1722 1715   * - look for a buffer to evict that is `bytes' long.
1723 1716   * - return the data block from this buffer rather than freeing it.
1724 1717   * This flag is used by callers that are trying to make space for a
1725 1718   * new buffer in a full arc cache.
1726 1719   *
1727 1720   * This function makes a "best effort".  It skips over any buffers
1728 1721   * it can't get a hash_lock on, and so may not catch all candidates.
1729 1722   * It may also return without evicting as much space as requested.
1730 1723   */
1731 1724  static void *
1732 1725  arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle,
1733 1726      arc_buf_contents_t type)
1734 1727  {
1735 1728          arc_state_t *evicted_state;
1736 1729          uint64_t bytes_evicted = 0, skipped = 0, missed = 0;
1737 1730          arc_buf_hdr_t *ab, *ab_prev = NULL;
1738 1731          list_t *list = &state->arcs_list[type];
1739 1732          kmutex_t *hash_lock;
1740 1733          boolean_t have_lock;
1741 1734          void *stolen = NULL;
1742 1735  
1743 1736          ASSERT(state == arc_mru || state == arc_mfu);
1744 1737  
1745 1738          evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
1746 1739  
1747 1740          mutex_enter(&state->arcs_mtx);
1748 1741          mutex_enter(&evicted_state->arcs_mtx);
1749 1742  
1750 1743          for (ab = list_tail(list); ab; ab = ab_prev) {
1751 1744                  ab_prev = list_prev(list, ab);
1752 1745                  /* prefetch buffers have a minimum lifespan */
1753 1746                  if (HDR_IO_IN_PROGRESS(ab) ||
1754 1747                      (spa && ab->b_spa != spa) ||
1755 1748                      (ab->b_flags & (ARC_PREFETCH|ARC_INDIRECT) &&
1756 1749                      ddi_get_lbolt() - ab->b_arc_access <
1757 1750                      arc_min_prefetch_lifespan)) {
1758 1751                          skipped++;
1759 1752                          continue;
1760 1753                  }
1761 1754                  /* "lookahead" for better eviction candidate */
1762 1755                  if (recycle && ab->b_size != bytes &&
1763 1756                      ab_prev && ab_prev->b_size == bytes)
1764 1757                          continue;
1765 1758                  hash_lock = HDR_LOCK(ab);
1766 1759                  have_lock = MUTEX_HELD(hash_lock);
1767 1760                  if (have_lock || mutex_tryenter(hash_lock)) {
1768 1761                          ASSERT0(refcount_count(&ab->b_refcnt));
1769 1762                          ASSERT(ab->b_datacnt > 0);
1770 1763                          while (ab->b_buf) {
1771 1764                                  arc_buf_t *buf = ab->b_buf;
1772 1765                                  if (!mutex_tryenter(&buf->b_evict_lock)) {
1773 1766                                          missed += 1;
1774 1767                                          break;
1775 1768                                  }
1776 1769                                  if (buf->b_data) {
1777 1770                                          bytes_evicted += ab->b_size;
1778 1771                                          if (recycle && ab->b_type == type &&
1779 1772                                              ab->b_size == bytes &&
1780 1773                                              !HDR_L2_WRITING(ab)) {
1781 1774                                                  stolen = buf->b_data;
1782 1775                                                  recycle = FALSE;
1783 1776                                          }
1784 1777                                  }
1785 1778                                  if (buf->b_efunc) {
1786 1779                                          mutex_enter(&arc_eviction_mtx);
1787 1780                                          arc_buf_destroy(buf,
1788 1781                                              buf->b_data == stolen, FALSE);
1789 1782                                          ab->b_buf = buf->b_next;
1790 1783                                          buf->b_hdr = &arc_eviction_hdr;
1791 1784                                          buf->b_next = arc_eviction_list;
1792 1785                                          arc_eviction_list = buf;
1793 1786                                          mutex_exit(&arc_eviction_mtx);
1794 1787                                          mutex_exit(&buf->b_evict_lock);
1795 1788                                  } else {
1796 1789                                          mutex_exit(&buf->b_evict_lock);
1797 1790                                          arc_buf_destroy(buf,
1798 1791                                              buf->b_data == stolen, TRUE);
1799 1792                                  }
1800 1793                          }
1801 1794  
1802 1795                          if (ab->b_l2hdr) {
1803 1796                                  ARCSTAT_INCR(arcstat_evict_l2_cached,
1804 1797                                      ab->b_size);
1805 1798                          } else {
1806 1799                                  if (l2arc_write_eligible(ab->b_spa, ab)) {
1807 1800                                          ARCSTAT_INCR(arcstat_evict_l2_eligible,
1808 1801                                              ab->b_size);
1809 1802                                  } else {
1810 1803                                          ARCSTAT_INCR(
1811 1804                                              arcstat_evict_l2_ineligible,
1812 1805                                              ab->b_size);
1813 1806                                  }
1814 1807                          }
1815 1808  
1816 1809                          if (ab->b_datacnt == 0) {
1817 1810                                  arc_change_state(evicted_state, ab, hash_lock);
1818 1811                                  ASSERT(HDR_IN_HASH_TABLE(ab));
1819 1812                                  ab->b_flags |= ARC_IN_HASH_TABLE;
1820 1813                                  ab->b_flags &= ~ARC_BUF_AVAILABLE;
1821 1814                                  DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, ab);
1822 1815                          }
1823 1816                          if (!have_lock)
1824 1817                                  mutex_exit(hash_lock);
1825 1818                          if (bytes >= 0 && bytes_evicted >= bytes)
1826 1819                                  break;
1827 1820                  } else {
1828 1821                          missed += 1;
1829 1822                  }
1830 1823          }
1831 1824  
1832 1825          mutex_exit(&evicted_state->arcs_mtx);
1833 1826          mutex_exit(&state->arcs_mtx);
1834 1827  
1835 1828          if (bytes_evicted < bytes)
1836 1829                  dprintf("only evicted %lld bytes from %x",
1837 1830                      (longlong_t)bytes_evicted, state);
1838 1831  
1839 1832          if (skipped)
1840 1833                  ARCSTAT_INCR(arcstat_evict_skip, skipped);
1841 1834  
1842 1835          if (missed)
1843 1836                  ARCSTAT_INCR(arcstat_mutex_miss, missed);
1844 1837  
1845 1838          /*
1846 1839           * We have just evicted some data into the ghost state, make
1847 1840           * sure we also adjust the ghost state size if necessary.
1848 1841           */
1849 1842          if (arc_no_grow &&
1850 1843              arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size > arc_c) {
1851 1844                  int64_t mru_over = arc_anon->arcs_size + arc_mru->arcs_size +
1852 1845                      arc_mru_ghost->arcs_size - arc_c;
1853 1846  
1854 1847                  if (mru_over > 0 && arc_mru_ghost->arcs_lsize[type] > 0) {
1855 1848                          int64_t todelete =
1856 1849                              MIN(arc_mru_ghost->arcs_lsize[type], mru_over);
1857 1850                          arc_evict_ghost(arc_mru_ghost, NULL, todelete);
1858 1851                  } else if (arc_mfu_ghost->arcs_lsize[type] > 0) {
1859 1852                          int64_t todelete = MIN(arc_mfu_ghost->arcs_lsize[type],
1860 1853                              arc_mru_ghost->arcs_size +
1861 1854                              arc_mfu_ghost->arcs_size - arc_c);
1862 1855                          arc_evict_ghost(arc_mfu_ghost, NULL, todelete);
1863 1856                  }
1864 1857          }
1865 1858  
1866 1859          return (stolen);
1867 1860  }
1868 1861  
1869 1862  /*
1870 1863   * Remove buffers from list until we've removed the specified number of
1871 1864   * bytes.  Destroy the buffers that are removed.
1872 1865   */
1873 1866  static void
1874 1867  arc_evict_ghost(arc_state_t *state, uint64_t spa, int64_t bytes)
1875 1868  {
1876 1869          arc_buf_hdr_t *ab, *ab_prev;
1877 1870          arc_buf_hdr_t marker = { 0 };
1878 1871          list_t *list = &state->arcs_list[ARC_BUFC_DATA];
1879 1872          kmutex_t *hash_lock;
1880 1873          uint64_t bytes_deleted = 0;
1881 1874          uint64_t bufs_skipped = 0;
1882 1875  
1883 1876          ASSERT(GHOST_STATE(state));
1884 1877  top:
1885 1878          mutex_enter(&state->arcs_mtx);
1886 1879          for (ab = list_tail(list); ab; ab = ab_prev) {
1887 1880                  ab_prev = list_prev(list, ab);
1888 1881                  if (spa && ab->b_spa != spa)
1889 1882                          continue;
1890 1883  
1891 1884                  /* ignore markers */
1892 1885                  if (ab->b_spa == 0)
1893 1886                          continue;
1894 1887  
1895 1888                  hash_lock = HDR_LOCK(ab);
1896 1889                  /* caller may be trying to modify this buffer, skip it */
1897 1890                  if (MUTEX_HELD(hash_lock))
1898 1891                          continue;
1899 1892                  if (mutex_tryenter(hash_lock)) {
1900 1893                          ASSERT(!HDR_IO_IN_PROGRESS(ab));
1901 1894                          ASSERT(ab->b_buf == NULL);
1902 1895                          ARCSTAT_BUMP(arcstat_deleted);
1903 1896                          bytes_deleted += ab->b_size;
1904 1897  
1905 1898                          if (ab->b_l2hdr != NULL) {
1906 1899                                  /*
1907 1900                                   * This buffer is cached on the 2nd Level ARC;
1908 1901                                   * don't destroy the header.
1909 1902                                   */
1910 1903                                  arc_change_state(arc_l2c_only, ab, hash_lock);
1911 1904                                  mutex_exit(hash_lock);
1912 1905                          } else {
1913 1906                                  arc_change_state(arc_anon, ab, hash_lock);
1914 1907                                  mutex_exit(hash_lock);
1915 1908                                  arc_hdr_destroy(ab);
1916 1909                          }
1917 1910  
1918 1911                          DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, ab);
1919 1912                          if (bytes >= 0 && bytes_deleted >= bytes)
1920 1913                                  break;
1921 1914                  } else if (bytes < 0) {
1922 1915                          /*
1923 1916                           * Insert a list marker and then wait for the
1924 1917                           * hash lock to become available. Once its
1925 1918                           * available, restart from where we left off.
1926 1919                           */
1927 1920                          list_insert_after(list, ab, &marker);
1928 1921                          mutex_exit(&state->arcs_mtx);
1929 1922                          mutex_enter(hash_lock);
1930 1923                          mutex_exit(hash_lock);
1931 1924                          mutex_enter(&state->arcs_mtx);
1932 1925                          ab_prev = list_prev(list, &marker);
1933 1926                          list_remove(list, &marker);
1934 1927                  } else
1935 1928                          bufs_skipped += 1;
1936 1929          }
1937 1930          mutex_exit(&state->arcs_mtx);
1938 1931  
1939 1932          if (list == &state->arcs_list[ARC_BUFC_DATA] &&
1940 1933              (bytes < 0 || bytes_deleted < bytes)) {
1941 1934                  list = &state->arcs_list[ARC_BUFC_METADATA];
1942 1935                  goto top;
1943 1936          }
1944 1937  
1945 1938          if (bufs_skipped) {
1946 1939                  ARCSTAT_INCR(arcstat_mutex_miss, bufs_skipped);
1947 1940                  ASSERT(bytes >= 0);
1948 1941          }
1949 1942  
1950 1943          if (bytes_deleted < bytes)
1951 1944                  dprintf("only deleted %lld bytes from %p",
1952 1945                      (longlong_t)bytes_deleted, state);
1953 1946  }
1954 1947  
1955 1948  static void
1956 1949  arc_adjust(void)
1957 1950  {
1958 1951          int64_t adjustment, delta;
1959 1952  
1960 1953          /*
1961 1954           * Adjust MRU size
1962 1955           */
1963 1956  
1964 1957          adjustment = MIN((int64_t)(arc_size - arc_c),
1965 1958              (int64_t)(arc_anon->arcs_size + arc_mru->arcs_size + arc_meta_used -
1966 1959              arc_p));
1967 1960  
1968 1961          if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {
1969 1962                  delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);
1970 1963                  (void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA);
1971 1964                  adjustment -= delta;
1972 1965          }
1973 1966  
1974 1967          if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) {
1975 1968                  delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustment);
1976 1969                  (void) arc_evict(arc_mru, NULL, delta, FALSE,
1977 1970                      ARC_BUFC_METADATA);
1978 1971          }
1979 1972  
1980 1973          /*
1981 1974           * Adjust MFU size
1982 1975           */
1983 1976  
1984 1977          adjustment = arc_size - arc_c;
1985 1978  
1986 1979          if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_DATA] > 0) {
1987 1980                  delta = MIN(adjustment, arc_mfu->arcs_lsize[ARC_BUFC_DATA]);
1988 1981                  (void) arc_evict(arc_mfu, NULL, delta, FALSE, ARC_BUFC_DATA);
1989 1982                  adjustment -= delta;
1990 1983          }
1991 1984  
1992 1985          if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) {
1993 1986                  int64_t delta = MIN(adjustment,
1994 1987                      arc_mfu->arcs_lsize[ARC_BUFC_METADATA]);
1995 1988                  (void) arc_evict(arc_mfu, NULL, delta, FALSE,
1996 1989                      ARC_BUFC_METADATA);
1997 1990          }
1998 1991  
1999 1992          /*
2000 1993           * Adjust ghost lists
2001 1994           */
2002 1995  
2003 1996          adjustment = arc_mru->arcs_size + arc_mru_ghost->arcs_size - arc_c;
2004 1997  
2005 1998          if (adjustment > 0 && arc_mru_ghost->arcs_size > 0) {
2006 1999                  delta = MIN(arc_mru_ghost->arcs_size, adjustment);
2007 2000                  arc_evict_ghost(arc_mru_ghost, NULL, delta);
2008 2001          }
2009 2002  
2010 2003          adjustment =
2011 2004              arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size - arc_c;
2012 2005  
2013 2006          if (adjustment > 0 && arc_mfu_ghost->arcs_size > 0) {
2014 2007                  delta = MIN(arc_mfu_ghost->arcs_size, adjustment);
2015 2008                  arc_evict_ghost(arc_mfu_ghost, NULL, delta);
2016 2009          }
2017 2010  }
2018 2011  
2019 2012  static void
2020 2013  arc_do_user_evicts(void)
2021 2014  {
2022 2015          mutex_enter(&arc_eviction_mtx);
2023 2016          while (arc_eviction_list != NULL) {
2024 2017                  arc_buf_t *buf = arc_eviction_list;
2025 2018                  arc_eviction_list = buf->b_next;
2026 2019                  mutex_enter(&buf->b_evict_lock);
2027 2020                  buf->b_hdr = NULL;
2028 2021                  mutex_exit(&buf->b_evict_lock);
2029 2022                  mutex_exit(&arc_eviction_mtx);
2030 2023  
2031 2024                  if (buf->b_efunc != NULL)
2032 2025                          VERIFY(buf->b_efunc(buf) == 0);
2033 2026  
2034 2027                  buf->b_efunc = NULL;
2035 2028                  buf->b_private = NULL;
2036 2029                  kmem_cache_free(buf_cache, buf);
2037 2030                  mutex_enter(&arc_eviction_mtx);
2038 2031          }
2039 2032          mutex_exit(&arc_eviction_mtx);
2040 2033  }
2041 2034  
2042 2035  /*
2043 2036   * Flush all *evictable* data from the cache for the given spa.
2044 2037   * NOTE: this will not touch "active" (i.e. referenced) data.
2045 2038   */
2046 2039  void
2047 2040  arc_flush(spa_t *spa)
2048 2041  {
2049 2042          uint64_t guid = 0;
2050 2043  
2051 2044          if (spa)
2052 2045                  guid = spa_load_guid(spa);
2053 2046  
2054 2047          while (list_head(&arc_mru->arcs_list[ARC_BUFC_DATA])) {
2055 2048                  (void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_DATA);
2056 2049                  if (spa)
2057 2050                          break;
2058 2051          }
2059 2052          while (list_head(&arc_mru->arcs_list[ARC_BUFC_METADATA])) {
2060 2053                  (void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_METADATA);
2061 2054                  if (spa)
2062 2055                          break;
2063 2056          }
2064 2057          while (list_head(&arc_mfu->arcs_list[ARC_BUFC_DATA])) {
2065 2058                  (void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_DATA);
2066 2059                  if (spa)
2067 2060                          break;
2068 2061          }
2069 2062          while (list_head(&arc_mfu->arcs_list[ARC_BUFC_METADATA])) {
2070 2063                  (void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_METADATA);
2071 2064                  if (spa)
2072 2065                          break;
2073 2066          }
2074 2067  
2075 2068          arc_evict_ghost(arc_mru_ghost, guid, -1);
2076 2069          arc_evict_ghost(arc_mfu_ghost, guid, -1);
2077 2070  
2078 2071          mutex_enter(&arc_reclaim_thr_lock);
2079 2072          arc_do_user_evicts();
2080 2073          mutex_exit(&arc_reclaim_thr_lock);
2081 2074          ASSERT(spa || arc_eviction_list == NULL);
2082 2075  }
2083 2076  
2084 2077  void
2085 2078  arc_shrink(void)
2086 2079  {
2087 2080          if (arc_c > arc_c_min) {
2088 2081                  uint64_t to_free;
2089 2082  
2090 2083  #ifdef _KERNEL
2091 2084                  to_free = MAX(arc_c >> arc_shrink_shift, ptob(needfree));
2092 2085  #else
2093 2086                  to_free = arc_c >> arc_shrink_shift;
2094 2087  #endif
2095 2088                  if (arc_c > arc_c_min + to_free)
2096 2089                          atomic_add_64(&arc_c, -to_free);
2097 2090                  else
2098 2091                          arc_c = arc_c_min;
2099 2092  
2100 2093                  atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
2101 2094                  if (arc_c > arc_size)
2102 2095                          arc_c = MAX(arc_size, arc_c_min);
2103 2096                  if (arc_p > arc_c)
2104 2097                          arc_p = (arc_c >> 1);
2105 2098                  ASSERT(arc_c >= arc_c_min);
2106 2099                  ASSERT((int64_t)arc_p >= 0);
2107 2100          }
2108 2101  
2109 2102          if (arc_size > arc_c)
2110 2103                  arc_adjust();
2111 2104  }
2112 2105  
2113 2106  /*
2114 2107   * Determine if the system is under memory pressure and is asking
2115 2108   * to reclaim memory. A return value of 1 indicates that the system
2116 2109   * is under memory pressure and that the arc should adjust accordingly.
2117 2110   */
2118 2111  static int
2119 2112  arc_reclaim_needed(void)
2120 2113  {
2121 2114          uint64_t extra;
2122 2115  
2123 2116  #ifdef _KERNEL
2124 2117  
2125 2118          if (needfree)
2126 2119                  return (1);
2127 2120  
2128 2121          /*
2129 2122           * take 'desfree' extra pages, so we reclaim sooner, rather than later
2130 2123           */
2131 2124          extra = desfree;
2132 2125  
2133 2126          /*
2134 2127           * check that we're out of range of the pageout scanner.  It starts to
2135 2128           * schedule paging if freemem is less than lotsfree and needfree.
2136 2129           * lotsfree is the high-water mark for pageout, and needfree is the
2137 2130           * number of needed free pages.  We add extra pages here to make sure
2138 2131           * the scanner doesn't start up while we're freeing memory.
2139 2132           */
2140 2133          if (freemem < lotsfree + needfree + extra)
2141 2134                  return (1);
2142 2135  
2143 2136          /*
2144 2137           * check to make sure that swapfs has enough space so that anon
2145 2138           * reservations can still succeed. anon_resvmem() checks that the
2146 2139           * availrmem is greater than swapfs_minfree, and the number of reserved
2147 2140           * swap pages.  We also add a bit of extra here just to prevent
2148 2141           * circumstances from getting really dire.
2149 2142           */
2150 2143          if (availrmem < swapfs_minfree + swapfs_reserve + extra)
2151 2144                  return (1);
2152 2145  
2153 2146  #if defined(__i386)
2154 2147          /*
2155 2148           * If we're on an i386 platform, it's possible that we'll exhaust the
2156 2149           * kernel heap space before we ever run out of available physical
2157 2150           * memory.  Most checks of the size of the heap_area compare against
2158 2151           * tune.t_minarmem, which is the minimum available real memory that we
2159 2152           * can have in the system.  However, this is generally fixed at 25 pages
2160 2153           * which is so low that it's useless.  In this comparison, we seek to
2161 2154           * calculate the total heap-size, and reclaim if more than 3/4ths of the
2162 2155           * heap is allocated.  (Or, in the calculation, if less than 1/4th is
2163 2156           * free)
2164 2157           */
2165 2158          if (vmem_size(heap_arena, VMEM_FREE) <
2166 2159              (vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC) >> 2))
2167 2160                  return (1);
2168 2161  #endif
2169 2162  
2170 2163          /*
2171 2164           * If zio data pages are being allocated out of a separate heap segment,
2172 2165           * then enforce that the size of available vmem for this arena remains
2173 2166           * above about 1/16th free.
2174 2167           *
2175 2168           * Note: The 1/16th arena free requirement was put in place
2176 2169           * to aggressively evict memory from the arc in order to avoid
2177 2170           * memory fragmentation issues.
2178 2171           */
2179 2172          if (zio_arena != NULL &&
2180 2173              vmem_size(zio_arena, VMEM_FREE) <
2181 2174              (vmem_size(zio_arena, VMEM_ALLOC) >> 4))
2182 2175                  return (1);
2183 2176  #else
2184 2177          if (spa_get_random(100) == 0)
2185 2178                  return (1);
2186 2179  #endif
2187 2180          return (0);
2188 2181  }
2189 2182  
2190 2183  static void
2191 2184  arc_kmem_reap_now(arc_reclaim_strategy_t strat)
2192 2185  {
2193 2186          size_t                  i;
2194 2187          kmem_cache_t            *prev_cache = NULL;
2195 2188          kmem_cache_t            *prev_data_cache = NULL;
2196 2189          extern kmem_cache_t     *zio_buf_cache[];
2197 2190          extern kmem_cache_t     *zio_data_buf_cache[];
2198 2191  
2199 2192  #ifdef _KERNEL
2200 2193          if (arc_meta_used >= arc_meta_limit) {
2201 2194                  /*
2202 2195                   * We are exceeding our meta-data cache limit.
2203 2196                   * Purge some DNLC entries to release holds on meta-data.
2204 2197                   */
2205 2198                  dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
2206 2199          }
2207 2200  #if defined(__i386)
2208 2201          /*
2209 2202           * Reclaim unused memory from all kmem caches.
2210 2203           */
2211 2204          kmem_reap();
2212 2205  #endif
2213 2206  #endif
2214 2207  
2215 2208          /*
2216 2209           * An aggressive reclamation will shrink the cache size as well as
2217 2210           * reap free buffers from the arc kmem caches.
2218 2211           */
2219 2212          if (strat == ARC_RECLAIM_AGGR)
2220 2213                  arc_shrink();
2221 2214  
2222 2215          for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
2223 2216                  if (zio_buf_cache[i] != prev_cache) {
2224 2217                          prev_cache = zio_buf_cache[i];
2225 2218                          kmem_cache_reap_now(zio_buf_cache[i]);
2226 2219                  }
2227 2220                  if (zio_data_buf_cache[i] != prev_data_cache) {
2228 2221                          prev_data_cache = zio_data_buf_cache[i];
2229 2222                          kmem_cache_reap_now(zio_data_buf_cache[i]);
2230 2223                  }
2231 2224          }
2232 2225          kmem_cache_reap_now(buf_cache);
2233 2226          kmem_cache_reap_now(hdr_cache);
2234 2227  
2235 2228          /*
2236 2229           * Ask the vmem areana to reclaim unused memory from its
2237 2230           * quantum caches.
2238 2231           */
2239 2232          if (zio_arena != NULL && strat == ARC_RECLAIM_AGGR)
2240 2233                  vmem_qcache_reap(zio_arena);
2241 2234  }
2242 2235  
2243 2236  static void
2244 2237  arc_reclaim_thread(void)
2245 2238  {
2246 2239          clock_t                 growtime = 0;
2247 2240          arc_reclaim_strategy_t  last_reclaim = ARC_RECLAIM_CONS;
2248 2241          callb_cpr_t             cpr;
2249 2242  
2250 2243          CALLB_CPR_INIT(&cpr, &arc_reclaim_thr_lock, callb_generic_cpr, FTAG);
2251 2244  
2252 2245          mutex_enter(&arc_reclaim_thr_lock);
2253 2246          while (arc_thread_exit == 0) {
2254 2247                  if (arc_reclaim_needed()) {
2255 2248  
2256 2249                          if (arc_no_grow) {
2257 2250                                  if (last_reclaim == ARC_RECLAIM_CONS) {
2258 2251                                          last_reclaim = ARC_RECLAIM_AGGR;
2259 2252                                  } else {
2260 2253                                          last_reclaim = ARC_RECLAIM_CONS;
2261 2254                                  }
2262 2255                          } else {
2263 2256                                  arc_no_grow = TRUE;
2264 2257                                  last_reclaim = ARC_RECLAIM_AGGR;
2265 2258                                  membar_producer();
2266 2259                          }
2267 2260  
2268 2261                          /* reset the growth delay for every reclaim */
2269 2262                          growtime = ddi_get_lbolt() + (arc_grow_retry * hz);
2270 2263  
2271 2264                          arc_kmem_reap_now(last_reclaim);
2272 2265                          arc_warm = B_TRUE;
2273 2266  
2274 2267                  } else if (arc_no_grow && ddi_get_lbolt() >= growtime) {
2275 2268                          arc_no_grow = FALSE;
2276 2269                  }
2277 2270  
2278 2271                  arc_adjust();
2279 2272  
2280 2273                  if (arc_eviction_list != NULL)
2281 2274                          arc_do_user_evicts();
2282 2275  
2283 2276                  /* block until needed, or one second, whichever is shorter */
2284 2277                  CALLB_CPR_SAFE_BEGIN(&cpr);
2285 2278                  (void) cv_timedwait(&arc_reclaim_thr_cv,
2286 2279                      &arc_reclaim_thr_lock, (ddi_get_lbolt() + hz));
2287 2280                  CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_thr_lock);
2288 2281          }
2289 2282  
2290 2283          arc_thread_exit = 0;
2291 2284          cv_broadcast(&arc_reclaim_thr_cv);
2292 2285          CALLB_CPR_EXIT(&cpr);           /* drops arc_reclaim_thr_lock */
2293 2286          thread_exit();
2294 2287  }
2295 2288  
2296 2289  /*
2297 2290   * Adapt arc info given the number of bytes we are trying to add and
2298 2291   * the state that we are comming from.  This function is only called
2299 2292   * when we are adding new content to the cache.
2300 2293   */
2301 2294  static void
2302 2295  arc_adapt(int bytes, arc_state_t *state)
2303 2296  {
2304 2297          int mult;
2305 2298          uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
2306 2299  
2307 2300          if (state == arc_l2c_only)
2308 2301                  return;
2309 2302  
2310 2303          ASSERT(bytes > 0);
2311 2304          /*
2312 2305           * Adapt the target size of the MRU list:
2313 2306           *      - if we just hit in the MRU ghost list, then increase
2314 2307           *        the target size of the MRU list.
2315 2308           *      - if we just hit in the MFU ghost list, then increase
2316 2309           *        the target size of the MFU list by decreasing the
2317 2310           *        target size of the MRU list.
2318 2311           */
2319 2312          if (state == arc_mru_ghost) {
2320 2313                  mult = ((arc_mru_ghost->arcs_size >= arc_mfu_ghost->arcs_size) ?
2321 2314                      1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size));
2322 2315                  mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
2323 2316  
2324 2317                  arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
2325 2318          } else if (state == arc_mfu_ghost) {
2326 2319                  uint64_t delta;
2327 2320  
2328 2321                  mult = ((arc_mfu_ghost->arcs_size >= arc_mru_ghost->arcs_size) ?
2329 2322                      1 : (arc_mru_ghost->arcs_size/arc_mfu_ghost->arcs_size));
2330 2323                  mult = MIN(mult, 10);
2331 2324  
2332 2325                  delta = MIN(bytes * mult, arc_p);
2333 2326                  arc_p = MAX(arc_p_min, arc_p - delta);
2334 2327          }
2335 2328          ASSERT((int64_t)arc_p >= 0);
2336 2329  
2337 2330          if (arc_reclaim_needed()) {
2338 2331                  cv_signal(&arc_reclaim_thr_cv);
2339 2332                  return;
2340 2333          }
2341 2334  
2342 2335          if (arc_no_grow)
2343 2336                  return;
2344 2337  
2345 2338          if (arc_c >= arc_c_max)
2346 2339                  return;
2347 2340  
2348 2341          /*
2349 2342           * If we're within (2 * maxblocksize) bytes of the target
2350 2343           * cache size, increment the target cache size
2351 2344           */
2352 2345          if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
2353 2346                  atomic_add_64(&arc_c, (int64_t)bytes);
2354 2347                  if (arc_c > arc_c_max)
2355 2348                          arc_c = arc_c_max;
2356 2349                  else if (state == arc_anon)
2357 2350                          atomic_add_64(&arc_p, (int64_t)bytes);
2358 2351                  if (arc_p > arc_c)
2359 2352                          arc_p = arc_c;
2360 2353          }
2361 2354          ASSERT((int64_t)arc_p >= 0);
2362 2355  }
2363 2356  
2364 2357  /*
2365 2358   * Check if the cache has reached its limits and eviction is required
2366 2359   * prior to insert.
2367 2360   */
2368 2361  static int
2369 2362  arc_evict_needed(arc_buf_contents_t type)
2370 2363  {
2371 2364          if (type == ARC_BUFC_METADATA && arc_meta_used >= arc_meta_limit)
2372 2365                  return (1);
2373 2366  
2374 2367          if (arc_reclaim_needed())
2375 2368                  return (1);
2376 2369  
2377 2370          return (arc_size > arc_c);
2378 2371  }
2379 2372  
2380 2373  /*
2381 2374   * The buffer, supplied as the first argument, needs a data block.
2382 2375   * So, if we are at cache max, determine which cache should be victimized.
2383 2376   * We have the following cases:
2384 2377   *
2385 2378   * 1. Insert for MRU, p > sizeof(arc_anon + arc_mru) ->
2386 2379   * In this situation if we're out of space, but the resident size of the MFU is
2387 2380   * under the limit, victimize the MFU cache to satisfy this insertion request.
2388 2381   *
2389 2382   * 2. Insert for MRU, p <= sizeof(arc_anon + arc_mru) ->
2390 2383   * Here, we've used up all of the available space for the MRU, so we need to
2391 2384   * evict from our own cache instead.  Evict from the set of resident MRU
2392 2385   * entries.
2393 2386   *
2394 2387   * 3. Insert for MFU (c - p) > sizeof(arc_mfu) ->
2395 2388   * c minus p represents the MFU space in the cache, since p is the size of the
2396 2389   * cache that is dedicated to the MRU.  In this situation there's still space on
2397 2390   * the MFU side, so the MRU side needs to be victimized.
2398 2391   *
2399 2392   * 4. Insert for MFU (c - p) < sizeof(arc_mfu) ->
2400 2393   * MFU's resident set is consuming more space than it has been allotted.  In
2401 2394   * this situation, we must victimize our own cache, the MFU, for this insertion.
2402 2395   */
2403 2396  static void
2404 2397  arc_get_data_buf(arc_buf_t *buf)
2405 2398  {
2406 2399          arc_state_t             *state = buf->b_hdr->b_state;
2407 2400          uint64_t                size = buf->b_hdr->b_size;
2408 2401          arc_buf_contents_t      type = buf->b_hdr->b_type;
2409 2402  
2410 2403          arc_adapt(size, state);
2411 2404  
2412 2405          /*
2413 2406           * We have not yet reached cache maximum size,
2414 2407           * just allocate a new buffer.
2415 2408           */
2416 2409          if (!arc_evict_needed(type)) {
2417 2410                  if (type == ARC_BUFC_METADATA) {
2418 2411                          buf->b_data = zio_buf_alloc(size);
2419 2412                          arc_space_consume(size, ARC_SPACE_DATA);
2420 2413                  } else {
2421 2414                          ASSERT(type == ARC_BUFC_DATA);
2422 2415                          buf->b_data = zio_data_buf_alloc(size);
2423 2416                          ARCSTAT_INCR(arcstat_data_size, size);
2424 2417                          atomic_add_64(&arc_size, size);
2425 2418                  }
2426 2419                  goto out;
2427 2420          }
2428 2421  
2429 2422          /*
2430 2423           * If we are prefetching from the mfu ghost list, this buffer
2431 2424           * will end up on the mru list; so steal space from there.
2432 2425           */
2433 2426          if (state == arc_mfu_ghost)
2434 2427                  state = buf->b_hdr->b_flags & ARC_PREFETCH ? arc_mru : arc_mfu;
2435 2428          else if (state == arc_mru_ghost)
2436 2429                  state = arc_mru;
2437 2430  
2438 2431          if (state == arc_mru || state == arc_anon) {
2439 2432                  uint64_t mru_used = arc_anon->arcs_size + arc_mru->arcs_size;
2440 2433                  state = (arc_mfu->arcs_lsize[type] >= size &&
2441 2434                      arc_p > mru_used) ? arc_mfu : arc_mru;
2442 2435          } else {
2443 2436                  /* MFU cases */
2444 2437                  uint64_t mfu_space = arc_c - arc_p;
2445 2438                  state =  (arc_mru->arcs_lsize[type] >= size &&
2446 2439                      mfu_space > arc_mfu->arcs_size) ? arc_mru : arc_mfu;
2447 2440          }
2448 2441          if ((buf->b_data = arc_evict(state, NULL, size, TRUE, type)) == NULL) {
2449 2442                  if (type == ARC_BUFC_METADATA) {
2450 2443                          buf->b_data = zio_buf_alloc(size);
2451 2444                          arc_space_consume(size, ARC_SPACE_DATA);
2452 2445                  } else {
2453 2446                          ASSERT(type == ARC_BUFC_DATA);
2454 2447                          buf->b_data = zio_data_buf_alloc(size);
2455 2448                          ARCSTAT_INCR(arcstat_data_size, size);
2456 2449                          atomic_add_64(&arc_size, size);
2457 2450                  }
2458 2451                  ARCSTAT_BUMP(arcstat_recycle_miss);
2459 2452          }
2460 2453          ASSERT(buf->b_data != NULL);
2461 2454  out:
2462 2455          /*
2463 2456           * Update the state size.  Note that ghost states have a
2464 2457           * "ghost size" and so don't need to be updated.
2465 2458           */
2466 2459          if (!GHOST_STATE(buf->b_hdr->b_state)) {
2467 2460                  arc_buf_hdr_t *hdr = buf->b_hdr;
2468 2461  
2469 2462                  atomic_add_64(&hdr->b_state->arcs_size, size);
2470 2463                  if (list_link_active(&hdr->b_arc_node)) {
2471 2464                          ASSERT(refcount_is_zero(&hdr->b_refcnt));
2472 2465                          atomic_add_64(&hdr->b_state->arcs_lsize[type], size);
2473 2466                  }
2474 2467                  /*
2475 2468                   * If we are growing the cache, and we are adding anonymous
2476 2469                   * data, and we have outgrown arc_p, update arc_p
2477 2470                   */
2478 2471                  if (arc_size < arc_c && hdr->b_state == arc_anon &&
2479 2472                      arc_anon->arcs_size + arc_mru->arcs_size > arc_p)
2480 2473                          arc_p = MIN(arc_c, arc_p + size);
2481 2474          }
2482 2475  }
2483 2476  
2484 2477  /*
2485 2478   * This routine is called whenever a buffer is accessed.
2486 2479   * NOTE: the hash lock is dropped in this function.
2487 2480   */
2488 2481  static void
2489 2482  arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock)
2490 2483  {
2491 2484          clock_t now;
2492 2485  
2493 2486          ASSERT(MUTEX_HELD(hash_lock));
2494 2487  
2495 2488          if (buf->b_state == arc_anon) {
2496 2489                  /*
2497 2490                   * This buffer is not in the cache, and does not
2498 2491                   * appear in our "ghost" list.  Add the new buffer
2499 2492                   * to the MRU state.
2500 2493                   */
2501 2494  
2502 2495                  ASSERT(buf->b_arc_access == 0);
2503 2496                  buf->b_arc_access = ddi_get_lbolt();
2504 2497                  DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
2505 2498                  arc_change_state(arc_mru, buf, hash_lock);
2506 2499  
2507 2500          } else if (buf->b_state == arc_mru) {
2508 2501                  now = ddi_get_lbolt();
2509 2502  
2510 2503                  /*
2511 2504                   * If this buffer is here because of a prefetch, then either:
2512 2505                   * - clear the flag if this is a "referencing" read
2513 2506                   *   (any subsequent access will bump this into the MFU state).
2514 2507                   * or
2515 2508                   * - move the buffer to the head of the list if this is
2516 2509                   *   another prefetch (to make it less likely to be evicted).
2517 2510                   */
2518 2511                  if ((buf->b_flags & ARC_PREFETCH) != 0) {
2519 2512                          if (refcount_count(&buf->b_refcnt) == 0) {
2520 2513                                  ASSERT(list_link_active(&buf->b_arc_node));
2521 2514                          } else {
2522 2515                                  buf->b_flags &= ~ARC_PREFETCH;
2523 2516                                  ARCSTAT_BUMP(arcstat_mru_hits);
2524 2517                          }
2525 2518                          buf->b_arc_access = now;
2526 2519                          return;
2527 2520                  }
2528 2521  
2529 2522                  /*
2530 2523                   * This buffer has been "accessed" only once so far,
2531 2524                   * but it is still in the cache. Move it to the MFU
2532 2525                   * state.
2533 2526                   */
2534 2527                  if (now > buf->b_arc_access + ARC_MINTIME) {
2535 2528                          /*
2536 2529                           * More than 125ms have passed since we
2537 2530                           * instantiated this buffer.  Move it to the
2538 2531                           * most frequently used state.
2539 2532                           */
2540 2533                          buf->b_arc_access = now;
2541 2534                          DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
2542 2535                          arc_change_state(arc_mfu, buf, hash_lock);
2543 2536                  }
2544 2537                  ARCSTAT_BUMP(arcstat_mru_hits);
2545 2538          } else if (buf->b_state == arc_mru_ghost) {
2546 2539                  arc_state_t     *new_state;
2547 2540                  /*
2548 2541                   * This buffer has been "accessed" recently, but
2549 2542                   * was evicted from the cache.  Move it to the
2550 2543                   * MFU state.
2551 2544                   */
2552 2545  
2553 2546                  if (buf->b_flags & ARC_PREFETCH) {
2554 2547                          new_state = arc_mru;
2555 2548                          if (refcount_count(&buf->b_refcnt) > 0)
2556 2549                                  buf->b_flags &= ~ARC_PREFETCH;
2557 2550                          DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
2558 2551                  } else {
2559 2552                          new_state = arc_mfu;
2560 2553                          DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
2561 2554                  }
2562 2555  
2563 2556                  buf->b_arc_access = ddi_get_lbolt();
2564 2557                  arc_change_state(new_state, buf, hash_lock);
2565 2558  
2566 2559                  ARCSTAT_BUMP(arcstat_mru_ghost_hits);
2567 2560          } else if (buf->b_state == arc_mfu) {
2568 2561                  /*
2569 2562                   * This buffer has been accessed more than once and is
2570 2563                   * still in the cache.  Keep it in the MFU state.
2571 2564                   *
2572 2565                   * NOTE: an add_reference() that occurred when we did
2573 2566                   * the arc_read() will have kicked this off the list.
2574 2567                   * If it was a prefetch, we will explicitly move it to
2575 2568                   * the head of the list now.
2576 2569                   */
2577 2570                  if ((buf->b_flags & ARC_PREFETCH) != 0) {
2578 2571                          ASSERT(refcount_count(&buf->b_refcnt) == 0);
2579 2572                          ASSERT(list_link_active(&buf->b_arc_node));
2580 2573                  }
2581 2574                  ARCSTAT_BUMP(arcstat_mfu_hits);
2582 2575                  buf->b_arc_access = ddi_get_lbolt();
2583 2576          } else if (buf->b_state == arc_mfu_ghost) {
2584 2577                  arc_state_t     *new_state = arc_mfu;
2585 2578                  /*
2586 2579                   * This buffer has been accessed more than once but has
2587 2580                   * been evicted from the cache.  Move it back to the
2588 2581                   * MFU state.
2589 2582                   */
2590 2583  
2591 2584                  if (buf->b_flags & ARC_PREFETCH) {
2592 2585                          /*
2593 2586                           * This is a prefetch access...
2594 2587                           * move this block back to the MRU state.
2595 2588                           */
2596 2589                          ASSERT0(refcount_count(&buf->b_refcnt));
2597 2590                          new_state = arc_mru;
2598 2591                  }
2599 2592  
2600 2593                  buf->b_arc_access = ddi_get_lbolt();
2601 2594                  DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
2602 2595                  arc_change_state(new_state, buf, hash_lock);
2603 2596  
2604 2597                  ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
2605 2598          } else if (buf->b_state == arc_l2c_only) {
2606 2599                  /*
2607 2600                   * This buffer is on the 2nd Level ARC.
2608 2601                   */
2609 2602  
2610 2603                  buf->b_arc_access = ddi_get_lbolt();
2611 2604                  DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
2612 2605                  arc_change_state(arc_mfu, buf, hash_lock);
2613 2606          } else {
2614 2607                  ASSERT(!"invalid arc state");
2615 2608          }
2616 2609  }
2617 2610  
2618 2611  /* a generic arc_done_func_t which you can use */
2619 2612  /* ARGSUSED */
2620 2613  void
2621 2614  arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
2622 2615  {
2623 2616          if (zio == NULL || zio->io_error == 0)
2624 2617                  bcopy(buf->b_data, arg, buf->b_hdr->b_size);
2625 2618          VERIFY(arc_buf_remove_ref(buf, arg));
2626 2619  }
2627 2620  
2628 2621  /* a generic arc_done_func_t */
2629 2622  void
2630 2623  arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
2631 2624  {
2632 2625          arc_buf_t **bufp = arg;
2633 2626          if (zio && zio->io_error) {
2634 2627                  VERIFY(arc_buf_remove_ref(buf, arg));
2635 2628                  *bufp = NULL;
2636 2629          } else {
2637 2630                  *bufp = buf;
2638 2631                  ASSERT(buf->b_data);
2639 2632          }
2640 2633  }
2641 2634  
2642 2635  static void
2643 2636  arc_read_done(zio_t *zio)
2644 2637  {
2645 2638          arc_buf_hdr_t   *hdr, *found;
2646 2639          arc_buf_t       *buf;
2647 2640          arc_buf_t       *abuf;  /* buffer we're assigning to callback */
2648 2641          kmutex_t        *hash_lock;
2649 2642          arc_callback_t  *callback_list, *acb;
2650 2643          int             freeable = FALSE;
2651 2644  
2652 2645          buf = zio->io_private;
2653 2646          hdr = buf->b_hdr;
2654 2647  
2655 2648          /*
2656 2649           * The hdr was inserted into hash-table and removed from lists
2657 2650           * prior to starting I/O.  We should find this header, since
2658 2651           * it's in the hash table, and it should be legit since it's
2659 2652           * not possible to evict it during the I/O.  The only possible
2660 2653           * reason for it not to be found is if we were freed during the
2661 2654           * read.
2662 2655           */
2663 2656          found = buf_hash_find(hdr->b_spa, &hdr->b_dva, hdr->b_birth,
2664 2657              &hash_lock);
2665 2658  
2666 2659          ASSERT((found == NULL && HDR_FREED_IN_READ(hdr) && hash_lock == NULL) ||
2667 2660              (found == hdr && DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
2668 2661              (found == hdr && HDR_L2_READING(hdr)));
2669 2662  
2670 2663          hdr->b_flags &= ~ARC_L2_EVICTED;
2671 2664          if (l2arc_noprefetch && (hdr->b_flags & ARC_PREFETCH))
2672 2665                  hdr->b_flags &= ~ARC_L2CACHE;
2673 2666  
2674 2667          /* byteswap if necessary */
2675 2668          callback_list = hdr->b_acb;
2676 2669          ASSERT(callback_list != NULL);
2677 2670          if (BP_SHOULD_BYTESWAP(zio->io_bp) && zio->io_error == 0) {
2678 2671                  dmu_object_byteswap_t bswap =
2679 2672                      DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));
2680 2673                  arc_byteswap_func_t *func = BP_GET_LEVEL(zio->io_bp) > 0 ?
2681 2674                      byteswap_uint64_array :
2682 2675                      dmu_ot_byteswap[bswap].ob_func;
2683 2676                  func(buf->b_data, hdr->b_size);
2684 2677          }
2685 2678  
2686 2679          arc_cksum_compute(buf, B_FALSE);
2687 2680          arc_buf_watch(buf);
2688 2681  
2689 2682          if (hash_lock && zio->io_error == 0 && hdr->b_state == arc_anon) {
2690 2683                  /*
2691 2684                   * Only call arc_access on anonymous buffers.  This is because
2692 2685                   * if we've issued an I/O for an evicted buffer, we've already
2693 2686                   * called arc_access (to prevent any simultaneous readers from
2694 2687                   * getting confused).
2695 2688                   */
2696 2689                  arc_access(hdr, hash_lock);
2697 2690          }
2698 2691  
2699 2692          /* create copies of the data buffer for the callers */
2700 2693          abuf = buf;
2701 2694          for (acb = callback_list; acb; acb = acb->acb_next) {
2702 2695                  if (acb->acb_done) {
2703 2696                          if (abuf == NULL) {
2704 2697                                  ARCSTAT_BUMP(arcstat_duplicate_reads);
2705 2698                                  abuf = arc_buf_clone(buf);
2706 2699                          }
2707 2700                          acb->acb_buf = abuf;
2708 2701                          abuf = NULL;
2709 2702                  }
2710 2703          }
2711 2704          hdr->b_acb = NULL;
2712 2705          hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
2713 2706          ASSERT(!HDR_BUF_AVAILABLE(hdr));
2714 2707          if (abuf == buf) {
2715 2708                  ASSERT(buf->b_efunc == NULL);
2716 2709                  ASSERT(hdr->b_datacnt == 1);
2717 2710                  hdr->b_flags |= ARC_BUF_AVAILABLE;
2718 2711          }
2719 2712  
2720 2713          ASSERT(refcount_is_zero(&hdr->b_refcnt) || callback_list != NULL);
2721 2714  
2722 2715          if (zio->io_error != 0) {
2723 2716                  hdr->b_flags |= ARC_IO_ERROR;
2724 2717                  if (hdr->b_state != arc_anon)
2725 2718                          arc_change_state(arc_anon, hdr, hash_lock);
2726 2719                  if (HDR_IN_HASH_TABLE(hdr))
2727 2720                          buf_hash_remove(hdr);
2728 2721                  freeable = refcount_is_zero(&hdr->b_refcnt);
2729 2722          }
2730 2723  
2731 2724          /*
2732 2725           * Broadcast before we drop the hash_lock to avoid the possibility
2733 2726           * that the hdr (and hence the cv) might be freed before we get to
2734 2727           * the cv_broadcast().
2735 2728           */
2736 2729          cv_broadcast(&hdr->b_cv);
2737 2730  
2738 2731          if (hash_lock) {
2739 2732                  mutex_exit(hash_lock);
2740 2733          } else {
2741 2734                  /*
2742 2735                   * This block was freed while we waited for the read to
2743 2736                   * complete.  It has been removed from the hash table and
2744 2737                   * moved to the anonymous state (so that it won't show up
2745 2738                   * in the cache).
2746 2739                   */
2747 2740                  ASSERT3P(hdr->b_state, ==, arc_anon);
2748 2741                  freeable = refcount_is_zero(&hdr->b_refcnt);
2749 2742          }
2750 2743  
2751 2744          /* execute each callback and free its structure */
2752 2745          while ((acb = callback_list) != NULL) {
2753 2746                  if (acb->acb_done)
2754 2747                          acb->acb_done(zio, acb->acb_buf, acb->acb_private);
2755 2748  
2756 2749                  if (acb->acb_zio_dummy != NULL) {
2757 2750                          acb->acb_zio_dummy->io_error = zio->io_error;
2758 2751                          zio_nowait(acb->acb_zio_dummy);
2759 2752                  }
2760 2753  
2761 2754                  callback_list = acb->acb_next;
2762 2755                  kmem_free(acb, sizeof (arc_callback_t));
2763 2756          }
2764 2757  
2765 2758          if (freeable)
2766 2759                  arc_hdr_destroy(hdr);
2767 2760  }
2768 2761  
2769 2762  /*
2770 2763   * "Read" the block at the specified DVA (in bp) via the
2771 2764   * cache.  If the block is found in the cache, invoke the provided
2772 2765   * callback immediately and return.  Note that the `zio' parameter
2773 2766   * in the callback will be NULL in this case, since no IO was
2774 2767   * required.  If the block is not in the cache pass the read request
2775 2768   * on to the spa with a substitute callback function, so that the
2776 2769   * requested block will be added to the cache.
2777 2770   *
2778 2771   * If a read request arrives for a block that has a read in-progress,
2779 2772   * either wait for the in-progress read to complete (and return the
2780 2773   * results); or, if this is a read with a "done" func, add a record
2781 2774   * to the read to invoke the "done" func when the read completes,
2782 2775   * and return; or just return.
2783 2776   *
2784 2777   * arc_read_done() will invoke all the requested "done" functions
2785 2778   * for readers of this block.
2786 2779   */
2787 2780  int
2788 2781  arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_done_func_t *done,
2789 2782      void *private, int priority, int zio_flags, uint32_t *arc_flags,
2790 2783      const zbookmark_t *zb)
2791 2784  {
2792 2785          arc_buf_hdr_t *hdr;
2793 2786          arc_buf_t *buf = NULL;
2794 2787          kmutex_t *hash_lock;
2795 2788          zio_t *rzio;
2796 2789          uint64_t guid = spa_load_guid(spa);
2797 2790  
2798 2791  top:
2799 2792          hdr = buf_hash_find(guid, BP_IDENTITY(bp), BP_PHYSICAL_BIRTH(bp),
2800 2793              &hash_lock);
2801 2794          if (hdr && hdr->b_datacnt > 0) {
2802 2795  
2803 2796                  *arc_flags |= ARC_CACHED;
2804 2797  
2805 2798                  if (HDR_IO_IN_PROGRESS(hdr)) {
2806 2799  
2807 2800                          if (*arc_flags & ARC_WAIT) {
2808 2801                                  cv_wait(&hdr->b_cv, hash_lock);
2809 2802                                  mutex_exit(hash_lock);
2810 2803                                  goto top;
2811 2804                          }
2812 2805                          ASSERT(*arc_flags & ARC_NOWAIT);
2813 2806  
2814 2807                          if (done) {
2815 2808                                  arc_callback_t  *acb = NULL;
2816 2809  
2817 2810                                  acb = kmem_zalloc(sizeof (arc_callback_t),
2818 2811                                      KM_SLEEP);
2819 2812                                  acb->acb_done = done;
2820 2813                                  acb->acb_private = private;
2821 2814                                  if (pio != NULL)
2822 2815                                          acb->acb_zio_dummy = zio_null(pio,
2823 2816                                              spa, NULL, NULL, NULL, zio_flags);
2824 2817  
2825 2818                                  ASSERT(acb->acb_done != NULL);
2826 2819                                  acb->acb_next = hdr->b_acb;
2827 2820                                  hdr->b_acb = acb;
2828 2821                                  add_reference(hdr, hash_lock, private);
2829 2822                                  mutex_exit(hash_lock);
2830 2823                                  return (0);
2831 2824                          }
2832 2825                          mutex_exit(hash_lock);
2833 2826                          return (0);
2834 2827                  }
2835 2828  
2836 2829                  ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
2837 2830  
2838 2831                  if (done) {
2839 2832                          add_reference(hdr, hash_lock, private);
2840 2833                          /*
2841 2834                           * If this block is already in use, create a new
2842 2835                           * copy of the data so that we will be guaranteed
2843 2836                           * that arc_release() will always succeed.
2844 2837                           */
2845 2838                          buf = hdr->b_buf;
2846 2839                          ASSERT(buf);
2847 2840                          ASSERT(buf->b_data);
2848 2841                          if (HDR_BUF_AVAILABLE(hdr)) {
2849 2842                                  ASSERT(buf->b_efunc == NULL);
2850 2843                                  hdr->b_flags &= ~ARC_BUF_AVAILABLE;
2851 2844                          } else {
2852 2845                                  buf = arc_buf_clone(buf);
2853 2846                          }
2854 2847  
2855 2848                  } else if (*arc_flags & ARC_PREFETCH &&
2856 2849                      refcount_count(&hdr->b_refcnt) == 0) {
2857 2850                          hdr->b_flags |= ARC_PREFETCH;
2858 2851                  }
2859 2852                  DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
2860 2853                  arc_access(hdr, hash_lock);
2861 2854                  if (*arc_flags & ARC_L2CACHE)
2862 2855                          hdr->b_flags |= ARC_L2CACHE;
2863 2856                  mutex_exit(hash_lock);
2864 2857                  ARCSTAT_BUMP(arcstat_hits);
2865 2858                  ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
2866 2859                      demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
2867 2860                      data, metadata, hits);
2868 2861  
2869 2862                  if (done)
2870 2863                          done(NULL, buf, private);
2871 2864          } else {
2872 2865                  uint64_t size = BP_GET_LSIZE(bp);
2873 2866                  arc_callback_t  *acb;
2874 2867                  vdev_t *vd = NULL;
2875 2868                  uint64_t addr = 0;
2876 2869                  boolean_t devw = B_FALSE;
2877 2870  
2878 2871                  if (hdr == NULL) {
2879 2872                          /* this block is not in the cache */
2880 2873                          arc_buf_hdr_t   *exists;
2881 2874                          arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
2882 2875                          buf = arc_buf_alloc(spa, size, private, type);
2883 2876                          hdr = buf->b_hdr;
2884 2877                          hdr->b_dva = *BP_IDENTITY(bp);
2885 2878                          hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
2886 2879                          hdr->b_cksum0 = bp->blk_cksum.zc_word[0];
2887 2880                          exists = buf_hash_insert(hdr, &hash_lock);
2888 2881                          if (exists) {
2889 2882                                  /* somebody beat us to the hash insert */
2890 2883                                  mutex_exit(hash_lock);
2891 2884                                  buf_discard_identity(hdr);
2892 2885                                  (void) arc_buf_remove_ref(buf, private);
2893 2886                                  goto top; /* restart the IO request */
2894 2887                          }
2895 2888                          /* if this is a prefetch, we don't have a reference */
2896 2889                          if (*arc_flags & ARC_PREFETCH) {
2897 2890                                  (void) remove_reference(hdr, hash_lock,
2898 2891                                      private);
2899 2892                                  hdr->b_flags |= ARC_PREFETCH;
2900 2893                          }
2901 2894                          if (*arc_flags & ARC_L2CACHE)
2902 2895                                  hdr->b_flags |= ARC_L2CACHE;
2903 2896                          if (BP_GET_LEVEL(bp) > 0)
2904 2897                                  hdr->b_flags |= ARC_INDIRECT;
2905 2898                  } else {
2906 2899                          /* this block is in the ghost cache */
2907 2900                          ASSERT(GHOST_STATE(hdr->b_state));
2908 2901                          ASSERT(!HDR_IO_IN_PROGRESS(hdr));
2909 2902                          ASSERT0(refcount_count(&hdr->b_refcnt));
2910 2903                          ASSERT(hdr->b_buf == NULL);
2911 2904  
2912 2905                          /* if this is a prefetch, we don't have a reference */
2913 2906                          if (*arc_flags & ARC_PREFETCH)
2914 2907                                  hdr->b_flags |= ARC_PREFETCH;
2915 2908                          else
2916 2909                                  add_reference(hdr, hash_lock, private);
2917 2910                          if (*arc_flags & ARC_L2CACHE)
2918 2911                                  hdr->b_flags |= ARC_L2CACHE;
2919 2912                          buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
2920 2913                          buf->b_hdr = hdr;
2921 2914                          buf->b_data = NULL;
2922 2915                          buf->b_efunc = NULL;
2923 2916                          buf->b_private = NULL;
2924 2917                          buf->b_next = NULL;
2925 2918                          hdr->b_buf = buf;
2926 2919                          ASSERT(hdr->b_datacnt == 0);
2927 2920                          hdr->b_datacnt = 1;
2928 2921                          arc_get_data_buf(buf);
2929 2922                          arc_access(hdr, hash_lock);
2930 2923                  }
2931 2924  
2932 2925                  ASSERT(!GHOST_STATE(hdr->b_state));
2933 2926  
2934 2927                  acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
2935 2928                  acb->acb_done = done;
2936 2929                  acb->acb_private = private;
2937 2930  
2938 2931                  ASSERT(hdr->b_acb == NULL);
2939 2932                  hdr->b_acb = acb;
2940 2933                  hdr->b_flags |= ARC_IO_IN_PROGRESS;
2941 2934  
2942 2935                  if (HDR_L2CACHE(hdr) && hdr->b_l2hdr != NULL &&
2943 2936                      (vd = hdr->b_l2hdr->b_dev->l2ad_vdev) != NULL) {
2944 2937                          devw = hdr->b_l2hdr->b_dev->l2ad_writing;
2945 2938                          addr = hdr->b_l2hdr->b_daddr;
2946 2939                          /*
2947 2940                           * Lock out device removal.
2948 2941                           */
2949 2942                          if (vdev_is_dead(vd) ||
2950 2943                              !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
2951 2944                                  vd = NULL;
2952 2945                  }
2953 2946  
2954 2947                  mutex_exit(hash_lock);
2955 2948  
2956 2949                  ASSERT3U(hdr->b_size, ==, size);
2957 2950                  DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
2958 2951                      uint64_t, size, zbookmark_t *, zb);
2959 2952                  ARCSTAT_BUMP(arcstat_misses);
2960 2953                  ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
2961 2954                      demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
2962 2955                      data, metadata, misses);
2963 2956  
2964 2957                  if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
2965 2958                          /*
2966 2959                           * Read from the L2ARC if the following are true:
2967 2960                           * 1. The L2ARC vdev was previously cached.
2968 2961                           * 2. This buffer still has L2ARC metadata.
2969 2962                           * 3. This buffer isn't currently writing to the L2ARC.
2970 2963                           * 4. The L2ARC entry wasn't evicted, which may
2971 2964                           *    also have invalidated the vdev.
2972 2965                           * 5. This isn't prefetch and l2arc_noprefetch is set.
2973 2966                           */
2974 2967                          if (hdr->b_l2hdr != NULL &&
2975 2968                              !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
2976 2969                              !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
2977 2970                                  l2arc_read_callback_t *cb;
2978 2971  
2979 2972                                  DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
2980 2973                                  ARCSTAT_BUMP(arcstat_l2_hits);
2981 2974  
2982 2975                                  cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
2983 2976                                      KM_SLEEP);
2984 2977                                  cb->l2rcb_buf = buf;
2985 2978                                  cb->l2rcb_spa = spa;
2986 2979                                  cb->l2rcb_bp = *bp;
2987 2980                                  cb->l2rcb_zb = *zb;
2988 2981                                  cb->l2rcb_flags = zio_flags;
2989 2982  
2990 2983                                  ASSERT(addr >= VDEV_LABEL_START_SIZE &&
2991 2984                                      addr + size < vd->vdev_psize -
2992 2985                                      VDEV_LABEL_END_SIZE);
2993 2986  
2994 2987                                  /*
2995 2988                                   * l2arc read.  The SCL_L2ARC lock will be
2996 2989                                   * released by l2arc_read_done().
2997 2990                                   */
2998 2991                                  rzio = zio_read_phys(pio, vd, addr, size,
2999 2992                                      buf->b_data, ZIO_CHECKSUM_OFF,
3000 2993                                      l2arc_read_done, cb, priority, zio_flags |
3001 2994                                      ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
3002 2995                                      ZIO_FLAG_DONT_PROPAGATE |
3003 2996                                      ZIO_FLAG_DONT_RETRY, B_FALSE);
3004 2997                                  DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
3005 2998                                      zio_t *, rzio);
3006 2999                                  ARCSTAT_INCR(arcstat_l2_read_bytes, size);
3007 3000  
3008 3001                                  if (*arc_flags & ARC_NOWAIT) {
3009 3002                                          zio_nowait(rzio);
3010 3003                                          return (0);
3011 3004                                  }
3012 3005  
3013 3006                                  ASSERT(*arc_flags & ARC_WAIT);
3014 3007                                  if (zio_wait(rzio) == 0)
3015 3008                                          return (0);
3016 3009  
3017 3010                                  /* l2arc read error; goto zio_read() */
3018 3011                          } else {
3019 3012                                  DTRACE_PROBE1(l2arc__miss,
3020 3013                                      arc_buf_hdr_t *, hdr);
3021 3014                                  ARCSTAT_BUMP(arcstat_l2_misses);
3022 3015                                  if (HDR_L2_WRITING(hdr))
3023 3016                                          ARCSTAT_BUMP(arcstat_l2_rw_clash);
3024 3017                                  spa_config_exit(spa, SCL_L2ARC, vd);
3025 3018                          }
3026 3019                  } else {
3027 3020                          if (vd != NULL)
3028 3021                                  spa_config_exit(spa, SCL_L2ARC, vd);
3029 3022                          if (l2arc_ndev != 0) {
3030 3023                                  DTRACE_PROBE1(l2arc__miss,
3031 3024                                      arc_buf_hdr_t *, hdr);
3032 3025                                  ARCSTAT_BUMP(arcstat_l2_misses);
3033 3026                          }
3034 3027                  }
3035 3028  
3036 3029                  rzio = zio_read(pio, spa, bp, buf->b_data, size,
3037 3030                      arc_read_done, buf, priority, zio_flags, zb);
3038 3031  
3039 3032                  if (*arc_flags & ARC_WAIT)
3040 3033                          return (zio_wait(rzio));
3041 3034  
3042 3035                  ASSERT(*arc_flags & ARC_NOWAIT);
3043 3036                  zio_nowait(rzio);
3044 3037          }
3045 3038          return (0);
3046 3039  }
3047 3040  
3048 3041  void
3049 3042  arc_set_callback(arc_buf_t *buf, arc_evict_func_t *func, void *private)
3050 3043  {
3051 3044          ASSERT(buf->b_hdr != NULL);
3052 3045          ASSERT(buf->b_hdr->b_state != arc_anon);
3053 3046          ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt) || func == NULL);
3054 3047          ASSERT(buf->b_efunc == NULL);
3055 3048          ASSERT(!HDR_BUF_AVAILABLE(buf->b_hdr));
3056 3049  
3057 3050          buf->b_efunc = func;
3058 3051          buf->b_private = private;
3059 3052  }
3060 3053  
3061 3054  /*
3062 3055   * This is used by the DMU to let the ARC know that a buffer is
3063 3056   * being evicted, so the ARC should clean up.  If this arc buf
3064 3057   * is not yet in the evicted state, it will be put there.
3065 3058   */
3066 3059  int
3067 3060  arc_buf_evict(arc_buf_t *buf)
3068 3061  {
3069 3062          arc_buf_hdr_t *hdr;
3070 3063          kmutex_t *hash_lock;
3071 3064          arc_buf_t **bufp;
3072 3065  
3073 3066          mutex_enter(&buf->b_evict_lock);
3074 3067          hdr = buf->b_hdr;
3075 3068          if (hdr == NULL) {
3076 3069                  /*
3077 3070                   * We are in arc_do_user_evicts().
3078 3071                   */
3079 3072                  ASSERT(buf->b_data == NULL);
3080 3073                  mutex_exit(&buf->b_evict_lock);
3081 3074                  return (0);
3082 3075          } else if (buf->b_data == NULL) {
3083 3076                  arc_buf_t copy = *buf; /* structure assignment */
3084 3077                  /*
3085 3078                   * We are on the eviction list; process this buffer now
3086 3079                   * but let arc_do_user_evicts() do the reaping.
3087 3080                   */
3088 3081                  buf->b_efunc = NULL;
3089 3082                  mutex_exit(&buf->b_evict_lock);
3090 3083                  VERIFY(copy.b_efunc(©) == 0);
3091 3084                  return (1);
3092 3085          }
3093 3086          hash_lock = HDR_LOCK(hdr);
3094 3087          mutex_enter(hash_lock);
3095 3088          hdr = buf->b_hdr;
3096 3089          ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
3097 3090  
3098 3091          ASSERT3U(refcount_count(&hdr->b_refcnt), <, hdr->b_datacnt);
3099 3092          ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
3100 3093  
3101 3094          /*
3102 3095           * Pull this buffer off of the hdr
3103 3096           */
3104 3097          bufp = &hdr->b_buf;
3105 3098          while (*bufp != buf)
3106 3099                  bufp = &(*bufp)->b_next;
3107 3100          *bufp = buf->b_next;
3108 3101  
3109 3102          ASSERT(buf->b_data != NULL);
3110 3103          arc_buf_destroy(buf, FALSE, FALSE);
3111 3104  
3112 3105          if (hdr->b_datacnt == 0) {
3113 3106                  arc_state_t *old_state = hdr->b_state;
3114 3107                  arc_state_t *evicted_state;
3115 3108  
3116 3109                  ASSERT(hdr->b_buf == NULL);
3117 3110                  ASSERT(refcount_is_zero(&hdr->b_refcnt));
3118 3111  
3119 3112                  evicted_state =
3120 3113                      (old_state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
3121 3114  
3122 3115                  mutex_enter(&old_state->arcs_mtx);
3123 3116                  mutex_enter(&evicted_state->arcs_mtx);
3124 3117  
3125 3118                  arc_change_state(evicted_state, hdr, hash_lock);
3126 3119                  ASSERT(HDR_IN_HASH_TABLE(hdr));
3127 3120                  hdr->b_flags |= ARC_IN_HASH_TABLE;
3128 3121                  hdr->b_flags &= ~ARC_BUF_AVAILABLE;
3129 3122  
3130 3123                  mutex_exit(&evicted_state->arcs_mtx);
3131 3124                  mutex_exit(&old_state->arcs_mtx);
3132 3125          }
3133 3126          mutex_exit(hash_lock);
3134 3127          mutex_exit(&buf->b_evict_lock);
3135 3128  
3136 3129          VERIFY(buf->b_efunc(buf) == 0);
3137 3130          buf->b_efunc = NULL;
3138 3131          buf->b_private = NULL;
3139 3132          buf->b_hdr = NULL;
3140 3133          buf->b_next = NULL;
3141 3134          kmem_cache_free(buf_cache, buf);
3142 3135          return (1);
3143 3136  }
3144 3137  
3145 3138  /*
3146 3139   * Release this buffer from the cache.  This must be done
3147 3140   * after a read and prior to modifying the buffer contents.
3148 3141   * If the buffer has more than one reference, we must make
3149 3142   * a new hdr for the buffer.
3150 3143   */
3151 3144  void
3152 3145  arc_release(arc_buf_t *buf, void *tag)
3153 3146  {
3154 3147          arc_buf_hdr_t *hdr;
3155 3148          kmutex_t *hash_lock = NULL;
3156 3149          l2arc_buf_hdr_t *l2hdr;
3157 3150          uint64_t buf_size;
3158 3151  
3159 3152          /*
3160 3153           * It would be nice to assert that if it's DMU metadata (level >
3161 3154           * 0 || it's the dnode file), then it must be syncing context.
3162 3155           * But we don't know that information at this level.
3163 3156           */
3164 3157  
3165 3158          mutex_enter(&buf->b_evict_lock);
3166 3159          hdr = buf->b_hdr;
3167 3160  
3168 3161          /* this buffer is not on any list */
3169 3162          ASSERT(refcount_count(&hdr->b_refcnt) > 0);
3170 3163  
3171 3164          if (hdr->b_state == arc_anon) {
3172 3165                  /* this buffer is already released */
3173 3166                  ASSERT(buf->b_efunc == NULL);
3174 3167          } else {
3175 3168                  hash_lock = HDR_LOCK(hdr);
3176 3169                  mutex_enter(hash_lock);
3177 3170                  hdr = buf->b_hdr;
3178 3171                  ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
3179 3172          }
3180 3173  
3181 3174          l2hdr = hdr->b_l2hdr;
3182 3175          if (l2hdr) {
3183 3176                  mutex_enter(&l2arc_buflist_mtx);
3184 3177                  hdr->b_l2hdr = NULL;
3185 3178          }
3186 3179          buf_size = hdr->b_size;
3187 3180  
3188 3181          /*
3189 3182           * Do we have more than one buf?
3190 3183           */
3191 3184          if (hdr->b_datacnt > 1) {
3192 3185                  arc_buf_hdr_t *nhdr;
3193 3186                  arc_buf_t **bufp;
3194 3187                  uint64_t blksz = hdr->b_size;
3195 3188                  uint64_t spa = hdr->b_spa;
3196 3189                  arc_buf_contents_t type = hdr->b_type;
3197 3190                  uint32_t flags = hdr->b_flags;
3198 3191  
3199 3192                  ASSERT(hdr->b_buf != buf || buf->b_next != NULL);
3200 3193                  /*
3201 3194                   * Pull the data off of this hdr and attach it to
3202 3195                   * a new anonymous hdr.
3203 3196                   */
3204 3197                  (void) remove_reference(hdr, hash_lock, tag);
3205 3198                  bufp = &hdr->b_buf;
3206 3199                  while (*bufp != buf)
3207 3200                          bufp = &(*bufp)->b_next;
3208 3201                  *bufp = buf->b_next;
3209 3202                  buf->b_next = NULL;
3210 3203  
3211 3204                  ASSERT3U(hdr->b_state->arcs_size, >=, hdr->b_size);
3212 3205                  atomic_add_64(&hdr->b_state->arcs_size, -hdr->b_size);
3213 3206                  if (refcount_is_zero(&hdr->b_refcnt)) {
3214 3207                          uint64_t *size = &hdr->b_state->arcs_lsize[hdr->b_type];
3215 3208                          ASSERT3U(*size, >=, hdr->b_size);
3216 3209                          atomic_add_64(size, -hdr->b_size);
3217 3210                  }
3218 3211  
3219 3212                  /*
3220 3213                   * We're releasing a duplicate user data buffer, update
3221 3214                   * our statistics accordingly.
3222 3215                   */
3223 3216                  if (hdr->b_type == ARC_BUFC_DATA) {
3224 3217                          ARCSTAT_BUMPDOWN(arcstat_duplicate_buffers);
3225 3218                          ARCSTAT_INCR(arcstat_duplicate_buffers_size,
3226 3219                              -hdr->b_size);
3227 3220                  }
3228 3221                  hdr->b_datacnt -= 1;
3229 3222                  arc_cksum_verify(buf);
3230 3223                  arc_buf_unwatch(buf);
3231 3224  
3232 3225                  mutex_exit(hash_lock);
3233 3226  
3234 3227                  nhdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
3235 3228                  nhdr->b_size = blksz;
3236 3229                  nhdr->b_spa = spa;
3237 3230                  nhdr->b_type = type;
3238 3231                  nhdr->b_buf = buf;
3239 3232                  nhdr->b_state = arc_anon;
3240 3233                  nhdr->b_arc_access = 0;
3241 3234                  nhdr->b_flags = flags & ARC_L2_WRITING;
3242 3235                  nhdr->b_l2hdr = NULL;
3243 3236                  nhdr->b_datacnt = 1;
3244 3237                  nhdr->b_freeze_cksum = NULL;
3245 3238                  (void) refcount_add(&nhdr->b_refcnt, tag);
3246 3239                  buf->b_hdr = nhdr;
3247 3240                  mutex_exit(&buf->b_evict_lock);
3248 3241                  atomic_add_64(&arc_anon->arcs_size, blksz);
3249 3242          } else {
3250 3243                  mutex_exit(&buf->b_evict_lock);
3251 3244                  ASSERT(refcount_count(&hdr->b_refcnt) == 1);
3252 3245                  ASSERT(!list_link_active(&hdr->b_arc_node));
3253 3246                  ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3254 3247                  if (hdr->b_state != arc_anon)
3255 3248                          arc_change_state(arc_anon, hdr, hash_lock);
3256 3249                  hdr->b_arc_access = 0;
3257 3250                  if (hash_lock)
3258 3251                          mutex_exit(hash_lock);
3259 3252  
3260 3253                  buf_discard_identity(hdr);
3261 3254                  arc_buf_thaw(buf);
3262 3255          }
3263 3256          buf->b_efunc = NULL;
3264 3257          buf->b_private = NULL;
3265 3258  
3266 3259          if (l2hdr) {
3267 3260                  list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
3268 3261                  kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
3269 3262                  ARCSTAT_INCR(arcstat_l2_size, -buf_size);
3270 3263                  mutex_exit(&l2arc_buflist_mtx);
3271 3264          }
3272 3265  }
3273 3266  
3274 3267  int
3275 3268  arc_released(arc_buf_t *buf)
3276 3269  {
3277 3270          int released;
3278 3271  
3279 3272          mutex_enter(&buf->b_evict_lock);
3280 3273          released = (buf->b_data != NULL && buf->b_hdr->b_state == arc_anon);
3281 3274          mutex_exit(&buf->b_evict_lock);
3282 3275          return (released);
3283 3276  }
3284 3277  
3285 3278  int
3286 3279  arc_has_callback(arc_buf_t *buf)
3287 3280  {
3288 3281          int callback;
3289 3282  
3290 3283          mutex_enter(&buf->b_evict_lock);
3291 3284          callback = (buf->b_efunc != NULL);
3292 3285          mutex_exit(&buf->b_evict_lock);
3293 3286          return (callback);
3294 3287  }
3295 3288  
3296 3289  #ifdef ZFS_DEBUG
3297 3290  int
3298 3291  arc_referenced(arc_buf_t *buf)
3299 3292  {
3300 3293          int referenced;
3301 3294  
3302 3295          mutex_enter(&buf->b_evict_lock);
3303 3296          referenced = (refcount_count(&buf->b_hdr->b_refcnt));
3304 3297          mutex_exit(&buf->b_evict_lock);
3305 3298          return (referenced);
3306 3299  }
3307 3300  #endif
3308 3301  
3309 3302  static void
3310 3303  arc_write_ready(zio_t *zio)
3311 3304  {
3312 3305          arc_write_callback_t *callback = zio->io_private;
3313 3306          arc_buf_t *buf = callback->awcb_buf;
3314 3307          arc_buf_hdr_t *hdr = buf->b_hdr;
3315 3308  
3316 3309          ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt));
3317 3310          callback->awcb_ready(zio, buf, callback->awcb_private);
3318 3311  
3319 3312          /*
3320 3313           * If the IO is already in progress, then this is a re-write
3321 3314           * attempt, so we need to thaw and re-compute the cksum.
3322 3315           * It is the responsibility of the callback to handle the
3323 3316           * accounting for any re-write attempt.
3324 3317           */
3325 3318          if (HDR_IO_IN_PROGRESS(hdr)) {
3326 3319                  mutex_enter(&hdr->b_freeze_lock);
3327 3320                  if (hdr->b_freeze_cksum != NULL) {
3328 3321                          kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
3329 3322                          hdr->b_freeze_cksum = NULL;
3330 3323                  }
3331 3324                  mutex_exit(&hdr->b_freeze_lock);
3332 3325          }
3333 3326          arc_cksum_compute(buf, B_FALSE);
3334 3327          hdr->b_flags |= ARC_IO_IN_PROGRESS;
3335 3328  }
3336 3329  
3337 3330  static void
3338 3331  arc_write_done(zio_t *zio)
3339 3332  {
3340 3333          arc_write_callback_t *callback = zio->io_private;
3341 3334          arc_buf_t *buf = callback->awcb_buf;
3342 3335          arc_buf_hdr_t *hdr = buf->b_hdr;
3343 3336  
3344 3337          ASSERT(hdr->b_acb == NULL);
3345 3338  
3346 3339          if (zio->io_error == 0) {
3347 3340                  hdr->b_dva = *BP_IDENTITY(zio->io_bp);
3348 3341                  hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp);
3349 3342                  hdr->b_cksum0 = zio->io_bp->blk_cksum.zc_word[0];
3350 3343          } else {
3351 3344                  ASSERT(BUF_EMPTY(hdr));
3352 3345          }
3353 3346  
3354 3347          /*
3355 3348           * If the block to be written was all-zero, we may have
3356 3349           * compressed it away.  In this case no write was performed
3357 3350           * so there will be no dva/birth/checksum.  The buffer must
3358 3351           * therefore remain anonymous (and uncached).
3359 3352           */
3360 3353          if (!BUF_EMPTY(hdr)) {
3361 3354                  arc_buf_hdr_t *exists;
3362 3355                  kmutex_t *hash_lock;
3363 3356  
3364 3357                  ASSERT(zio->io_error == 0);
3365 3358  
3366 3359                  arc_cksum_verify(buf);
3367 3360  
3368 3361                  exists = buf_hash_insert(hdr, &hash_lock);
3369 3362                  if (exists) {
3370 3363                          /*
3371 3364                           * This can only happen if we overwrite for
3372 3365                           * sync-to-convergence, because we remove
3373 3366                           * buffers from the hash table when we arc_free().
3374 3367                           */
3375 3368                          if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
3376 3369                                  if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
3377 3370                                          panic("bad overwrite, hdr=%p exists=%p",
3378 3371                                              (void *)hdr, (void *)exists);
3379 3372                                  ASSERT(refcount_is_zero(&exists->b_refcnt));
3380 3373                                  arc_change_state(arc_anon, exists, hash_lock);
3381 3374                                  mutex_exit(hash_lock);
3382 3375                                  arc_hdr_destroy(exists);
3383 3376                                  exists = buf_hash_insert(hdr, &hash_lock);
3384 3377                                  ASSERT3P(exists, ==, NULL);
3385 3378                          } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
3386 3379                                  /* nopwrite */
3387 3380                                  ASSERT(zio->io_prop.zp_nopwrite);
3388 3381                                  if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
3389 3382                                          panic("bad nopwrite, hdr=%p exists=%p",
3390 3383                                              (void *)hdr, (void *)exists);
3391 3384                          } else {
3392 3385                                  /* Dedup */
3393 3386                                  ASSERT(hdr->b_datacnt == 1);
3394 3387                                  ASSERT(hdr->b_state == arc_anon);
3395 3388                                  ASSERT(BP_GET_DEDUP(zio->io_bp));
3396 3389                                  ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
3397 3390                          }
3398 3391                  }
3399 3392                  hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
3400 3393                  /* if it's not anon, we are doing a scrub */
3401 3394                  if (!exists && hdr->b_state == arc_anon)
3402 3395                          arc_access(hdr, hash_lock);
3403 3396                  mutex_exit(hash_lock);
3404 3397          } else {
3405 3398                  hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
3406 3399          }
3407 3400  
3408 3401          ASSERT(!refcount_is_zero(&hdr->b_refcnt));
3409 3402          callback->awcb_done(zio, buf, callback->awcb_private);
3410 3403  
3411 3404          kmem_free(callback, sizeof (arc_write_callback_t));
3412 3405  }
3413 3406  
3414 3407  zio_t *
3415 3408  arc_write(zio_t *pio, spa_t *spa, uint64_t txg,
3416 3409      blkptr_t *bp, arc_buf_t *buf, boolean_t l2arc, const zio_prop_t *zp,
3417 3410      arc_done_func_t *ready, arc_done_func_t *done, void *private,
3418 3411      int priority, int zio_flags, const zbookmark_t *zb)
3419 3412  {
3420 3413          arc_buf_hdr_t *hdr = buf->b_hdr;
3421 3414          arc_write_callback_t *callback;
3422 3415          zio_t *zio;
3423 3416  
3424 3417          ASSERT(ready != NULL);
3425 3418          ASSERT(done != NULL);
3426 3419          ASSERT(!HDR_IO_ERROR(hdr));
3427 3420          ASSERT((hdr->b_flags & ARC_IO_IN_PROGRESS) == 0);
3428 3421          ASSERT(hdr->b_acb == NULL);
3429 3422          if (l2arc)
3430 3423                  hdr->b_flags |= ARC_L2CACHE;
3431 3424          callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
3432 3425          callback->awcb_ready = ready;
3433 3426          callback->awcb_done = done;
3434 3427          callback->awcb_private = private;
3435 3428          callback->awcb_buf = buf;
3436 3429  
3437 3430          zio = zio_write(pio, spa, txg, bp, buf->b_data, hdr->b_size, zp,
3438 3431              arc_write_ready, arc_write_done, callback, priority, zio_flags, zb);
3439 3432  
3440 3433          return (zio);
3441 3434  }
3442 3435  
3443 3436  static int
3444 3437  arc_memory_throttle(uint64_t reserve, uint64_t inflight_data, uint64_t txg)
3445 3438  {
3446 3439  #ifdef _KERNEL
3447 3440          uint64_t available_memory = ptob(freemem);
3448 3441          static uint64_t page_load = 0;
3449 3442          static uint64_t last_txg = 0;
3450 3443  
3451 3444  #if defined(__i386)
3452 3445          available_memory =
3453 3446              MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
3454 3447  #endif
3455 3448          if (available_memory >= zfs_write_limit_max)
3456 3449                  return (0);
3457 3450  
3458 3451          if (txg > last_txg) {
3459 3452                  last_txg = txg;
3460 3453                  page_load = 0;
3461 3454          }
3462 3455          /*
3463 3456           * If we are in pageout, we know that memory is already tight,
3464 3457           * the arc is already going to be evicting, so we just want to
3465 3458           * continue to let page writes occur as quickly as possible.
3466 3459           */
3467 3460          if (curproc == proc_pageout) {
3468 3461                  if (page_load > MAX(ptob(minfree), available_memory) / 4)
3469 3462                          return (SET_ERROR(ERESTART));
3470 3463                  /* Note: reserve is inflated, so we deflate */
3471 3464                  page_load += reserve / 8;
3472 3465                  return (0);
3473 3466          } else if (page_load > 0 && arc_reclaim_needed()) {
3474 3467                  /* memory is low, delay before restarting */
3475 3468                  ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
3476 3469                  return (SET_ERROR(EAGAIN));
3477 3470          }
3478 3471          page_load = 0;
3479 3472  
3480 3473          if (arc_size > arc_c_min) {
3481 3474                  uint64_t evictable_memory =
3482 3475                      arc_mru->arcs_lsize[ARC_BUFC_DATA] +
3483 3476                      arc_mru->arcs_lsize[ARC_BUFC_METADATA] +
3484 3477                      arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
3485 3478                      arc_mfu->arcs_lsize[ARC_BUFC_METADATA];
3486 3479                  available_memory += MIN(evictable_memory, arc_size - arc_c_min);
3487 3480          }
3488 3481  
3489 3482          if (inflight_data > available_memory / 4) {
3490 3483                  ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
3491 3484                  return (SET_ERROR(ERESTART));
3492 3485          }
3493 3486  #endif
3494 3487          return (0);
3495 3488  }
3496 3489  
3497 3490  void
3498 3491  arc_tempreserve_clear(uint64_t reserve)
3499 3492  {
3500 3493          atomic_add_64(&arc_tempreserve, -reserve);
3501 3494          ASSERT((int64_t)arc_tempreserve >= 0);
3502 3495  }
3503 3496  
3504 3497  int
3505 3498  arc_tempreserve_space(uint64_t reserve, uint64_t txg)
3506 3499  {
3507 3500          int error;
3508 3501          uint64_t anon_size;
3509 3502  
3510 3503  #ifdef ZFS_DEBUG
3511 3504          /*
3512 3505           * Once in a while, fail for no reason.  Everything should cope.
3513 3506           */
3514 3507          if (spa_get_random(10000) == 0) {
3515 3508                  dprintf("forcing random failure\n");
3516 3509                  return (SET_ERROR(ERESTART));
3517 3510          }
3518 3511  #endif
3519 3512          if (reserve > arc_c/4 && !arc_no_grow)
3520 3513                  arc_c = MIN(arc_c_max, reserve * 4);
3521 3514          if (reserve > arc_c)
3522 3515                  return (SET_ERROR(ENOMEM));
3523 3516  
3524 3517          /*
3525 3518           * Don't count loaned bufs as in flight dirty data to prevent long
3526 3519           * network delays from blocking transactions that are ready to be
3527 3520           * assigned to a txg.
3528 3521           */
3529 3522          anon_size = MAX((int64_t)(arc_anon->arcs_size - arc_loaned_bytes), 0);
3530 3523  
3531 3524          /*
3532 3525           * Writes will, almost always, require additional memory allocations
3533 3526           * in order to compress/encrypt/etc the data.  We therefor need to
3534 3527           * make sure that there is sufficient available memory for this.
3535 3528           */
3536 3529          if (error = arc_memory_throttle(reserve, anon_size, txg))
3537 3530                  return (error);
3538 3531  
3539 3532          /*
3540 3533           * Throttle writes when the amount of dirty data in the cache
3541 3534           * gets too large.  We try to keep the cache less than half full
3542 3535           * of dirty blocks so that our sync times don't grow too large.
3543 3536           * Note: if two requests come in concurrently, we might let them
3544 3537           * both succeed, when one of them should fail.  Not a huge deal.
3545 3538           */
3546 3539  
3547 3540          if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
3548 3541              anon_size > arc_c / 4) {
3549 3542                  dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
3550 3543                      "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
3551 3544                      arc_tempreserve>>10,
3552 3545                      arc_anon->arcs_lsize[ARC_BUFC_METADATA]>>10,
3553 3546                      arc_anon->arcs_lsize[ARC_BUFC_DATA]>>10,
3554 3547                      reserve>>10, arc_c>>10);
3555 3548                  return (SET_ERROR(ERESTART));
3556 3549          }
3557 3550          atomic_add_64(&arc_tempreserve, reserve);
3558 3551          return (0);
3559 3552  }
3560 3553  
3561 3554  void
3562 3555  arc_init(void)
3563 3556  {
3564 3557          mutex_init(&arc_reclaim_thr_lock, NULL, MUTEX_DEFAULT, NULL);
3565 3558          cv_init(&arc_reclaim_thr_cv, NULL, CV_DEFAULT, NULL);
3566 3559  
3567 3560          /* Convert seconds to clock ticks */
3568 3561          arc_min_prefetch_lifespan = 1 * hz;
3569 3562  
3570 3563          /* Start out with 1/8 of all memory */
3571 3564          arc_c = physmem * PAGESIZE / 8;
3572 3565  
3573 3566  #ifdef _KERNEL
3574 3567          /*
3575 3568           * On architectures where the physical memory can be larger
3576 3569           * than the addressable space (intel in 32-bit mode), we may
3577 3570           * need to limit the cache to 1/8 of VM size.
3578 3571           */
3579 3572          arc_c = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 8);
3580 3573  #endif
3581 3574  
3582 3575          /* set min cache to 1/32 of all memory, or 64MB, whichever is more */
3583 3576          arc_c_min = MAX(arc_c / 4, 64<<20);
3584 3577          /* set max to 3/4 of all memory, or all but 1GB, whichever is more */
3585 3578          if (arc_c * 8 >= 1<<30)
3586 3579                  arc_c_max = (arc_c * 8) - (1<<30);
3587 3580          else
3588 3581                  arc_c_max = arc_c_min;
3589 3582          arc_c_max = MAX(arc_c * 6, arc_c_max);
3590 3583  
3591 3584          /*
3592 3585           * Allow the tunables to override our calculations if they are
3593 3586           * reasonable (ie. over 64MB)
3594 3587           */
3595 3588          if (zfs_arc_max > 64<<20 && zfs_arc_max < physmem * PAGESIZE)
3596 3589                  arc_c_max = zfs_arc_max;
3597 3590          if (zfs_arc_min > 64<<20 && zfs_arc_min <= arc_c_max)
3598 3591                  arc_c_min = zfs_arc_min;
3599 3592  
3600 3593          arc_c = arc_c_max;
3601 3594          arc_p = (arc_c >> 1);
3602 3595  
3603 3596          /* limit meta-data to 1/4 of the arc capacity */
3604 3597          arc_meta_limit = arc_c_max / 4;
3605 3598  
3606 3599          /* Allow the tunable to override if it is reasonable */
3607 3600          if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
3608 3601                  arc_meta_limit = zfs_arc_meta_limit;
3609 3602  
3610 3603          if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
3611 3604                  arc_c_min = arc_meta_limit / 2;
3612 3605  
3613 3606          if (zfs_arc_grow_retry > 0)
3614 3607                  arc_grow_retry = zfs_arc_grow_retry;
3615 3608  
3616 3609          if (zfs_arc_shrink_shift > 0)
3617 3610                  arc_shrink_shift = zfs_arc_shrink_shift;
3618 3611  
3619 3612          if (zfs_arc_p_min_shift > 0)
3620 3613                  arc_p_min_shift = zfs_arc_p_min_shift;
3621 3614  
3622 3615          /* if kmem_flags are set, lets try to use less memory */
3623 3616          if (kmem_debugging())
3624 3617                  arc_c = arc_c / 2;
3625 3618          if (arc_c < arc_c_min)
3626 3619                  arc_c = arc_c_min;
3627 3620  
3628 3621          arc_anon = &ARC_anon;
3629 3622          arc_mru = &ARC_mru;
3630 3623          arc_mru_ghost = &ARC_mru_ghost;
3631 3624          arc_mfu = &ARC_mfu;
3632 3625          arc_mfu_ghost = &ARC_mfu_ghost;
3633 3626          arc_l2c_only = &ARC_l2c_only;
3634 3627          arc_size = 0;
3635 3628  
3636 3629          mutex_init(&arc_anon->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3637 3630          mutex_init(&arc_mru->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3638 3631          mutex_init(&arc_mru_ghost->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3639 3632          mutex_init(&arc_mfu->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3640 3633          mutex_init(&arc_mfu_ghost->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3641 3634          mutex_init(&arc_l2c_only->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
3642 3635  
3643 3636          list_create(&arc_mru->arcs_list[ARC_BUFC_METADATA],
3644 3637              sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3645 3638          list_create(&arc_mru->arcs_list[ARC_BUFC_DATA],
3646 3639              sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3647 3640          list_create(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],
3648 3641              sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3649 3642          list_create(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],
3650 3643              sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3651 3644          list_create(&arc_mfu->arcs_list[ARC_BUFC_METADATA],
3652 3645              sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3653 3646          list_create(&arc_mfu->arcs_list[ARC_BUFC_DATA],
3654 3647              sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3655 3648          list_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],
3656 3649              sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3657 3650          list_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],
3658 3651              sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3659 3652          list_create(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],
3660 3653              sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3661 3654          list_create(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],
3662 3655              sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
3663 3656  
3664 3657          buf_init();
3665 3658  
3666 3659          arc_thread_exit = 0;
3667 3660          arc_eviction_list = NULL;
3668 3661          mutex_init(&arc_eviction_mtx, NULL, MUTEX_DEFAULT, NULL);
3669 3662          bzero(&arc_eviction_hdr, sizeof (arc_buf_hdr_t));
3670 3663  
3671 3664          arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
3672 3665              sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
3673 3666  
3674 3667          if (arc_ksp != NULL) {
3675 3668                  arc_ksp->ks_data = &arc_stats;
3676 3669                  kstat_install(arc_ksp);
3677 3670          }
3678 3671  
3679 3672          (void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
3680 3673              TS_RUN, minclsyspri);
3681 3674  
3682 3675          arc_dead = FALSE;
3683 3676          arc_warm = B_FALSE;
3684 3677  
3685 3678          if (zfs_write_limit_max == 0)
3686 3679                  zfs_write_limit_max = ptob(physmem) >> zfs_write_limit_shift;
3687 3680          else
3688 3681                  zfs_write_limit_shift = 0;
3689 3682          mutex_init(&zfs_write_limit_lock, NULL, MUTEX_DEFAULT, NULL);
3690 3683  }
3691 3684  
3692 3685  void
3693 3686  arc_fini(void)
3694 3687  {
3695 3688          mutex_enter(&arc_reclaim_thr_lock);
3696 3689          arc_thread_exit = 1;
3697 3690          while (arc_thread_exit != 0)
3698 3691                  cv_wait(&arc_reclaim_thr_cv, &arc_reclaim_thr_lock);
3699 3692          mutex_exit(&arc_reclaim_thr_lock);
3700 3693  
3701 3694          arc_flush(NULL);
3702 3695  
3703 3696          arc_dead = TRUE;
3704 3697  
3705 3698          if (arc_ksp != NULL) {
3706 3699                  kstat_delete(arc_ksp);
3707 3700                  arc_ksp = NULL;
3708 3701          }
3709 3702  
3710 3703          mutex_destroy(&arc_eviction_mtx);
3711 3704          mutex_destroy(&arc_reclaim_thr_lock);
3712 3705          cv_destroy(&arc_reclaim_thr_cv);
3713 3706  
3714 3707          list_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]);
3715 3708          list_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
3716 3709          list_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]);
3717 3710          list_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
3718 3711          list_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]);
3719 3712          list_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
3720 3713          list_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]);
3721 3714          list_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
3722 3715  
3723 3716          mutex_destroy(&arc_anon->arcs_mtx);
3724 3717          mutex_destroy(&arc_mru->arcs_mtx);
3725 3718          mutex_destroy(&arc_mru_ghost->arcs_mtx);
3726 3719          mutex_destroy(&arc_mfu->arcs_mtx);
3727 3720          mutex_destroy(&arc_mfu_ghost->arcs_mtx);
3728 3721          mutex_destroy(&arc_l2c_only->arcs_mtx);
3729 3722  
3730 3723          mutex_destroy(&zfs_write_limit_lock);
3731 3724  
3732 3725          buf_fini();
3733 3726  
3734 3727          ASSERT(arc_loaned_bytes == 0);
3735 3728  }
3736 3729  
3737 3730  /*
3738 3731   * Level 2 ARC
3739 3732   *
3740 3733   * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
3741 3734   * It uses dedicated storage devices to hold cached data, which are populated
3742 3735   * using large infrequent writes.  The main role of this cache is to boost
3743 3736   * the performance of random read workloads.  The intended L2ARC devices
3744 3737   * include short-stroked disks, solid state disks, and other media with
3745 3738   * substantially faster read latency than disk.
3746 3739   *
3747 3740   *                 +-----------------------+
3748 3741   *                 |         ARC           |
3749 3742   *                 +-----------------------+
3750 3743   *                    |         ^     ^
3751 3744   *                    |         |     |
3752 3745   *      l2arc_feed_thread()    arc_read()
3753 3746   *                    |         |     |
3754 3747   *                    |  l2arc read   |
3755 3748   *                    V         |     |
3756 3749   *               +---------------+    |
3757 3750   *               |     L2ARC     |    |
3758 3751   *               +---------------+    |
3759 3752   *                   |    ^           |
3760 3753   *          l2arc_write() |           |
3761 3754   *                   |    |           |
3762 3755   *                   V    |           |
3763 3756   *                 +-------+      +-------+
3764 3757   *                 | vdev  |      | vdev  |
3765 3758   *                 | cache |      | cache |
3766 3759   *                 +-------+      +-------+
3767 3760   *                 +=========+     .-----.
3768 3761   *                 :  L2ARC  :    |-_____-|
3769 3762   *                 : devices :    | Disks |
3770 3763   *                 +=========+    `-_____-'
3771 3764   *
3772 3765   * Read requests are satisfied from the following sources, in order:
3773 3766   *
3774 3767   *      1) ARC
3775 3768   *      2) vdev cache of L2ARC devices
3776 3769   *      3) L2ARC devices
3777 3770   *      4) vdev cache of disks
3778 3771   *      5) disks
3779 3772   *
3780 3773   * Some L2ARC device types exhibit extremely slow write performance.
3781 3774   * To accommodate for this there are some significant differences between
3782 3775   * the L2ARC and traditional cache design:
3783 3776   *
3784 3777   * 1. There is no eviction path from the ARC to the L2ARC.  Evictions from
3785 3778   * the ARC behave as usual, freeing buffers and placing headers on ghost
3786 3779   * lists.  The ARC does not send buffers to the L2ARC during eviction as
3787 3780   * this would add inflated write latencies for all ARC memory pressure.
3788 3781   *
3789 3782   * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
3790 3783   * It does this by periodically scanning buffers from the eviction-end of
3791 3784   * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
3792 3785   * not already there.  It scans until a headroom of buffers is satisfied,
3793 3786   * which itself is a buffer for ARC eviction.  The thread that does this is
3794 3787   * l2arc_feed_thread(), illustrated below; example sizes are included to
3795 3788   * provide a better sense of ratio than this diagram:
3796 3789   *
3797 3790   *             head -->                        tail
3798 3791   *              +---------------------+----------+
3799 3792   *      ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
3800 3793   *              +---------------------+----------+   |   o L2ARC eligible
3801 3794   *      ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
3802 3795   *              +---------------------+----------+   |
3803 3796   *                   15.9 Gbytes      ^ 32 Mbytes    |
3804 3797   *                                 headroom          |
3805 3798   *                                            l2arc_feed_thread()
3806 3799   *                                                   |
3807 3800   *                       l2arc write hand <--[oooo]--'
3808 3801   *                               |           8 Mbyte
3809 3802   *                               |          write max
3810 3803   *                               V
3811 3804   *                +==============================+
3812 3805   *      L2ARC dev |####|#|###|###|    |####| ... |
3813 3806   *                +==============================+
3814 3807   *                           32 Gbytes
3815 3808   *
3816 3809   * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
3817 3810   * evicted, then the L2ARC has cached a buffer much sooner than it probably
3818 3811   * needed to, potentially wasting L2ARC device bandwidth and storage.  It is
3819 3812   * safe to say that this is an uncommon case, since buffers at the end of
3820 3813   * the ARC lists have moved there due to inactivity.
3821 3814   *
3822 3815   * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
3823 3816   * then the L2ARC simply misses copying some buffers.  This serves as a
3824 3817   * pressure valve to prevent heavy read workloads from both stalling the ARC
3825 3818   * with waits and clogging the L2ARC with writes.  This also helps prevent
3826 3819   * the potential for the L2ARC to churn if it attempts to cache content too
3827 3820   * quickly, such as during backups of the entire pool.
3828 3821   *
3829 3822   * 5. After system boot and before the ARC has filled main memory, there are
3830 3823   * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
3831 3824   * lists can remain mostly static.  Instead of searching from tail of these
3832 3825   * lists as pictured, the l2arc_feed_thread() will search from the list heads
3833 3826   * for eligible buffers, greatly increasing its chance of finding them.
3834 3827   *
3835 3828   * The L2ARC device write speed is also boosted during this time so that
3836 3829   * the L2ARC warms up faster.  Since there have been no ARC evictions yet,
3837 3830   * there are no L2ARC reads, and no fear of degrading read performance
3838 3831   * through increased writes.
3839 3832   *
3840 3833   * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
3841 3834   * the vdev queue can aggregate them into larger and fewer writes.  Each
3842 3835   * device is written to in a rotor fashion, sweeping writes through
3843 3836   * available space then repeating.
3844 3837   *
3845 3838   * 7. The L2ARC does not store dirty content.  It never needs to flush
3846 3839   * write buffers back to disk based storage.
3847 3840   *
3848 3841   * 8. If an ARC buffer is written (and dirtied) which also exists in the
3849 3842   * L2ARC, the now stale L2ARC buffer is immediately dropped.
3850 3843   *
3851 3844   * The performance of the L2ARC can be tweaked by a number of tunables, which
3852 3845   * may be necessary for different workloads:
3853 3846   *
3854 3847   *      l2arc_write_max         max write bytes per interval
3855 3848   *      l2arc_write_boost       extra write bytes during device warmup
3856 3849   *      l2arc_noprefetch        skip caching prefetched buffers
3857 3850   *      l2arc_headroom          number of max device writes to precache
3858 3851   *      l2arc_feed_secs         seconds between L2ARC writing
3859 3852   *
3860 3853   * Tunables may be removed or added as future performance improvements are
3861 3854   * integrated, and also may become zpool properties.
3862 3855   *
3863 3856   * There are three key functions that control how the L2ARC warms up:
3864 3857   *
3865 3858   *      l2arc_write_eligible()  check if a buffer is eligible to cache
3866 3859   *      l2arc_write_size()      calculate how much to write
3867 3860   *      l2arc_write_interval()  calculate sleep delay between writes
3868 3861   *
3869 3862   * These three functions determine what to write, how much, and how quickly
3870 3863   * to send writes.
3871 3864   */
3872 3865  
3873 3866  static boolean_t
3874 3867  l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *ab)
3875 3868  {
3876 3869          /*
3877 3870           * A buffer is *not* eligible for the L2ARC if it:
3878 3871           * 1. belongs to a different spa.
3879 3872           * 2. is already cached on the L2ARC.
3880 3873           * 3. has an I/O in progress (it may be an incomplete read).
3881 3874           * 4. is flagged not eligible (zfs property).
3882 3875           */
3883 3876          if (ab->b_spa != spa_guid || ab->b_l2hdr != NULL ||
3884 3877              HDR_IO_IN_PROGRESS(ab) || !HDR_L2CACHE(ab))
3885 3878                  return (B_FALSE);
3886 3879  
3887 3880          return (B_TRUE);
3888 3881  }
3889 3882  
3890 3883  static uint64_t
3891 3884  l2arc_write_size(l2arc_dev_t *dev)
3892 3885  {
3893 3886          uint64_t size;
3894 3887  
3895 3888          size = dev->l2ad_write;
3896 3889  
3897 3890          if (arc_warm == B_FALSE)
3898 3891                  size += dev->l2ad_boost;
3899 3892  
3900 3893          return (size);
3901 3894  
3902 3895  }
3903 3896  
3904 3897  static clock_t
3905 3898  l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
3906 3899  {
3907 3900          clock_t interval, next, now;
3908 3901  
3909 3902          /*
3910 3903           * If the ARC lists are busy, increase our write rate; if the
3911 3904           * lists are stale, idle back.  This is achieved by checking
3912 3905           * how much we previously wrote - if it was more than half of
3913 3906           * what we wanted, schedule the next write much sooner.
3914 3907           */
3915 3908          if (l2arc_feed_again && wrote > (wanted / 2))
3916 3909                  interval = (hz * l2arc_feed_min_ms) / 1000;
3917 3910          else
3918 3911                  interval = hz * l2arc_feed_secs;
3919 3912  
3920 3913          now = ddi_get_lbolt();
3921 3914          next = MAX(now, MIN(now + interval, began + interval));
3922 3915  
3923 3916          return (next);
3924 3917  }
3925 3918  
3926 3919  static void
3927 3920  l2arc_hdr_stat_add(void)
3928 3921  {
3929 3922          ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE + L2HDR_SIZE);
3930 3923          ARCSTAT_INCR(arcstat_hdr_size, -HDR_SIZE);
3931 3924  }
3932 3925  
3933 3926  static void
3934 3927  l2arc_hdr_stat_remove(void)
3935 3928  {
3936 3929          ARCSTAT_INCR(arcstat_l2_hdr_size, -(HDR_SIZE + L2HDR_SIZE));
3937 3930          ARCSTAT_INCR(arcstat_hdr_size, HDR_SIZE);
3938 3931  }
3939 3932  
3940 3933  /*
3941 3934   * Cycle through L2ARC devices.  This is how L2ARC load balances.
3942 3935   * If a device is returned, this also returns holding the spa config lock.
3943 3936   */
3944 3937  static l2arc_dev_t *
3945 3938  l2arc_dev_get_next(void)
3946 3939  {
3947 3940          l2arc_dev_t *first, *next = NULL;
3948 3941  
3949 3942          /*
3950 3943           * Lock out the removal of spas (spa_namespace_lock), then removal
3951 3944           * of cache devices (l2arc_dev_mtx).  Once a device has been selected,
3952 3945           * both locks will be dropped and a spa config lock held instead.
3953 3946           */
3954 3947          mutex_enter(&spa_namespace_lock);
3955 3948          mutex_enter(&l2arc_dev_mtx);
3956 3949  
3957 3950          /* if there are no vdevs, there is nothing to do */
3958 3951          if (l2arc_ndev == 0)
3959 3952                  goto out;
3960 3953  
3961 3954          first = NULL;
3962 3955          next = l2arc_dev_last;
3963 3956          do {
3964 3957                  /* loop around the list looking for a non-faulted vdev */
3965 3958                  if (next == NULL) {
3966 3959                          next = list_head(l2arc_dev_list);
3967 3960                  } else {
3968 3961                          next = list_next(l2arc_dev_list, next);
3969 3962                          if (next == NULL)
3970 3963                                  next = list_head(l2arc_dev_list);
3971 3964                  }
3972 3965  
3973 3966                  /* if we have come back to the start, bail out */
3974 3967                  if (first == NULL)
3975 3968                          first = next;
3976 3969                  else if (next == first)
3977 3970                          break;
3978 3971  
3979 3972          } while (vdev_is_dead(next->l2ad_vdev));
3980 3973  
3981 3974          /* if we were unable to find any usable vdevs, return NULL */
3982 3975          if (vdev_is_dead(next->l2ad_vdev))
3983 3976                  next = NULL;
3984 3977  
3985 3978          l2arc_dev_last = next;
3986 3979  
3987 3980  out:
3988 3981          mutex_exit(&l2arc_dev_mtx);
3989 3982  
3990 3983          /*
3991 3984           * Grab the config lock to prevent the 'next' device from being
3992 3985           * removed while we are writing to it.
3993 3986           */
3994 3987          if (next != NULL)
3995 3988                  spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
3996 3989          mutex_exit(&spa_namespace_lock);
3997 3990  
3998 3991          return (next);
3999 3992  }
4000 3993  
4001 3994  /*
4002 3995   * Free buffers that were tagged for destruction.
4003 3996   */
4004 3997  static void
4005 3998  l2arc_do_free_on_write()
4006 3999  {
4007 4000          list_t *buflist;
4008 4001          l2arc_data_free_t *df, *df_prev;
4009 4002  
4010 4003          mutex_enter(&l2arc_free_on_write_mtx);
4011 4004          buflist = l2arc_free_on_write;
4012 4005  
4013 4006          for (df = list_tail(buflist); df; df = df_prev) {
4014 4007                  df_prev = list_prev(buflist, df);
4015 4008                  ASSERT(df->l2df_data != NULL);
4016 4009                  ASSERT(df->l2df_func != NULL);
4017 4010                  df->l2df_func(df->l2df_data, df->l2df_size);
4018 4011                  list_remove(buflist, df);
4019 4012                  kmem_free(df, sizeof (l2arc_data_free_t));
4020 4013          }
4021 4014  
4022 4015          mutex_exit(&l2arc_free_on_write_mtx);
4023 4016  }
4024 4017  
4025 4018  /*
4026 4019   * A write to a cache device has completed.  Update all headers to allow
4027 4020   * reads from these buffers to begin.
4028 4021   */
4029 4022  static void
4030 4023  l2arc_write_done(zio_t *zio)
4031 4024  {
4032 4025          l2arc_write_callback_t *cb;
4033 4026          l2arc_dev_t *dev;
4034 4027          list_t *buflist;
4035 4028          arc_buf_hdr_t *head, *ab, *ab_prev;
4036 4029          l2arc_buf_hdr_t *abl2;
4037 4030          kmutex_t *hash_lock;
4038 4031  
4039 4032          cb = zio->io_private;
4040 4033          ASSERT(cb != NULL);
4041 4034          dev = cb->l2wcb_dev;
4042 4035          ASSERT(dev != NULL);
4043 4036          head = cb->l2wcb_head;
4044 4037          ASSERT(head != NULL);
4045 4038          buflist = dev->l2ad_buflist;
4046 4039          ASSERT(buflist != NULL);
4047 4040          DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
4048 4041              l2arc_write_callback_t *, cb);
4049 4042  
4050 4043          if (zio->io_error != 0)
4051 4044                  ARCSTAT_BUMP(arcstat_l2_writes_error);
4052 4045  
4053 4046          mutex_enter(&l2arc_buflist_mtx);
4054 4047  
4055 4048          /*
4056 4049           * All writes completed, or an error was hit.
4057 4050           */
4058 4051          for (ab = list_prev(buflist, head); ab; ab = ab_prev) {
4059 4052                  ab_prev = list_prev(buflist, ab);
4060 4053  
4061 4054                  hash_lock = HDR_LOCK(ab);
4062 4055                  if (!mutex_tryenter(hash_lock)) {
4063 4056                          /*
4064 4057                           * This buffer misses out.  It may be in a stage
4065 4058                           * of eviction.  Its ARC_L2_WRITING flag will be
4066 4059                           * left set, denying reads to this buffer.
4067 4060                           */
4068 4061                          ARCSTAT_BUMP(arcstat_l2_writes_hdr_miss);
4069 4062                          continue;
4070 4063                  }
4071 4064  
4072 4065                  if (zio->io_error != 0) {
4073 4066                          /*
4074 4067                           * Error - drop L2ARC entry.
4075 4068                           */
4076 4069                          list_remove(buflist, ab);
4077 4070                          abl2 = ab->b_l2hdr;
4078 4071                          ab->b_l2hdr = NULL;
4079 4072                          kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
4080 4073                          ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
4081 4074                  }
4082 4075  
4083 4076                  /*
4084 4077                   * Allow ARC to begin reads to this L2ARC entry.
4085 4078                   */
4086 4079                  ab->b_flags &= ~ARC_L2_WRITING;
4087 4080  
4088 4081                  mutex_exit(hash_lock);
4089 4082          }
4090 4083  
4091 4084          atomic_inc_64(&l2arc_writes_done);
4092 4085          list_remove(buflist, head);
4093 4086          kmem_cache_free(hdr_cache, head);
4094 4087          mutex_exit(&l2arc_buflist_mtx);
4095 4088  
4096 4089          l2arc_do_free_on_write();
4097 4090  
4098 4091          kmem_free(cb, sizeof (l2arc_write_callback_t));
4099 4092  }
4100 4093  
4101 4094  /*
4102 4095   * A read to a cache device completed.  Validate buffer contents before
4103 4096   * handing over to the regular ARC routines.
4104 4097   */
4105 4098  static void
4106 4099  l2arc_read_done(zio_t *zio)
4107 4100  {
4108 4101          l2arc_read_callback_t *cb;
4109 4102          arc_buf_hdr_t *hdr;
4110 4103          arc_buf_t *buf;
4111 4104          kmutex_t *hash_lock;
4112 4105          int equal;
4113 4106  
4114 4107          ASSERT(zio->io_vd != NULL);
4115 4108          ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
4116 4109  
4117 4110          spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
4118 4111  
4119 4112          cb = zio->io_private;
4120 4113          ASSERT(cb != NULL);
4121 4114          buf = cb->l2rcb_buf;
4122 4115          ASSERT(buf != NULL);
4123 4116  
4124 4117          hash_lock = HDR_LOCK(buf->b_hdr);
4125 4118          mutex_enter(hash_lock);
4126 4119          hdr = buf->b_hdr;
4127 4120          ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
4128 4121  
4129 4122          /*
4130 4123           * Check this survived the L2ARC journey.
4131 4124           */
4132 4125          equal = arc_cksum_equal(buf);
4133 4126          if (equal && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
4134 4127                  mutex_exit(hash_lock);
4135 4128                  zio->io_private = buf;
4136 4129                  zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
4137 4130                  zio->io_bp = &zio->io_bp_copy;  /* XXX fix in L2ARC 2.0 */
4138 4131                  arc_read_done(zio);
4139 4132          } else {
4140 4133                  mutex_exit(hash_lock);
4141 4134                  /*
4142 4135                   * Buffer didn't survive caching.  Increment stats and
4143 4136                   * reissue to the original storage device.
4144 4137                   */
4145 4138                  if (zio->io_error != 0) {
4146 4139                          ARCSTAT_BUMP(arcstat_l2_io_error);
4147 4140                  } else {
4148 4141                          zio->io_error = SET_ERROR(EIO);
4149 4142                  }
4150 4143                  if (!equal)
4151 4144                          ARCSTAT_BUMP(arcstat_l2_cksum_bad);
4152 4145  
4153 4146                  /*
4154 4147                   * If there's no waiter, issue an async i/o to the primary
4155 4148                   * storage now.  If there *is* a waiter, the caller must
4156 4149                   * issue the i/o in a context where it's OK to block.
4157 4150                   */
4158 4151                  if (zio->io_waiter == NULL) {
4159 4152                          zio_t *pio = zio_unique_parent(zio);
4160 4153  
4161 4154                          ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
4162 4155  
4163 4156                          zio_nowait(zio_read(pio, cb->l2rcb_spa, &cb->l2rcb_bp,
4164 4157                              buf->b_data, zio->io_size, arc_read_done, buf,
4165 4158                              zio->io_priority, cb->l2rcb_flags, &cb->l2rcb_zb));
4166 4159                  }
4167 4160          }
4168 4161  
4169 4162          kmem_free(cb, sizeof (l2arc_read_callback_t));
4170 4163  }
4171 4164  
4172 4165  /*
4173 4166   * This is the list priority from which the L2ARC will search for pages to
4174 4167   * cache.  This is used within loops (0..3) to cycle through lists in the
4175 4168   * desired order.  This order can have a significant effect on cache
4176 4169   * performance.
4177 4170   *
4178 4171   * Currently the metadata lists are hit first, MFU then MRU, followed by
4179 4172   * the data lists.  This function returns a locked list, and also returns
4180 4173   * the lock pointer.
4181 4174   */
4182 4175  static list_t *
4183 4176  l2arc_list_locked(int list_num, kmutex_t **lock)
4184 4177  {
4185 4178          list_t *list = NULL;
4186 4179  
4187 4180          ASSERT(list_num >= 0 && list_num <= 3);
4188 4181  
4189 4182          switch (list_num) {
4190 4183          case 0:
4191 4184                  list = &arc_mfu->arcs_list[ARC_BUFC_METADATA];
4192 4185                  *lock = &arc_mfu->arcs_mtx;
4193 4186                  break;
4194 4187          case 1:
4195 4188                  list = &arc_mru->arcs_list[ARC_BUFC_METADATA];
4196 4189                  *lock = &arc_mru->arcs_mtx;
4197 4190                  break;
4198 4191          case 2:
4199 4192                  list = &arc_mfu->arcs_list[ARC_BUFC_DATA];
4200 4193                  *lock = &arc_mfu->arcs_mtx;
4201 4194                  break;
4202 4195          case 3:
4203 4196                  list = &arc_mru->arcs_list[ARC_BUFC_DATA];
4204 4197                  *lock = &arc_mru->arcs_mtx;
4205 4198                  break;
4206 4199          }
4207 4200  
4208 4201          ASSERT(!(MUTEX_HELD(*lock)));
4209 4202          mutex_enter(*lock);
4210 4203          return (list);
4211 4204  }
4212 4205  
4213 4206  /*
4214 4207   * Evict buffers from the device write hand to the distance specified in
4215 4208   * bytes.  This distance may span populated buffers, it may span nothing.
4216 4209   * This is clearing a region on the L2ARC device ready for writing.
4217 4210   * If the 'all' boolean is set, every buffer is evicted.
4218 4211   */
4219 4212  static void
4220 4213  l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
4221 4214  {
4222 4215          list_t *buflist;
4223 4216          l2arc_buf_hdr_t *abl2;
4224 4217          arc_buf_hdr_t *ab, *ab_prev;
4225 4218          kmutex_t *hash_lock;
4226 4219          uint64_t taddr;
4227 4220  
4228 4221          buflist = dev->l2ad_buflist;
4229 4222  
4230 4223          if (buflist == NULL)
4231 4224                  return;
4232 4225  
4233 4226          if (!all && dev->l2ad_first) {
4234 4227                  /*
4235 4228                   * This is the first sweep through the device.  There is
4236 4229                   * nothing to evict.
4237 4230                   */
4238 4231                  return;
4239 4232          }
4240 4233  
4241 4234          if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
4242 4235                  /*
4243 4236                   * When nearing the end of the device, evict to the end
4244 4237                   * before the device write hand jumps to the start.
4245 4238                   */
4246 4239                  taddr = dev->l2ad_end;
4247 4240          } else {
4248 4241                  taddr = dev->l2ad_hand + distance;
4249 4242          }
4250 4243          DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
4251 4244              uint64_t, taddr, boolean_t, all);
4252 4245  
4253 4246  top:
4254 4247          mutex_enter(&l2arc_buflist_mtx);
4255 4248          for (ab = list_tail(buflist); ab; ab = ab_prev) {
4256 4249                  ab_prev = list_prev(buflist, ab);
4257 4250  
4258 4251                  hash_lock = HDR_LOCK(ab);
4259 4252                  if (!mutex_tryenter(hash_lock)) {
4260 4253                          /*
4261 4254                           * Missed the hash lock.  Retry.
4262 4255                           */
4263 4256                          ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
4264 4257                          mutex_exit(&l2arc_buflist_mtx);
4265 4258                          mutex_enter(hash_lock);
4266 4259                          mutex_exit(hash_lock);
4267 4260                          goto top;
4268 4261                  }
4269 4262  
4270 4263                  if (HDR_L2_WRITE_HEAD(ab)) {
4271 4264                          /*
4272 4265                           * We hit a write head node.  Leave it for
4273 4266                           * l2arc_write_done().
4274 4267                           */
4275 4268                          list_remove(buflist, ab);
4276 4269                          mutex_exit(hash_lock);
4277 4270                          continue;
4278 4271                  }
4279 4272  
4280 4273                  if (!all && ab->b_l2hdr != NULL &&
4281 4274                      (ab->b_l2hdr->b_daddr > taddr ||
4282 4275                      ab->b_l2hdr->b_daddr < dev->l2ad_hand)) {
4283 4276                          /*
4284 4277                           * We've evicted to the target address,
4285 4278                           * or the end of the device.
4286 4279                           */
4287 4280                          mutex_exit(hash_lock);
4288 4281                          break;
4289 4282                  }
4290 4283  
4291 4284                  if (HDR_FREE_IN_PROGRESS(ab)) {
4292 4285                          /*
4293 4286                           * Already on the path to destruction.
4294 4287                           */
4295 4288                          mutex_exit(hash_lock);
4296 4289                          continue;
4297 4290                  }
4298 4291  
4299 4292                  if (ab->b_state == arc_l2c_only) {
4300 4293                          ASSERT(!HDR_L2_READING(ab));
4301 4294                          /*
4302 4295                           * This doesn't exist in the ARC.  Destroy.
4303 4296                           * arc_hdr_destroy() will call list_remove()
4304 4297                           * and decrement arcstat_l2_size.
4305 4298                           */
4306 4299                          arc_change_state(arc_anon, ab, hash_lock);
4307 4300                          arc_hdr_destroy(ab);
4308 4301                  } else {
4309 4302                          /*
4310 4303                           * Invalidate issued or about to be issued
4311 4304                           * reads, since we may be about to write
4312 4305                           * over this location.
4313 4306                           */
4314 4307                          if (HDR_L2_READING(ab)) {
4315 4308                                  ARCSTAT_BUMP(arcstat_l2_evict_reading);
4316 4309                                  ab->b_flags |= ARC_L2_EVICTED;
4317 4310                          }
4318 4311  
4319 4312                          /*
4320 4313                           * Tell ARC this no longer exists in L2ARC.
4321 4314                           */
4322 4315                          if (ab->b_l2hdr != NULL) {
4323 4316                                  abl2 = ab->b_l2hdr;
4324 4317                                  ab->b_l2hdr = NULL;
4325 4318                                  kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
4326 4319                                  ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
4327 4320                          }
4328 4321                          list_remove(buflist, ab);
4329 4322  
4330 4323                          /*
4331 4324                           * This may have been leftover after a
4332 4325                           * failed write.
4333 4326                           */
4334 4327                          ab->b_flags &= ~ARC_L2_WRITING;
4335 4328                  }
4336 4329                  mutex_exit(hash_lock);
4337 4330          }
4338 4331          mutex_exit(&l2arc_buflist_mtx);
4339 4332  
4340 4333          vdev_space_update(dev->l2ad_vdev, -(taddr - dev->l2ad_evict), 0, 0);
4341 4334          dev->l2ad_evict = taddr;
4342 4335  }
4343 4336  
4344 4337  /*
4345 4338   * Find and write ARC buffers to the L2ARC device.
4346 4339   *
4347 4340   * An ARC_L2_WRITING flag is set so that the L2ARC buffers are not valid
4348 4341   * for reading until they have completed writing.
4349 4342   */
4350 4343  static uint64_t
4351 4344  l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
4352 4345  {
4353 4346          arc_buf_hdr_t *ab, *ab_prev, *head;
4354 4347          l2arc_buf_hdr_t *hdrl2;
4355 4348          list_t *list;
4356 4349          uint64_t passed_sz, write_sz, buf_sz, headroom;
4357 4350          void *buf_data;
4358 4351          kmutex_t *hash_lock, *list_lock;
4359 4352          boolean_t have_lock, full;
4360 4353          l2arc_write_callback_t *cb;
4361 4354          zio_t *pio, *wzio;
4362 4355          uint64_t guid = spa_load_guid(spa);
4363 4356  
4364 4357          ASSERT(dev->l2ad_vdev != NULL);
4365 4358  
4366 4359          pio = NULL;
4367 4360          write_sz = 0;
4368 4361          full = B_FALSE;
4369 4362          head = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
4370 4363          head->b_flags |= ARC_L2_WRITE_HEAD;
4371 4364  
4372 4365          /*
4373 4366           * Copy buffers for L2ARC writing.
4374 4367           */
4375 4368          mutex_enter(&l2arc_buflist_mtx);
4376 4369          for (int try = 0; try <= 3; try++) {
4377 4370                  list = l2arc_list_locked(try, &list_lock);
4378 4371                  passed_sz = 0;
4379 4372  
4380 4373                  /*
4381 4374                   * L2ARC fast warmup.
4382 4375                   *
4383 4376                   * Until the ARC is warm and starts to evict, read from the
4384 4377                   * head of the ARC lists rather than the tail.
4385 4378                   */
4386 4379                  headroom = target_sz * l2arc_headroom;
4387 4380                  if (arc_warm == B_FALSE)
4388 4381                          ab = list_head(list);
4389 4382                  else
4390 4383                          ab = list_tail(list);
4391 4384  
4392 4385                  for (; ab; ab = ab_prev) {
4393 4386                          if (arc_warm == B_FALSE)
4394 4387                                  ab_prev = list_next(list, ab);
4395 4388                          else
4396 4389                                  ab_prev = list_prev(list, ab);
4397 4390  
4398 4391                          hash_lock = HDR_LOCK(ab);
4399 4392                          have_lock = MUTEX_HELD(hash_lock);
4400 4393                          if (!have_lock && !mutex_tryenter(hash_lock)) {
4401 4394                                  /*
4402 4395                                   * Skip this buffer rather than waiting.
4403 4396                                   */
4404 4397                                  continue;
4405 4398                          }
4406 4399  
4407 4400                          passed_sz += ab->b_size;
4408 4401                          if (passed_sz > headroom) {
4409 4402                                  /*
4410 4403                                   * Searched too far.
4411 4404                                   */
4412 4405                                  mutex_exit(hash_lock);
4413 4406                                  break;
4414 4407                          }
4415 4408  
4416 4409                          if (!l2arc_write_eligible(guid, ab)) {
4417 4410                                  mutex_exit(hash_lock);
4418 4411                                  continue;
4419 4412                          }
4420 4413  
4421 4414                          if ((write_sz + ab->b_size) > target_sz) {
4422 4415                                  full = B_TRUE;
4423 4416                                  mutex_exit(hash_lock);
4424 4417                                  break;
4425 4418                          }
4426 4419  
4427 4420                          if (pio == NULL) {
4428 4421                                  /*
4429 4422                                   * Insert a dummy header on the buflist so
4430 4423                                   * l2arc_write_done() can find where the
4431 4424                                   * write buffers begin without searching.
4432 4425                                   */
4433 4426                                  list_insert_head(dev->l2ad_buflist, head);
4434 4427  
4435 4428                                  cb = kmem_alloc(
4436 4429                                      sizeof (l2arc_write_callback_t), KM_SLEEP);
4437 4430                                  cb->l2wcb_dev = dev;
4438 4431                                  cb->l2wcb_head = head;
4439 4432                                  pio = zio_root(spa, l2arc_write_done, cb,
4440 4433                                      ZIO_FLAG_CANFAIL);
4441 4434                          }
4442 4435  
4443 4436                          /*
4444 4437                           * Create and add a new L2ARC header.
4445 4438                           */
4446 4439                          hdrl2 = kmem_zalloc(sizeof (l2arc_buf_hdr_t), KM_SLEEP);
4447 4440                          hdrl2->b_dev = dev;
4448 4441                          hdrl2->b_daddr = dev->l2ad_hand;
4449 4442  
4450 4443                          ab->b_flags |= ARC_L2_WRITING;
4451 4444                          ab->b_l2hdr = hdrl2;
4452 4445                          list_insert_head(dev->l2ad_buflist, ab);
4453 4446                          buf_data = ab->b_buf->b_data;
4454 4447                          buf_sz = ab->b_size;
4455 4448  
4456 4449                          /*
4457 4450                           * Compute and store the buffer cksum before
4458 4451                           * writing.  On debug the cksum is verified first.
4459 4452                           */
4460 4453                          arc_cksum_verify(ab->b_buf);
4461 4454                          arc_cksum_compute(ab->b_buf, B_TRUE);
4462 4455  
4463 4456                          mutex_exit(hash_lock);
4464 4457  
4465 4458                          wzio = zio_write_phys(pio, dev->l2ad_vdev,
4466 4459                              dev->l2ad_hand, buf_sz, buf_data, ZIO_CHECKSUM_OFF,
4467 4460                              NULL, NULL, ZIO_PRIORITY_ASYNC_WRITE,
4468 4461                              ZIO_FLAG_CANFAIL, B_FALSE);
4469 4462  
4470 4463                          DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
4471 4464                              zio_t *, wzio);
4472 4465                          (void) zio_nowait(wzio);
4473 4466  
4474 4467                          /*
4475 4468                           * Keep the clock hand suitably device-aligned.
4476 4469                           */
4477 4470                          buf_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);
4478 4471  
4479 4472                          write_sz += buf_sz;
4480 4473                          dev->l2ad_hand += buf_sz;
4481 4474                  }
4482 4475  
4483 4476                  mutex_exit(list_lock);
4484 4477  
4485 4478                  if (full == B_TRUE)
4486 4479                          break;
4487 4480          }
4488 4481          mutex_exit(&l2arc_buflist_mtx);
4489 4482  
4490 4483          if (pio == NULL) {
4491 4484                  ASSERT0(write_sz);
4492 4485                  kmem_cache_free(hdr_cache, head);
4493 4486                  return (0);
4494 4487          }
4495 4488  
4496 4489          ASSERT3U(write_sz, <=, target_sz);
4497 4490          ARCSTAT_BUMP(arcstat_l2_writes_sent);
4498 4491          ARCSTAT_INCR(arcstat_l2_write_bytes, write_sz);
4499 4492          ARCSTAT_INCR(arcstat_l2_size, write_sz);
4500 4493          vdev_space_update(dev->l2ad_vdev, write_sz, 0, 0);
4501 4494  
4502 4495          /*
4503 4496           * Bump device hand to the device start if it is approaching the end.
4504 4497           * l2arc_evict() will already have evicted ahead for this case.
4505 4498           */
4506 4499          if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
4507 4500                  vdev_space_update(dev->l2ad_vdev,
4508 4501                      dev->l2ad_end - dev->l2ad_hand, 0, 0);
4509 4502                  dev->l2ad_hand = dev->l2ad_start;
4510 4503                  dev->l2ad_evict = dev->l2ad_start;
4511 4504                  dev->l2ad_first = B_FALSE;
4512 4505          }
4513 4506  
4514 4507          dev->l2ad_writing = B_TRUE;
4515 4508          (void) zio_wait(pio);
4516 4509          dev->l2ad_writing = B_FALSE;
4517 4510  
4518 4511          return (write_sz);
4519 4512  }
4520 4513  
4521 4514  /*
4522 4515   * This thread feeds the L2ARC at regular intervals.  This is the beating
4523 4516   * heart of the L2ARC.
4524 4517   */
4525 4518  static void
4526 4519  l2arc_feed_thread(void)
4527 4520  {
4528 4521          callb_cpr_t cpr;
4529 4522          l2arc_dev_t *dev;
4530 4523          spa_t *spa;
4531 4524          uint64_t size, wrote;
4532 4525          clock_t begin, next = ddi_get_lbolt();
4533 4526  
4534 4527          CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
4535 4528  
4536 4529          mutex_enter(&l2arc_feed_thr_lock);
4537 4530  
4538 4531          while (l2arc_thread_exit == 0) {
4539 4532                  CALLB_CPR_SAFE_BEGIN(&cpr);
4540 4533                  (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
4541 4534                      next);
4542 4535                  CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
4543 4536                  next = ddi_get_lbolt() + hz;
4544 4537  
4545 4538                  /*
4546 4539                   * Quick check for L2ARC devices.
4547 4540                   */
4548 4541                  mutex_enter(&l2arc_dev_mtx);
4549 4542                  if (l2arc_ndev == 0) {
4550 4543                          mutex_exit(&l2arc_dev_mtx);
4551 4544                          continue;
4552 4545                  }
4553 4546                  mutex_exit(&l2arc_dev_mtx);
4554 4547                  begin = ddi_get_lbolt();
4555 4548  
4556 4549                  /*
4557 4550                   * This selects the next l2arc device to write to, and in
4558 4551                   * doing so the next spa to feed from: dev->l2ad_spa.   This
4559 4552                   * will return NULL if there are now no l2arc devices or if
4560 4553                   * they are all faulted.
4561 4554                   *
4562 4555                   * If a device is returned, its spa's config lock is also
4563 4556                   * held to prevent device removal.  l2arc_dev_get_next()
4564 4557                   * will grab and release l2arc_dev_mtx.
4565 4558                   */
4566 4559                  if ((dev = l2arc_dev_get_next()) == NULL)
4567 4560                          continue;
4568 4561  
4569 4562                  spa = dev->l2ad_spa;
4570 4563                  ASSERT(spa != NULL);
4571 4564  
4572 4565                  /*
4573 4566                   * If the pool is read-only then force the feed thread to
4574 4567                   * sleep a little longer.
4575 4568                   */
4576 4569                  if (!spa_writeable(spa)) {
4577 4570                          next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
4578 4571                          spa_config_exit(spa, SCL_L2ARC, dev);
4579 4572                          continue;
4580 4573                  }
4581 4574  
4582 4575                  /*
4583 4576                   * Avoid contributing to memory pressure.
4584 4577                   */
4585 4578                  if (arc_reclaim_needed()) {
4586 4579                          ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
4587 4580                          spa_config_exit(spa, SCL_L2ARC, dev);
4588 4581                          continue;
4589 4582                  }
4590 4583  
4591 4584                  ARCSTAT_BUMP(arcstat_l2_feeds);
4592 4585  
4593 4586                  size = l2arc_write_size(dev);
4594 4587  
4595 4588                  /*
4596 4589                   * Evict L2ARC buffers that will be overwritten.
4597 4590                   */
4598 4591                  l2arc_evict(dev, size, B_FALSE);
4599 4592  
4600 4593                  /*
4601 4594                   * Write ARC buffers.
4602 4595                   */
4603 4596                  wrote = l2arc_write_buffers(spa, dev, size);
4604 4597  
4605 4598                  /*
4606 4599                   * Calculate interval between writes.
4607 4600                   */
4608 4601                  next = l2arc_write_interval(begin, size, wrote);
4609 4602                  spa_config_exit(spa, SCL_L2ARC, dev);
4610 4603          }
4611 4604  
4612 4605          l2arc_thread_exit = 0;
4613 4606          cv_broadcast(&l2arc_feed_thr_cv);
4614 4607          CALLB_CPR_EXIT(&cpr);           /* drops l2arc_feed_thr_lock */
4615 4608          thread_exit();
4616 4609  }
4617 4610  
4618 4611  boolean_t
4619 4612  l2arc_vdev_present(vdev_t *vd)
4620 4613  {
4621 4614          l2arc_dev_t *dev;
4622 4615  
4623 4616          mutex_enter(&l2arc_dev_mtx);
4624 4617          for (dev = list_head(l2arc_dev_list); dev != NULL;
4625 4618              dev = list_next(l2arc_dev_list, dev)) {
4626 4619                  if (dev->l2ad_vdev == vd)
4627 4620                          break;
4628 4621          }
4629 4622          mutex_exit(&l2arc_dev_mtx);
4630 4623  
4631 4624          return (dev != NULL);
4632 4625  }
4633 4626  
4634 4627  /*
4635 4628   * Add a vdev for use by the L2ARC.  By this point the spa has already
4636 4629   * validated the vdev and opened it.
4637 4630   */
4638 4631  void
4639 4632  l2arc_add_vdev(spa_t *spa, vdev_t *vd)
4640 4633  {
4641 4634          l2arc_dev_t *adddev;
4642 4635  
4643 4636          ASSERT(!l2arc_vdev_present(vd));
4644 4637  
4645 4638          /*
4646 4639           * Create a new l2arc device entry.
4647 4640           */
4648 4641          adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
4649 4642          adddev->l2ad_spa = spa;
4650 4643          adddev->l2ad_vdev = vd;
4651 4644          adddev->l2ad_write = l2arc_write_max;
4652 4645          adddev->l2ad_boost = l2arc_write_boost;
4653 4646          adddev->l2ad_start = VDEV_LABEL_START_SIZE;
4654 4647          adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
4655 4648          adddev->l2ad_hand = adddev->l2ad_start;
4656 4649          adddev->l2ad_evict = adddev->l2ad_start;
4657 4650          adddev->l2ad_first = B_TRUE;
4658 4651          adddev->l2ad_writing = B_FALSE;
4659 4652          ASSERT3U(adddev->l2ad_write, >, 0);
4660 4653  
4661 4654          /*
4662 4655           * This is a list of all ARC buffers that are still valid on the
4663 4656           * device.
4664 4657           */
4665 4658          adddev->l2ad_buflist = kmem_zalloc(sizeof (list_t), KM_SLEEP);
4666 4659          list_create(adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
4667 4660              offsetof(arc_buf_hdr_t, b_l2node));
4668 4661  
4669 4662          vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
4670 4663  
4671 4664          /*
4672 4665           * Add device to global list
4673 4666           */
4674 4667          mutex_enter(&l2arc_dev_mtx);
4675 4668          list_insert_head(l2arc_dev_list, adddev);
4676 4669          atomic_inc_64(&l2arc_ndev);
4677 4670          mutex_exit(&l2arc_dev_mtx);
4678 4671  }
4679 4672  
4680 4673  /*
4681 4674   * Remove a vdev from the L2ARC.
4682 4675   */
4683 4676  void
4684 4677  l2arc_remove_vdev(vdev_t *vd)
4685 4678  {
4686 4679          l2arc_dev_t *dev, *nextdev, *remdev = NULL;
4687 4680  
4688 4681          /*
4689 4682           * Find the device by vdev
4690 4683           */
4691 4684          mutex_enter(&l2arc_dev_mtx);
4692 4685          for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
4693 4686                  nextdev = list_next(l2arc_dev_list, dev);
4694 4687                  if (vd == dev->l2ad_vdev) {
4695 4688                          remdev = dev;
4696 4689                          break;
4697 4690                  }
4698 4691          }
4699 4692          ASSERT(remdev != NULL);
4700 4693  
4701 4694          /*
4702 4695           * Remove device from global list
4703 4696           */
4704 4697          list_remove(l2arc_dev_list, remdev);
4705 4698          l2arc_dev_last = NULL;          /* may have been invalidated */
4706 4699          atomic_dec_64(&l2arc_ndev);
4707 4700          mutex_exit(&l2arc_dev_mtx);
4708 4701  
4709 4702          /*
4710 4703           * Clear all buflists and ARC references.  L2ARC device flush.
4711 4704           */
4712 4705          l2arc_evict(remdev, 0, B_TRUE);
4713 4706          list_destroy(remdev->l2ad_buflist);
4714 4707          kmem_free(remdev->l2ad_buflist, sizeof (list_t));
4715 4708          kmem_free(remdev, sizeof (l2arc_dev_t));
4716 4709  }
4717 4710  
4718 4711  void
4719 4712  l2arc_init(void)
4720 4713  {
4721 4714          l2arc_thread_exit = 0;
4722 4715          l2arc_ndev = 0;
4723 4716          l2arc_writes_sent = 0;
4724 4717          l2arc_writes_done = 0;
4725 4718  
4726 4719          mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
4727 4720          cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
4728 4721          mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
4729 4722          mutex_init(&l2arc_buflist_mtx, NULL, MUTEX_DEFAULT, NULL);
4730 4723          mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
4731 4724  
4732 4725          l2arc_dev_list = &L2ARC_dev_list;
4733 4726          l2arc_free_on_write = &L2ARC_free_on_write;
4734 4727          list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
4735 4728              offsetof(l2arc_dev_t, l2ad_node));
4736 4729          list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
4737 4730              offsetof(l2arc_data_free_t, l2df_list_node));
4738 4731  }
4739 4732  
4740 4733  void
4741 4734  l2arc_fini(void)
4742 4735  {
4743 4736          /*
4744 4737           * This is called from dmu_fini(), which is called from spa_fini();
4745 4738           * Because of this, we can assume that all l2arc devices have
4746 4739           * already been removed when the pools themselves were removed.
4747 4740           */
4748 4741  
4749 4742          l2arc_do_free_on_write();
4750 4743  
4751 4744          mutex_destroy(&l2arc_feed_thr_lock);
4752 4745          cv_destroy(&l2arc_feed_thr_cv);
4753 4746          mutex_destroy(&l2arc_dev_mtx);
4754 4747          mutex_destroy(&l2arc_buflist_mtx);
4755 4748          mutex_destroy(&l2arc_free_on_write_mtx);
4756 4749  
4757 4750          list_destroy(l2arc_dev_list);
4758 4751          list_destroy(l2arc_free_on_write);
4759 4752  }
4760 4753  
4761 4754  void
4762 4755  l2arc_start(void)
4763 4756  {
4764 4757          if (!(spa_mode_global & FWRITE))
4765 4758                  return;
4766 4759  
4767 4760          (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
4768 4761              TS_RUN, minclsyspri);
4769 4762  }
4770 4763  
4771 4764  void
4772 4765  l2arc_stop(void)
4773 4766  {
4774 4767          if (!(spa_mode_global & FWRITE))
4775 4768                  return;
4776 4769  
4777 4770          mutex_enter(&l2arc_feed_thr_lock);
4778 4771          cv_signal(&l2arc_feed_thr_cv);  /* kick thread out of startup */
4779 4772          l2arc_thread_exit = 1;
4780 4773          while (l2arc_thread_exit != 0)
4781 4774                  cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
4782 4775          mutex_exit(&l2arc_feed_thr_lock);
4783 4776  }
  
    | 
      ↓ open down ↓ | 
    3859 lines elided | 
    
      ↑ open up ↑ | 
  
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX