Print this page
    
NEX-19394 backport 9337 zfs get all is slow due to uncached metadata
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
 Conflicts:
  usr/src/uts/common/fs/zfs/dbuf.c
  usr/src/uts/common/fs/zfs/dmu.c
  usr/src/uts/common/fs/zfs/sys/dmu_objset.h
NEX-15468 panic - Deadlock: cycle in blocking chain with dbuf_destroy calling mutex_vector_enter
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-16904 Need to port Illumos Bug #9433 to fix ARC hit rate
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-16146 9188 increase size of dbuf cache to reduce indirect block decompression
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-5366 Race between unique_insert() and unique_remove() causes ZFS fsid change
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6288 dmu_buf_will_dirty could be faster
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
6047 SPARC boot should support feature@embedded_data
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-1823 Slow performance doing of a large dataset
5911 ZFS "hangs" while deleting file
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
NEX-3558 KRRP Integration
NEX-3266 5630 stale bonus buffer in recycled dnode_t leads to data corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
NEX-3165 segregate ddt in arc
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Issue #7: add cacheability to the properties
          Contributors: Boris Protopopov
DDT is placed either into special or to L2ARC but not in both
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties
    
      
        | Split | 
	Close | 
      
      | Expand all | 
      | Collapse all | 
    
    
          --- old/usr/src/uts/common/fs/zfs/dbuf.c
          +++ new/usr/src/uts/common/fs/zfs/dbuf.c
   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  
    | 
      ↓ open down ↓ | 
    12 lines elided | 
    
      ↑ open up ↑ | 
  
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23      - * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
       23 + * Copyright 2018 Nexenta Systems, Inc.  All rights reserved.
  24   24   * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
  25   25   * Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
  26   26   * Copyright (c) 2013, Joyent, Inc. All rights reserved.
  27   27   * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
  28   28   * Copyright (c) 2014 Integros [integros.com]
  29   29   */
  30   30  
  31   31  #include <sys/zfs_context.h>
  32   32  #include <sys/dmu.h>
  33   33  #include <sys/dmu_send.h>
  34   34  #include <sys/dmu_impl.h>
  35   35  #include <sys/dbuf.h>
  36   36  #include <sys/dmu_objset.h>
  37   37  #include <sys/dsl_dataset.h>
  38   38  #include <sys/dsl_dir.h>
  39   39  #include <sys/dmu_tx.h>
  40   40  #include <sys/spa.h>
       41 +#include <sys/spa_impl.h>
  41   42  #include <sys/zio.h>
  42   43  #include <sys/dmu_zfetch.h>
  43   44  #include <sys/sa.h>
  44   45  #include <sys/sa_impl.h>
  45   46  #include <sys/zfeature.h>
  46   47  #include <sys/blkptr.h>
  47   48  #include <sys/range_tree.h>
  48   49  #include <sys/callb.h>
  49   50  #include <sys/abd.h>
  50      -#include <sys/vdev.h>
  51      -#include <sys/cityhash.h>
  52   51  
  53   52  uint_t zfs_dbuf_evict_key;
  54   53  
  55   54  static boolean_t dbuf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
  56   55  static void dbuf_write(dbuf_dirty_record_t *dr, arc_buf_t *data, dmu_tx_t *tx);
  57   56  
  58   57  #ifndef __lint
  59   58  extern inline void dmu_buf_init_user(dmu_buf_user_t *dbu,
  60   59      dmu_buf_evict_func_t *evict_func_sync,
  61   60      dmu_buf_evict_func_t *evict_func_async,
  62   61      dmu_buf_t **clear_on_evict_dbufp);
  63   62  #endif /* ! __lint */
  64   63  
  65   64  /*
  66   65   * Global data structures and functions for the dbuf cache.
  
    | 
      ↓ open down ↓ | 
    5 lines elided | 
    
      ↑ open up ↑ | 
  
  67   66   */
  68   67  static kmem_cache_t *dbuf_kmem_cache;
  69   68  static taskq_t *dbu_evict_taskq;
  70   69  
  71   70  static kthread_t *dbuf_cache_evict_thread;
  72   71  static kmutex_t dbuf_evict_lock;
  73   72  static kcondvar_t dbuf_evict_cv;
  74   73  static boolean_t dbuf_evict_thread_exit;
  75   74  
  76   75  /*
  77      - * LRU cache of dbufs. The dbuf cache maintains a list of dbufs that
  78      - * are not currently held but have been recently released. These dbufs
  79      - * are not eligible for arc eviction until they are aged out of the cache.
  80      - * Dbufs are added to the dbuf cache once the last hold is released. If a
  81      - * dbuf is later accessed and still exists in the dbuf cache, then it will
  82      - * be removed from the cache and later re-added to the head of the cache.
  83      - * Dbufs that are aged out of the cache will be immediately destroyed and
  84      - * become eligible for arc eviction.
       76 + * There are two dbuf caches; each dbuf can only be in one of them at a time.
       77 + *
       78 + * 1. Cache of metadata dbufs, to help make read-heavy administrative commands
       79 + *    from /sbin/zfs run faster. The "metadata cache" specifically stores dbufs
       80 + *    that represent the metadata that describes filesystems/snapshots/
       81 + *    bookmarks/properties/etc. We only evict from this cache when we export a
       82 + *    pool, to short-circuit as much I/O as possible for all administrative
       83 + *    commands that need the metadata. There is no eviction policy for this
       84 + *    cache, because we try to only include types in it which would occupy a
       85 + *    very small amount of space per object but create a large impact on the
       86 + *    performance of these commands. Instead, after it reaches a maximum size
       87 + *    (which should only happen on very small memory systems with a very large
       88 + *    number of filesystem objects), we stop taking new dbufs into the
       89 + *    metadata cache, instead putting them in the normal dbuf cache.
       90 + *
       91 + * 2. LRU cache of dbufs. The "dbuf cache" maintains a list of dbufs that
       92 + *    are not currently held but have been recently released. These dbufs
       93 + *    are not eligible for arc eviction until they are aged out of the cache.
       94 + *    Dbufs that are aged out of the cache will be immediately destroyed and
       95 + *    become eligible for arc eviction.
       96 + *
       97 + * Dbufs are added to these caches once the last hold is released. If a dbuf is
       98 + * later accessed and still exists in the dbuf cache, then it will be removed
       99 + * from the cache and later re-added to the head of the cache.
      100 + *
      101 + * If a given dbuf meets the requirements for the metadata cache, it will go
      102 + * there, otherwise it will be considered for the generic LRU dbuf cache. The
      103 + * caches and the refcounts tracking their sizes are stored in an array indexed
      104 + * by those caches' matching enum values (from dbuf_cached_state_t).
  85  105   */
  86      -static multilist_t *dbuf_cache;
  87      -static refcount_t dbuf_cache_size;
  88      -uint64_t dbuf_cache_max_bytes = 100 * 1024 * 1024;
      106 +typedef struct dbuf_cache {
      107 +        multilist_t *cache;
      108 +        refcount_t size;
      109 +} dbuf_cache_t;
      110 +dbuf_cache_t dbuf_caches[DB_CACHE_MAX];
  89  111  
  90      -/* Cap the size of the dbuf cache to log2 fraction of arc size. */
  91      -int dbuf_cache_max_shift = 5;
      112 +/* Size limits for the caches */
      113 +uint64_t dbuf_cache_max_bytes = 0;
      114 +uint64_t dbuf_metadata_cache_max_bytes = 0;
      115 +/* Set the default sizes of the caches to log2 fraction of arc size */
      116 +int dbuf_cache_shift = 5;
      117 +int dbuf_metadata_cache_shift = 6;
  92  118  
  93  119  /*
  94      - * The dbuf cache uses a three-stage eviction policy:
      120 + * For diagnostic purposes, this is incremented whenever we can't add
      121 + * something to the metadata cache because it's full, and instead put
      122 + * the data in the regular dbuf cache.
      123 + */
      124 +uint64_t dbuf_metadata_cache_overflow;
      125 +
      126 +/*
      127 + * The LRU dbuf cache uses a three-stage eviction policy:
  95  128   *      - A low water marker designates when the dbuf eviction thread
  96  129   *      should stop evicting from the dbuf cache.
  97  130   *      - When we reach the maximum size (aka mid water mark), we
  98  131   *      signal the eviction thread to run.
  99  132   *      - The high water mark indicates when the eviction thread
 100  133   *      is unable to keep up with the incoming load and eviction must
 101  134   *      happen in the context of the calling thread.
 102  135   *
 103  136   * The dbuf cache:
 104  137   *                                                 (max size)
 105  138   *                                      low water   mid water   hi water
 106  139   * +----------------------------------------+----------+----------+
 107  140   * |                                        |          |          |
 108  141   * |                                        |          |          |
 109  142   * |                                        |          |          |
 110  143   * |                                        |          |          |
 111  144   * +----------------------------------------+----------+----------+
 112  145   *                                        stop        signal     evict
 113  146   *                                      evicting     eviction   directly
 114  147   *                                                    thread
 115  148   *
 116  149   * The high and low water marks indicate the operating range for the eviction
 117  150   * thread. The low water mark is, by default, 90% of the total size of the
 118  151   * cache and the high water mark is at 110% (both of these percentages can be
 119  152   * changed by setting dbuf_cache_lowater_pct and dbuf_cache_hiwater_pct,
 120  153   * respectively). The eviction thread will try to ensure that the cache remains
 121  154   * within this range by waking up every second and checking if the cache is
 122  155   * above the low water mark. The thread can also be woken up by callers adding
 123  156   * elements into the cache if the cache is larger than the mid water (i.e max
 124  157   * cache size). Once the eviction thread is woken up and eviction is required,
 125  158   * it will continue evicting buffers until it's able to reduce the cache size
 126  159   * to the low water mark. If the cache size continues to grow and hits the high
 127  160   * water mark, then callers adding elments to the cache will begin to evict
 128  161   * directly from the cache until the cache is no longer above the high water
 129  162   * mark.
 130  163   */
 131  164  
 132  165  /*
 133  166   * The percentage above and below the maximum cache size.
 134  167   */
 135  168  uint_t dbuf_cache_hiwater_pct = 10;
 136  169  uint_t dbuf_cache_lowater_pct = 10;
 137  170  
 138  171  /* ARGSUSED */
 139  172  static int
 140  173  dbuf_cons(void *vdb, void *unused, int kmflag)
 141  174  {
 142  175          dmu_buf_impl_t *db = vdb;
 143  176          bzero(db, sizeof (dmu_buf_impl_t));
 144  177  
 145  178          mutex_init(&db->db_mtx, NULL, MUTEX_DEFAULT, NULL);
 146  179          cv_init(&db->db_changed, NULL, CV_DEFAULT, NULL);
 147  180          multilist_link_init(&db->db_cache_link);
 148  181          refcount_create(&db->db_holds);
 149  182  
 150  183          return (0);
 151  184  }
 152  185  
 153  186  /* ARGSUSED */
 154  187  static void
 155  188  dbuf_dest(void *vdb, void *unused)
 156  189  {
  
    | 
      ↓ open down ↓ | 
    52 lines elided | 
    
      ↑ open up ↑ | 
  
 157  190          dmu_buf_impl_t *db = vdb;
 158  191          mutex_destroy(&db->db_mtx);
 159  192          cv_destroy(&db->db_changed);
 160  193          ASSERT(!multilist_link_active(&db->db_cache_link));
 161  194          refcount_destroy(&db->db_holds);
 162  195  }
 163  196  
 164  197  /*
 165  198   * dbuf hash table routines
 166  199   */
      200 +#pragma align 64(dbuf_hash_table)
 167  201  static dbuf_hash_table_t dbuf_hash_table;
 168  202  
 169  203  static uint64_t dbuf_hash_count;
 170  204  
 171      -/*
 172      - * We use Cityhash for this. It's fast, and has good hash properties without
 173      - * requiring any large static buffers.
 174      - */
 175  205  static uint64_t
 176  206  dbuf_hash(void *os, uint64_t obj, uint8_t lvl, uint64_t blkid)
 177  207  {
 178      -        return (cityhash4((uintptr_t)os, obj, (uint64_t)lvl, blkid));
      208 +        uintptr_t osv = (uintptr_t)os;
      209 +        uint64_t crc = -1ULL;
      210 +
      211 +        ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
      212 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (lvl)) & 0xFF];
      213 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (osv >> 6)) & 0xFF];
      214 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (obj >> 0)) & 0xFF];
      215 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (obj >> 8)) & 0xFF];
      216 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (blkid >> 0)) & 0xFF];
      217 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (blkid >> 8)) & 0xFF];
      218 +
      219 +        crc ^= (osv>>14) ^ (obj>>16) ^ (blkid>>16);
      220 +
      221 +        return (crc);
 179  222  }
 180  223  
 181  224  #define DBUF_EQUAL(dbuf, os, obj, level, blkid)         \
 182  225          ((dbuf)->db.db_object == (obj) &&               \
 183  226          (dbuf)->db_objset == (os) &&                    \
 184  227          (dbuf)->db_level == (level) &&                  \
 185  228          (dbuf)->db_blkid == (blkid))
 186  229  
 187  230  dmu_buf_impl_t *
 188  231  dbuf_find(objset_t *os, uint64_t obj, uint8_t level, uint64_t blkid)
 189  232  {
 190  233          dbuf_hash_table_t *h = &dbuf_hash_table;
 191  234          uint64_t hv = dbuf_hash(os, obj, level, blkid);
 192  235          uint64_t idx = hv & h->hash_table_mask;
 193  236          dmu_buf_impl_t *db;
 194  237  
 195  238          mutex_enter(DBUF_HASH_MUTEX(h, idx));
 196  239          for (db = h->hash_table[idx]; db != NULL; db = db->db_hash_next) {
 197  240                  if (DBUF_EQUAL(db, os, obj, level, blkid)) {
 198  241                          mutex_enter(&db->db_mtx);
 199  242                          if (db->db_state != DB_EVICTING) {
 200  243                                  mutex_exit(DBUF_HASH_MUTEX(h, idx));
 201  244                                  return (db);
 202  245                          }
 203  246                          mutex_exit(&db->db_mtx);
 204  247                  }
 205  248          }
 206  249          mutex_exit(DBUF_HASH_MUTEX(h, idx));
 207  250          return (NULL);
 208  251  }
 209  252  
 210  253  static dmu_buf_impl_t *
 211  254  dbuf_find_bonus(objset_t *os, uint64_t object)
 212  255  {
 213  256          dnode_t *dn;
 214  257          dmu_buf_impl_t *db = NULL;
 215  258  
 216  259          if (dnode_hold(os, object, FTAG, &dn) == 0) {
 217  260                  rw_enter(&dn->dn_struct_rwlock, RW_READER);
 218  261                  if (dn->dn_bonus != NULL) {
 219  262                          db = dn->dn_bonus;
 220  263                          mutex_enter(&db->db_mtx);
 221  264                  }
 222  265                  rw_exit(&dn->dn_struct_rwlock);
 223  266                  dnode_rele(dn, FTAG);
 224  267          }
 225  268          return (db);
 226  269  }
 227  270  
 228  271  /*
 229  272   * Insert an entry into the hash table.  If there is already an element
 230  273   * equal to elem in the hash table, then the already existing element
 231  274   * will be returned and the new element will not be inserted.
 232  275   * Otherwise returns NULL.
 233  276   */
 234  277  static dmu_buf_impl_t *
 235  278  dbuf_hash_insert(dmu_buf_impl_t *db)
 236  279  {
 237  280          dbuf_hash_table_t *h = &dbuf_hash_table;
 238  281          objset_t *os = db->db_objset;
 239  282          uint64_t obj = db->db.db_object;
 240  283          int level = db->db_level;
 241  284          uint64_t blkid = db->db_blkid;
 242  285          uint64_t hv = dbuf_hash(os, obj, level, blkid);
 243  286          uint64_t idx = hv & h->hash_table_mask;
 244  287          dmu_buf_impl_t *dbf;
 245  288  
 246  289          mutex_enter(DBUF_HASH_MUTEX(h, idx));
 247  290          for (dbf = h->hash_table[idx]; dbf != NULL; dbf = dbf->db_hash_next) {
 248  291                  if (DBUF_EQUAL(dbf, os, obj, level, blkid)) {
 249  292                          mutex_enter(&dbf->db_mtx);
 250  293                          if (dbf->db_state != DB_EVICTING) {
 251  294                                  mutex_exit(DBUF_HASH_MUTEX(h, idx));
 252  295                                  return (dbf);
 253  296                          }
 254  297                          mutex_exit(&dbf->db_mtx);
 255  298                  }
 256  299          }
 257  300  
 258  301          mutex_enter(&db->db_mtx);
 259  302          db->db_hash_next = h->hash_table[idx];
 260  303          h->hash_table[idx] = db;
 261  304          mutex_exit(DBUF_HASH_MUTEX(h, idx));
 262  305          atomic_inc_64(&dbuf_hash_count);
 263  306  
 264  307          return (NULL);
 265  308  }
 266  309  
 267  310  /*
 268  311   * Remove an entry from the hash table.  It must be in the EVICTING state.
 269  312   */
 270  313  static void
 271  314  dbuf_hash_remove(dmu_buf_impl_t *db)
 272  315  {
 273  316          dbuf_hash_table_t *h = &dbuf_hash_table;
 274  317          uint64_t hv = dbuf_hash(db->db_objset, db->db.db_object,
 275  318              db->db_level, db->db_blkid);
 276  319          uint64_t idx = hv & h->hash_table_mask;
 277  320          dmu_buf_impl_t *dbf, **dbp;
 278  321  
 279  322          /*
 280  323           * We musn't hold db_mtx to maintain lock ordering:
 281  324           * DBUF_HASH_MUTEX > db_mtx.
 282  325           */
 283  326          ASSERT(refcount_is_zero(&db->db_holds));
 284  327          ASSERT(db->db_state == DB_EVICTING);
 285  328          ASSERT(!MUTEX_HELD(&db->db_mtx));
 286  329  
 287  330          mutex_enter(DBUF_HASH_MUTEX(h, idx));
 288  331          dbp = &h->hash_table[idx];
 289  332          while ((dbf = *dbp) != db) {
 290  333                  dbp = &dbf->db_hash_next;
 291  334                  ASSERT(dbf != NULL);
 292  335          }
 293  336          *dbp = db->db_hash_next;
 294  337          db->db_hash_next = NULL;
 295  338          mutex_exit(DBUF_HASH_MUTEX(h, idx));
 296  339          atomic_dec_64(&dbuf_hash_count);
 297  340  }
 298  341  
 299  342  typedef enum {
 300  343          DBVU_EVICTING,
 301  344          DBVU_NOT_EVICTING
 302  345  } dbvu_verify_type_t;
 303  346  
 304  347  static void
 305  348  dbuf_verify_user(dmu_buf_impl_t *db, dbvu_verify_type_t verify_type)
 306  349  {
 307  350  #ifdef ZFS_DEBUG
 308  351          int64_t holds;
 309  352  
 310  353          if (db->db_user == NULL)
 311  354                  return;
 312  355  
 313  356          /* Only data blocks support the attachment of user data. */
 314  357          ASSERT(db->db_level == 0);
 315  358  
 316  359          /* Clients must resolve a dbuf before attaching user data. */
 317  360          ASSERT(db->db.db_data != NULL);
 318  361          ASSERT3U(db->db_state, ==, DB_CACHED);
 319  362  
 320  363          holds = refcount_count(&db->db_holds);
 321  364          if (verify_type == DBVU_EVICTING) {
 322  365                  /*
 323  366                   * Immediate eviction occurs when holds == dirtycnt.
 324  367                   * For normal eviction buffers, holds is zero on
 325  368                   * eviction, except when dbuf_fix_old_data() calls
 326  369                   * dbuf_clear_data().  However, the hold count can grow
 327  370                   * during eviction even though db_mtx is held (see
 328  371                   * dmu_bonus_hold() for an example), so we can only
 329  372                   * test the generic invariant that holds >= dirtycnt.
 330  373                   */
 331  374                  ASSERT3U(holds, >=, db->db_dirtycnt);
 332  375          } else {
 333  376                  if (db->db_user_immediate_evict == TRUE)
 334  377                          ASSERT3U(holds, >=, db->db_dirtycnt);
 335  378                  else
 336  379                          ASSERT3U(holds, >, 0);
 337  380          }
 338  381  #endif
 339  382  }
 340  383  
 341  384  static void
 342  385  dbuf_evict_user(dmu_buf_impl_t *db)
 343  386  {
 344  387          dmu_buf_user_t *dbu = db->db_user;
 345  388  
 346  389          ASSERT(MUTEX_HELD(&db->db_mtx));
 347  390  
 348  391          if (dbu == NULL)
 349  392                  return;
 350  393  
 351  394          dbuf_verify_user(db, DBVU_EVICTING);
 352  395          db->db_user = NULL;
 353  396  
 354  397  #ifdef ZFS_DEBUG
 355  398          if (dbu->dbu_clear_on_evict_dbufp != NULL)
 356  399                  *dbu->dbu_clear_on_evict_dbufp = NULL;
 357  400  #endif
 358  401  
 359  402          /*
 360  403           * There are two eviction callbacks - one that we call synchronously
 361  404           * and one that we invoke via a taskq.  The async one is useful for
 362  405           * avoiding lock order reversals and limiting stack depth.
 363  406           *
 364  407           * Note that if we have a sync callback but no async callback,
 365  408           * it's likely that the sync callback will free the structure
 366  409           * containing the dbu.  In that case we need to take care to not
 367  410           * dereference dbu after calling the sync evict func.
 368  411           */
 369  412          boolean_t has_async = (dbu->dbu_evict_func_async != NULL);
 370  413  
 371  414          if (dbu->dbu_evict_func_sync != NULL)
 372  415                  dbu->dbu_evict_func_sync(dbu);
 373  416  
 374  417          if (has_async) {
 375  418                  taskq_dispatch_ent(dbu_evict_taskq, dbu->dbu_evict_func_async,
 376  419                      dbu, 0, &dbu->dbu_tqent);
 377  420          }
 378  421  }
 379  422  
 380  423  boolean_t
 381  424  dbuf_is_metadata(dmu_buf_impl_t *db)
 382  425  {
 383  426          if (db->db_level > 0) {
 384  427                  return (B_TRUE);
 385  428          } else {
  
    | 
      ↓ open down ↓ | 
    197 lines elided | 
    
      ↑ open up ↑ | 
  
 386  429                  boolean_t is_metadata;
 387  430  
 388  431                  DB_DNODE_ENTER(db);
 389  432                  is_metadata = DMU_OT_IS_METADATA(DB_DNODE(db)->dn_type);
 390  433                  DB_DNODE_EXIT(db);
 391  434  
 392  435                  return (is_metadata);
 393  436          }
 394  437  }
 395  438  
      439 +boolean_t
      440 +dbuf_is_ddt(dmu_buf_impl_t *db)
      441 +{
      442 +        boolean_t is_ddt;
      443 +
      444 +        DB_DNODE_ENTER(db);
      445 +        is_ddt = (DB_DNODE(db)->dn_type == DMU_OT_DDT_ZAP) ||
      446 +            (DB_DNODE(db)->dn_type == DMU_OT_DDT_STATS);
      447 +        DB_DNODE_EXIT(db);
      448 +
      449 +        return (is_ddt);
      450 +}
      451 +
 396  452  /*
      453 + * This returns whether this dbuf should be stored in the metadata cache, which
      454 + * is based on whether it's from one of the dnode types that store data related
      455 + * to traversing dataset hierarchies.
      456 + */
      457 +static boolean_t
      458 +dbuf_include_in_metadata_cache(dmu_buf_impl_t *db)
      459 +{
      460 +        DB_DNODE_ENTER(db);
      461 +        dmu_object_type_t type = DB_DNODE(db)->dn_type;
      462 +        DB_DNODE_EXIT(db);
      463 +
      464 +        /* Check if this dbuf is one of the types we care about */
      465 +        if (DMU_OT_IS_METADATA_CACHED(type)) {
      466 +                /* If we hit this, then we set something up wrong in dmu_ot */
      467 +                ASSERT(DMU_OT_IS_METADATA(type));
      468 +
      469 +                /*
      470 +                 * Sanity check for small-memory systems: don't allocate too
      471 +                 * much memory for this purpose.
      472 +                 */
      473 +                if (refcount_count(&dbuf_caches[DB_DBUF_METADATA_CACHE].size) >
      474 +                    dbuf_metadata_cache_max_bytes) {
      475 +                        dbuf_metadata_cache_overflow++;
      476 +                        DTRACE_PROBE1(dbuf__metadata__cache__overflow,
      477 +                            dmu_buf_impl_t *, db);
      478 +                        return (B_FALSE);
      479 +                }
      480 +
      481 +                return (B_TRUE);
      482 +        }
      483 +
      484 +        return (B_FALSE);
      485 +}
      486 +
      487 +/*
 397  488   * This function *must* return indices evenly distributed between all
 398  489   * sublists of the multilist. This is needed due to how the dbuf eviction
 399  490   * code is laid out; dbuf_evict_thread() assumes dbufs are evenly
 400  491   * distributed between all sublists and uses this assumption when
 401  492   * deciding which sublist to evict from and how much to evict from it.
 402  493   */
 403  494  unsigned int
 404  495  dbuf_cache_multilist_index_func(multilist_t *ml, void *obj)
 405  496  {
 406  497          dmu_buf_impl_t *db = obj;
 407  498  
 408  499          /*
 409  500           * The assumption here, is the hash value for a given
 410  501           * dmu_buf_impl_t will remain constant throughout it's lifetime
 411  502           * (i.e. it's objset, object, level and blkid fields don't change).
 412  503           * Thus, we don't need to store the dbuf's sublist index
 413  504           * on insertion, as this index can be recalculated on removal.
 414  505           *
 415  506           * Also, the low order bits of the hash value are thought to be
 416  507           * distributed evenly. Otherwise, in the case that the multilist
 417  508           * has a power of two number of sublists, each sublists' usage
 418  509           * would not be evenly distributed.
 419  510           */
 420  511          return (dbuf_hash(db->db_objset, db->db.db_object,
  
    | 
      ↓ open down ↓ | 
    14 lines elided | 
    
      ↑ open up ↑ | 
  
 421  512              db->db_level, db->db_blkid) %
 422  513              multilist_get_num_sublists(ml));
 423  514  }
 424  515  
 425  516  static inline boolean_t
 426  517  dbuf_cache_above_hiwater(void)
 427  518  {
 428  519          uint64_t dbuf_cache_hiwater_bytes =
 429  520              (dbuf_cache_max_bytes * dbuf_cache_hiwater_pct) / 100;
 430  521  
 431      -        return (refcount_count(&dbuf_cache_size) >
      522 +        return (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
 432  523              dbuf_cache_max_bytes + dbuf_cache_hiwater_bytes);
 433  524  }
 434  525  
 435  526  static inline boolean_t
 436  527  dbuf_cache_above_lowater(void)
 437  528  {
 438  529          uint64_t dbuf_cache_lowater_bytes =
 439  530              (dbuf_cache_max_bytes * dbuf_cache_lowater_pct) / 100;
 440  531  
 441      -        return (refcount_count(&dbuf_cache_size) >
      532 +        return (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
 442  533              dbuf_cache_max_bytes - dbuf_cache_lowater_bytes);
 443  534  }
 444  535  
 445  536  /*
 446  537   * Evict the oldest eligible dbuf from the dbuf cache.
 447  538   */
 448  539  static void
 449  540  dbuf_evict_one(void)
 450  541  {
 451      -        int idx = multilist_get_random_index(dbuf_cache);
 452      -        multilist_sublist_t *mls = multilist_sublist_lock(dbuf_cache, idx);
      542 +        int idx = multilist_get_random_index(dbuf_caches[DB_DBUF_CACHE].cache);
      543 +        multilist_sublist_t *mls = multilist_sublist_lock(
      544 +            dbuf_caches[DB_DBUF_CACHE].cache, idx);
 453  545  
 454  546          ASSERT(!MUTEX_HELD(&dbuf_evict_lock));
 455  547  
 456  548          /*
 457  549           * Set the thread's tsd to indicate that it's processing evictions.
 458  550           * Once a thread stops evicting from the dbuf cache it will
 459  551           * reset its tsd to NULL.
 460  552           */
 461  553          ASSERT3P(tsd_get(zfs_dbuf_evict_key), ==, NULL);
 462  554          (void) tsd_set(zfs_dbuf_evict_key, (void *)B_TRUE);
 463  555  
 464  556          dmu_buf_impl_t *db = multilist_sublist_tail(mls);
  
    | 
      ↓ open down ↓ | 
    2 lines elided | 
    
      ↑ open up ↑ | 
  
 465  557          while (db != NULL && mutex_tryenter(&db->db_mtx) == 0) {
 466  558                  db = multilist_sublist_prev(mls, db);
 467  559          }
 468  560  
 469  561          DTRACE_PROBE2(dbuf__evict__one, dmu_buf_impl_t *, db,
 470  562              multilist_sublist_t *, mls);
 471  563  
 472  564          if (db != NULL) {
 473  565                  multilist_sublist_remove(mls, db);
 474  566                  multilist_sublist_unlock(mls);
 475      -                (void) refcount_remove_many(&dbuf_cache_size,
      567 +                (void) refcount_remove_many(&dbuf_caches[DB_DBUF_CACHE].size,
 476  568                      db->db.db_size, db);
      569 +                ASSERT3U(db->db_caching_status, ==, DB_DBUF_CACHE);
      570 +                db->db_caching_status = DB_NO_CACHE;
 477  571                  dbuf_destroy(db);
 478  572          } else {
 479  573                  multilist_sublist_unlock(mls);
 480  574          }
 481  575          (void) tsd_set(zfs_dbuf_evict_key, NULL);
 482  576  }
 483  577  
 484  578  /*
 485  579   * The dbuf evict thread is responsible for aging out dbufs from the
 486  580   * cache. Once the cache has reached it's maximum size, dbufs are removed
 487  581   * and destroyed. The eviction thread will continue running until the size
 488  582   * of the dbuf cache is at or below the maximum size. Once the dbuf is aged
 489  583   * out of the cache it is destroyed and becomes eligible for arc eviction.
 490  584   */
 491  585  /* ARGSUSED */
 492  586  static void
 493  587  dbuf_evict_thread(void *unused)
 494  588  {
 495  589          callb_cpr_t cpr;
 496  590  
 497  591          CALLB_CPR_INIT(&cpr, &dbuf_evict_lock, callb_generic_cpr, FTAG);
 498  592  
 499  593          mutex_enter(&dbuf_evict_lock);
 500  594          while (!dbuf_evict_thread_exit) {
 501  595                  while (!dbuf_cache_above_lowater() && !dbuf_evict_thread_exit) {
 502  596                          CALLB_CPR_SAFE_BEGIN(&cpr);
 503  597                          (void) cv_timedwait_hires(&dbuf_evict_cv,
 504  598                              &dbuf_evict_lock, SEC2NSEC(1), MSEC2NSEC(1), 0);
 505  599                          CALLB_CPR_SAFE_END(&cpr, &dbuf_evict_lock);
 506  600                  }
 507  601                  mutex_exit(&dbuf_evict_lock);
 508  602  
 509  603                  /*
 510  604                   * Keep evicting as long as we're above the low water mark
 511  605                   * for the cache. We do this without holding the locks to
 512  606                   * minimize lock contention.
 513  607                   */
 514  608                  while (dbuf_cache_above_lowater() && !dbuf_evict_thread_exit) {
 515  609                          dbuf_evict_one();
 516  610                  }
 517  611  
 518  612                  mutex_enter(&dbuf_evict_lock);
  
    | 
      ↓ open down ↓ | 
    32 lines elided | 
    
      ↑ open up ↑ | 
  
 519  613          }
 520  614  
 521  615          dbuf_evict_thread_exit = B_FALSE;
 522  616          cv_broadcast(&dbuf_evict_cv);
 523  617          CALLB_CPR_EXIT(&cpr);   /* drops dbuf_evict_lock */
 524  618          thread_exit();
 525  619  }
 526  620  
 527  621  /*
 528  622   * Wake up the dbuf eviction thread if the dbuf cache is at its max size.
 529      - * If the dbuf cache is at its high water mark, then evict a dbuf from the
 530      - * dbuf cache using the callers context.
      623 + *
      624 + * Direct eviction (dbuf_evict_one()) is not called here, because
      625 + * the function doesn't care about the selected dbuf, so the following
      626 + * case is possible which will cause a deadlock-panic:
      627 + *
      628 + * Thread A is evicting dbufs that are related to dnodeA
      629 + * dnode_evict_dbufs(dnoneA) enters dn_dbufs_mtx and after that walks
      630 + * its own AVL of dbufs and calls dbuf_destroy():
      631 + * dbuf_destroy() ->...-> dbuf_evict_notify() -> dbuf_evict_one() ->
      632 + *  -> select a dbuf from cache -> dbuf_destroy() ->
      633 + *   -> mutex_enter(dn_dbufs_mtx of dnoneB)
      634 + *
      635 + * Thread B is evicting dbufs that are related to dnodeB
      636 + * dnode_evict_dbufs(dnoneB) enters dn_dbufs_mtx and after that walks
      637 + * its own AVL of dbufs and calls dbuf_destroy():
      638 + * dbuf_destroy() ->...-> dbuf_evict_notify() -> dbuf_evict_one() ->
      639 + *  -> select a dbuf from cache -> dbuf_destroy() ->
      640 + *   -> mutex_enter(dn_dbufs_mtx of dnoneA)
 531  641   */
 532  642  static void
 533  643  dbuf_evict_notify(void)
 534  644  {
 535  645  
 536  646          /*
 537  647           * We use thread specific data to track when a thread has
 538  648           * started processing evictions. This allows us to avoid deeply
 539  649           * nested stacks that would have a call flow similar to this:
 540  650           *
 541  651           * dbuf_rele()-->dbuf_rele_and_unlock()-->dbuf_evict_notify()
 542  652           *      ^                                               |
 543  653           *      |                                               |
 544  654           *      +-----dbuf_destroy()<--dbuf_evict_one()<--------+
 545  655           *
 546  656           * The dbuf_eviction_thread will always have its tsd set until
 547  657           * that thread exits. All other threads will only set their tsd
 548  658           * if they are participating in the eviction process. This only
 549  659           * happens if the eviction thread is unable to process evictions
 550  660           * fast enough. To keep the dbuf cache size in check, other threads
 551  661           * can evict from the dbuf cache directly. Those threads will set
 552  662           * their tsd values so that we ensure that they only evict one dbuf
  
    | 
      ↓ open down ↓ | 
    12 lines elided | 
    
      ↑ open up ↑ | 
  
 553  663           * from the dbuf cache.
 554  664           */
 555  665          if (tsd_get(zfs_dbuf_evict_key) != NULL)
 556  666                  return;
 557  667  
 558  668          /*
 559  669           * We check if we should evict without holding the dbuf_evict_lock,
 560  670           * because it's OK to occasionally make the wrong decision here,
 561  671           * and grabbing the lock results in massive lock contention.
 562  672           */
 563      -        if (refcount_count(&dbuf_cache_size) > dbuf_cache_max_bytes) {
      673 +        if (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
      674 +            dbuf_cache_max_bytes) {
 564  675                  if (dbuf_cache_above_hiwater())
 565  676                          dbuf_evict_one();
 566  677                  cv_signal(&dbuf_evict_cv);
 567  678          }
 568  679  }
 569  680  
 570  681  void
 571  682  dbuf_init(void)
 572  683  {
 573  684          uint64_t hsize = 1ULL << 16;
 574  685          dbuf_hash_table_t *h = &dbuf_hash_table;
 575  686          int i;
 576  687  
 577  688          /*
 578  689           * The hash table is big enough to fill all of physical memory
 579  690           * with an average 4K block size.  The table will take up
 580  691           * totalmem*sizeof(void*)/4K (i.e. 2MB/GB with 8-byte pointers).
 581  692           */
 582  693          while (hsize * 4096 < physmem * PAGESIZE)
 583  694                  hsize <<= 1;
 584  695  
 585  696  retry:
 586  697          h->hash_table_mask = hsize - 1;
 587  698          h->hash_table = kmem_zalloc(hsize * sizeof (void *), KM_NOSLEEP);
 588  699          if (h->hash_table == NULL) {
 589  700                  /* XXX - we should really return an error instead of assert */
  
    | 
      ↓ open down ↓ | 
    16 lines elided | 
    
      ↑ open up ↑ | 
  
 590  701                  ASSERT(hsize > (1ULL << 10));
 591  702                  hsize >>= 1;
 592  703                  goto retry;
 593  704          }
 594  705  
 595  706          dbuf_kmem_cache = kmem_cache_create("dmu_buf_impl_t",
 596  707              sizeof (dmu_buf_impl_t),
 597  708              0, dbuf_cons, dbuf_dest, NULL, NULL, NULL, 0);
 598  709  
 599  710          for (i = 0; i < DBUF_MUTEXES; i++)
 600      -                mutex_init(&h->hash_mutexes[i], NULL, MUTEX_DEFAULT, NULL);
      711 +                mutex_init(DBUF_HASH_MUTEX(h, i), NULL, MUTEX_DEFAULT, NULL);
 601  712  
      713 +
 602  714          /*
 603      -         * Setup the parameters for the dbuf cache. We cap the size of the
 604      -         * dbuf cache to 1/32nd (default) of the size of the ARC.
      715 +         * Setup the parameters for the dbuf caches. We set the sizes of the
      716 +         * dbuf cache and the metadata cache to 1/32nd and 1/16th (default)
      717 +         * of the size of the ARC, respectively.
 605  718           */
 606      -        dbuf_cache_max_bytes = MIN(dbuf_cache_max_bytes,
 607      -            arc_max_bytes() >> dbuf_cache_max_shift);
      719 +        if (dbuf_cache_max_bytes == 0 ||
      720 +            dbuf_cache_max_bytes >= arc_max_bytes())  {
      721 +                dbuf_cache_max_bytes = arc_max_bytes() >> dbuf_cache_shift;
      722 +        }
      723 +        if (dbuf_metadata_cache_max_bytes == 0 ||
      724 +            dbuf_metadata_cache_max_bytes >= arc_max_bytes()) {
      725 +                dbuf_metadata_cache_max_bytes =
      726 +                    arc_max_bytes() >> dbuf_metadata_cache_shift;
      727 +        }
 608  728  
 609  729          /*
      730 +         * The combined size of both caches should be less
      731 +         * the size of ARC, otherwise need to set them to
      732 +         * the default values.
      733 +         *
      734 +         * divide by 2 is a simple overflow protection
      735 +         */
      736 +        if (((dbuf_cache_max_bytes / 2) +
      737 +            (dbuf_metadata_cache_max_bytes / 2)) >= (arc_max_bytes() / 2)) {
      738 +                dbuf_cache_max_bytes = arc_max_bytes() >> dbuf_cache_shift;
      739 +                dbuf_metadata_cache_max_bytes =
      740 +                    arc_max_bytes() >> dbuf_metadata_cache_shift;
      741 +        }
      742 +
      743 +
      744 +        /*
 610  745           * All entries are queued via taskq_dispatch_ent(), so min/maxalloc
 611  746           * configuration is not required.
 612  747           */
 613  748          dbu_evict_taskq = taskq_create("dbu_evict", 1, minclsyspri, 0, 0, 0);
 614  749  
 615      -        dbuf_cache = multilist_create(sizeof (dmu_buf_impl_t),
 616      -            offsetof(dmu_buf_impl_t, db_cache_link),
 617      -            dbuf_cache_multilist_index_func);
 618      -        refcount_create(&dbuf_cache_size);
      750 +        for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
      751 +                dbuf_caches[dcs].cache =
      752 +                    multilist_create(sizeof (dmu_buf_impl_t),
      753 +                    offsetof(dmu_buf_impl_t, db_cache_link),
      754 +                    dbuf_cache_multilist_index_func);
      755 +                refcount_create(&dbuf_caches[dcs].size);
      756 +        }
 619  757  
 620  758          tsd_create(&zfs_dbuf_evict_key, NULL);
 621  759          dbuf_evict_thread_exit = B_FALSE;
 622  760          mutex_init(&dbuf_evict_lock, NULL, MUTEX_DEFAULT, NULL);
 623  761          cv_init(&dbuf_evict_cv, NULL, CV_DEFAULT, NULL);
 624  762          dbuf_cache_evict_thread = thread_create(NULL, 0, dbuf_evict_thread,
 625  763              NULL, 0, &p0, TS_RUN, minclsyspri);
 626  764  }
 627  765  
 628  766  void
 629  767  dbuf_fini(void)
 630  768  {
 631  769          dbuf_hash_table_t *h = &dbuf_hash_table;
 632  770          int i;
 633  771  
 634  772          for (i = 0; i < DBUF_MUTEXES; i++)
 635      -                mutex_destroy(&h->hash_mutexes[i]);
      773 +                mutex_destroy(DBUF_HASH_MUTEX(h, i));
 636  774          kmem_free(h->hash_table, (h->hash_table_mask + 1) * sizeof (void *));
 637  775          kmem_cache_destroy(dbuf_kmem_cache);
 638  776          taskq_destroy(dbu_evict_taskq);
 639  777  
 640  778          mutex_enter(&dbuf_evict_lock);
 641  779          dbuf_evict_thread_exit = B_TRUE;
 642  780          while (dbuf_evict_thread_exit) {
 643  781                  cv_signal(&dbuf_evict_cv);
 644  782                  cv_wait(&dbuf_evict_cv, &dbuf_evict_lock);
 645  783          }
 646  784          mutex_exit(&dbuf_evict_lock);
 647  785          tsd_destroy(&zfs_dbuf_evict_key);
 648  786  
 649  787          mutex_destroy(&dbuf_evict_lock);
 650  788          cv_destroy(&dbuf_evict_cv);
 651  789  
 652      -        refcount_destroy(&dbuf_cache_size);
 653      -        multilist_destroy(dbuf_cache);
      790 +        for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
      791 +                refcount_destroy(&dbuf_caches[dcs].size);
      792 +                multilist_destroy(dbuf_caches[dcs].cache);
      793 +        }
 654  794  }
 655  795  
 656  796  /*
 657  797   * Other stuff.
 658  798   */
 659  799  
 660  800  #ifdef ZFS_DEBUG
 661  801  static void
 662  802  dbuf_verify(dmu_buf_impl_t *db)
 663  803  {
 664  804          dnode_t *dn;
 665  805          dbuf_dirty_record_t *dr;
 666  806  
 667  807          ASSERT(MUTEX_HELD(&db->db_mtx));
 668  808  
 669  809          if (!(zfs_flags & ZFS_DEBUG_DBUF_VERIFY))
 670  810                  return;
 671  811  
 672  812          ASSERT(db->db_objset != NULL);
 673  813          DB_DNODE_ENTER(db);
 674  814          dn = DB_DNODE(db);
 675  815          if (dn == NULL) {
 676  816                  ASSERT(db->db_parent == NULL);
 677  817                  ASSERT(db->db_blkptr == NULL);
 678  818          } else {
 679  819                  ASSERT3U(db->db.db_object, ==, dn->dn_object);
 680  820                  ASSERT3P(db->db_objset, ==, dn->dn_objset);
 681  821                  ASSERT3U(db->db_level, <, dn->dn_nlevels);
 682  822                  ASSERT(db->db_blkid == DMU_BONUS_BLKID ||
 683  823                      db->db_blkid == DMU_SPILL_BLKID ||
 684  824                      !avl_is_empty(&dn->dn_dbufs));
 685  825          }
 686  826          if (db->db_blkid == DMU_BONUS_BLKID) {
 687  827                  ASSERT(dn != NULL);
 688  828                  ASSERT3U(db->db.db_size, >=, dn->dn_bonuslen);
 689  829                  ASSERT3U(db->db.db_offset, ==, DMU_BONUS_BLKID);
 690  830          } else if (db->db_blkid == DMU_SPILL_BLKID) {
 691  831                  ASSERT(dn != NULL);
 692  832                  ASSERT3U(db->db.db_size, >=, dn->dn_bonuslen);
 693  833                  ASSERT0(db->db.db_offset);
 694  834          } else {
 695  835                  ASSERT3U(db->db.db_offset, ==, db->db_blkid * db->db.db_size);
 696  836          }
 697  837  
 698  838          for (dr = db->db_data_pending; dr != NULL; dr = dr->dr_next)
 699  839                  ASSERT(dr->dr_dbuf == db);
 700  840  
 701  841          for (dr = db->db_last_dirty; dr != NULL; dr = dr->dr_next)
 702  842                  ASSERT(dr->dr_dbuf == db);
 703  843  
 704  844          /*
 705  845           * We can't assert that db_size matches dn_datablksz because it
 706  846           * can be momentarily different when another thread is doing
 707  847           * dnode_set_blksz().
 708  848           */
 709  849          if (db->db_level == 0 && db->db.db_object == DMU_META_DNODE_OBJECT) {
 710  850                  dr = db->db_data_pending;
 711  851                  /*
 712  852                   * It should only be modified in syncing context, so
 713  853                   * make sure we only have one copy of the data.
 714  854                   */
 715  855                  ASSERT(dr == NULL || dr->dt.dl.dr_data == db->db_buf);
 716  856          }
 717  857  
 718  858          /* verify db->db_blkptr */
 719  859          if (db->db_blkptr) {
 720  860                  if (db->db_parent == dn->dn_dbuf) {
 721  861                          /* db is pointed to by the dnode */
 722  862                          /* ASSERT3U(db->db_blkid, <, dn->dn_nblkptr); */
 723  863                          if (DMU_OBJECT_IS_SPECIAL(db->db.db_object))
 724  864                                  ASSERT(db->db_parent == NULL);
 725  865                          else
 726  866                                  ASSERT(db->db_parent != NULL);
 727  867                          if (db->db_blkid != DMU_SPILL_BLKID)
 728  868                                  ASSERT3P(db->db_blkptr, ==,
 729  869                                      &dn->dn_phys->dn_blkptr[db->db_blkid]);
 730  870                  } else {
 731  871                          /* db is pointed to by an indirect block */
 732  872                          int epb = db->db_parent->db.db_size >> SPA_BLKPTRSHIFT;
 733  873                          ASSERT3U(db->db_parent->db_level, ==, db->db_level+1);
 734  874                          ASSERT3U(db->db_parent->db.db_object, ==,
 735  875                              db->db.db_object);
 736  876                          /*
 737  877                           * dnode_grow_indblksz() can make this fail if we don't
 738  878                           * have the struct_rwlock.  XXX indblksz no longer
 739  879                           * grows.  safe to do this now?
 740  880                           */
 741  881                          if (RW_WRITE_HELD(&dn->dn_struct_rwlock)) {
 742  882                                  ASSERT3P(db->db_blkptr, ==,
 743  883                                      ((blkptr_t *)db->db_parent->db.db_data +
 744  884                                      db->db_blkid % epb));
 745  885                          }
 746  886                  }
 747  887          }
 748  888          if ((db->db_blkptr == NULL || BP_IS_HOLE(db->db_blkptr)) &&
 749  889              (db->db_buf == NULL || db->db_buf->b_data) &&
 750  890              db->db.db_data && db->db_blkid != DMU_BONUS_BLKID &&
 751  891              db->db_state != DB_FILL && !dn->dn_free_txg) {
 752  892                  /*
 753  893                   * If the blkptr isn't set but they have nonzero data,
 754  894                   * it had better be dirty, otherwise we'll lose that
 755  895                   * data when we evict this buffer.
 756  896                   *
 757  897                   * There is an exception to this rule for indirect blocks; in
 758  898                   * this case, if the indirect block is a hole, we fill in a few
 759  899                   * fields on each of the child blocks (importantly, birth time)
 760  900                   * to prevent hole birth times from being lost when you
 761  901                   * partially fill in a hole.
 762  902                   */
 763  903                  if (db->db_dirtycnt == 0) {
 764  904                          if (db->db_level == 0) {
 765  905                                  uint64_t *buf = db->db.db_data;
 766  906                                  int i;
 767  907  
 768  908                                  for (i = 0; i < db->db.db_size >> 3; i++) {
 769  909                                          ASSERT(buf[i] == 0);
 770  910                                  }
 771  911                          } else {
 772  912                                  blkptr_t *bps = db->db.db_data;
 773  913                                  ASSERT3U(1 << DB_DNODE(db)->dn_indblkshift, ==,
 774  914                                      db->db.db_size);
 775  915                                  /*
 776  916                                   * We want to verify that all the blkptrs in the
 777  917                                   * indirect block are holes, but we may have
 778  918                                   * automatically set up a few fields for them.
 779  919                                   * We iterate through each blkptr and verify
 780  920                                   * they only have those fields set.
 781  921                                   */
 782  922                                  for (int i = 0;
 783  923                                      i < db->db.db_size / sizeof (blkptr_t);
 784  924                                      i++) {
 785  925                                          blkptr_t *bp = &bps[i];
 786  926                                          ASSERT(ZIO_CHECKSUM_IS_ZERO(
 787  927                                              &bp->blk_cksum));
 788  928                                          ASSERT(
 789  929                                              DVA_IS_EMPTY(&bp->blk_dva[0]) &&
 790  930                                              DVA_IS_EMPTY(&bp->blk_dva[1]) &&
 791  931                                              DVA_IS_EMPTY(&bp->blk_dva[2]));
 792  932                                          ASSERT0(bp->blk_fill);
 793  933                                          ASSERT0(bp->blk_pad[0]);
 794  934                                          ASSERT0(bp->blk_pad[1]);
 795  935                                          ASSERT(!BP_IS_EMBEDDED(bp));
 796  936                                          ASSERT(BP_IS_HOLE(bp));
 797  937                                          ASSERT0(bp->blk_phys_birth);
 798  938                                  }
 799  939                          }
 800  940                  }
 801  941          }
 802  942          DB_DNODE_EXIT(db);
 803  943  }
 804  944  #endif
 805  945  
 806  946  static void
 807  947  dbuf_clear_data(dmu_buf_impl_t *db)
 808  948  {
 809  949          ASSERT(MUTEX_HELD(&db->db_mtx));
 810  950          dbuf_evict_user(db);
 811  951          ASSERT3P(db->db_buf, ==, NULL);
 812  952          db->db.db_data = NULL;
 813  953          if (db->db_state != DB_NOFILL)
 814  954                  db->db_state = DB_UNCACHED;
 815  955  }
 816  956  
 817  957  static void
 818  958  dbuf_set_data(dmu_buf_impl_t *db, arc_buf_t *buf)
 819  959  {
 820  960          ASSERT(MUTEX_HELD(&db->db_mtx));
 821  961          ASSERT(buf != NULL);
 822  962  
 823  963          db->db_buf = buf;
 824  964          ASSERT(buf->b_data != NULL);
 825  965          db->db.db_data = buf->b_data;
 826  966  }
 827  967  
 828  968  /*
 829  969   * Loan out an arc_buf for read.  Return the loaned arc_buf.
 830  970   */
 831  971  arc_buf_t *
 832  972  dbuf_loan_arcbuf(dmu_buf_impl_t *db)
 833  973  {
 834  974          arc_buf_t *abuf;
 835  975  
 836  976          ASSERT(db->db_blkid != DMU_BONUS_BLKID);
 837  977          mutex_enter(&db->db_mtx);
 838  978          if (arc_released(db->db_buf) || refcount_count(&db->db_holds) > 1) {
 839  979                  int blksz = db->db.db_size;
 840  980                  spa_t *spa = db->db_objset->os_spa;
 841  981  
 842  982                  mutex_exit(&db->db_mtx);
 843  983                  abuf = arc_loan_buf(spa, B_FALSE, blksz);
 844  984                  bcopy(db->db.db_data, abuf->b_data, blksz);
 845  985          } else {
 846  986                  abuf = db->db_buf;
 847  987                  arc_loan_inuse_buf(abuf, db);
 848  988                  db->db_buf = NULL;
 849  989                  dbuf_clear_data(db);
 850  990                  mutex_exit(&db->db_mtx);
 851  991          }
 852  992          return (abuf);
 853  993  }
 854  994  
 855  995  /*
 856  996   * Calculate which level n block references the data at the level 0 offset
 857  997   * provided.
 858  998   */
 859  999  uint64_t
 860 1000  dbuf_whichblock(dnode_t *dn, int64_t level, uint64_t offset)
 861 1001  {
 862 1002          if (dn->dn_datablkshift != 0 && dn->dn_indblkshift != 0) {
 863 1003                  /*
 864 1004                   * The level n blkid is equal to the level 0 blkid divided by
 865 1005                   * the number of level 0s in a level n block.
 866 1006                   *
 867 1007                   * The level 0 blkid is offset >> datablkshift =
 868 1008                   * offset / 2^datablkshift.
 869 1009                   *
 870 1010                   * The number of level 0s in a level n is the number of block
 871 1011                   * pointers in an indirect block, raised to the power of level.
 872 1012                   * This is 2^(indblkshift - SPA_BLKPTRSHIFT)^level =
 873 1013                   * 2^(level*(indblkshift - SPA_BLKPTRSHIFT)).
 874 1014                   *
 875 1015                   * Thus, the level n blkid is: offset /
 876 1016                   * ((2^datablkshift)*(2^(level*(indblkshift - SPA_BLKPTRSHIFT)))
 877 1017                   * = offset / 2^(datablkshift + level *
 878 1018                   *   (indblkshift - SPA_BLKPTRSHIFT))
 879 1019                   * = offset >> (datablkshift + level *
 880 1020                   *   (indblkshift - SPA_BLKPTRSHIFT))
 881 1021                   */
 882 1022                  return (offset >> (dn->dn_datablkshift + level *
 883 1023                      (dn->dn_indblkshift - SPA_BLKPTRSHIFT)));
 884 1024          } else {
 885 1025                  ASSERT3U(offset, <, dn->dn_datablksz);
 886 1026                  return (0);
 887 1027          }
 888 1028  }
 889 1029  
 890 1030  static void
 891 1031  dbuf_read_done(zio_t *zio, arc_buf_t *buf, void *vdb)
 892 1032  {
 893 1033          dmu_buf_impl_t *db = vdb;
 894 1034  
 895 1035          mutex_enter(&db->db_mtx);
 896 1036          ASSERT3U(db->db_state, ==, DB_READ);
 897 1037          /*
 898 1038           * All reads are synchronous, so we must have a hold on the dbuf
 899 1039           */
 900 1040          ASSERT(refcount_count(&db->db_holds) > 0);
 901 1041          ASSERT(db->db_buf == NULL);
 902 1042          ASSERT(db->db.db_data == NULL);
 903 1043          if (db->db_level == 0 && db->db_freed_in_flight) {
 904 1044                  /* we were freed in flight; disregard any error */
 905 1045                  arc_release(buf, db);
 906 1046                  bzero(buf->b_data, db->db.db_size);
 907 1047                  arc_buf_freeze(buf);
 908 1048                  db->db_freed_in_flight = FALSE;
 909 1049                  dbuf_set_data(db, buf);
 910 1050                  db->db_state = DB_CACHED;
 911 1051          } else if (zio == NULL || zio->io_error == 0) {
 912 1052                  dbuf_set_data(db, buf);
 913 1053                  db->db_state = DB_CACHED;
 914 1054          } else {
 915 1055                  ASSERT(db->db_blkid != DMU_BONUS_BLKID);
 916 1056                  ASSERT3P(db->db_buf, ==, NULL);
 917 1057                  arc_buf_destroy(buf, db);
 918 1058                  db->db_state = DB_UNCACHED;
 919 1059          }
 920 1060          cv_broadcast(&db->db_changed);
 921 1061          dbuf_rele_and_unlock(db, NULL);
 922 1062  }
 923 1063  
 924 1064  static void
 925 1065  dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags)
 926 1066  {
 927 1067          dnode_t *dn;
 928 1068          zbookmark_phys_t zb;
 929 1069          arc_flags_t aflags = ARC_FLAG_NOWAIT;
 930 1070  
 931 1071          DB_DNODE_ENTER(db);
 932 1072          dn = DB_DNODE(db);
 933 1073          ASSERT(!refcount_is_zero(&db->db_holds));
 934 1074          /* We need the struct_rwlock to prevent db_blkptr from changing. */
 935 1075          ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock));
 936 1076          ASSERT(MUTEX_HELD(&db->db_mtx));
 937 1077          ASSERT(db->db_state == DB_UNCACHED);
 938 1078          ASSERT(db->db_buf == NULL);
 939 1079  
 940 1080          if (db->db_blkid == DMU_BONUS_BLKID) {
 941 1081                  int bonuslen = MIN(dn->dn_bonuslen, dn->dn_phys->dn_bonuslen);
 942 1082  
 943 1083                  ASSERT3U(bonuslen, <=, db->db.db_size);
 944 1084                  db->db.db_data = zio_buf_alloc(DN_MAX_BONUSLEN);
 945 1085                  arc_space_consume(DN_MAX_BONUSLEN, ARC_SPACE_OTHER);
 946 1086                  if (bonuslen < DN_MAX_BONUSLEN)
 947 1087                          bzero(db->db.db_data, DN_MAX_BONUSLEN);
 948 1088                  if (bonuslen)
 949 1089                          bcopy(DN_BONUS(dn->dn_phys), db->db.db_data, bonuslen);
 950 1090                  DB_DNODE_EXIT(db);
 951 1091                  db->db_state = DB_CACHED;
 952 1092                  mutex_exit(&db->db_mtx);
 953 1093                  return;
 954 1094          }
 955 1095  
 956 1096          /*
 957 1097           * Recheck BP_IS_HOLE() after dnode_block_freed() in case dnode_sync()
 958 1098           * processes the delete record and clears the bp while we are waiting
 959 1099           * for the dn_mtx (resulting in a "no" from block_freed).
 960 1100           */
 961 1101          if (db->db_blkptr == NULL || BP_IS_HOLE(db->db_blkptr) ||
 962 1102              (db->db_level == 0 && (dnode_block_freed(dn, db->db_blkid) ||
 963 1103              BP_IS_HOLE(db->db_blkptr)))) {
 964 1104                  arc_buf_contents_t type = DBUF_GET_BUFC_TYPE(db);
 965 1105  
 966 1106                  dbuf_set_data(db, arc_alloc_buf(db->db_objset->os_spa, db, type,
 967 1107                      db->db.db_size));
 968 1108                  bzero(db->db.db_data, db->db.db_size);
 969 1109  
 970 1110                  if (db->db_blkptr != NULL && db->db_level > 0 &&
 971 1111                      BP_IS_HOLE(db->db_blkptr) &&
 972 1112                      db->db_blkptr->blk_birth != 0) {
 973 1113                          blkptr_t *bps = db->db.db_data;
 974 1114                          for (int i = 0; i < ((1 <<
 975 1115                              DB_DNODE(db)->dn_indblkshift) / sizeof (blkptr_t));
 976 1116                              i++) {
 977 1117                                  blkptr_t *bp = &bps[i];
 978 1118                                  ASSERT3U(BP_GET_LSIZE(db->db_blkptr), ==,
 979 1119                                      1 << dn->dn_indblkshift);
 980 1120                                  BP_SET_LSIZE(bp,
 981 1121                                      BP_GET_LEVEL(db->db_blkptr) == 1 ?
 982 1122                                      dn->dn_datablksz :
 983 1123                                      BP_GET_LSIZE(db->db_blkptr));
 984 1124                                  BP_SET_TYPE(bp, BP_GET_TYPE(db->db_blkptr));
 985 1125                                  BP_SET_LEVEL(bp,
 986 1126                                      BP_GET_LEVEL(db->db_blkptr) - 1);
 987 1127                                  BP_SET_BIRTH(bp, db->db_blkptr->blk_birth, 0);
 988 1128                          }
 989 1129                  }
 990 1130                  DB_DNODE_EXIT(db);
 991 1131                  db->db_state = DB_CACHED;
 992 1132                  mutex_exit(&db->db_mtx);
 993 1133                  return;
 994 1134          }
 995 1135  
 996 1136          DB_DNODE_EXIT(db);
 997 1137  
 998 1138          db->db_state = DB_READ;
 999 1139          mutex_exit(&db->db_mtx);
1000 1140  
1001 1141          if (DBUF_IS_L2CACHEABLE(db))
1002 1142                  aflags |= ARC_FLAG_L2CACHE;
1003 1143  
1004 1144          SET_BOOKMARK(&zb, db->db_objset->os_dsl_dataset ?
1005 1145              db->db_objset->os_dsl_dataset->ds_object : DMU_META_OBJSET,
1006 1146              db->db.db_object, db->db_level, db->db_blkid);
1007 1147  
1008 1148          dbuf_add_ref(db, NULL);
1009 1149  
1010 1150          (void) arc_read(zio, db->db_objset->os_spa, db->db_blkptr,
1011 1151              dbuf_read_done, db, ZIO_PRIORITY_SYNC_READ,
1012 1152              (flags & DB_RF_CANFAIL) ? ZIO_FLAG_CANFAIL : ZIO_FLAG_MUSTSUCCEED,
1013 1153              &aflags, &zb);
1014 1154  }
1015 1155  
1016 1156  /*
1017 1157   * This is our just-in-time copy function.  It makes a copy of buffers that
1018 1158   * have been modified in a previous transaction group before we access them in
1019 1159   * the current active group.
1020 1160   *
1021 1161   * This function is used in three places: when we are dirtying a buffer for the
1022 1162   * first time in a txg, when we are freeing a range in a dnode that includes
1023 1163   * this buffer, and when we are accessing a buffer which was received compressed
1024 1164   * and later referenced in a WRITE_BYREF record.
1025 1165   *
1026 1166   * Note that when we are called from dbuf_free_range() we do not put a hold on
1027 1167   * the buffer, we just traverse the active dbuf list for the dnode.
1028 1168   */
1029 1169  static void
1030 1170  dbuf_fix_old_data(dmu_buf_impl_t *db, uint64_t txg)
1031 1171  {
1032 1172          dbuf_dirty_record_t *dr = db->db_last_dirty;
1033 1173  
1034 1174          ASSERT(MUTEX_HELD(&db->db_mtx));
1035 1175          ASSERT(db->db.db_data != NULL);
1036 1176          ASSERT(db->db_level == 0);
1037 1177          ASSERT(db->db.db_object != DMU_META_DNODE_OBJECT);
1038 1178  
1039 1179          if (dr == NULL ||
1040 1180              (dr->dt.dl.dr_data !=
1041 1181              ((db->db_blkid  == DMU_BONUS_BLKID) ? db->db.db_data : db->db_buf)))
1042 1182                  return;
1043 1183  
1044 1184          /*
1045 1185           * If the last dirty record for this dbuf has not yet synced
1046 1186           * and its referencing the dbuf data, either:
1047 1187           *      reset the reference to point to a new copy,
1048 1188           * or (if there a no active holders)
1049 1189           *      just null out the current db_data pointer.
1050 1190           */
1051 1191          ASSERT(dr->dr_txg >= txg - 2);
1052 1192          if (db->db_blkid == DMU_BONUS_BLKID) {
1053 1193                  /* Note that the data bufs here are zio_bufs */
1054 1194                  dr->dt.dl.dr_data = zio_buf_alloc(DN_MAX_BONUSLEN);
1055 1195                  arc_space_consume(DN_MAX_BONUSLEN, ARC_SPACE_OTHER);
1056 1196                  bcopy(db->db.db_data, dr->dt.dl.dr_data, DN_MAX_BONUSLEN);
1057 1197          } else if (refcount_count(&db->db_holds) > db->db_dirtycnt) {
1058 1198                  int size = arc_buf_size(db->db_buf);
1059 1199                  arc_buf_contents_t type = DBUF_GET_BUFC_TYPE(db);
1060 1200                  spa_t *spa = db->db_objset->os_spa;
1061 1201                  enum zio_compress compress_type =
1062 1202                      arc_get_compression(db->db_buf);
1063 1203  
1064 1204                  if (compress_type == ZIO_COMPRESS_OFF) {
1065 1205                          dr->dt.dl.dr_data = arc_alloc_buf(spa, db, type, size);
1066 1206                  } else {
1067 1207                          ASSERT3U(type, ==, ARC_BUFC_DATA);
1068 1208                          dr->dt.dl.dr_data = arc_alloc_compressed_buf(spa, db,
1069 1209                              size, arc_buf_lsize(db->db_buf), compress_type);
1070 1210                  }
1071 1211                  bcopy(db->db.db_data, dr->dt.dl.dr_data->b_data, size);
1072 1212          } else {
1073 1213                  db->db_buf = NULL;
1074 1214                  dbuf_clear_data(db);
1075 1215          }
1076 1216  }
1077 1217  
1078 1218  int
1079 1219  dbuf_read(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags)
1080 1220  {
1081 1221          int err = 0;
1082 1222          boolean_t prefetch;
1083 1223          dnode_t *dn;
1084 1224  
1085 1225          /*
1086 1226           * We don't have to hold the mutex to check db_state because it
1087 1227           * can't be freed while we have a hold on the buffer.
1088 1228           */
1089 1229          ASSERT(!refcount_is_zero(&db->db_holds));
1090 1230  
1091 1231          if (db->db_state == DB_NOFILL)
1092 1232                  return (SET_ERROR(EIO));
1093 1233  
1094 1234          DB_DNODE_ENTER(db);
1095 1235          dn = DB_DNODE(db);
1096 1236          if ((flags & DB_RF_HAVESTRUCT) == 0)
1097 1237                  rw_enter(&dn->dn_struct_rwlock, RW_READER);
1098 1238  
1099 1239          prefetch = db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID &&
1100 1240              (flags & DB_RF_NOPREFETCH) == 0 && dn != NULL &&
1101 1241              DBUF_IS_CACHEABLE(db);
1102 1242  
1103 1243          mutex_enter(&db->db_mtx);
1104 1244          if (db->db_state == DB_CACHED) {
1105 1245                  /*
1106 1246                   * If the arc buf is compressed, we need to decompress it to
1107 1247                   * read the data. This could happen during the "zfs receive" of
1108 1248                   * a stream which is compressed and deduplicated.
1109 1249                   */
1110 1250                  if (db->db_buf != NULL &&
1111 1251                      arc_get_compression(db->db_buf) != ZIO_COMPRESS_OFF) {
1112 1252                          dbuf_fix_old_data(db,
1113 1253                              spa_syncing_txg(dmu_objset_spa(db->db_objset)));
1114 1254                          err = arc_decompress(db->db_buf);
1115 1255                          dbuf_set_data(db, db->db_buf);
1116 1256                  }
1117 1257                  mutex_exit(&db->db_mtx);
1118 1258                  if (prefetch)
1119 1259                          dmu_zfetch(&dn->dn_zfetch, db->db_blkid, 1, B_TRUE);
1120 1260                  if ((flags & DB_RF_HAVESTRUCT) == 0)
1121 1261                          rw_exit(&dn->dn_struct_rwlock);
1122 1262                  DB_DNODE_EXIT(db);
1123 1263          } else if (db->db_state == DB_UNCACHED) {
1124 1264                  spa_t *spa = dn->dn_objset->os_spa;
1125 1265                  boolean_t need_wait = B_FALSE;
1126 1266  
1127 1267                  if (zio == NULL &&
1128 1268                      db->db_blkptr != NULL && !BP_IS_HOLE(db->db_blkptr)) {
1129 1269                          zio = zio_root(spa, NULL, NULL, ZIO_FLAG_CANFAIL);
1130 1270                          need_wait = B_TRUE;
1131 1271                  }
1132 1272                  dbuf_read_impl(db, zio, flags);
1133 1273  
1134 1274                  /* dbuf_read_impl has dropped db_mtx for us */
1135 1275  
1136 1276                  if (prefetch)
1137 1277                          dmu_zfetch(&dn->dn_zfetch, db->db_blkid, 1, B_TRUE);
1138 1278  
1139 1279                  if ((flags & DB_RF_HAVESTRUCT) == 0)
1140 1280                          rw_exit(&dn->dn_struct_rwlock);
1141 1281                  DB_DNODE_EXIT(db);
1142 1282  
1143 1283                  if (need_wait)
1144 1284                          err = zio_wait(zio);
1145 1285          } else {
1146 1286                  /*
1147 1287                   * Another reader came in while the dbuf was in flight
1148 1288                   * between UNCACHED and CACHED.  Either a writer will finish
1149 1289                   * writing the buffer (sending the dbuf to CACHED) or the
1150 1290                   * first reader's request will reach the read_done callback
1151 1291                   * and send the dbuf to CACHED.  Otherwise, a failure
1152 1292                   * occurred and the dbuf went to UNCACHED.
1153 1293                   */
1154 1294                  mutex_exit(&db->db_mtx);
1155 1295                  if (prefetch)
1156 1296                          dmu_zfetch(&dn->dn_zfetch, db->db_blkid, 1, B_TRUE);
1157 1297                  if ((flags & DB_RF_HAVESTRUCT) == 0)
1158 1298                          rw_exit(&dn->dn_struct_rwlock);
1159 1299                  DB_DNODE_EXIT(db);
1160 1300  
1161 1301                  /* Skip the wait per the caller's request. */
1162 1302                  mutex_enter(&db->db_mtx);
1163 1303                  if ((flags & DB_RF_NEVERWAIT) == 0) {
1164 1304                          while (db->db_state == DB_READ ||
1165 1305                              db->db_state == DB_FILL) {
1166 1306                                  ASSERT(db->db_state == DB_READ ||
1167 1307                                      (flags & DB_RF_HAVESTRUCT) == 0);
1168 1308                                  DTRACE_PROBE2(blocked__read, dmu_buf_impl_t *,
1169 1309                                      db, zio_t *, zio);
1170 1310                                  cv_wait(&db->db_changed, &db->db_mtx);
1171 1311                          }
1172 1312                          if (db->db_state == DB_UNCACHED)
1173 1313                                  err = SET_ERROR(EIO);
1174 1314                  }
1175 1315                  mutex_exit(&db->db_mtx);
1176 1316          }
1177 1317  
1178 1318          return (err);
1179 1319  }
1180 1320  
1181 1321  static void
1182 1322  dbuf_noread(dmu_buf_impl_t *db)
1183 1323  {
1184 1324          ASSERT(!refcount_is_zero(&db->db_holds));
1185 1325          ASSERT(db->db_blkid != DMU_BONUS_BLKID);
1186 1326          mutex_enter(&db->db_mtx);
1187 1327          while (db->db_state == DB_READ || db->db_state == DB_FILL)
1188 1328                  cv_wait(&db->db_changed, &db->db_mtx);
1189 1329          if (db->db_state == DB_UNCACHED) {
1190 1330                  arc_buf_contents_t type = DBUF_GET_BUFC_TYPE(db);
1191 1331                  spa_t *spa = db->db_objset->os_spa;
1192 1332  
1193 1333                  ASSERT(db->db_buf == NULL);
1194 1334                  ASSERT(db->db.db_data == NULL);
1195 1335                  dbuf_set_data(db, arc_alloc_buf(spa, db, type, db->db.db_size));
1196 1336                  db->db_state = DB_FILL;
1197 1337          } else if (db->db_state == DB_NOFILL) {
1198 1338                  dbuf_clear_data(db);
1199 1339          } else {
1200 1340                  ASSERT3U(db->db_state, ==, DB_CACHED);
1201 1341          }
1202 1342          mutex_exit(&db->db_mtx);
1203 1343  }
1204 1344  
1205 1345  void
1206 1346  dbuf_unoverride(dbuf_dirty_record_t *dr)
1207 1347  {
1208 1348          dmu_buf_impl_t *db = dr->dr_dbuf;
1209 1349          blkptr_t *bp = &dr->dt.dl.dr_overridden_by;
1210 1350          uint64_t txg = dr->dr_txg;
1211 1351  
1212 1352          ASSERT(MUTEX_HELD(&db->db_mtx));
1213 1353          /*
1214 1354           * This assert is valid because dmu_sync() expects to be called by
1215 1355           * a zilog's get_data while holding a range lock.  This call only
1216 1356           * comes from dbuf_dirty() callers who must also hold a range lock.
1217 1357           */
1218 1358          ASSERT(dr->dt.dl.dr_override_state != DR_IN_DMU_SYNC);
1219 1359          ASSERT(db->db_level == 0);
1220 1360  
1221 1361          if (db->db_blkid == DMU_BONUS_BLKID ||
1222 1362              dr->dt.dl.dr_override_state == DR_NOT_OVERRIDDEN)
1223 1363                  return;
1224 1364  
1225 1365          ASSERT(db->db_data_pending != dr);
1226 1366  
1227 1367          /* free this block */
1228 1368          if (!BP_IS_HOLE(bp) && !dr->dt.dl.dr_nopwrite)
1229 1369                  zio_free(db->db_objset->os_spa, txg, bp);
1230 1370  
1231 1371          dr->dt.dl.dr_override_state = DR_NOT_OVERRIDDEN;
1232 1372          dr->dt.dl.dr_nopwrite = B_FALSE;
1233 1373  
1234 1374          /*
1235 1375           * Release the already-written buffer, so we leave it in
1236 1376           * a consistent dirty state.  Note that all callers are
1237 1377           * modifying the buffer, so they will immediately do
1238 1378           * another (redundant) arc_release().  Therefore, leave
1239 1379           * the buf thawed to save the effort of freezing &
1240 1380           * immediately re-thawing it.
1241 1381           */
1242 1382          arc_release(dr->dt.dl.dr_data, db);
1243 1383  }
1244 1384  
1245 1385  /*
1246 1386   * Evict (if its unreferenced) or clear (if its referenced) any level-0
1247 1387   * data blocks in the free range, so that any future readers will find
1248 1388   * empty blocks.
1249 1389   */
1250 1390  void
1251 1391  dbuf_free_range(dnode_t *dn, uint64_t start_blkid, uint64_t end_blkid,
1252 1392      dmu_tx_t *tx)
1253 1393  {
1254 1394          dmu_buf_impl_t db_search;
1255 1395          dmu_buf_impl_t *db, *db_next;
1256 1396          uint64_t txg = tx->tx_txg;
1257 1397          avl_index_t where;
1258 1398  
1259 1399          if (end_blkid > dn->dn_maxblkid &&
1260 1400              !(start_blkid == DMU_SPILL_BLKID || end_blkid == DMU_SPILL_BLKID))
1261 1401                  end_blkid = dn->dn_maxblkid;
1262 1402          dprintf_dnode(dn, "start=%llu end=%llu\n", start_blkid, end_blkid);
1263 1403  
1264 1404          db_search.db_level = 0;
1265 1405          db_search.db_blkid = start_blkid;
1266 1406          db_search.db_state = DB_SEARCH;
1267 1407  
1268 1408          mutex_enter(&dn->dn_dbufs_mtx);
1269 1409          db = avl_find(&dn->dn_dbufs, &db_search, &where);
1270 1410          ASSERT3P(db, ==, NULL);
1271 1411  
1272 1412          db = avl_nearest(&dn->dn_dbufs, where, AVL_AFTER);
1273 1413  
1274 1414          for (; db != NULL; db = db_next) {
1275 1415                  db_next = AVL_NEXT(&dn->dn_dbufs, db);
1276 1416                  ASSERT(db->db_blkid != DMU_BONUS_BLKID);
1277 1417  
1278 1418                  if (db->db_level != 0 || db->db_blkid > end_blkid) {
1279 1419                          break;
1280 1420                  }
1281 1421                  ASSERT3U(db->db_blkid, >=, start_blkid);
1282 1422  
1283 1423                  /* found a level 0 buffer in the range */
1284 1424                  mutex_enter(&db->db_mtx);
1285 1425                  if (dbuf_undirty(db, tx)) {
1286 1426                          /* mutex has been dropped and dbuf destroyed */
1287 1427                          continue;
1288 1428                  }
1289 1429  
1290 1430                  if (db->db_state == DB_UNCACHED ||
1291 1431                      db->db_state == DB_NOFILL ||
1292 1432                      db->db_state == DB_EVICTING) {
1293 1433                          ASSERT(db->db.db_data == NULL);
1294 1434                          mutex_exit(&db->db_mtx);
1295 1435                          continue;
1296 1436                  }
1297 1437                  if (db->db_state == DB_READ || db->db_state == DB_FILL) {
1298 1438                          /* will be handled in dbuf_read_done or dbuf_rele */
1299 1439                          db->db_freed_in_flight = TRUE;
1300 1440                          mutex_exit(&db->db_mtx);
1301 1441                          continue;
1302 1442                  }
1303 1443                  if (refcount_count(&db->db_holds) == 0) {
1304 1444                          ASSERT(db->db_buf);
1305 1445                          dbuf_destroy(db);
1306 1446                          continue;
1307 1447                  }
1308 1448                  /* The dbuf is referenced */
1309 1449  
1310 1450                  if (db->db_last_dirty != NULL) {
1311 1451                          dbuf_dirty_record_t *dr = db->db_last_dirty;
1312 1452  
1313 1453                          if (dr->dr_txg == txg) {
1314 1454                                  /*
1315 1455                                   * This buffer is "in-use", re-adjust the file
1316 1456                                   * size to reflect that this buffer may
1317 1457                                   * contain new data when we sync.
1318 1458                                   */
1319 1459                                  if (db->db_blkid != DMU_SPILL_BLKID &&
1320 1460                                      db->db_blkid > dn->dn_maxblkid)
1321 1461                                          dn->dn_maxblkid = db->db_blkid;
1322 1462                                  dbuf_unoverride(dr);
1323 1463                          } else {
1324 1464                                  /*
1325 1465                                   * This dbuf is not dirty in the open context.
1326 1466                                   * Either uncache it (if its not referenced in
1327 1467                                   * the open context) or reset its contents to
1328 1468                                   * empty.
1329 1469                                   */
1330 1470                                  dbuf_fix_old_data(db, txg);
1331 1471                          }
1332 1472                  }
1333 1473                  /* clear the contents if its cached */
1334 1474                  if (db->db_state == DB_CACHED) {
1335 1475                          ASSERT(db->db.db_data != NULL);
1336 1476                          arc_release(db->db_buf, db);
1337 1477                          bzero(db->db.db_data, db->db.db_size);
1338 1478                          arc_buf_freeze(db->db_buf);
1339 1479                  }
1340 1480  
1341 1481                  mutex_exit(&db->db_mtx);
1342 1482          }
1343 1483          mutex_exit(&dn->dn_dbufs_mtx);
1344 1484  }
1345 1485  
1346 1486  void
1347 1487  dbuf_new_size(dmu_buf_impl_t *db, int size, dmu_tx_t *tx)
1348 1488  {
1349 1489          arc_buf_t *buf, *obuf;
1350 1490          int osize = db->db.db_size;
1351 1491          arc_buf_contents_t type = DBUF_GET_BUFC_TYPE(db);
1352 1492          dnode_t *dn;
1353 1493  
1354 1494          ASSERT(db->db_blkid != DMU_BONUS_BLKID);
1355 1495  
1356 1496          DB_DNODE_ENTER(db);
1357 1497          dn = DB_DNODE(db);
1358 1498  
1359 1499          /* XXX does *this* func really need the lock? */
1360 1500          ASSERT(RW_WRITE_HELD(&dn->dn_struct_rwlock));
1361 1501  
1362 1502          /*
1363 1503           * This call to dmu_buf_will_dirty() with the dn_struct_rwlock held
1364 1504           * is OK, because there can be no other references to the db
1365 1505           * when we are changing its size, so no concurrent DB_FILL can
1366 1506           * be happening.
1367 1507           */
1368 1508          /*
1369 1509           * XXX we should be doing a dbuf_read, checking the return
1370 1510           * value and returning that up to our callers
1371 1511           */
1372 1512          dmu_buf_will_dirty(&db->db, tx);
1373 1513  
1374 1514          /* create the data buffer for the new block */
1375 1515          buf = arc_alloc_buf(dn->dn_objset->os_spa, db, type, size);
1376 1516  
1377 1517          /* copy old block data to the new block */
1378 1518          obuf = db->db_buf;
1379 1519          bcopy(obuf->b_data, buf->b_data, MIN(osize, size));
1380 1520          /* zero the remainder */
1381 1521          if (size > osize)
1382 1522                  bzero((uint8_t *)buf->b_data + osize, size - osize);
1383 1523  
1384 1524          mutex_enter(&db->db_mtx);
1385 1525          dbuf_set_data(db, buf);
1386 1526          arc_buf_destroy(obuf, db);
1387 1527          db->db.db_size = size;
1388 1528  
1389 1529          if (db->db_level == 0) {
1390 1530                  ASSERT3U(db->db_last_dirty->dr_txg, ==, tx->tx_txg);
1391 1531                  db->db_last_dirty->dt.dl.dr_data = buf;
1392 1532          }
1393 1533          mutex_exit(&db->db_mtx);
1394 1534  
1395 1535          dmu_objset_willuse_space(dn->dn_objset, size - osize, tx);
1396 1536          DB_DNODE_EXIT(db);
1397 1537  }
1398 1538  
1399 1539  void
1400 1540  dbuf_release_bp(dmu_buf_impl_t *db)
1401 1541  {
1402 1542          objset_t *os = db->db_objset;
1403 1543  
1404 1544          ASSERT(dsl_pool_sync_context(dmu_objset_pool(os)));
1405 1545          ASSERT(arc_released(os->os_phys_buf) ||
1406 1546              list_link_active(&os->os_dsl_dataset->ds_synced_link));
  
    | 
      ↓ open down ↓ | 
    743 lines elided | 
    
      ↑ open up ↑ | 
  
1407 1547          ASSERT(db->db_parent == NULL || arc_released(db->db_parent->db_buf));
1408 1548  
1409 1549          (void) arc_release(db->db_buf, db);
1410 1550  }
1411 1551  
1412 1552  /*
1413 1553   * We already have a dirty record for this TXG, and we are being
1414 1554   * dirtied again.
1415 1555   */
1416 1556  static void
1417      -dbuf_redirty(dbuf_dirty_record_t *dr)
     1557 +dbuf_redirty(dbuf_dirty_record_t *dr, boolean_t usesc)
1418 1558  {
1419 1559          dmu_buf_impl_t *db = dr->dr_dbuf;
1420 1560  
1421 1561          ASSERT(MUTEX_HELD(&db->db_mtx));
1422 1562  
1423 1563          if (db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID) {
1424 1564                  /*
1425 1565                   * If this buffer has already been written out,
1426 1566                   * we now need to reset its state.
1427 1567                   */
1428 1568                  dbuf_unoverride(dr);
1429 1569                  if (db->db.db_object != DMU_META_DNODE_OBJECT &&
1430 1570                      db->db_state != DB_NOFILL) {
1431 1571                          /* Already released on initial dirty, so just thaw. */
1432 1572                          ASSERT(arc_released(db->db_buf));
1433 1573                          arc_buf_thaw(db->db_buf);
1434 1574                  }
1435 1575          }
     1576 +        /*
     1577 +         * Special class usage of dirty dbuf could be changed,
     1578 +         * update the dirty entry.
     1579 +         */
     1580 +        dr->dr_usesc = usesc;
1436 1581  }
1437 1582  
1438 1583  dbuf_dirty_record_t *
1439      -dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
     1584 +dbuf_dirty_sc(dmu_buf_impl_t *db, dmu_tx_t *tx, boolean_t usesc)
1440 1585  {
1441 1586          dnode_t *dn;
1442 1587          objset_t *os;
1443 1588          dbuf_dirty_record_t **drp, *dr;
1444 1589          int drop_struct_lock = FALSE;
1445 1590          int txgoff = tx->tx_txg & TXG_MASK;
1446 1591  
1447 1592          ASSERT(tx->tx_txg != 0);
1448 1593          ASSERT(!refcount_is_zero(&db->db_holds));
1449 1594          DMU_TX_DIRTY_BUF(tx, db);
1450 1595  
1451 1596          DB_DNODE_ENTER(db);
1452 1597          dn = DB_DNODE(db);
1453 1598          /*
1454 1599           * Shouldn't dirty a regular buffer in syncing context.  Private
1455 1600           * objects may be dirtied in syncing context, but only if they
1456 1601           * were already pre-dirtied in open context.
1457 1602           */
1458 1603  #ifdef DEBUG
1459 1604          if (dn->dn_objset->os_dsl_dataset != NULL) {
1460 1605                  rrw_enter(&dn->dn_objset->os_dsl_dataset->ds_bp_rwlock,
1461 1606                      RW_READER, FTAG);
1462 1607          }
1463 1608          ASSERT(!dmu_tx_is_syncing(tx) ||
1464 1609              BP_IS_HOLE(dn->dn_objset->os_rootbp) ||
1465 1610              DMU_OBJECT_IS_SPECIAL(dn->dn_object) ||
1466 1611              dn->dn_objset->os_dsl_dataset == NULL);
1467 1612          if (dn->dn_objset->os_dsl_dataset != NULL)
1468 1613                  rrw_exit(&dn->dn_objset->os_dsl_dataset->ds_bp_rwlock, FTAG);
1469 1614  #endif
1470 1615          /*
1471 1616           * We make this assert for private objects as well, but after we
1472 1617           * check if we're already dirty.  They are allowed to re-dirty
1473 1618           * in syncing context.
1474 1619           */
1475 1620          ASSERT(dn->dn_object == DMU_META_DNODE_OBJECT ||
1476 1621              dn->dn_dirtyctx == DN_UNDIRTIED || dn->dn_dirtyctx ==
1477 1622              (dmu_tx_is_syncing(tx) ? DN_DIRTY_SYNC : DN_DIRTY_OPEN));
1478 1623  
1479 1624          mutex_enter(&db->db_mtx);
1480 1625          /*
1481 1626           * XXX make this true for indirects too?  The problem is that
1482 1627           * transactions created with dmu_tx_create_assigned() from
1483 1628           * syncing context don't bother holding ahead.
1484 1629           */
1485 1630          ASSERT(db->db_level != 0 ||
1486 1631              db->db_state == DB_CACHED || db->db_state == DB_FILL ||
1487 1632              db->db_state == DB_NOFILL);
1488 1633  
1489 1634          mutex_enter(&dn->dn_mtx);
1490 1635          /*
1491 1636           * Don't set dirtyctx to SYNC if we're just modifying this as we
1492 1637           * initialize the objset.
1493 1638           */
1494 1639          if (dn->dn_dirtyctx == DN_UNDIRTIED) {
1495 1640                  if (dn->dn_objset->os_dsl_dataset != NULL) {
1496 1641                          rrw_enter(&dn->dn_objset->os_dsl_dataset->ds_bp_rwlock,
1497 1642                              RW_READER, FTAG);
1498 1643                  }
1499 1644                  if (!BP_IS_HOLE(dn->dn_objset->os_rootbp)) {
1500 1645                          dn->dn_dirtyctx = (dmu_tx_is_syncing(tx) ?
1501 1646                              DN_DIRTY_SYNC : DN_DIRTY_OPEN);
1502 1647                          ASSERT(dn->dn_dirtyctx_firstset == NULL);
1503 1648                          dn->dn_dirtyctx_firstset = kmem_alloc(1, KM_SLEEP);
1504 1649                  }
1505 1650                  if (dn->dn_objset->os_dsl_dataset != NULL) {
1506 1651                          rrw_exit(&dn->dn_objset->os_dsl_dataset->ds_bp_rwlock,
1507 1652                              FTAG);
1508 1653                  }
1509 1654          }
1510 1655          mutex_exit(&dn->dn_mtx);
1511 1656  
1512 1657          if (db->db_blkid == DMU_SPILL_BLKID)
1513 1658                  dn->dn_have_spill = B_TRUE;
1514 1659  
1515 1660          /*
  
    | 
      ↓ open down ↓ | 
    66 lines elided | 
    
      ↑ open up ↑ | 
  
1516 1661           * If this buffer is already dirty, we're done.
1517 1662           */
1518 1663          drp = &db->db_last_dirty;
1519 1664          ASSERT(*drp == NULL || (*drp)->dr_txg <= tx->tx_txg ||
1520 1665              db->db.db_object == DMU_META_DNODE_OBJECT);
1521 1666          while ((dr = *drp) != NULL && dr->dr_txg > tx->tx_txg)
1522 1667                  drp = &dr->dr_next;
1523 1668          if (dr && dr->dr_txg == tx->tx_txg) {
1524 1669                  DB_DNODE_EXIT(db);
1525 1670  
1526      -                dbuf_redirty(dr);
     1671 +                dbuf_redirty(dr, usesc);
1527 1672                  mutex_exit(&db->db_mtx);
1528 1673                  return (dr);
1529 1674          }
1530 1675  
1531 1676          /*
1532 1677           * Only valid if not already dirty.
1533 1678           */
1534 1679          ASSERT(dn->dn_object == 0 ||
1535 1680              dn->dn_dirtyctx == DN_UNDIRTIED || dn->dn_dirtyctx ==
1536 1681              (dmu_tx_is_syncing(tx) ? DN_DIRTY_SYNC : DN_DIRTY_OPEN));
1537 1682  
1538 1683          ASSERT3U(dn->dn_nlevels, >, db->db_level);
1539 1684  
1540 1685          /*
1541 1686           * We should only be dirtying in syncing context if it's the
1542 1687           * mos or we're initializing the os or it's a special object.
1543 1688           * However, we are allowed to dirty in syncing context provided
1544 1689           * we already dirtied it in open context.  Hence we must make
1545 1690           * this assertion only if we're not already dirty.
1546 1691           */
1547 1692          os = dn->dn_objset;
1548 1693          VERIFY3U(tx->tx_txg, <=, spa_final_dirty_txg(os->os_spa));
1549 1694  #ifdef DEBUG
1550 1695          if (dn->dn_objset->os_dsl_dataset != NULL)
1551 1696                  rrw_enter(&os->os_dsl_dataset->ds_bp_rwlock, RW_READER, FTAG);
1552 1697          ASSERT(!dmu_tx_is_syncing(tx) || DMU_OBJECT_IS_SPECIAL(dn->dn_object) ||
1553 1698              os->os_dsl_dataset == NULL || BP_IS_HOLE(os->os_rootbp));
1554 1699          if (dn->dn_objset->os_dsl_dataset != NULL)
1555 1700                  rrw_exit(&os->os_dsl_dataset->ds_bp_rwlock, FTAG);
1556 1701  #endif
1557 1702          ASSERT(db->db.db_size != 0);
1558 1703  
1559 1704          dprintf_dbuf(db, "size=%llx\n", (u_longlong_t)db->db.db_size);
1560 1705  
1561 1706          if (db->db_blkid != DMU_BONUS_BLKID) {
1562 1707                  dmu_objset_willuse_space(os, db->db.db_size, tx);
1563 1708          }
1564 1709  
1565 1710          /*
1566 1711           * If this buffer is dirty in an old transaction group we need
1567 1712           * to make a copy of it so that the changes we make in this
1568 1713           * transaction group won't leak out when we sync the older txg.
1569 1714           */
1570 1715          dr = kmem_zalloc(sizeof (dbuf_dirty_record_t), KM_SLEEP);
1571 1716          if (db->db_level == 0) {
1572 1717                  void *data_old = db->db_buf;
1573 1718  
1574 1719                  if (db->db_state != DB_NOFILL) {
1575 1720                          if (db->db_blkid == DMU_BONUS_BLKID) {
1576 1721                                  dbuf_fix_old_data(db, tx->tx_txg);
1577 1722                                  data_old = db->db.db_data;
1578 1723                          } else if (db->db.db_object != DMU_META_DNODE_OBJECT) {
1579 1724                                  /*
1580 1725                                   * Release the data buffer from the cache so
1581 1726                                   * that we can modify it without impacting
1582 1727                                   * possible other users of this cached data
1583 1728                                   * block.  Note that indirect blocks and
1584 1729                                   * private objects are not released until the
1585 1730                                   * syncing state (since they are only modified
1586 1731                                   * then).
1587 1732                                   */
1588 1733                                  arc_release(db->db_buf, db);
1589 1734                                  dbuf_fix_old_data(db, tx->tx_txg);
1590 1735                                  data_old = db->db_buf;
1591 1736                          }
1592 1737                          ASSERT(data_old != NULL);
1593 1738                  }
1594 1739                  dr->dt.dl.dr_data = data_old;
1595 1740          } else {
  
    | 
      ↓ open down ↓ | 
    59 lines elided | 
    
      ↑ open up ↑ | 
  
1596 1741                  mutex_init(&dr->dt.di.dr_mtx, NULL, MUTEX_DEFAULT, NULL);
1597 1742                  list_create(&dr->dt.di.dr_children,
1598 1743                      sizeof (dbuf_dirty_record_t),
1599 1744                      offsetof(dbuf_dirty_record_t, dr_dirty_node));
1600 1745          }
1601 1746          if (db->db_blkid != DMU_BONUS_BLKID && os->os_dsl_dataset != NULL)
1602 1747                  dr->dr_accounted = db->db.db_size;
1603 1748          dr->dr_dbuf = db;
1604 1749          dr->dr_txg = tx->tx_txg;
1605 1750          dr->dr_next = *drp;
     1751 +        dr->dr_usesc = usesc;
1606 1752          *drp = dr;
1607 1753  
1608 1754          /*
1609 1755           * We could have been freed_in_flight between the dbuf_noread
1610 1756           * and dbuf_dirty.  We win, as though the dbuf_noread() had
1611 1757           * happened after the free.
1612 1758           */
1613 1759          if (db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID &&
1614 1760              db->db_blkid != DMU_SPILL_BLKID) {
1615 1761                  mutex_enter(&dn->dn_mtx);
1616 1762                  if (dn->dn_free_ranges[txgoff] != NULL) {
1617 1763                          range_tree_clear(dn->dn_free_ranges[txgoff],
1618 1764                              db->db_blkid, 1);
1619 1765                  }
1620 1766                  mutex_exit(&dn->dn_mtx);
1621 1767                  db->db_freed_in_flight = FALSE;
1622 1768          }
1623 1769  
1624 1770          /*
1625 1771           * This buffer is now part of this txg
1626 1772           */
1627 1773          dbuf_add_ref(db, (void *)(uintptr_t)tx->tx_txg);
1628 1774          db->db_dirtycnt += 1;
  
    | 
      ↓ open down ↓ | 
    13 lines elided | 
    
      ↑ open up ↑ | 
  
1629 1775          ASSERT3U(db->db_dirtycnt, <=, 3);
1630 1776  
1631 1777          mutex_exit(&db->db_mtx);
1632 1778  
1633 1779          if (db->db_blkid == DMU_BONUS_BLKID ||
1634 1780              db->db_blkid == DMU_SPILL_BLKID) {
1635 1781                  mutex_enter(&dn->dn_mtx);
1636 1782                  ASSERT(!list_link_active(&dr->dr_dirty_node));
1637 1783                  list_insert_tail(&dn->dn_dirty_records[txgoff], dr);
1638 1784                  mutex_exit(&dn->dn_mtx);
1639      -                dnode_setdirty(dn, tx);
     1785 +                dnode_setdirty_sc(dn, tx, usesc);
1640 1786                  DB_DNODE_EXIT(db);
1641 1787                  return (dr);
1642 1788          }
1643 1789  
1644 1790          /*
1645 1791           * The dn_struct_rwlock prevents db_blkptr from changing
1646 1792           * due to a write from syncing context completing
1647 1793           * while we are running, so we want to acquire it before
1648 1794           * looking at db_blkptr.
1649 1795           */
1650 1796          if (!RW_WRITE_HELD(&dn->dn_struct_rwlock)) {
1651 1797                  rw_enter(&dn->dn_struct_rwlock, RW_READER);
1652 1798                  drop_struct_lock = TRUE;
1653 1799          }
1654 1800  
1655 1801          /*
1656 1802           * We need to hold the dn_struct_rwlock to make this assertion,
1657 1803           * because it protects dn_phys / dn_next_nlevels from changing.
1658 1804           */
1659 1805          ASSERT((dn->dn_phys->dn_nlevels == 0 && db->db_level == 0) ||
1660 1806              dn->dn_phys->dn_nlevels > db->db_level ||
1661 1807              dn->dn_next_nlevels[txgoff] > db->db_level ||
1662 1808              dn->dn_next_nlevels[(tx->tx_txg-1) & TXG_MASK] > db->db_level ||
1663 1809              dn->dn_next_nlevels[(tx->tx_txg-2) & TXG_MASK] > db->db_level);
  
    | 
      ↓ open down ↓ | 
    14 lines elided | 
    
      ↑ open up ↑ | 
  
1664 1810  
1665 1811          /*
1666 1812           * If we are overwriting a dedup BP, then unless it is snapshotted,
1667 1813           * when we get to syncing context we will need to decrement its
1668 1814           * refcount in the DDT.  Prefetch the relevant DDT block so that
1669 1815           * syncing context won't have to wait for the i/o.
1670 1816           */
1671 1817          ddt_prefetch(os->os_spa, db->db_blkptr);
1672 1818  
1673 1819          if (db->db_level == 0) {
1674      -                dnode_new_blkid(dn, db->db_blkid, tx, drop_struct_lock);
     1820 +                dnode_new_blkid(dn, db->db_blkid, tx, usesc, drop_struct_lock);
1675 1821                  ASSERT(dn->dn_maxblkid >= db->db_blkid);
1676 1822          }
1677 1823  
1678 1824          if (db->db_level+1 < dn->dn_nlevels) {
1679 1825                  dmu_buf_impl_t *parent = db->db_parent;
1680 1826                  dbuf_dirty_record_t *di;
1681 1827                  int parent_held = FALSE;
1682 1828  
1683 1829                  if (db->db_parent == NULL || db->db_parent == dn->dn_dbuf) {
1684 1830                          int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
1685 1831  
1686 1832                          parent = dbuf_hold_level(dn, db->db_level+1,
1687 1833                              db->db_blkid >> epbs, FTAG);
1688 1834                          ASSERT(parent != NULL);
1689 1835                          parent_held = TRUE;
1690 1836                  }
1691 1837                  if (drop_struct_lock)
1692 1838                          rw_exit(&dn->dn_struct_rwlock);
1693 1839                  ASSERT3U(db->db_level+1, ==, parent->db_level);
1694      -                di = dbuf_dirty(parent, tx);
     1840 +                di = dbuf_dirty_sc(parent, tx, usesc);
1695 1841                  if (parent_held)
1696 1842                          dbuf_rele(parent, FTAG);
1697 1843  
1698 1844                  mutex_enter(&db->db_mtx);
1699 1845                  /*
1700 1846                   * Since we've dropped the mutex, it's possible that
1701 1847                   * dbuf_undirty() might have changed this out from under us.
1702 1848                   */
1703 1849                  if (db->db_last_dirty == dr ||
1704 1850                      dn->dn_object == DMU_META_DNODE_OBJECT) {
1705 1851                          mutex_enter(&di->dt.di.dr_mtx);
1706 1852                          ASSERT3U(di->dr_txg, ==, tx->tx_txg);
1707 1853                          ASSERT(!list_link_active(&dr->dr_dirty_node));
1708 1854                          list_insert_tail(&di->dt.di.dr_children, dr);
1709 1855                          mutex_exit(&di->dt.di.dr_mtx);
1710 1856                          dr->dr_parent = di;
1711 1857                  }
     1858 +
     1859 +                /*
     1860 +                 * Special class usage of dirty dbuf could be changed,
     1861 +                 * update the dirty entry.
     1862 +                 */
     1863 +                dr->dr_usesc = usesc;
1712 1864                  mutex_exit(&db->db_mtx);
1713 1865          } else {
1714 1866                  ASSERT(db->db_level+1 == dn->dn_nlevels);
1715 1867                  ASSERT(db->db_blkid < dn->dn_nblkptr);
1716 1868                  ASSERT(db->db_parent == NULL || db->db_parent == dn->dn_dbuf);
1717 1869                  mutex_enter(&dn->dn_mtx);
1718 1870                  ASSERT(!list_link_active(&dr->dr_dirty_node));
1719 1871                  list_insert_tail(&dn->dn_dirty_records[txgoff], dr);
1720 1872                  mutex_exit(&dn->dn_mtx);
1721 1873                  if (drop_struct_lock)
1722 1874                          rw_exit(&dn->dn_struct_rwlock);
1723 1875          }
1724 1876  
1725      -        dnode_setdirty(dn, tx);
     1877 +        dnode_setdirty_sc(dn, tx, usesc);
1726 1878          DB_DNODE_EXIT(db);
1727 1879          return (dr);
1728 1880  }
1729 1881  
     1882 +dbuf_dirty_record_t *
     1883 +dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
     1884 +{
     1885 +        spa_t *spa;
     1886 +
     1887 +        ASSERT(db->db_objset != NULL);
     1888 +        spa = db->db_objset->os_spa;
     1889 +
     1890 +        return (dbuf_dirty_sc(db, tx, spa->spa_usesc));
     1891 +}
     1892 +
1730 1893  /*
1731 1894   * Undirty a buffer in the transaction group referenced by the given
1732 1895   * transaction.  Return whether this evicted the dbuf.
1733 1896   */
1734 1897  static boolean_t
1735 1898  dbuf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
1736 1899  {
1737 1900          dnode_t *dn;
1738 1901          uint64_t txg = tx->tx_txg;
1739 1902          dbuf_dirty_record_t *dr, **drp;
1740 1903  
1741 1904          ASSERT(txg != 0);
1742 1905  
1743 1906          /*
1744 1907           * Due to our use of dn_nlevels below, this can only be called
1745 1908           * in open context, unless we are operating on the MOS.
1746 1909           * From syncing context, dn_nlevels may be different from the
1747 1910           * dn_nlevels used when dbuf was dirtied.
1748 1911           */
1749 1912          ASSERT(db->db_objset ==
1750 1913              dmu_objset_pool(db->db_objset)->dp_meta_objset ||
1751 1914              txg != spa_syncing_txg(dmu_objset_spa(db->db_objset)));
1752 1915          ASSERT(db->db_blkid != DMU_BONUS_BLKID);
1753 1916          ASSERT0(db->db_level);
1754 1917          ASSERT(MUTEX_HELD(&db->db_mtx));
1755 1918  
1756 1919          /*
1757 1920           * If this buffer is not dirty, we're done.
1758 1921           */
1759 1922          for (drp = &db->db_last_dirty; (dr = *drp) != NULL; drp = &dr->dr_next)
1760 1923                  if (dr->dr_txg <= txg)
1761 1924                          break;
1762 1925          if (dr == NULL || dr->dr_txg < txg)
1763 1926                  return (B_FALSE);
1764 1927          ASSERT(dr->dr_txg == txg);
1765 1928          ASSERT(dr->dr_dbuf == db);
1766 1929  
1767 1930          DB_DNODE_ENTER(db);
1768 1931          dn = DB_DNODE(db);
1769 1932  
1770 1933          dprintf_dbuf(db, "size=%llx\n", (u_longlong_t)db->db.db_size);
1771 1934  
1772 1935          ASSERT(db->db.db_size != 0);
1773 1936  
1774 1937          dsl_pool_undirty_space(dmu_objset_pool(dn->dn_objset),
1775 1938              dr->dr_accounted, txg);
1776 1939  
1777 1940          *drp = dr->dr_next;
1778 1941  
1779 1942          /*
1780 1943           * Note that there are three places in dbuf_dirty()
1781 1944           * where this dirty record may be put on a list.
1782 1945           * Make sure to do a list_remove corresponding to
1783 1946           * every one of those list_insert calls.
1784 1947           */
1785 1948          if (dr->dr_parent) {
1786 1949                  mutex_enter(&dr->dr_parent->dt.di.dr_mtx);
1787 1950                  list_remove(&dr->dr_parent->dt.di.dr_children, dr);
1788 1951                  mutex_exit(&dr->dr_parent->dt.di.dr_mtx);
1789 1952          } else if (db->db_blkid == DMU_SPILL_BLKID ||
1790 1953              db->db_level + 1 == dn->dn_nlevels) {
1791 1954                  ASSERT(db->db_blkptr == NULL || db->db_parent == dn->dn_dbuf);
1792 1955                  mutex_enter(&dn->dn_mtx);
1793 1956                  list_remove(&dn->dn_dirty_records[txg & TXG_MASK], dr);
1794 1957                  mutex_exit(&dn->dn_mtx);
1795 1958          }
1796 1959          DB_DNODE_EXIT(db);
1797 1960  
1798 1961          if (db->db_state != DB_NOFILL) {
1799 1962                  dbuf_unoverride(dr);
1800 1963  
1801 1964                  ASSERT(db->db_buf != NULL);
1802 1965                  ASSERT(dr->dt.dl.dr_data != NULL);
1803 1966                  if (dr->dt.dl.dr_data != db->db_buf)
1804 1967                          arc_buf_destroy(dr->dt.dl.dr_data, db);
1805 1968          }
1806 1969  
1807 1970          kmem_free(dr, sizeof (dbuf_dirty_record_t));
1808 1971  
1809 1972          ASSERT(db->db_dirtycnt > 0);
1810 1973          db->db_dirtycnt -= 1;
1811 1974  
1812 1975          if (refcount_remove(&db->db_holds, (void *)(uintptr_t)txg) == 0) {
1813 1976                  ASSERT(db->db_state == DB_NOFILL || arc_released(db->db_buf));
1814 1977                  dbuf_destroy(db);
  
    | 
      ↓ open down ↓ | 
    75 lines elided | 
    
      ↑ open up ↑ | 
  
1815 1978                  return (B_TRUE);
1816 1979          }
1817 1980  
1818 1981          return (B_FALSE);
1819 1982  }
1820 1983  
1821 1984  void
1822 1985  dmu_buf_will_dirty(dmu_buf_t *db_fake, dmu_tx_t *tx)
1823 1986  {
1824 1987          dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
     1988 +        spa_t *spa = db->db_objset->os_spa;
     1989 +        dmu_buf_will_dirty_sc(db_fake, tx, spa->spa_usesc);
     1990 +}
     1991 +
     1992 +void
     1993 +dmu_buf_will_dirty_sc(dmu_buf_t *db_fake, dmu_tx_t *tx, boolean_t usesc)
     1994 +{
     1995 +        dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
1825 1996          int rf = DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH;
1826 1997  
1827 1998          ASSERT(tx->tx_txg != 0);
1828 1999          ASSERT(!refcount_is_zero(&db->db_holds));
1829 2000  
1830 2001          /*
1831 2002           * Quick check for dirtyness.  For already dirty blocks, this
1832 2003           * reduces runtime of this function by >90%, and overall performance
1833 2004           * by 50% for some workloads (e.g. file deletion with indirect blocks
1834 2005           * cached).
1835 2006           */
1836 2007          mutex_enter(&db->db_mtx);
  
    | 
      ↓ open down ↓ | 
    2 lines elided | 
    
      ↑ open up ↑ | 
  
1837 2008          dbuf_dirty_record_t *dr;
1838 2009          for (dr = db->db_last_dirty;
1839 2010              dr != NULL && dr->dr_txg >= tx->tx_txg; dr = dr->dr_next) {
1840 2011                  /*
1841 2012                   * It's possible that it is already dirty but not cached,
1842 2013                   * because there are some calls to dbuf_dirty() that don't
1843 2014                   * go through dmu_buf_will_dirty().
1844 2015                   */
1845 2016                  if (dr->dr_txg == tx->tx_txg && db->db_state == DB_CACHED) {
1846 2017                          /* This dbuf is already dirty and cached. */
1847      -                        dbuf_redirty(dr);
     2018 +                        dbuf_redirty(dr, usesc);
1848 2019                          mutex_exit(&db->db_mtx);
1849 2020                          return;
1850 2021                  }
1851 2022          }
1852 2023          mutex_exit(&db->db_mtx);
1853 2024  
1854 2025          DB_DNODE_ENTER(db);
1855 2026          if (RW_WRITE_HELD(&DB_DNODE(db)->dn_struct_rwlock))
1856 2027                  rf |= DB_RF_HAVESTRUCT;
1857 2028          DB_DNODE_EXIT(db);
1858 2029          (void) dbuf_read(db, NULL, rf);
1859      -        (void) dbuf_dirty(db, tx);
     2030 +        (void) dbuf_dirty_sc(db, tx, usesc);
1860 2031  }
1861 2032  
     2033 +
1862 2034  void
1863 2035  dmu_buf_will_not_fill(dmu_buf_t *db_fake, dmu_tx_t *tx)
1864 2036  {
1865 2037          dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
1866 2038  
1867 2039          db->db_state = DB_NOFILL;
1868 2040  
1869 2041          dmu_buf_will_fill(db_fake, tx);
1870 2042  }
1871 2043  
1872 2044  void
1873 2045  dmu_buf_will_fill(dmu_buf_t *db_fake, dmu_tx_t *tx)
1874 2046  {
1875 2047          dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
1876 2048  
1877 2049          ASSERT(db->db_blkid != DMU_BONUS_BLKID);
1878 2050          ASSERT(tx->tx_txg != 0);
1879 2051          ASSERT(db->db_level == 0);
1880 2052          ASSERT(!refcount_is_zero(&db->db_holds));
1881 2053  
1882 2054          ASSERT(db->db.db_object != DMU_META_DNODE_OBJECT ||
1883 2055              dmu_tx_private_ok(tx));
1884 2056  
1885 2057          dbuf_noread(db);
1886 2058          (void) dbuf_dirty(db, tx);
1887 2059  }
1888 2060  
1889 2061  #pragma weak dmu_buf_fill_done = dbuf_fill_done
1890 2062  /* ARGSUSED */
1891 2063  void
1892 2064  dbuf_fill_done(dmu_buf_impl_t *db, dmu_tx_t *tx)
1893 2065  {
1894 2066          mutex_enter(&db->db_mtx);
1895 2067          DBUF_VERIFY(db);
1896 2068  
1897 2069          if (db->db_state == DB_FILL) {
1898 2070                  if (db->db_level == 0 && db->db_freed_in_flight) {
1899 2071                          ASSERT(db->db_blkid != DMU_BONUS_BLKID);
1900 2072                          /* we were freed while filling */
1901 2073                          /* XXX dbuf_undirty? */
1902 2074                          bzero(db->db.db_data, db->db.db_size);
1903 2075                          db->db_freed_in_flight = FALSE;
1904 2076                  }
1905 2077                  db->db_state = DB_CACHED;
1906 2078                  cv_broadcast(&db->db_changed);
1907 2079          }
1908 2080          mutex_exit(&db->db_mtx);
1909 2081  }
1910 2082  
1911 2083  void
1912 2084  dmu_buf_write_embedded(dmu_buf_t *dbuf, void *data,
1913 2085      bp_embedded_type_t etype, enum zio_compress comp,
1914 2086      int uncompressed_size, int compressed_size, int byteorder,
1915 2087      dmu_tx_t *tx)
1916 2088  {
1917 2089          dmu_buf_impl_t *db = (dmu_buf_impl_t *)dbuf;
1918 2090          struct dirty_leaf *dl;
1919 2091          dmu_object_type_t type;
1920 2092  
1921 2093          if (etype == BP_EMBEDDED_TYPE_DATA) {
1922 2094                  ASSERT(spa_feature_is_active(dmu_objset_spa(db->db_objset),
1923 2095                      SPA_FEATURE_EMBEDDED_DATA));
1924 2096          }
1925 2097  
1926 2098          DB_DNODE_ENTER(db);
1927 2099          type = DB_DNODE(db)->dn_type;
1928 2100          DB_DNODE_EXIT(db);
1929 2101  
1930 2102          ASSERT0(db->db_level);
1931 2103          ASSERT(db->db_blkid != DMU_BONUS_BLKID);
1932 2104  
1933 2105          dmu_buf_will_not_fill(dbuf, tx);
1934 2106  
1935 2107          ASSERT3U(db->db_last_dirty->dr_txg, ==, tx->tx_txg);
1936 2108          dl = &db->db_last_dirty->dt.dl;
1937 2109          encode_embedded_bp_compressed(&dl->dr_overridden_by,
1938 2110              data, comp, uncompressed_size, compressed_size);
1939 2111          BPE_SET_ETYPE(&dl->dr_overridden_by, etype);
1940 2112          BP_SET_TYPE(&dl->dr_overridden_by, type);
1941 2113          BP_SET_LEVEL(&dl->dr_overridden_by, 0);
1942 2114          BP_SET_BYTEORDER(&dl->dr_overridden_by, byteorder);
1943 2115  
1944 2116          dl->dr_override_state = DR_OVERRIDDEN;
1945 2117          dl->dr_overridden_by.blk_birth = db->db_last_dirty->dr_txg;
1946 2118  }
1947 2119  
1948 2120  /*
1949 2121   * Directly assign a provided arc buf to a given dbuf if it's not referenced
1950 2122   * by anybody except our caller. Otherwise copy arcbuf's contents to dbuf.
1951 2123   */
1952 2124  void
1953 2125  dbuf_assign_arcbuf(dmu_buf_impl_t *db, arc_buf_t *buf, dmu_tx_t *tx)
1954 2126  {
1955 2127          ASSERT(!refcount_is_zero(&db->db_holds));
1956 2128          ASSERT(db->db_blkid != DMU_BONUS_BLKID);
1957 2129          ASSERT(db->db_level == 0);
1958 2130          ASSERT3U(dbuf_is_metadata(db), ==, arc_is_metadata(buf));
1959 2131          ASSERT(buf != NULL);
1960 2132          ASSERT(arc_buf_lsize(buf) == db->db.db_size);
1961 2133          ASSERT(tx->tx_txg != 0);
1962 2134  
1963 2135          arc_return_buf(buf, db);
1964 2136          ASSERT(arc_released(buf));
1965 2137  
1966 2138          mutex_enter(&db->db_mtx);
1967 2139  
1968 2140          while (db->db_state == DB_READ || db->db_state == DB_FILL)
1969 2141                  cv_wait(&db->db_changed, &db->db_mtx);
1970 2142  
1971 2143          ASSERT(db->db_state == DB_CACHED || db->db_state == DB_UNCACHED);
1972 2144  
1973 2145          if (db->db_state == DB_CACHED &&
1974 2146              refcount_count(&db->db_holds) - 1 > db->db_dirtycnt) {
1975 2147                  mutex_exit(&db->db_mtx);
1976 2148                  (void) dbuf_dirty(db, tx);
1977 2149                  bcopy(buf->b_data, db->db.db_data, db->db.db_size);
1978 2150                  arc_buf_destroy(buf, db);
1979 2151                  xuio_stat_wbuf_copied();
1980 2152                  return;
1981 2153          }
1982 2154  
1983 2155          xuio_stat_wbuf_nocopy();
1984 2156          if (db->db_state == DB_CACHED) {
1985 2157                  dbuf_dirty_record_t *dr = db->db_last_dirty;
1986 2158  
1987 2159                  ASSERT(db->db_buf != NULL);
1988 2160                  if (dr != NULL && dr->dr_txg == tx->tx_txg) {
1989 2161                          ASSERT(dr->dt.dl.dr_data == db->db_buf);
1990 2162                          if (!arc_released(db->db_buf)) {
1991 2163                                  ASSERT(dr->dt.dl.dr_override_state ==
1992 2164                                      DR_OVERRIDDEN);
1993 2165                                  arc_release(db->db_buf, db);
1994 2166                          }
1995 2167                          dr->dt.dl.dr_data = buf;
1996 2168                          arc_buf_destroy(db->db_buf, db);
1997 2169                  } else if (dr == NULL || dr->dt.dl.dr_data != db->db_buf) {
1998 2170                          arc_release(db->db_buf, db);
1999 2171                          arc_buf_destroy(db->db_buf, db);
2000 2172                  }
2001 2173                  db->db_buf = NULL;
2002 2174          }
2003 2175          ASSERT(db->db_buf == NULL);
2004 2176          dbuf_set_data(db, buf);
2005 2177          db->db_state = DB_FILL;
2006 2178          mutex_exit(&db->db_mtx);
2007 2179          (void) dbuf_dirty(db, tx);
2008 2180          dmu_buf_fill_done(&db->db, tx);
2009 2181  }
2010 2182  
2011 2183  void
2012 2184  dbuf_destroy(dmu_buf_impl_t *db)
2013 2185  {
2014 2186          dnode_t *dn;
2015 2187          dmu_buf_impl_t *parent = db->db_parent;
2016 2188          dmu_buf_impl_t *dndb;
2017 2189  
2018 2190          ASSERT(MUTEX_HELD(&db->db_mtx));
2019 2191          ASSERT(refcount_is_zero(&db->db_holds));
2020 2192  
2021 2193          if (db->db_buf != NULL) {
2022 2194                  arc_buf_destroy(db->db_buf, db);
2023 2195                  db->db_buf = NULL;
2024 2196          }
2025 2197  
  
    | 
      ↓ open down ↓ | 
    154 lines elided | 
    
      ↑ open up ↑ | 
  
2026 2198          if (db->db_blkid == DMU_BONUS_BLKID) {
2027 2199                  ASSERT(db->db.db_data != NULL);
2028 2200                  zio_buf_free(db->db.db_data, DN_MAX_BONUSLEN);
2029 2201                  arc_space_return(DN_MAX_BONUSLEN, ARC_SPACE_OTHER);
2030 2202                  db->db_state = DB_UNCACHED;
2031 2203          }
2032 2204  
2033 2205          dbuf_clear_data(db);
2034 2206  
2035 2207          if (multilist_link_active(&db->db_cache_link)) {
2036      -                multilist_remove(dbuf_cache, db);
2037      -                (void) refcount_remove_many(&dbuf_cache_size,
     2208 +                ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
     2209 +                    db->db_caching_status == DB_DBUF_METADATA_CACHE);
     2210 +
     2211 +                multilist_remove(dbuf_caches[db->db_caching_status].cache, db);
     2212 +                (void) refcount_remove_many(
     2213 +                    &dbuf_caches[db->db_caching_status].size,
2038 2214                      db->db.db_size, db);
     2215 +
     2216 +                db->db_caching_status = DB_NO_CACHE;
2039 2217          }
2040 2218  
2041 2219          ASSERT(db->db_state == DB_UNCACHED || db->db_state == DB_NOFILL);
2042 2220          ASSERT(db->db_data_pending == NULL);
2043 2221  
2044 2222          db->db_state = DB_EVICTING;
2045 2223          db->db_blkptr = NULL;
2046 2224  
2047 2225          /*
2048 2226           * Now that db_state is DB_EVICTING, nobody else can find this via
2049 2227           * the hash table.  We can now drop db_mtx, which allows us to
2050 2228           * acquire the dn_dbufs_mtx.
2051 2229           */
2052 2230          mutex_exit(&db->db_mtx);
2053 2231  
2054 2232          DB_DNODE_ENTER(db);
2055 2233          dn = DB_DNODE(db);
2056 2234          dndb = dn->dn_dbuf;
2057 2235          if (db->db_blkid != DMU_BONUS_BLKID) {
2058 2236                  boolean_t needlock = !MUTEX_HELD(&dn->dn_dbufs_mtx);
2059 2237                  if (needlock)
2060 2238                          mutex_enter(&dn->dn_dbufs_mtx);
2061 2239                  avl_remove(&dn->dn_dbufs, db);
2062 2240                  atomic_dec_32(&dn->dn_dbufs_count);
2063 2241                  membar_producer();
2064 2242                  DB_DNODE_EXIT(db);
2065 2243                  if (needlock)
2066 2244                          mutex_exit(&dn->dn_dbufs_mtx);
2067 2245                  /*
2068 2246                   * Decrementing the dbuf count means that the hold corresponding
2069 2247                   * to the removed dbuf is no longer discounted in dnode_move(),
2070 2248                   * so the dnode cannot be moved until after we release the hold.
2071 2249                   * The membar_producer() ensures visibility of the decremented
2072 2250                   * value in dnode_move(), since DB_DNODE_EXIT doesn't actually
2073 2251                   * release any lock.
2074 2252                   */
2075 2253                  dnode_rele(dn, db);
2076 2254                  db->db_dnode_handle = NULL;
2077 2255  
2078 2256                  dbuf_hash_remove(db);
2079 2257          } else {
2080 2258                  DB_DNODE_EXIT(db);
2081 2259          }
  
    | 
      ↓ open down ↓ | 
    33 lines elided | 
    
      ↑ open up ↑ | 
  
2082 2260  
2083 2261          ASSERT(refcount_is_zero(&db->db_holds));
2084 2262  
2085 2263          db->db_parent = NULL;
2086 2264  
2087 2265          ASSERT(db->db_buf == NULL);
2088 2266          ASSERT(db->db.db_data == NULL);
2089 2267          ASSERT(db->db_hash_next == NULL);
2090 2268          ASSERT(db->db_blkptr == NULL);
2091 2269          ASSERT(db->db_data_pending == NULL);
     2270 +        ASSERT3U(db->db_caching_status, ==, DB_NO_CACHE);
2092 2271          ASSERT(!multilist_link_active(&db->db_cache_link));
2093 2272  
2094 2273          kmem_cache_free(dbuf_kmem_cache, db);
2095 2274          arc_space_return(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
2096 2275  
2097 2276          /*
2098 2277           * If this dbuf is referenced from an indirect dbuf,
2099 2278           * decrement the ref count on the indirect dbuf.
2100 2279           */
2101 2280          if (parent && parent != dndb)
2102 2281                  dbuf_rele(parent, db);
2103 2282  }
2104 2283  
2105 2284  /*
2106 2285   * Note: While bpp will always be updated if the function returns success,
2107 2286   * parentp will not be updated if the dnode does not have dn_dbuf filled in;
2108 2287   * this happens when the dnode is the meta-dnode, or a userused or groupused
2109 2288   * object.
2110 2289   */
2111 2290  static int
2112 2291  dbuf_findbp(dnode_t *dn, int level, uint64_t blkid, int fail_sparse,
2113 2292      dmu_buf_impl_t **parentp, blkptr_t **bpp)
2114 2293  {
2115 2294          *parentp = NULL;
2116 2295          *bpp = NULL;
2117 2296  
2118 2297          ASSERT(blkid != DMU_BONUS_BLKID);
2119 2298  
2120 2299          if (blkid == DMU_SPILL_BLKID) {
2121 2300                  mutex_enter(&dn->dn_mtx);
2122 2301                  if (dn->dn_have_spill &&
2123 2302                      (dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR))
2124 2303                          *bpp = &dn->dn_phys->dn_spill;
2125 2304                  else
2126 2305                          *bpp = NULL;
2127 2306                  dbuf_add_ref(dn->dn_dbuf, NULL);
2128 2307                  *parentp = dn->dn_dbuf;
2129 2308                  mutex_exit(&dn->dn_mtx);
2130 2309                  return (0);
2131 2310          }
2132 2311  
2133 2312          int nlevels =
2134 2313              (dn->dn_phys->dn_nlevels == 0) ? 1 : dn->dn_phys->dn_nlevels;
2135 2314          int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
2136 2315  
2137 2316          ASSERT3U(level * epbs, <, 64);
2138 2317          ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock));
2139 2318          /*
2140 2319           * This assertion shouldn't trip as long as the max indirect block size
2141 2320           * is less than 1M.  The reason for this is that up to that point,
2142 2321           * the number of levels required to address an entire object with blocks
2143 2322           * of size SPA_MINBLOCKSIZE satisfies nlevels * epbs + 1 <= 64.  In
2144 2323           * other words, if N * epbs + 1 > 64, then if (N-1) * epbs + 1 > 55
2145 2324           * (i.e. we can address the entire object), objects will all use at most
2146 2325           * N-1 levels and the assertion won't overflow.  However, once epbs is
2147 2326           * 13, 4 * 13 + 1 = 53, but 5 * 13 + 1 = 66.  Then, 4 levels will not be
2148 2327           * enough to address an entire object, so objects will have 5 levels,
2149 2328           * but then this assertion will overflow.
2150 2329           *
2151 2330           * All this is to say that if we ever increase DN_MAX_INDBLKSHIFT, we
2152 2331           * need to redo this logic to handle overflows.
2153 2332           */
2154 2333          ASSERT(level >= nlevels ||
2155 2334              ((nlevels - level - 1) * epbs) +
2156 2335              highbit64(dn->dn_phys->dn_nblkptr) <= 64);
2157 2336          if (level >= nlevels ||
2158 2337              blkid >= ((uint64_t)dn->dn_phys->dn_nblkptr <<
2159 2338              ((nlevels - level - 1) * epbs)) ||
2160 2339              (fail_sparse &&
2161 2340              blkid > (dn->dn_phys->dn_maxblkid >> (level * epbs)))) {
2162 2341                  /* the buffer has no parent yet */
2163 2342                  return (SET_ERROR(ENOENT));
2164 2343          } else if (level < nlevels-1) {
2165 2344                  /* this block is referenced from an indirect block */
2166 2345                  int err = dbuf_hold_impl(dn, level+1,
2167 2346                      blkid >> epbs, fail_sparse, FALSE, NULL, parentp);
2168 2347                  if (err)
2169 2348                          return (err);
2170 2349                  err = dbuf_read(*parentp, NULL,
2171 2350                      (DB_RF_HAVESTRUCT | DB_RF_NOPREFETCH | DB_RF_CANFAIL));
2172 2351                  if (err) {
2173 2352                          dbuf_rele(*parentp, NULL);
2174 2353                          *parentp = NULL;
2175 2354                          return (err);
2176 2355                  }
2177 2356                  *bpp = ((blkptr_t *)(*parentp)->db.db_data) +
2178 2357                      (blkid & ((1ULL << epbs) - 1));
2179 2358                  if (blkid > (dn->dn_phys->dn_maxblkid >> (level * epbs)))
2180 2359                          ASSERT(BP_IS_HOLE(*bpp));
2181 2360                  return (0);
2182 2361          } else {
2183 2362                  /* the block is referenced from the dnode */
2184 2363                  ASSERT3U(level, ==, nlevels-1);
2185 2364                  ASSERT(dn->dn_phys->dn_nblkptr == 0 ||
2186 2365                      blkid < dn->dn_phys->dn_nblkptr);
2187 2366                  if (dn->dn_dbuf) {
2188 2367                          dbuf_add_ref(dn->dn_dbuf, NULL);
2189 2368                          *parentp = dn->dn_dbuf;
2190 2369                  }
2191 2370                  *bpp = &dn->dn_phys->dn_blkptr[blkid];
2192 2371                  return (0);
2193 2372          }
2194 2373  }
2195 2374  
2196 2375  static dmu_buf_impl_t *
2197 2376  dbuf_create(dnode_t *dn, uint8_t level, uint64_t blkid,
2198 2377      dmu_buf_impl_t *parent, blkptr_t *blkptr)
2199 2378  {
2200 2379          objset_t *os = dn->dn_objset;
2201 2380          dmu_buf_impl_t *db, *odb;
2202 2381  
2203 2382          ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock));
2204 2383          ASSERT(dn->dn_type != DMU_OT_NONE);
2205 2384  
2206 2385          db = kmem_cache_alloc(dbuf_kmem_cache, KM_SLEEP);
2207 2386  
2208 2387          db->db_objset = os;
2209 2388          db->db.db_object = dn->dn_object;
2210 2389          db->db_level = level;
2211 2390          db->db_blkid = blkid;
2212 2391          db->db_last_dirty = NULL;
2213 2392          db->db_dirtycnt = 0;
2214 2393          db->db_dnode_handle = dn->dn_handle;
2215 2394          db->db_parent = parent;
2216 2395          db->db_blkptr = blkptr;
2217 2396  
2218 2397          db->db_user = NULL;
2219 2398          db->db_user_immediate_evict = FALSE;
  
    | 
      ↓ open down ↓ | 
    118 lines elided | 
    
      ↑ open up ↑ | 
  
2220 2399          db->db_freed_in_flight = FALSE;
2221 2400          db->db_pending_evict = FALSE;
2222 2401  
2223 2402          if (blkid == DMU_BONUS_BLKID) {
2224 2403                  ASSERT3P(parent, ==, dn->dn_dbuf);
2225 2404                  db->db.db_size = DN_MAX_BONUSLEN -
2226 2405                      (dn->dn_nblkptr-1) * sizeof (blkptr_t);
2227 2406                  ASSERT3U(db->db.db_size, >=, dn->dn_bonuslen);
2228 2407                  db->db.db_offset = DMU_BONUS_BLKID;
2229 2408                  db->db_state = DB_UNCACHED;
     2409 +                db->db_caching_status = DB_NO_CACHE;
2230 2410                  /* the bonus dbuf is not placed in the hash table */
2231 2411                  arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
2232 2412                  return (db);
2233 2413          } else if (blkid == DMU_SPILL_BLKID) {
2234 2414                  db->db.db_size = (blkptr != NULL) ?
2235 2415                      BP_GET_LSIZE(blkptr) : SPA_MINBLOCKSIZE;
2236 2416                  db->db.db_offset = 0;
2237 2417          } else {
2238 2418                  int blocksize =
2239 2419                      db->db_level ? 1 << dn->dn_indblkshift : dn->dn_datablksz;
2240 2420                  db->db.db_size = blocksize;
2241 2421                  db->db.db_offset = db->db_blkid * blocksize;
2242 2422          }
2243 2423  
2244 2424          /*
2245 2425           * Hold the dn_dbufs_mtx while we get the new dbuf
2246 2426           * in the hash table *and* added to the dbufs list.
2247 2427           * This prevents a possible deadlock with someone
2248 2428           * trying to look up this dbuf before its added to the
2249 2429           * dn_dbufs list.
2250 2430           */
2251 2431          mutex_enter(&dn->dn_dbufs_mtx);
  
    | 
      ↓ open down ↓ | 
    12 lines elided | 
    
      ↑ open up ↑ | 
  
2252 2432          db->db_state = DB_EVICTING;
2253 2433          if ((odb = dbuf_hash_insert(db)) != NULL) {
2254 2434                  /* someone else inserted it first */
2255 2435                  kmem_cache_free(dbuf_kmem_cache, db);
2256 2436                  mutex_exit(&dn->dn_dbufs_mtx);
2257 2437                  return (odb);
2258 2438          }
2259 2439          avl_add(&dn->dn_dbufs, db);
2260 2440  
2261 2441          db->db_state = DB_UNCACHED;
     2442 +        db->db_caching_status = DB_NO_CACHE;
2262 2443          mutex_exit(&dn->dn_dbufs_mtx);
2263 2444          arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
2264 2445  
2265 2446          if (parent && parent != dn->dn_dbuf)
2266 2447                  dbuf_add_ref(parent, db);
2267 2448  
2268 2449          ASSERT(dn->dn_object == DMU_META_DNODE_OBJECT ||
2269 2450              refcount_count(&dn->dn_holds) > 0);
2270 2451          (void) refcount_add(&dn->dn_holds, db);
2271 2452          atomic_inc_32(&dn->dn_dbufs_count);
2272 2453  
2273 2454          dprintf_dbuf(db, "db=%p\n", db);
2274 2455  
2275 2456          return (db);
2276 2457  }
2277 2458  
2278 2459  typedef struct dbuf_prefetch_arg {
2279 2460          spa_t *dpa_spa; /* The spa to issue the prefetch in. */
2280 2461          zbookmark_phys_t dpa_zb; /* The target block to prefetch. */
2281 2462          int dpa_epbs; /* Entries (blkptr_t's) Per Block Shift. */
2282 2463          int dpa_curlevel; /* The current level that we're reading */
2283 2464          dnode_t *dpa_dnode; /* The dnode associated with the prefetch */
2284 2465          zio_priority_t dpa_prio; /* The priority I/Os should be issued at. */
2285 2466          zio_t *dpa_zio; /* The parent zio_t for all prefetches. */
2286 2467          arc_flags_t dpa_aflags; /* Flags to pass to the final prefetch. */
2287 2468  } dbuf_prefetch_arg_t;
2288 2469  
2289 2470  /*
2290 2471   * Actually issue the prefetch read for the block given.
2291 2472   */
2292 2473  static void
2293 2474  dbuf_issue_final_prefetch(dbuf_prefetch_arg_t *dpa, blkptr_t *bp)
2294 2475  {
2295 2476          if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp))
2296 2477                  return;
2297 2478  
2298 2479          arc_flags_t aflags =
2299 2480              dpa->dpa_aflags | ARC_FLAG_NOWAIT | ARC_FLAG_PREFETCH;
2300 2481  
2301 2482          ASSERT3U(dpa->dpa_curlevel, ==, BP_GET_LEVEL(bp));
2302 2483          ASSERT3U(dpa->dpa_curlevel, ==, dpa->dpa_zb.zb_level);
2303 2484          ASSERT(dpa->dpa_zio != NULL);
2304 2485          (void) arc_read(dpa->dpa_zio, dpa->dpa_spa, bp, NULL, NULL,
2305 2486              dpa->dpa_prio, ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE,
2306 2487              &aflags, &dpa->dpa_zb);
2307 2488  }
2308 2489  
2309 2490  /*
2310 2491   * Called when an indirect block above our prefetch target is read in.  This
2311 2492   * will either read in the next indirect block down the tree or issue the actual
2312 2493   * prefetch if the next block down is our target.
2313 2494   */
2314 2495  static void
2315 2496  dbuf_prefetch_indirect_done(zio_t *zio, arc_buf_t *abuf, void *private)
2316 2497  {
2317 2498          dbuf_prefetch_arg_t *dpa = private;
2318 2499  
2319 2500          ASSERT3S(dpa->dpa_zb.zb_level, <, dpa->dpa_curlevel);
2320 2501          ASSERT3S(dpa->dpa_curlevel, >, 0);
2321 2502  
2322 2503          /*
2323 2504           * The dpa_dnode is only valid if we are called with a NULL
2324 2505           * zio. This indicates that the arc_read() returned without
2325 2506           * first calling zio_read() to issue a physical read. Once
2326 2507           * a physical read is made the dpa_dnode must be invalidated
2327 2508           * as the locks guarding it may have been dropped. If the
2328 2509           * dpa_dnode is still valid, then we want to add it to the dbuf
2329 2510           * cache. To do so, we must hold the dbuf associated with the block
2330 2511           * we just prefetched, read its contents so that we associate it
2331 2512           * with an arc_buf_t, and then release it.
2332 2513           */
2333 2514          if (zio != NULL) {
2334 2515                  ASSERT3S(BP_GET_LEVEL(zio->io_bp), ==, dpa->dpa_curlevel);
2335 2516                  if (zio->io_flags & ZIO_FLAG_RAW) {
2336 2517                          ASSERT3U(BP_GET_PSIZE(zio->io_bp), ==, zio->io_size);
2337 2518                  } else {
2338 2519                          ASSERT3U(BP_GET_LSIZE(zio->io_bp), ==, zio->io_size);
2339 2520                  }
2340 2521                  ASSERT3P(zio->io_spa, ==, dpa->dpa_spa);
2341 2522  
2342 2523                  dpa->dpa_dnode = NULL;
2343 2524          } else if (dpa->dpa_dnode != NULL) {
2344 2525                  uint64_t curblkid = dpa->dpa_zb.zb_blkid >>
2345 2526                      (dpa->dpa_epbs * (dpa->dpa_curlevel -
2346 2527                      dpa->dpa_zb.zb_level));
2347 2528                  dmu_buf_impl_t *db = dbuf_hold_level(dpa->dpa_dnode,
2348 2529                      dpa->dpa_curlevel, curblkid, FTAG);
2349 2530                  (void) dbuf_read(db, NULL,
2350 2531                      DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH | DB_RF_HAVESTRUCT);
2351 2532                  dbuf_rele(db, FTAG);
2352 2533          }
2353 2534  
2354 2535          dpa->dpa_curlevel--;
2355 2536  
2356 2537          uint64_t nextblkid = dpa->dpa_zb.zb_blkid >>
2357 2538              (dpa->dpa_epbs * (dpa->dpa_curlevel - dpa->dpa_zb.zb_level));
2358 2539          blkptr_t *bp = ((blkptr_t *)abuf->b_data) +
2359 2540              P2PHASE(nextblkid, 1ULL << dpa->dpa_epbs);
2360 2541          if (BP_IS_HOLE(bp) || (zio != NULL && zio->io_error != 0)) {
2361 2542                  kmem_free(dpa, sizeof (*dpa));
2362 2543          } else if (dpa->dpa_curlevel == dpa->dpa_zb.zb_level) {
2363 2544                  ASSERT3U(nextblkid, ==, dpa->dpa_zb.zb_blkid);
2364 2545                  dbuf_issue_final_prefetch(dpa, bp);
2365 2546                  kmem_free(dpa, sizeof (*dpa));
2366 2547          } else {
2367 2548                  arc_flags_t iter_aflags = ARC_FLAG_NOWAIT;
2368 2549                  zbookmark_phys_t zb;
2369 2550  
2370 2551                  /* flag if L2ARC eligible, l2arc_noprefetch then decides */
2371 2552                  if (dpa->dpa_aflags & ARC_FLAG_L2CACHE)
2372 2553                          iter_aflags |= ARC_FLAG_L2CACHE;
2373 2554  
2374 2555                  ASSERT3U(dpa->dpa_curlevel, ==, BP_GET_LEVEL(bp));
2375 2556  
2376 2557                  SET_BOOKMARK(&zb, dpa->dpa_zb.zb_objset,
2377 2558                      dpa->dpa_zb.zb_object, dpa->dpa_curlevel, nextblkid);
2378 2559  
2379 2560                  (void) arc_read(dpa->dpa_zio, dpa->dpa_spa,
2380 2561                      bp, dbuf_prefetch_indirect_done, dpa, dpa->dpa_prio,
2381 2562                      ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE,
2382 2563                      &iter_aflags, &zb);
2383 2564          }
2384 2565  
2385 2566          arc_buf_destroy(abuf, private);
2386 2567  }
2387 2568  
2388 2569  /*
2389 2570   * Issue prefetch reads for the given block on the given level.  If the indirect
2390 2571   * blocks above that block are not in memory, we will read them in
2391 2572   * asynchronously.  As a result, this call never blocks waiting for a read to
2392 2573   * complete.
2393 2574   */
2394 2575  void
2395 2576  dbuf_prefetch(dnode_t *dn, int64_t level, uint64_t blkid, zio_priority_t prio,
2396 2577      arc_flags_t aflags)
2397 2578  {
2398 2579          blkptr_t bp;
2399 2580          int epbs, nlevels, curlevel;
2400 2581          uint64_t curblkid;
2401 2582  
2402 2583          ASSERT(blkid != DMU_BONUS_BLKID);
2403 2584          ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock));
2404 2585  
2405 2586          if (blkid > dn->dn_maxblkid)
2406 2587                  return;
2407 2588  
2408 2589          if (dnode_block_freed(dn, blkid))
2409 2590                  return;
2410 2591  
2411 2592          /*
2412 2593           * This dnode hasn't been written to disk yet, so there's nothing to
2413 2594           * prefetch.
2414 2595           */
2415 2596          nlevels = dn->dn_phys->dn_nlevels;
2416 2597          if (level >= nlevels || dn->dn_phys->dn_nblkptr == 0)
2417 2598                  return;
2418 2599  
2419 2600          epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
2420 2601          if (dn->dn_phys->dn_maxblkid < blkid << (epbs * level))
2421 2602                  return;
2422 2603  
2423 2604          dmu_buf_impl_t *db = dbuf_find(dn->dn_objset, dn->dn_object,
2424 2605              level, blkid);
2425 2606          if (db != NULL) {
2426 2607                  mutex_exit(&db->db_mtx);
2427 2608                  /*
2428 2609                   * This dbuf already exists.  It is either CACHED, or
2429 2610                   * (we assume) about to be read or filled.
2430 2611                   */
2431 2612                  return;
2432 2613          }
2433 2614  
2434 2615          /*
2435 2616           * Find the closest ancestor (indirect block) of the target block
2436 2617           * that is present in the cache.  In this indirect block, we will
2437 2618           * find the bp that is at curlevel, curblkid.
2438 2619           */
2439 2620          curlevel = level;
2440 2621          curblkid = blkid;
2441 2622          while (curlevel < nlevels - 1) {
2442 2623                  int parent_level = curlevel + 1;
2443 2624                  uint64_t parent_blkid = curblkid >> epbs;
2444 2625                  dmu_buf_impl_t *db;
2445 2626  
2446 2627                  if (dbuf_hold_impl(dn, parent_level, parent_blkid,
2447 2628                      FALSE, TRUE, FTAG, &db) == 0) {
2448 2629                          blkptr_t *bpp = db->db_buf->b_data;
2449 2630                          bp = bpp[P2PHASE(curblkid, 1 << epbs)];
2450 2631                          dbuf_rele(db, FTAG);
2451 2632                          break;
2452 2633                  }
2453 2634  
2454 2635                  curlevel = parent_level;
2455 2636                  curblkid = parent_blkid;
2456 2637          }
2457 2638  
2458 2639          if (curlevel == nlevels - 1) {
2459 2640                  /* No cached indirect blocks found. */
2460 2641                  ASSERT3U(curblkid, <, dn->dn_phys->dn_nblkptr);
2461 2642                  bp = dn->dn_phys->dn_blkptr[curblkid];
2462 2643          }
2463 2644          if (BP_IS_HOLE(&bp))
2464 2645                  return;
2465 2646  
2466 2647          ASSERT3U(curlevel, ==, BP_GET_LEVEL(&bp));
2467 2648  
2468 2649          zio_t *pio = zio_root(dmu_objset_spa(dn->dn_objset), NULL, NULL,
2469 2650              ZIO_FLAG_CANFAIL);
2470 2651  
2471 2652          dbuf_prefetch_arg_t *dpa = kmem_zalloc(sizeof (*dpa), KM_SLEEP);
2472 2653          dsl_dataset_t *ds = dn->dn_objset->os_dsl_dataset;
2473 2654          SET_BOOKMARK(&dpa->dpa_zb, ds != NULL ? ds->ds_object : DMU_META_OBJSET,
2474 2655              dn->dn_object, level, blkid);
2475 2656          dpa->dpa_curlevel = curlevel;
2476 2657          dpa->dpa_prio = prio;
2477 2658          dpa->dpa_aflags = aflags;
2478 2659          dpa->dpa_spa = dn->dn_objset->os_spa;
2479 2660          dpa->dpa_dnode = dn;
2480 2661          dpa->dpa_epbs = epbs;
2481 2662          dpa->dpa_zio = pio;
2482 2663  
2483 2664          /* flag if L2ARC eligible, l2arc_noprefetch then decides */
2484 2665          if (DNODE_LEVEL_IS_L2CACHEABLE(dn, level))
2485 2666                  dpa->dpa_aflags |= ARC_FLAG_L2CACHE;
2486 2667  
2487 2668          /*
2488 2669           * If we have the indirect just above us, no need to do the asynchronous
2489 2670           * prefetch chain; we'll just run the last step ourselves.  If we're at
2490 2671           * a higher level, though, we want to issue the prefetches for all the
2491 2672           * indirect blocks asynchronously, so we can go on with whatever we were
2492 2673           * doing.
2493 2674           */
2494 2675          if (curlevel == level) {
2495 2676                  ASSERT3U(curblkid, ==, blkid);
2496 2677                  dbuf_issue_final_prefetch(dpa, &bp);
2497 2678                  kmem_free(dpa, sizeof (*dpa));
2498 2679          } else {
2499 2680                  arc_flags_t iter_aflags = ARC_FLAG_NOWAIT;
2500 2681                  zbookmark_phys_t zb;
2501 2682  
2502 2683                  /* flag if L2ARC eligible, l2arc_noprefetch then decides */
2503 2684                  if (DNODE_LEVEL_IS_L2CACHEABLE(dn, level))
2504 2685                          iter_aflags |= ARC_FLAG_L2CACHE;
2505 2686  
2506 2687                  SET_BOOKMARK(&zb, ds != NULL ? ds->ds_object : DMU_META_OBJSET,
2507 2688                      dn->dn_object, curlevel, curblkid);
2508 2689                  (void) arc_read(dpa->dpa_zio, dpa->dpa_spa,
2509 2690                      &bp, dbuf_prefetch_indirect_done, dpa, prio,
2510 2691                      ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE,
2511 2692                      &iter_aflags, &zb);
2512 2693          }
2513 2694          /*
2514 2695           * We use pio here instead of dpa_zio since it's possible that
2515 2696           * dpa may have already been freed.
2516 2697           */
2517 2698          zio_nowait(pio);
2518 2699  }
2519 2700  
2520 2701  /*
2521 2702   * Returns with db_holds incremented, and db_mtx not held.
2522 2703   * Note: dn_struct_rwlock must be held.
2523 2704   */
2524 2705  int
2525 2706  dbuf_hold_impl(dnode_t *dn, uint8_t level, uint64_t blkid,
2526 2707      boolean_t fail_sparse, boolean_t fail_uncached,
2527 2708      void *tag, dmu_buf_impl_t **dbp)
2528 2709  {
2529 2710          dmu_buf_impl_t *db, *parent = NULL;
2530 2711  
2531 2712          ASSERT(blkid != DMU_BONUS_BLKID);
2532 2713          ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock));
2533 2714          ASSERT3U(dn->dn_nlevels, >, level);
2534 2715  
2535 2716          *dbp = NULL;
2536 2717  top:
2537 2718          /* dbuf_find() returns with db_mtx held */
2538 2719          db = dbuf_find(dn->dn_objset, dn->dn_object, level, blkid);
2539 2720  
2540 2721          if (db == NULL) {
2541 2722                  blkptr_t *bp = NULL;
2542 2723                  int err;
2543 2724  
2544 2725                  if (fail_uncached)
2545 2726                          return (SET_ERROR(ENOENT));
2546 2727  
2547 2728                  ASSERT3P(parent, ==, NULL);
2548 2729                  err = dbuf_findbp(dn, level, blkid, fail_sparse, &parent, &bp);
2549 2730                  if (fail_sparse) {
2550 2731                          if (err == 0 && bp && BP_IS_HOLE(bp))
2551 2732                                  err = SET_ERROR(ENOENT);
2552 2733                          if (err) {
2553 2734                                  if (parent)
2554 2735                                          dbuf_rele(parent, NULL);
2555 2736                                  return (err);
2556 2737                          }
2557 2738                  }
  
    | 
      ↓ open down ↓ | 
    286 lines elided | 
    
      ↑ open up ↑ | 
  
2558 2739                  if (err && err != ENOENT)
2559 2740                          return (err);
2560 2741                  db = dbuf_create(dn, level, blkid, parent, bp);
2561 2742          }
2562 2743  
2563 2744          if (fail_uncached && db->db_state != DB_CACHED) {
2564 2745                  mutex_exit(&db->db_mtx);
2565 2746                  return (SET_ERROR(ENOENT));
2566 2747          }
2567 2748  
2568      -        if (db->db_buf != NULL)
     2749 +        if (db->db_buf != NULL) {
     2750 +                arc_buf_access(db->db_buf);
2569 2751                  ASSERT3P(db->db.db_data, ==, db->db_buf->b_data);
     2752 +        }
2570 2753  
2571 2754          ASSERT(db->db_buf == NULL || arc_referenced(db->db_buf));
2572 2755  
2573 2756          /*
2574 2757           * If this buffer is currently syncing out, and we are are
2575 2758           * still referencing it from db_data, we need to make a copy
2576 2759           * of it in case we decide we want to dirty it again in this txg.
2577 2760           */
2578 2761          if (db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID &&
2579 2762              dn->dn_object != DMU_META_DNODE_OBJECT &&
2580 2763              db->db_state == DB_CACHED && db->db_data_pending) {
2581 2764                  dbuf_dirty_record_t *dr = db->db_data_pending;
2582 2765  
2583 2766                  if (dr->dt.dl.dr_data == db->db_buf) {
2584 2767                          arc_buf_contents_t type = DBUF_GET_BUFC_TYPE(db);
2585 2768  
  
    | 
      ↓ open down ↓ | 
    6 lines elided | 
    
      ↑ open up ↑ | 
  
2586 2769                          dbuf_set_data(db,
2587 2770                              arc_alloc_buf(dn->dn_objset->os_spa, db, type,
2588 2771                              db->db.db_size));
2589 2772                          bcopy(dr->dt.dl.dr_data->b_data, db->db.db_data,
2590 2773                              db->db.db_size);
2591 2774                  }
2592 2775          }
2593 2776  
2594 2777          if (multilist_link_active(&db->db_cache_link)) {
2595 2778                  ASSERT(refcount_is_zero(&db->db_holds));
2596      -                multilist_remove(dbuf_cache, db);
2597      -                (void) refcount_remove_many(&dbuf_cache_size,
     2779 +                ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
     2780 +                    db->db_caching_status == DB_DBUF_METADATA_CACHE);
     2781 +
     2782 +                multilist_remove(dbuf_caches[db->db_caching_status].cache, db);
     2783 +                (void) refcount_remove_many(
     2784 +                    &dbuf_caches[db->db_caching_status].size,
2598 2785                      db->db.db_size, db);
     2786 +
     2787 +                db->db_caching_status = DB_NO_CACHE;
2599 2788          }
2600 2789          (void) refcount_add(&db->db_holds, tag);
2601 2790          DBUF_VERIFY(db);
2602 2791          mutex_exit(&db->db_mtx);
2603 2792  
2604 2793          /* NOTE: we can't rele the parent until after we drop the db_mtx */
2605 2794          if (parent)
2606 2795                  dbuf_rele(parent, NULL);
2607 2796  
2608 2797          ASSERT3P(DB_DNODE(db), ==, dn);
2609 2798          ASSERT3U(db->db_blkid, ==, blkid);
2610 2799          ASSERT3U(db->db_level, ==, level);
2611 2800          *dbp = db;
2612 2801  
2613 2802          return (0);
2614 2803  }
2615 2804  
2616 2805  dmu_buf_impl_t *
2617 2806  dbuf_hold(dnode_t *dn, uint64_t blkid, void *tag)
2618 2807  {
2619 2808          return (dbuf_hold_level(dn, 0, blkid, tag));
2620 2809  }
2621 2810  
2622 2811  dmu_buf_impl_t *
2623 2812  dbuf_hold_level(dnode_t *dn, int level, uint64_t blkid, void *tag)
2624 2813  {
2625 2814          dmu_buf_impl_t *db;
2626 2815          int err = dbuf_hold_impl(dn, level, blkid, FALSE, FALSE, tag, &db);
2627 2816          return (err ? NULL : db);
2628 2817  }
2629 2818  
2630 2819  void
2631 2820  dbuf_create_bonus(dnode_t *dn)
2632 2821  {
2633 2822          ASSERT(RW_WRITE_HELD(&dn->dn_struct_rwlock));
2634 2823  
2635 2824          ASSERT(dn->dn_bonus == NULL);
2636 2825          dn->dn_bonus = dbuf_create(dn, 0, DMU_BONUS_BLKID, dn->dn_dbuf, NULL);
2637 2826  }
2638 2827  
2639 2828  int
2640 2829  dbuf_spill_set_blksz(dmu_buf_t *db_fake, uint64_t blksz, dmu_tx_t *tx)
2641 2830  {
2642 2831          dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
2643 2832          dnode_t *dn;
2644 2833  
2645 2834          if (db->db_blkid != DMU_SPILL_BLKID)
2646 2835                  return (SET_ERROR(ENOTSUP));
2647 2836          if (blksz == 0)
2648 2837                  blksz = SPA_MINBLOCKSIZE;
2649 2838          ASSERT3U(blksz, <=, spa_maxblocksize(dmu_objset_spa(db->db_objset)));
2650 2839          blksz = P2ROUNDUP(blksz, SPA_MINBLOCKSIZE);
2651 2840  
2652 2841          DB_DNODE_ENTER(db);
2653 2842          dn = DB_DNODE(db);
2654 2843          rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
2655 2844          dbuf_new_size(db, blksz, tx);
2656 2845          rw_exit(&dn->dn_struct_rwlock);
2657 2846          DB_DNODE_EXIT(db);
2658 2847  
2659 2848          return (0);
2660 2849  }
2661 2850  
2662 2851  void
2663 2852  dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
2664 2853  {
2665 2854          dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx);
2666 2855  }
2667 2856  
2668 2857  #pragma weak dmu_buf_add_ref = dbuf_add_ref
2669 2858  void
2670 2859  dbuf_add_ref(dmu_buf_impl_t *db, void *tag)
2671 2860  {
2672 2861          int64_t holds = refcount_add(&db->db_holds, tag);
2673 2862          ASSERT3S(holds, >, 1);
2674 2863  }
2675 2864  
2676 2865  #pragma weak dmu_buf_try_add_ref = dbuf_try_add_ref
2677 2866  boolean_t
2678 2867  dbuf_try_add_ref(dmu_buf_t *db_fake, objset_t *os, uint64_t obj, uint64_t blkid,
2679 2868      void *tag)
2680 2869  {
2681 2870          dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
2682 2871          dmu_buf_impl_t *found_db;
2683 2872          boolean_t result = B_FALSE;
2684 2873  
2685 2874          if (db->db_blkid == DMU_BONUS_BLKID)
2686 2875                  found_db = dbuf_find_bonus(os, obj);
2687 2876          else
2688 2877                  found_db = dbuf_find(os, obj, 0, blkid);
2689 2878  
2690 2879          if (found_db != NULL) {
2691 2880                  if (db == found_db && dbuf_refcount(db) > db->db_dirtycnt) {
2692 2881                          (void) refcount_add(&db->db_holds, tag);
2693 2882                          result = B_TRUE;
2694 2883                  }
2695 2884                  mutex_exit(&db->db_mtx);
2696 2885          }
2697 2886          return (result);
2698 2887  }
2699 2888  
2700 2889  /*
2701 2890   * If you call dbuf_rele() you had better not be referencing the dnode handle
2702 2891   * unless you have some other direct or indirect hold on the dnode. (An indirect
2703 2892   * hold is a hold on one of the dnode's dbufs, including the bonus buffer.)
2704 2893   * Without that, the dbuf_rele() could lead to a dnode_rele() followed by the
2705 2894   * dnode's parent dbuf evicting its dnode handles.
2706 2895   */
2707 2896  void
2708 2897  dbuf_rele(dmu_buf_impl_t *db, void *tag)
2709 2898  {
2710 2899          mutex_enter(&db->db_mtx);
2711 2900          dbuf_rele_and_unlock(db, tag);
2712 2901  }
2713 2902  
2714 2903  void
2715 2904  dmu_buf_rele(dmu_buf_t *db, void *tag)
2716 2905  {
2717 2906          dbuf_rele((dmu_buf_impl_t *)db, tag);
2718 2907  }
2719 2908  
2720 2909  /*
2721 2910   * dbuf_rele() for an already-locked dbuf.  This is necessary to allow
2722 2911   * db_dirtycnt and db_holds to be updated atomically.
2723 2912   */
2724 2913  void
2725 2914  dbuf_rele_and_unlock(dmu_buf_impl_t *db, void *tag)
2726 2915  {
2727 2916          int64_t holds;
2728 2917  
2729 2918          ASSERT(MUTEX_HELD(&db->db_mtx));
2730 2919          DBUF_VERIFY(db);
2731 2920  
2732 2921          /*
2733 2922           * Remove the reference to the dbuf before removing its hold on the
2734 2923           * dnode so we can guarantee in dnode_move() that a referenced bonus
2735 2924           * buffer has a corresponding dnode hold.
2736 2925           */
2737 2926          holds = refcount_remove(&db->db_holds, tag);
2738 2927          ASSERT(holds >= 0);
2739 2928  
2740 2929          /*
2741 2930           * We can't freeze indirects if there is a possibility that they
2742 2931           * may be modified in the current syncing context.
2743 2932           */
2744 2933          if (db->db_buf != NULL &&
2745 2934              holds == (db->db_level == 0 ? db->db_dirtycnt : 0)) {
2746 2935                  arc_buf_freeze(db->db_buf);
2747 2936          }
2748 2937  
2749 2938          if (holds == db->db_dirtycnt &&
2750 2939              db->db_level == 0 && db->db_user_immediate_evict)
2751 2940                  dbuf_evict_user(db);
2752 2941  
2753 2942          if (holds == 0) {
2754 2943                  if (db->db_blkid == DMU_BONUS_BLKID) {
2755 2944                          dnode_t *dn;
2756 2945                          boolean_t evict_dbuf = db->db_pending_evict;
2757 2946  
2758 2947                          /*
2759 2948                           * If the dnode moves here, we cannot cross this
2760 2949                           * barrier until the move completes.
2761 2950                           */
2762 2951                          DB_DNODE_ENTER(db);
2763 2952  
2764 2953                          dn = DB_DNODE(db);
2765 2954                          atomic_dec_32(&dn->dn_dbufs_count);
2766 2955  
2767 2956                          /*
2768 2957                           * Decrementing the dbuf count means that the bonus
2769 2958                           * buffer's dnode hold is no longer discounted in
2770 2959                           * dnode_move(). The dnode cannot move until after
2771 2960                           * the dnode_rele() below.
2772 2961                           */
2773 2962                          DB_DNODE_EXIT(db);
2774 2963  
2775 2964                          /*
2776 2965                           * Do not reference db after its lock is dropped.
2777 2966                           * Another thread may evict it.
2778 2967                           */
2779 2968                          mutex_exit(&db->db_mtx);
2780 2969  
2781 2970                          if (evict_dbuf)
2782 2971                                  dnode_evict_bonus(dn);
2783 2972  
2784 2973                          dnode_rele(dn, db);
2785 2974                  } else if (db->db_buf == NULL) {
2786 2975                          /*
2787 2976                           * This is a special case: we never associated this
2788 2977                           * dbuf with any data allocated from the ARC.
2789 2978                           */
2790 2979                          ASSERT(db->db_state == DB_UNCACHED ||
2791 2980                              db->db_state == DB_NOFILL);
2792 2981                          dbuf_destroy(db);
2793 2982                  } else if (arc_released(db->db_buf)) {
2794 2983                          /*
2795 2984                           * This dbuf has anonymous data associated with it.
2796 2985                           */
2797 2986                          dbuf_destroy(db);
2798 2987                  } else {
2799 2988                          boolean_t do_arc_evict = B_FALSE;
2800 2989                          blkptr_t bp;
2801 2990                          spa_t *spa = dmu_objset_spa(db->db_objset);
2802 2991  
2803 2992                          if (!DBUF_IS_CACHEABLE(db) &&
2804 2993                              db->db_blkptr != NULL &&
  
    | 
      ↓ open down ↓ | 
    196 lines elided | 
    
      ↑ open up ↑ | 
  
2805 2994                              !BP_IS_HOLE(db->db_blkptr) &&
2806 2995                              !BP_IS_EMBEDDED(db->db_blkptr)) {
2807 2996                                  do_arc_evict = B_TRUE;
2808 2997                                  bp = *db->db_blkptr;
2809 2998                          }
2810 2999  
2811 3000                          if (!DBUF_IS_CACHEABLE(db) ||
2812 3001                              db->db_pending_evict) {
2813 3002                                  dbuf_destroy(db);
2814 3003                          } else if (!multilist_link_active(&db->db_cache_link)) {
2815      -                                multilist_insert(dbuf_cache, db);
2816      -                                (void) refcount_add_many(&dbuf_cache_size,
     3004 +                                ASSERT3U(db->db_caching_status, ==,
     3005 +                                    DB_NO_CACHE);
     3006 +
     3007 +                                dbuf_cached_state_t dcs =
     3008 +                                    dbuf_include_in_metadata_cache(db) ?
     3009 +                                    DB_DBUF_METADATA_CACHE : DB_DBUF_CACHE;
     3010 +                                db->db_caching_status = dcs;
     3011 +
     3012 +                                multilist_insert(dbuf_caches[dcs].cache, db);
     3013 +                                (void) refcount_add_many(&dbuf_caches[dcs].size,
2817 3014                                      db->db.db_size, db);
2818 3015                                  mutex_exit(&db->db_mtx);
2819 3016  
2820      -                                dbuf_evict_notify();
     3017 +                                if (db->db_caching_status == DB_DBUF_CACHE) {
     3018 +                                        dbuf_evict_notify();
     3019 +                                }
2821 3020                          }
2822 3021  
2823 3022                          if (do_arc_evict)
2824 3023                                  arc_freed(spa, &bp);
2825 3024                  }
2826 3025          } else {
2827 3026                  mutex_exit(&db->db_mtx);
2828 3027          }
2829 3028  
2830 3029  }
2831 3030  
2832 3031  #pragma weak dmu_buf_refcount = dbuf_refcount
2833 3032  uint64_t
2834 3033  dbuf_refcount(dmu_buf_impl_t *db)
2835 3034  {
2836 3035          return (refcount_count(&db->db_holds));
2837 3036  }
2838 3037  
2839 3038  void *
2840 3039  dmu_buf_replace_user(dmu_buf_t *db_fake, dmu_buf_user_t *old_user,
2841 3040      dmu_buf_user_t *new_user)
2842 3041  {
2843 3042          dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
2844 3043  
2845 3044          mutex_enter(&db->db_mtx);
2846 3045          dbuf_verify_user(db, DBVU_NOT_EVICTING);
2847 3046          if (db->db_user == old_user)
2848 3047                  db->db_user = new_user;
2849 3048          else
2850 3049                  old_user = db->db_user;
2851 3050          dbuf_verify_user(db, DBVU_NOT_EVICTING);
2852 3051          mutex_exit(&db->db_mtx);
2853 3052  
2854 3053          return (old_user);
2855 3054  }
2856 3055  
2857 3056  void *
2858 3057  dmu_buf_set_user(dmu_buf_t *db_fake, dmu_buf_user_t *user)
2859 3058  {
2860 3059          return (dmu_buf_replace_user(db_fake, NULL, user));
2861 3060  }
2862 3061  
2863 3062  void *
2864 3063  dmu_buf_set_user_ie(dmu_buf_t *db_fake, dmu_buf_user_t *user)
2865 3064  {
2866 3065          dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
2867 3066  
2868 3067          db->db_user_immediate_evict = TRUE;
2869 3068          return (dmu_buf_set_user(db_fake, user));
2870 3069  }
2871 3070  
2872 3071  void *
2873 3072  dmu_buf_remove_user(dmu_buf_t *db_fake, dmu_buf_user_t *user)
2874 3073  {
2875 3074          return (dmu_buf_replace_user(db_fake, user, NULL));
2876 3075  }
2877 3076  
2878 3077  void *
2879 3078  dmu_buf_get_user(dmu_buf_t *db_fake)
2880 3079  {
2881 3080          dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
2882 3081  
2883 3082          dbuf_verify_user(db, DBVU_NOT_EVICTING);
2884 3083          return (db->db_user);
2885 3084  }
2886 3085  
2887 3086  void
2888 3087  dmu_buf_user_evict_wait()
2889 3088  {
2890 3089          taskq_wait(dbu_evict_taskq);
2891 3090  }
2892 3091  
2893 3092  blkptr_t *
2894 3093  dmu_buf_get_blkptr(dmu_buf_t *db)
2895 3094  {
2896 3095          dmu_buf_impl_t *dbi = (dmu_buf_impl_t *)db;
2897 3096          return (dbi->db_blkptr);
2898 3097  }
2899 3098  
2900 3099  objset_t *
2901 3100  dmu_buf_get_objset(dmu_buf_t *db)
2902 3101  {
2903 3102          dmu_buf_impl_t *dbi = (dmu_buf_impl_t *)db;
2904 3103          return (dbi->db_objset);
2905 3104  }
2906 3105  
2907 3106  dnode_t *
2908 3107  dmu_buf_dnode_enter(dmu_buf_t *db)
2909 3108  {
2910 3109          dmu_buf_impl_t *dbi = (dmu_buf_impl_t *)db;
2911 3110          DB_DNODE_ENTER(dbi);
2912 3111          return (DB_DNODE(dbi));
2913 3112  }
2914 3113  
2915 3114  void
2916 3115  dmu_buf_dnode_exit(dmu_buf_t *db)
2917 3116  {
2918 3117          dmu_buf_impl_t *dbi = (dmu_buf_impl_t *)db;
2919 3118          DB_DNODE_EXIT(dbi);
2920 3119  }
2921 3120  
2922 3121  static void
2923 3122  dbuf_check_blkptr(dnode_t *dn, dmu_buf_impl_t *db)
2924 3123  {
2925 3124          /* ASSERT(dmu_tx_is_syncing(tx) */
2926 3125          ASSERT(MUTEX_HELD(&db->db_mtx));
2927 3126  
2928 3127          if (db->db_blkptr != NULL)
2929 3128                  return;
2930 3129  
2931 3130          if (db->db_blkid == DMU_SPILL_BLKID) {
2932 3131                  db->db_blkptr = &dn->dn_phys->dn_spill;
2933 3132                  BP_ZERO(db->db_blkptr);
2934 3133                  return;
2935 3134          }
2936 3135          if (db->db_level == dn->dn_phys->dn_nlevels-1) {
2937 3136                  /*
2938 3137                   * This buffer was allocated at a time when there was
2939 3138                   * no available blkptrs from the dnode, or it was
2940 3139                   * inappropriate to hook it in (i.e., nlevels mis-match).
2941 3140                   */
2942 3141                  ASSERT(db->db_blkid < dn->dn_phys->dn_nblkptr);
2943 3142                  ASSERT(db->db_parent == NULL);
2944 3143                  db->db_parent = dn->dn_dbuf;
2945 3144                  db->db_blkptr = &dn->dn_phys->dn_blkptr[db->db_blkid];
2946 3145                  DBUF_VERIFY(db);
2947 3146          } else {
2948 3147                  dmu_buf_impl_t *parent = db->db_parent;
2949 3148                  int epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
2950 3149  
2951 3150                  ASSERT(dn->dn_phys->dn_nlevels > 1);
2952 3151                  if (parent == NULL) {
2953 3152                          mutex_exit(&db->db_mtx);
2954 3153                          rw_enter(&dn->dn_struct_rwlock, RW_READER);
2955 3154                          parent = dbuf_hold_level(dn, db->db_level + 1,
2956 3155                              db->db_blkid >> epbs, db);
2957 3156                          rw_exit(&dn->dn_struct_rwlock);
2958 3157                          mutex_enter(&db->db_mtx);
2959 3158                          db->db_parent = parent;
2960 3159                  }
2961 3160                  db->db_blkptr = (blkptr_t *)parent->db.db_data +
2962 3161                      (db->db_blkid & ((1ULL << epbs) - 1));
2963 3162                  DBUF_VERIFY(db);
2964 3163          }
2965 3164  }
2966 3165  
2967 3166  static void
2968 3167  dbuf_sync_indirect(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
2969 3168  {
2970 3169          dmu_buf_impl_t *db = dr->dr_dbuf;
2971 3170          dnode_t *dn;
2972 3171          zio_t *zio;
2973 3172  
2974 3173          ASSERT(dmu_tx_is_syncing(tx));
2975 3174  
2976 3175          dprintf_dbuf_bp(db, db->db_blkptr, "blkptr=%p", db->db_blkptr);
2977 3176  
2978 3177          mutex_enter(&db->db_mtx);
2979 3178  
2980 3179          ASSERT(db->db_level > 0);
2981 3180          DBUF_VERIFY(db);
2982 3181  
2983 3182          /* Read the block if it hasn't been read yet. */
2984 3183          if (db->db_buf == NULL) {
2985 3184                  mutex_exit(&db->db_mtx);
2986 3185                  (void) dbuf_read(db, NULL, DB_RF_MUST_SUCCEED);
2987 3186                  mutex_enter(&db->db_mtx);
2988 3187          }
2989 3188          ASSERT3U(db->db_state, ==, DB_CACHED);
2990 3189          ASSERT(db->db_buf != NULL);
2991 3190  
2992 3191          DB_DNODE_ENTER(db);
  
    | 
      ↓ open down ↓ | 
    162 lines elided | 
    
      ↑ open up ↑ | 
  
2993 3192          dn = DB_DNODE(db);
2994 3193          /* Indirect block size must match what the dnode thinks it is. */
2995 3194          ASSERT3U(db->db.db_size, ==, 1<<dn->dn_phys->dn_indblkshift);
2996 3195          dbuf_check_blkptr(dn, db);
2997 3196          DB_DNODE_EXIT(db);
2998 3197  
2999 3198          /* Provide the pending dirty record to child dbufs */
3000 3199          db->db_data_pending = dr;
3001 3200  
3002 3201          mutex_exit(&db->db_mtx);
3003      -
3004 3202          dbuf_write(dr, db->db_buf, tx);
3005 3203  
3006 3204          zio = dr->dr_zio;
3007 3205          mutex_enter(&dr->dt.di.dr_mtx);
3008 3206          dbuf_sync_list(&dr->dt.di.dr_children, db->db_level - 1, tx);
3009 3207          ASSERT(list_head(&dr->dt.di.dr_children) == NULL);
3010 3208          mutex_exit(&dr->dt.di.dr_mtx);
3011 3209          zio_nowait(zio);
3012 3210  }
3013 3211  
3014 3212  static void
3015 3213  dbuf_sync_leaf(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
3016 3214  {
3017 3215          arc_buf_t **datap = &dr->dt.dl.dr_data;
3018 3216          dmu_buf_impl_t *db = dr->dr_dbuf;
3019 3217          dnode_t *dn;
3020 3218          objset_t *os;
3021 3219          uint64_t txg = tx->tx_txg;
3022 3220  
3023 3221          ASSERT(dmu_tx_is_syncing(tx));
3024 3222  
3025 3223          dprintf_dbuf_bp(db, db->db_blkptr, "blkptr=%p", db->db_blkptr);
3026 3224  
3027 3225          mutex_enter(&db->db_mtx);
3028 3226          /*
3029 3227           * To be synced, we must be dirtied.  But we
3030 3228           * might have been freed after the dirty.
3031 3229           */
3032 3230          if (db->db_state == DB_UNCACHED) {
3033 3231                  /* This buffer has been freed since it was dirtied */
3034 3232                  ASSERT(db->db.db_data == NULL);
3035 3233          } else if (db->db_state == DB_FILL) {
3036 3234                  /* This buffer was freed and is now being re-filled */
3037 3235                  ASSERT(db->db.db_data != dr->dt.dl.dr_data);
3038 3236          } else {
3039 3237                  ASSERT(db->db_state == DB_CACHED || db->db_state == DB_NOFILL);
3040 3238          }
3041 3239          DBUF_VERIFY(db);
3042 3240  
3043 3241          DB_DNODE_ENTER(db);
3044 3242          dn = DB_DNODE(db);
3045 3243  
3046 3244          if (db->db_blkid == DMU_SPILL_BLKID) {
3047 3245                  mutex_enter(&dn->dn_mtx);
3048 3246                  dn->dn_phys->dn_flags |= DNODE_FLAG_SPILL_BLKPTR;
3049 3247                  mutex_exit(&dn->dn_mtx);
3050 3248          }
3051 3249  
3052 3250          /*
3053 3251           * If this is a bonus buffer, simply copy the bonus data into the
3054 3252           * dnode.  It will be written out when the dnode is synced (and it
3055 3253           * will be synced, since it must have been dirty for dbuf_sync to
3056 3254           * be called).
3057 3255           */
3058 3256          if (db->db_blkid == DMU_BONUS_BLKID) {
3059 3257                  dbuf_dirty_record_t **drp;
3060 3258  
3061 3259                  ASSERT(*datap != NULL);
3062 3260                  ASSERT0(db->db_level);
3063 3261                  ASSERT3U(dn->dn_phys->dn_bonuslen, <=, DN_MAX_BONUSLEN);
3064 3262                  bcopy(*datap, DN_BONUS(dn->dn_phys), dn->dn_phys->dn_bonuslen);
3065 3263                  DB_DNODE_EXIT(db);
3066 3264  
3067 3265                  if (*datap != db->db.db_data) {
3068 3266                          zio_buf_free(*datap, DN_MAX_BONUSLEN);
3069 3267                          arc_space_return(DN_MAX_BONUSLEN, ARC_SPACE_OTHER);
3070 3268                  }
3071 3269                  db->db_data_pending = NULL;
3072 3270                  drp = &db->db_last_dirty;
3073 3271                  while (*drp != dr)
3074 3272                          drp = &(*drp)->dr_next;
3075 3273                  ASSERT(dr->dr_next == NULL);
3076 3274                  ASSERT(dr->dr_dbuf == db);
3077 3275                  *drp = dr->dr_next;
3078 3276                  kmem_free(dr, sizeof (dbuf_dirty_record_t));
3079 3277                  ASSERT(db->db_dirtycnt > 0);
3080 3278                  db->db_dirtycnt -= 1;
3081 3279                  dbuf_rele_and_unlock(db, (void *)(uintptr_t)txg);
3082 3280                  return;
3083 3281          }
3084 3282  
3085 3283          os = dn->dn_objset;
3086 3284  
3087 3285          /*
3088 3286           * This function may have dropped the db_mtx lock allowing a dmu_sync
3089 3287           * operation to sneak in. As a result, we need to ensure that we
3090 3288           * don't check the dr_override_state until we have returned from
3091 3289           * dbuf_check_blkptr.
3092 3290           */
3093 3291          dbuf_check_blkptr(dn, db);
3094 3292  
3095 3293          /*
3096 3294           * If this buffer is in the middle of an immediate write,
3097 3295           * wait for the synchronous IO to complete.
3098 3296           */
3099 3297          while (dr->dt.dl.dr_override_state == DR_IN_DMU_SYNC) {
3100 3298                  ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT);
3101 3299                  cv_wait(&db->db_changed, &db->db_mtx);
3102 3300                  ASSERT(dr->dt.dl.dr_override_state != DR_NOT_OVERRIDDEN);
3103 3301          }
3104 3302  
3105 3303          if (db->db_state != DB_NOFILL &&
3106 3304              dn->dn_object != DMU_META_DNODE_OBJECT &&
3107 3305              refcount_count(&db->db_holds) > 1 &&
3108 3306              dr->dt.dl.dr_override_state != DR_OVERRIDDEN &&
3109 3307              *datap == db->db_buf) {
3110 3308                  /*
3111 3309                   * If this buffer is currently "in use" (i.e., there
3112 3310                   * are active holds and db_data still references it),
3113 3311                   * then make a copy before we start the write so that
3114 3312                   * any modifications from the open txg will not leak
3115 3313                   * into this write.
3116 3314                   *
3117 3315                   * NOTE: this copy does not need to be made for
3118 3316                   * objects only modified in the syncing context (e.g.
3119 3317                   * DNONE_DNODE blocks).
3120 3318                   */
3121 3319                  int psize = arc_buf_size(*datap);
3122 3320                  arc_buf_contents_t type = DBUF_GET_BUFC_TYPE(db);
3123 3321                  enum zio_compress compress_type = arc_get_compression(*datap);
3124 3322  
3125 3323                  if (compress_type == ZIO_COMPRESS_OFF) {
3126 3324                          *datap = arc_alloc_buf(os->os_spa, db, type, psize);
3127 3325                  } else {
3128 3326                          ASSERT3U(type, ==, ARC_BUFC_DATA);
3129 3327                          int lsize = arc_buf_lsize(*datap);
3130 3328                          *datap = arc_alloc_compressed_buf(os->os_spa, db,
3131 3329                              psize, lsize, compress_type);
3132 3330                  }
3133 3331                  bcopy(db->db.db_data, (*datap)->b_data, psize);
3134 3332          }
3135 3333          db->db_data_pending = dr;
3136 3334  
3137 3335          mutex_exit(&db->db_mtx);
3138 3336  
3139 3337          dbuf_write(dr, *datap, tx);
3140 3338  
3141 3339          ASSERT(!list_link_active(&dr->dr_dirty_node));
3142 3340          if (dn->dn_object == DMU_META_DNODE_OBJECT) {
3143 3341                  list_insert_tail(&dn->dn_dirty_records[txg&TXG_MASK], dr);
3144 3342                  DB_DNODE_EXIT(db);
3145 3343          } else {
3146 3344                  /*
3147 3345                   * Although zio_nowait() does not "wait for an IO", it does
3148 3346                   * initiate the IO. If this is an empty write it seems plausible
3149 3347                   * that the IO could actually be completed before the nowait
3150 3348                   * returns. We need to DB_DNODE_EXIT() first in case
3151 3349                   * zio_nowait() invalidates the dbuf.
3152 3350                   */
3153 3351                  DB_DNODE_EXIT(db);
3154 3352                  zio_nowait(dr->dr_zio);
3155 3353          }
3156 3354  }
3157 3355  
3158 3356  void
3159 3357  dbuf_sync_list(list_t *list, int level, dmu_tx_t *tx)
3160 3358  {
3161 3359          dbuf_dirty_record_t *dr;
3162 3360  
3163 3361          while (dr = list_head(list)) {
3164 3362                  if (dr->dr_zio != NULL) {
3165 3363                          /*
3166 3364                           * If we find an already initialized zio then we
3167 3365                           * are processing the meta-dnode, and we have finished.
3168 3366                           * The dbufs for all dnodes are put back on the list
3169 3367                           * during processing, so that we can zio_wait()
3170 3368                           * these IOs after initiating all child IOs.
3171 3369                           */
3172 3370                          ASSERT3U(dr->dr_dbuf->db.db_object, ==,
3173 3371                              DMU_META_DNODE_OBJECT);
3174 3372                          break;
3175 3373                  }
3176 3374                  if (dr->dr_dbuf->db_blkid != DMU_BONUS_BLKID &&
3177 3375                      dr->dr_dbuf->db_blkid != DMU_SPILL_BLKID) {
3178 3376                          VERIFY3U(dr->dr_dbuf->db_level, ==, level);
3179 3377                  }
3180 3378                  list_remove(list, dr);
3181 3379                  if (dr->dr_dbuf->db_level > 0)
3182 3380                          dbuf_sync_indirect(dr, tx);
3183 3381                  else
3184 3382                          dbuf_sync_leaf(dr, tx);
3185 3383          }
3186 3384  }
3187 3385  
3188 3386  /* ARGSUSED */
3189 3387  static void
3190 3388  dbuf_write_ready(zio_t *zio, arc_buf_t *buf, void *vdb)
3191 3389  {
3192 3390          dmu_buf_impl_t *db = vdb;
3193 3391          dnode_t *dn;
3194 3392          blkptr_t *bp = zio->io_bp;
3195 3393          blkptr_t *bp_orig = &zio->io_bp_orig;
3196 3394          spa_t *spa = zio->io_spa;
3197 3395          int64_t delta;
3198 3396          uint64_t fill = 0;
3199 3397          int i;
3200 3398  
3201 3399          ASSERT3P(db->db_blkptr, !=, NULL);
3202 3400          ASSERT3P(&db->db_data_pending->dr_bp_copy, ==, bp);
3203 3401  
3204 3402          DB_DNODE_ENTER(db);
3205 3403          dn = DB_DNODE(db);
3206 3404          delta = bp_get_dsize_sync(spa, bp) - bp_get_dsize_sync(spa, bp_orig);
3207 3405          dnode_diduse_space(dn, delta - zio->io_prev_space_delta);
3208 3406          zio->io_prev_space_delta = delta;
3209 3407  
3210 3408          if (bp->blk_birth != 0) {
3211 3409                  ASSERT((db->db_blkid != DMU_SPILL_BLKID &&
3212 3410                      BP_GET_TYPE(bp) == dn->dn_type) ||
3213 3411                      (db->db_blkid == DMU_SPILL_BLKID &&
3214 3412                      BP_GET_TYPE(bp) == dn->dn_bonustype) ||
3215 3413                      BP_IS_EMBEDDED(bp));
3216 3414                  ASSERT(BP_GET_LEVEL(bp) == db->db_level);
3217 3415          }
3218 3416  
3219 3417          mutex_enter(&db->db_mtx);
3220 3418  
3221 3419  #ifdef ZFS_DEBUG
3222 3420          if (db->db_blkid == DMU_SPILL_BLKID) {
3223 3421                  ASSERT(dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR);
3224 3422                  ASSERT(!(BP_IS_HOLE(bp)) &&
3225 3423                      db->db_blkptr == &dn->dn_phys->dn_spill);
3226 3424          }
3227 3425  #endif
3228 3426  
3229 3427          if (db->db_level == 0) {
3230 3428                  mutex_enter(&dn->dn_mtx);
3231 3429                  if (db->db_blkid > dn->dn_phys->dn_maxblkid &&
3232 3430                      db->db_blkid != DMU_SPILL_BLKID)
3233 3431                          dn->dn_phys->dn_maxblkid = db->db_blkid;
3234 3432                  mutex_exit(&dn->dn_mtx);
3235 3433  
3236 3434                  if (dn->dn_type == DMU_OT_DNODE) {
3237 3435                          dnode_phys_t *dnp = db->db.db_data;
3238 3436                          for (i = db->db.db_size >> DNODE_SHIFT; i > 0;
3239 3437                              i--, dnp++) {
3240 3438                                  if (dnp->dn_type != DMU_OT_NONE)
3241 3439                                          fill++;
3242 3440                          }
3243 3441                  } else {
3244 3442                          if (BP_IS_HOLE(bp)) {
3245 3443                                  fill = 0;
3246 3444                          } else {
3247 3445                                  fill = 1;
3248 3446                          }
3249 3447                  }
3250 3448          } else {
3251 3449                  blkptr_t *ibp = db->db.db_data;
3252 3450                  ASSERT3U(db->db.db_size, ==, 1<<dn->dn_phys->dn_indblkshift);
3253 3451                  for (i = db->db.db_size >> SPA_BLKPTRSHIFT; i > 0; i--, ibp++) {
3254 3452                          if (BP_IS_HOLE(ibp))
3255 3453                                  continue;
3256 3454                          fill += BP_GET_FILL(ibp);
3257 3455                  }
3258 3456          }
3259 3457          DB_DNODE_EXIT(db);
3260 3458  
3261 3459          if (!BP_IS_EMBEDDED(bp))
3262 3460                  bp->blk_fill = fill;
3263 3461  
3264 3462          mutex_exit(&db->db_mtx);
3265 3463  
3266 3464          rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
3267 3465          *db->db_blkptr = *bp;
3268 3466          rw_exit(&dn->dn_struct_rwlock);
3269 3467  }
3270 3468  
3271 3469  /* ARGSUSED */
3272 3470  /*
3273 3471   * This function gets called just prior to running through the compression
3274 3472   * stage of the zio pipeline. If we're an indirect block comprised of only
3275 3473   * holes, then we want this indirect to be compressed away to a hole. In
3276 3474   * order to do that we must zero out any information about the holes that
3277 3475   * this indirect points to prior to before we try to compress it.
3278 3476   */
3279 3477  static void
3280 3478  dbuf_write_children_ready(zio_t *zio, arc_buf_t *buf, void *vdb)
3281 3479  {
3282 3480          dmu_buf_impl_t *db = vdb;
3283 3481          dnode_t *dn;
3284 3482          blkptr_t *bp;
3285 3483          unsigned int epbs, i;
3286 3484  
3287 3485          ASSERT3U(db->db_level, >, 0);
3288 3486          DB_DNODE_ENTER(db);
3289 3487          dn = DB_DNODE(db);
3290 3488          epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
3291 3489          ASSERT3U(epbs, <, 31);
3292 3490  
3293 3491          /* Determine if all our children are holes */
3294 3492          for (i = 0, bp = db->db.db_data; i < 1 << epbs; i++, bp++) {
3295 3493                  if (!BP_IS_HOLE(bp))
3296 3494                          break;
3297 3495          }
3298 3496  
3299 3497          /*
3300 3498           * If all the children are holes, then zero them all out so that
3301 3499           * we may get compressed away.
3302 3500           */
3303 3501          if (i == 1 << epbs) {
3304 3502                  /*
3305 3503                   * We only found holes. Grab the rwlock to prevent
3306 3504                   * anybody from reading the blocks we're about to
3307 3505                   * zero out.
3308 3506                   */
3309 3507                  rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
3310 3508                  bzero(db->db.db_data, db->db.db_size);
3311 3509                  rw_exit(&dn->dn_struct_rwlock);
3312 3510          }
3313 3511          DB_DNODE_EXIT(db);
3314 3512  }
3315 3513  
3316 3514  /*
3317 3515   * The SPA will call this callback several times for each zio - once
3318 3516   * for every physical child i/o (zio->io_phys_children times).  This
3319 3517   * allows the DMU to monitor the progress of each logical i/o.  For example,
3320 3518   * there may be 2 copies of an indirect block, or many fragments of a RAID-Z
3321 3519   * block.  There may be a long delay before all copies/fragments are completed,
3322 3520   * so this callback allows us to retire dirty space gradually, as the physical
3323 3521   * i/os complete.
3324 3522   */
3325 3523  /* ARGSUSED */
3326 3524  static void
3327 3525  dbuf_write_physdone(zio_t *zio, arc_buf_t *buf, void *arg)
3328 3526  {
3329 3527          dmu_buf_impl_t *db = arg;
3330 3528          objset_t *os = db->db_objset;
3331 3529          dsl_pool_t *dp = dmu_objset_pool(os);
3332 3530          dbuf_dirty_record_t *dr;
3333 3531          int delta = 0;
3334 3532  
3335 3533          dr = db->db_data_pending;
3336 3534          ASSERT3U(dr->dr_txg, ==, zio->io_txg);
3337 3535  
3338 3536          /*
3339 3537           * The callback will be called io_phys_children times.  Retire one
3340 3538           * portion of our dirty space each time we are called.  Any rounding
3341 3539           * error will be cleaned up by dsl_pool_sync()'s call to
3342 3540           * dsl_pool_undirty_space().
3343 3541           */
3344 3542          delta = dr->dr_accounted / zio->io_phys_children;
3345 3543          dsl_pool_undirty_space(dp, delta, zio->io_txg);
3346 3544  }
3347 3545  
3348 3546  /* ARGSUSED */
3349 3547  static void
3350 3548  dbuf_write_done(zio_t *zio, arc_buf_t *buf, void *vdb)
3351 3549  {
3352 3550          dmu_buf_impl_t *db = vdb;
3353 3551          blkptr_t *bp_orig = &zio->io_bp_orig;
3354 3552          blkptr_t *bp = db->db_blkptr;
3355 3553          objset_t *os = db->db_objset;
3356 3554          dmu_tx_t *tx = os->os_synctx;
3357 3555          dbuf_dirty_record_t **drp, *dr;
3358 3556  
3359 3557          ASSERT0(zio->io_error);
3360 3558          ASSERT(db->db_blkptr == bp);
3361 3559  
3362 3560          /*
3363 3561           * For nopwrites and rewrites we ensure that the bp matches our
3364 3562           * original and bypass all the accounting.
3365 3563           */
3366 3564          if (zio->io_flags & (ZIO_FLAG_IO_REWRITE | ZIO_FLAG_NOPWRITE)) {
3367 3565                  ASSERT(BP_EQUAL(bp, bp_orig));
3368 3566          } else {
3369 3567                  dsl_dataset_t *ds = os->os_dsl_dataset;
3370 3568                  (void) dsl_dataset_block_kill(ds, bp_orig, tx, B_TRUE);
3371 3569                  dsl_dataset_block_born(ds, bp, tx);
3372 3570          }
3373 3571  
3374 3572          mutex_enter(&db->db_mtx);
3375 3573  
3376 3574          DBUF_VERIFY(db);
3377 3575  
3378 3576          drp = &db->db_last_dirty;
3379 3577          while ((dr = *drp) != db->db_data_pending)
3380 3578                  drp = &dr->dr_next;
3381 3579          ASSERT(!list_link_active(&dr->dr_dirty_node));
3382 3580          ASSERT(dr->dr_dbuf == db);
3383 3581          ASSERT(dr->dr_next == NULL);
3384 3582          *drp = dr->dr_next;
3385 3583  
3386 3584  #ifdef ZFS_DEBUG
3387 3585          if (db->db_blkid == DMU_SPILL_BLKID) {
3388 3586                  dnode_t *dn;
3389 3587  
3390 3588                  DB_DNODE_ENTER(db);
3391 3589                  dn = DB_DNODE(db);
3392 3590                  ASSERT(dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR);
3393 3591                  ASSERT(!(BP_IS_HOLE(db->db_blkptr)) &&
3394 3592                      db->db_blkptr == &dn->dn_phys->dn_spill);
3395 3593                  DB_DNODE_EXIT(db);
3396 3594          }
3397 3595  #endif
3398 3596  
3399 3597          if (db->db_level == 0) {
3400 3598                  ASSERT(db->db_blkid != DMU_BONUS_BLKID);
3401 3599                  ASSERT(dr->dt.dl.dr_override_state == DR_NOT_OVERRIDDEN);
3402 3600                  if (db->db_state != DB_NOFILL) {
3403 3601                          if (dr->dt.dl.dr_data != db->db_buf)
3404 3602                                  arc_buf_destroy(dr->dt.dl.dr_data, db);
3405 3603                  }
3406 3604          } else {
3407 3605                  dnode_t *dn;
3408 3606  
3409 3607                  DB_DNODE_ENTER(db);
3410 3608                  dn = DB_DNODE(db);
3411 3609                  ASSERT(list_head(&dr->dt.di.dr_children) == NULL);
3412 3610                  ASSERT3U(db->db.db_size, ==, 1 << dn->dn_phys->dn_indblkshift);
3413 3611                  if (!BP_IS_HOLE(db->db_blkptr)) {
3414 3612                          int epbs =
3415 3613                              dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
3416 3614                          ASSERT3U(db->db_blkid, <=,
3417 3615                              dn->dn_phys->dn_maxblkid >> (db->db_level * epbs));
3418 3616                          ASSERT3U(BP_GET_LSIZE(db->db_blkptr), ==,
3419 3617                              db->db.db_size);
3420 3618                  }
3421 3619                  DB_DNODE_EXIT(db);
3422 3620                  mutex_destroy(&dr->dt.di.dr_mtx);
3423 3621                  list_destroy(&dr->dt.di.dr_children);
3424 3622          }
3425 3623          kmem_free(dr, sizeof (dbuf_dirty_record_t));
3426 3624  
3427 3625          cv_broadcast(&db->db_changed);
3428 3626          ASSERT(db->db_dirtycnt > 0);
3429 3627          db->db_dirtycnt -= 1;
3430 3628          db->db_data_pending = NULL;
3431 3629          dbuf_rele_and_unlock(db, (void *)(uintptr_t)tx->tx_txg);
3432 3630  }
3433 3631  
3434 3632  static void
3435 3633  dbuf_write_nofill_ready(zio_t *zio)
3436 3634  {
3437 3635          dbuf_write_ready(zio, NULL, zio->io_private);
3438 3636  }
3439 3637  
3440 3638  static void
3441 3639  dbuf_write_nofill_done(zio_t *zio)
3442 3640  {
3443 3641          dbuf_write_done(zio, NULL, zio->io_private);
3444 3642  }
3445 3643  
3446 3644  static void
3447 3645  dbuf_write_override_ready(zio_t *zio)
3448 3646  {
3449 3647          dbuf_dirty_record_t *dr = zio->io_private;
3450 3648          dmu_buf_impl_t *db = dr->dr_dbuf;
3451 3649  
3452 3650          dbuf_write_ready(zio, NULL, db);
3453 3651  }
3454 3652  
3455 3653  static void
3456 3654  dbuf_write_override_done(zio_t *zio)
3457 3655  {
3458 3656          dbuf_dirty_record_t *dr = zio->io_private;
3459 3657          dmu_buf_impl_t *db = dr->dr_dbuf;
3460 3658          blkptr_t *obp = &dr->dt.dl.dr_overridden_by;
3461 3659  
3462 3660          mutex_enter(&db->db_mtx);
3463 3661          if (!BP_EQUAL(zio->io_bp, obp)) {
3464 3662                  if (!BP_IS_HOLE(obp))
  
    | 
      ↓ open down ↓ | 
    451 lines elided | 
    
      ↑ open up ↑ | 
  
3465 3663                          dsl_free(spa_get_dsl(zio->io_spa), zio->io_txg, obp);
3466 3664                  arc_release(dr->dt.dl.dr_data, db);
3467 3665          }
3468 3666          mutex_exit(&db->db_mtx);
3469 3667          dbuf_write_done(zio, NULL, db);
3470 3668  
3471 3669          if (zio->io_abd != NULL)
3472 3670                  abd_put(zio->io_abd);
3473 3671  }
3474 3672  
3475      -typedef struct dbuf_remap_impl_callback_arg {
3476      -        objset_t        *drica_os;
3477      -        uint64_t        drica_blk_birth;
3478      -        dmu_tx_t        *drica_tx;
3479      -} dbuf_remap_impl_callback_arg_t;
3480      -
3481      -static void
3482      -dbuf_remap_impl_callback(uint64_t vdev, uint64_t offset, uint64_t size,
3483      -    void *arg)
3484      -{
3485      -        dbuf_remap_impl_callback_arg_t *drica = arg;
3486      -        objset_t *os = drica->drica_os;
3487      -        spa_t *spa = dmu_objset_spa(os);
3488      -        dmu_tx_t *tx = drica->drica_tx;
3489      -
3490      -        ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
3491      -
3492      -        if (os == spa_meta_objset(spa)) {
3493      -                spa_vdev_indirect_mark_obsolete(spa, vdev, offset, size, tx);
3494      -        } else {
3495      -                dsl_dataset_block_remapped(dmu_objset_ds(os), vdev, offset,
3496      -                    size, drica->drica_blk_birth, tx);
3497      -        }
3498      -}
3499      -
3500      -static void
3501      -dbuf_remap_impl(dnode_t *dn, blkptr_t *bp, dmu_tx_t *tx)
3502      -{
3503      -        blkptr_t bp_copy = *bp;
3504      -        spa_t *spa = dmu_objset_spa(dn->dn_objset);
3505      -        dbuf_remap_impl_callback_arg_t drica;
3506      -
3507      -        ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
3508      -
3509      -        drica.drica_os = dn->dn_objset;
3510      -        drica.drica_blk_birth = bp->blk_birth;
3511      -        drica.drica_tx = tx;
3512      -        if (spa_remap_blkptr(spa, &bp_copy, dbuf_remap_impl_callback,
3513      -            &drica)) {
3514      -                /*
3515      -                 * The struct_rwlock prevents dbuf_read_impl() from
3516      -                 * dereferencing the BP while we are changing it.  To
3517      -                 * avoid lock contention, only grab it when we are actually
3518      -                 * changing the BP.
3519      -                 */
3520      -                rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
3521      -                *bp = bp_copy;
3522      -                rw_exit(&dn->dn_struct_rwlock);
3523      -        }
3524      -}
3525      -
3526      -/*
3527      - * Returns true if a dbuf_remap would modify the dbuf. We do this by attempting
3528      - * to remap a copy of every bp in the dbuf.
3529      - */
3530      -boolean_t
3531      -dbuf_can_remap(const dmu_buf_impl_t *db)
3532      -{
3533      -        spa_t *spa = dmu_objset_spa(db->db_objset);
3534      -        blkptr_t *bp = db->db.db_data;
3535      -        boolean_t ret = B_FALSE;
3536      -
3537      -        ASSERT3U(db->db_level, >, 0);
3538      -        ASSERT3S(db->db_state, ==, DB_CACHED);
3539      -
3540      -        ASSERT(spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL));
3541      -
3542      -        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
3543      -        for (int i = 0; i < db->db.db_size >> SPA_BLKPTRSHIFT; i++) {
3544      -                blkptr_t bp_copy = bp[i];
3545      -                if (spa_remap_blkptr(spa, &bp_copy, NULL, NULL)) {
3546      -                        ret = B_TRUE;
3547      -                        break;
3548      -                }
3549      -        }
3550      -        spa_config_exit(spa, SCL_VDEV, FTAG);
3551      -
3552      -        return (ret);
3553      -}
3554      -
3555      -boolean_t
3556      -dnode_needs_remap(const dnode_t *dn)
3557      -{
3558      -        spa_t *spa = dmu_objset_spa(dn->dn_objset);
3559      -        boolean_t ret = B_FALSE;
3560      -
3561      -        if (dn->dn_phys->dn_nlevels == 0) {
3562      -                return (B_FALSE);
3563      -        }
3564      -
3565      -        ASSERT(spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL));
3566      -
3567      -        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
3568      -        for (int j = 0; j < dn->dn_phys->dn_nblkptr; j++) {
3569      -                blkptr_t bp_copy = dn->dn_phys->dn_blkptr[j];
3570      -                if (spa_remap_blkptr(spa, &bp_copy, NULL, NULL)) {
3571      -                        ret = B_TRUE;
3572      -                        break;
3573      -                }
3574      -        }
3575      -        spa_config_exit(spa, SCL_VDEV, FTAG);
3576      -
3577      -        return (ret);
3578      -}
3579      -
3580      -/*
3581      - * Remap any existing BP's to concrete vdevs, if possible.
3582      - */
3583      -static void
3584      -dbuf_remap(dnode_t *dn, dmu_buf_impl_t *db, dmu_tx_t *tx)
3585      -{
3586      -        spa_t *spa = dmu_objset_spa(db->db_objset);
3587      -        ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
3588      -
3589      -        if (!spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL))
3590      -                return;
3591      -
3592      -        if (db->db_level > 0) {
3593      -                blkptr_t *bp = db->db.db_data;
3594      -                for (int i = 0; i < db->db.db_size >> SPA_BLKPTRSHIFT; i++) {
3595      -                        dbuf_remap_impl(dn, &bp[i], tx);
3596      -                }
3597      -        } else if (db->db.db_object == DMU_META_DNODE_OBJECT) {
3598      -                dnode_phys_t *dnp = db->db.db_data;
3599      -                ASSERT3U(db->db_dnode_handle->dnh_dnode->dn_type, ==,
3600      -                    DMU_OT_DNODE);
3601      -                for (int i = 0; i < db->db.db_size >> DNODE_SHIFT; i++) {
3602      -                        for (int j = 0; j < dnp[i].dn_nblkptr; j++) {
3603      -                                dbuf_remap_impl(dn, &dnp[i].dn_blkptr[j], tx);
3604      -                        }
3605      -                }
3606      -        }
3607      -}
3608      -
3609      -
3610 3673  /* Issue I/O to commit a dirty buffer to disk. */
3611 3674  static void
3612 3675  dbuf_write(dbuf_dirty_record_t *dr, arc_buf_t *data, dmu_tx_t *tx)
3613 3676  {
3614 3677          dmu_buf_impl_t *db = dr->dr_dbuf;
3615 3678          dnode_t *dn;
3616 3679          objset_t *os;
3617 3680          dmu_buf_impl_t *parent = db->db_parent;
3618 3681          uint64_t txg = tx->tx_txg;
3619 3682          zbookmark_phys_t zb;
3620 3683          zio_prop_t zp;
3621 3684          zio_t *zio;
3622 3685          int wp_flag = 0;
     3686 +        zio_smartcomp_info_t sc;
3623 3687  
3624 3688          ASSERT(dmu_tx_is_syncing(tx));
3625 3689  
3626 3690          DB_DNODE_ENTER(db);
3627 3691          dn = DB_DNODE(db);
3628 3692          os = dn->dn_objset;
3629 3693  
     3694 +        dnode_setup_zio_smartcomp(db, &sc);
     3695 +
3630 3696          if (db->db_state != DB_NOFILL) {
3631 3697                  if (db->db_level > 0 || dn->dn_type == DMU_OT_DNODE) {
3632 3698                          /*
3633 3699                           * Private object buffers are released here rather
3634 3700                           * than in dbuf_dirty() since they are only modified
3635 3701                           * in the syncing context and we don't want the
3636 3702                           * overhead of making multiple copies of the data.
3637 3703                           */
3638 3704                          if (BP_IS_HOLE(db->db_blkptr)) {
3639 3705                                  arc_buf_thaw(data);
3640 3706                          } else {
3641 3707                                  dbuf_release_bp(db);
3642 3708                          }
3643      -                        dbuf_remap(dn, db, tx);
3644 3709                  }
3645 3710          }
3646 3711  
3647 3712          if (parent != dn->dn_dbuf) {
3648 3713                  /* Our parent is an indirect block. */
3649 3714                  /* We have a dirty parent that has been scheduled for write. */
3650 3715                  ASSERT(parent && parent->db_data_pending);
3651 3716                  /* Our parent's buffer is one level closer to the dnode. */
3652 3717                  ASSERT(db->db_level == parent->db_level-1);
3653 3718                  /*
3654 3719                   * We're about to modify our parent's db_data by modifying
3655 3720                   * our block pointer, so the parent must be released.
3656 3721                   */
3657 3722                  ASSERT(arc_released(parent->db_buf));
3658 3723                  zio = parent->db_data_pending->dr_zio;
3659 3724          } else {
3660 3725                  /* Our parent is the dnode itself. */
3661 3726                  ASSERT((db->db_level == dn->dn_phys->dn_nlevels-1 &&
3662 3727                      db->db_blkid != DMU_SPILL_BLKID) ||
3663 3728                      (db->db_blkid == DMU_SPILL_BLKID && db->db_level == 0));
3664 3729                  if (db->db_blkid != DMU_SPILL_BLKID)
3665 3730                          ASSERT3P(db->db_blkptr, ==,
3666 3731                              &dn->dn_phys->dn_blkptr[db->db_blkid]);
3667 3732                  zio = dn->dn_zio;
3668 3733          }
3669 3734  
3670 3735          ASSERT(db->db_level == 0 || data == db->db_buf);
  
    | 
      ↓ open down ↓ | 
    17 lines elided | 
    
      ↑ open up ↑ | 
  
3671 3736          ASSERT3U(db->db_blkptr->blk_birth, <=, txg);
3672 3737          ASSERT(zio);
3673 3738  
3674 3739          SET_BOOKMARK(&zb, os->os_dsl_dataset ?
3675 3740              os->os_dsl_dataset->ds_object : DMU_META_OBJSET,
3676 3741              db->db.db_object, db->db_level, db->db_blkid);
3677 3742  
3678 3743          if (db->db_blkid == DMU_SPILL_BLKID)
3679 3744                  wp_flag = WP_SPILL;
3680 3745          wp_flag |= (db->db_state == DB_NOFILL) ? WP_NOFILL : 0;
     3746 +        WP_SET_SPECIALCLASS(wp_flag, dr->dr_usesc);
3681 3747  
3682 3748          dmu_write_policy(os, dn, db->db_level, wp_flag, &zp);
3683 3749          DB_DNODE_EXIT(db);
3684 3750  
3685 3751          /*
3686 3752           * We copy the blkptr now (rather than when we instantiate the dirty
3687 3753           * record), because its value can change between open context and
3688 3754           * syncing context. We do not need to hold dn_struct_rwlock to read
3689 3755           * db_blkptr because we are in syncing context.
3690 3756           */
3691 3757          dr->dr_bp_copy = *db->db_blkptr;
3692 3758  
3693 3759          if (db->db_level == 0 &&
3694 3760              dr->dt.dl.dr_override_state == DR_OVERRIDDEN) {
3695 3761                  /*
  
    | 
      ↓ open down ↓ | 
    5 lines elided | 
    
      ↑ open up ↑ | 
  
3696 3762                   * The BP for this block has been provided by open context
3697 3763                   * (by dmu_sync() or dmu_buf_write_embedded()).
3698 3764                   */
3699 3765                  abd_t *contents = (data != NULL) ?
3700 3766                      abd_get_from_buf(data->b_data, arc_buf_size(data)) : NULL;
3701 3767  
3702 3768                  dr->dr_zio = zio_write(zio, os->os_spa, txg, &dr->dr_bp_copy,
3703 3769                      contents, db->db.db_size, db->db.db_size, &zp,
3704 3770                      dbuf_write_override_ready, NULL, NULL,
3705 3771                      dbuf_write_override_done,
3706      -                    dr, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
     3772 +                    dr, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb,
     3773 +                    &sc);
3707 3774                  mutex_enter(&db->db_mtx);
3708 3775                  dr->dt.dl.dr_override_state = DR_NOT_OVERRIDDEN;
3709 3776                  zio_write_override(dr->dr_zio, &dr->dt.dl.dr_overridden_by,
3710 3777                      dr->dt.dl.dr_copies, dr->dt.dl.dr_nopwrite);
3711 3778                  mutex_exit(&db->db_mtx);
3712 3779          } else if (db->db_state == DB_NOFILL) {
3713 3780                  ASSERT(zp.zp_checksum == ZIO_CHECKSUM_OFF ||
3714 3781                      zp.zp_checksum == ZIO_CHECKSUM_NOPARITY);
3715 3782                  dr->dr_zio = zio_write(zio, os->os_spa, txg,
3716 3783                      &dr->dr_bp_copy, NULL, db->db.db_size, db->db.db_size, &zp,
3717 3784                      dbuf_write_nofill_ready, NULL, NULL,
3718 3785                      dbuf_write_nofill_done, db,
3719 3786                      ZIO_PRIORITY_ASYNC_WRITE,
3720      -                    ZIO_FLAG_MUSTSUCCEED | ZIO_FLAG_NODATA, &zb);
     3787 +                    ZIO_FLAG_MUSTSUCCEED | ZIO_FLAG_NODATA, &zb, &sc);
3721 3788          } else {
3722 3789                  ASSERT(arc_released(data));
3723 3790  
3724 3791                  /*
3725 3792                   * For indirect blocks, we want to setup the children
3726 3793                   * ready callback so that we can properly handle an indirect
3727 3794                   * block that only contains holes.
3728 3795                   */
3729 3796                  arc_done_func_t *children_ready_cb = NULL;
3730 3797                  if (db->db_level != 0)
3731 3798                          children_ready_cb = dbuf_write_children_ready;
3732 3799  
3733 3800                  dr->dr_zio = arc_write(zio, os->os_spa, txg,
3734 3801                      &dr->dr_bp_copy, data, DBUF_IS_L2CACHEABLE(db),
3735 3802                      &zp, dbuf_write_ready, children_ready_cb,
3736 3803                      dbuf_write_physdone, dbuf_write_done, db,
3737      -                    ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
     3804 +                    ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb, &sc);
3738 3805          }
3739 3806  }
    
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX