Print this page
NEX-19394 backport 9337 zfs get all is slow due to uncached metadata
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
 Conflicts:
  usr/src/uts/common/fs/zfs/dbuf.c
  usr/src/uts/common/fs/zfs/dmu.c
  usr/src/uts/common/fs/zfs/sys/dmu_objset.h
NEX-15468 panic - Deadlock: cycle in blocking chain with dbuf_destroy calling mutex_vector_enter
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-16904 Need to port Illumos Bug #9433 to fix ARC hit rate
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-16146 9188 increase size of dbuf cache to reduce indirect block decompression
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-5366 Race between unique_insert() and unique_remove() causes ZFS fsid change
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6288 dmu_buf_will_dirty could be faster
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
6047 SPARC boot should support feature@embedded_data
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-1823 Slow performance doing of a large dataset
5911 ZFS "hangs" while deleting file
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
NEX-3558 KRRP Integration
NEX-3266 5630 stale bonus buffer in recycled dnode_t leads to data corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
NEX-3165 segregate ddt in arc
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Issue #7: add cacheability to the properties
          Contributors: Boris Protopopov
DDT is placed either into special or to L2ARC but not in both
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties

Split Close
Expand all
Collapse all
          --- old/usr/src/uts/common/fs/zfs/dbuf.c
          +++ new/usr/src/uts/common/fs/zfs/dbuf.c
↓ open down ↓ 12 lines elided ↑ open up ↑
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23      - * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
       23 + * Copyright 2018 Nexenta Systems, Inc.  All rights reserved.
  24   24   * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
  25   25   * Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
  26   26   * Copyright (c) 2013, Joyent, Inc. All rights reserved.
  27   27   * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
  28   28   * Copyright (c) 2014 Integros [integros.com]
  29   29   */
  30   30  
  31   31  #include <sys/zfs_context.h>
  32   32  #include <sys/dmu.h>
  33   33  #include <sys/dmu_send.h>
  34   34  #include <sys/dmu_impl.h>
  35   35  #include <sys/dbuf.h>
  36   36  #include <sys/dmu_objset.h>
  37   37  #include <sys/dsl_dataset.h>
  38   38  #include <sys/dsl_dir.h>
  39   39  #include <sys/dmu_tx.h>
  40   40  #include <sys/spa.h>
       41 +#include <sys/spa_impl.h>
  41   42  #include <sys/zio.h>
  42   43  #include <sys/dmu_zfetch.h>
  43   44  #include <sys/sa.h>
  44   45  #include <sys/sa_impl.h>
  45   46  #include <sys/zfeature.h>
  46   47  #include <sys/blkptr.h>
  47   48  #include <sys/range_tree.h>
  48   49  #include <sys/callb.h>
  49   50  #include <sys/abd.h>
  50      -#include <sys/vdev.h>
  51      -#include <sys/cityhash.h>
  52   51  
  53   52  uint_t zfs_dbuf_evict_key;
  54   53  
  55   54  static boolean_t dbuf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
  56   55  static void dbuf_write(dbuf_dirty_record_t *dr, arc_buf_t *data, dmu_tx_t *tx);
  57   56  
  58   57  #ifndef __lint
  59   58  extern inline void dmu_buf_init_user(dmu_buf_user_t *dbu,
  60   59      dmu_buf_evict_func_t *evict_func_sync,
  61   60      dmu_buf_evict_func_t *evict_func_async,
↓ open down ↓ 5 lines elided ↑ open up ↑
  67   66   */
  68   67  static kmem_cache_t *dbuf_kmem_cache;
  69   68  static taskq_t *dbu_evict_taskq;
  70   69  
  71   70  static kthread_t *dbuf_cache_evict_thread;
  72   71  static kmutex_t dbuf_evict_lock;
  73   72  static kcondvar_t dbuf_evict_cv;
  74   73  static boolean_t dbuf_evict_thread_exit;
  75   74  
  76   75  /*
  77      - * LRU cache of dbufs. The dbuf cache maintains a list of dbufs that
  78      - * are not currently held but have been recently released. These dbufs
  79      - * are not eligible for arc eviction until they are aged out of the cache.
  80      - * Dbufs are added to the dbuf cache once the last hold is released. If a
  81      - * dbuf is later accessed and still exists in the dbuf cache, then it will
  82      - * be removed from the cache and later re-added to the head of the cache.
  83      - * Dbufs that are aged out of the cache will be immediately destroyed and
  84      - * become eligible for arc eviction.
       76 + * There are two dbuf caches; each dbuf can only be in one of them at a time.
       77 + *
       78 + * 1. Cache of metadata dbufs, to help make read-heavy administrative commands
       79 + *    from /sbin/zfs run faster. The "metadata cache" specifically stores dbufs
       80 + *    that represent the metadata that describes filesystems/snapshots/
       81 + *    bookmarks/properties/etc. We only evict from this cache when we export a
       82 + *    pool, to short-circuit as much I/O as possible for all administrative
       83 + *    commands that need the metadata. There is no eviction policy for this
       84 + *    cache, because we try to only include types in it which would occupy a
       85 + *    very small amount of space per object but create a large impact on the
       86 + *    performance of these commands. Instead, after it reaches a maximum size
       87 + *    (which should only happen on very small memory systems with a very large
       88 + *    number of filesystem objects), we stop taking new dbufs into the
       89 + *    metadata cache, instead putting them in the normal dbuf cache.
       90 + *
       91 + * 2. LRU cache of dbufs. The "dbuf cache" maintains a list of dbufs that
       92 + *    are not currently held but have been recently released. These dbufs
       93 + *    are not eligible for arc eviction until they are aged out of the cache.
       94 + *    Dbufs that are aged out of the cache will be immediately destroyed and
       95 + *    become eligible for arc eviction.
       96 + *
       97 + * Dbufs are added to these caches once the last hold is released. If a dbuf is
       98 + * later accessed and still exists in the dbuf cache, then it will be removed
       99 + * from the cache and later re-added to the head of the cache.
      100 + *
      101 + * If a given dbuf meets the requirements for the metadata cache, it will go
      102 + * there, otherwise it will be considered for the generic LRU dbuf cache. The
      103 + * caches and the refcounts tracking their sizes are stored in an array indexed
      104 + * by those caches' matching enum values (from dbuf_cached_state_t).
  85  105   */
  86      -static multilist_t *dbuf_cache;
  87      -static refcount_t dbuf_cache_size;
  88      -uint64_t dbuf_cache_max_bytes = 100 * 1024 * 1024;
      106 +typedef struct dbuf_cache {
      107 +        multilist_t *cache;
      108 +        refcount_t size;
      109 +} dbuf_cache_t;
      110 +dbuf_cache_t dbuf_caches[DB_CACHE_MAX];
  89  111  
  90      -/* Cap the size of the dbuf cache to log2 fraction of arc size. */
  91      -int dbuf_cache_max_shift = 5;
      112 +/* Size limits for the caches */
      113 +uint64_t dbuf_cache_max_bytes = 0;
      114 +uint64_t dbuf_metadata_cache_max_bytes = 0;
      115 +/* Set the default sizes of the caches to log2 fraction of arc size */
      116 +int dbuf_cache_shift = 5;
      117 +int dbuf_metadata_cache_shift = 6;
  92  118  
  93  119  /*
  94      - * The dbuf cache uses a three-stage eviction policy:
      120 + * For diagnostic purposes, this is incremented whenever we can't add
      121 + * something to the metadata cache because it's full, and instead put
      122 + * the data in the regular dbuf cache.
      123 + */
      124 +uint64_t dbuf_metadata_cache_overflow;
      125 +
      126 +/*
      127 + * The LRU dbuf cache uses a three-stage eviction policy:
  95  128   *      - A low water marker designates when the dbuf eviction thread
  96  129   *      should stop evicting from the dbuf cache.
  97  130   *      - When we reach the maximum size (aka mid water mark), we
  98  131   *      signal the eviction thread to run.
  99  132   *      - The high water mark indicates when the eviction thread
 100  133   *      is unable to keep up with the incoming load and eviction must
 101  134   *      happen in the context of the calling thread.
 102  135   *
 103  136   * The dbuf cache:
 104  137   *                                                 (max size)
↓ open down ↓ 52 lines elided ↑ open up ↑
 157  190          dmu_buf_impl_t *db = vdb;
 158  191          mutex_destroy(&db->db_mtx);
 159  192          cv_destroy(&db->db_changed);
 160  193          ASSERT(!multilist_link_active(&db->db_cache_link));
 161  194          refcount_destroy(&db->db_holds);
 162  195  }
 163  196  
 164  197  /*
 165  198   * dbuf hash table routines
 166  199   */
      200 +#pragma align 64(dbuf_hash_table)
 167  201  static dbuf_hash_table_t dbuf_hash_table;
 168  202  
 169  203  static uint64_t dbuf_hash_count;
 170  204  
 171      -/*
 172      - * We use Cityhash for this. It's fast, and has good hash properties without
 173      - * requiring any large static buffers.
 174      - */
 175  205  static uint64_t
 176  206  dbuf_hash(void *os, uint64_t obj, uint8_t lvl, uint64_t blkid)
 177  207  {
 178      -        return (cityhash4((uintptr_t)os, obj, (uint64_t)lvl, blkid));
      208 +        uintptr_t osv = (uintptr_t)os;
      209 +        uint64_t crc = -1ULL;
      210 +
      211 +        ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
      212 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (lvl)) & 0xFF];
      213 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (osv >> 6)) & 0xFF];
      214 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (obj >> 0)) & 0xFF];
      215 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (obj >> 8)) & 0xFF];
      216 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (blkid >> 0)) & 0xFF];
      217 +        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (blkid >> 8)) & 0xFF];
      218 +
      219 +        crc ^= (osv>>14) ^ (obj>>16) ^ (blkid>>16);
      220 +
      221 +        return (crc);
 179  222  }
 180  223  
 181  224  #define DBUF_EQUAL(dbuf, os, obj, level, blkid)         \
 182  225          ((dbuf)->db.db_object == (obj) &&               \
 183  226          (dbuf)->db_objset == (os) &&                    \
 184  227          (dbuf)->db_level == (level) &&                  \
 185  228          (dbuf)->db_blkid == (blkid))
 186  229  
 187  230  dmu_buf_impl_t *
 188  231  dbuf_find(objset_t *os, uint64_t obj, uint8_t level, uint64_t blkid)
↓ open down ↓ 197 lines elided ↑ open up ↑
 386  429                  boolean_t is_metadata;
 387  430  
 388  431                  DB_DNODE_ENTER(db);
 389  432                  is_metadata = DMU_OT_IS_METADATA(DB_DNODE(db)->dn_type);
 390  433                  DB_DNODE_EXIT(db);
 391  434  
 392  435                  return (is_metadata);
 393  436          }
 394  437  }
 395  438  
      439 +boolean_t
      440 +dbuf_is_ddt(dmu_buf_impl_t *db)
      441 +{
      442 +        boolean_t is_ddt;
      443 +
      444 +        DB_DNODE_ENTER(db);
      445 +        is_ddt = (DB_DNODE(db)->dn_type == DMU_OT_DDT_ZAP) ||
      446 +            (DB_DNODE(db)->dn_type == DMU_OT_DDT_STATS);
      447 +        DB_DNODE_EXIT(db);
      448 +
      449 +        return (is_ddt);
      450 +}
      451 +
 396  452  /*
      453 + * This returns whether this dbuf should be stored in the metadata cache, which
      454 + * is based on whether it's from one of the dnode types that store data related
      455 + * to traversing dataset hierarchies.
      456 + */
      457 +static boolean_t
      458 +dbuf_include_in_metadata_cache(dmu_buf_impl_t *db)
      459 +{
      460 +        DB_DNODE_ENTER(db);
      461 +        dmu_object_type_t type = DB_DNODE(db)->dn_type;
      462 +        DB_DNODE_EXIT(db);
      463 +
      464 +        /* Check if this dbuf is one of the types we care about */
      465 +        if (DMU_OT_IS_METADATA_CACHED(type)) {
      466 +                /* If we hit this, then we set something up wrong in dmu_ot */
      467 +                ASSERT(DMU_OT_IS_METADATA(type));
      468 +
      469 +                /*
      470 +                 * Sanity check for small-memory systems: don't allocate too
      471 +                 * much memory for this purpose.
      472 +                 */
      473 +                if (refcount_count(&dbuf_caches[DB_DBUF_METADATA_CACHE].size) >
      474 +                    dbuf_metadata_cache_max_bytes) {
      475 +                        dbuf_metadata_cache_overflow++;
      476 +                        DTRACE_PROBE1(dbuf__metadata__cache__overflow,
      477 +                            dmu_buf_impl_t *, db);
      478 +                        return (B_FALSE);
      479 +                }
      480 +
      481 +                return (B_TRUE);
      482 +        }
      483 +
      484 +        return (B_FALSE);
      485 +}
      486 +
      487 +/*
 397  488   * This function *must* return indices evenly distributed between all
 398  489   * sublists of the multilist. This is needed due to how the dbuf eviction
 399  490   * code is laid out; dbuf_evict_thread() assumes dbufs are evenly
 400  491   * distributed between all sublists and uses this assumption when
 401  492   * deciding which sublist to evict from and how much to evict from it.
 402  493   */
 403  494  unsigned int
 404  495  dbuf_cache_multilist_index_func(multilist_t *ml, void *obj)
 405  496  {
 406  497          dmu_buf_impl_t *db = obj;
↓ open down ↓ 14 lines elided ↑ open up ↑
 421  512              db->db_level, db->db_blkid) %
 422  513              multilist_get_num_sublists(ml));
 423  514  }
 424  515  
 425  516  static inline boolean_t
 426  517  dbuf_cache_above_hiwater(void)
 427  518  {
 428  519          uint64_t dbuf_cache_hiwater_bytes =
 429  520              (dbuf_cache_max_bytes * dbuf_cache_hiwater_pct) / 100;
 430  521  
 431      -        return (refcount_count(&dbuf_cache_size) >
      522 +        return (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
 432  523              dbuf_cache_max_bytes + dbuf_cache_hiwater_bytes);
 433  524  }
 434  525  
 435  526  static inline boolean_t
 436  527  dbuf_cache_above_lowater(void)
 437  528  {
 438  529          uint64_t dbuf_cache_lowater_bytes =
 439  530              (dbuf_cache_max_bytes * dbuf_cache_lowater_pct) / 100;
 440  531  
 441      -        return (refcount_count(&dbuf_cache_size) >
      532 +        return (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
 442  533              dbuf_cache_max_bytes - dbuf_cache_lowater_bytes);
 443  534  }
 444  535  
 445  536  /*
 446  537   * Evict the oldest eligible dbuf from the dbuf cache.
 447  538   */
 448  539  static void
 449  540  dbuf_evict_one(void)
 450  541  {
 451      -        int idx = multilist_get_random_index(dbuf_cache);
 452      -        multilist_sublist_t *mls = multilist_sublist_lock(dbuf_cache, idx);
      542 +        int idx = multilist_get_random_index(dbuf_caches[DB_DBUF_CACHE].cache);
      543 +        multilist_sublist_t *mls = multilist_sublist_lock(
      544 +            dbuf_caches[DB_DBUF_CACHE].cache, idx);
 453  545  
 454  546          ASSERT(!MUTEX_HELD(&dbuf_evict_lock));
 455  547  
 456  548          /*
 457  549           * Set the thread's tsd to indicate that it's processing evictions.
 458  550           * Once a thread stops evicting from the dbuf cache it will
 459  551           * reset its tsd to NULL.
 460  552           */
 461  553          ASSERT3P(tsd_get(zfs_dbuf_evict_key), ==, NULL);
 462  554          (void) tsd_set(zfs_dbuf_evict_key, (void *)B_TRUE);
↓ open down ↓ 2 lines elided ↑ open up ↑
 465  557          while (db != NULL && mutex_tryenter(&db->db_mtx) == 0) {
 466  558                  db = multilist_sublist_prev(mls, db);
 467  559          }
 468  560  
 469  561          DTRACE_PROBE2(dbuf__evict__one, dmu_buf_impl_t *, db,
 470  562              multilist_sublist_t *, mls);
 471  563  
 472  564          if (db != NULL) {
 473  565                  multilist_sublist_remove(mls, db);
 474  566                  multilist_sublist_unlock(mls);
 475      -                (void) refcount_remove_many(&dbuf_cache_size,
      567 +                (void) refcount_remove_many(&dbuf_caches[DB_DBUF_CACHE].size,
 476  568                      db->db.db_size, db);
      569 +                ASSERT3U(db->db_caching_status, ==, DB_DBUF_CACHE);
      570 +                db->db_caching_status = DB_NO_CACHE;
 477  571                  dbuf_destroy(db);
 478  572          } else {
 479  573                  multilist_sublist_unlock(mls);
 480  574          }
 481  575          (void) tsd_set(zfs_dbuf_evict_key, NULL);
 482  576  }
 483  577  
 484  578  /*
 485  579   * The dbuf evict thread is responsible for aging out dbufs from the
 486  580   * cache. Once the cache has reached it's maximum size, dbufs are removed
↓ open down ↓ 32 lines elided ↑ open up ↑
 519  613          }
 520  614  
 521  615          dbuf_evict_thread_exit = B_FALSE;
 522  616          cv_broadcast(&dbuf_evict_cv);
 523  617          CALLB_CPR_EXIT(&cpr);   /* drops dbuf_evict_lock */
 524  618          thread_exit();
 525  619  }
 526  620  
 527  621  /*
 528  622   * Wake up the dbuf eviction thread if the dbuf cache is at its max size.
 529      - * If the dbuf cache is at its high water mark, then evict a dbuf from the
 530      - * dbuf cache using the callers context.
      623 + *
      624 + * Direct eviction (dbuf_evict_one()) is not called here, because
      625 + * the function doesn't care about the selected dbuf, so the following
      626 + * case is possible which will cause a deadlock-panic:
      627 + *
      628 + * Thread A is evicting dbufs that are related to dnodeA
      629 + * dnode_evict_dbufs(dnoneA) enters dn_dbufs_mtx and after that walks
      630 + * its own AVL of dbufs and calls dbuf_destroy():
      631 + * dbuf_destroy() ->...-> dbuf_evict_notify() -> dbuf_evict_one() ->
      632 + *  -> select a dbuf from cache -> dbuf_destroy() ->
      633 + *   -> mutex_enter(dn_dbufs_mtx of dnoneB)
      634 + *
      635 + * Thread B is evicting dbufs that are related to dnodeB
      636 + * dnode_evict_dbufs(dnoneB) enters dn_dbufs_mtx and after that walks
      637 + * its own AVL of dbufs and calls dbuf_destroy():
      638 + * dbuf_destroy() ->...-> dbuf_evict_notify() -> dbuf_evict_one() ->
      639 + *  -> select a dbuf from cache -> dbuf_destroy() ->
      640 + *   -> mutex_enter(dn_dbufs_mtx of dnoneA)
 531  641   */
 532  642  static void
 533  643  dbuf_evict_notify(void)
 534  644  {
 535  645  
 536  646          /*
 537  647           * We use thread specific data to track when a thread has
 538  648           * started processing evictions. This allows us to avoid deeply
 539  649           * nested stacks that would have a call flow similar to this:
 540  650           *
↓ open down ↓ 12 lines elided ↑ open up ↑
 553  663           * from the dbuf cache.
 554  664           */
 555  665          if (tsd_get(zfs_dbuf_evict_key) != NULL)
 556  666                  return;
 557  667  
 558  668          /*
 559  669           * We check if we should evict without holding the dbuf_evict_lock,
 560  670           * because it's OK to occasionally make the wrong decision here,
 561  671           * and grabbing the lock results in massive lock contention.
 562  672           */
 563      -        if (refcount_count(&dbuf_cache_size) > dbuf_cache_max_bytes) {
      673 +        if (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
      674 +            dbuf_cache_max_bytes) {
 564  675                  if (dbuf_cache_above_hiwater())
 565  676                          dbuf_evict_one();
 566  677                  cv_signal(&dbuf_evict_cv);
 567  678          }
 568  679  }
 569  680  
 570  681  void
 571  682  dbuf_init(void)
 572  683  {
 573  684          uint64_t hsize = 1ULL << 16;
↓ open down ↓ 16 lines elided ↑ open up ↑
 590  701                  ASSERT(hsize > (1ULL << 10));
 591  702                  hsize >>= 1;
 592  703                  goto retry;
 593  704          }
 594  705  
 595  706          dbuf_kmem_cache = kmem_cache_create("dmu_buf_impl_t",
 596  707              sizeof (dmu_buf_impl_t),
 597  708              0, dbuf_cons, dbuf_dest, NULL, NULL, NULL, 0);
 598  709  
 599  710          for (i = 0; i < DBUF_MUTEXES; i++)
 600      -                mutex_init(&h->hash_mutexes[i], NULL, MUTEX_DEFAULT, NULL);
      711 +                mutex_init(DBUF_HASH_MUTEX(h, i), NULL, MUTEX_DEFAULT, NULL);
 601  712  
      713 +
 602  714          /*
 603      -         * Setup the parameters for the dbuf cache. We cap the size of the
 604      -         * dbuf cache to 1/32nd (default) of the size of the ARC.
      715 +         * Setup the parameters for the dbuf caches. We set the sizes of the
      716 +         * dbuf cache and the metadata cache to 1/32nd and 1/16th (default)
      717 +         * of the size of the ARC, respectively.
 605  718           */
 606      -        dbuf_cache_max_bytes = MIN(dbuf_cache_max_bytes,
 607      -            arc_max_bytes() >> dbuf_cache_max_shift);
      719 +        if (dbuf_cache_max_bytes == 0 ||
      720 +            dbuf_cache_max_bytes >= arc_max_bytes())  {
      721 +                dbuf_cache_max_bytes = arc_max_bytes() >> dbuf_cache_shift;
      722 +        }
      723 +        if (dbuf_metadata_cache_max_bytes == 0 ||
      724 +            dbuf_metadata_cache_max_bytes >= arc_max_bytes()) {
      725 +                dbuf_metadata_cache_max_bytes =
      726 +                    arc_max_bytes() >> dbuf_metadata_cache_shift;
      727 +        }
 608  728  
 609  729          /*
      730 +         * The combined size of both caches should be less
      731 +         * the size of ARC, otherwise need to set them to
      732 +         * the default values.
      733 +         *
      734 +         * divide by 2 is a simple overflow protection
      735 +         */
      736 +        if (((dbuf_cache_max_bytes / 2) +
      737 +            (dbuf_metadata_cache_max_bytes / 2)) >= (arc_max_bytes() / 2)) {
      738 +                dbuf_cache_max_bytes = arc_max_bytes() >> dbuf_cache_shift;
      739 +                dbuf_metadata_cache_max_bytes =
      740 +                    arc_max_bytes() >> dbuf_metadata_cache_shift;
      741 +        }
      742 +
      743 +
      744 +        /*
 610  745           * All entries are queued via taskq_dispatch_ent(), so min/maxalloc
 611  746           * configuration is not required.
 612  747           */
 613  748          dbu_evict_taskq = taskq_create("dbu_evict", 1, minclsyspri, 0, 0, 0);
 614  749  
 615      -        dbuf_cache = multilist_create(sizeof (dmu_buf_impl_t),
 616      -            offsetof(dmu_buf_impl_t, db_cache_link),
 617      -            dbuf_cache_multilist_index_func);
 618      -        refcount_create(&dbuf_cache_size);
      750 +        for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
      751 +                dbuf_caches[dcs].cache =
      752 +                    multilist_create(sizeof (dmu_buf_impl_t),
      753 +                    offsetof(dmu_buf_impl_t, db_cache_link),
      754 +                    dbuf_cache_multilist_index_func);
      755 +                refcount_create(&dbuf_caches[dcs].size);
      756 +        }
 619  757  
 620  758          tsd_create(&zfs_dbuf_evict_key, NULL);
 621  759          dbuf_evict_thread_exit = B_FALSE;
 622  760          mutex_init(&dbuf_evict_lock, NULL, MUTEX_DEFAULT, NULL);
 623  761          cv_init(&dbuf_evict_cv, NULL, CV_DEFAULT, NULL);
 624  762          dbuf_cache_evict_thread = thread_create(NULL, 0, dbuf_evict_thread,
 625  763              NULL, 0, &p0, TS_RUN, minclsyspri);
 626  764  }
 627  765  
 628  766  void
 629  767  dbuf_fini(void)
 630  768  {
 631  769          dbuf_hash_table_t *h = &dbuf_hash_table;
 632  770          int i;
 633  771  
 634  772          for (i = 0; i < DBUF_MUTEXES; i++)
 635      -                mutex_destroy(&h->hash_mutexes[i]);
      773 +                mutex_destroy(DBUF_HASH_MUTEX(h, i));
 636  774          kmem_free(h->hash_table, (h->hash_table_mask + 1) * sizeof (void *));
 637  775          kmem_cache_destroy(dbuf_kmem_cache);
 638  776          taskq_destroy(dbu_evict_taskq);
 639  777  
 640  778          mutex_enter(&dbuf_evict_lock);
 641  779          dbuf_evict_thread_exit = B_TRUE;
 642  780          while (dbuf_evict_thread_exit) {
 643  781                  cv_signal(&dbuf_evict_cv);
 644  782                  cv_wait(&dbuf_evict_cv, &dbuf_evict_lock);
 645  783          }
 646  784          mutex_exit(&dbuf_evict_lock);
 647  785          tsd_destroy(&zfs_dbuf_evict_key);
 648  786  
 649  787          mutex_destroy(&dbuf_evict_lock);
 650  788          cv_destroy(&dbuf_evict_cv);
 651  789  
 652      -        refcount_destroy(&dbuf_cache_size);
 653      -        multilist_destroy(dbuf_cache);
      790 +        for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
      791 +                refcount_destroy(&dbuf_caches[dcs].size);
      792 +                multilist_destroy(dbuf_caches[dcs].cache);
      793 +        }
 654  794  }
 655  795  
 656  796  /*
 657  797   * Other stuff.
 658  798   */
 659  799  
 660  800  #ifdef ZFS_DEBUG
 661  801  static void
 662  802  dbuf_verify(dmu_buf_impl_t *db)
 663  803  {
↓ open down ↓ 743 lines elided ↑ open up ↑
1407 1547          ASSERT(db->db_parent == NULL || arc_released(db->db_parent->db_buf));
1408 1548  
1409 1549          (void) arc_release(db->db_buf, db);
1410 1550  }
1411 1551  
1412 1552  /*
1413 1553   * We already have a dirty record for this TXG, and we are being
1414 1554   * dirtied again.
1415 1555   */
1416 1556  static void
1417      -dbuf_redirty(dbuf_dirty_record_t *dr)
     1557 +dbuf_redirty(dbuf_dirty_record_t *dr, boolean_t usesc)
1418 1558  {
1419 1559          dmu_buf_impl_t *db = dr->dr_dbuf;
1420 1560  
1421 1561          ASSERT(MUTEX_HELD(&db->db_mtx));
1422 1562  
1423 1563          if (db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID) {
1424 1564                  /*
1425 1565                   * If this buffer has already been written out,
1426 1566                   * we now need to reset its state.
1427 1567                   */
1428 1568                  dbuf_unoverride(dr);
1429 1569                  if (db->db.db_object != DMU_META_DNODE_OBJECT &&
1430 1570                      db->db_state != DB_NOFILL) {
1431 1571                          /* Already released on initial dirty, so just thaw. */
1432 1572                          ASSERT(arc_released(db->db_buf));
1433 1573                          arc_buf_thaw(db->db_buf);
1434 1574                  }
1435 1575          }
     1576 +        /*
     1577 +         * Special class usage of dirty dbuf could be changed,
     1578 +         * update the dirty entry.
     1579 +         */
     1580 +        dr->dr_usesc = usesc;
1436 1581  }
1437 1582  
1438 1583  dbuf_dirty_record_t *
1439      -dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
     1584 +dbuf_dirty_sc(dmu_buf_impl_t *db, dmu_tx_t *tx, boolean_t usesc)
1440 1585  {
1441 1586          dnode_t *dn;
1442 1587          objset_t *os;
1443 1588          dbuf_dirty_record_t **drp, *dr;
1444 1589          int drop_struct_lock = FALSE;
1445 1590          int txgoff = tx->tx_txg & TXG_MASK;
1446 1591  
1447 1592          ASSERT(tx->tx_txg != 0);
1448 1593          ASSERT(!refcount_is_zero(&db->db_holds));
1449 1594          DMU_TX_DIRTY_BUF(tx, db);
↓ open down ↓ 66 lines elided ↑ open up ↑
1516 1661           * If this buffer is already dirty, we're done.
1517 1662           */
1518 1663          drp = &db->db_last_dirty;
1519 1664          ASSERT(*drp == NULL || (*drp)->dr_txg <= tx->tx_txg ||
1520 1665              db->db.db_object == DMU_META_DNODE_OBJECT);
1521 1666          while ((dr = *drp) != NULL && dr->dr_txg > tx->tx_txg)
1522 1667                  drp = &dr->dr_next;
1523 1668          if (dr && dr->dr_txg == tx->tx_txg) {
1524 1669                  DB_DNODE_EXIT(db);
1525 1670  
1526      -                dbuf_redirty(dr);
     1671 +                dbuf_redirty(dr, usesc);
1527 1672                  mutex_exit(&db->db_mtx);
1528 1673                  return (dr);
1529 1674          }
1530 1675  
1531 1676          /*
1532 1677           * Only valid if not already dirty.
1533 1678           */
1534 1679          ASSERT(dn->dn_object == 0 ||
1535 1680              dn->dn_dirtyctx == DN_UNDIRTIED || dn->dn_dirtyctx ==
1536 1681              (dmu_tx_is_syncing(tx) ? DN_DIRTY_SYNC : DN_DIRTY_OPEN));
↓ open down ↓ 59 lines elided ↑ open up ↑
1596 1741                  mutex_init(&dr->dt.di.dr_mtx, NULL, MUTEX_DEFAULT, NULL);
1597 1742                  list_create(&dr->dt.di.dr_children,
1598 1743                      sizeof (dbuf_dirty_record_t),
1599 1744                      offsetof(dbuf_dirty_record_t, dr_dirty_node));
1600 1745          }
1601 1746          if (db->db_blkid != DMU_BONUS_BLKID && os->os_dsl_dataset != NULL)
1602 1747                  dr->dr_accounted = db->db.db_size;
1603 1748          dr->dr_dbuf = db;
1604 1749          dr->dr_txg = tx->tx_txg;
1605 1750          dr->dr_next = *drp;
     1751 +        dr->dr_usesc = usesc;
1606 1752          *drp = dr;
1607 1753  
1608 1754          /*
1609 1755           * We could have been freed_in_flight between the dbuf_noread
1610 1756           * and dbuf_dirty.  We win, as though the dbuf_noread() had
1611 1757           * happened after the free.
1612 1758           */
1613 1759          if (db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID &&
1614 1760              db->db_blkid != DMU_SPILL_BLKID) {
1615 1761                  mutex_enter(&dn->dn_mtx);
↓ open down ↓ 13 lines elided ↑ open up ↑
1629 1775          ASSERT3U(db->db_dirtycnt, <=, 3);
1630 1776  
1631 1777          mutex_exit(&db->db_mtx);
1632 1778  
1633 1779          if (db->db_blkid == DMU_BONUS_BLKID ||
1634 1780              db->db_blkid == DMU_SPILL_BLKID) {
1635 1781                  mutex_enter(&dn->dn_mtx);
1636 1782                  ASSERT(!list_link_active(&dr->dr_dirty_node));
1637 1783                  list_insert_tail(&dn->dn_dirty_records[txgoff], dr);
1638 1784                  mutex_exit(&dn->dn_mtx);
1639      -                dnode_setdirty(dn, tx);
     1785 +                dnode_setdirty_sc(dn, tx, usesc);
1640 1786                  DB_DNODE_EXIT(db);
1641 1787                  return (dr);
1642 1788          }
1643 1789  
1644 1790          /*
1645 1791           * The dn_struct_rwlock prevents db_blkptr from changing
1646 1792           * due to a write from syncing context completing
1647 1793           * while we are running, so we want to acquire it before
1648 1794           * looking at db_blkptr.
1649 1795           */
↓ open down ↓ 14 lines elided ↑ open up ↑
1664 1810  
1665 1811          /*
1666 1812           * If we are overwriting a dedup BP, then unless it is snapshotted,
1667 1813           * when we get to syncing context we will need to decrement its
1668 1814           * refcount in the DDT.  Prefetch the relevant DDT block so that
1669 1815           * syncing context won't have to wait for the i/o.
1670 1816           */
1671 1817          ddt_prefetch(os->os_spa, db->db_blkptr);
1672 1818  
1673 1819          if (db->db_level == 0) {
1674      -                dnode_new_blkid(dn, db->db_blkid, tx, drop_struct_lock);
     1820 +                dnode_new_blkid(dn, db->db_blkid, tx, usesc, drop_struct_lock);
1675 1821                  ASSERT(dn->dn_maxblkid >= db->db_blkid);
1676 1822          }
1677 1823  
1678 1824          if (db->db_level+1 < dn->dn_nlevels) {
1679 1825                  dmu_buf_impl_t *parent = db->db_parent;
1680 1826                  dbuf_dirty_record_t *di;
1681 1827                  int parent_held = FALSE;
1682 1828  
1683 1829                  if (db->db_parent == NULL || db->db_parent == dn->dn_dbuf) {
1684 1830                          int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
1685 1831  
1686 1832                          parent = dbuf_hold_level(dn, db->db_level+1,
1687 1833                              db->db_blkid >> epbs, FTAG);
1688 1834                          ASSERT(parent != NULL);
1689 1835                          parent_held = TRUE;
1690 1836                  }
1691 1837                  if (drop_struct_lock)
1692 1838                          rw_exit(&dn->dn_struct_rwlock);
1693 1839                  ASSERT3U(db->db_level+1, ==, parent->db_level);
1694      -                di = dbuf_dirty(parent, tx);
     1840 +                di = dbuf_dirty_sc(parent, tx, usesc);
1695 1841                  if (parent_held)
1696 1842                          dbuf_rele(parent, FTAG);
1697 1843  
1698 1844                  mutex_enter(&db->db_mtx);
1699 1845                  /*
1700 1846                   * Since we've dropped the mutex, it's possible that
1701 1847                   * dbuf_undirty() might have changed this out from under us.
1702 1848                   */
1703 1849                  if (db->db_last_dirty == dr ||
1704 1850                      dn->dn_object == DMU_META_DNODE_OBJECT) {
1705 1851                          mutex_enter(&di->dt.di.dr_mtx);
1706 1852                          ASSERT3U(di->dr_txg, ==, tx->tx_txg);
1707 1853                          ASSERT(!list_link_active(&dr->dr_dirty_node));
1708 1854                          list_insert_tail(&di->dt.di.dr_children, dr);
1709 1855                          mutex_exit(&di->dt.di.dr_mtx);
1710 1856                          dr->dr_parent = di;
1711 1857                  }
     1858 +
     1859 +                /*
     1860 +                 * Special class usage of dirty dbuf could be changed,
     1861 +                 * update the dirty entry.
     1862 +                 */
     1863 +                dr->dr_usesc = usesc;
1712 1864                  mutex_exit(&db->db_mtx);
1713 1865          } else {
1714 1866                  ASSERT(db->db_level+1 == dn->dn_nlevels);
1715 1867                  ASSERT(db->db_blkid < dn->dn_nblkptr);
1716 1868                  ASSERT(db->db_parent == NULL || db->db_parent == dn->dn_dbuf);
1717 1869                  mutex_enter(&dn->dn_mtx);
1718 1870                  ASSERT(!list_link_active(&dr->dr_dirty_node));
1719 1871                  list_insert_tail(&dn->dn_dirty_records[txgoff], dr);
1720 1872                  mutex_exit(&dn->dn_mtx);
1721 1873                  if (drop_struct_lock)
1722 1874                          rw_exit(&dn->dn_struct_rwlock);
1723 1875          }
1724 1876  
1725      -        dnode_setdirty(dn, tx);
     1877 +        dnode_setdirty_sc(dn, tx, usesc);
1726 1878          DB_DNODE_EXIT(db);
1727 1879          return (dr);
1728 1880  }
1729 1881  
     1882 +dbuf_dirty_record_t *
     1883 +dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
     1884 +{
     1885 +        spa_t *spa;
     1886 +
     1887 +        ASSERT(db->db_objset != NULL);
     1888 +        spa = db->db_objset->os_spa;
     1889 +
     1890 +        return (dbuf_dirty_sc(db, tx, spa->spa_usesc));
     1891 +}
     1892 +
1730 1893  /*
1731 1894   * Undirty a buffer in the transaction group referenced by the given
1732 1895   * transaction.  Return whether this evicted the dbuf.
1733 1896   */
1734 1897  static boolean_t
1735 1898  dbuf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
1736 1899  {
1737 1900          dnode_t *dn;
1738 1901          uint64_t txg = tx->tx_txg;
1739 1902          dbuf_dirty_record_t *dr, **drp;
↓ open down ↓ 75 lines elided ↑ open up ↑
1815 1978                  return (B_TRUE);
1816 1979          }
1817 1980  
1818 1981          return (B_FALSE);
1819 1982  }
1820 1983  
1821 1984  void
1822 1985  dmu_buf_will_dirty(dmu_buf_t *db_fake, dmu_tx_t *tx)
1823 1986  {
1824 1987          dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
     1988 +        spa_t *spa = db->db_objset->os_spa;
     1989 +        dmu_buf_will_dirty_sc(db_fake, tx, spa->spa_usesc);
     1990 +}
     1991 +
     1992 +void
     1993 +dmu_buf_will_dirty_sc(dmu_buf_t *db_fake, dmu_tx_t *tx, boolean_t usesc)
     1994 +{
     1995 +        dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
1825 1996          int rf = DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH;
1826 1997  
1827 1998          ASSERT(tx->tx_txg != 0);
1828 1999          ASSERT(!refcount_is_zero(&db->db_holds));
1829 2000  
1830 2001          /*
1831 2002           * Quick check for dirtyness.  For already dirty blocks, this
1832 2003           * reduces runtime of this function by >90%, and overall performance
1833 2004           * by 50% for some workloads (e.g. file deletion with indirect blocks
1834 2005           * cached).
↓ open down ↓ 2 lines elided ↑ open up ↑
1837 2008          dbuf_dirty_record_t *dr;
1838 2009          for (dr = db->db_last_dirty;
1839 2010              dr != NULL && dr->dr_txg >= tx->tx_txg; dr = dr->dr_next) {
1840 2011                  /*
1841 2012                   * It's possible that it is already dirty but not cached,
1842 2013                   * because there are some calls to dbuf_dirty() that don't
1843 2014                   * go through dmu_buf_will_dirty().
1844 2015                   */
1845 2016                  if (dr->dr_txg == tx->tx_txg && db->db_state == DB_CACHED) {
1846 2017                          /* This dbuf is already dirty and cached. */
1847      -                        dbuf_redirty(dr);
     2018 +                        dbuf_redirty(dr, usesc);
1848 2019                          mutex_exit(&db->db_mtx);
1849 2020                          return;
1850 2021                  }
1851 2022          }
1852 2023          mutex_exit(&db->db_mtx);
1853 2024  
1854 2025          DB_DNODE_ENTER(db);
1855 2026          if (RW_WRITE_HELD(&DB_DNODE(db)->dn_struct_rwlock))
1856 2027                  rf |= DB_RF_HAVESTRUCT;
1857 2028          DB_DNODE_EXIT(db);
1858 2029          (void) dbuf_read(db, NULL, rf);
1859      -        (void) dbuf_dirty(db, tx);
     2030 +        (void) dbuf_dirty_sc(db, tx, usesc);
1860 2031  }
1861 2032  
     2033 +
1862 2034  void
1863 2035  dmu_buf_will_not_fill(dmu_buf_t *db_fake, dmu_tx_t *tx)
1864 2036  {
1865 2037          dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
1866 2038  
1867 2039          db->db_state = DB_NOFILL;
1868 2040  
1869 2041          dmu_buf_will_fill(db_fake, tx);
1870 2042  }
1871 2043  
↓ open down ↓ 154 lines elided ↑ open up ↑
2026 2198          if (db->db_blkid == DMU_BONUS_BLKID) {
2027 2199                  ASSERT(db->db.db_data != NULL);
2028 2200                  zio_buf_free(db->db.db_data, DN_MAX_BONUSLEN);
2029 2201                  arc_space_return(DN_MAX_BONUSLEN, ARC_SPACE_OTHER);
2030 2202                  db->db_state = DB_UNCACHED;
2031 2203          }
2032 2204  
2033 2205          dbuf_clear_data(db);
2034 2206  
2035 2207          if (multilist_link_active(&db->db_cache_link)) {
2036      -                multilist_remove(dbuf_cache, db);
2037      -                (void) refcount_remove_many(&dbuf_cache_size,
     2208 +                ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
     2209 +                    db->db_caching_status == DB_DBUF_METADATA_CACHE);
     2210 +
     2211 +                multilist_remove(dbuf_caches[db->db_caching_status].cache, db);
     2212 +                (void) refcount_remove_many(
     2213 +                    &dbuf_caches[db->db_caching_status].size,
2038 2214                      db->db.db_size, db);
     2215 +
     2216 +                db->db_caching_status = DB_NO_CACHE;
2039 2217          }
2040 2218  
2041 2219          ASSERT(db->db_state == DB_UNCACHED || db->db_state == DB_NOFILL);
2042 2220          ASSERT(db->db_data_pending == NULL);
2043 2221  
2044 2222          db->db_state = DB_EVICTING;
2045 2223          db->db_blkptr = NULL;
2046 2224  
2047 2225          /*
2048 2226           * Now that db_state is DB_EVICTING, nobody else can find this via
↓ open down ↓ 33 lines elided ↑ open up ↑
2082 2260  
2083 2261          ASSERT(refcount_is_zero(&db->db_holds));
2084 2262  
2085 2263          db->db_parent = NULL;
2086 2264  
2087 2265          ASSERT(db->db_buf == NULL);
2088 2266          ASSERT(db->db.db_data == NULL);
2089 2267          ASSERT(db->db_hash_next == NULL);
2090 2268          ASSERT(db->db_blkptr == NULL);
2091 2269          ASSERT(db->db_data_pending == NULL);
     2270 +        ASSERT3U(db->db_caching_status, ==, DB_NO_CACHE);
2092 2271          ASSERT(!multilist_link_active(&db->db_cache_link));
2093 2272  
2094 2273          kmem_cache_free(dbuf_kmem_cache, db);
2095 2274          arc_space_return(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
2096 2275  
2097 2276          /*
2098 2277           * If this dbuf is referenced from an indirect dbuf,
2099 2278           * decrement the ref count on the indirect dbuf.
2100 2279           */
2101 2280          if (parent && parent != dndb)
↓ open down ↓ 118 lines elided ↑ open up ↑
2220 2399          db->db_freed_in_flight = FALSE;
2221 2400          db->db_pending_evict = FALSE;
2222 2401  
2223 2402          if (blkid == DMU_BONUS_BLKID) {
2224 2403                  ASSERT3P(parent, ==, dn->dn_dbuf);
2225 2404                  db->db.db_size = DN_MAX_BONUSLEN -
2226 2405                      (dn->dn_nblkptr-1) * sizeof (blkptr_t);
2227 2406                  ASSERT3U(db->db.db_size, >=, dn->dn_bonuslen);
2228 2407                  db->db.db_offset = DMU_BONUS_BLKID;
2229 2408                  db->db_state = DB_UNCACHED;
     2409 +                db->db_caching_status = DB_NO_CACHE;
2230 2410                  /* the bonus dbuf is not placed in the hash table */
2231 2411                  arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
2232 2412                  return (db);
2233 2413          } else if (blkid == DMU_SPILL_BLKID) {
2234 2414                  db->db.db_size = (blkptr != NULL) ?
2235 2415                      BP_GET_LSIZE(blkptr) : SPA_MINBLOCKSIZE;
2236 2416                  db->db.db_offset = 0;
2237 2417          } else {
2238 2418                  int blocksize =
2239 2419                      db->db_level ? 1 << dn->dn_indblkshift : dn->dn_datablksz;
↓ open down ↓ 12 lines elided ↑ open up ↑
2252 2432          db->db_state = DB_EVICTING;
2253 2433          if ((odb = dbuf_hash_insert(db)) != NULL) {
2254 2434                  /* someone else inserted it first */
2255 2435                  kmem_cache_free(dbuf_kmem_cache, db);
2256 2436                  mutex_exit(&dn->dn_dbufs_mtx);
2257 2437                  return (odb);
2258 2438          }
2259 2439          avl_add(&dn->dn_dbufs, db);
2260 2440  
2261 2441          db->db_state = DB_UNCACHED;
     2442 +        db->db_caching_status = DB_NO_CACHE;
2262 2443          mutex_exit(&dn->dn_dbufs_mtx);
2263 2444          arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
2264 2445  
2265 2446          if (parent && parent != dn->dn_dbuf)
2266 2447                  dbuf_add_ref(parent, db);
2267 2448  
2268 2449          ASSERT(dn->dn_object == DMU_META_DNODE_OBJECT ||
2269 2450              refcount_count(&dn->dn_holds) > 0);
2270 2451          (void) refcount_add(&dn->dn_holds, db);
2271 2452          atomic_inc_32(&dn->dn_dbufs_count);
↓ open down ↓ 286 lines elided ↑ open up ↑
2558 2739                  if (err && err != ENOENT)
2559 2740                          return (err);
2560 2741                  db = dbuf_create(dn, level, blkid, parent, bp);
2561 2742          }
2562 2743  
2563 2744          if (fail_uncached && db->db_state != DB_CACHED) {
2564 2745                  mutex_exit(&db->db_mtx);
2565 2746                  return (SET_ERROR(ENOENT));
2566 2747          }
2567 2748  
2568      -        if (db->db_buf != NULL)
     2749 +        if (db->db_buf != NULL) {
     2750 +                arc_buf_access(db->db_buf);
2569 2751                  ASSERT3P(db->db.db_data, ==, db->db_buf->b_data);
     2752 +        }
2570 2753  
2571 2754          ASSERT(db->db_buf == NULL || arc_referenced(db->db_buf));
2572 2755  
2573 2756          /*
2574 2757           * If this buffer is currently syncing out, and we are are
2575 2758           * still referencing it from db_data, we need to make a copy
2576 2759           * of it in case we decide we want to dirty it again in this txg.
2577 2760           */
2578 2761          if (db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID &&
2579 2762              dn->dn_object != DMU_META_DNODE_OBJECT &&
↓ open down ↓ 6 lines elided ↑ open up ↑
2586 2769                          dbuf_set_data(db,
2587 2770                              arc_alloc_buf(dn->dn_objset->os_spa, db, type,
2588 2771                              db->db.db_size));
2589 2772                          bcopy(dr->dt.dl.dr_data->b_data, db->db.db_data,
2590 2773                              db->db.db_size);
2591 2774                  }
2592 2775          }
2593 2776  
2594 2777          if (multilist_link_active(&db->db_cache_link)) {
2595 2778                  ASSERT(refcount_is_zero(&db->db_holds));
2596      -                multilist_remove(dbuf_cache, db);
2597      -                (void) refcount_remove_many(&dbuf_cache_size,
     2779 +                ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
     2780 +                    db->db_caching_status == DB_DBUF_METADATA_CACHE);
     2781 +
     2782 +                multilist_remove(dbuf_caches[db->db_caching_status].cache, db);
     2783 +                (void) refcount_remove_many(
     2784 +                    &dbuf_caches[db->db_caching_status].size,
2598 2785                      db->db.db_size, db);
     2786 +
     2787 +                db->db_caching_status = DB_NO_CACHE;
2599 2788          }
2600 2789          (void) refcount_add(&db->db_holds, tag);
2601 2790          DBUF_VERIFY(db);
2602 2791          mutex_exit(&db->db_mtx);
2603 2792  
2604 2793          /* NOTE: we can't rele the parent until after we drop the db_mtx */
2605 2794          if (parent)
2606 2795                  dbuf_rele(parent, NULL);
2607 2796  
2608 2797          ASSERT3P(DB_DNODE(db), ==, dn);
↓ open down ↓ 196 lines elided ↑ open up ↑
2805 2994                              !BP_IS_HOLE(db->db_blkptr) &&
2806 2995                              !BP_IS_EMBEDDED(db->db_blkptr)) {
2807 2996                                  do_arc_evict = B_TRUE;
2808 2997                                  bp = *db->db_blkptr;
2809 2998                          }
2810 2999  
2811 3000                          if (!DBUF_IS_CACHEABLE(db) ||
2812 3001                              db->db_pending_evict) {
2813 3002                                  dbuf_destroy(db);
2814 3003                          } else if (!multilist_link_active(&db->db_cache_link)) {
2815      -                                multilist_insert(dbuf_cache, db);
2816      -                                (void) refcount_add_many(&dbuf_cache_size,
     3004 +                                ASSERT3U(db->db_caching_status, ==,
     3005 +                                    DB_NO_CACHE);
     3006 +
     3007 +                                dbuf_cached_state_t dcs =
     3008 +                                    dbuf_include_in_metadata_cache(db) ?
     3009 +                                    DB_DBUF_METADATA_CACHE : DB_DBUF_CACHE;
     3010 +                                db->db_caching_status = dcs;
     3011 +
     3012 +                                multilist_insert(dbuf_caches[dcs].cache, db);
     3013 +                                (void) refcount_add_many(&dbuf_caches[dcs].size,
2817 3014                                      db->db.db_size, db);
2818 3015                                  mutex_exit(&db->db_mtx);
2819 3016  
2820      -                                dbuf_evict_notify();
     3017 +                                if (db->db_caching_status == DB_DBUF_CACHE) {
     3018 +                                        dbuf_evict_notify();
     3019 +                                }
2821 3020                          }
2822 3021  
2823 3022                          if (do_arc_evict)
2824 3023                                  arc_freed(spa, &bp);
2825 3024                  }
2826 3025          } else {
2827 3026                  mutex_exit(&db->db_mtx);
2828 3027          }
2829 3028  
2830 3029  }
↓ open down ↓ 162 lines elided ↑ open up ↑
2993 3192          dn = DB_DNODE(db);
2994 3193          /* Indirect block size must match what the dnode thinks it is. */
2995 3194          ASSERT3U(db->db.db_size, ==, 1<<dn->dn_phys->dn_indblkshift);
2996 3195          dbuf_check_blkptr(dn, db);
2997 3196          DB_DNODE_EXIT(db);
2998 3197  
2999 3198          /* Provide the pending dirty record to child dbufs */
3000 3199          db->db_data_pending = dr;
3001 3200  
3002 3201          mutex_exit(&db->db_mtx);
3003      -
3004 3202          dbuf_write(dr, db->db_buf, tx);
3005 3203  
3006 3204          zio = dr->dr_zio;
3007 3205          mutex_enter(&dr->dt.di.dr_mtx);
3008 3206          dbuf_sync_list(&dr->dt.di.dr_children, db->db_level - 1, tx);
3009 3207          ASSERT(list_head(&dr->dt.di.dr_children) == NULL);
3010 3208          mutex_exit(&dr->dt.di.dr_mtx);
3011 3209          zio_nowait(zio);
3012 3210  }
3013 3211  
↓ open down ↓ 451 lines elided ↑ open up ↑
3465 3663                          dsl_free(spa_get_dsl(zio->io_spa), zio->io_txg, obp);
3466 3664                  arc_release(dr->dt.dl.dr_data, db);
3467 3665          }
3468 3666          mutex_exit(&db->db_mtx);
3469 3667          dbuf_write_done(zio, NULL, db);
3470 3668  
3471 3669          if (zio->io_abd != NULL)
3472 3670                  abd_put(zio->io_abd);
3473 3671  }
3474 3672  
3475      -typedef struct dbuf_remap_impl_callback_arg {
3476      -        objset_t        *drica_os;
3477      -        uint64_t        drica_blk_birth;
3478      -        dmu_tx_t        *drica_tx;
3479      -} dbuf_remap_impl_callback_arg_t;
3480      -
3481      -static void
3482      -dbuf_remap_impl_callback(uint64_t vdev, uint64_t offset, uint64_t size,
3483      -    void *arg)
3484      -{
3485      -        dbuf_remap_impl_callback_arg_t *drica = arg;
3486      -        objset_t *os = drica->drica_os;
3487      -        spa_t *spa = dmu_objset_spa(os);
3488      -        dmu_tx_t *tx = drica->drica_tx;
3489      -
3490      -        ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
3491      -
3492      -        if (os == spa_meta_objset(spa)) {
3493      -                spa_vdev_indirect_mark_obsolete(spa, vdev, offset, size, tx);
3494      -        } else {
3495      -                dsl_dataset_block_remapped(dmu_objset_ds(os), vdev, offset,
3496      -                    size, drica->drica_blk_birth, tx);
3497      -        }
3498      -}
3499      -
3500      -static void
3501      -dbuf_remap_impl(dnode_t *dn, blkptr_t *bp, dmu_tx_t *tx)
3502      -{
3503      -        blkptr_t bp_copy = *bp;
3504      -        spa_t *spa = dmu_objset_spa(dn->dn_objset);
3505      -        dbuf_remap_impl_callback_arg_t drica;
3506      -
3507      -        ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
3508      -
3509      -        drica.drica_os = dn->dn_objset;
3510      -        drica.drica_blk_birth = bp->blk_birth;
3511      -        drica.drica_tx = tx;
3512      -        if (spa_remap_blkptr(spa, &bp_copy, dbuf_remap_impl_callback,
3513      -            &drica)) {
3514      -                /*
3515      -                 * The struct_rwlock prevents dbuf_read_impl() from
3516      -                 * dereferencing the BP while we are changing it.  To
3517      -                 * avoid lock contention, only grab it when we are actually
3518      -                 * changing the BP.
3519      -                 */
3520      -                rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
3521      -                *bp = bp_copy;
3522      -                rw_exit(&dn->dn_struct_rwlock);
3523      -        }
3524      -}
3525      -
3526      -/*
3527      - * Returns true if a dbuf_remap would modify the dbuf. We do this by attempting
3528      - * to remap a copy of every bp in the dbuf.
3529      - */
3530      -boolean_t
3531      -dbuf_can_remap(const dmu_buf_impl_t *db)
3532      -{
3533      -        spa_t *spa = dmu_objset_spa(db->db_objset);
3534      -        blkptr_t *bp = db->db.db_data;
3535      -        boolean_t ret = B_FALSE;
3536      -
3537      -        ASSERT3U(db->db_level, >, 0);
3538      -        ASSERT3S(db->db_state, ==, DB_CACHED);
3539      -
3540      -        ASSERT(spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL));
3541      -
3542      -        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
3543      -        for (int i = 0; i < db->db.db_size >> SPA_BLKPTRSHIFT; i++) {
3544      -                blkptr_t bp_copy = bp[i];
3545      -                if (spa_remap_blkptr(spa, &bp_copy, NULL, NULL)) {
3546      -                        ret = B_TRUE;
3547      -                        break;
3548      -                }
3549      -        }
3550      -        spa_config_exit(spa, SCL_VDEV, FTAG);
3551      -
3552      -        return (ret);
3553      -}
3554      -
3555      -boolean_t
3556      -dnode_needs_remap(const dnode_t *dn)
3557      -{
3558      -        spa_t *spa = dmu_objset_spa(dn->dn_objset);
3559      -        boolean_t ret = B_FALSE;
3560      -
3561      -        if (dn->dn_phys->dn_nlevels == 0) {
3562      -                return (B_FALSE);
3563      -        }
3564      -
3565      -        ASSERT(spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL));
3566      -
3567      -        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
3568      -        for (int j = 0; j < dn->dn_phys->dn_nblkptr; j++) {
3569      -                blkptr_t bp_copy = dn->dn_phys->dn_blkptr[j];
3570      -                if (spa_remap_blkptr(spa, &bp_copy, NULL, NULL)) {
3571      -                        ret = B_TRUE;
3572      -                        break;
3573      -                }
3574      -        }
3575      -        spa_config_exit(spa, SCL_VDEV, FTAG);
3576      -
3577      -        return (ret);
3578      -}
3579      -
3580      -/*
3581      - * Remap any existing BP's to concrete vdevs, if possible.
3582      - */
3583      -static void
3584      -dbuf_remap(dnode_t *dn, dmu_buf_impl_t *db, dmu_tx_t *tx)
3585      -{
3586      -        spa_t *spa = dmu_objset_spa(db->db_objset);
3587      -        ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
3588      -
3589      -        if (!spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL))
3590      -                return;
3591      -
3592      -        if (db->db_level > 0) {
3593      -                blkptr_t *bp = db->db.db_data;
3594      -                for (int i = 0; i < db->db.db_size >> SPA_BLKPTRSHIFT; i++) {
3595      -                        dbuf_remap_impl(dn, &bp[i], tx);
3596      -                }
3597      -        } else if (db->db.db_object == DMU_META_DNODE_OBJECT) {
3598      -                dnode_phys_t *dnp = db->db.db_data;
3599      -                ASSERT3U(db->db_dnode_handle->dnh_dnode->dn_type, ==,
3600      -                    DMU_OT_DNODE);
3601      -                for (int i = 0; i < db->db.db_size >> DNODE_SHIFT; i++) {
3602      -                        for (int j = 0; j < dnp[i].dn_nblkptr; j++) {
3603      -                                dbuf_remap_impl(dn, &dnp[i].dn_blkptr[j], tx);
3604      -                        }
3605      -                }
3606      -        }
3607      -}
3608      -
3609      -
3610 3673  /* Issue I/O to commit a dirty buffer to disk. */
3611 3674  static void
3612 3675  dbuf_write(dbuf_dirty_record_t *dr, arc_buf_t *data, dmu_tx_t *tx)
3613 3676  {
3614 3677          dmu_buf_impl_t *db = dr->dr_dbuf;
3615 3678          dnode_t *dn;
3616 3679          objset_t *os;
3617 3680          dmu_buf_impl_t *parent = db->db_parent;
3618 3681          uint64_t txg = tx->tx_txg;
3619 3682          zbookmark_phys_t zb;
3620 3683          zio_prop_t zp;
3621 3684          zio_t *zio;
3622 3685          int wp_flag = 0;
     3686 +        zio_smartcomp_info_t sc;
3623 3687  
3624 3688          ASSERT(dmu_tx_is_syncing(tx));
3625 3689  
3626 3690          DB_DNODE_ENTER(db);
3627 3691          dn = DB_DNODE(db);
3628 3692          os = dn->dn_objset;
3629 3693  
     3694 +        dnode_setup_zio_smartcomp(db, &sc);
     3695 +
3630 3696          if (db->db_state != DB_NOFILL) {
3631 3697                  if (db->db_level > 0 || dn->dn_type == DMU_OT_DNODE) {
3632 3698                          /*
3633 3699                           * Private object buffers are released here rather
3634 3700                           * than in dbuf_dirty() since they are only modified
3635 3701                           * in the syncing context and we don't want the
3636 3702                           * overhead of making multiple copies of the data.
3637 3703                           */
3638 3704                          if (BP_IS_HOLE(db->db_blkptr)) {
3639 3705                                  arc_buf_thaw(data);
3640 3706                          } else {
3641 3707                                  dbuf_release_bp(db);
3642 3708                          }
3643      -                        dbuf_remap(dn, db, tx);
3644 3709                  }
3645 3710          }
3646 3711  
3647 3712          if (parent != dn->dn_dbuf) {
3648 3713                  /* Our parent is an indirect block. */
3649 3714                  /* We have a dirty parent that has been scheduled for write. */
3650 3715                  ASSERT(parent && parent->db_data_pending);
3651 3716                  /* Our parent's buffer is one level closer to the dnode. */
3652 3717                  ASSERT(db->db_level == parent->db_level-1);
3653 3718                  /*
↓ open down ↓ 17 lines elided ↑ open up ↑
3671 3736          ASSERT3U(db->db_blkptr->blk_birth, <=, txg);
3672 3737          ASSERT(zio);
3673 3738  
3674 3739          SET_BOOKMARK(&zb, os->os_dsl_dataset ?
3675 3740              os->os_dsl_dataset->ds_object : DMU_META_OBJSET,
3676 3741              db->db.db_object, db->db_level, db->db_blkid);
3677 3742  
3678 3743          if (db->db_blkid == DMU_SPILL_BLKID)
3679 3744                  wp_flag = WP_SPILL;
3680 3745          wp_flag |= (db->db_state == DB_NOFILL) ? WP_NOFILL : 0;
     3746 +        WP_SET_SPECIALCLASS(wp_flag, dr->dr_usesc);
3681 3747  
3682 3748          dmu_write_policy(os, dn, db->db_level, wp_flag, &zp);
3683 3749          DB_DNODE_EXIT(db);
3684 3750  
3685 3751          /*
3686 3752           * We copy the blkptr now (rather than when we instantiate the dirty
3687 3753           * record), because its value can change between open context and
3688 3754           * syncing context. We do not need to hold dn_struct_rwlock to read
3689 3755           * db_blkptr because we are in syncing context.
3690 3756           */
↓ open down ↓ 5 lines elided ↑ open up ↑
3696 3762                   * The BP for this block has been provided by open context
3697 3763                   * (by dmu_sync() or dmu_buf_write_embedded()).
3698 3764                   */
3699 3765                  abd_t *contents = (data != NULL) ?
3700 3766                      abd_get_from_buf(data->b_data, arc_buf_size(data)) : NULL;
3701 3767  
3702 3768                  dr->dr_zio = zio_write(zio, os->os_spa, txg, &dr->dr_bp_copy,
3703 3769                      contents, db->db.db_size, db->db.db_size, &zp,
3704 3770                      dbuf_write_override_ready, NULL, NULL,
3705 3771                      dbuf_write_override_done,
3706      -                    dr, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
     3772 +                    dr, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb,
     3773 +                    &sc);
3707 3774                  mutex_enter(&db->db_mtx);
3708 3775                  dr->dt.dl.dr_override_state = DR_NOT_OVERRIDDEN;
3709 3776                  zio_write_override(dr->dr_zio, &dr->dt.dl.dr_overridden_by,
3710 3777                      dr->dt.dl.dr_copies, dr->dt.dl.dr_nopwrite);
3711 3778                  mutex_exit(&db->db_mtx);
3712 3779          } else if (db->db_state == DB_NOFILL) {
3713 3780                  ASSERT(zp.zp_checksum == ZIO_CHECKSUM_OFF ||
3714 3781                      zp.zp_checksum == ZIO_CHECKSUM_NOPARITY);
3715 3782                  dr->dr_zio = zio_write(zio, os->os_spa, txg,
3716 3783                      &dr->dr_bp_copy, NULL, db->db.db_size, db->db.db_size, &zp,
3717 3784                      dbuf_write_nofill_ready, NULL, NULL,
3718 3785                      dbuf_write_nofill_done, db,
3719 3786                      ZIO_PRIORITY_ASYNC_WRITE,
3720      -                    ZIO_FLAG_MUSTSUCCEED | ZIO_FLAG_NODATA, &zb);
     3787 +                    ZIO_FLAG_MUSTSUCCEED | ZIO_FLAG_NODATA, &zb, &sc);
3721 3788          } else {
3722 3789                  ASSERT(arc_released(data));
3723 3790  
3724 3791                  /*
3725 3792                   * For indirect blocks, we want to setup the children
3726 3793                   * ready callback so that we can properly handle an indirect
3727 3794                   * block that only contains holes.
3728 3795                   */
3729 3796                  arc_done_func_t *children_ready_cb = NULL;
3730 3797                  if (db->db_level != 0)
3731 3798                          children_ready_cb = dbuf_write_children_ready;
3732 3799  
3733 3800                  dr->dr_zio = arc_write(zio, os->os_spa, txg,
3734 3801                      &dr->dr_bp_copy, data, DBUF_IS_L2CACHEABLE(db),
3735 3802                      &zp, dbuf_write_ready, children_ready_cb,
3736 3803                      dbuf_write_physdone, dbuf_write_done, db,
3737      -                    ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
     3804 +                    ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb, &sc);
3738 3805          }
3739 3806  }
    
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX