Print this page
NEX-19394 backport 9337 zfs get all is slow due to uncached metadata
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
 Conflicts:
  usr/src/uts/common/fs/zfs/dbuf.c
  usr/src/uts/common/fs/zfs/dmu.c
  usr/src/uts/common/fs/zfs/sys/dmu_objset.h
NEX-15468 panic - Deadlock: cycle in blocking chain with dbuf_destroy calling mutex_vector_enter
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-16904 Need to port Illumos Bug #9433 to fix ARC hit rate
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-16146 9188 increase size of dbuf cache to reduce indirect block decompression
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-5366 Race between unique_insert() and unique_remove() causes ZFS fsid change
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6288 dmu_buf_will_dirty could be faster
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
6047 SPARC boot should support feature@embedded_data
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-1823 Slow performance doing of a large dataset
5911 ZFS "hangs" while deleting file
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
NEX-3558 KRRP Integration
NEX-3266 5630 stale bonus buffer in recycled dnode_t leads to data corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
NEX-3165 segregate ddt in arc
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Issue #7: add cacheability to the properties
          Contributors: Boris Protopopov
DDT is placed either into special or to L2ARC but not in both
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties

@@ -18,11 +18,11 @@
  *
  * CDDL HEADER END
  */
 /*
  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
- * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
+ * Copyright 2018 Nexenta Systems, Inc.  All rights reserved.
  * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
  * Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
  * Copyright (c) 2013, Joyent, Inc. All rights reserved.
  * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
  * Copyright (c) 2014 Integros [integros.com]

@@ -36,21 +36,20 @@
 #include <sys/dmu_objset.h>
 #include <sys/dsl_dataset.h>
 #include <sys/dsl_dir.h>
 #include <sys/dmu_tx.h>
 #include <sys/spa.h>
+#include <sys/spa_impl.h>
 #include <sys/zio.h>
 #include <sys/dmu_zfetch.h>
 #include <sys/sa.h>
 #include <sys/sa_impl.h>
 #include <sys/zfeature.h>
 #include <sys/blkptr.h>
 #include <sys/range_tree.h>
 #include <sys/callb.h>
 #include <sys/abd.h>
-#include <sys/vdev.h>
-#include <sys/cityhash.h>
 
 uint_t zfs_dbuf_evict_key;
 
 static boolean_t dbuf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
 static void dbuf_write(dbuf_dirty_record_t *dr, arc_buf_t *data, dmu_tx_t *tx);

@@ -72,28 +71,62 @@
 static kmutex_t dbuf_evict_lock;
 static kcondvar_t dbuf_evict_cv;
 static boolean_t dbuf_evict_thread_exit;
 
 /*
- * LRU cache of dbufs. The dbuf cache maintains a list of dbufs that
+ * There are two dbuf caches; each dbuf can only be in one of them at a time.
+ *
+ * 1. Cache of metadata dbufs, to help make read-heavy administrative commands
+ *    from /sbin/zfs run faster. The "metadata cache" specifically stores dbufs
+ *    that represent the metadata that describes filesystems/snapshots/
+ *    bookmarks/properties/etc. We only evict from this cache when we export a
+ *    pool, to short-circuit as much I/O as possible for all administrative
+ *    commands that need the metadata. There is no eviction policy for this
+ *    cache, because we try to only include types in it which would occupy a
+ *    very small amount of space per object but create a large impact on the
+ *    performance of these commands. Instead, after it reaches a maximum size
+ *    (which should only happen on very small memory systems with a very large
+ *    number of filesystem objects), we stop taking new dbufs into the
+ *    metadata cache, instead putting them in the normal dbuf cache.
+ *
+ * 2. LRU cache of dbufs. The "dbuf cache" maintains a list of dbufs that
  * are not currently held but have been recently released. These dbufs
  * are not eligible for arc eviction until they are aged out of the cache.
- * Dbufs are added to the dbuf cache once the last hold is released. If a
- * dbuf is later accessed and still exists in the dbuf cache, then it will
- * be removed from the cache and later re-added to the head of the cache.
  * Dbufs that are aged out of the cache will be immediately destroyed and
  * become eligible for arc eviction.
+ *
+ * Dbufs are added to these caches once the last hold is released. If a dbuf is
+ * later accessed and still exists in the dbuf cache, then it will be removed
+ * from the cache and later re-added to the head of the cache.
+ *
+ * If a given dbuf meets the requirements for the metadata cache, it will go
+ * there, otherwise it will be considered for the generic LRU dbuf cache. The
+ * caches and the refcounts tracking their sizes are stored in an array indexed
+ * by those caches' matching enum values (from dbuf_cached_state_t).
  */
-static multilist_t *dbuf_cache;
-static refcount_t dbuf_cache_size;
-uint64_t dbuf_cache_max_bytes = 100 * 1024 * 1024;
+typedef struct dbuf_cache {
+        multilist_t *cache;
+        refcount_t size;
+} dbuf_cache_t;
+dbuf_cache_t dbuf_caches[DB_CACHE_MAX];
 
-/* Cap the size of the dbuf cache to log2 fraction of arc size. */
-int dbuf_cache_max_shift = 5;
+/* Size limits for the caches */
+uint64_t dbuf_cache_max_bytes = 0;
+uint64_t dbuf_metadata_cache_max_bytes = 0;
+/* Set the default sizes of the caches to log2 fraction of arc size */
+int dbuf_cache_shift = 5;
+int dbuf_metadata_cache_shift = 6;
 
 /*
- * The dbuf cache uses a three-stage eviction policy:
+ * For diagnostic purposes, this is incremented whenever we can't add
+ * something to the metadata cache because it's full, and instead put
+ * the data in the regular dbuf cache.
+ */
+uint64_t dbuf_metadata_cache_overflow;
+
+/*
+ * The LRU dbuf cache uses a three-stage eviction policy:
  *      - A low water marker designates when the dbuf eviction thread
  *      should stop evicting from the dbuf cache.
  *      - When we reach the maximum size (aka mid water mark), we
  *      signal the eviction thread to run.
  *      - The high water mark indicates when the eviction thread

@@ -162,22 +195,32 @@
 }
 
 /*
  * dbuf hash table routines
  */
+#pragma align 64(dbuf_hash_table)
 static dbuf_hash_table_t dbuf_hash_table;
 
 static uint64_t dbuf_hash_count;
 
-/*
- * We use Cityhash for this. It's fast, and has good hash properties without
- * requiring any large static buffers.
- */
 static uint64_t
 dbuf_hash(void *os, uint64_t obj, uint8_t lvl, uint64_t blkid)
 {
-        return (cityhash4((uintptr_t)os, obj, (uint64_t)lvl, blkid));
+        uintptr_t osv = (uintptr_t)os;
+        uint64_t crc = -1ULL;
+
+        ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
+        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (lvl)) & 0xFF];
+        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (osv >> 6)) & 0xFF];
+        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (obj >> 0)) & 0xFF];
+        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (obj >> 8)) & 0xFF];
+        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (blkid >> 0)) & 0xFF];
+        crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (blkid >> 8)) & 0xFF];
+
+        crc ^= (osv>>14) ^ (obj>>16) ^ (blkid>>16);
+
+        return (crc);
 }
 
 #define DBUF_EQUAL(dbuf, os, obj, level, blkid)         \
         ((dbuf)->db.db_object == (obj) &&               \
         (dbuf)->db_objset == (os) &&                    \

@@ -391,11 +434,59 @@
 
                 return (is_metadata);
         }
 }
 
+boolean_t
+dbuf_is_ddt(dmu_buf_impl_t *db)
+{
+        boolean_t is_ddt;
+
+        DB_DNODE_ENTER(db);
+        is_ddt = (DB_DNODE(db)->dn_type == DMU_OT_DDT_ZAP) ||
+            (DB_DNODE(db)->dn_type == DMU_OT_DDT_STATS);
+        DB_DNODE_EXIT(db);
+
+        return (is_ddt);
+}
+
 /*
+ * This returns whether this dbuf should be stored in the metadata cache, which
+ * is based on whether it's from one of the dnode types that store data related
+ * to traversing dataset hierarchies.
+ */
+static boolean_t
+dbuf_include_in_metadata_cache(dmu_buf_impl_t *db)
+{
+        DB_DNODE_ENTER(db);
+        dmu_object_type_t type = DB_DNODE(db)->dn_type;
+        DB_DNODE_EXIT(db);
+
+        /* Check if this dbuf is one of the types we care about */
+        if (DMU_OT_IS_METADATA_CACHED(type)) {
+                /* If we hit this, then we set something up wrong in dmu_ot */
+                ASSERT(DMU_OT_IS_METADATA(type));
+
+                /*
+                 * Sanity check for small-memory systems: don't allocate too
+                 * much memory for this purpose.
+                 */
+                if (refcount_count(&dbuf_caches[DB_DBUF_METADATA_CACHE].size) >
+                    dbuf_metadata_cache_max_bytes) {
+                        dbuf_metadata_cache_overflow++;
+                        DTRACE_PROBE1(dbuf__metadata__cache__overflow,
+                            dmu_buf_impl_t *, db);
+                        return (B_FALSE);
+                }
+
+                return (B_TRUE);
+        }
+
+        return (B_FALSE);
+}
+
+/*
  * This function *must* return indices evenly distributed between all
  * sublists of the multilist. This is needed due to how the dbuf eviction
  * code is laid out; dbuf_evict_thread() assumes dbufs are evenly
  * distributed between all sublists and uses this assumption when
  * deciding which sublist to evict from and how much to evict from it.

@@ -426,32 +517,33 @@
 dbuf_cache_above_hiwater(void)
 {
         uint64_t dbuf_cache_hiwater_bytes =
             (dbuf_cache_max_bytes * dbuf_cache_hiwater_pct) / 100;
 
-        return (refcount_count(&dbuf_cache_size) >
+        return (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
             dbuf_cache_max_bytes + dbuf_cache_hiwater_bytes);
 }
 
 static inline boolean_t
 dbuf_cache_above_lowater(void)
 {
         uint64_t dbuf_cache_lowater_bytes =
             (dbuf_cache_max_bytes * dbuf_cache_lowater_pct) / 100;
 
-        return (refcount_count(&dbuf_cache_size) >
+        return (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
             dbuf_cache_max_bytes - dbuf_cache_lowater_bytes);
 }
 
 /*
  * Evict the oldest eligible dbuf from the dbuf cache.
  */
 static void
 dbuf_evict_one(void)
 {
-        int idx = multilist_get_random_index(dbuf_cache);
-        multilist_sublist_t *mls = multilist_sublist_lock(dbuf_cache, idx);
+        int idx = multilist_get_random_index(dbuf_caches[DB_DBUF_CACHE].cache);
+        multilist_sublist_t *mls = multilist_sublist_lock(
+            dbuf_caches[DB_DBUF_CACHE].cache, idx);
 
         ASSERT(!MUTEX_HELD(&dbuf_evict_lock));
 
         /*
          * Set the thread's tsd to indicate that it's processing evictions.

@@ -470,12 +562,14 @@
             multilist_sublist_t *, mls);
 
         if (db != NULL) {
                 multilist_sublist_remove(mls, db);
                 multilist_sublist_unlock(mls);
-                (void) refcount_remove_many(&dbuf_cache_size,
+                (void) refcount_remove_many(&dbuf_caches[DB_DBUF_CACHE].size,
                     db->db.db_size, db);
+                ASSERT3U(db->db_caching_status, ==, DB_DBUF_CACHE);
+                db->db_caching_status = DB_NO_CACHE;
                 dbuf_destroy(db);
         } else {
                 multilist_sublist_unlock(mls);
         }
         (void) tsd_set(zfs_dbuf_evict_key, NULL);

@@ -524,12 +618,28 @@
         thread_exit();
 }
 
 /*
  * Wake up the dbuf eviction thread if the dbuf cache is at its max size.
- * If the dbuf cache is at its high water mark, then evict a dbuf from the
- * dbuf cache using the callers context.
+ *
+ * Direct eviction (dbuf_evict_one()) is not called here, because
+ * the function doesn't care about the selected dbuf, so the following
+ * case is possible which will cause a deadlock-panic:
+ *
+ * Thread A is evicting dbufs that are related to dnodeA
+ * dnode_evict_dbufs(dnoneA) enters dn_dbufs_mtx and after that walks
+ * its own AVL of dbufs and calls dbuf_destroy():
+ * dbuf_destroy() ->...-> dbuf_evict_notify() -> dbuf_evict_one() ->
+ *  -> select a dbuf from cache -> dbuf_destroy() ->
+ *   -> mutex_enter(dn_dbufs_mtx of dnoneB)
+ *
+ * Thread B is evicting dbufs that are related to dnodeB
+ * dnode_evict_dbufs(dnoneB) enters dn_dbufs_mtx and after that walks
+ * its own AVL of dbufs and calls dbuf_destroy():
+ * dbuf_destroy() ->...-> dbuf_evict_notify() -> dbuf_evict_one() ->
+ *  -> select a dbuf from cache -> dbuf_destroy() ->
+ *   -> mutex_enter(dn_dbufs_mtx of dnoneA)
  */
 static void
 dbuf_evict_notify(void)
 {
 

@@ -558,11 +668,12 @@
         /*
          * We check if we should evict without holding the dbuf_evict_lock,
          * because it's OK to occasionally make the wrong decision here,
          * and grabbing the lock results in massive lock contention.
          */
-        if (refcount_count(&dbuf_cache_size) > dbuf_cache_max_bytes) {
+        if (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
+            dbuf_cache_max_bytes) {
                 if (dbuf_cache_above_hiwater())
                         dbuf_evict_one();
                 cv_signal(&dbuf_evict_cv);
         }
 }

@@ -595,29 +706,56 @@
         dbuf_kmem_cache = kmem_cache_create("dmu_buf_impl_t",
             sizeof (dmu_buf_impl_t),
             0, dbuf_cons, dbuf_dest, NULL, NULL, NULL, 0);
 
         for (i = 0; i < DBUF_MUTEXES; i++)
-                mutex_init(&h->hash_mutexes[i], NULL, MUTEX_DEFAULT, NULL);
+                mutex_init(DBUF_HASH_MUTEX(h, i), NULL, MUTEX_DEFAULT, NULL);
 
+
         /*
-         * Setup the parameters for the dbuf cache. We cap the size of the
-         * dbuf cache to 1/32nd (default) of the size of the ARC.
+         * Setup the parameters for the dbuf caches. We set the sizes of the
+         * dbuf cache and the metadata cache to 1/32nd and 1/16th (default)
+         * of the size of the ARC, respectively.
          */
-        dbuf_cache_max_bytes = MIN(dbuf_cache_max_bytes,
-            arc_max_bytes() >> dbuf_cache_max_shift);
+        if (dbuf_cache_max_bytes == 0 ||
+            dbuf_cache_max_bytes >= arc_max_bytes())  {
+                dbuf_cache_max_bytes = arc_max_bytes() >> dbuf_cache_shift;
+        }
+        if (dbuf_metadata_cache_max_bytes == 0 ||
+            dbuf_metadata_cache_max_bytes >= arc_max_bytes()) {
+                dbuf_metadata_cache_max_bytes =
+                    arc_max_bytes() >> dbuf_metadata_cache_shift;
+        }
 
         /*
+         * The combined size of both caches should be less
+         * the size of ARC, otherwise need to set them to
+         * the default values.
+         *
+         * divide by 2 is a simple overflow protection
+         */
+        if (((dbuf_cache_max_bytes / 2) +
+            (dbuf_metadata_cache_max_bytes / 2)) >= (arc_max_bytes() / 2)) {
+                dbuf_cache_max_bytes = arc_max_bytes() >> dbuf_cache_shift;
+                dbuf_metadata_cache_max_bytes =
+                    arc_max_bytes() >> dbuf_metadata_cache_shift;
+        }
+
+
+        /*
          * All entries are queued via taskq_dispatch_ent(), so min/maxalloc
          * configuration is not required.
          */
         dbu_evict_taskq = taskq_create("dbu_evict", 1, minclsyspri, 0, 0, 0);
 
-        dbuf_cache = multilist_create(sizeof (dmu_buf_impl_t),
+        for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
+                dbuf_caches[dcs].cache =
+                    multilist_create(sizeof (dmu_buf_impl_t),
             offsetof(dmu_buf_impl_t, db_cache_link),
             dbuf_cache_multilist_index_func);
-        refcount_create(&dbuf_cache_size);
+                refcount_create(&dbuf_caches[dcs].size);
+        }
 
         tsd_create(&zfs_dbuf_evict_key, NULL);
         dbuf_evict_thread_exit = B_FALSE;
         mutex_init(&dbuf_evict_lock, NULL, MUTEX_DEFAULT, NULL);
         cv_init(&dbuf_evict_cv, NULL, CV_DEFAULT, NULL);

@@ -630,11 +768,11 @@
 {
         dbuf_hash_table_t *h = &dbuf_hash_table;
         int i;
 
         for (i = 0; i < DBUF_MUTEXES; i++)
-                mutex_destroy(&h->hash_mutexes[i]);
+                mutex_destroy(DBUF_HASH_MUTEX(h, i));
         kmem_free(h->hash_table, (h->hash_table_mask + 1) * sizeof (void *));
         kmem_cache_destroy(dbuf_kmem_cache);
         taskq_destroy(dbu_evict_taskq);
 
         mutex_enter(&dbuf_evict_lock);

@@ -647,12 +785,14 @@
         tsd_destroy(&zfs_dbuf_evict_key);
 
         mutex_destroy(&dbuf_evict_lock);
         cv_destroy(&dbuf_evict_cv);
 
-        refcount_destroy(&dbuf_cache_size);
-        multilist_destroy(dbuf_cache);
+        for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
+                refcount_destroy(&dbuf_caches[dcs].size);
+                multilist_destroy(dbuf_caches[dcs].cache);
+        }
 }
 
 /*
  * Other stuff.
  */

@@ -1412,11 +1552,11 @@
 /*
  * We already have a dirty record for this TXG, and we are being
  * dirtied again.
  */
 static void
-dbuf_redirty(dbuf_dirty_record_t *dr)
+dbuf_redirty(dbuf_dirty_record_t *dr, boolean_t usesc)
 {
         dmu_buf_impl_t *db = dr->dr_dbuf;
 
         ASSERT(MUTEX_HELD(&db->db_mtx));
 

@@ -1431,14 +1571,19 @@
                         /* Already released on initial dirty, so just thaw. */
                         ASSERT(arc_released(db->db_buf));
                         arc_buf_thaw(db->db_buf);
                 }
         }
+        /*
+         * Special class usage of dirty dbuf could be changed,
+         * update the dirty entry.
+         */
+        dr->dr_usesc = usesc;
 }
 
 dbuf_dirty_record_t *
-dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
+dbuf_dirty_sc(dmu_buf_impl_t *db, dmu_tx_t *tx, boolean_t usesc)
 {
         dnode_t *dn;
         objset_t *os;
         dbuf_dirty_record_t **drp, *dr;
         int drop_struct_lock = FALSE;

@@ -1521,11 +1666,11 @@
         while ((dr = *drp) != NULL && dr->dr_txg > tx->tx_txg)
                 drp = &dr->dr_next;
         if (dr && dr->dr_txg == tx->tx_txg) {
                 DB_DNODE_EXIT(db);
 
-                dbuf_redirty(dr);
+                dbuf_redirty(dr, usesc);
                 mutex_exit(&db->db_mtx);
                 return (dr);
         }
 
         /*

@@ -1601,10 +1746,11 @@
         if (db->db_blkid != DMU_BONUS_BLKID && os->os_dsl_dataset != NULL)
                 dr->dr_accounted = db->db.db_size;
         dr->dr_dbuf = db;
         dr->dr_txg = tx->tx_txg;
         dr->dr_next = *drp;
+        dr->dr_usesc = usesc;
         *drp = dr;
 
         /*
          * We could have been freed_in_flight between the dbuf_noread
          * and dbuf_dirty.  We win, as though the dbuf_noread() had

@@ -1634,11 +1780,11 @@
             db->db_blkid == DMU_SPILL_BLKID) {
                 mutex_enter(&dn->dn_mtx);
                 ASSERT(!list_link_active(&dr->dr_dirty_node));
                 list_insert_tail(&dn->dn_dirty_records[txgoff], dr);
                 mutex_exit(&dn->dn_mtx);
-                dnode_setdirty(dn, tx);
+                dnode_setdirty_sc(dn, tx, usesc);
                 DB_DNODE_EXIT(db);
                 return (dr);
         }
 
         /*

@@ -1669,11 +1815,11 @@
          * syncing context won't have to wait for the i/o.
          */
         ddt_prefetch(os->os_spa, db->db_blkptr);
 
         if (db->db_level == 0) {
-                dnode_new_blkid(dn, db->db_blkid, tx, drop_struct_lock);
+                dnode_new_blkid(dn, db->db_blkid, tx, usesc, drop_struct_lock);
                 ASSERT(dn->dn_maxblkid >= db->db_blkid);
         }
 
         if (db->db_level+1 < dn->dn_nlevels) {
                 dmu_buf_impl_t *parent = db->db_parent;

@@ -1689,11 +1835,11 @@
                         parent_held = TRUE;
                 }
                 if (drop_struct_lock)
                         rw_exit(&dn->dn_struct_rwlock);
                 ASSERT3U(db->db_level+1, ==, parent->db_level);
-                di = dbuf_dirty(parent, tx);
+                di = dbuf_dirty_sc(parent, tx, usesc);
                 if (parent_held)
                         dbuf_rele(parent, FTAG);
 
                 mutex_enter(&db->db_mtx);
                 /*

@@ -1707,10 +1853,16 @@
                         ASSERT(!list_link_active(&dr->dr_dirty_node));
                         list_insert_tail(&di->dt.di.dr_children, dr);
                         mutex_exit(&di->dt.di.dr_mtx);
                         dr->dr_parent = di;
                 }
+
+                /*
+                 * Special class usage of dirty dbuf could be changed,
+                 * update the dirty entry.
+                 */
+                dr->dr_usesc = usesc;
                 mutex_exit(&db->db_mtx);
         } else {
                 ASSERT(db->db_level+1 == dn->dn_nlevels);
                 ASSERT(db->db_blkid < dn->dn_nblkptr);
                 ASSERT(db->db_parent == NULL || db->db_parent == dn->dn_dbuf);

@@ -1720,15 +1872,26 @@
                 mutex_exit(&dn->dn_mtx);
                 if (drop_struct_lock)
                         rw_exit(&dn->dn_struct_rwlock);
         }
 
-        dnode_setdirty(dn, tx);
+        dnode_setdirty_sc(dn, tx, usesc);
         DB_DNODE_EXIT(db);
         return (dr);
 }
 
+dbuf_dirty_record_t *
+dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
+{
+        spa_t *spa;
+
+        ASSERT(db->db_objset != NULL);
+        spa = db->db_objset->os_spa;
+
+        return (dbuf_dirty_sc(db, tx, spa->spa_usesc));
+}
+
 /*
  * Undirty a buffer in the transaction group referenced by the given
  * transaction.  Return whether this evicted the dbuf.
  */
 static boolean_t

@@ -1820,10 +1983,18 @@
 
 void
 dmu_buf_will_dirty(dmu_buf_t *db_fake, dmu_tx_t *tx)
 {
         dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
+        spa_t *spa = db->db_objset->os_spa;
+        dmu_buf_will_dirty_sc(db_fake, tx, spa->spa_usesc);
+}
+
+void
+dmu_buf_will_dirty_sc(dmu_buf_t *db_fake, dmu_tx_t *tx, boolean_t usesc)
+{
+        dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
         int rf = DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH;
 
         ASSERT(tx->tx_txg != 0);
         ASSERT(!refcount_is_zero(&db->db_holds));
 

@@ -1842,11 +2013,11 @@
                  * because there are some calls to dbuf_dirty() that don't
                  * go through dmu_buf_will_dirty().
                  */
                 if (dr->dr_txg == tx->tx_txg && db->db_state == DB_CACHED) {
                         /* This dbuf is already dirty and cached. */
-                        dbuf_redirty(dr);
+                        dbuf_redirty(dr, usesc);
                         mutex_exit(&db->db_mtx);
                         return;
                 }
         }
         mutex_exit(&db->db_mtx);

@@ -1854,13 +2025,14 @@
         DB_DNODE_ENTER(db);
         if (RW_WRITE_HELD(&DB_DNODE(db)->dn_struct_rwlock))
                 rf |= DB_RF_HAVESTRUCT;
         DB_DNODE_EXIT(db);
         (void) dbuf_read(db, NULL, rf);
-        (void) dbuf_dirty(db, tx);
+        (void) dbuf_dirty_sc(db, tx, usesc);
 }
 
+
 void
 dmu_buf_will_not_fill(dmu_buf_t *db_fake, dmu_tx_t *tx)
 {
         dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
 

@@ -2031,13 +2203,19 @@
         }
 
         dbuf_clear_data(db);
 
         if (multilist_link_active(&db->db_cache_link)) {
-                multilist_remove(dbuf_cache, db);
-                (void) refcount_remove_many(&dbuf_cache_size,
+                ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
+                    db->db_caching_status == DB_DBUF_METADATA_CACHE);
+
+                multilist_remove(dbuf_caches[db->db_caching_status].cache, db);
+                (void) refcount_remove_many(
+                    &dbuf_caches[db->db_caching_status].size,
                     db->db.db_size, db);
+
+                db->db_caching_status = DB_NO_CACHE;
         }
 
         ASSERT(db->db_state == DB_UNCACHED || db->db_state == DB_NOFILL);
         ASSERT(db->db_data_pending == NULL);
 

@@ -2087,10 +2265,11 @@
         ASSERT(db->db_buf == NULL);
         ASSERT(db->db.db_data == NULL);
         ASSERT(db->db_hash_next == NULL);
         ASSERT(db->db_blkptr == NULL);
         ASSERT(db->db_data_pending == NULL);
+        ASSERT3U(db->db_caching_status, ==, DB_NO_CACHE);
         ASSERT(!multilist_link_active(&db->db_cache_link));
 
         kmem_cache_free(dbuf_kmem_cache, db);
         arc_space_return(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
 

@@ -2225,10 +2404,11 @@
                 db->db.db_size = DN_MAX_BONUSLEN -
                     (dn->dn_nblkptr-1) * sizeof (blkptr_t);
                 ASSERT3U(db->db.db_size, >=, dn->dn_bonuslen);
                 db->db.db_offset = DMU_BONUS_BLKID;
                 db->db_state = DB_UNCACHED;
+                db->db_caching_status = DB_NO_CACHE;
                 /* the bonus dbuf is not placed in the hash table */
                 arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
                 return (db);
         } else if (blkid == DMU_SPILL_BLKID) {
                 db->db.db_size = (blkptr != NULL) ?

@@ -2257,10 +2437,11 @@
                 return (odb);
         }
         avl_add(&dn->dn_dbufs, db);
 
         db->db_state = DB_UNCACHED;
+        db->db_caching_status = DB_NO_CACHE;
         mutex_exit(&dn->dn_dbufs_mtx);
         arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
 
         if (parent && parent != dn->dn_dbuf)
                 dbuf_add_ref(parent, db);

@@ -2563,12 +2744,14 @@
         if (fail_uncached && db->db_state != DB_CACHED) {
                 mutex_exit(&db->db_mtx);
                 return (SET_ERROR(ENOENT));
         }
 
-        if (db->db_buf != NULL)
+        if (db->db_buf != NULL) {
+                arc_buf_access(db->db_buf);
                 ASSERT3P(db->db.db_data, ==, db->db_buf->b_data);
+        }
 
         ASSERT(db->db_buf == NULL || arc_referenced(db->db_buf));
 
         /*
          * If this buffer is currently syncing out, and we are are

@@ -2591,13 +2774,19 @@
                 }
         }
 
         if (multilist_link_active(&db->db_cache_link)) {
                 ASSERT(refcount_is_zero(&db->db_holds));
-                multilist_remove(dbuf_cache, db);
-                (void) refcount_remove_many(&dbuf_cache_size,
+                ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
+                    db->db_caching_status == DB_DBUF_METADATA_CACHE);
+
+                multilist_remove(dbuf_caches[db->db_caching_status].cache, db);
+                (void) refcount_remove_many(
+                    &dbuf_caches[db->db_caching_status].size,
                     db->db.db_size, db);
+
+                db->db_caching_status = DB_NO_CACHE;
         }
         (void) refcount_add(&db->db_holds, tag);
         DBUF_VERIFY(db);
         mutex_exit(&db->db_mtx);
 

@@ -2810,17 +2999,27 @@
 
                         if (!DBUF_IS_CACHEABLE(db) ||
                             db->db_pending_evict) {
                                 dbuf_destroy(db);
                         } else if (!multilist_link_active(&db->db_cache_link)) {
-                                multilist_insert(dbuf_cache, db);
-                                (void) refcount_add_many(&dbuf_cache_size,
+                                ASSERT3U(db->db_caching_status, ==,
+                                    DB_NO_CACHE);
+
+                                dbuf_cached_state_t dcs =
+                                    dbuf_include_in_metadata_cache(db) ?
+                                    DB_DBUF_METADATA_CACHE : DB_DBUF_CACHE;
+                                db->db_caching_status = dcs;
+
+                                multilist_insert(dbuf_caches[dcs].cache, db);
+                                (void) refcount_add_many(&dbuf_caches[dcs].size,
                                     db->db.db_size, db);
                                 mutex_exit(&db->db_mtx);
 
+                                if (db->db_caching_status == DB_DBUF_CACHE) {
                                 dbuf_evict_notify();
                         }
+                        }
 
                         if (do_arc_evict)
                                 arc_freed(spa, &bp);
                 }
         } else {

@@ -2998,11 +3197,10 @@
 
         /* Provide the pending dirty record to child dbufs */
         db->db_data_pending = dr;
 
         mutex_exit(&db->db_mtx);
-
         dbuf_write(dr, db->db_buf, tx);
 
         zio = dr->dr_zio;
         mutex_enter(&dr->dt.di.dr_mtx);
         dbuf_sync_list(&dr->dt.di.dr_children, db->db_level - 1, tx);

@@ -3470,145 +3668,10 @@
 
         if (zio->io_abd != NULL)
                 abd_put(zio->io_abd);
 }
 
-typedef struct dbuf_remap_impl_callback_arg {
-        objset_t        *drica_os;
-        uint64_t        drica_blk_birth;
-        dmu_tx_t        *drica_tx;
-} dbuf_remap_impl_callback_arg_t;
-
-static void
-dbuf_remap_impl_callback(uint64_t vdev, uint64_t offset, uint64_t size,
-    void *arg)
-{
-        dbuf_remap_impl_callback_arg_t *drica = arg;
-        objset_t *os = drica->drica_os;
-        spa_t *spa = dmu_objset_spa(os);
-        dmu_tx_t *tx = drica->drica_tx;
-
-        ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
-
-        if (os == spa_meta_objset(spa)) {
-                spa_vdev_indirect_mark_obsolete(spa, vdev, offset, size, tx);
-        } else {
-                dsl_dataset_block_remapped(dmu_objset_ds(os), vdev, offset,
-                    size, drica->drica_blk_birth, tx);
-        }
-}
-
-static void
-dbuf_remap_impl(dnode_t *dn, blkptr_t *bp, dmu_tx_t *tx)
-{
-        blkptr_t bp_copy = *bp;
-        spa_t *spa = dmu_objset_spa(dn->dn_objset);
-        dbuf_remap_impl_callback_arg_t drica;
-
-        ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
-
-        drica.drica_os = dn->dn_objset;
-        drica.drica_blk_birth = bp->blk_birth;
-        drica.drica_tx = tx;
-        if (spa_remap_blkptr(spa, &bp_copy, dbuf_remap_impl_callback,
-            &drica)) {
-                /*
-                 * The struct_rwlock prevents dbuf_read_impl() from
-                 * dereferencing the BP while we are changing it.  To
-                 * avoid lock contention, only grab it when we are actually
-                 * changing the BP.
-                 */
-                rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
-                *bp = bp_copy;
-                rw_exit(&dn->dn_struct_rwlock);
-        }
-}
-
-/*
- * Returns true if a dbuf_remap would modify the dbuf. We do this by attempting
- * to remap a copy of every bp in the dbuf.
- */
-boolean_t
-dbuf_can_remap(const dmu_buf_impl_t *db)
-{
-        spa_t *spa = dmu_objset_spa(db->db_objset);
-        blkptr_t *bp = db->db.db_data;
-        boolean_t ret = B_FALSE;
-
-        ASSERT3U(db->db_level, >, 0);
-        ASSERT3S(db->db_state, ==, DB_CACHED);
-
-        ASSERT(spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL));
-
-        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
-        for (int i = 0; i < db->db.db_size >> SPA_BLKPTRSHIFT; i++) {
-                blkptr_t bp_copy = bp[i];
-                if (spa_remap_blkptr(spa, &bp_copy, NULL, NULL)) {
-                        ret = B_TRUE;
-                        break;
-                }
-        }
-        spa_config_exit(spa, SCL_VDEV, FTAG);
-
-        return (ret);
-}
-
-boolean_t
-dnode_needs_remap(const dnode_t *dn)
-{
-        spa_t *spa = dmu_objset_spa(dn->dn_objset);
-        boolean_t ret = B_FALSE;
-
-        if (dn->dn_phys->dn_nlevels == 0) {
-                return (B_FALSE);
-        }
-
-        ASSERT(spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL));
-
-        spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
-        for (int j = 0; j < dn->dn_phys->dn_nblkptr; j++) {
-                blkptr_t bp_copy = dn->dn_phys->dn_blkptr[j];
-                if (spa_remap_blkptr(spa, &bp_copy, NULL, NULL)) {
-                        ret = B_TRUE;
-                        break;
-                }
-        }
-        spa_config_exit(spa, SCL_VDEV, FTAG);
-
-        return (ret);
-}
-
-/*
- * Remap any existing BP's to concrete vdevs, if possible.
- */
-static void
-dbuf_remap(dnode_t *dn, dmu_buf_impl_t *db, dmu_tx_t *tx)
-{
-        spa_t *spa = dmu_objset_spa(db->db_objset);
-        ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
-
-        if (!spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL))
-                return;
-
-        if (db->db_level > 0) {
-                blkptr_t *bp = db->db.db_data;
-                for (int i = 0; i < db->db.db_size >> SPA_BLKPTRSHIFT; i++) {
-                        dbuf_remap_impl(dn, &bp[i], tx);
-                }
-        } else if (db->db.db_object == DMU_META_DNODE_OBJECT) {
-                dnode_phys_t *dnp = db->db.db_data;
-                ASSERT3U(db->db_dnode_handle->dnh_dnode->dn_type, ==,
-                    DMU_OT_DNODE);
-                for (int i = 0; i < db->db.db_size >> DNODE_SHIFT; i++) {
-                        for (int j = 0; j < dnp[i].dn_nblkptr; j++) {
-                                dbuf_remap_impl(dn, &dnp[i].dn_blkptr[j], tx);
-                        }
-                }
-        }
-}
-
-
 /* Issue I/O to commit a dirty buffer to disk. */
 static void
 dbuf_write(dbuf_dirty_record_t *dr, arc_buf_t *data, dmu_tx_t *tx)
 {
         dmu_buf_impl_t *db = dr->dr_dbuf;

@@ -3618,17 +3681,20 @@
         uint64_t txg = tx->tx_txg;
         zbookmark_phys_t zb;
         zio_prop_t zp;
         zio_t *zio;
         int wp_flag = 0;
+        zio_smartcomp_info_t sc;
 
         ASSERT(dmu_tx_is_syncing(tx));
 
         DB_DNODE_ENTER(db);
         dn = DB_DNODE(db);
         os = dn->dn_objset;
 
+        dnode_setup_zio_smartcomp(db, &sc);
+
         if (db->db_state != DB_NOFILL) {
                 if (db->db_level > 0 || dn->dn_type == DMU_OT_DNODE) {
                         /*
                          * Private object buffers are released here rather
                          * than in dbuf_dirty() since they are only modified

@@ -3638,11 +3704,10 @@
                         if (BP_IS_HOLE(db->db_blkptr)) {
                                 arc_buf_thaw(data);
                         } else {
                                 dbuf_release_bp(db);
                         }
-                        dbuf_remap(dn, db, tx);
                 }
         }
 
         if (parent != dn->dn_dbuf) {
                 /* Our parent is an indirect block. */

@@ -3676,10 +3741,11 @@
             db->db.db_object, db->db_level, db->db_blkid);
 
         if (db->db_blkid == DMU_SPILL_BLKID)
                 wp_flag = WP_SPILL;
         wp_flag |= (db->db_state == DB_NOFILL) ? WP_NOFILL : 0;
+        WP_SET_SPECIALCLASS(wp_flag, dr->dr_usesc);
 
         dmu_write_policy(os, dn, db->db_level, wp_flag, &zp);
         DB_DNODE_EXIT(db);
 
         /*

@@ -3701,11 +3767,12 @@
 
                 dr->dr_zio = zio_write(zio, os->os_spa, txg, &dr->dr_bp_copy,
                     contents, db->db.db_size, db->db.db_size, &zp,
                     dbuf_write_override_ready, NULL, NULL,
                     dbuf_write_override_done,
-                    dr, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
+                    dr, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb,
+                    &sc);
                 mutex_enter(&db->db_mtx);
                 dr->dt.dl.dr_override_state = DR_NOT_OVERRIDDEN;
                 zio_write_override(dr->dr_zio, &dr->dt.dl.dr_overridden_by,
                     dr->dt.dl.dr_copies, dr->dt.dl.dr_nopwrite);
                 mutex_exit(&db->db_mtx);

@@ -3715,11 +3782,11 @@
                 dr->dr_zio = zio_write(zio, os->os_spa, txg,
                     &dr->dr_bp_copy, NULL, db->db.db_size, db->db.db_size, &zp,
                     dbuf_write_nofill_ready, NULL, NULL,
                     dbuf_write_nofill_done, db,
                     ZIO_PRIORITY_ASYNC_WRITE,
-                    ZIO_FLAG_MUSTSUCCEED | ZIO_FLAG_NODATA, &zb);
+                    ZIO_FLAG_MUSTSUCCEED | ZIO_FLAG_NODATA, &zb, &sc);
         } else {
                 ASSERT(arc_released(data));
 
                 /*
                  * For indirect blocks, we want to setup the children

@@ -3732,8 +3799,8 @@
 
                 dr->dr_zio = arc_write(zio, os->os_spa, txg,
                     &dr->dr_bp_copy, data, DBUF_IS_L2CACHEABLE(db),
                     &zp, dbuf_write_ready, children_ready_cb,
                     dbuf_write_physdone, dbuf_write_done, db,
-                    ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
+                    ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb, &sc);
         }
 }