Print this page
NEX-19394 backport 9337 zfs get all is slow due to uncached metadata
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Conflicts:
usr/src/uts/common/fs/zfs/dbuf.c
usr/src/uts/common/fs/zfs/dmu.c
usr/src/uts/common/fs/zfs/sys/dmu_objset.h
NEX-15468 panic - Deadlock: cycle in blocking chain with dbuf_destroy calling mutex_vector_enter
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-16904 Need to port Illumos Bug #9433 to fix ARC hit rate
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-16146 9188 increase size of dbuf cache to reduce indirect block decompression
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-5366 Race between unique_insert() and unique_remove() causes ZFS fsid change
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
6288 dmu_buf_will_dirty could be faster
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
6047 SPARC boot should support feature@embedded_data
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-1823 Slow performance doing of a large dataset
5911 ZFS "hangs" while deleting file
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
NEX-3558 KRRP Integration
NEX-3266 5630 stale bonus buffer in recycled dnode_t leads to data corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
NEX-3165 segregate ddt in arc
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Issue #7: add cacheability to the properties
Contributors: Boris Protopopov
DDT is placed either into special or to L2ARC but not in both
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties
@@ -18,11 +18,11 @@
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
- * Copyright 2011 Nexenta Systems, Inc. All rights reserved.
+ * Copyright 2018 Nexenta Systems, Inc. All rights reserved.
* Copyright (c) 2012, 2017 by Delphix. All rights reserved.
* Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
* Copyright (c) 2013, Joyent, Inc. All rights reserved.
* Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
* Copyright (c) 2014 Integros [integros.com]
@@ -36,21 +36,20 @@
#include <sys/dmu_objset.h>
#include <sys/dsl_dataset.h>
#include <sys/dsl_dir.h>
#include <sys/dmu_tx.h>
#include <sys/spa.h>
+#include <sys/spa_impl.h>
#include <sys/zio.h>
#include <sys/dmu_zfetch.h>
#include <sys/sa.h>
#include <sys/sa_impl.h>
#include <sys/zfeature.h>
#include <sys/blkptr.h>
#include <sys/range_tree.h>
#include <sys/callb.h>
#include <sys/abd.h>
-#include <sys/vdev.h>
-#include <sys/cityhash.h>
uint_t zfs_dbuf_evict_key;
static boolean_t dbuf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
static void dbuf_write(dbuf_dirty_record_t *dr, arc_buf_t *data, dmu_tx_t *tx);
@@ -72,28 +71,62 @@
static kmutex_t dbuf_evict_lock;
static kcondvar_t dbuf_evict_cv;
static boolean_t dbuf_evict_thread_exit;
/*
- * LRU cache of dbufs. The dbuf cache maintains a list of dbufs that
+ * There are two dbuf caches; each dbuf can only be in one of them at a time.
+ *
+ * 1. Cache of metadata dbufs, to help make read-heavy administrative commands
+ * from /sbin/zfs run faster. The "metadata cache" specifically stores dbufs
+ * that represent the metadata that describes filesystems/snapshots/
+ * bookmarks/properties/etc. We only evict from this cache when we export a
+ * pool, to short-circuit as much I/O as possible for all administrative
+ * commands that need the metadata. There is no eviction policy for this
+ * cache, because we try to only include types in it which would occupy a
+ * very small amount of space per object but create a large impact on the
+ * performance of these commands. Instead, after it reaches a maximum size
+ * (which should only happen on very small memory systems with a very large
+ * number of filesystem objects), we stop taking new dbufs into the
+ * metadata cache, instead putting them in the normal dbuf cache.
+ *
+ * 2. LRU cache of dbufs. The "dbuf cache" maintains a list of dbufs that
* are not currently held but have been recently released. These dbufs
* are not eligible for arc eviction until they are aged out of the cache.
- * Dbufs are added to the dbuf cache once the last hold is released. If a
- * dbuf is later accessed and still exists in the dbuf cache, then it will
- * be removed from the cache and later re-added to the head of the cache.
* Dbufs that are aged out of the cache will be immediately destroyed and
* become eligible for arc eviction.
+ *
+ * Dbufs are added to these caches once the last hold is released. If a dbuf is
+ * later accessed and still exists in the dbuf cache, then it will be removed
+ * from the cache and later re-added to the head of the cache.
+ *
+ * If a given dbuf meets the requirements for the metadata cache, it will go
+ * there, otherwise it will be considered for the generic LRU dbuf cache. The
+ * caches and the refcounts tracking their sizes are stored in an array indexed
+ * by those caches' matching enum values (from dbuf_cached_state_t).
*/
-static multilist_t *dbuf_cache;
-static refcount_t dbuf_cache_size;
-uint64_t dbuf_cache_max_bytes = 100 * 1024 * 1024;
+typedef struct dbuf_cache {
+ multilist_t *cache;
+ refcount_t size;
+} dbuf_cache_t;
+dbuf_cache_t dbuf_caches[DB_CACHE_MAX];
-/* Cap the size of the dbuf cache to log2 fraction of arc size. */
-int dbuf_cache_max_shift = 5;
+/* Size limits for the caches */
+uint64_t dbuf_cache_max_bytes = 0;
+uint64_t dbuf_metadata_cache_max_bytes = 0;
+/* Set the default sizes of the caches to log2 fraction of arc size */
+int dbuf_cache_shift = 5;
+int dbuf_metadata_cache_shift = 6;
/*
- * The dbuf cache uses a three-stage eviction policy:
+ * For diagnostic purposes, this is incremented whenever we can't add
+ * something to the metadata cache because it's full, and instead put
+ * the data in the regular dbuf cache.
+ */
+uint64_t dbuf_metadata_cache_overflow;
+
+/*
+ * The LRU dbuf cache uses a three-stage eviction policy:
* - A low water marker designates when the dbuf eviction thread
* should stop evicting from the dbuf cache.
* - When we reach the maximum size (aka mid water mark), we
* signal the eviction thread to run.
* - The high water mark indicates when the eviction thread
@@ -162,22 +195,32 @@
}
/*
* dbuf hash table routines
*/
+#pragma align 64(dbuf_hash_table)
static dbuf_hash_table_t dbuf_hash_table;
static uint64_t dbuf_hash_count;
-/*
- * We use Cityhash for this. It's fast, and has good hash properties without
- * requiring any large static buffers.
- */
static uint64_t
dbuf_hash(void *os, uint64_t obj, uint8_t lvl, uint64_t blkid)
{
- return (cityhash4((uintptr_t)os, obj, (uint64_t)lvl, blkid));
+ uintptr_t osv = (uintptr_t)os;
+ uint64_t crc = -1ULL;
+
+ ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
+ crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (lvl)) & 0xFF];
+ crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (osv >> 6)) & 0xFF];
+ crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (obj >> 0)) & 0xFF];
+ crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (obj >> 8)) & 0xFF];
+ crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (blkid >> 0)) & 0xFF];
+ crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ (blkid >> 8)) & 0xFF];
+
+ crc ^= (osv>>14) ^ (obj>>16) ^ (blkid>>16);
+
+ return (crc);
}
#define DBUF_EQUAL(dbuf, os, obj, level, blkid) \
((dbuf)->db.db_object == (obj) && \
(dbuf)->db_objset == (os) && \
@@ -391,11 +434,59 @@
return (is_metadata);
}
}
+boolean_t
+dbuf_is_ddt(dmu_buf_impl_t *db)
+{
+ boolean_t is_ddt;
+
+ DB_DNODE_ENTER(db);
+ is_ddt = (DB_DNODE(db)->dn_type == DMU_OT_DDT_ZAP) ||
+ (DB_DNODE(db)->dn_type == DMU_OT_DDT_STATS);
+ DB_DNODE_EXIT(db);
+
+ return (is_ddt);
+}
+
/*
+ * This returns whether this dbuf should be stored in the metadata cache, which
+ * is based on whether it's from one of the dnode types that store data related
+ * to traversing dataset hierarchies.
+ */
+static boolean_t
+dbuf_include_in_metadata_cache(dmu_buf_impl_t *db)
+{
+ DB_DNODE_ENTER(db);
+ dmu_object_type_t type = DB_DNODE(db)->dn_type;
+ DB_DNODE_EXIT(db);
+
+ /* Check if this dbuf is one of the types we care about */
+ if (DMU_OT_IS_METADATA_CACHED(type)) {
+ /* If we hit this, then we set something up wrong in dmu_ot */
+ ASSERT(DMU_OT_IS_METADATA(type));
+
+ /*
+ * Sanity check for small-memory systems: don't allocate too
+ * much memory for this purpose.
+ */
+ if (refcount_count(&dbuf_caches[DB_DBUF_METADATA_CACHE].size) >
+ dbuf_metadata_cache_max_bytes) {
+ dbuf_metadata_cache_overflow++;
+ DTRACE_PROBE1(dbuf__metadata__cache__overflow,
+ dmu_buf_impl_t *, db);
+ return (B_FALSE);
+ }
+
+ return (B_TRUE);
+ }
+
+ return (B_FALSE);
+}
+
+/*
* This function *must* return indices evenly distributed between all
* sublists of the multilist. This is needed due to how the dbuf eviction
* code is laid out; dbuf_evict_thread() assumes dbufs are evenly
* distributed between all sublists and uses this assumption when
* deciding which sublist to evict from and how much to evict from it.
@@ -426,32 +517,33 @@
dbuf_cache_above_hiwater(void)
{
uint64_t dbuf_cache_hiwater_bytes =
(dbuf_cache_max_bytes * dbuf_cache_hiwater_pct) / 100;
- return (refcount_count(&dbuf_cache_size) >
+ return (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
dbuf_cache_max_bytes + dbuf_cache_hiwater_bytes);
}
static inline boolean_t
dbuf_cache_above_lowater(void)
{
uint64_t dbuf_cache_lowater_bytes =
(dbuf_cache_max_bytes * dbuf_cache_lowater_pct) / 100;
- return (refcount_count(&dbuf_cache_size) >
+ return (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
dbuf_cache_max_bytes - dbuf_cache_lowater_bytes);
}
/*
* Evict the oldest eligible dbuf from the dbuf cache.
*/
static void
dbuf_evict_one(void)
{
- int idx = multilist_get_random_index(dbuf_cache);
- multilist_sublist_t *mls = multilist_sublist_lock(dbuf_cache, idx);
+ int idx = multilist_get_random_index(dbuf_caches[DB_DBUF_CACHE].cache);
+ multilist_sublist_t *mls = multilist_sublist_lock(
+ dbuf_caches[DB_DBUF_CACHE].cache, idx);
ASSERT(!MUTEX_HELD(&dbuf_evict_lock));
/*
* Set the thread's tsd to indicate that it's processing evictions.
@@ -470,12 +562,14 @@
multilist_sublist_t *, mls);
if (db != NULL) {
multilist_sublist_remove(mls, db);
multilist_sublist_unlock(mls);
- (void) refcount_remove_many(&dbuf_cache_size,
+ (void) refcount_remove_many(&dbuf_caches[DB_DBUF_CACHE].size,
db->db.db_size, db);
+ ASSERT3U(db->db_caching_status, ==, DB_DBUF_CACHE);
+ db->db_caching_status = DB_NO_CACHE;
dbuf_destroy(db);
} else {
multilist_sublist_unlock(mls);
}
(void) tsd_set(zfs_dbuf_evict_key, NULL);
@@ -524,12 +618,28 @@
thread_exit();
}
/*
* Wake up the dbuf eviction thread if the dbuf cache is at its max size.
- * If the dbuf cache is at its high water mark, then evict a dbuf from the
- * dbuf cache using the callers context.
+ *
+ * Direct eviction (dbuf_evict_one()) is not called here, because
+ * the function doesn't care about the selected dbuf, so the following
+ * case is possible which will cause a deadlock-panic:
+ *
+ * Thread A is evicting dbufs that are related to dnodeA
+ * dnode_evict_dbufs(dnoneA) enters dn_dbufs_mtx and after that walks
+ * its own AVL of dbufs and calls dbuf_destroy():
+ * dbuf_destroy() ->...-> dbuf_evict_notify() -> dbuf_evict_one() ->
+ * -> select a dbuf from cache -> dbuf_destroy() ->
+ * -> mutex_enter(dn_dbufs_mtx of dnoneB)
+ *
+ * Thread B is evicting dbufs that are related to dnodeB
+ * dnode_evict_dbufs(dnoneB) enters dn_dbufs_mtx and after that walks
+ * its own AVL of dbufs and calls dbuf_destroy():
+ * dbuf_destroy() ->...-> dbuf_evict_notify() -> dbuf_evict_one() ->
+ * -> select a dbuf from cache -> dbuf_destroy() ->
+ * -> mutex_enter(dn_dbufs_mtx of dnoneA)
*/
static void
dbuf_evict_notify(void)
{
@@ -558,11 +668,12 @@
/*
* We check if we should evict without holding the dbuf_evict_lock,
* because it's OK to occasionally make the wrong decision here,
* and grabbing the lock results in massive lock contention.
*/
- if (refcount_count(&dbuf_cache_size) > dbuf_cache_max_bytes) {
+ if (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
+ dbuf_cache_max_bytes) {
if (dbuf_cache_above_hiwater())
dbuf_evict_one();
cv_signal(&dbuf_evict_cv);
}
}
@@ -595,29 +706,56 @@
dbuf_kmem_cache = kmem_cache_create("dmu_buf_impl_t",
sizeof (dmu_buf_impl_t),
0, dbuf_cons, dbuf_dest, NULL, NULL, NULL, 0);
for (i = 0; i < DBUF_MUTEXES; i++)
- mutex_init(&h->hash_mutexes[i], NULL, MUTEX_DEFAULT, NULL);
+ mutex_init(DBUF_HASH_MUTEX(h, i), NULL, MUTEX_DEFAULT, NULL);
+
/*
- * Setup the parameters for the dbuf cache. We cap the size of the
- * dbuf cache to 1/32nd (default) of the size of the ARC.
+ * Setup the parameters for the dbuf caches. We set the sizes of the
+ * dbuf cache and the metadata cache to 1/32nd and 1/16th (default)
+ * of the size of the ARC, respectively.
*/
- dbuf_cache_max_bytes = MIN(dbuf_cache_max_bytes,
- arc_max_bytes() >> dbuf_cache_max_shift);
+ if (dbuf_cache_max_bytes == 0 ||
+ dbuf_cache_max_bytes >= arc_max_bytes()) {
+ dbuf_cache_max_bytes = arc_max_bytes() >> dbuf_cache_shift;
+ }
+ if (dbuf_metadata_cache_max_bytes == 0 ||
+ dbuf_metadata_cache_max_bytes >= arc_max_bytes()) {
+ dbuf_metadata_cache_max_bytes =
+ arc_max_bytes() >> dbuf_metadata_cache_shift;
+ }
/*
+ * The combined size of both caches should be less
+ * the size of ARC, otherwise need to set them to
+ * the default values.
+ *
+ * divide by 2 is a simple overflow protection
+ */
+ if (((dbuf_cache_max_bytes / 2) +
+ (dbuf_metadata_cache_max_bytes / 2)) >= (arc_max_bytes() / 2)) {
+ dbuf_cache_max_bytes = arc_max_bytes() >> dbuf_cache_shift;
+ dbuf_metadata_cache_max_bytes =
+ arc_max_bytes() >> dbuf_metadata_cache_shift;
+ }
+
+
+ /*
* All entries are queued via taskq_dispatch_ent(), so min/maxalloc
* configuration is not required.
*/
dbu_evict_taskq = taskq_create("dbu_evict", 1, minclsyspri, 0, 0, 0);
- dbuf_cache = multilist_create(sizeof (dmu_buf_impl_t),
+ for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
+ dbuf_caches[dcs].cache =
+ multilist_create(sizeof (dmu_buf_impl_t),
offsetof(dmu_buf_impl_t, db_cache_link),
dbuf_cache_multilist_index_func);
- refcount_create(&dbuf_cache_size);
+ refcount_create(&dbuf_caches[dcs].size);
+ }
tsd_create(&zfs_dbuf_evict_key, NULL);
dbuf_evict_thread_exit = B_FALSE;
mutex_init(&dbuf_evict_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&dbuf_evict_cv, NULL, CV_DEFAULT, NULL);
@@ -630,11 +768,11 @@
{
dbuf_hash_table_t *h = &dbuf_hash_table;
int i;
for (i = 0; i < DBUF_MUTEXES; i++)
- mutex_destroy(&h->hash_mutexes[i]);
+ mutex_destroy(DBUF_HASH_MUTEX(h, i));
kmem_free(h->hash_table, (h->hash_table_mask + 1) * sizeof (void *));
kmem_cache_destroy(dbuf_kmem_cache);
taskq_destroy(dbu_evict_taskq);
mutex_enter(&dbuf_evict_lock);
@@ -647,12 +785,14 @@
tsd_destroy(&zfs_dbuf_evict_key);
mutex_destroy(&dbuf_evict_lock);
cv_destroy(&dbuf_evict_cv);
- refcount_destroy(&dbuf_cache_size);
- multilist_destroy(dbuf_cache);
+ for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
+ refcount_destroy(&dbuf_caches[dcs].size);
+ multilist_destroy(dbuf_caches[dcs].cache);
+ }
}
/*
* Other stuff.
*/
@@ -1412,11 +1552,11 @@
/*
* We already have a dirty record for this TXG, and we are being
* dirtied again.
*/
static void
-dbuf_redirty(dbuf_dirty_record_t *dr)
+dbuf_redirty(dbuf_dirty_record_t *dr, boolean_t usesc)
{
dmu_buf_impl_t *db = dr->dr_dbuf;
ASSERT(MUTEX_HELD(&db->db_mtx));
@@ -1431,14 +1571,19 @@
/* Already released on initial dirty, so just thaw. */
ASSERT(arc_released(db->db_buf));
arc_buf_thaw(db->db_buf);
}
}
+ /*
+ * Special class usage of dirty dbuf could be changed,
+ * update the dirty entry.
+ */
+ dr->dr_usesc = usesc;
}
dbuf_dirty_record_t *
-dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
+dbuf_dirty_sc(dmu_buf_impl_t *db, dmu_tx_t *tx, boolean_t usesc)
{
dnode_t *dn;
objset_t *os;
dbuf_dirty_record_t **drp, *dr;
int drop_struct_lock = FALSE;
@@ -1521,11 +1666,11 @@
while ((dr = *drp) != NULL && dr->dr_txg > tx->tx_txg)
drp = &dr->dr_next;
if (dr && dr->dr_txg == tx->tx_txg) {
DB_DNODE_EXIT(db);
- dbuf_redirty(dr);
+ dbuf_redirty(dr, usesc);
mutex_exit(&db->db_mtx);
return (dr);
}
/*
@@ -1601,10 +1746,11 @@
if (db->db_blkid != DMU_BONUS_BLKID && os->os_dsl_dataset != NULL)
dr->dr_accounted = db->db.db_size;
dr->dr_dbuf = db;
dr->dr_txg = tx->tx_txg;
dr->dr_next = *drp;
+ dr->dr_usesc = usesc;
*drp = dr;
/*
* We could have been freed_in_flight between the dbuf_noread
* and dbuf_dirty. We win, as though the dbuf_noread() had
@@ -1634,11 +1780,11 @@
db->db_blkid == DMU_SPILL_BLKID) {
mutex_enter(&dn->dn_mtx);
ASSERT(!list_link_active(&dr->dr_dirty_node));
list_insert_tail(&dn->dn_dirty_records[txgoff], dr);
mutex_exit(&dn->dn_mtx);
- dnode_setdirty(dn, tx);
+ dnode_setdirty_sc(dn, tx, usesc);
DB_DNODE_EXIT(db);
return (dr);
}
/*
@@ -1669,11 +1815,11 @@
* syncing context won't have to wait for the i/o.
*/
ddt_prefetch(os->os_spa, db->db_blkptr);
if (db->db_level == 0) {
- dnode_new_blkid(dn, db->db_blkid, tx, drop_struct_lock);
+ dnode_new_blkid(dn, db->db_blkid, tx, usesc, drop_struct_lock);
ASSERT(dn->dn_maxblkid >= db->db_blkid);
}
if (db->db_level+1 < dn->dn_nlevels) {
dmu_buf_impl_t *parent = db->db_parent;
@@ -1689,11 +1835,11 @@
parent_held = TRUE;
}
if (drop_struct_lock)
rw_exit(&dn->dn_struct_rwlock);
ASSERT3U(db->db_level+1, ==, parent->db_level);
- di = dbuf_dirty(parent, tx);
+ di = dbuf_dirty_sc(parent, tx, usesc);
if (parent_held)
dbuf_rele(parent, FTAG);
mutex_enter(&db->db_mtx);
/*
@@ -1707,10 +1853,16 @@
ASSERT(!list_link_active(&dr->dr_dirty_node));
list_insert_tail(&di->dt.di.dr_children, dr);
mutex_exit(&di->dt.di.dr_mtx);
dr->dr_parent = di;
}
+
+ /*
+ * Special class usage of dirty dbuf could be changed,
+ * update the dirty entry.
+ */
+ dr->dr_usesc = usesc;
mutex_exit(&db->db_mtx);
} else {
ASSERT(db->db_level+1 == dn->dn_nlevels);
ASSERT(db->db_blkid < dn->dn_nblkptr);
ASSERT(db->db_parent == NULL || db->db_parent == dn->dn_dbuf);
@@ -1720,15 +1872,26 @@
mutex_exit(&dn->dn_mtx);
if (drop_struct_lock)
rw_exit(&dn->dn_struct_rwlock);
}
- dnode_setdirty(dn, tx);
+ dnode_setdirty_sc(dn, tx, usesc);
DB_DNODE_EXIT(db);
return (dr);
}
+dbuf_dirty_record_t *
+dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
+{
+ spa_t *spa;
+
+ ASSERT(db->db_objset != NULL);
+ spa = db->db_objset->os_spa;
+
+ return (dbuf_dirty_sc(db, tx, spa->spa_usesc));
+}
+
/*
* Undirty a buffer in the transaction group referenced by the given
* transaction. Return whether this evicted the dbuf.
*/
static boolean_t
@@ -1820,10 +1983,18 @@
void
dmu_buf_will_dirty(dmu_buf_t *db_fake, dmu_tx_t *tx)
{
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
+ spa_t *spa = db->db_objset->os_spa;
+ dmu_buf_will_dirty_sc(db_fake, tx, spa->spa_usesc);
+}
+
+void
+dmu_buf_will_dirty_sc(dmu_buf_t *db_fake, dmu_tx_t *tx, boolean_t usesc)
+{
+ dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
int rf = DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH;
ASSERT(tx->tx_txg != 0);
ASSERT(!refcount_is_zero(&db->db_holds));
@@ -1842,11 +2013,11 @@
* because there are some calls to dbuf_dirty() that don't
* go through dmu_buf_will_dirty().
*/
if (dr->dr_txg == tx->tx_txg && db->db_state == DB_CACHED) {
/* This dbuf is already dirty and cached. */
- dbuf_redirty(dr);
+ dbuf_redirty(dr, usesc);
mutex_exit(&db->db_mtx);
return;
}
}
mutex_exit(&db->db_mtx);
@@ -1854,13 +2025,14 @@
DB_DNODE_ENTER(db);
if (RW_WRITE_HELD(&DB_DNODE(db)->dn_struct_rwlock))
rf |= DB_RF_HAVESTRUCT;
DB_DNODE_EXIT(db);
(void) dbuf_read(db, NULL, rf);
- (void) dbuf_dirty(db, tx);
+ (void) dbuf_dirty_sc(db, tx, usesc);
}
+
void
dmu_buf_will_not_fill(dmu_buf_t *db_fake, dmu_tx_t *tx)
{
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
@@ -2031,13 +2203,19 @@
}
dbuf_clear_data(db);
if (multilist_link_active(&db->db_cache_link)) {
- multilist_remove(dbuf_cache, db);
- (void) refcount_remove_many(&dbuf_cache_size,
+ ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
+ db->db_caching_status == DB_DBUF_METADATA_CACHE);
+
+ multilist_remove(dbuf_caches[db->db_caching_status].cache, db);
+ (void) refcount_remove_many(
+ &dbuf_caches[db->db_caching_status].size,
db->db.db_size, db);
+
+ db->db_caching_status = DB_NO_CACHE;
}
ASSERT(db->db_state == DB_UNCACHED || db->db_state == DB_NOFILL);
ASSERT(db->db_data_pending == NULL);
@@ -2087,10 +2265,11 @@
ASSERT(db->db_buf == NULL);
ASSERT(db->db.db_data == NULL);
ASSERT(db->db_hash_next == NULL);
ASSERT(db->db_blkptr == NULL);
ASSERT(db->db_data_pending == NULL);
+ ASSERT3U(db->db_caching_status, ==, DB_NO_CACHE);
ASSERT(!multilist_link_active(&db->db_cache_link));
kmem_cache_free(dbuf_kmem_cache, db);
arc_space_return(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
@@ -2225,10 +2404,11 @@
db->db.db_size = DN_MAX_BONUSLEN -
(dn->dn_nblkptr-1) * sizeof (blkptr_t);
ASSERT3U(db->db.db_size, >=, dn->dn_bonuslen);
db->db.db_offset = DMU_BONUS_BLKID;
db->db_state = DB_UNCACHED;
+ db->db_caching_status = DB_NO_CACHE;
/* the bonus dbuf is not placed in the hash table */
arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
return (db);
} else if (blkid == DMU_SPILL_BLKID) {
db->db.db_size = (blkptr != NULL) ?
@@ -2257,10 +2437,11 @@
return (odb);
}
avl_add(&dn->dn_dbufs, db);
db->db_state = DB_UNCACHED;
+ db->db_caching_status = DB_NO_CACHE;
mutex_exit(&dn->dn_dbufs_mtx);
arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
if (parent && parent != dn->dn_dbuf)
dbuf_add_ref(parent, db);
@@ -2563,12 +2744,14 @@
if (fail_uncached && db->db_state != DB_CACHED) {
mutex_exit(&db->db_mtx);
return (SET_ERROR(ENOENT));
}
- if (db->db_buf != NULL)
+ if (db->db_buf != NULL) {
+ arc_buf_access(db->db_buf);
ASSERT3P(db->db.db_data, ==, db->db_buf->b_data);
+ }
ASSERT(db->db_buf == NULL || arc_referenced(db->db_buf));
/*
* If this buffer is currently syncing out, and we are are
@@ -2591,13 +2774,19 @@
}
}
if (multilist_link_active(&db->db_cache_link)) {
ASSERT(refcount_is_zero(&db->db_holds));
- multilist_remove(dbuf_cache, db);
- (void) refcount_remove_many(&dbuf_cache_size,
+ ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
+ db->db_caching_status == DB_DBUF_METADATA_CACHE);
+
+ multilist_remove(dbuf_caches[db->db_caching_status].cache, db);
+ (void) refcount_remove_many(
+ &dbuf_caches[db->db_caching_status].size,
db->db.db_size, db);
+
+ db->db_caching_status = DB_NO_CACHE;
}
(void) refcount_add(&db->db_holds, tag);
DBUF_VERIFY(db);
mutex_exit(&db->db_mtx);
@@ -2810,17 +2999,27 @@
if (!DBUF_IS_CACHEABLE(db) ||
db->db_pending_evict) {
dbuf_destroy(db);
} else if (!multilist_link_active(&db->db_cache_link)) {
- multilist_insert(dbuf_cache, db);
- (void) refcount_add_many(&dbuf_cache_size,
+ ASSERT3U(db->db_caching_status, ==,
+ DB_NO_CACHE);
+
+ dbuf_cached_state_t dcs =
+ dbuf_include_in_metadata_cache(db) ?
+ DB_DBUF_METADATA_CACHE : DB_DBUF_CACHE;
+ db->db_caching_status = dcs;
+
+ multilist_insert(dbuf_caches[dcs].cache, db);
+ (void) refcount_add_many(&dbuf_caches[dcs].size,
db->db.db_size, db);
mutex_exit(&db->db_mtx);
+ if (db->db_caching_status == DB_DBUF_CACHE) {
dbuf_evict_notify();
}
+ }
if (do_arc_evict)
arc_freed(spa, &bp);
}
} else {
@@ -2998,11 +3197,10 @@
/* Provide the pending dirty record to child dbufs */
db->db_data_pending = dr;
mutex_exit(&db->db_mtx);
-
dbuf_write(dr, db->db_buf, tx);
zio = dr->dr_zio;
mutex_enter(&dr->dt.di.dr_mtx);
dbuf_sync_list(&dr->dt.di.dr_children, db->db_level - 1, tx);
@@ -3470,145 +3668,10 @@
if (zio->io_abd != NULL)
abd_put(zio->io_abd);
}
-typedef struct dbuf_remap_impl_callback_arg {
- objset_t *drica_os;
- uint64_t drica_blk_birth;
- dmu_tx_t *drica_tx;
-} dbuf_remap_impl_callback_arg_t;
-
-static void
-dbuf_remap_impl_callback(uint64_t vdev, uint64_t offset, uint64_t size,
- void *arg)
-{
- dbuf_remap_impl_callback_arg_t *drica = arg;
- objset_t *os = drica->drica_os;
- spa_t *spa = dmu_objset_spa(os);
- dmu_tx_t *tx = drica->drica_tx;
-
- ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
-
- if (os == spa_meta_objset(spa)) {
- spa_vdev_indirect_mark_obsolete(spa, vdev, offset, size, tx);
- } else {
- dsl_dataset_block_remapped(dmu_objset_ds(os), vdev, offset,
- size, drica->drica_blk_birth, tx);
- }
-}
-
-static void
-dbuf_remap_impl(dnode_t *dn, blkptr_t *bp, dmu_tx_t *tx)
-{
- blkptr_t bp_copy = *bp;
- spa_t *spa = dmu_objset_spa(dn->dn_objset);
- dbuf_remap_impl_callback_arg_t drica;
-
- ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
-
- drica.drica_os = dn->dn_objset;
- drica.drica_blk_birth = bp->blk_birth;
- drica.drica_tx = tx;
- if (spa_remap_blkptr(spa, &bp_copy, dbuf_remap_impl_callback,
- &drica)) {
- /*
- * The struct_rwlock prevents dbuf_read_impl() from
- * dereferencing the BP while we are changing it. To
- * avoid lock contention, only grab it when we are actually
- * changing the BP.
- */
- rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
- *bp = bp_copy;
- rw_exit(&dn->dn_struct_rwlock);
- }
-}
-
-/*
- * Returns true if a dbuf_remap would modify the dbuf. We do this by attempting
- * to remap a copy of every bp in the dbuf.
- */
-boolean_t
-dbuf_can_remap(const dmu_buf_impl_t *db)
-{
- spa_t *spa = dmu_objset_spa(db->db_objset);
- blkptr_t *bp = db->db.db_data;
- boolean_t ret = B_FALSE;
-
- ASSERT3U(db->db_level, >, 0);
- ASSERT3S(db->db_state, ==, DB_CACHED);
-
- ASSERT(spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL));
-
- spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
- for (int i = 0; i < db->db.db_size >> SPA_BLKPTRSHIFT; i++) {
- blkptr_t bp_copy = bp[i];
- if (spa_remap_blkptr(spa, &bp_copy, NULL, NULL)) {
- ret = B_TRUE;
- break;
- }
- }
- spa_config_exit(spa, SCL_VDEV, FTAG);
-
- return (ret);
-}
-
-boolean_t
-dnode_needs_remap(const dnode_t *dn)
-{
- spa_t *spa = dmu_objset_spa(dn->dn_objset);
- boolean_t ret = B_FALSE;
-
- if (dn->dn_phys->dn_nlevels == 0) {
- return (B_FALSE);
- }
-
- ASSERT(spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL));
-
- spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
- for (int j = 0; j < dn->dn_phys->dn_nblkptr; j++) {
- blkptr_t bp_copy = dn->dn_phys->dn_blkptr[j];
- if (spa_remap_blkptr(spa, &bp_copy, NULL, NULL)) {
- ret = B_TRUE;
- break;
- }
- }
- spa_config_exit(spa, SCL_VDEV, FTAG);
-
- return (ret);
-}
-
-/*
- * Remap any existing BP's to concrete vdevs, if possible.
- */
-static void
-dbuf_remap(dnode_t *dn, dmu_buf_impl_t *db, dmu_tx_t *tx)
-{
- spa_t *spa = dmu_objset_spa(db->db_objset);
- ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
-
- if (!spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL))
- return;
-
- if (db->db_level > 0) {
- blkptr_t *bp = db->db.db_data;
- for (int i = 0; i < db->db.db_size >> SPA_BLKPTRSHIFT; i++) {
- dbuf_remap_impl(dn, &bp[i], tx);
- }
- } else if (db->db.db_object == DMU_META_DNODE_OBJECT) {
- dnode_phys_t *dnp = db->db.db_data;
- ASSERT3U(db->db_dnode_handle->dnh_dnode->dn_type, ==,
- DMU_OT_DNODE);
- for (int i = 0; i < db->db.db_size >> DNODE_SHIFT; i++) {
- for (int j = 0; j < dnp[i].dn_nblkptr; j++) {
- dbuf_remap_impl(dn, &dnp[i].dn_blkptr[j], tx);
- }
- }
- }
-}
-
-
/* Issue I/O to commit a dirty buffer to disk. */
static void
dbuf_write(dbuf_dirty_record_t *dr, arc_buf_t *data, dmu_tx_t *tx)
{
dmu_buf_impl_t *db = dr->dr_dbuf;
@@ -3618,17 +3681,20 @@
uint64_t txg = tx->tx_txg;
zbookmark_phys_t zb;
zio_prop_t zp;
zio_t *zio;
int wp_flag = 0;
+ zio_smartcomp_info_t sc;
ASSERT(dmu_tx_is_syncing(tx));
DB_DNODE_ENTER(db);
dn = DB_DNODE(db);
os = dn->dn_objset;
+ dnode_setup_zio_smartcomp(db, &sc);
+
if (db->db_state != DB_NOFILL) {
if (db->db_level > 0 || dn->dn_type == DMU_OT_DNODE) {
/*
* Private object buffers are released here rather
* than in dbuf_dirty() since they are only modified
@@ -3638,11 +3704,10 @@
if (BP_IS_HOLE(db->db_blkptr)) {
arc_buf_thaw(data);
} else {
dbuf_release_bp(db);
}
- dbuf_remap(dn, db, tx);
}
}
if (parent != dn->dn_dbuf) {
/* Our parent is an indirect block. */
@@ -3676,10 +3741,11 @@
db->db.db_object, db->db_level, db->db_blkid);
if (db->db_blkid == DMU_SPILL_BLKID)
wp_flag = WP_SPILL;
wp_flag |= (db->db_state == DB_NOFILL) ? WP_NOFILL : 0;
+ WP_SET_SPECIALCLASS(wp_flag, dr->dr_usesc);
dmu_write_policy(os, dn, db->db_level, wp_flag, &zp);
DB_DNODE_EXIT(db);
/*
@@ -3701,11 +3767,12 @@
dr->dr_zio = zio_write(zio, os->os_spa, txg, &dr->dr_bp_copy,
contents, db->db.db_size, db->db.db_size, &zp,
dbuf_write_override_ready, NULL, NULL,
dbuf_write_override_done,
- dr, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
+ dr, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb,
+ &sc);
mutex_enter(&db->db_mtx);
dr->dt.dl.dr_override_state = DR_NOT_OVERRIDDEN;
zio_write_override(dr->dr_zio, &dr->dt.dl.dr_overridden_by,
dr->dt.dl.dr_copies, dr->dt.dl.dr_nopwrite);
mutex_exit(&db->db_mtx);
@@ -3715,11 +3782,11 @@
dr->dr_zio = zio_write(zio, os->os_spa, txg,
&dr->dr_bp_copy, NULL, db->db.db_size, db->db.db_size, &zp,
dbuf_write_nofill_ready, NULL, NULL,
dbuf_write_nofill_done, db,
ZIO_PRIORITY_ASYNC_WRITE,
- ZIO_FLAG_MUSTSUCCEED | ZIO_FLAG_NODATA, &zb);
+ ZIO_FLAG_MUSTSUCCEED | ZIO_FLAG_NODATA, &zb, &sc);
} else {
ASSERT(arc_released(data));
/*
* For indirect blocks, we want to setup the children
@@ -3732,8 +3799,8 @@
dr->dr_zio = arc_write(zio, os->os_spa, txg,
&dr->dr_bp_copy, data, DBUF_IS_L2CACHEABLE(db),
&zp, dbuf_write_ready, children_ready_cb,
dbuf_write_physdone, dbuf_write_done, db,
- ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
+ ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb, &sc);
}
}