big-one Wdiff usr/src/uts/common/fs/zfs/dnode.c

Print this page

Revert "8958 Update Intel ucode to 20180108 release"
This reverts commit 1adc3ffcd976ec0a34010cc7db08037a14c3ea4c.
NEX-15280 New default metadata block size is too large
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-15280 New default metadata block size is too large
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-5366 Race between unique_insert() and unique_remove() causes ZFS fsid change
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-1823 Slow performance doing of a large dataset
5911 ZFS "hangs" while deleting file
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
NEX-3266 5630 stale bonus buffer in recycled dnode_t leads to data corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
SUP-507 Delete or truncate of large files delayed on datasets with small recordsize
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties

Split	Close
Expand all
Collapse all

          --- old/usr/src/uts/common/fs/zfs/dnode.c
          +++ new/usr/src/uts/common/fs/zfs/dnode.c

   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *

↓ open down ↓

12 lines elided

↑ open up ↑

  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
       23 + * Copyright 2015 Nexenta Systems, Inc.  All rights reserved.
  23   24   * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
  24   25   * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
  25   26   * Copyright (c) 2014 Integros [integros.com]
  26   27   * Copyright 2017 RackTop Systems.
  27   28   */
  28   29  
  29   30  #include <sys/zfs_context.h>
  30   31  #include <sys/dbuf.h>
  31   32  #include <sys/dnode.h>
  32   33  #include <sys/dmu.h>
  33   34  #include <sys/dmu_impl.h>
  34   35  #include <sys/dmu_tx.h>
  35   36  #include <sys/dmu_objset.h>
  36   37  #include <sys/dsl_dir.h>
  37   38  #include <sys/dsl_dataset.h>
  38   39  #include <sys/spa.h>
  39   40  #include <sys/zio.h>
  40   41  #include <sys/dmu_zfetch.h>
  41   42  #include <sys/range_tree.h>
  42   43  
       44 +static void smartcomp_check_comp(dnode_smartcomp_t *sc);
       45 +
  43   46  static kmem_cache_t *dnode_cache;
  44   47  /*
  45   48   * Define DNODE_STATS to turn on statistic gathering. By default, it is only
  46   49   * turned on when DEBUG is also defined.
  47   50   */
  48   51  #ifdef  DEBUG
  49   52  #define DNODE_STATS
  50   53  #endif  /* DEBUG */
  51   54  
  52   55  #ifdef  DNODE_STATS
  53   56  #define DNODE_STAT_ADD(stat)                    ((stat)++)
  54   57  #else
  55   58  #define DNODE_STAT_ADD(stat)                    /* nothing */
  56   59  #endif  /* DNODE_STATS */
  57   60  
  58   61  static dnode_phys_t dnode_phys_zero;
  59   62  
  60   63  int zfs_default_bs = SPA_MINBLOCKSHIFT;
  61      -int zfs_default_ibs = DN_MAX_INDBLKSHIFT;
       64 +int zfs_default_ibs = DN_DFL_INDBLKSHIFT;
  62   65  
  63   66  #ifdef  _KERNEL
  64   67  static kmem_cbrc_t dnode_move(void *, void *, size_t, void *);
  65   68  #endif  /* _KERNEL */
  66   69  
  67   70  static int
  68   71  dbuf_compare(const void *x1, const void *x2)
  69   72  {
  70   73          const dmu_buf_impl_t *d1 = x1;
  71   74          const dmu_buf_impl_t *d2 = x2;

  72   75  
  73   76          if (d1->db_level < d2->db_level) {
  74   77                  return (-1);
  75   78          }
  76   79          if (d1->db_level > d2->db_level) {
  77   80                  return (1);
  78   81          }
  79   82  
  80   83          if (d1->db_blkid < d2->db_blkid) {
  81   84                  return (-1);
  82   85          }
  83   86          if (d1->db_blkid > d2->db_blkid) {
  84   87                  return (1);
  85   88          }
  86   89  
  87   90          if (d1->db_state == DB_SEARCH) {
  88   91                  ASSERT3S(d2->db_state, !=, DB_SEARCH);
  89   92                  return (-1);
  90   93          } else if (d2->db_state == DB_SEARCH) {
  91   94                  ASSERT3S(d1->db_state, !=, DB_SEARCH);
  92   95                  return (1);
  93   96          }
  94   97  
  95   98          if ((uintptr_t)d1 < (uintptr_t)d2) {
  96   99                  return (-1);
  97  100          }
  98  101          if ((uintptr_t)d1 > (uintptr_t)d2) {
  99  102                  return (1);
 100  103          }
 101  104          return (0);
 102  105  }
 103  106  
 104  107  /* ARGSUSED */
 105  108  static int
 106  109  dnode_cons(void *arg, void *unused, int kmflag)
 107  110  {
 108  111          dnode_t *dn = arg;
 109  112          int i;
 110  113  
 111  114          rw_init(&dn->dn_struct_rwlock, NULL, RW_DEFAULT, NULL);
 112  115          mutex_init(&dn->dn_mtx, NULL, MUTEX_DEFAULT, NULL);
 113  116          mutex_init(&dn->dn_dbufs_mtx, NULL, MUTEX_DEFAULT, NULL);
 114  117          cv_init(&dn->dn_notxholds, NULL, CV_DEFAULT, NULL);
 115  118  
 116  119          /*
 117  120           * Every dbuf has a reference, and dropping a tracked reference is
 118  121           * O(number of references), so don't track dn_holds.
 119  122           */
 120  123          refcount_create_untracked(&dn->dn_holds);
 121  124          refcount_create(&dn->dn_tx_holds);
 122  125          list_link_init(&dn->dn_link);
 123  126  
 124  127          bzero(&dn->dn_next_nblkptr[0], sizeof (dn->dn_next_nblkptr));
 125  128          bzero(&dn->dn_next_nlevels[0], sizeof (dn->dn_next_nlevels));
 126  129          bzero(&dn->dn_next_indblkshift[0], sizeof (dn->dn_next_indblkshift));
 127  130          bzero(&dn->dn_next_bonustype[0], sizeof (dn->dn_next_bonustype));
 128  131          bzero(&dn->dn_rm_spillblk[0], sizeof (dn->dn_rm_spillblk));
 129  132          bzero(&dn->dn_next_bonuslen[0], sizeof (dn->dn_next_bonuslen));
 130  133          bzero(&dn->dn_next_blksz[0], sizeof (dn->dn_next_blksz));
 131  134  
 132  135          for (i = 0; i < TXG_SIZE; i++) {
 133  136                  list_link_init(&dn->dn_dirty_link[i]);
 134  137                  dn->dn_free_ranges[i] = NULL;
 135  138                  list_create(&dn->dn_dirty_records[i],
 136  139                      sizeof (dbuf_dirty_record_t),
 137  140                      offsetof(dbuf_dirty_record_t, dr_dirty_node));
 138  141          }
 139  142  
 140  143          dn->dn_allocated_txg = 0;
 141  144          dn->dn_free_txg = 0;
 142  145          dn->dn_assigned_txg = 0;
 143  146          dn->dn_dirtyctx = 0;
 144  147          dn->dn_dirtyctx_firstset = NULL;
 145  148          dn->dn_bonus = NULL;
 146  149          dn->dn_have_spill = B_FALSE;
 147  150          dn->dn_zio = NULL;
 148  151          dn->dn_oldused = 0;
 149  152          dn->dn_oldflags = 0;
 150  153          dn->dn_olduid = 0;

↓ open down ↓

79 lines elided

↑ open up ↑

 151  154          dn->dn_oldgid = 0;
 152  155          dn->dn_newuid = 0;
 153  156          dn->dn_newgid = 0;
 154  157          dn->dn_id_flags = 0;
 155  158  
 156  159          dn->dn_dbufs_count = 0;
 157  160          avl_create(&dn->dn_dbufs, dbuf_compare, sizeof (dmu_buf_impl_t),
 158  161              offsetof(dmu_buf_impl_t, db_link));
 159  162  
 160  163          dn->dn_moved = 0;
      164 +
      165 +        bzero(&dn->dn_smartcomp, sizeof (dn->dn_smartcomp));
      166 +        mutex_init(&dn->dn_smartcomp.sc_lock, NULL, MUTEX_DEFAULT, NULL);
      167 +
 161  168          return (0);
 162  169  }
 163  170  
 164  171  /* ARGSUSED */
 165  172  static void
 166  173  dnode_dest(void *arg, void *unused)
 167  174  {
 168  175          int i;
 169  176          dnode_t *dn = arg;
 170  177  
      178 +        mutex_destroy(&dn->dn_smartcomp.sc_lock);
      179 +
 171  180          rw_destroy(&dn->dn_struct_rwlock);
 172  181          mutex_destroy(&dn->dn_mtx);
 173  182          mutex_destroy(&dn->dn_dbufs_mtx);
 174  183          cv_destroy(&dn->dn_notxholds);
 175  184          refcount_destroy(&dn->dn_holds);
 176  185          refcount_destroy(&dn->dn_tx_holds);
 177  186          ASSERT(!list_link_active(&dn->dn_link));
 178  187  
 179  188          for (i = 0; i < TXG_SIZE; i++) {
 180  189                  ASSERT(!list_link_active(&dn->dn_dirty_link[i]));

 181  190                  ASSERT3P(dn->dn_free_ranges[i], ==, NULL);
 182  191                  list_destroy(&dn->dn_dirty_records[i]);
 183  192                  ASSERT0(dn->dn_next_nblkptr[i]);
 184  193                  ASSERT0(dn->dn_next_nlevels[i]);
 185  194                  ASSERT0(dn->dn_next_indblkshift[i]);
 186  195                  ASSERT0(dn->dn_next_bonustype[i]);
 187  196                  ASSERT0(dn->dn_rm_spillblk[i]);
 188  197                  ASSERT0(dn->dn_next_bonuslen[i]);
 189  198                  ASSERT0(dn->dn_next_blksz[i]);
 190  199          }
 191  200  
 192  201          ASSERT0(dn->dn_allocated_txg);
 193  202          ASSERT0(dn->dn_free_txg);
 194  203          ASSERT0(dn->dn_assigned_txg);
 195  204          ASSERT0(dn->dn_dirtyctx);
 196  205          ASSERT3P(dn->dn_dirtyctx_firstset, ==, NULL);
 197  206          ASSERT3P(dn->dn_bonus, ==, NULL);
 198  207          ASSERT(!dn->dn_have_spill);
 199  208          ASSERT3P(dn->dn_zio, ==, NULL);
 200  209          ASSERT0(dn->dn_oldused);
 201  210          ASSERT0(dn->dn_oldflags);
 202  211          ASSERT0(dn->dn_olduid);
 203  212          ASSERT0(dn->dn_oldgid);
 204  213          ASSERT0(dn->dn_newuid);
 205  214          ASSERT0(dn->dn_newgid);
 206  215          ASSERT0(dn->dn_id_flags);
 207  216  
 208  217          ASSERT0(dn->dn_dbufs_count);
 209  218          avl_destroy(&dn->dn_dbufs);
 210  219  }
 211  220  
 212  221  void
 213  222  dnode_init(void)
 214  223  {
 215  224          ASSERT(dnode_cache == NULL);
 216  225          dnode_cache = kmem_cache_create("dnode_t",
 217  226              sizeof (dnode_t),
 218  227              0, dnode_cons, dnode_dest, NULL, NULL, NULL, 0);
 219  228  #ifdef  _KERNEL
 220  229          kmem_cache_set_move(dnode_cache, dnode_move);
 221  230  #endif  /* _KERNEL */
 222  231  }
 223  232  
 224  233  void
 225  234  dnode_fini(void)
 226  235  {
 227  236          kmem_cache_destroy(dnode_cache);
 228  237          dnode_cache = NULL;
 229  238  }
 230  239  
 231  240  
 232  241  #ifdef ZFS_DEBUG
 233  242  void
 234  243  dnode_verify(dnode_t *dn)
 235  244  {
 236  245          int drop_struct_lock = FALSE;
 237  246  
 238  247          ASSERT(dn->dn_phys);
 239  248          ASSERT(dn->dn_objset);
 240  249          ASSERT(dn->dn_handle->dnh_dnode == dn);
 241  250  
 242  251          ASSERT(DMU_OT_IS_VALID(dn->dn_phys->dn_type));
 243  252  
 244  253          if (!(zfs_flags & ZFS_DEBUG_DNODE_VERIFY))
 245  254                  return;
 246  255  
 247  256          if (!RW_WRITE_HELD(&dn->dn_struct_rwlock)) {
 248  257                  rw_enter(&dn->dn_struct_rwlock, RW_READER);
 249  258                  drop_struct_lock = TRUE;
 250  259          }
 251  260          if (dn->dn_phys->dn_type != DMU_OT_NONE || dn->dn_allocated_txg != 0) {
 252  261                  int i;
 253  262                  ASSERT3U(dn->dn_indblkshift, >=, 0);
 254  263                  ASSERT3U(dn->dn_indblkshift, <=, SPA_MAXBLOCKSHIFT);
 255  264                  if (dn->dn_datablkshift) {
 256  265                          ASSERT3U(dn->dn_datablkshift, >=, SPA_MINBLOCKSHIFT);
 257  266                          ASSERT3U(dn->dn_datablkshift, <=, SPA_MAXBLOCKSHIFT);
 258  267                          ASSERT3U(1<<dn->dn_datablkshift, ==, dn->dn_datablksz);
 259  268                  }
 260  269                  ASSERT3U(dn->dn_nlevels, <=, 30);
 261  270                  ASSERT(DMU_OT_IS_VALID(dn->dn_type));
 262  271                  ASSERT3U(dn->dn_nblkptr, >=, 1);
 263  272                  ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR);
 264  273                  ASSERT3U(dn->dn_bonuslen, <=, DN_MAX_BONUSLEN);
 265  274                  ASSERT3U(dn->dn_datablksz, ==,
 266  275                      dn->dn_datablkszsec << SPA_MINBLOCKSHIFT);
 267  276                  ASSERT3U(ISP2(dn->dn_datablksz), ==, dn->dn_datablkshift != 0);
 268  277                  ASSERT3U((dn->dn_nblkptr - 1) * sizeof (blkptr_t) +
 269  278                      dn->dn_bonuslen, <=, DN_MAX_BONUSLEN);
 270  279                  for (i = 0; i < TXG_SIZE; i++) {
 271  280                          ASSERT3U(dn->dn_next_nlevels[i], <=, dn->dn_nlevels);
 272  281                  }
 273  282          }
 274  283          if (dn->dn_phys->dn_type != DMU_OT_NONE)
 275  284                  ASSERT3U(dn->dn_phys->dn_nlevels, <=, dn->dn_nlevels);
 276  285          ASSERT(DMU_OBJECT_IS_SPECIAL(dn->dn_object) || dn->dn_dbuf != NULL);
 277  286          if (dn->dn_dbuf != NULL) {
 278  287                  ASSERT3P(dn->dn_phys, ==,
 279  288                      (dnode_phys_t *)dn->dn_dbuf->db.db_data +
 280  289                      (dn->dn_object % (dn->dn_dbuf->db.db_size >> DNODE_SHIFT)));
 281  290          }
 282  291          if (drop_struct_lock)
 283  292                  rw_exit(&dn->dn_struct_rwlock);
 284  293  }
 285  294  #endif
 286  295  
 287  296  void
 288  297  dnode_byteswap(dnode_phys_t *dnp)
 289  298  {
 290  299          uint64_t *buf64 = (void*)&dnp->dn_blkptr;
 291  300          int i;
 292  301  
 293  302          if (dnp->dn_type == DMU_OT_NONE) {
 294  303                  bzero(dnp, sizeof (dnode_phys_t));
 295  304                  return;
 296  305          }
 297  306  
 298  307          dnp->dn_datablkszsec = BSWAP_16(dnp->dn_datablkszsec);
 299  308          dnp->dn_bonuslen = BSWAP_16(dnp->dn_bonuslen);
 300  309          dnp->dn_maxblkid = BSWAP_64(dnp->dn_maxblkid);
 301  310          dnp->dn_used = BSWAP_64(dnp->dn_used);
 302  311  
 303  312          /*
 304  313           * dn_nblkptr is only one byte, so it's OK to read it in either
 305  314           * byte order.  We can't read dn_bouslen.
 306  315           */
 307  316          ASSERT(dnp->dn_indblkshift <= SPA_MAXBLOCKSHIFT);
 308  317          ASSERT(dnp->dn_nblkptr <= DN_MAX_NBLKPTR);
 309  318          for (i = 0; i < dnp->dn_nblkptr * sizeof (blkptr_t)/8; i++)
 310  319                  buf64[i] = BSWAP_64(buf64[i]);
 311  320  
 312  321          /*
 313  322           * OK to check dn_bonuslen for zero, because it won't matter if
 314  323           * we have the wrong byte order.  This is necessary because the
 315  324           * dnode dnode is smaller than a regular dnode.
 316  325           */
 317  326          if (dnp->dn_bonuslen != 0) {
 318  327                  /*
 319  328                   * Note that the bonus length calculated here may be
 320  329                   * longer than the actual bonus buffer.  This is because
 321  330                   * we always put the bonus buffer after the last block
 322  331                   * pointer (instead of packing it against the end of the
 323  332                   * dnode buffer).
 324  333                   */
 325  334                  int off = (dnp->dn_nblkptr-1) * sizeof (blkptr_t);
 326  335                  size_t len = DN_MAX_BONUSLEN - off;
 327  336                  ASSERT(DMU_OT_IS_VALID(dnp->dn_bonustype));
 328  337                  dmu_object_byteswap_t byteswap =
 329  338                      DMU_OT_BYTESWAP(dnp->dn_bonustype);
 330  339                  dmu_ot_byteswap[byteswap].ob_func(dnp->dn_bonus + off, len);
 331  340          }
 332  341  
 333  342          /* Swap SPILL block if we have one */
 334  343          if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR)
 335  344                  byteswap_uint64_array(&dnp->dn_spill, sizeof (blkptr_t));
 336  345  
 337  346  }
 338  347  
 339  348  void
 340  349  dnode_buf_byteswap(void *vbuf, size_t size)
 341  350  {
 342  351          dnode_phys_t *buf = vbuf;
 343  352          int i;
 344  353  
 345  354          ASSERT3U(sizeof (dnode_phys_t), ==, (1<<DNODE_SHIFT));
 346  355          ASSERT((size & (sizeof (dnode_phys_t)-1)) == 0);
 347  356  
 348  357          size >>= DNODE_SHIFT;
 349  358          for (i = 0; i < size; i++) {
 350  359                  dnode_byteswap(buf);
 351  360                  buf++;
 352  361          }
 353  362  }
 354  363  
 355  364  void
 356  365  dnode_setbonuslen(dnode_t *dn, int newsize, dmu_tx_t *tx)
 357  366  {
 358  367          ASSERT3U(refcount_count(&dn->dn_holds), >=, 1);
 359  368  
 360  369          dnode_setdirty(dn, tx);
 361  370          rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
 362  371          ASSERT3U(newsize, <=, DN_MAX_BONUSLEN -
 363  372              (dn->dn_nblkptr-1) * sizeof (blkptr_t));
 364  373          dn->dn_bonuslen = newsize;
 365  374          if (newsize == 0)
 366  375                  dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = DN_ZERO_BONUSLEN;
 367  376          else
 368  377                  dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = dn->dn_bonuslen;
 369  378          rw_exit(&dn->dn_struct_rwlock);
 370  379  }
 371  380  
 372  381  void
 373  382  dnode_setbonus_type(dnode_t *dn, dmu_object_type_t newtype, dmu_tx_t *tx)
 374  383  {
 375  384          ASSERT3U(refcount_count(&dn->dn_holds), >=, 1);
 376  385          dnode_setdirty(dn, tx);
 377  386          rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
 378  387          dn->dn_bonustype = newtype;
 379  388          dn->dn_next_bonustype[tx->tx_txg & TXG_MASK] = dn->dn_bonustype;
 380  389          rw_exit(&dn->dn_struct_rwlock);
 381  390  }
 382  391  
 383  392  void
 384  393  dnode_rm_spill(dnode_t *dn, dmu_tx_t *tx)
 385  394  {
 386  395          ASSERT3U(refcount_count(&dn->dn_holds), >=, 1);
 387  396          ASSERT(RW_WRITE_HELD(&dn->dn_struct_rwlock));
 388  397          dnode_setdirty(dn, tx);
 389  398          dn->dn_rm_spillblk[tx->tx_txg&TXG_MASK] = DN_KILL_SPILLBLK;
 390  399          dn->dn_have_spill = B_FALSE;
 391  400  }
 392  401  
 393  402  static void
 394  403  dnode_setdblksz(dnode_t *dn, int size)
 395  404  {
 396  405          ASSERT0(P2PHASE(size, SPA_MINBLOCKSIZE));
 397  406          ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
 398  407          ASSERT3U(size, >=, SPA_MINBLOCKSIZE);
 399  408          ASSERT3U(size >> SPA_MINBLOCKSHIFT, <,
 400  409              1<<(sizeof (dn->dn_phys->dn_datablkszsec) * 8));
 401  410          dn->dn_datablksz = size;
 402  411          dn->dn_datablkszsec = size >> SPA_MINBLOCKSHIFT;
 403  412          dn->dn_datablkshift = ISP2(size) ? highbit64(size - 1) : 0;
 404  413  }
 405  414  
 406  415  static dnode_t *
 407  416  dnode_create(objset_t *os, dnode_phys_t *dnp, dmu_buf_impl_t *db,
 408  417      uint64_t object, dnode_handle_t *dnh)
 409  418  {
 410  419          dnode_t *dn;
 411  420  
 412  421          dn = kmem_cache_alloc(dnode_cache, KM_SLEEP);
 413  422  #ifdef _KERNEL
 414  423          ASSERT(!POINTER_IS_VALID(dn->dn_objset));
 415  424  #endif /* _KERNEL */
 416  425          dn->dn_moved = 0;
 417  426  
 418  427          /*
 419  428           * Defer setting dn_objset until the dnode is ready to be a candidate
 420  429           * for the dnode_move() callback.
 421  430           */
 422  431          dn->dn_object = object;
 423  432          dn->dn_dbuf = db;
 424  433          dn->dn_handle = dnh;
 425  434          dn->dn_phys = dnp;
 426  435  
 427  436          if (dnp->dn_datablkszsec) {
 428  437                  dnode_setdblksz(dn, dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT);
 429  438          } else {
 430  439                  dn->dn_datablksz = 0;
 431  440                  dn->dn_datablkszsec = 0;
 432  441                  dn->dn_datablkshift = 0;
 433  442          }
 434  443          dn->dn_indblkshift = dnp->dn_indblkshift;
 435  444          dn->dn_nlevels = dnp->dn_nlevels;
 436  445          dn->dn_type = dnp->dn_type;
 437  446          dn->dn_nblkptr = dnp->dn_nblkptr;
 438  447          dn->dn_checksum = dnp->dn_checksum;
 439  448          dn->dn_compress = dnp->dn_compress;
 440  449          dn->dn_bonustype = dnp->dn_bonustype;
 441  450          dn->dn_bonuslen = dnp->dn_bonuslen;
 442  451          dn->dn_maxblkid = dnp->dn_maxblkid;
 443  452          dn->dn_have_spill = ((dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) != 0);
 444  453          dn->dn_id_flags = 0;
 445  454  
 446  455          dmu_zfetch_init(&dn->dn_zfetch, dn);
 447  456  
 448  457          ASSERT(DMU_OT_IS_VALID(dn->dn_phys->dn_type));
 449  458  
 450  459          mutex_enter(&os->os_lock);
 451  460          if (dnh->dnh_dnode != NULL) {
 452  461                  /* Lost the allocation race. */
 453  462                  mutex_exit(&os->os_lock);
 454  463                  kmem_cache_free(dnode_cache, dn);
 455  464                  return (dnh->dnh_dnode);
 456  465          }
 457  466  
 458  467          /*
 459  468           * Exclude special dnodes from os_dnodes so an empty os_dnodes
 460  469           * signifies that the special dnodes have no references from
 461  470           * their children (the entries in os_dnodes).  This allows
 462  471           * dnode_destroy() to easily determine if the last child has
 463  472           * been removed and then complete eviction of the objset.
 464  473           */
 465  474          if (!DMU_OBJECT_IS_SPECIAL(object))
 466  475                  list_insert_head(&os->os_dnodes, dn);
 467  476          membar_producer();
 468  477  
 469  478          /*
 470  479           * Everything else must be valid before assigning dn_objset
 471  480           * makes the dnode eligible for dnode_move().
 472  481           */
 473  482          dn->dn_objset = os;
 474  483  
 475  484          dnh->dnh_dnode = dn;
 476  485          mutex_exit(&os->os_lock);
 477  486  
 478  487          arc_space_consume(sizeof (dnode_t), ARC_SPACE_OTHER);
 479  488          return (dn);
 480  489  }
 481  490  
 482  491  /*
 483  492   * Caller must be holding the dnode handle, which is released upon return.
 484  493   */
 485  494  static void
 486  495  dnode_destroy(dnode_t *dn)
 487  496  {
 488  497          objset_t *os = dn->dn_objset;
 489  498          boolean_t complete_os_eviction = B_FALSE;
 490  499  
 491  500          ASSERT((dn->dn_id_flags & DN_ID_NEW_EXIST) == 0);
 492  501  
 493  502          mutex_enter(&os->os_lock);
 494  503          POINTER_INVALIDATE(&dn->dn_objset);
 495  504          if (!DMU_OBJECT_IS_SPECIAL(dn->dn_object)) {
 496  505                  list_remove(&os->os_dnodes, dn);
 497  506                  complete_os_eviction =
 498  507                      list_is_empty(&os->os_dnodes) &&
 499  508                      list_link_active(&os->os_evicting_node);
 500  509          }
 501  510          mutex_exit(&os->os_lock);
 502  511  
 503  512          /* the dnode can no longer move, so we can release the handle */
 504  513          zrl_remove(&dn->dn_handle->dnh_zrlock);
 505  514  
 506  515          dn->dn_allocated_txg = 0;
 507  516          dn->dn_free_txg = 0;
 508  517          dn->dn_assigned_txg = 0;
 509  518  
 510  519          dn->dn_dirtyctx = 0;
 511  520          if (dn->dn_dirtyctx_firstset != NULL) {
 512  521                  kmem_free(dn->dn_dirtyctx_firstset, 1);
 513  522                  dn->dn_dirtyctx_firstset = NULL;
 514  523          }
 515  524          if (dn->dn_bonus != NULL) {
 516  525                  mutex_enter(&dn->dn_bonus->db_mtx);
 517  526                  dbuf_destroy(dn->dn_bonus);
 518  527                  dn->dn_bonus = NULL;
 519  528          }
 520  529          dn->dn_zio = NULL;
 521  530  
 522  531          dn->dn_have_spill = B_FALSE;
 523  532          dn->dn_oldused = 0;
 524  533          dn->dn_oldflags = 0;
 525  534          dn->dn_olduid = 0;
 526  535          dn->dn_oldgid = 0;
 527  536          dn->dn_newuid = 0;
 528  537          dn->dn_newgid = 0;
 529  538          dn->dn_id_flags = 0;
 530  539  
 531  540          dmu_zfetch_fini(&dn->dn_zfetch);
 532  541          kmem_cache_free(dnode_cache, dn);
 533  542          arc_space_return(sizeof (dnode_t), ARC_SPACE_OTHER);
 534  543  
 535  544          if (complete_os_eviction)
 536  545                  dmu_objset_evict_done(os);
 537  546  }
 538  547  
 539  548  void
 540  549  dnode_allocate(dnode_t *dn, dmu_object_type_t ot, int blocksize, int ibs,
 541  550      dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
 542  551  {
 543  552          int i;
 544  553  
 545  554          ASSERT3U(blocksize, <=,
 546  555              spa_maxblocksize(dmu_objset_spa(dn->dn_objset)));
 547  556          if (blocksize == 0)
 548  557                  blocksize = 1 << zfs_default_bs;
 549  558          else
 550  559                  blocksize = P2ROUNDUP(blocksize, SPA_MINBLOCKSIZE);
 551  560  
 552  561          if (ibs == 0)
 553  562                  ibs = zfs_default_ibs;
 554  563  
 555  564          ibs = MIN(MAX(ibs, DN_MIN_INDBLKSHIFT), DN_MAX_INDBLKSHIFT);
 556  565  
 557  566          dprintf("os=%p obj=%llu txg=%llu blocksize=%d ibs=%d\n", dn->dn_objset,
 558  567              dn->dn_object, tx->tx_txg, blocksize, ibs);
 559  568  
 560  569          ASSERT(dn->dn_type == DMU_OT_NONE);
 561  570          ASSERT(bcmp(dn->dn_phys, &dnode_phys_zero, sizeof (dnode_phys_t)) == 0);
 562  571          ASSERT(dn->dn_phys->dn_type == DMU_OT_NONE);
 563  572          ASSERT(ot != DMU_OT_NONE);
 564  573          ASSERT(DMU_OT_IS_VALID(ot));
 565  574          ASSERT((bonustype == DMU_OT_NONE && bonuslen == 0) ||
 566  575              (bonustype == DMU_OT_SA && bonuslen == 0) ||
 567  576              (bonustype != DMU_OT_NONE && bonuslen != 0));
 568  577          ASSERT(DMU_OT_IS_VALID(bonustype));
 569  578          ASSERT3U(bonuslen, <=, DN_MAX_BONUSLEN);
 570  579          ASSERT(dn->dn_type == DMU_OT_NONE);
 571  580          ASSERT0(dn->dn_maxblkid);
 572  581          ASSERT0(dn->dn_allocated_txg);
 573  582          ASSERT0(dn->dn_assigned_txg);
 574  583          ASSERT(refcount_is_zero(&dn->dn_tx_holds));
 575  584          ASSERT3U(refcount_count(&dn->dn_holds), <=, 1);
 576  585          ASSERT(avl_is_empty(&dn->dn_dbufs));
 577  586  
 578  587          for (i = 0; i < TXG_SIZE; i++) {
 579  588                  ASSERT0(dn->dn_next_nblkptr[i]);
 580  589                  ASSERT0(dn->dn_next_nlevels[i]);
 581  590                  ASSERT0(dn->dn_next_indblkshift[i]);
 582  591                  ASSERT0(dn->dn_next_bonuslen[i]);
 583  592                  ASSERT0(dn->dn_next_bonustype[i]);
 584  593                  ASSERT0(dn->dn_rm_spillblk[i]);
 585  594                  ASSERT0(dn->dn_next_blksz[i]);
 586  595                  ASSERT(!list_link_active(&dn->dn_dirty_link[i]));
 587  596                  ASSERT3P(list_head(&dn->dn_dirty_records[i]), ==, NULL);
 588  597                  ASSERT3P(dn->dn_free_ranges[i], ==, NULL);
 589  598          }
 590  599  
 591  600          dn->dn_type = ot;
 592  601          dnode_setdblksz(dn, blocksize);
 593  602          dn->dn_indblkshift = ibs;
 594  603          dn->dn_nlevels = 1;
 595  604          if (bonustype == DMU_OT_SA) /* Maximize bonus space for SA */
 596  605                  dn->dn_nblkptr = 1;
 597  606          else
 598  607                  dn->dn_nblkptr = 1 +
 599  608                      ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
 600  609          dn->dn_bonustype = bonustype;
 601  610          dn->dn_bonuslen = bonuslen;
 602  611          dn->dn_checksum = ZIO_CHECKSUM_INHERIT;
 603  612          dn->dn_compress = ZIO_COMPRESS_INHERIT;
 604  613          dn->dn_dirtyctx = 0;
 605  614  
 606  615          dn->dn_free_txg = 0;
 607  616          if (dn->dn_dirtyctx_firstset) {
 608  617                  kmem_free(dn->dn_dirtyctx_firstset, 1);
 609  618                  dn->dn_dirtyctx_firstset = NULL;
 610  619          }
 611  620  
 612  621          dn->dn_allocated_txg = tx->tx_txg;
 613  622          dn->dn_id_flags = 0;
 614  623  
 615  624          dnode_setdirty(dn, tx);
 616  625          dn->dn_next_indblkshift[tx->tx_txg & TXG_MASK] = ibs;
 617  626          dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = dn->dn_bonuslen;
 618  627          dn->dn_next_bonustype[tx->tx_txg & TXG_MASK] = dn->dn_bonustype;
 619  628          dn->dn_next_blksz[tx->tx_txg & TXG_MASK] = dn->dn_datablksz;
 620  629  }
 621  630  
 622  631  void
 623  632  dnode_reallocate(dnode_t *dn, dmu_object_type_t ot, int blocksize,
 624  633      dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
 625  634  {
 626  635          int nblkptr;
 627  636  
 628  637          ASSERT3U(blocksize, >=, SPA_MINBLOCKSIZE);
 629  638          ASSERT3U(blocksize, <=,
 630  639              spa_maxblocksize(dmu_objset_spa(dn->dn_objset)));

↓ open down ↓

450 lines elided

↑ open up ↑

 631  640          ASSERT0(blocksize % SPA_MINBLOCKSIZE);
 632  641          ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT || dmu_tx_private_ok(tx));
 633  642          ASSERT(tx->tx_txg != 0);
 634  643          ASSERT((bonustype == DMU_OT_NONE && bonuslen == 0) ||
 635  644              (bonustype != DMU_OT_NONE && bonuslen != 0) ||
 636  645              (bonustype == DMU_OT_SA && bonuslen == 0));
 637  646          ASSERT(DMU_OT_IS_VALID(bonustype));
 638  647          ASSERT3U(bonuslen, <=, DN_MAX_BONUSLEN);
 639  648  
 640  649          /* clean up any unreferenced dbufs */
 641      -        dnode_evict_dbufs(dn);
      650 +        dnode_evict_dbufs(dn, DBUF_EVICT_ALL);
 642  651  
 643  652          dn->dn_id_flags = 0;
 644  653  
 645  654          rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
 646  655          dnode_setdirty(dn, tx);
 647  656          if (dn->dn_datablksz != blocksize) {
 648  657                  /* change blocksize */
 649  658                  ASSERT(dn->dn_maxblkid == 0 &&
 650  659                      (BP_IS_HOLE(&dn->dn_phys->dn_blkptr[0]) ||
 651  660                      dnode_block_freed(dn, 0)));

 652  661                  dnode_setdblksz(dn, blocksize);
 653  662                  dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = blocksize;
 654  663          }
 655  664          if (dn->dn_bonuslen != bonuslen)
 656  665                  dn->dn_next_bonuslen[tx->tx_txg&TXG_MASK] = bonuslen;
 657  666  
 658  667          if (bonustype == DMU_OT_SA) /* Maximize bonus space for SA */
 659  668                  nblkptr = 1;
 660  669          else
 661  670                  nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
 662  671          if (dn->dn_bonustype != bonustype)
 663  672                  dn->dn_next_bonustype[tx->tx_txg&TXG_MASK] = bonustype;
 664  673          if (dn->dn_nblkptr != nblkptr)
 665  674                  dn->dn_next_nblkptr[tx->tx_txg&TXG_MASK] = nblkptr;
 666  675          if (dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR) {
 667  676                  dbuf_rm_spill(dn, tx);
 668  677                  dnode_rm_spill(dn, tx);
 669  678          }
 670  679          rw_exit(&dn->dn_struct_rwlock);
 671  680  
 672  681          /* change type */
 673  682          dn->dn_type = ot;
 674  683  
 675  684          /* change bonus size and type */
 676  685          mutex_enter(&dn->dn_mtx);
 677  686          dn->dn_bonustype = bonustype;
 678  687          dn->dn_bonuslen = bonuslen;
 679  688          dn->dn_nblkptr = nblkptr;
 680  689          dn->dn_checksum = ZIO_CHECKSUM_INHERIT;
 681  690          dn->dn_compress = ZIO_COMPRESS_INHERIT;
 682  691          ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR);
 683  692  
 684  693          /* fix up the bonus db_size */
 685  694          if (dn->dn_bonus) {
 686  695                  dn->dn_bonus->db.db_size =
 687  696                      DN_MAX_BONUSLEN - (dn->dn_nblkptr-1) * sizeof (blkptr_t);
 688  697                  ASSERT(dn->dn_bonuslen <= dn->dn_bonus->db.db_size);
 689  698          }
 690  699  
 691  700          dn->dn_allocated_txg = tx->tx_txg;
 692  701          mutex_exit(&dn->dn_mtx);
 693  702  }
 694  703  
 695  704  #ifdef  DNODE_STATS
 696  705  static struct {
 697  706          uint64_t dms_dnode_invalid;
 698  707          uint64_t dms_dnode_recheck1;
 699  708          uint64_t dms_dnode_recheck2;
 700  709          uint64_t dms_dnode_special;
 701  710          uint64_t dms_dnode_handle;
 702  711          uint64_t dms_dnode_rwlock;
 703  712          uint64_t dms_dnode_active;
 704  713  } dnode_move_stats;
 705  714  #endif  /* DNODE_STATS */
 706  715  
 707  716  #ifdef  _KERNEL
 708  717  static void
 709  718  dnode_move_impl(dnode_t *odn, dnode_t *ndn)
 710  719  {
 711  720          int i;
 712  721  
 713  722          ASSERT(!RW_LOCK_HELD(&odn->dn_struct_rwlock));
 714  723          ASSERT(MUTEX_NOT_HELD(&odn->dn_mtx));
 715  724          ASSERT(MUTEX_NOT_HELD(&odn->dn_dbufs_mtx));
 716  725          ASSERT(!RW_LOCK_HELD(&odn->dn_zfetch.zf_rwlock));
 717  726  
 718  727          /* Copy fields. */
 719  728          ndn->dn_objset = odn->dn_objset;
 720  729          ndn->dn_object = odn->dn_object;
 721  730          ndn->dn_dbuf = odn->dn_dbuf;
 722  731          ndn->dn_handle = odn->dn_handle;
 723  732          ndn->dn_phys = odn->dn_phys;
 724  733          ndn->dn_type = odn->dn_type;
 725  734          ndn->dn_bonuslen = odn->dn_bonuslen;
 726  735          ndn->dn_bonustype = odn->dn_bonustype;
 727  736          ndn->dn_nblkptr = odn->dn_nblkptr;
 728  737          ndn->dn_checksum = odn->dn_checksum;
 729  738          ndn->dn_compress = odn->dn_compress;
 730  739          ndn->dn_nlevels = odn->dn_nlevels;
 731  740          ndn->dn_indblkshift = odn->dn_indblkshift;
 732  741          ndn->dn_datablkshift = odn->dn_datablkshift;
 733  742          ndn->dn_datablkszsec = odn->dn_datablkszsec;
 734  743          ndn->dn_datablksz = odn->dn_datablksz;
 735  744          ndn->dn_maxblkid = odn->dn_maxblkid;
 736  745          bcopy(&odn->dn_next_nblkptr[0], &ndn->dn_next_nblkptr[0],
 737  746              sizeof (odn->dn_next_nblkptr));
 738  747          bcopy(&odn->dn_next_nlevels[0], &ndn->dn_next_nlevels[0],
 739  748              sizeof (odn->dn_next_nlevels));
 740  749          bcopy(&odn->dn_next_indblkshift[0], &ndn->dn_next_indblkshift[0],
 741  750              sizeof (odn->dn_next_indblkshift));
 742  751          bcopy(&odn->dn_next_bonustype[0], &ndn->dn_next_bonustype[0],
 743  752              sizeof (odn->dn_next_bonustype));
 744  753          bcopy(&odn->dn_rm_spillblk[0], &ndn->dn_rm_spillblk[0],
 745  754              sizeof (odn->dn_rm_spillblk));
 746  755          bcopy(&odn->dn_next_bonuslen[0], &ndn->dn_next_bonuslen[0],
 747  756              sizeof (odn->dn_next_bonuslen));
 748  757          bcopy(&odn->dn_next_blksz[0], &ndn->dn_next_blksz[0],
 749  758              sizeof (odn->dn_next_blksz));
 750  759          for (i = 0; i < TXG_SIZE; i++) {
 751  760                  list_move_tail(&ndn->dn_dirty_records[i],
 752  761                      &odn->dn_dirty_records[i]);
 753  762          }
 754  763          bcopy(&odn->dn_free_ranges[0], &ndn->dn_free_ranges[0],
 755  764              sizeof (odn->dn_free_ranges));
 756  765          ndn->dn_allocated_txg = odn->dn_allocated_txg;
 757  766          ndn->dn_free_txg = odn->dn_free_txg;
 758  767          ndn->dn_assigned_txg = odn->dn_assigned_txg;
 759  768          ndn->dn_dirtyctx = odn->dn_dirtyctx;
 760  769          ndn->dn_dirtyctx_firstset = odn->dn_dirtyctx_firstset;
 761  770          ASSERT(refcount_count(&odn->dn_tx_holds) == 0);
 762  771          refcount_transfer(&ndn->dn_holds, &odn->dn_holds);
 763  772          ASSERT(avl_is_empty(&ndn->dn_dbufs));
 764  773          avl_swap(&ndn->dn_dbufs, &odn->dn_dbufs);
 765  774          ndn->dn_dbufs_count = odn->dn_dbufs_count;
 766  775          ndn->dn_bonus = odn->dn_bonus;
 767  776          ndn->dn_have_spill = odn->dn_have_spill;
 768  777          ndn->dn_zio = odn->dn_zio;
 769  778          ndn->dn_oldused = odn->dn_oldused;
 770  779          ndn->dn_oldflags = odn->dn_oldflags;
 771  780          ndn->dn_olduid = odn->dn_olduid;
 772  781          ndn->dn_oldgid = odn->dn_oldgid;
 773  782          ndn->dn_newuid = odn->dn_newuid;
 774  783          ndn->dn_newgid = odn->dn_newgid;
 775  784          ndn->dn_id_flags = odn->dn_id_flags;
 776  785          dmu_zfetch_init(&ndn->dn_zfetch, NULL);
 777  786          list_move_tail(&ndn->dn_zfetch.zf_stream, &odn->dn_zfetch.zf_stream);
 778  787          ndn->dn_zfetch.zf_dnode = odn->dn_zfetch.zf_dnode;
 779  788  
 780  789          /*
 781  790           * Update back pointers. Updating the handle fixes the back pointer of
 782  791           * every descendant dbuf as well as the bonus dbuf.
 783  792           */
 784  793          ASSERT(ndn->dn_handle->dnh_dnode == odn);
 785  794          ndn->dn_handle->dnh_dnode = ndn;
 786  795          if (ndn->dn_zfetch.zf_dnode == odn) {
 787  796                  ndn->dn_zfetch.zf_dnode = ndn;
 788  797          }
 789  798  
 790  799          /*
 791  800           * Invalidate the original dnode by clearing all of its back pointers.
 792  801           */
 793  802          odn->dn_dbuf = NULL;
 794  803          odn->dn_handle = NULL;
 795  804          avl_create(&odn->dn_dbufs, dbuf_compare, sizeof (dmu_buf_impl_t),
 796  805              offsetof(dmu_buf_impl_t, db_link));
 797  806          odn->dn_dbufs_count = 0;
 798  807          odn->dn_bonus = NULL;
 799  808          odn->dn_zfetch.zf_dnode = NULL;
 800  809  
 801  810          /*
 802  811           * Set the low bit of the objset pointer to ensure that dnode_move()
 803  812           * recognizes the dnode as invalid in any subsequent callback.
 804  813           */
 805  814          POINTER_INVALIDATE(&odn->dn_objset);
 806  815  
 807  816          /*
 808  817           * Satisfy the destructor.
 809  818           */
 810  819          for (i = 0; i < TXG_SIZE; i++) {
 811  820                  list_create(&odn->dn_dirty_records[i],
 812  821                      sizeof (dbuf_dirty_record_t),
 813  822                      offsetof(dbuf_dirty_record_t, dr_dirty_node));
 814  823                  odn->dn_free_ranges[i] = NULL;
 815  824                  odn->dn_next_nlevels[i] = 0;
 816  825                  odn->dn_next_indblkshift[i] = 0;
 817  826                  odn->dn_next_bonustype[i] = 0;
 818  827                  odn->dn_rm_spillblk[i] = 0;
 819  828                  odn->dn_next_bonuslen[i] = 0;
 820  829                  odn->dn_next_blksz[i] = 0;
 821  830          }
 822  831          odn->dn_allocated_txg = 0;
 823  832          odn->dn_free_txg = 0;
 824  833          odn->dn_assigned_txg = 0;
 825  834          odn->dn_dirtyctx = 0;
 826  835          odn->dn_dirtyctx_firstset = NULL;
 827  836          odn->dn_have_spill = B_FALSE;
 828  837          odn->dn_zio = NULL;
 829  838          odn->dn_oldused = 0;
 830  839          odn->dn_oldflags = 0;
 831  840          odn->dn_olduid = 0;
 832  841          odn->dn_oldgid = 0;
 833  842          odn->dn_newuid = 0;
 834  843          odn->dn_newgid = 0;
 835  844          odn->dn_id_flags = 0;
 836  845  
 837  846          /*
 838  847           * Mark the dnode.
 839  848           */
 840  849          ndn->dn_moved = 1;
 841  850          odn->dn_moved = (uint8_t)-1;
 842  851  }
 843  852  
 844  853  /*ARGSUSED*/
 845  854  static kmem_cbrc_t
 846  855  dnode_move(void *buf, void *newbuf, size_t size, void *arg)
 847  856  {
 848  857          dnode_t *odn = buf, *ndn = newbuf;
 849  858          objset_t *os;
 850  859          int64_t refcount;
 851  860          uint32_t dbufs;
 852  861  
 853  862          /*
 854  863           * The dnode is on the objset's list of known dnodes if the objset
 855  864           * pointer is valid. We set the low bit of the objset pointer when
 856  865           * freeing the dnode to invalidate it, and the memory patterns written
 857  866           * by kmem (baddcafe and deadbeef) set at least one of the two low bits.
 858  867           * A newly created dnode sets the objset pointer last of all to indicate
 859  868           * that the dnode is known and in a valid state to be moved by this
 860  869           * function.
 861  870           */
 862  871          os = odn->dn_objset;
 863  872          if (!POINTER_IS_VALID(os)) {
 864  873                  DNODE_STAT_ADD(dnode_move_stats.dms_dnode_invalid);
 865  874                  return (KMEM_CBRC_DONT_KNOW);
 866  875          }
 867  876  
 868  877          /*
 869  878           * Ensure that the objset does not go away during the move.
 870  879           */
 871  880          rw_enter(&os_lock, RW_WRITER);
 872  881          if (os != odn->dn_objset) {
 873  882                  rw_exit(&os_lock);
 874  883                  DNODE_STAT_ADD(dnode_move_stats.dms_dnode_recheck1);
 875  884                  return (KMEM_CBRC_DONT_KNOW);
 876  885          }
 877  886  
 878  887          /*
 879  888           * If the dnode is still valid, then so is the objset. We know that no
 880  889           * valid objset can be freed while we hold os_lock, so we can safely
 881  890           * ensure that the objset remains in use.
 882  891           */
 883  892          mutex_enter(&os->os_lock);
 884  893  
 885  894          /*
 886  895           * Recheck the objset pointer in case the dnode was removed just before
 887  896           * acquiring the lock.
 888  897           */
 889  898          if (os != odn->dn_objset) {
 890  899                  mutex_exit(&os->os_lock);
 891  900                  rw_exit(&os_lock);
 892  901                  DNODE_STAT_ADD(dnode_move_stats.dms_dnode_recheck2);
 893  902                  return (KMEM_CBRC_DONT_KNOW);
 894  903          }
 895  904  
 896  905          /*
 897  906           * At this point we know that as long as we hold os->os_lock, the dnode
 898  907           * cannot be freed and fields within the dnode can be safely accessed.
 899  908           * The objset listing this dnode cannot go away as long as this dnode is
 900  909           * on its list.
 901  910           */
 902  911          rw_exit(&os_lock);
 903  912          if (DMU_OBJECT_IS_SPECIAL(odn->dn_object)) {
 904  913                  mutex_exit(&os->os_lock);
 905  914                  DNODE_STAT_ADD(dnode_move_stats.dms_dnode_special);
 906  915                  return (KMEM_CBRC_NO);
 907  916          }
 908  917          ASSERT(odn->dn_dbuf != NULL); /* only "special" dnodes have no parent */
 909  918  
 910  919          /*
 911  920           * Lock the dnode handle to prevent the dnode from obtaining any new
 912  921           * holds. This also prevents the descendant dbufs and the bonus dbuf
 913  922           * from accessing the dnode, so that we can discount their holds. The
 914  923           * handle is safe to access because we know that while the dnode cannot
 915  924           * go away, neither can its handle. Once we hold dnh_zrlock, we can
 916  925           * safely move any dnode referenced only by dbufs.
 917  926           */
 918  927          if (!zrl_tryenter(&odn->dn_handle->dnh_zrlock)) {
 919  928                  mutex_exit(&os->os_lock);
 920  929                  DNODE_STAT_ADD(dnode_move_stats.dms_dnode_handle);
 921  930                  return (KMEM_CBRC_LATER);
 922  931          }
 923  932  
 924  933          /*
 925  934           * Ensure a consistent view of the dnode's holds and the dnode's dbufs.
 926  935           * We need to guarantee that there is a hold for every dbuf in order to
 927  936           * determine whether the dnode is actively referenced. Falsely matching
 928  937           * a dbuf to an active hold would lead to an unsafe move. It's possible
 929  938           * that a thread already having an active dnode hold is about to add a
 930  939           * dbuf, and we can't compare hold and dbuf counts while the add is in
 931  940           * progress.
 932  941           */
 933  942          if (!rw_tryenter(&odn->dn_struct_rwlock, RW_WRITER)) {
 934  943                  zrl_exit(&odn->dn_handle->dnh_zrlock);
 935  944                  mutex_exit(&os->os_lock);
 936  945                  DNODE_STAT_ADD(dnode_move_stats.dms_dnode_rwlock);
 937  946                  return (KMEM_CBRC_LATER);
 938  947          }
 939  948  
 940  949          /*
 941  950           * A dbuf may be removed (evicted) without an active dnode hold. In that
 942  951           * case, the dbuf count is decremented under the handle lock before the
 943  952           * dbuf's hold is released. This order ensures that if we count the hold
 944  953           * after the dbuf is removed but before its hold is released, we will
 945  954           * treat the unmatched hold as active and exit safely. If we count the
 946  955           * hold before the dbuf is removed, the hold is discounted, and the
 947  956           * removal is blocked until the move completes.
 948  957           */
 949  958          refcount = refcount_count(&odn->dn_holds);
 950  959          ASSERT(refcount >= 0);
 951  960          dbufs = odn->dn_dbufs_count;
 952  961  
 953  962          /* We can't have more dbufs than dnode holds. */
 954  963          ASSERT3U(dbufs, <=, refcount);
 955  964          DTRACE_PROBE3(dnode__move, dnode_t *, odn, int64_t, refcount,
 956  965              uint32_t, dbufs);
 957  966  
 958  967          if (refcount > dbufs) {
 959  968                  rw_exit(&odn->dn_struct_rwlock);
 960  969                  zrl_exit(&odn->dn_handle->dnh_zrlock);
 961  970                  mutex_exit(&os->os_lock);
 962  971                  DNODE_STAT_ADD(dnode_move_stats.dms_dnode_active);
 963  972                  return (KMEM_CBRC_LATER);
 964  973          }
 965  974  
 966  975          rw_exit(&odn->dn_struct_rwlock);
 967  976  
 968  977          /*
 969  978           * At this point we know that anyone with a hold on the dnode is not
 970  979           * actively referencing it. The dnode is known and in a valid state to
 971  980           * move. We're holding the locks needed to execute the critical section.
 972  981           */
 973  982          dnode_move_impl(odn, ndn);
 974  983  
 975  984          list_link_replace(&odn->dn_link, &ndn->dn_link);
 976  985          /* If the dnode was safe to move, the refcount cannot have changed. */
 977  986          ASSERT(refcount == refcount_count(&ndn->dn_holds));
 978  987          ASSERT(dbufs == ndn->dn_dbufs_count);
 979  988          zrl_exit(&ndn->dn_handle->dnh_zrlock); /* handle has moved */
 980  989          mutex_exit(&os->os_lock);
 981  990  
 982  991          return (KMEM_CBRC_YES);
 983  992  }
 984  993  #endif  /* _KERNEL */
 985  994  
 986  995  void
 987  996  dnode_special_close(dnode_handle_t *dnh)
 988  997  {
 989  998          dnode_t *dn = dnh->dnh_dnode;
 990  999  
 991 1000          /*
 992 1001           * Wait for final references to the dnode to clear.  This can
 993 1002           * only happen if the arc is asyncronously evicting state that
 994 1003           * has a hold on this dnode while we are trying to evict this
 995 1004           * dnode.
 996 1005           */
 997 1006          while (refcount_count(&dn->dn_holds) > 0)
 998 1007                  delay(1);
 999 1008          ASSERT(dn->dn_dbuf == NULL ||
1000 1009              dmu_buf_get_user(&dn->dn_dbuf->db) == NULL);
1001 1010          zrl_add(&dnh->dnh_zrlock);
1002 1011          dnode_destroy(dn); /* implicit zrl_remove() */
1003 1012          zrl_destroy(&dnh->dnh_zrlock);
1004 1013          dnh->dnh_dnode = NULL;
1005 1014  }
1006 1015  
1007 1016  void
1008 1017  dnode_special_open(objset_t *os, dnode_phys_t *dnp, uint64_t object,
1009 1018      dnode_handle_t *dnh)
1010 1019  {
1011 1020          dnode_t *dn;
1012 1021  
1013 1022          dn = dnode_create(os, dnp, NULL, object, dnh);
1014 1023          zrl_init(&dnh->dnh_zrlock);
1015 1024          DNODE_VERIFY(dn);
1016 1025  }
1017 1026  
1018 1027  static void
1019 1028  dnode_buf_evict_async(void *dbu)
1020 1029  {
1021 1030          dnode_children_t *children_dnodes = dbu;
1022 1031          int i;
1023 1032  
1024 1033          for (i = 0; i < children_dnodes->dnc_count; i++) {
1025 1034                  dnode_handle_t *dnh = &children_dnodes->dnc_children[i];
1026 1035                  dnode_t *dn;
1027 1036  
1028 1037                  /*
1029 1038                   * The dnode handle lock guards against the dnode moving to
1030 1039                   * another valid address, so there is no need here to guard
1031 1040                   * against changes to or from NULL.
1032 1041                   */
1033 1042                  if (dnh->dnh_dnode == NULL) {
1034 1043                          zrl_destroy(&dnh->dnh_zrlock);
1035 1044                          continue;
1036 1045                  }
1037 1046  
1038 1047                  zrl_add(&dnh->dnh_zrlock);
1039 1048                  dn = dnh->dnh_dnode;
1040 1049                  /*
1041 1050                   * If there are holds on this dnode, then there should
1042 1051                   * be holds on the dnode's containing dbuf as well; thus
1043 1052                   * it wouldn't be eligible for eviction and this function
1044 1053                   * would not have been called.
1045 1054                   */
1046 1055                  ASSERT(refcount_is_zero(&dn->dn_holds));
1047 1056                  ASSERT(refcount_is_zero(&dn->dn_tx_holds));
1048 1057  
1049 1058                  dnode_destroy(dn); /* implicit zrl_remove() */
1050 1059                  zrl_destroy(&dnh->dnh_zrlock);
1051 1060                  dnh->dnh_dnode = NULL;
1052 1061          }
1053 1062          kmem_free(children_dnodes, sizeof (dnode_children_t) +
1054 1063              children_dnodes->dnc_count * sizeof (dnode_handle_t));
1055 1064  }
1056 1065  
1057 1066  /*
1058 1067   * errors:
1059 1068   * EINVAL - invalid object number.
1060 1069   * EIO - i/o error.
1061 1070   * succeeds even for free dnodes.
1062 1071   */
1063 1072  int
1064 1073  dnode_hold_impl(objset_t *os, uint64_t object, int flag,
1065 1074      void *tag, dnode_t **dnp)
1066 1075  {
1067 1076          int epb, idx, err;
1068 1077          int drop_struct_lock = FALSE;
1069 1078          int type;
1070 1079          uint64_t blk;
1071 1080          dnode_t *mdn, *dn;
1072 1081          dmu_buf_impl_t *db;
1073 1082          dnode_children_t *children_dnodes;
1074 1083          dnode_handle_t *dnh;
1075 1084  
1076 1085          /*
1077 1086           * If you are holding the spa config lock as writer, you shouldn't
1078 1087           * be asking the DMU to do *anything* unless it's the root pool
1079 1088           * which may require us to read from the root filesystem while
1080 1089           * holding some (not all) of the locks as writer.
1081 1090           */
1082 1091          ASSERT(spa_config_held(os->os_spa, SCL_ALL, RW_WRITER) == 0 ||
1083 1092              (spa_is_root(os->os_spa) &&
1084 1093              spa_config_held(os->os_spa, SCL_STATE, RW_WRITER)));
1085 1094  
1086 1095          if (object == DMU_USERUSED_OBJECT || object == DMU_GROUPUSED_OBJECT) {
1087 1096                  dn = (object == DMU_USERUSED_OBJECT) ?
1088 1097                      DMU_USERUSED_DNODE(os) : DMU_GROUPUSED_DNODE(os);
1089 1098                  if (dn == NULL)
1090 1099                          return (SET_ERROR(ENOENT));
1091 1100                  type = dn->dn_type;
1092 1101                  if ((flag & DNODE_MUST_BE_ALLOCATED) && type == DMU_OT_NONE)
1093 1102                          return (SET_ERROR(ENOENT));
1094 1103                  if ((flag & DNODE_MUST_BE_FREE) && type != DMU_OT_NONE)
1095 1104                          return (SET_ERROR(EEXIST));
1096 1105                  DNODE_VERIFY(dn);
1097 1106                  (void) refcount_add(&dn->dn_holds, tag);
1098 1107                  *dnp = dn;
1099 1108                  return (0);
1100 1109          }
1101 1110  
1102 1111          if (object == 0 || object >= DN_MAX_OBJECT)
1103 1112                  return (SET_ERROR(EINVAL));
1104 1113  
1105 1114          mdn = DMU_META_DNODE(os);
1106 1115          ASSERT(mdn->dn_object == DMU_META_DNODE_OBJECT);
1107 1116  
1108 1117          DNODE_VERIFY(mdn);
1109 1118  
1110 1119          if (!RW_WRITE_HELD(&mdn->dn_struct_rwlock)) {
1111 1120                  rw_enter(&mdn->dn_struct_rwlock, RW_READER);
1112 1121                  drop_struct_lock = TRUE;
1113 1122          }
1114 1123  
1115 1124          blk = dbuf_whichblock(mdn, 0, object * sizeof (dnode_phys_t));
1116 1125  
1117 1126          db = dbuf_hold(mdn, blk, FTAG);
1118 1127          if (drop_struct_lock)
1119 1128                  rw_exit(&mdn->dn_struct_rwlock);
1120 1129          if (db == NULL)
1121 1130                  return (SET_ERROR(EIO));
1122 1131          err = dbuf_read(db, NULL, DB_RF_CANFAIL);
1123 1132          if (err) {
1124 1133                  dbuf_rele(db, FTAG);
1125 1134                  return (err);
1126 1135          }
1127 1136  
1128 1137          ASSERT3U(db->db.db_size, >=, 1<<DNODE_SHIFT);
1129 1138          epb = db->db.db_size >> DNODE_SHIFT;
1130 1139  
1131 1140          idx = object & (epb-1);
1132 1141  
1133 1142          ASSERT(DB_DNODE(db)->dn_type == DMU_OT_DNODE);
1134 1143          children_dnodes = dmu_buf_get_user(&db->db);
1135 1144          if (children_dnodes == NULL) {
1136 1145                  int i;
1137 1146                  dnode_children_t *winner;
1138 1147                  children_dnodes = kmem_zalloc(sizeof (dnode_children_t) +
1139 1148                      epb * sizeof (dnode_handle_t), KM_SLEEP);
1140 1149                  children_dnodes->dnc_count = epb;
1141 1150                  dnh = &children_dnodes->dnc_children[0];
1142 1151                  for (i = 0; i < epb; i++) {
1143 1152                          zrl_init(&dnh[i].dnh_zrlock);
1144 1153                  }
1145 1154                  dmu_buf_init_user(&children_dnodes->dnc_dbu, NULL,
1146 1155                      dnode_buf_evict_async, NULL);
1147 1156                  winner = dmu_buf_set_user(&db->db, &children_dnodes->dnc_dbu);
1148 1157                  if (winner != NULL) {
1149 1158  
1150 1159                          for (i = 0; i < epb; i++) {
1151 1160                                  zrl_destroy(&dnh[i].dnh_zrlock);
1152 1161                          }
1153 1162  
1154 1163                          kmem_free(children_dnodes, sizeof (dnode_children_t) +
1155 1164                              epb * sizeof (dnode_handle_t));
1156 1165                          children_dnodes = winner;
1157 1166                  }
1158 1167          }
1159 1168          ASSERT(children_dnodes->dnc_count == epb);
1160 1169  
1161 1170          dnh = &children_dnodes->dnc_children[idx];
1162 1171          zrl_add(&dnh->dnh_zrlock);
1163 1172          dn = dnh->dnh_dnode;
1164 1173          if (dn == NULL) {
1165 1174                  dnode_phys_t *phys = (dnode_phys_t *)db->db.db_data+idx;
1166 1175  
1167 1176                  dn = dnode_create(os, phys, db, object, dnh);
1168 1177          }
1169 1178  
1170 1179          mutex_enter(&dn->dn_mtx);
1171 1180          type = dn->dn_type;
1172 1181          if (dn->dn_free_txg ||
1173 1182              ((flag & DNODE_MUST_BE_ALLOCATED) && type == DMU_OT_NONE) ||
1174 1183              ((flag & DNODE_MUST_BE_FREE) &&
1175 1184              (type != DMU_OT_NONE || !refcount_is_zero(&dn->dn_holds)))) {
1176 1185                  mutex_exit(&dn->dn_mtx);
1177 1186                  zrl_remove(&dnh->dnh_zrlock);
1178 1187                  dbuf_rele(db, FTAG);
1179 1188                  return (type == DMU_OT_NONE ? ENOENT : EEXIST);
1180 1189          }
1181 1190          if (refcount_add(&dn->dn_holds, tag) == 1)
1182 1191                  dbuf_add_ref(db, dnh);
1183 1192          mutex_exit(&dn->dn_mtx);
1184 1193  
1185 1194          /* Now we can rely on the hold to prevent the dnode from moving. */
1186 1195          zrl_remove(&dnh->dnh_zrlock);
1187 1196  
1188 1197          DNODE_VERIFY(dn);
1189 1198          ASSERT3P(dn->dn_dbuf, ==, db);
1190 1199          ASSERT3U(dn->dn_object, ==, object);
1191 1200          dbuf_rele(db, FTAG);
1192 1201  
1193 1202          *dnp = dn;
1194 1203          return (0);
1195 1204  }
1196 1205  
1197 1206  /*
1198 1207   * Return held dnode if the object is allocated, NULL if not.
1199 1208   */
1200 1209  int
1201 1210  dnode_hold(objset_t *os, uint64_t object, void *tag, dnode_t **dnp)
1202 1211  {
1203 1212          return (dnode_hold_impl(os, object, DNODE_MUST_BE_ALLOCATED, tag, dnp));
1204 1213  }
1205 1214  
1206 1215  /*
1207 1216   * Can only add a reference if there is already at least one
1208 1217   * reference on the dnode.  Returns FALSE if unable to add a
1209 1218   * new reference.
1210 1219   */
1211 1220  boolean_t
1212 1221  dnode_add_ref(dnode_t *dn, void *tag)
1213 1222  {
1214 1223          mutex_enter(&dn->dn_mtx);
1215 1224          if (refcount_is_zero(&dn->dn_holds)) {
1216 1225                  mutex_exit(&dn->dn_mtx);
1217 1226                  return (FALSE);
1218 1227          }
1219 1228          VERIFY(1 < refcount_add(&dn->dn_holds, tag));
1220 1229          mutex_exit(&dn->dn_mtx);
1221 1230          return (TRUE);
1222 1231  }
1223 1232  
1224 1233  void
1225 1234  dnode_rele(dnode_t *dn, void *tag)
1226 1235  {
1227 1236          mutex_enter(&dn->dn_mtx);
1228 1237          dnode_rele_and_unlock(dn, tag);
1229 1238  }
1230 1239  
1231 1240  void
1232 1241  dnode_rele_and_unlock(dnode_t *dn, void *tag)
1233 1242  {
1234 1243          uint64_t refs;
1235 1244          /* Get while the hold prevents the dnode from moving. */
1236 1245          dmu_buf_impl_t *db = dn->dn_dbuf;
1237 1246          dnode_handle_t *dnh = dn->dn_handle;
1238 1247  
1239 1248          refs = refcount_remove(&dn->dn_holds, tag);
1240 1249          mutex_exit(&dn->dn_mtx);
1241 1250  
1242 1251          /*
1243 1252           * It's unsafe to release the last hold on a dnode by dnode_rele() or
1244 1253           * indirectly by dbuf_rele() while relying on the dnode handle to
1245 1254           * prevent the dnode from moving, since releasing the last hold could
1246 1255           * result in the dnode's parent dbuf evicting its dnode handles. For
1247 1256           * that reason anyone calling dnode_rele() or dbuf_rele() without some
1248 1257           * other direct or indirect hold on the dnode must first drop the dnode
1249 1258           * handle.
1250 1259           */
1251 1260          ASSERT(refs > 0 || dnh->dnh_zrlock.zr_owner != curthread);
1252 1261  
1253 1262          /* NOTE: the DNODE_DNODE does not have a dn_dbuf */
1254 1263          if (refs == 0 && db != NULL) {
1255 1264                  /*
1256 1265                   * Another thread could add a hold to the dnode handle in
1257 1266                   * dnode_hold_impl() while holding the parent dbuf. Since the
1258 1267                   * hold on the parent dbuf prevents the handle from being
1259 1268                   * destroyed, the hold on the handle is OK. We can't yet assert

↓ open down ↓

608 lines elided

↑ open up ↑

1260 1269                   * that the handle has zero references, but that will be
1261 1270                   * asserted anyway when the handle gets destroyed.
1262 1271                   */
1263 1272                  dbuf_rele(db, dnh);
1264 1273          }
1265 1274  }
1266 1275  
1267 1276  void
1268 1277  dnode_setdirty(dnode_t *dn, dmu_tx_t *tx)
1269 1278  {
     1279 +        dnode_setdirty_sc(dn, tx, B_TRUE);
     1280 +}
     1281 +
     1282 +void
     1283 +dnode_setdirty_sc(dnode_t *dn, dmu_tx_t *tx, boolean_t usesc)
     1284 +{
1270 1285          objset_t *os = dn->dn_objset;
1271 1286          uint64_t txg = tx->tx_txg;
1272 1287  
1273 1288          if (DMU_OBJECT_IS_SPECIAL(dn->dn_object)) {
1274 1289                  dsl_dataset_dirty(os->os_dsl_dataset, tx);
1275 1290                  return;
1276 1291          }
1277 1292  
1278 1293          DNODE_VERIFY(dn);
1279 1294

1280 1295  #ifdef ZFS_DEBUG
1281 1296          mutex_enter(&dn->dn_mtx);
1282 1297          ASSERT(dn->dn_phys->dn_type || dn->dn_allocated_txg);
1283 1298          ASSERT(dn->dn_free_txg == 0 || dn->dn_free_txg >= txg);
1284 1299          mutex_exit(&dn->dn_mtx);
1285 1300  #endif
1286 1301  
1287 1302          /*
1288 1303           * Determine old uid/gid when necessary
1289 1304           */
1290 1305          dmu_objset_userquota_get_ids(dn, B_TRUE, tx);
1291 1306  
1292 1307          multilist_t *dirtylist = os->os_dirty_dnodes[txg & TXG_MASK];
1293 1308          multilist_sublist_t *mls = multilist_sublist_lock_obj(dirtylist, dn);
1294 1309  
1295 1310          /*
1296 1311           * If we are already marked dirty, we're done.
1297 1312           */
1298 1313          if (list_link_active(&dn->dn_dirty_link[txg & TXG_MASK])) {
1299 1314                  multilist_sublist_unlock(mls);
1300 1315                  return;
1301 1316          }
1302 1317  
1303 1318          ASSERT(!refcount_is_zero(&dn->dn_holds) ||
1304 1319              !avl_is_empty(&dn->dn_dbufs));
1305 1320          ASSERT(dn->dn_datablksz != 0);
1306 1321          ASSERT0(dn->dn_next_bonuslen[txg&TXG_MASK]);
1307 1322          ASSERT0(dn->dn_next_blksz[txg&TXG_MASK]);
1308 1323          ASSERT0(dn->dn_next_bonustype[txg&TXG_MASK]);
1309 1324  
1310 1325          dprintf_ds(os->os_dsl_dataset, "obj=%llu txg=%llu\n",
1311 1326              dn->dn_object, txg);
1312 1327  
1313 1328          multilist_sublist_insert_head(mls, dn);
1314 1329  
1315 1330          multilist_sublist_unlock(mls);
1316 1331  
1317 1332          /*

↓ open down ↓

38 lines elided

↑ open up ↑

1318 1333           * The dnode maintains a hold on its containing dbuf as
1319 1334           * long as there are holds on it.  Each instantiated child
1320 1335           * dbuf maintains a hold on the dnode.  When the last child
1321 1336           * drops its hold, the dnode will drop its hold on the
1322 1337           * containing dbuf. We add a "dirty hold" here so that the
1323 1338           * dnode will hang around after we finish processing its
1324 1339           * children.
1325 1340           */
1326 1341          VERIFY(dnode_add_ref(dn, (void *)(uintptr_t)tx->tx_txg));
1327 1342  
1328      -        (void) dbuf_dirty(dn->dn_dbuf, tx);
1329      -
     1343 +        (void) dbuf_dirty_sc(dn->dn_dbuf, tx, usesc);
1330 1344          dsl_dataset_dirty(os->os_dsl_dataset, tx);
1331 1345  }
1332 1346  
1333 1347  void
1334 1348  dnode_free(dnode_t *dn, dmu_tx_t *tx)
1335 1349  {
1336 1350          mutex_enter(&dn->dn_mtx);
1337 1351          if (dn->dn_type == DMU_OT_NONE || dn->dn_free_txg) {
1338 1352                  mutex_exit(&dn->dn_mtx);
1339 1353                  return;

1340 1354          }
1341 1355          dn->dn_free_txg = tx->tx_txg;
1342 1356          mutex_exit(&dn->dn_mtx);
1343 1357  
1344 1358          dnode_setdirty(dn, tx);
1345 1359  }
1346 1360  
1347 1361  /*
1348 1362   * Try to change the block size for the indicated dnode.  This can only
1349 1363   * succeed if there are no blocks allocated or dirty beyond first block
1350 1364   */
1351 1365  int
1352 1366  dnode_set_blksz(dnode_t *dn, uint64_t size, int ibs, dmu_tx_t *tx)
1353 1367  {
1354 1368          dmu_buf_impl_t *db;
1355 1369          int err;
1356 1370  
1357 1371          ASSERT3U(size, <=, spa_maxblocksize(dmu_objset_spa(dn->dn_objset)));
1358 1372          if (size == 0)
1359 1373                  size = SPA_MINBLOCKSIZE;
1360 1374          else
1361 1375                  size = P2ROUNDUP(size, SPA_MINBLOCKSIZE);
1362 1376  
1363 1377          if (ibs == dn->dn_indblkshift)
1364 1378                  ibs = 0;
1365 1379  
1366 1380          if (size >> SPA_MINBLOCKSHIFT == dn->dn_datablkszsec && ibs == 0)
1367 1381                  return (0);
1368 1382  
1369 1383          rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
1370 1384  
1371 1385          /* Check for any allocated blocks beyond the first */
1372 1386          if (dn->dn_maxblkid != 0)
1373 1387                  goto fail;
1374 1388  
1375 1389          mutex_enter(&dn->dn_dbufs_mtx);
1376 1390          for (db = avl_first(&dn->dn_dbufs); db != NULL;
1377 1391              db = AVL_NEXT(&dn->dn_dbufs, db)) {
1378 1392                  if (db->db_blkid != 0 && db->db_blkid != DMU_BONUS_BLKID &&
1379 1393                      db->db_blkid != DMU_SPILL_BLKID) {
1380 1394                          mutex_exit(&dn->dn_dbufs_mtx);
1381 1395                          goto fail;
1382 1396                  }
1383 1397          }
1384 1398          mutex_exit(&dn->dn_dbufs_mtx);
1385 1399  
1386 1400          if (ibs && dn->dn_nlevels != 1)
1387 1401                  goto fail;
1388 1402  
1389 1403          /* resize the old block */
1390 1404          err = dbuf_hold_impl(dn, 0, 0, TRUE, FALSE, FTAG, &db);
1391 1405          if (err == 0)
1392 1406                  dbuf_new_size(db, size, tx);
1393 1407          else if (err != ENOENT)
1394 1408                  goto fail;
1395 1409  
1396 1410          dnode_setdblksz(dn, size);
1397 1411          dnode_setdirty(dn, tx);
1398 1412          dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = size;
1399 1413          if (ibs) {
1400 1414                  dn->dn_indblkshift = ibs;
1401 1415                  dn->dn_next_indblkshift[tx->tx_txg&TXG_MASK] = ibs;
1402 1416          }
1403 1417          /* rele after we have fixed the blocksize in the dnode */
1404 1418          if (db)
1405 1419                  dbuf_rele(db, FTAG);
1406 1420

↓ open down ↓

67 lines elided

↑ open up ↑

1407 1421          rw_exit(&dn->dn_struct_rwlock);
1408 1422          return (0);
1409 1423  
1410 1424  fail:
1411 1425          rw_exit(&dn->dn_struct_rwlock);
1412 1426          return (SET_ERROR(ENOTSUP));
1413 1427  }
1414 1428  
1415 1429  /* read-holding callers must not rely on the lock being continuously held */
1416 1430  void
1417      -dnode_new_blkid(dnode_t *dn, uint64_t blkid, dmu_tx_t *tx, boolean_t have_read)
     1431 +dnode_new_blkid(dnode_t *dn, uint64_t blkid, dmu_tx_t *tx,
     1432 +    boolean_t usesc, boolean_t have_read)
1418 1433  {
1419 1434          uint64_t txgoff = tx->tx_txg & TXG_MASK;
1420 1435          int epbs, new_nlevels;
1421 1436          uint64_t sz;
1422 1437  
1423 1438          ASSERT(blkid != DMU_BONUS_BLKID);
1424 1439  
1425 1440          ASSERT(have_read ?
1426 1441              RW_READ_HELD(&dn->dn_struct_rwlock) :
1427 1442              RW_WRITE_HELD(&dn->dn_struct_rwlock));

1428 1443  
1429 1444          /*
1430 1445           * if we have a read-lock, check to see if we need to do any work
1431 1446           * before upgrading to a write-lock.
1432 1447           */
1433 1448          if (have_read) {
1434 1449                  if (blkid <= dn->dn_maxblkid)
1435 1450                          return;
1436 1451  
1437 1452                  if (!rw_tryupgrade(&dn->dn_struct_rwlock)) {
1438 1453                          rw_exit(&dn->dn_struct_rwlock);
1439 1454                          rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
1440 1455                  }
1441 1456          }
1442 1457  
1443 1458          if (blkid <= dn->dn_maxblkid)
1444 1459                  goto out;
1445 1460  
1446 1461          dn->dn_maxblkid = blkid;
1447 1462  
1448 1463          /*
1449 1464           * Compute the number of levels necessary to support the new maxblkid.
1450 1465           */
1451 1466          new_nlevels = 1;
1452 1467          epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
1453 1468          for (sz = dn->dn_nblkptr;
1454 1469              sz <= blkid && sz >= dn->dn_nblkptr; sz <<= epbs)
1455 1470                  new_nlevels++;
1456 1471  
1457 1472          if (new_nlevels > dn->dn_nlevels) {
1458 1473                  int old_nlevels = dn->dn_nlevels;
1459 1474                  dmu_buf_impl_t *db;
1460 1475                  list_t *list;

↓ open down ↓

33 lines elided

↑ open up ↑

1461 1476                  dbuf_dirty_record_t *new, *dr, *dr_next;
1462 1477  
1463 1478                  dn->dn_nlevels = new_nlevels;
1464 1479  
1465 1480                  ASSERT3U(new_nlevels, >, dn->dn_next_nlevels[txgoff]);
1466 1481                  dn->dn_next_nlevels[txgoff] = new_nlevels;
1467 1482  
1468 1483                  /* dirty the left indirects */
1469 1484                  db = dbuf_hold_level(dn, old_nlevels, 0, FTAG);
1470 1485                  ASSERT(db != NULL);
1471      -                new = dbuf_dirty(db, tx);
     1486 +                new = dbuf_dirty_sc(db, tx, usesc);
1472 1487                  dbuf_rele(db, FTAG);
1473 1488  
1474 1489                  /* transfer the dirty records to the new indirect */
1475 1490                  mutex_enter(&dn->dn_mtx);
1476 1491                  mutex_enter(&new->dt.di.dr_mtx);
1477 1492                  list = &dn->dn_dirty_records[txgoff];
1478 1493                  for (dr = list_head(list); dr; dr = dr_next) {
1479 1494                          dr_next = list_next(&dn->dn_dirty_records[txgoff], dr);
1480 1495                          if (dr->dr_dbuf->db_level != new_nlevels-1 &&
1481 1496                              dr->dr_dbuf->db_blkid != DMU_BONUS_BLKID &&

1482 1497                              dr->dr_dbuf->db_blkid != DMU_SPILL_BLKID) {
1483 1498                                  ASSERT(dr->dr_dbuf->db_level == old_nlevels-1);
1484 1499                                  list_remove(&dn->dn_dirty_records[txgoff], dr);
1485 1500                                  list_insert_tail(&new->dt.di.dr_children, dr);
1486 1501                                  dr->dr_parent = new;
1487 1502                          }
1488 1503                  }
1489 1504                  mutex_exit(&new->dt.di.dr_mtx);
1490 1505                  mutex_exit(&dn->dn_mtx);
1491 1506          }
1492 1507  
1493 1508  out:
1494 1509          if (have_read)
1495 1510                  rw_downgrade(&dn->dn_struct_rwlock);
1496 1511  }
1497 1512  
1498 1513  static void
1499 1514  dnode_dirty_l1(dnode_t *dn, uint64_t l1blkid, dmu_tx_t *tx)
1500 1515  {
1501 1516          dmu_buf_impl_t *db = dbuf_hold_level(dn, 1, l1blkid, FTAG);
1502 1517          if (db != NULL) {
1503 1518                  dmu_buf_will_dirty(&db->db, tx);
1504 1519                  dbuf_rele(db, FTAG);
1505 1520          }
1506 1521  }
1507 1522  
1508 1523  void
1509 1524  dnode_free_range(dnode_t *dn, uint64_t off, uint64_t len, dmu_tx_t *tx)
1510 1525  {
1511 1526          dmu_buf_impl_t *db;
1512 1527          uint64_t blkoff, blkid, nblks;
1513 1528          int blksz, blkshift, head, tail;
1514 1529          int trunc = FALSE;
1515 1530          int epbs;
1516 1531  
1517 1532          rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
1518 1533          blksz = dn->dn_datablksz;
1519 1534          blkshift = dn->dn_datablkshift;
1520 1535          epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
1521 1536  
1522 1537          if (len == DMU_OBJECT_END) {
1523 1538                  len = UINT64_MAX - off;
1524 1539                  trunc = TRUE;
1525 1540          }
1526 1541  
1527 1542          /*
1528 1543           * First, block align the region to free:
1529 1544           */
1530 1545          if (ISP2(blksz)) {
1531 1546                  head = P2NPHASE(off, blksz);
1532 1547                  blkoff = P2PHASE(off, blksz);
1533 1548                  if ((off >> blkshift) > dn->dn_maxblkid)
1534 1549                          goto out;
1535 1550          } else {
1536 1551                  ASSERT(dn->dn_maxblkid == 0);
1537 1552                  if (off == 0 && len >= blksz) {
1538 1553                          /*
1539 1554                           * Freeing the whole block; fast-track this request.
1540 1555                           * Note that we won't dirty any indirect blocks,
1541 1556                           * which is fine because we will be freeing the entire
1542 1557                           * file and thus all indirect blocks will be freed
1543 1558                           * by free_children().
1544 1559                           */
1545 1560                          blkid = 0;
1546 1561                          nblks = 1;
1547 1562                          goto done;
1548 1563                  } else if (off >= blksz) {
1549 1564                          /* Freeing past end-of-data */
1550 1565                          goto out;
1551 1566                  } else {
1552 1567                          /* Freeing part of the block. */
1553 1568                          head = blksz - off;
1554 1569                          ASSERT3U(head, >, 0);
1555 1570                  }
1556 1571                  blkoff = off;
1557 1572          }
1558 1573          /* zero out any partial block data at the start of the range */
1559 1574          if (head) {
1560 1575                  ASSERT3U(blkoff + head, ==, blksz);
1561 1576                  if (len < head)
1562 1577                          head = len;
1563 1578                  if (dbuf_hold_impl(dn, 0, dbuf_whichblock(dn, 0, off),
1564 1579                      TRUE, FALSE, FTAG, &db) == 0) {
1565 1580                          caddr_t data;
1566 1581  
1567 1582                          /* don't dirty if it isn't on disk and isn't dirty */
1568 1583                          if (db->db_last_dirty ||
1569 1584                              (db->db_blkptr && !BP_IS_HOLE(db->db_blkptr))) {
1570 1585                                  rw_exit(&dn->dn_struct_rwlock);
1571 1586                                  dmu_buf_will_dirty(&db->db, tx);
1572 1587                                  rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
1573 1588                                  data = db->db.db_data;
1574 1589                                  bzero(data + blkoff, head);
1575 1590                          }
1576 1591                          dbuf_rele(db, FTAG);
1577 1592                  }
1578 1593                  off += head;
1579 1594                  len -= head;
1580 1595          }
1581 1596  
1582 1597          /* If the range was less than one block, we're done */
1583 1598          if (len == 0)
1584 1599                  goto out;
1585 1600  
1586 1601          /* If the remaining range is past end of file, we're done */
1587 1602          if ((off >> blkshift) > dn->dn_maxblkid)
1588 1603                  goto out;
1589 1604  
1590 1605          ASSERT(ISP2(blksz));
1591 1606          if (trunc)
1592 1607                  tail = 0;
1593 1608          else
1594 1609                  tail = P2PHASE(len, blksz);
1595 1610  
1596 1611          ASSERT0(P2PHASE(off, blksz));
1597 1612          /* zero out any partial block data at the end of the range */
1598 1613          if (tail) {
1599 1614                  if (len < tail)
1600 1615                          tail = len;
1601 1616                  if (dbuf_hold_impl(dn, 0, dbuf_whichblock(dn, 0, off+len),
1602 1617                      TRUE, FALSE, FTAG, &db) == 0) {
1603 1618                          /* don't dirty if not on disk and not dirty */
1604 1619                          if (db->db_last_dirty ||
1605 1620                              (db->db_blkptr && !BP_IS_HOLE(db->db_blkptr))) {
1606 1621                                  rw_exit(&dn->dn_struct_rwlock);
1607 1622                                  dmu_buf_will_dirty(&db->db, tx);
1608 1623                                  rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
1609 1624                                  bzero(db->db.db_data, tail);
1610 1625                          }
1611 1626                          dbuf_rele(db, FTAG);
1612 1627                  }
1613 1628                  len -= tail;
1614 1629          }
1615 1630  
1616 1631          /* If the range did not include a full block, we are done */
1617 1632          if (len == 0)
1618 1633                  goto out;
1619 1634  
1620 1635          ASSERT(IS_P2ALIGNED(off, blksz));
1621 1636          ASSERT(trunc || IS_P2ALIGNED(len, blksz));
1622 1637          blkid = off >> blkshift;
1623 1638          nblks = len >> blkshift;
1624 1639          if (trunc)
1625 1640                  nblks += 1;
1626 1641  
1627 1642          /*
1628 1643           * Dirty all the indirect blocks in this range.  Note that only
1629 1644           * the first and last indirect blocks can actually be written
1630 1645           * (if they were partially freed) -- they must be dirtied, even if
1631 1646           * they do not exist on disk yet.  The interior blocks will
1632 1647           * be freed by free_children(), so they will not actually be written.
1633 1648           * Even though these interior blocks will not be written, we
1634 1649           * dirty them for two reasons:
1635 1650           *
1636 1651           *  - It ensures that the indirect blocks remain in memory until
1637 1652           *    syncing context.  (They have already been prefetched by
1638 1653           *    dmu_tx_hold_free(), so we don't have to worry about reading
1639 1654           *    them serially here.)
1640 1655           *
1641 1656           *  - The dirty space accounting will put pressure on the txg sync
1642 1657           *    mechanism to begin syncing, and to delay transactions if there
1643 1658           *    is a large amount of freeing.  Even though these indirect
1644 1659           *    blocks will not be written, we could need to write the same
1645 1660           *    amount of space if we copy the freed BPs into deadlists.
1646 1661           */
1647 1662          if (dn->dn_nlevels > 1) {
1648 1663                  uint64_t first, last;
1649 1664  
1650 1665                  first = blkid >> epbs;
1651 1666                  dnode_dirty_l1(dn, first, tx);
1652 1667                  if (trunc)
1653 1668                          last = dn->dn_maxblkid >> epbs;
1654 1669                  else
1655 1670                          last = (blkid + nblks - 1) >> epbs;
1656 1671                  if (last != first)
1657 1672                          dnode_dirty_l1(dn, last, tx);
1658 1673  
1659 1674                  int shift = dn->dn_datablkshift + dn->dn_indblkshift -
1660 1675                      SPA_BLKPTRSHIFT;
1661 1676                  for (uint64_t i = first + 1; i < last; i++) {
1662 1677                          /*
1663 1678                           * Set i to the blockid of the next non-hole
1664 1679                           * level-1 indirect block at or after i.  Note
1665 1680                           * that dnode_next_offset() operates in terms of
1666 1681                           * level-0-equivalent bytes.
1667 1682                           */
1668 1683                          uint64_t ibyte = i << shift;
1669 1684                          int err = dnode_next_offset(dn, DNODE_FIND_HAVELOCK,
1670 1685                              &ibyte, 2, 1, 0);
1671 1686                          i = ibyte >> shift;
1672 1687                          if (i >= last)
1673 1688                                  break;
1674 1689  
1675 1690                          /*
1676 1691                           * Normally we should not see an error, either
1677 1692                           * from dnode_next_offset() or dbuf_hold_level()
1678 1693                           * (except for ESRCH from dnode_next_offset).
1679 1694                           * If there is an i/o error, then when we read
1680 1695                           * this block in syncing context, it will use
1681 1696                           * ZIO_FLAG_MUSTSUCCEED, and thus hang/panic according
1682 1697                           * to the "failmode" property.  dnode_next_offset()
1683 1698                           * doesn't have a flag to indicate MUSTSUCCEED.
1684 1699                           */
1685 1700                          if (err != 0)
1686 1701                                  break;
1687 1702  
1688 1703                          dnode_dirty_l1(dn, i, tx);
1689 1704                  }

↓ open down ↓

208 lines elided

↑ open up ↑

1690 1705          }
1691 1706  
1692 1707  done:
1693 1708          /*
1694 1709           * Add this range to the dnode range list.
1695 1710           * We will finish up this free operation in the syncing phase.
1696 1711           */
1697 1712          mutex_enter(&dn->dn_mtx);
1698 1713          int txgoff = tx->tx_txg & TXG_MASK;
1699 1714          if (dn->dn_free_ranges[txgoff] == NULL) {
1700      -                dn->dn_free_ranges[txgoff] = range_tree_create(NULL, NULL);
     1715 +                dn->dn_free_ranges[txgoff] =
     1716 +                    range_tree_create(NULL, NULL, &dn->dn_mtx);
1701 1717          }
1702 1718          range_tree_clear(dn->dn_free_ranges[txgoff], blkid, nblks);
1703 1719          range_tree_add(dn->dn_free_ranges[txgoff], blkid, nblks);
1704 1720          dprintf_dnode(dn, "blkid=%llu nblks=%llu txg=%llu\n",
1705 1721              blkid, nblks, tx->tx_txg);
1706 1722          mutex_exit(&dn->dn_mtx);
1707 1723  
1708 1724          dbuf_free_range(dn, blkid, blkid + nblks - 1, tx);
1709 1725          dnode_setdirty(dn, tx);
1710 1726  out:

1711 1727  
1712 1728          rw_exit(&dn->dn_struct_rwlock);
1713 1729  }
1714 1730  
1715 1731  static boolean_t
1716 1732  dnode_spill_freed(dnode_t *dn)
1717 1733  {
1718 1734          int i;
1719 1735  
1720 1736          mutex_enter(&dn->dn_mtx);
1721 1737          for (i = 0; i < TXG_SIZE; i++) {
1722 1738                  if (dn->dn_rm_spillblk[i] == DN_KILL_SPILLBLK)
1723 1739                          break;
1724 1740          }
1725 1741          mutex_exit(&dn->dn_mtx);
1726 1742          return (i < TXG_SIZE);
1727 1743  }
1728 1744  
1729 1745  /* return TRUE if this blkid was freed in a recent txg, or FALSE if it wasn't */
1730 1746  uint64_t
1731 1747  dnode_block_freed(dnode_t *dn, uint64_t blkid)
1732 1748  {
1733 1749          void *dp = spa_get_dsl(dn->dn_objset->os_spa);
1734 1750          int i;
1735 1751  
1736 1752          if (blkid == DMU_BONUS_BLKID)
1737 1753                  return (FALSE);
1738 1754  
1739 1755          /*
1740 1756           * If we're in the process of opening the pool, dp will not be
1741 1757           * set yet, but there shouldn't be anything dirty.
1742 1758           */
1743 1759          if (dp == NULL)
1744 1760                  return (FALSE);
1745 1761  
1746 1762          if (dn->dn_free_txg)
1747 1763                  return (TRUE);
1748 1764  
1749 1765          if (blkid == DMU_SPILL_BLKID)
1750 1766                  return (dnode_spill_freed(dn));
1751 1767  
1752 1768          mutex_enter(&dn->dn_mtx);
1753 1769          for (i = 0; i < TXG_SIZE; i++) {
1754 1770                  if (dn->dn_free_ranges[i] != NULL &&
1755 1771                      range_tree_contains(dn->dn_free_ranges[i], blkid, 1))
1756 1772                          break;
1757 1773          }
1758 1774          mutex_exit(&dn->dn_mtx);
1759 1775          return (i < TXG_SIZE);
1760 1776  }
1761 1777  
1762 1778  /* call from syncing context when we actually write/free space for this dnode */
1763 1779  void
1764 1780  dnode_diduse_space(dnode_t *dn, int64_t delta)
1765 1781  {
1766 1782          uint64_t space;
1767 1783          dprintf_dnode(dn, "dn=%p dnp=%p used=%llu delta=%lld\n",
1768 1784              dn, dn->dn_phys,
1769 1785              (u_longlong_t)dn->dn_phys->dn_used,
1770 1786              (longlong_t)delta);
1771 1787  
1772 1788          mutex_enter(&dn->dn_mtx);
1773 1789          space = DN_USED_BYTES(dn->dn_phys);
1774 1790          if (delta > 0) {
1775 1791                  ASSERT3U(space + delta, >=, space); /* no overflow */
1776 1792          } else {
1777 1793                  ASSERT3U(space, >=, -delta); /* no underflow */
1778 1794          }
1779 1795          space += delta;
1780 1796          if (spa_version(dn->dn_objset->os_spa) < SPA_VERSION_DNODE_BYTES) {
1781 1797                  ASSERT((dn->dn_phys->dn_flags & DNODE_FLAG_USED_BYTES) == 0);
1782 1798                  ASSERT0(P2PHASE(space, 1<<DEV_BSHIFT));
1783 1799                  dn->dn_phys->dn_used = space >> DEV_BSHIFT;
1784 1800          } else {
1785 1801                  dn->dn_phys->dn_used = space;
1786 1802                  dn->dn_phys->dn_flags |= DNODE_FLAG_USED_BYTES;
1787 1803          }
1788 1804          mutex_exit(&dn->dn_mtx);
1789 1805  }
1790 1806  
1791 1807  /*
1792 1808   * Scans a block at the indicated "level" looking for a hole or data,
1793 1809   * depending on 'flags'.
1794 1810   *
1795 1811   * If level > 0, then we are scanning an indirect block looking at its
1796 1812   * pointers.  If level == 0, then we are looking at a block of dnodes.
1797 1813   *
1798 1814   * If we don't find what we are looking for in the block, we return ESRCH.
1799 1815   * Otherwise, return with *offset pointing to the beginning (if searching
1800 1816   * forwards) or end (if searching backwards) of the range covered by the
1801 1817   * block pointer we matched on (or dnode).
1802 1818   *
1803 1819   * The basic search algorithm used below by dnode_next_offset() is to
1804 1820   * use this function to search up the block tree (widen the search) until
1805 1821   * we find something (i.e., we don't return ESRCH) and then search back
1806 1822   * down the tree (narrow the search) until we reach our original search
1807 1823   * level.
1808 1824   */
1809 1825  static int
1810 1826  dnode_next_offset_level(dnode_t *dn, int flags, uint64_t *offset,
1811 1827      int lvl, uint64_t blkfill, uint64_t txg)
1812 1828  {
1813 1829          dmu_buf_impl_t *db = NULL;
1814 1830          void *data = NULL;
1815 1831          uint64_t epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
1816 1832          uint64_t epb = 1ULL << epbs;
1817 1833          uint64_t minfill, maxfill;
1818 1834          boolean_t hole;
1819 1835          int i, inc, error, span;
1820 1836  
1821 1837          dprintf("probing object %llu offset %llx level %d of %u\n",
1822 1838              dn->dn_object, *offset, lvl, dn->dn_phys->dn_nlevels);
1823 1839  
1824 1840          hole = ((flags & DNODE_FIND_HOLE) != 0);
1825 1841          inc = (flags & DNODE_FIND_BACKWARDS) ? -1 : 1;
1826 1842          ASSERT(txg == 0 || !hole);
1827 1843  
1828 1844          if (lvl == dn->dn_phys->dn_nlevels) {
1829 1845                  error = 0;
1830 1846                  epb = dn->dn_phys->dn_nblkptr;
1831 1847                  data = dn->dn_phys->dn_blkptr;
1832 1848          } else {
1833 1849                  uint64_t blkid = dbuf_whichblock(dn, lvl, *offset);
1834 1850                  error = dbuf_hold_impl(dn, lvl, blkid, TRUE, FALSE, FTAG, &db);
1835 1851                  if (error) {
1836 1852                          if (error != ENOENT)
1837 1853                                  return (error);
1838 1854                          if (hole)
1839 1855                                  return (0);
1840 1856                          /*
1841 1857                           * This can only happen when we are searching up
1842 1858                           * the block tree for data.  We don't really need to
1843 1859                           * adjust the offset, as we will just end up looking
1844 1860                           * at the pointer to this block in its parent, and its
1845 1861                           * going to be unallocated, so we will skip over it.
1846 1862                           */
1847 1863                          return (SET_ERROR(ESRCH));
1848 1864                  }
1849 1865                  error = dbuf_read(db, NULL, DB_RF_CANFAIL | DB_RF_HAVESTRUCT);
1850 1866                  if (error) {
1851 1867                          dbuf_rele(db, FTAG);
1852 1868                          return (error);
1853 1869                  }
1854 1870                  data = db->db.db_data;
1855 1871          }
1856 1872  
1857 1873  
1858 1874          if (db != NULL && txg != 0 && (db->db_blkptr == NULL ||
1859 1875              db->db_blkptr->blk_birth <= txg ||
1860 1876              BP_IS_HOLE(db->db_blkptr))) {
1861 1877                  /*
1862 1878                   * This can only happen when we are searching up the tree
1863 1879                   * and these conditions mean that we need to keep climbing.
1864 1880                   */
1865 1881                  error = SET_ERROR(ESRCH);
1866 1882          } else if (lvl == 0) {
1867 1883                  dnode_phys_t *dnp = data;
1868 1884                  span = DNODE_SHIFT;
1869 1885                  ASSERT(dn->dn_type == DMU_OT_DNODE);
1870 1886  
1871 1887                  for (i = (*offset >> span) & (blkfill - 1);
1872 1888                      i >= 0 && i < blkfill; i += inc) {
1873 1889                          if ((dnp[i].dn_type == DMU_OT_NONE) == hole)
1874 1890                                  break;
1875 1891                          *offset += (1ULL << span) * inc;
1876 1892                  }
1877 1893                  if (i < 0 || i == blkfill)
1878 1894                          error = SET_ERROR(ESRCH);
1879 1895          } else {
1880 1896                  blkptr_t *bp = data;
1881 1897                  uint64_t start = *offset;
1882 1898                  span = (lvl - 1) * epbs + dn->dn_datablkshift;
1883 1899                  minfill = 0;
1884 1900                  maxfill = blkfill << ((lvl - 1) * epbs);
1885 1901  
1886 1902                  if (hole)
1887 1903                          maxfill--;
1888 1904                  else
1889 1905                          minfill++;
1890 1906  
1891 1907                  *offset = *offset >> span;
1892 1908                  for (i = BF64_GET(*offset, 0, epbs);
1893 1909                      i >= 0 && i < epb; i += inc) {
1894 1910                          if (BP_GET_FILL(&bp[i]) >= minfill &&
1895 1911                              BP_GET_FILL(&bp[i]) <= maxfill &&
1896 1912                              (hole || bp[i].blk_birth > txg))
1897 1913                                  break;
1898 1914                          if (inc > 0 || *offset > 0)
1899 1915                                  *offset += inc;
1900 1916                  }
1901 1917                  *offset = *offset << span;
1902 1918                  if (inc < 0) {
1903 1919                          /* traversing backwards; position offset at the end */
1904 1920                          ASSERT3U(*offset, <=, start);
1905 1921                          *offset = MIN(*offset + (1ULL << span) - 1, start);
1906 1922                  } else if (*offset < start) {
1907 1923                          *offset = start;
1908 1924                  }
1909 1925                  if (i < 0 || i >= epb)
1910 1926                          error = SET_ERROR(ESRCH);
1911 1927          }
1912 1928  
1913 1929          if (db)
1914 1930                  dbuf_rele(db, FTAG);
1915 1931  
1916 1932          return (error);
1917 1933  }
1918 1934  
1919 1935  /*
1920 1936   * Find the next hole, data, or sparse region at or after *offset.
1921 1937   * The value 'blkfill' tells us how many items we expect to find
1922 1938   * in an L0 data block; this value is 1 for normal objects,
1923 1939   * DNODES_PER_BLOCK for the meta dnode, and some fraction of
1924 1940   * DNODES_PER_BLOCK when searching for sparse regions thereof.
1925 1941   *
1926 1942   * Examples:
1927 1943   *
1928 1944   * dnode_next_offset(dn, flags, offset, 1, 1, 0);
1929 1945   *      Finds the next/previous hole/data in a file.
1930 1946   *      Used in dmu_offset_next().
1931 1947   *
1932 1948   * dnode_next_offset(mdn, flags, offset, 0, DNODES_PER_BLOCK, txg);
1933 1949   *      Finds the next free/allocated dnode an objset's meta-dnode.
1934 1950   *      Only finds objects that have new contents since txg (ie.
1935 1951   *      bonus buffer changes and content removal are ignored).
1936 1952   *      Used in dmu_object_next().
1937 1953   *
1938 1954   * dnode_next_offset(mdn, DNODE_FIND_HOLE, offset, 2, DNODES_PER_BLOCK >> 2, 0);
1939 1955   *      Finds the next L2 meta-dnode bp that's at most 1/4 full.
1940 1956   *      Used in dmu_object_alloc().
1941 1957   */
1942 1958  int
1943 1959  dnode_next_offset(dnode_t *dn, int flags, uint64_t *offset,
1944 1960      int minlvl, uint64_t blkfill, uint64_t txg)
1945 1961  {
1946 1962          uint64_t initial_offset = *offset;
1947 1963          int lvl, maxlvl;
1948 1964          int error = 0;
1949 1965  
1950 1966          if (!(flags & DNODE_FIND_HAVELOCK))
1951 1967                  rw_enter(&dn->dn_struct_rwlock, RW_READER);
1952 1968  
1953 1969          if (dn->dn_phys->dn_nlevels == 0) {
1954 1970                  error = SET_ERROR(ESRCH);
1955 1971                  goto out;
1956 1972          }
1957 1973  
1958 1974          if (dn->dn_datablkshift == 0) {
1959 1975                  if (*offset < dn->dn_datablksz) {
1960 1976                          if (flags & DNODE_FIND_HOLE)
1961 1977                                  *offset = dn->dn_datablksz;
1962 1978                  } else {
1963 1979                          error = SET_ERROR(ESRCH);
1964 1980                  }
1965 1981                  goto out;
1966 1982          }
1967 1983  
1968 1984          maxlvl = dn->dn_phys->dn_nlevels;
1969 1985  
1970 1986          for (lvl = minlvl; lvl <= maxlvl; lvl++) {
1971 1987                  error = dnode_next_offset_level(dn,
1972 1988                      flags, offset, lvl, blkfill, txg);
1973 1989                  if (error != ESRCH)
1974 1990                          break;
1975 1991          }
1976 1992  
1977 1993          while (error == 0 && --lvl >= minlvl) {
1978 1994                  error = dnode_next_offset_level(dn,
1979 1995                      flags, offset, lvl, blkfill, txg);
1980 1996          }
1981 1997  
1982 1998          /*
1983 1999           * There's always a "virtual hole" at the end of the object, even
1984 2000           * if all BP's which physically exist are non-holes.
1985 2001           */
1986 2002          if ((flags & DNODE_FIND_HOLE) && error == ESRCH && txg == 0 &&
1987 2003              minlvl == 1 && blkfill == 1 && !(flags & DNODE_FIND_BACKWARDS)) {
1988 2004                  error = 0;

↓ open down ↓

278 lines elided

↑ open up ↑

1989 2005          }
1990 2006  
1991 2007          if (error == 0 && (flags & DNODE_FIND_BACKWARDS ?
1992 2008              initial_offset < *offset : initial_offset > *offset))
1993 2009                  error = SET_ERROR(ESRCH);
1994 2010  out:
1995 2011          if (!(flags & DNODE_FIND_HAVELOCK))
1996 2012                  rw_exit(&dn->dn_struct_rwlock);
1997 2013  
1998 2014          return (error);
     2015 +}
     2016 +
     2017 +/*
     2018 + * When in the compressing phase, we check our results every 1 MiB. If
     2019 + * compression ratio drops below the threshold factor, we give up trying
     2020 + * to compress the file for a while. The length of the interval is
     2021 + * calculated from this interval value according to the algorithm in
     2022 + * smartcomp_check_comp.
     2023 + */
     2024 +uint64_t zfs_smartcomp_interval = 1 * 1024 * 1024;
     2025 +
     2026 +/*
     2027 + * Minimum compression factor is 12.5% (100% / factor) - below that we
     2028 + * consider compression to have failed.
     2029 + */
     2030 +uint64_t zfs_smartcomp_threshold_factor = 8;
     2031 +
     2032 +/*
     2033 + * Maximum power-of-2 exponent on the deny interval and consequently
     2034 + * the maximum number of compression successes and failures we track.
     2035 + * Successive compression failures extend the deny interval, whereas
     2036 + * repeated successes makes the algorithm more hesitant to start denying.
     2037 + */
     2038 +int64_t zfs_smartcomp_interval_exp = 5;
     2039 +
     2040 +/*
     2041 + * Callback invoked by the zio machinery when it wants to compress a data
     2042 + * block. If we are in the denying compression phase, we add the amount of
     2043 + * data written to our stats and check if we've denied enough data to
     2044 + * transition back in to the compression phase again.
     2045 + */
     2046 +boolean_t
     2047 +dnode_smartcomp_ask_cb(void *userinfo, const zio_t *zio)
     2048 +{
     2049 +        dnode_t *dn = userinfo;
     2050 +        dnode_smartcomp_t *sc;
     2051 +        dnode_smartcomp_state_t old_state;
     2052 +
     2053 +        ASSERT(dn != NULL);
     2054 +
     2055 +        sc = &dn->dn_smartcomp;
     2056 +        mutex_enter(&sc->sc_lock);
     2057 +        old_state = sc->sc_state;
     2058 +        if (sc->sc_state == DNODE_SMARTCOMP_DENYING) {
     2059 +                sc->sc_orig_size += zio->io_orig_size;
     2060 +                if (sc->sc_orig_size >= sc->sc_deny_interval) {
     2061 +                        /* time to retry compression on next call */
     2062 +                        sc->sc_state = DNODE_SMARTCOMP_COMPRESSING;
     2063 +                        sc->sc_size = 0;
     2064 +                        sc->sc_orig_size = 0;
     2065 +                }
     2066 +        }
     2067 +        mutex_exit(&sc->sc_lock);
     2068 +
     2069 +        return (old_state != DNODE_SMARTCOMP_DENYING);
     2070 +}
     2071 +
     2072 +/*
     2073 + * Callback invoked after compression has been performed to allow us to
     2074 + * monitor compression performance. If we're in a compressing phase, we
     2075 + * add the uncompressed and compressed data volumes to our state counters
     2076 + * and see if we need to recheck compression performance in
     2077 + * smartcomp_check_comp.
     2078 + */
     2079 +void
     2080 +dnode_smartcomp_result_cb(void *userinfo, const zio_t *zio)
     2081 +{
     2082 +        dnode_t *dn = userinfo;
     2083 +        dnode_smartcomp_t *sc;
     2084 +        uint64_t io_size = zio->io_size, io_orig_size = zio->io_orig_size;
     2085 +
     2086 +        ASSERT(dn != NULL);
     2087 +        sc = &dn->dn_smartcomp;
     2088 +
     2089 +        if (io_orig_size == 0)
     2090 +                /* XXX: is this valid anyway? */
     2091 +                return;
     2092 +
     2093 +        mutex_enter(&sc->sc_lock);
     2094 +        if (sc->sc_state == DNODE_SMARTCOMP_COMPRESSING) {
     2095 +                /* add last block's compression performance to our stats */
     2096 +                sc->sc_size += io_size;
     2097 +                sc->sc_orig_size += io_orig_size;
     2098 +                /* time to recheck compression performance? */
     2099 +                if (sc->sc_orig_size >= zfs_smartcomp_interval)
     2100 +                        smartcomp_check_comp(sc);
     2101 +        }
     2102 +        mutex_exit(&sc->sc_lock);
     2103 +}
     2104 +
     2105 +/*
     2106 + * This function checks whether the compression we've been getting is above
     2107 + * the threshold value. If it is, we decrement the sc_comp_failures counter
     2108 + * to indicate compression success. If it isn't we increment the same
     2109 + * counter and potentially start a compression deny phase.
     2110 + */
     2111 +static void
     2112 +smartcomp_check_comp(dnode_smartcomp_t *sc)
     2113 +{
     2114 +        uint64_t threshold = sc->sc_orig_size -
     2115 +            sc->sc_orig_size / zfs_smartcomp_threshold_factor;
     2116 +
     2117 +        ASSERT(MUTEX_HELD(&sc->sc_lock));
     2118 +        if (sc->sc_size > threshold) {
     2119 +                sc->sc_comp_failures =
     2120 +                    MIN(sc->sc_comp_failures + 1, zfs_smartcomp_interval_exp);
     2121 +                if (sc->sc_comp_failures > 0) {
     2122 +                        /* consistently getting too little compression, stop */
     2123 +                        sc->sc_state = DNODE_SMARTCOMP_DENYING;
     2124 +                        sc->sc_deny_interval =
     2125 +                            zfs_smartcomp_interval << sc->sc_comp_failures;
     2126 +                        /* randomize the interval by +-10% to avoid patterns */
     2127 +                        sc->sc_deny_interval = (sc->sc_deny_interval -
     2128 +                            (sc->sc_deny_interval / 10)) +
     2129 +                            spa_get_random(sc->sc_deny_interval / 5 + 1);
     2130 +                }
     2131 +        } else {
     2132 +                if (sc->sc_comp_failures > 0) {
     2133 +                        /*
     2134 +                         * We're biased for compression, so any success makes
     2135 +                         * us forget the file's past incompressibility.
     2136 +                         */
     2137 +                        sc->sc_comp_failures = 0;
     2138 +                } else {
     2139 +                        sc->sc_comp_failures = MAX(sc->sc_comp_failures - 1,
     2140 +                            -zfs_smartcomp_interval_exp);
     2141 +                }
     2142 +        }
     2143 +        /* reset state counters */
     2144 +        sc->sc_size = 0;
     2145 +        sc->sc_orig_size = 0;
     2146 +}
     2147 +
     2148 +/*
     2149 + * Prepares a zio_smartcomp_info_t structure for passing to zio_write or
     2150 + * arc_write depending on whether smart compression should be applied to
     2151 + * the specified objset, dnode and buffer.
     2152 + */
     2153 +extern void
     2154 +dnode_setup_zio_smartcomp(dmu_buf_impl_t *db, zio_smartcomp_info_t *sc)
     2155 +{
     2156 +        dnode_t *dn = DB_DNODE(db);
     2157 +        objset_t *os = dn->dn_objset;
     2158 +
     2159 +        /* Only do smart compression on user data of plain files. */
     2160 +        if (dn->dn_type == DMU_OT_PLAIN_FILE_CONTENTS && db->db_level == 0 &&
     2161 +            os->os_smartcomp_enabled && os->os_compress != ZIO_COMPRESS_OFF) {
     2162 +                sc->sc_ask = dnode_smartcomp_ask_cb;
     2163 +                sc->sc_result = dnode_smartcomp_result_cb;
     2164 +                sc->sc_userinfo = dn;
     2165 +        } else {
     2166 +                /*
     2167 +                 * Zeroing out the structure passed to zio_write will turn
     2168 +                 * smart compression off.
     2169 +                 */
     2170 +                bzero(sc, sizeof (*sc));
     2171 +        }
1999 2172  }

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX