Print this page
    
NEX-20218 Backport Illumos #9474 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes
MFV illumos-gate@fa41d87de9ec9000964c605eb01d6dc19e4a1abe
    9464 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes
    Reviewed by: Matt Ahrens <matt@delphix.com>
    Reviewed by: Brad Lewis <brad.lewis@delphix.com>
    Reviewed by: Andriy Gapon <avg@FreeBSD.org>
    Approved by: Dan McDonald <danmcd@joyent.com>
NEX-20208 Backport Illumos #9993 zil writes can get delayed in zio pipeline
MFV illumos-gate@2258ad0b755b24a55c6173b1e6bb6188389f72dd
    9993 zil writes can get delayed in zio pipeline
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed by: Brad Lewis <brad.lewis@delphix.com>
    Reviewed by: Matt Ahrens <matt@delphix.com>
    Approved by: Dan McDonald <danmcd@joyent.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-15067 KRRP: system panics during ZFS-receive: assertion failed: arc_can_share(hdr, buf)
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15067 KRRP: system panics during ZFS-receive: assertion failed: arc_can_share(hdr, buf)
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-14571 remove isal support remnants
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8065 ZFS doesn't notice when disk vdevs have no write cache
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
NEX-5856 ddt_capped isn't reset when deduped dataset is destroyed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5367 special vdev: sync-write options (NEW)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5318 Cleanup specialclass property (obsolete, not used) and fix related meta-to-special case
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5188 Removed special-vdev causes panic on read or on get size of special-bp
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4794 Write Back Cache sync and async writes: adjust routing according to watermark limits
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4619 Want kstats to monitor TRIM and UNMAP operation
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5438 zfs_blkptr_verify should continue after zfs_panic_recover
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Dan McDonald <danmcd@omniti.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
NEX-3502 dedup ceiling should set a pool prop when cap is in effect
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-4003 WRC: System panics on debug build
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
NEX-3411 Removal of small l2arc ddt vdev disables dedup despite enough RAM
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
NEX-3300 ddt byte count ceiling tunables should not depend on zfs_ddt_limit_type being set
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-1110 Odd zpool Latency Output
OS-70 remove zio timer code
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
Fixup merge results
re #13989 port of illumos-3805
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
SUP-504 Multiple disks being falsely failed/retired by new zio_timeout handling code
re #12770 rb4121 zio latency reports can produce false positives
re #12645 rb4073 Make vdev delay simulator independent of DEBUG
re #12643 rb4064 ZFS meta refactoring - vdev utilization tracking, auto-dedup
re #12616 rb4051 zfs_log_write()/dmu_sync() write once to special refactoring
re #8279 rb3915 need a mechanism to notify NMS about ZFS config changes (fix lint -courtesy of Yuri Pankov)
re #12584 rb4049 zfsxx latest code merge (fix lint - courtesy of Yuri Pankov)
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #12393 rb3935 Kerberos and smbd disagree about who is our AD server (fix elf runtime attributes check)
re #11612 rb3907 Failing vdev of a mirrored pool should not take zfs operations out of action for extended periods of time.
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties
    
      
        | Split | 
	Close | 
      
      | Expand all | 
      | Collapse all | 
    
    
          --- old/usr/src/uts/common/fs/zfs/zio.c
          +++ new/usr/src/uts/common/fs/zfs/zio.c
   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  
    | 
      ↓ open down ↓ | 
    10 lines elided | 
    
      ↑ open up ↑ | 
  
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
       21 +
  21   22  /*
  22   23   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23   24   * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
  24      - * Copyright (c) 2011 Nexenta Systems, Inc. All rights reserved.
  25   25   * Copyright (c) 2014 Integros [integros.com]
       26 + * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
  26   27   */
  27   28  
  28   29  #include <sys/sysmacros.h>
  29   30  #include <sys/zfs_context.h>
  30   31  #include <sys/fm/fs/zfs.h>
  31   32  #include <sys/spa.h>
  32   33  #include <sys/txg.h>
  33   34  #include <sys/spa_impl.h>
  34   35  #include <sys/vdev_impl.h>
  35   36  #include <sys/zio_impl.h>
  36   37  #include <sys/zio_compress.h>
  37   38  #include <sys/zio_checksum.h>
  38   39  #include <sys/dmu_objset.h>
  39   40  #include <sys/arc.h>
  40   41  #include <sys/ddt.h>
  41   42  #include <sys/blkptr.h>
       43 +#include <sys/special.h>
       44 +#include <sys/blkptr.h>
  42   45  #include <sys/zfeature.h>
       46 +#include <sys/dkioc_free_util.h>
       47 +#include <sys/dsl_scan.h>
       48 +
  43   49  #include <sys/metaslab_impl.h>
  44   50  #include <sys/abd.h>
  45   51  
       52 +extern int zfs_txg_timeout;
       53 +
  46   54  /*
  47   55   * ==========================================================================
  48   56   * I/O type descriptions
  49   57   * ==========================================================================
  50   58   */
  51   59  const char *zio_type_name[ZIO_TYPES] = {
  52   60          "zio_null", "zio_read", "zio_write", "zio_free", "zio_claim",
  53   61          "zio_ioctl"
  54   62  };
  55   63  
  56   64  boolean_t zio_dva_throttle_enabled = B_TRUE;
  57   65  
  58   66  /*
  59   67   * ==========================================================================
  60   68   * I/O kmem caches
  61   69   * ==========================================================================
  
    | 
      ↓ open down ↓ | 
    6 lines elided | 
    
      ↑ open up ↑ | 
  
  62   70   */
  63   71  kmem_cache_t *zio_cache;
  64   72  kmem_cache_t *zio_link_cache;
  65   73  kmem_cache_t *zio_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
  66   74  kmem_cache_t *zio_data_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
  67   75  
  68   76  #ifdef _KERNEL
  69   77  extern vmem_t *zio_alloc_arena;
  70   78  #endif
  71   79  
  72      -#define ZIO_PIPELINE_CONTINUE           0x100
  73      -#define ZIO_PIPELINE_STOP               0x101
  74      -
  75   80  #define BP_SPANB(indblkshift, level) \
  76   81          (((uint64_t)1) << ((level) * ((indblkshift) - SPA_BLKPTRSHIFT)))
  77   82  #define COMPARE_META_LEVEL      0x80000000ul
       83 +
  78   84  /*
  79   85   * The following actions directly effect the spa's sync-to-convergence logic.
  80   86   * The values below define the sync pass when we start performing the action.
  81   87   * Care should be taken when changing these values as they directly impact
  82   88   * spa_sync() performance. Tuning these values may introduce subtle performance
  83   89   * pathologies and should only be done in the context of performance analysis.
  84   90   * These tunables will eventually be removed and replaced with #defines once
  85   91   * enough analysis has been done to determine optimal values.
  86   92   *
  87   93   * The 'zfs_sync_pass_deferred_free' pass must be greater than 1 to ensure that
  88   94   * regular blocks are not deferred.
  89   95   */
  90   96  int zfs_sync_pass_deferred_free = 2; /* defer frees starting in this pass */
  91   97  int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */
  92   98  int zfs_sync_pass_rewrite = 2; /* rewrite new bps starting in this pass */
  93   99  
  94  100  /*
  95  101   * An allocating zio is one that either currently has the DVA allocate
  96  102   * stage set or will have it later in its lifetime.
  97  103   */
  
    | 
      ↓ open down ↓ | 
    10 lines elided | 
    
      ↑ open up ↑ | 
  
  98  104  #define IO_IS_ALLOCATING(zio) ((zio)->io_orig_pipeline & ZIO_STAGE_DVA_ALLOCATE)
  99  105  
 100  106  boolean_t       zio_requeue_io_start_cut_in_line = B_TRUE;
 101  107  
 102  108  #ifdef ZFS_DEBUG
 103  109  int zio_buf_debug_limit = 16384;
 104  110  #else
 105  111  int zio_buf_debug_limit = 0;
 106  112  #endif
 107  113  
      114 +/*
      115 + * Fault insertion for stress testing
      116 + */
      117 +int zio_faulty_vdev_enabled = 0;
      118 +uint64_t zio_faulty_vdev_guid;
      119 +uint64_t zio_faulty_vdev_delay_us = 1000000;    /* 1 second */
      120 +
      121 +/*
      122 + * Tunable to allow for debugging SCSI UNMAP/SATA TRIM calls. Disabling
      123 + * it will prevent ZFS from attempting to issue DKIOCFREE ioctls to the
      124 + * underlying storage.
      125 + */
      126 +boolean_t zfs_trim = B_TRUE;
      127 +uint64_t zfs_trim_min_ext_sz = 1 << 20; /* 1 MB */
      128 +
 108  129  static void zio_taskq_dispatch(zio_t *, zio_taskq_type_t, boolean_t);
 109  130  
 110  131  void
 111  132  zio_init(void)
 112  133  {
 113  134          size_t c;
 114  135          vmem_t *data_alloc_arena = NULL;
 115  136  
 116  137  #ifdef _KERNEL
 117  138          data_alloc_arena = zio_alloc_arena;
 118  139  #endif
 119  140          zio_cache = kmem_cache_create("zio_cache",
 120  141              sizeof (zio_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
 121  142          zio_link_cache = kmem_cache_create("zio_link_cache",
 122  143              sizeof (zio_link_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
 123  144  
 124  145          /*
 125  146           * For small buffers, we want a cache for each multiple of
 126  147           * SPA_MINBLOCKSIZE.  For larger buffers, we want a cache
 127  148           * for each quarter-power of 2.
 128  149           */
 129  150          for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
 130  151                  size_t size = (c + 1) << SPA_MINBLOCKSHIFT;
 131  152                  size_t p2 = size;
 132  153                  size_t align = 0;
 133  154                  size_t cflags = (size > zio_buf_debug_limit) ? KMC_NODEBUG : 0;
 134  155  
 135  156                  while (!ISP2(p2))
 136  157                          p2 &= p2 - 1;
 137  158  
 138  159  #ifndef _KERNEL
 139  160                  /*
 140  161                   * If we are using watchpoints, put each buffer on its own page,
 141  162                   * to eliminate the performance overhead of trapping to the
 142  163                   * kernel when modifying a non-watched buffer that shares the
 143  164                   * page with a watched buffer.
 144  165                   */
 145  166                  if (arc_watch && !IS_P2ALIGNED(size, PAGESIZE))
 146  167                          continue;
 147  168  #endif
 148  169                  if (size <= 4 * SPA_MINBLOCKSIZE) {
 149  170                          align = SPA_MINBLOCKSIZE;
 150  171                  } else if (IS_P2ALIGNED(size, p2 >> 2)) {
 151  172                          align = MIN(p2 >> 2, PAGESIZE);
 152  173                  }
 153  174  
 154  175                  if (align != 0) {
 155  176                          char name[36];
 156  177                          (void) sprintf(name, "zio_buf_%lu", (ulong_t)size);
 157  178                          zio_buf_cache[c] = kmem_cache_create(name, size,
 158  179                              align, NULL, NULL, NULL, NULL, NULL, cflags);
 159  180  
 160  181                          /*
 161  182                           * Since zio_data bufs do not appear in crash dumps, we
 162  183                           * pass KMC_NOTOUCH so that no allocator metadata is
 163  184                           * stored with the buffers.
 164  185                           */
 165  186                          (void) sprintf(name, "zio_data_buf_%lu", (ulong_t)size);
 166  187                          zio_data_buf_cache[c] = kmem_cache_create(name, size,
 167  188                              align, NULL, NULL, NULL, NULL, data_alloc_arena,
 168  189                              cflags | KMC_NOTOUCH);
 169  190                  }
 170  191          }
 171  192  
 172  193          while (--c != 0) {
  
    | 
      ↓ open down ↓ | 
    55 lines elided | 
    
      ↑ open up ↑ | 
  
 173  194                  ASSERT(zio_buf_cache[c] != NULL);
 174  195                  if (zio_buf_cache[c - 1] == NULL)
 175  196                          zio_buf_cache[c - 1] = zio_buf_cache[c];
 176  197  
 177  198                  ASSERT(zio_data_buf_cache[c] != NULL);
 178  199                  if (zio_data_buf_cache[c - 1] == NULL)
 179  200                          zio_data_buf_cache[c - 1] = zio_data_buf_cache[c];
 180  201          }
 181  202  
 182  203          zio_inject_init();
      204 +
 183  205  }
 184  206  
 185  207  void
 186  208  zio_fini(void)
 187  209  {
 188  210          size_t c;
 189  211          kmem_cache_t *last_cache = NULL;
 190  212          kmem_cache_t *last_data_cache = NULL;
 191  213  
 192  214          for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
 193  215                  if (zio_buf_cache[c] != last_cache) {
 194  216                          last_cache = zio_buf_cache[c];
 195  217                          kmem_cache_destroy(zio_buf_cache[c]);
 196  218                  }
 197  219                  zio_buf_cache[c] = NULL;
 198  220  
 199  221                  if (zio_data_buf_cache[c] != last_data_cache) {
 200  222                          last_data_cache = zio_data_buf_cache[c];
 201  223                          kmem_cache_destroy(zio_data_buf_cache[c]);
 202  224                  }
 203  225                  zio_data_buf_cache[c] = NULL;
 204  226          }
 205  227  
 206  228          kmem_cache_destroy(zio_link_cache);
 207  229          kmem_cache_destroy(zio_cache);
 208  230  
 209  231          zio_inject_fini();
 210  232  }
 211  233  
 212  234  /*
 213  235   * ==========================================================================
 214  236   * Allocate and free I/O buffers
 215  237   * ==========================================================================
 216  238   */
 217  239  
 218  240  /*
 219  241   * Use zio_buf_alloc to allocate ZFS metadata.  This data will appear in a
 220  242   * crashdump if the kernel panics, so use it judiciously.  Obviously, it's
 221  243   * useful to inspect ZFS metadata, but if possible, we should avoid keeping
 222  244   * excess / transient data in-core during a crashdump.
 223  245   */
 224  246  void *
 225  247  zio_buf_alloc(size_t size)
 226  248  {
 227  249          size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
 228  250  
 229  251          VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
 230  252  
 231  253          return (kmem_cache_alloc(zio_buf_cache[c], KM_PUSHPAGE));
 232  254  }
 233  255  
 234  256  /*
 235  257   * Use zio_data_buf_alloc to allocate data.  The data will not appear in a
 236  258   * crashdump if the kernel panics.  This exists so that we will limit the amount
 237  259   * of ZFS data that shows up in a kernel crashdump.  (Thus reducing the amount
 238  260   * of kernel heap dumped to disk when the kernel panics)
 239  261   */
 240  262  void *
 241  263  zio_data_buf_alloc(size_t size)
 242  264  {
 243  265          size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
 244  266  
 245  267          VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
 246  268  
 247  269          return (kmem_cache_alloc(zio_data_buf_cache[c], KM_PUSHPAGE));
 248  270  }
 249  271  
 250  272  void
 251  273  zio_buf_free(void *buf, size_t size)
 252  274  {
 253  275          size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
 254  276  
 255  277          VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
 256  278  
 257  279          kmem_cache_free(zio_buf_cache[c], buf);
 258  280  }
 259  281  
 260  282  void
 261  283  zio_data_buf_free(void *buf, size_t size)
 262  284  {
 263  285          size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
 264  286  
 265  287          VERIFY3U(c, <, SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
 266  288  
 267  289          kmem_cache_free(zio_data_buf_cache[c], buf);
 268  290  }
 269  291  
 270  292  /*
 271  293   * ==========================================================================
 272  294   * Push and pop I/O transform buffers
 273  295   * ==========================================================================
 274  296   */
 275  297  void
 276  298  zio_push_transform(zio_t *zio, abd_t *data, uint64_t size, uint64_t bufsize,
 277  299      zio_transform_func_t *transform)
 278  300  {
 279  301          zio_transform_t *zt = kmem_alloc(sizeof (zio_transform_t), KM_SLEEP);
 280  302  
 281  303          /*
 282  304           * Ensure that anyone expecting this zio to contain a linear ABD isn't
 283  305           * going to get a nasty surprise when they try to access the data.
 284  306           */
 285  307          IMPLY(abd_is_linear(zio->io_abd), abd_is_linear(data));
 286  308  
 287  309          zt->zt_orig_abd = zio->io_abd;
 288  310          zt->zt_orig_size = zio->io_size;
 289  311          zt->zt_bufsize = bufsize;
 290  312          zt->zt_transform = transform;
 291  313  
 292  314          zt->zt_next = zio->io_transform_stack;
 293  315          zio->io_transform_stack = zt;
 294  316  
 295  317          zio->io_abd = data;
 296  318          zio->io_size = size;
 297  319  }
 298  320  
 299  321  void
 300  322  zio_pop_transforms(zio_t *zio)
 301  323  {
 302  324          zio_transform_t *zt;
 303  325  
 304  326          while ((zt = zio->io_transform_stack) != NULL) {
 305  327                  if (zt->zt_transform != NULL)
 306  328                          zt->zt_transform(zio,
 307  329                              zt->zt_orig_abd, zt->zt_orig_size);
 308  330  
 309  331                  if (zt->zt_bufsize != 0)
 310  332                          abd_free(zio->io_abd);
 311  333  
 312  334                  zio->io_abd = zt->zt_orig_abd;
 313  335                  zio->io_size = zt->zt_orig_size;
 314  336                  zio->io_transform_stack = zt->zt_next;
 315  337  
 316  338                  kmem_free(zt, sizeof (zio_transform_t));
 317  339          }
 318  340  }
 319  341  
 320  342  /*
 321  343   * ==========================================================================
 322  344   * I/O transform callbacks for subblocks and decompression
 323  345   * ==========================================================================
 324  346   */
 325  347  static void
 326  348  zio_subblock(zio_t *zio, abd_t *data, uint64_t size)
 327  349  {
 328  350          ASSERT(zio->io_size > size);
 329  351  
 330  352          if (zio->io_type == ZIO_TYPE_READ)
 331  353                  abd_copy(data, zio->io_abd, size);
 332  354  }
 333  355  
 334  356  static void
 335  357  zio_decompress(zio_t *zio, abd_t *data, uint64_t size)
 336  358  {
 337  359          if (zio->io_error == 0) {
 338  360                  void *tmp = abd_borrow_buf(data, size);
 339  361                  int ret = zio_decompress_data(BP_GET_COMPRESS(zio->io_bp),
 340  362                      zio->io_abd, tmp, zio->io_size, size);
 341  363                  abd_return_buf_copy(data, tmp, size);
 342  364  
 343  365                  if (ret != 0)
 344  366                          zio->io_error = SET_ERROR(EIO);
 345  367          }
 346  368  }
 347  369  
 348  370  /*
 349  371   * ==========================================================================
 350  372   * I/O parent/child relationships and pipeline interlocks
 351  373   * ==========================================================================
 352  374   */
 353  375  zio_t *
 354  376  zio_walk_parents(zio_t *cio, zio_link_t **zl)
 355  377  {
 356  378          list_t *pl = &cio->io_parent_list;
 357  379  
 358  380          *zl = (*zl == NULL) ? list_head(pl) : list_next(pl, *zl);
 359  381          if (*zl == NULL)
 360  382                  return (NULL);
 361  383  
 362  384          ASSERT((*zl)->zl_child == cio);
 363  385          return ((*zl)->zl_parent);
 364  386  }
 365  387  
 366  388  zio_t *
 367  389  zio_walk_children(zio_t *pio, zio_link_t **zl)
 368  390  {
 369  391          list_t *cl = &pio->io_child_list;
 370  392  
 371  393          *zl = (*zl == NULL) ? list_head(cl) : list_next(cl, *zl);
 372  394          if (*zl == NULL)
 373  395                  return (NULL);
 374  396  
 375  397          ASSERT((*zl)->zl_parent == pio);
 376  398          return ((*zl)->zl_child);
 377  399  }
 378  400  
 379  401  zio_t *
 380  402  zio_unique_parent(zio_t *cio)
 381  403  {
 382  404          zio_link_t *zl = NULL;
 383  405          zio_t *pio = zio_walk_parents(cio, &zl);
 384  406  
 385  407          VERIFY3P(zio_walk_parents(cio, &zl), ==, NULL);
 386  408          return (pio);
 387  409  }
 388  410  
 389  411  void
 390  412  zio_add_child(zio_t *pio, zio_t *cio)
 391  413  {
 392  414          zio_link_t *zl = kmem_cache_alloc(zio_link_cache, KM_SLEEP);
 393  415  
 394  416          /*
 395  417           * Logical I/Os can have logical, gang, or vdev children.
 396  418           * Gang I/Os can have gang or vdev children.
 397  419           * Vdev I/Os can only have vdev children.
 398  420           * The following ASSERT captures all of these constraints.
 399  421           */
 400  422          ASSERT3S(cio->io_child_type, <=, pio->io_child_type);
 401  423  
 402  424          zl->zl_parent = pio;
 403  425          zl->zl_child = cio;
 404  426  
 405  427          mutex_enter(&cio->io_lock);
 406  428          mutex_enter(&pio->io_lock);
 407  429  
 408  430          ASSERT(pio->io_state[ZIO_WAIT_DONE] == 0);
 409  431  
 410  432          for (int w = 0; w < ZIO_WAIT_TYPES; w++)
 411  433                  pio->io_children[cio->io_child_type][w] += !cio->io_state[w];
 412  434  
 413  435          list_insert_head(&pio->io_child_list, zl);
 414  436          list_insert_head(&cio->io_parent_list, zl);
 415  437  
 416  438          pio->io_child_count++;
 417  439          cio->io_parent_count++;
 418  440  
 419  441          mutex_exit(&pio->io_lock);
 420  442          mutex_exit(&cio->io_lock);
 421  443  }
 422  444  
 423  445  static void
 424  446  zio_remove_child(zio_t *pio, zio_t *cio, zio_link_t *zl)
 425  447  {
 426  448          ASSERT(zl->zl_parent == pio);
 427  449          ASSERT(zl->zl_child == cio);
 428  450  
 429  451          mutex_enter(&cio->io_lock);
 430  452          mutex_enter(&pio->io_lock);
 431  453  
 432  454          list_remove(&pio->io_child_list, zl);
 433  455          list_remove(&cio->io_parent_list, zl);
 434  456  
  
    | 
      ↓ open down ↓ | 
    242 lines elided | 
    
      ↑ open up ↑ | 
  
 435  457          pio->io_child_count--;
 436  458          cio->io_parent_count--;
 437  459  
 438  460          mutex_exit(&pio->io_lock);
 439  461          mutex_exit(&cio->io_lock);
 440  462  
 441  463          kmem_cache_free(zio_link_cache, zl);
 442  464  }
 443  465  
 444  466  static boolean_t
 445      -zio_wait_for_children(zio_t *zio, uint8_t childbits, enum zio_wait_type wait)
      467 +zio_wait_for_children(zio_t *zio, enum zio_child child, enum zio_wait_type wait)
 446  468  {
      469 +        uint64_t *countp = &zio->io_children[child][wait];
 447  470          boolean_t waiting = B_FALSE;
 448  471  
 449  472          mutex_enter(&zio->io_lock);
 450  473          ASSERT(zio->io_stall == NULL);
 451      -        for (int c = 0; c < ZIO_CHILD_TYPES; c++) {
 452      -                if (!(ZIO_CHILD_BIT_IS_SET(childbits, c)))
 453      -                        continue;
 454      -
 455      -                uint64_t *countp = &zio->io_children[c][wait];
 456      -                if (*countp != 0) {
 457      -                        zio->io_stage >>= 1;
 458      -                        ASSERT3U(zio->io_stage, !=, ZIO_STAGE_OPEN);
 459      -                        zio->io_stall = countp;
 460      -                        waiting = B_TRUE;
 461      -                        break;
 462      -                }
      474 +        if (*countp != 0) {
      475 +                zio->io_stage >>= 1;
      476 +                ASSERT3U(zio->io_stage, !=, ZIO_STAGE_OPEN);
      477 +                zio->io_stall = countp;
      478 +                waiting = B_TRUE;
 463  479          }
 464  480          mutex_exit(&zio->io_lock);
      481 +
 465  482          return (waiting);
 466  483  }
 467  484  
 468  485  static void
 469  486  zio_notify_parent(zio_t *pio, zio_t *zio, enum zio_wait_type wait)
 470  487  {
 471  488          uint64_t *countp = &pio->io_children[zio->io_child_type][wait];
 472  489          int *errorp = &pio->io_child_error[zio->io_child_type];
 473  490  
 474  491          mutex_enter(&pio->io_lock);
 475  492          if (zio->io_error && !(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE))
 476  493                  *errorp = zio_worst_error(*errorp, zio->io_error);
 477  494          pio->io_reexecute |= zio->io_reexecute;
 478  495          ASSERT3U(*countp, >, 0);
 479  496  
 480  497          (*countp)--;
 481  498  
 482  499          if (*countp == 0 && pio->io_stall == countp) {
 483  500                  zio_taskq_type_t type =
 484  501                      pio->io_stage < ZIO_STAGE_VDEV_IO_START ? ZIO_TASKQ_ISSUE :
 485  502                      ZIO_TASKQ_INTERRUPT;
 486  503                  pio->io_stall = NULL;
 487  504                  mutex_exit(&pio->io_lock);
 488  505                  /*
 489  506                   * Dispatch the parent zio in its own taskq so that
 490  507                   * the child can continue to make progress. This also
 491  508                   * prevents overflowing the stack when we have deeply nested
 492  509                   * parent-child relationships.
 493  510                   */
 494  511                  zio_taskq_dispatch(pio, type, B_FALSE);
 495  512          } else {
 496  513                  mutex_exit(&pio->io_lock);
 497  514          }
 498  515  }
 499  516  
 500  517  static void
 501  518  zio_inherit_child_errors(zio_t *zio, enum zio_child c)
 502  519  {
 503  520          if (zio->io_child_error[c] != 0 && zio->io_error == 0)
 504  521                  zio->io_error = zio->io_child_error[c];
 505  522  }
 506  523  
 507  524  int
 508  525  zio_bookmark_compare(const void *x1, const void *x2)
 509  526  {
 510  527          const zio_t *z1 = x1;
 511  528          const zio_t *z2 = x2;
 512  529  
 513  530          if (z1->io_bookmark.zb_objset < z2->io_bookmark.zb_objset)
 514  531                  return (-1);
 515  532          if (z1->io_bookmark.zb_objset > z2->io_bookmark.zb_objset)
 516  533                  return (1);
 517  534  
 518  535          if (z1->io_bookmark.zb_object < z2->io_bookmark.zb_object)
 519  536                  return (-1);
 520  537          if (z1->io_bookmark.zb_object > z2->io_bookmark.zb_object)
 521  538                  return (1);
 522  539  
 523  540          if (z1->io_bookmark.zb_level < z2->io_bookmark.zb_level)
 524  541                  return (-1);
 525  542          if (z1->io_bookmark.zb_level > z2->io_bookmark.zb_level)
 526  543                  return (1);
 527  544  
 528  545          if (z1->io_bookmark.zb_blkid < z2->io_bookmark.zb_blkid)
 529  546                  return (-1);
 530  547          if (z1->io_bookmark.zb_blkid > z2->io_bookmark.zb_blkid)
 531  548                  return (1);
 532  549  
 533  550          if (z1 < z2)
 534  551                  return (-1);
 535  552          if (z1 > z2)
 536  553                  return (1);
 537  554  
 538  555          return (0);
 539  556  }
 540  557  
 541  558  /*
 542  559   * ==========================================================================
 543  560   * Create the various types of I/O (read, write, free, etc)
 544  561   * ==========================================================================
 545  562   */
 546  563  static zio_t *
 547  564  zio_create(zio_t *pio, spa_t *spa, uint64_t txg, const blkptr_t *bp,
 548  565      abd_t *data, uint64_t lsize, uint64_t psize, zio_done_func_t *done,
 549  566      void *private, zio_type_t type, zio_priority_t priority,
 550  567      enum zio_flag flags, vdev_t *vd, uint64_t offset,
 551  568      const zbookmark_phys_t *zb, enum zio_stage stage, enum zio_stage pipeline)
 552  569  {
 553  570          zio_t *zio;
 554  571  
 555  572          ASSERT3U(psize, <=, SPA_MAXBLOCKSIZE);
 556  573          ASSERT(P2PHASE(psize, SPA_MINBLOCKSIZE) == 0);
 557  574          ASSERT(P2PHASE(offset, SPA_MINBLOCKSIZE) == 0);
 558  575  
 559  576          ASSERT(!vd || spa_config_held(spa, SCL_STATE_ALL, RW_READER));
 560  577          ASSERT(!bp || !(flags & ZIO_FLAG_CONFIG_WRITER));
 561  578          ASSERT(vd || stage == ZIO_STAGE_OPEN);
 562  579  
 563  580          IMPLY(lsize != psize, (flags & ZIO_FLAG_RAW) != 0);
 564  581  
 565  582          zio = kmem_cache_alloc(zio_cache, KM_SLEEP);
 566  583          bzero(zio, sizeof (zio_t));
 567  584  
 568  585          mutex_init(&zio->io_lock, NULL, MUTEX_DEFAULT, NULL);
 569  586          cv_init(&zio->io_cv, NULL, CV_DEFAULT, NULL);
 570  587  
 571  588          list_create(&zio->io_parent_list, sizeof (zio_link_t),
 572  589              offsetof(zio_link_t, zl_parent_node));
 573  590          list_create(&zio->io_child_list, sizeof (zio_link_t),
 574  591              offsetof(zio_link_t, zl_child_node));
 575  592          metaslab_trace_init(&zio->io_alloc_list);
 576  593  
 577  594          if (vd != NULL)
 578  595                  zio->io_child_type = ZIO_CHILD_VDEV;
 579  596          else if (flags & ZIO_FLAG_GANG_CHILD)
 580  597                  zio->io_child_type = ZIO_CHILD_GANG;
 581  598          else if (flags & ZIO_FLAG_DDT_CHILD)
 582  599                  zio->io_child_type = ZIO_CHILD_DDT;
 583  600          else
 584  601                  zio->io_child_type = ZIO_CHILD_LOGICAL;
 585  602  
 586  603          if (bp != NULL) {
 587  604                  zio->io_bp = (blkptr_t *)bp;
 588  605                  zio->io_bp_copy = *bp;
 589  606                  zio->io_bp_orig = *bp;
 590  607                  if (type != ZIO_TYPE_WRITE ||
 591  608                      zio->io_child_type == ZIO_CHILD_DDT)
 592  609                          zio->io_bp = &zio->io_bp_copy;  /* so caller can free */
 593  610                  if (zio->io_child_type == ZIO_CHILD_LOGICAL)
 594  611                          zio->io_logical = zio;
 595  612                  if (zio->io_child_type > ZIO_CHILD_GANG && BP_IS_GANG(bp))
 596  613                          pipeline |= ZIO_GANG_STAGES;
 597  614          }
 598  615  
 599  616          zio->io_spa = spa;
 600  617          zio->io_txg = txg;
 601  618          zio->io_done = done;
 602  619          zio->io_private = private;
 603  620          zio->io_type = type;
 604  621          zio->io_priority = priority;
 605  622          zio->io_vd = vd;
 606  623          zio->io_offset = offset;
 607  624          zio->io_orig_abd = zio->io_abd = data;
 608  625          zio->io_orig_size = zio->io_size = psize;
 609  626          zio->io_lsize = lsize;
 610  627          zio->io_orig_flags = zio->io_flags = flags;
 611  628          zio->io_orig_stage = zio->io_stage = stage;
  
    | 
      ↓ open down ↓ | 
    137 lines elided | 
    
      ↑ open up ↑ | 
  
 612  629          zio->io_orig_pipeline = zio->io_pipeline = pipeline;
 613  630          zio->io_pipeline_trace = ZIO_STAGE_OPEN;
 614  631  
 615  632          zio->io_state[ZIO_WAIT_READY] = (stage >= ZIO_STAGE_READY);
 616  633          zio->io_state[ZIO_WAIT_DONE] = (stage >= ZIO_STAGE_DONE);
 617  634  
 618  635          if (zb != NULL)
 619  636                  zio->io_bookmark = *zb;
 620  637  
 621  638          if (pio != NULL) {
      639 +                zio->io_mc = pio->io_mc;
 622  640                  if (zio->io_logical == NULL)
 623  641                          zio->io_logical = pio->io_logical;
 624  642                  if (zio->io_child_type == ZIO_CHILD_GANG)
 625  643                          zio->io_gang_leader = pio->io_gang_leader;
 626  644                  zio_add_child(pio, zio);
      645 +
      646 +                /* copy the smartcomp setting when creating child zio's */
      647 +                bcopy(&pio->io_smartcomp, &zio->io_smartcomp,
      648 +                    sizeof (zio->io_smartcomp));
 627  649          }
 628  650  
 629  651          return (zio);
 630  652  }
 631  653  
 632  654  static void
 633  655  zio_destroy(zio_t *zio)
 634  656  {
 635  657          metaslab_trace_fini(&zio->io_alloc_list);
 636  658          list_destroy(&zio->io_parent_list);
 637  659          list_destroy(&zio->io_child_list);
 638  660          mutex_destroy(&zio->io_lock);
 639  661          cv_destroy(&zio->io_cv);
 640  662          kmem_cache_free(zio_cache, zio);
 641  663  }
 642  664  
 643  665  zio_t *
 644  666  zio_null(zio_t *pio, spa_t *spa, vdev_t *vd, zio_done_func_t *done,
 645  667      void *private, enum zio_flag flags)
 646  668  {
 647  669          zio_t *zio;
 648  670  
 649  671          zio = zio_create(pio, spa, 0, NULL, NULL, 0, 0, done, private,
 650  672              ZIO_TYPE_NULL, ZIO_PRIORITY_NOW, flags, vd, 0, NULL,
 651  673              ZIO_STAGE_OPEN, ZIO_INTERLOCK_PIPELINE);
 652  674  
 653  675          return (zio);
 654  676  }
  
    | 
      ↓ open down ↓ | 
    18 lines elided | 
    
      ↑ open up ↑ | 
  
 655  677  
 656  678  zio_t *
 657  679  zio_root(spa_t *spa, zio_done_func_t *done, void *private, enum zio_flag flags)
 658  680  {
 659  681          return (zio_null(NULL, spa, NULL, done, private, flags));
 660  682  }
 661  683  
 662  684  void
 663  685  zfs_blkptr_verify(spa_t *spa, const blkptr_t *bp)
 664  686  {
      687 +        /*
      688 +         * SPECIAL-BP has two DVAs, but DVA[0] in this case is a
      689 +         * temporary DVA, and after migration only the DVA[1]
      690 +         * contains valid data. Therefore, we start walking for
      691 +         * these BPs from DVA[1].
      692 +         */
      693 +        int start_dva = BP_IS_SPECIAL(bp) ? 1 : 0;
      694 +
 665  695          if (!DMU_OT_IS_VALID(BP_GET_TYPE(bp))) {
 666  696                  zfs_panic_recover("blkptr at %p has invalid TYPE %llu",
 667  697                      bp, (longlong_t)BP_GET_TYPE(bp));
 668  698          }
 669  699          if (BP_GET_CHECKSUM(bp) >= ZIO_CHECKSUM_FUNCTIONS ||
 670  700              BP_GET_CHECKSUM(bp) <= ZIO_CHECKSUM_ON) {
 671  701                  zfs_panic_recover("blkptr at %p has invalid CHECKSUM %llu",
 672  702                      bp, (longlong_t)BP_GET_CHECKSUM(bp));
 673  703          }
 674  704          if (BP_GET_COMPRESS(bp) >= ZIO_COMPRESS_FUNCTIONS ||
 675  705              BP_GET_COMPRESS(bp) <= ZIO_COMPRESS_ON) {
 676  706                  zfs_panic_recover("blkptr at %p has invalid COMPRESS %llu",
 677  707                      bp, (longlong_t)BP_GET_COMPRESS(bp));
 678  708          }
 679  709          if (BP_GET_LSIZE(bp) > SPA_MAXBLOCKSIZE) {
 680  710                  zfs_panic_recover("blkptr at %p has invalid LSIZE %llu",
 681  711                      bp, (longlong_t)BP_GET_LSIZE(bp));
 682  712          }
 683  713          if (BP_GET_PSIZE(bp) > SPA_MAXBLOCKSIZE) {
 684  714                  zfs_panic_recover("blkptr at %p has invalid PSIZE %llu",
 685  715                      bp, (longlong_t)BP_GET_PSIZE(bp));
  
    | 
      ↓ open down ↓ | 
    11 lines elided | 
    
      ↑ open up ↑ | 
  
 686  716          }
 687  717  
 688  718          if (BP_IS_EMBEDDED(bp)) {
 689  719                  if (BPE_GET_ETYPE(bp) > NUM_BP_EMBEDDED_TYPES) {
 690  720                          zfs_panic_recover("blkptr at %p has invalid ETYPE %llu",
 691  721                              bp, (longlong_t)BPE_GET_ETYPE(bp));
 692  722                  }
 693  723          }
 694  724  
 695  725          /*
 696      -         * Do not verify individual DVAs if the config is not trusted. This
 697      -         * will be done once the zio is executed in vdev_mirror_map_alloc.
 698      -         */
 699      -        if (!spa->spa_trust_config)
 700      -                return;
 701      -
 702      -        /*
 703  726           * Pool-specific checks.
 704  727           *
 705  728           * Note: it would be nice to verify that the blk_birth and
 706  729           * BP_PHYSICAL_BIRTH() are not too large.  However, spa_freeze()
 707  730           * allows the birth time of log blocks (and dmu_sync()-ed blocks
 708  731           * that are in the log) to be arbitrarily large.
 709  732           */
 710      -        for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
      733 +        for (int i = start_dva; i < BP_GET_NDVAS(bp); i++) {
 711  734                  uint64_t vdevid = DVA_GET_VDEV(&bp->blk_dva[i]);
 712  735                  if (vdevid >= spa->spa_root_vdev->vdev_children) {
 713  736                          zfs_panic_recover("blkptr at %p DVA %u has invalid "
 714  737                              "VDEV %llu",
 715  738                              bp, i, (longlong_t)vdevid);
 716  739                          continue;
 717  740                  }
 718  741                  vdev_t *vd = spa->spa_root_vdev->vdev_child[vdevid];
 719  742                  if (vd == NULL) {
 720  743                          zfs_panic_recover("blkptr at %p DVA %u has invalid "
 721  744                              "VDEV %llu",
 722  745                              bp, i, (longlong_t)vdevid);
 723  746                          continue;
 724  747                  }
 725  748                  if (vd->vdev_ops == &vdev_hole_ops) {
 726  749                          zfs_panic_recover("blkptr at %p DVA %u has hole "
 727  750                              "VDEV %llu",
 728  751                              bp, i, (longlong_t)vdevid);
 729  752                          continue;
 730  753                  }
 731  754                  if (vd->vdev_ops == &vdev_missing_ops) {
 732  755                          /*
 733  756                           * "missing" vdevs are valid during import, but we
 734  757                           * don't have their detailed info (e.g. asize), so
 735  758                           * we can't perform any more checks on them.
 736  759                           */
 737  760                          continue;
 738  761                  }
 739  762                  uint64_t offset = DVA_GET_OFFSET(&bp->blk_dva[i]);
 740  763                  uint64_t asize = DVA_GET_ASIZE(&bp->blk_dva[i]);
  
    | 
      ↓ open down ↓ | 
    20 lines elided | 
    
      ↑ open up ↑ | 
  
 741  764                  if (BP_IS_GANG(bp))
 742  765                          asize = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
 743  766                  if (offset + asize > vd->vdev_asize) {
 744  767                          zfs_panic_recover("blkptr at %p DVA %u has invalid "
 745  768                              "OFFSET %llu",
 746  769                              bp, i, (longlong_t)offset);
 747  770                  }
 748  771          }
 749  772  }
 750  773  
 751      -boolean_t
 752      -zfs_dva_valid(spa_t *spa, const dva_t *dva, const blkptr_t *bp)
 753      -{
 754      -        uint64_t vdevid = DVA_GET_VDEV(dva);
 755      -
 756      -        if (vdevid >= spa->spa_root_vdev->vdev_children)
 757      -                return (B_FALSE);
 758      -
 759      -        vdev_t *vd = spa->spa_root_vdev->vdev_child[vdevid];
 760      -        if (vd == NULL)
 761      -                return (B_FALSE);
 762      -
 763      -        if (vd->vdev_ops == &vdev_hole_ops)
 764      -                return (B_FALSE);
 765      -
 766      -        if (vd->vdev_ops == &vdev_missing_ops) {
 767      -                return (B_FALSE);
 768      -        }
 769      -
 770      -        uint64_t offset = DVA_GET_OFFSET(dva);
 771      -        uint64_t asize = DVA_GET_ASIZE(dva);
 772      -
 773      -        if (BP_IS_GANG(bp))
 774      -                asize = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
 775      -        if (offset + asize > vd->vdev_asize)
 776      -                return (B_FALSE);
 777      -
 778      -        return (B_TRUE);
 779      -}
 780      -
 781  774  zio_t *
 782  775  zio_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,
 783  776      abd_t *data, uint64_t size, zio_done_func_t *done, void *private,
 784  777      zio_priority_t priority, enum zio_flag flags, const zbookmark_phys_t *zb)
 785  778  {
 786  779          zio_t *zio;
 787  780  
 788  781          zfs_blkptr_verify(spa, bp);
 789  782  
 790  783          zio = zio_create(pio, spa, BP_PHYSICAL_BIRTH(bp), bp,
 791  784              data, size, size, done, private,
 792  785              ZIO_TYPE_READ, priority, flags, NULL, 0, zb,
 793  786              ZIO_STAGE_OPEN, (flags & ZIO_FLAG_DDT_CHILD) ?
 794  787              ZIO_DDT_CHILD_READ_PIPELINE : ZIO_READ_PIPELINE);
  
    | 
      ↓ open down ↓ | 
    4 lines elided | 
    
      ↑ open up ↑ | 
  
 795  788  
 796  789          return (zio);
 797  790  }
 798  791  
 799  792  zio_t *
 800  793  zio_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
 801  794      abd_t *data, uint64_t lsize, uint64_t psize, const zio_prop_t *zp,
 802  795      zio_done_func_t *ready, zio_done_func_t *children_ready,
 803  796      zio_done_func_t *physdone, zio_done_func_t *done,
 804  797      void *private, zio_priority_t priority, enum zio_flag flags,
 805      -    const zbookmark_phys_t *zb)
      798 +    const zbookmark_phys_t *zb,
      799 +    const zio_smartcomp_info_t *smartcomp)
 806  800  {
 807  801          zio_t *zio;
 808  802  
 809  803          ASSERT(zp->zp_checksum >= ZIO_CHECKSUM_OFF &&
 810  804              zp->zp_checksum < ZIO_CHECKSUM_FUNCTIONS &&
 811  805              zp->zp_compress >= ZIO_COMPRESS_OFF &&
 812  806              zp->zp_compress < ZIO_COMPRESS_FUNCTIONS &&
 813  807              DMU_OT_IS_VALID(zp->zp_type) &&
 814  808              zp->zp_level < 32 &&
 815  809              zp->zp_copies > 0 &&
 816  810              zp->zp_copies <= spa_max_replication(spa));
  
    | 
      ↓ open down ↓ | 
    1 lines elided | 
    
      ↑ open up ↑ | 
  
 817  811  
 818  812          zio = zio_create(pio, spa, txg, bp, data, lsize, psize, done, private,
 819  813              ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
 820  814              ZIO_STAGE_OPEN, (flags & ZIO_FLAG_DDT_CHILD) ?
 821  815              ZIO_DDT_CHILD_WRITE_PIPELINE : ZIO_WRITE_PIPELINE);
 822  816  
 823  817          zio->io_ready = ready;
 824  818          zio->io_children_ready = children_ready;
 825  819          zio->io_physdone = physdone;
 826  820          zio->io_prop = *zp;
      821 +        if (smartcomp != NULL)
      822 +                bcopy(smartcomp, &zio->io_smartcomp, sizeof (*smartcomp));
 827  823  
 828  824          /*
 829  825           * Data can be NULL if we are going to call zio_write_override() to
 830  826           * provide the already-allocated BP.  But we may need the data to
 831  827           * verify a dedup hit (if requested).  In this case, don't try to
 832  828           * dedup (just take the already-allocated BP verbatim).
 833  829           */
 834  830          if (data == NULL && zio->io_prop.zp_dedup_verify) {
 835  831                  zio->io_prop.zp_dedup = zio->io_prop.zp_dedup_verify = B_FALSE;
 836  832          }
 837  833  
 838  834          return (zio);
 839  835  }
 840  836  
 841  837  zio_t *
 842  838  zio_rewrite(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, abd_t *data,
 843  839      uint64_t size, zio_done_func_t *done, void *private,
 844  840      zio_priority_t priority, enum zio_flag flags, zbookmark_phys_t *zb)
 845  841  {
 846  842          zio_t *zio;
 847  843  
 848  844          zio = zio_create(pio, spa, txg, bp, data, size, size, done, private,
 849  845              ZIO_TYPE_WRITE, priority, flags | ZIO_FLAG_IO_REWRITE, NULL, 0, zb,
 850  846              ZIO_STAGE_OPEN, ZIO_REWRITE_PIPELINE);
 851  847  
 852  848          return (zio);
 853  849  }
 854  850  
 855  851  void
 856  852  zio_write_override(zio_t *zio, blkptr_t *bp, int copies, boolean_t nopwrite)
 857  853  {
 858  854          ASSERT(zio->io_type == ZIO_TYPE_WRITE);
 859  855          ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
 860  856          ASSERT(zio->io_stage == ZIO_STAGE_OPEN);
 861  857          ASSERT(zio->io_txg == spa_syncing_txg(zio->io_spa));
 862  858  
 863  859          /*
 864  860           * We must reset the io_prop to match the values that existed
 865  861           * when the bp was first written by dmu_sync() keeping in mind
 866  862           * that nopwrite and dedup are mutually exclusive.
 867  863           */
  
    | 
      ↓ open down ↓ | 
    31 lines elided | 
    
      ↑ open up ↑ | 
  
 868  864          zio->io_prop.zp_dedup = nopwrite ? B_FALSE : zio->io_prop.zp_dedup;
 869  865          zio->io_prop.zp_nopwrite = nopwrite;
 870  866          zio->io_prop.zp_copies = copies;
 871  867          zio->io_bp_override = bp;
 872  868  }
 873  869  
 874  870  void
 875  871  zio_free(spa_t *spa, uint64_t txg, const blkptr_t *bp)
 876  872  {
 877  873  
 878      -        zfs_blkptr_verify(spa, bp);
 879      -
 880  874          /*
 881  875           * The check for EMBEDDED is a performance optimization.  We
 882  876           * process the free here (by ignoring it) rather than
 883  877           * putting it on the list and then processing it in zio_free_sync().
 884  878           */
 885  879          if (BP_IS_EMBEDDED(bp))
 886  880                  return;
 887  881          metaslab_check_free(spa, bp);
 888  882  
 889  883          /*
 890  884           * Frees that are for the currently-syncing txg, are not going to be
 891  885           * deferred, and which will not need to do a read (i.e. not GANG or
 892  886           * DEDUP), can be processed immediately.  Otherwise, put them on the
 893  887           * in-memory list for later processing.
 894  888           */
 895  889          if (BP_IS_GANG(bp) || BP_GET_DEDUP(bp) ||
 896  890              txg != spa->spa_syncing_txg ||
 897  891              spa_sync_pass(spa) >= zfs_sync_pass_deferred_free) {
 898  892                  bplist_append(&spa->spa_free_bplist[txg & TXG_MASK], bp);
 899  893          } else {
 900  894                  VERIFY0(zio_wait(zio_free_sync(NULL, spa, txg, bp, 0)));
 901  895          }
 902  896  }
 903  897  
 904  898  zio_t *
 905  899  zio_free_sync(zio_t *pio, spa_t *spa, uint64_t txg, const blkptr_t *bp,
 906  900      enum zio_flag flags)
 907  901  {
 908  902          zio_t *zio;
 909  903          enum zio_stage stage = ZIO_FREE_PIPELINE;
  
    | 
      ↓ open down ↓ | 
    20 lines elided | 
    
      ↑ open up ↑ | 
  
 910  904  
 911  905          ASSERT(!BP_IS_HOLE(bp));
 912  906          ASSERT(spa_syncing_txg(spa) == txg);
 913  907          ASSERT(spa_sync_pass(spa) < zfs_sync_pass_deferred_free);
 914  908  
 915  909          if (BP_IS_EMBEDDED(bp))
 916  910                  return (zio_null(pio, spa, NULL, NULL, NULL, 0));
 917  911  
 918  912          metaslab_check_free(spa, bp);
 919  913          arc_freed(spa, bp);
      914 +        dsl_scan_freed(spa, bp);
 920  915  
 921  916          /*
 922  917           * GANG and DEDUP blocks can induce a read (for the gang block header,
 923  918           * or the DDT), so issue them asynchronously so that this thread is
 924  919           * not tied up.
 925  920           */
 926  921          if (BP_IS_GANG(bp) || BP_GET_DEDUP(bp))
 927  922                  stage |= ZIO_STAGE_ISSUE_ASYNC;
 928  923  
 929  924          zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
 930  925              BP_GET_PSIZE(bp), NULL, NULL, ZIO_TYPE_FREE, ZIO_PRIORITY_NOW,
 931  926              flags, NULL, 0, NULL, ZIO_STAGE_OPEN, stage);
  
    | 
      ↓ open down ↓ | 
    2 lines elided | 
    
      ↑ open up ↑ | 
  
 932  927  
 933  928          return (zio);
 934  929  }
 935  930  
 936  931  zio_t *
 937  932  zio_claim(zio_t *pio, spa_t *spa, uint64_t txg, const blkptr_t *bp,
 938  933      zio_done_func_t *done, void *private, enum zio_flag flags)
 939  934  {
 940  935          zio_t *zio;
 941  936  
 942      -        zfs_blkptr_verify(spa, bp);
      937 +        dprintf_bp(bp, "claiming in txg %llu", txg);
 943  938  
 944  939          if (BP_IS_EMBEDDED(bp))
 945  940                  return (zio_null(pio, spa, NULL, NULL, NULL, 0));
 946  941  
 947  942          /*
 948  943           * A claim is an allocation of a specific block.  Claims are needed
 949  944           * to support immediate writes in the intent log.  The issue is that
 950  945           * immediate writes contain committed data, but in a txg that was
 951  946           * *not* committed.  Upon opening the pool after an unclean shutdown,
 952  947           * the intent log claims all blocks that contain immediate write data
 953  948           * so that the SPA knows they're in use.
 954  949           *
 955  950           * All claims *must* be resolved in the first txg -- before the SPA
 956  951           * starts allocating blocks -- so that nothing is allocated twice.
 957  952           * If txg == 0 we just verify that the block is claimable.
 958  953           */
 959  954          ASSERT3U(spa->spa_uberblock.ub_rootbp.blk_birth, <, spa_first_txg(spa));
 960  955          ASSERT(txg == spa_first_txg(spa) || txg == 0);
  
    | 
      ↓ open down ↓ | 
    8 lines elided | 
    
      ↑ open up ↑ | 
  
 961  956          ASSERT(!BP_GET_DEDUP(bp) || !spa_writeable(spa));       /* zdb(1M) */
 962  957  
 963  958          zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
 964  959              BP_GET_PSIZE(bp), done, private, ZIO_TYPE_CLAIM, ZIO_PRIORITY_NOW,
 965  960              flags, NULL, 0, NULL, ZIO_STAGE_OPEN, ZIO_CLAIM_PIPELINE);
 966  961          ASSERT0(zio->io_queued_timestamp);
 967  962  
 968  963          return (zio);
 969  964  }
 970  965  
 971      -zio_t *
 972      -zio_ioctl(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
 973      -    zio_done_func_t *done, void *private, enum zio_flag flags)
      966 +static zio_t *
      967 +zio_ioctl_with_pipeline(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
      968 +    zio_done_func_t *done, void *private, enum zio_flag flags,
      969 +    enum zio_stage pipeline)
 974  970  {
 975  971          zio_t *zio;
 976  972          int c;
 977  973  
 978  974          if (vd->vdev_children == 0) {
 979  975                  zio = zio_create(pio, spa, 0, NULL, NULL, 0, 0, done, private,
 980  976                      ZIO_TYPE_IOCTL, ZIO_PRIORITY_NOW, flags, vd, 0, NULL,
 981      -                    ZIO_STAGE_OPEN, ZIO_IOCTL_PIPELINE);
      977 +                    ZIO_STAGE_OPEN, pipeline);
 982  978  
 983  979                  zio->io_cmd = cmd;
 984  980          } else {
 985      -                zio = zio_null(pio, spa, NULL, NULL, NULL, flags);
 986      -
 987      -                for (c = 0; c < vd->vdev_children; c++)
 988      -                        zio_nowait(zio_ioctl(zio, spa, vd->vdev_child[c], cmd,
 989      -                            done, private, flags));
      981 +                zio = zio_null(pio, spa, vd, done, private, flags);
      982 +                /*
      983 +                 * DKIOCFREE ioctl's need some special handling on interior
      984 +                 * vdevs. If the device provides an ops function to handle
      985 +                 * recomputing dkioc_free extents, then we call it.
      986 +                 * Otherwise the default behavior applies, which simply fans
      987 +                 * out the ioctl to all component vdevs.
      988 +                 */
      989 +                if (cmd == DKIOCFREE && vd->vdev_ops->vdev_op_trim != NULL) {
      990 +                        vd->vdev_ops->vdev_op_trim(vd, zio, private);
      991 +                } else {
      992 +                        for (c = 0; c < vd->vdev_children; c++)
      993 +                                zio_nowait(zio_ioctl_with_pipeline(zio,
      994 +                                    spa, vd->vdev_child[c], cmd, NULL,
      995 +                                    private, flags, pipeline));
      996 +                }
 990  997          }
 991  998  
 992  999          return (zio);
 993 1000  }
 994 1001  
 995 1002  zio_t *
     1003 +zio_ioctl(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
     1004 +    zio_done_func_t *done, void *private, enum zio_flag flags)
     1005 +{
     1006 +        return (zio_ioctl_with_pipeline(pio, spa, vd, cmd, done,
     1007 +            private, flags, ZIO_IOCTL_PIPELINE));
     1008 +}
     1009 +
     1010 +/*
     1011 + * Callback for when a trim zio has completed. This simply frees the
     1012 + * dkioc_free_list_t extent list of the DKIOCFREE ioctl.
     1013 + */
     1014 +static void
     1015 +zio_trim_done(zio_t *zio)
     1016 +{
     1017 +        VERIFY(zio->io_private != NULL);
     1018 +        dfl_free(zio->io_private);
     1019 +}
     1020 +
     1021 +static void
     1022 +zio_trim_check(uint64_t start, uint64_t len, void *msp)
     1023 +{
     1024 +        metaslab_t *ms = msp;
     1025 +        boolean_t held = MUTEX_HELD(&ms->ms_lock);
     1026 +        if (!held)
     1027 +                mutex_enter(&ms->ms_lock);
     1028 +        ASSERT(ms->ms_trimming_ts != NULL);
     1029 +        ASSERT(range_tree_contains(ms->ms_trimming_ts->ts_tree,
     1030 +            start - VDEV_LABEL_START_SIZE, len));
     1031 +        if (!held)
     1032 +                mutex_exit(&ms->ms_lock);
     1033 +}
     1034 +
     1035 +/*
     1036 + * Takes a bunch of freed extents and tells the underlying vdevs that the
     1037 + * space associated with these extents can be released.
     1038 + * This is used by flash storage to pre-erase blocks for rapid reuse later
     1039 + * and thin-provisioned block storage to reclaim unused blocks.
     1040 + */
     1041 +zio_t *
     1042 +zio_trim(spa_t *spa, vdev_t *vd, struct range_tree *tree,
     1043 +    zio_done_func_t *done, void *private, enum zio_flag flags,
     1044 +    int trim_flags, metaslab_t *msp)
     1045 +{
     1046 +        dkioc_free_list_t *dfl = NULL;
     1047 +        range_seg_t *rs;
     1048 +        uint64_t rs_idx;
     1049 +        uint64_t num_exts;
     1050 +        uint64_t bytes_issued = 0, bytes_skipped = 0, exts_skipped = 0;
     1051 +        /*
     1052 +         * We need this to invoke the caller's `done' callback with the
     1053 +         * correct io_private (not the dkioc_free_list_t, which is needed
     1054 +         * by the underlying DKIOCFREE ioctl).
     1055 +         */
     1056 +        zio_t *sub_pio = zio_root(spa, done, private, flags);
     1057 +
     1058 +        ASSERT(range_tree_space(tree) != 0);
     1059 +
     1060 +        if (!zfs_trim)
     1061 +                return (sub_pio);
     1062 +
     1063 +        num_exts = avl_numnodes(&tree->rt_root);
     1064 +        dfl = kmem_zalloc(DFL_SZ(num_exts), KM_SLEEP);
     1065 +        dfl->dfl_flags = trim_flags;
     1066 +        dfl->dfl_num_exts = num_exts;
     1067 +        dfl->dfl_offset = VDEV_LABEL_START_SIZE;
     1068 +        if (msp) {
     1069 +                dfl->dfl_ck_func = zio_trim_check;
     1070 +                dfl->dfl_ck_arg = msp;
     1071 +        }
     1072 +
     1073 +        for (rs = avl_first(&tree->rt_root), rs_idx = 0; rs != NULL;
     1074 +            rs = AVL_NEXT(&tree->rt_root, rs)) {
     1075 +                uint64_t len = rs->rs_end - rs->rs_start;
     1076 +
     1077 +                if (len < zfs_trim_min_ext_sz) {
     1078 +                        bytes_skipped += len;
     1079 +                        exts_skipped++;
     1080 +                        continue;
     1081 +                }
     1082 +
     1083 +                dfl->dfl_exts[rs_idx].dfle_start = rs->rs_start;
     1084 +                dfl->dfl_exts[rs_idx].dfle_length = len;
     1085 +
     1086 +                // check we're a multiple of the vdev ashift
     1087 +                ASSERT0(dfl->dfl_exts[rs_idx].dfle_start &
     1088 +                    ((1 << vd->vdev_ashift) - 1));
     1089 +                ASSERT0(dfl->dfl_exts[rs_idx].dfle_length &
     1090 +                    ((1 << vd->vdev_ashift) - 1));
     1091 +
     1092 +                rs_idx++;
     1093 +                bytes_issued += len;
     1094 +        }
     1095 +
     1096 +        spa_trimstats_update(spa, rs_idx, bytes_issued, exts_skipped,
     1097 +            bytes_skipped);
     1098 +
     1099 +        /* the zfs_trim_min_ext_sz filter may have shortened the list */
     1100 +        if (dfl->dfl_num_exts != rs_idx) {
     1101 +                dkioc_free_list_t *dfl2 = kmem_zalloc(DFL_SZ(rs_idx), KM_SLEEP);
     1102 +                bcopy(dfl, dfl2, DFL_SZ(rs_idx));
     1103 +                dfl2->dfl_num_exts = rs_idx;
     1104 +                dfl_free(dfl);
     1105 +                dfl = dfl2;
     1106 +        }
     1107 +
     1108 +        zio_nowait(zio_ioctl_with_pipeline(sub_pio, spa, vd, DKIOCFREE,
     1109 +            zio_trim_done, dfl, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
     1110 +            ZIO_FLAG_DONT_RETRY, ZIO_TRIM_PIPELINE));
     1111 +        return (sub_pio);
     1112 +}
     1113 +
     1114 +zio_t *
 996 1115  zio_read_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size,
 997 1116      abd_t *data, int checksum, zio_done_func_t *done, void *private,
 998 1117      zio_priority_t priority, enum zio_flag flags, boolean_t labels)
 999 1118  {
1000 1119          zio_t *zio;
1001 1120  
1002 1121          ASSERT(vd->vdev_children == 0);
1003 1122          ASSERT(!labels || offset + size <= VDEV_LABEL_START_SIZE ||
1004 1123              offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
1005 1124          ASSERT3U(offset + size, <=, vd->vdev_psize);
1006 1125  
1007 1126          zio = zio_create(pio, vd->vdev_spa, 0, NULL, data, size, size, done,
1008 1127              private, ZIO_TYPE_READ, priority, flags | ZIO_FLAG_PHYSICAL, vd,
1009 1128              offset, NULL, ZIO_STAGE_OPEN, ZIO_READ_PHYS_PIPELINE);
1010 1129  
1011 1130          zio->io_prop.zp_checksum = checksum;
1012 1131  
1013 1132          return (zio);
1014 1133  }
1015 1134  
1016 1135  zio_t *
1017 1136  zio_write_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size,
1018 1137      abd_t *data, int checksum, zio_done_func_t *done, void *private,
1019 1138      zio_priority_t priority, enum zio_flag flags, boolean_t labels)
1020 1139  {
1021 1140          zio_t *zio;
1022 1141  
1023 1142          ASSERT(vd->vdev_children == 0);
1024 1143          ASSERT(!labels || offset + size <= VDEV_LABEL_START_SIZE ||
1025 1144              offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
1026 1145          ASSERT3U(offset + size, <=, vd->vdev_psize);
1027 1146  
1028 1147          zio = zio_create(pio, vd->vdev_spa, 0, NULL, data, size, size, done,
1029 1148              private, ZIO_TYPE_WRITE, priority, flags | ZIO_FLAG_PHYSICAL, vd,
1030 1149              offset, NULL, ZIO_STAGE_OPEN, ZIO_WRITE_PHYS_PIPELINE);
1031 1150  
1032 1151          zio->io_prop.zp_checksum = checksum;
1033 1152  
1034 1153          if (zio_checksum_table[checksum].ci_flags & ZCHECKSUM_FLAG_EMBEDDED) {
1035 1154                  /*
1036 1155                   * zec checksums are necessarily destructive -- they modify
1037 1156                   * the end of the write buffer to hold the verifier/checksum.
1038 1157                   * Therefore, we must make a local copy in case the data is
1039 1158                   * being written to multiple places in parallel.
1040 1159                   */
1041 1160                  abd_t *wbuf = abd_alloc_sametype(data, size);
1042 1161                  abd_copy(wbuf, data, size);
1043 1162  
1044 1163                  zio_push_transform(zio, wbuf, size, size, NULL);
1045 1164          }
1046 1165  
1047 1166          return (zio);
1048 1167  }
1049 1168  
1050 1169  /*
  
    | 
      ↓ open down ↓ | 
    45 lines elided | 
    
      ↑ open up ↑ | 
  
1051 1170   * Create a child I/O to do some work for us.
1052 1171   */
1053 1172  zio_t *
1054 1173  zio_vdev_child_io(zio_t *pio, blkptr_t *bp, vdev_t *vd, uint64_t offset,
1055 1174      abd_t *data, uint64_t size, int type, zio_priority_t priority,
1056 1175      enum zio_flag flags, zio_done_func_t *done, void *private)
1057 1176  {
1058 1177          enum zio_stage pipeline = ZIO_VDEV_CHILD_PIPELINE;
1059 1178          zio_t *zio;
1060 1179  
1061      -        /*
1062      -         * vdev child I/Os do not propagate their error to the parent.
1063      -         * Therefore, for correct operation the caller *must* check for
1064      -         * and handle the error in the child i/o's done callback.
1065      -         * The only exceptions are i/os that we don't care about
1066      -         * (OPTIONAL or REPAIR).
1067      -         */
1068      -        ASSERT((flags & ZIO_FLAG_OPTIONAL) || (flags & ZIO_FLAG_IO_REPAIR) ||
1069      -            done != NULL);
     1180 +        ASSERT(vd->vdev_parent ==
     1181 +            (pio->io_vd ? pio->io_vd : pio->io_spa->spa_root_vdev));
1070 1182  
1071      -        /*
1072      -         * In the common case, where the parent zio was to a normal vdev,
1073      -         * the child zio must be to a child vdev of that vdev.  Otherwise,
1074      -         * the child zio must be to a top-level vdev.
1075      -         */
1076      -        if (pio->io_vd != NULL && pio->io_vd->vdev_ops != &vdev_indirect_ops) {
1077      -                ASSERT3P(vd->vdev_parent, ==, pio->io_vd);
1078      -        } else {
1079      -                ASSERT3P(vd, ==, vd->vdev_top);
1080      -        }
1081      -
1082 1183          if (type == ZIO_TYPE_READ && bp != NULL) {
1083 1184                  /*
1084 1185                   * If we have the bp, then the child should perform the
1085 1186                   * checksum and the parent need not.  This pushes error
1086 1187                   * detection as close to the leaves as possible and
1087 1188                   * eliminates redundant checksums in the interior nodes.
1088 1189                   */
1089 1190                  pipeline |= ZIO_STAGE_CHECKSUM_VERIFY;
1090 1191                  pio->io_pipeline &= ~ZIO_STAGE_CHECKSUM_VERIFY;
1091 1192          }
1092 1193  
1093      -        if (vd->vdev_ops->vdev_op_leaf) {
1094      -                ASSERT0(vd->vdev_children);
     1194 +        if (vd->vdev_children == 0)
1095 1195                  offset += VDEV_LABEL_START_SIZE;
1096      -        }
1097 1196  
1098      -        flags |= ZIO_VDEV_CHILD_FLAGS(pio);
     1197 +        flags |= ZIO_VDEV_CHILD_FLAGS(pio) | ZIO_FLAG_DONT_PROPAGATE;
1099 1198  
1100 1199          /*
1101 1200           * If we've decided to do a repair, the write is not speculative --
1102 1201           * even if the original read was.
1103 1202           */
1104 1203          if (flags & ZIO_FLAG_IO_REPAIR)
1105 1204                  flags &= ~ZIO_FLAG_SPECULATIVE;
1106 1205  
1107 1206          /*
1108 1207           * If we're creating a child I/O that is not associated with a
1109 1208           * top-level vdev, then the child zio is not an allocating I/O.
1110 1209           * If this is a retried I/O then we ignore it since we will
1111 1210           * have already processed the original allocating I/O.
1112 1211           */
1113 1212          if (flags & ZIO_FLAG_IO_ALLOCATING &&
1114 1213              (vd != vd->vdev_top || (flags & ZIO_FLAG_IO_RETRY))) {
1115      -                metaslab_class_t *mc = spa_normal_class(pio->io_spa);
     1214 +                metaslab_class_t *mc = pio->io_mc;
1116 1215  
1117 1216                  ASSERT(mc->mc_alloc_throttle_enabled);
1118 1217                  ASSERT(type == ZIO_TYPE_WRITE);
1119 1218                  ASSERT(priority == ZIO_PRIORITY_ASYNC_WRITE);
1120 1219                  ASSERT(!(flags & ZIO_FLAG_IO_REPAIR));
1121 1220                  ASSERT(!(pio->io_flags & ZIO_FLAG_IO_REWRITE) ||
1122 1221                      pio->io_child_type == ZIO_CHILD_GANG);
1123 1222  
1124 1223                  flags &= ~ZIO_FLAG_IO_ALLOCATING;
1125 1224          }
1126 1225  
1127 1226          zio = zio_create(pio, pio->io_spa, pio->io_txg, bp, data, size, size,
1128 1227              done, private, type, priority, flags, vd, offset, &pio->io_bookmark,
1129 1228              ZIO_STAGE_VDEV_IO_START >> 1, pipeline);
1130 1229          ASSERT3U(zio->io_child_type, ==, ZIO_CHILD_VDEV);
1131 1230  
1132 1231          zio->io_physdone = pio->io_physdone;
1133 1232          if (vd->vdev_ops->vdev_op_leaf && zio->io_logical != NULL)
1134 1233                  zio->io_logical->io_phys_children++;
1135 1234  
1136 1235          return (zio);
1137 1236  }
1138 1237  
1139 1238  zio_t *
1140 1239  zio_vdev_delegated_io(vdev_t *vd, uint64_t offset, abd_t *data, uint64_t size,
1141 1240      int type, zio_priority_t priority, enum zio_flag flags,
1142 1241      zio_done_func_t *done, void *private)
1143 1242  {
1144 1243          zio_t *zio;
1145 1244  
1146 1245          ASSERT(vd->vdev_ops->vdev_op_leaf);
1147 1246  
1148 1247          zio = zio_create(NULL, vd->vdev_spa, 0, NULL,
1149 1248              data, size, size, done, private, type, priority,
1150 1249              flags | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_RETRY | ZIO_FLAG_DELEGATED,
1151 1250              vd, offset, NULL,
1152 1251              ZIO_STAGE_VDEV_IO_START >> 1, ZIO_VDEV_CHILD_PIPELINE);
1153 1252  
1154 1253          return (zio);
1155 1254  }
1156 1255  
1157 1256  void
1158 1257  zio_flush(zio_t *zio, vdev_t *vd)
1159 1258  {
1160 1259          zio_nowait(zio_ioctl(zio, zio->io_spa, vd, DKIOCFLUSHWRITECACHE,
1161 1260              NULL, NULL,
1162 1261              ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY));
1163 1262  }
1164 1263  
1165 1264  void
1166 1265  zio_shrink(zio_t *zio, uint64_t size)
1167 1266  {
1168 1267          ASSERT3P(zio->io_executor, ==, NULL);
1169 1268          ASSERT3P(zio->io_orig_size, ==, zio->io_size);
1170 1269          ASSERT3U(size, <=, zio->io_size);
1171 1270  
1172 1271          /*
1173 1272           * We don't shrink for raidz because of problems with the
1174 1273           * reconstruction when reading back less than the block size.
1175 1274           * Note, BP_IS_RAIDZ() assumes no compression.
1176 1275           */
1177 1276          ASSERT(BP_GET_COMPRESS(zio->io_bp) == ZIO_COMPRESS_OFF);
1178 1277          if (!BP_IS_RAIDZ(zio->io_bp)) {
1179 1278                  /* we are not doing a raw write */
1180 1279                  ASSERT3U(zio->io_size, ==, zio->io_lsize);
1181 1280                  zio->io_orig_size = zio->io_size = zio->io_lsize = size;
1182 1281          }
1183 1282  }
1184 1283  
1185 1284  /*
  
    | 
      ↓ open down ↓ | 
    60 lines elided | 
    
      ↑ open up ↑ | 
  
1186 1285   * ==========================================================================
1187 1286   * Prepare to read and write logical blocks
1188 1287   * ==========================================================================
1189 1288   */
1190 1289  
1191 1290  static int
1192 1291  zio_read_bp_init(zio_t *zio)
1193 1292  {
1194 1293          blkptr_t *bp = zio->io_bp;
1195 1294  
1196      -        ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy);
1197      -
1198 1295          if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF &&
1199 1296              zio->io_child_type == ZIO_CHILD_LOGICAL &&
1200 1297              !(zio->io_flags & ZIO_FLAG_RAW)) {
1201 1298                  uint64_t psize =
1202 1299                      BP_IS_EMBEDDED(bp) ? BPE_GET_PSIZE(bp) : BP_GET_PSIZE(bp);
1203 1300                  zio_push_transform(zio, abd_alloc_sametype(zio->io_abd, psize),
1204 1301                      psize, psize, zio_decompress);
1205 1302          }
1206 1303  
1207 1304          if (BP_IS_EMBEDDED(bp) && BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA) {
1208 1305                  zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1209 1306  
1210 1307                  int psize = BPE_GET_PSIZE(bp);
1211 1308                  void *data = abd_borrow_buf(zio->io_abd, psize);
1212 1309                  decode_embedded_bp_compressed(bp, data);
1213 1310                  abd_return_buf_copy(zio->io_abd, data, psize);
1214 1311          } else {
1215 1312                  ASSERT(!BP_IS_EMBEDDED(bp));
1216      -                ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy);
1217 1313          }
1218 1314  
1219      -        if (!DMU_OT_IS_METADATA(BP_GET_TYPE(bp)) && BP_GET_LEVEL(bp) == 0)
     1315 +        if (!BP_IS_METADATA(bp))
1220 1316                  zio->io_flags |= ZIO_FLAG_DONT_CACHE;
1221 1317  
1222 1318          if (BP_GET_TYPE(bp) == DMU_OT_DDT_ZAP)
1223 1319                  zio->io_flags |= ZIO_FLAG_DONT_CACHE;
1224 1320  
1225 1321          if (BP_GET_DEDUP(bp) && zio->io_child_type == ZIO_CHILD_LOGICAL)
1226 1322                  zio->io_pipeline = ZIO_DDT_READ_PIPELINE;
1227 1323  
1228 1324          return (ZIO_PIPELINE_CONTINUE);
1229 1325  }
1230 1326  
1231 1327  static int
1232 1328  zio_write_bp_init(zio_t *zio)
1233 1329  {
1234 1330          if (!IO_IS_ALLOCATING(zio))
1235 1331                  return (ZIO_PIPELINE_CONTINUE);
1236 1332  
1237 1333          ASSERT(zio->io_child_type != ZIO_CHILD_DDT);
1238 1334  
1239 1335          if (zio->io_bp_override) {
1240 1336                  blkptr_t *bp = zio->io_bp;
1241 1337                  zio_prop_t *zp = &zio->io_prop;
1242 1338  
1243 1339                  ASSERT(bp->blk_birth != zio->io_txg);
1244 1340                  ASSERT(BP_GET_DEDUP(zio->io_bp_override) == 0);
1245 1341  
1246 1342                  *bp = *zio->io_bp_override;
1247 1343                  zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1248 1344  
1249 1345                  if (BP_IS_EMBEDDED(bp))
1250 1346                          return (ZIO_PIPELINE_CONTINUE);
1251 1347  
1252 1348                  /*
1253 1349                   * If we've been overridden and nopwrite is set then
1254 1350                   * set the flag accordingly to indicate that a nopwrite
1255 1351                   * has already occurred.
1256 1352                   */
1257 1353                  if (!BP_IS_HOLE(bp) && zp->zp_nopwrite) {
1258 1354                          ASSERT(!zp->zp_dedup);
1259 1355                          ASSERT3U(BP_GET_CHECKSUM(bp), ==, zp->zp_checksum);
1260 1356                          zio->io_flags |= ZIO_FLAG_NOPWRITE;
1261 1357                          return (ZIO_PIPELINE_CONTINUE);
1262 1358                  }
1263 1359  
1264 1360                  ASSERT(!zp->zp_nopwrite);
1265 1361  
1266 1362                  if (BP_IS_HOLE(bp) || !zp->zp_dedup)
1267 1363                          return (ZIO_PIPELINE_CONTINUE);
1268 1364  
1269 1365                  ASSERT((zio_checksum_table[zp->zp_checksum].ci_flags &
1270 1366                      ZCHECKSUM_FLAG_DEDUP) || zp->zp_dedup_verify);
1271 1367  
1272 1368                  if (BP_GET_CHECKSUM(bp) == zp->zp_checksum) {
1273 1369                          BP_SET_DEDUP(bp, 1);
1274 1370                          zio->io_pipeline |= ZIO_STAGE_DDT_WRITE;
1275 1371                          return (ZIO_PIPELINE_CONTINUE);
1276 1372                  }
1277 1373  
1278 1374                  /*
1279 1375                   * We were unable to handle this as an override bp, treat
1280 1376                   * it as a regular write I/O.
1281 1377                   */
1282 1378                  zio->io_bp_override = NULL;
1283 1379                  *bp = zio->io_bp_orig;
1284 1380                  zio->io_pipeline = zio->io_orig_pipeline;
1285 1381          }
1286 1382  
1287 1383          return (ZIO_PIPELINE_CONTINUE);
1288 1384  }
1289 1385  
1290 1386  static int
1291 1387  zio_write_compress(zio_t *zio)
1292 1388  {
1293 1389          spa_t *spa = zio->io_spa;
1294 1390          zio_prop_t *zp = &zio->io_prop;
1295 1391          enum zio_compress compress = zp->zp_compress;
1296 1392          blkptr_t *bp = zio->io_bp;
  
    | 
      ↓ open down ↓ | 
    67 lines elided | 
    
      ↑ open up ↑ | 
  
1297 1393          uint64_t lsize = zio->io_lsize;
1298 1394          uint64_t psize = zio->io_size;
1299 1395          int pass = 1;
1300 1396  
1301 1397          EQUIV(lsize != psize, (zio->io_flags & ZIO_FLAG_RAW) != 0);
1302 1398  
1303 1399          /*
1304 1400           * If our children haven't all reached the ready stage,
1305 1401           * wait for them and then repeat this pipeline stage.
1306 1402           */
1307      -        if (zio_wait_for_children(zio, ZIO_CHILD_LOGICAL_BIT |
1308      -            ZIO_CHILD_GANG_BIT, ZIO_WAIT_READY)) {
     1403 +        if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) ||
     1404 +            zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_READY))
1309 1405                  return (ZIO_PIPELINE_STOP);
1310      -        }
1311 1406  
1312 1407          if (!IO_IS_ALLOCATING(zio))
1313 1408                  return (ZIO_PIPELINE_CONTINUE);
1314 1409  
1315 1410          if (zio->io_children_ready != NULL) {
1316 1411                  /*
1317 1412                   * Now that all our children are ready, run the callback
1318 1413                   * associated with this zio in case it wants to modify the
1319 1414                   * data to be written.
1320 1415                   */
1321 1416                  ASSERT3U(zp->zp_level, >, 0);
1322 1417                  zio->io_children_ready(zio);
1323 1418          }
1324 1419  
1325 1420          ASSERT(zio->io_child_type != ZIO_CHILD_DDT);
1326 1421          ASSERT(zio->io_bp_override == NULL);
1327 1422  
1328 1423          if (!BP_IS_HOLE(bp) && bp->blk_birth == zio->io_txg) {
1329 1424                  /*
1330 1425                   * We're rewriting an existing block, which means we're
1331 1426                   * working on behalf of spa_sync().  For spa_sync() to
1332 1427                   * converge, it must eventually be the case that we don't
1333 1428                   * have to allocate new blocks.  But compression changes
1334 1429                   * the blocksize, which forces a reallocate, and makes
1335 1430                   * convergence take longer.  Therefore, after the first
1336 1431                   * few passes, stop compressing to ensure convergence.
1337 1432                   */
1338 1433                  pass = spa_sync_pass(spa);
1339 1434  
1340 1435                  ASSERT(zio->io_txg == spa_syncing_txg(spa));
1341 1436                  ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
  
    | 
      ↓ open down ↓ | 
    21 lines elided | 
    
      ↑ open up ↑ | 
  
1342 1437                  ASSERT(!BP_GET_DEDUP(bp));
1343 1438  
1344 1439                  if (pass >= zfs_sync_pass_dont_compress)
1345 1440                          compress = ZIO_COMPRESS_OFF;
1346 1441  
1347 1442                  /* Make sure someone doesn't change their mind on overwrites */
1348 1443                  ASSERT(BP_IS_EMBEDDED(bp) || MIN(zp->zp_copies + BP_IS_GANG(bp),
1349 1444                      spa_max_replication(spa)) == BP_GET_NDVAS(bp));
1350 1445          }
1351 1446  
     1447 +        DTRACE_PROBE1(zio_compress_ready, zio_t *, zio);
1352 1448          /* If it's a compressed write that is not raw, compress the buffer. */
1353      -        if (compress != ZIO_COMPRESS_OFF && psize == lsize) {
     1449 +        if (compress != ZIO_COMPRESS_OFF && psize == lsize &&
     1450 +            ZIO_SHOULD_COMPRESS(zio)) {
1354 1451                  void *cbuf = zio_buf_alloc(lsize);
1355 1452                  psize = zio_compress_data(compress, zio->io_abd, cbuf, lsize);
1356 1453                  if (psize == 0 || psize == lsize) {
1357 1454                          compress = ZIO_COMPRESS_OFF;
1358 1455                          zio_buf_free(cbuf, lsize);
1359 1456                  } else if (!zp->zp_dedup && psize <= BPE_PAYLOAD_SIZE &&
1360 1457                      zp->zp_level == 0 && !DMU_OT_HAS_FILL(zp->zp_type) &&
1361 1458                      spa_feature_is_enabled(spa, SPA_FEATURE_EMBEDDED_DATA)) {
1362 1459                          encode_embedded_bp_compressed(bp,
1363 1460                              cbuf, compress, lsize, psize);
1364 1461                          BPE_SET_ETYPE(bp, BP_EMBEDDED_TYPE_DATA);
1365 1462                          BP_SET_TYPE(bp, zio->io_prop.zp_type);
1366 1463                          BP_SET_LEVEL(bp, zio->io_prop.zp_level);
1367 1464                          zio_buf_free(cbuf, lsize);
1368 1465                          bp->blk_birth = zio->io_txg;
1369 1466                          zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1370 1467                          ASSERT(spa_feature_is_active(spa,
1371 1468                              SPA_FEATURE_EMBEDDED_DATA));
     1469 +                        if (zio->io_smartcomp.sc_result != NULL) {
     1470 +                                zio->io_smartcomp.sc_result(
     1471 +                                    zio->io_smartcomp.sc_userinfo, zio);
     1472 +                        } else {
     1473 +                                ASSERT(zio->io_smartcomp.sc_ask == NULL);
     1474 +                        }
1372 1475                          return (ZIO_PIPELINE_CONTINUE);
1373 1476                  } else {
1374 1477                          /*
1375 1478                           * Round up compressed size up to the ashift
1376 1479                           * of the smallest-ashift device, and zero the tail.
1377 1480                           * This ensures that the compressed size of the BP
1378 1481                           * (and thus compressratio property) are correct,
1379 1482                           * in that we charge for the padding used to fill out
1380 1483                           * the last sector.
1381 1484                           */
1382 1485                          ASSERT3U(spa->spa_min_ashift, >=, SPA_MINBLOCKSHIFT);
1383 1486                          size_t rounded = (size_t)P2ROUNDUP(psize,
1384 1487                              1ULL << spa->spa_min_ashift);
1385 1488                          if (rounded >= lsize) {
1386 1489                                  compress = ZIO_COMPRESS_OFF;
1387 1490                                  zio_buf_free(cbuf, lsize);
1388 1491                                  psize = lsize;
  
    | 
      ↓ open down ↓ | 
    7 lines elided | 
    
      ↑ open up ↑ | 
  
1389 1492                          } else {
1390 1493                                  abd_t *cdata = abd_get_from_buf(cbuf, lsize);
1391 1494                                  abd_take_ownership_of_buf(cdata, B_TRUE);
1392 1495                                  abd_zero_off(cdata, psize, rounded - psize);
1393 1496                                  psize = rounded;
1394 1497                                  zio_push_transform(zio, cdata,
1395 1498                                      psize, lsize, NULL);
1396 1499                          }
1397 1500                  }
1398 1501  
     1502 +                if (zio->io_smartcomp.sc_result != NULL) {
     1503 +                        zio->io_smartcomp.sc_result(
     1504 +                            zio->io_smartcomp.sc_userinfo, zio);
     1505 +                } else {
     1506 +                        ASSERT(zio->io_smartcomp.sc_ask == NULL);
     1507 +                }
     1508 +
1399 1509                  /*
1400 1510                   * We were unable to handle this as an override bp, treat
1401 1511                   * it as a regular write I/O.
1402 1512                   */
1403 1513                  zio->io_bp_override = NULL;
1404 1514                  *bp = zio->io_bp_orig;
1405 1515                  zio->io_pipeline = zio->io_orig_pipeline;
1406 1516          } else {
1407 1517                  ASSERT3U(psize, !=, 0);
     1518 +
     1519 +                /*
     1520 +                 * We are here because of:
     1521 +                 *      - compress == ZIO_COMPRESS_OFF
     1522 +                 *      - SmartCompression decides don't compress this data
     1523 +                 *      - this is a RAW-write
     1524 +                 *
     1525 +                 *      In case of RAW-write we should not override "compress"
     1526 +                 */
     1527 +                if ((zio->io_flags & ZIO_FLAG_RAW) == 0)
     1528 +                        compress = ZIO_COMPRESS_OFF;
1408 1529          }
1409 1530  
1410 1531          /*
1411 1532           * The final pass of spa_sync() must be all rewrites, but the first
1412 1533           * few passes offer a trade-off: allocating blocks defers convergence,
1413 1534           * but newly allocated blocks are sequential, so they can be written
1414 1535           * to disk faster.  Therefore, we allow the first few passes of
1415 1536           * spa_sync() to allocate new blocks, but force rewrites after that.
1416 1537           * There should only be a handful of blocks after pass 1 in any case.
1417 1538           */
1418 1539          if (!BP_IS_HOLE(bp) && bp->blk_birth == zio->io_txg &&
1419 1540              BP_GET_PSIZE(bp) == psize &&
1420 1541              pass >= zfs_sync_pass_rewrite) {
1421 1542                  ASSERT(psize != 0);
1422 1543                  enum zio_stage gang_stages = zio->io_pipeline & ZIO_GANG_STAGES;
1423 1544                  zio->io_pipeline = ZIO_REWRITE_PIPELINE | gang_stages;
1424 1545                  zio->io_flags |= ZIO_FLAG_IO_REWRITE;
1425 1546          } else {
1426 1547                  BP_ZERO(bp);
1427 1548                  zio->io_pipeline = ZIO_WRITE_PIPELINE;
1428 1549          }
1429 1550  
  
    | 
      ↓ open down ↓ | 
    12 lines elided | 
    
      ↑ open up ↑ | 
  
1430 1551          if (psize == 0) {
1431 1552                  if (zio->io_bp_orig.blk_birth != 0 &&
1432 1553                      spa_feature_is_active(spa, SPA_FEATURE_HOLE_BIRTH)) {
1433 1554                          BP_SET_LSIZE(bp, lsize);
1434 1555                          BP_SET_TYPE(bp, zp->zp_type);
1435 1556                          BP_SET_LEVEL(bp, zp->zp_level);
1436 1557                          BP_SET_BIRTH(bp, zio->io_txg, 0);
1437 1558                  }
1438 1559                  zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1439 1560          } else {
     1561 +                if (zp->zp_dedup) {
     1562 +                        /* check the best-effort dedup setting */
     1563 +                        zio_best_effort_dedup(zio);
     1564 +                }
1440 1565                  ASSERT(zp->zp_checksum != ZIO_CHECKSUM_GANG_HEADER);
1441 1566                  BP_SET_LSIZE(bp, lsize);
1442 1567                  BP_SET_TYPE(bp, zp->zp_type);
1443 1568                  BP_SET_LEVEL(bp, zp->zp_level);
1444 1569                  BP_SET_PSIZE(bp, psize);
1445 1570                  BP_SET_COMPRESS(bp, compress);
1446 1571                  BP_SET_CHECKSUM(bp, zp->zp_checksum);
1447 1572                  BP_SET_DEDUP(bp, zp->zp_dedup);
1448 1573                  BP_SET_BYTEORDER(bp, ZFS_HOST_BYTEORDER);
1449 1574                  if (zp->zp_dedup) {
1450 1575                          ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
1451 1576                          ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REWRITE));
1452 1577                          zio->io_pipeline = ZIO_DDT_WRITE_PIPELINE;
1453 1578                  }
1454 1579                  if (zp->zp_nopwrite) {
1455 1580                          ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
1456 1581                          ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REWRITE));
1457 1582                          zio->io_pipeline |= ZIO_STAGE_NOP_WRITE;
1458 1583                  }
1459 1584          }
1460 1585          return (ZIO_PIPELINE_CONTINUE);
1461 1586  }
1462 1587  
  
    | 
      ↓ open down ↓ | 
    13 lines elided | 
    
      ↑ open up ↑ | 
  
1463 1588  static int
1464 1589  zio_free_bp_init(zio_t *zio)
1465 1590  {
1466 1591          blkptr_t *bp = zio->io_bp;
1467 1592  
1468 1593          if (zio->io_child_type == ZIO_CHILD_LOGICAL) {
1469 1594                  if (BP_GET_DEDUP(bp))
1470 1595                          zio->io_pipeline = ZIO_DDT_FREE_PIPELINE;
1471 1596          }
1472 1597  
1473      -        ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy);
1474      -
1475 1598          return (ZIO_PIPELINE_CONTINUE);
1476 1599  }
1477 1600  
1478 1601  /*
1479 1602   * ==========================================================================
1480 1603   * Execute the I/O pipeline
1481 1604   * ==========================================================================
1482 1605   */
1483 1606  
1484 1607  static void
1485 1608  zio_taskq_dispatch(zio_t *zio, zio_taskq_type_t q, boolean_t cutinline)
1486 1609  {
1487 1610          spa_t *spa = zio->io_spa;
1488 1611          zio_type_t t = zio->io_type;
1489 1612          int flags = (cutinline ? TQ_FRONT : 0);
1490 1613  
1491 1614          /*
1492 1615           * If we're a config writer or a probe, the normal issue and
1493 1616           * interrupt threads may all be blocked waiting for the config lock.
1494 1617           * In this case, select the otherwise-unused taskq for ZIO_TYPE_NULL.
1495 1618           */
1496 1619          if (zio->io_flags & (ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_PROBE))
1497 1620                  t = ZIO_TYPE_NULL;
1498 1621  
  
    | 
      ↓ open down ↓ | 
    14 lines elided | 
    
      ↑ open up ↑ | 
  
1499 1622          /*
1500 1623           * A similar issue exists for the L2ARC write thread until L2ARC 2.0.
1501 1624           */
1502 1625          if (t == ZIO_TYPE_WRITE && zio->io_vd && zio->io_vd->vdev_aux)
1503 1626                  t = ZIO_TYPE_NULL;
1504 1627  
1505 1628          /*
1506 1629           * If this is a high priority I/O, then use the high priority taskq if
1507 1630           * available.
1508 1631           */
1509      -        if (zio->io_priority == ZIO_PRIORITY_NOW &&
     1632 +        if ((zio->io_priority == ZIO_PRIORITY_NOW ||
     1633 +            zio->io_priority == ZIO_PRIORITY_SYNC_WRITE) &&
1510 1634              spa->spa_zio_taskq[t][q + 1].stqs_count != 0)
1511 1635                  q++;
1512 1636  
1513 1637          ASSERT3U(q, <, ZIO_TASKQ_TYPES);
1514 1638  
1515 1639          /*
1516 1640           * NB: We are assuming that the zio can only be dispatched
1517 1641           * to a single taskq at a time.  It would be a grievous error
1518 1642           * to dispatch the zio to another taskq at the same time.
1519 1643           */
1520 1644          ASSERT(zio->io_tqent.tqent_next == NULL);
1521 1645          spa_taskq_dispatch_ent(spa, t, q, (task_func_t *)zio_execute, zio,
1522 1646              flags, &zio->io_tqent);
1523 1647  }
1524 1648  
1525 1649  static boolean_t
1526 1650  zio_taskq_member(zio_t *zio, zio_taskq_type_t q)
1527 1651  {
1528 1652          kthread_t *executor = zio->io_executor;
1529 1653          spa_t *spa = zio->io_spa;
1530 1654  
1531 1655          for (zio_type_t t = 0; t < ZIO_TYPES; t++) {
1532 1656                  spa_taskqs_t *tqs = &spa->spa_zio_taskq[t][q];
1533 1657                  uint_t i;
1534 1658                  for (i = 0; i < tqs->stqs_count; i++) {
1535 1659                          if (taskq_member(tqs->stqs_taskq[i], executor))
1536 1660                                  return (B_TRUE);
1537 1661                  }
1538 1662          }
1539 1663  
1540 1664          return (B_FALSE);
1541 1665  }
1542 1666  
1543 1667  static int
1544 1668  zio_issue_async(zio_t *zio)
1545 1669  {
1546 1670          zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE, B_FALSE);
1547 1671  
1548 1672          return (ZIO_PIPELINE_STOP);
1549 1673  }
1550 1674  
1551 1675  void
1552 1676  zio_interrupt(zio_t *zio)
1553 1677  {
1554 1678          zio_taskq_dispatch(zio, ZIO_TASKQ_INTERRUPT, B_FALSE);
1555 1679  }
1556 1680  
1557 1681  void
1558 1682  zio_delay_interrupt(zio_t *zio)
1559 1683  {
1560 1684          /*
1561 1685           * The timeout_generic() function isn't defined in userspace, so
1562 1686           * rather than trying to implement the function, the zio delay
1563 1687           * functionality has been disabled for userspace builds.
1564 1688           */
1565 1689  
1566 1690  #ifdef _KERNEL
1567 1691          /*
1568 1692           * If io_target_timestamp is zero, then no delay has been registered
1569 1693           * for this IO, thus jump to the end of this function and "skip" the
1570 1694           * delay; issuing it directly to the zio layer.
1571 1695           */
1572 1696          if (zio->io_target_timestamp != 0) {
1573 1697                  hrtime_t now = gethrtime();
1574 1698  
1575 1699                  if (now >= zio->io_target_timestamp) {
1576 1700                          /*
1577 1701                           * This IO has already taken longer than the target
1578 1702                           * delay to complete, so we don't want to delay it
1579 1703                           * any longer; we "miss" the delay and issue it
1580 1704                           * directly to the zio layer. This is likely due to
1581 1705                           * the target latency being set to a value less than
1582 1706                           * the underlying hardware can satisfy (e.g. delay
1583 1707                           * set to 1ms, but the disks take 10ms to complete an
1584 1708                           * IO request).
1585 1709                           */
1586 1710  
1587 1711                          DTRACE_PROBE2(zio__delay__miss, zio_t *, zio,
1588 1712                              hrtime_t, now);
1589 1713  
1590 1714                          zio_interrupt(zio);
1591 1715                  } else {
1592 1716                          hrtime_t diff = zio->io_target_timestamp - now;
1593 1717  
1594 1718                          DTRACE_PROBE3(zio__delay__hit, zio_t *, zio,
1595 1719                              hrtime_t, now, hrtime_t, diff);
1596 1720  
1597 1721                          (void) timeout_generic(CALLOUT_NORMAL,
1598 1722                              (void (*)(void *))zio_interrupt, zio, diff, 1, 0);
1599 1723                  }
1600 1724  
1601 1725                  return;
1602 1726          }
1603 1727  #endif
1604 1728  
1605 1729          DTRACE_PROBE1(zio__delay__skip, zio_t *, zio);
1606 1730          zio_interrupt(zio);
1607 1731  }
1608 1732  
1609 1733  /*
1610 1734   * Execute the I/O pipeline until one of the following occurs:
1611 1735   *
1612 1736   *      (1) the I/O completes
1613 1737   *      (2) the pipeline stalls waiting for dependent child I/Os
1614 1738   *      (3) the I/O issues, so we're waiting for an I/O completion interrupt
1615 1739   *      (4) the I/O is delegated by vdev-level caching or aggregation
1616 1740   *      (5) the I/O is deferred due to vdev-level queueing
1617 1741   *      (6) the I/O is handed off to another thread.
1618 1742   *
1619 1743   * In all cases, the pipeline stops whenever there's no CPU work; it never
1620 1744   * burns a thread in cv_wait().
1621 1745   *
1622 1746   * There's no locking on io_stage because there's no legitimate way
1623 1747   * for multiple threads to be attempting to process the same I/O.
1624 1748   */
1625 1749  static zio_pipe_stage_t *zio_pipeline[];
  
    | 
      ↓ open down ↓ | 
    106 lines elided | 
    
      ↑ open up ↑ | 
  
1626 1750  
1627 1751  void
1628 1752  zio_execute(zio_t *zio)
1629 1753  {
1630 1754          zio->io_executor = curthread;
1631 1755  
1632 1756          ASSERT3U(zio->io_queued_timestamp, >, 0);
1633 1757  
1634 1758          while (zio->io_stage < ZIO_STAGE_DONE) {
1635 1759                  enum zio_stage pipeline = zio->io_pipeline;
     1760 +                enum zio_stage old_stage = zio->io_stage;
1636 1761                  enum zio_stage stage = zio->io_stage;
1637 1762                  int rv;
1638 1763  
1639 1764                  ASSERT(!MUTEX_HELD(&zio->io_lock));
1640 1765                  ASSERT(ISP2(stage));
1641 1766                  ASSERT(zio->io_stall == NULL);
1642 1767  
1643 1768                  do {
1644 1769                          stage <<= 1;
1645 1770                  } while ((stage & pipeline) == 0);
1646 1771  
1647 1772                  ASSERT(stage <= ZIO_STAGE_DONE);
1648 1773  
1649 1774                  /*
1650 1775                   * If we are in interrupt context and this pipeline stage
1651 1776                   * will grab a config lock that is held across I/O,
1652 1777                   * or may wait for an I/O that needs an interrupt thread
1653 1778                   * to complete, issue async to avoid deadlock.
1654 1779                   *
1655 1780                   * For VDEV_IO_START, we cut in line so that the io will
1656 1781                   * be sent to disk promptly.
1657 1782                   */
1658 1783                  if ((stage & ZIO_BLOCKING_STAGES) && zio->io_vd == NULL &&
1659 1784                      zio_taskq_member(zio, ZIO_TASKQ_INTERRUPT)) {
1660 1785                          boolean_t cut = (stage == ZIO_STAGE_VDEV_IO_START) ?
1661 1786                              zio_requeue_io_start_cut_in_line : B_FALSE;
1662 1787                          zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE, cut);
  
    | 
      ↓ open down ↓ | 
    17 lines elided | 
    
      ↑ open up ↑ | 
  
1663 1788                          return;
1664 1789                  }
1665 1790  
1666 1791                  zio->io_stage = stage;
1667 1792                  zio->io_pipeline_trace |= zio->io_stage;
1668 1793                  rv = zio_pipeline[highbit64(stage) - 1](zio);
1669 1794  
1670 1795                  if (rv == ZIO_PIPELINE_STOP)
1671 1796                          return;
1672 1797  
     1798 +                if (rv == ZIO_PIPELINE_RESTART_STAGE) {
     1799 +                        zio->io_stage = old_stage;
     1800 +                        (void) zio_issue_async(zio);
     1801 +                        return;
     1802 +                }
     1803 +
1673 1804                  ASSERT(rv == ZIO_PIPELINE_CONTINUE);
1674 1805          }
1675 1806  }
1676 1807  
1677 1808  /*
1678 1809   * ==========================================================================
1679 1810   * Initiate I/O, either sync or async
1680 1811   * ==========================================================================
1681 1812   */
1682 1813  int
1683 1814  zio_wait(zio_t *zio)
1684 1815  {
1685 1816          int error;
1686 1817  
1687 1818          ASSERT3P(zio->io_stage, ==, ZIO_STAGE_OPEN);
1688 1819          ASSERT3P(zio->io_executor, ==, NULL);
1689 1820  
1690 1821          zio->io_waiter = curthread;
1691 1822          ASSERT0(zio->io_queued_timestamp);
1692 1823          zio->io_queued_timestamp = gethrtime();
1693 1824  
1694 1825          zio_execute(zio);
1695 1826  
1696 1827          mutex_enter(&zio->io_lock);
1697 1828          while (zio->io_executor != NULL)
1698 1829                  cv_wait(&zio->io_cv, &zio->io_lock);
1699 1830          mutex_exit(&zio->io_lock);
1700 1831  
1701 1832          error = zio->io_error;
1702 1833          zio_destroy(zio);
1703 1834  
1704 1835          return (error);
1705 1836  }
1706 1837  
1707 1838  void
1708 1839  zio_nowait(zio_t *zio)
1709 1840  {
1710 1841          ASSERT3P(zio->io_executor, ==, NULL);
1711 1842  
1712 1843          if (zio->io_child_type == ZIO_CHILD_LOGICAL &&
1713 1844              zio_unique_parent(zio) == NULL) {
1714 1845                  /*
1715 1846                   * This is a logical async I/O with no parent to wait for it.
1716 1847                   * We add it to the spa_async_root_zio "Godfather" I/O which
1717 1848                   * will ensure they complete prior to unloading the pool.
1718 1849                   */
1719 1850                  spa_t *spa = zio->io_spa;
1720 1851  
1721 1852                  zio_add_child(spa->spa_async_zio_root[CPU_SEQID], zio);
1722 1853          }
1723 1854  
1724 1855          ASSERT0(zio->io_queued_timestamp);
1725 1856          zio->io_queued_timestamp = gethrtime();
1726 1857          zio_execute(zio);
1727 1858  }
1728 1859  
1729 1860  /*
1730 1861   * ==========================================================================
1731 1862   * Reexecute, cancel, or suspend/resume failed I/O
1732 1863   * ==========================================================================
1733 1864   */
1734 1865  
1735 1866  static void
1736 1867  zio_reexecute(zio_t *pio)
1737 1868  {
1738 1869          zio_t *cio, *cio_next;
1739 1870  
1740 1871          ASSERT(pio->io_child_type == ZIO_CHILD_LOGICAL);
1741 1872          ASSERT(pio->io_orig_stage == ZIO_STAGE_OPEN);
1742 1873          ASSERT(pio->io_gang_leader == NULL);
1743 1874          ASSERT(pio->io_gang_tree == NULL);
1744 1875  
1745 1876          pio->io_flags = pio->io_orig_flags;
1746 1877          pio->io_stage = pio->io_orig_stage;
1747 1878          pio->io_pipeline = pio->io_orig_pipeline;
1748 1879          pio->io_reexecute = 0;
1749 1880          pio->io_flags |= ZIO_FLAG_REEXECUTED;
1750 1881          pio->io_pipeline_trace = 0;
1751 1882          pio->io_error = 0;
1752 1883          for (int w = 0; w < ZIO_WAIT_TYPES; w++)
1753 1884                  pio->io_state[w] = 0;
1754 1885          for (int c = 0; c < ZIO_CHILD_TYPES; c++)
1755 1886                  pio->io_child_error[c] = 0;
1756 1887  
1757 1888          if (IO_IS_ALLOCATING(pio))
1758 1889                  BP_ZERO(pio->io_bp);
1759 1890  
1760 1891          /*
1761 1892           * As we reexecute pio's children, new children could be created.
1762 1893           * New children go to the head of pio's io_child_list, however,
1763 1894           * so we will (correctly) not reexecute them.  The key is that
1764 1895           * the remainder of pio's io_child_list, from 'cio_next' onward,
1765 1896           * cannot be affected by any side effects of reexecuting 'cio'.
1766 1897           */
1767 1898          zio_link_t *zl = NULL;
1768 1899          for (cio = zio_walk_children(pio, &zl); cio != NULL; cio = cio_next) {
1769 1900                  cio_next = zio_walk_children(pio, &zl);
1770 1901                  mutex_enter(&pio->io_lock);
1771 1902                  for (int w = 0; w < ZIO_WAIT_TYPES; w++)
1772 1903                          pio->io_children[cio->io_child_type][w]++;
1773 1904                  mutex_exit(&pio->io_lock);
1774 1905                  zio_reexecute(cio);
1775 1906          }
1776 1907  
1777 1908          /*
1778 1909           * Now that all children have been reexecuted, execute the parent.
1779 1910           * We don't reexecute "The Godfather" I/O here as it's the
1780 1911           * responsibility of the caller to wait on it.
1781 1912           */
1782 1913          if (!(pio->io_flags & ZIO_FLAG_GODFATHER)) {
1783 1914                  pio->io_queued_timestamp = gethrtime();
1784 1915                  zio_execute(pio);
1785 1916          }
1786 1917  }
1787 1918  
1788 1919  void
1789 1920  zio_suspend(spa_t *spa, zio_t *zio)
1790 1921  {
1791 1922          if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_PANIC)
1792 1923                  fm_panic("Pool '%s' has encountered an uncorrectable I/O "
1793 1924                      "failure and the failure mode property for this pool "
1794 1925                      "is set to panic.", spa_name(spa));
1795 1926  
1796 1927          zfs_ereport_post(FM_EREPORT_ZFS_IO_FAILURE, spa, NULL, NULL, 0, 0);
1797 1928  
1798 1929          mutex_enter(&spa->spa_suspend_lock);
1799 1930  
1800 1931          if (spa->spa_suspend_zio_root == NULL)
1801 1932                  spa->spa_suspend_zio_root = zio_root(spa, NULL, NULL,
1802 1933                      ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
1803 1934                      ZIO_FLAG_GODFATHER);
1804 1935  
1805 1936          spa->spa_suspended = B_TRUE;
1806 1937  
1807 1938          if (zio != NULL) {
1808 1939                  ASSERT(!(zio->io_flags & ZIO_FLAG_GODFATHER));
1809 1940                  ASSERT(zio != spa->spa_suspend_zio_root);
1810 1941                  ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
1811 1942                  ASSERT(zio_unique_parent(zio) == NULL);
1812 1943                  ASSERT(zio->io_stage == ZIO_STAGE_DONE);
1813 1944                  zio_add_child(spa->spa_suspend_zio_root, zio);
1814 1945          }
1815 1946  
1816 1947          mutex_exit(&spa->spa_suspend_lock);
1817 1948  }
1818 1949  
1819 1950  int
1820 1951  zio_resume(spa_t *spa)
1821 1952  {
1822 1953          zio_t *pio;
1823 1954  
1824 1955          /*
1825 1956           * Reexecute all previously suspended i/o.
1826 1957           */
1827 1958          mutex_enter(&spa->spa_suspend_lock);
1828 1959          spa->spa_suspended = B_FALSE;
1829 1960          cv_broadcast(&spa->spa_suspend_cv);
1830 1961          pio = spa->spa_suspend_zio_root;
1831 1962          spa->spa_suspend_zio_root = NULL;
1832 1963          mutex_exit(&spa->spa_suspend_lock);
1833 1964  
1834 1965          if (pio == NULL)
1835 1966                  return (0);
1836 1967  
1837 1968          zio_reexecute(pio);
1838 1969          return (zio_wait(pio));
1839 1970  }
1840 1971  
1841 1972  void
1842 1973  zio_resume_wait(spa_t *spa)
1843 1974  {
1844 1975          mutex_enter(&spa->spa_suspend_lock);
1845 1976          while (spa_suspended(spa))
1846 1977                  cv_wait(&spa->spa_suspend_cv, &spa->spa_suspend_lock);
1847 1978          mutex_exit(&spa->spa_suspend_lock);
1848 1979  }
1849 1980  
1850 1981  /*
1851 1982   * ==========================================================================
1852 1983   * Gang blocks.
1853 1984   *
1854 1985   * A gang block is a collection of small blocks that looks to the DMU
1855 1986   * like one large block.  When zio_dva_allocate() cannot find a block
1856 1987   * of the requested size, due to either severe fragmentation or the pool
1857 1988   * being nearly full, it calls zio_write_gang_block() to construct the
1858 1989   * block from smaller fragments.
1859 1990   *
1860 1991   * A gang block consists of a gang header (zio_gbh_phys_t) and up to
1861 1992   * three (SPA_GBH_NBLKPTRS) gang members.  The gang header is just like
1862 1993   * an indirect block: it's an array of block pointers.  It consumes
1863 1994   * only one sector and hence is allocatable regardless of fragmentation.
1864 1995   * The gang header's bps point to its gang members, which hold the data.
1865 1996   *
1866 1997   * Gang blocks are self-checksumming, using the bp's <vdev, offset, txg>
1867 1998   * as the verifier to ensure uniqueness of the SHA256 checksum.
1868 1999   * Critically, the gang block bp's blk_cksum is the checksum of the data,
1869 2000   * not the gang header.  This ensures that data block signatures (needed for
1870 2001   * deduplication) are independent of how the block is physically stored.
1871 2002   *
1872 2003   * Gang blocks can be nested: a gang member may itself be a gang block.
1873 2004   * Thus every gang block is a tree in which root and all interior nodes are
1874 2005   * gang headers, and the leaves are normal blocks that contain user data.
1875 2006   * The root of the gang tree is called the gang leader.
1876 2007   *
1877 2008   * To perform any operation (read, rewrite, free, claim) on a gang block,
1878 2009   * zio_gang_assemble() first assembles the gang tree (minus data leaves)
1879 2010   * in the io_gang_tree field of the original logical i/o by recursively
1880 2011   * reading the gang leader and all gang headers below it.  This yields
1881 2012   * an in-core tree containing the contents of every gang header and the
1882 2013   * bps for every constituent of the gang block.
1883 2014   *
1884 2015   * With the gang tree now assembled, zio_gang_issue() just walks the gang tree
1885 2016   * and invokes a callback on each bp.  To free a gang block, zio_gang_issue()
1886 2017   * calls zio_free_gang() -- a trivial wrapper around zio_free() -- for each bp.
1887 2018   * zio_claim_gang() provides a similarly trivial wrapper for zio_claim().
1888 2019   * zio_read_gang() is a wrapper around zio_read() that omits reading gang
1889 2020   * headers, since we already have those in io_gang_tree.  zio_rewrite_gang()
1890 2021   * performs a zio_rewrite() of the data or, for gang headers, a zio_rewrite()
1891 2022   * of the gang header plus zio_checksum_compute() of the data to update the
1892 2023   * gang header's blk_cksum as described above.
1893 2024   *
1894 2025   * The two-phase assemble/issue model solves the problem of partial failure --
1895 2026   * what if you'd freed part of a gang block but then couldn't read the
1896 2027   * gang header for another part?  Assembling the entire gang tree first
1897 2028   * ensures that all the necessary gang header I/O has succeeded before
1898 2029   * starting the actual work of free, claim, or write.  Once the gang tree
1899 2030   * is assembled, free and claim are in-memory operations that cannot fail.
1900 2031   *
1901 2032   * In the event that a gang write fails, zio_dva_unallocate() walks the
1902 2033   * gang tree to immediately free (i.e. insert back into the space map)
1903 2034   * everything we've allocated.  This ensures that we don't get ENOSPC
1904 2035   * errors during repeated suspend/resume cycles due to a flaky device.
1905 2036   *
1906 2037   * Gang rewrites only happen during sync-to-convergence.  If we can't assemble
1907 2038   * the gang tree, we won't modify the block, so we can safely defer the free
1908 2039   * (knowing that the block is still intact).  If we *can* assemble the gang
1909 2040   * tree, then even if some of the rewrites fail, zio_dva_unallocate() will free
1910 2041   * each constituent bp and we can allocate a new block on the next sync pass.
1911 2042   *
1912 2043   * In all cases, the gang tree allows complete recovery from partial failure.
1913 2044   * ==========================================================================
1914 2045   */
1915 2046  
1916 2047  static void
1917 2048  zio_gang_issue_func_done(zio_t *zio)
1918 2049  {
1919 2050          abd_put(zio->io_abd);
1920 2051  }
1921 2052  
1922 2053  static zio_t *
1923 2054  zio_read_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, abd_t *data,
1924 2055      uint64_t offset)
1925 2056  {
1926 2057          if (gn != NULL)
1927 2058                  return (pio);
1928 2059  
1929 2060          return (zio_read(pio, pio->io_spa, bp, abd_get_offset(data, offset),
1930 2061              BP_GET_PSIZE(bp), zio_gang_issue_func_done,
1931 2062              NULL, pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio),
1932 2063              &pio->io_bookmark));
1933 2064  }
1934 2065  
1935 2066  static zio_t *
1936 2067  zio_rewrite_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, abd_t *data,
1937 2068      uint64_t offset)
1938 2069  {
1939 2070          zio_t *zio;
1940 2071  
1941 2072          if (gn != NULL) {
1942 2073                  abd_t *gbh_abd =
1943 2074                      abd_get_from_buf(gn->gn_gbh, SPA_GANGBLOCKSIZE);
1944 2075                  zio = zio_rewrite(pio, pio->io_spa, pio->io_txg, bp,
1945 2076                      gbh_abd, SPA_GANGBLOCKSIZE, zio_gang_issue_func_done, NULL,
1946 2077                      pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio),
1947 2078                      &pio->io_bookmark);
1948 2079                  /*
1949 2080                   * As we rewrite each gang header, the pipeline will compute
1950 2081                   * a new gang block header checksum for it; but no one will
1951 2082                   * compute a new data checksum, so we do that here.  The one
1952 2083                   * exception is the gang leader: the pipeline already computed
1953 2084                   * its data checksum because that stage precedes gang assembly.
1954 2085                   * (Presently, nothing actually uses interior data checksums;
1955 2086                   * this is just good hygiene.)
1956 2087                   */
1957 2088                  if (gn != pio->io_gang_leader->io_gang_tree) {
1958 2089                          abd_t *buf = abd_get_offset(data, offset);
1959 2090  
1960 2091                          zio_checksum_compute(zio, BP_GET_CHECKSUM(bp),
1961 2092                              buf, BP_GET_PSIZE(bp));
1962 2093  
1963 2094                          abd_put(buf);
1964 2095                  }
1965 2096                  /*
1966 2097                   * If we are here to damage data for testing purposes,
1967 2098                   * leave the GBH alone so that we can detect the damage.
1968 2099                   */
1969 2100                  if (pio->io_gang_leader->io_flags & ZIO_FLAG_INDUCE_DAMAGE)
1970 2101                          zio->io_pipeline &= ~ZIO_VDEV_IO_STAGES;
1971 2102          } else {
1972 2103                  zio = zio_rewrite(pio, pio->io_spa, pio->io_txg, bp,
1973 2104                      abd_get_offset(data, offset), BP_GET_PSIZE(bp),
1974 2105                      zio_gang_issue_func_done, NULL, pio->io_priority,
1975 2106                      ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
1976 2107          }
1977 2108  
1978 2109          return (zio);
1979 2110  }
1980 2111  
1981 2112  /* ARGSUSED */
1982 2113  static zio_t *
1983 2114  zio_free_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, abd_t *data,
1984 2115      uint64_t offset)
1985 2116  {
1986 2117          return (zio_free_sync(pio, pio->io_spa, pio->io_txg, bp,
1987 2118              ZIO_GANG_CHILD_FLAGS(pio)));
1988 2119  }
1989 2120  
1990 2121  /* ARGSUSED */
1991 2122  static zio_t *
1992 2123  zio_claim_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, abd_t *data,
1993 2124      uint64_t offset)
1994 2125  {
1995 2126          return (zio_claim(pio, pio->io_spa, pio->io_txg, bp,
1996 2127              NULL, NULL, ZIO_GANG_CHILD_FLAGS(pio)));
1997 2128  }
1998 2129  
1999 2130  static zio_gang_issue_func_t *zio_gang_issue_func[ZIO_TYPES] = {
2000 2131          NULL,
2001 2132          zio_read_gang,
2002 2133          zio_rewrite_gang,
2003 2134          zio_free_gang,
2004 2135          zio_claim_gang,
2005 2136          NULL
2006 2137  };
2007 2138  
2008 2139  static void zio_gang_tree_assemble_done(zio_t *zio);
2009 2140  
2010 2141  static zio_gang_node_t *
2011 2142  zio_gang_node_alloc(zio_gang_node_t **gnpp)
2012 2143  {
2013 2144          zio_gang_node_t *gn;
2014 2145  
2015 2146          ASSERT(*gnpp == NULL);
2016 2147  
2017 2148          gn = kmem_zalloc(sizeof (*gn), KM_SLEEP);
2018 2149          gn->gn_gbh = zio_buf_alloc(SPA_GANGBLOCKSIZE);
2019 2150          *gnpp = gn;
2020 2151  
2021 2152          return (gn);
2022 2153  }
2023 2154  
2024 2155  static void
2025 2156  zio_gang_node_free(zio_gang_node_t **gnpp)
2026 2157  {
2027 2158          zio_gang_node_t *gn = *gnpp;
2028 2159  
2029 2160          for (int g = 0; g < SPA_GBH_NBLKPTRS; g++)
2030 2161                  ASSERT(gn->gn_child[g] == NULL);
2031 2162  
2032 2163          zio_buf_free(gn->gn_gbh, SPA_GANGBLOCKSIZE);
2033 2164          kmem_free(gn, sizeof (*gn));
2034 2165          *gnpp = NULL;
2035 2166  }
2036 2167  
2037 2168  static void
2038 2169  zio_gang_tree_free(zio_gang_node_t **gnpp)
2039 2170  {
2040 2171          zio_gang_node_t *gn = *gnpp;
2041 2172  
2042 2173          if (gn == NULL)
2043 2174                  return;
2044 2175  
2045 2176          for (int g = 0; g < SPA_GBH_NBLKPTRS; g++)
2046 2177                  zio_gang_tree_free(&gn->gn_child[g]);
2047 2178  
2048 2179          zio_gang_node_free(gnpp);
2049 2180  }
2050 2181  
2051 2182  static void
2052 2183  zio_gang_tree_assemble(zio_t *gio, blkptr_t *bp, zio_gang_node_t **gnpp)
2053 2184  {
2054 2185          zio_gang_node_t *gn = zio_gang_node_alloc(gnpp);
2055 2186          abd_t *gbh_abd = abd_get_from_buf(gn->gn_gbh, SPA_GANGBLOCKSIZE);
2056 2187  
2057 2188          ASSERT(gio->io_gang_leader == gio);
2058 2189          ASSERT(BP_IS_GANG(bp));
2059 2190  
2060 2191          zio_nowait(zio_read(gio, gio->io_spa, bp, gbh_abd, SPA_GANGBLOCKSIZE,
2061 2192              zio_gang_tree_assemble_done, gn, gio->io_priority,
2062 2193              ZIO_GANG_CHILD_FLAGS(gio), &gio->io_bookmark));
2063 2194  }
2064 2195  
2065 2196  static void
2066 2197  zio_gang_tree_assemble_done(zio_t *zio)
2067 2198  {
2068 2199          zio_t *gio = zio->io_gang_leader;
2069 2200          zio_gang_node_t *gn = zio->io_private;
2070 2201          blkptr_t *bp = zio->io_bp;
2071 2202  
2072 2203          ASSERT(gio == zio_unique_parent(zio));
2073 2204          ASSERT(zio->io_child_count == 0);
2074 2205  
2075 2206          if (zio->io_error)
2076 2207                  return;
2077 2208  
2078 2209          /* this ABD was created from a linear buf in zio_gang_tree_assemble */
2079 2210          if (BP_SHOULD_BYTESWAP(bp))
2080 2211                  byteswap_uint64_array(abd_to_buf(zio->io_abd), zio->io_size);
2081 2212  
2082 2213          ASSERT3P(abd_to_buf(zio->io_abd), ==, gn->gn_gbh);
2083 2214          ASSERT(zio->io_size == SPA_GANGBLOCKSIZE);
2084 2215          ASSERT(gn->gn_gbh->zg_tail.zec_magic == ZEC_MAGIC);
2085 2216  
2086 2217          abd_put(zio->io_abd);
2087 2218  
2088 2219          for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
2089 2220                  blkptr_t *gbp = &gn->gn_gbh->zg_blkptr[g];
2090 2221                  if (!BP_IS_GANG(gbp))
2091 2222                          continue;
2092 2223                  zio_gang_tree_assemble(gio, gbp, &gn->gn_child[g]);
2093 2224          }
2094 2225  }
2095 2226  
2096 2227  static void
2097 2228  zio_gang_tree_issue(zio_t *pio, zio_gang_node_t *gn, blkptr_t *bp, abd_t *data,
2098 2229      uint64_t offset)
2099 2230  {
2100 2231          zio_t *gio = pio->io_gang_leader;
2101 2232          zio_t *zio;
2102 2233  
2103 2234          ASSERT(BP_IS_GANG(bp) == !!gn);
2104 2235          ASSERT(BP_GET_CHECKSUM(bp) == BP_GET_CHECKSUM(gio->io_bp));
2105 2236          ASSERT(BP_GET_LSIZE(bp) == BP_GET_PSIZE(bp) || gn == gio->io_gang_tree);
2106 2237  
2107 2238          /*
2108 2239           * If you're a gang header, your data is in gn->gn_gbh.
2109 2240           * If you're a gang member, your data is in 'data' and gn == NULL.
2110 2241           */
2111 2242          zio = zio_gang_issue_func[gio->io_type](pio, bp, gn, data, offset);
2112 2243  
2113 2244          if (gn != NULL) {
2114 2245                  ASSERT(gn->gn_gbh->zg_tail.zec_magic == ZEC_MAGIC);
2115 2246  
2116 2247                  for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
2117 2248                          blkptr_t *gbp = &gn->gn_gbh->zg_blkptr[g];
2118 2249                          if (BP_IS_HOLE(gbp))
2119 2250                                  continue;
2120 2251                          zio_gang_tree_issue(zio, gn->gn_child[g], gbp, data,
2121 2252                              offset);
2122 2253                          offset += BP_GET_PSIZE(gbp);
2123 2254                  }
2124 2255          }
2125 2256  
2126 2257          if (gn == gio->io_gang_tree)
2127 2258                  ASSERT3U(gio->io_size, ==, offset);
2128 2259  
2129 2260          if (zio != pio)
2130 2261                  zio_nowait(zio);
2131 2262  }
2132 2263  
2133 2264  static int
2134 2265  zio_gang_assemble(zio_t *zio)
2135 2266  {
2136 2267          blkptr_t *bp = zio->io_bp;
2137 2268  
2138 2269          ASSERT(BP_IS_GANG(bp) && zio->io_gang_leader == NULL);
2139 2270          ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2140 2271  
2141 2272          zio->io_gang_leader = zio;
2142 2273  
  
    | 
      ↓ open down ↓ | 
    460 lines elided | 
    
      ↑ open up ↑ | 
  
2143 2274          zio_gang_tree_assemble(zio, bp, &zio->io_gang_tree);
2144 2275  
2145 2276          return (ZIO_PIPELINE_CONTINUE);
2146 2277  }
2147 2278  
2148 2279  static int
2149 2280  zio_gang_issue(zio_t *zio)
2150 2281  {
2151 2282          blkptr_t *bp = zio->io_bp;
2152 2283  
2153      -        if (zio_wait_for_children(zio, ZIO_CHILD_GANG_BIT, ZIO_WAIT_DONE)) {
     2284 +        if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE))
2154 2285                  return (ZIO_PIPELINE_STOP);
2155      -        }
2156 2286  
2157 2287          ASSERT(BP_IS_GANG(bp) && zio->io_gang_leader == zio);
2158 2288          ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2159 2289  
2160 2290          if (zio->io_child_error[ZIO_CHILD_GANG] == 0)
2161 2291                  zio_gang_tree_issue(zio, zio->io_gang_tree, bp, zio->io_abd,
2162 2292                      0);
2163 2293          else
2164 2294                  zio_gang_tree_free(&zio->io_gang_tree);
2165 2295  
2166 2296          zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
2167 2297  
2168 2298          return (ZIO_PIPELINE_CONTINUE);
2169 2299  }
2170 2300  
2171 2301  static void
2172 2302  zio_write_gang_member_ready(zio_t *zio)
2173 2303  {
2174 2304          zio_t *pio = zio_unique_parent(zio);
2175 2305          zio_t *gio = zio->io_gang_leader;
2176 2306          dva_t *cdva = zio->io_bp->blk_dva;
2177 2307          dva_t *pdva = pio->io_bp->blk_dva;
2178 2308          uint64_t asize;
2179 2309  
2180 2310          if (BP_IS_HOLE(zio->io_bp))
2181 2311                  return;
2182 2312  
2183 2313          ASSERT(BP_IS_HOLE(&zio->io_bp_orig));
2184 2314  
2185 2315          ASSERT(zio->io_child_type == ZIO_CHILD_GANG);
2186 2316          ASSERT3U(zio->io_prop.zp_copies, ==, gio->io_prop.zp_copies);
2187 2317          ASSERT3U(zio->io_prop.zp_copies, <=, BP_GET_NDVAS(zio->io_bp));
2188 2318          ASSERT3U(pio->io_prop.zp_copies, <=, BP_GET_NDVAS(pio->io_bp));
2189 2319          ASSERT3U(BP_GET_NDVAS(zio->io_bp), <=, BP_GET_NDVAS(pio->io_bp));
2190 2320  
2191 2321          mutex_enter(&pio->io_lock);
2192 2322          for (int d = 0; d < BP_GET_NDVAS(zio->io_bp); d++) {
2193 2323                  ASSERT(DVA_GET_GANG(&pdva[d]));
2194 2324                  asize = DVA_GET_ASIZE(&pdva[d]);
2195 2325                  asize += DVA_GET_ASIZE(&cdva[d]);
2196 2326                  DVA_SET_ASIZE(&pdva[d], asize);
2197 2327          }
2198 2328          mutex_exit(&pio->io_lock);
2199 2329  }
2200 2330  
  
    | 
      ↓ open down ↓ | 
    35 lines elided | 
    
      ↑ open up ↑ | 
  
2201 2331  static void
2202 2332  zio_write_gang_done(zio_t *zio)
2203 2333  {
2204 2334          abd_put(zio->io_abd);
2205 2335  }
2206 2336  
2207 2337  static int
2208 2338  zio_write_gang_block(zio_t *pio)
2209 2339  {
2210 2340          spa_t *spa = pio->io_spa;
2211      -        metaslab_class_t *mc = spa_normal_class(spa);
     2341 +        metaslab_class_t *mc = pio->io_mc;
2212 2342          blkptr_t *bp = pio->io_bp;
2213 2343          zio_t *gio = pio->io_gang_leader;
2214 2344          zio_t *zio;
2215 2345          zio_gang_node_t *gn, **gnpp;
2216 2346          zio_gbh_phys_t *gbh;
2217 2347          abd_t *gbh_abd;
2218 2348          uint64_t txg = pio->io_txg;
2219 2349          uint64_t resid = pio->io_size;
2220 2350          uint64_t lsize;
2221 2351          int copies = gio->io_prop.zp_copies;
2222 2352          int gbh_copies = MIN(copies + 1, spa_max_replication(spa));
2223 2353          zio_prop_t zp;
2224 2354          int error;
2225 2355  
2226 2356          int flags = METASLAB_HINTBP_FAVOR | METASLAB_GANG_HEADER;
2227 2357          if (pio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
2228 2358                  ASSERT(pio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
2229 2359                  ASSERT(!(pio->io_flags & ZIO_FLAG_NODATA));
2230 2360  
2231 2361                  flags |= METASLAB_ASYNC_ALLOC;
2232 2362                  VERIFY(refcount_held(&mc->mc_alloc_slots, pio));
2233 2363  
2234 2364                  /*
2235 2365                   * The logical zio has already placed a reservation for
2236 2366                   * 'copies' allocation slots but gang blocks may require
2237 2367                   * additional copies. These additional copies
2238 2368                   * (i.e. gbh_copies - copies) are guaranteed to succeed
2239 2369                   * since metaslab_class_throttle_reserve() always allows
2240 2370                   * additional reservations for gang blocks.
2241 2371                   */
2242 2372                  VERIFY(metaslab_class_throttle_reserve(mc, gbh_copies - copies,
2243 2373                      pio, flags));
2244 2374          }
2245 2375  
2246 2376          error = metaslab_alloc(spa, mc, SPA_GANGBLOCKSIZE,
2247 2377              bp, gbh_copies, txg, pio == gio ? NULL : gio->io_bp, flags,
2248 2378              &pio->io_alloc_list, pio);
2249 2379          if (error) {
2250 2380                  if (pio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
2251 2381                          ASSERT(pio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
2252 2382                          ASSERT(!(pio->io_flags & ZIO_FLAG_NODATA));
2253 2383  
2254 2384                          /*
2255 2385                           * If we failed to allocate the gang block header then
2256 2386                           * we remove any additional allocation reservations that
2257 2387                           * we placed here. The original reservation will
2258 2388                           * be removed when the logical I/O goes to the ready
2259 2389                           * stage.
2260 2390                           */
2261 2391                          metaslab_class_throttle_unreserve(mc,
2262 2392                              gbh_copies - copies, pio);
2263 2393                  }
2264 2394                  pio->io_error = error;
2265 2395                  return (ZIO_PIPELINE_CONTINUE);
2266 2396          }
2267 2397  
2268 2398          if (pio == gio) {
2269 2399                  gnpp = &gio->io_gang_tree;
2270 2400          } else {
2271 2401                  gnpp = pio->io_private;
2272 2402                  ASSERT(pio->io_ready == zio_write_gang_member_ready);
2273 2403          }
2274 2404  
2275 2405          gn = zio_gang_node_alloc(gnpp);
2276 2406          gbh = gn->gn_gbh;
2277 2407          bzero(gbh, SPA_GANGBLOCKSIZE);
2278 2408          gbh_abd = abd_get_from_buf(gbh, SPA_GANGBLOCKSIZE);
2279 2409  
2280 2410          /*
2281 2411           * Create the gang header.
2282 2412           */
2283 2413          zio = zio_rewrite(pio, spa, txg, bp, gbh_abd, SPA_GANGBLOCKSIZE,
2284 2414              zio_write_gang_done, NULL, pio->io_priority,
2285 2415              ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
2286 2416  
2287 2417          /*
2288 2418           * Create and nowait the gang children.
2289 2419           */
2290 2420          for (int g = 0; resid != 0; resid -= lsize, g++) {
2291 2421                  lsize = P2ROUNDUP(resid / (SPA_GBH_NBLKPTRS - g),
2292 2422                      SPA_MINBLOCKSIZE);
2293 2423                  ASSERT(lsize >= SPA_MINBLOCKSIZE && lsize <= resid);
2294 2424  
2295 2425                  zp.zp_checksum = gio->io_prop.zp_checksum;
2296 2426                  zp.zp_compress = ZIO_COMPRESS_OFF;
2297 2427                  zp.zp_type = DMU_OT_NONE;
  
    | 
      ↓ open down ↓ | 
    76 lines elided | 
    
      ↑ open up ↑ | 
  
2298 2428                  zp.zp_level = 0;
2299 2429                  zp.zp_copies = gio->io_prop.zp_copies;
2300 2430                  zp.zp_dedup = B_FALSE;
2301 2431                  zp.zp_dedup_verify = B_FALSE;
2302 2432                  zp.zp_nopwrite = B_FALSE;
2303 2433  
2304 2434                  zio_t *cio = zio_write(zio, spa, txg, &gbh->zg_blkptr[g],
2305 2435                      abd_get_offset(pio->io_abd, pio->io_size - resid), lsize,
2306 2436                      lsize, &zp, zio_write_gang_member_ready, NULL, NULL,
2307 2437                      zio_write_gang_done, &gn->gn_child[g], pio->io_priority,
2308      -                    ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
     2438 +                    ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark,
     2439 +                    &pio->io_smartcomp);
2309 2440  
     2441 +                cio->io_mc = mc;
     2442 +
2310 2443                  if (pio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
2311 2444                          ASSERT(pio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
2312 2445                          ASSERT(!(pio->io_flags & ZIO_FLAG_NODATA));
2313 2446  
2314 2447                          /*
2315 2448                           * Gang children won't throttle but we should
2316 2449                           * account for their work, so reserve an allocation
2317 2450                           * slot for them here.
2318 2451                           */
2319 2452                          VERIFY(metaslab_class_throttle_reserve(mc,
2320 2453                              zp.zp_copies, cio, flags));
2321 2454                  }
2322 2455                  zio_nowait(cio);
2323 2456          }
2324 2457  
2325 2458          /*
2326 2459           * Set pio's pipeline to just wait for zio to finish.
2327 2460           */
2328 2461          pio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
2329 2462  
2330 2463          zio_nowait(zio);
2331 2464  
2332 2465          return (ZIO_PIPELINE_CONTINUE);
2333 2466  }
2334 2467  
2335 2468  /*
2336 2469   * The zio_nop_write stage in the pipeline determines if allocating a
2337 2470   * new bp is necessary.  The nopwrite feature can handle writes in
2338 2471   * either syncing or open context (i.e. zil writes) and as a result is
2339 2472   * mutually exclusive with dedup.
2340 2473   *
2341 2474   * By leveraging a cryptographically secure checksum, such as SHA256, we
2342 2475   * can compare the checksums of the new data and the old to determine if
2343 2476   * allocating a new block is required.  Note that our requirements for
2344 2477   * cryptographic strength are fairly weak: there can't be any accidental
2345 2478   * hash collisions, but we don't need to be secure against intentional
2346 2479   * (malicious) collisions.  To trigger a nopwrite, you have to be able
2347 2480   * to write the file to begin with, and triggering an incorrect (hash
2348 2481   * collision) nopwrite is no worse than simply writing to the file.
2349 2482   * That said, there are no known attacks against the checksum algorithms
2350 2483   * used for nopwrite, assuming that the salt and the checksums
2351 2484   * themselves remain secret.
2352 2485   */
2353 2486  static int
2354 2487  zio_nop_write(zio_t *zio)
2355 2488  {
2356 2489          blkptr_t *bp = zio->io_bp;
2357 2490          blkptr_t *bp_orig = &zio->io_bp_orig;
2358 2491          zio_prop_t *zp = &zio->io_prop;
2359 2492  
2360 2493          ASSERT(BP_GET_LEVEL(bp) == 0);
2361 2494          ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REWRITE));
2362 2495          ASSERT(zp->zp_nopwrite);
2363 2496          ASSERT(!zp->zp_dedup);
2364 2497          ASSERT(zio->io_bp_override == NULL);
2365 2498          ASSERT(IO_IS_ALLOCATING(zio));
2366 2499  
2367 2500          /*
2368 2501           * Check to see if the original bp and the new bp have matching
2369 2502           * characteristics (i.e. same checksum, compression algorithms, etc).
2370 2503           * If they don't then just continue with the pipeline which will
2371 2504           * allocate a new bp.
2372 2505           */
2373 2506          if (BP_IS_HOLE(bp_orig) ||
2374 2507              !(zio_checksum_table[BP_GET_CHECKSUM(bp)].ci_flags &
2375 2508              ZCHECKSUM_FLAG_NOPWRITE) ||
2376 2509              BP_GET_CHECKSUM(bp) != BP_GET_CHECKSUM(bp_orig) ||
2377 2510              BP_GET_COMPRESS(bp) != BP_GET_COMPRESS(bp_orig) ||
2378 2511              BP_GET_DEDUP(bp) != BP_GET_DEDUP(bp_orig) ||
2379 2512              zp->zp_copies != BP_GET_NDVAS(bp_orig))
2380 2513                  return (ZIO_PIPELINE_CONTINUE);
2381 2514  
2382 2515          /*
2383 2516           * If the checksums match then reset the pipeline so that we
2384 2517           * avoid allocating a new bp and issuing any I/O.
2385 2518           */
2386 2519          if (ZIO_CHECKSUM_EQUAL(bp->blk_cksum, bp_orig->blk_cksum)) {
2387 2520                  ASSERT(zio_checksum_table[zp->zp_checksum].ci_flags &
2388 2521                      ZCHECKSUM_FLAG_NOPWRITE);
2389 2522                  ASSERT3U(BP_GET_PSIZE(bp), ==, BP_GET_PSIZE(bp_orig));
2390 2523                  ASSERT3U(BP_GET_LSIZE(bp), ==, BP_GET_LSIZE(bp_orig));
2391 2524                  ASSERT(zp->zp_compress != ZIO_COMPRESS_OFF);
2392 2525                  ASSERT(bcmp(&bp->blk_prop, &bp_orig->blk_prop,
2393 2526                      sizeof (uint64_t)) == 0);
2394 2527  
2395 2528                  *bp = *bp_orig;
2396 2529                  zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
2397 2530                  zio->io_flags |= ZIO_FLAG_NOPWRITE;
2398 2531          }
2399 2532  
2400 2533          return (ZIO_PIPELINE_CONTINUE);
2401 2534  }
2402 2535  
2403 2536  /*
2404 2537   * ==========================================================================
2405 2538   * Dedup
2406 2539   * ==========================================================================
2407 2540   */
2408 2541  static void
2409 2542  zio_ddt_child_read_done(zio_t *zio)
2410 2543  {
2411 2544          blkptr_t *bp = zio->io_bp;
2412 2545          ddt_entry_t *dde = zio->io_private;
2413 2546          ddt_phys_t *ddp;
2414 2547          zio_t *pio = zio_unique_parent(zio);
2415 2548  
2416 2549          mutex_enter(&pio->io_lock);
2417 2550          ddp = ddt_phys_select(dde, bp);
2418 2551          if (zio->io_error == 0)
2419 2552                  ddt_phys_clear(ddp);    /* this ddp doesn't need repair */
2420 2553  
2421 2554          if (zio->io_error == 0 && dde->dde_repair_abd == NULL)
2422 2555                  dde->dde_repair_abd = zio->io_abd;
2423 2556          else
2424 2557                  abd_free(zio->io_abd);
2425 2558          mutex_exit(&pio->io_lock);
2426 2559  }
2427 2560  
2428 2561  static int
2429 2562  zio_ddt_read_start(zio_t *zio)
2430 2563  {
2431 2564          blkptr_t *bp = zio->io_bp;
2432 2565  
2433 2566          ASSERT(BP_GET_DEDUP(bp));
2434 2567          ASSERT(BP_GET_PSIZE(bp) == zio->io_size);
2435 2568          ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
2436 2569  
2437 2570          if (zio->io_child_error[ZIO_CHILD_DDT]) {
2438 2571                  ddt_t *ddt = ddt_select(zio->io_spa, bp);
2439 2572                  ddt_entry_t *dde = ddt_repair_start(ddt, bp);
2440 2573                  ddt_phys_t *ddp = dde->dde_phys;
2441 2574                  ddt_phys_t *ddp_self = ddt_phys_select(dde, bp);
2442 2575                  blkptr_t blk;
2443 2576  
2444 2577                  ASSERT(zio->io_vsd == NULL);
2445 2578                  zio->io_vsd = dde;
2446 2579  
2447 2580                  if (ddp_self == NULL)
2448 2581                          return (ZIO_PIPELINE_CONTINUE);
2449 2582  
2450 2583                  for (int p = 0; p < DDT_PHYS_TYPES; p++, ddp++) {
2451 2584                          if (ddp->ddp_phys_birth == 0 || ddp == ddp_self)
2452 2585                                  continue;
2453 2586                          ddt_bp_create(ddt->ddt_checksum, &dde->dde_key, ddp,
2454 2587                              &blk);
2455 2588                          zio_nowait(zio_read(zio, zio->io_spa, &blk,
2456 2589                              abd_alloc_for_io(zio->io_size, B_TRUE),
2457 2590                              zio->io_size, zio_ddt_child_read_done, dde,
2458 2591                              zio->io_priority, ZIO_DDT_CHILD_FLAGS(zio) |
2459 2592                              ZIO_FLAG_DONT_PROPAGATE, &zio->io_bookmark));
2460 2593                  }
2461 2594                  return (ZIO_PIPELINE_CONTINUE);
2462 2595          }
2463 2596  
2464 2597          zio_nowait(zio_read(zio, zio->io_spa, bp,
2465 2598              zio->io_abd, zio->io_size, NULL, NULL, zio->io_priority,
  
    | 
      ↓ open down ↓ | 
    146 lines elided | 
    
      ↑ open up ↑ | 
  
2466 2599              ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark));
2467 2600  
2468 2601          return (ZIO_PIPELINE_CONTINUE);
2469 2602  }
2470 2603  
2471 2604  static int
2472 2605  zio_ddt_read_done(zio_t *zio)
2473 2606  {
2474 2607          blkptr_t *bp = zio->io_bp;
2475 2608  
2476      -        if (zio_wait_for_children(zio, ZIO_CHILD_DDT_BIT, ZIO_WAIT_DONE)) {
     2609 +        if (zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE))
2477 2610                  return (ZIO_PIPELINE_STOP);
2478      -        }
2479 2611  
2480 2612          ASSERT(BP_GET_DEDUP(bp));
2481 2613          ASSERT(BP_GET_PSIZE(bp) == zio->io_size);
2482 2614          ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
2483 2615  
2484 2616          if (zio->io_child_error[ZIO_CHILD_DDT]) {
2485 2617                  ddt_t *ddt = ddt_select(zio->io_spa, bp);
2486 2618                  ddt_entry_t *dde = zio->io_vsd;
2487 2619                  if (ddt == NULL) {
2488 2620                          ASSERT(spa_load_state(zio->io_spa) != SPA_LOAD_NONE);
2489 2621                          return (ZIO_PIPELINE_CONTINUE);
2490 2622                  }
2491 2623                  if (dde == NULL) {
2492 2624                          zio->io_stage = ZIO_STAGE_DDT_READ_START >> 1;
2493 2625                          zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE, B_FALSE);
2494 2626                          return (ZIO_PIPELINE_STOP);
2495 2627                  }
2496 2628                  if (dde->dde_repair_abd != NULL) {
2497 2629                          abd_copy(zio->io_abd, dde->dde_repair_abd,
2498 2630                              zio->io_size);
2499 2631                          zio->io_child_error[ZIO_CHILD_DDT] = 0;
  
    | 
      ↓ open down ↓ | 
    11 lines elided | 
    
      ↑ open up ↑ | 
  
2500 2632                  }
2501 2633                  ddt_repair_done(ddt, dde);
2502 2634                  zio->io_vsd = NULL;
2503 2635          }
2504 2636  
2505 2637          ASSERT(zio->io_vsd == NULL);
2506 2638  
2507 2639          return (ZIO_PIPELINE_CONTINUE);
2508 2640  }
2509 2641  
     2642 +/* ARGSUSED */
2510 2643  static boolean_t
2511 2644  zio_ddt_collision(zio_t *zio, ddt_t *ddt, ddt_entry_t *dde)
2512 2645  {
2513 2646          spa_t *spa = zio->io_spa;
2514 2647          boolean_t do_raw = (zio->io_flags & ZIO_FLAG_RAW);
2515 2648  
2516 2649          /* We should never get a raw, override zio */
2517 2650          ASSERT(!(zio->io_bp_override && do_raw));
2518 2651  
2519 2652          /*
2520 2653           * Note: we compare the original data, not the transformed data,
2521 2654           * because when zio->io_bp is an override bp, we will not have
2522 2655           * pushed the I/O transforms.  That's an important optimization
2523 2656           * because otherwise we'd compress/encrypt all dmu_sync() data twice.
2524 2657           */
2525 2658          for (int p = DDT_PHYS_SINGLE; p <= DDT_PHYS_TRIPLE; p++) {
2526 2659                  zio_t *lio = dde->dde_lead_zio[p];
2527 2660  
2528 2661                  if (lio != NULL) {
2529 2662                          return (lio->io_orig_size != zio->io_orig_size ||
2530 2663                              abd_cmp(zio->io_orig_abd, lio->io_orig_abd,
2531 2664                              zio->io_orig_size) != 0);
2532 2665                  }
2533 2666          }
2534 2667  
2535 2668          for (int p = DDT_PHYS_SINGLE; p <= DDT_PHYS_TRIPLE; p++) {
2536 2669                  ddt_phys_t *ddp = &dde->dde_phys[p];
  
    | 
      ↓ open down ↓ | 
    17 lines elided | 
    
      ↑ open up ↑ | 
  
2537 2670  
2538 2671                  if (ddp->ddp_phys_birth != 0) {
2539 2672                          arc_buf_t *abuf = NULL;
2540 2673                          arc_flags_t aflags = ARC_FLAG_WAIT;
2541 2674                          int zio_flags = ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE;
2542 2675                          blkptr_t blk = *zio->io_bp;
2543 2676                          int error;
2544 2677  
2545 2678                          ddt_bp_fill(ddp, &blk, ddp->ddp_phys_birth);
2546 2679  
2547      -                        ddt_exit(ddt);
     2680 +                        dde_exit(dde);
2548 2681  
2549 2682                          /*
2550 2683                           * Intuitively, it would make more sense to compare
2551 2684                           * io_abd than io_orig_abd in the raw case since you
2552 2685                           * don't want to look at any transformations that have
2553 2686                           * happened to the data. However, for raw I/Os the
2554 2687                           * data will actually be the same in io_abd and
2555 2688                           * io_orig_abd, so all we have to do is issue this as
2556 2689                           * a raw ARC read.
2557 2690                           */
2558 2691                          if (do_raw) {
2559 2692                                  zio_flags |= ZIO_FLAG_RAW;
2560 2693                                  ASSERT3U(zio->io_size, ==, zio->io_orig_size);
2561 2694                                  ASSERT0(abd_cmp(zio->io_abd, zio->io_orig_abd,
2562 2695                                      zio->io_size));
2563 2696                                  ASSERT3P(zio->io_transform_stack, ==, NULL);
2564 2697                          }
2565 2698  
2566 2699                          error = arc_read(NULL, spa, &blk,
2567 2700                              arc_getbuf_func, &abuf, ZIO_PRIORITY_SYNC_READ,
  
    | 
      ↓ open down ↓ | 
    10 lines elided | 
    
      ↑ open up ↑ | 
  
2568 2701                              zio_flags, &aflags, &zio->io_bookmark);
2569 2702  
2570 2703                          if (error == 0) {
2571 2704                                  if (arc_buf_size(abuf) != zio->io_orig_size ||
2572 2705                                      abd_cmp_buf(zio->io_orig_abd, abuf->b_data,
2573 2706                                      zio->io_orig_size) != 0)
2574 2707                                          error = SET_ERROR(EEXIST);
2575 2708                                  arc_buf_destroy(abuf, &abuf);
2576 2709                          }
2577 2710  
2578      -                        ddt_enter(ddt);
     2711 +                        dde_enter(dde);
2579 2712                          return (error != 0);
2580 2713                  }
2581 2714          }
2582 2715  
2583 2716          return (B_FALSE);
2584 2717  }
2585 2718  
2586 2719  static void
2587 2720  zio_ddt_child_write_ready(zio_t *zio)
2588 2721  {
2589 2722          int p = zio->io_prop.zp_copies;
2590      -        ddt_t *ddt = ddt_select(zio->io_spa, zio->io_bp);
2591 2723          ddt_entry_t *dde = zio->io_private;
2592 2724          ddt_phys_t *ddp = &dde->dde_phys[p];
2593 2725          zio_t *pio;
2594 2726  
2595 2727          if (zio->io_error)
2596 2728                  return;
2597 2729  
2598      -        ddt_enter(ddt);
     2730 +        dde_enter(dde);
2599 2731  
2600 2732          ASSERT(dde->dde_lead_zio[p] == zio);
2601 2733  
2602 2734          ddt_phys_fill(ddp, zio->io_bp);
2603 2735  
2604 2736          zio_link_t *zl = NULL;
2605 2737          while ((pio = zio_walk_parents(zio, &zl)) != NULL)
2606 2738                  ddt_bp_fill(ddp, pio->io_bp, zio->io_txg);
2607 2739  
2608      -        ddt_exit(ddt);
     2740 +        dde_exit(dde);
2609 2741  }
2610 2742  
2611 2743  static void
2612 2744  zio_ddt_child_write_done(zio_t *zio)
2613 2745  {
2614 2746          int p = zio->io_prop.zp_copies;
2615      -        ddt_t *ddt = ddt_select(zio->io_spa, zio->io_bp);
2616 2747          ddt_entry_t *dde = zio->io_private;
2617 2748          ddt_phys_t *ddp = &dde->dde_phys[p];
2618 2749  
2619      -        ddt_enter(ddt);
     2750 +        dde_enter(dde);
2620 2751  
2621 2752          ASSERT(ddp->ddp_refcnt == 0);
2622 2753          ASSERT(dde->dde_lead_zio[p] == zio);
2623 2754          dde->dde_lead_zio[p] = NULL;
2624 2755  
2625 2756          if (zio->io_error == 0) {
2626 2757                  zio_link_t *zl = NULL;
2627 2758                  while (zio_walk_parents(zio, &zl) != NULL)
2628 2759                          ddt_phys_addref(ddp);
2629 2760          } else {
2630 2761                  ddt_phys_clear(ddp);
2631 2762          }
2632 2763  
2633      -        ddt_exit(ddt);
     2764 +        dde_exit(dde);
2634 2765  }
2635 2766  
2636 2767  static void
2637 2768  zio_ddt_ditto_write_done(zio_t *zio)
2638 2769  {
2639 2770          int p = DDT_PHYS_DITTO;
2640 2771          zio_prop_t *zp = &zio->io_prop;
2641 2772          blkptr_t *bp = zio->io_bp;
2642 2773          ddt_t *ddt = ddt_select(zio->io_spa, bp);
2643 2774          ddt_entry_t *dde = zio->io_private;
2644 2775          ddt_phys_t *ddp = &dde->dde_phys[p];
2645 2776          ddt_key_t *ddk = &dde->dde_key;
2646 2777  
2647      -        ddt_enter(ddt);
     2778 +        dde_enter(dde);
2648 2779  
2649 2780          ASSERT(ddp->ddp_refcnt == 0);
2650 2781          ASSERT(dde->dde_lead_zio[p] == zio);
2651 2782          dde->dde_lead_zio[p] = NULL;
2652 2783  
2653 2784          if (zio->io_error == 0) {
2654 2785                  ASSERT(ZIO_CHECKSUM_EQUAL(bp->blk_cksum, ddk->ddk_cksum));
2655 2786                  ASSERT(zp->zp_copies < SPA_DVAS_PER_BP);
2656 2787                  ASSERT(zp->zp_copies == BP_GET_NDVAS(bp) - BP_IS_GANG(bp));
2657 2788                  if (ddp->ddp_phys_birth != 0)
2658 2789                          ddt_phys_free(ddt, ddk, ddp, zio->io_txg);
2659 2790                  ddt_phys_fill(ddp, bp);
2660 2791          }
2661 2792  
2662      -        ddt_exit(ddt);
     2793 +        dde_exit(dde);
2663 2794  }
2664 2795  
2665 2796  static int
2666 2797  zio_ddt_write(zio_t *zio)
2667 2798  {
2668 2799          spa_t *spa = zio->io_spa;
2669 2800          blkptr_t *bp = zio->io_bp;
2670 2801          uint64_t txg = zio->io_txg;
2671 2802          zio_prop_t *zp = &zio->io_prop;
2672 2803          int p = zp->zp_copies;
2673 2804          int ditto_copies;
2674 2805          zio_t *cio = NULL;
  
    | 
      ↓ open down ↓ | 
    2 lines elided | 
    
      ↑ open up ↑ | 
  
2675 2806          zio_t *dio = NULL;
2676 2807          ddt_t *ddt = ddt_select(spa, bp);
2677 2808          ddt_entry_t *dde;
2678 2809          ddt_phys_t *ddp;
2679 2810  
2680 2811          ASSERT(BP_GET_DEDUP(bp));
2681 2812          ASSERT(BP_GET_CHECKSUM(bp) == zp->zp_checksum);
2682 2813          ASSERT(BP_IS_HOLE(bp) || zio->io_bp_override);
2683 2814          ASSERT(!(zio->io_bp_override && (zio->io_flags & ZIO_FLAG_RAW)));
2684 2815  
2685      -        ddt_enter(ddt);
2686 2816          dde = ddt_lookup(ddt, bp, B_TRUE);
2687      -        ddp = &dde->dde_phys[p];
2688 2817  
     2818 +        /*
     2819 +         * If we're not using special tier, for each new DDE that's not on disk:
     2820 +         * disable dedup if we have exhausted "allowed" DDT L2/ARC space
     2821 +         */
     2822 +        if ((dde->dde_state & DDE_NEW) && !spa->spa_usesc &&
     2823 +            (zfs_ddt_limit_type != DDT_NO_LIMIT || zfs_ddt_byte_ceiling != 0)) {
     2824 +                /* turn off dedup if we need to stop DDT growth */
     2825 +                if (spa_enable_dedup_cap(spa)) {
     2826 +                        dde->dde_state |= DDE_DONT_SYNC;
     2827 +
     2828 +                        /* disable dedup and use the ordinary write pipeline */
     2829 +                        zio_pop_transforms(zio);
     2830 +                        zp->zp_dedup = zp->zp_dedup_verify = B_FALSE;
     2831 +                        zio->io_stage = ZIO_STAGE_OPEN;
     2832 +                        zio->io_pipeline = ZIO_WRITE_PIPELINE;
     2833 +                        zio->io_bp_override = NULL;
     2834 +                        BP_ZERO(bp);
     2835 +                        dde_exit(dde);
     2836 +
     2837 +                        return (ZIO_PIPELINE_CONTINUE);
     2838 +                }
     2839 +        }
     2840 +        ASSERT(!(dde->dde_state & DDE_DONT_SYNC));
     2841 +
2689 2842          if (zp->zp_dedup_verify && zio_ddt_collision(zio, ddt, dde)) {
2690 2843                  /*
2691 2844                   * If we're using a weak checksum, upgrade to a strong checksum
2692 2845                   * and try again.  If we're already using a strong checksum,
2693 2846                   * we can't resolve it, so just convert to an ordinary write.
2694 2847                   * (And automatically e-mail a paper to Nature?)
2695 2848                   */
2696 2849                  if (!(zio_checksum_table[zp->zp_checksum].ci_flags &
2697 2850                      ZCHECKSUM_FLAG_DEDUP)) {
2698 2851                          zp->zp_checksum = spa_dedup_checksum(spa);
2699 2852                          zio_pop_transforms(zio);
2700 2853                          zio->io_stage = ZIO_STAGE_OPEN;
2701 2854                          BP_ZERO(bp);
2702 2855                  } else {
2703 2856                          zp->zp_dedup = B_FALSE;
2704 2857                          BP_SET_DEDUP(bp, B_FALSE);
2705 2858                  }
2706 2859                  ASSERT(!BP_GET_DEDUP(bp));
2707 2860                  zio->io_pipeline = ZIO_WRITE_PIPELINE;
2708      -                ddt_exit(ddt);
     2861 +                dde_exit(dde);
2709 2862                  return (ZIO_PIPELINE_CONTINUE);
2710 2863          }
2711 2864  
     2865 +        ddp = &dde->dde_phys[p];
2712 2866          ditto_copies = ddt_ditto_copies_needed(ddt, dde, ddp);
2713 2867          ASSERT(ditto_copies < SPA_DVAS_PER_BP);
2714 2868  
2715 2869          if (ditto_copies > ddt_ditto_copies_present(dde) &&
2716 2870              dde->dde_lead_zio[DDT_PHYS_DITTO] == NULL) {
2717 2871                  zio_prop_t czp = *zp;
2718 2872  
2719 2873                  czp.zp_copies = ditto_copies;
2720 2874  
2721 2875                  /*
2722 2876                   * If we arrived here with an override bp, we won't have run
2723 2877                   * the transform stack, so we won't have the data we need to
  
    | 
      ↓ open down ↓ | 
    2 lines elided | 
    
      ↑ open up ↑ | 
  
2724 2878                   * generate a child i/o.  So, toss the override bp and restart.
2725 2879                   * This is safe, because using the override bp is just an
2726 2880                   * optimization; and it's rare, so the cost doesn't matter.
2727 2881                   */
2728 2882                  if (zio->io_bp_override) {
2729 2883                          zio_pop_transforms(zio);
2730 2884                          zio->io_stage = ZIO_STAGE_OPEN;
2731 2885                          zio->io_pipeline = ZIO_WRITE_PIPELINE;
2732 2886                          zio->io_bp_override = NULL;
2733 2887                          BP_ZERO(bp);
2734      -                        ddt_exit(ddt);
     2888 +                        dde_exit(dde);
2735 2889                          return (ZIO_PIPELINE_CONTINUE);
2736 2890                  }
2737 2891  
2738 2892                  dio = zio_write(zio, spa, txg, bp, zio->io_orig_abd,
2739 2893                      zio->io_orig_size, zio->io_orig_size, &czp, NULL, NULL,
2740 2894                      NULL, zio_ddt_ditto_write_done, dde, zio->io_priority,
2741      -                    ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark);
     2895 +                    ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark, NULL);
2742 2896  
2743 2897                  zio_push_transform(dio, zio->io_abd, zio->io_size, 0, NULL);
2744 2898                  dde->dde_lead_zio[DDT_PHYS_DITTO] = dio;
2745 2899          }
2746 2900  
2747 2901          if (ddp->ddp_phys_birth != 0 || dde->dde_lead_zio[p] != NULL) {
2748 2902                  if (ddp->ddp_phys_birth != 0)
2749 2903                          ddt_bp_fill(ddp, bp, txg);
2750 2904                  if (dde->dde_lead_zio[p] != NULL)
2751 2905                          zio_add_child(zio, dde->dde_lead_zio[p]);
2752 2906                  else
2753 2907                          ddt_phys_addref(ddp);
  
    | 
      ↓ open down ↓ | 
    2 lines elided | 
    
      ↑ open up ↑ | 
  
2754 2908          } else if (zio->io_bp_override) {
2755 2909                  ASSERT(bp->blk_birth == txg);
2756 2910                  ASSERT(BP_EQUAL(bp, zio->io_bp_override));
2757 2911                  ddt_phys_fill(ddp, bp);
2758 2912                  ddt_phys_addref(ddp);
2759 2913          } else {
2760 2914                  cio = zio_write(zio, spa, txg, bp, zio->io_orig_abd,
2761 2915                      zio->io_orig_size, zio->io_orig_size, zp,
2762 2916                      zio_ddt_child_write_ready, NULL, NULL,
2763 2917                      zio_ddt_child_write_done, dde, zio->io_priority,
2764      -                    ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark);
     2918 +                    ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark, NULL);
2765 2919  
2766 2920                  zio_push_transform(cio, zio->io_abd, zio->io_size, 0, NULL);
2767 2921                  dde->dde_lead_zio[p] = cio;
2768 2922          }
2769 2923  
2770      -        ddt_exit(ddt);
     2924 +        dde_exit(dde);
2771 2925  
2772 2926          if (cio)
2773 2927                  zio_nowait(cio);
2774 2928          if (dio)
2775 2929                  zio_nowait(dio);
2776 2930  
2777 2931          return (ZIO_PIPELINE_CONTINUE);
2778 2932  }
2779 2933  
2780 2934  ddt_entry_t *freedde; /* for debugging */
2781 2935  
2782 2936  static int
2783 2937  zio_ddt_free(zio_t *zio)
  
    | 
      ↓ open down ↓ | 
    3 lines elided | 
    
      ↑ open up ↑ | 
  
2784 2938  {
2785 2939          spa_t *spa = zio->io_spa;
2786 2940          blkptr_t *bp = zio->io_bp;
2787 2941          ddt_t *ddt = ddt_select(spa, bp);
2788 2942          ddt_entry_t *dde;
2789 2943          ddt_phys_t *ddp;
2790 2944  
2791 2945          ASSERT(BP_GET_DEDUP(bp));
2792 2946          ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
2793 2947  
2794      -        ddt_enter(ddt);
2795 2948          freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
2796 2949          ddp = ddt_phys_select(dde, bp);
2797      -        ddt_phys_decref(ddp);
2798      -        ddt_exit(ddt);
     2950 +        if (ddp)
     2951 +                ddt_phys_decref(ddp);
     2952 +        dde_exit(dde);
2799 2953  
2800 2954          return (ZIO_PIPELINE_CONTINUE);
2801 2955  }
2802 2956  
2803 2957  /*
2804 2958   * ==========================================================================
2805 2959   * Allocate and free blocks
2806 2960   * ==========================================================================
2807 2961   */
2808 2962  
2809 2963  static zio_t *
2810      -zio_io_to_allocate(spa_t *spa)
     2964 +zio_io_to_allocate(metaslab_class_t *mc)
2811 2965  {
2812 2966          zio_t *zio;
2813 2967  
2814      -        ASSERT(MUTEX_HELD(&spa->spa_alloc_lock));
     2968 +        ASSERT(MUTEX_HELD(&mc->mc_alloc_lock));
2815 2969  
2816      -        zio = avl_first(&spa->spa_alloc_tree);
     2970 +        zio = avl_first(&mc->mc_alloc_tree);
2817 2971          if (zio == NULL)
2818 2972                  return (NULL);
2819 2973  
2820 2974          ASSERT(IO_IS_ALLOCATING(zio));
2821 2975  
2822 2976          /*
2823 2977           * Try to place a reservation for this zio. If we're unable to
2824 2978           * reserve then we throttle.
2825 2979           */
2826      -        if (!metaslab_class_throttle_reserve(spa_normal_class(spa),
     2980 +        if (!metaslab_class_throttle_reserve(mc,
2827 2981              zio->io_prop.zp_copies, zio, 0)) {
2828 2982                  return (NULL);
2829 2983          }
2830 2984  
2831      -        avl_remove(&spa->spa_alloc_tree, zio);
     2985 +        avl_remove(&mc->mc_alloc_tree, zio);
2832 2986          ASSERT3U(zio->io_stage, <, ZIO_STAGE_DVA_ALLOCATE);
2833 2987  
2834 2988          return (zio);
2835 2989  }
2836 2990  
2837 2991  static int
2838 2992  zio_dva_throttle(zio_t *zio)
2839 2993  {
2840 2994          spa_t *spa = zio->io_spa;
2841 2995          zio_t *nio;
2842 2996  
     2997 +        /* We need to use parent's MetaslabClass */
     2998 +        if (zio->io_mc == NULL) {
     2999 +                zio->io_mc = spa_select_class(spa, zio);
     3000 +                if (zio->io_prop.zp_usewbc)
     3001 +                        return (ZIO_PIPELINE_CONTINUE);
     3002 +        }
     3003 +
2843 3004          if (zio->io_priority == ZIO_PRIORITY_SYNC_WRITE ||
2844      -            !spa_normal_class(zio->io_spa)->mc_alloc_throttle_enabled ||
     3005 +            !zio->io_mc->mc_alloc_throttle_enabled ||
2845 3006              zio->io_child_type == ZIO_CHILD_GANG ||
2846 3007              zio->io_flags & ZIO_FLAG_NODATA) {
2847 3008                  return (ZIO_PIPELINE_CONTINUE);
2848 3009          }
2849 3010  
2850 3011          ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2851 3012  
2852 3013          ASSERT3U(zio->io_queued_timestamp, >, 0);
2853 3014          ASSERT(zio->io_stage == ZIO_STAGE_DVA_THROTTLE);
2854 3015  
2855      -        mutex_enter(&spa->spa_alloc_lock);
     3016 +        mutex_enter(&zio->io_mc->mc_alloc_lock);
2856 3017  
2857 3018          ASSERT(zio->io_type == ZIO_TYPE_WRITE);
2858      -        avl_add(&spa->spa_alloc_tree, zio);
     3019 +        avl_add(&zio->io_mc->mc_alloc_tree, zio);
2859 3020  
2860      -        nio = zio_io_to_allocate(zio->io_spa);
2861      -        mutex_exit(&spa->spa_alloc_lock);
     3021 +        nio = zio_io_to_allocate(zio->io_mc);
     3022 +        mutex_exit(&zio->io_mc->mc_alloc_lock);
2862 3023  
2863 3024          if (nio == zio)
2864 3025                  return (ZIO_PIPELINE_CONTINUE);
2865 3026  
2866 3027          if (nio != NULL) {
2867 3028                  ASSERT(nio->io_stage == ZIO_STAGE_DVA_THROTTLE);
2868 3029                  /*
2869 3030                   * We are passing control to a new zio so make sure that
2870 3031                   * it is processed by a different thread. We do this to
2871 3032                   * avoid stack overflows that can occur when parents are
2872 3033                   * throttled and children are making progress. We allow
2873 3034                   * it to go to the head of the taskq since it's already
2874 3035                   * been waiting.
2875 3036                   */
2876 3037                  zio_taskq_dispatch(nio, ZIO_TASKQ_ISSUE, B_TRUE);
2877 3038          }
2878 3039          return (ZIO_PIPELINE_STOP);
2879 3040  }
2880 3041  
2881 3042  void
2882      -zio_allocate_dispatch(spa_t *spa)
     3043 +zio_allocate_dispatch(metaslab_class_t *mc)
2883 3044  {
2884 3045          zio_t *zio;
2885 3046  
2886      -        mutex_enter(&spa->spa_alloc_lock);
2887      -        zio = zio_io_to_allocate(spa);
2888      -        mutex_exit(&spa->spa_alloc_lock);
     3047 +        mutex_enter(&mc->mc_alloc_lock);
     3048 +        zio = zio_io_to_allocate(mc);
     3049 +        mutex_exit(&mc->mc_alloc_lock);
2889 3050          if (zio == NULL)
2890 3051                  return;
2891 3052  
2892 3053          ASSERT3U(zio->io_stage, ==, ZIO_STAGE_DVA_THROTTLE);
2893 3054          ASSERT0(zio->io_error);
2894 3055          zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE, B_TRUE);
2895 3056  }
2896 3057  
2897 3058  static int
2898 3059  zio_dva_allocate(zio_t *zio)
2899 3060  {
2900 3061          spa_t *spa = zio->io_spa;
2901      -        metaslab_class_t *mc = spa_normal_class(spa);
     3062 +        metaslab_class_t *mc = zio->io_mc;
     3063 +
2902 3064          blkptr_t *bp = zio->io_bp;
2903 3065          int error;
2904 3066          int flags = 0;
2905 3067  
2906 3068          if (zio->io_gang_leader == NULL) {
2907 3069                  ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2908 3070                  zio->io_gang_leader = zio;
2909 3071          }
2910 3072  
2911 3073          ASSERT(BP_IS_HOLE(bp));
2912 3074          ASSERT0(BP_GET_NDVAS(bp));
2913 3075          ASSERT3U(zio->io_prop.zp_copies, >, 0);
2914 3076          ASSERT3U(zio->io_prop.zp_copies, <=, spa_max_replication(spa));
2915 3077          ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));
2916 3078  
2917      -        if (zio->io_flags & ZIO_FLAG_NODATA) {
     3079 +        if (zio->io_flags & ZIO_FLAG_NODATA || zio->io_prop.zp_usewbc) {
2918 3080                  flags |= METASLAB_DONT_THROTTLE;
2919 3081          }
2920 3082          if (zio->io_flags & ZIO_FLAG_GANG_CHILD) {
2921 3083                  flags |= METASLAB_GANG_CHILD;
2922 3084          }
2923      -        if (zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE) {
     3085 +        if (zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE &&
     3086 +            zio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
2924 3087                  flags |= METASLAB_ASYNC_ALLOC;
2925 3088          }
2926 3089  
2927 3090          error = metaslab_alloc(spa, mc, zio->io_size, bp,
2928 3091              zio->io_prop.zp_copies, zio->io_txg, NULL, flags,
2929 3092              &zio->io_alloc_list, zio);
2930 3093  
     3094 +#ifdef _KERNEL
     3095 +        DTRACE_PROBE6(zio_dva_allocate,
     3096 +            uint64_t, DVA_GET_VDEV(&bp->blk_dva[0]),
     3097 +            uint64_t, DVA_GET_VDEV(&bp->blk_dva[1]),
     3098 +            uint64_t, BP_GET_LEVEL(bp),
     3099 +            boolean_t, BP_IS_SPECIAL(bp),
     3100 +            boolean_t, BP_IS_METADATA(bp),
     3101 +            int, error);
     3102 +#endif
     3103 +
2931 3104          if (error != 0) {
2932 3105                  spa_dbgmsg(spa, "%s: metaslab allocation failure: zio %p, "
2933 3106                      "size %llu, error %d", spa_name(spa), zio, zio->io_size,
2934 3107                      error);
2935      -                if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE)
     3108 +                if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE) {
     3109 +                        if (zio->io_prop.zp_usewbc) {
     3110 +                                zio->io_prop.zp_usewbc = B_FALSE;
     3111 +                                zio->io_prop.zp_usesc = B_FALSE;
     3112 +                                zio->io_mc = spa_normal_class(spa);
     3113 +                        }
     3114 +
2936 3115                          return (zio_write_gang_block(zio));
     3116 +                }
     3117 +
2937 3118                  zio->io_error = error;
2938 3119          }
2939 3120  
2940 3121          return (ZIO_PIPELINE_CONTINUE);
2941 3122  }
2942 3123  
2943 3124  static int
2944 3125  zio_dva_free(zio_t *zio)
2945 3126  {
2946 3127          metaslab_free(zio->io_spa, zio->io_bp, zio->io_txg, B_FALSE);
2947 3128  
2948 3129          return (ZIO_PIPELINE_CONTINUE);
2949 3130  }
2950 3131  
2951 3132  static int
2952 3133  zio_dva_claim(zio_t *zio)
2953 3134  {
2954 3135          int error;
2955 3136  
2956 3137          error = metaslab_claim(zio->io_spa, zio->io_bp, zio->io_txg);
2957 3138          if (error)
2958 3139                  zio->io_error = error;
2959 3140  
2960 3141          return (ZIO_PIPELINE_CONTINUE);
2961 3142  }
2962 3143  
2963 3144  /*
2964 3145   * Undo an allocation.  This is used by zio_done() when an I/O fails
2965 3146   * and we want to give back the block we just allocated.
2966 3147   * This handles both normal blocks and gang blocks.
2967 3148   */
2968 3149  static void
2969 3150  zio_dva_unallocate(zio_t *zio, zio_gang_node_t *gn, blkptr_t *bp)
2970 3151  {
2971 3152          ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp));
2972 3153          ASSERT(zio->io_bp_override == NULL);
2973 3154  
2974 3155          if (!BP_IS_HOLE(bp))
2975 3156                  metaslab_free(zio->io_spa, bp, bp->blk_birth, B_TRUE);
2976 3157  
2977 3158          if (gn != NULL) {
2978 3159                  for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
2979 3160                          zio_dva_unallocate(zio, gn->gn_child[g],
2980 3161                              &gn->gn_gbh->zg_blkptr[g]);
2981 3162                  }
2982 3163          }
2983 3164  }
  
    | 
      ↓ open down ↓ | 
    37 lines elided | 
    
      ↑ open up ↑ | 
  
2984 3165  
2985 3166  /*
2986 3167   * Try to allocate an intent log block.  Return 0 on success, errno on failure.
2987 3168   */
2988 3169  int
2989 3170  zio_alloc_zil(spa_t *spa, uint64_t txg, blkptr_t *new_bp, blkptr_t *old_bp,
2990 3171      uint64_t size, boolean_t *slog)
2991 3172  {
2992 3173          int error = 1;
2993 3174          zio_alloc_list_t io_alloc_list;
     3175 +        spa_meta_placement_t *mp = &spa->spa_meta_policy;
2994 3176  
2995 3177          ASSERT(txg > spa_syncing_txg(spa));
2996 3178  
2997 3179          metaslab_trace_init(&io_alloc_list);
2998      -        error = metaslab_alloc(spa, spa_log_class(spa), size, new_bp, 1,
2999      -            txg, old_bp, METASLAB_HINTBP_AVOID, &io_alloc_list, NULL);
3000      -        if (error == 0) {
3001      -                *slog = TRUE;
3002      -        } else {
     3180 +
     3181 +        /*
     3182 +         * ZIL blocks are always contiguous (i.e. not gang blocks)
     3183 +         * so we set the METASLAB_HINTBP_AVOID flag so that they
     3184 +         * don't "fast gang" when allocating them.
     3185 +         * If the caller indicates that slog is not to be used
     3186 +         * (via use_slog)
     3187 +         * separate allocation class will not indeed be used,
     3188 +         * independently of whether this is log or special
     3189 +         */
     3190 +
     3191 +        if (spa_has_slogs(spa)) {
     3192 +                error = metaslab_alloc(spa, spa_log_class(spa),
     3193 +                    size, new_bp, 1, txg, old_bp,
     3194 +                    METASLAB_HINTBP_AVOID, &io_alloc_list, NULL);
     3195 +
     3196 +                DTRACE_PROBE2(zio_alloc_zil_log,
     3197 +                    spa_t *, spa, int, error);
     3198 +
     3199 +                if (error == 0)
     3200 +                        *slog = TRUE;
     3201 +        }
     3202 +
     3203 +        /*
     3204 +         * use special when failed to allocate from the regular
     3205 +         * slog, but only if allowed and if the special used
     3206 +         * space is  below watermarks
     3207 +         */
     3208 +        if (error != 0 && spa_can_special_be_used(spa) &&
     3209 +            mp->spa_sync_to_special != SYNC_TO_SPECIAL_DISABLED) {
     3210 +                error = metaslab_alloc(spa, spa_special_class(spa),
     3211 +                    size, new_bp, 1, txg, old_bp,
     3212 +                    METASLAB_HINTBP_AVOID, &io_alloc_list, NULL);
     3213 +
     3214 +                DTRACE_PROBE2(zio_alloc_zil_special,
     3215 +                    spa_t *, spa, int, error);
     3216 +
     3217 +                if (error == 0)
     3218 +                        *slog = FALSE;
     3219 +        }
     3220 +
     3221 +        if (error != 0) {
3003 3222                  error = metaslab_alloc(spa, spa_normal_class(spa), size,
3004 3223                      new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID,
3005 3224                      &io_alloc_list, NULL);
     3225 +
     3226 +                DTRACE_PROBE2(zio_alloc_zil_normal,
     3227 +                    spa_t *, spa, int, error);
     3228 +
3006 3229                  if (error == 0)
3007 3230                          *slog = FALSE;
3008 3231          }
     3232 +
3009 3233          metaslab_trace_fini(&io_alloc_list);
3010 3234  
3011 3235          if (error == 0) {
3012 3236                  BP_SET_LSIZE(new_bp, size);
3013 3237                  BP_SET_PSIZE(new_bp, size);
3014 3238                  BP_SET_COMPRESS(new_bp, ZIO_COMPRESS_OFF);
3015 3239                  BP_SET_CHECKSUM(new_bp,
3016 3240                      spa_version(spa) >= SPA_VERSION_SLIM_ZIL
3017 3241                      ? ZIO_CHECKSUM_ZILOG2 : ZIO_CHECKSUM_ZILOG);
3018 3242                  BP_SET_TYPE(new_bp, DMU_OT_INTENT_LOG);
3019 3243                  BP_SET_LEVEL(new_bp, 0);
3020 3244                  BP_SET_DEDUP(new_bp, 0);
3021 3245                  BP_SET_BYTEORDER(new_bp, ZFS_HOST_BYTEORDER);
3022 3246          } else {
3023 3247                  zfs_dbgmsg("%s: zil block allocation failure: "
3024 3248                      "size %llu, error %d", spa_name(spa), size, error);
3025 3249          }
3026 3250  
3027 3251          return (error);
3028 3252  }
3029 3253  
3030 3254  /*
3031 3255   * Free an intent log block.
3032 3256   */
3033 3257  void
3034 3258  zio_free_zil(spa_t *spa, uint64_t txg, blkptr_t *bp)
3035 3259  {
3036 3260          ASSERT(BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG);
3037 3261          ASSERT(!BP_IS_GANG(bp));
3038 3262  
3039 3263          zio_free(spa, txg, bp);
3040 3264  }
3041 3265  
3042 3266  /*
3043 3267   * ==========================================================================
3044 3268   * Read and write to physical devices
3045 3269   * ==========================================================================
3046 3270   */
3047 3271  
3048 3272  
3049 3273  /*
3050 3274   * Issue an I/O to the underlying vdev. Typically the issue pipeline
3051 3275   * stops after this stage and will resume upon I/O completion.
3052 3276   * However, there are instances where the vdev layer may need to
3053 3277   * continue the pipeline when an I/O was not issued. Since the I/O
3054 3278   * that was sent to the vdev layer might be different than the one
  
    | 
      ↓ open down ↓ | 
    36 lines elided | 
    
      ↑ open up ↑ | 
  
3055 3279   * currently active in the pipeline (see vdev_queue_io()), we explicitly
3056 3280   * force the underlying vdev layers to call either zio_execute() or
3057 3281   * zio_interrupt() to ensure that the pipeline continues with the correct I/O.
3058 3282   */
3059 3283  static int
3060 3284  zio_vdev_io_start(zio_t *zio)
3061 3285  {
3062 3286          vdev_t *vd = zio->io_vd;
3063 3287          uint64_t align;
3064 3288          spa_t *spa = zio->io_spa;
     3289 +        zio_type_t type = zio->io_type;
     3290 +        zio->io_vd_timestamp = gethrtime();
3065 3291  
3066 3292          ASSERT(zio->io_error == 0);
3067 3293          ASSERT(zio->io_child_error[ZIO_CHILD_VDEV] == 0);
3068 3294  
3069 3295          if (vd == NULL) {
3070 3296                  if (!(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
3071 3297                          spa_config_enter(spa, SCL_ZIO, zio, RW_READER);
3072 3298  
3073 3299                  /*
3074 3300                   * The mirror_ops handle multiple DVAs in a single BP.
3075 3301                   */
3076 3302                  vdev_mirror_ops.vdev_op_io_start(zio);
3077 3303                  return (ZIO_PIPELINE_STOP);
3078 3304          }
3079 3305  
3080 3306          ASSERT3P(zio->io_logical, !=, zio);
3081      -        if (zio->io_type == ZIO_TYPE_WRITE) {
3082      -                ASSERT(spa->spa_trust_config);
3083 3307  
3084      -                if (zio->io_vd->vdev_removing) {
3085      -                        ASSERT(zio->io_flags &
3086      -                            (ZIO_FLAG_PHYSICAL | ZIO_FLAG_SELF_HEAL |
3087      -                            ZIO_FLAG_INDUCE_DAMAGE));
3088      -                }
3089      -        }
3090      -
3091      -        /*
3092      -         * We keep track of time-sensitive I/Os so that the scan thread
3093      -         * can quickly react to certain workloads.  In particular, we care
3094      -         * about non-scrubbing, top-level reads and writes with the following
3095      -         * characteristics:
3096      -         *      - synchronous writes of user data to non-slog devices
3097      -         *      - any reads of user data
3098      -         * When these conditions are met, adjust the timestamp of spa_last_io
3099      -         * which allows the scan thread to adjust its workload accordingly.
3100      -         */
3101      -        if (!(zio->io_flags & ZIO_FLAG_SCAN_THREAD) && zio->io_bp != NULL &&
3102      -            vd == vd->vdev_top && !vd->vdev_islog &&
3103      -            zio->io_bookmark.zb_objset != DMU_META_OBJSET &&
3104      -            zio->io_txg != spa_syncing_txg(spa)) {
3105      -                uint64_t old = spa->spa_last_io;
3106      -                uint64_t new = ddi_get_lbolt64();
3107      -                if (old != new)
3108      -                        (void) atomic_cas_64(&spa->spa_last_io, old, new);
3109      -        }
3110      -
3111 3308          align = 1ULL << vd->vdev_top->vdev_ashift;
3112 3309  
3113 3310          if (!(zio->io_flags & ZIO_FLAG_PHYSICAL) &&
3114 3311              P2PHASE(zio->io_size, align) != 0) {
3115 3312                  /* Transform logical writes to be a full physical block size. */
3116 3313                  uint64_t asize = P2ROUNDUP(zio->io_size, align);
3117 3314                  abd_t *abuf = abd_alloc_sametype(zio->io_abd, asize);
3118 3315                  ASSERT(vd == vd->vdev_top);
3119      -                if (zio->io_type == ZIO_TYPE_WRITE) {
     3316 +                if (type == ZIO_TYPE_WRITE) {
3120 3317                          abd_copy(abuf, zio->io_abd, zio->io_size);
3121 3318                          abd_zero_off(abuf, zio->io_size, asize - zio->io_size);
3122 3319                  }
3123 3320                  zio_push_transform(zio, abuf, asize, asize, zio_subblock);
3124 3321          }
3125 3322  
3126 3323          /*
3127 3324           * If this is not a physical io, make sure that it is properly aligned
3128 3325           * before proceeding.
3129 3326           */
3130 3327          if (!(zio->io_flags & ZIO_FLAG_PHYSICAL)) {
3131 3328                  ASSERT0(P2PHASE(zio->io_offset, align));
  
    | 
      ↓ open down ↓ | 
    2 lines elided | 
    
      ↑ open up ↑ | 
  
3132 3329                  ASSERT0(P2PHASE(zio->io_size, align));
3133 3330          } else {
3134 3331                  /*
3135 3332                   * For physical writes, we allow 512b aligned writes and assume
3136 3333                   * the device will perform a read-modify-write as necessary.
3137 3334                   */
3138 3335                  ASSERT0(P2PHASE(zio->io_offset, SPA_MINBLOCKSIZE));
3139 3336                  ASSERT0(P2PHASE(zio->io_size, SPA_MINBLOCKSIZE));
3140 3337          }
3141 3338  
3142      -        VERIFY(zio->io_type != ZIO_TYPE_WRITE || spa_writeable(spa));
     3339 +        VERIFY(type != ZIO_TYPE_WRITE || spa_writeable(spa));
3143 3340  
3144 3341          /*
3145 3342           * If this is a repair I/O, and there's no self-healing involved --
3146 3343           * that is, we're just resilvering what we expect to resilver --
3147 3344           * then don't do the I/O unless zio's txg is actually in vd's DTL.
3148 3345           * This prevents spurious resilvering with nested replication.
3149 3346           * For example, given a mirror of mirrors, (A+B)+(C+D), if only
3150 3347           * A is out of date, we'll read from C+D, then use the data to
3151 3348           * resilver A+B -- but we don't actually want to resilver B, just A.
3152 3349           * The top-level mirror has no way to know this, so instead we just
3153 3350           * discard unnecessary repairs as we work our way down the vdev tree.
3154 3351           * The same logic applies to any form of nested replication:
3155 3352           * ditto + mirror, RAID-Z + replacing, etc.  This covers them all.
3156 3353           */
3157 3354          if ((zio->io_flags & ZIO_FLAG_IO_REPAIR) &&
3158 3355              !(zio->io_flags & ZIO_FLAG_SELF_HEAL) &&
3159 3356              zio->io_txg != 0 && /* not a delegated i/o */
3160 3357              !vdev_dtl_contains(vd, DTL_PARTIAL, zio->io_txg, 1)) {
3161      -                ASSERT(zio->io_type == ZIO_TYPE_WRITE);
     3358 +                ASSERT(type == ZIO_TYPE_WRITE);
3162 3359                  zio_vdev_io_bypass(zio);
3163 3360                  return (ZIO_PIPELINE_CONTINUE);
3164 3361          }
3165 3362  
3166 3363          if (vd->vdev_ops->vdev_op_leaf &&
3167      -            (zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE)) {
3168      -
3169      -                if (zio->io_type == ZIO_TYPE_READ && vdev_cache_read(zio))
     3364 +            (type == ZIO_TYPE_READ || type == ZIO_TYPE_WRITE)) {
     3365 +                if (type == ZIO_TYPE_READ && vdev_cache_read(zio))
3170 3366                          return (ZIO_PIPELINE_CONTINUE);
3171 3367  
3172 3368                  if ((zio = vdev_queue_io(zio)) == NULL)
3173 3369                          return (ZIO_PIPELINE_STOP);
3174 3370  
3175 3371                  if (!vdev_accessible(vd, zio)) {
3176 3372                          zio->io_error = SET_ERROR(ENXIO);
3177 3373                          zio_interrupt(zio);
3178 3374                          return (ZIO_PIPELINE_STOP);
3179 3375                  }
     3376 +
     3377 +                /*
     3378 +                 * Insert a fault simulation delay for a particular vdev.
     3379 +                 */
     3380 +                if (zio_faulty_vdev_enabled &&
     3381 +                    (zio->io_vd->vdev_guid == zio_faulty_vdev_guid)) {
     3382 +                        delay(NSEC_TO_TICK(zio_faulty_vdev_delay_us *
     3383 +                            (NANOSEC / MICROSEC)));
     3384 +                }
3180 3385          }
3181 3386  
3182 3387          vd->vdev_ops->vdev_op_io_start(zio);
3183 3388          return (ZIO_PIPELINE_STOP);
3184 3389  }
3185 3390  
3186 3391  static int
3187 3392  zio_vdev_io_done(zio_t *zio)
3188 3393  {
3189 3394          vdev_t *vd = zio->io_vd;
3190 3395          vdev_ops_t *ops = vd ? vd->vdev_ops : &vdev_mirror_ops;
3191 3396          boolean_t unexpected_error = B_FALSE;
3192 3397  
3193      -        if (zio_wait_for_children(zio, ZIO_CHILD_VDEV_BIT, ZIO_WAIT_DONE)) {
     3398 +        if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
3194 3399                  return (ZIO_PIPELINE_STOP);
3195      -        }
3196 3400  
3197 3401          ASSERT(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE);
3198 3402  
3199 3403          if (vd != NULL && vd->vdev_ops->vdev_op_leaf) {
3200      -
3201 3404                  vdev_queue_io_done(zio);
3202 3405  
3203 3406                  if (zio->io_type == ZIO_TYPE_WRITE)
3204 3407                          vdev_cache_write(zio);
3205 3408  
3206 3409                  if (zio_injection_enabled && zio->io_error == 0)
3207 3410                          zio->io_error = zio_handle_device_injection(vd,
3208 3411                              zio, EIO);
3209 3412  
3210 3413                  if (zio_injection_enabled && zio->io_error == 0)
3211 3414                          zio->io_error = zio_handle_label_injection(zio, EIO);
3212 3415  
3213 3416                  if (zio->io_error) {
3214 3417                          if (!vdev_accessible(vd, zio)) {
3215 3418                                  zio->io_error = SET_ERROR(ENXIO);
3216 3419                          } else {
  
    | 
      ↓ open down ↓ | 
    6 lines elided | 
    
      ↑ open up ↑ | 
  
3217 3420                                  unexpected_error = B_TRUE;
3218 3421                          }
3219 3422                  }
3220 3423          }
3221 3424  
3222 3425          ops->vdev_op_io_done(zio);
3223 3426  
3224 3427          if (unexpected_error)
3225 3428                  VERIFY(vdev_probe(vd, zio) == NULL);
3226 3429  
     3430 +        /*
     3431 +         * Measure delta between start and end of the I/O in nanoseconds.
     3432 +         * XXX: Handle overflow.
     3433 +         */
     3434 +        zio->io_vd_timestamp = gethrtime() - zio->io_vd_timestamp;
     3435 +
3227 3436          return (ZIO_PIPELINE_CONTINUE);
3228 3437  }
3229 3438  
3230 3439  /*
3231 3440   * For non-raidz ZIOs, we can just copy aside the bad data read from the
3232 3441   * disk, and use that to finish the checksum ereport later.
3233 3442   */
3234 3443  static void
3235 3444  zio_vsd_default_cksum_finish(zio_cksum_report_t *zcr,
3236 3445      const void *good_buf)
3237 3446  {
3238 3447          /* no processing needed */
3239 3448          zfs_ereport_finish_checksum(zcr, good_buf, zcr->zcr_cbdata, B_FALSE);
3240 3449  }
3241 3450  
3242 3451  /*ARGSUSED*/
3243 3452  void
3244 3453  zio_vsd_default_cksum_report(zio_t *zio, zio_cksum_report_t *zcr, void *ignored)
3245 3454  {
3246 3455          void *buf = zio_buf_alloc(zio->io_size);
3247 3456  
3248 3457          abd_copy_to_buf(buf, zio->io_abd, zio->io_size);
3249 3458  
3250 3459          zcr->zcr_cbinfo = zio->io_size;
  
    | 
      ↓ open down ↓ | 
    14 lines elided | 
    
      ↑ open up ↑ | 
  
3251 3460          zcr->zcr_cbdata = buf;
3252 3461          zcr->zcr_finish = zio_vsd_default_cksum_finish;
3253 3462          zcr->zcr_free = zio_buf_free;
3254 3463  }
3255 3464  
3256 3465  static int
3257 3466  zio_vdev_io_assess(zio_t *zio)
3258 3467  {
3259 3468          vdev_t *vd = zio->io_vd;
3260 3469  
3261      -        if (zio_wait_for_children(zio, ZIO_CHILD_VDEV_BIT, ZIO_WAIT_DONE)) {
     3470 +        if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
3262 3471                  return (ZIO_PIPELINE_STOP);
3263      -        }
3264 3472  
3265 3473          if (vd == NULL && !(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
3266 3474                  spa_config_exit(zio->io_spa, SCL_ZIO, zio);
3267 3475  
3268 3476          if (zio->io_vsd != NULL) {
3269 3477                  zio->io_vsd_ops->vsd_free(zio);
3270 3478                  zio->io_vsd = NULL;
3271 3479          }
3272 3480  
3273 3481          if (zio_injection_enabled && zio->io_error == 0)
3274 3482                  zio->io_error = zio_handle_fault_injection(zio, EIO);
3275 3483  
3276 3484          /*
3277 3485           * If the I/O failed, determine whether we should attempt to retry it.
3278 3486           *
3279 3487           * On retry, we cut in line in the issue queue, since we don't want
3280 3488           * compression/checksumming/etc. work to prevent our (cheap) IO reissue.
3281 3489           */
3282 3490          if (zio->io_error && vd == NULL &&
3283 3491              !(zio->io_flags & (ZIO_FLAG_DONT_RETRY | ZIO_FLAG_IO_RETRY))) {
3284 3492                  ASSERT(!(zio->io_flags & ZIO_FLAG_DONT_QUEUE)); /* not a leaf */
3285 3493                  ASSERT(!(zio->io_flags & ZIO_FLAG_IO_BYPASS));  /* not a leaf */
3286 3494                  zio->io_error = 0;
3287 3495                  zio->io_flags |= ZIO_FLAG_IO_RETRY |
3288 3496                      ZIO_FLAG_DONT_CACHE | ZIO_FLAG_DONT_AGGREGATE;
3289 3497                  zio->io_stage = ZIO_STAGE_VDEV_IO_START >> 1;
3290 3498                  zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE,
3291 3499                      zio_requeue_io_start_cut_in_line);
3292 3500                  return (ZIO_PIPELINE_STOP);
3293 3501          }
3294 3502  
3295 3503          /*
3296 3504           * If we got an error on a leaf device, convert it to ENXIO
3297 3505           * if the device is not accessible at all.
3298 3506           */
3299 3507          if (zio->io_error && vd != NULL && vd->vdev_ops->vdev_op_leaf &&
3300 3508              !vdev_accessible(vd, zio))
3301 3509                  zio->io_error = SET_ERROR(ENXIO);
3302 3510  
3303 3511          /*
3304 3512           * If we can't write to an interior vdev (mirror or RAID-Z),
3305 3513           * set vdev_cant_write so that we stop trying to allocate from it.
3306 3514           */
3307 3515          if (zio->io_error == ENXIO && zio->io_type == ZIO_TYPE_WRITE &&
3308 3516              vd != NULL && !vd->vdev_ops->vdev_op_leaf) {
3309 3517                  vd->vdev_cant_write = B_TRUE;
3310 3518          }
3311 3519  
3312 3520          /*
3313 3521           * If a cache flush returns ENOTSUP or ENOTTY, we know that no future
3314 3522           * attempts will ever succeed. In this case we set a persistent bit so
3315 3523           * that we don't bother with it in the future.
3316 3524           */
3317 3525          if ((zio->io_error == ENOTSUP || zio->io_error == ENOTTY) &&
3318 3526              zio->io_type == ZIO_TYPE_IOCTL &&
3319 3527              zio->io_cmd == DKIOCFLUSHWRITECACHE && vd != NULL)
3320 3528                  vd->vdev_nowritecache = B_TRUE;
3321 3529  
3322 3530          if (zio->io_error)
3323 3531                  zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
3324 3532  
3325 3533          if (vd != NULL && vd->vdev_ops->vdev_op_leaf &&
3326 3534              zio->io_physdone != NULL) {
3327 3535                  ASSERT(!(zio->io_flags & ZIO_FLAG_DELEGATED));
3328 3536                  ASSERT(zio->io_child_type == ZIO_CHILD_VDEV);
3329 3537                  zio->io_physdone(zio->io_logical);
3330 3538          }
3331 3539  
3332 3540          return (ZIO_PIPELINE_CONTINUE);
3333 3541  }
3334 3542  
3335 3543  void
3336 3544  zio_vdev_io_reissue(zio_t *zio)
3337 3545  {
3338 3546          ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_START);
3339 3547          ASSERT(zio->io_error == 0);
3340 3548  
3341 3549          zio->io_stage >>= 1;
3342 3550  }
3343 3551  
3344 3552  void
3345 3553  zio_vdev_io_redone(zio_t *zio)
3346 3554  {
3347 3555          ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_DONE);
3348 3556  
3349 3557          zio->io_stage >>= 1;
3350 3558  }
3351 3559  
3352 3560  void
3353 3561  zio_vdev_io_bypass(zio_t *zio)
3354 3562  {
3355 3563          ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_START);
3356 3564          ASSERT(zio->io_error == 0);
3357 3565  
3358 3566          zio->io_flags |= ZIO_FLAG_IO_BYPASS;
3359 3567          zio->io_stage = ZIO_STAGE_VDEV_IO_ASSESS >> 1;
3360 3568  }
3361 3569  
3362 3570  /*
3363 3571   * ==========================================================================
3364 3572   * Generate and verify checksums
3365 3573   * ==========================================================================
3366 3574   */
3367 3575  static int
3368 3576  zio_checksum_generate(zio_t *zio)
3369 3577  {
3370 3578          blkptr_t *bp = zio->io_bp;
3371 3579          enum zio_checksum checksum;
3372 3580  
3373 3581          if (bp == NULL) {
3374 3582                  /*
3375 3583                   * This is zio_write_phys().
3376 3584                   * We're either generating a label checksum, or none at all.
3377 3585                   */
3378 3586                  checksum = zio->io_prop.zp_checksum;
3379 3587  
3380 3588                  if (checksum == ZIO_CHECKSUM_OFF)
3381 3589                          return (ZIO_PIPELINE_CONTINUE);
3382 3590  
3383 3591                  ASSERT(checksum == ZIO_CHECKSUM_LABEL);
3384 3592          } else {
3385 3593                  if (BP_IS_GANG(bp) && zio->io_child_type == ZIO_CHILD_GANG) {
3386 3594                          ASSERT(!IO_IS_ALLOCATING(zio));
3387 3595                          checksum = ZIO_CHECKSUM_GANG_HEADER;
3388 3596                  } else {
3389 3597                          checksum = BP_GET_CHECKSUM(bp);
3390 3598                  }
3391 3599          }
3392 3600  
3393 3601          zio_checksum_compute(zio, checksum, zio->io_abd, zio->io_size);
3394 3602  
3395 3603          return (ZIO_PIPELINE_CONTINUE);
3396 3604  }
3397 3605  
3398 3606  static int
3399 3607  zio_checksum_verify(zio_t *zio)
3400 3608  {
3401 3609          zio_bad_cksum_t info;
3402 3610          blkptr_t *bp = zio->io_bp;
3403 3611          int error;
3404 3612  
3405 3613          ASSERT(zio->io_vd != NULL);
3406 3614  
3407 3615          if (bp == NULL) {
3408 3616                  /*
3409 3617                   * This is zio_read_phys().
3410 3618                   * We're either verifying a label checksum, or nothing at all.
3411 3619                   */
3412 3620                  if (zio->io_prop.zp_checksum == ZIO_CHECKSUM_OFF)
3413 3621                          return (ZIO_PIPELINE_CONTINUE);
3414 3622  
3415 3623                  ASSERT(zio->io_prop.zp_checksum == ZIO_CHECKSUM_LABEL);
3416 3624          }
3417 3625  
3418 3626          if ((error = zio_checksum_error(zio, &info)) != 0) {
3419 3627                  zio->io_error = error;
3420 3628                  if (error == ECKSUM &&
3421 3629                      !(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
3422 3630                          zfs_ereport_start_checksum(zio->io_spa,
3423 3631                              zio->io_vd, zio, zio->io_offset,
3424 3632                              zio->io_size, NULL, &info);
3425 3633                  }
3426 3634          }
3427 3635  
3428 3636          return (ZIO_PIPELINE_CONTINUE);
3429 3637  }
3430 3638  
3431 3639  /*
3432 3640   * Called by RAID-Z to ensure we don't compute the checksum twice.
3433 3641   */
3434 3642  void
3435 3643  zio_checksum_verified(zio_t *zio)
3436 3644  {
3437 3645          zio->io_pipeline &= ~ZIO_STAGE_CHECKSUM_VERIFY;
3438 3646  }
3439 3647  
3440 3648  /*
3441 3649   * ==========================================================================
3442 3650   * Error rank.  Error are ranked in the order 0, ENXIO, ECKSUM, EIO, other.
3443 3651   * An error of 0 indicates success.  ENXIO indicates whole-device failure,
3444 3652   * which may be transient (e.g. unplugged) or permament.  ECKSUM and EIO
3445 3653   * indicate errors that are specific to one I/O, and most likely permanent.
3446 3654   * Any other error is presumed to be worse because we weren't expecting it.
3447 3655   * ==========================================================================
3448 3656   */
3449 3657  int
3450 3658  zio_worst_error(int e1, int e2)
3451 3659  {
3452 3660          static int zio_error_rank[] = { 0, ENXIO, ECKSUM, EIO };
3453 3661          int r1, r2;
3454 3662  
3455 3663          for (r1 = 0; r1 < sizeof (zio_error_rank) / sizeof (int); r1++)
3456 3664                  if (e1 == zio_error_rank[r1])
3457 3665                          break;
3458 3666  
3459 3667          for (r2 = 0; r2 < sizeof (zio_error_rank) / sizeof (int); r2++)
3460 3668                  if (e2 == zio_error_rank[r2])
3461 3669                          break;
3462 3670  
3463 3671          return (r1 > r2 ? e1 : e2);
3464 3672  }
3465 3673  
3466 3674  /*
3467 3675   * ==========================================================================
  
    | 
      ↓ open down ↓ | 
    194 lines elided | 
    
      ↑ open up ↑ | 
  
3468 3676   * I/O completion
3469 3677   * ==========================================================================
3470 3678   */
3471 3679  static int
3472 3680  zio_ready(zio_t *zio)
3473 3681  {
3474 3682          blkptr_t *bp = zio->io_bp;
3475 3683          zio_t *pio, *pio_next;
3476 3684          zio_link_t *zl = NULL;
3477 3685  
3478      -        if (zio_wait_for_children(zio, ZIO_CHILD_GANG_BIT | ZIO_CHILD_DDT_BIT,
3479      -            ZIO_WAIT_READY)) {
     3686 +        if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) ||
     3687 +            zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_READY))
3480 3688                  return (ZIO_PIPELINE_STOP);
3481      -        }
3482 3689  
3483 3690          if (zio->io_ready) {
3484 3691                  ASSERT(IO_IS_ALLOCATING(zio));
3485 3692                  ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp) ||
3486 3693                      (zio->io_flags & ZIO_FLAG_NOPWRITE));
3487 3694                  ASSERT(zio->io_children[ZIO_CHILD_GANG][ZIO_WAIT_READY] == 0);
3488 3695  
3489 3696                  zio->io_ready(zio);
3490 3697          }
3491 3698  
3492 3699          if (bp != NULL && bp != &zio->io_bp_copy)
3493 3700                  zio->io_bp_copy = *bp;
3494 3701  
  
    | 
      ↓ open down ↓ | 
    3 lines elided | 
    
      ↑ open up ↑ | 
  
3495 3702          if (zio->io_error != 0) {
3496 3703                  zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
3497 3704  
3498 3705                  if (zio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
3499 3706                          ASSERT(IO_IS_ALLOCATING(zio));
3500 3707                          ASSERT(zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
3501 3708                          /*
3502 3709                           * We were unable to allocate anything, unreserve and
3503 3710                           * issue the next I/O to allocate.
3504 3711                           */
3505      -                        metaslab_class_throttle_unreserve(
3506      -                            spa_normal_class(zio->io_spa),
     3712 +                        metaslab_class_throttle_unreserve(zio->io_mc,
3507 3713                              zio->io_prop.zp_copies, zio);
3508      -                        zio_allocate_dispatch(zio->io_spa);
     3714 +                        zio_allocate_dispatch(zio->io_mc);
3509 3715                  }
3510 3716          }
3511 3717  
3512 3718          mutex_enter(&zio->io_lock);
3513 3719          zio->io_state[ZIO_WAIT_READY] = 1;
3514 3720          pio = zio_walk_parents(zio, &zl);
3515 3721          mutex_exit(&zio->io_lock);
3516 3722  
3517 3723          /*
3518 3724           * As we notify zio's parents, new parents could be added.
3519 3725           * New parents go to the head of zio's io_parent_list, however,
3520 3726           * so we will (correctly) not notify them.  The remainder of zio's
3521 3727           * io_parent_list, from 'pio_next' onward, cannot change because
3522 3728           * all parents must wait for us to be done before they can be done.
3523 3729           */
3524 3730          for (; pio != NULL; pio = pio_next) {
3525 3731                  pio_next = zio_walk_parents(zio, &zl);
3526 3732                  zio_notify_parent(pio, zio, ZIO_WAIT_READY);
3527 3733          }
3528 3734  
3529 3735          if (zio->io_flags & ZIO_FLAG_NODATA) {
3530 3736                  if (BP_IS_GANG(bp)) {
3531 3737                          zio->io_flags &= ~ZIO_FLAG_NODATA;
3532 3738                  } else {
3533 3739                          ASSERT((uintptr_t)zio->io_abd < SPA_MAXBLOCKSIZE);
3534 3740                          zio->io_pipeline &= ~ZIO_VDEV_IO_STAGES;
3535 3741                  }
3536 3742          }
3537 3743  
3538 3744          if (zio_injection_enabled &&
3539 3745              zio->io_spa->spa_syncing_txg == zio->io_txg)
3540 3746                  zio_handle_ignored_writes(zio);
3541 3747  
3542 3748          return (ZIO_PIPELINE_CONTINUE);
3543 3749  }
3544 3750  
3545 3751  /*
3546 3752   * Update the allocation throttle accounting.
3547 3753   */
3548 3754  static void
3549 3755  zio_dva_throttle_done(zio_t *zio)
3550 3756  {
3551 3757          zio_t *lio = zio->io_logical;
3552 3758          zio_t *pio = zio_unique_parent(zio);
3553 3759          vdev_t *vd = zio->io_vd;
3554 3760          int flags = METASLAB_ASYNC_ALLOC;
3555 3761  
3556 3762          ASSERT3P(zio->io_bp, !=, NULL);
3557 3763          ASSERT3U(zio->io_type, ==, ZIO_TYPE_WRITE);
3558 3764          ASSERT3U(zio->io_priority, ==, ZIO_PRIORITY_ASYNC_WRITE);
3559 3765          ASSERT3U(zio->io_child_type, ==, ZIO_CHILD_VDEV);
3560 3766          ASSERT(vd != NULL);
3561 3767          ASSERT3P(vd, ==, vd->vdev_top);
3562 3768          ASSERT(!(zio->io_flags & (ZIO_FLAG_IO_REPAIR | ZIO_FLAG_IO_RETRY)));
3563 3769          ASSERT(zio->io_flags & ZIO_FLAG_IO_ALLOCATING);
3564 3770          ASSERT(!(lio->io_flags & ZIO_FLAG_IO_REWRITE));
3565 3771          ASSERT(!(lio->io_orig_flags & ZIO_FLAG_NODATA));
3566 3772  
3567 3773          /*
3568 3774           * Parents of gang children can have two flavors -- ones that
3569 3775           * allocated the gang header (will have ZIO_FLAG_IO_REWRITE set)
3570 3776           * and ones that allocated the constituent blocks. The allocation
3571 3777           * throttle needs to know the allocating parent zio so we must find
3572 3778           * it here.
3573 3779           */
3574 3780          if (pio->io_child_type == ZIO_CHILD_GANG) {
3575 3781                  /*
3576 3782                   * If our parent is a rewrite gang child then our grandparent
3577 3783                   * would have been the one that performed the allocation.
3578 3784                   */
3579 3785                  if (pio->io_flags & ZIO_FLAG_IO_REWRITE)
3580 3786                          pio = zio_unique_parent(pio);
3581 3787                  flags |= METASLAB_GANG_CHILD;
3582 3788          }
3583 3789  
  
    | 
      ↓ open down ↓ | 
    65 lines elided | 
    
      ↑ open up ↑ | 
  
3584 3790          ASSERT(IO_IS_ALLOCATING(pio));
3585 3791          ASSERT3P(zio, !=, zio->io_logical);
3586 3792          ASSERT(zio->io_logical != NULL);
3587 3793          ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REPAIR));
3588 3794          ASSERT0(zio->io_flags & ZIO_FLAG_NOPWRITE);
3589 3795  
3590 3796          mutex_enter(&pio->io_lock);
3591 3797          metaslab_group_alloc_decrement(zio->io_spa, vd->vdev_id, pio, flags);
3592 3798          mutex_exit(&pio->io_lock);
3593 3799  
3594      -        metaslab_class_throttle_unreserve(spa_normal_class(zio->io_spa),
3595      -            1, pio);
     3800 +        metaslab_class_throttle_unreserve(pio->io_mc, 1, pio);
3596 3801  
3597 3802          /*
3598 3803           * Call into the pipeline to see if there is more work that
3599 3804           * needs to be done. If there is work to be done it will be
3600 3805           * dispatched to another taskq thread.
3601 3806           */
3602      -        zio_allocate_dispatch(zio->io_spa);
     3807 +        zio_allocate_dispatch(pio->io_mc);
3603 3808  }
3604 3809  
3605 3810  static int
3606 3811  zio_done(zio_t *zio)
3607 3812  {
3608 3813          spa_t *spa = zio->io_spa;
3609 3814          zio_t *lio = zio->io_logical;
3610 3815          blkptr_t *bp = zio->io_bp;
3611 3816          vdev_t *vd = zio->io_vd;
3612 3817          uint64_t psize = zio->io_size;
3613 3818          zio_t *pio, *pio_next;
3614      -        metaslab_class_t *mc = spa_normal_class(spa);
     3819 +        metaslab_class_t *mc = zio->io_mc;
3615 3820          zio_link_t *zl = NULL;
3616 3821  
3617 3822          /*
3618 3823           * If our children haven't all completed,
3619 3824           * wait for them and then repeat this pipeline stage.
3620 3825           */
3621      -        if (zio_wait_for_children(zio, ZIO_CHILD_ALL_BITS, ZIO_WAIT_DONE)) {
     3826 +        if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) ||
     3827 +            zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) ||
     3828 +            zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) ||
     3829 +            zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE))
3622 3830                  return (ZIO_PIPELINE_STOP);
3623      -        }
3624 3831  
3625 3832          /*
3626 3833           * If the allocation throttle is enabled, then update the accounting.
3627 3834           * We only track child I/Os that are part of an allocating async
3628 3835           * write. We must do this since the allocation is performed
3629 3836           * by the logical I/O but the actual write is done by child I/Os.
3630 3837           */
3631 3838          if (zio->io_flags & ZIO_FLAG_IO_ALLOCATING &&
3632 3839              zio->io_child_type == ZIO_CHILD_VDEV) {
3633 3840                  ASSERT(mc->mc_alloc_throttle_enabled);
3634 3841                  zio_dva_throttle_done(zio);
3635 3842          }
3636 3843  
3637 3844          /*
3638 3845           * If the allocation throttle is enabled, verify that
3639 3846           * we have decremented the refcounts for every I/O that was throttled.
3640 3847           */
3641 3848          if (zio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
3642 3849                  ASSERT(zio->io_type == ZIO_TYPE_WRITE);
3643 3850                  ASSERT(zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
3644 3851                  ASSERT(bp != NULL);
3645 3852                  metaslab_group_alloc_verify(spa, zio->io_bp, zio);
3646 3853                  VERIFY(refcount_not_held(&mc->mc_alloc_slots, zio));
3647 3854          }
3648 3855  
3649 3856          for (int c = 0; c < ZIO_CHILD_TYPES; c++)
3650 3857                  for (int w = 0; w < ZIO_WAIT_TYPES; w++)
3651 3858                          ASSERT(zio->io_children[c][w] == 0);
3652 3859  
3653 3860          if (bp != NULL && !BP_IS_EMBEDDED(bp)) {
3654 3861                  ASSERT(bp->blk_pad[0] == 0);
3655 3862                  ASSERT(bp->blk_pad[1] == 0);
3656 3863                  ASSERT(bcmp(bp, &zio->io_bp_copy, sizeof (blkptr_t)) == 0 ||
3657 3864                      (bp == zio_unique_parent(zio)->io_bp));
3658 3865                  if (zio->io_type == ZIO_TYPE_WRITE && !BP_IS_HOLE(bp) &&
3659 3866                      zio->io_bp_override == NULL &&
3660 3867                      !(zio->io_flags & ZIO_FLAG_IO_REPAIR)) {
3661 3868                          ASSERT(!BP_SHOULD_BYTESWAP(bp));
3662 3869                          ASSERT3U(zio->io_prop.zp_copies, <=, BP_GET_NDVAS(bp));
3663 3870                          ASSERT(BP_COUNT_GANG(bp) == 0 ||
3664 3871                              (BP_COUNT_GANG(bp) == BP_GET_NDVAS(bp)));
3665 3872                  }
3666 3873                  if (zio->io_flags & ZIO_FLAG_NOPWRITE)
3667 3874                          VERIFY(BP_EQUAL(bp, &zio->io_bp_orig));
3668 3875          }
3669 3876  
3670 3877          /*
3671 3878           * If there were child vdev/gang/ddt errors, they apply to us now.
3672 3879           */
3673 3880          zio_inherit_child_errors(zio, ZIO_CHILD_VDEV);
3674 3881          zio_inherit_child_errors(zio, ZIO_CHILD_GANG);
3675 3882          zio_inherit_child_errors(zio, ZIO_CHILD_DDT);
3676 3883  
3677 3884          /*
3678 3885           * If the I/O on the transformed data was successful, generate any
3679 3886           * checksum reports now while we still have the transformed data.
3680 3887           */
3681 3888          if (zio->io_error == 0) {
3682 3889                  while (zio->io_cksum_report != NULL) {
3683 3890                          zio_cksum_report_t *zcr = zio->io_cksum_report;
3684 3891                          uint64_t align = zcr->zcr_align;
3685 3892                          uint64_t asize = P2ROUNDUP(psize, align);
3686 3893                          char *abuf = NULL;
3687 3894                          abd_t *adata = zio->io_abd;
3688 3895  
3689 3896                          if (asize != psize) {
3690 3897                                  adata = abd_alloc_linear(asize, B_TRUE);
3691 3898                                  abd_copy(adata, zio->io_abd, psize);
3692 3899                                  abd_zero_off(adata, psize, asize - psize);
3693 3900                          }
3694 3901  
3695 3902                          if (adata != NULL)
3696 3903                                  abuf = abd_borrow_buf_copy(adata, asize);
3697 3904  
3698 3905                          zio->io_cksum_report = zcr->zcr_next;
3699 3906                          zcr->zcr_next = NULL;
3700 3907                          zcr->zcr_finish(zcr, abuf);
3701 3908                          zfs_ereport_free_checksum(zcr);
3702 3909  
3703 3910                          if (adata != NULL)
3704 3911                                  abd_return_buf(adata, abuf, asize);
3705 3912  
3706 3913                          if (asize != psize)
3707 3914                                  abd_free(adata);
3708 3915                  }
3709 3916          }
3710 3917  
3711 3918          zio_pop_transforms(zio);        /* note: may set zio->io_error */
3712 3919  
3713 3920          vdev_stat_update(zio, psize);
3714 3921  
3715 3922          if (zio->io_error) {
3716 3923                  /*
3717 3924                   * If this I/O is attached to a particular vdev,
3718 3925                   * generate an error message describing the I/O failure
3719 3926                   * at the block level.  We ignore these errors if the
3720 3927                   * device is currently unavailable.
3721 3928                   */
3722 3929                  if (zio->io_error != ECKSUM && vd != NULL && !vdev_is_dead(vd))
3723 3930                          zfs_ereport_post(FM_EREPORT_ZFS_IO, spa, vd, zio, 0, 0);
3724 3931  
3725 3932                  if ((zio->io_error == EIO || !(zio->io_flags &
3726 3933                      (ZIO_FLAG_SPECULATIVE | ZIO_FLAG_DONT_PROPAGATE))) &&
3727 3934                      zio == lio) {
3728 3935                          /*
3729 3936                           * For logical I/O requests, tell the SPA to log the
3730 3937                           * error and generate a logical data ereport.
3731 3938                           */
3732 3939                          spa_log_error(spa, zio);
3733 3940                          zfs_ereport_post(FM_EREPORT_ZFS_DATA, spa, NULL, zio,
3734 3941                              0, 0);
3735 3942                  }
3736 3943          }
3737 3944  
3738 3945          if (zio->io_error && zio == lio) {
3739 3946                  /*
3740 3947                   * Determine whether zio should be reexecuted.  This will
3741 3948                   * propagate all the way to the root via zio_notify_parent().
3742 3949                   */
3743 3950                  ASSERT(vd == NULL && bp != NULL);
3744 3951                  ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
3745 3952  
3746 3953                  if (IO_IS_ALLOCATING(zio) &&
3747 3954                      !(zio->io_flags & ZIO_FLAG_CANFAIL)) {
3748 3955                          if (zio->io_error != ENOSPC)
3749 3956                                  zio->io_reexecute |= ZIO_REEXECUTE_NOW;
3750 3957                          else
3751 3958                                  zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
3752 3959                  }
3753 3960  
3754 3961                  if ((zio->io_type == ZIO_TYPE_READ ||
3755 3962                      zio->io_type == ZIO_TYPE_FREE) &&
3756 3963                      !(zio->io_flags & ZIO_FLAG_SCAN_THREAD) &&
3757 3964                      zio->io_error == ENXIO &&
3758 3965                      spa_load_state(spa) == SPA_LOAD_NONE &&
3759 3966                      spa_get_failmode(spa) != ZIO_FAILURE_MODE_CONTINUE)
3760 3967                          zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
3761 3968  
3762 3969                  if (!(zio->io_flags & ZIO_FLAG_CANFAIL) && !zio->io_reexecute)
3763 3970                          zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
3764 3971  
3765 3972                  /*
3766 3973                   * Here is a possibly good place to attempt to do
3767 3974                   * either combinatorial reconstruction or error correction
3768 3975                   * based on checksums.  It also might be a good place
3769 3976                   * to send out preliminary ereports before we suspend
3770 3977                   * processing.
3771 3978                   */
3772 3979          }
3773 3980  
3774 3981          /*
3775 3982           * If there were logical child errors, they apply to us now.
3776 3983           * We defer this until now to avoid conflating logical child
3777 3984           * errors with errors that happened to the zio itself when
3778 3985           * updating vdev stats and reporting FMA events above.
3779 3986           */
3780 3987          zio_inherit_child_errors(zio, ZIO_CHILD_LOGICAL);
3781 3988  
3782 3989          if ((zio->io_error || zio->io_reexecute) &&
3783 3990              IO_IS_ALLOCATING(zio) && zio->io_gang_leader == zio &&
3784 3991              !(zio->io_flags & (ZIO_FLAG_IO_REWRITE | ZIO_FLAG_NOPWRITE)))
3785 3992                  zio_dva_unallocate(zio, zio->io_gang_tree, bp);
3786 3993  
3787 3994          zio_gang_tree_free(&zio->io_gang_tree);
3788 3995  
3789 3996          /*
3790 3997           * Godfather I/Os should never suspend.
3791 3998           */
3792 3999          if ((zio->io_flags & ZIO_FLAG_GODFATHER) &&
3793 4000              (zio->io_reexecute & ZIO_REEXECUTE_SUSPEND))
3794 4001                  zio->io_reexecute = 0;
3795 4002  
3796 4003          if (zio->io_reexecute) {
3797 4004                  /*
3798 4005                   * This is a logical I/O that wants to reexecute.
3799 4006                   *
3800 4007                   * Reexecute is top-down.  When an i/o fails, if it's not
3801 4008                   * the root, it simply notifies its parent and sticks around.
3802 4009                   * The parent, seeing that it still has children in zio_done(),
3803 4010                   * does the same.  This percolates all the way up to the root.
3804 4011                   * The root i/o will reexecute or suspend the entire tree.
3805 4012                   *
3806 4013                   * This approach ensures that zio_reexecute() honors
3807 4014                   * all the original i/o dependency relationships, e.g.
3808 4015                   * parents not executing until children are ready.
3809 4016                   */
3810 4017                  ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
3811 4018  
3812 4019                  zio->io_gang_leader = NULL;
3813 4020  
3814 4021                  mutex_enter(&zio->io_lock);
3815 4022                  zio->io_state[ZIO_WAIT_DONE] = 1;
3816 4023                  mutex_exit(&zio->io_lock);
3817 4024  
3818 4025                  /*
3819 4026                   * "The Godfather" I/O monitors its children but is
3820 4027                   * not a true parent to them. It will track them through
3821 4028                   * the pipeline but severs its ties whenever they get into
3822 4029                   * trouble (e.g. suspended). This allows "The Godfather"
3823 4030                   * I/O to return status without blocking.
3824 4031                   */
3825 4032                  zl = NULL;
3826 4033                  for (pio = zio_walk_parents(zio, &zl); pio != NULL;
3827 4034                      pio = pio_next) {
3828 4035                          zio_link_t *remove_zl = zl;
3829 4036                          pio_next = zio_walk_parents(zio, &zl);
3830 4037  
3831 4038                          if ((pio->io_flags & ZIO_FLAG_GODFATHER) &&
3832 4039                              (zio->io_reexecute & ZIO_REEXECUTE_SUSPEND)) {
3833 4040                                  zio_remove_child(pio, zio, remove_zl);
3834 4041                                  zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
3835 4042                          }
3836 4043                  }
3837 4044  
3838 4045                  if ((pio = zio_unique_parent(zio)) != NULL) {
3839 4046                          /*
3840 4047                           * We're not a root i/o, so there's nothing to do
3841 4048                           * but notify our parent.  Don't propagate errors
3842 4049                           * upward since we haven't permanently failed yet.
3843 4050                           */
3844 4051                          ASSERT(!(zio->io_flags & ZIO_FLAG_GODFATHER));
3845 4052                          zio->io_flags |= ZIO_FLAG_DONT_PROPAGATE;
3846 4053                          zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
3847 4054                  } else if (zio->io_reexecute & ZIO_REEXECUTE_SUSPEND) {
3848 4055                          /*
3849 4056                           * We'd fail again if we reexecuted now, so suspend
3850 4057                           * until conditions improve (e.g. device comes online).
3851 4058                           */
3852 4059                          zio_suspend(spa, zio);
3853 4060                  } else {
3854 4061                          /*
3855 4062                           * Reexecution is potentially a huge amount of work.
3856 4063                           * Hand it off to the otherwise-unused claim taskq.
3857 4064                           */
3858 4065                          ASSERT(zio->io_tqent.tqent_next == NULL);
3859 4066                          spa_taskq_dispatch_ent(spa, ZIO_TYPE_CLAIM,
3860 4067                              ZIO_TASKQ_ISSUE, (task_func_t *)zio_reexecute, zio,
3861 4068                              0, &zio->io_tqent);
3862 4069                  }
3863 4070                  return (ZIO_PIPELINE_STOP);
3864 4071          }
3865 4072  
3866 4073          ASSERT(zio->io_child_count == 0);
3867 4074          ASSERT(zio->io_reexecute == 0);
3868 4075          ASSERT(zio->io_error == 0 || (zio->io_flags & ZIO_FLAG_CANFAIL));
3869 4076  
3870 4077          /*
3871 4078           * Report any checksum errors, since the I/O is complete.
3872 4079           */
3873 4080          while (zio->io_cksum_report != NULL) {
3874 4081                  zio_cksum_report_t *zcr = zio->io_cksum_report;
3875 4082                  zio->io_cksum_report = zcr->zcr_next;
3876 4083                  zcr->zcr_next = NULL;
3877 4084                  zcr->zcr_finish(zcr, NULL);
3878 4085                  zfs_ereport_free_checksum(zcr);
3879 4086          }
3880 4087  
3881 4088          /*
3882 4089           * It is the responsibility of the done callback to ensure that this
3883 4090           * particular zio is no longer discoverable for adoption, and as
3884 4091           * such, cannot acquire any new parents.
3885 4092           */
3886 4093          if (zio->io_done)
3887 4094                  zio->io_done(zio);
3888 4095  
3889 4096          mutex_enter(&zio->io_lock);
3890 4097          zio->io_state[ZIO_WAIT_DONE] = 1;
3891 4098          mutex_exit(&zio->io_lock);
3892 4099  
3893 4100          zl = NULL;
3894 4101          for (pio = zio_walk_parents(zio, &zl); pio != NULL; pio = pio_next) {
3895 4102                  zio_link_t *remove_zl = zl;
3896 4103                  pio_next = zio_walk_parents(zio, &zl);
3897 4104                  zio_remove_child(pio, zio, remove_zl);
3898 4105                  zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
3899 4106          }
3900 4107  
3901 4108          if (zio->io_waiter != NULL) {
3902 4109                  mutex_enter(&zio->io_lock);
  
    | 
      ↓ open down ↓ | 
    269 lines elided | 
    
      ↑ open up ↑ | 
  
3903 4110                  zio->io_executor = NULL;
3904 4111                  cv_broadcast(&zio->io_cv);
3905 4112                  mutex_exit(&zio->io_lock);
3906 4113          } else {
3907 4114                  zio_destroy(zio);
3908 4115          }
3909 4116  
3910 4117          return (ZIO_PIPELINE_STOP);
3911 4118  }
3912 4119  
     4120 +zio_t *
     4121 +zio_wbc(zio_type_t type, vdev_t *vd, abd_t *data,
     4122 +    uint64_t size, uint64_t offset)
     4123 +{
     4124 +        zio_t *zio = NULL;
     4125 +
     4126 +        switch (type) {
     4127 +        case ZIO_TYPE_WRITE:
     4128 +                zio = zio_create(NULL, vd->vdev_spa, 0, NULL, data, size,
     4129 +                    size, NULL, NULL, ZIO_TYPE_WRITE, ZIO_PRIORITY_ASYNC_WRITE,
     4130 +                    ZIO_FLAG_PHYSICAL, vd, offset,
     4131 +                    NULL, ZIO_STAGE_OPEN, ZIO_WRITE_PHYS_PIPELINE);
     4132 +                break;
     4133 +        case ZIO_TYPE_READ:
     4134 +                zio = zio_create(NULL, vd->vdev_spa, 0, NULL, data, size,
     4135 +                    size, NULL, NULL, ZIO_TYPE_READ, ZIO_PRIORITY_ASYNC_READ,
     4136 +                    ZIO_FLAG_DONT_CACHE | ZIO_FLAG_PHYSICAL, vd, offset,
     4137 +                    NULL, ZIO_STAGE_OPEN, ZIO_READ_PHYS_PIPELINE);
     4138 +                break;
     4139 +        default:
     4140 +                ASSERT(0);
     4141 +        }
     4142 +
     4143 +        zio->io_prop.zp_checksum = ZIO_CHECKSUM_OFF;
     4144 +
     4145 +        return (zio);
     4146 +}
     4147 +
3913 4148  /*
3914 4149   * ==========================================================================
3915 4150   * I/O pipeline definition
3916 4151   * ==========================================================================
3917 4152   */
3918 4153  static zio_pipe_stage_t *zio_pipeline[] = {
3919 4154          NULL,
3920 4155          zio_read_bp_init,
3921 4156          zio_write_bp_init,
3922 4157          zio_free_bp_init,
3923 4158          zio_issue_async,
3924 4159          zio_write_compress,
3925 4160          zio_checksum_generate,
3926 4161          zio_nop_write,
3927 4162          zio_ddt_read_start,
3928 4163          zio_ddt_read_done,
3929 4164          zio_ddt_write,
3930 4165          zio_ddt_free,
3931 4166          zio_gang_assemble,
3932 4167          zio_gang_issue,
3933 4168          zio_dva_throttle,
3934 4169          zio_dva_allocate,
3935 4170          zio_dva_free,
3936 4171          zio_dva_claim,
3937 4172          zio_ready,
3938 4173          zio_vdev_io_start,
3939 4174          zio_vdev_io_done,
3940 4175          zio_vdev_io_assess,
3941 4176          zio_checksum_verify,
3942 4177          zio_done
3943 4178  };
3944 4179  
3945 4180  
3946 4181  
3947 4182  
3948 4183  /*
3949 4184   * Compare two zbookmark_phys_t's to see which we would reach first in a
3950 4185   * pre-order traversal of the object tree.
3951 4186   *
3952 4187   * This is simple in every case aside from the meta-dnode object. For all other
3953 4188   * objects, we traverse them in order (object 1 before object 2, and so on).
3954 4189   * However, all of these objects are traversed while traversing object 0, since
3955 4190   * the data it points to is the list of objects.  Thus, we need to convert to a
3956 4191   * canonical representation so we can compare meta-dnode bookmarks to
3957 4192   * non-meta-dnode bookmarks.
3958 4193   *
3959 4194   * We do this by calculating "equivalents" for each field of the zbookmark.
3960 4195   * zbookmarks outside of the meta-dnode use their own object and level, and
3961 4196   * calculate the level 0 equivalent (the first L0 blkid that is contained in the
3962 4197   * blocks this bookmark refers to) by multiplying their blkid by their span
3963 4198   * (the number of L0 blocks contained within one block at their level).
3964 4199   * zbookmarks inside the meta-dnode calculate their object equivalent
3965 4200   * (which is L0equiv * dnodes per data block), use 0 for their L0equiv, and use
3966 4201   * level + 1<<31 (any value larger than a level could ever be) for their level.
3967 4202   * This causes them to always compare before a bookmark in their object
3968 4203   * equivalent, compare appropriately to bookmarks in other objects, and to
3969 4204   * compare appropriately to other bookmarks in the meta-dnode.
3970 4205   */
3971 4206  int
3972 4207  zbookmark_compare(uint16_t dbss1, uint8_t ibs1, uint16_t dbss2, uint8_t ibs2,
3973 4208      const zbookmark_phys_t *zb1, const zbookmark_phys_t *zb2)
3974 4209  {
3975 4210          /*
3976 4211           * These variables represent the "equivalent" values for the zbookmark,
3977 4212           * after converting zbookmarks inside the meta dnode to their
3978 4213           * normal-object equivalents.
3979 4214           */
3980 4215          uint64_t zb1obj, zb2obj;
3981 4216          uint64_t zb1L0, zb2L0;
3982 4217          uint64_t zb1level, zb2level;
3983 4218  
3984 4219          if (zb1->zb_object == zb2->zb_object &&
3985 4220              zb1->zb_level == zb2->zb_level &&
3986 4221              zb1->zb_blkid == zb2->zb_blkid)
3987 4222                  return (0);
3988 4223  
3989 4224          /*
3990 4225           * BP_SPANB calculates the span in blocks.
3991 4226           */
3992 4227          zb1L0 = (zb1->zb_blkid) * BP_SPANB(ibs1, zb1->zb_level);
3993 4228          zb2L0 = (zb2->zb_blkid) * BP_SPANB(ibs2, zb2->zb_level);
3994 4229  
3995 4230          if (zb1->zb_object == DMU_META_DNODE_OBJECT) {
3996 4231                  zb1obj = zb1L0 * (dbss1 << (SPA_MINBLOCKSHIFT - DNODE_SHIFT));
3997 4232                  zb1L0 = 0;
3998 4233                  zb1level = zb1->zb_level + COMPARE_META_LEVEL;
3999 4234          } else {
4000 4235                  zb1obj = zb1->zb_object;
4001 4236                  zb1level = zb1->zb_level;
4002 4237          }
4003 4238  
4004 4239          if (zb2->zb_object == DMU_META_DNODE_OBJECT) {
4005 4240                  zb2obj = zb2L0 * (dbss2 << (SPA_MINBLOCKSHIFT - DNODE_SHIFT));
4006 4241                  zb2L0 = 0;
4007 4242                  zb2level = zb2->zb_level + COMPARE_META_LEVEL;
4008 4243          } else {
4009 4244                  zb2obj = zb2->zb_object;
4010 4245                  zb2level = zb2->zb_level;
4011 4246          }
4012 4247  
4013 4248          /* Now that we have a canonical representation, do the comparison. */
4014 4249          if (zb1obj != zb2obj)
4015 4250                  return (zb1obj < zb2obj ? -1 : 1);
4016 4251          else if (zb1L0 != zb2L0)
4017 4252                  return (zb1L0 < zb2L0 ? -1 : 1);
4018 4253          else if (zb1level != zb2level)
4019 4254                  return (zb1level > zb2level ? -1 : 1);
4020 4255          /*
4021 4256           * This can (theoretically) happen if the bookmarks have the same object
4022 4257           * and level, but different blkids, if the block sizes are not the same.
4023 4258           * There is presently no way to change the indirect block sizes
4024 4259           */
4025 4260          return (0);
4026 4261  }
4027 4262  
4028 4263  /*
4029 4264   *  This function checks the following: given that last_block is the place that
4030 4265   *  our traversal stopped last time, does that guarantee that we've visited
4031 4266   *  every node under subtree_root?  Therefore, we can't just use the raw output
4032 4267   *  of zbookmark_compare.  We have to pass in a modified version of
4033 4268   *  subtree_root; by incrementing the block id, and then checking whether
4034 4269   *  last_block is before or equal to that, we can tell whether or not having
4035 4270   *  visited last_block implies that all of subtree_root's children have been
4036 4271   *  visited.
4037 4272   */
4038 4273  boolean_t
4039 4274  zbookmark_subtree_completed(const dnode_phys_t *dnp,
4040 4275      const zbookmark_phys_t *subtree_root, const zbookmark_phys_t *last_block)
4041 4276  {
4042 4277          zbookmark_phys_t mod_zb = *subtree_root;
4043 4278          mod_zb.zb_blkid++;
4044 4279          ASSERT(last_block->zb_level == 0);
4045 4280  
4046 4281          /* The objset_phys_t isn't before anything. */
4047 4282          if (dnp == NULL)
4048 4283                  return (B_FALSE);
4049 4284  
4050 4285          /*
4051 4286           * We pass in 1ULL << (DNODE_BLOCK_SHIFT - SPA_MINBLOCKSHIFT) for the
4052 4287           * data block size in sectors, because that variable is only used if
4053 4288           * the bookmark refers to a block in the meta-dnode.  Since we don't
4054 4289           * know without examining it what object it refers to, and there's no
4055 4290           * harm in passing in this value in other cases, we always pass it in.
4056 4291           *
4057 4292           * We pass in 0 for the indirect block size shift because zb2 must be
4058 4293           * level 0.  The indirect block size is only used to calculate the span
4059 4294           * of the bookmark, but since the bookmark must be level 0, the span is
4060 4295           * always 1, so the math works out.
4061 4296           *
4062 4297           * If you make changes to how the zbookmark_compare code works, be sure
4063 4298           * to make sure that this code still works afterwards.
4064 4299           */
4065 4300          return (zbookmark_compare(dnp->dn_datablkszsec, dnp->dn_indblkshift,
4066 4301              1ULL << (DNODE_BLOCK_SHIFT - SPA_MINBLOCKSHIFT), 0, &mod_zb,
4067 4302              last_block) <= 0);
4068 4303  }
  
    | 
      ↓ open down ↓ | 
    146 lines elided | 
    
      ↑ open up ↑ | 
  
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX