Print this page
NEX-20218 Backport Illumos #9474 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes
MFV illumos-gate@fa41d87de9ec9000964c605eb01d6dc19e4a1abe
    9464 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes
    Reviewed by: Matt Ahrens <matt@delphix.com>
    Reviewed by: Brad Lewis <brad.lewis@delphix.com>
    Reviewed by: Andriy Gapon <avg@FreeBSD.org>
    Approved by: Dan McDonald <danmcd@joyent.com>
NEX-20208 Backport Illumos #9993 zil writes can get delayed in zio pipeline
MFV illumos-gate@2258ad0b755b24a55c6173b1e6bb6188389f72dd
    9993 zil writes can get delayed in zio pipeline
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed by: Brad Lewis <brad.lewis@delphix.com>
    Reviewed by: Matt Ahrens <matt@delphix.com>
    Approved by: Dan McDonald <danmcd@joyent.com>
NEX-9552 zfs_scan_idle throttling harms performance and needs to be removed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-15067 KRRP: system panics during ZFS-receive: assertion failed: arc_can_share(hdr, buf)
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15067 KRRP: system panics during ZFS-receive: assertion failed: arc_can_share(hdr, buf)
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-14571 remove isal support remnants
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-13140 DVA-throttle support for special-class
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-6088 ZFS scrub/resilver take excessively long due to issuing lots of random IO
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8065 ZFS doesn't notice when disk vdevs have no write cache
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
NEX-5856 ddt_capped isn't reset when deduped dataset is destroyed
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5795 Rename 'wrc' as 'wbc' in the source and in the tech docs
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5367 special vdev: sync-write options (NEW)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5318 Cleanup specialclass property (obsolete, not used) and fix related meta-to-special case
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5188 Removed special-vdev causes panic on read or on get size of special-bp
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-4794 Write Back Cache sync and async writes: adjust routing according to watermark limits
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4619 Want kstats to monitor TRIM and UNMAP operation
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5438 zfs_blkptr_verify should continue after zfs_panic_recover
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Dan McDonald <danmcd@omniti.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
NEX-3502 dedup ceiling should set a pool prop when cap is in effect
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-3984 On-demand TRIM
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Conflicts:
        usr/src/common/zfs/zpool_prop.c
        usr/src/uts/common/sys/fs/zfs.h
NEX-4003 WRC: System panics on debug build
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
NEX-3508 CLONE - Port NEX-2946 Add UNMAP/TRIM functionality to ZFS and illumos
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Conflicts:
    usr/src/uts/common/io/scsi/targets/sd.c
    usr/src/uts/common/sys/scsi/targets/sddef.h
NEX-3411 Removal of small l2arc ddt vdev disables dedup despite enough RAM
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
Reviewed by: Tony Nguyen <tony.nguyen@nexenta.com>
NEX-3300 ddt byte count ceiling tunables should not depend on zfs_ddt_limit_type being set
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-1110 Odd zpool Latency Output
OS-70 remove zio timer code
Moved closed ZFS files to open repo, changed Makefiles accordingly
Removed unneeded weak symbols
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
Fixup merge results
re #13989 port of illumos-3805
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
SUP-504 Multiple disks being falsely failed/retired by new zio_timeout handling code
re #12770 rb4121 zio latency reports can produce false positives
re #12645 rb4073 Make vdev delay simulator independent of DEBUG
re #12643 rb4064 ZFS meta refactoring - vdev utilization tracking, auto-dedup
re #12616 rb4051 zfs_log_write()/dmu_sync() write once to special refactoring
re #8279 rb3915 need a mechanism to notify NMS about ZFS config changes (fix lint -courtesy of Yuri Pankov)
re #12584 rb4049 zfsxx latest code merge (fix lint - courtesy of Yuri Pankov)
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
re #12393 rb3935 Kerberos and smbd disagree about who is our AD server (fix elf runtime attributes check)
re #11612 rb3907 Failing vdev of a mirrored pool should not take zfs operations out of action for extended periods of time.
re #8346 rb2639 KT disk failures
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties

Split Close
Expand all
Collapse all
          --- old/usr/src/uts/common/fs/zfs/zio.c
          +++ new/usr/src/uts/common/fs/zfs/zio.c
↓ open down ↓ 10 lines elided ↑ open up ↑
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
       21 +
  21   22  /*
  22   23   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23   24   * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
  24      - * Copyright (c) 2011 Nexenta Systems, Inc. All rights reserved.
  25   25   * Copyright (c) 2014 Integros [integros.com]
       26 + * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
  26   27   */
  27   28  
  28   29  #include <sys/sysmacros.h>
  29   30  #include <sys/zfs_context.h>
  30   31  #include <sys/fm/fs/zfs.h>
  31   32  #include <sys/spa.h>
  32   33  #include <sys/txg.h>
  33   34  #include <sys/spa_impl.h>
  34   35  #include <sys/vdev_impl.h>
  35   36  #include <sys/zio_impl.h>
  36   37  #include <sys/zio_compress.h>
  37   38  #include <sys/zio_checksum.h>
  38   39  #include <sys/dmu_objset.h>
  39   40  #include <sys/arc.h>
  40   41  #include <sys/ddt.h>
  41   42  #include <sys/blkptr.h>
       43 +#include <sys/special.h>
       44 +#include <sys/blkptr.h>
  42   45  #include <sys/zfeature.h>
       46 +#include <sys/dkioc_free_util.h>
       47 +#include <sys/dsl_scan.h>
       48 +
  43   49  #include <sys/metaslab_impl.h>
  44   50  #include <sys/abd.h>
  45   51  
       52 +extern int zfs_txg_timeout;
       53 +
  46   54  /*
  47   55   * ==========================================================================
  48   56   * I/O type descriptions
  49   57   * ==========================================================================
  50   58   */
  51   59  const char *zio_type_name[ZIO_TYPES] = {
  52   60          "zio_null", "zio_read", "zio_write", "zio_free", "zio_claim",
  53   61          "zio_ioctl"
  54   62  };
  55   63  
↓ open down ↓ 6 lines elided ↑ open up ↑
  62   70   */
  63   71  kmem_cache_t *zio_cache;
  64   72  kmem_cache_t *zio_link_cache;
  65   73  kmem_cache_t *zio_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
  66   74  kmem_cache_t *zio_data_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
  67   75  
  68   76  #ifdef _KERNEL
  69   77  extern vmem_t *zio_alloc_arena;
  70   78  #endif
  71   79  
  72      -#define ZIO_PIPELINE_CONTINUE           0x100
  73      -#define ZIO_PIPELINE_STOP               0x101
  74      -
  75   80  #define BP_SPANB(indblkshift, level) \
  76   81          (((uint64_t)1) << ((level) * ((indblkshift) - SPA_BLKPTRSHIFT)))
  77   82  #define COMPARE_META_LEVEL      0x80000000ul
       83 +
  78   84  /*
  79   85   * The following actions directly effect the spa's sync-to-convergence logic.
  80   86   * The values below define the sync pass when we start performing the action.
  81   87   * Care should be taken when changing these values as they directly impact
  82   88   * spa_sync() performance. Tuning these values may introduce subtle performance
  83   89   * pathologies and should only be done in the context of performance analysis.
  84   90   * These tunables will eventually be removed and replaced with #defines once
  85   91   * enough analysis has been done to determine optimal values.
  86   92   *
  87   93   * The 'zfs_sync_pass_deferred_free' pass must be greater than 1 to ensure that
↓ open down ↓ 10 lines elided ↑ open up ↑
  98  104  #define IO_IS_ALLOCATING(zio) ((zio)->io_orig_pipeline & ZIO_STAGE_DVA_ALLOCATE)
  99  105  
 100  106  boolean_t       zio_requeue_io_start_cut_in_line = B_TRUE;
 101  107  
 102  108  #ifdef ZFS_DEBUG
 103  109  int zio_buf_debug_limit = 16384;
 104  110  #else
 105  111  int zio_buf_debug_limit = 0;
 106  112  #endif
 107  113  
      114 +/*
      115 + * Fault insertion for stress testing
      116 + */
      117 +int zio_faulty_vdev_enabled = 0;
      118 +uint64_t zio_faulty_vdev_guid;
      119 +uint64_t zio_faulty_vdev_delay_us = 1000000;    /* 1 second */
      120 +
      121 +/*
      122 + * Tunable to allow for debugging SCSI UNMAP/SATA TRIM calls. Disabling
      123 + * it will prevent ZFS from attempting to issue DKIOCFREE ioctls to the
      124 + * underlying storage.
      125 + */
      126 +boolean_t zfs_trim = B_TRUE;
      127 +uint64_t zfs_trim_min_ext_sz = 1 << 20; /* 1 MB */
      128 +
 108  129  static void zio_taskq_dispatch(zio_t *, zio_taskq_type_t, boolean_t);
 109  130  
 110  131  void
 111  132  zio_init(void)
 112  133  {
 113  134          size_t c;
 114  135          vmem_t *data_alloc_arena = NULL;
 115  136  
 116  137  #ifdef _KERNEL
 117  138          data_alloc_arena = zio_alloc_arena;
↓ open down ↓ 55 lines elided ↑ open up ↑
 173  194                  ASSERT(zio_buf_cache[c] != NULL);
 174  195                  if (zio_buf_cache[c - 1] == NULL)
 175  196                          zio_buf_cache[c - 1] = zio_buf_cache[c];
 176  197  
 177  198                  ASSERT(zio_data_buf_cache[c] != NULL);
 178  199                  if (zio_data_buf_cache[c - 1] == NULL)
 179  200                          zio_data_buf_cache[c - 1] = zio_data_buf_cache[c];
 180  201          }
 181  202  
 182  203          zio_inject_init();
      204 +
 183  205  }
 184  206  
 185  207  void
 186  208  zio_fini(void)
 187  209  {
 188  210          size_t c;
 189  211          kmem_cache_t *last_cache = NULL;
 190  212          kmem_cache_t *last_data_cache = NULL;
 191  213  
 192  214          for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
↓ open down ↓ 242 lines elided ↑ open up ↑
 435  457          pio->io_child_count--;
 436  458          cio->io_parent_count--;
 437  459  
 438  460          mutex_exit(&pio->io_lock);
 439  461          mutex_exit(&cio->io_lock);
 440  462  
 441  463          kmem_cache_free(zio_link_cache, zl);
 442  464  }
 443  465  
 444  466  static boolean_t
 445      -zio_wait_for_children(zio_t *zio, uint8_t childbits, enum zio_wait_type wait)
      467 +zio_wait_for_children(zio_t *zio, enum zio_child child, enum zio_wait_type wait)
 446  468  {
      469 +        uint64_t *countp = &zio->io_children[child][wait];
 447  470          boolean_t waiting = B_FALSE;
 448  471  
 449  472          mutex_enter(&zio->io_lock);
 450  473          ASSERT(zio->io_stall == NULL);
 451      -        for (int c = 0; c < ZIO_CHILD_TYPES; c++) {
 452      -                if (!(ZIO_CHILD_BIT_IS_SET(childbits, c)))
 453      -                        continue;
 454      -
 455      -                uint64_t *countp = &zio->io_children[c][wait];
 456      -                if (*countp != 0) {
 457      -                        zio->io_stage >>= 1;
 458      -                        ASSERT3U(zio->io_stage, !=, ZIO_STAGE_OPEN);
 459      -                        zio->io_stall = countp;
 460      -                        waiting = B_TRUE;
 461      -                        break;
 462      -                }
      474 +        if (*countp != 0) {
      475 +                zio->io_stage >>= 1;
      476 +                ASSERT3U(zio->io_stage, !=, ZIO_STAGE_OPEN);
      477 +                zio->io_stall = countp;
      478 +                waiting = B_TRUE;
 463  479          }
 464  480          mutex_exit(&zio->io_lock);
      481 +
 465  482          return (waiting);
 466  483  }
 467  484  
 468  485  static void
 469  486  zio_notify_parent(zio_t *pio, zio_t *zio, enum zio_wait_type wait)
 470  487  {
 471  488          uint64_t *countp = &pio->io_children[zio->io_child_type][wait];
 472  489          int *errorp = &pio->io_child_error[zio->io_child_type];
 473  490  
 474  491          mutex_enter(&pio->io_lock);
↓ open down ↓ 137 lines elided ↑ open up ↑
 612  629          zio->io_orig_pipeline = zio->io_pipeline = pipeline;
 613  630          zio->io_pipeline_trace = ZIO_STAGE_OPEN;
 614  631  
 615  632          zio->io_state[ZIO_WAIT_READY] = (stage >= ZIO_STAGE_READY);
 616  633          zio->io_state[ZIO_WAIT_DONE] = (stage >= ZIO_STAGE_DONE);
 617  634  
 618  635          if (zb != NULL)
 619  636                  zio->io_bookmark = *zb;
 620  637  
 621  638          if (pio != NULL) {
      639 +                zio->io_mc = pio->io_mc;
 622  640                  if (zio->io_logical == NULL)
 623  641                          zio->io_logical = pio->io_logical;
 624  642                  if (zio->io_child_type == ZIO_CHILD_GANG)
 625  643                          zio->io_gang_leader = pio->io_gang_leader;
 626  644                  zio_add_child(pio, zio);
      645 +
      646 +                /* copy the smartcomp setting when creating child zio's */
      647 +                bcopy(&pio->io_smartcomp, &zio->io_smartcomp,
      648 +                    sizeof (zio->io_smartcomp));
 627  649          }
 628  650  
 629  651          return (zio);
 630  652  }
 631  653  
 632  654  static void
 633  655  zio_destroy(zio_t *zio)
 634  656  {
 635  657          metaslab_trace_fini(&zio->io_alloc_list);
 636  658          list_destroy(&zio->io_parent_list);
↓ open down ↓ 18 lines elided ↑ open up ↑
 655  677  
 656  678  zio_t *
 657  679  zio_root(spa_t *spa, zio_done_func_t *done, void *private, enum zio_flag flags)
 658  680  {
 659  681          return (zio_null(NULL, spa, NULL, done, private, flags));
 660  682  }
 661  683  
 662  684  void
 663  685  zfs_blkptr_verify(spa_t *spa, const blkptr_t *bp)
 664  686  {
      687 +        /*
      688 +         * SPECIAL-BP has two DVAs, but DVA[0] in this case is a
      689 +         * temporary DVA, and after migration only the DVA[1]
      690 +         * contains valid data. Therefore, we start walking for
      691 +         * these BPs from DVA[1].
      692 +         */
      693 +        int start_dva = BP_IS_SPECIAL(bp) ? 1 : 0;
      694 +
 665  695          if (!DMU_OT_IS_VALID(BP_GET_TYPE(bp))) {
 666  696                  zfs_panic_recover("blkptr at %p has invalid TYPE %llu",
 667  697                      bp, (longlong_t)BP_GET_TYPE(bp));
 668  698          }
 669  699          if (BP_GET_CHECKSUM(bp) >= ZIO_CHECKSUM_FUNCTIONS ||
 670  700              BP_GET_CHECKSUM(bp) <= ZIO_CHECKSUM_ON) {
 671  701                  zfs_panic_recover("blkptr at %p has invalid CHECKSUM %llu",
 672  702                      bp, (longlong_t)BP_GET_CHECKSUM(bp));
 673  703          }
 674  704          if (BP_GET_COMPRESS(bp) >= ZIO_COMPRESS_FUNCTIONS ||
↓ open down ↓ 11 lines elided ↑ open up ↑
 686  716          }
 687  717  
 688  718          if (BP_IS_EMBEDDED(bp)) {
 689  719                  if (BPE_GET_ETYPE(bp) > NUM_BP_EMBEDDED_TYPES) {
 690  720                          zfs_panic_recover("blkptr at %p has invalid ETYPE %llu",
 691  721                              bp, (longlong_t)BPE_GET_ETYPE(bp));
 692  722                  }
 693  723          }
 694  724  
 695  725          /*
 696      -         * Do not verify individual DVAs if the config is not trusted. This
 697      -         * will be done once the zio is executed in vdev_mirror_map_alloc.
 698      -         */
 699      -        if (!spa->spa_trust_config)
 700      -                return;
 701      -
 702      -        /*
 703  726           * Pool-specific checks.
 704  727           *
 705  728           * Note: it would be nice to verify that the blk_birth and
 706  729           * BP_PHYSICAL_BIRTH() are not too large.  However, spa_freeze()
 707  730           * allows the birth time of log blocks (and dmu_sync()-ed blocks
 708  731           * that are in the log) to be arbitrarily large.
 709  732           */
 710      -        for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
      733 +        for (int i = start_dva; i < BP_GET_NDVAS(bp); i++) {
 711  734                  uint64_t vdevid = DVA_GET_VDEV(&bp->blk_dva[i]);
 712  735                  if (vdevid >= spa->spa_root_vdev->vdev_children) {
 713  736                          zfs_panic_recover("blkptr at %p DVA %u has invalid "
 714  737                              "VDEV %llu",
 715  738                              bp, i, (longlong_t)vdevid);
 716  739                          continue;
 717  740                  }
 718  741                  vdev_t *vd = spa->spa_root_vdev->vdev_child[vdevid];
 719  742                  if (vd == NULL) {
 720  743                          zfs_panic_recover("blkptr at %p DVA %u has invalid "
↓ open down ↓ 20 lines elided ↑ open up ↑
 741  764                  if (BP_IS_GANG(bp))
 742  765                          asize = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
 743  766                  if (offset + asize > vd->vdev_asize) {
 744  767                          zfs_panic_recover("blkptr at %p DVA %u has invalid "
 745  768                              "OFFSET %llu",
 746  769                              bp, i, (longlong_t)offset);
 747  770                  }
 748  771          }
 749  772  }
 750  773  
 751      -boolean_t
 752      -zfs_dva_valid(spa_t *spa, const dva_t *dva, const blkptr_t *bp)
 753      -{
 754      -        uint64_t vdevid = DVA_GET_VDEV(dva);
 755      -
 756      -        if (vdevid >= spa->spa_root_vdev->vdev_children)
 757      -                return (B_FALSE);
 758      -
 759      -        vdev_t *vd = spa->spa_root_vdev->vdev_child[vdevid];
 760      -        if (vd == NULL)
 761      -                return (B_FALSE);
 762      -
 763      -        if (vd->vdev_ops == &vdev_hole_ops)
 764      -                return (B_FALSE);
 765      -
 766      -        if (vd->vdev_ops == &vdev_missing_ops) {
 767      -                return (B_FALSE);
 768      -        }
 769      -
 770      -        uint64_t offset = DVA_GET_OFFSET(dva);
 771      -        uint64_t asize = DVA_GET_ASIZE(dva);
 772      -
 773      -        if (BP_IS_GANG(bp))
 774      -                asize = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);
 775      -        if (offset + asize > vd->vdev_asize)
 776      -                return (B_FALSE);
 777      -
 778      -        return (B_TRUE);
 779      -}
 780      -
 781  774  zio_t *
 782  775  zio_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,
 783  776      abd_t *data, uint64_t size, zio_done_func_t *done, void *private,
 784  777      zio_priority_t priority, enum zio_flag flags, const zbookmark_phys_t *zb)
 785  778  {
 786  779          zio_t *zio;
 787  780  
 788  781          zfs_blkptr_verify(spa, bp);
 789  782  
 790  783          zio = zio_create(pio, spa, BP_PHYSICAL_BIRTH(bp), bp,
↓ open down ↓ 4 lines elided ↑ open up ↑
 795  788  
 796  789          return (zio);
 797  790  }
 798  791  
 799  792  zio_t *
 800  793  zio_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
 801  794      abd_t *data, uint64_t lsize, uint64_t psize, const zio_prop_t *zp,
 802  795      zio_done_func_t *ready, zio_done_func_t *children_ready,
 803  796      zio_done_func_t *physdone, zio_done_func_t *done,
 804  797      void *private, zio_priority_t priority, enum zio_flag flags,
 805      -    const zbookmark_phys_t *zb)
      798 +    const zbookmark_phys_t *zb,
      799 +    const zio_smartcomp_info_t *smartcomp)
 806  800  {
 807  801          zio_t *zio;
 808  802  
 809  803          ASSERT(zp->zp_checksum >= ZIO_CHECKSUM_OFF &&
 810  804              zp->zp_checksum < ZIO_CHECKSUM_FUNCTIONS &&
 811  805              zp->zp_compress >= ZIO_COMPRESS_OFF &&
 812  806              zp->zp_compress < ZIO_COMPRESS_FUNCTIONS &&
 813  807              DMU_OT_IS_VALID(zp->zp_type) &&
 814  808              zp->zp_level < 32 &&
 815  809              zp->zp_copies > 0 &&
↓ open down ↓ 1 lines elided ↑ open up ↑
 817  811  
 818  812          zio = zio_create(pio, spa, txg, bp, data, lsize, psize, done, private,
 819  813              ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
 820  814              ZIO_STAGE_OPEN, (flags & ZIO_FLAG_DDT_CHILD) ?
 821  815              ZIO_DDT_CHILD_WRITE_PIPELINE : ZIO_WRITE_PIPELINE);
 822  816  
 823  817          zio->io_ready = ready;
 824  818          zio->io_children_ready = children_ready;
 825  819          zio->io_physdone = physdone;
 826  820          zio->io_prop = *zp;
      821 +        if (smartcomp != NULL)
      822 +                bcopy(smartcomp, &zio->io_smartcomp, sizeof (*smartcomp));
 827  823  
 828  824          /*
 829  825           * Data can be NULL if we are going to call zio_write_override() to
 830  826           * provide the already-allocated BP.  But we may need the data to
 831  827           * verify a dedup hit (if requested).  In this case, don't try to
 832  828           * dedup (just take the already-allocated BP verbatim).
 833  829           */
 834  830          if (data == NULL && zio->io_prop.zp_dedup_verify) {
 835  831                  zio->io_prop.zp_dedup = zio->io_prop.zp_dedup_verify = B_FALSE;
 836  832          }
↓ open down ↓ 31 lines elided ↑ open up ↑
 868  864          zio->io_prop.zp_dedup = nopwrite ? B_FALSE : zio->io_prop.zp_dedup;
 869  865          zio->io_prop.zp_nopwrite = nopwrite;
 870  866          zio->io_prop.zp_copies = copies;
 871  867          zio->io_bp_override = bp;
 872  868  }
 873  869  
 874  870  void
 875  871  zio_free(spa_t *spa, uint64_t txg, const blkptr_t *bp)
 876  872  {
 877  873  
 878      -        zfs_blkptr_verify(spa, bp);
 879      -
 880  874          /*
 881  875           * The check for EMBEDDED is a performance optimization.  We
 882  876           * process the free here (by ignoring it) rather than
 883  877           * putting it on the list and then processing it in zio_free_sync().
 884  878           */
 885  879          if (BP_IS_EMBEDDED(bp))
 886  880                  return;
 887  881          metaslab_check_free(spa, bp);
 888  882  
 889  883          /*
↓ open down ↓ 20 lines elided ↑ open up ↑
 910  904  
 911  905          ASSERT(!BP_IS_HOLE(bp));
 912  906          ASSERT(spa_syncing_txg(spa) == txg);
 913  907          ASSERT(spa_sync_pass(spa) < zfs_sync_pass_deferred_free);
 914  908  
 915  909          if (BP_IS_EMBEDDED(bp))
 916  910                  return (zio_null(pio, spa, NULL, NULL, NULL, 0));
 917  911  
 918  912          metaslab_check_free(spa, bp);
 919  913          arc_freed(spa, bp);
      914 +        dsl_scan_freed(spa, bp);
 920  915  
 921  916          /*
 922  917           * GANG and DEDUP blocks can induce a read (for the gang block header,
 923  918           * or the DDT), so issue them asynchronously so that this thread is
 924  919           * not tied up.
 925  920           */
 926  921          if (BP_IS_GANG(bp) || BP_GET_DEDUP(bp))
 927  922                  stage |= ZIO_STAGE_ISSUE_ASYNC;
 928  923  
 929  924          zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
↓ open down ↓ 2 lines elided ↑ open up ↑
 932  927  
 933  928          return (zio);
 934  929  }
 935  930  
 936  931  zio_t *
 937  932  zio_claim(zio_t *pio, spa_t *spa, uint64_t txg, const blkptr_t *bp,
 938  933      zio_done_func_t *done, void *private, enum zio_flag flags)
 939  934  {
 940  935          zio_t *zio;
 941  936  
 942      -        zfs_blkptr_verify(spa, bp);
      937 +        dprintf_bp(bp, "claiming in txg %llu", txg);
 943  938  
 944  939          if (BP_IS_EMBEDDED(bp))
 945  940                  return (zio_null(pio, spa, NULL, NULL, NULL, 0));
 946  941  
 947  942          /*
 948  943           * A claim is an allocation of a specific block.  Claims are needed
 949  944           * to support immediate writes in the intent log.  The issue is that
 950  945           * immediate writes contain committed data, but in a txg that was
 951  946           * *not* committed.  Upon opening the pool after an unclean shutdown,
 952  947           * the intent log claims all blocks that contain immediate write data
↓ open down ↓ 8 lines elided ↑ open up ↑
 961  956          ASSERT(!BP_GET_DEDUP(bp) || !spa_writeable(spa));       /* zdb(1M) */
 962  957  
 963  958          zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
 964  959              BP_GET_PSIZE(bp), done, private, ZIO_TYPE_CLAIM, ZIO_PRIORITY_NOW,
 965  960              flags, NULL, 0, NULL, ZIO_STAGE_OPEN, ZIO_CLAIM_PIPELINE);
 966  961          ASSERT0(zio->io_queued_timestamp);
 967  962  
 968  963          return (zio);
 969  964  }
 970  965  
 971      -zio_t *
 972      -zio_ioctl(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
 973      -    zio_done_func_t *done, void *private, enum zio_flag flags)
      966 +static zio_t *
      967 +zio_ioctl_with_pipeline(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
      968 +    zio_done_func_t *done, void *private, enum zio_flag flags,
      969 +    enum zio_stage pipeline)
 974  970  {
 975  971          zio_t *zio;
 976  972          int c;
 977  973  
 978  974          if (vd->vdev_children == 0) {
 979  975                  zio = zio_create(pio, spa, 0, NULL, NULL, 0, 0, done, private,
 980  976                      ZIO_TYPE_IOCTL, ZIO_PRIORITY_NOW, flags, vd, 0, NULL,
 981      -                    ZIO_STAGE_OPEN, ZIO_IOCTL_PIPELINE);
      977 +                    ZIO_STAGE_OPEN, pipeline);
 982  978  
 983  979                  zio->io_cmd = cmd;
 984  980          } else {
 985      -                zio = zio_null(pio, spa, NULL, NULL, NULL, flags);
 986      -
 987      -                for (c = 0; c < vd->vdev_children; c++)
 988      -                        zio_nowait(zio_ioctl(zio, spa, vd->vdev_child[c], cmd,
 989      -                            done, private, flags));
      981 +                zio = zio_null(pio, spa, vd, done, private, flags);
      982 +                /*
      983 +                 * DKIOCFREE ioctl's need some special handling on interior
      984 +                 * vdevs. If the device provides an ops function to handle
      985 +                 * recomputing dkioc_free extents, then we call it.
      986 +                 * Otherwise the default behavior applies, which simply fans
      987 +                 * out the ioctl to all component vdevs.
      988 +                 */
      989 +                if (cmd == DKIOCFREE && vd->vdev_ops->vdev_op_trim != NULL) {
      990 +                        vd->vdev_ops->vdev_op_trim(vd, zio, private);
      991 +                } else {
      992 +                        for (c = 0; c < vd->vdev_children; c++)
      993 +                                zio_nowait(zio_ioctl_with_pipeline(zio,
      994 +                                    spa, vd->vdev_child[c], cmd, NULL,
      995 +                                    private, flags, pipeline));
      996 +                }
 990  997          }
 991  998  
 992  999          return (zio);
 993 1000  }
 994 1001  
 995 1002  zio_t *
     1003 +zio_ioctl(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
     1004 +    zio_done_func_t *done, void *private, enum zio_flag flags)
     1005 +{
     1006 +        return (zio_ioctl_with_pipeline(pio, spa, vd, cmd, done,
     1007 +            private, flags, ZIO_IOCTL_PIPELINE));
     1008 +}
     1009 +
     1010 +/*
     1011 + * Callback for when a trim zio has completed. This simply frees the
     1012 + * dkioc_free_list_t extent list of the DKIOCFREE ioctl.
     1013 + */
     1014 +static void
     1015 +zio_trim_done(zio_t *zio)
     1016 +{
     1017 +        VERIFY(zio->io_private != NULL);
     1018 +        dfl_free(zio->io_private);
     1019 +}
     1020 +
     1021 +static void
     1022 +zio_trim_check(uint64_t start, uint64_t len, void *msp)
     1023 +{
     1024 +        metaslab_t *ms = msp;
     1025 +        boolean_t held = MUTEX_HELD(&ms->ms_lock);
     1026 +        if (!held)
     1027 +                mutex_enter(&ms->ms_lock);
     1028 +        ASSERT(ms->ms_trimming_ts != NULL);
     1029 +        ASSERT(range_tree_contains(ms->ms_trimming_ts->ts_tree,
     1030 +            start - VDEV_LABEL_START_SIZE, len));
     1031 +        if (!held)
     1032 +                mutex_exit(&ms->ms_lock);
     1033 +}
     1034 +
     1035 +/*
     1036 + * Takes a bunch of freed extents and tells the underlying vdevs that the
     1037 + * space associated with these extents can be released.
     1038 + * This is used by flash storage to pre-erase blocks for rapid reuse later
     1039 + * and thin-provisioned block storage to reclaim unused blocks.
     1040 + */
     1041 +zio_t *
     1042 +zio_trim(spa_t *spa, vdev_t *vd, struct range_tree *tree,
     1043 +    zio_done_func_t *done, void *private, enum zio_flag flags,
     1044 +    int trim_flags, metaslab_t *msp)
     1045 +{
     1046 +        dkioc_free_list_t *dfl = NULL;
     1047 +        range_seg_t *rs;
     1048 +        uint64_t rs_idx;
     1049 +        uint64_t num_exts;
     1050 +        uint64_t bytes_issued = 0, bytes_skipped = 0, exts_skipped = 0;
     1051 +        /*
     1052 +         * We need this to invoke the caller's `done' callback with the
     1053 +         * correct io_private (not the dkioc_free_list_t, which is needed
     1054 +         * by the underlying DKIOCFREE ioctl).
     1055 +         */
     1056 +        zio_t *sub_pio = zio_root(spa, done, private, flags);
     1057 +
     1058 +        ASSERT(range_tree_space(tree) != 0);
     1059 +
     1060 +        if (!zfs_trim)
     1061 +                return (sub_pio);
     1062 +
     1063 +        num_exts = avl_numnodes(&tree->rt_root);
     1064 +        dfl = kmem_zalloc(DFL_SZ(num_exts), KM_SLEEP);
     1065 +        dfl->dfl_flags = trim_flags;
     1066 +        dfl->dfl_num_exts = num_exts;
     1067 +        dfl->dfl_offset = VDEV_LABEL_START_SIZE;
     1068 +        if (msp) {
     1069 +                dfl->dfl_ck_func = zio_trim_check;
     1070 +                dfl->dfl_ck_arg = msp;
     1071 +        }
     1072 +
     1073 +        for (rs = avl_first(&tree->rt_root), rs_idx = 0; rs != NULL;
     1074 +            rs = AVL_NEXT(&tree->rt_root, rs)) {
     1075 +                uint64_t len = rs->rs_end - rs->rs_start;
     1076 +
     1077 +                if (len < zfs_trim_min_ext_sz) {
     1078 +                        bytes_skipped += len;
     1079 +                        exts_skipped++;
     1080 +                        continue;
     1081 +                }
     1082 +
     1083 +                dfl->dfl_exts[rs_idx].dfle_start = rs->rs_start;
     1084 +                dfl->dfl_exts[rs_idx].dfle_length = len;
     1085 +
     1086 +                // check we're a multiple of the vdev ashift
     1087 +                ASSERT0(dfl->dfl_exts[rs_idx].dfle_start &
     1088 +                    ((1 << vd->vdev_ashift) - 1));
     1089 +                ASSERT0(dfl->dfl_exts[rs_idx].dfle_length &
     1090 +                    ((1 << vd->vdev_ashift) - 1));
     1091 +
     1092 +                rs_idx++;
     1093 +                bytes_issued += len;
     1094 +        }
     1095 +
     1096 +        spa_trimstats_update(spa, rs_idx, bytes_issued, exts_skipped,
     1097 +            bytes_skipped);
     1098 +
     1099 +        /* the zfs_trim_min_ext_sz filter may have shortened the list */
     1100 +        if (dfl->dfl_num_exts != rs_idx) {
     1101 +                dkioc_free_list_t *dfl2 = kmem_zalloc(DFL_SZ(rs_idx), KM_SLEEP);
     1102 +                bcopy(dfl, dfl2, DFL_SZ(rs_idx));
     1103 +                dfl2->dfl_num_exts = rs_idx;
     1104 +                dfl_free(dfl);
     1105 +                dfl = dfl2;
     1106 +        }
     1107 +
     1108 +        zio_nowait(zio_ioctl_with_pipeline(sub_pio, spa, vd, DKIOCFREE,
     1109 +            zio_trim_done, dfl, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
     1110 +            ZIO_FLAG_DONT_RETRY, ZIO_TRIM_PIPELINE));
     1111 +        return (sub_pio);
     1112 +}
     1113 +
     1114 +zio_t *
 996 1115  zio_read_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size,
 997 1116      abd_t *data, int checksum, zio_done_func_t *done, void *private,
 998 1117      zio_priority_t priority, enum zio_flag flags, boolean_t labels)
 999 1118  {
1000 1119          zio_t *zio;
1001 1120  
1002 1121          ASSERT(vd->vdev_children == 0);
1003 1122          ASSERT(!labels || offset + size <= VDEV_LABEL_START_SIZE ||
1004 1123              offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
1005 1124          ASSERT3U(offset + size, <=, vd->vdev_psize);
↓ open down ↓ 45 lines elided ↑ open up ↑
1051 1170   * Create a child I/O to do some work for us.
1052 1171   */
1053 1172  zio_t *
1054 1173  zio_vdev_child_io(zio_t *pio, blkptr_t *bp, vdev_t *vd, uint64_t offset,
1055 1174      abd_t *data, uint64_t size, int type, zio_priority_t priority,
1056 1175      enum zio_flag flags, zio_done_func_t *done, void *private)
1057 1176  {
1058 1177          enum zio_stage pipeline = ZIO_VDEV_CHILD_PIPELINE;
1059 1178          zio_t *zio;
1060 1179  
1061      -        /*
1062      -         * vdev child I/Os do not propagate their error to the parent.
1063      -         * Therefore, for correct operation the caller *must* check for
1064      -         * and handle the error in the child i/o's done callback.
1065      -         * The only exceptions are i/os that we don't care about
1066      -         * (OPTIONAL or REPAIR).
1067      -         */
1068      -        ASSERT((flags & ZIO_FLAG_OPTIONAL) || (flags & ZIO_FLAG_IO_REPAIR) ||
1069      -            done != NULL);
     1180 +        ASSERT(vd->vdev_parent ==
     1181 +            (pio->io_vd ? pio->io_vd : pio->io_spa->spa_root_vdev));
1070 1182  
1071      -        /*
1072      -         * In the common case, where the parent zio was to a normal vdev,
1073      -         * the child zio must be to a child vdev of that vdev.  Otherwise,
1074      -         * the child zio must be to a top-level vdev.
1075      -         */
1076      -        if (pio->io_vd != NULL && pio->io_vd->vdev_ops != &vdev_indirect_ops) {
1077      -                ASSERT3P(vd->vdev_parent, ==, pio->io_vd);
1078      -        } else {
1079      -                ASSERT3P(vd, ==, vd->vdev_top);
1080      -        }
1081      -
1082 1183          if (type == ZIO_TYPE_READ && bp != NULL) {
1083 1184                  /*
1084 1185                   * If we have the bp, then the child should perform the
1085 1186                   * checksum and the parent need not.  This pushes error
1086 1187                   * detection as close to the leaves as possible and
1087 1188                   * eliminates redundant checksums in the interior nodes.
1088 1189                   */
1089 1190                  pipeline |= ZIO_STAGE_CHECKSUM_VERIFY;
1090 1191                  pio->io_pipeline &= ~ZIO_STAGE_CHECKSUM_VERIFY;
1091 1192          }
1092 1193  
1093      -        if (vd->vdev_ops->vdev_op_leaf) {
1094      -                ASSERT0(vd->vdev_children);
     1194 +        if (vd->vdev_children == 0)
1095 1195                  offset += VDEV_LABEL_START_SIZE;
1096      -        }
1097 1196  
1098      -        flags |= ZIO_VDEV_CHILD_FLAGS(pio);
     1197 +        flags |= ZIO_VDEV_CHILD_FLAGS(pio) | ZIO_FLAG_DONT_PROPAGATE;
1099 1198  
1100 1199          /*
1101 1200           * If we've decided to do a repair, the write is not speculative --
1102 1201           * even if the original read was.
1103 1202           */
1104 1203          if (flags & ZIO_FLAG_IO_REPAIR)
1105 1204                  flags &= ~ZIO_FLAG_SPECULATIVE;
1106 1205  
1107 1206          /*
1108 1207           * If we're creating a child I/O that is not associated with a
1109 1208           * top-level vdev, then the child zio is not an allocating I/O.
1110 1209           * If this is a retried I/O then we ignore it since we will
1111 1210           * have already processed the original allocating I/O.
1112 1211           */
1113 1212          if (flags & ZIO_FLAG_IO_ALLOCATING &&
1114 1213              (vd != vd->vdev_top || (flags & ZIO_FLAG_IO_RETRY))) {
1115      -                metaslab_class_t *mc = spa_normal_class(pio->io_spa);
     1214 +                metaslab_class_t *mc = pio->io_mc;
1116 1215  
1117 1216                  ASSERT(mc->mc_alloc_throttle_enabled);
1118 1217                  ASSERT(type == ZIO_TYPE_WRITE);
1119 1218                  ASSERT(priority == ZIO_PRIORITY_ASYNC_WRITE);
1120 1219                  ASSERT(!(flags & ZIO_FLAG_IO_REPAIR));
1121 1220                  ASSERT(!(pio->io_flags & ZIO_FLAG_IO_REWRITE) ||
1122 1221                      pio->io_child_type == ZIO_CHILD_GANG);
1123 1222  
1124 1223                  flags &= ~ZIO_FLAG_IO_ALLOCATING;
1125 1224          }
↓ open down ↓ 60 lines elided ↑ open up ↑
1186 1285   * ==========================================================================
1187 1286   * Prepare to read and write logical blocks
1188 1287   * ==========================================================================
1189 1288   */
1190 1289  
1191 1290  static int
1192 1291  zio_read_bp_init(zio_t *zio)
1193 1292  {
1194 1293          blkptr_t *bp = zio->io_bp;
1195 1294  
1196      -        ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy);
1197      -
1198 1295          if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF &&
1199 1296              zio->io_child_type == ZIO_CHILD_LOGICAL &&
1200 1297              !(zio->io_flags & ZIO_FLAG_RAW)) {
1201 1298                  uint64_t psize =
1202 1299                      BP_IS_EMBEDDED(bp) ? BPE_GET_PSIZE(bp) : BP_GET_PSIZE(bp);
1203 1300                  zio_push_transform(zio, abd_alloc_sametype(zio->io_abd, psize),
1204 1301                      psize, psize, zio_decompress);
1205 1302          }
1206 1303  
1207 1304          if (BP_IS_EMBEDDED(bp) && BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA) {
1208 1305                  zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1209 1306  
1210 1307                  int psize = BPE_GET_PSIZE(bp);
1211 1308                  void *data = abd_borrow_buf(zio->io_abd, psize);
1212 1309                  decode_embedded_bp_compressed(bp, data);
1213 1310                  abd_return_buf_copy(zio->io_abd, data, psize);
1214 1311          } else {
1215 1312                  ASSERT(!BP_IS_EMBEDDED(bp));
1216      -                ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy);
1217 1313          }
1218 1314  
1219      -        if (!DMU_OT_IS_METADATA(BP_GET_TYPE(bp)) && BP_GET_LEVEL(bp) == 0)
     1315 +        if (!BP_IS_METADATA(bp))
1220 1316                  zio->io_flags |= ZIO_FLAG_DONT_CACHE;
1221 1317  
1222 1318          if (BP_GET_TYPE(bp) == DMU_OT_DDT_ZAP)
1223 1319                  zio->io_flags |= ZIO_FLAG_DONT_CACHE;
1224 1320  
1225 1321          if (BP_GET_DEDUP(bp) && zio->io_child_type == ZIO_CHILD_LOGICAL)
1226 1322                  zio->io_pipeline = ZIO_DDT_READ_PIPELINE;
1227 1323  
1228 1324          return (ZIO_PIPELINE_CONTINUE);
1229 1325  }
↓ open down ↓ 67 lines elided ↑ open up ↑
1297 1393          uint64_t lsize = zio->io_lsize;
1298 1394          uint64_t psize = zio->io_size;
1299 1395          int pass = 1;
1300 1396  
1301 1397          EQUIV(lsize != psize, (zio->io_flags & ZIO_FLAG_RAW) != 0);
1302 1398  
1303 1399          /*
1304 1400           * If our children haven't all reached the ready stage,
1305 1401           * wait for them and then repeat this pipeline stage.
1306 1402           */
1307      -        if (zio_wait_for_children(zio, ZIO_CHILD_LOGICAL_BIT |
1308      -            ZIO_CHILD_GANG_BIT, ZIO_WAIT_READY)) {
     1403 +        if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) ||
     1404 +            zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_READY))
1309 1405                  return (ZIO_PIPELINE_STOP);
1310      -        }
1311 1406  
1312 1407          if (!IO_IS_ALLOCATING(zio))
1313 1408                  return (ZIO_PIPELINE_CONTINUE);
1314 1409  
1315 1410          if (zio->io_children_ready != NULL) {
1316 1411                  /*
1317 1412                   * Now that all our children are ready, run the callback
1318 1413                   * associated with this zio in case it wants to modify the
1319 1414                   * data to be written.
1320 1415                   */
↓ open down ↓ 21 lines elided ↑ open up ↑
1342 1437                  ASSERT(!BP_GET_DEDUP(bp));
1343 1438  
1344 1439                  if (pass >= zfs_sync_pass_dont_compress)
1345 1440                          compress = ZIO_COMPRESS_OFF;
1346 1441  
1347 1442                  /* Make sure someone doesn't change their mind on overwrites */
1348 1443                  ASSERT(BP_IS_EMBEDDED(bp) || MIN(zp->zp_copies + BP_IS_GANG(bp),
1349 1444                      spa_max_replication(spa)) == BP_GET_NDVAS(bp));
1350 1445          }
1351 1446  
     1447 +        DTRACE_PROBE1(zio_compress_ready, zio_t *, zio);
1352 1448          /* If it's a compressed write that is not raw, compress the buffer. */
1353      -        if (compress != ZIO_COMPRESS_OFF && psize == lsize) {
     1449 +        if (compress != ZIO_COMPRESS_OFF && psize == lsize &&
     1450 +            ZIO_SHOULD_COMPRESS(zio)) {
1354 1451                  void *cbuf = zio_buf_alloc(lsize);
1355 1452                  psize = zio_compress_data(compress, zio->io_abd, cbuf, lsize);
1356 1453                  if (psize == 0 || psize == lsize) {
1357 1454                          compress = ZIO_COMPRESS_OFF;
1358 1455                          zio_buf_free(cbuf, lsize);
1359 1456                  } else if (!zp->zp_dedup && psize <= BPE_PAYLOAD_SIZE &&
1360 1457                      zp->zp_level == 0 && !DMU_OT_HAS_FILL(zp->zp_type) &&
1361 1458                      spa_feature_is_enabled(spa, SPA_FEATURE_EMBEDDED_DATA)) {
1362 1459                          encode_embedded_bp_compressed(bp,
1363 1460                              cbuf, compress, lsize, psize);
1364 1461                          BPE_SET_ETYPE(bp, BP_EMBEDDED_TYPE_DATA);
1365 1462                          BP_SET_TYPE(bp, zio->io_prop.zp_type);
1366 1463                          BP_SET_LEVEL(bp, zio->io_prop.zp_level);
1367 1464                          zio_buf_free(cbuf, lsize);
1368 1465                          bp->blk_birth = zio->io_txg;
1369 1466                          zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1370 1467                          ASSERT(spa_feature_is_active(spa,
1371 1468                              SPA_FEATURE_EMBEDDED_DATA));
     1469 +                        if (zio->io_smartcomp.sc_result != NULL) {
     1470 +                                zio->io_smartcomp.sc_result(
     1471 +                                    zio->io_smartcomp.sc_userinfo, zio);
     1472 +                        } else {
     1473 +                                ASSERT(zio->io_smartcomp.sc_ask == NULL);
     1474 +                        }
1372 1475                          return (ZIO_PIPELINE_CONTINUE);
1373 1476                  } else {
1374 1477                          /*
1375 1478                           * Round up compressed size up to the ashift
1376 1479                           * of the smallest-ashift device, and zero the tail.
1377 1480                           * This ensures that the compressed size of the BP
1378 1481                           * (and thus compressratio property) are correct,
1379 1482                           * in that we charge for the padding used to fill out
1380 1483                           * the last sector.
1381 1484                           */
↓ open down ↓ 7 lines elided ↑ open up ↑
1389 1492                          } else {
1390 1493                                  abd_t *cdata = abd_get_from_buf(cbuf, lsize);
1391 1494                                  abd_take_ownership_of_buf(cdata, B_TRUE);
1392 1495                                  abd_zero_off(cdata, psize, rounded - psize);
1393 1496                                  psize = rounded;
1394 1497                                  zio_push_transform(zio, cdata,
1395 1498                                      psize, lsize, NULL);
1396 1499                          }
1397 1500                  }
1398 1501  
     1502 +                if (zio->io_smartcomp.sc_result != NULL) {
     1503 +                        zio->io_smartcomp.sc_result(
     1504 +                            zio->io_smartcomp.sc_userinfo, zio);
     1505 +                } else {
     1506 +                        ASSERT(zio->io_smartcomp.sc_ask == NULL);
     1507 +                }
     1508 +
1399 1509                  /*
1400 1510                   * We were unable to handle this as an override bp, treat
1401 1511                   * it as a regular write I/O.
1402 1512                   */
1403 1513                  zio->io_bp_override = NULL;
1404 1514                  *bp = zio->io_bp_orig;
1405 1515                  zio->io_pipeline = zio->io_orig_pipeline;
1406 1516          } else {
1407 1517                  ASSERT3U(psize, !=, 0);
     1518 +
     1519 +                /*
     1520 +                 * We are here because of:
     1521 +                 *      - compress == ZIO_COMPRESS_OFF
     1522 +                 *      - SmartCompression decides don't compress this data
     1523 +                 *      - this is a RAW-write
     1524 +                 *
     1525 +                 *      In case of RAW-write we should not override "compress"
     1526 +                 */
     1527 +                if ((zio->io_flags & ZIO_FLAG_RAW) == 0)
     1528 +                        compress = ZIO_COMPRESS_OFF;
1408 1529          }
1409 1530  
1410 1531          /*
1411 1532           * The final pass of spa_sync() must be all rewrites, but the first
1412 1533           * few passes offer a trade-off: allocating blocks defers convergence,
1413 1534           * but newly allocated blocks are sequential, so they can be written
1414 1535           * to disk faster.  Therefore, we allow the first few passes of
1415 1536           * spa_sync() to allocate new blocks, but force rewrites after that.
1416 1537           * There should only be a handful of blocks after pass 1 in any case.
1417 1538           */
↓ open down ↓ 12 lines elided ↑ open up ↑
1430 1551          if (psize == 0) {
1431 1552                  if (zio->io_bp_orig.blk_birth != 0 &&
1432 1553                      spa_feature_is_active(spa, SPA_FEATURE_HOLE_BIRTH)) {
1433 1554                          BP_SET_LSIZE(bp, lsize);
1434 1555                          BP_SET_TYPE(bp, zp->zp_type);
1435 1556                          BP_SET_LEVEL(bp, zp->zp_level);
1436 1557                          BP_SET_BIRTH(bp, zio->io_txg, 0);
1437 1558                  }
1438 1559                  zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1439 1560          } else {
     1561 +                if (zp->zp_dedup) {
     1562 +                        /* check the best-effort dedup setting */
     1563 +                        zio_best_effort_dedup(zio);
     1564 +                }
1440 1565                  ASSERT(zp->zp_checksum != ZIO_CHECKSUM_GANG_HEADER);
1441 1566                  BP_SET_LSIZE(bp, lsize);
1442 1567                  BP_SET_TYPE(bp, zp->zp_type);
1443 1568                  BP_SET_LEVEL(bp, zp->zp_level);
1444 1569                  BP_SET_PSIZE(bp, psize);
1445 1570                  BP_SET_COMPRESS(bp, compress);
1446 1571                  BP_SET_CHECKSUM(bp, zp->zp_checksum);
1447 1572                  BP_SET_DEDUP(bp, zp->zp_dedup);
1448 1573                  BP_SET_BYTEORDER(bp, ZFS_HOST_BYTEORDER);
1449 1574                  if (zp->zp_dedup) {
↓ open down ↓ 13 lines elided ↑ open up ↑
1463 1588  static int
1464 1589  zio_free_bp_init(zio_t *zio)
1465 1590  {
1466 1591          blkptr_t *bp = zio->io_bp;
1467 1592  
1468 1593          if (zio->io_child_type == ZIO_CHILD_LOGICAL) {
1469 1594                  if (BP_GET_DEDUP(bp))
1470 1595                          zio->io_pipeline = ZIO_DDT_FREE_PIPELINE;
1471 1596          }
1472 1597  
1473      -        ASSERT3P(zio->io_bp, ==, &zio->io_bp_copy);
1474      -
1475 1598          return (ZIO_PIPELINE_CONTINUE);
1476 1599  }
1477 1600  
1478 1601  /*
1479 1602   * ==========================================================================
1480 1603   * Execute the I/O pipeline
1481 1604   * ==========================================================================
1482 1605   */
1483 1606  
1484 1607  static void
↓ open down ↓ 14 lines elided ↑ open up ↑
1499 1622          /*
1500 1623           * A similar issue exists for the L2ARC write thread until L2ARC 2.0.
1501 1624           */
1502 1625          if (t == ZIO_TYPE_WRITE && zio->io_vd && zio->io_vd->vdev_aux)
1503 1626                  t = ZIO_TYPE_NULL;
1504 1627  
1505 1628          /*
1506 1629           * If this is a high priority I/O, then use the high priority taskq if
1507 1630           * available.
1508 1631           */
1509      -        if (zio->io_priority == ZIO_PRIORITY_NOW &&
     1632 +        if ((zio->io_priority == ZIO_PRIORITY_NOW ||
     1633 +            zio->io_priority == ZIO_PRIORITY_SYNC_WRITE) &&
1510 1634              spa->spa_zio_taskq[t][q + 1].stqs_count != 0)
1511 1635                  q++;
1512 1636  
1513 1637          ASSERT3U(q, <, ZIO_TASKQ_TYPES);
1514 1638  
1515 1639          /*
1516 1640           * NB: We are assuming that the zio can only be dispatched
1517 1641           * to a single taskq at a time.  It would be a grievous error
1518 1642           * to dispatch the zio to another taskq at the same time.
1519 1643           */
↓ open down ↓ 106 lines elided ↑ open up ↑
1626 1750  
1627 1751  void
1628 1752  zio_execute(zio_t *zio)
1629 1753  {
1630 1754          zio->io_executor = curthread;
1631 1755  
1632 1756          ASSERT3U(zio->io_queued_timestamp, >, 0);
1633 1757  
1634 1758          while (zio->io_stage < ZIO_STAGE_DONE) {
1635 1759                  enum zio_stage pipeline = zio->io_pipeline;
     1760 +                enum zio_stage old_stage = zio->io_stage;
1636 1761                  enum zio_stage stage = zio->io_stage;
1637 1762                  int rv;
1638 1763  
1639 1764                  ASSERT(!MUTEX_HELD(&zio->io_lock));
1640 1765                  ASSERT(ISP2(stage));
1641 1766                  ASSERT(zio->io_stall == NULL);
1642 1767  
1643 1768                  do {
1644 1769                          stage <<= 1;
1645 1770                  } while ((stage & pipeline) == 0);
↓ open down ↓ 17 lines elided ↑ open up ↑
1663 1788                          return;
1664 1789                  }
1665 1790  
1666 1791                  zio->io_stage = stage;
1667 1792                  zio->io_pipeline_trace |= zio->io_stage;
1668 1793                  rv = zio_pipeline[highbit64(stage) - 1](zio);
1669 1794  
1670 1795                  if (rv == ZIO_PIPELINE_STOP)
1671 1796                          return;
1672 1797  
     1798 +                if (rv == ZIO_PIPELINE_RESTART_STAGE) {
     1799 +                        zio->io_stage = old_stage;
     1800 +                        (void) zio_issue_async(zio);
     1801 +                        return;
     1802 +                }
     1803 +
1673 1804                  ASSERT(rv == ZIO_PIPELINE_CONTINUE);
1674 1805          }
1675 1806  }
1676 1807  
1677 1808  /*
1678 1809   * ==========================================================================
1679 1810   * Initiate I/O, either sync or async
1680 1811   * ==========================================================================
1681 1812   */
1682 1813  int
↓ open down ↓ 460 lines elided ↑ open up ↑
2143 2274          zio_gang_tree_assemble(zio, bp, &zio->io_gang_tree);
2144 2275  
2145 2276          return (ZIO_PIPELINE_CONTINUE);
2146 2277  }
2147 2278  
2148 2279  static int
2149 2280  zio_gang_issue(zio_t *zio)
2150 2281  {
2151 2282          blkptr_t *bp = zio->io_bp;
2152 2283  
2153      -        if (zio_wait_for_children(zio, ZIO_CHILD_GANG_BIT, ZIO_WAIT_DONE)) {
     2284 +        if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE))
2154 2285                  return (ZIO_PIPELINE_STOP);
2155      -        }
2156 2286  
2157 2287          ASSERT(BP_IS_GANG(bp) && zio->io_gang_leader == zio);
2158 2288          ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2159 2289  
2160 2290          if (zio->io_child_error[ZIO_CHILD_GANG] == 0)
2161 2291                  zio_gang_tree_issue(zio, zio->io_gang_tree, bp, zio->io_abd,
2162 2292                      0);
2163 2293          else
2164 2294                  zio_gang_tree_free(&zio->io_gang_tree);
2165 2295  
↓ open down ↓ 35 lines elided ↑ open up ↑
2201 2331  static void
2202 2332  zio_write_gang_done(zio_t *zio)
2203 2333  {
2204 2334          abd_put(zio->io_abd);
2205 2335  }
2206 2336  
2207 2337  static int
2208 2338  zio_write_gang_block(zio_t *pio)
2209 2339  {
2210 2340          spa_t *spa = pio->io_spa;
2211      -        metaslab_class_t *mc = spa_normal_class(spa);
     2341 +        metaslab_class_t *mc = pio->io_mc;
2212 2342          blkptr_t *bp = pio->io_bp;
2213 2343          zio_t *gio = pio->io_gang_leader;
2214 2344          zio_t *zio;
2215 2345          zio_gang_node_t *gn, **gnpp;
2216 2346          zio_gbh_phys_t *gbh;
2217 2347          abd_t *gbh_abd;
2218 2348          uint64_t txg = pio->io_txg;
2219 2349          uint64_t resid = pio->io_size;
2220 2350          uint64_t lsize;
2221 2351          int copies = gio->io_prop.zp_copies;
↓ open down ↓ 76 lines elided ↑ open up ↑
2298 2428                  zp.zp_level = 0;
2299 2429                  zp.zp_copies = gio->io_prop.zp_copies;
2300 2430                  zp.zp_dedup = B_FALSE;
2301 2431                  zp.zp_dedup_verify = B_FALSE;
2302 2432                  zp.zp_nopwrite = B_FALSE;
2303 2433  
2304 2434                  zio_t *cio = zio_write(zio, spa, txg, &gbh->zg_blkptr[g],
2305 2435                      abd_get_offset(pio->io_abd, pio->io_size - resid), lsize,
2306 2436                      lsize, &zp, zio_write_gang_member_ready, NULL, NULL,
2307 2437                      zio_write_gang_done, &gn->gn_child[g], pio->io_priority,
2308      -                    ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
     2438 +                    ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark,
     2439 +                    &pio->io_smartcomp);
2309 2440  
     2441 +                cio->io_mc = mc;
     2442 +
2310 2443                  if (pio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
2311 2444                          ASSERT(pio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
2312 2445                          ASSERT(!(pio->io_flags & ZIO_FLAG_NODATA));
2313 2446  
2314 2447                          /*
2315 2448                           * Gang children won't throttle but we should
2316 2449                           * account for their work, so reserve an allocation
2317 2450                           * slot for them here.
2318 2451                           */
2319 2452                          VERIFY(metaslab_class_throttle_reserve(mc,
↓ open down ↓ 146 lines elided ↑ open up ↑
2466 2599              ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark));
2467 2600  
2468 2601          return (ZIO_PIPELINE_CONTINUE);
2469 2602  }
2470 2603  
2471 2604  static int
2472 2605  zio_ddt_read_done(zio_t *zio)
2473 2606  {
2474 2607          blkptr_t *bp = zio->io_bp;
2475 2608  
2476      -        if (zio_wait_for_children(zio, ZIO_CHILD_DDT_BIT, ZIO_WAIT_DONE)) {
     2609 +        if (zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE))
2477 2610                  return (ZIO_PIPELINE_STOP);
2478      -        }
2479 2611  
2480 2612          ASSERT(BP_GET_DEDUP(bp));
2481 2613          ASSERT(BP_GET_PSIZE(bp) == zio->io_size);
2482 2614          ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
2483 2615  
2484 2616          if (zio->io_child_error[ZIO_CHILD_DDT]) {
2485 2617                  ddt_t *ddt = ddt_select(zio->io_spa, bp);
2486 2618                  ddt_entry_t *dde = zio->io_vsd;
2487 2619                  if (ddt == NULL) {
2488 2620                          ASSERT(spa_load_state(zio->io_spa) != SPA_LOAD_NONE);
↓ open down ↓ 11 lines elided ↑ open up ↑
2500 2632                  }
2501 2633                  ddt_repair_done(ddt, dde);
2502 2634                  zio->io_vsd = NULL;
2503 2635          }
2504 2636  
2505 2637          ASSERT(zio->io_vsd == NULL);
2506 2638  
2507 2639          return (ZIO_PIPELINE_CONTINUE);
2508 2640  }
2509 2641  
     2642 +/* ARGSUSED */
2510 2643  static boolean_t
2511 2644  zio_ddt_collision(zio_t *zio, ddt_t *ddt, ddt_entry_t *dde)
2512 2645  {
2513 2646          spa_t *spa = zio->io_spa;
2514 2647          boolean_t do_raw = (zio->io_flags & ZIO_FLAG_RAW);
2515 2648  
2516 2649          /* We should never get a raw, override zio */
2517 2650          ASSERT(!(zio->io_bp_override && do_raw));
2518 2651  
2519 2652          /*
↓ open down ↓ 17 lines elided ↑ open up ↑
2537 2670  
2538 2671                  if (ddp->ddp_phys_birth != 0) {
2539 2672                          arc_buf_t *abuf = NULL;
2540 2673                          arc_flags_t aflags = ARC_FLAG_WAIT;
2541 2674                          int zio_flags = ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE;
2542 2675                          blkptr_t blk = *zio->io_bp;
2543 2676                          int error;
2544 2677  
2545 2678                          ddt_bp_fill(ddp, &blk, ddp->ddp_phys_birth);
2546 2679  
2547      -                        ddt_exit(ddt);
     2680 +                        dde_exit(dde);
2548 2681  
2549 2682                          /*
2550 2683                           * Intuitively, it would make more sense to compare
2551 2684                           * io_abd than io_orig_abd in the raw case since you
2552 2685                           * don't want to look at any transformations that have
2553 2686                           * happened to the data. However, for raw I/Os the
2554 2687                           * data will actually be the same in io_abd and
2555 2688                           * io_orig_abd, so all we have to do is issue this as
2556 2689                           * a raw ARC read.
2557 2690                           */
↓ open down ↓ 10 lines elided ↑ open up ↑
2568 2701                              zio_flags, &aflags, &zio->io_bookmark);
2569 2702  
2570 2703                          if (error == 0) {
2571 2704                                  if (arc_buf_size(abuf) != zio->io_orig_size ||
2572 2705                                      abd_cmp_buf(zio->io_orig_abd, abuf->b_data,
2573 2706                                      zio->io_orig_size) != 0)
2574 2707                                          error = SET_ERROR(EEXIST);
2575 2708                                  arc_buf_destroy(abuf, &abuf);
2576 2709                          }
2577 2710  
2578      -                        ddt_enter(ddt);
     2711 +                        dde_enter(dde);
2579 2712                          return (error != 0);
2580 2713                  }
2581 2714          }
2582 2715  
2583 2716          return (B_FALSE);
2584 2717  }
2585 2718  
2586 2719  static void
2587 2720  zio_ddt_child_write_ready(zio_t *zio)
2588 2721  {
2589 2722          int p = zio->io_prop.zp_copies;
2590      -        ddt_t *ddt = ddt_select(zio->io_spa, zio->io_bp);
2591 2723          ddt_entry_t *dde = zio->io_private;
2592 2724          ddt_phys_t *ddp = &dde->dde_phys[p];
2593 2725          zio_t *pio;
2594 2726  
2595 2727          if (zio->io_error)
2596 2728                  return;
2597 2729  
2598      -        ddt_enter(ddt);
     2730 +        dde_enter(dde);
2599 2731  
2600 2732          ASSERT(dde->dde_lead_zio[p] == zio);
2601 2733  
2602 2734          ddt_phys_fill(ddp, zio->io_bp);
2603 2735  
2604 2736          zio_link_t *zl = NULL;
2605 2737          while ((pio = zio_walk_parents(zio, &zl)) != NULL)
2606 2738                  ddt_bp_fill(ddp, pio->io_bp, zio->io_txg);
2607 2739  
2608      -        ddt_exit(ddt);
     2740 +        dde_exit(dde);
2609 2741  }
2610 2742  
2611 2743  static void
2612 2744  zio_ddt_child_write_done(zio_t *zio)
2613 2745  {
2614 2746          int p = zio->io_prop.zp_copies;
2615      -        ddt_t *ddt = ddt_select(zio->io_spa, zio->io_bp);
2616 2747          ddt_entry_t *dde = zio->io_private;
2617 2748          ddt_phys_t *ddp = &dde->dde_phys[p];
2618 2749  
2619      -        ddt_enter(ddt);
     2750 +        dde_enter(dde);
2620 2751  
2621 2752          ASSERT(ddp->ddp_refcnt == 0);
2622 2753          ASSERT(dde->dde_lead_zio[p] == zio);
2623 2754          dde->dde_lead_zio[p] = NULL;
2624 2755  
2625 2756          if (zio->io_error == 0) {
2626 2757                  zio_link_t *zl = NULL;
2627 2758                  while (zio_walk_parents(zio, &zl) != NULL)
2628 2759                          ddt_phys_addref(ddp);
2629 2760          } else {
2630 2761                  ddt_phys_clear(ddp);
2631 2762          }
2632 2763  
2633      -        ddt_exit(ddt);
     2764 +        dde_exit(dde);
2634 2765  }
2635 2766  
2636 2767  static void
2637 2768  zio_ddt_ditto_write_done(zio_t *zio)
2638 2769  {
2639 2770          int p = DDT_PHYS_DITTO;
2640 2771          zio_prop_t *zp = &zio->io_prop;
2641 2772          blkptr_t *bp = zio->io_bp;
2642 2773          ddt_t *ddt = ddt_select(zio->io_spa, bp);
2643 2774          ddt_entry_t *dde = zio->io_private;
2644 2775          ddt_phys_t *ddp = &dde->dde_phys[p];
2645 2776          ddt_key_t *ddk = &dde->dde_key;
2646 2777  
2647      -        ddt_enter(ddt);
     2778 +        dde_enter(dde);
2648 2779  
2649 2780          ASSERT(ddp->ddp_refcnt == 0);
2650 2781          ASSERT(dde->dde_lead_zio[p] == zio);
2651 2782          dde->dde_lead_zio[p] = NULL;
2652 2783  
2653 2784          if (zio->io_error == 0) {
2654 2785                  ASSERT(ZIO_CHECKSUM_EQUAL(bp->blk_cksum, ddk->ddk_cksum));
2655 2786                  ASSERT(zp->zp_copies < SPA_DVAS_PER_BP);
2656 2787                  ASSERT(zp->zp_copies == BP_GET_NDVAS(bp) - BP_IS_GANG(bp));
2657 2788                  if (ddp->ddp_phys_birth != 0)
2658 2789                          ddt_phys_free(ddt, ddk, ddp, zio->io_txg);
2659 2790                  ddt_phys_fill(ddp, bp);
2660 2791          }
2661 2792  
2662      -        ddt_exit(ddt);
     2793 +        dde_exit(dde);
2663 2794  }
2664 2795  
2665 2796  static int
2666 2797  zio_ddt_write(zio_t *zio)
2667 2798  {
2668 2799          spa_t *spa = zio->io_spa;
2669 2800          blkptr_t *bp = zio->io_bp;
2670 2801          uint64_t txg = zio->io_txg;
2671 2802          zio_prop_t *zp = &zio->io_prop;
2672 2803          int p = zp->zp_copies;
↓ open down ↓ 2 lines elided ↑ open up ↑
2675 2806          zio_t *dio = NULL;
2676 2807          ddt_t *ddt = ddt_select(spa, bp);
2677 2808          ddt_entry_t *dde;
2678 2809          ddt_phys_t *ddp;
2679 2810  
2680 2811          ASSERT(BP_GET_DEDUP(bp));
2681 2812          ASSERT(BP_GET_CHECKSUM(bp) == zp->zp_checksum);
2682 2813          ASSERT(BP_IS_HOLE(bp) || zio->io_bp_override);
2683 2814          ASSERT(!(zio->io_bp_override && (zio->io_flags & ZIO_FLAG_RAW)));
2684 2815  
2685      -        ddt_enter(ddt);
2686 2816          dde = ddt_lookup(ddt, bp, B_TRUE);
2687      -        ddp = &dde->dde_phys[p];
2688 2817  
     2818 +        /*
     2819 +         * If we're not using special tier, for each new DDE that's not on disk:
     2820 +         * disable dedup if we have exhausted "allowed" DDT L2/ARC space
     2821 +         */
     2822 +        if ((dde->dde_state & DDE_NEW) && !spa->spa_usesc &&
     2823 +            (zfs_ddt_limit_type != DDT_NO_LIMIT || zfs_ddt_byte_ceiling != 0)) {
     2824 +                /* turn off dedup if we need to stop DDT growth */
     2825 +                if (spa_enable_dedup_cap(spa)) {
     2826 +                        dde->dde_state |= DDE_DONT_SYNC;
     2827 +
     2828 +                        /* disable dedup and use the ordinary write pipeline */
     2829 +                        zio_pop_transforms(zio);
     2830 +                        zp->zp_dedup = zp->zp_dedup_verify = B_FALSE;
     2831 +                        zio->io_stage = ZIO_STAGE_OPEN;
     2832 +                        zio->io_pipeline = ZIO_WRITE_PIPELINE;
     2833 +                        zio->io_bp_override = NULL;
     2834 +                        BP_ZERO(bp);
     2835 +                        dde_exit(dde);
     2836 +
     2837 +                        return (ZIO_PIPELINE_CONTINUE);
     2838 +                }
     2839 +        }
     2840 +        ASSERT(!(dde->dde_state & DDE_DONT_SYNC));
     2841 +
2689 2842          if (zp->zp_dedup_verify && zio_ddt_collision(zio, ddt, dde)) {
2690 2843                  /*
2691 2844                   * If we're using a weak checksum, upgrade to a strong checksum
2692 2845                   * and try again.  If we're already using a strong checksum,
2693 2846                   * we can't resolve it, so just convert to an ordinary write.
2694 2847                   * (And automatically e-mail a paper to Nature?)
2695 2848                   */
2696 2849                  if (!(zio_checksum_table[zp->zp_checksum].ci_flags &
2697 2850                      ZCHECKSUM_FLAG_DEDUP)) {
2698 2851                          zp->zp_checksum = spa_dedup_checksum(spa);
2699 2852                          zio_pop_transforms(zio);
2700 2853                          zio->io_stage = ZIO_STAGE_OPEN;
2701 2854                          BP_ZERO(bp);
2702 2855                  } else {
2703 2856                          zp->zp_dedup = B_FALSE;
2704 2857                          BP_SET_DEDUP(bp, B_FALSE);
2705 2858                  }
2706 2859                  ASSERT(!BP_GET_DEDUP(bp));
2707 2860                  zio->io_pipeline = ZIO_WRITE_PIPELINE;
2708      -                ddt_exit(ddt);
     2861 +                dde_exit(dde);
2709 2862                  return (ZIO_PIPELINE_CONTINUE);
2710 2863          }
2711 2864  
     2865 +        ddp = &dde->dde_phys[p];
2712 2866          ditto_copies = ddt_ditto_copies_needed(ddt, dde, ddp);
2713 2867          ASSERT(ditto_copies < SPA_DVAS_PER_BP);
2714 2868  
2715 2869          if (ditto_copies > ddt_ditto_copies_present(dde) &&
2716 2870              dde->dde_lead_zio[DDT_PHYS_DITTO] == NULL) {
2717 2871                  zio_prop_t czp = *zp;
2718 2872  
2719 2873                  czp.zp_copies = ditto_copies;
2720 2874  
2721 2875                  /*
↓ open down ↓ 2 lines elided ↑ open up ↑
2724 2878                   * generate a child i/o.  So, toss the override bp and restart.
2725 2879                   * This is safe, because using the override bp is just an
2726 2880                   * optimization; and it's rare, so the cost doesn't matter.
2727 2881                   */
2728 2882                  if (zio->io_bp_override) {
2729 2883                          zio_pop_transforms(zio);
2730 2884                          zio->io_stage = ZIO_STAGE_OPEN;
2731 2885                          zio->io_pipeline = ZIO_WRITE_PIPELINE;
2732 2886                          zio->io_bp_override = NULL;
2733 2887                          BP_ZERO(bp);
2734      -                        ddt_exit(ddt);
     2888 +                        dde_exit(dde);
2735 2889                          return (ZIO_PIPELINE_CONTINUE);
2736 2890                  }
2737 2891  
2738 2892                  dio = zio_write(zio, spa, txg, bp, zio->io_orig_abd,
2739 2893                      zio->io_orig_size, zio->io_orig_size, &czp, NULL, NULL,
2740 2894                      NULL, zio_ddt_ditto_write_done, dde, zio->io_priority,
2741      -                    ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark);
     2895 +                    ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark, NULL);
2742 2896  
2743 2897                  zio_push_transform(dio, zio->io_abd, zio->io_size, 0, NULL);
2744 2898                  dde->dde_lead_zio[DDT_PHYS_DITTO] = dio;
2745 2899          }
2746 2900  
2747 2901          if (ddp->ddp_phys_birth != 0 || dde->dde_lead_zio[p] != NULL) {
2748 2902                  if (ddp->ddp_phys_birth != 0)
2749 2903                          ddt_bp_fill(ddp, bp, txg);
2750 2904                  if (dde->dde_lead_zio[p] != NULL)
2751 2905                          zio_add_child(zio, dde->dde_lead_zio[p]);
↓ open down ↓ 2 lines elided ↑ open up ↑
2754 2908          } else if (zio->io_bp_override) {
2755 2909                  ASSERT(bp->blk_birth == txg);
2756 2910                  ASSERT(BP_EQUAL(bp, zio->io_bp_override));
2757 2911                  ddt_phys_fill(ddp, bp);
2758 2912                  ddt_phys_addref(ddp);
2759 2913          } else {
2760 2914                  cio = zio_write(zio, spa, txg, bp, zio->io_orig_abd,
2761 2915                      zio->io_orig_size, zio->io_orig_size, zp,
2762 2916                      zio_ddt_child_write_ready, NULL, NULL,
2763 2917                      zio_ddt_child_write_done, dde, zio->io_priority,
2764      -                    ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark);
     2918 +                    ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark, NULL);
2765 2919  
2766 2920                  zio_push_transform(cio, zio->io_abd, zio->io_size, 0, NULL);
2767 2921                  dde->dde_lead_zio[p] = cio;
2768 2922          }
2769 2923  
2770      -        ddt_exit(ddt);
     2924 +        dde_exit(dde);
2771 2925  
2772 2926          if (cio)
2773 2927                  zio_nowait(cio);
2774 2928          if (dio)
2775 2929                  zio_nowait(dio);
2776 2930  
2777 2931          return (ZIO_PIPELINE_CONTINUE);
2778 2932  }
2779 2933  
2780 2934  ddt_entry_t *freedde; /* for debugging */
↓ open down ↓ 3 lines elided ↑ open up ↑
2784 2938  {
2785 2939          spa_t *spa = zio->io_spa;
2786 2940          blkptr_t *bp = zio->io_bp;
2787 2941          ddt_t *ddt = ddt_select(spa, bp);
2788 2942          ddt_entry_t *dde;
2789 2943          ddt_phys_t *ddp;
2790 2944  
2791 2945          ASSERT(BP_GET_DEDUP(bp));
2792 2946          ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
2793 2947  
2794      -        ddt_enter(ddt);
2795 2948          freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
2796 2949          ddp = ddt_phys_select(dde, bp);
2797      -        ddt_phys_decref(ddp);
2798      -        ddt_exit(ddt);
     2950 +        if (ddp)
     2951 +                ddt_phys_decref(ddp);
     2952 +        dde_exit(dde);
2799 2953  
2800 2954          return (ZIO_PIPELINE_CONTINUE);
2801 2955  }
2802 2956  
2803 2957  /*
2804 2958   * ==========================================================================
2805 2959   * Allocate and free blocks
2806 2960   * ==========================================================================
2807 2961   */
2808 2962  
2809 2963  static zio_t *
2810      -zio_io_to_allocate(spa_t *spa)
     2964 +zio_io_to_allocate(metaslab_class_t *mc)
2811 2965  {
2812 2966          zio_t *zio;
2813 2967  
2814      -        ASSERT(MUTEX_HELD(&spa->spa_alloc_lock));
     2968 +        ASSERT(MUTEX_HELD(&mc->mc_alloc_lock));
2815 2969  
2816      -        zio = avl_first(&spa->spa_alloc_tree);
     2970 +        zio = avl_first(&mc->mc_alloc_tree);
2817 2971          if (zio == NULL)
2818 2972                  return (NULL);
2819 2973  
2820 2974          ASSERT(IO_IS_ALLOCATING(zio));
2821 2975  
2822 2976          /*
2823 2977           * Try to place a reservation for this zio. If we're unable to
2824 2978           * reserve then we throttle.
2825 2979           */
2826      -        if (!metaslab_class_throttle_reserve(spa_normal_class(spa),
     2980 +        if (!metaslab_class_throttle_reserve(mc,
2827 2981              zio->io_prop.zp_copies, zio, 0)) {
2828 2982                  return (NULL);
2829 2983          }
2830 2984  
2831      -        avl_remove(&spa->spa_alloc_tree, zio);
     2985 +        avl_remove(&mc->mc_alloc_tree, zio);
2832 2986          ASSERT3U(zio->io_stage, <, ZIO_STAGE_DVA_ALLOCATE);
2833 2987  
2834 2988          return (zio);
2835 2989  }
2836 2990  
2837 2991  static int
2838 2992  zio_dva_throttle(zio_t *zio)
2839 2993  {
2840 2994          spa_t *spa = zio->io_spa;
2841 2995          zio_t *nio;
2842 2996  
     2997 +        /* We need to use parent's MetaslabClass */
     2998 +        if (zio->io_mc == NULL) {
     2999 +                zio->io_mc = spa_select_class(spa, zio);
     3000 +                if (zio->io_prop.zp_usewbc)
     3001 +                        return (ZIO_PIPELINE_CONTINUE);
     3002 +        }
     3003 +
2843 3004          if (zio->io_priority == ZIO_PRIORITY_SYNC_WRITE ||
2844      -            !spa_normal_class(zio->io_spa)->mc_alloc_throttle_enabled ||
     3005 +            !zio->io_mc->mc_alloc_throttle_enabled ||
2845 3006              zio->io_child_type == ZIO_CHILD_GANG ||
2846 3007              zio->io_flags & ZIO_FLAG_NODATA) {
2847 3008                  return (ZIO_PIPELINE_CONTINUE);
2848 3009          }
2849 3010  
2850 3011          ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2851 3012  
2852 3013          ASSERT3U(zio->io_queued_timestamp, >, 0);
2853 3014          ASSERT(zio->io_stage == ZIO_STAGE_DVA_THROTTLE);
2854 3015  
2855      -        mutex_enter(&spa->spa_alloc_lock);
     3016 +        mutex_enter(&zio->io_mc->mc_alloc_lock);
2856 3017  
2857 3018          ASSERT(zio->io_type == ZIO_TYPE_WRITE);
2858      -        avl_add(&spa->spa_alloc_tree, zio);
     3019 +        avl_add(&zio->io_mc->mc_alloc_tree, zio);
2859 3020  
2860      -        nio = zio_io_to_allocate(zio->io_spa);
2861      -        mutex_exit(&spa->spa_alloc_lock);
     3021 +        nio = zio_io_to_allocate(zio->io_mc);
     3022 +        mutex_exit(&zio->io_mc->mc_alloc_lock);
2862 3023  
2863 3024          if (nio == zio)
2864 3025                  return (ZIO_PIPELINE_CONTINUE);
2865 3026  
2866 3027          if (nio != NULL) {
2867 3028                  ASSERT(nio->io_stage == ZIO_STAGE_DVA_THROTTLE);
2868 3029                  /*
2869 3030                   * We are passing control to a new zio so make sure that
2870 3031                   * it is processed by a different thread. We do this to
2871 3032                   * avoid stack overflows that can occur when parents are
2872 3033                   * throttled and children are making progress. We allow
2873 3034                   * it to go to the head of the taskq since it's already
2874 3035                   * been waiting.
2875 3036                   */
2876 3037                  zio_taskq_dispatch(nio, ZIO_TASKQ_ISSUE, B_TRUE);
2877 3038          }
2878 3039          return (ZIO_PIPELINE_STOP);
2879 3040  }
2880 3041  
2881 3042  void
2882      -zio_allocate_dispatch(spa_t *spa)
     3043 +zio_allocate_dispatch(metaslab_class_t *mc)
2883 3044  {
2884 3045          zio_t *zio;
2885 3046  
2886      -        mutex_enter(&spa->spa_alloc_lock);
2887      -        zio = zio_io_to_allocate(spa);
2888      -        mutex_exit(&spa->spa_alloc_lock);
     3047 +        mutex_enter(&mc->mc_alloc_lock);
     3048 +        zio = zio_io_to_allocate(mc);
     3049 +        mutex_exit(&mc->mc_alloc_lock);
2889 3050          if (zio == NULL)
2890 3051                  return;
2891 3052  
2892 3053          ASSERT3U(zio->io_stage, ==, ZIO_STAGE_DVA_THROTTLE);
2893 3054          ASSERT0(zio->io_error);
2894 3055          zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE, B_TRUE);
2895 3056  }
2896 3057  
2897 3058  static int
2898 3059  zio_dva_allocate(zio_t *zio)
2899 3060  {
2900 3061          spa_t *spa = zio->io_spa;
2901      -        metaslab_class_t *mc = spa_normal_class(spa);
     3062 +        metaslab_class_t *mc = zio->io_mc;
     3063 +
2902 3064          blkptr_t *bp = zio->io_bp;
2903 3065          int error;
2904 3066          int flags = 0;
2905 3067  
2906 3068          if (zio->io_gang_leader == NULL) {
2907 3069                  ASSERT(zio->io_child_type > ZIO_CHILD_GANG);
2908 3070                  zio->io_gang_leader = zio;
2909 3071          }
2910 3072  
2911 3073          ASSERT(BP_IS_HOLE(bp));
2912 3074          ASSERT0(BP_GET_NDVAS(bp));
2913 3075          ASSERT3U(zio->io_prop.zp_copies, >, 0);
2914 3076          ASSERT3U(zio->io_prop.zp_copies, <=, spa_max_replication(spa));
2915 3077          ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));
2916 3078  
2917      -        if (zio->io_flags & ZIO_FLAG_NODATA) {
     3079 +        if (zio->io_flags & ZIO_FLAG_NODATA || zio->io_prop.zp_usewbc) {
2918 3080                  flags |= METASLAB_DONT_THROTTLE;
2919 3081          }
2920 3082          if (zio->io_flags & ZIO_FLAG_GANG_CHILD) {
2921 3083                  flags |= METASLAB_GANG_CHILD;
2922 3084          }
2923      -        if (zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE) {
     3085 +        if (zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE &&
     3086 +            zio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
2924 3087                  flags |= METASLAB_ASYNC_ALLOC;
2925 3088          }
2926 3089  
2927 3090          error = metaslab_alloc(spa, mc, zio->io_size, bp,
2928 3091              zio->io_prop.zp_copies, zio->io_txg, NULL, flags,
2929 3092              &zio->io_alloc_list, zio);
2930 3093  
     3094 +#ifdef _KERNEL
     3095 +        DTRACE_PROBE6(zio_dva_allocate,
     3096 +            uint64_t, DVA_GET_VDEV(&bp->blk_dva[0]),
     3097 +            uint64_t, DVA_GET_VDEV(&bp->blk_dva[1]),
     3098 +            uint64_t, BP_GET_LEVEL(bp),
     3099 +            boolean_t, BP_IS_SPECIAL(bp),
     3100 +            boolean_t, BP_IS_METADATA(bp),
     3101 +            int, error);
     3102 +#endif
     3103 +
2931 3104          if (error != 0) {
2932 3105                  spa_dbgmsg(spa, "%s: metaslab allocation failure: zio %p, "
2933 3106                      "size %llu, error %d", spa_name(spa), zio, zio->io_size,
2934 3107                      error);
2935      -                if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE)
     3108 +                if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE) {
     3109 +                        if (zio->io_prop.zp_usewbc) {
     3110 +                                zio->io_prop.zp_usewbc = B_FALSE;
     3111 +                                zio->io_prop.zp_usesc = B_FALSE;
     3112 +                                zio->io_mc = spa_normal_class(spa);
     3113 +                        }
     3114 +
2936 3115                          return (zio_write_gang_block(zio));
     3116 +                }
     3117 +
2937 3118                  zio->io_error = error;
2938 3119          }
2939 3120  
2940 3121          return (ZIO_PIPELINE_CONTINUE);
2941 3122  }
2942 3123  
2943 3124  static int
2944 3125  zio_dva_free(zio_t *zio)
2945 3126  {
2946 3127          metaslab_free(zio->io_spa, zio->io_bp, zio->io_txg, B_FALSE);
↓ open down ↓ 37 lines elided ↑ open up ↑
2984 3165  
2985 3166  /*
2986 3167   * Try to allocate an intent log block.  Return 0 on success, errno on failure.
2987 3168   */
2988 3169  int
2989 3170  zio_alloc_zil(spa_t *spa, uint64_t txg, blkptr_t *new_bp, blkptr_t *old_bp,
2990 3171      uint64_t size, boolean_t *slog)
2991 3172  {
2992 3173          int error = 1;
2993 3174          zio_alloc_list_t io_alloc_list;
     3175 +        spa_meta_placement_t *mp = &spa->spa_meta_policy;
2994 3176  
2995 3177          ASSERT(txg > spa_syncing_txg(spa));
2996 3178  
2997 3179          metaslab_trace_init(&io_alloc_list);
2998      -        error = metaslab_alloc(spa, spa_log_class(spa), size, new_bp, 1,
2999      -            txg, old_bp, METASLAB_HINTBP_AVOID, &io_alloc_list, NULL);
3000      -        if (error == 0) {
3001      -                *slog = TRUE;
3002      -        } else {
     3180 +
     3181 +        /*
     3182 +         * ZIL blocks are always contiguous (i.e. not gang blocks)
     3183 +         * so we set the METASLAB_HINTBP_AVOID flag so that they
     3184 +         * don't "fast gang" when allocating them.
     3185 +         * If the caller indicates that slog is not to be used
     3186 +         * (via use_slog)
     3187 +         * separate allocation class will not indeed be used,
     3188 +         * independently of whether this is log or special
     3189 +         */
     3190 +
     3191 +        if (spa_has_slogs(spa)) {
     3192 +                error = metaslab_alloc(spa, spa_log_class(spa),
     3193 +                    size, new_bp, 1, txg, old_bp,
     3194 +                    METASLAB_HINTBP_AVOID, &io_alloc_list, NULL);
     3195 +
     3196 +                DTRACE_PROBE2(zio_alloc_zil_log,
     3197 +                    spa_t *, spa, int, error);
     3198 +
     3199 +                if (error == 0)
     3200 +                        *slog = TRUE;
     3201 +        }
     3202 +
     3203 +        /*
     3204 +         * use special when failed to allocate from the regular
     3205 +         * slog, but only if allowed and if the special used
     3206 +         * space is  below watermarks
     3207 +         */
     3208 +        if (error != 0 && spa_can_special_be_used(spa) &&
     3209 +            mp->spa_sync_to_special != SYNC_TO_SPECIAL_DISABLED) {
     3210 +                error = metaslab_alloc(spa, spa_special_class(spa),
     3211 +                    size, new_bp, 1, txg, old_bp,
     3212 +                    METASLAB_HINTBP_AVOID, &io_alloc_list, NULL);
     3213 +
     3214 +                DTRACE_PROBE2(zio_alloc_zil_special,
     3215 +                    spa_t *, spa, int, error);
     3216 +
     3217 +                if (error == 0)
     3218 +                        *slog = FALSE;
     3219 +        }
     3220 +
     3221 +        if (error != 0) {
3003 3222                  error = metaslab_alloc(spa, spa_normal_class(spa), size,
3004 3223                      new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID,
3005 3224                      &io_alloc_list, NULL);
     3225 +
     3226 +                DTRACE_PROBE2(zio_alloc_zil_normal,
     3227 +                    spa_t *, spa, int, error);
     3228 +
3006 3229                  if (error == 0)
3007 3230                          *slog = FALSE;
3008 3231          }
     3232 +
3009 3233          metaslab_trace_fini(&io_alloc_list);
3010 3234  
3011 3235          if (error == 0) {
3012 3236                  BP_SET_LSIZE(new_bp, size);
3013 3237                  BP_SET_PSIZE(new_bp, size);
3014 3238                  BP_SET_COMPRESS(new_bp, ZIO_COMPRESS_OFF);
3015 3239                  BP_SET_CHECKSUM(new_bp,
3016 3240                      spa_version(spa) >= SPA_VERSION_SLIM_ZIL
3017 3241                      ? ZIO_CHECKSUM_ZILOG2 : ZIO_CHECKSUM_ZILOG);
3018 3242                  BP_SET_TYPE(new_bp, DMU_OT_INTENT_LOG);
↓ open down ↓ 36 lines elided ↑ open up ↑
3055 3279   * currently active in the pipeline (see vdev_queue_io()), we explicitly
3056 3280   * force the underlying vdev layers to call either zio_execute() or
3057 3281   * zio_interrupt() to ensure that the pipeline continues with the correct I/O.
3058 3282   */
3059 3283  static int
3060 3284  zio_vdev_io_start(zio_t *zio)
3061 3285  {
3062 3286          vdev_t *vd = zio->io_vd;
3063 3287          uint64_t align;
3064 3288          spa_t *spa = zio->io_spa;
     3289 +        zio_type_t type = zio->io_type;
     3290 +        zio->io_vd_timestamp = gethrtime();
3065 3291  
3066 3292          ASSERT(zio->io_error == 0);
3067 3293          ASSERT(zio->io_child_error[ZIO_CHILD_VDEV] == 0);
3068 3294  
3069 3295          if (vd == NULL) {
3070 3296                  if (!(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
3071 3297                          spa_config_enter(spa, SCL_ZIO, zio, RW_READER);
3072 3298  
3073 3299                  /*
3074 3300                   * The mirror_ops handle multiple DVAs in a single BP.
3075 3301                   */
3076 3302                  vdev_mirror_ops.vdev_op_io_start(zio);
3077 3303                  return (ZIO_PIPELINE_STOP);
3078 3304          }
3079 3305  
3080 3306          ASSERT3P(zio->io_logical, !=, zio);
3081      -        if (zio->io_type == ZIO_TYPE_WRITE) {
3082      -                ASSERT(spa->spa_trust_config);
3083 3307  
3084      -                if (zio->io_vd->vdev_removing) {
3085      -                        ASSERT(zio->io_flags &
3086      -                            (ZIO_FLAG_PHYSICAL | ZIO_FLAG_SELF_HEAL |
3087      -                            ZIO_FLAG_INDUCE_DAMAGE));
3088      -                }
3089      -        }
3090      -
3091      -        /*
3092      -         * We keep track of time-sensitive I/Os so that the scan thread
3093      -         * can quickly react to certain workloads.  In particular, we care
3094      -         * about non-scrubbing, top-level reads and writes with the following
3095      -         * characteristics:
3096      -         *      - synchronous writes of user data to non-slog devices
3097      -         *      - any reads of user data
3098      -         * When these conditions are met, adjust the timestamp of spa_last_io
3099      -         * which allows the scan thread to adjust its workload accordingly.
3100      -         */
3101      -        if (!(zio->io_flags & ZIO_FLAG_SCAN_THREAD) && zio->io_bp != NULL &&
3102      -            vd == vd->vdev_top && !vd->vdev_islog &&
3103      -            zio->io_bookmark.zb_objset != DMU_META_OBJSET &&
3104      -            zio->io_txg != spa_syncing_txg(spa)) {
3105      -                uint64_t old = spa->spa_last_io;
3106      -                uint64_t new = ddi_get_lbolt64();
3107      -                if (old != new)
3108      -                        (void) atomic_cas_64(&spa->spa_last_io, old, new);
3109      -        }
3110      -
3111 3308          align = 1ULL << vd->vdev_top->vdev_ashift;
3112 3309  
3113 3310          if (!(zio->io_flags & ZIO_FLAG_PHYSICAL) &&
3114 3311              P2PHASE(zio->io_size, align) != 0) {
3115 3312                  /* Transform logical writes to be a full physical block size. */
3116 3313                  uint64_t asize = P2ROUNDUP(zio->io_size, align);
3117 3314                  abd_t *abuf = abd_alloc_sametype(zio->io_abd, asize);
3118 3315                  ASSERT(vd == vd->vdev_top);
3119      -                if (zio->io_type == ZIO_TYPE_WRITE) {
     3316 +                if (type == ZIO_TYPE_WRITE) {
3120 3317                          abd_copy(abuf, zio->io_abd, zio->io_size);
3121 3318                          abd_zero_off(abuf, zio->io_size, asize - zio->io_size);
3122 3319                  }
3123 3320                  zio_push_transform(zio, abuf, asize, asize, zio_subblock);
3124 3321          }
3125 3322  
3126 3323          /*
3127 3324           * If this is not a physical io, make sure that it is properly aligned
3128 3325           * before proceeding.
3129 3326           */
↓ open down ↓ 2 lines elided ↑ open up ↑
3132 3329                  ASSERT0(P2PHASE(zio->io_size, align));
3133 3330          } else {
3134 3331                  /*
3135 3332                   * For physical writes, we allow 512b aligned writes and assume
3136 3333                   * the device will perform a read-modify-write as necessary.
3137 3334                   */
3138 3335                  ASSERT0(P2PHASE(zio->io_offset, SPA_MINBLOCKSIZE));
3139 3336                  ASSERT0(P2PHASE(zio->io_size, SPA_MINBLOCKSIZE));
3140 3337          }
3141 3338  
3142      -        VERIFY(zio->io_type != ZIO_TYPE_WRITE || spa_writeable(spa));
     3339 +        VERIFY(type != ZIO_TYPE_WRITE || spa_writeable(spa));
3143 3340  
3144 3341          /*
3145 3342           * If this is a repair I/O, and there's no self-healing involved --
3146 3343           * that is, we're just resilvering what we expect to resilver --
3147 3344           * then don't do the I/O unless zio's txg is actually in vd's DTL.
3148 3345           * This prevents spurious resilvering with nested replication.
3149 3346           * For example, given a mirror of mirrors, (A+B)+(C+D), if only
3150 3347           * A is out of date, we'll read from C+D, then use the data to
3151 3348           * resilver A+B -- but we don't actually want to resilver B, just A.
3152 3349           * The top-level mirror has no way to know this, so instead we just
3153 3350           * discard unnecessary repairs as we work our way down the vdev tree.
3154 3351           * The same logic applies to any form of nested replication:
3155 3352           * ditto + mirror, RAID-Z + replacing, etc.  This covers them all.
3156 3353           */
3157 3354          if ((zio->io_flags & ZIO_FLAG_IO_REPAIR) &&
3158 3355              !(zio->io_flags & ZIO_FLAG_SELF_HEAL) &&
3159 3356              zio->io_txg != 0 && /* not a delegated i/o */
3160 3357              !vdev_dtl_contains(vd, DTL_PARTIAL, zio->io_txg, 1)) {
3161      -                ASSERT(zio->io_type == ZIO_TYPE_WRITE);
     3358 +                ASSERT(type == ZIO_TYPE_WRITE);
3162 3359                  zio_vdev_io_bypass(zio);
3163 3360                  return (ZIO_PIPELINE_CONTINUE);
3164 3361          }
3165 3362  
3166 3363          if (vd->vdev_ops->vdev_op_leaf &&
3167      -            (zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE)) {
3168      -
3169      -                if (zio->io_type == ZIO_TYPE_READ && vdev_cache_read(zio))
     3364 +            (type == ZIO_TYPE_READ || type == ZIO_TYPE_WRITE)) {
     3365 +                if (type == ZIO_TYPE_READ && vdev_cache_read(zio))
3170 3366                          return (ZIO_PIPELINE_CONTINUE);
3171 3367  
3172 3368                  if ((zio = vdev_queue_io(zio)) == NULL)
3173 3369                          return (ZIO_PIPELINE_STOP);
3174 3370  
3175 3371                  if (!vdev_accessible(vd, zio)) {
3176 3372                          zio->io_error = SET_ERROR(ENXIO);
3177 3373                          zio_interrupt(zio);
3178 3374                          return (ZIO_PIPELINE_STOP);
3179 3375                  }
     3376 +
     3377 +                /*
     3378 +                 * Insert a fault simulation delay for a particular vdev.
     3379 +                 */
     3380 +                if (zio_faulty_vdev_enabled &&
     3381 +                    (zio->io_vd->vdev_guid == zio_faulty_vdev_guid)) {
     3382 +                        delay(NSEC_TO_TICK(zio_faulty_vdev_delay_us *
     3383 +                            (NANOSEC / MICROSEC)));
     3384 +                }
3180 3385          }
3181 3386  
3182 3387          vd->vdev_ops->vdev_op_io_start(zio);
3183 3388          return (ZIO_PIPELINE_STOP);
3184 3389  }
3185 3390  
3186 3391  static int
3187 3392  zio_vdev_io_done(zio_t *zio)
3188 3393  {
3189 3394          vdev_t *vd = zio->io_vd;
3190 3395          vdev_ops_t *ops = vd ? vd->vdev_ops : &vdev_mirror_ops;
3191 3396          boolean_t unexpected_error = B_FALSE;
3192 3397  
3193      -        if (zio_wait_for_children(zio, ZIO_CHILD_VDEV_BIT, ZIO_WAIT_DONE)) {
     3398 +        if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
3194 3399                  return (ZIO_PIPELINE_STOP);
3195      -        }
3196 3400  
3197 3401          ASSERT(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE);
3198 3402  
3199 3403          if (vd != NULL && vd->vdev_ops->vdev_op_leaf) {
3200      -
3201 3404                  vdev_queue_io_done(zio);
3202 3405  
3203 3406                  if (zio->io_type == ZIO_TYPE_WRITE)
3204 3407                          vdev_cache_write(zio);
3205 3408  
3206 3409                  if (zio_injection_enabled && zio->io_error == 0)
3207 3410                          zio->io_error = zio_handle_device_injection(vd,
3208 3411                              zio, EIO);
3209 3412  
3210 3413                  if (zio_injection_enabled && zio->io_error == 0)
↓ open down ↓ 6 lines elided ↑ open up ↑
3217 3420                                  unexpected_error = B_TRUE;
3218 3421                          }
3219 3422                  }
3220 3423          }
3221 3424  
3222 3425          ops->vdev_op_io_done(zio);
3223 3426  
3224 3427          if (unexpected_error)
3225 3428                  VERIFY(vdev_probe(vd, zio) == NULL);
3226 3429  
     3430 +        /*
     3431 +         * Measure delta between start and end of the I/O in nanoseconds.
     3432 +         * XXX: Handle overflow.
     3433 +         */
     3434 +        zio->io_vd_timestamp = gethrtime() - zio->io_vd_timestamp;
     3435 +
3227 3436          return (ZIO_PIPELINE_CONTINUE);
3228 3437  }
3229 3438  
3230 3439  /*
3231 3440   * For non-raidz ZIOs, we can just copy aside the bad data read from the
3232 3441   * disk, and use that to finish the checksum ereport later.
3233 3442   */
3234 3443  static void
3235 3444  zio_vsd_default_cksum_finish(zio_cksum_report_t *zcr,
3236 3445      const void *good_buf)
↓ open down ↓ 14 lines elided ↑ open up ↑
3251 3460          zcr->zcr_cbdata = buf;
3252 3461          zcr->zcr_finish = zio_vsd_default_cksum_finish;
3253 3462          zcr->zcr_free = zio_buf_free;
3254 3463  }
3255 3464  
3256 3465  static int
3257 3466  zio_vdev_io_assess(zio_t *zio)
3258 3467  {
3259 3468          vdev_t *vd = zio->io_vd;
3260 3469  
3261      -        if (zio_wait_for_children(zio, ZIO_CHILD_VDEV_BIT, ZIO_WAIT_DONE)) {
     3470 +        if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
3262 3471                  return (ZIO_PIPELINE_STOP);
3263      -        }
3264 3472  
3265 3473          if (vd == NULL && !(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
3266 3474                  spa_config_exit(zio->io_spa, SCL_ZIO, zio);
3267 3475  
3268 3476          if (zio->io_vsd != NULL) {
3269 3477                  zio->io_vsd_ops->vsd_free(zio);
3270 3478                  zio->io_vsd = NULL;
3271 3479          }
3272 3480  
3273 3481          if (zio_injection_enabled && zio->io_error == 0)
↓ open down ↓ 194 lines elided ↑ open up ↑
3468 3676   * I/O completion
3469 3677   * ==========================================================================
3470 3678   */
3471 3679  static int
3472 3680  zio_ready(zio_t *zio)
3473 3681  {
3474 3682          blkptr_t *bp = zio->io_bp;
3475 3683          zio_t *pio, *pio_next;
3476 3684          zio_link_t *zl = NULL;
3477 3685  
3478      -        if (zio_wait_for_children(zio, ZIO_CHILD_GANG_BIT | ZIO_CHILD_DDT_BIT,
3479      -            ZIO_WAIT_READY)) {
     3686 +        if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) ||
     3687 +            zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_READY))
3480 3688                  return (ZIO_PIPELINE_STOP);
3481      -        }
3482 3689  
3483 3690          if (zio->io_ready) {
3484 3691                  ASSERT(IO_IS_ALLOCATING(zio));
3485 3692                  ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp) ||
3486 3693                      (zio->io_flags & ZIO_FLAG_NOPWRITE));
3487 3694                  ASSERT(zio->io_children[ZIO_CHILD_GANG][ZIO_WAIT_READY] == 0);
3488 3695  
3489 3696                  zio->io_ready(zio);
3490 3697          }
3491 3698  
↓ open down ↓ 3 lines elided ↑ open up ↑
3495 3702          if (zio->io_error != 0) {
3496 3703                  zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
3497 3704  
3498 3705                  if (zio->io_flags & ZIO_FLAG_IO_ALLOCATING) {
3499 3706                          ASSERT(IO_IS_ALLOCATING(zio));
3500 3707                          ASSERT(zio->io_priority == ZIO_PRIORITY_ASYNC_WRITE);
3501 3708                          /*
3502 3709                           * We were unable to allocate anything, unreserve and
3503 3710                           * issue the next I/O to allocate.
3504 3711                           */
3505      -                        metaslab_class_throttle_unreserve(
3506      -                            spa_normal_class(zio->io_spa),
     3712 +                        metaslab_class_throttle_unreserve(zio->io_mc,
3507 3713                              zio->io_prop.zp_copies, zio);
3508      -                        zio_allocate_dispatch(zio->io_spa);
     3714 +                        zio_allocate_dispatch(zio->io_mc);
3509 3715                  }
3510 3716          }
3511 3717  
3512 3718          mutex_enter(&zio->io_lock);
3513 3719          zio->io_state[ZIO_WAIT_READY] = 1;
3514 3720          pio = zio_walk_parents(zio, &zl);
3515 3721          mutex_exit(&zio->io_lock);
3516 3722  
3517 3723          /*
3518 3724           * As we notify zio's parents, new parents could be added.
↓ open down ↓ 65 lines elided ↑ open up ↑
3584 3790          ASSERT(IO_IS_ALLOCATING(pio));
3585 3791          ASSERT3P(zio, !=, zio->io_logical);
3586 3792          ASSERT(zio->io_logical != NULL);
3587 3793          ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REPAIR));
3588 3794          ASSERT0(zio->io_flags & ZIO_FLAG_NOPWRITE);
3589 3795  
3590 3796          mutex_enter(&pio->io_lock);
3591 3797          metaslab_group_alloc_decrement(zio->io_spa, vd->vdev_id, pio, flags);
3592 3798          mutex_exit(&pio->io_lock);
3593 3799  
3594      -        metaslab_class_throttle_unreserve(spa_normal_class(zio->io_spa),
3595      -            1, pio);
     3800 +        metaslab_class_throttle_unreserve(pio->io_mc, 1, pio);
3596 3801  
3597 3802          /*
3598 3803           * Call into the pipeline to see if there is more work that
3599 3804           * needs to be done. If there is work to be done it will be
3600 3805           * dispatched to another taskq thread.
3601 3806           */
3602      -        zio_allocate_dispatch(zio->io_spa);
     3807 +        zio_allocate_dispatch(pio->io_mc);
3603 3808  }
3604 3809  
3605 3810  static int
3606 3811  zio_done(zio_t *zio)
3607 3812  {
3608 3813          spa_t *spa = zio->io_spa;
3609 3814          zio_t *lio = zio->io_logical;
3610 3815          blkptr_t *bp = zio->io_bp;
3611 3816          vdev_t *vd = zio->io_vd;
3612 3817          uint64_t psize = zio->io_size;
3613 3818          zio_t *pio, *pio_next;
3614      -        metaslab_class_t *mc = spa_normal_class(spa);
     3819 +        metaslab_class_t *mc = zio->io_mc;
3615 3820          zio_link_t *zl = NULL;
3616 3821  
3617 3822          /*
3618 3823           * If our children haven't all completed,
3619 3824           * wait for them and then repeat this pipeline stage.
3620 3825           */
3621      -        if (zio_wait_for_children(zio, ZIO_CHILD_ALL_BITS, ZIO_WAIT_DONE)) {
     3826 +        if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) ||
     3827 +            zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) ||
     3828 +            zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) ||
     3829 +            zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE))
3622 3830                  return (ZIO_PIPELINE_STOP);
3623      -        }
3624 3831  
3625 3832          /*
3626 3833           * If the allocation throttle is enabled, then update the accounting.
3627 3834           * We only track child I/Os that are part of an allocating async
3628 3835           * write. We must do this since the allocation is performed
3629 3836           * by the logical I/O but the actual write is done by child I/Os.
3630 3837           */
3631 3838          if (zio->io_flags & ZIO_FLAG_IO_ALLOCATING &&
3632 3839              zio->io_child_type == ZIO_CHILD_VDEV) {
3633 3840                  ASSERT(mc->mc_alloc_throttle_enabled);
↓ open down ↓ 269 lines elided ↑ open up ↑
3903 4110                  zio->io_executor = NULL;
3904 4111                  cv_broadcast(&zio->io_cv);
3905 4112                  mutex_exit(&zio->io_lock);
3906 4113          } else {
3907 4114                  zio_destroy(zio);
3908 4115          }
3909 4116  
3910 4117          return (ZIO_PIPELINE_STOP);
3911 4118  }
3912 4119  
     4120 +zio_t *
     4121 +zio_wbc(zio_type_t type, vdev_t *vd, abd_t *data,
     4122 +    uint64_t size, uint64_t offset)
     4123 +{
     4124 +        zio_t *zio = NULL;
     4125 +
     4126 +        switch (type) {
     4127 +        case ZIO_TYPE_WRITE:
     4128 +                zio = zio_create(NULL, vd->vdev_spa, 0, NULL, data, size,
     4129 +                    size, NULL, NULL, ZIO_TYPE_WRITE, ZIO_PRIORITY_ASYNC_WRITE,
     4130 +                    ZIO_FLAG_PHYSICAL, vd, offset,
     4131 +                    NULL, ZIO_STAGE_OPEN, ZIO_WRITE_PHYS_PIPELINE);
     4132 +                break;
     4133 +        case ZIO_TYPE_READ:
     4134 +                zio = zio_create(NULL, vd->vdev_spa, 0, NULL, data, size,
     4135 +                    size, NULL, NULL, ZIO_TYPE_READ, ZIO_PRIORITY_ASYNC_READ,
     4136 +                    ZIO_FLAG_DONT_CACHE | ZIO_FLAG_PHYSICAL, vd, offset,
     4137 +                    NULL, ZIO_STAGE_OPEN, ZIO_READ_PHYS_PIPELINE);
     4138 +                break;
     4139 +        default:
     4140 +                ASSERT(0);
     4141 +        }
     4142 +
     4143 +        zio->io_prop.zp_checksum = ZIO_CHECKSUM_OFF;
     4144 +
     4145 +        return (zio);
     4146 +}
     4147 +
3913 4148  /*
3914 4149   * ==========================================================================
3915 4150   * I/O pipeline definition
3916 4151   * ==========================================================================
3917 4152   */
3918 4153  static zio_pipe_stage_t *zio_pipeline[] = {
3919 4154          NULL,
3920 4155          zio_read_bp_init,
3921 4156          zio_write_bp_init,
3922 4157          zio_free_bp_init,
↓ open down ↓ 146 lines elided ↑ open up ↑
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX