big-one Cdiff usr/src/uts/common/fs/zfs/zfs

Print this page

NEX-19083 backport OS-7314 zil_commit should omit cache thrash
9962 zil_commit should omit cache thrash
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
NEX-10069 ZFS_READONLY is a little too strict
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9436 Rate limiting controls ... (fix cstyle)
NEX-3562 filename normalization doesn't work for removes (sync with upstream)
NEX-9436 Rate limiting controls (was QoS) per ZFS dataset, updates from demo
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-9213 comment for enabling async delete for all files is reversed.
Reviewed by: Jean Mccormack <jean.mccormack@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9090 trigger async freeing based on znode size
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-8972 Async-delete side-effect that may cause unmount EBUSY
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8852 Quality-of-Service (QoS) controls per NFS share
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "NEX-5085 implement async delete for large files"
This reverts commit 65aa8f42d93fcbd6e0efb3d4883170a20d760611.
Fails regression testing of the zfs test mirror_stress_004.
NEX-5085 implement async delete for large files
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
NEX-7543 backout async delete (NEX-5085 and NEX-6151)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6151 panic when forcefully unmounting the FS with large open files
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3562 filename normalization doesn't work for removes
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6334 Cannot unlink files when over quota
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5692 expose the number of hole blocks in a file
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4229 Panic destroying the pool using file backing store on FS with nbmand=on
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-1196 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Fixup merge results
re #14162 DOS issue with ZFS/NFS
re #7550 rb2134 lint-clean nza-kernel
re #6815 rb1758 need WORM in nza-kernel (4.0)


*** 19,37 ****
   * CDDL HEADER END
   */
  
  /*
   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
   * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
   * Copyright (c) 2014 Integros [integros.com]
   * Copyright 2015 Joyent, Inc.
   * Copyright 2017 Nexenta Systems, Inc.
   */
  
- /* Portions Copyright 2007 Jeremy Teo */
- /* Portions Copyright 2010 Robert Milkowski */
- 
  #include <sys/types.h>
  #include <sys/param.h>
  #include <sys/time.h>
  #include <sys/systm.h>
  #include <sys/sysmacros.h>
--- 19,36 ----
   * CDDL HEADER END
   */
  
  /*
   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+  * Portions Copyright 2007 Jeremy Teo
+  * Portions Copyright 2010 Robert Milkowski
   * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
   * Copyright (c) 2014 Integros [integros.com]
   * Copyright 2015 Joyent, Inc.
   * Copyright 2017 Nexenta Systems, Inc.
   */
  
  #include <sys/types.h>
  #include <sys/param.h>
  #include <sys/time.h>
  #include <sys/systm.h>
  #include <sys/sysmacros.h>
*** 81,90 ****
--- 80,90 ----
  #include <sys/zfs_rlock.h>
  #include <sys/extdirent.h>
  #include <sys/kidmap.h>
  #include <sys/cred.h>
  #include <sys/attr.h>
+ #include <sys/dsl_prop.h>
  #include <sys/zil.h>
  
  /*
   * Programming rules.
   *
*** 133,143 ****
   *      Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
   *      forever, because the previous txg can't quiesce until B's tx commits.
   *
   *      If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
   *      then drop all locks, call dmu_tx_wait(), and try again.  On subsequent
!  *      calls to dmu_tx_assign(), pass TXG_NOTHROTTLE in addition to TXG_NOWAIT,
   *      to indicate that this operation has already called dmu_tx_wait().
   *      This will ensure that we don't retry forever, waiting a short bit
   *      each time.
   *
   *  (5) If the operation succeeded, generate the intent log entry for it
--- 133,143 ----
   *      Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
   *      forever, because the previous txg can't quiesce until B's tx commits.
   *
   *      If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
   *      then drop all locks, call dmu_tx_wait(), and try again.  On subsequent
!  *      calls to dmu_tx_assign(), pass TXG_WAITED rather than TXG_NOWAIT,
   *      to indicate that this operation has already called dmu_tx_wait().
   *      This will ensure that we don't retry forever, waiting a short bit
   *      each time.
   *
   *  (5) If the operation succeeded, generate the intent log entry for it
*** 158,168 ****
   * top:
   *      zfs_dirent_lock(&dl, ...)       // lock directory entry (may VN_HOLD())
   *      rw_enter(...);                  // grab any other locks you need
   *      tx = dmu_tx_create(...);        // get DMU tx
   *      dmu_tx_hold_*();                // hold each object you might modify
!  *      error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
   *      if (error) {
   *              rw_exit(...);           // drop locks
   *              zfs_dirent_unlock(dl);  // unlock directory entry
   *              VN_RELE(...);           // release held vnodes
   *              if (error == ERESTART) {
--- 158,168 ----
   * top:
   *      zfs_dirent_lock(&dl, ...)       // lock directory entry (may VN_HOLD())
   *      rw_enter(...);                  // grab any other locks you need
   *      tx = dmu_tx_create(...);        // get DMU tx
   *      dmu_tx_hold_*();                // hold each object you might modify
!  *      error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
   *      if (error) {
   *              rw_exit(...);           // drop locks
   *              zfs_dirent_unlock(dl);  // unlock directory entry
   *              VN_RELE(...);           // release held vnodes
   *              if (error == ERESTART) {
*** 185,194 ****
--- 185,235 ----
   *      zil_commit(zilog, foid);        // synchronous when necessary
   *      ZFS_EXIT(zfsvfs);               // finished in zfs
   *      return (error);                 // done, report error
   */
  
+ /* set this tunable to zero to disable asynchronous freeing of files */
+ boolean_t zfs_do_async_free = B_TRUE;
+ 
+ /*
+  * This value will be multiplied by zfs_dirty_data_max to determine
+  * the threshold past which we will call zfs_inactive_impl() async.
+  *
+  * Selecting the multiplier is a balance between how long we're willing to wait
+  * for delete/free to complete (get shell back, have a NFS thread captive, etc)
+  * and reducing the number of active requests in the backing taskq.
+  *
+  * 4 GiB (zfs_dirty_data_max default) * 16 (multiplier default) = 64 GiB
+  * meaning by default we will call zfs_inactive_impl async for vnodes > 64 GiB
+  *
+  * WARNING: Setting this tunable to zero will enable asynchronous freeing for
+  * all files which can have undesirable side effects.
+  */
+ uint16_t zfs_inactive_async_multiplier = 16;
+ 
+ int nms_worm_transition_time = 30;
+ int
+ zfs_worm_in_trans(znode_t *zp)
+ {
+         zfsvfs_t                *zfsvfs = zp->z_zfsvfs;
+         timestruc_t             now;
+         sa_bulk_attr_t          bulk[2];
+         uint64_t                ctime[2];
+         int                     count = 0;
+ 
+         if (!nms_worm_transition_time)
+                 return (0);
+ 
+         gethrestime(&now);
+         SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL,
+             &ctime, sizeof (ctime));
+         if (sa_bulk_lookup(zp->z_sa_hdl, bulk, count) != 0)
+                 return (0);
+ 
+         return ((uint64_t)now.tv_sec - ctime[0] < nms_worm_transition_time);
+ }
+ 
  /* ARGSUSED */
  static int
  zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct)
  {
          znode_t *zp = VTOZ(*vpp);
*** 225,240 ****
  zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
      caller_context_t *ct)
  {
          znode_t *zp = VTOZ(vp);
          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
  
          /*
           * Clean up any locks held by this process on the vp.
           */
!         cleanlocks(vp, ddi_get_pid(), 0);
!         cleanshares(vp, ddi_get_pid());
  
          ZFS_ENTER(zfsvfs);
          ZFS_VERIFY_ZP(zp);
  
          /* Decrement the synchronous opens in the znode */
--- 266,282 ----
  zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
      caller_context_t *ct)
  {
          znode_t *zp = VTOZ(vp);
          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+         pid_t caller_pid = (ct != NULL) ? ct->cc_pid : ddi_get_pid();
  
          /*
           * Clean up any locks held by this process on the vp.
           */
!         cleanlocks(vp, caller_pid, 0);
!         cleanshares(vp, caller_pid);
  
          ZFS_ENTER(zfsvfs);
          ZFS_VERIFY_ZP(zp);
  
          /* Decrement the synchronous opens in the znode */
*** 484,493 ****
--- 526,693 ----
                          break;
          }
          return (error);
  }
  
+ 
+ /*
+  * ZFS I/O rate throttling
+  */
+ 
+ #define DELAY_SHIFT 24
+ 
+ typedef struct zfs_rate_delay {
+         uint_t rl_rate;
+         hrtime_t rl_delay;
+ } zfs_rate_delay_t;
+ 
+ /*
+  * The time we'll attempt to cv_wait (below), in nSec.
+  * This should be no less than the minimum time it normally takes
+  * to block a thread and wake back up after the timeout fires.
+  *
+  * Each table entry represents the delay for each 4MB of bandwith.
+  * we reduce the delay as the size fo the I/O increases.
+  */
+ zfs_rate_delay_t zfs_rate_delay_table[] = {
+         {0, 100000},
+         {1, 100000},
+         {2, 100000},
+         {3, 100000},
+         {4, 100000},
+         {5, 50000},
+         {6, 50000},
+         {7, 50000},
+         {8, 50000},
+         {9, 25000},
+         {10, 25000},
+         {11, 25000},
+         {12, 25000},
+         {13, 12500},
+         {14, 12500},
+         {15, 12500},
+         {16, 12500},
+         {17, 6250},
+         {18, 6250},
+         {19, 6250},
+         {20, 6250},
+         {21, 3125},
+         {22, 3125},
+         {23, 3125},
+         {24, 3125},
+ };
+ 
+ #define MAX_RATE_TBL_ENTRY 24
+ 
+ /*
+  * The delay we use should be reduced based on the size of the iorate
+  * for higher iorates we want a shorter delay.
+  */
+ static inline hrtime_t
+ zfs_get_delay(ssize_t iorate)
+ {
+         uint_t rate = iorate >> DELAY_SHIFT;
+ 
+         if (rate > MAX_RATE_TBL_ENTRY)
+                 rate = MAX_RATE_TBL_ENTRY;
+         return (zfs_rate_delay_table[rate].rl_delay);
+ }
+ 
+ /*
+  * ZFS I/O rate throttling
+  * See "Token Bucket" on Wikipedia
+  *
+  * This is "Token Bucket" with some modifications to avoid wait times
+  * longer than a couple seconds, so that we don't trigger NFS retries
+  * or similar.  This does mean that concurrent requests might take us
+  * over the rate limit, but that's a lesser evil.
+  */
+ static void
+ zfs_rate_throttle(zfsvfs_t *zfsvfs, ssize_t iosize)
+ {
+         zfs_rate_state_t *rate = &zfsvfs->z_rate;
+         hrtime_t now, delta; /* nanoseconds */
+         int64_t refill;
+ 
+         VERIFY(rate->rate_cap > 0);
+         mutex_enter(&rate->rate_lock);
+ 
+         /*
+          * If another thread is already waiting, we must queue up behind them.
+          * We'll wait up to 1 sec here.  We normally will resume by cv_signal,
+          * so we don't need fine timer resolution on this wait.
+          */
+         if (rate->rate_token_bucket < 0) {
+                 rate->rate_waiters++;
+                 (void) cv_timedwait_hires(
+                     &rate->rate_wait_cv, &rate->rate_lock,
+                     NANOSEC, TR_CLOCK_TICK, 0);
+                 rate->rate_waiters--;
+         }
+ 
+         /*
+          * How long since we last updated the bucket?
+          */
+         now = gethrtime();
+         delta = now - rate->rate_last_update;
+         rate->rate_last_update = now;
+         if (delta < 0)
+                 delta = 0; /* paranoid */
+ 
+         /*
+          * Add "tokens" for time since last update,
+          * being careful about possible overflow.
+          */
+         refill = (delta * rate->rate_cap) / NANOSEC;
+         if (refill < 0 || refill > rate->rate_cap)
+                 refill = rate->rate_cap; /* overflow */
+         rate->rate_token_bucket += refill;
+         if (rate->rate_token_bucket > rate->rate_cap)
+                 rate->rate_token_bucket = rate->rate_cap;
+ 
+         /*
+          * Withdraw tokens for the current I/O.* If this makes us overdrawn,
+          * wait an amount of time proportionate to the overdraft.  However,
+          * as a sanity measure, never wait more than 1 sec, and never try to
+          * wait less than the time it normally takes to block and reschedule.
+          *
+          * Leave the bucket negative while we wait so other threads know to
+          * queue up. In here, "refill" is the debt we're waiting to pay off.
+          */
+         rate->rate_token_bucket -= iosize;
+         if (rate->rate_token_bucket < 0) {
+                 hrtime_t zfs_rate_wait = 0;
+ 
+                 refill = rate->rate_token_bucket;
+                 DTRACE_PROBE2(zfs_rate_over, zfsvfs_t *, zfsvfs,
+                     int64_t, refill);
+ 
+                 if (rate->rate_cap <= 0)
+                         goto nocap;
+ 
+                 delta = (refill * NANOSEC) / rate->rate_cap;
+                 delta = MIN(delta, NANOSEC);
+ 
+                 zfs_rate_wait = zfs_get_delay(rate->rate_cap);
+ 
+                 if (delta > zfs_rate_wait) {
+                         (void) cv_timedwait_hires(
+                             &rate->rate_wait_cv, &rate->rate_lock,
+                             delta, TR_CLOCK_TICK, 0);
+                 }
+ 
+                 rate->rate_token_bucket += refill;
+         }
+ nocap:
+         if (rate->rate_waiters > 0) {
+                 cv_signal(&rate->rate_wait_cv);
+         }
+ 
+         mutex_exit(&rate->rate_lock);
+ }
+ 
+ 
  offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */
  
  /*
   * Read bytes from specified file into supplied buffer.
   *
*** 550,559 ****
--- 750,765 ----
                          return (error);
                  }
          }
  
          /*
+          * ZFS I/O rate throttling
+          */
+         if (zfsvfs->z_rate.rate_cap)
+                 zfs_rate_throttle(zfsvfs, uio->uio_resid);
+ 
+         /*
           * If we're in FRSYNC mode, sync out this znode before reading it.
           */
          if (ioflag & FRSYNC || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
                  zil_commit(zfsvfs->z_log, zp->z_id);
  
*** 713,725 ****
--- 919,935 ----
           * See zfs_zaccess_common()
           */
          if ((zp->z_pflags & ZFS_IMMUTABLE) ||
              ((zp->z_pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
              (uio->uio_loffset < zp->z_size))) {
+                 /* Make sure we're not a WORM before returning EPERM. */
+                 if (!(zp->z_pflags & ZFS_IMMUTABLE) ||
+                     !zp->z_zfsvfs->z_isworm) {
                          ZFS_EXIT(zfsvfs);
                          return (SET_ERROR(EPERM));
                  }
+         }
  
          zilog = zfsvfs->z_log;
  
          /*
           * Validate file offset
*** 739,748 ****
--- 949,964 ----
                  ZFS_EXIT(zfsvfs);
                  return (error);
          }
  
          /*
+          * ZFS I/O rate throttling
+          */
+         if (zfsvfs->z_rate.rate_cap)
+                 zfs_rate_throttle(zfsvfs, uio->uio_resid);
+ 
+         /*
           * Pre-fault the pages to ensure slow (eg NFS) pages
           * don't hold up txg.
           * Skip this if uio contains loaned arc_buf.
           */
          if ((uio->uio_extflg == UIO_XUIO) &&
*** 1013,1022 ****
--- 1229,1239 ----
  
          ZFS_EXIT(zfsvfs);
          return (0);
  }
  
+ /* ARGSUSED */
  void
  zfs_get_done(zgd_t *zgd, int error)
  {
          znode_t *zp = zgd->zgd_private;
          objset_t *os = zp->z_zfsvfs->z_os;
*** 1030,1042 ****
           * Release the vnode asynchronously as we currently have the
           * txg stopped from syncing.
           */
          VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
  
-         if (error == 0 && zgd->zgd_bp)
-                 zil_lwb_add_block(zgd->zgd_lwb, zgd->zgd_bp);
- 
          kmem_free(zgd, sizeof (zgd_t));
  }
  
  #ifdef DEBUG
  static int zil_fault_io = 0;
--- 1247,1256 ----
*** 1156,1170 ****
                                  lr->lr_common.lrc_txtype = TX_WRITE2;
                                  /*
                                   * TX_WRITE2 relies on the data previously
                                   * written by the TX_WRITE that caused
                                   * EALREADY.  We zero out the BP because
!                                  * it is the old, currently-on-disk BP,
!                                  * so there's no need to zio_flush() its
!                                  * vdevs (flushing would needlesly hurt
!                                  * performance, and doesn't work on
!                                  * indirect vdevs).
                                   */
                                  zgd->zgd_bp = NULL;
                                  BP_ZERO(bp);
                                  error = 0;
                          }
--- 1370,1380 ----
                                  lr->lr_common.lrc_txtype = TX_WRITE2;
                                  /*
                                   * TX_WRITE2 relies on the data previously
                                   * written by the TX_WRITE that caused
                                   * EALREADY.  We zero out the BP because
!                                  * it is the old, currently-on-disk BP.
                                   */
                                  zgd->zgd_bp = NULL;
                                  BP_ZERO(bp);
                                  error = 0;
                          }
*** 1243,1253 ****
  static int
  zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp,
      int flags, vnode_t *rdir, cred_t *cr,  caller_context_t *ct,
      int *direntflags, pathname_t *realpnp)
  {
!         znode_t *zdp = VTOZ(dvp);
          zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
          int     error = 0;
  
          /*
           * Fast path lookup, however we must skip DNLC lookup
--- 1453,1463 ----
  static int
  zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp,
      int flags, vnode_t *rdir, cred_t *cr,  caller_context_t *ct,
      int *direntflags, pathname_t *realpnp)
  {
!         znode_t *zp, *zdp = VTOZ(dvp);
          zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
          int     error = 0;
  
          /*
           * Fast path lookup, however we must skip DNLC lookup
*** 1361,1370 ****
--- 1571,1588 ----
          }
  
          error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
          if (error == 0)
                  error = specvp_check(vpp, cr);
+         if (*vpp) {
+                 zp = VTOZ(*vpp);
+                 if (!(zp->z_pflags & ZFS_IMMUTABLE) &&
+                     ((*vpp)->v_type != VDIR) &&
+                     zfsvfs->z_isworm && !zfs_worm_in_trans(zp)) {
+                         zp->z_pflags |= ZFS_IMMUTABLE;
+                 }
+         }
  
          ZFS_EXIT(zfsvfs);
          return (error);
  }
  
*** 1396,1405 ****
--- 1614,1624 ----
  static int
  zfs_create(vnode_t *dvp, char *name, vattr_t *vap, vcexcl_t excl,
      int mode, vnode_t **vpp, cred_t *cr, int flag, caller_context_t *ct,
      vsecattr_t *vsecp)
  {
+         int             imm_was_set = 0;
          znode_t         *zp, *dzp = VTOZ(dvp);
          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
          zilog_t         *zilog;
          objset_t        *os;
          zfs_dirlock_t   *dl;
*** 1481,1500 ****
--- 1700,1730 ----
          }
  
          if (zp == NULL) {
                  uint64_t txtype;
  
+                 if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
+                     dzp->z_zfsvfs->z_isworm) {
+                         imm_was_set = 1;
+                         dzp->z_pflags &= ~ZFS_IMMUTABLE;
+                 }
+ 
                  /*
                   * Create a new file object and update the directory
                   * to reference it.
                   */
                  if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
                          if (have_acl)
                                  zfs_acl_ids_free(&acl_ids);
+                         if (imm_was_set)
+                                 dzp->z_pflags |= ZFS_IMMUTABLE;
                          goto out;
                  }
  
+                 if (imm_was_set)
+                         dzp->z_pflags |= ZFS_IMMUTABLE;
+ 
                  /*
                   * We only support the creation of regular files in
                   * extended attribute directories.
                   */
  
*** 1530,1541 ****
                  if (!zfsvfs->z_use_sa &&
                      acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
                          dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
                              0, acl_ids.z_aclp->z_acl_bytes);
                  }
!                 error = dmu_tx_assign(tx,
!                     (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
                  if (error) {
                          zfs_dirent_unlock(dl);
                          if (error == ERESTART) {
                                  waited = B_TRUE;
                                  dmu_tx_wait(tx);
--- 1760,1770 ----
                  if (!zfsvfs->z_use_sa &&
                      acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
                          dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
                              0, acl_ids.z_aclp->z_acl_bytes);
                  }
!                 error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
                  if (error) {
                          zfs_dirent_unlock(dl);
                          if (error == ERESTART) {
                                  waited = B_TRUE;
                                  dmu_tx_wait(tx);
*** 1550,1559 ****
--- 1779,1791 ----
                  zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
  
                  if (fuid_dirtied)
                          zfs_fuid_sync(zfsvfs, tx);
  
+                 if (imm_was_set)
+                         zp->z_pflags |= ZFS_IMMUTABLE;
+ 
                  (void) zfs_link_create(dl, zp, tx, ZNEW);
                  txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
                  if (flag & FIGNORECASE)
                          txtype |= TX_CI;
                  zfs_log_create(zilog, tx, txtype, dzp, zp, name,
*** 1582,1598 ****
--- 1814,1847 ----
                   */
                  if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
                          error = SET_ERROR(EISDIR);
                          goto out;
                  }
+                 if ((flag & FWRITE) &&
+                     dzp->z_zfsvfs->z_isworm) {
+                         error = EPERM;
+                         goto out;
+                 }
+ 
+                 if (!(flag & FAPPEND) &&
+                     (zp->z_pflags & ZFS_IMMUTABLE) &&
+                     dzp->z_zfsvfs->z_isworm) {
+                         imm_was_set = 1;
+                         zp->z_pflags &= ~ZFS_IMMUTABLE;
+                 }
                  /*
                   * Verify requested access to file.
                   */
                  if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {
+                         if (imm_was_set)
+                                 zp->z_pflags |= ZFS_IMMUTABLE;
                          goto out;
                  }
  
+                 if (imm_was_set)
+                         zp->z_pflags |= ZFS_IMMUTABLE;
+ 
                  mutex_enter(&dzp->z_lock);
                  dzp->z_seq++;
                  mutex_exit(&dzp->z_lock);
  
                  /*
*** 1695,1704 ****
--- 1944,1958 ----
                  return (error);
          }
  
          vp = ZTOV(zp);
  
+         if (zp->z_zfsvfs->z_isworm) {
+                 error = SET_ERROR(EPERM);
+                 goto out;
+         }
+ 
          if (error = zfs_zaccess_delete(dzp, zp, cr)) {
                  goto out;
          }
  
          /*
*** 1761,1771 ****
          /*
           * Mark this transaction as typically resulting in a net free of space
           */
          dmu_tx_mark_netfree(tx);
  
!         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
          if (error) {
                  zfs_dirent_unlock(dl);
                  VN_RELE(vp);
                  if (xzp)
                          VN_RELE(ZTOV(xzp));
--- 2015,2025 ----
          /*
           * Mark this transaction as typically resulting in a net free of space
           */
          dmu_tx_mark_netfree(tx);
  
!         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
          if (error) {
                  zfs_dirent_unlock(dl);
                  VN_RELE(vp);
                  if (xzp)
                          VN_RELE(ZTOV(xzp));
*** 1888,1897 ****
--- 2142,2152 ----
  /*ARGSUSED*/
  static int
  zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr,
      caller_context_t *ct, int flags, vsecattr_t *vsecp)
  {
+         int             imm_was_set = 0;
          znode_t         *zp, *dzp = VTOZ(dvp);
          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
          zilog_t         *zilog;
          zfs_dirlock_t   *dl;
          uint64_t        txtype;
*** 1967,1983 ****
--- 2222,2249 ----
                  zfs_acl_ids_free(&acl_ids);
                  ZFS_EXIT(zfsvfs);
                  return (error);
          }
  
+         if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
+             dzp->z_zfsvfs->z_isworm) {
+                 imm_was_set = 1;
+                 dzp->z_pflags &= ~ZFS_IMMUTABLE;
+         }
+ 
          if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {
+                 if (imm_was_set)
+                         dzp->z_pflags |= ZFS_IMMUTABLE;
                  zfs_acl_ids_free(&acl_ids);
                  zfs_dirent_unlock(dl);
                  ZFS_EXIT(zfsvfs);
                  return (error);
          }
  
+         if (imm_was_set)
+                 dzp->z_pflags |= ZFS_IMMUTABLE;
+ 
          if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
                  zfs_acl_ids_free(&acl_ids);
                  zfs_dirent_unlock(dl);
                  ZFS_EXIT(zfsvfs);
                  return (SET_ERROR(EDQUOT));
*** 1998,2008 ****
          }
  
          dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
              ZFS_SA_BASE_ATTR_SIZE);
  
!         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
          if (error) {
                  zfs_dirent_unlock(dl);
                  if (error == ERESTART) {
                          waited = B_TRUE;
                          dmu_tx_wait(tx);
--- 2264,2274 ----
          }
  
          dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
              ZFS_SA_BASE_ATTR_SIZE);
  
!         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
          if (error) {
                  zfs_dirent_unlock(dl);
                  if (error == ERESTART) {
                          waited = B_TRUE;
                          dmu_tx_wait(tx);
*** 2100,2109 ****
--- 2366,2380 ----
                  return (error);
          }
  
          vp = ZTOV(zp);
  
+         if (dzp->z_zfsvfs->z_isworm) {
+                 error = SET_ERROR(EPERM);
+                 goto out;
+         }
+ 
          if (error = zfs_zaccess_delete(dzp, zp, cr)) {
                  goto out;
          }
  
          if (vp->v_type != VDIR) {
*** 2135,2145 ****
          dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
          dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
          zfs_sa_upgrade_txholds(tx, zp);
          zfs_sa_upgrade_txholds(tx, dzp);
          dmu_tx_mark_netfree(tx);
!         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
          if (error) {
                  rw_exit(&zp->z_parent_lock);
                  rw_exit(&zp->z_name_lock);
                  zfs_dirent_unlock(dl);
                  VN_RELE(vp);
--- 2406,2416 ----
          dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
          dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
          zfs_sa_upgrade_txholds(tx, zp);
          zfs_sa_upgrade_txholds(tx, dzp);
          dmu_tx_mark_netfree(tx);
!         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
          if (error) {
                  rw_exit(&zp->z_parent_lock);
                  rw_exit(&zp->z_name_lock);
                  zfs_dirent_unlock(dl);
                  VN_RELE(vp);
*** 2792,2809 ****
          xoap = xva_getxoptattr(xvap);
  
          xva_init(&tmpxvattr);
  
          /*
!          * Immutable files can only alter immutable bit and atime
           */
          if ((zp->z_pflags & ZFS_IMMUTABLE) &&
              ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
              ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
                  ZFS_EXIT(zfsvfs);
                  return (SET_ERROR(EPERM));
          }
  
          /*
           * Note: ZFS_READONLY is handled in zfs_zaccess_common.
           */
  
--- 3063,3092 ----
          xoap = xva_getxoptattr(xvap);
  
          xva_init(&tmpxvattr);
  
          /*
!          * Do not allow to alter immutable bit after it is set
           */
          if ((zp->z_pflags & ZFS_IMMUTABLE) &&
+             XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) &&
+             zp->z_zfsvfs->z_isworm) {
+                 ZFS_EXIT(zfsvfs);
+                 return (SET_ERROR(EPERM));
+         }
+ 
+         /*
+          * Immutable files can only alter atime
+          */
+         if (((zp->z_pflags & ZFS_IMMUTABLE) || zp->z_zfsvfs->z_isworm) &&
              ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
              ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
+                 if (!zp->z_zfsvfs->z_isworm || !zfs_worm_in_trans(zp)) {
                          ZFS_EXIT(zfsvfs);
                          return (SET_ERROR(EPERM));
                  }
+         }
  
          /*
           * Note: ZFS_READONLY is handled in zfs_zaccess_common.
           */
  
*** 3708,3718 ****
                  zfs_sa_upgrade_txholds(tx, tzp);
          }
  
          zfs_sa_upgrade_txholds(tx, szp);
          dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
!         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
          if (error) {
                  if (zl != NULL)
                          zfs_rename_unlock(&zl);
                  zfs_dirent_unlock(sdl);
                  zfs_dirent_unlock(tdl);
--- 3991,4001 ----
                  zfs_sa_upgrade_txholds(tx, tzp);
          }
  
          zfs_sa_upgrade_txholds(tx, szp);
          dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
!         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
          if (error) {
                  if (zl != NULL)
                          zfs_rename_unlock(&zl);
                  zfs_dirent_unlock(sdl);
                  zfs_dirent_unlock(tdl);
*** 3832,3841 ****
--- 4115,4125 ----
          znode_t         *zp, *dzp = VTOZ(dvp);
          zfs_dirlock_t   *dl;
          dmu_tx_t        *tx;
          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
          zilog_t         *zilog;
+         int             imm_was_set = 0;
          uint64_t        len = strlen(link);
          int             error;
          int             zflg = ZNEW;
          zfs_acl_ids_t   acl_ids;
          boolean_t       fuid_dirtied;
*** 3875,3890 ****
--- 4159,4182 ----
                  zfs_acl_ids_free(&acl_ids);
                  ZFS_EXIT(zfsvfs);
                  return (error);
          }
  
+         if ((dzp->z_pflags & ZFS_IMMUTABLE) && dzp->z_zfsvfs->z_isworm) {
+                 imm_was_set = 1;
+                 dzp->z_pflags &= ~ZFS_IMMUTABLE;
+         }
          if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
+                 if (imm_was_set)
+                         dzp->z_pflags |= ZFS_IMMUTABLE;
                  zfs_acl_ids_free(&acl_ids);
                  zfs_dirent_unlock(dl);
                  ZFS_EXIT(zfsvfs);
                  return (error);
          }
+         if (imm_was_set)
+                 dzp->z_pflags |= ZFS_IMMUTABLE;
  
          if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
                  zfs_acl_ids_free(&acl_ids);
                  zfs_dirent_unlock(dl);
                  ZFS_EXIT(zfsvfs);
*** 3901,3911 ****
                  dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
                      acl_ids.z_aclp->z_acl_bytes);
          }
          if (fuid_dirtied)
                  zfs_fuid_txhold(zfsvfs, tx);
!         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
          if (error) {
                  zfs_dirent_unlock(dl);
                  if (error == ERESTART) {
                          waited = B_TRUE;
                          dmu_tx_wait(tx);
--- 4193,4203 ----
                  dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
                      acl_ids.z_aclp->z_acl_bytes);
          }
          if (fuid_dirtied)
                  zfs_fuid_txhold(zfsvfs, tx);
!         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
          if (error) {
                  zfs_dirent_unlock(dl);
                  if (error == ERESTART) {
                          waited = B_TRUE;
                          dmu_tx_wait(tx);
*** 4122,4132 ****
          tx = dmu_tx_create(zfsvfs->z_os);
          dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
          dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
          zfs_sa_upgrade_txholds(tx, szp);
          zfs_sa_upgrade_txholds(tx, dzp);
!         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
          if (error) {
                  zfs_dirent_unlock(dl);
                  if (error == ERESTART) {
                          waited = B_TRUE;
                          dmu_tx_wait(tx);
--- 4414,4424 ----
          tx = dmu_tx_create(zfsvfs->z_os);
          dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
          dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
          zfs_sa_upgrade_txholds(tx, szp);
          zfs_sa_upgrade_txholds(tx, dzp);
!         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
          if (error) {
                  zfs_dirent_unlock(dl);
                  if (error == ERESTART) {
                          waited = B_TRUE;
                          dmu_tx_wait(tx);
*** 4398,4444 ****
                  zil_commit(zfsvfs->z_log, zp->z_id);
          ZFS_EXIT(zfsvfs);
          return (error);
  }
  
! /*ARGSUSED*/
! void
! zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
  {
-         znode_t *zp = VTOZ(vp);
          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
!         int error;
  
!         rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
          if (zp->z_sa_hdl == NULL) {
                  /*
                   * The fs has been unmounted, or we did a
                   * suspend/resume and this file no longer exists.
                   */
                  if (vn_has_cached_data(vp)) {
                          (void) pvn_vplist_dirty(vp, 0, zfs_null_putapage,
!                             B_INVAL, cr);
                  }
  
                  mutex_enter(&zp->z_lock);
                  mutex_enter(&vp->v_lock);
                  ASSERT(vp->v_count == 1);
                  VN_RELE_LOCKED(vp);
                  mutex_exit(&vp->v_lock);
                  mutex_exit(&zp->z_lock);
                  rw_exit(&zfsvfs->z_teardown_inactive_lock);
                  zfs_znode_free(zp);
!                 return;
          }
  
          /*
           * Attempt to push any data in the page cache.  If this fails
           * we will get kicked out later in zfs_zinactive().
           */
          if (vn_has_cached_data(vp)) {
                  (void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC,
!                     cr);
          }
  
          if (zp->z_atime_dirty && zp->z_unlinked == 0) {
                  dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
  
--- 4690,4760 ----
                  zil_commit(zfsvfs->z_log, zp->z_id);
          ZFS_EXIT(zfsvfs);
          return (error);
  }
  
! /*
!  * Returns B_TRUE and exits the z_teardown_inactive_lock
!  * if the znode we are looking at is no longer valid
!  */
! static boolean_t
! zfs_znode_free_invalid(znode_t *zp)
  {
          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
!         vnode_t *vp = ZTOV(zp);
  
!         ASSERT(rw_read_held(&zfsvfs->z_teardown_inactive_lock));
! 
          if (zp->z_sa_hdl == NULL) {
                  /*
                   * The fs has been unmounted, or we did a
                   * suspend/resume and this file no longer exists.
                   */
                  if (vn_has_cached_data(vp)) {
                          (void) pvn_vplist_dirty(vp, 0, zfs_null_putapage,
!                             B_INVAL, CRED());
                  }
  
                  mutex_enter(&zp->z_lock);
                  mutex_enter(&vp->v_lock);
                  ASSERT(vp->v_count == 1);
                  VN_RELE_LOCKED(vp);
                  mutex_exit(&vp->v_lock);
                  mutex_exit(&zp->z_lock);
+                 VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) !=
+                     UINT32_MAX);
                  rw_exit(&zfsvfs->z_teardown_inactive_lock);
                  zfs_znode_free(zp);
!                 return (B_TRUE);
          }
  
+         return (B_FALSE);
+ }
+ 
+ /*
+  * Does the prep work for freeing the znode, then calls zfs_zinactive to do the
+  * actual freeing.
+  * This code used be in zfs_inactive() before the async delete patch came in
+  */
+ static void
+ zfs_inactive_impl(znode_t *zp)
+ {
+         vnode_t *vp = ZTOV(zp);
+         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+         int error;
+ 
+         rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
+         if (zfs_znode_free_invalid(zp))
+                 return; /* z_teardown_inactive_lock already dropped */
+ 
          /*
           * Attempt to push any data in the page cache.  If this fails
           * we will get kicked out later in zfs_zinactive().
           */
          if (vn_has_cached_data(vp)) {
                  (void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC,
!                     CRED());
          }
  
          if (zp->z_atime_dirty && zp->z_unlinked == 0) {
                  dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
  
*** 4456,4468 ****
--- 4772,4826 ----
                          dmu_tx_commit(tx);
                  }
          }
  
          zfs_zinactive(zp);
+ 
+         VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) != UINT32_MAX);
+ 
          rw_exit(&zfsvfs->z_teardown_inactive_lock);
  }
  
+ /*
+  * taskq task calls zfs_inactive_impl() so that we can free the znode
+  */
+ static void
+ zfs_inactive_task(void *task_arg)
+ {
+         znode_t *zp = task_arg;
+         ASSERT(zp != NULL);
+         zfs_inactive_impl(zp);
+ }
+ 
+ /*ARGSUSED*/
+ void
+ zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
+ {
+         znode_t *zp = VTOZ(vp);
+         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+ 
+         rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
+ 
+         VERIFY(atomic_inc_32_nv(&zfsvfs->z_znodes_freeing_cnt) != 0);
+ 
+         if (zfs_znode_free_invalid(zp))
+                 return; /* z_teardown_inactive_lock already dropped */
+ 
+         if (zfs_do_async_free &&
+             zp->z_size > zfs_inactive_async_multiplier * zfs_dirty_data_max &&
+             taskq_dispatch(dsl_pool_vnrele_taskq(
+             dmu_objset_pool(zp->z_zfsvfs->z_os)), zfs_inactive_task,
+             zp, TQ_NOSLEEP) != NULL) {
+                 rw_exit(&zfsvfs->z_teardown_inactive_lock);
+                 return; /* task dispatched, we're done */
+         }
+         rw_exit(&zfsvfs->z_teardown_inactive_lock);
+ 
+         /* if the taskq dispatch failed - do a sync zfs_inactive_impl() call */
+         zfs_inactive_impl(zp);
+ }
+ 
  /*
   * Bounds-check the seek operation.
   *
   *      IN:     vp      - vnode seeking within
   *              ooff    - old file offset