big-one Udiff usr/src/uts/common/fs/zfs/zfs

Print this page

NEX-19083 backport OS-7314 zil_commit should omit cache thrash
9962 zil_commit should omit cache thrash
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
NEX-10069 ZFS_READONLY is a little too strict
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9436 Rate limiting controls ... (fix cstyle)
NEX-3562 filename normalization doesn't work for removes (sync with upstream)
NEX-9436 Rate limiting controls (was QoS) per ZFS dataset, updates from demo
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-9213 comment for enabling async delete for all files is reversed.
Reviewed by: Jean Mccormack <jean.mccormack@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9090 trigger async freeing based on znode size
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-8972 Async-delete side-effect that may cause unmount EBUSY
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8852 Quality-of-Service (QoS) controls per NFS share
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "NEX-5085 implement async delete for large files"
This reverts commit 65aa8f42d93fcbd6e0efb3d4883170a20d760611.
Fails regression testing of the zfs test mirror_stress_004.
NEX-5085 implement async delete for large files
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
NEX-7543 backout async delete (NEX-5085 and NEX-6151)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6151 panic when forcefully unmounting the FS with large open files
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3562 filename normalization doesn't work for removes
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6334 Cannot unlink files when over quota
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5692 expose the number of hole blocks in a file
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4229 Panic destroying the pool using file backing store on FS with nbmand=on
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-1196 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Fixup merge results
re #14162 DOS issue with ZFS/NFS
re #7550 rb2134 lint-clean nza-kernel
re #6815 rb1758 need WORM in nza-kernel (4.0)

@@ -19,19 +19,18 @@
  * CDDL HEADER END
  */
 
 /*
  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ * Portions Copyright 2007 Jeremy Teo
+ * Portions Copyright 2010 Robert Milkowski
  * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
  * Copyright (c) 2014 Integros [integros.com]
  * Copyright 2015 Joyent, Inc.
  * Copyright 2017 Nexenta Systems, Inc.
  */
 
-/* Portions Copyright 2007 Jeremy Teo */
-/* Portions Copyright 2010 Robert Milkowski */
-
 #include <sys/types.h>
 #include <sys/param.h>
 #include <sys/time.h>
 #include <sys/systm.h>
 #include <sys/sysmacros.h>

@@ -81,10 +80,11 @@
 #include <sys/zfs_rlock.h>
 #include <sys/extdirent.h>
 #include <sys/kidmap.h>
 #include <sys/cred.h>
 #include <sys/attr.h>
+#include <sys/dsl_prop.h>
 #include <sys/zil.h>
 
 /*
  * Programming rules.
  *

@@ -133,11 +133,11 @@
  *      Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
  *      forever, because the previous txg can't quiesce until B's tx commits.
  *
  *      If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
  *      then drop all locks, call dmu_tx_wait(), and try again.  On subsequent
- *      calls to dmu_tx_assign(), pass TXG_NOTHROTTLE in addition to TXG_NOWAIT,
+ *      calls to dmu_tx_assign(), pass TXG_WAITED rather than TXG_NOWAIT,
  *      to indicate that this operation has already called dmu_tx_wait().
  *      This will ensure that we don't retry forever, waiting a short bit
  *      each time.
  *
  *  (5) If the operation succeeded, generate the intent log entry for it

@@ -158,11 +158,11 @@
  * top:
  *      zfs_dirent_lock(&dl, ...)       // lock directory entry (may VN_HOLD())
  *      rw_enter(...);                  // grab any other locks you need
  *      tx = dmu_tx_create(...);        // get DMU tx
  *      dmu_tx_hold_*();                // hold each object you might modify
- *      error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+ *      error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
  *      if (error) {
  *              rw_exit(...);           // drop locks
  *              zfs_dirent_unlock(dl);  // unlock directory entry
  *              VN_RELE(...);           // release held vnodes
  *              if (error == ERESTART) {

@@ -185,10 +185,51 @@
  *      zil_commit(zilog, foid);        // synchronous when necessary
  *      ZFS_EXIT(zfsvfs);               // finished in zfs
  *      return (error);                 // done, report error
  */
 
+/* set this tunable to zero to disable asynchronous freeing of files */
+boolean_t zfs_do_async_free = B_TRUE;
+
+/*
+ * This value will be multiplied by zfs_dirty_data_max to determine
+ * the threshold past which we will call zfs_inactive_impl() async.
+ *
+ * Selecting the multiplier is a balance between how long we're willing to wait
+ * for delete/free to complete (get shell back, have a NFS thread captive, etc)
+ * and reducing the number of active requests in the backing taskq.
+ *
+ * 4 GiB (zfs_dirty_data_max default) * 16 (multiplier default) = 64 GiB
+ * meaning by default we will call zfs_inactive_impl async for vnodes > 64 GiB
+ *
+ * WARNING: Setting this tunable to zero will enable asynchronous freeing for
+ * all files which can have undesirable side effects.
+ */
+uint16_t zfs_inactive_async_multiplier = 16;
+
+int nms_worm_transition_time = 30;
+int
+zfs_worm_in_trans(znode_t *zp)
+{
+        zfsvfs_t                *zfsvfs = zp->z_zfsvfs;
+        timestruc_t             now;
+        sa_bulk_attr_t          bulk[2];
+        uint64_t                ctime[2];
+        int                     count = 0;
+
+        if (!nms_worm_transition_time)
+                return (0);
+
+        gethrestime(&now);
+        SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL,
+            &ctime, sizeof (ctime));
+        if (sa_bulk_lookup(zp->z_sa_hdl, bulk, count) != 0)
+                return (0);
+
+        return ((uint64_t)now.tv_sec - ctime[0] < nms_worm_transition_time);
+}
+
 /* ARGSUSED */
 static int
 zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct)
 {
         znode_t *zp = VTOZ(*vpp);

@@ -225,16 +266,17 @@
 zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
     caller_context_t *ct)
 {
         znode_t *zp = VTOZ(vp);
         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+        pid_t caller_pid = (ct != NULL) ? ct->cc_pid : ddi_get_pid();
 
         /*
          * Clean up any locks held by this process on the vp.
          */
-        cleanlocks(vp, ddi_get_pid(), 0);
-        cleanshares(vp, ddi_get_pid());
+        cleanlocks(vp, caller_pid, 0);
+        cleanshares(vp, caller_pid);
 
         ZFS_ENTER(zfsvfs);
         ZFS_VERIFY_ZP(zp);
 
         /* Decrement the synchronous opens in the znode */

@@ -484,10 +526,168 @@
                         break;
         }
         return (error);
 }
 
+
+/*
+ * ZFS I/O rate throttling
+ */
+
+#define DELAY_SHIFT 24
+
+typedef struct zfs_rate_delay {
+        uint_t rl_rate;
+        hrtime_t rl_delay;
+} zfs_rate_delay_t;
+
+/*
+ * The time we'll attempt to cv_wait (below), in nSec.
+ * This should be no less than the minimum time it normally takes
+ * to block a thread and wake back up after the timeout fires.
+ *
+ * Each table entry represents the delay for each 4MB of bandwith.
+ * we reduce the delay as the size fo the I/O increases.
+ */
+zfs_rate_delay_t zfs_rate_delay_table[] = {
+        {0, 100000},
+        {1, 100000},
+        {2, 100000},
+        {3, 100000},
+        {4, 100000},
+        {5, 50000},
+        {6, 50000},
+        {7, 50000},
+        {8, 50000},
+        {9, 25000},
+        {10, 25000},
+        {11, 25000},
+        {12, 25000},
+        {13, 12500},
+        {14, 12500},
+        {15, 12500},
+        {16, 12500},
+        {17, 6250},
+        {18, 6250},
+        {19, 6250},
+        {20, 6250},
+        {21, 3125},
+        {22, 3125},
+        {23, 3125},
+        {24, 3125},
+};
+
+#define MAX_RATE_TBL_ENTRY 24
+
+/*
+ * The delay we use should be reduced based on the size of the iorate
+ * for higher iorates we want a shorter delay.
+ */
+static inline hrtime_t
+zfs_get_delay(ssize_t iorate)
+{
+        uint_t rate = iorate >> DELAY_SHIFT;
+
+        if (rate > MAX_RATE_TBL_ENTRY)
+                rate = MAX_RATE_TBL_ENTRY;
+        return (zfs_rate_delay_table[rate].rl_delay);
+}
+
+/*
+ * ZFS I/O rate throttling
+ * See "Token Bucket" on Wikipedia
+ *
+ * This is "Token Bucket" with some modifications to avoid wait times
+ * longer than a couple seconds, so that we don't trigger NFS retries
+ * or similar.  This does mean that concurrent requests might take us
+ * over the rate limit, but that's a lesser evil.
+ */
+static void
+zfs_rate_throttle(zfsvfs_t *zfsvfs, ssize_t iosize)
+{
+        zfs_rate_state_t *rate = &zfsvfs->z_rate;
+        hrtime_t now, delta; /* nanoseconds */
+        int64_t refill;
+
+        VERIFY(rate->rate_cap > 0);
+        mutex_enter(&rate->rate_lock);
+
+        /*
+         * If another thread is already waiting, we must queue up behind them.
+         * We'll wait up to 1 sec here.  We normally will resume by cv_signal,
+         * so we don't need fine timer resolution on this wait.
+         */
+        if (rate->rate_token_bucket < 0) {
+                rate->rate_waiters++;
+                (void) cv_timedwait_hires(
+                    &rate->rate_wait_cv, &rate->rate_lock,
+                    NANOSEC, TR_CLOCK_TICK, 0);
+                rate->rate_waiters--;
+        }
+
+        /*
+         * How long since we last updated the bucket?
+         */
+        now = gethrtime();
+        delta = now - rate->rate_last_update;
+        rate->rate_last_update = now;
+        if (delta < 0)
+                delta = 0; /* paranoid */
+
+        /*
+         * Add "tokens" for time since last update,
+         * being careful about possible overflow.
+         */
+        refill = (delta * rate->rate_cap) / NANOSEC;
+        if (refill < 0 || refill > rate->rate_cap)
+                refill = rate->rate_cap; /* overflow */
+        rate->rate_token_bucket += refill;
+        if (rate->rate_token_bucket > rate->rate_cap)
+                rate->rate_token_bucket = rate->rate_cap;
+
+        /*
+         * Withdraw tokens for the current I/O.* If this makes us overdrawn,
+         * wait an amount of time proportionate to the overdraft.  However,
+         * as a sanity measure, never wait more than 1 sec, and never try to
+         * wait less than the time it normally takes to block and reschedule.
+         *
+         * Leave the bucket negative while we wait so other threads know to
+         * queue up. In here, "refill" is the debt we're waiting to pay off.
+         */
+        rate->rate_token_bucket -= iosize;
+        if (rate->rate_token_bucket < 0) {
+                hrtime_t zfs_rate_wait = 0;
+
+                refill = rate->rate_token_bucket;
+                DTRACE_PROBE2(zfs_rate_over, zfsvfs_t *, zfsvfs,
+                    int64_t, refill);
+
+                if (rate->rate_cap <= 0)
+                        goto nocap;
+
+                delta = (refill * NANOSEC) / rate->rate_cap;
+                delta = MIN(delta, NANOSEC);
+
+                zfs_rate_wait = zfs_get_delay(rate->rate_cap);
+
+                if (delta > zfs_rate_wait) {
+                        (void) cv_timedwait_hires(
+                            &rate->rate_wait_cv, &rate->rate_lock,
+                            delta, TR_CLOCK_TICK, 0);
+                }
+
+                rate->rate_token_bucket += refill;
+        }
+nocap:
+        if (rate->rate_waiters > 0) {
+                cv_signal(&rate->rate_wait_cv);
+        }
+
+        mutex_exit(&rate->rate_lock);
+}
+
+
 offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */
 
 /*
  * Read bytes from specified file into supplied buffer.
  *

@@ -550,10 +750,16 @@
                         return (error);
                 }
         }
 
         /*
+         * ZFS I/O rate throttling
+         */
+        if (zfsvfs->z_rate.rate_cap)
+                zfs_rate_throttle(zfsvfs, uio->uio_resid);
+
+        /*
          * If we're in FRSYNC mode, sync out this znode before reading it.
          */
         if (ioflag & FRSYNC || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
                 zil_commit(zfsvfs->z_log, zp->z_id);

@@ -713,13 +919,17 @@
          * See zfs_zaccess_common()
          */
         if ((zp->z_pflags & ZFS_IMMUTABLE) ||
             ((zp->z_pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
             (uio->uio_loffset < zp->z_size))) {
+                /* Make sure we're not a WORM before returning EPERM. */
+                if (!(zp->z_pflags & ZFS_IMMUTABLE) ||
+                    !zp->z_zfsvfs->z_isworm) {
                 ZFS_EXIT(zfsvfs);
                 return (SET_ERROR(EPERM));
         }
+        }
 
         zilog = zfsvfs->z_log;
 
         /*
          * Validate file offset

@@ -739,10 +949,16 @@
                 ZFS_EXIT(zfsvfs);
                 return (error);
         }
 
         /*
+         * ZFS I/O rate throttling
+         */
+        if (zfsvfs->z_rate.rate_cap)
+                zfs_rate_throttle(zfsvfs, uio->uio_resid);
+
+        /*
          * Pre-fault the pages to ensure slow (eg NFS) pages
          * don't hold up txg.
          * Skip this if uio contains loaned arc_buf.
          */
         if ((uio->uio_extflg == UIO_XUIO) &&

@@ -1013,10 +1229,11 @@
 
         ZFS_EXIT(zfsvfs);
         return (0);
 }
 
+/* ARGSUSED */
 void
 zfs_get_done(zgd_t *zgd, int error)
 {
         znode_t *zp = zgd->zgd_private;
         objset_t *os = zp->z_zfsvfs->z_os;

@@ -1030,13 +1247,10 @@
          * Release the vnode asynchronously as we currently have the
          * txg stopped from syncing.
          */
         VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
 
-        if (error == 0 && zgd->zgd_bp)
-                zil_lwb_add_block(zgd->zgd_lwb, zgd->zgd_bp);
-
         kmem_free(zgd, sizeof (zgd_t));
 }
 
 #ifdef DEBUG
 static int zil_fault_io = 0;

@@ -1156,15 +1370,11 @@
                                 lr->lr_common.lrc_txtype = TX_WRITE2;
                                 /*
                                  * TX_WRITE2 relies on the data previously
                                  * written by the TX_WRITE that caused
                                  * EALREADY.  We zero out the BP because
-                                 * it is the old, currently-on-disk BP,
-                                 * so there's no need to zio_flush() its
-                                 * vdevs (flushing would needlesly hurt
-                                 * performance, and doesn't work on
-                                 * indirect vdevs).
+                                 * it is the old, currently-on-disk BP.
                                  */
                                 zgd->zgd_bp = NULL;
                                 BP_ZERO(bp);
                                 error = 0;
                         }

@@ -1243,11 +1453,11 @@
 static int
 zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp,
     int flags, vnode_t *rdir, cred_t *cr,  caller_context_t *ct,
     int *direntflags, pathname_t *realpnp)
 {
-        znode_t *zdp = VTOZ(dvp);
+        znode_t *zp, *zdp = VTOZ(dvp);
         zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
         int     error = 0;
 
         /*
          * Fast path lookup, however we must skip DNLC lookup

@@ -1361,10 +1571,18 @@
         }
 
         error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
         if (error == 0)
                 error = specvp_check(vpp, cr);
+        if (*vpp) {
+                zp = VTOZ(*vpp);
+                if (!(zp->z_pflags & ZFS_IMMUTABLE) &&
+                    ((*vpp)->v_type != VDIR) &&
+                    zfsvfs->z_isworm && !zfs_worm_in_trans(zp)) {
+                        zp->z_pflags |= ZFS_IMMUTABLE;
+                }
+        }
 
         ZFS_EXIT(zfsvfs);
         return (error);
 }

@@ -1396,10 +1614,11 @@
 static int
 zfs_create(vnode_t *dvp, char *name, vattr_t *vap, vcexcl_t excl,
     int mode, vnode_t **vpp, cred_t *cr, int flag, caller_context_t *ct,
     vsecattr_t *vsecp)
 {
+        int             imm_was_set = 0;
         znode_t         *zp, *dzp = VTOZ(dvp);
         zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
         zilog_t         *zilog;
         objset_t        *os;
         zfs_dirlock_t   *dl;

@@ -1481,20 +1700,31 @@
         }
 
         if (zp == NULL) {
                 uint64_t txtype;
 
+                if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
+                    dzp->z_zfsvfs->z_isworm) {
+                        imm_was_set = 1;
+                        dzp->z_pflags &= ~ZFS_IMMUTABLE;
+                }
+
                 /*
                  * Create a new file object and update the directory
                  * to reference it.
                  */
                 if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
                         if (have_acl)
                                 zfs_acl_ids_free(&acl_ids);
+                        if (imm_was_set)
+                                dzp->z_pflags |= ZFS_IMMUTABLE;
                         goto out;
                 }
 
+                if (imm_was_set)
+                        dzp->z_pflags |= ZFS_IMMUTABLE;
+
                 /*
                  * We only support the creation of regular files in
                  * extended attribute directories.
                  */

@@ -1530,12 +1760,11 @@
                 if (!zfsvfs->z_use_sa &&
                     acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
                         dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
                             0, acl_ids.z_aclp->z_acl_bytes);
                 }
-                error = dmu_tx_assign(tx,
-                    (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+                error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
                 if (error) {
                         zfs_dirent_unlock(dl);
                         if (error == ERESTART) {
                                 waited = B_TRUE;
                                 dmu_tx_wait(tx);

@@ -1550,10 +1779,13 @@
                 zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
 
                 if (fuid_dirtied)
                         zfs_fuid_sync(zfsvfs, tx);
 
+                if (imm_was_set)
+                        zp->z_pflags |= ZFS_IMMUTABLE;
+
                 (void) zfs_link_create(dl, zp, tx, ZNEW);
                 txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
                 if (flag & FIGNORECASE)
                         txtype |= TX_CI;
                 zfs_log_create(zilog, tx, txtype, dzp, zp, name,

@@ -1582,17 +1814,34 @@
                  */
                 if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
                         error = SET_ERROR(EISDIR);
                         goto out;
                 }
+                if ((flag & FWRITE) &&
+                    dzp->z_zfsvfs->z_isworm) {
+                        error = EPERM;
+                        goto out;
+                }
+
+                if (!(flag & FAPPEND) &&
+                    (zp->z_pflags & ZFS_IMMUTABLE) &&
+                    dzp->z_zfsvfs->z_isworm) {
+                        imm_was_set = 1;
+                        zp->z_pflags &= ~ZFS_IMMUTABLE;
+                }
                 /*
                  * Verify requested access to file.
                  */
                 if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {
+                        if (imm_was_set)
+                                zp->z_pflags |= ZFS_IMMUTABLE;
                         goto out;
                 }
 
+                if (imm_was_set)
+                        zp->z_pflags |= ZFS_IMMUTABLE;
+
                 mutex_enter(&dzp->z_lock);
                 dzp->z_seq++;
                 mutex_exit(&dzp->z_lock);
 
                 /*

@@ -1695,10 +1944,15 @@
                 return (error);
         }
 
         vp = ZTOV(zp);
 
+        if (zp->z_zfsvfs->z_isworm) {
+                error = SET_ERROR(EPERM);
+                goto out;
+        }
+
         if (error = zfs_zaccess_delete(dzp, zp, cr)) {
                 goto out;
         }
 
         /*

@@ -1761,11 +2015,11 @@
         /*
          * Mark this transaction as typically resulting in a net free of space
          */
         dmu_tx_mark_netfree(tx);
 
-        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
         if (error) {
                 zfs_dirent_unlock(dl);
                 VN_RELE(vp);
                 if (xzp)
                         VN_RELE(ZTOV(xzp));

@@ -1888,10 +2142,11 @@
 /*ARGSUSED*/
 static int
 zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr,
     caller_context_t *ct, int flags, vsecattr_t *vsecp)
 {
+        int             imm_was_set = 0;
         znode_t         *zp, *dzp = VTOZ(dvp);
         zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
         zilog_t         *zilog;
         zfs_dirlock_t   *dl;
         uint64_t        txtype;

@@ -1967,17 +2222,28 @@
                 zfs_acl_ids_free(&acl_ids);
                 ZFS_EXIT(zfsvfs);
                 return (error);
         }
 
+        if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
+            dzp->z_zfsvfs->z_isworm) {
+                imm_was_set = 1;
+                dzp->z_pflags &= ~ZFS_IMMUTABLE;
+        }
+
         if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {
+                if (imm_was_set)
+                        dzp->z_pflags |= ZFS_IMMUTABLE;
                 zfs_acl_ids_free(&acl_ids);
                 zfs_dirent_unlock(dl);
                 ZFS_EXIT(zfsvfs);
                 return (error);
         }
 
+        if (imm_was_set)
+                dzp->z_pflags |= ZFS_IMMUTABLE;
+
         if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
                 zfs_acl_ids_free(&acl_ids);
                 zfs_dirent_unlock(dl);
                 ZFS_EXIT(zfsvfs);
                 return (SET_ERROR(EDQUOT));

@@ -1998,11 +2264,11 @@
         }
 
         dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
             ZFS_SA_BASE_ATTR_SIZE);
 
-        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
         if (error) {
                 zfs_dirent_unlock(dl);
                 if (error == ERESTART) {
                         waited = B_TRUE;
                         dmu_tx_wait(tx);

@@ -2100,10 +2366,15 @@
                 return (error);
         }
 
         vp = ZTOV(zp);
 
+        if (dzp->z_zfsvfs->z_isworm) {
+                error = SET_ERROR(EPERM);
+                goto out;
+        }
+
         if (error = zfs_zaccess_delete(dzp, zp, cr)) {
                 goto out;
         }
 
         if (vp->v_type != VDIR) {

@@ -2135,11 +2406,11 @@
         dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
         dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
         zfs_sa_upgrade_txholds(tx, zp);
         zfs_sa_upgrade_txholds(tx, dzp);
         dmu_tx_mark_netfree(tx);
-        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
         if (error) {
                 rw_exit(&zp->z_parent_lock);
                 rw_exit(&zp->z_name_lock);
                 zfs_dirent_unlock(dl);
                 VN_RELE(vp);

@@ -2792,18 +3063,30 @@
         xoap = xva_getxoptattr(xvap);
 
         xva_init(&tmpxvattr);
 
         /*
-         * Immutable files can only alter immutable bit and atime
+         * Do not allow to alter immutable bit after it is set
          */
         if ((zp->z_pflags & ZFS_IMMUTABLE) &&
+            XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) &&
+            zp->z_zfsvfs->z_isworm) {
+                ZFS_EXIT(zfsvfs);
+                return (SET_ERROR(EPERM));
+        }
+
+        /*
+         * Immutable files can only alter atime
+         */
+        if (((zp->z_pflags & ZFS_IMMUTABLE) || zp->z_zfsvfs->z_isworm) &&
             ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
             ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
+                if (!zp->z_zfsvfs->z_isworm || !zfs_worm_in_trans(zp)) {
                 ZFS_EXIT(zfsvfs);
                 return (SET_ERROR(EPERM));
         }
+        }
 
         /*
          * Note: ZFS_READONLY is handled in zfs_zaccess_common.
          */

@@ -3708,11 +3991,11 @@
                 zfs_sa_upgrade_txholds(tx, tzp);
         }
 
         zfs_sa_upgrade_txholds(tx, szp);
         dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
-        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
         if (error) {
                 if (zl != NULL)
                         zfs_rename_unlock(&zl);
                 zfs_dirent_unlock(sdl);
                 zfs_dirent_unlock(tdl);

@@ -3832,10 +4115,11 @@
         znode_t         *zp, *dzp = VTOZ(dvp);
         zfs_dirlock_t   *dl;
         dmu_tx_t        *tx;
         zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
         zilog_t         *zilog;
+        int             imm_was_set = 0;
         uint64_t        len = strlen(link);
         int             error;
         int             zflg = ZNEW;
         zfs_acl_ids_t   acl_ids;
         boolean_t       fuid_dirtied;

@@ -3875,16 +4159,24 @@
                 zfs_acl_ids_free(&acl_ids);
                 ZFS_EXIT(zfsvfs);
                 return (error);
         }
 
+        if ((dzp->z_pflags & ZFS_IMMUTABLE) && dzp->z_zfsvfs->z_isworm) {
+                imm_was_set = 1;
+                dzp->z_pflags &= ~ZFS_IMMUTABLE;
+        }
         if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
+                if (imm_was_set)
+                        dzp->z_pflags |= ZFS_IMMUTABLE;
                 zfs_acl_ids_free(&acl_ids);
                 zfs_dirent_unlock(dl);
                 ZFS_EXIT(zfsvfs);
                 return (error);
         }
+        if (imm_was_set)
+                dzp->z_pflags |= ZFS_IMMUTABLE;
 
         if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
                 zfs_acl_ids_free(&acl_ids);
                 zfs_dirent_unlock(dl);
                 ZFS_EXIT(zfsvfs);

@@ -3901,11 +4193,11 @@
                 dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
                     acl_ids.z_aclp->z_acl_bytes);
         }
         if (fuid_dirtied)
                 zfs_fuid_txhold(zfsvfs, tx);
-        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
         if (error) {
                 zfs_dirent_unlock(dl);
                 if (error == ERESTART) {
                         waited = B_TRUE;
                         dmu_tx_wait(tx);

@@ -4122,11 +4414,11 @@
         tx = dmu_tx_create(zfsvfs->z_os);
         dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
         dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
         zfs_sa_upgrade_txholds(tx, szp);
         zfs_sa_upgrade_txholds(tx, dzp);
-        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
         if (error) {
                 zfs_dirent_unlock(dl);
                 if (error == ERESTART) {
                         waited = B_TRUE;
                         dmu_tx_wait(tx);

@@ -4398,47 +4690,71 @@
                 zil_commit(zfsvfs->z_log, zp->z_id);
         ZFS_EXIT(zfsvfs);
         return (error);
 }
 
-/*ARGSUSED*/
-void
-zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
+/*
+ * Returns B_TRUE and exits the z_teardown_inactive_lock
+ * if the znode we are looking at is no longer valid
+ */
+static boolean_t
+zfs_znode_free_invalid(znode_t *zp)
 {
-        znode_t *zp = VTOZ(vp);
         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
-        int error;
+        vnode_t *vp = ZTOV(zp);
 
-        rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
+        ASSERT(rw_read_held(&zfsvfs->z_teardown_inactive_lock));
+
         if (zp->z_sa_hdl == NULL) {
                 /*
                  * The fs has been unmounted, or we did a
                  * suspend/resume and this file no longer exists.
                  */
                 if (vn_has_cached_data(vp)) {
                         (void) pvn_vplist_dirty(vp, 0, zfs_null_putapage,
-                            B_INVAL, cr);
+                            B_INVAL, CRED());
                 }
 
                 mutex_enter(&zp->z_lock);
                 mutex_enter(&vp->v_lock);
                 ASSERT(vp->v_count == 1);
                 VN_RELE_LOCKED(vp);
                 mutex_exit(&vp->v_lock);
                 mutex_exit(&zp->z_lock);
+                VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) !=
+                    UINT32_MAX);
                 rw_exit(&zfsvfs->z_teardown_inactive_lock);
                 zfs_znode_free(zp);
-                return;
+                return (B_TRUE);
         }
 
+        return (B_FALSE);
+}
+
+/*
+ * Does the prep work for freeing the znode, then calls zfs_zinactive to do the
+ * actual freeing.
+ * This code used be in zfs_inactive() before the async delete patch came in
+ */
+static void
+zfs_inactive_impl(znode_t *zp)
+{
+        vnode_t *vp = ZTOV(zp);
+        zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+        int error;
+
+        rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
+        if (zfs_znode_free_invalid(zp))
+                return; /* z_teardown_inactive_lock already dropped */
+
         /*
          * Attempt to push any data in the page cache.  If this fails
          * we will get kicked out later in zfs_zinactive().
          */
         if (vn_has_cached_data(vp)) {
                 (void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC,
-                    cr);
+                    CRED());
         }
 
         if (zp->z_atime_dirty && zp->z_unlinked == 0) {
                 dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);

@@ -4456,13 +4772,55 @@
                         dmu_tx_commit(tx);
                 }
         }
 
         zfs_zinactive(zp);
+
+        VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) != UINT32_MAX);
+
         rw_exit(&zfsvfs->z_teardown_inactive_lock);
 }
 
+/*
+ * taskq task calls zfs_inactive_impl() so that we can free the znode
+ */
+static void
+zfs_inactive_task(void *task_arg)
+{
+        znode_t *zp = task_arg;
+        ASSERT(zp != NULL);
+        zfs_inactive_impl(zp);
+}
+
+/*ARGSUSED*/
+void
+zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
+{
+        znode_t *zp = VTOZ(vp);
+        zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+
+        rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
+
+        VERIFY(atomic_inc_32_nv(&zfsvfs->z_znodes_freeing_cnt) != 0);
+
+        if (zfs_znode_free_invalid(zp))
+                return; /* z_teardown_inactive_lock already dropped */
+
+        if (zfs_do_async_free &&
+            zp->z_size > zfs_inactive_async_multiplier * zfs_dirty_data_max &&
+            taskq_dispatch(dsl_pool_vnrele_taskq(
+            dmu_objset_pool(zp->z_zfsvfs->z_os)), zfs_inactive_task,
+            zp, TQ_NOSLEEP) != NULL) {
+                rw_exit(&zfsvfs->z_teardown_inactive_lock);
+                return; /* task dispatched, we're done */
+        }
+        rw_exit(&zfsvfs->z_teardown_inactive_lock);
+
+        /* if the taskq dispatch failed - do a sync zfs_inactive_impl() call */
+        zfs_inactive_impl(zp);
+}
+
 /*
  * Bounds-check the seek operation.
  *
  *      IN:     vp      - vnode seeking within
  *              ooff    - old file offset