Print this page
NEX-19083 backport OS-7314 zil_commit should omit cache thrash
9962 zil_commit should omit cache thrash
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
NEX-10069 ZFS_READONLY is a little too strict
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9436 Rate limiting controls ... (fix cstyle)
NEX-3562 filename normalization doesn't work for removes (sync with upstream)
NEX-9436 Rate limiting controls (was QoS) per ZFS dataset, updates from demo
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-9213 comment for enabling async delete for all files is reversed.
Reviewed by: Jean Mccormack <jean.mccormack@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9090 trigger async freeing based on znode size
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-8972 Async-delete side-effect that may cause unmount EBUSY
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8852 Quality-of-Service (QoS) controls per NFS share
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "NEX-5085 implement async delete for large files"
This reverts commit 65aa8f42d93fcbd6e0efb3d4883170a20d760611.
Fails regression testing of the zfs test mirror_stress_004.
NEX-5085 implement async delete for large files
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
NEX-7543 backout async delete (NEX-5085 and NEX-6151)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6151 panic when forcefully unmounting the FS with large open files
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3562 filename normalization doesn't work for removes
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6334 Cannot unlink files when over quota
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5692 expose the number of hole blocks in a file
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4229 Panic destroying the pool using file backing store on FS with nbmand=on
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-1196 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Fixup merge results
re #14162 DOS issue with ZFS/NFS
re #7550 rb2134 lint-clean nza-kernel
re #6815 rb1758 need WORM in nza-kernel (4.0)
@@ -19,19 +19,18 @@
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ * Portions Copyright 2007 Jeremy Teo
+ * Portions Copyright 2010 Robert Milkowski
* Copyright (c) 2012, 2017 by Delphix. All rights reserved.
* Copyright (c) 2014 Integros [integros.com]
* Copyright 2015 Joyent, Inc.
* Copyright 2017 Nexenta Systems, Inc.
*/
-/* Portions Copyright 2007 Jeremy Teo */
-/* Portions Copyright 2010 Robert Milkowski */
-
#include <sys/types.h>
#include <sys/param.h>
#include <sys/time.h>
#include <sys/systm.h>
#include <sys/sysmacros.h>
@@ -81,10 +80,11 @@
#include <sys/zfs_rlock.h>
#include <sys/extdirent.h>
#include <sys/kidmap.h>
#include <sys/cred.h>
#include <sys/attr.h>
+#include <sys/dsl_prop.h>
#include <sys/zil.h>
/*
* Programming rules.
*
@@ -133,11 +133,11 @@
* Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
* forever, because the previous txg can't quiesce until B's tx commits.
*
* If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
* then drop all locks, call dmu_tx_wait(), and try again. On subsequent
- * calls to dmu_tx_assign(), pass TXG_NOTHROTTLE in addition to TXG_NOWAIT,
+ * calls to dmu_tx_assign(), pass TXG_WAITED rather than TXG_NOWAIT,
* to indicate that this operation has already called dmu_tx_wait().
* This will ensure that we don't retry forever, waiting a short bit
* each time.
*
* (5) If the operation succeeded, generate the intent log entry for it
@@ -158,11 +158,11 @@
* top:
* zfs_dirent_lock(&dl, ...) // lock directory entry (may VN_HOLD())
* rw_enter(...); // grab any other locks you need
* tx = dmu_tx_create(...); // get DMU tx
* dmu_tx_hold_*(); // hold each object you might modify
- * error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+ * error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
* if (error) {
* rw_exit(...); // drop locks
* zfs_dirent_unlock(dl); // unlock directory entry
* VN_RELE(...); // release held vnodes
* if (error == ERESTART) {
@@ -185,10 +185,51 @@
* zil_commit(zilog, foid); // synchronous when necessary
* ZFS_EXIT(zfsvfs); // finished in zfs
* return (error); // done, report error
*/
+/* set this tunable to zero to disable asynchronous freeing of files */
+boolean_t zfs_do_async_free = B_TRUE;
+
+/*
+ * This value will be multiplied by zfs_dirty_data_max to determine
+ * the threshold past which we will call zfs_inactive_impl() async.
+ *
+ * Selecting the multiplier is a balance between how long we're willing to wait
+ * for delete/free to complete (get shell back, have a NFS thread captive, etc)
+ * and reducing the number of active requests in the backing taskq.
+ *
+ * 4 GiB (zfs_dirty_data_max default) * 16 (multiplier default) = 64 GiB
+ * meaning by default we will call zfs_inactive_impl async for vnodes > 64 GiB
+ *
+ * WARNING: Setting this tunable to zero will enable asynchronous freeing for
+ * all files which can have undesirable side effects.
+ */
+uint16_t zfs_inactive_async_multiplier = 16;
+
+int nms_worm_transition_time = 30;
+int
+zfs_worm_in_trans(znode_t *zp)
+{
+ zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+ timestruc_t now;
+ sa_bulk_attr_t bulk[2];
+ uint64_t ctime[2];
+ int count = 0;
+
+ if (!nms_worm_transition_time)
+ return (0);
+
+ gethrestime(&now);
+ SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL,
+ &ctime, sizeof (ctime));
+ if (sa_bulk_lookup(zp->z_sa_hdl, bulk, count) != 0)
+ return (0);
+
+ return ((uint64_t)now.tv_sec - ctime[0] < nms_worm_transition_time);
+}
+
/* ARGSUSED */
static int
zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(*vpp);
@@ -225,16 +266,17 @@
zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+ pid_t caller_pid = (ct != NULL) ? ct->cc_pid : ddi_get_pid();
/*
* Clean up any locks held by this process on the vp.
*/
- cleanlocks(vp, ddi_get_pid(), 0);
- cleanshares(vp, ddi_get_pid());
+ cleanlocks(vp, caller_pid, 0);
+ cleanshares(vp, caller_pid);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
/* Decrement the synchronous opens in the znode */
@@ -484,10 +526,168 @@
break;
}
return (error);
}
+
+/*
+ * ZFS I/O rate throttling
+ */
+
+#define DELAY_SHIFT 24
+
+typedef struct zfs_rate_delay {
+ uint_t rl_rate;
+ hrtime_t rl_delay;
+} zfs_rate_delay_t;
+
+/*
+ * The time we'll attempt to cv_wait (below), in nSec.
+ * This should be no less than the minimum time it normally takes
+ * to block a thread and wake back up after the timeout fires.
+ *
+ * Each table entry represents the delay for each 4MB of bandwith.
+ * we reduce the delay as the size fo the I/O increases.
+ */
+zfs_rate_delay_t zfs_rate_delay_table[] = {
+ {0, 100000},
+ {1, 100000},
+ {2, 100000},
+ {3, 100000},
+ {4, 100000},
+ {5, 50000},
+ {6, 50000},
+ {7, 50000},
+ {8, 50000},
+ {9, 25000},
+ {10, 25000},
+ {11, 25000},
+ {12, 25000},
+ {13, 12500},
+ {14, 12500},
+ {15, 12500},
+ {16, 12500},
+ {17, 6250},
+ {18, 6250},
+ {19, 6250},
+ {20, 6250},
+ {21, 3125},
+ {22, 3125},
+ {23, 3125},
+ {24, 3125},
+};
+
+#define MAX_RATE_TBL_ENTRY 24
+
+/*
+ * The delay we use should be reduced based on the size of the iorate
+ * for higher iorates we want a shorter delay.
+ */
+static inline hrtime_t
+zfs_get_delay(ssize_t iorate)
+{
+ uint_t rate = iorate >> DELAY_SHIFT;
+
+ if (rate > MAX_RATE_TBL_ENTRY)
+ rate = MAX_RATE_TBL_ENTRY;
+ return (zfs_rate_delay_table[rate].rl_delay);
+}
+
+/*
+ * ZFS I/O rate throttling
+ * See "Token Bucket" on Wikipedia
+ *
+ * This is "Token Bucket" with some modifications to avoid wait times
+ * longer than a couple seconds, so that we don't trigger NFS retries
+ * or similar. This does mean that concurrent requests might take us
+ * over the rate limit, but that's a lesser evil.
+ */
+static void
+zfs_rate_throttle(zfsvfs_t *zfsvfs, ssize_t iosize)
+{
+ zfs_rate_state_t *rate = &zfsvfs->z_rate;
+ hrtime_t now, delta; /* nanoseconds */
+ int64_t refill;
+
+ VERIFY(rate->rate_cap > 0);
+ mutex_enter(&rate->rate_lock);
+
+ /*
+ * If another thread is already waiting, we must queue up behind them.
+ * We'll wait up to 1 sec here. We normally will resume by cv_signal,
+ * so we don't need fine timer resolution on this wait.
+ */
+ if (rate->rate_token_bucket < 0) {
+ rate->rate_waiters++;
+ (void) cv_timedwait_hires(
+ &rate->rate_wait_cv, &rate->rate_lock,
+ NANOSEC, TR_CLOCK_TICK, 0);
+ rate->rate_waiters--;
+ }
+
+ /*
+ * How long since we last updated the bucket?
+ */
+ now = gethrtime();
+ delta = now - rate->rate_last_update;
+ rate->rate_last_update = now;
+ if (delta < 0)
+ delta = 0; /* paranoid */
+
+ /*
+ * Add "tokens" for time since last update,
+ * being careful about possible overflow.
+ */
+ refill = (delta * rate->rate_cap) / NANOSEC;
+ if (refill < 0 || refill > rate->rate_cap)
+ refill = rate->rate_cap; /* overflow */
+ rate->rate_token_bucket += refill;
+ if (rate->rate_token_bucket > rate->rate_cap)
+ rate->rate_token_bucket = rate->rate_cap;
+
+ /*
+ * Withdraw tokens for the current I/O.* If this makes us overdrawn,
+ * wait an amount of time proportionate to the overdraft. However,
+ * as a sanity measure, never wait more than 1 sec, and never try to
+ * wait less than the time it normally takes to block and reschedule.
+ *
+ * Leave the bucket negative while we wait so other threads know to
+ * queue up. In here, "refill" is the debt we're waiting to pay off.
+ */
+ rate->rate_token_bucket -= iosize;
+ if (rate->rate_token_bucket < 0) {
+ hrtime_t zfs_rate_wait = 0;
+
+ refill = rate->rate_token_bucket;
+ DTRACE_PROBE2(zfs_rate_over, zfsvfs_t *, zfsvfs,
+ int64_t, refill);
+
+ if (rate->rate_cap <= 0)
+ goto nocap;
+
+ delta = (refill * NANOSEC) / rate->rate_cap;
+ delta = MIN(delta, NANOSEC);
+
+ zfs_rate_wait = zfs_get_delay(rate->rate_cap);
+
+ if (delta > zfs_rate_wait) {
+ (void) cv_timedwait_hires(
+ &rate->rate_wait_cv, &rate->rate_lock,
+ delta, TR_CLOCK_TICK, 0);
+ }
+
+ rate->rate_token_bucket += refill;
+ }
+nocap:
+ if (rate->rate_waiters > 0) {
+ cv_signal(&rate->rate_wait_cv);
+ }
+
+ mutex_exit(&rate->rate_lock);
+}
+
+
offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */
/*
* Read bytes from specified file into supplied buffer.
*
@@ -550,10 +750,16 @@
return (error);
}
}
/*
+ * ZFS I/O rate throttling
+ */
+ if (zfsvfs->z_rate.rate_cap)
+ zfs_rate_throttle(zfsvfs, uio->uio_resid);
+
+ /*
* If we're in FRSYNC mode, sync out this znode before reading it.
*/
if (ioflag & FRSYNC || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
zil_commit(zfsvfs->z_log, zp->z_id);
@@ -713,13 +919,17 @@
* See zfs_zaccess_common()
*/
if ((zp->z_pflags & ZFS_IMMUTABLE) ||
((zp->z_pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
(uio->uio_loffset < zp->z_size))) {
+ /* Make sure we're not a WORM before returning EPERM. */
+ if (!(zp->z_pflags & ZFS_IMMUTABLE) ||
+ !zp->z_zfsvfs->z_isworm) {
ZFS_EXIT(zfsvfs);
return (SET_ERROR(EPERM));
}
+ }
zilog = zfsvfs->z_log;
/*
* Validate file offset
@@ -739,10 +949,16 @@
ZFS_EXIT(zfsvfs);
return (error);
}
/*
+ * ZFS I/O rate throttling
+ */
+ if (zfsvfs->z_rate.rate_cap)
+ zfs_rate_throttle(zfsvfs, uio->uio_resid);
+
+ /*
* Pre-fault the pages to ensure slow (eg NFS) pages
* don't hold up txg.
* Skip this if uio contains loaned arc_buf.
*/
if ((uio->uio_extflg == UIO_XUIO) &&
@@ -1013,10 +1229,11 @@
ZFS_EXIT(zfsvfs);
return (0);
}
+/* ARGSUSED */
void
zfs_get_done(zgd_t *zgd, int error)
{
znode_t *zp = zgd->zgd_private;
objset_t *os = zp->z_zfsvfs->z_os;
@@ -1030,13 +1247,10 @@
* Release the vnode asynchronously as we currently have the
* txg stopped from syncing.
*/
VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
- if (error == 0 && zgd->zgd_bp)
- zil_lwb_add_block(zgd->zgd_lwb, zgd->zgd_bp);
-
kmem_free(zgd, sizeof (zgd_t));
}
#ifdef DEBUG
static int zil_fault_io = 0;
@@ -1156,15 +1370,11 @@
lr->lr_common.lrc_txtype = TX_WRITE2;
/*
* TX_WRITE2 relies on the data previously
* written by the TX_WRITE that caused
* EALREADY. We zero out the BP because
- * it is the old, currently-on-disk BP,
- * so there's no need to zio_flush() its
- * vdevs (flushing would needlesly hurt
- * performance, and doesn't work on
- * indirect vdevs).
+ * it is the old, currently-on-disk BP.
*/
zgd->zgd_bp = NULL;
BP_ZERO(bp);
error = 0;
}
@@ -1243,11 +1453,11 @@
static int
zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp,
int flags, vnode_t *rdir, cred_t *cr, caller_context_t *ct,
int *direntflags, pathname_t *realpnp)
{
- znode_t *zdp = VTOZ(dvp);
+ znode_t *zp, *zdp = VTOZ(dvp);
zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
int error = 0;
/*
* Fast path lookup, however we must skip DNLC lookup
@@ -1361,10 +1571,18 @@
}
error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
if (error == 0)
error = specvp_check(vpp, cr);
+ if (*vpp) {
+ zp = VTOZ(*vpp);
+ if (!(zp->z_pflags & ZFS_IMMUTABLE) &&
+ ((*vpp)->v_type != VDIR) &&
+ zfsvfs->z_isworm && !zfs_worm_in_trans(zp)) {
+ zp->z_pflags |= ZFS_IMMUTABLE;
+ }
+ }
ZFS_EXIT(zfsvfs);
return (error);
}
@@ -1396,10 +1614,11 @@
static int
zfs_create(vnode_t *dvp, char *name, vattr_t *vap, vcexcl_t excl,
int mode, vnode_t **vpp, cred_t *cr, int flag, caller_context_t *ct,
vsecattr_t *vsecp)
{
+ int imm_was_set = 0;
znode_t *zp, *dzp = VTOZ(dvp);
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
objset_t *os;
zfs_dirlock_t *dl;
@@ -1481,20 +1700,31 @@
}
if (zp == NULL) {
uint64_t txtype;
+ if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
+ dzp->z_zfsvfs->z_isworm) {
+ imm_was_set = 1;
+ dzp->z_pflags &= ~ZFS_IMMUTABLE;
+ }
+
/*
* Create a new file object and update the directory
* to reference it.
*/
if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
if (have_acl)
zfs_acl_ids_free(&acl_ids);
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
goto out;
}
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
+
/*
* We only support the creation of regular files in
* extended attribute directories.
*/
@@ -1530,12 +1760,11 @@
if (!zfsvfs->z_use_sa &&
acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, acl_ids.z_aclp->z_acl_bytes);
}
- error = dmu_tx_assign(tx,
- (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+ error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
@@ -1550,10 +1779,13 @@
zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
if (fuid_dirtied)
zfs_fuid_sync(zfsvfs, tx);
+ if (imm_was_set)
+ zp->z_pflags |= ZFS_IMMUTABLE;
+
(void) zfs_link_create(dl, zp, tx, ZNEW);
txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
if (flag & FIGNORECASE)
txtype |= TX_CI;
zfs_log_create(zilog, tx, txtype, dzp, zp, name,
@@ -1582,17 +1814,34 @@
*/
if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
error = SET_ERROR(EISDIR);
goto out;
}
+ if ((flag & FWRITE) &&
+ dzp->z_zfsvfs->z_isworm) {
+ error = EPERM;
+ goto out;
+ }
+
+ if (!(flag & FAPPEND) &&
+ (zp->z_pflags & ZFS_IMMUTABLE) &&
+ dzp->z_zfsvfs->z_isworm) {
+ imm_was_set = 1;
+ zp->z_pflags &= ~ZFS_IMMUTABLE;
+ }
/*
* Verify requested access to file.
*/
if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {
+ if (imm_was_set)
+ zp->z_pflags |= ZFS_IMMUTABLE;
goto out;
}
+ if (imm_was_set)
+ zp->z_pflags |= ZFS_IMMUTABLE;
+
mutex_enter(&dzp->z_lock);
dzp->z_seq++;
mutex_exit(&dzp->z_lock);
/*
@@ -1695,10 +1944,15 @@
return (error);
}
vp = ZTOV(zp);
+ if (zp->z_zfsvfs->z_isworm) {
+ error = SET_ERROR(EPERM);
+ goto out;
+ }
+
if (error = zfs_zaccess_delete(dzp, zp, cr)) {
goto out;
}
/*
@@ -1761,11 +2015,11 @@
/*
* Mark this transaction as typically resulting in a net free of space
*/
dmu_tx_mark_netfree(tx);
- error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+ error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
VN_RELE(vp);
if (xzp)
VN_RELE(ZTOV(xzp));
@@ -1888,10 +2142,11 @@
/*ARGSUSED*/
static int
zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr,
caller_context_t *ct, int flags, vsecattr_t *vsecp)
{
+ int imm_was_set = 0;
znode_t *zp, *dzp = VTOZ(dvp);
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
zfs_dirlock_t *dl;
uint64_t txtype;
@@ -1967,17 +2222,28 @@
zfs_acl_ids_free(&acl_ids);
ZFS_EXIT(zfsvfs);
return (error);
}
+ if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
+ dzp->z_zfsvfs->z_isworm) {
+ imm_was_set = 1;
+ dzp->z_pflags &= ~ZFS_IMMUTABLE;
+ }
+
if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
zfs_acl_ids_free(&acl_ids);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (error);
}
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
+
if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
zfs_acl_ids_free(&acl_ids);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (SET_ERROR(EDQUOT));
@@ -1998,11 +2264,11 @@
}
dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
ZFS_SA_BASE_ATTR_SIZE);
- error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+ error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
@@ -2100,10 +2366,15 @@
return (error);
}
vp = ZTOV(zp);
+ if (dzp->z_zfsvfs->z_isworm) {
+ error = SET_ERROR(EPERM);
+ goto out;
+ }
+
if (error = zfs_zaccess_delete(dzp, zp, cr)) {
goto out;
}
if (vp->v_type != VDIR) {
@@ -2135,11 +2406,11 @@
dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
zfs_sa_upgrade_txholds(tx, zp);
zfs_sa_upgrade_txholds(tx, dzp);
dmu_tx_mark_netfree(tx);
- error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+ error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
rw_exit(&zp->z_parent_lock);
rw_exit(&zp->z_name_lock);
zfs_dirent_unlock(dl);
VN_RELE(vp);
@@ -2792,18 +3063,30 @@
xoap = xva_getxoptattr(xvap);
xva_init(&tmpxvattr);
/*
- * Immutable files can only alter immutable bit and atime
+ * Do not allow to alter immutable bit after it is set
*/
if ((zp->z_pflags & ZFS_IMMUTABLE) &&
+ XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) &&
+ zp->z_zfsvfs->z_isworm) {
+ ZFS_EXIT(zfsvfs);
+ return (SET_ERROR(EPERM));
+ }
+
+ /*
+ * Immutable files can only alter atime
+ */
+ if (((zp->z_pflags & ZFS_IMMUTABLE) || zp->z_zfsvfs->z_isworm) &&
((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
+ if (!zp->z_zfsvfs->z_isworm || !zfs_worm_in_trans(zp)) {
ZFS_EXIT(zfsvfs);
return (SET_ERROR(EPERM));
}
+ }
/*
* Note: ZFS_READONLY is handled in zfs_zaccess_common.
*/
@@ -3708,11 +3991,11 @@
zfs_sa_upgrade_txholds(tx, tzp);
}
zfs_sa_upgrade_txholds(tx, szp);
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
- error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+ error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
if (zl != NULL)
zfs_rename_unlock(&zl);
zfs_dirent_unlock(sdl);
zfs_dirent_unlock(tdl);
@@ -3832,10 +4115,11 @@
znode_t *zp, *dzp = VTOZ(dvp);
zfs_dirlock_t *dl;
dmu_tx_t *tx;
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
+ int imm_was_set = 0;
uint64_t len = strlen(link);
int error;
int zflg = ZNEW;
zfs_acl_ids_t acl_ids;
boolean_t fuid_dirtied;
@@ -3875,16 +4159,24 @@
zfs_acl_ids_free(&acl_ids);
ZFS_EXIT(zfsvfs);
return (error);
}
+ if ((dzp->z_pflags & ZFS_IMMUTABLE) && dzp->z_zfsvfs->z_isworm) {
+ imm_was_set = 1;
+ dzp->z_pflags &= ~ZFS_IMMUTABLE;
+ }
if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
zfs_acl_ids_free(&acl_ids);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (error);
}
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
zfs_acl_ids_free(&acl_ids);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
@@ -3901,11 +4193,11 @@
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
acl_ids.z_aclp->z_acl_bytes);
}
if (fuid_dirtied)
zfs_fuid_txhold(zfsvfs, tx);
- error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+ error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
@@ -4122,11 +4414,11 @@
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
zfs_sa_upgrade_txholds(tx, szp);
zfs_sa_upgrade_txholds(tx, dzp);
- error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
+ error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
@@ -4398,47 +4690,71 @@
zil_commit(zfsvfs->z_log, zp->z_id);
ZFS_EXIT(zfsvfs);
return (error);
}
-/*ARGSUSED*/
-void
-zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
+/*
+ * Returns B_TRUE and exits the z_teardown_inactive_lock
+ * if the znode we are looking at is no longer valid
+ */
+static boolean_t
+zfs_znode_free_invalid(znode_t *zp)
{
- znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
- int error;
+ vnode_t *vp = ZTOV(zp);
- rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
+ ASSERT(rw_read_held(&zfsvfs->z_teardown_inactive_lock));
+
if (zp->z_sa_hdl == NULL) {
/*
* The fs has been unmounted, or we did a
* suspend/resume and this file no longer exists.
*/
if (vn_has_cached_data(vp)) {
(void) pvn_vplist_dirty(vp, 0, zfs_null_putapage,
- B_INVAL, cr);
+ B_INVAL, CRED());
}
mutex_enter(&zp->z_lock);
mutex_enter(&vp->v_lock);
ASSERT(vp->v_count == 1);
VN_RELE_LOCKED(vp);
mutex_exit(&vp->v_lock);
mutex_exit(&zp->z_lock);
+ VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) !=
+ UINT32_MAX);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
zfs_znode_free(zp);
- return;
+ return (B_TRUE);
}
+ return (B_FALSE);
+}
+
+/*
+ * Does the prep work for freeing the znode, then calls zfs_zinactive to do the
+ * actual freeing.
+ * This code used be in zfs_inactive() before the async delete patch came in
+ */
+static void
+zfs_inactive_impl(znode_t *zp)
+{
+ vnode_t *vp = ZTOV(zp);
+ zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+ int error;
+
+ rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
+ if (zfs_znode_free_invalid(zp))
+ return; /* z_teardown_inactive_lock already dropped */
+
/*
* Attempt to push any data in the page cache. If this fails
* we will get kicked out later in zfs_zinactive().
*/
if (vn_has_cached_data(vp)) {
(void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC,
- cr);
+ CRED());
}
if (zp->z_atime_dirty && zp->z_unlinked == 0) {
dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
@@ -4456,13 +4772,55 @@
dmu_tx_commit(tx);
}
}
zfs_zinactive(zp);
+
+ VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) != UINT32_MAX);
+
rw_exit(&zfsvfs->z_teardown_inactive_lock);
}
+/*
+ * taskq task calls zfs_inactive_impl() so that we can free the znode
+ */
+static void
+zfs_inactive_task(void *task_arg)
+{
+ znode_t *zp = task_arg;
+ ASSERT(zp != NULL);
+ zfs_inactive_impl(zp);
+}
+
+/*ARGSUSED*/
+void
+zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
+{
+ znode_t *zp = VTOZ(vp);
+ zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+
+ rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
+
+ VERIFY(atomic_inc_32_nv(&zfsvfs->z_znodes_freeing_cnt) != 0);
+
+ if (zfs_znode_free_invalid(zp))
+ return; /* z_teardown_inactive_lock already dropped */
+
+ if (zfs_do_async_free &&
+ zp->z_size > zfs_inactive_async_multiplier * zfs_dirty_data_max &&
+ taskq_dispatch(dsl_pool_vnrele_taskq(
+ dmu_objset_pool(zp->z_zfsvfs->z_os)), zfs_inactive_task,
+ zp, TQ_NOSLEEP) != NULL) {
+ rw_exit(&zfsvfs->z_teardown_inactive_lock);
+ return; /* task dispatched, we're done */
+ }
+ rw_exit(&zfsvfs->z_teardown_inactive_lock);
+
+ /* if the taskq dispatch failed - do a sync zfs_inactive_impl() call */
+ zfs_inactive_impl(zp);
+}
+
/*
* Bounds-check the seek operation.
*
* IN: vp - vnode seeking within
* ooff - old file offset