Print this page
NEX-19083 backport OS-7314 zil_commit should omit cache thrash
9962 zil_commit should omit cache thrash
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
NEX-10069 ZFS_READONLY is a little too strict
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9436 Rate limiting controls ... (fix cstyle)
NEX-3562 filename normalization doesn't work for removes (sync with upstream)
NEX-9436 Rate limiting controls (was QoS) per ZFS dataset, updates from demo
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-9213 comment for enabling async delete for all files is reversed.
Reviewed by: Jean Mccormack <jean.mccormack@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9090 trigger async freeing based on znode size
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-8972 Async-delete side-effect that may cause unmount EBUSY
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8852 Quality-of-Service (QoS) controls per NFS share
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "NEX-5085 implement async delete for large files"
This reverts commit 65aa8f42d93fcbd6e0efb3d4883170a20d760611.
Fails regression testing of the zfs test mirror_stress_004.
NEX-5085 implement async delete for large files
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
NEX-7543 backout async delete (NEX-5085 and NEX-6151)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6151 panic when forcefully unmounting the FS with large open files
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3562 filename normalization doesn't work for removes
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6334 Cannot unlink files when over quota
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5692 expose the number of hole blocks in a file
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4229 Panic destroying the pool using file backing store on FS with nbmand=on
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-1196 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Fixup merge results
re #14162 DOS issue with ZFS/NFS
re #7550 rb2134 lint-clean nza-kernel
re #6815 rb1758 need WORM in nza-kernel (4.0)
*** 19,37 ****
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2012, 2017 by Delphix. All rights reserved.
* Copyright (c) 2014 Integros [integros.com]
* Copyright 2015 Joyent, Inc.
* Copyright 2017 Nexenta Systems, Inc.
*/
- /* Portions Copyright 2007 Jeremy Teo */
- /* Portions Copyright 2010 Robert Milkowski */
-
#include <sys/types.h>
#include <sys/param.h>
#include <sys/time.h>
#include <sys/systm.h>
#include <sys/sysmacros.h>
--- 19,36 ----
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ * Portions Copyright 2007 Jeremy Teo
+ * Portions Copyright 2010 Robert Milkowski
* Copyright (c) 2012, 2017 by Delphix. All rights reserved.
* Copyright (c) 2014 Integros [integros.com]
* Copyright 2015 Joyent, Inc.
* Copyright 2017 Nexenta Systems, Inc.
*/
#include <sys/types.h>
#include <sys/param.h>
#include <sys/time.h>
#include <sys/systm.h>
#include <sys/sysmacros.h>
*** 81,90 ****
--- 80,90 ----
#include <sys/zfs_rlock.h>
#include <sys/extdirent.h>
#include <sys/kidmap.h>
#include <sys/cred.h>
#include <sys/attr.h>
+ #include <sys/dsl_prop.h>
#include <sys/zil.h>
/*
* Programming rules.
*
*** 133,143 ****
* Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
* forever, because the previous txg can't quiesce until B's tx commits.
*
* If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
* then drop all locks, call dmu_tx_wait(), and try again. On subsequent
! * calls to dmu_tx_assign(), pass TXG_NOTHROTTLE in addition to TXG_NOWAIT,
* to indicate that this operation has already called dmu_tx_wait().
* This will ensure that we don't retry forever, waiting a short bit
* each time.
*
* (5) If the operation succeeded, generate the intent log entry for it
--- 133,143 ----
* Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
* forever, because the previous txg can't quiesce until B's tx commits.
*
* If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
* then drop all locks, call dmu_tx_wait(), and try again. On subsequent
! * calls to dmu_tx_assign(), pass TXG_WAITED rather than TXG_NOWAIT,
* to indicate that this operation has already called dmu_tx_wait().
* This will ensure that we don't retry forever, waiting a short bit
* each time.
*
* (5) If the operation succeeded, generate the intent log entry for it
*** 158,168 ****
* top:
* zfs_dirent_lock(&dl, ...) // lock directory entry (may VN_HOLD())
* rw_enter(...); // grab any other locks you need
* tx = dmu_tx_create(...); // get DMU tx
* dmu_tx_hold_*(); // hold each object you might modify
! * error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
* if (error) {
* rw_exit(...); // drop locks
* zfs_dirent_unlock(dl); // unlock directory entry
* VN_RELE(...); // release held vnodes
* if (error == ERESTART) {
--- 158,168 ----
* top:
* zfs_dirent_lock(&dl, ...) // lock directory entry (may VN_HOLD())
* rw_enter(...); // grab any other locks you need
* tx = dmu_tx_create(...); // get DMU tx
* dmu_tx_hold_*(); // hold each object you might modify
! * error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
* if (error) {
* rw_exit(...); // drop locks
* zfs_dirent_unlock(dl); // unlock directory entry
* VN_RELE(...); // release held vnodes
* if (error == ERESTART) {
*** 185,194 ****
--- 185,235 ----
* zil_commit(zilog, foid); // synchronous when necessary
* ZFS_EXIT(zfsvfs); // finished in zfs
* return (error); // done, report error
*/
+ /* set this tunable to zero to disable asynchronous freeing of files */
+ boolean_t zfs_do_async_free = B_TRUE;
+
+ /*
+ * This value will be multiplied by zfs_dirty_data_max to determine
+ * the threshold past which we will call zfs_inactive_impl() async.
+ *
+ * Selecting the multiplier is a balance between how long we're willing to wait
+ * for delete/free to complete (get shell back, have a NFS thread captive, etc)
+ * and reducing the number of active requests in the backing taskq.
+ *
+ * 4 GiB (zfs_dirty_data_max default) * 16 (multiplier default) = 64 GiB
+ * meaning by default we will call zfs_inactive_impl async for vnodes > 64 GiB
+ *
+ * WARNING: Setting this tunable to zero will enable asynchronous freeing for
+ * all files which can have undesirable side effects.
+ */
+ uint16_t zfs_inactive_async_multiplier = 16;
+
+ int nms_worm_transition_time = 30;
+ int
+ zfs_worm_in_trans(znode_t *zp)
+ {
+ zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+ timestruc_t now;
+ sa_bulk_attr_t bulk[2];
+ uint64_t ctime[2];
+ int count = 0;
+
+ if (!nms_worm_transition_time)
+ return (0);
+
+ gethrestime(&now);
+ SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL,
+ &ctime, sizeof (ctime));
+ if (sa_bulk_lookup(zp->z_sa_hdl, bulk, count) != 0)
+ return (0);
+
+ return ((uint64_t)now.tv_sec - ctime[0] < nms_worm_transition_time);
+ }
+
/* ARGSUSED */
static int
zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(*vpp);
*** 225,240 ****
zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
/*
* Clean up any locks held by this process on the vp.
*/
! cleanlocks(vp, ddi_get_pid(), 0);
! cleanshares(vp, ddi_get_pid());
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
/* Decrement the synchronous opens in the znode */
--- 266,282 ----
zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+ pid_t caller_pid = (ct != NULL) ? ct->cc_pid : ddi_get_pid();
/*
* Clean up any locks held by this process on the vp.
*/
! cleanlocks(vp, caller_pid, 0);
! cleanshares(vp, caller_pid);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
/* Decrement the synchronous opens in the znode */
*** 484,493 ****
--- 526,693 ----
break;
}
return (error);
}
+
+ /*
+ * ZFS I/O rate throttling
+ */
+
+ #define DELAY_SHIFT 24
+
+ typedef struct zfs_rate_delay {
+ uint_t rl_rate;
+ hrtime_t rl_delay;
+ } zfs_rate_delay_t;
+
+ /*
+ * The time we'll attempt to cv_wait (below), in nSec.
+ * This should be no less than the minimum time it normally takes
+ * to block a thread and wake back up after the timeout fires.
+ *
+ * Each table entry represents the delay for each 4MB of bandwith.
+ * we reduce the delay as the size fo the I/O increases.
+ */
+ zfs_rate_delay_t zfs_rate_delay_table[] = {
+ {0, 100000},
+ {1, 100000},
+ {2, 100000},
+ {3, 100000},
+ {4, 100000},
+ {5, 50000},
+ {6, 50000},
+ {7, 50000},
+ {8, 50000},
+ {9, 25000},
+ {10, 25000},
+ {11, 25000},
+ {12, 25000},
+ {13, 12500},
+ {14, 12500},
+ {15, 12500},
+ {16, 12500},
+ {17, 6250},
+ {18, 6250},
+ {19, 6250},
+ {20, 6250},
+ {21, 3125},
+ {22, 3125},
+ {23, 3125},
+ {24, 3125},
+ };
+
+ #define MAX_RATE_TBL_ENTRY 24
+
+ /*
+ * The delay we use should be reduced based on the size of the iorate
+ * for higher iorates we want a shorter delay.
+ */
+ static inline hrtime_t
+ zfs_get_delay(ssize_t iorate)
+ {
+ uint_t rate = iorate >> DELAY_SHIFT;
+
+ if (rate > MAX_RATE_TBL_ENTRY)
+ rate = MAX_RATE_TBL_ENTRY;
+ return (zfs_rate_delay_table[rate].rl_delay);
+ }
+
+ /*
+ * ZFS I/O rate throttling
+ * See "Token Bucket" on Wikipedia
+ *
+ * This is "Token Bucket" with some modifications to avoid wait times
+ * longer than a couple seconds, so that we don't trigger NFS retries
+ * or similar. This does mean that concurrent requests might take us
+ * over the rate limit, but that's a lesser evil.
+ */
+ static void
+ zfs_rate_throttle(zfsvfs_t *zfsvfs, ssize_t iosize)
+ {
+ zfs_rate_state_t *rate = &zfsvfs->z_rate;
+ hrtime_t now, delta; /* nanoseconds */
+ int64_t refill;
+
+ VERIFY(rate->rate_cap > 0);
+ mutex_enter(&rate->rate_lock);
+
+ /*
+ * If another thread is already waiting, we must queue up behind them.
+ * We'll wait up to 1 sec here. We normally will resume by cv_signal,
+ * so we don't need fine timer resolution on this wait.
+ */
+ if (rate->rate_token_bucket < 0) {
+ rate->rate_waiters++;
+ (void) cv_timedwait_hires(
+ &rate->rate_wait_cv, &rate->rate_lock,
+ NANOSEC, TR_CLOCK_TICK, 0);
+ rate->rate_waiters--;
+ }
+
+ /*
+ * How long since we last updated the bucket?
+ */
+ now = gethrtime();
+ delta = now - rate->rate_last_update;
+ rate->rate_last_update = now;
+ if (delta < 0)
+ delta = 0; /* paranoid */
+
+ /*
+ * Add "tokens" for time since last update,
+ * being careful about possible overflow.
+ */
+ refill = (delta * rate->rate_cap) / NANOSEC;
+ if (refill < 0 || refill > rate->rate_cap)
+ refill = rate->rate_cap; /* overflow */
+ rate->rate_token_bucket += refill;
+ if (rate->rate_token_bucket > rate->rate_cap)
+ rate->rate_token_bucket = rate->rate_cap;
+
+ /*
+ * Withdraw tokens for the current I/O.* If this makes us overdrawn,
+ * wait an amount of time proportionate to the overdraft. However,
+ * as a sanity measure, never wait more than 1 sec, and never try to
+ * wait less than the time it normally takes to block and reschedule.
+ *
+ * Leave the bucket negative while we wait so other threads know to
+ * queue up. In here, "refill" is the debt we're waiting to pay off.
+ */
+ rate->rate_token_bucket -= iosize;
+ if (rate->rate_token_bucket < 0) {
+ hrtime_t zfs_rate_wait = 0;
+
+ refill = rate->rate_token_bucket;
+ DTRACE_PROBE2(zfs_rate_over, zfsvfs_t *, zfsvfs,
+ int64_t, refill);
+
+ if (rate->rate_cap <= 0)
+ goto nocap;
+
+ delta = (refill * NANOSEC) / rate->rate_cap;
+ delta = MIN(delta, NANOSEC);
+
+ zfs_rate_wait = zfs_get_delay(rate->rate_cap);
+
+ if (delta > zfs_rate_wait) {
+ (void) cv_timedwait_hires(
+ &rate->rate_wait_cv, &rate->rate_lock,
+ delta, TR_CLOCK_TICK, 0);
+ }
+
+ rate->rate_token_bucket += refill;
+ }
+ nocap:
+ if (rate->rate_waiters > 0) {
+ cv_signal(&rate->rate_wait_cv);
+ }
+
+ mutex_exit(&rate->rate_lock);
+ }
+
+
offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */
/*
* Read bytes from specified file into supplied buffer.
*
*** 550,559 ****
--- 750,765 ----
return (error);
}
}
/*
+ * ZFS I/O rate throttling
+ */
+ if (zfsvfs->z_rate.rate_cap)
+ zfs_rate_throttle(zfsvfs, uio->uio_resid);
+
+ /*
* If we're in FRSYNC mode, sync out this znode before reading it.
*/
if (ioflag & FRSYNC || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
zil_commit(zfsvfs->z_log, zp->z_id);
*** 713,725 ****
--- 919,935 ----
* See zfs_zaccess_common()
*/
if ((zp->z_pflags & ZFS_IMMUTABLE) ||
((zp->z_pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
(uio->uio_loffset < zp->z_size))) {
+ /* Make sure we're not a WORM before returning EPERM. */
+ if (!(zp->z_pflags & ZFS_IMMUTABLE) ||
+ !zp->z_zfsvfs->z_isworm) {
ZFS_EXIT(zfsvfs);
return (SET_ERROR(EPERM));
}
+ }
zilog = zfsvfs->z_log;
/*
* Validate file offset
*** 739,748 ****
--- 949,964 ----
ZFS_EXIT(zfsvfs);
return (error);
}
/*
+ * ZFS I/O rate throttling
+ */
+ if (zfsvfs->z_rate.rate_cap)
+ zfs_rate_throttle(zfsvfs, uio->uio_resid);
+
+ /*
* Pre-fault the pages to ensure slow (eg NFS) pages
* don't hold up txg.
* Skip this if uio contains loaned arc_buf.
*/
if ((uio->uio_extflg == UIO_XUIO) &&
*** 1013,1022 ****
--- 1229,1239 ----
ZFS_EXIT(zfsvfs);
return (0);
}
+ /* ARGSUSED */
void
zfs_get_done(zgd_t *zgd, int error)
{
znode_t *zp = zgd->zgd_private;
objset_t *os = zp->z_zfsvfs->z_os;
*** 1030,1042 ****
* Release the vnode asynchronously as we currently have the
* txg stopped from syncing.
*/
VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
- if (error == 0 && zgd->zgd_bp)
- zil_lwb_add_block(zgd->zgd_lwb, zgd->zgd_bp);
-
kmem_free(zgd, sizeof (zgd_t));
}
#ifdef DEBUG
static int zil_fault_io = 0;
--- 1247,1256 ----
*** 1156,1170 ****
lr->lr_common.lrc_txtype = TX_WRITE2;
/*
* TX_WRITE2 relies on the data previously
* written by the TX_WRITE that caused
* EALREADY. We zero out the BP because
! * it is the old, currently-on-disk BP,
! * so there's no need to zio_flush() its
! * vdevs (flushing would needlesly hurt
! * performance, and doesn't work on
! * indirect vdevs).
*/
zgd->zgd_bp = NULL;
BP_ZERO(bp);
error = 0;
}
--- 1370,1380 ----
lr->lr_common.lrc_txtype = TX_WRITE2;
/*
* TX_WRITE2 relies on the data previously
* written by the TX_WRITE that caused
* EALREADY. We zero out the BP because
! * it is the old, currently-on-disk BP.
*/
zgd->zgd_bp = NULL;
BP_ZERO(bp);
error = 0;
}
*** 1243,1253 ****
static int
zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp,
int flags, vnode_t *rdir, cred_t *cr, caller_context_t *ct,
int *direntflags, pathname_t *realpnp)
{
! znode_t *zdp = VTOZ(dvp);
zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
int error = 0;
/*
* Fast path lookup, however we must skip DNLC lookup
--- 1453,1463 ----
static int
zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp,
int flags, vnode_t *rdir, cred_t *cr, caller_context_t *ct,
int *direntflags, pathname_t *realpnp)
{
! znode_t *zp, *zdp = VTOZ(dvp);
zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
int error = 0;
/*
* Fast path lookup, however we must skip DNLC lookup
*** 1361,1370 ****
--- 1571,1588 ----
}
error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
if (error == 0)
error = specvp_check(vpp, cr);
+ if (*vpp) {
+ zp = VTOZ(*vpp);
+ if (!(zp->z_pflags & ZFS_IMMUTABLE) &&
+ ((*vpp)->v_type != VDIR) &&
+ zfsvfs->z_isworm && !zfs_worm_in_trans(zp)) {
+ zp->z_pflags |= ZFS_IMMUTABLE;
+ }
+ }
ZFS_EXIT(zfsvfs);
return (error);
}
*** 1396,1405 ****
--- 1614,1624 ----
static int
zfs_create(vnode_t *dvp, char *name, vattr_t *vap, vcexcl_t excl,
int mode, vnode_t **vpp, cred_t *cr, int flag, caller_context_t *ct,
vsecattr_t *vsecp)
{
+ int imm_was_set = 0;
znode_t *zp, *dzp = VTOZ(dvp);
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
objset_t *os;
zfs_dirlock_t *dl;
*** 1481,1500 ****
--- 1700,1730 ----
}
if (zp == NULL) {
uint64_t txtype;
+ if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
+ dzp->z_zfsvfs->z_isworm) {
+ imm_was_set = 1;
+ dzp->z_pflags &= ~ZFS_IMMUTABLE;
+ }
+
/*
* Create a new file object and update the directory
* to reference it.
*/
if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
if (have_acl)
zfs_acl_ids_free(&acl_ids);
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
goto out;
}
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
+
/*
* We only support the creation of regular files in
* extended attribute directories.
*/
*** 1530,1541 ****
if (!zfsvfs->z_use_sa &&
acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, acl_ids.z_aclp->z_acl_bytes);
}
! error = dmu_tx_assign(tx,
! (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
--- 1760,1770 ----
if (!zfsvfs->z_use_sa &&
acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, acl_ids.z_aclp->z_acl_bytes);
}
! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
*** 1550,1559 ****
--- 1779,1791 ----
zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
if (fuid_dirtied)
zfs_fuid_sync(zfsvfs, tx);
+ if (imm_was_set)
+ zp->z_pflags |= ZFS_IMMUTABLE;
+
(void) zfs_link_create(dl, zp, tx, ZNEW);
txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
if (flag & FIGNORECASE)
txtype |= TX_CI;
zfs_log_create(zilog, tx, txtype, dzp, zp, name,
*** 1582,1598 ****
--- 1814,1847 ----
*/
if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
error = SET_ERROR(EISDIR);
goto out;
}
+ if ((flag & FWRITE) &&
+ dzp->z_zfsvfs->z_isworm) {
+ error = EPERM;
+ goto out;
+ }
+
+ if (!(flag & FAPPEND) &&
+ (zp->z_pflags & ZFS_IMMUTABLE) &&
+ dzp->z_zfsvfs->z_isworm) {
+ imm_was_set = 1;
+ zp->z_pflags &= ~ZFS_IMMUTABLE;
+ }
/*
* Verify requested access to file.
*/
if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {
+ if (imm_was_set)
+ zp->z_pflags |= ZFS_IMMUTABLE;
goto out;
}
+ if (imm_was_set)
+ zp->z_pflags |= ZFS_IMMUTABLE;
+
mutex_enter(&dzp->z_lock);
dzp->z_seq++;
mutex_exit(&dzp->z_lock);
/*
*** 1695,1704 ****
--- 1944,1958 ----
return (error);
}
vp = ZTOV(zp);
+ if (zp->z_zfsvfs->z_isworm) {
+ error = SET_ERROR(EPERM);
+ goto out;
+ }
+
if (error = zfs_zaccess_delete(dzp, zp, cr)) {
goto out;
}
/*
*** 1761,1771 ****
/*
* Mark this transaction as typically resulting in a net free of space
*/
dmu_tx_mark_netfree(tx);
! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
VN_RELE(vp);
if (xzp)
VN_RELE(ZTOV(xzp));
--- 2015,2025 ----
/*
* Mark this transaction as typically resulting in a net free of space
*/
dmu_tx_mark_netfree(tx);
! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
VN_RELE(vp);
if (xzp)
VN_RELE(ZTOV(xzp));
*** 1888,1897 ****
--- 2142,2152 ----
/*ARGSUSED*/
static int
zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr,
caller_context_t *ct, int flags, vsecattr_t *vsecp)
{
+ int imm_was_set = 0;
znode_t *zp, *dzp = VTOZ(dvp);
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
zfs_dirlock_t *dl;
uint64_t txtype;
*** 1967,1983 ****
--- 2222,2249 ----
zfs_acl_ids_free(&acl_ids);
ZFS_EXIT(zfsvfs);
return (error);
}
+ if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
+ dzp->z_zfsvfs->z_isworm) {
+ imm_was_set = 1;
+ dzp->z_pflags &= ~ZFS_IMMUTABLE;
+ }
+
if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
zfs_acl_ids_free(&acl_ids);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (error);
}
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
+
if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
zfs_acl_ids_free(&acl_ids);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (SET_ERROR(EDQUOT));
*** 1998,2008 ****
}
dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
ZFS_SA_BASE_ATTR_SIZE);
! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
--- 2264,2274 ----
}
dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
ZFS_SA_BASE_ATTR_SIZE);
! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
*** 2100,2109 ****
--- 2366,2380 ----
return (error);
}
vp = ZTOV(zp);
+ if (dzp->z_zfsvfs->z_isworm) {
+ error = SET_ERROR(EPERM);
+ goto out;
+ }
+
if (error = zfs_zaccess_delete(dzp, zp, cr)) {
goto out;
}
if (vp->v_type != VDIR) {
*** 2135,2145 ****
dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
zfs_sa_upgrade_txholds(tx, zp);
zfs_sa_upgrade_txholds(tx, dzp);
dmu_tx_mark_netfree(tx);
! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
if (error) {
rw_exit(&zp->z_parent_lock);
rw_exit(&zp->z_name_lock);
zfs_dirent_unlock(dl);
VN_RELE(vp);
--- 2406,2416 ----
dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
zfs_sa_upgrade_txholds(tx, zp);
zfs_sa_upgrade_txholds(tx, dzp);
dmu_tx_mark_netfree(tx);
! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
rw_exit(&zp->z_parent_lock);
rw_exit(&zp->z_name_lock);
zfs_dirent_unlock(dl);
VN_RELE(vp);
*** 2792,2809 ****
xoap = xva_getxoptattr(xvap);
xva_init(&tmpxvattr);
/*
! * Immutable files can only alter immutable bit and atime
*/
if ((zp->z_pflags & ZFS_IMMUTABLE) &&
((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
ZFS_EXIT(zfsvfs);
return (SET_ERROR(EPERM));
}
/*
* Note: ZFS_READONLY is handled in zfs_zaccess_common.
*/
--- 3063,3092 ----
xoap = xva_getxoptattr(xvap);
xva_init(&tmpxvattr);
/*
! * Do not allow to alter immutable bit after it is set
*/
if ((zp->z_pflags & ZFS_IMMUTABLE) &&
+ XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) &&
+ zp->z_zfsvfs->z_isworm) {
+ ZFS_EXIT(zfsvfs);
+ return (SET_ERROR(EPERM));
+ }
+
+ /*
+ * Immutable files can only alter atime
+ */
+ if (((zp->z_pflags & ZFS_IMMUTABLE) || zp->z_zfsvfs->z_isworm) &&
((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
+ if (!zp->z_zfsvfs->z_isworm || !zfs_worm_in_trans(zp)) {
ZFS_EXIT(zfsvfs);
return (SET_ERROR(EPERM));
}
+ }
/*
* Note: ZFS_READONLY is handled in zfs_zaccess_common.
*/
*** 3708,3718 ****
zfs_sa_upgrade_txholds(tx, tzp);
}
zfs_sa_upgrade_txholds(tx, szp);
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
if (error) {
if (zl != NULL)
zfs_rename_unlock(&zl);
zfs_dirent_unlock(sdl);
zfs_dirent_unlock(tdl);
--- 3991,4001 ----
zfs_sa_upgrade_txholds(tx, tzp);
}
zfs_sa_upgrade_txholds(tx, szp);
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
if (zl != NULL)
zfs_rename_unlock(&zl);
zfs_dirent_unlock(sdl);
zfs_dirent_unlock(tdl);
*** 3832,3841 ****
--- 4115,4125 ----
znode_t *zp, *dzp = VTOZ(dvp);
zfs_dirlock_t *dl;
dmu_tx_t *tx;
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
+ int imm_was_set = 0;
uint64_t len = strlen(link);
int error;
int zflg = ZNEW;
zfs_acl_ids_t acl_ids;
boolean_t fuid_dirtied;
*** 3875,3890 ****
--- 4159,4182 ----
zfs_acl_ids_free(&acl_ids);
ZFS_EXIT(zfsvfs);
return (error);
}
+ if ((dzp->z_pflags & ZFS_IMMUTABLE) && dzp->z_zfsvfs->z_isworm) {
+ imm_was_set = 1;
+ dzp->z_pflags &= ~ZFS_IMMUTABLE;
+ }
if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
zfs_acl_ids_free(&acl_ids);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (error);
}
+ if (imm_was_set)
+ dzp->z_pflags |= ZFS_IMMUTABLE;
if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
zfs_acl_ids_free(&acl_ids);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
*** 3901,3911 ****
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
acl_ids.z_aclp->z_acl_bytes);
}
if (fuid_dirtied)
zfs_fuid_txhold(zfsvfs, tx);
! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
--- 4193,4203 ----
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
acl_ids.z_aclp->z_acl_bytes);
}
if (fuid_dirtied)
zfs_fuid_txhold(zfsvfs, tx);
! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
*** 4122,4132 ****
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
zfs_sa_upgrade_txholds(tx, szp);
zfs_sa_upgrade_txholds(tx, dzp);
! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
--- 4414,4424 ----
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
zfs_sa_upgrade_txholds(tx, szp);
zfs_sa_upgrade_txholds(tx, dzp);
! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
*** 4398,4444 ****
zil_commit(zfsvfs->z_log, zp->z_id);
ZFS_EXIT(zfsvfs);
return (error);
}
! /*ARGSUSED*/
! void
! zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
{
- znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
! int error;
! rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
if (zp->z_sa_hdl == NULL) {
/*
* The fs has been unmounted, or we did a
* suspend/resume and this file no longer exists.
*/
if (vn_has_cached_data(vp)) {
(void) pvn_vplist_dirty(vp, 0, zfs_null_putapage,
! B_INVAL, cr);
}
mutex_enter(&zp->z_lock);
mutex_enter(&vp->v_lock);
ASSERT(vp->v_count == 1);
VN_RELE_LOCKED(vp);
mutex_exit(&vp->v_lock);
mutex_exit(&zp->z_lock);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
zfs_znode_free(zp);
! return;
}
/*
* Attempt to push any data in the page cache. If this fails
* we will get kicked out later in zfs_zinactive().
*/
if (vn_has_cached_data(vp)) {
(void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC,
! cr);
}
if (zp->z_atime_dirty && zp->z_unlinked == 0) {
dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
--- 4690,4760 ----
zil_commit(zfsvfs->z_log, zp->z_id);
ZFS_EXIT(zfsvfs);
return (error);
}
! /*
! * Returns B_TRUE and exits the z_teardown_inactive_lock
! * if the znode we are looking at is no longer valid
! */
! static boolean_t
! zfs_znode_free_invalid(znode_t *zp)
{
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
! vnode_t *vp = ZTOV(zp);
! ASSERT(rw_read_held(&zfsvfs->z_teardown_inactive_lock));
!
if (zp->z_sa_hdl == NULL) {
/*
* The fs has been unmounted, or we did a
* suspend/resume and this file no longer exists.
*/
if (vn_has_cached_data(vp)) {
(void) pvn_vplist_dirty(vp, 0, zfs_null_putapage,
! B_INVAL, CRED());
}
mutex_enter(&zp->z_lock);
mutex_enter(&vp->v_lock);
ASSERT(vp->v_count == 1);
VN_RELE_LOCKED(vp);
mutex_exit(&vp->v_lock);
mutex_exit(&zp->z_lock);
+ VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) !=
+ UINT32_MAX);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
zfs_znode_free(zp);
! return (B_TRUE);
}
+ return (B_FALSE);
+ }
+
+ /*
+ * Does the prep work for freeing the znode, then calls zfs_zinactive to do the
+ * actual freeing.
+ * This code used be in zfs_inactive() before the async delete patch came in
+ */
+ static void
+ zfs_inactive_impl(znode_t *zp)
+ {
+ vnode_t *vp = ZTOV(zp);
+ zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+ int error;
+
+ rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
+ if (zfs_znode_free_invalid(zp))
+ return; /* z_teardown_inactive_lock already dropped */
+
/*
* Attempt to push any data in the page cache. If this fails
* we will get kicked out later in zfs_zinactive().
*/
if (vn_has_cached_data(vp)) {
(void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC,
! CRED());
}
if (zp->z_atime_dirty && zp->z_unlinked == 0) {
dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
*** 4456,4468 ****
--- 4772,4826 ----
dmu_tx_commit(tx);
}
}
zfs_zinactive(zp);
+
+ VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) != UINT32_MAX);
+
rw_exit(&zfsvfs->z_teardown_inactive_lock);
}
+ /*
+ * taskq task calls zfs_inactive_impl() so that we can free the znode
+ */
+ static void
+ zfs_inactive_task(void *task_arg)
+ {
+ znode_t *zp = task_arg;
+ ASSERT(zp != NULL);
+ zfs_inactive_impl(zp);
+ }
+
+ /*ARGSUSED*/
+ void
+ zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
+ {
+ znode_t *zp = VTOZ(vp);
+ zfsvfs_t *zfsvfs = zp->z_zfsvfs;
+
+ rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
+
+ VERIFY(atomic_inc_32_nv(&zfsvfs->z_znodes_freeing_cnt) != 0);
+
+ if (zfs_znode_free_invalid(zp))
+ return; /* z_teardown_inactive_lock already dropped */
+
+ if (zfs_do_async_free &&
+ zp->z_size > zfs_inactive_async_multiplier * zfs_dirty_data_max &&
+ taskq_dispatch(dsl_pool_vnrele_taskq(
+ dmu_objset_pool(zp->z_zfsvfs->z_os)), zfs_inactive_task,
+ zp, TQ_NOSLEEP) != NULL) {
+ rw_exit(&zfsvfs->z_teardown_inactive_lock);
+ return; /* task dispatched, we're done */
+ }
+ rw_exit(&zfsvfs->z_teardown_inactive_lock);
+
+ /* if the taskq dispatch failed - do a sync zfs_inactive_impl() call */
+ zfs_inactive_impl(zp);
+ }
+
/*
* Bounds-check the seek operation.
*
* IN: vp - vnode seeking within
* ooff - old file offset