Print this page
NEX-19083 backport OS-7314 zil_commit should omit cache thrash
9962 zil_commit should omit cache thrash
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
NEX-10069 ZFS_READONLY is a little too strict
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9436 Rate limiting controls ... (fix cstyle)
NEX-3562 filename normalization doesn't work for removes (sync with upstream)
NEX-9436 Rate limiting controls (was QoS) per ZFS dataset, updates from demo
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-9213 comment for enabling async delete for all files is reversed.
Reviewed by: Jean Mccormack <jean.mccormack@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9090 trigger async freeing based on znode size
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-8972 Async-delete side-effect that may cause unmount EBUSY
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8852 Quality-of-Service (QoS) controls per NFS share
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "NEX-5085 implement async delete for large files"
This reverts commit 65aa8f42d93fcbd6e0efb3d4883170a20d760611.
Fails regression testing of the zfs test mirror_stress_004.
NEX-5085 implement async delete for large files
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
NEX-7543 backout async delete (NEX-5085 and NEX-6151)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6151 panic when forcefully unmounting the FS with large open files
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3562 filename normalization doesn't work for removes
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6334 Cannot unlink files when over quota
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5692 expose the number of hole blocks in a file
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4229 Panic destroying the pool using file backing store on FS with nbmand=on
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-1196 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Fixup merge results
re #14162 DOS issue with ZFS/NFS
re #7550 rb2134 lint-clean nza-kernel
re #6815 rb1758 need WORM in nza-kernel (4.0)

*** 19,37 **** * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2012, 2017 by Delphix. All rights reserved. * Copyright (c) 2014 Integros [integros.com] * Copyright 2015 Joyent, Inc. * Copyright 2017 Nexenta Systems, Inc. */ - /* Portions Copyright 2007 Jeremy Teo */ - /* Portions Copyright 2010 Robert Milkowski */ - #include <sys/types.h> #include <sys/param.h> #include <sys/time.h> #include <sys/systm.h> #include <sys/sysmacros.h> --- 19,36 ---- * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. + * Portions Copyright 2007 Jeremy Teo + * Portions Copyright 2010 Robert Milkowski * Copyright (c) 2012, 2017 by Delphix. All rights reserved. * Copyright (c) 2014 Integros [integros.com] * Copyright 2015 Joyent, Inc. * Copyright 2017 Nexenta Systems, Inc. */ #include <sys/types.h> #include <sys/param.h> #include <sys/time.h> #include <sys/systm.h> #include <sys/sysmacros.h>
*** 81,90 **** --- 80,90 ---- #include <sys/zfs_rlock.h> #include <sys/extdirent.h> #include <sys/kidmap.h> #include <sys/cred.h> #include <sys/attr.h> + #include <sys/dsl_prop.h> #include <sys/zil.h> /* * Programming rules. *
*** 133,143 **** * Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open() * forever, because the previous txg can't quiesce until B's tx commits. * * If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT, * then drop all locks, call dmu_tx_wait(), and try again. On subsequent ! * calls to dmu_tx_assign(), pass TXG_NOTHROTTLE in addition to TXG_NOWAIT, * to indicate that this operation has already called dmu_tx_wait(). * This will ensure that we don't retry forever, waiting a short bit * each time. * * (5) If the operation succeeded, generate the intent log entry for it --- 133,143 ---- * Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open() * forever, because the previous txg can't quiesce until B's tx commits. * * If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT, * then drop all locks, call dmu_tx_wait(), and try again. On subsequent ! * calls to dmu_tx_assign(), pass TXG_WAITED rather than TXG_NOWAIT, * to indicate that this operation has already called dmu_tx_wait(). * This will ensure that we don't retry forever, waiting a short bit * each time. * * (5) If the operation succeeded, generate the intent log entry for it
*** 158,168 **** * top: * zfs_dirent_lock(&dl, ...) // lock directory entry (may VN_HOLD()) * rw_enter(...); // grab any other locks you need * tx = dmu_tx_create(...); // get DMU tx * dmu_tx_hold_*(); // hold each object you might modify ! * error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT); * if (error) { * rw_exit(...); // drop locks * zfs_dirent_unlock(dl); // unlock directory entry * VN_RELE(...); // release held vnodes * if (error == ERESTART) { --- 158,168 ---- * top: * zfs_dirent_lock(&dl, ...) // lock directory entry (may VN_HOLD()) * rw_enter(...); // grab any other locks you need * tx = dmu_tx_create(...); // get DMU tx * dmu_tx_hold_*(); // hold each object you might modify ! * error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); * if (error) { * rw_exit(...); // drop locks * zfs_dirent_unlock(dl); // unlock directory entry * VN_RELE(...); // release held vnodes * if (error == ERESTART) {
*** 185,194 **** --- 185,235 ---- * zil_commit(zilog, foid); // synchronous when necessary * ZFS_EXIT(zfsvfs); // finished in zfs * return (error); // done, report error */ + /* set this tunable to zero to disable asynchronous freeing of files */ + boolean_t zfs_do_async_free = B_TRUE; + + /* + * This value will be multiplied by zfs_dirty_data_max to determine + * the threshold past which we will call zfs_inactive_impl() async. + * + * Selecting the multiplier is a balance between how long we're willing to wait + * for delete/free to complete (get shell back, have a NFS thread captive, etc) + * and reducing the number of active requests in the backing taskq. + * + * 4 GiB (zfs_dirty_data_max default) * 16 (multiplier default) = 64 GiB + * meaning by default we will call zfs_inactive_impl async for vnodes > 64 GiB + * + * WARNING: Setting this tunable to zero will enable asynchronous freeing for + * all files which can have undesirable side effects. + */ + uint16_t zfs_inactive_async_multiplier = 16; + + int nms_worm_transition_time = 30; + int + zfs_worm_in_trans(znode_t *zp) + { + zfsvfs_t *zfsvfs = zp->z_zfsvfs; + timestruc_t now; + sa_bulk_attr_t bulk[2]; + uint64_t ctime[2]; + int count = 0; + + if (!nms_worm_transition_time) + return (0); + + gethrestime(&now); + SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL, + &ctime, sizeof (ctime)); + if (sa_bulk_lookup(zp->z_sa_hdl, bulk, count) != 0) + return (0); + + return ((uint64_t)now.tv_sec - ctime[0] < nms_worm_transition_time); + } + /* ARGSUSED */ static int zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct) { znode_t *zp = VTOZ(*vpp);
*** 225,240 **** zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr, caller_context_t *ct) { znode_t *zp = VTOZ(vp); zfsvfs_t *zfsvfs = zp->z_zfsvfs; /* * Clean up any locks held by this process on the vp. */ ! cleanlocks(vp, ddi_get_pid(), 0); ! cleanshares(vp, ddi_get_pid()); ZFS_ENTER(zfsvfs); ZFS_VERIFY_ZP(zp); /* Decrement the synchronous opens in the znode */ --- 266,282 ---- zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr, caller_context_t *ct) { znode_t *zp = VTOZ(vp); zfsvfs_t *zfsvfs = zp->z_zfsvfs; + pid_t caller_pid = (ct != NULL) ? ct->cc_pid : ddi_get_pid(); /* * Clean up any locks held by this process on the vp. */ ! cleanlocks(vp, caller_pid, 0); ! cleanshares(vp, caller_pid); ZFS_ENTER(zfsvfs); ZFS_VERIFY_ZP(zp); /* Decrement the synchronous opens in the znode */
*** 484,493 **** --- 526,693 ---- break; } return (error); } + + /* + * ZFS I/O rate throttling + */ + + #define DELAY_SHIFT 24 + + typedef struct zfs_rate_delay { + uint_t rl_rate; + hrtime_t rl_delay; + } zfs_rate_delay_t; + + /* + * The time we'll attempt to cv_wait (below), in nSec. + * This should be no less than the minimum time it normally takes + * to block a thread and wake back up after the timeout fires. + * + * Each table entry represents the delay for each 4MB of bandwith. + * we reduce the delay as the size fo the I/O increases. + */ + zfs_rate_delay_t zfs_rate_delay_table[] = { + {0, 100000}, + {1, 100000}, + {2, 100000}, + {3, 100000}, + {4, 100000}, + {5, 50000}, + {6, 50000}, + {7, 50000}, + {8, 50000}, + {9, 25000}, + {10, 25000}, + {11, 25000}, + {12, 25000}, + {13, 12500}, + {14, 12500}, + {15, 12500}, + {16, 12500}, + {17, 6250}, + {18, 6250}, + {19, 6250}, + {20, 6250}, + {21, 3125}, + {22, 3125}, + {23, 3125}, + {24, 3125}, + }; + + #define MAX_RATE_TBL_ENTRY 24 + + /* + * The delay we use should be reduced based on the size of the iorate + * for higher iorates we want a shorter delay. + */ + static inline hrtime_t + zfs_get_delay(ssize_t iorate) + { + uint_t rate = iorate >> DELAY_SHIFT; + + if (rate > MAX_RATE_TBL_ENTRY) + rate = MAX_RATE_TBL_ENTRY; + return (zfs_rate_delay_table[rate].rl_delay); + } + + /* + * ZFS I/O rate throttling + * See "Token Bucket" on Wikipedia + * + * This is "Token Bucket" with some modifications to avoid wait times + * longer than a couple seconds, so that we don't trigger NFS retries + * or similar. This does mean that concurrent requests might take us + * over the rate limit, but that's a lesser evil. + */ + static void + zfs_rate_throttle(zfsvfs_t *zfsvfs, ssize_t iosize) + { + zfs_rate_state_t *rate = &zfsvfs->z_rate; + hrtime_t now, delta; /* nanoseconds */ + int64_t refill; + + VERIFY(rate->rate_cap > 0); + mutex_enter(&rate->rate_lock); + + /* + * If another thread is already waiting, we must queue up behind them. + * We'll wait up to 1 sec here. We normally will resume by cv_signal, + * so we don't need fine timer resolution on this wait. + */ + if (rate->rate_token_bucket < 0) { + rate->rate_waiters++; + (void) cv_timedwait_hires( + &rate->rate_wait_cv, &rate->rate_lock, + NANOSEC, TR_CLOCK_TICK, 0); + rate->rate_waiters--; + } + + /* + * How long since we last updated the bucket? + */ + now = gethrtime(); + delta = now - rate->rate_last_update; + rate->rate_last_update = now; + if (delta < 0) + delta = 0; /* paranoid */ + + /* + * Add "tokens" for time since last update, + * being careful about possible overflow. + */ + refill = (delta * rate->rate_cap) / NANOSEC; + if (refill < 0 || refill > rate->rate_cap) + refill = rate->rate_cap; /* overflow */ + rate->rate_token_bucket += refill; + if (rate->rate_token_bucket > rate->rate_cap) + rate->rate_token_bucket = rate->rate_cap; + + /* + * Withdraw tokens for the current I/O.* If this makes us overdrawn, + * wait an amount of time proportionate to the overdraft. However, + * as a sanity measure, never wait more than 1 sec, and never try to + * wait less than the time it normally takes to block and reschedule. + * + * Leave the bucket negative while we wait so other threads know to + * queue up. In here, "refill" is the debt we're waiting to pay off. + */ + rate->rate_token_bucket -= iosize; + if (rate->rate_token_bucket < 0) { + hrtime_t zfs_rate_wait = 0; + + refill = rate->rate_token_bucket; + DTRACE_PROBE2(zfs_rate_over, zfsvfs_t *, zfsvfs, + int64_t, refill); + + if (rate->rate_cap <= 0) + goto nocap; + + delta = (refill * NANOSEC) / rate->rate_cap; + delta = MIN(delta, NANOSEC); + + zfs_rate_wait = zfs_get_delay(rate->rate_cap); + + if (delta > zfs_rate_wait) { + (void) cv_timedwait_hires( + &rate->rate_wait_cv, &rate->rate_lock, + delta, TR_CLOCK_TICK, 0); + } + + rate->rate_token_bucket += refill; + } + nocap: + if (rate->rate_waiters > 0) { + cv_signal(&rate->rate_wait_cv); + } + + mutex_exit(&rate->rate_lock); + } + + offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */ /* * Read bytes from specified file into supplied buffer. *
*** 550,559 **** --- 750,765 ---- return (error); } } /* + * ZFS I/O rate throttling + */ + if (zfsvfs->z_rate.rate_cap) + zfs_rate_throttle(zfsvfs, uio->uio_resid); + + /* * If we're in FRSYNC mode, sync out this znode before reading it. */ if (ioflag & FRSYNC || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS) zil_commit(zfsvfs->z_log, zp->z_id);
*** 713,725 **** --- 919,935 ---- * See zfs_zaccess_common() */ if ((zp->z_pflags & ZFS_IMMUTABLE) || ((zp->z_pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) && (uio->uio_loffset < zp->z_size))) { + /* Make sure we're not a WORM before returning EPERM. */ + if (!(zp->z_pflags & ZFS_IMMUTABLE) || + !zp->z_zfsvfs->z_isworm) { ZFS_EXIT(zfsvfs); return (SET_ERROR(EPERM)); } + } zilog = zfsvfs->z_log; /* * Validate file offset
*** 739,748 **** --- 949,964 ---- ZFS_EXIT(zfsvfs); return (error); } /* + * ZFS I/O rate throttling + */ + if (zfsvfs->z_rate.rate_cap) + zfs_rate_throttle(zfsvfs, uio->uio_resid); + + /* * Pre-fault the pages to ensure slow (eg NFS) pages * don't hold up txg. * Skip this if uio contains loaned arc_buf. */ if ((uio->uio_extflg == UIO_XUIO) &&
*** 1013,1022 **** --- 1229,1239 ---- ZFS_EXIT(zfsvfs); return (0); } + /* ARGSUSED */ void zfs_get_done(zgd_t *zgd, int error) { znode_t *zp = zgd->zgd_private; objset_t *os = zp->z_zfsvfs->z_os;
*** 1030,1042 **** * Release the vnode asynchronously as we currently have the * txg stopped from syncing. */ VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os))); - if (error == 0 && zgd->zgd_bp) - zil_lwb_add_block(zgd->zgd_lwb, zgd->zgd_bp); - kmem_free(zgd, sizeof (zgd_t)); } #ifdef DEBUG static int zil_fault_io = 0; --- 1247,1256 ----
*** 1156,1170 **** lr->lr_common.lrc_txtype = TX_WRITE2; /* * TX_WRITE2 relies on the data previously * written by the TX_WRITE that caused * EALREADY. We zero out the BP because ! * it is the old, currently-on-disk BP, ! * so there's no need to zio_flush() its ! * vdevs (flushing would needlesly hurt ! * performance, and doesn't work on ! * indirect vdevs). */ zgd->zgd_bp = NULL; BP_ZERO(bp); error = 0; } --- 1370,1380 ---- lr->lr_common.lrc_txtype = TX_WRITE2; /* * TX_WRITE2 relies on the data previously * written by the TX_WRITE that caused * EALREADY. We zero out the BP because ! * it is the old, currently-on-disk BP. */ zgd->zgd_bp = NULL; BP_ZERO(bp); error = 0; }
*** 1243,1253 **** static int zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp, int flags, vnode_t *rdir, cred_t *cr, caller_context_t *ct, int *direntflags, pathname_t *realpnp) { ! znode_t *zdp = VTOZ(dvp); zfsvfs_t *zfsvfs = zdp->z_zfsvfs; int error = 0; /* * Fast path lookup, however we must skip DNLC lookup --- 1453,1463 ---- static int zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp, int flags, vnode_t *rdir, cred_t *cr, caller_context_t *ct, int *direntflags, pathname_t *realpnp) { ! znode_t *zp, *zdp = VTOZ(dvp); zfsvfs_t *zfsvfs = zdp->z_zfsvfs; int error = 0; /* * Fast path lookup, however we must skip DNLC lookup
*** 1361,1370 **** --- 1571,1588 ---- } error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp); if (error == 0) error = specvp_check(vpp, cr); + if (*vpp) { + zp = VTOZ(*vpp); + if (!(zp->z_pflags & ZFS_IMMUTABLE) && + ((*vpp)->v_type != VDIR) && + zfsvfs->z_isworm && !zfs_worm_in_trans(zp)) { + zp->z_pflags |= ZFS_IMMUTABLE; + } + } ZFS_EXIT(zfsvfs); return (error); }
*** 1396,1405 **** --- 1614,1624 ---- static int zfs_create(vnode_t *dvp, char *name, vattr_t *vap, vcexcl_t excl, int mode, vnode_t **vpp, cred_t *cr, int flag, caller_context_t *ct, vsecattr_t *vsecp) { + int imm_was_set = 0; znode_t *zp, *dzp = VTOZ(dvp); zfsvfs_t *zfsvfs = dzp->z_zfsvfs; zilog_t *zilog; objset_t *os; zfs_dirlock_t *dl;
*** 1481,1500 **** --- 1700,1730 ---- } if (zp == NULL) { uint64_t txtype; + if ((dzp->z_pflags & ZFS_IMMUTABLE) && + dzp->z_zfsvfs->z_isworm) { + imm_was_set = 1; + dzp->z_pflags &= ~ZFS_IMMUTABLE; + } + /* * Create a new file object and update the directory * to reference it. */ if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) { if (have_acl) zfs_acl_ids_free(&acl_ids); + if (imm_was_set) + dzp->z_pflags |= ZFS_IMMUTABLE; goto out; } + if (imm_was_set) + dzp->z_pflags |= ZFS_IMMUTABLE; + /* * We only support the creation of regular files in * extended attribute directories. */
*** 1530,1541 **** if (!zfsvfs->z_use_sa && acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) { dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, acl_ids.z_aclp->z_acl_bytes); } ! error = dmu_tx_assign(tx, ! (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT); if (error) { zfs_dirent_unlock(dl); if (error == ERESTART) { waited = B_TRUE; dmu_tx_wait(tx); --- 1760,1770 ---- if (!zfsvfs->z_use_sa && acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) { dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, acl_ids.z_aclp->z_acl_bytes); } ! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); if (error) { zfs_dirent_unlock(dl); if (error == ERESTART) { waited = B_TRUE; dmu_tx_wait(tx);
*** 1550,1559 **** --- 1779,1791 ---- zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids); if (fuid_dirtied) zfs_fuid_sync(zfsvfs, tx); + if (imm_was_set) + zp->z_pflags |= ZFS_IMMUTABLE; + (void) zfs_link_create(dl, zp, tx, ZNEW); txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap); if (flag & FIGNORECASE) txtype |= TX_CI; zfs_log_create(zilog, tx, txtype, dzp, zp, name,
*** 1582,1598 **** --- 1814,1847 ---- */ if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) { error = SET_ERROR(EISDIR); goto out; } + if ((flag & FWRITE) && + dzp->z_zfsvfs->z_isworm) { + error = EPERM; + goto out; + } + + if (!(flag & FAPPEND) && + (zp->z_pflags & ZFS_IMMUTABLE) && + dzp->z_zfsvfs->z_isworm) { + imm_was_set = 1; + zp->z_pflags &= ~ZFS_IMMUTABLE; + } /* * Verify requested access to file. */ if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) { + if (imm_was_set) + zp->z_pflags |= ZFS_IMMUTABLE; goto out; } + if (imm_was_set) + zp->z_pflags |= ZFS_IMMUTABLE; + mutex_enter(&dzp->z_lock); dzp->z_seq++; mutex_exit(&dzp->z_lock); /*
*** 1695,1704 **** --- 1944,1958 ---- return (error); } vp = ZTOV(zp); + if (zp->z_zfsvfs->z_isworm) { + error = SET_ERROR(EPERM); + goto out; + } + if (error = zfs_zaccess_delete(dzp, zp, cr)) { goto out; } /*
*** 1761,1771 **** /* * Mark this transaction as typically resulting in a net free of space */ dmu_tx_mark_netfree(tx); ! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT); if (error) { zfs_dirent_unlock(dl); VN_RELE(vp); if (xzp) VN_RELE(ZTOV(xzp)); --- 2015,2025 ---- /* * Mark this transaction as typically resulting in a net free of space */ dmu_tx_mark_netfree(tx); ! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); if (error) { zfs_dirent_unlock(dl); VN_RELE(vp); if (xzp) VN_RELE(ZTOV(xzp));
*** 1888,1897 **** --- 2142,2152 ---- /*ARGSUSED*/ static int zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr, caller_context_t *ct, int flags, vsecattr_t *vsecp) { + int imm_was_set = 0; znode_t *zp, *dzp = VTOZ(dvp); zfsvfs_t *zfsvfs = dzp->z_zfsvfs; zilog_t *zilog; zfs_dirlock_t *dl; uint64_t txtype;
*** 1967,1983 **** --- 2222,2249 ---- zfs_acl_ids_free(&acl_ids); ZFS_EXIT(zfsvfs); return (error); } + if ((dzp->z_pflags & ZFS_IMMUTABLE) && + dzp->z_zfsvfs->z_isworm) { + imm_was_set = 1; + dzp->z_pflags &= ~ZFS_IMMUTABLE; + } + if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) { + if (imm_was_set) + dzp->z_pflags |= ZFS_IMMUTABLE; zfs_acl_ids_free(&acl_ids); zfs_dirent_unlock(dl); ZFS_EXIT(zfsvfs); return (error); } + if (imm_was_set) + dzp->z_pflags |= ZFS_IMMUTABLE; + if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) { zfs_acl_ids_free(&acl_ids); zfs_dirent_unlock(dl); ZFS_EXIT(zfsvfs); return (SET_ERROR(EDQUOT));
*** 1998,2008 **** } dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes + ZFS_SA_BASE_ATTR_SIZE); ! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT); if (error) { zfs_dirent_unlock(dl); if (error == ERESTART) { waited = B_TRUE; dmu_tx_wait(tx); --- 2264,2274 ---- } dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes + ZFS_SA_BASE_ATTR_SIZE); ! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); if (error) { zfs_dirent_unlock(dl); if (error == ERESTART) { waited = B_TRUE; dmu_tx_wait(tx);
*** 2100,2109 **** --- 2366,2380 ---- return (error); } vp = ZTOV(zp); + if (dzp->z_zfsvfs->z_isworm) { + error = SET_ERROR(EPERM); + goto out; + } + if (error = zfs_zaccess_delete(dzp, zp, cr)) { goto out; } if (vp->v_type != VDIR) {
*** 2135,2145 **** dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE); dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL); zfs_sa_upgrade_txholds(tx, zp); zfs_sa_upgrade_txholds(tx, dzp); dmu_tx_mark_netfree(tx); ! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT); if (error) { rw_exit(&zp->z_parent_lock); rw_exit(&zp->z_name_lock); zfs_dirent_unlock(dl); VN_RELE(vp); --- 2406,2416 ---- dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE); dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL); zfs_sa_upgrade_txholds(tx, zp); zfs_sa_upgrade_txholds(tx, dzp); dmu_tx_mark_netfree(tx); ! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); if (error) { rw_exit(&zp->z_parent_lock); rw_exit(&zp->z_name_lock); zfs_dirent_unlock(dl); VN_RELE(vp);
*** 2792,2809 **** xoap = xva_getxoptattr(xvap); xva_init(&tmpxvattr); /* ! * Immutable files can only alter immutable bit and atime */ if ((zp->z_pflags & ZFS_IMMUTABLE) && ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) || ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) { ZFS_EXIT(zfsvfs); return (SET_ERROR(EPERM)); } /* * Note: ZFS_READONLY is handled in zfs_zaccess_common. */ --- 3063,3092 ---- xoap = xva_getxoptattr(xvap); xva_init(&tmpxvattr); /* ! * Do not allow to alter immutable bit after it is set */ if ((zp->z_pflags & ZFS_IMMUTABLE) && + XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) && + zp->z_zfsvfs->z_isworm) { + ZFS_EXIT(zfsvfs); + return (SET_ERROR(EPERM)); + } + + /* + * Immutable files can only alter atime + */ + if (((zp->z_pflags & ZFS_IMMUTABLE) || zp->z_zfsvfs->z_isworm) && ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) || ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) { + if (!zp->z_zfsvfs->z_isworm || !zfs_worm_in_trans(zp)) { ZFS_EXIT(zfsvfs); return (SET_ERROR(EPERM)); } + } /* * Note: ZFS_READONLY is handled in zfs_zaccess_common. */
*** 3708,3718 **** zfs_sa_upgrade_txholds(tx, tzp); } zfs_sa_upgrade_txholds(tx, szp); dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL); ! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT); if (error) { if (zl != NULL) zfs_rename_unlock(&zl); zfs_dirent_unlock(sdl); zfs_dirent_unlock(tdl); --- 3991,4001 ---- zfs_sa_upgrade_txholds(tx, tzp); } zfs_sa_upgrade_txholds(tx, szp); dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL); ! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); if (error) { if (zl != NULL) zfs_rename_unlock(&zl); zfs_dirent_unlock(sdl); zfs_dirent_unlock(tdl);
*** 3832,3841 **** --- 4115,4125 ---- znode_t *zp, *dzp = VTOZ(dvp); zfs_dirlock_t *dl; dmu_tx_t *tx; zfsvfs_t *zfsvfs = dzp->z_zfsvfs; zilog_t *zilog; + int imm_was_set = 0; uint64_t len = strlen(link); int error; int zflg = ZNEW; zfs_acl_ids_t acl_ids; boolean_t fuid_dirtied;
*** 3875,3890 **** --- 4159,4182 ---- zfs_acl_ids_free(&acl_ids); ZFS_EXIT(zfsvfs); return (error); } + if ((dzp->z_pflags & ZFS_IMMUTABLE) && dzp->z_zfsvfs->z_isworm) { + imm_was_set = 1; + dzp->z_pflags &= ~ZFS_IMMUTABLE; + } if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) { + if (imm_was_set) + dzp->z_pflags |= ZFS_IMMUTABLE; zfs_acl_ids_free(&acl_ids); zfs_dirent_unlock(dl); ZFS_EXIT(zfsvfs); return (error); } + if (imm_was_set) + dzp->z_pflags |= ZFS_IMMUTABLE; if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) { zfs_acl_ids_free(&acl_ids); zfs_dirent_unlock(dl); ZFS_EXIT(zfsvfs);
*** 3901,3911 **** dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, acl_ids.z_aclp->z_acl_bytes); } if (fuid_dirtied) zfs_fuid_txhold(zfsvfs, tx); ! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT); if (error) { zfs_dirent_unlock(dl); if (error == ERESTART) { waited = B_TRUE; dmu_tx_wait(tx); --- 4193,4203 ---- dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, acl_ids.z_aclp->z_acl_bytes); } if (fuid_dirtied) zfs_fuid_txhold(zfsvfs, tx); ! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); if (error) { zfs_dirent_unlock(dl); if (error == ERESTART) { waited = B_TRUE; dmu_tx_wait(tx);
*** 4122,4132 **** tx = dmu_tx_create(zfsvfs->z_os); dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE); dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name); zfs_sa_upgrade_txholds(tx, szp); zfs_sa_upgrade_txholds(tx, dzp); ! error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT); if (error) { zfs_dirent_unlock(dl); if (error == ERESTART) { waited = B_TRUE; dmu_tx_wait(tx); --- 4414,4424 ---- tx = dmu_tx_create(zfsvfs->z_os); dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE); dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name); zfs_sa_upgrade_txholds(tx, szp); zfs_sa_upgrade_txholds(tx, dzp); ! error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); if (error) { zfs_dirent_unlock(dl); if (error == ERESTART) { waited = B_TRUE; dmu_tx_wait(tx);
*** 4398,4444 **** zil_commit(zfsvfs->z_log, zp->z_id); ZFS_EXIT(zfsvfs); return (error); } ! /*ARGSUSED*/ ! void ! zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct) { - znode_t *zp = VTOZ(vp); zfsvfs_t *zfsvfs = zp->z_zfsvfs; ! int error; ! rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER); if (zp->z_sa_hdl == NULL) { /* * The fs has been unmounted, or we did a * suspend/resume and this file no longer exists. */ if (vn_has_cached_data(vp)) { (void) pvn_vplist_dirty(vp, 0, zfs_null_putapage, ! B_INVAL, cr); } mutex_enter(&zp->z_lock); mutex_enter(&vp->v_lock); ASSERT(vp->v_count == 1); VN_RELE_LOCKED(vp); mutex_exit(&vp->v_lock); mutex_exit(&zp->z_lock); rw_exit(&zfsvfs->z_teardown_inactive_lock); zfs_znode_free(zp); ! return; } /* * Attempt to push any data in the page cache. If this fails * we will get kicked out later in zfs_zinactive(). */ if (vn_has_cached_data(vp)) { (void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC, ! cr); } if (zp->z_atime_dirty && zp->z_unlinked == 0) { dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os); --- 4690,4760 ---- zil_commit(zfsvfs->z_log, zp->z_id); ZFS_EXIT(zfsvfs); return (error); } ! /* ! * Returns B_TRUE and exits the z_teardown_inactive_lock ! * if the znode we are looking at is no longer valid ! */ ! static boolean_t ! zfs_znode_free_invalid(znode_t *zp) { zfsvfs_t *zfsvfs = zp->z_zfsvfs; ! vnode_t *vp = ZTOV(zp); ! ASSERT(rw_read_held(&zfsvfs->z_teardown_inactive_lock)); ! if (zp->z_sa_hdl == NULL) { /* * The fs has been unmounted, or we did a * suspend/resume and this file no longer exists. */ if (vn_has_cached_data(vp)) { (void) pvn_vplist_dirty(vp, 0, zfs_null_putapage, ! B_INVAL, CRED()); } mutex_enter(&zp->z_lock); mutex_enter(&vp->v_lock); ASSERT(vp->v_count == 1); VN_RELE_LOCKED(vp); mutex_exit(&vp->v_lock); mutex_exit(&zp->z_lock); + VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) != + UINT32_MAX); rw_exit(&zfsvfs->z_teardown_inactive_lock); zfs_znode_free(zp); ! return (B_TRUE); } + return (B_FALSE); + } + + /* + * Does the prep work for freeing the znode, then calls zfs_zinactive to do the + * actual freeing. + * This code used be in zfs_inactive() before the async delete patch came in + */ + static void + zfs_inactive_impl(znode_t *zp) + { + vnode_t *vp = ZTOV(zp); + zfsvfs_t *zfsvfs = zp->z_zfsvfs; + int error; + + rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER); + if (zfs_znode_free_invalid(zp)) + return; /* z_teardown_inactive_lock already dropped */ + /* * Attempt to push any data in the page cache. If this fails * we will get kicked out later in zfs_zinactive(). */ if (vn_has_cached_data(vp)) { (void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC, ! CRED()); } if (zp->z_atime_dirty && zp->z_unlinked == 0) { dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
*** 4456,4468 **** --- 4772,4826 ---- dmu_tx_commit(tx); } } zfs_zinactive(zp); + + VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) != UINT32_MAX); + rw_exit(&zfsvfs->z_teardown_inactive_lock); } + /* + * taskq task calls zfs_inactive_impl() so that we can free the znode + */ + static void + zfs_inactive_task(void *task_arg) + { + znode_t *zp = task_arg; + ASSERT(zp != NULL); + zfs_inactive_impl(zp); + } + + /*ARGSUSED*/ + void + zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct) + { + znode_t *zp = VTOZ(vp); + zfsvfs_t *zfsvfs = zp->z_zfsvfs; + + rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER); + + VERIFY(atomic_inc_32_nv(&zfsvfs->z_znodes_freeing_cnt) != 0); + + if (zfs_znode_free_invalid(zp)) + return; /* z_teardown_inactive_lock already dropped */ + + if (zfs_do_async_free && + zp->z_size > zfs_inactive_async_multiplier * zfs_dirty_data_max && + taskq_dispatch(dsl_pool_vnrele_taskq( + dmu_objset_pool(zp->z_zfsvfs->z_os)), zfs_inactive_task, + zp, TQ_NOSLEEP) != NULL) { + rw_exit(&zfsvfs->z_teardown_inactive_lock); + return; /* task dispatched, we're done */ + } + rw_exit(&zfsvfs->z_teardown_inactive_lock); + + /* if the taskq dispatch failed - do a sync zfs_inactive_impl() call */ + zfs_inactive_impl(zp); + } + /* * Bounds-check the seek operation. * * IN: vp - vnode seeking within * ooff - old file offset