Print this page
NEX-19083 backport OS-7314 zil_commit should omit cache thrash
9962 zil_commit should omit cache thrash
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
NEX-10069 ZFS_READONLY is a little too strict
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9436 Rate limiting controls ... (fix cstyle)
NEX-3562 filename normalization doesn't work for removes (sync with upstream)
NEX-9436 Rate limiting controls (was QoS) per ZFS dataset, updates from demo
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-9213 comment for enabling async delete for all files is reversed.
Reviewed by: Jean Mccormack <jean.mccormack@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9090 trigger async freeing based on znode size
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-8972 Async-delete side-effect that may cause unmount EBUSY
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8852 Quality-of-Service (QoS) controls per NFS share
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "NEX-5085 implement async delete for large files"
This reverts commit 65aa8f42d93fcbd6e0efb3d4883170a20d760611.
Fails regression testing of the zfs test mirror_stress_004.
NEX-5085 implement async delete for large files
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
NEX-7543 backout async delete (NEX-5085 and NEX-6151)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6151 panic when forcefully unmounting the FS with large open files
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3562 filename normalization doesn't work for removes
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6334 Cannot unlink files when over quota
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5692 expose the number of hole blocks in a file
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4229 Panic destroying the pool using file backing store on FS with nbmand=on
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-1196 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Fixup merge results
re #14162 DOS issue with ZFS/NFS
re #7550 rb2134 lint-clean nza-kernel
re #6815 rb1758 need WORM in nza-kernel (4.0)

Split Close
Expand all
Collapse all
          --- old/usr/src/uts/common/fs/zfs/zfs_vnops.c
          +++ new/usr/src/uts/common/fs/zfs/zfs_vnops.c
↓ open down ↓ 13 lines elided ↑ open up ↑
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  
  22   22  /*
  23   23   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
       24 + * Portions Copyright 2007 Jeremy Teo
       25 + * Portions Copyright 2010 Robert Milkowski
  24   26   * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
  25   27   * Copyright (c) 2014 Integros [integros.com]
  26   28   * Copyright 2015 Joyent, Inc.
  27   29   * Copyright 2017 Nexenta Systems, Inc.
  28   30   */
  29   31  
  30      -/* Portions Copyright 2007 Jeremy Teo */
  31      -/* Portions Copyright 2010 Robert Milkowski */
  32      -
  33   32  #include <sys/types.h>
  34   33  #include <sys/param.h>
  35   34  #include <sys/time.h>
  36   35  #include <sys/systm.h>
  37   36  #include <sys/sysmacros.h>
  38   37  #include <sys/resource.h>
  39   38  #include <sys/vfs.h>
  40   39  #include <sys/vfs_opreg.h>
  41   40  #include <sys/vnode.h>
  42   41  #include <sys/file.h>
↓ open down ↓ 33 lines elided ↑ open up ↑
  76   75  #include "fs/fs_subr.h"
  77   76  #include <sys/zfs_ctldir.h>
  78   77  #include <sys/zfs_fuid.h>
  79   78  #include <sys/zfs_sa.h>
  80   79  #include <sys/dnlc.h>
  81   80  #include <sys/zfs_rlock.h>
  82   81  #include <sys/extdirent.h>
  83   82  #include <sys/kidmap.h>
  84   83  #include <sys/cred.h>
  85   84  #include <sys/attr.h>
       85 +#include <sys/dsl_prop.h>
  86   86  #include <sys/zil.h>
  87   87  
  88   88  /*
  89   89   * Programming rules.
  90   90   *
  91   91   * Each vnode op performs some logical unit of work.  To do this, the ZPL must
  92   92   * properly lock its in-core state, create a DMU transaction, do the work,
  93   93   * record this work in the intent log (ZIL), commit the DMU transaction,
  94   94   * and wait for the intent log to commit if it is a synchronous operation.
  95   95   * Moreover, the vnode ops must work in both normal and log replay context.
↓ open down ↓ 32 lines elided ↑ open up ↑
 128  128   *      the tx assigns, and sometimes after (e.g. z_lock), then failing
 129  129   *      to use a non-blocking assign can deadlock the system.  The scenario:
 130  130   *
 131  131   *      Thread A has grabbed a lock before calling dmu_tx_assign().
 132  132   *      Thread B is in an already-assigned tx, and blocks for this lock.
 133  133   *      Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
 134  134   *      forever, because the previous txg can't quiesce until B's tx commits.
 135  135   *
 136  136   *      If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
 137  137   *      then drop all locks, call dmu_tx_wait(), and try again.  On subsequent
 138      - *      calls to dmu_tx_assign(), pass TXG_NOTHROTTLE in addition to TXG_NOWAIT,
      138 + *      calls to dmu_tx_assign(), pass TXG_WAITED rather than TXG_NOWAIT,
 139  139   *      to indicate that this operation has already called dmu_tx_wait().
 140  140   *      This will ensure that we don't retry forever, waiting a short bit
 141  141   *      each time.
 142  142   *
 143  143   *  (5) If the operation succeeded, generate the intent log entry for it
 144  144   *      before dropping locks.  This ensures that the ordering of events
 145  145   *      in the intent log matches the order in which they actually occurred.
 146  146   *      During ZIL replay the zfs_log_* functions will update the sequence
 147  147   *      number to indicate the zil transaction has replayed.
 148  148   *
↓ open down ↓ 4 lines elided ↑ open up ↑
 153  153   *      to ensure that synchronous semantics are provided when necessary.
 154  154   *
 155  155   * In general, this is how things should be ordered in each vnode op:
 156  156   *
 157  157   *      ZFS_ENTER(zfsvfs);              // exit if unmounted
 158  158   * top:
 159  159   *      zfs_dirent_lock(&dl, ...)       // lock directory entry (may VN_HOLD())
 160  160   *      rw_enter(...);                  // grab any other locks you need
 161  161   *      tx = dmu_tx_create(...);        // get DMU tx
 162  162   *      dmu_tx_hold_*();                // hold each object you might modify
 163      - *      error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
      163 + *      error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
 164  164   *      if (error) {
 165  165   *              rw_exit(...);           // drop locks
 166  166   *              zfs_dirent_unlock(dl);  // unlock directory entry
 167  167   *              VN_RELE(...);           // release held vnodes
 168  168   *              if (error == ERESTART) {
 169  169   *                      waited = B_TRUE;
 170  170   *                      dmu_tx_wait(tx);
 171  171   *                      dmu_tx_abort(tx);
 172  172   *                      goto top;
 173  173   *              }
↓ open down ↓ 6 lines elided ↑ open up ↑
 180  180   *              zfs_log_*(...);         // on success, make ZIL entry
 181  181   *      dmu_tx_commit(tx);              // commit DMU tx -- error or not
 182  182   *      rw_exit(...);                   // drop locks
 183  183   *      zfs_dirent_unlock(dl);          // unlock directory entry
 184  184   *      VN_RELE(...);                   // release held vnodes
 185  185   *      zil_commit(zilog, foid);        // synchronous when necessary
 186  186   *      ZFS_EXIT(zfsvfs);               // finished in zfs
 187  187   *      return (error);                 // done, report error
 188  188   */
 189  189  
      190 +/* set this tunable to zero to disable asynchronous freeing of files */
      191 +boolean_t zfs_do_async_free = B_TRUE;
      192 +
      193 +/*
      194 + * This value will be multiplied by zfs_dirty_data_max to determine
      195 + * the threshold past which we will call zfs_inactive_impl() async.
      196 + *
      197 + * Selecting the multiplier is a balance between how long we're willing to wait
      198 + * for delete/free to complete (get shell back, have a NFS thread captive, etc)
      199 + * and reducing the number of active requests in the backing taskq.
      200 + *
      201 + * 4 GiB (zfs_dirty_data_max default) * 16 (multiplier default) = 64 GiB
      202 + * meaning by default we will call zfs_inactive_impl async for vnodes > 64 GiB
      203 + *
      204 + * WARNING: Setting this tunable to zero will enable asynchronous freeing for
      205 + * all files which can have undesirable side effects.
      206 + */
      207 +uint16_t zfs_inactive_async_multiplier = 16;
      208 +
      209 +int nms_worm_transition_time = 30;
      210 +int
      211 +zfs_worm_in_trans(znode_t *zp)
      212 +{
      213 +        zfsvfs_t                *zfsvfs = zp->z_zfsvfs;
      214 +        timestruc_t             now;
      215 +        sa_bulk_attr_t          bulk[2];
      216 +        uint64_t                ctime[2];
      217 +        int                     count = 0;
      218 +
      219 +        if (!nms_worm_transition_time)
      220 +                return (0);
      221 +
      222 +        gethrestime(&now);
      223 +        SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL,
      224 +            &ctime, sizeof (ctime));
      225 +        if (sa_bulk_lookup(zp->z_sa_hdl, bulk, count) != 0)
      226 +                return (0);
      227 +
      228 +        return ((uint64_t)now.tv_sec - ctime[0] < nms_worm_transition_time);
      229 +}
      230 +
 190  231  /* ARGSUSED */
 191  232  static int
 192  233  zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct)
 193  234  {
 194  235          znode_t *zp = VTOZ(*vpp);
 195  236          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 196  237  
 197  238          ZFS_ENTER(zfsvfs);
 198  239          ZFS_VERIFY_ZP(zp);
 199  240  
↓ open down ↓ 20 lines elided ↑ open up ↑
 220  261          return (0);
 221  262  }
 222  263  
 223  264  /* ARGSUSED */
 224  265  static int
 225  266  zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
 226  267      caller_context_t *ct)
 227  268  {
 228  269          znode_t *zp = VTOZ(vp);
 229  270          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
      271 +        pid_t caller_pid = (ct != NULL) ? ct->cc_pid : ddi_get_pid();
 230  272  
 231  273          /*
 232  274           * Clean up any locks held by this process on the vp.
 233  275           */
 234      -        cleanlocks(vp, ddi_get_pid(), 0);
 235      -        cleanshares(vp, ddi_get_pid());
      276 +        cleanlocks(vp, caller_pid, 0);
      277 +        cleanshares(vp, caller_pid);
 236  278  
 237  279          ZFS_ENTER(zfsvfs);
 238  280          ZFS_VERIFY_ZP(zp);
 239  281  
 240  282          /* Decrement the synchronous opens in the znode */
 241  283          if ((flag & (FSYNC | FDSYNC)) && (count == 1))
 242  284                  atomic_dec_32(&zp->z_sync_cnt);
 243  285  
 244  286          if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
 245  287              ZTOV(zp)->v_type == VREG &&
↓ open down ↓ 233 lines elided ↑ open up ↑
 479  521                              uio, bytes);
 480  522                  }
 481  523                  len -= bytes;
 482  524                  off = 0;
 483  525                  if (error)
 484  526                          break;
 485  527          }
 486  528          return (error);
 487  529  }
 488  530  
      531 +
      532 +/*
      533 + * ZFS I/O rate throttling
      534 + */
      535 +
      536 +#define DELAY_SHIFT 24
      537 +
      538 +typedef struct zfs_rate_delay {
      539 +        uint_t rl_rate;
      540 +        hrtime_t rl_delay;
      541 +} zfs_rate_delay_t;
      542 +
      543 +/*
      544 + * The time we'll attempt to cv_wait (below), in nSec.
      545 + * This should be no less than the minimum time it normally takes
      546 + * to block a thread and wake back up after the timeout fires.
      547 + *
      548 + * Each table entry represents the delay for each 4MB of bandwith.
      549 + * we reduce the delay as the size fo the I/O increases.
      550 + */
      551 +zfs_rate_delay_t zfs_rate_delay_table[] = {
      552 +        {0, 100000},
      553 +        {1, 100000},
      554 +        {2, 100000},
      555 +        {3, 100000},
      556 +        {4, 100000},
      557 +        {5, 50000},
      558 +        {6, 50000},
      559 +        {7, 50000},
      560 +        {8, 50000},
      561 +        {9, 25000},
      562 +        {10, 25000},
      563 +        {11, 25000},
      564 +        {12, 25000},
      565 +        {13, 12500},
      566 +        {14, 12500},
      567 +        {15, 12500},
      568 +        {16, 12500},
      569 +        {17, 6250},
      570 +        {18, 6250},
      571 +        {19, 6250},
      572 +        {20, 6250},
      573 +        {21, 3125},
      574 +        {22, 3125},
      575 +        {23, 3125},
      576 +        {24, 3125},
      577 +};
      578 +
      579 +#define MAX_RATE_TBL_ENTRY 24
      580 +
      581 +/*
      582 + * The delay we use should be reduced based on the size of the iorate
      583 + * for higher iorates we want a shorter delay.
      584 + */
      585 +static inline hrtime_t
      586 +zfs_get_delay(ssize_t iorate)
      587 +{
      588 +        uint_t rate = iorate >> DELAY_SHIFT;
      589 +
      590 +        if (rate > MAX_RATE_TBL_ENTRY)
      591 +                rate = MAX_RATE_TBL_ENTRY;
      592 +        return (zfs_rate_delay_table[rate].rl_delay);
      593 +}
      594 +
      595 +/*
      596 + * ZFS I/O rate throttling
      597 + * See "Token Bucket" on Wikipedia
      598 + *
      599 + * This is "Token Bucket" with some modifications to avoid wait times
      600 + * longer than a couple seconds, so that we don't trigger NFS retries
      601 + * or similar.  This does mean that concurrent requests might take us
      602 + * over the rate limit, but that's a lesser evil.
      603 + */
      604 +static void
      605 +zfs_rate_throttle(zfsvfs_t *zfsvfs, ssize_t iosize)
      606 +{
      607 +        zfs_rate_state_t *rate = &zfsvfs->z_rate;
      608 +        hrtime_t now, delta; /* nanoseconds */
      609 +        int64_t refill;
      610 +
      611 +        VERIFY(rate->rate_cap > 0);
      612 +        mutex_enter(&rate->rate_lock);
      613 +
      614 +        /*
      615 +         * If another thread is already waiting, we must queue up behind them.
      616 +         * We'll wait up to 1 sec here.  We normally will resume by cv_signal,
      617 +         * so we don't need fine timer resolution on this wait.
      618 +         */
      619 +        if (rate->rate_token_bucket < 0) {
      620 +                rate->rate_waiters++;
      621 +                (void) cv_timedwait_hires(
      622 +                    &rate->rate_wait_cv, &rate->rate_lock,
      623 +                    NANOSEC, TR_CLOCK_TICK, 0);
      624 +                rate->rate_waiters--;
      625 +        }
      626 +
      627 +        /*
      628 +         * How long since we last updated the bucket?
      629 +         */
      630 +        now = gethrtime();
      631 +        delta = now - rate->rate_last_update;
      632 +        rate->rate_last_update = now;
      633 +        if (delta < 0)
      634 +                delta = 0; /* paranoid */
      635 +
      636 +        /*
      637 +         * Add "tokens" for time since last update,
      638 +         * being careful about possible overflow.
      639 +         */
      640 +        refill = (delta * rate->rate_cap) / NANOSEC;
      641 +        if (refill < 0 || refill > rate->rate_cap)
      642 +                refill = rate->rate_cap; /* overflow */
      643 +        rate->rate_token_bucket += refill;
      644 +        if (rate->rate_token_bucket > rate->rate_cap)
      645 +                rate->rate_token_bucket = rate->rate_cap;
      646 +
      647 +        /*
      648 +         * Withdraw tokens for the current I/O.* If this makes us overdrawn,
      649 +         * wait an amount of time proportionate to the overdraft.  However,
      650 +         * as a sanity measure, never wait more than 1 sec, and never try to
      651 +         * wait less than the time it normally takes to block and reschedule.
      652 +         *
      653 +         * Leave the bucket negative while we wait so other threads know to
      654 +         * queue up. In here, "refill" is the debt we're waiting to pay off.
      655 +         */
      656 +        rate->rate_token_bucket -= iosize;
      657 +        if (rate->rate_token_bucket < 0) {
      658 +                hrtime_t zfs_rate_wait = 0;
      659 +
      660 +                refill = rate->rate_token_bucket;
      661 +                DTRACE_PROBE2(zfs_rate_over, zfsvfs_t *, zfsvfs,
      662 +                    int64_t, refill);
      663 +
      664 +                if (rate->rate_cap <= 0)
      665 +                        goto nocap;
      666 +
      667 +                delta = (refill * NANOSEC) / rate->rate_cap;
      668 +                delta = MIN(delta, NANOSEC);
      669 +
      670 +                zfs_rate_wait = zfs_get_delay(rate->rate_cap);
      671 +
      672 +                if (delta > zfs_rate_wait) {
      673 +                        (void) cv_timedwait_hires(
      674 +                            &rate->rate_wait_cv, &rate->rate_lock,
      675 +                            delta, TR_CLOCK_TICK, 0);
      676 +                }
      677 +
      678 +                rate->rate_token_bucket += refill;
      679 +        }
      680 +nocap:
      681 +        if (rate->rate_waiters > 0) {
      682 +                cv_signal(&rate->rate_wait_cv);
      683 +        }
      684 +
      685 +        mutex_exit(&rate->rate_lock);
      686 +}
      687 +
      688 +
 489  689  offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */
 490  690  
 491  691  /*
 492  692   * Read bytes from specified file into supplied buffer.
 493  693   *
 494  694   *      IN:     vp      - vnode of file to be read from.
 495  695   *              uio     - structure supplying read location, range info,
 496  696   *                        and return buffer.
 497  697   *              ioflag  - SYNC flags; used to provide FRSYNC semantics.
 498  698   *              cr      - credentials of caller.
↓ open down ↓ 46 lines elided ↑ open up ↑
 545  745           */
 546  746          if (MANDMODE(zp->z_mode)) {
 547  747                  if (error = chklock(vp, FREAD,
 548  748                      uio->uio_loffset, uio->uio_resid, uio->uio_fmode, ct)) {
 549  749                          ZFS_EXIT(zfsvfs);
 550  750                          return (error);
 551  751                  }
 552  752          }
 553  753  
 554  754          /*
      755 +         * ZFS I/O rate throttling
      756 +         */
      757 +        if (zfsvfs->z_rate.rate_cap)
      758 +                zfs_rate_throttle(zfsvfs, uio->uio_resid);
      759 +
      760 +        /*
 555  761           * If we're in FRSYNC mode, sync out this znode before reading it.
 556  762           */
 557  763          if (ioflag & FRSYNC || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
 558  764                  zil_commit(zfsvfs->z_log, zp->z_id);
 559  765  
 560  766          /*
 561  767           * Lock the range against changes.
 562  768           */
 563  769          rl = zfs_range_lock(zp, uio->uio_loffset, uio->uio_resid, RL_READER);
 564  770  
↓ open down ↓ 143 lines elided ↑ open up ↑
 708  914          }
 709  915  
 710  916          /*
 711  917           * If immutable or not appending then return EPERM.
 712  918           * Intentionally allow ZFS_READONLY through here.
 713  919           * See zfs_zaccess_common()
 714  920           */
 715  921          if ((zp->z_pflags & ZFS_IMMUTABLE) ||
 716  922              ((zp->z_pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
 717  923              (uio->uio_loffset < zp->z_size))) {
 718      -                ZFS_EXIT(zfsvfs);
 719      -                return (SET_ERROR(EPERM));
      924 +                /* Make sure we're not a WORM before returning EPERM. */
      925 +                if (!(zp->z_pflags & ZFS_IMMUTABLE) ||
      926 +                    !zp->z_zfsvfs->z_isworm) {
      927 +                        ZFS_EXIT(zfsvfs);
      928 +                        return (SET_ERROR(EPERM));
      929 +                }
 720  930          }
 721  931  
 722  932          zilog = zfsvfs->z_log;
 723  933  
 724  934          /*
 725  935           * Validate file offset
 726  936           */
 727  937          woff = ioflag & FAPPEND ? zp->z_size : uio->uio_loffset;
 728  938          if (woff < 0) {
 729  939                  ZFS_EXIT(zfsvfs);
↓ open down ↓ 4 lines elided ↑ open up ↑
 734  944           * Check for mandatory locks before calling zfs_range_lock()
 735  945           * in order to prevent a deadlock with locks set via fcntl().
 736  946           */
 737  947          if (MANDMODE((mode_t)zp->z_mode) &&
 738  948              (error = chklock(vp, FWRITE, woff, n, uio->uio_fmode, ct)) != 0) {
 739  949                  ZFS_EXIT(zfsvfs);
 740  950                  return (error);
 741  951          }
 742  952  
 743  953          /*
      954 +         * ZFS I/O rate throttling
      955 +         */
      956 +        if (zfsvfs->z_rate.rate_cap)
      957 +                zfs_rate_throttle(zfsvfs, uio->uio_resid);
      958 +
      959 +        /*
 744  960           * Pre-fault the pages to ensure slow (eg NFS) pages
 745  961           * don't hold up txg.
 746  962           * Skip this if uio contains loaned arc_buf.
 747  963           */
 748  964          if ((uio->uio_extflg == UIO_XUIO) &&
 749  965              (((xuio_t *)uio)->xu_type == UIOTYPE_ZEROCOPY))
 750  966                  xuio = (xuio_t *)uio;
 751  967          else
 752  968                  uio_prefaultpages(MIN(n, max_blksz), uio);
 753  969  
↓ open down ↓ 254 lines elided ↑ open up ↑
1008 1224          }
1009 1225  
1010 1226          if (ioflag & (FSYNC | FDSYNC) ||
1011 1227              zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
1012 1228                  zil_commit(zilog, zp->z_id);
1013 1229  
1014 1230          ZFS_EXIT(zfsvfs);
1015 1231          return (0);
1016 1232  }
1017 1233  
     1234 +/* ARGSUSED */
1018 1235  void
1019 1236  zfs_get_done(zgd_t *zgd, int error)
1020 1237  {
1021 1238          znode_t *zp = zgd->zgd_private;
1022 1239          objset_t *os = zp->z_zfsvfs->z_os;
1023 1240  
1024 1241          if (zgd->zgd_db)
1025 1242                  dmu_buf_rele(zgd->zgd_db, zgd);
1026 1243  
1027 1244          zfs_range_unlock(zgd->zgd_rl);
1028 1245  
1029 1246          /*
1030 1247           * Release the vnode asynchronously as we currently have the
1031 1248           * txg stopped from syncing.
1032 1249           */
1033 1250          VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
1034 1251  
1035      -        if (error == 0 && zgd->zgd_bp)
1036      -                zil_lwb_add_block(zgd->zgd_lwb, zgd->zgd_bp);
1037      -
1038 1252          kmem_free(zgd, sizeof (zgd_t));
1039 1253  }
1040 1254  
1041 1255  #ifdef DEBUG
1042 1256  static int zil_fault_io = 0;
1043 1257  #endif
1044 1258  
1045 1259  /*
1046 1260   * Get data to generate a TX_WRITE intent log record.
1047 1261   */
↓ open down ↓ 103 lines elided ↑ open up ↑
1151 1365                           */
1152 1366                          if (error == 0)
1153 1367                                  return (0);
1154 1368  
1155 1369                          if (error == EALREADY) {
1156 1370                                  lr->lr_common.lrc_txtype = TX_WRITE2;
1157 1371                                  /*
1158 1372                                   * TX_WRITE2 relies on the data previously
1159 1373                                   * written by the TX_WRITE that caused
1160 1374                                   * EALREADY.  We zero out the BP because
1161      -                                 * it is the old, currently-on-disk BP,
1162      -                                 * so there's no need to zio_flush() its
1163      -                                 * vdevs (flushing would needlesly hurt
1164      -                                 * performance, and doesn't work on
1165      -                                 * indirect vdevs).
     1375 +                                 * it is the old, currently-on-disk BP.
1166 1376                                   */
1167 1377                                  zgd->zgd_bp = NULL;
1168 1378                                  BP_ZERO(bp);
1169 1379                                  error = 0;
1170 1380                          }
1171 1381                  }
1172 1382          }
1173 1383  
1174 1384          zfs_get_done(zgd, error);
1175 1385  
↓ open down ↓ 62 lines elided ↑ open up ↑
1238 1448   *
1239 1449   * Timestamps:
1240 1450   *      NA
1241 1451   */
1242 1452  /* ARGSUSED */
1243 1453  static int
1244 1454  zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp,
1245 1455      int flags, vnode_t *rdir, cred_t *cr,  caller_context_t *ct,
1246 1456      int *direntflags, pathname_t *realpnp)
1247 1457  {
1248      -        znode_t *zdp = VTOZ(dvp);
     1458 +        znode_t *zp, *zdp = VTOZ(dvp);
1249 1459          zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
1250 1460          int     error = 0;
1251 1461  
1252 1462          /*
1253 1463           * Fast path lookup, however we must skip DNLC lookup
1254 1464           * for case folding or normalizing lookups because the
1255 1465           * DNLC code only stores the passed in name.  This means
1256 1466           * creating 'a' and removing 'A' on a case insensitive
1257 1467           * file system would work, but DNLC still thinks 'a'
1258 1468           * exists and won't let you create it again on the next
↓ open down ↓ 97 lines elided ↑ open up ↑
1356 1566  
1357 1567          if (zfsvfs->z_utf8 && u8_validate(nm, strlen(nm),
1358 1568              NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
1359 1569                  ZFS_EXIT(zfsvfs);
1360 1570                  return (SET_ERROR(EILSEQ));
1361 1571          }
1362 1572  
1363 1573          error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
1364 1574          if (error == 0)
1365 1575                  error = specvp_check(vpp, cr);
     1576 +        if (*vpp) {
     1577 +                zp = VTOZ(*vpp);
     1578 +                if (!(zp->z_pflags & ZFS_IMMUTABLE) &&
     1579 +                    ((*vpp)->v_type != VDIR) &&
     1580 +                    zfsvfs->z_isworm && !zfs_worm_in_trans(zp)) {
     1581 +                        zp->z_pflags |= ZFS_IMMUTABLE;
     1582 +                }
     1583 +        }
1366 1584  
1367 1585          ZFS_EXIT(zfsvfs);
1368 1586          return (error);
1369 1587  }
1370 1588  
1371 1589  /*
1372 1590   * Attempt to create a new entry in a directory.  If the entry
1373 1591   * already exists, truncate the file if permissible, else return
1374 1592   * an error.  Return the vp of the created or trunc'd file.
1375 1593   *
↓ open down ↓ 15 lines elided ↑ open up ↑
1391 1609   *      dvp - ctime|mtime updated if new entry created
1392 1610   *       vp - ctime|mtime always, atime if new
1393 1611   */
1394 1612  
1395 1613  /* ARGSUSED */
1396 1614  static int
1397 1615  zfs_create(vnode_t *dvp, char *name, vattr_t *vap, vcexcl_t excl,
1398 1616      int mode, vnode_t **vpp, cred_t *cr, int flag, caller_context_t *ct,
1399 1617      vsecattr_t *vsecp)
1400 1618  {
     1619 +        int             imm_was_set = 0;
1401 1620          znode_t         *zp, *dzp = VTOZ(dvp);
1402 1621          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
1403 1622          zilog_t         *zilog;
1404 1623          objset_t        *os;
1405 1624          zfs_dirlock_t   *dl;
1406 1625          dmu_tx_t        *tx;
1407 1626          int             error;
1408 1627          ksid_t          *ksid;
1409 1628          uid_t           uid;
1410 1629          gid_t           gid = crgetgid(cr);
↓ open down ↓ 65 lines elided ↑ open up ↑
1476 1695                          if (strcmp(name, "..") == 0)
1477 1696                                  error = SET_ERROR(EISDIR);
1478 1697                          ZFS_EXIT(zfsvfs);
1479 1698                          return (error);
1480 1699                  }
1481 1700          }
1482 1701  
1483 1702          if (zp == NULL) {
1484 1703                  uint64_t txtype;
1485 1704  
     1705 +                if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
     1706 +                    dzp->z_zfsvfs->z_isworm) {
     1707 +                        imm_was_set = 1;
     1708 +                        dzp->z_pflags &= ~ZFS_IMMUTABLE;
     1709 +                }
     1710 +
1486 1711                  /*
1487 1712                   * Create a new file object and update the directory
1488 1713                   * to reference it.
1489 1714                   */
1490 1715                  if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
1491 1716                          if (have_acl)
1492 1717                                  zfs_acl_ids_free(&acl_ids);
     1718 +                        if (imm_was_set)
     1719 +                                dzp->z_pflags |= ZFS_IMMUTABLE;
1493 1720                          goto out;
1494 1721                  }
1495 1722  
     1723 +                if (imm_was_set)
     1724 +                        dzp->z_pflags |= ZFS_IMMUTABLE;
     1725 +
1496 1726                  /*
1497 1727                   * We only support the creation of regular files in
1498 1728                   * extended attribute directories.
1499 1729                   */
1500 1730  
1501 1731                  if ((dzp->z_pflags & ZFS_XATTR) &&
1502 1732                      (vap->va_type != VREG)) {
1503 1733                          if (have_acl)
1504 1734                                  zfs_acl_ids_free(&acl_ids);
1505 1735                          error = SET_ERROR(EINVAL);
↓ open down ↓ 19 lines elided ↑ open up ↑
1525 1755                  fuid_dirtied = zfsvfs->z_fuid_dirty;
1526 1756                  if (fuid_dirtied)
1527 1757                          zfs_fuid_txhold(zfsvfs, tx);
1528 1758                  dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
1529 1759                  dmu_tx_hold_sa(tx, dzp->z_sa_hdl, B_FALSE);
1530 1760                  if (!zfsvfs->z_use_sa &&
1531 1761                      acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
1532 1762                          dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
1533 1763                              0, acl_ids.z_aclp->z_acl_bytes);
1534 1764                  }
1535      -                error = dmu_tx_assign(tx,
1536      -                    (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     1765 +                error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
1537 1766                  if (error) {
1538 1767                          zfs_dirent_unlock(dl);
1539 1768                          if (error == ERESTART) {
1540 1769                                  waited = B_TRUE;
1541 1770                                  dmu_tx_wait(tx);
1542 1771                                  dmu_tx_abort(tx);
1543 1772                                  goto top;
1544 1773                          }
1545 1774                          zfs_acl_ids_free(&acl_ids);
1546 1775                          dmu_tx_abort(tx);
1547 1776                          ZFS_EXIT(zfsvfs);
1548 1777                          return (error);
1549 1778                  }
1550 1779                  zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
1551 1780  
1552 1781                  if (fuid_dirtied)
1553 1782                          zfs_fuid_sync(zfsvfs, tx);
1554 1783  
     1784 +                if (imm_was_set)
     1785 +                        zp->z_pflags |= ZFS_IMMUTABLE;
     1786 +
1555 1787                  (void) zfs_link_create(dl, zp, tx, ZNEW);
1556 1788                  txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
1557 1789                  if (flag & FIGNORECASE)
1558 1790                          txtype |= TX_CI;
1559 1791                  zfs_log_create(zilog, tx, txtype, dzp, zp, name,
1560 1792                      vsecp, acl_ids.z_fuidp, vap);
1561 1793                  zfs_acl_ids_free(&acl_ids);
1562 1794                  dmu_tx_commit(tx);
1563 1795          } else {
1564 1796                  int aflags = (flag & FAPPEND) ? V_APPEND : 0;
↓ open down ↓ 12 lines elided ↑ open up ↑
1577 1809                          error = SET_ERROR(EEXIST);
1578 1810                          goto out;
1579 1811                  }
1580 1812                  /*
1581 1813                   * Can't open a directory for writing.
1582 1814                   */
1583 1815                  if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
1584 1816                          error = SET_ERROR(EISDIR);
1585 1817                          goto out;
1586 1818                  }
     1819 +                if ((flag & FWRITE) &&
     1820 +                    dzp->z_zfsvfs->z_isworm) {
     1821 +                        error = EPERM;
     1822 +                        goto out;
     1823 +                }
     1824 +
     1825 +                if (!(flag & FAPPEND) &&
     1826 +                    (zp->z_pflags & ZFS_IMMUTABLE) &&
     1827 +                    dzp->z_zfsvfs->z_isworm) {
     1828 +                        imm_was_set = 1;
     1829 +                        zp->z_pflags &= ~ZFS_IMMUTABLE;
     1830 +                }
1587 1831                  /*
1588 1832                   * Verify requested access to file.
1589 1833                   */
1590 1834                  if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {
     1835 +                        if (imm_was_set)
     1836 +                                zp->z_pflags |= ZFS_IMMUTABLE;
1591 1837                          goto out;
1592 1838                  }
1593 1839  
     1840 +                if (imm_was_set)
     1841 +                        zp->z_pflags |= ZFS_IMMUTABLE;
     1842 +
1594 1843                  mutex_enter(&dzp->z_lock);
1595 1844                  dzp->z_seq++;
1596 1845                  mutex_exit(&dzp->z_lock);
1597 1846  
1598 1847                  /*
1599 1848                   * Truncate regular files if requested.
1600 1849                   */
1601 1850                  if ((ZTOV(zp)->v_type == VREG) &&
1602 1851                      (vap->va_mask & AT_SIZE) && (vap->va_size == 0)) {
1603 1852                          /* we can't hold any locks when calling zfs_freesp() */
↓ open down ↓ 86 lines elided ↑ open up ↑
1690 1939          if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
1691 1940              NULL, realnmp)) {
1692 1941                  if (realnmp)
1693 1942                          pn_free(realnmp);
1694 1943                  ZFS_EXIT(zfsvfs);
1695 1944                  return (error);
1696 1945          }
1697 1946  
1698 1947          vp = ZTOV(zp);
1699 1948  
     1949 +        if (zp->z_zfsvfs->z_isworm) {
     1950 +                error = SET_ERROR(EPERM);
     1951 +                goto out;
     1952 +        }
     1953 +
1700 1954          if (error = zfs_zaccess_delete(dzp, zp, cr)) {
1701 1955                  goto out;
1702 1956          }
1703 1957  
1704 1958          /*
1705 1959           * Need to use rmdir for removing directories.
1706 1960           */
1707 1961          if (vp->v_type == VDIR) {
1708 1962                  error = SET_ERROR(EPERM);
1709 1963                  goto out;
↓ open down ↓ 46 lines elided ↑ open up ↑
1756 2010          mutex_exit(&zp->z_lock);
1757 2011  
1758 2012          /* charge as an update -- would be nice not to charge at all */
1759 2013          dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
1760 2014  
1761 2015          /*
1762 2016           * Mark this transaction as typically resulting in a net free of space
1763 2017           */
1764 2018          dmu_tx_mark_netfree(tx);
1765 2019  
1766      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     2020 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
1767 2021          if (error) {
1768 2022                  zfs_dirent_unlock(dl);
1769 2023                  VN_RELE(vp);
1770 2024                  if (xzp)
1771 2025                          VN_RELE(ZTOV(xzp));
1772 2026                  if (error == ERESTART) {
1773 2027                          waited = B_TRUE;
1774 2028                          dmu_tx_wait(tx);
1775 2029                          dmu_tx_abort(tx);
1776 2030                          goto top;
↓ open down ↓ 106 lines elided ↑ open up ↑
1883 2137   *
1884 2138   * Timestamps:
1885 2139   *      dvp - ctime|mtime updated
1886 2140   *       vp - ctime|mtime|atime updated
1887 2141   */
1888 2142  /*ARGSUSED*/
1889 2143  static int
1890 2144  zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr,
1891 2145      caller_context_t *ct, int flags, vsecattr_t *vsecp)
1892 2146  {
     2147 +        int             imm_was_set = 0;
1893 2148          znode_t         *zp, *dzp = VTOZ(dvp);
1894 2149          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
1895 2150          zilog_t         *zilog;
1896 2151          zfs_dirlock_t   *dl;
1897 2152          uint64_t        txtype;
1898 2153          dmu_tx_t        *tx;
1899 2154          int             error;
1900 2155          int             zf = ZNEW;
1901 2156          ksid_t          *ksid;
1902 2157          uid_t           uid;
↓ open down ↓ 59 lines elided ↑ open up ↑
1962 2217  top:
1963 2218          *vpp = NULL;
1964 2219  
1965 2220          if (error = zfs_dirent_lock(&dl, dzp, dirname, &zp, zf,
1966 2221              NULL, NULL)) {
1967 2222                  zfs_acl_ids_free(&acl_ids);
1968 2223                  ZFS_EXIT(zfsvfs);
1969 2224                  return (error);
1970 2225          }
1971 2226  
     2227 +        if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
     2228 +            dzp->z_zfsvfs->z_isworm) {
     2229 +                imm_was_set = 1;
     2230 +                dzp->z_pflags &= ~ZFS_IMMUTABLE;
     2231 +        }
     2232 +
1972 2233          if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {
     2234 +                if (imm_was_set)
     2235 +                        dzp->z_pflags |= ZFS_IMMUTABLE;
1973 2236                  zfs_acl_ids_free(&acl_ids);
1974 2237                  zfs_dirent_unlock(dl);
1975 2238                  ZFS_EXIT(zfsvfs);
1976 2239                  return (error);
1977 2240          }
1978 2241  
     2242 +        if (imm_was_set)
     2243 +                dzp->z_pflags |= ZFS_IMMUTABLE;
     2244 +
1979 2245          if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
1980 2246                  zfs_acl_ids_free(&acl_ids);
1981 2247                  zfs_dirent_unlock(dl);
1982 2248                  ZFS_EXIT(zfsvfs);
1983 2249                  return (SET_ERROR(EDQUOT));
1984 2250          }
1985 2251  
1986 2252          /*
1987 2253           * Add a new entry to the directory.
1988 2254           */
↓ open down ↓ 4 lines elided ↑ open up ↑
1993 2259          if (fuid_dirtied)
1994 2260                  zfs_fuid_txhold(zfsvfs, tx);
1995 2261          if (!zfsvfs->z_use_sa && acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
1996 2262                  dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
1997 2263                      acl_ids.z_aclp->z_acl_bytes);
1998 2264          }
1999 2265  
2000 2266          dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
2001 2267              ZFS_SA_BASE_ATTR_SIZE);
2002 2268  
2003      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     2269 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
2004 2270          if (error) {
2005 2271                  zfs_dirent_unlock(dl);
2006 2272                  if (error == ERESTART) {
2007 2273                          waited = B_TRUE;
2008 2274                          dmu_tx_wait(tx);
2009 2275                          dmu_tx_abort(tx);
2010 2276                          goto top;
2011 2277                  }
2012 2278                  zfs_acl_ids_free(&acl_ids);
2013 2279                  dmu_tx_abort(tx);
↓ open down ↓ 81 lines elided ↑ open up ↑
2095 2361           * Attempt to lock directory; fail if entry doesn't exist.
2096 2362           */
2097 2363          if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
2098 2364              NULL, NULL)) {
2099 2365                  ZFS_EXIT(zfsvfs);
2100 2366                  return (error);
2101 2367          }
2102 2368  
2103 2369          vp = ZTOV(zp);
2104 2370  
     2371 +        if (dzp->z_zfsvfs->z_isworm) {
     2372 +                error = SET_ERROR(EPERM);
     2373 +                goto out;
     2374 +        }
     2375 +
2105 2376          if (error = zfs_zaccess_delete(dzp, zp, cr)) {
2106 2377                  goto out;
2107 2378          }
2108 2379  
2109 2380          if (vp->v_type != VDIR) {
2110 2381                  error = SET_ERROR(ENOTDIR);
2111 2382                  goto out;
2112 2383          }
2113 2384  
2114 2385          if (vp == cwd) {
↓ open down ↓ 15 lines elided ↑ open up ↑
2130 2401           */
2131 2402          rw_enter(&zp->z_parent_lock, RW_WRITER);
2132 2403  
2133 2404          tx = dmu_tx_create(zfsvfs->z_os);
2134 2405          dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
2135 2406          dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
2136 2407          dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
2137 2408          zfs_sa_upgrade_txholds(tx, zp);
2138 2409          zfs_sa_upgrade_txholds(tx, dzp);
2139 2410          dmu_tx_mark_netfree(tx);
2140      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     2411 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
2141 2412          if (error) {
2142 2413                  rw_exit(&zp->z_parent_lock);
2143 2414                  rw_exit(&zp->z_name_lock);
2144 2415                  zfs_dirent_unlock(dl);
2145 2416                  VN_RELE(vp);
2146 2417                  if (error == ERESTART) {
2147 2418                          waited = B_TRUE;
2148 2419                          dmu_tx_wait(tx);
2149 2420                          dmu_tx_abort(tx);
2150 2421                          goto top;
↓ open down ↓ 636 lines elided ↑ open up ↑
2787 3058  
2788 3059          /*
2789 3060           * If this is an xvattr_t, then get a pointer to the structure of
2790 3061           * optional attributes.  If this is NULL, then we have a vattr_t.
2791 3062           */
2792 3063          xoap = xva_getxoptattr(xvap);
2793 3064  
2794 3065          xva_init(&tmpxvattr);
2795 3066  
2796 3067          /*
2797      -         * Immutable files can only alter immutable bit and atime
     3068 +         * Do not allow to alter immutable bit after it is set
2798 3069           */
2799 3070          if ((zp->z_pflags & ZFS_IMMUTABLE) &&
2800      -            ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
2801      -            ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
     3071 +            XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) &&
     3072 +            zp->z_zfsvfs->z_isworm) {
2802 3073                  ZFS_EXIT(zfsvfs);
2803 3074                  return (SET_ERROR(EPERM));
2804 3075          }
2805 3076  
2806 3077          /*
     3078 +         * Immutable files can only alter atime
     3079 +         */
     3080 +        if (((zp->z_pflags & ZFS_IMMUTABLE) || zp->z_zfsvfs->z_isworm) &&
     3081 +            ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
     3082 +            ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
     3083 +                if (!zp->z_zfsvfs->z_isworm || !zfs_worm_in_trans(zp)) {
     3084 +                        ZFS_EXIT(zfsvfs);
     3085 +                        return (SET_ERROR(EPERM));
     3086 +                }
     3087 +        }
     3088 +
     3089 +        /*
2807 3090           * Note: ZFS_READONLY is handled in zfs_zaccess_common.
2808 3091           */
2809 3092  
2810 3093          /*
2811 3094           * Verify timestamps doesn't overflow 32 bits.
2812 3095           * ZFS can handle large timestamps, but 32bit syscalls can't
2813 3096           * handle times greater than 2039.  This check should be removed
2814 3097           * once large timestamps are fully supported.
2815 3098           */
2816 3099          if (mask & (AT_ATIME | AT_MTIME)) {
↓ open down ↓ 886 lines elided ↑ open up ↑
3703 3986                  dmu_tx_hold_sa(tx, tdzp->z_sa_hdl, B_FALSE);
3704 3987                  zfs_sa_upgrade_txholds(tx, tdzp);
3705 3988          }
3706 3989          if (tzp) {
3707 3990                  dmu_tx_hold_sa(tx, tzp->z_sa_hdl, B_FALSE);
3708 3991                  zfs_sa_upgrade_txholds(tx, tzp);
3709 3992          }
3710 3993  
3711 3994          zfs_sa_upgrade_txholds(tx, szp);
3712 3995          dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
3713      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     3996 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
3714 3997          if (error) {
3715 3998                  if (zl != NULL)
3716 3999                          zfs_rename_unlock(&zl);
3717 4000                  zfs_dirent_unlock(sdl);
3718 4001                  zfs_dirent_unlock(tdl);
3719 4002  
3720 4003                  if (sdzp == tdzp)
3721 4004                          rw_exit(&sdzp->z_name_lock);
3722 4005  
3723 4006                  VN_RELE(ZTOV(szp));
↓ open down ↓ 103 lines elided ↑ open up ↑
3827 4110  /*ARGSUSED*/
3828 4111  static int
3829 4112  zfs_symlink(vnode_t *dvp, char *name, vattr_t *vap, char *link, cred_t *cr,
3830 4113      caller_context_t *ct, int flags)
3831 4114  {
3832 4115          znode_t         *zp, *dzp = VTOZ(dvp);
3833 4116          zfs_dirlock_t   *dl;
3834 4117          dmu_tx_t        *tx;
3835 4118          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
3836 4119          zilog_t         *zilog;
     4120 +        int             imm_was_set = 0;
3837 4121          uint64_t        len = strlen(link);
3838 4122          int             error;
3839 4123          int             zflg = ZNEW;
3840 4124          zfs_acl_ids_t   acl_ids;
3841 4125          boolean_t       fuid_dirtied;
3842 4126          uint64_t        txtype = TX_SYMLINK;
3843 4127          boolean_t       waited = B_FALSE;
3844 4128  
3845 4129          ASSERT(vap->va_type == VLNK);
3846 4130  
↓ open down ↓ 23 lines elided ↑ open up ↑
3870 4154          /*
3871 4155           * Attempt to lock directory; fail if entry already exists.
3872 4156           */
3873 4157          error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg, NULL, NULL);
3874 4158          if (error) {
3875 4159                  zfs_acl_ids_free(&acl_ids);
3876 4160                  ZFS_EXIT(zfsvfs);
3877 4161                  return (error);
3878 4162          }
3879 4163  
     4164 +        if ((dzp->z_pflags & ZFS_IMMUTABLE) && dzp->z_zfsvfs->z_isworm) {
     4165 +                imm_was_set = 1;
     4166 +                dzp->z_pflags &= ~ZFS_IMMUTABLE;
     4167 +        }
3880 4168          if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
     4169 +                if (imm_was_set)
     4170 +                        dzp->z_pflags |= ZFS_IMMUTABLE;
3881 4171                  zfs_acl_ids_free(&acl_ids);
3882 4172                  zfs_dirent_unlock(dl);
3883 4173                  ZFS_EXIT(zfsvfs);
3884 4174                  return (error);
3885 4175          }
     4176 +        if (imm_was_set)
     4177 +                dzp->z_pflags |= ZFS_IMMUTABLE;
3886 4178  
3887 4179          if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
3888 4180                  zfs_acl_ids_free(&acl_ids);
3889 4181                  zfs_dirent_unlock(dl);
3890 4182                  ZFS_EXIT(zfsvfs);
3891 4183                  return (SET_ERROR(EDQUOT));
3892 4184          }
3893 4185          tx = dmu_tx_create(zfsvfs->z_os);
3894 4186          fuid_dirtied = zfsvfs->z_fuid_dirty;
3895 4187          dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, MAX(1, len));
3896 4188          dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
3897 4189          dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
3898 4190              ZFS_SA_BASE_ATTR_SIZE + len);
3899 4191          dmu_tx_hold_sa(tx, dzp->z_sa_hdl, B_FALSE);
3900 4192          if (!zfsvfs->z_use_sa && acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
3901 4193                  dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
3902 4194                      acl_ids.z_aclp->z_acl_bytes);
3903 4195          }
3904 4196          if (fuid_dirtied)
3905 4197                  zfs_fuid_txhold(zfsvfs, tx);
3906      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     4198 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
3907 4199          if (error) {
3908 4200                  zfs_dirent_unlock(dl);
3909 4201                  if (error == ERESTART) {
3910 4202                          waited = B_TRUE;
3911 4203                          dmu_tx_wait(tx);
3912 4204                          dmu_tx_abort(tx);
3913 4205                          goto top;
3914 4206                  }
3915 4207                  zfs_acl_ids_free(&acl_ids);
3916 4208                  dmu_tx_abort(tx);
↓ open down ↓ 200 lines elided ↑ open up ↑
4117 4409          if (error) {
4118 4410                  ZFS_EXIT(zfsvfs);
4119 4411                  return (error);
4120 4412          }
4121 4413  
4122 4414          tx = dmu_tx_create(zfsvfs->z_os);
4123 4415          dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
4124 4416          dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
4125 4417          zfs_sa_upgrade_txholds(tx, szp);
4126 4418          zfs_sa_upgrade_txholds(tx, dzp);
4127      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     4419 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
4128 4420          if (error) {
4129 4421                  zfs_dirent_unlock(dl);
4130 4422                  if (error == ERESTART) {
4131 4423                          waited = B_TRUE;
4132 4424                          dmu_tx_wait(tx);
4133 4425                          dmu_tx_abort(tx);
4134 4426                          goto top;
4135 4427                  }
4136 4428                  dmu_tx_abort(tx);
4137 4429                  ZFS_EXIT(zfsvfs);
↓ open down ↓ 255 lines elided ↑ open up ↑
4393 4685                  }
4394 4686          }
4395 4687  out:
4396 4688          zfs_range_unlock(rl);
4397 4689          if ((flags & B_ASYNC) == 0 || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
4398 4690                  zil_commit(zfsvfs->z_log, zp->z_id);
4399 4691          ZFS_EXIT(zfsvfs);
4400 4692          return (error);
4401 4693  }
4402 4694  
4403      -/*ARGSUSED*/
4404      -void
4405      -zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
     4695 +/*
     4696 + * Returns B_TRUE and exits the z_teardown_inactive_lock
     4697 + * if the znode we are looking at is no longer valid
     4698 + */
     4699 +static boolean_t
     4700 +zfs_znode_free_invalid(znode_t *zp)
4406 4701  {
4407      -        znode_t *zp = VTOZ(vp);
4408 4702          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
4409      -        int error;
     4703 +        vnode_t *vp = ZTOV(zp);
4410 4704  
4411      -        rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
     4705 +        ASSERT(rw_read_held(&zfsvfs->z_teardown_inactive_lock));
     4706 +
4412 4707          if (zp->z_sa_hdl == NULL) {
4413 4708                  /*
4414 4709                   * The fs has been unmounted, or we did a
4415 4710                   * suspend/resume and this file no longer exists.
4416 4711                   */
4417 4712                  if (vn_has_cached_data(vp)) {
4418 4713                          (void) pvn_vplist_dirty(vp, 0, zfs_null_putapage,
4419      -                            B_INVAL, cr);
     4714 +                            B_INVAL, CRED());
4420 4715                  }
4421 4716  
4422 4717                  mutex_enter(&zp->z_lock);
4423 4718                  mutex_enter(&vp->v_lock);
4424 4719                  ASSERT(vp->v_count == 1);
4425 4720                  VN_RELE_LOCKED(vp);
4426 4721                  mutex_exit(&vp->v_lock);
4427 4722                  mutex_exit(&zp->z_lock);
     4723 +                VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) !=
     4724 +                    UINT32_MAX);
4428 4725                  rw_exit(&zfsvfs->z_teardown_inactive_lock);
4429 4726                  zfs_znode_free(zp);
4430      -                return;
     4727 +                return (B_TRUE);
4431 4728          }
4432 4729  
     4730 +        return (B_FALSE);
     4731 +}
     4732 +
     4733 +/*
     4734 + * Does the prep work for freeing the znode, then calls zfs_zinactive to do the
     4735 + * actual freeing.
     4736 + * This code used be in zfs_inactive() before the async delete patch came in
     4737 + */
     4738 +static void
     4739 +zfs_inactive_impl(znode_t *zp)
     4740 +{
     4741 +        vnode_t *vp = ZTOV(zp);
     4742 +        zfsvfs_t *zfsvfs = zp->z_zfsvfs;
     4743 +        int error;
     4744 +
     4745 +        rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
     4746 +        if (zfs_znode_free_invalid(zp))
     4747 +                return; /* z_teardown_inactive_lock already dropped */
     4748 +
4433 4749          /*
4434 4750           * Attempt to push any data in the page cache.  If this fails
4435 4751           * we will get kicked out later in zfs_zinactive().
4436 4752           */
4437 4753          if (vn_has_cached_data(vp)) {
4438 4754                  (void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC,
4439      -                    cr);
     4755 +                    CRED());
4440 4756          }
4441 4757  
4442 4758          if (zp->z_atime_dirty && zp->z_unlinked == 0) {
4443 4759                  dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
4444 4760  
4445 4761                  dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
4446 4762                  zfs_sa_upgrade_txholds(tx, zp);
4447 4763                  error = dmu_tx_assign(tx, TXG_WAIT);
4448 4764                  if (error) {
4449 4765                          dmu_tx_abort(tx);
↓ open down ↓ 1 lines elided ↑ open up ↑
4451 4767                          mutex_enter(&zp->z_lock);
4452 4768                          (void) sa_update(zp->z_sa_hdl, SA_ZPL_ATIME(zfsvfs),
4453 4769                              (void *)&zp->z_atime, sizeof (zp->z_atime), tx);
4454 4770                          zp->z_atime_dirty = 0;
4455 4771                          mutex_exit(&zp->z_lock);
4456 4772                          dmu_tx_commit(tx);
4457 4773                  }
4458 4774          }
4459 4775  
4460 4776          zfs_zinactive(zp);
     4777 +
     4778 +        VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) != UINT32_MAX);
     4779 +
4461 4780          rw_exit(&zfsvfs->z_teardown_inactive_lock);
4462 4781  }
4463 4782  
     4783 +/*
     4784 + * taskq task calls zfs_inactive_impl() so that we can free the znode
     4785 + */
     4786 +static void
     4787 +zfs_inactive_task(void *task_arg)
     4788 +{
     4789 +        znode_t *zp = task_arg;
     4790 +        ASSERT(zp != NULL);
     4791 +        zfs_inactive_impl(zp);
     4792 +}
     4793 +
     4794 +/*ARGSUSED*/
     4795 +void
     4796 +zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
     4797 +{
     4798 +        znode_t *zp = VTOZ(vp);
     4799 +        zfsvfs_t *zfsvfs = zp->z_zfsvfs;
     4800 +
     4801 +        rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
     4802 +
     4803 +        VERIFY(atomic_inc_32_nv(&zfsvfs->z_znodes_freeing_cnt) != 0);
     4804 +
     4805 +        if (zfs_znode_free_invalid(zp))
     4806 +                return; /* z_teardown_inactive_lock already dropped */
     4807 +
     4808 +        if (zfs_do_async_free &&
     4809 +            zp->z_size > zfs_inactive_async_multiplier * zfs_dirty_data_max &&
     4810 +            taskq_dispatch(dsl_pool_vnrele_taskq(
     4811 +            dmu_objset_pool(zp->z_zfsvfs->z_os)), zfs_inactive_task,
     4812 +            zp, TQ_NOSLEEP) != NULL) {
     4813 +                rw_exit(&zfsvfs->z_teardown_inactive_lock);
     4814 +                return; /* task dispatched, we're done */
     4815 +        }
     4816 +        rw_exit(&zfsvfs->z_teardown_inactive_lock);
     4817 +
     4818 +        /* if the taskq dispatch failed - do a sync zfs_inactive_impl() call */
     4819 +        zfs_inactive_impl(zp);
     4820 +}
     4821 +
4464 4822  /*
4465 4823   * Bounds-check the seek operation.
4466 4824   *
4467 4825   *      IN:     vp      - vnode seeking within
4468 4826   *              ooff    - old file offset
4469 4827   *              noffp   - pointer to new file offset
4470 4828   *              ct      - caller context
4471 4829   *
4472 4830   *      RETURN: 0 on success, EINVAL if new offset invalid.
4473 4831   */
↓ open down ↓ 925 lines elided ↑ open up ↑
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX