big-one Wdiff usr/src/uts/common/fs/zfs/zfs_vnops.c

Print this page

NEX-19083 backport OS-7314 zil_commit should omit cache thrash
9962 zil_commit should omit cache thrash
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
NEX-10069 ZFS_READONLY is a little too strict
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9436 Rate limiting controls ... (fix cstyle)
NEX-3562 filename normalization doesn't work for removes (sync with upstream)
NEX-9436 Rate limiting controls (was QoS) per ZFS dataset, updates from demo
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-9213 comment for enabling async delete for all files is reversed.
Reviewed by: Jean Mccormack <jean.mccormack@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9090 trigger async freeing based on znode size
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-8972 Async-delete side-effect that may cause unmount EBUSY
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8852 Quality-of-Service (QoS) controls per NFS share
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "NEX-5085 implement async delete for large files"
This reverts commit 65aa8f42d93fcbd6e0efb3d4883170a20d760611.
Fails regression testing of the zfs test mirror_stress_004.
NEX-5085 implement async delete for large files
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
NEX-7543 backout async delete (NEX-5085 and NEX-6151)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6151 panic when forcefully unmounting the FS with large open files
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3562 filename normalization doesn't work for removes
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6334 Cannot unlink files when over quota
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5692 expose the number of hole blocks in a file
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4229 Panic destroying the pool using file backing store on FS with nbmand=on
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-1196 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Fixup merge results
re #14162 DOS issue with ZFS/NFS
re #7550 rb2134 lint-clean nza-kernel
re #6815 rb1758 need WORM in nza-kernel (4.0)

Split	Close
Expand all
Collapse all

          --- old/usr/src/uts/common/fs/zfs/zfs_vnops.c
          +++ new/usr/src/uts/common/fs/zfs/zfs_vnops.c

   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each

↓ open down ↓

13 lines elided

↑ open up ↑

  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  
  22   22  /*
  23   23   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
       24 + * Portions Copyright 2007 Jeremy Teo
       25 + * Portions Copyright 2010 Robert Milkowski
  24   26   * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
  25   27   * Copyright (c) 2014 Integros [integros.com]
  26   28   * Copyright 2015 Joyent, Inc.
  27   29   * Copyright 2017 Nexenta Systems, Inc.
  28   30   */
  29   31  
  30      -/* Portions Copyright 2007 Jeremy Teo */
  31      -/* Portions Copyright 2010 Robert Milkowski */
  32      -
  33   32  #include <sys/types.h>
  34   33  #include <sys/param.h>
  35   34  #include <sys/time.h>
  36   35  #include <sys/systm.h>
  37   36  #include <sys/sysmacros.h>
  38   37  #include <sys/resource.h>
  39   38  #include <sys/vfs.h>
  40   39  #include <sys/vfs_opreg.h>
  41   40  #include <sys/vnode.h>
  42   41  #include <sys/file.h>

  43   42  #include <sys/stat.h>
  44   43  #include <sys/kmem.h>
  45   44  #include <sys/taskq.h>
  46   45  #include <sys/uio.h>
  47   46  #include <sys/vmsystm.h>
  48   47  #include <sys/atomic.h>
  49   48  #include <sys/vm.h>
  50   49  #include <vm/seg_vn.h>
  51   50  #include <vm/pvn.h>
  52   51  #include <vm/as.h>
  53   52  #include <vm/kpm.h>
  54   53  #include <vm/seg_kpm.h>
  55   54  #include <sys/mman.h>
  56   55  #include <sys/pathname.h>
  57   56  #include <sys/cmn_err.h>
  58   57  #include <sys/errno.h>
  59   58  #include <sys/unistd.h>
  60   59  #include <sys/zfs_dir.h>
  61   60  #include <sys/zfs_acl.h>
  62   61  #include <sys/zfs_ioctl.h>
  63   62  #include <sys/fs/zfs.h>
  64   63  #include <sys/dmu.h>
  65   64  #include <sys/dmu_objset.h>
  66   65  #include <sys/spa.h>
  67   66  #include <sys/txg.h>
  68   67  #include <sys/dbuf.h>
  69   68  #include <sys/zap.h>
  70   69  #include <sys/sa.h>
  71   70  #include <sys/dirent.h>
  72   71  #include <sys/policy.h>
  73   72  #include <sys/sunddi.h>
  74   73  #include <sys/filio.h>
  75   74  #include <sys/sid.h>

↓ open down ↓

33 lines elided

↑ open up ↑

  76   75  #include "fs/fs_subr.h"
  77   76  #include <sys/zfs_ctldir.h>
  78   77  #include <sys/zfs_fuid.h>
  79   78  #include <sys/zfs_sa.h>
  80   79  #include <sys/dnlc.h>
  81   80  #include <sys/zfs_rlock.h>
  82   81  #include <sys/extdirent.h>
  83   82  #include <sys/kidmap.h>
  84   83  #include <sys/cred.h>
  85   84  #include <sys/attr.h>
       85 +#include <sys/dsl_prop.h>
  86   86  #include <sys/zil.h>
  87   87  
  88   88  /*
  89   89   * Programming rules.
  90   90   *
  91   91   * Each vnode op performs some logical unit of work.  To do this, the ZPL must
  92   92   * properly lock its in-core state, create a DMU transaction, do the work,
  93   93   * record this work in the intent log (ZIL), commit the DMU transaction,
  94   94   * and wait for the intent log to commit if it is a synchronous operation.
  95   95   * Moreover, the vnode ops must work in both normal and log replay context.

  96   96   * The ordering of events is important to avoid deadlocks and references
  97   97   * to freed memory.  The example below illustrates the following Big Rules:
  98   98   *
  99   99   *  (1) A check must be made in each zfs thread for a mounted file system.
 100  100   *      This is done avoiding races using ZFS_ENTER(zfsvfs).
 101  101   *      A ZFS_EXIT(zfsvfs) is needed before all returns.  Any znodes
 102  102   *      must be checked with ZFS_VERIFY_ZP(zp).  Both of these macros
 103  103   *      can return EIO from the calling function.
 104  104   *
 105  105   *  (2) VN_RELE() should always be the last thing except for zil_commit()
 106  106   *      (if necessary) and ZFS_EXIT(). This is for 3 reasons:
 107  107   *      First, if it's the last reference, the vnode/znode
 108  108   *      can be freed, so the zp may point to freed memory.  Second, the last
 109  109   *      reference will call zfs_zinactive(), which may induce a lot of work --
 110  110   *      pushing cached pages (which acquires range locks) and syncing out
 111  111   *      cached atime changes.  Third, zfs_zinactive() may require a new tx,
 112  112   *      which could deadlock the system if you were already holding one.
 113  113   *      If you must call VN_RELE() within a tx then use VN_RELE_ASYNC().
 114  114   *
 115  115   *  (3) All range locks must be grabbed before calling dmu_tx_assign(),
 116  116   *      as they can span dmu_tx_assign() calls.
 117  117   *
 118  118   *  (4) If ZPL locks are held, pass TXG_NOWAIT as the second argument to
 119  119   *      dmu_tx_assign().  This is critical because we don't want to block
 120  120   *      while holding locks.
 121  121   *
 122  122   *      If no ZPL locks are held (aside from ZFS_ENTER()), use TXG_WAIT.  This
 123  123   *      reduces lock contention and CPU usage when we must wait (note that if
 124  124   *      throughput is constrained by the storage, nearly every transaction
 125  125   *      must wait).
 126  126   *
 127  127   *      Note, in particular, that if a lock is sometimes acquired before

↓ open down ↓

32 lines elided

↑ open up ↑

 128  128   *      the tx assigns, and sometimes after (e.g. z_lock), then failing
 129  129   *      to use a non-blocking assign can deadlock the system.  The scenario:
 130  130   *
 131  131   *      Thread A has grabbed a lock before calling dmu_tx_assign().
 132  132   *      Thread B is in an already-assigned tx, and blocks for this lock.
 133  133   *      Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
 134  134   *      forever, because the previous txg can't quiesce until B's tx commits.
 135  135   *
 136  136   *      If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
 137  137   *      then drop all locks, call dmu_tx_wait(), and try again.  On subsequent
 138      - *      calls to dmu_tx_assign(), pass TXG_NOTHROTTLE in addition to TXG_NOWAIT,
      138 + *      calls to dmu_tx_assign(), pass TXG_WAITED rather than TXG_NOWAIT,
 139  139   *      to indicate that this operation has already called dmu_tx_wait().
 140  140   *      This will ensure that we don't retry forever, waiting a short bit
 141  141   *      each time.
 142  142   *
 143  143   *  (5) If the operation succeeded, generate the intent log entry for it
 144  144   *      before dropping locks.  This ensures that the ordering of events
 145  145   *      in the intent log matches the order in which they actually occurred.
 146  146   *      During ZIL replay the zfs_log_* functions will update the sequence
 147  147   *      number to indicate the zil transaction has replayed.
 148  148   *

 149  149   *  (6) At the end of each vnode op, the DMU tx must always commit,
 150  150   *      regardless of whether there were any errors.
 151  151   *
 152  152   *  (7) After dropping all locks, invoke zil_commit(zilog, foid)

↓ open down ↓

4 lines elided

↑ open up ↑

 153  153   *      to ensure that synchronous semantics are provided when necessary.
 154  154   *
 155  155   * In general, this is how things should be ordered in each vnode op:
 156  156   *
 157  157   *      ZFS_ENTER(zfsvfs);              // exit if unmounted
 158  158   * top:
 159  159   *      zfs_dirent_lock(&dl, ...)       // lock directory entry (may VN_HOLD())
 160  160   *      rw_enter(...);                  // grab any other locks you need
 161  161   *      tx = dmu_tx_create(...);        // get DMU tx
 162  162   *      dmu_tx_hold_*();                // hold each object you might modify
 163      - *      error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
      163 + *      error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
 164  164   *      if (error) {
 165  165   *              rw_exit(...);           // drop locks
 166  166   *              zfs_dirent_unlock(dl);  // unlock directory entry
 167  167   *              VN_RELE(...);           // release held vnodes
 168  168   *              if (error == ERESTART) {
 169  169   *                      waited = B_TRUE;
 170  170   *                      dmu_tx_wait(tx);
 171  171   *                      dmu_tx_abort(tx);
 172  172   *                      goto top;
 173  173   *              }

 174  174   *              dmu_tx_abort(tx);       // abort DMU tx
 175  175   *              ZFS_EXIT(zfsvfs);       // finished in zfs
 176  176   *              return (error);         // really out of space
 177  177   *      }
 178  178   *      error = do_real_work();         // do whatever this VOP does
 179  179   *      if (error == 0)

↓ open down ↓

6 lines elided

↑ open up ↑

 180  180   *              zfs_log_*(...);         // on success, make ZIL entry
 181  181   *      dmu_tx_commit(tx);              // commit DMU tx -- error or not
 182  182   *      rw_exit(...);                   // drop locks
 183  183   *      zfs_dirent_unlock(dl);          // unlock directory entry
 184  184   *      VN_RELE(...);                   // release held vnodes
 185  185   *      zil_commit(zilog, foid);        // synchronous when necessary
 186  186   *      ZFS_EXIT(zfsvfs);               // finished in zfs
 187  187   *      return (error);                 // done, report error
 188  188   */
 189  189  
      190 +/* set this tunable to zero to disable asynchronous freeing of files */
      191 +boolean_t zfs_do_async_free = B_TRUE;
      192 +
      193 +/*
      194 + * This value will be multiplied by zfs_dirty_data_max to determine
      195 + * the threshold past which we will call zfs_inactive_impl() async.
      196 + *
      197 + * Selecting the multiplier is a balance between how long we're willing to wait
      198 + * for delete/free to complete (get shell back, have a NFS thread captive, etc)
      199 + * and reducing the number of active requests in the backing taskq.
      200 + *
      201 + * 4 GiB (zfs_dirty_data_max default) * 16 (multiplier default) = 64 GiB
      202 + * meaning by default we will call zfs_inactive_impl async for vnodes > 64 GiB
      203 + *
      204 + * WARNING: Setting this tunable to zero will enable asynchronous freeing for
      205 + * all files which can have undesirable side effects.
      206 + */
      207 +uint16_t zfs_inactive_async_multiplier = 16;
      208 +
      209 +int nms_worm_transition_time = 30;
      210 +int
      211 +zfs_worm_in_trans(znode_t *zp)
      212 +{
      213 +        zfsvfs_t                *zfsvfs = zp->z_zfsvfs;
      214 +        timestruc_t             now;
      215 +        sa_bulk_attr_t          bulk[2];
      216 +        uint64_t                ctime[2];
      217 +        int                     count = 0;
      218 +
      219 +        if (!nms_worm_transition_time)
      220 +                return (0);
      221 +
      222 +        gethrestime(&now);
      223 +        SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL,
      224 +            &ctime, sizeof (ctime));
      225 +        if (sa_bulk_lookup(zp->z_sa_hdl, bulk, count) != 0)
      226 +                return (0);
      227 +
      228 +        return ((uint64_t)now.tv_sec - ctime[0] < nms_worm_transition_time);
      229 +}
      230 +
 190  231  /* ARGSUSED */
 191  232  static int
 192  233  zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct)
 193  234  {
 194  235          znode_t *zp = VTOZ(*vpp);
 195  236          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 196  237  
 197  238          ZFS_ENTER(zfsvfs);
 198  239          ZFS_VERIFY_ZP(zp);
 199  240

 200  241          if ((flag & FWRITE) && (zp->z_pflags & ZFS_APPENDONLY) &&
 201  242              ((flag & FAPPEND) == 0)) {
 202  243                  ZFS_EXIT(zfsvfs);
 203  244                  return (SET_ERROR(EPERM));
 204  245          }
 205  246  
 206  247          if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
 207  248              ZTOV(zp)->v_type == VREG &&
 208  249              !(zp->z_pflags & ZFS_AV_QUARANTINED) && zp->z_size > 0) {
 209  250                  if (fs_vscan(*vpp, cr, 0) != 0) {
 210  251                          ZFS_EXIT(zfsvfs);
 211  252                          return (SET_ERROR(EACCES));
 212  253                  }
 213  254          }
 214  255  
 215  256          /* Keep a count of the synchronous opens in the znode */
 216  257          if (flag & (FSYNC | FDSYNC))
 217  258                  atomic_inc_32(&zp->z_sync_cnt);
 218  259  
 219  260          ZFS_EXIT(zfsvfs);

↓ open down ↓

20 lines elided

↑ open up ↑

 220  261          return (0);
 221  262  }
 222  263  
 223  264  /* ARGSUSED */
 224  265  static int
 225  266  zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
 226  267      caller_context_t *ct)
 227  268  {
 228  269          znode_t *zp = VTOZ(vp);
 229  270          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
      271 +        pid_t caller_pid = (ct != NULL) ? ct->cc_pid : ddi_get_pid();
 230  272  
 231  273          /*
 232  274           * Clean up any locks held by this process on the vp.
 233  275           */
 234      -        cleanlocks(vp, ddi_get_pid(), 0);
 235      -        cleanshares(vp, ddi_get_pid());
      276 +        cleanlocks(vp, caller_pid, 0);
      277 +        cleanshares(vp, caller_pid);
 236  278  
 237  279          ZFS_ENTER(zfsvfs);
 238  280          ZFS_VERIFY_ZP(zp);
 239  281  
 240  282          /* Decrement the synchronous opens in the znode */
 241  283          if ((flag & (FSYNC | FDSYNC)) && (count == 1))
 242  284                  atomic_dec_32(&zp->z_sync_cnt);
 243  285  
 244  286          if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
 245  287              ZTOV(zp)->v_type == VREG &&

 246  288              !(zp->z_pflags & ZFS_AV_QUARANTINED) && zp->z_size > 0)
 247  289                  VERIFY(fs_vscan(vp, cr, 1) == 0);
 248  290  
 249  291          ZFS_EXIT(zfsvfs);
 250  292          return (0);
 251  293  }
 252  294  
 253  295  /*
 254  296   * Lseek support for finding holes (cmd == _FIO_SEEK_HOLE) and
 255  297   * data (cmd == _FIO_SEEK_DATA). "off" is an in/out parameter.
 256  298   */
 257  299  static int
 258  300  zfs_holey(vnode_t *vp, int cmd, offset_t *off)
 259  301  {
 260  302          znode_t *zp = VTOZ(vp);
 261  303          uint64_t noff = (uint64_t)*off; /* new offset */
 262  304          uint64_t file_sz;
 263  305          int error;
 264  306          boolean_t hole;
 265  307  
 266  308          file_sz = zp->z_size;
 267  309          if (noff >= file_sz)  {
 268  310                  return (SET_ERROR(ENXIO));
 269  311          }
 270  312  
 271  313          if (cmd == _FIO_SEEK_HOLE)
 272  314                  hole = B_TRUE;
 273  315          else
 274  316                  hole = B_FALSE;
 275  317  
 276  318          error = dmu_offset_next(zp->z_zfsvfs->z_os, zp->z_id, hole, &noff);
 277  319  
 278  320          if (error == ESRCH)
 279  321                  return (SET_ERROR(ENXIO));
 280  322  
 281  323          /*
 282  324           * We could find a hole that begins after the logical end-of-file,
 283  325           * because dmu_offset_next() only works on whole blocks.  If the
 284  326           * EOF falls mid-block, then indicate that the "virtual hole"
 285  327           * at the end of the file begins at the logical EOF, rather than
 286  328           * at the end of the last block.
 287  329           */
 288  330          if (noff > file_sz) {
 289  331                  ASSERT(hole);
 290  332                  noff = file_sz;
 291  333          }
 292  334  
 293  335          if (noff < *off)
 294  336                  return (error);
 295  337          *off = noff;
 296  338          return (error);
 297  339  }
 298  340  
 299  341  /* ARGSUSED */
 300  342  static int
 301  343  zfs_ioctl(vnode_t *vp, int com, intptr_t data, int flag, cred_t *cred,
 302  344      int *rvalp, caller_context_t *ct)
 303  345  {
 304  346          offset_t off;
 305  347          offset_t ndata;
 306  348          dmu_object_info_t doi;
 307  349          int error;
 308  350          zfsvfs_t *zfsvfs;
 309  351          znode_t *zp;
 310  352  
 311  353          switch (com) {
 312  354          case _FIOFFS:
 313  355          {
 314  356                  return (zfs_sync(vp->v_vfsp, 0, cred));
 315  357  
 316  358                  /*
 317  359                   * The following two ioctls are used by bfu.  Faking out,
 318  360                   * necessary to avoid bfu errors.
 319  361                   */
 320  362          }
 321  363          case _FIOGDIO:
 322  364          case _FIOSDIO:
 323  365          {
 324  366                  return (0);
 325  367          }
 326  368  
 327  369          case _FIO_SEEK_DATA:
 328  370          case _FIO_SEEK_HOLE:
 329  371          {
 330  372                  if (ddi_copyin((void *)data, &off, sizeof (off), flag))
 331  373                          return (SET_ERROR(EFAULT));
 332  374  
 333  375                  zp = VTOZ(vp);
 334  376                  zfsvfs = zp->z_zfsvfs;
 335  377                  ZFS_ENTER(zfsvfs);
 336  378                  ZFS_VERIFY_ZP(zp);
 337  379  
 338  380                  /* offset parameter is in/out */
 339  381                  error = zfs_holey(vp, com, &off);
 340  382                  ZFS_EXIT(zfsvfs);
 341  383                  if (error)
 342  384                          return (error);
 343  385                  if (ddi_copyout(&off, (void *)data, sizeof (off), flag))
 344  386                          return (SET_ERROR(EFAULT));
 345  387                  return (0);
 346  388          }
 347  389          case _FIO_COUNT_FILLED:
 348  390          {
 349  391                  /*
 350  392                   * _FIO_COUNT_FILLED adds a new ioctl command which
 351  393                   * exposes the number of filled blocks in a
 352  394                   * ZFS object.
 353  395                   */
 354  396                  zp = VTOZ(vp);
 355  397                  zfsvfs = zp->z_zfsvfs;
 356  398                  ZFS_ENTER(zfsvfs);
 357  399                  ZFS_VERIFY_ZP(zp);
 358  400  
 359  401                  /*
 360  402                   * Wait for all dirty blocks for this object
 361  403                   * to get synced out to disk, and the DMU info
 362  404                   * updated.
 363  405                   */
 364  406                  error = dmu_object_wait_synced(zfsvfs->z_os, zp->z_id);
 365  407                  if (error) {
 366  408                          ZFS_EXIT(zfsvfs);
 367  409                          return (error);
 368  410                  }
 369  411  
 370  412                  /*
 371  413                   * Retrieve fill count from DMU object.
 372  414                   */
 373  415                  error = dmu_object_info(zfsvfs->z_os, zp->z_id, &doi);
 374  416                  if (error) {
 375  417                          ZFS_EXIT(zfsvfs);
 376  418                          return (error);
 377  419                  }
 378  420  
 379  421                  ndata = doi.doi_fill_count;
 380  422  
 381  423                  ZFS_EXIT(zfsvfs);
 382  424                  if (ddi_copyout(&ndata, (void *)data, sizeof (ndata), flag))
 383  425                          return (SET_ERROR(EFAULT));
 384  426                  return (0);
 385  427          }
 386  428          }
 387  429          return (SET_ERROR(ENOTTY));
 388  430  }
 389  431  
 390  432  /*
 391  433   * Utility functions to map and unmap a single physical page.  These
 392  434   * are used to manage the mappable copies of ZFS file data, and therefore
 393  435   * do not update ref/mod bits.
 394  436   */
 395  437  caddr_t
 396  438  zfs_map_page(page_t *pp, enum seg_rw rw)
 397  439  {
 398  440          if (kpm_enable)
 399  441                  return (hat_kpm_mapin(pp, 0));
 400  442          ASSERT(rw == S_READ || rw == S_WRITE);
 401  443          return (ppmapin(pp, PROT_READ | ((rw == S_WRITE) ? PROT_WRITE : 0),
 402  444              (caddr_t)-1));
 403  445  }
 404  446  
 405  447  void
 406  448  zfs_unmap_page(page_t *pp, caddr_t addr)
 407  449  {
 408  450          if (kpm_enable) {
 409  451                  hat_kpm_mapout(pp, 0, addr);
 410  452          } else {
 411  453                  ppmapout(addr);
 412  454          }
 413  455  }
 414  456  
 415  457  /*
 416  458   * When a file is memory mapped, we must keep the IO data synchronized
 417  459   * between the DMU cache and the memory mapped pages.  What this means:
 418  460   *
 419  461   * On Write:    If we find a memory mapped page, we write to *both*
 420  462   *              the page and the dmu buffer.
 421  463   */
 422  464  static void
 423  465  update_pages(vnode_t *vp, int64_t start, int len, objset_t *os, uint64_t oid)
 424  466  {
 425  467          int64_t off;
 426  468  
 427  469          off = start & PAGEOFFSET;
 428  470          for (start &= PAGEMASK; len > 0; start += PAGESIZE) {
 429  471                  page_t *pp;
 430  472                  uint64_t nbytes = MIN(PAGESIZE - off, len);
 431  473  
 432  474                  if (pp = page_lookup(vp, start, SE_SHARED)) {
 433  475                          caddr_t va;
 434  476  
 435  477                          va = zfs_map_page(pp, S_WRITE);
 436  478                          (void) dmu_read(os, oid, start+off, nbytes, va+off,
 437  479                              DMU_READ_PREFETCH);
 438  480                          zfs_unmap_page(pp, va);
 439  481                          page_unlock(pp);
 440  482                  }
 441  483                  len -= nbytes;
 442  484                  off = 0;
 443  485          }
 444  486  }
 445  487  
 446  488  /*
 447  489   * When a file is memory mapped, we must keep the IO data synchronized
 448  490   * between the DMU cache and the memory mapped pages.  What this means:
 449  491   *
 450  492   * On Read:     We "read" preferentially from memory mapped pages,
 451  493   *              else we default from the dmu buffer.
 452  494   *
 453  495   * NOTE: We will always "break up" the IO into PAGESIZE uiomoves when
 454  496   *       the file is memory mapped.
 455  497   */
 456  498  static int
 457  499  mappedread(vnode_t *vp, int nbytes, uio_t *uio)
 458  500  {
 459  501          znode_t *zp = VTOZ(vp);
 460  502          int64_t start, off;
 461  503          int len = nbytes;
 462  504          int error = 0;
 463  505  
 464  506          start = uio->uio_loffset;
 465  507          off = start & PAGEOFFSET;
 466  508          for (start &= PAGEMASK; len > 0; start += PAGESIZE) {
 467  509                  page_t *pp;
 468  510                  uint64_t bytes = MIN(PAGESIZE - off, len);
 469  511  
 470  512                  if (pp = page_lookup(vp, start, SE_SHARED)) {
 471  513                          caddr_t va;
 472  514  
 473  515                          va = zfs_map_page(pp, S_READ);
 474  516                          error = uiomove(va + off, bytes, UIO_READ, uio);
 475  517                          zfs_unmap_page(pp, va);
 476  518                          page_unlock(pp);
 477  519                  } else {
 478  520                          error = dmu_read_uio_dbuf(sa_get_db(zp->z_sa_hdl),

↓ open down ↓

233 lines elided

↑ open up ↑

 479  521                              uio, bytes);
 480  522                  }
 481  523                  len -= bytes;
 482  524                  off = 0;
 483  525                  if (error)
 484  526                          break;
 485  527          }
 486  528          return (error);
 487  529  }
 488  530  
      531 +
      532 +/*
      533 + * ZFS I/O rate throttling
      534 + */
      535 +
      536 +#define DELAY_SHIFT 24
      537 +
      538 +typedef struct zfs_rate_delay {
      539 +        uint_t rl_rate;
      540 +        hrtime_t rl_delay;
      541 +} zfs_rate_delay_t;
      542 +
      543 +/*
      544 + * The time we'll attempt to cv_wait (below), in nSec.
      545 + * This should be no less than the minimum time it normally takes
      546 + * to block a thread and wake back up after the timeout fires.
      547 + *
      548 + * Each table entry represents the delay for each 4MB of bandwith.
      549 + * we reduce the delay as the size fo the I/O increases.
      550 + */
      551 +zfs_rate_delay_t zfs_rate_delay_table[] = {
      552 +        {0, 100000},
      553 +        {1, 100000},
      554 +        {2, 100000},
      555 +        {3, 100000},
      556 +        {4, 100000},
      557 +        {5, 50000},
      558 +        {6, 50000},
      559 +        {7, 50000},
      560 +        {8, 50000},
      561 +        {9, 25000},
      562 +        {10, 25000},
      563 +        {11, 25000},
      564 +        {12, 25000},
      565 +        {13, 12500},
      566 +        {14, 12500},
      567 +        {15, 12500},
      568 +        {16, 12500},
      569 +        {17, 6250},
      570 +        {18, 6250},
      571 +        {19, 6250},
      572 +        {20, 6250},
      573 +        {21, 3125},
      574 +        {22, 3125},
      575 +        {23, 3125},
      576 +        {24, 3125},
      577 +};
      578 +
      579 +#define MAX_RATE_TBL_ENTRY 24
      580 +
      581 +/*
      582 + * The delay we use should be reduced based on the size of the iorate
      583 + * for higher iorates we want a shorter delay.
      584 + */
      585 +static inline hrtime_t
      586 +zfs_get_delay(ssize_t iorate)
      587 +{
      588 +        uint_t rate = iorate >> DELAY_SHIFT;
      589 +
      590 +        if (rate > MAX_RATE_TBL_ENTRY)
      591 +                rate = MAX_RATE_TBL_ENTRY;
      592 +        return (zfs_rate_delay_table[rate].rl_delay);
      593 +}
      594 +
      595 +/*
      596 + * ZFS I/O rate throttling
      597 + * See "Token Bucket" on Wikipedia
      598 + *
      599 + * This is "Token Bucket" with some modifications to avoid wait times
      600 + * longer than a couple seconds, so that we don't trigger NFS retries
      601 + * or similar.  This does mean that concurrent requests might take us
      602 + * over the rate limit, but that's a lesser evil.
      603 + */
      604 +static void
      605 +zfs_rate_throttle(zfsvfs_t *zfsvfs, ssize_t iosize)
      606 +{
      607 +        zfs_rate_state_t *rate = &zfsvfs->z_rate;
      608 +        hrtime_t now, delta; /* nanoseconds */
      609 +        int64_t refill;
      610 +
      611 +        VERIFY(rate->rate_cap > 0);
      612 +        mutex_enter(&rate->rate_lock);
      613 +
      614 +        /*
      615 +         * If another thread is already waiting, we must queue up behind them.
      616 +         * We'll wait up to 1 sec here.  We normally will resume by cv_signal,
      617 +         * so we don't need fine timer resolution on this wait.
      618 +         */
      619 +        if (rate->rate_token_bucket < 0) {
      620 +                rate->rate_waiters++;
      621 +                (void) cv_timedwait_hires(
      622 +                    &rate->rate_wait_cv, &rate->rate_lock,
      623 +                    NANOSEC, TR_CLOCK_TICK, 0);
      624 +                rate->rate_waiters--;
      625 +        }
      626 +
      627 +        /*
      628 +         * How long since we last updated the bucket?
      629 +         */
      630 +        now = gethrtime();
      631 +        delta = now - rate->rate_last_update;
      632 +        rate->rate_last_update = now;
      633 +        if (delta < 0)
      634 +                delta = 0; /* paranoid */
      635 +
      636 +        /*
      637 +         * Add "tokens" for time since last update,
      638 +         * being careful about possible overflow.
      639 +         */
      640 +        refill = (delta * rate->rate_cap) / NANOSEC;
      641 +        if (refill < 0 || refill > rate->rate_cap)
      642 +                refill = rate->rate_cap; /* overflow */
      643 +        rate->rate_token_bucket += refill;
      644 +        if (rate->rate_token_bucket > rate->rate_cap)
      645 +                rate->rate_token_bucket = rate->rate_cap;
      646 +
      647 +        /*
      648 +         * Withdraw tokens for the current I/O.* If this makes us overdrawn,
      649 +         * wait an amount of time proportionate to the overdraft.  However,
      650 +         * as a sanity measure, never wait more than 1 sec, and never try to
      651 +         * wait less than the time it normally takes to block and reschedule.
      652 +         *
      653 +         * Leave the bucket negative while we wait so other threads know to
      654 +         * queue up. In here, "refill" is the debt we're waiting to pay off.
      655 +         */
      656 +        rate->rate_token_bucket -= iosize;
      657 +        if (rate->rate_token_bucket < 0) {
      658 +                hrtime_t zfs_rate_wait = 0;
      659 +
      660 +                refill = rate->rate_token_bucket;
      661 +                DTRACE_PROBE2(zfs_rate_over, zfsvfs_t *, zfsvfs,
      662 +                    int64_t, refill);
      663 +
      664 +                if (rate->rate_cap <= 0)
      665 +                        goto nocap;
      666 +
      667 +                delta = (refill * NANOSEC) / rate->rate_cap;
      668 +                delta = MIN(delta, NANOSEC);
      669 +
      670 +                zfs_rate_wait = zfs_get_delay(rate->rate_cap);
      671 +
      672 +                if (delta > zfs_rate_wait) {
      673 +                        (void) cv_timedwait_hires(
      674 +                            &rate->rate_wait_cv, &rate->rate_lock,
      675 +                            delta, TR_CLOCK_TICK, 0);
      676 +                }
      677 +
      678 +                rate->rate_token_bucket += refill;
      679 +        }
      680 +nocap:
      681 +        if (rate->rate_waiters > 0) {
      682 +                cv_signal(&rate->rate_wait_cv);
      683 +        }
      684 +
      685 +        mutex_exit(&rate->rate_lock);
      686 +}
      687 +
      688 +
 489  689  offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */
 490  690  
 491  691  /*
 492  692   * Read bytes from specified file into supplied buffer.
 493  693   *
 494  694   *      IN:     vp      - vnode of file to be read from.
 495  695   *              uio     - structure supplying read location, range info,
 496  696   *                        and return buffer.
 497  697   *              ioflag  - SYNC flags; used to provide FRSYNC semantics.
 498  698   *              cr      - credentials of caller.

 499  699   *              ct      - caller context
 500  700   *
 501  701   *      OUT:    uio     - updated offset and range, buffer filled.
 502  702   *
 503  703   *      RETURN: 0 on success, error code on failure.
 504  704   *
 505  705   * Side Effects:
 506  706   *      vp - atime updated if byte count > 0
 507  707   */
 508  708  /* ARGSUSED */
 509  709  static int
 510  710  zfs_read(vnode_t *vp, uio_t *uio, int ioflag, cred_t *cr, caller_context_t *ct)
 511  711  {
 512  712          znode_t         *zp = VTOZ(vp);
 513  713          zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
 514  714          ssize_t         n, nbytes;
 515  715          int             error = 0;
 516  716          rl_t            *rl;
 517  717          xuio_t          *xuio = NULL;
 518  718  
 519  719          ZFS_ENTER(zfsvfs);
 520  720          ZFS_VERIFY_ZP(zp);
 521  721  
 522  722          if (zp->z_pflags & ZFS_AV_QUARANTINED) {
 523  723                  ZFS_EXIT(zfsvfs);
 524  724                  return (SET_ERROR(EACCES));
 525  725          }
 526  726  
 527  727          /*
 528  728           * Validate file offset
 529  729           */
 530  730          if (uio->uio_loffset < (offset_t)0) {
 531  731                  ZFS_EXIT(zfsvfs);
 532  732                  return (SET_ERROR(EINVAL));
 533  733          }
 534  734  
 535  735          /*
 536  736           * Fasttrack empty reads
 537  737           */
 538  738          if (uio->uio_resid == 0) {
 539  739                  ZFS_EXIT(zfsvfs);
 540  740                  return (0);
 541  741          }
 542  742  
 543  743          /*
 544  744           * Check for mandatory locks

↓ open down ↓

46 lines elided

↑ open up ↑

 545  745           */
 546  746          if (MANDMODE(zp->z_mode)) {
 547  747                  if (error = chklock(vp, FREAD,
 548  748                      uio->uio_loffset, uio->uio_resid, uio->uio_fmode, ct)) {
 549  749                          ZFS_EXIT(zfsvfs);
 550  750                          return (error);
 551  751                  }
 552  752          }
 553  753  
 554  754          /*
      755 +         * ZFS I/O rate throttling
      756 +         */
      757 +        if (zfsvfs->z_rate.rate_cap)
      758 +                zfs_rate_throttle(zfsvfs, uio->uio_resid);
      759 +
      760 +        /*
 555  761           * If we're in FRSYNC mode, sync out this znode before reading it.
 556  762           */
 557  763          if (ioflag & FRSYNC || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
 558  764                  zil_commit(zfsvfs->z_log, zp->z_id);
 559  765  
 560  766          /*
 561  767           * Lock the range against changes.
 562  768           */
 563  769          rl = zfs_range_lock(zp, uio->uio_loffset, uio->uio_resid, RL_READER);
 564  770

 565  771          /*
 566  772           * If we are reading past end-of-file we can skip
 567  773           * to the end; but we might still need to set atime.
 568  774           */
 569  775          if (uio->uio_loffset >= zp->z_size) {
 570  776                  error = 0;
 571  777                  goto out;
 572  778          }
 573  779  
 574  780          ASSERT(uio->uio_loffset < zp->z_size);
 575  781          n = MIN(uio->uio_resid, zp->z_size - uio->uio_loffset);
 576  782  
 577  783          if ((uio->uio_extflg == UIO_XUIO) &&
 578  784              (((xuio_t *)uio)->xu_type == UIOTYPE_ZEROCOPY)) {
 579  785                  int nblk;
 580  786                  int blksz = zp->z_blksz;
 581  787                  uint64_t offset = uio->uio_loffset;
 582  788  
 583  789                  xuio = (xuio_t *)uio;
 584  790                  if ((ISP2(blksz))) {
 585  791                          nblk = (P2ROUNDUP(offset + n, blksz) - P2ALIGN(offset,
 586  792                              blksz)) / blksz;
 587  793                  } else {
 588  794                          ASSERT(offset + n <= blksz);
 589  795                          nblk = 1;
 590  796                  }
 591  797                  (void) dmu_xuio_init(xuio, nblk);
 592  798  
 593  799                  if (vn_has_cached_data(vp)) {
 594  800                          /*
 595  801                           * For simplicity, we always allocate a full buffer
 596  802                           * even if we only expect to read a portion of a block.
 597  803                           */
 598  804                          while (--nblk >= 0) {
 599  805                                  (void) dmu_xuio_add(xuio,
 600  806                                      dmu_request_arcbuf(sa_get_db(zp->z_sa_hdl),
 601  807                                      blksz), 0, blksz);
 602  808                          }
 603  809                  }
 604  810          }
 605  811  
 606  812          while (n > 0) {
 607  813                  nbytes = MIN(n, zfs_read_chunk_size -
 608  814                      P2PHASE(uio->uio_loffset, zfs_read_chunk_size));
 609  815  
 610  816                  if (vn_has_cached_data(vp)) {
 611  817                          error = mappedread(vp, nbytes, uio);
 612  818                  } else {
 613  819                          error = dmu_read_uio_dbuf(sa_get_db(zp->z_sa_hdl),
 614  820                              uio, nbytes);
 615  821                  }
 616  822                  if (error) {
 617  823                          /* convert checksum errors into IO errors */
 618  824                          if (error == ECKSUM)
 619  825                                  error = SET_ERROR(EIO);
 620  826                          break;
 621  827                  }
 622  828  
 623  829                  n -= nbytes;
 624  830          }
 625  831  out:
 626  832          zfs_range_unlock(rl);
 627  833  
 628  834          ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
 629  835          ZFS_EXIT(zfsvfs);
 630  836          return (error);
 631  837  }
 632  838  
 633  839  /*
 634  840   * Write the bytes to a file.
 635  841   *
 636  842   *      IN:     vp      - vnode of file to be written to.
 637  843   *              uio     - structure supplying write location, range info,
 638  844   *                        and data buffer.
 639  845   *              ioflag  - FAPPEND, FSYNC, and/or FDSYNC.  FAPPEND is
 640  846   *                        set if in append mode.
 641  847   *              cr      - credentials of caller.
 642  848   *              ct      - caller context (NFS/CIFS fem monitor only)
 643  849   *
 644  850   *      OUT:    uio     - updated offset and range.
 645  851   *
 646  852   *      RETURN: 0 on success, error code on failure.
 647  853   *
 648  854   * Timestamps:
 649  855   *      vp - ctime|mtime updated if byte count > 0
 650  856   */
 651  857  
 652  858  /* ARGSUSED */
 653  859  static int
 654  860  zfs_write(vnode_t *vp, uio_t *uio, int ioflag, cred_t *cr, caller_context_t *ct)
 655  861  {
 656  862          znode_t         *zp = VTOZ(vp);
 657  863          rlim64_t        limit = uio->uio_llimit;
 658  864          ssize_t         start_resid = uio->uio_resid;
 659  865          ssize_t         tx_bytes;
 660  866          uint64_t        end_size;
 661  867          dmu_tx_t        *tx;
 662  868          zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
 663  869          zilog_t         *zilog;
 664  870          offset_t        woff;
 665  871          ssize_t         n, nbytes;
 666  872          rl_t            *rl;
 667  873          int             max_blksz = zfsvfs->z_max_blksz;
 668  874          int             error = 0;
 669  875          arc_buf_t       *abuf;
 670  876          iovec_t         *aiov = NULL;
 671  877          xuio_t          *xuio = NULL;
 672  878          int             i_iov = 0;
 673  879          int             iovcnt = uio->uio_iovcnt;
 674  880          iovec_t         *iovp = uio->uio_iov;
 675  881          int             write_eof;
 676  882          int             count = 0;
 677  883          sa_bulk_attr_t  bulk[4];
 678  884          uint64_t        mtime[2], ctime[2];
 679  885  
 680  886          /*
 681  887           * Fasttrack empty write
 682  888           */
 683  889          n = start_resid;
 684  890          if (n == 0)
 685  891                  return (0);
 686  892  
 687  893          if (limit == RLIM64_INFINITY || limit > MAXOFFSET_T)
 688  894                  limit = MAXOFFSET_T;
 689  895  
 690  896          ZFS_ENTER(zfsvfs);
 691  897          ZFS_VERIFY_ZP(zp);
 692  898  
 693  899          SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_MTIME(zfsvfs), NULL, &mtime, 16);
 694  900          SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL, &ctime, 16);
 695  901          SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_SIZE(zfsvfs), NULL,
 696  902              &zp->z_size, 8);
 697  903          SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_FLAGS(zfsvfs), NULL,
 698  904              &zp->z_pflags, 8);
 699  905  
 700  906          /*
 701  907           * In a case vp->v_vfsp != zp->z_zfsvfs->z_vfs (e.g. snapshots) our
 702  908           * callers might not be able to detect properly that we are read-only,
 703  909           * so check it explicitly here.
 704  910           */
 705  911          if (zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) {
 706  912                  ZFS_EXIT(zfsvfs);
 707  913                  return (SET_ERROR(EROFS));

↓ open down ↓

143 lines elided

↑ open up ↑

 708  914          }
 709  915  
 710  916          /*
 711  917           * If immutable or not appending then return EPERM.
 712  918           * Intentionally allow ZFS_READONLY through here.
 713  919           * See zfs_zaccess_common()
 714  920           */
 715  921          if ((zp->z_pflags & ZFS_IMMUTABLE) ||
 716  922              ((zp->z_pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
 717  923              (uio->uio_loffset < zp->z_size))) {
 718      -                ZFS_EXIT(zfsvfs);
 719      -                return (SET_ERROR(EPERM));
      924 +                /* Make sure we're not a WORM before returning EPERM. */
      925 +                if (!(zp->z_pflags & ZFS_IMMUTABLE) ||
      926 +                    !zp->z_zfsvfs->z_isworm) {
      927 +                        ZFS_EXIT(zfsvfs);
      928 +                        return (SET_ERROR(EPERM));
      929 +                }
 720  930          }
 721  931  
 722  932          zilog = zfsvfs->z_log;
 723  933  
 724  934          /*
 725  935           * Validate file offset
 726  936           */
 727  937          woff = ioflag & FAPPEND ? zp->z_size : uio->uio_loffset;
 728  938          if (woff < 0) {
 729  939                  ZFS_EXIT(zfsvfs);

 730  940                  return (SET_ERROR(EINVAL));
 731  941          }
 732  942  
 733  943          /*

↓ open down ↓

4 lines elided

↑ open up ↑

 734  944           * Check for mandatory locks before calling zfs_range_lock()
 735  945           * in order to prevent a deadlock with locks set via fcntl().
 736  946           */
 737  947          if (MANDMODE((mode_t)zp->z_mode) &&
 738  948              (error = chklock(vp, FWRITE, woff, n, uio->uio_fmode, ct)) != 0) {
 739  949                  ZFS_EXIT(zfsvfs);
 740  950                  return (error);
 741  951          }
 742  952  
 743  953          /*
      954 +         * ZFS I/O rate throttling
      955 +         */
      956 +        if (zfsvfs->z_rate.rate_cap)
      957 +                zfs_rate_throttle(zfsvfs, uio->uio_resid);
      958 +
      959 +        /*
 744  960           * Pre-fault the pages to ensure slow (eg NFS) pages
 745  961           * don't hold up txg.
 746  962           * Skip this if uio contains loaned arc_buf.
 747  963           */
 748  964          if ((uio->uio_extflg == UIO_XUIO) &&
 749  965              (((xuio_t *)uio)->xu_type == UIOTYPE_ZEROCOPY))
 750  966                  xuio = (xuio_t *)uio;
 751  967          else
 752  968                  uio_prefaultpages(MIN(n, max_blksz), uio);
 753  969

 754  970          /*
 755  971           * If in append mode, set the io offset pointer to eof.
 756  972           */
 757  973          if (ioflag & FAPPEND) {
 758  974                  /*
 759  975                   * Obtain an appending range lock to guarantee file append
 760  976                   * semantics.  We reset the write offset once we have the lock.
 761  977                   */
 762  978                  rl = zfs_range_lock(zp, 0, n, RL_APPEND);
 763  979                  woff = rl->r_off;
 764  980                  if (rl->r_len == UINT64_MAX) {
 765  981                          /*
 766  982                           * We overlocked the file because this write will cause
 767  983                           * the file block size to increase.
 768  984                           * Note that zp_size cannot change with this lock held.
 769  985                           */
 770  986                          woff = zp->z_size;
 771  987                  }
 772  988                  uio->uio_loffset = woff;
 773  989          } else {
 774  990                  /*
 775  991                   * Note that if the file block size will change as a result of
 776  992                   * this write, then this range lock will lock the entire file
 777  993                   * so that we can re-write the block safely.
 778  994                   */
 779  995                  rl = zfs_range_lock(zp, woff, n, RL_WRITER);
 780  996          }
 781  997  
 782  998          if (woff >= limit) {
 783  999                  zfs_range_unlock(rl);
 784 1000                  ZFS_EXIT(zfsvfs);
 785 1001                  return (SET_ERROR(EFBIG));
 786 1002          }
 787 1003  
 788 1004          if ((woff + n) > limit || woff > (limit - n))
 789 1005                  n = limit - woff;
 790 1006  
 791 1007          /* Will this write extend the file length? */
 792 1008          write_eof = (woff + n > zp->z_size);
 793 1009  
 794 1010          end_size = MAX(zp->z_size, woff + n);
 795 1011  
 796 1012          /*
 797 1013           * Write the file in reasonable size chunks.  Each chunk is written
 798 1014           * in a separate transaction; this keeps the intent log records small
 799 1015           * and allows us to do more fine-grained space accounting.
 800 1016           */
 801 1017          while (n > 0) {
 802 1018                  abuf = NULL;
 803 1019                  woff = uio->uio_loffset;
 804 1020                  if (zfs_owner_overquota(zfsvfs, zp, B_FALSE) ||
 805 1021                      zfs_owner_overquota(zfsvfs, zp, B_TRUE)) {
 806 1022                          if (abuf != NULL)
 807 1023                                  dmu_return_arcbuf(abuf);
 808 1024                          error = SET_ERROR(EDQUOT);
 809 1025                          break;
 810 1026                  }
 811 1027  
 812 1028                  if (xuio && abuf == NULL) {
 813 1029                          ASSERT(i_iov < iovcnt);
 814 1030                          aiov = &iovp[i_iov];
 815 1031                          abuf = dmu_xuio_arcbuf(xuio, i_iov);
 816 1032                          dmu_xuio_clear(xuio, i_iov);
 817 1033                          DTRACE_PROBE3(zfs_cp_write, int, i_iov,
 818 1034                              iovec_t *, aiov, arc_buf_t *, abuf);
 819 1035                          ASSERT((aiov->iov_base == abuf->b_data) ||
 820 1036                              ((char *)aiov->iov_base - (char *)abuf->b_data +
 821 1037                              aiov->iov_len == arc_buf_size(abuf)));
 822 1038                          i_iov++;
 823 1039                  } else if (abuf == NULL && n >= max_blksz &&
 824 1040                      woff >= zp->z_size &&
 825 1041                      P2PHASE(woff, max_blksz) == 0 &&
 826 1042                      zp->z_blksz == max_blksz) {
 827 1043                          /*
 828 1044                           * This write covers a full block.  "Borrow" a buffer
 829 1045                           * from the dmu so that we can fill it before we enter
 830 1046                           * a transaction.  This avoids the possibility of
 831 1047                           * holding up the transaction if the data copy hangs
 832 1048                           * up on a pagefault (e.g., from an NFS server mapping).
 833 1049                           */
 834 1050                          size_t cbytes;
 835 1051  
 836 1052                          abuf = dmu_request_arcbuf(sa_get_db(zp->z_sa_hdl),
 837 1053                              max_blksz);
 838 1054                          ASSERT(abuf != NULL);
 839 1055                          ASSERT(arc_buf_size(abuf) == max_blksz);
 840 1056                          if (error = uiocopy(abuf->b_data, max_blksz,
 841 1057                              UIO_WRITE, uio, &cbytes)) {
 842 1058                                  dmu_return_arcbuf(abuf);
 843 1059                                  break;
 844 1060                          }
 845 1061                          ASSERT(cbytes == max_blksz);
 846 1062                  }
 847 1063  
 848 1064                  /*
 849 1065                   * Start a transaction.
 850 1066                   */
 851 1067                  tx = dmu_tx_create(zfsvfs->z_os);
 852 1068                  dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
 853 1069                  dmu_tx_hold_write(tx, zp->z_id, woff, MIN(n, max_blksz));
 854 1070                  zfs_sa_upgrade_txholds(tx, zp);
 855 1071                  error = dmu_tx_assign(tx, TXG_WAIT);
 856 1072                  if (error) {
 857 1073                          dmu_tx_abort(tx);
 858 1074                          if (abuf != NULL)
 859 1075                                  dmu_return_arcbuf(abuf);
 860 1076                          break;
 861 1077                  }
 862 1078  
 863 1079                  /*
 864 1080                   * If zfs_range_lock() over-locked we grow the blocksize
 865 1081                   * and then reduce the lock range.  This will only happen
 866 1082                   * on the first iteration since zfs_range_reduce() will
 867 1083                   * shrink down r_len to the appropriate size.
 868 1084                   */
 869 1085                  if (rl->r_len == UINT64_MAX) {
 870 1086                          uint64_t new_blksz;
 871 1087  
 872 1088                          if (zp->z_blksz > max_blksz) {
 873 1089                                  /*
 874 1090                                   * File's blocksize is already larger than the
 875 1091                                   * "recordsize" property.  Only let it grow to
 876 1092                                   * the next power of 2.
 877 1093                                   */
 878 1094                                  ASSERT(!ISP2(zp->z_blksz));
 879 1095                                  new_blksz = MIN(end_size,
 880 1096                                      1 << highbit64(zp->z_blksz));
 881 1097                          } else {
 882 1098                                  new_blksz = MIN(end_size, max_blksz);
 883 1099                          }
 884 1100                          zfs_grow_blocksize(zp, new_blksz, tx);
 885 1101                          zfs_range_reduce(rl, woff, n);
 886 1102                  }
 887 1103  
 888 1104                  /*
 889 1105                   * XXX - should we really limit each write to z_max_blksz?
 890 1106                   * Perhaps we should use SPA_MAXBLOCKSIZE chunks?
 891 1107                   */
 892 1108                  nbytes = MIN(n, max_blksz - P2PHASE(woff, max_blksz));
 893 1109  
 894 1110                  if (abuf == NULL) {
 895 1111                          tx_bytes = uio->uio_resid;
 896 1112                          error = dmu_write_uio_dbuf(sa_get_db(zp->z_sa_hdl),
 897 1113                              uio, nbytes, tx);
 898 1114                          tx_bytes -= uio->uio_resid;
 899 1115                  } else {
 900 1116                          tx_bytes = nbytes;
 901 1117                          ASSERT(xuio == NULL || tx_bytes == aiov->iov_len);
 902 1118                          /*
 903 1119                           * If this is not a full block write, but we are
 904 1120                           * extending the file past EOF and this data starts
 905 1121                           * block-aligned, use assign_arcbuf().  Otherwise,
 906 1122                           * write via dmu_write().
 907 1123                           */
 908 1124                          if (tx_bytes < max_blksz && (!write_eof ||
 909 1125                              aiov->iov_base != abuf->b_data)) {
 910 1126                                  ASSERT(xuio);
 911 1127                                  dmu_write(zfsvfs->z_os, zp->z_id, woff,
 912 1128                                      aiov->iov_len, aiov->iov_base, tx);
 913 1129                                  dmu_return_arcbuf(abuf);
 914 1130                                  xuio_stat_wbuf_copied();
 915 1131                          } else {
 916 1132                                  ASSERT(xuio || tx_bytes == max_blksz);
 917 1133                                  dmu_assign_arcbuf(sa_get_db(zp->z_sa_hdl),
 918 1134                                      woff, abuf, tx);
 919 1135                          }
 920 1136                          ASSERT(tx_bytes <= uio->uio_resid);
 921 1137                          uioskip(uio, tx_bytes);
 922 1138                  }
 923 1139                  if (tx_bytes && vn_has_cached_data(vp)) {
 924 1140                          update_pages(vp, woff,
 925 1141                              tx_bytes, zfsvfs->z_os, zp->z_id);
 926 1142                  }
 927 1143  
 928 1144                  /*
 929 1145                   * If we made no progress, we're done.  If we made even
 930 1146                   * partial progress, update the znode and ZIL accordingly.
 931 1147                   */
 932 1148                  if (tx_bytes == 0) {
 933 1149                          (void) sa_update(zp->z_sa_hdl, SA_ZPL_SIZE(zfsvfs),
 934 1150                              (void *)&zp->z_size, sizeof (uint64_t), tx);
 935 1151                          dmu_tx_commit(tx);
 936 1152                          ASSERT(error != 0);
 937 1153                          break;
 938 1154                  }
 939 1155  
 940 1156                  /*
 941 1157                   * Clear Set-UID/Set-GID bits on successful write if not
 942 1158                   * privileged and at least one of the excute bits is set.
 943 1159                   *
 944 1160                   * It would be nice to to this after all writes have
 945 1161                   * been done, but that would still expose the ISUID/ISGID
 946 1162                   * to another app after the partial write is committed.
 947 1163                   *
 948 1164                   * Note: we don't call zfs_fuid_map_id() here because
 949 1165                   * user 0 is not an ephemeral uid.
 950 1166                   */
 951 1167                  mutex_enter(&zp->z_acl_lock);
 952 1168                  if ((zp->z_mode & (S_IXUSR | (S_IXUSR >> 3) |
 953 1169                      (S_IXUSR >> 6))) != 0 &&
 954 1170                      (zp->z_mode & (S_ISUID | S_ISGID)) != 0 &&
 955 1171                      secpolicy_vnode_setid_retain(cr,
 956 1172                      (zp->z_mode & S_ISUID) != 0 && zp->z_uid == 0) != 0) {
 957 1173                          uint64_t newmode;
 958 1174                          zp->z_mode &= ~(S_ISUID | S_ISGID);
 959 1175                          newmode = zp->z_mode;
 960 1176                          (void) sa_update(zp->z_sa_hdl, SA_ZPL_MODE(zfsvfs),
 961 1177                              (void *)&newmode, sizeof (uint64_t), tx);
 962 1178                  }
 963 1179                  mutex_exit(&zp->z_acl_lock);
 964 1180  
 965 1181                  zfs_tstamp_update_setup(zp, CONTENT_MODIFIED, mtime, ctime,
 966 1182                      B_TRUE);
 967 1183  
 968 1184                  /*
 969 1185                   * Update the file size (zp_size) if it has changed;
 970 1186                   * account for possible concurrent updates.
 971 1187                   */
 972 1188                  while ((end_size = zp->z_size) < uio->uio_loffset) {
 973 1189                          (void) atomic_cas_64(&zp->z_size, end_size,
 974 1190                              uio->uio_loffset);
 975 1191                          ASSERT(error == 0);
 976 1192                  }
 977 1193                  /*
 978 1194                   * If we are replaying and eof is non zero then force
 979 1195                   * the file size to the specified eof. Note, there's no
 980 1196                   * concurrency during replay.
 981 1197                   */
 982 1198                  if (zfsvfs->z_replay && zfsvfs->z_replay_eof != 0)
 983 1199                          zp->z_size = zfsvfs->z_replay_eof;
 984 1200  
 985 1201                  error = sa_bulk_update(zp->z_sa_hdl, bulk, count, tx);
 986 1202  
 987 1203                  zfs_log_write(zilog, tx, TX_WRITE, zp, woff, tx_bytes, ioflag);
 988 1204                  dmu_tx_commit(tx);
 989 1205  
 990 1206                  if (error != 0)
 991 1207                          break;
 992 1208                  ASSERT(tx_bytes == nbytes);
 993 1209                  n -= nbytes;
 994 1210  
 995 1211                  if (!xuio && n > 0)
 996 1212                          uio_prefaultpages(MIN(n, max_blksz), uio);
 997 1213          }
 998 1214  
 999 1215          zfs_range_unlock(rl);
1000 1216  
1001 1217          /*
1002 1218           * If we're in replay mode, or we made no progress, return error.
1003 1219           * Otherwise, it's at least a partial write, so it's successful.
1004 1220           */
1005 1221          if (zfsvfs->z_replay || uio->uio_resid == start_resid) {
1006 1222                  ZFS_EXIT(zfsvfs);
1007 1223                  return (error);

↓ open down ↓

254 lines elided

↑ open up ↑

1008 1224          }
1009 1225  
1010 1226          if (ioflag & (FSYNC | FDSYNC) ||
1011 1227              zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
1012 1228                  zil_commit(zilog, zp->z_id);
1013 1229  
1014 1230          ZFS_EXIT(zfsvfs);
1015 1231          return (0);
1016 1232  }
1017 1233  
     1234 +/* ARGSUSED */
1018 1235  void
1019 1236  zfs_get_done(zgd_t *zgd, int error)
1020 1237  {
1021 1238          znode_t *zp = zgd->zgd_private;
1022 1239          objset_t *os = zp->z_zfsvfs->z_os;
1023 1240  
1024 1241          if (zgd->zgd_db)
1025 1242                  dmu_buf_rele(zgd->zgd_db, zgd);
1026 1243  
1027 1244          zfs_range_unlock(zgd->zgd_rl);
1028 1245  
1029 1246          /*
1030 1247           * Release the vnode asynchronously as we currently have the
1031 1248           * txg stopped from syncing.
1032 1249           */
1033 1250          VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
1034 1251  
1035      -        if (error == 0 && zgd->zgd_bp)
1036      -                zil_lwb_add_block(zgd->zgd_lwb, zgd->zgd_bp);
1037      -
1038 1252          kmem_free(zgd, sizeof (zgd_t));
1039 1253  }
1040 1254  
1041 1255  #ifdef DEBUG
1042 1256  static int zil_fault_io = 0;
1043 1257  #endif
1044 1258  
1045 1259  /*
1046 1260   * Get data to generate a TX_WRITE intent log record.
1047 1261   */

1048 1262  int
1049 1263  zfs_get_data(void *arg, lr_write_t *lr, char *buf, struct lwb *lwb, zio_t *zio)
1050 1264  {
1051 1265          zfsvfs_t *zfsvfs = arg;
1052 1266          objset_t *os = zfsvfs->z_os;
1053 1267          znode_t *zp;
1054 1268          uint64_t object = lr->lr_foid;
1055 1269          uint64_t offset = lr->lr_offset;
1056 1270          uint64_t size = lr->lr_length;
1057 1271          dmu_buf_t *db;
1058 1272          zgd_t *zgd;
1059 1273          int error = 0;
1060 1274  
1061 1275          ASSERT3P(lwb, !=, NULL);
1062 1276          ASSERT3P(zio, !=, NULL);
1063 1277          ASSERT3U(size, !=, 0);
1064 1278  
1065 1279          /*
1066 1280           * Nothing to do if the file has been removed
1067 1281           */
1068 1282          if (zfs_zget(zfsvfs, object, &zp) != 0)
1069 1283                  return (SET_ERROR(ENOENT));
1070 1284          if (zp->z_unlinked) {
1071 1285                  /*
1072 1286                   * Release the vnode asynchronously as we currently have the
1073 1287                   * txg stopped from syncing.
1074 1288                   */
1075 1289                  VN_RELE_ASYNC(ZTOV(zp),
1076 1290                      dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
1077 1291                  return (SET_ERROR(ENOENT));
1078 1292          }
1079 1293  
1080 1294          zgd = (zgd_t *)kmem_zalloc(sizeof (zgd_t), KM_SLEEP);
1081 1295          zgd->zgd_lwb = lwb;
1082 1296          zgd->zgd_private = zp;
1083 1297  
1084 1298          /*
1085 1299           * Write records come in two flavors: immediate and indirect.
1086 1300           * For small writes it's cheaper to store the data with the
1087 1301           * log record (immediate); for large writes it's cheaper to
1088 1302           * sync the data and get a pointer to it (indirect) so that
1089 1303           * we don't have to write the data twice.
1090 1304           */
1091 1305          if (buf != NULL) { /* immediate write */
1092 1306                  zgd->zgd_rl = zfs_range_lock(zp, offset, size, RL_READER);
1093 1307                  /* test for truncation needs to be done while range locked */
1094 1308                  if (offset >= zp->z_size) {
1095 1309                          error = SET_ERROR(ENOENT);
1096 1310                  } else {
1097 1311                          error = dmu_read(os, object, offset, size, buf,
1098 1312                              DMU_READ_NO_PREFETCH);
1099 1313                  }
1100 1314                  ASSERT(error == 0 || error == ENOENT);
1101 1315          } else { /* indirect write */
1102 1316                  /*
1103 1317                   * Have to lock the whole block to ensure when it's
1104 1318                   * written out and its checksum is being calculated
1105 1319                   * that no one can change the data. We need to re-check
1106 1320                   * blocksize after we get the lock in case it's changed!
1107 1321                   */
1108 1322                  for (;;) {
1109 1323                          uint64_t blkoff;
1110 1324                          size = zp->z_blksz;
1111 1325                          blkoff = ISP2(size) ? P2PHASE(offset, size) : offset;
1112 1326                          offset -= blkoff;
1113 1327                          zgd->zgd_rl = zfs_range_lock(zp, offset, size,
1114 1328                              RL_READER);
1115 1329                          if (zp->z_blksz == size)
1116 1330                                  break;
1117 1331                          offset += blkoff;
1118 1332                          zfs_range_unlock(zgd->zgd_rl);
1119 1333                  }
1120 1334                  /* test for truncation needs to be done while range locked */
1121 1335                  if (lr->lr_offset >= zp->z_size)
1122 1336                          error = SET_ERROR(ENOENT);
1123 1337  #ifdef DEBUG
1124 1338                  if (zil_fault_io) {
1125 1339                          error = SET_ERROR(EIO);
1126 1340                          zil_fault_io = 0;
1127 1341                  }
1128 1342  #endif
1129 1343                  if (error == 0)
1130 1344                          error = dmu_buf_hold(os, object, offset, zgd, &db,
1131 1345                              DMU_READ_NO_PREFETCH);
1132 1346  
1133 1347                  if (error == 0) {
1134 1348                          blkptr_t *bp = &lr->lr_blkptr;
1135 1349  
1136 1350                          zgd->zgd_db = db;
1137 1351                          zgd->zgd_bp = bp;
1138 1352  
1139 1353                          ASSERT(db->db_offset == offset);
1140 1354                          ASSERT(db->db_size == size);
1141 1355  
1142 1356                          error = dmu_sync(zio, lr->lr_common.lrc_txg,
1143 1357                              zfs_get_done, zgd);
1144 1358                          ASSERT(error || lr->lr_length <= size);
1145 1359  
1146 1360                          /*
1147 1361                           * On success, we need to wait for the write I/O
1148 1362                           * initiated by dmu_sync() to complete before we can
1149 1363                           * release this dbuf.  We will finish everything up
1150 1364                           * in the zfs_get_done() callback.

↓ open down ↓

103 lines elided

↑ open up ↑

1151 1365                           */
1152 1366                          if (error == 0)
1153 1367                                  return (0);
1154 1368  
1155 1369                          if (error == EALREADY) {
1156 1370                                  lr->lr_common.lrc_txtype = TX_WRITE2;
1157 1371                                  /*
1158 1372                                   * TX_WRITE2 relies on the data previously
1159 1373                                   * written by the TX_WRITE that caused
1160 1374                                   * EALREADY.  We zero out the BP because
1161      -                                 * it is the old, currently-on-disk BP,
1162      -                                 * so there's no need to zio_flush() its
1163      -                                 * vdevs (flushing would needlesly hurt
1164      -                                 * performance, and doesn't work on
1165      -                                 * indirect vdevs).
     1375 +                                 * it is the old, currently-on-disk BP.
1166 1376                                   */
1167 1377                                  zgd->zgd_bp = NULL;
1168 1378                                  BP_ZERO(bp);
1169 1379                                  error = 0;
1170 1380                          }
1171 1381                  }
1172 1382          }
1173 1383  
1174 1384          zfs_get_done(zgd, error);
1175 1385

1176 1386          return (error);
1177 1387  }
1178 1388  
1179 1389  /*ARGSUSED*/
1180 1390  static int
1181 1391  zfs_access(vnode_t *vp, int mode, int flag, cred_t *cr,
1182 1392      caller_context_t *ct)
1183 1393  {
1184 1394          znode_t *zp = VTOZ(vp);
1185 1395          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
1186 1396          int error;
1187 1397  
1188 1398          ZFS_ENTER(zfsvfs);
1189 1399          ZFS_VERIFY_ZP(zp);
1190 1400  
1191 1401          if (flag & V_ACE_MASK)
1192 1402                  error = zfs_zaccess(zp, mode, flag, B_FALSE, cr);
1193 1403          else
1194 1404                  error = zfs_zaccess_rwx(zp, mode, flag, cr);
1195 1405  
1196 1406          ZFS_EXIT(zfsvfs);
1197 1407          return (error);
1198 1408  }
1199 1409  
1200 1410  /*
1201 1411   * If vnode is for a device return a specfs vnode instead.
1202 1412   */
1203 1413  static int
1204 1414  specvp_check(vnode_t **vpp, cred_t *cr)
1205 1415  {
1206 1416          int error = 0;
1207 1417  
1208 1418          if (IS_DEVVP(*vpp)) {
1209 1419                  struct vnode *svp;
1210 1420  
1211 1421                  svp = specvp(*vpp, (*vpp)->v_rdev, (*vpp)->v_type, cr);
1212 1422                  VN_RELE(*vpp);
1213 1423                  if (svp == NULL)
1214 1424                          error = SET_ERROR(ENOSYS);
1215 1425                  *vpp = svp;
1216 1426          }
1217 1427          return (error);
1218 1428  }
1219 1429  
1220 1430  
1221 1431  /*
1222 1432   * Lookup an entry in a directory, or an extended attribute directory.
1223 1433   * If it exists, return a held vnode reference for it.
1224 1434   *
1225 1435   *      IN:     dvp     - vnode of directory to search.
1226 1436   *              nm      - name of entry to lookup.
1227 1437   *              pnp     - full pathname to lookup [UNUSED].
1228 1438   *              flags   - LOOKUP_XATTR set if looking for an attribute.
1229 1439   *              rdir    - root directory vnode [UNUSED].
1230 1440   *              cr      - credentials of caller.
1231 1441   *              ct      - caller context
1232 1442   *              direntflags - directory lookup flags
1233 1443   *              realpnp - returned pathname.
1234 1444   *
1235 1445   *      OUT:    vpp     - vnode of located entry, NULL if not found.
1236 1446   *
1237 1447   *      RETURN: 0 on success, error code on failure.

↓ open down ↓

62 lines elided

↑ open up ↑

1238 1448   *
1239 1449   * Timestamps:
1240 1450   *      NA
1241 1451   */
1242 1452  /* ARGSUSED */
1243 1453  static int
1244 1454  zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp,
1245 1455      int flags, vnode_t *rdir, cred_t *cr,  caller_context_t *ct,
1246 1456      int *direntflags, pathname_t *realpnp)
1247 1457  {
1248      -        znode_t *zdp = VTOZ(dvp);
     1458 +        znode_t *zp, *zdp = VTOZ(dvp);
1249 1459          zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
1250 1460          int     error = 0;
1251 1461  
1252 1462          /*
1253 1463           * Fast path lookup, however we must skip DNLC lookup
1254 1464           * for case folding or normalizing lookups because the
1255 1465           * DNLC code only stores the passed in name.  This means
1256 1466           * creating 'a' and removing 'A' on a case insensitive
1257 1467           * file system would work, but DNLC still thinks 'a'
1258 1468           * exists and won't let you create it again on the next

1259 1469           * pass through fast path.
1260 1470           */
1261 1471          if (!(flags & (LOOKUP_XATTR | FIGNORECASE))) {
1262 1472  
1263 1473                  if (dvp->v_type != VDIR) {
1264 1474                          return (SET_ERROR(ENOTDIR));
1265 1475                  } else if (zdp->z_sa_hdl == NULL) {
1266 1476                          return (SET_ERROR(EIO));
1267 1477                  }
1268 1478  
1269 1479                  if (nm[0] == 0 || (nm[0] == '.' && nm[1] == '\0')) {
1270 1480                          error = zfs_fastaccesschk_execute(zdp, cr);
1271 1481                          if (!error) {
1272 1482                                  *vpp = dvp;
1273 1483                                  VN_HOLD(*vpp);
1274 1484                                  return (0);
1275 1485                          }
1276 1486                          return (error);
1277 1487                  } else if (!zdp->z_zfsvfs->z_norm &&
1278 1488                      (zdp->z_zfsvfs->z_case == ZFS_CASE_SENSITIVE)) {
1279 1489  
1280 1490                          vnode_t *tvp = dnlc_lookup(dvp, nm);
1281 1491  
1282 1492                          if (tvp) {
1283 1493                                  error = zfs_fastaccesschk_execute(zdp, cr);
1284 1494                                  if (error) {
1285 1495                                          VN_RELE(tvp);
1286 1496                                          return (error);
1287 1497                                  }
1288 1498                                  if (tvp == DNLC_NO_VNODE) {
1289 1499                                          VN_RELE(tvp);
1290 1500                                          return (SET_ERROR(ENOENT));
1291 1501                                  } else {
1292 1502                                          *vpp = tvp;
1293 1503                                          return (specvp_check(vpp, cr));
1294 1504                                  }
1295 1505                          }
1296 1506                  }
1297 1507          }
1298 1508  
1299 1509          DTRACE_PROBE2(zfs__fastpath__lookup__miss, vnode_t *, dvp, char *, nm);
1300 1510  
1301 1511          ZFS_ENTER(zfsvfs);
1302 1512          ZFS_VERIFY_ZP(zdp);
1303 1513  
1304 1514          *vpp = NULL;
1305 1515  
1306 1516          if (flags & LOOKUP_XATTR) {
1307 1517                  /*
1308 1518                   * If the xattr property is off, refuse the lookup request.
1309 1519                   */
1310 1520                  if (!(zfsvfs->z_vfs->vfs_flag & VFS_XATTR)) {
1311 1521                          ZFS_EXIT(zfsvfs);
1312 1522                          return (SET_ERROR(EINVAL));
1313 1523                  }
1314 1524  
1315 1525                  /*
1316 1526                   * We don't allow recursive attributes..
1317 1527                   * Maybe someday we will.
1318 1528                   */
1319 1529                  if (zdp->z_pflags & ZFS_XATTR) {
1320 1530                          ZFS_EXIT(zfsvfs);
1321 1531                          return (SET_ERROR(EINVAL));
1322 1532                  }
1323 1533  
1324 1534                  if (error = zfs_get_xattrdir(VTOZ(dvp), vpp, cr, flags)) {
1325 1535                          ZFS_EXIT(zfsvfs);
1326 1536                          return (error);
1327 1537                  }
1328 1538  
1329 1539                  /*
1330 1540                   * Do we have permission to get into attribute directory?
1331 1541                   */
1332 1542  
1333 1543                  if (error = zfs_zaccess(VTOZ(*vpp), ACE_EXECUTE, 0,
1334 1544                      B_FALSE, cr)) {
1335 1545                          VN_RELE(*vpp);
1336 1546                          *vpp = NULL;
1337 1547                  }
1338 1548  
1339 1549                  ZFS_EXIT(zfsvfs);
1340 1550                  return (error);
1341 1551          }
1342 1552  
1343 1553          if (dvp->v_type != VDIR) {
1344 1554                  ZFS_EXIT(zfsvfs);
1345 1555                  return (SET_ERROR(ENOTDIR));
1346 1556          }
1347 1557  
1348 1558          /*
1349 1559           * Check accessibility of directory.
1350 1560           */
1351 1561  
1352 1562          if (error = zfs_zaccess(zdp, ACE_EXECUTE, 0, B_FALSE, cr)) {
1353 1563                  ZFS_EXIT(zfsvfs);
1354 1564                  return (error);
1355 1565          }

↓ open down ↓

97 lines elided

↑ open up ↑

1356 1566  
1357 1567          if (zfsvfs->z_utf8 && u8_validate(nm, strlen(nm),
1358 1568              NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
1359 1569                  ZFS_EXIT(zfsvfs);
1360 1570                  return (SET_ERROR(EILSEQ));
1361 1571          }
1362 1572  
1363 1573          error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
1364 1574          if (error == 0)
1365 1575                  error = specvp_check(vpp, cr);
     1576 +        if (*vpp) {
     1577 +                zp = VTOZ(*vpp);
     1578 +                if (!(zp->z_pflags & ZFS_IMMUTABLE) &&
     1579 +                    ((*vpp)->v_type != VDIR) &&
     1580 +                    zfsvfs->z_isworm && !zfs_worm_in_trans(zp)) {
     1581 +                        zp->z_pflags |= ZFS_IMMUTABLE;
     1582 +                }
     1583 +        }
1366 1584  
1367 1585          ZFS_EXIT(zfsvfs);
1368 1586          return (error);
1369 1587  }
1370 1588  
1371 1589  /*
1372 1590   * Attempt to create a new entry in a directory.  If the entry
1373 1591   * already exists, truncate the file if permissible, else return
1374 1592   * an error.  Return the vp of the created or trunc'd file.
1375 1593   *

1376 1594   *      IN:     dvp     - vnode of directory to put new file entry in.
1377 1595   *              name    - name of new file entry.
1378 1596   *              vap     - attributes of new file.
1379 1597   *              excl    - flag indicating exclusive or non-exclusive mode.
1380 1598   *              mode    - mode to open file with.
1381 1599   *              cr      - credentials of caller.
1382 1600   *              flag    - large file flag [UNUSED].
1383 1601   *              ct      - caller context
1384 1602   *              vsecp   - ACL to be set
1385 1603   *
1386 1604   *      OUT:    vpp     - vnode of created or trunc'd entry.
1387 1605   *
1388 1606   *      RETURN: 0 on success, error code on failure.
1389 1607   *
1390 1608   * Timestamps:

↓ open down ↓

15 lines elided

↑ open up ↑

1391 1609   *      dvp - ctime|mtime updated if new entry created
1392 1610   *       vp - ctime|mtime always, atime if new
1393 1611   */
1394 1612  
1395 1613  /* ARGSUSED */
1396 1614  static int
1397 1615  zfs_create(vnode_t *dvp, char *name, vattr_t *vap, vcexcl_t excl,
1398 1616      int mode, vnode_t **vpp, cred_t *cr, int flag, caller_context_t *ct,
1399 1617      vsecattr_t *vsecp)
1400 1618  {
     1619 +        int             imm_was_set = 0;
1401 1620          znode_t         *zp, *dzp = VTOZ(dvp);
1402 1621          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
1403 1622          zilog_t         *zilog;
1404 1623          objset_t        *os;
1405 1624          zfs_dirlock_t   *dl;
1406 1625          dmu_tx_t        *tx;
1407 1626          int             error;
1408 1627          ksid_t          *ksid;
1409 1628          uid_t           uid;
1410 1629          gid_t           gid = crgetgid(cr);

1411 1630          zfs_acl_ids_t   acl_ids;
1412 1631          boolean_t       fuid_dirtied;
1413 1632          boolean_t       have_acl = B_FALSE;
1414 1633          boolean_t       waited = B_FALSE;
1415 1634  
1416 1635          /*
1417 1636           * If we have an ephemeral id, ACL, or XVATTR then
1418 1637           * make sure file system is at proper version
1419 1638           */
1420 1639  
1421 1640          ksid = crgetsid(cr, KSID_OWNER);
1422 1641          if (ksid)
1423 1642                  uid = ksid_getid(ksid);
1424 1643          else
1425 1644                  uid = crgetuid(cr);
1426 1645  
1427 1646          if (zfsvfs->z_use_fuids == B_FALSE &&
1428 1647              (vsecp || (vap->va_mask & AT_XVATTR) ||
1429 1648              IS_EPHEMERAL(uid) || IS_EPHEMERAL(gid)))
1430 1649                  return (SET_ERROR(EINVAL));
1431 1650  
1432 1651          ZFS_ENTER(zfsvfs);
1433 1652          ZFS_VERIFY_ZP(dzp);
1434 1653          os = zfsvfs->z_os;
1435 1654          zilog = zfsvfs->z_log;
1436 1655  
1437 1656          if (zfsvfs->z_utf8 && u8_validate(name, strlen(name),
1438 1657              NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
1439 1658                  ZFS_EXIT(zfsvfs);
1440 1659                  return (SET_ERROR(EILSEQ));
1441 1660          }
1442 1661  
1443 1662          if (vap->va_mask & AT_XVATTR) {
1444 1663                  if ((error = secpolicy_xvattr((xvattr_t *)vap,
1445 1664                      crgetuid(cr), cr, vap->va_type)) != 0) {
1446 1665                          ZFS_EXIT(zfsvfs);
1447 1666                          return (error);
1448 1667                  }
1449 1668          }
1450 1669  top:
1451 1670          *vpp = NULL;
1452 1671  
1453 1672          if ((vap->va_mode & VSVTX) && secpolicy_vnode_stky_modify(cr))
1454 1673                  vap->va_mode &= ~VSVTX;
1455 1674  
1456 1675          if (*name == '\0') {
1457 1676                  /*
1458 1677                   * Null component name refers to the directory itself.
1459 1678                   */
1460 1679                  VN_HOLD(dvp);
1461 1680                  zp = dzp;
1462 1681                  dl = NULL;
1463 1682                  error = 0;
1464 1683          } else {
1465 1684                  /* possible VN_HOLD(zp) */
1466 1685                  int zflg = 0;
1467 1686  
1468 1687                  if (flag & FIGNORECASE)
1469 1688                          zflg |= ZCILOOK;
1470 1689  
1471 1690                  error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
1472 1691                      NULL, NULL);
1473 1692                  if (error) {
1474 1693                          if (have_acl)
1475 1694                                  zfs_acl_ids_free(&acl_ids);

↓ open down ↓

65 lines elided

↑ open up ↑

1476 1695                          if (strcmp(name, "..") == 0)
1477 1696                                  error = SET_ERROR(EISDIR);
1478 1697                          ZFS_EXIT(zfsvfs);
1479 1698                          return (error);
1480 1699                  }
1481 1700          }
1482 1701  
1483 1702          if (zp == NULL) {
1484 1703                  uint64_t txtype;
1485 1704  
     1705 +                if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
     1706 +                    dzp->z_zfsvfs->z_isworm) {
     1707 +                        imm_was_set = 1;
     1708 +                        dzp->z_pflags &= ~ZFS_IMMUTABLE;
     1709 +                }
     1710 +
1486 1711                  /*
1487 1712                   * Create a new file object and update the directory
1488 1713                   * to reference it.
1489 1714                   */
1490 1715                  if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
1491 1716                          if (have_acl)
1492 1717                                  zfs_acl_ids_free(&acl_ids);
     1718 +                        if (imm_was_set)
     1719 +                                dzp->z_pflags |= ZFS_IMMUTABLE;
1493 1720                          goto out;
1494 1721                  }
1495 1722  
     1723 +                if (imm_was_set)
     1724 +                        dzp->z_pflags |= ZFS_IMMUTABLE;
     1725 +
1496 1726                  /*
1497 1727                   * We only support the creation of regular files in
1498 1728                   * extended attribute directories.
1499 1729                   */
1500 1730  
1501 1731                  if ((dzp->z_pflags & ZFS_XATTR) &&
1502 1732                      (vap->va_type != VREG)) {
1503 1733                          if (have_acl)
1504 1734                                  zfs_acl_ids_free(&acl_ids);
1505 1735                          error = SET_ERROR(EINVAL);

1506 1736                          goto out;
1507 1737                  }
1508 1738  
1509 1739                  if (!have_acl && (error = zfs_acl_ids_create(dzp, 0, vap,
1510 1740                      cr, vsecp, &acl_ids)) != 0)
1511 1741                          goto out;
1512 1742                  have_acl = B_TRUE;
1513 1743  
1514 1744                  if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
1515 1745                          zfs_acl_ids_free(&acl_ids);
1516 1746                          error = SET_ERROR(EDQUOT);
1517 1747                          goto out;
1518 1748                  }
1519 1749  
1520 1750                  tx = dmu_tx_create(os);
1521 1751  
1522 1752                  dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
1523 1753                      ZFS_SA_BASE_ATTR_SIZE);
1524 1754

↓ open down ↓

19 lines elided

↑ open up ↑

1525 1755                  fuid_dirtied = zfsvfs->z_fuid_dirty;
1526 1756                  if (fuid_dirtied)
1527 1757                          zfs_fuid_txhold(zfsvfs, tx);
1528 1758                  dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
1529 1759                  dmu_tx_hold_sa(tx, dzp->z_sa_hdl, B_FALSE);
1530 1760                  if (!zfsvfs->z_use_sa &&
1531 1761                      acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
1532 1762                          dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
1533 1763                              0, acl_ids.z_aclp->z_acl_bytes);
1534 1764                  }
1535      -                error = dmu_tx_assign(tx,
1536      -                    (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     1765 +                error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
1537 1766                  if (error) {
1538 1767                          zfs_dirent_unlock(dl);
1539 1768                          if (error == ERESTART) {
1540 1769                                  waited = B_TRUE;
1541 1770                                  dmu_tx_wait(tx);
1542 1771                                  dmu_tx_abort(tx);
1543 1772                                  goto top;
1544 1773                          }
1545 1774                          zfs_acl_ids_free(&acl_ids);
1546 1775                          dmu_tx_abort(tx);
1547 1776                          ZFS_EXIT(zfsvfs);
1548 1777                          return (error);
1549 1778                  }
1550 1779                  zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
1551 1780  
1552 1781                  if (fuid_dirtied)
1553 1782                          zfs_fuid_sync(zfsvfs, tx);
1554 1783  
     1784 +                if (imm_was_set)
     1785 +                        zp->z_pflags |= ZFS_IMMUTABLE;
     1786 +
1555 1787                  (void) zfs_link_create(dl, zp, tx, ZNEW);
1556 1788                  txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
1557 1789                  if (flag & FIGNORECASE)
1558 1790                          txtype |= TX_CI;
1559 1791                  zfs_log_create(zilog, tx, txtype, dzp, zp, name,
1560 1792                      vsecp, acl_ids.z_fuidp, vap);
1561 1793                  zfs_acl_ids_free(&acl_ids);
1562 1794                  dmu_tx_commit(tx);
1563 1795          } else {
1564 1796                  int aflags = (flag & FAPPEND) ? V_APPEND : 0;

1565 1797  
1566 1798                  if (have_acl)
1567 1799                          zfs_acl_ids_free(&acl_ids);
1568 1800                  have_acl = B_FALSE;
1569 1801  
1570 1802                  /*
1571 1803                   * A directory entry already exists for this name.
1572 1804                   */
1573 1805                  /*
1574 1806                   * Can't truncate an existing file if in exclusive mode.
1575 1807                   */
1576 1808                  if (excl == EXCL) {

↓ open down ↓

12 lines elided

↑ open up ↑

1577 1809                          error = SET_ERROR(EEXIST);
1578 1810                          goto out;
1579 1811                  }
1580 1812                  /*
1581 1813                   * Can't open a directory for writing.
1582 1814                   */
1583 1815                  if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
1584 1816                          error = SET_ERROR(EISDIR);
1585 1817                          goto out;
1586 1818                  }
     1819 +                if ((flag & FWRITE) &&
     1820 +                    dzp->z_zfsvfs->z_isworm) {
     1821 +                        error = EPERM;
     1822 +                        goto out;
     1823 +                }
     1824 +
     1825 +                if (!(flag & FAPPEND) &&
     1826 +                    (zp->z_pflags & ZFS_IMMUTABLE) &&
     1827 +                    dzp->z_zfsvfs->z_isworm) {
     1828 +                        imm_was_set = 1;
     1829 +                        zp->z_pflags &= ~ZFS_IMMUTABLE;
     1830 +                }
1587 1831                  /*
1588 1832                   * Verify requested access to file.
1589 1833                   */
1590 1834                  if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {
     1835 +                        if (imm_was_set)
     1836 +                                zp->z_pflags |= ZFS_IMMUTABLE;
1591 1837                          goto out;
1592 1838                  }
1593 1839  
     1840 +                if (imm_was_set)
     1841 +                        zp->z_pflags |= ZFS_IMMUTABLE;
     1842 +
1594 1843                  mutex_enter(&dzp->z_lock);
1595 1844                  dzp->z_seq++;
1596 1845                  mutex_exit(&dzp->z_lock);
1597 1846  
1598 1847                  /*
1599 1848                   * Truncate regular files if requested.
1600 1849                   */
1601 1850                  if ((ZTOV(zp)->v_type == VREG) &&
1602 1851                      (vap->va_mask & AT_SIZE) && (vap->va_size == 0)) {
1603 1852                          /* we can't hold any locks when calling zfs_freesp() */

1604 1853                          zfs_dirent_unlock(dl);
1605 1854                          dl = NULL;
1606 1855                          error = zfs_freesp(zp, 0, 0, mode, TRUE);
1607 1856                          if (error == 0) {
1608 1857                                  vnevent_create(ZTOV(zp), ct);
1609 1858                          }
1610 1859                  }
1611 1860          }
1612 1861  out:
1613 1862  
1614 1863          if (dl)
1615 1864                  zfs_dirent_unlock(dl);
1616 1865  
1617 1866          if (error) {
1618 1867                  if (zp)
1619 1868                          VN_RELE(ZTOV(zp));
1620 1869          } else {
1621 1870                  *vpp = ZTOV(zp);
1622 1871                  error = specvp_check(vpp, cr);
1623 1872          }
1624 1873  
1625 1874          if (zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
1626 1875                  zil_commit(zilog, 0);
1627 1876  
1628 1877          ZFS_EXIT(zfsvfs);
1629 1878          return (error);
1630 1879  }
1631 1880  
1632 1881  /*
1633 1882   * Remove an entry from a directory.
1634 1883   *
1635 1884   *      IN:     dvp     - vnode of directory to remove entry from.
1636 1885   *              name    - name of entry to remove.
1637 1886   *              cr      - credentials of caller.
1638 1887   *              ct      - caller context
1639 1888   *              flags   - case flags
1640 1889   *
1641 1890   *      RETURN: 0 on success, error code on failure.
1642 1891   *
1643 1892   * Timestamps:
1644 1893   *      dvp - ctime|mtime
1645 1894   *       vp - ctime (if nlink > 0)
1646 1895   */
1647 1896  
1648 1897  uint64_t null_xattr = 0;
1649 1898  
1650 1899  /*ARGSUSED*/
1651 1900  static int
1652 1901  zfs_remove(vnode_t *dvp, char *name, cred_t *cr, caller_context_t *ct,
1653 1902      int flags)
1654 1903  {
1655 1904          znode_t         *zp, *dzp = VTOZ(dvp);
1656 1905          znode_t         *xzp;
1657 1906          vnode_t         *vp;
1658 1907          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
1659 1908          zilog_t         *zilog;
1660 1909          uint64_t        acl_obj, xattr_obj;
1661 1910          uint64_t        xattr_obj_unlinked = 0;
1662 1911          uint64_t        obj = 0;
1663 1912          zfs_dirlock_t   *dl;
1664 1913          dmu_tx_t        *tx;
1665 1914          boolean_t       may_delete_now, delete_now = FALSE;
1666 1915          boolean_t       unlinked, toobig = FALSE;
1667 1916          uint64_t        txtype;
1668 1917          pathname_t      *realnmp = NULL;
1669 1918          pathname_t      realnm;
1670 1919          int             error;
1671 1920          int             zflg = ZEXISTS;
1672 1921          boolean_t       waited = B_FALSE;
1673 1922  
1674 1923          ZFS_ENTER(zfsvfs);
1675 1924          ZFS_VERIFY_ZP(dzp);
1676 1925          zilog = zfsvfs->z_log;
1677 1926  
1678 1927          if (flags & FIGNORECASE) {
1679 1928                  zflg |= ZCILOOK;
1680 1929                  pn_alloc(&realnm);
1681 1930                  realnmp = &realnm;
1682 1931          }
1683 1932  
1684 1933  top:
1685 1934          xattr_obj = 0;
1686 1935          xzp = NULL;
1687 1936          /*
1688 1937           * Attempt to lock directory; fail if entry doesn't exist.
1689 1938           */

↓ open down ↓

86 lines elided

↑ open up ↑

1690 1939          if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
1691 1940              NULL, realnmp)) {
1692 1941                  if (realnmp)
1693 1942                          pn_free(realnmp);
1694 1943                  ZFS_EXIT(zfsvfs);
1695 1944                  return (error);
1696 1945          }
1697 1946  
1698 1947          vp = ZTOV(zp);
1699 1948  
     1949 +        if (zp->z_zfsvfs->z_isworm) {
     1950 +                error = SET_ERROR(EPERM);
     1951 +                goto out;
     1952 +        }
     1953 +
1700 1954          if (error = zfs_zaccess_delete(dzp, zp, cr)) {
1701 1955                  goto out;
1702 1956          }
1703 1957  
1704 1958          /*
1705 1959           * Need to use rmdir for removing directories.
1706 1960           */
1707 1961          if (vp->v_type == VDIR) {
1708 1962                  error = SET_ERROR(EPERM);
1709 1963                  goto out;

1710 1964          }
1711 1965  
1712 1966          vnevent_remove(vp, dvp, name, ct);
1713 1967  
1714 1968          if (realnmp)
1715 1969                  dnlc_remove(dvp, realnmp->pn_buf);
1716 1970          else
1717 1971                  dnlc_remove(dvp, name);
1718 1972  
1719 1973          mutex_enter(&vp->v_lock);
1720 1974          may_delete_now = vp->v_count == 1 && !vn_has_cached_data(vp);
1721 1975          mutex_exit(&vp->v_lock);
1722 1976  
1723 1977          /*
1724 1978           * We may delete the znode now, or we may put it in the unlinked set;
1725 1979           * it depends on whether we're the last link, and on whether there are
1726 1980           * other holds on the vnode.  So we dmu_tx_hold() the right things to
1727 1981           * allow for either case.
1728 1982           */
1729 1983          obj = zp->z_id;
1730 1984          tx = dmu_tx_create(zfsvfs->z_os);
1731 1985          dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
1732 1986          dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
1733 1987          zfs_sa_upgrade_txholds(tx, zp);
1734 1988          zfs_sa_upgrade_txholds(tx, dzp);
1735 1989          if (may_delete_now) {
1736 1990                  toobig =
1737 1991                      zp->z_size > zp->z_blksz * DMU_MAX_DELETEBLKCNT;
1738 1992                  /* if the file is too big, only hold_free a token amount */
1739 1993                  dmu_tx_hold_free(tx, zp->z_id, 0,
1740 1994                      (toobig ? DMU_MAX_ACCESS : DMU_OBJECT_END));
1741 1995          }
1742 1996  
1743 1997          /* are there any extended attributes? */
1744 1998          error = sa_lookup(zp->z_sa_hdl, SA_ZPL_XATTR(zfsvfs),
1745 1999              &xattr_obj, sizeof (xattr_obj));
1746 2000          if (error == 0 && xattr_obj) {
1747 2001                  error = zfs_zget(zfsvfs, xattr_obj, &xzp);
1748 2002                  ASSERT0(error);
1749 2003                  dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_TRUE);
1750 2004                  dmu_tx_hold_sa(tx, xzp->z_sa_hdl, B_FALSE);
1751 2005          }
1752 2006  
1753 2007          mutex_enter(&zp->z_lock);
1754 2008          if ((acl_obj = zfs_external_acl(zp)) != 0 && may_delete_now)
1755 2009                  dmu_tx_hold_free(tx, acl_obj, 0, DMU_OBJECT_END);

↓ open down ↓

46 lines elided

↑ open up ↑

1756 2010          mutex_exit(&zp->z_lock);
1757 2011  
1758 2012          /* charge as an update -- would be nice not to charge at all */
1759 2013          dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
1760 2014  
1761 2015          /*
1762 2016           * Mark this transaction as typically resulting in a net free of space
1763 2017           */
1764 2018          dmu_tx_mark_netfree(tx);
1765 2019  
1766      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     2020 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
1767 2021          if (error) {
1768 2022                  zfs_dirent_unlock(dl);
1769 2023                  VN_RELE(vp);
1770 2024                  if (xzp)
1771 2025                          VN_RELE(ZTOV(xzp));
1772 2026                  if (error == ERESTART) {
1773 2027                          waited = B_TRUE;
1774 2028                          dmu_tx_wait(tx);
1775 2029                          dmu_tx_abort(tx);
1776 2030                          goto top;

1777 2031                  }
1778 2032                  if (realnmp)
1779 2033                          pn_free(realnmp);
1780 2034                  dmu_tx_abort(tx);
1781 2035                  ZFS_EXIT(zfsvfs);
1782 2036                  return (error);
1783 2037          }
1784 2038  
1785 2039          /*
1786 2040           * Remove the directory entry.
1787 2041           */
1788 2042          error = zfs_link_destroy(dl, zp, tx, zflg, &unlinked);
1789 2043  
1790 2044          if (error) {
1791 2045                  dmu_tx_commit(tx);
1792 2046                  goto out;
1793 2047          }
1794 2048  
1795 2049          if (unlinked) {
1796 2050                  /*
1797 2051                   * Hold z_lock so that we can make sure that the ACL obj
1798 2052                   * hasn't changed.  Could have been deleted due to
1799 2053                   * zfs_sa_upgrade().
1800 2054                   */
1801 2055                  mutex_enter(&zp->z_lock);
1802 2056                  mutex_enter(&vp->v_lock);
1803 2057                  (void) sa_lookup(zp->z_sa_hdl, SA_ZPL_XATTR(zfsvfs),
1804 2058                      &xattr_obj_unlinked, sizeof (xattr_obj_unlinked));
1805 2059                  delete_now = may_delete_now && !toobig &&
1806 2060                      vp->v_count == 1 && !vn_has_cached_data(vp) &&
1807 2061                      xattr_obj == xattr_obj_unlinked && zfs_external_acl(zp) ==
1808 2062                      acl_obj;
1809 2063                  mutex_exit(&vp->v_lock);
1810 2064          }
1811 2065  
1812 2066          if (delete_now) {
1813 2067                  if (xattr_obj_unlinked) {
1814 2068                          ASSERT3U(xzp->z_links, ==, 2);
1815 2069                          mutex_enter(&xzp->z_lock);
1816 2070                          xzp->z_unlinked = 1;
1817 2071                          xzp->z_links = 0;
1818 2072                          error = sa_update(xzp->z_sa_hdl, SA_ZPL_LINKS(zfsvfs),
1819 2073                              &xzp->z_links, sizeof (xzp->z_links), tx);
1820 2074                          ASSERT3U(error,  ==,  0);
1821 2075                          mutex_exit(&xzp->z_lock);
1822 2076                          zfs_unlinked_add(xzp, tx);
1823 2077  
1824 2078                          if (zp->z_is_sa)
1825 2079                                  error = sa_remove(zp->z_sa_hdl,
1826 2080                                      SA_ZPL_XATTR(zfsvfs), tx);
1827 2081                          else
1828 2082                                  error = sa_update(zp->z_sa_hdl,
1829 2083                                      SA_ZPL_XATTR(zfsvfs), &null_xattr,
1830 2084                                      sizeof (uint64_t), tx);
1831 2085                          ASSERT0(error);
1832 2086                  }
1833 2087                  mutex_enter(&vp->v_lock);
1834 2088                  VN_RELE_LOCKED(vp);
1835 2089                  ASSERT0(vp->v_count);
1836 2090                  mutex_exit(&vp->v_lock);
1837 2091                  mutex_exit(&zp->z_lock);
1838 2092                  zfs_znode_delete(zp, tx);
1839 2093          } else if (unlinked) {
1840 2094                  mutex_exit(&zp->z_lock);
1841 2095                  zfs_unlinked_add(zp, tx);
1842 2096          }
1843 2097  
1844 2098          txtype = TX_REMOVE;
1845 2099          if (flags & FIGNORECASE)
1846 2100                  txtype |= TX_CI;
1847 2101          zfs_log_remove(zilog, tx, txtype, dzp, name, obj);
1848 2102  
1849 2103          dmu_tx_commit(tx);
1850 2104  out:
1851 2105          if (realnmp)
1852 2106                  pn_free(realnmp);
1853 2107  
1854 2108          zfs_dirent_unlock(dl);
1855 2109  
1856 2110          if (!delete_now)
1857 2111                  VN_RELE(vp);
1858 2112          if (xzp)
1859 2113                  VN_RELE(ZTOV(xzp));
1860 2114  
1861 2115          if (zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
1862 2116                  zil_commit(zilog, 0);
1863 2117  
1864 2118          ZFS_EXIT(zfsvfs);
1865 2119          return (error);
1866 2120  }
1867 2121  
1868 2122  /*
1869 2123   * Create a new directory and insert it into dvp using the name
1870 2124   * provided.  Return a pointer to the inserted directory.
1871 2125   *
1872 2126   *      IN:     dvp     - vnode of directory to add subdir to.
1873 2127   *              dirname - name of new directory.
1874 2128   *              vap     - attributes of new directory.
1875 2129   *              cr      - credentials of caller.
1876 2130   *              ct      - caller context
1877 2131   *              flags   - case flags
1878 2132   *              vsecp   - ACL to be set
1879 2133   *
1880 2134   *      OUT:    vpp     - vnode of created directory.
1881 2135   *
1882 2136   *      RETURN: 0 on success, error code on failure.

↓ open down ↓

106 lines elided

↑ open up ↑

1883 2137   *
1884 2138   * Timestamps:
1885 2139   *      dvp - ctime|mtime updated
1886 2140   *       vp - ctime|mtime|atime updated
1887 2141   */
1888 2142  /*ARGSUSED*/
1889 2143  static int
1890 2144  zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr,
1891 2145      caller_context_t *ct, int flags, vsecattr_t *vsecp)
1892 2146  {
     2147 +        int             imm_was_set = 0;
1893 2148          znode_t         *zp, *dzp = VTOZ(dvp);
1894 2149          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
1895 2150          zilog_t         *zilog;
1896 2151          zfs_dirlock_t   *dl;
1897 2152          uint64_t        txtype;
1898 2153          dmu_tx_t        *tx;
1899 2154          int             error;
1900 2155          int             zf = ZNEW;
1901 2156          ksid_t          *ksid;
1902 2157          uid_t           uid;

1903 2158          gid_t           gid = crgetgid(cr);
1904 2159          zfs_acl_ids_t   acl_ids;
1905 2160          boolean_t       fuid_dirtied;
1906 2161          boolean_t       waited = B_FALSE;
1907 2162  
1908 2163          ASSERT(vap->va_type == VDIR);
1909 2164  
1910 2165          /*
1911 2166           * If we have an ephemeral id, ACL, or XVATTR then
1912 2167           * make sure file system is at proper version
1913 2168           */
1914 2169  
1915 2170          ksid = crgetsid(cr, KSID_OWNER);
1916 2171          if (ksid)
1917 2172                  uid = ksid_getid(ksid);
1918 2173          else
1919 2174                  uid = crgetuid(cr);
1920 2175          if (zfsvfs->z_use_fuids == B_FALSE &&
1921 2176              (vsecp || (vap->va_mask & AT_XVATTR) ||
1922 2177              IS_EPHEMERAL(uid) || IS_EPHEMERAL(gid)))
1923 2178                  return (SET_ERROR(EINVAL));
1924 2179  
1925 2180          ZFS_ENTER(zfsvfs);
1926 2181          ZFS_VERIFY_ZP(dzp);
1927 2182          zilog = zfsvfs->z_log;
1928 2183  
1929 2184          if (dzp->z_pflags & ZFS_XATTR) {
1930 2185                  ZFS_EXIT(zfsvfs);
1931 2186                  return (SET_ERROR(EINVAL));
1932 2187          }
1933 2188  
1934 2189          if (zfsvfs->z_utf8 && u8_validate(dirname,
1935 2190              strlen(dirname), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
1936 2191                  ZFS_EXIT(zfsvfs);
1937 2192                  return (SET_ERROR(EILSEQ));
1938 2193          }
1939 2194          if (flags & FIGNORECASE)
1940 2195                  zf |= ZCILOOK;
1941 2196  
1942 2197          if (vap->va_mask & AT_XVATTR) {
1943 2198                  if ((error = secpolicy_xvattr((xvattr_t *)vap,
1944 2199                      crgetuid(cr), cr, vap->va_type)) != 0) {
1945 2200                          ZFS_EXIT(zfsvfs);
1946 2201                          return (error);
1947 2202                  }
1948 2203          }
1949 2204  
1950 2205          if ((error = zfs_acl_ids_create(dzp, 0, vap, cr,
1951 2206              vsecp, &acl_ids)) != 0) {
1952 2207                  ZFS_EXIT(zfsvfs);
1953 2208                  return (error);
1954 2209          }
1955 2210          /*
1956 2211           * First make sure the new directory doesn't exist.
1957 2212           *
1958 2213           * Existence is checked first to make sure we don't return
1959 2214           * EACCES instead of EEXIST which can cause some applications
1960 2215           * to fail.
1961 2216           */

↓ open down ↓

59 lines elided

↑ open up ↑

1962 2217  top:
1963 2218          *vpp = NULL;
1964 2219  
1965 2220          if (error = zfs_dirent_lock(&dl, dzp, dirname, &zp, zf,
1966 2221              NULL, NULL)) {
1967 2222                  zfs_acl_ids_free(&acl_ids);
1968 2223                  ZFS_EXIT(zfsvfs);
1969 2224                  return (error);
1970 2225          }
1971 2226  
     2227 +        if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
     2228 +            dzp->z_zfsvfs->z_isworm) {
     2229 +                imm_was_set = 1;
     2230 +                dzp->z_pflags &= ~ZFS_IMMUTABLE;
     2231 +        }
     2232 +
1972 2233          if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {
     2234 +                if (imm_was_set)
     2235 +                        dzp->z_pflags |= ZFS_IMMUTABLE;
1973 2236                  zfs_acl_ids_free(&acl_ids);
1974 2237                  zfs_dirent_unlock(dl);
1975 2238                  ZFS_EXIT(zfsvfs);
1976 2239                  return (error);
1977 2240          }
1978 2241  
     2242 +        if (imm_was_set)
     2243 +                dzp->z_pflags |= ZFS_IMMUTABLE;
     2244 +
1979 2245          if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
1980 2246                  zfs_acl_ids_free(&acl_ids);
1981 2247                  zfs_dirent_unlock(dl);
1982 2248                  ZFS_EXIT(zfsvfs);
1983 2249                  return (SET_ERROR(EDQUOT));
1984 2250          }
1985 2251  
1986 2252          /*
1987 2253           * Add a new entry to the directory.
1988 2254           */

1989 2255          tx = dmu_tx_create(zfsvfs->z_os);
1990 2256          dmu_tx_hold_zap(tx, dzp->z_id, TRUE, dirname);
1991 2257          dmu_tx_hold_zap(tx, DMU_NEW_OBJECT, FALSE, NULL);
1992 2258          fuid_dirtied = zfsvfs->z_fuid_dirty;

↓ open down ↓

4 lines elided

↑ open up ↑

1993 2259          if (fuid_dirtied)
1994 2260                  zfs_fuid_txhold(zfsvfs, tx);
1995 2261          if (!zfsvfs->z_use_sa && acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
1996 2262                  dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
1997 2263                      acl_ids.z_aclp->z_acl_bytes);
1998 2264          }
1999 2265  
2000 2266          dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
2001 2267              ZFS_SA_BASE_ATTR_SIZE);
2002 2268  
2003      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     2269 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
2004 2270          if (error) {
2005 2271                  zfs_dirent_unlock(dl);
2006 2272                  if (error == ERESTART) {
2007 2273                          waited = B_TRUE;
2008 2274                          dmu_tx_wait(tx);
2009 2275                          dmu_tx_abort(tx);
2010 2276                          goto top;
2011 2277                  }
2012 2278                  zfs_acl_ids_free(&acl_ids);
2013 2279                  dmu_tx_abort(tx);

2014 2280                  ZFS_EXIT(zfsvfs);
2015 2281                  return (error);
2016 2282          }
2017 2283  
2018 2284          /*
2019 2285           * Create new node.
2020 2286           */
2021 2287          zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
2022 2288  
2023 2289          if (fuid_dirtied)
2024 2290                  zfs_fuid_sync(zfsvfs, tx);
2025 2291  
2026 2292          /*
2027 2293           * Now put new name in parent dir.
2028 2294           */
2029 2295          (void) zfs_link_create(dl, zp, tx, ZNEW);
2030 2296  
2031 2297          *vpp = ZTOV(zp);
2032 2298  
2033 2299          txtype = zfs_log_create_txtype(Z_DIR, vsecp, vap);
2034 2300          if (flags & FIGNORECASE)
2035 2301                  txtype |= TX_CI;
2036 2302          zfs_log_create(zilog, tx, txtype, dzp, zp, dirname, vsecp,
2037 2303              acl_ids.z_fuidp, vap);
2038 2304  
2039 2305          zfs_acl_ids_free(&acl_ids);
2040 2306  
2041 2307          dmu_tx_commit(tx);
2042 2308  
2043 2309          zfs_dirent_unlock(dl);
2044 2310  
2045 2311          if (zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
2046 2312                  zil_commit(zilog, 0);
2047 2313  
2048 2314          ZFS_EXIT(zfsvfs);
2049 2315          return (0);
2050 2316  }
2051 2317  
2052 2318  /*
2053 2319   * Remove a directory subdir entry.  If the current working
2054 2320   * directory is the same as the subdir to be removed, the
2055 2321   * remove will fail.
2056 2322   *
2057 2323   *      IN:     dvp     - vnode of directory to remove from.
2058 2324   *              name    - name of directory to be removed.
2059 2325   *              cwd     - vnode of current working directory.
2060 2326   *              cr      - credentials of caller.
2061 2327   *              ct      - caller context
2062 2328   *              flags   - case flags
2063 2329   *
2064 2330   *      RETURN: 0 on success, error code on failure.
2065 2331   *
2066 2332   * Timestamps:
2067 2333   *      dvp - ctime|mtime updated
2068 2334   */
2069 2335  /*ARGSUSED*/
2070 2336  static int
2071 2337  zfs_rmdir(vnode_t *dvp, char *name, vnode_t *cwd, cred_t *cr,
2072 2338      caller_context_t *ct, int flags)
2073 2339  {
2074 2340          znode_t         *dzp = VTOZ(dvp);
2075 2341          znode_t         *zp;
2076 2342          vnode_t         *vp;
2077 2343          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
2078 2344          zilog_t         *zilog;
2079 2345          zfs_dirlock_t   *dl;
2080 2346          dmu_tx_t        *tx;
2081 2347          int             error;
2082 2348          int             zflg = ZEXISTS;
2083 2349          boolean_t       waited = B_FALSE;
2084 2350  
2085 2351          ZFS_ENTER(zfsvfs);
2086 2352          ZFS_VERIFY_ZP(dzp);
2087 2353          zilog = zfsvfs->z_log;
2088 2354  
2089 2355          if (flags & FIGNORECASE)
2090 2356                  zflg |= ZCILOOK;
2091 2357  top:
2092 2358          zp = NULL;
2093 2359  
2094 2360          /*

↓ open down ↓

81 lines elided

↑ open up ↑

2095 2361           * Attempt to lock directory; fail if entry doesn't exist.
2096 2362           */
2097 2363          if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
2098 2364              NULL, NULL)) {
2099 2365                  ZFS_EXIT(zfsvfs);
2100 2366                  return (error);
2101 2367          }
2102 2368  
2103 2369          vp = ZTOV(zp);
2104 2370  
     2371 +        if (dzp->z_zfsvfs->z_isworm) {
     2372 +                error = SET_ERROR(EPERM);
     2373 +                goto out;
     2374 +        }
     2375 +
2105 2376          if (error = zfs_zaccess_delete(dzp, zp, cr)) {
2106 2377                  goto out;
2107 2378          }
2108 2379  
2109 2380          if (vp->v_type != VDIR) {
2110 2381                  error = SET_ERROR(ENOTDIR);
2111 2382                  goto out;
2112 2383          }
2113 2384  
2114 2385          if (vp == cwd) {

2115 2386                  error = SET_ERROR(EINVAL);
2116 2387                  goto out;
2117 2388          }
2118 2389  
2119 2390          vnevent_rmdir(vp, dvp, name, ct);
2120 2391  
2121 2392          /*
2122 2393           * Grab a lock on the directory to make sure that noone is
2123 2394           * trying to add (or lookup) entries while we are removing it.
2124 2395           */
2125 2396          rw_enter(&zp->z_name_lock, RW_WRITER);
2126 2397  
2127 2398          /*
2128 2399           * Grab a lock on the parent pointer to make sure we play well
2129 2400           * with the treewalk and directory rename code.

↓ open down ↓

15 lines elided

↑ open up ↑

2130 2401           */
2131 2402          rw_enter(&zp->z_parent_lock, RW_WRITER);
2132 2403  
2133 2404          tx = dmu_tx_create(zfsvfs->z_os);
2134 2405          dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
2135 2406          dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
2136 2407          dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
2137 2408          zfs_sa_upgrade_txholds(tx, zp);
2138 2409          zfs_sa_upgrade_txholds(tx, dzp);
2139 2410          dmu_tx_mark_netfree(tx);
2140      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     2411 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
2141 2412          if (error) {
2142 2413                  rw_exit(&zp->z_parent_lock);
2143 2414                  rw_exit(&zp->z_name_lock);
2144 2415                  zfs_dirent_unlock(dl);
2145 2416                  VN_RELE(vp);
2146 2417                  if (error == ERESTART) {
2147 2418                          waited = B_TRUE;
2148 2419                          dmu_tx_wait(tx);
2149 2420                          dmu_tx_abort(tx);
2150 2421                          goto top;

2151 2422                  }
2152 2423                  dmu_tx_abort(tx);
2153 2424                  ZFS_EXIT(zfsvfs);
2154 2425                  return (error);
2155 2426          }
2156 2427  
2157 2428          error = zfs_link_destroy(dl, zp, tx, zflg, NULL);
2158 2429  
2159 2430          if (error == 0) {
2160 2431                  uint64_t txtype = TX_RMDIR;
2161 2432                  if (flags & FIGNORECASE)
2162 2433                          txtype |= TX_CI;
2163 2434                  zfs_log_remove(zilog, tx, txtype, dzp, name, ZFS_NO_OBJECT);
2164 2435          }
2165 2436  
2166 2437          dmu_tx_commit(tx);
2167 2438  
2168 2439          rw_exit(&zp->z_parent_lock);
2169 2440          rw_exit(&zp->z_name_lock);
2170 2441  out:
2171 2442          zfs_dirent_unlock(dl);
2172 2443  
2173 2444          VN_RELE(vp);
2174 2445  
2175 2446          if (zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
2176 2447                  zil_commit(zilog, 0);
2177 2448  
2178 2449          ZFS_EXIT(zfsvfs);
2179 2450          return (error);
2180 2451  }
2181 2452  
2182 2453  /*
2183 2454   * Read as many directory entries as will fit into the provided
2184 2455   * buffer from the given directory cursor position (specified in
2185 2456   * the uio structure).
2186 2457   *
2187 2458   *      IN:     vp      - vnode of directory to read.
2188 2459   *              uio     - structure supplying read location, range info,
2189 2460   *                        and return buffer.
2190 2461   *              cr      - credentials of caller.
2191 2462   *              ct      - caller context
2192 2463   *              flags   - case flags
2193 2464   *
2194 2465   *      OUT:    uio     - updated offset and range, buffer filled.
2195 2466   *              eofp    - set to true if end-of-file detected.
2196 2467   *
2197 2468   *      RETURN: 0 on success, error code on failure.
2198 2469   *
2199 2470   * Timestamps:
2200 2471   *      vp - atime updated
2201 2472   *
2202 2473   * Note that the low 4 bits of the cookie returned by zap is always zero.
2203 2474   * This allows us to use the low range for "special" directory entries:
2204 2475   * We use 0 for '.', and 1 for '..'.  If this is the root of the filesystem,
2205 2476   * we use the offset 2 for the '.zfs' directory.
2206 2477   */
2207 2478  /* ARGSUSED */
2208 2479  static int
2209 2480  zfs_readdir(vnode_t *vp, uio_t *uio, cred_t *cr, int *eofp,
2210 2481      caller_context_t *ct, int flags)
2211 2482  {
2212 2483          znode_t         *zp = VTOZ(vp);
2213 2484          iovec_t         *iovp;
2214 2485          edirent_t       *eodp;
2215 2486          dirent64_t      *odp;
2216 2487          zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
2217 2488          objset_t        *os;
2218 2489          caddr_t         outbuf;
2219 2490          size_t          bufsize;
2220 2491          zap_cursor_t    zc;
2221 2492          zap_attribute_t zap;
2222 2493          uint_t          bytes_wanted;
2223 2494          uint64_t        offset; /* must be unsigned; checks for < 1 */
2224 2495          uint64_t        parent;
2225 2496          int             local_eof;
2226 2497          int             outcount;
2227 2498          int             error;
2228 2499          uint8_t         prefetch;
2229 2500          boolean_t       check_sysattrs;
2230 2501  
2231 2502          ZFS_ENTER(zfsvfs);
2232 2503          ZFS_VERIFY_ZP(zp);
2233 2504  
2234 2505          if ((error = sa_lookup(zp->z_sa_hdl, SA_ZPL_PARENT(zfsvfs),
2235 2506              &parent, sizeof (parent))) != 0) {
2236 2507                  ZFS_EXIT(zfsvfs);
2237 2508                  return (error);
2238 2509          }
2239 2510  
2240 2511          /*
2241 2512           * If we are not given an eof variable,
2242 2513           * use a local one.
2243 2514           */
2244 2515          if (eofp == NULL)
2245 2516                  eofp = &local_eof;
2246 2517  
2247 2518          /*
2248 2519           * Check for valid iov_len.
2249 2520           */
2250 2521          if (uio->uio_iov->iov_len <= 0) {
2251 2522                  ZFS_EXIT(zfsvfs);
2252 2523                  return (SET_ERROR(EINVAL));
2253 2524          }
2254 2525  
2255 2526          /*
2256 2527           * Quit if directory has been removed (posix)
2257 2528           */
2258 2529          if ((*eofp = zp->z_unlinked) != 0) {
2259 2530                  ZFS_EXIT(zfsvfs);
2260 2531                  return (0);
2261 2532          }
2262 2533  
2263 2534          error = 0;
2264 2535          os = zfsvfs->z_os;
2265 2536          offset = uio->uio_loffset;
2266 2537          prefetch = zp->z_zn_prefetch;
2267 2538  
2268 2539          /*
2269 2540           * Initialize the iterator cursor.
2270 2541           */
2271 2542          if (offset <= 3) {
2272 2543                  /*
2273 2544                   * Start iteration from the beginning of the directory.
2274 2545                   */
2275 2546                  zap_cursor_init(&zc, os, zp->z_id);
2276 2547          } else {
2277 2548                  /*
2278 2549                   * The offset is a serialized cursor.
2279 2550                   */
2280 2551                  zap_cursor_init_serialized(&zc, os, zp->z_id, offset);
2281 2552          }
2282 2553  
2283 2554          /*
2284 2555           * Get space to change directory entries into fs independent format.
2285 2556           */
2286 2557          iovp = uio->uio_iov;
2287 2558          bytes_wanted = iovp->iov_len;
2288 2559          if (uio->uio_segflg != UIO_SYSSPACE || uio->uio_iovcnt != 1) {
2289 2560                  bufsize = bytes_wanted;
2290 2561                  outbuf = kmem_alloc(bufsize, KM_SLEEP);
2291 2562                  odp = (struct dirent64 *)outbuf;
2292 2563          } else {
2293 2564                  bufsize = bytes_wanted;
2294 2565                  outbuf = NULL;
2295 2566                  odp = (struct dirent64 *)iovp->iov_base;
2296 2567          }
2297 2568          eodp = (struct edirent *)odp;
2298 2569  
2299 2570          /*
2300 2571           * If this VFS supports the system attribute view interface; and
2301 2572           * we're looking at an extended attribute directory; and we care
2302 2573           * about normalization conflicts on this vfs; then we must check
2303 2574           * for normalization conflicts with the sysattr name space.
2304 2575           */
2305 2576          check_sysattrs = vfs_has_feature(vp->v_vfsp, VFSFT_SYSATTR_VIEWS) &&
2306 2577              (vp->v_flag & V_XATTRDIR) && zfsvfs->z_norm &&
2307 2578              (flags & V_RDDIR_ENTFLAGS);
2308 2579  
2309 2580          /*
2310 2581           * Transform to file-system independent format
2311 2582           */
2312 2583          outcount = 0;
2313 2584          while (outcount < bytes_wanted) {
2314 2585                  ino64_t objnum;
2315 2586                  ushort_t reclen;
2316 2587                  off64_t *next = NULL;
2317 2588  
2318 2589                  /*
2319 2590                   * Special case `.', `..', and `.zfs'.
2320 2591                   */
2321 2592                  if (offset == 0) {
2322 2593                          (void) strcpy(zap.za_name, ".");
2323 2594                          zap.za_normalization_conflict = 0;
2324 2595                          objnum = zp->z_id;
2325 2596                  } else if (offset == 1) {
2326 2597                          (void) strcpy(zap.za_name, "..");
2327 2598                          zap.za_normalization_conflict = 0;
2328 2599                          objnum = parent;
2329 2600                  } else if (offset == 2 && zfs_show_ctldir(zp)) {
2330 2601                          (void) strcpy(zap.za_name, ZFS_CTLDIR_NAME);
2331 2602                          zap.za_normalization_conflict = 0;
2332 2603                          objnum = ZFSCTL_INO_ROOT;
2333 2604                  } else {
2334 2605                          /*
2335 2606                           * Grab next entry.
2336 2607                           */
2337 2608                          if (error = zap_cursor_retrieve(&zc, &zap)) {
2338 2609                                  if ((*eofp = (error == ENOENT)) != 0)
2339 2610                                          break;
2340 2611                                  else
2341 2612                                          goto update;
2342 2613                          }
2343 2614  
2344 2615                          if (zap.za_integer_length != 8 ||
2345 2616                              zap.za_num_integers != 1) {
2346 2617                                  cmn_err(CE_WARN, "zap_readdir: bad directory "
2347 2618                                      "entry, obj = %lld, offset = %lld\n",
2348 2619                                      (u_longlong_t)zp->z_id,
2349 2620                                      (u_longlong_t)offset);
2350 2621                                  error = SET_ERROR(ENXIO);
2351 2622                                  goto update;
2352 2623                          }
2353 2624  
2354 2625                          objnum = ZFS_DIRENT_OBJ(zap.za_first_integer);
2355 2626                          /*
2356 2627                           * MacOS X can extract the object type here such as:
2357 2628                           * uint8_t type = ZFS_DIRENT_TYPE(zap.za_first_integer);
2358 2629                           */
2359 2630  
2360 2631                          if (check_sysattrs && !zap.za_normalization_conflict) {
2361 2632                                  zap.za_normalization_conflict =
2362 2633                                      xattr_sysattr_casechk(zap.za_name);
2363 2634                          }
2364 2635                  }
2365 2636  
2366 2637                  if (flags & V_RDDIR_ACCFILTER) {
2367 2638                          /*
2368 2639                           * If we have no access at all, don't include
2369 2640                           * this entry in the returned information
2370 2641                           */
2371 2642                          znode_t *ezp;
2372 2643                          if (zfs_zget(zp->z_zfsvfs, objnum, &ezp) != 0)
2373 2644                                  goto skip_entry;
2374 2645                          if (!zfs_has_access(ezp, cr)) {
2375 2646                                  VN_RELE(ZTOV(ezp));
2376 2647                                  goto skip_entry;
2377 2648                          }
2378 2649                          VN_RELE(ZTOV(ezp));
2379 2650                  }
2380 2651  
2381 2652                  if (flags & V_RDDIR_ENTFLAGS)
2382 2653                          reclen = EDIRENT_RECLEN(strlen(zap.za_name));
2383 2654                  else
2384 2655                          reclen = DIRENT64_RECLEN(strlen(zap.za_name));
2385 2656  
2386 2657                  /*
2387 2658                   * Will this entry fit in the buffer?
2388 2659                   */
2389 2660                  if (outcount + reclen > bufsize) {
2390 2661                          /*
2391 2662                           * Did we manage to fit anything in the buffer?
2392 2663                           */
2393 2664                          if (!outcount) {
2394 2665                                  error = SET_ERROR(EINVAL);
2395 2666                                  goto update;
2396 2667                          }
2397 2668                          break;
2398 2669                  }
2399 2670                  if (flags & V_RDDIR_ENTFLAGS) {
2400 2671                          /*
2401 2672                           * Add extended flag entry:
2402 2673                           */
2403 2674                          eodp->ed_ino = objnum;
2404 2675                          eodp->ed_reclen = reclen;
2405 2676                          /* NOTE: ed_off is the offset for the *next* entry */
2406 2677                          next = &(eodp->ed_off);
2407 2678                          eodp->ed_eflags = zap.za_normalization_conflict ?
2408 2679                              ED_CASE_CONFLICT : 0;
2409 2680                          (void) strncpy(eodp->ed_name, zap.za_name,
2410 2681                              EDIRENT_NAMELEN(reclen));
2411 2682                          eodp = (edirent_t *)((intptr_t)eodp + reclen);
2412 2683                  } else {
2413 2684                          /*
2414 2685                           * Add normal entry:
2415 2686                           */
2416 2687                          odp->d_ino = objnum;
2417 2688                          odp->d_reclen = reclen;
2418 2689                          /* NOTE: d_off is the offset for the *next* entry */
2419 2690                          next = &(odp->d_off);
2420 2691                          (void) strncpy(odp->d_name, zap.za_name,
2421 2692                              DIRENT64_NAMELEN(reclen));
2422 2693                          odp = (dirent64_t *)((intptr_t)odp + reclen);
2423 2694                  }
2424 2695                  outcount += reclen;
2425 2696  
2426 2697                  ASSERT(outcount <= bufsize);
2427 2698  
2428 2699                  /* Prefetch znode */
2429 2700                  if (prefetch)
2430 2701                          dmu_prefetch(os, objnum, 0, 0, 0,
2431 2702                              ZIO_PRIORITY_SYNC_READ);
2432 2703  
2433 2704          skip_entry:
2434 2705                  /*
2435 2706                   * Move to the next entry, fill in the previous offset.
2436 2707                   */
2437 2708                  if (offset > 2 || (offset == 2 && !zfs_show_ctldir(zp))) {
2438 2709                          zap_cursor_advance(&zc);
2439 2710                          offset = zap_cursor_serialize(&zc);
2440 2711                  } else {
2441 2712                          offset += 1;
2442 2713                  }
2443 2714                  if (next)
2444 2715                          *next = offset;
2445 2716          }
2446 2717          zp->z_zn_prefetch = B_FALSE; /* a lookup will re-enable pre-fetching */
2447 2718  
2448 2719          if (uio->uio_segflg == UIO_SYSSPACE && uio->uio_iovcnt == 1) {
2449 2720                  iovp->iov_base += outcount;
2450 2721                  iovp->iov_len -= outcount;
2451 2722                  uio->uio_resid -= outcount;
2452 2723          } else if (error = uiomove(outbuf, (long)outcount, UIO_READ, uio)) {
2453 2724                  /*
2454 2725                   * Reset the pointer.
2455 2726                   */
2456 2727                  offset = uio->uio_loffset;
2457 2728          }
2458 2729  
2459 2730  update:
2460 2731          zap_cursor_fini(&zc);
2461 2732          if (uio->uio_segflg != UIO_SYSSPACE || uio->uio_iovcnt != 1)
2462 2733                  kmem_free(outbuf, bufsize);
2463 2734  
2464 2735          if (error == ENOENT)
2465 2736                  error = 0;
2466 2737  
2467 2738          ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
2468 2739  
2469 2740          uio->uio_loffset = offset;
2470 2741          ZFS_EXIT(zfsvfs);
2471 2742          return (error);
2472 2743  }
2473 2744  
2474 2745  ulong_t zfs_fsync_sync_cnt = 4;
2475 2746  
2476 2747  static int
2477 2748  zfs_fsync(vnode_t *vp, int syncflag, cred_t *cr, caller_context_t *ct)
2478 2749  {
2479 2750          znode_t *zp = VTOZ(vp);
2480 2751          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
2481 2752  
2482 2753          /*
2483 2754           * Regardless of whether this is required for standards conformance,
2484 2755           * this is the logical behavior when fsync() is called on a file with
2485 2756           * dirty pages.  We use B_ASYNC since the ZIL transactions are already
2486 2757           * going to be pushed out as part of the zil_commit().
2487 2758           */
2488 2759          if (vn_has_cached_data(vp) && !(syncflag & FNODSYNC) &&
2489 2760              (vp->v_type == VREG) && !(IS_SWAPVP(vp)))
2490 2761                  (void) VOP_PUTPAGE(vp, (offset_t)0, (size_t)0, B_ASYNC, cr, ct);
2491 2762  
2492 2763          (void) tsd_set(zfs_fsyncer_key, (void *)zfs_fsync_sync_cnt);
2493 2764  
2494 2765          if (zfsvfs->z_os->os_sync != ZFS_SYNC_DISABLED) {
2495 2766                  ZFS_ENTER(zfsvfs);
2496 2767                  ZFS_VERIFY_ZP(zp);
2497 2768                  zil_commit(zfsvfs->z_log, zp->z_id);
2498 2769                  ZFS_EXIT(zfsvfs);
2499 2770          }
2500 2771          return (0);
2501 2772  }
2502 2773  
2503 2774  
2504 2775  /*
2505 2776   * Get the requested file attributes and place them in the provided
2506 2777   * vattr structure.
2507 2778   *
2508 2779   *      IN:     vp      - vnode of file.
2509 2780   *              vap     - va_mask identifies requested attributes.
2510 2781   *                        If AT_XVATTR set, then optional attrs are requested
2511 2782   *              flags   - ATTR_NOACLCHECK (CIFS server context)
2512 2783   *              cr      - credentials of caller.
2513 2784   *              ct      - caller context
2514 2785   *
2515 2786   *      OUT:    vap     - attribute values.
2516 2787   *
2517 2788   *      RETURN: 0 (always succeeds).
2518 2789   */
2519 2790  /* ARGSUSED */
2520 2791  static int
2521 2792  zfs_getattr(vnode_t *vp, vattr_t *vap, int flags, cred_t *cr,
2522 2793      caller_context_t *ct)
2523 2794  {
2524 2795          znode_t *zp = VTOZ(vp);
2525 2796          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
2526 2797          int     error = 0;
2527 2798          uint64_t links;
2528 2799          uint64_t mtime[2], ctime[2];
2529 2800          xvattr_t *xvap = (xvattr_t *)vap;       /* vap may be an xvattr_t * */
2530 2801          xoptattr_t *xoap = NULL;
2531 2802          boolean_t skipaclchk = (flags & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
2532 2803          sa_bulk_attr_t bulk[2];
2533 2804          int count = 0;
2534 2805  
2535 2806          ZFS_ENTER(zfsvfs);
2536 2807          ZFS_VERIFY_ZP(zp);
2537 2808  
2538 2809          zfs_fuid_map_ids(zp, cr, &vap->va_uid, &vap->va_gid);
2539 2810  
2540 2811          SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_MTIME(zfsvfs), NULL, &mtime, 16);
2541 2812          SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL, &ctime, 16);
2542 2813  
2543 2814          if ((error = sa_bulk_lookup(zp->z_sa_hdl, bulk, count)) != 0) {
2544 2815                  ZFS_EXIT(zfsvfs);
2545 2816                  return (error);
2546 2817          }
2547 2818  
2548 2819          /*
2549 2820           * If ACL is trivial don't bother looking for ACE_READ_ATTRIBUTES.
2550 2821           * Also, if we are the owner don't bother, since owner should
2551 2822           * always be allowed to read basic attributes of file.
2552 2823           */
2553 2824          if (!(zp->z_pflags & ZFS_ACL_TRIVIAL) &&
2554 2825              (vap->va_uid != crgetuid(cr))) {
2555 2826                  if (error = zfs_zaccess(zp, ACE_READ_ATTRIBUTES, 0,
2556 2827                      skipaclchk, cr)) {
2557 2828                          ZFS_EXIT(zfsvfs);
2558 2829                          return (error);
2559 2830                  }
2560 2831          }
2561 2832  
2562 2833          /*
2563 2834           * Return all attributes.  It's cheaper to provide the answer
2564 2835           * than to determine whether we were asked the question.
2565 2836           */
2566 2837  
2567 2838          mutex_enter(&zp->z_lock);
2568 2839          vap->va_type = vp->v_type;
2569 2840          vap->va_mode = zp->z_mode & MODEMASK;
2570 2841          vap->va_fsid = zp->z_zfsvfs->z_vfs->vfs_dev;
2571 2842          vap->va_nodeid = zp->z_id;
2572 2843          if ((vp->v_flag & VROOT) && zfs_show_ctldir(zp))
2573 2844                  links = zp->z_links + 1;
2574 2845          else
2575 2846                  links = zp->z_links;
2576 2847          vap->va_nlink = MIN(links, UINT32_MAX); /* nlink_t limit! */
2577 2848          vap->va_size = zp->z_size;
2578 2849          vap->va_rdev = vp->v_rdev;
2579 2850          vap->va_seq = zp->z_seq;
2580 2851  
2581 2852          /*
2582 2853           * Add in any requested optional attributes and the create time.
2583 2854           * Also set the corresponding bits in the returned attribute bitmap.
2584 2855           */
2585 2856          if ((xoap = xva_getxoptattr(xvap)) != NULL && zfsvfs->z_use_fuids) {
2586 2857                  if (XVA_ISSET_REQ(xvap, XAT_ARCHIVE)) {
2587 2858                          xoap->xoa_archive =
2588 2859                              ((zp->z_pflags & ZFS_ARCHIVE) != 0);
2589 2860                          XVA_SET_RTN(xvap, XAT_ARCHIVE);
2590 2861                  }
2591 2862  
2592 2863                  if (XVA_ISSET_REQ(xvap, XAT_READONLY)) {
2593 2864                          xoap->xoa_readonly =
2594 2865                              ((zp->z_pflags & ZFS_READONLY) != 0);
2595 2866                          XVA_SET_RTN(xvap, XAT_READONLY);
2596 2867                  }
2597 2868  
2598 2869                  if (XVA_ISSET_REQ(xvap, XAT_SYSTEM)) {
2599 2870                          xoap->xoa_system =
2600 2871                              ((zp->z_pflags & ZFS_SYSTEM) != 0);
2601 2872                          XVA_SET_RTN(xvap, XAT_SYSTEM);
2602 2873                  }
2603 2874  
2604 2875                  if (XVA_ISSET_REQ(xvap, XAT_HIDDEN)) {
2605 2876                          xoap->xoa_hidden =
2606 2877                              ((zp->z_pflags & ZFS_HIDDEN) != 0);
2607 2878                          XVA_SET_RTN(xvap, XAT_HIDDEN);
2608 2879                  }
2609 2880  
2610 2881                  if (XVA_ISSET_REQ(xvap, XAT_NOUNLINK)) {
2611 2882                          xoap->xoa_nounlink =
2612 2883                              ((zp->z_pflags & ZFS_NOUNLINK) != 0);
2613 2884                          XVA_SET_RTN(xvap, XAT_NOUNLINK);
2614 2885                  }
2615 2886  
2616 2887                  if (XVA_ISSET_REQ(xvap, XAT_IMMUTABLE)) {
2617 2888                          xoap->xoa_immutable =
2618 2889                              ((zp->z_pflags & ZFS_IMMUTABLE) != 0);
2619 2890                          XVA_SET_RTN(xvap, XAT_IMMUTABLE);
2620 2891                  }
2621 2892  
2622 2893                  if (XVA_ISSET_REQ(xvap, XAT_APPENDONLY)) {
2623 2894                          xoap->xoa_appendonly =
2624 2895                              ((zp->z_pflags & ZFS_APPENDONLY) != 0);
2625 2896                          XVA_SET_RTN(xvap, XAT_APPENDONLY);
2626 2897                  }
2627 2898  
2628 2899                  if (XVA_ISSET_REQ(xvap, XAT_NODUMP)) {
2629 2900                          xoap->xoa_nodump =
2630 2901                              ((zp->z_pflags & ZFS_NODUMP) != 0);
2631 2902                          XVA_SET_RTN(xvap, XAT_NODUMP);
2632 2903                  }
2633 2904  
2634 2905                  if (XVA_ISSET_REQ(xvap, XAT_OPAQUE)) {
2635 2906                          xoap->xoa_opaque =
2636 2907                              ((zp->z_pflags & ZFS_OPAQUE) != 0);
2637 2908                          XVA_SET_RTN(xvap, XAT_OPAQUE);
2638 2909                  }
2639 2910  
2640 2911                  if (XVA_ISSET_REQ(xvap, XAT_AV_QUARANTINED)) {
2641 2912                          xoap->xoa_av_quarantined =
2642 2913                              ((zp->z_pflags & ZFS_AV_QUARANTINED) != 0);
2643 2914                          XVA_SET_RTN(xvap, XAT_AV_QUARANTINED);
2644 2915                  }
2645 2916  
2646 2917                  if (XVA_ISSET_REQ(xvap, XAT_AV_MODIFIED)) {
2647 2918                          xoap->xoa_av_modified =
2648 2919                              ((zp->z_pflags & ZFS_AV_MODIFIED) != 0);
2649 2920                          XVA_SET_RTN(xvap, XAT_AV_MODIFIED);
2650 2921                  }
2651 2922  
2652 2923                  if (XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP) &&
2653 2924                      vp->v_type == VREG) {
2654 2925                          zfs_sa_get_scanstamp(zp, xvap);
2655 2926                  }
2656 2927  
2657 2928                  if (XVA_ISSET_REQ(xvap, XAT_CREATETIME)) {
2658 2929                          uint64_t times[2];
2659 2930  
2660 2931                          (void) sa_lookup(zp->z_sa_hdl, SA_ZPL_CRTIME(zfsvfs),
2661 2932                              times, sizeof (times));
2662 2933                          ZFS_TIME_DECODE(&xoap->xoa_createtime, times);
2663 2934                          XVA_SET_RTN(xvap, XAT_CREATETIME);
2664 2935                  }
2665 2936  
2666 2937                  if (XVA_ISSET_REQ(xvap, XAT_REPARSE)) {
2667 2938                          xoap->xoa_reparse = ((zp->z_pflags & ZFS_REPARSE) != 0);
2668 2939                          XVA_SET_RTN(xvap, XAT_REPARSE);
2669 2940                  }
2670 2941                  if (XVA_ISSET_REQ(xvap, XAT_GEN)) {
2671 2942                          xoap->xoa_generation = zp->z_gen;
2672 2943                          XVA_SET_RTN(xvap, XAT_GEN);
2673 2944                  }
2674 2945  
2675 2946                  if (XVA_ISSET_REQ(xvap, XAT_OFFLINE)) {
2676 2947                          xoap->xoa_offline =
2677 2948                              ((zp->z_pflags & ZFS_OFFLINE) != 0);
2678 2949                          XVA_SET_RTN(xvap, XAT_OFFLINE);
2679 2950                  }
2680 2951  
2681 2952                  if (XVA_ISSET_REQ(xvap, XAT_SPARSE)) {
2682 2953                          xoap->xoa_sparse =
2683 2954                              ((zp->z_pflags & ZFS_SPARSE) != 0);
2684 2955                          XVA_SET_RTN(xvap, XAT_SPARSE);
2685 2956                  }
2686 2957          }
2687 2958  
2688 2959          ZFS_TIME_DECODE(&vap->va_atime, zp->z_atime);
2689 2960          ZFS_TIME_DECODE(&vap->va_mtime, mtime);
2690 2961          ZFS_TIME_DECODE(&vap->va_ctime, ctime);
2691 2962  
2692 2963          mutex_exit(&zp->z_lock);
2693 2964  
2694 2965          sa_object_size(zp->z_sa_hdl, &vap->va_blksize, &vap->va_nblocks);
2695 2966  
2696 2967          if (zp->z_blksz == 0) {
2697 2968                  /*
2698 2969                   * Block size hasn't been set; suggest maximal I/O transfers.
2699 2970                   */
2700 2971                  vap->va_blksize = zfsvfs->z_max_blksz;
2701 2972          }
2702 2973  
2703 2974          ZFS_EXIT(zfsvfs);
2704 2975          return (0);
2705 2976  }
2706 2977  
2707 2978  /*
2708 2979   * Set the file attributes to the values contained in the
2709 2980   * vattr structure.
2710 2981   *
2711 2982   *      IN:     vp      - vnode of file to be modified.
2712 2983   *              vap     - new attribute values.
2713 2984   *                        If AT_XVATTR set, then optional attrs are being set
2714 2985   *              flags   - ATTR_UTIME set if non-default time values provided.
2715 2986   *                      - ATTR_NOACLCHECK (CIFS context only).
2716 2987   *              cr      - credentials of caller.
2717 2988   *              ct      - caller context
2718 2989   *
2719 2990   *      RETURN: 0 on success, error code on failure.
2720 2991   *
2721 2992   * Timestamps:
2722 2993   *      vp - ctime updated, mtime updated if size changed.
2723 2994   */
2724 2995  /* ARGSUSED */
2725 2996  static int
2726 2997  zfs_setattr(vnode_t *vp, vattr_t *vap, int flags, cred_t *cr,
2727 2998      caller_context_t *ct)
2728 2999  {
2729 3000          znode_t         *zp = VTOZ(vp);
2730 3001          zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
2731 3002          zilog_t         *zilog;
2732 3003          dmu_tx_t        *tx;
2733 3004          vattr_t         oldva;
2734 3005          xvattr_t        tmpxvattr;
2735 3006          uint_t          mask = vap->va_mask;
2736 3007          uint_t          saved_mask = 0;
2737 3008          int             trim_mask = 0;
2738 3009          uint64_t        new_mode;
2739 3010          uint64_t        new_uid, new_gid;
2740 3011          uint64_t        xattr_obj;
2741 3012          uint64_t        mtime[2], ctime[2];
2742 3013          znode_t         *attrzp;
2743 3014          int             need_policy = FALSE;
2744 3015          int             err, err2;
2745 3016          zfs_fuid_info_t *fuidp = NULL;
2746 3017          xvattr_t *xvap = (xvattr_t *)vap;       /* vap may be an xvattr_t * */
2747 3018          xoptattr_t      *xoap;
2748 3019          zfs_acl_t       *aclp;
2749 3020          boolean_t skipaclchk = (flags & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
2750 3021          boolean_t       fuid_dirtied = B_FALSE;
2751 3022          sa_bulk_attr_t  bulk[7], xattr_bulk[7];
2752 3023          int             count = 0, xattr_count = 0;
2753 3024  
2754 3025          if (mask == 0)
2755 3026                  return (0);
2756 3027  
2757 3028          if (mask & AT_NOSET)
2758 3029                  return (SET_ERROR(EINVAL));
2759 3030  
2760 3031          ZFS_ENTER(zfsvfs);
2761 3032          ZFS_VERIFY_ZP(zp);
2762 3033  
2763 3034          zilog = zfsvfs->z_log;
2764 3035  
2765 3036          /*
2766 3037           * Make sure that if we have ephemeral uid/gid or xvattr specified
2767 3038           * that file system is at proper version level
2768 3039           */
2769 3040  
2770 3041          if (zfsvfs->z_use_fuids == B_FALSE &&
2771 3042              (((mask & AT_UID) && IS_EPHEMERAL(vap->va_uid)) ||
2772 3043              ((mask & AT_GID) && IS_EPHEMERAL(vap->va_gid)) ||
2773 3044              (mask & AT_XVATTR))) {
2774 3045                  ZFS_EXIT(zfsvfs);
2775 3046                  return (SET_ERROR(EINVAL));
2776 3047          }
2777 3048  
2778 3049          if (mask & AT_SIZE && vp->v_type == VDIR) {
2779 3050                  ZFS_EXIT(zfsvfs);
2780 3051                  return (SET_ERROR(EISDIR));
2781 3052          }
2782 3053  
2783 3054          if (mask & AT_SIZE && vp->v_type != VREG && vp->v_type != VFIFO) {
2784 3055                  ZFS_EXIT(zfsvfs);
2785 3056                  return (SET_ERROR(EINVAL));
2786 3057          }

↓ open down ↓

636 lines elided

↑ open up ↑

2787 3058  
2788 3059          /*
2789 3060           * If this is an xvattr_t, then get a pointer to the structure of
2790 3061           * optional attributes.  If this is NULL, then we have a vattr_t.
2791 3062           */
2792 3063          xoap = xva_getxoptattr(xvap);
2793 3064  
2794 3065          xva_init(&tmpxvattr);
2795 3066  
2796 3067          /*
2797      -         * Immutable files can only alter immutable bit and atime
     3068 +         * Do not allow to alter immutable bit after it is set
2798 3069           */
2799 3070          if ((zp->z_pflags & ZFS_IMMUTABLE) &&
2800      -            ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
2801      -            ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
     3071 +            XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) &&
     3072 +            zp->z_zfsvfs->z_isworm) {
2802 3073                  ZFS_EXIT(zfsvfs);
2803 3074                  return (SET_ERROR(EPERM));
2804 3075          }
2805 3076  
2806 3077          /*
     3078 +         * Immutable files can only alter atime
     3079 +         */
     3080 +        if (((zp->z_pflags & ZFS_IMMUTABLE) || zp->z_zfsvfs->z_isworm) &&
     3081 +            ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
     3082 +            ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
     3083 +                if (!zp->z_zfsvfs->z_isworm || !zfs_worm_in_trans(zp)) {
     3084 +                        ZFS_EXIT(zfsvfs);
     3085 +                        return (SET_ERROR(EPERM));
     3086 +                }
     3087 +        }
     3088 +
     3089 +        /*
2807 3090           * Note: ZFS_READONLY is handled in zfs_zaccess_common.
2808 3091           */
2809 3092  
2810 3093          /*
2811 3094           * Verify timestamps doesn't overflow 32 bits.
2812 3095           * ZFS can handle large timestamps, but 32bit syscalls can't
2813 3096           * handle times greater than 2039.  This check should be removed
2814 3097           * once large timestamps are fully supported.
2815 3098           */
2816 3099          if (mask & (AT_ATIME | AT_MTIME)) {

2817 3100                  if (((mask & AT_ATIME) && TIMESPEC_OVERFLOW(&vap->va_atime)) ||
2818 3101                      ((mask & AT_MTIME) && TIMESPEC_OVERFLOW(&vap->va_mtime))) {
2819 3102                          ZFS_EXIT(zfsvfs);
2820 3103                          return (SET_ERROR(EOVERFLOW));
2821 3104                  }
2822 3105          }
2823 3106  
2824 3107  top:
2825 3108          attrzp = NULL;
2826 3109          aclp = NULL;
2827 3110  
2828 3111          /* Can this be moved to before the top label? */
2829 3112          if (zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) {
2830 3113                  ZFS_EXIT(zfsvfs);
2831 3114                  return (SET_ERROR(EROFS));
2832 3115          }
2833 3116  
2834 3117          /*
2835 3118           * First validate permissions
2836 3119           */
2837 3120  
2838 3121          if (mask & AT_SIZE) {
2839 3122                  err = zfs_zaccess(zp, ACE_WRITE_DATA, 0, skipaclchk, cr);
2840 3123                  if (err) {
2841 3124                          ZFS_EXIT(zfsvfs);
2842 3125                          return (err);
2843 3126                  }
2844 3127                  /*
2845 3128                   * XXX - Note, we are not providing any open
2846 3129                   * mode flags here (like FNDELAY), so we may
2847 3130                   * block if there are locks present... this
2848 3131                   * should be addressed in openat().
2849 3132                   */
2850 3133                  /* XXX - would it be OK to generate a log record here? */
2851 3134                  err = zfs_freesp(zp, vap->va_size, 0, 0, FALSE);
2852 3135                  if (err) {
2853 3136                          ZFS_EXIT(zfsvfs);
2854 3137                          return (err);
2855 3138                  }
2856 3139  
2857 3140                  if (vap->va_size == 0)
2858 3141                          vnevent_truncate(ZTOV(zp), ct);
2859 3142          }
2860 3143  
2861 3144          if (mask & (AT_ATIME|AT_MTIME) ||
2862 3145              ((mask & AT_XVATTR) && (XVA_ISSET_REQ(xvap, XAT_HIDDEN) ||
2863 3146              XVA_ISSET_REQ(xvap, XAT_READONLY) ||
2864 3147              XVA_ISSET_REQ(xvap, XAT_ARCHIVE) ||
2865 3148              XVA_ISSET_REQ(xvap, XAT_OFFLINE) ||
2866 3149              XVA_ISSET_REQ(xvap, XAT_SPARSE) ||
2867 3150              XVA_ISSET_REQ(xvap, XAT_CREATETIME) ||
2868 3151              XVA_ISSET_REQ(xvap, XAT_SYSTEM)))) {
2869 3152                  need_policy = zfs_zaccess(zp, ACE_WRITE_ATTRIBUTES, 0,
2870 3153                      skipaclchk, cr);
2871 3154          }
2872 3155  
2873 3156          if (mask & (AT_UID|AT_GID)) {
2874 3157                  int     idmask = (mask & (AT_UID|AT_GID));
2875 3158                  int     take_owner;
2876 3159                  int     take_group;
2877 3160  
2878 3161                  /*
2879 3162                   * NOTE: even if a new mode is being set,
2880 3163                   * we may clear S_ISUID/S_ISGID bits.
2881 3164                   */
2882 3165  
2883 3166                  if (!(mask & AT_MODE))
2884 3167                          vap->va_mode = zp->z_mode;
2885 3168  
2886 3169                  /*
2887 3170                   * Take ownership or chgrp to group we are a member of
2888 3171                   */
2889 3172  
2890 3173                  take_owner = (mask & AT_UID) && (vap->va_uid == crgetuid(cr));
2891 3174                  take_group = (mask & AT_GID) &&
2892 3175                      zfs_groupmember(zfsvfs, vap->va_gid, cr);
2893 3176  
2894 3177                  /*
2895 3178                   * If both AT_UID and AT_GID are set then take_owner and
2896 3179                   * take_group must both be set in order to allow taking
2897 3180                   * ownership.
2898 3181                   *
2899 3182                   * Otherwise, send the check through secpolicy_vnode_setattr()
2900 3183                   *
2901 3184                   */
2902 3185  
2903 3186                  if (((idmask == (AT_UID|AT_GID)) && take_owner && take_group) ||
2904 3187                      ((idmask == AT_UID) && take_owner) ||
2905 3188                      ((idmask == AT_GID) && take_group)) {
2906 3189                          if (zfs_zaccess(zp, ACE_WRITE_OWNER, 0,
2907 3190                              skipaclchk, cr) == 0) {
2908 3191                                  /*
2909 3192                                   * Remove setuid/setgid for non-privileged users
2910 3193                                   */
2911 3194                                  secpolicy_setid_clear(vap, cr);
2912 3195                                  trim_mask = (mask & (AT_UID|AT_GID));
2913 3196                          } else {
2914 3197                                  need_policy =  TRUE;
2915 3198                          }
2916 3199                  } else {
2917 3200                          need_policy =  TRUE;
2918 3201                  }
2919 3202          }
2920 3203  
2921 3204          mutex_enter(&zp->z_lock);
2922 3205          oldva.va_mode = zp->z_mode;
2923 3206          zfs_fuid_map_ids(zp, cr, &oldva.va_uid, &oldva.va_gid);
2924 3207          if (mask & AT_XVATTR) {
2925 3208                  /*
2926 3209                   * Update xvattr mask to include only those attributes
2927 3210                   * that are actually changing.
2928 3211                   *
2929 3212                   * the bits will be restored prior to actually setting
2930 3213                   * the attributes so the caller thinks they were set.
2931 3214                   */
2932 3215                  if (XVA_ISSET_REQ(xvap, XAT_APPENDONLY)) {
2933 3216                          if (xoap->xoa_appendonly !=
2934 3217                              ((zp->z_pflags & ZFS_APPENDONLY) != 0)) {
2935 3218                                  need_policy = TRUE;
2936 3219                          } else {
2937 3220                                  XVA_CLR_REQ(xvap, XAT_APPENDONLY);
2938 3221                                  XVA_SET_REQ(&tmpxvattr, XAT_APPENDONLY);
2939 3222                          }
2940 3223                  }
2941 3224  
2942 3225                  if (XVA_ISSET_REQ(xvap, XAT_NOUNLINK)) {
2943 3226                          if (xoap->xoa_nounlink !=
2944 3227                              ((zp->z_pflags & ZFS_NOUNLINK) != 0)) {
2945 3228                                  need_policy = TRUE;
2946 3229                          } else {
2947 3230                                  XVA_CLR_REQ(xvap, XAT_NOUNLINK);
2948 3231                                  XVA_SET_REQ(&tmpxvattr, XAT_NOUNLINK);
2949 3232                          }
2950 3233                  }
2951 3234  
2952 3235                  if (XVA_ISSET_REQ(xvap, XAT_IMMUTABLE)) {
2953 3236                          if (xoap->xoa_immutable !=
2954 3237                              ((zp->z_pflags & ZFS_IMMUTABLE) != 0)) {
2955 3238                                  need_policy = TRUE;
2956 3239                          } else {
2957 3240                                  XVA_CLR_REQ(xvap, XAT_IMMUTABLE);
2958 3241                                  XVA_SET_REQ(&tmpxvattr, XAT_IMMUTABLE);
2959 3242                          }
2960 3243                  }
2961 3244  
2962 3245                  if (XVA_ISSET_REQ(xvap, XAT_NODUMP)) {
2963 3246                          if (xoap->xoa_nodump !=
2964 3247                              ((zp->z_pflags & ZFS_NODUMP) != 0)) {
2965 3248                                  need_policy = TRUE;
2966 3249                          } else {
2967 3250                                  XVA_CLR_REQ(xvap, XAT_NODUMP);
2968 3251                                  XVA_SET_REQ(&tmpxvattr, XAT_NODUMP);
2969 3252                          }
2970 3253                  }
2971 3254  
2972 3255                  if (XVA_ISSET_REQ(xvap, XAT_AV_MODIFIED)) {
2973 3256                          if (xoap->xoa_av_modified !=
2974 3257                              ((zp->z_pflags & ZFS_AV_MODIFIED) != 0)) {
2975 3258                                  need_policy = TRUE;
2976 3259                          } else {
2977 3260                                  XVA_CLR_REQ(xvap, XAT_AV_MODIFIED);
2978 3261                                  XVA_SET_REQ(&tmpxvattr, XAT_AV_MODIFIED);
2979 3262                          }
2980 3263                  }
2981 3264  
2982 3265                  if (XVA_ISSET_REQ(xvap, XAT_AV_QUARANTINED)) {
2983 3266                          if ((vp->v_type != VREG &&
2984 3267                              xoap->xoa_av_quarantined) ||
2985 3268                              xoap->xoa_av_quarantined !=
2986 3269                              ((zp->z_pflags & ZFS_AV_QUARANTINED) != 0)) {
2987 3270                                  need_policy = TRUE;
2988 3271                          } else {
2989 3272                                  XVA_CLR_REQ(xvap, XAT_AV_QUARANTINED);
2990 3273                                  XVA_SET_REQ(&tmpxvattr, XAT_AV_QUARANTINED);
2991 3274                          }
2992 3275                  }
2993 3276  
2994 3277                  if (XVA_ISSET_REQ(xvap, XAT_REPARSE)) {
2995 3278                          mutex_exit(&zp->z_lock);
2996 3279                          ZFS_EXIT(zfsvfs);
2997 3280                          return (SET_ERROR(EPERM));
2998 3281                  }
2999 3282  
3000 3283                  if (need_policy == FALSE &&
3001 3284                      (XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP) ||
3002 3285                      XVA_ISSET_REQ(xvap, XAT_OPAQUE))) {
3003 3286                          need_policy = TRUE;
3004 3287                  }
3005 3288          }
3006 3289  
3007 3290          mutex_exit(&zp->z_lock);
3008 3291  
3009 3292          if (mask & AT_MODE) {
3010 3293                  if (zfs_zaccess(zp, ACE_WRITE_ACL, 0, skipaclchk, cr) == 0) {
3011 3294                          err = secpolicy_setid_setsticky_clear(vp, vap,
3012 3295                              &oldva, cr);
3013 3296                          if (err) {
3014 3297                                  ZFS_EXIT(zfsvfs);
3015 3298                                  return (err);
3016 3299                          }
3017 3300                          trim_mask |= AT_MODE;
3018 3301                  } else {
3019 3302                          need_policy = TRUE;
3020 3303                  }
3021 3304          }
3022 3305  
3023 3306          if (need_policy) {
3024 3307                  /*
3025 3308                   * If trim_mask is set then take ownership
3026 3309                   * has been granted or write_acl is present and user
3027 3310                   * has the ability to modify mode.  In that case remove
3028 3311                   * UID|GID and or MODE from mask so that
3029 3312                   * secpolicy_vnode_setattr() doesn't revoke it.
3030 3313                   */
3031 3314  
3032 3315                  if (trim_mask) {
3033 3316                          saved_mask = vap->va_mask;
3034 3317                          vap->va_mask &= ~trim_mask;
3035 3318                  }
3036 3319                  err = secpolicy_vnode_setattr(cr, vp, vap, &oldva, flags,
3037 3320                      (int (*)(void *, int, cred_t *))zfs_zaccess_unix, zp);
3038 3321                  if (err) {
3039 3322                          ZFS_EXIT(zfsvfs);
3040 3323                          return (err);
3041 3324                  }
3042 3325  
3043 3326                  if (trim_mask)
3044 3327                          vap->va_mask |= saved_mask;
3045 3328          }
3046 3329  
3047 3330          /*
3048 3331           * secpolicy_vnode_setattr, or take ownership may have
3049 3332           * changed va_mask
3050 3333           */
3051 3334          mask = vap->va_mask;
3052 3335  
3053 3336          if ((mask & (AT_UID | AT_GID))) {
3054 3337                  err = sa_lookup(zp->z_sa_hdl, SA_ZPL_XATTR(zfsvfs),
3055 3338                      &xattr_obj, sizeof (xattr_obj));
3056 3339  
3057 3340                  if (err == 0 && xattr_obj) {
3058 3341                          err = zfs_zget(zp->z_zfsvfs, xattr_obj, &attrzp);
3059 3342                          if (err)
3060 3343                                  goto out2;
3061 3344                  }
3062 3345                  if (mask & AT_UID) {
3063 3346                          new_uid = zfs_fuid_create(zfsvfs,
3064 3347                              (uint64_t)vap->va_uid, cr, ZFS_OWNER, &fuidp);
3065 3348                          if (new_uid != zp->z_uid &&
3066 3349                              zfs_fuid_overquota(zfsvfs, B_FALSE, new_uid)) {
3067 3350                                  if (attrzp)
3068 3351                                          VN_RELE(ZTOV(attrzp));
3069 3352                                  err = SET_ERROR(EDQUOT);
3070 3353                                  goto out2;
3071 3354                          }
3072 3355                  }
3073 3356  
3074 3357                  if (mask & AT_GID) {
3075 3358                          new_gid = zfs_fuid_create(zfsvfs, (uint64_t)vap->va_gid,
3076 3359                              cr, ZFS_GROUP, &fuidp);
3077 3360                          if (new_gid != zp->z_gid &&
3078 3361                              zfs_fuid_overquota(zfsvfs, B_TRUE, new_gid)) {
3079 3362                                  if (attrzp)
3080 3363                                          VN_RELE(ZTOV(attrzp));
3081 3364                                  err = SET_ERROR(EDQUOT);
3082 3365                                  goto out2;
3083 3366                          }
3084 3367                  }
3085 3368          }
3086 3369          tx = dmu_tx_create(zfsvfs->z_os);
3087 3370  
3088 3371          if (mask & AT_MODE) {
3089 3372                  uint64_t pmode = zp->z_mode;
3090 3373                  uint64_t acl_obj;
3091 3374                  new_mode = (pmode & S_IFMT) | (vap->va_mode & ~S_IFMT);
3092 3375  
3093 3376                  if (zp->z_zfsvfs->z_acl_mode == ZFS_ACL_RESTRICTED &&
3094 3377                      !(zp->z_pflags & ZFS_ACL_TRIVIAL)) {
3095 3378                          err = SET_ERROR(EPERM);
3096 3379                          goto out;
3097 3380                  }
3098 3381  
3099 3382                  if (err = zfs_acl_chmod_setattr(zp, &aclp, new_mode))
3100 3383                          goto out;
3101 3384  
3102 3385                  mutex_enter(&zp->z_lock);
3103 3386                  if (!zp->z_is_sa && ((acl_obj = zfs_external_acl(zp)) != 0)) {
3104 3387                          /*
3105 3388                           * Are we upgrading ACL from old V0 format
3106 3389                           * to V1 format?
3107 3390                           */
3108 3391                          if (zfsvfs->z_version >= ZPL_VERSION_FUID &&
3109 3392                              zfs_znode_acl_version(zp) ==
3110 3393                              ZFS_ACL_VERSION_INITIAL) {
3111 3394                                  dmu_tx_hold_free(tx, acl_obj, 0,
3112 3395                                      DMU_OBJECT_END);
3113 3396                                  dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
3114 3397                                      0, aclp->z_acl_bytes);
3115 3398                          } else {
3116 3399                                  dmu_tx_hold_write(tx, acl_obj, 0,
3117 3400                                      aclp->z_acl_bytes);
3118 3401                          }
3119 3402                  } else if (!zp->z_is_sa && aclp->z_acl_bytes > ZFS_ACE_SPACE) {
3120 3403                          dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
3121 3404                              0, aclp->z_acl_bytes);
3122 3405                  }
3123 3406                  mutex_exit(&zp->z_lock);
3124 3407                  dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_TRUE);
3125 3408          } else {
3126 3409                  if ((mask & AT_XVATTR) &&
3127 3410                      XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP))
3128 3411                          dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_TRUE);
3129 3412                  else
3130 3413                          dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
3131 3414          }
3132 3415  
3133 3416          if (attrzp) {
3134 3417                  dmu_tx_hold_sa(tx, attrzp->z_sa_hdl, B_FALSE);
3135 3418          }
3136 3419  
3137 3420          fuid_dirtied = zfsvfs->z_fuid_dirty;
3138 3421          if (fuid_dirtied)
3139 3422                  zfs_fuid_txhold(zfsvfs, tx);
3140 3423  
3141 3424          zfs_sa_upgrade_txholds(tx, zp);
3142 3425  
3143 3426          err = dmu_tx_assign(tx, TXG_WAIT);
3144 3427          if (err)
3145 3428                  goto out;
3146 3429  
3147 3430          count = 0;
3148 3431          /*
3149 3432           * Set each attribute requested.
3150 3433           * We group settings according to the locks they need to acquire.
3151 3434           *
3152 3435           * Note: you cannot set ctime directly, although it will be
3153 3436           * updated as a side-effect of calling this function.
3154 3437           */
3155 3438  
3156 3439  
3157 3440          if (mask & (AT_UID|AT_GID|AT_MODE))
3158 3441                  mutex_enter(&zp->z_acl_lock);
3159 3442          mutex_enter(&zp->z_lock);
3160 3443  
3161 3444          SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_FLAGS(zfsvfs), NULL,
3162 3445              &zp->z_pflags, sizeof (zp->z_pflags));
3163 3446  
3164 3447          if (attrzp) {
3165 3448                  if (mask & (AT_UID|AT_GID|AT_MODE))
3166 3449                          mutex_enter(&attrzp->z_acl_lock);
3167 3450                  mutex_enter(&attrzp->z_lock);
3168 3451                  SA_ADD_BULK_ATTR(xattr_bulk, xattr_count,
3169 3452                      SA_ZPL_FLAGS(zfsvfs), NULL, &attrzp->z_pflags,
3170 3453                      sizeof (attrzp->z_pflags));
3171 3454          }
3172 3455  
3173 3456          if (mask & (AT_UID|AT_GID)) {
3174 3457  
3175 3458                  if (mask & AT_UID) {
3176 3459                          SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_UID(zfsvfs), NULL,
3177 3460                              &new_uid, sizeof (new_uid));
3178 3461                          zp->z_uid = new_uid;
3179 3462                          if (attrzp) {
3180 3463                                  SA_ADD_BULK_ATTR(xattr_bulk, xattr_count,
3181 3464                                      SA_ZPL_UID(zfsvfs), NULL, &new_uid,
3182 3465                                      sizeof (new_uid));
3183 3466                                  attrzp->z_uid = new_uid;
3184 3467                          }
3185 3468                  }
3186 3469  
3187 3470                  if (mask & AT_GID) {
3188 3471                          SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_GID(zfsvfs),
3189 3472                              NULL, &new_gid, sizeof (new_gid));
3190 3473                          zp->z_gid = new_gid;
3191 3474                          if (attrzp) {
3192 3475                                  SA_ADD_BULK_ATTR(xattr_bulk, xattr_count,
3193 3476                                      SA_ZPL_GID(zfsvfs), NULL, &new_gid,
3194 3477                                      sizeof (new_gid));
3195 3478                                  attrzp->z_gid = new_gid;
3196 3479                          }
3197 3480                  }
3198 3481                  if (!(mask & AT_MODE)) {
3199 3482                          SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_MODE(zfsvfs),
3200 3483                              NULL, &new_mode, sizeof (new_mode));
3201 3484                          new_mode = zp->z_mode;
3202 3485                  }
3203 3486                  err = zfs_acl_chown_setattr(zp);
3204 3487                  ASSERT(err == 0);
3205 3488                  if (attrzp) {
3206 3489                          err = zfs_acl_chown_setattr(attrzp);
3207 3490                          ASSERT(err == 0);
3208 3491                  }
3209 3492          }
3210 3493  
3211 3494          if (mask & AT_MODE) {
3212 3495                  SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_MODE(zfsvfs), NULL,
3213 3496                      &new_mode, sizeof (new_mode));
3214 3497                  zp->z_mode = new_mode;
3215 3498                  ASSERT3U((uintptr_t)aclp, !=, NULL);
3216 3499                  err = zfs_aclset_common(zp, aclp, cr, tx);
3217 3500                  ASSERT0(err);
3218 3501                  if (zp->z_acl_cached)
3219 3502                          zfs_acl_free(zp->z_acl_cached);
3220 3503                  zp->z_acl_cached = aclp;
3221 3504                  aclp = NULL;
3222 3505          }
3223 3506  
3224 3507  
3225 3508          if (mask & AT_ATIME) {
3226 3509                  ZFS_TIME_ENCODE(&vap->va_atime, zp->z_atime);
3227 3510                  SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_ATIME(zfsvfs), NULL,
3228 3511                      &zp->z_atime, sizeof (zp->z_atime));
3229 3512          }
3230 3513  
3231 3514          if (mask & AT_MTIME) {
3232 3515                  ZFS_TIME_ENCODE(&vap->va_mtime, mtime);
3233 3516                  SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_MTIME(zfsvfs), NULL,
3234 3517                      mtime, sizeof (mtime));
3235 3518          }
3236 3519  
3237 3520          /* XXX - shouldn't this be done *before* the ATIME/MTIME checks? */
3238 3521          if (mask & AT_SIZE && !(mask & AT_MTIME)) {
3239 3522                  SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_MTIME(zfsvfs),
3240 3523                      NULL, mtime, sizeof (mtime));
3241 3524                  SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL,
3242 3525                      &ctime, sizeof (ctime));
3243 3526                  zfs_tstamp_update_setup(zp, CONTENT_MODIFIED, mtime, ctime,
3244 3527                      B_TRUE);
3245 3528          } else if (mask != 0) {
3246 3529                  SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL,
3247 3530                      &ctime, sizeof (ctime));
3248 3531                  zfs_tstamp_update_setup(zp, STATE_CHANGED, mtime, ctime,
3249 3532                      B_TRUE);
3250 3533                  if (attrzp) {
3251 3534                          SA_ADD_BULK_ATTR(xattr_bulk, xattr_count,
3252 3535                              SA_ZPL_CTIME(zfsvfs), NULL,
3253 3536                              &ctime, sizeof (ctime));
3254 3537                          zfs_tstamp_update_setup(attrzp, STATE_CHANGED,
3255 3538                              mtime, ctime, B_TRUE);
3256 3539                  }
3257 3540          }
3258 3541          /*
3259 3542           * Do this after setting timestamps to prevent timestamp
3260 3543           * update from toggling bit
3261 3544           */
3262 3545  
3263 3546          if (xoap && (mask & AT_XVATTR)) {
3264 3547  
3265 3548                  /*
3266 3549                   * restore trimmed off masks
3267 3550                   * so that return masks can be set for caller.
3268 3551                   */
3269 3552  
3270 3553                  if (XVA_ISSET_REQ(&tmpxvattr, XAT_APPENDONLY)) {
3271 3554                          XVA_SET_REQ(xvap, XAT_APPENDONLY);
3272 3555                  }
3273 3556                  if (XVA_ISSET_REQ(&tmpxvattr, XAT_NOUNLINK)) {
3274 3557                          XVA_SET_REQ(xvap, XAT_NOUNLINK);
3275 3558                  }
3276 3559                  if (XVA_ISSET_REQ(&tmpxvattr, XAT_IMMUTABLE)) {
3277 3560                          XVA_SET_REQ(xvap, XAT_IMMUTABLE);
3278 3561                  }
3279 3562                  if (XVA_ISSET_REQ(&tmpxvattr, XAT_NODUMP)) {
3280 3563                          XVA_SET_REQ(xvap, XAT_NODUMP);
3281 3564                  }
3282 3565                  if (XVA_ISSET_REQ(&tmpxvattr, XAT_AV_MODIFIED)) {
3283 3566                          XVA_SET_REQ(xvap, XAT_AV_MODIFIED);
3284 3567                  }
3285 3568                  if (XVA_ISSET_REQ(&tmpxvattr, XAT_AV_QUARANTINED)) {
3286 3569                          XVA_SET_REQ(xvap, XAT_AV_QUARANTINED);
3287 3570                  }
3288 3571  
3289 3572                  if (XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP))
3290 3573                          ASSERT(vp->v_type == VREG);
3291 3574  
3292 3575                  zfs_xvattr_set(zp, xvap, tx);
3293 3576          }
3294 3577  
3295 3578          if (fuid_dirtied)
3296 3579                  zfs_fuid_sync(zfsvfs, tx);
3297 3580  
3298 3581          if (mask != 0)
3299 3582                  zfs_log_setattr(zilog, tx, TX_SETATTR, zp, vap, mask, fuidp);
3300 3583  
3301 3584          mutex_exit(&zp->z_lock);
3302 3585          if (mask & (AT_UID|AT_GID|AT_MODE))
3303 3586                  mutex_exit(&zp->z_acl_lock);
3304 3587  
3305 3588          if (attrzp) {
3306 3589                  if (mask & (AT_UID|AT_GID|AT_MODE))
3307 3590                          mutex_exit(&attrzp->z_acl_lock);
3308 3591                  mutex_exit(&attrzp->z_lock);
3309 3592          }
3310 3593  out:
3311 3594          if (err == 0 && attrzp) {
3312 3595                  err2 = sa_bulk_update(attrzp->z_sa_hdl, xattr_bulk,
3313 3596                      xattr_count, tx);
3314 3597                  ASSERT(err2 == 0);
3315 3598          }
3316 3599  
3317 3600          if (attrzp)
3318 3601                  VN_RELE(ZTOV(attrzp));
3319 3602  
3320 3603          if (aclp)
3321 3604                  zfs_acl_free(aclp);
3322 3605  
3323 3606          if (fuidp) {
3324 3607                  zfs_fuid_info_free(fuidp);
3325 3608                  fuidp = NULL;
3326 3609          }
3327 3610  
3328 3611          if (err) {
3329 3612                  dmu_tx_abort(tx);
3330 3613                  if (err == ERESTART)
3331 3614                          goto top;
3332 3615          } else {
3333 3616                  err2 = sa_bulk_update(zp->z_sa_hdl, bulk, count, tx);
3334 3617                  dmu_tx_commit(tx);
3335 3618          }
3336 3619  
3337 3620  out2:
3338 3621          if (zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
3339 3622                  zil_commit(zilog, 0);
3340 3623  
3341 3624          ZFS_EXIT(zfsvfs);
3342 3625          return (err);
3343 3626  }
3344 3627  
3345 3628  typedef struct zfs_zlock {
3346 3629          krwlock_t       *zl_rwlock;     /* lock we acquired */
3347 3630          znode_t         *zl_znode;      /* znode we held */
3348 3631          struct zfs_zlock *zl_next;      /* next in list */
3349 3632  } zfs_zlock_t;
3350 3633  
3351 3634  /*
3352 3635   * Drop locks and release vnodes that were held by zfs_rename_lock().
3353 3636   */
3354 3637  static void
3355 3638  zfs_rename_unlock(zfs_zlock_t **zlpp)
3356 3639  {
3357 3640          zfs_zlock_t *zl;
3358 3641  
3359 3642          while ((zl = *zlpp) != NULL) {
3360 3643                  if (zl->zl_znode != NULL)
3361 3644                          VN_RELE(ZTOV(zl->zl_znode));
3362 3645                  rw_exit(zl->zl_rwlock);
3363 3646                  *zlpp = zl->zl_next;
3364 3647                  kmem_free(zl, sizeof (*zl));
3365 3648          }
3366 3649  }
3367 3650  
3368 3651  /*
3369 3652   * Search back through the directory tree, using the ".." entries.
3370 3653   * Lock each directory in the chain to prevent concurrent renames.
3371 3654   * Fail any attempt to move a directory into one of its own descendants.
3372 3655   * XXX - z_parent_lock can overlap with map or grow locks
3373 3656   */
3374 3657  static int
3375 3658  zfs_rename_lock(znode_t *szp, znode_t *tdzp, znode_t *sdzp, zfs_zlock_t **zlpp)
3376 3659  {
3377 3660          zfs_zlock_t     *zl;
3378 3661          znode_t         *zp = tdzp;
3379 3662          uint64_t        rootid = zp->z_zfsvfs->z_root;
3380 3663          uint64_t        oidp = zp->z_id;
3381 3664          krwlock_t       *rwlp = &szp->z_parent_lock;
3382 3665          krw_t           rw = RW_WRITER;
3383 3666  
3384 3667          /*
3385 3668           * First pass write-locks szp and compares to zp->z_id.
3386 3669           * Later passes read-lock zp and compare to zp->z_parent.
3387 3670           */
3388 3671          do {
3389 3672                  if (!rw_tryenter(rwlp, rw)) {
3390 3673                          /*
3391 3674                           * Another thread is renaming in this path.
3392 3675                           * Note that if we are a WRITER, we don't have any
3393 3676                           * parent_locks held yet.
3394 3677                           */
3395 3678                          if (rw == RW_READER && zp->z_id > szp->z_id) {
3396 3679                                  /*
3397 3680                                   * Drop our locks and restart
3398 3681                                   */
3399 3682                                  zfs_rename_unlock(&zl);
3400 3683                                  *zlpp = NULL;
3401 3684                                  zp = tdzp;
3402 3685                                  oidp = zp->z_id;
3403 3686                                  rwlp = &szp->z_parent_lock;
3404 3687                                  rw = RW_WRITER;
3405 3688                                  continue;
3406 3689                          } else {
3407 3690                                  /*
3408 3691                                   * Wait for other thread to drop its locks
3409 3692                                   */
3410 3693                                  rw_enter(rwlp, rw);
3411 3694                          }
3412 3695                  }
3413 3696  
3414 3697                  zl = kmem_alloc(sizeof (*zl), KM_SLEEP);
3415 3698                  zl->zl_rwlock = rwlp;
3416 3699                  zl->zl_znode = NULL;
3417 3700                  zl->zl_next = *zlpp;
3418 3701                  *zlpp = zl;
3419 3702  
3420 3703                  if (oidp == szp->z_id)          /* We're a descendant of szp */
3421 3704                          return (SET_ERROR(EINVAL));
3422 3705  
3423 3706                  if (oidp == rootid)             /* We've hit the top */
3424 3707                          return (0);
3425 3708  
3426 3709                  if (rw == RW_READER) {          /* i.e. not the first pass */
3427 3710                          int error = zfs_zget(zp->z_zfsvfs, oidp, &zp);
3428 3711                          if (error)
3429 3712                                  return (error);
3430 3713                          zl->zl_znode = zp;
3431 3714                  }
3432 3715                  (void) sa_lookup(zp->z_sa_hdl, SA_ZPL_PARENT(zp->z_zfsvfs),
3433 3716                      &oidp, sizeof (oidp));
3434 3717                  rwlp = &zp->z_parent_lock;
3435 3718                  rw = RW_READER;
3436 3719  
3437 3720          } while (zp->z_id != sdzp->z_id);
3438 3721  
3439 3722          return (0);
3440 3723  }
3441 3724  
3442 3725  /*
3443 3726   * Move an entry from the provided source directory to the target
3444 3727   * directory.  Change the entry name as indicated.
3445 3728   *
3446 3729   *      IN:     sdvp    - Source directory containing the "old entry".
3447 3730   *              snm     - Old entry name.
3448 3731   *              tdvp    - Target directory to contain the "new entry".
3449 3732   *              tnm     - New entry name.
3450 3733   *              cr      - credentials of caller.
3451 3734   *              ct      - caller context
3452 3735   *              flags   - case flags
3453 3736   *
3454 3737   *      RETURN: 0 on success, error code on failure.
3455 3738   *
3456 3739   * Timestamps:
3457 3740   *      sdvp,tdvp - ctime|mtime updated
3458 3741   */
3459 3742  /*ARGSUSED*/
3460 3743  static int
3461 3744  zfs_rename(vnode_t *sdvp, char *snm, vnode_t *tdvp, char *tnm, cred_t *cr,
3462 3745      caller_context_t *ct, int flags)
3463 3746  {
3464 3747          znode_t         *tdzp, *szp, *tzp;
3465 3748          znode_t         *sdzp = VTOZ(sdvp);
3466 3749          zfsvfs_t        *zfsvfs = sdzp->z_zfsvfs;
3467 3750          zilog_t         *zilog;
3468 3751          vnode_t         *realvp;
3469 3752          zfs_dirlock_t   *sdl, *tdl;
3470 3753          dmu_tx_t        *tx;
3471 3754          zfs_zlock_t     *zl;
3472 3755          int             cmp, serr, terr;
3473 3756          int             error = 0, rm_err = 0;
3474 3757          int             zflg = 0;
3475 3758          boolean_t       waited = B_FALSE;
3476 3759  
3477 3760          ZFS_ENTER(zfsvfs);
3478 3761          ZFS_VERIFY_ZP(sdzp);
3479 3762          zilog = zfsvfs->z_log;
3480 3763  
3481 3764          /*
3482 3765           * Make sure we have the real vp for the target directory.
3483 3766           */
3484 3767          if (VOP_REALVP(tdvp, &realvp, ct) == 0)
3485 3768                  tdvp = realvp;
3486 3769  
3487 3770          tdzp = VTOZ(tdvp);
3488 3771          ZFS_VERIFY_ZP(tdzp);
3489 3772  
3490 3773          /*
3491 3774           * We check z_zfsvfs rather than v_vfsp here, because snapshots and the
3492 3775           * ctldir appear to have the same v_vfsp.
3493 3776           */
3494 3777          if (tdzp->z_zfsvfs != zfsvfs || zfsctl_is_node(tdvp)) {
3495 3778                  ZFS_EXIT(zfsvfs);
3496 3779                  return (SET_ERROR(EXDEV));
3497 3780          }
3498 3781  
3499 3782          if (zfsvfs->z_utf8 && u8_validate(tnm,
3500 3783              strlen(tnm), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
3501 3784                  ZFS_EXIT(zfsvfs);
3502 3785                  return (SET_ERROR(EILSEQ));
3503 3786          }
3504 3787  
3505 3788          if (flags & FIGNORECASE)
3506 3789                  zflg |= ZCILOOK;
3507 3790  
3508 3791  top:
3509 3792          szp = NULL;
3510 3793          tzp = NULL;
3511 3794          zl = NULL;
3512 3795  
3513 3796          /*
3514 3797           * This is to prevent the creation of links into attribute space
3515 3798           * by renaming a linked file into/outof an attribute directory.
3516 3799           * See the comment in zfs_link() for why this is considered bad.
3517 3800           */
3518 3801          if ((tdzp->z_pflags & ZFS_XATTR) != (sdzp->z_pflags & ZFS_XATTR)) {
3519 3802                  ZFS_EXIT(zfsvfs);
3520 3803                  return (SET_ERROR(EINVAL));
3521 3804          }
3522 3805  
3523 3806          /*
3524 3807           * Lock source and target directory entries.  To prevent deadlock,
3525 3808           * a lock ordering must be defined.  We lock the directory with
3526 3809           * the smallest object id first, or if it's a tie, the one with
3527 3810           * the lexically first name.
3528 3811           */
3529 3812          if (sdzp->z_id < tdzp->z_id) {
3530 3813                  cmp = -1;
3531 3814          } else if (sdzp->z_id > tdzp->z_id) {
3532 3815                  cmp = 1;
3533 3816          } else {
3534 3817                  /*
3535 3818                   * First compare the two name arguments without
3536 3819                   * considering any case folding.
3537 3820                   */
3538 3821                  int nofold = (zfsvfs->z_norm & ~U8_TEXTPREP_TOUPPER);
3539 3822  
3540 3823                  cmp = u8_strcmp(snm, tnm, 0, nofold, U8_UNICODE_LATEST, &error);
3541 3824                  ASSERT(error == 0 || !zfsvfs->z_utf8);
3542 3825                  if (cmp == 0) {
3543 3826                          /*
3544 3827                           * POSIX: "If the old argument and the new argument
3545 3828                           * both refer to links to the same existing file,
3546 3829                           * the rename() function shall return successfully
3547 3830                           * and perform no other action."
3548 3831                           */
3549 3832                          ZFS_EXIT(zfsvfs);
3550 3833                          return (0);
3551 3834                  }
3552 3835                  /*
3553 3836                   * If the file system is case-folding, then we may
3554 3837                   * have some more checking to do.  A case-folding file
3555 3838                   * system is either supporting mixed case sensitivity
3556 3839                   * access or is completely case-insensitive.  Note
3557 3840                   * that the file system is always case preserving.
3558 3841                   *
3559 3842                   * In mixed sensitivity mode case sensitive behavior
3560 3843                   * is the default.  FIGNORECASE must be used to
3561 3844                   * explicitly request case insensitive behavior.
3562 3845                   *
3563 3846                   * If the source and target names provided differ only
3564 3847                   * by case (e.g., a request to rename 'tim' to 'Tim'),
3565 3848                   * we will treat this as a special case in the
3566 3849                   * case-insensitive mode: as long as the source name
3567 3850                   * is an exact match, we will allow this to proceed as
3568 3851                   * a name-change request.
3569 3852                   */
3570 3853                  if ((zfsvfs->z_case == ZFS_CASE_INSENSITIVE ||
3571 3854                      (zfsvfs->z_case == ZFS_CASE_MIXED &&
3572 3855                      flags & FIGNORECASE)) &&
3573 3856                      u8_strcmp(snm, tnm, 0, zfsvfs->z_norm, U8_UNICODE_LATEST,
3574 3857                      &error) == 0) {
3575 3858                          /*
3576 3859                           * case preserving rename request, require exact
3577 3860                           * name matches
3578 3861                           */
3579 3862                          zflg |= ZCIEXACT;
3580 3863                          zflg &= ~ZCILOOK;
3581 3864                  }
3582 3865          }
3583 3866  
3584 3867          /*
3585 3868           * If the source and destination directories are the same, we should
3586 3869           * grab the z_name_lock of that directory only once.
3587 3870           */
3588 3871          if (sdzp == tdzp) {
3589 3872                  zflg |= ZHAVELOCK;
3590 3873                  rw_enter(&sdzp->z_name_lock, RW_READER);
3591 3874          }
3592 3875  
3593 3876          if (cmp < 0) {
3594 3877                  serr = zfs_dirent_lock(&sdl, sdzp, snm, &szp,
3595 3878                      ZEXISTS | zflg, NULL, NULL);
3596 3879                  terr = zfs_dirent_lock(&tdl,
3597 3880                      tdzp, tnm, &tzp, ZRENAMING | zflg, NULL, NULL);
3598 3881          } else {
3599 3882                  terr = zfs_dirent_lock(&tdl,
3600 3883                      tdzp, tnm, &tzp, zflg, NULL, NULL);
3601 3884                  serr = zfs_dirent_lock(&sdl,
3602 3885                      sdzp, snm, &szp, ZEXISTS | ZRENAMING | zflg,
3603 3886                      NULL, NULL);
3604 3887          }
3605 3888  
3606 3889          if (serr) {
3607 3890                  /*
3608 3891                   * Source entry invalid or not there.
3609 3892                   */
3610 3893                  if (!terr) {
3611 3894                          zfs_dirent_unlock(tdl);
3612 3895                          if (tzp)
3613 3896                                  VN_RELE(ZTOV(tzp));
3614 3897                  }
3615 3898  
3616 3899                  if (sdzp == tdzp)
3617 3900                          rw_exit(&sdzp->z_name_lock);
3618 3901  
3619 3902                  if (strcmp(snm, "..") == 0)
3620 3903                          serr = SET_ERROR(EINVAL);
3621 3904                  ZFS_EXIT(zfsvfs);
3622 3905                  return (serr);
3623 3906          }
3624 3907          if (terr) {
3625 3908                  zfs_dirent_unlock(sdl);
3626 3909                  VN_RELE(ZTOV(szp));
3627 3910  
3628 3911                  if (sdzp == tdzp)
3629 3912                          rw_exit(&sdzp->z_name_lock);
3630 3913  
3631 3914                  if (strcmp(tnm, "..") == 0)
3632 3915                          terr = SET_ERROR(EINVAL);
3633 3916                  ZFS_EXIT(zfsvfs);
3634 3917                  return (terr);
3635 3918          }
3636 3919  
3637 3920          /*
3638 3921           * Must have write access at the source to remove the old entry
3639 3922           * and write access at the target to create the new entry.
3640 3923           * Note that if target and source are the same, this can be
3641 3924           * done in a single check.
3642 3925           */
3643 3926  
3644 3927          if (error = zfs_zaccess_rename(sdzp, szp, tdzp, tzp, cr))
3645 3928                  goto out;
3646 3929  
3647 3930          if (ZTOV(szp)->v_type == VDIR) {
3648 3931                  /*
3649 3932                   * Check to make sure rename is valid.
3650 3933                   * Can't do a move like this: /usr/a/b to /usr/a/b/c/d
3651 3934                   */
3652 3935                  if (error = zfs_rename_lock(szp, tdzp, sdzp, &zl))
3653 3936                          goto out;
3654 3937          }
3655 3938  
3656 3939          /*
3657 3940           * Does target exist?
3658 3941           */
3659 3942          if (tzp) {
3660 3943                  /*
3661 3944                   * Source and target must be the same type.
3662 3945                   */
3663 3946                  if (ZTOV(szp)->v_type == VDIR) {
3664 3947                          if (ZTOV(tzp)->v_type != VDIR) {
3665 3948                                  error = SET_ERROR(ENOTDIR);
3666 3949                                  goto out;
3667 3950                          }
3668 3951                  } else {
3669 3952                          if (ZTOV(tzp)->v_type == VDIR) {
3670 3953                                  error = SET_ERROR(EISDIR);
3671 3954                                  goto out;
3672 3955                          }
3673 3956                  }
3674 3957                  /*
3675 3958                   * POSIX dictates that when the source and target
3676 3959                   * entries refer to the same file object, rename
3677 3960                   * must do nothing and exit without error.
3678 3961                   */
3679 3962                  if (szp->z_id == tzp->z_id) {
3680 3963                          error = 0;
3681 3964                          goto out;
3682 3965                  }
3683 3966          }
3684 3967  
3685 3968          vnevent_pre_rename_src(ZTOV(szp), sdvp, snm, ct);
3686 3969          if (tzp)
3687 3970                  vnevent_pre_rename_dest(ZTOV(tzp), tdvp, tnm, ct);
3688 3971  
3689 3972          /*
3690 3973           * notify the target directory if it is not the same
3691 3974           * as source directory.
3692 3975           */
3693 3976          if (tdvp != sdvp) {
3694 3977                  vnevent_pre_rename_dest_dir(tdvp, ZTOV(szp), tnm, ct);
3695 3978          }
3696 3979  
3697 3980          tx = dmu_tx_create(zfsvfs->z_os);
3698 3981          dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
3699 3982          dmu_tx_hold_sa(tx, sdzp->z_sa_hdl, B_FALSE);
3700 3983          dmu_tx_hold_zap(tx, sdzp->z_id, FALSE, snm);
3701 3984          dmu_tx_hold_zap(tx, tdzp->z_id, TRUE, tnm);
3702 3985          if (sdzp != tdzp) {

↓ open down ↓

886 lines elided

↑ open up ↑

3703 3986                  dmu_tx_hold_sa(tx, tdzp->z_sa_hdl, B_FALSE);
3704 3987                  zfs_sa_upgrade_txholds(tx, tdzp);
3705 3988          }
3706 3989          if (tzp) {
3707 3990                  dmu_tx_hold_sa(tx, tzp->z_sa_hdl, B_FALSE);
3708 3991                  zfs_sa_upgrade_txholds(tx, tzp);
3709 3992          }
3710 3993  
3711 3994          zfs_sa_upgrade_txholds(tx, szp);
3712 3995          dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
3713      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     3996 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
3714 3997          if (error) {
3715 3998                  if (zl != NULL)
3716 3999                          zfs_rename_unlock(&zl);
3717 4000                  zfs_dirent_unlock(sdl);
3718 4001                  zfs_dirent_unlock(tdl);
3719 4002  
3720 4003                  if (sdzp == tdzp)
3721 4004                          rw_exit(&sdzp->z_name_lock);
3722 4005  
3723 4006                  VN_RELE(ZTOV(szp));

3724 4007                  if (tzp)
3725 4008                          VN_RELE(ZTOV(tzp));
3726 4009                  if (error == ERESTART) {
3727 4010                          waited = B_TRUE;
3728 4011                          dmu_tx_wait(tx);
3729 4012                          dmu_tx_abort(tx);
3730 4013                          goto top;
3731 4014                  }
3732 4015                  dmu_tx_abort(tx);
3733 4016                  ZFS_EXIT(zfsvfs);
3734 4017                  return (error);
3735 4018          }
3736 4019  
3737 4020          if (tzp)        /* Attempt to remove the existing target */
3738 4021                  error = rm_err = zfs_link_destroy(tdl, tzp, tx, zflg, NULL);
3739 4022  
3740 4023          if (error == 0) {
3741 4024                  error = zfs_link_create(tdl, szp, tx, ZRENAMING);
3742 4025                  if (error == 0) {
3743 4026                          szp->z_pflags |= ZFS_AV_MODIFIED;
3744 4027  
3745 4028                          error = sa_update(szp->z_sa_hdl, SA_ZPL_FLAGS(zfsvfs),
3746 4029                              (void *)&szp->z_pflags, sizeof (uint64_t), tx);
3747 4030                          ASSERT0(error);
3748 4031  
3749 4032                          error = zfs_link_destroy(sdl, szp, tx, ZRENAMING, NULL);
3750 4033                          if (error == 0) {
3751 4034                                  zfs_log_rename(zilog, tx, TX_RENAME |
3752 4035                                      (flags & FIGNORECASE ? TX_CI : 0), sdzp,
3753 4036                                      sdl->dl_name, tdzp, tdl->dl_name, szp);
3754 4037  
3755 4038                                  /*
3756 4039                                   * Update path information for the target vnode
3757 4040                                   */
3758 4041                                  vn_renamepath(tdvp, ZTOV(szp), tnm,
3759 4042                                      strlen(tnm));
3760 4043                          } else {
3761 4044                                  /*
3762 4045                                   * At this point, we have successfully created
3763 4046                                   * the target name, but have failed to remove
3764 4047                                   * the source name.  Since the create was done
3765 4048                                   * with the ZRENAMING flag, there are
3766 4049                                   * complications; for one, the link count is
3767 4050                                   * wrong.  The easiest way to deal with this
3768 4051                                   * is to remove the newly created target, and
3769 4052                                   * return the original error.  This must
3770 4053                                   * succeed; fortunately, it is very unlikely to
3771 4054                                   * fail, since we just created it.
3772 4055                                   */
3773 4056                                  VERIFY3U(zfs_link_destroy(tdl, szp, tx,
3774 4057                                      ZRENAMING, NULL), ==, 0);
3775 4058                          }
3776 4059                  }
3777 4060          }
3778 4061  
3779 4062          dmu_tx_commit(tx);
3780 4063  
3781 4064          if (tzp && rm_err == 0)
3782 4065                  vnevent_rename_dest(ZTOV(tzp), tdvp, tnm, ct);
3783 4066  
3784 4067          if (error == 0) {
3785 4068                  vnevent_rename_src(ZTOV(szp), sdvp, snm, ct);
3786 4069                  /* notify the target dir if it is not the same as source dir */
3787 4070                  if (tdvp != sdvp)
3788 4071                          vnevent_rename_dest_dir(tdvp, ct);
3789 4072          }
3790 4073  out:
3791 4074          if (zl != NULL)
3792 4075                  zfs_rename_unlock(&zl);
3793 4076  
3794 4077          zfs_dirent_unlock(sdl);
3795 4078          zfs_dirent_unlock(tdl);
3796 4079  
3797 4080          if (sdzp == tdzp)
3798 4081                  rw_exit(&sdzp->z_name_lock);
3799 4082  
3800 4083  
3801 4084          VN_RELE(ZTOV(szp));
3802 4085          if (tzp)
3803 4086                  VN_RELE(ZTOV(tzp));
3804 4087  
3805 4088          if (zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
3806 4089                  zil_commit(zilog, 0);
3807 4090  
3808 4091          ZFS_EXIT(zfsvfs);
3809 4092          return (error);
3810 4093  }
3811 4094  
3812 4095  /*
3813 4096   * Insert the indicated symbolic reference entry into the directory.
3814 4097   *
3815 4098   *      IN:     dvp     - Directory to contain new symbolic link.
3816 4099   *              link    - Name for new symlink entry.
3817 4100   *              vap     - Attributes of new entry.
3818 4101   *              cr      - credentials of caller.
3819 4102   *              ct      - caller context
3820 4103   *              flags   - case flags
3821 4104   *
3822 4105   *      RETURN: 0 on success, error code on failure.
3823 4106   *
3824 4107   * Timestamps:
3825 4108   *      dvp - ctime|mtime updated
3826 4109   */

↓ open down ↓

103 lines elided

↑ open up ↑

3827 4110  /*ARGSUSED*/
3828 4111  static int
3829 4112  zfs_symlink(vnode_t *dvp, char *name, vattr_t *vap, char *link, cred_t *cr,
3830 4113      caller_context_t *ct, int flags)
3831 4114  {
3832 4115          znode_t         *zp, *dzp = VTOZ(dvp);
3833 4116          zfs_dirlock_t   *dl;
3834 4117          dmu_tx_t        *tx;
3835 4118          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
3836 4119          zilog_t         *zilog;
     4120 +        int             imm_was_set = 0;
3837 4121          uint64_t        len = strlen(link);
3838 4122          int             error;
3839 4123          int             zflg = ZNEW;
3840 4124          zfs_acl_ids_t   acl_ids;
3841 4125          boolean_t       fuid_dirtied;
3842 4126          uint64_t        txtype = TX_SYMLINK;
3843 4127          boolean_t       waited = B_FALSE;
3844 4128  
3845 4129          ASSERT(vap->va_type == VLNK);
3846 4130

3847 4131          ZFS_ENTER(zfsvfs);
3848 4132          ZFS_VERIFY_ZP(dzp);
3849 4133          zilog = zfsvfs->z_log;
3850 4134  
3851 4135          if (zfsvfs->z_utf8 && u8_validate(name, strlen(name),
3852 4136              NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
3853 4137                  ZFS_EXIT(zfsvfs);
3854 4138                  return (SET_ERROR(EILSEQ));
3855 4139          }
3856 4140          if (flags & FIGNORECASE)
3857 4141                  zflg |= ZCILOOK;
3858 4142  
3859 4143          if (len > MAXPATHLEN) {
3860 4144                  ZFS_EXIT(zfsvfs);
3861 4145                  return (SET_ERROR(ENAMETOOLONG));
3862 4146          }
3863 4147  
3864 4148          if ((error = zfs_acl_ids_create(dzp, 0,
3865 4149              vap, cr, NULL, &acl_ids)) != 0) {
3866 4150                  ZFS_EXIT(zfsvfs);
3867 4151                  return (error);
3868 4152          }
3869 4153  top:

↓ open down ↓

23 lines elided

↑ open up ↑

3870 4154          /*
3871 4155           * Attempt to lock directory; fail if entry already exists.
3872 4156           */
3873 4157          error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg, NULL, NULL);
3874 4158          if (error) {
3875 4159                  zfs_acl_ids_free(&acl_ids);
3876 4160                  ZFS_EXIT(zfsvfs);
3877 4161                  return (error);
3878 4162          }
3879 4163  
     4164 +        if ((dzp->z_pflags & ZFS_IMMUTABLE) && dzp->z_zfsvfs->z_isworm) {
     4165 +                imm_was_set = 1;
     4166 +                dzp->z_pflags &= ~ZFS_IMMUTABLE;
     4167 +        }
3880 4168          if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
     4169 +                if (imm_was_set)
     4170 +                        dzp->z_pflags |= ZFS_IMMUTABLE;
3881 4171                  zfs_acl_ids_free(&acl_ids);
3882 4172                  zfs_dirent_unlock(dl);
3883 4173                  ZFS_EXIT(zfsvfs);
3884 4174                  return (error);
3885 4175          }
     4176 +        if (imm_was_set)
     4177 +                dzp->z_pflags |= ZFS_IMMUTABLE;
3886 4178  
3887 4179          if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
3888 4180                  zfs_acl_ids_free(&acl_ids);
3889 4181                  zfs_dirent_unlock(dl);
3890 4182                  ZFS_EXIT(zfsvfs);
3891 4183                  return (SET_ERROR(EDQUOT));
3892 4184          }
3893 4185          tx = dmu_tx_create(zfsvfs->z_os);
3894 4186          fuid_dirtied = zfsvfs->z_fuid_dirty;
3895 4187          dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, MAX(1, len));
3896 4188          dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
3897 4189          dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
3898 4190              ZFS_SA_BASE_ATTR_SIZE + len);
3899 4191          dmu_tx_hold_sa(tx, dzp->z_sa_hdl, B_FALSE);
3900 4192          if (!zfsvfs->z_use_sa && acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
3901 4193                  dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
3902 4194                      acl_ids.z_aclp->z_acl_bytes);
3903 4195          }
3904 4196          if (fuid_dirtied)
3905 4197                  zfs_fuid_txhold(zfsvfs, tx);
3906      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     4198 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
3907 4199          if (error) {
3908 4200                  zfs_dirent_unlock(dl);
3909 4201                  if (error == ERESTART) {
3910 4202                          waited = B_TRUE;
3911 4203                          dmu_tx_wait(tx);
3912 4204                          dmu_tx_abort(tx);
3913 4205                          goto top;
3914 4206                  }
3915 4207                  zfs_acl_ids_free(&acl_ids);
3916 4208                  dmu_tx_abort(tx);

3917 4209                  ZFS_EXIT(zfsvfs);
3918 4210                  return (error);
3919 4211          }
3920 4212  
3921 4213          /*
3922 4214           * Create a new object for the symlink.
3923 4215           * for version 4 ZPL datsets the symlink will be an SA attribute
3924 4216           */
3925 4217          zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
3926 4218  
3927 4219          if (fuid_dirtied)
3928 4220                  zfs_fuid_sync(zfsvfs, tx);
3929 4221  
3930 4222          mutex_enter(&zp->z_lock);
3931 4223          if (zp->z_is_sa)
3932 4224                  error = sa_update(zp->z_sa_hdl, SA_ZPL_SYMLINK(zfsvfs),
3933 4225                      link, len, tx);
3934 4226          else
3935 4227                  zfs_sa_symlink(zp, link, len, tx);
3936 4228          mutex_exit(&zp->z_lock);
3937 4229  
3938 4230          zp->z_size = len;
3939 4231          (void) sa_update(zp->z_sa_hdl, SA_ZPL_SIZE(zfsvfs),
3940 4232              &zp->z_size, sizeof (zp->z_size), tx);
3941 4233          /*
3942 4234           * Insert the new object into the directory.
3943 4235           */
3944 4236          (void) zfs_link_create(dl, zp, tx, ZNEW);
3945 4237  
3946 4238          if (flags & FIGNORECASE)
3947 4239                  txtype |= TX_CI;
3948 4240          zfs_log_symlink(zilog, tx, txtype, dzp, zp, name, link);
3949 4241  
3950 4242          zfs_acl_ids_free(&acl_ids);
3951 4243  
3952 4244          dmu_tx_commit(tx);
3953 4245  
3954 4246          zfs_dirent_unlock(dl);
3955 4247  
3956 4248          VN_RELE(ZTOV(zp));
3957 4249  
3958 4250          if (zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
3959 4251                  zil_commit(zilog, 0);
3960 4252  
3961 4253          ZFS_EXIT(zfsvfs);
3962 4254          return (error);
3963 4255  }
3964 4256  
3965 4257  /*
3966 4258   * Return, in the buffer contained in the provided uio structure,
3967 4259   * the symbolic path referred to by vp.
3968 4260   *
3969 4261   *      IN:     vp      - vnode of symbolic link.
3970 4262   *              uio     - structure to contain the link path.
3971 4263   *              cr      - credentials of caller.
3972 4264   *              ct      - caller context
3973 4265   *
3974 4266   *      OUT:    uio     - structure containing the link path.
3975 4267   *
3976 4268   *      RETURN: 0 on success, error code on failure.
3977 4269   *
3978 4270   * Timestamps:
3979 4271   *      vp - atime updated
3980 4272   */
3981 4273  /* ARGSUSED */
3982 4274  static int
3983 4275  zfs_readlink(vnode_t *vp, uio_t *uio, cred_t *cr, caller_context_t *ct)
3984 4276  {
3985 4277          znode_t         *zp = VTOZ(vp);
3986 4278          zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
3987 4279          int             error;
3988 4280  
3989 4281          ZFS_ENTER(zfsvfs);
3990 4282          ZFS_VERIFY_ZP(zp);
3991 4283  
3992 4284          mutex_enter(&zp->z_lock);
3993 4285          if (zp->z_is_sa)
3994 4286                  error = sa_lookup_uio(zp->z_sa_hdl,
3995 4287                      SA_ZPL_SYMLINK(zfsvfs), uio);
3996 4288          else
3997 4289                  error = zfs_sa_readlink(zp, uio);
3998 4290          mutex_exit(&zp->z_lock);
3999 4291  
4000 4292          ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
4001 4293  
4002 4294          ZFS_EXIT(zfsvfs);
4003 4295          return (error);
4004 4296  }
4005 4297  
4006 4298  /*
4007 4299   * Insert a new entry into directory tdvp referencing svp.
4008 4300   *
4009 4301   *      IN:     tdvp    - Directory to contain new entry.
4010 4302   *              svp     - vnode of new entry.
4011 4303   *              name    - name of new entry.
4012 4304   *              cr      - credentials of caller.
4013 4305   *              ct      - caller context
4014 4306   *
4015 4307   *      RETURN: 0 on success, error code on failure.
4016 4308   *
4017 4309   * Timestamps:
4018 4310   *      tdvp - ctime|mtime updated
4019 4311   *       svp - ctime updated
4020 4312   */
4021 4313  /* ARGSUSED */
4022 4314  static int
4023 4315  zfs_link(vnode_t *tdvp, vnode_t *svp, char *name, cred_t *cr,
4024 4316      caller_context_t *ct, int flags)
4025 4317  {
4026 4318          znode_t         *dzp = VTOZ(tdvp);
4027 4319          znode_t         *tzp, *szp;
4028 4320          zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
4029 4321          zilog_t         *zilog;
4030 4322          zfs_dirlock_t   *dl;
4031 4323          dmu_tx_t        *tx;
4032 4324          vnode_t         *realvp;
4033 4325          int             error;
4034 4326          int             zf = ZNEW;
4035 4327          uint64_t        parent;
4036 4328          uid_t           owner;
4037 4329          boolean_t       waited = B_FALSE;
4038 4330  
4039 4331          ASSERT(tdvp->v_type == VDIR);
4040 4332  
4041 4333          ZFS_ENTER(zfsvfs);
4042 4334          ZFS_VERIFY_ZP(dzp);
4043 4335          zilog = zfsvfs->z_log;
4044 4336  
4045 4337          if (VOP_REALVP(svp, &realvp, ct) == 0)
4046 4338                  svp = realvp;
4047 4339  
4048 4340          /*
4049 4341           * POSIX dictates that we return EPERM here.
4050 4342           * Better choices include ENOTSUP or EISDIR.
4051 4343           */
4052 4344          if (svp->v_type == VDIR) {
4053 4345                  ZFS_EXIT(zfsvfs);
4054 4346                  return (SET_ERROR(EPERM));
4055 4347          }
4056 4348  
4057 4349          szp = VTOZ(svp);
4058 4350          ZFS_VERIFY_ZP(szp);
4059 4351  
4060 4352          /*
4061 4353           * We check z_zfsvfs rather than v_vfsp here, because snapshots and the
4062 4354           * ctldir appear to have the same v_vfsp.
4063 4355           */
4064 4356          if (szp->z_zfsvfs != zfsvfs || zfsctl_is_node(svp)) {
4065 4357                  ZFS_EXIT(zfsvfs);
4066 4358                  return (SET_ERROR(EXDEV));
4067 4359          }
4068 4360  
4069 4361          /* Prevent links to .zfs/shares files */
4070 4362  
4071 4363          if ((error = sa_lookup(szp->z_sa_hdl, SA_ZPL_PARENT(zfsvfs),
4072 4364              &parent, sizeof (uint64_t))) != 0) {
4073 4365                  ZFS_EXIT(zfsvfs);
4074 4366                  return (error);
4075 4367          }
4076 4368          if (parent == zfsvfs->z_shares_dir) {
4077 4369                  ZFS_EXIT(zfsvfs);
4078 4370                  return (SET_ERROR(EPERM));
4079 4371          }
4080 4372  
4081 4373          if (zfsvfs->z_utf8 && u8_validate(name,
4082 4374              strlen(name), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
4083 4375                  ZFS_EXIT(zfsvfs);
4084 4376                  return (SET_ERROR(EILSEQ));
4085 4377          }
4086 4378          if (flags & FIGNORECASE)
4087 4379                  zf |= ZCILOOK;
4088 4380  
4089 4381          /*
4090 4382           * We do not support links between attributes and non-attributes
4091 4383           * because of the potential security risk of creating links
4092 4384           * into "normal" file space in order to circumvent restrictions
4093 4385           * imposed in attribute space.
4094 4386           */
4095 4387          if ((szp->z_pflags & ZFS_XATTR) != (dzp->z_pflags & ZFS_XATTR)) {
4096 4388                  ZFS_EXIT(zfsvfs);
4097 4389                  return (SET_ERROR(EINVAL));
4098 4390          }
4099 4391  
4100 4392  
4101 4393          owner = zfs_fuid_map_id(zfsvfs, szp->z_uid, cr, ZFS_OWNER);
4102 4394          if (owner != crgetuid(cr) && secpolicy_basic_link(cr) != 0) {
4103 4395                  ZFS_EXIT(zfsvfs);
4104 4396                  return (SET_ERROR(EPERM));
4105 4397          }
4106 4398  
4107 4399          if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
4108 4400                  ZFS_EXIT(zfsvfs);
4109 4401                  return (error);
4110 4402          }
4111 4403  
4112 4404  top:
4113 4405          /*
4114 4406           * Attempt to lock directory; fail if entry already exists.
4115 4407           */
4116 4408          error = zfs_dirent_lock(&dl, dzp, name, &tzp, zf, NULL, NULL);

↓ open down ↓

200 lines elided

↑ open up ↑

4117 4409          if (error) {
4118 4410                  ZFS_EXIT(zfsvfs);
4119 4411                  return (error);
4120 4412          }
4121 4413  
4122 4414          tx = dmu_tx_create(zfsvfs->z_os);
4123 4415          dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
4124 4416          dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
4125 4417          zfs_sa_upgrade_txholds(tx, szp);
4126 4418          zfs_sa_upgrade_txholds(tx, dzp);
4127      -        error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
     4419 +        error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
4128 4420          if (error) {
4129 4421                  zfs_dirent_unlock(dl);
4130 4422                  if (error == ERESTART) {
4131 4423                          waited = B_TRUE;
4132 4424                          dmu_tx_wait(tx);
4133 4425                          dmu_tx_abort(tx);
4134 4426                          goto top;
4135 4427                  }
4136 4428                  dmu_tx_abort(tx);
4137 4429                  ZFS_EXIT(zfsvfs);

4138 4430                  return (error);
4139 4431          }
4140 4432  
4141 4433          error = zfs_link_create(dl, szp, tx, 0);
4142 4434  
4143 4435          if (error == 0) {
4144 4436                  uint64_t txtype = TX_LINK;
4145 4437                  if (flags & FIGNORECASE)
4146 4438                          txtype |= TX_CI;
4147 4439                  zfs_log_link(zilog, tx, txtype, dzp, szp, name);
4148 4440          }
4149 4441  
4150 4442          dmu_tx_commit(tx);
4151 4443  
4152 4444          zfs_dirent_unlock(dl);
4153 4445  
4154 4446          if (error == 0) {
4155 4447                  vnevent_link(svp, ct);
4156 4448          }
4157 4449  
4158 4450          if (zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
4159 4451                  zil_commit(zilog, 0);
4160 4452  
4161 4453          ZFS_EXIT(zfsvfs);
4162 4454          return (error);
4163 4455  }
4164 4456  
4165 4457  /*
4166 4458   * zfs_null_putapage() is used when the file system has been force
4167 4459   * unmounted. It just drops the pages.
4168 4460   */
4169 4461  /* ARGSUSED */
4170 4462  static int
4171 4463  zfs_null_putapage(vnode_t *vp, page_t *pp, u_offset_t *offp,
4172 4464      size_t *lenp, int flags, cred_t *cr)
4173 4465  {
4174 4466          pvn_write_done(pp, B_INVAL|B_FORCE|B_ERROR);
4175 4467          return (0);
4176 4468  }
4177 4469  
4178 4470  /*
4179 4471   * Push a page out to disk, klustering if possible.
4180 4472   *
4181 4473   *      IN:     vp      - file to push page to.
4182 4474   *              pp      - page to push.
4183 4475   *              flags   - additional flags.
4184 4476   *              cr      - credentials of caller.
4185 4477   *
4186 4478   *      OUT:    offp    - start of range pushed.
4187 4479   *              lenp    - len of range pushed.
4188 4480   *
4189 4481   *      RETURN: 0 on success, error code on failure.
4190 4482   *
4191 4483   * NOTE: callers must have locked the page to be pushed.  On
4192 4484   * exit, the page (and all other pages in the kluster) must be
4193 4485   * unlocked.
4194 4486   */
4195 4487  /* ARGSUSED */
4196 4488  static int
4197 4489  zfs_putapage(vnode_t *vp, page_t *pp, u_offset_t *offp,
4198 4490      size_t *lenp, int flags, cred_t *cr)
4199 4491  {
4200 4492          znode_t         *zp = VTOZ(vp);
4201 4493          zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
4202 4494          dmu_tx_t        *tx;
4203 4495          u_offset_t      off, koff;
4204 4496          size_t          len, klen;
4205 4497          int             err;
4206 4498  
4207 4499          off = pp->p_offset;
4208 4500          len = PAGESIZE;
4209 4501          /*
4210 4502           * If our blocksize is bigger than the page size, try to kluster
4211 4503           * multiple pages so that we write a full block (thus avoiding
4212 4504           * a read-modify-write).
4213 4505           */
4214 4506          if (off < zp->z_size && zp->z_blksz > PAGESIZE) {
4215 4507                  klen = P2ROUNDUP((ulong_t)zp->z_blksz, PAGESIZE);
4216 4508                  koff = ISP2(klen) ? P2ALIGN(off, (u_offset_t)klen) : 0;
4217 4509                  ASSERT(koff <= zp->z_size);
4218 4510                  if (koff + klen > zp->z_size)
4219 4511                          klen = P2ROUNDUP(zp->z_size - koff, (uint64_t)PAGESIZE);
4220 4512                  pp = pvn_write_kluster(vp, pp, &off, &len, koff, klen, flags);
4221 4513          }
4222 4514          ASSERT3U(btop(len), ==, btopr(len));
4223 4515  
4224 4516          /*
4225 4517           * Can't push pages past end-of-file.
4226 4518           */
4227 4519          if (off >= zp->z_size) {
4228 4520                  /* ignore all pages */
4229 4521                  err = 0;
4230 4522                  goto out;
4231 4523          } else if (off + len > zp->z_size) {
4232 4524                  int npages = btopr(zp->z_size - off);
4233 4525                  page_t *trunc;
4234 4526  
4235 4527                  page_list_break(&pp, &trunc, npages);
4236 4528                  /* ignore pages past end of file */
4237 4529                  if (trunc)
4238 4530                          pvn_write_done(trunc, flags);
4239 4531                  len = zp->z_size - off;
4240 4532          }
4241 4533  
4242 4534          if (zfs_owner_overquota(zfsvfs, zp, B_FALSE) ||
4243 4535              zfs_owner_overquota(zfsvfs, zp, B_TRUE)) {
4244 4536                  err = SET_ERROR(EDQUOT);
4245 4537                  goto out;
4246 4538          }
4247 4539          tx = dmu_tx_create(zfsvfs->z_os);
4248 4540          dmu_tx_hold_write(tx, zp->z_id, off, len);
4249 4541  
4250 4542          dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
4251 4543          zfs_sa_upgrade_txholds(tx, zp);
4252 4544          err = dmu_tx_assign(tx, TXG_WAIT);
4253 4545          if (err != 0) {
4254 4546                  dmu_tx_abort(tx);
4255 4547                  goto out;
4256 4548          }
4257 4549  
4258 4550          if (zp->z_blksz <= PAGESIZE) {
4259 4551                  caddr_t va = zfs_map_page(pp, S_READ);
4260 4552                  ASSERT3U(len, <=, PAGESIZE);
4261 4553                  dmu_write(zfsvfs->z_os, zp->z_id, off, len, va, tx);
4262 4554                  zfs_unmap_page(pp, va);
4263 4555          } else {
4264 4556                  err = dmu_write_pages(zfsvfs->z_os, zp->z_id, off, len, pp, tx);
4265 4557          }
4266 4558  
4267 4559          if (err == 0) {
4268 4560                  uint64_t mtime[2], ctime[2];
4269 4561                  sa_bulk_attr_t bulk[3];
4270 4562                  int count = 0;
4271 4563  
4272 4564                  SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_MTIME(zfsvfs), NULL,
4273 4565                      &mtime, 16);
4274 4566                  SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL,
4275 4567                      &ctime, 16);
4276 4568                  SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_FLAGS(zfsvfs), NULL,
4277 4569                      &zp->z_pflags, 8);
4278 4570                  zfs_tstamp_update_setup(zp, CONTENT_MODIFIED, mtime, ctime,
4279 4571                      B_TRUE);
4280 4572                  err = sa_bulk_update(zp->z_sa_hdl, bulk, count, tx);
4281 4573                  ASSERT0(err);
4282 4574                  zfs_log_write(zfsvfs->z_log, tx, TX_WRITE, zp, off, len, 0);
4283 4575          }
4284 4576          dmu_tx_commit(tx);
4285 4577  
4286 4578  out:
4287 4579          pvn_write_done(pp, (err ? B_ERROR : 0) | flags);
4288 4580          if (offp)
4289 4581                  *offp = off;
4290 4582          if (lenp)
4291 4583                  *lenp = len;
4292 4584  
4293 4585          return (err);
4294 4586  }
4295 4587  
4296 4588  /*
4297 4589   * Copy the portion of the file indicated from pages into the file.
4298 4590   * The pages are stored in a page list attached to the files vnode.
4299 4591   *
4300 4592   *      IN:     vp      - vnode of file to push page data to.
4301 4593   *              off     - position in file to put data.
4302 4594   *              len     - amount of data to write.
4303 4595   *              flags   - flags to control the operation.
4304 4596   *              cr      - credentials of caller.
4305 4597   *              ct      - caller context.
4306 4598   *
4307 4599   *      RETURN: 0 on success, error code on failure.
4308 4600   *
4309 4601   * Timestamps:
4310 4602   *      vp - ctime|mtime updated
4311 4603   */
4312 4604  /*ARGSUSED*/
4313 4605  static int
4314 4606  zfs_putpage(vnode_t *vp, offset_t off, size_t len, int flags, cred_t *cr,
4315 4607      caller_context_t *ct)
4316 4608  {
4317 4609          znode_t         *zp = VTOZ(vp);
4318 4610          zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
4319 4611          page_t          *pp;
4320 4612          size_t          io_len;
4321 4613          u_offset_t      io_off;
4322 4614          uint_t          blksz;
4323 4615          rl_t            *rl;
4324 4616          int             error = 0;
4325 4617  
4326 4618          ZFS_ENTER(zfsvfs);
4327 4619          ZFS_VERIFY_ZP(zp);
4328 4620  
4329 4621          /*
4330 4622           * There's nothing to do if no data is cached.
4331 4623           */
4332 4624          if (!vn_has_cached_data(vp)) {
4333 4625                  ZFS_EXIT(zfsvfs);
4334 4626                  return (0);
4335 4627          }
4336 4628  
4337 4629          /*
4338 4630           * Align this request to the file block size in case we kluster.
4339 4631           * XXX - this can result in pretty aggresive locking, which can
4340 4632           * impact simultanious read/write access.  One option might be
4341 4633           * to break up long requests (len == 0) into block-by-block
4342 4634           * operations to get narrower locking.
4343 4635           */
4344 4636          blksz = zp->z_blksz;
4345 4637          if (ISP2(blksz))
4346 4638                  io_off = P2ALIGN_TYPED(off, blksz, u_offset_t);
4347 4639          else
4348 4640                  io_off = 0;
4349 4641          if (len > 0 && ISP2(blksz))
4350 4642                  io_len = P2ROUNDUP_TYPED(len + (off - io_off), blksz, size_t);
4351 4643          else
4352 4644                  io_len = 0;
4353 4645  
4354 4646          if (io_len == 0) {
4355 4647                  /*
4356 4648                   * Search the entire vp list for pages >= io_off.
4357 4649                   */
4358 4650                  rl = zfs_range_lock(zp, io_off, UINT64_MAX, RL_WRITER);
4359 4651                  error = pvn_vplist_dirty(vp, io_off, zfs_putapage, flags, cr);
4360 4652                  goto out;
4361 4653          }
4362 4654          rl = zfs_range_lock(zp, io_off, io_len, RL_WRITER);
4363 4655  
4364 4656          if (off > zp->z_size) {
4365 4657                  /* past end of file */
4366 4658                  zfs_range_unlock(rl);
4367 4659                  ZFS_EXIT(zfsvfs);
4368 4660                  return (0);
4369 4661          }
4370 4662  
4371 4663          len = MIN(io_len, P2ROUNDUP(zp->z_size, PAGESIZE) - io_off);
4372 4664  
4373 4665          for (off = io_off; io_off < off + len; io_off += io_len) {
4374 4666                  if ((flags & B_INVAL) || ((flags & B_ASYNC) == 0)) {
4375 4667                          pp = page_lookup(vp, io_off,
4376 4668                              (flags & (B_INVAL | B_FREE)) ? SE_EXCL : SE_SHARED);
4377 4669                  } else {
4378 4670                          pp = page_lookup_nowait(vp, io_off,
4379 4671                              (flags & B_FREE) ? SE_EXCL : SE_SHARED);
4380 4672                  }
4381 4673  
4382 4674                  if (pp != NULL && pvn_getdirty(pp, flags)) {
4383 4675                          int err;
4384 4676  
4385 4677                          /*
4386 4678                           * Found a dirty page to push
4387 4679                           */
4388 4680                          err = zfs_putapage(vp, pp, &io_off, &io_len, flags, cr);
4389 4681                          if (err)
4390 4682                                  error = err;
4391 4683                  } else {
4392 4684                          io_len = PAGESIZE;

↓ open down ↓

255 lines elided

↑ open up ↑

4393 4685                  }
4394 4686          }
4395 4687  out:
4396 4688          zfs_range_unlock(rl);
4397 4689          if ((flags & B_ASYNC) == 0 || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
4398 4690                  zil_commit(zfsvfs->z_log, zp->z_id);
4399 4691          ZFS_EXIT(zfsvfs);
4400 4692          return (error);
4401 4693  }
4402 4694  
4403      -/*ARGSUSED*/
4404      -void
4405      -zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
     4695 +/*
     4696 + * Returns B_TRUE and exits the z_teardown_inactive_lock
     4697 + * if the znode we are looking at is no longer valid
     4698 + */
     4699 +static boolean_t
     4700 +zfs_znode_free_invalid(znode_t *zp)
4406 4701  {
4407      -        znode_t *zp = VTOZ(vp);
4408 4702          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
4409      -        int error;
     4703 +        vnode_t *vp = ZTOV(zp);
4410 4704  
4411      -        rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
     4705 +        ASSERT(rw_read_held(&zfsvfs->z_teardown_inactive_lock));
     4706 +
4412 4707          if (zp->z_sa_hdl == NULL) {
4413 4708                  /*
4414 4709                   * The fs has been unmounted, or we did a
4415 4710                   * suspend/resume and this file no longer exists.
4416 4711                   */
4417 4712                  if (vn_has_cached_data(vp)) {
4418 4713                          (void) pvn_vplist_dirty(vp, 0, zfs_null_putapage,
4419      -                            B_INVAL, cr);
     4714 +                            B_INVAL, CRED());
4420 4715                  }
4421 4716  
4422 4717                  mutex_enter(&zp->z_lock);
4423 4718                  mutex_enter(&vp->v_lock);
4424 4719                  ASSERT(vp->v_count == 1);
4425 4720                  VN_RELE_LOCKED(vp);
4426 4721                  mutex_exit(&vp->v_lock);
4427 4722                  mutex_exit(&zp->z_lock);
     4723 +                VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) !=
     4724 +                    UINT32_MAX);
4428 4725                  rw_exit(&zfsvfs->z_teardown_inactive_lock);
4429 4726                  zfs_znode_free(zp);
4430      -                return;
     4727 +                return (B_TRUE);
4431 4728          }
4432 4729  
     4730 +        return (B_FALSE);
     4731 +}
     4732 +
     4733 +/*
     4734 + * Does the prep work for freeing the znode, then calls zfs_zinactive to do the
     4735 + * actual freeing.
     4736 + * This code used be in zfs_inactive() before the async delete patch came in
     4737 + */
     4738 +static void
     4739 +zfs_inactive_impl(znode_t *zp)
     4740 +{
     4741 +        vnode_t *vp = ZTOV(zp);
     4742 +        zfsvfs_t *zfsvfs = zp->z_zfsvfs;
     4743 +        int error;
     4744 +
     4745 +        rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
     4746 +        if (zfs_znode_free_invalid(zp))
     4747 +                return; /* z_teardown_inactive_lock already dropped */
     4748 +
4433 4749          /*
4434 4750           * Attempt to push any data in the page cache.  If this fails
4435 4751           * we will get kicked out later in zfs_zinactive().
4436 4752           */
4437 4753          if (vn_has_cached_data(vp)) {
4438 4754                  (void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC,
4439      -                    cr);
     4755 +                    CRED());
4440 4756          }
4441 4757  
4442 4758          if (zp->z_atime_dirty && zp->z_unlinked == 0) {
4443 4759                  dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
4444 4760  
4445 4761                  dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
4446 4762                  zfs_sa_upgrade_txholds(tx, zp);
4447 4763                  error = dmu_tx_assign(tx, TXG_WAIT);
4448 4764                  if (error) {
4449 4765                          dmu_tx_abort(tx);

4450 4766                  } else {

↓ open down ↓

1 lines elided

↑ open up ↑

4451 4767                          mutex_enter(&zp->z_lock);
4452 4768                          (void) sa_update(zp->z_sa_hdl, SA_ZPL_ATIME(zfsvfs),
4453 4769                              (void *)&zp->z_atime, sizeof (zp->z_atime), tx);
4454 4770                          zp->z_atime_dirty = 0;
4455 4771                          mutex_exit(&zp->z_lock);
4456 4772                          dmu_tx_commit(tx);
4457 4773                  }
4458 4774          }
4459 4775  
4460 4776          zfs_zinactive(zp);
     4777 +
     4778 +        VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) != UINT32_MAX);
     4779 +
4461 4780          rw_exit(&zfsvfs->z_teardown_inactive_lock);
4462 4781  }
4463 4782  
     4783 +/*
     4784 + * taskq task calls zfs_inactive_impl() so that we can free the znode
     4785 + */
     4786 +static void
     4787 +zfs_inactive_task(void *task_arg)
     4788 +{
     4789 +        znode_t *zp = task_arg;
     4790 +        ASSERT(zp != NULL);
     4791 +        zfs_inactive_impl(zp);
     4792 +}
     4793 +
     4794 +/*ARGSUSED*/
     4795 +void
     4796 +zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
     4797 +{
     4798 +        znode_t *zp = VTOZ(vp);
     4799 +        zfsvfs_t *zfsvfs = zp->z_zfsvfs;
     4800 +
     4801 +        rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
     4802 +
     4803 +        VERIFY(atomic_inc_32_nv(&zfsvfs->z_znodes_freeing_cnt) != 0);
     4804 +
     4805 +        if (zfs_znode_free_invalid(zp))
     4806 +                return; /* z_teardown_inactive_lock already dropped */
     4807 +
     4808 +        if (zfs_do_async_free &&
     4809 +            zp->z_size > zfs_inactive_async_multiplier * zfs_dirty_data_max &&
     4810 +            taskq_dispatch(dsl_pool_vnrele_taskq(
     4811 +            dmu_objset_pool(zp->z_zfsvfs->z_os)), zfs_inactive_task,
     4812 +            zp, TQ_NOSLEEP) != NULL) {
     4813 +                rw_exit(&zfsvfs->z_teardown_inactive_lock);
     4814 +                return; /* task dispatched, we're done */
     4815 +        }
     4816 +        rw_exit(&zfsvfs->z_teardown_inactive_lock);
     4817 +
     4818 +        /* if the taskq dispatch failed - do a sync zfs_inactive_impl() call */
     4819 +        zfs_inactive_impl(zp);
     4820 +}
     4821 +
4464 4822  /*
4465 4823   * Bounds-check the seek operation.
4466 4824   *
4467 4825   *      IN:     vp      - vnode seeking within
4468 4826   *              ooff    - old file offset
4469 4827   *              noffp   - pointer to new file offset
4470 4828   *              ct      - caller context
4471 4829   *
4472 4830   *      RETURN: 0 on success, EINVAL if new offset invalid.
4473 4831   */

4474 4832  /* ARGSUSED */
4475 4833  static int
4476 4834  zfs_seek(vnode_t *vp, offset_t ooff, offset_t *noffp,
4477 4835      caller_context_t *ct)
4478 4836  {
4479 4837          if (vp->v_type == VDIR)
4480 4838                  return (0);
4481 4839          return ((*noffp < 0 || *noffp > MAXOFFSET_T) ? EINVAL : 0);
4482 4840  }
4483 4841  
4484 4842  /*
4485 4843   * Pre-filter the generic locking function to trap attempts to place
4486 4844   * a mandatory lock on a memory mapped file.
4487 4845   */
4488 4846  static int
4489 4847  zfs_frlock(vnode_t *vp, int cmd, flock64_t *bfp, int flag, offset_t offset,
4490 4848      flk_callback_t *flk_cbp, cred_t *cr, caller_context_t *ct)
4491 4849  {
4492 4850          znode_t *zp = VTOZ(vp);
4493 4851          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
4494 4852  
4495 4853          ZFS_ENTER(zfsvfs);
4496 4854          ZFS_VERIFY_ZP(zp);
4497 4855  
4498 4856          /*
4499 4857           * We are following the UFS semantics with respect to mapcnt
4500 4858           * here: If we see that the file is mapped already, then we will
4501 4859           * return an error, but we don't worry about races between this
4502 4860           * function and zfs_map().
4503 4861           */
4504 4862          if (zp->z_mapcnt > 0 && MANDMODE(zp->z_mode)) {
4505 4863                  ZFS_EXIT(zfsvfs);
4506 4864                  return (SET_ERROR(EAGAIN));
4507 4865          }
4508 4866          ZFS_EXIT(zfsvfs);
4509 4867          return (fs_frlock(vp, cmd, bfp, flag, offset, flk_cbp, cr, ct));
4510 4868  }
4511 4869  
4512 4870  /*
4513 4871   * If we can't find a page in the cache, we will create a new page
4514 4872   * and fill it with file data.  For efficiency, we may try to fill
4515 4873   * multiple pages at once (klustering) to fill up the supplied page
4516 4874   * list.  Note that the pages to be filled are held with an exclusive
4517 4875   * lock to prevent access by other threads while they are being filled.
4518 4876   */
4519 4877  static int
4520 4878  zfs_fillpage(vnode_t *vp, u_offset_t off, struct seg *seg,
4521 4879      caddr_t addr, page_t *pl[], size_t plsz, enum seg_rw rw)
4522 4880  {
4523 4881          znode_t *zp = VTOZ(vp);
4524 4882          page_t *pp, *cur_pp;
4525 4883          objset_t *os = zp->z_zfsvfs->z_os;
4526 4884          u_offset_t io_off, total;
4527 4885          size_t io_len;
4528 4886          int err;
4529 4887  
4530 4888          if (plsz == PAGESIZE || zp->z_blksz <= PAGESIZE) {
4531 4889                  /*
4532 4890                   * We only have a single page, don't bother klustering
4533 4891                   */
4534 4892                  io_off = off;
4535 4893                  io_len = PAGESIZE;
4536 4894                  pp = page_create_va(vp, io_off, io_len,
4537 4895                      PG_EXCL | PG_WAIT, seg, addr);
4538 4896          } else {
4539 4897                  /*
4540 4898                   * Try to find enough pages to fill the page list
4541 4899                   */
4542 4900                  pp = pvn_read_kluster(vp, off, seg, addr, &io_off,
4543 4901                      &io_len, off, plsz, 0);
4544 4902          }
4545 4903          if (pp == NULL) {
4546 4904                  /*
4547 4905                   * The page already exists, nothing to do here.
4548 4906                   */
4549 4907                  *pl = NULL;
4550 4908                  return (0);
4551 4909          }
4552 4910  
4553 4911          /*
4554 4912           * Fill the pages in the kluster.
4555 4913           */
4556 4914          cur_pp = pp;
4557 4915          for (total = io_off + io_len; io_off < total; io_off += PAGESIZE) {
4558 4916                  caddr_t va;
4559 4917  
4560 4918                  ASSERT3U(io_off, ==, cur_pp->p_offset);
4561 4919                  va = zfs_map_page(cur_pp, S_WRITE);
4562 4920                  err = dmu_read(os, zp->z_id, io_off, PAGESIZE, va,
4563 4921                      DMU_READ_PREFETCH);
4564 4922                  zfs_unmap_page(cur_pp, va);
4565 4923                  if (err) {
4566 4924                          /* On error, toss the entire kluster */
4567 4925                          pvn_read_done(pp, B_ERROR);
4568 4926                          /* convert checksum errors into IO errors */
4569 4927                          if (err == ECKSUM)
4570 4928                                  err = SET_ERROR(EIO);
4571 4929                          return (err);
4572 4930                  }
4573 4931                  cur_pp = cur_pp->p_next;
4574 4932          }
4575 4933  
4576 4934          /*
4577 4935           * Fill in the page list array from the kluster starting
4578 4936           * from the desired offset `off'.
4579 4937           * NOTE: the page list will always be null terminated.
4580 4938           */
4581 4939          pvn_plist_init(pp, pl, plsz, off, io_len, rw);
4582 4940          ASSERT(pl == NULL || (*pl)->p_offset == off);
4583 4941  
4584 4942          return (0);
4585 4943  }
4586 4944  
4587 4945  /*
4588 4946   * Return pointers to the pages for the file region [off, off + len]
4589 4947   * in the pl array.  If plsz is greater than len, this function may
4590 4948   * also return page pointers from after the specified region
4591 4949   * (i.e. the region [off, off + plsz]).  These additional pages are
4592 4950   * only returned if they are already in the cache, or were created as
4593 4951   * part of a klustered read.
4594 4952   *
4595 4953   *      IN:     vp      - vnode of file to get data from.
4596 4954   *              off     - position in file to get data from.
4597 4955   *              len     - amount of data to retrieve.
4598 4956   *              plsz    - length of provided page list.
4599 4957   *              seg     - segment to obtain pages for.
4600 4958   *              addr    - virtual address of fault.
4601 4959   *              rw      - mode of created pages.
4602 4960   *              cr      - credentials of caller.
4603 4961   *              ct      - caller context.
4604 4962   *
4605 4963   *      OUT:    protp   - protection mode of created pages.
4606 4964   *              pl      - list of pages created.
4607 4965   *
4608 4966   *      RETURN: 0 on success, error code on failure.
4609 4967   *
4610 4968   * Timestamps:
4611 4969   *      vp - atime updated
4612 4970   */
4613 4971  /* ARGSUSED */
4614 4972  static int
4615 4973  zfs_getpage(vnode_t *vp, offset_t off, size_t len, uint_t *protp,
4616 4974      page_t *pl[], size_t plsz, struct seg *seg, caddr_t addr,
4617 4975      enum seg_rw rw, cred_t *cr, caller_context_t *ct)
4618 4976  {
4619 4977          znode_t         *zp = VTOZ(vp);
4620 4978          zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
4621 4979          page_t          **pl0 = pl;
4622 4980          int             err = 0;
4623 4981  
4624 4982          /* we do our own caching, faultahead is unnecessary */
4625 4983          if (pl == NULL)
4626 4984                  return (0);
4627 4985          else if (len > plsz)
4628 4986                  len = plsz;
4629 4987          else
4630 4988                  len = P2ROUNDUP(len, PAGESIZE);
4631 4989          ASSERT(plsz >= len);
4632 4990  
4633 4991          ZFS_ENTER(zfsvfs);
4634 4992          ZFS_VERIFY_ZP(zp);
4635 4993  
4636 4994          if (protp)
4637 4995                  *protp = PROT_ALL;
4638 4996  
4639 4997          /*
4640 4998           * Loop through the requested range [off, off + len) looking
4641 4999           * for pages.  If we don't find a page, we will need to create
4642 5000           * a new page and fill it with data from the file.
4643 5001           */
4644 5002          while (len > 0) {
4645 5003                  if (*pl = page_lookup(vp, off, SE_SHARED))
4646 5004                          *(pl+1) = NULL;
4647 5005                  else if (err = zfs_fillpage(vp, off, seg, addr, pl, plsz, rw))
4648 5006                          goto out;
4649 5007                  while (*pl) {
4650 5008                          ASSERT3U((*pl)->p_offset, ==, off);
4651 5009                          off += PAGESIZE;
4652 5010                          addr += PAGESIZE;
4653 5011                          if (len > 0) {
4654 5012                                  ASSERT3U(len, >=, PAGESIZE);
4655 5013                                  len -= PAGESIZE;
4656 5014                          }
4657 5015                          ASSERT3U(plsz, >=, PAGESIZE);
4658 5016                          plsz -= PAGESIZE;
4659 5017                          pl++;
4660 5018                  }
4661 5019          }
4662 5020  
4663 5021          /*
4664 5022           * Fill out the page array with any pages already in the cache.
4665 5023           */
4666 5024          while (plsz > 0 &&
4667 5025              (*pl++ = page_lookup_nowait(vp, off, SE_SHARED))) {
4668 5026                          off += PAGESIZE;
4669 5027                          plsz -= PAGESIZE;
4670 5028          }
4671 5029  out:
4672 5030          if (err) {
4673 5031                  /*
4674 5032                   * Release any pages we have previously locked.
4675 5033                   */
4676 5034                  while (pl > pl0)
4677 5035                          page_unlock(*--pl);
4678 5036          } else {
4679 5037                  ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
4680 5038          }
4681 5039  
4682 5040          *pl = NULL;
4683 5041  
4684 5042          ZFS_EXIT(zfsvfs);
4685 5043          return (err);
4686 5044  }
4687 5045  
4688 5046  /*
4689 5047   * Request a memory map for a section of a file.  This code interacts
4690 5048   * with common code and the VM system as follows:
4691 5049   *
4692 5050   * - common code calls mmap(), which ends up in smmap_common()
4693 5051   * - this calls VOP_MAP(), which takes you into (say) zfs
4694 5052   * - zfs_map() calls as_map(), passing segvn_create() as the callback
4695 5053   * - segvn_create() creates the new segment and calls VOP_ADDMAP()
4696 5054   * - zfs_addmap() updates z_mapcnt
4697 5055   */
4698 5056  /*ARGSUSED*/
4699 5057  static int
4700 5058  zfs_map(vnode_t *vp, offset_t off, struct as *as, caddr_t *addrp,
4701 5059      size_t len, uchar_t prot, uchar_t maxprot, uint_t flags, cred_t *cr,
4702 5060      caller_context_t *ct)
4703 5061  {
4704 5062          znode_t *zp = VTOZ(vp);
4705 5063          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
4706 5064          segvn_crargs_t  vn_a;
4707 5065          int             error;
4708 5066  
4709 5067          ZFS_ENTER(zfsvfs);
4710 5068          ZFS_VERIFY_ZP(zp);
4711 5069  
4712 5070          /*
4713 5071           * Note: ZFS_READONLY is handled in zfs_zaccess_common.
4714 5072           */
4715 5073  
4716 5074          if ((prot & PROT_WRITE) && (zp->z_pflags &
4717 5075              (ZFS_IMMUTABLE | ZFS_APPENDONLY))) {
4718 5076                  ZFS_EXIT(zfsvfs);
4719 5077                  return (SET_ERROR(EPERM));
4720 5078          }
4721 5079  
4722 5080          if ((prot & (PROT_READ | PROT_EXEC)) &&
4723 5081              (zp->z_pflags & ZFS_AV_QUARANTINED)) {
4724 5082                  ZFS_EXIT(zfsvfs);
4725 5083                  return (SET_ERROR(EACCES));
4726 5084          }
4727 5085  
4728 5086          if (vp->v_flag & VNOMAP) {
4729 5087                  ZFS_EXIT(zfsvfs);
4730 5088                  return (SET_ERROR(ENOSYS));
4731 5089          }
4732 5090  
4733 5091          if (off < 0 || len > MAXOFFSET_T - off) {
4734 5092                  ZFS_EXIT(zfsvfs);
4735 5093                  return (SET_ERROR(ENXIO));
4736 5094          }
4737 5095  
4738 5096          if (vp->v_type != VREG) {
4739 5097                  ZFS_EXIT(zfsvfs);
4740 5098                  return (SET_ERROR(ENODEV));
4741 5099          }
4742 5100  
4743 5101          /*
4744 5102           * If file is locked, disallow mapping.
4745 5103           */
4746 5104          if (MANDMODE(zp->z_mode) && vn_has_flocks(vp)) {
4747 5105                  ZFS_EXIT(zfsvfs);
4748 5106                  return (SET_ERROR(EAGAIN));
4749 5107          }
4750 5108  
4751 5109          as_rangelock(as);
4752 5110          error = choose_addr(as, addrp, len, off, ADDR_VACALIGN, flags);
4753 5111          if (error != 0) {
4754 5112                  as_rangeunlock(as);
4755 5113                  ZFS_EXIT(zfsvfs);
4756 5114                  return (error);
4757 5115          }
4758 5116  
4759 5117          vn_a.vp = vp;
4760 5118          vn_a.offset = (u_offset_t)off;
4761 5119          vn_a.type = flags & MAP_TYPE;
4762 5120          vn_a.prot = prot;
4763 5121          vn_a.maxprot = maxprot;
4764 5122          vn_a.cred = cr;
4765 5123          vn_a.amp = NULL;
4766 5124          vn_a.flags = flags & ~MAP_TYPE;
4767 5125          vn_a.szc = 0;
4768 5126          vn_a.lgrp_mem_policy_flags = 0;
4769 5127  
4770 5128          error = as_map(as, *addrp, len, segvn_create, &vn_a);
4771 5129  
4772 5130          as_rangeunlock(as);
4773 5131          ZFS_EXIT(zfsvfs);
4774 5132          return (error);
4775 5133  }
4776 5134  
4777 5135  /* ARGSUSED */
4778 5136  static int
4779 5137  zfs_addmap(vnode_t *vp, offset_t off, struct as *as, caddr_t addr,
4780 5138      size_t len, uchar_t prot, uchar_t maxprot, uint_t flags, cred_t *cr,
4781 5139      caller_context_t *ct)
4782 5140  {
4783 5141          uint64_t pages = btopr(len);
4784 5142  
4785 5143          atomic_add_64(&VTOZ(vp)->z_mapcnt, pages);
4786 5144          return (0);
4787 5145  }
4788 5146  
4789 5147  /*
4790 5148   * The reason we push dirty pages as part of zfs_delmap() is so that we get a
4791 5149   * more accurate mtime for the associated file.  Since we don't have a way of
4792 5150   * detecting when the data was actually modified, we have to resort to
4793 5151   * heuristics.  If an explicit msync() is done, then we mark the mtime when the
4794 5152   * last page is pushed.  The problem occurs when the msync() call is omitted,
4795 5153   * which by far the most common case:
4796 5154   *
4797 5155   *      open()
4798 5156   *      mmap()
4799 5157   *      <modify memory>
4800 5158   *      munmap()
4801 5159   *      close()
4802 5160   *      <time lapse>
4803 5161   *      putpage() via fsflush
4804 5162   *
4805 5163   * If we wait until fsflush to come along, we can have a modification time that
4806 5164   * is some arbitrary point in the future.  In order to prevent this in the
4807 5165   * common case, we flush pages whenever a (MAP_SHARED, PROT_WRITE) mapping is
4808 5166   * torn down.
4809 5167   */
4810 5168  /* ARGSUSED */
4811 5169  static int
4812 5170  zfs_delmap(vnode_t *vp, offset_t off, struct as *as, caddr_t addr,
4813 5171      size_t len, uint_t prot, uint_t maxprot, uint_t flags, cred_t *cr,
4814 5172      caller_context_t *ct)
4815 5173  {
4816 5174          uint64_t pages = btopr(len);
4817 5175  
4818 5176          ASSERT3U(VTOZ(vp)->z_mapcnt, >=, pages);
4819 5177          atomic_add_64(&VTOZ(vp)->z_mapcnt, -pages);
4820 5178  
4821 5179          if ((flags & MAP_SHARED) && (prot & PROT_WRITE) &&
4822 5180              vn_has_cached_data(vp))
4823 5181                  (void) VOP_PUTPAGE(vp, off, len, B_ASYNC, cr, ct);
4824 5182  
4825 5183          return (0);
4826 5184  }
4827 5185  
4828 5186  /*
4829 5187   * Free or allocate space in a file.  Currently, this function only
4830 5188   * supports the `F_FREESP' command.  However, this command is somewhat
4831 5189   * misnamed, as its functionality includes the ability to allocate as
4832 5190   * well as free space.
4833 5191   *
4834 5192   *      IN:     vp      - vnode of file to free data in.
4835 5193   *              cmd     - action to take (only F_FREESP supported).
4836 5194   *              bfp     - section of file to free/alloc.
4837 5195   *              flag    - current file open mode flags.
4838 5196   *              offset  - current file offset.
4839 5197   *              cr      - credentials of caller [UNUSED].
4840 5198   *              ct      - caller context.
4841 5199   *
4842 5200   *      RETURN: 0 on success, error code on failure.
4843 5201   *
4844 5202   * Timestamps:
4845 5203   *      vp - ctime|mtime updated
4846 5204   */
4847 5205  /* ARGSUSED */
4848 5206  static int
4849 5207  zfs_space(vnode_t *vp, int cmd, flock64_t *bfp, int flag,
4850 5208      offset_t offset, cred_t *cr, caller_context_t *ct)
4851 5209  {
4852 5210          znode_t         *zp = VTOZ(vp);
4853 5211          zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
4854 5212          uint64_t        off, len;
4855 5213          int             error;
4856 5214  
4857 5215          ZFS_ENTER(zfsvfs);
4858 5216          ZFS_VERIFY_ZP(zp);
4859 5217  
4860 5218          if (cmd != F_FREESP) {
4861 5219                  ZFS_EXIT(zfsvfs);
4862 5220                  return (SET_ERROR(EINVAL));
4863 5221          }
4864 5222  
4865 5223          /*
4866 5224           * In a case vp->v_vfsp != zp->z_zfsvfs->z_vfs (e.g. snapshots) our
4867 5225           * callers might not be able to detect properly that we are read-only,
4868 5226           * so check it explicitly here.
4869 5227           */
4870 5228          if (zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) {
4871 5229                  ZFS_EXIT(zfsvfs);
4872 5230                  return (SET_ERROR(EROFS));
4873 5231          }
4874 5232  
4875 5233          if (error = convoff(vp, bfp, 0, offset)) {
4876 5234                  ZFS_EXIT(zfsvfs);
4877 5235                  return (error);
4878 5236          }
4879 5237  
4880 5238          if (bfp->l_len < 0) {
4881 5239                  ZFS_EXIT(zfsvfs);
4882 5240                  return (SET_ERROR(EINVAL));
4883 5241          }
4884 5242  
4885 5243          off = bfp->l_start;
4886 5244          len = bfp->l_len; /* 0 means from off to end of file */
4887 5245  
4888 5246          error = zfs_freesp(zp, off, len, flag, TRUE);
4889 5247  
4890 5248          if (error == 0 && off == 0 && len == 0)
4891 5249                  vnevent_truncate(ZTOV(zp), ct);
4892 5250  
4893 5251          ZFS_EXIT(zfsvfs);
4894 5252          return (error);
4895 5253  }
4896 5254  
4897 5255  /*ARGSUSED*/
4898 5256  static int
4899 5257  zfs_fid(vnode_t *vp, fid_t *fidp, caller_context_t *ct)
4900 5258  {
4901 5259          znode_t         *zp = VTOZ(vp);
4902 5260          zfsvfs_t        *zfsvfs = zp->z_zfsvfs;
4903 5261          uint32_t        gen;
4904 5262          uint64_t        gen64;
4905 5263          uint64_t        object = zp->z_id;
4906 5264          zfid_short_t    *zfid;
4907 5265          int             size, i, error;
4908 5266  
4909 5267          ZFS_ENTER(zfsvfs);
4910 5268          ZFS_VERIFY_ZP(zp);
4911 5269  
4912 5270          if ((error = sa_lookup(zp->z_sa_hdl, SA_ZPL_GEN(zfsvfs),
4913 5271              &gen64, sizeof (uint64_t))) != 0) {
4914 5272                  ZFS_EXIT(zfsvfs);
4915 5273                  return (error);
4916 5274          }
4917 5275  
4918 5276          gen = (uint32_t)gen64;
4919 5277  
4920 5278          size = (zfsvfs->z_parent != zfsvfs) ? LONG_FID_LEN : SHORT_FID_LEN;
4921 5279          if (fidp->fid_len < size) {
4922 5280                  fidp->fid_len = size;
4923 5281                  ZFS_EXIT(zfsvfs);
4924 5282                  return (SET_ERROR(ENOSPC));
4925 5283          }
4926 5284  
4927 5285          zfid = (zfid_short_t *)fidp;
4928 5286  
4929 5287          zfid->zf_len = size;
4930 5288  
4931 5289          for (i = 0; i < sizeof (zfid->zf_object); i++)
4932 5290                  zfid->zf_object[i] = (uint8_t)(object >> (8 * i));
4933 5291  
4934 5292          /* Must have a non-zero generation number to distinguish from .zfs */
4935 5293          if (gen == 0)
4936 5294                  gen = 1;
4937 5295          for (i = 0; i < sizeof (zfid->zf_gen); i++)
4938 5296                  zfid->zf_gen[i] = (uint8_t)(gen >> (8 * i));
4939 5297  
4940 5298          if (size == LONG_FID_LEN) {
4941 5299                  uint64_t        objsetid = dmu_objset_id(zfsvfs->z_os);
4942 5300                  zfid_long_t     *zlfid;
4943 5301  
4944 5302                  zlfid = (zfid_long_t *)fidp;
4945 5303  
4946 5304                  for (i = 0; i < sizeof (zlfid->zf_setid); i++)
4947 5305                          zlfid->zf_setid[i] = (uint8_t)(objsetid >> (8 * i));
4948 5306  
4949 5307                  /* XXX - this should be the generation number for the objset */
4950 5308                  for (i = 0; i < sizeof (zlfid->zf_setgen); i++)
4951 5309                          zlfid->zf_setgen[i] = 0;
4952 5310          }
4953 5311  
4954 5312          ZFS_EXIT(zfsvfs);
4955 5313          return (0);
4956 5314  }
4957 5315  
4958 5316  static int
4959 5317  zfs_pathconf(vnode_t *vp, int cmd, ulong_t *valp, cred_t *cr,
4960 5318      caller_context_t *ct)
4961 5319  {
4962 5320          znode_t         *zp, *xzp;
4963 5321          zfsvfs_t        *zfsvfs;
4964 5322          zfs_dirlock_t   *dl;
4965 5323          int             error;
4966 5324  
4967 5325          switch (cmd) {
4968 5326          case _PC_LINK_MAX:
4969 5327                  *valp = ULONG_MAX;
4970 5328                  return (0);
4971 5329  
4972 5330          case _PC_FILESIZEBITS:
4973 5331                  *valp = 64;
4974 5332                  return (0);
4975 5333  
4976 5334          case _PC_XATTR_EXISTS:
4977 5335                  zp = VTOZ(vp);
4978 5336                  zfsvfs = zp->z_zfsvfs;
4979 5337                  ZFS_ENTER(zfsvfs);
4980 5338                  ZFS_VERIFY_ZP(zp);
4981 5339                  *valp = 0;
4982 5340                  error = zfs_dirent_lock(&dl, zp, "", &xzp,
4983 5341                      ZXATTR | ZEXISTS | ZSHARED, NULL, NULL);
4984 5342                  if (error == 0) {
4985 5343                          zfs_dirent_unlock(dl);
4986 5344                          if (!zfs_dirempty(xzp))
4987 5345                                  *valp = 1;
4988 5346                          VN_RELE(ZTOV(xzp));
4989 5347                  } else if (error == ENOENT) {
4990 5348                          /*
4991 5349                           * If there aren't extended attributes, it's the
4992 5350                           * same as having zero of them.
4993 5351                           */
4994 5352                          error = 0;
4995 5353                  }
4996 5354                  ZFS_EXIT(zfsvfs);
4997 5355                  return (error);
4998 5356  
4999 5357          case _PC_SATTR_ENABLED:
5000 5358          case _PC_SATTR_EXISTS:
5001 5359                  *valp = vfs_has_feature(vp->v_vfsp, VFSFT_SYSATTR_VIEWS) &&
5002 5360                      (vp->v_type == VREG || vp->v_type == VDIR);
5003 5361                  return (0);
5004 5362  
5005 5363          case _PC_ACCESS_FILTERING:
5006 5364                  *valp = vfs_has_feature(vp->v_vfsp, VFSFT_ACCESS_FILTER) &&
5007 5365                      vp->v_type == VDIR;
5008 5366                  return (0);
5009 5367  
5010 5368          case _PC_ACL_ENABLED:
5011 5369                  *valp = _ACL_ACE_ENABLED;
5012 5370                  return (0);
5013 5371  
5014 5372          case _PC_MIN_HOLE_SIZE:
5015 5373                  *valp = (ulong_t)SPA_MINBLOCKSIZE;
5016 5374                  return (0);
5017 5375  
5018 5376          case _PC_TIMESTAMP_RESOLUTION:
5019 5377                  /* nanosecond timestamp resolution */
5020 5378                  *valp = 1L;
5021 5379                  return (0);
5022 5380  
5023 5381          default:
5024 5382                  return (fs_pathconf(vp, cmd, valp, cr, ct));
5025 5383          }
5026 5384  }
5027 5385  
5028 5386  /*ARGSUSED*/
5029 5387  static int
5030 5388  zfs_getsecattr(vnode_t *vp, vsecattr_t *vsecp, int flag, cred_t *cr,
5031 5389      caller_context_t *ct)
5032 5390  {
5033 5391          znode_t *zp = VTOZ(vp);
5034 5392          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
5035 5393          int error;
5036 5394          boolean_t skipaclchk = (flag & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
5037 5395  
5038 5396          ZFS_ENTER(zfsvfs);
5039 5397          ZFS_VERIFY_ZP(zp);
5040 5398          error = zfs_getacl(zp, vsecp, skipaclchk, cr);
5041 5399          ZFS_EXIT(zfsvfs);
5042 5400  
5043 5401          return (error);
5044 5402  }
5045 5403  
5046 5404  /*ARGSUSED*/
5047 5405  static int
5048 5406  zfs_setsecattr(vnode_t *vp, vsecattr_t *vsecp, int flag, cred_t *cr,
5049 5407      caller_context_t *ct)
5050 5408  {
5051 5409          znode_t *zp = VTOZ(vp);
5052 5410          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
5053 5411          int error;
5054 5412          boolean_t skipaclchk = (flag & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
5055 5413          zilog_t *zilog = zfsvfs->z_log;
5056 5414  
5057 5415          ZFS_ENTER(zfsvfs);
5058 5416          ZFS_VERIFY_ZP(zp);
5059 5417  
5060 5418          error = zfs_setacl(zp, vsecp, skipaclchk, cr);
5061 5419  
5062 5420          if (zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
5063 5421                  zil_commit(zilog, 0);
5064 5422  
5065 5423          ZFS_EXIT(zfsvfs);
5066 5424          return (error);
5067 5425  }
5068 5426  
5069 5427  /*
5070 5428   * The smallest read we may consider to loan out an arcbuf.
5071 5429   * This must be a power of 2.
5072 5430   */
5073 5431  int zcr_blksz_min = (1 << 10);  /* 1K */
5074 5432  /*
5075 5433   * If set to less than the file block size, allow loaning out of an
5076 5434   * arcbuf for a partial block read.  This must be a power of 2.
5077 5435   */
5078 5436  int zcr_blksz_max = (1 << 17);  /* 128K */
5079 5437  
5080 5438  /*ARGSUSED*/
5081 5439  static int
5082 5440  zfs_reqzcbuf(vnode_t *vp, enum uio_rw ioflag, xuio_t *xuio, cred_t *cr,
5083 5441      caller_context_t *ct)
5084 5442  {
5085 5443          znode_t *zp = VTOZ(vp);
5086 5444          zfsvfs_t *zfsvfs = zp->z_zfsvfs;
5087 5445          int max_blksz = zfsvfs->z_max_blksz;
5088 5446          uio_t *uio = &xuio->xu_uio;
5089 5447          ssize_t size = uio->uio_resid;
5090 5448          offset_t offset = uio->uio_loffset;
5091 5449          int blksz;
5092 5450          int fullblk, i;
5093 5451          arc_buf_t *abuf;
5094 5452          ssize_t maxsize;
5095 5453          int preamble, postamble;
5096 5454  
5097 5455          if (xuio->xu_type != UIOTYPE_ZEROCOPY)
5098 5456                  return (SET_ERROR(EINVAL));
5099 5457  
5100 5458          ZFS_ENTER(zfsvfs);
5101 5459          ZFS_VERIFY_ZP(zp);
5102 5460          switch (ioflag) {
5103 5461          case UIO_WRITE:
5104 5462                  /*
5105 5463                   * Loan out an arc_buf for write if write size is bigger than
5106 5464                   * max_blksz, and the file's block size is also max_blksz.
5107 5465                   */
5108 5466                  blksz = max_blksz;
5109 5467                  if (size < blksz || zp->z_blksz != blksz) {
5110 5468                          ZFS_EXIT(zfsvfs);
5111 5469                          return (SET_ERROR(EINVAL));
5112 5470                  }
5113 5471                  /*
5114 5472                   * Caller requests buffers for write before knowing where the
5115 5473                   * write offset might be (e.g. NFS TCP write).
5116 5474                   */
5117 5475                  if (offset == -1) {
5118 5476                          preamble = 0;
5119 5477                  } else {
5120 5478                          preamble = P2PHASE(offset, blksz);
5121 5479                          if (preamble) {
5122 5480                                  preamble = blksz - preamble;
5123 5481                                  size -= preamble;
5124 5482                          }
5125 5483                  }
5126 5484  
5127 5485                  postamble = P2PHASE(size, blksz);
5128 5486                  size -= postamble;
5129 5487  
5130 5488                  fullblk = size / blksz;
5131 5489                  (void) dmu_xuio_init(xuio,
5132 5490                      (preamble != 0) + fullblk + (postamble != 0));
5133 5491                  DTRACE_PROBE3(zfs_reqzcbuf_align, int, preamble,
5134 5492                      int, postamble, int,
5135 5493                      (preamble != 0) + fullblk + (postamble != 0));
5136 5494  
5137 5495                  /*
5138 5496                   * Have to fix iov base/len for partial buffers.  They
5139 5497                   * currently represent full arc_buf's.
5140 5498                   */
5141 5499                  if (preamble) {
5142 5500                          /* data begins in the middle of the arc_buf */
5143 5501                          abuf = dmu_request_arcbuf(sa_get_db(zp->z_sa_hdl),
5144 5502                              blksz);
5145 5503                          ASSERT(abuf);
5146 5504                          (void) dmu_xuio_add(xuio, abuf,
5147 5505                              blksz - preamble, preamble);
5148 5506                  }
5149 5507  
5150 5508                  for (i = 0; i < fullblk; i++) {
5151 5509                          abuf = dmu_request_arcbuf(sa_get_db(zp->z_sa_hdl),
5152 5510                              blksz);
5153 5511                          ASSERT(abuf);
5154 5512                          (void) dmu_xuio_add(xuio, abuf, 0, blksz);
5155 5513                  }
5156 5514  
5157 5515                  if (postamble) {
5158 5516                          /* data ends in the middle of the arc_buf */
5159 5517                          abuf = dmu_request_arcbuf(sa_get_db(zp->z_sa_hdl),
5160 5518                              blksz);
5161 5519                          ASSERT(abuf);
5162 5520                          (void) dmu_xuio_add(xuio, abuf, 0, postamble);
5163 5521                  }
5164 5522                  break;
5165 5523          case UIO_READ:
5166 5524                  /*
5167 5525                   * Loan out an arc_buf for read if the read size is larger than
5168 5526                   * the current file block size.  Block alignment is not
5169 5527                   * considered.  Partial arc_buf will be loaned out for read.
5170 5528                   */
5171 5529                  blksz = zp->z_blksz;
5172 5530                  if (blksz < zcr_blksz_min)
5173 5531                          blksz = zcr_blksz_min;
5174 5532                  if (blksz > zcr_blksz_max)
5175 5533                          blksz = zcr_blksz_max;
5176 5534                  /* avoid potential complexity of dealing with it */
5177 5535                  if (blksz > max_blksz) {
5178 5536                          ZFS_EXIT(zfsvfs);
5179 5537                          return (SET_ERROR(EINVAL));
5180 5538                  }
5181 5539  
5182 5540                  maxsize = zp->z_size - uio->uio_loffset;
5183 5541                  if (size > maxsize)
5184 5542                          size = maxsize;
5185 5543  
5186 5544                  if (size < blksz || vn_has_cached_data(vp)) {
5187 5545                          ZFS_EXIT(zfsvfs);
5188 5546                          return (SET_ERROR(EINVAL));
5189 5547                  }
5190 5548                  break;
5191 5549          default:
5192 5550                  ZFS_EXIT(zfsvfs);
5193 5551                  return (SET_ERROR(EINVAL));
5194 5552          }
5195 5553  
5196 5554          uio->uio_extflg = UIO_XUIO;
5197 5555          XUIO_XUZC_RW(xuio) = ioflag;
5198 5556          ZFS_EXIT(zfsvfs);
5199 5557          return (0);
5200 5558  }
5201 5559  
5202 5560  /*ARGSUSED*/
5203 5561  static int
5204 5562  zfs_retzcbuf(vnode_t *vp, xuio_t *xuio, cred_t *cr, caller_context_t *ct)
5205 5563  {
5206 5564          int i;
5207 5565          arc_buf_t *abuf;
5208 5566          int ioflag = XUIO_XUZC_RW(xuio);
5209 5567  
5210 5568          ASSERT(xuio->xu_type == UIOTYPE_ZEROCOPY);
5211 5569  
5212 5570          i = dmu_xuio_cnt(xuio);
5213 5571          while (i-- > 0) {
5214 5572                  abuf = dmu_xuio_arcbuf(xuio, i);
5215 5573                  /*
5216 5574                   * if abuf == NULL, it must be a write buffer
5217 5575                   * that has been returned in zfs_write().
5218 5576                   */
5219 5577                  if (abuf)
5220 5578                          dmu_return_arcbuf(abuf);
5221 5579                  ASSERT(abuf || ioflag == UIO_WRITE);
5222 5580          }
5223 5581  
5224 5582          dmu_xuio_fini(xuio);
5225 5583          return (0);
5226 5584  }
5227 5585  
5228 5586  /*
5229 5587   * Predeclare these here so that the compiler assumes that
5230 5588   * this is an "old style" function declaration that does
5231 5589   * not include arguments => we won't get type mismatch errors
5232 5590   * in the initializations that follow.
5233 5591   */
5234 5592  static int zfs_inval();
5235 5593  static int zfs_isdir();
5236 5594  
5237 5595  static int
5238 5596  zfs_inval()
5239 5597  {
5240 5598          return (SET_ERROR(EINVAL));
5241 5599  }
5242 5600  
5243 5601  static int
5244 5602  zfs_isdir()
5245 5603  {
5246 5604          return (SET_ERROR(EISDIR));
5247 5605  }
5248 5606  /*
5249 5607   * Directory vnode operations template
5250 5608   */
5251 5609  vnodeops_t *zfs_dvnodeops;
5252 5610  const fs_operation_def_t zfs_dvnodeops_template[] = {
5253 5611          VOPNAME_OPEN,           { .vop_open = zfs_open },
5254 5612          VOPNAME_CLOSE,          { .vop_close = zfs_close },
5255 5613          VOPNAME_READ,           { .error = zfs_isdir },
5256 5614          VOPNAME_WRITE,          { .error = zfs_isdir },
5257 5615          VOPNAME_IOCTL,          { .vop_ioctl = zfs_ioctl },
5258 5616          VOPNAME_GETATTR,        { .vop_getattr = zfs_getattr },
5259 5617          VOPNAME_SETATTR,        { .vop_setattr = zfs_setattr },
5260 5618          VOPNAME_ACCESS,         { .vop_access = zfs_access },
5261 5619          VOPNAME_LOOKUP,         { .vop_lookup = zfs_lookup },
5262 5620          VOPNAME_CREATE,         { .vop_create = zfs_create },
5263 5621          VOPNAME_REMOVE,         { .vop_remove = zfs_remove },
5264 5622          VOPNAME_LINK,           { .vop_link = zfs_link },
5265 5623          VOPNAME_RENAME,         { .vop_rename = zfs_rename },
5266 5624          VOPNAME_MKDIR,          { .vop_mkdir = zfs_mkdir },
5267 5625          VOPNAME_RMDIR,          { .vop_rmdir = zfs_rmdir },
5268 5626          VOPNAME_READDIR,        { .vop_readdir = zfs_readdir },
5269 5627          VOPNAME_SYMLINK,        { .vop_symlink = zfs_symlink },
5270 5628          VOPNAME_FSYNC,          { .vop_fsync = zfs_fsync },
5271 5629          VOPNAME_INACTIVE,       { .vop_inactive = zfs_inactive },
5272 5630          VOPNAME_FID,            { .vop_fid = zfs_fid },
5273 5631          VOPNAME_SEEK,           { .vop_seek = zfs_seek },
5274 5632          VOPNAME_PATHCONF,       { .vop_pathconf = zfs_pathconf },
5275 5633          VOPNAME_GETSECATTR,     { .vop_getsecattr = zfs_getsecattr },
5276 5634          VOPNAME_SETSECATTR,     { .vop_setsecattr = zfs_setsecattr },
5277 5635          VOPNAME_VNEVENT,        { .vop_vnevent = fs_vnevent_support },
5278 5636          NULL,                   NULL
5279 5637  };
5280 5638  
5281 5639  /*
5282 5640   * Regular file vnode operations template
5283 5641   */
5284 5642  vnodeops_t *zfs_fvnodeops;
5285 5643  const fs_operation_def_t zfs_fvnodeops_template[] = {
5286 5644          VOPNAME_OPEN,           { .vop_open = zfs_open },
5287 5645          VOPNAME_CLOSE,          { .vop_close = zfs_close },
5288 5646          VOPNAME_READ,           { .vop_read = zfs_read },
5289 5647          VOPNAME_WRITE,          { .vop_write = zfs_write },
5290 5648          VOPNAME_IOCTL,          { .vop_ioctl = zfs_ioctl },
5291 5649          VOPNAME_GETATTR,        { .vop_getattr = zfs_getattr },
5292 5650          VOPNAME_SETATTR,        { .vop_setattr = zfs_setattr },
5293 5651          VOPNAME_ACCESS,         { .vop_access = zfs_access },
5294 5652          VOPNAME_LOOKUP,         { .vop_lookup = zfs_lookup },
5295 5653          VOPNAME_RENAME,         { .vop_rename = zfs_rename },
5296 5654          VOPNAME_FSYNC,          { .vop_fsync = zfs_fsync },
5297 5655          VOPNAME_INACTIVE,       { .vop_inactive = zfs_inactive },
5298 5656          VOPNAME_FID,            { .vop_fid = zfs_fid },
5299 5657          VOPNAME_SEEK,           { .vop_seek = zfs_seek },
5300 5658          VOPNAME_FRLOCK,         { .vop_frlock = zfs_frlock },
5301 5659          VOPNAME_SPACE,          { .vop_space = zfs_space },
5302 5660          VOPNAME_GETPAGE,        { .vop_getpage = zfs_getpage },
5303 5661          VOPNAME_PUTPAGE,        { .vop_putpage = zfs_putpage },
5304 5662          VOPNAME_MAP,            { .vop_map = zfs_map },
5305 5663          VOPNAME_ADDMAP,         { .vop_addmap = zfs_addmap },
5306 5664          VOPNAME_DELMAP,         { .vop_delmap = zfs_delmap },
5307 5665          VOPNAME_PATHCONF,       { .vop_pathconf = zfs_pathconf },
5308 5666          VOPNAME_GETSECATTR,     { .vop_getsecattr = zfs_getsecattr },
5309 5667          VOPNAME_SETSECATTR,     { .vop_setsecattr = zfs_setsecattr },
5310 5668          VOPNAME_VNEVENT,        { .vop_vnevent = fs_vnevent_support },
5311 5669          VOPNAME_REQZCBUF,       { .vop_reqzcbuf = zfs_reqzcbuf },
5312 5670          VOPNAME_RETZCBUF,       { .vop_retzcbuf = zfs_retzcbuf },
5313 5671          NULL,                   NULL
5314 5672  };
5315 5673  
5316 5674  /*
5317 5675   * Symbolic link vnode operations template
5318 5676   */
5319 5677  vnodeops_t *zfs_symvnodeops;
5320 5678  const fs_operation_def_t zfs_symvnodeops_template[] = {
5321 5679          VOPNAME_GETATTR,        { .vop_getattr = zfs_getattr },
5322 5680          VOPNAME_SETATTR,        { .vop_setattr = zfs_setattr },
5323 5681          VOPNAME_ACCESS,         { .vop_access = zfs_access },
5324 5682          VOPNAME_RENAME,         { .vop_rename = zfs_rename },
5325 5683          VOPNAME_READLINK,       { .vop_readlink = zfs_readlink },
5326 5684          VOPNAME_INACTIVE,       { .vop_inactive = zfs_inactive },
5327 5685          VOPNAME_FID,            { .vop_fid = zfs_fid },
5328 5686          VOPNAME_PATHCONF,       { .vop_pathconf = zfs_pathconf },
5329 5687          VOPNAME_VNEVENT,        { .vop_vnevent = fs_vnevent_support },
5330 5688          NULL,                   NULL
5331 5689  };
5332 5690  
5333 5691  /*
5334 5692   * special share hidden files vnode operations template
5335 5693   */
5336 5694  vnodeops_t *zfs_sharevnodeops;
5337 5695  const fs_operation_def_t zfs_sharevnodeops_template[] = {
5338 5696          VOPNAME_GETATTR,        { .vop_getattr = zfs_getattr },
5339 5697          VOPNAME_ACCESS,         { .vop_access = zfs_access },
5340 5698          VOPNAME_INACTIVE,       { .vop_inactive = zfs_inactive },
5341 5699          VOPNAME_FID,            { .vop_fid = zfs_fid },
5342 5700          VOPNAME_PATHCONF,       { .vop_pathconf = zfs_pathconf },
5343 5701          VOPNAME_GETSECATTR,     { .vop_getsecattr = zfs_getsecattr },
5344 5702          VOPNAME_SETSECATTR,     { .vop_setsecattr = zfs_setsecattr },
5345 5703          VOPNAME_VNEVENT,        { .vop_vnevent = fs_vnevent_support },
5346 5704          NULL,                   NULL
5347 5705  };
5348 5706  
5349 5707  /*
5350 5708   * Extended attribute directory vnode operations template
5351 5709   *
5352 5710   * This template is identical to the directory vnodes
5353 5711   * operation template except for restricted operations:
5354 5712   *      VOP_MKDIR()
5355 5713   *      VOP_SYMLINK()
5356 5714   *
5357 5715   * Note that there are other restrictions embedded in:
5358 5716   *      zfs_create()    - restrict type to VREG
5359 5717   *      zfs_link()      - no links into/out of attribute space
5360 5718   *      zfs_rename()    - no moves into/out of attribute space
5361 5719   */
5362 5720  vnodeops_t *zfs_xdvnodeops;
5363 5721  const fs_operation_def_t zfs_xdvnodeops_template[] = {
5364 5722          VOPNAME_OPEN,           { .vop_open = zfs_open },
5365 5723          VOPNAME_CLOSE,          { .vop_close = zfs_close },
5366 5724          VOPNAME_IOCTL,          { .vop_ioctl = zfs_ioctl },
5367 5725          VOPNAME_GETATTR,        { .vop_getattr = zfs_getattr },
5368 5726          VOPNAME_SETATTR,        { .vop_setattr = zfs_setattr },
5369 5727          VOPNAME_ACCESS,         { .vop_access = zfs_access },
5370 5728          VOPNAME_LOOKUP,         { .vop_lookup = zfs_lookup },
5371 5729          VOPNAME_CREATE,         { .vop_create = zfs_create },
5372 5730          VOPNAME_REMOVE,         { .vop_remove = zfs_remove },
5373 5731          VOPNAME_LINK,           { .vop_link = zfs_link },
5374 5732          VOPNAME_RENAME,         { .vop_rename = zfs_rename },
5375 5733          VOPNAME_MKDIR,          { .error = zfs_inval },
5376 5734          VOPNAME_RMDIR,          { .vop_rmdir = zfs_rmdir },
5377 5735          VOPNAME_READDIR,        { .vop_readdir = zfs_readdir },
5378 5736          VOPNAME_SYMLINK,        { .error = zfs_inval },
5379 5737          VOPNAME_FSYNC,          { .vop_fsync = zfs_fsync },
5380 5738          VOPNAME_INACTIVE,       { .vop_inactive = zfs_inactive },
5381 5739          VOPNAME_FID,            { .vop_fid = zfs_fid },
5382 5740          VOPNAME_SEEK,           { .vop_seek = zfs_seek },
5383 5741          VOPNAME_PATHCONF,       { .vop_pathconf = zfs_pathconf },
5384 5742          VOPNAME_GETSECATTR,     { .vop_getsecattr = zfs_getsecattr },
5385 5743          VOPNAME_SETSECATTR,     { .vop_setsecattr = zfs_setsecattr },
5386 5744          VOPNAME_VNEVENT,        { .vop_vnevent = fs_vnevent_support },
5387 5745          NULL,                   NULL
5388 5746  };
5389 5747  
5390 5748  /*
5391 5749   * Error vnode operations template
5392 5750   */
5393 5751  vnodeops_t *zfs_evnodeops;
5394 5752  const fs_operation_def_t zfs_evnodeops_template[] = {
5395 5753          VOPNAME_INACTIVE,       { .vop_inactive = zfs_inactive },
5396 5754          VOPNAME_PATHCONF,       { .vop_pathconf = zfs_pathconf },
5397 5755          NULL,                   NULL
5398 5756  };

↓ open down ↓

925 lines elided

↑ open up ↑

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX