big-one Sdiff usr/src/uts/common/fs/zfs/zfs

Print this page

NEX-19083 backport OS-7314 zil_commit should omit cache thrash
9962 zil_commit should omit cache thrash
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
NEX-10069 ZFS_READONLY is a little too strict
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9436 Rate limiting controls ... (fix cstyle)
NEX-3562 filename normalization doesn't work for removes (sync with upstream)
NEX-9436 Rate limiting controls (was QoS) per ZFS dataset, updates from demo
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-9213 comment for enabling async delete for all files is reversed.
Reviewed by: Jean Mccormack <jean.mccormack@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-9090 trigger async freeing based on znode size
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-8972 Async-delete side-effect that may cause unmount EBUSY
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-8852 Quality-of-Service (QoS) controls per NFS share
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "NEX-5085 implement async delete for large files"
This reverts commit 65aa8f42d93fcbd6e0efb3d4883170a20d760611.
Fails regression testing of the zfs test mirror_stress_004.
NEX-5085 implement async delete for large files
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
NEX-7543 backout async delete (NEX-5085 and NEX-6151)
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-6151 panic when forcefully unmounting the FS with large open files
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-5085 implement async delete for large files
Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3562 filename normalization doesn't work for removes
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
6334 Cannot unlink files when over quota
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
6328 Fix cstyle errors in zfs codebase (fix studio)
6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
5692 expose the number of hole blocks in a file
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-4229 Panic destroying the pool using file backing store on FS with nbmand=on
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-1196 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Fixup merge results
re #14162 DOS issue with ZFS/NFS
re #7550 rb2134 lint-clean nza-kernel
re #6815 rb1758 need WORM in nza-kernel (4.0)

   4  * The contents of this file are subject to the terms of the
   5  * Common Development and Distribution License (the "License").
   6  * You may not use this file except in compliance with the License.
   7  *
   8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9  * or http://www.opensolaris.org/os/licensing.
  10  * See the License for the specific language governing permissions
  11  * and limitations under the License.
  12  *
  13  * When distributing Covered Code, include this CDDL HEADER in each
  14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15  * If applicable, add the following below this CDDL HEADER, with the
  16  * fields enclosed by brackets "[]" replaced with your own identifying
  17  * information: Portions Copyright [yyyy] [name of copyright owner]
  18  *
  19  * CDDL HEADER END
  20  */
  21 
  22 /*
  23  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.


  24  * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
  25  * Copyright (c) 2014 Integros [integros.com]
  26  * Copyright 2015 Joyent, Inc.
  27  * Copyright 2017 Nexenta Systems, Inc.
  28  */
  29 
  30 /* Portions Copyright 2007 Jeremy Teo */
  31 /* Portions Copyright 2010 Robert Milkowski */
  32 
  33 #include <sys/types.h>
  34 #include <sys/param.h>
  35 #include <sys/time.h>
  36 #include <sys/systm.h>
  37 #include <sys/sysmacros.h>
  38 #include <sys/resource.h>
  39 #include <sys/vfs.h>
  40 #include <sys/vfs_opreg.h>
  41 #include <sys/vnode.h>
  42 #include <sys/file.h>
  43 #include <sys/stat.h>
  44 #include <sys/kmem.h>
  45 #include <sys/taskq.h>
  46 #include <sys/uio.h>
  47 #include <sys/vmsystm.h>
  48 #include <sys/atomic.h>
  49 #include <sys/vm.h>
  50 #include <vm/seg_vn.h>
  51 #include <vm/pvn.h>
  52 #include <vm/as.h>

  66 #include <sys/spa.h>
  67 #include <sys/txg.h>
  68 #include <sys/dbuf.h>
  69 #include <sys/zap.h>
  70 #include <sys/sa.h>
  71 #include <sys/dirent.h>
  72 #include <sys/policy.h>
  73 #include <sys/sunddi.h>
  74 #include <sys/filio.h>
  75 #include <sys/sid.h>
  76 #include "fs/fs_subr.h"
  77 #include <sys/zfs_ctldir.h>
  78 #include <sys/zfs_fuid.h>
  79 #include <sys/zfs_sa.h>
  80 #include <sys/dnlc.h>
  81 #include <sys/zfs_rlock.h>
  82 #include <sys/extdirent.h>
  83 #include <sys/kidmap.h>
  84 #include <sys/cred.h>
  85 #include <sys/attr.h>

  86 #include <sys/zil.h>
  87 
  88 /*
  89  * Programming rules.
  90  *
  91  * Each vnode op performs some logical unit of work.  To do this, the ZPL must
  92  * properly lock its in-core state, create a DMU transaction, do the work,
  93  * record this work in the intent log (ZIL), commit the DMU transaction,
  94  * and wait for the intent log to commit if it is a synchronous operation.
  95  * Moreover, the vnode ops must work in both normal and log replay context.
  96  * The ordering of events is important to avoid deadlocks and references
  97  * to freed memory.  The example below illustrates the following Big Rules:
  98  *
  99  *  (1) A check must be made in each zfs thread for a mounted file system.
 100  *      This is done avoiding races using ZFS_ENTER(zfsvfs).
 101  *      A ZFS_EXIT(zfsvfs) is needed before all returns.  Any znodes
 102  *      must be checked with ZFS_VERIFY_ZP(zp).  Both of these macros
 103  *      can return EIO from the calling function.
 104  *
 105  *  (2) VN_RELE() should always be the last thing except for zil_commit()

 118  *  (4) If ZPL locks are held, pass TXG_NOWAIT as the second argument to
 119  *      dmu_tx_assign().  This is critical because we don't want to block
 120  *      while holding locks.
 121  *
 122  *      If no ZPL locks are held (aside from ZFS_ENTER()), use TXG_WAIT.  This
 123  *      reduces lock contention and CPU usage when we must wait (note that if
 124  *      throughput is constrained by the storage, nearly every transaction
 125  *      must wait).
 126  *
 127  *      Note, in particular, that if a lock is sometimes acquired before
 128  *      the tx assigns, and sometimes after (e.g. z_lock), then failing
 129  *      to use a non-blocking assign can deadlock the system.  The scenario:
 130  *
 131  *      Thread A has grabbed a lock before calling dmu_tx_assign().
 132  *      Thread B is in an already-assigned tx, and blocks for this lock.
 133  *      Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
 134  *      forever, because the previous txg can't quiesce until B's tx commits.
 135  *
 136  *      If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
 137  *      then drop all locks, call dmu_tx_wait(), and try again.  On subsequent
 138  *      calls to dmu_tx_assign(), pass TXG_NOTHROTTLE in addition to TXG_NOWAIT,
 139  *      to indicate that this operation has already called dmu_tx_wait().
 140  *      This will ensure that we don't retry forever, waiting a short bit
 141  *      each time.
 142  *
 143  *  (5) If the operation succeeded, generate the intent log entry for it
 144  *      before dropping locks.  This ensures that the ordering of events
 145  *      in the intent log matches the order in which they actually occurred.
 146  *      During ZIL replay the zfs_log_* functions will update the sequence
 147  *      number to indicate the zil transaction has replayed.
 148  *
 149  *  (6) At the end of each vnode op, the DMU tx must always commit,
 150  *      regardless of whether there were any errors.
 151  *
 152  *  (7) After dropping all locks, invoke zil_commit(zilog, foid)
 153  *      to ensure that synchronous semantics are provided when necessary.
 154  *
 155  * In general, this is how things should be ordered in each vnode op:
 156  *
 157  *      ZFS_ENTER(zfsvfs);              // exit if unmounted
 158  * top:
 159  *      zfs_dirent_lock(&dl, ...)   // lock directory entry (may VN_HOLD())
 160  *      rw_enter(...);                  // grab any other locks you need
 161  *      tx = dmu_tx_create(...);        // get DMU tx
 162  *      dmu_tx_hold_*();                // hold each object you might modify
 163  *      error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
 164  *      if (error) {
 165  *              rw_exit(...);           // drop locks
 166  *              zfs_dirent_unlock(dl);  // unlock directory entry
 167  *              VN_RELE(...);           // release held vnodes
 168  *              if (error == ERESTART) {
 169  *                      waited = B_TRUE;
 170  *                      dmu_tx_wait(tx);
 171  *                      dmu_tx_abort(tx);
 172  *                      goto top;
 173  *              }
 174  *              dmu_tx_abort(tx);       // abort DMU tx
 175  *              ZFS_EXIT(zfsvfs);       // finished in zfs
 176  *              return (error);         // really out of space
 177  *      }
 178  *      error = do_real_work();         // do whatever this VOP does
 179  *      if (error == 0)
 180  *              zfs_log_*(...);         // on success, make ZIL entry
 181  *      dmu_tx_commit(tx);              // commit DMU tx -- error or not
 182  *      rw_exit(...);                   // drop locks
 183  *      zfs_dirent_unlock(dl);          // unlock directory entry
 184  *      VN_RELE(...);                   // release held vnodes
 185  *      zil_commit(zilog, foid);        // synchronous when necessary
 186  *      ZFS_EXIT(zfsvfs);               // finished in zfs
 187  *      return (error);                 // done, report error
 188  */
 189 









































 190 /* ARGSUSED */
 191 static int
 192 zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct)
 193 {
 194         znode_t *zp = VTOZ(*vpp);
 195         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 196 
 197         ZFS_ENTER(zfsvfs);
 198         ZFS_VERIFY_ZP(zp);
 199 
 200         if ((flag & FWRITE) && (zp->z_pflags & ZFS_APPENDONLY) &&
 201             ((flag & FAPPEND) == 0)) {
 202                 ZFS_EXIT(zfsvfs);
 203                 return (SET_ERROR(EPERM));
 204         }
 205 
 206         if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
 207             ZTOV(zp)->v_type == VREG &&
 208             !(zp->z_pflags & ZFS_AV_QUARANTINED) && zp->z_size > 0) {
 209                 if (fs_vscan(*vpp, cr, 0) != 0) {
 210                         ZFS_EXIT(zfsvfs);
 211                         return (SET_ERROR(EACCES));
 212                 }
 213         }
 214 
 215         /* Keep a count of the synchronous opens in the znode */
 216         if (flag & (FSYNC | FDSYNC))
 217                 atomic_inc_32(&zp->z_sync_cnt);
 218 
 219         ZFS_EXIT(zfsvfs);
 220         return (0);
 221 }
 222 
 223 /* ARGSUSED */
 224 static int
 225 zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
 226     caller_context_t *ct)
 227 {
 228         znode_t *zp = VTOZ(vp);
 229         zfsvfs_t *zfsvfs = zp->z_zfsvfs;

 230 
 231         /*
 232          * Clean up any locks held by this process on the vp.
 233          */
 234         cleanlocks(vp, ddi_get_pid(), 0);
 235         cleanshares(vp, ddi_get_pid());
 236 
 237         ZFS_ENTER(zfsvfs);
 238         ZFS_VERIFY_ZP(zp);
 239 
 240         /* Decrement the synchronous opens in the znode */
 241         if ((flag & (FSYNC | FDSYNC)) && (count == 1))
 242                 atomic_dec_32(&zp->z_sync_cnt);
 243 
 244         if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
 245             ZTOV(zp)->v_type == VREG &&
 246             !(zp->z_pflags & ZFS_AV_QUARANTINED) && zp->z_size > 0)
 247                 VERIFY(fs_vscan(vp, cr, 1) == 0);
 248 
 249         ZFS_EXIT(zfsvfs);
 250         return (0);
 251 }
 252 
 253 /*
 254  * Lseek support for finding holes (cmd == _FIO_SEEK_HOLE) and
 255  * data (cmd == _FIO_SEEK_DATA). "off" is an in/out parameter.

 469 
 470                 if (pp = page_lookup(vp, start, SE_SHARED)) {
 471                         caddr_t va;
 472 
 473                         va = zfs_map_page(pp, S_READ);
 474                         error = uiomove(va + off, bytes, UIO_READ, uio);
 475                         zfs_unmap_page(pp, va);
 476                         page_unlock(pp);
 477                 } else {
 478                         error = dmu_read_uio_dbuf(sa_get_db(zp->z_sa_hdl),
 479                             uio, bytes);
 480                 }
 481                 len -= bytes;
 482                 off = 0;
 483                 if (error)
 484                         break;
 485         }
 486         return (error);
 487 }
 488 






























































































































































 489 offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */
 490 
 491 /*
 492  * Read bytes from specified file into supplied buffer.
 493  *
 494  *      IN:     vp      - vnode of file to be read from.
 495  *              uio     - structure supplying read location, range info,
 496  *                        and return buffer.
 497  *              ioflag  - SYNC flags; used to provide FRSYNC semantics.
 498  *              cr      - credentials of caller.
 499  *              ct      - caller context
 500  *
 501  *      OUT:    uio     - updated offset and range, buffer filled.
 502  *
 503  *      RETURN: 0 on success, error code on failure.
 504  *
 505  * Side Effects:
 506  *      vp - atime updated if byte count > 0
 507  */
 508 /* ARGSUSED */

 535         /*
 536          * Fasttrack empty reads
 537          */
 538         if (uio->uio_resid == 0) {
 539                 ZFS_EXIT(zfsvfs);
 540                 return (0);
 541         }
 542 
 543         /*
 544          * Check for mandatory locks
 545          */
 546         if (MANDMODE(zp->z_mode)) {
 547                 if (error = chklock(vp, FREAD,
 548                     uio->uio_loffset, uio->uio_resid, uio->uio_fmode, ct)) {
 549                         ZFS_EXIT(zfsvfs);
 550                         return (error);
 551                 }
 552         }
 553 
 554         /*






 555          * If we're in FRSYNC mode, sync out this znode before reading it.
 556          */
 557         if (ioflag & FRSYNC || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
 558                 zil_commit(zfsvfs->z_log, zp->z_id);
 559 
 560         /*
 561          * Lock the range against changes.
 562          */
 563         rl = zfs_range_lock(zp, uio->uio_loffset, uio->uio_resid, RL_READER);
 564 
 565         /*
 566          * If we are reading past end-of-file we can skip
 567          * to the end; but we might still need to set atime.
 568          */
 569         if (uio->uio_loffset >= zp->z_size) {
 570                 error = 0;
 571                 goto out;
 572         }
 573 
 574         ASSERT(uio->uio_loffset < zp->z_size);

 698             &zp->z_pflags, 8);
 699 
 700         /*
 701          * In a case vp->v_vfsp != zp->z_zfsvfs->z_vfs (e.g. snapshots) our
 702          * callers might not be able to detect properly that we are read-only,
 703          * so check it explicitly here.
 704          */
 705         if (zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) {
 706                 ZFS_EXIT(zfsvfs);
 707                 return (SET_ERROR(EROFS));
 708         }
 709 
 710         /*
 711          * If immutable or not appending then return EPERM.
 712          * Intentionally allow ZFS_READONLY through here.
 713          * See zfs_zaccess_common()
 714          */
 715         if ((zp->z_pflags & ZFS_IMMUTABLE) ||
 716             ((zp->z_pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
 717             (uio->uio_loffset < zp->z_size))) {



 718                 ZFS_EXIT(zfsvfs);
 719                 return (SET_ERROR(EPERM));
 720         }

 721 
 722         zilog = zfsvfs->z_log;
 723 
 724         /*
 725          * Validate file offset
 726          */
 727         woff = ioflag & FAPPEND ? zp->z_size : uio->uio_loffset;
 728         if (woff < 0) {
 729                 ZFS_EXIT(zfsvfs);
 730                 return (SET_ERROR(EINVAL));
 731         }
 732 
 733         /*
 734          * Check for mandatory locks before calling zfs_range_lock()
 735          * in order to prevent a deadlock with locks set via fcntl().
 736          */
 737         if (MANDMODE((mode_t)zp->z_mode) &&
 738             (error = chklock(vp, FWRITE, woff, n, uio->uio_fmode, ct)) != 0) {
 739                 ZFS_EXIT(zfsvfs);
 740                 return (error);
 741         }
 742 
 743         /*






 744          * Pre-fault the pages to ensure slow (eg NFS) pages
 745          * don't hold up txg.
 746          * Skip this if uio contains loaned arc_buf.
 747          */
 748         if ((uio->uio_extflg == UIO_XUIO) &&
 749             (((xuio_t *)uio)->xu_type == UIOTYPE_ZEROCOPY))
 750                 xuio = (xuio_t *)uio;
 751         else
 752                 uio_prefaultpages(MIN(n, max_blksz), uio);
 753 
 754         /*
 755          * If in append mode, set the io offset pointer to eof.
 756          */
 757         if (ioflag & FAPPEND) {
 758                 /*
 759                  * Obtain an appending range lock to guarantee file append
 760                  * semantics.  We reset the write offset once we have the lock.
 761                  */
 762                 rl = zfs_range_lock(zp, 0, n, RL_APPEND);
 763                 woff = rl->r_off;

 998 
 999         zfs_range_unlock(rl);
1000 
1001         /*
1002          * If we're in replay mode, or we made no progress, return error.
1003          * Otherwise, it's at least a partial write, so it's successful.
1004          */
1005         if (zfsvfs->z_replay || uio->uio_resid == start_resid) {
1006                 ZFS_EXIT(zfsvfs);
1007                 return (error);
1008         }
1009 
1010         if (ioflag & (FSYNC | FDSYNC) ||
1011             zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
1012                 zil_commit(zilog, zp->z_id);
1013 
1014         ZFS_EXIT(zfsvfs);
1015         return (0);
1016 }
1017 

1018 void
1019 zfs_get_done(zgd_t *zgd, int error)
1020 {
1021         znode_t *zp = zgd->zgd_private;
1022         objset_t *os = zp->z_zfsvfs->z_os;
1023 
1024         if (zgd->zgd_db)
1025                 dmu_buf_rele(zgd->zgd_db, zgd);
1026 
1027         zfs_range_unlock(zgd->zgd_rl);
1028 
1029         /*
1030          * Release the vnode asynchronously as we currently have the
1031          * txg stopped from syncing.
1032          */
1033         VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
1034 
1035         if (error == 0 && zgd->zgd_bp)
1036                 zil_lwb_add_block(zgd->zgd_lwb, zgd->zgd_bp);
1037 
1038         kmem_free(zgd, sizeof (zgd_t));
1039 }
1040 
1041 #ifdef DEBUG
1042 static int zil_fault_io = 0;
1043 #endif
1044 
1045 /*
1046  * Get data to generate a TX_WRITE intent log record.
1047  */
1048 int
1049 zfs_get_data(void *arg, lr_write_t *lr, char *buf, struct lwb *lwb, zio_t *zio)
1050 {
1051         zfsvfs_t *zfsvfs = arg;
1052         objset_t *os = zfsvfs->z_os;
1053         znode_t *zp;
1054         uint64_t object = lr->lr_foid;
1055         uint64_t offset = lr->lr_offset;
1056         uint64_t size = lr->lr_length;
1057         dmu_buf_t *db;

1141 
1142                         error = dmu_sync(zio, lr->lr_common.lrc_txg,
1143                             zfs_get_done, zgd);
1144                         ASSERT(error || lr->lr_length <= size);
1145 
1146                         /*
1147                          * On success, we need to wait for the write I/O
1148                          * initiated by dmu_sync() to complete before we can
1149                          * release this dbuf.  We will finish everything up
1150                          * in the zfs_get_done() callback.
1151                          */
1152                         if (error == 0)
1153                                 return (0);
1154 
1155                         if (error == EALREADY) {
1156                                 lr->lr_common.lrc_txtype = TX_WRITE2;
1157                                 /*
1158                                  * TX_WRITE2 relies on the data previously
1159                                  * written by the TX_WRITE that caused
1160                                  * EALREADY.  We zero out the BP because
1161                                  * it is the old, currently-on-disk BP,
1162                                  * so there's no need to zio_flush() its
1163                                  * vdevs (flushing would needlesly hurt
1164                                  * performance, and doesn't work on
1165                                  * indirect vdevs).
1166                                  */
1167                                 zgd->zgd_bp = NULL;
1168                                 BP_ZERO(bp);
1169                                 error = 0;
1170                         }
1171                 }
1172         }
1173 
1174         zfs_get_done(zgd, error);
1175 
1176         return (error);
1177 }
1178 
1179 /*ARGSUSED*/
1180 static int
1181 zfs_access(vnode_t *vp, int mode, int flag, cred_t *cr,
1182     caller_context_t *ct)
1183 {
1184         znode_t *zp = VTOZ(vp);
1185         zfsvfs_t *zfsvfs = zp->z_zfsvfs;

1228  *              flags   - LOOKUP_XATTR set if looking for an attribute.
1229  *              rdir    - root directory vnode [UNUSED].
1230  *              cr      - credentials of caller.
1231  *              ct      - caller context
1232  *              direntflags - directory lookup flags
1233  *              realpnp - returned pathname.
1234  *
1235  *      OUT:    vpp     - vnode of located entry, NULL if not found.
1236  *
1237  *      RETURN: 0 on success, error code on failure.
1238  *
1239  * Timestamps:
1240  *      NA
1241  */
1242 /* ARGSUSED */
1243 static int
1244 zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp,
1245     int flags, vnode_t *rdir, cred_t *cr,  caller_context_t *ct,
1246     int *direntflags, pathname_t *realpnp)
1247 {
1248         znode_t *zdp = VTOZ(dvp);
1249         zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
1250         int     error = 0;
1251 
1252         /*
1253          * Fast path lookup, however we must skip DNLC lookup
1254          * for case folding or normalizing lookups because the
1255          * DNLC code only stores the passed in name.  This means
1256          * creating 'a' and removing 'A' on a case insensitive
1257          * file system would work, but DNLC still thinks 'a'
1258          * exists and won't let you create it again on the next
1259          * pass through fast path.
1260          */
1261         if (!(flags & (LOOKUP_XATTR | FIGNORECASE))) {
1262 
1263                 if (dvp->v_type != VDIR) {
1264                         return (SET_ERROR(ENOTDIR));
1265                 } else if (zdp->z_sa_hdl == NULL) {
1266                         return (SET_ERROR(EIO));
1267                 }
1268

1346         }
1347 
1348         /*
1349          * Check accessibility of directory.
1350          */
1351 
1352         if (error = zfs_zaccess(zdp, ACE_EXECUTE, 0, B_FALSE, cr)) {
1353                 ZFS_EXIT(zfsvfs);
1354                 return (error);
1355         }
1356 
1357         if (zfsvfs->z_utf8 && u8_validate(nm, strlen(nm),
1358             NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
1359                 ZFS_EXIT(zfsvfs);
1360                 return (SET_ERROR(EILSEQ));
1361         }
1362 
1363         error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
1364         if (error == 0)
1365                 error = specvp_check(vpp, cr);








1366 
1367         ZFS_EXIT(zfsvfs);
1368         return (error);
1369 }
1370 
1371 /*
1372  * Attempt to create a new entry in a directory.  If the entry
1373  * already exists, truncate the file if permissible, else return
1374  * an error.  Return the vp of the created or trunc'd file.
1375  *
1376  *      IN:     dvp     - vnode of directory to put new file entry in.
1377  *              name    - name of new file entry.
1378  *              vap     - attributes of new file.
1379  *              excl    - flag indicating exclusive or non-exclusive mode.
1380  *              mode    - mode to open file with.
1381  *              cr      - credentials of caller.
1382  *              flag    - large file flag [UNUSED].
1383  *              ct      - caller context
1384  *              vsecp   - ACL to be set
1385  *
1386  *      OUT:    vpp     - vnode of created or trunc'd entry.
1387  *
1388  *      RETURN: 0 on success, error code on failure.
1389  *
1390  * Timestamps:
1391  *      dvp - ctime|mtime updated if new entry created
1392  *       vp - ctime|mtime always, atime if new
1393  */
1394 
1395 /* ARGSUSED */
1396 static int
1397 zfs_create(vnode_t *dvp, char *name, vattr_t *vap, vcexcl_t excl,
1398     int mode, vnode_t **vpp, cred_t *cr, int flag, caller_context_t *ct,
1399     vsecattr_t *vsecp)
1400 {

1401         znode_t         *zp, *dzp = VTOZ(dvp);
1402         zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
1403         zilog_t         *zilog;
1404         objset_t        *os;
1405         zfs_dirlock_t   *dl;
1406         dmu_tx_t        *tx;
1407         int             error;
1408         ksid_t          *ksid;
1409         uid_t           uid;
1410         gid_t           gid = crgetgid(cr);
1411         zfs_acl_ids_t   acl_ids;
1412         boolean_t       fuid_dirtied;
1413         boolean_t       have_acl = B_FALSE;
1414         boolean_t       waited = B_FALSE;
1415 
1416         /*
1417          * If we have an ephemeral id, ACL, or XVATTR then
1418          * make sure file system is at proper version
1419          */
1420

1466                 int zflg = 0;
1467 
1468                 if (flag & FIGNORECASE)
1469                         zflg |= ZCILOOK;
1470 
1471                 error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
1472                     NULL, NULL);
1473                 if (error) {
1474                         if (have_acl)
1475                                 zfs_acl_ids_free(&acl_ids);
1476                         if (strcmp(name, "..") == 0)
1477                                 error = SET_ERROR(EISDIR);
1478                         ZFS_EXIT(zfsvfs);
1479                         return (error);
1480                 }
1481         }
1482 
1483         if (zp == NULL) {
1484                 uint64_t txtype;
1485 






1486                 /*
1487                  * Create a new file object and update the directory
1488                  * to reference it.
1489                  */
1490                 if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
1491                         if (have_acl)
1492                                 zfs_acl_ids_free(&acl_ids);


1493                         goto out;
1494                 }
1495 



1496                 /*
1497                  * We only support the creation of regular files in
1498                  * extended attribute directories.
1499                  */
1500 
1501                 if ((dzp->z_pflags & ZFS_XATTR) &&
1502                     (vap->va_type != VREG)) {
1503                         if (have_acl)
1504                                 zfs_acl_ids_free(&acl_ids);
1505                         error = SET_ERROR(EINVAL);
1506                         goto out;
1507                 }
1508 
1509                 if (!have_acl && (error = zfs_acl_ids_create(dzp, 0, vap,
1510                     cr, vsecp, &acl_ids)) != 0)
1511                         goto out;
1512                 have_acl = B_TRUE;
1513 
1514                 if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
1515                         zfs_acl_ids_free(&acl_ids);
1516                         error = SET_ERROR(EDQUOT);
1517                         goto out;
1518                 }
1519 
1520                 tx = dmu_tx_create(os);
1521 
1522                 dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
1523                     ZFS_SA_BASE_ATTR_SIZE);
1524 
1525                 fuid_dirtied = zfsvfs->z_fuid_dirty;
1526                 if (fuid_dirtied)
1527                         zfs_fuid_txhold(zfsvfs, tx);
1528                 dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
1529                 dmu_tx_hold_sa(tx, dzp->z_sa_hdl, B_FALSE);
1530                 if (!zfsvfs->z_use_sa &&
1531                     acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
1532                         dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
1533                             0, acl_ids.z_aclp->z_acl_bytes);
1534                 }
1535                 error = dmu_tx_assign(tx,
1536                     (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
1537                 if (error) {
1538                         zfs_dirent_unlock(dl);
1539                         if (error == ERESTART) {
1540                                 waited = B_TRUE;
1541                                 dmu_tx_wait(tx);
1542                                 dmu_tx_abort(tx);
1543                                 goto top;
1544                         }
1545                         zfs_acl_ids_free(&acl_ids);
1546                         dmu_tx_abort(tx);
1547                         ZFS_EXIT(zfsvfs);
1548                         return (error);
1549                 }
1550                 zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
1551 
1552                 if (fuid_dirtied)
1553                         zfs_fuid_sync(zfsvfs, tx);
1554 



1555                 (void) zfs_link_create(dl, zp, tx, ZNEW);
1556                 txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
1557                 if (flag & FIGNORECASE)
1558                         txtype |= TX_CI;
1559                 zfs_log_create(zilog, tx, txtype, dzp, zp, name,
1560                     vsecp, acl_ids.z_fuidp, vap);
1561                 zfs_acl_ids_free(&acl_ids);
1562                 dmu_tx_commit(tx);
1563         } else {
1564                 int aflags = (flag & FAPPEND) ? V_APPEND : 0;
1565 
1566                 if (have_acl)
1567                         zfs_acl_ids_free(&acl_ids);
1568                 have_acl = B_FALSE;
1569 
1570                 /*
1571                  * A directory entry already exists for this name.
1572                  */
1573                 /*
1574                  * Can't truncate an existing file if in exclusive mode.
1575                  */
1576                 if (excl == EXCL) {
1577                         error = SET_ERROR(EEXIST);
1578                         goto out;
1579                 }
1580                 /*
1581                  * Can't open a directory for writing.
1582                  */
1583                 if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
1584                         error = SET_ERROR(EISDIR);
1585                         goto out;
1586                 }












1587                 /*
1588                  * Verify requested access to file.
1589                  */
1590                 if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {


1591                         goto out;
1592                 }
1593 



1594                 mutex_enter(&dzp->z_lock);
1595                 dzp->z_seq++;
1596                 mutex_exit(&dzp->z_lock);
1597 
1598                 /*
1599                  * Truncate regular files if requested.
1600                  */
1601                 if ((ZTOV(zp)->v_type == VREG) &&
1602                     (vap->va_mask & AT_SIZE) && (vap->va_size == 0)) {
1603                         /* we can't hold any locks when calling zfs_freesp() */
1604                         zfs_dirent_unlock(dl);
1605                         dl = NULL;
1606                         error = zfs_freesp(zp, 0, 0, mode, TRUE);
1607                         if (error == 0) {
1608                                 vnevent_create(ZTOV(zp), ct);
1609                         }
1610                 }
1611         }
1612 out:
1613

1680                 pn_alloc(&realnm);
1681                 realnmp = &realnm;
1682         }
1683 
1684 top:
1685         xattr_obj = 0;
1686         xzp = NULL;
1687         /*
1688          * Attempt to lock directory; fail if entry doesn't exist.
1689          */
1690         if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
1691             NULL, realnmp)) {
1692                 if (realnmp)
1693                         pn_free(realnmp);
1694                 ZFS_EXIT(zfsvfs);
1695                 return (error);
1696         }
1697 
1698         vp = ZTOV(zp);
1699 





1700         if (error = zfs_zaccess_delete(dzp, zp, cr)) {
1701                 goto out;
1702         }
1703 
1704         /*
1705          * Need to use rmdir for removing directories.
1706          */
1707         if (vp->v_type == VDIR) {
1708                 error = SET_ERROR(EPERM);
1709                 goto out;
1710         }
1711 
1712         vnevent_remove(vp, dvp, name, ct);
1713 
1714         if (realnmp)
1715                 dnlc_remove(dvp, realnmp->pn_buf);
1716         else
1717                 dnlc_remove(dvp, name);
1718 
1719         mutex_enter(&vp->v_lock);

1746         if (error == 0 && xattr_obj) {
1747                 error = zfs_zget(zfsvfs, xattr_obj, &xzp);
1748                 ASSERT0(error);
1749                 dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_TRUE);
1750                 dmu_tx_hold_sa(tx, xzp->z_sa_hdl, B_FALSE);
1751         }
1752 
1753         mutex_enter(&zp->z_lock);
1754         if ((acl_obj = zfs_external_acl(zp)) != 0 && may_delete_now)
1755                 dmu_tx_hold_free(tx, acl_obj, 0, DMU_OBJECT_END);
1756         mutex_exit(&zp->z_lock);
1757 
1758         /* charge as an update -- would be nice not to charge at all */
1759         dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
1760 
1761         /*
1762          * Mark this transaction as typically resulting in a net free of space
1763          */
1764         dmu_tx_mark_netfree(tx);
1765 
1766         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
1767         if (error) {
1768                 zfs_dirent_unlock(dl);
1769                 VN_RELE(vp);
1770                 if (xzp)
1771                         VN_RELE(ZTOV(xzp));
1772                 if (error == ERESTART) {
1773                         waited = B_TRUE;
1774                         dmu_tx_wait(tx);
1775                         dmu_tx_abort(tx);
1776                         goto top;
1777                 }
1778                 if (realnmp)
1779                         pn_free(realnmp);
1780                 dmu_tx_abort(tx);
1781                 ZFS_EXIT(zfsvfs);
1782                 return (error);
1783         }
1784 
1785         /*
1786          * Remove the directory entry.

1873  *              dirname - name of new directory.
1874  *              vap     - attributes of new directory.
1875  *              cr      - credentials of caller.
1876  *              ct      - caller context
1877  *              flags   - case flags
1878  *              vsecp   - ACL to be set
1879  *
1880  *      OUT:    vpp     - vnode of created directory.
1881  *
1882  *      RETURN: 0 on success, error code on failure.
1883  *
1884  * Timestamps:
1885  *      dvp - ctime|mtime updated
1886  *       vp - ctime|mtime|atime updated
1887  */
1888 /*ARGSUSED*/
1889 static int
1890 zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr,
1891     caller_context_t *ct, int flags, vsecattr_t *vsecp)
1892 {

1893         znode_t         *zp, *dzp = VTOZ(dvp);
1894         zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
1895         zilog_t         *zilog;
1896         zfs_dirlock_t   *dl;
1897         uint64_t        txtype;
1898         dmu_tx_t        *tx;
1899         int             error;
1900         int             zf = ZNEW;
1901         ksid_t          *ksid;
1902         uid_t           uid;
1903         gid_t           gid = crgetgid(cr);
1904         zfs_acl_ids_t   acl_ids;
1905         boolean_t       fuid_dirtied;
1906         boolean_t       waited = B_FALSE;
1907 
1908         ASSERT(vap->va_type == VDIR);
1909 
1910         /*
1911          * If we have an ephemeral id, ACL, or XVATTR then
1912          * make sure file system is at proper version

1952                 ZFS_EXIT(zfsvfs);
1953                 return (error);
1954         }
1955         /*
1956          * First make sure the new directory doesn't exist.
1957          *
1958          * Existence is checked first to make sure we don't return
1959          * EACCES instead of EEXIST which can cause some applications
1960          * to fail.
1961          */
1962 top:
1963         *vpp = NULL;
1964 
1965         if (error = zfs_dirent_lock(&dl, dzp, dirname, &zp, zf,
1966             NULL, NULL)) {
1967                 zfs_acl_ids_free(&acl_ids);
1968                 ZFS_EXIT(zfsvfs);
1969                 return (error);
1970         }
1971 






1972         if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {


1973                 zfs_acl_ids_free(&acl_ids);
1974                 zfs_dirent_unlock(dl);
1975                 ZFS_EXIT(zfsvfs);
1976                 return (error);
1977         }
1978 



1979         if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
1980                 zfs_acl_ids_free(&acl_ids);
1981                 zfs_dirent_unlock(dl);
1982                 ZFS_EXIT(zfsvfs);
1983                 return (SET_ERROR(EDQUOT));
1984         }
1985 
1986         /*
1987          * Add a new entry to the directory.
1988          */
1989         tx = dmu_tx_create(zfsvfs->z_os);
1990         dmu_tx_hold_zap(tx, dzp->z_id, TRUE, dirname);
1991         dmu_tx_hold_zap(tx, DMU_NEW_OBJECT, FALSE, NULL);
1992         fuid_dirtied = zfsvfs->z_fuid_dirty;
1993         if (fuid_dirtied)
1994                 zfs_fuid_txhold(zfsvfs, tx);
1995         if (!zfsvfs->z_use_sa && acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
1996                 dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
1997                     acl_ids.z_aclp->z_acl_bytes);
1998         }
1999 
2000         dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
2001             ZFS_SA_BASE_ATTR_SIZE);
2002 
2003         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
2004         if (error) {
2005                 zfs_dirent_unlock(dl);
2006                 if (error == ERESTART) {
2007                         waited = B_TRUE;
2008                         dmu_tx_wait(tx);
2009                         dmu_tx_abort(tx);
2010                         goto top;
2011                 }
2012                 zfs_acl_ids_free(&acl_ids);
2013                 dmu_tx_abort(tx);
2014                 ZFS_EXIT(zfsvfs);
2015                 return (error);
2016         }
2017 
2018         /*
2019          * Create new node.
2020          */
2021         zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
2022 
2023         if (fuid_dirtied)

2085         ZFS_ENTER(zfsvfs);
2086         ZFS_VERIFY_ZP(dzp);
2087         zilog = zfsvfs->z_log;
2088 
2089         if (flags & FIGNORECASE)
2090                 zflg |= ZCILOOK;
2091 top:
2092         zp = NULL;
2093 
2094         /*
2095          * Attempt to lock directory; fail if entry doesn't exist.
2096          */
2097         if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
2098             NULL, NULL)) {
2099                 ZFS_EXIT(zfsvfs);
2100                 return (error);
2101         }
2102 
2103         vp = ZTOV(zp);
2104 





2105         if (error = zfs_zaccess_delete(dzp, zp, cr)) {
2106                 goto out;
2107         }
2108 
2109         if (vp->v_type != VDIR) {
2110                 error = SET_ERROR(ENOTDIR);
2111                 goto out;
2112         }
2113 
2114         if (vp == cwd) {
2115                 error = SET_ERROR(EINVAL);
2116                 goto out;
2117         }
2118 
2119         vnevent_rmdir(vp, dvp, name, ct);
2120 
2121         /*
2122          * Grab a lock on the directory to make sure that noone is
2123          * trying to add (or lookup) entries while we are removing it.
2124          */
2125         rw_enter(&zp->z_name_lock, RW_WRITER);
2126 
2127         /*
2128          * Grab a lock on the parent pointer to make sure we play well
2129          * with the treewalk and directory rename code.
2130          */
2131         rw_enter(&zp->z_parent_lock, RW_WRITER);
2132 
2133         tx = dmu_tx_create(zfsvfs->z_os);
2134         dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
2135         dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
2136         dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
2137         zfs_sa_upgrade_txholds(tx, zp);
2138         zfs_sa_upgrade_txholds(tx, dzp);
2139         dmu_tx_mark_netfree(tx);
2140         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
2141         if (error) {
2142                 rw_exit(&zp->z_parent_lock);
2143                 rw_exit(&zp->z_name_lock);
2144                 zfs_dirent_unlock(dl);
2145                 VN_RELE(vp);
2146                 if (error == ERESTART) {
2147                         waited = B_TRUE;
2148                         dmu_tx_wait(tx);
2149                         dmu_tx_abort(tx);
2150                         goto top;
2151                 }
2152                 dmu_tx_abort(tx);
2153                 ZFS_EXIT(zfsvfs);
2154                 return (error);
2155         }
2156 
2157         error = zfs_link_destroy(dl, zp, tx, zflg, NULL);
2158 
2159         if (error == 0) {
2160                 uint64_t txtype = TX_RMDIR;

2777 
2778         if (mask & AT_SIZE && vp->v_type == VDIR) {
2779                 ZFS_EXIT(zfsvfs);
2780                 return (SET_ERROR(EISDIR));
2781         }
2782 
2783         if (mask & AT_SIZE && vp->v_type != VREG && vp->v_type != VFIFO) {
2784                 ZFS_EXIT(zfsvfs);
2785                 return (SET_ERROR(EINVAL));
2786         }
2787 
2788         /*
2789          * If this is an xvattr_t, then get a pointer to the structure of
2790          * optional attributes.  If this is NULL, then we have a vattr_t.
2791          */
2792         xoap = xva_getxoptattr(xvap);
2793 
2794         xva_init(&tmpxvattr);
2795 
2796         /*
2797          * Immutable files can only alter immutable bit and atime
2798          */
2799         if ((zp->z_pflags & ZFS_IMMUTABLE) &&










2800             ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
2801             ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {

2802                 ZFS_EXIT(zfsvfs);
2803                 return (SET_ERROR(EPERM));
2804         }

2805 
2806         /*
2807          * Note: ZFS_READONLY is handled in zfs_zaccess_common.
2808          */
2809 
2810         /*
2811          * Verify timestamps doesn't overflow 32 bits.
2812          * ZFS can handle large timestamps, but 32bit syscalls can't
2813          * handle times greater than 2039.  This check should be removed
2814          * once large timestamps are fully supported.
2815          */
2816         if (mask & (AT_ATIME | AT_MTIME)) {
2817                 if (((mask & AT_ATIME) && TIMESPEC_OVERFLOW(&vap->va_atime)) ||
2818                     ((mask & AT_MTIME) && TIMESPEC_OVERFLOW(&vap->va_mtime))) {
2819                         ZFS_EXIT(zfsvfs);
2820                         return (SET_ERROR(EOVERFLOW));
2821                 }
2822         }
2823 
2824 top:

3693         if (tdvp != sdvp) {
3694                 vnevent_pre_rename_dest_dir(tdvp, ZTOV(szp), tnm, ct);
3695         }
3696 
3697         tx = dmu_tx_create(zfsvfs->z_os);
3698         dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
3699         dmu_tx_hold_sa(tx, sdzp->z_sa_hdl, B_FALSE);
3700         dmu_tx_hold_zap(tx, sdzp->z_id, FALSE, snm);
3701         dmu_tx_hold_zap(tx, tdzp->z_id, TRUE, tnm);
3702         if (sdzp != tdzp) {
3703                 dmu_tx_hold_sa(tx, tdzp->z_sa_hdl, B_FALSE);
3704                 zfs_sa_upgrade_txholds(tx, tdzp);
3705         }
3706         if (tzp) {
3707                 dmu_tx_hold_sa(tx, tzp->z_sa_hdl, B_FALSE);
3708                 zfs_sa_upgrade_txholds(tx, tzp);
3709         }
3710 
3711         zfs_sa_upgrade_txholds(tx, szp);
3712         dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
3713         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
3714         if (error) {
3715                 if (zl != NULL)
3716                         zfs_rename_unlock(&zl);
3717                 zfs_dirent_unlock(sdl);
3718                 zfs_dirent_unlock(tdl);
3719 
3720                 if (sdzp == tdzp)
3721                         rw_exit(&sdzp->z_name_lock);
3722 
3723                 VN_RELE(ZTOV(szp));
3724                 if (tzp)
3725                         VN_RELE(ZTOV(tzp));
3726                 if (error == ERESTART) {
3727                         waited = B_TRUE;
3728                         dmu_tx_wait(tx);
3729                         dmu_tx_abort(tx);
3730                         goto top;
3731                 }
3732                 dmu_tx_abort(tx);
3733                 ZFS_EXIT(zfsvfs);

3817  *              vap     - Attributes of new entry.
3818  *              cr      - credentials of caller.
3819  *              ct      - caller context
3820  *              flags   - case flags
3821  *
3822  *      RETURN: 0 on success, error code on failure.
3823  *
3824  * Timestamps:
3825  *      dvp - ctime|mtime updated
3826  */
3827 /*ARGSUSED*/
3828 static int
3829 zfs_symlink(vnode_t *dvp, char *name, vattr_t *vap, char *link, cred_t *cr,
3830     caller_context_t *ct, int flags)
3831 {
3832         znode_t         *zp, *dzp = VTOZ(dvp);
3833         zfs_dirlock_t   *dl;
3834         dmu_tx_t        *tx;
3835         zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
3836         zilog_t         *zilog;

3837         uint64_t        len = strlen(link);
3838         int             error;
3839         int             zflg = ZNEW;
3840         zfs_acl_ids_t   acl_ids;
3841         boolean_t       fuid_dirtied;
3842         uint64_t        txtype = TX_SYMLINK;
3843         boolean_t       waited = B_FALSE;
3844 
3845         ASSERT(vap->va_type == VLNK);
3846 
3847         ZFS_ENTER(zfsvfs);
3848         ZFS_VERIFY_ZP(dzp);
3849         zilog = zfsvfs->z_log;
3850 
3851         if (zfsvfs->z_utf8 && u8_validate(name, strlen(name),
3852             NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
3853                 ZFS_EXIT(zfsvfs);
3854                 return (SET_ERROR(EILSEQ));
3855         }
3856         if (flags & FIGNORECASE)

3860                 ZFS_EXIT(zfsvfs);
3861                 return (SET_ERROR(ENAMETOOLONG));
3862         }
3863 
3864         if ((error = zfs_acl_ids_create(dzp, 0,
3865             vap, cr, NULL, &acl_ids)) != 0) {
3866                 ZFS_EXIT(zfsvfs);
3867                 return (error);
3868         }
3869 top:
3870         /*
3871          * Attempt to lock directory; fail if entry already exists.
3872          */
3873         error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg, NULL, NULL);
3874         if (error) {
3875                 zfs_acl_ids_free(&acl_ids);
3876                 ZFS_EXIT(zfsvfs);
3877                 return (error);
3878         }
3879 




3880         if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {


3881                 zfs_acl_ids_free(&acl_ids);
3882                 zfs_dirent_unlock(dl);
3883                 ZFS_EXIT(zfsvfs);
3884                 return (error);
3885         }


3886 
3887         if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
3888                 zfs_acl_ids_free(&acl_ids);
3889                 zfs_dirent_unlock(dl);
3890                 ZFS_EXIT(zfsvfs);
3891                 return (SET_ERROR(EDQUOT));
3892         }
3893         tx = dmu_tx_create(zfsvfs->z_os);
3894         fuid_dirtied = zfsvfs->z_fuid_dirty;
3895         dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, MAX(1, len));
3896         dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
3897         dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
3898             ZFS_SA_BASE_ATTR_SIZE + len);
3899         dmu_tx_hold_sa(tx, dzp->z_sa_hdl, B_FALSE);
3900         if (!zfsvfs->z_use_sa && acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
3901                 dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
3902                     acl_ids.z_aclp->z_acl_bytes);
3903         }
3904         if (fuid_dirtied)
3905                 zfs_fuid_txhold(zfsvfs, tx);
3906         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
3907         if (error) {
3908                 zfs_dirent_unlock(dl);
3909                 if (error == ERESTART) {
3910                         waited = B_TRUE;
3911                         dmu_tx_wait(tx);
3912                         dmu_tx_abort(tx);
3913                         goto top;
3914                 }
3915                 zfs_acl_ids_free(&acl_ids);
3916                 dmu_tx_abort(tx);
3917                 ZFS_EXIT(zfsvfs);
3918                 return (error);
3919         }
3920 
3921         /*
3922          * Create a new object for the symlink.
3923          * for version 4 ZPL datsets the symlink will be an SA attribute
3924          */
3925         zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
3926

4107         if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
4108                 ZFS_EXIT(zfsvfs);
4109                 return (error);
4110         }
4111 
4112 top:
4113         /*
4114          * Attempt to lock directory; fail if entry already exists.
4115          */
4116         error = zfs_dirent_lock(&dl, dzp, name, &tzp, zf, NULL, NULL);
4117         if (error) {
4118                 ZFS_EXIT(zfsvfs);
4119                 return (error);
4120         }
4121 
4122         tx = dmu_tx_create(zfsvfs->z_os);
4123         dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
4124         dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
4125         zfs_sa_upgrade_txholds(tx, szp);
4126         zfs_sa_upgrade_txholds(tx, dzp);
4127         error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
4128         if (error) {
4129                 zfs_dirent_unlock(dl);
4130                 if (error == ERESTART) {
4131                         waited = B_TRUE;
4132                         dmu_tx_wait(tx);
4133                         dmu_tx_abort(tx);
4134                         goto top;
4135                 }
4136                 dmu_tx_abort(tx);
4137                 ZFS_EXIT(zfsvfs);
4138                 return (error);
4139         }
4140 
4141         error = zfs_link_create(dl, szp, tx, 0);
4142 
4143         if (error == 0) {
4144                 uint64_t txtype = TX_LINK;
4145                 if (flags & FIGNORECASE)
4146                         txtype |= TX_CI;
4147                 zfs_log_link(zilog, tx, txtype, dzp, szp, name);

4383                         int err;
4384 
4385                         /*
4386                          * Found a dirty page to push
4387                          */
4388                         err = zfs_putapage(vp, pp, &io_off, &io_len, flags, cr);
4389                         if (err)
4390                                 error = err;
4391                 } else {
4392                         io_len = PAGESIZE;
4393                 }
4394         }
4395 out:
4396         zfs_range_unlock(rl);
4397         if ((flags & B_ASYNC) == 0 || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
4398                 zil_commit(zfsvfs->z_log, zp->z_id);
4399         ZFS_EXIT(zfsvfs);
4400         return (error);
4401 }
4402 
4403 /*ARGSUSED*/
4404 void
4405 zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)



4406 {
4407         znode_t *zp = VTOZ(vp);
4408         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
4409         int error;
4410 
4411         rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);

4412         if (zp->z_sa_hdl == NULL) {
4413                 /*
4414                  * The fs has been unmounted, or we did a
4415                  * suspend/resume and this file no longer exists.
4416                  */
4417                 if (vn_has_cached_data(vp)) {
4418                         (void) pvn_vplist_dirty(vp, 0, zfs_null_putapage,
4419                             B_INVAL, cr);
4420                 }
4421 
4422                 mutex_enter(&zp->z_lock);
4423                 mutex_enter(&vp->v_lock);
4424                 ASSERT(vp->v_count == 1);
4425                 VN_RELE_LOCKED(vp);
4426                 mutex_exit(&vp->v_lock);
4427                 mutex_exit(&zp->z_lock);


4428                 rw_exit(&zfsvfs->z_teardown_inactive_lock);
4429                 zfs_znode_free(zp);
4430                 return;
4431         }
4432 



















4433         /*
4434          * Attempt to push any data in the page cache.  If this fails
4435          * we will get kicked out later in zfs_zinactive().
4436          */
4437         if (vn_has_cached_data(vp)) {
4438                 (void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC,
4439                     cr);
4440         }
4441 
4442         if (zp->z_atime_dirty && zp->z_unlinked == 0) {
4443                 dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
4444 
4445                 dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
4446                 zfs_sa_upgrade_txholds(tx, zp);
4447                 error = dmu_tx_assign(tx, TXG_WAIT);
4448                 if (error) {
4449                         dmu_tx_abort(tx);
4450                 } else {
4451                         mutex_enter(&zp->z_lock);
4452                         (void) sa_update(zp->z_sa_hdl, SA_ZPL_ATIME(zfsvfs),
4453                             (void *)&zp->z_atime, sizeof (zp->z_atime), tx);
4454                         zp->z_atime_dirty = 0;
4455                         mutex_exit(&zp->z_lock);
4456                         dmu_tx_commit(tx);
4457                 }
4458         }
4459 
4460         zfs_zinactive(zp);



4461         rw_exit(&zfsvfs->z_teardown_inactive_lock);
4462 }
4463 







































4464 /*
4465  * Bounds-check the seek operation.
4466  *
4467  *      IN:     vp      - vnode seeking within
4468  *              ooff    - old file offset
4469  *              noffp   - pointer to new file offset
4470  *              ct      - caller context
4471  *
4472  *      RETURN: 0 on success, EINVAL if new offset invalid.
4473  */
4474 /* ARGSUSED */
4475 static int
4476 zfs_seek(vnode_t *vp, offset_t ooff, offset_t *noffp,
4477     caller_context_t *ct)
4478 {
4479         if (vp->v_type == VDIR)
4480                 return (0);
4481         return ((*noffp < 0 || *noffp > MAXOFFSET_T) ? EINVAL : 0);
4482 }
4483

   4  * The contents of this file are subject to the terms of the
   5  * Common Development and Distribution License (the "License").
   6  * You may not use this file except in compliance with the License.
   7  *
   8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9  * or http://www.opensolaris.org/os/licensing.
  10  * See the License for the specific language governing permissions
  11  * and limitations under the License.
  12  *
  13  * When distributing Covered Code, include this CDDL HEADER in each
  14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15  * If applicable, add the following below this CDDL HEADER, with the
  16  * fields enclosed by brackets "[]" replaced with your own identifying
  17  * information: Portions Copyright [yyyy] [name of copyright owner]
  18  *
  19  * CDDL HEADER END
  20  */
  21 
  22 /*
  23  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  24  * Portions Copyright 2007 Jeremy Teo
  25  * Portions Copyright 2010 Robert Milkowski
  26  * Copyright (c) 2012, 2017 by Delphix. All rights reserved.
  27  * Copyright (c) 2014 Integros [integros.com]
  28  * Copyright 2015 Joyent, Inc.
  29  * Copyright 2017 Nexenta Systems, Inc.
  30  */
  31 



  32 #include <sys/types.h>
  33 #include <sys/param.h>
  34 #include <sys/time.h>
  35 #include <sys/systm.h>
  36 #include <sys/sysmacros.h>
  37 #include <sys/resource.h>
  38 #include <sys/vfs.h>
  39 #include <sys/vfs_opreg.h>
  40 #include <sys/vnode.h>
  41 #include <sys/file.h>
  42 #include <sys/stat.h>
  43 #include <sys/kmem.h>
  44 #include <sys/taskq.h>
  45 #include <sys/uio.h>
  46 #include <sys/vmsystm.h>
  47 #include <sys/atomic.h>
  48 #include <sys/vm.h>
  49 #include <vm/seg_vn.h>
  50 #include <vm/pvn.h>
  51 #include <vm/as.h>

  65 #include <sys/spa.h>
  66 #include <sys/txg.h>
  67 #include <sys/dbuf.h>
  68 #include <sys/zap.h>
  69 #include <sys/sa.h>
  70 #include <sys/dirent.h>
  71 #include <sys/policy.h>
  72 #include <sys/sunddi.h>
  73 #include <sys/filio.h>
  74 #include <sys/sid.h>
  75 #include "fs/fs_subr.h"
  76 #include <sys/zfs_ctldir.h>
  77 #include <sys/zfs_fuid.h>
  78 #include <sys/zfs_sa.h>
  79 #include <sys/dnlc.h>
  80 #include <sys/zfs_rlock.h>
  81 #include <sys/extdirent.h>
  82 #include <sys/kidmap.h>
  83 #include <sys/cred.h>
  84 #include <sys/attr.h>
  85 #include <sys/dsl_prop.h>
  86 #include <sys/zil.h>
  87 
  88 /*
  89  * Programming rules.
  90  *
  91  * Each vnode op performs some logical unit of work.  To do this, the ZPL must
  92  * properly lock its in-core state, create a DMU transaction, do the work,
  93  * record this work in the intent log (ZIL), commit the DMU transaction,
  94  * and wait for the intent log to commit if it is a synchronous operation.
  95  * Moreover, the vnode ops must work in both normal and log replay context.
  96  * The ordering of events is important to avoid deadlocks and references
  97  * to freed memory.  The example below illustrates the following Big Rules:
  98  *
  99  *  (1) A check must be made in each zfs thread for a mounted file system.
 100  *      This is done avoiding races using ZFS_ENTER(zfsvfs).
 101  *      A ZFS_EXIT(zfsvfs) is needed before all returns.  Any znodes
 102  *      must be checked with ZFS_VERIFY_ZP(zp).  Both of these macros
 103  *      can return EIO from the calling function.
 104  *
 105  *  (2) VN_RELE() should always be the last thing except for zil_commit()

 118  *  (4) If ZPL locks are held, pass TXG_NOWAIT as the second argument to
 119  *      dmu_tx_assign().  This is critical because we don't want to block
 120  *      while holding locks.
 121  *
 122  *      If no ZPL locks are held (aside from ZFS_ENTER()), use TXG_WAIT.  This
 123  *      reduces lock contention and CPU usage when we must wait (note that if
 124  *      throughput is constrained by the storage, nearly every transaction
 125  *      must wait).
 126  *
 127  *      Note, in particular, that if a lock is sometimes acquired before
 128  *      the tx assigns, and sometimes after (e.g. z_lock), then failing
 129  *      to use a non-blocking assign can deadlock the system.  The scenario:
 130  *
 131  *      Thread A has grabbed a lock before calling dmu_tx_assign().
 132  *      Thread B is in an already-assigned tx, and blocks for this lock.
 133  *      Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
 134  *      forever, because the previous txg can't quiesce until B's tx commits.
 135  *
 136  *      If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
 137  *      then drop all locks, call dmu_tx_wait(), and try again.  On subsequent
 138  *      calls to dmu_tx_assign(), pass TXG_WAITED rather than TXG_NOWAIT,
 139  *      to indicate that this operation has already called dmu_tx_wait().
 140  *      This will ensure that we don't retry forever, waiting a short bit
 141  *      each time.
 142  *
 143  *  (5) If the operation succeeded, generate the intent log entry for it
 144  *      before dropping locks.  This ensures that the ordering of events
 145  *      in the intent log matches the order in which they actually occurred.
 146  *      During ZIL replay the zfs_log_* functions will update the sequence
 147  *      number to indicate the zil transaction has replayed.
 148  *
 149  *  (6) At the end of each vnode op, the DMU tx must always commit,
 150  *      regardless of whether there were any errors.
 151  *
 152  *  (7) After dropping all locks, invoke zil_commit(zilog, foid)
 153  *      to ensure that synchronous semantics are provided when necessary.
 154  *
 155  * In general, this is how things should be ordered in each vnode op:
 156  *
 157  *      ZFS_ENTER(zfsvfs);              // exit if unmounted
 158  * top:
 159  *      zfs_dirent_lock(&dl, ...)   // lock directory entry (may VN_HOLD())
 160  *      rw_enter(...);                  // grab any other locks you need
 161  *      tx = dmu_tx_create(...);        // get DMU tx
 162  *      dmu_tx_hold_*();                // hold each object you might modify
 163  *      error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
 164  *      if (error) {
 165  *              rw_exit(...);           // drop locks
 166  *              zfs_dirent_unlock(dl);  // unlock directory entry
 167  *              VN_RELE(...);           // release held vnodes
 168  *              if (error == ERESTART) {
 169  *                      waited = B_TRUE;
 170  *                      dmu_tx_wait(tx);
 171  *                      dmu_tx_abort(tx);
 172  *                      goto top;
 173  *              }
 174  *              dmu_tx_abort(tx);       // abort DMU tx
 175  *              ZFS_EXIT(zfsvfs);       // finished in zfs
 176  *              return (error);         // really out of space
 177  *      }
 178  *      error = do_real_work();         // do whatever this VOP does
 179  *      if (error == 0)
 180  *              zfs_log_*(...);         // on success, make ZIL entry
 181  *      dmu_tx_commit(tx);              // commit DMU tx -- error or not
 182  *      rw_exit(...);                   // drop locks
 183  *      zfs_dirent_unlock(dl);          // unlock directory entry
 184  *      VN_RELE(...);                   // release held vnodes
 185  *      zil_commit(zilog, foid);        // synchronous when necessary
 186  *      ZFS_EXIT(zfsvfs);               // finished in zfs
 187  *      return (error);                 // done, report error
 188  */
 189 
 190 /* set this tunable to zero to disable asynchronous freeing of files */
 191 boolean_t zfs_do_async_free = B_TRUE;
 192 
 193 /*
 194  * This value will be multiplied by zfs_dirty_data_max to determine
 195  * the threshold past which we will call zfs_inactive_impl() async.
 196  *
 197  * Selecting the multiplier is a balance between how long we're willing to wait
 198  * for delete/free to complete (get shell back, have a NFS thread captive, etc)
 199  * and reducing the number of active requests in the backing taskq.
 200  *
 201  * 4 GiB (zfs_dirty_data_max default) * 16 (multiplier default) = 64 GiB
 202  * meaning by default we will call zfs_inactive_impl async for vnodes > 64 GiB
 203  *
 204  * WARNING: Setting this tunable to zero will enable asynchronous freeing for
 205  * all files which can have undesirable side effects.
 206  */
 207 uint16_t zfs_inactive_async_multiplier = 16;
 208 
 209 int nms_worm_transition_time = 30;
 210 int
 211 zfs_worm_in_trans(znode_t *zp)
 212 {
 213         zfsvfs_t                *zfsvfs = zp->z_zfsvfs;
 214         timestruc_t             now;
 215         sa_bulk_attr_t          bulk[2];
 216         uint64_t                ctime[2];
 217         int                     count = 0;
 218 
 219         if (!nms_worm_transition_time)
 220                 return (0);
 221 
 222         gethrestime(&now);
 223         SA_ADD_BULK_ATTR(bulk, count, SA_ZPL_CTIME(zfsvfs), NULL,
 224             &ctime, sizeof (ctime));
 225         if (sa_bulk_lookup(zp->z_sa_hdl, bulk, count) != 0)
 226                 return (0);
 227 
 228         return ((uint64_t)now.tv_sec - ctime[0] < nms_worm_transition_time);
 229 }
 230 
 231 /* ARGSUSED */
 232 static int
 233 zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct)
 234 {
 235         znode_t *zp = VTOZ(*vpp);
 236         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 237 
 238         ZFS_ENTER(zfsvfs);
 239         ZFS_VERIFY_ZP(zp);
 240 
 241         if ((flag & FWRITE) && (zp->z_pflags & ZFS_APPENDONLY) &&
 242             ((flag & FAPPEND) == 0)) {
 243                 ZFS_EXIT(zfsvfs);
 244                 return (SET_ERROR(EPERM));
 245         }
 246 
 247         if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
 248             ZTOV(zp)->v_type == VREG &&
 249             !(zp->z_pflags & ZFS_AV_QUARANTINED) && zp->z_size > 0) {
 250                 if (fs_vscan(*vpp, cr, 0) != 0) {
 251                         ZFS_EXIT(zfsvfs);
 252                         return (SET_ERROR(EACCES));
 253                 }
 254         }
 255 
 256         /* Keep a count of the synchronous opens in the znode */
 257         if (flag & (FSYNC | FDSYNC))
 258                 atomic_inc_32(&zp->z_sync_cnt);
 259 
 260         ZFS_EXIT(zfsvfs);
 261         return (0);
 262 }
 263 
 264 /* ARGSUSED */
 265 static int
 266 zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
 267     caller_context_t *ct)
 268 {
 269         znode_t *zp = VTOZ(vp);
 270         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 271         pid_t caller_pid = (ct != NULL) ? ct->cc_pid : ddi_get_pid();
 272 
 273         /*
 274          * Clean up any locks held by this process on the vp.
 275          */
 276         cleanlocks(vp, caller_pid, 0);
 277         cleanshares(vp, caller_pid);
 278 
 279         ZFS_ENTER(zfsvfs);
 280         ZFS_VERIFY_ZP(zp);
 281 
 282         /* Decrement the synchronous opens in the znode */
 283         if ((flag & (FSYNC | FDSYNC)) && (count == 1))
 284                 atomic_dec_32(&zp->z_sync_cnt);
 285 
 286         if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
 287             ZTOV(zp)->v_type == VREG &&
 288             !(zp->z_pflags & ZFS_AV_QUARANTINED) && zp->z_size > 0)
 289                 VERIFY(fs_vscan(vp, cr, 1) == 0);
 290 
 291         ZFS_EXIT(zfsvfs);
 292         return (0);
 293 }
 294 
 295 /*
 296  * Lseek support for finding holes (cmd == _FIO_SEEK_HOLE) and
 297  * data (cmd == _FIO_SEEK_DATA). "off" is an in/out parameter.

 511 
 512                 if (pp = page_lookup(vp, start, SE_SHARED)) {
 513                         caddr_t va;
 514 
 515                         va = zfs_map_page(pp, S_READ);
 516                         error = uiomove(va + off, bytes, UIO_READ, uio);
 517                         zfs_unmap_page(pp, va);
 518                         page_unlock(pp);
 519                 } else {
 520                         error = dmu_read_uio_dbuf(sa_get_db(zp->z_sa_hdl),
 521                             uio, bytes);
 522                 }
 523                 len -= bytes;
 524                 off = 0;
 525                 if (error)
 526                         break;
 527         }
 528         return (error);
 529 }
 530 
 531 
 532 /*
 533  * ZFS I/O rate throttling
 534  */
 535 
 536 #define DELAY_SHIFT 24
 537 
 538 typedef struct zfs_rate_delay {
 539         uint_t rl_rate;
 540         hrtime_t rl_delay;
 541 } zfs_rate_delay_t;
 542 
 543 /*
 544  * The time we'll attempt to cv_wait (below), in nSec.
 545  * This should be no less than the minimum time it normally takes
 546  * to block a thread and wake back up after the timeout fires.
 547  *
 548  * Each table entry represents the delay for each 4MB of bandwith.
 549  * we reduce the delay as the size fo the I/O increases.
 550  */
 551 zfs_rate_delay_t zfs_rate_delay_table[] = {
 552         {0, 100000},
 553         {1, 100000},
 554         {2, 100000},
 555         {3, 100000},
 556         {4, 100000},
 557         {5, 50000},
 558         {6, 50000},
 559         {7, 50000},
 560         {8, 50000},
 561         {9, 25000},
 562         {10, 25000},
 563         {11, 25000},
 564         {12, 25000},
 565         {13, 12500},
 566         {14, 12500},
 567         {15, 12500},
 568         {16, 12500},
 569         {17, 6250},
 570         {18, 6250},
 571         {19, 6250},
 572         {20, 6250},
 573         {21, 3125},
 574         {22, 3125},
 575         {23, 3125},
 576         {24, 3125},
 577 };
 578 
 579 #define MAX_RATE_TBL_ENTRY 24
 580 
 581 /*
 582  * The delay we use should be reduced based on the size of the iorate
 583  * for higher iorates we want a shorter delay.
 584  */
 585 static inline hrtime_t
 586 zfs_get_delay(ssize_t iorate)
 587 {
 588         uint_t rate = iorate >> DELAY_SHIFT;
 589 
 590         if (rate > MAX_RATE_TBL_ENTRY)
 591                 rate = MAX_RATE_TBL_ENTRY;
 592         return (zfs_rate_delay_table[rate].rl_delay);
 593 }
 594 
 595 /*
 596  * ZFS I/O rate throttling
 597  * See "Token Bucket" on Wikipedia
 598  *
 599  * This is "Token Bucket" with some modifications to avoid wait times
 600  * longer than a couple seconds, so that we don't trigger NFS retries
 601  * or similar.  This does mean that concurrent requests might take us
 602  * over the rate limit, but that's a lesser evil.
 603  */
 604 static void
 605 zfs_rate_throttle(zfsvfs_t *zfsvfs, ssize_t iosize)
 606 {
 607         zfs_rate_state_t *rate = &zfsvfs->z_rate;
 608         hrtime_t now, delta; /* nanoseconds */
 609         int64_t refill;
 610 
 611         VERIFY(rate->rate_cap > 0);
 612         mutex_enter(&rate->rate_lock);
 613 
 614         /*
 615          * If another thread is already waiting, we must queue up behind them.
 616          * We'll wait up to 1 sec here.  We normally will resume by cv_signal,
 617          * so we don't need fine timer resolution on this wait.
 618          */
 619         if (rate->rate_token_bucket < 0) {
 620                 rate->rate_waiters++;
 621                 (void) cv_timedwait_hires(
 622                     &rate->rate_wait_cv, &rate->rate_lock,
 623                     NANOSEC, TR_CLOCK_TICK, 0);
 624                 rate->rate_waiters--;
 625         }
 626 
 627         /*
 628          * How long since we last updated the bucket?
 629          */
 630         now = gethrtime();
 631         delta = now - rate->rate_last_update;
 632         rate->rate_last_update = now;
 633         if (delta < 0)
 634                 delta = 0; /* paranoid */
 635 
 636         /*
 637          * Add "tokens" for time since last update,
 638          * being careful about possible overflow.
 639          */
 640         refill = (delta * rate->rate_cap) / NANOSEC;
 641         if (refill < 0 || refill > rate->rate_cap)
 642                 refill = rate->rate_cap; /* overflow */
 643         rate->rate_token_bucket += refill;
 644         if (rate->rate_token_bucket > rate->rate_cap)
 645                 rate->rate_token_bucket = rate->rate_cap;
 646 
 647         /*
 648          * Withdraw tokens for the current I/O.* If this makes us overdrawn,
 649          * wait an amount of time proportionate to the overdraft.  However,
 650          * as a sanity measure, never wait more than 1 sec, and never try to
 651          * wait less than the time it normally takes to block and reschedule.
 652          *
 653          * Leave the bucket negative while we wait so other threads know to
 654          * queue up. In here, "refill" is the debt we're waiting to pay off.
 655          */
 656         rate->rate_token_bucket -= iosize;
 657         if (rate->rate_token_bucket < 0) {
 658                 hrtime_t zfs_rate_wait = 0;
 659 
 660                 refill = rate->rate_token_bucket;
 661                 DTRACE_PROBE2(zfs_rate_over, zfsvfs_t *, zfsvfs,
 662                     int64_t, refill);
 663 
 664                 if (rate->rate_cap <= 0)
 665                         goto nocap;
 666 
 667                 delta = (refill * NANOSEC) / rate->rate_cap;
 668                 delta = MIN(delta, NANOSEC);
 669 
 670                 zfs_rate_wait = zfs_get_delay(rate->rate_cap);
 671 
 672                 if (delta > zfs_rate_wait) {
 673                         (void) cv_timedwait_hires(
 674                             &rate->rate_wait_cv, &rate->rate_lock,
 675                             delta, TR_CLOCK_TICK, 0);
 676                 }
 677 
 678                 rate->rate_token_bucket += refill;
 679         }
 680 nocap:
 681         if (rate->rate_waiters > 0) {
 682                 cv_signal(&rate->rate_wait_cv);
 683         }
 684 
 685         mutex_exit(&rate->rate_lock);
 686 }
 687 
 688 
 689 offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */
 690 
 691 /*
 692  * Read bytes from specified file into supplied buffer.
 693  *
 694  *      IN:     vp      - vnode of file to be read from.
 695  *              uio     - structure supplying read location, range info,
 696  *                        and return buffer.
 697  *              ioflag  - SYNC flags; used to provide FRSYNC semantics.
 698  *              cr      - credentials of caller.
 699  *              ct      - caller context
 700  *
 701  *      OUT:    uio     - updated offset and range, buffer filled.
 702  *
 703  *      RETURN: 0 on success, error code on failure.
 704  *
 705  * Side Effects:
 706  *      vp - atime updated if byte count > 0
 707  */
 708 /* ARGSUSED */

 735         /*
 736          * Fasttrack empty reads
 737          */
 738         if (uio->uio_resid == 0) {
 739                 ZFS_EXIT(zfsvfs);
 740                 return (0);
 741         }
 742 
 743         /*
 744          * Check for mandatory locks
 745          */
 746         if (MANDMODE(zp->z_mode)) {
 747                 if (error = chklock(vp, FREAD,
 748                     uio->uio_loffset, uio->uio_resid, uio->uio_fmode, ct)) {
 749                         ZFS_EXIT(zfsvfs);
 750                         return (error);
 751                 }
 752         }
 753 
 754         /*
 755          * ZFS I/O rate throttling
 756          */
 757         if (zfsvfs->z_rate.rate_cap)
 758                 zfs_rate_throttle(zfsvfs, uio->uio_resid);
 759 
 760         /*
 761          * If we're in FRSYNC mode, sync out this znode before reading it.
 762          */
 763         if (ioflag & FRSYNC || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
 764                 zil_commit(zfsvfs->z_log, zp->z_id);
 765 
 766         /*
 767          * Lock the range against changes.
 768          */
 769         rl = zfs_range_lock(zp, uio->uio_loffset, uio->uio_resid, RL_READER);
 770 
 771         /*
 772          * If we are reading past end-of-file we can skip
 773          * to the end; but we might still need to set atime.
 774          */
 775         if (uio->uio_loffset >= zp->z_size) {
 776                 error = 0;
 777                 goto out;
 778         }
 779 
 780         ASSERT(uio->uio_loffset < zp->z_size);

 904             &zp->z_pflags, 8);
 905 
 906         /*
 907          * In a case vp->v_vfsp != zp->z_zfsvfs->z_vfs (e.g. snapshots) our
 908          * callers might not be able to detect properly that we are read-only,
 909          * so check it explicitly here.
 910          */
 911         if (zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) {
 912                 ZFS_EXIT(zfsvfs);
 913                 return (SET_ERROR(EROFS));
 914         }
 915 
 916         /*
 917          * If immutable or not appending then return EPERM.
 918          * Intentionally allow ZFS_READONLY through here.
 919          * See zfs_zaccess_common()
 920          */
 921         if ((zp->z_pflags & ZFS_IMMUTABLE) ||
 922             ((zp->z_pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
 923             (uio->uio_loffset < zp->z_size))) {
 924                 /* Make sure we're not a WORM before returning EPERM. */
 925                 if (!(zp->z_pflags & ZFS_IMMUTABLE) ||
 926                     !zp->z_zfsvfs->z_isworm) {
 927                         ZFS_EXIT(zfsvfs);
 928                         return (SET_ERROR(EPERM));
 929                 }
 930         }
 931 
 932         zilog = zfsvfs->z_log;
 933 
 934         /*
 935          * Validate file offset
 936          */
 937         woff = ioflag & FAPPEND ? zp->z_size : uio->uio_loffset;
 938         if (woff < 0) {
 939                 ZFS_EXIT(zfsvfs);
 940                 return (SET_ERROR(EINVAL));
 941         }
 942 
 943         /*
 944          * Check for mandatory locks before calling zfs_range_lock()
 945          * in order to prevent a deadlock with locks set via fcntl().
 946          */
 947         if (MANDMODE((mode_t)zp->z_mode) &&
 948             (error = chklock(vp, FWRITE, woff, n, uio->uio_fmode, ct)) != 0) {
 949                 ZFS_EXIT(zfsvfs);
 950                 return (error);
 951         }
 952 
 953         /*
 954          * ZFS I/O rate throttling
 955          */
 956         if (zfsvfs->z_rate.rate_cap)
 957                 zfs_rate_throttle(zfsvfs, uio->uio_resid);
 958 
 959         /*
 960          * Pre-fault the pages to ensure slow (eg NFS) pages
 961          * don't hold up txg.
 962          * Skip this if uio contains loaned arc_buf.
 963          */
 964         if ((uio->uio_extflg == UIO_XUIO) &&
 965             (((xuio_t *)uio)->xu_type == UIOTYPE_ZEROCOPY))
 966                 xuio = (xuio_t *)uio;
 967         else
 968                 uio_prefaultpages(MIN(n, max_blksz), uio);
 969 
 970         /*
 971          * If in append mode, set the io offset pointer to eof.
 972          */
 973         if (ioflag & FAPPEND) {
 974                 /*
 975                  * Obtain an appending range lock to guarantee file append
 976                  * semantics.  We reset the write offset once we have the lock.
 977                  */
 978                 rl = zfs_range_lock(zp, 0, n, RL_APPEND);
 979                 woff = rl->r_off;

1214 
1215         zfs_range_unlock(rl);
1216 
1217         /*
1218          * If we're in replay mode, or we made no progress, return error.
1219          * Otherwise, it's at least a partial write, so it's successful.
1220          */
1221         if (zfsvfs->z_replay || uio->uio_resid == start_resid) {
1222                 ZFS_EXIT(zfsvfs);
1223                 return (error);
1224         }
1225 
1226         if (ioflag & (FSYNC | FDSYNC) ||
1227             zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
1228                 zil_commit(zilog, zp->z_id);
1229 
1230         ZFS_EXIT(zfsvfs);
1231         return (0);
1232 }
1233 
1234 /* ARGSUSED */
1235 void
1236 zfs_get_done(zgd_t *zgd, int error)
1237 {
1238         znode_t *zp = zgd->zgd_private;
1239         objset_t *os = zp->z_zfsvfs->z_os;
1240 
1241         if (zgd->zgd_db)
1242                 dmu_buf_rele(zgd->zgd_db, zgd);
1243 
1244         zfs_range_unlock(zgd->zgd_rl);
1245 
1246         /*
1247          * Release the vnode asynchronously as we currently have the
1248          * txg stopped from syncing.
1249          */
1250         VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
1251 



1252         kmem_free(zgd, sizeof (zgd_t));
1253 }
1254 
1255 #ifdef DEBUG
1256 static int zil_fault_io = 0;
1257 #endif
1258 
1259 /*
1260  * Get data to generate a TX_WRITE intent log record.
1261  */
1262 int
1263 zfs_get_data(void *arg, lr_write_t *lr, char *buf, struct lwb *lwb, zio_t *zio)
1264 {
1265         zfsvfs_t *zfsvfs = arg;
1266         objset_t *os = zfsvfs->z_os;
1267         znode_t *zp;
1268         uint64_t object = lr->lr_foid;
1269         uint64_t offset = lr->lr_offset;
1270         uint64_t size = lr->lr_length;
1271         dmu_buf_t *db;

1355 
1356                         error = dmu_sync(zio, lr->lr_common.lrc_txg,
1357                             zfs_get_done, zgd);
1358                         ASSERT(error || lr->lr_length <= size);
1359 
1360                         /*
1361                          * On success, we need to wait for the write I/O
1362                          * initiated by dmu_sync() to complete before we can
1363                          * release this dbuf.  We will finish everything up
1364                          * in the zfs_get_done() callback.
1365                          */
1366                         if (error == 0)
1367                                 return (0);
1368 
1369                         if (error == EALREADY) {
1370                                 lr->lr_common.lrc_txtype = TX_WRITE2;
1371                                 /*
1372                                  * TX_WRITE2 relies on the data previously
1373                                  * written by the TX_WRITE that caused
1374                                  * EALREADY.  We zero out the BP because
1375                                  * it is the old, currently-on-disk BP.




1376                                  */
1377                                 zgd->zgd_bp = NULL;
1378                                 BP_ZERO(bp);
1379                                 error = 0;
1380                         }
1381                 }
1382         }
1383 
1384         zfs_get_done(zgd, error);
1385 
1386         return (error);
1387 }
1388 
1389 /*ARGSUSED*/
1390 static int
1391 zfs_access(vnode_t *vp, int mode, int flag, cred_t *cr,
1392     caller_context_t *ct)
1393 {
1394         znode_t *zp = VTOZ(vp);
1395         zfsvfs_t *zfsvfs = zp->z_zfsvfs;

1438  *              flags   - LOOKUP_XATTR set if looking for an attribute.
1439  *              rdir    - root directory vnode [UNUSED].
1440  *              cr      - credentials of caller.
1441  *              ct      - caller context
1442  *              direntflags - directory lookup flags
1443  *              realpnp - returned pathname.
1444  *
1445  *      OUT:    vpp     - vnode of located entry, NULL if not found.
1446  *
1447  *      RETURN: 0 on success, error code on failure.
1448  *
1449  * Timestamps:
1450  *      NA
1451  */
1452 /* ARGSUSED */
1453 static int
1454 zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct pathname *pnp,
1455     int flags, vnode_t *rdir, cred_t *cr,  caller_context_t *ct,
1456     int *direntflags, pathname_t *realpnp)
1457 {
1458         znode_t *zp, *zdp = VTOZ(dvp);
1459         zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
1460         int     error = 0;
1461 
1462         /*
1463          * Fast path lookup, however we must skip DNLC lookup
1464          * for case folding or normalizing lookups because the
1465          * DNLC code only stores the passed in name.  This means
1466          * creating 'a' and removing 'A' on a case insensitive
1467          * file system would work, but DNLC still thinks 'a'
1468          * exists and won't let you create it again on the next
1469          * pass through fast path.
1470          */
1471         if (!(flags & (LOOKUP_XATTR | FIGNORECASE))) {
1472 
1473                 if (dvp->v_type != VDIR) {
1474                         return (SET_ERROR(ENOTDIR));
1475                 } else if (zdp->z_sa_hdl == NULL) {
1476                         return (SET_ERROR(EIO));
1477                 }
1478

1556         }
1557 
1558         /*
1559          * Check accessibility of directory.
1560          */
1561 
1562         if (error = zfs_zaccess(zdp, ACE_EXECUTE, 0, B_FALSE, cr)) {
1563                 ZFS_EXIT(zfsvfs);
1564                 return (error);
1565         }
1566 
1567         if (zfsvfs->z_utf8 && u8_validate(nm, strlen(nm),
1568             NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
1569                 ZFS_EXIT(zfsvfs);
1570                 return (SET_ERROR(EILSEQ));
1571         }
1572 
1573         error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
1574         if (error == 0)
1575                 error = specvp_check(vpp, cr);
1576         if (*vpp) {
1577                 zp = VTOZ(*vpp);
1578                 if (!(zp->z_pflags & ZFS_IMMUTABLE) &&
1579                     ((*vpp)->v_type != VDIR) &&
1580                     zfsvfs->z_isworm && !zfs_worm_in_trans(zp)) {
1581                         zp->z_pflags |= ZFS_IMMUTABLE;
1582                 }
1583         }
1584 
1585         ZFS_EXIT(zfsvfs);
1586         return (error);
1587 }
1588 
1589 /*
1590  * Attempt to create a new entry in a directory.  If the entry
1591  * already exists, truncate the file if permissible, else return
1592  * an error.  Return the vp of the created or trunc'd file.
1593  *
1594  *      IN:     dvp     - vnode of directory to put new file entry in.
1595  *              name    - name of new file entry.
1596  *              vap     - attributes of new file.
1597  *              excl    - flag indicating exclusive or non-exclusive mode.
1598  *              mode    - mode to open file with.
1599  *              cr      - credentials of caller.
1600  *              flag    - large file flag [UNUSED].
1601  *              ct      - caller context
1602  *              vsecp   - ACL to be set
1603  *
1604  *      OUT:    vpp     - vnode of created or trunc'd entry.
1605  *
1606  *      RETURN: 0 on success, error code on failure.
1607  *
1608  * Timestamps:
1609  *      dvp - ctime|mtime updated if new entry created
1610  *       vp - ctime|mtime always, atime if new
1611  */
1612 
1613 /* ARGSUSED */
1614 static int
1615 zfs_create(vnode_t *dvp, char *name, vattr_t *vap, vcexcl_t excl,
1616     int mode, vnode_t **vpp, cred_t *cr, int flag, caller_context_t *ct,
1617     vsecattr_t *vsecp)
1618 {
1619         int             imm_was_set = 0;
1620         znode_t         *zp, *dzp = VTOZ(dvp);
1621         zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
1622         zilog_t         *zilog;
1623         objset_t        *os;
1624         zfs_dirlock_t   *dl;
1625         dmu_tx_t        *tx;
1626         int             error;
1627         ksid_t          *ksid;
1628         uid_t           uid;
1629         gid_t           gid = crgetgid(cr);
1630         zfs_acl_ids_t   acl_ids;
1631         boolean_t       fuid_dirtied;
1632         boolean_t       have_acl = B_FALSE;
1633         boolean_t       waited = B_FALSE;
1634 
1635         /*
1636          * If we have an ephemeral id, ACL, or XVATTR then
1637          * make sure file system is at proper version
1638          */
1639

1685                 int zflg = 0;
1686 
1687                 if (flag & FIGNORECASE)
1688                         zflg |= ZCILOOK;
1689 
1690                 error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
1691                     NULL, NULL);
1692                 if (error) {
1693                         if (have_acl)
1694                                 zfs_acl_ids_free(&acl_ids);
1695                         if (strcmp(name, "..") == 0)
1696                                 error = SET_ERROR(EISDIR);
1697                         ZFS_EXIT(zfsvfs);
1698                         return (error);
1699                 }
1700         }
1701 
1702         if (zp == NULL) {
1703                 uint64_t txtype;
1704 
1705                 if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
1706                     dzp->z_zfsvfs->z_isworm) {
1707                         imm_was_set = 1;
1708                         dzp->z_pflags &= ~ZFS_IMMUTABLE;
1709                 }
1710 
1711                 /*
1712                  * Create a new file object and update the directory
1713                  * to reference it.
1714                  */
1715                 if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
1716                         if (have_acl)
1717                                 zfs_acl_ids_free(&acl_ids);
1718                         if (imm_was_set)
1719                                 dzp->z_pflags |= ZFS_IMMUTABLE;
1720                         goto out;
1721                 }
1722 
1723                 if (imm_was_set)
1724                         dzp->z_pflags |= ZFS_IMMUTABLE;
1725 
1726                 /*
1727                  * We only support the creation of regular files in
1728                  * extended attribute directories.
1729                  */
1730 
1731                 if ((dzp->z_pflags & ZFS_XATTR) &&
1732                     (vap->va_type != VREG)) {
1733                         if (have_acl)
1734                                 zfs_acl_ids_free(&acl_ids);
1735                         error = SET_ERROR(EINVAL);
1736                         goto out;
1737                 }
1738 
1739                 if (!have_acl && (error = zfs_acl_ids_create(dzp, 0, vap,
1740                     cr, vsecp, &acl_ids)) != 0)
1741                         goto out;
1742                 have_acl = B_TRUE;
1743 
1744                 if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
1745                         zfs_acl_ids_free(&acl_ids);
1746                         error = SET_ERROR(EDQUOT);
1747                         goto out;
1748                 }
1749 
1750                 tx = dmu_tx_create(os);
1751 
1752                 dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
1753                     ZFS_SA_BASE_ATTR_SIZE);
1754 
1755                 fuid_dirtied = zfsvfs->z_fuid_dirty;
1756                 if (fuid_dirtied)
1757                         zfs_fuid_txhold(zfsvfs, tx);
1758                 dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
1759                 dmu_tx_hold_sa(tx, dzp->z_sa_hdl, B_FALSE);
1760                 if (!zfsvfs->z_use_sa &&
1761                     acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
1762                         dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
1763                             0, acl_ids.z_aclp->z_acl_bytes);
1764                 }
1765                 error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);

1766                 if (error) {
1767                         zfs_dirent_unlock(dl);
1768                         if (error == ERESTART) {
1769                                 waited = B_TRUE;
1770                                 dmu_tx_wait(tx);
1771                                 dmu_tx_abort(tx);
1772                                 goto top;
1773                         }
1774                         zfs_acl_ids_free(&acl_ids);
1775                         dmu_tx_abort(tx);
1776                         ZFS_EXIT(zfsvfs);
1777                         return (error);
1778                 }
1779                 zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
1780 
1781                 if (fuid_dirtied)
1782                         zfs_fuid_sync(zfsvfs, tx);
1783 
1784                 if (imm_was_set)
1785                         zp->z_pflags |= ZFS_IMMUTABLE;
1786 
1787                 (void) zfs_link_create(dl, zp, tx, ZNEW);
1788                 txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
1789                 if (flag & FIGNORECASE)
1790                         txtype |= TX_CI;
1791                 zfs_log_create(zilog, tx, txtype, dzp, zp, name,
1792                     vsecp, acl_ids.z_fuidp, vap);
1793                 zfs_acl_ids_free(&acl_ids);
1794                 dmu_tx_commit(tx);
1795         } else {
1796                 int aflags = (flag & FAPPEND) ? V_APPEND : 0;
1797 
1798                 if (have_acl)
1799                         zfs_acl_ids_free(&acl_ids);
1800                 have_acl = B_FALSE;
1801 
1802                 /*
1803                  * A directory entry already exists for this name.
1804                  */
1805                 /*
1806                  * Can't truncate an existing file if in exclusive mode.
1807                  */
1808                 if (excl == EXCL) {
1809                         error = SET_ERROR(EEXIST);
1810                         goto out;
1811                 }
1812                 /*
1813                  * Can't open a directory for writing.
1814                  */
1815                 if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
1816                         error = SET_ERROR(EISDIR);
1817                         goto out;
1818                 }
1819                 if ((flag & FWRITE) &&
1820                     dzp->z_zfsvfs->z_isworm) {
1821                         error = EPERM;
1822                         goto out;
1823                 }
1824 
1825                 if (!(flag & FAPPEND) &&
1826                     (zp->z_pflags & ZFS_IMMUTABLE) &&
1827                     dzp->z_zfsvfs->z_isworm) {
1828                         imm_was_set = 1;
1829                         zp->z_pflags &= ~ZFS_IMMUTABLE;
1830                 }
1831                 /*
1832                  * Verify requested access to file.
1833                  */
1834                 if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {
1835                         if (imm_was_set)
1836                                 zp->z_pflags |= ZFS_IMMUTABLE;
1837                         goto out;
1838                 }
1839 
1840                 if (imm_was_set)
1841                         zp->z_pflags |= ZFS_IMMUTABLE;
1842 
1843                 mutex_enter(&dzp->z_lock);
1844                 dzp->z_seq++;
1845                 mutex_exit(&dzp->z_lock);
1846 
1847                 /*
1848                  * Truncate regular files if requested.
1849                  */
1850                 if ((ZTOV(zp)->v_type == VREG) &&
1851                     (vap->va_mask & AT_SIZE) && (vap->va_size == 0)) {
1852                         /* we can't hold any locks when calling zfs_freesp() */
1853                         zfs_dirent_unlock(dl);
1854                         dl = NULL;
1855                         error = zfs_freesp(zp, 0, 0, mode, TRUE);
1856                         if (error == 0) {
1857                                 vnevent_create(ZTOV(zp), ct);
1858                         }
1859                 }
1860         }
1861 out:
1862

1929                 pn_alloc(&realnm);
1930                 realnmp = &realnm;
1931         }
1932 
1933 top:
1934         xattr_obj = 0;
1935         xzp = NULL;
1936         /*
1937          * Attempt to lock directory; fail if entry doesn't exist.
1938          */
1939         if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
1940             NULL, realnmp)) {
1941                 if (realnmp)
1942                         pn_free(realnmp);
1943                 ZFS_EXIT(zfsvfs);
1944                 return (error);
1945         }
1946 
1947         vp = ZTOV(zp);
1948 
1949         if (zp->z_zfsvfs->z_isworm) {
1950                 error = SET_ERROR(EPERM);
1951                 goto out;
1952         }
1953 
1954         if (error = zfs_zaccess_delete(dzp, zp, cr)) {
1955                 goto out;
1956         }
1957 
1958         /*
1959          * Need to use rmdir for removing directories.
1960          */
1961         if (vp->v_type == VDIR) {
1962                 error = SET_ERROR(EPERM);
1963                 goto out;
1964         }
1965 
1966         vnevent_remove(vp, dvp, name, ct);
1967 
1968         if (realnmp)
1969                 dnlc_remove(dvp, realnmp->pn_buf);
1970         else
1971                 dnlc_remove(dvp, name);
1972 
1973         mutex_enter(&vp->v_lock);

2000         if (error == 0 && xattr_obj) {
2001                 error = zfs_zget(zfsvfs, xattr_obj, &xzp);
2002                 ASSERT0(error);
2003                 dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_TRUE);
2004                 dmu_tx_hold_sa(tx, xzp->z_sa_hdl, B_FALSE);
2005         }
2006 
2007         mutex_enter(&zp->z_lock);
2008         if ((acl_obj = zfs_external_acl(zp)) != 0 && may_delete_now)
2009                 dmu_tx_hold_free(tx, acl_obj, 0, DMU_OBJECT_END);
2010         mutex_exit(&zp->z_lock);
2011 
2012         /* charge as an update -- would be nice not to charge at all */
2013         dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
2014 
2015         /*
2016          * Mark this transaction as typically resulting in a net free of space
2017          */
2018         dmu_tx_mark_netfree(tx);
2019 
2020         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
2021         if (error) {
2022                 zfs_dirent_unlock(dl);
2023                 VN_RELE(vp);
2024                 if (xzp)
2025                         VN_RELE(ZTOV(xzp));
2026                 if (error == ERESTART) {
2027                         waited = B_TRUE;
2028                         dmu_tx_wait(tx);
2029                         dmu_tx_abort(tx);
2030                         goto top;
2031                 }
2032                 if (realnmp)
2033                         pn_free(realnmp);
2034                 dmu_tx_abort(tx);
2035                 ZFS_EXIT(zfsvfs);
2036                 return (error);
2037         }
2038 
2039         /*
2040          * Remove the directory entry.

2127  *              dirname - name of new directory.
2128  *              vap     - attributes of new directory.
2129  *              cr      - credentials of caller.
2130  *              ct      - caller context
2131  *              flags   - case flags
2132  *              vsecp   - ACL to be set
2133  *
2134  *      OUT:    vpp     - vnode of created directory.
2135  *
2136  *      RETURN: 0 on success, error code on failure.
2137  *
2138  * Timestamps:
2139  *      dvp - ctime|mtime updated
2140  *       vp - ctime|mtime|atime updated
2141  */
2142 /*ARGSUSED*/
2143 static int
2144 zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr,
2145     caller_context_t *ct, int flags, vsecattr_t *vsecp)
2146 {
2147         int             imm_was_set = 0;
2148         znode_t         *zp, *dzp = VTOZ(dvp);
2149         zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
2150         zilog_t         *zilog;
2151         zfs_dirlock_t   *dl;
2152         uint64_t        txtype;
2153         dmu_tx_t        *tx;
2154         int             error;
2155         int             zf = ZNEW;
2156         ksid_t          *ksid;
2157         uid_t           uid;
2158         gid_t           gid = crgetgid(cr);
2159         zfs_acl_ids_t   acl_ids;
2160         boolean_t       fuid_dirtied;
2161         boolean_t       waited = B_FALSE;
2162 
2163         ASSERT(vap->va_type == VDIR);
2164 
2165         /*
2166          * If we have an ephemeral id, ACL, or XVATTR then
2167          * make sure file system is at proper version

2207                 ZFS_EXIT(zfsvfs);
2208                 return (error);
2209         }
2210         /*
2211          * First make sure the new directory doesn't exist.
2212          *
2213          * Existence is checked first to make sure we don't return
2214          * EACCES instead of EEXIST which can cause some applications
2215          * to fail.
2216          */
2217 top:
2218         *vpp = NULL;
2219 
2220         if (error = zfs_dirent_lock(&dl, dzp, dirname, &zp, zf,
2221             NULL, NULL)) {
2222                 zfs_acl_ids_free(&acl_ids);
2223                 ZFS_EXIT(zfsvfs);
2224                 return (error);
2225         }
2226 
2227         if ((dzp->z_pflags & ZFS_IMMUTABLE) &&
2228             dzp->z_zfsvfs->z_isworm) {
2229                 imm_was_set = 1;
2230                 dzp->z_pflags &= ~ZFS_IMMUTABLE;
2231         }
2232 
2233         if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {
2234                 if (imm_was_set)
2235                         dzp->z_pflags |= ZFS_IMMUTABLE;
2236                 zfs_acl_ids_free(&acl_ids);
2237                 zfs_dirent_unlock(dl);
2238                 ZFS_EXIT(zfsvfs);
2239                 return (error);
2240         }
2241 
2242         if (imm_was_set)
2243                 dzp->z_pflags |= ZFS_IMMUTABLE;
2244 
2245         if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
2246                 zfs_acl_ids_free(&acl_ids);
2247                 zfs_dirent_unlock(dl);
2248                 ZFS_EXIT(zfsvfs);
2249                 return (SET_ERROR(EDQUOT));
2250         }
2251 
2252         /*
2253          * Add a new entry to the directory.
2254          */
2255         tx = dmu_tx_create(zfsvfs->z_os);
2256         dmu_tx_hold_zap(tx, dzp->z_id, TRUE, dirname);
2257         dmu_tx_hold_zap(tx, DMU_NEW_OBJECT, FALSE, NULL);
2258         fuid_dirtied = zfsvfs->z_fuid_dirty;
2259         if (fuid_dirtied)
2260                 zfs_fuid_txhold(zfsvfs, tx);
2261         if (!zfsvfs->z_use_sa && acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
2262                 dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
2263                     acl_ids.z_aclp->z_acl_bytes);
2264         }
2265 
2266         dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
2267             ZFS_SA_BASE_ATTR_SIZE);
2268 
2269         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
2270         if (error) {
2271                 zfs_dirent_unlock(dl);
2272                 if (error == ERESTART) {
2273                         waited = B_TRUE;
2274                         dmu_tx_wait(tx);
2275                         dmu_tx_abort(tx);
2276                         goto top;
2277                 }
2278                 zfs_acl_ids_free(&acl_ids);
2279                 dmu_tx_abort(tx);
2280                 ZFS_EXIT(zfsvfs);
2281                 return (error);
2282         }
2283 
2284         /*
2285          * Create new node.
2286          */
2287         zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
2288 
2289         if (fuid_dirtied)

2351         ZFS_ENTER(zfsvfs);
2352         ZFS_VERIFY_ZP(dzp);
2353         zilog = zfsvfs->z_log;
2354 
2355         if (flags & FIGNORECASE)
2356                 zflg |= ZCILOOK;
2357 top:
2358         zp = NULL;
2359 
2360         /*
2361          * Attempt to lock directory; fail if entry doesn't exist.
2362          */
2363         if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
2364             NULL, NULL)) {
2365                 ZFS_EXIT(zfsvfs);
2366                 return (error);
2367         }
2368 
2369         vp = ZTOV(zp);
2370 
2371         if (dzp->z_zfsvfs->z_isworm) {
2372                 error = SET_ERROR(EPERM);
2373                 goto out;
2374         }
2375 
2376         if (error = zfs_zaccess_delete(dzp, zp, cr)) {
2377                 goto out;
2378         }
2379 
2380         if (vp->v_type != VDIR) {
2381                 error = SET_ERROR(ENOTDIR);
2382                 goto out;
2383         }
2384 
2385         if (vp == cwd) {
2386                 error = SET_ERROR(EINVAL);
2387                 goto out;
2388         }
2389 
2390         vnevent_rmdir(vp, dvp, name, ct);
2391 
2392         /*
2393          * Grab a lock on the directory to make sure that noone is
2394          * trying to add (or lookup) entries while we are removing it.
2395          */
2396         rw_enter(&zp->z_name_lock, RW_WRITER);
2397 
2398         /*
2399          * Grab a lock on the parent pointer to make sure we play well
2400          * with the treewalk and directory rename code.
2401          */
2402         rw_enter(&zp->z_parent_lock, RW_WRITER);
2403 
2404         tx = dmu_tx_create(zfsvfs->z_os);
2405         dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
2406         dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
2407         dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
2408         zfs_sa_upgrade_txholds(tx, zp);
2409         zfs_sa_upgrade_txholds(tx, dzp);
2410         dmu_tx_mark_netfree(tx);
2411         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
2412         if (error) {
2413                 rw_exit(&zp->z_parent_lock);
2414                 rw_exit(&zp->z_name_lock);
2415                 zfs_dirent_unlock(dl);
2416                 VN_RELE(vp);
2417                 if (error == ERESTART) {
2418                         waited = B_TRUE;
2419                         dmu_tx_wait(tx);
2420                         dmu_tx_abort(tx);
2421                         goto top;
2422                 }
2423                 dmu_tx_abort(tx);
2424                 ZFS_EXIT(zfsvfs);
2425                 return (error);
2426         }
2427 
2428         error = zfs_link_destroy(dl, zp, tx, zflg, NULL);
2429 
2430         if (error == 0) {
2431                 uint64_t txtype = TX_RMDIR;

3048 
3049         if (mask & AT_SIZE && vp->v_type == VDIR) {
3050                 ZFS_EXIT(zfsvfs);
3051                 return (SET_ERROR(EISDIR));
3052         }
3053 
3054         if (mask & AT_SIZE && vp->v_type != VREG && vp->v_type != VFIFO) {
3055                 ZFS_EXIT(zfsvfs);
3056                 return (SET_ERROR(EINVAL));
3057         }
3058 
3059         /*
3060          * If this is an xvattr_t, then get a pointer to the structure of
3061          * optional attributes.  If this is NULL, then we have a vattr_t.
3062          */
3063         xoap = xva_getxoptattr(xvap);
3064 
3065         xva_init(&tmpxvattr);
3066 
3067         /*
3068          * Do not allow to alter immutable bit after it is set
3069          */
3070         if ((zp->z_pflags & ZFS_IMMUTABLE) &&
3071             XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) &&
3072             zp->z_zfsvfs->z_isworm) {
3073                 ZFS_EXIT(zfsvfs);
3074                 return (SET_ERROR(EPERM));
3075         }
3076 
3077         /*
3078          * Immutable files can only alter atime
3079          */
3080         if (((zp->z_pflags & ZFS_IMMUTABLE) || zp->z_zfsvfs->z_isworm) &&
3081             ((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
3082             ((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
3083                 if (!zp->z_zfsvfs->z_isworm || !zfs_worm_in_trans(zp)) {
3084                         ZFS_EXIT(zfsvfs);
3085                         return (SET_ERROR(EPERM));
3086                 }
3087         }
3088 
3089         /*
3090          * Note: ZFS_READONLY is handled in zfs_zaccess_common.
3091          */
3092 
3093         /*
3094          * Verify timestamps doesn't overflow 32 bits.
3095          * ZFS can handle large timestamps, but 32bit syscalls can't
3096          * handle times greater than 2039.  This check should be removed
3097          * once large timestamps are fully supported.
3098          */
3099         if (mask & (AT_ATIME | AT_MTIME)) {
3100                 if (((mask & AT_ATIME) && TIMESPEC_OVERFLOW(&vap->va_atime)) ||
3101                     ((mask & AT_MTIME) && TIMESPEC_OVERFLOW(&vap->va_mtime))) {
3102                         ZFS_EXIT(zfsvfs);
3103                         return (SET_ERROR(EOVERFLOW));
3104                 }
3105         }
3106 
3107 top:

3976         if (tdvp != sdvp) {
3977                 vnevent_pre_rename_dest_dir(tdvp, ZTOV(szp), tnm, ct);
3978         }
3979 
3980         tx = dmu_tx_create(zfsvfs->z_os);
3981         dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
3982         dmu_tx_hold_sa(tx, sdzp->z_sa_hdl, B_FALSE);
3983         dmu_tx_hold_zap(tx, sdzp->z_id, FALSE, snm);
3984         dmu_tx_hold_zap(tx, tdzp->z_id, TRUE, tnm);
3985         if (sdzp != tdzp) {
3986                 dmu_tx_hold_sa(tx, tdzp->z_sa_hdl, B_FALSE);
3987                 zfs_sa_upgrade_txholds(tx, tdzp);
3988         }
3989         if (tzp) {
3990                 dmu_tx_hold_sa(tx, tzp->z_sa_hdl, B_FALSE);
3991                 zfs_sa_upgrade_txholds(tx, tzp);
3992         }
3993 
3994         zfs_sa_upgrade_txholds(tx, szp);
3995         dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
3996         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
3997         if (error) {
3998                 if (zl != NULL)
3999                         zfs_rename_unlock(&zl);
4000                 zfs_dirent_unlock(sdl);
4001                 zfs_dirent_unlock(tdl);
4002 
4003                 if (sdzp == tdzp)
4004                         rw_exit(&sdzp->z_name_lock);
4005 
4006                 VN_RELE(ZTOV(szp));
4007                 if (tzp)
4008                         VN_RELE(ZTOV(tzp));
4009                 if (error == ERESTART) {
4010                         waited = B_TRUE;
4011                         dmu_tx_wait(tx);
4012                         dmu_tx_abort(tx);
4013                         goto top;
4014                 }
4015                 dmu_tx_abort(tx);
4016                 ZFS_EXIT(zfsvfs);

4100  *              vap     - Attributes of new entry.
4101  *              cr      - credentials of caller.
4102  *              ct      - caller context
4103  *              flags   - case flags
4104  *
4105  *      RETURN: 0 on success, error code on failure.
4106  *
4107  * Timestamps:
4108  *      dvp - ctime|mtime updated
4109  */
4110 /*ARGSUSED*/
4111 static int
4112 zfs_symlink(vnode_t *dvp, char *name, vattr_t *vap, char *link, cred_t *cr,
4113     caller_context_t *ct, int flags)
4114 {
4115         znode_t         *zp, *dzp = VTOZ(dvp);
4116         zfs_dirlock_t   *dl;
4117         dmu_tx_t        *tx;
4118         zfsvfs_t        *zfsvfs = dzp->z_zfsvfs;
4119         zilog_t         *zilog;
4120         int             imm_was_set = 0;
4121         uint64_t        len = strlen(link);
4122         int             error;
4123         int             zflg = ZNEW;
4124         zfs_acl_ids_t   acl_ids;
4125         boolean_t       fuid_dirtied;
4126         uint64_t        txtype = TX_SYMLINK;
4127         boolean_t       waited = B_FALSE;
4128 
4129         ASSERT(vap->va_type == VLNK);
4130 
4131         ZFS_ENTER(zfsvfs);
4132         ZFS_VERIFY_ZP(dzp);
4133         zilog = zfsvfs->z_log;
4134 
4135         if (zfsvfs->z_utf8 && u8_validate(name, strlen(name),
4136             NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
4137                 ZFS_EXIT(zfsvfs);
4138                 return (SET_ERROR(EILSEQ));
4139         }
4140         if (flags & FIGNORECASE)

4144                 ZFS_EXIT(zfsvfs);
4145                 return (SET_ERROR(ENAMETOOLONG));
4146         }
4147 
4148         if ((error = zfs_acl_ids_create(dzp, 0,
4149             vap, cr, NULL, &acl_ids)) != 0) {
4150                 ZFS_EXIT(zfsvfs);
4151                 return (error);
4152         }
4153 top:
4154         /*
4155          * Attempt to lock directory; fail if entry already exists.
4156          */
4157         error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg, NULL, NULL);
4158         if (error) {
4159                 zfs_acl_ids_free(&acl_ids);
4160                 ZFS_EXIT(zfsvfs);
4161                 return (error);
4162         }
4163 
4164         if ((dzp->z_pflags & ZFS_IMMUTABLE) && dzp->z_zfsvfs->z_isworm) {
4165                 imm_was_set = 1;
4166                 dzp->z_pflags &= ~ZFS_IMMUTABLE;
4167         }
4168         if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
4169                 if (imm_was_set)
4170                         dzp->z_pflags |= ZFS_IMMUTABLE;
4171                 zfs_acl_ids_free(&acl_ids);
4172                 zfs_dirent_unlock(dl);
4173                 ZFS_EXIT(zfsvfs);
4174                 return (error);
4175         }
4176         if (imm_was_set)
4177                 dzp->z_pflags |= ZFS_IMMUTABLE;
4178 
4179         if (zfs_acl_ids_overquota(zfsvfs, &acl_ids)) {
4180                 zfs_acl_ids_free(&acl_ids);
4181                 zfs_dirent_unlock(dl);
4182                 ZFS_EXIT(zfsvfs);
4183                 return (SET_ERROR(EDQUOT));
4184         }
4185         tx = dmu_tx_create(zfsvfs->z_os);
4186         fuid_dirtied = zfsvfs->z_fuid_dirty;
4187         dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, MAX(1, len));
4188         dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
4189         dmu_tx_hold_sa_create(tx, acl_ids.z_aclp->z_acl_bytes +
4190             ZFS_SA_BASE_ATTR_SIZE + len);
4191         dmu_tx_hold_sa(tx, dzp->z_sa_hdl, B_FALSE);
4192         if (!zfsvfs->z_use_sa && acl_ids.z_aclp->z_acl_bytes > ZFS_ACE_SPACE) {
4193                 dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
4194                     acl_ids.z_aclp->z_acl_bytes);
4195         }
4196         if (fuid_dirtied)
4197                 zfs_fuid_txhold(zfsvfs, tx);
4198         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
4199         if (error) {
4200                 zfs_dirent_unlock(dl);
4201                 if (error == ERESTART) {
4202                         waited = B_TRUE;
4203                         dmu_tx_wait(tx);
4204                         dmu_tx_abort(tx);
4205                         goto top;
4206                 }
4207                 zfs_acl_ids_free(&acl_ids);
4208                 dmu_tx_abort(tx);
4209                 ZFS_EXIT(zfsvfs);
4210                 return (error);
4211         }
4212 
4213         /*
4214          * Create a new object for the symlink.
4215          * for version 4 ZPL datsets the symlink will be an SA attribute
4216          */
4217         zfs_mknode(dzp, vap, tx, cr, 0, &zp, &acl_ids);
4218

4399         if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
4400                 ZFS_EXIT(zfsvfs);
4401                 return (error);
4402         }
4403 
4404 top:
4405         /*
4406          * Attempt to lock directory; fail if entry already exists.
4407          */
4408         error = zfs_dirent_lock(&dl, dzp, name, &tzp, zf, NULL, NULL);
4409         if (error) {
4410                 ZFS_EXIT(zfsvfs);
4411                 return (error);
4412         }
4413 
4414         tx = dmu_tx_create(zfsvfs->z_os);
4415         dmu_tx_hold_sa(tx, szp->z_sa_hdl, B_FALSE);
4416         dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
4417         zfs_sa_upgrade_txholds(tx, szp);
4418         zfs_sa_upgrade_txholds(tx, dzp);
4419         error = dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
4420         if (error) {
4421                 zfs_dirent_unlock(dl);
4422                 if (error == ERESTART) {
4423                         waited = B_TRUE;
4424                         dmu_tx_wait(tx);
4425                         dmu_tx_abort(tx);
4426                         goto top;
4427                 }
4428                 dmu_tx_abort(tx);
4429                 ZFS_EXIT(zfsvfs);
4430                 return (error);
4431         }
4432 
4433         error = zfs_link_create(dl, szp, tx, 0);
4434 
4435         if (error == 0) {
4436                 uint64_t txtype = TX_LINK;
4437                 if (flags & FIGNORECASE)
4438                         txtype |= TX_CI;
4439                 zfs_log_link(zilog, tx, txtype, dzp, szp, name);

4675                         int err;
4676 
4677                         /*
4678                          * Found a dirty page to push
4679                          */
4680                         err = zfs_putapage(vp, pp, &io_off, &io_len, flags, cr);
4681                         if (err)
4682                                 error = err;
4683                 } else {
4684                         io_len = PAGESIZE;
4685                 }
4686         }
4687 out:
4688         zfs_range_unlock(rl);
4689         if ((flags & B_ASYNC) == 0 || zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
4690                 zil_commit(zfsvfs->z_log, zp->z_id);
4691         ZFS_EXIT(zfsvfs);
4692         return (error);
4693 }
4694 
4695 /*
4696  * Returns B_TRUE and exits the z_teardown_inactive_lock
4697  * if the znode we are looking at is no longer valid
4698  */
4699 static boolean_t
4700 zfs_znode_free_invalid(znode_t *zp)
4701 {

4702         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
4703         vnode_t *vp = ZTOV(zp);
4704 
4705         ASSERT(rw_read_held(&zfsvfs->z_teardown_inactive_lock));
4706 
4707         if (zp->z_sa_hdl == NULL) {
4708                 /*
4709                  * The fs has been unmounted, or we did a
4710                  * suspend/resume and this file no longer exists.
4711                  */
4712                 if (vn_has_cached_data(vp)) {
4713                         (void) pvn_vplist_dirty(vp, 0, zfs_null_putapage,
4714                             B_INVAL, CRED());
4715                 }
4716 
4717                 mutex_enter(&zp->z_lock);
4718                 mutex_enter(&vp->v_lock);
4719                 ASSERT(vp->v_count == 1);
4720                 VN_RELE_LOCKED(vp);
4721                 mutex_exit(&vp->v_lock);
4722                 mutex_exit(&zp->z_lock);
4723                 VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) !=
4724                     UINT32_MAX);
4725                 rw_exit(&zfsvfs->z_teardown_inactive_lock);
4726                 zfs_znode_free(zp);
4727                 return (B_TRUE);
4728         }
4729 
4730         return (B_FALSE);
4731 }
4732 
4733 /*
4734  * Does the prep work for freeing the znode, then calls zfs_zinactive to do the
4735  * actual freeing.
4736  * This code used be in zfs_inactive() before the async delete patch came in
4737  */
4738 static void
4739 zfs_inactive_impl(znode_t *zp)
4740 {
4741         vnode_t *vp = ZTOV(zp);
4742         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
4743         int error;
4744 
4745         rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
4746         if (zfs_znode_free_invalid(zp))
4747                 return; /* z_teardown_inactive_lock already dropped */
4748 
4749         /*
4750          * Attempt to push any data in the page cache.  If this fails
4751          * we will get kicked out later in zfs_zinactive().
4752          */
4753         if (vn_has_cached_data(vp)) {
4754                 (void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL|B_ASYNC,
4755                     CRED());
4756         }
4757 
4758         if (zp->z_atime_dirty && zp->z_unlinked == 0) {
4759                 dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
4760 
4761                 dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
4762                 zfs_sa_upgrade_txholds(tx, zp);
4763                 error = dmu_tx_assign(tx, TXG_WAIT);
4764                 if (error) {
4765                         dmu_tx_abort(tx);
4766                 } else {
4767                         mutex_enter(&zp->z_lock);
4768                         (void) sa_update(zp->z_sa_hdl, SA_ZPL_ATIME(zfsvfs),
4769                             (void *)&zp->z_atime, sizeof (zp->z_atime), tx);
4770                         zp->z_atime_dirty = 0;
4771                         mutex_exit(&zp->z_lock);
4772                         dmu_tx_commit(tx);
4773                 }
4774         }
4775 
4776         zfs_zinactive(zp);
4777 
4778         VERIFY(atomic_dec_32_nv(&zfsvfs->z_znodes_freeing_cnt) != UINT32_MAX);
4779 
4780         rw_exit(&zfsvfs->z_teardown_inactive_lock);
4781 }
4782 
4783 /*
4784  * taskq task calls zfs_inactive_impl() so that we can free the znode
4785  */
4786 static void
4787 zfs_inactive_task(void *task_arg)
4788 {
4789         znode_t *zp = task_arg;
4790         ASSERT(zp != NULL);
4791         zfs_inactive_impl(zp);
4792 }
4793 
4794 /*ARGSUSED*/
4795 void
4796 zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
4797 {
4798         znode_t *zp = VTOZ(vp);
4799         zfsvfs_t *zfsvfs = zp->z_zfsvfs;
4800 
4801         rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER_STARVEWRITER);
4802 
4803         VERIFY(atomic_inc_32_nv(&zfsvfs->z_znodes_freeing_cnt) != 0);
4804 
4805         if (zfs_znode_free_invalid(zp))
4806                 return; /* z_teardown_inactive_lock already dropped */
4807 
4808         if (zfs_do_async_free &&
4809             zp->z_size > zfs_inactive_async_multiplier * zfs_dirty_data_max &&
4810             taskq_dispatch(dsl_pool_vnrele_taskq(
4811             dmu_objset_pool(zp->z_zfsvfs->z_os)), zfs_inactive_task,
4812             zp, TQ_NOSLEEP) != NULL) {
4813                 rw_exit(&zfsvfs->z_teardown_inactive_lock);
4814                 return; /* task dispatched, we're done */
4815         }
4816         rw_exit(&zfsvfs->z_teardown_inactive_lock);
4817 
4818         /* if the taskq dispatch failed - do a sync zfs_inactive_impl() call */
4819         zfs_inactive_impl(zp);
4820 }
4821 
4822 /*
4823  * Bounds-check the seek operation.
4824  *
4825  *      IN:     vp      - vnode seeking within
4826  *              ooff    - old file offset
4827  *              noffp   - pointer to new file offset
4828  *              ct      - caller context
4829  *
4830  *      RETURN: 0 on success, EINVAL if new offset invalid.
4831  */
4832 /* ARGSUSED */
4833 static int
4834 zfs_seek(vnode_t *vp, offset_t ooff, offset_t *noffp,
4835     caller_context_t *ct)
4836 {
4837         if (vp->v_type == VDIR)
4838                 return (0);
4839         return ((*noffp < 0 || *noffp > MAXOFFSET_T) ? EINVAL : 0);
4840 }
4841