Print this page
NEX-15740 NFS deadlock in rfs4_compound with hundreds of threads waiting for lock owned by rfs4_op_rename (lint fix)
NEX-15740 NFS deadlock in rfs4_compound with hundreds of threads waiting for lock owned by rfs4_op_rename
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-16917 Need to reduce the impact of NFS per-share kstats on failover
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
NEX-16835 Kernel panic during BDD tests at rfs4_compound func
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-15924 Getting panic: BAD TRAP: type=d (#gp General protection) rp=ffffff0021464690 addr=12
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
NEX-16812 Timing window where dtrace probe could try to access share info after unshared
Reviewed by: Evan Layton <evan.layton@nexenta.com>
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
NEX-16452 NFS server in a zone state database needs to be per zone
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-15279 support NFS server in zone
NEX-15520 online NFS shares cause zoneadm halt to hang in nfs_export_zone_fini
Portions contributed by: Dan Kruchinin dan.kruchinin@nexenta.com
Portions contributed by: Stepan Zastupov stepan.zastupov@gmail.com
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
NEX-9275 Got "bad mutex" panic when run IO to nfs share from clients
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-7366 Getting panic in "module "nfssrv" due to a NULL pointer dereference" when updating NFS shares on a pool
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-6778 NFS kstats leak and cause system to hang
Revert "NEX-4261 Per-client NFS server IOPS, bandwidth, and latency kstats"
This reverts commit 586c3ab1927647487f01c337ddc011c642575a52.
Revert "NEX-5354 Aggregated IOPS, bandwidth, and latency kstats for NFS server"
This reverts commit c91d7614da8618ef48018102b077f60ecbbac8c2.
Revert "NEX-5667 nfssrv_stats_flags does not work for aggregated kstats"
This reverts commit 3dcf42618be7dd5f408c327f429c81e07ca08e74.
Revert "NEX-5750 Time values for aggregated NFS server kstats should be normalized"
This reverts commit 1f4d4f901153b0191027969fa4a8064f9d3b9ee1.
Revert "NEX-5942 Panic in rfs4_minorvers_mismatch() with NFSv4.1 client"
This reverts commit 40766417094a162f5e4cc8786c0fa0a7e5871cd9.
Revert "NEX-5752 NFS server: namespace collision in kstats"
This reverts commit ae81e668db86050da8e483264acb0cce0444a132.
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-6109 NFS client panics in nfssrv when running nfsv4-test basic_ops STC tests
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Steve Peng <steve.peng@nexenta.com>
NEX-4261 Per-client NFS server IOPS, bandwidth, and latency kstats
Reviewed by: Kevin Crowe <kevin.crowe@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-5134 Deadlock between rfs4_do_lock() and rfs4_op_read()
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
NEX-3311 NFSv4: setlock() can spin forever
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
NEX-3097 IOPS, bandwidth, and latency kstats for NFS server
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-1128 NFS server: Generic uid and gid remapping for AUTH_SYS
Reviewed by: Jan Kryl <jan.kryl@nexenta.com>
OS-72 NULL pointer dereference in rfs4_op_setclientid()
Reviewed by: Dan McDonald <danmcd@nexenta.com>

@@ -18,20 +18,23 @@
  *
  * CDDL HEADER END
  */
 
 /*
- * Copyright 2016 Nexenta Systems, Inc.  All rights reserved.
  * Copyright (c) 2003, 2010, Oracle and/or its affiliates. All rights reserved.
- * Copyright (c) 2012, 2016 by Delphix. All rights reserved.
  */
 
 /*
  *      Copyright (c) 1983,1984,1985,1986,1987,1988,1989  AT&T.
  *      All Rights Reserved
  */
 
+/*
+ * Copyright 2019 Nexenta Systems, Inc.
+ * Copyright (c) 2012, 2016 by Delphix. All rights reserved.
+ */
+
 #include <sys/param.h>
 #include <sys/types.h>
 #include <sys/systm.h>
 #include <sys/cred.h>
 #include <sys/buf.h>

@@ -55,23 +58,26 @@
 #include <sys/policy.h>
 #include <sys/fem.h>
 #include <sys/sdt.h>
 #include <sys/ddi.h>
 #include <sys/zone.h>
+#include <sys/kstat.h>
 
 #include <fs/fs_reparse.h>
 
 #include <rpc/types.h>
 #include <rpc/auth.h>
 #include <rpc/rpcsec_gss.h>
 #include <rpc/svc.h>
 
 #include <nfs/nfs.h>
+#include <nfs/nfssys.h>
 #include <nfs/export.h>
 #include <nfs/nfs_cmd.h>
 #include <nfs/lm.h>
 #include <nfs/nfs4.h>
+#include <nfs/nfs4_drc.h>
 
 #include <sys/strsubr.h>
 #include <sys/strsun.h>
 
 #include <inet/common.h>

@@ -145,20 +151,17 @@
  *
  */
 #define DIRENT64_TO_DIRCOUNT(dp) \
         (3 * BYTES_PER_XDR_UNIT + DIRENT64_NAMELEN((dp)->d_reclen))
 
-time_t rfs4_start_time;                 /* Initialized in rfs4_srvrinit */
+zone_key_t      rfs4_zone_key;
 
 static sysid_t lockt_sysid;             /* dummy sysid for all LOCKT calls */
 
 u_longlong_t    nfs4_srv_caller_id;
 uint_t          nfs4_srv_vkey = 0;
 
-verifier4       Write4verf;
-verifier4       Readdir4verf;
-
 void    rfs4_init_compound_state(struct compound_state *);
 
 static void     nullfree(caddr_t);
 static void     rfs4_op_inval(nfs_argop4 *, nfs_resop4 *, struct svc_req *,
                         struct compound_state *);

@@ -243,15 +246,16 @@
                         struct svc_req *req, struct compound_state *);
 static void     rfs4_op_secinfo(nfs_argop4 *, nfs_resop4 *, struct svc_req *,
                         struct compound_state *);
 static void     rfs4_op_secinfo_free(nfs_resop4 *);
 
-static nfsstat4 check_open_access(uint32_t,
-                                struct compound_state *, struct svc_req *);
+static nfsstat4 check_open_access(uint32_t, struct compound_state *,
+                    struct svc_req *);
 nfsstat4 rfs4_client_sysid(rfs4_client_t *, sysid_t *);
-void rfs4_ss_clid(rfs4_client_t *);
+void            rfs4_ss_clid(nfs4_srv_t *, rfs4_client_t *);
 
+
 /*
  * translation table for attrs
  */
 struct nfs4_ntov_table {
         union nfs4_attr_u *na;

@@ -266,151 +270,183 @@
 
 static nfsstat4 do_rfs4_set_attrs(bitmap4 *resp, fattr4 *fattrp,
                     struct compound_state *cs, struct nfs4_svgetit_arg *sargp,
                     struct nfs4_ntov_table *ntovp, nfs4_attr_cmd_t cmd);
 
+static void     hanfsv4_failover(nfs4_srv_t *);
+
 fem_t           *deleg_rdops;
 fem_t           *deleg_wrops;
 
-rfs4_servinst_t *rfs4_cur_servinst = NULL;      /* current server instance */
-kmutex_t        rfs4_servinst_lock;     /* protects linked list */
-int             rfs4_seen_first_compound;       /* set first time we see one */
-
 /*
  * NFS4 op dispatch table
  */
 
 struct rfsv4disp {
         void    (*dis_proc)();          /* proc to call */
         void    (*dis_resfree)();       /* frees space allocated by proc */
         int     dis_flags;              /* RPC_IDEMPOTENT, etc... */
+        int     op_type;                /* operation type, see below */
 };
 
+/*
+ * operation types; used primarily for the per-exportinfo kstat implementation
+ */
+#define NFS4_OP_NOFH    0       /* The operation does not operate with any */
+                                /* particular filehandle; we cannot associate */
+                                /* it with any exportinfo. */
+
+#define NFS4_OP_CFH     1       /* The operation works with the current */
+                                /* filehandle; we associate the operation */
+                                /* with the exportinfo related to the current */
+                                /* filehandle (as set before the operation is */
+                                /* executed). */
+
+#define NFS4_OP_SFH     2       /* The operation works with the saved */
+                                /* filehandle; we associate the operation */
+                                /* with the exportinfo related to the saved */
+                                /* filehandle (as set before the operation is */
+                                /* executed). */
+
+#define NFS4_OP_POSTCFH 3       /* The operation ignores the current */
+                                /* filehandle, but sets the new current */
+                                /* filehandle instead; we associate the */
+                                /* operation with the exportinfo related to */
+                                /* the current filehandle as set after the */
+                                /* operation is successfuly executed.  Since */
+                                /* we do not know the particular exportinfo */
+                                /* (and thus the kstat) before the operation */
+                                /* is done, there is no simple way how to */
+                                /* update some I/O kstat statistics related */
+                                /* to kstat_queue(9F). */
+
 static struct rfsv4disp rfsv4disptab[] = {
         /*
          * NFS VERSION 4
          */
 
         /* RFS_NULL = 0 */
-        {rfs4_op_illegal, nullfree, 0},
+        {rfs4_op_illegal, nullfree, 0, NFS4_OP_NOFH},
 
         /* UNUSED = 1 */
-        {rfs4_op_illegal, nullfree, 0},
+        {rfs4_op_illegal, nullfree, 0, NFS4_OP_NOFH},
 
         /* UNUSED = 2 */
-        {rfs4_op_illegal, nullfree, 0},
+        {rfs4_op_illegal, nullfree, 0, NFS4_OP_NOFH},
 
         /* OP_ACCESS = 3 */
-        {rfs4_op_access, nullfree, RPC_IDEMPOTENT},
+        {rfs4_op_access, nullfree, RPC_IDEMPOTENT, NFS4_OP_CFH},
 
         /* OP_CLOSE = 4 */
-        {rfs4_op_close, nullfree, 0},
+        {rfs4_op_close, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_COMMIT = 5 */
-        {rfs4_op_commit, nullfree, RPC_IDEMPOTENT},
+        {rfs4_op_commit, nullfree, RPC_IDEMPOTENT, NFS4_OP_CFH},
 
         /* OP_CREATE = 6 */
-        {rfs4_op_create, nullfree, 0},
+        {rfs4_op_create, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_DELEGPURGE = 7 */
-        {rfs4_op_delegpurge, nullfree, 0},
+        {rfs4_op_delegpurge, nullfree, 0, NFS4_OP_NOFH},
 
         /* OP_DELEGRETURN = 8 */
-        {rfs4_op_delegreturn, nullfree, 0},
+        {rfs4_op_delegreturn, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_GETATTR = 9 */
-        {rfs4_op_getattr, rfs4_op_getattr_free, RPC_IDEMPOTENT},
+        {rfs4_op_getattr, rfs4_op_getattr_free, RPC_IDEMPOTENT, NFS4_OP_CFH},
 
         /* OP_GETFH = 10 */
-        {rfs4_op_getfh, rfs4_op_getfh_free, RPC_ALL},
+        {rfs4_op_getfh, rfs4_op_getfh_free, RPC_ALL, NFS4_OP_CFH},
 
         /* OP_LINK = 11 */
-        {rfs4_op_link, nullfree, 0},
+        {rfs4_op_link, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_LOCK = 12 */
-        {rfs4_op_lock, lock_denied_free, 0},
+        {rfs4_op_lock, lock_denied_free, 0, NFS4_OP_CFH},
 
         /* OP_LOCKT = 13 */
-        {rfs4_op_lockt, lock_denied_free, 0},
+        {rfs4_op_lockt, lock_denied_free, 0, NFS4_OP_CFH},
 
         /* OP_LOCKU = 14 */
-        {rfs4_op_locku, nullfree, 0},
+        {rfs4_op_locku, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_LOOKUP = 15 */
-        {rfs4_op_lookup, nullfree, (RPC_IDEMPOTENT | RPC_PUBLICFH_OK)},
+        {rfs4_op_lookup, nullfree, (RPC_IDEMPOTENT | RPC_PUBLICFH_OK),
+            NFS4_OP_CFH},
 
         /* OP_LOOKUPP = 16 */
-        {rfs4_op_lookupp, nullfree, (RPC_IDEMPOTENT | RPC_PUBLICFH_OK)},
+        {rfs4_op_lookupp, nullfree, (RPC_IDEMPOTENT | RPC_PUBLICFH_OK),
+            NFS4_OP_CFH},
 
         /* OP_NVERIFY = 17 */
-        {rfs4_op_nverify, nullfree, RPC_IDEMPOTENT},
+        {rfs4_op_nverify, nullfree, RPC_IDEMPOTENT, NFS4_OP_CFH},
 
         /* OP_OPEN = 18 */
-        {rfs4_op_open, rfs4_free_reply, 0},
+        {rfs4_op_open, rfs4_free_reply, 0, NFS4_OP_CFH},
 
         /* OP_OPENATTR = 19 */
-        {rfs4_op_openattr, nullfree, 0},
+        {rfs4_op_openattr, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_OPEN_CONFIRM = 20 */
-        {rfs4_op_open_confirm, nullfree, 0},
+        {rfs4_op_open_confirm, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_OPEN_DOWNGRADE = 21 */
-        {rfs4_op_open_downgrade, nullfree, 0},
+        {rfs4_op_open_downgrade, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_OPEN_PUTFH = 22 */
-        {rfs4_op_putfh, nullfree, RPC_ALL},
+        {rfs4_op_putfh, nullfree, RPC_ALL, NFS4_OP_POSTCFH},
 
         /* OP_PUTPUBFH = 23 */
-        {rfs4_op_putpubfh, nullfree, RPC_ALL},
+        {rfs4_op_putpubfh, nullfree, RPC_ALL, NFS4_OP_POSTCFH},
 
         /* OP_PUTROOTFH = 24 */
-        {rfs4_op_putrootfh, nullfree, RPC_ALL},
+        {rfs4_op_putrootfh, nullfree, RPC_ALL, NFS4_OP_POSTCFH},
 
         /* OP_READ = 25 */
-        {rfs4_op_read, rfs4_op_read_free, RPC_IDEMPOTENT},
+        {rfs4_op_read, rfs4_op_read_free, RPC_IDEMPOTENT, NFS4_OP_CFH},
 
         /* OP_READDIR = 26 */
-        {rfs4_op_readdir, rfs4_op_readdir_free, RPC_IDEMPOTENT},
+        {rfs4_op_readdir, rfs4_op_readdir_free, RPC_IDEMPOTENT, NFS4_OP_CFH},
 
         /* OP_READLINK = 27 */
-        {rfs4_op_readlink, rfs4_op_readlink_free, RPC_IDEMPOTENT},
+        {rfs4_op_readlink, rfs4_op_readlink_free, RPC_IDEMPOTENT, NFS4_OP_CFH},
 
         /* OP_REMOVE = 28 */
-        {rfs4_op_remove, nullfree, 0},
+        {rfs4_op_remove, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_RENAME = 29 */
-        {rfs4_op_rename, nullfree, 0},
+        {rfs4_op_rename, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_RENEW = 30 */
-        {rfs4_op_renew, nullfree, 0},
+        {rfs4_op_renew, nullfree, 0, NFS4_OP_NOFH},
 
         /* OP_RESTOREFH = 31 */
-        {rfs4_op_restorefh, nullfree, RPC_ALL},
+        {rfs4_op_restorefh, nullfree, RPC_ALL, NFS4_OP_SFH},
 
         /* OP_SAVEFH = 32 */
-        {rfs4_op_savefh, nullfree, RPC_ALL},
+        {rfs4_op_savefh, nullfree, RPC_ALL, NFS4_OP_CFH},
 
         /* OP_SECINFO = 33 */
-        {rfs4_op_secinfo, rfs4_op_secinfo_free, 0},
+        {rfs4_op_secinfo, rfs4_op_secinfo_free, 0, NFS4_OP_CFH},
 
         /* OP_SETATTR = 34 */
-        {rfs4_op_setattr, nullfree, 0},
+        {rfs4_op_setattr, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_SETCLIENTID = 35 */
-        {rfs4_op_setclientid, nullfree, 0},
+        {rfs4_op_setclientid, nullfree, 0, NFS4_OP_NOFH},
 
         /* OP_SETCLIENTID_CONFIRM = 36 */
-        {rfs4_op_setclientid_confirm, nullfree, 0},
+        {rfs4_op_setclientid_confirm, nullfree, 0, NFS4_OP_NOFH},
 
         /* OP_VERIFY = 37 */
-        {rfs4_op_verify, nullfree, RPC_IDEMPOTENT},
+        {rfs4_op_verify, nullfree, RPC_IDEMPOTENT, NFS4_OP_CFH},
 
         /* OP_WRITE = 38 */
-        {rfs4_op_write, nullfree, 0},
+        {rfs4_op_write, nullfree, 0, NFS4_OP_CFH},
 
         /* OP_RELEASE_LOCKOWNER = 39 */
-        {rfs4_op_release_lockowner, nullfree, 0},
+        {rfs4_op_release_lockowner, nullfree, 0, NFS4_OP_NOFH},
 };
 
 static uint_t rfsv4disp_cnt = sizeof (rfsv4disptab) / sizeof (rfsv4disptab[0]);
 
 #define OP_ILLEGAL_IDX (rfsv4disp_cnt)

@@ -464,11 +500,11 @@
         "rfs4_op_release_lockowner",
         "rfs4_op_illegal"
 };
 #endif
 
-void    rfs4_ss_chkclid(rfs4_client_t *);
+void    rfs4_ss_chkclid(nfs4_srv_t *, rfs4_client_t *);
 
 extern size_t   strlcpy(char *dst, const char *src, size_t dstsize);
 
 extern void     rfs4_free_fs_locations4(fs_locations4 *);
 

@@ -497,18 +533,19 @@
         VOPNAME_SETSECATTR,     { .femop_setsecattr = deleg_wr_setsecattr },
         VOPNAME_VNEVENT,        { .femop_vnevent = deleg_wr_vnevent },
         NULL,                   NULL
 };
 
-int
-rfs4_srvrinit(void)
+/* ARGSUSED */
+static void *
+rfs4_zone_init(zoneid_t zoneid)
 {
+        nfs4_srv_t *nsrv4;
         timespec32_t verf;
-        int error;
-        extern void rfs4_attr_init();
-        extern krwlock_t rfs4_deleg_policy_lock;
 
+        nsrv4 = kmem_zalloc(sizeof (*nsrv4), KM_SLEEP);
+
         /*
          * The following algorithm attempts to find a unique verifier
          * to be used as the write verifier returned from the server
          * to the client.  It is important that this verifier change
          * whenever the server reboots.  Of secondary importance, it

@@ -533,73 +570,120 @@
 
                 gethrestime(&tverf);
                 verf.tv_sec = (time_t)tverf.tv_sec;
                 verf.tv_nsec = tverf.tv_nsec;
         }
+        nsrv4->write4verf = *(uint64_t *)&verf;
 
-        Write4verf = *(uint64_t *)&verf;
+        /* Used to manage create/destroy of server state */
+        nsrv4->nfs4_server_state = NULL;
+        nsrv4->nfs4_cur_servinst = NULL;
+        nsrv4->nfs4_deleg_policy = SRV_NEVER_DELEGATE;
+        mutex_init(&nsrv4->deleg_lock, NULL, MUTEX_DEFAULT, NULL);
+        mutex_init(&nsrv4->state_lock, NULL, MUTEX_DEFAULT, NULL);
+        mutex_init(&nsrv4->servinst_lock, NULL, MUTEX_DEFAULT, NULL);
+        rw_init(&nsrv4->deleg_policy_lock, NULL, RW_DEFAULT, NULL);
 
-        rfs4_attr_init();
-        mutex_init(&rfs4_deleg_lock, NULL, MUTEX_DEFAULT, NULL);
+        return (nsrv4);
+}
 
-        /* Used to manage create/destroy of server state */
-        mutex_init(&rfs4_state_lock, NULL, MUTEX_DEFAULT, NULL);
+/* ARGSUSED */
+static void
+rfs4_zone_fini(zoneid_t zoneid, void *data)
+{
+        nfs4_srv_t *nsrv4 = data;
 
-        /* Used to manage access to server instance linked list */
-        mutex_init(&rfs4_servinst_lock, NULL, MUTEX_DEFAULT, NULL);
+        mutex_destroy(&nsrv4->deleg_lock);
+        mutex_destroy(&nsrv4->state_lock);
+        mutex_destroy(&nsrv4->servinst_lock);
+        rw_destroy(&nsrv4->deleg_policy_lock);
 
-        /* Used to manage access to rfs4_deleg_policy */
-        rw_init(&rfs4_deleg_policy_lock, NULL, RW_DEFAULT, NULL);
+        kmem_free(nsrv4, sizeof (*nsrv4));
+}
 
-        error = fem_create("deleg_rdops", nfs4_rd_deleg_tmpl, &deleg_rdops);
-        if (error != 0) {
+void
+rfs4_srvrinit(void)
+{
+        extern void rfs4_attr_init();
+
+        zone_key_create(&rfs4_zone_key, rfs4_zone_init, NULL, rfs4_zone_fini);
+
+        rfs4_attr_init();
+
+
+        if (fem_create("deleg_rdops", nfs4_rd_deleg_tmpl, &deleg_rdops) != 0) {
                 rfs4_disable_delegation();
-        } else {
-                error = fem_create("deleg_wrops", nfs4_wr_deleg_tmpl,
-                    &deleg_wrops);
-                if (error != 0) {
+        } else if (fem_create("deleg_wrops", nfs4_wr_deleg_tmpl,
+            &deleg_wrops) != 0) {
                         rfs4_disable_delegation();
                         fem_free(deleg_rdops);
                 }
-        }
 
         nfs4_srv_caller_id = fs_new_caller_id();
-
         lockt_sysid = lm_alloc_sysidt();
-
         vsd_create(&nfs4_srv_vkey, NULL);
-
-        return (0);
+        rfs4_state_g_init();
 }
 
 void
 rfs4_srvrfini(void)
 {
-        extern krwlock_t rfs4_deleg_policy_lock;
-
         if (lockt_sysid != LM_NOSYSID) {
                 lm_free_sysidt(lockt_sysid);
                 lockt_sysid = LM_NOSYSID;
         }
 
-        mutex_destroy(&rfs4_deleg_lock);
-        mutex_destroy(&rfs4_state_lock);
-        rw_destroy(&rfs4_deleg_policy_lock);
+        rfs4_state_g_fini();
 
         fem_free(deleg_rdops);
         fem_free(deleg_wrops);
+
+        (void) zone_key_delete(rfs4_zone_key);
 }
 
 void
+rfs4_do_server_start(int server_upordown,
+    int srv_delegation, int cluster_booted)
+{
+        nfs4_srv_t *nsrv4 = zone_getspecific(rfs4_zone_key, curzone);
+
+        /* Is this a warm start? */
+        if (server_upordown == NFS_SERVER_QUIESCED) {
+                cmn_err(CE_NOTE, "nfs4_srv: "
+                    "server was previously quiesced; "
+                    "existing NFSv4 state will be re-used");
+
+                /*
+                 * HA-NFSv4: this is also the signal
+                 * that a Resource Group failover has
+                 * occurred.
+                 */
+                if (cluster_booted)
+                        hanfsv4_failover(nsrv4);
+        } else {
+                /* Cold start */
+                nsrv4->rfs4_start_time = 0;
+                rfs4_state_zone_init(nsrv4);
+                nsrv4->nfs4_drc = rfs4_init_drc(nfs4_drc_max,
+                    nfs4_drc_hash);
+        }
+
+        /* Check if delegation is to be enabled */
+        if (srv_delegation != FALSE)
+                rfs4_set_deleg_policy(nsrv4, SRV_NORMAL_DELEGATE);
+}
+
+void
 rfs4_init_compound_state(struct compound_state *cs)
 {
         bzero(cs, sizeof (*cs));
         cs->cont = TRUE;
         cs->access = CS_ACCESS_DENIED;
         cs->deleg = FALSE;
         cs->mandlock = FALSE;
         cs->fh.nfs_fh4_val = cs->fhbuf;
+        cs->statusp = NULL;
 }
 
 void
 rfs4_grace_start(rfs4_servinst_t *sip)
 {

@@ -650,38 +734,39 @@
 
 /*
  * reset all currently active grace periods
  */
 void
-rfs4_grace_reset_all(void)
+rfs4_grace_reset_all(nfs4_srv_t *nsrv4)
 {
         rfs4_servinst_t *sip;
 
-        mutex_enter(&rfs4_servinst_lock);
-        for (sip = rfs4_cur_servinst; sip != NULL; sip = sip->prev)
+        mutex_enter(&nsrv4->servinst_lock);
+        for (sip = nsrv4->nfs4_cur_servinst; sip != NULL; sip = sip->prev)
                 if (rfs4_servinst_in_grace(sip))
                         rfs4_grace_start(sip);
-        mutex_exit(&rfs4_servinst_lock);
+        mutex_exit(&nsrv4->servinst_lock);
 }
 
 /*
  * start any new instances' grace periods
  */
 void
-rfs4_grace_start_new(void)
+rfs4_grace_start_new(nfs4_srv_t *nsrv4)
 {
         rfs4_servinst_t *sip;
 
-        mutex_enter(&rfs4_servinst_lock);
-        for (sip = rfs4_cur_servinst; sip != NULL; sip = sip->prev)
+        mutex_enter(&nsrv4->servinst_lock);
+        for (sip = nsrv4->nfs4_cur_servinst; sip != NULL; sip = sip->prev)
                 if (rfs4_servinst_grace_new(sip))
                         rfs4_grace_start(sip);
-        mutex_exit(&rfs4_servinst_lock);
+        mutex_exit(&nsrv4->servinst_lock);
 }
 
 static rfs4_dss_path_t *
-rfs4_dss_newpath(rfs4_servinst_t *sip, char *path, unsigned index)
+rfs4_dss_newpath(nfs4_srv_t *nsrv4, rfs4_servinst_t *sip,
+    char *path, unsigned index)
 {
         size_t len;
         rfs4_dss_path_t *dss_path;
 
         dss_path = kmem_alloc(sizeof (rfs4_dss_path_t), KM_SLEEP);

@@ -701,19 +786,19 @@
 
         /*
          * Add to list of served paths.
          * No locking required, as we're only ever called at startup.
          */
-        if (rfs4_dss_pathlist == NULL) {
+        if (nsrv4->dss_pathlist == NULL) {
                 /* this is the first dss_path_t */
 
                 /* needed for insque/remque */
                 dss_path->next = dss_path->prev = dss_path;
 
-                rfs4_dss_pathlist = dss_path;
+                nsrv4->dss_pathlist = dss_path;
         } else {
-                insque(dss_path, rfs4_dss_pathlist);
+                insque(dss_path, nsrv4->dss_pathlist);
         }
 
         return (dss_path);
 }
 

@@ -721,11 +806,12 @@
  * Create a new server instance, and make it the currently active instance.
  * Note that starting the grace period too early will reduce the clients'
  * recovery window.
  */
 void
-rfs4_servinst_create(int start_grace, int dss_npaths, char **dss_paths)
+rfs4_servinst_create(nfs4_srv_t *nsrv4, int start_grace,
+    int dss_npaths, char **dss_paths)
 {
         unsigned i;
         rfs4_servinst_t *sip;
         rfs4_oldstate_t *oldstate;
 

@@ -752,43 +838,44 @@
         sip->dss_npaths = dss_npaths;
         sip->dss_paths = kmem_alloc(dss_npaths *
             sizeof (rfs4_dss_path_t *), KM_SLEEP);
 
         for (i = 0; i < dss_npaths; i++) {
-                sip->dss_paths[i] = rfs4_dss_newpath(sip, dss_paths[i], i);
+                /* CSTYLED */
+                sip->dss_paths[i] = rfs4_dss_newpath(nsrv4, sip, dss_paths[i], i);
         }
 
-        mutex_enter(&rfs4_servinst_lock);
-        if (rfs4_cur_servinst != NULL) {
+        mutex_enter(&nsrv4->servinst_lock);
+        if (nsrv4->nfs4_cur_servinst != NULL) {
                 /* add to linked list */
-                sip->prev = rfs4_cur_servinst;
-                rfs4_cur_servinst->next = sip;
+                sip->prev = nsrv4->nfs4_cur_servinst;
+                nsrv4->nfs4_cur_servinst->next = sip;
         }
         if (start_grace)
                 rfs4_grace_start(sip);
         /* make the new instance "current" */
-        rfs4_cur_servinst = sip;
+        nsrv4->nfs4_cur_servinst = sip;
 
-        mutex_exit(&rfs4_servinst_lock);
+        mutex_exit(&nsrv4->servinst_lock);
 }
 
 /*
  * In future, we might add a rfs4_servinst_destroy(sip) but, for now, destroy
  * all instances directly.
  */
 void
-rfs4_servinst_destroy_all(void)
+rfs4_servinst_destroy_all(nfs4_srv_t *nsrv4)
 {
         rfs4_servinst_t *sip, *prev, *current;
 #ifdef DEBUG
         int n = 0;
 #endif
 
-        mutex_enter(&rfs4_servinst_lock);
-        ASSERT(rfs4_cur_servinst != NULL);
-        current = rfs4_cur_servinst;
-        rfs4_cur_servinst = NULL;
+        mutex_enter(&nsrv4->servinst_lock);
+        ASSERT(nsrv4->nfs4_cur_servinst != NULL);
+        current = nsrv4->nfs4_cur_servinst;
+        nsrv4->nfs4_cur_servinst = NULL;
         for (sip = current; sip != NULL; sip = prev) {
                 prev = sip->prev;
                 rw_destroy(&sip->rwlock);
                 if (sip->oldstate)
                         kmem_free(sip->oldstate, sizeof (rfs4_oldstate_t));

@@ -798,29 +885,30 @@
                 kmem_free(sip, sizeof (rfs4_servinst_t));
 #ifdef DEBUG
                 n++;
 #endif
         }
-        mutex_exit(&rfs4_servinst_lock);
+        mutex_exit(&nsrv4->servinst_lock);
 }
 
 /*
  * Assign the current server instance to a client_t.
  * Should be called with cp->rc_dbe held.
  */
 void
-rfs4_servinst_assign(rfs4_client_t *cp, rfs4_servinst_t *sip)
+rfs4_servinst_assign(nfs4_srv_t *nsrv4, rfs4_client_t *cp,
+    rfs4_servinst_t *sip)
 {
         ASSERT(rfs4_dbe_refcnt(cp->rc_dbe) > 0);
 
         /*
          * The lock ensures that if the current instance is in the process
          * of changing, we will see the new one.
          */
-        mutex_enter(&rfs4_servinst_lock);
+        mutex_enter(&nsrv4->servinst_lock);
         cp->rc_server_instance = sip;
-        mutex_exit(&rfs4_servinst_lock);
+        mutex_exit(&nsrv4->servinst_lock);
 }
 
 rfs4_servinst_t *
 rfs4_servinst(rfs4_client_t *cp)
 {

@@ -877,10 +965,11 @@
         secinfo4 *resok_val;
         struct secinfo *secp;
         seconfig_t *si;
         bool_t did_traverse = FALSE;
         int dotdot, walk;
+        nfs_export_t *ne = nfs_get_export();
 
         dvp = cs->vp;
         dotdot = (nm[0] == '.' && nm[1] == '.' && nm[2] == '\0');
 
         /*

@@ -898,11 +987,11 @@
 
                         /*
                          * If at the system root, then can
                          * go up no further.
                          */
-                        if (VN_CMP(dvp, rootdir))
+                        if (VN_CMP(dvp, ZONE_ROOTVP()))
                                 return (puterrno4(ENOENT));
 
                         /*
                          * Traverse back to the mounted-on filesystem
                          */

@@ -1015,11 +1104,11 @@
          *
          * Return all flavors for a pseudo node.
          * For a real export node, return the flavor that the client
          * has access with.
          */
-        ASSERT(RW_LOCK_HELD(&exported_lock));
+        ASSERT(RW_LOCK_HELD(&ne->exported_lock));
         if (PSEUDO(exi)) {
                 count = exi->exi_export.ex_seccnt; /* total sec count */
                 resok_val = kmem_alloc(count * sizeof (secinfo4), KM_SLEEP);
                 secp = exi->exi_export.ex_secinfo;
 

@@ -1378,10 +1467,11 @@
         COMMIT4res *resp = &resop->nfs_resop4_u.opcommit;
         int error;
         vnode_t *vp = cs->vp;
         cred_t *cr = cs->cr;
         vattr_t va;
+        nfs4_srv_t *nsrv4;
 
         DTRACE_NFSV4_2(op__commit__start, struct compound_state *, cs,
             COMMIT4args *, args);
 
         if (vp == NULL) {

@@ -1434,12 +1524,13 @@
         if (error) {
                 *cs->statusp = resp->status = puterrno4(error);
                 goto out;
         }
 
+        nsrv4 = zone_getspecific(rfs4_zone_key, curzone);
         *cs->statusp = resp->status = NFS4_OK;
-        resp->writeverf = Write4verf;
+        resp->writeverf = nsrv4->write4verf;
 out:
         DTRACE_NFSV4_2(op__commit__done, struct compound_state *, cs,
             COMMIT4res *, resp);
 }
 

@@ -2643,11 +2734,11 @@
 
                         /*
                          * If at the system root, then can
                          * go up no further.
                          */
-                        if (VN_CMP(cs->vp, rootdir))
+                        if (VN_CMP(cs->vp, ZONE_ROOTVP()))
                                 return (puterrno4(ENOENT));
 
                         /*
                          * Traverse back to the mounted-on filesystem
                          */

@@ -3407,10 +3498,11 @@
         PUTPUBFH4res    *resp = &resop->nfs_resop4_u.opputpubfh;
         int             error;
         vnode_t         *vp;
         struct exportinfo *exi, *sav_exi;
         nfs_fh4_fmt_t   *fh_fmtp;
+        nfs_export_t *ne = nfs_get_export();
 
         DTRACE_NFSV4_1(op__putpubfh__start, struct compound_state *, cs);
 
         if (cs->vp) {
                 VN_RELE(cs->vp);

@@ -3420,23 +3512,23 @@
         if (cs->cr)
                 crfree(cs->cr);
 
         cs->cr = crdup(cs->basecr);
 
-        vp = exi_public->exi_vp;
+        vp = ne->exi_public->exi_vp;
         if (vp == NULL) {
                 *cs->statusp = resp->status = NFS4ERR_SERVERFAULT;
                 goto out;
         }
 
-        error = makefh4(&cs->fh, vp, exi_public);
+        error = makefh4(&cs->fh, vp, ne->exi_public);
         if (error != 0) {
                 *cs->statusp = resp->status = puterrno4(error);
                 goto out;
         }
         sav_exi = cs->exi;
-        if (exi_public == exi_root) {
+        if (ne->exi_public == ne->exi_root) {
                 /*
                  * No filesystem is actually shared public, so we default
                  * to exi_root. In this case, we must check whether root
                  * is exported.
                  */

@@ -3447,16 +3539,16 @@
                  * should use is what checkexport4 returns, because root_exi is
                  * actually a mostly empty struct.
                  */
                 exi = checkexport4(&fh_fmtp->fh4_fsid,
                     (fid_t *)&fh_fmtp->fh4_xlen, NULL);
-                cs->exi = ((exi != NULL) ? exi : exi_public);
+                cs->exi = ((exi != NULL) ? exi : ne->exi_public);
         } else {
                 /*
                  * it's a properly shared filesystem
                  */
-                cs->exi = exi_public;
+                cs->exi = ne->exi_public;
         }
 
         if (is_system_labeled()) {
                 bslabel_t *clabel;
 

@@ -3527,11 +3619,10 @@
         if (cs->cr) {
                 crfree(cs->cr);
                 cs->cr = NULL;
         }
 
-
         if (args->object.nfs_fh4_len < NFS_FH4_LEN) {
                 *cs->statusp = resp->status = NFS4ERR_BADHANDLE;
                 goto out;
         }
 

@@ -3594,11 +3685,11 @@
          * Using rootdir, the system root vnode,
          * get its fid.
          */
         bzero(&fid, sizeof (fid));
         fid.fid_len = MAXFIDSZ;
-        error = vop_fid_pseudo(rootdir, &fid);
+        error = vop_fid_pseudo(ZONE_ROOTVP(), &fid);
         if (error != 0) {
                 *cs->statusp = resp->status = puterrno4(error);
                 goto out;
         }
 

@@ -3608,11 +3699,11 @@
          * If the server root isn't exported directly, then
          * it should at least be a pseudo export based on
          * one or more exports further down in the server's
          * file tree.
          */
-        exi = checkexport4(&rootdir->v_vfsp->vfs_fsid, &fid, NULL);
+        exi = checkexport4(&ZONE_ROOTVP()->v_vfsp->vfs_fsid, &fid, NULL);
         if (exi == NULL || exi->exi_export.ex_flags & EX_PUBLIC) {
                 NFS4_DEBUG(rfs4_debug,
                     (CE_WARN, "rfs4_op_putrootfh: export check failure"));
                 *cs->statusp = resp->status = NFS4ERR_SERVERFAULT;
                 goto out;

@@ -3620,24 +3711,24 @@
 
         /*
          * Now make a filehandle based on the root
          * export and root vnode.
          */
-        error = makefh4(&cs->fh, rootdir, exi);
+        error = makefh4(&cs->fh, ZONE_ROOTVP(), exi);
         if (error != 0) {
                 *cs->statusp = resp->status = puterrno4(error);
                 goto out;
         }
 
         sav_exi = cs->exi;
         cs->exi = exi;
 
-        VN_HOLD(rootdir);
-        cs->vp = rootdir;
+        VN_HOLD(ZONE_ROOTVP());
+        cs->vp = ZONE_ROOTVP();
 
         if ((resp->status = call_checkauth4(cs, req)) != NFS4_OK) {
-                VN_RELE(rootdir);
+                VN_RELE(cs->vp);
                 cs->vp = NULL;
                 cs->exi = sav_exi;
                 goto out;
         }
 

@@ -4244,11 +4335,11 @@
                          * not ENOTEMPTY, if the directory is not
                          * empty.  A System V NFS server needs to map
                          * NFS4ERR_EXIST to NFS4ERR_NOTEMPTY to
                          * transmit over the wire.
                          */
-                        if ((error = VOP_RMDIR(dvp, name, rootdir, cs->cr,
+                        if ((error = VOP_RMDIR(dvp, name, ZONE_ROOTVP(), cs->cr,
                             NULL, 0)) == EEXIST)
                                 error = ENOTEMPTY;
                 }
         } else {
                 if ((error = VOP_REMOVE(dvp, name, cs->cr, NULL, 0)) == 0 &&

@@ -4356,18 +4447,19 @@
         RENAME4args *args = &argop->nfs_argop4_u.oprename;
         RENAME4res *resp = &resop->nfs_resop4_u.oprename;
         int error;
         vnode_t *odvp;
         vnode_t *ndvp;
-        vnode_t *srcvp, *targvp;
+        vnode_t *srcvp, *targvp, *tvp;
         struct vattr obdva, oidva, oadva;
         struct vattr nbdva, nidva, nadva;
         char *onm, *nnm;
         uint_t olen, nlen;
         rfs4_file_t *fp, *sfp;
         int in_crit_src, in_crit_targ;
         int fp_rele_grant_hold, sfp_rele_grant_hold;
+        int unlinked;
         bslabel_t *clabel;
         struct sockaddr *ca;
         char *converted_onm = NULL;
         char *converted_nnm = NULL;
         nfsstat4 status;

@@ -4374,13 +4466,14 @@
 
         DTRACE_NFSV4_2(op__rename__start, struct compound_state *, cs,
             RENAME4args *, args);
 
         fp = sfp = NULL;
-        srcvp = targvp = NULL;
+        srcvp = targvp = tvp = NULL;
         in_crit_src = in_crit_targ = 0;
         fp_rele_grant_hold = sfp_rele_grant_hold = 0;
+        unlinked = 0;
 
         /* CURRENT_FH: target directory */
         ndvp = cs->vp;
         if (ndvp == NULL) {
                 *cs->statusp = resp->status = NFS4ERR_NOFILEHANDLE;

@@ -4549,11 +4642,10 @@
                         goto err_out;
                 }
         }
         fp_rele_grant_hold = 1;
 
-
         /* Check for NBMAND lock on both source and target */
         if (nbl_need_check(srcvp)) {
                 nbl_start_crit(srcvp, RW_READER);
                 in_crit_src = 1;
                 if (nbl_conflict(srcvp, NBL_RENAME, 0, 0, 0, NULL)) {

@@ -4584,35 +4676,45 @@
         }
 
         NFS4_SET_FATTR4_CHANGE(resp->source_cinfo.before, obdva.va_ctime)
         NFS4_SET_FATTR4_CHANGE(resp->target_cinfo.before, nbdva.va_ctime)
 
-        if ((error = VOP_RENAME(odvp, converted_onm, ndvp, converted_nnm,
-            cs->cr, NULL, 0)) == 0 && fp != NULL) {
-                struct vattr va;
-                vnode_t *tvp;
+        error = VOP_RENAME(odvp, converted_onm, ndvp, converted_nnm, cs->cr,
+            NULL, 0);
 
+        /*
+         * If target existed and was unlinked by VOP_RENAME, state will need
+         * closed. To avoid deadlock, rfs4_close_all_state will be done after
+         * any necessary nbl_end_crit on srcvp and tgtvp.
+         */
+        if (error == 0 && fp != NULL) {
                 rfs4_dbe_lock(fp->rf_dbe);
                 tvp = fp->rf_vp;
                 if (tvp)
                         VN_HOLD(tvp);
                 rfs4_dbe_unlock(fp->rf_dbe);
 
                 if (tvp) {
+                        struct vattr va;
                         va.va_mask = AT_NLINK;
+
                         if (!VOP_GETATTR(tvp, &va, 0, cs->cr, NULL) &&
                             va.va_nlink == 0) {
-                                /* The file is gone and so should the state */
-                                if (in_crit_targ) {
-                                        nbl_end_crit(targvp);
-                                        in_crit_targ = 0;
+                                unlinked = 1;
+
+                                /* DEBUG data */
+                                if ((srcvp == targvp) || (tvp != targvp)) {
+                                        cmn_err(CE_WARN, "rfs4_op_rename: "
+                                            "srcvp %p, targvp: %p, tvp: %p",
+                                            (void *)srcvp, (void *)targvp,
+                                            (void *)tvp);
                                 }
-                                rfs4_close_all_state(fp);
-                        }
+                        } else {
                         VN_RELE(tvp);
                 }
         }
+        }
         if (error == 0)
                 vn_renamepath(ndvp, srcvp, nnm, nlen - 1);
 
         if (in_crit_src)
                 nbl_end_crit(srcvp);

@@ -4621,10 +4723,25 @@
         if (in_crit_targ)
                 nbl_end_crit(targvp);
         if (targvp)
                 VN_RELE(targvp);
 
+        if (unlinked) {
+                ASSERT(fp != NULL);
+                ASSERT(tvp != NULL);
+
+                /* DEBUG data */
+                if (RW_READ_HELD(&tvp->v_nbllock)) {
+                        cmn_err(CE_WARN, "rfs4_op_rename: "
+                            "RW_READ_HELD(%p)", (void *)tvp);
+                }
+
+                /* The file is gone and so should the state */
+                rfs4_close_all_state(fp);
+                VN_RELE(tvp);
+        }
+
         if (sfp) {
                 rfs4_clear_dont_grant(sfp);
                 rfs4_file_rele(sfp);
         }
         if (fp) {

@@ -5557,10 +5674,11 @@
         cred_t *savecred, *cr;
         bool_t *deleg = &cs->deleg;
         nfsstat4 stat;
         int in_crit = 0;
         caller_context_t ct;
+        nfs4_srv_t *nsrv4;
 
         DTRACE_NFSV4_2(op__write__start, struct compound_state *, cs,
             WRITE4args *, args);
 
         vp = cs->vp;

@@ -5627,15 +5745,16 @@
         if (MANDLOCK(vp, bva.va_mode)) {
                 *cs->statusp = resp->status = NFS4ERR_ACCESS;
                 goto out;
         }
 
+        nsrv4 = zone_getspecific(rfs4_zone_key, curzone);
         if (args->data_len == 0) {
                 *cs->statusp = resp->status = NFS4_OK;
                 resp->count = 0;
                 resp->committed = args->stable;
-                resp->writeverf = Write4verf;
+                resp->writeverf = nsrv4->write4verf;
                 goto out;
         }
 
         if (args->mblk != NULL) {
                 mblk_t *m;

@@ -5727,11 +5846,11 @@
         if (ioflag == 0)
                 resp->committed = UNSTABLE4;
         else
                 resp->committed = FILE_SYNC4;
 
-        resp->writeverf = Write4verf;
+        resp->writeverf = nsrv4->write4verf;
 
 out:
         if (in_crit)
                 nbl_end_crit(vp);
 

@@ -5747,10 +5866,12 @@
 rfs4_compound(COMPOUND4args *args, COMPOUND4res *resp, struct exportinfo *exi,
     struct svc_req *req, cred_t *cr, int *rv)
 {
         uint_t i;
         struct compound_state cs;
+        nfs4_srv_t *nsrv4;
+        nfs_export_t *ne = nfs_get_export();
 
         if (rv != NULL)
                 *rv = 0;
         rfs4_init_compound_state(&cs);
         /*

@@ -5804,10 +5925,11 @@
         resp->array_len = args->array_len;
         resp->array = kmem_zalloc(args->array_len * sizeof (nfs_resop4),
             KM_SLEEP);
 
         cs.basecr = cr;
+        nsrv4 = zone_getspecific(rfs4_zone_key, curzone);
 
         DTRACE_NFSV4_2(compound__start, struct compound_state *, &cs,
             COMPOUND4args *, args);
 
         /*

@@ -5818,24 +5940,24 @@
          * per proc (excluding public exinfo), and exi_count design
          * is sufficient to protect concurrent execution of NFS2/3
          * ops along with unexport.  This lock will be removed as
          * part of the NFSv4 phase 2 namespace redesign work.
          */
-        rw_enter(&exported_lock, RW_READER);
+        rw_enter(&ne->exported_lock, RW_READER);
 
         /*
          * If this is the first compound we've seen, we need to start all
          * new instances' grace periods.
          */
-        if (rfs4_seen_first_compound == 0) {
-                rfs4_grace_start_new();
+        if (nsrv4->seen_first_compound == 0) {
+                rfs4_grace_start_new(nsrv4);
                 /*
                  * This must be set after rfs4_grace_start_new(), otherwise
                  * another thread could proceed past here before the former
                  * is finished.
                  */
-                rfs4_seen_first_compound = 1;
+                nsrv4->seen_first_compound = 1;
         }
 
         for (i = 0; i < args->array_len && cs.cont; i++) {
                 nfs_argop4 *argop;
                 nfs_resop4 *resop;

@@ -5845,24 +5967,86 @@
                 resop = &resp->array[i];
                 resop->resop = argop->argop;
                 op = (uint_t)resop->resop;
 
                 if (op < rfsv4disp_cnt) {
+                        kstat_t *ksp = rfsprocio_v4_ptr[op];
+                        kstat_t *exi_ksp = NULL;
+
                         /*
                          * Count the individual ops here; NULL and COMPOUND
                          * are counted in common_dispatch()
                          */
                         rfsproccnt_v4_ptr[op].value.ui64++;
 
+                        if (ksp != NULL) {
+                                mutex_enter(ksp->ks_lock);
+                                kstat_runq_enter(KSTAT_IO_PTR(ksp));
+                                mutex_exit(ksp->ks_lock);
+                        }
+
+                        switch (rfsv4disptab[op].op_type) {
+                        case NFS4_OP_CFH:
+                                resop->exi = cs.exi;
+                                break;
+                        case NFS4_OP_SFH:
+                                resop->exi = cs.saved_exi;
+                                break;
+                        default:
+                                ASSERT(resop->exi == NULL);
+                                break;
+                        }
+
+                        if (resop->exi != NULL) {
+                                exi_ksp = NULL;
+                                if (resop->exi->exi_kstats != NULL) {
+                                        exi_ksp = exp_kstats_v4(
+                                            resop->exi->exi_kstats, op);
+                                }
+                                if (exi_ksp != NULL) {
+                                        mutex_enter(exi_ksp->ks_lock);
+                                        kstat_runq_enter(KSTAT_IO_PTR(exi_ksp));
+                                        mutex_exit(exi_ksp->ks_lock);
+                                }
+                        }
+
                         NFS4_DEBUG(rfs4_debug > 1,
                             (CE_NOTE, "Executing %s", rfs4_op_string[op]));
                         (*rfsv4disptab[op].dis_proc)(argop, resop, req, &cs);
                         NFS4_DEBUG(rfs4_debug > 1, (CE_NOTE, "%s returned %d",
                             rfs4_op_string[op], *cs.statusp));
                         if (*cs.statusp != NFS4_OK)
                                 cs.cont = FALSE;
+
+                        if (rfsv4disptab[op].op_type == NFS4_OP_POSTCFH &&
+                            *cs.statusp == NFS4_OK &&
+                            (resop->exi = cs.exi) != NULL) {
+                                exi_ksp = NULL;
+                                if (resop->exi->exi_kstats != NULL) {
+                                        exi_ksp = exp_kstats_v4(
+                                            resop->exi->exi_kstats, op);
+                                }
+                        }
+
+                        if (exi_ksp != NULL) {
+                                mutex_enter(exi_ksp->ks_lock);
+                                KSTAT_IO_PTR(exi_ksp)->nwritten +=
+                                    argop->opsize;
+                                KSTAT_IO_PTR(exi_ksp)->writes++;
+                                if (rfsv4disptab[op].op_type != NFS4_OP_POSTCFH)
+                                        kstat_runq_exit(KSTAT_IO_PTR(exi_ksp));
+                                mutex_exit(exi_ksp->ks_lock);
                 } else {
+                                resop->exi = NULL;
+                        }
+
+                        if (ksp != NULL) {
+                                mutex_enter(ksp->ks_lock);
+                                kstat_runq_exit(KSTAT_IO_PTR(ksp));
+                                mutex_exit(ksp->ks_lock);
+                        }
+                } else {
                         /*
                          * This is effectively dead code since XDR code
                          * will have already returned BADXDR if op doesn't
                          * decode to legal value.  This only done for a
                          * day when XDR code doesn't verify v4 opcodes.

@@ -5873,35 +6057,50 @@
                         rfs4_op_illegal(argop, resop, req, &cs);
                         cs.cont = FALSE;
                 }
 
                 /*
+                 * The exi saved in the resop to be used for kstats update
+                 * once the opsize is calculated during XDR response encoding.
+                 * Put a hold on resop->exi so that it can't be destroyed.
+                 */
+                if (resop->exi != NULL)
+                        exi_hold(resop->exi);
+
+                /*
                  * If not at last op, and if we are to stop, then
                  * compact the results array.
                  */
                 if ((i + 1) < args->array_len && !cs.cont) {
                         nfs_resop4 *new_res = kmem_alloc(
-                            (i+1) * sizeof (nfs_resop4), KM_SLEEP);
+                            (i + 1) * sizeof (nfs_resop4), KM_SLEEP);
                         bcopy(resp->array,
-                            new_res, (i+1) * sizeof (nfs_resop4));
+                            new_res, (i + 1) * sizeof (nfs_resop4));
                         kmem_free(resp->array,
                             args->array_len * sizeof (nfs_resop4));
 
                         resp->array_len =  i + 1;
                         resp->array = new_res;
                 }
         }
 
-        rw_exit(&exported_lock);
+        rw_exit(&ne->exported_lock);
 
-        DTRACE_NFSV4_2(compound__done, struct compound_state *, &cs,
-            COMPOUND4res *, resp);
-
+        /*
+         * clear exportinfo and vnode fields from compound_state before dtrace
+         * probe, to avoid tracing residual values for path and share path.
+         */
         if (cs.vp)
                 VN_RELE(cs.vp);
         if (cs.saved_vp)
                 VN_RELE(cs.saved_vp);
+        cs.exi = cs.saved_exi = NULL;
+        cs.vp = cs.saved_vp = NULL;
+
+        DTRACE_NFSV4_2(compound__done, struct compound_state *, &cs,
+            COMPOUND4res *, resp);
+
         if (cs.saved_fh.nfs_fh4_val)
                 kmem_free(cs.saved_fh.nfs_fh4_val, NFS4_FHSIZE);
 
         if (cs.basecr)
                 crfree(cs.basecr);

@@ -5967,10 +6166,97 @@
                         flag = 0;
         }
         *flagp = flag;
 }
 
+/*
+ * Update the kstats for the received requests.
+ * Note: writes/nwritten are used to hold count and nbytes of requests received.
+ *
+ * Per export request statistics need to be updated during the compound request
+ * processing (rfs4_compound()) as that is where it is known which exportinfo to
+ * associate the kstats with.
+ */
+void
+rfs4_compound_kstat_args(COMPOUND4args *args)
+{
+        int i;
+
+        for (i = 0; i < args->array_len; i++) {
+                uint_t op = (uint_t)args->array[i].argop;
+
+                if (op < rfsv4disp_cnt) {
+                        kstat_t *ksp = rfsprocio_v4_ptr[op];
+
+                        if (ksp != NULL) {
+                                mutex_enter(ksp->ks_lock);
+                                KSTAT_IO_PTR(ksp)->nwritten +=
+                                    args->array[i].opsize;
+                                KSTAT_IO_PTR(ksp)->writes++;
+                                mutex_exit(ksp->ks_lock);
+                        }
+                }
+        }
+}
+
+/*
+ * Update the kstats for the sent responses.
+ * Note: reads/nread are used to hold count and nbytes of responses sent.
+ *
+ * Per export response statistics cannot be updated until here, after the
+ * response send has generated the opsize (bytes sent) in the XDR encoding.
+ * The exportinfo with which the kstats should be associated is thus saved
+ * in the response structure (by rfs4_compound()) for use here. A hold is
+ * placed on the exi to ensure it cannot be deleted before use. This hold
+ * is released, and the exi set to NULL, here.
+ */
+void
+rfs4_compound_kstat_res(COMPOUND4res *res)
+{
+        int i;
+        nfs_export_t *ne = nfs_get_export();
+
+        for (i = 0; i < res->array_len; i++) {
+                uint_t op = (uint_t)res->array[i].resop;
+
+                if (op < rfsv4disp_cnt) {
+                        kstat_t *ksp = rfsprocio_v4_ptr[op];
+                        struct exportinfo *exi = res->array[i].exi;
+
+                        if (ksp != NULL) {
+                                mutex_enter(ksp->ks_lock);
+                                KSTAT_IO_PTR(ksp)->nread +=
+                                    res->array[i].opsize;
+                                KSTAT_IO_PTR(ksp)->reads++;
+                                mutex_exit(ksp->ks_lock);
+                        }
+
+                        if (exi != NULL) {
+                                kstat_t *exi_ksp = NULL;
+
+                                rw_enter(&ne->exported_lock, RW_READER);
+
+                                if (exi->exi_kstats != NULL) {
+                                        /*CSTYLED*/
+                                        exi_ksp = exp_kstats_v4(exi->exi_kstats, op);
+                                }
+                                if (exi_ksp != NULL) {
+                                        mutex_enter(exi_ksp->ks_lock);
+                                        KSTAT_IO_PTR(exi_ksp)->nread +=
+                                            res->array[i].opsize;
+                                        KSTAT_IO_PTR(exi_ksp)->reads++;
+                                        mutex_exit(exi_ksp->ks_lock);
+                                }
+
+                                exi_rele(&exi);
+                                res->array[i].exi = NULL;
+                                rw_exit(&ne->exported_lock);
+                        }
+                }
+        }
+}
+
 nfsstat4
 rfs4_client_sysid(rfs4_client_t *cp, sysid_t *sp)
 {
         nfsstat4 e;
 

@@ -6601,29 +6887,31 @@
                  */
 
                 if (trunc) {
                         int in_crit = 0;
                         rfs4_file_t *fp;
+                        nfs4_srv_t *nsrv4;
                         bool_t create = FALSE;
 
                         /*
                          * We are writing over an existing file.
                          * Check to see if we need to recall a delegation.
                          */
-                        rfs4_hold_deleg_policy();
+                        nsrv4 = zone_getspecific(rfs4_zone_key, curzone);
+                        rfs4_hold_deleg_policy(nsrv4);
                         if ((fp = rfs4_findfile(vp, NULL, &create)) != NULL) {
                                 if (rfs4_check_delegated_byfp(FWRITE, fp,
                                     (reqsize == 0), FALSE, FALSE, &clientid)) {
                                         rfs4_file_rele(fp);
-                                        rfs4_rele_deleg_policy();
+                                        rfs4_rele_deleg_policy(nsrv4);
                                         VN_RELE(vp);
                                         *attrset = 0;
                                         return (NFS4ERR_DELAY);
                                 }
                                 rfs4_file_rele(fp);
                         }
-                        rfs4_rele_deleg_policy();
+                        rfs4_rele_deleg_policy(nsrv4);
 
                         if (nbl_need_check(vp)) {
                                 in_crit = 1;
 
                                 ASSERT(reqsize == 0);

@@ -8177,15 +8465,17 @@
         SETCLIENTID_CONFIRM4args *args =
             &argop->nfs_argop4_u.opsetclientid_confirm;
         SETCLIENTID_CONFIRM4res *res =
             &resop->nfs_resop4_u.opsetclientid_confirm;
         rfs4_client_t *cp, *cptoclose = NULL;
+        nfs4_srv_t *nsrv4;
 
         DTRACE_NFSV4_2(op__setclientid__confirm__start,
             struct compound_state *, cs,
             SETCLIENTID_CONFIRM4args *, args);
 
+        nsrv4 = zone_getspecific(rfs4_zone_key, curzone);
         *cs->statusp = res->status = NFS4_OK;
 
         cp = rfs4_findclient_by_id(args->clientid, TRUE);
 
         if (cp == NULL) {

@@ -8217,18 +8507,18 @@
 
         /*
          * Update the client's associated server instance, if it's changed
          * since the client was created.
          */
-        if (rfs4_servinst(cp) != rfs4_cur_servinst)
-                rfs4_servinst_assign(cp, rfs4_cur_servinst);
+        if (rfs4_servinst(cp) != nsrv4->nfs4_cur_servinst)
+                rfs4_servinst_assign(nsrv4, cp, nsrv4->nfs4_cur_servinst);
 
         /*
          * Record clientid in stable storage.
          * Must be done after server instance has been assigned.
          */
-        rfs4_ss_clid(cp);
+        rfs4_ss_clid(nsrv4, cp);
 
         rfs4_dbe_unlock(cp->rc_dbe);
 
         if (cptoclose)
                 /* don't need to rele, client_close does it */

@@ -8239,11 +8529,11 @@
         rfs4_update_lease(cp);
 
         /*
          * Check to see if client can perform reclaims
          */
-        rfs4_ss_chkclid(cp);
+        rfs4_ss_chkclid(nsrv4, cp);
 
         rfs4_client_rele(cp);
 
 out:
         DTRACE_NFSV4_2(op__setclientid__confirm__done,

@@ -9883,6 +10173,170 @@
         if (ci == NULL)
                 return (0);
         is_downrev = ci->ri_no_referrals;
         rfs4_dbe_rele(ci->ri_dbe);
         return (is_downrev);
+}
+
+/*
+ * Do the main work of handling HA-NFSv4 Resource Group failover on
+ * Sun Cluster.
+ * We need to detect whether any RG admin paths have been added or removed,
+ * and adjust resources accordingly.
+ * Currently we're using a very inefficient algorithm, ~ 2 * O(n**2). In
+ * order to scale, the list and array of paths need to be held in more
+ * suitable data structures.
+ */
+static void
+hanfsv4_failover(nfs4_srv_t *nsrv4)
+{
+        int i, start_grace, numadded_paths = 0;
+        char **added_paths = NULL;
+        rfs4_dss_path_t *dss_path;
+
+        /*
+         * Note: currently, dss_pathlist cannot be NULL, since
+         * it will always include an entry for NFS4_DSS_VAR_DIR. If we
+         * make the latter dynamically specified too, the following will
+         * need to be adjusted.
+         */
+
+        /*
+         * First, look for removed paths: RGs that have been failed-over
+         * away from this node.
+         * Walk the "currently-serving" dss_pathlist and, for each
+         * path, check if it is on the "passed-in" rfs4_dss_newpaths array
+         * from nfsd. If not, that RG path has been removed.
+         *
+         * Note that nfsd has sorted rfs4_dss_newpaths for us, and removed
+         * any duplicates.
+         */
+        dss_path = nsrv4->dss_pathlist;
+        do {
+                int found = 0;
+                char *path = dss_path->path;
+
+                /* used only for non-HA so may not be removed */
+                if (strcmp(path, NFS4_DSS_VAR_DIR) == 0) {
+                        dss_path = dss_path->next;
+                        continue;
+                }
+
+                for (i = 0; i < rfs4_dss_numnewpaths; i++) {
+                        int cmpret;
+                        char *newpath = rfs4_dss_newpaths[i];
+
+                        /*
+                         * Since nfsd has sorted rfs4_dss_newpaths for us,
+                         * once the return from strcmp is negative we know
+                         * we've passed the point where "path" should be,
+                         * and can stop searching: "path" has been removed.
+                         */
+                        cmpret = strcmp(path, newpath);
+                        if (cmpret < 0)
+                                break;
+                        if (cmpret == 0) {
+                                found = 1;
+                                break;
+                        }
+                }
+
+                if (found == 0) {
+                        unsigned index = dss_path->index;
+                        rfs4_servinst_t *sip = dss_path->sip;
+                        rfs4_dss_path_t *path_next = dss_path->next;
+
+                        /*
+                         * This path has been removed.
+                         * We must clear out the servinst reference to
+                         * it, since it's now owned by another
+                         * node: we should not attempt to touch it.
+                         */
+                        ASSERT(dss_path == sip->dss_paths[index]);
+                        sip->dss_paths[index] = NULL;
+
+                        /* remove from "currently-serving" list, and destroy */
+                        remque(dss_path);
+                        /* allow for NUL */
+                        kmem_free(dss_path->path, strlen(dss_path->path) + 1);
+                        kmem_free(dss_path, sizeof (rfs4_dss_path_t));
+
+                        dss_path = path_next;
+                } else {
+                        /* path was found; not removed */
+                        dss_path = dss_path->next;
+                }
+        } while (dss_path != nsrv4->dss_pathlist);
+
+        /*
+         * Now, look for added paths: RGs that have been failed-over
+         * to this node.
+         * Walk the "passed-in" rfs4_dss_newpaths array from nfsd and,
+         * for each path, check if it is on the "currently-serving"
+         * dss_pathlist. If not, that RG path has been added.
+         *
+         * Note: we don't do duplicate detection here; nfsd does that for us.
+         *
+         * Note: numadded_paths <= rfs4_dss_numnewpaths, which gives us
+         * an upper bound for the size needed for added_paths[numadded_paths].
+         */
+
+        /* probably more space than we need, but guaranteed to be enough */
+        if (rfs4_dss_numnewpaths > 0) {
+                size_t sz = rfs4_dss_numnewpaths * sizeof (char *);
+                added_paths = kmem_zalloc(sz, KM_SLEEP);
+        }
+
+        /* walk the "passed-in" rfs4_dss_newpaths array from nfsd */
+        for (i = 0; i < rfs4_dss_numnewpaths; i++) {
+                int found = 0;
+                char *newpath = rfs4_dss_newpaths[i];
+
+                dss_path = nsrv4->dss_pathlist;
+                do {
+                        char *path = dss_path->path;
+
+                        /* used only for non-HA */
+                        if (strcmp(path, NFS4_DSS_VAR_DIR) == 0) {
+                                dss_path = dss_path->next;
+                                continue;
+                        }
+
+                        if (strncmp(path, newpath, strlen(path)) == 0) {
+                                found = 1;
+                                break;
+                        }
+
+                        dss_path = dss_path->next;
+                } while (dss_path != nsrv4->dss_pathlist);
+
+                if (found == 0) {
+                        added_paths[numadded_paths] = newpath;
+                        numadded_paths++;
+                }
+        }
+
+        /* did we find any added paths? */
+        if (numadded_paths > 0) {
+
+                /* create a new server instance, and start its grace period */
+                start_grace = 1;
+                /* CSTYLED */
+                rfs4_servinst_create(nsrv4, start_grace, numadded_paths, added_paths);
+
+                /* read in the stable storage state from these paths */
+                rfs4_dss_readstate(nsrv4, numadded_paths, added_paths);
+
+                /*
+                 * Multiple failovers during a grace period will cause
+                 * clients of the same resource group to be partitioned
+                 * into different server instances, with different
+                 * grace periods.  Since clients of the same resource
+                 * group must be subject to the same grace period,
+                 * we need to reset all currently active grace periods.
+                 */
+                rfs4_grace_reset_all(nsrv4);
+        }
+
+        if (rfs4_dss_numnewpaths > 0)
+                kmem_free(added_paths, rfs4_dss_numnewpaths * sizeof (char *));
 }