Print this page
NEX-20178 Heavy read load using 10G i40e causes network disconnect
MFV illumos-joyent@83a8d0d616db36010b59cc850d1926c0f6a30de1
OS-7457 i40e Tx freezes on zero descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV illumos-joyent@0d3f2b61dcfb18edace4fd257054f6fdbe07c99c
OS-7492 i40e Tx freeze when b_cont chain exceeds 8 descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV illumos-joyent@b4bede175d4c50ac1b36078a677b69388f6fb59f
OS-7577 initialize FC for i40e
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Rob Johnston <rob.johnston@joyent.com>
MFV illumos-joyent@83a8d0d616db36010b59cc850d1926c0f6a30de1
OS-7457 i40e Tx freezes on zero descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
NEX-19928 i40e: cannot create static IP address
Reviewed by: Cynthia Eastham <cynthia.eastham@nexenta.com>
MFV: illumos-joyent@b93056a35d6d6d301f24bc6631fc57dd2c8992c4
OS-7456 i40e default VSI sometimes lacks implicit L2 filter
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Ryan Zezeski <rpz@joyent.com>
MFV: illumos-joyent@61dc3dec4f82a3e13e94609a0a83d5f66c64e760
OS-6846 want i40e multi-group support
OS-7372 i40e_alloc_ring_mem() unwinds when it shouldn't
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Ryan Zezeski <rpz@joyent.com>
MFV: illumos-joyent@9e30beee2f0c127bf41868db46257124206e28d6
OS-5225 Want Fortville TSO support
Reviewed by: Ryan Zezeski <rpz@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Patrick Mooney <patrick.mooney@joyent.com>
Author: Rob Johnston <rob.johnston@joyent.com>
NEX-7822 40Gb Intel XL710 NIC performance data
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Evan Layton <evan.layton@nexenta.com>
        
@@ -9,11 +9,11 @@
  * http://www.illumos.org/license/CDDL.
  */
 
 /*
  * Copyright 2015 OmniTI Computer Consulting, Inc. All rights reserved.
- * Copyright (c) 2017, Joyent, Inc.
+ * Copyright 2019 Joyent, Inc.
  * Copyright 2017 Tegile Systems, Inc.  All rights reserved.
  */
 
 /*
  * i40e - Intel 10/40 Gb Ethernet driver
@@ -135,11 +135,11 @@
  * For each unique PCI device that we encounter, we'll create a i40e_device_t.
  * From there, because we don't have a good way to tell the GLDv3 about sharing
  * resources between everything, we'll end up just dividing the resources
  * evenly between all of the functions. Longer term, if we don't have to declare
  * to the GLDv3 that these resources are shared, then we'll maintain a pool and
- * hae each PF allocate from the pool in the device, thus if only two of four
+ * have each PF allocate from the pool in the device, thus if only two of four
  * ports are being used, for example, then all of the resources can still be
  * used.
  *
  * -------------------------------------------
  * Transmit and Receive Queue Pair Allocations
@@ -167,11 +167,11 @@
  * the default VSI.
  *
  * To receieve broadcast traffic, we enable it through the admin queue, rather
  * than use one of our filters for it. For multicast traffic, we reserve a
  * certain number of the hash filters and assign them to a given PF. When we
- * exceed those, we then switch to using promicuous mode for multicast traffic.
+ * exceed those, we then switch to using promiscuous mode for multicast traffic.
  *
  * More specifically, once we exceed the number of filters (indicated because
  * the i40e_t`i40e_resources.ifr_nmcastfilt ==
  * i40e_t`i40e_resources.ifr_nmcastfilt_used), we then instead need to toggle
  * promiscuous mode. If promiscuous mode is toggled then we keep track of the
@@ -186,18 +186,19 @@
  *
  * --------------
  * VSI Management
  * --------------
  *
- * At this time, we currently only support a single MAC group, and thus a single
- * VSI. This VSI is considered the default VSI and should be the only one that
- * exists after a reset. Currently it is stored as the member
- * i40e_t`i40e_vsi_id. While this works for the moment and for an initial
- * driver, it's not sufficient for the longer-term path of the driver. Instead,
- * we'll want to actually have a unique i40e_vsi_t structure which is used
- * everywhere. Note that this means that every place that uses the
- * i40e_t`i40e_vsi_id will need to be refactored.
+ * The PFs share 384 VSIs. The firmware creates one VSI per PF by default.
+ * During chip start we retrieve the SEID of this VSI and assign it as the
+ * default VSI for our VEB (one VEB per PF). We then add additional VSIs to
+ * the VEB up to the determined number of rx groups: i40e_t`i40e_num_rx_groups.
+ * We currently cap this number to I40E_GROUP_MAX to a) make sure all PFs can
+ * allocate the same number of VSIs, and b) to keep the interrupt multiplexing
+ * under control. In the future, when we improve the interrupt allocation, we
+ * may want to revisit this cap to make better use of the available VSIs. The
+ * VSI allocation and configuration can be found in i40e_chip_start().
  *
  * ----------------
  * Structure Layout
  * ----------------
  *
@@ -238,24 +239,23 @@
  *          | i40e_device_t *         --+-----+
  *          | i40e_state_t            --+---> Device State
  *          | i40e_hw_t               --+---> Intel common code structure
  *          | mac_handle_t            --+---> GLDv3 handle to MAC
  *          | ddi_periodic_t          --+---> Link activity timer
- *          | int (vsi_id)            --+---> VSI ID, main identifier
+ *          | i40e_vsi_t *            --+---> Array of VSIs
  *          | i40e_func_rsrc_t        --+---> Available hardware resources
  *          | i40e_switch_rsrc_t *    --+---> Switch resource snapshot
  *          | i40e_sdu                --+---> Current MTU
  *          | i40e_frame_max          --+---> Current HW frame size
  *          | i40e_uaddr_t *          --+---> Array of assigned unicast MACs
  *          | i40e_maddr_t *          --+---> Array of assigned multicast MACs
  *          | i40e_mcast_promisccount --+---> Active multicast state
  *          | i40e_promisc_on         --+---> Current promiscuous mode state
- *          | int                     --+---> Number of transmit/receive pairs
+ *          | uint_t                  --+---> Number of transmit/receive pairs
+ *          | i40e_rx_group_t *       --+---> Array of Rx groups
  *          | kstat_t *               --+---> PF kstats
- *          | kstat_t *               --+---> VSI kstats
  *          | i40e_pf_stats_t         --+---> PF kstat backing data
- *          | i40e_vsi_stats_t        --+---> VSI kstat backing data
  *          | i40e_trqpair_t *        --+---------+
  *          +---------------------------+         |
  *                                                |
  *                                                v
  *  +-------------------------------+       +-----------------------------+
@@ -357,12 +357,10 @@
  * At the moment the i40e_t driver is rather bare bones, allowing us to start
  * getting data flowing and folks using it while we develop additional features.
  * While bugs have been filed to cover this future work, the following gives an
  * overview of expected work:
  *
- *  o TSO support
- *  o Multiple group support
  *  o DMA binding and breaking up the locking in ring recycling.
  *  o Enhanced detection of device errors
  *  o Participation in IRM
  *  o FMA device reset
  *  o Stall detection, temperature error detection, etc.
@@ -369,11 +367,11 @@
  *  o More dynamic resource pools
  */
 
 #include "i40e_sw.h"
 
-static char i40e_ident[] = "Intel 10/40Gb Ethernet v1.0.1";
+static char i40e_ident[] = "Intel 10/40Gb Ethernet v1.0.3";
 
 /*
  * The i40e_glock primarily protects the lists below and the i40e_device_t
  * structures.
  */
@@ -759,19 +757,20 @@
                     FM_VERSION, DATA_TYPE_UINT8, FM_EREPORT_VERS0, NULL);
         }
 }
 
 /*
- * Here we're trying to get the ID of the default VSI. In general, when we come
- * through and look at this shortly after attach, we expect there to only be a
- * single element present, which is the default VSI. Importantly, each PF seems
- * to not see any other devices, in part because of the simple switch mode that
- * we're using. If for some reason, we see more artifact, we'll need to revisit
- * what we're doing here.
+ * Here we're trying to set the SEID of the default VSI. In general,
+ * when we come through and look at this shortly after attach, we
+ * expect there to only be a single element present, which is the
+ * default VSI. Importantly, each PF seems to not see any other
+ * devices, in part because of the simple switch mode that we're
+ * using. If for some reason, we see more artifacts, we'll need to
+ * revisit what we're doing here.
  */
-static int
-i40e_get_vsi_id(i40e_t *i40e)
+static boolean_t
+i40e_set_def_vsi_seid(i40e_t *i40e)
 {
         i40e_hw_t *hw = &i40e->i40e_hw_space;
         struct i40e_aqc_get_switch_config_resp *sw_config;
         uint8_t aq_buf[I40E_AQ_LARGE_BUF];
         uint16_t next = 0;
@@ -782,21 +781,47 @@
         rc = i40e_aq_get_switch_config(hw, sw_config, sizeof (aq_buf), &next,
             NULL);
         if (rc != I40E_SUCCESS) {
                 i40e_error(i40e, "i40e_aq_get_switch_config() failed %d: %d",
                     rc, hw->aq.asq_last_status);
-                return (-1);
+                return (B_FALSE);
         }
 
         if (LE_16(sw_config->header.num_reported) != 1) {
                 i40e_error(i40e, "encountered multiple (%d) switching units "
                     "during attach, not proceeding",
                     LE_16(sw_config->header.num_reported));
+                return (B_FALSE);
+        }
+
+        I40E_DEF_VSI_SEID(i40e) = sw_config->element[0].seid;
+        return (B_TRUE);
+}
+
+/*
+ * Get the SEID of the uplink MAC.
+ */
+static int
+i40e_get_mac_seid(i40e_t *i40e)
+{
+        i40e_hw_t *hw = &i40e->i40e_hw_space;
+        struct i40e_aqc_get_switch_config_resp *sw_config;
+        uint8_t aq_buf[I40E_AQ_LARGE_BUF];
+        uint16_t next = 0;
+        int rc;
+
+        /* LINTED: E_BAD_PTR_CAST_ALIGN */
+        sw_config = (struct i40e_aqc_get_switch_config_resp *)aq_buf;
+        rc = i40e_aq_get_switch_config(hw, sw_config, sizeof (aq_buf), &next,
+            NULL);
+        if (rc != I40E_SUCCESS) {
+                i40e_error(i40e, "i40e_aq_get_switch_config() failed %d: %d",
+                    rc, hw->aq.asq_last_status);
                 return (-1);
         }
 
-        return (sw_config->element[0].seid);
+        return (LE_16(sw_config->element[0].uplink_seid));
 }
 
 /*
  * We need to fill the i40e_hw_t structure with the capabilities of this PF. We
  * must also provide the memory for it; however, we don't need to keep it around
@@ -1096,15 +1121,20 @@
  * Free receive & transmit rings.
  */
 static void
 i40e_free_trqpairs(i40e_t *i40e)
 {
-        int i;
         i40e_trqpair_t *itrq;
 
+        if (i40e->i40e_rx_groups != NULL) {
+                kmem_free(i40e->i40e_rx_groups,
+                    sizeof (i40e_rx_group_t) * i40e->i40e_num_rx_groups);
+                i40e->i40e_rx_groups = NULL;
+        }
+
         if (i40e->i40e_trqpairs != NULL) {
-                for (i = 0; i < i40e->i40e_num_trqpairs; i++) {
+                for (uint_t i = 0; i < i40e->i40e_num_trqpairs; i++) {
                         itrq = &i40e->i40e_trqpairs[i];
                         mutex_destroy(&itrq->itrq_rx_lock);
                         mutex_destroy(&itrq->itrq_tx_lock);
                         mutex_destroy(&itrq->itrq_tcb_lock);
 
@@ -1131,11 +1161,10 @@
  * need.
  */
 static boolean_t
 i40e_alloc_trqpairs(i40e_t *i40e)
 {
-        int i;
         void *mutexpri = DDI_INTR_PRI(i40e->i40e_intr_pri);
 
         /*
          * Now that we have the priority for the interrupts, initialize
          * all relevant locks.
@@ -1144,39 +1173,52 @@
         mutex_init(&i40e->i40e_rx_pending_lock, NULL, MUTEX_DRIVER, mutexpri);
         cv_init(&i40e->i40e_rx_pending_cv, NULL, CV_DRIVER, NULL);
 
         i40e->i40e_trqpairs = kmem_zalloc(sizeof (i40e_trqpair_t) *
             i40e->i40e_num_trqpairs, KM_SLEEP);
-        for (i = 0; i < i40e->i40e_num_trqpairs; i++) {
+        for (uint_t i = 0; i < i40e->i40e_num_trqpairs; i++) {
                 i40e_trqpair_t *itrq = &i40e->i40e_trqpairs[i];
 
                 itrq->itrq_i40e = i40e;
                 mutex_init(&itrq->itrq_rx_lock, NULL, MUTEX_DRIVER, mutexpri);
                 mutex_init(&itrq->itrq_tx_lock, NULL, MUTEX_DRIVER, mutexpri);
                 mutex_init(&itrq->itrq_tcb_lock, NULL, MUTEX_DRIVER, mutexpri);
                 itrq->itrq_index = i;
         }
 
+        i40e->i40e_rx_groups = kmem_zalloc(sizeof (i40e_rx_group_t) *
+            i40e->i40e_num_rx_groups, KM_SLEEP);
+
+        for (uint_t i = 0; i < i40e->i40e_num_rx_groups; i++) {
+                i40e_rx_group_t *rxg = &i40e->i40e_rx_groups[i];
+
+                rxg->irg_index = i;
+                rxg->irg_i40e = i40e;
+        }
+
         return (B_TRUE);
 }
 
 
 
 /*
  * Unless a .conf file already overrode i40e_t structure values, they will
  * be 0, and need to be set in conjunction with the now-available HW report.
- *
- * However, at the moment, we cap all of these resources as we only support a
- * single receive ring and a single group.
  */
 /* ARGSUSED */
 static void
 i40e_hw_to_instance(i40e_t *i40e, i40e_hw_t *hw)
 {
-        if (i40e->i40e_num_trqpairs == 0) {
-                i40e->i40e_num_trqpairs = I40E_TRQPAIR_MAX;
+        if (i40e->i40e_num_trqpairs_per_vsi == 0) {
+                if (i40e_is_x722(i40e)) {
+                        i40e->i40e_num_trqpairs_per_vsi =
+                            I40E_722_MAX_TC_QUEUES;
+                } else {
+                        i40e->i40e_num_trqpairs_per_vsi =
+                            I40E_710_MAX_TC_QUEUES;
         }
+        }
 
         if (i40e->i40e_num_rx_groups == 0) {
                 i40e->i40e_num_rx_groups = I40E_GROUP_MAX;
         }
 }
@@ -1307,16 +1349,15 @@
                     rc);
                 return (B_FALSE);
         }
 
         /*
-         * We need to obtain the Virtual Station ID (VSI) before we can
-         * perform other operations on the device.
+         * We need to obtain the Default Virtual Station SEID (VSI)
+         * before we can perform other operations on the device.
          */
-        i40e->i40e_vsi_id = i40e_get_vsi_id(i40e);
-        if (i40e->i40e_vsi_id == -1) {
-                i40e_error(i40e, "failed to obtain VSI ID");
+        if (!i40e_set_def_vsi_seid(i40e)) {
+                i40e_error(i40e, "failed to obtain Default VSI SEID");
                 return (B_FALSE);
         }
 
         return (B_TRUE);
 }
@@ -1557,10 +1598,13 @@
             I40E_DEF_RX_LIMIT_PER_INTR);
 
         i40e->i40e_tx_hcksum_enable = i40e_get_prop(i40e, "tx_hcksum_enable",
             B_FALSE, B_TRUE, B_TRUE);
 
+        i40e->i40e_tx_lso_enable = i40e_get_prop(i40e, "tx_lso_enable",
+            B_FALSE, B_TRUE, B_TRUE);
+
         i40e->i40e_rx_hcksum_enable = i40e_get_prop(i40e, "rx_hcksum_enable",
             B_FALSE, B_TRUE, B_TRUE);
 
         i40e->i40e_rx_dma_min = i40e_get_prop(i40e, "rx_dma_threshold",
             I40E_MIN_RX_DMA_THRESH, I40E_MAX_RX_DMA_THRESH,
@@ -1726,27 +1770,69 @@
                     rc);
                 return (B_FALSE);
         }
 
         i40e->i40e_intr_type = 0;
+        i40e->i40e_num_rx_groups = I40E_GROUP_MAX;
 
+        /*
+         * We need to determine the number of queue pairs per traffic
+         * class. We only have one traffic class (TC0), so we'll base
+         * this off the number of interrupts provided. Furthermore,
+         * since we only use one traffic class, the number of queues
+         * per traffic class and per VSI are the same.
+         */
         if ((intr_types & DDI_INTR_TYPE_MSIX) &&
-            i40e->i40e_intr_force <= I40E_INTR_MSIX) {
-                if (i40e_alloc_intr_handles(i40e, devinfo,
-                    DDI_INTR_TYPE_MSIX)) {
-                        i40e->i40e_num_trqpairs =
-                            MIN(i40e->i40e_intr_count - 1, max_trqpairs);
+            (i40e->i40e_intr_force <= I40E_INTR_MSIX) &&
+            (i40e_alloc_intr_handles(i40e, devinfo, DDI_INTR_TYPE_MSIX))) {
+                uint32_t n;
+
+                /*
+                 * While we want the number of queue pairs to match
+                 * the number of interrupts, we must keep stay in
+                 * bounds of the maximum number of queues per traffic
+                 * class. We subtract one from i40e_intr_count to
+                 * account for interrupt zero; which is currently
+                 * restricted to admin queue commands and other
+                 * interrupt causes.
+                 */
+                n = MIN(i40e->i40e_intr_count - 1, max_trqpairs);
+                ASSERT3U(n, >, 0);
+
+                /*
+                 * Round up to the nearest power of two to ensure that
+                 * the QBASE aligns with the TC size which must be
+                 * programmed as a power of two. See the queue mapping
+                 * description in section 7.4.9.5.5.1.
+                 *
+                 * If i40e_intr_count - 1 is not a power of two then
+                 * some queue pairs on the same VSI will have to share
+                 * an interrupt.
+                 *
+                 * We may want to revisit this logic in a future where
+                 * we have more interrupts and more VSIs. Otherwise,
+                 * each VSI will use as many interrupts as possible.
+                 * Using more QPs per VSI means better RSS for each
+                 * group, but at the same time may require more
+                 * sharing of interrupts across VSIs. This may be a
+                 * good candidate for a .conf tunable.
+                 */
+                n = 0x1 << ddi_fls(n);
+                i40e->i40e_num_trqpairs_per_vsi = n;
+                ASSERT3U(i40e->i40e_num_rx_groups, >, 0);
+                i40e->i40e_num_trqpairs = i40e->i40e_num_trqpairs_per_vsi *
+                    i40e->i40e_num_rx_groups;
                         return (B_TRUE);
                 }
-        }
 
         /*
          * We only use multiple transmit/receive pairs when MSI-X interrupts are
          * available due to the fact that the device basically only supports a
          * single MSI interrupt.
          */
         i40e->i40e_num_trqpairs = I40E_TRQPAIR_NOMSIX;
+        i40e->i40e_num_trqpairs_per_vsi = i40e->i40e_num_trqpairs;
         i40e->i40e_num_rx_groups = I40E_GROUP_NOMSIX;
 
         if ((intr_types & DDI_INTR_TYPE_MSI) &&
             (i40e->i40e_intr_force <= I40E_INTR_MSI)) {
                 if (i40e_alloc_intr_handles(i40e, devinfo, DDI_INTR_TYPE_MSI))
@@ -1765,28 +1851,24 @@
  * Map different interrupts to MSI-X vectors.
  */
 static boolean_t
 i40e_map_intrs_to_vectors(i40e_t *i40e)
 {
-        int i;
-
         if (i40e->i40e_intr_type != DDI_INTR_TYPE_MSIX) {
                 return (B_TRUE);
         }
 
         /*
-         * Each queue pair is mapped to a single interrupt, so transmit
-         * and receive interrupts for a given queue share the same vector.
-         * The number of queue pairs is one less than the number of interrupt
-         * vectors and is assigned the vector one higher than its index.
-         * Vector zero is reserved for the admin queue.
+         * Each queue pair is mapped to a single interrupt, so
+         * transmit and receive interrupts for a given queue share the
+         * same vector. Vector zero is reserved for the admin queue.
          */
-        ASSERT(i40e->i40e_intr_count == i40e->i40e_num_trqpairs + 1);
+        for (uint_t i = 0; i < i40e->i40e_num_trqpairs; i++) {
+                uint_t vector = i % (i40e->i40e_intr_count - 1);
 
-        for (i = 0; i < i40e->i40e_num_trqpairs; i++) {
-                i40e->i40e_trqpairs[i].itrq_rx_intrvec = i + 1;
-                i40e->i40e_trqpairs[i].itrq_tx_intrvec = i + 1;
+                i40e->i40e_trqpairs[i].itrq_rx_intrvec = vector + 1;
+                i40e->i40e_trqpairs[i].itrq_tx_intrvec = vector + 1;
         }
 
         return (B_TRUE);
 }
 
@@ -1921,110 +2003,324 @@
 i40e_init_macaddrs(i40e_t *i40e, i40e_hw_t *hw)
 {
 }
 
 /*
- * Configure the hardware for the Virtual Station Interface (VSI).  Currently
- * we only support one, but in the future we could instantiate more than one
- * per attach-point.
+ * Set the properties which have common values across all the VSIs.
+ * Consult the "Add VSI" command section (7.4.9.5.5.1) for a
+ * complete description of these properties.
  */
-static boolean_t
-i40e_config_vsi(i40e_t *i40e, i40e_hw_t *hw)
+static void
+i40e_set_shared_vsi_props(i40e_t *i40e,
+    struct i40e_aqc_vsi_properties_data *info, uint_t vsi_idx)
 {
-        struct i40e_vsi_context context;
-        int err, tc_queues;
+        uint_t tc_queues;
+        uint16_t vsi_qp_base;
 
-        bzero(&context, sizeof (struct i40e_vsi_context));
-        context.seid = i40e->i40e_vsi_id;
-        context.pf_num = hw->pf_id;
-        err = i40e_aq_get_vsi_params(hw, &context, NULL);
-        if (err != I40E_SUCCESS) {
-                i40e_error(i40e, "get VSI params failed with %d", err);
-                return (B_FALSE);
-        }
+        /*
+         * It's important that we use bitwise-OR here; callers to this
+         * function might enable other sections before calling this
+         * function.
+         */
+        info->valid_sections |= LE_16(I40E_AQ_VSI_PROP_QUEUE_MAP_VALID |
+            I40E_AQ_VSI_PROP_VLAN_VALID);
 
-        i40e->i40e_vsi_num = context.vsi_number;
-
         /*
-         * Set the queue and traffic class bits.  Keep it simple for now.
+         * Calculate the starting QP index for this VSI. This base is
+         * relative to the PF queue space; so a value of 0 for PF#1
+         * represents the absolute index PFLAN_QALLOC_FIRSTQ for PF#1.
          */
-        context.info.valid_sections = I40E_AQ_VSI_PROP_QUEUE_MAP_VALID;
-        context.info.mapping_flags = I40E_AQ_VSI_QUE_MAP_CONTIG;
-        context.info.queue_mapping[0] = I40E_ASSIGN_ALL_QUEUES;
+        vsi_qp_base = vsi_idx * i40e->i40e_num_trqpairs_per_vsi;
+        info->mapping_flags = LE_16(I40E_AQ_VSI_QUE_MAP_CONTIG);
+        info->queue_mapping[0] =
+            LE_16((vsi_qp_base << I40E_AQ_VSI_QUEUE_SHIFT) &
+                I40E_AQ_VSI_QUEUE_MASK);
 
         /*
-         * tc_queues determines the size of the traffic class, where the size is
-         * 2^^tc_queues to a maximum of 64 for the X710 and 128 for the X722.
+         * tc_queues determines the size of the traffic class, where
+         * the size is 2^^tc_queues to a maximum of 64 for the X710
+         * and 128 for the X722.
          *
          * Some examples:
-         *      i40e_num_trqpairs == 1 =>  tc_queues = 0, 2^^0 = 1.
-         *      i40e_num_trqpairs == 7 =>  tc_queues = 3, 2^^3 = 8.
-         *      i40e_num_trqpairs == 8 =>  tc_queues = 3, 2^^3 = 8.
-         *      i40e_num_trqpairs == 9 =>  tc_queues = 4, 2^^4 = 16.
-         *      i40e_num_trqpairs == 17 => tc_queues = 5, 2^^5 = 32.
-         *      i40e_num_trqpairs == 64 => tc_queues = 6, 2^^6 = 64.
+         *      i40e_num_trqpairs_per_vsi == 1 =>  tc_queues = 0, 2^^0 = 1.
+         *      i40e_num_trqpairs_per_vsi == 7 =>  tc_queues = 3, 2^^3 = 8.
+         *      i40e_num_trqpairs_per_vsi == 8 =>  tc_queues = 3, 2^^3 = 8.
+         *      i40e_num_trqpairs_per_vsi == 9 =>  tc_queues = 4, 2^^4 = 16.
+         *      i40e_num_trqpairs_per_vsi == 17 => tc_queues = 5, 2^^5 = 32.
+         *      i40e_num_trqpairs_per_vsi == 64 => tc_queues = 6, 2^^6 = 64.
          */
-        tc_queues = ddi_fls(i40e->i40e_num_trqpairs - 1);
+        tc_queues = ddi_fls(i40e->i40e_num_trqpairs_per_vsi - 1);
 
-        context.info.tc_mapping[0] = ((0 << I40E_AQ_VSI_TC_QUE_OFFSET_SHIFT) &
+        /*
+         * The TC queue mapping is in relation to the VSI queue space.
+         * Since we are only using one traffic class (TC0) we always
+         * start at queue offset 0.
+         */
+        info->tc_mapping[0] =
+            LE_16(((0 << I40E_AQ_VSI_TC_QUE_OFFSET_SHIFT) &
             I40E_AQ_VSI_TC_QUE_OFFSET_MASK) |
             ((tc_queues << I40E_AQ_VSI_TC_QUE_NUMBER_SHIFT) &
-            I40E_AQ_VSI_TC_QUE_NUMBER_MASK);
+                    I40E_AQ_VSI_TC_QUE_NUMBER_MASK));
 
-        context.info.valid_sections |= I40E_AQ_VSI_PROP_VLAN_VALID;
-        context.info.port_vlan_flags = I40E_AQ_VSI_PVLAN_MODE_ALL |
+        /*
+         * I40E_AQ_VSI_PVLAN_MODE_ALL ("VLAN driver insertion mode")
+         *
+         *      Allow tagged and untagged packets to be sent to this
+         *      VSI from the host.
+         *
+         * I40E_AQ_VSI_PVLAN_EMOD_NOTHING ("VLAN and UP expose mode")
+         *
+         *      Leave the tag on the frame and place no VLAN
+         *      information in the descriptor. We want this mode
+         *      because our MAC layer will take care of the VLAN tag,
+         *      if there is one.
+         */
+        info->port_vlan_flags = I40E_AQ_VSI_PVLAN_MODE_ALL |
             I40E_AQ_VSI_PVLAN_EMOD_NOTHING;
+}
 
-        context.flags = LE16_TO_CPU(I40E_AQ_VSI_TYPE_PF);
+/*
+ * Delete the VSI at this index, if one exists. We assume there is no
+ * action we can take if this command fails but to log the failure.
+ */
+static void
+i40e_delete_vsi(i40e_t *i40e, uint_t idx)
+{
+        i40e_hw_t       *hw = &i40e->i40e_hw_space;
+        uint16_t        seid = i40e->i40e_vsis[idx].iv_seid;
 
-        i40e->i40e_vsi_stat_id = LE16_TO_CPU(context.info.stat_counter_idx);
-        if (i40e_stat_vsi_init(i40e) == B_FALSE)
-                return (B_FALSE);
+        if (seid != 0) {
+                int rc;
 
-        err = i40e_aq_update_vsi_params(hw, &context, NULL);
-        if (err != I40E_SUCCESS) {
-                i40e_error(i40e, "Update VSI params failed with %d", err);
+                rc = i40e_aq_delete_element(hw, seid, NULL);
+
+                if (rc != I40E_SUCCESS) {
+                        i40e_error(i40e, "Failed to delete VSI %d: %d",
+                            rc, hw->aq.asq_last_status);
+                }
+
+                i40e->i40e_vsis[idx].iv_seid = 0;
+        }
+}
+
+/*
+ * Add a new VSI.
+ */
+static boolean_t
+i40e_add_vsi(i40e_t *i40e, i40e_hw_t *hw, uint_t idx)
+{
+        struct i40e_vsi_context ctx;
+        i40e_rx_group_t         *rxg;
+        int                     rc;
+
+        /*
+         * The default VSI is created by the controller. This function
+         * creates new, non-defualt VSIs only.
+         */
+        ASSERT3U(idx, !=, 0);
+
+        bzero(&ctx, sizeof (struct i40e_vsi_context));
+        ctx.uplink_seid = i40e->i40e_veb_seid;
+        ctx.pf_num = hw->pf_id;
+        ctx.flags = I40E_AQ_VSI_TYPE_PF;
+        ctx.connection_type = I40E_AQ_VSI_CONN_TYPE_NORMAL;
+        i40e_set_shared_vsi_props(i40e, &ctx.info, idx);
+
+        rc = i40e_aq_add_vsi(hw, &ctx, NULL);
+        if (rc != I40E_SUCCESS) {
+                i40e_error(i40e, "i40e_aq_add_vsi() failed %d: %d", rc,
+                    hw->aq.asq_last_status);
                 return (B_FALSE);
         }
 
+        rxg = &i40e->i40e_rx_groups[idx];
+        rxg->irg_vsi_seid = ctx.seid;
+        i40e->i40e_vsis[idx].iv_number = ctx.vsi_number;
+        i40e->i40e_vsis[idx].iv_seid = ctx.seid;
+        i40e->i40e_vsis[idx].iv_stats_id = LE_16(ctx.info.stat_counter_idx);
 
+        if (i40e_stat_vsi_init(i40e, idx) == B_FALSE)
+                return (B_FALSE);
+
         return (B_TRUE);
 }
 
 /*
- * Configure the RSS key. For the X710 controller family, this is set on a
- * per-PF basis via registers. For the X722, this is done on a per-VSI basis
- * through the admin queue.
+ * Configure the hardware for the Default Virtual Station Interface (VSI).
  */
 static boolean_t
-i40e_config_rss_key(i40e_t *i40e, i40e_hw_t *hw)
+i40e_config_def_vsi(i40e_t *i40e, i40e_hw_t *hw)
 {
-        uint32_t seed[I40E_PFQF_HKEY_MAX_INDEX + 1];
+        struct i40e_vsi_context ctx;
+        i40e_rx_group_t *def_rxg;
+        int err;
+        struct i40e_aqc_remove_macvlan_element_data filt;
 
-        (void) random_get_pseudo_bytes((uint8_t *)seed, sizeof (seed));
+        bzero(&ctx, sizeof (struct i40e_vsi_context));
+        ctx.seid = I40E_DEF_VSI_SEID(i40e);
+        ctx.pf_num = hw->pf_id;
+        err = i40e_aq_get_vsi_params(hw, &ctx, NULL);
+        if (err != I40E_SUCCESS) {
+                i40e_error(i40e, "get VSI params failed with %d", err);
+                return (B_FALSE);
+        }
 
-        if (i40e_is_x722(i40e)) {
+        ctx.info.valid_sections = 0;
+        i40e->i40e_vsis[0].iv_number = ctx.vsi_number;
+        i40e->i40e_vsis[0].iv_stats_id = LE_16(ctx.info.stat_counter_idx);
+        if (i40e_stat_vsi_init(i40e, 0) == B_FALSE)
+                return (B_FALSE);
+
+        i40e_set_shared_vsi_props(i40e, &ctx.info, I40E_DEF_VSI_IDX);
+
+        err = i40e_aq_update_vsi_params(hw, &ctx, NULL);
+        if (err != I40E_SUCCESS) {
+                i40e_error(i40e, "Update VSI params failed with %d", err);
+                return (B_FALSE);
+        }
+
+        def_rxg = &i40e->i40e_rx_groups[0];
+        def_rxg->irg_vsi_seid = I40E_DEF_VSI_SEID(i40e);
+
+        /*
+         * We have seen three different behaviors in regards to the
+         * Default VSI and its implicit L2 MAC+VLAN filter.
+         *
+         * 1. It has an implicit filter for the factory MAC address
+         *    and this filter counts against 'ifr_nmacfilt_used'.
+         *
+         * 2. It has an implicit filter for the factory MAC address
+         *    and this filter DOES NOT count against 'ifr_nmacfilt_used'.
+         *
+         * 3. It DOES NOT have an implicit filter.
+         *
+         * All three of these cases are accounted for below. If we
+         * fail to remove the L2 filter (ENOENT) then we assume there
+         * wasn't one. Otherwise, if we successfully remove the
+         * filter, we make sure to update the 'ifr_nmacfilt_used'
+         * count accordingly.
+         *
+         * We remove this filter to prevent duplicate delivery of
+         * packets destined for the primary MAC address as DLS will
+         * create the same filter on a non-default VSI for the primary
+         * MAC client.
+         *
+         * If you change the following code please test it across as
+         * many X700 series controllers and firmware revisions as you
+         * can.
+         */
+        bzero(&filt, sizeof (filt));
+        bcopy(hw->mac.port_addr, filt.mac_addr, ETHERADDRL);
+        filt.flags = I40E_AQC_MACVLAN_DEL_PERFECT_MATCH;
+        filt.vlan_tag = 0;
+
+        ASSERT3U(i40e->i40e_resources.ifr_nmacfilt_used, <=, 1);
+        i40e_log(i40e, "Num L2 filters: %u",
+            i40e->i40e_resources.ifr_nmacfilt_used);
+
+        err = i40e_aq_remove_macvlan(hw, I40E_DEF_VSI_SEID(i40e), &filt, 1,
+            NULL);
+        if (err == I40E_SUCCESS) {
+                i40e_log(i40e,
+                    "Removed L2 filter from Default VSI with SEID %u",
+                    I40E_DEF_VSI_SEID(i40e));
+        } else if (hw->aq.asq_last_status == ENOENT) {
+                i40e_log(i40e,
+                    "No L2 filter for Default VSI with SEID %u",
+                    I40E_DEF_VSI_SEID(i40e));
+        } else {
+                i40e_error(i40e, "Failed to remove L2 filter from"
+                    " Default VSI with SEID %u: %d (%d)",
+                    I40E_DEF_VSI_SEID(i40e), err, hw->aq.asq_last_status);
+
+                return (B_FALSE);
+        }
+
+        /*
+         *  As mentioned above, the controller created an implicit L2
+         *  filter for the primary MAC. We want to remove both the
+         *  filter and decrement the filter count. However, not all
+         *  controllers count this implicit filter against the total
+         *  MAC filter count. So here we are making sure it is either
+         *  one or zero. If it is one, then we know it is for the
+         *  implicit filter and we should decrement since we just
+         *  removed the filter above. If it is zero then we know the
+         *  controller that does not count the implicit filter, and it
+         *  was enough to just remove it; we leave the count alone.
+         *  But if it is neither, then we have never seen a controller
+         *  like this before and we should fail to attach.
+         *
+         *  It is unfortunate that this code must exist but the
+         *  behavior of this implicit L2 filter and its corresponding
+         *  count were dicovered through empirical testing. The
+         *  programming manuals hint at this filter but do not
+         *  explicitly call out the exact behavior.
+         */
+        if (i40e->i40e_resources.ifr_nmacfilt_used == 1) {
+                i40e->i40e_resources.ifr_nmacfilt_used--;
+        } else {
+                if (i40e->i40e_resources.ifr_nmacfilt_used != 0) {
+                        i40e_error(i40e, "Unexpected L2 filter count: %u"
+                            " (expected 0)",
+                            i40e->i40e_resources.ifr_nmacfilt_used);
+                            return (B_FALSE);
+                }
+        }
+
+        return (B_TRUE);
+}
+
+static boolean_t
+i40e_config_rss_key_x722(i40e_t *i40e, i40e_hw_t *hw)
+{
+        for (uint_t i = 0; i < i40e->i40e_num_rx_groups; i++) {
+                uint32_t seed[I40E_PFQF_HKEY_MAX_INDEX + 1];
                 struct i40e_aqc_get_set_rss_key_data key;
-                const char *u8seed = (char *)seed;
+                const char *u8seed;
                 enum i40e_status_code status;
+                uint16_t vsi_number = i40e->i40e_vsis[i].iv_number;
 
+                (void) random_get_pseudo_bytes((uint8_t *)seed, sizeof (seed));
+                u8seed = (char *)seed;
+
                 CTASSERT(sizeof (key) >= (sizeof (key.standard_rss_key) +
                     sizeof (key.extended_hash_key)));
 
                 bcopy(u8seed, key.standard_rss_key,
                     sizeof (key.standard_rss_key));
                 bcopy(&u8seed[sizeof (key.standard_rss_key)],
                     key.extended_hash_key, sizeof (key.extended_hash_key));
 
-                status = i40e_aq_set_rss_key(hw, i40e->i40e_vsi_num, &key);
+                ASSERT3U(vsi_number, !=, 0);
+                status = i40e_aq_set_rss_key(hw, vsi_number, &key);
+
                 if (status != I40E_SUCCESS) {
-                        i40e_error(i40e, "failed to set rss key: %d", status);
+                        i40e_error(i40e, "failed to set RSS key for VSI %u: %d",
+                            vsi_number, status);
                         return (B_FALSE);
                 }
+        }
+
+        return (B_TRUE);
+}
+
+/*
+ * Configure the RSS key. For the X710 controller family, this is set on a
+ * per-PF basis via registers. For the X722, this is done on a per-VSI basis
+ * through the admin queue.
+ */
+static boolean_t
+i40e_config_rss_key(i40e_t *i40e, i40e_hw_t *hw)
+{
+        if (i40e_is_x722(i40e)) {
+                if (!i40e_config_rss_key_x722(i40e, hw))
+                        return (B_FALSE);
         } else {
-                uint_t i;
-                for (i = 0; i <= I40E_PFQF_HKEY_MAX_INDEX; i++)
+                uint32_t seed[I40E_PFQF_HKEY_MAX_INDEX + 1];
+
+                (void) random_get_pseudo_bytes((uint8_t *)seed, sizeof (seed));
+                for (uint_t i = 0; i <= I40E_PFQF_HKEY_MAX_INDEX; i++)
                         i40e_write_rx_ctl(hw, I40E_PFQF_HKEY(i), seed[i]);
         }
 
         return (B_TRUE);
 }
@@ -2032,15 +2328,16 @@
 /*
  * Populate the LUT. The size of each entry in the LUT depends on the controller
  * family, with the X722 using a known 7-bit width. On the X710 controller, this
  * is programmed through its control registers where as on the X722 this is
  * configured through the admin queue. Also of note, the X722 allows the LUT to
- * be set on a per-PF or VSI basis. At this time, as we only have a single VSI,
- * we use the PF setting as it is the primary VSI.
+ * be set on a per-PF or VSI basis. At this time we use the PF setting. If we
+ * decide to use the per-VSI LUT in the future, then we will need to modify the
+ * i40e_add_vsi() function to set the RSS LUT bits in the queueing section.
  *
  * We populate the LUT in a round robin fashion with the rx queue indices from 0
- * to i40e_num_trqpairs - 1.
+ * to i40e_num_trqpairs_per_vsi - 1.
  */
 static boolean_t
 i40e_config_rss_hlut(i40e_t *i40e, i40e_hw_t *hw)
 {
         uint32_t *hlut;
@@ -2066,19 +2363,24 @@
                 lut_mask = (1 << 7) - 1;
         } else {
                 lut_mask = (1 << hw->func_caps.rss_table_entry_width) - 1;
         }
 
-        for (i = 0; i < I40E_HLUT_TABLE_SIZE; i++)
-                ((uint8_t *)hlut)[i] = (i % i40e->i40e_num_trqpairs) & lut_mask;
+        for (i = 0; i < I40E_HLUT_TABLE_SIZE; i++) {
+                ((uint8_t *)hlut)[i] =
+                    (i % i40e->i40e_num_trqpairs_per_vsi) & lut_mask;
+        }
 
         if (i40e_is_x722(i40e)) {
                 enum i40e_status_code status;
-                status = i40e_aq_set_rss_lut(hw, i40e->i40e_vsi_num, B_TRUE,
-                    (uint8_t *)hlut, I40E_HLUT_TABLE_SIZE);
+
+                status = i40e_aq_set_rss_lut(hw, 0, B_TRUE, (uint8_t *)hlut,
+                    I40E_HLUT_TABLE_SIZE);
+
                 if (status != I40E_SUCCESS) {
-                        i40e_error(i40e, "failed to set RSS LUT: %d", status);
+                        i40e_error(i40e, "failed to set RSS LUT %d: %d",
+                            status, hw->aq.asq_last_status);
                         goto out;
                 }
         } else {
                 for (i = 0; i < I40E_HLUT_TABLE_SIZE >> 2; i++) {
                         I40E_WRITE_REG(hw, I40E_PFQF_HLUT(i), hlut[i]);
@@ -2150,10 +2452,11 @@
 i40e_chip_start(i40e_t *i40e)
 {
         i40e_hw_t *hw = &i40e->i40e_hw_space;
         struct i40e_filter_control_settings filter;
         int rc;
+        uint8_t err;
 
         if (((hw->aq.fw_maj_ver == 4) && (hw->aq.fw_min_ver < 33)) ||
             (hw->aq.fw_maj_ver < 4)) {
                 i40e_msec_delay(75);
                 if (i40e_aq_set_link_restart_an(hw, TRUE, NULL) !=
@@ -2165,10 +2468,19 @@
         }
 
         /* Determine hardware state */
         i40e_get_hw_state(i40e, hw);
 
+        /* For now, we always disable Ethernet Flow Control. */
+        hw->fc.requested_mode = I40E_FC_NONE;
+        rc = i40e_set_fc(hw, &err, B_TRUE);
+        if (rc != I40E_SUCCESS) {
+                i40e_error(i40e, "Setting flow control failed, returned %d"
+                    " with error: 0x%x", rc, err);
+                return (B_FALSE);
+        }
+
         /* Initialize mac addresses. */
         i40e_init_macaddrs(i40e, hw);
 
         /*
          * Set up the filter control. If the hash lut size is changed from
@@ -2186,13 +2498,39 @@
                 return (B_FALSE);
         }
 
         i40e_intr_chip_init(i40e);
 
-        if (!i40e_config_vsi(i40e, hw))
+        rc = i40e_get_mac_seid(i40e);
+        if (rc == -1) {
+                i40e_error(i40e, "failed to obtain MAC Uplink SEID");
                 return (B_FALSE);
+        }
+        i40e->i40e_mac_seid = (uint16_t)rc;
 
+        /*
+         * Create a VEB in order to support multiple VSIs. Each VSI
+         * functions as a MAC group. This call sets the PF's MAC as
+         * the uplink port and the PF's default VSI as the default
+         * downlink port.
+         */
+        rc = i40e_aq_add_veb(hw, i40e->i40e_mac_seid, I40E_DEF_VSI_SEID(i40e),
+            0x1, B_TRUE, &i40e->i40e_veb_seid, B_FALSE, NULL);
+        if (rc != I40E_SUCCESS) {
+                i40e_error(i40e, "i40e_aq_add_veb() failed %d: %d", rc,
+                    hw->aq.asq_last_status);
+                return (B_FALSE);
+        }
+
+        if (!i40e_config_def_vsi(i40e, hw))
+                return (B_FALSE);
+
+        for (uint_t i = 1; i < i40e->i40e_num_rx_groups; i++) {
+                if (!i40e_add_vsi(i40e, hw, i))
+                        return (B_FALSE);
+        }
+
         if (!i40e_config_rss(i40e, hw))
                 return (B_FALSE);
 
         i40e_flush(hw);
 
@@ -2547,11 +2885,11 @@
          * traffic class index for the first device. We query the VSI parameters
          * again to get what the handle is. Note that every queue is always
          * assigned to traffic class zero, because we don't actually use them.
          */
         bzero(&context, sizeof (struct i40e_vsi_context));
-        context.seid = i40e->i40e_vsi_id;
+        context.seid = I40E_DEF_VSI_SEID(i40e);
         context.pf_num = hw->pf_id;
         err = i40e_aq_get_vsi_params(hw, &context, NULL);
         if (err != I40E_SUCCESS) {
                 i40e_error(i40e, "get VSI params failed with %d", err);
                 return (B_FALSE);
@@ -2651,11 +2989,12 @@
 }
 
 void
 i40e_stop(i40e_t *i40e, boolean_t free_allocations)
 {
-        int i;
+        uint_t i;
+        i40e_hw_t *hw = &i40e->i40e_hw_space;
 
         ASSERT(MUTEX_HELD(&i40e->i40e_general_lock));
 
         /*
          * Shutdown and drain the tx and rx pipeline. We do this using the
@@ -2687,10 +3026,31 @@
                 ddi_fm_service_impact(i40e->i40e_dip, DDI_SERVICE_LOST);
         }
 
         delay(50 * drv_usectohz(1000));
 
+        /*
+         * We don't delete the default VSI because it replaces the VEB
+         * after VEB deletion (see the "Delete Element" section).
+         * Furthermore, since the default VSI is provided by the
+         * firmware, we never attempt to delete it.
+         */
+        for (i = 1; i < i40e->i40e_num_rx_groups; i++) {
+                i40e_delete_vsi(i40e, i);
+        }
+
+        if (i40e->i40e_veb_seid != 0) {
+                int rc = i40e_aq_delete_element(hw, i40e->i40e_veb_seid, NULL);
+
+                if (rc != I40E_SUCCESS) {
+                        i40e_error(i40e, "Failed to delete VEB %d: %d", rc,
+                            hw->aq.asq_last_status);
+                }
+
+                i40e->i40e_veb_seid = 0;
+        }
+
         i40e_intr_chip_fini(i40e);
 
         for (i = 0; i < i40e->i40e_num_trqpairs; i++) {
                 mutex_enter(&i40e->i40e_trqpairs[i].itrq_rx_lock);
                 mutex_enter(&i40e->i40e_trqpairs[i].itrq_tx_lock);
@@ -2716,11 +3076,13 @@
         for (i = 0; i < i40e->i40e_num_trqpairs; i++) {
                 mutex_exit(&i40e->i40e_trqpairs[i].itrq_rx_lock);
                 mutex_exit(&i40e->i40e_trqpairs[i].itrq_tx_lock);
         }
 
-        i40e_stat_vsi_fini(i40e);
+        for (i = 0; i < i40e->i40e_num_rx_groups; i++) {
+                i40e_stat_vsi_fini(i40e, i);
+        }
 
         i40e->i40e_link_speed = 0;
         i40e->i40e_link_duplex = 0;
         i40e_link_state_set(i40e, LINK_STATE_UNKNOWN);
 
@@ -2781,11 +3143,12 @@
 
         /*
          * Enable broadcast traffic; however, do not enable multicast traffic.
          * That's handle exclusively through MAC's mc_multicst routines.
          */
-        err = i40e_aq_set_vsi_broadcast(hw, i40e->i40e_vsi_id, B_TRUE, NULL);
+        err = i40e_aq_set_vsi_broadcast(hw, I40E_DEF_VSI_SEID(i40e), B_TRUE,
+            NULL);
         if (err != I40E_SUCCESS) {
                 i40e_error(i40e, "failed to set default VSI: %d", err);
                 rc = B_FALSE;
                 goto done;
         }