Print this page
NEX-20178 Heavy read load using 10G i40e causes network disconnect
MFV illumos-joyent@83a8d0d616db36010b59cc850d1926c0f6a30de1
OS-7457 i40e Tx freezes on zero descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV illumos-joyent@0d3f2b61dcfb18edace4fd257054f6fdbe07c99c
OS-7492 i40e Tx freeze when b_cont chain exceeds 8 descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV illumos-joyent@b4bede175d4c50ac1b36078a677b69388f6fb59f
OS-7577 initialize FC for i40e
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Rob Johnston <rob.johnston@joyent.com>
MFV illumos-joyent@83a8d0d616db36010b59cc850d1926c0f6a30de1
OS-7457 i40e Tx freezes on zero descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV: illumos-joyent@61dc3dec4f82a3e13e94609a0a83d5f66c64e760
OS-6846 want i40e multi-group support
OS-7372 i40e_alloc_ring_mem() unwinds when it shouldn't
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Ryan Zezeski <rpz@joyent.com>
MFV: illumos-joyent@6f6fae1b433b461a7b014f48ad94fc7f4927c6ed
OS-7344 i40e Tx freeze caused by off-by-one DMA
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Ryan Zezeski <rpz@joyent.com>
MFV: illumos-joyent@757454db6669c1186f60bc625510c1b67217aae6
OS-7082 i40e: blown assert in i40e_tx_cleanup_ring()
OS-7086 i40e: add mdb dcmd to dump info on tx descriptor rings
OS-7101 i40e: add kstat to track TX DMA bind failures
Reviewed by: Ryan Zezeski <rpz@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Patrick Mooney <patrick.mooney@joyent.com>
Author: Rob Johnston <rob.johnston@joyent.com>
MFV: illumos-joyent@9e30beee2f0c127bf41868db46257124206e28d6
OS-5225 Want Fortville TSO support
Reviewed by: Ryan Zezeski <rpz@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Patrick Mooney <patrick.mooney@joyent.com>
Author: Rob Johnston <rob.johnston@joyent.com>
        
*** 9,19 ****
   * http://www.illumos.org/license/CDDL.
   */
  
  /*
   * Copyright 2015 OmniTI Computer Consulting, Inc. All rights reserved.
!  * Copyright 2016 Joyent, Inc.
   */
  
  #include "i40e_sw.h"
  
  /*
--- 9,19 ----
   * http://www.illumos.org/license/CDDL.
   */
  
  /*
   * Copyright 2015 OmniTI Computer Consulting, Inc. All rights reserved.
!  * Copyright 2019 Joyent, Inc.
   */
  
  #include "i40e_sw.h"
  
  /*
*** 58,80 ****
   * i40e_t`i40e_sdu changes.
   *
   * This size is then rounded up to the nearest 1k chunk, which represents the
   * actual amount of memory that we'll allocate for a single frame.
   *
!  * Note, that for rx, we do something that might be unexpected. We always add
   * an extra two bytes to the frame size that we allocate. We then offset the DMA
   * address that we receive a packet into by two bytes. This ensures that the IP
   * header will always be 4 byte aligned because the MAC header is either 14 or
   * 18 bytes in length, depending on the use of 802.1Q tagging, which makes IP's
   * and MAC's lives easier.
   *
!  * Both the rx and tx descriptor rings (which are what we use to communicate
   * with hardware) are allocated as a single region of DMA memory which is the
   * size of the descriptor (4 bytes and 2 bytes respectively) times the total
!  * number of descriptors for an rx and tx ring.
   *
!  * While the rx and tx descriptors are allocated using DMA-based memory, the
   * control blocks for each of them are allocated using normal kernel memory.
   * They aren't special from a DMA perspective. We'll go over the design of both
   * receiving and transmitting separately, as they have slightly different
   * control blocks and different ways that we manage the relationship between
   * control blocks and descriptors.
--- 58,80 ----
   * i40e_t`i40e_sdu changes.
   *
   * This size is then rounded up to the nearest 1k chunk, which represents the
   * actual amount of memory that we'll allocate for a single frame.
   *
!  * Note, that for RX, we do something that might be unexpected. We always add
   * an extra two bytes to the frame size that we allocate. We then offset the DMA
   * address that we receive a packet into by two bytes. This ensures that the IP
   * header will always be 4 byte aligned because the MAC header is either 14 or
   * 18 bytes in length, depending on the use of 802.1Q tagging, which makes IP's
   * and MAC's lives easier.
   *
!  * Both the RX and TX descriptor rings (which are what we use to communicate
   * with hardware) are allocated as a single region of DMA memory which is the
   * size of the descriptor (4 bytes and 2 bytes respectively) times the total
!  * number of descriptors for an RX and TX ring.
   *
!  * While the RX and TX descriptors are allocated using DMA-based memory, the
   * control blocks for each of them are allocated using normal kernel memory.
   * They aren't special from a DMA perspective. We'll go over the design of both
   * receiving and transmitting separately, as they have slightly different
   * control blocks and different ways that we manage the relationship between
   * control blocks and descriptors.
*** 111,145 ****
   * builds, we allow someone to whack the variable i40e_debug_rx_mode to override
   * the behavior and always do a bcopy or a DMA bind.
   *
   * To try and ensure that the device always has blocks that it can receive data
   * into, we maintain two lists of control blocks, a working list and a free
!  * list. Each list is sized equal to the number of descriptors in the rx ring.
!  * During the GLDv3 mc_start routine, we allocate a number of rx control blocks
   * equal to twice the number of descriptors in the ring and we assign them
   * equally to the free list and to the working list. Each control block also has
   * DMA memory allocated and associated with which it will be used to receive the
   * actual packet data. All of a received frame's data will end up in a single
   * DMA buffer.
   *
!  * During operation, we always maintain the invariant that each rx descriptor
!  * has an associated rx control block which lives in the working list. If we
   * feel that we should loan up DMA memory to MAC in the form of a message block,
   * we can only do so if we can maintain this invariant. To do that, we swap in
   * one of the buffers from the free list. If none are available, then we resort
   * to using allocb(9F) and bcopy(9F) on the packet instead, regardless of the
   * size.
   *
   * Loaned message blocks come back to use when freemsg(9F) or freeb(9F) is
!  * called on the block, at which point we restore the rx control block to the
   * free list and are able to reuse the DMA memory again. While the scheme may
   * seem odd, it importantly keeps us out of trying to do any DMA allocations in
   * the normal path of operation, even though we may still have to allocate
   * message blocks and copy.
   *
!  * The following state machine describes the life time of a rx control block. In
!  * the diagram we abbrviate the rx ring descriptor entry as rxd and the rx
   * control block entry as rcb.
   *
   *             |                                   |
   *             * ... 1/2 of all initial rcb's  ... *
   *             |                                   |
--- 111,145 ----
   * builds, we allow someone to whack the variable i40e_debug_rx_mode to override
   * the behavior and always do a bcopy or a DMA bind.
   *
   * To try and ensure that the device always has blocks that it can receive data
   * into, we maintain two lists of control blocks, a working list and a free
!  * list. Each list is sized equal to the number of descriptors in the RX ring.
!  * During the GLDv3 mc_start routine, we allocate a number of RX control blocks
   * equal to twice the number of descriptors in the ring and we assign them
   * equally to the free list and to the working list. Each control block also has
   * DMA memory allocated and associated with which it will be used to receive the
   * actual packet data. All of a received frame's data will end up in a single
   * DMA buffer.
   *
!  * During operation, we always maintain the invariant that each RX descriptor
!  * has an associated RX control block which lives in the working list. If we
   * feel that we should loan up DMA memory to MAC in the form of a message block,
   * we can only do so if we can maintain this invariant. To do that, we swap in
   * one of the buffers from the free list. If none are available, then we resort
   * to using allocb(9F) and bcopy(9F) on the packet instead, regardless of the
   * size.
   *
   * Loaned message blocks come back to use when freemsg(9F) or freeb(9F) is
!  * called on the block, at which point we restore the RX control block to the
   * free list and are able to reuse the DMA memory again. While the scheme may
   * seem odd, it importantly keeps us out of trying to do any DMA allocations in
   * the normal path of operation, even though we may still have to allocate
   * message blocks and copy.
   *
!  * The following state machine describes the life time of a RX control block. In
!  * the diagram we abbrviate the RX ring descriptor entry as rxd and the rx
   * control block entry as rcb.
   *
   *             |                                   |
   *             * ... 1/2 of all initial rcb's  ... *
   *             |                                   |
*** 158,172 ****
   *             |   and it is                      v
   *             |   recycled.              +-------------------+
   *             +--------------------<-----| rcb loaned to MAC |
   *                                        +-------------------+
   *
!  * Finally, note that every rx control block has a reference count on it. One
   * reference is added as long as the driver has had the GLDv3 mc_start endpoint
   * called. If the GLDv3 mc_stop entry point is called, IP has been unplumbed and
   * no other DLPI consumers remain, then we'll decrement the reference count by
!  * one. Whenever we loan up the rx control block and associated buffer to MAC,
   * then we bump the reference count again. Even though the device is stopped,
   * there may still be loaned frames in upper levels that we'll want to account
   * for. Our callback from freemsg(9F)/freeb(9F) will take care of making sure
   * that it is cleaned up.
   *
--- 158,172 ----
   *             |   and it is                      v
   *             |   recycled.              +-------------------+
   *             +--------------------<-----| rcb loaned to MAC |
   *                                        +-------------------+
   *
!  * Finally, note that every RX control block has a reference count on it. One
   * reference is added as long as the driver has had the GLDv3 mc_start endpoint
   * called. If the GLDv3 mc_stop entry point is called, IP has been unplumbed and
   * no other DLPI consumers remain, then we'll decrement the reference count by
!  * one. Whenever we loan up the RX control block and associated buffer to MAC,
   * then we bump the reference count again. Even though the device is stopped,
   * there may still be loaned frames in upper levels that we'll want to account
   * for. Our callback from freemsg(9F)/freeb(9F) will take care of making sure
   * that it is cleaned up.
   *
*** 190,203 ****
   * the HEAD and TAIL, inclusive. Note that while we initially program the HEAD,
   * the only values we ever consult ourselves are the TAIL register and our own
   * state tracking. Effectively, we cache the HEAD register and then update it
   * ourselves based on our work.
   *
!  * When we iterate over the rx descriptors and thus the received frames, we are
   * either in an interrupt context or we've been asked by MAC to poll on the
   * ring. If we've been asked to poll on the ring, we have a maximum number of
!  * bytes of mblk_t's to return. If processing an rx descriptor would cause us to
   * exceed that count, then we do not process it. When in interrupt context, we
   * don't have a strict byte count. However, to ensure liveness, we limit the
   * amount of data based on a configuration value
   * (i40e_t`i40e_rx_limit_per_intr). The number that we've started with for this
   * is based on similar numbers that are used for ixgbe. After some additional
--- 190,203 ----
   * the HEAD and TAIL, inclusive. Note that while we initially program the HEAD,
   * the only values we ever consult ourselves are the TAIL register and our own
   * state tracking. Effectively, we cache the HEAD register and then update it
   * ourselves based on our work.
   *
!  * When we iterate over the RX descriptors and thus the received frames, we are
   * either in an interrupt context or we've been asked by MAC to poll on the
   * ring. If we've been asked to poll on the ring, we have a maximum number of
!  * bytes of mblk_t's to return. If processing an RX descriptor would cause us to
   * exceed that count, then we do not process it. When in interrupt context, we
   * don't have a strict byte count. However, to ensure liveness, we limit the
   * amount of data based on a configuration value
   * (i40e_t`i40e_rx_limit_per_intr). The number that we've started with for this
   * is based on similar numbers that are used for ixgbe. After some additional
*** 247,281 ****
   *
   * While the transmit path is similar in spirit to the receive path, it works
   * differently due to the fact that all data is originated by the operating
   * system and not by the device.
   *
!  * Like rx, there is both a descriptor ring that we use to communicate to the
   * driver and which points to the memory used to transmit a frame. Similarly,
!  * there is a corresponding transmit control block. Each transmit control block
!  * has a region of DMA memory allocated to it; however, the way we use it
!  * varies.
   *
   * The driver is asked to process a single frame at a time. That message block
   * may be made up of multiple fragments linked together by the mblk_t`b_cont
   * member. The device has a hard limit of up to 8 buffers being allowed for use
!  * for a single logical frame. For each fragment, we'll try and use an entry
!  * from the tx descriptor ring and then we'll allocate a corresponding tx
!  * control block. Depending on the size of the fragment, we may copy it around
!  * or we might instead try to do DMA binding of the fragment.
   *
!  * If we exceed the number of blocks that fit, we'll try to pull up the block
!  * and then we'll do a DMA bind and send it out.
   *
!  * If we don't have enough space in the ring or tx control blocks available,
   * then we'll return the unprocessed message block to MAC. This will induce flow
   * control and once we recycle enough entries, we'll once again enable sending
   * on the ring.
   *
   * We size the working list as equal to the number of descriptors in the ring.
   * We size the free list as equal to 1.5 times the number of descriptors in the
!  * ring. We'll allocate a number of tx control block entries equal to the number
   * of entries in the free list. By default, all entries are placed in the free
   * list. As we come along and try to send something, we'll allocate entries from
   * the free list and add them to the working list, where they'll stay until the
   * hardware indicates that all of the data has been written back to us. The
   * reason that we start with 1.5x is to help facilitate having more than one TX
--- 247,304 ----
   *
   * While the transmit path is similar in spirit to the receive path, it works
   * differently due to the fact that all data is originated by the operating
   * system and not by the device.
   *
!  * Like RX, there is both a descriptor ring that we use to communicate to the
   * driver and which points to the memory used to transmit a frame.  Similarly,
!  * there is a corresponding transmit control block, however, the correspondence
!  * between descriptors and control blocks is more complex and not necessarily
!  * 1-to-1.
   *
   * The driver is asked to process a single frame at a time. That message block
   * may be made up of multiple fragments linked together by the mblk_t`b_cont
   * member. The device has a hard limit of up to 8 buffers being allowed for use
!  * for a single non-LSO packet or LSO segment. The number of TX ring entires
!  * (and thus TX control blocks) used depends on the fragment sizes and DMA
!  * layout, as explained below.
   *
!  * We alter our DMA strategy based on a threshold tied to the fragment size.
!  * This threshold is configurable via the tx_dma_threshold property. If the
!  * fragment is above the threshold, we DMA bind it -- consuming one TCB and
!  * potentially several data descriptors. The exact number of descriptors (equal
!  * to the number of DMA cookies) depends on page size, MTU size, b_rptr offset
!  * into page, b_wptr offset into page, and the physical layout of the dblk's
!  * memory (contiguous or not). Essentially, we are at the mercy of the DMA
!  * engine and the dblk's memory allocation. Knowing the exact number of
!  * descriptors up front is a task best not taken on by the driver itself.
!  * Instead, we attempt to DMA bind the fragment and verify the descriptor
!  * layout meets hardware constraints. If the proposed DMA bind does not satisfy
!  * the hardware constaints, then we discard it and instead copy the entire
!  * fragment into the pre-allocated TCB buffer (or buffers if the fragment is
!  * larger than the TCB buffer).
   *
!  * If the fragment is below or at the threshold, we copy it to the pre-allocated
!  * buffer of a TCB. We compress consecutive copy fragments into a single TCB to
!  * conserve resources. We are guaranteed that the TCB buffer is made up of only
!  * 1 DMA cookie; and therefore consumes only one descriptor on the controller.
!  *
!  * Furthermore, if the frame requires HW offloads such as LSO, tunneling or
!  * filtering, then the TX data descriptors must be preceeded by a single TX
!  * context descriptor.  Because there is no DMA transfer associated with the
!  * context descriptor, we allocate a control block with a special type which
!  * indicates to the TX ring recycle code that there are no associated DMA
!  * resources to unbind when the control block is free'd.
!  *
!  * If we don't have enough space in the ring or TX control blocks available,
   * then we'll return the unprocessed message block to MAC. This will induce flow
   * control and once we recycle enough entries, we'll once again enable sending
   * on the ring.
   *
   * We size the working list as equal to the number of descriptors in the ring.
   * We size the free list as equal to 1.5 times the number of descriptors in the
!  * ring. We'll allocate a number of TX control block entries equal to the number
   * of entries in the free list. By default, all entries are placed in the free
   * list. As we come along and try to send something, we'll allocate entries from
   * the free list and add them to the working list, where they'll stay until the
   * hardware indicates that all of the data has been written back to us. The
   * reason that we start with 1.5x is to help facilitate having more than one TX
*** 323,356 ****
   *             |
   *             v
   *    +------------------+                       +------------------+
   *    | tcb on free list |---*------------------>| tcb on work list |
   *    +------------------+   .                   +------------------+
!  *             ^             . tcb allocated               |
   *             |               to send frame               v
   *             |               or fragment on              |
   *             |               wire, mblk from             |
   *             |               MAC associated.             |
   *             |                                           |
   *             +------*-------------------------------<----+
   *                    .
   *                    . Hardware indicates
   *                      entry transmitted.
!  *                      tcb recycled, mblk
   *                      from MAC freed.
   *
   * ------------
   * Blocking MAC
   * ------------
   *
!  * Wen performing transmit, we can run out of descriptors and ring entries. When
!  * such a case happens, we return the mblk_t to MAC to indicate that we've been
!  * blocked. At that point in time, MAC becomes blocked and will not transmit
!  * anything out that specific ring until we notify MAC. To indicate that we're
!  * in such a situation we set i40e_trqpair_t`itrq_tx_blocked member to B_TRUE.
   *
!  * When we recycle tx descriptors then we'll end up signaling MAC by calling
   * mac_tx_ring_update() if we were blocked, letting it know that it's safe to
   * start sending frames out to us again.
   */
  
  /*
--- 346,386 ----
   *             |
   *             v
   *    +------------------+                       +------------------+
   *    | tcb on free list |---*------------------>| tcb on work list |
   *    +------------------+   .                   +------------------+
!  *             ^             . N tcbs allocated[1]         |
   *             |               to send frame               v
   *             |               or fragment on              |
   *             |               wire, mblk from             |
   *             |               MAC associated.             |
   *             |                                           |
   *             +------*-------------------------------<----+
   *                    .
   *                    . Hardware indicates
   *                      entry transmitted.
!  *                      tcbs recycled, mblk
   *                      from MAC freed.
   *
+  * [1] We allocate N tcbs to transmit a single frame where N can be 1 context
+  *     descriptor plus 1 data descriptor, in the non-DMA-bind case.  In the DMA
+  *     bind case, N can be 1 context descriptor plus 1 data descriptor per
+  *     b_cont in the mblk.  In this case, the mblk is associated with the first
+  *     data descriptor and freed as part of freeing that data descriptor.
+  *
   * ------------
   * Blocking MAC
   * ------------
   *
!  * When performing transmit, we can run out of descriptors and ring entries.
!  * When such a case happens, we return the mblk_t to MAC to indicate that we've
!  * been blocked. At that point in time, MAC becomes blocked and will not
!  * transmit anything out that specific ring until we notify MAC. To indicate
!  * that we're in such a situation we set i40e_trqpair_t`itrq_tx_blocked member
!  * to B_TRUE.
   *
!  * When we recycle TX descriptors then we'll end up signaling MAC by calling
   * mac_tx_ring_update() if we were blocked, letting it know that it's safe to
   * start sending frames out to us again.
   */
  
  /*
*** 365,381 ****
  #error  "unknown architecture for i40e"
  #endif
  
  /*
   * This structure is used to maintain information and flags related to
!  * transmitting a frame. The first member is the set of flags we need to or into
!  * the command word (generally checksumming related). The second member controls
!  * the word offsets which is required for IP and L4 checksumming.
   */
  typedef struct i40e_tx_context {
!         enum i40e_tx_desc_cmd_bits      itc_cmdflags;
!         uint32_t                        itc_offsets;
  } i40e_tx_context_t;
  
  /*
   * Toggles on debug builds which can be used to override our RX behaviour based
   * on thresholds.
--- 395,413 ----
  #error  "unknown architecture for i40e"
  #endif
  
  /*
   * This structure is used to maintain information and flags related to
!  * transmitting a frame.  These fields are ultimately used to construct the
!  * TX data descriptor(s) and, if necessary, the TX context descriptor.
   */
  typedef struct i40e_tx_context {
!         enum i40e_tx_desc_cmd_bits      itc_data_cmdflags;
!         uint32_t                        itc_data_offsets;
!         enum i40e_tx_ctx_desc_cmd_bits  itc_ctx_cmdflags;
!         uint32_t                        itc_ctx_tsolen;
!         uint32_t                        itc_ctx_mss;
  } i40e_tx_context_t;
  
  /*
   * Toggles on debug builds which can be used to override our RX behaviour based
   * on thresholds.
*** 393,410 ****
  /*
   * Notes on the following pair of DMA attributes. The first attribute,
   * i40e_static_dma_attr, is designed to be used for both the descriptor rings
   * and the static buffers that we associate with control blocks. For this
   * reason, we force an SGL length of one. While technically the driver supports
!  * a larger SGL (5 on rx and 8 on tx), we opt to only use one to simplify our
   * management here. In addition, when the Intel common code wants to allocate
   * memory via the i40e_allocate_virt_mem osdep function, we have it leverage
   * the static dma attr.
   *
!  * The second set of attributes, i40e_txbind_dma_attr, is what we use when we're
!  * binding a bunch of mblk_t fragments to go out the door. Note that the main
!  * difference here is that we're allowed a larger SGL length -- eight.
   *
   * Note, we default to setting ourselves to be DMA capable here. However,
   * because we could have multiple instances which have different FMA error
   * checking capabilities, or end up on different buses, we make these static
   * and const and copy them into the i40e_t for the given device with the actual
--- 425,446 ----
  /*
   * Notes on the following pair of DMA attributes. The first attribute,
   * i40e_static_dma_attr, is designed to be used for both the descriptor rings
   * and the static buffers that we associate with control blocks. For this
   * reason, we force an SGL length of one. While technically the driver supports
!  * a larger SGL (5 on RX and 8 on TX), we opt to only use one to simplify our
   * management here. In addition, when the Intel common code wants to allocate
   * memory via the i40e_allocate_virt_mem osdep function, we have it leverage
   * the static dma attr.
   *
!  * The latter two sets of attributes, are what we use when we're binding a
!  * bunch of mblk_t fragments to go out the door. Note that the main difference
!  * here is that we're allowed a larger SGL length.  For non-LSO TX, we
!  * restrict the SGL length to match the number of TX buffers available to the
!  * PF (8).  For the LSO case we can go much larger, with the caveat that each
!  * MSS-sized chunk (segment) must not span more than 8 data descriptors and
!  * hence must not span more than 8 cookies.
   *
   * Note, we default to setting ourselves to be DMA capable here. However,
   * because we could have multiple instances which have different FMA error
   * checking capabilities, or end up on different buses, we make these static
   * and const and copy them into the i40e_t for the given device with the actual
*** 427,437 ****
  
  static const ddi_dma_attr_t i40e_g_txbind_dma_attr = {
          DMA_ATTR_V0,                    /* version number */
          0x0000000000000000ull,          /* low address */
          0xFFFFFFFFFFFFFFFFull,          /* high address */
!         0x00000000FFFFFFFFull,          /* dma counter max */
          I40E_DMA_ALIGNMENT,             /* alignment */
          0x00000FFF,                     /* burst sizes */
          0x00000001,                     /* minimum transfer size */
          0x00000000FFFFFFFFull,          /* maximum transfer size */
          0xFFFFFFFFFFFFFFFFull,          /* maximum segment size  */
--- 463,473 ----
  
  static const ddi_dma_attr_t i40e_g_txbind_dma_attr = {
          DMA_ATTR_V0,                    /* version number */
          0x0000000000000000ull,          /* low address */
          0xFFFFFFFFFFFFFFFFull,          /* high address */
!         I40E_MAX_TX_BUFSZ - 1,          /* dma counter max */
          I40E_DMA_ALIGNMENT,             /* alignment */
          0x00000FFF,                     /* burst sizes */
          0x00000001,                     /* minimum transfer size */
          0x00000000FFFFFFFFull,          /* maximum transfer size */
          0xFFFFFFFFFFFFFFFFull,          /* maximum segment size  */
*** 438,447 ****
--- 474,498 ----
          I40E_TX_MAX_COOKIE,             /* scatter/gather list length */
          0x00000001,                     /* granularity */
          DDI_DMA_FLAGERR                 /* DMA flags */
  };
  
+ static const ddi_dma_attr_t i40e_g_txbind_lso_dma_attr = {
+         DMA_ATTR_V0,                    /* version number */
+         0x0000000000000000ull,          /* low address */
+         0xFFFFFFFFFFFFFFFFull,          /* high address */
+         I40E_MAX_TX_BUFSZ - 1,          /* dma counter max */
+         I40E_DMA_ALIGNMENT,             /* alignment */
+         0x00000FFF,                     /* burst sizes */
+         0x00000001,                     /* minimum transfer size */
+         0x00000000FFFFFFFFull,          /* maximum transfer size */
+         0xFFFFFFFFFFFFFFFFull,          /* maximum segment size  */
+         I40E_TX_LSO_MAX_COOKIE,         /* scatter/gather list length */
+         0x00000001,                     /* granularity */
+         DDI_DMA_FLAGERR                 /* DMA flags */
+ };
+ 
  /*
   * Next, we have the attributes for these structures. The descriptor rings are
   * all strictly little endian, while the data buffers are just arrays of bytes
   * representing frames. Because of this, we purposefully simplify the driver
   * programming life by programming the descriptor ring as little endian, while
*** 666,685 ****
          rxd->rxd_rcb_free = rxd->rxd_free_list_size;
  
          rxd->rxd_work_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) *
              rxd->rxd_ring_size, KM_NOSLEEP);
          if (rxd->rxd_work_list == NULL) {
!                 i40e_error(i40e, "failed to allocate rx work list for a ring "
                      "of %d entries for ring %d", rxd->rxd_ring_size,
                      itrq->itrq_index);
                  goto cleanup;
          }
  
          rxd->rxd_free_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) *
              rxd->rxd_free_list_size, KM_NOSLEEP);
          if (rxd->rxd_free_list == NULL) {
!                 i40e_error(i40e, "failed to allocate a %d entry rx free list "
                      "for ring %d", rxd->rxd_free_list_size, itrq->itrq_index);
                  goto cleanup;
          }
  
          rxd->rxd_rcb_area = kmem_zalloc(sizeof (i40e_rx_control_block_t) *
--- 717,736 ----
          rxd->rxd_rcb_free = rxd->rxd_free_list_size;
  
          rxd->rxd_work_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) *
              rxd->rxd_ring_size, KM_NOSLEEP);
          if (rxd->rxd_work_list == NULL) {
!                 i40e_error(i40e, "failed to allocate RX work list for a ring "
                      "of %d entries for ring %d", rxd->rxd_ring_size,
                      itrq->itrq_index);
                  goto cleanup;
          }
  
          rxd->rxd_free_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) *
              rxd->rxd_free_list_size, KM_NOSLEEP);
          if (rxd->rxd_free_list == NULL) {
!                 i40e_error(i40e, "failed to allocate a %d entry RX free list "
                      "for ring %d", rxd->rxd_free_list_size, itrq->itrq_index);
                  goto cleanup;
          }
  
          rxd->rxd_rcb_area = kmem_zalloc(sizeof (i40e_rx_control_block_t) *
*** 763,781 ****
          size_t dmasz;
          i40e_rx_control_block_t *rcb;
          i40e_t *i40e = rxd->rxd_i40e;
  
          /*
!          * First allocate the rx descriptor ring.
           */
          dmasz = sizeof (i40e_rx_desc_t) * rxd->rxd_ring_size;
          VERIFY(dmasz > 0);
          if (i40e_alloc_dma_buffer(i40e, &rxd->rxd_desc_area,
              &i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr, B_FALSE,
              B_TRUE, dmasz) == B_FALSE) {
                  i40e_error(i40e, "failed to allocate DMA resources "
!                     "for rx descriptor ring");
                  return (B_FALSE);
          }
          rxd->rxd_desc_ring =
              (i40e_rx_desc_t *)(uintptr_t)rxd->rxd_desc_area.dmab_address;
          rxd->rxd_desc_next = 0;
--- 814,832 ----
          size_t dmasz;
          i40e_rx_control_block_t *rcb;
          i40e_t *i40e = rxd->rxd_i40e;
  
          /*
!          * First allocate the RX descriptor ring.
           */
          dmasz = sizeof (i40e_rx_desc_t) * rxd->rxd_ring_size;
          VERIFY(dmasz > 0);
          if (i40e_alloc_dma_buffer(i40e, &rxd->rxd_desc_area,
              &i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr, B_FALSE,
              B_TRUE, dmasz) == B_FALSE) {
                  i40e_error(i40e, "failed to allocate DMA resources "
!                     "for RX descriptor ring");
                  return (B_FALSE);
          }
          rxd->rxd_desc_ring =
              (i40e_rx_desc_t *)(uintptr_t)rxd->rxd_desc_area.dmab_address;
          rxd->rxd_desc_next = 0;
*** 797,807 ****
  
                  dmap = &rcb->rcb_dma;
                  if (i40e_alloc_dma_buffer(i40e, dmap,
                      &i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr,
                      B_TRUE, B_FALSE, dmasz) == B_FALSE) {
!                         i40e_error(i40e, "failed to allocate rx dma buffer");
                          return (B_FALSE);
                  }
  
                  /*
                   * Initialize the control block and offset the DMA address. See
--- 848,858 ----
  
                  dmap = &rcb->rcb_dma;
                  if (i40e_alloc_dma_buffer(i40e, dmap,
                      &i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr,
                      B_TRUE, B_FALSE, dmasz) == B_FALSE) {
!                         i40e_error(i40e, "failed to allocate RX dma buffer");
                          return (B_FALSE);
                  }
  
                  /*
                   * Initialize the control block and offset the DMA address. See
*** 839,849 ****
--- 890,904 ----
                          i40e_free_dma_buffer(&tcb->tcb_dma);
                          if (tcb->tcb_dma_handle != NULL) {
                                  ddi_dma_free_handle(&tcb->tcb_dma_handle);
                                  tcb->tcb_dma_handle = NULL;
                          }
+                         if (tcb->tcb_lso_dma_handle != NULL) {
+                                 ddi_dma_free_handle(&tcb->tcb_lso_dma_handle);
+                                 tcb->tcb_lso_dma_handle = NULL;
                          }
+                 }
  
                  fsz = sizeof (i40e_tx_control_block_t) *
                      itrq->itrq_tx_free_list_size;
                  kmem_free(itrq->itrq_tcb_area, fsz);
                  itrq->itrq_tcb_area = NULL;
*** 879,898 ****
          itrq->itrq_tx_ring_size = i40e->i40e_tx_ring_size;
          itrq->itrq_tx_free_list_size = i40e->i40e_tx_ring_size +
              (i40e->i40e_tx_ring_size >> 1);
  
          /*
!          * Allocate an additional tx descriptor for the writeback head.
           */
          dmasz = sizeof (i40e_tx_desc_t) * itrq->itrq_tx_ring_size;
          dmasz += sizeof (i40e_tx_desc_t);
  
          VERIFY(dmasz > 0);
          if (i40e_alloc_dma_buffer(i40e, &itrq->itrq_desc_area,
              &i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr,
              B_FALSE, B_TRUE, dmasz) == B_FALSE) {
!                 i40e_error(i40e, "failed to allocate DMA resources for tx "
                      "descriptor ring");
                  return (B_FALSE);
          }
          itrq->itrq_desc_ring =
              (i40e_tx_desc_t *)(uintptr_t)itrq->itrq_desc_area.dmab_address;
--- 934,953 ----
          itrq->itrq_tx_ring_size = i40e->i40e_tx_ring_size;
          itrq->itrq_tx_free_list_size = i40e->i40e_tx_ring_size +
              (i40e->i40e_tx_ring_size >> 1);
  
          /*
!          * Allocate an additional TX descriptor for the writeback head.
           */
          dmasz = sizeof (i40e_tx_desc_t) * itrq->itrq_tx_ring_size;
          dmasz += sizeof (i40e_tx_desc_t);
  
          VERIFY(dmasz > 0);
          if (i40e_alloc_dma_buffer(i40e, &itrq->itrq_desc_area,
              &i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr,
              B_FALSE, B_TRUE, dmasz) == B_FALSE) {
!                 i40e_error(i40e, "failed to allocate DMA resources for TX "
                      "descriptor ring");
                  return (B_FALSE);
          }
          itrq->itrq_desc_ring =
              (i40e_tx_desc_t *)(uintptr_t)itrq->itrq_desc_area.dmab_address;
*** 903,928 ****
          itrq->itrq_desc_free = itrq->itrq_tx_ring_size;
  
          itrq->itrq_tcb_work_list = kmem_zalloc(itrq->itrq_tx_ring_size *
              sizeof (i40e_tx_control_block_t *), KM_NOSLEEP);
          if (itrq->itrq_tcb_work_list == NULL) {
!                 i40e_error(i40e, "failed to allocate a %d entry tx work list "
                      "for ring %d", itrq->itrq_tx_ring_size, itrq->itrq_index);
                  goto cleanup;
          }
  
          itrq->itrq_tcb_free_list = kmem_zalloc(itrq->itrq_tx_free_list_size *
              sizeof (i40e_tx_control_block_t *), KM_SLEEP);
          if (itrq->itrq_tcb_free_list == NULL) {
!                 i40e_error(i40e, "failed to allocate a %d entry tx free list "
                      "for ring %d", itrq->itrq_tx_free_list_size,
                      itrq->itrq_index);
                  goto cleanup;
          }
  
          /*
!          * We allocate enough tx control blocks to cover the free list.
           */
          itrq->itrq_tcb_area = kmem_zalloc(sizeof (i40e_tx_control_block_t) *
              itrq->itrq_tx_free_list_size, KM_NOSLEEP);
          if (itrq->itrq_tcb_area == NULL) {
                  i40e_error(i40e, "failed to allocate a %d entry tcb area for "
--- 958,983 ----
          itrq->itrq_desc_free = itrq->itrq_tx_ring_size;
  
          itrq->itrq_tcb_work_list = kmem_zalloc(itrq->itrq_tx_ring_size *
              sizeof (i40e_tx_control_block_t *), KM_NOSLEEP);
          if (itrq->itrq_tcb_work_list == NULL) {
!                 i40e_error(i40e, "failed to allocate a %d entry TX work list "
                      "for ring %d", itrq->itrq_tx_ring_size, itrq->itrq_index);
                  goto cleanup;
          }
  
          itrq->itrq_tcb_free_list = kmem_zalloc(itrq->itrq_tx_free_list_size *
              sizeof (i40e_tx_control_block_t *), KM_SLEEP);
          if (itrq->itrq_tcb_free_list == NULL) {
!                 i40e_error(i40e, "failed to allocate a %d entry TX free list "
                      "for ring %d", itrq->itrq_tx_free_list_size,
                      itrq->itrq_index);
                  goto cleanup;
          }
  
          /*
!          * We allocate enough TX control blocks to cover the free list.
           */
          itrq->itrq_tcb_area = kmem_zalloc(sizeof (i40e_tx_control_block_t) *
              itrq->itrq_tx_free_list_size, KM_NOSLEEP);
          if (itrq->itrq_tcb_area == NULL) {
                  i40e_error(i40e, "failed to allocate a %d entry tcb area for "
*** 946,967 ****
                   */
                  ret = ddi_dma_alloc_handle(i40e->i40e_dip,
                      &i40e->i40e_txbind_dma_attr, DDI_DMA_DONTWAIT, NULL,
                      &tcb->tcb_dma_handle);
                  if (ret != DDI_SUCCESS) {
!                         i40e_error(i40e, "failed to allocate DMA handle for tx "
                              "data binding on ring %d: %d", itrq->itrq_index,
                              ret);
                          tcb->tcb_dma_handle = NULL;
                          goto cleanup;
                  }
  
                  if (i40e_alloc_dma_buffer(i40e, &tcb->tcb_dma,
                      &i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr,
                      B_TRUE, B_FALSE, dmasz) == B_FALSE) {
                          i40e_error(i40e, "failed to allocate %ld bytes of "
!                             "DMA for tx data binding on ring %d", dmasz,
                              itrq->itrq_index);
                          goto cleanup;
                  }
  
                  itrq->itrq_tcb_free_list[i] = tcb;
--- 1001,1033 ----
                   */
                  ret = ddi_dma_alloc_handle(i40e->i40e_dip,
                      &i40e->i40e_txbind_dma_attr, DDI_DMA_DONTWAIT, NULL,
                      &tcb->tcb_dma_handle);
                  if (ret != DDI_SUCCESS) {
!                         i40e_error(i40e, "failed to allocate DMA handle for TX "
                              "data binding on ring %d: %d", itrq->itrq_index,
                              ret);
                          tcb->tcb_dma_handle = NULL;
                          goto cleanup;
                  }
  
+                 ret = ddi_dma_alloc_handle(i40e->i40e_dip,
+                     &i40e->i40e_txbind_lso_dma_attr, DDI_DMA_DONTWAIT, NULL,
+                     &tcb->tcb_lso_dma_handle);
+                 if (ret != DDI_SUCCESS) {
+                         i40e_error(i40e, "failed to allocate DMA handle for TX "
+                             "LSO data binding on ring %d: %d", itrq->itrq_index,
+                             ret);
+                         tcb->tcb_lso_dma_handle = NULL;
+                         goto cleanup;
+                 }
+ 
                  if (i40e_alloc_dma_buffer(i40e, &tcb->tcb_dma,
                      &i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr,
                      B_TRUE, B_FALSE, dmasz) == B_FALSE) {
                          i40e_error(i40e, "failed to allocate %ld bytes of "
!                             "DMA for TX data binding on ring %d", dmasz,
                              itrq->itrq_index);
                          goto cleanup;
                  }
  
                  itrq->itrq_tcb_free_list[i] = tcb;
*** 987,1000 ****
  
          for (i = 0; i < i40e->i40e_num_trqpairs; i++) {
                  i40e_rx_data_t *rxd = i40e->i40e_trqpairs[i].itrq_rxdata;
  
                  /*
!                  * Clean up our rx data. We have to free DMA resources first and
                   * then if we have no more pending RCB's, then we'll go ahead
                   * and clean things up. Note, we can't set the stopped flag on
!                  * the rx data until after we've done the first pass of the
                   * pending resources. Otherwise we might race with
                   * i40e_rx_recycle on determining who should free the
                   * i40e_rx_data_t above.
                   */
                  i40e_free_rx_dma(rxd, failed_init);
--- 1053,1073 ----
  
          for (i = 0; i < i40e->i40e_num_trqpairs; i++) {
                  i40e_rx_data_t *rxd = i40e->i40e_trqpairs[i].itrq_rxdata;
  
                  /*
!                  * In some cases i40e_alloc_rx_data() may have failed
!                  * and in that case there is no rxd to free.
!                  */
!                 if (rxd == NULL)
!                         continue;
! 
!                 /*
!                  * Clean up our RX data. We have to free DMA resources first and
                   * then if we have no more pending RCB's, then we'll go ahead
                   * and clean things up. Note, we can't set the stopped flag on
!                  * the RX data until after we've done the first pass of the
                   * pending resources. Otherwise we might race with
                   * i40e_rx_recycle on determining who should free the
                   * i40e_rx_data_t above.
                   */
                  i40e_free_rx_dma(rxd, failed_init);
*** 1053,1073 ****
--- 1126,1152 ----
  {
          bcopy(&i40e_g_static_dma_attr, &i40e->i40e_static_dma_attr,
              sizeof (ddi_dma_attr_t));
          bcopy(&i40e_g_txbind_dma_attr, &i40e->i40e_txbind_dma_attr,
              sizeof (ddi_dma_attr_t));
+         bcopy(&i40e_g_txbind_lso_dma_attr, &i40e->i40e_txbind_lso_dma_attr,
+             sizeof (ddi_dma_attr_t));
          bcopy(&i40e_g_desc_acc_attr, &i40e->i40e_desc_acc_attr,
              sizeof (ddi_device_acc_attr_t));
          bcopy(&i40e_g_buf_acc_attr, &i40e->i40e_buf_acc_attr,
              sizeof (ddi_device_acc_attr_t));
  
          if (fma == B_TRUE) {
                  i40e->i40e_static_dma_attr.dma_attr_flags |= DDI_DMA_FLAGERR;
                  i40e->i40e_txbind_dma_attr.dma_attr_flags |= DDI_DMA_FLAGERR;
+                 i40e->i40e_txbind_lso_dma_attr.dma_attr_flags |=
+                     DDI_DMA_FLAGERR;
          } else {
                  i40e->i40e_static_dma_attr.dma_attr_flags &= ~DDI_DMA_FLAGERR;
                  i40e->i40e_txbind_dma_attr.dma_attr_flags &= ~DDI_DMA_FLAGERR;
+                 i40e->i40e_txbind_lso_dma_attr.dma_attr_flags &=
+                     ~DDI_DMA_FLAGERR;
          }
  }
  
  static void
  i40e_rcb_free(i40e_rx_data_t *rxd, i40e_rx_control_block_t *rcb)
*** 1100,1110 ****
  }
  
  /*
   * This is the callback that we get from the OS when freemsg(9F) has been called
   * on a loaned descriptor. In addition, if we take the last reference count
!  * here, then we have to tear down all of the rx data.
   */
  void
  i40e_rx_recycle(caddr_t arg)
  {
          uint32_t ref;
--- 1179,1189 ----
  }
  
  /*
   * This is the callback that we get from the OS when freemsg(9F) has been called
   * on a loaned descriptor. In addition, if we take the last reference count
!  * here, then we have to tear down all of the RX data.
   */
  void
  i40e_rx_recycle(caddr_t arg)
  {
          uint32_t ref;
*** 1766,1884 ****
  /*
   * Attempt to put togther the information we'll need to feed into a descriptor
   * to properly program the hardware for checksum offload as well as the
   * generally required flags.
   *
!  * The i40e_tx_context_t`itc_cmdflags contains the set of flags we need to or
!  * into the descriptor based on the checksum flags for this mblk_t and the
   * actual information we care about.
   */
  static int
  i40e_tx_context(i40e_t *i40e, i40e_trqpair_t *itrq, mblk_t *mp,
!     i40e_tx_context_t *tctx)
  {
!         int ret;
!         uint32_t flags, start;
!         mac_ether_offload_info_t meo;
          i40e_txq_stat_t *txs = &itrq->itrq_txstat;
  
          bzero(tctx, sizeof (i40e_tx_context_t));
  
          if (i40e->i40e_tx_hcksum_enable != B_TRUE)
                  return (0);
  
!         mac_hcksum_get(mp, &start, NULL, NULL, NULL, &flags);
!         if (flags == 0)
                  return (0);
  
-         if ((ret = mac_ether_offload_info(mp, &meo)) != 0) {
-                 txs->itxs_hck_meoifail.value.ui64++;
-                 return (ret);
-         }
- 
          /*
           * Have we been asked to checksum an IPv4 header. If so, verify that we
           * have sufficient information and then set the proper fields in the
           * command structure.
           */
!         if (flags & HCK_IPV4_HDRCKSUM) {
!                 if ((meo.meoi_flags & MEOI_L2INFO_SET) == 0) {
                          txs->itxs_hck_nol2info.value.ui64++;
                          return (-1);
                  }
!                 if ((meo.meoi_flags & MEOI_L3INFO_SET) == 0) {
                          txs->itxs_hck_nol3info.value.ui64++;
                          return (-1);
                  }
!                 if (meo.meoi_l3proto != ETHERTYPE_IP) {
                          txs->itxs_hck_badl3.value.ui64++;
                          return (-1);
                  }
!                 tctx->itc_cmdflags |= I40E_TX_DESC_CMD_IIPT_IPV4_CSUM;
!                 tctx->itc_offsets |= (meo.meoi_l2hlen >> 1) <<
                      I40E_TX_DESC_LENGTH_MACLEN_SHIFT;
!                 tctx->itc_offsets |= (meo.meoi_l3hlen >> 2) <<
                      I40E_TX_DESC_LENGTH_IPLEN_SHIFT;
          }
  
          /*
           * We've been asked to provide an L4 header, first, set up the IP
           * information in the descriptor if we haven't already before moving
           * onto seeing if we have enough information for the L4 checksum
           * offload.
           */
!         if (flags & HCK_PARTIALCKSUM) {
!                 if ((meo.meoi_flags & MEOI_L4INFO_SET) == 0) {
                          txs->itxs_hck_nol4info.value.ui64++;
                          return (-1);
                  }
  
!                 if (!(flags & HCK_IPV4_HDRCKSUM)) {
!                         if ((meo.meoi_flags & MEOI_L2INFO_SET) == 0) {
                                  txs->itxs_hck_nol2info.value.ui64++;
                                  return (-1);
                          }
!                         if ((meo.meoi_flags & MEOI_L3INFO_SET) == 0) {
                                  txs->itxs_hck_nol3info.value.ui64++;
                                  return (-1);
                          }
  
!                         if (meo.meoi_l3proto == ETHERTYPE_IP) {
!                                 tctx->itc_cmdflags |=
                                      I40E_TX_DESC_CMD_IIPT_IPV4;
!                         } else if (meo.meoi_l3proto == ETHERTYPE_IPV6) {
!                                 tctx->itc_cmdflags |=
                                      I40E_TX_DESC_CMD_IIPT_IPV6;
                          } else {
                                  txs->itxs_hck_badl3.value.ui64++;
                                  return (-1);
                          }
!                         tctx->itc_offsets |= (meo.meoi_l2hlen >> 1) <<
                              I40E_TX_DESC_LENGTH_MACLEN_SHIFT;
!                         tctx->itc_offsets |= (meo.meoi_l3hlen >> 2) <<
                              I40E_TX_DESC_LENGTH_IPLEN_SHIFT;
                  }
  
!                 switch (meo.meoi_l4proto) {
                  case IPPROTO_TCP:
!                         tctx->itc_cmdflags |= I40E_TX_DESC_CMD_L4T_EOFT_TCP;
                          break;
                  case IPPROTO_UDP:
!                         tctx->itc_cmdflags |= I40E_TX_DESC_CMD_L4T_EOFT_UDP;
                          break;
                  case IPPROTO_SCTP:
!                         tctx->itc_cmdflags |= I40E_TX_DESC_CMD_L4T_EOFT_SCTP;
                          break;
                  default:
                          txs->itxs_hck_badl4.value.ui64++;
                          return (-1);
                  }
  
!                 tctx->itc_offsets |= (meo.meoi_l4hlen >> 2) <<
                      I40E_TX_DESC_LENGTH_L4_FC_LEN_SHIFT;
          }
  
          return (0);
  }
  
  static void
  i40e_tcb_free(i40e_trqpair_t *itrq, i40e_tx_control_block_t *tcb)
--- 1845,1981 ----
  /*
   * Attempt to put togther the information we'll need to feed into a descriptor
   * to properly program the hardware for checksum offload as well as the
   * generally required flags.
   *
!  * The i40e_tx_context_t`itc_data_cmdflags contains the set of flags we need to
!  * 'or' into the descriptor based on the checksum flags for this mblk_t and the
   * actual information we care about.
+  *
+  * If the mblk requires LSO then we'll also gather the information that will be
+  * used to construct the Transmit Context Descriptor.
   */
  static int
  i40e_tx_context(i40e_t *i40e, i40e_trqpair_t *itrq, mblk_t *mp,
!     mac_ether_offload_info_t *meo, i40e_tx_context_t *tctx)
  {
!         uint32_t chkflags, start, mss, lsoflags;
          i40e_txq_stat_t *txs = &itrq->itrq_txstat;
  
          bzero(tctx, sizeof (i40e_tx_context_t));
  
          if (i40e->i40e_tx_hcksum_enable != B_TRUE)
                  return (0);
  
!         mac_hcksum_get(mp, &start, NULL, NULL, NULL, &chkflags);
!         mac_lso_get(mp, &mss, &lsoflags);
! 
!         if (chkflags == 0 && lsoflags == 0)
                  return (0);
  
          /*
           * Have we been asked to checksum an IPv4 header. If so, verify that we
           * have sufficient information and then set the proper fields in the
           * command structure.
           */
!         if (chkflags & HCK_IPV4_HDRCKSUM) {
!                 if ((meo->meoi_flags & MEOI_L2INFO_SET) == 0) {
                          txs->itxs_hck_nol2info.value.ui64++;
                          return (-1);
                  }
!                 if ((meo->meoi_flags & MEOI_L3INFO_SET) == 0) {
                          txs->itxs_hck_nol3info.value.ui64++;
                          return (-1);
                  }
!                 if (meo->meoi_l3proto != ETHERTYPE_IP) {
                          txs->itxs_hck_badl3.value.ui64++;
                          return (-1);
                  }
!                 tctx->itc_data_cmdflags |= I40E_TX_DESC_CMD_IIPT_IPV4_CSUM;
!                 tctx->itc_data_offsets |= (meo->meoi_l2hlen >> 1) <<
                      I40E_TX_DESC_LENGTH_MACLEN_SHIFT;
!                 tctx->itc_data_offsets |= (meo->meoi_l3hlen >> 2) <<
                      I40E_TX_DESC_LENGTH_IPLEN_SHIFT;
          }
  
          /*
           * We've been asked to provide an L4 header, first, set up the IP
           * information in the descriptor if we haven't already before moving
           * onto seeing if we have enough information for the L4 checksum
           * offload.
           */
!         if (chkflags & HCK_PARTIALCKSUM) {
!                 if ((meo->meoi_flags & MEOI_L4INFO_SET) == 0) {
                          txs->itxs_hck_nol4info.value.ui64++;
                          return (-1);
                  }
  
!                 if (!(chkflags & HCK_IPV4_HDRCKSUM)) {
!                         if ((meo->meoi_flags & MEOI_L2INFO_SET) == 0) {
                                  txs->itxs_hck_nol2info.value.ui64++;
                                  return (-1);
                          }
!                         if ((meo->meoi_flags & MEOI_L3INFO_SET) == 0) {
                                  txs->itxs_hck_nol3info.value.ui64++;
                                  return (-1);
                          }
  
!                         if (meo->meoi_l3proto == ETHERTYPE_IP) {
!                                 tctx->itc_data_cmdflags |=
                                      I40E_TX_DESC_CMD_IIPT_IPV4;
!                         } else if (meo->meoi_l3proto == ETHERTYPE_IPV6) {
!                                 tctx->itc_data_cmdflags |=
                                      I40E_TX_DESC_CMD_IIPT_IPV6;
                          } else {
                                  txs->itxs_hck_badl3.value.ui64++;
                                  return (-1);
                          }
!                         tctx->itc_data_offsets |= (meo->meoi_l2hlen >> 1) <<
                              I40E_TX_DESC_LENGTH_MACLEN_SHIFT;
!                         tctx->itc_data_offsets |= (meo->meoi_l3hlen >> 2) <<
                              I40E_TX_DESC_LENGTH_IPLEN_SHIFT;
                  }
  
!                 switch (meo->meoi_l4proto) {
                  case IPPROTO_TCP:
!                         tctx->itc_data_cmdflags |=
!                             I40E_TX_DESC_CMD_L4T_EOFT_TCP;
                          break;
                  case IPPROTO_UDP:
!                         tctx->itc_data_cmdflags |=
!                             I40E_TX_DESC_CMD_L4T_EOFT_UDP;
                          break;
                  case IPPROTO_SCTP:
!                         tctx->itc_data_cmdflags |=
!                             I40E_TX_DESC_CMD_L4T_EOFT_SCTP;
                          break;
                  default:
                          txs->itxs_hck_badl4.value.ui64++;
                          return (-1);
                  }
  
!                 tctx->itc_data_offsets |= (meo->meoi_l4hlen >> 2) <<
                      I40E_TX_DESC_LENGTH_L4_FC_LEN_SHIFT;
          }
  
+         if (lsoflags & HW_LSO) {
+                 /*
+                  * LSO requires that checksum offloads are enabled.  If for
+                  * some reason they're not we bail out with an error.
+                  */
+                 if ((chkflags & HCK_IPV4_HDRCKSUM) == 0 ||
+                     (chkflags & HCK_PARTIALCKSUM) == 0) {
+                         txs->itxs_lso_nohck.value.ui64++;
+                         return (-1);
+                 }
+ 
+                 tctx->itc_ctx_cmdflags |= I40E_TX_CTX_DESC_TSO;
+                 tctx->itc_ctx_mss = mss;
+                 tctx->itc_ctx_tsolen = msgsize(mp) -
+                     (meo->meoi_l2hlen + meo->meoi_l3hlen + meo->meoi_l4hlen);
+         }
+ 
          return (0);
  }
  
  static void
  i40e_tcb_free(i40e_trqpair_t *itrq, i40e_tx_control_block_t *tcb)
*** 1923,1944 ****
--- 2020,2056 ----
          switch (tcb->tcb_type) {
          case I40E_TX_COPY:
                  tcb->tcb_dma.dmab_len = 0;
                  break;
          case I40E_TX_DMA:
+                 if (tcb->tcb_used_lso == B_TRUE && tcb->tcb_bind_ncookies > 0)
+                         (void) ddi_dma_unbind_handle(tcb->tcb_lso_dma_handle);
+                 else if (tcb->tcb_bind_ncookies > 0)
                          (void) ddi_dma_unbind_handle(tcb->tcb_dma_handle);
+                 if (tcb->tcb_bind_info != NULL) {
+                         kmem_free(tcb->tcb_bind_info,
+                             tcb->tcb_bind_ncookies *
+                             sizeof (struct i40e_dma_bind_info));
+                 }
+                 tcb->tcb_bind_info = NULL;
+                 tcb->tcb_bind_ncookies = 0;
+                 tcb->tcb_used_lso = B_FALSE;
                  break;
+         case I40E_TX_DESC:
+                 break;
          case I40E_TX_NONE:
                  /* Cast to pacify lint */
                  panic("trying to free tcb %p with bad type none", (void *)tcb);
          default:
                  panic("unknown i40e tcb type: %d", tcb->tcb_type);
          }
  
          tcb->tcb_type = I40E_TX_NONE;
+         if (tcb->tcb_mp != NULL) {
                  freemsg(tcb->tcb_mp);
                  tcb->tcb_mp = NULL;
+         }
          tcb->tcb_next = NULL;
  }
  
  /*
   * This is called as part of shutting down to clean up all outstanding
*** 1967,1980 ****
          index = itrq->itrq_desc_head;
          while (itrq->itrq_desc_free < itrq->itrq_tx_ring_size) {
                  i40e_tx_control_block_t *tcb;
  
                  tcb = itrq->itrq_tcb_work_list[index];
!                 VERIFY(tcb != NULL);
                  itrq->itrq_tcb_work_list[index] = NULL;
                  i40e_tcb_reset(tcb);
                  i40e_tcb_free(itrq, tcb);
  
                  bzero(&itrq->itrq_desc_ring[index], sizeof (i40e_tx_desc_t));
                  index = i40e_next_desc(index, 1, itrq->itrq_tx_ring_size);
                  itrq->itrq_desc_free++;
          }
--- 2079,2093 ----
          index = itrq->itrq_desc_head;
          while (itrq->itrq_desc_free < itrq->itrq_tx_ring_size) {
                  i40e_tx_control_block_t *tcb;
  
                  tcb = itrq->itrq_tcb_work_list[index];
!                 if (tcb != NULL) {
                          itrq->itrq_tcb_work_list[index] = NULL;
                          i40e_tcb_reset(tcb);
                          i40e_tcb_free(itrq, tcb);
+                 }
  
                  bzero(&itrq->itrq_desc_ring[index], sizeof (i40e_tx_desc_t));
                  index = i40e_next_desc(index, 1, itrq->itrq_tx_ring_size);
                  itrq->itrq_desc_free++;
          }
*** 1993,2002 ****
--- 2106,2116 ----
  i40e_tx_recycle_ring(i40e_trqpair_t *itrq)
  {
          uint32_t wbhead, toclean, count;
          i40e_tx_control_block_t *tcbhead;
          i40e_t *i40e = itrq->itrq_i40e;
+         uint_t desc_per_tcb, i;
  
          mutex_enter(&itrq->itrq_tx_lock);
  
          ASSERT(itrq->itrq_desc_free <= itrq->itrq_tx_ring_size);
          if (itrq->itrq_desc_free == itrq->itrq_tx_ring_size) {
*** 2040,2055 ****
                  ASSERT(tcb != NULL);
                  tcb->tcb_next = tcbhead;
                  tcbhead = tcb;
  
                  /*
                   * We zero this out for sanity purposes.
                   */
!                 bzero(&itrq->itrq_desc_ring[toclean], sizeof (i40e_tx_desc_t));
!                 toclean = i40e_next_desc(toclean, 1, itrq->itrq_tx_ring_size);
                  count++;
          }
  
          itrq->itrq_desc_head = wbhead;
          itrq->itrq_desc_free += count;
          itrq->itrq_txstat.itxs_recycled.value.ui64 += count;
          ASSERT(itrq->itrq_desc_free <= itrq->itrq_tx_ring_size);
--- 2154,2185 ----
                  ASSERT(tcb != NULL);
                  tcb->tcb_next = tcbhead;
                  tcbhead = tcb;
  
                  /*
+                  * In the DMA bind case, there may not necessarily be a 1:1
+                  * mapping between tcb's and descriptors.  If the tcb type
+                  * indicates a DMA binding then check the number of DMA
+                  * cookies to determine how many entries to clean in the
+                  * descriptor ring.
+                  */
+                 if (tcb->tcb_type == I40E_TX_DMA)
+                         desc_per_tcb = tcb->tcb_bind_ncookies;
+                 else
+                         desc_per_tcb = 1;
+ 
+                 for (i = 0; i < desc_per_tcb; i++) {
+                         /*
                           * We zero this out for sanity purposes.
                           */
!                         bzero(&itrq->itrq_desc_ring[toclean],
!                             sizeof (i40e_tx_desc_t));
!                         toclean = i40e_next_desc(toclean, 1,
!                             itrq->itrq_tx_ring_size);
                          count++;
                  }
+         }
  
          itrq->itrq_desc_head = wbhead;
          itrq->itrq_desc_free += count;
          itrq->itrq_txstat.itxs_recycled.value.ui64 += count;
          ASSERT(itrq->itrq_desc_free <= itrq->itrq_tx_ring_size);
*** 2076,2089 ****
          }
  
          DTRACE_PROBE2(i40e__recycle, i40e_trqpair_t *, itrq, uint32_t, count);
  }
  
  /*
   * We've been asked to send a message block on the wire. We'll only have a
   * single chain. There will not be any b_next pointers; however, there may be
!  * multiple b_cont blocks.
   *
   * We may do one of three things with any given mblk_t chain:
   *
   *   1) Drop it
   *   2) Transmit it
--- 2206,2793 ----
          }
  
          DTRACE_PROBE2(i40e__recycle, i40e_trqpair_t *, itrq, uint32_t, count);
  }
  
+ static void
+ i40e_tx_copy_fragment(i40e_tx_control_block_t *tcb, const mblk_t *mp,
+     const size_t off, const size_t len)
+ {
+         const void *soff = mp->b_rptr + off;
+         void *doff = tcb->tcb_dma.dmab_address + tcb->tcb_dma.dmab_len;
+ 
+         ASSERT3U(len, >, 0);
+         ASSERT3P(soff, >=, mp->b_rptr);
+         ASSERT3P(soff, <=, mp->b_wptr);
+         ASSERT3U(len, <=, MBLKL(mp));
+         ASSERT3U((uintptr_t)soff + len, <=, (uintptr_t)mp->b_wptr);
+         ASSERT3U(tcb->tcb_dma.dmab_size - tcb->tcb_dma.dmab_len, >=, len);
+         bcopy(soff, doff, len);
+         tcb->tcb_type = I40E_TX_COPY;
+         tcb->tcb_dma.dmab_len += len;
+         I40E_DMA_SYNC(&tcb->tcb_dma, DDI_DMA_SYNC_FORDEV);
+ }
+ 
+ static i40e_tx_control_block_t *
+ i40e_tx_bind_fragment(i40e_trqpair_t *itrq, const mblk_t *mp,
+     size_t off, boolean_t use_lso)
+ {
+         ddi_dma_handle_t dma_handle;
+         ddi_dma_cookie_t dma_cookie;
+         uint_t i = 0, ncookies = 0, dmaflags;
+         i40e_tx_control_block_t *tcb;
+         i40e_txq_stat_t *txs = &itrq->itrq_txstat;
+ 
+         if ((tcb = i40e_tcb_alloc(itrq)) == NULL) {
+                 txs->itxs_err_notcb.value.ui64++;
+                 return (NULL);
+         }
+         tcb->tcb_type = I40E_TX_DMA;
+ 
+         if (use_lso == B_TRUE)
+                 dma_handle = tcb->tcb_lso_dma_handle;
+         else
+                 dma_handle = tcb->tcb_dma_handle;
+ 
+         dmaflags = DDI_DMA_WRITE | DDI_DMA_STREAMING;
+         if (ddi_dma_addr_bind_handle(dma_handle, NULL,
+             (caddr_t)(mp->b_rptr + off), MBLKL(mp) - off, dmaflags,
+             DDI_DMA_DONTWAIT, NULL, &dma_cookie, &ncookies) != DDI_DMA_MAPPED) {
+                 txs->itxs_bind_fails.value.ui64++;
+                 goto bffail;
+         }
+ 
+         tcb->tcb_bind_ncookies = ncookies;
+         tcb->tcb_used_lso = use_lso;
+ 
+         tcb->tcb_bind_info =
+             kmem_zalloc(ncookies * sizeof (struct i40e_dma_bind_info),
+             KM_NOSLEEP);
+         if (tcb->tcb_bind_info == NULL)
+                 goto bffail;
+ 
+         while (i < ncookies) {
+                 if (i > 0)
+                         ddi_dma_nextcookie(dma_handle, &dma_cookie);
+ 
+                 tcb->tcb_bind_info[i].dbi_paddr =
+                     (caddr_t)dma_cookie.dmac_laddress;
+                 tcb->tcb_bind_info[i++].dbi_len = dma_cookie.dmac_size;
+         }
+ 
+         return (tcb);
+ 
+ bffail:
+         i40e_tcb_reset(tcb);
+         i40e_tcb_free(itrq, tcb);
+         return (NULL);
+ }
+ 
+ static void
+ i40e_tx_set_data_desc(i40e_trqpair_t *itrq, i40e_tx_context_t *tctx,
+     caddr_t buff, size_t len, boolean_t last_desc)
+ {
+         i40e_tx_desc_t *txdesc;
+         int cmd;
+ 
+         ASSERT(MUTEX_HELD(&itrq->itrq_tx_lock));
+         itrq->itrq_desc_free--;
+         txdesc = &itrq->itrq_desc_ring[itrq->itrq_desc_tail];
+         itrq->itrq_desc_tail = i40e_next_desc(itrq->itrq_desc_tail, 1,
+             itrq->itrq_tx_ring_size);
+ 
+         cmd = I40E_TX_DESC_CMD_ICRC | tctx->itc_data_cmdflags;
+ 
+         /*
+          * The last data descriptor needs the EOP bit set, so that the HW knows
+          * that we're ready to send.  Additionally, we set the RS (Report
+          * Status) bit, so that we are notified when the transmit engine has
+          * completed DMA'ing all of the data descriptors and data buffers
+          * associated with this frame.
+          */
+         if (last_desc == B_TRUE) {
+                 cmd |= I40E_TX_DESC_CMD_EOP;
+                 cmd |= I40E_TX_DESC_CMD_RS;
+         }
+ 
+         /*
+          * Per the X710 manual, section 8.4.2.1.1, the buffer size
+          * must be a value from 1 to 16K minus 1, inclusive.
+          */
+         ASSERT3U(len, >=, 1);
+         ASSERT3U(len, <=, I40E_MAX_TX_BUFSZ - 1);
+ 
+         txdesc->buffer_addr = CPU_TO_LE64((uintptr_t)buff);
+         txdesc->cmd_type_offset_bsz =
+             LE_64(((uint64_t)I40E_TX_DESC_DTYPE_DATA |
+             ((uint64_t)tctx->itc_data_offsets << I40E_TXD_QW1_OFFSET_SHIFT) |
+             ((uint64_t)cmd << I40E_TXD_QW1_CMD_SHIFT) |
+             ((uint64_t)len << I40E_TXD_QW1_TX_BUF_SZ_SHIFT)));
+ }
+ 
  /*
+  * Place 'tcb' on the tail of the list represented by 'head'/'tail'.
+  */
+ static inline void
+ tcb_list_append(i40e_tx_control_block_t **head, i40e_tx_control_block_t **tail,
+     i40e_tx_control_block_t *tcb)
+ {
+         if (*head == NULL) {
+                 *head = tcb;
+                 *tail = *head;
+         } else {
+                 ASSERT3P(*tail, !=, NULL);
+                 ASSERT3P((*tail)->tcb_next, ==, NULL);
+                 (*tail)->tcb_next = tcb;
+                 *tail = tcb;
+         }
+ }
+ 
+ /*
+  * This function takes a single packet, possibly consisting of
+  * multiple mblks, and creates a TCB chain to send to the controller.
+  * This TCB chain may span up to a maximum of 8 descriptors. A copy
+  * TCB consumes one descriptor; whereas a DMA TCB may consume 1 or
+  * more, depending on several factors. For each fragment (invidual
+  * mblk making up the packet), we determine if its size dictates a
+  * copy to the TCB buffer or a DMA bind of the dblk buffer. We keep a
+  * count of descriptors used; when that count reaches the max we force
+  * all remaining fragments into a single TCB buffer. We have a
+  * guarantee that the TCB buffer is always larger than the MTU -- so
+  * there is always enough room. Consecutive fragments below the DMA
+  * threshold are copied into a single TCB. In the event of an error
+  * this function returns NULL but leaves 'mp' alone.
+  */
+ static i40e_tx_control_block_t *
+ i40e_non_lso_chain(i40e_trqpair_t *itrq, mblk_t *mp, uint_t *ndesc)
+ {
+         const mblk_t *nmp = mp;
+         uint_t needed_desc = 0;
+         boolean_t force_copy = B_FALSE;
+         i40e_tx_control_block_t *tcb = NULL, *tcbhead = NULL, *tcbtail = NULL;
+         i40e_t *i40e = itrq->itrq_i40e;
+         i40e_txq_stat_t *txs = &itrq->itrq_txstat;
+ 
+         /* TCB buffer is always larger than MTU. */
+         ASSERT3U(msgsize(mp), <, i40e->i40e_tx_buf_size);
+ 
+         while (nmp != NULL) {
+                 const size_t nmp_len = MBLKL(nmp);
+ 
+                 /* Ignore zero-length mblks. */
+                 if (nmp_len == 0) {
+                         nmp = nmp->b_cont;
+                         continue;
+                 }
+ 
+                 if (nmp_len < i40e->i40e_tx_dma_min || force_copy) {
+                         /* Compress consecutive copies into one TCB. */
+                         if (tcb != NULL && tcb->tcb_type == I40E_TX_COPY) {
+                                 i40e_tx_copy_fragment(tcb, nmp, 0, nmp_len);
+                                 nmp = nmp->b_cont;
+                                 continue;
+                         }
+ 
+                         if ((tcb = i40e_tcb_alloc(itrq)) == NULL) {
+                                 txs->itxs_err_notcb.value.ui64++;
+                                 goto fail;
+                         }
+ 
+                         /*
+                          * TCB DMA buffer is guaranteed to be one
+                          * cookie by i40e_alloc_dma_buffer().
+                          */
+                         i40e_tx_copy_fragment(tcb, nmp, 0, nmp_len);
+                         needed_desc++;
+                         tcb_list_append(&tcbhead, &tcbtail, tcb);
+                 } else {
+                         uint_t total_desc;
+ 
+                         tcb = i40e_tx_bind_fragment(itrq, nmp, 0, B_FALSE);
+                         if (tcb == NULL) {
+                                 i40e_error(i40e, "dma bind failed!");
+                                 goto fail;
+                         }
+ 
+                         /*
+                          * If the new total exceeds the max or we've
+                          * reached the limit and there's data left,
+                          * then give up binding and copy the rest into
+                          * the pre-allocated TCB buffer.
+                          */
+                         total_desc = needed_desc + tcb->tcb_bind_ncookies;
+                         if ((total_desc > I40E_TX_MAX_COOKIE) ||
+                             (total_desc == I40E_TX_MAX_COOKIE &&
+                             nmp->b_cont != NULL)) {
+                                 i40e_tcb_reset(tcb);
+                                 i40e_tcb_free(itrq, tcb);
+ 
+                                 if (tcbtail != NULL &&
+                                     tcbtail->tcb_type == I40E_TX_COPY) {
+                                         tcb = tcbtail;
+                                 } else {
+                                         tcb = NULL;
+                                 }
+ 
+                                 force_copy = B_TRUE;
+                                 txs->itxs_force_copy.value.ui64++;
+                                 continue;
+                         }
+ 
+                         needed_desc += tcb->tcb_bind_ncookies;
+                         tcb_list_append(&tcbhead, &tcbtail, tcb);
+                 }
+ 
+                 nmp = nmp->b_cont;
+         }
+ 
+         ASSERT3P(nmp, ==, NULL);
+         ASSERT3U(needed_desc, <=, I40E_TX_MAX_COOKIE);
+         ASSERT3P(tcbhead, !=, NULL);
+         *ndesc += needed_desc;
+         return (tcbhead);
+ 
+ fail:
+         tcb = tcbhead;
+         while (tcb != NULL) {
+                 i40e_tx_control_block_t *next = tcb->tcb_next;
+ 
+                 ASSERT(tcb->tcb_type == I40E_TX_DMA ||
+                     tcb->tcb_type == I40E_TX_COPY);
+ 
+                 tcb->tcb_mp = NULL;
+                 i40e_tcb_reset(tcb);
+                 i40e_tcb_free(itrq, tcb);
+                 tcb = next;
+         }
+ 
+         return (NULL);
+ }
+ 
+ /*
+  * Section 8.4.1 of the 700-series programming guide states that a
+  * segment may span up to 8 data descriptors; including both header
+  * and payload data. However, empirical evidence shows that the
+  * controller freezes the Tx queue when presented with a segment of 8
+  * descriptors. Or, at least, when the first segment contains 8
+  * descriptors. One explanation is that the controller counts the
+  * context descriptor against the first segment, even though the
+  * programming guide makes no mention of such a constraint. In any
+  * case, we limit TSO segments to 7 descriptors to prevent Tx queue
+  * freezes. We still allow non-TSO segments to utilize all 8
+  * descriptors as they have not demonstrated the faulty behavior.
+  */
+ uint_t i40e_lso_num_descs = 7;
+ 
+ #define I40E_TCB_LEFT(tcb)                              \
+         ((tcb)->tcb_dma.dmab_size - (tcb)->tcb_dma.dmab_len)
+ 
+ /*
+  * This function is similar in spirit to i40e_non_lso_chain(), but
+  * much more complicated in reality. Like the previous function, it
+  * takes a packet (an LSO packet) as input and returns a chain of
+  * TCBs. The complication comes with the fact that we are no longer
+  * trying to fit the entire packet into 8 descriptors, but rather we
+  * must fit each MSS-size segment of the LSO packet into 8 descriptors.
+  * Except it's really 7 descriptors, see i40e_lso_num_descs.
+  *
+  * Your first inclination might be to verify that a given segment
+  * spans no more than 7 mblks; but it's actually much more subtle than
+  * that. First, let's describe what the hardware expects, and then we
+  * can expound on the software side of things.
+  *
+  * For an LSO packet the hardware expects the following:
+  *
+  *      o Each MSS-sized segment must span no more than 7 descriptors.
+  *
+  *      o The header size does not count towards the segment size.
+  *
+  *      o If header and payload share the first descriptor, then the
+  *        controller will count the descriptor twice.
+  *
+  * The most important thing to keep in mind is that the hardware does
+  * not view the segments in terms of mblks, like we do. The hardware
+  * only sees descriptors. It will iterate each descriptor in turn,
+  * keeping a tally of bytes seen and descriptors visited. If the byte
+  * count hasn't reached MSS by the time the descriptor count reaches
+  * 7, then the controller freezes the queue and we are stuck.
+  * Furthermore, the hardware picks up its tally where it left off. So
+  * if it reached MSS in the middle of a descriptor, it will start
+  * tallying the next segment in the middle of that descriptor. The
+  * hardware's view is entirely removed from the mblk chain or even the
+  * descriptor layout. Consider these facts:
+  *
+  *      o The MSS will vary dpeneding on MTU and other factors.
+  *
+  *      o The dblk allocation will sit at various offsets within a
+  *        memory page.
+  *
+  *      o The page size itself could vary in the future (i.e. not
+  *        always 4K).
+  *
+  *      o Just because a dblk is virtually contiguous doesn't mean
+  *        it's physically contiguous. The number of cookies
+  *        (descriptors) required by a DMA bind of a single dblk is at
+  *        the mercy of the page size and physical layout.
+  *
+  *      o The descriptors will most often NOT start/end on a MSS
+  *        boundary. Thus the hardware will often start counting the
+  *        MSS mid descriptor and finish mid descriptor.
+  *
+  * The upshot of all this is that the driver must learn to think like
+  * the controller; and verify that none of the constraints are broken.
+  * It does this by tallying up the segment just like the hardware
+  * would. This is handled by the two variables 'segsz' and 'segdesc'.
+  * After each attempt to bind a dblk, we check the constaints. If
+  * violated, we undo the DMA and force a copy until MSS is met. We
+  * have a guarantee that the TCB buffer is larger than MTU; thus
+  * ensuring we can always meet the MSS with a single copy buffer. We
+  * also copy consecutive non-DMA fragments into the same TCB buffer.
+  */
+ static i40e_tx_control_block_t *
+ i40e_lso_chain(i40e_trqpair_t *itrq, const mblk_t *mp,
+     const mac_ether_offload_info_t *meo, const i40e_tx_context_t *tctx,
+     uint_t *ndesc)
+ {
+         size_t mp_len = MBLKL(mp);
+         /*
+          * The cpoff (copy offset) variable tracks the offset inside
+          * the current mp. There are cases where the entire mp is not
+          * fully copied in one go: such as the header copy followed by
+          * a non-DMA mblk, or a TCB buffer that only has enough space
+          * to copy part of the current mp.
+          */
+         size_t cpoff = 0;
+         /*
+          * The segsz and segdesc variables track the controller's view
+          * of the segment. The needed_desc variable tracks the total
+          * number of data descriptors used by the driver.
+          */
+         size_t segsz = 0;
+         uint_t segdesc = 0;
+         uint_t needed_desc = 0;
+         const size_t hdrlen =
+             meo->meoi_l2hlen + meo->meoi_l3hlen + meo->meoi_l4hlen;
+         const size_t mss = tctx->itc_ctx_mss;
+         boolean_t force_copy = B_FALSE;
+         i40e_tx_control_block_t *tcb = NULL, *tcbhead = NULL, *tcbtail = NULL;
+         i40e_t *i40e = itrq->itrq_i40e;
+         i40e_txq_stat_t *txs = &itrq->itrq_txstat;
+ 
+         /*
+          * We always copy the header in order to avoid more
+          * complicated code dealing with various edge cases.
+          */
+         ASSERT3U(MBLKL(mp), >=, hdrlen);
+         if ((tcb = i40e_tcb_alloc(itrq)) == NULL) {
+                 txs->itxs_err_notcb.value.ui64++;
+                 goto fail;
+         }
+         needed_desc++;
+ 
+         tcb_list_append(&tcbhead, &tcbtail, tcb);
+         i40e_tx_copy_fragment(tcb, mp, 0, hdrlen);
+         cpoff += hdrlen;
+ 
+         /*
+          * A single descriptor containing both header and data is
+          * counted twice by the controller.
+          */
+         if ((mp_len > hdrlen && mp_len < i40e->i40e_tx_dma_min) ||
+             (mp->b_cont != NULL &&
+             MBLKL(mp->b_cont) < i40e->i40e_tx_dma_min)) {
+                 segdesc = 2;
+         } else {
+                 segdesc = 1;
+         }
+ 
+         /* If this fragment was pure header, then move to the next one. */
+         if (cpoff == mp_len) {
+                 mp = mp->b_cont;
+                 cpoff = 0;
+         }
+ 
+         while (mp != NULL) {
+                 mp_len = MBLKL(mp);
+ force_copy:
+                 /* Ignore zero-length mblks. */
+                 if (mp_len == 0) {
+                         mp = mp->b_cont;
+                         cpoff = 0;
+                         continue;
+                 }
+ 
+                 /*
+                  * We copy into the preallocated TCB buffer when the
+                  * current fragment is less than the DMA threshold OR
+                  * when the DMA bind can't meet the controller's
+                  * segment descriptor limit.
+                  */
+                 if (mp_len < i40e->i40e_tx_dma_min || force_copy) {
+                         size_t tocopy;
+ 
+                         /*
+                          * Our objective here is to compress
+                          * consecutive copies into one TCB (until it
+                          * is full). If there is no current TCB, or if
+                          * it is a DMA TCB, then allocate a new one.
+                          */
+                         if (tcb == NULL ||
+                             (tcb != NULL && tcb->tcb_type != I40E_TX_COPY)) {
+                                 if ((tcb = i40e_tcb_alloc(itrq)) == NULL) {
+                                         txs->itxs_err_notcb.value.ui64++;
+                                         goto fail;
+                                 }
+ 
+                                 /*
+                                  * The TCB DMA buffer is guaranteed to
+                                  * be one cookie by i40e_alloc_dma_buffer().
+                                  */
+                                 needed_desc++;
+                                 segdesc++;
+                                 ASSERT3U(segdesc, <=, i40e_lso_num_descs);
+                                 tcb_list_append(&tcbhead, &tcbtail, tcb);
+                         }
+ 
+                         tocopy = MIN(I40E_TCB_LEFT(tcb), mp_len - cpoff);
+                         i40e_tx_copy_fragment(tcb, mp, cpoff, tocopy);
+                         cpoff += tocopy;
+                         segsz += tocopy;
+ 
+                         /* We have consumed the current mp. */
+                         if (cpoff == mp_len) {
+                                 mp = mp->b_cont;
+                                 cpoff = 0;
+                         }
+ 
+                         /* We have consumed the current TCB buffer. */
+                         if (I40E_TCB_LEFT(tcb) == 0) {
+                                 tcb = NULL;
+                         }
+ 
+                         /*
+                          * We have met MSS with this copy; restart the
+                          * counters.
+                          */
+                         if (segsz >= mss) {
+                                 segsz = segsz % mss;
+                                 segdesc = segsz == 0 ? 0 : 1;
+                                 force_copy = B_FALSE;
+                         }
+ 
+                         /*
+                          * We are at the controller's descriptor
+                          * limit; we must copy into the current TCB
+                          * until MSS is reached. The TCB buffer is
+                          * always bigger than the MTU so we know it is
+                          * big enough to meet the MSS.
+                          */
+                         if (segdesc == i40e_lso_num_descs) {
+                                 force_copy = B_TRUE;
+                         }
+                 } else {
+                         uint_t tsegdesc = segdesc;
+                         size_t tsegsz = segsz;
+ 
+                         ASSERT(force_copy == B_FALSE);
+                         ASSERT3U(tsegdesc, <, i40e_lso_num_descs);
+ 
+                         tcb = i40e_tx_bind_fragment(itrq, mp, cpoff, B_TRUE);
+                         if (tcb == NULL) {
+                                 i40e_error(i40e, "dma bind failed!");
+                                 goto fail;
+                         }
+ 
+                         for (uint_t i = 0; i < tcb->tcb_bind_ncookies; i++) {
+                                 struct i40e_dma_bind_info dbi =
+                                     tcb->tcb_bind_info[i];
+ 
+                                 tsegsz += dbi.dbi_len;
+                                 tsegdesc++;
+                                 ASSERT3U(tsegdesc, <=, i40e_lso_num_descs);
+ 
+                                 /*
+                                  * We've met the MSS with this portion
+                                  * of the DMA.
+                                  */
+                                 if (tsegsz >= mss) {
+                                         tsegdesc = 1;
+                                         tsegsz = tsegsz % mss;
+                                 }
+ 
+                                 /*
+                                  * We've reached max descriptors but
+                                  * have not met the MSS. Undo the bind
+                                  * and instead copy.
+                                  */
+                                 if (tsegdesc == i40e_lso_num_descs) {
+                                         i40e_tcb_reset(tcb);
+                                         i40e_tcb_free(itrq, tcb);
+ 
+                                         if (tcbtail != NULL &&
+                                             I40E_TCB_LEFT(tcb) > 0 &&
+                                             tcbtail->tcb_type == I40E_TX_COPY) {
+                                                 tcb = tcbtail;
+                                         } else {
+                                                 tcb = NULL;
+                                         }
+ 
+                                         /*
+                                          * Remember, we are still on
+                                          * the same mp.
+                                          */
+                                         force_copy = B_TRUE;
+                                         txs->itxs_tso_force_copy.value.ui64++;
+                                         goto force_copy;
+                                 }
+                         }
+ 
+                         ASSERT3U(tsegdesc, <=, i40e_lso_num_descs);
+                         ASSERT3U(tsegsz, <, mss);
+ 
+                         /*
+                          * We've made if through the loop without
+                          * breaking the segment descriptor contract
+                          * with the controller -- replace the segment
+                          * tracking values with the temporary ones.
+                          */
+                         segdesc = tsegdesc;
+                         segsz = tsegsz;
+                         needed_desc += tcb->tcb_bind_ncookies;
+                         cpoff = 0;
+                         tcb_list_append(&tcbhead, &tcbtail, tcb);
+                         mp = mp->b_cont;
+                 }
+         }
+ 
+         ASSERT3P(mp, ==, NULL);
+         ASSERT3P(tcbhead, !=, NULL);
+         *ndesc += needed_desc;
+         return (tcbhead);
+ 
+ fail:
+         tcb = tcbhead;
+         while (tcb != NULL) {
+                 i40e_tx_control_block_t *next = tcb->tcb_next;
+ 
+                 ASSERT(tcb->tcb_type == I40E_TX_DMA ||
+                     tcb->tcb_type == I40E_TX_COPY);
+ 
+                 tcb->tcb_mp = NULL;
+                 i40e_tcb_reset(tcb);
+                 i40e_tcb_free(itrq, tcb);
+                 tcb = next;
+         }
+ 
+         return (NULL);
+ }
+ 
+ /*
   * We've been asked to send a message block on the wire. We'll only have a
   * single chain. There will not be any b_next pointers; however, there may be
!  * multiple b_cont blocks. The number of b_cont blocks may exceed the
!  * controller's Tx descriptor limit.
   *
   * We may do one of three things with any given mblk_t chain:
   *
   *   1) Drop it
   *   2) Transmit it
*** 2094,2109 ****
   * something.
   */
  mblk_t *
  i40e_ring_tx(void *arg, mblk_t *mp)
  {
!         const mblk_t *nmp;
!         size_t mpsize;
!         i40e_tx_control_block_t *tcb;
!         i40e_tx_desc_t *txdesc;
          i40e_tx_context_t tctx;
!         int cmd, type;
  
          i40e_trqpair_t *itrq = arg;
          i40e_t *i40e = itrq->itrq_i40e;
          i40e_hw_t *hw = &i40e->i40e_hw_space;
          i40e_txq_stat_t *txs = &itrq->itrq_txstat;
--- 2798,2815 ----
   * something.
   */
  mblk_t *
  i40e_ring_tx(void *arg, mblk_t *mp)
  {
!         size_t msglen;
!         i40e_tx_control_block_t *tcb_ctx = NULL, *tcb = NULL, *tcbhead = NULL;
!         i40e_tx_context_desc_t *ctxdesc;
!         mac_ether_offload_info_t meo;
          i40e_tx_context_t tctx;
!         int type;
!         uint_t needed_desc = 0;
!         boolean_t do_ctx_desc = B_FALSE, use_lso = B_FALSE;
  
          i40e_trqpair_t *itrq = arg;
          i40e_t *i40e = itrq->itrq_i40e;
          i40e_hw_t *hw = &i40e->i40e_hw_space;
          i40e_txq_stat_t *txs = &itrq->itrq_txstat;
*** 2117,2235 ****
              (i40e->i40e_link_state != LINK_STATE_UP)) {
                  freemsg(mp);
                  return (NULL);
          }
  
          /*
           * Figure out the relevant context about this frame that we might need
!          * for enabling checksum, lso, etc. This also fills in information that
           * we might set around the packet type, etc.
           */
!         if (i40e_tx_context(i40e, itrq, mp, &tctx) < 0) {
                  freemsg(mp);
                  itrq->itrq_txstat.itxs_err_context.value.ui64++;
                  return (NULL);
          }
  
          /*
           * For the primordial driver we can punt on doing any recycling right
           * now; however, longer term we need to probably do some more pro-active
!          * recycling to cut back on stalls in the tx path.
           */
  
!         /*
!          * Do a quick size check to make sure it fits into what we think it
!          * should for this device. Note that longer term this will be false,
!          * particularly when we have the world of TSO.
!          */
!         mpsize = 0;
!         for (nmp = mp; nmp != NULL; nmp = nmp->b_cont) {
!                 mpsize += MBLKL(nmp);
!         }
  
          /*
!          * First we allocate our tx control block and prepare the packet for
!          * transmit before we do a final check for descriptors. We do it this
!          * way to minimize the time under the tx lock.
           */
!         tcb = i40e_tcb_alloc(itrq);
!         if (tcb == NULL) {
                  txs->itxs_err_notcb.value.ui64++;
                  goto txfail;
          }
  
!         /*
!          * For transmitting a block, we're currently going to use just a
!          * single control block and bcopy all of the fragments into it. We
!          * should be more intelligent about doing DMA binding or otherwise, but
!          * for getting off the ground this will have to do.
!          */
!         ASSERT(tcb->tcb_dma.dmab_len == 0);
!         ASSERT(tcb->tcb_dma.dmab_size >= mpsize);
!         for (nmp = mp; nmp != NULL; nmp = nmp->b_cont) {
!                 size_t clen = MBLKL(nmp);
!                 void *coff = tcb->tcb_dma.dmab_address + tcb->tcb_dma.dmab_len;
! 
!                 bcopy(nmp->b_rptr, coff, clen);
!                 tcb->tcb_dma.dmab_len += clen;
          }
-         ASSERT(tcb->tcb_dma.dmab_len == mpsize);
  
          /*
!          * While there's really no need to keep the mp here, but let's just do
!          * it to help with our own debugging for now.
           */
-         tcb->tcb_mp = mp;
-         tcb->tcb_type = I40E_TX_COPY;
-         I40E_DMA_SYNC(&tcb->tcb_dma, DDI_DMA_SYNC_FORDEV);
- 
          mutex_enter(&itrq->itrq_tx_lock);
!         if (itrq->itrq_desc_free < i40e->i40e_tx_block_thresh) {
                  txs->itxs_err_nodescs.value.ui64++;
                  mutex_exit(&itrq->itrq_tx_lock);
                  goto txfail;
          }
  
          /*
!          * Build up the descriptor and send it out. Thankfully at the moment
!          * we only need a single desc, because we're not doing anything fancy
!          * yet.
           */
!         ASSERT(itrq->itrq_desc_free > 0);
          itrq->itrq_desc_free--;
!         txdesc = &itrq->itrq_desc_ring[itrq->itrq_desc_tail];
!         itrq->itrq_tcb_work_list[itrq->itrq_desc_tail] = tcb;
!         itrq->itrq_desc_tail = i40e_next_desc(itrq->itrq_desc_tail, 1,
              itrq->itrq_tx_ring_size);
  
!         /*
!          * Note, we always set EOP and RS which indicates that this is the last
!          * data frame and that we should ask for it to be transmitted. We also
!          * must always set ICRC, because that is an internal bit that must be
!          * set to one for data descriptors. The remaining bits in the command
!          * descriptor depend on checksumming and are determined based on the
!          * information set up in i40e_tx_context().
!          */
!         type = I40E_TX_DESC_DTYPE_DATA;
!         cmd = I40E_TX_DESC_CMD_EOP |
!             I40E_TX_DESC_CMD_RS |
!             I40E_TX_DESC_CMD_ICRC |
!             tctx.itc_cmdflags;
!         txdesc->buffer_addr =
!             CPU_TO_LE64((uintptr_t)tcb->tcb_dma.dmab_dma_address);
!         txdesc->cmd_type_offset_bsz = CPU_TO_LE64(((uint64_t)type |
!             ((uint64_t)tctx.itc_offsets << I40E_TXD_QW1_OFFSET_SHIFT) |
!             ((uint64_t)cmd << I40E_TXD_QW1_CMD_SHIFT) |
!             ((uint64_t)tcb->tcb_dma.dmab_len << I40E_TXD_QW1_TX_BUF_SZ_SHIFT)));
  
          /*
           * Now, finally, sync the DMA data and alert hardware.
           */
          I40E_DMA_SYNC(&itrq->itrq_desc_area, DDI_DMA_SYNC_FORDEV);
  
          I40E_WRITE_REG(hw, I40E_QTX_TAIL(itrq->itrq_index),
              itrq->itrq_desc_tail);
          if (i40e_check_acc_handle(i40e->i40e_osdep_space.ios_reg_handle) !=
              DDI_FM_OK) {
                  /*
                   * Note, we can't really go through and clean this up very well,
                   * because the memory has been given to the device, so just
--- 2823,2972 ----
              (i40e->i40e_link_state != LINK_STATE_UP)) {
                  freemsg(mp);
                  return (NULL);
          }
  
+         if (mac_ether_offload_info(mp, &meo) != 0) {
+                 freemsg(mp);
+                 itrq->itrq_txstat.itxs_hck_meoifail.value.ui64++;
+                 return (NULL);
+         }
+ 
          /*
           * Figure out the relevant context about this frame that we might need
!          * for enabling checksum, LSO, etc. This also fills in information that
           * we might set around the packet type, etc.
           */
!         if (i40e_tx_context(i40e, itrq, mp, &meo, &tctx) < 0) {
                  freemsg(mp);
                  itrq->itrq_txstat.itxs_err_context.value.ui64++;
                  return (NULL);
          }
+         if (tctx.itc_ctx_cmdflags & I40E_TX_CTX_DESC_TSO) {
+                 use_lso = B_TRUE;
+                 do_ctx_desc = B_TRUE;
+         }
  
          /*
           * For the primordial driver we can punt on doing any recycling right
           * now; however, longer term we need to probably do some more pro-active
!          * recycling to cut back on stalls in the TX path.
           */
  
!         msglen = msgsize(mp);
  
+         if (do_ctx_desc) {
                  /*
!                  * If we're doing tunneling or LSO, then we'll need a TX
!                  * context descriptor in addition to one or more TX data
!                  * descriptors.  Since there's no data DMA block or handle
!                  * associated with the context descriptor, we create a special
!                  * control block that behaves effectively like a NOP.
                   */
!                 if ((tcb_ctx = i40e_tcb_alloc(itrq)) == NULL) {
                          txs->itxs_err_notcb.value.ui64++;
                          goto txfail;
                  }
+                 tcb_ctx->tcb_type = I40E_TX_DESC;
+                 needed_desc++;
+         }
  
!         if (!use_lso) {
!                 tcbhead = i40e_non_lso_chain(itrq, mp, &needed_desc);
!         } else {
!                 tcbhead = i40e_lso_chain(itrq, mp, &meo, &tctx, &needed_desc);
          }
  
+         if (tcbhead == NULL)
+                 goto txfail;
+ 
+         tcbhead->tcb_mp = mp;
+ 
          /*
!          * The second condition ensures that 'itrq_desc_tail' never
!          * equals 'itrq_desc_head'. This enforces the rule found in
!          * the second bullet point of section 8.4.3.1.5 of the XL710
!          * PG, which declares the TAIL pointer in I40E_QTX_TAIL should
!          * never overlap with the head. This means that we only ever
!          * have 'itrq_tx_ring_size - 1' total available descriptors.
           */
          mutex_enter(&itrq->itrq_tx_lock);
!         if (itrq->itrq_desc_free < i40e->i40e_tx_block_thresh ||
!             (itrq->itrq_desc_free - 1) < needed_desc) {
                  txs->itxs_err_nodescs.value.ui64++;
                  mutex_exit(&itrq->itrq_tx_lock);
                  goto txfail;
          }
  
+         if (do_ctx_desc) {
                  /*
!                  * If we're enabling any offloads for this frame, then we'll
!                  * need to build up a transmit context descriptor, first.  The
!                  * context descriptor needs to be placed in the TX ring before
!                  * the data descriptor(s).  See section 8.4.2, table 8-16
                   */
!                 uint_t tail = itrq->itrq_desc_tail;
                  itrq->itrq_desc_free--;
!                 ctxdesc = (i40e_tx_context_desc_t *)&itrq->itrq_desc_ring[tail];
!                 itrq->itrq_tcb_work_list[tail] = tcb_ctx;
!                 itrq->itrq_desc_tail = i40e_next_desc(tail, 1,
                      itrq->itrq_tx_ring_size);
  
!                 /* QW0 */
!                 type = I40E_TX_DESC_DTYPE_CONTEXT;
!                 ctxdesc->tunneling_params = 0;
!                 ctxdesc->l2tag2 = 0;
  
+                 /* QW1 */
+                 ctxdesc->type_cmd_tso_mss = CPU_TO_LE64((uint64_t)type);
+                 if (tctx.itc_ctx_cmdflags & I40E_TX_CTX_DESC_TSO) {
+                         ctxdesc->type_cmd_tso_mss |= CPU_TO_LE64((uint64_t)
+                             ((uint64_t)tctx.itc_ctx_cmdflags <<
+                             I40E_TXD_CTX_QW1_CMD_SHIFT) |
+                             ((uint64_t)tctx.itc_ctx_tsolen <<
+                             I40E_TXD_CTX_QW1_TSO_LEN_SHIFT) |
+                             ((uint64_t)tctx.itc_ctx_mss <<
+                             I40E_TXD_CTX_QW1_MSS_SHIFT));
+                 }
+         }
+ 
+         tcb = tcbhead;
+         while (tcb != NULL) {
+ 
+                 itrq->itrq_tcb_work_list[itrq->itrq_desc_tail] = tcb;
+                 if (tcb->tcb_type == I40E_TX_COPY) {
+                         boolean_t last_desc = (tcb->tcb_next == NULL);
+ 
+                         i40e_tx_set_data_desc(itrq, &tctx,
+                             (caddr_t)tcb->tcb_dma.dmab_dma_address,
+                             tcb->tcb_dma.dmab_len, last_desc);
+                 } else {
+                         boolean_t last_desc = B_FALSE;
+                         ASSERT3S(tcb->tcb_type, ==, I40E_TX_DMA);
+ 
+                         for (uint_t c = 0; c < tcb->tcb_bind_ncookies; c++) {
+                                 last_desc = (c == tcb->tcb_bind_ncookies - 1) &&
+                                     (tcb->tcb_next == NULL);
+ 
+                                 i40e_tx_set_data_desc(itrq, &tctx,
+                                     tcb->tcb_bind_info[c].dbi_paddr,
+                                     tcb->tcb_bind_info[c].dbi_len,
+                                     last_desc);
+                         }
+                 }
+ 
+                 tcb = tcb->tcb_next;
+         }
+ 
          /*
           * Now, finally, sync the DMA data and alert hardware.
           */
          I40E_DMA_SYNC(&itrq->itrq_desc_area, DDI_DMA_SYNC_FORDEV);
  
          I40E_WRITE_REG(hw, I40E_QTX_TAIL(itrq->itrq_index),
              itrq->itrq_desc_tail);
+ 
          if (i40e_check_acc_handle(i40e->i40e_osdep_space.ios_reg_handle) !=
              DDI_FM_OK) {
                  /*
                   * Note, we can't really go through and clean this up very well,
                   * because the memory has been given to the device, so just
*** 2237,2249 ****
                   */
                  ddi_fm_service_impact(i40e->i40e_dip, DDI_SERVICE_DEGRADED);
                  atomic_or_32(&i40e->i40e_state, I40E_ERROR);
          }
  
!         txs->itxs_bytes.value.ui64 += mpsize;
          txs->itxs_packets.value.ui64++;
!         txs->itxs_descriptors.value.ui64++;
  
          mutex_exit(&itrq->itrq_tx_lock);
  
          return (NULL);
  
--- 2974,2986 ----
                   */
                  ddi_fm_service_impact(i40e->i40e_dip, DDI_SERVICE_DEGRADED);
                  atomic_or_32(&i40e->i40e_state, I40E_ERROR);
          }
  
!         txs->itxs_bytes.value.ui64 += msglen;
          txs->itxs_packets.value.ui64++;
!         txs->itxs_descriptors.value.ui64 += needed_desc;
  
          mutex_exit(&itrq->itrq_tx_lock);
  
          return (NULL);
  
*** 2252,2265 ****
           * We ran out of resources. Return it to MAC and indicate that we'll
           * need to signal MAC. If there are allocated tcb's, return them now.
           * Make sure to reset their message block's, since we'll return them
           * back to MAC.
           */
!         if (tcb != NULL) {
                  tcb->tcb_mp = NULL;
                  i40e_tcb_reset(tcb);
                  i40e_tcb_free(itrq, tcb);
          }
  
          mutex_enter(&itrq->itrq_tx_lock);
          itrq->itrq_tx_blocked = B_TRUE;
          mutex_exit(&itrq->itrq_tx_lock);
--- 2989,3015 ----
           * We ran out of resources. Return it to MAC and indicate that we'll
           * need to signal MAC. If there are allocated tcb's, return them now.
           * Make sure to reset their message block's, since we'll return them
           * back to MAC.
           */
!         if (tcb_ctx != NULL) {
!                 tcb_ctx->tcb_mp = NULL;
!                 i40e_tcb_reset(tcb_ctx);
!                 i40e_tcb_free(itrq, tcb_ctx);
!         }
! 
!         tcb = tcbhead;
!         while (tcb != NULL) {
!                 i40e_tx_control_block_t *next = tcb->tcb_next;
! 
!                 ASSERT(tcb->tcb_type == I40E_TX_DMA ||
!                     tcb->tcb_type == I40E_TX_COPY);
! 
                  tcb->tcb_mp = NULL;
                  i40e_tcb_reset(tcb);
                  i40e_tcb_free(itrq, tcb);
+                 tcb = next;
          }
  
          mutex_enter(&itrq->itrq_tx_lock);
          itrq->itrq_tx_blocked = B_TRUE;
          mutex_exit(&itrq->itrq_tx_lock);