Print this page
NEX-20178 Heavy read load using 10G i40e causes network disconnect
MFV illumos-joyent@83a8d0d616db36010b59cc850d1926c0f6a30de1
OS-7457 i40e Tx freezes on zero descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV illumos-joyent@0d3f2b61dcfb18edace4fd257054f6fdbe07c99c
OS-7492 i40e Tx freeze when b_cont chain exceeds 8 descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV illumos-joyent@b4bede175d4c50ac1b36078a677b69388f6fb59f
OS-7577 initialize FC for i40e
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Rob Johnston <rob.johnston@joyent.com>
MFV illumos-joyent@83a8d0d616db36010b59cc850d1926c0f6a30de1
OS-7457 i40e Tx freezes on zero descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV: illumos-joyent@61dc3dec4f82a3e13e94609a0a83d5f66c64e760
OS-6846 want i40e multi-group support
OS-7372 i40e_alloc_ring_mem() unwinds when it shouldn't
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Ryan Zezeski <rpz@joyent.com>
MFV: illumos-joyent@6f6fae1b433b461a7b014f48ad94fc7f4927c6ed
OS-7344 i40e Tx freeze caused by off-by-one DMA
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Ryan Zezeski <rpz@joyent.com>
MFV: illumos-joyent@757454db6669c1186f60bc625510c1b67217aae6
OS-7082 i40e: blown assert in i40e_tx_cleanup_ring()
OS-7086 i40e: add mdb dcmd to dump info on tx descriptor rings
OS-7101 i40e: add kstat to track TX DMA bind failures
Reviewed by: Ryan Zezeski <rpz@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Patrick Mooney <patrick.mooney@joyent.com>
Author: Rob Johnston <rob.johnston@joyent.com>
MFV: illumos-joyent@9e30beee2f0c127bf41868db46257124206e28d6
OS-5225 Want Fortville TSO support
Reviewed by: Ryan Zezeski <rpz@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Patrick Mooney <patrick.mooney@joyent.com>
Author: Rob Johnston <rob.johnston@joyent.com>
*** 9,19 ****
* http://www.illumos.org/license/CDDL.
*/
/*
* Copyright 2015 OmniTI Computer Consulting, Inc. All rights reserved.
! * Copyright 2016 Joyent, Inc.
*/
#include "i40e_sw.h"
/*
--- 9,19 ----
* http://www.illumos.org/license/CDDL.
*/
/*
* Copyright 2015 OmniTI Computer Consulting, Inc. All rights reserved.
! * Copyright 2019 Joyent, Inc.
*/
#include "i40e_sw.h"
/*
*** 58,80 ****
* i40e_t`i40e_sdu changes.
*
* This size is then rounded up to the nearest 1k chunk, which represents the
* actual amount of memory that we'll allocate for a single frame.
*
! * Note, that for rx, we do something that might be unexpected. We always add
* an extra two bytes to the frame size that we allocate. We then offset the DMA
* address that we receive a packet into by two bytes. This ensures that the IP
* header will always be 4 byte aligned because the MAC header is either 14 or
* 18 bytes in length, depending on the use of 802.1Q tagging, which makes IP's
* and MAC's lives easier.
*
! * Both the rx and tx descriptor rings (which are what we use to communicate
* with hardware) are allocated as a single region of DMA memory which is the
* size of the descriptor (4 bytes and 2 bytes respectively) times the total
! * number of descriptors for an rx and tx ring.
*
! * While the rx and tx descriptors are allocated using DMA-based memory, the
* control blocks for each of them are allocated using normal kernel memory.
* They aren't special from a DMA perspective. We'll go over the design of both
* receiving and transmitting separately, as they have slightly different
* control blocks and different ways that we manage the relationship between
* control blocks and descriptors.
--- 58,80 ----
* i40e_t`i40e_sdu changes.
*
* This size is then rounded up to the nearest 1k chunk, which represents the
* actual amount of memory that we'll allocate for a single frame.
*
! * Note, that for RX, we do something that might be unexpected. We always add
* an extra two bytes to the frame size that we allocate. We then offset the DMA
* address that we receive a packet into by two bytes. This ensures that the IP
* header will always be 4 byte aligned because the MAC header is either 14 or
* 18 bytes in length, depending on the use of 802.1Q tagging, which makes IP's
* and MAC's lives easier.
*
! * Both the RX and TX descriptor rings (which are what we use to communicate
* with hardware) are allocated as a single region of DMA memory which is the
* size of the descriptor (4 bytes and 2 bytes respectively) times the total
! * number of descriptors for an RX and TX ring.
*
! * While the RX and TX descriptors are allocated using DMA-based memory, the
* control blocks for each of them are allocated using normal kernel memory.
* They aren't special from a DMA perspective. We'll go over the design of both
* receiving and transmitting separately, as they have slightly different
* control blocks and different ways that we manage the relationship between
* control blocks and descriptors.
*** 111,145 ****
* builds, we allow someone to whack the variable i40e_debug_rx_mode to override
* the behavior and always do a bcopy or a DMA bind.
*
* To try and ensure that the device always has blocks that it can receive data
* into, we maintain two lists of control blocks, a working list and a free
! * list. Each list is sized equal to the number of descriptors in the rx ring.
! * During the GLDv3 mc_start routine, we allocate a number of rx control blocks
* equal to twice the number of descriptors in the ring and we assign them
* equally to the free list and to the working list. Each control block also has
* DMA memory allocated and associated with which it will be used to receive the
* actual packet data. All of a received frame's data will end up in a single
* DMA buffer.
*
! * During operation, we always maintain the invariant that each rx descriptor
! * has an associated rx control block which lives in the working list. If we
* feel that we should loan up DMA memory to MAC in the form of a message block,
* we can only do so if we can maintain this invariant. To do that, we swap in
* one of the buffers from the free list. If none are available, then we resort
* to using allocb(9F) and bcopy(9F) on the packet instead, regardless of the
* size.
*
* Loaned message blocks come back to use when freemsg(9F) or freeb(9F) is
! * called on the block, at which point we restore the rx control block to the
* free list and are able to reuse the DMA memory again. While the scheme may
* seem odd, it importantly keeps us out of trying to do any DMA allocations in
* the normal path of operation, even though we may still have to allocate
* message blocks and copy.
*
! * The following state machine describes the life time of a rx control block. In
! * the diagram we abbrviate the rx ring descriptor entry as rxd and the rx
* control block entry as rcb.
*
* | |
* * ... 1/2 of all initial rcb's ... *
* | |
--- 111,145 ----
* builds, we allow someone to whack the variable i40e_debug_rx_mode to override
* the behavior and always do a bcopy or a DMA bind.
*
* To try and ensure that the device always has blocks that it can receive data
* into, we maintain two lists of control blocks, a working list and a free
! * list. Each list is sized equal to the number of descriptors in the RX ring.
! * During the GLDv3 mc_start routine, we allocate a number of RX control blocks
* equal to twice the number of descriptors in the ring and we assign them
* equally to the free list and to the working list. Each control block also has
* DMA memory allocated and associated with which it will be used to receive the
* actual packet data. All of a received frame's data will end up in a single
* DMA buffer.
*
! * During operation, we always maintain the invariant that each RX descriptor
! * has an associated RX control block which lives in the working list. If we
* feel that we should loan up DMA memory to MAC in the form of a message block,
* we can only do so if we can maintain this invariant. To do that, we swap in
* one of the buffers from the free list. If none are available, then we resort
* to using allocb(9F) and bcopy(9F) on the packet instead, regardless of the
* size.
*
* Loaned message blocks come back to use when freemsg(9F) or freeb(9F) is
! * called on the block, at which point we restore the RX control block to the
* free list and are able to reuse the DMA memory again. While the scheme may
* seem odd, it importantly keeps us out of trying to do any DMA allocations in
* the normal path of operation, even though we may still have to allocate
* message blocks and copy.
*
! * The following state machine describes the life time of a RX control block. In
! * the diagram we abbrviate the RX ring descriptor entry as rxd and the rx
* control block entry as rcb.
*
* | |
* * ... 1/2 of all initial rcb's ... *
* | |
*** 158,172 ****
* | and it is v
* | recycled. +-------------------+
* +--------------------<-----| rcb loaned to MAC |
* +-------------------+
*
! * Finally, note that every rx control block has a reference count on it. One
* reference is added as long as the driver has had the GLDv3 mc_start endpoint
* called. If the GLDv3 mc_stop entry point is called, IP has been unplumbed and
* no other DLPI consumers remain, then we'll decrement the reference count by
! * one. Whenever we loan up the rx control block and associated buffer to MAC,
* then we bump the reference count again. Even though the device is stopped,
* there may still be loaned frames in upper levels that we'll want to account
* for. Our callback from freemsg(9F)/freeb(9F) will take care of making sure
* that it is cleaned up.
*
--- 158,172 ----
* | and it is v
* | recycled. +-------------------+
* +--------------------<-----| rcb loaned to MAC |
* +-------------------+
*
! * Finally, note that every RX control block has a reference count on it. One
* reference is added as long as the driver has had the GLDv3 mc_start endpoint
* called. If the GLDv3 mc_stop entry point is called, IP has been unplumbed and
* no other DLPI consumers remain, then we'll decrement the reference count by
! * one. Whenever we loan up the RX control block and associated buffer to MAC,
* then we bump the reference count again. Even though the device is stopped,
* there may still be loaned frames in upper levels that we'll want to account
* for. Our callback from freemsg(9F)/freeb(9F) will take care of making sure
* that it is cleaned up.
*
*** 190,203 ****
* the HEAD and TAIL, inclusive. Note that while we initially program the HEAD,
* the only values we ever consult ourselves are the TAIL register and our own
* state tracking. Effectively, we cache the HEAD register and then update it
* ourselves based on our work.
*
! * When we iterate over the rx descriptors and thus the received frames, we are
* either in an interrupt context or we've been asked by MAC to poll on the
* ring. If we've been asked to poll on the ring, we have a maximum number of
! * bytes of mblk_t's to return. If processing an rx descriptor would cause us to
* exceed that count, then we do not process it. When in interrupt context, we
* don't have a strict byte count. However, to ensure liveness, we limit the
* amount of data based on a configuration value
* (i40e_t`i40e_rx_limit_per_intr). The number that we've started with for this
* is based on similar numbers that are used for ixgbe. After some additional
--- 190,203 ----
* the HEAD and TAIL, inclusive. Note that while we initially program the HEAD,
* the only values we ever consult ourselves are the TAIL register and our own
* state tracking. Effectively, we cache the HEAD register and then update it
* ourselves based on our work.
*
! * When we iterate over the RX descriptors and thus the received frames, we are
* either in an interrupt context or we've been asked by MAC to poll on the
* ring. If we've been asked to poll on the ring, we have a maximum number of
! * bytes of mblk_t's to return. If processing an RX descriptor would cause us to
* exceed that count, then we do not process it. When in interrupt context, we
* don't have a strict byte count. However, to ensure liveness, we limit the
* amount of data based on a configuration value
* (i40e_t`i40e_rx_limit_per_intr). The number that we've started with for this
* is based on similar numbers that are used for ixgbe. After some additional
*** 247,281 ****
*
* While the transmit path is similar in spirit to the receive path, it works
* differently due to the fact that all data is originated by the operating
* system and not by the device.
*
! * Like rx, there is both a descriptor ring that we use to communicate to the
* driver and which points to the memory used to transmit a frame. Similarly,
! * there is a corresponding transmit control block. Each transmit control block
! * has a region of DMA memory allocated to it; however, the way we use it
! * varies.
*
* The driver is asked to process a single frame at a time. That message block
* may be made up of multiple fragments linked together by the mblk_t`b_cont
* member. The device has a hard limit of up to 8 buffers being allowed for use
! * for a single logical frame. For each fragment, we'll try and use an entry
! * from the tx descriptor ring and then we'll allocate a corresponding tx
! * control block. Depending on the size of the fragment, we may copy it around
! * or we might instead try to do DMA binding of the fragment.
*
! * If we exceed the number of blocks that fit, we'll try to pull up the block
! * and then we'll do a DMA bind and send it out.
*
! * If we don't have enough space in the ring or tx control blocks available,
* then we'll return the unprocessed message block to MAC. This will induce flow
* control and once we recycle enough entries, we'll once again enable sending
* on the ring.
*
* We size the working list as equal to the number of descriptors in the ring.
* We size the free list as equal to 1.5 times the number of descriptors in the
! * ring. We'll allocate a number of tx control block entries equal to the number
* of entries in the free list. By default, all entries are placed in the free
* list. As we come along and try to send something, we'll allocate entries from
* the free list and add them to the working list, where they'll stay until the
* hardware indicates that all of the data has been written back to us. The
* reason that we start with 1.5x is to help facilitate having more than one TX
--- 247,304 ----
*
* While the transmit path is similar in spirit to the receive path, it works
* differently due to the fact that all data is originated by the operating
* system and not by the device.
*
! * Like RX, there is both a descriptor ring that we use to communicate to the
* driver and which points to the memory used to transmit a frame. Similarly,
! * there is a corresponding transmit control block, however, the correspondence
! * between descriptors and control blocks is more complex and not necessarily
! * 1-to-1.
*
* The driver is asked to process a single frame at a time. That message block
* may be made up of multiple fragments linked together by the mblk_t`b_cont
* member. The device has a hard limit of up to 8 buffers being allowed for use
! * for a single non-LSO packet or LSO segment. The number of TX ring entires
! * (and thus TX control blocks) used depends on the fragment sizes and DMA
! * layout, as explained below.
*
! * We alter our DMA strategy based on a threshold tied to the fragment size.
! * This threshold is configurable via the tx_dma_threshold property. If the
! * fragment is above the threshold, we DMA bind it -- consuming one TCB and
! * potentially several data descriptors. The exact number of descriptors (equal
! * to the number of DMA cookies) depends on page size, MTU size, b_rptr offset
! * into page, b_wptr offset into page, and the physical layout of the dblk's
! * memory (contiguous or not). Essentially, we are at the mercy of the DMA
! * engine and the dblk's memory allocation. Knowing the exact number of
! * descriptors up front is a task best not taken on by the driver itself.
! * Instead, we attempt to DMA bind the fragment and verify the descriptor
! * layout meets hardware constraints. If the proposed DMA bind does not satisfy
! * the hardware constaints, then we discard it and instead copy the entire
! * fragment into the pre-allocated TCB buffer (or buffers if the fragment is
! * larger than the TCB buffer).
*
! * If the fragment is below or at the threshold, we copy it to the pre-allocated
! * buffer of a TCB. We compress consecutive copy fragments into a single TCB to
! * conserve resources. We are guaranteed that the TCB buffer is made up of only
! * 1 DMA cookie; and therefore consumes only one descriptor on the controller.
! *
! * Furthermore, if the frame requires HW offloads such as LSO, tunneling or
! * filtering, then the TX data descriptors must be preceeded by a single TX
! * context descriptor. Because there is no DMA transfer associated with the
! * context descriptor, we allocate a control block with a special type which
! * indicates to the TX ring recycle code that there are no associated DMA
! * resources to unbind when the control block is free'd.
! *
! * If we don't have enough space in the ring or TX control blocks available,
* then we'll return the unprocessed message block to MAC. This will induce flow
* control and once we recycle enough entries, we'll once again enable sending
* on the ring.
*
* We size the working list as equal to the number of descriptors in the ring.
* We size the free list as equal to 1.5 times the number of descriptors in the
! * ring. We'll allocate a number of TX control block entries equal to the number
* of entries in the free list. By default, all entries are placed in the free
* list. As we come along and try to send something, we'll allocate entries from
* the free list and add them to the working list, where they'll stay until the
* hardware indicates that all of the data has been written back to us. The
* reason that we start with 1.5x is to help facilitate having more than one TX
*** 323,356 ****
* |
* v
* +------------------+ +------------------+
* | tcb on free list |---*------------------>| tcb on work list |
* +------------------+ . +------------------+
! * ^ . tcb allocated |
* | to send frame v
* | or fragment on |
* | wire, mblk from |
* | MAC associated. |
* | |
* +------*-------------------------------<----+
* .
* . Hardware indicates
* entry transmitted.
! * tcb recycled, mblk
* from MAC freed.
*
* ------------
* Blocking MAC
* ------------
*
! * Wen performing transmit, we can run out of descriptors and ring entries. When
! * such a case happens, we return the mblk_t to MAC to indicate that we've been
! * blocked. At that point in time, MAC becomes blocked and will not transmit
! * anything out that specific ring until we notify MAC. To indicate that we're
! * in such a situation we set i40e_trqpair_t`itrq_tx_blocked member to B_TRUE.
*
! * When we recycle tx descriptors then we'll end up signaling MAC by calling
* mac_tx_ring_update() if we were blocked, letting it know that it's safe to
* start sending frames out to us again.
*/
/*
--- 346,386 ----
* |
* v
* +------------------+ +------------------+
* | tcb on free list |---*------------------>| tcb on work list |
* +------------------+ . +------------------+
! * ^ . N tcbs allocated[1] |
* | to send frame v
* | or fragment on |
* | wire, mblk from |
* | MAC associated. |
* | |
* +------*-------------------------------<----+
* .
* . Hardware indicates
* entry transmitted.
! * tcbs recycled, mblk
* from MAC freed.
*
+ * [1] We allocate N tcbs to transmit a single frame where N can be 1 context
+ * descriptor plus 1 data descriptor, in the non-DMA-bind case. In the DMA
+ * bind case, N can be 1 context descriptor plus 1 data descriptor per
+ * b_cont in the mblk. In this case, the mblk is associated with the first
+ * data descriptor and freed as part of freeing that data descriptor.
+ *
* ------------
* Blocking MAC
* ------------
*
! * When performing transmit, we can run out of descriptors and ring entries.
! * When such a case happens, we return the mblk_t to MAC to indicate that we've
! * been blocked. At that point in time, MAC becomes blocked and will not
! * transmit anything out that specific ring until we notify MAC. To indicate
! * that we're in such a situation we set i40e_trqpair_t`itrq_tx_blocked member
! * to B_TRUE.
*
! * When we recycle TX descriptors then we'll end up signaling MAC by calling
* mac_tx_ring_update() if we were blocked, letting it know that it's safe to
* start sending frames out to us again.
*/
/*
*** 365,381 ****
#error "unknown architecture for i40e"
#endif
/*
* This structure is used to maintain information and flags related to
! * transmitting a frame. The first member is the set of flags we need to or into
! * the command word (generally checksumming related). The second member controls
! * the word offsets which is required for IP and L4 checksumming.
*/
typedef struct i40e_tx_context {
! enum i40e_tx_desc_cmd_bits itc_cmdflags;
! uint32_t itc_offsets;
} i40e_tx_context_t;
/*
* Toggles on debug builds which can be used to override our RX behaviour based
* on thresholds.
--- 395,413 ----
#error "unknown architecture for i40e"
#endif
/*
* This structure is used to maintain information and flags related to
! * transmitting a frame. These fields are ultimately used to construct the
! * TX data descriptor(s) and, if necessary, the TX context descriptor.
*/
typedef struct i40e_tx_context {
! enum i40e_tx_desc_cmd_bits itc_data_cmdflags;
! uint32_t itc_data_offsets;
! enum i40e_tx_ctx_desc_cmd_bits itc_ctx_cmdflags;
! uint32_t itc_ctx_tsolen;
! uint32_t itc_ctx_mss;
} i40e_tx_context_t;
/*
* Toggles on debug builds which can be used to override our RX behaviour based
* on thresholds.
*** 393,410 ****
/*
* Notes on the following pair of DMA attributes. The first attribute,
* i40e_static_dma_attr, is designed to be used for both the descriptor rings
* and the static buffers that we associate with control blocks. For this
* reason, we force an SGL length of one. While technically the driver supports
! * a larger SGL (5 on rx and 8 on tx), we opt to only use one to simplify our
* management here. In addition, when the Intel common code wants to allocate
* memory via the i40e_allocate_virt_mem osdep function, we have it leverage
* the static dma attr.
*
! * The second set of attributes, i40e_txbind_dma_attr, is what we use when we're
! * binding a bunch of mblk_t fragments to go out the door. Note that the main
! * difference here is that we're allowed a larger SGL length -- eight.
*
* Note, we default to setting ourselves to be DMA capable here. However,
* because we could have multiple instances which have different FMA error
* checking capabilities, or end up on different buses, we make these static
* and const and copy them into the i40e_t for the given device with the actual
--- 425,446 ----
/*
* Notes on the following pair of DMA attributes. The first attribute,
* i40e_static_dma_attr, is designed to be used for both the descriptor rings
* and the static buffers that we associate with control blocks. For this
* reason, we force an SGL length of one. While technically the driver supports
! * a larger SGL (5 on RX and 8 on TX), we opt to only use one to simplify our
* management here. In addition, when the Intel common code wants to allocate
* memory via the i40e_allocate_virt_mem osdep function, we have it leverage
* the static dma attr.
*
! * The latter two sets of attributes, are what we use when we're binding a
! * bunch of mblk_t fragments to go out the door. Note that the main difference
! * here is that we're allowed a larger SGL length. For non-LSO TX, we
! * restrict the SGL length to match the number of TX buffers available to the
! * PF (8). For the LSO case we can go much larger, with the caveat that each
! * MSS-sized chunk (segment) must not span more than 8 data descriptors and
! * hence must not span more than 8 cookies.
*
* Note, we default to setting ourselves to be DMA capable here. However,
* because we could have multiple instances which have different FMA error
* checking capabilities, or end up on different buses, we make these static
* and const and copy them into the i40e_t for the given device with the actual
*** 427,437 ****
static const ddi_dma_attr_t i40e_g_txbind_dma_attr = {
DMA_ATTR_V0, /* version number */
0x0000000000000000ull, /* low address */
0xFFFFFFFFFFFFFFFFull, /* high address */
! 0x00000000FFFFFFFFull, /* dma counter max */
I40E_DMA_ALIGNMENT, /* alignment */
0x00000FFF, /* burst sizes */
0x00000001, /* minimum transfer size */
0x00000000FFFFFFFFull, /* maximum transfer size */
0xFFFFFFFFFFFFFFFFull, /* maximum segment size */
--- 463,473 ----
static const ddi_dma_attr_t i40e_g_txbind_dma_attr = {
DMA_ATTR_V0, /* version number */
0x0000000000000000ull, /* low address */
0xFFFFFFFFFFFFFFFFull, /* high address */
! I40E_MAX_TX_BUFSZ - 1, /* dma counter max */
I40E_DMA_ALIGNMENT, /* alignment */
0x00000FFF, /* burst sizes */
0x00000001, /* minimum transfer size */
0x00000000FFFFFFFFull, /* maximum transfer size */
0xFFFFFFFFFFFFFFFFull, /* maximum segment size */
*** 438,447 ****
--- 474,498 ----
I40E_TX_MAX_COOKIE, /* scatter/gather list length */
0x00000001, /* granularity */
DDI_DMA_FLAGERR /* DMA flags */
};
+ static const ddi_dma_attr_t i40e_g_txbind_lso_dma_attr = {
+ DMA_ATTR_V0, /* version number */
+ 0x0000000000000000ull, /* low address */
+ 0xFFFFFFFFFFFFFFFFull, /* high address */
+ I40E_MAX_TX_BUFSZ - 1, /* dma counter max */
+ I40E_DMA_ALIGNMENT, /* alignment */
+ 0x00000FFF, /* burst sizes */
+ 0x00000001, /* minimum transfer size */
+ 0x00000000FFFFFFFFull, /* maximum transfer size */
+ 0xFFFFFFFFFFFFFFFFull, /* maximum segment size */
+ I40E_TX_LSO_MAX_COOKIE, /* scatter/gather list length */
+ 0x00000001, /* granularity */
+ DDI_DMA_FLAGERR /* DMA flags */
+ };
+
/*
* Next, we have the attributes for these structures. The descriptor rings are
* all strictly little endian, while the data buffers are just arrays of bytes
* representing frames. Because of this, we purposefully simplify the driver
* programming life by programming the descriptor ring as little endian, while
*** 666,685 ****
rxd->rxd_rcb_free = rxd->rxd_free_list_size;
rxd->rxd_work_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) *
rxd->rxd_ring_size, KM_NOSLEEP);
if (rxd->rxd_work_list == NULL) {
! i40e_error(i40e, "failed to allocate rx work list for a ring "
"of %d entries for ring %d", rxd->rxd_ring_size,
itrq->itrq_index);
goto cleanup;
}
rxd->rxd_free_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) *
rxd->rxd_free_list_size, KM_NOSLEEP);
if (rxd->rxd_free_list == NULL) {
! i40e_error(i40e, "failed to allocate a %d entry rx free list "
"for ring %d", rxd->rxd_free_list_size, itrq->itrq_index);
goto cleanup;
}
rxd->rxd_rcb_area = kmem_zalloc(sizeof (i40e_rx_control_block_t) *
--- 717,736 ----
rxd->rxd_rcb_free = rxd->rxd_free_list_size;
rxd->rxd_work_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) *
rxd->rxd_ring_size, KM_NOSLEEP);
if (rxd->rxd_work_list == NULL) {
! i40e_error(i40e, "failed to allocate RX work list for a ring "
"of %d entries for ring %d", rxd->rxd_ring_size,
itrq->itrq_index);
goto cleanup;
}
rxd->rxd_free_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) *
rxd->rxd_free_list_size, KM_NOSLEEP);
if (rxd->rxd_free_list == NULL) {
! i40e_error(i40e, "failed to allocate a %d entry RX free list "
"for ring %d", rxd->rxd_free_list_size, itrq->itrq_index);
goto cleanup;
}
rxd->rxd_rcb_area = kmem_zalloc(sizeof (i40e_rx_control_block_t) *
*** 763,781 ****
size_t dmasz;
i40e_rx_control_block_t *rcb;
i40e_t *i40e = rxd->rxd_i40e;
/*
! * First allocate the rx descriptor ring.
*/
dmasz = sizeof (i40e_rx_desc_t) * rxd->rxd_ring_size;
VERIFY(dmasz > 0);
if (i40e_alloc_dma_buffer(i40e, &rxd->rxd_desc_area,
&i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr, B_FALSE,
B_TRUE, dmasz) == B_FALSE) {
i40e_error(i40e, "failed to allocate DMA resources "
! "for rx descriptor ring");
return (B_FALSE);
}
rxd->rxd_desc_ring =
(i40e_rx_desc_t *)(uintptr_t)rxd->rxd_desc_area.dmab_address;
rxd->rxd_desc_next = 0;
--- 814,832 ----
size_t dmasz;
i40e_rx_control_block_t *rcb;
i40e_t *i40e = rxd->rxd_i40e;
/*
! * First allocate the RX descriptor ring.
*/
dmasz = sizeof (i40e_rx_desc_t) * rxd->rxd_ring_size;
VERIFY(dmasz > 0);
if (i40e_alloc_dma_buffer(i40e, &rxd->rxd_desc_area,
&i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr, B_FALSE,
B_TRUE, dmasz) == B_FALSE) {
i40e_error(i40e, "failed to allocate DMA resources "
! "for RX descriptor ring");
return (B_FALSE);
}
rxd->rxd_desc_ring =
(i40e_rx_desc_t *)(uintptr_t)rxd->rxd_desc_area.dmab_address;
rxd->rxd_desc_next = 0;
*** 797,807 ****
dmap = &rcb->rcb_dma;
if (i40e_alloc_dma_buffer(i40e, dmap,
&i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr,
B_TRUE, B_FALSE, dmasz) == B_FALSE) {
! i40e_error(i40e, "failed to allocate rx dma buffer");
return (B_FALSE);
}
/*
* Initialize the control block and offset the DMA address. See
--- 848,858 ----
dmap = &rcb->rcb_dma;
if (i40e_alloc_dma_buffer(i40e, dmap,
&i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr,
B_TRUE, B_FALSE, dmasz) == B_FALSE) {
! i40e_error(i40e, "failed to allocate RX dma buffer");
return (B_FALSE);
}
/*
* Initialize the control block and offset the DMA address. See
*** 839,849 ****
--- 890,904 ----
i40e_free_dma_buffer(&tcb->tcb_dma);
if (tcb->tcb_dma_handle != NULL) {
ddi_dma_free_handle(&tcb->tcb_dma_handle);
tcb->tcb_dma_handle = NULL;
}
+ if (tcb->tcb_lso_dma_handle != NULL) {
+ ddi_dma_free_handle(&tcb->tcb_lso_dma_handle);
+ tcb->tcb_lso_dma_handle = NULL;
}
+ }
fsz = sizeof (i40e_tx_control_block_t) *
itrq->itrq_tx_free_list_size;
kmem_free(itrq->itrq_tcb_area, fsz);
itrq->itrq_tcb_area = NULL;
*** 879,898 ****
itrq->itrq_tx_ring_size = i40e->i40e_tx_ring_size;
itrq->itrq_tx_free_list_size = i40e->i40e_tx_ring_size +
(i40e->i40e_tx_ring_size >> 1);
/*
! * Allocate an additional tx descriptor for the writeback head.
*/
dmasz = sizeof (i40e_tx_desc_t) * itrq->itrq_tx_ring_size;
dmasz += sizeof (i40e_tx_desc_t);
VERIFY(dmasz > 0);
if (i40e_alloc_dma_buffer(i40e, &itrq->itrq_desc_area,
&i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr,
B_FALSE, B_TRUE, dmasz) == B_FALSE) {
! i40e_error(i40e, "failed to allocate DMA resources for tx "
"descriptor ring");
return (B_FALSE);
}
itrq->itrq_desc_ring =
(i40e_tx_desc_t *)(uintptr_t)itrq->itrq_desc_area.dmab_address;
--- 934,953 ----
itrq->itrq_tx_ring_size = i40e->i40e_tx_ring_size;
itrq->itrq_tx_free_list_size = i40e->i40e_tx_ring_size +
(i40e->i40e_tx_ring_size >> 1);
/*
! * Allocate an additional TX descriptor for the writeback head.
*/
dmasz = sizeof (i40e_tx_desc_t) * itrq->itrq_tx_ring_size;
dmasz += sizeof (i40e_tx_desc_t);
VERIFY(dmasz > 0);
if (i40e_alloc_dma_buffer(i40e, &itrq->itrq_desc_area,
&i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr,
B_FALSE, B_TRUE, dmasz) == B_FALSE) {
! i40e_error(i40e, "failed to allocate DMA resources for TX "
"descriptor ring");
return (B_FALSE);
}
itrq->itrq_desc_ring =
(i40e_tx_desc_t *)(uintptr_t)itrq->itrq_desc_area.dmab_address;
*** 903,928 ****
itrq->itrq_desc_free = itrq->itrq_tx_ring_size;
itrq->itrq_tcb_work_list = kmem_zalloc(itrq->itrq_tx_ring_size *
sizeof (i40e_tx_control_block_t *), KM_NOSLEEP);
if (itrq->itrq_tcb_work_list == NULL) {
! i40e_error(i40e, "failed to allocate a %d entry tx work list "
"for ring %d", itrq->itrq_tx_ring_size, itrq->itrq_index);
goto cleanup;
}
itrq->itrq_tcb_free_list = kmem_zalloc(itrq->itrq_tx_free_list_size *
sizeof (i40e_tx_control_block_t *), KM_SLEEP);
if (itrq->itrq_tcb_free_list == NULL) {
! i40e_error(i40e, "failed to allocate a %d entry tx free list "
"for ring %d", itrq->itrq_tx_free_list_size,
itrq->itrq_index);
goto cleanup;
}
/*
! * We allocate enough tx control blocks to cover the free list.
*/
itrq->itrq_tcb_area = kmem_zalloc(sizeof (i40e_tx_control_block_t) *
itrq->itrq_tx_free_list_size, KM_NOSLEEP);
if (itrq->itrq_tcb_area == NULL) {
i40e_error(i40e, "failed to allocate a %d entry tcb area for "
--- 958,983 ----
itrq->itrq_desc_free = itrq->itrq_tx_ring_size;
itrq->itrq_tcb_work_list = kmem_zalloc(itrq->itrq_tx_ring_size *
sizeof (i40e_tx_control_block_t *), KM_NOSLEEP);
if (itrq->itrq_tcb_work_list == NULL) {
! i40e_error(i40e, "failed to allocate a %d entry TX work list "
"for ring %d", itrq->itrq_tx_ring_size, itrq->itrq_index);
goto cleanup;
}
itrq->itrq_tcb_free_list = kmem_zalloc(itrq->itrq_tx_free_list_size *
sizeof (i40e_tx_control_block_t *), KM_SLEEP);
if (itrq->itrq_tcb_free_list == NULL) {
! i40e_error(i40e, "failed to allocate a %d entry TX free list "
"for ring %d", itrq->itrq_tx_free_list_size,
itrq->itrq_index);
goto cleanup;
}
/*
! * We allocate enough TX control blocks to cover the free list.
*/
itrq->itrq_tcb_area = kmem_zalloc(sizeof (i40e_tx_control_block_t) *
itrq->itrq_tx_free_list_size, KM_NOSLEEP);
if (itrq->itrq_tcb_area == NULL) {
i40e_error(i40e, "failed to allocate a %d entry tcb area for "
*** 946,967 ****
*/
ret = ddi_dma_alloc_handle(i40e->i40e_dip,
&i40e->i40e_txbind_dma_attr, DDI_DMA_DONTWAIT, NULL,
&tcb->tcb_dma_handle);
if (ret != DDI_SUCCESS) {
! i40e_error(i40e, "failed to allocate DMA handle for tx "
"data binding on ring %d: %d", itrq->itrq_index,
ret);
tcb->tcb_dma_handle = NULL;
goto cleanup;
}
if (i40e_alloc_dma_buffer(i40e, &tcb->tcb_dma,
&i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr,
B_TRUE, B_FALSE, dmasz) == B_FALSE) {
i40e_error(i40e, "failed to allocate %ld bytes of "
! "DMA for tx data binding on ring %d", dmasz,
itrq->itrq_index);
goto cleanup;
}
itrq->itrq_tcb_free_list[i] = tcb;
--- 1001,1033 ----
*/
ret = ddi_dma_alloc_handle(i40e->i40e_dip,
&i40e->i40e_txbind_dma_attr, DDI_DMA_DONTWAIT, NULL,
&tcb->tcb_dma_handle);
if (ret != DDI_SUCCESS) {
! i40e_error(i40e, "failed to allocate DMA handle for TX "
"data binding on ring %d: %d", itrq->itrq_index,
ret);
tcb->tcb_dma_handle = NULL;
goto cleanup;
}
+ ret = ddi_dma_alloc_handle(i40e->i40e_dip,
+ &i40e->i40e_txbind_lso_dma_attr, DDI_DMA_DONTWAIT, NULL,
+ &tcb->tcb_lso_dma_handle);
+ if (ret != DDI_SUCCESS) {
+ i40e_error(i40e, "failed to allocate DMA handle for TX "
+ "LSO data binding on ring %d: %d", itrq->itrq_index,
+ ret);
+ tcb->tcb_lso_dma_handle = NULL;
+ goto cleanup;
+ }
+
if (i40e_alloc_dma_buffer(i40e, &tcb->tcb_dma,
&i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr,
B_TRUE, B_FALSE, dmasz) == B_FALSE) {
i40e_error(i40e, "failed to allocate %ld bytes of "
! "DMA for TX data binding on ring %d", dmasz,
itrq->itrq_index);
goto cleanup;
}
itrq->itrq_tcb_free_list[i] = tcb;
*** 987,1000 ****
for (i = 0; i < i40e->i40e_num_trqpairs; i++) {
i40e_rx_data_t *rxd = i40e->i40e_trqpairs[i].itrq_rxdata;
/*
! * Clean up our rx data. We have to free DMA resources first and
* then if we have no more pending RCB's, then we'll go ahead
* and clean things up. Note, we can't set the stopped flag on
! * the rx data until after we've done the first pass of the
* pending resources. Otherwise we might race with
* i40e_rx_recycle on determining who should free the
* i40e_rx_data_t above.
*/
i40e_free_rx_dma(rxd, failed_init);
--- 1053,1073 ----
for (i = 0; i < i40e->i40e_num_trqpairs; i++) {
i40e_rx_data_t *rxd = i40e->i40e_trqpairs[i].itrq_rxdata;
/*
! * In some cases i40e_alloc_rx_data() may have failed
! * and in that case there is no rxd to free.
! */
! if (rxd == NULL)
! continue;
!
! /*
! * Clean up our RX data. We have to free DMA resources first and
* then if we have no more pending RCB's, then we'll go ahead
* and clean things up. Note, we can't set the stopped flag on
! * the RX data until after we've done the first pass of the
* pending resources. Otherwise we might race with
* i40e_rx_recycle on determining who should free the
* i40e_rx_data_t above.
*/
i40e_free_rx_dma(rxd, failed_init);
*** 1053,1073 ****
--- 1126,1152 ----
{
bcopy(&i40e_g_static_dma_attr, &i40e->i40e_static_dma_attr,
sizeof (ddi_dma_attr_t));
bcopy(&i40e_g_txbind_dma_attr, &i40e->i40e_txbind_dma_attr,
sizeof (ddi_dma_attr_t));
+ bcopy(&i40e_g_txbind_lso_dma_attr, &i40e->i40e_txbind_lso_dma_attr,
+ sizeof (ddi_dma_attr_t));
bcopy(&i40e_g_desc_acc_attr, &i40e->i40e_desc_acc_attr,
sizeof (ddi_device_acc_attr_t));
bcopy(&i40e_g_buf_acc_attr, &i40e->i40e_buf_acc_attr,
sizeof (ddi_device_acc_attr_t));
if (fma == B_TRUE) {
i40e->i40e_static_dma_attr.dma_attr_flags |= DDI_DMA_FLAGERR;
i40e->i40e_txbind_dma_attr.dma_attr_flags |= DDI_DMA_FLAGERR;
+ i40e->i40e_txbind_lso_dma_attr.dma_attr_flags |=
+ DDI_DMA_FLAGERR;
} else {
i40e->i40e_static_dma_attr.dma_attr_flags &= ~DDI_DMA_FLAGERR;
i40e->i40e_txbind_dma_attr.dma_attr_flags &= ~DDI_DMA_FLAGERR;
+ i40e->i40e_txbind_lso_dma_attr.dma_attr_flags &=
+ ~DDI_DMA_FLAGERR;
}
}
static void
i40e_rcb_free(i40e_rx_data_t *rxd, i40e_rx_control_block_t *rcb)
*** 1100,1110 ****
}
/*
* This is the callback that we get from the OS when freemsg(9F) has been called
* on a loaned descriptor. In addition, if we take the last reference count
! * here, then we have to tear down all of the rx data.
*/
void
i40e_rx_recycle(caddr_t arg)
{
uint32_t ref;
--- 1179,1189 ----
}
/*
* This is the callback that we get from the OS when freemsg(9F) has been called
* on a loaned descriptor. In addition, if we take the last reference count
! * here, then we have to tear down all of the RX data.
*/
void
i40e_rx_recycle(caddr_t arg)
{
uint32_t ref;
*** 1766,1884 ****
/*
* Attempt to put togther the information we'll need to feed into a descriptor
* to properly program the hardware for checksum offload as well as the
* generally required flags.
*
! * The i40e_tx_context_t`itc_cmdflags contains the set of flags we need to or
! * into the descriptor based on the checksum flags for this mblk_t and the
* actual information we care about.
*/
static int
i40e_tx_context(i40e_t *i40e, i40e_trqpair_t *itrq, mblk_t *mp,
! i40e_tx_context_t *tctx)
{
! int ret;
! uint32_t flags, start;
! mac_ether_offload_info_t meo;
i40e_txq_stat_t *txs = &itrq->itrq_txstat;
bzero(tctx, sizeof (i40e_tx_context_t));
if (i40e->i40e_tx_hcksum_enable != B_TRUE)
return (0);
! mac_hcksum_get(mp, &start, NULL, NULL, NULL, &flags);
! if (flags == 0)
return (0);
- if ((ret = mac_ether_offload_info(mp, &meo)) != 0) {
- txs->itxs_hck_meoifail.value.ui64++;
- return (ret);
- }
-
/*
* Have we been asked to checksum an IPv4 header. If so, verify that we
* have sufficient information and then set the proper fields in the
* command structure.
*/
! if (flags & HCK_IPV4_HDRCKSUM) {
! if ((meo.meoi_flags & MEOI_L2INFO_SET) == 0) {
txs->itxs_hck_nol2info.value.ui64++;
return (-1);
}
! if ((meo.meoi_flags & MEOI_L3INFO_SET) == 0) {
txs->itxs_hck_nol3info.value.ui64++;
return (-1);
}
! if (meo.meoi_l3proto != ETHERTYPE_IP) {
txs->itxs_hck_badl3.value.ui64++;
return (-1);
}
! tctx->itc_cmdflags |= I40E_TX_DESC_CMD_IIPT_IPV4_CSUM;
! tctx->itc_offsets |= (meo.meoi_l2hlen >> 1) <<
I40E_TX_DESC_LENGTH_MACLEN_SHIFT;
! tctx->itc_offsets |= (meo.meoi_l3hlen >> 2) <<
I40E_TX_DESC_LENGTH_IPLEN_SHIFT;
}
/*
* We've been asked to provide an L4 header, first, set up the IP
* information in the descriptor if we haven't already before moving
* onto seeing if we have enough information for the L4 checksum
* offload.
*/
! if (flags & HCK_PARTIALCKSUM) {
! if ((meo.meoi_flags & MEOI_L4INFO_SET) == 0) {
txs->itxs_hck_nol4info.value.ui64++;
return (-1);
}
! if (!(flags & HCK_IPV4_HDRCKSUM)) {
! if ((meo.meoi_flags & MEOI_L2INFO_SET) == 0) {
txs->itxs_hck_nol2info.value.ui64++;
return (-1);
}
! if ((meo.meoi_flags & MEOI_L3INFO_SET) == 0) {
txs->itxs_hck_nol3info.value.ui64++;
return (-1);
}
! if (meo.meoi_l3proto == ETHERTYPE_IP) {
! tctx->itc_cmdflags |=
I40E_TX_DESC_CMD_IIPT_IPV4;
! } else if (meo.meoi_l3proto == ETHERTYPE_IPV6) {
! tctx->itc_cmdflags |=
I40E_TX_DESC_CMD_IIPT_IPV6;
} else {
txs->itxs_hck_badl3.value.ui64++;
return (-1);
}
! tctx->itc_offsets |= (meo.meoi_l2hlen >> 1) <<
I40E_TX_DESC_LENGTH_MACLEN_SHIFT;
! tctx->itc_offsets |= (meo.meoi_l3hlen >> 2) <<
I40E_TX_DESC_LENGTH_IPLEN_SHIFT;
}
! switch (meo.meoi_l4proto) {
case IPPROTO_TCP:
! tctx->itc_cmdflags |= I40E_TX_DESC_CMD_L4T_EOFT_TCP;
break;
case IPPROTO_UDP:
! tctx->itc_cmdflags |= I40E_TX_DESC_CMD_L4T_EOFT_UDP;
break;
case IPPROTO_SCTP:
! tctx->itc_cmdflags |= I40E_TX_DESC_CMD_L4T_EOFT_SCTP;
break;
default:
txs->itxs_hck_badl4.value.ui64++;
return (-1);
}
! tctx->itc_offsets |= (meo.meoi_l4hlen >> 2) <<
I40E_TX_DESC_LENGTH_L4_FC_LEN_SHIFT;
}
return (0);
}
static void
i40e_tcb_free(i40e_trqpair_t *itrq, i40e_tx_control_block_t *tcb)
--- 1845,1981 ----
/*
* Attempt to put togther the information we'll need to feed into a descriptor
* to properly program the hardware for checksum offload as well as the
* generally required flags.
*
! * The i40e_tx_context_t`itc_data_cmdflags contains the set of flags we need to
! * 'or' into the descriptor based on the checksum flags for this mblk_t and the
* actual information we care about.
+ *
+ * If the mblk requires LSO then we'll also gather the information that will be
+ * used to construct the Transmit Context Descriptor.
*/
static int
i40e_tx_context(i40e_t *i40e, i40e_trqpair_t *itrq, mblk_t *mp,
! mac_ether_offload_info_t *meo, i40e_tx_context_t *tctx)
{
! uint32_t chkflags, start, mss, lsoflags;
i40e_txq_stat_t *txs = &itrq->itrq_txstat;
bzero(tctx, sizeof (i40e_tx_context_t));
if (i40e->i40e_tx_hcksum_enable != B_TRUE)
return (0);
! mac_hcksum_get(mp, &start, NULL, NULL, NULL, &chkflags);
! mac_lso_get(mp, &mss, &lsoflags);
!
! if (chkflags == 0 && lsoflags == 0)
return (0);
/*
* Have we been asked to checksum an IPv4 header. If so, verify that we
* have sufficient information and then set the proper fields in the
* command structure.
*/
! if (chkflags & HCK_IPV4_HDRCKSUM) {
! if ((meo->meoi_flags & MEOI_L2INFO_SET) == 0) {
txs->itxs_hck_nol2info.value.ui64++;
return (-1);
}
! if ((meo->meoi_flags & MEOI_L3INFO_SET) == 0) {
txs->itxs_hck_nol3info.value.ui64++;
return (-1);
}
! if (meo->meoi_l3proto != ETHERTYPE_IP) {
txs->itxs_hck_badl3.value.ui64++;
return (-1);
}
! tctx->itc_data_cmdflags |= I40E_TX_DESC_CMD_IIPT_IPV4_CSUM;
! tctx->itc_data_offsets |= (meo->meoi_l2hlen >> 1) <<
I40E_TX_DESC_LENGTH_MACLEN_SHIFT;
! tctx->itc_data_offsets |= (meo->meoi_l3hlen >> 2) <<
I40E_TX_DESC_LENGTH_IPLEN_SHIFT;
}
/*
* We've been asked to provide an L4 header, first, set up the IP
* information in the descriptor if we haven't already before moving
* onto seeing if we have enough information for the L4 checksum
* offload.
*/
! if (chkflags & HCK_PARTIALCKSUM) {
! if ((meo->meoi_flags & MEOI_L4INFO_SET) == 0) {
txs->itxs_hck_nol4info.value.ui64++;
return (-1);
}
! if (!(chkflags & HCK_IPV4_HDRCKSUM)) {
! if ((meo->meoi_flags & MEOI_L2INFO_SET) == 0) {
txs->itxs_hck_nol2info.value.ui64++;
return (-1);
}
! if ((meo->meoi_flags & MEOI_L3INFO_SET) == 0) {
txs->itxs_hck_nol3info.value.ui64++;
return (-1);
}
! if (meo->meoi_l3proto == ETHERTYPE_IP) {
! tctx->itc_data_cmdflags |=
I40E_TX_DESC_CMD_IIPT_IPV4;
! } else if (meo->meoi_l3proto == ETHERTYPE_IPV6) {
! tctx->itc_data_cmdflags |=
I40E_TX_DESC_CMD_IIPT_IPV6;
} else {
txs->itxs_hck_badl3.value.ui64++;
return (-1);
}
! tctx->itc_data_offsets |= (meo->meoi_l2hlen >> 1) <<
I40E_TX_DESC_LENGTH_MACLEN_SHIFT;
! tctx->itc_data_offsets |= (meo->meoi_l3hlen >> 2) <<
I40E_TX_DESC_LENGTH_IPLEN_SHIFT;
}
! switch (meo->meoi_l4proto) {
case IPPROTO_TCP:
! tctx->itc_data_cmdflags |=
! I40E_TX_DESC_CMD_L4T_EOFT_TCP;
break;
case IPPROTO_UDP:
! tctx->itc_data_cmdflags |=
! I40E_TX_DESC_CMD_L4T_EOFT_UDP;
break;
case IPPROTO_SCTP:
! tctx->itc_data_cmdflags |=
! I40E_TX_DESC_CMD_L4T_EOFT_SCTP;
break;
default:
txs->itxs_hck_badl4.value.ui64++;
return (-1);
}
! tctx->itc_data_offsets |= (meo->meoi_l4hlen >> 2) <<
I40E_TX_DESC_LENGTH_L4_FC_LEN_SHIFT;
}
+ if (lsoflags & HW_LSO) {
+ /*
+ * LSO requires that checksum offloads are enabled. If for
+ * some reason they're not we bail out with an error.
+ */
+ if ((chkflags & HCK_IPV4_HDRCKSUM) == 0 ||
+ (chkflags & HCK_PARTIALCKSUM) == 0) {
+ txs->itxs_lso_nohck.value.ui64++;
+ return (-1);
+ }
+
+ tctx->itc_ctx_cmdflags |= I40E_TX_CTX_DESC_TSO;
+ tctx->itc_ctx_mss = mss;
+ tctx->itc_ctx_tsolen = msgsize(mp) -
+ (meo->meoi_l2hlen + meo->meoi_l3hlen + meo->meoi_l4hlen);
+ }
+
return (0);
}
static void
i40e_tcb_free(i40e_trqpair_t *itrq, i40e_tx_control_block_t *tcb)
*** 1923,1944 ****
--- 2020,2056 ----
switch (tcb->tcb_type) {
case I40E_TX_COPY:
tcb->tcb_dma.dmab_len = 0;
break;
case I40E_TX_DMA:
+ if (tcb->tcb_used_lso == B_TRUE && tcb->tcb_bind_ncookies > 0)
+ (void) ddi_dma_unbind_handle(tcb->tcb_lso_dma_handle);
+ else if (tcb->tcb_bind_ncookies > 0)
(void) ddi_dma_unbind_handle(tcb->tcb_dma_handle);
+ if (tcb->tcb_bind_info != NULL) {
+ kmem_free(tcb->tcb_bind_info,
+ tcb->tcb_bind_ncookies *
+ sizeof (struct i40e_dma_bind_info));
+ }
+ tcb->tcb_bind_info = NULL;
+ tcb->tcb_bind_ncookies = 0;
+ tcb->tcb_used_lso = B_FALSE;
break;
+ case I40E_TX_DESC:
+ break;
case I40E_TX_NONE:
/* Cast to pacify lint */
panic("trying to free tcb %p with bad type none", (void *)tcb);
default:
panic("unknown i40e tcb type: %d", tcb->tcb_type);
}
tcb->tcb_type = I40E_TX_NONE;
+ if (tcb->tcb_mp != NULL) {
freemsg(tcb->tcb_mp);
tcb->tcb_mp = NULL;
+ }
tcb->tcb_next = NULL;
}
/*
* This is called as part of shutting down to clean up all outstanding
*** 1967,1980 ****
index = itrq->itrq_desc_head;
while (itrq->itrq_desc_free < itrq->itrq_tx_ring_size) {
i40e_tx_control_block_t *tcb;
tcb = itrq->itrq_tcb_work_list[index];
! VERIFY(tcb != NULL);
itrq->itrq_tcb_work_list[index] = NULL;
i40e_tcb_reset(tcb);
i40e_tcb_free(itrq, tcb);
bzero(&itrq->itrq_desc_ring[index], sizeof (i40e_tx_desc_t));
index = i40e_next_desc(index, 1, itrq->itrq_tx_ring_size);
itrq->itrq_desc_free++;
}
--- 2079,2093 ----
index = itrq->itrq_desc_head;
while (itrq->itrq_desc_free < itrq->itrq_tx_ring_size) {
i40e_tx_control_block_t *tcb;
tcb = itrq->itrq_tcb_work_list[index];
! if (tcb != NULL) {
itrq->itrq_tcb_work_list[index] = NULL;
i40e_tcb_reset(tcb);
i40e_tcb_free(itrq, tcb);
+ }
bzero(&itrq->itrq_desc_ring[index], sizeof (i40e_tx_desc_t));
index = i40e_next_desc(index, 1, itrq->itrq_tx_ring_size);
itrq->itrq_desc_free++;
}
*** 1993,2002 ****
--- 2106,2116 ----
i40e_tx_recycle_ring(i40e_trqpair_t *itrq)
{
uint32_t wbhead, toclean, count;
i40e_tx_control_block_t *tcbhead;
i40e_t *i40e = itrq->itrq_i40e;
+ uint_t desc_per_tcb, i;
mutex_enter(&itrq->itrq_tx_lock);
ASSERT(itrq->itrq_desc_free <= itrq->itrq_tx_ring_size);
if (itrq->itrq_desc_free == itrq->itrq_tx_ring_size) {
*** 2040,2055 ****
ASSERT(tcb != NULL);
tcb->tcb_next = tcbhead;
tcbhead = tcb;
/*
* We zero this out for sanity purposes.
*/
! bzero(&itrq->itrq_desc_ring[toclean], sizeof (i40e_tx_desc_t));
! toclean = i40e_next_desc(toclean, 1, itrq->itrq_tx_ring_size);
count++;
}
itrq->itrq_desc_head = wbhead;
itrq->itrq_desc_free += count;
itrq->itrq_txstat.itxs_recycled.value.ui64 += count;
ASSERT(itrq->itrq_desc_free <= itrq->itrq_tx_ring_size);
--- 2154,2185 ----
ASSERT(tcb != NULL);
tcb->tcb_next = tcbhead;
tcbhead = tcb;
/*
+ * In the DMA bind case, there may not necessarily be a 1:1
+ * mapping between tcb's and descriptors. If the tcb type
+ * indicates a DMA binding then check the number of DMA
+ * cookies to determine how many entries to clean in the
+ * descriptor ring.
+ */
+ if (tcb->tcb_type == I40E_TX_DMA)
+ desc_per_tcb = tcb->tcb_bind_ncookies;
+ else
+ desc_per_tcb = 1;
+
+ for (i = 0; i < desc_per_tcb; i++) {
+ /*
* We zero this out for sanity purposes.
*/
! bzero(&itrq->itrq_desc_ring[toclean],
! sizeof (i40e_tx_desc_t));
! toclean = i40e_next_desc(toclean, 1,
! itrq->itrq_tx_ring_size);
count++;
}
+ }
itrq->itrq_desc_head = wbhead;
itrq->itrq_desc_free += count;
itrq->itrq_txstat.itxs_recycled.value.ui64 += count;
ASSERT(itrq->itrq_desc_free <= itrq->itrq_tx_ring_size);
*** 2076,2089 ****
}
DTRACE_PROBE2(i40e__recycle, i40e_trqpair_t *, itrq, uint32_t, count);
}
/*
* We've been asked to send a message block on the wire. We'll only have a
* single chain. There will not be any b_next pointers; however, there may be
! * multiple b_cont blocks.
*
* We may do one of three things with any given mblk_t chain:
*
* 1) Drop it
* 2) Transmit it
--- 2206,2793 ----
}
DTRACE_PROBE2(i40e__recycle, i40e_trqpair_t *, itrq, uint32_t, count);
}
+ static void
+ i40e_tx_copy_fragment(i40e_tx_control_block_t *tcb, const mblk_t *mp,
+ const size_t off, const size_t len)
+ {
+ const void *soff = mp->b_rptr + off;
+ void *doff = tcb->tcb_dma.dmab_address + tcb->tcb_dma.dmab_len;
+
+ ASSERT3U(len, >, 0);
+ ASSERT3P(soff, >=, mp->b_rptr);
+ ASSERT3P(soff, <=, mp->b_wptr);
+ ASSERT3U(len, <=, MBLKL(mp));
+ ASSERT3U((uintptr_t)soff + len, <=, (uintptr_t)mp->b_wptr);
+ ASSERT3U(tcb->tcb_dma.dmab_size - tcb->tcb_dma.dmab_len, >=, len);
+ bcopy(soff, doff, len);
+ tcb->tcb_type = I40E_TX_COPY;
+ tcb->tcb_dma.dmab_len += len;
+ I40E_DMA_SYNC(&tcb->tcb_dma, DDI_DMA_SYNC_FORDEV);
+ }
+
+ static i40e_tx_control_block_t *
+ i40e_tx_bind_fragment(i40e_trqpair_t *itrq, const mblk_t *mp,
+ size_t off, boolean_t use_lso)
+ {
+ ddi_dma_handle_t dma_handle;
+ ddi_dma_cookie_t dma_cookie;
+ uint_t i = 0, ncookies = 0, dmaflags;
+ i40e_tx_control_block_t *tcb;
+ i40e_txq_stat_t *txs = &itrq->itrq_txstat;
+
+ if ((tcb = i40e_tcb_alloc(itrq)) == NULL) {
+ txs->itxs_err_notcb.value.ui64++;
+ return (NULL);
+ }
+ tcb->tcb_type = I40E_TX_DMA;
+
+ if (use_lso == B_TRUE)
+ dma_handle = tcb->tcb_lso_dma_handle;
+ else
+ dma_handle = tcb->tcb_dma_handle;
+
+ dmaflags = DDI_DMA_WRITE | DDI_DMA_STREAMING;
+ if (ddi_dma_addr_bind_handle(dma_handle, NULL,
+ (caddr_t)(mp->b_rptr + off), MBLKL(mp) - off, dmaflags,
+ DDI_DMA_DONTWAIT, NULL, &dma_cookie, &ncookies) != DDI_DMA_MAPPED) {
+ txs->itxs_bind_fails.value.ui64++;
+ goto bffail;
+ }
+
+ tcb->tcb_bind_ncookies = ncookies;
+ tcb->tcb_used_lso = use_lso;
+
+ tcb->tcb_bind_info =
+ kmem_zalloc(ncookies * sizeof (struct i40e_dma_bind_info),
+ KM_NOSLEEP);
+ if (tcb->tcb_bind_info == NULL)
+ goto bffail;
+
+ while (i < ncookies) {
+ if (i > 0)
+ ddi_dma_nextcookie(dma_handle, &dma_cookie);
+
+ tcb->tcb_bind_info[i].dbi_paddr =
+ (caddr_t)dma_cookie.dmac_laddress;
+ tcb->tcb_bind_info[i++].dbi_len = dma_cookie.dmac_size;
+ }
+
+ return (tcb);
+
+ bffail:
+ i40e_tcb_reset(tcb);
+ i40e_tcb_free(itrq, tcb);
+ return (NULL);
+ }
+
+ static void
+ i40e_tx_set_data_desc(i40e_trqpair_t *itrq, i40e_tx_context_t *tctx,
+ caddr_t buff, size_t len, boolean_t last_desc)
+ {
+ i40e_tx_desc_t *txdesc;
+ int cmd;
+
+ ASSERT(MUTEX_HELD(&itrq->itrq_tx_lock));
+ itrq->itrq_desc_free--;
+ txdesc = &itrq->itrq_desc_ring[itrq->itrq_desc_tail];
+ itrq->itrq_desc_tail = i40e_next_desc(itrq->itrq_desc_tail, 1,
+ itrq->itrq_tx_ring_size);
+
+ cmd = I40E_TX_DESC_CMD_ICRC | tctx->itc_data_cmdflags;
+
+ /*
+ * The last data descriptor needs the EOP bit set, so that the HW knows
+ * that we're ready to send. Additionally, we set the RS (Report
+ * Status) bit, so that we are notified when the transmit engine has
+ * completed DMA'ing all of the data descriptors and data buffers
+ * associated with this frame.
+ */
+ if (last_desc == B_TRUE) {
+ cmd |= I40E_TX_DESC_CMD_EOP;
+ cmd |= I40E_TX_DESC_CMD_RS;
+ }
+
+ /*
+ * Per the X710 manual, section 8.4.2.1.1, the buffer size
+ * must be a value from 1 to 16K minus 1, inclusive.
+ */
+ ASSERT3U(len, >=, 1);
+ ASSERT3U(len, <=, I40E_MAX_TX_BUFSZ - 1);
+
+ txdesc->buffer_addr = CPU_TO_LE64((uintptr_t)buff);
+ txdesc->cmd_type_offset_bsz =
+ LE_64(((uint64_t)I40E_TX_DESC_DTYPE_DATA |
+ ((uint64_t)tctx->itc_data_offsets << I40E_TXD_QW1_OFFSET_SHIFT) |
+ ((uint64_t)cmd << I40E_TXD_QW1_CMD_SHIFT) |
+ ((uint64_t)len << I40E_TXD_QW1_TX_BUF_SZ_SHIFT)));
+ }
+
/*
+ * Place 'tcb' on the tail of the list represented by 'head'/'tail'.
+ */
+ static inline void
+ tcb_list_append(i40e_tx_control_block_t **head, i40e_tx_control_block_t **tail,
+ i40e_tx_control_block_t *tcb)
+ {
+ if (*head == NULL) {
+ *head = tcb;
+ *tail = *head;
+ } else {
+ ASSERT3P(*tail, !=, NULL);
+ ASSERT3P((*tail)->tcb_next, ==, NULL);
+ (*tail)->tcb_next = tcb;
+ *tail = tcb;
+ }
+ }
+
+ /*
+ * This function takes a single packet, possibly consisting of
+ * multiple mblks, and creates a TCB chain to send to the controller.
+ * This TCB chain may span up to a maximum of 8 descriptors. A copy
+ * TCB consumes one descriptor; whereas a DMA TCB may consume 1 or
+ * more, depending on several factors. For each fragment (invidual
+ * mblk making up the packet), we determine if its size dictates a
+ * copy to the TCB buffer or a DMA bind of the dblk buffer. We keep a
+ * count of descriptors used; when that count reaches the max we force
+ * all remaining fragments into a single TCB buffer. We have a
+ * guarantee that the TCB buffer is always larger than the MTU -- so
+ * there is always enough room. Consecutive fragments below the DMA
+ * threshold are copied into a single TCB. In the event of an error
+ * this function returns NULL but leaves 'mp' alone.
+ */
+ static i40e_tx_control_block_t *
+ i40e_non_lso_chain(i40e_trqpair_t *itrq, mblk_t *mp, uint_t *ndesc)
+ {
+ const mblk_t *nmp = mp;
+ uint_t needed_desc = 0;
+ boolean_t force_copy = B_FALSE;
+ i40e_tx_control_block_t *tcb = NULL, *tcbhead = NULL, *tcbtail = NULL;
+ i40e_t *i40e = itrq->itrq_i40e;
+ i40e_txq_stat_t *txs = &itrq->itrq_txstat;
+
+ /* TCB buffer is always larger than MTU. */
+ ASSERT3U(msgsize(mp), <, i40e->i40e_tx_buf_size);
+
+ while (nmp != NULL) {
+ const size_t nmp_len = MBLKL(nmp);
+
+ /* Ignore zero-length mblks. */
+ if (nmp_len == 0) {
+ nmp = nmp->b_cont;
+ continue;
+ }
+
+ if (nmp_len < i40e->i40e_tx_dma_min || force_copy) {
+ /* Compress consecutive copies into one TCB. */
+ if (tcb != NULL && tcb->tcb_type == I40E_TX_COPY) {
+ i40e_tx_copy_fragment(tcb, nmp, 0, nmp_len);
+ nmp = nmp->b_cont;
+ continue;
+ }
+
+ if ((tcb = i40e_tcb_alloc(itrq)) == NULL) {
+ txs->itxs_err_notcb.value.ui64++;
+ goto fail;
+ }
+
+ /*
+ * TCB DMA buffer is guaranteed to be one
+ * cookie by i40e_alloc_dma_buffer().
+ */
+ i40e_tx_copy_fragment(tcb, nmp, 0, nmp_len);
+ needed_desc++;
+ tcb_list_append(&tcbhead, &tcbtail, tcb);
+ } else {
+ uint_t total_desc;
+
+ tcb = i40e_tx_bind_fragment(itrq, nmp, 0, B_FALSE);
+ if (tcb == NULL) {
+ i40e_error(i40e, "dma bind failed!");
+ goto fail;
+ }
+
+ /*
+ * If the new total exceeds the max or we've
+ * reached the limit and there's data left,
+ * then give up binding and copy the rest into
+ * the pre-allocated TCB buffer.
+ */
+ total_desc = needed_desc + tcb->tcb_bind_ncookies;
+ if ((total_desc > I40E_TX_MAX_COOKIE) ||
+ (total_desc == I40E_TX_MAX_COOKIE &&
+ nmp->b_cont != NULL)) {
+ i40e_tcb_reset(tcb);
+ i40e_tcb_free(itrq, tcb);
+
+ if (tcbtail != NULL &&
+ tcbtail->tcb_type == I40E_TX_COPY) {
+ tcb = tcbtail;
+ } else {
+ tcb = NULL;
+ }
+
+ force_copy = B_TRUE;
+ txs->itxs_force_copy.value.ui64++;
+ continue;
+ }
+
+ needed_desc += tcb->tcb_bind_ncookies;
+ tcb_list_append(&tcbhead, &tcbtail, tcb);
+ }
+
+ nmp = nmp->b_cont;
+ }
+
+ ASSERT3P(nmp, ==, NULL);
+ ASSERT3U(needed_desc, <=, I40E_TX_MAX_COOKIE);
+ ASSERT3P(tcbhead, !=, NULL);
+ *ndesc += needed_desc;
+ return (tcbhead);
+
+ fail:
+ tcb = tcbhead;
+ while (tcb != NULL) {
+ i40e_tx_control_block_t *next = tcb->tcb_next;
+
+ ASSERT(tcb->tcb_type == I40E_TX_DMA ||
+ tcb->tcb_type == I40E_TX_COPY);
+
+ tcb->tcb_mp = NULL;
+ i40e_tcb_reset(tcb);
+ i40e_tcb_free(itrq, tcb);
+ tcb = next;
+ }
+
+ return (NULL);
+ }
+
+ /*
+ * Section 8.4.1 of the 700-series programming guide states that a
+ * segment may span up to 8 data descriptors; including both header
+ * and payload data. However, empirical evidence shows that the
+ * controller freezes the Tx queue when presented with a segment of 8
+ * descriptors. Or, at least, when the first segment contains 8
+ * descriptors. One explanation is that the controller counts the
+ * context descriptor against the first segment, even though the
+ * programming guide makes no mention of such a constraint. In any
+ * case, we limit TSO segments to 7 descriptors to prevent Tx queue
+ * freezes. We still allow non-TSO segments to utilize all 8
+ * descriptors as they have not demonstrated the faulty behavior.
+ */
+ uint_t i40e_lso_num_descs = 7;
+
+ #define I40E_TCB_LEFT(tcb) \
+ ((tcb)->tcb_dma.dmab_size - (tcb)->tcb_dma.dmab_len)
+
+ /*
+ * This function is similar in spirit to i40e_non_lso_chain(), but
+ * much more complicated in reality. Like the previous function, it
+ * takes a packet (an LSO packet) as input and returns a chain of
+ * TCBs. The complication comes with the fact that we are no longer
+ * trying to fit the entire packet into 8 descriptors, but rather we
+ * must fit each MSS-size segment of the LSO packet into 8 descriptors.
+ * Except it's really 7 descriptors, see i40e_lso_num_descs.
+ *
+ * Your first inclination might be to verify that a given segment
+ * spans no more than 7 mblks; but it's actually much more subtle than
+ * that. First, let's describe what the hardware expects, and then we
+ * can expound on the software side of things.
+ *
+ * For an LSO packet the hardware expects the following:
+ *
+ * o Each MSS-sized segment must span no more than 7 descriptors.
+ *
+ * o The header size does not count towards the segment size.
+ *
+ * o If header and payload share the first descriptor, then the
+ * controller will count the descriptor twice.
+ *
+ * The most important thing to keep in mind is that the hardware does
+ * not view the segments in terms of mblks, like we do. The hardware
+ * only sees descriptors. It will iterate each descriptor in turn,
+ * keeping a tally of bytes seen and descriptors visited. If the byte
+ * count hasn't reached MSS by the time the descriptor count reaches
+ * 7, then the controller freezes the queue and we are stuck.
+ * Furthermore, the hardware picks up its tally where it left off. So
+ * if it reached MSS in the middle of a descriptor, it will start
+ * tallying the next segment in the middle of that descriptor. The
+ * hardware's view is entirely removed from the mblk chain or even the
+ * descriptor layout. Consider these facts:
+ *
+ * o The MSS will vary dpeneding on MTU and other factors.
+ *
+ * o The dblk allocation will sit at various offsets within a
+ * memory page.
+ *
+ * o The page size itself could vary in the future (i.e. not
+ * always 4K).
+ *
+ * o Just because a dblk is virtually contiguous doesn't mean
+ * it's physically contiguous. The number of cookies
+ * (descriptors) required by a DMA bind of a single dblk is at
+ * the mercy of the page size and physical layout.
+ *
+ * o The descriptors will most often NOT start/end on a MSS
+ * boundary. Thus the hardware will often start counting the
+ * MSS mid descriptor and finish mid descriptor.
+ *
+ * The upshot of all this is that the driver must learn to think like
+ * the controller; and verify that none of the constraints are broken.
+ * It does this by tallying up the segment just like the hardware
+ * would. This is handled by the two variables 'segsz' and 'segdesc'.
+ * After each attempt to bind a dblk, we check the constaints. If
+ * violated, we undo the DMA and force a copy until MSS is met. We
+ * have a guarantee that the TCB buffer is larger than MTU; thus
+ * ensuring we can always meet the MSS with a single copy buffer. We
+ * also copy consecutive non-DMA fragments into the same TCB buffer.
+ */
+ static i40e_tx_control_block_t *
+ i40e_lso_chain(i40e_trqpair_t *itrq, const mblk_t *mp,
+ const mac_ether_offload_info_t *meo, const i40e_tx_context_t *tctx,
+ uint_t *ndesc)
+ {
+ size_t mp_len = MBLKL(mp);
+ /*
+ * The cpoff (copy offset) variable tracks the offset inside
+ * the current mp. There are cases where the entire mp is not
+ * fully copied in one go: such as the header copy followed by
+ * a non-DMA mblk, or a TCB buffer that only has enough space
+ * to copy part of the current mp.
+ */
+ size_t cpoff = 0;
+ /*
+ * The segsz and segdesc variables track the controller's view
+ * of the segment. The needed_desc variable tracks the total
+ * number of data descriptors used by the driver.
+ */
+ size_t segsz = 0;
+ uint_t segdesc = 0;
+ uint_t needed_desc = 0;
+ const size_t hdrlen =
+ meo->meoi_l2hlen + meo->meoi_l3hlen + meo->meoi_l4hlen;
+ const size_t mss = tctx->itc_ctx_mss;
+ boolean_t force_copy = B_FALSE;
+ i40e_tx_control_block_t *tcb = NULL, *tcbhead = NULL, *tcbtail = NULL;
+ i40e_t *i40e = itrq->itrq_i40e;
+ i40e_txq_stat_t *txs = &itrq->itrq_txstat;
+
+ /*
+ * We always copy the header in order to avoid more
+ * complicated code dealing with various edge cases.
+ */
+ ASSERT3U(MBLKL(mp), >=, hdrlen);
+ if ((tcb = i40e_tcb_alloc(itrq)) == NULL) {
+ txs->itxs_err_notcb.value.ui64++;
+ goto fail;
+ }
+ needed_desc++;
+
+ tcb_list_append(&tcbhead, &tcbtail, tcb);
+ i40e_tx_copy_fragment(tcb, mp, 0, hdrlen);
+ cpoff += hdrlen;
+
+ /*
+ * A single descriptor containing both header and data is
+ * counted twice by the controller.
+ */
+ if ((mp_len > hdrlen && mp_len < i40e->i40e_tx_dma_min) ||
+ (mp->b_cont != NULL &&
+ MBLKL(mp->b_cont) < i40e->i40e_tx_dma_min)) {
+ segdesc = 2;
+ } else {
+ segdesc = 1;
+ }
+
+ /* If this fragment was pure header, then move to the next one. */
+ if (cpoff == mp_len) {
+ mp = mp->b_cont;
+ cpoff = 0;
+ }
+
+ while (mp != NULL) {
+ mp_len = MBLKL(mp);
+ force_copy:
+ /* Ignore zero-length mblks. */
+ if (mp_len == 0) {
+ mp = mp->b_cont;
+ cpoff = 0;
+ continue;
+ }
+
+ /*
+ * We copy into the preallocated TCB buffer when the
+ * current fragment is less than the DMA threshold OR
+ * when the DMA bind can't meet the controller's
+ * segment descriptor limit.
+ */
+ if (mp_len < i40e->i40e_tx_dma_min || force_copy) {
+ size_t tocopy;
+
+ /*
+ * Our objective here is to compress
+ * consecutive copies into one TCB (until it
+ * is full). If there is no current TCB, or if
+ * it is a DMA TCB, then allocate a new one.
+ */
+ if (tcb == NULL ||
+ (tcb != NULL && tcb->tcb_type != I40E_TX_COPY)) {
+ if ((tcb = i40e_tcb_alloc(itrq)) == NULL) {
+ txs->itxs_err_notcb.value.ui64++;
+ goto fail;
+ }
+
+ /*
+ * The TCB DMA buffer is guaranteed to
+ * be one cookie by i40e_alloc_dma_buffer().
+ */
+ needed_desc++;
+ segdesc++;
+ ASSERT3U(segdesc, <=, i40e_lso_num_descs);
+ tcb_list_append(&tcbhead, &tcbtail, tcb);
+ }
+
+ tocopy = MIN(I40E_TCB_LEFT(tcb), mp_len - cpoff);
+ i40e_tx_copy_fragment(tcb, mp, cpoff, tocopy);
+ cpoff += tocopy;
+ segsz += tocopy;
+
+ /* We have consumed the current mp. */
+ if (cpoff == mp_len) {
+ mp = mp->b_cont;
+ cpoff = 0;
+ }
+
+ /* We have consumed the current TCB buffer. */
+ if (I40E_TCB_LEFT(tcb) == 0) {
+ tcb = NULL;
+ }
+
+ /*
+ * We have met MSS with this copy; restart the
+ * counters.
+ */
+ if (segsz >= mss) {
+ segsz = segsz % mss;
+ segdesc = segsz == 0 ? 0 : 1;
+ force_copy = B_FALSE;
+ }
+
+ /*
+ * We are at the controller's descriptor
+ * limit; we must copy into the current TCB
+ * until MSS is reached. The TCB buffer is
+ * always bigger than the MTU so we know it is
+ * big enough to meet the MSS.
+ */
+ if (segdesc == i40e_lso_num_descs) {
+ force_copy = B_TRUE;
+ }
+ } else {
+ uint_t tsegdesc = segdesc;
+ size_t tsegsz = segsz;
+
+ ASSERT(force_copy == B_FALSE);
+ ASSERT3U(tsegdesc, <, i40e_lso_num_descs);
+
+ tcb = i40e_tx_bind_fragment(itrq, mp, cpoff, B_TRUE);
+ if (tcb == NULL) {
+ i40e_error(i40e, "dma bind failed!");
+ goto fail;
+ }
+
+ for (uint_t i = 0; i < tcb->tcb_bind_ncookies; i++) {
+ struct i40e_dma_bind_info dbi =
+ tcb->tcb_bind_info[i];
+
+ tsegsz += dbi.dbi_len;
+ tsegdesc++;
+ ASSERT3U(tsegdesc, <=, i40e_lso_num_descs);
+
+ /*
+ * We've met the MSS with this portion
+ * of the DMA.
+ */
+ if (tsegsz >= mss) {
+ tsegdesc = 1;
+ tsegsz = tsegsz % mss;
+ }
+
+ /*
+ * We've reached max descriptors but
+ * have not met the MSS. Undo the bind
+ * and instead copy.
+ */
+ if (tsegdesc == i40e_lso_num_descs) {
+ i40e_tcb_reset(tcb);
+ i40e_tcb_free(itrq, tcb);
+
+ if (tcbtail != NULL &&
+ I40E_TCB_LEFT(tcb) > 0 &&
+ tcbtail->tcb_type == I40E_TX_COPY) {
+ tcb = tcbtail;
+ } else {
+ tcb = NULL;
+ }
+
+ /*
+ * Remember, we are still on
+ * the same mp.
+ */
+ force_copy = B_TRUE;
+ txs->itxs_tso_force_copy.value.ui64++;
+ goto force_copy;
+ }
+ }
+
+ ASSERT3U(tsegdesc, <=, i40e_lso_num_descs);
+ ASSERT3U(tsegsz, <, mss);
+
+ /*
+ * We've made if through the loop without
+ * breaking the segment descriptor contract
+ * with the controller -- replace the segment
+ * tracking values with the temporary ones.
+ */
+ segdesc = tsegdesc;
+ segsz = tsegsz;
+ needed_desc += tcb->tcb_bind_ncookies;
+ cpoff = 0;
+ tcb_list_append(&tcbhead, &tcbtail, tcb);
+ mp = mp->b_cont;
+ }
+ }
+
+ ASSERT3P(mp, ==, NULL);
+ ASSERT3P(tcbhead, !=, NULL);
+ *ndesc += needed_desc;
+ return (tcbhead);
+
+ fail:
+ tcb = tcbhead;
+ while (tcb != NULL) {
+ i40e_tx_control_block_t *next = tcb->tcb_next;
+
+ ASSERT(tcb->tcb_type == I40E_TX_DMA ||
+ tcb->tcb_type == I40E_TX_COPY);
+
+ tcb->tcb_mp = NULL;
+ i40e_tcb_reset(tcb);
+ i40e_tcb_free(itrq, tcb);
+ tcb = next;
+ }
+
+ return (NULL);
+ }
+
+ /*
* We've been asked to send a message block on the wire. We'll only have a
* single chain. There will not be any b_next pointers; however, there may be
! * multiple b_cont blocks. The number of b_cont blocks may exceed the
! * controller's Tx descriptor limit.
*
* We may do one of three things with any given mblk_t chain:
*
* 1) Drop it
* 2) Transmit it
*** 2094,2109 ****
* something.
*/
mblk_t *
i40e_ring_tx(void *arg, mblk_t *mp)
{
! const mblk_t *nmp;
! size_t mpsize;
! i40e_tx_control_block_t *tcb;
! i40e_tx_desc_t *txdesc;
i40e_tx_context_t tctx;
! int cmd, type;
i40e_trqpair_t *itrq = arg;
i40e_t *i40e = itrq->itrq_i40e;
i40e_hw_t *hw = &i40e->i40e_hw_space;
i40e_txq_stat_t *txs = &itrq->itrq_txstat;
--- 2798,2815 ----
* something.
*/
mblk_t *
i40e_ring_tx(void *arg, mblk_t *mp)
{
! size_t msglen;
! i40e_tx_control_block_t *tcb_ctx = NULL, *tcb = NULL, *tcbhead = NULL;
! i40e_tx_context_desc_t *ctxdesc;
! mac_ether_offload_info_t meo;
i40e_tx_context_t tctx;
! int type;
! uint_t needed_desc = 0;
! boolean_t do_ctx_desc = B_FALSE, use_lso = B_FALSE;
i40e_trqpair_t *itrq = arg;
i40e_t *i40e = itrq->itrq_i40e;
i40e_hw_t *hw = &i40e->i40e_hw_space;
i40e_txq_stat_t *txs = &itrq->itrq_txstat;
*** 2117,2235 ****
(i40e->i40e_link_state != LINK_STATE_UP)) {
freemsg(mp);
return (NULL);
}
/*
* Figure out the relevant context about this frame that we might need
! * for enabling checksum, lso, etc. This also fills in information that
* we might set around the packet type, etc.
*/
! if (i40e_tx_context(i40e, itrq, mp, &tctx) < 0) {
freemsg(mp);
itrq->itrq_txstat.itxs_err_context.value.ui64++;
return (NULL);
}
/*
* For the primordial driver we can punt on doing any recycling right
* now; however, longer term we need to probably do some more pro-active
! * recycling to cut back on stalls in the tx path.
*/
! /*
! * Do a quick size check to make sure it fits into what we think it
! * should for this device. Note that longer term this will be false,
! * particularly when we have the world of TSO.
! */
! mpsize = 0;
! for (nmp = mp; nmp != NULL; nmp = nmp->b_cont) {
! mpsize += MBLKL(nmp);
! }
/*
! * First we allocate our tx control block and prepare the packet for
! * transmit before we do a final check for descriptors. We do it this
! * way to minimize the time under the tx lock.
*/
! tcb = i40e_tcb_alloc(itrq);
! if (tcb == NULL) {
txs->itxs_err_notcb.value.ui64++;
goto txfail;
}
! /*
! * For transmitting a block, we're currently going to use just a
! * single control block and bcopy all of the fragments into it. We
! * should be more intelligent about doing DMA binding or otherwise, but
! * for getting off the ground this will have to do.
! */
! ASSERT(tcb->tcb_dma.dmab_len == 0);
! ASSERT(tcb->tcb_dma.dmab_size >= mpsize);
! for (nmp = mp; nmp != NULL; nmp = nmp->b_cont) {
! size_t clen = MBLKL(nmp);
! void *coff = tcb->tcb_dma.dmab_address + tcb->tcb_dma.dmab_len;
!
! bcopy(nmp->b_rptr, coff, clen);
! tcb->tcb_dma.dmab_len += clen;
}
- ASSERT(tcb->tcb_dma.dmab_len == mpsize);
/*
! * While there's really no need to keep the mp here, but let's just do
! * it to help with our own debugging for now.
*/
- tcb->tcb_mp = mp;
- tcb->tcb_type = I40E_TX_COPY;
- I40E_DMA_SYNC(&tcb->tcb_dma, DDI_DMA_SYNC_FORDEV);
-
mutex_enter(&itrq->itrq_tx_lock);
! if (itrq->itrq_desc_free < i40e->i40e_tx_block_thresh) {
txs->itxs_err_nodescs.value.ui64++;
mutex_exit(&itrq->itrq_tx_lock);
goto txfail;
}
/*
! * Build up the descriptor and send it out. Thankfully at the moment
! * we only need a single desc, because we're not doing anything fancy
! * yet.
*/
! ASSERT(itrq->itrq_desc_free > 0);
itrq->itrq_desc_free--;
! txdesc = &itrq->itrq_desc_ring[itrq->itrq_desc_tail];
! itrq->itrq_tcb_work_list[itrq->itrq_desc_tail] = tcb;
! itrq->itrq_desc_tail = i40e_next_desc(itrq->itrq_desc_tail, 1,
itrq->itrq_tx_ring_size);
! /*
! * Note, we always set EOP and RS which indicates that this is the last
! * data frame and that we should ask for it to be transmitted. We also
! * must always set ICRC, because that is an internal bit that must be
! * set to one for data descriptors. The remaining bits in the command
! * descriptor depend on checksumming and are determined based on the
! * information set up in i40e_tx_context().
! */
! type = I40E_TX_DESC_DTYPE_DATA;
! cmd = I40E_TX_DESC_CMD_EOP |
! I40E_TX_DESC_CMD_RS |
! I40E_TX_DESC_CMD_ICRC |
! tctx.itc_cmdflags;
! txdesc->buffer_addr =
! CPU_TO_LE64((uintptr_t)tcb->tcb_dma.dmab_dma_address);
! txdesc->cmd_type_offset_bsz = CPU_TO_LE64(((uint64_t)type |
! ((uint64_t)tctx.itc_offsets << I40E_TXD_QW1_OFFSET_SHIFT) |
! ((uint64_t)cmd << I40E_TXD_QW1_CMD_SHIFT) |
! ((uint64_t)tcb->tcb_dma.dmab_len << I40E_TXD_QW1_TX_BUF_SZ_SHIFT)));
/*
* Now, finally, sync the DMA data and alert hardware.
*/
I40E_DMA_SYNC(&itrq->itrq_desc_area, DDI_DMA_SYNC_FORDEV);
I40E_WRITE_REG(hw, I40E_QTX_TAIL(itrq->itrq_index),
itrq->itrq_desc_tail);
if (i40e_check_acc_handle(i40e->i40e_osdep_space.ios_reg_handle) !=
DDI_FM_OK) {
/*
* Note, we can't really go through and clean this up very well,
* because the memory has been given to the device, so just
--- 2823,2972 ----
(i40e->i40e_link_state != LINK_STATE_UP)) {
freemsg(mp);
return (NULL);
}
+ if (mac_ether_offload_info(mp, &meo) != 0) {
+ freemsg(mp);
+ itrq->itrq_txstat.itxs_hck_meoifail.value.ui64++;
+ return (NULL);
+ }
+
/*
* Figure out the relevant context about this frame that we might need
! * for enabling checksum, LSO, etc. This also fills in information that
* we might set around the packet type, etc.
*/
! if (i40e_tx_context(i40e, itrq, mp, &meo, &tctx) < 0) {
freemsg(mp);
itrq->itrq_txstat.itxs_err_context.value.ui64++;
return (NULL);
}
+ if (tctx.itc_ctx_cmdflags & I40E_TX_CTX_DESC_TSO) {
+ use_lso = B_TRUE;
+ do_ctx_desc = B_TRUE;
+ }
/*
* For the primordial driver we can punt on doing any recycling right
* now; however, longer term we need to probably do some more pro-active
! * recycling to cut back on stalls in the TX path.
*/
! msglen = msgsize(mp);
+ if (do_ctx_desc) {
/*
! * If we're doing tunneling or LSO, then we'll need a TX
! * context descriptor in addition to one or more TX data
! * descriptors. Since there's no data DMA block or handle
! * associated with the context descriptor, we create a special
! * control block that behaves effectively like a NOP.
*/
! if ((tcb_ctx = i40e_tcb_alloc(itrq)) == NULL) {
txs->itxs_err_notcb.value.ui64++;
goto txfail;
}
+ tcb_ctx->tcb_type = I40E_TX_DESC;
+ needed_desc++;
+ }
! if (!use_lso) {
! tcbhead = i40e_non_lso_chain(itrq, mp, &needed_desc);
! } else {
! tcbhead = i40e_lso_chain(itrq, mp, &meo, &tctx, &needed_desc);
}
+ if (tcbhead == NULL)
+ goto txfail;
+
+ tcbhead->tcb_mp = mp;
+
/*
! * The second condition ensures that 'itrq_desc_tail' never
! * equals 'itrq_desc_head'. This enforces the rule found in
! * the second bullet point of section 8.4.3.1.5 of the XL710
! * PG, which declares the TAIL pointer in I40E_QTX_TAIL should
! * never overlap with the head. This means that we only ever
! * have 'itrq_tx_ring_size - 1' total available descriptors.
*/
mutex_enter(&itrq->itrq_tx_lock);
! if (itrq->itrq_desc_free < i40e->i40e_tx_block_thresh ||
! (itrq->itrq_desc_free - 1) < needed_desc) {
txs->itxs_err_nodescs.value.ui64++;
mutex_exit(&itrq->itrq_tx_lock);
goto txfail;
}
+ if (do_ctx_desc) {
/*
! * If we're enabling any offloads for this frame, then we'll
! * need to build up a transmit context descriptor, first. The
! * context descriptor needs to be placed in the TX ring before
! * the data descriptor(s). See section 8.4.2, table 8-16
*/
! uint_t tail = itrq->itrq_desc_tail;
itrq->itrq_desc_free--;
! ctxdesc = (i40e_tx_context_desc_t *)&itrq->itrq_desc_ring[tail];
! itrq->itrq_tcb_work_list[tail] = tcb_ctx;
! itrq->itrq_desc_tail = i40e_next_desc(tail, 1,
itrq->itrq_tx_ring_size);
! /* QW0 */
! type = I40E_TX_DESC_DTYPE_CONTEXT;
! ctxdesc->tunneling_params = 0;
! ctxdesc->l2tag2 = 0;
+ /* QW1 */
+ ctxdesc->type_cmd_tso_mss = CPU_TO_LE64((uint64_t)type);
+ if (tctx.itc_ctx_cmdflags & I40E_TX_CTX_DESC_TSO) {
+ ctxdesc->type_cmd_tso_mss |= CPU_TO_LE64((uint64_t)
+ ((uint64_t)tctx.itc_ctx_cmdflags <<
+ I40E_TXD_CTX_QW1_CMD_SHIFT) |
+ ((uint64_t)tctx.itc_ctx_tsolen <<
+ I40E_TXD_CTX_QW1_TSO_LEN_SHIFT) |
+ ((uint64_t)tctx.itc_ctx_mss <<
+ I40E_TXD_CTX_QW1_MSS_SHIFT));
+ }
+ }
+
+ tcb = tcbhead;
+ while (tcb != NULL) {
+
+ itrq->itrq_tcb_work_list[itrq->itrq_desc_tail] = tcb;
+ if (tcb->tcb_type == I40E_TX_COPY) {
+ boolean_t last_desc = (tcb->tcb_next == NULL);
+
+ i40e_tx_set_data_desc(itrq, &tctx,
+ (caddr_t)tcb->tcb_dma.dmab_dma_address,
+ tcb->tcb_dma.dmab_len, last_desc);
+ } else {
+ boolean_t last_desc = B_FALSE;
+ ASSERT3S(tcb->tcb_type, ==, I40E_TX_DMA);
+
+ for (uint_t c = 0; c < tcb->tcb_bind_ncookies; c++) {
+ last_desc = (c == tcb->tcb_bind_ncookies - 1) &&
+ (tcb->tcb_next == NULL);
+
+ i40e_tx_set_data_desc(itrq, &tctx,
+ tcb->tcb_bind_info[c].dbi_paddr,
+ tcb->tcb_bind_info[c].dbi_len,
+ last_desc);
+ }
+ }
+
+ tcb = tcb->tcb_next;
+ }
+
/*
* Now, finally, sync the DMA data and alert hardware.
*/
I40E_DMA_SYNC(&itrq->itrq_desc_area, DDI_DMA_SYNC_FORDEV);
I40E_WRITE_REG(hw, I40E_QTX_TAIL(itrq->itrq_index),
itrq->itrq_desc_tail);
+
if (i40e_check_acc_handle(i40e->i40e_osdep_space.ios_reg_handle) !=
DDI_FM_OK) {
/*
* Note, we can't really go through and clean this up very well,
* because the memory has been given to the device, so just
*** 2237,2249 ****
*/
ddi_fm_service_impact(i40e->i40e_dip, DDI_SERVICE_DEGRADED);
atomic_or_32(&i40e->i40e_state, I40E_ERROR);
}
! txs->itxs_bytes.value.ui64 += mpsize;
txs->itxs_packets.value.ui64++;
! txs->itxs_descriptors.value.ui64++;
mutex_exit(&itrq->itrq_tx_lock);
return (NULL);
--- 2974,2986 ----
*/
ddi_fm_service_impact(i40e->i40e_dip, DDI_SERVICE_DEGRADED);
atomic_or_32(&i40e->i40e_state, I40E_ERROR);
}
! txs->itxs_bytes.value.ui64 += msglen;
txs->itxs_packets.value.ui64++;
! txs->itxs_descriptors.value.ui64 += needed_desc;
mutex_exit(&itrq->itrq_tx_lock);
return (NULL);
*** 2252,2265 ****
* We ran out of resources. Return it to MAC and indicate that we'll
* need to signal MAC. If there are allocated tcb's, return them now.
* Make sure to reset their message block's, since we'll return them
* back to MAC.
*/
! if (tcb != NULL) {
tcb->tcb_mp = NULL;
i40e_tcb_reset(tcb);
i40e_tcb_free(itrq, tcb);
}
mutex_enter(&itrq->itrq_tx_lock);
itrq->itrq_tx_blocked = B_TRUE;
mutex_exit(&itrq->itrq_tx_lock);
--- 2989,3015 ----
* We ran out of resources. Return it to MAC and indicate that we'll
* need to signal MAC. If there are allocated tcb's, return them now.
* Make sure to reset their message block's, since we'll return them
* back to MAC.
*/
! if (tcb_ctx != NULL) {
! tcb_ctx->tcb_mp = NULL;
! i40e_tcb_reset(tcb_ctx);
! i40e_tcb_free(itrq, tcb_ctx);
! }
!
! tcb = tcbhead;
! while (tcb != NULL) {
! i40e_tx_control_block_t *next = tcb->tcb_next;
!
! ASSERT(tcb->tcb_type == I40E_TX_DMA ||
! tcb->tcb_type == I40E_TX_COPY);
!
tcb->tcb_mp = NULL;
i40e_tcb_reset(tcb);
i40e_tcb_free(itrq, tcb);
+ tcb = next;
}
mutex_enter(&itrq->itrq_tx_lock);
itrq->itrq_tx_blocked = B_TRUE;
mutex_exit(&itrq->itrq_tx_lock);