Print this page
NEX-20178 Heavy read load using 10G i40e causes network disconnect
MFV illumos-joyent@83a8d0d616db36010b59cc850d1926c0f6a30de1
OS-7457 i40e Tx freezes on zero descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV illumos-joyent@0d3f2b61dcfb18edace4fd257054f6fdbe07c99c
OS-7492 i40e Tx freeze when b_cont chain exceeds 8 descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV illumos-joyent@b4bede175d4c50ac1b36078a677b69388f6fb59f
OS-7577 initialize FC for i40e
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Rob Johnston <rob.johnston@joyent.com>
MFV illumos-joyent@83a8d0d616db36010b59cc850d1926c0f6a30de1
OS-7457 i40e Tx freezes on zero descriptors
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
MFV: illumos-joyent@61dc3dec4f82a3e13e94609a0a83d5f66c64e760
OS-6846 want i40e multi-group support
OS-7372 i40e_alloc_ring_mem() unwinds when it shouldn't
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Ryan Zezeski <rpz@joyent.com>
MFV: illumos-joyent@6f6fae1b433b461a7b014f48ad94fc7f4927c6ed
OS-7344 i40e Tx freeze caused by off-by-one DMA
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Ryan Zezeski <rpz@joyent.com>
MFV: illumos-joyent@757454db6669c1186f60bc625510c1b67217aae6
OS-7082 i40e: blown assert in i40e_tx_cleanup_ring()
OS-7086 i40e: add mdb dcmd to dump info on tx descriptor rings
OS-7101 i40e: add kstat to track TX DMA bind failures
Reviewed by: Ryan Zezeski <rpz@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Patrick Mooney <patrick.mooney@joyent.com>
Author: Rob Johnston <rob.johnston@joyent.com>
MFV: illumos-joyent@9e30beee2f0c127bf41868db46257124206e28d6
OS-5225 Want Fortville TSO support
Reviewed by: Ryan Zezeski <rpz@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Patrick Mooney <patrick.mooney@joyent.com>
Author: Rob Johnston <rob.johnston@joyent.com>

*** 9,19 **** * http://www.illumos.org/license/CDDL. */ /* * Copyright 2015 OmniTI Computer Consulting, Inc. All rights reserved. ! * Copyright 2016 Joyent, Inc. */ #include "i40e_sw.h" /* --- 9,19 ---- * http://www.illumos.org/license/CDDL. */ /* * Copyright 2015 OmniTI Computer Consulting, Inc. All rights reserved. ! * Copyright 2019 Joyent, Inc. */ #include "i40e_sw.h" /*
*** 58,80 **** * i40e_t`i40e_sdu changes. * * This size is then rounded up to the nearest 1k chunk, which represents the * actual amount of memory that we'll allocate for a single frame. * ! * Note, that for rx, we do something that might be unexpected. We always add * an extra two bytes to the frame size that we allocate. We then offset the DMA * address that we receive a packet into by two bytes. This ensures that the IP * header will always be 4 byte aligned because the MAC header is either 14 or * 18 bytes in length, depending on the use of 802.1Q tagging, which makes IP's * and MAC's lives easier. * ! * Both the rx and tx descriptor rings (which are what we use to communicate * with hardware) are allocated as a single region of DMA memory which is the * size of the descriptor (4 bytes and 2 bytes respectively) times the total ! * number of descriptors for an rx and tx ring. * ! * While the rx and tx descriptors are allocated using DMA-based memory, the * control blocks for each of them are allocated using normal kernel memory. * They aren't special from a DMA perspective. We'll go over the design of both * receiving and transmitting separately, as they have slightly different * control blocks and different ways that we manage the relationship between * control blocks and descriptors. --- 58,80 ---- * i40e_t`i40e_sdu changes. * * This size is then rounded up to the nearest 1k chunk, which represents the * actual amount of memory that we'll allocate for a single frame. * ! * Note, that for RX, we do something that might be unexpected. We always add * an extra two bytes to the frame size that we allocate. We then offset the DMA * address that we receive a packet into by two bytes. This ensures that the IP * header will always be 4 byte aligned because the MAC header is either 14 or * 18 bytes in length, depending on the use of 802.1Q tagging, which makes IP's * and MAC's lives easier. * ! * Both the RX and TX descriptor rings (which are what we use to communicate * with hardware) are allocated as a single region of DMA memory which is the * size of the descriptor (4 bytes and 2 bytes respectively) times the total ! * number of descriptors for an RX and TX ring. * ! * While the RX and TX descriptors are allocated using DMA-based memory, the * control blocks for each of them are allocated using normal kernel memory. * They aren't special from a DMA perspective. We'll go over the design of both * receiving and transmitting separately, as they have slightly different * control blocks and different ways that we manage the relationship between * control blocks and descriptors.
*** 111,145 **** * builds, we allow someone to whack the variable i40e_debug_rx_mode to override * the behavior and always do a bcopy or a DMA bind. * * To try and ensure that the device always has blocks that it can receive data * into, we maintain two lists of control blocks, a working list and a free ! * list. Each list is sized equal to the number of descriptors in the rx ring. ! * During the GLDv3 mc_start routine, we allocate a number of rx control blocks * equal to twice the number of descriptors in the ring and we assign them * equally to the free list and to the working list. Each control block also has * DMA memory allocated and associated with which it will be used to receive the * actual packet data. All of a received frame's data will end up in a single * DMA buffer. * ! * During operation, we always maintain the invariant that each rx descriptor ! * has an associated rx control block which lives in the working list. If we * feel that we should loan up DMA memory to MAC in the form of a message block, * we can only do so if we can maintain this invariant. To do that, we swap in * one of the buffers from the free list. If none are available, then we resort * to using allocb(9F) and bcopy(9F) on the packet instead, regardless of the * size. * * Loaned message blocks come back to use when freemsg(9F) or freeb(9F) is ! * called on the block, at which point we restore the rx control block to the * free list and are able to reuse the DMA memory again. While the scheme may * seem odd, it importantly keeps us out of trying to do any DMA allocations in * the normal path of operation, even though we may still have to allocate * message blocks and copy. * ! * The following state machine describes the life time of a rx control block. In ! * the diagram we abbrviate the rx ring descriptor entry as rxd and the rx * control block entry as rcb. * * | | * * ... 1/2 of all initial rcb's ... * * | | --- 111,145 ---- * builds, we allow someone to whack the variable i40e_debug_rx_mode to override * the behavior and always do a bcopy or a DMA bind. * * To try and ensure that the device always has blocks that it can receive data * into, we maintain two lists of control blocks, a working list and a free ! * list. Each list is sized equal to the number of descriptors in the RX ring. ! * During the GLDv3 mc_start routine, we allocate a number of RX control blocks * equal to twice the number of descriptors in the ring and we assign them * equally to the free list and to the working list. Each control block also has * DMA memory allocated and associated with which it will be used to receive the * actual packet data. All of a received frame's data will end up in a single * DMA buffer. * ! * During operation, we always maintain the invariant that each RX descriptor ! * has an associated RX control block which lives in the working list. If we * feel that we should loan up DMA memory to MAC in the form of a message block, * we can only do so if we can maintain this invariant. To do that, we swap in * one of the buffers from the free list. If none are available, then we resort * to using allocb(9F) and bcopy(9F) on the packet instead, regardless of the * size. * * Loaned message blocks come back to use when freemsg(9F) or freeb(9F) is ! * called on the block, at which point we restore the RX control block to the * free list and are able to reuse the DMA memory again. While the scheme may * seem odd, it importantly keeps us out of trying to do any DMA allocations in * the normal path of operation, even though we may still have to allocate * message blocks and copy. * ! * The following state machine describes the life time of a RX control block. In ! * the diagram we abbrviate the RX ring descriptor entry as rxd and the rx * control block entry as rcb. * * | | * * ... 1/2 of all initial rcb's ... * * | |
*** 158,172 **** * | and it is v * | recycled. +-------------------+ * +--------------------<-----| rcb loaned to MAC | * +-------------------+ * ! * Finally, note that every rx control block has a reference count on it. One * reference is added as long as the driver has had the GLDv3 mc_start endpoint * called. If the GLDv3 mc_stop entry point is called, IP has been unplumbed and * no other DLPI consumers remain, then we'll decrement the reference count by ! * one. Whenever we loan up the rx control block and associated buffer to MAC, * then we bump the reference count again. Even though the device is stopped, * there may still be loaned frames in upper levels that we'll want to account * for. Our callback from freemsg(9F)/freeb(9F) will take care of making sure * that it is cleaned up. * --- 158,172 ---- * | and it is v * | recycled. +-------------------+ * +--------------------<-----| rcb loaned to MAC | * +-------------------+ * ! * Finally, note that every RX control block has a reference count on it. One * reference is added as long as the driver has had the GLDv3 mc_start endpoint * called. If the GLDv3 mc_stop entry point is called, IP has been unplumbed and * no other DLPI consumers remain, then we'll decrement the reference count by ! * one. Whenever we loan up the RX control block and associated buffer to MAC, * then we bump the reference count again. Even though the device is stopped, * there may still be loaned frames in upper levels that we'll want to account * for. Our callback from freemsg(9F)/freeb(9F) will take care of making sure * that it is cleaned up. *
*** 190,203 **** * the HEAD and TAIL, inclusive. Note that while we initially program the HEAD, * the only values we ever consult ourselves are the TAIL register and our own * state tracking. Effectively, we cache the HEAD register and then update it * ourselves based on our work. * ! * When we iterate over the rx descriptors and thus the received frames, we are * either in an interrupt context or we've been asked by MAC to poll on the * ring. If we've been asked to poll on the ring, we have a maximum number of ! * bytes of mblk_t's to return. If processing an rx descriptor would cause us to * exceed that count, then we do not process it. When in interrupt context, we * don't have a strict byte count. However, to ensure liveness, we limit the * amount of data based on a configuration value * (i40e_t`i40e_rx_limit_per_intr). The number that we've started with for this * is based on similar numbers that are used for ixgbe. After some additional --- 190,203 ---- * the HEAD and TAIL, inclusive. Note that while we initially program the HEAD, * the only values we ever consult ourselves are the TAIL register and our own * state tracking. Effectively, we cache the HEAD register and then update it * ourselves based on our work. * ! * When we iterate over the RX descriptors and thus the received frames, we are * either in an interrupt context or we've been asked by MAC to poll on the * ring. If we've been asked to poll on the ring, we have a maximum number of ! * bytes of mblk_t's to return. If processing an RX descriptor would cause us to * exceed that count, then we do not process it. When in interrupt context, we * don't have a strict byte count. However, to ensure liveness, we limit the * amount of data based on a configuration value * (i40e_t`i40e_rx_limit_per_intr). The number that we've started with for this * is based on similar numbers that are used for ixgbe. After some additional
*** 247,281 **** * * While the transmit path is similar in spirit to the receive path, it works * differently due to the fact that all data is originated by the operating * system and not by the device. * ! * Like rx, there is both a descriptor ring that we use to communicate to the * driver and which points to the memory used to transmit a frame. Similarly, ! * there is a corresponding transmit control block. Each transmit control block ! * has a region of DMA memory allocated to it; however, the way we use it ! * varies. * * The driver is asked to process a single frame at a time. That message block * may be made up of multiple fragments linked together by the mblk_t`b_cont * member. The device has a hard limit of up to 8 buffers being allowed for use ! * for a single logical frame. For each fragment, we'll try and use an entry ! * from the tx descriptor ring and then we'll allocate a corresponding tx ! * control block. Depending on the size of the fragment, we may copy it around ! * or we might instead try to do DMA binding of the fragment. * ! * If we exceed the number of blocks that fit, we'll try to pull up the block ! * and then we'll do a DMA bind and send it out. * ! * If we don't have enough space in the ring or tx control blocks available, * then we'll return the unprocessed message block to MAC. This will induce flow * control and once we recycle enough entries, we'll once again enable sending * on the ring. * * We size the working list as equal to the number of descriptors in the ring. * We size the free list as equal to 1.5 times the number of descriptors in the ! * ring. We'll allocate a number of tx control block entries equal to the number * of entries in the free list. By default, all entries are placed in the free * list. As we come along and try to send something, we'll allocate entries from * the free list and add them to the working list, where they'll stay until the * hardware indicates that all of the data has been written back to us. The * reason that we start with 1.5x is to help facilitate having more than one TX --- 247,304 ---- * * While the transmit path is similar in spirit to the receive path, it works * differently due to the fact that all data is originated by the operating * system and not by the device. * ! * Like RX, there is both a descriptor ring that we use to communicate to the * driver and which points to the memory used to transmit a frame. Similarly, ! * there is a corresponding transmit control block, however, the correspondence ! * between descriptors and control blocks is more complex and not necessarily ! * 1-to-1. * * The driver is asked to process a single frame at a time. That message block * may be made up of multiple fragments linked together by the mblk_t`b_cont * member. The device has a hard limit of up to 8 buffers being allowed for use ! * for a single non-LSO packet or LSO segment. The number of TX ring entires ! * (and thus TX control blocks) used depends on the fragment sizes and DMA ! * layout, as explained below. * ! * We alter our DMA strategy based on a threshold tied to the fragment size. ! * This threshold is configurable via the tx_dma_threshold property. If the ! * fragment is above the threshold, we DMA bind it -- consuming one TCB and ! * potentially several data descriptors. The exact number of descriptors (equal ! * to the number of DMA cookies) depends on page size, MTU size, b_rptr offset ! * into page, b_wptr offset into page, and the physical layout of the dblk's ! * memory (contiguous or not). Essentially, we are at the mercy of the DMA ! * engine and the dblk's memory allocation. Knowing the exact number of ! * descriptors up front is a task best not taken on by the driver itself. ! * Instead, we attempt to DMA bind the fragment and verify the descriptor ! * layout meets hardware constraints. If the proposed DMA bind does not satisfy ! * the hardware constaints, then we discard it and instead copy the entire ! * fragment into the pre-allocated TCB buffer (or buffers if the fragment is ! * larger than the TCB buffer). * ! * If the fragment is below or at the threshold, we copy it to the pre-allocated ! * buffer of a TCB. We compress consecutive copy fragments into a single TCB to ! * conserve resources. We are guaranteed that the TCB buffer is made up of only ! * 1 DMA cookie; and therefore consumes only one descriptor on the controller. ! * ! * Furthermore, if the frame requires HW offloads such as LSO, tunneling or ! * filtering, then the TX data descriptors must be preceeded by a single TX ! * context descriptor. Because there is no DMA transfer associated with the ! * context descriptor, we allocate a control block with a special type which ! * indicates to the TX ring recycle code that there are no associated DMA ! * resources to unbind when the control block is free'd. ! * ! * If we don't have enough space in the ring or TX control blocks available, * then we'll return the unprocessed message block to MAC. This will induce flow * control and once we recycle enough entries, we'll once again enable sending * on the ring. * * We size the working list as equal to the number of descriptors in the ring. * We size the free list as equal to 1.5 times the number of descriptors in the ! * ring. We'll allocate a number of TX control block entries equal to the number * of entries in the free list. By default, all entries are placed in the free * list. As we come along and try to send something, we'll allocate entries from * the free list and add them to the working list, where they'll stay until the * hardware indicates that all of the data has been written back to us. The * reason that we start with 1.5x is to help facilitate having more than one TX
*** 323,356 **** * | * v * +------------------+ +------------------+ * | tcb on free list |---*------------------>| tcb on work list | * +------------------+ . +------------------+ ! * ^ . tcb allocated | * | to send frame v * | or fragment on | * | wire, mblk from | * | MAC associated. | * | | * +------*-------------------------------<----+ * . * . Hardware indicates * entry transmitted. ! * tcb recycled, mblk * from MAC freed. * * ------------ * Blocking MAC * ------------ * ! * Wen performing transmit, we can run out of descriptors and ring entries. When ! * such a case happens, we return the mblk_t to MAC to indicate that we've been ! * blocked. At that point in time, MAC becomes blocked and will not transmit ! * anything out that specific ring until we notify MAC. To indicate that we're ! * in such a situation we set i40e_trqpair_t`itrq_tx_blocked member to B_TRUE. * ! * When we recycle tx descriptors then we'll end up signaling MAC by calling * mac_tx_ring_update() if we were blocked, letting it know that it's safe to * start sending frames out to us again. */ /* --- 346,386 ---- * | * v * +------------------+ +------------------+ * | tcb on free list |---*------------------>| tcb on work list | * +------------------+ . +------------------+ ! * ^ . N tcbs allocated[1] | * | to send frame v * | or fragment on | * | wire, mblk from | * | MAC associated. | * | | * +------*-------------------------------<----+ * . * . Hardware indicates * entry transmitted. ! * tcbs recycled, mblk * from MAC freed. * + * [1] We allocate N tcbs to transmit a single frame where N can be 1 context + * descriptor plus 1 data descriptor, in the non-DMA-bind case. In the DMA + * bind case, N can be 1 context descriptor plus 1 data descriptor per + * b_cont in the mblk. In this case, the mblk is associated with the first + * data descriptor and freed as part of freeing that data descriptor. + * * ------------ * Blocking MAC * ------------ * ! * When performing transmit, we can run out of descriptors and ring entries. ! * When such a case happens, we return the mblk_t to MAC to indicate that we've ! * been blocked. At that point in time, MAC becomes blocked and will not ! * transmit anything out that specific ring until we notify MAC. To indicate ! * that we're in such a situation we set i40e_trqpair_t`itrq_tx_blocked member ! * to B_TRUE. * ! * When we recycle TX descriptors then we'll end up signaling MAC by calling * mac_tx_ring_update() if we were blocked, letting it know that it's safe to * start sending frames out to us again. */ /*
*** 365,381 **** #error "unknown architecture for i40e" #endif /* * This structure is used to maintain information and flags related to ! * transmitting a frame. The first member is the set of flags we need to or into ! * the command word (generally checksumming related). The second member controls ! * the word offsets which is required for IP and L4 checksumming. */ typedef struct i40e_tx_context { ! enum i40e_tx_desc_cmd_bits itc_cmdflags; ! uint32_t itc_offsets; } i40e_tx_context_t; /* * Toggles on debug builds which can be used to override our RX behaviour based * on thresholds. --- 395,413 ---- #error "unknown architecture for i40e" #endif /* * This structure is used to maintain information and flags related to ! * transmitting a frame. These fields are ultimately used to construct the ! * TX data descriptor(s) and, if necessary, the TX context descriptor. */ typedef struct i40e_tx_context { ! enum i40e_tx_desc_cmd_bits itc_data_cmdflags; ! uint32_t itc_data_offsets; ! enum i40e_tx_ctx_desc_cmd_bits itc_ctx_cmdflags; ! uint32_t itc_ctx_tsolen; ! uint32_t itc_ctx_mss; } i40e_tx_context_t; /* * Toggles on debug builds which can be used to override our RX behaviour based * on thresholds.
*** 393,410 **** /* * Notes on the following pair of DMA attributes. The first attribute, * i40e_static_dma_attr, is designed to be used for both the descriptor rings * and the static buffers that we associate with control blocks. For this * reason, we force an SGL length of one. While technically the driver supports ! * a larger SGL (5 on rx and 8 on tx), we opt to only use one to simplify our * management here. In addition, when the Intel common code wants to allocate * memory via the i40e_allocate_virt_mem osdep function, we have it leverage * the static dma attr. * ! * The second set of attributes, i40e_txbind_dma_attr, is what we use when we're ! * binding a bunch of mblk_t fragments to go out the door. Note that the main ! * difference here is that we're allowed a larger SGL length -- eight. * * Note, we default to setting ourselves to be DMA capable here. However, * because we could have multiple instances which have different FMA error * checking capabilities, or end up on different buses, we make these static * and const and copy them into the i40e_t for the given device with the actual --- 425,446 ---- /* * Notes on the following pair of DMA attributes. The first attribute, * i40e_static_dma_attr, is designed to be used for both the descriptor rings * and the static buffers that we associate with control blocks. For this * reason, we force an SGL length of one. While technically the driver supports ! * a larger SGL (5 on RX and 8 on TX), we opt to only use one to simplify our * management here. In addition, when the Intel common code wants to allocate * memory via the i40e_allocate_virt_mem osdep function, we have it leverage * the static dma attr. * ! * The latter two sets of attributes, are what we use when we're binding a ! * bunch of mblk_t fragments to go out the door. Note that the main difference ! * here is that we're allowed a larger SGL length. For non-LSO TX, we ! * restrict the SGL length to match the number of TX buffers available to the ! * PF (8). For the LSO case we can go much larger, with the caveat that each ! * MSS-sized chunk (segment) must not span more than 8 data descriptors and ! * hence must not span more than 8 cookies. * * Note, we default to setting ourselves to be DMA capable here. However, * because we could have multiple instances which have different FMA error * checking capabilities, or end up on different buses, we make these static * and const and copy them into the i40e_t for the given device with the actual
*** 427,437 **** static const ddi_dma_attr_t i40e_g_txbind_dma_attr = { DMA_ATTR_V0, /* version number */ 0x0000000000000000ull, /* low address */ 0xFFFFFFFFFFFFFFFFull, /* high address */ ! 0x00000000FFFFFFFFull, /* dma counter max */ I40E_DMA_ALIGNMENT, /* alignment */ 0x00000FFF, /* burst sizes */ 0x00000001, /* minimum transfer size */ 0x00000000FFFFFFFFull, /* maximum transfer size */ 0xFFFFFFFFFFFFFFFFull, /* maximum segment size */ --- 463,473 ---- static const ddi_dma_attr_t i40e_g_txbind_dma_attr = { DMA_ATTR_V0, /* version number */ 0x0000000000000000ull, /* low address */ 0xFFFFFFFFFFFFFFFFull, /* high address */ ! I40E_MAX_TX_BUFSZ - 1, /* dma counter max */ I40E_DMA_ALIGNMENT, /* alignment */ 0x00000FFF, /* burst sizes */ 0x00000001, /* minimum transfer size */ 0x00000000FFFFFFFFull, /* maximum transfer size */ 0xFFFFFFFFFFFFFFFFull, /* maximum segment size */
*** 438,447 **** --- 474,498 ---- I40E_TX_MAX_COOKIE, /* scatter/gather list length */ 0x00000001, /* granularity */ DDI_DMA_FLAGERR /* DMA flags */ }; + static const ddi_dma_attr_t i40e_g_txbind_lso_dma_attr = { + DMA_ATTR_V0, /* version number */ + 0x0000000000000000ull, /* low address */ + 0xFFFFFFFFFFFFFFFFull, /* high address */ + I40E_MAX_TX_BUFSZ - 1, /* dma counter max */ + I40E_DMA_ALIGNMENT, /* alignment */ + 0x00000FFF, /* burst sizes */ + 0x00000001, /* minimum transfer size */ + 0x00000000FFFFFFFFull, /* maximum transfer size */ + 0xFFFFFFFFFFFFFFFFull, /* maximum segment size */ + I40E_TX_LSO_MAX_COOKIE, /* scatter/gather list length */ + 0x00000001, /* granularity */ + DDI_DMA_FLAGERR /* DMA flags */ + }; + /* * Next, we have the attributes for these structures. The descriptor rings are * all strictly little endian, while the data buffers are just arrays of bytes * representing frames. Because of this, we purposefully simplify the driver * programming life by programming the descriptor ring as little endian, while
*** 666,685 **** rxd->rxd_rcb_free = rxd->rxd_free_list_size; rxd->rxd_work_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) * rxd->rxd_ring_size, KM_NOSLEEP); if (rxd->rxd_work_list == NULL) { ! i40e_error(i40e, "failed to allocate rx work list for a ring " "of %d entries for ring %d", rxd->rxd_ring_size, itrq->itrq_index); goto cleanup; } rxd->rxd_free_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) * rxd->rxd_free_list_size, KM_NOSLEEP); if (rxd->rxd_free_list == NULL) { ! i40e_error(i40e, "failed to allocate a %d entry rx free list " "for ring %d", rxd->rxd_free_list_size, itrq->itrq_index); goto cleanup; } rxd->rxd_rcb_area = kmem_zalloc(sizeof (i40e_rx_control_block_t) * --- 717,736 ---- rxd->rxd_rcb_free = rxd->rxd_free_list_size; rxd->rxd_work_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) * rxd->rxd_ring_size, KM_NOSLEEP); if (rxd->rxd_work_list == NULL) { ! i40e_error(i40e, "failed to allocate RX work list for a ring " "of %d entries for ring %d", rxd->rxd_ring_size, itrq->itrq_index); goto cleanup; } rxd->rxd_free_list = kmem_zalloc(sizeof (i40e_rx_control_block_t *) * rxd->rxd_free_list_size, KM_NOSLEEP); if (rxd->rxd_free_list == NULL) { ! i40e_error(i40e, "failed to allocate a %d entry RX free list " "for ring %d", rxd->rxd_free_list_size, itrq->itrq_index); goto cleanup; } rxd->rxd_rcb_area = kmem_zalloc(sizeof (i40e_rx_control_block_t) *
*** 763,781 **** size_t dmasz; i40e_rx_control_block_t *rcb; i40e_t *i40e = rxd->rxd_i40e; /* ! * First allocate the rx descriptor ring. */ dmasz = sizeof (i40e_rx_desc_t) * rxd->rxd_ring_size; VERIFY(dmasz > 0); if (i40e_alloc_dma_buffer(i40e, &rxd->rxd_desc_area, &i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr, B_FALSE, B_TRUE, dmasz) == B_FALSE) { i40e_error(i40e, "failed to allocate DMA resources " ! "for rx descriptor ring"); return (B_FALSE); } rxd->rxd_desc_ring = (i40e_rx_desc_t *)(uintptr_t)rxd->rxd_desc_area.dmab_address; rxd->rxd_desc_next = 0; --- 814,832 ---- size_t dmasz; i40e_rx_control_block_t *rcb; i40e_t *i40e = rxd->rxd_i40e; /* ! * First allocate the RX descriptor ring. */ dmasz = sizeof (i40e_rx_desc_t) * rxd->rxd_ring_size; VERIFY(dmasz > 0); if (i40e_alloc_dma_buffer(i40e, &rxd->rxd_desc_area, &i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr, B_FALSE, B_TRUE, dmasz) == B_FALSE) { i40e_error(i40e, "failed to allocate DMA resources " ! "for RX descriptor ring"); return (B_FALSE); } rxd->rxd_desc_ring = (i40e_rx_desc_t *)(uintptr_t)rxd->rxd_desc_area.dmab_address; rxd->rxd_desc_next = 0;
*** 797,807 **** dmap = &rcb->rcb_dma; if (i40e_alloc_dma_buffer(i40e, dmap, &i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr, B_TRUE, B_FALSE, dmasz) == B_FALSE) { ! i40e_error(i40e, "failed to allocate rx dma buffer"); return (B_FALSE); } /* * Initialize the control block and offset the DMA address. See --- 848,858 ---- dmap = &rcb->rcb_dma; if (i40e_alloc_dma_buffer(i40e, dmap, &i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr, B_TRUE, B_FALSE, dmasz) == B_FALSE) { ! i40e_error(i40e, "failed to allocate RX dma buffer"); return (B_FALSE); } /* * Initialize the control block and offset the DMA address. See
*** 839,849 **** --- 890,904 ---- i40e_free_dma_buffer(&tcb->tcb_dma); if (tcb->tcb_dma_handle != NULL) { ddi_dma_free_handle(&tcb->tcb_dma_handle); tcb->tcb_dma_handle = NULL; } + if (tcb->tcb_lso_dma_handle != NULL) { + ddi_dma_free_handle(&tcb->tcb_lso_dma_handle); + tcb->tcb_lso_dma_handle = NULL; } + } fsz = sizeof (i40e_tx_control_block_t) * itrq->itrq_tx_free_list_size; kmem_free(itrq->itrq_tcb_area, fsz); itrq->itrq_tcb_area = NULL;
*** 879,898 **** itrq->itrq_tx_ring_size = i40e->i40e_tx_ring_size; itrq->itrq_tx_free_list_size = i40e->i40e_tx_ring_size + (i40e->i40e_tx_ring_size >> 1); /* ! * Allocate an additional tx descriptor for the writeback head. */ dmasz = sizeof (i40e_tx_desc_t) * itrq->itrq_tx_ring_size; dmasz += sizeof (i40e_tx_desc_t); VERIFY(dmasz > 0); if (i40e_alloc_dma_buffer(i40e, &itrq->itrq_desc_area, &i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr, B_FALSE, B_TRUE, dmasz) == B_FALSE) { ! i40e_error(i40e, "failed to allocate DMA resources for tx " "descriptor ring"); return (B_FALSE); } itrq->itrq_desc_ring = (i40e_tx_desc_t *)(uintptr_t)itrq->itrq_desc_area.dmab_address; --- 934,953 ---- itrq->itrq_tx_ring_size = i40e->i40e_tx_ring_size; itrq->itrq_tx_free_list_size = i40e->i40e_tx_ring_size + (i40e->i40e_tx_ring_size >> 1); /* ! * Allocate an additional TX descriptor for the writeback head. */ dmasz = sizeof (i40e_tx_desc_t) * itrq->itrq_tx_ring_size; dmasz += sizeof (i40e_tx_desc_t); VERIFY(dmasz > 0); if (i40e_alloc_dma_buffer(i40e, &itrq->itrq_desc_area, &i40e->i40e_static_dma_attr, &i40e->i40e_desc_acc_attr, B_FALSE, B_TRUE, dmasz) == B_FALSE) { ! i40e_error(i40e, "failed to allocate DMA resources for TX " "descriptor ring"); return (B_FALSE); } itrq->itrq_desc_ring = (i40e_tx_desc_t *)(uintptr_t)itrq->itrq_desc_area.dmab_address;
*** 903,928 **** itrq->itrq_desc_free = itrq->itrq_tx_ring_size; itrq->itrq_tcb_work_list = kmem_zalloc(itrq->itrq_tx_ring_size * sizeof (i40e_tx_control_block_t *), KM_NOSLEEP); if (itrq->itrq_tcb_work_list == NULL) { ! i40e_error(i40e, "failed to allocate a %d entry tx work list " "for ring %d", itrq->itrq_tx_ring_size, itrq->itrq_index); goto cleanup; } itrq->itrq_tcb_free_list = kmem_zalloc(itrq->itrq_tx_free_list_size * sizeof (i40e_tx_control_block_t *), KM_SLEEP); if (itrq->itrq_tcb_free_list == NULL) { ! i40e_error(i40e, "failed to allocate a %d entry tx free list " "for ring %d", itrq->itrq_tx_free_list_size, itrq->itrq_index); goto cleanup; } /* ! * We allocate enough tx control blocks to cover the free list. */ itrq->itrq_tcb_area = kmem_zalloc(sizeof (i40e_tx_control_block_t) * itrq->itrq_tx_free_list_size, KM_NOSLEEP); if (itrq->itrq_tcb_area == NULL) { i40e_error(i40e, "failed to allocate a %d entry tcb area for " --- 958,983 ---- itrq->itrq_desc_free = itrq->itrq_tx_ring_size; itrq->itrq_tcb_work_list = kmem_zalloc(itrq->itrq_tx_ring_size * sizeof (i40e_tx_control_block_t *), KM_NOSLEEP); if (itrq->itrq_tcb_work_list == NULL) { ! i40e_error(i40e, "failed to allocate a %d entry TX work list " "for ring %d", itrq->itrq_tx_ring_size, itrq->itrq_index); goto cleanup; } itrq->itrq_tcb_free_list = kmem_zalloc(itrq->itrq_tx_free_list_size * sizeof (i40e_tx_control_block_t *), KM_SLEEP); if (itrq->itrq_tcb_free_list == NULL) { ! i40e_error(i40e, "failed to allocate a %d entry TX free list " "for ring %d", itrq->itrq_tx_free_list_size, itrq->itrq_index); goto cleanup; } /* ! * We allocate enough TX control blocks to cover the free list. */ itrq->itrq_tcb_area = kmem_zalloc(sizeof (i40e_tx_control_block_t) * itrq->itrq_tx_free_list_size, KM_NOSLEEP); if (itrq->itrq_tcb_area == NULL) { i40e_error(i40e, "failed to allocate a %d entry tcb area for "
*** 946,967 **** */ ret = ddi_dma_alloc_handle(i40e->i40e_dip, &i40e->i40e_txbind_dma_attr, DDI_DMA_DONTWAIT, NULL, &tcb->tcb_dma_handle); if (ret != DDI_SUCCESS) { ! i40e_error(i40e, "failed to allocate DMA handle for tx " "data binding on ring %d: %d", itrq->itrq_index, ret); tcb->tcb_dma_handle = NULL; goto cleanup; } if (i40e_alloc_dma_buffer(i40e, &tcb->tcb_dma, &i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr, B_TRUE, B_FALSE, dmasz) == B_FALSE) { i40e_error(i40e, "failed to allocate %ld bytes of " ! "DMA for tx data binding on ring %d", dmasz, itrq->itrq_index); goto cleanup; } itrq->itrq_tcb_free_list[i] = tcb; --- 1001,1033 ---- */ ret = ddi_dma_alloc_handle(i40e->i40e_dip, &i40e->i40e_txbind_dma_attr, DDI_DMA_DONTWAIT, NULL, &tcb->tcb_dma_handle); if (ret != DDI_SUCCESS) { ! i40e_error(i40e, "failed to allocate DMA handle for TX " "data binding on ring %d: %d", itrq->itrq_index, ret); tcb->tcb_dma_handle = NULL; goto cleanup; } + ret = ddi_dma_alloc_handle(i40e->i40e_dip, + &i40e->i40e_txbind_lso_dma_attr, DDI_DMA_DONTWAIT, NULL, + &tcb->tcb_lso_dma_handle); + if (ret != DDI_SUCCESS) { + i40e_error(i40e, "failed to allocate DMA handle for TX " + "LSO data binding on ring %d: %d", itrq->itrq_index, + ret); + tcb->tcb_lso_dma_handle = NULL; + goto cleanup; + } + if (i40e_alloc_dma_buffer(i40e, &tcb->tcb_dma, &i40e->i40e_static_dma_attr, &i40e->i40e_buf_acc_attr, B_TRUE, B_FALSE, dmasz) == B_FALSE) { i40e_error(i40e, "failed to allocate %ld bytes of " ! "DMA for TX data binding on ring %d", dmasz, itrq->itrq_index); goto cleanup; } itrq->itrq_tcb_free_list[i] = tcb;
*** 987,1000 **** for (i = 0; i < i40e->i40e_num_trqpairs; i++) { i40e_rx_data_t *rxd = i40e->i40e_trqpairs[i].itrq_rxdata; /* ! * Clean up our rx data. We have to free DMA resources first and * then if we have no more pending RCB's, then we'll go ahead * and clean things up. Note, we can't set the stopped flag on ! * the rx data until after we've done the first pass of the * pending resources. Otherwise we might race with * i40e_rx_recycle on determining who should free the * i40e_rx_data_t above. */ i40e_free_rx_dma(rxd, failed_init); --- 1053,1073 ---- for (i = 0; i < i40e->i40e_num_trqpairs; i++) { i40e_rx_data_t *rxd = i40e->i40e_trqpairs[i].itrq_rxdata; /* ! * In some cases i40e_alloc_rx_data() may have failed ! * and in that case there is no rxd to free. ! */ ! if (rxd == NULL) ! continue; ! ! /* ! * Clean up our RX data. We have to free DMA resources first and * then if we have no more pending RCB's, then we'll go ahead * and clean things up. Note, we can't set the stopped flag on ! * the RX data until after we've done the first pass of the * pending resources. Otherwise we might race with * i40e_rx_recycle on determining who should free the * i40e_rx_data_t above. */ i40e_free_rx_dma(rxd, failed_init);
*** 1053,1073 **** --- 1126,1152 ---- { bcopy(&i40e_g_static_dma_attr, &i40e->i40e_static_dma_attr, sizeof (ddi_dma_attr_t)); bcopy(&i40e_g_txbind_dma_attr, &i40e->i40e_txbind_dma_attr, sizeof (ddi_dma_attr_t)); + bcopy(&i40e_g_txbind_lso_dma_attr, &i40e->i40e_txbind_lso_dma_attr, + sizeof (ddi_dma_attr_t)); bcopy(&i40e_g_desc_acc_attr, &i40e->i40e_desc_acc_attr, sizeof (ddi_device_acc_attr_t)); bcopy(&i40e_g_buf_acc_attr, &i40e->i40e_buf_acc_attr, sizeof (ddi_device_acc_attr_t)); if (fma == B_TRUE) { i40e->i40e_static_dma_attr.dma_attr_flags |= DDI_DMA_FLAGERR; i40e->i40e_txbind_dma_attr.dma_attr_flags |= DDI_DMA_FLAGERR; + i40e->i40e_txbind_lso_dma_attr.dma_attr_flags |= + DDI_DMA_FLAGERR; } else { i40e->i40e_static_dma_attr.dma_attr_flags &= ~DDI_DMA_FLAGERR; i40e->i40e_txbind_dma_attr.dma_attr_flags &= ~DDI_DMA_FLAGERR; + i40e->i40e_txbind_lso_dma_attr.dma_attr_flags &= + ~DDI_DMA_FLAGERR; } } static void i40e_rcb_free(i40e_rx_data_t *rxd, i40e_rx_control_block_t *rcb)
*** 1100,1110 **** } /* * This is the callback that we get from the OS when freemsg(9F) has been called * on a loaned descriptor. In addition, if we take the last reference count ! * here, then we have to tear down all of the rx data. */ void i40e_rx_recycle(caddr_t arg) { uint32_t ref; --- 1179,1189 ---- } /* * This is the callback that we get from the OS when freemsg(9F) has been called * on a loaned descriptor. In addition, if we take the last reference count ! * here, then we have to tear down all of the RX data. */ void i40e_rx_recycle(caddr_t arg) { uint32_t ref;
*** 1766,1884 **** /* * Attempt to put togther the information we'll need to feed into a descriptor * to properly program the hardware for checksum offload as well as the * generally required flags. * ! * The i40e_tx_context_t`itc_cmdflags contains the set of flags we need to or ! * into the descriptor based on the checksum flags for this mblk_t and the * actual information we care about. */ static int i40e_tx_context(i40e_t *i40e, i40e_trqpair_t *itrq, mblk_t *mp, ! i40e_tx_context_t *tctx) { ! int ret; ! uint32_t flags, start; ! mac_ether_offload_info_t meo; i40e_txq_stat_t *txs = &itrq->itrq_txstat; bzero(tctx, sizeof (i40e_tx_context_t)); if (i40e->i40e_tx_hcksum_enable != B_TRUE) return (0); ! mac_hcksum_get(mp, &start, NULL, NULL, NULL, &flags); ! if (flags == 0) return (0); - if ((ret = mac_ether_offload_info(mp, &meo)) != 0) { - txs->itxs_hck_meoifail.value.ui64++; - return (ret); - } - /* * Have we been asked to checksum an IPv4 header. If so, verify that we * have sufficient information and then set the proper fields in the * command structure. */ ! if (flags & HCK_IPV4_HDRCKSUM) { ! if ((meo.meoi_flags & MEOI_L2INFO_SET) == 0) { txs->itxs_hck_nol2info.value.ui64++; return (-1); } ! if ((meo.meoi_flags & MEOI_L3INFO_SET) == 0) { txs->itxs_hck_nol3info.value.ui64++; return (-1); } ! if (meo.meoi_l3proto != ETHERTYPE_IP) { txs->itxs_hck_badl3.value.ui64++; return (-1); } ! tctx->itc_cmdflags |= I40E_TX_DESC_CMD_IIPT_IPV4_CSUM; ! tctx->itc_offsets |= (meo.meoi_l2hlen >> 1) << I40E_TX_DESC_LENGTH_MACLEN_SHIFT; ! tctx->itc_offsets |= (meo.meoi_l3hlen >> 2) << I40E_TX_DESC_LENGTH_IPLEN_SHIFT; } /* * We've been asked to provide an L4 header, first, set up the IP * information in the descriptor if we haven't already before moving * onto seeing if we have enough information for the L4 checksum * offload. */ ! if (flags & HCK_PARTIALCKSUM) { ! if ((meo.meoi_flags & MEOI_L4INFO_SET) == 0) { txs->itxs_hck_nol4info.value.ui64++; return (-1); } ! if (!(flags & HCK_IPV4_HDRCKSUM)) { ! if ((meo.meoi_flags & MEOI_L2INFO_SET) == 0) { txs->itxs_hck_nol2info.value.ui64++; return (-1); } ! if ((meo.meoi_flags & MEOI_L3INFO_SET) == 0) { txs->itxs_hck_nol3info.value.ui64++; return (-1); } ! if (meo.meoi_l3proto == ETHERTYPE_IP) { ! tctx->itc_cmdflags |= I40E_TX_DESC_CMD_IIPT_IPV4; ! } else if (meo.meoi_l3proto == ETHERTYPE_IPV6) { ! tctx->itc_cmdflags |= I40E_TX_DESC_CMD_IIPT_IPV6; } else { txs->itxs_hck_badl3.value.ui64++; return (-1); } ! tctx->itc_offsets |= (meo.meoi_l2hlen >> 1) << I40E_TX_DESC_LENGTH_MACLEN_SHIFT; ! tctx->itc_offsets |= (meo.meoi_l3hlen >> 2) << I40E_TX_DESC_LENGTH_IPLEN_SHIFT; } ! switch (meo.meoi_l4proto) { case IPPROTO_TCP: ! tctx->itc_cmdflags |= I40E_TX_DESC_CMD_L4T_EOFT_TCP; break; case IPPROTO_UDP: ! tctx->itc_cmdflags |= I40E_TX_DESC_CMD_L4T_EOFT_UDP; break; case IPPROTO_SCTP: ! tctx->itc_cmdflags |= I40E_TX_DESC_CMD_L4T_EOFT_SCTP; break; default: txs->itxs_hck_badl4.value.ui64++; return (-1); } ! tctx->itc_offsets |= (meo.meoi_l4hlen >> 2) << I40E_TX_DESC_LENGTH_L4_FC_LEN_SHIFT; } return (0); } static void i40e_tcb_free(i40e_trqpair_t *itrq, i40e_tx_control_block_t *tcb) --- 1845,1981 ---- /* * Attempt to put togther the information we'll need to feed into a descriptor * to properly program the hardware for checksum offload as well as the * generally required flags. * ! * The i40e_tx_context_t`itc_data_cmdflags contains the set of flags we need to ! * 'or' into the descriptor based on the checksum flags for this mblk_t and the * actual information we care about. + * + * If the mblk requires LSO then we'll also gather the information that will be + * used to construct the Transmit Context Descriptor. */ static int i40e_tx_context(i40e_t *i40e, i40e_trqpair_t *itrq, mblk_t *mp, ! mac_ether_offload_info_t *meo, i40e_tx_context_t *tctx) { ! uint32_t chkflags, start, mss, lsoflags; i40e_txq_stat_t *txs = &itrq->itrq_txstat; bzero(tctx, sizeof (i40e_tx_context_t)); if (i40e->i40e_tx_hcksum_enable != B_TRUE) return (0); ! mac_hcksum_get(mp, &start, NULL, NULL, NULL, &chkflags); ! mac_lso_get(mp, &mss, &lsoflags); ! ! if (chkflags == 0 && lsoflags == 0) return (0); /* * Have we been asked to checksum an IPv4 header. If so, verify that we * have sufficient information and then set the proper fields in the * command structure. */ ! if (chkflags & HCK_IPV4_HDRCKSUM) { ! if ((meo->meoi_flags & MEOI_L2INFO_SET) == 0) { txs->itxs_hck_nol2info.value.ui64++; return (-1); } ! if ((meo->meoi_flags & MEOI_L3INFO_SET) == 0) { txs->itxs_hck_nol3info.value.ui64++; return (-1); } ! if (meo->meoi_l3proto != ETHERTYPE_IP) { txs->itxs_hck_badl3.value.ui64++; return (-1); } ! tctx->itc_data_cmdflags |= I40E_TX_DESC_CMD_IIPT_IPV4_CSUM; ! tctx->itc_data_offsets |= (meo->meoi_l2hlen >> 1) << I40E_TX_DESC_LENGTH_MACLEN_SHIFT; ! tctx->itc_data_offsets |= (meo->meoi_l3hlen >> 2) << I40E_TX_DESC_LENGTH_IPLEN_SHIFT; } /* * We've been asked to provide an L4 header, first, set up the IP * information in the descriptor if we haven't already before moving * onto seeing if we have enough information for the L4 checksum * offload. */ ! if (chkflags & HCK_PARTIALCKSUM) { ! if ((meo->meoi_flags & MEOI_L4INFO_SET) == 0) { txs->itxs_hck_nol4info.value.ui64++; return (-1); } ! if (!(chkflags & HCK_IPV4_HDRCKSUM)) { ! if ((meo->meoi_flags & MEOI_L2INFO_SET) == 0) { txs->itxs_hck_nol2info.value.ui64++; return (-1); } ! if ((meo->meoi_flags & MEOI_L3INFO_SET) == 0) { txs->itxs_hck_nol3info.value.ui64++; return (-1); } ! if (meo->meoi_l3proto == ETHERTYPE_IP) { ! tctx->itc_data_cmdflags |= I40E_TX_DESC_CMD_IIPT_IPV4; ! } else if (meo->meoi_l3proto == ETHERTYPE_IPV6) { ! tctx->itc_data_cmdflags |= I40E_TX_DESC_CMD_IIPT_IPV6; } else { txs->itxs_hck_badl3.value.ui64++; return (-1); } ! tctx->itc_data_offsets |= (meo->meoi_l2hlen >> 1) << I40E_TX_DESC_LENGTH_MACLEN_SHIFT; ! tctx->itc_data_offsets |= (meo->meoi_l3hlen >> 2) << I40E_TX_DESC_LENGTH_IPLEN_SHIFT; } ! switch (meo->meoi_l4proto) { case IPPROTO_TCP: ! tctx->itc_data_cmdflags |= ! I40E_TX_DESC_CMD_L4T_EOFT_TCP; break; case IPPROTO_UDP: ! tctx->itc_data_cmdflags |= ! I40E_TX_DESC_CMD_L4T_EOFT_UDP; break; case IPPROTO_SCTP: ! tctx->itc_data_cmdflags |= ! I40E_TX_DESC_CMD_L4T_EOFT_SCTP; break; default: txs->itxs_hck_badl4.value.ui64++; return (-1); } ! tctx->itc_data_offsets |= (meo->meoi_l4hlen >> 2) << I40E_TX_DESC_LENGTH_L4_FC_LEN_SHIFT; } + if (lsoflags & HW_LSO) { + /* + * LSO requires that checksum offloads are enabled. If for + * some reason they're not we bail out with an error. + */ + if ((chkflags & HCK_IPV4_HDRCKSUM) == 0 || + (chkflags & HCK_PARTIALCKSUM) == 0) { + txs->itxs_lso_nohck.value.ui64++; + return (-1); + } + + tctx->itc_ctx_cmdflags |= I40E_TX_CTX_DESC_TSO; + tctx->itc_ctx_mss = mss; + tctx->itc_ctx_tsolen = msgsize(mp) - + (meo->meoi_l2hlen + meo->meoi_l3hlen + meo->meoi_l4hlen); + } + return (0); } static void i40e_tcb_free(i40e_trqpair_t *itrq, i40e_tx_control_block_t *tcb)
*** 1923,1944 **** --- 2020,2056 ---- switch (tcb->tcb_type) { case I40E_TX_COPY: tcb->tcb_dma.dmab_len = 0; break; case I40E_TX_DMA: + if (tcb->tcb_used_lso == B_TRUE && tcb->tcb_bind_ncookies > 0) + (void) ddi_dma_unbind_handle(tcb->tcb_lso_dma_handle); + else if (tcb->tcb_bind_ncookies > 0) (void) ddi_dma_unbind_handle(tcb->tcb_dma_handle); + if (tcb->tcb_bind_info != NULL) { + kmem_free(tcb->tcb_bind_info, + tcb->tcb_bind_ncookies * + sizeof (struct i40e_dma_bind_info)); + } + tcb->tcb_bind_info = NULL; + tcb->tcb_bind_ncookies = 0; + tcb->tcb_used_lso = B_FALSE; break; + case I40E_TX_DESC: + break; case I40E_TX_NONE: /* Cast to pacify lint */ panic("trying to free tcb %p with bad type none", (void *)tcb); default: panic("unknown i40e tcb type: %d", tcb->tcb_type); } tcb->tcb_type = I40E_TX_NONE; + if (tcb->tcb_mp != NULL) { freemsg(tcb->tcb_mp); tcb->tcb_mp = NULL; + } tcb->tcb_next = NULL; } /* * This is called as part of shutting down to clean up all outstanding
*** 1967,1980 **** index = itrq->itrq_desc_head; while (itrq->itrq_desc_free < itrq->itrq_tx_ring_size) { i40e_tx_control_block_t *tcb; tcb = itrq->itrq_tcb_work_list[index]; ! VERIFY(tcb != NULL); itrq->itrq_tcb_work_list[index] = NULL; i40e_tcb_reset(tcb); i40e_tcb_free(itrq, tcb); bzero(&itrq->itrq_desc_ring[index], sizeof (i40e_tx_desc_t)); index = i40e_next_desc(index, 1, itrq->itrq_tx_ring_size); itrq->itrq_desc_free++; } --- 2079,2093 ---- index = itrq->itrq_desc_head; while (itrq->itrq_desc_free < itrq->itrq_tx_ring_size) { i40e_tx_control_block_t *tcb; tcb = itrq->itrq_tcb_work_list[index]; ! if (tcb != NULL) { itrq->itrq_tcb_work_list[index] = NULL; i40e_tcb_reset(tcb); i40e_tcb_free(itrq, tcb); + } bzero(&itrq->itrq_desc_ring[index], sizeof (i40e_tx_desc_t)); index = i40e_next_desc(index, 1, itrq->itrq_tx_ring_size); itrq->itrq_desc_free++; }
*** 1993,2002 **** --- 2106,2116 ---- i40e_tx_recycle_ring(i40e_trqpair_t *itrq) { uint32_t wbhead, toclean, count; i40e_tx_control_block_t *tcbhead; i40e_t *i40e = itrq->itrq_i40e; + uint_t desc_per_tcb, i; mutex_enter(&itrq->itrq_tx_lock); ASSERT(itrq->itrq_desc_free <= itrq->itrq_tx_ring_size); if (itrq->itrq_desc_free == itrq->itrq_tx_ring_size) {
*** 2040,2055 **** ASSERT(tcb != NULL); tcb->tcb_next = tcbhead; tcbhead = tcb; /* * We zero this out for sanity purposes. */ ! bzero(&itrq->itrq_desc_ring[toclean], sizeof (i40e_tx_desc_t)); ! toclean = i40e_next_desc(toclean, 1, itrq->itrq_tx_ring_size); count++; } itrq->itrq_desc_head = wbhead; itrq->itrq_desc_free += count; itrq->itrq_txstat.itxs_recycled.value.ui64 += count; ASSERT(itrq->itrq_desc_free <= itrq->itrq_tx_ring_size); --- 2154,2185 ---- ASSERT(tcb != NULL); tcb->tcb_next = tcbhead; tcbhead = tcb; /* + * In the DMA bind case, there may not necessarily be a 1:1 + * mapping between tcb's and descriptors. If the tcb type + * indicates a DMA binding then check the number of DMA + * cookies to determine how many entries to clean in the + * descriptor ring. + */ + if (tcb->tcb_type == I40E_TX_DMA) + desc_per_tcb = tcb->tcb_bind_ncookies; + else + desc_per_tcb = 1; + + for (i = 0; i < desc_per_tcb; i++) { + /* * We zero this out for sanity purposes. */ ! bzero(&itrq->itrq_desc_ring[toclean], ! sizeof (i40e_tx_desc_t)); ! toclean = i40e_next_desc(toclean, 1, ! itrq->itrq_tx_ring_size); count++; } + } itrq->itrq_desc_head = wbhead; itrq->itrq_desc_free += count; itrq->itrq_txstat.itxs_recycled.value.ui64 += count; ASSERT(itrq->itrq_desc_free <= itrq->itrq_tx_ring_size);
*** 2076,2089 **** } DTRACE_PROBE2(i40e__recycle, i40e_trqpair_t *, itrq, uint32_t, count); } /* * We've been asked to send a message block on the wire. We'll only have a * single chain. There will not be any b_next pointers; however, there may be ! * multiple b_cont blocks. * * We may do one of three things with any given mblk_t chain: * * 1) Drop it * 2) Transmit it --- 2206,2793 ---- } DTRACE_PROBE2(i40e__recycle, i40e_trqpair_t *, itrq, uint32_t, count); } + static void + i40e_tx_copy_fragment(i40e_tx_control_block_t *tcb, const mblk_t *mp, + const size_t off, const size_t len) + { + const void *soff = mp->b_rptr + off; + void *doff = tcb->tcb_dma.dmab_address + tcb->tcb_dma.dmab_len; + + ASSERT3U(len, >, 0); + ASSERT3P(soff, >=, mp->b_rptr); + ASSERT3P(soff, <=, mp->b_wptr); + ASSERT3U(len, <=, MBLKL(mp)); + ASSERT3U((uintptr_t)soff + len, <=, (uintptr_t)mp->b_wptr); + ASSERT3U(tcb->tcb_dma.dmab_size - tcb->tcb_dma.dmab_len, >=, len); + bcopy(soff, doff, len); + tcb->tcb_type = I40E_TX_COPY; + tcb->tcb_dma.dmab_len += len; + I40E_DMA_SYNC(&tcb->tcb_dma, DDI_DMA_SYNC_FORDEV); + } + + static i40e_tx_control_block_t * + i40e_tx_bind_fragment(i40e_trqpair_t *itrq, const mblk_t *mp, + size_t off, boolean_t use_lso) + { + ddi_dma_handle_t dma_handle; + ddi_dma_cookie_t dma_cookie; + uint_t i = 0, ncookies = 0, dmaflags; + i40e_tx_control_block_t *tcb; + i40e_txq_stat_t *txs = &itrq->itrq_txstat; + + if ((tcb = i40e_tcb_alloc(itrq)) == NULL) { + txs->itxs_err_notcb.value.ui64++; + return (NULL); + } + tcb->tcb_type = I40E_TX_DMA; + + if (use_lso == B_TRUE) + dma_handle = tcb->tcb_lso_dma_handle; + else + dma_handle = tcb->tcb_dma_handle; + + dmaflags = DDI_DMA_WRITE | DDI_DMA_STREAMING; + if (ddi_dma_addr_bind_handle(dma_handle, NULL, + (caddr_t)(mp->b_rptr + off), MBLKL(mp) - off, dmaflags, + DDI_DMA_DONTWAIT, NULL, &dma_cookie, &ncookies) != DDI_DMA_MAPPED) { + txs->itxs_bind_fails.value.ui64++; + goto bffail; + } + + tcb->tcb_bind_ncookies = ncookies; + tcb->tcb_used_lso = use_lso; + + tcb->tcb_bind_info = + kmem_zalloc(ncookies * sizeof (struct i40e_dma_bind_info), + KM_NOSLEEP); + if (tcb->tcb_bind_info == NULL) + goto bffail; + + while (i < ncookies) { + if (i > 0) + ddi_dma_nextcookie(dma_handle, &dma_cookie); + + tcb->tcb_bind_info[i].dbi_paddr = + (caddr_t)dma_cookie.dmac_laddress; + tcb->tcb_bind_info[i++].dbi_len = dma_cookie.dmac_size; + } + + return (tcb); + + bffail: + i40e_tcb_reset(tcb); + i40e_tcb_free(itrq, tcb); + return (NULL); + } + + static void + i40e_tx_set_data_desc(i40e_trqpair_t *itrq, i40e_tx_context_t *tctx, + caddr_t buff, size_t len, boolean_t last_desc) + { + i40e_tx_desc_t *txdesc; + int cmd; + + ASSERT(MUTEX_HELD(&itrq->itrq_tx_lock)); + itrq->itrq_desc_free--; + txdesc = &itrq->itrq_desc_ring[itrq->itrq_desc_tail]; + itrq->itrq_desc_tail = i40e_next_desc(itrq->itrq_desc_tail, 1, + itrq->itrq_tx_ring_size); + + cmd = I40E_TX_DESC_CMD_ICRC | tctx->itc_data_cmdflags; + + /* + * The last data descriptor needs the EOP bit set, so that the HW knows + * that we're ready to send. Additionally, we set the RS (Report + * Status) bit, so that we are notified when the transmit engine has + * completed DMA'ing all of the data descriptors and data buffers + * associated with this frame. + */ + if (last_desc == B_TRUE) { + cmd |= I40E_TX_DESC_CMD_EOP; + cmd |= I40E_TX_DESC_CMD_RS; + } + + /* + * Per the X710 manual, section 8.4.2.1.1, the buffer size + * must be a value from 1 to 16K minus 1, inclusive. + */ + ASSERT3U(len, >=, 1); + ASSERT3U(len, <=, I40E_MAX_TX_BUFSZ - 1); + + txdesc->buffer_addr = CPU_TO_LE64((uintptr_t)buff); + txdesc->cmd_type_offset_bsz = + LE_64(((uint64_t)I40E_TX_DESC_DTYPE_DATA | + ((uint64_t)tctx->itc_data_offsets << I40E_TXD_QW1_OFFSET_SHIFT) | + ((uint64_t)cmd << I40E_TXD_QW1_CMD_SHIFT) | + ((uint64_t)len << I40E_TXD_QW1_TX_BUF_SZ_SHIFT))); + } + /* + * Place 'tcb' on the tail of the list represented by 'head'/'tail'. + */ + static inline void + tcb_list_append(i40e_tx_control_block_t **head, i40e_tx_control_block_t **tail, + i40e_tx_control_block_t *tcb) + { + if (*head == NULL) { + *head = tcb; + *tail = *head; + } else { + ASSERT3P(*tail, !=, NULL); + ASSERT3P((*tail)->tcb_next, ==, NULL); + (*tail)->tcb_next = tcb; + *tail = tcb; + } + } + + /* + * This function takes a single packet, possibly consisting of + * multiple mblks, and creates a TCB chain to send to the controller. + * This TCB chain may span up to a maximum of 8 descriptors. A copy + * TCB consumes one descriptor; whereas a DMA TCB may consume 1 or + * more, depending on several factors. For each fragment (invidual + * mblk making up the packet), we determine if its size dictates a + * copy to the TCB buffer or a DMA bind of the dblk buffer. We keep a + * count of descriptors used; when that count reaches the max we force + * all remaining fragments into a single TCB buffer. We have a + * guarantee that the TCB buffer is always larger than the MTU -- so + * there is always enough room. Consecutive fragments below the DMA + * threshold are copied into a single TCB. In the event of an error + * this function returns NULL but leaves 'mp' alone. + */ + static i40e_tx_control_block_t * + i40e_non_lso_chain(i40e_trqpair_t *itrq, mblk_t *mp, uint_t *ndesc) + { + const mblk_t *nmp = mp; + uint_t needed_desc = 0; + boolean_t force_copy = B_FALSE; + i40e_tx_control_block_t *tcb = NULL, *tcbhead = NULL, *tcbtail = NULL; + i40e_t *i40e = itrq->itrq_i40e; + i40e_txq_stat_t *txs = &itrq->itrq_txstat; + + /* TCB buffer is always larger than MTU. */ + ASSERT3U(msgsize(mp), <, i40e->i40e_tx_buf_size); + + while (nmp != NULL) { + const size_t nmp_len = MBLKL(nmp); + + /* Ignore zero-length mblks. */ + if (nmp_len == 0) { + nmp = nmp->b_cont; + continue; + } + + if (nmp_len < i40e->i40e_tx_dma_min || force_copy) { + /* Compress consecutive copies into one TCB. */ + if (tcb != NULL && tcb->tcb_type == I40E_TX_COPY) { + i40e_tx_copy_fragment(tcb, nmp, 0, nmp_len); + nmp = nmp->b_cont; + continue; + } + + if ((tcb = i40e_tcb_alloc(itrq)) == NULL) { + txs->itxs_err_notcb.value.ui64++; + goto fail; + } + + /* + * TCB DMA buffer is guaranteed to be one + * cookie by i40e_alloc_dma_buffer(). + */ + i40e_tx_copy_fragment(tcb, nmp, 0, nmp_len); + needed_desc++; + tcb_list_append(&tcbhead, &tcbtail, tcb); + } else { + uint_t total_desc; + + tcb = i40e_tx_bind_fragment(itrq, nmp, 0, B_FALSE); + if (tcb == NULL) { + i40e_error(i40e, "dma bind failed!"); + goto fail; + } + + /* + * If the new total exceeds the max or we've + * reached the limit and there's data left, + * then give up binding and copy the rest into + * the pre-allocated TCB buffer. + */ + total_desc = needed_desc + tcb->tcb_bind_ncookies; + if ((total_desc > I40E_TX_MAX_COOKIE) || + (total_desc == I40E_TX_MAX_COOKIE && + nmp->b_cont != NULL)) { + i40e_tcb_reset(tcb); + i40e_tcb_free(itrq, tcb); + + if (tcbtail != NULL && + tcbtail->tcb_type == I40E_TX_COPY) { + tcb = tcbtail; + } else { + tcb = NULL; + } + + force_copy = B_TRUE; + txs->itxs_force_copy.value.ui64++; + continue; + } + + needed_desc += tcb->tcb_bind_ncookies; + tcb_list_append(&tcbhead, &tcbtail, tcb); + } + + nmp = nmp->b_cont; + } + + ASSERT3P(nmp, ==, NULL); + ASSERT3U(needed_desc, <=, I40E_TX_MAX_COOKIE); + ASSERT3P(tcbhead, !=, NULL); + *ndesc += needed_desc; + return (tcbhead); + + fail: + tcb = tcbhead; + while (tcb != NULL) { + i40e_tx_control_block_t *next = tcb->tcb_next; + + ASSERT(tcb->tcb_type == I40E_TX_DMA || + tcb->tcb_type == I40E_TX_COPY); + + tcb->tcb_mp = NULL; + i40e_tcb_reset(tcb); + i40e_tcb_free(itrq, tcb); + tcb = next; + } + + return (NULL); + } + + /* + * Section 8.4.1 of the 700-series programming guide states that a + * segment may span up to 8 data descriptors; including both header + * and payload data. However, empirical evidence shows that the + * controller freezes the Tx queue when presented with a segment of 8 + * descriptors. Or, at least, when the first segment contains 8 + * descriptors. One explanation is that the controller counts the + * context descriptor against the first segment, even though the + * programming guide makes no mention of such a constraint. In any + * case, we limit TSO segments to 7 descriptors to prevent Tx queue + * freezes. We still allow non-TSO segments to utilize all 8 + * descriptors as they have not demonstrated the faulty behavior. + */ + uint_t i40e_lso_num_descs = 7; + + #define I40E_TCB_LEFT(tcb) \ + ((tcb)->tcb_dma.dmab_size - (tcb)->tcb_dma.dmab_len) + + /* + * This function is similar in spirit to i40e_non_lso_chain(), but + * much more complicated in reality. Like the previous function, it + * takes a packet (an LSO packet) as input and returns a chain of + * TCBs. The complication comes with the fact that we are no longer + * trying to fit the entire packet into 8 descriptors, but rather we + * must fit each MSS-size segment of the LSO packet into 8 descriptors. + * Except it's really 7 descriptors, see i40e_lso_num_descs. + * + * Your first inclination might be to verify that a given segment + * spans no more than 7 mblks; but it's actually much more subtle than + * that. First, let's describe what the hardware expects, and then we + * can expound on the software side of things. + * + * For an LSO packet the hardware expects the following: + * + * o Each MSS-sized segment must span no more than 7 descriptors. + * + * o The header size does not count towards the segment size. + * + * o If header and payload share the first descriptor, then the + * controller will count the descriptor twice. + * + * The most important thing to keep in mind is that the hardware does + * not view the segments in terms of mblks, like we do. The hardware + * only sees descriptors. It will iterate each descriptor in turn, + * keeping a tally of bytes seen and descriptors visited. If the byte + * count hasn't reached MSS by the time the descriptor count reaches + * 7, then the controller freezes the queue and we are stuck. + * Furthermore, the hardware picks up its tally where it left off. So + * if it reached MSS in the middle of a descriptor, it will start + * tallying the next segment in the middle of that descriptor. The + * hardware's view is entirely removed from the mblk chain or even the + * descriptor layout. Consider these facts: + * + * o The MSS will vary dpeneding on MTU and other factors. + * + * o The dblk allocation will sit at various offsets within a + * memory page. + * + * o The page size itself could vary in the future (i.e. not + * always 4K). + * + * o Just because a dblk is virtually contiguous doesn't mean + * it's physically contiguous. The number of cookies + * (descriptors) required by a DMA bind of a single dblk is at + * the mercy of the page size and physical layout. + * + * o The descriptors will most often NOT start/end on a MSS + * boundary. Thus the hardware will often start counting the + * MSS mid descriptor and finish mid descriptor. + * + * The upshot of all this is that the driver must learn to think like + * the controller; and verify that none of the constraints are broken. + * It does this by tallying up the segment just like the hardware + * would. This is handled by the two variables 'segsz' and 'segdesc'. + * After each attempt to bind a dblk, we check the constaints. If + * violated, we undo the DMA and force a copy until MSS is met. We + * have a guarantee that the TCB buffer is larger than MTU; thus + * ensuring we can always meet the MSS with a single copy buffer. We + * also copy consecutive non-DMA fragments into the same TCB buffer. + */ + static i40e_tx_control_block_t * + i40e_lso_chain(i40e_trqpair_t *itrq, const mblk_t *mp, + const mac_ether_offload_info_t *meo, const i40e_tx_context_t *tctx, + uint_t *ndesc) + { + size_t mp_len = MBLKL(mp); + /* + * The cpoff (copy offset) variable tracks the offset inside + * the current mp. There are cases where the entire mp is not + * fully copied in one go: such as the header copy followed by + * a non-DMA mblk, or a TCB buffer that only has enough space + * to copy part of the current mp. + */ + size_t cpoff = 0; + /* + * The segsz and segdesc variables track the controller's view + * of the segment. The needed_desc variable tracks the total + * number of data descriptors used by the driver. + */ + size_t segsz = 0; + uint_t segdesc = 0; + uint_t needed_desc = 0; + const size_t hdrlen = + meo->meoi_l2hlen + meo->meoi_l3hlen + meo->meoi_l4hlen; + const size_t mss = tctx->itc_ctx_mss; + boolean_t force_copy = B_FALSE; + i40e_tx_control_block_t *tcb = NULL, *tcbhead = NULL, *tcbtail = NULL; + i40e_t *i40e = itrq->itrq_i40e; + i40e_txq_stat_t *txs = &itrq->itrq_txstat; + + /* + * We always copy the header in order to avoid more + * complicated code dealing with various edge cases. + */ + ASSERT3U(MBLKL(mp), >=, hdrlen); + if ((tcb = i40e_tcb_alloc(itrq)) == NULL) { + txs->itxs_err_notcb.value.ui64++; + goto fail; + } + needed_desc++; + + tcb_list_append(&tcbhead, &tcbtail, tcb); + i40e_tx_copy_fragment(tcb, mp, 0, hdrlen); + cpoff += hdrlen; + + /* + * A single descriptor containing both header and data is + * counted twice by the controller. + */ + if ((mp_len > hdrlen && mp_len < i40e->i40e_tx_dma_min) || + (mp->b_cont != NULL && + MBLKL(mp->b_cont) < i40e->i40e_tx_dma_min)) { + segdesc = 2; + } else { + segdesc = 1; + } + + /* If this fragment was pure header, then move to the next one. */ + if (cpoff == mp_len) { + mp = mp->b_cont; + cpoff = 0; + } + + while (mp != NULL) { + mp_len = MBLKL(mp); + force_copy: + /* Ignore zero-length mblks. */ + if (mp_len == 0) { + mp = mp->b_cont; + cpoff = 0; + continue; + } + + /* + * We copy into the preallocated TCB buffer when the + * current fragment is less than the DMA threshold OR + * when the DMA bind can't meet the controller's + * segment descriptor limit. + */ + if (mp_len < i40e->i40e_tx_dma_min || force_copy) { + size_t tocopy; + + /* + * Our objective here is to compress + * consecutive copies into one TCB (until it + * is full). If there is no current TCB, or if + * it is a DMA TCB, then allocate a new one. + */ + if (tcb == NULL || + (tcb != NULL && tcb->tcb_type != I40E_TX_COPY)) { + if ((tcb = i40e_tcb_alloc(itrq)) == NULL) { + txs->itxs_err_notcb.value.ui64++; + goto fail; + } + + /* + * The TCB DMA buffer is guaranteed to + * be one cookie by i40e_alloc_dma_buffer(). + */ + needed_desc++; + segdesc++; + ASSERT3U(segdesc, <=, i40e_lso_num_descs); + tcb_list_append(&tcbhead, &tcbtail, tcb); + } + + tocopy = MIN(I40E_TCB_LEFT(tcb), mp_len - cpoff); + i40e_tx_copy_fragment(tcb, mp, cpoff, tocopy); + cpoff += tocopy; + segsz += tocopy; + + /* We have consumed the current mp. */ + if (cpoff == mp_len) { + mp = mp->b_cont; + cpoff = 0; + } + + /* We have consumed the current TCB buffer. */ + if (I40E_TCB_LEFT(tcb) == 0) { + tcb = NULL; + } + + /* + * We have met MSS with this copy; restart the + * counters. + */ + if (segsz >= mss) { + segsz = segsz % mss; + segdesc = segsz == 0 ? 0 : 1; + force_copy = B_FALSE; + } + + /* + * We are at the controller's descriptor + * limit; we must copy into the current TCB + * until MSS is reached. The TCB buffer is + * always bigger than the MTU so we know it is + * big enough to meet the MSS. + */ + if (segdesc == i40e_lso_num_descs) { + force_copy = B_TRUE; + } + } else { + uint_t tsegdesc = segdesc; + size_t tsegsz = segsz; + + ASSERT(force_copy == B_FALSE); + ASSERT3U(tsegdesc, <, i40e_lso_num_descs); + + tcb = i40e_tx_bind_fragment(itrq, mp, cpoff, B_TRUE); + if (tcb == NULL) { + i40e_error(i40e, "dma bind failed!"); + goto fail; + } + + for (uint_t i = 0; i < tcb->tcb_bind_ncookies; i++) { + struct i40e_dma_bind_info dbi = + tcb->tcb_bind_info[i]; + + tsegsz += dbi.dbi_len; + tsegdesc++; + ASSERT3U(tsegdesc, <=, i40e_lso_num_descs); + + /* + * We've met the MSS with this portion + * of the DMA. + */ + if (tsegsz >= mss) { + tsegdesc = 1; + tsegsz = tsegsz % mss; + } + + /* + * We've reached max descriptors but + * have not met the MSS. Undo the bind + * and instead copy. + */ + if (tsegdesc == i40e_lso_num_descs) { + i40e_tcb_reset(tcb); + i40e_tcb_free(itrq, tcb); + + if (tcbtail != NULL && + I40E_TCB_LEFT(tcb) > 0 && + tcbtail->tcb_type == I40E_TX_COPY) { + tcb = tcbtail; + } else { + tcb = NULL; + } + + /* + * Remember, we are still on + * the same mp. + */ + force_copy = B_TRUE; + txs->itxs_tso_force_copy.value.ui64++; + goto force_copy; + } + } + + ASSERT3U(tsegdesc, <=, i40e_lso_num_descs); + ASSERT3U(tsegsz, <, mss); + + /* + * We've made if through the loop without + * breaking the segment descriptor contract + * with the controller -- replace the segment + * tracking values with the temporary ones. + */ + segdesc = tsegdesc; + segsz = tsegsz; + needed_desc += tcb->tcb_bind_ncookies; + cpoff = 0; + tcb_list_append(&tcbhead, &tcbtail, tcb); + mp = mp->b_cont; + } + } + + ASSERT3P(mp, ==, NULL); + ASSERT3P(tcbhead, !=, NULL); + *ndesc += needed_desc; + return (tcbhead); + + fail: + tcb = tcbhead; + while (tcb != NULL) { + i40e_tx_control_block_t *next = tcb->tcb_next; + + ASSERT(tcb->tcb_type == I40E_TX_DMA || + tcb->tcb_type == I40E_TX_COPY); + + tcb->tcb_mp = NULL; + i40e_tcb_reset(tcb); + i40e_tcb_free(itrq, tcb); + tcb = next; + } + + return (NULL); + } + + /* * We've been asked to send a message block on the wire. We'll only have a * single chain. There will not be any b_next pointers; however, there may be ! * multiple b_cont blocks. The number of b_cont blocks may exceed the ! * controller's Tx descriptor limit. * * We may do one of three things with any given mblk_t chain: * * 1) Drop it * 2) Transmit it
*** 2094,2109 **** * something. */ mblk_t * i40e_ring_tx(void *arg, mblk_t *mp) { ! const mblk_t *nmp; ! size_t mpsize; ! i40e_tx_control_block_t *tcb; ! i40e_tx_desc_t *txdesc; i40e_tx_context_t tctx; ! int cmd, type; i40e_trqpair_t *itrq = arg; i40e_t *i40e = itrq->itrq_i40e; i40e_hw_t *hw = &i40e->i40e_hw_space; i40e_txq_stat_t *txs = &itrq->itrq_txstat; --- 2798,2815 ---- * something. */ mblk_t * i40e_ring_tx(void *arg, mblk_t *mp) { ! size_t msglen; ! i40e_tx_control_block_t *tcb_ctx = NULL, *tcb = NULL, *tcbhead = NULL; ! i40e_tx_context_desc_t *ctxdesc; ! mac_ether_offload_info_t meo; i40e_tx_context_t tctx; ! int type; ! uint_t needed_desc = 0; ! boolean_t do_ctx_desc = B_FALSE, use_lso = B_FALSE; i40e_trqpair_t *itrq = arg; i40e_t *i40e = itrq->itrq_i40e; i40e_hw_t *hw = &i40e->i40e_hw_space; i40e_txq_stat_t *txs = &itrq->itrq_txstat;
*** 2117,2235 **** (i40e->i40e_link_state != LINK_STATE_UP)) { freemsg(mp); return (NULL); } /* * Figure out the relevant context about this frame that we might need ! * for enabling checksum, lso, etc. This also fills in information that * we might set around the packet type, etc. */ ! if (i40e_tx_context(i40e, itrq, mp, &tctx) < 0) { freemsg(mp); itrq->itrq_txstat.itxs_err_context.value.ui64++; return (NULL); } /* * For the primordial driver we can punt on doing any recycling right * now; however, longer term we need to probably do some more pro-active ! * recycling to cut back on stalls in the tx path. */ ! /* ! * Do a quick size check to make sure it fits into what we think it ! * should for this device. Note that longer term this will be false, ! * particularly when we have the world of TSO. ! */ ! mpsize = 0; ! for (nmp = mp; nmp != NULL; nmp = nmp->b_cont) { ! mpsize += MBLKL(nmp); ! } /* ! * First we allocate our tx control block and prepare the packet for ! * transmit before we do a final check for descriptors. We do it this ! * way to minimize the time under the tx lock. */ ! tcb = i40e_tcb_alloc(itrq); ! if (tcb == NULL) { txs->itxs_err_notcb.value.ui64++; goto txfail; } ! /* ! * For transmitting a block, we're currently going to use just a ! * single control block and bcopy all of the fragments into it. We ! * should be more intelligent about doing DMA binding or otherwise, but ! * for getting off the ground this will have to do. ! */ ! ASSERT(tcb->tcb_dma.dmab_len == 0); ! ASSERT(tcb->tcb_dma.dmab_size >= mpsize); ! for (nmp = mp; nmp != NULL; nmp = nmp->b_cont) { ! size_t clen = MBLKL(nmp); ! void *coff = tcb->tcb_dma.dmab_address + tcb->tcb_dma.dmab_len; ! ! bcopy(nmp->b_rptr, coff, clen); ! tcb->tcb_dma.dmab_len += clen; } - ASSERT(tcb->tcb_dma.dmab_len == mpsize); /* ! * While there's really no need to keep the mp here, but let's just do ! * it to help with our own debugging for now. */ - tcb->tcb_mp = mp; - tcb->tcb_type = I40E_TX_COPY; - I40E_DMA_SYNC(&tcb->tcb_dma, DDI_DMA_SYNC_FORDEV); - mutex_enter(&itrq->itrq_tx_lock); ! if (itrq->itrq_desc_free < i40e->i40e_tx_block_thresh) { txs->itxs_err_nodescs.value.ui64++; mutex_exit(&itrq->itrq_tx_lock); goto txfail; } /* ! * Build up the descriptor and send it out. Thankfully at the moment ! * we only need a single desc, because we're not doing anything fancy ! * yet. */ ! ASSERT(itrq->itrq_desc_free > 0); itrq->itrq_desc_free--; ! txdesc = &itrq->itrq_desc_ring[itrq->itrq_desc_tail]; ! itrq->itrq_tcb_work_list[itrq->itrq_desc_tail] = tcb; ! itrq->itrq_desc_tail = i40e_next_desc(itrq->itrq_desc_tail, 1, itrq->itrq_tx_ring_size); ! /* ! * Note, we always set EOP and RS which indicates that this is the last ! * data frame and that we should ask for it to be transmitted. We also ! * must always set ICRC, because that is an internal bit that must be ! * set to one for data descriptors. The remaining bits in the command ! * descriptor depend on checksumming and are determined based on the ! * information set up in i40e_tx_context(). ! */ ! type = I40E_TX_DESC_DTYPE_DATA; ! cmd = I40E_TX_DESC_CMD_EOP | ! I40E_TX_DESC_CMD_RS | ! I40E_TX_DESC_CMD_ICRC | ! tctx.itc_cmdflags; ! txdesc->buffer_addr = ! CPU_TO_LE64((uintptr_t)tcb->tcb_dma.dmab_dma_address); ! txdesc->cmd_type_offset_bsz = CPU_TO_LE64(((uint64_t)type | ! ((uint64_t)tctx.itc_offsets << I40E_TXD_QW1_OFFSET_SHIFT) | ! ((uint64_t)cmd << I40E_TXD_QW1_CMD_SHIFT) | ! ((uint64_t)tcb->tcb_dma.dmab_len << I40E_TXD_QW1_TX_BUF_SZ_SHIFT))); /* * Now, finally, sync the DMA data and alert hardware. */ I40E_DMA_SYNC(&itrq->itrq_desc_area, DDI_DMA_SYNC_FORDEV); I40E_WRITE_REG(hw, I40E_QTX_TAIL(itrq->itrq_index), itrq->itrq_desc_tail); if (i40e_check_acc_handle(i40e->i40e_osdep_space.ios_reg_handle) != DDI_FM_OK) { /* * Note, we can't really go through and clean this up very well, * because the memory has been given to the device, so just --- 2823,2972 ---- (i40e->i40e_link_state != LINK_STATE_UP)) { freemsg(mp); return (NULL); } + if (mac_ether_offload_info(mp, &meo) != 0) { + freemsg(mp); + itrq->itrq_txstat.itxs_hck_meoifail.value.ui64++; + return (NULL); + } + /* * Figure out the relevant context about this frame that we might need ! * for enabling checksum, LSO, etc. This also fills in information that * we might set around the packet type, etc. */ ! if (i40e_tx_context(i40e, itrq, mp, &meo, &tctx) < 0) { freemsg(mp); itrq->itrq_txstat.itxs_err_context.value.ui64++; return (NULL); } + if (tctx.itc_ctx_cmdflags & I40E_TX_CTX_DESC_TSO) { + use_lso = B_TRUE; + do_ctx_desc = B_TRUE; + } /* * For the primordial driver we can punt on doing any recycling right * now; however, longer term we need to probably do some more pro-active ! * recycling to cut back on stalls in the TX path. */ ! msglen = msgsize(mp); + if (do_ctx_desc) { /* ! * If we're doing tunneling or LSO, then we'll need a TX ! * context descriptor in addition to one or more TX data ! * descriptors. Since there's no data DMA block or handle ! * associated with the context descriptor, we create a special ! * control block that behaves effectively like a NOP. */ ! if ((tcb_ctx = i40e_tcb_alloc(itrq)) == NULL) { txs->itxs_err_notcb.value.ui64++; goto txfail; } + tcb_ctx->tcb_type = I40E_TX_DESC; + needed_desc++; + } ! if (!use_lso) { ! tcbhead = i40e_non_lso_chain(itrq, mp, &needed_desc); ! } else { ! tcbhead = i40e_lso_chain(itrq, mp, &meo, &tctx, &needed_desc); } + if (tcbhead == NULL) + goto txfail; + + tcbhead->tcb_mp = mp; + /* ! * The second condition ensures that 'itrq_desc_tail' never ! * equals 'itrq_desc_head'. This enforces the rule found in ! * the second bullet point of section 8.4.3.1.5 of the XL710 ! * PG, which declares the TAIL pointer in I40E_QTX_TAIL should ! * never overlap with the head. This means that we only ever ! * have 'itrq_tx_ring_size - 1' total available descriptors. */ mutex_enter(&itrq->itrq_tx_lock); ! if (itrq->itrq_desc_free < i40e->i40e_tx_block_thresh || ! (itrq->itrq_desc_free - 1) < needed_desc) { txs->itxs_err_nodescs.value.ui64++; mutex_exit(&itrq->itrq_tx_lock); goto txfail; } + if (do_ctx_desc) { /* ! * If we're enabling any offloads for this frame, then we'll ! * need to build up a transmit context descriptor, first. The ! * context descriptor needs to be placed in the TX ring before ! * the data descriptor(s). See section 8.4.2, table 8-16 */ ! uint_t tail = itrq->itrq_desc_tail; itrq->itrq_desc_free--; ! ctxdesc = (i40e_tx_context_desc_t *)&itrq->itrq_desc_ring[tail]; ! itrq->itrq_tcb_work_list[tail] = tcb_ctx; ! itrq->itrq_desc_tail = i40e_next_desc(tail, 1, itrq->itrq_tx_ring_size); ! /* QW0 */ ! type = I40E_TX_DESC_DTYPE_CONTEXT; ! ctxdesc->tunneling_params = 0; ! ctxdesc->l2tag2 = 0; + /* QW1 */ + ctxdesc->type_cmd_tso_mss = CPU_TO_LE64((uint64_t)type); + if (tctx.itc_ctx_cmdflags & I40E_TX_CTX_DESC_TSO) { + ctxdesc->type_cmd_tso_mss |= CPU_TO_LE64((uint64_t) + ((uint64_t)tctx.itc_ctx_cmdflags << + I40E_TXD_CTX_QW1_CMD_SHIFT) | + ((uint64_t)tctx.itc_ctx_tsolen << + I40E_TXD_CTX_QW1_TSO_LEN_SHIFT) | + ((uint64_t)tctx.itc_ctx_mss << + I40E_TXD_CTX_QW1_MSS_SHIFT)); + } + } + + tcb = tcbhead; + while (tcb != NULL) { + + itrq->itrq_tcb_work_list[itrq->itrq_desc_tail] = tcb; + if (tcb->tcb_type == I40E_TX_COPY) { + boolean_t last_desc = (tcb->tcb_next == NULL); + + i40e_tx_set_data_desc(itrq, &tctx, + (caddr_t)tcb->tcb_dma.dmab_dma_address, + tcb->tcb_dma.dmab_len, last_desc); + } else { + boolean_t last_desc = B_FALSE; + ASSERT3S(tcb->tcb_type, ==, I40E_TX_DMA); + + for (uint_t c = 0; c < tcb->tcb_bind_ncookies; c++) { + last_desc = (c == tcb->tcb_bind_ncookies - 1) && + (tcb->tcb_next == NULL); + + i40e_tx_set_data_desc(itrq, &tctx, + tcb->tcb_bind_info[c].dbi_paddr, + tcb->tcb_bind_info[c].dbi_len, + last_desc); + } + } + + tcb = tcb->tcb_next; + } + /* * Now, finally, sync the DMA data and alert hardware. */ I40E_DMA_SYNC(&itrq->itrq_desc_area, DDI_DMA_SYNC_FORDEV); I40E_WRITE_REG(hw, I40E_QTX_TAIL(itrq->itrq_index), itrq->itrq_desc_tail); + if (i40e_check_acc_handle(i40e->i40e_osdep_space.ios_reg_handle) != DDI_FM_OK) { /* * Note, we can't really go through and clean this up very well, * because the memory has been given to the device, so just
*** 2237,2249 **** */ ddi_fm_service_impact(i40e->i40e_dip, DDI_SERVICE_DEGRADED); atomic_or_32(&i40e->i40e_state, I40E_ERROR); } ! txs->itxs_bytes.value.ui64 += mpsize; txs->itxs_packets.value.ui64++; ! txs->itxs_descriptors.value.ui64++; mutex_exit(&itrq->itrq_tx_lock); return (NULL); --- 2974,2986 ---- */ ddi_fm_service_impact(i40e->i40e_dip, DDI_SERVICE_DEGRADED); atomic_or_32(&i40e->i40e_state, I40E_ERROR); } ! txs->itxs_bytes.value.ui64 += msglen; txs->itxs_packets.value.ui64++; ! txs->itxs_descriptors.value.ui64 += needed_desc; mutex_exit(&itrq->itrq_tx_lock); return (NULL);
*** 2252,2265 **** * We ran out of resources. Return it to MAC and indicate that we'll * need to signal MAC. If there are allocated tcb's, return them now. * Make sure to reset their message block's, since we'll return them * back to MAC. */ ! if (tcb != NULL) { tcb->tcb_mp = NULL; i40e_tcb_reset(tcb); i40e_tcb_free(itrq, tcb); } mutex_enter(&itrq->itrq_tx_lock); itrq->itrq_tx_blocked = B_TRUE; mutex_exit(&itrq->itrq_tx_lock); --- 2989,3015 ---- * We ran out of resources. Return it to MAC and indicate that we'll * need to signal MAC. If there are allocated tcb's, return them now. * Make sure to reset their message block's, since we'll return them * back to MAC. */ ! if (tcb_ctx != NULL) { ! tcb_ctx->tcb_mp = NULL; ! i40e_tcb_reset(tcb_ctx); ! i40e_tcb_free(itrq, tcb_ctx); ! } ! ! tcb = tcbhead; ! while (tcb != NULL) { ! i40e_tx_control_block_t *next = tcb->tcb_next; ! ! ASSERT(tcb->tcb_type == I40E_TX_DMA || ! tcb->tcb_type == I40E_TX_COPY); ! tcb->tcb_mp = NULL; i40e_tcb_reset(tcb); i40e_tcb_free(itrq, tcb); + tcb = next; } mutex_enter(&itrq->itrq_tx_lock); itrq->itrq_tx_blocked = B_TRUE; mutex_exit(&itrq->itrq_tx_lock);