Print this page
9832 Original bug discovered as 9560 has friends IPv4 packets coming in as IPv6 creating chaos
Reviewed by: Robert Mustacchi <rm@joyent.com>

Split Close
Expand all
Collapse all
          --- old/usr/src/uts/common/io/mac/mac_sched.c
          +++ new/usr/src/uts/common/io/mac/mac_sched.c
↓ open down ↓ 13 lines elided ↑ open up ↑
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
  23   23   * Use is subject to license terms.
  24      - * Copyright 2017 Joyent, Inc.
  25   24   * Copyright 2013 Nexenta Systems, Inc. All rights reserved.
       25 + * Copyright 2019 Joyent, Inc.
  26   26   */
  27   27  
  28   28  /*
  29   29   * MAC data path
  30   30   *
  31   31   * The MAC data path is concerned with the flow of traffic from mac clients --
  32   32   * DLS, IP, etc. -- to various GLDv3 device drivers -- e1000g, vnic, aggr,
  33   33   * ixgbe, etc. -- and from the GLDv3 device drivers back to clients.
  34   34   *
  35   35   * -----------
  36   36   * Terminology
  37   37   * -----------
  38   38   *
  39   39   * MAC uses a lot of different, but related terms that are associated with the
  40   40   * design and structure of the data path. Before we cover other aspects, first
  41   41   * let's review the terminology that MAC uses.
  42   42   *
  43   43   * MAC
  44   44   *
  45      - *      This driver. It interfaces with device drivers and provides abstractions
  46      - *      that the rest of the system consumes. All data links -- things managed
  47      - *      with dladm(1M), are accessed through MAC.
       45 + *      This driver. It interfaces with device drivers and provides abstractions
       46 + *      that the rest of the system consumes. All data links -- things managed
       47 + *      with dladm(1M), are accessed through MAC.
  48   48   *
  49   49   * GLDv3 DEVICE DRIVER
  50   50   *
  51      - *      A GLDv3 device driver refers to a driver, both for pseudo-devices and
  52      - *      real devices, which implement the GLDv3 driver API. Common examples of
  53      - *      these are igb and ixgbe, which are drivers for various Intel networking
  54      - *      cards. These devices may or may not have various features, such as
  55      - *      hardware rings and checksum offloading. For MAC, a GLDv3 device is the
  56      - *      final point for the transmission of a packet and the starting point for
  57      - *      the receipt of a packet.
       51 + *      A GLDv3 device driver refers to a driver, both for pseudo-devices and
       52 + *      real devices, which implement the GLDv3 driver API. Common examples of
       53 + *      these are igb and ixgbe, which are drivers for various Intel networking
       54 + *      cards. These devices may or may not have various features, such as
       55 + *      hardware rings and checksum offloading. For MAC, a GLDv3 device is the
       56 + *      final point for the transmission of a packet and the starting point for
       57 + *      the receipt of a packet.
  58   58   *
  59   59   * FLOWS
  60   60   *
  61      - *      At a high level, a flow refers to a series of packets that are related.
  62      - *      Often times the term is used in the context of TCP to indicate a unique
  63      - *      TCP connection and the traffic over it. However, a flow can exist at
  64      - *      other levels of the system as well. MAC has a notion of a default flow
  65      - *      which is used for all unicast traffic addressed to the address of a MAC
  66      - *      device. For example, when a VNIC is created, a default flow is created
  67      - *      for the VNIC's MAC address. In addition, flows are created for broadcast
  68      - *      groups and a user may create a flow with flowadm(1M).
       61 + *      At a high level, a flow refers to a series of packets that are related.
       62 + *      Often times the term is used in the context of TCP to indicate a unique
       63 + *      TCP connection and the traffic over it. However, a flow can exist at
       64 + *      other levels of the system as well. MAC has a notion of a default flow
       65 + *      which is used for all unicast traffic addressed to the address of a MAC
       66 + *      device. For example, when a VNIC is created, a default flow is created
       67 + *      for the VNIC's MAC address. In addition, flows are created for broadcast
       68 + *      groups and a user may create a flow with flowadm(1M).
  69   69   *
  70   70   * CLASSIFICATION
  71   71   *
  72      - *      Classification refers to the notion of identifying an incoming frame
  73      - *      based on its destination address and optionally its source addresses and
  74      - *      doing different processing based on that information. Classification can
  75      - *      be done in both hardware and software. In general, we usually only
  76      - *      classify based on the layer two destination, eg. for Ethernet, the
  77      - *      destination MAC address.
       72 + *      Classification refers to the notion of identifying an incoming frame
       73 + *      based on its destination address and optionally its source addresses and
       74 + *      doing different processing based on that information. Classification can
       75 + *      be done in both hardware and software. In general, we usually only
       76 + *      classify based on the layer two destination, eg. for Ethernet, the
       77 + *      destination MAC address.
  78   78   *
  79      - *      The system also will do classification based on layer three and layer
  80      - *      four properties. This is used to support things like flowadm(1M), which
  81      - *      allows setting QoS and other properties on a per-flow basis.
       79 + *      The system also will do classification based on layer three and layer
       80 + *      four properties. This is used to support things like flowadm(1M), which
       81 + *      allows setting QoS and other properties on a per-flow basis.
  82   82   *
  83   83   * RING
  84   84   *
  85      - *      Conceptually, a ring represents a series of framed messages, often in a
  86      - *      contiguous chunk of memory that acts as a circular buffer. Rings come in
  87      - *      a couple of forms. Generally they are either a hardware construct (hw
  88      - *      ring) or they are a software construct (sw ring) maintained by MAC.
       85 + *      Conceptually, a ring represents a series of framed messages, often in a
       86 + *      contiguous chunk of memory that acts as a circular buffer. Rings come in
       87 + *      a couple of forms. Generally they are either a hardware construct (hw
       88 + *      ring) or they are a software construct (sw ring) maintained by MAC.
  89   89   *
  90   90   * HW RING
  91   91   *
  92      - *      A hardware ring is a set of resources provided by a GLDv3 device driver
  93      - *      (even if it is a pseudo-device). A hardware ring comes in two different
  94      - *      forms: receive (rx) rings and transmit (tx) rings. An rx hw ring is
  95      - *      something that has a unique DMA (direct memory access) region and
  96      - *      generally supports some form of classification (though it isn't always
  97      - *      used), as well as a means of generating an interrupt specific to that
  98      - *      ring. For example, the device may generate a specific MSI-X for a PCI
  99      - *      express device. A tx ring is similar, except that it is dedicated to
 100      - *      transmission. It may also be a vector for enabling features such as VLAN
 101      - *      tagging and large transmit offloading. It usually has its own dedicated
 102      - *      interrupts for transmit being completed.
       92 + *      A hardware ring is a set of resources provided by a GLDv3 device driver
       93 + *      (even if it is a pseudo-device). A hardware ring comes in two different
       94 + *      forms: receive (rx) rings and transmit (tx) rings. An rx hw ring is
       95 + *      something that has a unique DMA (direct memory access) region and
       96 + *      generally supports some form of classification (though it isn't always
       97 + *      used), as well as a means of generating an interrupt specific to that
       98 + *      ring. For example, the device may generate a specific MSI-X for a PCI
       99 + *      express device. A tx ring is similar, except that it is dedicated to
      100 + *      transmission. It may also be a vector for enabling features such as VLAN
      101 + *      tagging and large transmit offloading. It usually has its own dedicated
      102 + *      interrupts for transmit being completed.
 103  103   *
 104  104   * SW RING
 105  105   *
 106      - *      A software ring is a construction of MAC. It represents the same thing
 107      - *      that a hardware ring generally does, a collection of frames. However,
 108      - *      instead of being in a contiguous ring of memory, they're instead linked
 109      - *      by using the mblk_t's b_next pointer. Each frame may itself be multiple
 110      - *      mblk_t's linked together by the b_cont pointer. A software ring always
 111      - *      represents a collection of classified packets; however, it varies as to
 112      - *      whether it uses only layer two information, or a combination of that and
 113      - *      additional layer three and layer four data.
      106 + *      A software ring is a construction of MAC. It represents the same thing
      107 + *      that a hardware ring generally does, a collection of frames. However,
      108 + *      instead of being in a contiguous ring of memory, they're instead linked
      109 + *      by using the mblk_t's b_next pointer. Each frame may itself be multiple
      110 + *      mblk_t's linked together by the b_cont pointer. A software ring always
      111 + *      represents a collection of classified packets; however, it varies as to
      112 + *      whether it uses only layer two information, or a combination of that and
      113 + *      additional layer three and layer four data.
 114  114   *
 115  115   * FANOUT
 116  116   *
 117      - *      Fanout is the idea of spreading out the load of processing frames based
 118      - *      on the source and destination information contained in the layer two,
 119      - *      three, and four headers, such that the data can then be processed in
 120      - *      parallel using multiple hardware threads.
      117 + *      Fanout is the idea of spreading out the load of processing frames based
      118 + *      on the source and destination information contained in the layer two,
      119 + *      three, and four headers, such that the data can then be processed in
      120 + *      parallel using multiple hardware threads.
 121  121   *
 122      - *      A fanout algorithm hashes the headers and uses that to place different
 123      - *      flows into a bucket. The most important thing is that packets that are
 124      - *      in the same flow end up in the same bucket. If they do not, performance
 125      - *      can be adversely affected. Consider the case of TCP.  TCP severely
 126      - *      penalizes a connection if the data arrives out of order. If a given flow
 127      - *      is processed on different CPUs, then the data will appear out of order,
 128      - *      hence the invariant that fanout always hash a given flow to the same
 129      - *      bucket and thus get processed on the same CPU.
      122 + *      A fanout algorithm hashes the headers and uses that to place different
      123 + *      flows into a bucket. The most important thing is that packets that are
      124 + *      in the same flow end up in the same bucket. If they do not, performance
      125 + *      can be adversely affected. Consider the case of TCP.  TCP severely
      126 + *      penalizes a connection if the data arrives out of order. If a given flow
      127 + *      is processed on different CPUs, then the data will appear out of order,
      128 + *      hence the invariant that fanout always hash a given flow to the same
      129 + *      bucket and thus get processed on the same CPU.
 130  130   *
 131  131   * RECEIVE SIDE SCALING (RSS)
 132  132   *
 133  133   *
 134      - *      Receive side scaling is a term that isn't common in illumos, but is used
 135      - *      by vendors and was popularized by Microsoft. It refers to the idea of
 136      - *      spreading the incoming receive load out across multiple interrupts which
 137      - *      can be directed to different CPUs. This allows a device to leverage
 138      - *      hardware rings even when it doesn't support hardware classification. The
 139      - *      hardware uses an algorithm to perform fanout that ensures the flow
 140      - *      invariant is maintained.
      134 + *      Receive side scaling is a term that isn't common in illumos, but is used
      135 + *      by vendors and was popularized by Microsoft. It refers to the idea of
      136 + *      spreading the incoming receive load out across multiple interrupts which
      137 + *      can be directed to different CPUs. This allows a device to leverage
      138 + *      hardware rings even when it doesn't support hardware classification. The
      139 + *      hardware uses an algorithm to perform fanout that ensures the flow
      140 + *      invariant is maintained.
 141  141   *
 142  142   * SOFT RING SET
 143  143   *
 144      - *      A soft ring set, commonly abbreviated SRS, is a collection of rings and
 145      - *      is used for both transmitting and receiving. It is maintained in the
 146      - *      structure mac_soft_ring_set_t. A soft ring set is usually associated
 147      - *      with flows, and coordinates both the use of hardware and software rings.
 148      - *      Because the use of hardware rings can change as devices such as VNICs
 149      - *      come and go, we always ensure that the set has software classification
 150      - *      rules that correspond to the hardware classification rules from rings.
      144 + *      A soft ring set, commonly abbreviated SRS, is a collection of rings and
      145 + *      is used for both transmitting and receiving. It is maintained in the
      146 + *      structure mac_soft_ring_set_t. A soft ring set is usually associated
      147 + *      with flows, and coordinates both the use of hardware and software rings.
      148 + *      Because the use of hardware rings can change as devices such as VNICs
      149 + *      come and go, we always ensure that the set has software classification
      150 + *      rules that correspond to the hardware classification rules from rings.
 151  151   *
 152      - *      Soft ring sets are also used for the enforcement of various QoS
 153      - *      properties. For example, if a bandwidth limit has been placed on a
 154      - *      specific flow or device, then that will be enforced by the soft ring
 155      - *      set.
      152 + *      Soft ring sets are also used for the enforcement of various QoS
      153 + *      properties. For example, if a bandwidth limit has been placed on a
      154 + *      specific flow or device, then that will be enforced by the soft ring
      155 + *      set.
 156  156   *
 157  157   * SERVICE ATTACHMENT POINT (SAP)
 158  158   *
 159      - *      The service attachment point is a DLPI (Data Link Provider Interface)
 160      - *      concept; however, it comes up quite often in MAC. Most MAC devices speak
 161      - *      a protocol that has some notion of different channels or message type
 162      - *      identifiers. For example, Ethernet defines an EtherType which is a part
 163      - *      of the Ethernet header and defines the particular protocol of the data
 164      - *      payload. If the EtherType is set to 0x0800, then it defines that the
 165      - *      contents of that Ethernet frame is IPv4 traffic. For Ethernet, the
 166      - *      EtherType is the SAP.
      159 + *      The service attachment point is a DLPI (Data Link Provider Interface)
      160 + *      concept; however, it comes up quite often in MAC. Most MAC devices speak
      161 + *      a protocol that has some notion of different channels or message type
      162 + *      identifiers. For example, Ethernet defines an EtherType which is a part
      163 + *      of the Ethernet header and defines the particular protocol of the data
      164 + *      payload. If the EtherType is set to 0x0800, then it defines that the
      165 + *      contents of that Ethernet frame is IPv4 traffic. For Ethernet, the
      166 + *      EtherType is the SAP.
 167  167   *
 168      - *      In DLPI, a given consumer attaches to a specific SAP. In illumos, the ip
 169      - *      and arp drivers attach to the EtherTypes for IPv4, IPv6, and ARP. Using
 170      - *      libdlpi(3LIB) user software can attach to arbitrary SAPs. With the
 171      - *      exception of 802.1Q VLAN tagged traffic, MAC itself does not directly
 172      - *      consume the SAP; however, it uses that information as part of hashing
 173      - *      and it may be used as part of the construction of flows.
      168 + *      In DLPI, a given consumer attaches to a specific SAP. In illumos, the ip
      169 + *      and arp drivers attach to the EtherTypes for IPv4, IPv6, and ARP. Using
      170 + *      libdlpi(3LIB) user software can attach to arbitrary SAPs. With the
      171 + *      exception of 802.1Q VLAN tagged traffic, MAC itself does not directly
      172 + *      consume the SAP; however, it uses that information as part of hashing
      173 + *      and it may be used as part of the construction of flows.
 174  174   *
 175  175   * PRIMARY MAC CLIENT
 176  176   *
 177      - *      The primary mac client refers to a mac client whose unicast address
 178      - *      matches the address of the device itself. For example, if the system has
 179      - *      instance of the e1000g driver such as e1000g0, e1000g1, etc., the
 180      - *      primary mac client is the one named after the device itself. VNICs that
 181      - *      are created on top of such devices are not the primary client.
      177 + *      The primary mac client refers to a mac client whose unicast address
      178 + *      matches the address of the device itself. For example, if the system has
      179 + *      instance of the e1000g driver such as e1000g0, e1000g1, etc., the
      180 + *      primary mac client is the one named after the device itself. VNICs that
      181 + *      are created on top of such devices are not the primary client.
 182  182   *
 183  183   * TRANSMIT DESCRIPTORS
 184  184   *
 185      - *      Transmit descriptors are a resource that most GLDv3 device drivers have.
 186      - *      Generally, a GLDv3 device driver takes a frame that's meant to be output
 187      - *      and puts a copy of it into a region of memory. Each region of memory
 188      - *      usually has an associated descriptor that the device uses to manage
 189      - *      properties of the frames. Devices have a limited number of such
 190      - *      descriptors. They get reclaimed once the device finishes putting the
 191      - *      frame on the wire.
      185 + *      Transmit descriptors are a resource that most GLDv3 device drivers have.
      186 + *      Generally, a GLDv3 device driver takes a frame that's meant to be output
      187 + *      and puts a copy of it into a region of memory. Each region of memory
      188 + *      usually has an associated descriptor that the device uses to manage
      189 + *      properties of the frames. Devices have a limited number of such
      190 + *      descriptors. They get reclaimed once the device finishes putting the
      191 + *      frame on the wire.
 192  192   *
 193      - *      If the driver runs out of transmit descriptors, for example, the OS is
 194      - *      generating more frames than it can put on the wire, then it will return
 195      - *      them back to the MAC layer.
      193 + *      If the driver runs out of transmit descriptors, for example, the OS is
      194 + *      generating more frames than it can put on the wire, then it will return
      195 + *      them back to the MAC layer.
 196  196   *
 197  197   * ---------------------------------
 198  198   * Rings, Classification, and Fanout
 199  199   * ---------------------------------
 200  200   *
 201  201   * The heart of MAC is made up of rings, and not those that Elven-kings wear.
 202  202   * When receiving a packet, MAC breaks the work into two different, though
 203  203   * interrelated phases. The first phase is generally classification and then the
 204  204   * second phase is generally fanout. When a frame comes in from a GLDv3 Device,
 205  205   * MAC needs to determine where that frame should be delivered. If it's a
↓ open down ↓ 981 lines elided ↑ open up ↑
1187 1187                  (mac_srs)->srs_state |= SRS_POLLING;                    \
1188 1188                  (void) mac_hwring_disable_intr((mac_ring_handle_t)      \
1189 1189                      (mac_srs)->srs_ring);                               \
1190 1190                  (mac_srs)->srs_rx.sr_poll_on++;                         \
1191 1191          }                                                               \
1192 1192  }
1193 1193  
1194 1194  #define MAC_SRS_WORKER_POLLING_ON(mac_srs) {                            \
1195 1195          ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock));                       \
1196 1196          if (((mac_srs)->srs_state &                                     \
1197      -            (SRS_POLLING_CAPAB|SRS_WORKER|SRS_POLLING)) ==              \
     1197 +            (SRS_POLLING_CAPAB|SRS_WORKER|SRS_POLLING)) ==              \
1198 1198              (SRS_POLLING_CAPAB|SRS_WORKER)) {                           \
1199 1199                  (mac_srs)->srs_state |= SRS_POLLING;                    \
1200 1200                  (void) mac_hwring_disable_intr((mac_ring_handle_t)      \
1201 1201                      (mac_srs)->srs_ring);                               \
1202 1202                  (mac_srs)->srs_rx.sr_worker_poll_on++;                  \
1203 1203          }                                                               \
1204 1204  }
1205 1205  
1206 1206  /*
1207 1207   * MAC_SRS_POLL_RING
↓ open down ↓ 2 lines elided ↑ open up ↑
1210 1210   * provided it wasn't already polling (SRS_GET_PKTS was set).
1211 1211   *
1212 1212   * Poll thread gets to run only from mac_rx_srs_drain() and only
1213 1213   * if the drain was being done by the worker thread.
1214 1214   */
1215 1215  #define MAC_SRS_POLL_RING(mac_srs) {                                    \
1216 1216          mac_srs_rx_t    *srs_rx = &(mac_srs)->srs_rx;                   \
1217 1217                                                                          \
1218 1218          ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock));                       \
1219 1219          srs_rx->sr_poll_thr_sig++;                                      \
1220      -        if (((mac_srs)->srs_state &                                     \
     1220 +        if (((mac_srs)->srs_state &                                     \
1221 1221              (SRS_POLLING_CAPAB|SRS_WORKER|SRS_GET_PKTS)) ==             \
1222 1222                  (SRS_WORKER|SRS_POLLING_CAPAB)) {                       \
1223 1223                  (mac_srs)->srs_state |= SRS_GET_PKTS;                   \
1224      -                cv_signal(&(mac_srs)->srs_cv);                          \
     1224 +                cv_signal(&(mac_srs)->srs_cv);                          \
1225 1225          } else {                                                        \
1226 1226                  srs_rx->sr_poll_thr_busy++;                             \
1227 1227          }                                                               \
1228 1228  }
1229 1229  
1230 1230  /*
1231 1231   * MAC_SRS_CHECK_BW_CONTROL
1232 1232   *
1233 1233   * Check to see if next tick has started so we can reset the
1234 1234   * SRS_BW_ENFORCED flag and allow more packets to come in the
1235 1235   * system.
1236 1236   */
1237 1237  #define MAC_SRS_CHECK_BW_CONTROL(mac_srs) {                             \
1238 1238          ASSERT(MUTEX_HELD(&(mac_srs)->srs_lock));                       \
1239 1239          ASSERT(((mac_srs)->srs_type & SRST_TX) ||                       \
1240 1240              MUTEX_HELD(&(mac_srs)->srs_bw->mac_bw_lock));               \
1241 1241          clock_t now = ddi_get_lbolt();                                  \
1242 1242          if ((mac_srs)->srs_bw->mac_bw_curr_time != now) {               \
1243 1243                  (mac_srs)->srs_bw->mac_bw_curr_time = now;              \
1244      -                (mac_srs)->srs_bw->mac_bw_used = 0;                     \
     1244 +                (mac_srs)->srs_bw->mac_bw_used = 0;                     \
1245 1245                  if ((mac_srs)->srs_bw->mac_bw_state & SRS_BW_ENFORCED)  \
1246 1246                          (mac_srs)->srs_bw->mac_bw_state &= ~SRS_BW_ENFORCED; \
1247 1247          }                                                               \
1248 1248  }
1249 1249  
1250 1250  /*
1251 1251   * MAC_SRS_WORKER_WAKEUP
1252 1252   *
1253 1253   * Wake up the SRS worker thread to process the queue as long as
1254 1254   * no one else is processing the queue. If we are optimizing for
↓ open down ↓ 453 lines elided ↑ open up ↑
1708 1708          ip6_t           *ip6h;
1709 1709          ipha_t          *ipha;
1710 1710          uint8_t         *whereptr;
1711 1711          uint_t          hash;
1712 1712          uint16_t        remlen;
1713 1713          uint8_t         nexthdr;
1714 1714          uint16_t        hdr_len;
1715 1715          uint32_t        src_val, dst_val;
1716 1716          boolean_t       modifiable = B_TRUE;
1717 1717          boolean_t       v6;
     1718 +        int             errno;
1718 1719  
1719 1720          ASSERT(MBLKL(mp) >= hdrsize);
1720 1721  
1721 1722          if (sap == ETHERTYPE_IPV6) {
1722 1723                  v6 = B_TRUE;
1723 1724                  hdr_len = IPV6_HDR_LEN;
1724 1725          } else if (sap == ETHERTYPE_IP) {
1725 1726                  v6 = B_FALSE;
1726 1727                  hdr_len = IP_SIMPLE_HDR_LENGTH;
1727 1728          } else {
↓ open down ↓ 54 lines elided ↑ open up ↑
1782 1783          if (v6) {
1783 1784                  remlen = ntohs(ip6h->ip6_plen);
1784 1785                  nexthdr = ip6h->ip6_nxt;
1785 1786                  src_val = V4_PART_OF_V6(ip6h->ip6_src);
1786 1787                  dst_val = V4_PART_OF_V6(ip6h->ip6_dst);
1787 1788                  /*
1788 1789                   * Do src based fanout if below tunable is set to B_TRUE or
1789 1790                   * when mac_ip_hdr_length_v6() fails because of malformed
1790 1791                   * packets or because mblks need to be concatenated using
1791 1792                   * pullupmsg().
1792      -                 *
1793      -                 * Perform a version check to prevent parsing weirdness...
1794 1793                   */
1795      -                if (IPH_HDR_VERSION(ip6h) != IPV6_VERSION ||
1796      -                    !mac_ip_hdr_length_v6(ip6h, mp->b_wptr, &hdr_len, &nexthdr,
1797      -                    NULL)) {
     1794 +                errno = mac_ip_hdr_length_v6(ip6h, mp->b_wptr, &hdr_len,
     1795 +                    &nexthdr, NULL);
     1796 +                switch (errno) {
     1797 +                case EINVAL:
     1798 +                        /* Bad version. */
     1799 +                        *indx = 0;
     1800 +                        *type = OTH;
     1801 +                        return (0);
     1802 +                case 0:
     1803 +                        break;
     1804 +                default:
1798 1805                          goto src_dst_based_fanout;
1799 1806                  }
1800 1807          } else {
     1808 +                if (IPH_HDR_VERSION(ipha) != IPV4_VERSION) {
     1809 +                        /* Bad version. */
     1810 +                        *indx = 0;
     1811 +                        *type = OTH;
     1812 +                        return (0);
     1813 +                }
1801 1814                  hdr_len = IPH_HDR_LENGTH(ipha);
1802 1815                  remlen = ntohs(ipha->ipha_length) - hdr_len;
1803 1816                  nexthdr = ipha->ipha_protocol;
1804 1817                  src_val = (uint32_t)ipha->ipha_src;
1805 1818                  dst_val = (uint32_t)ipha->ipha_dst;
1806 1819                  /*
1807 1820                   * Catch IPv4 fragment case here.  IPv6 has nexthdr == FRAG
1808 1821                   * for its equivalent case.
1809 1822                   */
1810 1823                  if ((ntohs(ipha->ipha_fragment_offset_and_flags) &
↓ open down ↓ 389 lines elided ↑ open up ↑
2200 2213   * Rx ring to get a chain of packets. It can inline process that chain
2201 2214   * if mac_latency_optimize is set (default) or signal the SRS worker thread
2202 2215   * to do the remaining processing.
2203 2216   *
2204 2217   * Since packets come in the system via interrupt or poll path, we also
2205 2218   * update the stats and deal with promiscous clients here.
2206 2219   */
2207 2220  void
2208 2221  mac_rx_srs_poll_ring(mac_soft_ring_set_t *mac_srs)
2209 2222  {
2210      -        kmutex_t                *lock = &mac_srs->srs_lock;
2211      -        kcondvar_t              *async = &mac_srs->srs_cv;
     2223 +        kmutex_t                *lock = &mac_srs->srs_lock;
     2224 +        kcondvar_t              *async = &mac_srs->srs_cv;
2212 2225          mac_srs_rx_t            *srs_rx = &mac_srs->srs_rx;
2213      -        mblk_t                  *head, *tail, *mp;
2214      -        callb_cpr_t             cprinfo;
2215      -        ssize_t                 bytes_to_pickup;
2216      -        size_t                  sz;
     2226 +        mblk_t                  *head, *tail, *mp;
     2227 +        callb_cpr_t             cprinfo;
     2228 +        ssize_t                 bytes_to_pickup;
     2229 +        size_t                  sz;
2217 2230          int                     count;
2218 2231          mac_client_impl_t       *smcip;
2219 2232  
2220 2233          CALLB_CPR_INIT(&cprinfo, lock, callb_generic_cpr, "mac_srs_poll");
2221 2234          mutex_enter(lock);
2222 2235  
2223 2236  start:
2224 2237          for (;;) {
2225 2238                  if (mac_srs->srs_state & SRS_PAUSE)
2226 2239                          goto done;
↓ open down ↓ 246 lines elided ↑ open up ↑
2473 2486  /*
2474 2487   * mac_srs_pick_chain
2475 2488   *
2476 2489   * In Bandwidth control case, checks how many packets can be processed
2477 2490   * and return them in a sub chain.
2478 2491   */
2479 2492  static mblk_t *
2480 2493  mac_srs_pick_chain(mac_soft_ring_set_t *mac_srs, mblk_t **chain_tail,
2481 2494      size_t *chain_sz, int *chain_cnt)
2482 2495  {
2483      -        mblk_t                  *head = NULL;
2484      -        mblk_t                  *tail = NULL;
     2496 +        mblk_t                  *head = NULL;
     2497 +        mblk_t                  *tail = NULL;
2485 2498          size_t                  sz;
2486      -        size_t                  tsz = 0;
     2499 +        size_t                  tsz = 0;
2487 2500          int                     cnt = 0;
2488      -        mblk_t                  *mp;
     2501 +        mblk_t                  *mp;
2489 2502  
2490 2503          ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2491 2504          mutex_enter(&mac_srs->srs_bw->mac_bw_lock);
2492 2505          if (((mac_srs->srs_bw->mac_bw_used + mac_srs->srs_size) <=
2493 2506              mac_srs->srs_bw->mac_bw_limit) ||
2494 2507              (mac_srs->srs_bw->mac_bw_limit == 0)) {
2495 2508                  mutex_exit(&mac_srs->srs_bw->mac_bw_lock);
2496 2509                  head = mac_srs->srs_first;
2497 2510                  mac_srs->srs_first = NULL;
2498 2511                  *chain_tail = mac_srs->srs_last;
↓ open down ↓ 60 lines elided ↑ open up ↑
2559 2572   *
2560 2573   * There is a equivalent drain routine in bandwidth control mode
2561 2574   * mac_rx_srs_drain_bw. There is some code duplication between the two
2562 2575   * routines but they are highly performance sensitive and are easier
2563 2576   * to read/debug if they stay separate. Any code changes here might
2564 2577   * also apply to mac_rx_srs_drain_bw as well.
2565 2578   */
2566 2579  void
2567 2580  mac_rx_srs_drain(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
2568 2581  {
2569      -        mblk_t                  *head;
     2582 +        mblk_t                  *head;
2570 2583          mblk_t                  *tail;
2571      -        timeout_id_t            tid;
     2584 +        timeout_id_t            tid;
2572 2585          int                     cnt = 0;
2573 2586          mac_client_impl_t       *mcip = mac_srs->srs_mcip;
2574 2587          mac_srs_rx_t            *srs_rx = &mac_srs->srs_rx;
2575 2588  
2576 2589          ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2577 2590          ASSERT(!(mac_srs->srs_type & SRST_BW_CONTROL));
2578 2591  
2579 2592          /* If we are blanked i.e. can't do upcalls, then we are done */
2580 2593          if (mac_srs->srs_state & (SRS_BLANK | SRS_PAUSE)) {
2581 2594                  ASSERT((mac_srs->srs_type & SRST_NO_SOFT_RINGS) ||
↓ open down ↓ 217 lines elided ↑ open up ↑
2799 2812   *
2800 2813   * There is a equivalent drain routine in non bandwidth control mode
2801 2814   * mac_rx_srs_drain. There is some code duplication between the two
2802 2815   * routines but they are highly performance sensitive and are easier
2803 2816   * to read/debug if they stay separate. Any code changes here might
2804 2817   * also apply to mac_rx_srs_drain as well.
2805 2818   */
2806 2819  void
2807 2820  mac_rx_srs_drain_bw(mac_soft_ring_set_t *mac_srs, uint_t proc_type)
2808 2821  {
2809      -        mblk_t                  *head;
     2822 +        mblk_t                  *head;
2810 2823          mblk_t                  *tail;
2811      -        timeout_id_t            tid;
     2824 +        timeout_id_t            tid;
2812 2825          size_t                  sz = 0;
2813 2826          int                     cnt = 0;
2814 2827          mac_client_impl_t       *mcip = mac_srs->srs_mcip;
2815 2828          mac_srs_rx_t            *srs_rx = &mac_srs->srs_rx;
2816 2829          clock_t                 now;
2817 2830  
2818 2831          ASSERT(MUTEX_HELD(&mac_srs->srs_lock));
2819 2832          ASSERT(mac_srs->srs_type & SRST_BW_CONTROL);
2820 2833  again:
2821 2834          /* Check if we are doing B/W control */
↓ open down ↓ 244 lines elided ↑ open up ↑
3066 3079  
3067 3080  /*
3068 3081   * mac_srs_worker
3069 3082   *
3070 3083   * The SRS worker routine. Drains the queue when no one else is
3071 3084   * processing it.
3072 3085   */
3073 3086  void
3074 3087  mac_srs_worker(mac_soft_ring_set_t *mac_srs)
3075 3088  {
3076      -        kmutex_t                *lock = &mac_srs->srs_lock;
3077      -        kcondvar_t              *async = &mac_srs->srs_async;
     3089 +        kmutex_t                *lock = &mac_srs->srs_lock;
     3090 +        kcondvar_t              *async = &mac_srs->srs_async;
3078 3091          callb_cpr_t             cprinfo;
3079 3092          boolean_t               bw_ctl_flag;
3080 3093  
3081 3094          CALLB_CPR_INIT(&cprinfo, lock, callb_generic_cpr, "srs_worker");
3082 3095          mutex_enter(lock);
3083 3096  
3084 3097  start:
3085 3098          for (;;) {
3086 3099                  bw_ctl_flag = B_FALSE;
3087 3100                  if (mac_srs->srs_type & SRST_BW_CONTROL) {
↓ open down ↓ 679 lines elided ↑ open up ↑
3767 3780   *
3768 3781   * In this mode, the SRS will have access to multiple Tx rings to send
3769 3782   * the packet out. The fanout hint that is passed as an argument is
3770 3783   * used to find an appropriate ring to fanout the traffic. Each Tx
3771 3784   * ring, in turn,  will have a soft ring associated with it. If a Tx
3772 3785   * ring runs out of Tx desc's the returned packet will be queued in
3773 3786   * the soft ring associated with that Tx ring. The srs itself will not
3774 3787   * queue any packets.
3775 3788   */
3776 3789  
3777      -#define MAC_TX_SOFT_RING_PROCESS(chain) {                               \
     3790 +#define MAC_TX_SOFT_RING_PROCESS(chain) {                               \
3778 3791          index = COMPUTE_INDEX(hash, mac_srs->srs_tx_ring_count),        \
3779 3792          softring = mac_srs->srs_tx_soft_rings[index];                   \
3780 3793          cookie = mac_tx_soft_ring_process(softring, chain, flag, ret_mp); \
3781 3794          DTRACE_PROBE2(tx__fanout, uint64_t, hash, uint_t, index);       \
3782 3795  }
3783 3796  
3784 3797  static mac_tx_cookie_t
3785 3798  mac_tx_fanout_mode(mac_soft_ring_set_t *mac_srs, mblk_t *mp_chain,
3786 3799      uintptr_t fanout_hint, uint16_t flag, mblk_t **ret_mp)
3787 3800  {
↓ open down ↓ 820 lines elided ↑ open up ↑
4608 4621          i_mac_notify(mip, MAC_NOTE_TX);
4609 4622  }
4610 4623  
4611 4624  /*
4612 4625   * RX SOFTRING RELATED FUNCTIONS
4613 4626   *
4614 4627   * These functions really belong in mac_soft_ring.c and here for
4615 4628   * a short period.
4616 4629   */
4617 4630  
4618      -#define SOFT_RING_ENQUEUE_CHAIN(ringp, mp, tail, cnt, sz) {             \
     4631 +#define SOFT_RING_ENQUEUE_CHAIN(ringp, mp, tail, cnt, sz) {             \
4619 4632          /*                                                              \
4620 4633           * Enqueue our mblk chain.                                      \
4621 4634           */                                                             \
4622 4635          ASSERT(MUTEX_HELD(&(ringp)->s_ring_lock));                      \
4623 4636                                                                          \
4624 4637          if ((ringp)->s_ring_last != NULL)                               \
4625 4638                  (ringp)->s_ring_last->b_next = (mp);                    \
4626 4639          else                                                            \
4627 4640                  (ringp)->s_ring_first = (mp);                           \
4628 4641          (ringp)->s_ring_last = (tail);                                  \
↓ open down ↓ 156 lines elided ↑ open up ↑
4785 4798          }
4786 4799  }
4787 4800  
4788 4801  /*
4789 4802   * TX SOFTRING RELATED FUNCTIONS
4790 4803   *
4791 4804   * These functions really belong in mac_soft_ring.c and here for
4792 4805   * a short period.
4793 4806   */
4794 4807  
4795      -#define TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp, tail, cnt, sz) {          \
     4808 +#define TX_SOFT_RING_ENQUEUE_CHAIN(ringp, mp, tail, cnt, sz) {          \
4796 4809          ASSERT(MUTEX_HELD(&ringp->s_ring_lock));                        \
4797 4810          ringp->s_ring_state |= S_RING_ENQUEUED;                         \
4798 4811          SOFT_RING_ENQUEUE_CHAIN(ringp, mp_chain, tail, cnt, sz);        \
4799 4812  }
4800 4813  
4801 4814  /*
4802 4815   * mac_tx_sring_queued
4803 4816   *
4804 4817   * When we are out of transmit descriptors and we already have a
4805 4818   * queue that exceeds hiwat (or the client called us with
↓ open down ↓ 189 lines elided ↑ open up ↑
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX