Kebe Says - Dan McDonald's Blog

Quick Reminder -- tcp_{xmit,recv}_hiwat and high-bandwidth*delay networks

I was recently working with a colleague on connecting two data centers via an IPsec tunnel. He was using iperf (coming soon to OmniOS bloody along with netperf) to test the bandwidth, and was disappointed in his results.

The amount of memory you need to hold a TCP connection's unacknowledged data is the Bandwidth-Delay product. The defaults shipped in illumos are small on the receive side:


bloody(~)[0]% ndd -get /dev/tcp tcp_recv_hiwat
128000
bloody(~)[0]%
and even smaller on the transmit side:

bloody(~)[0]% ndd -get /dev/tcp tcp_xmit_hiwat
49152
bloody(~)[0]%

Even platforms with Automatic tuning, the maximums they use are often not set highly enough.

Introducing IPsec into the picture adds additional latency (if not so much for encryption thanks to AES-NI & friends, then for the encapsulation and checks). This often is enough to take what are normally good enough maximums and invalidate them as too small. To change these on illumos, you can use the ndd(1M) command shown above, OR you can use the modern, persists-across-reboots, ipadm(1M) command:


bloody(~)[1]% sudo ipadm set-prop -p recv_buf=1048576 tcp
bloody(~)[0]% sudo ipadm set-prop -p send_buf=1048576 tcp
bloody(~)[0]% ipadm show-prop -p send_buf tcp
PROTO PROPERTY PERM CURRENT PERSISTENT DEFAULT POSSIBLE
tcp send_buf rw 1048576 1048576 49152 4096-1048576
bloody(~)[0]% ipadm show-prop -p recv_buf tcp
PROTO PROPERTY PERM CURRENT PERSISTENT DEFAULT POSSIBLE
tcp recv_buf rw 1048576 1048576 128000 2048-1048576
bloody(~)[0]%

There's future work there in not only increasing the upper bound (easy), but also adopting the automatic tuning so the maximum just isn't taken right off the bat.

Happy (early) 20th anniversary, IPv6

My first full-time job out of school was with the The U.S. Naval Research Laboratory. It was a spectacular opportunity. I was going to be working on next-generation (at the time) Internet Protocol research and development.

When I joined in early 1994, the IPng proposals had been narrowed to three:

  • SIPP - Simple Internet Protocol Plus. 8-byte addresses, combined with a routing header that could, in theory, extend the space even further (inherited from IPng contender PIP).
  • TUBA - TCP Using Big Addresses. The use of OSI's CLNP with proven IPv4 transports TCP and UDP running over it.
  • CATNIP - Common Architecture for the Internet. I never understood this proposal, to be honest, but I believe it was an attempt to merge CLNP and IPv4.

NRL, well, my part of NRL, anyway placed its bet on SIPP. I was hired to help build SIPP for then-nascent 4.4BSD. (The first 10 months were actually on 4.3 Net/2 as shipped by BSDI!) It was a great team to work with, and our 1995 USENIX paper displayed our good work.

Ooops... I'm getting a bit ahead of myself.

The announcement of the IPng winner was to be at the 30th IETF meeting in Toronto, late in July. Some of us were fortunate to find out early that what would become IPv6 was SIPP, but with 16-byte addresses. Since I was building this thing, I figured it was time to get to work before Toronto.

20 years ago today, I sent this (with slightly reordered header fields) mail out to a subset of people. I didn't use the public mailing list, because I couldn't disclose SIPP-16 (which became IPv6) before the Toronto meeting. I also discovered some issues that later implementors would discover, as you can see.


From: "Daniel L. McDonald" <danmcd>
Subject: SIPP-16 stuff
To: danmcd (Daniel L. McDonald), cmetz (Craig Metz), atkinson (Ran Atkinson),
deering@parc.xerox.com, Bob.Hinden@eng.sun.com,
bob.gilligan@eng.sun.com, francis@cactus.ntt.jp,
rxg@thumper.bellcore.com, set@thumper.bellcore.com, bound@zk3.dec.com,
christian.huitema@sophia.inria.fr, conta@lassie.ucx.lkg.dec.com,
grehan@flotsm.ozy.dec.com, nordmark@jurassic-248.Eng.Sun.COM,
bill.simpson@um.cc.umich.edu, rj@sgi.com
Date: Thu, 21 Jul 1994 19:20:33 -0500 (EST)
Cc: vjs@sgi.com
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Message-ID: <9407220020.aa02835@sundance.itd.nrl.navy.mil>
Content-Length: 1578
Status: RO
X-Status:
X-Keywords: NotJunk
X-UID: 155

SIPP folks,

Has anyone tried quick-n-dirty SIPP-16 mods yet?

We have managed to send/receive SIPP-16 pings across both Ethernet and
loopback. UDP was working with SIPP-8, and we're working on it for SIPP-16.
Minor multicast cases were working for SIPP-8 also, and will be moved to
SIPP-16. TCP will be forthcoming once we're comfortable with some of the
protocol control block changes.

My idea for the SIPP-16 sockaddr_sipp and sipp_addr is something like:

struct sipp_addr {
u_long words[4];
};

struct sockaddr_sipp {
u_char ss_len; /* For BSD routing tree code. */
u_char ss_family;
u_short ss_port;
u_long ss_reserved;
struct sipp_addr ss_addr;
};

We've managed to use the above to configure our interfaces, and send raw
SIPP-16 ICMP pings. I've a feeling the routing tree will get hairy with the
new sockaddr_sipp. The size discrepancy between the sockaddr_sipp, and the
conventional sockaddr will cause other compatibility issues to arise.
(E.g. SIOCAIFADDR will not work with SIPP, but SIOCAIFADDR_SIPP will.)

We look forward to the implementors meeting, so we can talk about bloody
gory details, experience with certain internals (PCBs!), and to find out
how far behind we still are.

Dan McD, Craig Metz, & Ran Atkinson
--
Dan McDonald | Mail: {danmcd,mcdonald}@itd.nrl.navy.mil --------------+
Computer Scientist | WWW: http://wintermute.itd.nrl.navy.mil/danmcd.html |
Naval Research Lab | Phone: (202) 404-7122 #include <disclaimer.h> |
Washington, DC | "Rise from the ashes, A blaze of everyday glory" - Rush +

Funny how many defunct-or-at-least-renamed organizations are in that mail (Sun, DEC, Bellcore) are in that mail. BTW, for Solarish systems, the SIOCSLIFADDR (note the 'L') became the ioctl of choice for longer sockaddrs. Also, this was before I discovered uintN_t data types.

If it wasn't clear from the text of the mail, we actually transmitted IPv6 packets across an Ethernet that day. It's possible these were the first IPv6 packets ever sent on a wire. (Other early implementations used IPv6-in-IPv4 exclusively.) I won't fully claim that honor here, but I do believe it could be true.

A final suggested read

David Reed passed along a pointer to this paper by Dan Geer:

A Time for Choosing

Please read it, and understand the founding spirit of the Internet. And with that, I say goodbye to Oracle.

End-to-end Research Group is ending

Let me quote BBN's Craig Partridge on the Internet Research Task Force's end2end-interest mailing list:
Dear Friends and Colleagues:

After 26 years, the End-to-End Research Group has decided to cease existence
as of January 1st, 2010. While there is certainly still end-to-end research
to be done, the group had ceased to effectively serve as a forum for those
discussions.

The E2E group had a great run, serving as a place where many researchers
could bring their ideas for initial, informal, airing. The meetings could be
bruising. (At one meeting, a member tried to encourage a speaker by saying
"We're all friends here" only to pause and say, "No, I'm sorry, actually we
eat our young, but proceed anyway"). But the meetings usually also brought
insights.

Ideas that were tested in E2E meetings include slow start and improved
round-trip time estimation, Random Early Drop, Integrated and Differentiated
Services, Weighted Fair Queuing, PAWS, and Transaction TCP.
When I learned about the group (and their enlightening e-mail list), my networking professor described it as covering, "End to end, and everything in between..." Now you half-dozen readers know the exact origin of my previous (was "this") blog's name.

Luckily, the mailing alias will still be around. Still, the cliche, "End of an era," really applies here. It's yet another sign of the Internet's maturity, and that the really new places for research are probably somewhere not a lot of people are examining.

Anyone else have something to say about the End-to-End Research Group going away?

How to tell when a performance project succeeds?

The Volo project is an effort to improve the interface between sockets and any socket-driven subsystems, including the TCP/IP stack. During their testing, they panicked during some IPsec tests. See this bug for what they reported.

In our IPsec, we have LARVAL IPsec security associations (SAs). These are SAs that reserve a unique 32-bit Security Parameters Index (SPI), but have no other data. If a packet arrives for a LARVAL SA, we queue it up, so that when it gets filled in by key management, the packet can go through. We do this because of the IKE Quick Mode Exchange, which looks like this:

        INITIATOR                               RESPONDER
        ---------                               ---------

        IKE Quick Mode Packet #1  ------->

                                <----------     IKE Quick Mode Packet #2

        IKE Quick Mode Packet #3  -------->
Now once the initiator receives Quick Mode packet #2, it has enough information to transmit an IPsec-protected packet. Unfortunately, the responder cannot finish completing its Security Association entries until it receives packet #3. It is possible, then, that the initiator's IPsec packet may arrive before the responder has finished processing IKE. Let's look at the packets again:
        INITIATOR                               RESPONDER
        ---------                               ---------

        IKE Quick Mode Packet #1  ------->

                                <----------     IKE Quick Mode Packet #2

        ESP or AH packet        ---------->     Does this packet...

        IKE Quick Mode Packet #3  -------->     ... get processed after my
                                                receipt of #3, also after
                                                which I SADB_UPDATE my
                                                inbound SA, which changes it
                                                from LARVAL to MATURE?

Now the code that queues up an inbound IPsec packet for a LARVAL SA is sadb_set_lpkt(), as was shown in the bug's description. It turns out there was a locking bug in this function - and we even had an ASSERT()-ion that the SA manipulated by sadb_set_lpkt() was always larval. The problem was, we discounted the possibility of IKE finishing between the detection of a LARVAL SA and the actual call to sadb_set_lpkt().

The Volo project improved UDP latency enough so that the IKE packet wormed its way up the stack and into in.iked faster than the concurrent ESP or AH packet. The aformentioned ASSERT() tripped during Volo testing, because we did not check the SA's state while holding its lock. Had we, we could tell that the LARVAL SA was promoted to ACTIVE, and we could go ahead and process the packet.

This race condition was present since sadb_set_lpkt() was introduced in Solaris 9, but it took Volo's improved performance to find it. So hats off to Volo for speeding things up enough to find long-dormant race conditions!



Addendum - IKEv2 does not have this problem because its equivalent to v1's Quick Mode is a simpler request/response exchange, so the responder is ready to receive when it sends the response back to the initiator.

We'll miss you, Itojun

Jun-Ichiro Hagino, better known to the IETF community as Itojun, died on October 29th at the age of 37.

Please visit this link for a message from his colleages at Japan's WIDE Project.

He led the KAME/WIDE IPv6 work that now appears in all of the BSDs (including MacOS X), and even their user-land tools are in Linux. Having worked on one of the pre-KAME IPv6 implementations (NRL), I can tell you that Itojun pushed things along even further than I'd hoped.

I had the privilege to talk with him when I was attending IETF meetings, and even visited him on his home turf in Tokyo. He was very unassuming, enthusiastic, and an all-around Good Guy (TM). I'll miss him, and a LOT of other people will too. PLEASE follow the link I mentioned above to see what you can do to honor Itojun.

IPsec Tunnel Reform, IP Instances, and other new-in-S10 goodies

Solaris 10 Update 4 (or as marketing calls it, Solaris 10 08/07) contains some backported goodies we've had in Nevada/OpenSolaris for a while.

IPsec Tunnel Reform was one of the first big pieces of code to be dropped into the S10u4 codebase. It shores up our interoperability story, so you can now start constructing VPNs that tell IKE to negotiation Tunnel-Mode (as opposed to IP-in-IP transport mode). Tunnels themselves are still network interfaces, but their IPsec configuration is now wholly in the purview of ipsecconf(1M). Modulo IKE (which we still OEM part of), we developed Tunnel Reform in the open with OpenSolaris.

Also new for S10u4 is IP Instances. Before u4, you could create non-global zones, but their network management (e.g. ifconfig(1M)) had to be done from the global zone. With u4, one can create a unique instance zone which gives the zone its own complete TCP/IP stack. The global zone needs to only assign a GLDv3-compatible interface to the zone (e.g. bge, nge, e1000g) to give it a unique IP Instance. You could have a single box be your router/firewall/NAT, your web server, and who knows what else, all while keeping those functions out of the fully-privileged global zone. It makes me think about upgrading to business-class Internet service at home, building my own box like Bart did and getting a few extra Ethernet ports.

Oh, and if you want to do it all with less ethernet ports, check out OpenSolaris's Crossbow and its VNIC abstraction!

Have fun moving your network bits in new and interesting ways!

Dan's blog is powered by blahgd