Kebe Says - Dan McDonald's Blog

All Your Base Are Belong to 20-Somethings, and Solaris 9

Two Decades Ago…

Someone pointed out recently that the famous Internet meme “All your base are belong to us” turned 20 this week. Boy do I feel old. I was still in California, but Wendy and I were plotting our move to Massachusetts.

In AD 2001, S9 Was Beginning

OF COURSE I watched the video back then. The original Shockwave/Flash version on a site that no longer exists. I used my then-prototype Sun Blade 1000 to watch it, on Netscape, on in-development Solaris 9.

I found a bug in the audio driver by watching it. Luckily for me, portions of the Sun bug database were archived and available for your browsing pleasure. Behold bug 4451857. I reported it, and all of the text there is younger me.

The analysis and solution are not in this version of the bug report, which is a shame, because the maintainer (one Brian Botton) was quite responsive, and appreciated the MDB output. He fixed the bug by moving around a not-shown-there am_exit_task() call.

Another thing missing from the bug report is my “Public Summary” which I thought would tie things up nicely. I now present it here:

In A.D. 2001
S9 was beginning.
Brian: What Happen?
Dan: Someone set up us the livelock
Dan: We get signal
Brian: What!
Dan: MDB screen turn on.
Brian: It’s YOU!
4451857: How are you gentleman?
4451857: All your cv_wait() are belong to us.
4451857: You are on the way to livelock.
Brian: What you say?
4451857: You have no chance to kill -9 make your time.
4451857: HA HA HA HA…
Brian: Take off every am_exit_task().
Dan: You know what you doing
Brian: Move am_exit_task().
Brian: For great bugfix!

I Have No Whistle to Blow, But I Must Scream

I'm sure all twelve of you readers out there know what's been going on with respect to recent revelations about NSA activity. Among other things is the unnerving discovery that NSA has been attempting to actively dumb-down security for the Internet.

In the second linked article, Bruce Schneier calls upon people to blow the whistle on, "how the NSA and other agencies are subverting routers, switches, the internet backbone, encryption technologies and cloud systems." Here's the deal:

I have never been asked to introduce back-doors or weaken security in the Solaris, OpenSolaris, Oracle Solaris 11 (for the four months I worked on it post-barn-door-closing), or Illumos. If there are weaknesses there, it was not because of any deliberate effort on my part.

You can view the kernel IPsec protocol sources (AH & ESP) here, by looking at ipsec*.c, sadb.c, spd.c, spdsock.c, keysock.c and header files in the directory above it. You can see the IPsec management utilities here. According to at least one well-known security researcher, the Illumos (nee OpenSolaris) IPsec code isn't bollocks.

There is no open-source for IKE, because the libike.so.1 library was mostly OEM code, from a vendor whose technical lead let me co-write an RFC with him. You can use the various observability and debugging tools in Illumos to see how things work, however, if you wish.

If you want to write your own, better, key management application for Illumos (or even Oracle Solaris), you can use PF_KEY to control the IPsec SADB. I detail the subsequent additions to RFC 2367 on my day-one-of-OpenSolaris blog post. If you want to work on IPsec in totally-open-source Illumos, you have my blessing, and I'll definitely be reviewing (and maybe integrating if you pass code reviews) your code.

Broad-Spectrum Dogfooding, or Why I Miss Jurassic.

NOTE: I imported this from blogspot and the embedded tweet was nicely there. Not sure if other self-hosted entries will be that cool.

I think most of you dozen readers know what I mean, when I refer to dogfooding. Some people think of Microsoft when they hear the term, but I first heard it from the same person via his being a Sun customer, AND via my old roommate, who worked for him.

I saw this Tweet last week:

I then checked out the blog post. It dealt with how an iSCSI LAN can be a failure point, partially due to the weakness of the ones-complement TCP/IP checksum

Reading this reminded me of an old bug we found in Sun with either NFS or an ethernet device driver, and the only way we caught it was by using IPsec (AH particularly) and seeing packets fail the authentication check. The corrupt NFS packets had 16-bits worth of 1 (0xffff), where it should have had 16-bits worth of 0 (0x0000). Using the standard TCP/IP checksum, there's no difference between those two values, no matter where they fall in the packet. Using IPsec, however, even with HMAC-MD5, showed the packet failure clearly when the packet authentication check failed. This bug wouldn't have been discovered were it not for the Solaris Team's big honking server, jurassic, and how its multiple concurrent uses interacted with each other.

Even before there was OpenSolaris, people knew about jurassic. Solaris people's (not any old Sun people... Solaris people) posts on IETF mailing lists often showed user@jurassic. Jurassic served as the NFS source of home directories, and until the early 2000s e-mail inboxes as well. Every two weeks the in-development Solaris build would be placed upon jurassic. As a Solaris developer, if your changes broke jurassic, you fixed those changes immediately, or risked getting your changes yanked out. Not breaking jurassic was a great motivator for code quality. Also, if you had a new feature, you wanted it used on jurassic, even if not by everyone.

Once the basic IPsec protocols - AH & ESP - went into Solaris 8, I convinced the jurassic maintainers to protect all traffic between jurassic and a couple of workstations. One was mine, naturally. I encrypted all of my traffic to jurassic. Since we only had 100Mbit in our building at that time, the performance hit wasn't too bad, relatively speaking. Another belonged to an NFS developer, who I'd somehow convinced to run AH, because I was already running ESP (and AH used less cycles for protection). It was this NFS developer, surprised he wasn't getting data corruption while other were, who helped suss out the bug in question.

At this point, I'd like to have a moment of silence for all of the made-public Solaris information that Oracle has since put back in its box. I could've had a bug id here, folks, A REAL BUG ID!!!

So for a few of us, jurassic also served as an IPsec testbed. It also was helpful in determining that nobody else's cleartext performance dropped while a few of us were running with network traffic (put more succinctly, connection policy latching worked). Other services would run on jurassic as well: DNS, IMAP, and others I'm sure I'm forgetting. Jurassic core dumps eventually would be used to test out the then-new mdb (oh, those early ::findleaks results...), and I'm sure more than a few DTrace scripts helped diagnose some jurassic-discovered bugs.

At Nexenta, we make a dedicated storage appliance. Naturally, we use them inside where appropriate. We Nexentians (especially the ones in Lowell) use Illumos from other distributions for even greater effect. My Illumos Home Data Center talk touches upon these at about 10:43 in. We use Illumos to host VMs (Thank you Joyent), we use it for site-to-site VPNs, we will be using it for public services at some point, and everything I mentioned all runs on Illumos. It's not quite the magnifying glass Jurassic was, but we do what we can.

I believe Oracle still has jurassic around, I know it did prior to my 2011 departure. I suspect it's helping Oracle Solaris even today. I suspect, however, that a less dense, but more widely instantiated broad-spectrum dogfooding continues on in Illumos today.

MAC-then-encrypt - also harmful, also hard to do in Solaris

Hello again!

Kenny Paterson's once again turning the theoretical into practical. This time he's pointed out that if one configures IPsec to MAC-then-encrypt (do packet authentication first, THEN encrypt the packet), one is open to cryptographic attack. Here's a citation for his ACM CCS paper.

The good news is that we cannot configure the IPsec SPD to perform MAC-then-encrypt at all. One could configure transport mode to just MAC, then have the packet transit a tunnel that just encrypts, but then you'll see warnings about the encryption-only tunnel configuration. This has been true for a LONG time (starting with S9, maybe even S8).

So basically, we don't make it easy for you to shoot yourself in the foot this way. You really have to try, and as I pointed out earlier, the encryption-only part will warn you.

Do a "pkg image-update" with multiple zones!!!

Hello you half-dozen readers!

Recently I reinstalled my home server to OpenSolaris, build 130. I used zfs send and zfs recv to recover my relevant bits of data. I also constructed new zones, this time using ipkg zones.

Using ipkg zones takes a bit of acclimation. The biggest thing to note is that if you need a specific software package, you have to use pkg install in the zone you wish to have the software. For example, I have three zones:
  • The Global, internal-only, server zone - My global zone spends most of its time without a default route, serving NFSv4 and anything else I can think of only to my local LAN. If I need a new service, I temporarily add a global route, and pkg install away.
  • The Webserver zone - Just like it says. I needed Apache here, and had to pkg install Apache here.
  • The Router/NAT/IPsec-remote-access/Firewall zone - If you're going to put potential targets on the Internet, why put the global zone there? Especially with Crossbow VNICs and IP Instances!
So I got all of these zones, and the global zone isn't even net-attached most of the time? More interesting still, I need to upgrade all of these zones.

I posed this problem to pkg-discuss@opensolaris.org. Right now, pkg image-update won't upgrade the non-global zones. Worse still, I need to upgrade a zone that's also acting as my NAT and router. Luckily for me, Ed Pilatowicz gave me some good advice: i do have one other workaround/suggestion you could try. after you do
an image-update of your global zone. before rebooting, use beadm to
mount the new image on /a. then you can try doing "pkg -R
/a/path_to_your_zone/root image-update" for each of your zones. this
will probably work as long as your always image-update'ing to the latest
bits in the repository (and no new images get pushed to the repository
in between all the image-update opreations.)
So I took Ed at his word.

Even if you have an ultra-paranoid global zone, you need to get it talking to an IPS repository. Either temporarily add an off-link route like I do, or have a local repository handy. Proceed and pkg image-update your global zone. Make sure you use --be-name to pick a BE name that you'll remember.

Next, you literally beadm mount new-be-name /mnt and for each zone root directory (while still able to reach the repository from your global zone) do pkg -R zone-root-path image-update. For my own example, I did:
  • pkg image-update --be-name 132
  • beadm mount 132 /mnt
  • pkg -R /mnt/export/home/webserver/root image-update
  • pkg -R /mnt/export/home/router/root image-update
  • beadm umount 132
  • reboot
This worked quite well for me moving up from 130 to 132. Just make sure your global zone can reach the repository, and you should be golden.

IKEv2 project page updated

The IKEv2 project page on opensolaris.org now has links to both an early-revision design document, and a webrev pointer.

OpenSolaris works out of the box with Amazon Virtual Private Cloud

Glenn Brunette asked me if OpenSolaris could access the Amazon Virtual Private Cloud or not. I told him it had better, or else there was a bug. He then did some scripting work, got some BGP help from Sowmini, and consulted Sebastien on some tunneling details. It's now up, running, and in a nice package, ready to use.

IKEv2 project now on OpenSolaris

The IKEv2 project page is now available here on OpenSolaris. There's mailing-list information and a brief hello. We are working on design-level issues right now and some larval code, so c'mon over as we start to fire this up.

New IPsec goodies in S10u7

Hello again. Pardon any latency. This whole Oracle thing has been a bit distracting. Never mind figuring out the hard way what limitations there are on racoon2 and what to do about them.

Anyway, Solaris 10 Update 7 (aka. 5/09) is now out. It contains a few new IPsec features that have been in OpenSolaris for a bit. They include:
  • HMAC-SHA-2 support per RFC 4868 in all three sizes (SHA-256, SHA-384, and SHA-512) for IPsec and IKE.
  • 2048-bit (group 14), 3072-bit (group 15), and 4096-bit (group 16) Diffie-Hellman groups for IKE. (NOTE: Be careful running 3072 or 4096 bit on Niagara 1 hardware, see here for why. Niagara 2 works better, but not optimally, with those two groups.
  • IKE Dead Peer Detection
  • SMF Management of IPsec. Four new services split out from network/initial:
    • svc:/network/ipsec/ipsecalgs:default -- Sets up IPsec kernel algorithm mappings.
    • svc:/network/ipsec/policy:default -- Sets up the IPsec SPD (reads /etc/inet/ipsecinit.conf).
    • svc:/network/ipsec/manual-key:default -- Reads any manually-added SAs (reads /etc/inet/secret/ipseckeys).
    • svc:/network/ipsec/ike:default -- Controls the IKE daemon.
  • The UDP_NAT_T_ENDPOINT socket option from OpenSolaris, so you can develop your own NAT-Traversing IPsec key management apps without relying on in.iked.
We've even more goodies in OpenSolaris, BTW.

How to tell when a performance project succeeds?

The Volo project is an effort to improve the interface between sockets and any socket-driven subsystems, including the TCP/IP stack. During their testing, they panicked during some IPsec tests. See this bug for what they reported.

In our IPsec, we have LARVAL IPsec security associations (SAs). These are SAs that reserve a unique 32-bit Security Parameters Index (SPI), but have no other data. If a packet arrives for a LARVAL SA, we queue it up, so that when it gets filled in by key management, the packet can go through. We do this because of the IKE Quick Mode Exchange, which looks like this:

        INITIATOR                               RESPONDER
        ---------                               ---------

        IKE Quick Mode Packet #1  ------->

                                <----------     IKE Quick Mode Packet #2

        IKE Quick Mode Packet #3  -------->
Now once the initiator receives Quick Mode packet #2, it has enough information to transmit an IPsec-protected packet. Unfortunately, the responder cannot finish completing its Security Association entries until it receives packet #3. It is possible, then, that the initiator's IPsec packet may arrive before the responder has finished processing IKE. Let's look at the packets again:
        INITIATOR                               RESPONDER
        ---------                               ---------

        IKE Quick Mode Packet #1  ------->

                                <----------     IKE Quick Mode Packet #2

        ESP or AH packet        ---------->     Does this packet...

        IKE Quick Mode Packet #3  -------->     ... get processed after my
                                                receipt of #3, also after
                                                which I SADB_UPDATE my
                                                inbound SA, which changes it
                                                from LARVAL to MATURE?

Now the code that queues up an inbound IPsec packet for a LARVAL SA is sadb_set_lpkt(), as was shown in the bug's description. It turns out there was a locking bug in this function - and we even had an ASSERT()-ion that the SA manipulated by sadb_set_lpkt() was always larval. The problem was, we discounted the possibility of IKE finishing between the detection of a LARVAL SA and the actual call to sadb_set_lpkt().

The Volo project improved UDP latency enough so that the IKE packet wormed its way up the stack and into in.iked faster than the concurrent ESP or AH packet. The aformentioned ASSERT() tripped during Volo testing, because we did not check the SA's state while holding its lock. Had we, we could tell that the LARVAL SA was promoted to ACTIVE, and we could go ahead and process the packet.

This race condition was present since sadb_set_lpkt() was introduced in Solaris 9, but it took Volo's improved performance to find it. So hats off to Volo for speeding things up enough to find long-dormant race conditions!



Addendum - IKEv2 does not have this problem because its equivalent to v1's Quick Mode is a simpler request/response exchange, so the responder is ready to receive when it sends the response back to the initiator.

Dan's blog is powered by blahgd