Kebe Says - Dan McD's Blog» illumos

Broad-Spectrum Dogfooding, or Why I Miss Jurassic.

Filed under:

— Dan McD. @ March 25, 2013 22:01

NOTE: I imported this from blogspot and the embedded tweet was nicely there. Not sure if other self-hosted entries will be that cool.

I think most of you dozen readers know what I mean, when I refer to dogfooding. Some people think of Microsoft when they hear the term, but I first heard it from the same person via his being a Sun customer, AND via my old roommate, who worked for him.

I saw this Tweet last week:

RT @stu: "Compared to networking, storage is serious business" great article on storage networking bit.ly/WNEOxT @ioshints #iSCSI
— Charles Beeler (@charlesbeeler) March 21, 2013

I then checked out the blog post. It dealt with how an iSCSI LAN can be a failure point, partially due to the weakness of the ones-complement TCP/IP checksum

Reading this reminded me of an old bug we found in Sun with either NFS or an ethernet device driver, and the only way we caught it was by using IPsec (AH particularly) and seeing packets fail the authentication check. The corrupt NFS packets had 16-bits worth of 1 (0xffff), where it should have had 16-bits worth of 0 (0x0000). Using the standard TCP/IP checksum, there's no difference between those two values, no matter where they fall in the packet. Using IPsec, however, even with HMAC-MD5, showed the packet failure clearly when the packet authentication check failed. This bug wouldn't have been discovered were it not for the Solaris Team's big honking server, jurassic, and how its multiple concurrent uses interacted with each other.

Even before there was OpenSolaris, people knew about jurassic. Solaris people's (not any old Sun people... Solaris people) posts on IETF mailing lists often showed user@jurassic. Jurassic served as the NFS source of home directories, and until the early 2000s e-mail inboxes as well. Every two weeks the in-development Solaris build would be placed upon jurassic. As a Solaris developer, if your changes broke jurassic, you fixed those changes immediately, or risked getting your changes yanked out. Not breaking jurassic was a great motivator for code quality. Also, if you had a new feature, you wanted it used on jurassic, even if not by everyone.

Once the basic IPsec protocols - AH & ESP - went into Solaris 8, I convinced the jurassic maintainers to protect all traffic between jurassic and a couple of workstations. One was mine, naturally. I encrypted all of my traffic to jurassic. Since we only had 100Mbit in our building at that time, the performance hit wasn't too bad, relatively speaking. Another belonged to an NFS developer, who I'd somehow convinced to run AH, because I was already running ESP (and AH used less cycles for protection). It was this NFS developer, surprised he wasn't getting data corruption while other were, who helped suss out the bug in question.

At this point, I'd like to have a moment of silence for all of the made-public Solaris information that Oracle has since put back in its box. I could've had a bug id here, folks, A REAL BUG ID!!!

So for a few of us, jurassic also served as an IPsec testbed. It also was helpful in determining that nobody else's cleartext performance dropped while a few of us were running with network traffic (put more succinctly, connection policy latching worked). Other services would run on jurassic as well: DNS, IMAP, and others I'm sure I'm forgetting. Jurassic core dumps eventually would be used to test out the then-new mdb (oh, those early ::findleaks results...), and I'm sure more than a few DTrace scripts helped diagnose some jurassic-discovered bugs.

At Nexenta, we make a dedicated storage appliance. Naturally, we use them inside where appropriate. We Nexentians (especially the ones in Lowell) use Illumos from other distributions for even greater effect. My Illumos Home Data Center talk touches upon these at about 10:43 in. We use Illumos to host VMs (Thank you Joyent), we use it for site-to-site VPNs, we will be using it for public services at some point, and everything I mentioned all runs on Illumos. It's not quite the magnifying glass Jurassic was, but we do what we can.

I believe Oracle still has jurassic around, I know it did prior to my 2011 departure. I suspect it's helping Oracle Solaris even today. I suspect, however, that a less dense, but more widely instantiated broad-spectrum dogfooding continues on in Illumos today.

Delegated ZFS, cloning, and SCM

Filed under:

— Dan McD. @ February 26, 2013 16:09

Well THAT was a long break from blogging...

One of the things that's happened in the illumos community is a subtle shift of the main illumos source repository from being primarily Mercurial to being primarily Git. This means I've had to learn Git. At first, I wasn't sure why people were so rabidly pro-Git. I found one of the big reasons:


everywhere(~/ws)[0]% /bin/time git clone git-illumos git-illumos.copy
Cloning into git-illumos.copy...
done.

real       11.8
user        4.7
sys         3.2
everywhere(~/ws)[0]% /bin/time hg clone illumos-clone illumos-clone.copy
updating working directory
44332 files updated, 0 files merged, 0 files removed, 0 files unresolved

real     1:52.6
user       28.9
sys        25.4
everywhere(~/ws)[0]%

Wow! Yeah, I can see why this would appeal to people. I'm still using Mercurial in a fair amount of places, both for my illumos work and for Nexenta as well. I should show one other thing that both SCM cloning operations do: take up disk space.


everywhere(~/ws)[0]% zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
rpool   298G   198G   100G         -    66%  1.00x  ONLINE  -
everywhere(~/ws)[0]% /bin/time git clone git-illumos git-illumos.copy

  *** SNIP! *** 

everywhere(~/ws)[0]% sync
everywhere(~/ws)[0]% zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
rpool   298G   198G  99.6G         -    66%  1.00x  ONLINE  -
everywhere(~/ws)[0]% /bin/time hg clone illumos-clone illumos-clone.copy

  *** SNIP! *** 

everywhere(~/ws)[0]% sync
everywhere(~/ws)[0]% zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
rpool   298G   199G  98.7G         -    66%  1.00x  ONLINE  -
everywhere(~/ws)[0]%

I believe Git will also take up less disk space, but still, that's approximately half a gig or more for an illumos workspace. If it's populated, say with a preinstalled proto area and compiled objects, that'll be even larger.

Consider one of the great strengths of ZFS: its copy-on-write architecture. Take a local, on-disk master repo, say one you're pulling directly from the source, and make it its own filesystem. Child/downstream workspaces from your on-disk master now can be created using low-latency ZFS operations. Only two problems need to be solved: non-privileged usage, and SCM correction to properly designate the parent/child or upstream/downstream relationship.

Another useful ZFS feature is administrative delegation. Put simply, an administrator can allow an ordinary user to perform selected ZFS primitives on a given filesystem, and its descendants in the ZFS filesystem tree. For example:


everywhere(~)[0]% zfs allow rpool/export/home/danmcd
everywhere(~)[0]% zfs allow rpool/export/home/danmcd/ws
---- Permissions on rpool/export/home/danmcd/ws ----------------------
Local+Descendent permissions:
        user danmcd clone,create,destroy,mount,promote,snapshot
everywhere(~)[0]%

I (as root) delegated several permissions for a subdirectory of $HOME to me (as danmcd). From here, I can create new filesystems in ~/ws, as well as destroy them, clone them, mount, snapshot, and promote them. All of these are useful operations. The syntax for delegation is mostly straightforward: zfs allow -ld clone,create,destroy,mount,promote,snapshot rpool/export/home/danmcd/ws. The -ld flags enable local and descendant permission propagation.

First thing I did was zfs create rpool/export/home/danmcd/ws/illumos-clone, followed by hg clone ssh://anonhg@hg.illumos.org/illumos-gate illumos-clone. This populates my local Mercurial illumos repo. I can perform a similar operation with git. Per my above timing examples, I did so with git-illumos.

I wrote a script to clone, promote, and reparent Git and Mercurial workspaces using ZFS operations. It's called zclone and it's here for download. It's still a work in progress, and I'd like to maybe have it end up in usr/src/tools in illumos-gate someday. (I'll try and update this particular post as things evolve.)

Check out the times, and the disk space (not) used:


everywhere(~/ws)[0]% zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
rpool   298G   198G   100G         -    66%  1.00x  ONLINE  -
everywhere(~/ws)[0]% /bin/time zclone git-illumos git-illumos.zc
Created rpool/export/home/danmcd/ws/git-illumos.zc,
    a zfs clone of rpool/export/home/danmcd/ws/git-illumos

real        1.0
user        0.0
sys         0.0
everywhere(~/ws)[0]% /bin/time zclone illumos-clone illumos-clone.zc
Created rpool/export/home/danmcd/ws/illumos-clone.zc,
    a zfs clone of rpool/export/home/danmcd/ws/illumos-clone

real        1.0
user        0.0
sys         0.0
everywhere(~/ws)[0]% zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
rpool   298G   198G   100G         -    66%  1.00x  ONLINE  -
everywhere(~/ws)[0]%

These are constant-time operations, folks. And like I said earlier, I suppose its possible to have the local master repos populated with pre-compiled objects, header files in proto areas (an illumos build trick), and other disk-intensive operations pre-performed.

A quick search didn't yield me any results in this area: using ZFS to help make source trees take up less space. I'm surprised nobody's blogged about this or documented it, but I may have missed something. Either way, it doesn't hurt to mention it again.

Finding Ada, but with better technology examples!

Filed under:

— Dan McD. @ October 7, 2011 16:35

I found out thanks to Denny Gentry about Ada Lovelace Day today. Denny has a great blog post citing three engineers and their work with ATM.

The three engineers are wonderful examples of excellence, ones I'd gladly mention. What bugs me is that he cited... ewww.... ATM. His third paragraph mentioned why I go, "ewww..." over ATM. He didn't have to deal with (I think) some of the politics of ATM zealots, but that doesn't take away from Allyn's, Sally's, or Renee's abilities or contributions.

In fact, it's not difficult to cite further contributions from each of them... two of which I can further support with source code!

First off, Sally Floyd is well known for much TCP and congestion control goodness. If you followed the link to Sally's page you can see all (or at least most) of her work for yourself. I unfortunately don't know of any quickly-linkable code to cite, but I'll gladly accept suggestions.

Allyn Romanow was a engineer at Sun, and worked in my old group (Solaris Internet Engineering) while she was there. Her big contribution to the Solaris TCP/IP stack was the support for large, fast networks (aka. RFC 1323), which you can see scattered throughout the TCP code, particularly here.

Renee Danson (now Sommerfeld), also an engineer at Sun, escaped the world of ATM to join Internet Engineering later on. I was fortunate to have her land with Team IPsec for a while. As we were bringing up IKE for Solaris 9, I was hoping to have a command-line tool alter the running IKE daemon using the Solaris lightweight IPC mechanism known as doors. Renee made this happen. Because of a large OEM component, the IKE daemon source isn't available for browsing, but the control program, ikeadm(1M) is there for the world to see.

An unofficial IETF slogan was, "We believe in rough consensus and running code." I figured it's even better to find Ada with some running code to back it up.

WRITE_SAME support now in Illumos COMSTAR

Filed under:

— Dan McD. @ June 11, 2011 14:13

The WRITE_SAME primitive is now available in Illumos as of this push:

13382:d84aa76f7cd2 Dan McDonald 
937 WRITE_SAME support for COMSTAR
Reviewed by: Gordon Ross 
Reviewed by: Richard Elling 
Reviewed by: Robert Gordon 
Approved by: Gordon Ross

Sumit Gupta wrote the original contribution, and after a bit of my own massaging, it's now in Illumos. Unlike the UNMAP push, this one did not have a lot of rewhacking (in large part due to its lower amount of direct interaction with ZFS).

The WRITE_SAME primitive works pretty much like its name. The iSCSI initiator passes in a WRITE_SAME primitive along with a single disk block. The iSCSI target then writes the same block over the range of logical block addresses specified in the command.

One set of experiments I did prior to integration was figuring out what size buffer to allocate for an I/O. In a perfect world, you don't want to do sbd_write() calls for every 512-byte block. On the other hand, you also don't want to force the kmem allocator to perform unholy tasks of allocation. I settled on a default of 128kbytes, which has a kmem_cache magazine backing it up (according to kmem stats). Users can experiment with this themselves by tweaking stmf_sbd's sbd_write_same_optimal_chunk variable. Every WRITE_SAME request, once it generates the data, consults this variable prior to allocating a block. Source-junkies can look here for the function in question.

Happy block-writing, folks!

For Illumos newbies: On developing small

Filed under:

— Dan McD. @ March 22, 2011 00:14

I just finished a chat with a person who's doing a device driver, and he was worried that a certain header file wasn't available in his /usr/include. This struck me as odd, as I always get my headers from the workspace's proto area...

Then I realized I've had 15 years at Sun under my belt and this person's a complete newbie.

I haven't looked very closely at the Illumos build instructions, but I'm going to do some things now that will help kernel module writers (e.g. device drivers) get started without resorting to a full build right off the bat. I'll assume that you've installed the appropriate compilers and the "onbld" package so that you have a populated /opt/onbld/bin.

STEP 1: The /opt/onbld/bin/ws command:

When you go to work in an Illumos source base, your best off "entering it" via the ws command. I've hacked my .tcshrc to print a different prompt when I'm in with ws. Here, check it out:

everywhere(~)[1]% ws ws/to_mhi

Workspace                    : /export/home/danmcd/ws/to_mhi
Workspace Parent             : /export/home/danmcd/ws/illumos-clone
Proto area ($ROOT)           : /export/home/danmcd/ws/to_mhi/proto/root_i386
Parent proto area ($PARENT_ROOT) : /export/home/danmcd/ws/illumos-clone/proto/root_i386
Root of source ($SRC)        : /export/home/danmcd/ws/to_mhi/usr/src
Root of test source ($TSRC)  : /export/home/danmcd/ws/to_mhi/usr/ontest
Current directory ($PWD)     : /export/home/danmcd/ws/to_mhi

WS-everywhere-WS(~/ws/to_mhi)[0]%

You'll notice a few things got set in the environment. What I use to alter my .tcshrc is the CODEMGR_WS variable. You should do the same in your favorite shell's config.

UPDATE: You will need to set SPRO_ROOT and BUILD_TOOLS after invoking ws. I do this already in my .tcshrc, but forgot to report it. A newer tool: bldenv, fixes this, but currently at the cost of a configuration file. There's talk of merging ws's simplicity with bldenv's completeness.

One of the key concepts in building Illumos is the "proto area". This is a version of the root filesystem that lives within your source tree. You'll see it set above. There's one per basic architecture type (i386 or sparc). When a full "nightly" build happens, the proto area gets populated with headers, libraries, commands, kernel modules, etc., and then the packaging tools sweep up their input from the proto area. The proto area contains more than what is on a running system.

You need to populate your proto area with basics (directory structures, etc.) to start.


WS-everywhere-WS(~/ws/to_mhi)[1]% cd $SRC
WS-everywhere-WS(usr/src)[0]% pwd
/export/home/danmcd/ws/to_mhi/usr/src
WS-everywhere-WS(usr/src)[0]% dmake sgs
       < Go get a drink of water or coffee, it's gonna be a bit... >
WS-everywhere-WS(usr/src)[1]%

The "sgs" target sets up the proto area completely.

If you're proceeding to build, say, kernel modules, you should populate the kernel include files in the proto area.


WS-everywhere-WS(~/ws/to_mhi)[0]% cd usr/src/uts
WS-everywhere-WS(src/uts)[0]% dmake install_h
    < TONS of output deleted... >
WS-everywhere-WS(src/uts)[0]%

UPDATE Fellow Illumos hacker Rich Lowe has informed me that "dmake setup" does both sgs and install_h in one fell swoop.

And then you can go and compile your kernel module. I'll use "ip" as an example:


WS-everywhere-WS(src/uts)[1]% cd intel/ip
WS-everywhere-WS(intel/ip)[0]% pwd
/export/home/danmcd/ws/to_mhi/usr/src/uts/intel/ip
WS-everywhere-WS(intel/ip)[0]% dmake
     < MORE output deleted... >
WS-everywhere-WS(intel/ip)[0]%

If you want to lint-check your module, don't do the obvious "make lint" but instead do "make modlintlib". This will perform basic lint sanity without the overhead of a full crosscheck.

Now if you want to do something in userland, you'll need to do more than a simple header install. You MIGHT need to bringup libraries too, because it's possible your workspace's libraries have different versions than the machine you're actually building on.


WS-everywhere-WS(intel/ip)[0]% cd $SRC/lib
WS-everywhere-WS(src/lib)[0]%

If you utter "dmake install", it's going to be a while. You can, if you know only a certain library was altered, cd into that library and utter "dmake install" in there. For example:


WS-everywhere-WS(src/lib)[0]% cd libipsecutil
WS-everywhere-WS(lib/libipsecutil)[0]% dmake install_h
     < output deleted... >
WS-everywhere-WS(lib/libipsecutil)[0]% dmake install
     < MORE output deleted... >
WS-everywhere-WS(lib/libipsecutil)[0]%

Then you can go to, say, your new command, and start compiling and debugging there. Once you're done, you can exit this shell, and it will return you to your original pre-ws shell.

Hopefully this will lower some of the barriers to entry for budding Illumos hackers.

Finally unpacked

Filed under:

— Dan McD. @ March 10, 2011 19:50

I think I've managed to move all of my old blog entries over from blogs.sun.com. Hopefully I'll be posting some Illumos-related technical content before too long. Stay tuned!

I'm leaving Oracle, and switching gears

Filed under:

— Dan McD. @ January 18, 2011 19:35

15 years ago I was finishing up last-minute changes at NRL while getting ready to move coasts. While I'm not moving coasts, I'm at the point where I'm finishing up last-minute changes again.

I'm leaving Oracle this week, and will be trying something a bit different after that. I've been doing IPsec or at least TCP/IP related work for the entirety of my time at Sun. I expect to be back in TCP/IP-land relatively soon, but I will be learning some new-to-me technologies in the immediate future.

I've met and worked with some extraordinary people during my time at Sun. I hope to keep in touch with them after I depart. If any of you half-dozen readers wish to keep up, I'd suggest following my Twitter feed until I decide whether or not I find a new home for this blog. I'm also findable on Facebook and LinkedIn for those so inclined.

Kebe's Home Data Center (or f''(Bart's new home server))

Filed under:

— Dan McD. @ March 3, 2008 08:21

A little over a year ago, Bart Smaalders blogged about his new home server. Subsequently Bill built a similarly-configured one. (I thought that he had blogged about his too, but he hadn't.)

I'd been toying with the idea of following in Bill's and Bart's footsteps for some time. A recent influx allowed me to upgrade lots of home technology (including a new Penryn-powered MacBook Pro), and finally allowed me to build out what I like to think of as my home data center. I mention f''(Bart's...) because this box really is the second-derivative of Bart's original box (with Bill's being the first-derivative).

And the starting lineup for this box is:

An AMD Opteron Model 185 - I was lucky enough to stumble across one of these. 2 cores of 2.6GHz AMD64 goodness.
A Tyan S2866 - I bought the one with two Ethernet ports - one nVidia (nge) and one Broadcom (bge). It has audio too, but I haven't tested it as I've my Macs for such things. It has all of the goodies Bart mentioned, but I *think* that the SATA might be native now. (Please comment if you know.)
2GB ECC RAM - with room for two more if need be.
A two-port old Intel Pro Ethernet 10/100 - good thing the driver (iprb) for this is now open-source. I'll explain why I need four Ethernet ports in a bit.
Two Western Digitial "green" 750GB SATA drives Each drive has 32GB root partitions (yes that's large, until Indiana matures, though, I'll stick with UFS roots), 4 GB swap (for core dumps), and the remaining large areas combine to make one mirrored ZFS pool with ~700 decimal GB of storage.
A cheap MSI nVidia 8400GS - It's more than enough to drive my 1920x1200 display.
An overkill Antec 850W power supply - obtained for only $100 from the carcass of CompUSA.
A Lian Li U60 case - My brother-in-law, who has years in the trenches of PC care, feeding, and repair, recommended Lian Li to me. It has all the space I need and more for drives, and its fan layout is pretty comprehensive. Since this box lives in my office, noise isn't that much of an issue.
OpenSolaris build 83 - While I'm pumped about what's going on with Indiana it's still under development, and I want something a bit more stable.

So why four ethernet ports (covering three drivers)? Well, like Indiana, Crossbow is exciting, but not yet integrated into the main OpenSolaris tree. I do, however, very much like the idea of Virtual Network Machines and I'll be using these four ports to build three such machines on this server using prerequisite-to-Crossbow IP Instances. Two ports will form the router zone. The router will also be a firewall, and maybe an IPsec remote-access server too. With Tunnel Reform in place, I can let my or my wife's notebook Macs access our internal home network from any location. One port will be the public web server, and assuming Comcast doesn't screw things up too badly on their business-class install, the new home of www.kebe.com. The last port will be the internal-server and global-zone/administrative station. All of that ZFS space needs to be accessible from somewhere, right?

I'd like to thank Bart and Bill for the hardware inspiration, and to my friends in OpenSolaris networking for offering up something I can exploit immediately to create my three machines in one OpenSolaris install. I'll keep y'all informed about how things are going.

ESP without authentication considered harmful

Filed under:

— Dan McD. @ March 7, 2006 11:17

Hopefully you will read this and go "That's obvious". I'm writing this entry, however, for those who don't.

When IPsec was being specified over 10 years ago, attacks against cipher-block-chaining (CBC) encryption were understood. ESP has an authentication algorithm because AH had a vocal-enough opposition to merit having packet integrity in ESP also (there are also performance arguments for ESP-auth).

Now there actual attacks with actual results. Kenny Paterson and Arnold Yau have published a paper with attacks against no-authentication ESP Tunnel Mode. I believe some of the techniques can also be employed against Transport Mode as well, but again, only with no authentication present.

The simple solution, of course, is to employ your choice of ESP Authentication (encr_auth_algs in ipsecconf(1m) or ifconfig(1m)) or AH (auth_algs in ipsecconf(1m) or ifconfig(1m)) with your IPsec deployment. We warn users about such configurations with ifconfig(1m) today. There is an RFE to eliminate or make very difficult encryption-only configurations in Solaris. Maybe someone in the OpenSolaris community would like to take a stab at it?

Put IPsec to work in YOUR application

Filed under:

— Dan McD. @ September 13, 2005 21:43

Hello coders!

Most people know that you can use ipsecconf(1m) to apply IPsec policy enforcement to an existing application. For example, if you wish to only allow inbound telnet traffic that's under IPsec protection, you'd put something like this into /etc/inet/ipsecinit.conf or other ipsecconf(1m) input:

# Inbound telnet traffic should be IPsec protected
{ lport 23 } ipsec { encr_algs any(128..) encr_auth_algs md5 sa shared}
    or ipsec { encr_algs any(128..) encr_auth_algs sha1 sa shared}

Combine that with appropriate IKE configuration or manual IPsec keys, and you can secure your telnet traffic against eavesdropping, connection hijacking, etc.

For existing services, using ipsecconf(1m) is the most expedient way to bring IPsec protection to bear on packets.

For new services, or services that are being modified anyway, consider using per-socket policy as an alternative. Some advantages to per-socket policy are:

Per-socket policy is stored internally in network session state (the conn_t structure in OpenSolaris). Entries from ipsecconf(1m) are stored in the global Security Policy Database (SPD). No global SPD entries means lower latency for fresh flow creation, and less lock acquisition.

Per-socket bypass means fewer bypass entries in global SPD. If I bypass remote-port 80 using ipsecconf(1m), I can, in theory, enter the system with a remote TCP packet with port=80. There's an RFE (6219908) to work around this, but per-socket is still quicker. I'd love a web proxy with the ability to set per-socket bypass.

The newly SMF-ized inetd(1m) would be a prime candidate for per-socket policy. See RFE 6226853, and this might be something someone in the OpenSolaris community would like to tackle!

Let's look at the ipsec_req_t structure that's been around since Solaris 8 in /usr/include/netinet/in.h:

/*
 * Different preferences that can be requested from IPSEC protocols.
 */

#define IP_SEC_OPT 0x22 /* Used to set IPSEC options */
#define IPSEC_PREF_NEVER 0x01
#define IPSEC_PREF_REQUIRED 0x02
#define IPSEC_PREF_UNIQUE 0x04
/*
 * This can be used with the setsockopt() call to set per socket security
 * options. When the application uses per-socket API, we will reflect
 * the request on both outbound and inbound packets.
 */

typedef struct ipsec_req {
	uint_t ipsr_ah_req; /* AH request */
	uint_t ipsr_esp_req; /* ESP request */
	uint_t ipsr_self_encap_req; /* Self-Encap request */
	uint8_t ipsr_auth_alg; /* Auth algs for AH */
	uint8_t ipsr_esp_alg; /* Encr algs for ESP */
	uint8_t ipsr_esp_auth_alg; /* Auth algs for ESP */
} ipsec_req_t;

The ipsec_req_t is a subset of what one can specify with ipsecconf(1m) in Solaris 9 or later, but it matched what one could do with Solaris 8's version. Algorithm values are derived from PF_KEY (see /usr/include/net/pfkeyv2.h for values), as below. One could also use getipsecalgbyname(3nsl). If I wish to set a socket to use ESP with AES and and MD5, I'd set it up as follows:

	int s; /* Socket file descriptor... */

	ipsec_req_t ipsr;

 .....

	/* NOTE: Do this BEFORE calling connect() or accept() for TCP sockets. */
	ipsr.ipsr_ah_req = 0;
	ipsr.ipsr_esp_req = IPSEC_PREF_REQUIRED;

	ipsr.ipsr_self_encap_req = 0;
	ipsr.ipsr_auth_alg = 0;

	ipsr.ipsr_esp_alg = SADB_EALG_AES;
	ipsr.ipsr_esp_auth_alg = SADB_AALG_MD5HMAC;
	if (setsockopt(s, IPPROTO_IP, IP_SEC_OPT, &ipsr,
	    sizeof (ipsr)) == -1) {
		perror("setsockopt");
		bail(); /* Ugggh, we failed. */
	}
	/* You now have per-socket policy set. */

Notice I mentioned setting the socket option BEFORE calling connect() or accept? This is because of a phenomenon we implement called connection latching. Basically, connection latching means that once an endpoint is connect()-ed, the IPsec policy (whether set per-socket or inherited from the state of the global SPD at the time) latches in place. We made this decision to avoid keeping policy-per-datagram state for things like TCP retransmits.

One thing per-socket policy does not address is the case of unconnected datagram services. In a perfect world, we could have IPsec policy information percolate all the way to the socket layer, where an application can make fully-informed per-datagram decisions on whether or not a particular packet was secured or not. It's a hard problem, requiring XNET sockets (to use sendmsg() and recvmsg() with ancillary data).

BTW, if you want to bypass whatever global entries are in the SPD, you can zero out the structure, and set all three (ah, esp, self_encap) action indicators to IPSEC_PREF_NEVER. You need to be privileged (root or "sys_net_config") to use per-socket bypass, however.

So modulo the keying problem (setting up IKE or having both ends agree on IPsec manual keys), you can put IPsec to work right in your application. In fact, if you use IKE, you can let IKE sort out permissions and access control (by using PKI-issued certificates, self-signed certificates, or preshared keys) and have policy merely determine the details of the protection required.

EDITED: This entry brought to you by the Technorati tags IPsec, Solaris, and OpenSolaris.