Kebe Says - Dan McDonald's Blog

WRITE_SAME support now in Illumos COMSTAR

The WRITE_SAME primitive is now available in Illumos as of this push:
13382:d84aa76f7cd2 Dan McDonald 
937 WRITE_SAME support for COMSTAR
Reviewed by: Gordon Ross 
Reviewed by: Richard Elling 
Reviewed by: Robert Gordon 
Approved by: Gordon Ross  
Sumit Gupta wrote the original contribution, and after a bit of my own massaging, it's now in Illumos. Unlike the UNMAP push, this one did not have a lot of rewhacking (in large part due to its lower amount of direct interaction with ZFS).

The WRITE_SAME primitive works pretty much like its name. The iSCSI initiator passes in a WRITE_SAME primitive along with a single disk block. The iSCSI target then writes the same block over the range of logical block addresses specified in the command.

One set of experiments I did prior to integration was figuring out what size buffer to allocate for an I/O. In a perfect world, you don't want to do sbd_write() calls for every 512-byte block. On the other hand, you also don't want to force the kmem allocator to perform unholy tasks of allocation. I settled on a default of 128kbytes, which has a kmem_cache magazine backing it up (according to kmem stats). Users can experiment with this themselves by tweaking stmf_sbd's sbd_write_same_optimal_chunk variable. Every WRITE_SAME request, once it generates the data, consults this variable prior to allocating a block. Source-junkies can look here for the function in question.

Happy block-writing, folks!

Showing your kids the Star Wars films - which order?

WARNING: Spoilers for the Star Wars movies. Here's some old-school spoiler space...










We finally finished (in fits and starts) showing our twin 8-year-olds all six Star Wars films. We showed them in the order Wendy and I saw them on the big screen: 4 (Star Wars), 5 (The Empire Strikes Back, 6 (Return of the Jedi), 1 (The Phantom Menace), 2 (Attack of the Clones), and finally 3 (Revenge of the Sith). Now I'll admit we skipped over char-broiled Anakin and Vader's suit-fitting during Sith, but they're 8, what would you expect?

As we finished up, something occurred to me. I remember reading my favorite online-exclusive film critic (and fellow parent) Drew McWeeney mentioning toward the bottom of this article that he was going to show them to his son in a slightly different order: 4, 5, 1, 2, 3, 6.

That's a fascinating way to show them. At the end of The Empire Strikes Back the first-time viewer may have a question about whether or not Darth Vader is Luke's father. Why not, at that point, show the first-time viewer the story of Anakin Skywalker? This works especially well now, where the special-edition Empire uses Ian McDiarmid's Emperor Palpatine, and an astute child will notice how much Darth Sidious resembles him (or even Senator Palpatine).

Commenters (oh gotta love Internet feedback... makes me glad I only have a half-dozen readers) mention a few other orders: 1, 2, 3, 4, 5, 6 ("to get the crap out of the way"), or the flip-flop 1, 4, 2, 5, 3, 6 (tracking both in single steps).

There's a little part of me that wishes we tried the flashback-in-the-middle approach, but the only thing that matters is that our girls enjoyed the movies, and now they get one or two more of the jokes Wendy and I make.

For Illumos newbies: On developing small

I just finished a chat with a person who's doing a device driver, and he was worried that a certain header file wasn't available in his /usr/include. This struck me as odd, as I always get my headers from the workspace's proto area...

Then I realized I've had 15 years at Sun under my belt and this person's a complete newbie.

I haven't looked very closely at the Illumos build instructions, but I'm going to do some things now that will help kernel module writers (e.g. device drivers) get started without resorting to a full build right off the bat. I'll assume that you've installed the appropriate compilers and the "onbld" package so that you have a populated /opt/onbld/bin.

STEP 1: The /opt/onbld/bin/ws command:

When you go to work in an Illumos source base, your best off "entering it" via the ws command. I've hacked my .tcshrc to print a different prompt when I'm in with ws. Here, check it out:
everywhere(~)[1]% ws ws/to_mhi

Workspace                    : /export/home/danmcd/ws/to_mhi
Workspace Parent             : /export/home/danmcd/ws/illumos-clone
Proto area ($ROOT)           : /export/home/danmcd/ws/to_mhi/proto/root_i386
Parent proto area ($PARENT_ROOT) : /export/home/danmcd/ws/illumos-clone/proto/root_i386
Root of source ($SRC)        : /export/home/danmcd/ws/to_mhi/usr/src
Root of test source ($TSRC)  : /export/home/danmcd/ws/to_mhi/usr/ontest
Current directory ($PWD)     : /export/home/danmcd/ws/to_mhi

WS-everywhere-WS(~/ws/to_mhi)[0]% 
You'll notice a few things got set in the environment. What I use to alter my .tcshrc is the CODEMGR_WS variable. You should do the same in your favorite shell's config.

UPDATE: You will need to set SPRO_ROOT and BUILD_TOOLS after invoking ws. I do this already in my .tcshrc, but forgot to report it. A newer tool: bldenv, fixes this, but currently at the cost of a configuration file. There's talk of merging ws's simplicity with bldenv's completeness.

One of the key concepts in building Illumos is the "proto area". This is a version of the root filesystem that lives within your source tree. You'll see it set above. There's one per basic architecture type (i386 or sparc). When a full "nightly" build happens, the proto area gets populated with headers, libraries, commands, kernel modules, etc., and then the packaging tools sweep up their input from the proto area. The proto area contains more than what is on a running system.

You need to populate your proto area with basics (directory structures, etc.) to start.

WS-everywhere-WS(~/ws/to_mhi)[1]% cd $SRC
WS-everywhere-WS(usr/src)[0]% pwd
/export/home/danmcd/ws/to_mhi/usr/src
WS-everywhere-WS(usr/src)[0]% dmake sgs
       < Go get a drink of water or coffee, it's gonna be a bit... >
WS-everywhere-WS(usr/src)[1]% 
The "sgs" target sets up the proto area completely.

If you're proceeding to build, say, kernel modules, you should populate the kernel include files in the proto area.

WS-everywhere-WS(~/ws/to_mhi)[0]% cd usr/src/uts
WS-everywhere-WS(src/uts)[0]% dmake install_h
    < TONS of output deleted... >
WS-everywhere-WS(src/uts)[0]% 
UPDATE Fellow Illumos hacker Rich Lowe has informed me that "dmake setup" does both sgs and install_h in one fell swoop.

And then you can go and compile your kernel module. I'll use "ip" as an example:

WS-everywhere-WS(src/uts)[1]% cd intel/ip
WS-everywhere-WS(intel/ip)[0]% pwd
/export/home/danmcd/ws/to_mhi/usr/src/uts/intel/ip
WS-everywhere-WS(intel/ip)[0]% dmake
     < MORE output deleted... >
WS-everywhere-WS(intel/ip)[0]% 
If you want to lint-check your module, don't do the obvious "make lint" but instead do "make modlintlib". This will perform basic lint sanity without the overhead of a full crosscheck.

Now if you want to do something in userland, you'll need to do more than a simple header install. You MIGHT need to bringup libraries too, because it's possible your workspace's libraries have different versions than the machine you're actually building on.

WS-everywhere-WS(intel/ip)[0]% cd $SRC/lib
WS-everywhere-WS(src/lib)[0]% 
If you utter "dmake install", it's going to be a while. You can, if you know only a certain library was altered, cd into that library and utter "dmake install" in there. For example:

WS-everywhere-WS(src/lib)[0]% cd libipsecutil
WS-everywhere-WS(lib/libipsecutil)[0]% dmake install_h
     < output deleted... >
WS-everywhere-WS(lib/libipsecutil)[0]% dmake install
     < MORE output deleted... >
WS-everywhere-WS(lib/libipsecutil)[0]% 
Then you can go to, say, your new command, and start compiling and debugging there. Once you're done, you can exit this shell, and it will return you to your original pre-ws shell.

Hopefully this will lower some of the barriers to entry for budding Illumos hackers.

Finally unpacked

I think I've managed to move all of my old blog entries over from blogs.sun.com. Hopefully I'll be posting some Illumos-related technical content before too long. Stay tuned!

Hello again, world!

At Wendy's and Garrett's advice, I've set up shop here on Blogger/Blogspot.

I plan on importing all of my old Sun blog posts here, but I exported in a non-blogger non-XML (ick) format. So I'll be backpatching by copy-and-paste when time allows.

Happy blog reading, you half-dozen readers! :)

A final suggested read

David Reed passed along a pointer to this paper by Dan Geer:

A Time for Choosing

Please read it, and understand the founding spirit of the Internet. And with that, I say goodbye to Oracle.

I'm leaving Oracle, and switching gears

15 years ago I was finishing up last-minute changes at NRL while getting ready to move coasts. While I'm not moving coasts, I'm at the point where I'm finishing up last-minute changes again.

I'm leaving Oracle this week, and will be trying something a bit different after that. I've been doing IPsec or at least TCP/IP related work for the entirety of my time at Sun. I expect to be back in TCP/IP-land relatively soon, but I will be learning some new-to-me technologies in the immediate future.

I've met and worked with some extraordinary people during my time at Sun. I hope to keep in touch with them after I depart. If any of you half-dozen readers wish to keep up, I'd suggest following my Twitter feed until I decide whether or not I find a new home for this blog. I'm also findable on Facebook and LinkedIn for those so inclined.

MAC-then-encrypt - also harmful, also hard to do in Solaris

Hello again!

Kenny Paterson's once again turning the theoretical into practical. This time he's pointed out that if one configures IPsec to MAC-then-encrypt (do packet authentication first, THEN encrypt the packet), one is open to cryptographic attack. Here's a citation for his ACM CCS paper.

The good news is that we cannot configure the IPsec SPD to perform MAC-then-encrypt at all. One could configure transport mode to just MAC, then have the packet transit a tunnel that just encrypts, but then you'll see warnings about the encryption-only tunnel configuration. This has been true for a LONG time (starting with S9, maybe even S8).

So basically, we don't make it easy for you to shoot yourself in the foot this way. You really have to try, and as I pointed out earlier, the encryption-only part will warn you.

Thinking about the Birthday Problem on my Birthday, as it applies to my Birthday Present

My birthday is upon me.

My birthday present was an iPhone 4. Yeah, I got it early, but it was nice to have for my just-finished vacation drive. I noticed that when I'd reshuffle the 1763 songs on there, I'd more often than not hit a collision with a song I swear I'd heard during the previous shuffling. Time for some math...

The Birthday Problem (or Birthday Paradox, not because it's a real paradox, but because it's counterintuitive) shows that it only takes 23 people to be in the same room before the chances that two of them share a birthday are equivalent to a coin flip. The link above shows how one derives this. Basically, as you keep adding people, the probability of there NOT being a shared birthday decreases. That probability hits near-enough to 50% at 23 people.

I figured if I would have listened/remembered 30 songs from a previous shuffle. That's 2-3 hours of music, not a lot when you're driving all day. So if I accidentally shook my iPhone and reshuffled the songs, how many would I need to hear until I had a coinflip's chance of hearing a repeat from the previous 30?

Basically, the probability of NOT hearing a previously-heard song was (1763-30) / 1763. If that wasn't a repeat, the probability of another non-collision would be (1762 - 30) / 1762. Note that unlike the birthday problem, I'm decrementing the denominator as well. This is because I'm not going to hear the same song twice in a random shuffle.

I wrote a C program (because I hack way too much ON code) to compute things. Here it is:

#include <stdio.h>

int
main(int argc, char *argv[])
{
        double p;
        int i, listened, total, tries;

        if (argc != 4) {
                fprintf(stderr,
                    "usage:  ipod [listened-songs] [total-songs] [tries]\n");
                return (1);
        }

        p = 1.0;
        listened = atoi(argv[1]);
        total = atoi(argv[2]);
        tries = atoi(argv[3]);

        for (i = 0; i < tries; i++)
                p *= (double)(total - listened - i) / (double)(total - i);

        printf("P(NO repeat for %d on the second playthough): %lf%%\n", tries,
            p * 100.0);
        printf("P(Repeat for %d on the second playthough): %lf%%\n", tries,
            (1 - p) * 100.0);
        return (0);
}
Turns out, I need to hear 40 songs to have a coinflip's chance of hearing one of the previous 30 songs I heard before reshuffling the 1763 total songs.

mactavish(~/sources)[0]% ./a.out 30 1763 40
P(NO repeat for 40 on the second playthough): 49.942794%
P(Repeat for 40 on the second playthough): 50.057206%
mactavish(~/sources)[0]% 
The above program should work for any sized iPod/iPhone collection, or any sized song-memory/patience. I really hope I got the math/derivation right. Any probability wizards in the audience can feel free to school me in the comments section.

Do a "pkg image-update" with multiple zones!!!

Hello you half-dozen readers!

Recently I reinstalled my home server to OpenSolaris, build 130. I used zfs send and zfs recv to recover my relevant bits of data. I also constructed new zones, this time using ipkg zones.

Using ipkg zones takes a bit of acclimation. The biggest thing to note is that if you need a specific software package, you have to use pkg install in the zone you wish to have the software. For example, I have three zones:
  • The Global, internal-only, server zone - My global zone spends most of its time without a default route, serving NFSv4 and anything else I can think of only to my local LAN. If I need a new service, I temporarily add a global route, and pkg install away.
  • The Webserver zone - Just like it says. I needed Apache here, and had to pkg install Apache here.
  • The Router/NAT/IPsec-remote-access/Firewall zone - If you're going to put potential targets on the Internet, why put the global zone there? Especially with Crossbow VNICs and IP Instances!
So I got all of these zones, and the global zone isn't even net-attached most of the time? More interesting still, I need to upgrade all of these zones.

I posed this problem to pkg-discuss@opensolaris.org. Right now, pkg image-update won't upgrade the non-global zones. Worse still, I need to upgrade a zone that's also acting as my NAT and router. Luckily for me, Ed Pilatowicz gave me some good advice: i do have one other workaround/suggestion you could try. after you do
an image-update of your global zone. before rebooting, use beadm to
mount the new image on /a. then you can try doing "pkg -R
/a/path_to_your_zone/root image-update" for each of your zones. this
will probably work as long as your always image-update'ing to the latest
bits in the repository (and no new images get pushed to the repository
in between all the image-update opreations.)
So I took Ed at his word.

Even if you have an ultra-paranoid global zone, you need to get it talking to an IPS repository. Either temporarily add an off-link route like I do, or have a local repository handy. Proceed and pkg image-update your global zone. Make sure you use --be-name to pick a BE name that you'll remember.

Next, you literally beadm mount new-be-name /mnt and for each zone root directory (while still able to reach the repository from your global zone) do pkg -R zone-root-path image-update. For my own example, I did:
  • pkg image-update --be-name 132
  • beadm mount 132 /mnt
  • pkg -R /mnt/export/home/webserver/root image-update
  • pkg -R /mnt/export/home/router/root image-update
  • beadm umount 132
  • reboot
This worked quite well for me moving up from 130 to 132. Just make sure your global zone can reach the repository, and you should be golden.

Dan's blog is powered by blahgd