Kebe Says - Dan McDonald's Blog

Now you can boot SmartOS off of a ZFS pool

Booting from a zpool

The most recent published biweekly release of SmartOS has a new feature I authored: the ability to manage and boot SmartOS-bootable ZFS pools.

A few people read about this feature, and jumped to the conclusion that the SmartOS boot philosophy, enumerated here:

  • The "/" filesystem is on a ramdisk
  • The "/usr" filesystem is read-only
  • All of the useful state is stored on the zones ZFS pool.

were suddenly thrown out the window. Nope.

This change is the first phase in a plan to not depend on ISO images or USB sticks for SmartOS, or Triton, to boot.

The primary thrust of this specific SmartOS change was to allow installation-time enabling of a bootable zones pool. The SmartOS installer now allows one to specify a bootable pool, either one created during the "create my special pools" shell escape, or just by specifying zones.

A secondary thrust of this change was to allow running SmartOS deployments to upgrade their zones pools to be BIOS bootable (if the pool structure allows booting), OR to create a new pool with new devices (and use zpool create -B) to be dedicated to boot. For example:

smartos# zpool create -B c3t0d0 standalone
smartos# piadm bootable -e standalone
smartos# piadm bootable
standalone                     ==> BIOS and UEFI
zones                          ==> non-bootable
smartos# 

Under the covers

Most of what’s above can be gleaned from the manual page. This section will discuss what the layout of a bootable pool actually looks like, and how the piadm(1M) command sets things up, and expects things to BE set up.

Bootable pool basics

The piadm bootable command will indicate if a pool is bootable at all via the setting of the bootfs property on the pool. That gets you the BIOS bootability check, which admittedly is an assumption. The UEFI check happens by finding the disks s0 slice, and seeing if it’s formatted as pcfs, and if the proper EFI System Partition boot file is present.

bootfs layout

For standalone SmartOS booting, bootfs is supposed to be mounted on "/" with the pathname equal to the bootfs name. By convention, we prefer POOL/boot. Let’s take a look:

smartos# piadm bootable zones ==> BIOS and UEFI smartos# piadm list PI STAMP BOOTABLE FILESYSTEM BOOT BITS? NOW NEXT 20200810T185749Z zones/boot none yes yes 20200813T030805Z zones/boot next no no smartos# cd /zones/boot smartos# ls -lt total 9 lrwxrwxrwx 1 root root 27 Aug 25 15:58 platform -> ./platform-20200810T185749Z lrwxrwxrwx 1 root root 23 Aug 25 15:58 boot -> ./boot-20200813T030805Z drwxr-xr-x 3 root root 3 Aug 14 16:10 etc drwxr-xr-x 4 root root 15 Aug 13 06:07 boot-20200813T030805Z drwxr-xr-x 4 root root 5 Aug 13 06:07 platform-20200813T030805Z drwxr-xr-x 4 1345 staff 5 Aug 10 20:30 platform-20200810T185749Z smartos#

Notice that the Platform Image stamp 20200810T185749Z is currently booted, and will be booted the next time. Notice, however, that there are no “BOOT BITS”, also known as the Boot Image, for 20200810T185749Z, and instead the 20200813T030805Z boot bits are employed? This allows a SmartOS bootable pool to update just the Platform Image (ala Triton) without altering loader. If one utters piadm activate 20200813T030805Z, then things will change:

smartos# piadm activate 20200813T030805Z
smartos# piadm list
PI STAMP           BOOTABLE FILESYSTEM            BOOT BITS?   NOW   NEXT  
20200810T185749Z   zones/boot                     none         yes   no   
20200813T030805Z   zones/boot                     next         no    yes  
smartos# ls -lt
total 9
lrwxrwxrwx   1 root     root          27 Sep  2 00:25 platform -> ./platform-20200813T030805Z
lrwxrwxrwx   1 root     root          23 Sep  2 00:25 boot -> ./boot-20200813T030805Z
drwxr-xr-x   3 root     root           3 Aug 14 16:10 etc
drwxr-xr-x   4 root     root          15 Aug 13 06:07 boot-20200813T030805Z
drwxr-xr-x   4 root     root           5 Aug 13 06:07 platform-20200813T030805Z
drwxr-xr-x   4 1345     staff          5 Aug 10 20:30 platform-20200810T185749Z
smartos# 

piadm(1M) manipulates symbolic links in the boot filesystem to set versions of both the Boot Image (i.e. loader) and the Platform Image.

Home Data Center 3.0 -- Part 2: HDC's many uses

In the prior post, I mentioned a need for four active ethernet ports. These four ports are physical links to four distinct Ethernet networks. Joyent's SmartOS and Triton characterize these with NIC Tags. I just view them as distinct networks. They are all driven by the illumos igb(7d) driver (hmm, that man page needs updating) on HDC 3.0, and I'll specify them now:

  • igb0 - My home network.
  • igb1 - The external network. This port is directly attached to my FiOS Optical Network Terminal's Gigabit Ethernet port.
  • igb2 - My work network. Used for my workstation, and "external" NIC Tag for my work-at-home Triton deployment, Kebecloud.
  • igb3 - Mostly unused for now, but connected to Kebecloud's "admin" NIC Tag.
The zones abstraction in illumos allows not just containment, but a full TCP/IP stack to be assigned to each zone. This makes a zone feel more like a proper virtual machine in most cases. Many illumos distros are able to run a full VMM as the only process in a zone, which ends up delivering a proper virtual machine. As of this post's publication, however, I'm only running illumos zones, not full VM ones. Here's their list:
(0)# zoneadm list -cv
  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              ipkg     shared
   1 webserver        running    /zones/webserver               lipkg    excl  
   2 work             running    /zones/work                    lipkg    excl  
   3 router           running    /zones/router                  lipkg    excl  
   4 calendar         running    /zones/calendar                lipkg    excl  
   5 dns              running    /zones/dns                     lipkg    excl  
(0)# 
Their zone names correspond to their jobs:
  • global - The illumos global zone is what exists even in the absence of other zones. Some illumos distros, like SmartOS, encourage minimizing what a global zone has for services. HDC's global zone serves NFS and SMB/CIFS to my home network. The global zone has the primary link into the home network. HDC's global zone has no default route, so if any operations that need out-of-the-house networking either go through another zone (e.g. DNS lookups), or a defaut route must be temporarily added (e.g. NTP chimes, `pkg update`).
  • webserver - Just like the name says, this zone hosts the web server for kebe.com. For this zone, it uses lofs(7FS), the loopback virtual file system to inherit subdirectories from the global zone. I edit blog entries (like this one) for this zone via NFS from my laptop. The global zone serves NFS, but the files I'm editing are not only available in the global zone, but are also lofs-mounted into the webserver zone as well. The webserver zone has a vnic (see here for details about a vnic, the virtual network interface controller) link to the home network, but has a default route, and the router zone's NAT (more later) forwards ports 80 and 443 to this zone. Additionally, the home network DHCP server lives here, for no other reason than, "it's not the global zone."
  • work - The work zone is new in the past six years, and as of recently, eschews lofs(7FS) for delegated ZFS datasets. A delegated ZFS dataset, a proper filesystem in this case, is assigned entirely to the zone. This zone also has the primary (and only) link to the work network, a physical connection (for now unused) to my work Triton's admin network, and an etherstub vnic (see here for details about an etherstub) link to the router zone. The work zone itself is a router for work network machines (as well as serves DNS for the work network), but since I only have one public IP address, I use the etherstub to link it to the router zone. The zone, as of recent illumos builds, can further serve its own NFS. This allows even less global-zone participation with work data, and it means work machines do not need backchannel paths to the global zone for NFS service. The work zone has a full illumos development environment on it, and performs builds of illumos rather quickly. It also has its own Unbound (see the DNS zone below) for the work network.
  • router - The router zone does what the name says. It has a vnic link to the home network and the physical link to the external network. It runs ipnat to NAT etherstub work traffic or home network traffic to the Internet, and redirects well-known ports to their respective zones. It does not use a proper firewall, but has IPsec policy in place to drop anything that isn't matched by ipnat, because in a no-policy situation, ipnat lets unmatched packets arrive on the local zone. The router zone also runs the (alas still closed source) IKEv1 daemon to allow me remote access to this server while I'm remote. It uses an old test tool from the pre-Oracle Sun days a few of you half-dozen readers will know by name. We have a larval IKEv2 out in the community, and I'll gladly switch to that once it's available.
  • calendar - Blogged about when first deployed, this zone's sole purpose is to serve our calendar both internally and externally. It uses the Radicale server. Many of my complaints from the prior post have been alleviated by subsequent updates. I wish the authors understood interface stability a bit better (jumping from 2.x to 3.0 was far more annoying than it needed to be), but it gets the job done. It has a vnic link to the home network, a default route, and gets calendaring packets shuffled to it by the router zone so my family can access the calendar wherever we are.
  • dns - A recent switch to OmniOSce-supported NSD and Unbound encouraged me to bring up a dedicated zone for DNS. I run both daemons here, and have the router zone redirect public kebe.com requests here to NSD. The Unbound server services all networks that can reach HDC. It has a vnic link to the home network, and a default route.

The first picture shows HDC as a single entity, and its physical networks. The second picture shows the zones of HDC as Virtual Network Machines, which should give some insight into why I call my home server a Home Data Center.

HDC, physically HDC, logically

Home Data Center 3.0 -- Part 1: Back to AMD

Twelve years ago I built my first Home Data Center (HDC). Six years ago I had OmniTI's Supermicro rep put together the second one.

Unlike last time, I'm not going to recap the entirety of HDC 2.0. I will mention briefly that since its 2014 inception, I've only upgraded its mirrored spinning-rust disk drives twice: once from 2TB to 6TB, and last year from 6TB to 14TB. I'll detail the current drives in the parts list.

Like last time, and the time before it, I started with a CPU in mind. AMD has been on a tear with Ryzen and EPYC. I still wanted low-ish power, but since I use some of HDC's resources for work or the illumos community, I figured a core-count bump would be worth the cost of some watts. Lucky me, the AMD Ryzen 7 3700x fit the bill nicely: Double the cores & threads with a 20W TDP increase.

Unlike last time, but like the time before it, I built this one from parts myself. It took a little digging, and I made one small mistake in parts selection, but otherwise it all came together nicely.

Parts

  • AMD Ryzen 7 3700x - It supports up to 128GB of ECC RAM, it's double the CPU of the old HDC for only 50% more TDP wattage. It's another good upgrade.
  • Noctua NH-U12S (AM4 edition) CPU cooler - I was afraid the stock cooler would cover the RAM slots on the motherboard. Research suggested the NH-U12S would prevent this problem, and the research panned out. Also Noctua's support email, in spite of COVID, has been quite responsive.
  • ASRock Rack X470D4U - While only having two Gigabit Ethernet (GigE) ports, this motherboard was the only purpose-built Socket AM4 server motherboard. It has IPMI/BMC on its own Ethernet port (but you'll have to double check it doesn't "failover" to your first GigE port). It has four DIMM slots, and with the current BIOS (mine shipped with it), supports 128GB of RAM. There are variants with Two 10 Gigabit Ethernet (10GigE) ports, but I opted for the less expensive GigE one. If I'd wanted to wait, there's a new, not yet available, X570 version, whose more expensive version has both two 10GigE AND two GigE ports, which would saved me from needing...
  • Intel I350 dual-port Gigabit Ethernet card - This old reliable is well supported and tested. It brings me up to the four ethernet ports I need.
  • Nemix RAM - 4x32GB PC3200 ECC Unbuffered DIMMS - Yep, like HDC 2.0, I maxxed out my RAM immediately. 6 years ago I'd said 32GB would be enough, and for the most part that's still true, except I sometimes wish to perform multiple concurrent builds, or memory-map large kernel dumps for debugging. The vendor is new-to-me, and did not have a lot of reviews on Newegg. I ran 2.5 passes of memtest86 against the memory, and it held up under those tests. Nightly builds aren't introducing bitflips, which I saw on HDC 1.0 when it ran mixed ECC/non-ECC RAM.
  • A pair of 500GB Samsung 860 EVO SATA SSDs - These are slightly used, but they are mirrored, and partitioned as follows:
    • s0 -- 256MB, EFI System partiontion (ESP)
    • s1 -- 100GB, rpool for OmniOSce
    • s2 -- 64GB, ZFS intent log device (slog)
    • s3 -- 64GB, unused, possible future L2ARC
    • s4 -- 2GB, unused
    • The remaining 200-something GB is unassigned, and fodder for the wear-levellers. The motherboard HAS a pair of M.2 connectors for NVMe or SATA SSDs in that form-factor, but these were hand-me-downs, so free.
  • A pair of Western Digital Ultrastar (nee HGST Ultrastar) HC530 14TB Hard Drives - These are beasts, and according to Backblaze stats released less than a week ago, its 12TB siblings hold up very well with respect to failure rates.
  • Fractal Design Meshify C case - I'd mentioned a small mistake, and this case was it. NOT because the case is bad... the case is quite good, but because I bought the case thinking I needed to optimize for the microATX form factor, and I really didn't need to. The price I paid for this was the inability to ever expand to four 3.5" drives if I so desire. In 12 years of HDC, though, I've never exceeded that. That's why this is only a small mistake. The airflow on this case is amazing, and there's room for more fans if I ever need them.
  • Seasonic Focus GX-550 power supply - In HDC 1.0, I had to burn through two power supplies. This one has a 10 year warranty, so I don't think I'll have to stress about it.
  • OmniOSce stable releases - Starting with HDC 2.0, I've been running OmniOS, and its community-driven successor, OmniOSce. The every-six-month stable releases strike a good balance between refreshes and stability.

I've given two talks on how I use HDC. Since the last of those was six years ago, I'm going to stop now, and dedicate the next post to how I use HDC 3.0.

Now self-hosted at kebe.com

Let's throw out the first pitch.

I've moved my blog over from blogspot to here at kebe.com. I've recently upgraded the hardware for my Home Data Center (the subject of a future post), and while running the Blahg software doesn't require ANY sort of hardware upgrade, I figured since I had the patient open I'd make the change now.

Yes it's been almost five years since last I blogged. Let's see, since the release of OmniOS r151016, I've:

  • Cut r151018, r151020, and r151022.
  • Got RIFfed from OmniTI.
  • Watched OmniOS become OmniOSce with great success.
  • Got hired at Joyent and made more contributions to illumos via SmartOS.
  • Tons more I either wouldn't blog about, or just plain forgot to mention.
So I'm here now, and maybe I'll pick up again? The most prolific year I had blogging was 2007 with 11 posts, with 2011 being 2nd place with 10. Not even sure if I *HAVE* a half-dozen readers anymore, but now I have far more control over the platform (and the truly wonderful software I'm using).

While Blahg supports comments, I've disabled them for now. I might re-enabled them down the road, but for now, you can find me on one of the two socials on the right and comment there.

Goodbye blogspot

First off, long time no blog!

This is the last post I'm putting on the Blogspot site. In the spirit of eating my own dogfood, I've now set up a self-hosted blog on my HDC. I'm sure it won't be hard for all half-dozen of you readers to move over. I'll have new content over there, at the very least the Hello, World post, a catchup post, and a HDC 3.0 post to match the ones for 1.0 and 2.0.

Coming Soon (updated)

This is STILL only a test, but with an update. The big question remains: How quickly I can bring over my old Google owned blog entries?

(BTW Jeff, I could get used to this LaTeX-like syntax…)

You’ll start to see stuff trickle in. I’m using the hopefully-pushed-back-soon enhancements to Blahg that allows simple raw HTML for entries. I’ve done some crazy thing to extract, and maybe the hello-world post here will explain that.

From 0-to-illumos on OmniOS r151016

Today we updated OmniOS to its next stable release: r151016. You can click the link to see its release notes, and you may notice a brief mention the illumos-tools package.

I want to see more people working on illumos. A way to help that is to get people started on actually BUILDING illumos more quickly. To that end, r151016 contains everything to bring up an illumos development environment. You can develop small on it, but this post is going to discuss how we make building all of illumos-gate from scratch easier. (I plan on updating the older post on small/focused compilation after ws(1) and bldenv(1) effectively merge into one tool.)

The first thing you want to do is install OmniOS. The latest release media can be found here, on the Installation page.

After installation, your system is a blank slate. You'll need to set a root password, create a non-root user, and finally add networking parameters. The OmniOS wiki's General Administration Guide covers how to do this.

I've added a new building illumos page to the OmniOS wiki that should detail how straightforward the process is. You should be able to kick off a full nightly(1ONBLD) build quickly enough. If you don't want to edit one of the omnios-illumos-* samples in /opt/onbld/env, just make sure you have a $USER/ws directory, clone one of illumos-gate or illumos-omnios into $USER/ws/testws and use one of the template /opt/onbld/env/omnios-illumos-* files corresponding to illumos-gate or illumos-omnios. For example:


omnios(~)[0]% mkdir ws
omnios(~)[0]% cd ws
omnios(~/ws)[0]% git clone https://github.com/illumos/illumos-gate/ testws

omnios(~/ws)[0]% /bin/time /opt/onbld/bin/nightly /opt/onbld/env/omnios-illumos-gate
You can then look in testws/log/log-date&time/mail_msg to see how your build went.

Quick Reminder -- tcp_{xmit,recv}_hiwat and high-bandwidth*delay networks

I was recently working with a colleague on connecting two data centers via an IPsec tunnel. He was using iperf (coming soon to OmniOS bloody along with netperf) to test the bandwidth, and was disappointed in his results.

The amount of memory you need to hold a TCP connection's unacknowledged data is the Bandwidth-Delay product. The defaults shipped in illumos are small on the receive side:


bloody(~)[0]% ndd -get /dev/tcp tcp_recv_hiwat
128000
bloody(~)[0]%
and even smaller on the transmit side:

bloody(~)[0]% ndd -get /dev/tcp tcp_xmit_hiwat
49152
bloody(~)[0]%

Even platforms with Automatic tuning, the maximums they use are often not set highly enough.

Introducing IPsec into the picture adds additional latency (if not so much for encryption thanks to AES-NI & friends, then for the encapsulation and checks). This often is enough to take what are normally good enough maximums and invalidate them as too small. To change these on illumos, you can use the ndd(1M) command shown above, OR you can use the modern, persists-across-reboots, ipadm(1M) command:


bloody(~)[1]% sudo ipadm set-prop -p recv_buf=1048576 tcp
bloody(~)[0]% sudo ipadm set-prop -p send_buf=1048576 tcp
bloody(~)[0]% ipadm show-prop -p send_buf tcp
PROTO PROPERTY PERM CURRENT PERSISTENT DEFAULT POSSIBLE
tcp send_buf rw 1048576 1048576 49152 4096-1048576
bloody(~)[0]% ipadm show-prop -p recv_buf tcp
PROTO PROPERTY PERM CURRENT PERSISTENT DEFAULT POSSIBLE
tcp recv_buf rw 1048576 1048576 128000 2048-1048576
bloody(~)[0]%

There's future work there in not only increasing the upper bound (easy), but also adopting the automatic tuning so the maximum just isn't taken right off the bat.

New HDC service: Calendaring (or, The Limitation Game)

I'll start by stating my biases: I don't like data bloat like ASN.1, XML, or even bloaty protocols like HTTP. (Your homework: Would a 1980s-developed WAN-scale RPC have obviated HTTP? Write a paper with your answer to that question, with support.) I understand the big problems they attempt to solve. I also still think not enough people in the business were paying attention in OS (or Networking) class when seeing the various attempts at data representation during the 80s and 90s. Also, I generally like pushing intelligence out to the end-nodes, and in client/server models, this means the clients. CalDAV rubs me the wrong way on the first bias, and MOSTLY the right way on my second bias, though the clients I use aren't very smart. I will admit near-complete ignorance of CalDAV. I poked a little at its RFC, looking up how Alarms are implemented, and discovered that mostly, Alarm processing is a client issue. ("This specification makes no attempt to provide multi-user alarms on group calendars or to find out for whom an alarm is intended.")

I've configured Radicale on my Home Data Center. I need to publicly thank Lauri Tirkkonen (aka. lotheac on Freenode) for the IPS publisher which serves me up Radicale. Since my target audience is my family-of-four, I wasn't particularly concerned with its reported lack of scalability. I also didn't want to have CalDAV be a supplicant of Apache or another web server for the time. If I decide to revisit my web server choices, I may move CalDAV to that new webserver (likely nginx). I got TLS and four users configured on stock Radicale.

My job was to make an electronic equivalent of our family paper calendar. We have seven (7) colors/categories for this calendar (names withheld from the search engines): Whole-Family, Parent1, Parent2, Both-Parents, Child1, Child2, Both-Children. I thought, given iCal (10.6), Calendar.app (10.10), or Calendar (iOS), it wouldn't be too hard for these to be created and shared. I was mildly wrong.

I'm not sure if what I had to do was a limitation of my clients, of Radicale, or of CalDAV itself, but I had to create seven (7) different accounts, each with a distinct ends-in-'/' URL:

  • https://.../Whole-Family.ics/
  • https://.../Parent1.ics/
  • https://.../Parent2.ics/
  • https://.../Both-Parents.ics/
  • https://.../Child1.ics/
  • https://.../Child2.ics/
  • https://.../Both-Children.ics/
I had to configure N (large N) devices or machine-logins with these seven (7) accounts.

Luckily, Radicale DID allow me to restrict Child1's and Child2's write access to just their own calendars. Apart from that, we want the whole family to read all of the calendars. This means the colors are uniform across all of our devices (stored on the server). It also means any alarms (per above) trigger on ALL of our devices. This makes alarms (something I really like in my own Calendar) useless. Modulo the alarms problem (which can be mitigated by judicious use of iOS's Reminders app and a daily glance at the calendar), this seems to end up working pretty well, so far.

Both children recently acquired iPhones. Which means if I open this service outside our internal home network, we can schedule calendars no matter where we are, and get up to date changes no matter where we are. That will be extremely convenient.

I somewhat hope that one of my half-dozen readers will find something so laughably wrong with how I configured things that any complaints I make will be rendered moot. I'm not certain, however, that will be the case.

Toolsmiths - since everything is software now anyway...

A recent twitter storm occurred in light of last week's #encryptnews event.

I was rather flattered when well-known whistleblower Thomas Drake retweeted this response of mine:

The mention of "buying usable software" probably makes sense to someone who's used to dealing with Commercial, Off-The-Shelf (COTS) software. We don't live in a world where COTS is necessarily safe anymore. There was a period (which I luckily lived and worked in), where Defense Department ARPA money was being directed specifically to make COTS software more secure and high-assurance. Given the Snowden revelations, however, COTS can possibly be a vulnerability as much as it could be a strength.

In the seminal Frederick Brooks book, The Mythical Man-Month, he describes one approach to software engineering: The Surgical Team. See here and scroll down for a proper description. Note the different roles for such a team.

Given that most media is equivalent to software (easily copied, distributed, etc.), I wonder if media organizations shouldn't adopt certain types of those organizational roles that have been until now the domain of traditional software. In particular, the role of the Toolsmith should be one that modern media organizations adopt. Ignoring traditional functions of "IT", a toolsmith for, say, an investigative organization should be well-versed in what military types like to call Defensive Information Warfare. Beyond just the mere use of encryption (NOTE: ANYONE who equates encryption with security should be shot, or at least distrusted), such Toolsmiths should enable their journalists (who would correspond to the surgeon or the assistant in the surgical team model) to do their job in the face of strong adversaries. An entity that needs a toolsmith will also need a software base, and unless the entity has resources enough to create an entire software stack, that entity will need Free Open-Source Software (for various definitions of Free and Open I won't get into for fear of derailing my point).

I haven't been working in security much since the Solaris Diaspora, so I'm a little out of touch with modern threat environments. I suspect it's everything I'd previous imagined, just more real, and where the word "foreign" can be dropped from "major foreign governments". Anyone who cares about keeping their information and themselves safe should, in my opinion, have at least a toolsmith on their staff. Several organizations do, or at least have technology experts, like the ACLU's Christopher Soghoian, for example. The analogy could probably extend beyond security, but I wanted to at least point out the use of an effective toolsmith.

Dan's blog is powered by blahgd