[this post is basically me venting after an all-nighter, it's way too harsh than it should have been]
An alternative title to this could be "if something looks too good to be true, it usually is."
We needed to uplift 80+ desktop PCs from a pile of cheap parts to usable state in about two weeks. Nothing to it, right? Mass installations and cloning are done every day, right? Well, no. I basically had two projects to pick from - Clonezilla and FOG, and picked FOG because Clonezilla is old and clunky, while FOG has a relatively nice and powerful web interface. As it turned out, either of them would have collapsed under the task...
The basic setup was to install a template computer with Windows 7 and Ubuntu 10.04 dual-boot and clone it to the 80-some other computers. To do this, we figured two weeks would be more than enough. But no cigar.
Basically, we spent 4/5ths of the planned time installing only the Windows part of the system, with a huge amount of time spent reconciling Windows 7 and our old Win2k3 Active Domain controller. Only then, basically two days to the deadline, we installed Ubuntu (thankfully, this was over in two afternoons - one for the OS and one to integrate it with the AD) and started working on imaging and distribution.
And this point - cloning the systems - is where we got such a hugely unexpected mess of problems that it left a "will not touch this in the next 10 years with a shitty stick" type of feeling for the whole process. I will simply skip the preamble dealing with configuring our network (AD-controlled) for PXE, reconfiguring firewalls to allow TFTP and NFS from the FOG server to the labs, hitting Samba and Winbind on the head to actually accept the AD etc. and go straight for the meat.
Well, taking an image shouldn't be a problem, right? Erm, no. Not if it turns out we have a network problem with packets lost which turns the 1 Gbit streaming TCP imaging process to basically sub-100 Mbit/s speed. After fiddling around it was clear that we will spend more time debugging the network so we sucked it up and went with it.
Well, we now have the image - it shouldn't be too hard to deploy it? After all, a test run we did earlier in the setup of the template system showed absolutely no problems? No way.
What would you think of the error message "Bad partition table - invalid partition signature 0" when we attempted to deploy the image? FOG spews this out in the "Checking disks" phase and refuses to consider any other option but stopping loudly.
It turns out it's a bug in FOG. I haven't found any explanation of it except for a few mysterious "HELP ME! IT DOESN'T WORK!" messages in forums which went unanswered. Here is what is going on: FOG creates its image (of type "single drive, all partitions, non-resizable") in the following way:
- record the MBR
- for each partition: read the free/allocated space map from the file system, dump the allocated data
The deployment process goes similarily:
- restore the MBR
- for each partition: dump the recorded data to the partition
Looks good but contains two vital flaws. This section is about Flaw #1: if the image contains an extended partition (basically, a nested partition within a MBR partition), when the MBR is restored (with dd) on the deployment computer, it will contain a record of the extended partition, but the nested partition table to which the MBR now points will contain garbage. In particular, on a freshly minted hard drive, it will contain All Zeroes, which, when the MBR is dumped and partprobe invoked, will cause it to drop dead of consternation with the above error.
The solution to this is as follows: PXE boot FOG menu, choose "Debug" options, in the given debug console invoke "fdisk /dev/sda" , watch gleefully as the fdisk complains about the "Zero" problem, accept its offer to fix the problem to you and write the partition back (with "w"). Now, when restoring the image, FOG will restore the MBR, which will again point to the same garbage but with one important difference: this particular byte of the MBR will Not Be Zero and all will be well, for now.
Apparently, noone has caught this problem before because noone images with extended partitions on empty disk drives?
Ok now, FOG restored the image. Surerly the troubles are over, right? Of course not. As it turns out, the imaged computers are unbootable. No error messages, nothing from either GRUB or Windows boot loader, just a lonely text mode cursor blinking on the screen. Changing the active partition on such systems yields precisely no effect.
Which bring us to Flaw #2 of the FOG imaging process: it records *used blocks in the file system*. Unix boot loaders are actually written near the superblock (in front or behind it), which is *not* counted by the used space bitmaps. So FOG, in all its wizdom, skipped recording GRUB and since GRUB was resposible for the dual-boot magic, everything went "poof".
The solution is this: boot the imaged computer from a CD containing "Super GRUB2 CD", ask this recovery tool to do its magic and locate any possible GRUB2 configurations on the drive, watch happily as it does its job properly and reads the grub.conf file from the imaged file system, presenting you with boot options. Next, boot the installed Linux from the hard drive into recovery mode, select "Fix GRUB" from the menu presented here, and *then*, in addition to that, activate the single-user root console (or netroot, if you need it), and do "grub-install /dev/sda".
NOW, reboot and all is well.
Apparently people have caught this problem earlier but I could not find any details except a foum posts saying "yeah, both FOG and clonezilla suck if you need to clone dual-boot systems". It looks like this problems only affects dual-boot images for some reason. Maybe FOG tries to be overly clever instead of dumping the first few MBs of active partitions verbatim to the image?
Two things should make clear how bad all this went: 1) all production imaging was done unicast because we couldn't reconfigure our network for cross-IP-network multicast (we have a large network), with 1 Gbit/s uplinks to the labs and 100 Mbit/s switches inside the labs; and 2) the two recovery steps for FOG I've described above needed to be done for every single imaged machine, pre- and post- imaging. In addition to that, there were many broken runs, imaging 30-or-so computers over 100 Mbit/s unicast, only to find the imagings failed because of a combination of the above factors. Every boot device change from PXE to CD to HDD required going to BIOS which was password protected (because we were trying to implement a security policy). Tedious. It would be easier to just buy Ghost or Acronis.
I truly believe all this is making me mentally ill. I literally visually parsed MBRs, debugged bootcode in GRUB and other Linux code to find out what happened here and why did it not work. I know way too much about all this to keep my sanity.
I want to be a tourist guide in some nice scenic country.
Update: Despite that this post sounds pretty harsh, Fog project is actually pretty nice and once I learned what works and what doesn't, I have no problems using it and even recommending it to friends. We originally chose it because, being Linux-based, had a better chance in dealing with Linux, and (of course, no surprise here) to save per-seat money on a potentially large deployment of cloned machines (this is only the first wave). It ended being pretty popular and even my collegues like it so it is probably here to stay.