Archive for the 'Sysadmin' Category

Datacenters and power consumption

Thursday, July 30th, 2009

Now that we’re in 2009, it’s been a few years since we’re more and more aware that energy is a valuable resource that should be spared. Where I work I’m handling a medium-sized datacenter of 160 servers, and they use approximately 28 kW of power. 28.000 watts, that represents 470 light bulbs, always on, burning day and night.

To this, we can add the cooling system’s consumption, which we don’t monitor but we can safely add 3 to 8 kW for the three cooling units, depending on the outside temperature. That’s what fun about datacenters, we have to burn electricity to cool down the room heated by the servers’ electricity consumption. Consider it the equivalent of putting the oven in your fridge when you bake a cake.

We’ve been trying to minimize a bit this consumption. There’s an article there that writes about shutting down servers when they’re not in use – ie, at night, or during the week-end.

Well, we thought of it first ! :-) In my opinion, it isn’t possible to shutdown every server at night : the backups run during the night, and saving its data is much more important to a company than saving electricity – sadly in some sense… So, we can’t shutdown production servers with important data on it. Luckily, where I work, a very large part of the datacenter is used for development and testing – we’re doing cluster-oriented storage, we have clusters, and about 140 out of our 160 servers are completely unused at night: developers go home.

So, one year and half ago already, I’ve implemented a way for developers to tell whether their test servers could be shutdown at night or not (in case they have long-running tests on them). It’s not a huge success, mainly because people choose 24/24 instead of 12/24 “just in case”, I think, but with approximately 30 servers down every night and during week-ends, we still spare 8 kW more than half the time. Still better than nothing…

Besides, at home, I’m now putting my laptop to sleep when I’m not in front of it. The saving’s much less and completely nullified by the server in the cupboard, but it’s still better than the days where I had three servers in the cupboard and didn’t put my laptop to sleep !

RAID1 array enlarging

Wednesday, March 4th, 2009

Here’s a quick  recipe to easily enlarge a RAID1 array with the least possible downtime, using linux 2.6 and mdadm.

We’ll start with a two-disk setup, /dev/sda and /dev/sdb, containing two arrays, /dev/md0 and /dev/md1. /dev/md0 is mounted on / and /dev/md1 is mounted on /backup. We want to grow /dev/md1 from 230GB to 898G (switching from 250GB disks to 1TB).

/dev/md0 has /dev/sda1 and /dev/sdb1, /dev/md1 has /dev/sda3 and /dev/sdb3, while swap partitions are on /dev/sda2 and /dev/sdb2.

Obligatory warning: Use your own brain when following this procedure. Don’t follow me blindly – it’s your data at stake.

Booting on degraded array: don’t shoot yourself in the foot.

When you’ll remove one of the existing disks, your computer won’t be able to boot if grub isn’t installed on the other disk’s bootsector, so make sure that grub is installed on both disks’ MBR:
#grub
grub> find /boot/grub/menu.lst
(hd0,0)
(hd1,0)
grub> root (hd0,0)
grub> setup (hd0)
grub> root(hd1,0)
grub> setup (hd1)

Shutdown the computer, remove sdb, put in one of the new 1TB disks in place, and reboot. Booting can take some time while the initrd’s mdadm tries to find the missing disk.

You’ll boot with degraded arrays, as shown there:
#cat /proc/mdstat
md0 : active raid1 sda1[0]
19534912 blocks [1/2] [U_]

md1 : active raid1 sda3[0]
223134720 blocks [1/2] [U_]

Now, we’ll dump sda’s partition table:
#sfdisk -d /dev/sda > partitions.txt

Edit the partitions.txt file to remove the size=xxxxxxx field on the sda3 line, so that the biggest possible partition size will be used. The file will look like:

# partition table of /dev/sda
unit: sectors
/dev/sda1 : start=       63, size= 39070017, Id=fd, bootable
/dev/sda2 : start= 39070080, size=  1959930, Id=82
/dev/sda3 : start= 41030010, Id=fd
/dev/sda4 : start=        0, size=        0, Id= 0

Disk initialisation

Now partition sdb using this table:
#sfdisk /dev/sdb < partitions.txt

recreate swap if needed:
#mkswap /dev/sdb2; swapon -a

Put sdb back in the arrays:
#mdadm –manage /dev/md0 –add /dev/sdb1
#mdadm –manage /dev/md1 –add /dev/sdb3

Wait until the array is resynchronised and clean. I use:
#watch cat /proc/mdstat #(quit with Ctrl-C)

Install grub on the new disk using grub, like previously (sdb is hd1 for grub), so that you’ll be able to boot from it.

Changing the second disk

Shutdown, remove sda, put the second new disk in place of it, and reboot – make sure your BIOS is configured to try and boot on both drives.

Now you’ll have degraded arrays again:
#cat /proc/mdstat
md0 : active raid1 sdb1[0]
19534912 blocks [1/2] [_U]

md1 : active raid1 sdb3[0]
223134720 blocks [1/2] [_U]

Redo the whole disk initialisation section, this time on sda instead of sdb. Don’t forget to reinstall grub on sda.

In the end you’ll get your arrays clean as they were before, but /dev/md1 will still be 230GB instead of using the whole available room on the disks’ partitions 3.

Grow the things

Let’s ask mdadm to take the whole partitions size for md1:
#mdadm –grow /dev/md1 –size=max

You’ll have to wait for synchronisation again (watch cat /proc/mdstat).

The only remaining thing is to grow the ext3 filesystem sitting on md1, and that’s where the most downtime happen (your data won’t be available unless you do a live FS resize, which I didn’t want to test); these steps took about 30 minutes to complete for me:
#umount /dev/md1
#e2fsck -f /dev/md1 #(it’s better to force a check to avoid a resize failure)
#resize2fs /dev/md1 #(this makes the filesystem the biggest possible)
#e2fsck -f /dev/md1 #(verify that everything is OK)
#mount /dev/md1 #(and you’re done, as df -h should show you):

# df -h /dev/md1
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1              898G  228G  634G  27% /backup

Rambling about half-finished RAID setups

One thing you may have noticed is that I’m installing grub on both drives. This can seem evident, but most software RAID arrays I’ve seen couldn’t boot out of the second disk for lack of an MBR. It makes the RAID setup useful when your second disk fails, but if it’s the first, you’re forced to resort to a rescue CD or PXE boot to reboot your server. This makes things much harder to fix, provokes cold sweats, downtimes, and user annoyment. Install grub on both disks. Check the system boots when removing one disk, both the first or the second, before going into production. Don’t misunderstand your RAID arrays as a backup system. RAID arrays provides redundancy and eases (a LOT) recovering from a failed disk, but it doesn’t eases recovering from two failed disks; and it doesn’t recover lost data from human mistakes either. Regarding failed disks, best results are achieved by monitoring the disks – with smartd for example – and replacing suspicious disks too soon rather than too late.

Registar switch

Tuesday, November 25th, 2008

Back in may 2001, I grabbed my first domain name, colino.net. I wanted to stop switching URL each time I switched the hosting. At the time I had no debit card, and I chose the first registrar I found which accepted payment by cheque, amen.fr.

Since then, I grumbled and grumbled each time I had to log in to their web administration interface, for domain renewal, DNS glue records updates, etc ; but out of habit, I bought two more domain names from them : my wife‘s, and a second one I had.

Finally, last week, after more failing glue records updates, I switched my domain and my wife’s domain to gandi.net. I left the third one on amen, as I don’t plan on renewing it, it’s useless for me to keep two domain names.

I knew Gandi beforehand as it’s the registrar for my work’s domain, and I’m glad I did the switch. Their interface is multiple times better, it does what it has to do and doesn’t bug out when hitting Submit with not even an error message.

How to change Dell’s BIOS settings from a Linux command-line

Wednesday, May 21st, 2008

To be able to change BIOS settings from the command-line on a Dell Poweredge, you need the syscfg utility. It’s very useful when you want to change a configuration on, for example, 32 nodes at once, without having to plug screen, plug keyboard, reboot, change setting, reboot 32 times. Here is how I installed it on the CentOS 5 distribution :

# cd ; wget -q -O – http://linux.dell.com/repo/hardware/bootstrap.cgi | bash
# yum install srvadmin-hapi
# wget ftp://ftp.us.dell.com/sysman/dtk_2.5_80_Linux.iso
# mkdir dtk
# mount -o loop dtk_2.5_80_Linux.iso dtk/
# cd dtk/isolinux/
# cp  SA.2 ~/SA.2.gz
# cd; gunzip SA.2.gz
# mkdir stage2
# cd stage2
# cpio -i < ../SA.2
# cd lofs
# mkdir dell
# mount -o loop dell.cramfs dell/
# mkdir -p /usr/local/sbin ; cp dell/toolkit/bin/syscfg /usr/local/sbin/
# umount dell
# cd
# umount dtk

And voilà! You can now use syscfg:

# /usr/local/sbin/syscfg –biosver
biosver=1.5.1
# /usr/local/sbin/syscfg –virtualization=enable
virtualization=enable

I’d have preferred an easier way, but couldn’t find syscfg’s RPM.

When deploying that to a lot of nodes, you probably don’t want to go through all the associated network downloads of the first phase (wget of the yum repository, yum, and wget of the 230MB iso), so you can take shortcuts:

# for node in $(list_of_nodes); do scp /usr/local/sbin/syscfg /var/cache/yum/dell-hardware-auto/packages/srvadmin-*.rpm $node: ; ssh $node “mkdir -p /usr/local/sbin; mv syscfg /usr/local/sbin; rpm -ivh srvadmin-*.rpm”; done;

Lazily testing memory

Tuesday, May 13th, 2008

I had, until recently, a problem when it came to test memory on the nodes in my lab. Until now, I was able to PXE boot memtest+, but had to go down to the lab and plug a screen to check the output. Multiple annoyances: first I had to move my ass to the lab room, then I add to do some difficult things to plug a screen to the node, then I had to come back from time to time and look at the output. All of these right in front of the cooling units, which blow some really cold air now that they work correctly.

This morning I investigated in the source code of memtest+ and found out it supports output to serial consoles since recently!

A little upgrade later, I can now boot memtest+ with the console=ttyS0,57600 command line parameter and just watch my serial line output, without moving at all! Yay!

DEFAULT memtest console=ttyS0,57600
LABEL memtest
KERNEL images/tools/memtest

Viva PXE!

(Btw for those who’ll find funny to use DEFAULT memtest… PXE boot choices are updated via a cgi script called from an intranet tool, which itself is wrapped in a little GTK systray applet. This applet allows to reboot, shut down, power on, reinstall various distributions, follow serial line, open an ssh connection, on the lab’s nodes. This tool is also useable via command line for scripting power).

Stuff that happens to sysadmins

Wednesday, March 5th, 2008

Buy one 1U server from $supplier, specifically ask for a pair of rails, learn that “Of course it comes with rails!”.

Two weeks later. Buy a 42U rack, and eight 1U servers, all of these from the same $supplier, and at the same time. Receive your rack and 8 servers, without rails. Inquire by email: “No, servers don’t come automatically with rails, you didn’t ask for them”.

Thanks, Dell. It’s always a pleasure.

Geek pr0n again

Tuesday, January 22nd, 2008

Take one 3U server from SuperMicro, fitted with an Adaptec 31605 SATA RAID controller and 16 one-terabyte disks. Create a RAID 0+1 array for data storage, with 14 of these 16 disks. You get:

#available storage
[root@sam119 ~]# df -h /data
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 6.3T 941M 6.0T 1% /data

#write test
[root@sam119 ~]# dd if=/dev/zero of=/data/toto.txt bs=4096 count=2621440
2621440+0 records in
2621440+0 records out
10737418240 bytes (11 GB) copied, 38.8372 seconds, 276 MB/s

#read test
[root@sam119 ~]# dd if=/data/toto.txt of=/dev/null
20971520+0 records in
20971520+0 records out
10737418240 bytes (11 GB) copied, 31.7011 seconds, 339 MB/s

(See also the previous Geek pr0n entry, complete with JPEGs!)

RPM Hell

Monday, October 15th, 2007

Actually, this is not RPM Hell but rather crappy packaging hell:

[root@sam70 ~]# rpmbuild –rebuild xen-3.1.0-10.fc8.src.rpm
Installing xen-3.1.0-10.fc8.src.rpm
error: Failed build dependencies:
transfig is needed by xen-3.1.0-10.x86_64
libidn-devel is needed by xen-3.1.0-10.x86_64
texi2html is needed by xen-3.1.0-10.x86_64
SDL-devel is needed by xen-3.1.0-10.x86_64
curl-devel is needed by xen-3.1.0-10.x86_64
libX11-devel is needed by xen-3.1.0-10.x86_64
python-devel is needed by xen-3.1.0-10.x86_64
ghostscript is needed by xen-3.1.0-10.x86_64
tetex-latex is needed by xen-3.1.0-10.x86_64
gtk2-devel is needed by xen-3.1.0-10.x86_64
libaio-devel is needed by xen-3.1.0-10.x86_64
/usr/include/gnu/stubs-32.h is needed by xen-3.1.0-10.x86_64
dev86 is needed by xen-3.1.0-10.x86_64
gnutls-devel is needed by xen-3.1.0-10.x86_64
openssl-devel is needed by xen-3.1.0-10.x86_64

Picking only one example from this list: why on earth does xen need gtk2-devel?

Let’s just grin, and do as requested, after all, it’s not as if I had a choice. After figuring out that /usr/include/gnu/stubs-32.h is actually provided by glibc-devel (and the src.rpm building machine didn’t know it…)

[root@sam70 ~]# yum install transfig libidn-devel texi2html SDL-devel curl-devel libX11-devel python-devel ghostscript tetex-latex gtk2-devel libaio-devel glibc-devel dev86 gnutls-devel openssl-devel
Loading “installonlyn” plugin
Setting up Install Process
Setting up repositories
base 100% |=========================| 1.1 kB 00:00
updates 100% |=========================| 951 B 00:00
addons 100% |=========================| 951 B 00:00
extras 100% |=========================| 1.1 kB 00:00
Reading repository metadata in from local files
primary.xml.gz 100% |=========================| 341 kB 00:01
updates : ################################################## 817/817
Added 12 new packages, deleted 0 old in 1.03 seconds
[...]

Transaction Summary
=============================================================================
Install 76 Package(s)
Update 0 Package(s)
Remove 0 Package(s)

Total download size: 116 M
Is this ok [y/N]: y

Now everything’s downloaded…

Transaction Check Error:
file /usr/share/man/man1/asn1parse.1ssl.gz from install of openssl-0.9.8b-8.3.el5 conflicts with file from package openssl-0.9.8b-8.3.el5
file /usr/share/man/man1/nseq.1ssl.gz from install of openssl-0.9.8b-8.3.el5 conflicts with file from package openssl-0.9.8b-8.3.el5
file /usr/share/man/man1/ocsp.1ssl.gz from install of openssl-0.9.8b-8.3.el5 conflicts with file from package openssl-0.9.8b-8.3.el5
file /usr/share/man/man1/smime.1ssl.gz from install of openssl-0.9.8b-8.3.el5 conflicts with file from package openssl-0.9.8b-8.3.el5

That’s actually due to openssl-devel package pulling in i386 and x86_64 versions of the openssl package, which happily conflict due to both of them shipping… the same manpages! I wonder how they manage to get automated builds working. (Solving that: rpm -ivh –force –nodeps /var/cache/yum/base/packages/openssl-* . Doesn’t look like chainsaw work at all…)

As a bonus, note how the error message doesn’t indicate at all that the two conflicting packages, even if they’re named the same, are not actually the same, given both of them are for different architectures.

Sigh. Thanks, RedHat.

Geek pr0n

Tuesday, June 19th, 2007

I can’t resist sharing a few pictures of my lab at work:

Total: 100 nodes…

Stuff that happens to cluster sysadmins

Friday, February 9th, 2007

Suddenly getting 60% packet loss between the LAN and the cluster, just because

kernel: ip_conntrack: table full, dropping packet.

Fix the problem by raising the limit from 65536 to 2097152 slots. If this limit is reached, that’ll eat 700MB of memory, ouch.

news for few, stuff no-one cares about