Ramblings on IT and Physics

Tuesday, February 5, 2013

DBRD mirrored disk

So today I'm gonna talk about how to set up a replication between two servers. In my case I've created one LVM on both servers that is a RAID10 and now I want to use DRBD to keep the block devices synchronized and therefore protect me against a server failure.

We start with installing the DRBD packages. I've found the easiest to just use the elrepo repository: http://elrepo.org/ follow the instructions there to set up the repository in Yum.

You then install the package and kernel module. If you have a recent enough kernel the kmod package may not be necessary. Look it up from drbd website.

yum install -y drbd84-utils kmod-drbd84

Next up we need to configure the drbd resource that we want to mirror. In our case we want to keep /home mirrored so we create a new resource file in /etc/drbd.d/ called home.res with the following contents:

resource home {
device /dev/drbd1;
disk /dev/vg0/home;
meta-disk internal;
on se3 {
address 192.168.1.243:7789;
}
on se5 {
address 192.168.1.245:7789;
}
}

What it says is that it will use the local device /dev/vg0/home and create a new block device /dev/drbd1 that you should then use. In addition we ask to use internal metadata, which means that the metainfo is kept on the same device as data. This however means that you will have to re-create the filesystem on the device. If you want to create a drbd mirror of an already existing filesystem, then you have to use external metadata and I suggest you consule DRBD website Chapter 17 DRBD internals that discusses the metadata usage. There are downsides to using metadata on the same block device if the block device is not a raid volume, but a single disk so do take this into consideration.

So the above configuration describes a drbd resource called home that has two servers with their respective IP-s given and the port on which to talk to the other host. It is preferred to use a separate heartbeat channel if possible, but it works quite well over a single network interface that is also used for other things.

Now that we have described the resource we have to enable it the first time. Choose one of the nodes and ask drbd to create the resources metadata:

# drbdadm create-md home

as we use internal metadata it'll warn you that you are possibly destroying existing data. Remember that you are indeed if you had anything previously stored. In that case you better consult drbd manual how to set up with external metadata.

Now that we have the metadata created we can bring the resource up. For it we first load the dbrd kernel module and then ask the resource to be brought up:

# modprobe drbd
# dbrdadm up home

We can now check the status:

[root@se5 drbd.d]# cat /proc/drbd
version: 8.4.2 (api:1/proto:86-101)
GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag@Build64R5, 2012-09-06 08:15:57

1: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r----s
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:2147483608

So right now it shows that it's waiting for connection to the other host (we've not yet installed/configured it there) and it considers itself inconsistent, which is normal at this point.

Now let's install drbd on the other server and copy over the configuration so that we can also bring up the other part of our distributed mirror.

rpm --import http://elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://elrepo.org/elrepo-release-5-4.el5.elrepo.noarch.rpm
yum install -y drbd84-utils kmod-drbd84
scp se5:/etc/drbd.d/home.res /etc/drbd.d/home.res
drbdadm create-md home
modprobe drbd
drbdadm up home

Now that we've repeated the steps the two resources should be in communication, but still unable to decide who's the master from which to do the initial synchronization:

[root@se3 ~]# cat /proc/drbd
version: 8.4.2 (api:1/proto:86-101)
GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag@Build64R5, 2012-09-06 08:15:57

1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:2147483608

So now let's pick one server (in our case se3) that we call the primary and force it to be considered it:

drbdadm primary --force home

Now drbd knows from whom to copy the necessary data over (even though we've not really even created any data on it) and starts the synchronization:

[root@se3 ~]# cat /proc/drbd
version: 8.4.2 (api:1/proto:86-101)
GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag@Build64R5, 2012-09-06 08:15:57

1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
ns:4025152 nr:0 dw:0 dr:4025152 al:0 bm:245 lo:1 pe:13 ua:0 ap:0 ep:1 wo:f oos:2143460504
[>....................] sync'ed: 0.2% (2093220/2097148)M
finish: 5:59:58 speed: 99,236 (82,104) K/sec

You can now already create a filesystem on the new device and mount it as well as start using, but remember, that it's still under sync so until that is not completed the performance might be degraded.

What I have not covered here is what the default settings are and what they mean. I recommend reading up on drbd website, but basically the default protocol that drbd uses requires synchronized writes. So if the devices are connected any write to the primary device will also initiate a write on the secondary and the OK will be given back only once both have completed. It can be set up as asynchronous write if needed, but that requires changing the protocol. Also, one can set up an active-active (i.e. primary/primary) setup allowing both devices to be used at the same time, but caution should be observed in such setups to avoid split brain situations and issues. In our case we want an active-passive configuration with NFS server failover, but that we will discuss in a separate post.

Optimizing for IOPS

So we recently discovered that if a user accidentally forgets to make his/her jobs use the local scratch space on workernodes it's not too hard to overload a shared /home filesystem with iops. Right now the home filesystem is a basic 2-drive RAID0 configuration that is mirrored to a backup host with DRBD (might describe the setup in a later post) and while the SATA drives with cache enabled seemed to be able to handle a huge amount of IOPS (the cluster did an average of 700-2000 IOPS at the time) it did show in iostat output that the drives were saturating most of the time:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdx 49.00 3047.00 222.00 888.00 6.25 15.37 39.90 4.02 3.50 0.85 94.30
sdy 26.00 2930.00 226.00 825.00 6.16 14.67 40.59 2.30 2.25 0.77 81.20
md1 0.00 0.00 518.00 7690.00 12.23 30.04 10.55 0.00 0.00 0.00 0.00

You can see this output with iostat -dmx sdx sdy md1 1

where -d means disk utilization information, -m shows the numbers in MB instead of blocks and -x shows extended information. The list of drives is followed otherwise it shows for all (and this node had 24 drives) and the final 1 means that it shows it in 1s intervals. The first output is always averaged since system boot, the next ones show results for the interval then.

So while the shared home doesn't really need much capacity (a few TB are fine) it does need IOPS if it is to be responsive while cluster jobs are on it as well as users doing interactive things like compilation, git operations etc that all do reasonable amount of IOPS while not requiring much bandwidth. So we decided to repurpose a few new storage servers. Each server has 36 drives of 3TB (2.73TB usable). The raw capacity is therefore ca 98.3TB. Now the usual way we exploit this is that we mount each drive separately and give it to hadoop as data brick. This way if a single drive fails we only lose that part and hadoop will replicate the blocks elsewhere. We still want to do that so the way we configured this is using LVM.

Firstly we create 36 PV's that basically tell the volume manager the backend store units that it can utilize.

for i in `cat list`;
do
pvcreate /dev/$i;
done

Here I used a file called list to hold all the device names (as in sdc sdd etc), as there are 36, to ease the creation.

We then create one large volume group on top of those 36 drives allowing us to span the whole system for logical volumes.

vgcreate vg0 `cat list|xargs -i echo -n "/dev/{} "`

Again, we just parse the list file to add all the devices to the command. Basically it looks like:

vgcreate vg0 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi ...

So now we have one huge volume group that can take all that we need:

# vgs
VG #PV #LV #SN Attr VSize VFree
vg0 36 0 0 wz--n- 98.23T 98.23T

So let's create first the RAID10 mirror for high IOPS. As I want to use it across all drives I'll have to tell lvcreate to use mirror log in memory as there are no spare drives that it can use for it. If you do create a smaller LV, then ignore the --mirrorlog core part as it'll just use a few MB on some other drive in the VG to keep the log.

lvcreate --mirrors 1 --stripes 18 --mirrorlog core -L 2T -n home vg0

This creates a mirror that has each side striped across 18 drives. Effectively a RAID10. The size is 2TB, the mirrorlog is kept in core memory and it's named home (i.e. /dev/vg0/home) and created on the volume group called vg0.

One can see the status with lvs command:

# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
home vg0 mwi-a- 2.00T 0.01

you'll notice that it's started a synchronization of the volume. It'll take a while and during that time the drives will be somewhat busy so if you want to measure any performance it's better to wait until it has completed the synchronization.

For further details on the volume you can look with the lvdisplay command:

# lvdisplay
--- Logical volume ---
LV Name /dev/vg0/home
VG Name vg0
LV UUID xhqUt8-vkbv-UUnZ-Hcvu-8ADX-8y8V-dtwucm
LV Write Access read/write
LV Status available
# open 0
LV Size 2.00 TB
Current LE 524304
Mirrored volumes 2
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:2

Once it has reached stability we can do some performance measurements. A sample tool to use is iozone to create the load (first you have to make a filesystem on the new raid and mount it though).

So after mkfs.ext4 -m 0 /dev/vg0/home and mount /dev/vg0/home /mnt we have the setup we need.

Running iozone in one window with the basic -a option (that is automatic testing of various tests) we run iostat in another window. If you want to figure out from iostat output which device to monitor for the raid volume, then look at the lvdisplay output. It says block device 253:2. You can then look at /proc/diskstats to find what notation it has:

253 2 dm-2 487 0 3884 22771 52117628 0 416941024 3330370634 0 477748 3330507046

So our 253:2 corresponds to dm-2 device. You can then run for example:

iostat -dmx sdd sdm sdal dm-2 10

this would show you three individual drives and the full volume in 10s intervals with extended statistics and sizes shown in MB. Here's a sample output after letting it run for a few iterations:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.00 3290.70 0.00 122.10 0.00 12.80 214.77 18.39 147.23 4.33 52.88
sdm 0.00 3521.30 0.80 121.50 0.07 13.70 230.47 17.91 143.31 4.42 54.08
sdal 0.00 3299.10 0.00 108.40 0.00 12.83 242.30 14.85 133.63 5.27 57.10
dm-2 0.00 0.00 0.50 65955.50 0.00 257.64 8.00 28801.15 400.63 0.01 60.08

As you can see I caught it during write tests, the RAID wrote 260MB/s and was about 60% utilized during the measurement window of 10s. It managed to average almost 66000 write requests to the volume. The OS was able to merge some of those (you can see the wrqm/s that there were 3300-3500 write requests merged on average to ca 110-120 writes as seen in the w/s column). So thanks to merging of write requests the volume was able to take almost 66000 IOPS. The actual drives average around 120-150 IOPS, which is about what you'd expect from ordinary 7200 rpm SATA drives. So in a 36 drive volume you can expect read IOPS to go as high as 5400 IOPS to the metal (and thanks to merging of requests probably up to 10-30x that) and about half of that for writes. In any case it gives you capacity AND IOPS. One doesn't always necessarily need to invest in SSDs immediately though we do have a few SSD in our system as well.

Here's an example from running iozone on one of those SSDs:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdc 0.00 30508.00 0.00 481.00 0.00 118.58 504.88 5.08 9.99 1.60 77.20

As you can see it was able to merge over 30k IOPS into 480 writes/s all by itself so SSDs are impressive, but we reached twice of that in our RAID10 config using ordinary SATA drives.

I'll cover the making of DRBD mirroring of the data and possibly setting up a clustered NFS server in a separate post.

Thursday, December 13, 2012

gluster stuck with a lock

Today I ran into a situation where gluster was stuck. You could check info and status, but nothing more spectacular. When you tried enabling profiling or heal you'd get in log:

[2012-12-13 15:40:07.881363] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: c3ce6b9c-6297-4e77-924c-b44e2c13e58f, lock held by: c3ce6b9c-6297-4e77-924c-b44e2c13e58f

As can be seen the node itself can't get a lock that it itself is holding. Googling around it seems that the error comes from a command that failed to complete correctly and didn't release the lock. The web talks that this lock should auto-disappear in 30 minutes, but that's not the best solution. Also I was able to get rid of the lock a few times with

gluster volume statedump home0

but not always. When I managed to clear the lock things seemed to work (profiling started and gave info etc), but when I ran again

gluster volume heal home0

it locked up again. In the end the only help that I got from #gluster was to unmount all clients, stop all gluster processes and restart them. That indeed cleared the locks and heal command also works, but it's not really a solution I'd like to ever see again. I've written to gluster-users list and will update this entry if I get an actual way that this should have been cleared.

Wednesday, December 5, 2012

Network debugging for 10Gbit/s

So today we decided to take the network testing seriously. We connected two servers with 1G network to the generic infrastructure (to access software from internet) and then a direct cable between them for 10G. This way we should be able to test exclusively the OS + driver + NIC fw level that we can get to 10Gbit and from there expand the test to include a switch.

To make sure we didn't have any stale software we reinstalled both nodes with CentOS 6.3 with no kernel tunings. We also compared sysctl -a output on the two servers and though there were minor differences none of them should prove relevant.

We then launched a few baseline tests. First of all running iperf locally inside the server to see how much the server itself can handle. With 256KB window size both did at least 17Gbit/s

[root@wn-d-01 ~]# iperf -w 256k -c 192.168.2.1 -i 1
------------------------------------------------------------
Client connecting to 192.168.2.1, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 55968 connected with 192.168.2.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1.77 GBytes 15.2 Gbits/sec
[ 3] 1.0- 2.0 sec 2.37 GBytes 20.4 Gbits/sec
[ 3] 2.0- 3.0 sec 2.42 GBytes 20.8 Gbits/sec
[ 3] 3.0- 4.0 sec 2.40 GBytes 20.6 Gbits/sec
[ 3] 4.0- 5.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 5.0- 6.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 6.0- 7.0 sec 2.39 GBytes 20.6 Gbits/sec
[ 3] 7.0- 8.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 8.0- 9.0 sec 2.39 GBytes 20.5 Gbits/sec
[ 3] 9.0-10.0 sec 2.38 GBytes 20.5 Gbits/sec
[ 3] 0.0-10.0 sec 23.3 GBytes 20.0 Gbits/sec
[root@wn-d-01 ~]#

Running then the test between them firstly on the 1Gbit/s to see that there is no generic foul play at work that cuts speed to 25% or 33% we saw nicely 942Mbit/s speeds:

[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.2.98
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 53338 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 112 MBytes 943 Mbits/sec
[ 3] 1.0- 2.0 sec 112 MBytes 942 Mbits/sec
[ 3] 2.0- 3.0 sec 112 MBytes 942 Mbits/sec
[ 3] 3.0- 4.0 sec 112 MBytes 942 Mbits/sec
[ 3] 4.0- 5.0 sec 112 MBytes 942 Mbits/sec
[ 3] 5.0- 6.0 sec 112 MBytes 941 Mbits/sec
[ 3] 6.0- 7.0 sec 112 MBytes 942 Mbits/sec
[ 3] 7.0- 8.0 sec 112 MBytes 942 Mbits/sec
[ 3] 8.0- 9.0 sec 112 MBytes 942 Mbits/sec
[ 3] 9.0-10.0 sec 112 MBytes 942 Mbits/sec
[ 3] 0.0-10.0 sec 1.10 GBytes 942 Mbits/sec
[root@wn-d-01 ~]#

So now we fired up the default kernel driver for our Mellanox 10G card and tested the link firstly with ping:

[root@wn-d-98 ~]# ping 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.725 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.177 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.187 ms

So an RTT of 0.2ms means that with 256KB window size you get 9.8Gbit/s. So let's see if that actually works (remember, it's direct attached cable, nothing else running on the servers):

[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 256 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 51131 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 290 MBytes 2.43 Gbits/sec
[ 3] 1.0- 2.0 sec 296 MBytes 2.49 Gbits/sec
[ 3] 2.0- 3.0 sec 191 MBytes 1.61 Gbits/sec
[ 3] 3.0- 4.0 sec 320 MBytes 2.68 Gbits/sec
[ 3] 4.0- 5.0 sec 232 MBytes 1.95 Gbits/sec
[ 3] 5.0- 6.0 sec 161 MBytes 1.35 Gbits/sec
[ 3] 6.0- 7.0 sec 135 MBytes 1.13 Gbits/sec
[ 3] 7.0- 8.0 sec 249 MBytes 2.09 Gbits/sec
[ 3] 8.0- 9.0 sec 224 MBytes 1.88 Gbits/sec
[ 3] 9.0-10.0 sec 182 MBytes 1.53 Gbits/sec
[ 3] 0.0-10.0 sec 2.23 GBytes 1.91 Gbits/sec
[root@wn-d-01 ~]#

Not even close. So next up we installed Mellanox official kernel modules. With those we could also increase the window size to 1-2MB etc (which the default somehow capped at 256KB). Though this shouldn't matter. The first test looked promising the first few seconds:

[root@wn-d-01 ~]# iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58336 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 1021 MBytes 8.57 Gbits/sec
[ 3] 1.0- 2.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 2.0- 3.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 3.0- 4.0 sec 1.10 GBytes 9.47 Gbits/sec
[ 3] 4.0- 5.0 sec 933 MBytes 7.82 Gbits/sec
[ 3] 5.0- 6.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 6.0- 7.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 7.0- 8.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 8.0- 9.0 sec 276 MBytes 2.32 Gbits/sec
[ 3] 9.0-10.0 sec 277 MBytes 2.33 Gbits/sec

No matter how hard we tried we couldn't repeat the 9.47Gb/s speeds. Digging into Mellanox network performance tuning guide I first set the default kernel parameters according to them to higher values however that had absolutely no impact on throughput.

The tunings they recommend:

# disable TCP timestamps
sysctl -w net.ipv4.tcp_timestamps=0

# Disable the TCP selective acks option for better CPU utilization:
sysctl -w net.ipv4.tcp_sack=0

# Increase the maximum length of processor input queues:
sysctl -w net.core.netdev_max_backlog=250000

# Increase the TCP maximum and default buffer sizes using setsockopt():
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216
sysctl -w net.core.optmem_max=16777216

# Increase memory thresholds to prevent packet dropping:
sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"

# Increase Linux’s auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are:
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Enable low latency mode for TCP:
sysctl -w net.ipv4.tcp_low_latency=1

However what did impact somewhat is turning off adaptive interrupt moderation. Though only for a short time. We're getting 7Gbit/s from one node to the other, but the other direction was able to do 7 Gbit/s only for a few seconds before hiccuping and going down to 2.3Gbit/s again:

iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58341 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 856 MBytes 7.18 Gbits/sec
[ 3] 1.0- 2.0 sec 855 MBytes 7.17 Gbits/sec
[ 3] 2.0- 3.0 sec 879 MBytes 7.37 Gbits/sec
[ 3] 3.0- 4.0 sec 902 MBytes 7.57 Gbits/sec
[ 3] 4.0- 5.0 sec 854 MBytes 7.16 Gbits/sec
[ 3] 5.0- 6.0 sec 203 MBytes 1.71 Gbits/sec
[ 3] 6.0- 7.0 sec 306 MBytes 2.56 Gbits/sec
[ 3] 7.0- 8.0 sec 852 MBytes 7.15 Gbits/sec
[ 3] 8.0- 9.0 sec 799 MBytes 6.70 Gbits/sec
[ 3] 9.0-10.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 10.0-11.0 sec 323 MBytes 2.71 Gbits/sec
[ 3] 11.0-12.0 sec 278 MBytes 2.33 Gbits/sec
[ 3] 12.0-13.0 sec 277 MBytes 2.32 Gbits/sec
[ 3] 13.0-14.0 sec 277 MBytes 2.32 Gbits/sec
...

Reading further I changed the adaptive interrupt moderation to this:

ethtool -C eth2 adaptive-rx off rx-usecs 32 rx-frames 32

Running two parallel streams and bidirectional test gives a pretty good result.

wn-d-98 as client, wn-d-01 as server:

# iperf -w 1M -i 5 -t 20 -c 192.168.10.1 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.1, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.2 port 56125 connected with 192.168.10.1 port 5001
[ 3] local 192.168.10.2 port 56126 connected with 192.168.10.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec

The other direction:

------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 58460 connected with 192.168.10.2 port 5001
[ 4] local 192.168.10.1 port 58459 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 0.0- 5.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 0.0- 5.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 5.0-10.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 5.0-10.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 4] 10.0-15.0 sec 2.76 GBytes 4.73 Gbits/sec
[SUM] 10.0-15.0 sec 5.52 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 0.0-20.0 sec 11.0 GBytes 4.74 Gbits/sec
[ 4] 15.0-20.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 15.0-20.0 sec 5.52 GBytes 9.48 Gbits/sec
[ 4] 0.0-20.0 sec 11.0 GBytes 4.73 Gbits/sec
[SUM] 0.0-20.0 sec 22.1 GBytes 9.47 Gbits/sec

So with two streams we can saturate the network. Testing again with 1 stream we got d-98 -> d-01 at 9.5Gb/s, but the reverse was at 2.3 Gb/s. Running d-01 -> d-98 at -P 2 got to 9.5G again. Bizarre. The first test now is to see what happens after reboot.

After reboot we see pretty much the same state. Single stream sucks, two streams get 9.5G at least initially and then speeds slow down if you run them parallel in both directions. The tuning script was not made restart safe exactly to see if it had any effect. Same for the adaptive interrupts. Setting both up again we get back to the previous state where single stream speeds are 2-3Gbit/s when run at the same time and two parallel streams get 9.5G in both directions.

Update: it seems the traffic speed is relatively unstable. I tried to move to OpenVZ kernel as well and got the result that with the same tunings and adaptive interrupts I got the following result (notice the varying speed and how it jumps when the other direction traffic is sent, which I've highlighted in red):

# iperf -w 1M -i 5 -t 100 -c 192.168.10.2 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 4] local 192.168.10.1 port 52035 connected with 192.168.10.2 port 5001
[ 3] local 192.168.10.1 port 52036 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 0.0- 5.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 0.0- 5.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[ 3] 5.0-10.0 sec 1.76 GBytes 3.03 Gbits/sec
[SUM] 5.0-10.0 sec 3.53 GBytes 6.06 Gbits/sec
[ 4] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 10.0-15.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 10.0-15.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 4] 15.0-20.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 15.0-20.0 sec 693 MBytes 1.16 Gbits/sec
[SUM] 15.0-20.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 20.0-25.0 sec 694 MBytes 1.17 Gbits/sec
[ 3] 20.0-25.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 20.0-25.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47385
[ 6] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47384
[ 4] 25.0-30.0 sec 1.85 GBytes 3.18 Gbits/sec
[ 3] 25.0-30.0 sec 2.42 GBytes 4.16 Gbits/sec
[SUM] 25.0-30.0 sec 4.28 GBytes 7.35 Gbits/sec
[ 3] 30.0-35.0 sec 2.36 GBytes 4.06 Gbits/sec
[ 4] 30.0-35.0 sec 1.70 GBytes 2.93 Gbits/sec
[SUM] 30.0-35.0 sec 4.06 GBytes 6.98 Gbits/sec
[ 3] 35.0-40.0 sec 2.83 GBytes 4.86 Gbits/sec
[ 4] 35.0-40.0 sec 2.10 GBytes 3.61 Gbits/sec
[SUM] 35.0-40.0 sec 4.93 GBytes 8.47 Gbits/sec
[ 4] 40.0-45.0 sec 820 MBytes 1.38 Gbits/sec
[ 3] 40.0-45.0 sec 1.37 GBytes 2.36 Gbits/sec
[SUM] 40.0-45.0 sec 2.17 GBytes 3.73 Gbits/sec
[ 4] 45.0-50.0 sec 2.30 GBytes 3.96 Gbits/sec
[ 3] 45.0-50.0 sec 3.02 GBytes 5.20 Gbits/sec
[SUM] 45.0-50.0 sec 5.33 GBytes 9.15 Gbits/sec
[ 4] 50.0-55.0 sec 1.37 GBytes 2.36 Gbits/sec
[ 3] 50.0-55.0 sec 2.00 GBytes 3.43 Gbits/sec
[SUM] 50.0-55.0 sec 3.37 GBytes 5.79 Gbits/sec
[ 4] 0.0-30.9 sec 12.4 GBytes 3.46 Gbits/sec
[ 6] 0.0-30.9 sec 12.6 GBytes 3.50 Gbits/sec
[SUM] 0.0-30.9 sec 25.0 GBytes 6.96 Gbits/sec
[ 4] 55.0-60.0 sec 2.63 GBytes 4.51 Gbits/sec
[ 3] 55.0-60.0 sec 2.89 GBytes 4.96 Gbits/sec
[SUM] 55.0-60.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[ 3] 60.0-65.0 sec 2.60 GBytes 4.47 Gbits/sec
[SUM] 60.0-65.0 sec 5.20 GBytes 8.94 Gbits/sec
[ 4] 65.0-70.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 65.0-70.0 sec 696 MBytes 1.17 Gbits/sec
[SUM] 65.0-70.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[ 3] 70.0-75.0 sec 858 MBytes 1.44 Gbits/sec
[SUM] 70.0-75.0 sec 1.67 GBytes 2.88 Gbits/sec
[ 4] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[ 3] 75.0-80.0 sec 2.76 GBytes 4.74 Gbits/sec
[SUM] 75.0-80.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 4] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[ 3] 80.0-85.0 sec 2.60 GBytes 4.46 Gbits/sec
[SUM] 80.0-85.0 sec 5.19 GBytes 8.92 Gbits/sec
[ 3] 85.0-90.0 sec 694 MBytes 1.16 Gbits/sec
[ 4] 85.0-90.0 sec 697 MBytes 1.17 Gbits/sec
[SUM] 85.0-90.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 90.0-95.0 sec 695 MBytes 1.17 Gbits/sec
[ 3] 90.0-95.0 sec 694 MBytes 1.17 Gbits/sec
[SUM] 90.0-95.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 4] 95.0-100.0 sec 696 MBytes 1.17 Gbits/sec
[ 4] 0.0-100.0 sec 32.6 GBytes 2.80 Gbits/sec
[ 3] 95.0-100.0 sec 694 MBytes 1.16 Gbits/sec
[SUM] 95.0-100.0 sec 1.36 GBytes 2.33 Gbits/sec
[ 3] 0.0-100.0 sec 36.7 GBytes 3.15 Gbits/sec
[SUM] 0.0-100.0 sec 69.3 GBytes 5.95 Gbits/sec
[root@wn-d-01 mlnx_en-1.5.9]#

reducing the rx-usec and rx-frames to 0 as recommended in the performance guide I can't really get past 3Gbit/s. So it does point towards some issue with interrupts.

So as a final test as Mellanox driver package provide scripts to set IRQ affinity for mellanox interfaces I tried fixing it and retesting. On both nodes:

# set_irq_affinity_cpulist.sh 0 eth2
-------------------------------------
Optimizing IRQs for Single port traffic
-------------------------------------
Discovered irqs for eth2: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
Assign irq 120 mask 0x1
Assign irq 121 mask 0x1
Assign irq 122 mask 0x1
Assign irq 123 mask 0x1
Assign irq 124 mask 0x1
Assign irq 125 mask 0x1
Assign irq 126 mask 0x1
Assign irq 127 mask 0x1
Assign irq 128 mask 0x1
Assign irq 129 mask 0x1
Assign irq 130 mask 0x1
Assign irq 131 mask 0x1
Assign irq 132 mask 0x1
Assign irq 133 mask 0x1
Assign irq 134 mask 0x1
Assign irq 135 mask 0x1

done.

And the result:

[root@wn-d-01 ~]# iperf -w 1M -i 5 -t 100 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 3] local 192.168.10.1 port 52039 connected with 192.168.10.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 5.51 GBytes 9.46 Gbits/sec
[ 3] 5.0-10.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 10.0-15.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 15.0-20.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 20.0-25.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 25.0-30.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 30.0-35.0 sec 5.51 GBytes 9.47 Gbits/sec
[ 3] 35.0-40.0 sec 5.51 GBytes 9.47 Gbits/sec
^C[ 3] 0.0-41.6 sec 45.8 GBytes 9.47 Gbits/sec
[root@wn-d-01 ~]#

works both single and multiple streams and on one and both directions. Yay!. Now we just have to solve this on ALL nodes :)

Tuesday, December 4, 2012

OpenVZ latest kernels screw networking?

We've been puzzled recently with the overall performance of our networking. We run a mellanox fabric of SX1016 switches (four of them), which are 64 port 10G switches. We trunk three of them with eight ports to the fourth meaning that within a switch you can get up to 560Gbit/s, but between switches you're probably limited to 80Gbit/s. The way our environment is distributed you should get goods from both worlds.

However in practice we see most nodes traffic in the 0.5-1.5Gbit/s range. Which is really really odd. We'd been suspecting the switches etc for a long while and have about 4 different mellanox tickets open including both switch and 10G card firmwares in production now, that were largely created by Mellanox because of our issues.

But today as Ilja was debugging another issue at one of his deployments on 1G he noticed a weird network performance drop. Even with basic tests he couldn't get 1G line speed, not even close. The bugzilla ticket in question: http://bugzilla.openvz.org/show_bug.cgi?id=2454

He tried repeating the test here in our datacenter on some spare older nodes with 1G networking and was able to reproduce the issue, which disappeared with kernel downgrade. He also tested speeds between 1G and 10G and was getting really bad results. So next up we planned to test it inside 10G fabric. I ran a test between two 10G nodes and no matter how I tried I was hard pressed to see more than half a Gbit/s speeds. I then decided to test direct HN <-> VZ container tests as those have been shown to be able to run without overhead so for 10G we should be able to get 9+ Gbit/s easily.

Well that's a nice thought, but this is the actual result:

[root@wn-v-5492 ~]# iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 512 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 10.10.23.164 port 34441 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 235 MBytes 394 Mbits/sec
[ 3] 5.0-10.0 sec 910 MBytes 1.53 Gbits/sec
[ 3] 10.0-15.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 15.0-20.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 20.0-25.0 sec 1.04 GBytes 1.79 Gbits/sec
[ 3] 25.0-30.0 sec 1.05 GBytes 1.80 Gbits/sec
[ 3] 0.0-30.0 sec 5.29 GBytes 1.52 Gbits/sec
[root@wn-v-5492 ~]#

Now the interesting thing to notice here is that firstly it takes a moment for the speed to pick up, but we can live with that. Then however it's pretty hard capped at 1.8Gb/s. This node was doing nothing else at the time, therefore it wasn't resource constraining.

Another interesting thing of note here is that if you go back to Ilja's bugzilla post that was done before I even started testing, then there's a nice quote:

"Another symptom - reversing the iperf testing direction (using VM as iperf client, and remote physical server as iperf server) results in even worse results, which are quite consistent: ~180Mbps"

As he was testing on 1Gbit/s networks he saw a throughput of 18%. We just now tested the exact same thing on a totally different hardware and got 18% throughput. That's just too hard to believe to be a coincidence. So we started to look into kernels. The one we were running at that time was: 2.6.32-042stab059.7, which is a rebase of 2.6.32-279.1.1.el6. We downgraded to 2.6.32-042stab053.5, which is a rebase of 2.6.32-220.7.1.el6. Rerunning the test:

iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size: 512 KByte (WARNING: requested 256 KByte)
------------------------------------------------------------
[ 3] local 10.10.23.164 port 33264 connected with 192.168.2.98 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 1.24 GBytes 2.13 Gbits/sec
[ 3] 5.0-10.0 sec 1.39 GBytes 2.39 Gbits/sec
[ 3] 10.0-15.0 sec 1.42 GBytes 2.43 Gbits/sec
[ 3] 15.0-20.0 sec 1.41 GBytes 2.42 Gbits/sec
[ 3] 20.0-25.0 sec 1.42 GBytes 2.44 Gbits/sec
[ 3] 25.0-30.0 sec 1.42 GBytes 2.44 Gbits/sec
[ 3] 0.0-30.0 sec 8.30 GBytes 2.38 Gbits/sec

So, improvement. But not quite line speed yet. Next up we restarted two of the nodes in vanilla CentOS 6.3 liveCD's to test vanilla OS kernels and see if that can get the speed up. Our local RTT times are about 0.05ms meaning that a simple calculation shows that we'll need a basic 64KB TCP window size to get to 10G therefore in theory no tuning is needed for the vanilla kernel:

Window = BW * RTT = 10Gbit/s * 0.05 ms = 1.25 GB/s * 0.05 ms = 1280MB/s * 0.05 ms = 1.28MB /ms * 0.05ms = 0.064 MB = 65.5 KB.

The vanilla kernel usually has tcp_rmem and tcp_wmem of 4K 16K 4M meaning that 16K default window would indeed give about 2.4Gb, but setting the window size larger should give you full 10G. However at least our first tests with vanilla kernel came up with nothing promising. We couldn't get past about 2.5Gbit/s and with multiple streams we were at best hitting 4.5Gbit/s. I'll continue the post when I continue the debugging tomorrow...

Slurm and health check

For torque we used to run every 30 minutes a cronjob that checked if the node is working properly and if not it disabled them. With slurm I finally took the time to look for a way to have slurm automatically do it. Discovered it was extremely easy. You just need to add two config lines:

HealthCheckProgram=/home/test/testNode.sh
HealthCheckInterval=300

Now slurm runs every 5 minutes the health check program and if it gets stuck it's killed within 60s. The script has to perform a check and if a check fails it's got to take care of fixing it or disabling the node. It's done fairly simply. For example we check the presence of /hdfs directory for access to storage and if not ask slurm to drain the node:

# Test HDFS
NHDFS=`ls /hdfs|wc -l`
if [ $NHDFS -eq 0 ]; then
scontrol update NodeName=$NODE State=drain Reason="HDFS mount lost"
fi

You can add pretty much any check you want. The result is that sinfo nicely shows the drained nodes with reasons:

[root@slurm-1 ~]# sinfo -lN
Tue Dec 4 16:39:01 2012
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
wn-v-[2036,..,5384] 8 main* allocated 32 2:16:1 65536 0 1 (null) none
wn-v-[2072,..,7039] 19 main* drained 24+ 2:12+:1 49152+ 0 1 (null) HDFS mount lost
wn-v-[2324,..,6428] 6 main* idle 32 2:16:1 65536 0 1 (null) none

As you can see nodes that have lost the mount are automatically disabled and drained to be taken care of by you at some point.

Monday, December 3, 2012

XFS vs EXT4

I've been hanging around the #gluster chat the past week to make sure we've gotten the gluster installation running smoothly and the one topic that keeps popping up is the underlying brick filesystem. The default recommendation of gluster people is XFS because of an EXT4 bug that causes major mayhem that got introduced in recent kernels and backported to RHEL 6 main branch. Read more about that here:

http://joejulian.name/blog/glusterfs-bit-by-ext4-structure-change/

However I'd not jump to XFS that fast myself. In August-September this year after we rebuilt the datacenter for new power grid, new interconnect and lots of new nodes we were also planning to expand and change our storage to a more distributed model. The system we had been using was 15 storage nodes that in total provided raid5 volumes in total of 750TB. Now instead we planned to move to using pure disks directly fed to Hadoop as data directories and have every block in hadoop replicated at least twice. Now that would cost us ca 30-40% of good capacity (we'd regain the parity and hot spare drives so that alleviated it somewhat), but we'd also add all the workernode disks (each node had 3x 3TB drives of which ca 1TB was dedicated to local scratch space + OS) so the overall raw capacity went up to 2PB giving us 1PB of usable disk space in a distirbuted RAID 10 like environment.

All was good and nice except the power grid wasn't initially well balanced. So when we cranked up the job counts we sometimes hit a region where the breaker would turn off a block of 16-25 servers. As now all servers were also part of the storage that was annoying especially because those were real breakers so someone had to go physically on-site, enable the breaker and try to balance the power grid (luckily we had power phase usage history so could see which phases were over/under balances).

The bad part about the power loss was that it seems XFS was not at all safe against power loss. I don't mind losing the files that were actively being written (that means new files that would never get marked as completed therefore automatically triggering a redownload). However the bad part was that Hadoop claimed every time we lost a block of nodes to have lost anywhere from 300k to 1M blocks. Each block is 128MB. Now this was an annoyance most of the time because every block is double replicated so it just meant a lot of intra nodes traffic until the under replicated blocks got re-replicated. But with that many lost blocks it also meant that at times BOTH copies of a block were killed. When that happened we had data corruption and lost files.

Upon closer inspection we found that blocks were marked lost because the datanodes coming up after the reboot couldn't validate them. The reason was that either the datafile itself had been lost and replaced by a 0 byte file or the metadata file had. If either was lost, the block was considered lost. In theory we may have been able to recover some of them because if we lost both blocks it may well have happened that we lost the datafile for one and the block for the other, but it was way too much manual labor to be worth it.

After about 10 power losses and hundreds of TB of lost data we made a decision to move away from XFS even though this porblem of 0 size files is not supposed to be part of newer kernels and we were running CentOS 6.3 with OpenVZ kernel (not the newest at the time, but still 2.6.32-x kernel that should have all the XFS fixes). So after about a month of migration with decommissioning a block of datanodes, reformatting to ext4 and moving on we now run everything on EXT4. As it is we've not been hit by the above mentioned bug in hadoop related operations (we think) and it'll be something to think when operating glusterfs. Right now it's operated from the storage nodes that still run Scientific Linux 5.7 with kernels that are not affected, but if we plan to move gluster to a similar format like Hadoop it'll mean introducing bricks from newer kernel ext4 partitions and that may be an issue.