Thursday, December 13, 2012

gluster stuck with a lock

Today I ran into a situation where gluster was stuck. You could check info and status, but nothing more spectacular. When you tried enabling profiling or heal you'd get in log:

[2012-12-13 15:40:07.881363] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: c3ce6b9c-6297-4e77-924c-b44e2c13e58f, lock held by: c3ce6b9c-6297-4e77-924c-b44e2c13e58f

As can be seen the node itself can't get a lock that it itself is holding. Googling around it seems that the error comes from a command that failed to complete correctly and didn't release the lock. The web talks that this lock should auto-disappear in 30 minutes, but that's not the best solution. Also I was able to get rid of the lock a few times with

gluster volume statedump home0

but not always. When I managed to clear the lock things seemed to work (profiling started and gave info etc), but when I ran again

gluster volume heal home0 

it locked up again. In the end the only help that I got from #gluster was to unmount all clients, stop all gluster processes and restart them. That indeed cleared the locks and heal command also works, but it's not really a solution I'd like to ever see again. I've written to gluster-users list and will update this entry if I get an actual way that this should have been cleared.

Wednesday, December 5, 2012

Network debugging for 10Gbit/s

So today we decided to take the network testing seriously. We connected two servers with 1G network to the generic infrastructure (to access software from internet) and then a direct cable between them for 10G. This way we should be able to test exclusively the OS + driver + NIC fw level that we can get to 10Gbit and from there expand the test to include a switch.

To make sure we didn't have any stale software we reinstalled both nodes with CentOS 6.3 with no kernel tunings. We also compared sysctl -a output on the two servers and though there were minor differences none of them should prove relevant.

We then launched a few baseline tests. First of all running iperf locally inside the server to see how much the server itself can handle. With 256KB window size both did at least 17Gbit/s


[root@wn-d-01 ~]# iperf -w 256k -c 192.168.2.1 -i 1
------------------------------------------------------------
Client connecting to 192.168.2.1, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  3] local 192.168.2.1 port 55968 connected with 192.168.2.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  1.77 GBytes  15.2 Gbits/sec
[  3]  1.0- 2.0 sec  2.37 GBytes  20.4 Gbits/sec
[  3]  2.0- 3.0 sec  2.42 GBytes  20.8 Gbits/sec
[  3]  3.0- 4.0 sec  2.40 GBytes  20.6 Gbits/sec
[  3]  4.0- 5.0 sec  2.39 GBytes  20.5 Gbits/sec
[  3]  5.0- 6.0 sec  2.39 GBytes  20.5 Gbits/sec
[  3]  6.0- 7.0 sec  2.39 GBytes  20.6 Gbits/sec
[  3]  7.0- 8.0 sec  2.38 GBytes  20.5 Gbits/sec
[  3]  8.0- 9.0 sec  2.39 GBytes  20.5 Gbits/sec
[  3]  9.0-10.0 sec  2.38 GBytes  20.5 Gbits/sec
[  3]  0.0-10.0 sec  23.3 GBytes  20.0 Gbits/sec
[root@wn-d-01 ~]#

Running then the test between them firstly on the 1Gbit/s to see that there is no generic foul play at work that cuts speed to 25% or 33% we saw nicely 942Mbit/s speeds:


[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.2.98
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  3] local 192.168.2.1 port 53338 connected with 192.168.2.98 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   112 MBytes   943 Mbits/sec
[  3]  1.0- 2.0 sec   112 MBytes   942 Mbits/sec
[  3]  2.0- 3.0 sec   112 MBytes   942 Mbits/sec
[  3]  3.0- 4.0 sec   112 MBytes   942 Mbits/sec
[  3]  4.0- 5.0 sec   112 MBytes   942 Mbits/sec
[  3]  5.0- 6.0 sec   112 MBytes   941 Mbits/sec
[  3]  6.0- 7.0 sec   112 MBytes   942 Mbits/sec
[  3]  7.0- 8.0 sec   112 MBytes   942 Mbits/sec
[  3]  8.0- 9.0 sec   112 MBytes   942 Mbits/sec
[  3]  9.0-10.0 sec   112 MBytes   942 Mbits/sec
[  3]  0.0-10.0 sec  1.10 GBytes   942 Mbits/sec
[root@wn-d-01 ~]# 

So now we fired up the default kernel driver for our Mellanox 10G card and tested the link firstly with ping:


[root@wn-d-98 ~]# ping 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.725 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.177 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.187 ms

So an RTT of 0.2ms means that with 256KB window size you get 9.8Gbit/s. So let's see if that actually works (remember, it's direct attached cable, nothing else running on the servers):


[root@wn-d-01 ~]# iperf -w 256k -i 1 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  3] local 192.168.10.1 port 51131 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   290 MBytes  2.43 Gbits/sec
[  3]  1.0- 2.0 sec   296 MBytes  2.49 Gbits/sec
[  3]  2.0- 3.0 sec   191 MBytes  1.61 Gbits/sec
[  3]  3.0- 4.0 sec   320 MBytes  2.68 Gbits/sec
[  3]  4.0- 5.0 sec   232 MBytes  1.95 Gbits/sec
[  3]  5.0- 6.0 sec   161 MBytes  1.35 Gbits/sec
[  3]  6.0- 7.0 sec   135 MBytes  1.13 Gbits/sec
[  3]  7.0- 8.0 sec   249 MBytes  2.09 Gbits/sec
[  3]  8.0- 9.0 sec   224 MBytes  1.88 Gbits/sec
[  3]  9.0-10.0 sec   182 MBytes  1.53 Gbits/sec
[  3]  0.0-10.0 sec  2.23 GBytes  1.91 Gbits/sec
[root@wn-d-01 ~]#

Not even close. So next up we installed Mellanox official kernel modules. With those we could also increase the window size to 1-2MB etc (which the default somehow capped at 256KB). Though this shouldn't matter. The first test looked promising the first few seconds:


[root@wn-d-01 ~]# iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 192.168.10.1 port 58336 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  1021 MBytes  8.57 Gbits/sec
[  3]  1.0- 2.0 sec  1.10 GBytes  9.47 Gbits/sec
[  3]  2.0- 3.0 sec  1.10 GBytes  9.47 Gbits/sec
[  3]  3.0- 4.0 sec  1.10 GBytes  9.47 Gbits/sec
[  3]  4.0- 5.0 sec   933 MBytes  7.82 Gbits/sec
[  3]  5.0- 6.0 sec   278 MBytes  2.33 Gbits/sec
[  3]  6.0- 7.0 sec   277 MBytes  2.32 Gbits/sec
[  3]  7.0- 8.0 sec   277 MBytes  2.32 Gbits/sec
[  3]  8.0- 9.0 sec   276 MBytes  2.32 Gbits/sec
[  3]  9.0-10.0 sec   277 MBytes  2.33 Gbits/sec

No matter how hard we tried we couldn't repeat the 9.47Gb/s speeds. Digging into Mellanox network performance tuning guide I first set the default kernel parameters according to them to higher values however that had absolutely no impact on throughput.

The tunings they recommend:

# disable TCP timestamps
sysctl -w net.ipv4.tcp_timestamps=0

# Disable the TCP selective acks option for better CPU utilization:
sysctl -w net.ipv4.tcp_sack=0

# Increase the maximum length of processor input queues:
sysctl -w net.core.netdev_max_backlog=250000

# Increase the TCP maximum and default buffer sizes using setsockopt():
sysctl  -w net.core.rmem_max=16777216
sysctl  -w net.core.wmem_max=16777216
sysctl  -w net.core.rmem_default=16777216
sysctl  -w net.core.wmem_default=16777216
sysctl  -w net.core.optmem_max=16777216

# Increase memory thresholds to prevent packet dropping:
sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"

# Increase Linux’s auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are:
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Enable low latency mode for TCP:
sysctl -w net.ipv4.tcp_low_latency=1

However what did impact somewhat is turning off adaptive interrupt moderation. Though only for a short time. We're getting 7Gbit/s from one node to the other, but the other direction was able to do 7 Gbit/s only for a few seconds before hiccuping and going down to 2.3Gbit/s again:


iperf -w 1M -i 1 -t 30 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 192.168.10.1 port 58341 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   856 MBytes  7.18 Gbits/sec
[  3]  1.0- 2.0 sec   855 MBytes  7.17 Gbits/sec
[  3]  2.0- 3.0 sec   879 MBytes  7.37 Gbits/sec
[  3]  3.0- 4.0 sec   902 MBytes  7.57 Gbits/sec
[  3]  4.0- 5.0 sec   854 MBytes  7.16 Gbits/sec
[  3]  5.0- 6.0 sec   203 MBytes  1.71 Gbits/sec
[  3]  6.0- 7.0 sec   306 MBytes  2.56 Gbits/sec
[  3]  7.0- 8.0 sec   852 MBytes  7.15 Gbits/sec
[  3]  8.0- 9.0 sec   799 MBytes  6.70 Gbits/sec
[  3]  9.0-10.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 10.0-11.0 sec   323 MBytes  2.71 Gbits/sec
[  3] 11.0-12.0 sec   278 MBytes  2.33 Gbits/sec
[  3] 12.0-13.0 sec   277 MBytes  2.32 Gbits/sec
[  3] 13.0-14.0 sec   277 MBytes  2.32 Gbits/sec
...

Reading further I changed the adaptive interrupt moderation to this:

ethtool -C eth2 adaptive-rx off rx-usecs 32 rx-frames 32

Running two parallel streams and bidirectional test gives a pretty good result.

wn-d-98 as client, wn-d-01 as server:

# iperf -w 1M -i 5 -t 20 -c 192.168.10.1 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.1, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  4] local 192.168.10.2 port 56125 connected with 192.168.10.1 port 5001
[  3] local 192.168.10.2 port 56126 connected with 192.168.10.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 5.0 sec  2.76 GBytes  4.73 Gbits/sec
[  3]  0.0- 5.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM]  0.0- 5.0 sec  5.51 GBytes  9.47 Gbits/sec
[  4]  5.0-10.0 sec  2.76 GBytes  4.74 Gbits/sec
[  3]  5.0-10.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM]  5.0-10.0 sec  5.51 GBytes  9.47 Gbits/sec
[  4] 10.0-15.0 sec  2.76 GBytes  4.73 Gbits/sec
[  3] 10.0-15.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM] 10.0-15.0 sec  5.51 GBytes  9.47 Gbits/sec
[  4] 15.0-20.0 sec  2.76 GBytes  4.74 Gbits/sec
[  4]  0.0-20.0 sec  11.0 GBytes  4.74 Gbits/sec
[  3]  0.0-20.0 sec  11.0 GBytes  4.74 Gbits/sec
[SUM]  0.0-20.0 sec  22.1 GBytes  9.47 Gbits/sec

The other direction:

------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 192.168.10.1 port 58460 connected with 192.168.10.2 port 5001
[  4] local 192.168.10.1 port 58459 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 5.0 sec  2.76 GBytes  4.74 Gbits/sec
[  4]  0.0- 5.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM]  0.0- 5.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3]  5.0-10.0 sec  2.76 GBytes  4.74 Gbits/sec
[  4]  5.0-10.0 sec  2.76 GBytes  4.73 Gbits/sec
[SUM]  5.0-10.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 10.0-15.0 sec  2.76 GBytes  4.74 Gbits/sec
[  4] 10.0-15.0 sec  2.76 GBytes  4.73 Gbits/sec
[SUM] 10.0-15.0 sec  5.52 GBytes  9.47 Gbits/sec
[  3] 15.0-20.0 sec  2.76 GBytes  4.74 Gbits/sec
[  3]  0.0-20.0 sec  11.0 GBytes  4.74 Gbits/sec
[  4] 15.0-20.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM] 15.0-20.0 sec  5.52 GBytes  9.48 Gbits/sec
[  4]  0.0-20.0 sec  11.0 GBytes  4.73 Gbits/sec
[SUM]  0.0-20.0 sec  22.1 GBytes  9.47 Gbits/sec

So with two streams we can saturate the network. Testing again with 1 stream we got d-98 -> d-01 at 9.5Gb/s, but the reverse was at 2.3 Gb/s. Running d-01 -> d-98 at -P 2 got to 9.5G again. Bizarre. The first test now is to see what happens after reboot.

After reboot we see pretty much the same state. Single stream sucks, two streams get 9.5G at least initially and then speeds slow down if you run them parallel in both directions. The tuning script was not made restart safe exactly to see if it had any effect. Same for the adaptive interrupts. Setting both up again we get back to the previous state where single stream speeds are 2-3Gbit/s when run at the same time and two parallel streams get 9.5G in both directions.

Update: it seems the traffic speed is relatively unstable. I tried to move to OpenVZ kernel as well and got the result that with the same tunings and adaptive interrupts I got the following result (notice the varying speed and how it jumps when the other direction traffic is sent, which I've highlighted in red):


# iperf -w 1M -i 5 -t 100 -c 192.168.10.2 -P 2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  4] local 192.168.10.1 port 52035 connected with 192.168.10.2 port 5001
[  3] local 192.168.10.1 port 52036 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 5.0 sec  2.60 GBytes  4.46 Gbits/sec
[  3]  0.0- 5.0 sec  2.60 GBytes  4.46 Gbits/sec
[SUM]  0.0- 5.0 sec  5.19 GBytes  8.92 Gbits/sec
[  4]  5.0-10.0 sec  1.76 GBytes  3.03 Gbits/sec
[  3]  5.0-10.0 sec  1.76 GBytes  3.03 Gbits/sec
[SUM]  5.0-10.0 sec  3.53 GBytes  6.06 Gbits/sec
[  4] 10.0-15.0 sec  2.60 GBytes  4.46 Gbits/sec
[  3] 10.0-15.0 sec  2.60 GBytes  4.46 Gbits/sec
[SUM] 10.0-15.0 sec  5.19 GBytes  8.92 Gbits/sec
[  4] 15.0-20.0 sec   695 MBytes  1.17 Gbits/sec
[  3] 15.0-20.0 sec   693 MBytes  1.16 Gbits/sec
[SUM] 15.0-20.0 sec  1.36 GBytes  2.33 Gbits/sec
[  4] 20.0-25.0 sec   694 MBytes  1.17 Gbits/sec
[  3] 20.0-25.0 sec   694 MBytes  1.16 Gbits/sec
[SUM] 20.0-25.0 sec  1.36 GBytes  2.33 Gbits/sec
[  4] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47385
[  6] local 192.168.10.1 port 5001 connected with 192.168.10.2 port 47384
[  4] 25.0-30.0 sec  1.85 GBytes  3.18 Gbits/sec
[  3] 25.0-30.0 sec  2.42 GBytes  4.16 Gbits/sec
[SUM] 25.0-30.0 sec  4.28 GBytes  7.35 Gbits/sec
[  3] 30.0-35.0 sec  2.36 GBytes  4.06 Gbits/sec
[  4] 30.0-35.0 sec  1.70 GBytes  2.93 Gbits/sec
[SUM] 30.0-35.0 sec  4.06 GBytes  6.98 Gbits/sec
[  3] 35.0-40.0 sec  2.83 GBytes  4.86 Gbits/sec
[  4] 35.0-40.0 sec  2.10 GBytes  3.61 Gbits/sec
[SUM] 35.0-40.0 sec  4.93 GBytes  8.47 Gbits/sec
[  4] 40.0-45.0 sec   820 MBytes  1.38 Gbits/sec
[  3] 40.0-45.0 sec  1.37 GBytes  2.36 Gbits/sec
[SUM] 40.0-45.0 sec  2.17 GBytes  3.73 Gbits/sec
[  4] 45.0-50.0 sec  2.30 GBytes  3.96 Gbits/sec
[  3] 45.0-50.0 sec  3.02 GBytes  5.20 Gbits/sec
[SUM] 45.0-50.0 sec  5.33 GBytes  9.15 Gbits/sec
[  4] 50.0-55.0 sec  1.37 GBytes  2.36 Gbits/sec
[  3] 50.0-55.0 sec  2.00 GBytes  3.43 Gbits/sec
[SUM] 50.0-55.0 sec  3.37 GBytes  5.79 Gbits/sec
[  4]  0.0-30.9 sec  12.4 GBytes  3.46 Gbits/sec
[  6]  0.0-30.9 sec  12.6 GBytes  3.50 Gbits/sec
[SUM]  0.0-30.9 sec  25.0 GBytes  6.96 Gbits/sec
[  4] 55.0-60.0 sec  2.63 GBytes  4.51 Gbits/sec
[  3] 55.0-60.0 sec  2.89 GBytes  4.96 Gbits/sec
[SUM] 55.0-60.0 sec  5.51 GBytes  9.47 Gbits/sec
[  4] 60.0-65.0 sec  2.60 GBytes  4.47 Gbits/sec
[  3] 60.0-65.0 sec  2.60 GBytes  4.47 Gbits/sec
[SUM] 60.0-65.0 sec  5.20 GBytes  8.94 Gbits/sec
[  4] 65.0-70.0 sec   695 MBytes  1.17 Gbits/sec
[  3] 65.0-70.0 sec   696 MBytes  1.17 Gbits/sec
[SUM] 65.0-70.0 sec  1.36 GBytes  2.33 Gbits/sec
[  4] 70.0-75.0 sec   858 MBytes  1.44 Gbits/sec
[  3] 70.0-75.0 sec   858 MBytes  1.44 Gbits/sec
[SUM] 70.0-75.0 sec  1.67 GBytes  2.88 Gbits/sec
[  4] 75.0-80.0 sec  2.76 GBytes  4.74 Gbits/sec
[  3] 75.0-80.0 sec  2.76 GBytes  4.74 Gbits/sec
[SUM] 75.0-80.0 sec  5.51 GBytes  9.47 Gbits/sec
[  4] 80.0-85.0 sec  2.60 GBytes  4.46 Gbits/sec
[  3] 80.0-85.0 sec  2.60 GBytes  4.46 Gbits/sec
[SUM] 80.0-85.0 sec  5.19 GBytes  8.92 Gbits/sec
[  3] 85.0-90.0 sec   694 MBytes  1.16 Gbits/sec
[  4] 85.0-90.0 sec   697 MBytes  1.17 Gbits/sec
[SUM] 85.0-90.0 sec  1.36 GBytes  2.33 Gbits/sec
[  4] 90.0-95.0 sec   695 MBytes  1.17 Gbits/sec
[  3] 90.0-95.0 sec   694 MBytes  1.17 Gbits/sec
[SUM] 90.0-95.0 sec  1.36 GBytes  2.33 Gbits/sec
[  4] 95.0-100.0 sec   696 MBytes  1.17 Gbits/sec
[  4]  0.0-100.0 sec  32.6 GBytes  2.80 Gbits/sec
[  3] 95.0-100.0 sec   694 MBytes  1.16 Gbits/sec
[SUM] 95.0-100.0 sec  1.36 GBytes  2.33 Gbits/sec
[  3]  0.0-100.0 sec  36.7 GBytes  3.15 Gbits/sec
[SUM]  0.0-100.0 sec  69.3 GBytes  5.95 Gbits/sec
[root@wn-d-01 mlnx_en-1.5.9]# 

reducing the rx-usec and rx-frames to 0 as recommended in the performance guide I can't really get past 3Gbit/s. So it does point towards some issue with interrupts.

So as a final test as Mellanox driver package provide scripts to set IRQ affinity for mellanox interfaces I tried fixing it and retesting. On both nodes:


# set_irq_affinity_cpulist.sh 0 eth2
-------------------------------------
Optimizing IRQs for Single port traffic
-------------------------------------
Discovered irqs for eth2: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
Assign irq 120 mask 0x1
Assign irq 121 mask 0x1
Assign irq 122 mask 0x1
Assign irq 123 mask 0x1
Assign irq 124 mask 0x1
Assign irq 125 mask 0x1
Assign irq 126 mask 0x1
Assign irq 127 mask 0x1
Assign irq 128 mask 0x1
Assign irq 129 mask 0x1
Assign irq 130 mask 0x1
Assign irq 131 mask 0x1
Assign irq 132 mask 0x1
Assign irq 133 mask 0x1
Assign irq 134 mask 0x1
Assign irq 135 mask 0x1

done.

And the result:


[root@wn-d-01 ~]# iperf -w 1M -i 5 -t 100 -c 192.168.10.2
------------------------------------------------------------
Client connecting to 192.168.10.2, TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 192.168.10.1 port 52039 connected with 192.168.10.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 5.0 sec  5.51 GBytes  9.46 Gbits/sec
[  3]  5.0-10.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 10.0-15.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 15.0-20.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 20.0-25.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 25.0-30.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 30.0-35.0 sec  5.51 GBytes  9.47 Gbits/sec
[  3] 35.0-40.0 sec  5.51 GBytes  9.47 Gbits/sec
^C[  3]  0.0-41.6 sec  45.8 GBytes  9.47 Gbits/sec
[root@wn-d-01 ~]# 



works both single and multiple streams and on one and both directions. Yay!. Now we just have to solve this on ALL nodes :)

Tuesday, December 4, 2012

OpenVZ latest kernels screw networking?

We've been puzzled recently with the overall performance of our networking. We run a mellanox fabric of SX1016 switches (four of them), which are 64 port 10G switches. We trunk three of them with eight ports to the fourth meaning that within a switch you can get up to 560Gbit/s, but between switches you're probably limited to 80Gbit/s. The way our environment is distributed you should get goods from both worlds.

However in practice we see most nodes traffic in the 0.5-1.5Gbit/s range. Which is really really odd. We'd been suspecting the switches etc for a long while and have about 4 different mellanox tickets open including both switch and 10G card firmwares in production now, that were largely created by Mellanox because of our issues.

But today as Ilja was debugging another issue at one of his deployments on 1G he noticed a weird network performance drop. Even with basic tests he couldn't get 1G line speed, not even close. The bugzilla ticket in question: http://bugzilla.openvz.org/show_bug.cgi?id=2454

He tried repeating the test here in our datacenter on some spare older nodes with 1G networking and was able to reproduce the issue, which disappeared with kernel downgrade. He also tested speeds between 1G and 10G and was getting really bad results. So next up we planned to test it inside 10G fabric. I ran a test between two 10G nodes and no matter how I tried I was hard pressed to see more than half a Gbit/s speeds. I then decided to test direct HN <-> VZ container tests as those have been shown to be able to run without overhead so for 10G we should be able to get 9+ Gbit/s easily.

Well that's a nice thought, but this is the actual result:



[root@wn-v-5492 ~]# iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size:  512 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  3] local 10.10.23.164 port 34441 connected with 192.168.2.98 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 5.0 sec   235 MBytes   394 Mbits/sec
[  3]  5.0-10.0 sec   910 MBytes  1.53 Gbits/sec
[  3] 10.0-15.0 sec  1.04 GBytes  1.79 Gbits/sec
[  3] 15.0-20.0 sec  1.04 GBytes  1.79 Gbits/sec
[  3] 20.0-25.0 sec  1.04 GBytes  1.79 Gbits/sec
[  3] 25.0-30.0 sec  1.05 GBytes  1.80 Gbits/sec
[  3]  0.0-30.0 sec  5.29 GBytes  1.52 Gbits/sec
[root@wn-v-5492 ~]# 

Now the interesting thing to notice here is that firstly it takes a moment for the speed to pick up, but we can live with that. Then however it's pretty hard capped at 1.8Gb/s. This node was doing nothing else at the time, therefore it wasn't resource constraining.

Another interesting thing of note here is that if you go back to Ilja's bugzilla post that was done before I even started testing, then there's a nice quote:

"Another symptom - reversing the iperf testing direction (using VM as iperf client, and remote physical server as iperf server) results in even worse results, which are quite consistent: ~180Mbps"


As he was testing on 1Gbit/s networks he saw a throughput of 18%. We just now tested the exact same thing on a totally different hardware and got 18% throughput. That's just too hard to believe to be a coincidence. So we started to look into kernels. The one we were running at that time was: 2.6.32-042stab059.7, which is a rebase of 2.6.32-279.1.1.el6. We downgraded to 2.6.32-042stab053.5, which is a rebase of 2.6.32-220.7.1.el6. Rerunning the test:


iperf -w 256k -c 192.168.2.98 -i 5 -t 30
------------------------------------------------------------
Client connecting to 192.168.2.98, TCP port 5001
TCP window size:  512 KByte (WARNING: requested  256 KByte)
------------------------------------------------------------
[  3] local 10.10.23.164 port 33264 connected with 192.168.2.98 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 5.0 sec  1.24 GBytes  2.13 Gbits/sec
[  3]  5.0-10.0 sec  1.39 GBytes  2.39 Gbits/sec
[  3] 10.0-15.0 sec  1.42 GBytes  2.43 Gbits/sec
[  3] 15.0-20.0 sec  1.41 GBytes  2.42 Gbits/sec
[  3] 20.0-25.0 sec  1.42 GBytes  2.44 Gbits/sec
[  3] 25.0-30.0 sec  1.42 GBytes  2.44 Gbits/sec
[  3]  0.0-30.0 sec  8.30 GBytes  2.38 Gbits/sec

So, improvement. But not quite line speed yet. Next up we restarted two of the nodes in vanilla CentOS  6.3 liveCD's to test vanilla OS kernels and see if that can get the speed up. Our local RTT times are about 0.05ms meaning that a simple calculation shows that we'll need a basic 64KB TCP window size to get to 10G therefore in theory no tuning is needed for the vanilla kernel:

Window = BW * RTT = 10Gbit/s * 0.05 ms = 1.25 GB/s * 0.05 ms = 1280MB/s * 0.05 ms = 1.28MB /ms * 0.05ms = 0.064 MB = 65.5 KB.

The vanilla kernel usually has tcp_rmem and tcp_wmem of 4K 16K 4M meaning that 16K default window would indeed give about 2.4Gb, but setting the window size larger should give you full 10G. However at least our first tests with vanilla kernel came up with nothing promising. We couldn't get past about 2.5Gbit/s and with multiple streams we were at best hitting 4.5Gbit/s. I'll continue the post when I continue the debugging tomorrow...

Slurm and health check

For torque we used to run every 30 minutes a cronjob that checked if the node is working properly and if not it disabled them. With slurm I finally took the time to look for a way to have slurm automatically do it. Discovered it was extremely easy. You just need to add two config lines:


HealthCheckProgram=/home/test/testNode.sh
HealthCheckInterval=300

Now slurm runs every 5 minutes the health check program and if it gets stuck it's killed within 60s. The script has to perform a check and if a check fails it's got to take care of fixing it or disabling the node. It's done fairly simply. For example we check the presence of /hdfs directory for access to storage and if not ask slurm to drain the node:


# Test HDFS
NHDFS=`ls /hdfs|wc -l`
if [ $NHDFS -eq 0 ]; then
  scontrol update NodeName=$NODE State=drain Reason="HDFS mount lost"
fi

You can add pretty much any check you want. The result is that sinfo nicely shows the drained nodes with reasons:


[root@slurm-1 ~]# sinfo -lN
Tue Dec  4 16:39:01 2012
NODELIST         NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON              
wn-v-[2036,..,5384]  8     main*   allocated   32   2:16:1  65536        0      1   (null) none                
wn-v-[2072,..,7039] 19     main*     drained  24+  2:12+:1 49152+        0      1   (null) HDFS mount lost     
wn-v-[2324,..,6428]  6     main*        idle   32   2:16:1  65536        0      1   (null) none 



As you can see nodes that have lost the mount are automatically disabled and drained to be taken care of by you at some point.


Monday, December 3, 2012

XFS vs EXT4

I've been hanging around the #gluster chat the past week to make sure we've gotten the gluster installation running smoothly and the one topic that keeps popping up is the underlying brick filesystem. The default recommendation of gluster people is XFS because of an EXT4 bug that causes major mayhem that got introduced in recent kernels and backported to RHEL 6 main branch. Read more about that here:


However I'd not jump to XFS that fast myself. In August-September this year after we rebuilt the datacenter for new power grid, new interconnect and lots of new nodes we were also planning to expand and change our storage to a more distributed model. The system we had been using was 15 storage nodes that in total provided raid5 volumes in total of 750TB. Now instead we planned to move to using pure disks directly fed to Hadoop as data directories and have every block in hadoop replicated at least twice. Now that would cost us ca 30-40% of good capacity (we'd regain the parity and hot spare drives so that alleviated it somewhat), but we'd also add all the workernode disks (each node had 3x 3TB drives of which ca 1TB was dedicated to local scratch space + OS) so the overall raw capacity went up to 2PB giving us 1PB of usable disk space in a distirbuted RAID 10 like environment. 

All was good and nice except the power grid wasn't initially well balanced. So when we cranked up the  job counts we sometimes hit a region where the breaker would turn off a block of 16-25 servers. As now all servers were also part of the storage that was annoying especially because those were real breakers so someone had to go physically on-site, enable the breaker and try to balance the power grid (luckily we had power phase usage history so could see which phases were over/under balances). 

The bad part about the power loss was that it seems XFS was not at all safe against power loss. I don't mind losing the files that were actively being written (that means new files that would never get marked as completed therefore automatically triggering a redownload). However the bad part was that Hadoop claimed every time we lost a block of nodes to have lost anywhere from 300k to 1M blocks. Each block is 128MB. Now this was an annoyance most of the time because every block is double replicated so it just meant a lot of intra nodes traffic until the under replicated blocks got re-replicated. But with that many lost blocks it also meant that at times BOTH copies of a block were killed. When that happened we had data corruption and lost files. 

Upon closer inspection we found that blocks were marked lost because the datanodes coming up after the reboot couldn't validate them. The reason was that either the datafile itself had been lost and replaced by a 0 byte file or the metadata file had. If either was lost, the block was considered lost. In theory we may have been able to recover some of them because if we lost both blocks it may well have happened that we lost the datafile for one and the block for the other, but it was way too much manual labor to be worth it.

After about 10 power losses and hundreds of TB of lost data we made a decision to move away from XFS even though this porblem of 0 size files is not supposed to be part of newer kernels and we were running CentOS 6.3 with OpenVZ kernel (not the newest at the time, but still 2.6.32-x kernel that should have all the XFS fixes). So after about a month of migration with decommissioning a block of datanodes, reformatting to ext4 and moving on we now run everything on EXT4. As it is we've not been hit by the above mentioned bug in hadoop related operations (we think) and it'll be something to think when operating glusterfs. Right now it's operated from the storage nodes that still run Scientific Linux 5.7 with kernels that are not affected, but if we plan to move gluster to a similar format like Hadoop it'll mean introducing bricks from newer kernel ext4 partitions and that may be an issue. 

Lazy umount led to no NFS locking

Today we had one of the gluster NFS servers go wacko. No better name for it. Using the console it was nice and responsive as well as could talk with outside world, but seems to have had NFS issues. So we sent it to reboot and remounted our /home directory from another gluster node. To do that I performed a lazy umount:

# umount -l /home

the idea of course is to get the fs unmounted, then mount the correct one and as soon as different tasks get their i/o finalized they end up using the new filesystem already. However seems that's not a good idea as even though majority of functionality recovered without any users having to do anything it somehow screwed up the locking. Any task that wanted to set a lock on a file got an access denied error.

Restarting the VM didn't help. On the headnode we saw in dmesg:


[1038502.797998] lockd: cannot monitor 192.168.1.244
[1038508.865280] lockd: cannot monitor 192.168.1.244

which quickly pointed to nfslock service issue. Restarting it fixed everything. Let me now go and bang my head against the wall for a while.