Burnt Chrome: iperf3 and microbursts

This is in the "Always know how your tools work" category.

Background

We were doing some end-to-end network testing at work using iperf3. While TCP would happily fill the link, UDP was giving us just miserable results. It wouldn't achieve more than 8 mbps of 'goodput' on a 50mbps link (which we'd just verified using TCP). Extra packets over 8mbps were getting dropped.

The setup at the traffic generation end was:

PC -> switch1 -> switch2 -> 50mbps link to ISP

PC to switch1 was 1Gbps
switch1 to switch2 was 100Mbps

If we connected the PC running iperf3 directly to switch2, we'd get much better throughput (or, rather, much less packet loss for the same traffic load). But then the PC was transmitting into a 100Mbps port, not a 1Gbps port...

I thought that this sounded like perhaps packet bursts were exceeding some buffers, and got curious as to how iperf3 actually generates traffic and performs pacing with the -b option.

Packet Bursts

I personally didn't have access to the test setup, so I did some quick code inspection. What I found was that it turns transmission on using a 100ms timer, turning transmission back off when it's achieved the right long-term average transmit rate (specified with -b ).

What this ends up looking like for a 100Mbps rate transmitting into a much higher rate interface, using 1KB packets, is below. This is a graph of MB send per 1ms period, scaled into Mbps.

Here's the first second of that, using the same Y scale (1Gbps):

So yeah, bursty. Not exactly "micro" bursts, either. More like millibursts.

iperf3 and Packet Pacing

iperf3 uses the fq socket pacing option by default, when it's available (tc-fq.8.html). But we were using OSX, where it's not available.

When it's not available, iperf3 uses the following algorithm to throttle the rate it transmits at:

while( testing)

sleep 100ms

while( total_bytes_sent / total_elapsed_time < target_rate)

transmit buffer of data

This results in big rate/10 bursts of data on the wire. If the local interface rate is the same as the end-to-end network path's rate, then there's no issue. If the local interface rate is wildly different, then issues start to arise. Like at a switch that's doing a 10:1 rate change between ports.

Switches don't have 100ms of buffers, nor do we want them to (because that leads to TCP bufferbloat).

So I started experimenting with faster timers to see how smooth the results would be.

10ms timer

Using a 10ms timer brought the peaks down a little, and vastly shortened them in the process

But it's still not "smooth"

1ms timer

The real fq packet scheduler sets a timer based on the number of µs between packets to achieve the right rate, to give a very smooth pacing. That's probably a better solution than the 1ms timer that I ended up using, but the 1ms timer works fairly well:

It's still not ideal, but it's quite good. And in my testing, it seems to have a minor impact on CPU load, but not enough to cause issues (I can get 45-50Gbps between processes running locally using both timer values).

Conclusion

Know what your tools actually do. Especially in networking where rate limiting is really an exercise in pulse-width modulation.

Burnt Chrome

Tuesday, September 20, 2016

iperf3 and microbursts