Thursday, September 22, 2016

iperf3 and microbursts, part II

Previously, I talked about iPerf3's microbursts, and how changing the frequency of the timer could smoothen things out.

One of the first feedback items that I got, and lined up with some musings of my own, was if it might be better to calculate the timer based on the smoothest possible packet rate:

PPS = Rate / Packet Size

This is, btw, what iPerf3 does automatically on Linux, when the fq socket scheduler is available.  So this is really just seeing if we can fake it from user-land on systems that don't have it (like my OSX system)

Luckily, adding this to iPerf's code is pretty easy.


To recap, using a 1ms timer and 100Mbps with a 1K udp packet size results in the following:
Zooming in:

Switching from 1ms pacing to 819us pacing (the results of the calculated pacing), nets:
And zooming in:

And, I should be careful, because I'm quantizing this into buckets that are the same size as the timer...  I should probably be subdividing more, or much less, to get a better view of what's going on.  But I'm going to stick with the 1ms buckets for this analysis (for data presentation consistency).


I've only been showing 100Mbps results, but I really should document how it works at higher speeds, especially something closer to the line rate.  So here's what 500Mbps looks like through all these changes.


1ms timer:

And then using a calculated 16µs timer:

Much, much smoother pacing.

However, there's still some big overshoots, but that's due to how the iperf3 red-light/green-light algorithm reacts to stumbles (or late-firing timers).  It sends more packets until it catches back up.  At the micro-scale, this isn't a big deal, but it can cause the tool to stick in "green-light" mode when testing through congested links and it can't actually maintain the desired rate.

I've setup a new branch on my GitHub fork to play around with this.  Including capping the maximum frequency (with a command-line param to change it).  The cap is specified in µs, as that's what the POSIX api lets you use.

Now to get some captures from a Linux box with fq pacing, and show how well it performs.

Tuesday, September 20, 2016

iperf3 and microbursts

This is in the "Always know how your tools work" category.


We were doing some end-to-end network testing at work using iperf3.  While TCP would happily fill the link, UDP was giving us just miserable results.  It wouldn't achieve more than 8 mbps of 'goodput' on a 50mbps link (which we'd just verified using TCP).  Extra packets over 8mbps were getting dropped.

The setup at the traffic generation end was:

PC -> switch1 -> switch2 -> 50mbps link to ISP

  • PC to switch1 was 1Gbps
  • switch1 to switch2 was 100Mbps
If we connected the PC running iperf3 directly to switch2, we'd get much better throughput (or, rather, much less packet loss for the same traffic load).  But then the PC was transmitting into a 100Mbps port, not a 1Gbps port...

I thought that this sounded like perhaps packet bursts were exceeding some buffers, and got curious as to how iperf3 actually generates traffic and performs pacing with the -b option.

Packet Bursts

I personally didn't have access to the test setup, so I did some quick code inspection.  What I found was that it turns transmission on using a 100ms timer, turning transmission back off when it's achieved the right long-term average transmit rate (specified with -b ).

What this ends up looking like for a 100Mbps rate transmitting into a much higher rate interface, using 1KB packets, is below.  This is a graph of MB send per 1ms period, scaled into Mbps.
Here's the first second of that, using the same Y scale (1Gbps):

So yeah, bursty.  Not exactly "micro" bursts, either.  More like millibursts.

iperf3 and Packet Pacing

iperf3 uses the fq socket pacing option by default, when it's available (tc-fq.8.html).  But we were using OSX, where it's not available.

When it's not available, iperf3 uses the following algorithm to throttle the rate it transmits at:

while( testing) 
    sleep 100ms
    while( total_bytes_sent / total_elapsed_time < target_rate)
       transmit buffer of data

This results in big rate/10 bursts of data on the wire.  If the local interface rate is the same as the end-to-end network path's rate, then there's no issue.  If the local interface rate is wildly different, then issues start to arise.  Like at a switch that's doing a 10:1 rate change between ports.

Switches don't have 100ms of buffers, nor do we want them to (because that leads to TCP bufferbloat).

So I started experimenting with faster timers to see how smooth the results would be.

10ms timer

Using a 10ms timer brought the peaks down a little, and vastly shortened them in the process
But it's still not "smooth"

1ms timer

The real fq packet scheduler sets a timer based on the number of µs between packets to achieve the right rate, to give a very smooth pacing.  That's probably a better solution than the 1ms timer that I ended up using, but the 1ms timer works fairly well:

It's still not ideal, but it's quite good.  And in my testing, it seems to have a minor impact on CPU load, but not enough to cause issues (I can get 45-50Gbps between processes running locally using both timer values).


Know what your tools actually do.  Especially in networking where rate limiting is really an exercise in pulse-width modulation.