A Brief History of Bufferbloat

2013-02-27 10:07 | Source

Introduction

The “bufferbloat” issue has now been explained and documented in many places by many people (most recently, and famously, by Jim Gettys), but I’m going to present my own explanation by way of introduction. I’m going to consider the case of a home network with a single router connecting the LAN to the internet (most likely via an ADSL or cable internet connection); this is not the only place where the issue arises, but it is the situation that most people are familiar with.

Why buffer?

To understand the problem with buffering, we first have to understand why buffering is being done in the first place. Generally speaking, buffering at least one packet is necessary to successfully forward traffic from the LAN interface to the WAN interface; if you can’t buffer at least one packet, you can’t receive and route any packets because you don’t know where to send them until you’ve received and processed them. However, most routers will buffer far more than just one packet; and the reason for this is throughput. Incoming traffic does not always arrive at a steady rate, so by keeping a reasonably-sized buffer of incoming traffic, the router can provide a steady stream of outgoing traffic keeping the outgoing link at maximum utilization despite fluctuations on the input side. To some extent, the more you buffer, the better throughput you can achieve, and as historically the focus has been on maximum throughput on an internet connection, buffers have been sized very generously for some time now, to the point where they are frequently far larger than they have to be in order to achieve maximum throughput. This brings us to the next question:

Why not buffer?

The problem with large buffers is that while they may improve throughput, they also increase latency. Thus, while “bulk” flows (file uploads and downloads) experienced improved performance, “interactive” flows such as gaming, VoIP traffic, and so on suffers. To understand why this is, let us consider some example figures. The default settings for an ethernet interface on Linux are to use the “pfifo_fast” queueing discipline (which is basically just a first-in-first-out queue, as the name suggests), with a qlen of 1000 (so the queue can grow up to 1000 packets long). Standard Ethernet MTU is 1500 bytes, which means that if the queue fills with traffic from a file upload, we will have 1,500,000 bytes (~1.43 MB) of data in the queue. Ordinarily the LAN interface will be running at 100M or 1G while the outgoing ADSL / cable connection will be much slower (let’s use a 1Mbps uplink in this illustration), causing the queue to fill up under normal circumstances before packets start being dropped.

Now, let’s say that while this file upload is occurring, you are also trying to play a real-time strategy game. When you issue a command to some of your units, the game sends out a small command packet encoding the command that you just issued. This packet arrives at the router and joins the queue behind all of the traffic currently in the queue. Only once it reaches the head of the queue will it actually be transmitted across your internet connection, so there will be a delay before it is even sent out onto the internet; this delay will be cumulative with the normal latency between you and the server. How long will this delay be? Transferring 1,500,000 bytes at 1Mbps will take 12 seconds! Obviously this is a ridiculous amount of added latency, resulting in a completely unplayable game as many gamers can attest to.

This increase in latency can even affect throughput if you are trying to download and upload at the same time; the ACK traffic for the upload will get caught in the bufferbloat caused by the download, and the ACK traffic for the download will get caught in the bufferbloat caused by the upload, causing everything to slow down. (Attempting to use a bittorrent client without setting ratelimits often leads one to encounter this problem, although many modern torrent clients have “smart ratelimiting” to try to work around this problem)

Now what?

So, how do we solve this problem? We can reduce the size of the queue in the router, as often it is massively oversized, but beyond a certain point, making the queue smaller will start to hurt throughput (resulting in slower downloads and uploads), forcing us to make a trade-off between latency and throughput. In addition, manually adjusting the size of the queue is a very difficult task, especially in the face of changing network conditions; the bandwidth available to your ADSL connection can vary greatly depending on congestion at your ISP, “turbo boost” ISP products that allow you to temporarily burst above your normal bandwidth limit, and so on. Can we do better than this?

In fact, we can do better. The answer lies in more advanced queue management: we want to queue as much as necessary to maintain throughput, but no more. The latest and greatest in this field is the CoDel (“controlled delay”) queue management algorithm, designed by Kathleen Nichols and Van Jacobson, which aims to achieve reasonable behaviour with very little tuning; in other words, it can be deployed on ADSL/cable routers in a standard configuration with no end-user tuning required. In brief, CoDel looks at a sliding window of time (100ms by default), and determines the minimum delay experienced by packets traversing the queue; if this increases above a certain target value (5ms by default), CoDel enters a dropping mode. (Side note: “dropping” can mean actual packet dropping, or simply ECN marking; end-users will normally want dropping, but users in more controlled environments, like datacenters, may get more reliable behaviour with packet marking).

CoDel (implemented by the “codel” qdisc in the Linux kernel) thus allows us to manage the size of the queue, but we still have the problem of multiple flows interfering with each other; if I start a file upload, that will still interfere with my IRC connection or VoIP call. What we need is called “fair queuing”; we want to share the internet connection “fairly” between all of the flows going over the connection, rather than allowing a few of them to hog the connection at the expense of others. The “fq_codel” qdisc gives us a way to do this (although there are other ways to accomplish the same thing); essentially, it classifies the incoming packets into separate flows and maintains a separate queue for each flow, managed by CoDel. (Actually, it uses a hashing scheme, so the more flows you have, the more likely it is that some of them will share the same queue, but this is necessary to avoid out-of-control resource usage in the presence of many flows.) Traffic is drawn from each separate queue “fairly”, so essentially this allows your interactive traffic (games, IRC, VoIP, etc.) to “skip the queue” as their individual queues will be small, instead of being stuck in a queue behind bulk flows which can build up a longer queue.

Caveats

Unfortunately there are still some problems facing early adopters wanting to take advantage of these improvements in queue management algorithms.

The first problem is that queues will only build up at the hop where the path bottleneck is; if this occurs at one of your ISP’s routers rather than your own router, then any queue management you do on your own router will have little effect. In order to ensure that the bottleneck is your router, you will need to ratelimit traffic through your router to below the speeds available to you through your ISP, thus reducing available bandwidth by a small amount, and also making it impossible to take advantage of any kind of “bursting” capacity your ISP might make available to you. In addition to this, if the available bandwidth drops below your ratelimiting due to network conditions (which may vary based on time of day, phase of moon, etc.), your queue management will once again become ineffective. The real solution here is for your ISP to implement CoDel (or some other active queue management that accomplishes the same thing), but for most people, that is just a pipe dream that is unlikely to be realised in the near future.

The second problem is that there are actually many different buffers lurking in many different places such as your ethernet switch, ADSL modem, Linux ethernet driver, etc. Some work has been done recently (see BQL) to deal with the sources of additional bufferbloat in the Linux kernel, but many drivers have not yet been updated to support BQL, so the problem remains. If your queue management algorithm thinks the packet has actually been transmitted, but it’s just stuck in another buffer farther down the stack, then it is unlikely to perform as expected. Send / receive offloads can produce similar problems as suddenly you can be waiting for a “packet” (which will actually be split into many packets by the ethernet hardware) to be transmitted that is far larger than the normal MTU, producing a much longer delay than expected; thus turning these off is essential (and unlikely to have any downside at typical home internet speeds, or even 1Gbps ethernet speeds, on modern hardware).

The third problem is a little more fundamental. At any given time, even if a packet has been placed at the front of all of your queues, you may already be in the process of transmitting a packet; thus the minimum delay that can be achieved is constrained by the time taken to transmit a single packet across your internet connection. Assuming a standard ethernet MTU of 1500 bytes (this size will actually be slightly higher in practice, due to ethernet frame / ADSL / ATM / etc. overheads), on a 100Mbps uplink, this will be a delay of 0.12ms; this is unlikely to be of concern for many people. However, on slower uplinks, this starts to become more of a problem: at 10Mbps the delay increases to 1.2ms; at 1Mbps the delay is 12ms (requiring the CoDel target to be increased, as reducing the delay below the default target of 5ms is now impossible); and at 512kbps the delay is 24ms. This figure represents not only an increase in the maximum delay experienced, but also the variance between minimum and maximum delay. If you are playing a game on a server that is 25ms away, having your latency fluctuate between 25ms and 49ms (nearly doubling) in an unpredictable fashion is far harder to deal with than a stable, predictable delay of, say, 60ms would be. Thus people on slower uplinks have little recourse other than to hopefully upgrade to a faster internet connection.