Running away again

[I originally posted this on Facebook; I've disabled comments here because I don't think I can cope with them over here on my blog, but if you really need to comment you can find me on Facebook or elsewhere]

I've run away several times in my life.

The very first time was as a teenager; I don't even recall exactly what it was that triggered it, I think it was anxiety over a compulsory school camp that I didn't want to attend. I didn't have a plan, I didn't even have any real intention to actually run away for real, I just felt like I was trapped and needed to escape / lash out somehow. "Running away" amounted to declaring my intent, leaving the house, hiding somewhere nearby while everyone frantically searched for me, and eventually returning a few hours later once I had calmed down a bit. Of course, at the time, I wasn't really capable of fully understanding the situation; I didn't understand the destructive mental paths that lead to being unable to cope, I didn't really care how much emotional stress I was causing to others, and I didn't realise that had I been able to communicate better about how I was feeling, I could probably have dealt with things much better. In short: I was wrong to run away this time. Then again, I was just a teenager, so I think I can be forgiven for failing at those things.

Later on in life, running away took on a more sophisticated form. Once I gained sufficient control over my life (both in the legal and the financial sense), I took steps to cut certain people out of it. I have few regrets about the process (which was staggered, not something I did all at once); some relationships and interactions are simply not healthy or beneficial for the parties involved, and not having to deal with the constant stress of those interactions definitely improved my mental health and my life in general by quite a bit. In short: I think I was right to run away this time. But in the end, it was still running away; I couldn't cope with the situation, so I chose not to cope with it.

And now…recent events surrounding the 2016 US elections are slowly producing that rising sense of panic and inability to cope again. The elections themselves are not the trigger; rather, it is the conversations they have sparked which has lead me to realise certain things about relatives, acquaintainces, and friends. As most of you probably know, I'm no stranger to intellectual debate / conflict; this isn't about that. I can quite happily handle interactions with someone I disagree with on politics, economics, or what have you; most of the time, I'm happy to discuss / argue the matter for hours, unpacking our disagreement into differences in assumptions vs. differences in reasoning, and ending with a better understanding of both sides even if my mind remains unchanged.

But this is different: this is a conflict over whether or not we should care about other people. Caring about other people is something I learned as an incontrovertible truth from childhood, not a selective privilege to be extended to only some and not others. I now realise that many around me do not feel the same way; they feel that some people are not deserving of their care, for whatever reason, and…I don't know how to deal with this.

I feel like running away this time would be a mistake, that I need to stay and engage with these people, attempt to change their minds; but I have no idea how to do this, and I fear the poisoning of either my own mind, or my relationships, as a result of this. I fear that to maintain these relationships as-is, with no condemnation, will result in sending the message that I support something I do not. I fear that such condemnation would inevitably lead to the destruction of those relationships anyway, accomplishing nothing good in the end. I fear that by running away, I am losing the opportunity to do something about all of this. I feel guilty because of my fear; there are others who do not even have the option of running away, others who are the direct target of the uncaring and the hatred, rather than mere bystanders who can leave at any other time. How much worse must their fear be than mine?

And so here I am, trapped in a seemingly endless mental loop, afraid to run and afraid not to run.

Progress on txacme

The initial release of txacme was perhaps a quieter affair than I was hoping for, but the response has been enthusiastic from the few people that did take notice. Work has been progressing on the next release (which will be 0.9.1, I'm not at all ready to pronounce the API fully baked yet), which will have a number of minor improvements such as better documentation, but the headline feature is probably dns-01 challenge support.

There is currently a PR up with a libcloud-based implementation that works [1]; I also have plans for a twistd plugin to run a standalone issuing service, but I'm not yet sure if this will make it into 0.9.1.

Implementing a dns-01 challenge responder revealed some deficiencies in the IResponder interface that made it necessary to expand it; this is exactly the sort of API change that I was expecting to crop up, and one of the reasons why I'm holding off on a 1.0.0 release for the time being. My guess is that I'll make it to at least 0.9.5 or thereabouts before 1.0.0 comes out.

Footnotes

[1] Well... mostly. libcloud does synchronous network I/O so it needs to be invoked in a thread pool, and some of the DNS drivers are buggy (eg. the Route 53 driver does not format TXT records how Route 53 expects them), but I have a passing integration test using Rackspace Cloud DNS!

WordPress to Nikola

As you may or may not have noticed, my blog looks substantially different now. This is because I have migrated from WordPress to Nikola. Using the WordPress importer worked out reasonably well, although I needed to fix up the code blocks myself. Going forward I'm just going to use the standard ReST format for writing posts.

I'm also hosting the site on S3 with CloudFront in front, rather than a dinky Linode, so the site should be a lot faster now. However, there's probably some stuff broken that I didn't notice; please yell if you see anything.

Adventures in deployment with Propellor, Docker, and Fabric

Preface

After playing around with Docker a bit, I decided that it would make an ideal deployment platform for my work services (previously we were using some ad-hoc isolation using unix users and not much else). While Docker’s security is…suspect…compared to a complete virtualization solution (see Xen), I’m not so much worried about complete isolation between my services, as things like easy resource limits and imaging. You can build this yourself out of cgroups, chroot, etc. but in the end you’re just reinventing the Docker wheel, so I went with Docker instead.

However, Docker by itself is not a complete solution. You still need some way to configure the Docker host, and you also need to build Docker images, so I added Propellor (which I recently discovered) and Fabric to the mix.

Propellor

Propellor is a configuration management system (in the sense of Puppet, Chef, Salt, et al.) written in Haskell, where your configuration itself is Haskell code. For someone coming from a systems administration background, the flexibility and breadth offered by a real programming language like Haskell may be quite daunting, but as a programmer, I find it far more straightforward to just write code that does what I want, extracting common pieces into functions and so on. Our previous incarnation of things used Puppet for configuration management, but it always felt very awkward to work with; another problem is that Puppet was introduced after a bunch of the infrastructure was in place, meaning a lot of things were not actually managed by Puppet because somebody forgot. Propellor was used to configure a new server from scratch, ensuring that nothing was done ad-hoc, and while I won’t go into too much detail about Propellor, I am liking it a lot so far.

The role of Propellor in the new order is to configure things to provide the expected platform. This includes installing Docker, installing admin user accounts, SSH keys, groups, and so on.

Docker

The Docker workflow I adopted is based on the one described by Glyph. I would strongly recommend you go read his excellent post for the long explanation, but the short version is that instead of building a single container image, you instead build three: A “build” container used to produce the built artifacts from your sources (eg. Python wheels, Java/Clojure JARs), a “run” container which is built by installing the artifacts produced by running the “build” container, and thus does not need to contain your toolchain and -dev packages (keeping the size down), and a “base” container which contains the things shared by the “build” and “run” containers, allowing for even more efficiency of disk usage.

While I can’t show the Docker bits for our proprietary codebases, you can see the bits for one of our free software codebases, including instructions for building and running the images. The relative simplicity of the .docker files is no accident; rather than trying to shoehorn any complex build processes into the Docker image build, all of the heavy lifting is done by standard build and install tools (in the case of Documint: apt/dpkg, pip, and setuptools). Following this principal will save you a lot of pain and tears.

Fabric

The steps outlined for building the Docker images are relatively straightforward, but copy/pasting shell command lines from a README into a terminal is still not a great experience. In addition, our developers are typically working from internet connections where downloading multi-megabyte Docker images / packages / etc. is a somewhat tedious experience, and uploading the resulting images is ten times worse (literally ten times worse; my connection at home is 10M down / 1M up ADSL, for example). Rather than doing this locally, this should instead run on one of our servers which has much better connectivity and a stable / well-defined platform configuration (thanks to Propellor). So now the process would be “copy/paste shell command lines from a README into an ssh session” — no thanks. (For comparison, our current deployment processes use some ad-hoc shell scripts lying around on the relevant servers; a bit better than copy/pasting into an ssh session, but not by much.)

At this point, froztbyte reminded me of Fabric (which I knew about previously, but hadn’t thoughto f in this context). So instead I wrote some fairly simple Fabric tasks to automate the process of building new containers, and also deploying them. For final production use, I will probably be setting up a scheduled task that automatically deploys from our “prod” branch (much like our current workflow does), but for testing purposes, we want a deploy to happen whenever somebody merges something into our testing release branch (eventually I’d like to deploy test environments on demand for separate branches, but this poses some challenges which are outside of the scope of this blog post). I could build some automated deployment system triggered by webhooks from BitBucket (where our private source code is hosted), but since everyone with access to merge things into that branch also has direct SSH access to our servers, Fabric was the easiest solution; no need to add another pile of moving parts to the system.

My Fabric tasks look like this (censored slightly to remove hostnames):

 1 @hosts('my.uat.host')
 2 def build_uat_documint():
 3     with settings(warn_only=True):
 4         if run('test -d /srv/build/documint').failed:
 5             run('git clone --quiet -- https://github.com/fusionapp/documint.git /srv/build/documint')
 6     with cd('/srv/build/documint'):
 7         run('git pull --quiet')
 8         run('docker build --tag=fusionapp/documint-base --file=docker/base.docker .')
 9         run('docker build --tag=fusionapp/documint-build --file=docker/build.docker .')
10         run('docker run --rm --tty --interactive --volume="/srv/build/documint:/application" --volume="/srv/build/documint/wheelhouse:/wheelhouse" fusionapp/documint-build')
11         run('cp /srv/build/clj-neon/src/target/uberjar/clj-neon-*-standalone.jar bin/clj-neon.jar')
12         run('docker build --tag=fusionapp/documint --file=docker/run.docker .')
13 
14 
15 @hosts('my.uat.host')
16 def deploy_uat_documint():
17     with settings(warn_only=True):
18         run('docker stop --time=30 documint')
19         run('docker rm --volumes --force documint')
20     run('docker run --detach --restart=always --name=documint --publish=8750:8750 fusionapp/documint')

Developers can now deploy a new version of Documint (for example) by simply running fab build_uat_documint deploy_uat_documint. Incidentally, the unit tests are run during the container build (from the .docker file), so deploying a busted code version by accident shouldn’t happen.

Axiom benchmark results on PyPy 2.5.0

This is a followup to a post I made about 1.5 years ago, benchmarking Axiom on PyPy 2.1.0. Not too much has changed in Axiom since then (we fixed two nasty bugs that mainly affected PyPy, but I don’t expect those changes to have had much impact on performance), but PyPy (now at 2.5.0) has had plenty of work done on it since then, so let’s see what that means for Axiom performance!

Unlike my previous post, I’m basically just going to show the results here without much commentary:

Graph of Axiom performance

A few notes:

  • I didn’t redo the old benchmark results, but the hardware/software I ran the benchmarks on is not significantly different, so I think the results are still valid as far as broad comparisons go (as you can see, the CPython results match fairly closely).
  • The benchmark harness I’m using now is improved over the last time, using some statistical techniques to determine how long to run the benchmark, rather than relying on some hardcoded values to achieve JIT warmup and performance stability. Still could use some work (eg. outputting a kernel density estimate / error bars, rather than just a single mean time value).
  • There is one new benchmark relative to the last time, powerup-loading; PyPy really shines here, cutting out a ton of overhead. There’s still room for a few more benchmarks of critical functions such as actually running and loading query results (as opposed to just constructing query objects).
  • The branch I used to run these benchmarks is available on Github.
  • The horizontal axis is cut off at 1.0 so you can’t actually see how store-opening lines up, but the raw data shows that PyPy 2.1.0 took about 53% longer on this benchmark, whil PyPy 2.5.0 only takes about 2% longer.

BitBucket migration

At work, we are currently using Launchpad for project hosting of our proprietary codebase. Launchpad charges $250/year/project for hosting of proprietary projects which is a little steep, and Launchpad/bzr has been falling behind the alternatives in terms of tooling / development / support, so when our Launchpad subscription came up for renewal at the beginning of the month, this caused our somewhat vague plans to switch to something else to crystalize.

I initially assumed that Github/git would be the obvious way to go, but after looking into BitBucket/hg, I was pleasantly surprised to discover that things like hosted CI were available there too. Nobody on our team is much of a git enthusiast to begin with, so using hg seemed like a far more attractive option. This meant figuring out how to do two things: 1) migrate all of our existing bugs, and 2) migrate our existing bzr branches to hg. The former proved to be relatively straightforward: Jonathan wrote a script that used launchpadlib to access the Launchpad API, retrieve the bug data + metadata, and write it out in the BitBucket import format (more on this in another post, or on Jonathan's blog, depending on if I can convince him to write it up or not).

The bzr to hg conversion turned out to be a little more complex. A simple "hg convert" of our trunk branch worked surprisingly well; the trunk history converted correctly (at least as far as I could tell), but more (pleasantly) surprisingly, the branches which were merged into trunk were also reconstructed by the conversion, along with the merges. The conversion relies on the bzr branch nick; this works somewhat like hg branches (it's associated with a commit at the time that you commit), but as bzr does not place as much importance on this label as hg, it is more likely to be wrong by accident (by default the branch nick is just taken from the last component of the path to the branch you are committing in, I believe, and in our case I suspect nobody has ever set the branch nick manually). Among other things, this resulted in 4 different branch names for "trunk", as well as some other oddities like misspelled feature branch names.

(As an aside, I'd like to mention that hg log has far more utility than bzr log, mostly due to the "revsets" feature. Almost all of the inspection I did while debugging the conversion was done on the converted repo using hg, not on the original bzr repo, simply because it was far easier to get the information that way.)

A "branchmap" file solved the problem with differing branches; mapping the different names for "trunk" to "default" made the revision history graph look a lot more reasonable than when I originally did the conversion. I also switched to using --datesort for the conversion at this point; the documentation warns that this may produce a much larger repository than --branchsort (the default), but in my case, the size difference was trivial. I suspect this may only apply in scenarios with back-and-forth merges between long-lived branches, rather than the short-lived topic branches that form the majority of our workflow. I also created an "authormap" file at this point to reconcile differing author identities over the history of our repository. The bzr author identity is a full name/email (eg. "Tristan Seligmann <mithrandi@mithrandi.net>", but again, there were various historical oddities here; BitBucket also has the ability to map author identities to BitBucket users, but I decided normalizing during the conversion was a good idea anyway.

The biggest problem I had to deal with (although this was actually one of the first problems I noticed) was that all of these merged branches were still open. Mercurial has the concept of "open" and "closed" branches, with closed branches being hidden by default in most places. A "closed" branch is simply one whose head revision is marked as closing the branch; which of course, none of my branches had, due to being converted from bzr which does not have an equivalent concept. Committing a closing revision to each branch was simple enough to script, but that only lead to more difficulties: 1) a gigantic pile of noise revisions in the history, and 2) a bunch of dangling heads as the new "close" revision was not part of the merge to trunk. Scripting a merge of all of the dangling heads would have produced even more noise, so I looked for a different solution.

Eventually I ran across a patch on the Mercurial mailing list; unfortunately the thread in which it was posted never went anywhere, but the patch still worked. What this patch allowed me to do was after the initial conversion, run another hg-to-hg conversion in which I marked the last branch revision before the merge to trunk as closing the branch. The full conversion process now looked something like this:

hg convert --datesort --branchmap branchmap --authormap authormap Fusion Fusion-hg-unspliced
cd Fusion-hg-unspliced
hg log --template "{node} close\n" -r "head() and not branch(default)" > ../splicemap
cd ..
PYTHONPATH=$HOME/hg-patched python $HOME/hg-patched/hg convert --splicemap splicemap Fusion-hg-unspliced Fusion-hg

This was good enough for a trunk conversion, but what about open branches that aren't yet merged into trunk? We could have held off until we were able to merge all of these branches, but that seemed like a lot of work (although we did merge as many outstanding branches as possible). Fortunately hg convert can operate in an incremental way; during the conversion, the mapping from source revs to destination revs is stored in dest/.hg/shamap; the only wrinkle was my two stage-conversion process. What I needed was a way to map the original bzr revisions to the hg revisions in the second repository. In order to accomplish this, I wrote a small Python script to merge the two mappings:

With the help of this script, I could now convert other branches:

# Only take the bzr revisions
grep '@' Fusion-hg-unspliced/.hg/shamap > shamap
python mergemaps.py shamap Fusion-hg/.hg/shamap > shamap-spliced
mv shamap-spliced Fusion-hg/.hg/shamap

# Now let's convert a branch
hg convert --branchmap branchmap --authormap authormap Fusion-some-branch Fusion-hg

In summary, the process, while hardly trivial, worked out a lot better than I had initially expected.

EDIT: I forgot to mention in the original draft: We first started thinking about moving away from Launchpad at the beginning of September, and completed the migration in the last week, so the entire process took us less than a month of part-time discussion / work.

Axiom benchmark results on PyPy

EDIT: Updated version now available.

EDIT: Fixed the issue with the store-opening benchmark

Axiom conveniently includes a few microbenchmarks; I thought I’d use them to give an idea of the speed increase made possible by running Axiom on PyPy. In order to do this, however, I’m going to have to modify the benchmarks a little. To understand why this is necessary, one has to understand how PyPy achieves the speed it does: namely, through the use of JIT (Just-In-Time) compilation techniques. In short, these techniques mean that PyPy is compiling code during the execution of a program; it does this “just in time” to run the code (or actually, if I understand correctly, in some cases only after the code has been run). This means that when a PyPy program has just started up, there is a lot of performance overhead in the form of the time taken up by JIT compilation running, as well as time taken up by code being interpreted slowly because it has not yet been compiled. While this performance hit is quite significant for command-line tools and other short-lived programs, many applications making use of Axiom are long-lived server processes; for these, any startup overhead is mostly unimportant, the performance that interests us is the performance achieved once the startup cost has already been paid. The Axiom microbenchmarks mostly take the form of performing a certain operation N times, recording the time taken, then dividing that time by N to get an average time per single operation. I have made two modifications to the microbenchmarks in order to demonstrate the performance on PyPy; first, I have increased the value of “N”; second, I have modified the benchmarks to run the entire benchmark twice, throwing away the results from the first run and only reporting the second run. This serves to exclude startup/”warmup” costs from the benchmark.

All of the results below are from my desktop machine running Debian unstable on amd64, CPython 2.7.5, and PyPy 2.1.0 on a Core i7-2600K running at 3.40GHz. I tried to keep the system mostly quiet during benchmarking, but I did have a web browser and other typical desktop applications running at the same time. Here’s a graph of the results; see the rest of the post for the details, especially regarding the store-opening benchmark (which is actually slower on PyPy).

[graph removed, see the new post instead]

To get an example of how much of a difference this makes, let’s take a look at the first benchmark I’m going to run, item-creation 15. This benchmark constructs an Item type with 15 integer attributes, then runs 10 transactions where each transaction creates 1000 items of that type. In its initial form, the results look like this:

1 mithrandi@lorien&gt; python item-creation 15
2 0.000164939785004
3 mithrandi@lorien&gt; pypy item-creation 15
4 0.000301389718056

That’s about 165µs per item creation on CPython, and 301µs on PyPy, nearly 83% slower; not exactly what we were hoping for. If I increase the length of the outer loop (number of transactions) from 10 to 1000, and introduce the double benchmark run, the results look a lot more encouraging:

1 mithrandi@lorien&gt; python item-creation 15
2 0.000159110188484
3 mithrandi@lorien&gt; pypy item-creation 15
4 8.7410929203e-05

That’s about 159µs per item creation on CPython, and only 87µs on PyPy; that’s a 45% speed increase. The PyPy speed-up is welcome, but it’s also interesting to note that CPython benefits slightly from the changes to the benchmark. I don’t have any immediate explanation for why this might be, but the difference is only about 3%, so it doesn’t matter too much.

The second benchmark is inmemory-setting. This benchmark constructs 10,000 items with 5 inmemory attributes (actually, the number of attributes is hardcoded, due to a limitation in the benchmark code), and then times how long it takes to set all 5 attributes to new values on each of the 10,000 items. I decreased the number of items to 1000, wrapped a loop around the attribute setting to repeat it 1000 times, and introduced the double benchmark run:

1 mithrandi@lorien&gt; python inmemory-setting
2 4.86490821838e-07
3 mithrandi@lorien&gt; pypy inmemory-setting
4 1.28742599487e-07

That’s 486ns to set an attribute on CPython, and 129ns on PyPy, for a 74% speed increase. Note that this benchmark is extremely sensitive to small fluctuations since the operation being measured is such a fast one, so the results can vary a fair amount between benchmarks run. For interest’s sake, I repeated the benchmark except with a normal Python class substituted for Item, in order to compare the overhead of setting an inmemory attribute as compared with normal Python attribute access. The result was 61ns to set an attribute on CPython (making an inmemory attribute about 700% slower), and 2ns on PyPy (inmemory is 5700% slower). The speed difference on PyPy is more to do with how fast setting a normal attribute is on PyPy, than to do with Axiom being slow.

The third benchmark is integer-setting. This benchmark is similar to inmemory-setting except that it uses integer attributes instead of inmemory attributes. I performed the same modifications, except with an outer loop of 100 iterations:

1 mithrandi@lorien&gt; python integer-setting
2 1.23480038643e-05
3 mithrandi@lorien&gt; pypy integer-setting
4 3.80326986313e-06

That’s 12.3µs to set an attribute on CPython, and 3.8µs on PyPy, a 69% speed increase.

The fourth benchmark is item-loading 15. This benchmark creates 10,000 items with 15 integer attributes each, then times how long it takes to load an item from the database. On CPython, the items are deallocated and removed from the item cache immediately thanks to refcounting, but on PyPy a gc.collect() after creating the items is necessary to force them to be garbage collected. In addition, I increased the number of items to 100,000 and introduced the double benchmark run:

1 mithrandi@lorien&gt; python item-loading 15
2 9.09668397903e-05
3 mithrandi@lorien&gt; pypy item-loading 15
4 5.70205903053e-05

That’s 90µs to load an item on CPython, and 57µs on PyPy, for a modest 37% speed increase.

The fifth benchmark is multiquery-creation 5 15. This benchmark constructs (but does not run) an Axiom query involving 5 different types, each with 15 attributes (such a query requires Axiom to construct SQL that mentions each item table, and each column in those tables) 10,000 times. I increased the number of queries constructed to 100,000 and introduced the double benchmark run:

1 mithrandi@lorien&gt; python multiquery-creation 5 15
2 5.5426299572e-05
3 mithrandi@lorien&gt; pypy multiquery-creation 5 15
4 7.98981904984e-06

55µs to construct a query on CPython; 8µs on PyPy; 86% speed increase.

The sixth benchmark is query-creation 15. This benchmark is the same as multiquery-creation, except for queries involving only a single item type. I increased the number of queries constructed to 1,000,000 and introduced the double benchmark run:

1 mithrandi@lorien&gt; python query-creation 15
2 1.548528409e-05
3 mithrandi@lorien&gt; pypy query-creation 15
4 1.56546807289e-06

15.5µs to construct a query on CPython; 1.6µs on PyPy; 90% speed increase.

The final benchmark is store-opening 20 15. This benchmark simply times how long it takes to open a store containing 20 different item types, each with 15 attributes (opening a store requires Axiom to load the schema from the database, among other things). I increased the number of iterations from 100 to 10,000; due to a bug in Axiom, the benchmark will run out of file descriptors partway, so I had to work around this. I also introduced the double benchmark run:

1 mithrandi@lorien&gt; python store-opening 20 15
2 0.00140788140297
3 mithrandi@lorien&gt; pypy store-opening 20 15
4 0.00202187280655

1.41ms to open a store on CPython; 2.02ms on PyPy; 44% slowdown. I’m not sure what the cause of the slowdown is.

A bzr branch containing all of my modifications is available at lp:~mithrandi/divmod.org/pypy-benchmarking.

Divmod / PyPy status update

Just a quick status update:

  • Epsilon test suite passes on PyPy.
  • Nevow test suite passes on PyPy.
  • Axiom (trunk) test suite has two failures on PyPy, fixed by this branch (which just deletes the tests); I don’t expect this to affect any application code.
  • Mantissa test suite mostly fails. This is due to modules that indirectly import xmantissa.terminal which imports PyCrypto directly or indirectly (via twisted.conch) — PyCrypto does not build on PyPy.
  • I haven’t looked at Combinator; the test suite has a whole bunch of test failures on CPython due to a change in Subversion, and there’s no real reason to run Combinator with PyPy (I would expect it to be slower than on CPython).
  • I haven’t looked at Quotient or Imaginary yet, as they depend on Mantissa.
  • I haven’t looked at Hyperbola, Prime, Reverend, or Sine — I doubt anyone cares about these (I don’t even know what Prime does, and it doesn’t have a test suite).

The next thing I’m going to work on is making the Mantissa dependency on PyCrypto optional; while having an SSH server is nice, there is plenty of functionality in Mantissa that does not depend on or interact with the SSH server in any way, so it’ll still be useful for most applications. With any luck, once this is fixed, the entire test suite will pass; it’s hard to predict given how little of the test suite is currently runnable.

A Brief History of Bufferbloat

Introduction

The “bufferbloat” issue has now been explained and documented in many places by many people (most recently, and famously, by Jim Gettys), but I’m going to present my own explanation by way of introduction. I’m going to consider the case of a home network with a single router connecting the LAN to the internet (most likely via an ADSL or cable internet connection); this is not the only place where the issue arises, but it is the situation that most people are familiar with.

Why buffer?

To understand the problem with buffering, we first have to understand why buffering is being done in the first place. Generally speaking, buffering at least one packet is necessary to successfully forward traffic from the LAN interface to the WAN interface; if you can’t buffer at least one packet, you can’t receive and route any packets because you don’t know where to send them until you’ve received and processed them. However, most routers will buffer far more than just one packet; and the reason for this is throughput. Incoming traffic does not always arrive at a steady rate, so by keeping a reasonably-sized buffer of incoming traffic, the router can provide a steady stream of outgoing traffic keeping the outgoing link at  maximum utilization despite fluctuations on the input side. To some extent, the more you buffer, the better throughput you can achieve, and as historically the focus has been on maximum throughput on an internet connection, buffers have been sized very generously for some time now, to the point where they are frequently far larger than they have to be in order to achieve maximum throughput. This brings us to the next question:

Why not buffer?

The problem with large buffers is that while they may improve throughput, they also increase latency. Thus, while “bulk” flows (file uploads and downloads) experienced improved performance, “interactive” flows such as gaming, VoIP traffic, and so on suffers. To understand why this is, let us consider some example figures. The default settings for an ethernet interface on Linux are to use the “pfifo_fast” queueing discipline (which is basically just a first-in-first-out queue, as the name suggests), with a qlen of 1000 (so the queue can grow up to 1000 packets long). Standard Ethernet MTU is 1500 bytes, which means that if the queue fills with traffic from a file upload, we will have 1,500,000 bytes (~1.43 MB) of data in the queue. Ordinarily the LAN interface will be running at 100M or 1G while the outgoing ADSL / cable connection will be much slower (let’s use a 1Mbps uplink in this illustration), causing the queue to fill up under normal circumstances before packets start being dropped.

Now, let’s say that while this file upload is occurring, you are also trying to play a real-time strategy game. When you issue a command to some of your units, the game sends out a small command packet encoding the command that you just issued. This packet arrives at the router and joins the queue behind all of the traffic currently in the queue. Only once it reaches the head of the queue will it actually be transmitted across your internet connection, so there will be a delay before it is even sent out onto the internet; this delay will be cumulative with the normal latency between you and the server. How long will this delay be? Transferring 1,500,000 bytes at 1Mbps will take 12 seconds! Obviously this is a ridiculous amount of added latency, resulting in a completely unplayable game as many gamers can attest to.

This increase in latency can even affect throughput if you are trying to download and upload at the same time; the ACK traffic for the upload will get caught in the bufferbloat caused by the download, and the ACK traffic for the download will get caught in the bufferbloat caused by the upload, causing everything to slow down. (Attempting to use a bittorrent client without setting ratelimits often leads one to encounter this problem, although many modern torrent clients have “smart ratelimiting” to try to work around this problem)

Now what?

So, how do we solve this problem? We can reduce the size of the queue in the router, as often it is massively oversized, but beyond a certain point, making the queue smaller will start to hurt throughput (resulting in slower downloads and uploads), forcing us to make a trade-off between latency and throughput. In addition, manually adjusting the size of the queue is a very difficult task, especially in the face of changing network conditions; the bandwidth available to your ADSL connection can vary greatly depending on congestion at your ISP, “turbo boost” ISP products that allow you to temporarily burst above your normal bandwidth limit, and so on. Can we do better than this?

In fact, we can do better. The answer lies in more advanced queue management: we want to queue as much as necessary to maintain throughput, but no more. The latest and greatest in this field is the CoDel (“controlled delay”) queue management algorithm, designed by Kathleen Nichols and Van Jacobson, which aims to achieve reasonable behaviour with very little tuning; in other words, it can be deployed on ADSL/cable routers in a standard configuration with no end-user tuning required. In brief, CoDel looks at a sliding window of time (100ms by default), and determines the minimum delay experienced by packets traversing the queue; if this increases above a certain target value (5ms by default), CoDel enters a dropping mode. (Side note: “dropping” can mean actual packet dropping, or simply ECN marking; end-users will normally want dropping, but users in more controlled environments, like datacenters, may get more reliable behaviour with packet marking).

CoDel (implemented by the “codel” qdisc in the Linux kernel) thus allows us to manage the size of the queue, but we still have the problem of multiple flows interfering with each other; if I start a file upload, that will still interfere with my IRC connection or VoIP call. What we need is called “fair queuing”; we want to share the internet connection “fairly” between all of the flows going over the connection, rather than allowing a few of them to hog the connection at the expense of others. The “fq_codel” qdisc gives us a way to do this (although there are other ways to accomplish the same thing); essentially, it classifies the incoming packets into separate flows and maintains a separate queue for each flow, managed by CoDel. (Actually, it uses a hashing scheme, so the more flows you have, the more likely it is that some of them will share the same queue, but this is necessary to avoid out-of-control resource usage in the presence of many flows.) Traffic is drawn from each separate queue “fairly”, so essentially this allows your interactive traffic (games, IRC, VoIP, etc.) to “skip the queue” as their individual queues will be small, instead of being stuck in a queue behind bulk flows which can build up a longer queue.

Caveats

Unfortunately there are still some problems facing early adopters wanting to take advantage of these improvements in queue management algorithms.

The first problem is that queues will only build up at the hop where the path bottleneck is; if this occurs at one of your ISP’s routers rather than your own router, then any queue management you do on your own router will have little effect. In order to ensure that the bottleneck is your router, you will need to ratelimit traffic through your router to below the speeds available to you through your ISP, thus reducing available bandwidth by a small amount, and also making it impossible to take advantage of any kind of “bursting” capacity your ISP might make available to you. In addition to this, if the available bandwidth drops below your ratelimiting due to network conditions (which may vary based on time of day, phase of moon, etc.), your queue management will once again become ineffective. The real solution here is for your ISP to implement CoDel (or some other active queue management that accomplishes the same thing), but for most people, that is just a pipe dream that is unlikely to be realised in the near future.

The second problem is that there are actually many different buffers lurking in many different places such as your ethernet switch, ADSL modem, Linux ethernet driver, etc. Some work has been done recently (see BQL) to deal with the sources of additional bufferbloat in the Linux kernel, but many drivers have not yet been updated to support BQL, so the problem remains. If your queue management algorithm thinks the packet has actually been transmitted, but it’s just stuck in another buffer farther down the stack, then it is unlikely to perform as expected. Send / receive offloads can produce similar problems as suddenly you can be waiting for a “packet” (which will actually be split into many packets by the ethernet hardware) to be transmitted that is far larger than the normal MTU, producing a much longer delay than expected; thus turning these off is essential (and unlikely to have any downside at typical home internet speeds, or even 1Gbps ethernet speeds, on modern hardware).

The third problem is a little more fundamental. At any given time, even if a packet has been placed at the front of all of your queues, you may already be in the process of transmitting a packet; thus the minimum delay that can be achieved is constrained by the time taken to transmit a single packet across your internet connection. Assuming a standard ethernet MTU of 1500 bytes (this size will actually be slightly higher in practice, due to ethernet frame / ADSL / ATM / etc. overheads), on a 100Mbps uplink, this will be a delay of 0.12ms; this is unlikely to be of concern for many people. However, on slower uplinks, this starts to become more of a problem: at 10Mbps the delay increases to 1.2ms; at 1Mbps the delay is 12ms (requiring the CoDel target to be increased, as reducing the delay below the default target of 5ms is now impossible); and at 512kbps the delay is 24ms. This figure represents not only an increase in the maximum delay experienced, but also the variance between minimum and maximum delay. If you are playing a game on a server that is 25ms away, having your latency fluctuate between 25ms and 49ms (nearly doubling) in an unpredictable fashion is far harder to deal with than a stable, predictable delay of, say, 60ms would be. Thus people on slower uplinks have little recourse other than to hopefully upgrade to a faster internet connection.

(Fair) queuing, buffering, and "bufferbloat"

This is basically just a teaser, but I’ve been playing around with things on my home internet connection (PPPoE over ADSL) and have achieved some fairly good results (within the limitations of my internet connection). I’ve learned quite a lot about various bits and pieces along the way, so I decided I should blog it all for my own future reference as well as others. I’ll probably be duplicating some existing material along the way, but half of the challenge was just finding all the bits and pieces scattered around and putting them all together, so hopefully I’ll save somebody else some time in the future. I’ll also be uploading my actual test scripts, as well as the test results, for reference.