Network FAIL

2008-06-20 11:57 | Source

Colin already blogged a bit about this, but I thought I'd provide a more detailed description of the saga so far.

Wednesday, June 18 13:12:00: Hetzner (who our servers are colocated with) post a network notice saying that "Verizon are having intermittant (sic) problems with their international link betwen (sic) Cape Town and London", and identifying DDoS attacks as the cause of the problems.

Wednesday, June 18 15:30:00: We start getting complaints from some of our users that they cannot access our web application. Tests from my side and several other locations don't reveal any problems accessing our site, but the users report being able to access all other web sites; by the end of the day, I am still unable to trace the problem; connectivity between our servers and all of the clients reporting problems seems to be fine, but they continue to be unable to access the site. I call it a day after business hours end.

Thursday, June 19 08:00:00: The same users are still unable to access our site. I continue poking around, and run across traffic going between our servers and one of the client sites experiencing problems while checking things with tcpdump. Huh? If they can't access the site, why am I seeing web traffic? Then I notice a familiar pattern; TCP packets 1500 bytes in size are being retransmitted continually, before the connection is torn down from the other side — which is what you usually see when PMTU discovery is broken.

Thursday, June 19 09:08:43: I send an e-mail to Hetzner

support, briefly describing what I thought the problem was, and asking

them to look into it urgently.

Thursday, June 19 09:09:00: I receive an e-mail from their support autoresponding giving me a ticket reference number.

Thursday, June 19 09:15:00: Continuing to look into the problem myself, I do some test traceroutes and it seems that when going to local sites (all of the tests I did were to IS destinations, although I didn't catch onto this at the time), at a certain point in Verizon's network, packets larger than 1496 bytes are being silently dropped; no ICMP "Fragmentation Needed" response. This isn't happening for international routes. So, I ask around and get people to run some tests from other sites (which is where Colin came in), and confirm the same thing from the outside, arriving at the conclusion that ICMP filtering is breaking PMTU discovery, although I'm not sure exactly where the filtering is occurring (In hindsight, I'm not sure this conclusion was actually a correct assessment of the problem…)

Thursday, June 19 10:58:00: Still no response from Hetzner support; I have a quiet few minutes, so I decide to call the helpdesk; apparently nobody has picked the ticket up yet, because they're very busy. The guy puts me on hold while he speaks to someone else, then tells me that they're not aware of any network problems currently, and asks me to please send through any information I have about the problem.

Thursday, June 19 11:17:41: I finish putting together an e-mail containing traceroute output etc. and my commentary on the problems we're experiencing.

Thursday, June 19 11:45:13: Hetzner respond, saying that they're experiencing difficulties with the firewall at their JHB datacentre (where our servers are located), and that they're looking into it.

Thursday, June 19 12:00:55: Hetzner report that the firewall issue is resolved. They also mention that there is some kind of Verizon <-> IS peering issue. Some quick testing on my side shows that the problem has not gone away, but I can confirm that Verizon <-> SAIX is working; also, doing some tests against ADSL links, I can receive ICMP "Fragmentation Needed" packets just fine, putting a further dent into my previous hypothesis about ICMP filtering. At this stage, my best guess is that Verizon are doing some kind of overly-aggressive packet filtering in response to the DDoS attacks previously mentioned.

Thursday, June 19 12:14:51: Hetzner confirm that the issue has been escalated with Verizon, but are unable to provide me with an ETA for the resolution of the issue. Looking at the network notice posted about the issue, I notice that they are claiming "high packet loss"; I don't see any packet loss aside from packets larger than 1496 bytes which are still being dropped completely, which seems a bit strange to me.

Friday, June 20 08:00:00: The problem has still not been solved; my interim measure of dropping the MTU on our network interfaces to 1496 seems to be helping with most users, but there are still connectivity issues causing us major hassles. My phone is ringing off the hook (metaphorically speaking, since it's a Nokia E65) with people wanting to know when the problem is going to be resolved, and I still don't have much information to give them.

Friday, June 20 10:00:00: Still poking around, I'm starting to notice general packet loss along with the packets-larger-than-1496-bytes-being-dropped problem.

Friday, June 20 12:48:55: Doing some more testing, I notice that packets larger than 1496 bytes are now making it through the Verizon <-> IS route, although there is still generally high packet loss. Yay? Unfortunately the general packet loss is proving to be as much of a pain, causing large transfers to stall and so on.

Friday, June 20 14:36:00: Packet loss seems to have died down; checking to see if things are working again, fingers are crossed…

Friday, June 20 16:32:00: The nightmare continues. Most sites seem to be working now, but one isn't (an extremely important one); it seems to be a "large packets being dropped" problem again. When sending small amounts of data, everything is fine; when sending large amounts, the connection just hangs. However, my tests sending data from our servers to theirs (ie. Verizon -> IS) show no problems, and the tests I've been able to run from another IS site show no problems going IS -> Verizon there, so it seems to be limited in some fashion. Lowered MTUs to 1400 as a temporary measure, which seems to be working; tracking this down is going to be a nightmare, though. My only hope is that someone is already working on it somewhere…

EDIT: clarify "intermittant" and "betwen"