Ferry Boender thinks you shouldn't be using Google App Engine or Amazon S3, because of vendor lock-in. Here's why it doesn't matter: these services aren't about the API, for the most part, they're about the infrastructure. It's trivial to implement S3's API, probably not more than a few hours of work; heck, Apache with PUT support and a custom directory index is already 80% of the way there. Similarly, I understand that Google App Engine includes a test harness you can run your application on. However, for a real application, you need a scalable, robust, and highly-available implementation, and that's where the difficulty lies. The reason why you're "locked in" to S3 is because nobody else is offering such a service yet, not because of their API; porting your code to any comparable service would be trivial (if it's not trivial, then I doubt you can claim that it's a comparable service).
He also raises the issue of debugging: if someone else's service breaks, you have to wait for them to fix it instead of fixing it yourself; and that's a fair point. You have mostly the same with proprietary software components; if you don't have the source code, you probably need to wait for someone else to fix it, instead of being able to debug it yourself (although you do have some recourse in the form of system debuggers and disassemblers). However, the converse situation also exists: if you implement and run the service yourself, then when it breaks, you *have* to fix it. In other words, this isn't so much about debugging, as who can do a better job. Think you can do a better job than Amazon? Really? Then go ahead and reimplement S3 yourself; maybe you should even think about going into competition with them. For most people, however, this is not the case; I probably have the technical skills to implement such a service, but I certainly don't have the resources and infrastructure, and I want to get on with implementing my own application, so I'm happy to pay Amazon to do the work for me instead. By all accounts, they're doing a pretty good job overall; and of course, bear in mind that Amazon are running their own applications on that infrastructure too.
Finally, if you want better support than you get for free on the AWS forums etc., you can always pay for better support.
I attended my first GeekDinner (the second one in JHB) this weekend. It was more of a "GeekLunch", starting at 14:00 in the afternoon, although some people hung around until late that night. The venue was Yusuf's house, with his wife Shehnaaz providing the awesome food. I arrived just after 14:00, one of the first to arrive, and the others trickled in over the course of the next hour or so. Yusuf was up first, speaking about the pervasive problems with CSS, and how in many ways it represents a step sideways or backwards relative to what came before it. I went next, slidelessly rambling on a bit about the history of scale in computing, before moving on to a more detailed description of the scaling architecture we've chosen to employ for our application. We broke for lunch then, and then resumed with Dominic talking about CSRF/XSS attacks, browser hijacking, and ways to protect yourself against the aforementioned. That brought the presentations to an end (they were each about 30 minutes long), but everyone stayed for a while, hung out / chatted, and generally had a good time.
The atmosphere was very relaxed and friendly, and with such a small group it was easy to take questions from the audience, so none of the presentations turned into a complete monologue. This was my first time presenting tech-related material like this, and I hadn't really prepared (I only volunteered on the mailing list a few days before the event); I don't think I did too badly, but some pretty slides and more preparation would have been nice. Answering questions from the audience was awesome, though; it's always good to know you haven't completely lost everyone.
All in all, it looks like we're definitely on-track for a thriving GeekDinner JHB event; I think the key will be to let things evolve naturally, rather than trying to shoehorn things forcibly into a canned recipe provided by someone else.
Wednesday, June 18 13:12:00: Hetzner (who our servers are colocated with) post a network notice saying that "Verizon are having intermittant (sic) problems with their international link betwen (sic) Cape Town and London", and identifying DDoS attacks as the cause of the problems.
Wednesday, June 18 15:30:00: We start getting complaints from some of our users that they cannot access our web application. Tests from my side and several other locations don't reveal any problems accessing our site, but the users report being able to access all other web sites; by the end of the day, I am still unable to trace the problem; connectivity between our servers and all of the clients reporting problems seems to be fine, but they continue to be unable to access the site. I call it a day after business hours end.
Thursday, June 19 08:00:00: The same users are still unable to access our site. I continue poking around, and run across traffic going between our servers and one of the client sites experiencing problems while checking things with tcpdump. Huh? If they can't access the site, why am I seeing web traffic? Then I notice a familiar pattern; TCP packets 1500 bytes in size are being retransmitted continually, before the connection is torn down from the other side — which is what you usually see when PMTU discovery is broken.
Thursday, June 19 09:08:43: I send an e-mail to Hetzner
support, briefly describing what I thought the problem was, and asking
them to look into it urgently.
Thursday, June 19 09:09:00: I receive an e-mail from their support autoresponding giving me a ticket reference number.
Thursday, June 19 09:15:00: Continuing to look into the problem myself, I do some test traceroutes and it seems that when going to local sites (all of the tests I did were to IS destinations, although I didn't catch onto this at the time), at a certain point in Verizon's network, packets larger than 1496 bytes are being silently dropped; no ICMP "Fragmentation Needed" response. This isn't happening for international routes. So, I ask around and get people to run some tests from other sites (which is where Colin came in), and confirm the same thing from the outside, arriving at the conclusion that ICMP filtering is breaking PMTU discovery, although I'm not sure exactly where the filtering is occurring (In hindsight, I'm not sure this conclusion was actually a correct assessment of the problem…)
Thursday, June 19 10:58:00: Still no response from Hetzner support; I have a quiet few minutes, so I decide to call the helpdesk; apparently nobody has picked the ticket up yet, because they're very busy. The guy puts me on hold while he speaks to someone else, then tells me that they're not aware of any network problems currently, and asks me to please send through any information I have about the problem.
Thursday, June 19 11:17:41: I finish putting together an e-mail containing traceroute output etc. and my commentary on the problems we're experiencing.
Thursday, June 19 11:45:13: Hetzner respond, saying that they're experiencing difficulties with the firewall at their JHB datacentre (where our servers are located), and that they're looking into it.
Thursday, June 19 12:00:55: Hetzner report that the firewall issue is resolved. They also mention that there is some kind of Verizon <-> IS peering issue. Some quick testing on my side shows that the problem has not gone away, but I can confirm that Verizon <-> SAIX is working; also, doing some tests against ADSL links, I can receive ICMP "Fragmentation Needed" packets just fine, putting a further dent into my previous hypothesis about ICMP filtering. At this stage, my best guess is that Verizon are doing some kind of overly-aggressive packet filtering in response to the DDoS attacks previously mentioned.
Thursday, June 19 12:14:51: Hetzner confirm that the issue has been escalated with Verizon, but are unable to provide me with an ETA for the resolution of the issue. Looking at the network notice posted about the issue, I notice that they are claiming "high packet loss"; I don't see any packet loss aside from packets larger than 1496 bytes which are still being dropped completely, which seems a bit strange to me.
Friday, June 20 08:00:00: The problem has still not been solved; my interim measure of dropping the MTU on our network interfaces to 1496 seems to be helping with most users, but there are still connectivity issues causing us major hassles. My phone is ringing off the hook (metaphorically speaking, since it's a Nokia E65) with people wanting to know when the problem is going to be resolved, and I still don't have much information to give them.
Friday, June 20 10:00:00: Still poking around, I'm starting to notice general packet loss along with the packets-larger-than-1496-bytes-being-dropped problem.
Friday, June 20 12:48:55: Doing some more testing, I notice that packets larger than 1496 bytes are now making it through the Verizon <-> IS route, although there is still generally high packet loss. Yay? Unfortunately the general packet loss is proving to be as much of a pain, causing large transfers to stall and so on.
Friday, June 20 14:36:00: Packet loss seems to have died down; checking to see if things are working again, fingers are crossed…
Friday, June 20 16:32:00: The nightmare continues. Most sites seem to be working now, but one isn't (an extremely important one); it seems to be a "large packets being dropped" problem again. When sending small amounts of data, everything is fine; when sending large amounts, the connection just hangs. However, my tests sending data from our servers to theirs (ie. Verizon -> IS) show no problems, and the tests I've been able to run from another IS site show no problems going IS -> Verizon there, so it seems to be limited in some fashion. Lowered MTUs to 1400 as a temporary measure, which seems to be working; tracking this down is going to be a nightmare, though. My only hope is that someone is already working on it somewhere…
EDIT: clarify "intermittant" and "betwen"
I just purchased a Spyder 3 Elite. The main reason I wanted one of these was to calibrate my Samsung 2232GW LCD monitor, as the one I have has an AU Optronics (aka "A") panel with a decidedly blue colour cast. Having successfully calibrated the display, I am quite pleased with the results, so I'm posting my colour profile here for others who aren't quite willing to splurge on a Spyder 3 or similar calibration device. If you have an AUO panel, my profile should work reasonably well for you: Samsung SyncMaster 2232GW A Internet.icm
Select the "Internet" preset on the monitor (down-arrow OSD menu) before loading this colour profile, as I calibrated from that preset; the default preset is just too bright for me. To find out which panel you have, you need to access the service menu: first lower contrast and brightness to 0, go into the main OSD menu, then hold the "Source" button for about five seconds. Now look at the last three letters in the "Version" field; if the middle letter is "L" (eg. my panel is "CLA"), then it is an AU Optronics panel. For reference, the other panel manufacturerers are "A" for Samsung (which should be almost perfect out of the box), "D" for CMO / Chimei (which apparently has an even worse blue cast that can't be calibrated away), and "I" for CPT (which I have no information about).
…to Clarke's Third Law: Any sufficiently primitive technology is indistinguishable from BS.
Is it ok if I put some South African objects in?
Every now and then, I have to help someone understand some aspect of text encoding, Unicode, character sets, etc. and I’ve never come across a handy reference to which I could point people, so I figured I’d better write one myself.
The first thing to realise is that basically all data storage is about encoding. You have some kind of underlying layer (stone tablets, papyrus, a hard drive, whatever) and you want to manipulate it in a way that lets you (or someone else) examine those manipulations and reconstruct the data; the manipulation phase is called “encoding”, and the examination phase is called “decoding”. Of course, there are many different ways to stuff some information onto the papyrus (or whatever your medium is); for example, if I want to encode the number 2644 to store it on a piece of paper, I can use Arabic numerals in decimal (2644), Arabic numerals in hexadecimal (0xA54), Roman numerals (MMDCXLIV), and so on. The same applies to all sorts of other encodings of other kinds of data; for example, if I want to store a picture in a file, I have to choose between image encodings such as PNG and GIF.
All of these involve a common idea of some “abstract idea” (such as a number, or a picture), and a concrete encoding that is used to store that idea, and communicate it with others — but of course, you cannot actually manipulate abstract ideas on a computer, so when you decode some data, in reality you are always encoding it into another encoding at the same time, otherwise you couldn’t do anything about it. This may make the process seem a bit pointless, but we tend to build all sorts of useful abstractions in computers, and decoding data often allows you to move to a higher level of abstraction. For example, if you decode an image stored in PNG or GIF format, the result is a whole bunch of image pixel values, which you must still store in memory somehow; but you can use the same format regardless of whether those values came from a PNG file, a GIF file, or even a JPEG file.
However, this post is about text, not other kinds of data, so let’s fast forward to the good part. Computer memory is, on a basic level, a physical encoding of numbers. The smallest addressable slice of memory is typically 8 bits, or a byte. (Some obscure architectures work differently, but I’ll exclude those from my discussion here, in the interests of sanity). As a collection of bits, the simplest way to treat a byte is as an 8-digit binary number, which gives us a range of values from 00000000 to 11111111 in binary, or 0 to 255 in decimal (0x00 to 0xFF in hex). From these simple building blocks, we can start building much larger structures; for example, if we wanted to store a larger number, we might use 32 bits (4 bytes), ordered in a pre-agreed fashion.
But we want to store text, not numbers, so various encodings for text have also been developed over time; the ASCII encoding is probably the most well-known text encoding. It is a 7-bit encoding, meaning that only values in the range 0 through 127 are used (due to historical reasons, when the 8th bit was being used for other purposes, and thus unavailable to encode character information). ASCII is nothing more than a table mapping characters to numbers; if you have a text string of 5 characters, you look up the number for each character, and end up with a sequence of 5 numbers, which can be stored in 5 bytes of memory. Something to note here is that ASCII is both a character set (the list of characters it encodes) and an encoding (because it specifies a way to encode those characters as bytes); these two concepts are not always lumped together, as we’ll see shortly.
In a US/English-centric world, ASCII works pretty well, but once you go beyond that, you start running into difficulties: you need to use characters in your document that just aren’t available in ASCII — the character set is too small. At this point in history, the constraints on using the 8th bit were no longer relevant, which freed up an extra 128 values (128 – 255) for use; thus, a variety of new encodings sprung up (the ISO-8859-* family) that were just ASCII + region specific characters. If you only use ASCII characters, your text would be compatible with any of these encodings, so they are all “backwards compatible” in that sense; but there isn’t generally any way to mix different encodings within the same document, so if you need to use extra characters from both ISO-8859-2 and ISO-8859-4, you still have problems. Also, there is still a vast host of characters (for example, the Chinese/Japanese/Korean characters) in use that aren’t representable in *any* of these encodings. Today, the ISO-8859-1 encoding is most common in software / documents using one of these encodings, and often software is misconfigured to decode text in this format, even when some other encoding has been used.
Enter Unicode and the Universal Character Set standard; you can read about the differences between Unicode and UCS elsewhere, but I will just talk about Unicode here for the sake of simplicity. Among other things, the Unicode standard contains a vast character set; rather than the 128 or 256 characters of the limited character sets I’ve discussed so far, Unicode has over 100,000 characters, and specifies various attributes and properties of these characters which are often important when writing software to display them on screen or print them, as well as in other contexts. In addition, Unicode specifies several different encodings of this character set; unlike previous encodings I have mentioned, where character sets and encoding schemes went hand in hand, the Unicode character set simply assigns a number, or “codepoint” to each character, and then the various encoding schemes provide a mapping between codepoints and raw bytes.
The main encodings associated with Unicode are UTF-8, UTF-16, and UTF-32. UTF-8 is a variable-length encoding, which means that the number of bytes corresponding to each character varies; UTF-32 (and the UCS-4 encoding, which is essentially equivalent) is a fixed-length encoding that uses 32-bit integers (4 bytes) for each character, and thus raises endianness issues (the order in which the 4 bytes are written; and finally, UTF-16 is a complete mess, where codepoints under 2 ** 16 are stored as a 16-bit integer, and codepoints over that are stored as a pair of special reserved characters (called a surrogate pair) in the range below 2 ** 16 and then encoded like any other character in that range (UCS-2 is essentially the same, except it simply does not allow for any characters outside of the 16-bit range).
So, what does this all mean? Well, for one thing, if you’re writing an application that handles text of any kind, you will need to decode the incoming text, and in order to do that correctly, you will need to know what encoding it was encoded with. If you’re writing an HTTP server / web application, the information is provided in the HTTP header; if you’re implementing some other protocol, hopefully it either specifies a particular encoding, or provides a mechanism for communicating the encoding to the other side. Also, if you’re sending text to other people, you need to make sure you’re encoding it with the correct encoding; if you say your HTML document is ISO-8859-1, but it’s encoded with UTF-8, then someone is going to get garbage in their browser.
There are different mechanisms for handling text in different languages / libraries, so consult the relevant documentation to find out what th
e correct way to do it is in your particular environment, but as a bonus, I’m going to give a brief rundown of how it all works in Python. In Python, the ‘str’ type contains raw bytes, not text. The name of the type that stores text is ‘unicode’; unsurprisingly, this type can only store characters that are present in the Unicode character set. Depending on how your Python interpreter was compiled, the unicode type uses either UTF-16 or UTF-32 internally to store the text, but you don’t generally have to worry about this. To turn a str object into a unicode object, you need to decode with the correct decoding; for example:
>>> print ‘Wei\xc3\x9fes Fleisch’.decode(‘utf-8′) Weißes Fleisch >>> print unicode(‘\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81B’, ‘shift-jis’) こんにちは。
(Both ways of decoding are essentially equivalent.) Likewise, to turn a unicode object into a str object, encode with the correct encoding:
>>> u’Weißes Fleisch’.encode(‘utf-8′) ‘Wei\xc3\x9fes Fleisch’
Unfortunately, Python will automatically encode and decode strings for you under some circumstances, using the “default encoding”, which should always be ascii. For example:
>>> ‘foo’ + u’bar’ u’foobar’
As you can see, Python has automatically decoded the first string before performing the concatenation. This is bad; if the string was not encoded in ASCII, then you will either get a UnicodeDecodeError exception, or garbage data:
>>> ‘Wei\xc3\x9fes Fleisch’ + u’haha’ Traceback (most recent call last): File “”, line 1, in ? UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 3: ordinal not in range(128)
To avoid this kind of problem, always encode and decode explicitly. You generally want to do this at abstraction boundaries; when you’re handling incoming data, determine the correct encoding, and then decode there; then work with unicode objects within the guts of your application, and only encode the text again once you send it back out onto the network, or write it to a file, or otherwise output it in some fashion.
UPDATE: Fixed a few typos / errors, and added some headings.
Unfortunately, it seems like the author didn't have enough space to tell his story properly; many of his characters start developing, but then just get stranded as the narration is taken up by the details of the events in the story. The ending also seems somewhat abrupt; it is almost as if the storyteller suddenly realised he had to leave for an important meeting, and needed to end his story off as quickly as possible. Despite this, I found it an enjoyable read, and would recommend it to anyone else interested in this kind of science fiction.