by mithrandi on Jun.26, 2011
I often find myself helping newer Twisted users come to grips with arranging their flow control when they first start writing code that uses Deferreds. While the Deferred Reference does a reasonable job of covering all of the details, it is often difficult to make the intuitive leap from the synchronous patterns one is used to, to their asynchronous equivalents. To that end, I often find that comparing sync and async versions is illustrative; there are some examples of this nature in the Deferred Reference, but some patterns are missing, and I’ve never actually put all of the examples down in one place, so I thought I’d do that in my blog post. Without any further ado, here they are:
EDIT: Added composition example
Call a function, and use the result
# Synchronous version result = getSomething() doSomething(result) # Asynchronous version d = getSomething() d.addCallback(doSomething)
Call a function and use the result, catching a particular exception
# Synchronous version try: result = getSomething() doSomething(result) except SomeException as e: handleError(e) # Asynchronous version def _eb(f): f.trap(SomeException) handleError(f) d = getSomething() d.addCallback(doSomething) d.addErrback(_eb)
Call a function and use the result, catching any exception
# Synchronous version try: result = getSomething() doSomething(result) except: log.err() # Asynchronous version d = getSomething() d.addCallback(doSomething) d.addErrback(log.err)
Call a function and use the result, catching exceptions raised by that function
# Synchronous version try: result = getSomething() except: log.err() else: doSomething(result) # Asynchronous version d = getSomething() d.addCallbacks(doSomething, log.err)
Call a function and use the result, recovering from a particular exception raised by the function
# Synchronous version try: result = getSomething() except SomeException: result = 42 doSomething(result) # Asynchronous version def _eb(f): f.trap(SomeException) return 42 d = getSomething() d.addErrback(_eb) d.addCallback(doSomething)
Call a function and use the result, performing cleanup if an exception occurs
# Synchronous version try: result = getSomething() doSomething(result) finally: cleanStuffUp() # Asynchronous version d = getSomething() d.addCallback(doSomething) d.addBoth(lambda ignored: cleanStuffUp())
Compose several functions
# Synchronous version result = getSomething() result2 = doStuff(result) result3 = doMoreStuff(result2) # Asynchronous version d = getSomething() d.addCallback(doStuff) d.addCallback(doMoreStuff)
If anyone has any suggestions for other examples I should add to this list, feel free to leave a comment or drop me a note, and I’ll consider updating the post.
by mithrandi on Dec.13, 2008
Since my earlier post on this subject, someone has since brought to my attention this blog post by James Bennett. James writes well, cutting straight through to the real issues at hand, but in some places I think his facts are incorrect, and in other places I draw different conclusions to the ones he draws.
First up, Unicode strings vs. byte strings. In fact, these are handled in almost exactly the same fashion in Python 2 and in Python 3; both languages have a type for storing strings of characters, and a type for storing strings of arbitrary bytes (including things like data read from a network socket, and the encoded form of character strings). In Python 3, the str type is for storing character strings, and the bytes type is for storing byte strings. In Python 2, the unicode type is for storing character strings, and the str type is for storing byte strings. That’s really the only difference; the Python 2 str type has some methods that the Python 3 bytes type doesn’t, but that’s a relatively unimportant difference. The real problem in Python 2 is that many people have used the str type to store character strings, when they really should have been using the unicode type; this includes things built into the language (like attribute names or function names), various stdlib modules, and vast oceans of third-party code.
What does Python 3 do to solve this? Well, not all that much, except for completely breaking everyone’s existing string-handling code; I guess James assumes that in the process of fixing all of their string-handling code, they’ll get things right this time around, but I’m somewhat less optimistic. Still, I think it is important to point out that Python 3 does *not* give you any additional tools for dealing with character / byte strings, nor does it make it any easier to work with them; at best, it just fixes some of the broken character / byte string-handling code that was being distributed with Python.
With that out the way, I’ll move on to the “different conclusions” part. First up, the “Death by a thousand cuts”; I know many programmers feel similarly about the myriad minor issues he mentions, but I’m simply not one of them. Sure, there are all sorts of minor annoyances, and they do start to add up over time, but they’re simply irrelevant compared with the big issues. I might spend two weeks out of a whole year dealing with them, as opposed to months of time spent working around the lack of production-quality libraries for certain tasks, or the lack of higher-level programming constructs requiring me to write pages and pages of lower-level code to solve a certain problem. I’ll admit that I used to find these minor issues a great annoyance, but over time, they’ve just faded away to background noise, just like much of the supposedly major differentiating factors between different libraries and different programming languages. Once you see the forest, you stop caring so much about the trees.
Speaking of libraries, the new standard-library reorganisation is all very exciting; but I would really have liked it if they’d spent the time and energy on actually improving the code to a level suitable for production applications. It really doesn’t matter how most of the standard library is organised, if you’re not going to be using any of it anyway. In addition, projects reorganise APIs *all the time*, and there’s a perfectly straightforward way to do it in a backwards-compatible fashion. You introduce the new API or new location of the API, deprecate the old one, and then eventually remove it. No Python 3-style chasm-of-incompatibility required.
Of course, some of the standard library changes are actual functional improvements, not just rearranging the deck chairs; I haven’t looked at it yet myself, but I’ll take it on faith that the new I/O library is a vast improvement over the old Python 2 I/O facilities. Except… you don’t need to break backwards-compatibility to introduce a new I/O library; and I assume it’ll be ported to Python 2 sooner or later. Indeed, this is a common trend in Python 3 improvements; all the really interesting functional improvements are stuff that can and most likely will be ported to Python 2, if it has not already been ported.
If Python was a brand new language, being developed from scratch with a brand new community, I would be very happy about all of the changes made in Python 3; but since it’s not, I must repeat my claim that aside from things that can be backported to Python 2 in the first place, absolutely none of the Python 3 changes are worth making the jump to what is essentially a whole new programming language.
by mithrandi on Nov.20, 2008
If you're a Python coder, and you haven't been living under a rock for the past few years, then you've probably heard of a thing called Python 3000, or Python 3.0. Python 3.0 is presented as a "new version" of Python, but with a twist: it represents a complete backwards compatibility break with the Python 2.x series, in order to make a whole slew of backwards-incompatible changes with no direct migration path. Thus, in some sense, this is actually a brand new language that just looks a lot like Python 2.x.
Is this a problem? Superficially, it doesn't "feel" like, say, trying to migrate from Java to C#; after all, it's just Python with a few cleanups and new features, right? Unfortunately, any non-trivial codebase is going to be riddled with code that is affected by the migration, and so the effort involved is still quite substantial. The problem runs deeper than that, however; code doesn't exist in a vacuum, it exists in a community of other code. All of your Python code has to run in the same runtime, so if you port your code to Python 3.x, you need all of the libraries / frameworks that you depend on to be available for Python 3.x. Of course, if you port your library to Python 3.x, people still using Python 2.x can't use any newer versions of your library anymore, so you either leave those users completely stranded, or you have to maintain two versions of your library simultaneously; that's really not much fun.
The Python developers have a solution to this problem: a tool called 2to3. 2to3 attempts to automatically convert Python 2.x code to Python 3.x code. Obviously, it can't do the right thing in every case, so the idea is that you modify your Python 2.x code so that it remains valid and correct Python 2.x code, but is also in a form that 2to3 can automatically convert to valid and correct Python 3.x code. Thus, you can simply maintain your Python 2.x codebase, generating the Python 3.x version, until one day you decide you no longer need Python 2.x code; at this point, you would run 2to3 over your codebase one last time and commit it, thus dropping Python 2.x forever.
Sounds great, right? Unfortunately, 2to3 is still in a fairly immature state, with lots of bugs and issues. In addition, there are many situations where 2to3 simply can't do the right thing. For example, in code that handles strings correctly, any use of the
str type would simply be converted to use of the
bytes type, and
unicode would become
str. However, a lot of code incorrectly uses the
str type for strings of text, or because it uses APIs that are incorrectly written; in these cases, use of the
str type might need to be converted to the Python 3.x
str type, and 2to3 is never going to be able to figure this out on its own.
In the end, it's not yet clear that 2to3 will be a viable migration strategy, which opens the door to the possibility that we will see a schism in the Python ecosystem; with some people clinging to Python 2.x, while other people working in Python 3.x, isolated from their 2.x counterparts. Fortunately, the community has a lot of time to think about this: the 2.x series will probably continue until at least 2.7, if not beyond that. I'm personally still on Python 2.4 (my production servers run Debian stable), so I've got at least another 3 versions to go, which is quite a lot of time. If a viable migration strategy hasn't emerged by then, then I'm hoping that I'll either be able to switch from CPython to PyPy, or I'll already have made the jump to greener pastures, and thus not have to worry about Python at all.
by mithrandi on Apr.11, 2008
Every now and then, I have to help someone understand some aspect of text encoding, Unicode, character sets, etc. and I’ve never come across a handy reference to which I could point people, so I figured I’d better write one myself.
The first thing to realise is that basically all data storage is about encoding. You have some kind of underlying layer (stone tablets, papyrus, a hard drive, whatever) and you want to manipulate it in a way that lets you (or someone else) examine those manipulations and reconstruct the data; the manipulation phase is called “encoding”, and the examination phase is called “decoding”. Of course, there are many different ways to stuff some information onto the papyrus (or whatever your medium is); for example, if I want to encode the number 2644 to store it on a piece of paper, I can use Arabic numerals in decimal (2644), Arabic numerals in hexadecimal (0xA54), Roman numerals (MMDCXLIV), and so on. The same applies to all sorts of other encodings of other kinds of data; for example, if I want to store a picture in a file, I have to choose between image encodings such as PNG and GIF.
All of these involve a common idea of some “abstract idea” (such as a number, or a picture), and a concrete encoding that is used to store that idea, and communicate it with others — but of course, you cannot actually manipulate abstract ideas on a computer, so when you decode some data, in reality you are always encoding it into another encoding at the same time, otherwise you couldn’t do anything about it. This may make the process seem a bit pointless, but we tend to build all sorts of useful abstractions in computers, and decoding data often allows you to move to a higher level of abstraction. For example, if you decode an image stored in PNG or GIF format, the result is a whole bunch of image pixel values, which you must still store in memory somehow; but you can use the same format regardless of whether those values came from a PNG file, a GIF file, or even a JPEG file.
However, this post is about text, not other kinds of data, so let’s fast forward to the good part. Computer memory is, on a basic level, a physical encoding of numbers. The smallest addressable slice of memory is typically 8 bits, or a byte. (Some obscure architectures work differently, but I’ll exclude those from my discussion here, in the interests of sanity). As a collection of bits, the simplest way to treat a byte is as an 8-digit binary number, which gives us a range of values from 00000000 to 11111111 in binary, or 0 to 255 in decimal (0×00 to 0xFF in hex). From these simple building blocks, we can start building much larger structures; for example, if we wanted to store a larger number, we might use 32 bits (4 bytes), ordered in a pre-agreed fashion.
But we want to store text, not numbers, so various encodings for text have also been developed over time; the ASCII encoding is probably the most well-known text encoding. It is a 7-bit encoding, meaning that only values in the range 0 through 127 are used (due to historical reasons, when the 8th bit was being used for other purposes, and thus unavailable to encode character information). ASCII is nothing more than a table mapping characters to numbers; if you have a text string of 5 characters, you look up the number for each character, and end up with a sequence of 5 numbers, which can be stored in 5 bytes of memory. Something to note here is that ASCII is both a character set (the list of characters it encodes) and an encoding (because it specifies a way to encode those characters as bytes); these two concepts are not always lumped together, as we’ll see shortly.
In a US/English-centric world, ASCII works pretty well, but once you go beyond that, you start running into difficulties: you need to use characters in your document that just aren’t available in ASCII — the character set is too small. At this point in history, the constraints on using the 8th bit were no longer relevant, which freed up an extra 128 values (128 – 255) for use; thus, a variety of new encodings sprung up (the ISO-8859-* family) that were just ASCII + region specific characters. If you only use ASCII characters, your text would be compatible with any of these encodings, so they are all “backwards compatible” in that sense; but there isn’t generally any way to mix different encodings within the same document, so if you need to use extra characters from both ISO-8859-2 and ISO-8859-4, you still have problems. Also, there is still a vast host of characters (for example, the Chinese/Japanese/Korean characters) in use that aren’t representable in *any* of these encodings. Today, the ISO-8859-1 encoding is most common in software / documents using one of these encodings, and often software is misconfigured to decode text in this format, even when some other encoding has been used.
Enter Unicode and the Universal Character Set standard; you can read about the differences between Unicode and UCS elsewhere, but I will just talk about Unicode here for the sake of simplicity. Among other things, the Unicode standard contains a vast character set; rather than the 128 or 256 characters of the limited character sets I’ve discussed so far, Unicode has over 100,000 characters, and specifies various attributes and properties of these characters which are often important when writing software to display them on screen or print them, as well as in other contexts. In addition, Unicode specifies several different encodings of this character set; unlike previous encodings I have mentioned, where character sets and encoding schemes went hand in hand, the Unicode character set simply assigns a number, or “codepoint” to each character, and then the various encoding schemes provide a mapping between codepoints and raw bytes.
The main encodings associated with Unicode are UTF-8, UTF-16, and UTF-32. UTF-8 is a variable-length encoding, which means that the number of bytes corresponding to each character varies; UTF-32 (and the UCS-4 encoding, which is essentially equivalent) is a fixed-length encoding that uses 32-bit integers (4 bytes) for each character, and thus raises endianness issues (the order in which the 4 bytes are written; and finally, UTF-16 is a complete mess, where codepoints under 2 ** 16 are stored as a 16-bit integer, and codepoints over that are stored as a pair of special reserved characters (called a surrogate pair) in the range below 2 ** 16 and then encoded like any other character in that range (UCS-2 is essentially the same, except it simply does not allow for any characters outside of the 16-bit range).
So, what does this all mean? Well, for one thing, if you’re writing an application that handles text of any kind, you will need to decode the incoming text, and in order to do that correctly, you will need to know what encoding it was encoded with. If you’re writing an HTTP server / web application, the information is provided in the HTTP header; if you’re implementing some other protocol, hopefully it either specifies a particular encoding, or provides a mechanism for communicating the encoding to the other side. Also, if you’re sending text to other people, you need to make sure you’re encoding it with the correct encoding; if you say your HTML document is ISO-8859-1, but it’s encoded with UTF-8, then someone is going to get garbage in their browser.
There are different mechanisms for handling text in different languages / libraries, so consult the relevant documentation to find out what th
e correct way to do it is in your particular environment, but as a bonus, I’m going to give a brief rundown of how it all works in Python. In Python, the ‘str’ type contains raw bytes, not text. The name of the type that stores text is ‘unicode’; unsurprisingly, this type can only store characters that are present in the Unicode character set. Depending on how your Python interpreter was compiled, the unicode type uses either UTF-16 or UTF-32 internally to store the text, but you don’t generally have to worry about this. To turn a str object into a unicode object, you need to decode with the correct decoding; for example:
>>> print 'Wei\xc3\x9fes Fleisch'.decode('utf-8') Weißes Fleisch >>> print unicode('\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81B', 'shift-jis') こんにちは。
(Both ways of decoding are essentially equivalent.) Likewise, to turn a unicode object into a str object, encode with the correct encoding:
>>> u'Weißes Fleisch'.encode('utf-8') 'Wei\xc3\x9fes Fleisch'
Unfortunately, Python will automatically encode and decode strings for you under some circumstances, using the “default encoding”, which should always be ascii. For example:
>>> 'foo' + u'bar' u'foobar'
As you can see, Python has automatically decoded the first string before performing the concatenation. This is bad; if the string was not encoded in ASCII, then you will either get a UnicodeDecodeError exception, or garbage data:
>>> 'Wei\xc3\x9fes Fleisch' + u'haha' Traceback (most recent call last): File "", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
To avoid this kind of problem, always encode and decode explicitly. You generally want to do this at abstraction boundaries; when you’re handling incoming data, determine the correct encoding, and then decode there; then work with unicode objects within the guts of your application, and only encode the text again once you send it back out onto the network, or write it to a file, or otherwise output it in some fashion.
UPDATE: Fixed a few typos / errors, and added some headings.
by mithrandi on Apr.09, 2007
__init__ in Python kinda sucks. Actually, the linked example involves a method other than
__init__, but this is probably the most common situation in which this problem arises.
There is a pattern I sometimes use in this situation, which provides cooperative handling of kwargs, with the caveat that the argument namespace is shared across the whole inheritance hierarchy. The usage looks something like this:
class Base(object): def __init__(self, foo): self.foo = foo class A(Base): def __init__(self, bar=5, **kw): # consume bar parameter super(A, self).__init__(**kw) self.bar = bar # sample instantiations A(foo=10) A(foo=5, bar=10) class B(Base): def __init__(self, baz, **kw): # consume baz parameter super(B, self).__init__(**kw) self.baz = baz # sample instantiations B(foo=7, baz=9) class C(A, B): def __init__(self, foo, baz=10, quux, **kw): # munge foo, and make baz default to 10 # also consume quux super(B, self).__init__(foo=foo + 5, baz=baz, **kw) self.quux = quux # sample instantiations C(foo=2, bar=10, baz=7) C(foo=8)
I’m not going to debate the larger issues here; I’m just presenting this as something that hopefully be a useful tool for handling this kind of situation. In the cases where I’ve used it, it has worked quite well, although you will probably run into trouble when many parts of the inheritance tree are under different people’s control, due to the shared argument namespace.
by mithrandi on Jun.02, 2006
So, Colin has been ranting and raving about Django and templating and such again; subsequently, I had a brief conversation on IRC with someone else about the subject, and figured I’d do a bit of a braindump into my blog.
Like Colin, I have a strong dislike of using plaintext templating mechanisms for generating anything other than text/plain data. This is primarily due to the mixing of levels that this kind of usage entails; it is a design flaw very similar to the kind of design flaw that enables SQL injection attacks in poorly written applications. Your data should be filtering through a higher level abstraction which is then serialized by code specifically written to implement the output format; dumping it straight into a plaintext template is just begging for trouble, in the form of random garbage not conforming to the expected format. One only needs to look at all the malformed RSS feeds out there to see an example of how bad this can get.
I do think that using plaintext templating for emitting more structured data has its place in things like wikiish systems (eg. JotSpot) or CMSish systems, as an enabling technology for people who are nominally non-programmers; but for a “real” programmer with a full array of libraries and language at their disposal, it seems to just be bad system design.
In closing, I’ll note that I use Nevow for (X)HTML templating, making heavy use of patterns and templates, with a few render specials. If you’re not familiar with Nevow, that probably sounds like complete Greek; I suggest you do some more reading to get an idea of how Nevow templates work, but in very broad terms a pattern is an XML element marked with a
nevow:pattern="name" element; the element is then not directly present in the output, but is available to the presentation code, and is usually used for making N copies of a particular element structure. A slot, on the other hand, is a
element, which is “filled” by the presentation code; that is, replaced with content supplied at render time.