.. _overcoming-frustration: ========================================================== Overcoming frustration: Correctly using unicode in python2 ========================================================== In python-2.x, there's two types that deal with text. 1. :class:`str` is for strings of bytes. These are very similar in nature to how strings are handled in C. 2. :class:`unicode` is for strings of unicode :term:`code points`. .. note:: **Just what the dickens is "Unicode"?** One mistake that people encountering this issue for the first time make is confusing the :class:`unicode` type and the encodings of unicode stored in the :class:`str` type. In python, the :class:`unicode` type stores an abstract sequence of :term:`code points`. Each :term:`code point` represents a :term:`grapheme`. By contrast, byte :class:`str` stores a sequence of bytes which can then be mapped to a sequence of :term:`code points`. Each unicode encoding (:term:`UTF-8`, UTF-7, UTF-16, UTF-32, etc) maps different sequences of bytes to the unicode :term:`code points`. What does that mean to you as a programmer? When you're dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with :class:`unicode` strings as they abstract characters in a manner that's appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over a network link, etc, you should be dealing with byte :class:`str` as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters. In the python2 world many APIs use these two classes interchangably but there are several important APIs where only one or the other will do the right thing. When you give the wrong type of string to an API that wants the other type, you may end up with an exception being raised (:exc:`UnicodeDecodeError` or :exc:`UnicodeEncodeError`). However, these exceptions aren't always raised because python implicitly converts between types... *sometimes*. ----------------------------------- Frustration #1: Inconsistent Errors ----------------------------------- Although converting when possible seems like the right thing to do, it's actually the first source of frustration. A programmer can test out their program with a string like: ``The quick brown fox jumped over the lazy dog`` and not encounter any issues. But when they release their software into the wild, someone enters the string: ``I sat down for coffee at the café`` and suddenly an exception is thrown. The reason? The mechanism that converts between the two types is only able to deal with :term:`ASCII` characters. Once you throw non-:term:`ASCII` characters into your strings, you have to start dealing with the conversion manually. So, if I manually convert everything to either byte :class:`str` or :class:`unicode` strings, will I be okay? The answer is.... *sometimes*. --------------------------------- Frustration #2: Inconsistent APIs --------------------------------- The problem you run into when converting everything to byte :class:`str` or :class:`unicode` strings is that you'll be using someone else's API quite often (this includes the APIs in the |stdlib|_) and find that the API will only accept byte :class:`str` or only accept :class:`unicode` strings. Or worse, that the code will accept either when you're dealing with strings that consist solely of :term:`ASCII` but throw an error when you give it a string that's got non-:term:`ASCII` characters. When you encounter these APIs you first need to identify which type will work better and then you have to convert your values to the correct type for that code. Thus the programmer that wants to proactively fix all unicode errors in their code needs to do two things: 1. You must keep track of what type your sequences of text are. Does ``my_sentence`` contain :class:`unicode` or :class:`str`? If you don't know that then you're going to be in for a world of hurt. 2. Anytime you call a function you need to evaluate whether that function will do the right thing with :class:`str` or :class:`unicode` values. Sending the wrong value here will lead to a :exc:`UnicodeError` being thrown when the string contains non-:term:`ASCII` characters. .. note:: There is one mitigating factor here. The python community has been standardizing on using :class:`unicode` in all its APIs. Although there are some APIs that you need to send byte :class:`str` to in order to be safe, (including things as ubiquitous as :func:`print` as we'll see in the next section), it's getting easier and easier to use :class:`unicode` strings with most APIs. ------------------------------------------------ Frustration #3: Inconsistent treatment of output ------------------------------------------------ Alright, since the python community is moving to using :class:`unicode` strings everywhere, we might as well convert everything to :class:`unicode` strings and use that by default, right? Sounds good most of the time but there's at least one huge caveat to be aware of. Anytime you output text to the terminal or to a file, the text has to be converted into a byte :class:`str`. Python will try to implicitly convert from :class:`unicode` to byte :class:`str`... but it will throw an exception if the bytes are non-:term:`ASCII`:: >>> string = unicode(raw_input(), 'utf8') café >>> log = open('/var/tmp/debug.log', 'w') >>> log.write(string) Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128) Okay, this is simple enough to solve: Just convert to a byte :class:`str` and we're all set:: >>> string = unicode(raw_input(), 'utf8') café >>> string_for_output = string.encode('utf8', 'replace') >>> log = open('/var/tmp/debug.log', 'w') >>> log.write(string_for_output) >>> So that was simple, right? Well... there's one gotcha that makes things a bit harder to debug sometimes. When you attempt to write non-:term:`ASCII` :class:`unicode` strings to a file-like object you get a traceback everytime. But what happens when you use :func:`print`? The terminal is a file-like object so it should raise an exception right? The answer to that is.... *sometimes*: .. code-block:: pycon $ python >>> print u'café' café No exception. Okay, we're fine then? We are until someone does one of the following: * Runs the script in a different locale: .. code-block:: pycon $ LC_ALL=C python >>> # Note: if you're using a good terminal program when running in the C locale >>> # The terminal program will prevent you from entering non-ASCII characters >>> # python will still recognize them if you use the codepoint instead: >>> print u'caf\xe9' Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128) * Redirects output to a file: .. code-block:: pycon $ cat test.py #!/usr/bin/python -tt # -*- coding: utf-8 -*- print u'café' $ ./test.py >t Traceback (most recent call last): File "./test.py", line 4, in print u'café' UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128) Okay, the locale thing is a pain but understandable: the C locale doesn't understand any characters outside of :term:`ASCII` so naturally attempting to display those won't work. Now why does redirecting to a file cause problems? It's because :func:`print` in python2 is treated specially. Whereas the other file-like objects in python always convert to :term:`ASCII` unless you set them up differently, using :func:`print` to output to the terminal will use the user's locale to convert before sending the output to the terminal. When :func:`print` is not outputting to the terminal (being redirected to a file, for instance), :func:`print` decides that it doesn't know what locale to use for that file and so it tries to convert to :term:`ASCII` instead. So what does this mean for you, as a programmer? Unless you have the luxury of controlling how your users use your code, you should always, always, always convert to a byte :class:`str` before outputting strings to the terminal or to a file. Python even provides you with a facility to do just this. If you know that every :class:`unicode` string you send to a particular file-like object (for instance, :data:`~sys.stdout`) should be converted to a particular encoding you can use a :class:`codecs.StreamWriter` object to convert from a :class:`unicode` string into a byte :class:`str`. In particular, :func:`codecs.getwriter` will return a :class:`~codecs.StreamWriter` class that will help you to wrap a file-like object for output. Using our :func:`print` example: .. code-block:: python $ cat test.py #!/usr/bin/python -tt # -*- coding: utf-8 -*- import codecs import sys UTF8Writer = codecs.getwriter('utf8') sys.stdout = UTF8Writer(sys.stdout) print u'café' $ ./test.py >t $ cat t café ----------------------------------------- Frustrations #4 and #5 -- The other shoes ----------------------------------------- In English, there's a saying "waiting for the other shoe to drop". It means that when one event (usually bad) happens, you come to expect another event (usually worse) to come after. In this case we have two other shoes. Frustration #4: Now it doesn't take byte strings?! ================================================== If you wrap :data:`sys.stdout` using :func:`codecs.getwriter` and think you are now safe to print any variable without checking its type I am afraid I must inform you that you're not paying enough attention to :term:`Murphy's Law`. The :class:`~codecs.StreamWriter` that :func:`codecs.getwriter` provides will take :class:`unicode` strings and transform them into byte :class:`str` before they get to :data:`sys.stdout`. The problem is if you give it something that's already a byte :class:`str` it tries to transform that as well. To do that it tries to turn the byte :class:`str` you give it into :class:`unicode` and then transform that back into a byte :class:`str`... and since it uses the :term:`ASCII` codec to perform those conversions, chances are that it'll blow up when making them:: >>> import codecs >>> import sys >>> UTF8Writer = codecs.getwriter('utf8') >>> sys.stdout = UTF8Writer(sys.stdout) >>> print 'café' Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python2.6/codecs.py", line 351, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) To work around this, kitchen provides an alternate version of :func:`codecs.getwriter` that can deal with both byte :class:`str` and :class:`unicode` strings. Use :func:`kitchen.text.converters.getwriter` in place of the :mod:`codecs` version like this:: >>> import sys >>> from kitchen.text.converters import getwriter >>> UTF8Writer = getwriter('utf8') >>> sys.stdout = UTF8Writer(sys.stdout) >>> print u'café' café >>> print 'café' café ------------------------------------------- Frustration #5: Inconsistent APIs Part deux ------------------------------------------- Sometimes you do everything right in your code but other people's code fails you. With unicode issues this happens more often than we want. A glaring example of this is when you get values back from a function that aren't consistently :class:`unicode` string or byte :class:`str`. An example from the |stdlib|_ is :mod:`gettext`. The :mod:`gettext` functions are used to help translate messages that you display to users in the users' native languages. Since most languages contain letters outside of the :term:`ASCII` range, the values that are returned contain unicode characters. :mod:`gettext` provides you with :meth:`~gettext.GNUTranslations.ugettext` and :meth:`~gettext.GNUTranslations.ungettext` to return these translations as :class:`unicode` strings and :meth:`~gettext.GNUTranslations.gettext`, :meth:`~gettext.GNUTranslations.ngettext`, :meth:`~gettext.GNUTranslations.lgettext`, and :meth:`~gettext.GNUTranslations.lngettext` to return them as encoded byte :class:`str`. Unfortunately, even though they're documented to return only one type of string or the other, the implementation has corner cases where the wrong type can be returned. This means that even if you separate your :class:`unicode` string and byte :class:`str` correctly before you pass your strings to a :mod:`gettext` function, afterwards, you might have to check that you have the right sort of string type again. .. note:: :mod:`kitchen.i18n` provides alternate gettext translation objects that return only byte :class:`str` or only :class:`unicode` string. --------------- A few solutions --------------- Now that we've identified the issues, can we define a comprehensive strategy for dealing with them? Convert text at the border ========================== If you get some piece of text from a library, read from a file, etc, turn it into a :class:`unicode` string immediately. Since python is moving in the direction of :class:`unicode` strings everywhere it's going to be easier to work with :class:`unicode` strings within your code. If your code is heavily involved with using things that are bytes, you can do the opposite and convert all text into byte :class:`str` at the border and only convert to :class:`unicode` when you need it for passing to another library or performing string operations on it. In either case, the important thing is to pick a default type for strings and stick with it throughout your code. When you mix the types it becomes much easier to operate on a string with a function that can only use the other type by mistake. .. note:: In python3, the abstract unicode type becomes much more prominent. The type named ``str`` is the equivalent of python2's :class:`unicode` and python3's ``bytes`` type replaces python2's :class:`str`. Most APIs deal in the unicode type of string with just some pieces that are low level dealing with bytes. The implicit conversions between bytes and unicode is removed and whenever you want to make the conversion you need to do so explicitly. When the data needs to be treated as bytes (or unicode) use a naming convention =============================================================================== Sometimes you're converting nearly all of your data to :class:`unicode` strings but you have one or two values where you have to keep byte :class:`str` around. This is often the case when you need to use the value verbatim with some external resource. For instance, filenames or key values in a database. When you do this, use a naming convention for the data you're working with so you (and others reading your code later) don't get confused about what's being stored in the value. If you need both a textual string to present to the user and a byte value for an exact match, consider keeping both versions around. You can either use two variables for this or a :class:`dict` whose key is the byte value. .. note:: You can use the naming convention used in kitchen as a guide for implementing your own naming convention. It prefixes byte :class:`str` variables of unknown encoding with ``b_`` and byte :class:`str` of known encoding with the encoding name like: ``utf8_``. If the default was to handle :class:`str` and only keep a few :class:`unicode` values, those variables would be prefixed with ``u_``. When outputting data, convert back into bytes ============================================= When you go to send your data back outside of your program (to the filesystem, over the network, displaying to the user, etc) turn the data back into a byte :class:`str`. How you do this will depend on the expected output format of the data. For displaying to the user, you can use the user's default encoding using :func:`locale.getpreferredencoding`. For entering into a file, you're best bet is to pick a single encoding and stick with it. .. warning:: When using the encoding that the user has set (for instance, using :func:`locale.getpreferredencoding`, remember that they may have their encoding set to something that can't display every single unicode character. That means when you convert from :class:`unicode` to a byte :class:`str` you need to decide what should happen if the byte value is not valid in the user's encoding. For purposes of displaying messages to the user, it's usually okay to use the ``replace`` encoding error handler to replace the invalid characters with a question mark or other symbol meaning the character couldn't be displayed. You can use :func:`kitchen.text.converters.getwriter` to do this automatically for :data:`sys.stdout`. When creating exception messages be sure to convert to bytes manually. When writing unittests, include non-ASCII values and both unicode and str type ============================================================================== Unless you know that a specific portion of your code will only deal with :term:`ASCII`, be sure to include non-:term:`ASCII` values in your unittests. Including a few characters from several different scripts is highly advised as well because some code may have special cased accented roman characters but not know how to handle characters used in Asian alphabets. Similarly, unless you know that that portion of your code will only be given :class:`unicode` strings or only byte :class:`str` be sure to try variables of both types in your unittests. When doing this, make sure that the variables are also non-:term:`ASCII` as python's implicit conversion will mask problems with pure :term:`ASCII` data. In many cases, it makes sense to check what happens if byte :class:`str` and :class:`unicode` strings that won't decode in the present locale are given. Be vigilant about spotting poor APIs ==================================== Make sure that the libraries you use return only :class:`unicode` strings or byte :class:`str`. Unittests can help you spot issues here by running many variations of data through your functions and checking that you're still getting the types of string that you expect. Example: Putting this all together with kitchen =============================================== The kitchen library provides a wide array of functions to help you deal with byte :class:`str` and :class:`unicode` strings in your program. Here's a short example that uses many kitchen functions to do its work:: #!/usr/bin/python -tt # -*- coding: utf-8 -*- import locale import os import sys import unicodedata from kitchen.text.converters import getwriter, to_bytes, to_unicode from kitchen.i18n import get_translation_object if __name__ == '__main__': # Setup gettext driven translations but use the kitchen functions so # we don't have the mismatched bytes-unicode issues. translations = get_translation_object('example') # We use _() for marking strings that we operate on as unicode # This is pretty much everything _ = translations.ugettext # And b_() for marking strings that we operate on as bytes. # This is limited to exceptions b_ = translations.lgettext # Setup stdout encoding = locale.getpreferredencoding() Writer = getwriter(encoding) sys.stdout = Writer(sys.stdout) # Load data. Format is filename\0description # description should be utf-8 but filename can be any legal filename # on the filesystem # Sample datafile.txt: # /etc/shells\x00Shells available on caf\xc3\xa9.lan # /var/tmp/file\xff\x00File with non-utf8 data in the filename # # And to create /var/tmp/file\xff (under bash or zsh) do: # echo 'Some data' > /var/tmp/file$'\377' datafile = open('datafile.txt', 'r') data = {} for line in datafile: # We're going to keep filename as bytes because we will need the # exact bytes to access files on a POSIX operating system. # description, we'll immediately transform into unicode type. b_filename, description = line.split('\0', 1) # to_unicode defaults to decoding output from utf-8 and replacing # any problematic bytes with the unicode replacement character # We accept mangling of the description here knowing that our file # format is supposed to use utf-8 in that field and that the # description will only be displayed to the user, not used as # a key value. description = to_unicode(description, 'utf-8').strip() data[b_filename] = description datafile.close() # We're going to add a pair of extra fields onto our data to show the # length of the description and the filesize. We put those between # the filename and description because we haven't checked that the # description is free of NULLs. datafile = open('newdatafile.txt', 'w') # Name filename with a b_ prefix to denote byte string of unknown encoding for b_filename in data: # Since we have the byte representation of filename, we can read any # filename if os.access(b_filename, os.F_OK): size = os.path.getsize(b_filename) else: size = 0 # Because the description is unicode type, we know the number of # characters corresponds to the length of the normalized unicode # string. length = len(unicodedata.normalize('NFC', description)) # Print a summary to the screen # Note that we do not let implici type conversion from str to # unicode transform b_filename into a unicode string. That might # fail as python would use the ASCII filename. Instead we use # to_unicode() to explictly transform in a way that we know will # not traceback. print _(u'filename: %s') % to_unicode(b_filename) print _(u'file size: %s') % size print _(u'desc length: %s') % length print _(u'description: %s') % data[b_filename] # First combine the unicode portion line = u'%s\0%s\0%s' % (size, length, data[b_filename]) # Since the filenames are bytes, turn everything else to bytes before combining # Turning into unicode first would be wrong as the bytes in b_filename # might not convert b_line = '%s\0%s\n' % (b_filename, to_bytes(line)) # Just to demonstrate that getwriter will pass bytes through fine print b_('Wrote: %s') % b_line datafile.write(b_line) datafile.close() # And just to show how to properly deal with an exception. # Note two things about this: # 1) We use the b_() function to translate the string. This returns a # byte string instead of a unicode string # 2) We're using the b_() function returned by kitchen. If we had # used the one from gettext we would need to convert the message to # a byte str first message = u'Demonstrate the proper way to raise exceptions. Sincerely, \u3068\u3057\u304a' raise Exception(b_(message)) .. seealso:: :mod:`kitchen.text.converters`