405 lines
18 KiB
ReStructuredText
405 lines
18 KiB
ReStructuredText
-----------------------
|
||
Kitchen.text.converters
|
||
-----------------------
|
||
|
||
.. automodule:: kitchen.text.converters
|
||
|
||
Byte Strings and Unicode in Python2
|
||
===================================
|
||
|
||
Python2 has two string types, :class:`str` and :class:`unicode`.
|
||
:class:`unicode` represents an abstract sequence of text characters. It can
|
||
hold any character that is present in the unicode standard. :class:`str` can
|
||
hold any byte of data. The operating system and python work together to
|
||
display these bytes as characters in many cases but you should always keep in
|
||
mind that the information is really a sequence of bytes, not a sequence of
|
||
characters. In python2 these types are interchangeable a large amount of the
|
||
time. They are one of the few pairs of types that automatically convert when
|
||
used in equality::
|
||
|
||
>>> # string is converted to unicode and then compared
|
||
>>> "I am a string" == u"I am a string"
|
||
True
|
||
>>> # Other types, like int, don't have this special treatment
|
||
>>> 5 == "5"
|
||
False
|
||
|
||
However, this automatic conversion tends to lull people into a false sense of
|
||
security. As long as you're dealing with :term:`ASCII` characters the
|
||
automatic conversion will save you from seeing any differences. Once you
|
||
start using characters that are not in :term:`ASCII`, you will start getting
|
||
:exc:`UnicodeError` and :exc:`UnicodeWarning` as the automatic conversions
|
||
between the types fail::
|
||
|
||
>>> "I am an ñ" == u"I am an ñ"
|
||
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
|
||
False
|
||
|
||
Why do these conversions fail? The reason is that the python2
|
||
:class:`unicode` type represents an abstract sequence of unicode text known as
|
||
:term:`code points`. :class:`str`, on the other hand, really represents
|
||
a sequence of bytes. Those bytes are converted by your operating system to
|
||
appear as characters on your screen using a particular encoding (usually
|
||
with a default defined by the operating system and customizable by the
|
||
individual user.) Although :term:`ASCII` characters are fairly standard in
|
||
what bytes represent each character, the bytes outside of the :term:`ASCII`
|
||
range are not. In general, each encoding will map a different character to
|
||
a particular byte. Newer encodings map individual characters to multiple
|
||
bytes (which the older encodings will instead treat as multiple characters).
|
||
In the face of these differences, python refuses to guess at an encoding and
|
||
instead issues a warning or exception and refuses to convert.
|
||
|
||
.. seealso::
|
||
:ref:`overcoming-frustration`
|
||
For a longer introduction on this subject.
|
||
|
||
Strategy for Explicit Conversion
|
||
================================
|
||
|
||
So what is the best method of dealing with this weltering babble of incoherent
|
||
encodings? The basic strategy is to explicitly turn everything into
|
||
:class:`unicode` when it first enters your program. Then, when you send it to
|
||
output, you can transform the unicode back into bytes. Doing this allows you
|
||
to control the encodings that are used and avoid getting tracebacks due to
|
||
:exc:`UnicodeError`. Using the functions defined in this module, that looks
|
||
something like this:
|
||
|
||
.. code-block:: pycon
|
||
:linenos:
|
||
|
||
>>> from kitchen.text.converters import to_unicode, to_bytes
|
||
>>> name = raw_input('Enter your name: ')
|
||
Enter your name: Toshio くらとみ
|
||
>>> name
|
||
'Toshio \xe3\x81\x8f\xe3\x82\x89\xe3\x81\xa8\xe3\x81\xbf'
|
||
>>> type(name)
|
||
<type 'str'>
|
||
>>> unicode_name = to_unicode(name)
|
||
>>> type(unicode_name)
|
||
<type 'unicode'>
|
||
>>> unicode_name
|
||
u'Toshio \u304f\u3089\u3068\u307f'
|
||
>>> # Do a lot of other things before needing to save/output again:
|
||
>>> output = open('datafile', 'w')
|
||
>>> output.write(to_bytes(u'Name: %s\\n' % unicode_name))
|
||
|
||
A few notes:
|
||
|
||
Looking at line 6, you'll notice that the input we took from the user was
|
||
a byte :class:`str`. In general, anytime we're getting a value from outside
|
||
of python (The filesystem, reading data from the network, interacting with an
|
||
external command, reading values from the environment) we are interacting with
|
||
something that will want to give us a byte :class:`str`. Some |stdlib|_
|
||
modules and third party libraries will automatically attempt to convert a byte
|
||
:class:`str` to :class:`unicode` strings for you. This is both a boon and
|
||
a curse. If the library can guess correctly about the encoding that the data
|
||
is in, it will return :class:`unicode` objects to you without you having to
|
||
convert. However, if it can't guess correctly, you may end up with one of
|
||
several problems:
|
||
|
||
:exc:`UnicodeError`
|
||
The library attempted to decode a byte :class:`str` into
|
||
a :class:`unicode`, string failed, and raises an exception.
|
||
Garbled data
|
||
If the library returns the data after decoding it with the wrong encoding,
|
||
the characters you see in the :exc:`unicode` string won't be the ones that
|
||
you expect.
|
||
A byte :class:`str` instead of :class:`unicode` string
|
||
Some libraries will return a :class:`unicode` string when they're able to
|
||
decode the data and a byte :class:`str` when they can't. This is
|
||
generally the hardest problem to debug when it occurs. Avoid it in your
|
||
own code and try to avoid or open bugs against upstreams that do this. See
|
||
:ref:`DesigningUnicodeAwareAPIs` for strategies to do this properly.
|
||
|
||
On line 8, we convert from a byte :class:`str` to a :class:`unicode` string.
|
||
:func:`~kitchen.text.converters.to_unicode` does this for us. It has some
|
||
error handling and sane defaults that make this a nicer function to use than
|
||
calling :meth:`str.decode` directly:
|
||
|
||
* Instead of defaulting to the :term:`ASCII` encoding which fails with all
|
||
but the simple American English characters, it defaults to :term:`UTF-8`.
|
||
* Instead of raising an error if it cannot decode a value, it will replace
|
||
the value with the unicode "Replacement character" symbol (``<EFBFBD>``).
|
||
* If you happen to call this method with something that is not a :class:`str`
|
||
or :class:`unicode`, it will return an empty :class:`unicode` string.
|
||
|
||
All three of these can be overridden using different keyword arguments to the
|
||
function. See the :func:`to_unicode` documentation for more information.
|
||
|
||
On line 15 we push the data back out to a file. Two things you should note here:
|
||
|
||
1. We deal with the strings as :class:`unicode` until the last instant. The
|
||
string format that we're using is :class:`unicode` and the variable also
|
||
holds :class:`unicode`. People sometimes get into trouble when they mix
|
||
a byte :class:`str` format with a variable that holds a :class:`unicode`
|
||
string (or vice versa) at this stage.
|
||
2. :func:`~kitchen.text.converters.to_bytes`, does the reverse of
|
||
:func:`to_unicode`. In this case, we're using the default values which
|
||
turn :class:`unicode` into a byte :class:`str` using :term:`UTF-8`. Any
|
||
errors are replaced with a ``<EFBFBD>`` and sending nonstring objects yield empty
|
||
:class:`unicode` strings. Just like :func:`to_unicode`, you can look at
|
||
the documentation for :func:`to_bytes` to find out how to override any of
|
||
these defaults.
|
||
|
||
When to use an alternate strategy
|
||
---------------------------------
|
||
|
||
The default strategy of decoding to :class:`unicode` strings when you take
|
||
data in and encoding to a byte :class:`str` when you send the data back out
|
||
works great for most problems but there are a few times when you shouldn't:
|
||
|
||
* The values aren't meant to be read as text
|
||
* The values need to be byte-for-byte when you send them back out -- for
|
||
instance if they are database keys or filenames.
|
||
* You are transferring the data between several libraries that all expect
|
||
byte :class:`str`.
|
||
|
||
In each of these instances, there is a reason to keep around the byte
|
||
:class:`str` version of a value. Here's a few hints to keep your sanity in
|
||
these situations:
|
||
|
||
1. Keep your :class:`unicode` and :class:`str` values separate. Just like the
|
||
pain caused when you have to use someone else's library that returns both
|
||
:class:`unicode` and :class:`str` you can cause yourself pain if you have
|
||
functions that can return both types or variables that could hold either
|
||
type of value.
|
||
2. Name your variables so that you can tell whether you're storing byte
|
||
:class:`str` or :class:`unicode` string. One of the first things you end
|
||
up having to do when debugging is determine what type of string you have in
|
||
a variable and what type of string you are expecting. Naming your
|
||
variables consistently so that you can tell which type they are supposed to
|
||
hold will save you from at least one of those steps.
|
||
3. When you get values initially, make sure that you're dealing with the type
|
||
of value that you expect as you save it. You can use :func:`isinstance`
|
||
or :func:`to_bytes` since :func:`to_bytes` doesn't do any modifications of
|
||
the string if it's already a :class:`str`. When using :func:`to_bytes`
|
||
for this purpose you might want to use::
|
||
|
||
try:
|
||
b_input = to_bytes(input_should_be_bytes_already, errors='strict', nonstring='strict')
|
||
except:
|
||
handle_errors_somehow()
|
||
|
||
The reason is that the default of :func:`to_bytes` will take characters
|
||
that are illegal in the chosen encoding and transform them to replacement
|
||
characters. Since the point of keeping this data as a byte :class:`str` is
|
||
to keep the exact same bytes when you send it outside of your code,
|
||
changing things to replacement characters should be rasing red flags that
|
||
something is wrong. Setting :attr:`errors` to ``strict`` will raise an
|
||
exception which gives you an opportunity to fail gracefully.
|
||
4. Sometimes you will want to print out the values that you have in your byte
|
||
:class:`str`. When you do this you will need to make sure that you
|
||
transform :class:`unicode` to :class:`str` before combining them. Also be
|
||
sure that any other function calls (including :mod:`gettext`) are going to
|
||
give you strings that are the same type. For instance::
|
||
|
||
print to_bytes(_('Username: %(user)s'), 'utf-8') % {'user': b_username}
|
||
|
||
Gotchas and how to avoid them
|
||
=============================
|
||
|
||
Even when you have a good conceptual understanding of how python2 treats
|
||
:class:`unicode` and :class:`str` there are still some things that can
|
||
surprise you. In most cases this is because, as noted earlier, python or one
|
||
of the python libraries you depend on is trying to convert a value
|
||
automatically and failing. Explicit conversion at the appropriate place
|
||
usually solves that.
|
||
|
||
str(obj)
|
||
--------
|
||
|
||
One common idiom for getting a simple, string representation of an object is to use::
|
||
|
||
str(obj)
|
||
|
||
Unfortunately, this is not safe. Sometimes str(obj) will return
|
||
:class:`unicode`. Sometimes it will return a byte :class:`str`. Sometimes,
|
||
it will attempt to convert from a :class:`unicode` string to a byte
|
||
:class:`str`, fail, and throw a :exc:`UnicodeError`. To be safe from all of
|
||
these, first decide whether you need :class:`unicode` or :class:`str` to be
|
||
returned. Then use :func:`to_unicode` or :func:`to_bytes` to get the simple
|
||
representation like this::
|
||
|
||
u_representation = to_unicode(obj, nonstring='simplerepr')
|
||
b_representation = to_bytes(obj, nonstring='simplerepr')
|
||
|
||
print
|
||
-----
|
||
|
||
python has a builtin :func:`print` statement that outputs strings to the
|
||
terminal. This originated in a time when python only dealt with byte
|
||
:class:`str`. When :class:`unicode` strings came about, some enhancements
|
||
were made to the :func:`print` statement so that it could print those as well.
|
||
The enhancements make :func:`print` work most of the time. However, the times
|
||
when it doesn't work tend to make for cryptic debugging.
|
||
|
||
The basic issue is that :func:`print` has to figure out what encoding to use
|
||
when it prints a :class:`unicode` string to the terminal. When python is
|
||
attached to your terminal (ie, you're running the interpreter or running
|
||
a script that prints to the screen) python is able to take the encoding value
|
||
from your locale settings :envvar:`LC_ALL` or :envvar:`LC_CTYPE` and print the
|
||
characters allowed by that encoding. On most modern Unix systems, the
|
||
encoding is :term:`utf-8` which means that you can print any :class:`unicode`
|
||
character without problem.
|
||
|
||
There are two common cases of things going wrong:
|
||
|
||
1. Someone has a locale set that does not accept all valid unicode characters.
|
||
For instance::
|
||
|
||
$ LC_ALL=C python
|
||
>>> print u'\ufffd'
|
||
Traceback (most recent call last):
|
||
File "<stdin>", line 1, in <module>
|
||
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
|
||
|
||
This often happens when a script that you've written and debugged from the
|
||
terminal is run from an automated environment like :program:`cron`. It
|
||
also occurs when you have written a script using a :term:`utf-8` aware
|
||
locale and released it for consumption by people all over the internet.
|
||
Inevitably, someone is running with a locale that can't handle all unicode
|
||
characters and you get a traceback reported.
|
||
2. You redirect output to a file. Python isn't using the values in
|
||
:envvar:`LC_ALL` unconditionally to decide what encoding to use. Instead
|
||
it is using the encoding set for the terminal you are printing to which is
|
||
set to accept different encodings by :envvar:`LC_ALL`. If you redirect
|
||
to a file, you are no longer printing to the terminal so :envvar:`LC_ALL`
|
||
won't have any effect. At this point, python will decide it can't find an
|
||
encoding and fallback to :term:`ASCII` which will likely lead to
|
||
:exc:`UnicodeError` being raised. You can see this in a short script::
|
||
|
||
#! /usr/bin/python -tt
|
||
print u'\ufffd'
|
||
|
||
And then look at the difference between running it normally and redirecting to a file:
|
||
|
||
.. code-block:: console
|
||
|
||
$ ./test.py
|
||
<20>
|
||
$ ./test.py > t
|
||
Traceback (most recent call last):
|
||
File "test.py", line 3, in <module>
|
||
print u'\ufffd'
|
||
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
|
||
|
||
The short answer to dealing with this is to always use bytes when writing
|
||
output. You can do this by explicitly converting to bytes like this::
|
||
|
||
from kitchen.text.converters import to_bytes
|
||
u_string = u'\ufffd'
|
||
print to_bytes(u_string)
|
||
|
||
or you can wrap stdout and stderr with a :class:`~codecs.StreamWriter`.
|
||
A :class:`~codecs.StreamWriter` is convenient in that you can assign it to
|
||
encode for :data:`sys.stdout` or :data:`sys.stderr` and then have output
|
||
automatically converted but it has the drawback of still being able to throw
|
||
:exc:`UnicodeError` if the writer can't encode all possible unicode
|
||
codepoints. Kitchen provides an alternate version which can be retrieved with
|
||
:func:`kitchen.text.converters.getwriter` which will not traceback in its
|
||
standard configuration.
|
||
|
||
.. _unicode-and-dict-keys:
|
||
|
||
Unicode, str, and dict keys
|
||
---------------------------
|
||
|
||
The :func:`hash` of the :term:`ASCII` characters is the same for
|
||
:class:`unicode` and byte :class:`str`. When you use them in :class:`dict`
|
||
keys, they evaluate to the same dictionary slot::
|
||
|
||
>>> u_string = u'a'
|
||
>>> b_string = 'a'
|
||
>>> hash(u_string), hash(b_string)
|
||
(12416037344, 12416037344)
|
||
>>> d = {}
|
||
>>> d[u_string] = 'unicode'
|
||
>>> d[b_string] = 'bytes'
|
||
>>> d
|
||
{u'a': 'bytes'}
|
||
|
||
When you deal with key values outside of :term:`ASCII`, :class:`unicode` and
|
||
byte :class:`str` evaluate unequally no matter what their character content or
|
||
hash value::
|
||
|
||
>>> u_string = u'ñ'
|
||
>>> b_string = u_string.encode('utf-8')
|
||
>>> print u_string
|
||
ñ
|
||
>>> print b_string
|
||
ñ
|
||
>>> d = {}
|
||
>>> d[u_string] = 'unicode'
|
||
>>> d[b_string] = 'bytes'
|
||
>>> d
|
||
{u'\\xf1': 'unicode', '\\xc3\\xb1': 'bytes'}
|
||
>>> b_string2 = '\\xf1'
|
||
>>> hash(u_string), hash(b_string2)
|
||
(30848092528, 30848092528)
|
||
>>> d = {}
|
||
>>> d[u_string] = 'unicode'
|
||
>>> d[b_string2] = 'bytes'
|
||
{u'\\xf1': 'unicode', '\\xf1': 'bytes'}
|
||
|
||
How do you work with this one? Remember rule #1: Keep your :class:`unicode`
|
||
and byte :class:`str` values separate. That goes for keys in a dictionary
|
||
just like anything else.
|
||
|
||
* For any given dictionary, make sure that all your keys are either
|
||
:class:`unicode` or :class:`str`. **Do not mix the two.** If you're being
|
||
given both :class:`unicode` and :class:`str` but you don't need to preserve
|
||
separate keys for each, I recommend using :func:`to_unicode` or
|
||
:func:`to_bytes` to convert all keys to one type or the other like this::
|
||
|
||
>>> from kitchen.text.converters import to_unicode
|
||
>>> u_string = u'one'
|
||
>>> b_string = 'two'
|
||
>>> d = {}
|
||
>>> d[to_unicode(u_string)] = 1
|
||
>>> d[to_unicode(b_string)] = 2
|
||
>>> d
|
||
{u'two': 2, u'one': 1}
|
||
|
||
* These issues also apply to using dicts with tuple keys that contain
|
||
a mixture of :class:`unicode` and :class:`str`. Once again the best fix
|
||
is to standardise on either :class:`str` or :class:`unicode`.
|
||
|
||
* If you absolutely need to store values in a dictionary where the keys could
|
||
be either :class:`unicode` or :class:`str` you can use
|
||
:class:`~kitchen.collections.strictdict.StrictDict` which has separate
|
||
entries for all :class:`unicode` and byte :class:`str` and deals correctly
|
||
with any :class:`tuple` containing mixed :class:`unicode` and byte
|
||
:class:`str`.
|
||
|
||
---------
|
||
Functions
|
||
---------
|
||
|
||
Unicode and byte str conversion
|
||
===============================
|
||
|
||
.. autofunction:: kitchen.text.converters.to_unicode
|
||
.. autofunction:: kitchen.text.converters.to_bytes
|
||
.. autofunction:: kitchen.text.converters.getwriter
|
||
.. autofunction:: kitchen.text.converters.to_str
|
||
.. autofunction:: kitchen.text.converters.to_utf8
|
||
|
||
Transformation to XML
|
||
=====================
|
||
|
||
.. autofunction:: kitchen.text.converters.unicode_to_xml
|
||
.. autofunction:: kitchen.text.converters.xml_to_unicode
|
||
.. autofunction:: kitchen.text.converters.byte_string_to_xml
|
||
.. autofunction:: kitchen.text.converters.xml_to_byte_string
|
||
.. autofunction:: kitchen.text.converters.bytes_to_xml
|
||
.. autofunction:: kitchen.text.converters.xml_to_bytes
|
||
.. autofunction:: kitchen.text.converters.guess_encoding_to_xml
|
||
.. autofunction:: kitchen.text.converters.to_xml
|
||
|
||
Working with exception messages
|
||
===============================
|
||
|
||
.. autodata:: kitchen.text.converters.EXCEPTION_CONVERTERS
|
||
.. autodata:: kitchen.text.converters.BYTE_EXCEPTION_CONVERTERS
|
||
.. autofunction:: kitchen.text.converters.exception_to_unicode
|
||
.. autofunction:: kitchen.text.converters.exception_to_bytes
|