406 lines
18 KiB
ReStructuredText
406 lines
18 KiB
ReStructuredText
|
-----------------------
|
|||
|
Kitchen.text.converters
|
|||
|
-----------------------
|
|||
|
|
|||
|
.. automodule:: kitchen.text.converters
|
|||
|
|
|||
|
Byte Strings and Unicode in Python2
|
|||
|
===================================
|
|||
|
|
|||
|
Python2 has two string types, :class:`str` and :class:`unicode`.
|
|||
|
:class:`unicode` represents an abstract sequence of text characters. It can
|
|||
|
hold any character that is present in the unicode standard. :class:`str` can
|
|||
|
hold any byte of data. The operating system and python work together to
|
|||
|
display these bytes as characters in many cases but you should always keep in
|
|||
|
mind that the information is really a sequence of bytes, not a sequence of
|
|||
|
characters. In python2 these types are interchangeable a large amount of the
|
|||
|
time. They are one of the few pairs of types that automatically convert when
|
|||
|
used in equality::
|
|||
|
|
|||
|
>>> # string is converted to unicode and then compared
|
|||
|
>>> "I am a string" == u"I am a string"
|
|||
|
True
|
|||
|
>>> # Other types, like int, don't have this special treatment
|
|||
|
>>> 5 == "5"
|
|||
|
False
|
|||
|
|
|||
|
However, this automatic conversion tends to lull people into a false sense of
|
|||
|
security. As long as you're dealing with :term:`ASCII` characters the
|
|||
|
automatic conversion will save you from seeing any differences. Once you
|
|||
|
start using characters that are not in :term:`ASCII`, you will start getting
|
|||
|
:exc:`UnicodeError` and :exc:`UnicodeWarning` as the automatic conversions
|
|||
|
between the types fail::
|
|||
|
|
|||
|
>>> "I am an ñ" == u"I am an ñ"
|
|||
|
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
|
|||
|
False
|
|||
|
|
|||
|
Why do these conversions fail? The reason is that the python2
|
|||
|
:class:`unicode` type represents an abstract sequence of unicode text known as
|
|||
|
:term:`code points`. :class:`str`, on the other hand, really represents
|
|||
|
a sequence of bytes. Those bytes are converted by your operating system to
|
|||
|
appear as characters on your screen using a particular encoding (usually
|
|||
|
with a default defined by the operating system and customizable by the
|
|||
|
individual user.) Although :term:`ASCII` characters are fairly standard in
|
|||
|
what bytes represent each character, the bytes outside of the :term:`ASCII`
|
|||
|
range are not. In general, each encoding will map a different character to
|
|||
|
a particular byte. Newer encodings map individual characters to multiple
|
|||
|
bytes (which the older encodings will instead treat as multiple characters).
|
|||
|
In the face of these differences, python refuses to guess at an encoding and
|
|||
|
instead issues a warning or exception and refuses to convert.
|
|||
|
|
|||
|
.. seealso::
|
|||
|
:ref:`overcoming-frustration`
|
|||
|
For a longer introduction on this subject.
|
|||
|
|
|||
|
Strategy for Explicit Conversion
|
|||
|
================================
|
|||
|
|
|||
|
So what is the best method of dealing with this weltering babble of incoherent
|
|||
|
encodings? The basic strategy is to explicitly turn everything into
|
|||
|
:class:`unicode` when it first enters your program. Then, when you send it to
|
|||
|
output, you can transform the unicode back into bytes. Doing this allows you
|
|||
|
to control the encodings that are used and avoid getting tracebacks due to
|
|||
|
:exc:`UnicodeError`. Using the functions defined in this module, that looks
|
|||
|
something like this:
|
|||
|
|
|||
|
.. code-block:: pycon
|
|||
|
:linenos:
|
|||
|
|
|||
|
>>> from kitchen.text.converters import to_unicode, to_bytes
|
|||
|
>>> name = raw_input('Enter your name: ')
|
|||
|
Enter your name: Toshio くらとみ
|
|||
|
>>> name
|
|||
|
'Toshio \xe3\x81\x8f\xe3\x82\x89\xe3\x81\xa8\xe3\x81\xbf'
|
|||
|
>>> type(name)
|
|||
|
<type 'str'>
|
|||
|
>>> unicode_name = to_unicode(name)
|
|||
|
>>> type(unicode_name)
|
|||
|
<type 'unicode'>
|
|||
|
>>> unicode_name
|
|||
|
u'Toshio \u304f\u3089\u3068\u307f'
|
|||
|
>>> # Do a lot of other things before needing to save/output again:
|
|||
|
>>> output = open('datafile', 'w')
|
|||
|
>>> output.write(to_bytes(u'Name: %s\\n' % unicode_name))
|
|||
|
|
|||
|
A few notes:
|
|||
|
|
|||
|
Looking at line 6, you'll notice that the input we took from the user was
|
|||
|
a byte :class:`str`. In general, anytime we're getting a value from outside
|
|||
|
of python (The filesystem, reading data from the network, interacting with an
|
|||
|
external command, reading values from the environment) we are interacting with
|
|||
|
something that will want to give us a byte :class:`str`. Some |stdlib|_
|
|||
|
modules and third party libraries will automatically attempt to convert a byte
|
|||
|
:class:`str` to :class:`unicode` strings for you. This is both a boon and
|
|||
|
a curse. If the library can guess correctly about the encoding that the data
|
|||
|
is in, it will return :class:`unicode` objects to you without you having to
|
|||
|
convert. However, if it can't guess correctly, you may end up with one of
|
|||
|
several problems:
|
|||
|
|
|||
|
:exc:`UnicodeError`
|
|||
|
The library attempted to decode a byte :class:`str` into
|
|||
|
a :class:`unicode`, string failed, and raises an exception.
|
|||
|
Garbled data
|
|||
|
If the library returns the data after decoding it with the wrong encoding,
|
|||
|
the characters you see in the :exc:`unicode` string won't be the ones that
|
|||
|
you expect.
|
|||
|
A byte :class:`str` instead of :class:`unicode` string
|
|||
|
Some libraries will return a :class:`unicode` string when they're able to
|
|||
|
decode the data and a byte :class:`str` when they can't. This is
|
|||
|
generally the hardest problem to debug when it occurs. Avoid it in your
|
|||
|
own code and try to avoid or open bugs against upstreams that do this. See
|
|||
|
:ref:`DesigningUnicodeAwareAPIs` for strategies to do this properly.
|
|||
|
|
|||
|
On line 8, we convert from a byte :class:`str` to a :class:`unicode` string.
|
|||
|
:func:`~kitchen.text.converters.to_unicode` does this for us. It has some
|
|||
|
error handling and sane defaults that make this a nicer function to use than
|
|||
|
calling :meth:`str.decode` directly:
|
|||
|
|
|||
|
* Instead of defaulting to the :term:`ASCII` encoding which fails with all
|
|||
|
but the simple American English characters, it defaults to :term:`UTF-8`.
|
|||
|
* Instead of raising an error if it cannot decode a value, it will replace
|
|||
|
the value with the unicode "Replacement character" symbol (``<EFBFBD>``).
|
|||
|
* If you happen to call this method with something that is not a :class:`str`
|
|||
|
or :class:`unicode`, it will return an empty :class:`unicode` string.
|
|||
|
|
|||
|
All three of these can be overridden using different keyword arguments to the
|
|||
|
function. See the :func:`to_unicode` documentation for more information.
|
|||
|
|
|||
|
On line 15 we push the data back out to a file. Two things you should note here:
|
|||
|
|
|||
|
1. We deal with the strings as :class:`unicode` until the last instant. The
|
|||
|
string format that we're using is :class:`unicode` and the variable also
|
|||
|
holds :class:`unicode`. People sometimes get into trouble when they mix
|
|||
|
a byte :class:`str` format with a variable that holds a :class:`unicode`
|
|||
|
string (or vice versa) at this stage.
|
|||
|
2. :func:`~kitchen.text.converters.to_bytes`, does the reverse of
|
|||
|
:func:`to_unicode`. In this case, we're using the default values which
|
|||
|
turn :class:`unicode` into a byte :class:`str` using :term:`UTF-8`. Any
|
|||
|
errors are replaced with a ``<EFBFBD>`` and sending nonstring objects yield empty
|
|||
|
:class:`unicode` strings. Just like :func:`to_unicode`, you can look at
|
|||
|
the documentation for :func:`to_bytes` to find out how to override any of
|
|||
|
these defaults.
|
|||
|
|
|||
|
When to use an alternate strategy
|
|||
|
---------------------------------
|
|||
|
|
|||
|
The default strategy of decoding to :class:`unicode` strings when you take
|
|||
|
data in and encoding to a byte :class:`str` when you send the data back out
|
|||
|
works great for most problems but there are a few times when you shouldn't:
|
|||
|
|
|||
|
* The values aren't meant to be read as text
|
|||
|
* The values need to be byte-for-byte when you send them back out -- for
|
|||
|
instance if they are database keys or filenames.
|
|||
|
* You are transferring the data between several libraries that all expect
|
|||
|
byte :class:`str`.
|
|||
|
|
|||
|
In each of these instances, there is a reason to keep around the byte
|
|||
|
:class:`str` version of a value. Here's a few hints to keep your sanity in
|
|||
|
these situations:
|
|||
|
|
|||
|
1. Keep your :class:`unicode` and :class:`str` values separate. Just like the
|
|||
|
pain caused when you have to use someone else's library that returns both
|
|||
|
:class:`unicode` and :class:`str` you can cause yourself pain if you have
|
|||
|
functions that can return both types or variables that could hold either
|
|||
|
type of value.
|
|||
|
2. Name your variables so that you can tell whether you're storing byte
|
|||
|
:class:`str` or :class:`unicode` string. One of the first things you end
|
|||
|
up having to do when debugging is determine what type of string you have in
|
|||
|
a variable and what type of string you are expecting. Naming your
|
|||
|
variables consistently so that you can tell which type they are supposed to
|
|||
|
hold will save you from at least one of those steps.
|
|||
|
3. When you get values initially, make sure that you're dealing with the type
|
|||
|
of value that you expect as you save it. You can use :func:`isinstance`
|
|||
|
or :func:`to_bytes` since :func:`to_bytes` doesn't do any modifications of
|
|||
|
the string if it's already a :class:`str`. When using :func:`to_bytes`
|
|||
|
for this purpose you might want to use::
|
|||
|
|
|||
|
try:
|
|||
|
b_input = to_bytes(input_should_be_bytes_already, errors='strict', nonstring='strict')
|
|||
|
except:
|
|||
|
handle_errors_somehow()
|
|||
|
|
|||
|
The reason is that the default of :func:`to_bytes` will take characters
|
|||
|
that are illegal in the chosen encoding and transform them to replacement
|
|||
|
characters. Since the point of keeping this data as a byte :class:`str` is
|
|||
|
to keep the exact same bytes when you send it outside of your code,
|
|||
|
changing things to replacement characters should be rasing red flags that
|
|||
|
something is wrong. Setting :attr:`errors` to ``strict`` will raise an
|
|||
|
exception which gives you an opportunity to fail gracefully.
|
|||
|
4. Sometimes you will want to print out the values that you have in your byte
|
|||
|
:class:`str`. When you do this you will need to make sure that you
|
|||
|
transform :class:`unicode` to :class:`str` before combining them. Also be
|
|||
|
sure that any other function calls (including :mod:`gettext`) are going to
|
|||
|
give you strings that are the same type. For instance::
|
|||
|
|
|||
|
print to_bytes(_('Username: %(user)s'), 'utf-8') % {'user': b_username}
|
|||
|
|
|||
|
Gotchas and how to avoid them
|
|||
|
=============================
|
|||
|
|
|||
|
Even when you have a good conceptual understanding of how python2 treats
|
|||
|
:class:`unicode` and :class:`str` there are still some things that can
|
|||
|
surprise you. In most cases this is because, as noted earlier, python or one
|
|||
|
of the python libraries you depend on is trying to convert a value
|
|||
|
automatically and failing. Explicit conversion at the appropriate place
|
|||
|
usually solves that.
|
|||
|
|
|||
|
str(obj)
|
|||
|
--------
|
|||
|
|
|||
|
One common idiom for getting a simple, string representation of an object is to use::
|
|||
|
|
|||
|
str(obj)
|
|||
|
|
|||
|
Unfortunately, this is not safe. Sometimes str(obj) will return
|
|||
|
:class:`unicode`. Sometimes it will return a byte :class:`str`. Sometimes,
|
|||
|
it will attempt to convert from a :class:`unicode` string to a byte
|
|||
|
:class:`str`, fail, and throw a :exc:`UnicodeError`. To be safe from all of
|
|||
|
these, first decide whether you need :class:`unicode` or :class:`str` to be
|
|||
|
returned. Then use :func:`to_unicode` or :func:`to_bytes` to get the simple
|
|||
|
representation like this::
|
|||
|
|
|||
|
u_representation = to_unicode(obj, nonstring='simplerepr')
|
|||
|
b_representation = to_bytes(obj, nonstring='simplerepr')
|
|||
|
|
|||
|
print
|
|||
|
-----
|
|||
|
|
|||
|
python has a builtin :func:`print` statement that outputs strings to the
|
|||
|
terminal. This originated in a time when python only dealt with byte
|
|||
|
:class:`str`. When :class:`unicode` strings came about, some enhancements
|
|||
|
were made to the :func:`print` statement so that it could print those as well.
|
|||
|
The enhancements make :func:`print` work most of the time. However, the times
|
|||
|
when it doesn't work tend to make for cryptic debugging.
|
|||
|
|
|||
|
The basic issue is that :func:`print` has to figure out what encoding to use
|
|||
|
when it prints a :class:`unicode` string to the terminal. When python is
|
|||
|
attached to your terminal (ie, you're running the interpreter or running
|
|||
|
a script that prints to the screen) python is able to take the encoding value
|
|||
|
from your locale settings :envvar:`LC_ALL` or :envvar:`LC_CTYPE` and print the
|
|||
|
characters allowed by that encoding. On most modern Unix systems, the
|
|||
|
encoding is :term:`utf-8` which means that you can print any :class:`unicode`
|
|||
|
character without problem.
|
|||
|
|
|||
|
There are two common cases of things going wrong:
|
|||
|
|
|||
|
1. Someone has a locale set that does not accept all valid unicode characters.
|
|||
|
For instance::
|
|||
|
|
|||
|
$ LC_ALL=C python
|
|||
|
>>> print u'\ufffd'
|
|||
|
Traceback (most recent call last):
|
|||
|
File "<stdin>", line 1, in <module>
|
|||
|
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
|
|||
|
|
|||
|
This often happens when a script that you've written and debugged from the
|
|||
|
terminal is run from an automated environment like :program:`cron`. It
|
|||
|
also occurs when you have written a script using a :term:`utf-8` aware
|
|||
|
locale and released it for consumption by people all over the internet.
|
|||
|
Inevitably, someone is running with a locale that can't handle all unicode
|
|||
|
characters and you get a traceback reported.
|
|||
|
2. You redirect output to a file. Python isn't using the values in
|
|||
|
:envvar:`LC_ALL` unconditionally to decide what encoding to use. Instead
|
|||
|
it is using the encoding set for the terminal you are printing to which is
|
|||
|
set to accept different encodings by :envvar:`LC_ALL`. If you redirect
|
|||
|
to a file, you are no longer printing to the terminal so :envvar:`LC_ALL`
|
|||
|
won't have any effect. At this point, python will decide it can't find an
|
|||
|
encoding and fallback to :term:`ASCII` which will likely lead to
|
|||
|
:exc:`UnicodeError` being raised. You can see this in a short script::
|
|||
|
|
|||
|
#! /usr/bin/python -tt
|
|||
|
print u'\ufffd'
|
|||
|
|
|||
|
And then look at the difference between running it normally and redirecting to a file:
|
|||
|
|
|||
|
.. code-block:: console
|
|||
|
|
|||
|
$ ./test.py
|
|||
|
<20>
|
|||
|
$ ./test.py > t
|
|||
|
Traceback (most recent call last):
|
|||
|
File "test.py", line 3, in <module>
|
|||
|
print u'\ufffd'
|
|||
|
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
|
|||
|
|
|||
|
The short answer to dealing with this is to always use bytes when writing
|
|||
|
output. You can do this by explicitly converting to bytes like this::
|
|||
|
|
|||
|
from kitchen.text.converters import to_bytes
|
|||
|
u_string = u'\ufffd'
|
|||
|
print to_bytes(u_string)
|
|||
|
|
|||
|
or you can wrap stdout and stderr with a :class:`~codecs.StreamWriter`.
|
|||
|
A :class:`~codecs.StreamWriter` is convenient in that you can assign it to
|
|||
|
encode for :data:`sys.stdout` or :data:`sys.stderr` and then have output
|
|||
|
automatically converted but it has the drawback of still being able to throw
|
|||
|
:exc:`UnicodeError` if the writer can't encode all possible unicode
|
|||
|
codepoints. Kitchen provides an alternate version which can be retrieved with
|
|||
|
:func:`kitchen.text.converters.getwriter` which will not traceback in its
|
|||
|
standard configuration.
|
|||
|
|
|||
|
.. _unicode-and-dict-keys:
|
|||
|
|
|||
|
Unicode, str, and dict keys
|
|||
|
---------------------------
|
|||
|
|
|||
|
The :func:`hash` of the :term:`ASCII` characters is the same for
|
|||
|
:class:`unicode` and byte :class:`str`. When you use them in :class:`dict`
|
|||
|
keys, they evaluate to the same dictionary slot::
|
|||
|
|
|||
|
>>> u_string = u'a'
|
|||
|
>>> b_string = 'a'
|
|||
|
>>> hash(u_string), hash(b_string)
|
|||
|
(12416037344, 12416037344)
|
|||
|
>>> d = {}
|
|||
|
>>> d[u_string] = 'unicode'
|
|||
|
>>> d[b_string] = 'bytes'
|
|||
|
>>> d
|
|||
|
{u'a': 'bytes'}
|
|||
|
|
|||
|
When you deal with key values outside of :term:`ASCII`, :class:`unicode` and
|
|||
|
byte :class:`str` evaluate unequally no matter what their character content or
|
|||
|
hash value::
|
|||
|
|
|||
|
>>> u_string = u'ñ'
|
|||
|
>>> b_string = u_string.encode('utf-8')
|
|||
|
>>> print u_string
|
|||
|
ñ
|
|||
|
>>> print b_string
|
|||
|
ñ
|
|||
|
>>> d = {}
|
|||
|
>>> d[u_string] = 'unicode'
|
|||
|
>>> d[b_string] = 'bytes'
|
|||
|
>>> d
|
|||
|
{u'\\xf1': 'unicode', '\\xc3\\xb1': 'bytes'}
|
|||
|
>>> b_string2 = '\\xf1'
|
|||
|
>>> hash(u_string), hash(b_string2)
|
|||
|
(30848092528, 30848092528)
|
|||
|
>>> d = {}
|
|||
|
>>> d[u_string] = 'unicode'
|
|||
|
>>> d[b_string2] = 'bytes'
|
|||
|
{u'\\xf1': 'unicode', '\\xf1': 'bytes'}
|
|||
|
|
|||
|
How do you work with this one? Remember rule #1: Keep your :class:`unicode`
|
|||
|
and byte :class:`str` values separate. That goes for keys in a dictionary
|
|||
|
just like anything else.
|
|||
|
|
|||
|
* For any given dictionary, make sure that all your keys are either
|
|||
|
:class:`unicode` or :class:`str`. **Do not mix the two.** If you're being
|
|||
|
given both :class:`unicode` and :class:`str` but you don't need to preserve
|
|||
|
separate keys for each, I recommend using :func:`to_unicode` or
|
|||
|
:func:`to_bytes` to convert all keys to one type or the other like this::
|
|||
|
|
|||
|
>>> from kitchen.text.converters import to_unicode
|
|||
|
>>> u_string = u'one'
|
|||
|
>>> b_string = 'two'
|
|||
|
>>> d = {}
|
|||
|
>>> d[to_unicode(u_string)] = 1
|
|||
|
>>> d[to_unicode(b_string)] = 2
|
|||
|
>>> d
|
|||
|
{u'two': 2, u'one': 1}
|
|||
|
|
|||
|
* These issues also apply to using dicts with tuple keys that contain
|
|||
|
a mixture of :class:`unicode` and :class:`str`. Once again the best fix
|
|||
|
is to standardise on either :class:`str` or :class:`unicode`.
|
|||
|
|
|||
|
* If you absolutely need to store values in a dictionary where the keys could
|
|||
|
be either :class:`unicode` or :class:`str` you can use
|
|||
|
:class:`~kitchen.collections.strictdict.StrictDict` which has separate
|
|||
|
entries for all :class:`unicode` and byte :class:`str` and deals correctly
|
|||
|
with any :class:`tuple` containing mixed :class:`unicode` and byte
|
|||
|
:class:`str`.
|
|||
|
|
|||
|
---------
|
|||
|
Functions
|
|||
|
---------
|
|||
|
|
|||
|
Unicode and byte str conversion
|
|||
|
===============================
|
|||
|
|
|||
|
.. autofunction:: kitchen.text.converters.to_unicode
|
|||
|
.. autofunction:: kitchen.text.converters.to_bytes
|
|||
|
.. autofunction:: kitchen.text.converters.getwriter
|
|||
|
.. autofunction:: kitchen.text.converters.to_str
|
|||
|
.. autofunction:: kitchen.text.converters.to_utf8
|
|||
|
|
|||
|
Transformation to XML
|
|||
|
=====================
|
|||
|
|
|||
|
.. autofunction:: kitchen.text.converters.unicode_to_xml
|
|||
|
.. autofunction:: kitchen.text.converters.xml_to_unicode
|
|||
|
.. autofunction:: kitchen.text.converters.byte_string_to_xml
|
|||
|
.. autofunction:: kitchen.text.converters.xml_to_byte_string
|
|||
|
.. autofunction:: kitchen.text.converters.bytes_to_xml
|
|||
|
.. autofunction:: kitchen.text.converters.xml_to_bytes
|
|||
|
.. autofunction:: kitchen.text.converters.guess_encoding_to_xml
|
|||
|
.. autofunction:: kitchen.text.converters.to_xml
|
|||
|
|
|||
|
Working with exception messages
|
|||
|
===============================
|
|||
|
|
|||
|
.. autodata:: kitchen.text.converters.EXCEPTION_CONVERTERS
|
|||
|
.. autodata:: kitchen.text.converters.BYTE_EXCEPTION_CONVERTERS
|
|||
|
.. autofunction:: kitchen.text.converters.exception_to_unicode
|
|||
|
.. autofunction:: kitchen.text.converters.exception_to_bytes
|