PageRenderTime 23ms CodeModel.GetById 14ms app.highlight 3ms RepoModel.GetById 1ms app.codeStats 0ms

/docs/ref/unicode.txt

https://code.google.com/p/mango-py/
Plain Text | 362 lines | 277 code | 85 blank | 0 comment | 0 complexity | a33a23c6ff7ece2dde9521e4e4e297d2 MD5 | raw file
  1============
  2Unicode data
  3============
  4
  5Django natively supports Unicode data everywhere. Providing your database can
  6somehow store the data, you can safely pass around Unicode strings to
  7templates, models and the database.
  8
  9This document tells you what you need to know if you're writing applications
 10that use data or templates that are encoded in something other than ASCII.
 11
 12Creating the database
 13=====================
 14
 15Make sure your database is configured to be able to store arbitrary string
 16data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use
 17a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be
 18able to store certain characters in the database, and information will be lost.
 19
 20 * MySQL users, refer to the `MySQL manual`_ (section 9.1.3.2 for MySQL 5.1)
 21   for details on how to set or alter the database character set encoding.
 22
 23 * PostgreSQL users, refer to the `PostgreSQL manual`_ (section 21.2.2 in
 24   PostgreSQL 8) for details on creating databases with the correct encoding.
 25
 26 * SQLite users, there is nothing you need to do. SQLite always uses UTF-8
 27   for internal encoding.
 28
 29.. _MySQL manual: http://dev.mysql.com/doc/refman/5.1/en/charset-database.html
 30.. _PostgreSQL manual: http://www.postgresql.org/docs/8.2/static/multibyte.html#AEN24104
 31
 32All of Django's database backends automatically convert Unicode strings into
 33the appropriate encoding for talking to the database. They also automatically
 34convert strings retrieved from the database into Python Unicode strings. You
 35don't even need to tell Django what encoding your database uses: that is
 36handled transparently.
 37
 38For more, see the section "The database API" below.
 39
 40General string handling
 41=======================
 42
 43Whenever you use strings with Django -- e.g., in database lookups, template
 44rendering or anywhere else -- you have two choices for encoding those strings.
 45You can use Unicode strings, or you can use normal strings (sometimes called
 46"bytestrings") that are encoded using UTF-8.
 47
 48.. admonition:: Warning
 49
 50    A bytestring does not carry any information with it about its encoding.
 51    For that reason, we have to make an assumption, and Django assumes that all
 52    bytestrings are in UTF-8.
 53
 54    If you pass a string to Django that has been encoded in some other format,
 55    things will go wrong in interesting ways. Usually, Django will raise a
 56    ``UnicodeDecodeError`` at some point.
 57
 58If your code only uses ASCII data, it's safe to use your normal strings,
 59passing them around at will, because ASCII is a subset of UTF-8.
 60
 61Don't be fooled into thinking that if your :setting:`DEFAULT_CHARSET` setting is set
 62to something other than ``'utf-8'`` you can use that other encoding in your
 63bytestrings! :setting:`DEFAULT_CHARSET` only applies to the strings generated as
 64the result of template rendering (and e-mail). Django will always assume UTF-8
 65encoding for internal bytestrings. The reason for this is that the
 66:setting:`DEFAULT_CHARSET` setting is not actually under your control (if you are the
 67application developer). It's under the control of the person installing and
 68using your application -- and if that person chooses a different setting, your
 69code must still continue to work. Ergo, it cannot rely on that setting.
 70
 71In most cases when Django is dealing with strings, it will convert them to
 72Unicode strings before doing anything else. So, as a general rule, if you pass
 73in a bytestring, be prepared to receive a Unicode string back in the result.
 74
 75Translated strings
 76------------------
 77
 78Aside from Unicode strings and bytestrings, there's a third type of string-like
 79object you may encounter when using Django. The framework's
 80internationalization features introduce the concept of a "lazy translation" --
 81a string that has been marked as translated but whose actual translation result
 82isn't determined until the object is used in a string. This feature is useful
 83in cases where the translation locale is unknown until the string is used, even
 84though the string might have originally been created when the code was first
 85imported.
 86
 87Normally, you won't have to worry about lazy translations. Just be aware that
 88if you examine an object and it claims to be a
 89``django.utils.functional.__proxy__`` object, it is a lazy translation.
 90Calling ``unicode()`` with the lazy translation as the argument will generate a
 91Unicode string in the current locale.
 92
 93For more details about lazy translation objects, refer to the
 94:doc:`internationalization </topics/i18n/index>` documentation.
 95
 96Useful utility functions
 97------------------------
 98
 99Because some string operations come up again and again, Django ships with a few
100useful functions that should make working with Unicode and bytestring objects
101a bit easier.
102
103Conversion functions
104~~~~~~~~~~~~~~~~~~~~
105
106The ``django.utils.encoding`` module contains a few functions that are handy
107for converting back and forth between Unicode and bytestrings.
108
109    * ``smart_unicode(s, encoding='utf-8', strings_only=False, errors='strict')``
110      converts its input to a Unicode string. The ``encoding`` parameter
111      specifies the input encoding. (For example, Django uses this internally
112      when processing form input data, which might not be UTF-8 encoded.) The
113      ``strings_only`` parameter, if set to True, will result in Python
114      numbers, booleans and ``None`` not being converted to a string (they keep
115      their original types). The ``errors`` parameter takes any of the values
116      that are accepted by Python's ``unicode()`` function for its error
117      handling.
118
119      If you pass ``smart_unicode()`` an object that has a ``__unicode__``
120      method, it will use that method to do the conversion.
121
122    * ``force_unicode(s, encoding='utf-8', strings_only=False,
123      errors='strict')`` is identical to ``smart_unicode()`` in almost all
124      cases. The difference is when the first argument is a :ref:`lazy
125      translation <lazy-translations>` instance. While ``smart_unicode()``
126      preserves lazy translations, ``force_unicode()`` forces those objects to a
127      Unicode string (causing the translation to occur). Normally, you'll want
128      to use ``smart_unicode()``. However, ``force_unicode()`` is useful in
129      template tags and filters that absolutely *must* have a string to work
130      with, not just something that can be converted to a string.
131
132    * ``smart_str(s, encoding='utf-8', strings_only=False, errors='strict')``
133      is essentially the opposite of ``smart_unicode()``. It forces the first
134      argument to a bytestring. The ``strings_only`` parameter has the same
135      behavior as for ``smart_unicode()`` and ``force_unicode()``. This is
136      slightly different semantics from Python's builtin ``str()`` function,
137      but the difference is needed in a few places within Django's internals.
138
139Normally, you'll only need to use ``smart_unicode()``. Call it as early as
140possible on any input data that might be either Unicode or a bytestring, and
141from then on, you can treat the result as always being Unicode.
142
143.. _uri-and-iri-handling:
144
145URI and IRI handling
146~~~~~~~~~~~~~~~~~~~~
147
148Web frameworks have to deal with URLs (which are a type of IRI_). One
149requirement of URLs is that they are encoded using only ASCII characters.
150However, in an international environment, you might need to construct a
151URL from an IRI_ -- very loosely speaking, a URI that can contain Unicode
152characters. Quoting and converting an IRI to URI can be a little tricky, so
153Django provides some assistance.
154
155    * The function ``django.utils.encoding.iri_to_uri()`` implements the
156      conversion from IRI to URI as required by the specification (`RFC
157      3987`_).
158
159    * The functions ``django.utils.http.urlquote()`` and
160      ``django.utils.http.urlquote_plus()`` are versions of Python's standard
161      ``urllib.quote()`` and ``urllib.quote_plus()`` that work with non-ASCII
162      characters. (The data is converted to UTF-8 prior to encoding.)
163
164These two groups of functions have slightly different purposes, and it's
165important to keep them straight. Normally, you would use ``urlquote()`` on the
166individual portions of the IRI or URI path so that any reserved characters
167such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to
168the full IRI and it converts any non-ASCII characters to the correct encoded
169values.
170
171.. note::
172    Technically, it isn't correct to say that ``iri_to_uri()`` implements the
173    full algorithm in the IRI specification. It doesn't (yet) perform the
174    international domain name encoding portion of the algorithm.
175
176The ``iri_to_uri()`` function will not change ASCII characters that are
177otherwise permitted in a URL. So, for example, the character '%' is not
178further encoded when passed to ``iri_to_uri()``. This means you can pass a
179full URL to this function and it will not mess up the query string or anything
180like that.
181
182An example might clarify things here::
183
184    >>> urlquote(u'Paris & Orlйans')
185    u'Paris%20%26%20Orl%C3%A9ans'
186    >>> iri_to_uri(u'/favorites/Franзois/%s' % urlquote(u'Paris & Orlйans'))
187    '/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans'
188
189If you look carefully, you can see that the portion that was generated by
190``urlquote()`` in the second example was not double-quoted when passed to
191``iri_to_uri()``. This is a very important and useful feature. It means that
192you can construct your IRI without worrying about whether it contains
193non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the
194result.
195
196The ``iri_to_uri()`` function is also idempotent, which means the following is
197always true::
198
199    iri_to_uri(iri_to_uri(some_string)) = iri_to_uri(some_string)
200
201So you can safely call it multiple times on the same IRI without risking
202double-quoting problems.
203
204.. _URI: http://www.ietf.org/rfc/rfc2396.txt
205.. _IRI: http://www.ietf.org/rfc/rfc3987.txt
206.. _RFC 3987: IRI_
207
208Models
209======
210
211Because all strings are returned from the database as Unicode strings, model
212fields that are character based (CharField, TextField, URLField, etc) will
213contain Unicode values when Django retrieves data from the database. This
214is *always* the case, even if the data could fit into an ASCII bytestring.
215
216You can pass in bytestrings when creating a model or populating a field, and
217Django will convert it to Unicode when it needs to.
218
219Choosing between ``__str__()`` and ``__unicode__()``
220----------------------------------------------------
221
222One consequence of using Unicode by default is that you have to take some care
223when printing data from the model.
224
225In particular, rather than giving your model a ``__str__()`` method, we
226recommended you implement a ``__unicode__()`` method. In the ``__unicode__()``
227method, you can quite safely return the values of all your fields without
228having to worry about whether they fit into a bytestring or not. (The way
229Python works, the result of ``__str__()`` is *always* a bytestring, even if you
230accidentally try to return a Unicode object).
231
232You can still create a ``__str__()`` method on your models if you want, of
233course, but you shouldn't need to do this unless you have a good reason.
234Django's ``Model`` base class automatically provides a ``__str__()``
235implementation that calls ``__unicode__()`` and encodes the result into UTF-8.
236This means you'll normally only need to implement a ``__unicode__()`` method
237and let Django handle the coercion to a bytestring when required.
238
239Taking care in ``get_absolute_url()``
240-------------------------------------
241
242URLs can only contain ASCII characters. If you're constructing a URL from
243pieces of data that might be non-ASCII, be careful to encode the results in a
244way that is suitable for a URL. The ``django.db.models.permalink()`` decorator
245handles this for you automatically.
246
247If you're constructing a URL manually (i.e., *not* using the ``permalink()``
248decorator), you'll need to take care of the encoding yourself. In this case,
249use the ``iri_to_uri()`` and ``urlquote()`` functions that were documented
250above_. For example::
251
252    from django.utils.encoding import iri_to_uri
253    from django.utils.http import urlquote
254
255    def get_absolute_url(self):
256        url = u'/person/%s/?x=0&y=0' % urlquote(self.location)
257        return iri_to_uri(url)
258
259This function returns a correctly encoded URL even if ``self.location`` is
260something like "Jack visited Paris & Orlйans". (In fact, the ``iri_to_uri()``
261call isn't strictly necessary in the above example, because all the
262non-ASCII characters would have been removed in quoting in the first line.)
263
264.. _above: `URI and IRI handling`_
265
266The database API
267================
268
269You can pass either Unicode strings or UTF-8 bytestrings as arguments to
270``filter()`` methods and the like in the database API. The following two
271querysets are identical::
272
273    qs = People.objects.filter(name__contains=u'Е')
274    qs = People.objects.filter(name__contains='\xc3\x85') # UTF-8 encoding of Е
275
276Templates
277=========
278
279You can use either Unicode or bytestrings when creating templates manually::
280
281	from django.template import Template
282	t1 = Template('This is a bytestring template.')
283	t2 = Template(u'This is a Unicode template.')
284
285But the common case is to read templates from the filesystem, and this creates
286a slight complication: not all filesystems store their data encoded as UTF-8.
287If your template files are not stored with a UTF-8 encoding, set the :setting:`FILE_CHARSET`
288setting to the encoding of the files on disk. When Django reads in a template
289file, it will convert the data from this encoding to Unicode. (:setting:`FILE_CHARSET`
290is set to ``'utf-8'`` by default.)
291
292The :setting:`DEFAULT_CHARSET` setting controls the encoding of rendered templates.
293This is set to UTF-8 by default.
294
295Template tags and filters
296-------------------------
297
298A couple of tips to remember when writing your own template tags and filters:
299
300    * Always return Unicode strings from a template tag's ``render()`` method
301      and from template filters.
302
303    * Use ``force_unicode()`` in preference to ``smart_unicode()`` in these
304      places. Tag rendering and filter calls occur as the template is being
305      rendered, so there is no advantage to postponing the conversion of lazy
306      translation objects into strings. It's easier to work solely with Unicode
307      strings at that point.
308
309E-mail
310======
311
312Django's e-mail framework (in ``django.core.mail``) supports Unicode
313transparently. You can use Unicode data in the message bodies and any headers.
314However, you're still obligated to respect the requirements of the e-mail
315specifications, so, for example, e-mail addresses should use only ASCII
316characters.
317
318The following code example demonstrates that everything except e-mail addresses
319can be non-ASCII::
320
321    from django.core.mail import EmailMessage
322
323    subject = u'My visit to Sшr-Trшndelag'
324    sender = u'Arnbjцrg Rбрormsdуttir <arnbjorg@example.com>'
325    recipients = ['Fred <fred@example.com']
326    body = u'...'
327    EmailMessage(subject, body, sender, recipients).send()
328
329Form submission
330===============
331
332HTML form submission is a tricky area. There's no guarantee that the
333submission will include encoding information, which means the framework might
334have to guess at the encoding of submitted data.
335
336Django adopts a "lazy" approach to decoding form data. The data in an
337``HttpRequest`` object is only decoded when you access it. In fact, most of
338the data is not decoded at all. Only the ``HttpRequest.GET`` and
339``HttpRequest.POST`` data structures have any decoding applied to them. Those
340two fields will return their members as Unicode data. All other attributes and
341methods of ``HttpRequest`` return data exactly as it was submitted by the
342client.
343
344By default, the :setting:`DEFAULT_CHARSET` setting is used as the assumed encoding
345for form data. If you need to change this for a particular form, you can set
346the ``encoding`` attribute on an ``HttpRequest`` instance. For example::
347
348    def some_view(request):
349        # We know that the data must be encoded as KOI8-R (for some reason).
350        request.encoding = 'koi8-r'
351        ...
352
353You can even change the encoding after having accessed ``request.GET`` or
354``request.POST``, and all subsequent accesses will use the new encoding.
355
356Most developers won't need to worry about changing form encoding, but this is
357a useful feature for applications that talk to legacy systems whose encoding
358you cannot control.
359
360Django does not decode the data of file uploads, because that data is normally
361treated as collections of bytes, rather than strings. Any automatic decoding
362there would alter the meaning of the stream of bytes.