Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file. The io
module (added in Python 2.6) provides an io.open
function, which has an encoding parameter.
Use the open method from the io
module.
>>>import io
>>>f = io.open("test", mode="r", encoding="utf-8")
Then after calling f's read() function, an encoded Unicode object is returned.
>>>f.read()
u'Capit\xe1l\n\n'
Note that in Python 3, the io.open
function is an alias for the built-in open
function. The built-in open function only supports the encoding argument in Python 3, not Python 2.
Edit: Previously this answer recommended the codecs module. The codecs module can cause problems when mixing read()
and readline()
, so this answer now recommends the io module instead.
Use the open method from the codecs module.
>>>import codecs
>>>f = codecs.open("test", "r", "utf-8")
Then after calling f's read() function, an encoded Unicode object is returned.
>>>f.read()
u'Capit\xe1l\n\n'
If you know the encoding of a file, using the codecs package is going to be much less confusing.
See http://docs.python.org/library/codecs.html#codecs.open
To expand on the answers others have given:
We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.
Computers deal with such numbers as bytes... skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware, 16-bit computers would expand that to two bytes, and so forth.
Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, i.e. English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well).
But that's not enough when you want to represent characters from more than one language, so cramming all available characters into a single byte just won't work.
There are essentially two different types of encodings: one expands the value range by adding more bits. Examples of these encodings would be UCS2 (2 bytes = 16 bits) and UCS4 (4 bytes = 32 bits). They suffer from inherently the same problem as the ASCII and ISO-8859 standards, as their value range is still limited, even if the limit is vastly higher.
The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they're set, then the next unit in a sequence of units is to be considered part of the same character. If they're not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more.
Multi-byte encodings (I should say multi-unit after the above explanation) have the advantage that they are relatively space-efficient, but the downside that operations such as finding substrings, comparisons, etc. all have to decode the characters to unicode code points before such operations can be performed (there are some shortcuts, though).
Both the UCS standards and the UTF standards encode the code points as defined in Unicode. In theory, those encodings could be used to encode any number (within the range the encoding supports) - but of course these encodings were made to encode Unicode code points. And that's your relationship between them.
Windows handles so-called "Unicode" strings as UTF-16 strings, while most UNIXes default to UTF-8 these days. Communications protocols such as HTTP tend to work best with UTF-8, as the unit size in UTF-8 is the same as in ASCII, and most such protocols were designed in the ASCII era. On the other hand, UTF-16 gives the best average space/processing performance when representing all living languages.
The Unicode standard defines fewer code points than can be represented in 32 bits. Thus for all practical purposes, UTF-32 and UCS4 became the same encoding, as you're unlikely to have to deal with multi-unit characters in UTF-32.
Hope that fills in some details.
Best Answer
Can't help with swank or Emacs, I'm afraid. I'm using Enclojure on NetBeans and it works well there.
On matching: As Alex said,
\w
doesn't work for non-English characters, not even the extended Latin charsets for Western Europe:The \w skips the extended chars. Using
[(?u)\w]+
instead makes no difference, same with the Japanese.But see this regex reference:
\p{L}
matches any Unicode character in category Letter, so it actually works for Norwegianas well as for Japanese (at least I suppose so, I can't read it but it seems to be in the ballpark):
There are lots of other options, like matching on combining diacritical marks and whatnot, check out the reference.
Edit: More on Unicode in Java
A quick reference to other points of potential interest when working with Unicode.
Fortunately, Java generally does a very good job of reading and writing text in the correct encodings for the location and platform, but occasionally you need to override it.
This is all Java, most of this stuff does not have a Clojure wrapper (at least not yet).
Java characters/strings are UTF-16 internally. The
char
type (and its wrapper Character) is 16 bits, which is not enough to represent all of Unicode, so many non-Latin scripts need two chars to represent one symbol.When dealing with non-Latin Unicode it's often better to use
code points
rather than characters. A code point is one Unicode character/symbol represented as an int. The String and Character classes have methods for converting between Java chars and Unicode code points.I'm putting this here since I occasionally need this stuff, but not often enough to actually remember the details from one time to the next. Sort of a note to my future self, and it might be useful to others starting out with international languages and encodings as well.