Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file. The io
module (added in Python 2.6) provides an io.open
function, which has an encoding parameter.
Use the open method from the io
module.
>>>import io
>>>f = io.open("test", mode="r", encoding="utf-8")
Then after calling f's read() function, an encoded Unicode object is returned.
>>>f.read()
u'Capit\xe1l\n\n'
Note that in Python 3, the io.open
function is an alias for the built-in open
function. The built-in open function only supports the encoding argument in Python 3, not Python 2.
Edit: Previously this answer recommended the codecs module. The codecs module can cause problems when mixing read()
and readline()
, so this answer now recommends the io module instead.
Use the open method from the codecs module.
>>>import codecs
>>>f = codecs.open("test", "r", "utf-8")
Then after calling f's read() function, an encoded Unicode object is returned.
>>>f.read()
u'Capit\xe1l\n\n'
If you know the encoding of a file, using the codecs package is going to be much less confusing.
See http://docs.python.org/library/codecs.html#codecs.open
Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.
Example:
accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'
Best Solution
A nice, light, library which I use successfully is utf8proc.