Python学习之------str/unicode/bytes
参考
Unicode 编码及 UTF-32, UTF-16 和 UTF-8
https://timothybramlett.com/Strings_Bytes_and_Unicode_in_Python_2_and_3.html
https://nedbatchelder.com/text/unipain.html
字符串和字节串
字符串时字符序列,字符串是给人看的
字节串是字节序列,它可以直接存储在硬盘,字节串用来操作的,字节串是给计算机看的。
字符串 --- encode(编码)-->字节串
字节串 --- decode(解码)-->字符串
区分Unicode和 Encodings
unicode:
Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).
The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.记住,Unicode 其实就是为每个字符分配一个称为code point(代码点)的数字 ID,用于实现 code point --> character的索引
Encodings:
Map characters to bit patterns.
将字符映射到内存中保存的二进制串
These bit patterns are used to represent the characters in computer memory or on disk.
There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:
存在许多不同的编码,囊括了不同的字符子集。在英语世界中,最常见的编码如下:
-
ASCII
Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.
Example:
a → 1100001 (0x61)
-
ISO 8859-1 (aka Latin-1)
Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.
Example:
a → 01100001 (0x61)
á → 11100001 (0xE1)
-
UTF-8
Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).
Example:
a → 01100001 (0x61)
á → 11000011 10100001 (0xC3 0xA1)
≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
😂 → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)
Unicode and Encodings:
如果使用Latin-1编码方式去编码á字符,可以描述为:
"将á字符或code point=U+00E1的字符编码为"11100001"
如果使用UTF-8编码方式去编码á字符,可以描述为:
"将á字符或code point=U+00E1的字符编码为"11000011 10100001"
有些容易混淆的地方:
有些字符的 code point和编码后的二进制串相同
比如
'a' 以ASCII编码encodes成1100001,用16进制表示为0x61,而'a' 的Unicode code point刚好有等于U+0061
'á' 以ASCII编码encodes成11100001,,用16进制表示为0xE1,而'á' 的Unicode code point刚好有等于U+00E1
python2
在python2中:
str格式本质含义是"某种编码格式",绝大多数情况下,被引号框起来的字符串,就是str,它本身存储的就是字节码(bytes),str跟bytes是等价的。
那么这个字节码是什么格式的。
如果这段代码是在解释器上输入的,那么这个s的格式就是解释器的编码格式,对于windows的cmd而言,就是gbk。
如果将段代码是保存后才执行的,比如存储为utf-8,那么在解释器载入这段程序的时候,就会将s初始化为utf-8编码。
python
>>> s = "你好"
>>> print chardet.detect(s)
{'confidence': 0.99, 'language': 'Chinese', 'encoding': 'GB2312'}
unicode类型的含义就是"用unicode code point 表示的字符串",至于Python 内部使用哪种数据格式来表示内存中的code point 对我们来说并不重要
encode与decode
下面的测试,都将在代码存储为utf-8后,再由解释器执行,也就是说str将初始化为utf-8编码
python
import chardet
s = "你好"
print type(s)
print chardet.detect(s)
# s为utf-8编码的字符串,用decode()将s解码为unicode对象
u = s.decode("UTF-8")
print type(u)
# 对解码后的unicode对象进行编码,用encode()将u编码为str对象(编码方式为gbk)。
s = u.encode("gbk")
print type(s)
print chardet.detect(s)
'''
<type 'str'>
{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
<type 'unicode'>
<type 'str'>
{'confidence': 0.99, 'language': 'Chinese', 'encoding': 'GB2312'}
'''
再看下,下面这个情况,s作为字符串,不仅有decode(),还有encode()。
python
s = "你好"
# str.decode:将某种编码的字符串解码成unicode对象
print hasattr(s, "decode")
# str.encode:直接将字符串改成另外一种编码,其实做了两步,先decode,然后encode
print hasattr(s, "encode")
'''
True
True
'''
str的encode(),为我们省去了decode()的过程。但这里也有一个坑!
python
s = "你好"
s = s.encode("gbk")
'''
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
'''
尝试将utf-8的str直接变成gbk编码,报错了,报错信息翻译如下:
'ascii'编解码器无法解码位置0的字节0xce:序号不在范围(128)
可以看出,程序尝试用ascii码对s进行解码,之所以会使用ascii,这与系统的默认编码有关:
python
import sys
print sys.getdefaultencoding()
'''
ascii
'''
那么,我们将sys的默认编码设置成utf-8,就可以正常运行了
python
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
s = "你好"
print chardet.detect(s)
s = s.encode("gbk")
print chardet.detect(s)
'''
{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
{'confidence': 0.99, 'language': 'Chinese', 'encoding': 'GB2312'}
'''
使用小细节,在python2中
"xxx": str类型
b"xxx": bytes类型 等价于str类型
u"xxx": unicode类型,需要u"xxx""".encode('utf-8')转换成str/bytes类型
python3
在python3.x中
str格式的定义变更为"Unicode类型的字符串",也就是说在默认情况下,被引号框起来的字符串,是使用Unicode编码的。
也就是说python3中的str就相当于python2中的unicode。
值得注意的是:bytes跟unicode是不等价的
在python3中
"xxx": unicode类型,没有解码方法decode()
unicode类型与bytes类型之前通过encode和decode转换,如下图
https://nedbatchelder.com/text/unipain.html
一个典型的问题
UnicodeEncodeError: 'ascii' codec can't encode character xxx
Unicode HOWTO
Pragmatic Unicode
复现,解决问题
python
# unicode
a = u'bats\u00E0'
print a
=> batsà
# str()转换时会使用默认的编码方式取encode, 这里默认的编码方式时ascii就报错了
str(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
# 解决:使用.encode(编码方式)来显示指定编码方式进行encode
a.encode('utf-8')
=> 'bats\xc3\xa0'
print a.encode('utf-8')
=> batsà
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters.
To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.