2004年08月06日 星期五 15:32
各位大虾: RFC2060中规定了中文信箱名的编码问题,现在摘录如下: 5.1.3. Mailbox International Naming Convention By convention, international mailbox names are specified using a modified version of the UTF-7 encoding described in [UTF-7]. The purpose of these modifications is to correct the following problems with UTF-7: 1) UTF-7 uses the "+" character for shifting; this conflicts with the common use of "+" in mailbox names, in particular USENET newsgroup names. 2) UTF-7’s encoding is BASE64 which uses the "/" character; this conflicts with the use of "/" as a popular hierarchy delimiter. 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with the use of "\" as a popular hierarchy delimiter. 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with the use of "˜" in some servers as a home directory indicator. 5) UTF-7 permits multiple alternate forms to represent the same string; in particular, printable US-ASCII chararacters can be represented in encoded form. In modified UTF-7, printable US-ASCII characters except for "&" represent themselves; that is, characters with octet values 0x20-0x25 and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet sequence "&-". All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all Unicode 16-bit octets) are represented in modified BASE64, with a further modification from [UTF-7] that "," is used instead of "/". Modified BASE64 MUST NOT be used to represent any printing US-ASCII character which can represent itself. "&" is used to shift to modified BASE64 and "-" to shift back to USASCII. All names start in US-ASCII, and MUST end in US-ASCII (that is, a name that ends with a Unicode 16-bit octet MUST end with a "- "). For example, here is a mailbox name which mixes English, Japanese, and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw- 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如, 按照以上规定: “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;". 各位大虾,如何实现这边的编码和解码?可否示例? 最后一个问题,Python是不错,可惜中文处理实在头疼! 按有的资料介绍,UTF-8的解码和编码可以用如下方法: s=u"社会主义中国" u8=s.encode("utf-8") ---转化成utf-8 #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是"绀句細涓讳箟涓浗" u8.decode("utf-8") ---转化成unicode 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码是就会出错:) Sincerely, Frank Ning -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20040806/6357e03c/attachment.html
2004年08月06日 星期五 15:46
##!/usr/bin/env python #printu.py import locale encoding = locale.getdefaultlocale()[1] P1="""社会主义中国""" s1 = unicode(P1, encoding) #s2 = unicode(P1, "utf-8") print s1 print s1.encode("utf-8") print len(s1) #python printu.py 输出结果: 社会主义中国 绀句細涓讳箟涓浗 6 gavin wrote: > 各位大虾: > > RFC2060中规定了中文信箱名的编码问题,现在摘录如下: > > *5.1.3. Mailbox International Naming Convention* > By convention, international mailbox names are specified using a > modified version of the UTF-7 encoding described in [UTF-7]. The > purpose of these modifications is to correct the following problems > with UTF-7: > > 1) UTF-7 uses the "+" character for shifting; this conflicts with > the common use of "+" in mailbox names, in particular USENET > newsgroup names. > > 2) UTF-7’s encoding is BASE64 which uses the "/" character; this > conflicts with the use of "/" as a popular hierarchy delimiter. > > 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with > the use of "\" as a popular hierarchy delimiter. > > 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with > the use of "˜" in some servers as a home directory indicator. > > 5) UTF-7 permits multiple alternate forms to represent the same > string; in particular, printable US-ASCII chararacters can be > represented in encoded form. > > In modified UTF-7, printable US-ASCII characters except for "&" > represent themselves; that is, characters with octet values 0x20-0x25 > and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet > sequence "&-". > > All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all > Unicode 16-bit octets) are represented in modified BASE64, with a > further modification from [UTF-7] that "," is used instead of "/". > Modified BASE64 MUST NOT be used to represent any printing US-ASCII > character which can represent itself. > "&" is used to shift to modified BASE64 and "-" to shift back to USASCII. > All names start in US-ASCII, and MUST end in US-ASCII (that > is, a name that ends with a Unicode 16-bit octet MUST end with a "- > "). > > For example, here is a mailbox name which mixes English, Japanese, > and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw- > > > 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如, > 按照以上规定: > “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;". > > 各位大虾,如何实现这边的编码和解码?可否示例? > > > 最后一个问题,Python是不错,可惜中文处理实在头疼! > > 按有的资料介绍,UTF-8的解码和编码可以用如下方法: > s=u"社会主义中国" > u8=s.encode("utf-8") ---转化成utf-8 > #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 > 句細涓讳箟涓浗" > u8.decode("utf-8") ---转化成unicode > > 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 > 是就会出错:) > > > > > Sincerely, > > Frank Ning > >------------------------------------------------------------------------ > >_______________________________________________ >python-chinese list >python-chinese at lists.python.cn >http://python.cn/mailman/listinfo/python-chinese > >
2004年08月06日 星期五 15:50
非常感谢! ----- Original Message ----- From: "gentoo.cn" <gentoo.cn at 126.com> To: <python-chinese at lists.python.cn> Sent: Friday, August 06, 2004 3:46 PM Subject: Re: [python-chinese] 如何解码中文信箱名的编码问题? > ##!/usr/bin/env python > #printu.py > import locale > encoding = locale.getdefaultlocale()[1] > > P1="""社会主义中国""" > s1 = unicode(P1, encoding) > #s2 = unicode(P1, "utf-8") > print s1 > print s1.encode("utf-8") > print len(s1) > > #python printu.py > 输出结果: > 社会主义中国 > 绀句細涓讳箟涓浗 > 6 > > > gavin wrote: > > > 各位大虾: > > > > RFC2060中规定了中文信箱名的编码问题,现在摘录如下: > > > > *5.1.3. Mailbox International Naming Convention* > > By convention, international mailbox names are specified using a > > modified version of the UTF-7 encoding described in [UTF-7]. The > > purpose of these modifications is to correct the following problems > > with UTF-7: > > > > 1) UTF-7 uses the "+" character for shifting; this conflicts with > > the common use of "+" in mailbox names, in particular USENET > > newsgroup names. > > > > 2) UTF-7’s encoding is BASE64 which uses the "/" character; this > > conflicts with the use of "/" as a popular hierarchy delimiter. > > > > 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with > > the use of "\" as a popular hierarchy delimiter. > > > > 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with > > the use of "˜" in some servers as a home directory indicator. > > > > 5) UTF-7 permits multiple alternate forms to represent the same > > string; in particular, printable US-ASCII chararacters can be > > represented in encoded form. > > > > In modified UTF-7, printable US-ASCII characters except for "&" > > represent themselves; that is, characters with octet values 0x20-0x25 > > and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet > > sequence "&-". > > > > All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all > > Unicode 16-bit octets) are represented in modified BASE64, with a > > further modification from [UTF-7] that "," is used instead of "/". > > Modified BASE64 MUST NOT be used to represent any printing US-ASCII > > character which can represent itself. > > "&" is used to shift to modified BASE64 and "-" to shift back to USASCII. > > All names start in US-ASCII, and MUST end in US-ASCII (that > > is, a name that ends with a Unicode 16-bit octet MUST end with a "- > > "). > > > > For example, here is a mailbox name which mixes English, Japanese, > > and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw- > > > > > > 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如, > > 按照以上规定: > > “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;". > > > > 各位大虾,如何实现这边的编码和解码?可否示例? > > > > > > 最后一个问题,Python是不错,可惜中文处理实在头疼! > > > > 按有的资料介绍,UTF-8的解码和编码可以用如下方法: > > s=u"社会主义中国" > > u8=s.encode("utf-8") ---转化成utf-8 > > #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 > > 句細涓讳箟涓浗" > > u8.decode("utf-8") ---转化成unicode > > > > 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 > > 是就会出错:) > > > > > > > > > > Sincerely, > > > > Frank Ning
2004年08月06日 星期五 16:07
> ##!/usr/bin/env python > #printu.py > import locale > encoding = locale.getdefaultlocale()[1] > > P1="""社会主义中国""" > s1 = unicode(P1, encoding) > #s2 = unicode(P1, "utf-8") > print s1 > print s1.encode("utf-8") > print len(s1) > > #python printu.py > 输出结果: > 社会主义中国 > 绀句細涓讳箟涓浗 > 6 > 好像不成功啊:) >>> import locale >>> encoding = locale.getdefaultlocale()[1] >>> P1="""社会主义中国""" >>> s1 = unicode(P1, encoding) Traceback (most recent call last): File "", line 1, in ? LookupError: unknown encoding: gb18030 >>> s1 = unicode(P1, "utf-8") Traceback (most recent call last): File " ", line 1, in ? UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data 再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国” 用 > > gavin wrote: > > > 各位大虾: > > > > RFC2060中规定了中文信箱名的编码问题,现在摘录如下: > > > > *5.1.3. Mailbox International Naming Convention* > > By convention, international mailbox names are specified using a > > modified version of the UTF-7 encoding described in [UTF-7]. The > > purpose of these modifications is to correct the following problems > > with UTF-7: > > > > 1) UTF-7 uses the "+" character for shifting; this conflicts with > > the common use of "+" in mailbox names, in particular USENET > > newsgroup names. > > > > 2) UTF-7’s encoding is BASE64 which uses the "/" character; this > > conflicts with the use of "/" as a popular hierarchy delimiter. > > > > 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with > > the use of "\" as a popular hierarchy delimiter. > > > > 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with > > the use of "˜" in some servers as a home directory indicator. > > > > 5) UTF-7 permits multiple alternate forms to represent the same > > string; in particular, printable US-ASCII chararacters can be > > represented in encoded form. > > > > In modified UTF-7, printable US-ASCII characters except for "&" > > represent themselves; that is, characters with octet values 0x20-0x25 > > and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet > > sequence "&-". > > > > All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all > > Unicode 16-bit octets) are represented in modified BASE64, with a > > further modification from [UTF-7] that "," is used instead of "/". > > Modified BASE64 MUST NOT be used to represent any printing US-ASCII > > character which can represent itself. > > "&" is used to shift to modified BASE64 and "-" to shift back to USASCII. > > All names start in US-ASCII, and MUST end in US-ASCII (that > > is, a name that ends with a Unicode 16-bit octet MUST end with a "- > > "). > > > > For example, here is a mailbox name which mixes English, Japanese, > > and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw- > > > > > > 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如, > > 按照以上规定: > > “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;". > > > > 各位大虾,如何实现这边的编码和解码?可否示例? > > > > > > 最后一个问题,Python是不错,可惜中文处理实在头疼! > > > > 按有的资料介绍,UTF-8的解码和编码可以用如下方法: > > s=u"社会主义中国" > > u8=s.encode("utf-8") ---转化成utf-8 > > #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 > > 句細涓讳箟涓浗" > > u8.decode("utf-8") ---转化成unicode > > > > 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 > > 是就会出错:) > > > > > > > >
2004年08月06日 星期五 16:07
> ##!/usr/bin/env python > #printu.py > import locale > encoding = locale.getdefaultlocale()[1] > > P1="""社会主义中国""" > s1 = unicode(P1, encoding) > #s2 = unicode(P1, "utf-8") > print s1 > print s1.encode("utf-8") > print len(s1) > > #python printu.py > 输出结果: > 社会主义中国 > 绀句細涓讳箟涓浗 > 6 > 好像不成功啊:) >>> import locale >>> encoding = locale.getdefaultlocale()[1] >>> P1="""社会主义中国""" >>> s1 = unicode(P1, encoding) Traceback (most recent call last): File "", line 1, in ? LookupError: unknown encoding: gb18030 >>> s1 = unicode(P1, "utf-8") Traceback (most recent call last): File " ", line 1, in ? UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data 再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国” 用 > > gavin wrote: > > > 各位大虾: > > > > RFC2060中规定了中文信箱名的编码问题,现在摘录如下: > > > > *5.1.3. Mailbox International Naming Convention* > > By convention, international mailbox names are specified using a > > modified version of the UTF-7 encoding described in [UTF-7]. The > > purpose of these modifications is to correct the following problems > > with UTF-7: > > > > 1) UTF-7 uses the "+" character for shifting; this conflicts with > > the common use of "+" in mailbox names, in particular USENET > > newsgroup names. > > > > 2) UTF-7’s encoding is BASE64 which uses the "/" character; this > > conflicts with the use of "/" as a popular hierarchy delimiter. > > > > 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with > > the use of "\" as a popular hierarchy delimiter. > > > > 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with > > the use of "˜" in some servers as a home directory indicator. > > > > 5) UTF-7 permits multiple alternate forms to represent the same > > string; in particular, printable US-ASCII chararacters can be > > represented in encoded form. > > > > In modified UTF-7, printable US-ASCII characters except for "&" > > represent themselves; that is, characters with octet values 0x20-0x25 > > and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet > > sequence "&-". > > > > All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all > > Unicode 16-bit octets) are represented in modified BASE64, with a > > further modification from [UTF-7] that "," is used instead of "/". > > Modified BASE64 MUST NOT be used to represent any printing US-ASCII > > character which can represent itself. > > "&" is used to shift to modified BASE64 and "-" to shift back to USASCII. > > All names start in US-ASCII, and MUST end in US-ASCII (that > > is, a name that ends with a Unicode 16-bit octet MUST end with a "- > > "). > > > > For example, here is a mailbox name which mixes English, Japanese, > > and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw- > > > > > > 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如, > > 按照以上规定: > > “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;". > > > > 各位大虾,如何实现这边的编码和解码?可否示例? > > > > > > 最后一个问题,Python是不错,可惜中文处理实在头疼! > > > > 按有的资料介绍,UTF-8的解码和编码可以用如下方法: > > s=u"社会主义中国" > > u8=s.encode("utf-8") ---转化成utf-8 > > #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 > > 句細涓讳箟涓浗" > > u8.decode("utf-8") ---转化成unicode > > > > 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 > > 是就会出错:) > > > > > > > >
2004年08月06日 星期五 16:24
你在什么平台上执行? locale是什么? or U can try http://cjkpython.i18n.org/ gavin wrote: >>##!/usr/bin/env python >>#printu.py >>import locale >>encoding = locale.getdefaultlocale()[1] >> >>P1="""社会主义中国""" >>s1 = unicode(P1, encoding) >>#s2 = unicode(P1, "utf-8") >>print s1 >>print s1.encode("utf-8") >>print len(s1) >> >>#python printu.py >>输出结果: >>社会主义中国 >>绀句細涓讳箟涓浗 >>6 >> >> >> >好像不成功啊:) > > >>>>import locale >>>>encoding = locale.getdefaultlocale()[1] >>>>P1="""社会主义中国""" >>>>s1 = unicode(P1, encoding) >>>> >>>> >Traceback (most recent call last): > File "", line 1, in ? >LookupError: unknown encoding: gb18030 > > >>>>s1 = unicode(P1, "utf-8") >>>> >>>> >Traceback (most recent call last): > File "", line 1, in ? >UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data > > >再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国” >用 > > > >>gavin wrote: >> >> >> >>>各位大虾: >>> >>>RFC2060中规定了中文信箱名的编码问题,现在摘录如下: >>> >>>*5.1.3. Mailbox International Naming Convention* >>>By convention, international mailbox names are specified using a >>>modified version of the UTF-7 encoding described in [UTF-7]. The >>>purpose of these modifications is to correct the following problems >>>with UTF-7: >>> >>>1) UTF-7 uses the "+" character for shifting; this conflicts with >>>the common use of "+" in mailbox names, in particular USENET >>>newsgroup names. >>> >>>2) UTF-7’s encoding is BASE64 which uses the "/" character; this >>>conflicts with the use of "/" as a popular hierarchy delimiter. >>> >>>3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with >>>the use of "\" as a popular hierarchy delimiter. >>> >>>4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with >>>the use of "˜" in some servers as a home directory indicator. >>> >>>5) UTF-7 permits multiple alternate forms to represent the same >>>string; in particular, printable US-ASCII chararacters can be >>>represented in encoded form. >>> >>>In modified UTF-7, printable US-ASCII characters except for "&" >>>represent themselves; that is, characters with octet values 0x20-0x25 >>>and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet >>>sequence "&-". >>> >>>All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all >>>Unicode 16-bit octets) are represented in modified BASE64, with a >>>further modification from [UTF-7] that "," is used instead of "/". >>>Modified BASE64 MUST NOT be used to represent any printing US-ASCII >>>character which can represent itself. >>>"&" is used to shift to modified BASE64 and "-" to shift back to USASCII. >>>All names start in US-ASCII, and MUST end in US-ASCII (that >>>is, a name that ends with a Unicode 16-bit octet MUST end with a "- >>>"). >>> >>>For example, here is a mailbox name which mixes English, Japanese, >>>and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw- >>> >>> >>>本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如, >>>按照以上规定: >>>“草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;". >>> >>>各位大虾,如何实现这边的编码和解码?可否示例? >>> >>> >>>最后一个问题,Python是不错,可惜中文处理实在头疼! >>> >>>按有的资料介绍,UTF-8的解码和编码可以用如下方法: >>>s=u"社会主义中国" >>>u8=s.encode("utf-8") ---转化成utf-8 >>>#转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 >>>句細涓讳箟涓浗" >>>u8.decode("utf-8") ---转化成unicode >>> >>>如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 >>>是就会出错:) >>> >>> >>> >>> >>> >>>
2004年08月06日 星期五 17:37
utf8的解码编码试成功了,多谢:) 从http://cjkpython.i18n.org/下载CJKCodecs包,编译安装 >>> import locale >>> encoding=locale.getdefaultlocale()[1] >>> P1="社会主义中国" >>> s1=unicode(P1,encoding) >>> s1 u'\u793e\u4f1a\u4e3b\u4e49\u4e2d\u56fd' >>> s=s1.encode("utf-8") >>> print s 绀句細涓讳箟涓浗 >>> l="绀句細涓讳箟涓浗" >>> p=l.decode("utf-8") >>> p u'\u793e\u4f1a\u4e3b\u4e49\u4e2d\u56fd' >>> p.encode(encoding) '\xc9\xe7\xbb\xe1\xd6\xf7\xd2\xe5\xd6\xd0\xb9\xfa' >>> print p.encode(encoding) 社会主义中国 >>> P1 '\xc9\xe7\xbb\xe1\xd6\xf7\xd2\xe5\xd6\xd0\xb9\xfa' ----- Original Message ----- From: "gentoo.cn" <gentoo.cn at 126.com> To: "gavin" <gavin at sz.net.cn> Cc: <python-chinese at lists.python.cn> Sent: Friday, August 06, 2004 4:24 PM Subject: Re: [python-chinese] 如何解码中文信箱名的编码问题? > 你在什么平台上执行? > locale是什么? > or U can try > http://cjkpython.i18n.org/ > > > > gavin wrote: > > >>##!/usr/bin/env python > >>#printu.py > >>import locale > >>encoding = locale.getdefaultlocale()[1] > >> > >>P1="""社会主义中国""" > >>s1 = unicode(P1, encoding) > >>#s2 = unicode(P1, "utf-8") > >>print s1 > >>print s1.encode("utf-8") > >>print len(s1) > >> > >>#python printu.py > >>输出结果: > >>社会主义中国 > >>绀句細涓讳箟涓浗 > >>6 > >> > >> > >> > >好像不成功啊:) > > > > > >>>>import locale > >>>>encoding = locale.getdefaultlocale()[1] > >>>>P1="""社会主义中国""" > >>>>s1 = unicode(P1, encoding) > >>>> > >>>> > >Traceback (most recent call last): > > File "", line 1, in ? > >LookupError: unknown encoding: gb18030 > > > > > >>>>s1 = unicode(P1, "utf-8") > >>>> > >>>> > >Traceback (most recent call last): > > File "", line 1, in ? > >UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data > > > > > >再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国” > >用 > > > > > > > >>gavin wrote: > >> > >> > >> > >>>各位大虾: > >>> > >>>RFC2060中规定了中文信箱名的编码问题,现在摘录如下: > >>> > >>>*5.1.3. Mailbox International Naming Convention* > >>>By convention, international mailbox names are specified using a > >>>modified version of the UTF-7 encoding described in [UTF-7]. The > >>>purpose of these modifications is to correct the following problems > >>>with UTF-7: > >>> > >>>1) UTF-7 uses the "+" character for shifting; this conflicts with > >>>the common use of "+" in mailbox names, in particular USENET > >>>newsgroup names. > >>> > >>>2) UTF-7’s encoding is BASE64 which uses the "/" character; this > >>>conflicts with the use of "/" as a popular hierarchy delimiter. > >>> > >>>3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with > >>>the use of "\" as a popular hierarchy delimiter. > >>> > >>>4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with > >>>the use of "˜" in some servers as a home directory indicator. > >>> > >>>5) UTF-7 permits multiple alternate forms to represent the same > >>>string; in particular, printable US-ASCII chararacters can be > >>>represented in encoded form. > >>> > >>>In modified UTF-7, printable US-ASCII characters except for "&" > >>>represent themselves; that is, characters with octet values 0x20-0x25 > >>>and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet > >>>sequence "&-". > >>> > >>>All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all > >>>Unicode 16-bit octets) are represented in modified BASE64, with a > >>>further modification from [UTF-7] that "," is used instead of "/". > >>>Modified BASE64 MUST NOT be used to represent any printing US-ASCII > >>>character which can represent itself. > >>>"&" is used to shift to modified BASE64 and "-" to shift back to USASCII. > >>>All names start in US-ASCII, and MUST end in US-ASCII (that > >>>is, a name that ends with a Unicode 16-bit octet MUST end with a "- > >>>"). > >>> > >>>For example, here is a mailbox name which mixes English, Japanese, > >>>and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw- > >>> > >>> > >>>本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如, > >>>按照以上规定: > >>>“草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;". > >>> > >>>各位大虾,如何实现这边的编码和解码?可否示例? > >>> > >>> > >>>最后一个问题,Python是不错,可惜中文处理实在头疼! > >>> > >>>按有的资料介绍,UTF-8的解码和编码可以用如下方法: > >>>s=u"社会主义中国" > >>>u8=s.encode("utf-8") ---转化成utf-8 > >>>#转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 > >>>句細涓讳箟涓浗" > >>>u8.decode("utf-8") ---转化成unicode > >>> > >>>如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 > >>>是就会出错:) > >>> > >>> > >>> > >>> > >>> > >>> > >
Zeuux © 2024
京ICP备05028076号