Python论坛  - 讨论区

标题:[python-chinese] 如何解码中文信箱名的编码问题?

2004年08月06日 星期五 15:32

gavin gavin at sz.net.cn
Fri Aug 6 15:32:44 HKT 2004

各位大虾:

RFC2060中规定了中文信箱名的编码问题,现在摘录如下:

5.1.3. Mailbox International Naming Convention
By convention, international mailbox names are specified using a
modified version of the UTF-7 encoding described in [UTF-7]. The
purpose of these modifications is to correct the following problems
with UTF-7:

1) UTF-7 uses the "+" character for shifting; this conflicts with
the common use of "+" in mailbox names, in particular USENET
newsgroup names.

2) UTF-7’s encoding is BASE64 which uses the "/" character; this
conflicts with the use of "/" as a popular hierarchy delimiter.

3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
the use of "\" as a popular hierarchy delimiter.

4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
the use of "˜" in some servers as a home directory indicator.

5) UTF-7 permits multiple alternate forms to represent the same
string; in particular, printable US-ASCII chararacters can be
represented in encoded form.

In modified UTF-7, printable US-ASCII characters except for "&"
represent themselves; that is, characters with octet values 0x20-0x25
and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
sequence "&-".

All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
Unicode 16-bit octets) are represented in modified BASE64, with a
further modification from [UTF-7] that "," is used instead of "/".
Modified BASE64 MUST NOT be used to represent any printing US-ASCII
character which can represent itself.
"&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
All names start in US-ASCII, and MUST end in US-ASCII (that
is, a name that ends with a Unicode 16-bit octet MUST end with a "-
").

For example, here is a mailbox name which mixes English, Japanese,
and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-


本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
按照以上规定:
“草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".

各位大虾,如何实现这边的编码和解码?可否示例?


最后一个问题,Python是不错,可惜中文处理实在头疼!

按有的资料介绍,UTF-8的解码和编码可以用如下方法:
s=u"社会主义中国"
u8=s.encode("utf-8")  ---转化成utf-8
#转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是"绀句細涓讳箟涓浗"
u8.decode("utf-8")    ---转化成unicode

如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码是就会出错:)




Sincerely,

Frank Ning
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20040806/6357e03c/attachment.html

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年08月06日 星期五 15:46

gentoo.cn gentoo.cn at 126.com
Fri Aug 6 15:46:58 HKT 2004

##!/usr/bin/env python
#printu.py
import locale
encoding = locale.getdefaultlocale()[1]

P1="""社会主义中国"""
s1 = unicode(P1, encoding)
#s2 = unicode(P1, "utf-8")
print s1
print s1.encode("utf-8")
print len(s1)

#python printu.py
输出结果:
社会主义中国
绀句細涓讳箟涓浗
6


gavin wrote:

> 各位大虾:
>  
> RFC2060中规定了中文信箱名的编码问题,现在摘录如下:
>  
> *5.1.3. Mailbox International Naming Convention*
> By convention, international mailbox names are specified using a
> modified version of the UTF-7 encoding described in [UTF-7]. The
> purpose of these modifications is to correct the following problems
> with UTF-7:
>
> 1) UTF-7 uses the "+" character for shifting; this conflicts with
> the common use of "+" in mailbox names, in particular USENET
> newsgroup names.
>
> 2) UTF-7’s encoding is BASE64 which uses the "/" character; this
> conflicts with the use of "/" as a popular hierarchy delimiter.
>
> 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
> the use of "\" as a popular hierarchy delimiter.
>
> 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
> the use of "˜" in some servers as a home directory indicator.
>
> 5) UTF-7 permits multiple alternate forms to represent the same
> string; in particular, printable US-ASCII chararacters can be
> represented in encoded form.
>
> In modified UTF-7, printable US-ASCII characters except for "&"
> represent themselves; that is, characters with octet values 0x20-0x25
> and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
> sequence "&-".
>
> All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
> Unicode 16-bit octets) are represented in modified BASE64, with a
> further modification from [UTF-7] that "," is used instead of "/".
> Modified BASE64 MUST NOT be used to represent any printing US-ASCII
> character which can represent itself.
> "&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
> All names start in US-ASCII, and MUST end in US-ASCII (that
> is, a name that ends with a Unicode 16-bit octet MUST end with a "-
> ").
>
> For example, here is a mailbox name which mixes English, Japanese,
> and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-
>  
>  
> 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
> 按照以上规定:
> “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".
>  
> 各位大虾,如何实现这边的编码和解码?可否示例?
>  
>  
> 最后一个问题,Python是不错,可惜中文处理实在头疼!
>  
> 按有的资料介绍,UTF-8的解码和编码可以用如下方法:
> s=u"社会主义中国"
> u8=s.encode("utf-8")  ---转化成utf-8
> #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 
> 句細涓讳箟涓浗"
> u8.decode("utf-8")    ---转化成unicode
>  
> 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 
> 是就会出错:)
>  
>  
>  
>
> Sincerely,
>  
> Frank Ning
>
>------------------------------------------------------------------------
>
>_______________________________________________
>python-chinese list
>python-chinese at lists.python.cn
>http://python.cn/mailman/listinfo/python-chinese
>  
>



[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年08月06日 星期五 15:50

gavin gavin at sz.net.cn
Fri Aug 6 15:50:15 HKT 2004

非常感谢!

----- Original Message ----- 
From: "gentoo.cn" <gentoo.cn at 126.com>
To: <python-chinese at lists.python.cn>
Sent: Friday, August 06, 2004 3:46 PM
Subject: Re: [python-chinese] 如何解码中文信箱名的编码问题?


> ##!/usr/bin/env python
> #printu.py
> import locale
> encoding = locale.getdefaultlocale()[1]
> 
> P1="""社会主义中国"""
> s1 = unicode(P1, encoding)
> #s2 = unicode(P1, "utf-8")
> print s1
> print s1.encode("utf-8")
> print len(s1)
> 
> #python printu.py
> 输出结果:
> 社会主义中国
> 绀句細涓讳箟涓浗
> 6
> 
> 
> gavin wrote:
> 
> > 各位大虾:
> >  
> > RFC2060中规定了中文信箱名的编码问题,现在摘录如下:
> >  
> > *5.1.3. Mailbox International Naming Convention*
> > By convention, international mailbox names are specified using a
> > modified version of the UTF-7 encoding described in [UTF-7]. The
> > purpose of these modifications is to correct the following problems
> > with UTF-7:
> >
> > 1) UTF-7 uses the "+" character for shifting; this conflicts with
> > the common use of "+" in mailbox names, in particular USENET
> > newsgroup names.
> >
> > 2) UTF-7’s encoding is BASE64 which uses the "/" character; this
> > conflicts with the use of "/" as a popular hierarchy delimiter.
> >
> > 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
> > the use of "\" as a popular hierarchy delimiter.
> >
> > 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
> > the use of "˜" in some servers as a home directory indicator.
> >
> > 5) UTF-7 permits multiple alternate forms to represent the same
> > string; in particular, printable US-ASCII chararacters can be
> > represented in encoded form.
> >
> > In modified UTF-7, printable US-ASCII characters except for "&"
> > represent themselves; that is, characters with octet values 0x20-0x25
> > and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
> > sequence "&-".
> >
> > All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
> > Unicode 16-bit octets) are represented in modified BASE64, with a
> > further modification from [UTF-7] that "," is used instead of "/".
> > Modified BASE64 MUST NOT be used to represent any printing US-ASCII
> > character which can represent itself.
> > "&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
> > All names start in US-ASCII, and MUST end in US-ASCII (that
> > is, a name that ends with a Unicode 16-bit octet MUST end with a "-
> > ").
> >
> > For example, here is a mailbox name which mixes English, Japanese,
> > and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-
> >  
> >  
> > 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
> > 按照以上规定:
> > “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".
> >  
> > 各位大虾,如何实现这边的编码和解码?可否示例?
> >  
> >  
> > 最后一个问题,Python是不错,可惜中文处理实在头疼!
> >  
> > 按有的资料介绍,UTF-8的解码和编码可以用如下方法:
> > s=u"社会主义中国"
> > u8=s.encode("utf-8")  ---转化成utf-8
> > #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 
> > 句細涓讳箟涓浗"
> > u8.decode("utf-8")    ---转化成unicode
> >  
> > 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 
> > 是就会出错:)
> >  
> >  
> >  
> >
> > Sincerely,
> >  
> > Frank Ning

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年08月06日 星期五 16:07

gavin gavin at sz.net.cn
Fri Aug 6 16:07:26 HKT 2004

> ##!/usr/bin/env python
> #printu.py
> import locale
> encoding = locale.getdefaultlocale()[1]
> 
> P1="""社会主义中国"""
> s1 = unicode(P1, encoding)
> #s2 = unicode(P1, "utf-8")
> print s1
> print s1.encode("utf-8")
> print len(s1)
> 
> #python printu.py
> 输出结果:
> 社会主义中国
> 绀句細涓讳箟涓浗
> 6
> 
好像不成功啊:)
>>> import locale
>>> encoding = locale.getdefaultlocale()[1]
>>> P1="""社会主义中国"""
>>> s1 = unicode(P1, encoding)
Traceback (most recent call last):
  File "", line 1, in ?
LookupError: unknown encoding: gb18030
>>> s1 = unicode(P1, "utf-8")
Traceback (most recent call last):
  File "", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data


再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国”
用

> 
> gavin wrote:
> 
> > 各位大虾:
> >  
> > RFC2060中规定了中文信箱名的编码问题,现在摘录如下:
> >  
> > *5.1.3. Mailbox International Naming Convention*
> > By convention, international mailbox names are specified using a
> > modified version of the UTF-7 encoding described in [UTF-7]. The
> > purpose of these modifications is to correct the following problems
> > with UTF-7:
> >
> > 1) UTF-7 uses the "+" character for shifting; this conflicts with
> > the common use of "+" in mailbox names, in particular USENET
> > newsgroup names.
> >
> > 2) UTF-7’s encoding is BASE64 which uses the "/" character; this
> > conflicts with the use of "/" as a popular hierarchy delimiter.
> >
> > 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
> > the use of "\" as a popular hierarchy delimiter.
> >
> > 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
> > the use of "˜" in some servers as a home directory indicator.
> >
> > 5) UTF-7 permits multiple alternate forms to represent the same
> > string; in particular, printable US-ASCII chararacters can be
> > represented in encoded form.
> >
> > In modified UTF-7, printable US-ASCII characters except for "&"
> > represent themselves; that is, characters with octet values 0x20-0x25
> > and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
> > sequence "&-".
> >
> > All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
> > Unicode 16-bit octets) are represented in modified BASE64, with a
> > further modification from [UTF-7] that "," is used instead of "/".
> > Modified BASE64 MUST NOT be used to represent any printing US-ASCII
> > character which can represent itself.
> > "&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
> > All names start in US-ASCII, and MUST end in US-ASCII (that
> > is, a name that ends with a Unicode 16-bit octet MUST end with a "-
> > ").
> >
> > For example, here is a mailbox name which mixes English, Japanese,
> > and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-
> >  
> >  
> > 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
> > 按照以上规定:
> > “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".
> >  
> > 各位大虾,如何实现这边的编码和解码?可否示例?
> >  
> >  
> > 最后一个问题,Python是不错,可惜中文处理实在头疼!
> >  
> > 按有的资料介绍,UTF-8的解码和编码可以用如下方法:
> > s=u"社会主义中国"
> > u8=s.encode("utf-8")  ---转化成utf-8
> > #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 
> > 句細涓讳箟涓浗"
> > u8.decode("utf-8")    ---转化成unicode
> >  
> > 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 
> > 是就会出错:)
> >  
> >  
> >  
> >

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年08月06日 星期五 16:07

gavin gavin at sz.net.cn
Fri Aug 6 16:07:30 HKT 2004

> ##!/usr/bin/env python
> #printu.py
> import locale
> encoding = locale.getdefaultlocale()[1]
> 
> P1="""社会主义中国"""
> s1 = unicode(P1, encoding)
> #s2 = unicode(P1, "utf-8")
> print s1
> print s1.encode("utf-8")
> print len(s1)
> 
> #python printu.py
> 输出结果:
> 社会主义中国
> 绀句細涓讳箟涓浗
> 6
> 
好像不成功啊:)
>>> import locale
>>> encoding = locale.getdefaultlocale()[1]
>>> P1="""社会主义中国"""
>>> s1 = unicode(P1, encoding)
Traceback (most recent call last):
  File "", line 1, in ?
LookupError: unknown encoding: gb18030
>>> s1 = unicode(P1, "utf-8")
Traceback (most recent call last):
  File "", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data


再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国”
用

> 
> gavin wrote:
> 
> > 各位大虾:
> >  
> > RFC2060中规定了中文信箱名的编码问题,现在摘录如下:
> >  
> > *5.1.3. Mailbox International Naming Convention*
> > By convention, international mailbox names are specified using a
> > modified version of the UTF-7 encoding described in [UTF-7]. The
> > purpose of these modifications is to correct the following problems
> > with UTF-7:
> >
> > 1) UTF-7 uses the "+" character for shifting; this conflicts with
> > the common use of "+" in mailbox names, in particular USENET
> > newsgroup names.
> >
> > 2) UTF-7’s encoding is BASE64 which uses the "/" character; this
> > conflicts with the use of "/" as a popular hierarchy delimiter.
> >
> > 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
> > the use of "\" as a popular hierarchy delimiter.
> >
> > 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
> > the use of "˜" in some servers as a home directory indicator.
> >
> > 5) UTF-7 permits multiple alternate forms to represent the same
> > string; in particular, printable US-ASCII chararacters can be
> > represented in encoded form.
> >
> > In modified UTF-7, printable US-ASCII characters except for "&"
> > represent themselves; that is, characters with octet values 0x20-0x25
> > and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
> > sequence "&-".
> >
> > All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
> > Unicode 16-bit octets) are represented in modified BASE64, with a
> > further modification from [UTF-7] that "," is used instead of "/".
> > Modified BASE64 MUST NOT be used to represent any printing US-ASCII
> > character which can represent itself.
> > "&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
> > All names start in US-ASCII, and MUST end in US-ASCII (that
> > is, a name that ends with a Unicode 16-bit octet MUST end with a "-
> > ").
> >
> > For example, here is a mailbox name which mixes English, Japanese,
> > and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-
> >  
> >  
> > 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
> > 按照以上规定:
> > “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".
> >  
> > 各位大虾,如何实现这边的编码和解码?可否示例?
> >  
> >  
> > 最后一个问题,Python是不错,可惜中文处理实在头疼!
> >  
> > 按有的资料介绍,UTF-8的解码和编码可以用如下方法:
> > s=u"社会主义中国"
> > u8=s.encode("utf-8")  ---转化成utf-8
> > #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 
> > 句細涓讳箟涓浗"
> > u8.decode("utf-8")    ---转化成unicode
> >  
> > 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 
> > 是就会出错:)
> >  
> >  
> >  
> >

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年08月06日 星期五 16:24

gentoo.cn gentoo.cn at 126.com
Fri Aug 6 16:24:33 HKT 2004

你在什么平台上执行?
locale是什么?
or U can try
http://cjkpython.i18n.org/



gavin wrote:

>>##!/usr/bin/env python
>>#printu.py
>>import locale
>>encoding = locale.getdefaultlocale()[1]
>>
>>P1="""社会主义中国"""
>>s1 = unicode(P1, encoding)
>>#s2 = unicode(P1, "utf-8")
>>print s1
>>print s1.encode("utf-8")
>>print len(s1)
>>
>>#python printu.py
>>输出结果:
>>社会主义中国
>>绀句細涓讳箟涓浗
>>6
>>
>>    
>>
>好像不成功啊:)
>  
>
>>>>import locale
>>>>encoding = locale.getdefaultlocale()[1]
>>>>P1="""社会主义中国"""
>>>>s1 = unicode(P1, encoding)
>>>>        
>>>>
>Traceback (most recent call last):
>  File "", line 1, in ?
>LookupError: unknown encoding: gb18030
>  
>
>>>>s1 = unicode(P1, "utf-8")
>>>>        
>>>>
>Traceback (most recent call last):
>  File "", line 1, in ?
>UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
>
>
>再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国”
>>
>  
>
>>gavin wrote:
>>
>>    
>>
>>>各位大虾:
>>> 
>>>RFC2060中规定了中文信箱名的编码问题,现在摘录如下:
>>> 
>>>*5.1.3. Mailbox International Naming Convention*
>>>By convention, international mailbox names are specified using a
>>>modified version of the UTF-7 encoding described in [UTF-7]. The
>>>purpose of these modifications is to correct the following problems
>>>with UTF-7:
>>>
>>>1) UTF-7 uses the "+" character for shifting; this conflicts with
>>>the common use of "+" in mailbox names, in particular USENET
>>>newsgroup names.
>>>
>>>2) UTF-7’s encoding is BASE64 which uses the "/" character; this
>>>conflicts with the use of "/" as a popular hierarchy delimiter.
>>>
>>>3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
>>>the use of "\" as a popular hierarchy delimiter.
>>>
>>>4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
>>>the use of "˜" in some servers as a home directory indicator.
>>>
>>>5) UTF-7 permits multiple alternate forms to represent the same
>>>string; in particular, printable US-ASCII chararacters can be
>>>represented in encoded form.
>>>
>>>In modified UTF-7, printable US-ASCII characters except for "&"
>>>represent themselves; that is, characters with octet values 0x20-0x25
>>>and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
>>>sequence "&-".
>>>
>>>All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
>>>Unicode 16-bit octets) are represented in modified BASE64, with a
>>>further modification from [UTF-7] that "," is used instead of "/".
>>>Modified BASE64 MUST NOT be used to represent any printing US-ASCII
>>>character which can represent itself.
>>>"&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
>>>All names start in US-ASCII, and MUST end in US-ASCII (that
>>>is, a name that ends with a Unicode 16-bit octet MUST end with a "-
>>>").
>>>
>>>For example, here is a mailbox name which mixes English, Japanese,
>>>and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-
>>> 
>>> 
>>>本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
>>>按照以上规定:
>>>“草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".
>>> 
>>>各位大虾,如何实现这边的编码和解码?可否示例?
>>> 
>>> 
>>>最后一个问题,Python是不错,可惜中文处理实在头疼!
>>> 
>>>按有的资料介绍,UTF-8的解码和编码可以用如下方法:
>>>s=u"社会主义中国"
>>>u8=s.encode("utf-8")  ---转化成utf-8
>>>#转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 
>>>句細涓讳箟涓浗"
>>>u8.decode("utf-8")    ---转化成unicode
>>> 
>>>如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 
>>>是就会出错:)
>>> 
>>> 
>>> 
>>>
>>>      
>>>



[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年08月06日 星期五 17:37

gavin gavin at sz.net.cn
Fri Aug 6 17:37:28 HKT 2004

utf8的解码编码试成功了,多谢:)
从http://cjkpython.i18n.org/下载CJKCodecs包,编译安装 

>>> import locale
>>> encoding=locale.getdefaultlocale()[1]
>>> P1="社会主义中国"

>>> s1=unicode(P1,encoding)
>>> s1
u'\u793e\u4f1a\u4e3b\u4e49\u4e2d\u56fd'
>>> s=s1.encode("utf-8")
>>> print s
绀句細涓讳箟涓浗
>>> l="绀句細涓讳箟涓浗"
>>> p=l.decode("utf-8")
>>> p
u'\u793e\u4f1a\u4e3b\u4e49\u4e2d\u56fd'
>>> p.encode(encoding)
'\xc9\xe7\xbb\xe1\xd6\xf7\xd2\xe5\xd6\xd0\xb9\xfa'
>>> print p.encode(encoding)
社会主义中国
>>> P1
'\xc9\xe7\xbb\xe1\xd6\xf7\xd2\xe5\xd6\xd0\xb9\xfa'



----- Original Message ----- 
From: "gentoo.cn" <gentoo.cn at 126.com>
To: "gavin" <gavin at sz.net.cn>
Cc: <python-chinese at lists.python.cn>
Sent: Friday, August 06, 2004 4:24 PM
Subject: Re: [python-chinese] 如何解码中文信箱名的编码问题?


> 你在什么平台上执行?
> locale是什么?
> or U can try
> http://cjkpython.i18n.org/
> 
> 
> 
> gavin wrote:
> 
> >>##!/usr/bin/env python
> >>#printu.py
> >>import locale
> >>encoding = locale.getdefaultlocale()[1]
> >>
> >>P1="""社会主义中国"""
> >>s1 = unicode(P1, encoding)
> >>#s2 = unicode(P1, "utf-8")
> >>print s1
> >>print s1.encode("utf-8")
> >>print len(s1)
> >>
> >>#python printu.py
> >>输出结果:
> >>社会主义中国
> >>绀句細涓讳箟涓浗
> >>6
> >>
> >>    
> >>
> >好像不成功啊:)
> >  
> >
> >>>>import locale
> >>>>encoding = locale.getdefaultlocale()[1]
> >>>>P1="""社会主义中国"""
> >>>>s1 = unicode(P1, encoding)
> >>>>        
> >>>>
> >Traceback (most recent call last):
> >  File "", line 1, in ?
> >LookupError: unknown encoding: gb18030
> >  
> >
> >>>>s1 = unicode(P1, "utf-8")
> >>>>        
> >>>>
> >Traceback (most recent call last):
> >  File "", line 1, in ?
> >UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
> >
> >
> >再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国”
> >用
> >
> >  
> >
> >>gavin wrote:
> >>
> >>    
> >>
> >>>各位大虾:
> >>> 
> >>>RFC2060中规定了中文信箱名的编码问题,现在摘录如下:
> >>> 
> >>>*5.1.3. Mailbox International Naming Convention*
> >>>By convention, international mailbox names are specified using a
> >>>modified version of the UTF-7 encoding described in [UTF-7]. The
> >>>purpose of these modifications is to correct the following problems
> >>>with UTF-7:
> >>>
> >>>1) UTF-7 uses the "+" character for shifting; this conflicts with
> >>>the common use of "+" in mailbox names, in particular USENET
> >>>newsgroup names.
> >>>
> >>>2) UTF-7’s encoding is BASE64 which uses the "/" character; this
> >>>conflicts with the use of "/" as a popular hierarchy delimiter.
> >>>
> >>>3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
> >>>the use of "\" as a popular hierarchy delimiter.
> >>>
> >>>4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
> >>>the use of "˜" in some servers as a home directory indicator.
> >>>
> >>>5) UTF-7 permits multiple alternate forms to represent the same
> >>>string; in particular, printable US-ASCII chararacters can be
> >>>represented in encoded form.
> >>>
> >>>In modified UTF-7, printable US-ASCII characters except for "&"
> >>>represent themselves; that is, characters with octet values 0x20-0x25
> >>>and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
> >>>sequence "&-".
> >>>
> >>>All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
> >>>Unicode 16-bit octets) are represented in modified BASE64, with a
> >>>further modification from [UTF-7] that "," is used instead of "/".
> >>>Modified BASE64 MUST NOT be used to represent any printing US-ASCII
> >>>character which can represent itself.
> >>>"&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
> >>>All names start in US-ASCII, and MUST end in US-ASCII (that
> >>>is, a name that ends with a Unicode 16-bit octet MUST end with a "-
> >>>").
> >>>
> >>>For example, here is a mailbox name which mixes English, Japanese,
> >>>and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-
> >>> 
> >>> 
> >>>本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
> >>>按照以上规定:
> >>>“草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".
> >>> 
> >>>各位大虾,如何实现这边的编码和解码?可否示例?
> >>> 
> >>> 
> >>>最后一个问题,Python是不错,可惜中文处理实在头疼!
> >>> 
> >>>按有的资料介绍,UTF-8的解码和编码可以用如下方法:
> >>>s=u"社会主义中国"
> >>>u8=s.encode("utf-8")  ---转化成utf-8
> >>>#转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 
> >>>句細涓讳箟涓浗"
> >>>u8.decode("utf-8")    ---转化成unicode
> >>> 
> >>>如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 
> >>>是就会出错:)
> >>> 
> >>> 
> >>> 
> >>>
> >>>      
> >>>
> 
> 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

如下红色区域有误,请重新填写。

    你的回复:

    请 登录 后回复。还没有在Zeuux哲思注册吗?现在 注册 !

    Zeuux © 2024

    京ICP备05028076号