Python论坛的帖子： - 哲思

Python论坛 - 讨论区

返回群组主页

标题：[python-chinese] 是bug还是我用错--utf-8 encoded '钗'字

分享

徐继哲

楼主 2006年07月19日星期三 21:27

Ren Lifeng lfren at cad.zju.edu.cn
Wed Jul 19 21:27:36 HKT 2006

下面是一次操作过程。

$ ipython
Python 2.3.5 (#2, Jun 13 2006, 23:12:55) 
Type "copyright", "credits" or "license" for more information.

IPython 0.7.2 -- An enhanced Interactive Python.

In [1]: import sys
In [2]: sys.getdefaultencoding()
Out[2]: 'utf-8'
In [3]: unicode('钗', 'utf-8')
---------------------------------------------------------------------------
exceptions.UnicodeDecodeError                        Traceback (most recent call last)

/home/rlf/prog/test/python/scripts/ 

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: unexpected code byte
> (1)

ipdb> quit
In [4]: unicode('头', 'utf-8')
Out[4]: u'\u5934'
In [5]: unicode('凤', 'utf-8')
Out[5]: u'\u51e4'
In [6]: 


我的遇到的实际问题是
$ ls -sh 钗头凤.mp3
3.4M 钗头凤.mp3
$ openfile.py 钗头凤.mp3
IOError: [Errno 2] No such file or directory: '    \x92\x97\xe5\xa4\xb4\xe5\x87\xa4.mp3'
$ cat openfile.py
#! /usr/bin/python
# -*- coding: utf-8; -*-
fl = open(sys.argv[1])
$

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月20日星期四 08:16

张骏 zhangj at foreseen-info.com
Thu Jul 20 08:16:14 HKT 2006

Python 2.4.3 

>>> '钗'
'\xee\xce'
>>> '钗'.decode( 'gbk' )
u'\u9497'
>>> '钗'.decode( 'gbk' ).encode( 'utf-8' )
'\xe9\x92\x97'
>>> '钗'.decode( 'gbk' ).encode( 'utf-8' ).decode( 'utf-8' )
u'\u9497'

估计你装的cjkcodec包有bug，是最新版吗？

在 2006-7-19 21:27:36，Ren Lifeng <lfren at cad.zju.edu.cn> 写道：
> 下面是一次操作过程。
> 
> $ ipython
> Python 2.3.5 (#2, Jun 13 2006, 23:12:55) 
> Type "copyright", "credits" or "license" for more information.
> 
> IPython 0.7.2 -- An enhanced Interactive Python.
> 
> In [1]: import sys
> In [2]: sys.getdefaultencoding()
> Out[2]: 'utf-8'
> In [3]: unicode('钗', 'utf-8')
> ---------------------------------------------------------------------------
> exceptions.UnicodeDecodeError                        Traceback (most recent call last)
> 
> /home/rlf/prog/test/python/scripts/ 
> 
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: unexpected code byte
> > (1)
> 
> ipdb> quit
> In [4]: unicode('头', 'utf-8')
> Out[4]: u'\u5934'
> In [5]: unicode('凤', 'utf-8')
> Out[5]: u'\u51e4'
> In [6]: 
> 
> 
> 我的遇到的实际问题是
> $ ls -sh 钗头凤.mp3
> 3.4M 钗头凤.mp3
> $ openfile.py 钗头凤.mp3
> IOError: [Errno 2] No such file or directory: '    \x92\x97\xe5\xa4\xb4\xe5\x87\xa4.mp3'
> $ cat openfile.py
> #! /usr/bin/python
> # -*- coding: utf-8; -*-
> fl = open(sys.argv[1])
> $
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese



-- 
张骏 <zhangj at foreseen-info.com>

敏捷来自Python
简单源于我们
丰元信信息技术有限公司

Python技术交流群：22507237

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

李迎辉

0楼 2006年07月20日星期四 10:12

limodou limodou at gmail.com
Thu Jul 20 10:12:46 HKT 2006

On 7/19/06, Ren Lifeng <lfren at cad.zju.edu.cn> wrote:
> 下面是一次操作过程。
>
> $ ipython
> Python 2.3.5 (#2, Jun 13 2006, 23:12:55)
> Type "copyright", "credits" or "license" for more information.
>
> IPython 0.7.2 -- An enhanced Interactive Python.
>
> In [1]: import sys
> In [2]: sys.getdefaultencoding()
> Out[2]: 'utf-8'
> In [3]: unicode('钗', 'utf-8')

这个'钗'是utf-8编码的吗？查看一下你的sys.stdin.encoding是什么编码。它决定了你在命令行输入时用到的编码。

> ---------------------------------------------------------------------------
> exceptions.UnicodeDecodeError                        Traceback (most recent call last)
>
> /home/rlf/prog/test/python/scripts/
>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: unexpected code byte
> > (1)
>
> ipdb> quit
> In [4]: unicode('头', 'utf-8')
> Out[4]: u'\u5934'
> In [5]: unicode('凤', 'utf-8')
> Out[5]: u'\u51e4'
> In [6]:
>
>
> 我的遇到的实际问题是
> $ ls -sh 钗头凤.mp3
> 3.4M 钗头凤.mp3
> $ openfile.py 钗头凤.mp3
> IOError: [Errno 2] No such file or directory: '    \x92\x97\xe5\xa4\xb4\xe5\x87\xa4.mp3'

这前面怎么好象有空格？

> $ cat openfile.py
> #! /usr/bin/python
> # -*- coding: utf-8; -*-
> fl = open(sys.argv[1])
> $

-- 
I like python!
My Blog: http://www.donews.net/limodou
My Django Site: http://www.djangocn.org
NewEdit Maillist: http://groups.google.com/group/NewEdit

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月20日星期四 13:21

Ren Lifeng lfren at cad.zju.edu.cn
Thu Jul 20 13:21:20 HKT 2006

cjkcodecs:
 python-cjkcodecs          1.1.1-2

不想装 2.4

张骏 <zhangj at foreseen-info.com> writes:

> Python 2.4.3 
>
>>>> '钗'
> '\xee\xce'
>>>> '钗'.decode( 'gbk' )
> u'\u9497'
>>>> '钗'.decode( 'gbk' ).encode( 'utf-8' )
> '\xe9\x92\x97'
>>>> '钗'.decode( 'gbk' ).encode( 'utf-8' ).decode( 'utf-8' )
> u'\u9497'
>
> 估计你装的cjkcodec包有bug，是最新版吗？
>
> 在 2006-7-19 21:27:36，Ren Lifeng <lfren at cad.zju.edu.cn> 写道：
>> 下面是一次操作过程。
>> 
>> $ ipython
>> Python 2.3.5 (#2, Jun 13 2006, 23:12:55) 
>> Type "copyright", "credits" or "license" for more information.
>> 
>> IPython 0.7.2 -- An enhanced Interactive Python.
>> 
>> In [1]: import sys
>> In [2]: sys.getdefaultencoding()
>> Out[2]: 'utf-8'
>> In [3]: unicode('钗', 'utf-8')
>> ---------------------------------------------------------------------------
>> exceptions.UnicodeDecodeError                        Traceback (most recent call last)
>> 
>> /home/rlf/prog/test/python/scripts/ 
>> 
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: unexpected code byte
>> > (1)
>> 
>> ipdb> quit
>> In [4]: unicode('头', 'utf-8')
>> Out[4]: u'\u5934'
>> In [5]: unicode('凤', 'utf-8')
>> Out[5]: u'\u51e4'
>> In [6]: 
>> 
>> 
>> 我的遇到的实际问题是
>> $ ls -sh 钗头凤.mp3
>> 3.4M 钗头凤.mp3
>> $ openfile.py 钗头凤.mp3
>> IOError: [Errno 2] No such file or directory: '    \x92\x97\xe5\xa4\xb4\xe5\x87\xa4.mp3'
>> $ cat openfile.py
>> #! /usr/bin/python
>> # -*- coding: utf-8; -*-
>> fl = open(sys.argv[1])
>> $
>> _______________________________________________
>> python-chinese
>> Post: send python-chinese at lists.python.cn
>> Subscribe: send subscribe to python-chinese-request at lists.python.cn
>> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
>> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>
>
> -- 
> 张骏 <zhangj at foreseen-info.com>
>
> 敏捷来自Python
> 简单源于我们
> 丰元信信息技术有限公司
>
> Python技术交流群：22507237
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>

--

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月20日星期四 13:27

Ren Lifeng lfren at cad.zju.edu.cn
Thu Jul 20 13:27:16 HKT 2006

limodou <limodou at gmail.com> writes:

In [1]: import sys
In [2]: sys.stdin.encoding
Out[2]: 'UTF-8'
In [3]: 

In [3]: '钗'[0]
Out[3]: ' '
In [4]: ord('钗'[0])
Out[4]: 32
In [5]: len('钗')
Out[5]: 6
In [6]: 

这么常用的字竟然要用6字节来编码。而且第一个字节竟然是 0x20。


另外，据我猜测 sys.stdin.encoding 应该和 sys.getdefaultencoding() 一致。

> On 7/19/06, Ren Lifeng <lfren at cad.zju.edu.cn> wrote:
>> 下面是一次操作过程。
>>
>> $ ipython
>> Python 2.3.5 (#2, Jun 13 2006, 23:12:55)
>> Type "copyright", "credits" or "license" for more information.
>>
>> IPython 0.7.2 -- An enhanced Interactive Python.
>>
>> In [1]: import sys
>> In [2]: sys.getdefaultencoding()
>> Out[2]: 'utf-8'
>> In [3]: unicode('钗', 'utf-8')
>
> 这个'钗'是utf-8编码的吗？查看一下你的sys.stdin.encoding是什么编码。它决定了你在命令行输入时用到的编码。
>
>> ---------------------------------------------------------------------------
>> exceptions.UnicodeDecodeError                        Traceback (most recent call last)
>>
>> /home/rlf/prog/test/python/scripts/
>>
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: unexpected code byte
>> > (1)
>>
>> ipdb> quit
>> In [4]: unicode('头', 'utf-8')
>> Out[4]: u'\u5934'
>> In [5]: unicode('凤', 'utf-8')
>> Out[5]: u'\u51e4'
>> In [6]:
>>
>>
>> 我的遇到的实际问题是
>> $ ls -sh 钗头凤.mp3
>> 3.4M 钗头凤.mp3
>> $ openfile.py 钗头凤.mp3
>> IOError: [Errno 2] No such file or directory: '    \x92\x97\xe5\xa4\xb4\xe5\x87\xa4.mp3'
>
> 这前面怎么好象有空格？
>
>> $ cat openfile.py
>> #! /usr/bin/python
>> # -*- coding: utf-8; -*-
>> fl = open(sys.argv[1])
>> $
>
> -- 
> I like python!
> My Blog: http://www.donews.net/limodou
> My Django Site: http://www.djangocn.org
> NewEdit Maillist: http://groups.google.com/group/NewEdit
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese

--

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

李迎辉

0楼 2006年07月20日星期四 13:36

limodou limodou at gmail.com
Thu Jul 20 13:36:24 HKT 2006

On 7/20/06, Ren Lifeng <lfren at cad.zju.edu.cn> wrote:
> limodou <limodou at gmail.com> writes:
>
> In [1]: import sys
> In [2]: sys.stdin.encoding
> Out[2]: 'UTF-8'
> In [3]:
>
> In [3]: '钗'[0]
> Out[3]: ' '
> In [4]: ord('钗'[0])
> Out[4]: 32
> In [5]: len('钗')
> Out[5]: 6
> In [6]:
>
> 这么常用的字竟然要用6字节来编码。而且第一个字节竟然是 0x20。
>
不 知道你的系统是怎么回事。

-- 
I like python!
My Blog: http://www.donews.net/limodou
My Django Site: http://www.djangocn.org
NewEdit Maillist: http://groups.google.com/group/NewEdit

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

大熊非熊

0楼 2006年07月20日星期四 13:37

大熊 bearsprite at gmail.com
Thu Jul 20 13:37:19 HKT 2006

钗的UTF8编码应该是0xE9 0x92 0x97

-- 
茫茫人海，你是我的最爱
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060720/b7d93c34/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月20日星期四 13:44

Ren Lifeng lfren at cad.zju.edu.cn
Thu Jul 20 13:44:32 HKT 2006

麻烦你告诉我，在你那里 u'钗' 是什么

In[1]: u'钗'

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

王颖奇

0楼 2006年07月20日星期四 14:38

wang yingqi wangyingqi at gmail.com
Thu Jul 20 14:38:07 HKT 2006

我这里目前也和你的问题一样，
我这里得到的u'钗'是 \u9497



On 7/20/06, Ren Lifeng <lfren at cad.zju.edu.cn> wrote:
>
>
> 麻烦你告诉我，在你那里 u'钗' 是什么
>
> In[1]: u'钗'
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060720/2bb2c825/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月22日星期六 10:48

Carlos Liu about.linux at gmail.com
Sat Jul 22 10:48:32 HKT 2006

On 7/20/06, Ren Lifeng <lfren at cad.zju.edu.cn> wrote:
>
> In [1]: import sys
> In [2]: sys.getdefaultencoding()
> Out[2]: 'utf-8'
> In [3]: unicode('钗', 'utf-8')
> ---------------------------------------------------------------------------
> exceptions.UnicodeDecodeError                        Traceback (most recent call last)
>
> /home/rlf/prog/test/python/scripts/
>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: unexpected code byte
> > (1)
>

试了一下，应该是 ipython 的 bug。直接用 python 命令行就好了。

-- 
 Best Regards
 Carlos

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月22日星期六 14:41

Ren Lifeng lfren at cad.zju.edu.cn
Sat Jul 22 14:41:51 HKT 2006

"Carlos Liu" <about.linux at gmail.com> writes:

> On 7/20/06, Ren Lifeng <lfren at cad.zju.edu.cn> wrote:
>>
>> In [1]: import sys
>> In [2]: sys.getdefaultencoding()
>> Out[2]: 'utf-8'
>> In [3]: unicode('钗', 'utf-8')
>> ---------------------------------------------------------------------------
>> exceptions.UnicodeDecodeError                        Traceback (most recent call last)
>>
>> /home/rlf/prog/test/python/scripts/
>>
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: unexpected code byte
>> > (1)
>>
>
> 试了一下，应该是 ipython 的 bug。直接用 python 命令行就好了。
>
是 python shell 的问题。下面是我在 rxvt/bash 下面的我在交互模式下运行
python 的一次过程。

rlf at gforge:~$ python
Python 2.3.5 (#2, Jun 13 2006, 23:12:55) 
[GCC 4.1.2 20060613 (prerelease) (Debian 4.1.1-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> ss = '    '
>>> hh = ['0x%x' % ord(s) for s in ss]
>>> hh
['0x20', '0x20', '0x20', '0x20', '0x92', '0x97']
>>> 

上面显示的那个象空格的东西就是我输入的“钗字”。python shell 会在钗字本身的
编码前面加入4个空格，并把 0xe9 吃掉。

我用的 debian/testing 带的python 2.3.5。

现在我是这样避开这个问题的。
In [3]: ed
IPython will make a temporary file named: /tmp/ipython_edit_a5wBIK.py
Editing...Waiting for Emacs...
 done. Executing edited code...
Out[3]: "# -*- coding: utf-8; -*-\nss = '\xe9\x92\x97'\n"
In [4]: !cat /tmp/ipython_edit_a5wBIK.py
# -*- coding: utf-8; -*-
ss = '钗'
即编辑并运行一个临时文件，在这个文件中对字符串赋值。
--

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年07月22日星期六 18:44

Carlos Liu about.linux at gmail.com
Sat Jul 22 18:44:40 HKT 2006

On 7/22/06, Ren Lifeng <lfren at cad.zju.edu.cn> wrote:
> 是 python shell 的问题。下面是我在 rxvt/bash 下面的我在交互模式下运行
> python 的一次过程。
>
> rlf at gforge:~$ python
> Python 2.3.5 (#2, Jun 13 2006, 23:12:55)
> [GCC 4.1.2 20060613 (prerelease) (Debian 4.1.1-4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> ss = '    '
> >>> hh = ['0x%x' % ord(s) for s in ss]
> >>> hh
> ['0x20', '0x20', '0x20', '0x20', '0x92', '0x97']
> >>>
>
> 上面显示的那个象空格的东西就是我输入的"钗字"。python shell 会在钗字本身的
> 编码前面加入4个空格，并把 0xe9 吃掉。
>
> 我用的 debian/testing 带的python 2.3.5。
>
> 现在我是这样避开这个问题的。
> In [3]: ed
> IPython will make a temporary file named: /tmp/ipython_edit_a5wBIK.py
> Editing...Waiting for Emacs...
>  done. Executing edited code...
> Out[3]: "# -*- coding: utf-8; -*-\nss = '\xe9\x92\x97'\n"
> In [4]: !cat /tmp/ipython_edit_a5wBIK.py
> # -*- coding: utf-8; -*-
> ss = '钗'
> 即编辑并运行一个临时文件，在这个文件中对字符串赋值。

在我的 Debian sid 中，gnome-terminal/rxvt-unicode + python2.3.5/python2.4.3
都可以正常处理"钗"字，只有 ipython 不行。


-- 
 Best Regards
 Carlos

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！

Zeuux © 2025

京ICP备05028076号