Python论坛的帖子： - 哲思

Python论坛 - 讨论区

返回群组主页

标题：[python-chinese] 关于中文正则表达式问题

分享

徐继哲

楼主 2007年12月02日星期日 00:20

Samuel samuel.yh.wu在gmail.com
星期日十二月 2 00:20:04 HKT 2007

我想在文本中找出价格出现的次数，因此想使用下面的匹配：

(￥\s*)*\d+\s*[元米块]

开始的时候没有使用编码，编译的时候出错，后来按照提示加入了coding= UTF-8.
原始文件是UTF-8编码，系统是open suse 10.3.
下面是我的程序代码

#! /usr/bin/env python
> # -*- coding: UTF-8 -*-
> import re
> import sys
> f=open(sys.argv[1],'r')
> n=0
> ps=''
> p=re.compile('(￥\s*)*\d+\s*[元米块]')
> for line in f:
>     a=re.findall(p,line)
>     for word in a:
>         ps=ps+word
>     if len(ps)*1.0/len(line)>(1*1.0/50):
>         n=n+1
>         print ps
>         print line
>     ps=''
> n
>

麻烦大家帮我看看这个匹配里面有什么问题。

第一次发邮件，欢迎指教。谢谢先。


-- 
Samuel Wu
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071202/7ba417cf/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

tianjie

0楼 2007年12月02日星期日 00:53

@@ askfor在gmail.com
星期日十二月 2 00:53:01 HKT 2007

p=re.compile(u'(£¤\s*)*\d+\s*[ÔªÃ×¿é]', re.U)
 ÊÔÊÔ
On 12/2/07, Samuel <samuel.yh.wu在gmail.com> wrote:
>
> ÎÒÏëÔÚÎÄ±¾ÖÐÕÒ³ö¼Û¸ñ³öÏÖµÄ´ÎÊý£¬Òò´ËÏëÊ¹ÓÃÏÂÃæµÄÆ¥Åä£º
>
> (£¤\s*)*\d+\s*[ÔªÃ×¿é]
>
> ¿ªÊ¼µÄÊ±ºòÃ»ÓÐÊ¹ÓÃ±àÂë£¬±àÒëµÄÊ±ºò³ö´í£¬ºóÀ´°´ÕÕÌáÊ¾¼ÓÈëÁËcoding= UTF-8.
> ÔÊ¼ÎÄ¼þÊÇUTF-8±àÂë£¬ÏµÍ³ÊÇopen suse 10.3.
> ÏÂÃæÊÇÎÒµÄ³ÌÐò´úÂë
>
> #! /usr/bin/env python
> > # -*- coding: UTF-8 -*-
> > import re
> > import sys
> > f=open(sys.argv[1],'r')
> > n=0
> > ps=''
> > p=re.compile('(£¤\s*)*\d+\s*[ÔªÃ×¿é]')
> > for line in f:
> >     a=re.findall(p,line)
> >     for word in a:
> >         ps=ps+word
> >     if len(ps)*1.0/len(line)>(1*1.0/50):
> >         n=n+1
> >         print ps
> >         print line
> >     ps=''
> > n
> >
>
> Âé·³´ó¼Ò°ïÎÒ¿´¿´Õâ¸öÆ¥ÅäÀïÃæÓÐÊ²Ã´ÎÊÌâ¡£
>
> µÚÒ»´Î·¢ÓÊ¼þ£¬»¶ÓÖ¸½Ì¡£Ð»Ð»ÏÈ¡£
>
>
> --
> Samuel Wu
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071202/668dc138/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月02日星期日 17:12

Samuel samuel.yh.wu在gmail.com
星期日十二月 2 17:12:48 HKT 2007

Hi @@,

thanks for your quick response.

I tested your suggestions, however I failed again. Here's my test procedure:

change p to:


> p=re.compile(u'(￥\s*)*\d+\s*[元米块]', re.U)
>
then test the script, and no pattern found matched;
And again, change re.findall(p,line) to re.findall(p,unicode(line,'UTF-8'))
No match again.

Actually we can use a simple test sentence to test the patter,
supposed
test=u'120元 1111 145块 qwrer 34米'
p=re.compile(u'(￥\s*)*\d+\s*[元米块]', re.U)
then we can test the pattern with
re.match(p,test)

It failed to match what i need (120元, 145块, 34米).

BTW, if 'u' is not used (like what I used before
p=re.compile('(￥\s*)*\d+\s*[元米块]')
), too many patterns matched, like 120多 can be matched).

So can anyone tell me what's the problem? It should encoding problem, but I
know few about it.

Thanks for @@'s suggestions, and more helps are looking forward. Thanks in
advance.

My IME has problems, and I can't input Chinese for now. sorry for
inconvenience if any.

On Dec 2, 2007 12:53 AM, @@ <askfor在gmail.com> wrote:

> p=re.compile(u'(￥\s*)*\d+\s*[元米块]', re.U)
>  试试
> On 12/2/07, Samuel <samuel.yh.wu在gmail.com> wrote:
>
> > 我想在文本中找出价格出现的次数，因此想使用下面的匹配：
> >
> > (￥\s*)*\d+\s*[元米块]
> >
> > 开始的时候没有使用编码，编译的时候出错，后来按照提示加入了coding= UTF-8.
> > 原始文件是UTF-8编码，系统是open suse 10.3.
> > 下面是我的程序代码
> >
> > #! /usr/bin/env python
> > > # -*- coding: UTF-8 -*-
> > > import re
> > > import sys
> > > f=open(sys.argv [1],'r')
> > > n=0
> > > ps=''
> > > p=re.compile('(￥\s*)*\d+\s*[元米块]')
> > > for line in f:
> > >     a=re.findall(p,line)
> > >     for word in a:
> > >         ps=ps+word
> > >     if len(ps)*1.0/len(line)>(1*1.0/50):
> > >         n=n+1
> > >         print ps
> > >         print line
> > >     ps=''
> > > n
> > >
> >
> > 麻烦大家帮我看看这个匹配里面有什么问题。
> >
> > 第一次发邮件，欢迎指教。谢谢先。
> >
> >
> > --
> > Samuel Wu
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>



-- 
Samuel Wu
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071202/f2279100/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月02日星期日 20:05

Chengjie Sun chjsun在gmail.com
星期日十二月 2 20:05:31 HKT 2007

Try following to see if it is what you want.
I think you should pay attention to the meaning of the pattern


#! /usr/bin/env python
# -*- coding: UTF-8 -*-
import re
import sys
p=re.compile(u'\d+[ÔªÃ×¿é]', re.U)
test=u'120Ôª 1111 145¿é qwrer 34Ã×'
a = re.findall(p,test)
for word in a:
     print word




On Dec 2, 2007 5:12 PM, Samuel <samuel.yh.wu在gmail.com> wrote:

> Hi @@,
>
> thanks for your quick response.
>
> I tested your suggestions, however I failed again. Here's my test
> procedure:
>
> change p to:
>
>
> > p=re.compile(u'(£¤\s*)*\d+\s*[ÔªÃ×¿é]', re.U)
> >
> then test the script, and no pattern found matched;
> And again, change re.findall(p,line) to re.findall(p,unicode(line,'UTF-8')
> )
> No match again.
>
> Actually we can use a simple test sentence to test the patter,
> supposed
> test=u'120Ôª 1111 145¿é qwrer 34Ã×'
> p=re.compile(u'(£¤\s*)*\d+\s*[ÔªÃ×¿é]', re.U)
> then we can test the pattern with
> re.match(p,test)
>
> It failed to match what i need (120Ôª, 145¿é, 34Ã×).
>
> BTW, if 'u' is not used (like what I used before p=re.compile('(£¤\s*)*\d+\s*[ÔªÃ×¿é]')
> ), too many patterns matched, like 120¶à can be matched).
>
> So can anyone tell me what's the problem? It should encoding problem, but
> I know few about it.
>
> Thanks for @@'s suggestions, and more helps are looking forward. Thanks in
> advance.
>
> My IME has problems, and I can't input Chinese for now. sorry for
> inconvenience if any.
>
>
> On Dec 2, 2007 12:53 AM, @@ <askfor在gmail.com> wrote:
>
> > p=re.compile(u'(£¤\s*)*\d+\s*[ÔªÃ×¿é]', re.U)
> >  ÊÔÊÔ
> > On 12/2/07, Samuel <samuel.yh.wu在gmail.com> wrote:
> >
> > > ÎÒÏëÔÚÎÄ±¾ÖÐÕÒ³ö¼Û¸ñ³öÏÖµÄ´ÎÊý£¬Òò´ËÏëÊ¹ÓÃÏÂÃæµÄÆ¥Åä£º
> > >
> > > (£¤\s*)*\d+\s*[ÔªÃ×¿é]
> > >
> > > ¿ªÊ¼µÄÊ±ºòÃ»ÓÐÊ¹ÓÃ±àÂë£¬±àÒëµÄÊ±ºò³ö´í£¬ºóÀ´°´ÕÕÌáÊ¾¼ÓÈëÁËcoding= UTF-8.
> > > ÔÊ¼ÎÄ¼þÊÇUTF-8±àÂë£¬ÏµÍ³ÊÇopen suse 10.3.
> > > ÏÂÃæÊÇÎÒµÄ³ÌÐò´úÂë
> > >
> > > #! /usr/bin/env python
> > > > # -*- coding: UTF-8 -*-
> > > > import re
> > > > import sys
> > > > f=open(sys.argv [1],'r')
> > > > n=0
> > > > ps=''
> > > > p=re.compile('(£¤\s*)*\d+\s*[ÔªÃ×¿é]')
> > > > for line in f:
> > > >     a=re.findall(p,line)
> > > >     for word in a:
> > > >         ps=ps+word
> > > >     if len(ps)*1.0/len(line)>(1*1.0/50):
> > > >         n=n+1
> > > >         print ps
> > > >         print line
> > > >     ps=''
> > > > n
> > > >
> > >
> > > Âé·³´ó¼Ò°ïÎÒ¿´¿´Õâ¸öÆ¥ÅäÀïÃæÓÐÊ²Ã´ÎÊÌâ¡£
> > >
> > > µÚÒ»´Î·¢ÓÊ¼þ£¬»¶ÓÖ¸½Ì¡£Ð»Ð»ÏÈ¡£
> > >
> > >
> > > --
> > > Samuel Wu
> > >
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese在lists.python.cn
> > > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> > >
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
>
> --
> Samuel Wu
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071202/062f000e/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月03日星期一 11:17

eho eho jiulang.eho在gmail.com
星期一十二月 3 11:17:46 HKT 2007

ÔÚ07-12-2£¬Chengjie Sun <chjsun在gmail.com> Ð´µÀ£º
>
> Try following to see if it is what you want.
> I think you should pay attention to the meaning of the pattern
>
>
> #! /usr/bin/env python
> # -*- coding: UTF-8 -*-
> import re
> import sys
> p=re.compile(u'\d+[ÔªÃ×¿é]', re.U)
> test=u'120Ôª 1111 145¿é qwrer 34Ã×'
> a = re.findall(p,test)
> for word in a:
>      print word



ÎÄ¼þÖÐÒÑ¾Éè¶¨ÁË±àÂëÎªutf-8£¬ÄÇÃ´ÔÚ×Ö·û´®ÖÐ¾Í²»ÓÃ"u"±êÊ¶ÁË
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
import re
import sys
p=re.compile('\d+[ÔªÃ×¿é]')
test='120Ôª 1111 145¿é qwrer 34Ã×'
a = re.findall(p,test)
if a:
     print a

ÓÉÓÚ×Ö·û´®ÊÇutf-8µÄ£¬ËùÓÐÔÚ¿ØÖÆÌ¨ÏÂÊä³öÊÇ×Ö·û±àÂëµÄÐÎÊ½
µ«ÊÇÕâÀïÓÐ¸öÎÊÌâ£¬×Ö·û±àÂëÖ»Êä³öÁËÒ»Î»
ÕýÔò±í´ïÊ½¸ÄÎª£ºp=re.compile('(\d+Ôª|\d+Ã×\d+¿é)')¾ÍºÃÁË

»¶ÓÅÄ×©¡£
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071203/d336b34c/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月03日星期一 14:14

Samuel samuel.yh.wu在gmail.com
星期一十二月 3 14:14:51 HKT 2007

非常谢谢大家的回复。（我在公司的时候是可以打中文的）

首先我也觉得无需要使用u来限定字符编码了。

根据Chengjie的建议， 我把代码改了一下，但仍然有问题，希望大家继续指教：
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
import re
import sys
f=open(sys.argv[1],'r')
n=0
ps=''
*p=re.compile('(￥\s*)?\d+\s*元|(￥\s*)?\d+\s*米|(￥\s*)?\d+\s*块')*
for line in f:
    a=re.findall(p,line)
    for word in a:
        ps=ps+word
    if len(a) >5:
       print a
    if len(ps)*1.0/len(line)>(1*1.0/50):
        n=n+1
        print line
    ps=''
print n
f.close()

另外，我也可以给出一个测试语句：
test='掀影像风暴 诺基亚320万像素n73￥ 1580 元。 真倒霉买到了一部翻新机,前天我在北京买的,真是气死人了'

最终需要匹配的红色highlight部分。 ￥ 是全角字符。


运行上面代码的时候会出错，下面是错误信息：

Traceback (most recent call last):
  File "./pricespam.py", line 12, in 
    ps=ps+word
TypeError: *cannot concatenate 'str' and 'tuple' objects*

p 匹配错误，我把a打印出来， a里面内嵌了一个list， 不是字符数组。

另外，像chengjie说的，只能打印UTF-8三个编码中的一个，不能全部显示，除非分开来写。 不知道为什么这样。
我还是不想分开来写，毕竟这样非常麻烦，而且容易写错。 希望大家继续指点我。

刚接触python不久，感觉编码很头大。 另外，正则表达式是不是和普通的正则表达式不一样。
我在Shell里面测试过表达式的正确性了，完全能匹配上面的红色字符，实际上我就是用grep 匹配出来的。

最后，再谢谢大家。

On Dec 3, 2007 11:17 AM, eho eho <jiulang.eho在gmail.com> wrote:

>
>
> 在07-12-2，Chengjie Sun <chjsun在gmail.com> 写道：
> >
> > Try following to see if it is what you want.
> > I think you should pay attention to the meaning of the pattern
> >
> >
> > #! /usr/bin/env python
> > # -*- coding: UTF-8 -*-
> > import re
> > import sys
> > p=re.compile(u'\d+[元米块]', re.U)
> > test=u'120元 1111 145块 qwrer 34米'
> > a = re.findall(p,test)
> > for word in a:
> >      print word
>
>
>
> 文件中已经设定了编码为utf-8，那么在字符串中就不用"u"标识了
> #! /usr/bin/env python
> # -*- coding: UTF-8 -*-
> import re
> import sys
> p=re.compile ('\d+[元米块]')
> test='120元 1111 145块 qwrer 34米'
> a = re.findall(p,test)
> if a:
>      print a
>
> 由于字符串是utf-8的，所有在控制台下输出是字符编码的形式
> 但是这里有个问题，字符编码只输出了一位
> 正则表达式改为：p=re.compile('(\d+元|\d+ 米 \d+块)')就好了
>
> 欢迎拍砖。
>
>
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>



-- 
Samuel Wu
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071203/58935a50/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月03日星期一 14:33

Samuel samuel.yh.wu在gmail.com
星期一十二月 3 14:33:08 HKT 2007

纠正一个错误，
我是根据eho的建议来改的，不好意思。

谢谢chengjie 和eho，谢谢。

On Dec 3, 2007 2:14 PM, Samuel <samuel.yh.wu在gmail.com> wrote:

> 非常谢谢大家的回复。（我在公司的时候是可以打中文的）
>
> 首先我也觉得无需要使用u来限定字符编码了。
>
> 根据Chengjie的建议， 我把代码改了一下，但仍然有问题，希望大家继续指教：
> #! /usr/bin/env python
> # -*- coding: UTF-8 -*-
> import re
> import sys
> f=open(sys.argv[1],'r')
> n=0
> ps=''
> *p=re.compile('(￥\s*)?\d+\s*元|(￥\s*)?\d+\s*米|(￥\s*)?\d+\s*块')*
> for line in f:
>     a=re.findall(p,line)
>     for word in a:
>         ps=ps+word
>     if len(a) >5:
>        print a
>     if len(ps)*1.0/len(line)>(1*1.0/50):
>         n=n+1
>         print line
>     ps=''
> print n
> f.close()
>
> 另外，我也可以给出一个测试语句：
> test='掀影像风暴 诺基亚320万像素n73 ￥ 1580 元。 真倒霉买到了一部翻新机,前天我在北京买的,真是气死人了'
>
> 最终需要匹配的红色highlight部分。 ￥ 是全角字符。
>
>
> 运行上面代码的时候会出错，下面是错误信息：
>
> Traceback (most recent call last):
>   File "./pricespam.py", line 12, in 
>     ps=ps+word
> TypeError: *cannot concatenate 'str' and 'tuple' objects*
>
> p 匹配错误，我把a打印出来， a里面内嵌了一个list， 不是字符数组。
>
> 另外，像chengjie说的，只能打印UTF-8三个编码中的一个，不能全部显示，除非分开来写。 不知道为什么这样。
> 我还是不想分开来写，毕竟这样非常麻烦，而且容易写错。 希望大家继续指点我。
>
> 刚接触python不久，感觉编码很头大。 另外，正则表达式是不是和普通的正则表达式不一样。
> 我在Shell里面测试过表达式的正确性了，完全能匹配上面的红色字符，实际上我就是用grep 匹配出来的。
>
> 最后，再谢谢大家。
>
> On Dec 3, 2007 11:17 AM, eho eho <jiulang.eho在gmail.com> wrote:
>
> >
> >
> > 在07-12-2，Chengjie Sun <chjsun在gmail.com> 写道：
> > >
> > > Try following to see if it is what you want.
> > > I think you should pay attention to the meaning of the pattern
> > >
> > >
> > > #! /usr/bin/env python
> > > # -*- coding: UTF-8 -*-
> > > import re
> > > import sys
> > > p=re.compile(u'\d+[元米块]', re.U)
> > > test=u'120元 1111 145块 qwrer 34米'
> > > a = re.findall(p,test)
> > > for word in a:
> > >      print word
> >
> >
> >
> > 文件中已经设定了编码为utf-8，那么在字符串中就不用"u"标识了
> > #! /usr/bin/env python
> > # -*- coding: UTF-8 -*-
> > import re
> > import sys
> > p=re.compile ('\d+[元米块]')
> > test='120元 1111 145块 qwrer 34米'
> > a = re.findall(p,test)
> > if a:
> >      print a
> >
> > 由于字符串是utf-8的，所有在控制台下输出是字符编码的形式
> > 但是这里有个问题，字符编码只输出了一位
> > 正则表达式改为：p=re.compile('(\d+元|\d+ 米 \d+块)')就好了
> >
> > 欢迎拍砖。
> >
> >
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
>
> --
> Samuel Wu
>



-- 
Samuel Wu
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071203/80206c92/attachment-0001.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月03日星期一 15:48

eho eho jiulang.eho在gmail.com
星期一十二月 3 15:48:57 HKT 2007

²»ÓÃ¿ÍÆø^_^
Èç¹ûÒªÔÚ¿ØÖÆÌ¨´òÓ¡³öÖÐÎÄ£¬
ÔÚ×îºó¼ÓÉÏÈçÏÂ´úÂë£¬Èç¹ûÓëÆäËûÄ£¿é½»»¥±£³Öunicode¾Í²»Òª¸Ä±ä±àÂëÁË¡£

#! /usr/bin/env python
# -*- coding: UTF-8 -*-
import re
import sys
p=re.compile('(\d+Ôª|\d+Ã×|\d+¿é)')
test='120Ôª 1111 145¿é qwrer 34Ã×'
a = re.findall(p,test)
if a:
     print a
     for word in a:
         print word.decode('cp936')




ÔÚ07-12-3£¬Samuel <samuel.yh.wu在gmail.com> Ð´µÀ£º
>
> ¾ÀÕýÒ»¸ö´íÎó£¬
> ÎÒÊÇ¸ù¾ÝehoµÄ½¨ÒéÀ´¸ÄµÄ£¬²»ºÃÒâË¼¡£
>
> Ð»Ð»chengjie ºÍeho£¬Ð»Ð»¡£
>
>
-- 
make simple things easy and complex things possible.
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071203/6074917a/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月03日星期一 16:17

Samuel samuel.yh.wu在gmail.com
星期一十二月 3 16:17:59 HKT 2007

非常谢谢。

如果我使用下面的语句


>      for word in a:
>          print word.decode('cp936')
>
>
> 会出现这样的错误：
UnicodeDecodeError: 'gbk' codec can't decode byte 0x83 in position 6:
incomplete multibyte sequence

这应当和我使用的平台由关系，我是在linux 平台上，python默认为unicode编码(不是utf-8编码),
我以前一直以为unicode和UTF-8是一回事。

我使用下面的语句可以打印：
print unicode(word,'UTF-8')

期待解决我的正则表达式问题。 Thread很长，我总结一下：

1)不能把正则表达式写成:

p=re.compile('\d+\s*[元米块]')
而要写成
p=re.compile('\d+\s*元|\d+\s*米|\d+\s*块')

2) 下面的表达式无法匹配：
*p=re.compile('(￥\s*)?\d+\s*元|(￥\s*)?\d+\s*米|(￥\s*)?\d+\s*块')*
理想情况下希望写成：
*p=re.compile('(￥\s*)?\d+\s*[元米块]')

*希望大家使用我的测试语句：
test='掀影像风暴 诺基亚320万像素n73 ￥ 1580 元。 真倒霉买到了一部翻新机,前天我在北京买的,真是气死人了'

把红色匹配出来就可以，（不允许使用 p=re.compile('￥ 1580 元'), 哈哈)

感谢大家耐心指导。

-- 
Samuel Wu
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071203/c0d6e6d2/attachment-0001.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月03日星期一 17:49

eho eho jiulang.eho在gmail.com
星期一十二月 3 17:49:06 HKT 2007

  »á³öÏÖÕâÑùµÄ´íÎó£º
> > UnicodeDecodeError: 'gbk' codec can't decode byte 0x83 in position 6:
> > incomplete multibyte sequence
> >
>

ÎÒµÄÊÇÔÚwindows£¬ÓÃcp936ºÍgbk£¬gb2312¶¼¿ÉÒÔµÄ
ÄãµÄÊÇlinux£¬ÄÇ¾Í°Ñ×Ö·û±àÂë¸Ä³Égbk£¬¾Íok ÁË


  --
> > make simple things easy and complex things possible.
> >
>
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071203/611486f5/attachment-0001.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月03日星期一 18:22

Chengjie Sun chjsun在gmail.com
星期一十二月 3 18:22:21 HKT 2007

p=re.compile(r'£¤\s*\d+\s*(?:Ôª|Ã×|¿é)')

test='ÏÆÓ°Ïñ·ç±© Åµ»ùÑÇ320ÍòÏñËØn73 £¤ 1580 Ôª¡£ Õæµ¹Ã¹Âòµ½ÁËÒ»²¿·ÐÂ»ú,Ç°ÌìÎÒÔÚ±±¾©ÂòµÄ,ÕæÊÇÆøËÀÈËÁË'
a = re.findall(p,test)
for word in a:
     print word


ÕâÑù¿ÉÒÔµÃµ½ÄãÏëÒªµÄ½á¹û¡£ÊÇÓÐµãÆæ¹Ö£¬Èç¹û°Ñ"ÔªÃ×¿é"¿ì·Åµ½ÖÐÀ¨ºÅÖÐ£¬Æ¥Åä½á¹û¾ÍÊÇ²»ÕýÈ·µÄ¡£
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071203/3c19c6d9/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月03日星期一 18:35

Samuel samuel.yh.wu在gmail.com
星期一十二月 3 18:35:06 HKT 2007

是的，正是这样。。非常的奇怪。不能在中括号中使用，否则只要一个数字加一个中文字符都能匹配。
应当仍然是编码的问题。

On Dec 3, 2007 6:22 PM, Chengjie Sun <chjsun在gmail.com> wrote:

> p=re.compile(r'￥\s*\d+\s*(?:元|米|块)')
>
> test='掀影像风暴 诺基亚320万像素n73 ￥ 1580 元。 真倒霉买到了一部翻新机,前天我在北京买的,真是气死人了'
> a = re.findall(p,test)
> for word in a:
>      print word
>
>
> 这样可以得到你想要的结果。是有点奇怪，如果把"元米块"快放到中括号中，匹配结果就是不正确的。
>
>
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>



-- 
Samuel Wu
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071203/4e6ae75f/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月03日星期一 19:25

eho eho jiulang.eho在gmail.com
星期一十二月 3 19:25:30 HKT 2007

>
>


ÕâÑù°É£¬ÎÒ¸øÄã°Ñ´úÂëÐ´Íê°É£¬ÍêÕû´úÂëÈçÏÂ :)

#! /usr/bin/env python
# -*- coding: UTF-8 -*-
import re
import sys
p=re.compile(r'((?:£¤\s*)\d+\s*(?:Ôª|Ã×|¿é))')
test='120Ôª 1111 145¿é qwrer 34Ã×'
test2='ÏÆÓ°Ïñ·ç±© Åµ»ùÑÇ320ÍòÏñËØn73 £¤ 1580 Ôª¡£ Õæµ¹Ã¹Âòµ½ÁËÒ»²¿·ÐÂ»ú,Ç°ÌìÎÒÔÚ±±¾©ÂòµÄ,ÕæÊÇÆøËÀÈËÁË'

a = re.findall(p,test2)
if a:
     for word in a:
         print word.decode('gbk')


-- 
make simple things easy and complex things possible.
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071203/3a05fb41/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

cyt

0楼 2007年12月03日星期一 19:31

yuting cui yutingcui在gmail.com
星期一十二月 3 19:31:45 HKT 2007

re.compile(unicode('(￥\s*)*\d+\s*[元米块]','utf8'), re.U)

在 07-12-3，Samuel<samuel.yh.wu在gmail.com> 写道：
> 是的，正是这样。。非常的奇怪。不能在中括号中使用，否则只要一个数字加一个中文字符都能匹配。
> 应当仍然是编码的问题。
>
>
> On Dec 3, 2007 6:22 PM, Chengjie Sun <chjsun在gmail.com> wrote:
> >
> > p=re.compile(r'￥\s*\d+\s*(?:元|米|块)')
> >
> >
> > test='掀影像风暴 诺基亚320万像素n73 ￥ 1580 元。 真倒霉买到了一部翻新机,前天我在北京买的,真是气死人了'
> >
> > a = re.findall(p,test)
> >
> > for word in a:
> >      print word
> >
> >
> >
> > 这样可以得到你想要的结果。是有点奇怪，如果把"元米块"快放到中括号中，匹配结果就是不正确的。
> >
> >
> >
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> > Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
> >
>
>
>
> --
> Samuel Wu
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年12月08日星期六 11:14

clfff.peter clfff.peter在gmail.com
星期六十二月 8 11:14:29 HKT 2007

Â·¹ýÎÊÒ»ÏÂ£¬¼øÓÚ£º
     ÎÄ¼þ±àÂëÎª £º # -*- coding: UTF-8 -*-
     ×îºó×ª»»Ê±ÓÃ £ºprint word.decode('gbk')
ÎÒÊÇ²»ÊÇ¿ÉÒÔÕâÑùÈÏÎª£º
     µ±python½âÊÍÕâ¶Î³ÌÐòÊ±£¬ÏÈ½«ÎÄ¼þÒÔutf-8¶ÁÈë£¬µ«ÊÇËùÓÐÄ¬ÈÏ×Ö·û´®ÔÚÄÚ´æÖÐ¶¼ÊÇ'gbk'µÄ±àÂë£¬ËùÒÔ×îºóÒªÓÃ print
word.decode('gbk')£¬ ²»ÖªµÀÀí½âµÄ¶Ô²»¶Ô¡£
Ð»Ð»¡£


ÔÚ07-12-3£¬eho eho <jiulang.eho在gmail.com> Ð´µÀ£º
>
>
>
>
> ÕâÑù°É£¬ÎÒ¸øÄã°Ñ´úÂëÐ´Íê°É£¬ÍêÕû´úÂëÈçÏÂ :)
>
> #! /usr/bin/env python
> # -*- coding: UTF-8 -*-
> import re
> import sys
> p=re.compile(r'((?:£¤\s*)\d+\s*(?:Ôª|Ã×|¿é))')
> test='120Ôª 1111 145¿é qwrer 34Ã×'
> test2='ÏÆÓ°Ïñ·ç±© Åµ»ùÑÇ320ÍòÏñËØ n73 £¤ 1580 Ôª¡£ Õæµ¹Ã¹Âòµ½ÁËÒ»²¿·ÐÂ»ú,Ç°ÌìÎÒÔÚ±±¾©ÂòµÄ,ÕæÊÇÆøËÀÈËÁË'
>
> a = re.findall(p,test2)
> if a:
>      for word in a:
>          print word.decode('gbk')
>
>
> --
> make simple things easy and complex things possible.
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071208/e60a1fde/attachment-0001.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！

Zeuux © 2025

京ICP备05028076号