Python论坛的帖子： - 哲思

Python论坛 - 讨论区

返回群组主页

标题：[python-chinese] 正则表达式匹配中文！

分享

徐继哲

楼主 2005年04月06日星期三 13:49

Carambo qutr at tjub.com.cn
Wed Apr 6 13:49:05 HKT 2005

python-chinese :

　　您好！
x = "中文"
>>> if re.match('[^\x00\xff][^\x00\xff]', x) != None:
...     print "OK"
...     
OK
>>>


Carambo ， qutr at tjub.com.cn 
2005-4-6 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20050406/888dfad6/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年04月06日星期三 14:28

Qiangning Hong hongqn at gmail.com
Wed Apr 6 14:28:58 HKT 2005

On Apr 6, 2005 1:49 PM, Carambo <qutr at tjub.com.cn> wrote:
> 
> x = "中文"
> >>> if re.match('[^\x00\xff][^\x00\xff]', x) != None:
> ... print "OK"
> ... 
> OK
> 

没看明白你想匹配什么

-- 
Qiangning Hong
Get Firefox! <
http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20050406/8745f013/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年04月06日星期三 15:10

cpunion cpunion at 263.net
Wed Apr 6 15:10:59 HKT 2005

 >>> m = re.findall(u'[\u4e00-\u7fff]+', unicode('1234中华人民共和国万岁 
China中文936中文China踩死小日本\t1234', 'cp936'))

 >>> for i in m:
    print i

   
中华人民共和国万岁
中文
中文
死小日本

我这个范围是准确的，“踩”字就没包括在里面，现在没找到unicode中文编码分布 
的资料，以后有资料了，把那个范围完善一下就可以了。


Qiangning Hong 写道:

>
> On Apr 6, 2005 1:49 PM, *Carambo* <qutr at tjub.com.cn 
> qutr at tjub.com.cn>> wrote:
>
>     x = "中文"
>     >>> if re.match('[^\x00\xff][^\x00\xff]', x) != None:
>     ...     print "OK"
>     ...    
>     OK
>
>
> 没看明白你想匹配什么
>
> -- 
> Qiangning Hong
> Get Firefox! 
> <http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1 
> <http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1>>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>python-chinese list
>python-chinese at lists.python.cn
>http://python.cn/mailman/listinfo/python-chinese
>  
>

-- 
座右铭：不怕不会，就怕不会搜！

提点建议：大家把自己的邮件客户端设置一下，这里好多邮件是乱码，如果你们看到我的邮件是乱码，也请通知我一下，以免影响交流。

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年04月06日星期三 15:12

cpunion cpunion at 263.net
Wed Apr 6 15:12:36 HKT 2005

掉了个“不”字：
我这个范围是不准确的。

cpunion 写道:

> >>> m = re.findall(u'[\u4e00-\u7fff]+', unicode('1234中华人民共和国万 
> 岁 China中文936中文China踩死小日本\t1234', 'cp936'))
>
> >>> for i in m:
>    print i
>
>   中华人民共和国万岁
> 中文
> 中文
> 死小日本
>
> 我这个范围是准确的，“踩”字就没包括在里面，现在没找到unicode中文编码分 
> 布 的资料，以后有资料了，把那个范围完善一下就可以了。
>
>
> Qiangning Hong 写道:
>
>>
>> On Apr 6, 2005 1:49 PM, *Carambo* <qutr at tjub.com.cn 
>> qutr at tjub.com.cn>> wrote:
>>
>>     x = "中文"
>>     >>> if re.match('[^\x00\xff][^\x00\xff]', x) != None:
>>     ...     print "OK"
>>     ...        OK
>>
>>
>> 没看明白你想匹配什么
>>
>> -- 
>> Qiangning Hong
>> Get Firefox! 
>> <http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1 
>> <http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> python-chinese list
>> python-chinese at lists.python.cn
>> http://python.cn/mailman/listinfo/python-chinese
>>  
>>
>

-- 
座右铭：不怕不会，就怕不会搜！

提点建议：大家把自己的邮件客户端设置一下，这里好多邮件是乱码，如果你们看到我的邮件是乱码，也请通知我一下，以免影响交流。

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年04月06日星期三 16:18

saddle saddle at gmail.com
Wed Apr 6 16:18:07 HKT 2005

和和, 有意思啊:)
>>> unicode('踩', 'cp936')
u'\u8e29'
看起来,是不在4e00到7fff里面. 不知道4e00到7fff里面都是汉字么...
On Wed, 06 Apr 2005 15:12:36 +0800
cpunion <cpunion at 263.net> wrote:

> 掉了个“不”字：
> 我这个范围是不准确的。
> 
> cpunion 写道:
> 
> > >>> m = re.findall(u'[\u4e00-\u7fff]+', unicode('1234中华人民共和国万 
> > 岁 China中文936中文China踩死小日本\t1234', 'cp936'))
> >
> > >>> for i in m:
> >    print i
> >
> >   中华人民共和国万岁
> > 中文
> > 中文
> > 死小日本
> >
> > 我这个范围是准确的，“踩”字就没包括在里面，现在没找到unicode中文编码分 
> > 布 的资料，以后有资料了，把那个范围完善一下就可以了。
> >
> >
> > Qiangning Hong 写道:
> >
> >>
> >> On Apr 6, 2005 1:49 PM, *Carambo* <qutr at tjub.com.cn 
> >> qutr at tjub.com.cn>> wrote:
> >>
> >>     x = "中文"
> >>     >>> if re.match('[^\x00\xff][^\x00\xff]', x) != None:
> >>     ...     print "OK"
> >>     ...        OK
> >>
> >>
> >> 没看明白你想匹配什么
> >>
> >> -- 
> >> Qiangning Hong
> >> Get Firefox! 
> >> <http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1 
> >> <http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1>>
> >>
> >> ------------------------------------------------------------------------
> >>
> >> _______________________________________________
> >> python-chinese list
> >> python-chinese at lists.python.cn
> >> http://python.cn/mailman/listinfo/python-chinese
> >>  
> >>
> >
> 
> -- 
> 座右铭：不怕不会，就怕不会搜！
> 
> 提点建议：大家把自己的邮件客户端设置一下，这里好多邮件是乱码，如果你们看到我的邮件是乱码，也请通知我一下，以免影响交流。
> 
> _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese

-- 
saddle <saddle at gmail.com>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年04月06日星期三 17:27

torres torreswang at gmail.com
Wed Apr 6 17:27:18 HKT 2005

匹配中文字符的正则表达式： [\u4e00-\u9fa5]


On Apr 6, 2005 4:18 PM, saddle <saddle at gmail.com> wrote: 
> 
> 和和, 有意思啊:)
> >>> unicode('踩', 'cp936')
> u'\u8e29'
> 看起来,是不在4e00到7fff里面. 不知道4e00到7fff里面都是汉字么...
> On Wed, 06 Apr 2005 15:12:36 +0800
> cpunion <cpunion at 263.net> wrote:
> 
> > 掉了个"不"字：
> > 我这个范围是不准确的。
> >
> > cpunion 写道:
> >
> > > >>> m = re.findall(u'[\u4e00-\u7fff]+', unicode('1234中华人民共和国万
> > > 岁 China中文936中文China踩死小日本\t1234', 'cp936'))
> > >
> > > >>> for i in m:
> > > print i
> > >
> > > 中华人民共和国万岁
> > > 中文
> > > 中文
> > > 死小日本
> > >
> > > 我这个范围是准确的，"踩"字就没包括在里面，现在没找到unicode中文编码分
> > > 布 的资料，以后有资料了，把那个范围完善一下就可以了。
> > >
> > >
> > > Qiangning Hong 写道:
> > >
> > >>
> > >> On Apr 6, 2005 1:49 PM, *Carambo* <qutr at tjub.com.cn
> > >> qutr at tjub.com.cn>> wrote:
> > >>
> > >> x = "中文"
> > >> >>> if re.match('[^\x00\xff][^\x00\xff]', x) != None:
> > >> ... print "OK"
> > >> ... OK
> > >>
> > >>
> > >> 没看明白你想匹配什么
> > >>
> > >> --
> > >> Qiangning Hong
> > >> Get Firefox!
> > >> <http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1
> > >> <http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1>>
> > >>
> > >> 
> ------------------------------------------------------------------------
> > >>
> > >> _______________________________________________
> > >> python-chinese list
> > >> python-chinese at lists.python.cn
> > >> http://python.cn/mailman/listinfo/python-chinese
> > >>
> > >>
> > >
> >
> > --
> > 座右铭：不怕不会，就怕不会搜！
> >
> > 提点建议：大家把自己的邮件客户端设置一下，这里好多邮件是乱码，如果你们看到我的邮件是乱码，也请通知我一下，以免影响交流。
> >
> > _______________________________________________
> > python-chinese list
> > python-chinese at lists.python.cn
> > http://python.cn/mailman/listinfo/python-chinese
> 
> --
> saddle <saddle at gmail.com>
> 
> _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese
> 



-- 
yours friend
torreswang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20050406/13c0c699/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年04月06日星期三 23:27

Qiangning Hong hongqn at gmail.com
Wed Apr 6 23:27:36 HKT 2005

这个相当于哪个字符集中的汉字？GB2312？GBK？肯定不会是GB18030。

On Apr 6, 2005 5:27 PM, torres <torreswang at gmail.com> wrote:
> 
> 匹配中文字符的正则表达式： [\u4e00-\u9fa5]
> 
> 
> On Apr 6, 2005 4:18 PM, saddle <saddle at gmail.com> wrote: 
> > 
> > 和和, 有意思啊:)
> > >>> unicode('踩', 'cp936')
> > u'\u8e29'
> > 看起来,是不在4e00到7fff里面. 不知道4e00到7fff里面都是汉字么...
> > On Wed, 06 Apr 2005 15:12:36 +0800
> > cpunion <cpunion at 263.net> wrote:
> > 
> > > 掉了个"不"字：
> > > 我这个范围是不准确的。
> > >
> > > cpunion 写道:
> > >
> > > > >>> m = re.findall(u'[\u4e00-\u7fff]+', unicode('1234中华人民共和国万
> > > > 岁 China中文936中文China踩死小日本\t1234', 'cp936'))
> > > >
> > > > >>> for i in m:
> > > > print i
> > > >
> > > > 中华人民共和国万岁
> > > > 中文
> > > > 中文
> > > > 死小日本
> > > >
> > > > 我这个范围是准确的，"踩"字就没包括在里面，现在没找到unicode中文编码分
> > > > 布 的资料，以后有资料了，把那个范围完善一下就可以了。
> > > >
> > > >
> > > > Qiangning Hong 写道:
> > > >
> > > >>
> > > >> On Apr 6, 2005 1:49 PM, *Carambo* <qutr at tjub.com.cn
> > > >> qutr at tjub.com.cn>> wrote:
> > > >>
> > > >> x = "中文"
> > > >> >>> if re.match('[^\x00\xff][^\x00\xff]', x) != None:
> > > >> ... print "OK"
> > > >> ... OK
> > > >>
> > > >>
> > > >> 没看明白你想匹配什么
> > > >>
> > > >> --
> > > >> Qiangning Hong
> > > >> Get Firefox!
> > > >> <http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1>
> > > >> <http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1>
> > >>
> > > >>
> > > >> 
> > ------------------------------------------------------------------------
> > > >>
> > > >> _______________________________________________
> > > >> python-chinese list
> > > >> python-chinese at lists.python.cn
> > > >> http://python.cn/mailman/listinfo/python-chinese
> > > >>
> > > >>
> > > >
> > >
> > > --
> > > 座右铭：不怕不会，就怕不会搜！
> > >
> > > 提点建议：大家把自己的邮件客户端设置一下，这里好多邮件是乱码，如果你们看到我的邮件是乱码，也请通知我一下，以免影响交流。
> > >
> > > _______________________________________________
> > > python-chinese list
> > > python-chinese at lists.python.cn
> > > http://python.cn/mailman/listinfo/python-chinese
> > 
> > --
> > saddle <saddle at gmail.com>
> > 
> > _______________________________________________
> > python-chinese list
> > python-chinese at lists.python.cn
> > http://python.cn/mailman/listinfo/python-chinese
> > 
> 
> 
> 
> -- 
> yours friend
> torreswang
> _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese
> 
> 
> 


-- 
Qiangning Hong
Get Firefox! <
http://www.spreadfirefox.com/?q=affiliates&id=67907&t=1>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20050406/0aecdf60/attachment-0001.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2005年04月06日星期三 23:45

Carlos Z.F. Liu carlosliu at users.sourceforge.net
Wed Apr 6 23:45:53 HKT 2005

On Wed, Apr 06, 2005 at 11:27:36PM +0800, Qiangning Hong wrote:
> 这个相当于哪个字符集中的汉字？GB2312？GBK？肯定不会是GB18030。
> 
> On Apr 6, 2005 5:27 PM, torres <torreswang at gmail.com> wrote:
> > 
> > 匹配中文字符的正则表达式： [\u4e00-\u9fa5]
> > 

这是 Unicode 中完整的 CJK Unified Ideographs 区段。基本上
可以看成是与 GB18030 中的汉字对应吧，至少是 95% 以上。还有
几个 CJK Compatibility Idesgraphs 之类的区段，字符数很少，
而且都是与前面那个区段重复的字形。


-- 
 Best Regards,
 Carlos

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！

Zeuux © 2025

京ICP备05028076号