Python论坛的帖子：

Python论坛 - 讨论区

标题：[python-chinese] 我想做一个脚本，将任意编码类型的文件读入并转化为unicode，然后根据其中标点，每次取一句话出来。但我有几个问题：

楼主 2007年10月28日星期日 11:03

??? ?? clfff.peter在gmail.com
星期日十月 28 11:03:41 HKT 2007

     a>  ÅöÉÏÈ«½Ç/°ë½ÇÕâÖÖ¶«Î÷£¬ÓÐÃ»ÓÐÊ²Ã´Ó°Ïì¡££¨ÆäÊµÎÒ¶ÔÈ«½Ç/°ë½ÇÒ²²»ÊÇºÜÃ÷°×£¬Ö»ÊÇ¾õµÃ¿ÉÄÜ»áÓÐÓ°Ïì£¬ÓÐËË³±ã½²½²È«½Ç/°ë½Ç£¿£¿£©
     b>  ÔõÃ´Ñù²ÅÄÜÕýÈ·Ê¶±ðÒ»¾ä»°¡£Ð§¹û¾ÍÏñÒ»¸öÈËÔÚ¶ÁÎÄÕÂÊ±µÄÄÇÖÖ¸Ð¾õ¡£
     c> ....
ÓÐË×ö¹ýÀàËÆµÄ¶«Î÷Âð£¿ÄÜ²»ÄÜÖ¸µãÏÂ...( ~__~ )
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071028/88766ae2/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年10月28日星期日 11:19

??? ?? clfff.peter在gmail.com
星期日十月 28 11:19:58 HKT 2007

ÓÐÃ»ÓÐ°ì·¨¿ÉÒÔµÃµ½Ä³ÖÖ±àÂëµÄËùÓÐ±êµã·ûºÅµÄ¼¯ºÏ£¿£¿£¿£¿

ÔÚ07-10-28£¬??? ?? <clfff.peter在gmail.com> Ð´µÀ£º
>
>
>      a>  ÅöÉÏÈ«½Ç/°ë½ÇÕâÖÖ¶«Î÷£¬ÓÐÃ»ÓÐÊ²Ã´Ó°Ïì¡££¨ÆäÊµÎÒ¶ÔÈ«½Ç/°ë½ÇÒ²²»ÊÇºÜÃ÷°×£¬Ö»ÊÇ¾õµÃ¿ÉÄÜ»áÓÐÓ°Ïì£¬ÓÐËË³±ã½²½²È«½Ç/°ë½Ç£¿£¿£©
>      b>  ÔõÃ´Ñù²ÅÄÜÕýÈ·Ê¶±ðÒ»¾ä»°¡£Ð§¹û¾ÍÏñÒ»¸öÈËÔÚ¶ÁÎÄÕÂÊ±µÄÄÇÖÖ¸Ð¾õ¡£
>      c> ....
> ÓÐË×ö¹ýÀàËÆµÄ¶«Î÷Âð£¿ÄÜ²»ÄÜÖ¸µãÏÂ...( ~__~ )
>
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071028/64bfb5b7/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

壳壳

0楼 2007年10月28日星期日 13:20

Jiahua Huang jhuangjiahua在gmail.com
星期日十月 28 13:20:16 HKT 2007

给你提示下


#!/usr/bin/python
# -*- coding: UTF-8 -*-

def zh2unicode(stri):
	"""Auto converter encodings to unicode
	
	It will test utf8,gbk,big5,jp,kr to converter"""
	global encc
	for c in ('utf-8', 'gbk', 'big5', 'jp', 'euc_kr','utf16','utf32'):
		encc = c
		try:
			return stri.decode(c)
		except:
			pass
	encc = 'unk'
	return stri

seps=[" ","\t","\n","\r",",","<",">","?","!",
";","\#",":",".","'",'"',"(",")","{","}","[","]","|","_","=",
"　","，","？","。","、",""",""","《","》","［","］","！","（","）"]

seps=map(lambda i:unicode(i,'utf8'), seps)

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

刘洪清

0楼 2007年10月28日星期日 17:35

Davies Liu davies.liu在gmail.com
星期日十月 28 17:35:23 HKT 2007

这种方法是不完备的，有可能一种编码的字节流刚好在另一种编码中也是有效的，但是内容不对

On 10/28/07, Jiahua Huang <jhuangjiahua在gmail.com> wrote:
>
> 给你提示下
>
>
> #!/usr/bin/python
> # -*- coding: UTF-8 -*-
>
> def zh2unicode(stri):
>         """Auto converter encodings to unicode
>
>         It will test utf8,gbk,big5,jp,kr to converter"""
>         global encc
>         for c in ('utf-8', 'gbk', 'big5', 'jp', 'euc_kr','utf16','utf32'):
>                 encc = c
>                 try:
>                         return stri.decode(c)
>                 except:
>                         pass
>         encc = 'unk'
>         return stri
>
> seps=[" ","\t","\n","\r",",","<",">","?","!",
> ";","\#",":",".","'",'"',"(",")","{","}","[","]","|","_","=",
> " ","，","？","。","、",""",""","《","》","［","］","！","（","）"]
>
> seps=map(lambda i:unicode(i,'utf8'), seps)
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071028/c48776a5/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年10月29日星期一 16:24

??? ?? clfff.peter在gmail.com
星期一十月 29 16:24:35 HKT 2007

ÏÈÊÔÊÔ°É£¬Ð»Ð»¡£

2007/10/28, Jiahua Huang <jhuangjiahua在gmail.com>:
>
> ¸øÄãÌáÊ¾ÏÂ
>
>
> #!/usr/bin/python
> # -*- coding: UTF-8 -*-
>
> def zh2unicode(stri):
>        """Auto converter encodings to unicode
>
>        It will test utf8,gbk,big5,jp,kr to converter"""
>        global encc
>        for c in ('utf-8', 'gbk', 'big5', 'jp', 'euc_kr','utf16','utf32'):
>                encc = c
>                try:
>                        return stri.decode(c)
>                except:
>                        pass
>        encc = 'unk'
>        return stri
>
> seps=[" ","\t","\n","\r",",","<",">","?","!",
> ";","\#",":",".","'",'"',"(",")","{","}","[","]","|","_","=",
> " ","£¬","£¿","¡£","¡¢",""",""","¡¶","¡·","£Û","£Ý","£¡","£¨","£©"]
>
> seps=map(lambda i:unicode(i,'utf8'), seps)
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071029/c26c7646/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！