Python论坛  - 讨论区

标题:[python-chinese] 我想做一个脚本,将任意编码类型的文件读入并转化为unicode,然后根据其中标点,每次取一句话出来。但我有几个问题:

2007年10月28日 星期日 11:03

??? ?? clfff.peter在gmail.com
星期日 十月 28 11:03:41 HKT 2007

     a>  ÅöÉÏÈ«½Ç/°ë½ÇÕâÖÖ¶«Î÷£¬ÓÐûÓÐʲôӰÏì¡££¨ÆäʵÎÒ¶ÔÈ«½Ç/°ë½ÇÒ²²»ÊǺÜÃ÷°×£¬Ö»ÊǾõµÃ¿ÉÄÜ»áÓÐÓ°Ï죬ÓÐ˭˳±ã½²½²È«½Ç/°ë½Ç£¿£¿£©
     b>  ÔõôÑù²ÅÄÜÕýȷʶ±ðÒ»¾ä»°¡£Ð§¹û¾ÍÏñÒ»¸öÈËÔÚ¶ÁÎÄÕÂʱµÄÄÇÖָоõ¡£
     c> ....
ÓÐË­×ö¹ýÀàËƵĶ«Î÷Âð£¿Äܲ»ÄÜÖ¸µãÏÂ...( ~__~ )
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒƳý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071028/88766ae2/attachment.html 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年10月28日 星期日 11:19

??? ?? clfff.peter在gmail.com
星期日 十月 28 11:19:58 HKT 2007

ÓÐûÓа취¿ÉÒԵõ½Ä³ÖÖ±àÂëµÄËùÓбêµã·ûºÅµÄ¼¯ºÏ£¿£¿£¿£¿

ÔÚ07-10-28£¬??? ?? <clfff.peter在gmail.com> дµÀ£º
>
>
>      a>  ÅöÉÏÈ«½Ç/°ë½ÇÕâÖÖ¶«Î÷£¬ÓÐûÓÐʲôӰÏì¡££¨ÆäʵÎÒ¶ÔÈ«½Ç/°ë½ÇÒ²²»ÊǺÜÃ÷°×£¬Ö»ÊǾõµÃ¿ÉÄÜ»áÓÐÓ°Ï죬ÓÐ˭˳±ã½²½²È«½Ç/°ë½Ç£¿£¿£©
>      b>  ÔõôÑù²ÅÄÜÕýȷʶ±ðÒ»¾ä»°¡£Ð§¹û¾ÍÏñÒ»¸öÈËÔÚ¶ÁÎÄÕÂʱµÄÄÇÖָоõ¡£
>      c> ....
> ÓÐË­×ö¹ýÀàËƵĶ«Î÷Âð£¿Äܲ»ÄÜÖ¸µãÏÂ...( ~__~ )
>
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒƳý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071028/64bfb5b7/attachment.htm 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年10月28日 星期日 13:20

Jiahua Huang jhuangjiahua在gmail.com
星期日 十月 28 13:20:16 HKT 2007

给你提示下


#!/usr/bin/python
# -*- coding: UTF-8 -*-

def zh2unicode(stri):
	"""Auto converter encodings to unicode
	
	It will test utf8,gbk,big5,jp,kr to converter"""
	global encc
	for c in ('utf-8', 'gbk', 'big5', 'jp', 'euc_kr','utf16','utf32'):
		encc = c
		try:
			return stri.decode(c)
		except:
			pass
	encc = 'unk'
	return stri

seps=[" ","\t","\n","\r",",","<",">","?","!",
";","\#",":",".","'",'"',"(",")","{","}","[","]","|","_","=",
" ",",","?","。","、",""",""","《","》","[","]","!","(",")"]

seps=map(lambda i:unicode(i,'utf8'), seps)

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年10月28日 星期日 17:35

Davies Liu davies.liu在gmail.com
星期日 十月 28 17:35:23 HKT 2007

这种方法是不完备的,有可能一种编码的字节流刚好在另一种编码中也是有效的,但是内容不对

On 10/28/07, Jiahua Huang <jhuangjiahua在gmail.com> wrote:
>
> 给你提示下
>
>
> #!/usr/bin/python
> # -*- coding: UTF-8 -*-
>
> def zh2unicode(stri):
>         """Auto converter encodings to unicode
>
>         It will test utf8,gbk,big5,jp,kr to converter"""
>         global encc
>         for c in ('utf-8', 'gbk', 'big5', 'jp', 'euc_kr','utf16','utf32'):
>                 encc = c
>                 try:
>                         return stri.decode(c)
>                 except:
>                         pass
>         encc = 'unk'
>         return stri
>
> seps=[" ","\t","\n","\r",",","<",">","?","!",
> ";","\#",":",".","'",'"',"(",")","{","}","[","]","|","_","=",
> " ",",","?","。","、",""",""","《","》","[","]","!","(",")"]
>
> seps=map(lambda i:unicode(i,'utf8'), seps)
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20071028/c48776a5/attachment.htm 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年10月29日 星期一 16:24

??? ?? clfff.peter在gmail.com
星期一 十月 29 16:24:35 HKT 2007

ÏÈÊÔÊÔ°É£¬Ð»Ð»¡£

2007/10/28, Jiahua Huang <jhuangjiahua在gmail.com>:
>
> ¸øÄãÌáʾÏÂ
>
>
> #!/usr/bin/python
> # -*- coding: UTF-8 -*-
>
> def zh2unicode(stri):
>        """Auto converter encodings to unicode
>
>        It will test utf8,gbk,big5,jp,kr to converter"""
>        global encc
>        for c in ('utf-8', 'gbk', 'big5', 'jp', 'euc_kr','utf16','utf32'):
>                encc = c
>                try:
>                        return stri.decode(c)
>                except:
>                        pass
>        encc = 'unk'
>        return stri
>
> seps=[" ","\t","\n","\r",",","<",">","?","!",
> ";","\#",":",".","'",'"',"(",")","{","}","[","]","|","_","=",
> " ","£¬","£¿","¡£","¡¢",""",""","¡¶","¡·","£Û","£Ý","£¡","£¨","£©"]
>
> seps=map(lambda i:unicode(i,'utf8'), seps)
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒƳý...
URL: http://python.cn/pipermail/python-chinese/attachments/20071029/c26c7646/attachment.html 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

如下红色区域有误,请重新填写。

    你的回复:

    请 登录 后回复。还没有在Zeuux哲思注册吗?现在 注册 !

    Zeuux © 2025

    京ICP备05028076号