Python论坛的帖子： - 哲思

Python论坛 - 讨论区

返回群组主页

标题：[python-chinese] 邮件中文字符编码一致化的问题

分享

徐继哲

楼主 2006年06月23日星期五 16:24

Jason Liu telecomliu at gmail.com
Fri Jun 23 16:24:48 HKT 2006

问题的出发点是这样的：多份邮件中的中文字符可能使用了不同的编码方案，比如gbk, unicode,
utf8等，但由于某种原因具体到某一封邮件的编码方案是未知的。现在希望能将这些邮件中的字符都统一到某一种指定的编码上，可以吗？如何实现呢？

欢迎讨论，谢谢！
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060623/01739362/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年06月23日星期五 17:37

Xiao Lei Wu xiaoleiw at cn.ibm.com
Fri Jun 23 17:37:35 HKT 2006

ÄãÃÇÒ»°ã¶¼ÔÚÕâ¸öÎÄ¼þÀïÃæÐ´Ê²Ã´£¿

Best Regards,

Zachary Wu (Îâ°~ÀÚ)
Software Engineer, Enterprise Content Management FVT, IBM China Software
Development Lab
Tel: +86 10 82782244-3235. Fax: 82782244-2886 Tie Line: 915-2244-3235
Internet: xiaoleiw at cn.ibm.com
Notes ID: Xiao Lei Wu/China/Contr/IBM at IBMCN
Address: 8/F, Block A, Power Creative Building, No.1, East Road, Shang Di,
Beijing 100085, P.R. China
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060623/ca0ca2a7/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

0楼 2006年06月23日星期五 18:55

Robert Chen search.pythoner at gmail.com
Fri Jun 23 18:55:44 HKT 2006

ÎÒÒ»°ãÔÚÀïÃæÊ²Ã´¶¼²»Ð´ºÇºÇ~~

On 6/23/06, Xiao Lei Wu <xiaoleiw at cn.ibm.com> wrote:
>
>  ÄãÃÇÒ»°ã¶¼ÔÚÕâ¸öÎÄ¼þÀïÃæÐ´Ê²Ã´£¿
>
> Best Regards,
>
> Zachary Wu (Îâ°~ÀÚ)
> Software Engineer, Enterprise Content Management FVT, IBM China Software
> Development Lab
> Tel: +86 10 82782244-3235. Fax: 82782244-2886 Tie Line: 915-2244-3235
> Internet: xiaoleiw at cn.ibm.com
> Notes ID: Xiao Lei Wu/China/Contr/IBM at IBMCN
> Address: 8/F, Block A, Power Creative Building, No.1, East Road, Shang Di,
> Beijing 100085, P.R. China
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>


-- 
Robert
PythonÔ´ÂëÆÊÎö¡ª¡ªhttp://blog.donews.com/lemur/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060623/7531d369/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年06月23日星期五 19:06

Jason Liu telecomliu at gmail.com
Fri Jun 23 19:06:46 HKT 2006

ÔÚ06-6-23£¬Xiao Lei Wu <xiaoleiw at cn.ibm.com> Ð´µÀ£º
>
>  ÄãÃÇÒ»°ã¶¼ÔÚÕâ¸öÎÄ¼þÀïÃæÐ´Ê²Ã´£¿
>
> Best Regards,
>
> Zachary Wu (Îâ°~ÀÚ)
>

»ù±¾ÉÏÊÇÌí¼ÓÂ·¾¶£¬import°üÖ®ÀàµÄ¹¤×÷
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060623/50e2bfc2/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年06月23日星期五 19:10

swordsp sparas2006 at gmail.com
Fri Jun 23 19:10:46 HKT 2006

http://chardet.feedparser.org/
Universal Encoding Detector
Character encoding auto-detection in Python. As smart as your browser. Open
source.

应该是一种通用的笨办法吧，正确率似乎还可以。

On 6/23/06, Jason Liu <telecomliu at gmail.com> wrote:
>
> 问题的出发点是这样的：多份邮件中的中文字符可能使用了不同的编码方案，比如gbk, unicode,
> utf8等，但由于某种原因具体到某一封邮件的编码方案是未知的。现在希望能将这些邮件中的字符都统一到某一种指定的编码上，可以吗？如何实现呢？
>
> 欢迎讨论，谢谢！
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060623/cdea87f0/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

0楼 2006年06月23日星期五 19:22

Zoom.Quiet zoom.quiet at gmail.com
Fri Jun 23 19:22:44 HKT 2006

On 6/23/06, Xiao Lei Wu <xiaoleiw at cn.ibm.com> wrote:
>
>
>
> 你们一般都在这个文件里面写什么？
一般是空着,除非有特殊任务,自动将什么文件加入环境字典什么的,
很多软件都是这么作的,limodou 的 newEdit 也是这么来将 mix 和 plugin 组件加入到对象树皮中的说^^

>
>  Best Regards,
>
>  Zachary Wu (吴皛磊)
>  Software Engineer, Enterprise Content Management FVT, IBM China Software
> Development Lab
>  Tel: +86 10 82782244-3235. Fax: 82782244-2886 Tie Line: 915-2244-3235
>  Internet: xiaoleiw at cn.ibm.com
>  Notes ID: Xiao Lei Wu/China/Contr/IBM at IBMCN
>  Address: 8/F, Block A, Power Creative Building, No.1, East Road, Shang Di,
> Beijing 100085, P.R. China
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request at lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>
>


-- 
"""Time is unimportant, only life important!
blogging  :  http://blog.zoomquiet.org/pyblosxom/
wiki enter:   http://wiki.woodpecker.org.cn/moin/ZoomQuiet
in douban:  http://www.douban.com/people/zoomq/
"""

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

李迎辉

0楼 2006年06月23日星期五 19:36

limodou limodou at gmail.com
Fri Jun 23 19:36:59 HKT 2006

On 6/23/06, Zoom. Quiet <zoom.quiet at gmail.com> wrote:
> On 6/23/06, Xiao Lei Wu <xiaoleiw at cn.ibm.com> wrote:
> >
> >
> >
> > 你们一般都在这个文件里面写什么？
> 一般是空着,除非有特殊任务,自动将什么文件加入环境字典什么的,
> 很多软件都是这么作的,limodou 的 newEdit 也是这么来将 mix 和 plugin 组件加入到对象树皮中的说^^
>
因为__init__.py是在导入一个包时要执行的，因此写不写东西要看设计。比如我有一个包，里面有许多的模块，一种使用方式是导入包就自动导入相关的子模块，这样可以在__init__.py中将子模块导进来。还有一种是通过包来找子模块，不需要简化的方式，这样包只是起来一个组织的作用，因此__init__.py可以是空的。

怎么用都行。象django的许多包就一个__init__.py文件，工作都在这个文件中，没有别的东西了。

-- 
I like python!
My Blog: http://www.donews.net/limodou
My Django Site: http://www.djangocn.org
NewEdit Maillist: http://groups.google.com/group/NewEdit

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年06月23日星期五 22:32

shhgs shhgs.efhilt at gmail.com
Fri Jun 23 22:32:12 HKT 2006

应该可以。Python的email模块能很好地解释和生成mail
message，而且Python也支持很多中文的编码。但是问题本身有一定的难度，你应该先去读一下rfc 822, 2822。

邮件的格式有很多种。这里比较麻烦的是multipart的。你得先把multipart的东西解码，一般是base64或者quoted
print，然后分析multipart里面的东西，找出mimetype是text/xxx的东西，然后再转换。

注意，multipart是可以无限嵌套的，因此你拿到一个multipart的东西之后，一定得解开，进去之后逐个找。


On 6/23/06, swordsp <sparas2006 at gmail.com> wrote:
> http://chardet.feedparser.org/
> Universal Encoding Detector
> Character encoding auto-detection in Python. As smart as your browser. Open
> source.
>
> 应该是一种通用的笨办法吧，正确率似乎还可以。
>
>
> On 6/23/06, Jason Liu <telecomliu at gmail.com> wrote:
> >
> >
> >
> > 问题的出发点是这样的：多份邮件中的中文字符可能使用了不同的编码方案，比如gbk, unicode,
> utf8等，但由于某种原因具体到某一封邮件的编码方案是未知的。现在希望能将这些邮件中的字符都统一到某一种指定的编码上，可以吗？如何实现呢？
> >
> > 欢迎讨论，谢谢！
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese at lists.python.cn
> > Subscribe: send subscribe to
> python-chinese-request at lists.python.cn
> > Unsubscribe: send unsubscribe to
> python-chinese-request at lists.python.cn
> > Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
> >
> >
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request at lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

邱英波

0楼 2006年06月24日星期六 21:56

Yingbo Qiu qiuyingbo at gmail.com
Sat Jun 24 21:56:24 HKT 2006

在 06-6-23，swordsp<sparas2006 at gmail.com> 写道：
> http://chardet.feedparser.org/
> Universal Encoding Detector
> Character encoding auto-detection in Python. As smart as your browser. Open
> source.
>
> 应该是一种通用的笨办法吧，正确率似乎还可以。

看了它的介绍，算法来自于 Mozilla 的成果.

我们也曾经用过 Mozilla 的这个算法来判断，在文章字符较多的情况下，准确度还凑合，如果字数比较少，准确率不是很高，比如 "手机" 这
4 个 byte 让它判断，似乎就判断成了日文。

它的算法是依据各个语言的常用字词来计算的，我想可能一些词库需要更新，这样能更适合中文的需要.. 另外，对于邮件而言，这个库无法识别出 HZ
编码来，需要自己 patch..

还有人推荐 enca 这个库，但我没有用过。http://packages.debian.org/unstable/text/enca

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年06月25日星期日 10:54

netkiller openunix at 163.com
Sun Jun 25 10:54:05 HKT 2006

1.通过邮件subject

 =?gb2312?b? xxxxxxxxxxxxxxx ?=

2.通过正文.
Content-Type: text/plain;
 charset="gb2312"
Content-Transfer-Encoding: base64


----- Original Message ----- 
From: "Jason Liu" <telecomliu at gmail.com>
To: "python-chinese" <python-chinese at lists.python.cn>
Sent: Friday, June 23, 2006 4:24 PM
Subject: [python-chinese] 邮件中文字符编码一致化的问题


> 问题的出发点是这样的：多份邮件中的中文字符可能使用了不同的编码方案，比如gbk, unicode,
> utf8等，但由于某种原因具体到某一封邮件的编码方案是未知的。现在希望能将这些邮件中的字符都统一到某一种指定的编码上，可以吗？如何实现呢？
> 
> 欢迎讨论，谢谢！
> 


--------------------------------------------------------------------------------


> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年06月26日星期一 12:16

Xiao Lei Wu xiaoleiw at cn.ibm.com
Mon Jun 26 12:16:07 HKT 2006

python-chinese-bounces at lists.python.cn 写于 2006-06-23 19:36:59:

> On 6/23/06, Zoom. Quiet <zoom.quiet at gmail.com> wrote:
> > On 6/23/06, Xiao Lei Wu <xiaoleiw at cn.ibm.com> wrote:
> > >
> > >
> > >
> > > 你们一般都在这个文件里面写什么？
> > 一般是空着,除非有特殊任务,自动将什么文件加入环境字典什么的,
> > 很多软件都是这么作的,limodou 的 newEdit 也是这么来将 mix 和 plugin 组
> 件加入到对象树皮中的说^^
> >
> 因为__init__.py是在导入一个包时要执行的，因此写不写东西要看设计。比如
> 我有一个包，里面有许多的模块，一种使用方式是导入包就自动导入相关的子
> 模块，这样可以在__init__.py中将子模块导进来。还有一种是通过包来找子
> 模块，不需要简化的方式，这样包只是起来一个组织的作用，因此__init__.
> py可以是空的。
>
> 怎么用都行。象django的许多包就一个__init__.py文件，工作都在这个文件
> 中，没有别的东西了。

这样不大好，像SPE的主__init__.py做得就比较好:
##import info
##INFO=info.copy()
##INFO['description']=\
##"""This is the main spe application."""
##__doc__=INFO['doc']%INFO

def main():
    import SPE

if __name__ == '__main__': main()

> --
> I like python!
> My Blog: http://www.donews.net/limodou
> My Django Site: http://www.djangocn.org
> NewEdit Maillist: http://groups.google.com/group/NewEdit
> _______________________________________________
> python-chinese
> Post: send python-chinese at lists.python.cn
> Subscribe: send subscribe to python-chinese-request at lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request at lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060626/f2a91f1e/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2006年06月26日星期一 16:12

Jason Liu telecomliu at gmail.com
Mon Jun 26 16:12:17 HKT 2006

谢谢各位的指点，这个问题基本解决了。
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20060626/7f2f2985/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！

Zeuux © 2025

京ICP备05028076号