2004年03月12日 星期五 18:35
I am trying to process a huge Chinese document. The single document is in pure text format and it's nearly 4M. I always get "incomplete multibyte sequence" error when I try to unicode the sentences. I think the reason is because the Chinese document uses both ascii punctuations and 2-byte Chinese punctuations. For example, the single document can both , and , and both < > and 《》. Is there anyway, I can go around this? Don't ask me to fix the Chinese document! __________________________________ Do you Yahoo!? Yahoo! Search - Find what youre looking for faster http://search.yahoo.com
2004年03月12日 星期五 21:51
Hello Anthony, can u try line by line to process it not do in time? === [ 18:35 ; 04-03-12 ] you wrote: AL> I am trying to process a huge Chinese document. The AL> single document is in pure text format and it's nearly AL> 4M. AL> I always get "incomplete multibyte sequence" error AL> when I try to unicode the sentences. AL> I think the reason is because the Chinese document AL> uses both ascii punctuations and 2-byte Chinese AL> punctuations. AL> For example, the single document can both , and AL> , and both < > and 《》. AL> Is there anyway, I can go around this? Don't ask me AL> to fix the Chinese document! AL> __________________________________ AL> Do you Yahoo!? AL> Yahoo! Search - Find what youre looking for faster AL> http://search.yahoo.com === === === === === === === === === === -- Best regards, Zoom.Quiet /=======================================\ ]Time is unimportant, only life important![ \=======================================/
2004年03月13日 星期六 00:37
ÈçÏÂÃæÒ»ÐÐ s.accept(); Ëü»áÒ»Ö±blockÔÚÉÏÃæÒ»ÐÐ,ÎÒÓùýCtrl-C»òÕßÔÚÁíÒ»¸öÏß³ÌÖÐcloseÕâ¸ösocket¶¼²»ÐÐ.ÇëÎÊÎÒ¸ÃÈçºÎ°ì²ÅÄÜ"Àñò"µÄ½áÊøÎҵijÌÐòÄØ? лл _________________________________________________________________ Create a Job Alert on MSN Careers and enter for a chance to win $1000! http://msn.careerbuilder.com/promo/kaday.htm?siteid=CBMSN_1K≻_extcmp=JS_JASweep_MSNHotm2
2004年03月13日 星期六 02:52
Yes, I can read it sentence by sentence, but the problem is how I can know where there is going to be the unicode problem. --- "Zoom.Quiet" <zoomq at infopro.cn> wrote: > Hello Anthony, > > can u try line by line to process it > not do in time? > > === [ 18:35 ; 04-03-12 ] you wrote: > > AL> I am trying to process a huge Chinese document. > The > AL> single document is in pure text format and it's > nearly > AL> 4M. > > AL> I always get "incomplete multibyte sequence" > error > AL> when I try to unicode the sentences. > > AL> I think the reason is because the Chinese > document > AL> uses both ascii punctuations and 2-byte Chinese > AL> punctuations. > > AL> For example, the single document can both , and > AL> , and both < > and 《》. > > AL> Is there anyway, I can go around this? Don't > ask me > AL> to fix the Chinese document! > > AL> __________________________________ > AL> Do you Yahoo!? > AL> Yahoo! Search - Find what you抮e looking for > faster > AL> http://search.yahoo.com > > === === === === === === === === === === > > -- > Best regards, > Zoom.Quiet > > /=======================================\ > ]Time is unimportant, only life important![ > \=======================================/ > > _______________________________________________ > python-chinese list > python-chinese at lists.python.cn > http://python.cn/mailman/listinfo/python-chinese __________________________________ Do you Yahoo!? Yahoo! Search - Find what youre looking for faster http://search.yahoo.com
2004年03月13日 星期六 04:02
It's a problem with the unicode converter or the source text. To find out which, analyze the bytes that cause the problem. What are the decimal or hex values? Then we can see whether in fact it is a valid or invalid sequence. Also, please fix your Chinese emails. If Yahoo is the problem, please consider switching to another free email service. Thanks! John > > I am trying to process a huge Chinese document. The > single document is in pure text format and it's nearly > 4M. > > I always get "incomplete multibyte sequence" error > when I try to unicode the sentences. > > I think the reason is because the Chinese document > uses both ascii punctuations and 2-byte Chinese > punctuations. > > For example, the single document can both , and > , and both < > and 《》. > > Is there anyway, I can go around this? Don't ask me > to fix the Chinese document! >
2004年03月13日 星期六 06:17
看看现在能不能看到汉字? Yes, what you say makes very good sense. The following 2 lines attempt to break apart the Chinese sentence at punctuations. str = "世界名著《红楼梦》的作者曹雪芹是前清有名的才子。" alist=re.split('《|》|。', str) It works fine, and alist will contain 3 chunks of the sentence as expected. But if I unicode the str before I call re.split like so: str = unicode(str, 'gbk') then the regular expression passed to re.split won't match anything. I tried unicoding the punctuations in the regular expression as well, like so: leftbk = '《' rightbk= '》' fullstop = '。' pattern = '\'' + leftbk '|' + rightbk + '|' + fullstop + '\'' alist=re.split(pattern, str) It does not work. I am kindov at my wit's end. --- John Li <johnli at ahlt.net> 的正文:> It's a problem with the unicode converter or the > source > text. To find out which, analyze the bytes that > cause > the problem. What are the decimal or hex values? > Then we can see whether in fact it is a valid or > invalid > sequence. > > Also, please fix your Chinese emails. If Yahoo is > the > problem, please consider switching to another free > email service. Thanks! > > John > > > > > I am trying to process a huge Chinese document. > The > > single document is in pure text format and it's > nearly > > 4M. > > > > I always get "incomplete multibyte sequence" error > > when I try to unicode the sentences. > > > > I think the reason is because the Chinese document > > uses both ascii punctuations and 2-byte Chinese > > punctuations. > > > > For example, the single document can both , and > > , and both < > and 《》. > > > > Is there anyway, I can go around this? Don't ask > me > > to fix the Chinese document! > > > > > > _______________________________________________ > python-chinese list > python-chinese at lists.python.cn > http://python.cn/mailman/listinfo/python-chinese > _________________________________________________________ Do You Yahoo!? 完全免费的雅虎电邮,马上注册获赠额外60兆网络存储空间 http://cn.rd.yahoo.com/mail_cn/tag/?http://cn.mail.yahoo.com
2004年03月13日 星期六 13:00
> 看看现在能不能看到汉字? 谢谢! > Yes, what you say makes very good sense. > > The following 2 lines attempt to break apart the > Chinese sentence at punctuations. > > str = > "世界名著《红楼梦》的作者曹雪芹是前清有名的才子。" > alist=re.split('《|》|。', str) > > It works fine, and alist will contain 3 chunks of the > sentence as expected. > > But if I unicode the str before I call re.split like > so: > > str = unicode(str, 'gbk') > > then the regular expression passed to re.split won't > match anything. > > I tried unicoding the punctuations in the regular > expression as well, like so: > > leftbk = '《' > rightbk= '》' > fullstop = '。' > > pattern = '\'' + leftbk '|' + rightbk + '|' + fullstop > + '\'' > > alist=re.split(pattern, str) > > It does not work. I am kindov at my wit's end. ---------------------------------------- import wx, re str = '世界名著《红楼梦》的作者曹雪芹是前清有名的才子。' str = unicode(str, 'gbk') leftbk = unicode('《', 'gbk') rightbk = unicode('》', 'gbk') fullstop = unicode('。', 'gbk') pattern = leftbk + u'|' + rightbk + u'|' + fullstop alist=re.split(pattern, str) for x in alist: print x.encode('gbk') --John
2004年03月13日 星期六 13:04
> import wx, re > 对不起!不要 import wx John
2004年03月13日 星期六 16:18
Yes, John, thank you very much for your hint and the sample code. I did not realize that I need to use u'|' for the or operator. --- John Li <johnli at ahlt.net> wrote: > > import wx, re > > > 对不起!不要 import wx > > John > > _______________________________________________ > python-chinese list > python-chinese at lists.python.cn > http://python.cn/mailman/listinfo/python-chinese > __________________________________ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com
2004年03月14日 星期日 00:46
> Yes, John, thank you very much for your hint and the > sample code. I did not realize that I need to use > u'|' for the or operator. > Actually, I don't think that's essential--I was just being careful. I think that if any one of the strings is unicode, then the rest will be automatically converted to unicode, so that the resulting string is unicode.
2004年03月14日 星期日 02:07
> I think that if any one of the strings is > unicode, > then the rest will be automatically converted to > unicode, > so that the resulting string is unicode. Hi, John, are you serious in saying the above? __________________________________ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com
Zeuux © 2024
京ICP备05028076号