2004年03月13日 星期六 08:24
First, how do we read one byte each time in Python? Second, the text file I am processing has both Chinese characters and ascii characters. So, I wanna know when I hit a Chinese character. The way to detect is to check if the eighth bit of a byte is set since a Chinese character has 2 bytes. So, here is the 2nd question, how to check if the 8th bit is set in Python? Does python have such bit operation functions? Thanks. __________________________________ Do you Yahoo!? Yahoo! Search - Find what youre looking for faster http://search.yahoo.com
2004年03月13日 星期六 08:59
tell us what process you want to perform to the document, then we can help you. >From: Anthony Liu <antonyliu2002 at yahoo.com> >Reply-To: python-chinese at lists.python.cn >To: pycn <python-chinese at lists.python.cn> >Subject: [python-chinese] incomplete multibyte sequence >Date: Fri, 12 Mar 2004 02:35:51 -0800 (PST) > >I am trying to process a huge Chinese document. The >single document is in pure text format and it's nearly >4M. > >I always get "incomplete multibyte sequence" error >when I try to unicode the sentences. > >I think the reason is because the Chinese document >uses both ascii punctuations and 2-byte Chinese >punctuations. > >For example, the single document can both , and >£¬ and both < > and ¡¶¡·. > >Is there anyway, I can go around this? Don't ask me >to fix the Chinese document! > >__________________________________ >Do you Yahoo!? >Yahoo! Search - Find what youre looking for faster >http://search.yahoo.com >_______________________________________________ >python-chinese list >python-chinese at lists.python.cn >http://python.cn/mailman/listinfo/python-chinese _________________________________________________________________ ÏíÓÃÊÀ½çÉÏ×î´óµÄµç×ÓÓʼþϵͳ¡ª MSN Hotmail¡£ http://www.hotmail.com
2004年03月13日 星期六 09:07
>From: Anthony Liu <antonyliu2002 at yahoo.com> >Reply-To: python-chinese at lists.python.cn >To: pycn <python-chinese at lists.python.cn> >Subject: [python-chinese] How to check if the 8th bit of a byte is set? >Date: Fri, 12 Mar 2004 16:24:58 -0800 (PST) > >First, how do we read one byte each time in Python? s='abc' s[0] get the first byte >Second, the text file I am processing has both Chinese >characters and ascii characters. So, I wanna know >when I hit a Chinese character. The way to detect is >to check if the eighth bit of a byte is set since a >Chinese character has 2 bytes. So, here is the 2nd >question, how to check if the 8th bit is set in >Python? Does python have such bit operation >functions? a='a' ord(a) & 0x80 >Thanks. > >__________________________________ >Do you Yahoo!? >Yahoo! Search - Find what youre looking for faster >http://search.yahoo.com >_______________________________________________ >python-chinese list >python-chinese at lists.python.cn >http://python.cn/mailman/listinfo/python-chinese _________________________________________________________________ ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger: http://messenger.msn.com/cn
2004年03月13日 星期六 16:17
--- Who Bruce <whoonline at msn.com> wrote: > > >From: Anthony Liu <antonyliu2002 at yahoo.com> > >Reply-To: python-chinese at lists.python.cn > >To: pycn <python-chinese at lists.python.cn> > >Subject: [python-chinese] How to check if the 8th > bit of a byte is set? > >Date: Fri, 12 Mar 2004 16:24:58 -0800 (PST) > > > >First, how do we read one byte each time in Python? > s='abc' > s[0] get the first byte Thank you Bruce, yes, this helps. I can probably just read line by line by readline() and then check each character in this line. A good hint. > >Second, the text file I am processing has both > Chinese > >characters and ascii characters. So, I wanna know > >when I hit a Chinese character. The way to detect > is > >to check if the eighth bit of a byte is set since a > >Chinese character has 2 bytes. So, here is the 2nd > >question, how to check if the 8th bit is set in > >Python? Does python have such bit operation > >functions? > > a='a' > ord(a) & 0x80 This helps a lot. I believe you are right, but I don't yet know why you AND 0x80. Can you explain please? Don't laugh at me, I don't know much about the internal representation of a character. > >__________________________________ > >Do you Yahoo!? > >Yahoo! Search - Find what you抮e looking for faster > >http://search.yahoo.com > >_______________________________________________ > >python-chinese list > >python-chinese at lists.python.cn > >http://python.cn/mailman/listinfo/python-chinese > > _________________________________________________________________ > 与联机的朋友进行交流,请使用 MSN Messenger: > http://messenger.msn.com/cn > > _______________________________________________ > python-chinese list > python-chinese at lists.python.cn > http://python.cn/mailman/listinfo/python-chinese __________________________________ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com
2004年03月13日 星期六 16:25
Bruce, Thank you. Actually, I just wanna break apart every chinese sentence using the punctuations as the delimiter, and then get the initial and final characters in each clause. For example, if we have a Chinese sentence like so (assuming c1 ... cn represent the 1st through the nth characters in the sentence): c1c2c3c4,c5c6c7"c8c9c10c11. I want to break it apart so that I get 3 clauses in the this case, as follows: c1c2c3c4 c5c6c7 c8c9c10c11 And then get the initial and final characters of each clause, i.e., in this case, c1, c4, c5, c7, c8, c11 I guess I will probably just read line by line and check each character individually. When I hit the characters on both sides of a punctuation, then I store them. What do you think? --- Who Bruce <whoonline at msn.com> wrote: > tell us what process you want to perform to the > document, then we can help > you. > > > >From: Anthony Liu <antonyliu2002 at yahoo.com> > >Reply-To: python-chinese at lists.python.cn > >To: pycn <python-chinese at lists.python.cn> > >Subject: [python-chinese] incomplete multibyte > sequence > >Date: Fri, 12 Mar 2004 02:35:51 -0800 (PST) > > > >I am trying to process a huge Chinese document. > The > >single document is in pure text format and it's > nearly > >4M. > > > >I always get "incomplete multibyte sequence" error > >when I try to unicode the sentences. > > > >I think the reason is because the Chinese document > >uses both ascii punctuations and 2-byte Chinese > >punctuations. > > > >For example, the single document can both , and > >, and both < > and 《》. > > > >Is there anyway, I can go around this? Don't ask > me > >to fix the Chinese document! > > > >__________________________________ > >Do you Yahoo!? > >Yahoo! Search - Find what you抮e looking for faster > >http://search.yahoo.com > >_______________________________________________ > >python-chinese list > >python-chinese at lists.python.cn > >http://python.cn/mailman/listinfo/python-chinese > > _________________________________________________________________ > 享用世界上最大的电子邮件系统 MSN Hotmail。 > http://www.hotmail.com > > _______________________________________________ > python-chinese list > python-chinese at lists.python.cn > http://python.cn/mailman/listinfo/python-chinese __________________________________ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com
2004年03月13日 星期六 17:15
> a='a' > ord(a) & 0x80 ord(a) & 10000000_binary = 0 if 'a' is an ascii character. else if 'a' is the 1st byte of a Chinese character, ord(a) & 10000000_binary = 128 Right? __________________________________ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com
2004年03月14日 星期日 12:34
本人十分希望学习Python语言,哪位能告诉我在哪能买到或下载??????????????? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20040314/8ec29b2e/attachment.htm
2004年03月15日 星期一 10:09
Hello 靳云, Faint!!!! 大侠!稍微搜索一下子先! Python GNU 系统的开源工程! www.python.org 自由下载,学习,使用! === [ 12:34 ; 04-03-14 ] you wrote: ? 本人十分希望学习Python语言,哪位能告诉我在哪能买到或下载??????????????? === === === === === === === === === === -- Best regards, Zoom.Quiet /=======================================\ ]Time is unimportant, only life important![ \=======================================/
Zeuux © 2024
京ICP备05028076号