2007年05月08日 星期二 10:02
¸÷λ¸ßÈ˺㺠°³·Çרҵ³ÌÐòÔ±£¬×î½ü¶Ôpython²úÉúÐËȤ£¬ÕýºÃÄÄpythonÀ´Ð´¸öץȡ°Ù¶ÈÌù°ÉÅÅÐаñÖÐһЩÔËÓªÊý¾ÝµÄС³ÌÐò£¬Óöµ½ÈçÏÂÎÊÌ⻹Íû¸÷λ¸ßÈËÖ¸½Ì¡£ ÎÒÒªÔÚÈçϵÄhtml´úÂëÖÐÈ¡³öºìÉ«±ê¼ÇµÄÄǶΡ£ ÉèÏëÈ¡³öºó½á¹ûÈçÏ °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É £¬Ö÷ÌâÊý316223 Ìù×ÓÊý3042967ƪ »áÔ±Êý2038. ÎÒÓà SGMLParserºÃÏñÎÞ·¨½â¾öÕâ¸öÎÊÌ⣬ SGMLParser½âÎöµÄºÃÏñ¶¼ÊÇÕâÖÖ£¬ÕâÀï²¢²»Î¨Ò»£¬¶øÇÒÐкÅÒ²²»Î¨Ò»¡£ÒªÆ¥Åä ¹²ÓÐÖ÷ÌâÊý ²Å¿ÉÒÔ¡£ »¹ÍûÖ¸½Ì °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É ............................................ÇëÎÊÊÇ¿ñ±©Õ½Ê¿Êä³ö¸ß»¹»¹ÊÇÎäÆ÷սʿÊä³ö¸ß£¿ 125.64.55.* 08:41 121.201.44.*
¹²ÓÐÖ÷ÌâÊý316223¸ö£¬Ìù×ÓÊý3042967ƪ£¬http://hi.baidu.com/q/%C4%A7%CA%DE%CA%C0%BD%E7" target="_blank">»áÔ±Êý2038 |
2007年05月08日 星期二 10:06
在07-5-8,icekernel <icekernel at gmail.com> 写道: > 各位高人好: > 俺非专业程序员,最近对python产生兴趣,正好哪python来写个抓取百度贴吧排行榜中一些运营数据的小程序,遇到如下问题还望各位高人指教。 > 我要在如下的html代码中取出红色标记的那段。 > 设想取出后结果如下 > > 百度贴吧_魔兽世界吧 ,主题数316223 贴子数3042967篇 会员数2038. > > 我用 SGMLParser好像无法解决这个问题, SGMLParser解析的好像都是这种,这里 >> "pad10L">并不唯一,而且行号也不唯一。要匹配 > class="pad10L">共有主题数 > > 才可以。 > 还望指教 > > > 百度贴吧_魔兽世界吧 > ............................................ > >> t href="/f?kz=166411958" target=_blank > 请问是狂暴战士输出高还还是武器战士输出高?> a> > > ="u">125.64.55.*> font> > 08:41& > nbsp; 121.201.44.*> td> > > >> "80%" height="20" border="0" > cellpadding="0" cellspacing="0" bgcolor= > "#FFFFFF"> >
> > ="pad10L">共有主题数> color=red>316223> font>个,贴子数> =red>3042967> >篇,> "http://hi.baidu.com/q/%C4%A7%CA%DE%CA%C0%BD%E7" target > ="_blank">会员> >数> red>2038> td> > > > > > class="pg" >1 > >[2] > >[3] > >[4] > >[5] > >[6] > >[7] > >[8] > >[9] > >[10] > >下一页 > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=316200>< > font>尾页 > > > > -- > gtalk: icekernel at gmail.com > blog: http://www.bulaoge.com/?icekernel > _______________________________________________ > python-chinese > Post: send python-chinese at lists.python.cn > Subscribe: send subscribe to python-chinese-request at lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request at lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > using HTMLParser -- GoogleTalk: qcxhome at gmail.com MSN: qcxhome at hotmail.com My Space: tkdchen.spaces.live.com BOINC: boinc.berkeley.edu 中国分布式计算总站: www.equn.com
2007年05月08日 星期二 10:28
ºÃÏñ¶¼ÊÇÕâÖÖ ÕÒ²»µ½tableµÄÑùÀý³ÌÐò ÔÚ07-5-8£¬ÂóÌïÊØÍûÕß <qcxhome在gmail.com> дµÀ£º > > ÔÚ07-5-8£¬icekernel <icekernel在gmail.com> дµÀ£º > > ¸÷λ¸ßÈ˺㺠> > > °³·Çרҵ³ÌÐòÔ±£¬×î½ü¶Ôpython²úÉúÐËȤ£¬ÕýºÃÄÄpythonÀ´Ð´¸öץȡ°Ù¶ÈÌù°ÉÅÅÐаñÖÐһЩÔËÓªÊý¾ÝµÄС³ÌÐò£¬Óöµ½ÈçÏÂÎÊÌ⻹Íû¸÷λ¸ßÈËÖ¸½Ì¡£ > > ÎÒÒªÔÚÈçϵÄhtml´úÂëÖÐÈ¡³öºìÉ«±ê¼ÇµÄÄǶΡ£ > > ÉèÏëÈ¡³öºó½á¹ûÈçÏ > > > > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É £¬Ö÷ÌâÊý316223 Ìù×ÓÊý3042967ƪ »áÔ±Êý2038. > > > > ÎÒÓà SGMLParserºÃÏñÎÞ·¨½â¾öÕâ¸öÎÊÌ⣬ SGMLParser½âÎöµÄºÃÏñ¶¼ÊÇÕâÖÖ£¬ÕâÀï > >> > "pad10L">²¢²»Î¨Ò»£¬¶øÇÒÐкÅÒ²²»Î¨Ò»¡£ÒªÆ¥Åä > > class="pad10L">¹²ÓÐÖ÷ÌâÊý > > > > ²Å¿ÉÒÔ¡£ > > »¹ÍûÖ¸½Ì > > > > > > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É > > ............................................ > > > >> > t href="/f?kz=166411958" target=_blank > ÇëÎÊÊÇ¿ñ±©Õ½Ê¿Êä³ö¸ß»¹»¹ÊÇÎäÆ÷սʿÊä³ö¸ß£¿> > a> > > > > ="u">125.64.55.*> > font> > > 08:41& > > nbsp; 121.201.44.*> > td> > > > > > >> > "80%" height="20" border="0" > > cellpadding="0" cellspacing="0" bgcolor= > > "#FFFFFF"> > >
> > > > ="pad10L">¹²ÓÐÖ÷ÌâÊý> > color=red>316223> > font>¸ö£¬Ìù×ÓÊý> > =red>3042967> > >ƪ£¬> > "http://hi.baidu.com/q/%C4%A7%CA%DE%CA%C0%BD%E7" target > > ="_blank">»áÔ±> > >Êý> > red>2038> > td> > > > > > > > > > > class="pg" >1 > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > >[2] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=100 > > >[3] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=150 > > >[4] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=200 > > >[5] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=250 > > >[6] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=300 > > >[7] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=350 > > >[8] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=400 > > >[9] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=450 > > >[10] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > >ÏÂÒ»Ò³ > > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=316200>< > > font>βҳ > > > > > > > > -- > > gtalk: icekernel在gmail.com > > blog: http://www.bulaoge.com/?icekernel > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > using HTMLParser > > > -- > GoogleTalk: qcxhome在gmail.com > MSN: qcxhome在hotmail.com > My Space: tkdchen.spaces.live.com > BOINC: boinc.berkeley.edu > Öйú·Ö²¼Ê½¼ÆËã×ÜÕ¾: www.equn.com > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese -- gtalk: icekernel在gmail.com blog: http://www.bulaoge.com/?icekernel -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070508/3e05cb81/attachment.htm
2007年05月08日 星期二 10:38
ûÓÃpython×ö¹ý¡£ÒÔÇ°ÓÃphp×¥µÄ»°¶¼ÊÇÓÃregx ²»ÐоÍÓÃregx°É On 5/8/07, icekernel <icekernel在gmail.com> wrote: > > ºÃÏñ¶¼ÊÇÕâÖÖ ÕÒ²»µ½tableµÄÑùÀý³ÌÐò > > ÔÚ07-5-8£¬ÂóÌïÊØÍûÕß <qcxhome在gmail.com> дµÀ£º > > > > ÔÚ07-5-8£¬icekernel <icekernel在gmail.com> дµÀ£º > > > ¸÷λ¸ßÈ˺㺠> > > > > °³·Çרҵ³ÌÐòÔ±£¬×î½ü¶Ôpython²úÉúÐËȤ£¬ÕýºÃÄÄpythonÀ´Ð´¸öץȡ°Ù¶ÈÌù°ÉÅÅÐаñÖÐһЩÔËÓªÊý¾ÝµÄС³ÌÐò£¬Óöµ½ÈçÏÂÎÊÌ⻹Íû¸÷λ¸ßÈËÖ¸½Ì¡£ > > > ÎÒÒªÔÚÈçϵÄhtml´úÂëÖÐÈ¡³öºìÉ«±ê¼ÇµÄÄǶΡ£ > > > ÉèÏëÈ¡³öºó½á¹ûÈçÏ > > > > > > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É £¬Ö÷ÌâÊý316223 Ìù×ÓÊý3042967ƪ »áÔ±Êý2038. > > > > > > ÎÒÓà SGMLParserºÃÏñÎÞ·¨½â¾öÕâ¸öÎÊÌ⣬ SGMLParser½âÎöµÄºÃÏñ¶¼ÊÇÕâÖÖ£¬ÕâÀï > > >> > > "pad10L">²¢²»Î¨Ò»£¬¶øÇÒÐкÅÒ²²»Î¨Ò»¡£ÒªÆ¥Åä > > > class="pad10L">¹²ÓÐÖ÷ÌâÊý > > > > > > ²Å¿ÉÒÔ¡£ > > > »¹ÍûÖ¸½Ì > > > > > > > > > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É > > > ............................................ > > > > > >> > > t href="/f?kz=166411958" target=_blank > ÇëÎÊÊÇ¿ñ±©Õ½Ê¿Êä³ö¸ß»¹»¹ÊÇÎäÆ÷սʿÊä³ö¸ß£¿> > > a> > > > > > > ="u"> 125.64.55.*> > > font> > > > 08:41& > > > nbsp; 121.201.44.*> > > td> > > > > > > > > >> > > "80%" height="20" border="0" > > > cellpadding="0" cellspacing="0" bgcolor= > > > "#FFFFFF"> > > >
> > > > > > ="pad10L">¹²ÓÐÖ÷ÌâÊý> > > color=red>316223> > > font>¸ö£¬Ìù×ÓÊý> > > =red>3042967> > > >ƪ£¬> > > " http://hi.baidu.com/q/%C4%A7%CA%DE%CA%C0%BD%E7" target > > > ="_blank">»áÔ±> > > >Êý> > > red>2038> > > td> > > > > > > > > > > > > > > > class="pg" >1 > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > > >[2] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=100 > > > > > >[3] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=150 > > > >[4] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=200 > > > > > >[5] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=250 > > > >[6] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=300 > > > > > >[7] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=350 > > > >[8] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=400 > > > > > >[9] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=450 > > > >[10] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > > > > >ÏÂÒ»Ò³ > > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=316200>< > > > > > font>βҳ > > > > > > > > > > > > -- > > > gtalk: icekernel在gmail.com > > > blog: http://www.bulaoge.com/?icekernel > > > _______________________________________________ > > > python-chinese > > > Post: send python-chinese在lists.python.cn > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > > using HTMLParser > > > > > > -- > > GoogleTalk: qcxhome在gmail.com > > MSN: qcxhome在hotmail.com > > My Space: tkdchen.spaces.live.com > > BOINC: boinc.berkeley.edu > > Öйú·Ö²¼Ê½¼ÆËã×ÜÕ¾: www.equn.com > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > -- > gtalk: icekernel在gmail.com > blog: http://www.bulaoge.com/?icekernel > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070508/2fca864d/attachment.html
2007年05月08日 星期二 10:42
在 07-5-8,Tian<askfor at gmail.com> 写道: > 没用python做过。以前用php抓的话都是用regx > 不行就用regx吧 > > > > On 5/8/07, icekernel <icekernel at gmail.com> wrote: > > 好像都是这种 找不到table的样例程序 > > > > > > 在07-5-8,麦田守望者 <qcxhome at gmail.com> 写道: > > > > > 在07-5-8,icekernel < icekernel at gmail.com> 写道: > > > > 各位高人好: > > > > > 俺非专业程序员,最近对python产生兴趣,正好哪python来写个抓取百度贴吧排行榜中一些运营数据的小程序,遇到如下问题还望各位高人指教。 > > > > 我要在如下的html代码中取出红色标记的那段。 > > > > 设想取出后结果如下 > > > > > > > > 百度贴吧_魔兽世界吧 ,主题数316223 贴子数3042967篇 会员数2038. > > > > > > > > 我用 SGMLParser好像无法解决这个问题, SGMLParser解析的好像都是这种,这里 > > > >> > > > "pad10L">并不唯一,而且行号也不唯一。要匹配 > > > > class="pad10L">共有主题数 > > > > > > > > 才可以。 > > > > 还望指教 > > > > > > > > > > > > 百度贴吧_魔兽世界吧 > > > > ............................................ > > > > > > > >> > > > t href="/f?kz=166411958" target=_blank > 请问是狂暴战士输出高还还是武器战士输出高?> > > > a> > > > > > > > > ="u"> 125.64.55.*> > > > font> > > > > 08:41& > > > > nbsp; 121.201.44.*> > > > td> > > > > > > > > > > > >> > > > "80%" height="20" border="0" > > > > cellpadding="0" cellspacing="0" bgcolor= > > > > "#FFFFFF"> > > > >
> > > > > > > > ="pad10L">共有主题数> > > > color=red>316223> > > > font>个,贴子数> > > > =red>3042967> > > > >篇,> > > > " http://hi.baidu.com/q/%C4%A7%CA%DE%CA%C0%BD%E7" > target > > > > ="_blank">会员> > > > >数> > > > red>2038> > > > td> > > > > > > > > > > > > > > > > > > > > class="pg" >1 > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > > > >[2] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=100 > > > > >[3] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=150 > > > > >[4] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=200 > > > > >[5] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=250 > > > > >[6] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=300 > > > > >[7] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=350 > > > > >[8] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=400 > > > > >[9] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=450 > > > > >[10] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > > > >下一页 > > > > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=316200>< > > > > font>尾页 > > > > > > > > > > > > > > > > -- > > > > gtalk: icekernel at gmail.com > > > > blog: http://www.bulaoge.com/?icekernel > > > > _______________________________________________ > > > > python-chinese > > > > Post: send python-chinese at lists.python.cn > > > > Subscribe: send subscribe to > python-chinese-request at lists.python.cn > > > > Unsubscribe: send unsubscribe to > python-chinese-request at lists.python.cn > > > > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > > > > > > > > > > using HTMLParser > > > > > > > > > -- > > > GoogleTalk: qcxhome at gmail.com > > > MSN: qcxhome at hotmail.com > > > My Space: tkdchen.spaces.live.com > > > BOINC: boinc.berkeley.edu > > > 中国分布式计算总站: www.equn.com > > > _______________________________________________ > > > python-chinese > > > Post: send python-chinese at lists.python.cn > > > Subscribe: send subscribe to > python-chinese-request at lists.python.cn > > > Unsubscribe: send unsubscribe to > python-chinese-request at lists.python.cn > > > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > > > > > > > > -- > > > > gtalk: icekernel at gmail.com > > blog: http://www.bulaoge.com/?icekernel > > _______________________________________________ > > python-chinese > > Post: send python-chinese at lists.python.cn > > Subscribe: send subscribe to > python-chinese-request at lists.python.cn > > Unsubscribe: send unsubscribe to > python-chinese-request at lists.python.cn > > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > > > > > _______________________________________________ > python-chinese > Post: send python-chinese at lists.python.cn > Subscribe: send subscribe to > python-chinese-request at lists.python.cn > Unsubscribe: send unsubscribe to > python-chinese-request at lists.python.cn > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > 参看HTMLParser包的文档 -- GoogleTalk: qcxhome at gmail.com MSN: qcxhome at hotmail.com My Space: tkdchen.spaces.live.com BOINC: boinc.berkeley.edu 中国分布式计算总站: www.equn.com
2007年05月08日 星期二 10:59
On 5/8/07, icekernel <icekernel在gmail.com> wrote: > 各位高人好: > 俺非专业程序员,最近对python产生兴趣,正好哪python来写个抓取百度贴吧排行榜中一些运营数据的小程序,遇到如下问题还望各位高人指教。 > 我要在如下的html代码中取出红色标记的那段。 > 设想取出后结果如下 > > 百度贴吧_魔兽世界吧 ,主题数316223 贴子数3042967篇 会员数2038. > > 我用 SGMLParser好像无法解决这个问题, SGMLParser解析的好像都是这种,这里 >> "pad10L">并不唯一,而且行号也不唯一。要匹配 > class="pad10L">共有主题数 > > 才可以。 > 还望指教 > 建议使用beautifulsoup,应该非常方便。 -- I like python! UliPad < >: http://wiki.woodpecker.org.cn/moin/UliPad My Blog: http://www.donews.net/limodou
2007年05月08日 星期二 11:06
¶àлlimodou ÎÒ¿´µ½ÁËÕâ¸ö¶«Î÷ȷʵÏ൱·½±ã ÔÚ07-5-8£¬limodou <limodou在gmail.com> дµÀ£º > > On 5/8/07, icekernel <icekernel在gmail.com> wrote: > > ¸÷λ¸ßÈ˺㺠> > °³·Çרҵ³ÌÐòÔ±£¬×î½ü¶Ôpython²úÉúÐËȤ£¬ÕýºÃÄÄpythonÀ´Ð´¸öץȡ°Ù¶ÈÌù°ÉÅÅÐаñÖÐһЩÔËÓªÊý¾ÝµÄС³ÌÐò£¬Óöµ½ÈçÏÂÎÊÌ⻹Íû¸÷λ¸ßÈËÖ¸½Ì¡£ > > ÎÒÒªÔÚÈçϵÄhtml´úÂëÖÐÈ¡³öºìÉ«±ê¼ÇµÄÄǶΡ£ > > ÉèÏëÈ¡³öºó½á¹ûÈçÏ > > > > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É £¬Ö÷ÌâÊý316223 Ìù×ÓÊý3042967ƪ »áÔ±Êý2038. > > > > ÎÒÓà SGMLParserºÃÏñÎÞ·¨½â¾öÕâ¸öÎÊÌ⣬ SGMLParser½âÎöµÄºÃÏñ¶¼ÊÇÕâÖÖ£¬ÕâÀï > >> > "pad10L">²¢²»Î¨Ò»£¬¶øÇÒÐкÅÒ²²»Î¨Ò»¡£ÒªÆ¥Åä > > class="pad10L">¹²ÓÐÖ÷ÌâÊý > > > > ²Å¿ÉÒÔ¡£ > > »¹ÍûÖ¸½Ì > > > ½¨ÒéʹÓÃbeautifulsoup£¬Ó¦¸Ã·Ç³£·½±ã¡£ > > -- > I like python! > UliPad < >: http://wiki.woodpecker.org.cn/moin/UliPad > My Blog: http://www.donews.net/limodou > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese -- gtalk: icekernel在gmail.com blog: http://www.bulaoge.com/?icekernel -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070508/9e4d30b5/attachment.htm
2007年05月08日 星期二 14:05
Õâ¸öΪʲô²»ÓÃregexpÀ´×öÄØ£¿ÒòΪÓбàÂëÎÊÌâô£¿ÕýÔò´¦ÀíÓ¦¸Ã±ÈʹÓÃbeautifulsoup½âÎö³öÀ´Ò»¸öÊ÷Òª¿ì°É£¿ ÕýÔò£º ¹²ÓÐÖ÷ÌâÊý<[^>]*>(\d+) È¡$1¾Í¿ÉÒÔÁË ÔÚ07-5-8£¬icekernel <icekernel在gmail.com> дµÀ£º > > ¶àлlimodou ÎÒ¿´µ½ÁËÕâ¸ö¶«Î÷ȷʵÏ൱·½±ã > > ÔÚ07-5-8£¬limodou <limodou在gmail.com> дµÀ£º > > > > On 5/8/07, icekernel <icekernel在gmail.com> wrote: > > > ¸÷λ¸ßÈ˺㺠> > > °³·Çרҵ³ÌÐòÔ±£¬×î½ü¶Ôpython²úÉúÐËȤ£¬ÕýºÃÄÄpythonÀ´Ð´¸öץȡ°Ù¶ÈÌù°ÉÅÅÐаñÖÐһЩÔËÓªÊý¾ÝµÄС³ÌÐò£¬Óöµ½ÈçÏÂÎÊÌ⻹Íû¸÷λ¸ßÈËÖ¸½Ì¡£ > > > ÎÒÒªÔÚÈçϵÄhtml´úÂëÖÐÈ¡³öºìÉ«±ê¼ÇµÄÄǶΡ£ > > > ÉèÏëÈ¡³öºó½á¹ûÈçÏ > > > > > > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É £¬Ö÷ÌâÊý316223 Ìù×ÓÊý3042967ƪ »áÔ±Êý2038. > > > > > > ÎÒÓà SGMLParserºÃÏñÎÞ·¨½â¾öÕâ¸öÎÊÌ⣬ SGMLParser½âÎöµÄºÃÏñ¶¼ÊÇÕâÖÖ£¬ÕâÀï > > >> > > "pad10L">²¢²»Î¨Ò»£¬¶øÇÒÐкÅÒ²²»Î¨Ò»¡£ÒªÆ¥Åä > > > class="pad10L">¹²ÓÐÖ÷ÌâÊý > > > > > > ²Å¿ÉÒÔ¡£ > > > »¹ÍûÖ¸½Ì > > > > > ½¨ÒéʹÓÃbeautifulsoup£¬Ó¦¸Ã·Ç³£·½±ã¡£ > > > > -- > > I like python! > > UliPad < >: http://wiki.woodpecker.org.cn/moin/UliPad > > My Blog: http://www.donews.net/limodou > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > -- > gtalk: icekernel在gmail.com > blog: http://www.bulaoge.com/?icekernel > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070508/deca9cca/attachment.html
2007年05月08日 星期二 14:13
On 5/8/07, Rodin <schludern在gmail.com> wrote: > 这个为什么不用regexp来做呢?因为有编码问题么?正则处理应该比使用beautifulsoup解析出来一个树要快吧? > > 正则: > 共有主题数<[^>]*>(\d+) > 取$1就可以了 > 这个看个人选择了,有时速度不是最主要的问题,可能方便,简单才是最重要的。 -- I like python! UliPad <>: http://wiki.woodpecker.org.cn/moin/UliPad My Blog: http://www.donews.net/limodou
2007年05月08日 星期二 15:18
àÅ£¬Èç¹û»¹ÐèÒª½øÐиü¸´ÔӵķÖÎöµÄ»°£¬ÓÃÕýÔò¾Í³ÔÁ¦ÁË£¬BeautifulsoupÊǸüºÃһЩµÄÑ¡Ôñ£¬²Ù×÷ÓÐЩÀàËÆHTML DOMÁË ÔÚ07-5-8£¬limodou <limodou在gmail.com> дµÀ£º > > On 5/8/07, Rodin <schludern在gmail.com> wrote: > > Õâ¸öΪʲô²»ÓÃregexpÀ´×öÄØ£¿ÒòΪÓбàÂëÎÊÌâô£¿ÕýÔò´¦ÀíÓ¦¸Ã±ÈʹÓÃbeautifulsoup½âÎö³öÀ´Ò»¸öÊ÷Òª¿ì°É£¿ > > > > ÕýÔò£º > > ¹²ÓÐÖ÷ÌâÊý<[^>]*>(\d+) > > È¡$1¾Í¿ÉÒÔÁË > > > Õâ¸ö¿´¸öÈËÑ¡ÔñÁË£¬ÓÐʱËٶȲ»ÊÇ×îÖ÷ÒªµÄÎÊÌ⣬¿ÉÄÜ·½±ã£¬¼òµ¥²ÅÊÇ×îÖØÒªµÄ¡£ > > -- > I like python! > UliPad <>: http://wiki.woodpecker.org.cn/moin/UliPad > My Blog: http://www.donews.net/limodou > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070508/0bbfb82d/attachment.htm
2007年05月08日 星期二 20:10
On 5/8/07, Rodin <schludern at gmail.com> wrote: > > 嗯,如果还需要进行更复杂的分析的话,用正则就吃力了,Beautifulsoup是更好一些的选择,操作有些类似HTML DOM了 > > nod,BeautifulSoup本身也可以结合正则来用,这样是有很强大的灵活性与检索能力的。 -- I like Python & Linux. Blog: http://recordus.cublog.cn -------------- next part -------------- An HTML attachment was scrubbed... URL: http://python.cn/pipermail/python-chinese/attachments/20070508/1a960eb2/attachment.html
Zeuux © 2025
京ICP备05028076号