Python论坛  - 讨论区

标题:[python-chinese] 如何分析出html中特定的table中的内容

2007年05月08日 星期二 10:02

icekernel icekernel在gmail.com
星期二 五月 8 10:02:01 HKT 2007

¸÷λ¸ßÈ˺ãº
        °³·Çרҵ³ÌÐòÔ±£¬×î½ü¶Ôpython²úÉúÐËȤ£¬ÕýºÃÄÄpythonÀ´Ð´¸öץȡ°Ù¶ÈÌù°ÉÅÅÐаñÖÐһЩÔËÓªÊý¾ÝµÄС³ÌÐò£¬Óöµ½ÈçÏÂÎÊÌ⻹Íû¸÷λ¸ßÈËÖ¸½Ì¡£
ÎÒÒªÔÚÈçϵÄhtml´úÂëÖÐÈ¡³öºìÉ«±ê¼ÇµÄÄǶΡ£
   ÉèÏëÈ¡³öºó½á¹ûÈçÏÂ
           °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É £¬Ö÷ÌâÊý316223 Ìù×ÓÊý3042967ƪ »áÔ±Êý2038.

  ÎÒÓà SGMLParserºÃÏñÎÞ·¨½â¾öÕâ¸öÎÊÌ⣬ SGMLParser½âÎöµÄºÃÏñ¶¼ÊÇÕâÖÖ£¬ÕâÀï²¢²»Î¨Ò»£¬¶øÇÒÐкÅÒ²²»Î¨Ò»¡£ÒªÆ¥Åä¹²ÓÐÖ÷ÌâÊý
²Å¿ÉÒÔ¡£
  »¹ÍûÖ¸½Ì

°Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É      
............................................

ÇëÎÊÊÇ¿ñ±©Õ½Ê¿Êä³ö¸ß»¹»¹ÊÇÎäÆ÷սʿÊä³ö¸ß£¿  
125.64.55.*
08:41  121.201.44.*


¹²ÓÐÖ÷ÌâÊý316223¸ö£¬Ìù×ÓÊý3042967ƪ£¬http://hi.baidu.com/q/%C4%A7%CA%DE%CA%C0%BD%E7" target="_blank">»áÔ±Êý2038
-- gtalk: icekernel在gmail.com blog: http://www.bulaoge.com/?icekernel -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070508/26ea3d55/attachment.html

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年05月08日 星期二 10:06

麦田守望者 qcxhome在gmail.com
星期二 五月 8 10:06:03 HKT 2007

在07-5-8,icekernel <icekernel at gmail.com> 写道:
> 各位高人好:
>         俺非专业程序员,最近对python产生兴趣,正好哪python来写个抓取百度贴吧排行榜中一些运营数据的小程序,遇到如下问题还望各位高人指教。
> 我要在如下的html代码中取出红色标记的那段。
>    设想取出后结果如下
>
> 百度贴吧_魔兽世界吧 ,主题数316223 贴子数3042967篇 会员数2038.
>
>   我用 SGMLParser好像无法解决这个问题, SGMLParser解析的好像都是这种,这里
> > "pad10L">并不唯一,而且行号也不唯一。要匹配>  class="pad10L">共有主题数
>
> 才可以。
>   还望指教
>
>
> 百度贴吧_魔兽世界吧      
> ............................................
>
> > t href="/f?kz=166411958" target=_blank > 请问是狂暴战士输出高还还是武器战士输出高?> a>  
> > ="u">125.64.55.*> font>
> 08:41&
> nbsp; 121.201.44.*> td>
> 
> 
> 
> "80%" height="20" border="0" > cellpadding="0" cellspacing="0" bgcolor= > "#FFFFFF"> > > > ="pad10L">共有主题数> color=red>316223> font>个,贴子数> =red>3042967> >篇,> "http://hi.baidu.com/q/%C4%A7%CA%DE%CA%C0%BD%E7" target > ="_blank">会员> >数> red>2038> td> > > > >
> class="pg" >1 > >[2] > >[3] > >[4] > >[5] > >[6] > >[7] > >[8] > >[9] > >[10] > >下一页 > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=316200>< > font>尾页  > > > > -- > gtalk: icekernel at gmail.com > blog: http://www.bulaoge.com/?icekernel > _______________________________________________ > python-chinese > Post: send python-chinese at lists.python.cn > Subscribe: send subscribe to python-chinese-request at lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request at lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > using HTMLParser -- GoogleTalk: qcxhome at gmail.com MSN: qcxhome at hotmail.com My Space: tkdchen.spaces.live.com BOINC: boinc.berkeley.edu 中国分布式计算总站: www.equn.com

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年05月08日 星期二 10:28

icekernel icekernel在gmail.com
星期二 五月 8 10:28:44 HKT 2007

ºÃÏñ¶¼ÊÇÕâÖÖ ÕÒ²»µ½tableµÄÑùÀý³ÌÐò

ÔÚ07-5-8£¬ÂóÌïÊØÍûÕß <qcxhome在gmail.com> дµÀ£º
>
> ÔÚ07-5-8£¬icekernel <icekernel在gmail.com> дµÀ£º
> > ¸÷λ¸ßÈ˺ãº
> >
> °³·Çרҵ³ÌÐòÔ±£¬×î½ü¶Ôpython²úÉúÐËȤ£¬ÕýºÃÄÄpythonÀ´Ð´¸öץȡ°Ù¶ÈÌù°ÉÅÅÐаñÖÐһЩÔËÓªÊý¾ÝµÄС³ÌÐò£¬Óöµ½ÈçÏÂÎÊÌ⻹Íû¸÷λ¸ßÈËÖ¸½Ì¡£
> > ÎÒÒªÔÚÈçϵÄhtml´úÂëÖÐÈ¡³öºìÉ«±ê¼ÇµÄÄǶΡ£
> >    ÉèÏëÈ¡³öºó½á¹ûÈçÏÂ
> >
> > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É £¬Ö÷ÌâÊý316223 Ìù×ÓÊý3042967ƪ »áÔ±Êý2038.
> >
> >   ÎÒÓà SGMLParserºÃÏñÎÞ·¨½â¾öÕâ¸öÎÊÌ⣬ SGMLParser½âÎöµÄºÃÏñ¶¼ÊÇÕâÖÖ£¬ÕâÀï
> > > > "pad10L">²¢²»Î¨Ò»£¬¶øÇÒÐкÅÒ²²»Î¨Ò»¡£ÒªÆ¥Åä> >  class="pad10L">¹²ÓÐÖ÷ÌâÊý
> >
> > ²Å¿ÉÒÔ¡£
> >   »¹ÍûÖ¸½Ì
> >
> >
> > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É      
> > ............................................
> >
> > > > t href="/f?kz=166411958" target=_blank > ÇëÎÊÊÇ¿ñ±©Õ½Ê¿Êä³ö¸ß»¹»¹ÊÇÎäÆ÷սʿÊä³ö¸ß£¿> > a>  
> > > > ="u">125.64.55.*> > font>
> > 08:41&
> > nbsp; 121.201.44.*> > td>
> > 
> > 
> > 
> > "80%" height="20" border="0" > > cellpadding="0" cellspacing="0" bgcolor= > > "#FFFFFF"> > > > > > > ="pad10L">¹²ÓÐÖ÷ÌâÊý> > color=red>316223> > font>¸ö£¬Ìù×ÓÊý> > =red>3042967> > >ƪ£¬> > "http://hi.baidu.com/q/%C4%A7%CA%DE%CA%C0%BD%E7" target > > ="_blank">»áÔ±> > >Êý> > red>2038> > td> > > > > > > > >
> > class="pg" >1 > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > >[2] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=100 > > >[3] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=150 > > >[4] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=200 > > >[5] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=250 > > >[6] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=300 > > >[7] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=350 > > >[8] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=400 > > >[9] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=450 > > >[10] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > >ÏÂÒ»Ò³ > > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=316200>< > > font>βҳ  > > > > > > > > -- > > gtalk: icekernel在gmail.com > > blog: http://www.bulaoge.com/?icekernel > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > using HTMLParser > > > -- > GoogleTalk: qcxhome在gmail.com > MSN: qcxhome在hotmail.com > My Space: tkdchen.spaces.live.com > BOINC: boinc.berkeley.edu > Öйú·Ö²¼Ê½¼ÆËã×ÜÕ¾: www.equn.com > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese -- gtalk: icekernel在gmail.com blog: http://www.bulaoge.com/?icekernel -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070508/3e05cb81/attachment.htm

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年05月08日 星期二 10:38

Tian askfor在gmail.com
星期二 五月 8 10:38:52 HKT 2007

ûÓÃpython×ö¹ý¡£ÒÔÇ°ÓÃphp×¥µÄ»°¶¼ÊÇÓÃregx
²»ÐоÍÓÃregx°É


On 5/8/07, icekernel <icekernel在gmail.com> wrote:
>
> ºÃÏñ¶¼ÊÇÕâÖÖ ÕÒ²»µ½tableµÄÑùÀý³ÌÐò
>
> ÔÚ07-5-8£¬ÂóÌïÊØÍûÕß <qcxhome在gmail.com> дµÀ£º
> >
> > ÔÚ07-5-8£¬icekernel <icekernel在gmail.com> дµÀ£º
> > > ¸÷λ¸ßÈ˺ãº
> > >
> > °³·Çרҵ³ÌÐòÔ±£¬×î½ü¶Ôpython²úÉúÐËȤ£¬ÕýºÃÄÄpythonÀ´Ð´¸öץȡ°Ù¶ÈÌù°ÉÅÅÐаñÖÐһЩÔËÓªÊý¾ÝµÄС³ÌÐò£¬Óöµ½ÈçÏÂÎÊÌ⻹Íû¸÷λ¸ßÈËÖ¸½Ì¡£
> > > ÎÒÒªÔÚÈçϵÄhtml´úÂëÖÐÈ¡³öºìÉ«±ê¼ÇµÄÄǶΡ£
> > >    ÉèÏëÈ¡³öºó½á¹ûÈçÏÂ
> > >
> > > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É £¬Ö÷ÌâÊý316223 Ìù×ÓÊý3042967ƪ »áÔ±Êý2038.
> > >
> > >   ÎÒÓà SGMLParserºÃÏñÎÞ·¨½â¾öÕâ¸öÎÊÌ⣬ SGMLParser½âÎöµÄºÃÏñ¶¼ÊÇÕâÖÖ£¬ÕâÀï
> > > > > > "pad10L">²¢²»Î¨Ò»£¬¶øÇÒÐкÅÒ²²»Î¨Ò»¡£ÒªÆ¥Åä> > >  class="pad10L">¹²ÓÐÖ÷ÌâÊý
> > >
> > > ²Å¿ÉÒÔ¡£
> > >   »¹ÍûÖ¸½Ì
> > >
> > >
> > > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É      
> > > ............................................
> > >
> > > > > > t href="/f?kz=166411958" target=_blank > ÇëÎÊÊÇ¿ñ±©Õ½Ê¿Êä³ö¸ß»¹»¹ÊÇÎäÆ÷սʿÊä³ö¸ß£¿> > > a>  
> > > > > > ="u"> 125.64.55.*> > > font>
> > > 08:41&
> > > nbsp; 121.201.44.*> > > td>
> > > 
> > > 
> > > 
> > > "80%" height="20" border="0" > > > cellpadding="0" cellspacing="0" bgcolor= > > > "#FFFFFF"> > > > > > > > > > ="pad10L">¹²ÓÐÖ÷ÌâÊý> > > color=red>316223> > > font>¸ö£¬Ìù×ÓÊý> > > =red>3042967> > > >ƪ£¬> > > " http://hi.baidu.com/q/%C4%A7%CA%DE%CA%C0%BD%E7" target > > > ="_blank">»áÔ±> > > >Êý> > > red>2038> > > td> > > > > > > > > > > > >
> > > class="pg" >1 > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > > >[2] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=100 > > > > > >[3] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=150 > > > >[4] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=200 > > > > > >[5] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=250 > > > >[6] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=300 > > > > > >[7] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=350 > > > >[8] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=400 > > > > > >[9] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=450 > > > >[10] > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > > > > >ÏÂÒ»Ò³ > > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=316200>< > > > > > font>βҳ  > > > > > > > > > > > > -- > > > gtalk: icekernel在gmail.com > > > blog: http://www.bulaoge.com/?icekernel > > > _______________________________________________ > > > python-chinese > > > Post: send python-chinese在lists.python.cn > > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > > > using HTMLParser > > > > > > -- > > GoogleTalk: qcxhome在gmail.com > > MSN: qcxhome在hotmail.com > > My Space: tkdchen.spaces.live.com > > BOINC: boinc.berkeley.edu > > Öйú·Ö²¼Ê½¼ÆËã×ÜÕ¾: www.equn.com > > _______________________________________________ > > python-chinese > > Post: send python-chinese在lists.python.cn > > Subscribe: send subscribe to python-chinese-request在lists.python.cn > > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > > Detail Info: http://python.cn/mailman/listinfo/python-chinese > > > > > -- > gtalk: icekernel在gmail.com > blog: http://www.bulaoge.com/?icekernel > > _______________________________________________ > python-chinese > Post: send python-chinese在lists.python.cn > Subscribe: send subscribe to python-chinese-request在lists.python.cn > Unsubscribe: send unsubscribe to python-chinese-request在lists.python.cn > Detail Info: http://python.cn/mailman/listinfo/python-chinese > -------------- 下一部分 -------------- Ò»¸öHTML¸½¼þ±»ÒƳý... URL: http://python.cn/pipermail/python-chinese/attachments/20070508/2fca864d/attachment.html

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年05月08日 星期二 10:42

麦田守望者 qcxhome在gmail.com
星期二 五月 8 10:42:57 HKT 2007

在 07-5-8,Tian<askfor at gmail.com> 写道:
> 没用python做过。以前用php抓的话都是用regx
> 不行就用regx吧
>
>
>
> On 5/8/07, icekernel <icekernel at gmail.com> wrote:
> > 好像都是这种 找不到table的样例程序
> >
> >
> > 在07-5-8,麦田守望者 <qcxhome at gmail.com> 写道:
> >
> > > 在07-5-8,icekernel < icekernel at gmail.com> 写道:
> > > > 各位高人好:
> > > >
> 俺非专业程序员,最近对python产生兴趣,正好哪python来写个抓取百度贴吧排行榜中一些运营数据的小程序,遇到如下问题还望各位高人指教。
> > > > 我要在如下的html代码中取出红色标记的那段。
> > > >    设想取出后结果如下
> > > >
> > > > 百度贴吧_魔兽世界吧 ,主题数316223 贴子数3042967篇 会员数2038.
> > > >
> > > >   我用 SGMLParser好像无法解决这个问题, SGMLParser解析的好像都是这种,这里
> > > > > > > > "pad10L">并不唯一,而且行号也不唯一。要匹配> > > >  class="pad10L">共有主题数
> > > >
> > > > 才可以。
> > > >   还望指教
> > > >
> > > >
> > > > 百度贴吧_魔兽世界吧      
> > > > ............................................
> > > >
> > > > > > > > t href="/f?kz=166411958" target=_blank > 请问是狂暴战士输出高还还是武器战士输出高?> > > > a>  
> > > > > > > > ="u"> 125.64.55.*> > > > font>
> > > > 08:41&
> > > > nbsp; 121.201.44.*> > > > td>
> > > > 
> > > > 
> > > > 
> > > > "80%" height="20" border="0" > > > > cellpadding="0" cellspacing="0" bgcolor= > > > > "#FFFFFF"> > > > > > > > > > > > > ="pad10L">共有主题数> > > > color=red>316223> > > > font>个,贴子数> > > > =red>3042967> > > > >篇,> > > > " http://hi.baidu.com/q/%C4%A7%CA%DE%CA%C0%BD%E7" > target > > > > ="_blank">会员> > > > >数> > > > red>2038> > > > td> > > > > > > > > > > > > > > > >
> > > > class="pg" >1 > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > > > >[2] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=100 > > > > >[3] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=150 > > > > >[4] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=200 > > > > >[5] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=250 > > > > >[6] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=300 > > > > >[7] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=350 > > > > >[8] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=400 > > > > >[9] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=450 > > > > >[10] > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=50 > > > > >下一页 > > > > > href=/f?z=0&ct;=318767104&lm;=11≻=0&rn;=50&tn;=baiduKeywordSearch&rs3;=0&rs4;=0&word;=%C4%A7%CA%DE%CA%C0%BD%E7&pn;=316200>< > > > > font>尾页  > > > > > > > > > > > > > > > > -- > > > > gtalk: icekernel at gmail.com > > > > blog: http://www.bulaoge.com/?icekernel > > > > _______________________________________________ > > > > python-chinese > > > > Post: send python-chinese at lists.python.cn > > > > Subscribe: send subscribe to > python-chinese-request at lists.python.cn > > > > Unsubscribe: send unsubscribe to > python-chinese-request at lists.python.cn > > > > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > > > > > > > > > > using HTMLParser > > > > > > > > > -- > > > GoogleTalk: qcxhome at gmail.com > > > MSN: qcxhome at hotmail.com > > > My Space: tkdchen.spaces.live.com > > > BOINC: boinc.berkeley.edu > > > 中国分布式计算总站: www.equn.com > > > _______________________________________________ > > > python-chinese > > > Post: send python-chinese at lists.python.cn > > > Subscribe: send subscribe to > python-chinese-request at lists.python.cn > > > Unsubscribe: send unsubscribe to > python-chinese-request at lists.python.cn > > > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > > > > > > > > -- > > > > gtalk: icekernel at gmail.com > > blog: http://www.bulaoge.com/?icekernel > > _______________________________________________ > > python-chinese > > Post: send python-chinese at lists.python.cn > > Subscribe: send subscribe to > python-chinese-request at lists.python.cn > > Unsubscribe: send unsubscribe to > python-chinese-request at lists.python.cn > > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > > > > > _______________________________________________ > python-chinese > Post: send python-chinese at lists.python.cn > Subscribe: send subscribe to > python-chinese-request at lists.python.cn > Unsubscribe: send unsubscribe to > python-chinese-request at lists.python.cn > Detail Info: > http://python.cn/mailman/listinfo/python-chinese > 参看HTMLParser包的文档 -- GoogleTalk: qcxhome at gmail.com MSN: qcxhome at hotmail.com My Space: tkdchen.spaces.live.com BOINC: boinc.berkeley.edu 中国分布式计算总站: www.equn.com

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年05月08日 星期二 10:59

limodou limodou在gmail.com
星期二 五月 8 10:59:15 HKT 2007

On 5/8/07, icekernel <icekernel在gmail.com> wrote:
> 各位高人好:
> 俺非专业程序员,最近对python产生兴趣,正好哪python来写个抓取百度贴吧排行榜中一些运营数据的小程序,遇到如下问题还望各位高人指教。
> 我要在如下的html代码中取出红色标记的那段。
>  设想取出后结果如下
>
> 百度贴吧_魔兽世界吧 ,主题数316223 贴子数3042967篇 会员数2038.
>
>  我用 SGMLParser好像无法解决这个问题, SGMLParser解析的好像都是这种,这里
> > "pad10L">并不唯一,而且行号也不唯一。要匹配>  class="pad10L">共有主题数
>
> 才可以。
>  还望指教
>
建议使用beautifulsoup,应该非常方便。

-- 
I like python!
UliPad <>: http://wiki.woodpecker.org.cn/moin/UliPad
My Blog: http://www.donews.net/limodou

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年05月08日 星期二 11:06

icekernel icekernel在gmail.com
星期二 五月 8 11:06:44 HKT 2007

¶àлlimodou ÎÒ¿´µ½ÁËÕâ¸ö¶«Î÷ȷʵÏ൱·½±ã

ÔÚ07-5-8£¬limodou <limodou在gmail.com> дµÀ£º
>
> On 5/8/07, icekernel <icekernel在gmail.com> wrote:
> > ¸÷λ¸ßÈ˺ãº
> > °³·Çרҵ³ÌÐòÔ±£¬×î½ü¶Ôpython²úÉúÐËȤ£¬ÕýºÃÄÄpythonÀ´Ð´¸öץȡ°Ù¶ÈÌù°ÉÅÅÐаñÖÐһЩÔËÓªÊý¾ÝµÄС³ÌÐò£¬Óöµ½ÈçÏÂÎÊÌ⻹Íû¸÷λ¸ßÈËÖ¸½Ì¡£
> > ÎÒÒªÔÚÈçϵÄhtml´úÂëÖÐÈ¡³öºìÉ«±ê¼ÇµÄÄǶΡ£
> >  ÉèÏëÈ¡³öºó½á¹ûÈçÏÂ
> >
> > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É £¬Ö÷ÌâÊý316223 Ìù×ÓÊý3042967ƪ »áÔ±Êý2038.
> >
> >  ÎÒÓà SGMLParserºÃÏñÎÞ·¨½â¾öÕâ¸öÎÊÌ⣬ SGMLParser½âÎöµÄºÃÏñ¶¼ÊÇÕâÖÖ£¬ÕâÀï
> > > > "pad10L">²¢²»Î¨Ò»£¬¶øÇÒÐкÅÒ²²»Î¨Ò»¡£ÒªÆ¥Åä> >  class="pad10L">¹²ÓÐÖ÷ÌâÊý
> >
> > ²Å¿ÉÒÔ¡£
> >  »¹ÍûÖ¸½Ì
> >
> ½¨ÒéʹÓÃbeautifulsoup£¬Ó¦¸Ã·Ç³£·½±ã¡£
>
> --
> I like python!
> UliPad <>: http://wiki.woodpecker.org.cn/moin/UliPad
> My Blog: http://www.donews.net/limodou
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese




-- 
gtalk: icekernel在gmail.com
blog:  http://www.bulaoge.com/?icekernel
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒƳý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070508/9e4d30b5/attachment.htm 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年05月08日 星期二 14:05

Rodin schludern在gmail.com
星期二 五月 8 14:05:21 HKT 2007

Õâ¸öΪʲô²»ÓÃregexpÀ´×öÄØ£¿ÒòΪÓбàÂëÎÊÌâô£¿ÕýÔò´¦ÀíÓ¦¸Ã±ÈʹÓÃbeautifulsoup½âÎö³öÀ´Ò»¸öÊ÷Òª¿ì°É£¿

ÕýÔò£º
¹²ÓÐÖ÷ÌâÊý<[^>]*>(\d+)
È¡$1¾Í¿ÉÒÔÁË

ÔÚ07-5-8£¬icekernel <icekernel在gmail.com> дµÀ£º
>
> ¶àлlimodou ÎÒ¿´µ½ÁËÕâ¸ö¶«Î÷ȷʵÏ൱·½±ã
>
> ÔÚ07-5-8£¬limodou <limodou在gmail.com> дµÀ£º
> >
> > On 5/8/07, icekernel <icekernel在gmail.com> wrote:
> > > ¸÷λ¸ßÈ˺ãº
> > > °³·Çרҵ³ÌÐòÔ±£¬×î½ü¶Ôpython²úÉúÐËȤ£¬ÕýºÃÄÄpythonÀ´Ð´¸öץȡ°Ù¶ÈÌù°ÉÅÅÐаñÖÐһЩÔËÓªÊý¾ÝµÄС³ÌÐò£¬Óöµ½ÈçÏÂÎÊÌ⻹Íû¸÷λ¸ßÈËÖ¸½Ì¡£
> > > ÎÒÒªÔÚÈçϵÄhtml´úÂëÖÐÈ¡³öºìÉ«±ê¼ÇµÄÄǶΡ£
> > >  ÉèÏëÈ¡³öºó½á¹ûÈçÏÂ
> > >
> > > °Ù¶ÈÌù°É_ħÊÞÊÀ½ç°É £¬Ö÷ÌâÊý316223 Ìù×ÓÊý3042967ƪ »áÔ±Êý2038.
> > >
> > >  ÎÒÓà SGMLParserºÃÏñÎÞ·¨½â¾öÕâ¸öÎÊÌ⣬ SGMLParser½âÎöµÄºÃÏñ¶¼ÊÇÕâÖÖ£¬ÕâÀï
> > > > > > "pad10L">²¢²»Î¨Ò»£¬¶øÇÒÐкÅÒ²²»Î¨Ò»¡£ÒªÆ¥Åä> > >  class="pad10L">¹²ÓÐÖ÷ÌâÊý
> > >
> > > ²Å¿ÉÒÔ¡£
> > >  »¹ÍûÖ¸½Ì
> > >
> > ½¨ÒéʹÓÃbeautifulsoup£¬Ó¦¸Ã·Ç³£·½±ã¡£
> >
> > --
> > I like python!
> > UliPad <>: http://wiki.woodpecker.org.cn/moin/UliPad
> > My Blog: http://www.donews.net/limodou
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>
>
>
> --
> gtalk: icekernel在gmail.com
> blog:  http://www.bulaoge.com/?icekernel
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒƳý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070508/deca9cca/attachment.html 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年05月08日 星期二 14:13

limodou limodou在gmail.com
星期二 五月 8 14:13:34 HKT 2007

On 5/8/07, Rodin <schludern在gmail.com> wrote:
> 这个为什么不用regexp来做呢?因为有编码问题么?正则处理应该比使用beautifulsoup解析出来一个树要快吧?
>
> 正则:
> 共有主题数<[^>]*>(\d+)
> 取$1就可以了
>
这个看个人选择了,有时速度不是最主要的问题,可能方便,简单才是最重要的。

-- 
I like python!
UliPad <>: http://wiki.woodpecker.org.cn/moin/UliPad
My Blog: http://www.donews.net/limodou

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年05月08日 星期二 15:18

Rodin schludern在gmail.com
星期二 五月 8 15:18:33 HKT 2007

àÅ£¬Èç¹û»¹ÐèÒª½øÐиü¸´ÔӵķÖÎöµÄ»°£¬ÓÃÕýÔò¾Í³ÔÁ¦ÁË£¬BeautifulsoupÊǸüºÃһЩµÄÑ¡Ôñ£¬²Ù×÷ÓÐЩÀàËÆHTML DOMÁË

ÔÚ07-5-8£¬limodou <limodou在gmail.com> дµÀ£º
>
> On 5/8/07, Rodin <schludern在gmail.com> wrote:
> > Õâ¸öΪʲô²»ÓÃregexpÀ´×öÄØ£¿ÒòΪÓбàÂëÎÊÌâô£¿ÕýÔò´¦ÀíÓ¦¸Ã±ÈʹÓÃbeautifulsoup½âÎö³öÀ´Ò»¸öÊ÷Òª¿ì°É£¿
> >
> > ÕýÔò£º
> > ¹²ÓÐÖ÷ÌâÊý<[^>]*>(\d+)
> > È¡$1¾Í¿ÉÒÔÁË
> >
> Õâ¸ö¿´¸öÈËÑ¡ÔñÁË£¬ÓÐʱËٶȲ»ÊÇ×îÖ÷ÒªµÄÎÊÌ⣬¿ÉÄÜ·½±ã£¬¼òµ¥²ÅÊÇ×îÖØÒªµÄ¡£
>
> --
> I like python!
> UliPad <>: http://wiki.woodpecker.org.cn/moin/UliPad
> My Blog: http://www.donews.net/limodou
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒƳý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070508/0bbfb82d/attachment.htm 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2007年05月08日 星期二 20:10

Xupeng Yun recordus在gmail.com
星期二 五月 8 20:10:28 HKT 2007

On 5/8/07, Rodin <schludern at gmail.com> wrote:
>
> 嗯,如果还需要进行更复杂的分析的话,用正则就吃力了,Beautifulsoup是更好一些的选择,操作有些类似HTML DOM了
>
>
nod,BeautifulSoup本身也可以结合正则来用,这样是有很强大的灵活性与检索能力的。

-- 
I like Python & Linux.
Blog: http://recordus.cublog.cn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://python.cn/pipermail/python-chinese/attachments/20070508/1a960eb2/attachment.html 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

如下红色区域有误,请重新填写。

    你的回复:

    请 登录 后回复。还没有在Zeuux哲思注册吗?现在 注册 !

    Zeuux © 2025

    京ICP备05028076号