Python论坛的帖子： - 哲思

Python论坛 - 讨论区

返回群组主页

标题：[python-chinese] 地下室里的爬虫

分享

刘晓明

楼主 2007年03月13日星期二 14:47

gashero harry.python在gmail.com
星期二三月 13 14:47:15 HKT 2007

偶最近面试douban.com时的初试试题，我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的，后来并发访问问题搞不定，就改用BerkeleyDB了，就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里面写出来的，不要怪我，呵呵。
哪位朋友有兴趣可以发邮件给我，我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2的svn，非常不推荐只看源码，因为调试过程中发现的很多问题我直接写在提交日志里面了。
当前的状态是有2个线程，3000左右URL，速度比较慢。以前使用SQLite的时候速度更慢，不过URL在2万以上了。
另外，希望各位高人也可以看看，我在改用BerkeleyDB之后，在使用threading.Lock()这个锁的时候，时间长了会出毛病，并不是抛出异常，而是Python解释器直接中止。

-- 
从前有一只很冷的毛毛虫，他想获得一点温暖。而获得温暖的机会只有从树上掉下来，落进别人的领口。
片刻的温暖，之后便失去生命。而很多同类却连这片刻的温暖都没有得到就..
我会得到温暖么？小心翼翼的尝试，却还是会受到伤害。
我愿为那一刻的温暖去拼，可是谁愿意接受?

欢迎访问偶的博客：
http://blog.csdn.net/gashero

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月13日星期二 15:10

IQDoctor huanghao.c在gmail.com
星期二三月 13 15:10:56 HKT 2007

晕........

考试考如何写爬虫?

gashero 写道:
> 偶最近面试douban.com时的初试试题，我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的，后来并发访问问题搞不定，就改用BerkeleyDB了，就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里面写出来的，不要怪我，呵呵。
> 哪位朋友有兴趣可以发邮件给我，我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2的svn，非常不推荐只看源码，因为调试过程中发现的很多问题我直接写在提交日志里面了。
> 当前的状态是有2个线程，3000左右URL，速度比较慢。以前使用SQLite的时候速度更慢，不过URL在2万以上了。
> 另外，希望各位高人也可以看看，我在改用BerkeleyDB之后，在使用threading.Lock()这个锁的时候，时间长了会出毛病，并不是抛出异常，而是Python解释器直接中止。
>
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月13日星期二 15:47

Neil(木野狐) chenrong2003在gmail.com
星期二三月 13 15:47:16 HKT 2007

想学习一下，谢谢。

在 07-3-13，gashero<harry.python在gmail.com> 写道：
> 偶最近面试douban.com时的初试试题，我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的，后来并发访问问题搞不定，就改用BerkeleyDB了，就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月13日星期二 16:52

萝卜 ebbstar在126.com
星期二三月 13 16:52:13 HKT 2007

很有兴趣学习一下，我现在漫无亩目的的学习感觉收获不大，又想不出来弄点什么。
就是是不知道怎么看你的邮件原始地址，没法发邮件:(
我的邮件 zbbstar在gmail.com
看到了给我来一份，感谢

>偶最近面试douban.com时的初试试题，我回来就给实现了一下。面向dangdang.com网站的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的，后来并发访问问题搞不定，就改用BerkeleyDB了，就是dbhash模块。使用BerkeleyDB的数据库模型部分是在地下室里面写出来的，不要怪我，呵呵。
>哪位朋友有兴趣可以发邮件给我，我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2的svn，非常不推荐只看源码，因为调试过程中发现的很多问题我直接写在提交日志里面了。
>当前的状态是有2个线程，3000左右URL，速度比较慢。以前使用SQLite的时候速度更慢，不过URL在2万以上了。
>另外，希望各位高人也可以看看，我在改用BerkeleyDB之后，在使用threading.Lock()这个锁的时候，时间长了会出毛病，并不是抛出异常，而是Python解释器直接中止。
>
>-- 
>从前有一只很冷的毛毛虫，他想获得一点温暖。而获得温暖的机会只有从树上掉下来，落进别人的领口。
>片刻的温暖，之后便失去生命。而很多同类却连这片刻的温暖都没有得到就..
>我会得到温暖么？小心翼翼的尝试，却还是会受到伤害。
>我愿为那一刻的温暖去拼，可是谁愿意接受?
>
>欢迎访问偶的博客：
>http://blog.csdn.net/gashero
>_______________________________________________
>python-chinese
>Post: send python-chinese在lists.python.cn
>Subscribe: send subscribe to python-chinese-request在lists.python.cn
>Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
>Detail Info: http://python.cn/mailman/listinfo/python-chinese

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月13日星期二 17:10

Xell Zhang xellzhang在gmail.com
星期二三月 13 17:10:15 HKT 2007

ÎÒÒ²ºÜÓÐÐËÈ¤Ñ§Ï°£¬²»ÖªµÀÄÜ²»ÄÜ·¢ËÍÒ»·ÝÔ´´úÂëµ½ÎÒµÄÓÊÏä£¬Ð»Ð»¡£

On 3/13/07, ÂÜ²· <ebbstar在126.com> wrote:
>
> ºÜÓÐÐËÈ¤Ñ§Ï°Ò»ÏÂ£¬ÎÒÏÖÔÚÂþÎÞÄ¶Ä¿µÄµÄÑ§Ï°¸Ð¾õÊÕ»ñ²»´ó£¬ÓÖÏë²»³öÀ´ÅªµãÊ²Ã´¡£
> ¾ÍÊÇÊÇ²»ÖªµÀÔõÃ´¿´ÄãµÄÓÊ¼þÔÊ¼µØÖ·£¬Ã»·¨·¢ÓÊ¼þ:(
> ÎÒµÄÓÊ¼þ zbbstar在gmail.com
> ¿´µ½ÁË¸øÎÒÀ´Ò»·Ý£¬¸ÐÐ»
>
> >Å¼×î½üÃæÊÔdouban.comÊ±µÄ³õÊÔÊÔÌâ£¬ÎÒ»ØÀ´¾Í¸øÊµÏÖÁËÒ»ÏÂ¡£ÃæÏòdangdang.comÍøÕ¾µÄÌØ¶¨ÍøÕ¾ÅÀ³æ
> ¡£¿ªÊ¼ÊÇÊ¹ÓÃpysqlite2Á¬½ÓSQLite×öÊý¾Ý¿âµÄ£¬ºóÀ´²¢·¢·ÃÎÊÎÊÌâ¸ã²»¶¨£¬¾Í¸ÄÓÃBerkeleyDBÁË£¬¾ÍÊÇdbhashÄ£¿é¡£Ê¹ÓÃBerkeleyDBµÄÊý¾Ý¿âÄ£ÐÍ²¿·ÖÊÇÔÚµØÏÂÊÒÀïÃæÐ´³öÀ´µÄ£¬²»Òª¹ÖÎÒ£¬ºÇºÇ¡£
> >ÄÄÎ»ÅóÓÑÓÐÐËÈ¤¿ÉÒÔ·¢ÓÊ¼þ¸øÎÒ£¬ÎÒ»á»Ø¸´ÕâÌõÅÀ³æµÄSVN°æ±¾¿âÑ¹Ëõ°ü¡£°æ±¾ÊÇ1.3.2µÄsvn£¬
> ·Ç³£²»ÍÆ¼öÖ»¿´Ô´Âë£¬ÒòÎªµ÷ÊÔ¹ý³ÌÖÐ·¢ÏÖµÄºÜ¶àÎÊÌâÎÒÖ±½ÓÐ´ÔÚÌá½»ÈÕÖ¾ÀïÃæÁË¡£
> >µ±Ç°µÄ×´Ì¬ÊÇÓÐ2¸öÏß³Ì£¬3000×óÓÒURL£¬ËÙ¶È±È½ÏÂý¡£ÒÔÇ°Ê¹ÓÃSQLiteµÄÊ±ºòËÙ¶È¸üÂý£¬²»¹ýURLÔÚ2ÍòÒÔÉÏÁË¡£
> >ÁíÍâ£¬Ï£Íû¸÷Î»¸ßÈËÒ²¿ÉÒÔ¿´¿´£¬ÎÒÔÚ¸ÄÓÃBerkeleyDBÖ®ºó£¬ÔÚÊ¹ÓÃthreading.Lock
> ()Õâ¸öËøµÄÊ±ºò£¬Ê±¼ä³¤ÁË»á³öÃ«²¡£¬²¢²»ÊÇÅ×³öÒì³££¬¶øÊÇPython½âÊÍÆ÷Ö±½ÓÖÐÖ¹¡£
> >
> >--
> >´ÓÇ°ÓÐÒ»Ö»ºÜÀäµÄÃ«Ã«³æ£¬ËûÏë»ñµÃÒ»µãÎÂÅ¯¡£¶ø»ñµÃÎÂÅ¯µÄ»ú»áÖ»ÓÐ´ÓÊ÷ÉÏµôÏÂÀ´£¬Âä½ø±ðÈËµÄÁì¿Ú¡£
> >Æ¬¿ÌµÄÎÂÅ¯£¬Ö®ºó±ãÊ§È¥ÉúÃü¡£¶øºÜ¶àÍ¬ÀàÈ´Á¬ÕâÆ¬¿ÌµÄÎÂÅ¯¶¼Ã»ÓÐµÃµ½¾Í..
> >ÎÒ»áµÃµ½ÎÂÅ¯Ã´£¿Ð¡ÐÄÒíÒíµÄ³¢ÊÔ£¬È´»¹ÊÇ»áÊÜµ½ÉËº¦¡£
> >ÎÒÔ¸ÎªÄÇÒ»¿ÌµÄÎÂÅ¯È¥Æ´£¬¿ÉÊÇËÔ¸Òâ½ÓÊÜ?
> >
> >»¶Ó·ÃÎÊÅ¼µÄ²©¿Í£º
> >http://blog.csdn.net/gashero
> >_______________________________________________
> >python-chinese
> >Post: send python-chinese在lists.python.cn
> >Subscribe: send subscribe to python-chinese-request在lists.python.cn
> >Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> >Detail Info: http://python.cn/mailman/listinfo/python-chinese
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070313/e5e6b36e/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月13日星期二 17:25

prolibertine prolibertine在gmail.com
星期二三月 13 17:25:05 HKT 2007

也给我来一份吧，谢谢了

-- 
--~--~---------~--~----~------------~-------~--~----~
Best Regards
JesseZhao(ZhaoGuang)

Blog : Http://JesseZhao.cnblogs.com
E-Mail : Prolibertine在gmail.com
IM(Live Messager) : Prolibertine在gmail.com
--~--~---------~--~----~------------~-------~--~----~

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月13日星期二 17:27

YySwing.杨 yang在pyboo.com
星期二三月 13 17:27:36 HKT 2007

¸øÎÒÒ²·¢Ò»·ÝÔ´Âë
лл
  ----- Original Message ----- 
  From: Xell Zhang 
  To: python-chinese在lists.python.cn 
  Sent: Tuesday, March 13, 2007 5:10 PM
  Subject: Re: [python-chinese] µØÏÂÊÒÀïµÄÅÀ³æ


  ÎÒÒ²ºÜÓÐÐËÈ¤Ñ§Ï°£¬²»ÖªµÀÄÜ²»ÄÜ·¢ËÍÒ»·ÝÔ´´úÂëµ½ÎÒµÄÓÊÏä£¬Ð»Ð»¡£


  On 3/13/07, ÂÜ²· <ebbstar在126.com> wrote:
    ºÜÓÐÐËÈ¤Ñ§Ï°Ò»ÏÂ£¬ÎÒÏÖÔÚÂþÎÞÄ¶Ä¿µÄµÄÑ§Ï°¸Ð¾õÊÕ»ñ²»´ó£¬ÓÖÏë²»³öÀ´ÅªµãÊ²Ã´¡£
    ¾ÍÊÇÊÇ²»ÖªµÀÔõÃ´¿´ÄãµÄÓÊ¼þÔÊ¼µØÖ·£¬Ã»·¨·¢ÓÊ¼þ:(
    ÎÒµÄÓÊ¼þ zbbstar在gmail.com
    ¿´µ½ÁË¸øÎÒÀ´Ò»·Ý£¬¸ÐÐ»

    >Å¼×î½üÃæÊÔdouban.comÊ±µÄ³õÊÔÊÔÌâ£¬ÎÒ»ØÀ´¾Í¸øÊµÏÖÁËÒ»ÏÂ¡£ÃæÏòdangdang.comÍøÕ¾µÄÌØ¶¨ÍøÕ¾ÅÀ³æ¡£¿ªÊ¼ÊÇÊ¹ÓÃpysqlite2Á¬½ÓSQLite×öÊý¾Ý¿âµÄ£¬ºóÀ´²¢·¢·ÃÎÊÎÊÌâ¸ã²»¶¨£¬¾Í¸ÄÓÃBerkeleyDBÁË£¬¾ÍÊÇdbhashÄ£¿é¡£Ê¹ÓÃBerkeleyDBµÄÊý¾Ý¿âÄ£ÐÍ²¿·ÖÊÇÔÚµØÏÂÊÒÀïÃæÐ´³öÀ´µÄ£¬²»Òª¹ÖÎÒ£¬ºÇºÇ¡£ 
    >ÄÄÎ»ÅóÓÑÓÐÐËÈ¤¿ÉÒÔ·¢ÓÊ¼þ¸øÎÒ£¬ÎÒ»á»Ø¸´ÕâÌõÅÀ³æµÄSVN°æ±¾¿âÑ¹Ëõ°ü¡£°æ±¾ÊÇ1.3.2µÄsvn£¬·Ç³£²»ÍÆ¼öÖ»¿´Ô´Âë£¬ÒòÎªµ÷ÊÔ¹ý³ÌÖÐ·¢ÏÖµÄºÜ¶àÎÊÌâÎÒÖ±½ÓÐ´ÔÚÌá½»ÈÕÖ¾ÀïÃæÁË¡£
    >µ±Ç°µÄ×´Ì¬ÊÇÓÐ2¸öÏß³Ì£¬3000×óÓÒURL£¬ËÙ¶È±È½ÏÂý¡£ÒÔÇ°Ê¹ÓÃSQLiteµÄÊ±ºòËÙ¶È¸üÂý£¬²»¹ýURLÔÚ2ÍòÒÔÉÏÁË¡£
    >ÁíÍâ£¬Ï£Íû¸÷Î»¸ßÈËÒ²¿ÉÒÔ¿´¿´£¬ÎÒÔÚ¸ÄÓÃBerkeleyDBÖ®ºó£¬ÔÚÊ¹ÓÃthreading.Lock()Õâ¸öËøµÄÊ±ºò£¬Ê±¼ä³¤ÁË»á³öÃ«²¡£¬²¢²»ÊÇÅ×³öÒì³££¬¶øÊÇPython½âÊÍÆ÷Ö±½ÓÖÐÖ¹¡£ 
    >
    >--
    >´ÓÇ°ÓÐÒ»Ö»ºÜÀäµÄÃ«Ã«³æ£¬ËûÏë»ñµÃÒ»µãÎÂÅ¯¡£¶ø»ñµÃÎÂÅ¯µÄ»ú»áÖ»ÓÐ´ÓÊ÷ÉÏµôÏÂÀ´£¬Âä½ø±ðÈËµÄÁì¿Ú¡£
    >Æ¬¿ÌµÄÎÂÅ¯£¬Ö®ºó±ãÊ§È¥ÉúÃü¡£¶øºÜ¶àÍ¬ÀàÈ´Á¬ÕâÆ¬¿ÌµÄÎÂÅ¯¶¼Ã»ÓÐµÃµ½¾Í..
    >ÎÒ»áµÃµ½ÎÂÅ¯Ã´£¿Ð¡ÐÄÒíÒíµÄ³¢ÊÔ£¬È´»¹ÊÇ»áÊÜµ½ÉËº¦¡£
    >ÎÒÔ¸ÎªÄÇÒ»¿ÌµÄÎÂÅ¯È¥Æ´£¬¿ÉÊÇËÔ¸Òâ½ÓÊÜ?
    >
    >»¶Ó·ÃÎÊÅ¼µÄ²©¿Í£º
    > http://blog.csdn.net/gashero
    >_______________________________________________
    >python-chinese
    >Post: send python-chinese在lists.python.cn
    >Subscribe: send subscribe to python-chinese-request在lists.python.cn
    >Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn 
    >Detail Info: http://python.cn/mailman/listinfo/python-chinese
    _______________________________________________
    python-chinese
    Post: send python-chinese在lists.python.cn
    Subscribe: send subscribe to python-chinese-request在lists.python.cn
    Unsubscribe: send unsubscribe to   python-chinese-request在lists.python.cn
    Detail Info: http://python.cn/mailman/listinfo/python-chinese




------------------------------------------------------------------------------


  _______________________________________________
  python-chinese
  Post: send python-chinese在lists.python.cn
  Subscribe: send subscribe to python-chinese-request在lists.python.cn
  Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
  Detail Info: http://python.cn/mailman/listinfo/python-chinese
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070313/8bc76e14/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

顾英博

0楼 2007年03月13日星期二 22:22

Gu Yingbo tensiongyb在gmail.com
星期二三月 13 22:22:06 HKT 2007

¸øÎÒÒ²·¢Ò»·Ý°É£¬Ñ§Ï°Ò»ÏÂ£¬Thanks.
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070313/59bb5b59/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

易明晶

0楼 2007年03月13日星期二 22:29

hackergene hackergene在gmail.com
星期二三月 13 22:29:26 HKT 2007

ÓÐÐËÈ¤,Ð»Ð»Ìá¹©Ò»·Ý!

ÔÚ07-3-13£¬Gu Yingbo <tensiongyb在gmail.com> Ð´µÀ£º
>
> ¸øÎÒÒ²·¢Ò»·Ý°É£¬Ñ§Ï°Ò»ÏÂ£¬Thanks.
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>



-- 
ºì¿ÍÍøÂçhttp://www.allhonker.com
¹úÄÚÊ×¼ÒÓ¦ÓÃwiki¼¼ÊõµÄºÚ¿ÍÀàÕ¾µã!
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070313/6e6f1b99/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月13日星期二 23:03

Devin Deng deng.devin在gmail.com
星期二三月 13 23:03:08 HKT 2007

去年写的Quick & Dirty 蜘蛛程序，抓指定网站的，
现在都忘光了，看能不能给大家参考一下。


# -*- coding: utf-8 -*-
from twisted.python import threadable
threadable.init()
from twisted.internet import reactor, threads

import urllib2
import urllib
import urlparse
import time
from sgmllib import SGMLParser


from usearch import USearch # 此部分负责数据库操作，无法公布源码

class URLLister(SGMLParser):

    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):
        href = [v for k, v in attrs if k=='href']
        if href:
            self.urls.extend(href)

class Filter:

	def __init__(self, Host, denys=None, allows=None):
		self.deny_words = denys
		self.allow_words = allows
	
	# Check url is valid or not.
	def verify(self, url):
	
		for k in self.deny_words:
			if url.find(k) != -1:
				return False
		
		for k in self.allow_words:
			if url.find(k) !=-1:
				return True
		
		return True



class Host:
	
	def __init__(self, hostname, entry_url=None, description=None,
encoding=None, charset=None):
		self.hostname = hostname
		self.entry_url = entry_url
		self.encoding = encoding
		self.charset = charset
		self.description = description
	
	def configxml(self):
		import elementtree.ElementTree as ET
		
		root = ET.Element("config")
		en = ET.SubElement(root, "encoding")
		en.text = self.encoding
		
		ch = ET.SubElement(root, "charset")
		ch.text = self.charset
		
		entry = ET.SubElement(root, "entry_url")
		entry.text = self.entry_url
		
		return ET.tostring(root)
	
	def parse_config(self, configstring):
		import elementtree.ElementTree as ET
		from StringIO import StringIO
		tree = ET.parse(StringIO(configstring))
		self.encoding =  tree.findtext(".//encoding")
		self.charset = tree.findtext(".//charset")
		self.entry_url = tree.findtext(".//entry_url")
	
	def create(self):
		u = USearch()
		self.configs = self.configxml()
		
		ret = u.CreateDomain(self.hostname,self.description, self.configs)
		#print ret
	
	def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 ==
visiting, 2 = visited.
		# TODO: load domain data from backend database.
		u = USearch()
		try:
		    ret = u.ListDomain(flag)['result']
		    for d in ret:

				if d.domain == self.hostname:
					self.parse_config(d.parse_config)
					self.description = d.description
					return True
		except:
			pass
		return False
	

class Page:

	def __init__(self, url, host, description=None):
		self.url = url
		self.description = description
		self.host = host
		self.page_request = None
		self.content = None
		
		self.status_code = None
		self.encoding = None
		self.charset = None
		self.length = 0
		self.md5 = None
		self.urls = []
	
	# Read web page.
	def get_page(self, url=None):
		if not url: url = self.url
		type = get_type(self.host.hostname,url)
		if type != 0: return None
		try:
			opener = urllib2.build_opener()
			opener.addheaders = [('User-agent', 'Mozilla/5.0')]
			self.page_request = opener.open(urllib.unquote(url))
			#self.page_request = urllib2.urlopen(url)
			self.content = self.page_request.read()
			self.status_code = self.page_request.code
			return self.status_code
		except:
			self.stats_code = 500
			print "ERROR READING: %s" % self.url
			return None
		
		
	def get_header(self):
		
		if not self.page_request:
			self.get_page()
		header = self.page_request.info()
		try:
			self.length = header['Content-Length']
			content_type = header['Content-Type']
			#if content_type.find('charset') == -1:
			self.charset = self.host.charset
			
			self.encoding = self.host.encoding
		except:
			pass
			
		
	def get_urls(self):
		
		if not self.page_request:
			self.get_page()
		
		if self.status_code != 200:
			return
		
		parser = URLLister()
		
		try:
			parser.feed(self.content)
		except:
			print "ERROR: Parse urls error!"
			return
		
		#print "URLS: ", parser.urls
		#self.urls = parser.urls
		if not self.charset: self.charset = "gbk"
		for i in parser.urls:
			try:
				type = get_type(self.host.hostname,i)
			
				if type == 4:
					i = join_url(self.host.hostname, self.url, i)
				if type == 0 or type ==4:
					if i:
						i = urllib.quote(i)
						self.urls.append(i.decode(self.charset).encode('utf-8'))
			except:
				pass
			
		parser.close()
		self.page_request.close()
	
	def save_header(self):
		# Save header info into db.
		pass
	
	def save_current_url(self):
		save_url = urllib.quote(self.url)
		usearch = USearch()
		usearch.CreateUrl( domain=self.host.hostname, url=save_url,
length=self.length, status_code=self.status_code)

	# Set URL's flag
	def flag_url(self, flag):
		usearch = USearch()
		usearch.UpdateUrl(status=flag)
	
	def save_urls(self):
		# Save all the founded urls into db
		print "RELEATED_URLS:", len(self.urls)
		usearch = USearch()
		usearch.CreateRelateUrl(urllib.quote(self.url), self.urls)
		
	def save_page(self):
		usearch = USearch()
		import cgi
		
		try:
			content = self.content.decode(self.charset).encode('utf-8')
			usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'),
content)
		except:
			print "ERROR to save page"
			return -1
		print "SAVE PAGE Done", self.url
		return 0
		
	

def get_type(domain, url):
    if not url: return 5
    import urlparse
    tup = urlparse.urlparse(url)
    if tup[0] == "http":
        # check if the same domain
        if tup[1] == domain: return 0
        else: return 1  # outside link
    if tup[0] == "javascript":
        return 2
    if tup[0] == "ftp":
        return 3
    if tup[0] == "mailto":
	return 5

    return 4    # internal link

def join_url(domain, referral, url):

    if not url or len(url) ==0: return None
    tup = urlparse.urlparse(url)
    if not tup: return None

    if tup[0] == "javascript" or tup[0] == "ftp": return None


    else:
        if url[0] == "/": # means root link begins
            newurl = "http://%s%s" % ( domain, url)
            return newurl
        if url[0] == ".": return None # ignore relative link at first.
        else:

	#		if referral.rfind("/") != -1:
	#			referral = referral[0:referral.rfind("/")+1]
	#	newurl = "%s%s" % (referral, url)
		newurl = urlparse.urljoin(referral, url)
		return newurl

if __name__ == '__main__':
	
	def done(x):
			
		u = USearch()
                x = urllib.quote(x.decode('gbk').encode('utf-8'))
		u.SetUrlStatus(x, '2')
		time.sleep(2)
		print "DONE: ",x
		url = next_url(h)
		if not url: reactor.stop()
		else:threads.deferToThread(spider, h, url ).addCallback(done)
		
		
	def next_url(host):
		u = USearch()
		ret = u.GetTaskUrls(host.hostname,'0',1)['result']
		try:
			url = urllib.unquote(ret[0].url)
		except:
			return None
		
		if urlparse.urlparse(url)[1] != host.hostname: next_url(host)
		return urllib.unquote(ret[0].url)
	
	def spider(host, surf_url):
		
		#surf_url = surf_url.decode(host.charset).encode('utf-8')
		surf_url = urllib.unquote(surf_url)
		p = Page(surf_url, host)
		#try:
		if not p.get_page():
			print "ERROR: GET %s error!" % surf_url
			return surf_url # Something Wrong!
		p.get_header() # Get page's header
		p.get_urls() # Get all the urls in page
			#print p.urls
		p.save_current_url() # Save current page's url info into DB
		p.save_urls()
		p.save_page()
		#except:
		#	pass
		
		return surf_url
		
	
	import sys
	#host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312")
	#host.create()

	#~ h = Host("www.chilema.cn")
	#~ h.load()
	
	#~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/")
	#~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/")
	#~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/")
	
	#~ threads.deferToThread(spider, h,
"http://www.chilema.cn/Eat/").addCallback(done)
	
	#host = Host("www.ziye114.com", "", "Beijing Local", "gb2312")
	#host.create()
	
	hostname = sys.argv[1]
	entry_url = ""
	if len(sys.argv) == 3: entry_url = sys.argv[2]

	h = Host(hostname)
	hostname_url = "http://%s/%s" % (hostname,entry_url)
	h.load()
	threads.deferToThread(spider, h, hostname_url).addCallback(done)
	threads.deferToThread(spider, h, next_url(h)).addCallback(done)
	threads.deferToThread(spider, h, next_url(h)).addCallback(done)
	threads.deferToThread(spider, h, next_url(h)).addCallback(done)
	reactor.run()

------------------------------

Best Regards,

Devin Deng

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月14日星期三 01:30

Yang Yongsheng ysyang2002在gmail.com
星期三三月 14 01:30:10 HKT 2007

我觉得楼主愿意分享自己的代码的精神让人敬佩不已，各位回帖求代码的兄弟的求知精神同样让人欣赏，不过如果各位兄弟能直接把请求信发到楼主的email信箱那就更好了。

On 3/13/07, hackergene <hackergene在gmail.com> wrote:
> 有兴趣,谢谢提供一份!
>
> 在07-3-13，Gu Yingbo <tensiongyb在gmail.com> 写道：
> > 给我也发一份吧，学习一下，Thanks.
> >
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> > Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
> >
>
>
>
> --
> 红客网络http://www.allhonker.com
> 国内首家应用wiki技术的黑客类站点!
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to
> python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to
> python-chinese-request在lists.python.cn
> Detail Info:
> http://python.cn/mailman/listinfo/python-chinese
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月14日星期三 09:21

lu_zi lu_zi_2000在163.com
星期三三月 14 09:21:54 HKT 2007

关键问题是要找到性能瓶颈，如果是数据库的问题，可以换sqlrelay连接池，换mysql
数据库；果是线程数不够，就增加线程数，开启线程池；如果是python里的罗辑比较慢
可以把关键的地方移出写c模块；如果是网络速度慢.......

-----邮件原件-----
发件人: python-chinese-bounces在lists.python.cn
[mailto:python-chinese-bounces在lists.python.cn] 代表 gashero
发送时间: 2007年3月13日 14:47
收件人: Python中国用户组
主题: [python-chinese] 地下室里的爬虫

偶最近面试douban.com时的初试试题，我回来就给实现了一下。面向dangdang.com网站
的特定网站爬虫。开始是使用pysqlite2连接SQLite做数据库的，后来并发访问问题搞
不定，就改用BerkeleyDB了，就是dbhash模块。使用BerkeleyDB的数据库模型部分是在
地下室里面写出来的，不要怪我，呵呵。
哪位朋友有兴趣可以发邮件给我，我会回复这条爬虫的SVN版本库压缩包。版本是1.3.2
的svn，非常不推荐只看源码，因为调试过程中发现的很多问题我直接写在提交日志里
面了。
当前的状态是有2个线程，3000左右URL，速度比较慢。以前使用SQLite的时候速度更
慢，不过URL在2万以上了。
另外，希望各位高人也可以看看，我在改用BerkeleyDB之后，在使用threading.Lock()
这个锁的时候，时间长了会出毛病，并不是抛出异常，而是Python解释器直接中止。

-- 
从前有一只很冷的毛毛虫，他想获得一点温暖。而获得温暖的机会只有从树上掉下来，
落进别人的领口。
片刻的温暖，之后便失去生命。而很多同类却连这片刻的温暖都没有得到就..
我会得到温暖么？小心翼翼的尝试，却还是会受到伤害。
我愿为那一刻的温暖去拼，可是谁愿意接受?

欢迎访问偶的博客：
http://blog.csdn.net/gashero
_______________________________________________
python-chinese
Post: send python-chinese在lists.python.cn
Subscribe: send subscribe to python-chinese-request在lists.python.cn
Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
Detail Info: http://python.cn/mailman/listinfo/python-chinese

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

vicalloy

0楼 2007年03月14日星期三 09:50

vicalloy zbirder在gmail.com
星期三三月 14 09:50:07 HKT 2007

如果虫子太厉害会被目标网站给挂的吧。

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

头太晕

0楼 2007年03月14日星期三 10:47

头太晕 torrycn在gmail.com
星期三三月 14 10:47:52 HKT 2007

2007/3/14, vicalloy <zbirder在gmail.com>:
>
> Èç¹û³æ×ÓÌ«À÷º¦»á±»Ä¿±êÍøÕ¾¸ø¹ÒµÄ°É¡£


ÄÇ»¹ÊÇÂýÂýÅÀ°É£¬ÓûËÙÔò²»´ï¡£¡£¡£¹þ¹þ¡£¡£¡£
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070314/3f9436a4/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月14日星期三 11:14

Geoffery>>>嘎瑞>>>XuLi geofferyli在gmail.com
星期三三月 14 11:14:24 HKT 2007

Ñ§Ï°Ò»ÏÂ£¬Çë·¢¸øÎÒÒ»·Ý
Ö±½Ó»Ø¸´Õâ¸öÓÊ¼þ¡£
лл

On 3/14/07, Í·Ì«ÔÎ <torrycn在gmail.com> wrote:
>
>
>
> 2007/3/14, vicalloy <zbirder在gmail.com>:
> >
> > Èç¹û³æ×ÓÌ«À÷º¦»á±»Ä¿±êÍøÕ¾¸ø¹ÒµÄ°É¡£
>
>
> ÄÇ»¹ÊÇÂýÂýÅÀ°É£¬ÓûËÙÔò²»´ï¡£¡£¡£¹þ¹þ¡£¡£¡£
>
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>



-- 
-------------------------------------------------------------------
python-chinese list
PythonÖÐÎÄ¼¼ÊõÌÖÂÛÓÊ¼þÁÐ±í
·¢ÑÔ: ·¢ÓÊ¼þµ½ python-chinese在lists.python.cn
¶©ÔÄ: ·¢ËÍ subscribe µ½ python-chinese-request在lists.python.cn
ÍË¶©: ·¢ËÍ unsubscribe µ½  python-chinese-request在lists.python.cn
ÏêÏ¸ËµÃ÷: http://python.cn/mailman/listinfo/python-chinese
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070314/6948639e/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

0楼 2007年03月14日星期三 11:31

Zoom.Quiet zoom.quiet在gmail.com
星期三三月 14 11:31:51 HKT 2007

On 3/13/07, Devin Deng <deng.devin在gmail.com> wrote:
> 去年写的Quick & Dirty 蜘蛛程序，抓指定网站的，
> 现在都忘光了，看能不能给大家参考一下。
收集！
http://wiki.woodpecker.org.cn/moin/MicroProj/2007-03-14

>
>
> # -*- coding: utf-8 -*-
> from twisted.python import threadable
> threadable.init()
> from twisted.internet import reactor, threads
>
> import urllib2
> import urllib
> import urlparse
> import time
> from sgmllib import SGMLParser
>
>
> from usearch import USearch # 此部分负责数据库操作，无法公布源码
>
> class URLLister(SGMLParser):
>
>     def reset(self):
>         SGMLParser.reset(self)
>         self.urls = []
>
>     def start_a(self, attrs):
>         href = [v for k, v in attrs if k=='href']
>         if href:
>             self.urls.extend(href)
>
> class Filter:
>
>         def __init__(self, Host, denys=None, allows=None):
>                 self.deny_words = denys
>                 self.allow_words = allows
>
>         # Check url is valid or not.
>         def verify(self, url):
>
>                 for k in self.deny_words:
>                         if url.find(k) != -1:
>                                 return False
>
>                 for k in self.allow_words:
>                         if url.find(k) !=-1:
>                                 return True
>
>                 return True
>
>
>
> class Host:
>
>         def __init__(self, hostname, entry_url=None, description=None,
> encoding=None, charset=None):
>                 self.hostname = hostname
>                 self.entry_url = entry_url
>                 self.encoding = encoding
>                 self.charset = charset
>                 self.description = description
>
>         def configxml(self):
>                 import elementtree.ElementTree as ET
>
>                 root = ET.Element("config")
>                 en = ET.SubElement(root, "encoding")
>                 en.text = self.encoding
>
>                 ch = ET.SubElement(root, "charset")
>                 ch.text = self.charset
>
>                 entry = ET.SubElement(root, "entry_url")
>                 entry.text = self.entry_url
>
>                 return ET.tostring(root)
>
>         def parse_config(self, configstring):
>                 import elementtree.ElementTree as ET
>                 from StringIO import StringIO
>                 tree = ET.parse(StringIO(configstring))
>                 self.encoding =  tree.findtext(".//encoding")
>                 self.charset = tree.findtext(".//charset")
>                 self.entry_url = tree.findtext(".//entry_url")
>
>         def create(self):
>                 u = USearch()
>                 self.configs = self.configxml()
>
>                 ret = u.CreateDomain(self.hostname,self.description, self.configs)
>                 #print ret
>
>         def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 ==
> visiting, 2 = visited.
>                 # TODO: load domain data from backend database.
>                 u = USearch()
>                 try:
>                     ret = u.ListDomain(flag)['result']
>                     for d in ret:
>
>                                 if d.domain == self.hostname:
>                                         self.parse_config(d.parse_config)
>                                         self.description = d.description
>                                         return True
>                 except:
>                         pass
>                 return False
>
>
> class Page:
>
>         def __init__(self, url, host, description=None):
>                 self.url = url
>                 self.description = description
>                 self.host = host
>                 self.page_request = None
>                 self.content = None
>
>                 self.status_code = None
>                 self.encoding = None
>                 self.charset = None
>                 self.length = 0
>                 self.md5 = None
>                 self.urls = []
>
>         # Read web page.
>         def get_page(self, url=None):
>                 if not url: url = self.url
>                 type = get_type(self.host.hostname,url)
>                 if type != 0: return None
>                 try:
>                         opener = urllib2.build_opener()
>                         opener.addheaders = [('User-agent', 'Mozilla/5.0')]
>                         self.page_request = opener.open(urllib.unquote(url))
>                         #self.page_request = urllib2.urlopen(url)
>                         self.content = self.page_request.read()
>                         self.status_code = self.page_request.code
>                         return self.status_code
>                 except:
>                         self.stats_code = 500
>                         print "ERROR READING: %s" % self.url
>                         return None
>
>
>         def get_header(self):
>
>                 if not self.page_request:
>                         self.get_page()
>                 header = self.page_request.info()
>                 try:
>                         self.length = header['Content-Length']
>                         content_type = header['Content-Type']
>                         #if content_type.find('charset') == -1:
>                         self.charset = self.host.charset
>
>                         self.encoding = self.host.encoding
>                 except:
>                         pass
>
>
>         def get_urls(self):
>
>                 if not self.page_request:
>                         self.get_page()
>
>                 if self.status_code != 200:
>                         return
>
>                 parser = URLLister()
>
>                 try:
>                         parser.feed(self.content)
>                 except:
>                         print "ERROR: Parse urls error!"
>                         return
>
>                 #print "URLS: ", parser.urls
>                 #self.urls = parser.urls
>                 if not self.charset: self.charset = "gbk"
>                 for i in parser.urls:
>                         try:
>                                 type = get_type(self.host.hostname,i)
>
>                                 if type == 4:
>                                         i = join_url(self.host.hostname, self.url, i)
>                                 if type == 0 or type ==4:
>                                         if i:
>                                                 i = urllib.quote(i)
>                                                 self.urls.append(i.decode(self.charset).encode('utf-8'))
>                         except:
>                                 pass
>
>                 parser.close()
>                 self.page_request.close()
>
>         def save_header(self):
>                 # Save header info into db.
>                 pass
>
>         def save_current_url(self):
>                 save_url = urllib.quote(self.url)
>                 usearch = USearch()
>                 usearch.CreateUrl( domain=self.host.hostname, url=save_url,
> length=self.length, status_code=self.status_code)
>
>         # Set URL's flag
>         def flag_url(self, flag):
>                 usearch = USearch()
>                 usearch.UpdateUrl(status=flag)
>
>         def save_urls(self):
>                 # Save all the founded urls into db
>                 print "RELEATED_URLS:", len(self.urls)
>                 usearch = USearch()
>                 usearch.CreateRelateUrl(urllib.quote(self.url), self.urls)
>
>         def save_page(self):
>                 usearch = USearch()
>                 import cgi
>
>                 try:
>                         content = self.content.decode(self.charset).encode('utf-8')
>                         usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'),
> content)
>                 except:
>                         print "ERROR to save page"
>                         return -1
>                 print "SAVE PAGE Done", self.url
>                 return 0
>
>
>
> def get_type(domain, url):
>     if not url: return 5
>     import urlparse
>     tup = urlparse.urlparse(url)
>     if tup[0] == "http":
>         # check if the same domain
>         if tup[1] == domain: return 0
>         else: return 1  # outside link
>     if tup[0] == "javascript":
>         return 2
>     if tup[0] == "ftp":
>         return 3
>     if tup[0] == "mailto":
>         return 5
>
>     return 4    # internal link
>
> def join_url(domain, referral, url):
>
>     if not url or len(url) ==0: return None
>     tup = urlparse.urlparse(url)
>     if not tup: return None
>
>     if tup[0] == "javascript" or tup[0] == "ftp": return None
>
>
>     else:
>         if url[0] == "/": # means root link begins
>             newurl = "http://%s%s" % ( domain, url)
>             return newurl
>         if url[0] == ".": return None # ignore relative link at first.
>         else:
>
>         #               if referral.rfind("/") != -1:
>         #                       referral = referral[0:referral.rfind("/")+1]
>         #       newurl = "%s%s" % (referral, url)
>                 newurl = urlparse.urljoin(referral, url)
>                 return newurl
>
> if __name__ == '__main__':
>
>         def done(x):
>
>                 u = USearch()
>                 x = urllib.quote(x.decode('gbk').encode('utf-8'))
>                 u.SetUrlStatus(x, '2')
>                 time.sleep(2)
>                 print "DONE: ",x
>                 url = next_url(h)
>                 if not url: reactor.stop()
>                 else:threads.deferToThread(spider, h, url ).addCallback(done)
>
>
>         def next_url(host):
>                 u = USearch()
>                 ret = u.GetTaskUrls(host.hostname,'0',1)['result']
>                 try:
>                         url = urllib.unquote(ret[0].url)
>                 except:
>                         return None
>
>                 if urlparse.urlparse(url)[1] != host.hostname: next_url(host)
>                 return urllib.unquote(ret[0].url)
>
>         def spider(host, surf_url):
>
>                 #surf_url = surf_url.decode(host.charset).encode('utf-8')
>                 surf_url = urllib.unquote(surf_url)
>                 p = Page(surf_url, host)
>                 #try:
>                 if not p.get_page():
>                         print "ERROR: GET %s error!" % surf_url
>                         return surf_url # Something Wrong!
>                 p.get_header() # Get page's header
>                 p.get_urls() # Get all the urls in page
>                         #print p.urls
>                 p.save_current_url() # Save current page's url info into DB
>                 p.save_urls()
>                 p.save_page()
>                 #except:
>                 #       pass
>
>                 return surf_url
>
>
>         import sys
>         #host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312")
>         #host.create()
>
>         #~ h = Host("www.chilema.cn")
>         #~ h.load()
>
>         #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/")
>         #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/")
>         #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/")
>
>         #~ threads.deferToThread(spider, h,
> "http://www.chilema.cn/Eat/").addCallback(done)
>
>         #host = Host("www.ziye114.com", "", "Beijing Local", "gb2312")
>         #host.create()
>
>         hostname = sys.argv[1]
>         entry_url = ""
>         if len(sys.argv) == 3: entry_url = sys.argv[2]
>
>         h = Host(hostname)
>         hostname_url = "http://%s/%s" % (hostname,entry_url)
>         h.load()
>         threads.deferToThread(spider, h, hostname_url).addCallback(done)
>         threads.deferToThread(spider, h, next_url(h)).addCallback(done)
>         threads.deferToThread(spider, h, next_url(h)).addCallback(done)
>         threads.deferToThread(spider, h, next_url(h)).addCallback(done)
>         reactor.run()
>
> ------------------------------
>
> Best Regards,
>
> Devin Deng
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese


-- 
'''Time is unimportant, only life important!
http://zoomquiet.org
blog在http://blog.zoomquiet.org/pyblosxom/
wiki在http://wiki.woodpecker.org.cn/moin/ZoomQuiet
scrap在http://floss.zoomquiet.org
douban在http://www.douban.com/people/zoomq/
____________________________________
Pls. use OpenOffice.org to replace M$ Office.
     http://zh.openoffice.org
Pls. use 7-zip to replace WinRAR/WinZip.
     http://7-zip.org/zh-cn/
You can get the truely Freedom 4 software.
'''

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月14日星期三 23:15

散漫酷男孩 17mxnet在gmail.com
星期三三月 14 23:15:02 HKT 2007

瓶颈多半是CPU上 或着网络

在 07-3-14，Zoom. Quiet<zoom.quiet在gmail.com> 写道：
> On 3/13/07, Devin Deng <deng.devin在gmail.com> wrote:
> > 去年写的Quick & Dirty 蜘蛛程序，抓指定网站的，
> > 现在都忘光了，看能不能给大家参考一下。
> 收集！
> http://wiki.woodpecker.org.cn/moin/MicroProj/2007-03-14
>
> >
> >
> > # -*- coding: utf-8 -*-
> > from twisted.python import threadable
> > threadable.init()
> > from twisted.internet import reactor, threads
> >
> > import urllib2
> > import urllib
> > import urlparse
> > import time
> > from sgmllib import SGMLParser
> >
> >
> > from usearch import USearch # 此部分负责数据库操作，无法公布源码
> >
> > class URLLister(SGMLParser):
> >
> >     def reset(self):
> >         SGMLParser.reset(self)
> >         self.urls = []
> >
> >     def start_a(self, attrs):
> >         href = [v for k, v in attrs if k=='href']
> >         if href:
> >             self.urls.extend(href)
> >
> > class Filter:
> >
> >         def __init__(self, Host, denys=None, allows=None):
> >                 self.deny_words = denys
> >                 self.allow_words = allows
> >
> >         # Check url is valid or not.
> >         def verify(self, url):
> >
> >                 for k in self.deny_words:
> >                         if url.find(k) != -1:
> >                                 return False
> >
> >                 for k in self.allow_words:
> >                         if url.find(k) !=-1:
> >                                 return True
> >
> >                 return True
> >
> >
> >
> > class Host:
> >
> >         def __init__(self, hostname, entry_url=None, description=None,
> > encoding=None, charset=None):
> >                 self.hostname = hostname
> >                 self.entry_url = entry_url
> >                 self.encoding = encoding
> >                 self.charset = charset
> >                 self.description = description
> >
> >         def configxml(self):
> >                 import elementtree.ElementTree as ET
> >
> >                 root = ET.Element("config")
> >                 en = ET.SubElement(root, "encoding")
> >                 en.text = self.encoding
> >
> >                 ch = ET.SubElement(root, "charset")
> >                 ch.text = self.charset
> >
> >                 entry = ET.SubElement(root, "entry_url")
> >                 entry.text = self.entry_url
> >
> >                 return ET.tostring(root)
> >
> >         def parse_config(self, configstring):
> >                 import elementtree.ElementTree as ET
> >                 from StringIO import StringIO
> >                 tree = ET.parse(StringIO(configstring))
> >                 self.encoding =  tree.findtext(".//encoding")
> >                 self.charset = tree.findtext(".//charset")
> >                 self.entry_url = tree.findtext(".//entry_url")
> >
> >         def create(self):
> >                 u = USearch()
> >                 self.configs = self.configxml()
> >
> >                 ret = u.CreateDomain(self.hostname,self.description, self.configs)
> >                 #print ret
> >
> >         def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 ==
> > visiting, 2 = visited.
> >                 # TODO: load domain data from backend database.
> >                 u = USearch()
> >                 try:
> >                     ret = u.ListDomain(flag)['result']
> >                     for d in ret:
> >
> >                                 if d.domain == self.hostname:
> >                                         self.parse_config(d.parse_config)
> >                                         self.description = d.description
> >                                         return True
> >                 except:
> >                         pass
> >                 return False
> >
> >
> > class Page:
> >
> >         def __init__(self, url, host, description=None):
> >                 self.url = url
> >                 self.description = description
> >                 self.host = host
> >                 self.page_request = None
> >                 self.content = None
> >
> >                 self.status_code = None
> >                 self.encoding = None
> >                 self.charset = None
> >                 self.length = 0
> >                 self.md5 = None
> >                 self.urls = []
> >
> >         # Read web page.
> >         def get_page(self, url=None):
> >                 if not url: url = self.url
> >                 type = get_type(self.host.hostname,url)
> >                 if type != 0: return None
> >                 try:
> >                         opener = urllib2.build_opener()
> >                         opener.addheaders = [('User-agent', 'Mozilla/5.0')]
> >                         self.page_request = opener.open(urllib.unquote(url))
> >                         #self.page_request = urllib2.urlopen(url)
> >                         self.content = self.page_request.read()
> >                         self.status_code = self.page_request.code
> >                         return self.status_code
> >                 except:
> >                         self.stats_code = 500
> >                         print "ERROR READING: %s" % self.url
> >                         return None
> >
> >
> >         def get_header(self):
> >
> >                 if not self.page_request:
> >                         self.get_page()
> >                 header = self.page_request.info()
> >                 try:
> >                         self.length = header['Content-Length']
> >                         content_type = header['Content-Type']
> >                         #if content_type.find('charset') == -1:
> >                         self.charset = self.host.charset
> >
> >                         self.encoding = self.host.encoding
> >                 except:
> >                         pass
> >
> >
> >         def get_urls(self):
> >
> >                 if not self.page_request:
> >                         self.get_page()
> >
> >                 if self.status_code != 200:
> >                         return
> >
> >                 parser = URLLister()
> >
> >                 try:
> >                         parser.feed(self.content)
> >                 except:
> >                         print "ERROR: Parse urls error!"
> >                         return
> >
> >                 #print "URLS: ", parser.urls
> >                 #self.urls = parser.urls
> >                 if not self.charset: self.charset = "gbk"
> >                 for i in parser.urls:
> >                         try:
> >                                 type = get_type(self.host.hostname,i)
> >
> >                                 if type == 4:
> >                                         i = join_url(self.host.hostname, self.url, i)
> >                                 if type == 0 or type ==4:
> >                                         if i:
> >                                                 i = urllib.quote(i)
> >                                                 self.urls.append(i.decode(self.charset).encode('utf-8'))
> >                         except:
> >                                 pass
> >
> >                 parser.close()
> >                 self.page_request.close()
> >
> >         def save_header(self):
> >                 # Save header info into db.
> >                 pass
> >
> >         def save_current_url(self):
> >                 save_url = urllib.quote(self.url)
> >                 usearch = USearch()
> >                 usearch.CreateUrl( domain=self.host.hostname, url=save_url,
> > length=self.length, status_code=self.status_code)
> >
> >         # Set URL's flag
> >         def flag_url(self, flag):
> >                 usearch = USearch()
> >                 usearch.UpdateUrl(status=flag)
> >
> >         def save_urls(self):
> >                 # Save all the founded urls into db
> >                 print "RELEATED_URLS:", len(self.urls)
> >                 usearch = USearch()
> >                 usearch.CreateRelateUrl(urllib.quote(self.url), self.urls)
> >
> >         def save_page(self):
> >                 usearch = USearch()
> >                 import cgi
> >
> >                 try:
> >                         content = self.content.decode(self.charset).encode('utf-8')
> >                         usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'),
> > content)
> >                 except:
> >                         print "ERROR to save page"
> >                         return -1
> >                 print "SAVE PAGE Done", self.url
> >                 return 0
> >
> >
> >
> > def get_type(domain, url):
> >     if not url: return 5
> >     import urlparse
> >     tup = urlparse.urlparse(url)
> >     if tup[0] == "http":
> >         # check if the same domain
> >         if tup[1] == domain: return 0
> >         else: return 1  # outside link
> >     if tup[0] == "javascript":
> >         return 2
> >     if tup[0] == "ftp":
> >         return 3
> >     if tup[0] == "mailto":
> >         return 5
> >
> >     return 4    # internal link
> >
> > def join_url(domain, referral, url):
> >
> >     if not url or len(url) ==0: return None
> >     tup = urlparse.urlparse(url)
> >     if not tup: return None
> >
> >     if tup[0] == "javascript" or tup[0] == "ftp": return None
> >
> >
> >     else:
> >         if url[0] == "/": # means root link begins
> >             newurl = "http://%s%s" % ( domain, url)
> >             return newurl
> >         if url[0] == ".": return None # ignore relative link at first.
> >         else:
> >
> >         #               if referral.rfind("/") != -1:
> >         #                       referral = referral[0:referral.rfind("/")+1]
> >         #       newurl = "%s%s" % (referral, url)
> >                 newurl = urlparse.urljoin(referral, url)
> >                 return newurl
> >
> > if __name__ == '__main__':
> >
> >         def done(x):
> >
> >                 u = USearch()
> >                 x = urllib.quote(x.decode('gbk').encode('utf-8'))
> >                 u.SetUrlStatus(x, '2')
> >                 time.sleep(2)
> >                 print "DONE: ",x
> >                 url = next_url(h)
> >                 if not url: reactor.stop()
> >                 else:threads.deferToThread(spider, h, url ).addCallback(done)
> >
> >
> >         def next_url(host):
> >                 u = USearch()
> >                 ret = u.GetTaskUrls(host.hostname,'0',1)['result']
> >                 try:
> >                         url = urllib.unquote(ret[0].url)
> >                 except:
> >                         return None
> >
> >                 if urlparse.urlparse(url)[1] != host.hostname: next_url(host)
> >                 return urllib.unquote(ret[0].url)
> >
> >         def spider(host, surf_url):
> >
> >                 #surf_url = surf_url.decode(host.charset).encode('utf-8')
> >                 surf_url = urllib.unquote(surf_url)
> >                 p = Page(surf_url, host)
> >                 #try:
> >                 if not p.get_page():
> >                         print "ERROR: GET %s error!" % surf_url
> >                         return surf_url # Something Wrong!
> >                 p.get_header() # Get page's header
> >                 p.get_urls() # Get all the urls in page
> >                         #print p.urls
> >                 p.save_current_url() # Save current page's url info into DB
> >                 p.save_urls()
> >                 p.save_page()
> >                 #except:
> >                 #       pass
> >
> >                 return surf_url
> >
> >
> >         import sys
> >         #host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312")
> >         #host.create()
> >
> >         #~ h = Host("www.chilema.cn")
> >         #~ h.load()
> >
> >         #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/")
> >         #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/")
> >         #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/")
> >
> >         #~ threads.deferToThread(spider, h,
> > "http://www.chilema.cn/Eat/").addCallback(done)
> >
> >         #host = Host("www.ziye114.com", "", "Beijing Local", "gb2312")
> >         #host.create()
> >
> >         hostname = sys.argv[1]
> >         entry_url = ""
> >         if len(sys.argv) == 3: entry_url = sys.argv[2]
> >
> >         h = Host(hostname)
> >         hostname_url = "http://%s/%s" % (hostname,entry_url)
> >         h.load()
> >         threads.deferToThread(spider, h, hostname_url).addCallback(done)
> >         threads.deferToThread(spider, h, next_url(h)).addCallback(done)
> >         threads.deferToThread(spider, h, next_url(h)).addCallback(done)
> >         threads.deferToThread(spider, h, next_url(h)).addCallback(done)
> >         reactor.run()
> >
> > ------------------------------
> >
> > Best Regards,
> >
> > Devin Deng
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
>
> --
> '''Time is unimportant, only life important!
> http://zoomquiet.org
> blog在http://blog.zoomquiet.org/pyblosxom/
> wiki在http://wiki.woodpecker.org.cn/moin/ZoomQuiet
> scrap在http://floss.zoomquiet.org
> douban在http://www.douban.com/people/zoomq/
> ____________________________________
> Pls. use OpenOffice.org to replace M$ Office.
>      http://zh.openoffice.org
> Pls. use 7-zip to replace WinRAR/WinZip.
>      http://7-zip.org/zh-cn/
> You can get the truely Freedom 4 software.
> '''
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月14日星期三 23:15

散漫酷男孩 17mxnet在gmail.com
星期三三月 14 23:15:32 HKT 2007

说错了 是数据库
索引什么的要做好。。

在 07-3-14，散漫酷男孩<17mxnet在gmail.com> 写道：
> 瓶颈多半是CPU上 或着网络
>
> 在 07-3-14，Zoom. Quiet<zoom.quiet在gmail.com> 写道：
> > On 3/13/07, Devin Deng <deng.devin在gmail.com> wrote:
> > > 去年写的Quick & Dirty 蜘蛛程序，抓指定网站的，
> > > 现在都忘光了，看能不能给大家参考一下。
> > 收集！
> > http://wiki.woodpecker.org.cn/moin/MicroProj/2007-03-14
> >
> > >
> > >
> > > # -*- coding: utf-8 -*-
> > > from twisted.python import threadable
> > > threadable.init()
> > > from twisted.internet import reactor, threads
> > >
> > > import urllib2
> > > import urllib
> > > import urlparse
> > > import time
> > > from sgmllib import SGMLParser
> > >
> > >
> > > from usearch import USearch # 此部分负责数据库操作，无法公布源码
> > >
> > > class URLLister(SGMLParser):
> > >
> > >     def reset(self):
> > >         SGMLParser.reset(self)
> > >         self.urls = []
> > >
> > >     def start_a(self, attrs):
> > >         href = [v for k, v in attrs if k=='href']
> > >         if href:
> > >             self.urls.extend(href)
> > >
> > > class Filter:
> > >
> > >         def __init__(self, Host, denys=None, allows=None):
> > >                 self.deny_words = denys
> > >                 self.allow_words = allows
> > >
> > >         # Check url is valid or not.
> > >         def verify(self, url):
> > >
> > >                 for k in self.deny_words:
> > >                         if url.find(k) != -1:
> > >                                 return False
> > >
> > >                 for k in self.allow_words:
> > >                         if url.find(k) !=-1:
> > >                                 return True
> > >
> > >                 return True
> > >
> > >
> > >
> > > class Host:
> > >
> > >         def __init__(self, hostname, entry_url=None, description=None,
> > > encoding=None, charset=None):
> > >                 self.hostname = hostname
> > >                 self.entry_url = entry_url
> > >                 self.encoding = encoding
> > >                 self.charset = charset
> > >                 self.description = description
> > >
> > >         def configxml(self):
> > >                 import elementtree.ElementTree as ET
> > >
> > >                 root = ET.Element("config")
> > >                 en = ET.SubElement(root, "encoding")
> > >                 en.text = self.encoding
> > >
> > >                 ch = ET.SubElement(root, "charset")
> > >                 ch.text = self.charset
> > >
> > >                 entry = ET.SubElement(root, "entry_url")
> > >                 entry.text = self.entry_url
> > >
> > >                 return ET.tostring(root)
> > >
> > >         def parse_config(self, configstring):
> > >                 import elementtree.ElementTree as ET
> > >                 from StringIO import StringIO
> > >                 tree = ET.parse(StringIO(configstring))
> > >                 self.encoding =  tree.findtext(".//encoding")
> > >                 self.charset = tree.findtext(".//charset")
> > >                 self.entry_url = tree.findtext(".//entry_url")
> > >
> > >         def create(self):
> > >                 u = USearch()
> > >                 self.configs = self.configxml()
> > >
> > >                 ret = u.CreateDomain(self.hostname,self.description, self.configs)
> > >                 #print ret
> > >
> > >         def load(self, flag='A'): # 'A' means all, 0 means unvisited, 1 ==
> > > visiting, 2 = visited.
> > >                 # TODO: load domain data from backend database.
> > >                 u = USearch()
> > >                 try:
> > >                     ret = u.ListDomain(flag)['result']
> > >                     for d in ret:
> > >
> > >                                 if d.domain == self.hostname:
> > >                                         self.parse_config(d.parse_config)
> > >                                         self.description = d.description
> > >                                         return True
> > >                 except:
> > >                         pass
> > >                 return False
> > >
> > >
> > > class Page:
> > >
> > >         def __init__(self, url, host, description=None):
> > >                 self.url = url
> > >                 self.description = description
> > >                 self.host = host
> > >                 self.page_request = None
> > >                 self.content = None
> > >
> > >                 self.status_code = None
> > >                 self.encoding = None
> > >                 self.charset = None
> > >                 self.length = 0
> > >                 self.md5 = None
> > >                 self.urls = []
> > >
> > >         # Read web page.
> > >         def get_page(self, url=None):
> > >                 if not url: url = self.url
> > >                 type = get_type(self.host.hostname,url)
> > >                 if type != 0: return None
> > >                 try:
> > >                         opener = urllib2.build_opener()
> > >                         opener.addheaders = [('User-agent', 'Mozilla/5.0')]
> > >                         self.page_request = opener.open(urllib.unquote(url))
> > >                         #self.page_request = urllib2.urlopen(url)
> > >                         self.content = self.page_request.read()
> > >                         self.status_code = self.page_request.code
> > >                         return self.status_code
> > >                 except:
> > >                         self.stats_code = 500
> > >                         print "ERROR READING: %s" % self.url
> > >                         return None
> > >
> > >
> > >         def get_header(self):
> > >
> > >                 if not self.page_request:
> > >                         self.get_page()
> > >                 header = self.page_request.info()
> > >                 try:
> > >                         self.length = header['Content-Length']
> > >                         content_type = header['Content-Type']
> > >                         #if content_type.find('charset') == -1:
> > >                         self.charset = self.host.charset
> > >
> > >                         self.encoding = self.host.encoding
> > >                 except:
> > >                         pass
> > >
> > >
> > >         def get_urls(self):
> > >
> > >                 if not self.page_request:
> > >                         self.get_page()
> > >
> > >                 if self.status_code != 200:
> > >                         return
> > >
> > >                 parser = URLLister()
> > >
> > >                 try:
> > >                         parser.feed(self.content)
> > >                 except:
> > >                         print "ERROR: Parse urls error!"
> > >                         return
> > >
> > >                 #print "URLS: ", parser.urls
> > >                 #self.urls = parser.urls
> > >                 if not self.charset: self.charset = "gbk"
> > >                 for i in parser.urls:
> > >                         try:
> > >                                 type = get_type(self.host.hostname,i)
> > >
> > >                                 if type == 4:
> > >                                         i = join_url(self.host.hostname, self.url, i)
> > >                                 if type == 0 or type ==4:
> > >                                         if i:
> > >                                                 i = urllib.quote(i)
> > >                                                 self.urls.append(i.decode(self.charset).encode('utf-8'))
> > >                         except:
> > >                                 pass
> > >
> > >                 parser.close()
> > >                 self.page_request.close()
> > >
> > >         def save_header(self):
> > >                 # Save header info into db.
> > >                 pass
> > >
> > >         def save_current_url(self):
> > >                 save_url = urllib.quote(self.url)
> > >                 usearch = USearch()
> > >                 usearch.CreateUrl( domain=self.host.hostname, url=save_url,
> > > length=self.length, status_code=self.status_code)
> > >
> > >         # Set URL's flag
> > >         def flag_url(self, flag):
> > >                 usearch = USearch()
> > >                 usearch.UpdateUrl(status=flag)
> > >
> > >         def save_urls(self):
> > >                 # Save all the founded urls into db
> > >                 print "RELEATED_URLS:", len(self.urls)
> > >                 usearch = USearch()
> > >                 usearch.CreateRelateUrl(urllib.quote(self.url), self.urls)
> > >
> > >         def save_page(self):
> > >                 usearch = USearch()
> > >                 import cgi
> > >
> > >                 try:
> > >                         content = self.content.decode(self.charset).encode('utf-8')
> > >                         usearch.CreateSearchContent(self.url.decode(self.charset).encode('utf-8'),
> > > content)
> > >                 except:
> > >                         print "ERROR to save page"
> > >                         return -1
> > >                 print "SAVE PAGE Done", self.url
> > >                 return 0
> > >
> > >
> > >
> > > def get_type(domain, url):
> > >     if not url: return 5
> > >     import urlparse
> > >     tup = urlparse.urlparse(url)
> > >     if tup[0] == "http":
> > >         # check if the same domain
> > >         if tup[1] == domain: return 0
> > >         else: return 1  # outside link
> > >     if tup[0] == "javascript":
> > >         return 2
> > >     if tup[0] == "ftp":
> > >         return 3
> > >     if tup[0] == "mailto":
> > >         return 5
> > >
> > >     return 4    # internal link
> > >
> > > def join_url(domain, referral, url):
> > >
> > >     if not url or len(url) ==0: return None
> > >     tup = urlparse.urlparse(url)
> > >     if not tup: return None
> > >
> > >     if tup[0] == "javascript" or tup[0] == "ftp": return None
> > >
> > >
> > >     else:
> > >         if url[0] == "/": # means root link begins
> > >             newurl = "http://%s%s" % ( domain, url)
> > >             return newurl
> > >         if url[0] == ".": return None # ignore relative link at first.
> > >         else:
> > >
> > >         #               if referral.rfind("/") != -1:
> > >         #                       referral = referral[0:referral.rfind("/")+1]
> > >         #       newurl = "%s%s" % (referral, url)
> > >                 newurl = urlparse.urljoin(referral, url)
> > >                 return newurl
> > >
> > > if __name__ == '__main__':
> > >
> > >         def done(x):
> > >
> > >                 u = USearch()
> > >                 x = urllib.quote(x.decode('gbk').encode('utf-8'))
> > >                 u.SetUrlStatus(x, '2')
> > >                 time.sleep(2)
> > >                 print "DONE: ",x
> > >                 url = next_url(h)
> > >                 if not url: reactor.stop()
> > >                 else:threads.deferToThread(spider, h, url ).addCallback(done)
> > >
> > >
> > >         def next_url(host):
> > >                 u = USearch()
> > >                 ret = u.GetTaskUrls(host.hostname,'0',1)['result']
> > >                 try:
> > >                         url = urllib.unquote(ret[0].url)
> > >                 except:
> > >                         return None
> > >
> > >                 if urlparse.urlparse(url)[1] != host.hostname: next_url(host)
> > >                 return urllib.unquote(ret[0].url)
> > >
> > >         def spider(host, surf_url):
> > >
> > >                 #surf_url = surf_url.decode(host.charset).encode('utf-8')
> > >                 surf_url = urllib.unquote(surf_url)
> > >                 p = Page(surf_url, host)
> > >                 #try:
> > >                 if not p.get_page():
> > >                         print "ERROR: GET %s error!" % surf_url
> > >                         return surf_url # Something Wrong!
> > >                 p.get_header() # Get page's header
> > >                 p.get_urls() # Get all the urls in page
> > >                         #print p.urls
> > >                 p.save_current_url() # Save current page's url info into DB
> > >                 p.save_urls()
> > >                 p.save_page()
> > >                 #except:
> > >                 #       pass
> > >
> > >                 return surf_url
> > >
> > >
> > >         import sys
> > >         #host = Host("www.chilema.cn", "/Eat/", "Shenzhen Local", "","gb2312")
> > >         #host.create()
> > >
> > >         #~ h = Host("www.chilema.cn")
> > >         #~ h.load()
> > >
> > >         #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/")
> > >         #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/canyin/")
> > >         #~ #reactor.callInThread(Spider, h, "http://beta.u2m.cn/fb/")
> > >
> > >         #~ threads.deferToThread(spider, h,
> > > "http://www.chilema.cn/Eat/").addCallback(done)
> > >
> > >         #host = Host("www.ziye114.com", "", "Beijing Local", "gb2312")
> > >         #host.create()
> > >
> > >         hostname = sys.argv[1]
> > >         entry_url = ""
> > >         if len(sys.argv) == 3: entry_url = sys.argv[2]
> > >
> > >         h = Host(hostname)
> > >         hostname_url = "http://%s/%s" % (hostname,entry_url)
> > >         h.load()
> > >         threads.deferToThread(spider, h, hostname_url).addCallback(done)
> > >         threads.deferToThread(spider, h, next_url(h)).addCallback(done)
> > >         threads.deferToThread(spider, h, next_url(h)).addCallback(done)
> > >         threads.deferToThread(spider, h, next_url(h)).addCallback(done)
> > >         reactor.run()
> > >
> > > ------------------------------
> > >
> > > Best Regards,
> > >
> > > Devin Deng
> > > _______________________________________________
> > > python-chinese
> > > Post: send python-chinese在lists.python.cn
> > > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> > > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
> >
> > --
> > '''Time is unimportant, only life important!
> > http://zoomquiet.org
> > blog在http://blog.zoomquiet.org/pyblosxom/
> > wiki在http://wiki.woodpecker.org.cn/moin/ZoomQuiet
> > scrap在http://floss.zoomquiet.org
> > douban在http://www.douban.com/people/zoomq/
> > ____________________________________
> > Pls. use OpenOffice.org to replace M$ Office.
> >      http://zh.openoffice.org
> > Pls. use 7-zip to replace WinRAR/WinZip.
> >      http://7-zip.org/zh-cn/
> > You can get the truely Freedom 4 software.
> > '''
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
>

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

黄思昊

0楼 2007年03月15日星期四 00:05

sihao huang hsihao001在gmail.com
星期四三月 15 00:05:30 HKT 2007

请给我也发一份源码吧。想学习一下，谢谢

-- 
sleepy right brain~~犯困的右脑
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20070315/48a85a77/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月15日星期四 10:39

Hope lanyig在gmail.com
星期四三月 15 10:39:05 HKT 2007

这是邮件组, 要的人直接发到他**本人**邮箱去好了. 往这个这个组发, 一大堆人收到,有何意义?
开这个线索的gashero也不知道怎么想的,想分享代码,直接贴在邮件列表里不是更好? 搞得同那些发垃圾邮件来搜集邮件地址的一样.

在07-3-15，sihao huang <hsihao001在gmail.com> 写道：
>
>
> 请给我也发一份源码吧。想学习一下，谢谢
>
> --
> sleepy right brain~~犯困的右脑
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
一个HTML附件被移除...
URL: http://python.cn/pipermail/python-chinese/attachments/20070315/c06edbd2/attachment.htm

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

徐继哲

0楼 2007年03月15日星期四 10:56

liokm liokmkoil在gmail.com
星期四三月 15 10:56:47 HKT 2007

ºÇºÇ¡«´ó¼ÒÕâ²»ÊÇ¶¼¼±×ÅÑ§Ï°Âð£¬¿´À´³ýÁËÒ»¾ä"Òª´úÂë"£¬Í¬Ê±Ò²ÒªÌ¸Ì¸×Ô¼ºµÄ¸ÐÊÜ£¬ÇÐ´èÏÂgasheroÓöµ½µÄÎÊÌâ²ÅÊÇ°¡¡£Ò²Ðí·½·¨Ç·Í×£¬µ«gasheroµÄ·ÖÏí»¹ÊÇÁîÈËPF°¡¡£

ÔÚ07-3-15£¬Hope <lanyig在gmail.com> Ð´µÀ£º
>
> ÕâÊÇÓÊ¼þ×é, ÒªµÄÈËÖ±½Ó·¢µ½Ëû**±¾ÈË**ÓÊÏäÈ¥ºÃÁË. ÍùÕâ¸öÕâ¸ö×é·¢, Ò»´ó¶ÑÈËÊÕµ½,ÓÐºÎÒâÒå?
> ¿ªÕâ¸öÏßË÷µÄgasheroÒ²²»ÖªµÀÔõÃ´ÏëµÄ,Ïë·ÖÏí´úÂë,Ö±½ÓÌùÔÚÓÊ¼þÁÐ±íÀï²»ÊÇ¸üºÃ? ¸ãµÃÍ¬ÄÇÐ©·¢À¬»øÓÊ¼þÀ´ËÑ¼¯ÓÊ¼þµØÖ·µÄÒ»Ñù.
>
> ÔÚ07-3-15£¬sihao huang < hsihao001在gmail.com> Ð´µÀ£º
> >
> >
> > Çë¸øÎÒÒ²·¢Ò»·ÝÔ´Âë°É¡£ÏëÑ§Ï°Ò»ÏÂ£¬Ð»Ð»
> >
> > --
> > sleepy right brain~~·¸À§µÄÓÒÄÔ
> > _______________________________________________
> > python-chinese
> > Post: send python-chinese在lists.python.cn
> > Subscribe: send subscribe to python-chinese-request在lists.python.cn
> > Unsubscribe: send unsubscribe to
> > python-chinese-request在lists.python.cn
> > Detail Info: http://python.cn/mailman/listinfo/python-chinese
> >
>
>
> _______________________________________________
> python-chinese
> Post: send python-chinese在lists.python.cn
> Subscribe: send subscribe to python-chinese-request在lists.python.cn
> Unsubscribe: send unsubscribe to  python-chinese-request在lists.python.cn
> Detail Info: http://python.cn/mailman/listinfo/python-chinese
>
-------------- 下一部分 --------------
Ò»¸öHTML¸½¼þ±»ÒÆ³ý...
URL: http://python.cn/pipermail/python-chinese/attachments/20070315/7173f989/attachment.html

[导入自Mailman归档：http://www.zeuux.org/pipermail/zeuux-python]

请登录后回复。还没有在Zeuux哲思注册吗？现在注册！

Zeuux © 2025

京ICP备05028076号